MRC: OpenAI’s Open Networking Protocol for Reliable AI Supercomputer Training Clusters

Training frontier AI models is no longer solely a computing challenge—it has become a networking challenge. OpenAI has stepped up to address this with its new open protocol, Multipath Reliable Connection (MRC).

The Networking Bottleneck in AI Training

To understand why MRC matters, one must look inside a supercomputer during large-scale model training. A single training step can involve millions of data transfers. Even one delayed transfer can ripple through the entire job, causing GPUs to sit idle. This idle time is costly: with over 900 million people using ChatGPT weekly, every second of GPU downtime translates to real financial and performance losses.

MRC: OpenAI’s Open Networking Protocol for Reliable AI Supercomputer Training Clusters

Network congestion, link failures, and device faults are the primary sources of delay and jitter. These issues become more frequent and harder to resolve as clusters scale up. OpenAI states its goal is “not just to build a fast network, but also to build one that delivers very predictable performance, even in the presence of failures, to keep training jobs moving.”

Introducing MRC: A Collaborative Open Protocol

OpenAI announced MRC after two years of development in partnership with AMD, Broadcom, Intel, Microsoft, and NVIDIA. The specification was published through the Open Compute Project (OCP), allowing the broader industry to adopt and build upon it. MRC extends RDMA over Converged Ethernet (RoCE)—an InfiniBand Trade Association standard that enables hardware-accelerated remote direct memory access between GPUs and CPUs, bypassing the CPU for maximum throughput.

It also draws on techniques from the Ultra Ethernet Consortium (UEC) and extends them with SRv6-based source routing. SRv6 (Segment Routing over IPv6) allows the sending machine to encode the exact route a packet should follow inside the packet header, so switches no longer need complex routing calculations. This reduces processing load and power consumption at data center scale.

How MRC Works: Core Mechanisms

MRC introduces three core mechanisms to boost reliability and performance. The most detailed of these is Adaptive Packet Spraying, described below. (The other two mechanisms remain undisclosed in the original announcement.)

Adaptive Packet Spraying: Eliminating Congestion

Traditional RoCEv2 sends each transfer over a single network path, which leads to congestion at the core. MRC instead spreads packets across hundreds of paths simultaneously, reducing congestion. This technique, called Intelligent Packet-Spray Load Balancing, allows packets to traverse alternative paths if the primary one becomes unusable. The result is a network that maintains predictable performance even under failure conditions, keeping GPU utilization high.

Learn more about the networking bottleneck or the collaborative development of MRC.

Conclusion

By open-sourcing MRC through OCP, OpenAI aims to accelerate adoption across the industry. The protocol addresses a critical gap in large-scale AI supercomputing—ensuring that network reliability and performance keep pace with the demands of frontier models.

MRC: OpenAI’s Open Networking Protocol for Reliable AI Supercomputer Training Clusters

The Networking Bottleneck in AI Training

Introducing MRC: A Collaborative Open Protocol

How MRC Works: Core Mechanisms

Adaptive Packet Spraying: Eliminating Congestion

Conclusion

Related

Categories

Explore