May 22, 2026

Zettascale in Practice: MRC Benefit

Breakout saves Clos layers, not just bandwidth. By turning one large NIC pipe into multiple topology rails, MRC can reduce switch count, hop count, and latency at the scale boundaries where Clos depth changes.

Abstract: MRC is often explained as reliable multipath transport for AI networking. That is correct, but incomplete. The deeper value appears when MRC is combined with NIC breakout. Breakout does not magically reduce total bandwidth demand. If every GPU still needs 800G, switching bandwidth is still determined by GPU count and oversubscription target. But breakout changes the topology envelope: bandwidth becomes radix, radix enables larger flat fabrics, and larger flat fabrics can avoid Clos layers. The benefit is scale-dependent: 8 x 100G can collapse a 512-GPU cluster into single-stage planes, 4 x 200G can save one Clos layer at 16K GPUs, and breakout becomes necessary for practical three-layer topology at 262K GPUs.

Disclaimer: The ideas and analysis presented in this article are based on publicly available technical information, first-principles reasoning, and the author’s personal experience in HPC and distributed systems architecture. The content reflects personal technical exploration and architectural thinking only, and does not represent the views, strategies, products, or business interests of the author’s employer or any affiliated organization. All rights reserved by the author.

MRC and zettascale topology diagram showing that breakout keeps total bandwidth constant while increasing per-plane radix and reducing Clos depth.
Breakout keeps per-GPU bandwidth constant, but changes fabric shape: higher effective radix can keep important AI cluster sizes in fewer Clos layers.

The previous article, MRC and the Future of AI Networking, framed MRC as a way to split one large transport domain into multiple reliable lanes, distributing pressure across paths, buffers, and switching resources.

This article starts from that transport view and asks a topology question: what does MRC-style breakout change when AI clusters grow toward zettascale?

The magic of MRC: rephrasing scaling

Most people explain MRC from the transport side: multipath reliable connection, packet spraying, out-of-order placement, congestion control, failure handling, and better load balancing. That is right, but incomplete.

MRC is not only a transport improvement. It changes how we think about GPU cluster scaling. Traditionally, one GPU with one 800G NIC looks like one big pipe:

GPU -> 800G NIC -> Network

With MRC-style breakout, the same 800G NIC can become multiple coordinated rails:

GPU -> 800G NIC -> 200G rail 0 -> 200G rail 1 -> 200G rail 2 -> 200G rail 3

Or it can be exposed as two 400G rails, or eight 100G rails. The total bandwidth is still 800G, but the topology is no longer the same.

Breakout does not save switches first. Breakout saves Clos layers. Clos layers determine hop count. Hop count determines latency.

The first question: 1 x 800G, 2 x 400G, 4 x 200G, or 8 x 100G?

Suppose each GPU has one 800G NIC. That NIC can be exposed in several ways:

1 x 800G = 800G per GPU 2 x 400G = 800G per GPU 4 x 200G = 800G per GPU 8 x 100G = 800G per GPU

At the bandwidth level, they look the same. At the topology level, they are very different.

Using a 51.2T switch ASIC as the building block, the same ASIC can be viewed as:

64 x 800G ports 128 x 400G ports 256 x 200G ports 512 x 100G ports

When we lower the per-plane link speed, the same switch ASIC gives more ports. More ports means higher radix. Higher radix means a larger fabric plane can stay flat before it needs another Clos layer.

This is the first-principles reason breakout matters. It is not only a port-mode decision. It is a topology-depth decision.

Maximum scalability for each breakout mode

For simplicity, assume a 51.2T switch ASIC, one 800G NIC per GPU, a 1:1 nonblocking fabric target, parallel or bundled links between switch stages, and bandwidth-minimal switch count rather than vendor SKU-specific construction.

One-layer max GPUs ~= R Two-layer max GPUs ~= R^2 / 2 Three-layer max GPUs ~= R^3 / 4

Here R is the port count of one 51.2T switch at the per-plane link speed.

Maximum GPU scalability by breakout mode

Breakout mode Planes Per-plane speed Ports per switch One-layer max GPUs Two-layer max GPUs Three-layer max GPUs
1 x 800G1800G64642,04865,536
2 x 400G2400G1281288,192524,288
4 x 200G4200G25625632,7684,194,304
8 x 100G8100G512512131,07233,554,432

This table is the core of the article. It shows how four breakout choices create four different topology envelopes. The better question is not which breakout is universally best. The better question is: at my target GPU count, which breakout keeps the fabric at the shallowest practical Clos depth while keeping plane count and operations manageable?

Switch count is determined by Clos layer

For a fixed number of GPUs, fixed 800G bandwidth per GPU, fixed oversubscription ratio, and fixed Clos layer count, total switch bandwidth requirement is mostly fixed.

Total endpoint bandwidth = GPU count x 800G

If the fabric is nonblocking, the network must carry this bandwidth. So if we compare 1 x 800G, 2 x 400G, 4 x 200G, and 8 x 100G under the same Clos depth, total switch count is roughly the same.

Breakout does not directly save switch bandwidth. Breakout does not directly save switches if Clos layer is fixed. But: Breakout may reduce the required Clos layer.

Once the required Clos layer changes, switch count, latency, and operational complexity change. That is where MRC breakout becomes powerful.

Three GPU cluster sizes: 512, 16K, and 262K GPUs

Let us compare three cluster sizes under the same simplified assumptions: one 800G NIC per GPU, a 51.2T switch ASIC, and a 1:1 nonblocking fabric target.

Small cluster: 512 GPUs

At 512 GPUs, breakout choice already matters. For 1 x 800G, 2 x 400G, and 4 x 200G, one switch plane does not have enough radix to connect all 512 GPUs directly.

1 x 800G one-layer max = 64 GPUs 2 x 400G one-layer max = 128 GPUs 4 x 200G one-layer max = 256 GPUs

So these modes need a two-layer Clos. But 8 x 100G is different: one 100G plane switch can connect all 512 GPUs directly. Since 8 x 100G has eight planes, the design needs one switch per plane, or eight switches total.

512-GPU simplified topology comparison

Breakout mode Planes Required topology Total switches Saving vs 1 x 800G Meaning
1 x 800G12-layer Clos24baseline64-port radix is too small for one layer.
2 x 400G22-layer Clos240128-port radix is still too small.
4 x 200G42-layer Clos240256-port radix is still too small.
8 x 100G81-layer per plane816 / 66.7%512-port radix exactly fits 512 GPUs.

At small scale, 8 x 100G saves many switches because it turns the fabric into single-stage planes. The tradeoff is eight rails per GPU and eight planes to manage.

Middle-large cluster: 16,384 GPUs

At 16K GPUs, the story changes. A 1 x 800G fabric plane cannot stay in two layers with a 51.2T switch. 2 x 400G is better, but still not enough. 4 x 200G and 8 x 100G can stay in two layers.

16,384-GPU simplified topology comparison

Breakout mode Planes Required topology Total switches Saving vs 1 x 800G Meaning
1 x 800G13-layer Clos1,280baselineOne large 800G plane lacks radix.
2 x 400G23-layer Clos1,2800Better, but still not enough.
4 x 200G42-layer Clos768512 / 40%Saves one Clos layer.
8 x 100G82-layer Clos768512 / 40%Same switch saving, but twice the planes.

This is the most important middle-scale case. It does not win because it saves bandwidth. It wins because it keeps the fabric in two layers.

1 x 800G or 2 x 400G three-layer design: 1,280 switches 4 x 200G or 8 x 100G two-layer design: 768 switches Switch saving = 1,280 - 768 = 512 switches Saving ratio = 512 / 1,280 = 40%

The difference between 4 x 200G and 8 x 100G is not switch count at this size. Both stay in two layers. Both use 768 switches in this simplified model. The difference is plane count and operational complexity.

4 x 200G: 4 planes 4 rails per GPU 768 switches 8 x 100G: 8 planes 8 rails per GPU 768 switches

At middle-large scale, the best breakout is the one that saves the Clos layer without creating unnecessary plane complexity.

Really large cluster: 262,144 GPUs

At 262K GPUs, the design enters another regime. 1 x 800G simply runs out of radix under the simplified three-layer model.

1 x 800G three-layer max ~= 65,536 GPUs Target cluster size = 262,144 GPUs Therefore: 1 x 800G cannot support 262K GPUs in a simple three-layer model.

It would require a deeper hierarchy, multiple fabric blocks, or another architectural workaround. Breakout changes the result.

2 x 400G three-layer max ~= 524,288 GPUs 4 x 200G three-layer max ~= 4,194,304 GPUs 8 x 100G three-layer max ~= 33,554,432 GPUs

262,144-GPU simplified topology comparison

Breakout mode Planes Required topology Total switches Meaning
1 x 800G1More than 3-layer / multiple blocksNot comparable800G radix runs out.
2 x 400G23-layer Clos20,480Fits, but only two planes.
4 x 200G43-layer Clos20,480Fits with moderate plane count.
8 x 100G83-layer Clos20,480Fits with smaller planes, but more operations.

At this scale, breakout is no longer just optimization. It becomes the reason the fabric can still fit into a reasonable three-layer design.

Why fewer Clos layers mean lower latency

Topology matters because every Clos tier adds hops.

A one-layer plane is direct:

GPU NIC -> switch -> GPU NIC

A two-tier path looks like this:

GPU NIC -> leaf -> spine -> leaf -> GPU NIC

A three-tier path looks like this:

GPU NIC -> leaf -> aggregation -> core -> aggregation -> leaf -> GPU NIC

A deeper hierarchy adds even more forwarding stages. More switch stages mean more forwarding latency, more serialization points, more queueing opportunities, more congestion interaction, more failure surfaces, and more telemetry complexity.

For AI training and inference, this is especially important because the network is inside the computation loop. All-reduce, reduce-scatter, all-gather, MoE all-to-all, tensor-parallel synchronization, pipeline bubbles, checkpoint movement, KV-cache movement, and distributed inference routing all feel topology.

These workloads do not only care about average bandwidth. They care about tail latency. A single slow communication phase can delay the whole training step or inference pipeline.

Economics and performance of breakout MRC

MRC breakout is not only a way to build larger clusters. It is a way to improve the economics and performance of useful bandwidth.

For a fixed GPU cluster, total endpoint bandwidth may be the same:

N GPUs x 800G per GPU

But the economic result can be very different because topology is different. A deeper Clos fabric costs more switch stages, optics, cables, racks, power, cooling, failure domains, operational complexity, and latency.

If 8 x 100G keeps a 512-GPU cluster in one layer instead of two, the benefit is direct:

24 switches -> 8 switches 66.7% switch saving lower latency simpler hop path

If 4 x 200G or 8 x 100G keeps a 16K-GPU cluster in two layers instead of three, the benefit is also direct:

1,280 switches -> 768 switches 40% switch saving one fewer Clos layer lower end-to-end latency

If breakout keeps a 262K-GPU cluster inside a three-layer design while 1 x 800G runs out of radix, the benefit is more fundamental: breakout becomes required for practical topology scaling.

But maximum breakout is not automatically best. Too few planes may not provide enough radix. Too many planes increase NIC rails, cabling, MRC path state, packet reordering pressure, congestion-control complexity, telemetry objects, and debugging surface.

The best design balances scalability, latency, switch count, plane count, operations, failure isolation, transport complexity, and cost.

Conclusion

MRC is often introduced as a scalability feature. That is true. But the deeper point is that MRC also becomes a topology, latency, and economics feature.

Once MRC can recombine multiple physical rails into one reliable transport abstraction, breakout becomes a topology design tool.

MRC enables multipath rails Breakout increases effective radix Higher radix expands the flat-fabric envelope A larger flat-fabric envelope can reduce Clos depth Lower Clos depth reduces hop count Lower hop count reduces latency Lower latency improves AI scaling efficiency

The benefit changes by scale. At 512 GPUs, 8 x 100G can save many switches by avoiding Clos entirely. At 16K GPUs, 4 x 200G and 8 x 100G can both save one Clos layer versus 1 x 800G or 2 x 400G. At 262K GPUs, 1 x 800G cannot fit the simplified three-layer model, so breakout becomes necessary for practical zettascale topology.

The real magic is not simply that MRC gives more paths. The real magic is that MRC changes the topology choices available to the system.

In zettascale AI infrastructure, the network is no longer just a packet delivery system. The network becomes part of the distributed computer itself. In that computer, the right breakout choice depends on scale, latency, cost, and operational complexity.

References