As AI workloads scale to thousands of GPUs, the network becomes the critical backbone for performance and ROI. At Juniper Networks’ latest AI Infrastructure Field Day, we detailed how Juniper’s self-optimizing Ethernet fabric and advanced load balancing innovations enable AI clusters to run at peak efficiency, even as traffic patterns and congestion challenges become increasingly complex.
Unlike traditional data center traffic, AI/ML training workloads generate a small number of high-bandwidth (low-entropy), long-lived, and bursty RDMA flows. These flows are tightly synchronized due to the nature of distributed training (e.g., data parallelism), where GPUs train in lockstep through compute and gradient synchronization phases. A single delayed flow can render all GPUs idle, thereby impacting job completion time—a key metric for cluster efficiency.
Figure 1: Traditional LB is not sufficient for RoCEv2 traffic
Traditional Ethernet was designed for Transmission Control Protocol’s (TCP) in-order delivery, leveraging static ECMP (Equal Cost Multipath) to hash and pin flows to paths. This approach works well for entropy-rich, mixed-traffic environments; however, in AI clusters with few, long-lived flows, a random distribution leads to congestion hotspots and unpredictable performance. Even with 1:1 oversubscription and rail-optimized topologies, congestion can occur if a leaf incorrectly places more flows on a leaf-spine link and multiple leaves independently overload a spine link, creating imbalances and congestion hotspots due to inefficient load balancing.
The problem with current proprietary solutions
To address these challenges, some vendors have introduced packet spraying via specialized “SuperNICs” or distributed scheduled fabrics with deep-buffer leaf switches. Both approaches are expensive, proprietary, and power-hungry.
SuperNIC-based solutions rely on packet spraying in the fabric, so packets in each flow may be sent down multiple different ECMP paths through the fabric. While this leads to even more load balancing in the fabric, it results in packets arriving out-of-order (OOO) at the destination. These solutions then rely on SuperNICs at the destination to OOO packet arrival and place these packets in the correct location in GPU memory. These SuperNICs add significant cost and consume more power per server. For example, Bluefield-3 Network Interface Cards (NICs) consume 400W more power per GPU server relative to standard GPU NICs.
Distributed Scheduled Fabric solutions take a different approach by breaking RoCE v2 packets into smaller cells at the ingress leaf, adding sequence numbers to each cell. These cells are then sprayed across the spine layer and reassembled at the egress leaf. However, this architecture relies heavily on deep-buffer leaf devices to manage reassembly, which significantly increases the overall cost of the solution.
Figure 2: Current load balancing methods rely on packet spraying for max performance
Juniper’s open, standards-based AI load balancing
Juniper’s approach is fundamentally different to any other vendor: an open, standards-based Ethernet solution that dynamically adapts to congestion and load, using three key load balancing techniques: DLB, GLB, and new RLB.
1. Dynamic Load Balancing (DLB)
DLB enhances static ECMP by tracking link quality based on link utilization and buffer pressure at microsecond granularity, making real-time forwarding decisions at the first leaf switch. DLB operates in two modes:
- Flowlet Mode: Breaks long-lived flows into sub-flows (flowlets) based on configurable pauses, allowing for fairer distribution without out-of-order delivery. This is ideal for AI workloads, which naturally experience compute and sync pauses between bursts.
- Packet Mode: Classic packet spraying, distributing packets across all ECMP paths while factoring in real-time link quality. This achieves the highest throughput but requires NICs that can handle OOO packets.
Figure 3: Dynamic Load Balancing
For example, in Figure 3, if GPU G2 is sending traffic to G6, Leaf L1 maintains a DLB table tracking the link quality of each of the ECMP paths. There are eight quality bands, but three are shown for simplicity. Based on the best link quality, it forwards the flowlet to spine S2, minimizing hotspots that lead to congestion.
In Juniper’s AI Lab tests using a standardized ML benchmark (DLRM) and NCCL AllReduce tests, three approaches were compared: Static ECMP, DLB-Flowlet Mode and DLB-Packet Mode. When testing with DLB-Packet mode, we used a NIC capable of handling OOO packet arrivals.
Figure 4 shows DLRM Job Completion Time (JCT, lower is better) for each load balancing technique under increasing loads. We used ROCEv2-capable traffic generators to create background load. The results indicate:
- With Static ECMP, JCT degrades rapidly with higher congestion in the network.
- DLB (Flowlet Mode) delivered highly deterministic JCT, nearly matching the performance of packet spraying but without the need for special NICs.
- DLB (Packet Mode) delivered the lowest JCT, but only marginally above flowlet mode, by only 0.08%!
Figure 4: MLCommons DLRM Test results comparing DLB flowlet, packet mode and ECMP with background traffic
Figure 5 illustrates NCCL AllReduce test results showing bus bandwidth in GB/s (higher is better) for each of the load balancing techniques under increasing load conditions. We generate background load using RoCE v2–capable traffic generators. Again, we see that ECMP performs poorly under load, while both DLB-Flowlet and DLB-Packet deliver good results. While Static ECMP is clearly not desirable for AI workloads, it was also interesting that the DLB-Flowlet delivered cost–effective results, high performance without the need for proprietary, costly gear.
Figure 5: NCCL tests all reduce performance tests comparing ECMP and DLB
Note: PXN was disabled to route more traffic through the leaf-spine links
Figure 6 further illustrates the effectiveness of DLB in evenly utilizing the available capacity in the fabric. The goal of effective load balancing is to minimize the polarization of traffic on the leaf uplinks. Imbalance will lead to some links experiencing congestion and traffic drops, while others remain unaffected. Each line of the graphs below is the utilization level (Gbps) of each uplink on a leaf during a DLRM. Traffic distribution on a leaf over all its spine uplinks for static ECMP is random, with some links being utilized more than others. In contrast, for DLB, all links are utilized equally, with even flowlet mode.
Figure 6: ECMP vs DLB link utilization graph on leaf-spine links on a leaf
2. Global Load Balancing (GLB)
With GLB, spines continuously monitor the quality of their local links to all connected leaves and advertise this link quality information to all other connected leaves. The leaves then combine the received link quality data from the spines with their own local DLB measurements to make more informed load balancing decisions. This enables upstream switches to proactively avoid congested downstream paths and select better end-to-end routes for traffic flow. The approach is similar to how Google Maps uses real-time traffic conditions to recommend the most efficient routes.
Figure 7: Global Load Balancing
For example, when GPU G2 sends traffic to G6, Leaf L1 not only tracks the local link-quality but now also uses the GLB updates to track remote link-quality at spines, for a global view. It factors both link qualities and forwards the flowlet to Spine S3, as that is the best end–to–end path, for the most effective load balancing.
Figure 8: NCCL AllReduce performance results with 30% extra traffic from traffic generator over a spine-leaf link
As shown in the graph above, in a test where we introduced 30% extra traffic on the spine-leaf link on one of the spines, DLB + GLB was able to rebalance traffic around the congested spine link to other spines. In contrast, DLB only and ECMP continued to send traffic to the congested spine. This resulted in GLB achieving high performance compared to DLB only and ECMP.
3. RDMA-aware Load Balancing (RLB)
RLB introduces deterministic path forwarding for AI workloads using Juniper Networks’ BGP-DPF (Deterministic Path Forwarding) combined with advanced software integration. This approach ensures that traffic flows follow pre-determined paths through the network fabric during steady-state operation while maintaining robust fallback mechanisms for failure scenarios.
Deterministic flow identification
Breaking a single large RoCE v2 flow into multiple sub-flows adds more entropy and helps to distribute these flows fairly. Typically, multiple sub-flows disaggregate multiple memory regions and associate queue pairs (QPs) on both sender and receiver ends, and each QP uses a random UDP source port.
Figure 9: Single RDMA QP vs Multiple RDMA QPs
RLB introduces determinism by assigning an IP to each QP. This allows the use of traffic engineering techniques using BGP to deterministically assign paths and perfectly distribute RDMA flows.
This converts the load balancing problem into a deterministic routing problem, enabling precise, policy-driven traffic engineering. Each RDMA subflow (also known as a QP) can be mapped to a specific path through the fabric, eliminating LB-related randomness and ensuring consistent high performance and in-order delivery.
The solution ensures that each flow follows a pre-determined path through the network fabric:
- RDMA traffic flows are mapped to deterministic paths based on BGP policies defined at the control plane.
- This mapping eliminates variability in flow distribution due to current load balancing methods, ensuring predictable performance across all workloads.
- If a link failure occurs within the fabric, RLB falls back to using DLB, ensuring that jobs continue while routing around the failure.
Figure 10: Flow disaggregation and placement in RLB
Lab results:
Performance variability observed through benchmarking tools is a strong indicator of how a load balancing strategy will perform in deployment with real AI traffic. High deterministic performance or low performance variability is highly desirable to ensure an AI fabric performs predictably and to maximize performance in multi-tenant environments.
In Juniper’s AI Lab tests using NCCL tests and traffic generators, the three approaches were compared, and this time, each test run was run thrice, and the results from each run were noted:
- ECMP had the most variability due to random distribution of flows with 196 GB/s, 263 GB/s and 373GB/s measured bus bandwidth.
- DLB improved the performance variance and reduced this variability band considerably (353 GB/s – 373 GB/s), which is pretty good, but still exhibits some variability.
- RLB achieved maximum performance every single run deterministically – 373 GB/s every time!
Figure 11: Performance Variance across different test runs among ECMP, DLB and RLB
The reason for RLB’s consistent high performance is evident from the following graphs.
- ECN is an effective method of congestion control, and when a switch experiences congestion on a port queue, it sets a bit in the packet header. Receiving NICs upon detecting the ECN bit being set send a Congestion Notification Packet (CNP) to the sender to slow down the BW, resulting in alleviating congestion.
- When comparing whether congestion control was activated between using DLB and RLB, no CNPs or PFCs were reported during multiple runs for RLB, indicating that there was no congestion; in contrast, a few CNP packets were generated for DLB.
Figure 12: ECN CNP packets generated (Congestion Indicator)
- RLB delivered every packet in order as RoCE NICs reported zero Out-of-order packets for RLB.
- DLB can still generate some OOO packets as it rebalances flowlets and places them on newer, better paths. Fine-tuning the DLB inactivity time may be required to eliminate DLB OOO packets.
Figure 13: RLB achieves zero out-of-order packets compared to a few in DLB
AI-Load Balancing: three innovations. one goal—maximize network performance
Juniper’s AI Load Balancing (AI-LB) suite—Dynamic Load Balancing (DLB), Global Load Balancing (GLB), and RDMA-aware Load Balancing (RLB)—delivers a unified strategy to overcome the unique challenges of AI networking. DLB responds to real-time congestion at the first hop, intelligently steering flowlets or packets on microsecond-level link quality. GLB adds a global view, allowing switches to avoid downstream congestion before it occurs. RLB takes it even further by enabling deterministic paths for RDMA flows through BGP-driven control. This approach completely eliminates out-of-order delivery and performance variability.
Because one size doesn’t fit all in AI networking, Juniper’s AI-LB techniques—DLB, GLB, and RLB—are designed to be used individually or in combination, depending on your unique requirements. The right mix depends on what level of performance is “good enough” for your application, the effort you’re willing to invest in tuning for optimal results, the scale of your cluster, and the characteristics of your AI model, particularly the type of collective communication it relies on. Whether you’re prioritizing simplicity with minimal tuning or targeting deterministic, maximum throughput performance at scale, AI-LB provides the flexibility to align your network strategy with your workload priorities.
The bottom line
As AI clusters scale, network performance determinism, efficiency, and openness become non-negotiable. While proprietary SuperNICs and distributed scheduled fabrics attempt to mask Ethernet’s perceived limitations with expensive, power-hungry workarounds, Juniper takes a fundamentally different approach.
With AI Load Balancing solutions—including Dynamic (DLB), Global (GLB), and RDMA-aware Load Balancing (RLB)—Juniper delivers predictable, congestion-free performance using open standards protocols like BGP and Ethernet technologies. AI-LB isn’t just a smarter load balancing technique—it’s a clear rejection of the myth that AI networking requires closed, proprietary ecosystems.
Our approach is simple and powerful:
- No vendor lock-in.
- No exotic hardware.
- Just smarter load balancing that maximizes GPU utilization.
Stop overpaying for complexity. Don’t tie your infrastructure to a single vendor’s roadmap.
Build for performance, scale, and freedom with Juniper.