We all know to avoid rush hour whenever we can, but in an AI training cluster, it’s always rush hour. In our previous blog on AI cluster design with rail optimization in Juniper Apstra, we looked at how NVIDIA prescribes a topology to minimize latency and congestion between graphics processing units (GPUs). Their rail locality technology, along with this server cabling topology at the data center access layer, keeps more traffic in the “stripe” of servers connected by a group of leaf devices and away from the leaf-to-spine fabric links.
Like staying on the smaller streets is often possible with local drives, it’s not a solution for longer distances where the highway is always faster. So, too, must the inter-stripe traffic cross the mega highways of the data center Clos leaf-spine fabric. And in training clusters, this traffic is always intense.
In this blog, we look at how Juniper Networks optimizes the ever-present high traffic load in these parts of the Clos fabrics to avoid the slowing effects of congestion and packet drops that would cost our customers GPU training cycles and time to market.
Designing for the worst case is only the beginning
AI servers’ GPU-to-GPU fabrics are built with 200G and 400G access speeds. This is not your usual cloud data center. Expensive training clusters are built to train, and they do—nonstop. Such training epochs and jobs, as they are broken down into smaller tasks, use multiple parallelization techniques that allow the GPUs to work near 100% utilization. Still, they’re also continually driving the network at nearly 100% utilization.
Such networks are built as Clos topologies with multiple equidistant paths from each source to the destination. The leaf-spine fabric is not designed with any oversubscription: there’s an equal ratio of each leaf’s capacity facing the servers and the spines. In such designs, at first glance, there’s enough total capacity in the fabric to serve the demand of all access at full speed. However, congestion still occurs with incast when multiple sources are overloading a single destination. The other cause of congestion is load balancing.
Isn’t balancing the operative word? In theory, yes. In reality, it can do just the opposite.
The commonplace load balancing configuration is equal cost multipathing (ECMP). The path from one leaf to another can go through any spine device in a Clos fabric, so ECMP load balancing distributes traffic across all spines. If for every packet we pick a spine and iterate through those next hops in a loop, then we would perfectly distribute traffic load, not accounting for minor differences in packet size. The same mechanism is present with equal cost multipath (ECMP) from a spine down to a destination leaf when multiple links are present. However, this so-called packet spraying will certainly result in packets arriving out of order at the destination, which is intolerable to aspects of AI training workloads and most other similar Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCEv2) traffic. Our Senior Director of Product Management, Dmitry Shokarev has explained and demonstrated with RDMA traffic, more forgiving subtypes of traffic can tolerate reordering when performed with hardware assist on the network interface controller (NIC) card.
To preserve packet ordering, each flow is randomly hashed to select one of the next hops. And this is where the trouble creeps in because, unlike in cloud data centers where there are many flows of varying sizes and durations, AI model training produces few flows of long duration and immense bandwidth needs.
The random hashing produces a Poisson distribution of the flows. To most people, it’s fishy to fathom that a random hash function across n equal probability next-hop links of m flows can cause huge hotspots (congestion) and cold spots (no or little traffic). Imagine that if n=m, we end up with 36% of the links totally unused! Luckily, we have more flows than links, with at least as many flows as GPUs even in the theoretical worst case, and several times that many in practice. Still, the so-called elephant flows are huge, so when hashing results in a link that doesn’t have enough bandwidth to handle one more elephant… let’s just say that elephant’s favorite part of the (Clos “fat”) tree is not the trunk.
Dynamic load balancing
Dynamic or adaptive load balancing (DLB / ALB) is a newer take on ECMP flow-based load balancing, which inputs next-hop path quality as a consideration for pathing flows. It will prefer colder links over hotter links. In other words, it actually takes link load into consideration. Moreover, DLB can re-path flows. To avoid causing out-of-order packets when re-pathing, the packet forwarding engine tracks what are called flowlets. These are continual streams of packets in the flow without a break below a configured interval threshold.
It’s also worth noting that DLB deals faster with link failure. It also creates diversity of hashing across devices. This diversity is important because if hashing were consistent across devices with a consistent number of next-hop paths, then a leaf sending flows to a spine would result in the spine hashing those same flows to a single next-hop path instead of distributing them. This hashing polarization is avoided with DLB.
With DLB, congestion is essentially avoided until a flow comes along that cannot fit on any link, and by that time, it’s likely that the fabric is already close to its overall carrying capacity. There remains the possibility that multiple leaf devices converge traffic on a spine that must route to a common destination, but the probability of this occurring is much lower.
Future holistic load-balancing mechanisms with and without controllers also aim to deal with that challenge. But if packet spraying with destination reordering is feasible for more and more traffic and NICs, then the fabric is kept simpler while getting all the closer to the theoretical maximum capacity. Finally, there are purely NIC-based end-to-end load balancing approaches like Amazon’s SRD that also keep the fabric simple. Some of these approaches allow the opportunity to converge multiple AI cluster fabrics without sacrificing the risk of congestion. Needless to say, Juniper Networks is exploring all of these approaches.
Automating Congestion Avoidance
In closing, let’s look at how Apstra can help configure DLB. Like the previous blog on topology design, we provide how to automate Apstra with Terraform configuration in our open-source Juniper design example for AI clusters. The Terraform, in this case, adds DLB to the GPU fabric blueprint and storage fabric blueprint because they are both expected to have relatively few large flows where congestion should be avoided. See this video for a demo of our Apstra with Terraform configuration.
Apstra reference designs don’t include the native intent-based options yet for DLB settings, so we apply a configlet to set the relatively simple Junos configuration, letting Apstra know to merge this configlet extension into its golden reference configuration for the devices selected.
In our next blog, we will look at dealing with congestion with Data Center Quantized Congestion Notification (DCQCN) in the increasingly unlikely event of congestion on the fabric or when it happens inevitably due to incast. Following that, I will get into the details of observability of AI cluster fabrics: Hint: we will use Apstra Intent-Based Analytics (IBA) probes like the ECMP imbalance probe, bandwidth utilization, device traffic and fabric headroom, hot/cold interfaces and employ some of the new custom telemetry analytics features in the new Apstra 4.2.0 release.