The rapid adoption of Artificial Intelligence (AI) is transforming industries by enhancing automation, improving efficiency, and increasing productivity. Organizations worldwide are leveraging AI as a fundamental component of their strategic initiatives to drive smarter decision-making, optimize operations, and deliver real-time insights.
As AI becomes a cornerstone of strategic transformation, the size and complexity of AI models are growing at an unprecedented pace. Efficiently distributing high-performance workloads across multiple GPUs reduces training time and enhances overall productivity. At the heart of this process are collective operations that enable seamless data exchange and synchronization among GPUs. These operations are crucial for accelerating training and maximizing the effectiveness of AI clusters.
To keep pace with these advancements, the supporting infrastructure must evolve. This evolution necessitates the integration of ultra-fast networking to establish robust backend systems for training massive AI workloads. Such advancements ensure that large-scale AI models can be trained efficiently, deployed effectively, and optimized for performance, all while circumventing potential limitations.
Juniper and Broadcom have collaborated on a comprehensive benchmarking study, using the ROCm Communication Collectives Library (RCCL), to assess how well a modern networking stack can support intensive AI workloads. These tests were performed on a 32-node AMD MI300x GPU cluster interconnected with the Juniper Networks® QFX5240 Switch delivering high-density 800 GbE connectivity and Broadcom’s P1400GD 400 GbE Network Interface Cards (NICs). Juniper Apstra® simplified deployments through Juniper Validated Designs (JVDs) and AI-ready blueprints, delivering enhanced speed, stability, and performance. The results highlight how a finely tuned infrastructure can significantly accelerate RCCL operations and AI workload efficiency.
Benchmarking study: putting performance to the test
The benchmarking study was designed to evaluate the performance of collective operations essential for distributed AI training across a large-scale GPU cluster. By focusing on performance and scalability, the study aimed to assess how effectively the network infrastructure supports high-throughput, low-latency data exchange among GPUs.
Using the RCCL framework, the tests measured the impact of the Juniper QFX5240 Switch, Broadcom’s P1400GD NICs, and AMD MI300x GPUs on various communication patterns. The goal was to identify key performance bottlenecks, optimize workload distribution, and ensure that AI models can be trained faster and more efficiently within multi-node environments.
Building blocks of high-speed AI networking
- Juniper QFX5240: High-Performance Ethernet Switch
The Juniper QFX5240 Switch delivers cutting-edge 800 GbE connectivity, addressing the demands of large-scale AI clusters. Its specialized features tailored for AI and ML workloads ensure low latency and high throughput, significantly enhancing training efficiency. This includes,
-
- High-density ports: 800 GbE and 400 GbE options for flexible AI/ML deployment.
- Massive switching capacity: 51.2 Tbps unidirectional throughput for seamless multi-node communication.
- Ultra-low latency measured in nanoseconds for rapid data transfers.
- RoCEv2 support for efficient RDMA-based communication.
- Advanced congestion management: DCQCN-PFC and ECN with auto-tuning to prevent network bottlenecks.
- Adaptive load balancing: Dynamic load balancing (DLB), selective DLB, and reactive path rebalancing are used to optimize traffic flow.
- Configurable hash-bucket sizing: Adjusts to varying traffic patterns.
- Storm prevention: PFC watchdog ensures network stability.
- Global Load Balancing (GLB): Ensures consistent performance across large clusters.
- Broadcom BCM957608-P1400GD NIC: PCIe Gen5 400Gbps Ethernet NIC Adapter
Broadcom’s 400G Ethernet Adapters, based on the BCM57608 Ethernet controller, are optimized to meet the performance demands of AI applications, minimizing latency and maximizing data throughput. Additional features supported by the Broadcom BCM957608-P1400GD NIC to ensure optimal performance for AI include the following:
-
- RoCEv2: Enabling direct memory access over the network to achieve maximum performance
- Data Center Quantized Congestion Notification (DCQCN p/d): End-to-end congestion control mechanism for RoCE enables the NIC and switch to actively detect and respond to network congestion.
- Priority Flow Control (PFC): Utilized by RoCE, it is essential for establishing a lossless network. It prevents packet loss due to switch buffer overflows by pausing specific traffic classes during congestion instead of halting all traffic on a link.
- Fine-grain rate control at queue pair level ensuring minimal and localized impact of congestion events.
Additionally, Broadcom’s AI Ethernet adapters are the lowest-power 400G NIC solutions available today. They help reduce overall power demands and improve the thermal requirements of the network to improve network reliability and resiliency.
- AMD MI300x: Powering Large-Scale AI
AMD’s MI300 series accelerators are designed to meet the rigorous requirements of large-scale AI and high-performance computing (HPC) workloads. The 32-node cluster used in our tests showcased the MI300x’s ability to handle intensive, distributed AI training while maintaining exceptional efficiency.
Communication patterns evaluated
To assess the network’s efficiency in supporting distributed AI workloads, we used the RCCL communication libraries, providing optimized GPU-to-GPU communication. Our assessment focused on four essential collective operations, each replicating key scenarios encountered during large-scale AI training:
- All-to-All: Every GPU communicates with every other GPU, exchanging data simultaneously.
- All-Reduce: Aggregates data from all GPUs and distributes the combined result for synchronized training.
- All-Gather: Ensures that each GPU shares its local data with all others.
- Reduce-Scatter: Combines data across GPUs before distributing segmented results.
To gauge the network’s efficiency in supporting AI workloads, we utilized the RCCL benchmarking test suite to run the operations —all to all, all_reduce, all_gather, and reduce_scatter—with message sizes ranging from 16MB to 4BG. RoCEv2 was utilized as the transport layer, ensuring efficient RDMA-based communication throughout the evaluation process. The goal was to replicate real-world AI training scenarios that demand high throughput and low latency.
Performance results: breaking down the metrics
All-to-All performance
During an All-to-All communication, each GPU exchanges a unique message with every other GPU in a full mesh point-to-point format. This creates moderate to high entropy and congestion in the network. Good congestion control is a key factor in All-to-All performance.
Since All-to-All sends unique data from each GPU, it does not take advantage of the pipelining performed in a ring or tree-based algorithm. Therefore, the bus bandwidth will be a function of the average bandwidth between each GPU pair (inter and intra-node). As the scale increases, this bandwidth should reflect the bandwidth of the scale-out network. In this case, the network bandwidth is ~50GB for point-to-point. The fact that the target bandwidth is exceeded indicates efficiency gains from the collectives. Those gains result from local transfers inside the server and are, therefore, not limited by the scale-out network.
As shown below (Figure 1), the combination of the Juniper QFX5240 and Broadcom’s P1400GD NICs can achieve the targeted bandwidth, indicating a highly performant and optimized network.
Figure 1: All-to-All Performance Chart
All-Reduce, All-Gather, and Reduce-Scatter performance
All-Reduce, All-Gather, and Reduce-Scatter are implemented by passing messages between all GPUs through rings or trees. This enables a heavily pipelined collective to utilize the entire bus bandwidth (~360 GB). This workload creates low entropy and requires low latency through the network. The Juniper QFX5240 Adaptive load-balancing feature is excellent for maintaining low latency.
The results shown in the figures below clearly demonstrate that the Juniper QFX5240, combined with Broadcom’s P1400GD NICs, achieves the target bus bandwidth for all three collectives, indicating a performance optimized network.
Note: At smaller message sizes, inefficiencies inherent to the collective’s implementation limit achievable performance.
Figure 2: All-Reduce Performance Chart
Figure 3: All-Gather Performance Chart
Figure 4: Reduce-Scatter Performance Chart
Key takeaways
The following switch and NIC features were instrumental in achieving the performance detailed above. Not only are these features critical in ensuring the network can support the performance demands of an AI workload, but it’s also critical that the components work in conjunction to maximize each feature’s potential benefits. These results demonstrate that each feature and component is well-tuned to create a performance-optimized network.
Juniper QFX5240 feature impact:
- 800 GbE throughput: Enabled efficient multi-node scaling and improved AI networking efficiency.
- Adaptive load balancing: Optimized AI workload distribution, preventing congestion and improved GPU traffic management.
- Buffer tuning: Minimizes packet loss during large-batch AI model training, ensuring uninterrupted AI training operations.
Broadcom P1400GD feature impact:
- 400 GbE throughput: Full 400 GbE throughput allows saturation of the GPU links and is essential to ensuring the network is not a bottleneck.
- Low latency: Optimized NIC architecture to minimize latency impact through the NIC.
- RoCEv2: Hardened RoCE solution with an enhanced congestion control mechanism resulting in improved RCCL performance by improving the overall network responsiveness, bandwidth, and latency.
AMD MI300x Efficiency:
- Demonstrated high efficiency for demanding, large-scale distributed AI and HPC workloads.
- Improved synchronization during large-scale model training.
Conclusion: scaling AI with confidence
This benchmarking study illustrates that an optimized AI networking stack, featuring Juniper’s QFX5240 Switch, Broadcom’s P1400GD NICs, and AMD’s MI300x GPU, delivers exceptional RCCL performance. This stack plays a pivotal role in ensuring that large-scale AI models are trained efficiently, deployed seamlessly, and optimized for maximum performance without falling prey to potential bottlenecks.
As AI workloads grow in complexity, organizations must rely on high-speed Ethernet solutions to meet the demands of modern data centers. The combination of high throughput, low latency, and advanced congestion management ensures that organizations can train large-scale AI models more efficiently and cost-effectively.
Additional references: