Developing and deploying AI applications requires highly optimized infrastructure, including purpose-built GPU servers, a robust AI/ML software stack, and a lossless, low-latency network fabric. An AI cluster typically has four separate networks: frontend, management, backend accelerator interconnect, and backend storage. However, backend (aka, training) networks present the toughest challenges, and this is where a good networking company really earns its stripes.
Juniper is thrilled to share our participation in the latest MLPerf® Training v4.0 benchmark suite. This demonstrates our ongoing hardware and software innovation in AI systems, as highlighted in the recent press release by MLCommons®. We are committed to accelerating AI innovation and collaborating with organizations like MLCommons to make data center infrastructure simpler, faster, and more economical to deploy for AI/ML workloads.
Pushing AI boundaries with new benchmarks for training
The MLPerf Training benchmark suite thoroughly evaluates AI system performance, efficiency, and innovation. These peer-reviewed benchmarks test the entire AI system stack across various real-world workloads. The suite includes comprehensive tests that stress machine learning (ML) models, software, and hardware for a wide range of applications. The v4.0 release features over 205 performance results from 17 leading organizations, including Juniper Networks. In this release, two new benchmarks have been added for the first time: Low-Rank Adaptation (LoRA) fine-tuning of LLama 2 70B and Graph Neural Networks (GNN), highlighting the rapid evolution in large language models (LLM) and emphasizing the need for high-profile training.
For MLPerf Training 4.0, Juniper submitted benchmarks for prominent AI models, including LLama 2 70B with LoRA fine-tuning, Bi-directional Encoder Representation from Transformers (BERT), and Deep Learning Recommendation Model (DLRM). These benchmarks were achieved on a Juniper AI Cluster comprising Nvidia A100 and H100 GPUs, using Juniper’s AI-optimized Ethernet fabric as the accelerator interconnect. For BERT, we optimized pre-training tasks using a Wikipedia dataset, evaluating performance with MLM accuracy.
-
Low-Rank Adaptation (LoRA) fine-tuning of LLama 2 70B
This benchmark enhances the LLM pre-trained on the general corpus text for specific tasks while significantly reducing the computational and memory resources compared to pre-training or supervised fine-tuning. LoRA fine-tuning significantly reduces the training parameters while maintaining performance comparable to fully fine-tuned models. It uses the LLama 2 70B model, which is fine-tuned with the Scrolls dataset of government documents to generate accurate document summaries measured using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) algorithm.
-
Bi-directional Encoder Representation from Transformers (BERT)
BERT trained with Wikipedia is a leading-edge language model used extensively in natural language processing tasks. Given a text input, language models predict related words and are employed as a building block for translation, search, text understanding, answering questions, and generating text.
-
Deep Learning Recommendation Model (DLRM)
DLRM trained with Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) dataset represents a wide variety of commercial applications that touch the lives of nearly every individual on the planet. Common examples include recommendations for online shopping, search results, and social media content ranking. Juniper’s DLRM submission utilized the Criteo dataset and HugeCTR to efficiently handle sparse and dense features, with AUC as the evaluation metric, achieving exceptional performance.
Crucially, we optimized inter-node communication with RDMA over Converged Ethernet (RoCE v2), enabling low-latency, high-bandwidth data transfers critical for efficient, distributed training workloads. Furthermore, Juniper has demonstrated in our AI Innovation Lab that Ethernet performance is on par with InfiniBand for AI training, highlighting its capability to deliver top-notch networking solutions for AI workloads. “I’m thrilled by the performance gains we are seeing, especially for generative AI,” said David Kanter, executive director of MLCommons. “Together with our first power results for MLPerf Training, we are increasing capabilities and reducing the environmental footprint—making AI better for everyone. This is a significant undertaking, and we celebrate the hard work and dedication of Juniper Networks.”
Please visit the MLCommons Training benchmark page to review the full MLPerf Training v4.0 results.
Plus, read the blog and follow the video series to learn about network optimization techniques such as Dynamic Load Balancing (DLB), Data Center Quantized Congestion Notification (DCQCN) Profile tuning, Explicit Congestion Notification (ECN), and Priority Flow Control (PFC).
Looking forward
Juniper’s collaboration with MLCommons in MLPerf training represents a major milestone in advancing AI innovation and streamlining data center infrastructure. With its expertise in networking solutions and dedication to customer success, Juniper continues to drive the advancement of AI technology, enabling organizations to unleash the full potential of AI in their operations.
The insights gained from the MLPerf benchmarks will guide our efforts to develop more efficient, powerful, and environmentally sustainable AI solutions. The AI revolution is coming to the networking world, and Juniper is helping lead the charge.
If you want to hear more about Networking for AI and learn about MLPerf Training 4.0, join Juniper’s AI experts, industry thought leaders, and special guests from some of the most innovative companies working with AI today at “Seize the AI Moment” on July 23rd. Join the conversation and learn:
- Best practices for building high-performing AI data centers
- How companies like Meta, Broadcom, Intel, AMD, and others are solving new infrastructure challenges
- Why Ethernet is the gold standard networking technology for AI clusters
Register here.