Enterprise IT organizations are under pressure to “just do something” to take advantage of recent advancements in artificial intelligence (AI). Generative AI has quickly transitioned from R&D sideline to a boardroom imperative across industries. But most enterprise networking teams still don’t know where to begin.
While there are all sorts of new technologies and an alphabet soup of new protocols to sift through when building out new data centers for AI workloads, the truth is that the majority of an organization’s current network infrastructure and expertise still applies. Many companies we work closely with are surprised to discover this reality.
At Juniper, we’re helping our customers navigate the unique challenges associated with deploying and operating new AI data center architectures. In fact, in July we launched our AI data center (“Networking for AI”) solution and highlighted the key role that the Juniper™ Apstra data center fabric management and automation software plays. No need for complicated, proprietary AI data center implementations, such as scheduled fabrics. Apstra guides you at every step along the AI data center network life cycle—from Day 0 design to Day 1 deployment to Day 2+ ongoing operations. Get all the details in this new solution brief.
Simplify the deployment and operation of AI data centers with Juniper Apstra templates
Network teams are being tasked to build new AI data center (AI DC) infrastructure to take advantage of recent advancements in AI. The distinct nature of AI traffic patterns, often referred to as “elephant flows,” presents a unique set of challenges and requirements that demand a responsive, forward-thinking networking and management approach. This is a completely new world for the typical enterprise data center organization, where maximizing utilization of expensive graphics processing units (GPUs) and minimizing job completion times (JCT) to optimize the economics of your AI DCs are paramount.
Fortunately, the Ethernet infrastructure that virtually all network teams are familiar with is up to the challenge, although AI DCs do require new network architectures and new techniques for tuning the Ethernet fabric to optimize performance. Apstra handles both the new AI DC architectures and fabric tuning challenges in its stride. AI DC designs are created in a template designer using existing Apstra blueprints. And auto-tuning AI training networks with Apstra saves you countless hours and frustration versus today’s manual, time-intensive approaches. These new capabilities are offered free of charge, within existing Apstra licenses.
Due to their unique requirements, designing and configuring AI DC fabrics for AI workloads poses additional challenges compared to traditional workloads. With Apstra’s AI cluster template designer, you can quickly and easily create validated and optimized Apstra templates tailored to your specific resource requirements and workloads. By simply specifying minimum inputs, such as the number of GPUs, servers, and racks needed, the tool generates customized rack templates that ensure efficient resource utilization and minimize potential performance bottlenecks, allowing you to reliably scale your network with a few clicks. Don’t know what a “rail-optimized” design is? Apstra shows you how to build it. Continuous validation from Apstra ensures that the data center network operational state matches the intent declared in the template.
The unique design requirements and number of connections make deploying AI training environments complicated and error prone. Apstra blueprints provide a cabling map that, when combined with rack elevation information, provides a precise list of links and endpoints that can be used by data center technicians to complete the cabling work with a high level of accuracy.
Network interface cards (NICs) in the GPU servers play a prominent role in AI DC designs. Configuring GPU NICs for all-to-all routing is laborious and requires an understanding of the overall network fabric. The Apstra host agent for AI/ML can automatically configure the GPU NICs with the correct IP address and routing information derived from the GPU network blueprint.
Pre-stage your AI DC while waiting for delivery of those hard-to-find GPUs
AI accelerators (e.g., GPUs) are expensive and scarce. After waiting months for lead times to resolve, many companies are finally acquiring expensive GPUs only to have them sit idle while they spend weeks deploying and configuring the infrastructure.
Apstra provides a robust framework for modeling and managing network designs digitally before any physical hardware is acquired, enabling organizations to streamline network planning and deployment processes effectively. Pre-stage your entire AI DC before your infrastructure arrives in the warehouse. With Apstra, you can then deploy in days rather than weeks.
AI DC fabric auto-tuning
Training AI models over Ethernet requires new techniques for congestion management and flow control, such as data center quantized congestion notification (DCQCN). Tuning AI fabrics to optimize performance and take advantage of these sophisticated protocols presents significant operational challenges. Manual tuning techniques are time-consuming, ineffective, and fraught with errors and inefficiencies.
With Apstra, you can auto-tune your fabric in minutes. Tuning a fabric is a delicate balance of maximizing throughput while minimizing packet loss. As a switch buffer fills, the components of DCQCN, explicit congestion notification (ECN), and priority flow control (PFC), are employed to optimize traffic flow. Apstra continuously monitors key RoCE v2 congestion metrics and reconfigures DCQCN on the switches to prevent packet loss. Dynamic optimization of DCQCN parameters based on real-time network analysis ensures efficient operation. This application is currently available for download from GitHub—it’s also available for free with any Apstra license.
Juniper Apstra to minimize time to value for enterprise AI projects
Be sure to check out the Juniper Ops4AI Lab, where customers are testing model performance and the efficacy of data center network designs, slashing their time to AI value. Enterprise network teams don’t need to panic in the face of aggressive AI goals pushed down by senior leadership. Juniper Apstra has you covered.
Read the solution brief to learn more and take a deeper dive in our AI data center solution page.