In a recent Bob Friday Talks webinar, I talked to Sharada Yeluri, Senior Director of Engineering at Juniper, about Networking for AI and the next generation of data centers.
What’s the difference between AI for Networking and Networking for AI?
AI for Networking is when AI acts as a network expert leveraging cloud AIOps to deploy and operate networks on par with human IT domain experts. This keeps your networks in good shape by finding and fixing problems before you notice them.
Networking for AI is when you design the data center of the future with x86 architecture in the front and GPU clusters to train and run big GenAI Large Language Models (LLM) in the back. It’s like creating the optimal track for AI applications to run fast and smooth, dealing with huge AI workloads easily. Yeluri describes it as using the network to run the new generation of large deep learning models at speed and scale more efficiently.
So:
- AI for Networking = Managing the network using AI
- Networking for AI = Making AI more effective using efficient, high-bandwidth, highly scalable data center solutions
How is Networking for AI in data centers different than traditional data centers?
Typical data centers of the past consisted of compute and storage built on x86 architecture to run the cloud SaaS applications (e.g., Netflix, email) that we have all come to love and depend on. With the introduction of ChatGPT in 2022, GenAI and LLM are now poised to touch all aspects of society, ranging from healthcare to automotive to agriculture.
These new GenAI LLM applications are compute intensive and require a new generation of compute known as GPU optimized for the algorithms to train and run these LLMs. The difference between the training and inference is:
- Training: You’re taking a deep learning model with up to a trillion weights and running through days of training on trillions of words from the internet so the model can do what you want it to do. Essentially, you’re splitting training across many GPUs and are running parallel computations across them all. Once the computations are done, the GPUs exchange the results and start the recomputation
- Inference: Once you have trained the model, you need to run it in production to process user’s input and prompts
The backend AI workload is somewhat unique compared to the frontend application workload because there’s a lot of complex math and the need to exchange training data very quickly between AI clusters. The power requirement for a GPU rack running an AI workload can be 10x more than that of an x86 rack running typical cloud applications.
Ethernet vs InfiniBand
So, if you have an 800Gbps switch moving training data between GPUs, how much difference can the protocol used to move the data make? According to Forbes, “The data centers used to train and operate these models require vast amounts of electricity. GPT-4, for example, required over 50 gigawatt-hours, approximately 0.02% of the electricity California generates in a year, and 50 times the amount it took to train GPT-3, the previous iteration.”
A Gigawatt costs approximately $50K/hr, so we are talking around $2.5M for a training session. So yes, if you can speed up training or reduce the power of switches and servers by 10%, we are talking real money.
Similar to the front of the data center with x86 servers, the GPU servers are connected with a highspeed fabric of switches to move training and inference data between servers. The difference is there is a new protocol on the track to give our beloved Ethernet protocol a run for its money: Infiniband. While the port speed of Infiniband switches may not run as fast as their Ethernet counterpart, they are known for latency.
But team Ethernet is closing this performance gap, and my money is on the old time favorite to win the networking for AI race.
Ethernet has a few pros going for it compared to InfiniBand. For one, Ethernet switches are everywhere, so there’s a rich vendor ecosystem that keeps prices down, unlike InfiniBand where there’s a customer monopoly that makes it difficult to control prices.
Ethernet’s vender ecosystem also encourages innovation. The highest performing Ethernet switch on the market today has at least twice the bandwidth of the InfiniBand switch. So, with Ethernet, you only need about half the number of switches to build the same fabric.
One area where InfiniBand has a bit of a lead is efficiency since Ethernet switches have a slight tendency to become congested. Remember that many GPUs are all exchanging results during every iteration of training. That’s a very high bandwidth of communication. And they do this over and over again; compute, communication, compute, communication. So, GPUs dominate the cost of the backend data center. If your network switches aren’t efficient and are causing congestion, you’re not using GPUs efficiently and you’ll need more switches to do the same training.
What is Juniper doing around standards of operability for Networking for AI in GPU clusters?
Ethernet vendors like Juniper need to be able to control congestion so that we’re not increasing the job completion time of workloads. Juniper is a proud member of the Ultra Ethernet Consortium: A collaboration of companies dedicated to enhancing Ethernet so it can handle high bandwidth, low latency communication, and congestion properly to achieve a truly interoperable system. We want to ensure all the custom silicon that we’re building complies with the new standards that are coming from the Consortium.
For more details on Networking for AI, watch the full interview here and check out the five top trends in data centers for 2024.