This is a joint blog authored by Kirk Compton (WW Director, Partnerships, SambaNova Systems) and Ben Baker (Sr. Director Data Center Marketing, Juniper Networks)
Enterprises face mounting pressure to use recent advancements in artificial intelligence (AI) to transform their businesses. Generative AI has quickly transitioned from an R&D project to a boardroom imperative across industries.
However, when it comes to AI, most enterprise IT teams still don’t know where to begin. Many IT departments are currently struggling to attract, train, and retain the right talent. Worse, AI brings all sorts of new technologies and an alphabet soup of new protocols to sift through when building out new data center infrastructure for AI workloads. And this is before you discover the often-excessive lead times for the GPU infrastructure required.
As a result, we’ve seen many projects falter. Several key issues that emerged as common stumbling blocks:
Rush to adoption: The fear of missing out (FOMO) on the technology has prompted many organizations to rush into adopting AI without fully considering its complexities and limitations.
Maturing technology: Generative AI is still evolving, and many tools and features are in their early stages. Organizations may encounter unexpected errors or limitations when implementing generative AI solutions.
Data management and governance: Organizations need to ensure their data access management is under control as AI adoption typically brings up new security and privacy issues. This means adopting processes, policies, and technologies to manage and control data effectively throughout its lifecycle.
AI Infrastructure as a Service
Time to market matters. The rapidly advancing AI market moves much faster than the typical enterprise. Fortunately, new business models offering AI Infrastructure as a Service (IaaS) are delivering smooth cloud experiences to enterprises trying to navigate the complexities of AI. This complexity creates uncertainty and doubt in the minds of IT executives including what (xPU, network, storage, server) and where (existing DC, build new DC, co-lo, cloud) as they navigate their AI journey. Organizations want speed to market and are searching for validation (components / interoperability) and agility (automation, auto-tuning for scale and change), constantly calculating the risk vs. reward of different courses of action.
Part of the debate has already been settled: the present and the future of data center infrastructure is hybrid. The large majority of companies will continue to operate on-premises data centers as well as use public clouds for a variety of workloads, both AI and traditional. To execute AI transformation projects, many enterprises are turning to AI IaaS companies such as SambaNova Systems.
AI Training and Inference
Training AI models is a massive parallel compute processing problem that relies on dependable network infrastructure. Job completion time is the key metric. Do it right, and all GPUs work together efficiently. Do it wrong, and the network is a bottleneck and expensive GPU time is wasted. To train an AI model, massive data sets are chopped into smaller chunks and distributed as flows across a backend DC fabric to dispersed GPUs that perform calculations. Individual results are transported back over the network and merged before another training run begins. Any straggler GPU computations create delays, which is called tail latency. As this training process repeats, tail latencies compound and can add up to significant delays to the overall goal, which is to move the AI model from training to inference so end users can interact with it. The network is critical to the process.
But industry attention is turning toward AI inference and the data center infrastructure required to deliver it. No-one knows precisely how much inference capacity the global IT market might consume in the next several years, but the consensus is that it will be a multiple of the installed base required for AI training. Claims that “inference is going to be 100 times bigger than training” may be hyperbole, but inference is the ongoing application of AI in real world use cases. Model training occurs less often so it stands to reason that inference will be a much bigger market.
Inference is where end users interact with the AI model. Users query a trained AI model, which delivers a response back. Latency is critical here, with tokens per second being a key metric. Again, the network, which connects the necessary compute and storage engines, is critical. If you want your AI interface to feel like a conversation, then even just a second delay is too long.
“Scalable, reliable, and operationally simple networking is required for fast, high fidelity AI training and inference. Juniper’s data center portfolio is enabling SambaNova to deliver AI infrastructure solutions to help our customers fast-track AI initiatives to transform their businesses.”
– Marshall Choy, SVP, Product SambaNova Systems
Specialized, high-performance AI/ML
SambaNova provides hardware and software solutions to offer specialized, high-performance AI and machine learning capabilities. SambaNova focuses on delivering advanced computing platforms to deploy AI training and inference models.
SambaNova hardware, including Reconfigurable Dataflow Units (RDUs), is designed to deliver high performance computing, which outperforms traditional CPUs or GPUs in certain AI tasks. Juniper works closely with SambaNova architects to ensure the network design is optimized for current and planned RDUs. Future areas of collaboration between the two companies may include the JuniperⓇ Apstra data center fabric management and automation solution, which simplifies AI networking deployment and operation, as well as the Juniper QFX5240, which became the first 800G switch to market from an OEM in early 2024.
SambaNova’s solutions are often used to create dedicated AI clusters for specific tasks like natural language processing (NLP), with full-stack AI platforms, including both hardware and software optimized for deep learning and other AI workloads. One of SambaNova’s strengths is enabling large-scale AI model training, as well as Inference, with faster and more efficient hardware.
- Training as a Service: Customers are billed based on the time spent using SambaNova-powered hardware and the volume of data processed.
- Inference Services: After training, AI models need to be deployed for real-time use. CSPs can charge for inference tasks running on SambaNova-powered infrastructure, offering low-latency, high-performance AI inference solutions.
SambaNova’s systems are optimized for dataflow processing, with tools for managing AI data pipelines and optimizing AI models for better performance. Similar to other AI models hosted on the cloud, SambaNova’s AI models can be hosted, managed, and scaled via cloud services. This is all with a foundation of critical infrastructure from Juniper. Every SambaNova customer is running their workloads through Juniper Networks QFX Series Switches.
SambaNova delivers superior value, either directly to enterprises or to cloud service providers who provide managed AI services to their end customers. The AI IaaS model aggregates capacity to ensure full utilization. No risk of stranded on-premises enterprise data center assets. No need for enterprises to upgrade power delivery or cooling systems in their data centers to handle intensive AI workloads. And no need to rely on hard-to-find (and retain) IT talent. SambaNova does it all for you.
Come see us at SC24
SambaNova (booth #2142) and Juniper (booth # 2309) will be at SC24 in Atlanta, November 17th – 22nd. We’re excited to meet our customers and partners, engage in meaningful discussions, and explore the art of what’s possible! Stop by and see us.