Artificial intelligence (AI) and machine learning (ML) are evolving so quickly, the possibilities for what they can accomplish seem limitless. Smart factories that continually optimize themselves, real-time video analytics, automated medical diagnostics, it’s all on the table. Or at least, it should be. But one looming technical barrier can prevent the most innovative ML applications—and many other compute-intensive workloads—from reaching their full potential: inflexible networks.
Many ML use cases work best as in-network applications, collecting data from many connected endpoints, analyzing it and acting on it in real time—locally, where the data is generated. The problem: this model creates very different traffic patterns than those for which conventional merchant Ethernet switches were designed. Because most merchant Ethernet switches can’t adapt to meet changing requirements—and because network hardware often stays in place for years—these switches can put a hard limit on the pace of innovation at the edge. At least, that can happen in networks that don’t use Juniper’s Trio 6 silicon.
Juniper Networks leads the industry in high-performance access routing, with hundreds of thousands of MX Universal Routing platforms deployed worldwide. Using our custom-designed Trio ASICs—the only networking silicon built for the multiservice edge—these systems deliver industry-leading programmability, adaptability and scale. Our customers know that when they use Trio, they can meet the needs of tomorrow, not just today, supporting new protocols and features years after their platforms were deployed. As the network becomes more essential to every business and network requirements grow more unpredictable, that flexibility will become even more important.
But how, exactly, does Juniper’s approach to networking silicon translate to real-world outcomes? What can actually be done with Trio processors, and how big of a difference do they make? Recently, Juniper engineers partnered with Professor Manya Ghobadi and others at the Massachusetts Institute of Technology (MIT) to find out. The joint Juniper-MIT team presented this work at ACM SIGCOMM 2022. The results? Trio more than lived up to its billing.
Growing Challenges at the Edge
Advances in AI/ML, the Internet of Things (IoT), the metaverse and other emerging applications dramatically increase the complexity and bandwidth requirements of edge networks. To capitalize on these innovations, organizations need ever-more performance, scale, intelligence, security and observability across networking equipment. And they need networking silicon that can accommodate massive traffic volumes flowing in unpredictable ways.
Consider an ML use case for an “Industry 4.0” smart factory, where hundreds of connected “things” communicate and interact at the edge. A manufacturer wants to train ML algorithms on the massive volumes of data generated in the factory, with the goal of identifying anomalies and taking automated actions to optimize quality and productivity in real time. To do it, they’ll need to aggregate data from many different endpoints in different ways and accommodate huge spikes of East-West traffic as the algorithm optimizes.
Juniper’s Trio 6, and the MX multiservice edge systems based on it, are designed for exactly these kinds of challenges. Trio ASICs provide 160 high-performance packet processing engines specifically optimized for networking workloads. They can support software running in thousands of threads, Gigabytes of storage, high-performance compute engines, high-capacity packet filters and embedded MACSec and IPSec crypto engines for each ASIC. They pair that performance with a programmable data plane, so they can continue to support new functionality throughout the life of an MX platform.
This kind of flexible, high-performance networking is becoming essential for emerging ML jobs, many of which have become so large, organizations need many machines working in parallel, often over many days, to complete a training cycle. As a result, researchers are exploring new in-network aggregation strategies to improve learning times. The Trio 6 architecture should be perfect for this task and should outperform systems designed for traditional two-way traffic flows. The MIT research team decided to put that hypothesis to the test.
The computer scientists built two testbeds to simulate very heavy ML operations in a large-scale cloud environment. One was driven by Trio and one by a Tofino programmable switch—a platform that already delivers significant learning time improvements compared to conventional switches and servers. The team aimed to compare in-network aggregation performance and determine whether Trio could reduce time-to-accuracy in distributed ML training workloads. Specifically, they wanted to measure Trio’s ability to optimize these workloads even in the presence of “stragglers”— specific servers in a cluster that perform slower than others.
In shared clusters working on multiple jobs, different servers often experience performance jitter due to congestion, load imbalance, resource contention, and other issues. As a result, servers working on the same job must wait for the slowest machine, decreasing overall application performance. This “straggler problem” can have a huge impact on the performance of ML workloads—along with a wide range of other distributed applications, including MapReduce, Spark, database queries and key-value stores. Google cites stragglers as a primary cause of poor performance in data processing jobs, and data from Facebook and Microsoft indicate that they can make ML jobs perform as much as 8x slower.
To see how well the Trio architecture could address this issue, the joint MIT-Juniper team applied the ASICs to aggregate training weights in distributed ML training jobs to and detect and mitigate the effects of performance stragglers. The results: while the Tofino switch kept needing more time as it waited for stragglers, the MX system was able to mitigate the issue and maintain close-to-ideal training times. In fact, the team showed that for certain training sets, Trio improved convergence times by up to 1.8x.
That kind of acceleration can have a significant impact on what organizations can do with ML and other distributed workloads at the edge. Equally important: the team achieved that level of improvement for an application that did not exist when the Trio chipset was built—a testament to the value of a programmable data plane.
The potential use cases for high-performance networks with programmable data planes are nearly limitless. In just the next few years, we can expect to see more advanced in-network computing applications, more intelligent telemetry, faster anomaly detection and mitigation and more. Juniper is committed to ensuring our customers can take full advantage of this potential.
With the ability to update microcode for Trio 6, we can continually add support for new features, protocols and protocol extensions—all without requiring customers to swap hardware, as would be required with other vendors. In fact, we’ve already done this repeatedly. For example, when segment routing and bidirectional forwarding detection emerged, Juniper MX Series customers were able to roll these protocols out on their existing hardware, via software updates, with no performance penalty.
While much of the industry moves in the opposite direction—offering fewer, more generic networking options—Juniper will continue investing in silicon that can continually adapt to meet new needs. We’ll keep evolving Trio ASICs and MX systems and software to enable a higher-capacity, more efficient, higher-performing programmable data plane.
With so much potential for new multiservice edge applications, and so much unknown, it makes sense. Furthermore, we can’t wait to see where the researchers, partners and customers joining us on this journey take their networks next.
For more details about the difference Trio silicon can make for real-world ML workloads, download the MIT-Juniper research paper.