HPE Networking Chief AI Officer Bob Friday and I recently participated in a podcast with Tech Field Day. The starting premise of the show was, “Data center networking needs AI,” with which we wholeheartedly agree. Bob has spent the last 10 years as a pioneer in bringing AIOps to networking. He kicked off the discussion by articulating HPE Networking’s journey toward the Self-Driving Network™, a vision he explained in his recent 6-part blog series. We then dug deeper into what AI means for data center networks in particular and outlined some of the work we’ve been doing in this area. This is what we’ll cover in this short, two-part blog series.
Is there a problem to be solved?
People are afraid to touch their networks. It sounds ridiculous, but most network engineers are nervous before, during, and after a change, whether that’s provisioning a new service, a firmware upgrade, etc. The process is stresful and operators worry that if they touch the network they may break it.
The fundamental problem underlying this epidemic in data center networking is complexity: the alphabet soup of dozens of protocols to understand, configurations of perhaps thousands of physical and logical devices, multiple infrastructure vendors to manage. The list goes on. Combine this complexity with the endless flood of data that operators are subjected to, coupled with many inadequate troubleshooting tools on the market today, and data center networking teams are often overwhelmed. They end up drowning in data but starved of insights.
This is precisely the type of situation where clever AI and machine learning algorithms can be useful: mountains of data that are tough for humans to sift through, but for the most part the data is actually fairly well-structured.
It’s not about the technology, it’s about what you need
Too often, technology discussions start in the wrong place: with the technology itself. Many of us have heard that edict from above in our organizations: “We as a company need to use AI or we’ll get left behind.” Or maybe you’re hearing it from a vendor: “You need to use AI in your data center network”. But these are aimless starting points. The starting point with any technology discussion should be, what are your goals? What are the problems you have that need to be solved?
Our goal has always been very clear—deliver the best possible user experience, whether that “user” is the data center network operator, or the end user that relies on the data center, whether they know it or not, for the applications they use. Over the last 10 years, Juniper has developed and applied AI, incredibly successfully, first to Wi-Fi and then to the rest of the campus and branch domain. In the data center we have a second-mover advantage, able to glean all of the insights from lessons learned in the campus.
Data center network operational challenges
If the goal is to optimize the user experience, then the challenges to be solved for data center practioners take close collaboration with our customers. We group these challenges into three categories: limited insights, insufficient speed, and poor reliability.
- Insights. In a typical scenario, the network lead at an enterprise gets a call from a very unhappy executive complaining that the CRM app, the ERP system, or other critical app, is down. It must be the network, right? Well, maybe not. Often network teams have limited insights into the applications running over their network. But let’s be clear, the whole point of a data center is to host and deliver applications that end users need, whether it’s something relatively frivolous for a consumer or a mission critical app for a business.
- Speed. Another problem we see is insufficient speed or agility when it comes to managing IT infrastructure. When an urgent new business requirement arises, such as adding capacity quickly for some unforeseen spike in demand, the standard change management processes or maintenance windows that a company has don’t cut it—the process is not fast enough or it’s too rigid.
- Reliability. Finally, poor reliability and downtime is an ongoing concern. Nearly every enterprise is just one hasty change or manual configuration error away from bringing down the entire network, with huge reputational and financial impacts to the company. Not to mention, negative implications for your career. If you don’t have fast, robust automated recovery and rollback, it may take hours or days of intensive work to restore services. Data center network operators too often find themselves in reactive mode, firefighting, constantly dealing with these emergencies, instead of focusing on proactive strategic IT initiatives to help the business.
Before we blindly jump into AI as a panacea, we have to thoroughly understand these problems and ask ourselves, can AI help solve them? The answer is, absolutely, yes.
AI is necessary, but not sufficient
But AI does cannot solve every NetOps problem that you have. AI is necessary, but not sufficient. We need systems that use both AI, which is fundamentally probabilistic, and other deterministic approaches such as intent-based networking.
Is it OK to be 99% right on a configuration? No, you want to be 100% right, which requires rules-based, deterministic software. But on Day 2 when the data center is operating out in the wild in unpredictable environments, it’s a different calculation. If you have a system that can tell you with 99% accuracy, based on a myriad of symptoms, the root cause of a problem that you’re seeing , then that’s a solution that’s probably better than what you have today. And this is the power of AI, to sift through massive amounts of data and extract the correlations that humans cannot easily do.
Put the two technologies together, AI and intent-based networking—then you can deliver that unbeatable network operator experience and application experience for end users.
We’ll dig into how we’re using AIOps to solve stubborn data center networking challenges in the second and final part of this blog series.