Whiteboard Blog Series: Artificial Intelligence for IT Operations and Why It Matters

Artificial intelligence (AI) is increasingly being used to augment the capabilities of IT operations teams, a practice commonly known as AIOps. AIOps has met significant resistance from some IT teams and been enthusiastically embraced by others. So why AIOps? What’s in it for the organization, and for the individual members of IT teams?

Let’s start off by addressing the enormous elephant in the room. Given the significant disruption that digitization has brought to other segments of the workforce, IT teams are understandably wary that AIOps might mean they will lose their jobs to a computer.

To IT teams everywhere: relax. AIOps does not mean the robots are coming for your job. Quite the opposite, in fact.

AIOps has been around in one form or another for years. It got its current buzzword name a few years ago, and ever since it has started becoming part of the portfolios of major tech vendors, including Juniper. Data exists on the how and why of AIOps, and it’s very clear that what drives AIOps adoption is the need to cope with both scale and rapid growth, and emphatically not a desire to reduce headcounts.

Human limits

Every person has a limit on how much information they can keep in their working memory, including how many variables they can track and how quickly they can respond to new problems. Up to a certain point, you can solve the limits of an individual person by adding more people to a team… but only up to a point.

Past a certain level of system complexity (or rapid growth) adding people to a team doesn’t help. IT requires systems thinking, and systems thinking requires the ability to hold most (if not all) of the variables in your head. We can bypass this to a certain extent by breaking teams up into areas of responsibility, but eventually someone (or a group of someone’s) has to be able to see the whole picture and understand that if you push this over here, that changes over there.

When things get too complicated, we turn to abstractions. Management interfaces, automation, orchestration, visualizations, analyses and reports are the tools of the modern IT team. The complexities of storage, compute and networking have all been abstracted away behind various levels of virtualization, even to the point that larger IT teams regularly (and increasingly programmatically) create and destroy entire virtual data centers.

The problem with all these layers of abstraction, however, is that they obscure the cascade effects of changes to the system. You can abstract your storage away all you want, but it is absolutely still possible to do something to a SAN that causes it to behave badly, impacting production.

The more mature that our technology becomes, the less frequently we run into these problems. As a result, when we do run into them, the problems are either more obscure or happen very infrequently, and thus more difficult to solve.

Augmenting IT teams

AIOps technologies bridge the knowledge gap that the management tools we rely on introduce when they allow us to become dependent upon abstractions to cope with complexity, growth and/or scale. In one form or another, all AIOps AIs learn what “normal” looks like and become concerned when things look abnormal. In this, they are much like a classic Security Information and Event Management (SIEM) system.

SIEMs, however, merely throw up alerts when something has gone wrong. AIOps products – including Juniper Networks’ Marvis Virtual Network Assistant – keep track not only of what has gone wrong, but how the problem was resolved. They learn that if A behaves like X, then applying Y solution resolves the issue. And they remember this, even if the issue doesn’t recur for years.

Again, IT teams have been doing this for decades using ticketing systems. Unfortunately, ticketing systems rely on search, which means appropriate metadata, semantic tagging and so forth all become important: finding the data you need is entirely reliant upon the person logging information accurately and being verbose about that ticket during the last incident…and humans are notoriously bad at documentation.

AIOps AIs, however, are fantastic at documentation. If they have access to all the statistics, logs, tickets, help queries and so forth that are related to a given incident then they can store all that data as being related to that specific type of incident for as long as they exist. Every time something abnormal happens, the AI can review everything it has ever learned and see if the new event looks anything at all like previous events and quickly provide IT teams with insight into how they might solve it.

If this is starting to sound like it might actually be useful, well then buckle up, because it gets a whole lot better.

AIOps to the future

An AIOps AI learning from your IT team’s events over time does sound like it would get progressively more useful, but it also feels like it would take a long time to get there. It’s common knowledge that AIs need to train on largeish datasets, and if one of the problems AIOps is trying to solve is that humans are awful at documentation, we can’t exactly point the AI at the ticketing system and have magic happen.

But what happens if we don’t limit the AI to only your organization? What happens if the AI learns from ALL organizations? What if the vendor behind that AI is constantly adding knowledge both in the form of hardcoded answers to known problems and by expanding the knowledge base of the AI with the help of participating organizations willing to help make the AI better?

Suddenly, the AI’s capabilities demonstrate exponential growth, and IT teams have access not only to fixes for the obscure infrastructure errors encountered by their own organization in the past, but also how to fix the obscure infrastructure errors that every participating organization has encountered. And past a certain threshold of confidence in the solution, those AIs can even be set up to apply the fix automatically, without IT intervention.

Does this involve doing work that IT teams were doing before? Yes. But it is the automation of the parts of the job we all loathe and are demonstrably bad at. AIOps isn’t a buzzword for robots that have come for our jobs; it’s a term used for products that free us from yet another piece of tedium that stands in the way of doing our jobs, and it allows organizations to cope with larger, more complex and more rapidly growing networks than they ever could with humans alone.