Whiteboard Blog Series: AI for IT and Anomaly Detection in Networks

Every network is different yet comprised of similar components. We all use the same protocols but our user, application and device fingerprints are unique to each and every organization. This constitutes a big challenge for anomaly detection performed by a widely deployed platform. No single model fits all. The fundamental question is then, what is anomalous? A burst in traffic or spike in WLAN association failures may fall outside of some standard deviation, but is that considered neutral or bad? To answer this question more context and feature engineering are required. However, we must remember that the desired outcome involves identifying the root cause and empowering an entity to move forward with an interactive, automated or autonomous fix.

Figure 2.0 Evolution of methods and tooling for anomaly detection.

Marvis, the AI entity powering the Mist platform, has evolved through multiple iterations of its anomaly detection engine (Figure 2.0). What started out using a basic technique called moving average, has evolved through the use of ARIMA (autoregressive integrated moving average), to its current version 3.0 which leverages LSTM (long-short term memory) RNN (recurrent neural networks). In it, a unique model is built per customer site using deep learning techniques that allow for predictions to be made from sequences of time-series data (even when there are cyclical or seasonal gaps). By employing a unique model per customer, the learned baseline is highly accurate and although computationally expensive, it only needs to be retrained when predictions consistently fall outside of a predetermined deviation. This autonomous internal workflow is adaptive and a closed-loop which results in the platform performing continuous and autonomous learning. This alone imparts real value to network operators, but we still want to go further and find the root cause of the anomaly.

Once seasonal baselines are filtered out, temporal correlation is performed (by looking at events across a window of time-series data) with a priority placed on ‘what changed’ recently, and a method called mutual information is applied. This guides the focus in looking for events with a high conditional probability of having influenced the anomalous event. This scope reduction and analysis allows for the suggestion of automated remedial actions. By detecting the anomaly, troubleshooting it and then implementing those recommended actions, AIOps and self-driving networks now enter the mainstream and reset industry expectations for how to manage and support networks.

The AI data science toolbox is deep, wide and ever-evolving, but its real value is unlocked in the delivery of actionable insights and relevant recommendations. Building an AI-driven platform involves many other tools and challenges that involve large scale system and product design, yet at its core, there must be trust in the machine learning methods and workflows used. This journey to build trust starts early on with the selection of the right data and primitives and never quite ends. Continuous integration and delivery are the best ways to adapt to an ever-changing environment, and when it evolves from automated to autonomous, we can witness a truly AI-driven enterprise and environment unfold.

Introducing the Technical Whiteboard series

Foundational concepts of the AI-driven enterprise lead to a set of core tools. We cover a few of them at a high level to introduce technical concepts we use in our AI. The technical whiteboard series is also paired with demos of our products in action to show how these core tools are being used in real AI. Take a look: Whiteboard Technical Series: Artificial Intelligence Overview