On the journey to automated network operations, many organizations begin by aligning with internal software development teams and applying lessons learned in the software space to help make network infrastructure and operations more resilient.
One of the most impactful examples is the idea of automated testing. For years, software engineers have used development methodologies like “test-driven development” to keep software quality higher and discover bugs earlier in the software development process, where they can be fixed more easily and cost-effectively. Rather than simply make changes on a whim, developers first commit their operational knowledge of the system into automated tests, which must pass in order for the software to be considered “working”.
Every Network Change Rolls the Dice
In stark contrast, network changes are often made with minimal enforcement that “what it should be” is equal to “what it really is”. Usually, a core set of senior engineers keep a mental copy of what “normal” is in their heads as tribal knowledge. When they create a change plan or review one from a more junior engineer, they compare their knowledge of the existing system with what the change is proposing.
This presents several problems:
1. No one knows everything and even if we could know everything, everyone has bad days, where they overlooked a key detail. In short, everyone is human.
2. Having this knowledge in one or a few people’s heads doesn’t benefit anyone outside that group. This makes it tremendously difficult to bring new engineers onto the team effectively.
3. Despite intentions, tribal knowledge usually has the opposite effect; it creates distrust and fosters “blaming the network”. We like to assume that the things we know least about are the cause of problems.
Imagine if your senior engineers weren’t available to troubleshoot a problem or if they overlooked a detail – after all, we’re all human. Keeping this idea of how the network “should be” in someone’s brain just doesn’t scale.
Moving from Tribal Knowledge to Test-Driven Operations
Rather, we should strive to commit “what it should be” into some sort of executable format. This can be something as simple as a set of Python scripts, or some kind of DSL for an existing tool. For instance, Juniper maintains an open source project called JSNAPY which helps network operators develop a suite of tests that are meaningful to them.
Test suites like these can be executed at any time. They can be executed periodically, so that long-term, operations personnel can be confident that the network is providing the necessary services. They can also be iterated on, added to and executed before and after every network change. Just like in the software world, it is meant to provide a vital augmentation to operations personnel so that it’s easier to ensure that the system is behaving as intended. Naturally, the details of this will change over time, and so should your tests.
Not only will a test-driven approach help ensure ongoing reliability, it creates a precedent for what we do after outages happen. This isn’t a panacea – and just like automated testing didn’t solve all bugs in the software world, it won’t make your network 100% reliable. However, it does provide a formal structure for you to learn from past mistakes and avoid repeating them.