One of the best things about the newly launched NRE Labs is that it’s not just focused on configuration management. The ability to automate changes to network configuration is a vital part of any network automation initiative, but far from the complete picture. When constructing the curriculum, we wanted to give equal focus to other aspects of automation, especially those that will end up saving time and effort for the majority of network engineers, many of whom may not change their network very often.
One powerful use case is the ability to make assertions about the network – how it’s configured, what kind of state it’s in, and how it’s changed over a given period of time. Not just to know this information, but also be able to compare “what it really is” with “what it should be” in a predictable way. This is a critical step in the journey to Network Reliability Engineering as we start to manage our workflows – including these verification workflows – as code.
Testing and Assertions
First, a brief detour to the world of software development. It’s no secret that Network Reliability Engineering, like all other Reliability Engineering disciplines, is heavily inspired by software development processes and the application of those processes to the world of infrastructure operations.
In the old days of waterfall processes, software was built over the course of months, and then tested at the tail-end by a team of dedicated testing personnel. However, this model doesn’t keep up with the demands placed on modern software teams. These days, testing is done during the entire software development lifecycle. It’s developed as a set of tests that are defined as code alongside the actual application code being tested. When changes are made, the tests run a certain part of the software’s functionality using certain inputs, and assert a certain output.
From this assertion, we can tell whether or not the underlying software is doing what it’s supposed to do. We can rely on these tests to help us identify bugs moving forward because they’re part of the codebase and are run after every single change.
All that said – what does all this software stuff have to do with the day-to-day of a network engineer? It’s important to ponder what the software folks have actually done by building tests alongside their code. What they’ve done is describe – often using an actual programming language – what “should be.” They know how the software should work and know (based on an input) what the output should be in a fairly deterministic way.
The real secret we like to keep from ourselves is that these thought processes and assertions about how our network should be working are already happening on our networks today – the question is, is it structured and repeatable like a software engineer’s test suite? Or is it unstructured, tribal knowledge that’s done on the fly to the best of our knowledge in the moment (usually under pressure of resolving an outage)?
The knowledge you have about how the network should be at any given time should instead be committed into a version-control repository and maintained over the long term. When an outage happens, you don’t need to log into your network devices and run a bunch of show commands. Instead, run a test suite that has all the assertions you want to make, built right in. This means you can more quickly get up to speed and spend less time on getting situational awareness and more time actually fixing the problem.
JSNAPy is one tool in the network verification space that allows you to start moving from tribal knowledge to tangible test artifacts, stored and managed “as code.” This is an open source project aimed at simplifying this conversion and enabling you to run these tests on your Juniper network devices.
This works by retrieving some information, called a “snapshot,” and then running a set of assertions about what that data should look like. For instance, if we want to make some assertions about the state of BGP on our devices, we might use the “get-bgp-summary-information” RPC and assert from the resulting data that:
- There is exactly one BGP peer group configured
- There are exactly two BGP peers configured
- There are zero BGP peers “down” (meaning all are up and active)
The screenshot below shows how all of this is done in a simple YAML file that is passed to JSNAy:
Running these (and any other) tests is done via the command-line or via some kind of continuous integration pipeline. Based on the output, we can see that one of our tests failed on vqfx2:
We didn’t have to log into our network devices to find which device had the failure, nor did we have to log into vqfx2 and list the BGP neighbors to see how it compared with our memory of what should be there. This was all committed into our tests and we simply executed them. We now know that there’s a problem with one of the BGP peers on vqfx2 and can go straight out of recon mode into troubleshooting mode.
Try JSNAPy Right Now in Your Browser
We’ve had a lot of fun creating new lessons for folks to get started with automation and JSNAPy was the subject of one of our very first lessons. If you want to get started with JSNAPy before bothering with getting it installed and working on your own system, try it out for free, right now, all in the browser.
We’re pushing new content to NRE Labs all the time. Follow us on Twitter @NRELabs for updates on new content, fixes, and related events.