This blog was originally published to the Apstra website – in 2021, Juniper Networks acquired Apstra. Learn more about the acquisition here.
This is the second part of a blog series on using AOS® to easily operate your data network. Click here to read part one.
Last time, we were talking about how we can use AOS to gracefully drain traffic off network devices in order to perform maintenance with a “Software-First” approach. You were probably scratching your head at the end saying “Now what?” Obviously, getting application traffic off the network is just step one in the operational process. So what happens next? Well, typically you are doing one of the following tasks:
- Troubleshoot device offline
- Replace failed or damaged device
- Upgrade the Device OS
Let’s focus on that last option, which every operator has been challenged by.
Network Device OS Upgrades are a required function in any modern enterprise. The OS can be affected by known or unknown bugs, security vulnerabilities, or more. The operator may also wish to upgrade the OS in order to activate a new feature. Whatever the reason is, with a fixed form factor device with a single CPU (or supervisor), we can expect some sort of outage due to the reload process. So as we have previously described, it typically makes sense to place the equipment into maintenance mode before performing these actions.
Multivendor Fabrics? No Problem.
Like every feature in AOS, the Intent is separated from the method by which we accomplish the stated goal. You shouldn’t have to be an expert in every vendor’s OS in order to manage a mixed hardware network. So the device OS upgrade process does not require any CLI input from the administrator, they simply select the devices to be upgraded, then pick the proper OS from a drop down list, and then submit the job. AOS can manage multiple simultaneous upgrades and even multiple active jobs for different OS types.
Positioning and Validating the Images
In advance of scheduling the upgrade, the operator should upload the OS images to the AOS server. When the images are being copied, you have the option to store the hash value provided by the vendor to ensure that the image is absolutely valid and has no errors. This information is stored in the AOS Server, in fact, we can use the hash value to regularly check the devices to make sure no corruption has taken place in our file system.
AOS also supports storing the images on a dedicated HTTP server, which allows for an engineering team to centralize all the images even if they are not being used with AOS.
Upgrading, or even Downgrading
Once the job has been kicked off, AOS copies the image to the storage location on each device. When the file copy is complete, the bootfile command is set to the new image and the device is reloaded. If the operator has placed the device into Maintenance Mode, then the device comes back in the same state. If the mode was not set on the device, it will come back up in a fully operational state with the service configuration.
You can use this workflow to both upgrade and downgrade the OS. Sometimes when we upgrade we encounter new problems and have to go back to the previous version. For this reason, we recommend storing all OS images on the AOS Server or the related HTTP server.
Hey, that wasn’t there before!
Frequently when you perform an upgrade, new default command settings appear in the running configuration syntax. This is due to subtle changes in the vendor’s source code. Typically these changes can be anticipated by reading the release notes, but on occasion new commands or even modified default settings will appear after a reload. AOS provides an easy way to identify these problems and resolve them.
In an AOS managed network, every device configuration is checked every 60 seconds for changes. When a device is reloaded and the AOS Agent is activated during boot, the configuration is checked immediately. ANY CHANGE, no matter how subtle, will be identified by AOS and presented to the administrator as a Config Anomaly. At that point, you can manually adjust the new command settings, or have AOS automate the change, or simply accept the new settings as a baseline, or “Golden” config. Once the change has been accepted, AOS expects that command to be there, and you will be alerted with another anomaly if any other change appears.
Automated OS Version Compliance Checks
AOS uses Intent-Based Analytics (IBA) to check elements of the network at regular intervals for problems. One of the more popular IBA “probes” is the OS Version Check, which looks at the running OS version on all devices from a certain vendor and triggers an anomaly if the version does not match. In AOS 3.0 we augmented this function with the Global SLA feature. This allows a member of the network or security team to set a preferred value once in AOS and refer to that variable in any number of IBA probes. For example, you could set the approved EOS version for Arista devices to “4.21.1.1F”. This is like envvars for your entire network.
The OS Check probe will then create an anomaly for every device running any version that differs from that. If the security team is alerted to a vulnerability in that version, that can simply change the version listed in the SLA and the IBA probes will automatically be updated to look for the new version.
As a result, all devices that were running the old version will now show an anomaly, and the count of these anomalies defines how many OS upgrades you need to do to be in compliance. The security team can watch this number in real-time, it will decrease with every successful OS upgrade. By using Role Based Access Control (RBAC), any member of the organization can be fully in the loop on the OS remediation process.
9 out of 10 Engineers Agree, Do Your Upgrades!
AOS was designed to improve the lives of operators and increase the efficiency of businesses by rapidly upleveling the capabilities of the people who manage these systems. Prior to using AOS, Bloomberg had a single engineer working on OS upgrades, taking upwards of 8 months to upgrade 174 switches. The same tasks could have been completed with AOS Maintenance Mode and Device OS Upgrade in approximately 87 hours. In fact, with parallel OS upgrade job support, the entire network could have been upgraded in a single day.
While OS upgrades are typically not the most exciting task, they are absolutely required and need to be completed in relatively short time periods. Don’t believe me? Check out what Andrew Lerner at Gartner had to say about it. With the automated tools built into AOS, we can easily upgrade complex multivendor network topologies with a consistent workflow and validation assistance. This ensures that the upgrades are completed quickly and accurately, allowing you to return to working on more challenging projects.
View our webinar on Intent-Based Data Center Automation 3.0
Read our white paper on The Apstra Zero Lock-in Guarantee