Failure and Resilience in Scalable Data Center Fabrics

In the aftermath of a major network outage, the natural reaction of most network engineers is to find some way to avoid all future outages—regardless of the cost. The first line of defense against future outages is redundancy, whether in the form of additional parallel links, new routers, new firewalls and other hardware—as long as there is more of it so packets have more paths to make it from source to destination. RFC1925 rule9 describes the plight of the network engineer: “For all resources, whatever it is, you need more.”

This instinctual reaction among network engineers is backed by the formula for computing Mean Time Between Failures (MTBF). Each parallel path added to a network increases the MTBF by about half the MTBF of the additional path. Between the math and the observation that “one is none, two is one,” there seems to be little doubt that adding redundant paths, devices, etc. is the right thing to do.

Although adding redundancy is almost always put in place to optimize the network at a local choke point, local optimizations almost always reduce global optimization (or the optimal operation of the overall network as a system. Like all things in life, redundancy in the network has tradeoffs—tradeoffs that can swamp the intended increase in network resilience.

At first glance, this doesn’t make a lot of sense—adding another parallel device or link reduces the MTBF, so it should reduce the MTBF of the whole network. Without tradeoffs, this would be true.

And that’s just the trouble—if the tradeoffs haven’t been found, then we haven’t looked hard enough.

Where should engineers look for tradeoffs with redundancy that can reduce the resilience of the entire network? Here’s a hint – MTBF is only one part of resilience. The second part – the part network engineers often forget – is Mean Time to Repair (MTTR). How long does a failure dwell in the network (the dwell time) before being discovered? How long does it take to find and fix the failure? How long does it take to move from a temporary fix to a permanent one?

Of course, there’s also the Mean Time to Innocence (MTTI) and Mean Time Between Mistakes (MTBM), both of which are also impacted by the increasing complexity of a network with ever-increasing levels of redundancy.

Join Juniper Networks on October 28, 2021 at 9 am PT for a deeper dive on this topic, including increasing workloads, feedback loops and the law of large numbers in relation to the interaction between increasing redundancy and increasing resilience.

About me

Failure and Resilience in Scalable Data Center Fabrics