We recently spoke to a group of Juniper Ambassadors (see Part One here and Part Two here) to better understand the dynamics, drivers and concerns they have encountered while designing and running data center networks and fabrics. Our panel discussion focused on what matters most to their organizations and clients, and although no one size fits all, we uncovered some interesting insights.
Observability and Cyclical Tails
Keeping Up with the Joneses
Customers may be at different stages of outsourcing or insourcing workloads. This means their approach to the management of their infrastructure and tiers matters greatly. Irrespective of whether data center workloads generate revenue or play a role in supporting productivity, most organizations have a legacy to contend with. Greenfields may open up wider choice, but this blank canvas comes with its own risk and uncertainty. Business stakeholders, network architects and engineers all watch the market to spot trends and identify which peers have had successes and failures and most importantly, why.
Christian points out that customers “talk a lot with each other” and are “looking for the next big thing,” albeit Daniel has engagements where customers know exactly what they want, with subsequent decision making coming down to the traditional characteristics of “speeds and feeds.” Some of his customers “know they want a fabric” and are happy to roll administration and management into “their existing operational playbooks.” There are different customer stages, but everyone, large or small, wants simplicity, scalability and a clear path to hassle-free operations.
Another commonality is the desire to be able to increase observability and control outcomes. This requires a better understanding of state in the network and the ability to target workloads to the most efficient and effective locations possible.
Andrew remarks that, “We need to see more state, but to a far bigger degree, it’s not about us seeing it.” His focus is on his customers’ ability to make evidence-based decisions, especially “when you’re moving towards customers who want to be able to have more control.” He highlights that in a “self-serve” scenario, customers need visibility, but the challenge becomes how much state and data to expose to them.
Tom’s organization “has experienced quite a bit of expansion due to acquisition,” and as such, their needs are many. They have a hybrid cloud solution and multiple product lines. As they embark on consolidating data centers, they seek to merge business units and product lines due to their various acquisitions. By selecting one dominant paradigm for scalability, different footprints must be consolidated and transformed. When confronted with a mix of services and products (both old and new), he notes that some workloads “don’t do too well up at Amazon, they stay local, and then we have to have high-speed connectivity.” But for those workloads that were more suited to a public cloud, they’ve actually selected their central consolidated data center locations based upon latency to AWS and Azure. As Tom embarks on tripling capacity in some of these data centers, he must have “the ability to add things on the fly,” and a simple way to manage it. Their previous model and approach wouldn’t scale well, so a uniform approach to building fabrics was required. His takeaway: “That’s where EVPN-VXLAN made a lot of sense for us, because we could scale out well.”
Operations are Forever
Troubleshooting, User Experience and Expectations
Once you provide services to users or customers, you incur the overhead of operations. Irrespective of Service Level Objectives (SLOs), expectations and agreements, operational outcomes impact risk. Risk cannot be outsourced the same way tasks or processes can. It’s a factor of both dependence and complexity. By streamlining and simplifying operations, risk can be lowered and negative impacts can be minimized. This is applicable not just for regular operations but also in security, where complexity is also frowned upon. Some complexity is unavoidable and other aspects of complexity may be thought of as a tax. Still, as the SRE book states, “it’s very important to consider the difference between essential complexity and accidental complexity” in systems we accept operational responsibility for. Further, as Jeff Bezos puts it, “There’s experimental failure — that’s the kind of failure you should be happy with, and there’s operational failure… That’s not good failure.”
Tom highlights that “once you get north of twenty or thirty leaf switches in the data center, just doing maintenance such as adding a VXLAN becomes cumbersome even with all the techniques you can use in Junos OS, so then you have to look at ‘How am I going to automate that?’” One of his bugbears relates to maintenance. He notes, “That’s where your management tool comes in to play, it’s not just the implementation. That’s the easy part. It’s the day-to-day maintenance because Moves Adds and Changes (MACs) across a distributed control plane become a lot more difficult.”
Additionally, he sees challenges with Ethernet Switch Identifier (ESI) Link Aggregation Groups (LAGs) and “if you’re adding those manually via the CLI at scale,” it can be “complex and risky,” especially when dealing with twenty or thirty ESI-LAGs per rack across a data center. As many do, he looks toward a new breed of tools and platforms such as Juniper Apstra Fabric Conductor with the promise of reducing toil and minimizing risk to ensure continued success.
Conclusion
In an on-premises world, the choice of solution for each layer, including which vendor to use, means that you may forgo certain economies of scale for other considerations due to legacy constraints. Migration of workloads to public clouds often involves re-engineering and development time to truly take advantage of all the touted security or performance benefits. In this regard, green fields are “easier” when unencumbered by legacy, but it’s still the network that provides all access and the most robust policy enforcement points.
For many, the underlay is expected to serve as an infinitely scalable and ubiquitous plane. All policy, security and associated complexity migrate higher up into the overlay and edge nodes. The need to partition failure domains becomes almost mechanical at the underlay but is still informed by and virtually coupled to an intelligent overlay. One primary theme that emerges time and time again is that new breeds of intelligent and unified management tools are still highly desired for tackling growing toil and accelerating complexity.
Whether for internal users or external customers, simplifying data center networking has many operational benefits. As the volume and velocity of operational work increases, new approaches and systems are required, not just to maintain, but to improve our productivity while benefiting our own personal trajectories.