Large Scale WAN Migration
What makes EDNX subject matter experts in performing a successful WAN migration
At EDNX we specialize in networking in general but WAN migration is one of our key offerings. We have successfully led on multiple very large scale projects. They included Wide Area Network spanning many geographical locations and broad spectrum of technologies. We have vast experience with MPLS, Internet, Point-to-Point, EVPL, ELAN, IWAN and SD-WAN. Why do you need to understand so many different routing technologies when you are trying to select a partner to help with WAN upgrades? It all depends on the size of the project but typical large organization will have a subset of those solutions on their existing WAN. When you look at the list above we should really explain them one at a time at least from a high level.
Various legacy Wide Area Network acronyms
- MPLS – Service Provider builds Layer 3 VPNs for their customers. This means ISP becomes responsible for routing customer packets between their locations. This kind of solution effectively builds a private network for each customer and separates them from each other on ISP backbone.
- Internet – It is simply a circuit that delivers raw internet. You may either run BGP or static routing with Provider Edge (PE) router. The traffic leaving your organisation normally requires Network Address Translation (NAT) which is often configured on front end firewalls. If you need to connect your branches over the internet it usually requires building a complex VPN to secure that connectivity.
- Point-to-point – Physical wave circuit that is built by extending fibre network between two locations. Organisation may use this kind of WAN to connect two branches directly if the business needs high throughput.
- EVPL – Layer 2 Point-to-Point technology where direct connectivity between two locations is achieved by pseudo wire across MPLS network.
- ELAN – Layer 2 Point-to-Multipoint solution where you can achieve Layer 2 connectivity between more than two locations.
How about IWAN and SD-WAN?
The previous section included the key legacy WAN solutions that exist today. Next Generation Wide Area Networks like IWAN or SD-WAN may use the mixture of those circuits to build the underlay network. The main difference is that both IWAN & SD-WAN have a concept of overlay network which allows them to achieve transport independence. What that means is that multiple branches can all logically connect to each other regardless of the underlay. Obviously the underlying network needs to provide any-to-any connectivity so P2P solutions wouldn’t be suitable but can complement the overall design. The most common underlay choice is a mixture of MPLS and Internet to provide both path diversity and often independent circuit providers.
We have included some key aspects of IWAN & SD-WAN in another blog. Here, based on Cisco recommendations we will assume that the target solution for your Wide Area Network is SD-WAN.
Typical scenarios and things to consider before your WAN upgrade
There are so many different WAN migration cases that it would be impossible to cover them all. We need to concentrate on the common scenarios:
- Migrating legacy WAN based on MPLS & Internet to the next generation SD-WAN
- Implementing some additional P2P connectivity for high throughput
General approach and preparing your WAN upgrade will follow the same principles. Lets dig into some specifics.
Understanding your existing network is critical before planning the WAN migration
The key documentation you need are Low Level Designs for both old and new solution. Many people may question if the current setup needs a detailed design. In the end of the day it will completely change soon enough? The trouble is that without full understanding of the existing solution your WAN migration will be at high risk. You simply can’t afford to attempt any changes on the network without knowing the way it works today.
The old design will show how your organization is connected to the Wide Area Network. It will explain various traffic flows and routing policies. It will provide a solid foundation that you can reuse when designing your new WAN. Once you spend enough time making sure your old WAN is well understood, it will be much easier to define what you expect from the upgrade. At EDNX we call this network assesment phase and believe it is critically important milestone of the project.
The new design on the other hand will solve business problems like cost, resilience, policy enforcement, scalability, automation etc. Following a decent audit of your old WAN it is now a lot easier to see what you can improve. All traffic flows, routing policies, bottlenecks you documented will now feed the new network design documents.
How could WAN re-design process look like?
This blog is about addressing the complexity of large scale WAN upgrades. Those kind of projects are impossible to implement overnight with a big bang approach. Therefore lets assume that the network consist of 100 branches and 3 main hubs spanning different 3 different regions. In the design phase you have captured the following:
- Each branch have dual MPLS and single INET circuit
- Each hub have dual MPLS and dual INET circuit
- Local Internet is used to access O365 and Salesforce at the branches
- Guest wireless is available on 20% of the branches and using INET circuit
- MPLS is used for any internal applications, collaboration
The new design business case has the following requirements:
- Each branch should have a single MPLS circuit to save cost
- Failure of MPLS shouldn’t prevent the branch from communicating with all other branches and internal applications
- INET circuit should be available for internal application traffic as long as the latency meets the requirements
- Adding more circuits in the future doesn’t require big design changes
- All hubs need direct P2P fibre for database replication and high throughput
- The Control Plane needs full isolation from the Data Plane
From those requirements it becomes clear that you need a hybrid solution based on SD-WAN and some additional P2P circuits. Only Software Defined WAN offers overlay that can achieve your goals. We won’t get into any specific topology details like the placement of SD-WAN control plane, P2P circuits etc. The point of this blog is to highlight WAN migration challenges rather than specific low level solution.
Once you understand your existing and new design you can start your implementation planning. To make things easier we will take away all of the logistics like purchasing new hardware and disconnecting old circuits.
What happens to the traffic when you begin your WAN upgrade
This is the point where you need to figure out the most efficient way to migrate your WAN to the new solution. Implementation stage is the most complicated part of the project. It will very often require breaking down into multiple changes. Lets start asking some fundamental questions:
- How much can you do on the existing network without any disruption to the users?
- Is it better to start at the branch or at the hub?
- At which point it is best to add P2P links between the hubs and what is the impact?
- Are there any branches that need to be grouped in the same window?
- How is the traffic routing between your old and new WAN?
- How can you reduce the downtime to minimum?
- Will the routing be symmetrical and predictable?
- What testing should be done during and after each migration?
- What is the fallback plan if things go wrong?
To answer all those questions you not only need a detailed understanding of your existing network but also the intricacies of the new solution. You need to be able to predict the outcome of your change on your organization WAN routing. Everything you do on the command line will have an impact so at every point of your migration you need to be able to roll-back any changes. Don’t cut your own management connectivity! Let’s break things down a bit more.
Building your next generation SD-WAN one step at a time
First decision would have to be very strategic in terms of basic order of operations. Lets try to picture your organisation WAN the day after first cut-over to SD-WAN solution. If the site you migrated is SD-WAN only and have no native MPLS anymore you have lost connectivity to all your branches. The situation will get better over time as more branches are migrated but effectively decommissioning native MPLS where all other branches communicate doesn’t sound like the best approach. Does that mean you need to keep native MPLS everywhere until you configure SD-WAN everywhere and then migrate? How about leaving MPLS and launching SD-WAN only at the hub sites that will most likely have the biggest MPLS circuits? Each hub would effectively have IP reachability to all existing spokes via native MPLS and SD-WAN spokes over new SD-WAN edge routers. In addition to that, hub sites would become a transit path between old and new design. Any traffic from old branches would route upstream to the hub over native MPLS and downstream to the new branches across the overlay.
This really explains the fundamental migration approach. You should start building your SD-WAN at the hubs or any arbitrary locations that are strategically important and have large capacity MPLS circuits. Those circuits will have to carry additional capacity to route the traffic between both environments. Once all the main sites have SD-WAN enabled you are going to hit at least few more hurdles that we explained next.
Defining the most efficient routing policy between SD-WAN and legacy branches
Like we said in the previous paragraph the hub sites need to be SD-WAN enabled first as they become transit points. This creates couple of interesting problems when selecting the next site for cut-over. Imagine you have sites in a certain country that require low latency for the internal collaboration and the regional hubs are outside of this country. When you migrate only one of the sites to SD-WAN the performance may get massive hit depending on your routing policy. It could be acceptable to route via the closest regional hub for a short period of time. In reality you need to plan your WAN migration activities for those sites to be very close in time. You also need to ensure that regional hub is preferred in the interim.
Because most IP flows are bidirectional there are again few pieces of the puzzle:
- Traffic from SD-WAN branch to a legacy branch needs to route via regional hub
- Traffic from legacy branch to SD-WAN branch needs to route via regional hub
- If the regional hub becomes unavailable what is the secondary hub choice
- Consider also any inter-site traffic between two different regions and its latency requirements
Unless you pay attention to the routing policies during your WAN migration your inter-site connectivity will be subject to sub-optimal or asymmetric routing. This would not only impact end-to-end user experience but could be really hard to troubleshoot. The way you enforce effective routing policy will be different for SD-WAN sites and old branches. Typically to achieve the desired effect you will need a combination of filtering, redistribution, route tags or communities on the legacy part. SD-WAN provides routing policy via independent control plane enforced on vSmart servers.
SD-WAN lets you build the control plane first
Cisco Software Defined WAN is a great solution for many different reasons. One of them is a complete separation of control and data plane. This allows pre-configuring a lot of your SD-WAN before you start using it. Cisco states that 80% of their customers host SD-WAN control plane in the cloud. Hosting the control plane internally have some drawbacks. One of them is the potential battle between your network and server team because all control elements are not network forwarding devices. They are only required for the compute element. If there is a problem with the servers hosting vSmart or other components it would impact the network routing updates.
Anyway, the main point we are trying to make here is that you can prestage lot of SD-WAN elements well before you are ready to use them. Once the control plane is setup it is best to test it by building a test vEdge with some bogus subnets and making sure it works as expected. Adding second test vEdge in a different location would be even better because it would prove direct spoke to spoke connectivity. Testing can be done in the production site like the first hub as long as you don’t advertise real prefixes to begin with. The main challenge here will be making sure that your test bed is not impacting any production traffic. Completing this stage would give you confidence for any future migrations that need scheduling in the maintenance window.
Start rolling out SD-WAN to the production environment
Once you have gained confidence in your solution design and initial test environment you can move to the first live migration. Although adding only one site to the solution shouldn’t impact any user traffic it has risks associated with any change in the production environment. It is the point where you need to look ahead and prepare any filtering that may be required once you add more sites to your new WAN. Remember about routing policies and path preference from the previous paragraphs. It is critical to make sure that various traffic flows work as expected:
- routing between old and new environment chooses the most optimal hub
- any asymmetric routing between both environments is reduced to the minimum
- monitor traffic levels at transit points to make sure you don’t saturate the circuits
- perform fail-over testing in the transit points early into the migration to make sure it works as expected
Executing full WAN migration in a large organization may take many months in reality. Over time the number of spokes on the new solution will increase and this would reduce the load in the transit points. One important thing to bear in mind is that those transit points (hubs) would typically inject increasing number of prefixes into the Service Provider MPLS network. This is required to allow native MPLS sites communication with SD-WAN. Service Providers often impose a prefix limit from the customer and may shut down BGP peering if this exceeds the agreed value. If that happens it could lead to catastrophic outage at the hub and isolating it from MPLS.
Final design tweaks and post migration tidy-up
In the end of mass migration you will still have some challenges to solve. The requirement was to provide additional P2P links between the hubs for DB replication. When you add additional peering points into your WAN you are going to create one more transit path. In the end of SD-WAN roll-out each branch has only 1 x MPLS and 1 x INET circuit that connect it to the network. On the other hand each hub not only have dual MPLS and INET but also full mesh of P2P links. As long as there is no prefix overlap between SD-WAN and P2P links you have no risk of routing loops and your routing is predictable. How about the failure scenario? Which way would database replication traffic go if you lose this high throughput P2P ciruit? Perhaps it would re-route via another region automatically but what if that leads to completely unacceptable delay? Should it route via SD-WAN or native MPLS?
This example shows that even with well defined design there will be things that are difficult to predict in early stages. The key point in the end of your WAN migration is to run detailed fail-over testing that could help you uncover cases like this. The results of this testing would lead to potential design changes or updating your routing policy. Although many organizations are hesitant to start breaking the network in the end to uncover any design shortcomings, at EDNX we believe it is fundamental part of the project. It is a lot better to gain confidence in the solution in the controlled manner than be caught unprepared in the middle of the working day.