Bad things happen. They always do.
That is why we need to take caution and be safe in many things we do. Including our deployment process.
Some of the worst outages I’ve seen could have had their impact minimized or even could have been avoided with a proper process of safe deployment.
In this post, we’ll cover some safe deployment practices which I’ve seen or used in cloud services that have made the deployment process safer.
Before we begin, let’s talk about some general high-availability basics:
- You need more than one instance of your cloud service. This ensures that if the service goes down due to planned or unplanned reasons (deployment, OS upgrades, hardware faults etc…), there are other instances that can serve the service’s requests while it is down. Also, you will need to take the instance out of the load balancer during deployment, although that usually happens automatically by the platform when using PaaS to host your cloud services.
- You will want your cloud service to be deployed to more than one data center. If for whatever reason, an entire data center goes down, and your services happen to be hosted on that data center, you will be able to reroute the traffic to your services in the other data center. The closer these two data centers are, the closer the latencies of requests will be once you reroute your requests from one data center to the other.
- Once you decide to host your service in multiple geographical regions, in order to give good latencies to requests worldwide, you will want to host your cloud service on pairs of data centers, where each pair is in a single geographical region. The idea is that you should be able to route all the requests from one DC to its paired DC in case of a fault (due to a data center outage, or a bad deployment). Using services like AWS’ “Route 53” or Azure’s “Traffic Manager”, you are also able to have those failovers happen automatically. Those services can perform health checks on your service and according to the result, can decide that the service in a DC is unhealthy and start re-routing the traffic to the paired DC automatically (usually by a DNS change).
To get the general idea of safe deployment, let’s analyze two approaches I’ve seen and used in the past, that have many safe deployment elements in them.
Gradual traffic shift
The core idea in this approach is that you have two environments which are exactly the same (deployed in the same DCs, the same amount of Scale Units etc…) for the cloud service you want to deploy safely. In addition, you will need some mechanism or component to have the ability to easily and gradually reroute requests from one environment to the other. One good candidate for such a component would be a Gateway Routing component. Let’s assume we use this component for now.
- Before you begin the deployment of the new version, the assumption is that all of the production traffic is routed to only one of these two environments. That means you have a (currently) active environment and a (currently) passive environment.
- You first deploy in parallel to all of the service instances in the “passive” environment.
- Run any tests you want on the “passive” environment (bypassing the Gateway Routing).
- Using your Gateway Routing component, gradually start shifting traffic from the active environment to the passive environment. You can either do the gradual traffic shift automatically, or you can do it in waves (shift 1% of the traffic to the passive env, then 10% then 25% etc…).
- If the traffic shift happens automatically, you will want to have some system to analyze the quality of service of requests in the passive environment, so that the gradual shifting of the traffic can also stop and revert itself if there are faults with the new version.
After the traffic was shifted, the previously passive environment is now the “active” environment and vice versa.
- The fine control of how much traffic is exposed during the deployment could minimize potential impact over usage of “staging” which would expose all traffic to a given instance with the new version.
- This approach screams “Testing in Production” because after deploying the new version to the passive environment, you can run whatever tests you want on the future active production environment before you start shifting real user traffic to it.
- Very quick rollback. If you (or one of your monitors) detects there’s a problem with the new version, you can instantly configure your routing logic to go back to route all requests to the environment with the stable version.
- Expensive – half of your instances are passive and aren’t utilized. Although this can be partially improved if you scale out your passive environment only before you start shifting traffic to it.
- You can only have up to two different versions in production at any point in time, and therefore, you can’t use continuous deployment with this approach.
Ring based deployment
I would strongly recommend you read my previous post about “Tenant Pinning” before you continue reading this one – as the Tenant Pinning concept will greatly enhance the efficiency of this approach.
The idea behind this approach is that you deploy your new version in five pre-defined ordered waves (rings). Each ring defines a group of your cloud service instances.
You deploy the new version to all the services in a single ring, wait, and move on to the next ring. Each ring (usually) contains a larger amount of service instances than the previous ring. This means that a deployment to ring 1 should have the least impact on your production environment, ring 2 will have a little bit more impact, and so forth. After ring 5, all of the service instances have been upgraded to the latest version.
Now, after you read my previous post and know what tenant pinning is :), let’s assume for the next example that you have 2 scale units in your production environment. Each scale unit serves requests from a different set of tenants.
Let’s look at an example for defining the deployment rings:
Let’s analyze the rings in this example to understand the concepts behind the ring based deployment approach. In the example above, the assumption is that the North American region has the largest volume of traffic, Europe has the 2nd largest, and Asia has the least amount of traffic.
Ring 1 includes a single scale unit in a single DC. The DC chosen here is a DC of the region with the least amount of traffic. In the above example, ring 1 exposes half of Asia’s region traffic (only one DC in the region) of half of the customers (only one Scale Unit).
- Some confidence in the new version after seeing it work in a production environment for a very limited amount of customers and traffic.
Ring 2 includes a single scale unit in a single DC. The DC chosen here is a DC of the region with the 2nd smallest amount of traffic. We do this because we want the new version to be exposed to a larger volume of traffic in a specific DC\ScaleUnit and see how it reacts under a little more load. Also, deploying the new version to a different region will probably mean it is going to serve more “kinds” of workloads supported by the service. Your customers in the Asia region might not be using the same workloads the European customers do.
- Confidence that there are no performance regressions after the new build is exposed to a larger volume of traffic in the 2nd smallest (and 2nd largest in this case) region.
- Works with more workloads and usage types.
On ring 3 we complete the deployment of more (or all) scale units in the first two DCs. This way, if there are any service dependencies which are shared across all scale units in a DC, you basically test that there are no regressions in the performance of your new build in the case that it is using new API’s\workloads\operations with your dependencies. In this ring, we also deployed to the new build to two different DCs at once in a new scale unit – which again, can potentially expose even more types of workloads and usages, because scale unit 2 is serving a different set of customers than scale unit 1.
- Verify your service dependencies can handle larger volumes of traffic of your (potentially) new usage.
- The new version is exposed to even more workloads and usage types.
Now, after ring 3, we’re pretty confident in the new version. There might still be some regression or hidden bugs in the new build, but the chance those won’t get caught before ring 4 is low. Here, we deployed the new version to half of the DCs in the regions that weren’t exposed to the new version. After ring 4, half of the DCs have been deployed with the new version but only one DC in each fault domain. This means, that if something is wrong, we are still able to reroute all traffic from every DC with the faulty version to its pair’ed DC in the same fault domain – and instantly mitigate the issue.
- Half of the DC’s are using the new version, which gives us almost complete confidence in the new build.
- Until this ring (included), we’re able to instantly mitigate issues that are caused by the deployed version.
We complete the deployment to all remaining DCs.
You’ll definitely want a Ring 0 that is deployed before ring 1. Ring 0 will use a special scale unit that will not serve any production traffic directly. In the above example, this special “Canary” scale unit is in every prod DC. You don’t have to host the Canary scale unit on every DC, but it does have an important benefit to it – we’ll discuss this one later.
The idea here is that you will be able to verify the new version in a semi-production environment (same DC’s, same service dependencies same secrets and same client certificates etc… as your production environment).
There are many things you can do with ring 0 to verify the quality of the version you are about to deploy to production.
- You can run synthetic traffic on the Canary scale unit and verify that the “sanity” workflows are working in a production environment, including all service dependencies.
- You can develop a forking mechanism that takes a percentage of your real production traffic and “copies” it (probably not in real time) and sends it to the Canary environment that has the new build. That way, you can run a manual or automatic analysis of the responses and detect if there were any differences between the responses from the Canary environment (which is using the new version) and the responses from the Production environment (which has the existing version). That way you can see if any of the differences are showing a regression in the service’s behavior and gives you the ability to fix the bug before any real customer gets impacted.
- When you want to renew service secrets\certificates, if your canary scale unit is spread across all your production DCs, you will be able to validate any new secrets\certificates in a production environment using the same secrets & certificates you are about to use in your customer-facing scale units.
- If you want, you can actually COMBINE the two approaches discussed above (Ring based deployment + Gradual traffic shift). Basically, instead of upgrading the cloud services in-place, you could just shift the traffic of all services in a given ring to the corresponding services in the “passive” environment that has the new build.
- You will want to wait a sufficient amount of time between each ring deployment. Your monitors might not be perfect and might not find all the issues and regressions you’re customers are experiencing. You need to give some time for your customers to report any regressions they are experiencing. A wait time of 24 hours between each ring should be sufficient, but you can decide what works best for you. Also, to deploy a hotfix you will probably have very little (if any at all) waiting time between each ring.
- In addition to waiting a sufficient amount of time between each ring, you can also define usage goals for each ring. Example of some usage goals are the number of users that were served by the ring with new version, the number of requests that hit the current ring with the new version, the % of scenarios that were used in the current ring with the new version.
- One of the best ways you can use to enhance your safe deployment process even further is by using Feature Flags (Toggles). Feature flags will allow you to turn on\off specific functionalities in your service independently from your deployment process. Meaning that you could ship a new feature or functionality change in a “disabled” state, and turn it on at the pace of your choice with gradual exposure. More on this topic in the next blog post.
- If you don’t want the added latencies when mitigating a problem with the new version by rerouting traffic to a pair’ed DC, you could also create a special additional scale-unit in all production DC’s. This scale unit will not have any production traffic routed to it by default (maybe only synthetic). Then in case of a fault caused by the new version, you will be able to route a specific scale unit’s traffic to this special scale unit in the same DC. This solution is only viable in the earlier rings when the current ring contained only a single scale unit (rings 1-3 in the above example). You do not want to route more than one scale unit’s traffic to this special scale unit – to avoid “Tenant Poisoning” (which means violating the tenant pinning in your system).
- In most cases, you will want to treat configuration changes the same way you treat code changes and use the same safe deployment processes. Some of the worst outages I’ve seen were caused by a configuration change.
- Some services, also have another special scale unit that serves “intra-company” usage. That means that employees from the company that maintains the service (or other “friendlies”) are exposed to the new version before any other real customer.
- As usage goes up, and the number of customers that are using your service increases, consider creating more scale units instead of scaling out your existing ones. This will ensure the first few rings expose only a small number of customers to the new version.
The points and approaches discussed in this post are mostly to give you the general ideas of what you need to think about when planning your deployment process. Only you know best which of these ideas can work well for your service.
Special thanks to Radhika Kashyap for sharing her knowledge on the topic with me, which helped me to better present the ideas in this post.