Zero downtime deployments - implementation
In my previous post I discussed why I think zero downtime deployments are a vital aspect of delivering quality software. This is the follow up article detailing some of the technical approaches you and your teams can implement to help drive zero downtime deployments into your pipelines.
There are several different approaches you can follow to enable zero downtime deployments and the route you choose will be based on many factors including hosting platform topology, cost implications and appetite for risk alongside the complexities of the system and the usage profile.
Zero downtime deployments are always easier and safer to manage using automated Continuous Integration / Continuous Deployment (CI/CD) pipelines using systems like Azure DevOps, Jenkin’s or TeamCity. While it is perfectly possible to manage zero downtime deployments by hand there are many additional benefits to implementing a CD/CI pipeline to support this process.
Benefits of CD/CI pipelines
Some of the many benefits of implementing Continuous Integration / Continuous Deployment include:
- Deploy faster – automated CD/CI pipelines enable faster, repeatable deployments using automation to do all the heavy lifting.
- Improved ROI – CD/CI pipelines do require a time investment up front to build out, but over the lifespan of a digital product this investment is realised via huge time savings for each automated deployment.
- Continuous feedback – CD/CI provides feedback sooner and more frequently because integrating smaller changes more often into your deployments enables changes to be tested faster.
- Improved security – CD/CI pipelines can contain additional steps to scan code for known security issues as part of your build processes.
- Increased testing coverage – CD/CI pipelines enable unit tests to be run as part of your builds alongside automated browser-based testing as part of your deployments.
In addition to CD/CI pipelines, one thing is common across many of the concepts discussed in this article is the need to be able to shape traffic and manage real time user load between resources. This is a factor for managing both public facing network demand and internal network traffic for supporting systems like API services.
With CD/CI pipelines in place and with a method to enable the shaping of traffic between resources you have everything at your fingertips to implement zero downtime deployments.
As a general rule, I have always favoured tackling the hardest part of a given problem first in general development tasks. However, when implementing automated zero downtime deployments I tend to prefer starting simple and growing solutions to solve more complex systems as knowledge and understanding increases.
Walking Skeleton
This term was introduced to me last year during an excellent talk at LeadDev London 2023. While the term was new to me, the concept was not. The term "Walking Skeleton" is, in essence a form of proof of an end-to-end concept. Typically, this concept focuses on a single piece of functionality, a Walking Skeleton, which represents a minimalistic end-to-end implementation that results in working, shippable and testable code via CI/CD pipelines all the way through to production.
The process of creating this end-to end implementation is the perfect time to also fold in your zero downtime deployment processes.
Proving the concept
It’s vital you and your team can demonstrate that there is zero impact to your services while deploying new features using zero downtime deployment patterns. This is part of the trust building aspect that should never be overlooked. The benefits discussed in my previous article can’t be realised without having the trust from your organisation that you are able to deploy without any impact to consumers of the systems you are updating.
I would recommend designing some test strategies to help provide the data evidence to support your zero downtime deployments are working as expected.
These tests require a suitable standardised and baselined load profile which can be repeated alongside various supporting data point metrics to help prove your concept is not impacting the service during deployments. These tests will vary massively depending on the system under test, it’s architectural structure and hosting resource topology. They should be designed in conjunction with your internal testing/QA teams to ensure anything is covered and measured. During this phase you should also test all failure states to ensure nothing in your processes can fail and cause disruption to the systems.
Recommended by LinkedIn
Zero downtime sounds great, but how can it be achieved?
There are several different approaches you and your teams can implement to enable zero downtime deployments for your services and systems.
The approach you take will depend on many factors: there is not a one size fits all solution.
I believe it is vital that your development teams “work backwards” when considering new feature development and always consider the question, “how will we deploy this feature” when planning new feature development tasks.
Some of these approaches described below require careful planning when single data sources are also in the mix. Managing zero downtime deployments for single source database systems like content management systems require careful considerations to implement. Options including feature flags should be considered alongside defensive coding practices when dealing with single data sources, schema changes and zero downtime deployments.
When managing schema changes, always design any code changes that depend on the modified schema to cope with the original and modified schema gracefully to ensure deployments don’t impact consumers of your services when in a transient state.
For API based deployments I strongly advise implementing a good versioning system to enable the clients consuming the API to be switched between different versions of the same API layer, on the same resources. This approach helps enable more flexible release options when running the various deployment options described below.
Rolling Deployments
A rolling deployment approach incrementally replaces previous versions of a service or application with a new version by entirely switching out the environment in which the application is running. This could be a rolling deployment across a load balanced, multiple virtual server environment for example, or containers running new versions of an application that may take the place of containers running previous version.
With rolling deployments, the infrastructure resources in which old and new services are being run are not isolated from each other. This lack of isolation means a rolling deployment can be accomplished quickly but consideration for any supporting system changes including API’s and databases also need to be considered.
A/B Deployments
An A/B deployment approach enables your organisation to effectively test 2 versions of an application feature set on your users. A/B testing is a common approach with content management systems when gathering insights into understanding the user experience impact of different approaches. Some CMS platforms offer A/B testing as part of their framework and this deployment approach can help augment this.
The “A” version would be the old version, while the “B” version would contain a new or revised feature. Each version would be released to a subset of users for testing with metric gathering on goals and of course feedback from your users.
Once the “B” version has been tested and data has helped to prove it to be an improvement, the new feature it can replace the “A” version. A/B deployments are complicated because organisations must calculate a sampling size and observe results carefully to draw the correct conclusions about the best version of the application.
Canary Deployments
This deployment approach is similar in concept the rolling deployment option described above but with a focus on reducing risk. A canary deployment uses a phased deployment strategy in which traffic is shaped between different versions of your application or service in small increments. The key feature benefit here is that the new application code is released to a small group of users so it can be tested. Metrics including performance, goal tracking and stability measure the success of the new feature against the existing version on a small scale.
Much like rolling deployments, canary deployment processes can present risks with backward incompatibility and careful consideration of downstream supporting API layers and databases must be factored in. These deployments also take more time due to the incremental approach.
Blue-Green Deployment
Considered by many to be the holy grail of deployment strategies, Blue Green is also often the most expensive and time-consuming configuration to implement.
The Blue-Green deployment approach completely removes downtime by running 2 identical production environments, one called Blue and the other called Green. Only one of these identical environments is live at any one time and handles all production traffic.
A fully implemented Blue-Green deployment approach effectively removes risk and enable zero downtime by allowing your organisation to switch between two production environments. This enables complete offline, end-to-end testing of the newly deployed code on one of the production environments while the other processes live user requests. It also offers a great rollback option as you always have two live environments at your disposal.
Cost is a downside to Blue-Green deployments because you need to build and maintain two distinct production ready duplicate environments with identical infrastructure.