Reducing Software Deployment Risk
The cult-classic Galaxy Quest featured the “Omega-13 Device.” You could jump back in time 13 seconds to correct a mistake you made. It was essentially a big “undo” button. Imagine that. If you could undo any mistake, you would drive the cost of error down to zero. You would be free to innovate and you could make mistakes. But the mistakes wouldn't hurt you.
The global server outages of the last few weeks have me thinking more generally about the commercial risks of software updates. What happens if an update goes wrong? Blaming the vendor doesn’t heal your customers. Just ask anyone stuck in an airport while the servers were down.
Mistakes happen, so we need to find a way to drive down the cost of errant software updates. One technique that I’m fond of is “immutable deployment.”
In days of yore, you had a server. You installed software on it. You updated it. You patched it. You maintained it. And if something broke, you faced downtime. It was like a pet. If it became ill, you nursed it back to health. To deploy software updates, you would schedule downtime, say a low-usage time like Sunday morning. If the update worked, all was well. Otherwise, you wouldn’t see daylight until it was fixed.
The traditional fallback was “disaster recovery” and that toolbox offers many ways to address this. If you’ve created and tested your backup processes, you install a fresh server from the most recent backup. More sophisticated approaches include point-in-time-recovery where you can roll back to the moment before things went haywire. Those are good tools, but what if we could avoid disaster recovery altogether?
Virtualization and containerization and the cloud have opened more options. Immutable deployment is my favorite. You never patch or update a server. Instead, you build a new image containing the software updates. You put that image through your automated testing suite. If it passes, you spin up new servers using the new image and then retire the older servers. If a bug escapes and you have to roll back to the previous release, you simply spin up new servers using the previous image and retire the servers running the errant update. All of this happens with the flick of a switch or through automation.
Contrast this with disaster recovery and restoring from backup. As we saw with recent events, it can take days of manual reconfiguration to bring systems back to their previous state.
Recommended by LinkedIn
Immutable deployment requires a couple of things. You need a virtualized or containerized workload. And you have to separate your compute from your data. This is a key area of technical debt that prevents some workloads from employing it. Of course, building those image recipes can be a lot of work, but you have to do some version of that anyway. You workload needs a recipe. It can’t simply be the accumulation of all updates and changes ever applied to it.
Immutable deployment works especially well when combined with continuous integration continuous deployment (CI/CD), but CI/CD is not a prerequisite. The trick is that you install your latest software and your latest patches to a reusable image, not the live machine.
People move to the cloud to be more resilient and more nimble. Agility requires reducing the cost of errors and the penalty for choices that don't work out. Otherwise, you're just treating the cloud like colocation space…like someone else's computer.
Creating immutable deployments takes some work. But the piece of mind it brings makes it all worthwhile.
-- Carter
Interesting and timely article Carter, thanks! I agree it's a good time for some folks to reevaluate the perceived costs and risk mitigation of immutable deployments. Ironically, from my experiences on a Windows dev-sec-ops containerization project, CloudStrike was the easiest of the handful of third party security tools to image.