What can DevOps improve?
CC-BY-SA-4.0 Kharnagy

What can DevOps improve?

Storytime.

Once upon a time, there was a company that had a SaaS product. The company still is and thrives. After all, this is a happy story. The company had a problem with a two-year-long tail of supported versions. The support was taking bug reports that had been fixed for some other customer and expedited a version upgrade.

An expedited version upgrade would mean that the one engineer responsible for upgrading things would check his calendar and book two days — one day for running the test upgrade script. Someone would manually verify the deployment. The other day was to execute the main upgrade script for someone to test manually. In fact, in this very mature company, the main and test upgrades were performed with the same script. One just had to change a parameter and an Excel sheet in between. The engineer probably dreamed about progress bars and nightmared of red warnings at the final step. There are 40 versions of the software in the wild and two deployments happen per week at evening time. One deployment takes 3-9 hours.

T+1 year

Fast forward one year. The service can is deployed by one command that performs both test and main upgrades, and nobody cares to verify anything manually again. The long tail is down to 130 days or 15 versions. Five deployments happen per night by scheduling them during the day. This 12.5x increase is impressive.

T+1.5 years

Fast forward half a year. The service is deployed in regularly scheduled batches named by the weekday. The long tail is ten weeks or versions. The rate is 20 deployments per night, and one has to remember to schedule them. The 4x increase on top of the 12.5x increase is okay.

T+2 years

Fast forward half a year. There is a job that deploys a batch every night. A priority list exists to book a deployment for the next night and a blacklist to hold others until a specific date. Service owners can request their customers to be on those lists. The long tail is four versions or six weeks. Around 30 environments are deployed each night. The 1.5x increase on the cumulative 50x so far is nice.

T+2.25 years

Fast forward three months. The deployment job automatically picks what to deploy. The service owners can tell a chatbot what to deploy sooner and what to blacklist until which date. From a customer perspective, the deployment takes two minutes. There are 50 deployments per night. The 5/3x increase on top of the cumulative 75x is meh.

T+2.5833... years

Fast forward two months. The deployment job can now run all the time. The test upgrades still take some time, but they run all day. The remaining 1-minute switch to the newest version happens during the customer night time based on their main timezone. There are 150 deployments per night over 200 customer environments. The real duration of the main upgrade run is around 160 minutes over two nights. The 3x increase over the 125x cumulative increase is okay and leaves us at 375 times the original pace. Even the automated job is idle for most of the week.

That's just the time perceivable to the customer. They feel that the service is just getting a little better every week. For the engineer, it is even more significant. He used to spend around 1700 hours a year in upgrading. Now he checks the version distribution graph on the monitor on Thursday mornings. Most of the times, it shows no warnings and 75% of the environments in yesterday's version and the rest in the previous week's release. He spends his work time in more proactive and smarter duties. Let's agree that the glance at the monitor is a long, slow and flirting one and takes about a minute like in the movies. That's a mere 102000x improvement.

How did it happen?

There was no cloud transformation or a microservice architecture makeover. The platform remains the same virtualised Hyper-V Windows platform that it was at the beginning. The original Powershell scripts were pretty good. Good enough to never really introduce any shiny new configuration management tools. Sure, it has been written over a few times, but still in Powershell. At one point a Jenkins appeared, and that is used to initiate things, but it is running one command, which then picks up the heavy lifting. The bottleneck has moved to the product itself.

The team didn't grow or shrink significantly around the work. There was a slower period when one member left, and another got started, but most of the work was done by a few people at a time without a total allocation to the project.

Is this true?

Real events inspired the story. The timeline is groggy in my memory, and the final increment was left in able hands when I passed the torch. Let's say; it is very likely to have happened as it was the plan.

To view or add a comment, sign in

More articles by Ferrix Hovi

Others also viewed

Explore content categories