When Engineering Problems Disappear

Does your team normally spend time in damage-control mode when releasing software updates? It's normal in many companies, especially those that publish their software live on the web, for the occasional deployment to expose unexpected errors, even to the point of causing a service outage until the error can be rolled back.

This creates a pattern of anxiety around releases, along with the temptation to batch updates into bigger, less frequent releases. This happens even though more frequent releases can be very helpful in refining the deployment process and making errors less likely. In fact, the growing popularity of continuous delivery has led to more and more companies attacking the problem by making deployments so frequent that an unreliable update procedure is not an option.

Facebook is an example of a company that's not only at the forefront of continuous delivery technology, but they also describe their progress in blog posts that others can learn from. A recent post describes their current system of collecting all the tens or hundreds of commits made by their global development team every few hours, and publishing them to the world.

Engineering the Problem Away

A situation where an occasional errant commit can take down the service once it goes live doesn’t scale to thousands of changes deployed each day. So Facebook is one of many companies that have engineered a way around the problem.

The basic mechanism at work is the developers' knowledge that their commits are going directly into production, which forces careful design to make updates work well with existing code, and also greatly incentivizes reliance on automated tests to catch problems before release.

But defects will still sometimes get through the developer's initial code commit, and Facebook has taken care to engineer a deployment process that carefully rolls out updates to a small pool of servers first, and provides for interruption of the full deployment before bad deployments reach global scale.

This kind of engineering requires determination and resources. But it doesn’t require a Facebook-scale organization. In fact, doing it at an early stage will enable growth to a much larger scale, by stopping an annoyance before it becomes a major bottleneck. The key principle is to take something scary and stressful and to make it so routine that it's no longer intimidating.

A Bigger Challenge

For another example, think back to the last time a service you provide (or one that you use) went down due to a server failure. In an era of increasing cloud infrastructure and distributed applications, there are still plenty of businesses who are at the mercy of occasional server outages, and have no choice but to scramble to react when they occur.

But not at Netflix. Despite how completely their business depends on constant availability of online services, there's no one there who loses any sleep over the idea of a server failing. In fact, their servers fail all the time, thanks to processes which are designed to randomly disable servers and other crucial infrastructure on purpose. That's because there are fully automated processes in place to repair infrastructure like failed servers, and continual recovery from deliberate failures is the best way to be confident that the system is working.

This so-called chaos engineering is another example of making a problem disappear by making it routine. Imagine how satisfying it is for engineers at Netflix, who instead of frantically reacting to server outages when they happen, can spend that time designing systems which handle the problem automatically.

Investing in Culture

This chance for engineers to be proactive is a big reason to invest in a state-of-the-art process for your software company. Aside from being happier and less stressed, engineers will be doing what you hired them for in the first place: creatively solving new problems, instead of repeating solutions to old ones.

This takes resources which might not always be in plentiful supply. But reinvesting in your process when resources are available is necessary to stay competitive. Software practices continue to advance, and companies are in greater competition all the time to provide an attractive environment for talented engineers.

If you're not sure how your company stacks up against these advanced development practices, start by browsing the engineering blogs of some of these major companies who share their adventures in process design, and get inspired. You might find that a well-placed investment in your process can make some of your most frustrating problems disappear for good.


To view or add a comment, sign in

More articles by Brian Auton

  • The Spinning Glass Furnace, or Engineering the Right Things

    For a quick tour of a truly impressive engineering project, read about the process of making the mirrors for the Giant…

  • Solving the Problem of Legacy Code

    I've been reading lately about the trouble the IRS is having updating its legacy code to handle current changes in tax…

  • Making Changes For Your Software Team

    In a previous article, I discussed a software manager’s role in finding useful practices that may be missing from the…

  • Supplying Tools for Your Software Team

    One of a software manager’s most important jobs is to supply the tools engineers need to do their best work. Peter…

Others also viewed

Explore content categories