Centralization of Operations for Small Teams
The IT industry has a long history of swings between centralization and decentralization. Centralization tends to focus on efficiency, and decentralization tends to focus on speed and fit-for-purpose. The shift to DevOps has led to a focus on decentralization of cloud operations. Read about Flickr, Etsy, Netflix, and other DevOps pioneers and you will see that the operations tasks are incorporated into the development team, and there is no formal separate operations team.
Many DevOps teams, however, find that the flat, decentralized operational model is difficult or impossible to implement. The team may be too small to be able to afford 24x7 coverage, for example. The team may not have any operational skills or experience, and cannot train or hire the right skills quickly enough. In these cases, centralization makes sense if done carefully and correctly.
Say for example you have 10 small (10-20 people) DevOps teams in your organization, and each team owns a cloud solution that has 10 incidents per month. Each team must assign 3-4 people per week to cover incidents during normal work hours, plus off-hours and weekends. The teams get stressed and burned out, and attrition becomes an issue. Rotation of operational coverage can help, but then everyone on the team must learn the tools for incident management, and must be able to diagnose and resolve a problem with any aspect of the cloud solution.
In this example, you may find that your organization can set up a single team of incident managers that can solve relatively simple problems for all of the DevOps teams. A single dedicated team can handle an average of 100 incidents per month (10 per solution) with fewer people than spreading this task out to every DevOps team. The DevOps teams create runbooks that describe how to detect and solve the most common causes of operational incidents, and the incident managers run the runbooks when an incident occurs. The DevOps teams will still be on call for exceptional cases, but the incident managers can handle simple outages or outages for which the DevOps team has provided runbooks. If the runbooks are complete and thorough, and updated regularly, the DevOps teams may only get called once or twice a month. If the DevOps team improved the architecture and deployment topology of the application to improve availability, the calls may become even more infrequent. The stress level goes down, and the productivity goes up. The DevOps team has an incentive to provide quality runbooks, and should have more time to make architectural improvements to the solution to improve availability.
The DevOps teams must avoid certain pitfalls if operational tasks are centralized. In the above example, it is easy for the DevOps team to become disengaged from how well their solution runs in the cloud, because someone else is now responsible for fixing it. Deal with this problem up front, by making sure that the DevOps team still owns the customer satisfaction and the Service Level Agreement for the solution. The DevOps team must measure and improve the effectiveness of the runbooks, and ensure that the stability and availability of the solution is continuously improved. The end goal is not simply to relieve stress on the DevOps team - it is to improve the availability of the offering to customers, while improving efficiency and effectiveness of the teams. The solution leadership should also rotate people every few months between the DevOps teams and the centralized team, to keep domain knowledge fresh and to reduce burnout.
David, this is great insight on the practical effects of executive action. Well-meaning, smart decisions can have these unintended consequences. I've heard many people complain about this cycle, including myself, but this is really the first time I've seen someone describe the impact so clearly. Thanks.