A Checklist to Build DevOps Organization
A more apt title of this article could have been "A Checklist to Build Production Engineering Organization". But I wanted to stress the importance of automation in every aspect of production engineering operations, and, these days, DevOps is a good buzzword to invoke that theme. Without the core value of automation built into the production engineering processes, a newly-built team will find it hard to support the business when the latter is ready to scale up - a time when development groups want to focus on new product features and will be eager to move off their plates the responsibilities that are not essentially part of building and fine-tuning product features.
An automated operations environment would help the business to make quick changes with minimum defects and downtime. An earlier attempt to summarize my thoughts on the subject can be found here: DevOps Defined. Typically, DevOps teams will be responsible for tasks of following kind in an organization:
- Infrastructure as Code: In virtualized, cloud based environments, the computing resources can be provisioned as needed. When allocation of computing resources tend to be available on demand and elastic, there is no way such environments can be built manually. Your team team should know how to automate it and integrate those steps with provisioning tools and configuration management systems.
- Platform as Code: Identify software roles in an application stack and automate the steps to build them. Using that as the building blocks, large scale application environments can be stood up with the help of configuration management tools. That is the only way you can scale up operations if the consumer app you support or the internal storage service you manage or the newly released SaaS offering of the company would become an instant hit. If you ask your prospective customers to wait, you will lose them to competition or lose your credibility as an an internal infrastructure service provider, depending on what you have been supporting.
- Deployment Automation: A well tested feature should be deployed in production with minimum delay. Continuous Integration (CI) is an effective method to both test and deploy code, but in real life, the state of that in a company would be somewhere between manual code push and a fully automated code deployment process.
- Monitoring and Operational Intelligence: Don't settle down to only use the out-of-the-box features of your favorite monitoring tool. To effectively monitor the application stacks, custom plugins have to be developed and the team should have the necessary skills to do it. The advent of log aggregation tools such as LogStash and Splunk made it possible to dig out errors and insights from server logs. Again, to make these tools more useful, code has to be written to instrument, mine and present operational data.
- Automation of Routine Tasks: The production engineering team in any company will have a long list of items to carry out periodically, that tend to defy any classification. Some of the things I had done or was responsible for in the past from this category are the following:
- Weekly reports containing aggregates from business systems.
- Operational performance and computational usage data.
- Data extracts generated for both for internal and external customers.
- Updating of various metadata used by applications.
- Reprocessing of data to correct issues with aggregation done earlier.
- Security audits, both internal and regulatory requirements such as that needed for SOX compliance.
These chores are normally handed over to production engineering team to handle with lengthy procedures. They should be automated as much as possible, to avoid burdening the team members with rut work, and to avoid mistakes that could be committed by a bored worker who may find nothing exciting in carrying out a routine task.
Staffing
Tools and processes alone will not solve any problem. You will always need talented people on a team to get things done in line with the larger goals of the company, with the use of minimum resources. Complex tools in the hands of incompetent people will only result in creating more chaos. With that general warning, let’s see what we need in this area :-)
I already repeated the importance of automation a few times, and that normally means writing code. Tool vendors will always argue that you can do everything from the dashboards of the products they peddle. But a production engineering team that has good coding skills can extend third-party tools, build custom tools if situation demands it, and collaborate well with development teams for adding features that would improve the operability of the applications.
A traditional operations team of data center era consisted of system and network admins, database and third-party application admins and the application support engineers. In the last such company where I had worked, we had system admins, Oracle and MySQL admins, Tableau and Microstrategy admins, and application support engineers. Very few new companies can afford to have such a division of labor, but, they might still need to have resources to cover similar job responsibilities. So, the important thing is to find people who are not married to certain technologies and products, but, who are open to learn new technologies and comfortable to use coding as one of the items in their toolbox to solve problems.
It is important to have both system administration and coding skills available in a production engineering team. If the company can afford to have only few people in the production engineering team then its members need to be versatile. A large team can still afford to have specialists. The exact composition of a team will ultimately depend on the specific requirements of the business but a team without substantial coding skills as a group, will not get much automation done. Failure to automate operational tasks will become a bottleneck as the company grows and the requirement to scale-up will become essential, and, throwing bodies at increasing work load is a bad strategy.
Operations Infrastructure
An operations infrastructure must be in place to roll out the related processes in a new organization. Some components will be there already even before a production engineering group is formally setup because those things, such as a ticketing system, are essential to running a high-tech company. Parts of that infrastructure will be shared with other groups also, mainly development, as part of collaboration. Though it is hard to generalize, a production engineering organization would require some form of the tools and applications from the following list.
The infrastructure can be divided into two broad categories. First is the set of processes that need to rolled out and owned by the production engineering group. Examples are release process, incident management, and on-call. The Tools needed for rolling out automation projects and the production engineering process are the second category of items.
Tools for Production Engineering
My objective here is only to list the kind of tools needed for a production engineering team for them to be effective. In each category you can easily find multiple competing products. If I mention any product specifically that only indicates my familiarity with that product. The important thing is to have some tool, including home-grown, available when the need arises. It is also important to avoid using multiple tools in the same category unless there is a compelling reason to do so.
Documentation Platform
A documentation platform that can be used by both development and operations teams is an essential component for collaboration. Wiki based solutions like MediaWiki are most popular, but, even gDocs will do the job.
Any documentation solution can easily degenerate into a storage location of assorted documents very soon. To avoid a free-for-all/anarchy situation, it is important to set a structure for organizing the documents right from the beginning. One effective method is to organize documents around applications and using templates for creating standard documents.
Configuration Management
Configuration management is a very generic term. In the DevOps circles it usually refers to a Puppet or Chef like tool that manage the system level configurations and baseline software installations on a computing node. There are at least 3 different configuration management needs and if we include the configuration requirements for automated deployments, the list can grow to four. However, the subject of deployment automation is better discussed in the larger context of Continuous Integration (CI).
Automation System for Access Control and Baseline Software Installation
When a user joins or leaves the company various accesses, both system and application level, defined for that user related to the role of user should be propagated to various systems automatically. Tools like Puppet and Chef may be the most popular but there are plenty of alternatives available.
As a new user is provisioned or a departing user deleted from the system, when a new computing node is provisioned, the baseline software bits needed for that node can be installed and configured as well using these tools. The system level configuration and software deployment done on a computing node is based on its "role" in a larger software system.
Configuration Management Database (CMDB)
Managing configurations of application stacks and environments is the next requirement. Implementation of a full-fledged CMDB system may not be warranted, but, at least a custom solution will be needed eventually, because, without a single source of truth for such configurations, rolling out serious automation projects may be hard.
For example, keeping track of what system settings and software bits go into a software role would have immense use if we want to stand up application stacks in a totally automated fashion. I still remember the joy of helping ourselves in building large object storage farms from few command lines, which used to be an excruciating, week-long effort of cut-pasting scores of commands and running manual steps.
Software Configuration Management (SCM)
Traditionally, the use of a SCM tool like subversion or git is limited with in production engineering teams. Scripts used for ad hoc automation efforts will be in somebody’s home directories and when Brian leaves the company, hell will break loose in the application area that he has been supporting smoothly up until then.
SCM system is not only for application code development; any piece of code and configuration data that are needed to replicate the application environment have to be managed using the SCM system. Code is not only for defining the product features; in a highly automated environment, code is needed for maintaining it also.
Make sure that members of a production engineering team are skilled to use company’s SCM system. If there are multiple SCM tools in use, take leadership in standardizing on one. Existence of multiple tools is a clear indication that product development teams work in silos and that normally creates nightmarish scenarios for production engineering team because when issues happen you will come across development teams that are more inclined to cover their bases than resolving issues.
Ops code should also go through peer review and be part of the release process to have visibility on what is deployed in production. The need to include ops code as part of release process is becoming more important lately as the concepts of infrastructure and platform as code can be implemented, and, they are not very different from writing code for implementing product features.
Continuous Integration (CI)
CI automates the code deployment process beginning the step of code checking in by developers into the SCM system. On the CI platform, the code changes are built, packaged for deployment, and deployed in a staging environment where the changes are tested.
The developers will get immediate feedback on the quality of their code and that helps to get bugs fixed immediately. The integration of the code is incremental and continuous, and, incompatibilities are ironed out early on. There will not be the need for specific integration tests.
Jenkins is a popular CI platform available to rolling out CI process. The CI processes are integrated with CM, CMDB and SCM systems.
Bug Tracking
Like an SCM system, a bug tracking system is primarily rolled out for the use of development organizations. It is important that production engineering team gets visibility into the projects and issues tracked in that system. The team should also have privileges to create its own projects and queues to manage code related to DevOps areas that we discussed in the beginning.
The bug tracking applications are typically part of the generic ticketing systems and there have been no dearth of both open-source and licensed software in this area. Bugzilla and Jira are some of the well known products that I have used.
Monitoring
A matured monitoring infrastructure will have checks implemented at different levels.
- Infrastructure, network, system and application monitoring using industry standard tools such as Nagios, Zenoss etc.
- Last mile monitoring: If you have to monitor a consumer web app or SaaS application that test has to be done from the Internet, outside of your corporate network. There are many service providers in that space, like Apica and Catchpoint.
- Log aggregation: At the very basic level, log aggregation tools gather the system and application logs at one place and index them for search. Looking through the logs for error patterns and setting up alerts on their occurrences can help with catching issues that dedicated monitoring might miss. There are both open-source and licensed products in the market; Loggly, Logstash and Splunk are some of the popular products.
- Third-party tools dashboards and API: Many third-party tools that are used to build the applications might come with their own admin tools and there will be some monitoring features available with those. While dashboards can be used out-of-the-box, monitoring related APIs that provide status on the underlying components could be used to build monitoring checks on the main monitoring platform such as Nagios.
Operations Intelligence & Management Reporting
The leadership team will be interested in various summary data such as utilization of computing resources, uptime of applications and various performance indexes such as percentage of meeting SLAs (Service Level Agreement) etc. Core monitoring systems will provide basic information for such reporting but further aggregation and presentation will be required. Custom batch jobs that collect and aggregate operational data will have to be designed and implemented. Presentation layer can be custom dashboards built using popular frameworks using PHP or Node.js, or standard reporting tools such as Actuate, Tableau or Microstrategy. Once the task of collecting various operational data of interest is completed, insights can be drawn from the data using any BI tool and such tools might already be used by business groups.
Popular log aggregation tools such as Logstash and Splunk provide another set of operational intelligence data by indexing the logs. In addition to mining the standard log files, operational data can be generated on computational nodes and these tools can be used to aggregate and index custom operational metrics for analysis.
There are products available in the market to help with this, but, largely home-grown solutions tend to be the norm in this category with the support of reporting applications.
Production Engineering Processes
The tools discussed above help to rollout the standard production engineering processes that are essential to a matured organization. However, when such processes are implemented in a new organization care must be taken to ensure that a new process adds some value and will not slow down things as a result of its implementation.
Release Process & Change Management
The release process normally refers to code deployment in production, and change management refers to any change that would have an impact on systems in production. By definition, change management process covers application releases. It also keeps track of changes in infrastructure, OS and third-party software upgrades, database changes, and even one-off jobs that may have an impact on the computing resources.
The main objectives of a change management process should be tracking changes done in production, and, documenting and socializing the changes for better visibility within the company.
It is important that the proposed changes are reviewed and approved by a dedicated team, and, stakeholders and business owners are notified of the changes before and after those are implemented.
Product Documentation & Runbooks
This is something built on top of the documentation platform. In a new company, product documentation would be non-existent and such efforts will be ongoing as the applications will be enhanced in every release. It is important to create operational runbooks for the applications. Set up a process to maintain them and tie that to release management. One standard question to ask in a release review meeting would about the changes needed in the operational runbook.
Document the application errors that will be distributed by monitoring systems and log aggregation tools as alerts. Even though a self-healing production environment is the ideal situation to have, there could be some manual interventions needed always.
Document routine maintenance tasks. Generating reports for both internal and external customers, meta-data updates, and taking backups and purging -- there could be several application specific chores you may need to do routinely. Though these tasks are typically automated, some manual steps will be needed to deliver the services to the end-customers.
Make sure that runbooks are not excuses for not automating repetitive tasks. There is a tendency on the line management side to throw manpower at maintenance tasks to address them manually. As indicated earlier, such an expensive strategy will never scale up on the long run, and, that could drive away staff who may not want to perform the rut work. If you have team members who are happy to do routine tasks and resistant to automate them, you will soon notice that your devops efforts will get stuck with their inability or lack of motivation to implement that.
24x7 On-Call
The applications are expected to be available always. Even if the application has internal users, it may have a user-base from multiple geographies. The downtime of consumer web or SaaS applications should be very minimum if at all business can afford that. Outages and other incidents can happen in production, in the most unexpected ways and the response to such incidents should be quick.
To have a smooth on-call process, following things have to be in place:
- Contact information of members of both development and operations teams.
- A vacation calendar with up-to-date info on who are available on a specific time window.
- An on-call calendar that clearly indicates who are responsible for responding to critical alerts and incidents, at a given point of time.
- Escalation procedures specific to applications. Normally, the on-call person has to contact a point-of-contact (PoC) in the development group as the first escalation step.
Business Continuity Planning (BCP)
A BCP plan essentially addresses the non-availability of primary production environment. The non-availability could be as a result of a natural disaster, and hence the popular term Disaster Recovery planning, and, sometimes BCP and DR are used interchangeably, but, DR planning is part of larger BCP strategy.
As part of BCP, following items are addressed:
- Document scenarios of primary production environment not available and related mitigation plans.
- Backup and replication strategies to support overall BCP strategy.
- Building production quality stand-by environment or running application environments in multiple geographical regions. The latter configuration makes the application Highly Available (HA).
A detailed discussion of DR planning and BCP is here: Saving Your Buisness from Disasters
Agile Methodology
I had seen production operations teams being dragged into the company's or an engineering department's drive to roll out agile process. Though that has been found to be a useful methodology for product development groups, it could be clumsy and forced in an operations environment, mainly because the operations teams don't have full control over their own time. Issues happen in production and the priorities change, but keeping the systems up and running is the primary responsibility. So, getting the projects done in a fixed time-frame may not be possible always.
However, projects, both small and big, have to be tracked formally, and, they have to completed. If an agile methodology has to be adopted, the production operations team has to be realistic and assertive about its involvement:
- Be part of development Scrum teams. Engineering projects are not just product development. The infrastructure to run the application and its monitoring requirements have to be planned right from the beginning. Embedding an operations engineer in the application development agile teams is a great idea, as opposed to tossing out tasks to the operations team without context.
- Roll out a Kanban like process within the production engineering team to manage projects. Irrespective of the adoption of methodologies, managing the backlog of projects and tasks and their prioritization should happen.
Incident Management
Issues happen all the time in production environments. But if such an incident causes considerable negative impact on the end-user experience or loss of revenue, then a quick fix will be needed. That is called a hotfix normally, and, the process followed is different from the standard code deployment procedure, with a focus on resolving the issue at the earliest.
The incident management process should also ensure that both users and stakeholders are informed of an ongoing issue. If an end-user would end up escalating a system wide issue (don't confuse this with the reporting of product bugs), then the company has a serious problem running its business and the production operations group can avoid such embarrassment by alerting on an issue before users notice it, and, later, taking leadership in analyzing the root-cause of production-down issue.
In a new organization, following processes and info have to be in place to deal with incidents in production that would have some business impact.
- Prepare a comprehensive list of contacts and setup a process to maintain it. The contact info should include operations and development POC's for products. The contacts from operations could be multiple with support needed from core infrastructure, network operations, databases and application support. The list should also identify the product owners, normally product managers, who would manage the communication with the end-users if some issue happens.
- Setup the group communication infrastructure. When an incident happens multiple people could end up triaging the issue. Chat, voice and desktop sharing are most common modes of communication that will be used during a crisis. The employees should have access to communication tools such as telephone conferencing, IRC, Webex etc.
- Implement a root cause analysis process to review major outages in production. The focus of such issues must be resolving issues so the same incident will not repeat.
Tech Refresh
Software applications run on hardware infrastructure and software platforms that need upgrades. Old hardware has to be replaced or upgraded, OS has to be upgraded to the latest stable version, and third-party software components will also require upgrades as older versions could go out of official support if you hang on to it for long.
In environments that are built using open-source products, automatic upgrades are very common. Though largely that will not have an impact, in general, changing any component in production without adequate testing is not advisable. The company should have a plan to roll out upgrades in production environments.
In a data center or private cloud environment, the production operations team has to plan for retiring and replacing old hardware. Such efforts are called rewiring and considerable resources are needed to setup a new computing environment where an application stack will be redeployed so the existing environment can be retired.
Security
Vulnerabilities in the security strategy will put both business and its customers at risk. If there is a serious security breach, new companies rarely recover from it, as, it would lose customer trust and reputation.
The subject of securing cloud based applications and the platform they are running on, can be discussed in length. However, the basic precautions listed below have to be taken, however they are rolled out in a specific environment. These efforts will be in the right direction in implementing the requirements for ISO/IEC 27001 certification or SOX compliance. Such things are needed as the company grows.
- Often the unit-tested software is neither password protected nor the communications encrypted. It is important that the applications in production only run with such basic protections enabled. It means, implementation of ssl and custom or industry standard authentication protocols like oAuth.
- Don't allow code with user credentials to be checked into CMS. Such info must be externalized from the code and be moved to config files that can be setup as part deployment process.
- Rollout a process to manage passwords. Such efforts will be useful later to be compliant with security audits like ISO/IEC 27001 certification or SOX compliance.
- Run industry standard tests such as PEN tests periodically and harden the environments quickly based on the results.
- Automate the process of granting and revoking user access, both OS and application. Generate user creation/addition logs.
- Have a process in place for the production operations team to be informed of latest security patches etc by the cloud provider or third-party tool vendors..
- Include security review as part of planning major releases.
Conclusion
It is very tempting to rollout popular tools and implement fancy sounding processes company wide, as part of setting up production engineering infrastructure. The tools and processes are only good in the hands of those who know how to use those effectively. So, it is very important that a versatile and competent team is built first and then empower them to choose or build the right tools of their trade. A new tool or a process implemented should be for solving an existing problem or improving productivity; if such an emergency doesn't exist, it is better to wait as real-life requirements can help to define the processes better and to choose the right supporting tools.
useful Article and devops roadmap
Excellent write up! It is helpful not just for the DevOps managers but also for someone overlooking entire technology requirements and process of the organization and help the corporate leadership understand the importance of setting up the DevOps the right way. Thanks very much for such a detailed and simplified explanation.
Excellent write up. I've had some difficulty convincing people to go this route. They like to think they are devops, but didn't value IaaS, agile or even the implementation of a CMDB. Any advice to convince them? (I got reprimanded for saying "if you don't do it, your competitors will")
Good primer. You covered all major components.