Site Reliability Engineers - Efficient crew behind Modern IT infrastructure
The field of IT operations in the enterprises, was considered mundane and not a preferred career option for many IT professional not long ago. But with Google publishing their internal processes around this as site reliability engineers about a decade back, has changed how IT operations is looked at (There are 1)several other changes in the landscape including cloud, DevOps.
Take an engineering approach to operations. Start with taking a holistic picture of the landscape, improve observability and capture every detail possible, leverage visualization and analytics tools to make sense of all of these data and have through understanding of the business processes that being executed in the systems. Apparent from the IT metrics, measure the business metrics and ensure the IT systems are performing as needed to meet the business metrics.
Recently I was at SRE meetup, few learnings from there.
Maturity levels
Reactive
- Set the scope and role of the SRE right. Suddenly SRE cannot be resolvers for all issues in the IT landscape
- Need people with attitude who find ways to solve problem, and are quality obsessed
- SRE success is linked back to collaboration
- It is all people at this point in time
Responsive
- Instrumentation across the IT landscape, measure everything
- Induct Platform & tools for correlation and analysis
- Start with automation at lower scale, to show incremental business value and then scale.
Proactive
- Enhance processes and tools for working in federated structures.
- Improve measurement and establish ownership of every module/ component in the landscape.
- Move upstream and work closely with developers. Embedded SREs focussing on deployment, observability, testing, etc. during the development part of the DevOps lifecycle.
- Automation across the value chain of detect, identify, and limited extent on escalate and fix
Predictive
- Adopt as-as-service across the value chain bringing in extreme clarity on scope, services, interface standards and service levels
- Strong & uniformly capable & participative team across the organization levels.
- Leverage AI/ ML to make sense of data, derive conclusion and also predict.
- Handle at scale through extreme automation extending into escalate and fix as well.
Mean-time-to-*
- Mean-time-to-detect : Right level of observability across the system landscape, can significantly reduce the mean time to detect, but that is just the first step. Correlation across can help identify the root causes faster
- Mean-time-to-resolve: Resolving an issue requires, people, process and technology to come together. In any enterprise owing to its complexity, several parties would be involved in a critical situation and establishing right ownership drives decision making and execution faster. Automation is very effective in fix roll-out as well.
Enhanced Engineering processes to make the organization participative
- New feature releases is a significant contributor for reliability issues - With Canary, Blue/ Green deployment styles being more prevalent, also consider request shadowing (twitter being one of the early adopters with diffy) for critical releases for complex systems
- Security controls implementation - There are just too many bad people prowling, classical implementation at L3/L4 level security controls is just not sufficient. Need solution at L7 level (need that level of rich data to make decisions) to ensure that legitimate users are not affected, while evil is blocked.
Creating a SRE team
- Most common question - Do we take operations team and train on programming (automation skills are basic requirement) or take software engineers and get them to operations? The answer is clear from Google's initial SRE days, it is engineers who are put into operations. They would work very smart & hard to ensure that they are not paged at 02:00 AM on Sunday.
- Empowering the application developers - Be friendly with the developers (it is their work that keeps you awake), provide them with right tooling and dashboards so they understand how their applications are working in production (with DevOps, the developers are getting closer to operations), so that they can built better resilience into their applications. Work as embedded SRE team in DevOps team to coach and guide them.
- Career Path for SRE - For an experienced SRE team member, one of my suggestions for progression would be focus creation of tools & platforms in this space. Of course, you were using these to do your job, now it is your chance to create something even better. (Create an open source centric organization, there is a whole world of VCs willing to invest and make you millionaire)
Note: I have tried to reduce usage of "operations" term, to drive emphasis on the "engineering". The ultimate goal is to establish a working style while allows the creators of the product to manage the end to end lifecycle in delivering services to its consumers.
Excellent read on SRE. Shristi & Stithi done by Dev should better the world for all. Hoping to see this getting build and called out in the support models. It might be practiced to a good extend now, but not being explicitly recognized and planned may lack the needed level of attention...