Site Reliability Engineers - Efficient crew behind Modern IT infrastructure

Madhan Raj J

Published Mar 31, 2019

The field of IT operations in the enterprises, was considered mundane and not a preferred career option for many IT professional not long ago. But with Google publishing their internal processes around this as site reliability engineers about a decade back, has changed how IT operations is looked at (There are 1)several other changes in the landscape including cloud, DevOps.

Take an engineering approach to operations. Start with taking a holistic picture of the landscape, improve observability and capture every detail possible, leverage visualization and analytics tools to make sense of all of these data and have through understanding of the business processes that being executed in the systems. Apparent from the IT metrics, measure the business metrics and ensure the IT systems are performing as needed to meet the business metrics.

Recently I was at SRE meetup, few learnings from there.

Maturity levels

Reactive

Set the scope and role of the SRE right. Suddenly SRE cannot be resolvers for all issues in the IT landscape
Need people with attitude who find ways to solve problem, and are quality obsessed
SRE success is linked back to collaboration
It is all people at this point in time

Responsive

Instrumentation across the IT landscape, measure everything
Induct Platform & tools for correlation and analysis
Start with automation at lower scale, to show incremental business value and then scale.

Proactive

Enhance processes and tools for working in federated structures.
Improve measurement and establish ownership of every module/ component in the landscape.
Move upstream and work closely with developers. Embedded SREs focussing on deployment, observability, testing, etc. during the development part of the DevOps lifecycle.
Automation across the value chain of detect, identify, and limited extent on escalate and fix

Predictive

Adopt as-as-service across the value chain bringing in extreme clarity on scope, services, interface standards and service levels
Strong & uniformly capable & participative team across the organization levels.
Leverage AI/ ML to make sense of data, derive conclusion and also predict.
Handle at scale through extreme automation extending into escalate and fix as well.

Mean-time-to-*

Mean-time-to-detect : Right level of observability across the system landscape, can significantly reduce the mean time to detect, but that is just the first step. Correlation across can help identify the root causes faster
Mean-time-to-resolve: Resolving an issue requires, people, process and technology to come together. In any enterprise owing to its complexity, several parties would be involved in a critical situation and establishing right ownership drives decision making and execution faster. Automation is very effective in fix roll-out as well.

Enhanced Engineering processes to make the organization participative

New feature releases is a significant contributor for reliability issues - With Canary, Blue/ Green deployment styles being more prevalent, also consider request shadowing (twitter being one of the early adopters with diffy) for critical releases for complex systems
Security controls implementation - There are just too many bad people prowling, classical implementation at L3/L4 level security controls is just not sufficient. Need solution at L7 level (need that level of rich data to make decisions) to ensure that legitimate users are not affected, while evil is blocked.

Creating a SRE team

Most common question - Do we take operations team and train on programming (automation skills are basic requirement) or take software engineers and get them to operations? The answer is clear from Google's initial SRE days, it is engineers who are put into operations. They would work very smart & hard to ensure that they are not paged at 02:00 AM on Sunday.
Empowering the application developers - Be friendly with the developers (it is their work that keeps you awake), provide them with right tooling and dashboards so they understand how their applications are working in production (with DevOps, the developers are getting closer to operations), so that they can built better resilience into their applications. Work as embedded SRE team in DevOps team to coach and guide them.
Career Path for SRE - For an experienced SRE team member, one of my suggestions for progression would be focus creation of tools & platforms in this space. Of course, you were using these to do your job, now it is your chance to create something even better. (Create an open source centric organization, there is a whole world of VCs willing to invest and make you millionaire)

Note: I have tried to reduce usage of "operations" term, to drive emphasis on the "engineering". The ultimate goal is to establish a working style while allows the creators of the product to manage the end to end lifecycle in delivering services to its consumers.

Bobby Gadadhar 7y

Excellent read on SRE. Shristi & Stithi done by Dev should better the world for all. Hoping to see this getting build and called out in the support models. It might be practiced to a good extend now, but not being explicitly recognized and planned may lack the needed level of attention...

1 Reaction

To view or add a comment, sign in

Site Reliability Engineers - Efficient crew behind Modern IT infrastructure

Madhan Raj J

More articles by Madhan Raj J

Others also viewed

Second Edition of the Site Reliability Engineering Newsletter!

Finding Your Perfect Fit: DevOps vs Site Reliability Engineering vs Platform Engineering

What is SRE and why does it matter?

[gedge.io] Site Reliability Engineering, a nutshell explanation and why you should care

Site Reliability Engineering (SRE): Engineering Reliability at Scale

Performance Engineer to SRE?

Site Reliability Engineering Bridging Development and Operations

Applying DevOps to Your Network – NetDevOps Series, Part 6

An introduction to Google's Site Reliability Engineering and what it could mean to traditional service management organizations

Explore content categories

More articles by Madhan Raj J

End of SaaS, untrue. Agentic AI coding would not replace SaaS, but SaaS adoption would change

Vibe Coding, success lies in ability to transition from AI to human... Part 2

Vibe Coding, success lies in ability to transition from AI to human

Gig economy is depriving manufacturing sector of workforce

AI is an effective tool, our experience in one specific use case

Understanding Privacy, Security and Sovereignty

Success of enterprise lies in the right operating model with required talent

Hybrid work has failed, and it is easier to replace remote work with AI

Future for IT operations requires more than AI for large enterprises

The software that goes with product and services, does not get required price as its value is not communicated