Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE)

The history of SRE: “Hope is not a strategy”

The phrase “site reliability engineering” is credited to Benjamin Treynor Sloss, vice president of engineering at Google. Sloss joined Google in 2003 and was tasked with building a team to help ensure the health of Google’s production systems at scale—no small task. Sloss himself has defined SRE as “what happens when you ask a software engineer to design an operations function.”

SRE is "What happened when you take operations and make it a software problem. the job is all about automating operations out of job". Todd Palino, Linkedin. You could also think SRE as Google approach to DevOps, the need to efficiently manage large, complex systems. 

To understand more about SRE, here is some background story: Developers (Functionality/Features) delivers their code towards Operators (Reliability/Stability) who are responsible for running the code in production, Operators as many of us know has less understanding about the code base as well as developers are not aware of what is happening at the production. This gap between Functionality and Reliability often causes friction in business. Now we think that we have Devops then why you need to have another sort of mechanism at all. Well, yes Devops is a set of practices and is a culture designed to break down the barriers between developers and operators and other parts of the organization.

"We want to keep the site up all the time"

Often SRE is confused with DevOps the difference with is; DevOps is designed to help organizations’ IT department move in agile and performant ways. It builds a healthy working relationship between the Operations staff and Dev team, allowing each to see how their work influences affects the other. By combining knowledge and effort, DevOps should produce a more robust, reliable, agile product.

According to devops.com, both SRE and DevOps are methodologies addressing organizations’ needs for production operation management. But the differences between the two doctrines are quite significant: While DevOps raise problems and dispatch them to Dev to solve, the SRE approach is to find problems and solve some of them themselves. While DevOps teams would usually choose the more conservative approach, leaving the production environment untouched unless absolutely necessary, SREs are more confident in their ability to maintain a stable production environment and push for rapid changes and software updates. Not unlike the DevOps team, SREs also thrive on a stable production environment, but one of the SRE team’s goals is to improve performance and operational efficiency.

So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise,

or

“SREs are Software Engineers who specialize in reliability/stability. SREs apply the principles of computer science and engineering to the design and development of computer systems: generally, large distributed ones.” - Tammy Butow @DropBox

A real-life example, think of contemporary typical web application like e-commerce which needs to store retrieve data from the database communicate with a couple of APIs from providers like of payment gateways and shipping agencies, and what happens when a timeout error occurs! What we do immediately is we issue immediate patches so the problem is not impacting other areas of the application and not affecting the business and then find out the root cause and patch further to close the problem forever.  

So the point here is how we identify the root cause and make sure it is not happening again thus offering the customer a high feeling of stability and how SRE could help here. 

SRE approach towards the above problem - Application logs and analytics are collected and are monitored closely by AI using models which are machine learned thus predicting a possible problem in advance! Having a rich graphical interface to display these predictions will be an added advantage in above case. It is not necessary that we should be thinking in terms of AI and machine learning but writing a small script to avoid/stop a future problem proactively is also an SRE job. So in my opinion SRE is more about next week challenge vs last week problem, where engineers foresee possible issues and take proactive measures to ensure the site's reliability.



"SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise" software expertise in its broad sense or at the company level?

Like
Reply

Truely impressed and great read !!

I am impressed with the site reliability engineering research and knowledge gone into this piece. Great read.

To view or add a comment, sign in

Others also viewed

Explore content categories