Are DevOps & SRE dead in Operations?

Are DevOps & SRE dead in Operations?

This topic has been mulling around in my mind for a couple of years now and I thought it might be time to put down some thoughts onto paper (well electronic paper) around this discussion. The post discusses a lot of aspects but I will try and keep the post topics in some sort of logical journey representing my thought process. It is a big topic, and a controversial one. Depending on who you ask, you will get a different answer on what DevOps and SRE are. So of course these are just my views on what they are.

My background is databases, so throughout this post I will keep looking from a database perspective, but I think many of my thoughts and vision I apply to other Platforms within typical Operations departments.

Job Titles and Team Names

The first thing to say, is that some companies seem to use job titles incorrectly and they are often just mingled and mashed together. For me, DevOps and SRE are not job titles and not really even department or team names, although in larger enterprises, dedicated teams seem inevitable. I am a fan of using the various platform names, e.g. Core Platform, Application Platform Database Platform and just engineering title names. Typically for teams to work they will have a mix of engineers with different skills allowing them to adopt parts of the DevOps or SRE philosophies to varying degrees. I am not a fan of putting DevOps or SRE labels on team names or titles.

Why a Database Team?

As a database guy, this question is an important one. The reason I have included it, is because I think it shows maybe the strongest case for why DevOps and SRE may no longer be suitable for database services.

Databases are unique in many ways, they are very different technology to most apps and as such warrant a degree of specialism, typically resulting in a separate data or database team. Here are just a few items that make databases different:

  • Unique HA and DR solutions built into the database product.
  • Different backup and recovery tools built into the product.
  • Different migration methods for databases to the cloud.
  • Mistakes are often costly in terms of infrastructure and/or performance if you get it wrong.
  • Their own crash consistency level.
  • They typically don’t sit in containers well (except maybe dev)

With all these differences and the requirement for this specialty knowledge, I think a Database Platform team is a must for an IT team of any reasonable size.

The Wall

Those of us who have operated in a typical operations environment remember back in the nineties and even early noughties, when it seemed like a constant battle trying to catch those apps being thrown over the wall by devs and desperately trying to work out how to support it, hoping we could monitor and scale it. These times caused a lot discontent and made a lot of operations and devs unhappy. Despite operations staff asking to be involved much earlier before the "throw it over the wall" phase, in many companies it did not happen for a variety of reasons.

 Due to uniqueness of databases, many aspects are neglected and not thought through thoroughly by devs. It isn't their fault, often they just don't have the database specialty knowledge and apply general app principles. But due to their querks, general app principles sometimes are not right for databases. Over the years I have seen dev teams create solutions that do not scale, do not perform and require additional headcount just because of an inefficient design or process.

DevOps Revolution

It was late in the noughties when DevOps started to emerge and the "you build it, you run it" philosophy appeared, but it didnt really take off everywhere. Operations often find it hard to adopt development techniques and developers either didn't want to, or found it difficult to consider operational aspects when designing and coding. Most companies who follow DevOps still have a "shadow" operations team of some sort or the lead devs end up doing the operational aspects, which is often a poor use of expertise. Very few companies truly have one team following DevOps techniques and builds end to end. It may have been possible back in 2006, but the cloud is just too complex now. I believe it is almost impossible now for any single engineer to do this to a high standard now. As the cloud grows in complexity, DevOps teams do less and less infrastructure and include fewer operational aspects, widening the gap of operational areas not being maintained.

The principles are promising, and still valid, but I question the concept of a team following the "you build it you run it" mantra. I think it still has legs and it is still advantageous for both app dev and ops teams to follow the principles of devops in their work.

Site Reliability Engineering (SRE)

SRE is often quoted as a class of DevOps and was created as a concept by Google. SRE is fundamentally using administrators with software expertise to substitute automation for human labour. SRE is about converting administrators into engineers to build stuff to increase the reliability of a production system.

This will involve, post mortems, ticket handling, Service Level Objective (SLO) monitoring and why typically SRE starts from Production and works backwards. True SRE means inclusion in architecture decisions to ensure resiliency and self service, but unfortunately many company SRE teams stop at production. Typically 50% of an SRE's time will be automation/system reviews and 50% tickets and troubleshooting (toil).

Database Reliability Engineering (DBRE) is a derivative of SRE, but largely follows the same principles as SRE, just from a data perspective. But an important aspect of DBRE is Self Service.

Conclusions

DevOps principles are still valid with the exception of "End to End responsibility". The internet is too complex for this to work now and I believe it must be a shared responsibility with Ops, but that doesn't mean going back to the "Throw It Over The Wall" times.

SRE as a technique is still important, blameless post mortems, ticket handling, toil removal, dashboard monitoring and automation are still key, but the automation part should not be to automate existing systems after the fact.

The future for Operations is Platform Engineering and Platform as a Product (PaaP). The concept of PaaP is relatively new, but makes sense to separate the platform from the customer facing product. The principle is to abstract a huge amount of very specialised platform knowledge away from the app dev build into the PaaP, thus reducing cognitive load for the developer and reducing the operational burden for SRE teams.

Typically this involves building a Internal Developer Platform (IDP) to allow app dev teams to self service. An IDP will typically be a mix of documentation, examples and APIs to allow developers to include the infrastructure deployment within their application build without the responsiblity of working out how to design or deploy it. Developers are no longer required to write Infrastructure as Code (IaC) to build their infrastructure, instead they use an IDP to do it. The IDP will be using IaC behind the scenes to still build out the infrastructure but the developer can only choose from an approved way using approved and operationally sound standards (golden paths). This concept is one I have been working on for a while and to assist I have been developing a tool JustDeploy. This tool is a IaC tool which can be used to build up deployment units, preset approved units of infrastructure. One use case for JustDeploy is to be one of the tools that form an IDP, but any IaC tool can be used by your IDP.

IDPs should be developed using DevOps principles by a Platform Engineering team in Operations and include all the relevant SRE automation to ensure Application Performance Monitoring (APM), Open Telemetry and other techniques are embedded into the product.

The next few years I hope will be an exciting time for Operations as the industry moves to a Platform Engineering era, bringing together all these techniques that have appeared over the past years and providing self service to app devs. The SRE discipline is a good part of the way to Platform Engineering, but needs to take on the Platform as a Product concept.

Yes, everything is cyclical in nature - decentralization is the name of the game in tech, but recentralization is happening on a global basis by way of geopolitical isolation, everything in life depends on everything else, so while companies adopt the latest fad, the question remains, how effective are any of our actions against our intended result? 😎

Like
Reply

To view or add a comment, sign in

More articles by Richard Brown

Others also viewed

Explore content categories