DevOps & Cloud Native: Who is in charge? Ops, Dev or SRE?
To be honest I wasn’t a part of the Cloud Native Innovators, I joined the wave of early adopters in 2016. Before then, I was too struggle with resolving DevOps and Infrastructure operations and spent my time convincing my fellowship to adopt an unbalance DevOps approach where DEV was in uppercase and ops in lowercase. At this time we were trying to deliver the famous DevOps Loop with IaaS technologies like OpenStack or Public Cloud platform with mitigated successes. You could argue that in 2014/2015 some Public Cloud providers already achieved this but, you know, I’m an Open Source guy and, as my peer, I like to see through technologies even if at the beginning this takes more efforts. At the end we are always awarded by figure out by ourselves and sharing with our community. Back then one of my teammates, Thibault, shared with me his passion for a new technology Kubernetes and how its solving most of our DevOps struggles through container orchestration. Thibault is a convincing guy and I have been onboard quickly. My second thought was how could we integrate that kind of technology, Kubernetes & Cloud Native into our old-school IT. This is where one my other teammates, Jean, intervened to tell how we could solve this integrating best Ops practices to our platform. Since then, my Cloud Native Infrastructure journey has been started and I try to use what I learnt and still learning every day to support my clients on their own journey.
In most of the projects I have been working for we have been in front of difficulties in regard on what DevOps implies from a production-side point of view. From this, Continuous Integration is the easiest part of DevOps, everything is on Dev hands, you manage your code, you test it and you run it. There are plenty of tools to do so, more or less mature or easy-to-use, but this is not a real challenge at least on the operation or support side. Added to that the fact most of CI happen outside production constraints, and without restrictive SLAs, that makes things easier to manage. From production-side point of view, Continuous Delivery is the main concern. How can we guaranty Security, Integrity, Maintainability with a fluid delivery process? If I would like to merge CI and CD in a fully DevOps manner, I do ask the question of accountability at first.
The question is simple: who is going to be accountable of what? As example, If I gave the full autonomy to the project, which is one of the Cloud Native promises, with build-Your-Own DevOps pipelines that implies to put the accountability on the back of Dev Teams including resources management. In opposite if I was heading to On-demand pipelines that cannot be modified directly by project but only by Ops, Ops are going to be the accountable team.
When SRE came in my radar, I was expected Google could be helpful on solving this accountability complexity. Google definition is simple “SRE is what you get when you treat operations as if it’s a software problem”. As a Kubernetes, former OpenStack, Linux and Open Source guy my first thought was “This is what we’ve been doing for years!?”. SRE can't be reduced as an Infrastructure-as-Code approach, what Google brings on the table some interesting elements like SLO, SLI, Error budgets and highlight the importance of observability. No surprise on the fact SRE fits perfectly with Kubernetes and its ecosystem as Kube and SRE have both the same origin. Back to our discussion, what about the accountability? If SRE brought an approach, some metrics to follow, there is no mention on where the SRE has to be related to. SRE is operation but which one? SRE is going to solve DevOps Pipeline issues? Platform Integration? Inner Platform Issues? Issues between Platform and Infrastructure components? SRE are going to be all located in the same team? Or on the opposite this is a behavior, an approach that could be implemented in different teams? Let’s see this through a figure:
SRE is not going to change your accountability discussion. SRE is a backbone activity, and its positioning depends on your DevOps and Platform strategy:
Recommended by LinkedIn
SRE are going to be a part of the Ops Teams. They are going to focus on on-boarding projects onto the Platform, solve minor issues and give Level 2 support mostly on the Apps and DevOps integration part. L1 and L3 would be managed by the Ops Platform Team.
SRE are going to be part of the Dev Teams. They are going to build dedicated services, connected them to the platform and on-boarding project into these. They are going to manage L2 and L3 support. L1 would be managed by the Ops Platform Team.
SRE are going to be part of a Shared Teams, Ops probably. They are going to focus on on-boarding projects onto the Platform depend on which level of services has been taken by the project. Mostly they are going to solve minor issues and give Level 2 support on the Apps and DevOps integration part. L1 and L3 would be managed by the Ops Platform Team.
To conclude, if SRE, DevOps and Cloud Native have brought lot of good things, but we still have to clarify “Who is going to be in charge? “. First, to support our people, skills are not the same when you are a Developer, a Platform Admin or an SRE. For sure all of them are moving the same way but you need to help them developing their competences in a good one. Second, to build a proper measurement practice related to team accountability, each team are not going to leverage observability, monitoring, telemetry, and Apps performance the same way. Last, to be efficient and not duplicated tasks and responsibility. Cloud Native is about moving fast and failing more faster, simplify your organization must be your first target!
Good question to start every project regardless of the size - who is responsible for what.
Great article Jonathan! You said with mitigated success, but at that time given the context it was a leapfrog
Very interesting.