SRE-Cheat-Sheet

Iman Abrehdari

Published Dec 19, 2024

In today's technology-driven world, ensuring systems' reliability, scalability, and performance is critical. Site Reliability Engineering (SRE) is a discipline that combines software engineering and operations to create highly available services that users can trust. Whether you're new to SRE or a seasoned professional looking to brush up on best practices, this cheat sheet provides a quick guide to mastering the essentials.

This cheat sheet is designed for SRE and is primarily inspired by Google's SRE practices. It serves as a quick reference guide for looking up key terms. For a deeper understanding, I recommend reading the Google SRE book, which is available for free: Google SRE Book.

Site Reliability Engineering

Fundamentally, it’s what happens when you ask a software engineer to design an operations function" -- Ben Treynor, VP of Engineering Google :)

Reliability

Reliability measures how well a service meets its expected performance standards.

What promises should be made, and to whom?
Which metrics should be tracked?
How much reliability is sufficient?"

Principles

Reliability is the key feature of any service. It's users, not monitoring tools, that ultimately define reliability.

Aiming for 100% reliability is unrealistic in most cases. Achieving 99.9% requires a skilled software engineering team. To reach 99.99%, a well-trained operations team focused on automation is essential. For 99.999% reliability, you must prioritize stability over the speed of feature releases.

Error Budget

An error budget represents the amount of downtime you're willing to accept in order to push new features. For example, if your application has 90% uptime, that means you can afford up to 36.5 days of downtime per year, or 72 hours per month. You can choose to spend this downtime on fixing issues or on improving system reliability to allow for more feature releases. The decision is yours.

The key is to pause feature releases until your error budget is replenished. This approach offers several advantages:

Incentive for Stability: Your software engineers will focus on building a more stable application. If the system is unstable, they'll need to allocate their error budget to fix issues, rather than release new features.
Freedom to Innovate: With a stable application, you can push new features as long as your error budget permits.
Uptime Consistency: Your uptime will align with your SLA. After all, no one wants to risk breaching their SLA or facing legal consequences

How can you ensure your services are reliable?

Rolling out changes gradually

Gradual rollout of changes
Incremental deployments
Feature toggles
Canary deployments with easy rollback, affecting only a small percentage of users initially

Remove a single point of failure

Multi-AZ (Availability Zone) deployments
Implement disaster recovery (DR) in a geographically separate region

Reduce TTD (Time-To-Detect)

Detect issues more quickly with automated alerts and monitoring
Track SLO compliance and monitor error budget consumption

Reduce TTR (Time-To-Resolution)

Resolve outages more quickly
Share knowledge through playbooks
Automate outage mitigation steps, like shifting traffic between regions

Recommended by LinkedIn

SLI, SLO, and SLA: The Cornerstones of SRE

Sourav Dhiman 1 year ago

SRE: SLIs, SLOs, and Automations That Actually Help

Gabriel G. 2 months ago

Let's Learn Some Processes Together: Clearing the Fog…

Om Baghel 2 years ago

Increase TTF / TBF - Expected frequency of failure to

Enhance fault tolerance by deploying services across multiple availability zones (AZs)
Automate manual mitigation processes

Enhance Operational Efficiency

Conduct post-mortems for outages
Standardize infrastructure across the organization
Identify regions with poor reliability and prioritize efforts to improve them

|------------|---------------|
Issue         TTD                    TTR
        Time-To-Detect    Time-To-Resolution

Wheel of Misfortune: The "Wheel of Misfortune" is a role-playing exercise where a past postmortem is re-enacted. Engineers take on specific roles outlined in the original postmortem, simulating the incident to improve understanding, communication, and response strategies.

Mean Time to Recover (MTTR): MTTR represents the average time required for a system or device to recover from a failure and return to normal operation.
Mean Time Between Failures (MTBF): MTBF is the estimated average time between consecutive failures of a repairable system during normal operation. It is calculated as the arithmetic mean of the time intervals between failures. For non-repairable systems, the equivalent term is Mean Time to Failure (MTTF).
Mean Time to Failure (MTTF): MTTF refers to the expected average time until failure for a non-repairable system.
Capability Maturity Model (CMM): The Capability Maturity Model (CMM) is a framework developed based on research funded by the U.S. Department of Defense, analyzing data from various organizations. The term "maturity" refers to the progression of processes from informal, ad hoc practices to formalized steps, managed metrics, and ultimately, continuous optimization for improved efficiency and outcomes.

The Four Golden Signals

A set of fundamental questions about your service focused on monitoring.

Saturation Saturation refers to the capacity limits of your service. This could be metrics like CPU utilization or memory usage. Define what saturation means for your service by identifying the point at which it could fail. Measure and monitor metrics that indicate when you're approaching this threshold.

Latency is critical because users today demand fast applications. Monitoring latency is essential. At Google, latency is measured using percentiles:

P50: The median latency (50th percentile).
P90: The 90th percentile.
P99: The 99th percentile.

Tip: Avoid using averages for latency metrics, as they can mask outliers and fail to represent user experiences accurately.

Errors Errors indicate failures in serving traffic. These are often measured in Errors Per Second (EPS) to track the rate of failures over time.

Traffic Traffic is typically measured in Requests Per Second (RPS) or Queries Per Second (QPS), reflecting the workload your service is handling.

Valid Monitoring Outputs

Alerts: Alerts are for urgent issues that require immediate human action to prevent system failure or degradation.

Tickets: indicate issues that need human attention but are not urgent. Unlike alerts, tickets can be addressed with sufficient lead time.

Logs: Logs are diagnostic tools used for postmortems, forensic analysis, and troubleshooting purposes.

Defense in Depth Failures are inevitable it and design your system to tolerate them. Implement strategies to automatically handle and fix point failures without human intervention. A fault-tolerant design reduces single points of failure, making your system more resilient.

Graceful Degradation Graceful degradation ensures that your system can handle failures without a complete breakdown. For example:

In a slow network, Hangouts reduces video resolution while preserving audio.
For Gmail, large attachments might not load, but users can still read emails.

These automated responses maintain high availability and usability, minimizing the need for human intervention.

Jim Ettig 1y

Thanks for sharing this cheat sheet! What are some common pitfalls you've seen beginners make in SRE? It’d be interesting to hear your insights on overcoming those challenges.

1 Reaction

Saleh Miri 1y

Amazing content, that's very informative 🙏

1 Reaction

Mamon Eweimer 1y

I agree

1 Reaction

Faizan Bashir 1y

Very informative. Thanks for sharing. I would also like to add RTO/RPO, Infrastructure as Code coverage for the entire infrastructure.

SRE-Cheat-Sheet

Iman Abrehdari

Site Reliability Engineering

Reliability

Principles

Error Budget

How can you ensure your services are reliable?

Recommended by LinkedIn

Enhance Operational Efficiency

The Four Golden Signals

More articles by Iman Abrehdari

Others also viewed

Unlocking SRE: Introduction to Site Reliability Engineering

Unlocking SRE: Navigating Error Budgets

SRE Concepts series Part 1

Infrastructure as Code (IaC): The Future of Platform Engineering

Cloud-Native Reliability Engineering (SRE + AIOps)

Measuring What Actually Matters: The Engineering Leader's Guide to SLI, SLO, and SLA

Unlocking the Power of Reusability: Patterns for Scalable Systems in SRE

From SRE to Platform Engineering: Why?

Root Cause Analysis: The "Learning" Layer of SRE

Explore content categories

Site Reliability Engineering

Reliability

Principles

Error Budget

How can you ensure your services are reliable?

Recommended by LinkedIn

Enhance Operational Efficiency

The Four Golden Signals

More articles by Iman Abrehdari

20 Must-Know PCA (Prometheus Certified Associate) Exam Questions & Answers

Implementing SRE Cultural Values

Scaling Prometheus: Architecting Observability for Scale

🚀 The Linux SysAdmin’s Survival Guide: Tools, Tricks, and Tactics for SREs and Software Engineers

Error Budgets in the Site Reliability Engineering (SRE)

How to monitor network traffic on Linux

Others also viewed

Unlocking SRE: Introduction to Site Reliability Engineering

Unlocking SRE: Navigating Error Budgets

SRE Concepts series Part 1

Infrastructure as Code (IaC): The Future of Platform Engineering

Cloud-Native Reliability Engineering (SRE + AIOps)

Measuring What Actually Matters: The Engineering Leader's Guide to SLI, SLO, and SLA

Unlocking the Power of Reusability: Patterns for Scalable Systems in SRE

From SRE to Platform Engineering: Why?

Root Cause Analysis: The "Learning" Layer of SRE

Similar topics

Key Programming Principles for Reliable Code

Reliable Engineering Systems for Business Operations

Explore content categories