SRE-Cheat-Sheet

SRE-Cheat-Sheet

In today's technology-driven world, ensuring systems' reliability, scalability, and performance is critical. Site Reliability Engineering (SRE) is a discipline that combines software engineering and operations to create highly available services that users can trust. Whether you're new to SRE or a seasoned professional looking to brush up on best practices, this cheat sheet provides a quick guide to mastering the essentials.


This cheat sheet is designed for SRE and is primarily inspired by Google's SRE practices. It serves as a quick reference guide for looking up key terms. For a deeper understanding, I recommend reading the Google SRE book, which is available for free: Google SRE Book.


Site Reliability Engineering

Fundamentally, it’s what happens when you ask a software engineer to design an operations function" -- Ben Treynor, VP of Engineering Google :)


Reliability

Reliability measures how well a service meets its expected performance standards.

  • What promises should be made, and to whom?
  • Which metrics should be tracked?
  • How much reliability is sufficient?"


Principles

Reliability is the key feature of any service. It's users, not monitoring tools, that ultimately define reliability.

Aiming for 100% reliability is unrealistic in most cases. Achieving 99.9% requires a skilled software engineering team. To reach 99.99%, a well-trained operations team focused on automation is essential. For 99.999% reliability, you must prioritize stability over the speed of feature releases.


Error Budget

An error budget represents the amount of downtime you're willing to accept in order to push new features. For example, if your application has 90% uptime, that means you can afford up to 36.5 days of downtime per year, or 72 hours per month. You can choose to spend this downtime on fixing issues or on improving system reliability to allow for more feature releases. The decision is yours.

The key is to pause feature releases until your error budget is replenished. This approach offers several advantages:

  • Incentive for Stability: Your software engineers will focus on building a more stable application. If the system is unstable, they'll need to allocate their error budget to fix issues, rather than release new features.
  • Freedom to Innovate: With a stable application, you can push new features as long as your error budget permits.
  • Uptime Consistency: Your uptime will align with your SLA. After all, no one wants to risk breaching their SLA or facing legal consequences


How can you ensure your services are reliable?


Rolling out changes gradually

  • Gradual rollout of changes
  • Incremental deployments
  • Feature toggles
  • Canary deployments with easy rollback, affecting only a small percentage of users initially


Remove a single point of failure

  • Multi-AZ (Availability Zone) deployments
  • Implement disaster recovery (DR) in a geographically separate region


Reduce TTD (Time-To-Detect)

  • Detect issues more quickly with automated alerts and monitoring
  • Track SLO compliance and monitor error budget consumption


Reduce TTR (Time-To-Resolution)

  • Resolve outages more quickly
  • Share knowledge through playbooks
  • Automate outage mitigation steps, like shifting traffic between regions


Increase TTF / TBF - Expected frequency of failure to

  • Enhance fault tolerance by deploying services across multiple availability zones (AZs)
  • Automate manual mitigation processes


Enhance Operational Efficiency

  • Conduct post-mortems for outages
  • Standardize infrastructure across the organization
  • Identify regions with poor reliability and prioritize efforts to improve them

|------------|---------------|
Issue         TTD                    TTR
        Time-To-Detect    Time-To-Resolution        

Wheel of Misfortune: The "Wheel of Misfortune" is a role-playing exercise where a past postmortem is re-enacted. Engineers take on specific roles outlined in the original postmortem, simulating the incident to improve understanding, communication, and response strategies.


  • Mean Time to Recover (MTTR): MTTR represents the average time required for a system or device to recover from a failure and return to normal operation.
  • Mean Time Between Failures (MTBF): MTBF is the estimated average time between consecutive failures of a repairable system during normal operation. It is calculated as the arithmetic mean of the time intervals between failures. For non-repairable systems, the equivalent term is Mean Time to Failure (MTTF).
  • Mean Time to Failure (MTTF): MTTF refers to the expected average time until failure for a non-repairable system.
  • Capability Maturity Model (CMM): The Capability Maturity Model (CMM) is a framework developed based on research funded by the U.S. Department of Defense, analyzing data from various organizations. The term "maturity" refers to the progression of processes from informal, ad hoc practices to formalized steps, managed metrics, and ultimately, continuous optimization for improved efficiency and outcomes.


The Four Golden Signals


A set of fundamental questions about your service focused on monitoring.

Article content
Four Golden Signals

Saturation Saturation refers to the capacity limits of your service. This could be metrics like CPU utilization or memory usage. Define what saturation means for your service by identifying the point at which it could fail. Measure and monitor metrics that indicate when you're approaching this threshold.


Latency is critical because users today demand fast applications. Monitoring latency is essential. At Google, latency is measured using percentiles:

  • P50: The median latency (50th percentile).
  • P90: The 90th percentile.
  • P99: The 99th percentile.

Tip: Avoid using averages for latency metrics, as they can mask outliers and fail to represent user experiences accurately.


Errors Errors indicate failures in serving traffic. These are often measured in Errors Per Second (EPS) to track the rate of failures over time.


Traffic Traffic is typically measured in Requests Per Second (RPS) or Queries Per Second (QPS), reflecting the workload your service is handling.


Valid Monitoring Outputs

Alerts: Alerts are for urgent issues that require immediate human action to prevent system failure or degradation.

Tickets: indicate issues that need human attention but are not urgent. Unlike alerts, tickets can be addressed with sufficient lead time.

Logs: Logs are diagnostic tools used for postmortems, forensic analysis, and troubleshooting purposes.


Defense in Depth Failures are inevitable it and design your system to tolerate them. Implement strategies to automatically handle and fix point failures without human intervention. A fault-tolerant design reduces single points of failure, making your system more resilient.


Graceful Degradation Graceful degradation ensures that your system can handle failures without a complete breakdown. For example:

  • In a slow network, Hangouts reduces video resolution while preserving audio.
  • For Gmail, large attachments might not load, but users can still read emails.

These automated responses maintain high availability and usability, minimizing the need for human intervention.

Thanks for sharing this cheat sheet! What are some common pitfalls you've seen beginners make in SRE? It’d be interesting to hear your insights on overcoming those challenges.

Amazing content, that's very informative 🙏

Very informative. Thanks for sharing. I would also like to add RTO/RPO, Infrastructure as Code coverage for the entire infrastructure.

To view or add a comment, sign in

More articles by Iman Abrehdari

Others also viewed

Explore content categories