SRE-Cheat-Sheet
In today's technology-driven world, ensuring systems' reliability, scalability, and performance is critical. Site Reliability Engineering (SRE) is a discipline that combines software engineering and operations to create highly available services that users can trust. Whether you're new to SRE or a seasoned professional looking to brush up on best practices, this cheat sheet provides a quick guide to mastering the essentials.
This cheat sheet is designed for SRE and is primarily inspired by Google's SRE practices. It serves as a quick reference guide for looking up key terms. For a deeper understanding, I recommend reading the Google SRE book, which is available for free: Google SRE Book.
Site Reliability Engineering
Fundamentally, it’s what happens when you ask a software engineer to design an operations function" -- Ben Treynor, VP of Engineering Google :)
Reliability
Reliability measures how well a service meets its expected performance standards.
Principles
Reliability is the key feature of any service. It's users, not monitoring tools, that ultimately define reliability.
Aiming for 100% reliability is unrealistic in most cases. Achieving 99.9% requires a skilled software engineering team. To reach 99.99%, a well-trained operations team focused on automation is essential. For 99.999% reliability, you must prioritize stability over the speed of feature releases.
Error Budget
An error budget represents the amount of downtime you're willing to accept in order to push new features. For example, if your application has 90% uptime, that means you can afford up to 36.5 days of downtime per year, or 72 hours per month. You can choose to spend this downtime on fixing issues or on improving system reliability to allow for more feature releases. The decision is yours.
The key is to pause feature releases until your error budget is replenished. This approach offers several advantages:
How can you ensure your services are reliable?
Rolling out changes gradually
Remove a single point of failure
Reduce TTD (Time-To-Detect)
Reduce TTR (Time-To-Resolution)
Recommended by LinkedIn
Increase TTF / TBF - Expected frequency of failure to
Enhance Operational Efficiency
|------------|---------------|
Issue TTD TTR
Time-To-Detect Time-To-Resolution
Wheel of Misfortune: The "Wheel of Misfortune" is a role-playing exercise where a past postmortem is re-enacted. Engineers take on specific roles outlined in the original postmortem, simulating the incident to improve understanding, communication, and response strategies.
The Four Golden Signals
A set of fundamental questions about your service focused on monitoring.
Saturation Saturation refers to the capacity limits of your service. This could be metrics like CPU utilization or memory usage. Define what saturation means for your service by identifying the point at which it could fail. Measure and monitor metrics that indicate when you're approaching this threshold.
Latency is critical because users today demand fast applications. Monitoring latency is essential. At Google, latency is measured using percentiles:
Tip: Avoid using averages for latency metrics, as they can mask outliers and fail to represent user experiences accurately.
Errors Errors indicate failures in serving traffic. These are often measured in Errors Per Second (EPS) to track the rate of failures over time.
Traffic Traffic is typically measured in Requests Per Second (RPS) or Queries Per Second (QPS), reflecting the workload your service is handling.
Valid Monitoring Outputs
Alerts: Alerts are for urgent issues that require immediate human action to prevent system failure or degradation.
Tickets: indicate issues that need human attention but are not urgent. Unlike alerts, tickets can be addressed with sufficient lead time.
Logs: Logs are diagnostic tools used for postmortems, forensic analysis, and troubleshooting purposes.
Defense in Depth Failures are inevitable it and design your system to tolerate them. Implement strategies to automatically handle and fix point failures without human intervention. A fault-tolerant design reduces single points of failure, making your system more resilient.
Graceful Degradation Graceful degradation ensures that your system can handle failures without a complete breakdown. For example:
These automated responses maintain high availability and usability, minimizing the need for human intervention.
Thanks for sharing this cheat sheet! What are some common pitfalls you've seen beginners make in SRE? It’d be interesting to hear your insights on overcoming those challenges.
Amazing content, that's very informative 🙏
I agree
Very informative. Thanks for sharing. I would also like to add RTO/RPO, Infrastructure as Code coverage for the entire infrastructure.