Observability
Historically, IT monitoring mainly focused on tracking server uptime and quickly addressing any downtimes, either through automation or manual intervention by operations engineers. However, keeping servers running doesn't necessarily mean services are accessible to users.
With advancements in Observability and the adoption of Site Reliability Engineering (SRE) practices, the concept of Multi-Layer Observability has become increasingly vital. This concept underpins next-generation AIOps, which integrates AI and ML models, including sophisticated language models.
When discussing Observability, it's often described merely as Metrics, Logs, and Traces. Similarly, SRE is typically associated with SLI, SLO, SLA, and Error Budgets.
Proper implementation is crucial for deriving maximum value, which can revolutionize your operational model. Let's delve into the key components:
Layer 1 : Business Layer
Observability extends beyond IT to business. Unlike IT, standard business metrics aren’t readily available and need to be developed based on industry specifics. Key metrics might include quarterly results like subscription numbers and product revenues, alongside critical Business KPIs such as user engagement and order volumes.
Layer 2: Service Chain Layer
For underperforming products, such as a mobile offering, it's essential to examine the service chain to determine the cause of poor performance. This could be broken down into key business processes like Explore to Order / Lead to Order, Order to Activate, Usage to Cash, and Trouble to Resolve, with tailored metrics for each segment to monitor performance effectively.
Explore to Order / Lead to Order: Am I getting the orders as expected
Order to Activate: Are the orders getting activated on time
Usage to Cash: Am I making money out of it, which one i get money out of?
Trouble to Resolve: Is my customer facing issue or might face issue? How can i fix it proactively?
Layer 3: Accountability Layer
It's crucial to define accountability when issues arise, such as during the Order to Activate process in the Mobile business line. In real-world operations, with hundreds of systems and thousands of API endpoints, it’s vital to establish feedback loops with Agile teams to address and eliminate root causes, thus enhancing SLO and SLA.
Recommended by LinkedIn
Layer 4: Traceability Layer
This layer aims to quickly pinpoint issues. It includes sub-tracks such as Application Performance Monitoring (APM) to identify bottlenecks, Centralized Log Monitoring (CLM) to analyze logs across the value chain, and Cross-Domain Correlation to swiftly identify causes across applications, infrastructure, and networks.
A. Application Performance Monitoring (APM): to identify where is the bottleneck in the application, database, etc.
B. Centralized Log Monitoring (CLM): to analyze the logs with the traceability id across the value chain from channel to all the systems.
C. Cross Domain Correlation between App, Infra and Network: to quickly correlate between Application, Infrastructure and Network to find the cause
Conclusion
Implementing these layers effectively facilitates the reorganization of operations teams into more streamlined SRE teams aligned with business processes and cross-domain functions. This structure optimizes operations across multiple application and infrastructure components within business processes, particularly in cloud environments. This can help to provide better operational efficiency, awesome customer experience and obviously optimize cost.
Future articles will explore how Generative AI can aid in guided troubleshooting, further enhancing operational efficiency.
That was an interesting read, Mohamed Raffi and it really got me excited to read the next one.
Mohamed Raffi Very interesting. Thank you for sharing