Observability

Mohamed Raffi

Published May 11, 2024

Historically, IT monitoring mainly focused on tracking server uptime and quickly addressing any downtimes, either through automation or manual intervention by operations engineers. However, keeping servers running doesn't necessarily mean services are accessible to users.

With advancements in Observability and the adoption of Site Reliability Engineering (SRE) practices, the concept of Multi-Layer Observability has become increasingly vital. This concept underpins next-generation AIOps, which integrates AI and ML models, including sophisticated language models.

When discussing Observability, it's often described merely as Metrics, Logs, and Traces. Similarly, SRE is typically associated with SLI, SLO, SLA, and Error Budgets.

Proper implementation is crucial for deriving maximum value, which can revolutionize your operational model. Let's delve into the key components:

Layer 1 : Business Layer

Observability extends beyond IT to business. Unlike IT, standard business metrics aren’t readily available and need to be developed based on industry specifics. Key metrics might include quarterly results like subscription numbers and product revenues, alongside critical Business KPIs such as user engagement and order volumes.

Layer 2: Service Chain Layer

For underperforming products, such as a mobile offering, it's essential to examine the service chain to determine the cause of poor performance. This could be broken down into key business processes like Explore to Order / Lead to Order, Order to Activate, Usage to Cash, and Trouble to Resolve, with tailored metrics for each segment to monitor performance effectively.

Explore to Order / Lead to Order: Am I getting the orders as expected

Order to Activate: Are the orders getting activated on time

Usage to Cash: Am I making money out of it, which one i get money out of?

Trouble to Resolve: Is my customer facing issue or might face issue? How can i fix it proactively?

Layer 3: Accountability Layer

It's crucial to define accountability when issues arise, such as during the Order to Activate process in the Mobile business line. In real-world operations, with hundreds of systems and thousands of API endpoints, it’s vital to establish feedback loops with Agile teams to address and eliminate root causes, thus enhancing SLO and SLA.

Recommended by LinkedIn

10 Big ObserveOps Ideas Redefining IT Operations in…

Motadata 4 months ago

Incidents as a Catalyst for Quality, Reliability, and…

Cristiano Messina 8 months ago

Resilience in Motion: From Green to Greener with SRE…

Cristiano Messina 6 months ago

Layer 4: Traceability Layer

This layer aims to quickly pinpoint issues. It includes sub-tracks such as Application Performance Monitoring (APM) to identify bottlenecks, Centralized Log Monitoring (CLM) to analyze logs across the value chain, and Cross-Domain Correlation to swiftly identify causes across applications, infrastructure, and networks.

A. Application Performance Monitoring (APM): to identify where is the bottleneck in the application, database, etc.

B. Centralized Log Monitoring (CLM): to analyze the logs with the traceability id across the value chain from channel to all the systems.

C. Cross Domain Correlation between App, Infra and Network: to quickly correlate between Application, Infrastructure and Network to find the cause

Conclusion

Implementing these layers effectively facilitates the reorganization of operations teams into more streamlined SRE teams aligned with business processes and cross-domain functions. This structure optimizes operations across multiple application and infrastructure components within business processes, particularly in cloud environments. This can help to provide better operational efficiency, awesome customer experience and obviously optimize cost.

Future articles will explore how Generative AI can aid in guided troubleshooting, further enhancing operational efficiency.

Ricardo Vale de Andrade 1y

That was an interesting read, Mohamed Raffi and it really got me excited to read the next one.

1 Reaction

Woodley B. Preucil, CFA 1y

Mohamed Raffi Very interesting. Thank you for sharing

1 Reaction

See more comments

To view or add a comment, sign in

See all

Observability

Mohamed Raffi

Layer 1 : Business Layer

Layer 2: Service Chain Layer

Layer 3: Accountability Layer

Recommended by LinkedIn

Layer 4: Traceability Layer

Conclusion

More articles by this author

Others also viewed

Building reliability

Redefining the Future of Reliability with AI-driven SRE

Parallel run strategy - modernize without downtime.

Cloud-Native Reliability Engineering (SRE + AIOps)

Part 3 – The People, Process & Technology Playbook: Why Modern PAM Fails Without All Three (And How to Get It Right)

From Reactive to Proactive: How to deploy AIOps and Automation to transform culture

Tooling vs. Culture – What Really Drives Reliability?

Is performance engineering expensive?

What is SRE (Site Reliability Engineering ) and use cases

Key Trends Shaping the Future of SRE in 2025

Explore content categories

Layer 1 : Business Layer

Layer 2: Service Chain Layer

Layer 3: Accountability Layer

Recommended by LinkedIn

Layer 4: Traceability Layer

Conclusion

Why Gen AI Mentor?

Mar 11, 2026

Product Development with AI features

May 23, 2025

Gen AI Adoption

Feb 21, 2025

Image Prompts and Multi Models

Nov 27, 2023

The Rise of Data driven Health Management

Nov 18, 2023

Product Development with Gen AI — Part III

Oct 28, 2023

Value Chain of Generative AI

Oct 24, 2023

Gen AI features in LinkedIn

Oct 4, 2023

Product Development with Gen AI, LCNC & Cloud Native — Part II

Oct 1, 2023

Product Development with Gen AI, LCNC & Cloud Native — Part I

Sep 9, 2023

Others also viewed

Building reliability

Redefining the Future of Reliability with AI-driven SRE

Parallel run strategy - modernize without downtime.

Cloud-Native Reliability Engineering (SRE + AIOps)

Part 3 – The People, Process & Technology Playbook: Why Modern PAM Fails Without All Three (And How to Get It Right)

From Reactive to Proactive: How to deploy AIOps and Automation to transform culture

Tooling vs. Culture – What Really Drives Reliability?

Is performance engineering expensive?

What is SRE (Site Reliability Engineering ) and use cases

Key Trends Shaping the Future of SRE in 2025

Similar topics

How to Maximize Observability in Systems

How to Integrate KPIs into Daily Business Operations

Explore content categories