OBSERVABILITY: PROGRAMMING FOR OBSERVABILITY

Mathew Glenn

Published Oct 1, 2025

Imagine this scenario in our fictitious movie theater ticket site: You deployed what you thought was bulletproof code—an Azure Function handling movie ticket purchases. Clean unit tests, perfect logic, solid error handling. Then 2 AM happens.

Users couldn't buy tickets. Logs showed "everything's working fine," but customers were getting 500 errors. You spend hours playing detective in the dark, and compiling code changes which curious includes lots of lines that say “logger.debug(“Got here”)”, trying to isolate where the problem is. Turns out the Cosmos DB connection was timing out under load, but there was no way to see it until eventually one of those “here” statements no longer output.

That’s a common scenario, one I am guilty of myself (though I would use “Got here #1” and 2, and 3 in my first commit, but that’s an aside).

Why Code for Observability from Day One

Most developers think about observability after deployment, but that's like installing seatbelts after a crash. When you're building an Azure Function that processes ticket purchases, every line of code should answer future questions: What happened? Why did it fail? Where's the bottleneck?

The mistake I see repeatedly is developers adding logging after something breaks, desperately trying to figure out what went wrong. Instead, instrument predictively at every critical decision point.

Logging: Your System's Medical Chart

When to Log:

Function entry and all exits - every path into and out of your function
External service calls - before and after every service like a Cosmos DB operation, or API call
Error handling branches - every exception catch, or validation failure
Business logic decision points - seat availability checks, or payment processing states

Standard Severity Levels

DEBUG: Detailed diagnostic info (customer_id=12345, seat=A14, query_time=120ms) - typically filtered in production
INFO: Business events (ticket_purchased, payment_completed)
WARNING: Recoverable issues (retry_attempted, fallback_used)
ERROR: Failed operations requiring investigation
CRITICAL: System-threatening failures (database_unreachable, payment_gateway_down)

Custom Metrics

What to Track as Metrics:

Business metrics: tickets_sold_total, revenue_per_minute, failed_purchases_count
Performance metrics: cosmos_db_latency, total_transaction_duration
Error metrics: seat_already_booked_attempts, payment_failures_total
System health: active_connections, memory_usage, request_rate

Example (python)

# host.json: keep normal INFO; use sampling for traces
# local.settings.json / app config: APPLICATIONINSIGHTS_CONNECTION_STRING

from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import logging, time

configure_azure_monitor()                    # traces, metrics, logs exporter
RequestsInstrumentor().instrument()          # outbound HTTP auto-trace

logger = logging.getLogger("ticket-purchase")

meter = metrics.get_meter("ticket-purchase")
tx_count = meter.create_counter("tickets_sold_total")
tx_latency = meter.create_histogram("transaction_duration_ms")

def purchase_ticket(req):
    start = time.time()
    with trace.get_tracer("ticket-purchase").start_as_current_span("purchase"):
        # Attributes for correlation (low cardinality)
        attrs = {"movie":"Top Gun","theater":"AMC15"}
        logger.info("ticket_purchase_started", extra={"attrs": attrs})

        # ... business logic, SDK calls instrumented automatically ...

        tx_count.add(1, attributes=attrs)
        tx_latency.record((time.time()-start)*1000, attributes=attrs)
        logger.info("ticket_purchase_completed", extra={"attrs": attrs})

Enable Debug Logging Temporarily: Set AzureFunctionsJobHost__logging__LogLevel__Default to Debug in function settings, then revert to Info post-troubleshooting.

Distributed Tracing: Your Request's GPS

In serverless architectures, one ticket purchase spans your Azure Function, Cosmos DB, payment APIs, and notification services. Without distributed tracing, troubleshooting feels like following breadcrumbs in a forest.

Implementation Strategy:

Assign unique trace IDs to each purchase request
Propagate trace context through all service calls
Create spans for major operations (DB reads, payment processing, seat reservation)
Correlate all logs and metrics with trace IDs

Blackbox Monitoring: Testing Like Your Customers

While your internal metrics (whitebox monitoring) show how your system works, blackbox monitoring reveals whether it works from the user's perspective.

Whitebox vs Blackbox:

Whitebox: Internal system metrics (CPU, memory, database query times, application logs) - answers "Why did it fail?"
Blackbox: External user experience testing (end-to-end transaction success, response times, availability) - answers "Did it fail for users?"

Practical Implementation: Create synthetic transactions that mimic real user behavior—automated scripts that attempt ticket purchases for test seats (like ZZ99), validate the complete flow, and alert when the end-to-end experience fails. This catches issues your internal metrics might miss, like authentication problems or UI bugs.

Planning Observability from Story Definition

Treat observability like security—build it into requirements, not bolt it on later. When writing user stories for ticket purchasing:

As a customer, I want to buy tickets..."
Observability requirements: "System must log all purchase attempts, track success/failure rates, measure transaction latency, and provide end-to-end tracing for failed purchases."

Design observability endpoints and hooks as part of your architecture. Validate all observability paths in unit tests, integration tests, and synthetic monitoring before deployment.

Alert Strategy That Prevents 2 AM Pages

Focus alerts on actionable problems: failed transactions, unusual error rates, database capacity issues, payment gateway failures. Each alert should include trace IDs enabling immediate investigation.

Document every alert with runbooks explaining what it means and how to resolve it—your future 2 AM self will thank you.

The Bottom Line

The goal isn't perfect visibility from day one. It's building systems that can tell you their own story when things go sideways. Start with structured logging at critical points, add one business metric, implement basic tracing, and evolve from there.

Your production incidents become learning opportunities instead of emergencies when your code is designed to be observable.

To view or add a comment, sign in

OBSERVABILITY: PROGRAMMING FOR OBSERVABILITY

Mathew Glenn

Why Code for Observability from Day One

Logging: Your System's Medical Chart

Custom Metrics

Example (python)

Recommended by LinkedIn

Distributed Tracing: Your Request's GPS

Blackbox Monitoring: Testing Like Your Customers

Planning Observability from Story Definition

Alert Strategy That Prevents 2 AM Pages

The Bottom Line

More articles by Mathew Glenn

Others also viewed

Mastering Arrays: A Comprehensive Guide for TypeScript Developers

🚀 Mastering Looping in Node.js: Unlocking Efficiency and Flexibility 🔄

Clean Code 101: Functions

C Programming for Payment Systems: Balancing Power and Precision

Function Names, do it right

ES6 Currying

C++26: The Silent Revolution That Will Reshape Systems Programming

Demystifying Concurrency in Go: A Beginner's Guide with Examples

Claude Code: a primer on MCP vs Hooks

Explore content categories

Why Code for Observability from Day One

Logging: Your System's Medical Chart

Custom Metrics

Example (python)

Recommended by LinkedIn

Distributed Tracing: Your Request's GPS

Blackbox Monitoring: Testing Like Your Customers

Planning Observability from Story Definition

Alert Strategy That Prevents 2 AM Pages

The Bottom Line

More articles by Mathew Glenn

Observability: A Cost Optimization Guide for AWS

Observability: A Cost Optimization Guide for Azure

OBSERVABILITY: MONITORING INFRASTRUCTURE AS CODE

An SRE's Thoughts On Cloud Failures, Part 2 of X: CloudFlare

An SRE’s Analysis of Recent Cloud Failures

OBSERVABILITY: WHY AIOPS INVESTMENTS FAIL

OBSERVABILITY: LOGGING TOOLS STACK IN HYBRID ENVIRONMENTS

OBSERVABILITY: EFFECTIVE MONITORING TOOLS STACK IN HYBRID ENVIRONMENTS

Observability: Whitebox vs. Blackbox Monitoring in Modern Apps

OBSERVABILITY: THE ALERT FATIGUE TRAP

Others also viewed

Mastering Arrays: A Comprehensive Guide for TypeScript Developers

🚀 Mastering Looping in Node.js: Unlocking Efficiency and Flexibility 🔄

Clean Code 101: Functions

C Programming for Payment Systems: Balancing Power and Precision

Function Names, do it right

ES6 Currying

C++26: The Silent Revolution That Will Reshape Systems Programming

Demystifying Concurrency in Go: A Beginner's Guide with Examples

Claude Code: a primer on MCP vs Hooks

Similar topics

Tools for Observability in Software Development

How to Maximize Observability in Systems

Explore content categories