OBSERVABILITY: PROGRAMMING FOR OBSERVABILITY
Imagine this scenario in our fictitious movie theater ticket site: You deployed what you thought was bulletproof code—an Azure Function handling movie ticket purchases. Clean unit tests, perfect logic, solid error handling. Then 2 AM happens.
Users couldn't buy tickets. Logs showed "everything's working fine," but customers were getting 500 errors. You spend hours playing detective in the dark, and compiling code changes which curious includes lots of lines that say “logger.debug(“Got here”)”, trying to isolate where the problem is. Turns out the Cosmos DB connection was timing out under load, but there was no way to see it until eventually one of those “here” statements no longer output.
That’s a common scenario, one I am guilty of myself (though I would use “Got here #1” and 2, and 3 in my first commit, but that’s an aside).
Why Code for Observability from Day One
Most developers think about observability after deployment, but that's like installing seatbelts after a crash. When you're building an Azure Function that processes ticket purchases, every line of code should answer future questions: What happened? Why did it fail? Where's the bottleneck?
The mistake I see repeatedly is developers adding logging after something breaks, desperately trying to figure out what went wrong. Instead, instrument predictively at every critical decision point.
Logging: Your System's Medical Chart
When to Log:
Standard Severity Levels
Custom Metrics
What to Track as Metrics:
Example (python)
# host.json: keep normal INFO; use sampling for traces
# local.settings.json / app config: APPLICATIONINSIGHTS_CONNECTION_STRING
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace, metrics
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import logging, time
configure_azure_monitor() # traces, metrics, logs exporter
RequestsInstrumentor().instrument() # outbound HTTP auto-trace
logger = logging.getLogger("ticket-purchase")
meter = metrics.get_meter("ticket-purchase")
tx_count = meter.create_counter("tickets_sold_total")
tx_latency = meter.create_histogram("transaction_duration_ms")
def purchase_ticket(req):
start = time.time()
with trace.get_tracer("ticket-purchase").start_as_current_span("purchase"):
# Attributes for correlation (low cardinality)
attrs = {"movie":"Top Gun","theater":"AMC15"}
logger.info("ticket_purchase_started", extra={"attrs": attrs})
# ... business logic, SDK calls instrumented automatically ...
tx_count.add(1, attributes=attrs)
tx_latency.record((time.time()-start)*1000, attributes=attrs)
logger.info("ticket_purchase_completed", extra={"attrs": attrs})
Enable Debug Logging Temporarily: Set AzureFunctionsJobHost__logging__LogLevel__Default to Debug in function settings, then revert to Info post-troubleshooting.
Recommended by LinkedIn
Distributed Tracing: Your Request's GPS
In serverless architectures, one ticket purchase spans your Azure Function, Cosmos DB, payment APIs, and notification services. Without distributed tracing, troubleshooting feels like following breadcrumbs in a forest.
Implementation Strategy:
Blackbox Monitoring: Testing Like Your Customers
While your internal metrics (whitebox monitoring) show how your system works, blackbox monitoring reveals whether it works from the user's perspective.
Whitebox vs Blackbox:
Practical Implementation: Create synthetic transactions that mimic real user behavior—automated scripts that attempt ticket purchases for test seats (like ZZ99), validate the complete flow, and alert when the end-to-end experience fails. This catches issues your internal metrics might miss, like authentication problems or UI bugs.
Planning Observability from Story Definition
Treat observability like security—build it into requirements, not bolt it on later. When writing user stories for ticket purchasing:
Design observability endpoints and hooks as part of your architecture. Validate all observability paths in unit tests, integration tests, and synthetic monitoring before deployment.
Alert Strategy That Prevents 2 AM Pages
Focus alerts on actionable problems: failed transactions, unusual error rates, database capacity issues, payment gateway failures. Each alert should include trace IDs enabling immediate investigation.
Document every alert with runbooks explaining what it means and how to resolve it—your future 2 AM self will thank you.
The Bottom Line
The goal isn't perfect visibility from day one. It's building systems that can tell you their own story when things go sideways. Start with structured logging at critical points, add one business metric, implement basic tracing, and evolve from there.
Your production incidents become learning opportunities instead of emergencies when your code is designed to be observable.