The Telemetry Brief — Issue: Sampling Without Losing the Plot

Satbir Singh

Published Dec 29, 2025

How to cut telemetry cost and keep the “one trace that matters” when prod is on fire ⏱️ Read time: ~6–8 min

Why this matters

Most teams start with “collect everything,” then costs climb, pipelines lag, and someone flips on aggressive sampling… and suddenly you can’t debug incidents because the critical traces are missing.

The goal isn’t “less telemetry.” It’s right telemetry — keeping high-signal traces while trimming noise.

TL;DR

If you only remember three things:

Head sampling is cheap but blind.
Tail sampling is smarter but needs buffering and sane limits.
The best approach is usually hybrid: head sampling + tail sampling on errors/latency + always-keep for important routes/tenants.

1) The two sampling types (in plain English)

A) Head sampling (decide at the start)

What it does: chooses whether to keep a trace when the trace begins (usually in the SDK). Pros: simple, low overhead, great for high traffic. Cons: you don’t know if the request will error or be slow yet → you can drop the traces you’ll need most during incidents.

Use it for: baseline coverage (e.g., 5–20%) so you always have some visibility.

B) Tail sampling (decide after seeing the whole trace)

What it does: the Collector buffers spans briefly, evaluates the completed trace, and keeps/drops based on what actually happened. Pros: you can keep errors, high latency, specific endpoints, VIP customers, etc. Cons: needs memory and careful limits; if misconfigured, it can drop data under pressure.

Use it for: preserving debuggability: “keep all 5xx and p99 latency traces.”

2) The “hybrid” sampling strategy that works in real life

Here’s a practical setup I’ve seen hold up across busy production systems:

✅ Baseline + Always Keep + Tail Rules

Baseline head sampling: 10% for general coverage
Always keep 100% for:
Tail sample for:
Dynamic during incidents: temporarily raise baseline to 25–50% for affected services (and roll back later)

This gives you a stable cost floor and incident-level depth where you need it.

3) The most common sampling mistakes (and fixes)

Mistake #1: Sampling only at the SDK and calling it “done”

Why it hurts: you’ll drop the traces you need most: the slow ones and the failing ones. ✅ Fix: add tail sampling rules in the Collector to “always keep” errors and slow traces.

Mistake #2: No “always-keep” guardrails for critical paths

Why it hurts: the “checkout is down” incident happens, and checkout traces are… missing. ✅ Fix: always keep key routes (attribute-based rules).

4) Tail sampling rules you can actually start with

Below is an example OpenTelemetry Collector tail sampling policy set that’s production-friendly.

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 2000
    policies:
      # 1) Keep ALL errors
      - name: keep_errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # 2) Keep slow traces (adjust threshold)
      - name: keep_high_latency
        type: latency
        latency:
          threshold_ms: 2000

      # 3) Keep 100% for critical endpoints (example)
      - name: keep_critical_routes
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/login", "/checkout", "/payment", "/oauth/callback"]
          enabled_regex_matching: false

      # 4) Keep more for Tier-0 tenants (example)
      - name: keep_tier0_tenants
        type: string_attribute
        string_attribute:
          key: tenant.tier
          values: ["tier0", "platinum"]

      # 5) Sample the rest at 10%
      - name: sample_everything_else
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Two practical notes:

Tail sampling depends on buffering. Tune decision_wait, num_traces, and expected_new_traces_per_sec based on real load.
If you’re under memory pressure, reduce the scope: tail sample only the “front door” services first.

5) Debuggability boosters (cheap wins)

Even with good sampling, these 3 add huge leverage:

✅ Consistent correlation IDs

Ensure you have a stable trace_id in logs (via log correlation)
Propagate across services using W3C Trace Context

✅ Use exemplars (metrics ↔ traces)

Exemplars connect a metric spike (p99 latency) to a trace you can open instantly.
Great when you can’t store every trace.

✅ Add “decision attributes” to spans

Add a few high-signal attributes that help tail sampling:

http.route
http.method
rpc.service, rpc.method
error.type
tenant.tier
auth.flow (login, token refresh, SSO, etc.)

The Telemetry Pattern of the Week

“Always keep the first failure”

During an outage, the first few failing traces often contain the clearest causal chain (timeouts, DNS, auth failures, dependency issues).

A pattern I like:

Always keep the first N errors per service per minute, then revert to normal tail rules.

It’s a great compromise when you don’t want “keep all errors forever.”

Quick checklist (copy/paste)

Baseline head sampling (5–20%)
Collector tail sampling: keep errors + slow traces
Always keep critical routes (login/checkout/auth)
Keep higher rate for Tier-0 tenants
Tight attribute hygiene (control high-cardinality)
Logs ↔ traces correlation enabled
Incident playbook: temporarily raise sampling on impacted services

Question for you (reply in comments)

When you’re debugging an incident, what’s the one trace you wish you always had?

The slowest request?
The first failure?
The “VIP customer” journey?
The auth/token flow?