The Telemetry Brief — Issue: Sampling Without Losing the Plot
How to cut telemetry cost and keep the “one trace that matters” when prod is on fire ⏱️ Read time: ~6–8 min
Why this matters
Most teams start with “collect everything,” then costs climb, pipelines lag, and someone flips on aggressive sampling… and suddenly you can’t debug incidents because the critical traces are missing.
The goal isn’t “less telemetry.” It’s right telemetry — keeping high-signal traces while trimming noise.
TL;DR
If you only remember three things:
1) The two sampling types (in plain English)
A) Head sampling (decide at the start)
What it does: chooses whether to keep a trace when the trace begins (usually in the SDK). Pros: simple, low overhead, great for high traffic. Cons: you don’t know if the request will error or be slow yet → you can drop the traces you’ll need most during incidents.
Use it for: baseline coverage (e.g., 5–20%) so you always have some visibility.
B) Tail sampling (decide after seeing the whole trace)
What it does: the Collector buffers spans briefly, evaluates the completed trace, and keeps/drops based on what actually happened. Pros: you can keep errors, high latency, specific endpoints, VIP customers, etc. Cons: needs memory and careful limits; if misconfigured, it can drop data under pressure.
Use it for: preserving debuggability: “keep all 5xx and p99 latency traces.”
2) The “hybrid” sampling strategy that works in real life
Here’s a practical setup I’ve seen hold up across busy production systems:
✅ Baseline + Always Keep + Tail Rules
This gives you a stable cost floor and incident-level depth where you need it.
3) The most common sampling mistakes (and fixes)
Mistake #1: Sampling only at the SDK and calling it “done”
Why it hurts: you’ll drop the traces you need most: the slow ones and the failing ones. ✅ Fix: add tail sampling rules in the Collector to “always keep” errors and slow traces.
Mistake #2: No “always-keep” guardrails for critical paths
Why it hurts: the “checkout is down” incident happens, and checkout traces are… missing. ✅ Fix: always keep key routes (attribute-based rules).
Recommended by LinkedIn
Mistake #3: Missing high-cardinality discipline
Why it hurts: costs explode and sampling becomes a band-aid. ✅ Fix: be intentional about attributes:
4) Tail sampling rules you can actually start with
Below is an example OpenTelemetry Collector tail sampling policy set that’s production-friendly.
processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 2000
policies:
# 1) Keep ALL errors
- name: keep_errors
type: status_code
status_code:
status_codes: [ERROR]
# 2) Keep slow traces (adjust threshold)
- name: keep_high_latency
type: latency
latency:
threshold_ms: 2000
# 3) Keep 100% for critical endpoints (example)
- name: keep_critical_routes
type: string_attribute
string_attribute:
key: http.route
values: ["/login", "/checkout", "/payment", "/oauth/callback"]
enabled_regex_matching: false
# 4) Keep more for Tier-0 tenants (example)
- name: keep_tier0_tenants
type: string_attribute
string_attribute:
key: tenant.tier
values: ["tier0", "platinum"]
# 5) Sample the rest at 10%
- name: sample_everything_else
type: probabilistic
probabilistic:
sampling_percentage: 10
Two practical notes:
5) Debuggability boosters (cheap wins)
Even with good sampling, these 3 add huge leverage:
✅ Consistent correlation IDs
✅ Use exemplars (metrics ↔ traces)
✅ Add “decision attributes” to spans
Add a few high-signal attributes that help tail sampling:
The Telemetry Pattern of the Week
“Always keep the first failure”
During an outage, the first few failing traces often contain the clearest causal chain (timeouts, DNS, auth failures, dependency issues).
A pattern I like:
It’s a great compromise when you don’t want “keep all errors forever.”
Quick checklist (copy/paste)
Question for you (reply in comments)
When you’re debugging an incident, what’s the one trace you wish you always had?