Context at Inference Time: The Latency Problem

Navdeep Singh Gill

Published Feb 17, 2026

You built the world's richest context graph. Your agent can't use it in time.

If you remember only one thing from this post: Context that arrives after the decision is made is not context. It's a post-mortem.

The 3.2-Second Failure

A customer messages your support agent. Simple issue. Account question.

The agent begins:

Query entity resolution layer → 180ms
Retrieve decision traces → 420ms
Pull relationship graph → 310ms
Fetch contract context → 890ms
Load prior interaction history → 640ms
Assemble governance constraints → 280ms
Run inference → 480ms

Total: 3.2 seconds.

The customer sees a spinner. Then a generic response.

Why generic? Because at 1.8 seconds, the system timed out on context retrieval and fell back to the base prompt. The contract context never arrived. The relationship graph was still loading. The governance constraints came in 400ms after the response was already sent.

The agent responded without half its context.

The architecture was perfect. The plumbing was too slow.

"A context graph that can't serve at inference speed is a knowledge base, not an operating system."

Why Latency Is the Silent Killer

Everyone focuses on context quality. What to capture. How to structure it. Which relationships matter.

Almost nobody focuses on context delivery.

The assumption is that once context exists, serving it is a solved problem. Database query. API call. Done.

This assumption breaks catastrophically at scale.

The Math Nobody Does

A typical agent decision requires context from multiple sources:

Context TypeSourceTypical LatencyEntity identityIdentity Hub20-80msCurrent stateSystems of record50-200msDecision historyContext graph100-500msRelationship graphGraph database80-400msGovernance constraintsPolicy engine50-150msPrior interactionsInteraction store100-600msContract/commercial termsDocument store200-800ms

If these run sequentially: 600ms-2,700ms before inference even starts.

If parallelized perfectly: bounded by the slowest source, typically 400-800ms.

Add inference time (200-1,000ms depending on model and prompt size) and you're looking at 600ms-3,700ms total response time.

For a chat interface, anything above 2 seconds feels broken. For a real-time agent in a workflow, anything above 500ms is unusable. For a multi-agent system where agents call other agents, latency multiplies at every hop.

"Every layer you add to the context stack adds latency. The richest context architecture in the world is worthless if it can't serve under your latency budget."

The Three Latency Traps

Trap 1: The Retrieval Explosion

Your context graph grows. The flywheel works. More decisions, more outcomes, more patterns, more relationships.

Congratulations — you now have a retrieval problem.

A customer with five years of history has thousands of decision traces. An entity with complex relationships spans hundreds of nodes. A contract portfolio for an enterprise account involves dozens of documents.

The naive approach: retrieve everything relevant and let the model sort it out.

The result: 15,000 tokens of context crammed into every prompt. Latency spikes. Token costs explode. And half the context is irrelevant to the specific decision being made.

The retrieval explosion means your context graph's greatest strength — depth — becomes its operational weakness.

The query isn't "what context exists?" It's "what context matters for this specific decision, right now?"

That's a fundamentally harder problem.

Trap 2: The Freshness-Latency Tradeoff

Fresh context is expensive. Cached context is fast.

You can serve a customer's contract terms in 5ms from cache. But those terms were updated yesterday. The cache hasn't invalidated yet. The agent just made a decision based on the old terms.

You can query the source system for real-time terms. But that takes 400ms — and the source system is under load, so it's actually 1,200ms. The agent is waiting. The customer is waiting.

Every context source forces a choice: fast-and-possibly-stale vs slow-and-definitely-current.

The right answer varies by context type:

Context TypeStaleness ToleranceStrategyEntity identityMinutesCache with event-driven invalidationContract termsHoursCache with TTL + on-demand refresh for high-value decisionsDecision historyDaysCache aggressively — history doesn't changeGovernance constraintsSecondsReal-time query — policy violations are non-negotiableCurrent system stateSecondsReal-time query — stale state causes action errorsRelationship graphHoursCache with batch refresh

There is no universal answer. Each context type needs its own freshness strategy.

Trap 3: The Token Budget

Even if you retrieve the right context at the right speed, you have a finite context window.

Modern LLMs accept large contexts. But larger contexts mean:

Higher latency (more tokens to process)
Higher cost (token-based pricing)
Lower precision (model attention degrades with context length)

The agent doesn't need the customer's entire history. It needs the relevant slice of that history for this specific decision.

Context selection is as important as context retrieval.

An agent handling a billing dispute needs: recent payment history, relevant contract terms, prior billing exceptions, and applicable governance constraints. It does not need: the customer's onboarding sequence, last year's product feedback, or the sales team's internal notes.

"Stuffing the context window is the inference-time equivalent of giving someone a filing cabinet when they asked for a memo."

The Context Serving Architecture

Solving the latency problem requires a serving layer purpose-built for agent decision-making.

Layer 1: Pre-Computation

Don't compute at query time what you can compute in advance.

Recommended by LinkedIn

Most graph systems don’t fail visibly; they fail…

Alexander Pankraz 1 month ago

C++26 Reflection + Injection Massive Gains

Arturs K. 2 months ago

Is Your Code Slow? Here’s Why Time Complexity Isn’t…

Vishwanth setty 11 months ago

Entity context bundles: For high-frequency entities (top customers, active contracts, critical products), pre-assemble the most commonly needed context into ready-to-serve bundles.

The bundle includes: resolved entity, key relationships, active contracts, recent decisions, applicable constraints. Assembled in background. Refreshed on schedule or event. Served in single read.

Decision templates: Most agent decisions fall into a small number of categories: refund, exception, escalation, pricing, fulfillment. Each category needs a predictable context profile. Pre-map which context types each decision category requires.

When a decision request arrives, the system doesn't query the entire graph. It pulls the pre-computed bundle for the entity and the template for the decision type.

Latency impact: 80-150ms for pre-computed bundles vs 600-2,700ms for on-demand assembly.

Layer 2: Intelligent Retrieval

Not all context retrieval can be pre-computed. For dynamic queries, retrieval must be intelligent, not exhaustive.

Relevance ranking: Not every decision trace is equally relevant. A billing dispute needs billing-related decisions, not product feedback. Rank context by relevance to the decision type before retrieval.

Recency weighting: Recent context is usually more relevant than historical context. Weight retrieval toward the last 90 days, with exceptions for contract terms and precedent-setting decisions that may be older.

Confidence-based depth: If the agent's initial context is sufficient for a high-confidence decision, stop retrieving. Don't fetch the full history for a straightforward refund. Reserve deep retrieval for complex, ambiguous, or high-stakes decisions.

Layer 3: Tiered Caching

Not every context query hits the source system.

Tier 1 — Hot cache (in-memory): Entity bundles for top 20% of entities by decision frequency. Governance constraints. Active policies. Latency: < 10ms.

Tier 2 — Warm cache (distributed): Recent decision traces. Relationship graphs. Contract summaries. Latency: 20-80ms.

Tier 3 — Cold storage (source systems): Full interaction history. Archived decisions. Legacy data. Latency: 200-800ms.

The system should satisfy 80%+ of agent queries from Tier 1 and Tier 2 combined.

Layer 4: Async Enrichment

Some context is valuable but not time-critical.

The agent can respond with Tier 1 and Tier 2 context immediately. Meanwhile, a background process retrieves deeper context from Tier 3. If the additional context changes the decision, the agent can self-correct or flag for review.

This pattern — respond fast, enrich async, self-correct if needed — maintains low latency without sacrificing depth for complex cases.

"The fastest context is the context you already assembled. The smartest retrieval is the retrieval you didn't need to do."

The Latency Budget

Every agent needs a latency budget — a hard ceiling on total response time, allocated across components.

Example budget for a customer-facing support agent:

ComponentBudgetStrategyEntity resolution20msPre-resolved, cachedContext retrieval100msPre-computed bundle + templateGovernance check30msHot cacheInference300msOptimized prompt, right-sized modelResponse formatting50msTemplate-basedTotal500ms

The budget forces architectural discipline:

If entity resolution takes 200ms, you need a caching layer
If context retrieval takes 600ms, you need pre-computation
If inference takes 1,000ms, you need a smaller model or a shorter prompt
If governance checks take 500ms, you need to move policy to the edge

Every component that exceeds its budget degrades the user experience or forces another component to compensate.

What Happens When You Get It Wrong

The failures are specific and measurable:

Timeout fallback: Agent responds without full context. Decisions are safe but generic. Resolution rate drops. Human escalations increase. The flywheel captures shallow decision traces instead of rich ones.

Stale context: Agent uses cached data that's no longer current. Decisions are confident but based on yesterday's reality. Wrong contract terms applied. Expired policies enforced. The errors are subtle and hard to detect.

Context bloat: Agent receives too much context. Token costs spike. Inference latency increases. Model attention dilutes across irrelevant information. Decision quality degrades despite having more data.

Cascade latency: In multi-agent systems, each agent adds latency. Agent A calls Agent B, which queries the context graph, which checks governance, which resolves the entity. Three hops. Each hop adds its own retrieval and inference time. A 500ms single-agent operation becomes a 2,500ms multi-agent workflow.

"Multi-agent latency doesn't add. It multiplies. Every hop is a retrieval cycle, a governance check, and an inference step."

The Metrics That Matter

Standard API latency metrics aren't enough. You need context-specific observability.

MetricTargetWhat It RevealsContext assembly time (p50)< 100msIs pre-computation working?Context assembly time (p99)< 500msAre edge cases blowing up?Cache hit rate> 80%Is the hot cache effective?Fallback rate< 5%How often do agents respond without full context?Stale context rate< 2%How often is cached context outdated?Token utilization40-70%Are you stuffing or starving the context window?End-to-end latency (p50)< 500msIs the system meeting the latency budget?End-to-end latency (p99)< 2sAre tail latencies acceptable?

The fallback rate is the metric most teams miss. If your agents are falling back to base prompts 15% of the time, 15% of your decisions are being made without context. The flywheel is running on partial data. The compounding advantage you built is leaking.

Where This Fits

You've built the context stack:

Entity resolution (Post 11) — identity is correct
Decision traces (Post 5) — decisions are captured
Context graphs (Post 3) — relationships are mapped
The flywheel (Post 13) — context compounds

Now the question shifts from "do we have the right context?" to "can we serve it fast enough?"

Context at inference time is the bridge between architecture and operations.

Build the richest context graph in the world. If the agent can't access it in 100ms, it might as well not exist.

"Architecture determines what's possible. Latency determines what's usable. The gap between the two is where agents fail in production."

Post 14 of 26 in the "Why Agents Fail" series.

Previously: The Flywheel — How Context Compounds Into Competitive Advantage Next: Why RAG Isn't Enough — From Retrieval to Reasoning

#AgenticAI #EnterpriseAI #SystemsArchitecture #Latency #ContextGraphs

AI + Human = Human Squared

6,828 followers

+ Subscribe

Riya Khurana 2mo

Customers don't know what a context graph is. They know when a response is fast and relevant or slow and generic. When our latency spiked, resolution rates dropped 18% — not because the agent got dumber, but because it was responding without context. Customers thought the AI had gotten worse. Actually, it just couldn't access what it knew in time.

1 Reaction

Dr. Jagreet Kaur 2mo

Latency budgets changed how my team thinks about architecture. Before, every feature request was 'add more context.' Now the question is: 'Does this fit in the 100ms retrieval budget?' If it doesn't, you either pre-compute it or you prove it's worth blowing the budget. That constraint made our agents faster AND more accurate — because it forced us to think about which context actually matters

Context at Inference Time: The Latency Problem

Navdeep Singh Gill

The 3.2-Second Failure

Why Latency Is the Silent Killer

The Math Nobody Does

The Three Latency Traps

Trap 1: The Retrieval Explosion

Trap 2: The Freshness-Latency Tradeoff

Trap 3: The Token Budget

The Context Serving Architecture

Layer 1: Pre-Computation

Recommended by LinkedIn

Layer 2: Intelligent Retrieval

Layer 3: Tiered Caching

Layer 4: Async Enrichment

The Latency Budget

What Happens When You Get It Wrong

The Metrics That Matter

Where This Fits

AI + Human = Human Squared

6,828 followers

More articles by Navdeep Singh Gill

Others also viewed

Understanding Time & Space Complexity

An Efficient (In-Mem) Graph Storage

🔥 🎆 Spark Join Strategies — Complete Deep Dive with Before & After Understanding 🎇 🔥

From 1 Record to Billions

From Data Graveyard to Intelligence Engine: The Local LLM + Obsidian Blueprint

Enhancing Efficiency: Leveraging Array-Based Data Structures for Optimized Insertion and Retrieval, and the Evolution into Hash Tables

Time and Space Complexities: Bubble Sort in Go

Space Complexity

Do We Still Need RAG in the Age of Long Context Models?

Explore content categories

The 3.2-Second Failure

Why Latency Is the Silent Killer

The Math Nobody Does

The Three Latency Traps

Trap 1: The Retrieval Explosion

Trap 2: The Freshness-Latency Tradeoff

Trap 3: The Token Budget

The Context Serving Architecture

Layer 1: Pre-Computation

Recommended by LinkedIn

Layer 2: Intelligent Retrieval

Layer 3: Tiered Caching

Layer 4: Async Enrichment

The Latency Budget

What Happens When You Get It Wrong

The Metrics That Matter

Where This Fits

AI + Human = Human Squared

6,828 followers

More articles by Navdeep Singh Gill

The Confidence Trap: Why Your AI Agent's Biggest Risk Isn't What You Think

AgentOps: The Operating Discipline for AI That Actually Runs

Why Enterprise AI Needs an Operating System: The Case for Agentic OS

The Semantic Layer Is Necessary. It Is Not Sufficient.

GRC for Agentic Systems: Proving Your Agents Can Be Trusted

Security Operations for Agentic Systems: Why Perimeter Defenses Aren't Enough

Governance: Who Controls What Agents Can Do

Building Production World Models for Agentic Systems

The Freshness Problem: When Context Goes Stale

The Feedback Loop: How Context Gets Better

Others also viewed

Understanding Time & Space Complexity

An Efficient (In-Mem) Graph Storage

🔥 🎆 Spark Join Strategies — Complete Deep Dive with Before & After Understanding 🎇 🔥

From 1 Record to Billions

From Data Graveyard to Intelligence Engine: The Local LLM + Obsidian Blueprint

Enhancing Efficiency: Leveraging Array-Based Data Structures for Optimized Insertion and Retrieval, and the Evolution into Hash Tables

Time and Space Complexities: Bubble Sort in Go

Space Complexity

Do We Still Need RAG in the Age of Long Context Models?

Explore content categories