Context at Inference Time: The Latency Problem
You built the world's richest context graph. Your agent can't use it in time.
If you remember only one thing from this post: Context that arrives after the decision is made is not context. It's a post-mortem.
The 3.2-Second Failure
A customer messages your support agent. Simple issue. Account question.
The agent begins:
Total: 3.2 seconds.
The customer sees a spinner. Then a generic response.
Why generic? Because at 1.8 seconds, the system timed out on context retrieval and fell back to the base prompt. The contract context never arrived. The relationship graph was still loading. The governance constraints came in 400ms after the response was already sent.
The agent responded without half its context.
The architecture was perfect. The plumbing was too slow.
"A context graph that can't serve at inference speed is a knowledge base, not an operating system."
Why Latency Is the Silent Killer
Everyone focuses on context quality. What to capture. How to structure it. Which relationships matter.
Almost nobody focuses on context delivery.
The assumption is that once context exists, serving it is a solved problem. Database query. API call. Done.
This assumption breaks catastrophically at scale.
The Math Nobody Does
A typical agent decision requires context from multiple sources:
Context TypeSourceTypical LatencyEntity identityIdentity Hub20-80msCurrent stateSystems of record50-200msDecision historyContext graph100-500msRelationship graphGraph database80-400msGovernance constraintsPolicy engine50-150msPrior interactionsInteraction store100-600msContract/commercial termsDocument store200-800ms
If these run sequentially: 600ms-2,700ms before inference even starts.
If parallelized perfectly: bounded by the slowest source, typically 400-800ms.
Add inference time (200-1,000ms depending on model and prompt size) and you're looking at 600ms-3,700ms total response time.
For a chat interface, anything above 2 seconds feels broken. For a real-time agent in a workflow, anything above 500ms is unusable. For a multi-agent system where agents call other agents, latency multiplies at every hop.
"Every layer you add to the context stack adds latency. The richest context architecture in the world is worthless if it can't serve under your latency budget."
The Three Latency Traps
Trap 1: The Retrieval Explosion
Your context graph grows. The flywheel works. More decisions, more outcomes, more patterns, more relationships.
Congratulations — you now have a retrieval problem.
A customer with five years of history has thousands of decision traces. An entity with complex relationships spans hundreds of nodes. A contract portfolio for an enterprise account involves dozens of documents.
The naive approach: retrieve everything relevant and let the model sort it out.
The result: 15,000 tokens of context crammed into every prompt. Latency spikes. Token costs explode. And half the context is irrelevant to the specific decision being made.
The retrieval explosion means your context graph's greatest strength — depth — becomes its operational weakness.
The query isn't "what context exists?" It's "what context matters for this specific decision, right now?"
That's a fundamentally harder problem.
Trap 2: The Freshness-Latency Tradeoff
Fresh context is expensive. Cached context is fast.
You can serve a customer's contract terms in 5ms from cache. But those terms were updated yesterday. The cache hasn't invalidated yet. The agent just made a decision based on the old terms.
You can query the source system for real-time terms. But that takes 400ms — and the source system is under load, so it's actually 1,200ms. The agent is waiting. The customer is waiting.
Every context source forces a choice: fast-and-possibly-stale vs slow-and-definitely-current.
The right answer varies by context type:
Context TypeStaleness ToleranceStrategyEntity identityMinutesCache with event-driven invalidationContract termsHoursCache with TTL + on-demand refresh for high-value decisionsDecision historyDaysCache aggressively — history doesn't changeGovernance constraintsSecondsReal-time query — policy violations are non-negotiableCurrent system stateSecondsReal-time query — stale state causes action errorsRelationship graphHoursCache with batch refresh
There is no universal answer. Each context type needs its own freshness strategy.
Trap 3: The Token Budget
Even if you retrieve the right context at the right speed, you have a finite context window.
Modern LLMs accept large contexts. But larger contexts mean:
The agent doesn't need the customer's entire history. It needs the relevant slice of that history for this specific decision.
Context selection is as important as context retrieval.
An agent handling a billing dispute needs: recent payment history, relevant contract terms, prior billing exceptions, and applicable governance constraints. It does not need: the customer's onboarding sequence, last year's product feedback, or the sales team's internal notes.
"Stuffing the context window is the inference-time equivalent of giving someone a filing cabinet when they asked for a memo."
The Context Serving Architecture
Solving the latency problem requires a serving layer purpose-built for agent decision-making.
Layer 1: Pre-Computation
Don't compute at query time what you can compute in advance.
Recommended by LinkedIn
Entity context bundles: For high-frequency entities (top customers, active contracts, critical products), pre-assemble the most commonly needed context into ready-to-serve bundles.
The bundle includes: resolved entity, key relationships, active contracts, recent decisions, applicable constraints. Assembled in background. Refreshed on schedule or event. Served in single read.
Decision templates: Most agent decisions fall into a small number of categories: refund, exception, escalation, pricing, fulfillment. Each category needs a predictable context profile. Pre-map which context types each decision category requires.
When a decision request arrives, the system doesn't query the entire graph. It pulls the pre-computed bundle for the entity and the template for the decision type.
Latency impact: 80-150ms for pre-computed bundles vs 600-2,700ms for on-demand assembly.
Layer 2: Intelligent Retrieval
Not all context retrieval can be pre-computed. For dynamic queries, retrieval must be intelligent, not exhaustive.
Relevance ranking: Not every decision trace is equally relevant. A billing dispute needs billing-related decisions, not product feedback. Rank context by relevance to the decision type before retrieval.
Recency weighting: Recent context is usually more relevant than historical context. Weight retrieval toward the last 90 days, with exceptions for contract terms and precedent-setting decisions that may be older.
Confidence-based depth: If the agent's initial context is sufficient for a high-confidence decision, stop retrieving. Don't fetch the full history for a straightforward refund. Reserve deep retrieval for complex, ambiguous, or high-stakes decisions.
Layer 3: Tiered Caching
Not every context query hits the source system.
Tier 1 — Hot cache (in-memory): Entity bundles for top 20% of entities by decision frequency. Governance constraints. Active policies. Latency: < 10ms.
Tier 2 — Warm cache (distributed): Recent decision traces. Relationship graphs. Contract summaries. Latency: 20-80ms.
Tier 3 — Cold storage (source systems): Full interaction history. Archived decisions. Legacy data. Latency: 200-800ms.
The system should satisfy 80%+ of agent queries from Tier 1 and Tier 2 combined.
Layer 4: Async Enrichment
Some context is valuable but not time-critical.
The agent can respond with Tier 1 and Tier 2 context immediately. Meanwhile, a background process retrieves deeper context from Tier 3. If the additional context changes the decision, the agent can self-correct or flag for review.
This pattern — respond fast, enrich async, self-correct if needed — maintains low latency without sacrificing depth for complex cases.
"The fastest context is the context you already assembled. The smartest retrieval is the retrieval you didn't need to do."
The Latency Budget
Every agent needs a latency budget — a hard ceiling on total response time, allocated across components.
Example budget for a customer-facing support agent:
ComponentBudgetStrategyEntity resolution20msPre-resolved, cachedContext retrieval100msPre-computed bundle + templateGovernance check30msHot cacheInference300msOptimized prompt, right-sized modelResponse formatting50msTemplate-basedTotal500ms
The budget forces architectural discipline:
Every component that exceeds its budget degrades the user experience or forces another component to compensate.
What Happens When You Get It Wrong
The failures are specific and measurable:
Timeout fallback: Agent responds without full context. Decisions are safe but generic. Resolution rate drops. Human escalations increase. The flywheel captures shallow decision traces instead of rich ones.
Stale context: Agent uses cached data that's no longer current. Decisions are confident but based on yesterday's reality. Wrong contract terms applied. Expired policies enforced. The errors are subtle and hard to detect.
Context bloat: Agent receives too much context. Token costs spike. Inference latency increases. Model attention dilutes across irrelevant information. Decision quality degrades despite having more data.
Cascade latency: In multi-agent systems, each agent adds latency. Agent A calls Agent B, which queries the context graph, which checks governance, which resolves the entity. Three hops. Each hop adds its own retrieval and inference time. A 500ms single-agent operation becomes a 2,500ms multi-agent workflow.
"Multi-agent latency doesn't add. It multiplies. Every hop is a retrieval cycle, a governance check, and an inference step."
The Metrics That Matter
Standard API latency metrics aren't enough. You need context-specific observability.
MetricTargetWhat It RevealsContext assembly time (p50)< 100msIs pre-computation working?Context assembly time (p99)< 500msAre edge cases blowing up?Cache hit rate> 80%Is the hot cache effective?Fallback rate< 5%How often do agents respond without full context?Stale context rate< 2%How often is cached context outdated?Token utilization40-70%Are you stuffing or starving the context window?End-to-end latency (p50)< 500msIs the system meeting the latency budget?End-to-end latency (p99)< 2sAre tail latencies acceptable?
The fallback rate is the metric most teams miss. If your agents are falling back to base prompts 15% of the time, 15% of your decisions are being made without context. The flywheel is running on partial data. The compounding advantage you built is leaking.
Where This Fits
You've built the context stack:
Now the question shifts from "do we have the right context?" to "can we serve it fast enough?"
Context at inference time is the bridge between architecture and operations.
Build the richest context graph in the world. If the agent can't access it in 100ms, it might as well not exist.
"Architecture determines what's possible. Latency determines what's usable. The gap between the two is where agents fail in production."
Post 14 of 26 in the "Why Agents Fail" series.
Previously: The Flywheel — How Context Compounds Into Competitive Advantage Next: Why RAG Isn't Enough — From Retrieval to Reasoning
#AgenticAI #EnterpriseAI #SystemsArchitecture #Latency #ContextGraphs
Customers don't know what a context graph is. They know when a response is fast and relevant or slow and generic. When our latency spiked, resolution rates dropped 18% — not because the agent got dumber, but because it was responding without context. Customers thought the AI had gotten worse. Actually, it just couldn't access what it knew in time.
Latency budgets changed how my team thinks about architecture. Before, every feature request was 'add more context.' Now the question is: 'Does this fit in the 100ms retrieval budget?' If it doesn't, you either pre-compute it or you prove it's worth blowing the budget. That constraint made our agents faster AND more accurate — because it forced us to think about which context actually matters