RAG Failure Modes, Quick Fixes, and Advanced Techniques
If you read my earlier post on the basics of Retrieval-Augmented Generation (RAG), you know the core idea: find relevant document chunks, and feed them to a language model so answers are grounded in real text. This follow-up shows what typically breaks in real projects, how to triage and fix it fast, and which advanced patterns to adopt when you move from prototype to production.
Quick glossary (read first — I’ll use these words and explain any unfamiliar ones before I use them)
Common failure cases and quick mitigations
Poor recall — the right passage exists but isn’t returned
Symptom: the correct answer is in your data but isn’t in top-k.
Fast checks
Quick mitigations
Bad chunking — either too small, too large, or no overlap
Symptom: answers miss important facts or are noisy.
Why: chunk boundaries cut important sentences, or chunks are too broad.
Quick mitigations
Query drift — retriever finds “related” text but not what the task needs
Symptom: returned chunks are topically similar but not useful for the specific task (e.g., FAQ vs. policy text).
Fixes
Outdated / stale index
Symptom: answers cite old or removed content.
Fixes
Hallucinations from weak/noisy context
Symptom: LLM invents facts not supported by retrieved text.
Mitigations
Copy-paste grounding prompt
Answer using ONLY the passages below. If none support an answer, reply "I don't know". Cite passage IDs used.
Passages:
[1] ...
[2] ...
Question: ...
Answer:
Techniques and patterns to scale accuracy and reliability
Recommended by LinkedIn
Hybrid retrieval (sparse + dense)
Idea: run BM25 and vector search in parallel, merge results, dedupe, then re-rank.
Why: BM25 catches exact phrases and small lexical cues while vector search catches paraphrases and synonyms. Together they reduce misses.
Re-ranking with cross-encoders or LLM evaluators
Idea: retrieve a broad candidate set quickly, then re-score candidates with a cross-encoder or small LLM that scores query+passage pairs.
Trade-off: more latency/cost, but much better precision. Use re-rank only on the candidate set (not your entire index eg. top 100) to keep cost manageable.
HyDE — use the model to help retrieval
Idea: ask the LLM to produce a short, hypothetical answer first; embed that synthetic answer and retrieve documents similar to it.
Why it helps: the synthetic answer expresses intent in the model’s own representation space and often improves recall for complex queries.
Simple flow
Corrective RAG — iterative retrieval when confidence is low
Pattern: generate → evaluate → if low confidence, expand retrieval (bigger top-k / different filters), refine prompt, and regenerate.
Good for: high-risk answers where you prefer “I don’t know” over a wrong answer.
Query translation and sub-query rewriting
Query translation: normalize abbreviations, convert domain shorthand, or translate non-English queries into your primary language before embedding.
Sub-query rewriting: decompose a multi-part question into smaller, focused queries (helps multi-hop reasoning).
Contextual embeddings and domain tuning
Idea: fine-tune or select embedding models tuned for your domain (medical, legal, product docs). Contextual or domain embeddings improve retrieval relevance. LangChain and other libraries make swapping embedders easy.
GraphRAG — add structure for multi-hop reasoning
When to use: queries that need linking entity relationships (e.g., supply chain, legal citations). How: build a small knowledge graph of entities and relations; use graph traversal to find candidate documents, then run RAG on those passages.
Caching and cost control
ANN and tuning for latency vs recall
ANN (Approximate Nearest Neighbours) is how vector DBs search quickly. Tune ANN parameters to balance recall and latency. For prototyping choose settings biased to recall; later, tune for latency using production traffic.
Checklist
Final takeaway — practical path forward