RAG Failure Modes, Quick Fixes, and Advanced Techniques

If you read my earlier post on the basics of Retrieval-Augmented Generation (RAG), you know the core idea: find relevant document chunks, and feed them to a language model so answers are grounded in real text. This follow-up shows what typically breaks in real projects, how to triage and fix it fast, and which advanced patterns to adopt when you move from prototype to production.


Quick glossary (read first — I’ll use these words and explain any unfamiliar ones before I use them)

  • Retriever — component that finds relevant text passages from your documents.
  • Generator — the LLM that writes the final answer using retrieved passages.
  • Embedding / Vector — a numeric representation of text; similar meaning → nearby vectors.
  • Vector DB — storage optimized for vector search (FAISS, Pinecone, Qdrant, etc.).
  • BM25 (sparse search) — classic keyword search; good for exact phrase matches.
  • Dense / Vector search — semantic search using embeddings.
  • Hybrid retrieval — combine BM25 (sparse) + vector search (dense) to get the benefits of both.
  • ANN (Approximate Nearest Neighbors) — fast method for finding nearby vectors in large datasets (trades some accuracy for speed).
  • Top-k — the top k passages returned by the retriever (e.g., top-5).
  • Re-rank / Cross-encoder — a slower but more accurate model that scores query+passage pairs to choose the best passages.
  • Chunking — splitting long documents into smaller passages.
  • Overlap — repeating a short window between neighbouring chunks to prevent losing context at boundaries.
  • HyDE (Hypothetical Document Embeddings) — technique where the LLM first writes a short hypothetical answer, you embed that synthetic answer, and then retrieve documents similar to it.
  • GraphRAG — use a knowledge graph (entities + relationships) to guide retrieval for multi-hop questions.
  • Corrective RAG — iteratively retrieve → generate → evaluate → fetch more if the answer is weak.


Common failure cases and quick mitigations

Poor recall — the right passage exists but isn’t returned

Symptom: the correct answer is in your data but isn’t in top-k.

Fast checks

  • Run a recall test: pick 30 known Q/A pairs and measure whether the gold passage appears in top-10.

Quick mitigations

  • Increase top-k (e.g., retrieve 50 then re-rank to top 5).
  • Use hybrid retrieval (BM25 + vector) and merge the candidate lists.
  • Try a different or domain-tuned embedding model.
  • Adjust ANN settings to favor recall (less aggressive approximation).


Bad chunking — either too small, too large, or no overlap

Symptom: answers miss important facts or are noisy.

Why: chunk boundaries cut important sentences, or chunks are too broad.

Quick mitigations

  • Use sentence/token-aware splitters (LangChain, LlamaIndex).
  • Start with 300–500 tokens per chunk and 10–30% overlap.
  • Merge very small chunks with neighbours.
  • Debug tip: when a query fails, print the top-5 chunks — if the key sentence is split, increase overlap.


Query drift — retriever finds “related” text but not what the task needs

Symptom: returned chunks are topically similar but not useful for the specific task (e.g., FAQ vs. policy text).

Fixes

  • Use instruction-aware embedding: prepend a short instruction before embedding, e.g. "Find factual paragraphs that answer: " + user_query.
  • Re-rank with a cross-encoder trained/selected for your task.
  • Use sub-query rewriting: decompose a complex user query into small retrievable questions.


Outdated / stale index

Symptom: answers cite old or removed content.

Fixes

  • Implement incremental upserts or event-driven re-indexing when content changes.
  • Add last_updated metadata and prefer fresher documents for time-sensitive queries.
  • Monitor “last indexed” timestamp and alert when it’s stale.


Hallucinations from weak/noisy context

Symptom: LLM invents facts not supported by retrieved text.

Mitigations

  • Strong grounding prompt: instruct the LLM to only use provided passages and say “I don’t know” otherwise.
  • Re-rank candidates to ensure the generator sees high-quality context.
  • Use an LLM-based verification step to check claims against retrieved passages.

Copy-paste grounding prompt

Answer using ONLY the passages below. If none support an answer, reply "I don't know". Cite passage IDs used. 
Passages: 
[1] ... 
[2] ... 
Question: ... 
Answer:        

Techniques and patterns to scale accuracy and reliability

Hybrid retrieval (sparse + dense)

Idea: run BM25 and vector search in parallel, merge results, dedupe, then re-rank.

Why: BM25 catches exact phrases and small lexical cues while vector search catches paraphrases and synonyms. Together they reduce misses.


Re-ranking with cross-encoders or LLM evaluators

Idea: retrieve a broad candidate set quickly, then re-score candidates with a cross-encoder or small LLM that scores query+passage pairs.

Trade-off: more latency/cost, but much better precision. Use re-rank only on the candidate set (not your entire index eg. top 100) to keep cost manageable.


HyDE — use the model to help retrieval

Idea: ask the LLM to produce a short, hypothetical answer first; embed that synthetic answer and retrieve documents similar to it.

Why it helps: the synthetic answer expresses intent in the model’s own representation space and often improves recall for complex queries.

Simple flow

  • Query → LLM generates hypothetical answer → embed hypothetical → vector search → retrieve best supporting docs → final generation.


Corrective RAG — iterative retrieval when confidence is low

Pattern: generate → evaluate → if low confidence, expand retrieval (bigger top-k / different filters), refine prompt, and regenerate.

Good for: high-risk answers where you prefer “I don’t know” over a wrong answer.


Query translation and sub-query rewriting

Query translation: normalize abbreviations, convert domain shorthand, or translate non-English queries into your primary language before embedding.

Sub-query rewriting: decompose a multi-part question into smaller, focused queries (helps multi-hop reasoning).


Contextual embeddings and domain tuning

Idea: fine-tune or select embedding models tuned for your domain (medical, legal, product docs). Contextual or domain embeddings improve retrieval relevance. LangChain and other libraries make swapping embedders easy.


GraphRAG — add structure for multi-hop reasoning

When to use: queries that need linking entity relationships (e.g., supply chain, legal citations). How: build a small knowledge graph of entities and relations; use graph traversal to find candidate documents, then run RAG on those passages.


Caching and cost control

  • Cache retrieval results for common queries (expire when source changes).
  • Cache final answers for truly frequent queries.
  • Invalidate caches when relevant docs update.


ANN and tuning for latency vs recall

ANN (Approximate Nearest Neighbours) is how vector DBs search quickly. Tune ANN parameters to balance recall and latency. For prototyping choose settings biased to recall; later, tune for latency using production traffic.


Checklist

  • Chunking w/ overlap + sensible token sizes.
  • Hybrid retrieval + ANN settings tuned for recall.
  • Re-ranking layer (cross-encoder or LLM evaluator).
  • Grounding prompts + verification step.
  • Caching & cache invalidation rules.
  • Monitoring: recall@k, latency p95, index freshness alerts, hallucination/error rates.
  • Human-in-loop paths for high-impact decisions.
  • Audit logs: store passage IDs and the prompt for every answer (vital for debugging).


Final takeaway — practical path forward

  • If you’re troubleshooting: start with hybrid retrieval, increase top-k, inspect top 10 chunks, and add a re-rank pass. Ground the model with a strict prompt and add a simple verification step.
  • When scaling: add HyDE for better recall on complex queries, use caching for hot traffic, and consider GraphRAG for multi-hop needs.
  • Always measure: recall@k, latency, index freshness, and hallucination rate — these drive the right engineering choices.

To view or add a comment, sign in

More articles by Narinder Kumar

Others also viewed

Explore content categories