RAG Failure Modes, Quick Fixes, and Advanced Techniques

Narinder Kumar

Published Aug 27, 2025

If you read my earlier post on the basics of Retrieval-Augmented Generation (RAG), you know the core idea: find relevant document chunks, and feed them to a language model so answers are grounded in real text. This follow-up shows what typically breaks in real projects, how to triage and fix it fast, and which advanced patterns to adopt when you move from prototype to production.

Quick glossary (read first — I’ll use these words and explain any unfamiliar ones before I use them)

Retriever — component that finds relevant text passages from your documents.
Generator — the LLM that writes the final answer using retrieved passages.
Embedding / Vector — a numeric representation of text; similar meaning → nearby vectors.
Vector DB — storage optimized for vector search (FAISS, Pinecone, Qdrant, etc.).
BM25 (sparse search) — classic keyword search; good for exact phrase matches.
Dense / Vector search — semantic search using embeddings.
Hybrid retrieval — combine BM25 (sparse) + vector search (dense) to get the benefits of both.
ANN (Approximate Nearest Neighbors) — fast method for finding nearby vectors in large datasets (trades some accuracy for speed).
Top-k — the top k passages returned by the retriever (e.g., top-5).
Re-rank / Cross-encoder — a slower but more accurate model that scores query+passage pairs to choose the best passages.
Chunking — splitting long documents into smaller passages.
Overlap — repeating a short window between neighbouring chunks to prevent losing context at boundaries.
HyDE (Hypothetical Document Embeddings) — technique where the LLM first writes a short hypothetical answer, you embed that synthetic answer, and then retrieve documents similar to it.
GraphRAG — use a knowledge graph (entities + relationships) to guide retrieval for multi-hop questions.
Corrective RAG — iteratively retrieve → generate → evaluate → fetch more if the answer is weak.

Common failure cases and quick mitigations

Poor recall — the right passage exists but isn’t returned

Symptom: the correct answer is in your data but isn’t in top-k.

Fast checks

Run a recall test: pick 30 known Q/A pairs and measure whether the gold passage appears in top-10.

Quick mitigations

Increase top-k (e.g., retrieve 50 then re-rank to top 5).
Use hybrid retrieval (BM25 + vector) and merge the candidate lists.
Try a different or domain-tuned embedding model.
Adjust ANN settings to favor recall (less aggressive approximation).

Bad chunking — either too small, too large, or no overlap

Symptom: answers miss important facts or are noisy.

Why: chunk boundaries cut important sentences, or chunks are too broad.

Quick mitigations

Use sentence/token-aware splitters (LangChain, LlamaIndex).
Start with 300–500 tokens per chunk and 10–30% overlap.
Merge very small chunks with neighbours.
Debug tip: when a query fails, print the top-5 chunks — if the key sentence is split, increase overlap.

Query drift — retriever finds “related” text but not what the task needs

Symptom: returned chunks are topically similar but not useful for the specific task (e.g., FAQ vs. policy text).

Fixes

Use instruction-aware embedding: prepend a short instruction before embedding, e.g. "Find factual paragraphs that answer: " + user_query.
Re-rank with a cross-encoder trained/selected for your task.
Use sub-query rewriting: decompose a complex user query into small retrievable questions.

Outdated / stale index

Symptom: answers cite old or removed content.

Fixes

Implement incremental upserts or event-driven re-indexing when content changes.
Add last_updated metadata and prefer fresher documents for time-sensitive queries.
Monitor “last indexed” timestamp and alert when it’s stale.

Hallucinations from weak/noisy context

Symptom: LLM invents facts not supported by retrieved text.

Mitigations

Strong grounding prompt: instruct the LLM to only use provided passages and say “I don’t know” otherwise.
Re-rank candidates to ensure the generator sees high-quality context.
Use an LLM-based verification step to check claims against retrieved passages.

Copy-paste grounding prompt

Answer using ONLY the passages below. If none support an answer, reply "I don't know". Cite passage IDs used. 
Passages: 
[1] ... 
[2] ... 
Question: ... 
Answer:

Techniques and patterns to scale accuracy and reliability

Recommended by LinkedIn

Implementing Statistical Text Filtering for Olympus

Tanmay Vemuri 1 year ago

Advanced RAG Techniques (Which Also Improve Efficiency)

Janak Panchal 7 months ago

Maximal Marginal Relevance (MMR): The Missing…

KARAN KADAM 2 months ago

Hybrid retrieval (sparse + dense)

Idea: run BM25 and vector search in parallel, merge results, dedupe, then re-rank.

Why: BM25 catches exact phrases and small lexical cues while vector search catches paraphrases and synonyms. Together they reduce misses.

Re-ranking with cross-encoders or LLM evaluators

Idea: retrieve a broad candidate set quickly, then re-score candidates with a cross-encoder or small LLM that scores query+passage pairs.

Trade-off: more latency/cost, but much better precision. Use re-rank only on the candidate set (not your entire index eg. top 100) to keep cost manageable.

HyDE — use the model to help retrieval

Idea: ask the LLM to produce a short, hypothetical answer first; embed that synthetic answer and retrieve documents similar to it.

Why it helps: the synthetic answer expresses intent in the model’s own representation space and often improves recall for complex queries.

Simple flow

Query → LLM generates hypothetical answer → embed hypothetical → vector search → retrieve best supporting docs → final generation.

Corrective RAG — iterative retrieval when confidence is low

Pattern: generate → evaluate → if low confidence, expand retrieval (bigger top-k / different filters), refine prompt, and regenerate.

Good for: high-risk answers where you prefer “I don’t know” over a wrong answer.

Query translation and sub-query rewriting

Query translation: normalize abbreviations, convert domain shorthand, or translate non-English queries into your primary language before embedding.

Sub-query rewriting: decompose a multi-part question into smaller, focused queries (helps multi-hop reasoning).

Contextual embeddings and domain tuning

Idea: fine-tune or select embedding models tuned for your domain (medical, legal, product docs). Contextual or domain embeddings improve retrieval relevance. LangChain and other libraries make swapping embedders easy.

GraphRAG — add structure for multi-hop reasoning

When to use: queries that need linking entity relationships (e.g., supply chain, legal citations). How: build a small knowledge graph of entities and relations; use graph traversal to find candidate documents, then run RAG on those passages.

Caching and cost control

Cache retrieval results for common queries (expire when source changes).
Cache final answers for truly frequent queries.
Invalidate caches when relevant docs update.

ANN and tuning for latency vs recall

ANN (Approximate Nearest Neighbours) is how vector DBs search quickly. Tune ANN parameters to balance recall and latency. For prototyping choose settings biased to recall; later, tune for latency using production traffic.

Checklist

Chunking w/ overlap + sensible token sizes.
Hybrid retrieval + ANN settings tuned for recall.
Re-ranking layer (cross-encoder or LLM evaluator).
Grounding prompts + verification step.
Caching & cache invalidation rules.
Monitoring: recall@k, latency p95, index freshness alerts, hallucination/error rates.
Human-in-loop paths for high-impact decisions.
Audit logs: store passage IDs and the prompt for every answer (vital for debugging).

Final takeaway — practical path forward

If you’re troubleshooting: start with hybrid retrieval, increase top-k, inspect top 10 chunks, and add a re-rank pass. Ground the model with a strict prompt and add a simple verification step.
When scaling: add HyDE for better recall on complex queries, use caching for hot traffic, and consider GraphRAG for multi-hop needs.
Always measure: recall@k, latency, index freshness, and hallucination rate — these drive the right engineering choices.

To view or add a comment, sign in

Quick glossary (read first — I’ll use these words and explain any unfamiliar ones before I use them)

Common failure cases and quick mitigations

Poor recall — the right passage exists but isn’t returned

Bad chunking — either too small, too large, or no overlap

Query drift — retriever finds “related” text but not what the task needs

Outdated / stale index

Hallucinations from weak/noisy context

Techniques and patterns to scale accuracy and reliability

Recommended by LinkedIn

Hybrid retrieval (sparse + dense)

Re-ranking with cross-encoders or LLM evaluators

HyDE — use the model to help retrieval

Corrective RAG — iterative retrieval when confidence is low

Query translation and sub-query rewriting

Contextual embeddings and domain tuning

GraphRAG — add structure for multi-hop reasoning

Caching and cost control

ANN and tuning for latency vs recall

Checklist

Final takeaway — practical path forward

More articles by Narinder Kumar

Retrieval Augmented Generation (RAG)

Agentic AI — What agents are, how they work, and why tools matter

How to Speak to AI: The Importance of System Prompts and the Main Prompting Styles (Zero-shot, Few-shot, and More)

From Words to Wonders: Understanding GPT, Vector Embeddings, and Tokenization

Others also viewed

Know about Vectorless RAG

The Art of Chunking: Unlocking Better RAG Performance

🚀 Vector RAG vs Vectorless RAG: The Next Evolution of Retrieval Systems

C++ Core Guidelines: Rules for the Definition of Concepts

Understanding Web 3.0: The Role of RDF and OWL in the Semantic Web 🌐

Revolutionizing RAG Pipelines: Best Practices and Advanced Techniques

Linear Search

Untangling the Digital Thread: A case for modeling with RDF - Part 1

RAG Observability

Bags of Documents and the Cluster Hypothesis

Similar topics

How to Improve RAG Retrieval Methods

How to Use Retrieval Augmented Generation Strategies

How to Use RAG Architecture for Better Information Retrieval

New Approaches to RAG Models

How to Improve Retrieval-Augmented Generation Architectures

Improving Information Retrieval Using AI and LLMs

Explore content categories