Retrieval Augmented Generation (RAG)

Narinder Kumar

Published Aug 21, 2025

How retrievers and generators work together to make AI accurate, current, and useful

RAG is a design pattern that first finds the most relevant pieces of your documents and then asks a language model to write an answer using only those pieces. Think search plus explain. That makes answers more accurate, verifiable, and up-to-date.

🔑 Key Terms Explained(used in the article)

Retriever – Finds relevant information from a knowledge base (like a librarian fetching books).
Generator – The AI model that writes the answer using the retrieved information (like a writer summarizing the books).
Citation – A reference or link that shows where the information came from so you can verify it.
Indexing – Organizing documents for faster search, similar to an index at the back of a book.
Vectorization – Turning text into numbers (vectors) so AI can compare meaning, not just words.

What RAG is, in simple words

Retriever: Finds small passages or documents relevant to the question.
Generator: Uses those retrieved passages together with the question to compose the final answer. RAG = Retriever + Generator.

Why people use RAG

Keep answers current beyond the model’s training data.
Use private or domain data (product docs, legal policies, internal wiki).
Improve truthfulness because answers are grounded in real documents and can be cited.

How RAG works — the simple flow

Phase 1: Data Preparation (Indexing) - This is where raw documents are processed and made searchable.

Steps (A–D):

(A) Collect documents Source data from PDFs, websites, or free text.
(B) Chunking Break large documents into smaller overlapping chunks to preserve context at boundaries.
(C) Generate embeddings Convert each chunk into high-dimensional vectors using an embedding model.
(D) Store in vector database Save embeddings along with metadata for fast similarity search.

Flow:

Raw Documents → Chunking → Embeddings → Vector DB

Phase 2: Retrieval + Generation

Once the index is ready, the system can handle queries.

Steps (1–5):

(1) User query A natural language question is submitted.
(2) Embed query The query is converted into an embedding.
(3) Retrieve top-k chunks The query embedding is compared with stored vectors in the database to fetch the most relevant chunks.
(4) Generator (LLM) The selected chunks are passed to a language model to construct a grounded answer.
(5) Final answer with citations The system outputs a coherent response, often with references back to the original source.

Flow:

User Query → Embed Query → Vector DB → Retrieve Top-k Chunks → Generator → Answer

Small, real world example (customer support)

User: “What is the return policy for Product X?”

Retriever returns two chunks from product docs: one stating “returns within 30 days with receipt” and another with exceptions.
Generator uses those passages and replies: “Product X can be returned within 30 days with the original receipt. Exceptions: custom orders are non-refundable. Source: ProductX_FAQ#12.”

This is better than a model guessing numbers or inventing policies.

Recommended by LinkedIn

Retrieval-Augmented Generation (RAG): Bridging the Gap…

Satyanarayana Murthy Udayagiri Venkata Naga 1 year ago

RAG: Build Better AI Apps

SHASHANK GUNDA 1 year ago

RAG with LlamaIndex: Unleashing the Power of…

Pi Square AI 1 year ago

What is indexing

Indexing means preparing your documents so they are fast to search:

Split documents into chunks.
Compute embeddings (vectors) for each chunk.
Store vectors and metadata in a vector store (FAISS, Pinecone, Weaviate, Supabase, Qdrant).

At query time you search this index instead of scanning raw text.

Why we perform vectorization

Vectorization (embeddings) turns text into numeric vectors where semantic similarity maps to geometric closeness. That means:

Phrases with the same meaning end up near each other even if they use different words.
Vector search finds relevant content by meaning, not exact keywords.

Why RAGs exist

Models have training cutoffs and limited context windows.
RAG lets models consult a searchable memory so they can answer from fresh or private documents without needing retraining.

Why we perform chunking

You rarely embed a whole book as one vector. Chunking splits long texts so retrieval returns focused passages rather than noisy, irrelevant sections. Good chunking improves precision.

Typical guideline

Start with 200–600 tokens per chunk depending on your model’s context window.

Why overlapping is used in chunking

When a sentence or fact sits at a chunk boundary it can be lost. Overlap ensures that context spanning chunk boundaries appears in at least one chunk. Overlap increases recall at a small cost in storage and embedding requests.

Common setting

Overlap: 10–30% of chunk size (for example chunk 500 tokens, overlap 100 tokens).

Interactive copy-paste prompts you can try

Try these in ChatGPT or your agent-enabled environment.

Planner prompt

Goal: "Find 3 recent blog posts about AI fairness and summarize each in one sentence. Then propose a 3-step learning plan." 
First, list the step-by-step plan and tools you'd use. Then simulate running the plan (note simulated results if live web access isn't available).

RAG simulation prompt

You have documents: 
Doc1: "Product X: returns within 30 days with receipt." 
Doc2: "Custom orders are non-refundable." 
Question: Can I return Product X after 14 days? Use only the documents.

Practical tips and best practices

Use LangChain or similar libraries to handily perform chunking, embedding, and connect to vector stores.
Tune chunk size and overlap by experiment. Start chunk size 300–500 tokens, overlap 50–150 tokens.
Store metadata (source, date, doc id) so answers can include citations.
Use hybrid retrieval (dense + sparse) and re-ranking for higher precision.
Add a verification step for high-stakes answers: re-check generated claims against retrieved passages.

🎯 Final Thoughts

RAG is changing how we use AI—making it more accurate, transparent, and reliable. Whether you’re a beginner exploring AI or a professional building solutions, understanding RAG is a must.

To view or add a comment, sign in

Retrieval Augmented Generation (RAG)

Narinder Kumar

How retrievers and generators work together to make AI accurate, current, and useful

🔑 Key Terms Explained(used in the article)

What RAG is, in simple words

Why people use RAG

How RAG works — the simple flow

Small, real world example (customer support)

Recommended by LinkedIn

What is indexing

Why we perform vectorization

Why RAGs exist

Why we perform chunking

Why overlapping is used in chunking

Interactive copy-paste prompts you can try

Practical tips and best practices

🎯 Final Thoughts

More articles by Narinder Kumar

Others also viewed

FAQ: Retrieval-Augmented Generation (RAG) Systems – Expectations, Limits & Best Practices

How to Implement Retrieval-Augmented Generation (RAG)

🌐 "Navigating the Nexus of Knowledge Graphs and AI: Illuminating Insights from PDFs" 📚

The Ultimate Guide to Retrieval-Augmented Generation (RAG) 🚀

Advanced Retrieval Augmented Generation (RAG) with Reranking

Unlocking the Power of Advanced RAG Techniques

Understanding Vector Databases: Their Role in LLMs and LVMs, Efficiency in Transformer Algorithms, and Key Security Considerations

Knowledge Graphs vs. Vector Databases: The Great Debate for RAG

Beyond Naive RAG: Mastering Question Transformation for Production AI Agents

RAG, Fine-Tuning, or Prompting? Choosing the Right Strategy for Smarter AI

How to Use RAG Architecture for Better Information Retrieval

How to Improve RAG Retrieval Methods

How to Use Retrieval Augmented Generation Strategies

How to Improve Retrieval-Augmented Generation Architectures

Retrieval-Augmented Generation Technology Stack Guide

Understanding the Role of Rag in AI Applications

How to Improve AI Using Rag Techniques

Explore content categories

How retrievers and generators work together to make AI accurate, current, and useful

🔑 Key Terms Explained(used in the article)

What RAG is, in simple words

Why people use RAG

How RAG works — the simple flow

Small, real world example (customer support)

Recommended by LinkedIn

What is indexing

Why we perform vectorization

Why RAGs exist

Why we perform chunking

Why overlapping is used in chunking

Interactive copy-paste prompts you can try

Practical tips and best practices

🎯 Final Thoughts

More articles by Narinder Kumar

RAG Failure Modes, Quick Fixes, and Advanced Techniques

Agentic AI — What agents are, how they work, and why tools matter

How to Speak to AI: The Importance of System Prompts and the Main Prompting Styles (Zero-shot, Few-shot, and More)

From Words to Wonders: Understanding GPT, Vector Embeddings, and Tokenization

Others also viewed

FAQ: Retrieval-Augmented Generation (RAG) Systems – Expectations, Limits & Best Practices

How to Implement Retrieval-Augmented Generation (RAG)

🌐 "Navigating the Nexus of Knowledge Graphs and AI: Illuminating Insights from PDFs" 📚

The Ultimate Guide to Retrieval-Augmented Generation (RAG) 🚀

Advanced Retrieval Augmented Generation (RAG) with Reranking

Unlocking the Power of Advanced RAG Techniques

Understanding Vector Databases: Their Role in LLMs and LVMs, Efficiency in Transformer Algorithms, and Key Security Considerations

Knowledge Graphs vs. Vector Databases: The Great Debate for RAG

Beyond Naive RAG: Mastering Question Transformation for Production AI Agents

RAG, Fine-Tuning, or Prompting? Choosing the Right Strategy for Smarter AI

Similar topics

How to Use RAG Architecture for Better Information Retrieval

How to Improve RAG Retrieval Methods

How to Use Retrieval Augmented Generation Strategies

How to Improve Retrieval-Augmented Generation Architectures

Retrieval-Augmented Generation Technology Stack Guide

Understanding the Role of Rag in AI Applications

How to Improve AI Using Rag Techniques

Explore content categories