How to Improve Retrieval-Augmented Generation Architectures

Explore top LinkedIn content from expert professionals.

Summary

Retrieval-augmented generation (RAG) combines search and generative AI to deliver more accurate, grounded responses by pulling in external information during output creation. Improving these architectures means making sure both the search and generation processes work hand-in-hand for trustworthy, relevant results that real-world users can count on.

  • Refine retrieval queries: Make use of strategies like entity-aware search, hybrid keyword and vector methods, and multi-step filtering to ensure the information fed into your AI is as relevant as possible.
  • Compress and filter context: Use smart compression or agent-based filtering systems to remove unnecessary or noisy information before passing it to your generative model, which can make answers more precise and reduce system strain.
  • Measure and adjust outputs: Track metrics like relevance and faithfulness of answers, and adjust your retrieval and generation process based on these results to consistently improve accuracy and reliability.
Summarized by AI based on LinkedIn member posts
  • View profile for Armand Ruiz
    Armand Ruiz Armand Ruiz is an Influencer

    building AI systems @meta

    206,811 followers

    Meta delivered a RAG rethink, and they called it REFRAG Traditional Retrieval-Augmented Generation (RAG) has a scaling problem. Most of the context we feed into LLMs during RAG is irrelevant. Worse, we process it anyway, token by token, blowing up memory and latency for minimal gain. The new Superintelligence team at Meta just proposed a fix: REFRAG. REFRAG does something deceptively simple and profoundly effective: Instead of feeding the full retrieved text, it compresses it into embeddings; before decoding. Think of it as skipping the small talk and jumping straight to the point. Why it matters: 1/ Up to 30x faster time-to-first-token than standard RAG pipelines. 2/ No loss in perplexity (a rarity with this kind of optimization). 3/ Works across multi-turn conversations, summarization, and standard RAG; all without retraining the base model. And perhaps the most interesting part? It uses a lightweight RL policy to learn which chunks need full text and which don’t. Dynamic, adaptive compression at inference time. This isn’t just a speed hack. It’s a shift in how we architect context for LLMs. More context no longer means slower models. That changes how we design systems and what we expect from them. Link to the paper: https://lnkd.in/gwsrS-H8

  • View profile for Aishwarya Srinivasan
    Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
    628,012 followers

    If you are an AI Engineer building production-grade GenAI systems, RAG should be in your toolkit. LLMs are powerful for information generation, but: → They hallucinate → They don’t know anything post-training → They struggle with out-of-distribution queries RAG solves this by injecting external knowledge at inference time. But basic RAG (retrieval + generation) isn’t enough for complex use cases. You need advanced techniques to make it reliable in production. Let’s break it down 👇 🧠 Basic RAG = Retrieval → Generation You ask a question. → The retriever fetches top-k documents (via vector search, BM25, etc.) → The LLM answers based on the query + retrieved context But, this naive setup fails quickly in the wild. You need to address two hard problems: 1. Are we retrieving the right documents? 2. Is the generator actually using them faithfully? ⚙️ Advanced RAG = Engineering Both Ends To improve retrieval, we have techniques like: → Chunk size tuning (fixed vs. recursive splitting) → Sliding window chunking (for dense docs) → Structured data retrieval (tables, graphs, SQL) → Metadata-aware search (filtering by author/date/type) → Mixed retrieval (hybrid keyword + dense) → Embedding fine-tuning (aligning to domain-specific semantics) → Question rewriting (to improve recall) To improve generation, options include: → Compressing retrieved docs (summarization, reranking) → Generator fine-tuning (rewarding citation usage and reasoning) → Re-ranking outputs (scoring factuality or domain accuracy) → Plug-and-play adapters (LoRA, QLoRA, etc.) 🧪 Beyond Modular: Joint Optimization Some of the most promising work goes further: → Fine-tuning retriever + generator end-to-end → Retrieval training via generation loss (REACT, RETRO-style) → Generator-enhanced search (LLM reformulates the query for better retrieval) This is where RAG starts to feel less like a bolt-on patch and more like a full-stack system. 📏 How Do You Know It's Working? Key metrics to track: → Context Relevance (Are the right docs retrieved?) → Answer Faithfulness (Did the LLM stay grounded?) → Negative Rejection (Does it avoid answering when nothing relevant is retrieved?) → Tools: RAGAS, FaithfulQA, nDCG, Recall@k 🛠️ Arvind and I are kicking off a hands-on workshop on RAG This first session is designed for beginner to intermediate practitioners who want to move beyond theory and actually build. Here’s what you’ll learn: → How RAG enhances LLMs with real-time, contextual data → Core concepts: vector DBs, indexing, reranking, fusion → Build a working RAG pipeline using LangChain + Pinecone → Explore no-code/low-code setups and real-world use cases If you're serious about building with LLMs, this is where you start. 📅 Save your seat and join us live: https://lnkd.in/gS_B7_7d Image source: LlamaIndex

  • View profile for Ravit Jain
    Ravit Jain Ravit Jain is an Influencer

    Founder & Host of "The Ravit Show" | Influencer & Creator | LinkedIn Top Voice | Startups Advisor | Gartner Ambassador | Data & AI Community Builder | Influencer Marketing B2B | Marketing & Media | (Mumbai/San Francisco)

    169,182 followers

    RAG just got smarter. If you’ve been working with Retrieval-Augmented Generation (RAG), you probably know the basic setup: An LLM retrieves documents based on a query and uses them to generate better, grounded responses. But as use cases get more complex, we need more advanced retrieval strategies—and that’s where these four techniques come in: Self-Query Retriever Instead of relying on static prompts, the model creates its own structured query based on metadata. Let’s say a user asks: “What are the reviews with a score greater than 7 that say bad things about the movie?” This technique breaks that down into query + filter logic, letting the model interact directly with structured data (like Chroma DB) using the right filters. Parent Document Retriever Here, retrieval happens in two stages: 1. Identify the most relevant chunks 2. Pull in their parent documents for full context This ensures you don’t lose meaning just because information was split across small segments. Contextual Compression Retriever (Reranker) Sometimes the top retrieved documents are… close, but not quite right. This reranker pulls the top K (say 4) documents, then uses a transformer + reranker (like Cohere) to compress and re-rank the results based on both query and context—keeping only the most relevant bits. Multi-Vector Retrieval Architecture Instead of matching a single vector per document, this method breaks both queries and documents into multiple token-level vectors using models like ColBERT. The retrieval happens across all vectors—giving you higher recall and more precise results for dense, knowledge-rich tasks. These aren’t just fancy tricks. They solve real-world problems like: • “My agent’s answer missed part of the doc.” • “Why is the model returning irrelevant data?” • “How can I ground this LLM more effectively in enterprise knowledge?” As RAG continues to scale, these kinds of techniques are becoming foundational. So if you’re building search-heavy or knowledge-aware AI systems, it’s time to level up beyond basic retrieval. Which of these approaches are you most excited to experiment with? #ai #agents #rag #theravitshow

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    16,024 followers

    Excited to share a groundbreaking advancement in Retrieval-Augmented Generation (RAG) - introducing MAIN-RAG: Multi-Agent Filtering RAG! This innovative framework tackles a critical challenge in RAG systems - the quality of retrieved documents. Traditional RAG approaches often struggle with irrelevant or noisy documents that degrade performance and reliability. Here's what makes MAIN-RAG special: >> Architecture MAIN-RAG employs three specialized LLM agents working in concert: - Agent-1 (Predictor) infers initial answers from retrieved documents - Agent-2 (Judge) evaluates document relevance and assigns confidence scores - Agent-3 (Final-Predictor) generates the final response using filtered, high-quality documents >> Key Innovations - Training-free implementation requiring no additional labeled data or fine-tuning - Dynamic filtering mechanism that adapts relevance thresholds based on score distributions - Inter-agent consensus approach ensuring robust document selection - Significant performance gains: 2-11% improvement in answer accuracy while reducing irrelevant documents The research team from Texas A&M University and Visa Research has demonstrated MAIN-RAG's effectiveness across multiple QA benchmarks, showing particular strength in scenarios requiring external knowledge validation. This work represents a significant step forward in making RAG systems more reliable and accurate. The training-free nature makes it immediately applicable for real-world applications.

  • View profile for Cornellius Y.

    Data Scientist & AI Engineer | Data Insight | Helping Orgs Scale with Data

    44,004 followers

    🚀 𝐄𝐧𝐡𝐚𝐧𝐜𝐢𝐧𝐠 𝐒𝐞𝐚𝐫𝐜𝐡 𝐟𝐨𝐫 𝐌𝐨𝐫𝐞 𝐑𝐞𝐥𝐞𝐯𝐚𝐧𝐭 𝐑𝐀𝐆 𝐑𝐞𝐬𝐮𝐥𝐭𝐬. . . Retrieval-augmented generation (RAG) systems depend on retrieval and generation to produce high-quality responses. However, if the retrieval process isn’t effective, even the best LLMs will struggle to generate useful outputs. The Solution? 𝐄𝐧𝐡𝐚𝐧𝐜𝐞𝐝 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐓𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞𝐬 Instead of relying on a basic retrieval system, we can refine queries and retrieval strategies to improve accuracy and relevance. Here are four techniques that could enhance retrieval performance: 📌 𝐄𝐧𝐭𝐢𝐭𝐲-𝐀𝐰𝐚𝐫𝐞 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 Use named entities (e.g., people, locations, organizations) to refine search queries. ✅ Benefits: Improves precision by focusing on domain-specific terminology and reducing ambiguity. 📌 𝐇𝐲𝐛𝐫𝐢𝐝 𝐒𝐩𝐚𝐫𝐬𝐞-𝐃𝐞𝐧𝐬𝐞 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 For better relevance, combine sparse retrieval (e.g., BM25) with dense vector search (embeddings). ✅ Benefits: Balances precision and recall, covering keyword-based and semantic search techniques. 📌 𝐌𝐮𝐥𝐭𝐢-𝐒𝐭𝐞𝐩 𝐃𝐨𝐜𝐮𝐦𝐞𝐧𝐭 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 Retrieves documents iteratively, refining queries and filtering results in multiple stages. ✅ Benefits: Increases relevance for complex queries and eliminates noisy or duplicate results. 📌 𝐇𝐲𝐩𝐨𝐭𝐡𝐞𝐭𝐢𝐜𝐚𝐥 𝐃𝐨𝐜𝐮𝐦𝐞𝐧𝐭 𝐄𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠 (𝐇𝐲𝐃𝐄) Generates a pseudo-document from the query before retrieval, improving search results. ✅ Benefits: Helps when queries are short, vague, or lack sufficient context. 🛠 How These Techniques Improve RAG 1️⃣ They increase recall, ensuring important documents aren’t missed. 2️⃣ They reduce noise, preventing irrelevant or duplicate context from misleading the generation step. 3️⃣ They handle complex queries better, allowing for better reasoning and improved search expansion. 💡 Key Takeaways 🔑 Better retrieval leads to better generation—fix retrieval first! 🔑 Simple techniques like entity-aware retrieval can drastically improve RAG results. ✍️ Want to dive deeper? Read the full article here: https://lnkd.in/gYv9UWuy 🔗RAG-To-Know Repository: https://lnkd.in/gQqqQd2a What are your thoughts? Have you used any of these techniques before? Let’s discuss this in the comments!👇👇👇

  • View profile for Vaibhava Lakshmi Ravideshik

    AI for Science @ GRAIL | Research Lead @ Massachussetts Institute of Technology - Kellis Lab | LinkedIn Learning Instructor | Author - “Charting the Cosmos: AI’s expedition beyond Earth” | TSI Astronaut Candidate

    20,077 followers

    We’ve all heard the term RAG - Retrieval-Augmented Generation - tossed around as the secret sauce behind grounded LLMs. In essence, it’s like giving an LLM a library card: before answering your question, it goes and fetches the most relevant documents (or graph facts), then uses them to reason and generate a reply. But here’s the catch - the retriever and the generator don’t really talk to each other. The retriever decides what’s relevant. The generator tries to make sense of whatever it’s given. If the retriever grabs noisy or incomplete data, the generator can’t correct it. And if the generator struggles, it can’t tell the retriever how to do better. That’s the “broken conversation” D-RAG (Differentiable Retrieval-Augmented Generation) sets out to fix. Think of the retriever as a spotlight scanning a huge knowledge graph (like Freebase or Wikidata) for the most useful facts to answer a question. Normally, that spotlight’s movements are controlled by rough heuristics - you can’t teach it through gradients because its decisions are discrete (“select this fact, skip that one”). D-RAG changes that by adding a soft switch. It uses a clever mathematical trick called Gumbel-Softmax, which lets the retriever make selections that are almost discrete but still smooth enough for gradients to flow through. This means the system can now learn end-to-end: the generator’s success or failure directly tunes how the retriever behaves next time. The retriever is powered by a Graph Neural Network that encodes not just words but the structure of the knowledge graph - who’s connected to whom, through what relationship. Then, instead of just handing over a list of triples, D-RAG builds a neural prompt - a text-plus-structure hybrid that the LLM can understand while still preserving graph context. The result? A pipeline where the retriever and generator evolve together, reducing noise, keeping the reasoning chain intact, and boosting both precision and recall in benchmarks like WebQSP and CWQ. This may sound technical, but it points toward something big: models that don’t just retrieve knowledge but learn what kind of knowledge helps reasoning. In a way, D-RAG teaches machines a subtle human skill - learning how to look things up better, based on how well you understood them last time. Imagine RAG systems that self-improve their “research habits,” or question-answering agents that adapt their retrieval strategy depending on how confident they are. That’s the frontier this paper hints at. Full length paper: https://lnkd.in/g9VHGGA9 #ArtificialIntelligence #MachineLearning #DeepLearning #NaturalLanguageProcessing #GenerativeAI #LLMs #RetrievalAugmentedGeneration #RAG #DifferentiableRAG #KnowledgeGraphs #KnowledgeGraphQA #GraphNeuralNetworks #GraphAI #NeuralRetrieval #EndToEndLearning #ReasoningSystems #AIResearch #EMNLP2025 #AIInnovation #FutureOfAI #ComputationalLinguistics #AIEvolution

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,609 followers

    Many companies have started experimenting with simple RAG systems, probably as their first use case, to test the effectiveness of generative AI in extracting knowledge from unstructured data like PDFs, text files, and PowerPoint files. If you've used basic RAG architectures with tools like LlamaIndex or LangChain, you might have already encountered three key problems: 𝟭. 𝗜𝗻𝗮𝗱𝗲𝗾𝘂𝗮𝘁𝗲 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗠𝗲𝘁𝗿𝗶𝗰𝘀: Existing metrics fail to catch subtle errors like unsupported claims or hallucinations, making it hard to accurately assess and enhance system performance. 𝟮. 𝗗𝗶𝗳𝗳𝗶𝗰𝘂𝗹𝘁𝘆 𝗛𝗮𝗻𝗱𝗹𝗶𝗻𝗴 𝗖𝗼𝗺𝗽𝗹𝗲𝘅 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀: Standard RAG methods often struggle to find and combine information from multiple sources effectively, leading to slower responses and less relevant results. 𝟯. 𝗦𝘁𝗿𝘂𝗴𝗴𝗹𝗶𝗻𝗴 𝘁𝗼 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗮𝗻𝗱 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗼𝗻𝘀: Basic RAG approaches often miss the deeper relationships between information pieces, resulting in incomplete or inaccurate answers that don't fully meet user needs. In this post I will introduce three useful papers to address these gaps: 𝟭. 𝗥𝗔𝗚𝗖𝗵𝗲𝗸𝗲𝗿: introduces a new framework for evaluating RAG systems with a focus on fine-grained, claim-level metrics. It proposes a comprehensive set of metrics: claim-level precision, recall, and F1 score to measure the correctness and completeness of responses; claim recall and context precision to evaluate the effectiveness of the retriever; and faithfulness, noise sensitivity, hallucination rate, self-knowledge reliance, and context utilization to diagnose the generator's performance. Consider using these metrics to help identify errors, enhance accuracy, and reduce hallucinations in generated outputs. 𝟮. 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁𝗥𝗔𝗚: It uses a labeler and filter mechanism to identify and retain only the most relevant parts of retrieved information, reducing the need for repeated large language model calls. This iterative approach refines search queries efficiently, lowering latency and costs while maintaining high accuracy for complex, multi-hop questions. 𝟯. 𝗚𝗿𝗮𝗽𝗵𝗥𝗔𝗚: By leveraging structured data from knowledge graphs, GraphRAG methods enhance the retrieval process, capturing complex relationships and dependencies between entities that traditional text-based retrieval methods often miss. This approach enables the generation of more precise and context-aware content, making it particularly valuable for applications in domains that require a deep understanding of interconnected data, such as scientific research, legal documentation, and complex question answering. For example, in tasks such as query-focused summarization, GraphRAG demonstrates substantial gains by effectively leveraging graph structures to capture local and global relationships within documents. It's encouraging to see how quickly gaps are identified and improvements are made in the GenAI world.

  • View profile for Umair Ahmad

    Senior Data & Technology Leader | Omni-Retail Commerce Architect | Digital Transformation & Growth Strategist | Leading High-Performance Teams, Driving Impact

    11,161 followers

    Most RAG systems look impressive in demos. But many quietly fail when they hit real production workloads. Why? Because the problem is rarely the model. It is usually retrieval design mistakes. Before scaling a RAG system, these anti patterns must be addressed. 𝐇𝐞𝐫𝐞 𝐚𝐫𝐞 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐜𝐨𝐦𝐦𝐨𝐧 𝐫𝐞𝐚𝐬𝐨𝐧𝐬 𝐑𝐀𝐆 𝐬𝐲𝐬𝐭𝐞𝐦𝐬 𝐛𝐫𝐞𝐚𝐤. →  𝐏𝐨𝐨𝐫 𝐂𝐡𝐮𝐧𝐤𝐢𝐧𝐠 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐲 • Fixed length chunks break meaning • Semantic section based chunking works better → 𝐖𝐞𝐚𝐤 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐲 • Dense vectors alone miss exact queries • Hybrid retrieval improves coverage → 𝐈𝐧𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐭 𝐓𝐨𝐩 𝐊 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 • Large top k adds noise and cost • Adaptive retrieval improves relevance → 𝐌𝐢𝐬𝐬𝐢𝐧𝐠 𝐑𝐞𝐫𝐚𝐧𝐤𝐢𝐧𝐠 𝐋𝐚𝐲𝐞𝐫 • First retrieval pass is approximate • Rerankers improve final relevance → 𝐈𝐠𝐧𝐨𝐫𝐢𝐧𝐠 𝐌𝐞𝐭𝐚𝐝𝐚𝐭𝐚 • Outdated or irrelevant documents appear • Metadata filters improve precision → 𝐎𝐧𝐞 𝐄𝐦𝐛𝐞𝐝𝐝𝐢𝐧𝐠 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐄𝐯𝐞𝐫𝐲𝐭𝐡𝐢𝐧𝐠 • Different content types behave differently • Specialized embeddings improve accuracy → 𝐔𝐧𝐢𝐟𝐨𝐫𝐦 𝐂𝐡𝐮𝐧𝐤 𝐒𝐢𝐳𝐞𝐬 • Different documents need different granularity • Content aware chunk sizing works better → 𝐍𝐨 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐄𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 • Failures stay invisible • Retrieval metrics reveal quality issues → 𝐓𝐫𝐞𝐚𝐭𝐢𝐧𝐠 𝐑𝐀𝐆 𝐚𝐬 𝐏𝐫𝐨𝐦𝐩𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 • Prompts cannot fix poor retrieval • Retrieval architecture matters first → 𝐍𝐨 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐋𝐨𝐨𝐩 • System never improves after deployment • Feedback driven updates improve accuracy → 𝐒𝐭𝐚𝐥𝐞 𝐕𝐞𝐜𝐭𝐨𝐫 𝐈𝐧𝐝𝐞𝐱𝐞𝐬 • Knowledge slowly becomes outdated • Incremental re embedding keeps systems fresh → 𝐋𝐚𝐭𝐞𝐧𝐜𝐲 𝐁𝐥𝐢𝐧𝐝 𝐒𝐩𝐨𝐭𝐬 • Retrieval pipelines become slow • Latency budgets keep systems scalable → 𝐒𝐢𝐧𝐠𝐥𝐞 𝐇𝐨𝐩 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 • Complex questions require multi step retrieval • Multi hop pipelines handle deeper reasoning → 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐎𝐯𝐞𝐫𝐥𝐨𝐚𝐝 • Too many tokens reduce relevance • Context compression improves signal → 𝐍𝐨 𝐂𝐨𝐬𝐭 𝐌𝐨𝐧𝐢𝐭𝐨𝐫𝐢𝐧𝐠 • Token and embedding costs escalate • Cost visibility enables sustainable scaling RAG is not just about connecting a model to a vector database. 𝐈𝐭 𝐢𝐬 𝐚𝐛𝐨𝐮𝐭 𝐝𝐞𝐬𝐢𝐠𝐧𝐢𝐧𝐠 𝐚 𝐫𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 𝐭𝐡𝐚𝐭 𝐬𝐜𝐚𝐥𝐞𝐬 𝐰𝐢𝐭𝐡 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲, 𝐬𝐩𝐞𝐞𝐝, 𝐚𝐧𝐝 𝐜𝐨𝐬𝐭 𝐜𝐨𝐧𝐭𝐫𝐨𝐥. Follow Umair Ahmad for more insights

  • View profile for Pradeep Sanyal

    AI Leader | Scaling AI from Pilot to Production | Chief AI Officer | Agentic Systems | AI Operating model, Governance, Adoption

    22,232 followers

    RAG is finally moving from prototype to production. This guide shows how to do it well. If you’re designing AI systems that need to retrieve facts, ground responses, and work reliably at scale, Mastering RAG is one of the most useful technical guides out there. It goes beyond surface-level diagrams and tackles real architectural decisions. What makes it stand out? → Breaks down input, context, and fact-level hallucinations → Clarifies when to retrieve, when to fine-tune, and when to do both → Details evaluation methods that go beyond toy benchmarks → Explains how to design feedback loops that actually improve answers → Offers patterns for relevance-first retrieval in complex domains → Frames data as a first-class design layer, not an afterthought Why it’s still relevant in 2025: Most enterprises are realizing that reliable LLM behavior is less about bigger models and more about better orchestration. Grounding. Context control. Cost discipline. Retrieval done right. This guide doesn’t just explain RAG. It helps you build a retrieval-centric system that’s accurate, auditable, and production-grade. If you’re shipping AI in regulated, domain-specific, or cost-sensitive environments - this is the reference to bookmark. 𝐀𝐈 𝐝𝐨𝐞𝐬𝐧’𝐭 𝐟𝐚𝐢𝐥 𝐢𝐧 𝐭𝐡𝐞 𝐦𝐨𝐝𝐞𝐥. 𝐈𝐭 𝐟𝐚𝐢𝐥𝐬 𝐢𝐧 𝐭𝐡𝐞 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞. 📌 Save. 🔁 Share. 💬 Discuss. 𝘍𝘰𝘭𝘭𝘰𝘸 𝘮𝘦 𝘧𝘰𝘳 𝘯𝘰-𝘧𝘭𝘶𝘧𝘧 𝘪𝘯𝘴𝘪𝘨𝘩𝘵𝘴 𝘰𝘯 𝘦𝘯𝘵𝘦𝘳𝘱𝘳𝘪𝘴𝘦 𝘈𝘐, 𝘢𝘨𝘦𝘯𝘵𝘴, 𝘢𝘯𝘥 𝘭𝘦𝘢𝘥𝘦𝘳𝘴𝘩𝘪𝘱.

  • View profile for Doug Safreno

    MTS at Anthropic

    3,643 followers

    Retrieval systems are the most common point of failure for Retrieval-Augmented Generation (RAG) systems; they are also incredibly difficult to tune. Here are the top techniques I’ve seen companies use to improve their RAG: 1. Preprocess embeddings ‣ What you embed defines how your data is represented for retrieval. Preprocessing your data is super important for retrieving accurate matches. For example, consider embedding: “Product: <product name>, tags: <tags>” rather than “<product name>” for better results. 2. Use retrieval as a tool (”Agentic RAG”) ‣ Most companies follow two steps: retrieve than generate. For example, the user might ask “what are the best Thanksgiving mugs you offer?” which gets directly embedded and sent to the retrieval system. Instead, consider an agentic approach where your retrieval system is a tool. The LLM will then search for something like “Thanksgiving mug”, denoising your query for you, and can do follow up searches if necessary. 3. Experiment with Top-K ‣ The Top-K parameter determines how many results your system retrieves. Lower K-values reduce noise but risk missing the best answer. Conversely, higher K-values increase recall but may overwhelm the AI. The right setting depends entirely on your app's use case. 4. Search mechanism: vector, traditional, or hybrid? ‣ The retrieval mechanism shapes how results are surfaced. Vector databases are ideal for semantic searches like product recommendations. Traditional search (keyword matching) works for structured, text-heavy queries. Hybrid systems combine both, making them well-suited for apps requiring super specific knowledge. What are you doing to tune your retrieval system?

Explore categories