RAG stands for Retrieval-Augmented Generation. It’s a technique that combines the power of LLMs with real-time access to external information sources. Instead of relying solely on what an AI model learned during training (which can quickly become outdated), RAG enables the model to retrieve relevant data from external databases, documents, or APIs—and then use that information to generate more accurate, context-aware responses. How does RAG work? 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗲: The system searches for the most relevant documents or data based on your query, using advanced search methods like semantic or vector search. 𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻: Instead of just using the original question, RAG 𝗮𝘂𝗴𝗺𝗲𝗻𝘁𝘀 (enriches) the prompt by adding the retrieved information directly into the input for the AI model. This means the model doesn’t just rely on what it “remembers” from training—it now sees your question 𝘱𝘭𝘶𝘴 the latest, domain-specific context 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲: The LLM takes the retrieved information and crafts a well-informed, natural language response. 𝗪𝗵𝘆 𝗱𝗼𝗲𝘀 𝗥𝗔𝗚 𝗺𝗮𝘁𝘁𝗲𝗿? Improves accuracy: By referencing up-to-date or proprietary data, RAG reduces outdated or incorrect answers. Context-aware: Responses are tailored using the latest information, not just what the model “remembers.” Reduces hallucinations: RAG helps prevent AI from making up facts by grounding answers in real sources. Example: Imagine asking an AI assistant, “What are the latest trends in renewable energy?” A traditional LLM might give you a general answer based on old data. With RAG, the model first searches for the most recent articles and reports, then synthesizes a response grounded in that up-to-date information. Illustration by Deepak Bhardwaj
Improving LLM Accuracy with Contextual Data
Explore top LinkedIn content from expert professionals.
Summary
Improving LLM accuracy with contextual data means providing AI models with relevant, up-to-date information so their responses are more reliable and less prone to errors or "hallucinations." Contextual data helps large language models (LLMs) understand the specific situation or question, rather than relying only on what they learned during training.
- Ground responses: Make sure your system pulls current, verified information from trusted sources so the LLM can give accurate answers.
- Manage memory: Separate immediate context from long-term stored information so the model uses just what is needed for each task.
- Curate inputs: Regularly review what data and tools are available to the model, and structure them so only relevant details reach the LLM during response generation.
-
-
It is easy to criticize LLM hallucinations but Google researchers just made a major leap toward solving them for statistical data. In the DataGemma paper (Sep ’24), they teach LLMs when to ask an external source instead of guessing. They propose two approaches: Retrieval interleaved generation (RIG) - the model injects natural language queries into its output, triggering fact retrieval from Data Commons. Retrieval augmented generation (RAG) - the model pulls full data tables into its context and reasons over them with a long-context LLM. The results are impressive: (1) RIG improved statistical accuracy from 5–17% to ~58% (2) RAG hit ~99% accuracy on direct citations (with some inference errors still remaining) (3) Users strongly preferred the new responses over baseline answers. As LLMs increasingly rely on external tools, teaching them "when to ask" may become as important as "how to answer." Paper https://lnkd.in/gaKY_VNE
-
Stop worshipping prompts. Start engineering the CONTEXT. If the LLM sounds smart but generates nonsense, that’s not really “hallucination” anymore… That’s due to the incomplete context one feeds it, which is (most of the time) unstructured, stale, or missing the things that mattered. But we need to understand that context isn't just the icing anymore, it's the whole damn CAKE that makes or breaks modern AI apps. We’re seeing a shift where initially RAG gave models a library card, and now context engineering principles teach them what to pull, when to pull, and how to best use it without polluting context windows. The most effective systems today are modular, with retrieval, memory, and tool use working together seamlessly. What a modern context-engineered system looks like: • Working memory: the last few turns and interim tool results needed right now. • Long-term memory: user preferences, prior outcomes, and facts stored in vector stores, referenced when useful. • Dynamic retrieval: query rewriting, reranking, and compression before anything hits the context window. • Tools as first-class citizens: APIs, search, MCP servers, etc., invoked when necessary. 𝐄𝐱𝐚𝐦𝐩𝐥𝐞: In an AI coding agent, working memory stores the latest compiler errors and recent changes, while long-term memory stores project dependencies and indexed files. The tools fetch API documentation and run web searches when knowledge falls short. The result is faster, more accurate code without hallucinations. So, if you’re building smart Agents today, do this: • Start with optimizing retrieval quality: query rewriting, rerankers, and context compression before the LLM sees anything. • Separate memories: working (short-term) vs. long-term, write back only distilled facts (not entire transcripts) to the long-term memory. • Treat tools like sensors: call them when evidence is missing. Never assume the model just “knows” everything. • Make the context contract explicit: schemas for tools/outputs and lightweight, enforceable system rules. The good news is that your existing RAG stack isn’t obsolete with the emergence of these new principles - it is the foundation. The difference now is orchestration: curating the smallest, sharpest slice of context the model needs to fulfill its job… no more, no less. So, if the model’s output is off, don’t just rewrite the prompt. Review and fix that context, and then watch the model act like it finally understands the assignment!
-
Are your LLM apps still hallucinating? Zep used to as well—a lot. Here’s how we worked to solve Zep's hallucinations. We've spent a lot of cycles diving into why LLMs hallucinate and experimenting with the most effective techniques to prevent it. Some might sound familiar, but it's the combined approach that really moves the needle. First, why do hallucinations happen? A few core reasons: 🔍 LLMs rely on statistical patterns, not true understanding. 🎲 Responses are based on probabilities, not verified facts. 🤔 No innate ability to differentiate truth from plausible fiction. 📚 Training datasets often include biases, outdated info, or errors. Put simply: LLMs predict the next likely word—they don’t actually "understand" or verify what's accurate. When prompted beyond their knowledge, they creatively fill gaps with plausible (but incorrect) info. ⚠️ Funny if you’re casually chatting—problematic if you're building enterprise apps. So, how do you reduce hallucinations effectively? The #1 technique: grounding the LLM in data. - Use Retrieval-Augmented Generation (RAG) to anchor responses in verified data. - Use long-term memory systems like Zep to ensure the model is always grounded in personalization data: user context, preferences, traits etc - Fine-tune models on domain-specific datasets to improve response consistency and style, although fine-tuning alone typically doesn't add substantial new factual knowledge. - Explicit, clear prompting—avoid ambiguity or unnecessary complexity. - Encourage models to self-verify conclusions when accuracy is essential. - Structure complex tasks with chain-of-thought prompting (COT) to improve outputs or force "none"/unknown responses when necessary. - Strategically tweak model parameters (e.g., temperature, top-p) to limit overly creative outputs. - Post-processing verification for mission-critical outputs, for example, matching to known business states. One technique alone rarely solves hallucinations. For maximum ROI, we've found combining RAG with a robust long-term memory solution (like ours at Zep) is the sweet spot. Systems that ground responses in factual, evolving knowledge significantly outperform. Did I miss any good techniques? What are you doing in your apps?
-
𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 is the process of deliberately designing, structuring, and manipulating the inputs, metadata, memory, and environment surrounding a LLM to produce better, more reliable, and more useful outputs. 𝐇𝐞𝐫𝐞’𝐬 𝐡𝐨𝐰 𝐭𝐨 𝐭𝐡𝐢𝐧𝐤 𝐚𝐛𝐨𝐮𝐭 𝐢𝐭: - LLM is the CPU - Context Window is the RAM - Context Engineering is your OS Just like RAM, the context window has strict limits. What you load into it and when defines everything from performance to reliability. Think of it as "𝐏𝐫𝐨𝐦𝐩𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠" on steroids, with a focus on providing a rich and structured environment for the LLM to work within. 𝐇𝐞𝐫𝐞’𝐬 𝐭𝐡𝐞 𝐟𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤 𝐈 𝐤𝐞𝐞𝐩 𝐜𝐨𝐦𝐢𝐧𝐠 𝐛𝐚𝐜𝐤 𝐭𝐨: 𝐓𝐡𝐞 𝟒 𝐂𝐬 𝐨𝐟 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠: 1. Save Context Store important information outside the context window so it can be reused later. - Log task results - Storing conversation states and chat history - Persist metadata This is about Memory. Offload what the model doesn’t need right now but might need soon. 2. Select Context Pull relevant information into the context window for the task at hand. - Use search (RAG) - Lookup memory - Query prior interactions Selection quality = Output quality. Garbage in, Garbage out. 3. Compress Context When you exceed token limits, you compress. - Summarize - Cluster with embeddings - Trim token-by-token Think like a systems engineer. Signal > Noise. Token budgets are real. 4. Isolate Context Sometimes, the best boost in performance comes from narrowing scope. - Scope to one subtask - Modularize Agents - Run isolated threads Less clutter = Fewer Hallucinations = More Deterministic Behavior. --- 𝐖𝐡𝐲 𝐭𝐡𝐢𝐬 𝐦𝐚𝐭𝐭𝐞𝐫𝐬 ? Most LLM failures aren’t because of weak prompts. They fail because the context window is overloaded, underutilized, or just ignored. 𝐋𝐞𝐭 𝐦𝐞 𝐤𝐧𝐨𝐰 𝐢𝐟 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐫𝐮𝐧𝐝𝐨𝐰𝐧 𝐨𝐟 𝐏𝐫𝐨𝐦𝐩𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐯𝐬 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠.
-
𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗮𝗹 𝗮𝗱𝘃𝗶𝗰𝗲 𝘁𝗼 𝗺𝗮𝗸𝗲 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗥𝗔𝗚 (𝗮𝗻𝗱 𝗺𝗮𝗸𝗲 𝗶𝘁 𝗮𝗰𝗰𝘂𝗿𝗮𝘁𝗲) 🚀 Most RAG demos look great… until you ship them. By default, RAG accuracy is low: the retriever misses, returns near-duplicates, pulls the wrong “almost relevant” chunks, and the LLM confidently answers anyway 😅 Getting to production quality means stacking techniques end-to-end. Think in stages: 𝗿𝗲𝗰𝗮𝗹𝗹 → 𝗽𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻 → 𝗮𝗻𝘀𝘄𝗲𝗿𝗮𝗯𝗶𝗹𝗶𝘁𝘆 🎯 Here’s a workflow (matching the diagram) and what each stage buys you: 𝟭) 𝗤𝘂𝗲𝗿𝘆 + 𝗰𝗼𝗻𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗼𝗻 𝗵𝗶𝘀𝘁𝗼𝗿𝘆 → Query Rewriter (LLM) 🧠 • Normalize intent, resolve pronouns, add constraints from history • Output: clean search query + metadata constraints (time range, product, region, access scope) 𝟮) 𝗛𝘆𝗗𝗘 (Hypothetical Document Embeddings) 📝 • LLM drafts a hypothetical “ideal answer passage” • Embed it to reduce vocabulary mismatch and boost recall 𝟯) 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗲𝗿 + 𝗙𝗶𝗹𝘁𝗲𝗿𝘀 🧰 • Apply metadata filtering before scoring (tenant, permissions/ACL, doc type, recency, language) 🔒 • This is the difference between “smart” and “safe” retrieval 𝟰) 𝗛𝘆𝗯𝗿𝗶𝗱 𝘀𝗲𝗮𝗿𝗰𝗵 (𝗱𝗲𝗻𝘀𝗲 + 𝘀𝗽𝗮𝗿𝘀𝗲) 🔎 • Dense = semantic recall; Sparse/BM25 = exact terms, IDs, error codes, names • Retrieve Top-N from both, then merge (weighted fusion) → fewer blind spots ⚖️ 𝟱) 𝗥𝗲-𝗿𝗮𝗻𝗸𝗲𝗿 (LLM or cross-encoder) 🥇 • Score Top-N candidates for true relevance to the rewritten query • Often the biggest quality jump (watch latency/cost) ⏱️💸 𝟲) 𝗗𝗶𝘃𝗲𝗿𝘀𝗶𝘁𝘆 & 𝗱𝗲-𝗱𝘂𝗽: MMR 🧩 • Reduce near-duplicate chunks and improve coverage • Critical when many docs repeat boilerplate (and your context window gets wasted) 🪟 𝟳) 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗽𝗮𝗰𝗸𝗶𝗻𝗴 → Generator 🏗️ • Tight context: best passages + citations + key metadata • “Answer from context only”, refusal rules, “ask follow-up if missing” • Final answer + links/citations 🔗 𝟴) 𝗜𝗻𝗱𝗲𝘅-𝘁𝗶𝗺𝗲 𝘁𝗿𝗶𝗰𝗸𝘀 that make retrieval easier 🗂️ • Chunk with structure (titles/headers), not fixed tokens only • Deduplicate boilerplate; separate “facts” from long “how-to” sections • Store rich metadata (owner, ACL, timestamps, source, tags) and keep it queryable 🏷️ 𝟵) 𝗢𝗽𝘀 𝗸𝗻𝗼𝗯𝘀 (so it survives real traffic) 🛠️ • Cache embeddings + retrieval; async rerank when possible; set tight timeouts 𝟭𝟬) 𝗖𝗹𝗼𝘀𝗲 𝘁𝗵𝗲 𝗹𝗼𝗼𝗽 🔁 • Log: query, rewrite, filters, retrieved ids, fusion scores, rerank scores, final citations • Evaluate (golden sets, clicks, human review) and tune k, fusion weights, MMR λ, reranker thresholds 📈 • Monitor “no-answer” + “low-evidence” rates 👀 Production RAG isn’t “LLM + vector DB”. It’s an information pipeline with lots of boring knobs - and those knobs are where accuracy comes from 🧪 #RAG #LLM #RetrievalAugmentedGeneration #Search #VectorDatabase #AIEngineering #MLOps
-
LLMs process text from left to right — each token can only look back at what came before it, never forward. This means that when you write a long prompt with context at the beginning and a question at the end, the model answers the question having "seen" the context, but the context tokens were generated without any awareness of what question was coming. This asymmetry is a basic structural property of how these models work. The paper asks what happens if you just send the prompt twice in a row, so that every part of the input gets a second pass where it can attend to every other part. The answer is that accuracy goes up across seven different benchmarks and seven different models (from the Gemini, ChatGPT, Claude, and DeepSeek series of LLMs), with no increase in the length of the model's output and no meaningful increase in response time — because processing the input is done in parallel by the hardware anyway. There are no new losses to compute, no finetuning, no clever prompt engineering beyond the repetition itself. The gap between this technique and doing nothing is sometimes small, sometimes large (one model went from 21% to 97% on a task involving finding a name in a list). If you are thinking about how to get better results from these models without paying for longer outputs or slower responses, that's a fairly concrete and low-effort finding. Read with AI tutor: https://lnkd.in/ene242cx Get the PDF: https://lnkd.in/e9tbUTNv
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development