Prompt formatting can have a dramatic impact on LLM performance, but it varies substantially across models. Some pragmatic findings from a recent research paper: 💡 Prompt Format Significantly Affects LLM Performance. Different prompt formats (plain text, Markdown, YAML, JSON) can result in performance variations of up to 40%, depending on the task and model. For instance, GPT-3.5-turbo showed a dramatic performance shift between Markdown and JSON in code translation tasks, while GPT-4 exhibited greater stability. This indicates the importance of testing and optimizing prompts for specific tasks and models. 🛠️ Tailor Formats to Task and Model. Prompt formats like JSON, Markdown, YAML, and plain text yield different performance outcomes across tasks. For instance, GPT-3.5-turbo performed 40% better in JSON for code tasks, while GPT-4 preferred Markdown for reasoning tasks. Test multiple formats early in your process to identify which structure maximizes results for your specific task and model. 📋 Keep Instructions and Context Explicit. Include clear task instructions, persona descriptions, and examples in your prompts. For example, specifying roles (“You are a Python coder”) and output style (“Respond in JSON”) improves model understanding. Consistency in how you frame the task across different formats minimizes confusion and enhances reliability. 📊 Choose Format Based on Data Complexity. For simple tasks, plain text or Markdown often suffices. For structured outputs like programming or translations, formats such as JSON or YAML may perform better. Align the prompt format with the complexity of the expected response to leverage the model’s capabilities fully. 🔄 Iterate and Validate Performance. Run tests with variations in prompt structure to measure impact. Tools like Coefficient of Mean Deviation (CMD) or Intersection-over-Union (IoU) can help quantify performance differences. Start with benchmarks like MMLU or HumanEval to validate consistency and accuracy before deploying at scale. 🚀 Leverage Larger Models for Stability. If working with sensitive tasks requiring consistent outputs, opt for larger models like GPT-4, which show better robustness to format changes. For instance, GPT-4 maintained higher performance consistency across benchmarks compared to GPT-3.5. Link to paper in comments.
Improving LLM Accuracy Across Diverse Text Formats
Explore top LinkedIn content from expert professionals.
Summary
Improving LLM accuracy across diverse text formats means making large language models (LLMs) better at understanding and responding accurately, no matter if they're dealing with plain text, code, tables, or structured data like JSON or XML. This involves choosing the right input formats, providing clear instructions, and using tools or methods that help the model give more reliable answers for different tasks and content types.
- Match format to task: Select the input format—like plain text, Markdown, or JSON—that best fits your specific task, and don’t hesitate to try multiple formats early on to see which produces the most reliable responses.
- Make instructions clear: Spell out roles, expected output, and task details in your prompt to help the model understand exactly what you’re asking, especially when working with complex or domain-specific content.
- Test and refine: Build in a feedback loop by checking outputs, comparing them to trustworthy examples, and adjusting your prompts or input data until the model’s answers consistently meet your needs.
-
-
One of the persistent challenges in using large language models (LLMs) is getting them to follow instructions reliably — especially when the instructions are subtle or domain-specific. DeepMind’s latest research introduces Symbol Tuning, a simple yet powerful fine-tuning method that significantly improves an LLM’s ability to follow symbolic prompts (e.g., bullet points, XML, Markdown, or code-like instructions) in zero-shot and few-shot settings. https://lnkd.in/gzKDdHQ2 Why this matters: 🔹 Improves instruction following in GPT-class models 🔹 Works with tiny amounts of data (just 100K tokens!) 🔹 Boosts performance in math, code, and reasoning-heavy tasks 🔹 Enhances models' ability to generalize across symbolic formats This has massive implications for building enterprise agents, RAG pipelines, and developer copilots that need high-precision, structured interaction with users or data. A great reminder: sometimes, small, well-targeted innovations create massive gains. #LLM #InContextLearning #SymbolTuning #PromptEngineering #DeepMind #GenAI #AIResearch #InstructionFollowing #EnterpriseAI #DeveloperTools
-
It is easy to criticize LLM hallucinations but Google researchers just made a major leap toward solving them for statistical data. In the DataGemma paper (Sep ’24), they teach LLMs when to ask an external source instead of guessing. They propose two approaches: Retrieval interleaved generation (RIG) - the model injects natural language queries into its output, triggering fact retrieval from Data Commons. Retrieval augmented generation (RAG) - the model pulls full data tables into its context and reasons over them with a long-context LLM. The results are impressive: (1) RIG improved statistical accuracy from 5–17% to ~58% (2) RAG hit ~99% accuracy on direct citations (with some inference errors still remaining) (3) Users strongly preferred the new responses over baseline answers. As LLMs increasingly rely on external tools, teaching them "when to ask" may become as important as "how to answer." Paper https://lnkd.in/gaKY_VNE
-
Exciting New Research: A Library of LLM Intrinsics for Retrieval-Augmented Generation I just came across a groundbreaking paper from IBM Research that introduces a novel concept for the LLM developer community: a library of LLM Intrinsics for Retrieval-Augmented Generation (RAG). In the software world, we've long benefited from reusable libraries with well-defined APIs. However, the LLM ecosystem has lacked this pattern until now. This research proposes "LLM Intrinsics" - capabilities that can be invoked through stable, well-defined APIs, independent of their implementation details. >> What are LLM Intrinsics? The researchers define an LLM intrinsic as "a capability that can be invoked through a well-defined API that is reasonably stable and independent of how the LLM intrinsic itself is implemented." Think of them as compiler intrinsics for LLMs - functions that occur frequently enough to warrant standardization. >> The RAG Intrinsics Library The library currently includes five powerful intrinsics: 1. Query Rewrite (QR) - Decontextualizes multi-turn conversation queries into standalone versions, improving retriever performance by 9 percentage points in Recall@20 and 8 points in RAGAS Faithfulness. 2. Uncertainty Quantification (UQ) - Provides calibrated certainty scores (5% to 95%) for answers, with an impressive Expected Calibration Error of just 0.064 across tasks. 3. Hallucination Detection (HD) - Analyzes responses against source documents to assign hallucination risk scores for each sentence, achieving 72.2% F1 score on the RAGTruth benchmark. 4. Answerability Determination (AD) - Determines if a query can be answered based on provided documents, achieving 77.4% weighted F1 score on SQUADRUN Dev and 86.1% on MT-RAG. 5. Citation Generation (CG) - Creates fine-grained citations for each sentence in responses, outperforming Llama-3.1-70B-Instruct with F1 scores of 62.0% to 75.4% across LongBench-Cite datasets. >> Implementation Details Each intrinsic is implemented as a LoRA adapter for IBM Granite 3.2 8b Instruct, available on HuggingFace. More importantly, they're accessible through Granite IO Processing, a framework that handles input/output transformations. The researchers also demonstrate how these intrinsics can be composed into powerful workflows. For example, combining Query Rewrite with Answerability Determination yields a 11% improvement in Joint Answerability-Faithfulness Score compared to using neither. This work represents a significant step toward standardization in the LLM ecosystem, potentially enabling the same level of collaboration and specialization we've seen in traditional software development. All models are released under Apache 2.0 license for both research and commercial use. Definitely worth exploring if you're working on RAG applications!
-
Are humans 5X better than AI? This paper is blowing up (not in a good way) The recent study claims LLMs are 5x less accurate than humans at summarizing scientific research. That’s a bold claim. But maybe it’s not the model that’s off. Maybe it's the AI strategy, system, prompt, data... What’s your secret sauce for getting the most out of an llm? Scientific summarization is dense, domain-specific, and context-heavy. And evaluating accuracy in this space? That’s not simple either. So just because a general-purpose LLM is struggling with a turing style test... doesn't mean it can't do better. Is it just how they're using it? I think it's short sighted to drop a complex task into an LLM and expect expert results without expert setup. To get better answers, you need a better AI strategy, system, and deployment. Some tips and tricks we find helpful: 1. Start small and be intentional. Don’t just upload a paper and say “summarize this.” Define the structure, tone, and scope you want. Try prompts like: “List three key findings in plain language, and include one real-world implication for each.” The clearer your expectations, the better the output. 2. Test - Build in a feedback loop from the beginning. Ask the model what might be missing from the summary, or how confident it is in the output. Compare responses to expert-written summaries or benchmark examples. If the model can’t handle tasks where the answers are known, it’s not ready for tasks where they’re not. 3. Tweak - Refine everything: prompts, data, logic. Add retrieval grounding so the model pulls from trusted sources instead of guessing. Fine-tune with domain-specific examples to improve accuracy and reduce noise. Experiment with prompt variations and analyze how the answers change. Tuning isn’t just technical. Its iterative alignment between output and expectation. (Spoiler alert: you might be at this stage for a while.) 4. Repeat Every new domain, dataset, or objective requires a fresh approach. LLMs don’t self-correct across contexts, but your workflow can. Build reusable templates. Create consistent evaluation criteria. Track what works, version your changes, and keep refining. Improving LLM performance isn’t one and done. It’s a cycle. Finally: If you treat a language model like a magic button, it's going to kill the rabbit in the hat. If you treat it like a system you deploy, test, tweak, and evolve It can retrieve magic bunnies flying everywhere Q: How are you using LLMs to improve workflows? Have you tried domain-specific data? Would love to hear your approaches in the comments.
-
LLMs don’t just respond to What you ask—they respond to How you ask. If you’re still relying on basic prompting, you’re leaving a lot of performance on the table. Here’s how people are systematically optimizing prompts for higher accuracy, robustness, and efficiency in AI apps: ⭐ Few-Shot Prompting – Improve precision in classification tasks by including example inputs/outputs (e.g., for detecting jailbreak attempts, spam, or misinformation). ⭐ Meta Prompting – Use an LLM to refine its own prompts (e.g., "Given this input/output, how would you rewrite this prompt for better performance?"). This works especially well for text generation and retrieval tasks. ⭐ Gradient Prompt Optimization (GPO) – Treat prompts like trainable parameters, embedding them and optimizing with loss gradients. Think of it as fine-tuning without modifying the model itself—a game-changer for low-resource AI applications. ⭐ Prompt Optimization Libraries – Tools like DSPy automate prompt refinement, evaluating variations systematically. For production AI systems, this makes tuning scalable. The Takeaway? Prompt Optimization is a Continuous Process Real-world data shifts. New failure modes emerge. Just like model retraining, prompts need continuous iteration. What’s your go-to method for improving AI prompts?
-
Summary of the paper: Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models (Oct, '24) RAG enhances LLMs by retrieving relevant external knowledge. However, imperfect retrieval and knowledge conflicts hinder RAG's performance. 1. Imperfect Retrieval: Retrieved documents may contain irrelevant or contradictory information, resulting in noise that complicates the assessment of data. In addition, essential information may be omitted, leading to incomplete or inaccurate responses. 2. Knowledge Conflicts: Retrieved documents may present conflicting information, both among themselves and in relation to the model's internal knowledge, resulting in contradictions. Also, the model may inadvertently introduce biases inherent in the retrieved data. The proposed Astute RAG approach addresses these challenges through: 1. Retrieval Refinement: Contextualized retrieval enhances the process by utilizing the query and initially generated text to filter out irrelevant or conflicting information, thereby prioritizing more pertinent documents. Additionally, diversification ensures a varied selection of retrieved documents, which helps to mitigate the effects of noise and missing information. 2. Conflict Resolution: Conflicts can be resolved by weighing the model's internal knowledge against the retrieved information. The construction of a knowledge graph from the retrieved documents aids in identifying and reconciling contradictions. Furthermore, contextualized reasoning leverages the generated text to analyze the retrieved information, facilitating the resolution of conflicts. 3. Bias Mitigation: Diverse document selection ensures that retrieved materials reflect a wide array of perspectives. Moreover, the model is trained to recognize potential biases in the data and actively mitigate their impact. 4. Active Knowledge Updating Updating the model's internal knowledge with verified retrieved information enhances its accuracy and reliability. #generativeai #llm #ai #ml
-
𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗮𝗹 𝗮𝗱𝘃𝗶𝗰𝗲 𝘁𝗼 𝗺𝗮𝗸𝗲 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗥𝗔𝗚 (𝗮𝗻𝗱 𝗺𝗮𝗸𝗲 𝗶𝘁 𝗮𝗰𝗰𝘂𝗿𝗮𝘁𝗲) 🚀 Most RAG demos look great… until you ship them. By default, RAG accuracy is low: the retriever misses, returns near-duplicates, pulls the wrong “almost relevant” chunks, and the LLM confidently answers anyway 😅 Getting to production quality means stacking techniques end-to-end. Think in stages: 𝗿𝗲𝗰𝗮𝗹𝗹 → 𝗽𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻 → 𝗮𝗻𝘀𝘄𝗲𝗿𝗮𝗯𝗶𝗹𝗶𝘁𝘆 🎯 Here’s a workflow (matching the diagram) and what each stage buys you: 𝟭) 𝗤𝘂𝗲𝗿𝘆 + 𝗰𝗼𝗻𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗼𝗻 𝗵𝗶𝘀𝘁𝗼𝗿𝘆 → Query Rewriter (LLM) 🧠 • Normalize intent, resolve pronouns, add constraints from history • Output: clean search query + metadata constraints (time range, product, region, access scope) 𝟮) 𝗛𝘆𝗗𝗘 (Hypothetical Document Embeddings) 📝 • LLM drafts a hypothetical “ideal answer passage” • Embed it to reduce vocabulary mismatch and boost recall 𝟯) 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗲𝗿 + 𝗙𝗶𝗹𝘁𝗲𝗿𝘀 🧰 • Apply metadata filtering before scoring (tenant, permissions/ACL, doc type, recency, language) 🔒 • This is the difference between “smart” and “safe” retrieval 𝟰) 𝗛𝘆𝗯𝗿𝗶𝗱 𝘀𝗲𝗮𝗿𝗰𝗵 (𝗱𝗲𝗻𝘀𝗲 + 𝘀𝗽𝗮𝗿𝘀𝗲) 🔎 • Dense = semantic recall; Sparse/BM25 = exact terms, IDs, error codes, names • Retrieve Top-N from both, then merge (weighted fusion) → fewer blind spots ⚖️ 𝟱) 𝗥𝗲-𝗿𝗮𝗻𝗸𝗲𝗿 (LLM or cross-encoder) 🥇 • Score Top-N candidates for true relevance to the rewritten query • Often the biggest quality jump (watch latency/cost) ⏱️💸 𝟲) 𝗗𝗶𝘃𝗲𝗿𝘀𝗶𝘁𝘆 & 𝗱𝗲-𝗱𝘂𝗽: MMR 🧩 • Reduce near-duplicate chunks and improve coverage • Critical when many docs repeat boilerplate (and your context window gets wasted) 🪟 𝟳) 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗽𝗮𝗰𝗸𝗶𝗻𝗴 → Generator 🏗️ • Tight context: best passages + citations + key metadata • “Answer from context only”, refusal rules, “ask follow-up if missing” • Final answer + links/citations 🔗 𝟴) 𝗜𝗻𝗱𝗲𝘅-𝘁𝗶𝗺𝗲 𝘁𝗿𝗶𝗰𝗸𝘀 that make retrieval easier 🗂️ • Chunk with structure (titles/headers), not fixed tokens only • Deduplicate boilerplate; separate “facts” from long “how-to” sections • Store rich metadata (owner, ACL, timestamps, source, tags) and keep it queryable 🏷️ 𝟵) 𝗢𝗽𝘀 𝗸𝗻𝗼𝗯𝘀 (so it survives real traffic) 🛠️ • Cache embeddings + retrieval; async rerank when possible; set tight timeouts 𝟭𝟬) 𝗖𝗹𝗼𝘀𝗲 𝘁𝗵𝗲 𝗹𝗼𝗼𝗽 🔁 • Log: query, rewrite, filters, retrieved ids, fusion scores, rerank scores, final citations • Evaluate (golden sets, clicks, human review) and tune k, fusion weights, MMR λ, reranker thresholds 📈 • Monitor “no-answer” + “low-evidence” rates 👀 Production RAG isn’t “LLM + vector DB”. It’s an information pipeline with lots of boring knobs - and those knobs are where accuracy comes from 🧪 #RAG #LLM #RetrievalAugmentedGeneration #Search #VectorDatabase #AIEngineering #MLOps
-
What impact does 𝗽𝗿𝗼𝗺𝗽𝘁 𝗳𝗼𝗿𝗺𝗮𝘁𝘁𝗶𝗻𝗴 have on your 𝗟𝗟𝗠 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲? There is interesting debate happening in the community around the impact of both input and output formatting on performance of your LLM applications. In general, we are converging to the conclusion that both matter and should be part of your Prompt Engineering strategy. Recently there was a paper released that specifically evaluates the impact of input formatting. Key takeaways I am bringing from the paper and 𝗔𝗜 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 𝗯𝘂𝗶𝗹𝗱𝗶𝗻𝗴 𝗔𝗜 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 𝘀𝗵𝗼𝘂𝗹𝗱 𝗰𝗼𝗻𝘀𝗶𝗱𝗲𝗿 𝗮𝘀 𝘄𝗲𝗹𝗹: ➡️ Testing different variations of prompt formatting even with the same instructions should be considered in your prompt engineering process. Consider: 👉 Plain text 👉 Markdown 👉 YAML 👉 JSON 👉 XML ❗️The difference in performance driven by prompt formatting can be as much as 40%! It is clearly worth experimenting with. ➡️ Format efficiency of your prompts is likely not consistent between LLMs even in the same family (e.g. GPT). ❗️ You should reevaluate your application performance if switching underlying models. ➡️ Evaluating and keeping track of your LLM application parameters is critical if you want to bring your applications to production. ✅ In general, I consider it to be good news as we have more untapped space to improve our application performance. ℹ️ As models keep improving, we should see reduced numbers in formatting impact on results variability. Read the full paper here: https://lnkd.in/d-AD-Ptq Kudos to the authors! Looking forward to following research. #AI #LLM #MachineLearning
-
Excited to share insights from our recent guest tutorial with Rami Krispin's Data Newsletter on building reliable LLMs through high‑quality data, ethical scraping, and robust preprocessing! Here’s what we covered: 1. Why Data Quality Matters • “Garbage in, garbage out”: noisy or biased data will cripple LLM performance and introduce undesired outputs. • Well‑curated datasets (like TinyGSM’s math problems) can help match or exceed much larger models on benchmarks -- this is where companies create value! 2. Defining Data Standards • Source from peer‑reviewed papers, reputable sites, and cross‑verify facts. • Balance breadth (diverse domains) with depth (domain‑specific relevance). • Filter out toxic, spammy, and low‑quality text. • Make sure to respect privacy (GDPR/CCPA), copyright, and Terms of Service. 3. Ethical Web Scraping • Respect robots.txt, rate limits, and ToS. • Avoid PII and sensitive data. • Use polite scrapers (Scrapy, Selenium) or public datasets (Common Crawl) to minimize legal risk. 4. Cleaning & Structuring Data • Strip HTML, normalize text, segment into token‑bounded chunks, and apply keyword filters. • Tools like BeautifulSoup, LangChain, and custom Python scripts streamline cleaning and chunking. 5. Advanced Parsing & Schema‑Driven Extraction • LlamaParse for converting complex PDFs/Word docs into structured Markdown or JSON. • OpenAI Structured Outputs to enforce JSON schemas, ensuring consistent, machine‑readable data. 6. End‑to‑End Pipelines • Combine Scrapy → LlamaParse → OpenAI Structured Outputs → Hugging Face Datasets for scalable, trusted data workflows. Read the full tutorial here: https://lnkd.in/eTsA788w And for hands‑on practice, join our “From Beginner to Advanced LLM Developer” course at Towards AI Academy: https://lnkd.in/eP5NTpDK #AI #MachineLearning #DataEngineering #LLM #DataPrep
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development