LLMs process text from left to right — each token can only look back at what came before it, never forward. This means that when you write a long prompt with context at the beginning and a question at the end, the model answers the question having "seen" the context, but the context tokens were generated without any awareness of what question was coming. This asymmetry is a basic structural property of how these models work. The paper asks what happens if you just send the prompt twice in a row, so that every part of the input gets a second pass where it can attend to every other part. The answer is that accuracy goes up across seven different benchmarks and seven different models (from the Gemini, ChatGPT, Claude, and DeepSeek series of LLMs), with no increase in the length of the model's output and no meaningful increase in response time — because processing the input is done in parallel by the hardware anyway. There are no new losses to compute, no finetuning, no clever prompt engineering beyond the repetition itself. The gap between this technique and doing nothing is sometimes small, sometimes large (one model went from 21% to 97% on a task involving finding a name in a list). If you are thinking about how to get better results from these models without paying for longer outputs or slower responses, that's a fairly concrete and low-effort finding. Read with AI tutor: https://lnkd.in/ene242cx Get the PDF: https://lnkd.in/e9tbUTNv
Improving Predictive Accuracy
Explore top LinkedIn content from expert professionals.
-
-
Training LLMs for spam classification: I added 14 experiments comparing different approaches: https://lnkd.in/gTNVvGcj - which token to train - which layers to train - different model sizes - LoRA - unmasking - and more! Any additional experiments you'd like to see? And here are the take aways for the table shown in the picture: 1. Training the Last vs. First Output Token (Row 1 vs. 2): Training the last output token results in substantially better performance compared to the first. This improvement is expected due to the causal self-attention mask. 2. Training the Last Transformer Block vs. Last Layer (Row 1 vs. 3): Training the entire last transformer block is also results in substantially better results than training only the last layer. 3. Training All Layers vs. Last Transformer Block (Row 1 vs. 4): Training all layers shows a modest improvement of ~2% over just training the last transformer block, but it requires almost three times longer in terms of training duration. 4. Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 6 and 7): Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. Similarly, the 12x larger model improves the predictive performance even further. (The medium model was perhaps not well pretrained or the particular finetuning configuration works not as well for this model.) 5. Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 8): Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights. 6. Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 9 vs. 4): Keeping the model frozen and adding trainable LoRA layers (see Appendix E for details) is a viable alternative to training all model parameters and even improves the performance by 1% point. As it can be seen by the 1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. 7. Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 10): Padding the input to the full supported context length results is significantly worse. 8. Padding vs no padding (Row 1 vs. 11 and 12): The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 12, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments. 9. Disabling the causal attention mask (Row 1 vs. 13): Disables the causal attention mask used in the multi-head attention module. This means all tokens can attend all other tokens. The model accuracy is slightly improved compared to the GPT model with causal mask.
-
In the last three months alone, over ten papers outlining novel prompting techniques were published, boosting LLMs’ performance by a substantial margin. Two weeks ago, a groundbreaking paper from Microsoft demonstrated how a well-prompted GPT-4 outperforms Google’s Med-PaLM 2, a specialized medical model, solely through sophisticated prompting techniques. Yet, while our X and LinkedIn feeds buzz with ‘secret prompting tips’, a definitive, research-backed guide aggregating these advanced prompting strategies is hard to come by. This gap prevents LLM developers and everyday users from harnessing these novel frameworks to enhance performance and achieve more accurate results. https://lnkd.in/g7_6eP6y In this AI Tidbits Deep Dive, I outline six of the best and recent prompting methods: (1) EmotionPrompt - inspired by human psychology, this method utilizes emotional stimuli in prompts to gain performance enhancements (2) Optimization by PROmpting (OPRO) - a DeepMind innovation that refines prompts automatically, surpassing human-crafted ones. This paper discovered the “Take a deep breath” instruction that improved LLMs’ performance by 9%. (3) Chain-of-Verification (CoVe) - Meta's novel four-step prompting process that drastically reduces hallucinations and improves factual accuracy (4) System 2 Attention (S2A) - also from Meta, a prompting method that filters out irrelevant details prior to querying the LLM (5) Step-Back Prompting - encouraging LLMs to abstract queries for enhanced reasoning (6) Rephrase and Respond (RaR) - UCLA's method that lets LLMs rephrase queries for better comprehension and response accuracy Understanding the spectrum of available prompting strategies and how to apply them in your app can mean the difference between a production-ready app and a nascent project with untapped potential. Full blog post https://lnkd.in/g7_6eP6y
-
I recently demoed 4 FP&A platforms that claim to effectively forecast 13-week cash flows using AI. Three of the companies are dedicated planning tools. One company is a financial reporting tool. Despite them being leaders in the FP&A space, seeing their 13-week cash flow tools left me unconvinced. ----------- What the FP&A tools got right? (1) Cash flow forecasts were generated in a flash It was remarkable to see how quickly these tools can create a direct format 13-week cash flow. It took seconds. When you're needing to update a cash flow model, taking days or weeks to refresh a rolling forecast isn't an option. (2) Cash flow forecasts were traceable Many company cash flow models are driven by lots of data. Auditing Excel formulas isn't a great use of time for Treasurers or FP&As. These tools make it easy to vouch back to the root data and explore the detail. (3) Cash flows are good enough for companies that don't have to worry Some FP&As struggle to accept that top-down forecasts may be good enough for most companies that don't have to worry much about cash flow. That's because they're flush with liquidity, have a line or credit, and aren't laser-focused or hands-on with cash. A decent forecast that isn't remarkably accurate isn't always a liability. It's can be an asset since it's a reasonable-enough snapshot in time. ----------- What the FP&A tools get wrong? (4) Forecasts use past data and trends for almost all assumptions about the future If managing cash flow for a business that's seasonal, volatile, or has cash flow issues, relying on past data and trends can be reckless and lazy. When it comes to cash flow management, relying too much on historical trends can lead to really poor assumptions. If decisions are based on those bad assumptions, you get bad decisions too. (5) Forecasts were mostly observational, not prescriptive Unless you're working with a large corporation, where operations are steady and bank accounts are full, cash flow forecasts should enable thoughtful choices. That means the model should reveal operational drivers, opportunities, and scenarios. These tools don't really allow for these basic features. They're mostly just reports and data extrapolations. (6) Forecasts didn't capture nuance In the example I show here, my cash flow model can quickly and easily incorporate actuals, weekly and monthly forecast periods. I'm able to hold back 20% of accounts payable. I can pay back the A/P at any rate and timing that I want. I can be aggressive with catch-up payments and early-payment discounts. It's what a company needs to be able to see, whether it's doing $20 million or $200 million in revenue. It's not that AI can't do cash flow forecasting and modeling. It's that it can't do it as well as you'd hope. And that's the problem. Cash flows are full of nuance. AI-driven cash flow forecasts aren't great at understanding nuance. You can learn cash flows with me live: https://lnkd.in/grQVkeyE
-
LLM hallucinations aren't bugs, they're compression artefacts. And we just figured out how to predict them before they happen. 400 stars in one week, the reception has been unreal. Our toolkit is open source and anyone can use it. https://lnkd.in/e4s3X8GK When your LLM confidently states that "Napoleon won the Battle of Waterloo," it's not broken. It's doing exactly what it was trained to do: compress the entire internet into model weights, then decompress on demand. Sometimes, there isn't enough information to perfectly reconstruct rare facts, so it fills gaps with statistically plausible but wrong content. Think of it like a ZIP file corrupted during compression. The decompression algorithm still runs, but outputs garbage where data was lost. The breakthrough: We proved hallucinations occur when information budgets fall below mathematical thresholds. Using our Expectation-level Decompression Law (EDFL), we can calculate exactly how many bits of information are needed to prevent any specific hallucination, before generation even starts. This resolves a fundamental paradox: LLMs achieve near-perfect Bayesian performance on average, yet systematically fail on specific inputs. We proved they're "Bayesian in expectation, not in realisation", optimising average-case compression rather than worst-case reliability. Why this changes everything? Instead of treating hallucinations as inevitable, we can now: Calculate risk scores before generating any text Set guaranteed error bounds (e.g. 95%) Know precisely when to gather more context vs. abstain The full preprint is being released on arXiv this week. Until then, read the preprint PDF we uploaded here: https://lnkd.in/eRf_ecu3 The toolkit works with any OpenAI-compatible API. Zero retraining required. Provides mathematical SLA guarantees for compliance. Perfect for healthcare, finance, legal, anywhere errors aren't acceptable. The era of "trust me, bro" AI is ending. Welcome to bounded, predictable AI reliability. Big thanks to Ahmed K. Maggie C. for all the help putting this + the repo together! #AI #MachineLearning #ResponsibleAI #OpenSource #LLM #Innovation
-
RAG stands for Retrieval-Augmented Generation. It’s a technique that combines the power of LLMs with real-time access to external information sources. Instead of relying solely on what an AI model learned during training (which can quickly become outdated), RAG enables the model to retrieve relevant data from external databases, documents, or APIs—and then use that information to generate more accurate, context-aware responses. How does RAG work? 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗲: The system searches for the most relevant documents or data based on your query, using advanced search methods like semantic or vector search. 𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻: Instead of just using the original question, RAG 𝗮𝘂𝗴𝗺𝗲𝗻𝘁𝘀 (enriches) the prompt by adding the retrieved information directly into the input for the AI model. This means the model doesn’t just rely on what it “remembers” from training—it now sees your question 𝘱𝘭𝘶𝘴 the latest, domain-specific context 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲: The LLM takes the retrieved information and crafts a well-informed, natural language response. 𝗪𝗵𝘆 𝗱𝗼𝗲𝘀 𝗥𝗔𝗚 𝗺𝗮𝘁𝘁𝗲𝗿? Improves accuracy: By referencing up-to-date or proprietary data, RAG reduces outdated or incorrect answers. Context-aware: Responses are tailored using the latest information, not just what the model “remembers.” Reduces hallucinations: RAG helps prevent AI from making up facts by grounding answers in real sources. Example: Imagine asking an AI assistant, “What are the latest trends in renewable energy?” A traditional LLM might give you a general answer based on old data. With RAG, the model first searches for the most recent articles and reports, then synthesizes a response grounded in that up-to-date information. Illustration by Deepak Bhardwaj
-
Few Lessons from Deploying and Using LLMs in Production Deploying LLMs can feel like hiring a hyperactive genius intern—they dazzle users while potentially draining your API budget. Here are some insights I’ve gathered: 1. “Cheap” is a Lie You Tell Yourself: Cloud costs per call may seem low, but the overall expense of an LLM-based system can skyrocket. Fixes: - Cache repetitive queries: Users ask the same thing at least 100x/day - Gatekeep: Use cheap classifiers (BERT) to filter “easy” requests. Let LLMs handle only the complex 10% and your current systems handle the remaining 90%. - Quantize your models: Shrink LLMs to run on cheaper hardware without massive accuracy drops - Asynchronously build your caches — Pre-generate common responses before they’re requested or gracefully fail the first time a query comes and cache for the next time. 2. Guard Against Model Hallucinations: Sometimes, models express answers with such confidence that distinguishing fact from fiction becomes challenging, even for human reviewers. Fixes: - Use RAG - Just a fancy way of saying to provide your model the knowledge it requires in the prompt itself by querying some database based on semantic matches with the query. - Guardrails: Validate outputs using regex or cross-encoders to establish a clear decision boundary between the query and the LLM’s response. 3. The best LLM is often a discriminative model: You don’t always need a full LLM. Consider knowledge distillation: use a large LLM to label your data and then train a smaller, discriminative model that performs similarly at a much lower cost. 4. It's not about the model, it is about the data on which it is trained: A smaller LLM might struggle with specialized domain data—that’s normal. Fine-tune your model on your specific data set by starting with parameter-efficient methods (like LoRA or Adapters) and using synthetic data generation to bootstrap training. 5. Prompts are the new Features: Prompts are the new features in your system. Version them, run A/B tests, and continuously refine using online experiments. Consider bandit algorithms to automatically promote the best-performing variants. What do you think? Have I missed anything? I’d love to hear your “I survived LLM prod” stories in the comments!
-
Explaining the Evaluation method LLM-as-a-Judge (LLMaaJ). Token-based metrics like BLEU or ROUGE are still useful for structured tasks like translation or summarization. But for open-ended answers, RAG copilots, or complex enterprise prompts, they often miss the bigger picture. That’s where LLMaaJ changes the game. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗶𝘁? You use a powerful LLM as an evaluator, not a generator. It’s given: - The original question - The generated answer - And the retrieved context or gold answer 𝗧𝗵𝗲𝗻 𝗶𝘁 𝗮𝘀𝘀𝗲𝘀𝘀𝗲𝘀: ✅ Faithfulness to the source ✅ Factual accuracy ✅ Semantic alignment—even if phrased differently 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: LLMaaJ captures what traditional metrics can’t. It understands paraphrasing. It flags hallucinations. It mirrors human judgment, which is critical when deploying GenAI systems in the enterprise. 𝗖𝗼𝗺𝗺𝗼𝗻 𝗟𝗟𝗠𝗮𝗮𝗝-𝗯𝗮𝘀𝗲𝗱 𝗺𝗲𝘁𝗿𝗶𝗰𝘀: - Answer correctness - Answer faithfulness - Coherence, tone, and even reasoning quality 📌 If you’re building enterprise-grade copilots or RAG workflows, LLMaaJ is how you scale QA beyond manual reviews. To put LLMaaJ into practice, check out EvalAssist; a new tool from IBM Research. It offers a web-based UI to streamline LLM evaluations: - Refine your criteria iteratively using Unitxt - Generate structured evaluations - Export as Jupyter notebooks to scale effortlessly A powerful way to bring LLM-as-a-Judge into your QA stack. - Get Started guide: https://lnkd.in/g4QP3-Ue - Demo Site: https://lnkd.in/gUSrV65s - Github Repo: https://lnkd.in/gPVEQRtv - Whitepapers: https://lnkd.in/gnHi6SeW
-
Apple’s new Superposition Prompting method improves RAG accuracy by 43%. Suppose your vector search retrieved 5 documents. Instead of processing them as one big unit, this approach lets the LLM consider each doc separately and process them in parallel. So it’s obvious how it improves the speed. But how does it improve the accuracy? A major problem in LLMs is, when there is irrelevant info in the input context, it confuses the model. So, by considering each retrieved doc separately when answering the query, this problem is reduced. Here’s how they actually do Superposition Prompting: They use a DAG structure where each query segment is a duplicate of the original query. This allows for parallel processing of the query segments. The model looks at each query segment and its retrieved docs independently, and uses path pruning to get rid of any irrelevant docs. To make inference even faster, the paper uses path caching and path parallelisation techniques: - Path caching precomputes KV embeddings for the docs - Path parallelisation computes KV caches and logits for query segments in parallel. Paper: https://lnkd.in/g_j67TpY #AI #RAG #LLMs
-
𝗠𝗰𝗞𝗶𝗻𝘀𝗲𝘆 𝗼𝘂𝘁𝗹𝗶𝗻𝗲𝗱 𝟲 𝗮𝗱𝘃𝗮𝗻𝗰𝗲𝗱 𝗙𝗣&𝗔 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀 𝗳𝗼𝗿 𝗯𝗲𝘁𝘁𝗲𝗿 𝗳𝗼𝗿𝗲𝗰𝗮𝘀𝘁𝗶𝗻𝗴. Most finance teams know them. Few actually implement them consistently. Why? Because doing it right has always been painfully manual. 𝗛𝗲𝗿𝗲'𝘀 𝘄𝗵𝗮𝘁 𝘀𝘁𝗿𝘂𝗰𝗸 𝗺𝗲: AI is changing this. Fast. The six practices McKinsey recommends are now achievable at scale: • 𝗣𝗿𝗼𝗯𝗮𝗯𝗶𝗹𝗶𝘁𝘆-𝘄𝗲𝗶𝗴𝗵𝘁𝗲𝗱 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀 – AI can run hundreds of scenarios and assign P values automatically, not just the three you had time to build manually. • 𝗧𝗿𝘂𝗲 𝗺𝗼𝗺𝗲𝗻𝘁𝘂𝗺 𝗰𝗮𝘀𝗲𝘀 – AI separates baseline trends from management initiatives without the spreadsheet gymnastics. • 𝗕𝗲𝗮𝗿 𝗰𝗮𝘀𝗲 𝗺𝗼𝗱𝗲𝗹𝗶𝗻𝗴 – AI identifies downside risks and models them before you're blindsided. • 𝗖𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝘁 𝗺𝗮𝗰𝗿𝗼 𝗮𝘀𝘀𝘂𝗺𝗽𝘁𝗶𝗼𝗻𝘀 – AI flags when one business unit uses different GDP assumptions than another. • 𝗗𝗶𝘀𝗮𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗲𝗱 𝗶𝗻𝗳𝗹𝗮𝘁𝗶𝗼𝗻 – AI tracks the specific components that actually affect your business, not just CPI averages. • 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗯𝗮𝗰𝗸 𝘁𝗲𝘀𝘁𝗶𝗻𝗴 – AI compares forecasts to actuals weekly and learns from variances automatically. 𝗧𝗵𝗲 𝗯𝗿𝘂𝘁𝗮𝗹 𝘁𝗿𝘂𝘁𝗵: Human bias has always been the weak link in forecasting. Optimism creeps in. Assumptions go unchallenged. P-values are applied inconsistently across business units. AI doesn't have a political agenda. It doesn't inflate projections to look good in front of the board. It just processes data. The result? Faster forecasts. More accurate projections. And decisions based on reality, not hope. 𝗠𝘆 𝗮𝗱𝘃𝗶𝗰𝗲? 𝟭. 𝗦𝘁𝗮𝗿𝘁 𝘄𝗶𝘁𝗵 𝗯𝗮𝗰𝗸 𝘁𝗲𝘀𝘁𝗶𝗻𝗴 Use AI to compare your forecasts over the last 12 months with actuals. Find where bias lives in your models. 𝟮. 𝗔𝘂𝘁𝗼𝗺𝗮𝘁𝗲 𝘀𝗰𝗲𝗻𝗮𝗿𝗶𝗼 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 Stop building three scenarios manually. Let AI generate probability-weighted ranges based on actual data patterns. 𝟯. 𝗘𝗻𝗳𝗼𝗿𝗰𝗲 𝗮𝘀𝘀𝘂𝗺𝗽𝘁𝗶𝗼𝗻 𝗰𝗼𝗻𝘀𝗶𝘀𝘁𝗲𝗻𝗰𝘆 Use AI to flag when macro assumptions differ across business units. Inconsistency kills forecast accuracy. Because here's what separates finance teams that drive decisions from those that just report numbers: They use AI to remove bias and deliver forecasts that leadership can actually trust. 𝗦𝗼 𝗯𝗲 𝗵𝗼𝗻𝗲𝘀𝘁: Which of these six practices is your biggest gap right now? ---------- 🧑💼 I'm a partner at Business Partnering Institute 🤝 We help increase the influence of your finance team 🔔 To see more content, hit the bell on my profile 📘 Order our new book now: https://bit.ly/4h2P9AA 🧑🎓 Enroll in our LinkedIn course: https://bit.ly/4a5fB9l 📻 #FinanceMaster podcast: https://bit.ly/3NLSt73 📺 Follow us on YouTube: https://bit.ly/4bSBut6 📢 Join our WhatsApp channel: https://bit.ly/3WWGOrc 📄 Check out all our templates and cheat sheets here: https://lnkd.in/eC_zuCU4
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development