One of the biggest constraints on the value of LLMs is that they are equally confident irrespective of underlying uncertainty. A new model, Entropix, proposes using different strategies for selecting the next token depending on the nature of the model's uncertainty. A great piece by Thariq Shihipar lays out the logic. The starting point is distinguishing between entropy and varentropy. Entropy measures how concentrated or diffuse the options for next token. Low entropy means the model has a very high probability next token, high entropy suggests there are a number of similar value possible next tokens. Varentropy assesses how different the probabilities are, either consistent (low) or varied (high). Each of the four combinations of these possibilities yields different strategies for improving next token selection: ⬇️⬇️Low Entropy, Low Varentropy: Model is very confident → Choose the highest probability option ⬇️⬆️Low Entropy, High Varentropy: Few strong competing options → Consider branching to explore different paths ⬆️⬇️High Entropy, Low Varentropy: Model is uncertain → Use "thinking tokens" to prompt more consideration ⬆️⬆️High Entropy, High Varentropy: Many scattered options → Use random selection or branching These are still early days in being able to assess model uncertainty and adjust to improve outputs validity (including reducing hallucinations). However progress in this will greatly improve the value of LLMs. Another critical aspect of this research is in Humans + AI work. Humans have to make their own assessments of LLM outputs of highly varying quality. Decision quality could improve massively If LLMS could offer valid confidence assessments for input into complex human-first decisions.
Utilizing Software Features
Explore top LinkedIn content from expert professionals.
-
-
Groundbreaking Research Alert: Making LLMs More Efficient with Smart Retrieval A fascinating paper from NAVER LABS Europe introduces a novel approach to optimize Large Language Models' retrieval mechanisms. The research shows how we can reduce retrieval operations by over 50% while maintaining or even improving performance. Key Technical Insights: - The system uses an "I Know" (IK) classifier that achieves 80% accuracy in determining when an LLM needs external knowledge - Only 32 tokens from the initial response are needed to make this determination - Training requires just 20,000 samples to achieve optimal performance - The approach works across multiple model families including Mistral, Llama, Gemma, and SOLAR Under the hood: - The system employs an LLM-as-judge architecture for training data generation - It uses adapters for fine-tuning larger models (7B+) - The IK score is computed using softmax on Yes/No token logits - Processing time is remarkably efficient: 3.7ms for IK classification, 8.3ms for generating 32 tokens Real-world Impact: - Reduces RAG processing time by up to 80% - Improves efficiency across various datasets including NQ, ASQA, HotpotQA - Particularly effective for general knowledge datasets like TriviaQA and SCIQ This research represents a significant step forward in making LLMs more efficient and practical for real-world applications. The ability to selectively activate retrieval mechanisms could be a game-changer for deployment at scale.
-
You're losing a TON of your budget ... because you're using the wrong... data format??? Yep. Allow ne to explain TOON vs JSON. Both are data formats. But they will impact your team and your education problems. A lot of folks are sending large raw JSON payloads into GPT-5 or Claude 3.5. But your team needs to be trained that can be really wasteful. Part of your training needs to cover the new TOON (Token-Oriented Object Notation) format for data showing up in your LLM's. In simple terms, you can you can use this simpler, condensed format to save you money and trouble In practice, that means: - 40%+ faster processing for LLMs (check out the graphic I created below - Often 30–60% fewer tokens for uniform lists like logs, search results, product catalogs, or time-series data So I'm going to urge you teach two simple rules of thumb: - Use JSON when your data is deeply nested, heterogeneous, or tightly coupled to tools/function calling. - Use TOOM when 80%+ of your data is tabular or uniform records and you care about squeezing more useful work into the same context window. If you are building RAG pipelines, analytics, or catalog search, experimenting with a token‑oriented format like TOON can be one of the easiest ways to get more out of the models you already use.
-
No Power BI? No Problem. Everyone seems to be building dashboards in Power BI these days. But what if you don’t have access? Maybe your company hasn’t provided a license. Maybe your laptop can’t handle it. Or maybe you’re just not sure where to begin. Here’s what most people don’t realize: You can still build solid analytics skills using free, accessible tools, and those same skills will carry over when you do start using Power BI. Tools like Tableau Public, Looker Studio, Google Sheets, and Excel Online can teach you how to clean data, build dashboards, apply formulas, and tell compelling stories with data. You don’t need expensive software to start. You just need the right mindset and resources. I’ve pulled together some of the best tutorials and practice tools to help you get started: ↳ Access the Tableau Free desktop version here: https://lnkd.in/dxXxzR_m ↳ Learn how to install it here: https://lnkd.in/dkYHrQfC ↳ Introduction to Tableau: https://lnkd.in/deXZiDjG ↳ Connecting to Data Sources: https://lnkd.in/dq8ibppR Core Skills ↳ Calculated Fields: https://lnkd.in/dEdhYjYC ↳ Filters & Parameters: https://lnkd.in/dJPaGJ_i ↳ Tableau Zen Master Tips & Tricks: https://lnkd.in/dXqY3yPs ↳ Top 10 Tableau Dashboard Design Tips: https://lnkd.in/dZcewx7i Advanced Techniques ↳ Create a Stunning Advanced Dashboard in Tableau: ↳ LOD Expressions: https://lnkd.in/dSfjmuWg ↳ Tableau Prep: https://lnkd.in/dkYHrQfC Real-World Applications ↳ Tableau Public Portfolio: https://lnkd.in/dxXxzR_m ↳ Case Studies: https://lnkd.in/d_jRSttk Additional Resources ↳ Practice Datasets: https://lnkd.in/dEwcEiVq ↳ Cheat Sheets: https://shorturl.at/3SHnK ↳ Communities: https://lnkd.in/dqTZySvW Know someone who needs this? Share it with them. ♻ If you’re serious about leveling up your data career, join my WhatsApp channel for direct insights & updates, or subscribe to my YouTube channel for in-depth tutorials. ↳ My WhatsApp channel: https://lnkd.in/dawGfYjq ↳ My YouTube channel: https://lnkd.in/deiQF4DW
-
Researchers from Meta built a new RAG approach. (with 2-4x less token usage + 16x larger context window) Most of what we retrieve in RAG setups never actually helps the LLM. In classic RAG, when a query arrives: - You encode it into a vector. - Fetch similar chunks from vector DB. - Dump the retrieved context into the LLM. It typically works, but at a huge cost: - Most chunks contain irrelevant text. - The LLM has to process far more tokens. - You pay for compute, latency, and context. That’s the exact problem Meta AI’s new method REFRAG solves. It fundamentally rethinks retrieval and the diagram below explains how it works. Essentially, instead of feeding the LLM every chunk and every token, REFRAG compresses and filters context at a vector level: - Chunk compression: Each chunk is encoded into a single compressed embedding, rather than hundreds of token embeddings. - Relevance policy: A lightweight RL-trained policy evaluates the compressed embeddings and keeps only the most relevant chunks. - Selective expansion: Only the chunks chosen by the RL policy are expanded back into their full embeddings and passed to the LLM. This way, the model processes just what matters and ignores the rest. Here's the step-by-step walkthrough: - Step 1-2) Encode the docs and store them in a vector database. - Step 3-5) Encode the full user query and find relevant chunks. Also, compute the token-level embeddings for both the query (step 7) and matching chunks. - Step 6) Use a relevance policy (trained via RL) to select chunks to keep. - Step 8) Concatenate the token-level representations of the input query with the token-level embedding of selected chunks and a compressed single-vector representation of the rejected chunks. - Step 9-10) Send all that to the LLM. The RL step makes REFRAG a more relevance-aware RAG pipeline. Based on the research paper, this approach: - has 30.85x faster time-to-first-token (3.75x better than previous SOTA) - provides 16x larger context windows - outperforms LLaMA on 16 RAG benchmarks while using 2–4x fewer decoder tokens. - leads to no accuracy loss across RAG, summarization, and multi-turn conversation tasks That means you can process 16x more context at 30x the speed, with the same accuracy. The code has not been released yet by Meta. They intend to do that soon. ____ Find me → Avi Chawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
-
New Google paper challenges how we measure LLM reasoning. Token count is a poor proxy for actual reasoning quality. There might be a better way to measure this. This work introduces "deep-thinking tokens," a metric that identifies tokens where internal model predictions shift significantly across deeper layers before stabilizing. These tokens capture "genuine reasoning" effort rather than verbose output. Instead of measuring how much a model writes, measure how hard it's actually thinking at each step. Deep-thinking tokens are identified by tracking prediction instability across transformer layers during inference. The ratio of deep-thinking tokens correlates more reliably with accuracy than token count or confidence metrics across mathematical and scientific benchmarks (AIME 24/25, HMMT 25, GPQA-diamond), tested on DeepSeek-R1, Qwen3, and GPT-OSS. They also introduce Think@n, a test-time compute strategy that prioritizes samples with high deep-thinking ratios while early-rejecting low-quality partial outputs, reducing cost without sacrificing performance. Why does it matter? As inference-time scaling becomes a primary lever for improving model performance, we need better signals than token length to understand when a model is actually reasoning versus just rambling.
-
ByteDance is on fire with another step in the Long March for LLM Efficiency: UltraMem trades dynamic computation (QKV attention) for efficient table lookups (PKM or Product Key Memory). By substituting compute with smart lookup, they achieve 6x faster inference and 4x compute reduction. This is the original PKM proposal from Guillaume Lample (then of Meta FAIR and now of Mistral AI) et al.: https://lnkd.in/gaEz463u Scale: A 21B parameter system organized into six memory layers. The key design choice: for each layer, selecting only the top 84 most relevant vectors (0.0026%!) per query token from millions of possibilities. Their approach uses Tucker decomposition—a tensor factorization method—to make this extremely sparse selection practical. The trick is finding the right 0.0026% quickly and accurately. Results: matches model performance while running 6x faster than MoE and using 1/4 the compute. Shows how extremely sparse but precise lookup can dramatically improve LLM efficiency: https://lnkd.in/gdH7ZxYU
-
The biggest problem with LLMs isn't their reasoning - it's their inefficiency. Chain of Draft just solved it. While Chain of Thought (CoT) prompting and reasoning models in general help LLMs tackle complex tasks, it introduces a significant inefficiency: verbose outputs that consume tokens and increase latency. But is such word spew necessary for effective reasoning? Human cognition suggests otherwise. When solving complex problems, we typically generate concise notes capturing only essential insights - not elaborate explanations of every step. This cognitive efficiency inspired the Chain of Draft methodology. Chain of Draft distinguishes itself by encouraging LLMs to generate minimalistic yet informative intermediate outputs. The empirical results are compelling: similar or better accuracy while using as little as 7.6% of the tokens required by traditional methods. This paper provides several compelling evaluation results: • 𝗔𝗿𝗶𝘁𝗵𝗺𝗲𝘁𝗶𝗰 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 (𝗚𝗦𝗠𝟴𝗸): 91% accuracy with CoD versus 95% with CoT, while reducing tokens by 80% and latency by 76.2% • 𝗖𝗼𝗺𝗺𝗼𝗻𝘀𝗲𝗻𝘀𝗲 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴: In several cases, CoD actually outperformed CoT in accuracy while using significantly fewer tokens • 𝗦𝘆𝗺𝗯𝗼𝗹𝗶𝗰 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴: Perfect 100% accuracy with both methods, but CoD used only 14-32% of the tokens that CoT required Chain of Draft offers a method to maintain reasoning capabilities while dramatically reducing costs and latency. This advancement enables the deployment of reasoning-heavy LLM applications in cost-sensitive and latency-sensitive scenarios that were previously impractical. Perhaps most significantly, this research demonstrates that effective reasoning doesn't require verbosity - concise, focused thinking can be equally powerful, if not more so. I'm excited for when this prompting strategy makes its way into reasoning based models in general, giving us faster, cheaper and smarter models! Read the paper: https://lnkd.in/e3wA549Z
-
Percentages with variance arrows. Revenue shown in millions. Hide Zero values. When I build out views or dashboards in Tableau, I ask how stakeholders want their values, As it influences my approach. You could build calculated fields for each request. Or you could use formatting that's already there. Tableau can handle number formats without touching calculations: 1. Native formatting for basic prefixes and suffixes Right-click your measure → Format → Numbers. This handles most standard requests. Currency displays (£, $, €). Percentage symbols. Thousands separators. Decimal precision. Takes seconds and keeps your workbook clean. 2. Custom formatting for unique requests Format → Numbers → Custom. Adds symbols to positive or negative values ▲0.0%;▼0.0%;►0.0% → ▲5.2%, ▼2.1%, ►0.0% Prevents 0 from appearing #,##0;-#,##0;"" → 1,234, -567, 0 → 1,234, -567 The native formatting does have its limitations, which is when I resort to calculated fields 3. Calculated fields for dynamic scaling That's when you can build a calculation to handle your number range 750 → 750, 5,400 → 5.4K, 125,000 → 125K, 2,350,000 → 2.35M If your team's been building calculated fields for every formatting request, this might save you some time.
-
Awesome how Tableau Data Blending helps you combine data without heavy modeling. Not every dataset lives in the same place. And not every team has time to fully model everything before building dashboards. Tableau Data Blending is designed for those moments. It lets you combine data from different sources directly in Tableau, without needing to join everything at the database level. You can relate datasets on the fly and start analyzing immediately. Why teams still use it: - It speeds up analysis when data lives across multiple systems. - avoids waiting on full data modeling or engineering work. - works well for quick comparisons and exploratory analysis. - allows analysts to move forward without changing underlying data structures. It’s especially useful when working with spreadsheets, external data, or combining warehouse data with local files. It is not meant to replace proper data modeling, but it is a powerful option when speed matters more than perfection.
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development