How to Improve AI Performance With New Techniques

Explore top LinkedIn content from expert professionals.

Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

41,883 followers 2y
Report this post
In the last three months alone, over ten papers outlining novel prompting techniques were published, boosting LLMs’ performance by a substantial margin. Two weeks ago, a groundbreaking paper from Microsoft demonstrated how a well-prompted GPT-4 outperforms Google’s Med-PaLM 2, a specialized medical model, solely through sophisticated prompting techniques. Yet, while our X and LinkedIn feeds buzz with ‘secret prompting tips’, a definitive, research-backed guide aggregating these advanced prompting strategies is hard to come by. This gap prevents LLM developers and everyday users from harnessing these novel frameworks to enhance performance and achieve more accurate results. https://lnkd.in/g7_6eP6y In this AI Tidbits Deep Dive, I outline six of the best and recent prompting methods: (1) EmotionPrompt - inspired by human psychology, this method utilizes emotional stimuli in prompts to gain performance enhancements (2) Optimization by PROmpting (OPRO) - a DeepMind innovation that refines prompts automatically, surpassing human-crafted ones. This paper discovered the “Take a deep breath” instruction that improved LLMs’ performance by 9%. (3) Chain-of-Verification (CoVe) - Meta's novel four-step prompting process that drastically reduces hallucinations and improves factual accuracy (4) System 2 Attention (S2A) - also from Meta, a prompting method that filters out irrelevant details prior to querying the LLM (5) Step-Back Prompting - encouraging LLMs to abstract queries for enhanced reasoning (6) Rephrase and Respond (RaR) - UCLA's method that lets LLMs rephrase queries for better comprehension and response accuracy Understanding the spectrum of available prompting strategies and how to apply them in your app can mean the difference between a production-ready app and a nascent project with untapped potential. Full blog post https://lnkd.in/g7_6eP6y
No more previous content

No more next content
31 Comments
Like Comment
Greg Coquillo Greg Coquillo is an Influencer

AI Infrastructure Product Leader | Scaling GPU Clusters for Frontier Models | Microsoft Azure AI & HPC | Former AWS, Amazon | Startup Investor | Linkedin Top Voice | I build the infrastructure that allows AI to scale

228,993 followers 4mo
Report this post
I consider prompting techniques some of the lowest-hanging fruits one can use to achieve step-change improvement with their model performance. This isn’t to say that “typing better instructions” is that simple. As a matter of fact, it can be quite complex. Prompting has evolved into a full discipline with frameworks, reasoning methods, multimodal techniques, and role-based structures that dramatically change how models think, plan, analyse, and create. This guide that breaks down every major prompting category you need to build powerful, reliable, and structured AI workflows: 1️⃣ Core Prompting Techniques The foundational methods include few-shot, zero-shot, one-shot, style prompts. They teach the model patterns, tone, and structure. 2️⃣ Reasoning-Enhancing Techniques Approaches like Chain-of-Thought, Graph-of-Thought, ReAct, and Deliberate prompting help LLMs reason more clearly, avoid shortcuts, and solve complex tasks step-by-step. 3️⃣ Instruction & Role-Based Prompting Define the task clearly or assign the model a “role” such as planner, analyst, engineer, or teacher to get more predictable, domain-focused outputs. 4️⃣ Prompt Composition Techniques Methods like prompt chaining, meta-prompting, dynamic variables, and templates help you build multi-step, modular workflows used in real agent systems. 5️⃣ Tool-Augmented Prompting Combine prompts with vector search, retrieval (RAG), planners, executors, or agent-style instructions to turn LLMs into decision-making systems rather than passive responders. 6️⃣ Optimization & Safety Techniques Guardrails, verification prompts, bias checks, and error-correction prompts improve reliability, factual accuracy, and trustworthiness. These are essential for production systems. 7️⃣ Creativity-Enhancing Techniques Analogy prompts, divergent prompts, story prompts, and spatial diagrams unlock creative reasoning, exploration, and alternative problem-solving paths. 8️⃣ Multimodal Prompting Use images, audio, video, transcripts, diagrams, code, or mixed-media prompts (text + JSON + tables) to build richer and more intelligent multimodal workflows. Modern prompting has fully evolved to designing thinking systems. When you combine reasoning techniques, structured instructions, memory, tools, and multimodal inputs, you unlock a level of performance that avoids costly fine tuning methods. What best practices have you used when designing prompts for your LLM? #LLM
No more previous content

No more next content
51 Comments
Like Comment
Vignesh Kumar Vignesh Kumar is an Influencer

AI Product & Engineering | Start-up Mentor & Advisor | TEDx & Keynote Speaker | LinkedIn Top Voice ’24 | Building AI Community Pair.AI | Director - Orange Business, Cisco, VMware | Cloud - SaaS & IaaS | kumarvignesh.com

21,032 followers 11mo
Report this post
🚀 Ever wondered why your AI model behaves like it’s lighting up a 20-story office just to light up your desk? Every time you ask a large language model (LLM) a simple question, it activates billions of neurons—even if only a handful are needed. That’s like turning on every room in your house just to make coffee. Wasteful, right? This is where Microsoft’s has a latest innovation: WINA (Weight-Informed Neuron Activation) Let me simplify this for you: WINA teaches AI models to think a bit more like humans. Use only the "brain cells" that matter, and let the rest nap. 🧠💡 And if you wondering how it is different from other concepts - here is a quick comparison: 🧩 It’s different from “Mixture of Experts” (MoE): MoE is like hiring a bunch of specialists—grammar geeks, science buffs—and picking the right one each time. Great, but you need to retrain the whole model to do that. Not cheap. Not fast. ⚙️ It’s better than earlier training-free methods like TEAL or CATS: Those methods shut off neurons based only on how “loud” they are. But some quiet ones punch above their weight! Silencing them blindly? That kills performance. 🚀 WINA solves this issues as: it checks both how loud a neuron is AND how big a megaphone it's holding. In lay man's terms: It multiplies neuron activity by the strength of the weights it connects to—then keeps only the most impactful ones. Simple idea, huge results. This simple technique can have a huge impact on customers adopting AI: 💸 Efficiency without trade-offs WINA can switch off up to 65% of neurons, and still outperform previous methods like TEAL by 2–3 percentage points in accuracy. That’s like beating your last marathon by 5 minutes. ⚡ 60%+ reduction in compute costs Across models like Qwen 2.5, LLaMA 2/3, and Phi-4, WINA slashed FLOPs by over 60%—meaning faster responses and lower GPU bills. 🛠️ No retraining needed It’s plug-and-play. Bolt it onto an existing model, tune your sparsity, and go. Great for startups or teams running models in production with no appetite for massive retraining cycles. WINA isn't just a tech upgrade—it’s a mindset shift. Instead of making models bigger, let’s make them smarter and leaner. Use just what you need, when you need it. I write about #artificialintelligence | #technology | #startups | #mentoring | #leadership | #financialindependence PS: All views are personal Vignesh Kumar

3 Comments
Like Comment
Aparna Dhinakaran

Founder - CPO @ Arize AI ✨ we're hiring ✨

35,312 followers 11mo
Report this post
Prompt optimization is becoming foundational for anyone building reliable AI agents Hardcoding prompts and hoping for the best doesn’t scale. To get consistent outputs from LLMs, prompts need to be tested, evaluated, and improved—just like any other component of your system This visual breakdown covers four practical techniques to help you do just that: 🔹 Few Shot Prompting Labeled examples embedded directly in the prompt help models generalize—especially for edge cases. It's a fast way to guide outputs without fine-tuning 🔹 Meta Prompting Prompt the model to improve or rewrite prompts. This self-reflective approach often leads to more robust instructions, especially in chained or agent-based setups 🔹 Gradient Prompt Optimization Embed prompt variants, calculate loss against expected responses, and backpropagate to refine the prompt. A data-driven way to optimize performance at scale 🔹 Prompt Optimization Libraries Tools like DSPy, AutoPrompt, PEFT, and PromptWizard automate parts of the loop—from bootstrapping to eval-based refinement Prompts should evolve alongside your agents. These techniques help you build feedback loops that scale, adapt, and close the gap between intention and output
No more previous content

No more next content
10 Comments
Like Comment
Umair Ahmad

Senior Data & Technology Leader | Omni-Retail Commerce Architect | Digital Transformation & Growth Strategist | Leading High-Performance Teams, Driving Impact

11,161 followers 8mo
Report this post
𝗠𝗮𝗸𝗶𝗻𝗴 𝗔𝗜 𝘀𝗺𝗮𝗿𝘁𝗲𝗿, 𝗳𝗮𝘀𝘁𝗲𝗿, 𝗮𝗻𝗱 𝗺𝗼𝗿𝗲 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲, 𝗼𝗻𝗲 𝗰𝗼𝗻𝘁𝗲𝘅𝘁 𝗮𝘁 𝗮 𝘁𝗶𝗺𝗲 As artificial intelligence systems mature, prompt engineering alone is no longer sufficient. The next stage of advancement is context engineering, which focuses on carefully designing everything an AI model sees before it responds. By controlling the information, structure, and memory available to the model, we enable higher accuracy, deeper reasoning, and more predictable performance. In prompt engineering, you provide a single instruction and rely on the model’s internal capabilities. In context engineering, you orchestrate multiple sources of information, selectively retrieve relevant knowledge, and structure the context so the model performs with greater precision. The result is a system that produces smarter, faster, and more consistent outcomes for complex tasks. 𝗞𝗲𝘆 𝗣𝗿𝗶𝗻𝗰𝗶𝗽𝗹𝗲𝘀 𝗼𝗳 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗮𝗻𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 AI systems perform best when provided with the most relevant information at the right time. Retrieval augmented generation techniques allow dynamic integration of documents, structured data, APIs, and real-time facts. By assembling only what is required, we reduce noise and improve reliability. 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗣𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 Context engineering enables long sequence reasoning by handling thousands of tokens efficiently. It also supports structured integration where models combine tables, knowledge graphs, and stored facts with unstructured inputs. This approach allows the AI to reason more effectively and deliver responses grounded in verified information. 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁 Advanced systems must balance short term and long-term memory. Context compression techniques maintain meaning while reducing size, enabling efficient responses without losing depth. Constraint management ensures token limits are respected while optimizing the information provided to the model. 𝗧𝗲𝗰𝗵𝗻𝗶𝗾𝘂𝗲𝘀 𝗮𝗻𝗱 𝗧𝗼𝗼𝗹𝘀 Developers can leverage tools and frameworks that enable dynamic context assembly and scalable retrieval, including LangChain, LlamaIndex, Pinecone, and Weaviate. Vector databases manage embeddings and memory hierarchies. Hybrid retrieval strategies combine structured APIs with multi document grounding. Testing and refinement cycles measure accuracy, optimize relevance, and improve system intelligence over time. 𝗪𝗵𝘆 𝗜𝘁 𝗠𝗮𝘁𝘁𝗲𝗿𝘀 Context engineering transforms AI from reactive to adaptive intelligence. Instead of issuing instructions and hoping for accuracy, we design the environment in which the model operates so that correct responses become the natural outcome. This approach powers enterprise-ready AI systems that are more predictable, more reliable, and capable of scaling across complex domains. Follow Umair Ahmad for more insights #AI #ContextEngineering #SystemDesign #MachineLearning
No more previous content

No more next content
12 Comments
Like Comment
Vinija Jain

84,260 followers 1y
Report this post
🚀 Excited to share our latest research on accelerated generation techniques for large language models (LLMs)! 🧠✨ 🔗 https://lnkd.in/gRPd2MaV In our comprehensive survey, we delve into 30+ techniques to speed up text generation, making real-time applications more efficient. Accelerated generation techniques aim to reduce the time and computational resources needed for LLMs to generate text, ensuring faster and more responsive AI systems. Here's a sneak peek: - Speculative Decoding: Generates multiple candidate outputs simultaneously to reduce latency. For example, SpecDec achieves up to a 5x speedup in generation. - Early Exiting Mechanisms: Terminates the generation process upon confident predictions, saving computational resources. CALM dynamically allocates resources per input, cutting down processing time. - Non-Autoregressive Methods: Innovates parallelization for faster, coherent output generation. FlowSeq leverages latent variables to model dependencies while maintaining efficiency. This paper, created in collaboration with researchers from Massachusetts Institute of Technology and Columbia University, is crucial for advancing LLM efficiency and enhancing their real-world applications. Dive into the full details and explore the cutting-edge techniques driving the future of AI! ✍🏻 Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, and Aman Chadha
No more previous content

No more next content
3 Comments
Like Comment
Ken Huang

AI Book Author |Speaker |DistributedApps.AI |Humanoid Robot | Physical AI| EC-Council GenAI Security Instructor | CSA Fellow | CSA AI Safety WGs Co-Chair

26,517 followers 1y
Report this post
🚀 New Blog Alert: Exploring Test-Time Compute (TTC) for LLMs 🧠 Everyone's talking about Test-Time Computation (TTC) as a transformative way to improve LLM performance. But what is TTC really about, and why does it matter now? In this blog post, I highlight some key aspects of TTC, including strategies like adaptive distribution updates, self-verification, and Monte Carlo Tree Search (MCTS). These advanced techniques enable LLMs to refine their outputs dynamically at inference, unlocking better quality and efficiency. 🔍 Why TTC Matters Now: Performance: On the challenging MATH dataset, TTC improved test accuracy by up to 21.6% without retraining (Snell et al., 2024). Efficiency: Compute-optimal strategies have demonstrated over 4x efficiency gains compared to traditional methods. Scalability Alternative: Smaller models enhanced with TTC outperformed larger models lacking it, showing that size isn't everything. Future AI Paradigm: TTC challenges the "bigger is better" model of AI development, pointing toward more adaptable and resource-efficient systems. The blog also includes a Python example integrating LLaMA-3 with MCTS for reasoning tasks—perfect for those eager to experiment with TTC in their projects. #AI #MachineLearning #LargeLanguageModels #TTC #TechInnovation #AIEngineering

Test Time Compute Ken Huang on LinkedIn

4 Comments
Like Comment
Brij kishore Pandey Brij kishore Pandey is an Influencer

AI Architect & Engineer | AI Strategist

720,785 followers 1y
Report this post
In the world of Generative AI, 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹-𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (𝗥𝗔𝗚) is a game-changer. By combining the capabilities of LLMs with domain-specific knowledge retrieval, RAG enables smarter, more relevant AI-driven solutions. But to truly leverage its potential, we must follow some essential 𝗯𝗲𝘀𝘁 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗲𝘀: 1️⃣ 𝗦𝘁𝗮𝗿𝘁 𝘄𝗶𝘁𝗵 𝗮 𝗖𝗹𝗲𝗮𝗿 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲 Define your problem statement. Whether it’s building intelligent chatbots, document summarization, or customer support systems, clarity on the goal ensures efficient implementation. 2️⃣ 𝗖𝗵𝗼𝗼𝘀𝗲 𝘁𝗵𝗲 𝗥𝗶𝗴𝗵𝘁 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗕𝗮𝘀𝗲 - Ensure your knowledge base is 𝗵𝗶𝗴𝗵-𝗾𝘂𝗮𝗹𝗶𝘁𝘆, 𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱, 𝗮𝗻𝗱 𝘂𝗽-𝘁𝗼-𝗱𝗮𝘁𝗲. - Use vector embeddings (e.g., pgvector in PostgreSQL) to represent your data for efficient similarity search. 3️⃣ 𝗢𝗽𝘁𝗶𝗺𝗶𝘇𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗠𝗲𝗰𝗵𝗮𝗻𝗶𝘀𝗺𝘀 - Use hybrid search techniques (semantic + keyword search) for better precision. - Tools like 𝗽𝗴𝗔𝗜, 𝗪𝗲𝗮𝘃𝗶𝗮𝘁𝗲, or 𝗣𝗶𝗻𝗲𝗰𝗼𝗻𝗲 can enhance retrieval speed and accuracy. 4️⃣ 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗲 𝗬𝗼𝘂𝗿 𝗟𝗟𝗠 (𝗢𝗽𝘁𝗶𝗼𝗻𝗮𝗹) - If your use case demands it, fine-tune the LLM on your domain-specific data for improved contextual understanding. 5️⃣ 𝗘𝗻𝘀𝘂𝗿𝗲 𝗦𝗰𝗮𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 - Architect your solution to scale. Use caching, indexing, and distributed architectures to handle growing data and user demands. 6️⃣ 𝗠𝗼𝗻𝗶𝘁𝗼𝗿 𝗮𝗻𝗱 𝗜𝘁𝗲𝗿𝗮𝘁𝗲 - Continuously monitor performance using metrics like retrieval accuracy, response time, and user satisfaction. - Incorporate feedback loops to refine your knowledge base and model performance. 7️⃣ 𝗦𝘁𝗮𝘆 𝗦𝗲𝗰𝘂𝗿𝗲 𝗮𝗻𝗱 𝗖𝗼𝗺𝗽𝗹𝗶𝗮𝗻𝘁 - Handle sensitive data responsibly with encryption and access controls. - Ensure compliance with industry standards (e.g., GDPR, HIPAA). With the right practices, you can unlock its full potential to build powerful, domain-specific AI applications. What are your top tips or challenges?
No more previous content

No more next content
33 Comments
Like Comment
Sharada Yeluri

Engineering Leader

21,532 followers 1y
Report this post
A lot has changed since my #LLM inference article last January—it’s hard to believe a year has passed! The AI industry has pivoted from focusing solely on scaling model sizes to enhancing reasoning abilities during inference. This shift is driven by the recognition that simply increasing model parameters yields diminishing returns and that improving inference capabilities can lead to more efficient and intelligent AI systems. OpenAI's o1 and Google's Gemini 2.0 are examples of models that employ #InferenceTimeCompute. Some techniques include best-of-N sampling, which generates multiple outputs and selects the best one; iterative refinement, which allows the model to improve its initial answers; and speculative decoding. Self-verification lets the model check its own output, while adaptive inference-time computation dynamically allocates extra #GPU resources for challenging prompts. These methods represent a significant step toward more reasoning-driven inference. Another exciting trend is #AgenticWorkflows, where an AI agent, a SW program running on an inference server, breaks the queried task into multiple small tasks without requiring complex user prompts (prompt engineering may see end of life this year!). It then autonomously plans, executes, and monitors these tasks. In this process, it may run inference multiple times on the model while maintaining context across the runs. #TestTimeTraining takes things further by adapting models on the fly. This technique fine-tunes the model for new inputs, enhancing its performance. These advancements can complement each other. For example, an AI system may use agentic workflow to break down a task, apply inference-time computing to generate high-quality outputs at each step and employ test-time training to learn unexpected challenges. The result? Systems that are faster, smarter, and more adaptable. What does this mean for inference hardware and networking gear? Previously, most open-source models barely needed one GPU server, and inference was often done in front-end networks or by reusing the training networks. However, as the computational complexity of inference increases, more focus will be on building scale-up systems with hundreds of tightly interconnected GPUs or accelerators for inference flows. While Nvidia GPUs continue to dominate, other accelerators, especially from hyperscalers, would likely gain traction. Networking remains a critical piece of the puzzle. Can #Ethernet, with enhancements like compressed headers, link retries, and reduced latencies, rise to meet the demands of these scale-up systems? Or will we see a fragmented ecosystem of switches for non-Nvdia scale-up systems? My bet is on Ethernet. Its ubiquity makes it a strong contender for the job... Reflecting on the past year, it’s clear that AI progress isn’t just about making things bigger but smarter. The future looks more exciting as we rethink models, hardware, and networking. Here’s to what the 2025 will bring!
No more previous content

No more next content
14 Comments
Like Comment
Sarthak Rastogi

AI engineer | Posts on agents + advanced RAG | Experienced in LLM research, ML engineering, Software Engineering

25,243 followers 1y
Report this post
Meta used 5 interesting techniques to make Llama 3.1 outperform GPT-4o, that AI engineers should know. Here’s how they work — 1. Grouped Query Attention (GQA) GQA reduces the no. of key-value heads while maintaining the number of query heads. This decreases the size of KV caches, which ultimately optimises memory usage and speeds up inference. 2. Rotary Positional Embedding (RoPE): RoPE improves the model's handling of longer context windows by enabling better positional encoding, extending context lengths up to 32,768 tokens. 3. Advanced Tokenisation: Llama 3 uses a 128K token vocab that combines std tokens with additional tokens to better support non-English languages. This helps improve compression rates of tokens and overall performance. 4. Annealing Data Techniques: This involves mixing high-quality data in select domains during training. This helps the LLM in efficient learning of domain-specific knowledge. 5. Model-based Quality Filtering: They used classifiers like fasttext and DistilRoberta to filter and select high-quality tokens during pre-training. This made sure that the model is trained on superior data — and it shows in the downstream performance. Paper: https://lnkd.in/gJ6PpXGt #AI #LLMs #LLama3

10 Comments
Like Comment

How to Improve AI Performance With New Techniques

More in Performance Optimization Techniques

Explore categories