LLM System Optimization

Explore top LinkedIn content from expert professionals.

  • View profile for Zain Hasan

    I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰

    19,603 followers

    You don't need a 2 trillion parameter model to tell you the capital of France is Paris. Be smart and route between a panel of models according to query difficulty and model specialty! New paper proposes a framework to train a router that routes queries to the appropriate LLM to optimize the trade-off b/w cost vs. performance. Overview: Model inference cost varies significantly: Per one million output tokens: Llama-3-70b ($1) vs. GPT-4-0613 ($60), Haiku ($1.25) vs. Opus ($75) The RouteLLM paper propose a router training framework based on human preference data and augmentation techniques, demonstrating over 2x cost saving on widely used benchmarks. They define the problem as having to choose between two classes of models: (1) strong models - produce high quality responses but at a high cost (GPT-4o, Claude3.5) (2) weak models - relatively lower quality and lower cost (Mixtral8x7B, Llama3-8b) A good router requires a deep understanding of the question’s complexity as well as the strengths and weaknesses of the available LLMs. Explore different routing approaches: - Similarity-weighted (SW) ranking - Matrix factorization - BERT query classifier - Causal LLM query classifier Neat Ideas to Build From: - Users can collect a small amount of in-domain data to improve performance for their specific use cases via dataset augmentation. - Can expand this problem from routing between a strong and weak LLM to a multiclass model routing approach where we have specialist models(language vision model, function calling model etc.) - Larger framework controlled by a router - imagine a system of 15-20 tuned small models and the router as the n+1'th model responsible for picking the LLM that will handle a particular query at inference time. - MoA architectures: Routing to different architectures of a Mixture of Agents would be a cool idea as well. Depending on the query you decide how many proposers there should be, how many layers in the mixture, what the aggregate models should be etc. - Route based caching: If you get redundant queries that are slightly different then route the query+previous answer to a small model to light rewriting instead of regenerating the answer

  • View profile for Aymeric Roucher

    Building Agents, formerly at Hugging Face | Polytechnique - Cambridge

    28,893 followers

    𝗥𝗼𝘂𝘁𝗲𝗟𝗟𝗠: 𝗥𝗼𝘂𝘁𝗲 𝘆𝗼𝘂𝗿 𝗾𝘂𝗲𝗿𝘆 𝘁𝗼 𝗮 𝘀𝗺𝗮𝗹𝗹𝗲𝗿 𝗟𝗟𝗠 𝘄𝗵𝗲𝗻 𝗽𝗼𝘀𝘀𝗶𝗯𝗹𝗲 ⇒ 𝗰𝘂𝘁 𝟱𝟬% 𝗼𝗳 𝗰𝗼𝘀𝘁 ✂️ The LMSys team maintains the ChatbotArena, which is a great evaluation system based on thousands of matches: when a user submits a query, they receive the answers from two hidden models A and B, and vote between the two. This preference data allows them to create an ELO ranking, which is a great indicator of model strength. The team has found another great usage of this preference data they gathered: train a router to route user queries to the most appropriate model. The main idea is that 𝙢𝙖𝙣𝙮 𝙦𝙪𝙚𝙧𝙞𝙚𝙨 𝙙𝙤 𝙣𝙤𝙩 𝙧𝙚𝙦𝙪𝙞𝙧𝙚 𝙖 𝙨𝙩𝙧𝙤𝙣𝙜 𝙢𝙤𝙙𝙚𝙡: for instance “summarize this paragraph in 1 sentence” can be solved very well by a small model like Llama-3-8B, which is orders of magnitude cheaper to run than the usual behemoths. If you manage to selectively route all easy queries to the smaller LLM, you can save a lot on the costs with minimal performance reduction (a queries will be poorly answered due to mis-routing) So the team set on to train a router that given a query, chooses the most appropriate LLM to answer it, between a strong/expensive one and a weak/cheap one. 🛠️ Create a router between GPT-4 (strong model) and Mixtral-8x7B (small model) 🔢 Use preference data from 80k labels → Augment this with gold preference data for specific benchmarks → Define custom metrics to measure perf gain from routing → Test on MT-Bench, GSM8k, and MMLU 💥 Achieve 95% of GPT-4 quality on MT-Bench for over 2x cost reduction ✨ Overhead cost are minimal, even the most expensive routing method introduces an overhead under 0.4% of GPT-4 generation 🧂 Grain of salt: MT-Bench is really the benchmark where this method performs best, and introducing “gold data” from the benchmark probably biased results upwards. So the “95% perf for 2x cost reduction” will not be as impressive in a real setting 𝙍𝙚𝙖𝙙 𝙩𝙝𝙚 𝙥𝙖𝙥𝙚𝙧 𝙝𝙚𝙧𝙚 👉 https://lnkd.in/e4n2baVr 𝘾𝙤𝙙𝙚 𝙧𝙚𝙥𝙤 𝙞𝙨 𝙝𝙚𝙧𝙚 (already 1.7k stars) 👉 https://lnkd.in/e9zVVTaP

  • View profile for Andrew Reed

    Applied AI @ LangChain

    6,771 followers

    Can we save inference cost by routing easier questions to cheaper LLMs? 🤔 📝 New research from Carnegie Mellon University, Google DeepMind, Indian Institute of Technology, Delhi and University of Southern California proposes AutoMix - an approach that strategically routes queries to larger LLMs based on the approximate correctness of outputs from a smaller LLM. 🔎 How it works: 1️⃣ First, generate an answer with a small, efficient model. 2️⃣ Then automatically verify the answer using few-shot learning. 3️⃣ Use a meta-verifier to evaluate reliability of answer. 4️⃣ Only if uncertain, invoke a larger, more accurate (but costly) model. 💡 Key benefits: • Enhances the incremental benefit per cost by up to 89% • Builds on and outperforms prior methods like FrugalGPT • Works with open source and black-box LLMs alike I'm excited by the potential for novel methods like this to make LLM solutions more cost effective, and therefore usable at scale 💰 🧵Read the paper for all the details! 👉 https://lnkd.in/eW7Nf2rf

  • View profile for Vince Lynch

    +12 year AI veteran | CEO of IV.AI | We’re hiring

    11,954 followers

    I’m jealous of AI Because with a model you can measure confidence Imagine you could do that as a human? Measure how close or far off you are? here's how to measure for technical and non-technical teams For business teams: Run a ‘known answers’ test. Give the model questions or tasks where you already know the answer. Think of it like a QA test for logic. If it can't pass here, it's not ready to run wild in your stack. Ask for confidence directly. Prompt it: “How sure are you about that answer on a scale of 1-10?” Then: “Why might this be wrong?” You'll surface uncertainty the model won't reveal unless asked. Check consistency. Phrase the same request five different ways. Is it giving stable answers? If not, revisit the product strategy for the llm Force reasoning. Use prompts like “Show step-by-step how you got this result.” This lets you audit the logic, not just the output. Great for strategy, legal, and product decisions. For technical teams: Use the softmax output to get predicted probabilities. Example: Model says “fraud” with 92% probability. Use entropy to spot uncertainty. High entropy = low confidence. (Shannon entropy: −∑p log p) Language models Extract token-level log-likelihoods from the model if you have API or model access. These give you the probability of each word generated. Use sequence likelihood to rank alternate responses. Common in RAG and search-ranking setups. For uncertainty estimates, try: Monte Carlo Dropout: Run the same input multiple times with dropout on. Compare outputs. High variance = low confidence. Ensemble models: Aggregate predictions from several models to smooth confidence. Calibration testing: Use a reliability diagram to check if predicted probabilities match actual outcomes. Use Expected Calibration Error (ECE) as a metric. Good models should show that 80% confident = ~80% correct. How to improve confidence (and make it trustworthy) Label smoothing during training Prevents overconfident predictions and improves generalization. Temperature tuning (post-hoc) Adjusts the softmax sharpness to better align confidence and accuracy. Temperature < 1 → sharper, more confident Temperature > 1 → more cautious, less spiky predictions Fine-tuning on domain-specific data Shrinks uncertainty and reduces hedging in model output. Especially effective for LLMs that need to be assertive in narrow domains (legal, medicine, strategy). Use focal loss for noisy or imbalanced datasets. It down-weights easy examples and forces the model to pay attention to harder cases, which tightens confidence on the edge cases. Reinforcement learning from human feedback (RLHF) Aligns the model's reward with correct and confident reasoning. Bottom line: A confident model isn't just better - it's safer, cheaper, and easier to debug. If you’re building workflows or products that rely on AI, but you’re not measuring model confidence, you’re guessing. #AI #ML #LLM #MachineLearning #AIConfidence #RLHF #ModelCalibration

  • View profile for Sumit Kumar

    Senior MLE @Meta, Ex- TikTok|Amazon|Samsung

    8,235 followers

    What if instead of passively observing an LLM's confidence, we could actively teach it to know when to retrieve? The final post of my Adaptive RAG series explores training-based approaches that treat retrieval decisions as a learned skill. The previous posts established that naive RAG is costly and often harmful, before exploring lightweight pre-generation methods and confidence-based probing. This final post takes a fundamentally different approach: treating adaptive retrieval as a learned skill. Instead of just inferring when a model needs help, we can explicitly train it to be self-aware. We examine three paradigms in increasing order of sophistication: 🔹 Gatekeeper Models: Lightweight classifiers that act as intelligent routers, deciding whether to invoke retrieval 🔹 Fine-tuned LLMs: Fine-tuning approaches that teach an LLM to recognize its own knowledge gaps and signal when it needs external information 🔹 Reasoning Agents: Advanced methods that train LLMs to become autonomous agents, engaging in multi-step reasoning about what they know, what they need, and how to gather missing information iteratively The post includes a practical decision framework to help you choose based on API access, training budget, query complexity, and latency requirements. The key takeaway is that the choice depends on your constraints. You can read the full post here: https://lnkd.in/gr8C_AAd #RAG #AdaptiveRAG #LLM #AI #MachineLearning #DeepLearning #InformationRetrieval

  • View profile for Hao Hoang

    Daily AI Interview Questions | Senior AI Researcher & Engineer | ML, LLMs, NLP, DL, CV, ML Systems | 56k+ AI Community

    55,168 followers

    𝘊𝘩𝘰𝘰𝘴𝘪𝘯𝘨 𝘵𝘩𝘦 𝘳𝘪𝘨𝘩𝘵 𝘓𝘓𝘔 𝘧𝘰𝘳 𝘢 𝘵𝘢𝘴𝘬 𝘪𝘴 𝘢 𝘤𝘰𝘯𝘴𝘵𝘢𝘯𝘵 𝘵𝘶𝘨-𝘰𝘧-𝘸𝘢𝘳 𝘣𝘦𝘵𝘸𝘦𝘦𝘯 𝘱𝘦𝘳𝘧𝘰𝘳𝘮𝘢𝘯𝘤𝘦 𝘢𝘯𝘥 𝘤𝘰𝘴𝘵. 𝘞𝘩𝘢𝘵 𝘪𝘧 𝘢 𝘳𝘰𝘶𝘵𝘦𝘳 𝘤𝘰𝘶𝘭𝘥 𝘭𝘦𝘢𝘳𝘯 𝘵𝘰 𝘮𝘢𝘬𝘦 𝘵𝘩𝘦 𝘰𝘱𝘵𝘪𝘮𝘢𝘭 𝘤𝘩𝘰𝘪𝘤𝘦 𝘰𝘯 𝘵𝘩𝘦 𝘧𝘭𝘺, 𝘶𝘴𝘪𝘯𝘨 𝘰𝘯𝘭𝘺 𝘴𝘪𝘮𝘱𝘭𝘦 𝘶𝘴𝘦𝘳 𝘧𝘦𝘦𝘥𝘣𝘢𝘤𝘬, 𝘸𝘪𝘵𝘩𝘰𝘶𝘵 𝘢 𝘮𝘢𝘴𝘴𝘪𝘷𝘦 𝘱𝘳𝘦-𝘭𝘢𝘣𝘦𝘭𝘦𝘥 𝘥𝘢𝘵𝘢𝘴𝘦𝘵? This is critical as companies deploy multi-LLM systems. The cost of running every query through a top-tier model is prohibitive, but creating static, supervised routers is expensive and they fail to adapt to changing user needs. A new paper from Fujitsu Research and Microsoft Research, "𝐀𝐝𝐚𝐩𝐭𝐢𝐯𝐞 𝐋𝐋𝐌 𝐑𝐨𝐮𝐭𝐢𝐧𝐠 𝐮𝐧𝐝𝐞𝐫 𝐁𝐮𝐝𝐠𝐞𝐭 𝐂𝐨𝐧𝐬𝐭𝐫𝐚𝐢𝐧𝐭𝐬," tackles this head-on. Instead of treating routing as a supervised learning task, they reframe it as a contextual bandit problem, allowing the system to learn and adapt from limited feedback, much like a recommendation engine learns from clicks. Their novel method, PILOT (Preference-prior Informed LinUCB for Adaptive RouTing), learns a shared embedding space for queries and LLMs. This space is first pre-trained on offline human preference data, then continuously refined online using live user feedback (e.g., a simple 👍/👎). The results: on the RouterBench benchmark, PILOT achieved 93% of GPT-4's performance at only 25% of its cost. This intelligent routing adds negligible latency to the user experience. The takeaway: This research paves the way for truly dynamic, cost-aware AI systems that optimize themselves in real-time. It's a shift from static routing to intelligent, feedback-driven orchestration, making powerful multi-LLM applications more economically viable and responsive than ever before. #AI #LLM #MachineLearning #AIEfficiency #Research #Innovation

  • Many “LLM routers” reduce to simple classifier heuristics, yet real-world routing demands handling cost, accuracy, and composition tradeoffs - a nuance many repos gloss over. LLMRouter brings structured routing to multi-LLM stacks by formalizing LLM selection as a decision problem over cost, performance, and task characteristics rather than a one-off API choice. The repository provides 16+ router implementations from classical baselines (KNN, SVM, MLP, Elo rating) to graph-based, multi-round, and personalized strategies, and integrates training, inference, and evaluation in a unified CLI with data pipelines from 11 benchmark datasets. Unlike toy classifiers, it embeds router training into an ML workflow with support for pre-trained multi-round routers such as Router-R1 (RL-trained policy router) and GMTRouter (graph-based personalization), surfacing concrete tradeoffs between simple heuristics and learned decision policies. Practically this elevates routing from hard-coded model selection to a reproducible engineering pattern. You get training data generation, metrics for performance vs cost, plugin hooks for custom logic, and API key driven inference pipelines; all together this reduces bespoke scripting and brittle ad-hoc logic that many teams build internally. The critical constraint remains operational overhead: router training and multi-round strategies add latency, GPU dependency for training, and complexity in monitoring cost/accuracy balance. In high-throughput production, this will require observability and failover design comparable to core inference layers. For AI architects evaluating multi-model stacks, LLMRouter is a substantive reference implementation showing how routing can be engineered and extended beyond simple task classification. Github👩💻https://lnkd.in/eJjFyAP5

Explore categories