After optimizing costs for many AI systems, I've developed a systematic approach that consistently delivers cost reductions of 60-80%. Here's my playbook, in order of least to most effort: Step 1: Optimizing Inference Throughput Start here for the biggest wins with least effort. Enabling caching (LiteLLM (YC W23), Zilliz) and strategic batch processing can reduce costs by a lot with very little effort. I have seen teams cut costs by half simply by implementing caching and batching requests that don't require real-time results. Step 2: Maximizing Token Efficiency This can give you an additional 50% cost savings. Prompt engineering, automated compression (ScaleDown), and structured outputs can cut token usage without sacrificing quality. Small changes in how you craft prompts can lead to massive savings at scale. Step 3: Model Orchestration Use routers and cascades to send prompts to the cheapest and most effective model for that prompt (OpenRouter, Martian). Why use GPT-4 for simple classification when GPT-3.5 will do? Smart routing ensures you're not overpaying for intelligence you don't need. Step 4: Self-Hosting I only suggest self-hosting for teams at scale because of the complexities involved. This requires more technical investment upfront but pays dividends for high-volume applications. The key is tackling these layers systematically. Most teams jump straight to self-hosting or model switching, but the real savings come from optimizing throughput and token efficiency first. What's your experience with AI cost optimization?
How to Lower Language Model Training Costs
Explore top LinkedIn content from expert professionals.
Summary
Lowering language model training costs involves finding ways to make AI systems more affordable and efficient by carefully managing resources, model choices, and infrastructure. The goal is to reduce expenses without sacrificing quality, so businesses can scale their AI projects sustainably.
- Use smaller models: Switch to task-specific, compact language models for routine queries, reserving larger models for complex tasks to cut unnecessary spending.
- Implement smart routing: Set up systems that analyze each request and send it to the model that fits best, ensuring you don’t pay extra for power you don’t need.
- Adopt caching and batching: Store and reuse common responses and batch similar requests together to minimize repeated processing and lower infrastructure costs.
-
-
AT&T cut their AI costs by 90%. Not by negotiating better contracts. By rethinking the architecture entirely. They were processing 8 billion tokens a day through large, general-purpose models. Expensive. Slow. Hard to scale. So their CDO Andy Markus rebuilt the entire orchestration layer around small, purpose-built models, each trained for a specific task, coordinated by a larger model above them. The result? They now process 27 billion tokens a day. At roughly 10% of the previous cost. And accuracy? Markus says small language models are "just about as accurate, if not as accurate, as a large language model on a given domain area." His conclusion: "I believe the future of agentic AI is many, many, many small language models." Most business leaders are still debating which big model to buy. The smarter question is: what specific tasks do I actually need AI to do, and can a smaller, cheaper, focused model do it just as well? In most cases, the answer is yes. Interview with Andy Markus here: https://lnkd.in/e9A6xt6Y ___ I write about AI, business strategy, and leadership for decision-makers. Enjoyed this post? Like 👍, comment 💭, or re-post ♻️ to share with others. Hit the 🔔 on my profile to receive my latests posts.
-
Few Lessons from Deploying and Using LLMs in Production Deploying LLMs can feel like hiring a hyperactive genius intern—they dazzle users while potentially draining your API budget. Here are some insights I’ve gathered: 1. “Cheap” is a Lie You Tell Yourself: Cloud costs per call may seem low, but the overall expense of an LLM-based system can skyrocket. Fixes: - Cache repetitive queries: Users ask the same thing at least 100x/day - Gatekeep: Use cheap classifiers (BERT) to filter “easy” requests. Let LLMs handle only the complex 10% and your current systems handle the remaining 90%. - Quantize your models: Shrink LLMs to run on cheaper hardware without massive accuracy drops - Asynchronously build your caches — Pre-generate common responses before they’re requested or gracefully fail the first time a query comes and cache for the next time. 2. Guard Against Model Hallucinations: Sometimes, models express answers with such confidence that distinguishing fact from fiction becomes challenging, even for human reviewers. Fixes: - Use RAG - Just a fancy way of saying to provide your model the knowledge it requires in the prompt itself by querying some database based on semantic matches with the query. - Guardrails: Validate outputs using regex or cross-encoders to establish a clear decision boundary between the query and the LLM’s response. 3. The best LLM is often a discriminative model: You don’t always need a full LLM. Consider knowledge distillation: use a large LLM to label your data and then train a smaller, discriminative model that performs similarly at a much lower cost. 4. It's not about the model, it is about the data on which it is trained: A smaller LLM might struggle with specialized domain data—that’s normal. Fine-tune your model on your specific data set by starting with parameter-efficient methods (like LoRA or Adapters) and using synthetic data generation to bootstrap training. 5. Prompts are the new Features: Prompts are the new features in your system. Version them, run A/B tests, and continuously refine using online experiments. Consider bandit algorithms to automatically promote the best-performing variants. What do you think? Have I missed anything? I’d love to hear your “I survived LLM prod” stories in the comments!
-
You don't need a 2 trillion parameter model to tell you the capital of France is Paris. Be smart and route between a panel of models according to query difficulty and model specialty! New paper proposes a framework to train a router that routes queries to the appropriate LLM to optimize the trade-off b/w cost vs. performance. Overview: Model inference cost varies significantly: Per one million output tokens: Llama-3-70b ($1) vs. GPT-4-0613 ($60), Haiku ($1.25) vs. Opus ($75) The RouteLLM paper propose a router training framework based on human preference data and augmentation techniques, demonstrating over 2x cost saving on widely used benchmarks. They define the problem as having to choose between two classes of models: (1) strong models - produce high quality responses but at a high cost (GPT-4o, Claude3.5) (2) weak models - relatively lower quality and lower cost (Mixtral8x7B, Llama3-8b) A good router requires a deep understanding of the question’s complexity as well as the strengths and weaknesses of the available LLMs. Explore different routing approaches: - Similarity-weighted (SW) ranking - Matrix factorization - BERT query classifier - Causal LLM query classifier Neat Ideas to Build From: - Users can collect a small amount of in-domain data to improve performance for their specific use cases via dataset augmentation. - Can expand this problem from routing between a strong and weak LLM to a multiclass model routing approach where we have specialist models(language vision model, function calling model etc.) - Larger framework controlled by a router - imagine a system of 15-20 tuned small models and the router as the n+1'th model responsible for picking the LLM that will handle a particular query at inference time. - MoA architectures: Routing to different architectures of a Mixture of Agents would be a cool idea as well. Depending on the query you decide how many proposers there should be, how many layers in the mixture, what the aggregate models should be etc. - Route based caching: If you get redundant queries that are slightly different then route the query+previous answer to a small model to light rewriting instead of regenerating the answer
-
As companies look to scale their GenAI initiatives, a significant hurdle is emerging: the cost of scaling the infrastructure, particularly in managing tokens for paid Large Language Models (LLMs) and the surrounding infrastructure. Here's what companies need to know: a) Token-based pricing, the standard for most LLM providers, presents a significant cost management challenge due to the wide cost variations between models. For instance, GPT-4 can be ten times more expensive than GPT-3.5-turbo. b) Infrastructure costs go beyond just the LLM fees. For every $1 spent on developing a model, companies may need to pay $100 to $1,000 on infrastructure to run it effectively. c) Run costs typically exceed build costs for GenAI applications, with model usage and labor being the most significant drivers. Optimizing costs is an ongoing process, and the following best practices would help reduce the costs significantly: a) Techniques, like preloading embeddings, can reduce query costs from a dollar to less than a penny. b) Optimizing prompts to reduce token usage c) Using task-specific, smaller models where appropriate d) Implementing caching and batching of requests e) Utilizing model quantization and distillation techniques f) A flexible API system can help avoid vendor lock-in and allow quick adaptation as technology evolves. Investments in GenAI should be tied to ROI. Not all AI interactions need the same level of responsiveness (and cost). Leaders must focus on sustainable, cost-effective scaling strategies as we transition from GenAI's 'honeymoon phase'. The key is to balance innovation and financial prudence, ensuring long-term success in the AI-driven future. #GenerativeAI #AIScaling #TechLeadership #InnovationCosts #GenAI
-
LLMs are overkill for 80% of business tasks. Enter SLMs: Most companies are burning cash on GPT-4 when a specialized Small Language Model would do the job better, faster, and cheaper. Here's the architecture difference: Traditional LLMs: Simple linear pipeline that processes everything with maximum resources. Like using a Ferrari for grocery runs. Smart SLMs: Optimized parallel processing with compact tokenization, task-specific embeddings, and model quantization. Built for edge deployment and real-world efficiency. Real Cost Comparison: - GPT-4: $30/million input tokens, $60/million output tokens - GPT-4.1-nano: $0.10/million input, $0.40/million output (OpenAI's cheapest) - Llama 3.2 (1B): $0.03-0.05/million tokens - Custom fine-tuned SLMs can cost even less Where SLMs Win: SLMs excel at customer service handling 90% of repetitive queries, document classification, sentiment analysis, code completion for specific languages, and IoT/edge device applications. Where LLMs Still Rule: LLMs remain superior for creative writing, complex reasoning tasks, multi-domain applications, and research assistance. Real Business Case: Switching from GPT-4 to a specialized SLM for invoice processing: - Latency: 2s → 0.3s - Cost: Over 90% reduction - Accuracy: Improved with domain-specific training Quick Start Guide: 1. Identify repetitive tasks in your workflow 2. Calculate current LLM costs 3. Test open-source SLMs (Phi-3, TinyLlama, Llama 3.2) 4. Fine-tune on your specific data 5. Deploy locally or on edge The future isn't about bigger models. It's about smarter, specialized ones that run anywhere. Over to you: What task are you overpaying LLMs to handle?
-
Stop predicting tokens with your LLM. Start predicting vectors. Every modern LLM, from GPT to DeepSeek, follows the same fundamental process: 𝘯𝘦𝘹𝘵-𝘵𝘰𝘬𝘦𝘯 𝘱𝘳𝘦𝘥𝘪𝘤𝘵𝘪𝘰𝘯. The model predicts one token, then the next, then the next, sequentially building text one piece at a time. This works, but it's incredibly inefficient. A new paper from WeChat AI introduces CALM (Continuous Autoregressive Language Models), which, instead of predicting discrete tokens, CALM predicts 𝘤𝘰𝘯𝘵𝘪𝘯𝘶𝘰𝘶𝘴 vectors, made up of K tokens each. 𝗘𝗻𝗰𝗼𝗱𝗶𝗻𝗴 𝗧𝗼𝗸𝗲𝗻𝘀 𝗮𝘀 𝗩𝗲𝗰𝘁𝗼𝗿𝘀 You can't just use a standard embedding model, because embedding models are 𝘦𝘯𝘤𝘰𝘥𝘦𝘳𝘴 𝘰𝘯𝘭𝘺. Since we will be predicting vectors as well, you need a way to go back from the vector to the tokens. Enter the 𝗮𝘂𝘁𝗼𝗲𝗻𝗰𝗼𝗱𝗲𝗿 (remember this?): • Has an 𝘦𝘯𝘤𝘰𝘥𝘦𝘳 and a 𝘥𝘦𝘤𝘰𝘥𝘦𝘳 • Compresses K tokens into a single vector with minimal information loss • Create a robust latent space that's smooth enough for generative modelling 𝗔 𝗖𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹 Normal LMs are 𝘥𝘪𝘴𝘤𝘳𝘦𝘵𝘦 only, they predict a single token at a time. Think of it like a multi-class classification model over time. CALM, however, is a 𝘤𝘰𝘯𝘵𝘪𝘯𝘶𝘰𝘶𝘴 predictor, it will be predicting the compressed vectors, sort of similar to a regression model. To account for continuity, several things had to be changed: • 𝘊𝘳𝘰𝘴𝘴 𝘦𝘯𝘵𝘳𝘰𝘱𝘺 𝘭𝘰𝘴𝘴 -> 𝘌𝘯𝘦𝘳𝘨𝘺 𝘚𝘤𝘰𝘳𝘦 𝘓𝘰𝘴𝘴 • 𝘗𝘦𝘳𝘱𝘭𝘦𝘹𝘪𝘵𝘺 -> 𝘉𝘳𝘪𝘦𝘳 𝘚𝘤𝘰𝘳𝘦 𝗥𝗲𝘀𝘂𝗹𝘁𝘀 CALM-M (371M params, K=4): Matches Transformer-S (281M params) performance with 44% fewer training FLOPs and 34% fewer inference FLOPs. As chunk size K increases, the compute requirements of CALM lower drastically, but the accuracy goes down. But this is an extremely interesting development in the field of language models, and I'll be really excited to see what comes of this initial, promising piece of research. I also love the way the paper was written, like a story of how the researchers worked on this project. It presents a problem, then a solution, rather than just throwing all the theory at you straight away: https://lnkd.in/e9RjbXZa
-
Nothing changed in the product. But the AI bill doubled overnight. That’s when most teams learn the hard truth: 𝐭𝐨𝐤𝐞𝐧 𝐮𝐬𝐚𝐠𝐞 𝐝𝐨𝐞𝐬𝐧’𝐭 𝐞𝐱𝐩𝐥𝐨𝐝𝐞 𝐛𝐞𝐜𝐚𝐮𝐬𝐞 𝐨𝐟 𝐨𝐧𝐞 𝐛𝐢𝐠 𝐦𝐢𝐬𝐭𝐚𝐤𝐞, 𝐢𝐭 𝐜𝐫𝐞𝐞𝐩𝐬 𝐢𝐧 𝐭𝐡𝐫𝐨𝐮𝐠𝐡 𝐝𝐨𝐳𝐞𝐧𝐬 𝐨𝐟 𝐬𝐦𝐚𝐥𝐥 𝐨𝐧𝐞𝐬. Here’s a simple breakdown of the core strategies that keep AI systems fast, affordable, and predictable as they scale: 𝐂𝐨𝐬𝐭 𝐑𝐞𝐝𝐮𝐜𝐭𝐢𝐨𝐧 𝐅𝐨𝐜𝐮𝐬 ‣ Shorten System Prompts Cut the unnecessary instructions. Smaller system prompts mean lower cost on every single call. ‣ Use Structured Prompts Bullets, schemas, and clear formats reduce ambiguity and prevent the model from generating long, wasteful responses. ‣ Trim Conversation History Only include the parts relevant to the current task. Long-running agents often burn tokens without you noticing. ‣ Budget Your Context Window Divide context into strict sections so one part doesn’t overwhelm the whole window. 𝐋𝐚𝐭𝐞𝐧𝐜𝐲 & 𝐄𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲 𝐅𝐨𝐜𝐮𝐬 ‣ Compress Retrieved Content Summaries → key chunks → only then full text. This keeps retrieval grounded without ballooning token usage. ‣ Metadata-First Retrieval Start with summaries or metadata; pull full documents only when required. ‣ Replace Text with IDs Instead of resending repeated text, reference IDs, states, or steps. ‣ Limit Tool Output Size Filter tool returns so agents only receive the data they actually need. 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 & 𝐒𝐩𝐞𝐞𝐝 𝐅𝐨𝐜𝐮𝐬 ‣ Use Smaller Models Smartly Not every step needs your biggest model. Route simple tasks to lighter ones. ‣ Stop Over-Explaining If you don’t ask for long reasoning, the model won’t generate it. Huge hidden token savings. ‣ Cache Stable Responses If an instruction doesn’t change, don’t regenerate it. Cache it. ‣ Enforce Max Output Tokens Set strict caps so the model never produces more than required. Costs rarely spike because AI got more expensive, they spike because your system became less disciplined. Optimizing tokens isn’t optional anymore. It’s how you build AI products that scale without burning your budget.
-
Maximizing LLM Efficiency: Cost Savings and Sustainable AI Performance Optimizing costs for large language models (LLMs) is essential for scalable, sustainable AI applications. Approaches like FrugalGPT offer frameworks that reduce expenses while maintaining high-quality outputs by intelligently selecting models based on task requirements. FrugalGPT’s approach to cost optimization includes three key techniques: 1️⃣ Prompt Adaptation – Concise, optimized prompts reduce token usage, lowering processing time and cost. 2️⃣ LLM Approximation – By caching common responses and fine-tuning specific models, FrugalGPT decreases the need to repeatedly query more costly, resource-heavy models. 3️⃣ LLM Cascade – Dynamically selecting the optimal combination of LLMs based on the input query, ensuring that simpler tasks are handled by less costly models, while more complex queries are directed to more powerful LLMs. While FrugalGPT’s primary goal is cost optimization, its strategies inherently support sustainability by minimizing heavy LLM usage when smaller models suffice, optimizing prompts to reduce resource demands, and caching frequent responses. Reducing reliance on high-resource models, where possible, decreases energy demands and aligns with sustainable AI practices. Several commercial offerings have also adopted and built on similar concepts, introducing tools for enhanced model selection, automated prompt optimization, and scalable caching systems to balance performance, cost, and sustainability effectively. Every optimization involves trade-offs. FrugalGPT allows users to fine-tune this balance, sometimes sacrificing a small degree of accuracy for significant cost reduction. Explore FrugalGPT’s methods and trade-off analysis to learn more about achieving quality outcomes cost-effectively while contributing to a more efficient AI ecosystem FrugalGPT Trade-off Analysis. Here is the Google colab notebook - https://lnkd.in/d2q6XNkM Do read this very interesting FrugalGPT paper for insights into the experiments and methodologies. - https://lnkd.in/dik6JW4B . Additionally, try out Google Illuminate by providing the research paper to generate an engaging audio summary, making complex content more accessible. #greenai #sustainableai #sustainability
-
LLM Cost Optimization Strategies: Achieving Efficient AI Workflows Large Language Models (LLMs) are transforming industries but come with high computational costs. To make AI solutions more scalable and efficient, it's essential to adopt smart cost optimization strategies. 🔑 Key Strategies: 1️⃣ Input Optimization: Refine prompts and prune unnecessary context. 2️⃣ Model Selection: Choose the right-size models for task-specific needs. 3️⃣ Distributed Processing: Improve performance with distributed inference and load balancing. 4️⃣ Model Optimization: Implement quantization and pruning techniques to reduce computational requirements. 5️⃣ Caching Strategy: Use response and embedding caching for faster results. 6️⃣ Output Management: Optimize token limits and enable stream processing. 7️⃣ System Architecture: Enhance efficiency with batch processing and request optimization. By adopting these strategies, organizations can unlock the full potential of LLMs while keeping operational expenses under control. How is your organization managing LLM costs? Let's discuss!
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development