How to Deploy Llms for Optimal Performance

Explore top LinkedIn content from expert professionals.

Summary

Deploying large language models (LLMs) for optimal performance means designing systems that run these AI models efficiently, with high accuracy and manageable costs. Instead of just relying on default settings, it involves using smart strategies to manage everything from the underlying architecture and model size to how user requests are handled.

Smart resource allocation: Use techniques like model quantization and caching to cut down on hardware expenses and speed up processing for repeated or simple queries.
Tailored architecture choices: Separate application components and select serving systems that match your deployment needs, so you can scale and adapt as requirements evolve.
Strategic query management: Implement intelligent routing that directs each request to the model best suited for the task, improving accuracy while keeping costs in check.

Summarized by AI based on LinkedIn member posts

Rahul Agarwal

Staff ML Engineer | Meta, Roku, Walmart | 1:1 @ topmate.io/MLwhiz

45,178 followers 1y
Report this post
Few Lessons from Deploying and Using LLMs in Production Deploying LLMs can feel like hiring a hyperactive genius intern—they dazzle users while potentially draining your API budget. Here are some insights I’ve gathered: 1. “Cheap” is a Lie You Tell Yourself: Cloud costs per call may seem low, but the overall expense of an LLM-based system can skyrocket. Fixes: - Cache repetitive queries: Users ask the same thing at least 100x/day - Gatekeep: Use cheap classifiers (BERT) to filter “easy” requests. Let LLMs handle only the complex 10% and your current systems handle the remaining 90%. - Quantize your models: Shrink LLMs to run on cheaper hardware without massive accuracy drops - Asynchronously build your caches — Pre-generate common responses before they’re requested or gracefully fail the first time a query comes and cache for the next time. 2. Guard Against Model Hallucinations: Sometimes, models express answers with such confidence that distinguishing fact from fiction becomes challenging, even for human reviewers. Fixes: - Use RAG - Just a fancy way of saying to provide your model the knowledge it requires in the prompt itself by querying some database based on semantic matches with the query. - Guardrails: Validate outputs using regex or cross-encoders to establish a clear decision boundary between the query and the LLM’s response. 3. The best LLM is often a discriminative model: You don’t always need a full LLM. Consider knowledge distillation: use a large LLM to label your data and then train a smaller, discriminative model that performs similarly at a much lower cost. 4. It's not about the model, it is about the data on which it is trained: A smaller LLM might struggle with specialized domain data—that’s normal. Fine-tune your model on your specific data set by starting with parameter-efficient methods (like LoRA or Adapters) and using synthetic data generation to bootstrap training. 5. Prompts are the new Features: Prompts are the new features in your system. Version them, run A/B tests, and continuously refine using online experiments. Consider bandit algorithms to automatically promote the best-performing variants. What do you think? Have I missed anything? I’d love to hear your “I survived LLM prod” stories in the comments!

46 Comments
Like Comment
Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer

627,898 followers 11mo Edited
Report this post
If you’re an AI engineer trying to optimize your LLMs for inference, here’s a quick guide for you 👇 Efficient inference isn’t just about faster hardware, it’s a multi-layered design problem. From how you compress prompts to how your memory is managed across GPUs, everything impacts latency, throughput, and cost. Here’s a structured taxonomy of inference-time optimizations for LLMs: 1. Data-Level Optimization Reduce redundant tokens and unnecessary output computation. → Input Compression: - Prompt Pruning, remove irrelevant history or system tokens - Prompt Summarization, use model-generated summaries as input - Soft Prompt Compression, encode static context using embeddings - RAG, replace long prompts with retrieved documents plus compact queries → Output Organization: - Pre-structure output to reduce decoding time and minimize sampling steps 2. Model-Level Optimization (a) Efficient Structure Design → Efficient FFN Design, use gated or sparsely-activated FFNs (e.g., SwiGLU) → Efficient Attention, FlashAttention, linear attention, or sliding window for long context → Transformer Alternates, e.g., Mamba, Reformer for memory-efficient decoding → Multi/Group-Query Attention, share keys/values across heads to reduce KV cache size → Low-Complexity Attention, replace full softmax with approximations (e.g., Linformer) (b) Model Compression → Quantization: - Post-Training, no retraining needed - Quantization-Aware Training, better accuracy, especially <8-bit → Sparsification: - Weight Pruning, Sparse Attention → Structure Optimization: - Neural Architecture Search, Structure Factorization → Knowledge Distillation: - White-box, student learns internal states - Black-box, student mimics output logits → Dynamic Inference, adaptive early exits or skipping blocks based on input complexity 3. System-Level Optimization (a) Inference Engine → Graph & Operator Optimization, use ONNX, TensorRT, BetterTransformer for op fusion → Speculative Decoding, use a smaller model to draft tokens, validate with full model → Memory Management, KV cache reuse, paging strategies (e.g., PagedAttention in vLLM) (b) Serving System → Batching, group requests with similar lengths for throughput gains → Scheduling, token-level preemption (e.g., TGI, vLLM schedulers) → Distributed Systems, use tensor, pipeline, or model parallelism to scale across GPUs My Two Cents 🫰 → Always benchmark end-to-end latency, not just token decode speed → For production, 8-bit or 4-bit quantized models with MQA and PagedAttention give the best price/performance → If using long context (>64k), consider sliding attention plus RAG, not full dense memory → Use speculative decoding and batching for chat applications with high concurrency → LLM inference is a systems problem. Optimizing it requires thinking holistically, from tokens to tensors to threads. Image inspo: A Survey on Efficient Inference for Large Language Models ---- Follow me (Aishwarya Srinivasan) for more AI insights!
No more previous content

No more next content
64 Comments
Like Comment
Shalini Goyal

Executive Director @ JP Morgan | Ex-Amazon || Professor @ Zigurat || Speaker, Author || TechWomen100 Award Finalist

119,791 followers 9mo
Report this post
Building a GenAI app? Don’t just plug in a model - design it to scale, adapt, and evolve. Here’s your blueprint for future-ready GenAI systems. 👇 1. Modular Architecture Separate UI, orchestration, models, and storage to swap parts independently. Use LangChain or LlamaIndex to build pipelines. 2. Context Engineering Layer system prompts, memory, and retrieved knowledge to optimize generation. Use chunking and summarization to stay efficient. 3. Retrieval-Augmented Generation (RAG) Connect vector DBs like Pinecone or Weaviate and use hybrid search (dense + keyword) for domain-specific relevance. 4. Low-Latency Design Cut load times and delay using model distillation, quantization, and async I/O. 5. Agent-Based Systems Use CrewAI, AutoGen, or LangGraph for task decomposition and tool execution via specialized sub-agents. 6. Tool & Plugin Integration Enable LLMs to run code, hit APIs, or use external tools through OpenAI function-calling or LangChain routing. 7. Streaming & Feedback Improve experience with real-time streaming via WebSockets and user feedback for continuous refinement. 8. Memory Management Support both session and long-term memory using Redis, Postgres, or vector DBs for persistence. 9. Smart Deployment Use K8s or serverless runtimes (like AWS Lambda) to deploy GenAI apps with dynamic scaling. 10. Observability Track usage, hallucinations, and prompts using tools like LangSmith or WhyLabs for LLM monitoring. [Explore More In The Post] Here’s the takeaway? Good GenAI apps aren’t just about prompts, they’re engineered for performance, adaptability, and scale.
No more previous content

No more next content
37 Comments
Like Comment
Paolo Perrone

No BS AI/ML Content | ML Engineer with a Plot Twist 🥷100M+ Views 📝

128,856 followers 1mo
Report this post
Most teams deploy LLMs with default settings and wonder why inference costs $50K/month. The optimization stack exists. Most engineers don't know the layers. Here's the full inference optimization hierarchy: LAYER 1: Serving architecture Before touching a single kernel, get your serving right. vLLM (74K ⭐): PagedAttention, continuous batching. https://lnkd.in/eeT_HM2B SGLang (25K ⭐): structured generation + RadixAttention. Faster for constrained outputs. https://lnkd.in/eKK7sxdf LAYER 2: Quantization Shrink the model without killing accuracy. llama.cpp (92K ⭐): GGUF quantization. Run 70B on consumer hardware. https://lnkd.in/eJrUg_qd Unsloth (50K ⭐): QLoRA fine-tuning at 70% less VRAM. https://lnkd.in/gJZtH4Y4 This layer alone can cut your GPU bill in half. LAYER 3: Attention + caching How much are you spending on redundant prefill? Flash Attention (21K ⭐): memory-efficient, IO-aware. Non-negotiable. https://lnkd.in/eYkuRuxC LMCache (1.5K ⭐): KV cache sharing. Eliminates it entirely. github.com/LMCache/LMCache LAYER 4: Hardware-specific acceleration Match your optimization to your silicon. TensorRT-LLM: purpose-built for NVIDIA GPUs. Kernel fusion, in-flight batching. https://lnkd.in/ekuFuDAP MLX: native framework for Apple Silicon. Inference without CUDA. github.com/ml-explore/mlx LAYER 5: Custom kernels Where the real differentiation lives. LeetCUDA (9K ⭐): 200+ CUDA kernels. Tensor Cores, HGEMM. https://lnkd.in/eUfgpwW6 llm.c (28K ⭐): Karpathy's raw C/CUDA. The fundamentals. github.com/karpathy/llm.c Engineers who write custom kernels command $200K+ at NVIDIA, Meta, and Google. LAYER 6: Distributed inference When one node isn't enough. NVIDIA Dynamo: multi-node orchestration. Disaggregated serving. https://lnkd.in/etBGNtjk exo (39K ⭐): distributed inference across consumer devices. github.com/exo-explore/exo 6 layers. Each one multiplies the savings from the layer above. Most teams stop at Layer 1. The ones running inference profitably reach Layer 5. Which layer is your team stuck at? 👇 💾 Bookmark this. Your next inference bill will thank you.

73 Comments
Like Comment
Soham Chatterjee

Co-Founder & CTO @ ScaleDown | Task-specific SLMs - frontier quality, 10x cheaper and 2x faster

5,002 followers 9mo
Report this post
No single LLM, whether open-source or proprietary, outperforms all others across every task or domain. This makes a "one-size-fits-all" model strategy fundamentally suboptimal. This is where the "LLM Control Plane" comes in. It is a critical architectural layer that orchestrates how applications interact with a diverse ecosystem of models. A core component of this layer is the LLM Router, an intelligent controller that directs every query to the best model for the job, optimizing for cost, speed, and quality. This layer is super interesting because we are now seeing many companies release small AI models specifically to solve the routing problem. The latest is the Arch-router model from Katanemo, which shows that routing can achieve better performance than a single LLM and reduce latency! Arch-Router is a compact 1.5B parameter model that learns to map queries to human-readable policies you define, like Domain: 'legal' and Action: 'summarize_contract' It allows you to encode your own definition of "best," aligning routing decisions with real-world needs rather than academic benchmarks. Check out their paper here: https://lnkd.in/gRfbhX2g
No more previous content

No more next content
Like Comment
Jared Quincy Davis

Founder and CEO, Mithril

9,932 followers 1y
Report this post
We’re not yet at the point where a single LLM call can solve many of the most valuable problems in production. As a consequence, practitioners frequently deploy *compound AI systems* composed of multiple prompts, sub-stages, and often with multiple calls per stage. These systems' implementations may also encompass multiple models and providers. These *networks-of-networks* (NONs) or "multi-stage pipelines" can be difficult to optimize and tune in a principled manner. There are numerous levels at which they can be tuned, including but not limited to: (I) optimizing the prompts in the system (see [DSPy](https://lnkd.in/g3vcqw3H) (II) optimizing the weights of a verifier or router (see [FrugalGPT](https://lnkd.in/g36kfhs9)) (III) optimizing the architecture of the NON (see [NON](https://lnkd.in/g5tvASaz) and [Are More LLM Calls All You Need](https://lnkd.in/gh_v5b2D)) (IV) optimizing the selection amongst and composition of frozen modules in the system (see our new work, [LLMSelector](https://lnkd.in/gkt7nj8w)). In a multi-stage compound system, which LLM should be used for which calls, given the spikes and affinities across models? How much can we push the performance frontier by tuning this? Quite dramatically → in LLMSelector, we demonstrate performance gains from *5-70%* above that of the best mono-model system across myriad tasks, ranging from LiveCodeBench to FEVER. One core technical challenge is that the search space for optimizing LLM selection is exponential. We find, though, that optimization is still feasible and tractable given that (a) the compound system's aggregate performance is often *monotonic* in the performance of individual modules, allowing for greedy optimization at times, and (b) we can *learn to predict* module performance This is an exciting direction for future research! Great collaboration with Lingjiao Chen, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, and Ion Stoica! References: LLMSelector: https://lnkd.in/gkt7nj8w Other works → DSPy: https://lnkd.in/g3vcqw3H FrugalGPT: https://lnkd.in/g36kfhs9) Networks of Networks (NON): https://lnkd.in/g5tvASaz Are More LLM Calls All You Need: https://lnkd.in/gh_v5b2D

GitHub - stanfordnlp/dspy: DSPy: The framework for programming—not prompting—language models github.com

5 Comments
Like Comment
James Noh

Partner @a16z | AI Infra & Cloud GTM | Stanford 🌲

5,571 followers 3w
Report this post
10 based lessons learned from Baseten's inference book by Philip Kiely: 1️⃣ Start with shared inference, and move to dedicated later. Pay-per-token APIs while finding PMF. Switch to dedicated GPUs only when 3 factors align: scale (volume makes per-GPU cheaper), specialization (custom models or tight latency needs), and orchestration (multi-model pipelines). 2️⃣ Inference optimization is about constraints, not maximization. The best inference systems are specific, not general. They're tuned for a particular model, product, and traffic pattern. More constraints = higher performance. 3️⃣ KV cache is arguably the most important resource in LLM inference. Without it, attention is quadratic and inference would be unbearably slow. Managing where it lives, how much to allocate, and when to reuse it across requests is critical. 4️⃣ Tensor Parallelism (TP) is default multi-GPU strategy; Expert Parallelism (EP) improves throughput for MoE. TP splits tensor ops across GPUs for lower per-user latency but needs high-bandwidth NVLink. EP assigns experts to specific GPUs, reducing networking overhead and scaling better across nodes. Many deployments combine both. 5️⃣ Memory bandwidth, not compute, bottlenecks most LLM serving. Decode is memory-bound because model weights load for every generated token. Any optimization that reduces memory movement––quantization, kernel fusion, and PagedAttention––directly improves TPS. 6️⃣ Inference engines as tools for abstraction. vLLM (broadest support), SGLang (strong MoE & customization), and TensorRT-LLM (highest performance but steepest learning curve) all feature continuous batching, quantization, speculative decoding, prefix caching, parallelism, and disaggregation. Choose one based on your model, hardware & engineering bandwidth. 7️⃣ Measure latency in percentiles, not averages. This is because LLM latency is right-skewed. P99 (5x longer than the median) will ruin the user experience even when the avg looks fine. Track P50, P90, and P99 for both TTFT & TPS. 8️⃣ Long cold start times → inaccurate scale-down → compute over-provisioning. Cold starts pile up across GPU procurement, container loading, weight loading, and engine compilation. Quantizing weights, caching compiled engines, and loading from low-latency local storage all chip away at this. 9️⃣ Image gen models are composable. Every image system is a pipeline of a text encoder, denoising model, and VAE––each individually optimizable. Unlike LLMs, image inference is compute-bound and the quality-speed tradeoff is direct: fewer denoising steps = faster but worse output. 🔟 Client code is the often-overlooked last mile. TLS handshakes, session management, and protocol choice eat into latency budget. Reuse HTTP sessions, use WebSockets for real-time audio, and gRPC for structured service-to-service calls. For bulk workloads, async jobs with webhook callbacks unlock throughput sync requests can't match. 🔗 Link to Phil's full free book in comments ⬇️
No more previous content

No more next content
15 Comments
Like Comment
Aman Gupta

AI leader @ Nubank | Prev AI research @ Amazon, Apple, LinkedIn | LLMs, optimization

6,951 followers 1y
Report this post
🚀 New Paper Alert! Excited to share our latest paper, "Efficient AI in Practice: Training and Deployment of Efficient LLMs for Industry Applications", now available on arXiv - https://lnkd.in/dZrUEGqD! Large Language Models (LLMs) have unlocked incredible capabilities across AI applications, from search and recommendations to generative tasks. However, their sheer size and computational cost often make them impractical for real-world deployment at scale. In this work, we explore techniques to train and deploy Small Language Models (SLMs) that retain much of the power of their larger counterparts while being significantly more efficient. 🔍 Key Contributions: ✅ Knowledge Distillation – We efficiently transfer knowledge from large models to smaller ones, ensuring strong task performance. We demonstrate the effectiveness of various flavors of distillation - on-policy, supervised, seqKD ✅ Model Compression (Pruning & Quantization) – We apply structured pruning (OSSCAR) and quantization (GPTQ, QuantEase, FP8) to drastically reduce model size while maintaining accuracy. ✅ Real-World Deployment at LinkedIn – We showcase how we deploy SLMs for ranking, recommendation, and reasoning tasks at LinkedIn, achieving 20× model compression with minimal accuracy loss. ✅ Serving Optimizations – We detail inference speedups, leveraging techniques like RadixAttention, FlashInfer, and tensor parallelism on NVIDIA H100 GPUs to optimize latency and throughput. Key Results: 📉 20× reduction in model size with minimal accuracy loss ⚡ 40% improvement in attention latency through structured pruning 🚀 Significant serving speedup with FP8 quantization and prefix caching This work is a step toward making LLMs more efficient, scalable, and production-ready for industry use cases. We hope it helps others looking to deploy high-performance AI at scale! A huge shoutout to my incredible co-authors at LinkedIn for their contributions - Qingquan Song, Kayhan Behdin, Yun Dai, Ata Fatahi, Shao Tang, HEJIAN SANG, Gregory Dexter, Sirou Z., Jason (Siyu) Zhu, Tejas Dharamsi, Maziar Sanjabi, Vignesh Kothapalli, Hamed Firooz, Zhoutong Fu, Yihan Cao, Pin-Lun (Byron) Hsu, Fedor Borisyuk, Zhipeng(Jason) Wang, PhD, Rahul Mazumder, Natesh Pillai, Luke Simon Special thanks to our leadership Zhipeng(Jason) Wang, PhD, Xiaobing Xue, Necip Fazil Ayan, Deepak Agarwal for empowering us and helping us push the envelope! #AI #MachineLearning #LLMs #Efficiency #AIatScale #DeepLearning #KnowledgeDistillation #Pruning #Quantization #Deployment

Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems arxiv.org

7 Comments
Like Comment

How to Deploy Llms for Optimal Performance

Summary

More in Performance Optimization Techniques

Explore categories