"LLM inference memory issue: how to fix it without GPUs"

7mo Edited

"Your LLM inference is running out of GPU memory with long conversations. How do you fix it without losing performance?" Instant Thought : "Buy more GPUs" or "truncate context." ❌ The real answer: It's not model weights — it's the KV cache. 👉 KV cache grows linearly with tokens. 👉 A 7B model with 8K context = ~4GB KV cache alone. 👉 Idle users = idle GPU memory = wasted $$$. The secret to scaling isn’t bigger GPUs — it’s tiered cache offloading. GPU → CPU RAM → SSD → Distributed storage (based on access patterns) Reuse cache for 14x faster time-to-first-token (vs recomputing) Handle multi-user sessions without OOM errors 💡 “Keep everything in GPU until OOM.” 💡 “Tiered offloading with LMCache.” Scaling LLMs = 80% memory management, 20% compute. Offload smart. Serve more. #LLM #MachineLearning #Inference #GPU #PerformanceOptimization #AI #MLOps #KVCache #LLMScaling For details: https://lnkd.in/gVghhBYy

KV cache offloading | LLM Inference Handbook bentoml.com

To view or add a comment, sign in

More Relevant Posts

Jason Childers
6mo
Report this post
Your CPUs can't keep up. Too much data. Too much demand. GPUs are the way. Follow along for more on how GPU-accelerated compute offers answers to the biggest problems in AI and Analytics compute infrastructure.
Like Comment
To view or add a comment, sign in
Chiron Kiadó

2,300 followers
6mo
Report this post
For more than half a century, computing has relied on the Von Neumann or Harvard model. Nearly every modern chip — CPUs, GPUs and even many specialized accelerators — derives from this design. Over time, new architectures like Very Long Instruction Word (VLIW), dataflow processors and GPUs were introduced to address specific performance bottlenecks, but none offered a comprehensive alternative to the paradigm itself. A new approach called Deterministic Execution challenges this status quo. Instead of dynamically guessing what instructions to run next, it schedules every operation with cycle-level precision, creating a predictable execution timeline. This enables a single processor to unify scalar, vector and matrix compute — handling both general-purpose and AI-intensive workloads without relying on separate accelerators. https://lnkd.in/ge3sBkMN #BeyondVonNeumann #TowardAUnified #DeterministicArchitecture

Beyond Von Neumann: Toward a unified deterministic architecture venturebeat.com
Like Comment
To view or add a comment, sign in
Ravichandran Paramasivam
6mo
Report this post
GPUDirect Storage (GDS): what it is, how it works, and when it helps GPUDirect Storage (GDS) lets storage devices (local NVMe or networked NVMe-oF / certain parallel filesystems) DMA data directly to/from GPU memory. That skips the traditional bounce buffer in CPU RAM, cutting latency, freeing CPU cycles, and raising end-to-end throughput. ✴️ Why you'd want this If your pipeline already runs on the GPU, moving data GPU↔storage via CPU RAM is wasted work. GDS replaces: Storage → CPU RAM → memcpy → GPU with: Storage → DMA → GPU memory (and back) Two gains: more bandwidth into the GPU memory and lower CPU utilization/latency during I/O. ✴️ How it works When your app uses the cuFile library, the GDS stack does three things: ✅ Pins GPU memory & sets up mappings so other devices can DMA to it safely (through the NVIDIA kernel driver + IOMMU). ✅ Opens a file in direct I/O mode (O_DIRECT) so the storage stack can bypass the page cache and target the GPU buffer. ✅ Hands the storage device (or storage NIC) a DMA description that points straight at that GPU buffer. ✴️ Data path (local NVMe example): [ NVMe SSD ] --- PCIe ---> [ Root Port / Switch ] --- PCIe ---> [ GPU HBM ] (no copy through CPU RAM) ✴️ Data path (remote storage, NVMe-oF over RoCE/IB): [ NVMe-oF target ] --- (RDMA over IB/RoCE) ---> [ NIC ] --- PCIe --> [ GPU HBM ] The CPU still issues syscalls and sets up the transfer, but the payload bytes do not detour through system memory. ✴️ What you need (prereqs) ✅ NVIDIA GPU + Driver + CUDA supported by GDS. the nvidia-fs kernel module (part of GDS) must be loaded. ✅ Linux with compatible kernel and MOFED (for RDMA paths). ✅ Storage that supports direct paths. ✅ O_DIRECT friendly path: best results when the filesystem honors O_DIRECT (aligned I/O, DMA-capable). ✴️ What performance looks like ✅ Higher sustained throughput to GPU memory (no CPU copy limits). ✅ Lower CPU utilization (CPU does control-plane only). ✅ Lower latency (fewer hops). ✴️ Where it shines ✅ LLM training & inference with massive corpora or KV-cache paging, move tensors/checkpoints directly into HBM. ✅ Data analytics / ETL on GPUs (RAPIDS/cuDF) where I/O throttles compute. ✅ Media & vision pipelines (GPU decode/encode + pre/post-proc) streaming from NVMe arrays. ✅ Scientific I/O (HDF5/Parquet/NetCDF) when staging through CPU memory dominates wall-time. ✴️ Bottom line If your GPUs are compute-bound after the data arrives, GDS won't matter. But if I/O is the long pole, GDS can turn storage into a first-class peer of the GPU, boosting BW/W and lowering latency by skipping CPU RAM. Start with a single node (local NVMe + one GPU), validate with NVIDIA's samples, then scale to NVMe-oF / parallel FS once you have clean baselines.
1 Comment
Like Comment
To view or add a comment, sign in
Jonah McLeod
6mo
Report this post
Simplex Micro’s CEO has published a new article in VentureBeat, outlining his vision for a unified deterministic RISC-V computing architecture. https://lnkd.in/gjftcSqD For more than 50 years, processor design has been bound by the Von Neumann model — from CPUs and GPUs to specialized AI accelerators. Even as we added layers of complexity through speculation, prediction, and out-of-order execution, performance often came at the expense of efficiency and predictability. In this new VentureBeat piece, he introduces a bold new paradigm: Deterministic Execution — a cycle-accurate approach that eliminates speculation and unifies scalar, vector, and matrix compute under a single deterministic scheduler. By orchestrating compute and memory with precise timing, we can achieve higher throughput, lower power, and simpler hardware — forming the foundation for the next generation of AI and general-purpose processors that no longer need separate CPU, GPU, or neural engines.https://https://lnkd.in/gjftcSqD #DeterministicExecution #RISCV #AIHardware #Architecture #SimplexMicro #VentureBeat

Beyond Von Neumann: Toward a unified deterministic architecture venturebeat.com

3 Comments
Like Comment
To view or add a comment, sign in
Jiarong Xing
6mo
Report this post
🚀 Save the GPU Cost Crisis Today!!! Headache with LLMs lock a whole GPU but leave capacity idle? Frustrated by your GPU cluster's low utilization? We built kvcached (KV cache daemon), an open-source library to save your GPU cluster utilization when serving LLMs. 🧩 What it does: kvcached enables elastic GPU sharing for LLM inference by virtualizing the KV cache. With kvcached, each LLM uses only the GPU memory it actually needs, instead of aggressively reserving a large static allocation in advance. ⚙️ Why it matters: – 🚫 Eliminates static GPU memory reservation, improving resource utilization – 🧠 Enables multiple workloads to flexibly run on shared GPUs – ⚡ Allows finer-grained and more rapid autoscaling in Serverless LLM – 🚀 Achieves 1.2×–28× faster time-to-first-token in multi-LLM serving 🌐 kvcached is compatible with mainstream LLM inference engines including sgl-project and vLLM. Try it with one command now: https://lnkd.in/dArTKvnr Read more in our deep-dive blog post: 📄 https://lnkd.in/gxK8N5QT kvcached represents our first step toward a GPU operating system—a vision where compute and memory are dynamically shared across models, workloads, and even users. This project is a joint effort led by Berkeley’s Sky Computing Lab (University of California, Berkeley) in close collaboration with Rice University and UCLA, and with valuable input from collaborators and colleagues at NVIDIA, Intel Corporation, Stanford University, and the sgl-project and vLLM communities. 👏 Incredibly grateful to our amazing team: Jiarong Xing, Yifan Qiao, Shan Yu, Xingqi Cui, Mingyuan MA, Yangmin Li, Xinyuan Tong, Yang Wang We especially thank our advisors, Joseph Gonzalez and Ion Stoica, for their guidance and insightful feedback. We thank everyone who shared feedback, ideas, and support throughout the project’s development. We're warmly inviting collaborators from both academia and industry to join us in building the foundations of elastic GPU infrastructure. Let’s make GPUs as flexible, efficient, and shared as CPUs. 💪 #LLMServing #KVCache #GPUOS #GPUSharing #GPUVirtualization #SystemsResearch #DeepLearningInfrastructure #OpenSource #Berkeley #SkyComputing #vLLM #SGLang
24 Comments
Like Comment
To view or add a comment, sign in
Yifan Qiao
6mo
Report this post
So excited to release kvcached 🔥 GPUs should not sit idle when workloads subside. We built kvcached to make GPUs elastic, efficient, and shareable across LLMs. No more wasting HBM on static and idle KV cache. This is only the beginning of our journey toward a GPU operating system for dynamic and efficient AI infrastructure. Please stay tuned for more updates. I am grateful to work with such an amazing team from Berkeley’s Sky Computing Lab (University of California, Berkeley), Rice University, and UCLA. Special thanks to our advisors Joseph Gonzalez and Ion Stoica for their guidance and support.
Jiarong Xing

Assistant Professor@Rice; Postdoc@UCBerkeley; AI System Researcher
6mo

🚀 Save the GPU Cost Crisis Today!!! Headache with LLMs lock a whole GPU but leave capacity idle? Frustrated by your GPU cluster's low utilization? We built kvcached (KV cache daemon), an open-source library to save your GPU cluster utilization when serving LLMs. 🧩 What it does: kvcached enables elastic GPU sharing for LLM inference by virtualizing the KV cache. With kvcached, each LLM uses only the GPU memory it actually needs, instead of aggressively reserving a large static allocation in advance. ⚙️ Why it matters: – 🚫 Eliminates static GPU memory reservation, improving resource utilization – 🧠 Enables multiple workloads to flexibly run on shared GPUs – ⚡ Allows finer-grained and more rapid autoscaling in Serverless LLM – 🚀 Achieves 1.2×–28× faster time-to-first-token in multi-LLM serving 🌐 kvcached is compatible with mainstream LLM inference engines including sgl-project and vLLM. Try it with one command now: https://lnkd.in/dArTKvnr Read more in our deep-dive blog post: 📄 https://lnkd.in/gxK8N5QT kvcached represents our first step toward a GPU operating system—a vision where compute and memory are dynamically shared across models, workloads, and even users. This project is a joint effort led by Berkeley’s Sky Computing Lab (University of California, Berkeley) in close collaboration with Rice University and UCLA, and with valuable input from collaborators and colleagues at NVIDIA, Intel Corporation, Stanford University, and the sgl-project and vLLM communities. 👏 Incredibly grateful to our amazing team: Jiarong Xing, Yifan Qiao, Shan Yu, Xingqi Cui, Mingyuan MA, Yangmin Li, Xinyuan Tong, Yang Wang We especially thank our advisors, Joseph Gonzalez and Ion Stoica, for their guidance and insightful feedback. We thank everyone who shared feedback, ideas, and support throughout the project’s development. We're warmly inviting collaborators from both academia and industry to join us in building the foundations of elastic GPU infrastructure. Let’s make GPUs as flexible, efficient, and shared as CPUs. 💪 #LLMServing #KVCache #GPUOS #GPUSharing #GPUVirtualization #SystemsResearch #DeepLearningInfrastructure #OpenSource #Berkeley #SkyComputing #vLLM #SGLang
1 Comment
Like Comment
To view or add a comment, sign in
Rajesh Iyer
6mo Edited
Report this post
⚡ Surviving Beyond GPU Memory: Why Pooled NVMe Beats the VRAM Dream (for Now) Act 1 — The GPU Fleet Problem GPUs are compute monsters. FLOPs everywhere. But in enterprise data processing, the real problem isn’t FLOPs. It’s that we can’t manage a fleet of GPUs as smartly as we do CPUs. With CPUs: OS schedulers and caches balance compute and memory. Distributed frameworks (such as Spark, Databricks, and MPI) spill data gracefully and pool resources. The system remains stable even when workloads exceed RAM capacity. With GPUs: Each GPU is a silo — 80–120 GB of HBM, insanely fast but brutally finite. FLOPs are tied to memory locality. If it doesn’t fit, performance collapses. Multi-GPU adds FLOPs, but doesn’t pool memory unless you’re in exotic NVSwitch systems. The result: the HBM cliff. Once you hit it, your workload doesn’t slow down — it crashes. For BFSI workloads like tensor decompositions of customer × product × agent × outcome datasets, this isn’t rare. It’s the norm. Act 2 — The VRAM Dream vs. the NVMe Reality There is a dream solution: pooled VRAM. In DGX-class systems with NVSwitch, multiple GPUs can share a single, flat address space for memory. In theory, that solves the HBM silo problem. In practice? It’s rare, expensive, and still finite. Once you blow past pooled VRAM (say, 640 GB across 8 GPUs), you’re still dead. And outside those premium boxes, you don’t get this at all. So what’s the practical bridge? Pooled NVMe over RDMA. NVMe disaggregated into a fabric pool, accessible directly by GPUs via GPUDirect Storage. HBM: ~3 TB/s, ~100 ns latency. NVMe over RDMA: 10–40 GB/s, ~5–20 µs latency. CPU spill: 1–5 GB/s, ms latency. On paper, NVMe is “100× slower than HBM.” But the proper comparison isn’t against HBM — it’s against CPU bounce buffers or Databricks-style disk spill. There, RDMA-backed NVMe is an order of magnitude faster. And, critically, it keeps the job alive. 👉 HBM = lungs. 👉 NVMe via RDMA = oxygen tank. 👉 Together = finish the race instead of collapsing at the memory cliff. Act 3 — Why This Matters for BFSI Financial services workloads don’t politely fit into 80 GB chunks: Retention analytics across millions of policies. Cross-product risk exposure models. Real-time drift detection in high-dimensional tensors. These workloads aren’t nice-to-have. They are the frontier of competitive analytics. And they only deliver if the system can survive its own data. #tPower #ml #ai #GenAI4FS #inc81starch

2 Comments
Like Comment
To view or add a comment, sign in
Brian Piercy
6mo Edited
Report this post
“By predicting exactly when data will arrive — whether in 10 cycles or 200 — Deterministic Execution can slot dependent instructions into the right future cycle. This turns latency from a hazard into a schedulable event, keeping the execution units fully utilized and avoiding the massive thread and buffer overheads used by GPUs or custom VLIW chips.” #chipdesign #cpus #gpus

Beyond Von Neumann: Toward a unified deterministic architecture venturebeat.com
Like Comment
To view or add a comment, sign in
Andrew Walko
7mo
Report this post
I've always been a big fan of running SLMs on CPU, and as models, hardware, and inference libraries continue to improve, this option becomes more feasible for production workloads. In my latest blog, Julien SIMON and I showcase how you can optimize SLM inference on Intel processors utilizing Intel-optimized inference libraries. Check it out and let us know your thoughts! https://lnkd.in/g5N4NzbH

Optimizing Arcee Foundation Models on Intel CPUs arcee.ai

1 Comment
Like Comment
To view or add a comment, sign in
Ujwal A Krishna
6mo
Report this post
Microsoft has open-sourced bitnet.cpp, a blazing-fast 1-bit LLM inference framework optimized for CPUs — and it’s a big deal for local AI compute. This could redefine how we think about running large models without expensive GPUs or cloud dependencies. Key highlights: * Up to 6x faster inference with 82% lower energy consumption * 100B parameter models running directly on x86 CPUs (via kernel throughput demo) * Ternary weights (-1, 0, +1) + 8-bit activations for huge memory savings Alongside this, Microsoft also released BitNet b1.58 2B4T, the first open-source model using just 1.58 bits per weight — and it still performs impressively on benchmarks. If you care about efficient AI at scale, this is absolutely worth a look. The efficiency gains are real, though the “100B on CPU” demo was with dummy parameters (~5–7 t/s). The currently usable model is 2B4T — but the direction is clear. The era of efficient, low-bit AI might be closer than we think. GitHub: https://lnkd.in/gi6R8ptP Paper: https://lnkd.in/gzASgUaQ #AI #LLM #BitNet #OpenSource #EdgeAI #EfficientAI #Microsoft #MachineLearning #DeepLearning #AIResearch #GPU
Like Comment
To view or add a comment, sign in

8,509 followers

View Profile Connect

"LLM inference memory issue: how to fix it without GPUs"

More from this author

Why Multidimensional Scaling Fails?

Ensemble Learning

Radial basis function network

Explore content categories

"LLM inference memory issue: how to fix it without GPUs"

More Relevant Posts

More from this author

Why Multidimensional Scaling Fails?

Ensemble Learning

Radial basis function network

Explore related topics

Explore content categories