Quantization Techniques for Long Context LLMs

Explore top LinkedIn content from expert professionals.

Summary

Quantization techniques for long context LLMs are methods for shrinking the memory footprint and speeding up large language models, especially when processing long sequences of text. By compressing model weights and caches, these approaches allow LLMs to run faster, use less hardware, and support bigger workloads without sacrificing quality.

  • Apply precision slicing: Using methods that allow a single model to operate at various bit-levels helps match performance needs to available hardware without retraining.
  • Preserve similarity scores: Compressing vectors while maintaining unbiased inner products ensures the model continues to understand context and relationships accurately even at low memory usage.
  • Streamline deployment: Switching to efficient quantization algorithms can reduce memory bottlenecks, enabling longer context windows and quicker response times on existing GPUs.
Summarized by AI based on LinkedIn member posts
  • View profile for Zain Hasan

    I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: Vector DBs, Data Scientist, Lecturer & Health Tech Founder | 🇺🇸🇨🇦🇵🇰

    19,605 followers

    The researchers at Google DeepMind just introduced "Matryoshka Quantization" (MatQuant), a clever new technique that could make deploying large language models much more efficient. The key insight? Rather than creating separate models for different quantization levels (int8, int4, int2), MatQuant leverages the nested "Matryoshka" structure naturally present in integer data types. Think of it like Russian nesting dolls - the int2 representation is nested within int4, which is nested within int8. Here are the major innovations: 1. Single Model, Multiple Precisions >> MatQuant trains one model that can operate at multiple precision levels (int8, int4, int2) >> You can extract lower precision models by simply slicing the most significant bits >> No need to maintain separate models for different deployment scenarios 2. Improved Low-Precision Performance >> Int2 models extracted from MatQuant are up to 10% more accurate than standard int2 quantization >> This is a huge breakthrough since int2 quantization typically severely degrades model quality >> The researchers achieved this through co-training and co-distillation across precision levels 3. Flexible Deployment >> MatQuant enables "Mix'n'Match" - using different precisions for different layers >> You can interpolate to intermediate bit-widths like int3 and int6 >> This allows fine-grained control over the accuracy vs. efficiency trade-off The results are impressive. When applied to the FFN parameters of Gemma-2 9B: >> Int8 and int4 models perform on par with individually trained baselines >> Int2 models show significant improvements (8%+ better on downstream tasks) >> Remarkably, an int2 FFN-quantized Gemma-2 9B outperforms an int8 FFN-quantized Gemma-2 2B This work represents a major step forward in model quantization, making it easier to deploy LLMs across different hardware constraints while maintaining high performance. The ability to extract multiple precision levels from a single trained model is particularly valuable for real-world applications. Looking forward to seeing how this technique gets adopted by the community and what further improvements it enables in model deployment efficiency! Let me know if you'd like me to elaborate on any aspect of the paper. I'm particularly fascinated by how they managed to improve int2 performance through the co-training approach. https://lnkd.in/g6mdmVjx

  • View profile for Kuldeep Singh Sidhu

    Senior Data Scientist @ Walmart | BITS Pilani

    16,024 followers

    Exciting breakthrough in extreme low-bit quantization for Large Language Models! The good folks at Microsoft have developed VPTQ (Vector Post-Training Quantization), a novel approach to LLM compression. They achieved reduced model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, and 4.41-7.34 on LLaMA-3 over the SOTA at 2-bit. On paper, this looks extremely interesting. VPTQ (Vector Post-Training Quantization), GPTQ (Generative Pre-trained Transformer Quantization), and AWQ (Activation-Aware Weight Quantization) are all post-training quantization methods for large language models, but they differ in their approaches and performance characteristics. VPTQ uses Second-Order Optimization and Channel-Independent Second-Order Optimization to achieve extreme low-bit quantization (down to 2 bits) while maintaining competitive accuracy and inference speed. It outperforms GPTQ in terms of accuracy and compression ratio, especially at very low bit widths. GPTQ uses a one-shot weight quantization method based on approximate second-order information, achieving good results at 4 bits but struggling at lower precisions. AWQ, on the other hand, focuses on identifying and preserving critical weights during quantization, resulting in faster inference than GPTQ and sometimes better perplexity, though at the cost of slightly higher VRAM usage. Overall, VPTQ appears to offer the best balance of compression, accuracy, and speed, particularly for extreme low-bit scenarios. Key Steps for Implementing Vector Post-Training Quantization (VPTQ) for Large Language Models: 1. Formulate the quantization problem: - Use Second-Order Optimization to guide the quantization algorithm design. - Employ Channel-Independent Second-Order Optimization for granular vector quantization. 2. Initialize centroids: - Implement Hessian-Weighted Centroid Initialization. - Solve it as a Weighted K-means Clustering problem. 3. Quantize the model weights: - Iterate through each layer of the model. - For each Linear operator: a. If outlier elimination is enabled, quantize outlier weights first. b. Initialize centroids for remaining weights. c. Apply the VPTQ algorithm to quantize weights. d. If residual quantization is enabled, quantize the residual error. 4. Implement Residual Vector Quantization (optional): - Use multiple stages to further compress residual errors. - Employ separate lookup tables for each stage. 5. Apply outlier elimination (optional) 6. Perform layer-wise fine-tuning: - Fine-tune centroids and layer normalization parameters. - Use a small calibration dataset (e.g., 128 samples from C4). 7. Optimize for inference: - Implement efficient dequantization by reading centroids from codebooks. - Fuse dequantization and matrix multiplication operations where possible. VPTQ enables extreme compression of LLMs while maintaining remarkable accuracy, paving the way for more efficient deployment and inference of these powerful models.

  • View profile for Sohrab Rahimi

    Director, AI/ML Lead @ Google

    23,606 followers

    Google solved one of the most important constraints of language models by reducing how much memory they need to run, which directly translates into freed compute capacity and lower serving cost. The paper “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate” from Google Research focuses on a simple observation. LLMs do not rely on exact vector values during inference. They rely on inner products between vectors. If those similarities are preserved, the model behaves the same. The design follows from that. First, vectors are transformed to make compression tractable. A random rotation makes each dimension behave almost independently. That removes the need for complex, data-specific quantization schemes and allows each coordinate to be compressed separately with minimal loss. Second, the method fixes what compression breaks. Standard quantization distorts inner products, which directly affects attention. TurboQuant isolates that distortion and encodes it using a 1-bit correction signal based on a Quantized Johnson–Lindenstrauss transform. This restores unbiased similarity calculations with negligible overhead. The key architectural move is separation. One stage compresses efficiently. The other guarantees correctness of interactions between vectors. That is why the system can push compression without degrading performance. The empirical result is what matters. KV cache 𝗰𝗮𝗻 𝗯𝗲 𝗰𝗼𝗺𝗽𝗿𝗲𝘀𝘀𝗲𝗱 𝗯𝘆 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝟱𝘅 while maintaining the same downstream accuracy on long-context tasks. This brings compression close to the theoretical limit for this problem. Now the practical impact. Take a RAG system with 10M documents embedded at 1536 dimensions. Stored in float16, that is roughly 30 GB of vector data. With 5x to 8x compression, that drops to around 4–6 GB. The entire index fits in GPU memory. Retrieval becomes local, faster, and cheaper. On the generation side, the KV cache shrinks by the same factor. A system capped at 32k context due to memory can push toward 150k on the same hardware, or run several times more concurrent requests per GPU. In practice, a deployment that required 8 GPUs to serve long-context RAG queries can drop to 2–3 GPUs for the same throughput, or keep the hardware and scale traffic significantly. No retraining. No change to the model. Just a different way of encoding state during inference. This is a significant and imoactful breakthrough. Memory is the bottleneck in modern LLM systems. If you compress it without breaking similarity, you unlock longer context, higher throughput, and materially lower cost at the same time. Blog: https://lnkd.in/ei3Nb5Vv Paper: https://lnkd.in/eyJ4Hf9U

  • View profile for Asif Razzaq

    Founder @ Marktechpost (AI Dev News Platform) | 1 Million+ Monthly Readers

    35,056 followers

    Google Introduces TurboQuant: A New Compression Algorithm that Reduces LLM Key-Value Cache Memory by 6x and Delivers Up to 8x Speedup, All with Zero Accuracy Loss The biggest bottleneck in scaling LLMs isn't just compute—it’s the KV Cache. As context windows grow, memory communication between HBM and SRAM kills performance. Google’s new TurboQuant changes the game with a near-optimal, data-oblivious vector quantization framework. But why is it a breakthrough? - Data-Oblivious: No more slow k-means training on your dataset. It works instantly. - The Rotation Trick: It applies a random rotation to input vectors, inducing a concentrated Beta distribution on coordinates. - Optimal Scaling: It solves a continuous 1D k-means / Max-Lloyd problem per coordinate, achieving MSE distortion within a factor of ≈ 2.7 of the theoretical Shannon Lower Bound. - Unbiased Inner Products: By applying a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform to the residual, it eliminates the bias that usually plagues low-bit quantization. The Results: (1) 4.5x Compression: Quality neutrality at 3.5 bits per channel. (2) 104k Context: Matched full-precision performance on "Needle-In-A-Haystack" tests under 4x compression. (3) Instant Indexing: Reduced vector database indexing time to virtually zero compared to traditional Product Quantization. Read the full analysis here: https://lnkd.in/gDRyFBC8 Paper: https://lnkd.in/gyybbu3U Technical details: https://lnkd.in/gruFq7gD Google Google Research Amir Zandieh, PhD Majid Daliri Majid Hadian Vahab Mirrokni

  • View profile for Kartik Mathur

    Applied ML Scientist @ Microsoft | LLMs & AI Agents | Patent Inventor | AAAI Author & Program Committee | Hackathon Winner | MS CS @ University of Southern California | LLM Research & Inference, AI Agents

    3,491 followers

    This week, Google Research achieved nearly lossless compression of the KV Cache in frontier models. The paper is called TurboQuant, and it's being presented at ICLR 2026. Here's what it does and why it matters. Large language models keep a "cheat sheet" while they're thinking. It's called the KV Cache, short for Key-Value Cache. Instead of re-reading everything from scratch, the model stores recently used information there so it can retrieve it fast. The catch: this cheat sheet is enormous. It eats up memory, slows things down, and becomes a real bottleneck at scale. The obvious fix is compression. But traditional compression methods introduce their own overhead, adding 1 to 2 extra bits per number and partially canceling out the benefit. TurboQuant compresses the KV Cache down to just 3 bits per number, with no retraining required and no measurable loss in model accuracy. Two algorithms power this approach:  • PolarQuant handles the heavy lifting. Instead of storing a vector using standard X/Y/Z coordinates, it converts it into polar coordinates (think: an angle plus a distance).  • QJL (Quantized Johnson-Lindenstrauss) mops up the residual error using just 1 bit. It acts as a mathematical error-checker that removes bias from the compressed approximation, keeping attention scores accurate. The KV Cache bottleneck is one of the main reasons long-context inference is expensive. But vector quantization is also the foundation of large-scale semantic search, the technology that lets search engines find meaning rather than just keywords. This is the kind of foundational infrastructure work that quietly makes everything else faster and cheaper. Blog: https://lnkd.in/g5NcAFQ5 #KVCache #Quantization #LLMInference #AIResearch #GoogleResearch #MachineLearning #ICLR2026.

Explore categories