The researchers at Google DeepMind just introduced "Matryoshka Quantization" (MatQuant), a clever new technique that could make deploying large language models much more efficient. The key insight? Rather than creating separate models for different quantization levels (int8, int4, int2), MatQuant leverages the nested "Matryoshka" structure naturally present in integer data types. Think of it like Russian nesting dolls - the int2 representation is nested within int4, which is nested within int8. Here are the major innovations: 1. Single Model, Multiple Precisions >> MatQuant trains one model that can operate at multiple precision levels (int8, int4, int2) >> You can extract lower precision models by simply slicing the most significant bits >> No need to maintain separate models for different deployment scenarios 2. Improved Low-Precision Performance >> Int2 models extracted from MatQuant are up to 10% more accurate than standard int2 quantization >> This is a huge breakthrough since int2 quantization typically severely degrades model quality >> The researchers achieved this through co-training and co-distillation across precision levels 3. Flexible Deployment >> MatQuant enables "Mix'n'Match" - using different precisions for different layers >> You can interpolate to intermediate bit-widths like int3 and int6 >> This allows fine-grained control over the accuracy vs. efficiency trade-off The results are impressive. When applied to the FFN parameters of Gemma-2 9B: >> Int8 and int4 models perform on par with individually trained baselines >> Int2 models show significant improvements (8%+ better on downstream tasks) >> Remarkably, an int2 FFN-quantized Gemma-2 9B outperforms an int8 FFN-quantized Gemma-2 2B This work represents a major step forward in model quantization, making it easier to deploy LLMs across different hardware constraints while maintaining high performance. The ability to extract multiple precision levels from a single trained model is particularly valuable for real-world applications. Looking forward to seeing how this technique gets adopted by the community and what further improvements it enables in model deployment efficiency! Let me know if you'd like me to elaborate on any aspect of the paper. I'm particularly fascinated by how they managed to improve int2 performance through the co-training approach. https://lnkd.in/g6mdmVjx
How Quantization is Transforming Model Performance
Explore top LinkedIn content from expert professionals.
Summary
Quantization is a technique used in artificial intelligence to reduce the precision of numerical values within models, making them smaller and more efficient without sacrificing much accuracy. Recent advancements are showing how quantization enables high-performance deployments of large language models, especially in resource-constrained environments.
- Assess hardware needs: Consider the memory bandwidth and batch size requirements before choosing a quantization strategy, as these factors can impact model latency and throughput.
- Validate performance trade-offs: Test lower precision quantization thoroughly, since compressing models to 4-bit or less may introduce accuracy losses that aren’t always apparent on paper.
- Choose calibration methods wisely: Use advanced calibration techniques like outlier-aware quantization or quantization-aware training to preserve important model weights and maintain accuracy during deployment.
-
-
I watched a senior engineer spend three weeks quantizing an LLM to 4-bit. The P99 latency got worse. The issue wasn’t the technique; it was treating quantization as a storage problem instead of a memory-bandwidth problem. At Twitter, I spent a month debugging why our "optimized" models ran slower than the originals. The models were smaller. The math was correct. Yet latency regressed. The missing piece: the *unpacking tax*. Here’s the reality most benchmarks hide: Time ≈ Total bytes moved / Memory bandwidth On paper, moving from FP16 (16-bit) to INT4 (4-bit) means 4× less data moving across the memory bus per token. In a memory-bound regime, that translates to 3–4× higher throughput. But there’s a catch. GPUs don’t compute in 4-bit or 8-bit. Those weights are dequantized back to FP16/BF16 in the local cache before computation. That dequantization costs clock cycles and creates production surprises: → High batch sizes: Time saved on memory movement dominates = throughput improves → Batch size of 1: Unpacking overhead dominates = latency gets worse Quantization is not a free win. It’s a tradeoff. If you’re choosing a method, align it with your deployment reality: → GPTQ: Effective for static weights, but sensitive to outliers → AWQ: Preserves critical weights at higher precision for better quality → GGUF: Excellent for CPU/Metal inference, less relevant for H100/A100 clusters This is Part 4 of a deep dive into inference optimization. Previous posts: Memory Wall: https://lnkd.in/gdT26UTV KV Cache: https://lnkd.in/gKkrqVzf Paged Attention: https://lnkd.in/gX5JNZhn Next up: I will break down the closest thing to "cheating physics" in ML - Speculative Decoding. What’s the most expensive quantization mistake you’ve seen in production - latency, quality, or operability?
-
Groundbreaking Research Alert: 4-bit Quantization for RAG Systems A fascinating new paper from San José State University introduces an innovative approach to optimize Retrieval-augmented Generation (RAG) systems through 4-bit quantization of vector embeddings. >> Technical Deep Dive: The research tackles a critical challenge in RAG systems - the massive memory requirements for storing high-dimensional embedding vectors. Current top-ranked models on MTEB typically use embedding dimensions between 512-4096, consuming substantial memory resources. Consider this: A standard dbpedia dataset with 1M entries and 1536 dimensions requires 6.1GB of RAM just for embeddings. The proposed solution? A sophisticated 4-bit quantization approach that: - Reduces memory footprint by up to 87.5% - Maintains search accuracy within 4% of original performance - Implements group-wise quantization for enhanced precision - Outperforms HNSW algorithm in accuracy with group sizes ≤ 128 >> Under the Hood: The system employs symmetric linear quantization with group-wise processing, where vectors are split into equal-sized groups with individual quantization scales. This approach significantly outperforms traditional Product Quantization methods, maintaining correlation coefficients above 0.82 across multiple semantic textual similarity datasets. >> Impact: This breakthrough enables RAG deployment in resource-constrained environments while maintaining high accuracy. The research demonstrates that intelligent quantization can dramatically reduce infrastructure costs without compromising performance.
-
You're in a GenAI Engineer interview at Goldman Sachs, and the interviewer asks: "We need to deploy a 70B parameter LLM for production trading signals. Should we use 4-bit or 8-bit quantization? Justify your choice." Here's how you can answer: A. Most candidates fumble here because they only know "quantization reduces model size." Incomplete answer. B. There are 5 critical factors every GenAI engineer should understand cold. 𝟭. 𝗧𝗵𝗲 𝗣𝗿𝗲𝗰𝗶𝘀𝗶𝗼𝗻-𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗧𝗿𝗮𝗱𝗲𝗼𝗳𝗳 - 𝗧𝗵𝗲 𝗳𝘂𝗻𝗱𝗮𝗺𝗲𝗻𝘁𝗮𝗹 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝗰𝗲 8-bit (INT8) maintains near-IDENTICAL accuracy: Performance degradation < 1% on most tasks Uses linear quantization: Q = round(scale × W + zero_point) 4-bit (INT4/NF4) trades accuracy for efficiency: Performance degradation 2-5% depending on architecture Uses non-linear quantization (NormalFloat4) to preserve distribution The brutal truth? 8-bit is production-safe. 4-bit requires EXTENSIVE validation. 𝟮. 𝗧𝗵𝗲 𝗠𝗲𝗺𝗼𝗿𝘆 𝗙𝗼𝗼𝘁𝗽𝗿𝗶𝗻𝘁 - 𝗪𝗵𝗲𝗿𝗲 𝟵𝟬% 𝗼𝗳 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 𝗴𝗼 𝘄𝗿𝗼𝗻𝗴 Most people think "4-bit = 2x smaller than 8-bit." Wrong move. FP32: 70B model = 280GB INT8: 70B model = 70GB (4x compression) INT4: 70B model = 35GB (8x compression) But here's the catch - you STILL need overhead for KV cache, activations, and gradients. Real-world 70B INT4 deployment? Needs 48-60GB minimum, not 35GB. 𝟯. 𝗧𝗵𝗲 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻 𝗠𝗲𝘁𝗵𝗼𝗱 - 𝗧𝗵𝗲 𝗵𝗶𝗱𝗱𝗲𝗻 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗸𝗶𝗹𝗹𝗲𝗿 Here's what separates junior from senior GenAI engineers: Post-Training Quantization (PTQ): Fast setup (hours) Works reliably for 8-bit 4-bit quality varies wildly GPTQ/AWQ (Advanced PTQ): Weight-only quantization with calibration Industry standard for 4-bit LLMs Requires representative calibration dataset (CRITICAL) QAT (Quantization-Aware Training): Expensive compute (days to weeks) Required for mission-critical 4-bit deployments 𝟰. 𝗧𝗵𝗲 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗦𝗽𝗲𝗲𝗱 𝗧𝗿𝗮𝗱𝗲 - 𝟱𝘅 𝘁𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁, 𝗯𝘂𝘁 𝘄𝗵𝘆? 8-bit: Native GPU support (Tensor Cores) Blazing fast matrix multiplication 1.5-2x throughput vs FP16 4-bit: Limited hardware support Requires dequantization to FP16 for computation Memory bandwidth bound, NOT compute bound The counterintuitive reality? 4-bit isn't always faster despite being smaller. 𝟱. 𝗧𝗵𝗲 𝗗𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁 𝗥𝗲𝗮𝗹𝗶𝘁𝘆 - 𝗧𝗵𝗲 𝗰𝗼𝘀𝘁 𝗻𝗼𝗯𝗼𝗱𝘆 𝘁𝗮𝗹𝗸𝘀 𝗮𝗯𝗼𝘂𝘁 8-bit: A100 80GB fits 70B comfortably, batch size 8-16 supported 4-bit: RTX 4090 24GB runs 70B (barely), batch size 1-4 maximum 𝗪𝗵𝗲𝗻 𝟴-𝗯𝗶𝘁 𝘄𝗶𝗻𝘀: ✅ Accuracy non-negotiable (finance, healthcare) ✅ Production reliability > cost optimization ✅ Batch inference workloads 𝗪𝗵𝗲𝗻 𝟰-𝗯𝗶𝘁 𝘄𝗶𝗻𝘀: ✅ Extreme memory constraints ✅ Cost optimization critical ✅ Acceptable 2-5% quality degradation ✅ Using GPTQ/AWQ Packt is offering Christmas $9.99 deals : https://lnkd.in/gPNKdiXr Sonia Chauhan
-
You're in an NVIDIA Deep Learning Performance Engineer interview. The question: "We are moving from FP16 to INT8, INT4, and even 1.58-bit (Binary) models. Why does decreasing numerical precision often result in almost zero loss in 'intelligence'?" You pause - most jump to "Models are over-parameterized." You reply: - LLM weights aren't uniformly distributed; they follow a heavy-tailed distribution. Most weights are near zero and contribute little to the final output. - High precision (FP32/16) is necessary during training to capture tiny gradient updates. But during inference, the features the model has learned are robust enough that close enough is often perfect. - We use 'Outlier-aware Quantization.' We find the 0.1% of feature weights that have huge magnitudes and keep them in high precision, while squashing the rest into 4 bits. The interviewer probes: "What is the actual physical bottleneck we are solving here?" You explain: - It's the 'Memory Wall.' - Moving a 16-bit number from VRAM to the CUDA core consumes orders of magnitude more energy and time than the actual multiplication. - By moving to 4-bit, we quadruple the effective memory bandwidth. We can fit a 70B parameter model on a single consumer GPU that would otherwise require an enterprise cluster. - We also utilize 'Weight-Only Quantization' where we dequantize back to FP16 just for the calculation, saving memory space while maintaining math accuracy. Finally, you add: "Intelligence is about the topology of the high-dimensional space, not the precision of the coordinates. Quantization is just finding a more efficient way to map that." #AI #LLMs #FineTuning
-
Google DeepMind Researchers Propose Matryoshka Quantization: A Technique to Enhance Deep Learning Efficiency by Optimizing Multi-Precision Models without Sacrificing Accuracy Researchers at Google DeepMind introduced Matryoshka Quantization (MatQuant) to create a single model that functions across multiple precision levels. Unlike conventional methods that treat each bit-width separately, MatQuant optimizes a model for int8, int4, and int2 using a shared bit representation. This allows models to be deployed at different precisions without retraining, reducing computational and storage costs. MatQuant extracts lower-bit models from a high-bit model while preserving accuracy by leveraging the hierarchical structure of integer data types. Testing on Gemma-2 2B, Gemma-2 9B, and Mistral 7B models showed that MatQuant improves int2 accuracy by up to 10% over standard quantization techniques like QAT and OmniQuant. Experimental evaluations of MatQuant demonstrate its ability to mitigate accuracy loss from quantization. Researchers tested the method on Transformer-based LLMs, focusing on quantizing Feed-Forward Network (FFN) parameters, a key factor in inference latency. Results show that MatQuant’s int8 and int4 models achieve comparable accuracy to independently trained baselines while outperforming them at int2 precision. On the Gemma-2 9B model, MatQuant improved int2 accuracy by 8.01%, while the Mistral 7B model saw a 6.35% improvement over traditional quantization methods. The study also found that MatQuant’s right-shifted quantized weight distribution enhances accuracy across all bit-widths, particularly benefiting lower-precision models. Also, MatQuant enables seamless bit-width interpolation and layer-wise Mix’n’Match configurations, allowing flexible deployment based on hardware constraints...... Read full article: https://lnkd.in/gWTcqSCN Paper: https://lnkd.in/ggAF-sjf Google DeepMind Pranav Nair PURANJAY DATTA Jeff Dean Prateek Jain Aditya Kusupati
-
Day 13/100: Quantization - Number Formats (FP8, INT8, INT4, NVFP4) A 70B parameter model in FP32 occupies 280 GB - four H100s just to hold the weights. In BF16, that's 140 GB. INT8: 70 GB. INT4: 35 GB, fitting on a single H100 with room for KV cache. NVFP4 lands around 40 GB at ~4.5 effective bits per weight, with quality closer to FP8 than INT4. Quantization is not just a compression trick; it's what makes large models economically servable. The tradeoff is precision loss. FP32 has 23 mantissa bits. FP16 has 10. BF16 has 7 but the same exponent range as FP32, which is why it's preferred for training. INT8 has 7 bits of dynamic range with no floating-point representation. Each reduction cuts the representable space roughly in half. The question is how much accuracy to lose on the downstream task. FP8 is the current sweet spot for production. Hopper's H100 has native FP8 Tensor Core support: 3.9 petaFLOPS in FP8 vs 1.98 petaFLOPS in FP16 - a 2× throughput gain if you can tolerate the precision reduction. FP8 comes in two formats: E4M3 (4 exponent bits, 3 mantissa, better for activations) and E5M2 (5 exponent, 2 mantissa, better for gradients). Most inference deployments use E4M3 for both weights and activations. INT4 is the most aggressive widely-used format. GPTQ and AWQ both target INT4 for weights. The quantization error is visible on challenging benchmarks but acceptable for most chatbot workloads. The memory saving: 4× vs FP16 is compelling enough that nearly all large model deployments at inference scale use some form of 4-bit quantization. NVFP4 is the next step beyond FP8. Blackwell's native 4-bit floating point (E2M1) with micro scaling, every block of 16 values shares an FP8 scale factor, recovering dynamic range that a raw 4-bit format loses. Blackwell Tensor Cores run FP4 GEMMs at 2x FP8 throughput. The effective storage cost is ~4.5 bits per weight (4 bits + amortized scale overhead), landing between INT4's aggressive compression and FP8's quality preservation. The notebook (https://lnkd.in/g7nzjEJ7) implements FP8, INT8, INT4, and NVFP4 quantization from scratch, measures quantization error per format, and models memory footprint across model sizes from 7B to 405B. #LLM #Inference #Quantization #FP8 #INT8 #INT4 #NVFP4 #Blackwell #DeepLearning #AI #MLEngineering #100DaysOfInference
-
Cursor rewrote their entire MoE layer from scratch in pure CUDA and PTX. they got a 3.5x MoE layer speedup and 1.5x end-to-end training speedup on Blackwell. let's break down what they did: initially, they tried just quantizing to naive FP8 but this gave them no speedup. on Blackwell, quantizing matrices before feeding them to an FP8 matmul consumes roughly 40% of the matmul time. when you include transpose-quantization for backward passes, it jumps to 76%. you get 2x faster matmul but spend nearly the same time just preparing the inputs. MXFP8 training can actually be slower than BF16 if you don't fuse the quantization. it gets worse on Blackwell specifically. on Hopper, tensor core results accumulate in registers, so you can pipeline dequantization with CUDA cores while the next matmul runs. on Blackwell, results go into a new on-chip memory called TMEM. to do any arithmetic on the accumulator, you transfer from TMEM to registers, process with CUDA cores, write back, and wait. Cursor measured dequantization taking 1.76x the matmul time on Blackwell (vs 1.03x on Hopper). they couldn't even beat Hopper's realistic FP8 throughput with any variation of this approach. the fix is to not dequantize at all. Blackwell's tcgen05.mma block_scale PTX instruction handles MXFP8 block scaling entirely in hardware, inside the tensor cores. no TMEM-to-register transfers, no CUDA core arithmetic. the scaling factors load into TMEM and get consumed during the matrix multiply itself. but you still need to quantize the inputs. existing kernels from TransformerEngine and TorchAO run at ~4.5 TB/s and produce scale factors in the wrong memory layout, requiring a separate reshape kernel. Cursor built a quantization kernel sustaining 6.2+ TB/s that writes scales directly in the hardware-expected packed layout. they also fused quantization into SwiGLU's epilogue, so activations get quantized as they flow through the activation function. no BF16 round-trip through HBM. for grouped GEMM (the actual MoE operation), they beat DeepSeek's DeepGEMM at 0.43ms vs 0.67ms for forward/dgrad. that benchmark excludes DeepGEMM's quantization time, since DeepGEMM doesn't ship optimized quantization kernels. the real-world gap is larger. Cursor uses MXFP8 with 32-element block scaling (FP8E4M3 elements, E8M0 scale factors). DeepSeek V3 used 128-element blocks for the A matrix. finer blocks = better accuracy but more scale factors to manage. Cursor verified 32-block MXFP8 converges nearly identically to BF16. MoE forward went from 25.96ms (Blackwell BF16) to 9.45ms. backward from 59.17ms to 17.04ms. end-to-end: 24k tokens/GPU vs 16k on Blackwell BF16. the kernel was written by Stuart Sul (ML at Cursor), and the full link is in the comments.
-
DeepSeek built a model that rivals GPT-4 for $5.6 million instead of $100 million. Now every AI company is scrambling to copy their approach. It's called Quantization. Here's the problem they solved: If you're running a 70B parameter AI model, it needs 8 expensive GPUs just to load. Sounds normal for enterprise AI, right? But what if the exact same model could run on a single GPU with virtually identical performance? With traditional AI deployment, companies assume bigger models need bigger hardware. Period. So when you're budgeting for AI infrastructure, you're planning for server farms that cost millions. Quantization works like intelligent compression with three core principles: Precision Reduction, Weight Clustering, and Performance Preservation. First, Precision Reduction converts 32-bit numbers into 4-bit representations without losing the important patterns. Then Weight Clustering groups similar values together, so instead of storing millions of unique numbers, you store thousands of representative ones. Finally, Performance Preservation ensures the model still understands language and generates accurate responses. The Counter-Intuitive part is the model becomes FASTER, not slower. For example, if a startup wants to deploy Llama 70B, traditional deployment requires enterprise GPU clusters. But with INT4 quantization, the same model runs on consumer hardware while maintaining conversation quality. DeepSeek proved this works at scale. Now startups run enterprise models on desktop computers. Edge devices process AI without cloud connections. Companies cut infrastructure costs by 90% overnight. Mobile apps include full language models locally. Good thing is AI finally fits where you need it instead of requiring a data center. The irony is smaller models often outperform larger ones because they process information faster through your hardware. Every major AI provider - Meta, Google, Microsoft - now releases quantized versions as standard. Because they realized the bottleneck wasn't model intelligence. It was assuming bigger always means better. Over to you: What model are you overpaying to run at full precision? Legends: FP = Floating Point, INT = Integer
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development