Demystifying Quantization: A Clear Guide to Understanding the Concept and Methods in Large Language Models

Mahmoud BIDRY

Published Feb 12, 2024

Large Language Models (LLMs) have undoubtedly revolutionized the field of artificial intelligence, showcasing remarkable performance across a diverse array of tasks. From language translation to text generation, these models have demonstrated unprecedented capabilities. However, their widespread deployment has been hindered by one significant challenge: the huge amount of their parameters, which requires a demand for large memory capacity and high memory bandwidth.

Imagine this: you have a model with billions of parameters, each parameter requiring a substantial amount of memory. This demands not only large memory capacity but also high memory bandwidth, posing a significant obstacle to practical deployment. So, what can be done to reduce this burden?

This is where the concept of quantification comes from.

What is and Why Quantization?

Quantization is a method used in AI to reduce the computational and memory costs of running inference. It achieves this by representing the numerical values (weights and activations) within AI models using low-precision data types, such as 8-bit integers (int8), instead of the traditional 32-bit floating-point (float32) data types. The purpose of quantization is to reduce the size of the data and make it easier to manage, store, and process. This is particularly important when working with large datasets, where reducing the precision of the data can result in significant memory savings and faster computation times.

How much memory is needed to store and train an LLM?

A traditional data type for LLMs is a 32-bit floating-point (float32) data type, to train a 1B (billion) parameters model we need 4 bytes for the model weights and 20 extra bytes for the optimizer, gradient, activation, and temp memory.

image source

A single-model parameter, at full 32-bit precision, is represented by 4 bytes. Therefore, a 1-billion-parameter model requires 4 GB of GPU RAM just to load the model into GPU RAM at full precision. If we want to train the model we need more GPU memory to store the other parameters (optimizer, gradient..), we need 24 GB of GPU RAM to train the model, six times more than storing the model.

image source

Quantization is a popular way to convert your model parameters from 32-bit precision down to 16-bit precision — or even 8-bit or 4-bit. By quantizing your model weights from 32-bit full precision down to 16-bit half-precision, you can quickly reduce your 1-billion-parameter-model memory requirement down 50% to only 2 GB for loading and 12 GB for training.

Before diving into the quantization concept, let’s explore the difference between data types.

In the world of computers, numbers are represented using a system known as floating-point representation. This system breaks down a number into three parts: the sign, the exponent, and the mantissa.

image source

Sign: The sign bit indicates whether the number is positive or negative. ‘0’ represents a positive number, while ‘1’ represents a negative number1.

Exponent: The exponent represents the power to which the base (usually 2) is raised.

Fraction: The Mantissa, also known as the Fraction or Significand, represents the significant digits of the number.

The BF16 stands for (Brain Floating Point Format) is the popular choice on LLMs

How really can we save resources using quantization?

As we said, the traditional data type used is 32-bit floating-point (float32). Now, we will discuss the conversion from FP32 to other types, let’s start with FP16.

In the illustration above, the conversion of the mathematical constant Pi (3.14…) from FP32 to FP16 is depicted. This conversion results in a 50% reduction in memory requirements, a significant advantage for resource optimization. However, it comes with a trade-off, impacting the precision of the model. The diminished range of the exponent in FP16 implies limitations in representing numbers with the same magnitude as FP32, potentially introducing numerical instability and risking loss of information during computational processes.

Recommended by LinkedIn

Understanding Floating Point Numbers and Precision in…

Dhananjay Kumar 1 year ago

Foundational Concepts in Large Language Models: A…

Tom Mathews 6 months ago

Large Concept Models: A Step Toward Conceptual AI…

Brikesh Kumar 1 year ago

In the image above, the conversion of the Pi number (3.14…) from FP32 to BF16 is illustrated. This conversion is particularly advantageous as it not only reduces the memory requirement by half but also maintains higher precision when compared to FP16. BF16 employs a 16-bit format with 8 bits for the exponent and 7 bits for the mantissa, allowing for a significant reduction in storage while allocating additional bits to enhance precision. This makes BF16 a preferred choice over FP16 in scenarios where both memory efficiency and numerical accuracy are crucial, striking a beneficial balance between reduced memory usage and retained precision.

In the picture above, the change from FP32 to INT8 for Pi (3.14…) is shown. It does save a lot of memory, about 75%, which is cool. But there’s a catch. INT8 has a small range because it only uses 8 bits, and it doesn’t have something called an exponent, making it not great for big or detailed numbers. So, while it saves space, it might not be the best choice if you need to work with a wide range of numbers and really care about getting the details right.

Quantization Methods (GPTQ, AWQ, GGUF)

Now that we understand the basics of quantization, let’s explore some quantization methods.

GPTQ :

GPTQ (Post-Training Quantization for GPT Models) is a quantization method specifically designed for GPT (Generative Pre-trained Transformer) models. It aims to reduce the memory requirements of these large language models while maintaining similar performance.

This method is designed for GPU hardware, “it can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline.”

The process of GPTQ involves several steps:

· Model Training: Initially, the GPT model is trained using standard techniques, typically on powerful hardware with high memory capacity. This training phase involves running the model on a large dataset to learn the underlying patterns and dependencies in the text.

· Post-Training Quantization: After the model is trained, the quantization process begins. GPTQ converts the model’s parameters, such as weights and activations, from a high-precision floating-point format to a lower-bit format. For example, weights and activations can be quantized to 4-bit precision.

· Quantization-Aware training: Quantization-Aware Fine-Tuning is a crucial step in the GPTQ process. It aims to mitigate any potential degradation in performance caused by the reduction in precision during quantization. This step involves further training the quantized model on a smaller dataset or a subset of the original dataset. The goal of Quantization-Aware Fine-Tuning is to help the quantized model adapt to the quantized representation and recover any loss in accuracy caused by the reduction in precision. During this fine-tuning process, the model learns to optimize its parameters and adapt to the lower-bit format, ultimately improving its performance while maintaining the benefits of quantization.

The original paper of GPTQ

AWQ :

AWQ (Activation-aware Weight Quantization) is a quantization method that aims to reduce the memory requirements of large language models while preserving or even improving their performance. It is similar to GPTQ but introduces an activation-aware approach to quantization. AWQ takes into account that not all weights and activations in a model are equally important for its performance. The main idea behind AWQ is to selectively quantize the model’s weights and activations, focusing on preserving the important ones with higher precision while allowing for lower precision in less critical parts. By doing so, AWQ achieves a balance between reducing the memory footprint and maintaining the model’s accuracy.

“We observe that the weights of LLMs are not equally important: there is a small fraction of salient weights that are much more important for LLMs’ performance compared to others. Skipping the quantization of these salient weights can help bridge the performance degradation due to the quantization loss without any training or regression “

The original paper of AWQ

GGUF :

GGUF (GPT-Generated Unified Format) the new format of GGML, GGUF models can be run on CPUs, making them accessible to a wider range of users The unique binary format of GGUF enables fast loading and ease of reading, facilitating efficient storage and retrieval of model parameters.

GGUF is a file format used for storing models for inference, particularly in the context of large language models like GPT. It is designed to address the limitations of GGML, offering better tokenization, support for special tokens, and improved performance. GGUF is extensible, supports metadata storage, and enables fast loading and ease of reading. It is supported by various clients and libraries, making it a valuable format for deploying and utilizing large language models.

Case Study: The Bloke’s Quantized Models

To see quantization in action, let’s take a look at some quantized models by ‘The Bloke’ on Hugging Face. These models showcase the practical application of quantization, demonstrating how it can optimize memory usage without compromising performance.

Thank you for reading

Please don’t hesitate to reach out with any questions. If you share a passion for AI and its transformative potential, I invite you to connect with me on LinkedIn or explore my GitHub profile for further insights.

Mohcine Ghalmi 2y

Love this

1 Reaction

To view or add a comment, sign in

Demystifying Quantization: A Clear Guide to Understanding the Concept and Methods in Large Language Models

Mahmoud BIDRY

Recommended by LinkedIn

More articles by Mahmoud BIDRY

Others also viewed

Architects of Intelligence and Discoveries

Unveiling the Fragility of Large Language Models in Mathematical Reasoning: The Case of GSM-Symbolic

Beyond Tokens: Reimagining Language Models Through Concept Abstraction

Small Language Models: Why the Next Big Thing Might Be Tiny

The Anatomy of Large Language Models: From Tokens to Intelligence

Solar Pro: High-Performance LLM on a Single GPU

Exploring Phi-4: Microsoft’s Breakthrough in Open AI Models

AI- LLM Architectures

Compression Techniques for Large Language Models

Why Large Language Models Require More Computing Power

How to Train Custom Language Models

How Quantization is Transforming Model Performance

Parameter Tuning Strategies for Large Language Models

Quantization Techniques for Long Context LLMs

Data Preprocessing for Large Language Models

Explore content categories

Recommended by LinkedIn

More articles by Mahmoud BIDRY

Unlocking the Power of Language Models: Transforming Dialogue Summarization with Prompt Engineering

Is it possible for smaller language models to outperform larger ones?

Others also viewed

Architects of Intelligence and Discoveries

Unveiling the Fragility of Large Language Models in Mathematical Reasoning: The Case of GSM-Symbolic

Beyond Tokens: Reimagining Language Models Through Concept Abstraction

Small Language Models: Why the Next Big Thing Might Be Tiny

The Anatomy of Large Language Models: From Tokens to Intelligence

Solar Pro: High-Performance LLM on a Single GPU

Exploring Phi-4: Microsoft’s Breakthrough in Open AI Models

AI- LLM Architectures

Compression Techniques for Large Language Models

Similar topics

Why Large Language Models Require More Computing Power

How to Train Custom Language Models

How Quantization is Transforming Model Performance

Parameter Tuning Strategies for Large Language Models

Quantization Techniques for Long Context LLMs

Data Preprocessing for Large Language Models

Explore content categories