Demystifying Quantization: A Clear Guide to Understanding the Concept and Methods in Large Language Models
Large Language Models (LLMs) have undoubtedly revolutionized the field of artificial intelligence, showcasing remarkable performance across a diverse array of tasks. From language translation to text generation, these models have demonstrated unprecedented capabilities. However, their widespread deployment has been hindered by one significant challenge: the huge amount of their parameters, which requires a demand for large memory capacity and high memory bandwidth.
Imagine this: you have a model with billions of parameters, each parameter requiring a substantial amount of memory. This demands not only large memory capacity but also high memory bandwidth, posing a significant obstacle to practical deployment. So, what can be done to reduce this burden?
This is where the concept of quantification comes from.
What is and Why Quantization?
Quantization is a method used in AI to reduce the computational and memory costs of running inference. It achieves this by representing the numerical values (weights and activations) within AI models using low-precision data types, such as 8-bit integers (int8), instead of the traditional 32-bit floating-point (float32) data types. The purpose of quantization is to reduce the size of the data and make it easier to manage, store, and process. This is particularly important when working with large datasets, where reducing the precision of the data can result in significant memory savings and faster computation times.
How much memory is needed to store and train an LLM?
A traditional data type for LLMs is a 32-bit floating-point (float32) data type, to train a 1B (billion) parameters model we need 4 bytes for the model weights and 20 extra bytes for the optimizer, gradient, activation, and temp memory.
A single-model parameter, at full 32-bit precision, is represented by 4 bytes. Therefore, a 1-billion-parameter model requires 4 GB of GPU RAM just to load the model into GPU RAM at full precision. If we want to train the model we need more GPU memory to store the other parameters (optimizer, gradient..), we need 24 GB of GPU RAM to train the model, six times more than storing the model.
Quantization is a popular way to convert your model parameters from 32-bit precision down to 16-bit precision — or even 8-bit or 4-bit. By quantizing your model weights from 32-bit full precision down to 16-bit half-precision, you can quickly reduce your 1-billion-parameter-model memory requirement down 50% to only 2 GB for loading and 12 GB for training.
Before diving into the quantization concept, let’s explore the difference between data types.
In the world of computers, numbers are represented using a system known as floating-point representation. This system breaks down a number into three parts: the sign, the exponent, and the mantissa.
Sign: The sign bit indicates whether the number is positive or negative. ‘0’ represents a positive number, while ‘1’ represents a negative number1.
Exponent: The exponent represents the power to which the base (usually 2) is raised.
Fraction: The Mantissa, also known as the Fraction or Significand, represents the significant digits of the number.
The BF16 stands for (Brain Floating Point Format) is the popular choice on LLMs
How really can we save resources using quantization?
As we said, the traditional data type used is 32-bit floating-point (float32). Now, we will discuss the conversion from FP32 to other types, let’s start with FP16.
In the illustration above, the conversion of the mathematical constant Pi (3.14…) from FP32 to FP16 is depicted. This conversion results in a 50% reduction in memory requirements, a significant advantage for resource optimization. However, it comes with a trade-off, impacting the precision of the model. The diminished range of the exponent in FP16 implies limitations in representing numbers with the same magnitude as FP32, potentially introducing numerical instability and risking loss of information during computational processes.
Recommended by LinkedIn
In the image above, the conversion of the Pi number (3.14…) from FP32 to BF16 is illustrated. This conversion is particularly advantageous as it not only reduces the memory requirement by half but also maintains higher precision when compared to FP16. BF16 employs a 16-bit format with 8 bits for the exponent and 7 bits for the mantissa, allowing for a significant reduction in storage while allocating additional bits to enhance precision. This makes BF16 a preferred choice over FP16 in scenarios where both memory efficiency and numerical accuracy are crucial, striking a beneficial balance between reduced memory usage and retained precision.
In the picture above, the change from FP32 to INT8 for Pi (3.14…) is shown. It does save a lot of memory, about 75%, which is cool. But there’s a catch. INT8 has a small range because it only uses 8 bits, and it doesn’t have something called an exponent, making it not great for big or detailed numbers. So, while it saves space, it might not be the best choice if you need to work with a wide range of numbers and really care about getting the details right.
Quantization Methods (GPTQ, AWQ, GGUF)
Now that we understand the basics of quantization, let’s explore some quantization methods.
GPTQ :
GPTQ (Post-Training Quantization for GPT Models) is a quantization method specifically designed for GPT (Generative Pre-trained Transformer) models. It aims to reduce the memory requirements of these large language models while maintaining similar performance.
This method is designed for GPU hardware, “it can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline.”
The process of GPTQ involves several steps:
· Model Training: Initially, the GPT model is trained using standard techniques, typically on powerful hardware with high memory capacity. This training phase involves running the model on a large dataset to learn the underlying patterns and dependencies in the text.
· Post-Training Quantization: After the model is trained, the quantization process begins. GPTQ converts the model’s parameters, such as weights and activations, from a high-precision floating-point format to a lower-bit format. For example, weights and activations can be quantized to 4-bit precision.
· Quantization-Aware training: Quantization-Aware Fine-Tuning is a crucial step in the GPTQ process. It aims to mitigate any potential degradation in performance caused by the reduction in precision during quantization. This step involves further training the quantized model on a smaller dataset or a subset of the original dataset. The goal of Quantization-Aware Fine-Tuning is to help the quantized model adapt to the quantized representation and recover any loss in accuracy caused by the reduction in precision. During this fine-tuning process, the model learns to optimize its parameters and adapt to the lower-bit format, ultimately improving its performance while maintaining the benefits of quantization.
AWQ :
AWQ (Activation-aware Weight Quantization) is a quantization method that aims to reduce the memory requirements of large language models while preserving or even improving their performance. It is similar to GPTQ but introduces an activation-aware approach to quantization. AWQ takes into account that not all weights and activations in a model are equally important for its performance. The main idea behind AWQ is to selectively quantize the model’s weights and activations, focusing on preserving the important ones with higher precision while allowing for lower precision in less critical parts. By doing so, AWQ achieves a balance between reducing the memory footprint and maintaining the model’s accuracy.
“We observe that the weights of LLMs are not equally important: there is a small fraction of salient weights that are much more important for LLMs’ performance compared to others. Skipping the quantization of these salient weights can help bridge the performance degradation due to the quantization loss without any training or regression “
GGUF :
GGUF (GPT-Generated Unified Format) the new format of GGML, GGUF models can be run on CPUs, making them accessible to a wider range of users The unique binary format of GGUF enables fast loading and ease of reading, facilitating efficient storage and retrieval of model parameters.
GGUF is a file format used for storing models for inference, particularly in the context of large language models like GPT. It is designed to address the limitations of GGML, offering better tokenization, support for special tokens, and improved performance. GGUF is extensible, supports metadata storage, and enables fast loading and ease of reading. It is supported by various clients and libraries, making it a valuable format for deploying and utilizing large language models.
Case Study: The Bloke’s Quantized Models
To see quantization in action, let’s take a look at some quantized models by ‘The Bloke’ on Hugging Face. These models showcase the practical application of quantization, demonstrating how it can optimize memory usage without compromising performance.
Thank you for reading
Please don’t hesitate to reach out with any questions. If you share a passion for AI and its transformative potential, I invite you to connect with me on LinkedIn or explore my GitHub profile for further insights.
Love this