Compression Techniques for Large Language Models
1. Why Compress LLMs?
Large Language Models (LLMs) such as GPT‑3, LLaMA, and PaLM achieve state‑of‑the‑art performance but demand enormous memory and compute resources. Compressing these models is crucial to enable:
2. Introduction to Quantization and Pruning
Both can be applied post‑training (straightforward but may degrade accuracy) or during training (quantization‑aware training, gradual pruning) to better preserve performance.
3. Hands‑On Implementation (Conceptual)
4. Challenge: Quantizing an LLM
Naïve quantization can introduce noise that degrades model accuracy—especially in sensitive layers (embeddings, attention projections). Key pain points:
5. Solution Strategies (with Code Examples)
Recommended by LinkedIn
a. Mixed‑Precision Quantization
Use high precision for critical layers and lower bits elsewhere.
from transformers import AutoModelForCausalLM
import bitsandbytes as bnb
model = AutoModelForCausalLM.from_pretrained("gpt2-medium")
# Quantize all linear layers to 8-bit except embeddings
model = model.quantize(bits=8, quant_type="nf4", ignore_modules=["wte"])
b. Quantization‑Aware Training (QAT)
Simulate quantization during fine‑tuning so weights adapt to low‑precision noise.
from transformers import Trainer, TrainingArguments, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_threshold=6.0,
)
training_args = TrainingArguments(
output_dir="qat-output",
per_device_train_batch_size=4,
fp16=False,
optim="paged_adamw_8bit",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_ds,
bitsandbytes_config=bnb_config
)
trainer.train()
c. Structured Pruning
Remove entire attention heads or neurons for hardware‑friendly sparsity.
from torch.nn.utils import prune
for name, module in model.named_modules():
if isinstance(module, torch.nn.MultiheadAttention):
# Prune 30% of attention heads by zeroing out entire head projections
prune.ln_structured(module.out_proj, name="weight", amount=0.3, n=module.num_heads, dim=0)
d. Hybrid QLoRA Approach
Freeze the base model in low‑bit and train small adapter matrices in higher precision.
from peft import get_peft_model, LoraConfig
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
# The base weights remain in 4‑bit; adapter parameters train in full precision
e. Range Calibration for Quantization
Calibrate activation ranges on a representative dataset before quantization.
# Pseudocode: run a batch through each layer to record min/max
ranges = {}
for batch in calib_loader:
with torch.no_grad():
_ = model(**batch)
for name, activation in model.activations.items():
ranges[name] = (activation.min(), activation.max())
# Use these ranges when applying symmetric or asymmetric quantization
Conclusion By combining mixed‑precision, quantization‑aware training, structured pruning, and hybrid adapter methods, practitioners can compress LLMs dramatically—often to 4–8× smaller footprints—while maintaining near‑original performance. These optimizations unlock deployment on edge devices, accelerate inference, and lower operational costs without sacrificing model quality.