Quantization
Reducing model precision to use less memory and compute faster
Quantization is the process of converting a model's weights (and sometimes activations) from higher precision (e.g., FP32) to lower precision (e.g., INT8 or INT4) to reduce memory usage and speed up inference.
Example
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# Load a model quantized to 4-bit
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=quantization_config,
device_map="auto",
)
# 7B model: ~14 GB in FP16 → ~3.5 GB in 4-bit
Common Precision Levels
| Precision | Bits per Weight | 7B Model Size |
|---|---|---|
| FP32 | 32 | ~28 GB |
| FP16 / BF16 | 16 | ~14 GB |
| INT8 | 8 | ~7 GB |
| INT4 | 4 | ~3.5 GB |