Quantization

Quantization is the process of converting a model's weights (and sometimes activations) from higher precision (e.g., FP32) to lower precision (e.g., INT8 or INT4) to reduce memory usage and speed up inference.

Example

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Load a model quantized to 4-bit
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto",
)
# 7B model: ~14 GB in FP16 → ~3.5 GB in 4-bit

Common Precision Levels

Precision	Bits per Weight	7B Model Size
FP32	32	~28 GB
FP16 / BF16	16	~14 GB
INT8	8	~7 GB
INT4	4	~3.5 GB

Example

Common Precision Levels

See Also