Thunder Compute logo

Quantization

Reducing model precision to use less memory and compute faster

Quantization is the process of converting a model's weights (and sometimes activations) from higher precision (e.g., FP32) to lower precision (e.g., INT8 or INT4) to reduce memory usage and speed up inference.

Example

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Load a model quantized to 4-bit
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto",
)
# 7B model: ~14 GB in FP16 → ~3.5 GB in 4-bit

Common Precision Levels

PrecisionBits per Weight7B Model Size
FP3232~28 GB
FP16 / BF1616~14 GB
INT88~7 GB
INT44~3.5 GB

See Also