---
title: "Quantization"
canonical: "https://www.thundercompute.com/glossary/inference/quantization"
description: "Reducing model precision to use less memory and compute faster"
sidebarTitle: "Quantization"
icon: "compress"
iconType: "solid"
---

**Quantization** is the process of converting a model's weights (and sometimes activations) from higher precision (e.g., FP32) to lower precision (e.g., INT8 or INT4) to reduce memory usage and speed up inference.

## Example

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Load a model quantized to 4-bit
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto",
)
# 7B model: ~14 GB in FP16 → ~3.5 GB in 4-bit
```

## Common Precision Levels

| Precision | Bits per Weight | 7B Model Size |
|-----------|----------------|---------------|
| FP32 | 32 | ~28 GB |
| FP16 / BF16 | 16 | ~14 GB |
| INT8 | 8 | ~7 GB |
| INT4 | 4 | ~3.5 GB |

## See Also

- [Inference](/inference/inference)
- [VRAM](/gpu-hardware/vram)
- [Tensor Cores](/gpu-hardware/tensor-cores)
