---
title: "Throughput vs. Latency"
canonical: "https://www.thundercompute.com/glossary/inference/throughput-vs-latency"
description: "Two key metrics for measuring inference performance"
sidebarTitle: "Throughput vs. Latency"
icon: "chart-line"
iconType: "solid"
---

**Throughput** is how many requests a system can handle per unit of time. **Latency** is how long a single request takes to complete. These two metrics often trade off against each other.

## Example

```
Scenario: Serving an LLM

Latency-optimized:
  - Batch size 1 → 50ms per request → 20 req/s

Throughput-optimized:
  - Batch size 32 → 200ms per request → 160 req/s
  - Each request is slower, but total work done is 8x higher
```

## Key Differences

| Metric | Optimized For | Typical Strategy |
|--------|--------------|------------------|
| **Latency** | Real-time apps, chatbots | Small batch, fast hardware |
| **Throughput** | Batch processing, serving at scale | Large batch, parallelism |

## See Also

- [Inference](/inference/inference)
- [Batch Size](/training/batch-size)
- [Memory Bandwidth](/memory/memory-bandwidth)
