Throughput vs. Latency
Two key metrics for measuring inference performance
Throughput is how many requests a system can handle per unit of time. Latency is how long a single request takes to complete. These two metrics often trade off against each other.
Example
Scenario: Serving an LLM
Latency-optimized:
- Batch size 1 → 50ms per request → 20 req/s
Throughput-optimized:
- Batch size 32 → 200ms per request → 160 req/s
- Each request is slower, but total work done is 8x higher
Key Differences
| Metric | Optimized For | Typical Strategy |
|---|---|---|
| Latency | Real-time apps, chatbots | Small batch, fast hardware |
| Throughput | Batch processing, serving at scale | Large batch, parallelism |