Tensor Parallelism
Splitting individual layers across multiple GPUs
Tensor parallelism splits individual weight matrices (tensors) across multiple GPUs. Each GPU computes a portion of a layer's output, and results are combined. This reduces memory per GPU without the pipeline bubbles of model parallelism.
Example
A single linear layer: y = xW + b
Where W is [4096 × 4096]
With 2-way tensor parallelism:
GPU 0 holds W[:, :2048] → computes y₀
GPU 1 holds W[:, 2048:] → computes y₁
y = concat(y₀, y₁)
Key Points
- Requires fast inter-GPU communication (NVLink)
- Each GPU stores only a fraction of each layer's weights
- Common in large language model serving (e.g., Megatron-LM)