Tensor Parallelism

Tensor parallelism splits individual weight matrices (tensors) across multiple GPUs. Each GPU computes a portion of a layer's output, and results are combined. This reduces memory per GPU without the pipeline bubbles of model parallelism.

Example

A single linear layer: y = xW + b
Where W is [4096 × 4096]

With 2-way tensor parallelism:
  GPU 0 holds W[:, :2048]  →  computes y₀
  GPU 1 holds W[:, 2048:]  →  computes y₁
  y = concat(y₀, y₁)

Key Points

Requires fast inter-GPU communication (NVLink)
Each GPU stores only a fraction of each layer's weights
Common in large language model serving (e.g., Megatron-LM)

Example

Key Points

See Also