Thunder Compute logo

Tensor Parallelism

Splitting individual layers across multiple GPUs

Tensor parallelism splits individual weight matrices (tensors) across multiple GPUs. Each GPU computes a portion of a layer's output, and results are combined. This reduces memory per GPU without the pipeline bubbles of model parallelism.

Example

A single linear layer: y = xW + b
Where W is [4096 × 4096]

With 2-way tensor parallelism:
  GPU 0 holds W[:, :2048]  →  computes y₀
  GPU 1 holds W[:, 2048:]  →  computes y₁
  y = concat(y₀, y₁)

Key Points

  • Requires fast inter-GPU communication (NVLink)
  • Each GPU stores only a fraction of each layer's weights
  • Common in large language model serving (e.g., Megatron-LM)

See Also