Thunder Compute logo

Data Parallelism

Distributing training data across multiple GPUs, each with a copy of the model

Data parallelism splits a training batch across multiple GPUs. Each GPU holds a full copy of the model and processes a portion of the data. Gradients are synchronized (averaged) after each step.

Example

import torch
import torch.nn as nn

model = MyModel().cuda()
model = nn.DataParallel(model)  # wrap for data parallelism

# Each forward pass is automatically split across available GPUs
output = model(input_batch.cuda())

How It Works

  1. Split the batch across N GPUs
  2. Each GPU runs the forward + backward pass on its portion
  3. Gradients are all-reduced (averaged) across GPUs
  4. Each GPU updates its model copy with the same gradients

See Also

Data Parallelism