Data Parallelism

Data parallelism splits a training batch across multiple GPUs. Each GPU holds a full copy of the model and processes a portion of the data. Gradients are synchronized (averaged) after each step.

Example

import torch
import torch.nn as nn

model = MyModel().cuda()
model = nn.DataParallel(model)  # wrap for data parallelism

# Each forward pass is automatically split across available GPUs
output = model(input_batch.cuda())

How It Works

Split the batch across N GPUs
Each GPU runs the forward + backward pass on its portion
Gradients are all-reduced (averaged) across GPUs
Each GPU updates its model copy with the same gradients

Example

How It Works

See Also