Data Parallelism
Distributing training data across multiple GPUs, each with a copy of the model
Data parallelism splits a training batch across multiple GPUs. Each GPU holds a full copy of the model and processes a portion of the data. Gradients are synchronized (averaged) after each step.
Example
import torch
import torch.nn as nn
model = MyModel().cuda()
model = nn.DataParallel(model) # wrap for data parallelism
# Each forward pass is automatically split across available GPUs
output = model(input_batch.cuda())
How It Works
- Split the batch across N GPUs
- Each GPU runs the forward + backward pass on its portion
- Gradients are all-reduced (averaged) across GPUs
- Each GPU updates its model copy with the same gradients