---
title: "Data Parallelism"
canonical: "https://www.thundercompute.com/glossary/parallelism/data-parallelism"
description: "Distributing training data across multiple GPUs, each with a copy of the model"
sidebarTitle: "Data Parallelism"
icon: "clone"
iconType: "solid"
---

**Data parallelism** splits a training batch across multiple GPUs. Each GPU holds a full copy of the model and processes a portion of the data. Gradients are synchronized (averaged) after each step.

## Example

```python
import torch
import torch.nn as nn

model = MyModel().cuda()
model = nn.DataParallel(model)  # wrap for data parallelism

# Each forward pass is automatically split across available GPUs
output = model(input_batch.cuda())
```

## How It Works

1. Split the batch across N GPUs
2. Each GPU runs the forward + backward pass on its portion
3. Gradients are all-reduced (averaged) across GPUs
4. Each GPU updates its model copy with the same gradients

## See Also

- [Model Parallelism](/parallelism/model-parallelism)
- [Tensor Parallelism](/parallelism/tensor-parallelism)
- [Batch Size](/training/batch-size)
