To train large models using PyTorch, we need to go parallel. There are two commonly used strategies123:

1. model parallelism,
2. data parallelism,
3. data-model parallelism.

## Model Parallelism

Model parallelism splits the model on different nodes14. We will focus on data parallelism but the key idea is shown in the following illustration.

## Data Parallelism

Data parallelism creates replicas of the model on each device and use different subsets of training data14.

Data parallelism is based on the additive property of the loss gradient.

There are two ready to use data parallel paradigms: DataParallel and DistributedDataParallel.

### Jargons

To better understand what happens in the strategies, we recommend reading the following.

### DataParallel in PyTorch

DataParallel is a strategy implemented in PyTorch using multi-threading.

The above illustration shows that the master GPU, “GPU 0”, is the coordinator and also computes more than others. This creates imbalance in GPU usage.

Due to multi-threading, DataParallel also suffers from GIL5.

### DistributedDataParallel in PyTorch

DistributedDataParallel (DDP) is a strategy implemented in PyTorch using . DDP is the recommended method in the PyTorch documentation.

