Pytorch Data Parallelism

To train large models using PyTorch, we need to go parallel. There are two commonly used strategies¹²³:

model parallelism,
data parallelism,
data-model parallelism.

Model Parallelism

Model parallelism splits the model on different nodes¹⁴. We will focus on data parallelism but the key idea is shown in the following illustration.

Model parallel
Li X, Zhang G, Li K, Zheng W. Chapter 4 - Deep Learning and Its Parallelization. In: Buyya R, Calheiros RN, Dastjerdi AV, editors. Big Data. Morgan Kaufmann; 2016. pp. 95–118. doi:10.1016/B978-0-12-805394-2.00004-0

Data Parallelism

Data parallelism creates replicas of the model on each device and use different subsets of training data¹⁴.

Data parallelism is based on the additive property of the loss gradient.

There are two ready to use data parallel paradigms: DataParallel and DistributedDataParallel.

Jargons

To better understand what happens in the strategies, we recommend reading the following.

CUDA Memory
Optimizing memory operations for CUDA
The PyTorch documentation provides an illustration on collective communication.

`DataParallel` in PyTorch

DataParallel is a strategy implemented in PyTorch using multi-threading.

DataParallel
Mohan A. Distributed data parallel training using Pytorch on AWS. In: Telesens [Internet]. [cited 17 Oct 2022]. Available: https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/

The above illustration shows that the master GPU, “GPU 0”, is the coordinator and also computes more than others. This creates imbalance in GPU usage.

Due to multi-threading, DataParallel also suffers from GIL⁵.

`DistributedDataParallel` in PyTorch

DistributedDataParallel (DDP) is a strategy implemented in PyTorch using [[multi-processing]] The Python Language: Multi-Processing Python as a programming language . DDP is the recommended method in the PyTorch documentation.

Planted: 2022-10-19 by L Ma;

References:

Model Parallelism

Data Parallelism

Jargons

DataParallel in PyTorch

DistributedDataParallel in PyTorch

`DataParallel` in PyTorch

`DistributedDataParallel` in PyTorch