ColossalAI/docs/source/en/concepts/paradigms_of_parallelism.md

125 lines
7.2 KiB
Markdown

# Paradigms of Parallelism
Author: Shenggui Li, Siqi Mai
## Introduction
With the development of deep learning, there is an increasing demand for parallel training. This is because that model
and datasets are getting larger and larger and training time becomes a nightmare if we stick to single-GPU training. In
this section, we will provide a brief overview of existing methods to parallelize training. If you wish to add on to this
post, you may create a discussion in the [GitHub forum](https://github.com/hpcaitech/ColossalAI/discussions).
## Data Parallel
Data parallel is the most common form of parallelism due to its simplicity. In data parallel training, the dataset is split
into several shards, each shard is allocated to a device. This is equivalent to parallelize the training process along the
batch dimension. Each device will hold a full copy of the model replica and trains on the dataset shard allocated. After
back-propagation, the gradients of the model will be all-reduced so that the model parameters on different devices can stay
synchronized.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/WSAensMqjwHdOlR.png"/>
<figcaption>Data parallel illustration</figcaption>
</figure>
## Model Parallel
In data parallel training, one prominent feature is that each GPU holds a copy of the whole model weights. This brings
redundancy issue. Another paradigm of parallelism is model parallelism, where model is split and distributed over an array
of devices. There are generally two types of parallelism: tensor parallelism and pipeline parallelism. Tensor parallelism is
to parallelize computation within an operation such as matrix-matrix multiplication. Pipeline parallelism is to parallelize
computation between layers. Thus, from another point of view, tensor parallelism can be seen as intra-layer parallelism and
pipeline parallelism can be seen as inter-layer parallelism.
### Tensor Parallel
Tensor parallel training is to split a tensor into `N` chunks along a specific dimension and each device only holds `1/N`
of the whole tensor while not affecting the correctness of the computation graph. This requires additional communication
to make sure that the result is correct.
Taking a general matrix multiplication as an example, let's say we have C = AB. We can split B along the column dimension
into `[B0 B1 B2 ... Bn]` and each device holds a column. We then multiply `A` with each column in `B` on each device, we
will get `[AB0 AB1 AB2 ... ABn]`. At this moment, each device still holds partial results, e.g. device rank 0 holds `AB0`.
To make sure the result is correct, we need to all-gather the partial result and concatenate the tensor along the column
dimension. In this way, we are able to distribute the tensor over devices while making sure the computation flow remains
correct.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/2ZwyPDvXANW4tMG.png"/>
<figcaption>Tensor parallel illustration</figcaption>
</figure>
In Colossal-AI, we provide an array of tensor parallelism methods, namely 1D, 2D, 2.5D and 3D tensor parallelism. We will
talk about them in detail in `advanced tutorials`.
Related paper:
- [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668)
- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
- [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
- [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
### Pipeline Parallel
Pipeline parallelism is generally easy to understand. If you recall your computer architecture course, this indeed exists
in the CPU design.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/at3eDv7kKBusxbd.png"/>
<figcaption>Pipeline parallel illustration</figcaption>
</figure>
The core idea of pipeline parallelism is that the model is split by layer into several chunks, each chunk is
given to a device. During the forward pass, each device passes the intermediate activation to the next stage. During the backward pass,
each device passes the gradient of the input tensor back to the previous pipeline stage. This allows devices to compute simultaneously,
and increases the training throughput. One drawback of pipeline parallel training is that there will be some bubble time where
some devices are engaged in computation, leading to waste of computational resources.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/sDNq51PS3Gxbw7F.png"/>
<figcaption>Source: <a href="https://arxiv.org/abs/1811.06965">GPipe</a></figcaption>
</figure>
Related paper:
- [PipeDream: Fast and Efficient Pipeline Parallel DNN Training](https://arxiv.org/abs/1806.03377)
- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- [Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines](https://arxiv.org/abs/2107.06925)
## Optimizer-Level Parallel
Another paradigm works at the optimizer level, and the current most famous method of this paradigm is ZeRO which stands
for [zero redundancy optimizer](https://arxiv.org/abs/1910.02054). ZeRO works at three levels to remove memory redundancy
(fp16 training is required for ZeRO):
- Level 1: The optimizer states are partitioned across the processes
- Level 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process
only stores the gradients corresponding to its partition of the optimizer states.
- Level 3: The 16-bit model parameters are partitioned across the processes
Related paper:
- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
## Parallelism on Heterogeneous System
The methods mentioned above generally require a large number of GPU to train a large model. However, it is often neglected
that CPU has a much larger memory compared to GPU. On a typical server, CPU can easily have several hundred GB RAM while each GPU
typically only has 16 or 32 GB RAM. This prompts the community to think why CPU memory is not utilized for distributed training.
Recent advances rely on CPU and even NVMe disk to train large models. The main idea is to offload tensors back to CPU memory
or NVMe disk when they are not used. By using the heterogeneous system architecture, it is possible to accommodate a huge
model on a single machine.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/qLHD5lk97hXQdbv.png"/>
<figcaption>Heterogenous system illustration</figcaption>
</figure>
Related paper:
- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
- [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)