Merge pull request #4 from hpcaitech/hotfix/doc

reoder parallelization methods in parallelization documentation
pull/8/head
Frank Lee 2021-11-02 14:35:06 +08:00 committed by GitHub
commit ccbc918c11
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 49 additions and 49 deletions

View File

@ -29,6 +29,54 @@ not have to explicitly set them in your configurations. When data parallel size
adds the distributed data sampler to the dataloader to shard the dataset.
## 1D, 2D, 2.5D and 3D Parallel
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
devices where N is the number of tensor chunks in a single dimension.
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
parallelizes 2D tensor parallelism. An amount of $P = N^2 d$ processors are arranged into d layers,
where each layer performs matrix multiplication operations independently with a dimension N.
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
through optimized load balancing of parameters as well as activations.
```python
# 1D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=4, mode='1d')
)
# 2D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=4, mode='2d')
)
# 2.5D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=8, mode='2.5d', depth=2)
)
# 3D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=8, mode='3d')
)
```
## Pipeline Parallel (experimental)
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
@ -160,57 +208,9 @@ schedule = dict(
)
```
## 1D, 2D, 2.5D and 3D Parallel
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
devices where N is the number of tensor chunks in a single dimension.
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
parallelizes 2D tensor parallelism. An amount of $P = N^2 d$ processors are arranged into d layers,
where each layer performs matrix multiplication operations independently with a dimension N.
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
through optimized load balancing of parameters as well as activations.
```python
# 1D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=4, mode='1d')
)
# 2D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=4, mode='2d')
)
# 2.5D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=8, mode='2.5d', depth=2)
)
# 3D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=8, mode='3d')
)
```
## Sequence Parallel (experimental)
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
This feature is still in development is only experimental for now.
This feature is still in development is only experimental for now.