mirror of https://github.com/hpcaitech/ColossalAI
Merge pull request #4 from hpcaitech/hotfix/doc
reoder parallelization methods in parallelization documentationpull/8/head
commit
ccbc918c11
|
@ -29,6 +29,54 @@ not have to explicitly set them in your configurations. When data parallel size
|
|||
adds the distributed data sampler to the dataloader to shard the dataset.
|
||||
|
||||
|
||||
## 1D, 2D, 2.5D and 3D Parallel
|
||||
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
|
||||
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
|
||||
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
|
||||
|
||||
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
|
||||
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
|
||||
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
|
||||
devices where N is the number of tensor chunks in a single dimension.
|
||||
|
||||
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
|
||||
Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
|
||||
parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers,
|
||||
where each layer performs matrix multiplication operations independently with a dimension N.
|
||||
|
||||
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
|
||||
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
|
||||
the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
|
||||
through optimized load balancing of parameters as well as activations.
|
||||
|
||||
|
||||
|
||||
```python
|
||||
# 1D parallel
|
||||
parallel = dict(
|
||||
pipeline=dict(size=1), # number of pipeline stages
|
||||
tensor=dict(size=4, mode='1d')
|
||||
)
|
||||
|
||||
# 2D parallel
|
||||
parallel = dict(
|
||||
pipeline=dict(size=1), # number of pipeline stages
|
||||
tensor=dict(size=4, mode='2d')
|
||||
)
|
||||
|
||||
# 2.5D parallel
|
||||
parallel = dict(
|
||||
pipeline=dict(size=1), # number of pipeline stages
|
||||
tensor=dict(size=8, mode='2.5d', depth=2)
|
||||
)
|
||||
|
||||
# 3D parallel
|
||||
parallel = dict(
|
||||
pipeline=dict(size=1), # number of pipeline stages
|
||||
tensor=dict(size=8, mode='3d')
|
||||
)
|
||||
```
|
||||
|
||||
## Pipeline Parallel (experimental)
|
||||
|
||||
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
|
||||
|
@ -160,57 +208,9 @@ schedule = dict(
|
|||
)
|
||||
```
|
||||
|
||||
## 1D, 2D, 2.5D and 3D Parallel
|
||||
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
|
||||
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
|
||||
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
|
||||
|
||||
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
|
||||
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
|
||||
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
|
||||
devices where N is the number of tensor chunks in a single dimension.
|
||||
|
||||
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
|
||||
Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
|
||||
parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers,
|
||||
where each layer performs matrix multiplication operations independently with a dimension N.
|
||||
|
||||
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
|
||||
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
|
||||
the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
|
||||
through optimized load balancing of parameters as well as activations.
|
||||
|
||||
|
||||
|
||||
```python
|
||||
# 1D parallel
|
||||
parallel = dict(
|
||||
pipeline=dict(size=1), # number of pipeline stages
|
||||
tensor=dict(size=4, mode='1d')
|
||||
)
|
||||
|
||||
# 2D parallel
|
||||
parallel = dict(
|
||||
pipeline=dict(size=1), # number of pipeline stages
|
||||
tensor=dict(size=4, mode='2d')
|
||||
)
|
||||
|
||||
# 2.5D parallel
|
||||
parallel = dict(
|
||||
pipeline=dict(size=1), # number of pipeline stages
|
||||
tensor=dict(size=8, mode='2.5d', depth=2)
|
||||
)
|
||||
|
||||
# 3D parallel
|
||||
parallel = dict(
|
||||
pipeline=dict(size=1), # number of pipeline stages
|
||||
tensor=dict(size=8, mode='3d')
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
## Sequence Parallel (experimental)
|
||||
|
||||
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
|
||||
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
|
||||
This feature is still in development is only experimental for now.
|
||||
This feature is still in development is only experimental for now.
|
||||
|
|
Loading…
Reference in New Issue