From 50982c0b7dfd112064b2e831c425014c3345b63c Mon Sep 17 00:00:00 2001 From: ver217 Date: Mon, 1 Nov 2021 14:31:55 +0800 Subject: [PATCH] reoder parallelization methods in parallelization documentation --- docs/parallelization.md | 98 ++++++++++++++++++++--------------------- 1 file changed, 49 insertions(+), 49 deletions(-) diff --git a/docs/parallelization.md b/docs/parallelization.md index 84911d76e..413c84420 100644 --- a/docs/parallelization.md +++ b/docs/parallelization.md @@ -29,6 +29,54 @@ not have to explicitly set them in your configurations. When data parallel size adds the distributed data sampler to the dataloader to shard the dataset. +## 1D, 2D, 2.5D and 3D Parallel +To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each +tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI. +- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) + +- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343) +2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, +model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ +devices where N is the number of tensor chunks in a single dimension. + +- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500) +Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further +parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers, +where each layer performs matrix multiplication operations independently with a dimension N. + +- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450) +We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves +the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed +through optimized load balancing of parameters as well as activations. + + + +```python +# 1D parallel +parallel = dict( + pipeline=dict(size=1), # number of pipeline stages + tensor=dict(size=4, mode='1d') +) + +# 2D parallel +parallel = dict( + pipeline=dict(size=1), # number of pipeline stages + tensor=dict(size=4, mode='2d') +) + +# 2.5D parallel +parallel = dict( + pipeline=dict(size=1), # number of pipeline stages + tensor=dict(size=8, mode='2.5d', depth=2) +) + +# 3D parallel +parallel = dict( + pipeline=dict(size=1), # number of pipeline stages + tensor=dict(size=8, mode='3d') +) +``` + ## Pipeline Parallel (experimental) Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple @@ -160,57 +208,9 @@ schedule = dict( ) ``` -## 1D, 2D, 2.5D and 3D Parallel -To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each -tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI. -- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) - -- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343) -2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, -model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ -devices where N is the number of tensor chunks in a single dimension. - -- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500) -Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further -parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers, -where each layer performs matrix multiplication operations independently with a dimension N. - -- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450) -We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves -the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed -through optimized load balancing of parameters as well as activations. - - - -```python -# 1D parallel -parallel = dict( - pipeline=dict(size=1), # number of pipeline stages - tensor=dict(size=4, mode='1d') -) - -# 2D parallel -parallel = dict( - pipeline=dict(size=1), # number of pipeline stages - tensor=dict(size=4, mode='2d') -) - -# 2.5D parallel -parallel = dict( - pipeline=dict(size=1), # number of pipeline stages - tensor=dict(size=8, mode='2.5d', depth=2) -) - -# 3D parallel -parallel = dict( - pipeline=dict(size=1), # number of pipeline stages - tensor=dict(size=8, mode='3d') -) -``` - ## Sequence Parallel (experimental) Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging. This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120). -This feature is still in development is only experimental for now. \ No newline at end of file +This feature is still in development is only experimental for now.