mirror of https://github.com/hpcaitech/ColossalAI
[doc]update moe chinese document. (#3890)
* [doc]update-moe * [doc]update-moe * [doc]update-moe * [doc]update-moe * [doc]update-moepull/3898/head^2
@ -137,3 +137,4 @@ criterion = MoeLoss(
Finally, just use trainer or engine in `colossalai` to do your training.
Otherwise, you should take care of gradient by yourself.
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 integrate_mixture_of_experts_into_your_model.py -->
@ -19,7 +19,7 @@ We aim to make Colossal-AI easy to use and non-intrusive to user code. There is
1. Prepare a configuration file where specifies the features you want to use and your parameters.
2. Initialize distributed backend with `colossalai.launch`
3. Inject the training features into your training components (e.g. model, optimizer) with `colossalai.initialize`.
3. Inject the training features into your training components (e.g. model, optimizer) with `colossalai.booster`.
4. Run training and testing
We will cover the whole workflow in the `basic tutorials` section.
@ -34,3 +34,5 @@ The Colossal-AI system will be expanded to include more training skills, these n
4. expansion of existing parallelism methods
We welcome ideas and contribution from the community and you can post your idea for future development in our forum.
<!-- doc-test-command: echo "colossalai_overview.md does not need test" -->
@ -9,44 +9,24 @@
- [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961)
- [Go Wider Instead of Deeper](https://arxiv.org/abs/2107.11817)
## Introduction
Since the advent of Switch Transformer, the AI community has found Mixture of Experts (MoE) a useful technique to enlarge the capacity of deep learning models.
自从`Switch Transformer`出现以来,人工智能社区发现专家混合 (MoE) 是一种扩大深度学习模型容量的有用技术。
Colossal-AI 提供了专为MoE模型设计的并行性的早期访问版本。Colossal-AI中MoE最突出的优势就是方便。我们的目标是帮助我们的用户轻松地将MoE与模型并行性和数据并行性结合起来。
在这里,我们将介绍如何使用具有模型并行性和数据并行性的 MoE。
Colossal-AI provides an early access version of parallelism specifically designed for MoE models.
The most prominent advantage of MoE in Colossal-AI is convenience.
We aim to help our users to easily combine MoE with model parallelism and data parallelism.
## 目录
1. [搭建MoE运行环境](#搭建moe运行环境)
2. [创建MoE层](#创建moe层)
3. [定义训练模型](#训练模型)
However, the current implementation has two main drawbacks now.
The first drawback is its poor efficiency in large batch size and long sequence length training.
The second drawback is incompatibility with tensor parallelism.
We are working on system optimization to overcome the training efficiency problem.
The compatibility problem with tensor parallelism requires more adaptation, and we will tackle this issue in the future.
我们提供[示例](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet), 详细介绍请参考 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples).
该示例使用 [WideNet](https://arxiv.org/abs/2107.11817) 作为基于 MoE 的模型的示例.
Here, we will introduce how to use MoE with model parallelism and data parallelism.
## Table of Content
In this tutorial we will cover:
1. Set up MoE running environment
2. Create MoE layer
3. Train your model
We provided the [example code](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet) for this tutorial in [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples).
This example uses [WideNet](https://arxiv.org/abs/2107.11817) as an example of MoE-based model.
## Set up MoE running environment
In your project folder, create a `config.py`.
This file is to specify some features you may want to use to train your model.
In order to enable MoE, you need to add a dict called parallel and specify the value of key moe.
You can assign a value for the key size of moe, which represents the model parallel size of experts (i.e. the number of experts in one group to parallelize training).
For example, if the size is 4, 4 processes will be assigned to 4 consecutive GPUs and these 4 processes form a moe model parallel group.
Each process on the 4 GPUs will only get a portion of experts. Increasing the model parallel size will reduce communication cost, but increase computation cost in each GPU and activation cost in memory.
The total data parallel size is auto-detected and set as the number of GPUs by default.
## 搭建MoE运行环境
在您的项目文件夹中,创建`config.py`文件。在该文件中,您可以指定希望用于训练模型的一些功能。为了启用 MoE,您需要在`config.py`中定义`parallel`字段,并指定`moe`的值。`moe`表示一组moe并行化训练组的并行大小。例如,`moe`设置为4,则4个进程将分配给4个连续的GPU,这4个进程组成一个moe模型并行组。每个进程只会得到一部分专家。增加mo e并行的大小将降低通信成本,但会增加每个GPU的计算成本和内存中activation的存储成本。总的数据并行的大小是自动检测的,默认情况下设置为GPU的数量。
@ -55,37 +35,29 @@ parallel = dict(
If `MOE_MODEL_PARALLEL_SIZE = E` and set the number of experts as `E` where `E` is a constant number, the process flow of forward pass of a transformer encoder in a model parallel group is shown below.
如果`MOE_MODEL_PARALLEL_SIZE = E`,即设置专家的总数为`E`(`E`为一个常数)。在模型并行中,transformer编码器中前向部分的处理流程如下图所示。
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/oI59QcxdteKUTks.png"/>
<figcaption>MoE Transformer, image source: <a href="https://arxiv.org/abs/2006.16668">GShard</a></figcaption>
Since all experts are allocated to all GPUs in a model parallel group and a GPU only owns a portion of experts,
original data parallel groups are no longer correct for the parameters of experts during gradient handling in backward pass anymore.
So we create a new kind of parallel group called moe data parallel group.
The difference among different kinds of parallel group, when the configuration is set as `WORLD_SIZE=4`,
`MOE_MODEL_PARALLEL_SIZE=2`, is shown here.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/Sn8FpmQPKIiBEq2.png"/>
<figcaption>MoE process group</figcaption>
As for gradient handling, we provide MoeGradientHandler to all-reduce every parameter of the model.
If you use `colossalai.initialize` function to create your training engine, the MoE gradient handler will be added to your engine automatically.
Otherwise, you should take care of gradient by yourself.
All parameters of MoE running environment are stored in colossalai.global_variables.moe_env.
You can access your configuration parameters to check whether your setup is correct.
from colossalai.global_variables import moe_env
## Create MoE layer
You can create a MoE layer from `colossalai.nn.moe`.
But before doing that, you should set up random seeds for all processes like this.
## 创建MoE层
from colossalai.context.random import moe_set_seed
@ -95,10 +67,7 @@ moe_set_seed(42)
model = Widenet(num_experts=4, capacity_factor=1.2)
`moe_set_seed` will set different seed for different processes in a moe model parallel group.
This helps initialize parameters in experts.
Then create an instance of experts and an instance of router.
Here is the example in model zoo.
`moe_set_seed` 会为一个moe模型并行组中的不同进程设置不同的种子(这有助于在专家中初始化参数),创建一个专家实例和一个路由器实例,示例如下。
from colossalai.nn.layer.moe import Experts, MoeLayer, Top2Router, NormalNoiseGenerator
@ -118,16 +87,11 @@ ffn=MoeLayer(dim_model=d_model, num_experts=num_experts,
router=shared_router, experts=shared_experts)
Inside the initialization of Experts, the local expert number of each GPU will be calculated automatically. You just need to specify the class of each expert and its parameters used in its initialization. As for routers, we have provided top1 router and top2 router. You can find them in colossalai.nn.layer.moe. After creating the instance of experts and router, the only thing initialized in Moelayer is gate module. More definitions of each class can be found in our API document and code.
在Experts的初始化中,会自动计算每个GPU的本地expert数量,您只需指定每个专家的类型及其在初始化时使用的参数。此外,我们提供了`Top1Router`和`Top2Router`,您可以在`colossalai.nn.layer.moe` 找到它们。在创建experts和router的实例时,`Moelayer`只初始化了`gate`模块,类型的更多详细信息您可以参考我们的API文档和代码。
## 定义训练模型
## Train Your Model
Do not to forget to use `colossalai.initialize` function in `colossalai` to add gradient handler for the engine.
We handle the back-propagation of MoE models for you.
In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients.
You can find more information about the handler `MoeGradientHandler` in colossal directory.
The loss criterion should be wrapped by `Moeloss` to add auxiliary loss of MoE. Example is like this.
使用colossalai中的`colossalai.initialize`函数为引擎添加梯度处理程序以处理 MoE模型的反向传播。在 `colossalai.initialize` 中,我们会自动创建一个`MoeGradientHandler`对象来处理梯度。您可以在colossal目录中找到有关`MoeGradientHandler`的更多信息。为了添加MoE的相关损失处理,损失函数应使用`Moeloss`封装,示例如下。
criterion = MoeLoss(
@ -135,6 +99,6 @@ criterion = MoeLoss(
最后,您只需使用 `colossalai` 中的`trainer`或`engine`进行训练即可。
Finally, just use trainer or engine in `colossalai` to do your training.
Otherwise, you should take care of gradient by yourself.
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 integrate_mixture_of_experts_into_your_model.py -->
@ -19,7 +19,7 @@ Colossal-AI 是一个集成的系统,为用户提供一套综合的训练方
1. 准备一个配置文件,指定您要使用的功能和参数。
2. 用 `colossalai.launch` 初始化分布式后端。
3. 用 `colossalai.initialize` 将训练特征注入您的训练组件(如模型、优化器)中。
3. 用 `colossalai.booster` 将训练特征注入您的训练组件(如模型、优化器)中。
4. 进行训练和测试.
@ -34,3 +34,5 @@ Colossal-AI 系统将会进一步拓展和优化,包括但不限于:
4. 拓展现有的并行方法
**我们始终欢迎社区的建议和讨论,如果您遇到任何问题,我们将非常愿意帮助您。您可以在GitHub 提 [issue](https://github.com/hpcaitech/ColossalAI/issues) ,或在[论坛](https://github.com/hpcaitech/ColossalAI/discussions)上创建一个讨论主题。**
<!-- doc-test-command: echo "colossalai_overview.md does not need test" -->
Reference in New Issue