[doc] update and revise some typos and errs in docs (#4107)

* fix some typos and problems in doc

* fix some typos and problems in doc

* add doc test
pull/4122/head
Jianghai 2023-06-28 19:30:37 +08:00 committed by GitHub
parent 769cddcb2c
commit 711e2b4c00
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
6 changed files with 99 additions and 62 deletions

View File

@ -1,31 +1,36 @@
# Booster API
Author: [Mingyan Jiang](https://github.com/jiangmingyan)
Author: [Mingyan Jiang](https://github.com/jiangmingyan) [Jianghai Chen](https://github.com/CjhHa1)
**Prerequisite:**
- [Distributed Training](../concepts/distributed_training.md)
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
**Example Code**
- [Train with Booster](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet/README.md)
## Introduction
In our new design, `colossalai.booster` replaces the role of `colossalai.initialize` to inject features into your training components (e.g. model, optimizer, dataloader) seamlessly. With these new APIs, you can integrate your model with our parallelism features more friendly. Also calling `colossalai.booster` is the standard procedure before you run into your training loops. In the sections below, I will cover how `colossalai.booster` works and what we should take note of.
### Plugin
Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows:
***GeminiPlugin:*** This plugin wraps the Gemini acceleration solution, that ZeRO with chunk-based memory management.
**_GeminiPlugin:_** This plugin wraps the Gemini acceleration solution, that ZeRO with chunk-based memory management.
***TorchDDPPlugin:*** This plugin wraps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines.
**_TorchDDPPlugin:_** This plugin wraps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines.
***LowLevelZeroPlugin:*** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs.
**_LowLevelZeroPlugin:_** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs.
### API of booster
{{ autodoc:colossalai.booster.Booster }}
## Usage
In a typical workflow, you should launch distributed environment at the beginning of training script and create objects needed (such as models, optimizers, loss function, data loaders etc.) firstly, then call `colossalai.booster` to inject features into these objects, After that, you can use our booster APIs and these returned objects to continue the rest of your training processes.
A pseudo-code example is like below:
@ -67,5 +72,4 @@ def train():
[more design details](https://github.com/hpcaitech/ColossalAI/discussions/3046)
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 booster_api.py -->

View File

@ -3,12 +3,13 @@
Author: [Mingyan Jiang](https://github.com/jiangmingyan)
**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Training Booster](../basics/booster_api.md)
**Related Paper**
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
## Introduction
@ -19,9 +20,8 @@ In Colossal-AI, we have incorporated different implementations of mixed precisio
2. apex.amp
3. naive amp
| Colossal-AI | support tensor parallel | support pipeline parallel | fp16 extent |
| ----------- | ----------------------- | ------------------------- | ----------- |
| -------------- | ----------------------- | ------------------------- | ---------------------------------------------------------------------------------------------------- |
| AMP_TYPE.TORCH | ✅ | ❌ | Model parameters, activation, gradients are downcast to fp16 during forward and backward propagation |
| AMP_TYPE.APEX | ❌ | ❌ | More fine-grained, we can choose opt_level O0, O1, O2, O3 |
| AMP_TYPE.NAIVE | ✅ | ✅ | Model parameters, forward and backward operations are all downcast to fp16 |
@ -64,8 +64,11 @@ However, there are other operations, like reductions, which require the dynamic
We supported three AMP training methods and allowed the user to train with AMP with no code. If you want to train with amp, just assign `mixed_precision` with `fp16` when you instantiate the `Booster`. Now booster support torch amp, the other two(apex amp, naive amp) are still started by `colossalai.initialize`, if needed, please refer to [this](./mixed_precision_training.md). Next we will support `bf16`, `fp8`.
### Start with Booster
instantiate `Booster` with `mixed_precision="fp16"`, then you can train with torch amp.
<!--- doc-test-ignore-start -->
```python
"""
Mapping:
@ -78,9 +81,13 @@ instantiate `Booster` with `mixed_precision="fp16"`, then you can train with tor
from colossalai import Booster
booster = Booster(mixed_precision='fp16',...)
```
<!--- doc-test-ignore-end -->
or you can create a `FP16TorchMixedPrecision` object, such as:
<!--- doc-test-ignore-start -->
```python
from colossalai.mixed_precision import FP16TorchMixedPrecision
mixed_precision = FP16TorchMixedPrecision(
@ -90,9 +97,10 @@ mixed_precision = FP16TorchMixedPrecision(
growth_interval=2000)
booster = Booster(mixed_precision=mixed_precision,...)
```
<!--- doc-test-ignore-end -->
The same goes for other types of amps.
<!--- doc-test-ignore-end -->
The same goes for other types of amps.
### Torch AMP Configuration
@ -121,7 +129,6 @@ The output model is converted to AMP model of smaller memory consumption.
If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
Otherwise, try smaller models or checkout more parallelization training techniques!
## Hands-on Practice
Now we will introduce the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example.
@ -248,4 +255,5 @@ Use the following command to start the training scripts. You can change `--nproc
```shell
colossalai run --nproc_per_node 1 train.py
```
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training_with_booster.py -->

View File

@ -7,19 +7,18 @@ can also run on systems with only one GPU. Quick demos showing how to use Coloss
## Single GPU
Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline
performances. We provided an example to [train ResNet on CIFAR10 dataset](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet)
with only one GPU. You can find the example in [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples).
performances. We provided an example to [train ResNet on CIFAR10 dataset](https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/resnet)
with only one GPU. You can find the example in [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI/tree/main/examples).
Detailed instructions can be found in its `README.md`.
## Multiple GPUs
Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the
training process drastically by applying efficient parallelization techniques. When we have several parallelism for you
to try out.
training process drastically by applying efficient parallelization techniques. When we have several parallelism for you to try out.
#### 1. data parallel
You can use the same [ResNet example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet) as the
You can use the same [ResNet example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/resnet) as the
single-GPU demo above. By setting `--nproc_per_node` to be the number of GPUs you have on your machine, the example
is turned into a data parallel example.
@ -27,17 +26,19 @@ is turned into a data parallel example.
Hybrid parallel includes data, tensor, and pipeline parallelism. In Colossal-AI, we support different types of tensor
parallelism (i.e. 1D, 2D, 2.5D and 3D). You can switch between different tensor parallelism by simply changing the configuration
in the `config.py`. You can follow the [GPT example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt).
in the `config.py`. You can follow the [GPT example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt).
Detailed instructions can be found in its `README.md`.
#### 3. MoE parallel
We provided [an example of WideNet](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet) to demonstrate
We provided [an example of ViT-MoE](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/moe) to demonstrate
MoE parallelism. WideNet uses mixture of experts (MoE) to achieve better performance. More details can be found in
[Tutorial: Integrate Mixture-of-Experts Into Your Model](../advanced_tutorials/integrate_mixture_of_experts_into_your_model.md)
#### 4. sequence parallel
Sequence parallel is designed to tackle memory efficiency and sequence length limit problems in NLP tasks. We provided
[an example of BERT](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/bert/sequene_parallel) in
[ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples). You can follow the `README.md` to execute the code.
[an example of BERT](https://github.com/hpcaitech/ColossalAI/tree/main/examples/tutorial/sequence_parallel) in
[ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI/tree/main/examples). You can follow the `README.md` to execute the code.
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 run_demo.py -->

View File

@ -1,28 +1,37 @@
# booster 使用
作者: [Mingyan Jiang](https://github.com/jiangmingyan)
作者: [Mingyan Jiang](https://github.com/jiangmingyan) [Jianghai Chen](https://github.com/CjhHa1)
**预备知识:**
- [分布式训练](../concepts/distributed_training.md)
- [Colossal-AI 总览](../concepts/colossalai_overview.md)
**示例代码**
<!-- update this url-->
- [使用 booster 训练](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/new_api/cifar_resnet/README.md)
## 简介
在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。
在我们的新设计中, `colossalai.booster` 代替 `colossalai.initialize` 将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 使用 booster API, 您可以更友好地将我们的并行策略整合到待训练模型中. 调用 `colossalai.booster` 是您进入训练循环前的基本操作。
在下面的章节中,我们将介绍 `colossalai.booster` 是如何工作的以及使用时我们要注意的细节。
### Booster 插件
Booster 插件是管理并行配置的重要组件eggemini 插件封装了 gemini 加速方案)。目前支持的插件如下:
***GeminiPlugin:*** GeminiPlugin插件封装了 gemini 加速解决方案,即基于块内存管理的 ZeRO优化方案。
**_GeminiPlugin:_** GeminiPlugin 插件封装了 gemini 加速解决方案,即基于块内存管理的 ZeRO 优化方案。
***TorchDDPPlugin:*** TorchDDPPlugin插件封装了DDP加速方案,实现了模型级别的数据并行,可以跨多机运行。
**_TorchDDPPlugin:_** TorchDDPPlugin 插件封装了 DDP 加速方案,实现了模型级别的数据并行,可以跨多机运行。
***LowLevelZeroPlugin:*** LowLevelZeroPlugin插件封装了零冗余优化器的 1/2 阶段。阶段 1切分优化器参数分发到各并发进程或并发GPU上。阶段 2切分优化器参数及梯度分发到各并发进程或并发GPU上。
**_LowLevelZeroPlugin:_** LowLevelZeroPlugin 插件封装了零冗余优化器的 1/2 阶段。阶段 1切分优化器参数分发到各并发进程或并发 GPU 上。阶段 2切分优化器参数及梯度分发到各并发进程或并发 GPU 上。
### Booster 接口
<!--TODO: update autodoc -->
{{ autodoc:colossalai.booster.Booster }}
## 使用方法及示例

View File

@ -3,12 +3,13 @@
作者: [Mingyan Jiang](https://github.com/jiangmingyan)
**前置教程**
- [定义配置文件](../basics/define_your_config.md)
- [booster 使用](../basics/booster_api.md)
**相关论文**
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
## 引言
@ -19,9 +20,8 @@ AMP 代表自动混合精度训练。
2. apex.amp
3. naive amp
| Colossal-AI | 支持张量并行 | 支持流水并行 | fp16 范围 |
| ----------- | ----------------------- | ------------------------- | ----------- |
| -------------- | ------------ | ------------ | --------------------------------------------------------- |
| AMP_TYPE.TORCH | ✅ | ❌ | 在前向和反向传播期间,模型参数、激活和梯度向下转换至 fp16 |
| AMP_TYPE.APEX | ❌ | ❌ | 更细粒度,我们可以选择 opt_level O0, O1, O2, O3 |
| AMP_TYPE.NAIVE | ✅ | ✅ | 模型参数、前向和反向操作,全都向下转换至 fp16 |
@ -57,11 +57,14 @@ AMP 代表自动混合精度训练。
## Colossal-AI 中的 AMP
我们支持三种 AMP 训练方法,并允许用户在没有改变代码的情况下使用 AMP 进行训练。booster支持amp特性注入如果您要使用混合精度训练则在创建booster实例时指定`mixed_precision`参数我们现已支持torch ampapex amp, naive amp现已移植torch amp至boosterapex amp, naive amp仍由`colossalai.initialize`方式启动,如您需使用,请[参考](./mixed_precision_training.md;后续将会拓展`bf16`,`pf8`的混合精度训练.
我们支持三种 AMP 训练方法,并允许用户在没有改变代码的情况下使用 AMP 进行训练。booster 支持 amp 特性注入,如果您要使用混合精度训练,则在创建 booster 实例时指定`mixed_precision`参数,我们现已支持 torch ampapex amp, naive amp现已移植 torch amp boosterapex amp, naive amp 仍由`colossalai.initialize`方式启动,如您需使用,请[参考](./mixed_precision_training.md);后续将会拓展`bf16`,`pf8`的混合精度训练.
#### booster 启动方式
您可以在创建 booster 实例时,指定`mixed_precision="fp16"`即使用 torch amp。
<!--- doc-test-ignore-start -->
```python
"""
初始化映射关系如下:
@ -74,9 +77,13 @@ AMP 代表自动混合精度训练。
from colossalai import Booster
booster = Booster(mixed_precision='fp16',...)
```
<!--- doc-test-ignore-end -->
或者您可以自定义一个`FP16TorchMixedPrecision`对象,如
<!--- doc-test-ignore-start -->
```python
from colossalai.mixed_precision import FP16TorchMixedPrecision
mixed_precision = FP16TorchMixedPrecision(
@ -86,7 +93,9 @@ mixed_precision = FP16TorchMixedPrecision(
growth_interval=2000)
booster = Booster(mixed_precision=mixed_precision,...)
```
<!--- doc-test-ignore-end -->
其他类型的 amp 使用方式也是一样的。
### Torch AMP 配置
@ -186,6 +195,7 @@ lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=NUM_EPOCHS
```
### 步骤 4. 插入 AMP
创建一个 MixedPrecision 对象(如果需要)及 torchDDPPlugin 对象,调用 `colossalai.boost` 将所有训练组件转为为 FP16 模式.
```python
@ -232,4 +242,5 @@ for epoch in range(NUM_EPOCHS):
```shell
colossalai run --nproc_per_node 1 train.py
```
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training_with_booster.py -->

View File

@ -4,8 +4,8 @@ Colossal-AI 是一个集成的大规模深度学习系统,具有高效的并
## 单 GPU
Colossal-AI 可以用在只有一个 GPU 的系统上训练深度学习模型,并达到 baseline 的性能。 我们提供了一个 [CIFAR10数据集上训练ResNet](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet) 的例子,该例子只需要一个 GPU。
您可以在 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples) 中获取该例子。详细说明可以在其 `README.md` 中获取。
Colossal-AI 可以用在只有一个 GPU 的系统上训练深度学习模型,并达到 baseline 的性能。 我们提供了一个 [ CIFAR10 数据集上训练 ResNet](https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/resnet) 的例子,该例子只需要一个 GPU。
您可以在 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI/tree/main/examples) 中获取该例子。详细说明可以在其 `README.md` 中获取。
## 多 GPU
@ -13,16 +13,20 @@ Colossal-AI 可用于在具有多个 GPU 的分布式系统上训练深度学习
#### 1. 数据并行
您可以使用与上述单 GPU 演示相同的 [ResNet例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet)。 通过设置 `--nproc_per_node` 为您机器上的 GPU 数量,您就能把数据并行应用在您的例子上了。
您可以使用与上述单 GPU 演示相同的 [ResNet 例子](https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/resnet)。 通过设置 `--nproc_per_node` 为您机器上的 GPU 数量,您就能把数据并行应用在您的例子上了。
#### 2. 混合并行
混合并行包括数据、张量和流水线并行。在 Colossal-AI 中,我们支持不同类型的张量并行(即 1D、2D、2.5D 和 3D。您可以通过简单地改变 `config.py` 中的配置在不同的张量并行之间切换。您可以参考 [GPT example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt), 更多细节能在它的 `README.md` 中被找到。
混合并行包括数据、张量和流水线并行。在 Colossal-AI 中,我们支持不同类型的张量并行(即 1D、2D、2.5D 和 3D。您可以通过简单地改变 `config.py` 中的配置在不同的张量并行之间切换。您可以参考 [GPT example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt), 更多细节能在它的 `README.md` 中被找到。
#### 3. MoE 并行
我们提供了一个 [WideNet例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet) 来验证 MoE 的并行性。 WideNet 使用 Mixture of ExpertsMoE来实现更好的性能。更多的细节可以在我们的教程中获取[教会您如何把Mixture of Experts整合到模型中](../advanced_tutorials/integrate_mixture_of_experts_into_your_model.md)。
<!-- TODO: 在colossalai中实现这个例子 -->
我们提供了一个 [ViT-MoE 例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/moe) 来验证 MoE 的并行性。 WideNet 使用 Mixture of ExpertsMoE来实现更好的性能。更多的细节可以在我们的教程中获取[教会您如何把 Mixture of Experts 整合到模型中](../advanced_tutorials/integrate_mixture_of_experts_into_your_model.md)。
#### 4. 序列并行
序列并行是为了解决NLP任务中的内存效率和序列长度限制问题。 我们在 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples) 中提供了一个 [BERT例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/bert/sequene_parallel)。您可以按照 `README.md` 来执行代码。
序列并行是为了解决 NLP 任务中的内存效率和序列长度限制问题。 我们在 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI/tree/main/examples) 中提供了一个 [Sequence Parallelism 例子](https://github.com/hpcaitech/ColossalAI/tree/main/examples/tutorial/sequence_parallel)。您可以按照 `README.md` 来执行代码。
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 run_demo.py -->