fix typo docs/

pull/3829/head
digger yu 2023-05-24 09:53:21 +08:00 committed by Frank Lee
parent 34966378e8
commit e90fdb1000
23 changed files with 30 additions and 30 deletions

View File

@ -98,7 +98,7 @@ Lastly, if you want to skip some code, you just need to add the following annota
<!--- doc-test-ignore-end -->
```
If you have any dependency required, please add it to `requriements-doc-test.txt` for pip and `conda-doc-test-deps.yml` for Conda.
If you have any dependency required, please add it to `requirements-doc-test.txt` for pip and `conda-doc-test-deps.yml` for Conda.
### 💉 Auto Documentation

View File

@ -1,6 +1,6 @@
# References
The Colossal-AI project aims to provide a wide array of parallelism techniques for the machine learning community in the big-model era. This project is inspired by quite a few reserach works, some are conducted by some of our developers and the others are research projects open-sourced by other organizations. We would like to credit these amazing projects below in the IEEE citation format.
The Colossal-AI project aims to provide a wide array of parallelism techniques for the machine learning community in the big-model era. This project is inspired by quite a few research works, some are conducted by some of our developers and the others are research projects open-sourced by other organizations. We would like to credit these amazing projects below in the IEEE citation format.
## By Our Team

View File

@ -56,7 +56,7 @@ follow the steps below to create a new distributed initialization.
world_size: int,
config: Config,
data_parallel_size: int,
pipeline_parlalel_size: int,
pipeline_parallel_size: int,
tensor_parallel_size: int,
arg1,
arg2):

View File

@ -121,7 +121,7 @@ Inside the initialization of Experts, the local expert number of each GPU will b
## Train Your Model
Do not to forget to use `colossalai.initialize` function in `colosalai` to add gradient handler for the engine.
Do not to forget to use `colossalai.initialize` function in `colossalai` to add gradient handler for the engine.
We handle the back-propagation of MoE models for you.
In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients.
You can find more information about the handler `MoeGradientHandler` in colossal directory.

View File

@ -53,7 +53,7 @@ export CHECKPOINT_DIR="your_opt_checkpoint_path"
# the ${CONFIG_DIR} must contain a server.sh file as the entry of service
export CONFIG_DIR="config_file_path"
docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:lastest
docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:latest
```
Then open `https://[IP-ADDRESS]:8020/docs#` in your browser to try out!

View File

@ -69,7 +69,7 @@ After the forward operation of the embedding module, each word in all sequences
<figcaption>The embedding module</figcaption>
</figure>
Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer percepton is located in the second block.
Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer perception is located in the second block.
<figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/08/17/LAVzDlpRcj4dYeb.png"/>

View File

@ -195,7 +195,7 @@ def build_cifar(batch_size):
## Training ViT using pipeline
You can set the size of pipeline parallel and number of microbatches in config. `NUM_CHUNKS` is useful when using interleved-pipeline (for more details see [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) ). The original batch will be split into `num_microbatches`, and each stage will load a micro batch each time. Then we will generate an approriate schedule for you to execute the pipeline training. If you don't need the output and label of model, you can set `return_output_label` to `False` when calling `trainer.fit()` which can further reduce GPU memory usage.
You can set the size of pipeline parallel and number of microbatches in config. `NUM_CHUNKS` is useful when using interleaved-pipeline (for more details see [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) ). The original batch will be split into `num_microbatches`, and each stage will load a micro batch each time. Then we will generate an appropriate schedule for you to execute the pipeline training. If you don't need the output and label of model, you can set `return_output_label` to `False` when calling `trainer.fit()` which can further reduce GPU memory usage.
You should `export DATA=/path/to/cifar`.

View File

@ -16,14 +16,14 @@ In this example for ViT model, Colossal-AI provides three different parallelism
We will show you how to train ViT on CIFAR-10 dataset with these parallelism techniques. To run this example, you will need 2-4 GPUs.
## Tabel of Contents
## Table of Contents
1. Colossal-AI installation
2. Steps to train ViT with data parallelism
3. Steps to train ViT with pipeline parallelism
4. Steps to train ViT with tensor parallelism or hybrid parallelism
## Colossal-AI Installation
You can install Colossal-AI pacakage and its dependencies with PyPI.
You can install Colossal-AI package and its dependencies with PyPI.
```bash
pip install colossalai
```
@ -31,7 +31,7 @@ pip install colossalai
## Data Parallelism
Data parallism is one basic way to accelerate model training process. You can apply data parallelism to training by only two steps:
Data parallelism is one basic way to accelerate model training process. You can apply data parallelism to training by only two steps:
1. Define a configuration file
2. Change a few lines of code in train script
@ -94,7 +94,7 @@ from torchvision import transforms
from torchvision.datasets import CIFAR10
```
#### Lauch Colossal-AI
#### Launch Colossal-AI
In train script, you need to initialize the distributed environment for Colossal-AI after your config file is prepared. We call this process `launch`. In Colossal-AI, we provided several launch methods to initialize the distributed backend. In most cases, you can use `colossalai.launch` and `colossalai.get_default_parser` to pass the parameters via command line. Besides, Colossal-AI can utilize the existing launch tool provided by PyTorch as many users are familiar with by using `colossalai.launch_from_torch`. For more details, you can view the related [documents](https://www.colossalai.org/docs/basics/launch_colossalai).
@ -613,7 +613,7 @@ NUM_MICRO_BATCHES = parallel['pipeline']
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE)
```
Ohter configs:
Other configs:
```python
# hyper parameters
# BATCH_SIZE is as per GPU

View File

@ -14,9 +14,9 @@ In our new design, `colossalai.booster` replaces the role of `colossalai.initial
### Plugin
Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows:
***GeminiPlugin:*** This plugin wrapps the Gemini acceleration solution, that ZeRO with chunk-based memory management.
***GeminiPlugin:*** This plugin wraps the Gemini acceleration solution, that ZeRO with chunk-based memory management.
***TorchDDPPlugin:*** This plugin wrapps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines.
***TorchDDPPlugin:*** This plugin wraps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines.
***LowLevelZeroPlugin:*** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs.

View File

@ -52,7 +52,7 @@ An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/c
## Example
Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_dgree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor.
Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_degree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor.
```python

View File

@ -67,7 +67,7 @@ Given $P=q \times q \times q$ processors, we present the theoretical computation
## Usage
To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallelism setting as below.
```python
CONFIG = dict(parallel=dict(
data=1,

View File

@ -75,7 +75,7 @@ Build your model, optimizer, loss function, lr scheduler and dataloaders. Note t
NUM_EPOCHS = 200
BATCH_SIZE = 128
GRADIENT_CLIPPING = 0.1
# build resnetå
# build resnet
model = resnet34(num_classes=10)
# build dataloaders
train_dataset = CIFAR10(root=Path(os.environ.get('DATA', './data')),

View File

@ -53,7 +53,7 @@ It's compatible with all parallel methods in ColossalAI.
> ⚠ It only offloads optimizer states on CPU. This means it only affects CPU training or Zero/Gemini with offloading.
## Exampls
## Examples
Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`.

View File

@ -156,4 +156,4 @@ trainer.fit(train_dataloader=train_dataloader,
display_progress=True)
```
We use `2` pipeline stages and the batch will be splitted into `4` micro batches.
We use `2` pipeline stages and the batch will be split into `4` micro batches.

View File

@ -72,7 +72,7 @@ chunk_manager = init_chunk_manager(model=module,
gemini_manager = GeminiManager(placement_policy, chunk_manager)
```
`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still samller than the minimum chunk size, all parameters will be compacted into one small chunk.
`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still smaller than the minimum chunk size, all parameters will be compacted into one small chunk.
Initialization of the optimizer.
```python

View File

@ -48,7 +48,7 @@ Colossal-AI 为用户提供了一个全局 context使他们能够轻松地管
world_size: int,
config: Config,
data_parallel_size: int,
pipeline_parlalel_size: int,
pipeline_parallel_size: int,
tensor_parallel_size: int,
arg1,
arg2):

View File

@ -122,7 +122,7 @@ Inside the initialization of Experts, the local expert number of each GPU will b
## Train Your Model
Do not to forget to use `colossalai.initialize` function in `colosalai` to add gradient handler for the engine.
Do not to forget to use `colossalai.initialize` function in `colossalai` to add gradient handler for the engine.
We handle the back-propagation of MoE models for you.
In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients.
You can find more information about the handler `MoeGradientHandler` in colossal directory.

View File

@ -48,7 +48,7 @@ zero = dict(
</figure>
ColossalAI设计了Gemini就像双子星一样它管理CPU和GPU二者内存空间。它可以让张量在训练过程中动态分布在CPU-GPU的存储空间内从而让模型训练突破GPU的内存墙。内存管理器由两部分组成分别是MemStatsCollector(MSC)和StatefuleTensorMgr(STM)。
ColossalAI设计了Gemini就像双子星一样它管理CPU和GPU二者内存空间。它可以让张量在训练过程中动态分布在CPU-GPU的存储空间内从而让模型训练突破GPU的内存墙。内存管理器由两部分组成分别是MemStatsCollector(MSC)和StatefulTensorMgr(STM)。
我们利用了深度学习网络训练过程的迭代特性。我们将迭代分为warmup和non-warmup两个阶段开始时的一个或若干迭代步属于预热阶段其余的迭代步属于正式阶段。在warmup阶段我们为MSC收集信息而在non-warmup阶段STM入去MSC收集的信息来移动tensor以达到最小化CPU-GPU数据移动volume的目的。
@ -75,7 +75,7 @@ STM管理所有model data tensor的信息。在模型的构造过程中Coloss
我们在算子的开始和结束计算时,触发内存采样操作,我们称这个时间点为**采样时刻sampling moment)**,两个采样时刻之间的时间我们称为**period**。计算过程是一个黑盒由于可能分配临时buffer内存使用情况很复杂。但是我们可以较准确的获取period的系统最大内存使用。非模型数据的使用可以通过两个统计时刻之间系统最大内存使用-模型内存使用获得。
我们如何设计采样时刻呢。我们选择preOp的model data layout adjust之前。如下图所示。我们采样获得上一个period的system memory used和下一个period的model data memoy used。并行策略会给MSC的工作造成障碍。如图所示比如对于ZeRO或者Tensor Parallel由于Op计算前需要gather模型数据会带来额外的内存需求。因此我们要求在模型数据变化前进行采样系统内存这样在一个period内MSC会把preOp的模型变化内存捕捉。比如在period 2-3内我们考虑的tensor gather和shard带来的内存变化。
我们如何设计采样时刻呢。我们选择preOp的model data layout adjust之前。如下图所示。我们采样获得上一个period的system memory used和下一个period的model data memory used。并行策略会给MSC的工作造成障碍。如图所示比如对于ZeRO或者Tensor Parallel由于Op计算前需要gather模型数据会带来额外的内存需求。因此我们要求在模型数据变化前进行采样系统内存这样在一个period内MSC会把preOp的模型变化内存捕捉。比如在period 2-3内我们考虑的tensor gather和shard带来的内存变化。
尽管可以将采样时刻放在其他位置比如排除gather buffer的变动新信息但是会给造成麻烦。不同并行方式Op的实现有差异比如对于Linear OpTensor Parallel中gather buffer的分配在Op中。而对于ZeROgather buffer的分配是在PreOp中。将放在PreOp开始时采样有利于将两种情况统一。

View File

@ -52,7 +52,7 @@ export CHECKPOINT_DIR="your_opt_checkpoint_path"
# the ${CONFIG_DIR} must contain a server.sh file as the entry of service
export CONFIG_DIR="config_file_path"
docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:lastest
docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:latest
```
接下来,您就可以在您的浏览器中打开 `https://[IP-ADDRESS]:8020/docs#` 进行测试。

View File

@ -477,7 +477,7 @@ def build_cifar(batch_size):
return train_dataloader, test_dataloader
# craete dataloaders
# create dataloaders
train_dataloader , test_dataloader = build_cifar()
# create loss function
criterion = CrossEntropyLoss(label_smoothing=0.1)
@ -492,7 +492,7 @@ lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
#### 启动 Colossal-AI 引擎
```python
# intiailize
# initialize
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model,
optimizer=optimizer,
criterion=criterion,

View File

@ -53,7 +53,7 @@ ColoTensor 包含额外的属性[ColoTensorSpec](https://colossalai.readthedocs.
## Example
让我们看一个例子。 使用 tp_degree=4, dp_dgree=2 在 8 个 GPU 上初始化并Shard一个ColoTensor。 然后tensor被沿着 TP 进程组中的最后一个维度进行分片。 最后,我们沿着 TP 进程组中的第一个维度dim 0对其进行重新Shard。 我们鼓励用户运行代码并观察每个张量的形状。
让我们看一个例子。 使用 tp_degree=4, dp_degree=2 在 8 个 GPU 上初始化并Shard一个ColoTensor。 然后tensor被沿着 TP 进程组中的最后一个维度进行分片。 最后,我们沿着 TP 进程组中的第一个维度dim 0对其进行重新Shard。 我们鼓励用户运行代码并观察每个张量的形状。
```python

View File

@ -203,7 +203,7 @@ Naive AMP 的默认参数:
- initial_scale(int): gradient scaler 的初始值
- growth_factor(int): loss scale 的增长率
- backoff_factor(float): loss scale 的下降率
- hysterisis(int): 动态 loss scaling 的延迟偏移
- hysteresis(int): 动态 loss scaling 的延迟偏移
- max_scale(int): loss scale 的最大允许值
- verbose(bool): 如果被设为`True`,将打印调试信息

View File

@ -53,7 +53,7 @@ optimizer = HybridAdam(model.parameters(), lr=1e-3, nvme_offload_fraction=1.0, n
> ⚠ 它只会卸载在 CPU 上的优化器状态。这意味着它只会影响 CPU 训练或者使用卸载的 Zero/Gemini。
## Exampls
## Examples
Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`.
首先让我们从两个简单的例子开始 -- 用不同的方法训练 GPT。这些例子依赖`transformers`。