Merge pull request #10 from hpcaitech/feature/zhdoc

added Chinese documents and fixed some typos in English documents
pull/11/head
Frank Lee 3 years ago committed by GitHub
commit ccb44882e1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,12 +1,12 @@
# Add Your Own Parallelism
# Add your own parallelism
## Overview
To enable researchers and engineers to extend our framework to other novel large-scale distributed training algorithm
with less effort, we have decoupled the various components in the training lifecycle. You can implement your own
with less effort, we have decoupled various components in the training lifecycle. You can implement your own
parallelism by simply inheriting from the base class.
The main components are
The main components are:
1. `ProcessGroupInitializer`
2. `GradientHandler`
@ -14,14 +14,13 @@ The main components are
## Process Group Initializer
Parallelism is often managed by process groups where processes involved in parallel computing are placed in the same
Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
process group. For different parallel algorithms, different process groups need to be created. ColossalAI provides a
global context for the user to easily manage their process groups. If you wish to add new process group, you can easily
global context for users to easily manage their process groups. If you wish to add new process group, you can easily
define a new class and set it in your configuration file. To define your own way of creating process groups, you can
follow the steps below to create new distributed initialization.
1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`
follow the steps below to create a new distributed initialization.
1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`.
```python
class ParallelMode(Enum):
GLOBAL = 'global'
@ -34,11 +33,10 @@ follow the steps below to create new distributed initialization.
NEW_MODE = 'new_mode' # define your mode here
```
2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossal.context.dist_group_initializer`. The
2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The
first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other
arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the
registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`.
```python
# sample initializer class
@DIST_GROUP_INITIALIZER.register_module
@ -84,18 +82,16 @@ follow the steps below to create new distributed initialization.
## Gradient Handler
Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce
strategies may be executed for different kinds of parallelism, the user can
inherit `colossal.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
strategies may be executed for different kinds of parallelism, users can
inherit `colossalai.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data
parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own
gradient handler like below:
```python
from colossalai.registry import GRADIENT_HANDLER
from colossalai.engine import BaseGradientHandler
@GRADIENT_HANDLER.register_module
class YourGradientHandler(BaseGradientHandler):
@ -116,5 +112,5 @@ dist_initializer = [
Schedule entails how to execute a forward and backward pass. Currently, ColossalAI provides pipeline and non-pipeline
schedules. If you want to modify how the forward and backward passes are executed, you can
inherit `colossalai.engine.BaseSchedule` and implement your idea. You can add your schedule to the engine before
inherit `colossalai.engine.BaseSchedule` and implement your idea. You can also add your schedule to the engine before
training.

@ -0,0 +1,103 @@
# 添加新的并行技术
为了方便科研人员和工程师们更方便地拓展我们的框架来兼容一些新的大规模分布式训练算法,我们对训练过程中的几个组件进行了解耦,您可以通过继承基类的方式
来实现新的并行技术。
主要的组件如下所示:
1. `ProcessGroupInitializer`
2. `GradientHandler`
3. `Schedule`
## 进程组初始化器
并行化一般是通过进程组来进行管理的,同属于一个并行化算法的进程将被分到一个进程组中,如果系统中存在多种不同的并行化技术,那么需要创建多个不同的进程组。
ColossalAI为用户提供了一个全局上下文变量来便捷地管理他们的进程组。如果您希望增加新的进程组您可以定义一个新的类并且在您的配置文件中进行设置。下方的
代码块中介绍了如果在系统中加入您的新并行技术以及如何进行初始化。
1. 在`colossalai.context.parallel_mode.ParallelMode`中添加新的并行模式。
```python
class ParallelMode(Enum):
GLOBAL = 'global'
DATA = 'data'
PIPELINE = 'pipe'
PIPELINE_PREV = 'pipe_prev'
PIPELINE_NEXT = 'pipe_next'
...
NEW_MODE = 'new_mode' # define your mode here
```
2. 创建一个`ProcessGroupInitializer`的子类,您可以参考`colossalai.context.dist_group_initializer`中给出的例子。前六个参数将由`ParallelContext`
决定。如果您需要设置新的参数,您可以用新的参数替换下面例子中的`arg1`与`arg2`。最后,您需要使用`@DIST_GROUP_INITIALIZER.register_module`装饰器
在我们的注册表注册您的初始化器。
```python
# sample initializer class
@DIST_GROUP_INITIALIZER.register_module
class MyParallelInitializer(ProcessGroupInitializer):
def __init__(self,
rank: int,
world_size: int,
config: Config,
data_parallel_size: int,
pipeline_parlalel_size: int,
tensor_parallel_size: int,
arg1,
arg2):
super().__init__(rank, world_size, config)
self.arg1 = arg1
self.arg2 = arg2
# ... your variable init
def init_parallel_groups(self):
# initialize your process groups
pass
```
在此之后您可以将您的初始化器插入到当前的mode-to-initialize映射`colossalai.constants.INITIALIZER_MAPPING`中,您也可以通过更改该文件来动态变更名称与
并行模式的映射。
```python
colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
```
3. 在配置文件中设置您的初始化器,如果您的初始化器需要参数,您可以自行传入,下面的代码可以让`ParallelContext`来创建您的初始化器并初始化您需要的进程组。
```python
parallel = dict(
pipeline=dict(size=1),
tensor=dict(size=x, mode='new_mode') # this is where you enable your new parallel mode
)
```
## 梯度处理器
梯度处理器的功能是对模型参数的梯度进行all-reduce操作。由于不同的并行技术可能需要不同的all-reduce操作用户们可以通过继承
`colossalai.engine.gradient_handler.BaseGradientHandler`来执行其个性化操作。目前ColossalAI使用普通的数据并行梯度处理器该处理器在所有的数据
并行rank上执行all-reduce操作且当ColossalAI监测到当前系统使用了数据并行时该处理器会被自动创建。您可以使用下方代码块中的代码添加您自定义的梯度处理器
```python
from colossalai.registry import GRADIENT_HANDLER
from colossalai.engine import BaseGradientHandler
@GRADIENT_HANDLER.register_module
class YourGradientHandler(BaseGradientHandler):
def handle_gradient(self):
do_something()
```
在此之后,您可以在配置文件中指定您想要使用的梯度处理器。
```python
dist_initializer = [
dict(type='YourGradientHandler'),
]
```
## 调度器
调度器中指定了在前向传播和后向传播时需要执行哪些操作ColossalAI提供了支持流水线和不支持流水线的调度器。如果您想要修改前向传播和后向传播的执行方式您可以
继承`colossalai.engine.BaseSchedule`并实现您想要的操作。您也可以在训练模型之前将您的调度器添加到我们的引擎中来。

@ -1,24 +1,24 @@
# Mixed Precision Training
# Mixed precision training
In Colossal-AI, we have integrated different implementations of mixed precision training:
In ColossalAI, we have incorporated different implementations of mixed precision training:
1. torch.cuda.amp
2. apex.amp
3. tensor-parallel amp
The first two rely on the original implementation of [PyTorch](https://pytorch.org/docs/stable/amp.html)
(version 1.6 and above) and [Nvidia Apex](https://github.com/NVIDIA/apex). However, these two methods are not compatible
with tensor parallelism. This is because that tensors are split across devices in tensor parallelism, thus, it is needed
to communicate among different processes to check if inf or nan occurs throughout the whole model weights. For the mixed
precision training with tensor parallel, we adapted this feature from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
with tensor parallelism. This is because that tensors are split across devices in tensor parallelism, thus, it is required
to communicate among different processes to check if `inf` or `nan` occurs in the whole model weights. For the mixed
precision training with tensor parallelism, we adapted this feature from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
To use mixed precision training, you can easily specify the `fp16` field in the configuration file. Currently, torch and
apex amp cannot be guaranteed to work with tensor and pipeline parallelism, thus, only the last one is recommended if you
To use mixed precision training, you can easily specify the `fp16` field in the config file to be True. Currently, PyTorch and
Apex amp cannot be guaranteed to work with tensor and pipeline parallelism, thus, only the last one is recommended if you
are using hybrid parallelism.
## Torch AMP
## PyTorch AMP
PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to fp16 format
while keeping some operations such as reductions in fp32. You can configure the gradient scaler in the configuration.
PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to `fp16` format
while keeping some operations such as reductions in `fp32`. You can configure the gradient scaler in the config file.
```python
from colossalai.engine import AMP_TYPE
@ -34,13 +34,14 @@ fp16=dict(
)
```
## Apex AMP
For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We supported this plugin because it allows
for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2) will keep batch normalization in fp32.
For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We support
this plugin because it allows for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2)
will keep batch normalization in `fp32`.
The following code block shows a config file for Apex AMP.
The configuration is like below.
```python
from colossalai.engine import AMP_TYPE
@ -64,8 +65,10 @@ fp16 = dict(
## Tensor Parallel AMP
We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with
complex tensor and pipeline parallel.
We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with complex tensor
and pipeline parallelism.
The following conde block show a config file for this mode.
```python
from colossalai.engine import AMP_TYPE

@ -0,0 +1,79 @@
# 混合精度训练
ColossalAI可以使用如下三种不同的混合精度训练方式
1. torch.cuda.amp
2. apex.amp
3. 张量并行AMP
前两种混合精度训练方式依赖于[PyTorch](https://pytorch.org/docs/stable/amp.html)的原生实现1.6或以上版本)以及
[Nvidia Apex](https://github.com/NVIDIA/apex),但这两种方法与张量并行并不兼容,因为在张量并行中我们需要将张量进行切分并保存在不同的设备上,
因此,实现兼容张量并行的混合精度训练需要在不同进程之间不断通信来交流`inf`以及`nan`是否存在于模型参数中,因此我们才用了
[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)的实现方式。
您可以简单地将配置文件中的`fp16`字段设置为True来使用混合精度训练。目前PyTorch与Apex的amp不能保证与张量和流水线并行兼容因此我们推荐您使用
最后一种混合精度训练方式。
## PyTorch AMP
PyTorch在1.6及以上版本中提供了混合精度训练,其可以在保持一些操作的精度为`fp32`的同时,将数据转换成`fp16`格式,您可以在配置文件中配置使用。
```python
from colossalai.engine import AMP_TYPE
fp16=dict(
mode=AMP_TYPE.TORCH,
# below are default values for grad scaler
init_scale=2.**16,
growth_factor=2.0,
backoff_factor=0.5,
growth_interval=2000,
enabled=True
)
```
## Apex AMP
我们使用了[Apex](https://nvidia.github.io/apex/)中的混合精度训练,因为该模式提供了细粒度的混合精度控制,例如,`O2`级(第二级优化器)将会保持
批标准化在`fp32`上进行。下面的代码块展示了使用Apex AMP的配置文件。
```python
from colossalai.engine import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.APEX,
# below are the default values
enabled=True,
opt_level='O1',
cast_model_type=None,
patch_torch_functions=None,
keep_batchnorm_fp32=None,
master_weights=None,
loss_scale=None,
cast_model_outputs=None,
num_losses=1,
verbosity=1,
min_loss_scale=None,
max_loss_scale=16777216.0
)
```
## 张量并行AMP
我们借鉴了Megatron-LM的混合精度训练实现该实现方式与张量并行与流水线并行相兼容。下面的代码块展示了使用张量并行AMP的配置文件。
```python
from colossalai.engine import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.PARALLEL,
# below are the default values
clip_grad=0,
log_num_zeros_in_grad=False,
initial_scale=2 ** 32,
min_scale=1,
growth_factor=2,
backoff_factor=0.5,
growth_interval=1000,
hysteresis=2
)
```

@ -1,6 +1,6 @@
# Config file
Here is an example config file of training ViT on cifar:
Here is a config file example showing how to train a ViT model on the CIFAR10 dataset using ColossalAI:
```python
# build train_dataset and train_dataloader from this dictionary

@ -0,0 +1,187 @@
# 配置文件
下方代码块中的示例展示了如何在CIFAR10数据集上使用ColossalAI训练ViT模型。
```python
# build train_dataset and train_dataloader from this dictionary
# It is not compulsory in Config File, instead, you can input this dictionary as an argument into colossalai.initialize()
train_data = dict(
# dictionary for building Dataset
dataset=dict(
# the type CIFAR10Dataset has to be registered
type='CIFAR10Dataset',
root='/path/to/data',
# transform pipeline
transform_pipeline=[
dict(type='Resize', size=IMG_SIZE),
dict(type='RandomCrop', size=IMG_SIZE, padding=4),
dict(type='RandomHorizontalFlip'),
dict(type='ToTensor'),
dict(type='Normalize',
mean=[0.4914, 0.4822, 0.4465],
std=[0.2023, 0.1994, 0.2010]),
]
),
# dictionary for building Dataloader
dataloader=dict(
batch_size=BATCH_SIZE,
pin_memory=True,
# num_workers=1,
shuffle=True,
)
)
# build test_dataset and test_dataloader from this dictionary
test_data = dict(
dataset=dict(
type='CIFAR10Dataset',
root='/path/to/data',
train=False,
transform_pipeline=[
dict(type='Resize', size=IMG_SIZE),
dict(type='ToTensor'),
dict(type='Normalize',
mean=[0.4914, 0.4822, 0.4465],
std=[0.2023, 0.1994, 0.2010]
),
]
),
dataloader=dict(
batch_size=BATCH_SIZE,
pin_memory=True,
# num_workers=1,
)
)
# compulsory
# build optimizer from this dictionary
optimizer = dict(
# Avaluable types: 'ZeroRedundancyOptimizer_Level_1', 'ZeroRedundancyOptimizer_Level_2', 'ZeroRedundancyOptimizer_Level_3'
# 'Adam', 'Lamb', 'SGD', 'FusedLAMB', 'FusedAdam', 'FusedSGD', 'FP16Optimizer'
type='Adam',
lr=0.001,
weight_decay=0
)
# compulsory
# build loss function from this dictionary
loss = dict(
# Avaluable types:
# 'CrossEntropyLoss2D', 'CrossEntropyLoss2p5D', 'CrossEntropyLoss3D'
type='CrossEntropyLoss2D',
)
# compulsory
# build model from this dictionary
model = dict(
# types avaluable: 'PretrainBERT', 'VanillaResNet', 'VisionTransformerFromConfig'
type='VisionTransformerFromConfig',
# each key-value pair above refers to a layer
# input data pass through these layers recursively
tensor_splitting_cfg=dict(
type='ViTInputSplitter2D',
),
embedding_cfg=dict(
type='ViTPatchEmbedding2D',
img_size=IMG_SIZE,
patch_size=PATCH_SIZE,
embed_dim=DIM,
),
token_fusion_cfg=dict(
type='ViTTokenFuser2D',
img_size=IMG_SIZE,
patch_size=PATCH_SIZE,
embed_dim=DIM,
drop_rate=0.1
),
norm_cfg=dict(
type='LayerNorm2D',
normalized_shape=DIM,
eps=1e-6,
),
block_cfg=dict(
# ViTBlock is a submodule
type='ViTBlock',
attention_cfg=dict(
type='ViTSelfAttention2D',
hidden_size=DIM,
num_attention_heads=NUM_ATTENTION_HEADS,
attention_dropout_prob=0.,
hidden_dropout_prob=0.1,
checkpoint=True
),
droppath_cfg=dict(
type='VanillaViTDropPath',
),
mlp_cfg=dict(
type='ViTMLP2D',
in_features=DIM,
dropout_prob=0.1,
mlp_ratio=4,
checkpoint=True
),
norm_cfg=dict(
type='LayerNorm2D',
normalized_shape=DIM,
eps=1e-6,
),
),
head_cfg=dict(
type='ViTHead2D',
hidden_size=DIM,
num_classes=NUM_CLASSES,
),
embed_dim=DIM,
depth=DEPTH,
drop_path_rate=0.,
)
# hooks are built when initializing trainer
# possible hooks: 'BaseHook', 'MetricHook','LoadCheckpointHook'
# 'SaveCheckpointHook','LossHook', 'AccuracyHook', 'Accuracy2DHook'
# 'LogMetricByEpochHook', 'TensorboardHook','LogTimingByEpochHook', 'LogMemoryByEpochHook'
hooks = [
dict(type='LogMetricByEpochHook'),
dict(type='LogTimingByEpochHook'),
dict(type='LogMemoryByEpochHook'),
dict(type='Accuracy2DHook'),
dict(type='LossHook'),
# dict(type='TensorboardHook', log_dir='./tfb_logs'),
# dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
# dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
]
# three keys: pipeline, tensor, data
# if data=dict(size=1), which means no data parallelization, then there is no need to define it
parallel = dict(
pipeline=dict(size=1),
tensor=dict(size=4, mode='2d'),
)
# not compulsory
# pipeline or no pipeline schedule
fp16 = dict(
mode=AMP_TYPE.PARALLEL,
initial_scale=2 ** 8
)
# not compulsory
# build learning rate scheduler
lr_scheduler = dict(
type='LinearWarmupLR',
warmup_epochs=5
)
schedule = dict(
num_microbatches=8
)
# training stopping criterion
# you can give num_steps or num_epochs
num_epochs = 60
# config logging path
logging = dict(
root_path='./logs'
)
```

@ -3,27 +3,27 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
ColossalAI documentation
ColossalAI开发文档
======================================
.. toctree::
:maxdepth: 1
:caption: GETTING STARTED
:caption: 快速上手指南
installation.md
run_demo.md
installation_zh.md
run_demo_zh.md
.. toctree::
:maxdepth: 1
:caption: CUSTOMIZE YOUR TRAINING
parallelization.md
model.md
trainer_engine.md
amp.md
zero.md
add_your_parallel.md
config.md
:caption: 个性化您的训练
parallelization_zh.md
model_zh.md
trainer_engine_zh.md
amp_zh.md
zero_zh.md
add_your_parallel_zh.md
config_zh.md

@ -0,0 +1,40 @@
.. ColossalAI documentation master file, created by
sphinx-quickstart on Mon Oct 11 17:05:05 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
ColossalAI documentation
======================================
.. toctree::
:maxdepth: 1
:caption: GETTING STARTED
installation.md
run_demo.md
.. toctree::
:maxdepth: 1
:caption: CUSTOMIZE YOUR TRAINING
parallelization.md
model.md
trainer_engine.md
amp.md
zero.md
add_your_parallel.md
config.md
.. toctree::
:maxdepth: 2
:caption: API REFERENCE
colossalai/colossalai
Indices and tables
==================
* :ref:`genindex`

@ -0,0 +1,25 @@
# 快速安装
## 使用pip安装
```bash
pip install colossalai
```
## 使用源代码安装
```shell
git clone git@github.com:hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt
# install colossalai
pip install .
```
安装并支持内核融合(使用融合优化器时必须执行下面的代码)
```
pip install -v --no-cache-dir --global-option="--cuda_ext" .
```

@ -1,15 +1,17 @@
# Define your own parallel model
## Write a Simple 2D Parallel Model
Let's say that you have a huge MLP model with billions of parameters and its extremely large hidden layer size makes it
impossible to fit into a single GPU. Don't worry, ColossalAI is here to help you sort things out. With the help of ColossalAI,
you can write your model in the familiar way in which you used to write models for a single GPU, while ColossalAI automatically
splits your model weights and fit them perfectly into a set of GPUs. We give a simple example showing how to write a simple
2D parallel model in the ColossalAI context.
Let's say we have a huge MLP model and its very large hidden size makes it difficult to fit into a single GPU. We can
then distribute the model weights across GPUs in a 2D mesh while you still write your model in a familiar way.
## Write a simple 2D parallel model
```python
from colossalai.nn import Linear2D
import torch.nn as nn
class MLP_2D(nn.Module):
def __init__(self):
@ -21,8 +23,9 @@ class MLP_2D(nn.Module):
x = self.linear_1(x)
x = self.linear_2(x)
return x
```
## Use pre-defined model
Our Model Zoo supports *BERT*, *VIT*, *MLP-Mixer* of different sizes.
For the sake of your convenience, we kindly provide you in our Model Zoo with some prevalent models such as *BERT*, *VIT*,
and *MLP-Mixer*. Feel free to customize them into different sizes to fit into your special needs.

@ -0,0 +1,28 @@
# 定义符合您需求的并行模型
如果您在训练一个拥有数亿级参数的巨大MLP模型那么该模型一定无法在单个GPU上进行训练不用担心ColossalAI可以帮您解决这一问题。您仍旧可以像写单GPU模型
那样来写您的模型ColossalAI会按照您的并行设置自动将模型参数进行切割并将它们均匀地存入一组GPU中。下面是一个简单的例子来向您展示如何在ColossalAI
环境下写一个2D张量并行的模型。
## 简单的2D张量并行模型
```python
from colossalai.nn import Linear2D
import torch.nn as nn
class MLP_2D(nn.Module):
def __init__(self):
super().__init__()
self.linear_1 = Linear2D(in_features=1024, out_features=16384)
self.linear_2 = Linear2D(in_features=16384, out_features=1024)
def forward(self, x):
x = self.linear_1(x)
x = self.linear_2(x)
return x
```
## 使用事先定义好的模型
为了您使用的方便我们事先在我们的Model Zoo中定义好了一些现在流行的模型比如*BERT*、*VIT*以及*MLP-Mixer*,您可以根据您的需求来自定义这些模型的规模。

@ -12,45 +12,44 @@ tensor parallelism.
```python
parallel = dict(
pipeline=dict["size": int],
tensor=dict["size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any]
pipeline=dict("size": int),
tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
)
```
The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and data,
pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a int
representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
represents the way of model parallelism.
represents the way of tensor parallelism.
## Data Parallel
Data parallel is the most common way to distribute your training task by splitting data into several shards and train
on a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do
not have to explicitly set them in your configurations. When data parallel size is larger than 1, Colossal-AI automatically
not have to explicitly set them in your configurations. When data parallel size is larger than 1, ColossalAI automatically
adds the distributed data sampler to the dataloader to shard the dataset.
## 1D, 2D, 2.5D and 3D Parallel
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
tensor parallel method. These parallel modes need to work with the distributed layers provided by ColossalAI.
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data,
model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$
devices where N is the number of tensor chunks in a single dimension.
devices where $N$ is the number of tensor chunks in a single dimension.
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
parallelizes 2D tensor parallelism. An amount of $P = N^2 d$ processors are arranged into d layers,
where each layer performs matrix multiplication operations independently with a dimension N.
Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which further
parallelizes 2D tensor parallelism. An amount of $P = N^2 d$ processors are arranged into $d$ layers,
where each layer performs matrix multiplication operations independently with a dimension $N$.
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves
the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed
the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage are evenly distributed
through optimized load balancing of parameters as well as activations.
```python
# 1D parallel
parallel = dict(
@ -85,8 +84,8 @@ and the second layer to the second GPU. This example of course wastes the comput
the idea of pipeline parallelism.
As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline
parallelism in PyTorch, you may need to add one more attribute in your model class which tells Colossal-AI the sequence
of execution. One example you can refer is `colossalai.nn.VanillaResNet`.
parallelism in PyTorch, you may need to add one more attribute, `layers_cfg` in your model class which tells ColossalAI
the sequence of execution. One example you can refer is `colossalai.nn.model.VanillaResNet`.
```python
from colossalai.nn import BaseModel
@ -193,7 +192,7 @@ class VanillaResNet(BaseModel):
]
```
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, ColossalAI
will automatically creates the pipeline schedule which defines the forward and backward step. You can specify how many microbatches
to run in each step in the `schedule` configuration.
@ -207,10 +206,10 @@ schedule = dict(
num_microbatches = 4 # set the number of microbatches per step
)
```
This feature is still in development and is only experimental for now.
## Sequence Parallel (experimental)
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
This feature is still in development is only experimental for now.
This feature is still in development and is only experimental for now.

@ -0,0 +1,202 @@
# 并行技术
## 配置并行技术组合
ColossalAI支持多种并行技术包括数据并行、张量并行1D、2D、2.5D、3D、流水线并行以及序列并行。您可以通过更改配置文件中的`parallel`字典变量
来初始化分布式系统中的进程组,配置文件中的`parallel`字典变量必须满足下面的格式。数据并行的规模可以通过`parallel`中流水线并行的规模和张量并行的
规模计算得出。
```python
parallel = dict(
pipeline=dict("size": int),
tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
)
```
注意该字典变量的名称必须为**parallel**。该变量中所有的参数,包括`parallel`本身都是非必需的,如果您的代码中没有提供该变量,则所有并行规模都将被
设定为默认值1即不使用任何并行技术的情况。`parallel`中data、pipeline以及tensor的值分别代表了数据并行、流水线并行、以及张量并行的规模而mode
的值代表了张量并行的模式。
## 数据并行
数据并行是一种最常见的并行技术可以将数据分成几个不同的部分并对每一个部分在一台设备上进行训练。ColossalAI可以自动检测数据并行设置并为您设置好环境
您不需要在您的环境配置中显式地设置。当数据并行规模大于1时ColossalAI会自动为数据读取器增加分布式数据采样器以此来达到切分数据集的目的。
## 1D、2D、2.5D与3D张量并行
为了方便混合并行技术我们提供了一系列的张量并行技术同时下面罗列了每一种张量并行技术对应的论文这些张量并行技术需要ColossalAI提供的分布式层结构的支持。
- 1D[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D[An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2D张量并行依赖SUMMA矩阵乘法技术其在两个不同的维度上对于输入数据进行切分。切分后的张量分布在一个的2D网格上使用的总设备数量为$P = N^2$,其中$N$为
一个维度上的切分张量数量。
- 2.5D[2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
2.5D并行技术受到了2.5D矩阵乘法的启发其对于2D张量并行的结果进行进一步切分在$d$层上面安排$P = N^2 d$个处理器,相应地,矩阵乘法操作也被切分为$d$份
在不同的层上面进行。
- 3D[Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
我们还引入了3D张量并行技术该技术在一个3D处理器立方体中对神经网络参数进行并行化。使用$P$个处理器时,该并行技术可以在付出$O(P^{1/3})$的通信开销的情况下
达到最优表现,且计算资源和内存使用都可以在$P$个处理器上达到平均分配。
使用上述几种张量并行的`parallel`字典变量示例参见下方代码。
```python
# 1D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=4, mode='1d')
)
# 2D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=4, mode='2d')
)
# 2.5D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=8, mode='2.5d', depth=2)
)
# 3D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=8, mode='3d')
)
```
## 流水线并行(开发中)
流水线并行指的是在将深度学习模型按照层切分为几个不同的部分例如假设一个由两个线性层组成的简单模型我们可以使用两个GPU那么我们可以把第一个线性层
的工作分配给一个GPU把第二个线性层的工作分配给另一个GPU。当然这个例子只是为了说明流水线并行的工作方式没有实际意义。
由于PyTorch的计算基于动态计算图所以在执行前无法确定计算流。为了支持PyTorch中的流水线并行您需要为您的模型类加入一个额外的特征`layers_cfg`
使ColossalAI清楚具体的计算流程`colossalai.nn.VanillaResNet`给出了一个您可以参考的示例。
```python
from colossalai.nn import BaseModel
import torch
class VanillaResNet(BaseModel):
def __init__(
self,
num_cls: int,
block_type: str,
layers: List[int],
norm_layer_type: str = 'BatchNorm2d',
in_channels: int = 3,
groups: int = 1,
width_per_group: int = 64,
zero_init_residual: bool = False,
replace_stride_with_dilation: Optional[List[bool]] = None,
dilations=(1, 1, 1, 1)
) -> None:
super().__init__()
... # some model params
self.layers_cfg = [
# conv1
dict(type='Conv2d',
in_channels=in_channels,
out_channels=self.inplanes,
kernel_size=7,
stride=2,
padding=3,
bias=False),
# bn1
dict(
type=norm_layer_type,
num_features=self.inplanes
),
# relu
dict(
type='ReLU',
inplace=True
),
# maxpool
dict(
type='MaxPool2d',
kernel_size=3,
stride=2,
padding=1
),
# layer 1
dict(
inplanes=self.inplanes,
planes=64,
blocks=self.blocks[0],
dilation=self.dilations[0],
**self.reslayer_common_cfg
),
# layer 2
dict(
inplanes=64 * self.block_expansion,
planes=128,
blocks=self.blocks[1],
stride=2,
dilate=replace_stride_with_dilation[0],
dilation=self.dilations[1],
**self.reslayer_common_cfg
),
# layer 3
dict(
inplanes=128 * self.block_expansion,
planes=256,
blocks=layers[2],
stride=2,
dilate=replace_stride_with_dilation[1],
dilation=self.dilations[2],
**self.reslayer_common_cfg
),
# layer 4
dict(
inplanes=256 * self.block_expansion,
planes=512,
blocks=layers[3], stride=2,
dilate=replace_stride_with_dilation[2],
dilation=self.dilations[3],
**self.reslayer_common_cfg
),
# avg pool
dict(
type='AdaptiveAvgPool2d',
output_size=(1, 1)
),
# flatten
dict(
type='LambdaWrapper',
func=lambda mod, x: torch.flatten(x, 1)
),
# linear
dict(
type='Linear',
in_features=512 * self.block_expansion,
out_features=num_cls
)
]
```
您可以在配置文件中手动设置流水线并行的级数当柳树线并行级数大于1时ColossalAI将会自动创建定义前向传播和后向传播的流水线调度程序。同时您还可以在配置文件
中的`schedule`字典变量来定义每一个步骤中训练的微批数量。下面的代码给出了一个配置流水线并行的例子。
```python
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=1, mode=None)
)
schedule = dict(
num_microbatches = 4 # set the number of microbatches per step
)
```
目前该并行技术仍处于实验开发阶段。
## 序列并行(开发中)
序列并行是为了支持对于长序列数据的建模,这类数据包括文档级别的文本理解以及医学影像分析,该并行技术由论文
[Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120)提出。
目前该并行技术仍处于实验开发阶段。

@ -19,23 +19,27 @@ where `HOST` is the IP address of your system. Note that we use
the [Slurm](https://slurm.schedmd.com/documentation.html) job scheduling system here.
```bash
HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./example/train_vit_2d.py ./configs/vit/vit_2d.py
HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
```
`./configs/vit/vit_2d.py` is a config file, which is introduced in the [Config file](config.md) section below. These
config files are used by ColossalAI to define all kinds of training arguments, such as the model, dataset and training
method (optimizer, lr_scheduler, epoch, etc.). Config files are highly customizable and can be modified so as to train
different models.
`./example/run_trainer.py` contains a standard training script and is presented below, it reads the config file and
`./examples/run_trainer.py` contains a standard training script and is presented below, it reads the config file and
realizes the training process.
```python
import colossalai
from colossalai.core import global_context as gpc
from colossalai.engine import Engine
from colossalai.logging import get_global_dist_logger
from colossalai.trainer import Trainer
from colossalai.core import global_context as gpc
def run_trainer():
model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
logger = get_global_dist_logger()
schedule.data_sync = False
engine = Engine(
model=model,
criterion=criterion,
@ -43,17 +47,24 @@ engine = Engine(
lr_scheduler=lr_scheduler,
schedule=schedule
)
logger.info("engine is built", ranks=[0])
trainer = Trainer(engine=engine,
hooks_cfg=gpc.config.hooks,
verbose=True)
logger.info("trainer is built", ranks=[0])
logger.info("start training", ranks=[0])
trainer.fit(
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
max_epochs=gpc.config.num_epochs,
display_progress=True,
test_interval=5
test_interval=2
)
if __name__ == '__main__':
run_trainer()
```
Alternatively, the `model` variable can be substituted with a self-defined model or a pre-defined model in our Model
@ -62,7 +73,7 @@ Zoo. The detailed substitution process is elaborated [here](model.md).
## Features
ColossalAI provides a collection of parallel training components for you. We aim to support you with your development of
distributed deep learning models just like how you write single-GPU deeo learning models. We provide friendly tools to
distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools to
kickstart distributed training in a few lines.
- [Data Parallelism](parallelization.md)

@ -0,0 +1,75 @@
# 快速上手
ColossalAI是一个大规模深度学习框架其中包含高效的并行技术。该框架可以在多GPU的分布式系统上使用并行技术有效地加速模型训练同时该框架也可以运行在
带有GPU的非分布式系统上。下面是ColossalAI的快速上手指南。
## 单GPU系统
在带有GPU的非分布式系统上进行模型训练时ColossalAI可以达到当前的基线效率。
[这里](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)我们给出一个Google Colab
示例展现如何使用ColossalAI与CIFAR10数据集在非分布式系统上训练一个LeNet模型。
## 多GPU系统
在多GPU的分布式系统上训练深度学习模型时ColossalAI可以使用高效的并行技术来显著地加速训练过程这些技术将在下面的[并行技术](parallelization.md)章节中被详述。
下面的代码将在拥有四个GPU的分布式系统上训练一个ViT模型其中`HOST`变量为您分布式系统的IP地址。请注意下面的代码使用了
[Slurm](https://slurm.schedmd.com/documentation.html)作业调度系统。
```bash
HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
```
`./configs/vit/vit_2d.py`是一个[配置文件](config.md)ColossalAI使用配置文件来定义训练过程中需要用到的参数比如模型类型、数据集、以及优化器、学习率调度器等。
您可以通过编写配置文件的方式来训练不同的模型。`./examples/run_trainer.py`是一个标准的训练脚本,具体代码已经附在下面。该脚本可以读入配置文件中的训练参数并训练模型。
```python
import colossalai
from colossalai.core import global_context as gpc
from colossalai.engine import Engine
from colossalai.logging import get_global_dist_logger
from colossalai.trainer import Trainer
def run_trainer():
model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
logger = get_global_dist_logger()
schedule.data_sync = False
engine = Engine(
model=model,
criterion=criterion,
optimizer=optimizer,
lr_scheduler=lr_scheduler,
schedule=schedule
)
logger.info("engine is built", ranks=[0])
trainer = Trainer(engine=engine,
hooks_cfg=gpc.config.hooks,
verbose=True)
logger.info("trainer is built", ranks=[0])
logger.info("start training", ranks=[0])
trainer.fit(
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
max_epochs=gpc.config.num_epochs,
display_progress=True,
test_interval=2
)
if __name__ == '__main__':
run_trainer()
```
上面代码中的`model`变量可以被替换为一个自定义的模型或者`Model Zoo`中一个事先定义的模型,以此来达到训练不同模型的目的,[这里](model.md)详述了如何进行这样的替换。
## 系统功能
ColossalAI提供了一系列并行组件来加速您的模型训练我们在下面的章节提供了关于这些并行组件的介绍。我们的目标是使您的分布式深度学习模型开发像单卡深度学习模型开发那样方便。
- [数据并行](parallelization.md)
- [1D、2D、2.5D、3D张量并行以及序列并行](parallelization.md)
- [流水线并行](parallelization.md)
- [训练器以及引擎](trainer_engine.md)
- [自定义您的并行模式](add_your_parallel.md)
- [混合精度训练](amp.md)
- [ZeRO优化器](zero.md)

@ -2,7 +2,9 @@
## Build your engine
To better understand the function of `Engine` class, you should know the conception of the process function in common engines. The process function usually controls the behavior over a batch of a dataset, `Engine` class just controls the process function. For example, common process function looks like this:
To better understand how `Engine` class works, let's start from the conception of the process function in common engines. The process function
usually controls the behavior over a batch of a dataset, `Engine` class just controls the process function. Here we give a standard process
function in the following code block.
```python
def process_function(dataloader, model, criterion, optim):
@ -14,9 +16,13 @@ def process_function(dataloader, model, criterion, optim):
optim.setp()
```
In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is hard for users to write their own functions for pipeline parallelism. Aiming at accessible hybrid parallelism for users, we provide powerful `Engine` class. It enables pipeline parallelism and offers 1F1B non-interleaving strategy. Also, you can use pre-defined learning rate scheduler in your `Engine` to adjust learning rate during training.
In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is tricky for users to write their own process
functions for pipeline parallelism. Aiming at offering accessible hybrid parallelism for users, we provide the powerful `Engine` class. This class
enables pipeline parallelism and offers one-forward-one-backward non-interleaving strategy. Also, you can use pre-defined learning rate scheduler
in the `Engine` class to adjust learning rate during training.
In order to build your engine, just set model, criterion, optimizer, learning rate scheduler and schedule. Consider the following code as an example.
In order to build your engine, just set variables `model`, `criterion`, `optimizer`, `lr_scheduler` and `schedule`. The following code block provides
an example.
```python
import torch
@ -24,7 +30,6 @@ import torch.nn as nn
import torchvision.models as models
import colossalai
model = models.resnet18()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model)
@ -40,23 +45,27 @@ MyEngine = Engine(
)
```
More information is in API reference.
More information regarding the class can be found in the API references.
## Customize your trainer
### Overview
Before starting to learn how to customize a trainer meeting your need, you should have a basic understanding about the function of `Trainer`. We recommend you to read *Get Started* section and *Build your engine* first.
To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly recommend that you read *Get Started*
section and *Build your engine* first.
Trainer class tends to enable researchers and engineers to use our framework more conveniently, instead of writing their own scripts, we provide `Trainer` class and you can simply construct it with your own `Engine` by calling `MyTrainer = Trainer(MyEngine)`. Then use method `fit` to train or evaluate your model. In order to make our `Trainer` class more powerful, we add some useful features to it, such as monitor or record running states and metrics which indicate model's performance, or save after a training epoch.
The `Trainer` class enables researchers and engineers to use our framework more conveniently. Instead of having to write your own scripts, you can simply
construct your own trainer by calling the `Trainer` class, just like what we did in the following code block.
To accomplish that, specific actions must be added to the training or evaluation. `BaseHook` class allow you to add desired actions in specific time points. We have already created practical hooks for those useful features. What you need to do is just picking the hooks you want.
More detailed class descriptions can be found in API reference.
```python
MyTrainer = Trainer(MyEngine)
```
### Example
After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more powerful, we incorporate a set of
handy tools to the class. For example, you can monitor or record the running states and metrics which indicate the current performance of the model. These
functions are realized by hooks. The `BasicHook` class allows you to execute your hook functions at specified time. We have already created some practical
hooks for you, as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the class can be found
in the API references.
```python
hooks = [
@ -65,26 +74,24 @@ hooks = [
dict(type='LogMemoryByEpochHook'),
dict(type='AccuracyHook'),
dict(type='LossHook'),
# dict(type='TensorboardHook', log_dir='./tfb_logs'),
# dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
# dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
dict(type='TensorboardHook', log_dir='./tfb_logs'),
dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
]
```
Above hooks will record metrics, used time and memory usage to log every epoch. Also it prints loss and accuracy to let users monitor the performance of the model.
These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides, they print the current loss and
accuracy to let users monitor the performance of the model.
### Hook
You can extend our `BaseHook` class. Hooks can be called at twelve time points. More detailed information can be found in API reference.
Or extend from `MetricHook` to write a metric collector. You should also use the decorator `@HOOKS.register_module` for your own hook class, and import it in your main python script.
For `after_train_iter()`, it receives the output of engine per iteration, which is a list including output, label and loss.
Note that you can define the priority to arrange the execution order of all hooks.
If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook` class to write a metric collector.
These hook functions can be called at twelve points in time. Besides, you can define the priorities of all hooks to arrange the execution order of them.
More information can be found in the API references.
### Metric
You can write your own metric by extending `Metric` class. It is always used with `MetricHook`. If you write your own metric hooks, please set the priority carefully and make sure is called before other hooks which may use the results of metrics.
You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your write your own metric hooks, please set
the priority carefully and make sure the hook is called before other hooks which might require the results of the metric hook.
We've already provided some metric hooks. We store metric objects in `runner.states['metrics']`. It is a dictionary and you can use the name of the metric to access it.
We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary and metrics can be accessed by their names.

@ -0,0 +1,85 @@
# 引擎与训练器
## 引擎
为了更好的理解我们的`Engine`类是如何工作的,我们首先需要了解常见引擎中进程函数的概念。进程函数控制数据集中一个批的行为,`Engine`类控制的正是该进程函数。我们在下方的代码块
中给出了一个标准的进程函数例子。
```python
def process_function(dataloader, model, criterion, optim):
optim.zero_grad()
data, label = next(dataloader)
output = model(data)
loss = criterion(output, label)
loss.backward()
optim.setp()
```
在`ignite.engine`与`keras.engine`中,进程函数需要由用户提供,然而,用户很难为流水线并行编写进程函数。为了为用户提供方便的混合并行,我们提供了具备强大功能的`Engine`类,
该类提供前向传播后向传播不交织的策略,并支持流水线并行。同时,我们`Engine`类在使用事先定义好的学习率调度器来在训练过程中调整学习率。
您在构造引擎时只需要定义`model`、`criterion`、`optimizer`、`lr_scheduler`与`schedule`等变量即可,下面的代码块给出了一个这样的例子。
```python
import torch
import torch.nn as nn
import torchvision.models as models
import colossalai
model = models.resnet18()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model)
lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
schedule = colossalai.engine.schedule.NoPipelineSchedule()
MyEngine = Engine(
model=model,
criterion=criterion,
optimizer=optimizer,
lr_scheduler=lr_scheduler,
schedule=schedule
)
```
更多该类的相关信息可以在API信息中找到。
## 训练器
要了解如何个性化适应您需求的训练器,首先需要了解我们的`Trainer`类。
`Trainer`类旨在让科研工作者和工程师更加方便地使用我们的框架,您不需要自己写脚本,只需要调用`Trainer`类来构造您的训练器即可,就像下面的代码块中所做的。
```python
MyTrainer = Trainer(MyEngine)
```
在此之后,您可以使用`fit`方法来训练或调用您的模型。除此之外,为了让我们的`Trainer`类拥有更强大的功能,我们加入了一系列方便您使用的工具,例如,您可以在训练过程中持续监测并记录模型目前
的运行状态和表现,这些功能都是通过钩子函数来实现的。我们提供的`BasicHook`类让您可以在您指定的时间运行我们提供的钩子函数,或者您自行定义的钩子函数。我们事先为您定义好了一些实用的钩子
函数列在下面的代码块中您需要做的就是找到符合您需求的钩子函数。更多该类的相关信息可以在API信息中找到。
```python
hooks = [
dict(type='LogMetricByEpochHook'),
dict(type='LogTimingByEpochHook'),
dict(type='LogMemoryByEpochHook'),
dict(type='AccuracyHook'),
dict(type='LossHook'),
dict(type='TensorboardHook', log_dir='./tfb_logs'),
dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
]
```
上面这些钩子函数可以记录模型性能指标训练时间显存使用等并在每一个epoch结束后将这些信息写入到日志中。除此之外这些钩子函数还可以即时输出当前的损失以及准确率让用户可以监测模型的性能。
### 钩子函数
如果您有个性化需求,您可以继承我们的`BaseHook`类并添加您的函数,或者继承我们的`MetricHook`来编写您需要的度量标准。这些钩子函数可以在12个不同的时间点被执行。更多该类的相关信息可以在API
信息中找到。
### 度量标准
您可以通过继承我们的`Metric`类来提供您需要的度量标准,该类需要与`MetricHook`类一同使用。当您编写您的度量标准钩子函数时,请用心设置您的优先级来确保该钩子函数的优先级高于那些需要度量结果的
钩子函数。
我们已经为您定义好了一些度量标准钩子函数在`runner.states['metrics']`供您参考。

@ -1,18 +1,25 @@
# Zero Redundancy Optimizer and Zero Offload
# Zero Redundancy optimizer and zero offload
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three
model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them.
By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity
and communication efficiency are retained.
1. **ZeRO Level 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
2. **ZeRO Level 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
3. **ZeRO Level 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
1. **ZeRO Level 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the
first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
2. **ZeRO Level 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process
only stores the gradients corresponding to its partition of the optimizer states.
3. **ZeRO Level 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and
partition them during the forward and backward passes.
## Getting Started
## Getting Started with ZeRO
Once you are training with ColossalAI, enabling ZeRO-3 offload is as simple as enabling it in your ColossalAI configuration! Below are a few examples of ZeRO-3 configurations.
If you are training models with ColossalAI, enabling ZeRO-3 offload is as simple as enabling it in your ColossalAI configuration!
Below are a few examples of ZeRO-3 configurations.
### Example ZeRO-3 Configurations
### Example of ZeRO-3 Configurations
Here we use ``Adam`` as the initial optimizer.
Here we use `Adam` as the initial optimizer.
1. Use ZeRO to partition the optimizer states (level 1), gradients (level 2), and parameters (level 3).
```python
@ -74,8 +81,8 @@ Here we use ``Adam`` as the initial optimizer.
)
```
Note that ``fp16`` is automatically enabled when using ZeRO.
Note that `fp16` is automatically enabled when using ZeRO.
### Training
Once you complete your configuration, just use `colossalai.initialize()` to initialize your training. All you need to do is to write your configuration.
Once you have completed your configuration, just use `colossalai.initialize()` to initialize your training.

@ -0,0 +1,84 @@
# ZeRO优化器与offload
ZeRO优化器可以切分三种模型状态优化器状态、梯度、参数并将它们存储在不同的进程中以此来减少数据并行的存储冗余传统的数据并行需要将上述三种状态
复制很多份保存在每一个进程中。与传统的做法相比ZeRO优化器可以极大地提高内存存储效率并保持较好的通信效率。
1. **ZeRO Level 1**: 优化器状态(如对于[Adam优化器](https://arxiv.org/abs/1412.6980)而言32比特的参数以及第一和第二动量的预测值被切分
存储在不同的进程中,这样每一个进程只需要更新它对应的那一部分参数。
2. **ZeRO Level 2**: 用于更新模型参数的32比特的梯度在这一级被切分存储在不同的进程中这里梯度的切分与level 1中模型参数的切分是一一对应的每一个
进程上的梯度正好被用来更新该进程上的保存的模型参数。
3. **ZeRO Level 3**: 16比特的模型参数在这一级被切分存储在不同的进程中ZeRO-3可以在前向传播和后向传播期间自动收集或切分这些参数。
## 使用ZeRO优化器
在ColossalAI中启用ZeRO优化器只需要您在配置文件中进行配置即可下面是一些使用ZeRO-3的配置文件例子。
### 使用ZeRO优化器以及offload
这里我们使用`Adam`作为我们的初始优化器.
1. 使用ZeRO来切分优化器状态level 1梯度level 2以及模型参数level 3
```python
optimizer = dict(
type='Adam',
lr=0.001,
weight_decay=0
)
zero = dict(
type='ZeroRedundancyOptimizer_Level_3',
dynamic_loss_scale=True,
clip_grad=1.0
)
```
2. 将优化器状态以及计算分配到CPU上
```python
zero = dict(
offload_optimizer_config=dict(
device='cpu',
pin_memory=True,
fast_init=True
),
...
)
```
3. 将模型参数分配到CPU上来节省显存
```python
zero = dict(
offload_optimizer_config=dict(
device='cpu',
pin_memory=True,
fast_init=True
),
offload_param_config=dict(
device='cpu',
pin_memory=True,
fast_init=OFFLOAD_PARAM_MAX_IN_CPU
),
...
)
```
4. 将参数分配到NVMe上来节省更多显存如果您的系统上安装了NVMe
```python
zero = dict(
offload_optimizer_config=dict(
device='nvme',
pin_memory=True,
fast_init=True,
nvme_path='/nvme_data'
),
offload_param_config=dict(
device='nvme',
pin_memory=True,
max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU,
nvme_path='/nvme_data'
),
...
)
```
请注意使用ZeRO时`fp16`将会被自动激活。
### 使用ZeRO优化器进行训练
如果您完成了上述配置,可以运行`colossalai.initialize()`来开始您的训练。
Loading…
Cancel
Save