removed tutorial markdown and refreshed rst files for consistency

pull/165/head
Frank Lee 2022-01-19 16:06:53 +08:00 committed by ver217
parent ca4ae52d6b
commit be85a0f366
102 changed files with 456 additions and 1970 deletions

View File

@ -1,5 +1,7 @@
# Colossal-AI
![logo](./docs/images/Colossal-AI_logo.png)
<div align="center">
<h3> <a href="https://arxiv.org/abs/2110.14883"> Paper </a> | <a href="https://www.colossalai.org/"> Documentation </a> | <a href="https://github.com/hpcaitech/ColossalAI/discussions"> Forum </a> | <a href="https://medium.com/@hpcaitech"> Blog </a></h3>
</div>
@ -33,7 +35,6 @@ Install and enable CUDA kernel fusion (compulsory installation when using fused
pip install -v --no-cache-dir --global-option="--cuda_ext" .
```
## Use Docker
Run the following command to build a docker image from Dockerfile provided.
@ -71,18 +72,18 @@ colossalai.launch(
)
# build your model
model = ...
model = ...
# build you dataset, the dataloader will have distributed data
# build you dataset, the dataloader will have distributed data
# sampler by default
train_dataset = ...
train_dataset = ...
train_dataloader = get_dataloader(dataset=dataset,
shuffle=True
)
# build your
optimizer = ...
# build your
optimizer = ...
# build your loss function
criterion = ...
@ -137,13 +138,15 @@ Colossal-AI provides a collection of parallel training components for you. We ai
distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart
distributed training in a few lines.
- [Data Parallelism](./docs/parallelization.md)
- [Pipeline Parallelism](./docs/parallelization.md)
- [1D, 2D, 2.5D, 3D and sequence parallelism](./docs/parallelization.md)
- [Friendly trainer and engine](./docs/trainer_engine.md)
- [Extensible for new parallelism](./docs/add_your_parallel.md)
- [Mixed Precision Training](./docs/amp.md)
- [Zero Redundancy Optimizer (ZeRO)](./docs/zero.md)
- Data Parallelism
- Pipeline Parallelism
- 1D, 2D, 2.5D, 3D and sequence parallelism
- Friendly trainer and engine
- Extensible for new parallelism
- Mixed Precision Training
- Zero Redundancy Optimizer (ZeRO)
Please visit our [documentation and tutorials](https://www.colossalai.org/) for more details.
## Cite Us

View File

@ -1,113 +0,0 @@
# Add your own parallelism
## Overview
To enable researchers and engineers to extend our system to other novel large-scale distributed training algorithm
with less effort, we have decoupled various components in the training lifecycle. You can implement your own
parallelism by simply inheriting from the base class.
The main components are:
1. `ProcessGroupInitializer`
2. `GradientHandler`
3. `Schedule`
## Process Group Initializer
Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
process group. For different parallel algorithms, different process groups need to be created. Colossal-AI provides a
global context for users to easily manage their process groups. If you wish to add new process group, you can easily
define a new class and set it in your configuration file. To define your own way of creating process groups, you can
follow the steps below to create a new distributed initialization.
1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`.
```python
class ParallelMode(Enum):
GLOBAL = 'global'
DATA = 'data'
PIPELINE = 'pipe'
...
NEW_MODE = 'new_mode' # define your mode here
```
2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The
first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other
arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the
registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`.
```python
# sample initializer class
@DIST_GROUP_INITIALIZER.register_module
class MyParallelInitializer(ProcessGroupInitializer):
def __init__(self,
rank: int,
world_size: int,
config: Config,
data_parallel_size: int,
pipeline_parlalel_size: int,
tensor_parallel_size: int,
arg1,
arg2):
super().__init__(rank, world_size, config)
self.arg1 = arg1
self.arg2 = arg2
# ... your variable init
def init_parallel_groups(self):
# initialize your process groups
pass
```
Then, you can insert your new initializer to the current mode-to-initialize mapping
in `colossalai.constants.INITIALIZER_MAPPING`. You can modify the file or insert new key-value pair dynamically.
```python
colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
```
3. Set your initializer in your config file. You can pass in your own arguments if there is any. This allows
the `ParallelContext` to create your initializer and initialize your desired process groups.
```python
parallel = dict(
pipeline=dict(size=1),
tensor=dict(size=x, mode='new_mode') # this is where you enable your new parallel mode
)
```
## Gradient Handler
Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce
strategies may be executed for different kinds of parallelism, users can
inherit `colossalai.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data
parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own
gradient handler like below:
```python
from colossalai.registry import GRADIENT_HANDLER
from colossalai.engine import BaseGradientHandler
@GRADIENT_HANDLER.register_module
class YourGradientHandler(BaseGradientHandler):
def handle_gradient(self):
do_something()
```
Afterwards, you can specify the gradient handler you want to use in your configuration file.
```python
gradient_handlers = [
dict(type='YourGradientHandler'),
]
```
## Schedule
Schedule entails how to execute a forward and backward pass. Currently, Colossal-AI provides pipeline and non-pipeline
schedules. If you want to modify how the forward and backward passes are executed, you can
inherit `colossalai.engine.schedule.BaseSchedule` and implement the `forward_back_step` function.

View File

@ -1,92 +0,0 @@
# 添加新的并行技术
为了方便科研人员和工程师们更方便地拓展我们的系统来兼容一些新的大规模分布式训练算法,我们对训练过程中的几个组件进行了解耦,您可以通过继承基类的方式来实现新的并行技术。
主要的组件如下所示:
1. `ProcessGroupInitializer`
2. `GradientHandler`
3. `Schedule`
## 进程组初始化器
并行化一般是通过进程组来进行管理的同属于一个并行化算法的进程将被分到一个进程组中如果系统中存在多种不同的并行化技术那么需要创建多个不同的进程组。Colossal-AI为用户提供了一个全局上下文变量来便捷地管理他们的进程组。如果您希望增加新的进程组您可以定义一个新的类并且在您的配置文件中进行设置。下方的代码块介绍了如何在系统中加入您的新颖并行技术以及如何进行初始化。
1. 在`colossalai.context.parallel_mode.ParallelMode`中添加新的并行模式。
```python
class ParallelMode(Enum):
GLOBAL = 'global'
DATA = 'data'
PIPELINE = 'pipe'
...
NEW_MODE = 'new_mode' # define your mode here
```
2. 创建一个`ProcessGroupInitializer`的子类,您可以参考`colossalai.context.dist_group_initializer`中给出的例子。前六个参数将由`ParallelContext`决定。如果您需要设置新的参数,您可以用新的参数替换下面例子中的`arg1`与`arg2`。最后,您需要使用`@DIST_GROUP_INITIALIZER.register_module`装饰器在我们的注册表中注册您的初始化器。
```python
# sample initializer class
@DIST_GROUP_INITIALIZER.register_module
class MyParallelInitializer(ProcessGroupInitializer):
def __init__(self,
rank: int,
world_size: int,
config: Config,
data_parallel_size: int,
pipeline_parlalel_size: int,
tensor_parallel_size: int,
arg1,
arg2):
super().__init__(rank, world_size, config)
self.arg1 = arg1
self.arg2 = arg2
# ... your variable init
def init_parallel_groups(self):
# initialize your process groups
pass
```
在此之后您可以将您的初始化器插入到当前的mode-to-initialize映射`colossalai.constants.INITIALIZER_MAPPING`中,您也可以通过更改该文件来动态变更名称与并行模式的映射。
```python
colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
```
3. 在配置文件中设置您的初始化器。如果您的初始化器需要参数,您可以自行传入。下面的代码可以让`ParallelContext`来创建您的初始化器并初始化您需要的进程组。
```python
parallel = dict(
pipeline=dict(size=1),
tensor=dict(size=x, mode='new_mode') # this is where you enable your new parallel mode
)
```
## 梯度处理器
梯度处理器的功能是对模型参数的梯度进行all-reduce操作。由于不同的并行技术可能需要不同的all-reduce操作用户们可以通过继承`colossalai.engine.gradient_handler.BaseGradientHandler`来执行其个性化操作。目前Colossal-AI使用普通的数据并行梯度处理器该处理器在所有的数据并行rank上执行all-reduce操作且当Colossal-AI检测到当前系统使用了数据并行时该处理器会被自动创建。您可以使用下方代码块中的代码添加您自定义的梯度处理器
```python
from colossalai.registry import GRADIENT_HANDLER
from colossalai.engine import BaseGradientHandler
@GRADIENT_HANDLER.register_module
class YourGradientHandler(BaseGradientHandler):
def handle_gradient(self):
do_something()
```
在此之后,您可以在配置文件中指定您想要使用的梯度处理器。
```python
dist_initializer = [
dict(type='YourGradientHandler'),
]
```
## 调度器
调度器中指定了在前向传播和后向传播时需要执行哪些操作Colossal-AI提供了流水线和非流水线的调度器。如果您想要修改前向传播和后向传播的执行方式您可以继承`colossalai.engine.BaseSchedule`并实现您想要的操作。您也可以在训练模型之前将您的调度器添加到我们的引擎中来。

View File

@ -1,102 +0,0 @@
# Mixed precision training
In Colossal-AI, we have incorporated different implementations of mixed precision training:
1. torch.cuda.amp
2. apex.amp
3. naive amp
The first two rely on the original implementation of [PyTorch](https://pytorch.org/docs/stable/amp.html)
(version 1.6 and above) and [Nvidia Apex](https://github.com/NVIDIA/apex). The last mehtod is simialr to Apex O2 level.
Among these methods, apex.amp is not compatible with tensor parallelism. This is because that tensors are split across devices
in tensor parallelism, thus, it is required to communicate among different processes to check if `inf` or `nan` occurs in the
whole model weights. **We modified the torch amp implementation so that it is compatible with tensor parallelism now.**
To use mixed precision training, you can easily specify the `fp16` field in the config file to be True. Currently, PyTorch and
Apex amp cannot be guaranteed to work with tensor and pipeline parallelism. We recommend you to use torch amp as it generally
gives better accuracy than naive amp.
The AMP module is designed to be completely modular and can be used independently from other colossalai modules.
If you wish to only use amp in your code base without `colossalai.initialize`, you can use `colossalai.amp.convert_to_amp`.
```python
from colossalai.amp import AMP_TYPE
# exmaple of using torch amp
model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
optimizer,
criterion,
AMP_TYPE.TORCH)
```
## PyTorch AMP
PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to `fp16` format
while keeping some operations such as reductions in `fp32`. You can configure the gradient scaler in the config file.
```python
from colossalai.amp import AMP_TYPE
fp16=dict(
mode=AMP_TYPE.TORCH,
# below are default values for grad scaler
init_scale=2.**16,
growth_factor=2.0,
backoff_factor=0.5,
growth_interval=2000,
enabled=True
)
```
## Apex AMP
For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We support
this plugin because it allows for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2)
will keep batch normalization in `fp32`.
The following code block shows a config file for Apex AMP.
```python
from colossalai.amp import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.APEX,
# below are the default values
enabled=True,
opt_level='O1',
cast_model_type=None,
patch_torch_functions=None,
keep_batchnorm_fp32=None,
master_weights=None,
loss_scale=None,
cast_model_outputs=None,
num_losses=1,
verbosity=1,
min_loss_scale=None,
max_loss_scale=16777216.0
)
```
## Naive AMP
We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with complex tensor
and pipeline parallelism. This AMP mode will cast all operations into fp16.
The following code block shows a config file for this mode.
```python
from colossalai.amp import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.NAIVE,
# below are the default values
clip_grad=0,
log_num_zeros_in_grad=False,
initial_scale=2 ** 32,
min_scale=1,
growth_factor=2,
backoff_factor=0.5,
growth_interval=1000,
hysteresis=2
)
```

View File

@ -1,74 +0,0 @@
# 混合精度训练
Colossal-AI可以使用如下三种不同的混合精度训练方式
1. torch.cuda.amp
2. apex.amp
3. 张量并行AMP
前两种混合精度训练方式依赖于[PyTorch](https://pytorch.org/docs/stable/amp.html)的原生实现1.6或以上版本)以及[Nvidia Apex](https://github.com/NVIDIA/apex),但这两种方法与张量并行并不兼容,因为在张量并行中我们需要将张量进行切分并保存在不同的设备上,因此,实现兼容张量并行的混合精度训练需要在不同进程之间不断通信来交流`inf`以及`nan`是否存在于模型参数中,因此我们采用了[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)的实现方式。
您可以简单地将配置文件中的`fp16`字段设置为True来使用混合精度训练。目前PyTorch与Apex的amp不能保证与张量和流水线并行兼容因此我们推荐您使用最后一种混合精度训练方式。
## PyTorch AMP
PyTorch在1.6及以上版本中提供了混合精度训练,它可以在保持一些操作的精度为`fp32`的同时,将数据转换成`fp16`格式,您可以在配置文件中配置使用。
```python
from colossalai.engine import AMP_TYPE
fp16=dict(
mode=AMP_TYPE.TORCH,
# below are default values for grad scaler
init_scale=2.**16,
growth_factor=2.0,
backoff_factor=0.5,
growth_interval=2000,
enabled=True
)
```
## Apex AMP
我们使用了[Apex](https://nvidia.github.io/apex/)中的混合精度训练,因为该模式提供了细粒度的混合精度控制,例如,`O2`级(第二级优化器)将会保持批标准化在`fp32`上进行。下面的代码块展示了使用Apex AMP的配置文件。
```python
from colossalai.engine import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.APEX,
# below are the default values
enabled=True,
opt_level='O1',
cast_model_type=None,
patch_torch_functions=None,
keep_batchnorm_fp32=None,
master_weights=None,
loss_scale=None,
cast_model_outputs=None,
num_losses=1,
verbosity=1,
min_loss_scale=None,
max_loss_scale=16777216.0
)
```
## 张量并行AMP
我们借鉴了Megatron-LM的混合精度训练实现该实现方式与张量并行、流水线并行相兼容。下面的代码块展示了使用张量并行AMP的配置文件。
```python
from colossalai.engine import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.PARALLEL,
# below are the default values
clip_grad=0,
log_num_zeros_in_grad=False,
initial_scale=2 ** 32,
min_scale=1,
growth_factor=2,
backoff_factor=0.5,
growth_interval=1000,
hysteresis=2
)
```

View File

@ -0,0 +1,5 @@
colossalai.amp.amp\_type
========================
.. automodule:: colossalai.amp.amp_type
:members:

View File

@ -0,0 +1,5 @@
colossalai.amp.apex\_amp.apex\_amp
==================================
.. automodule:: colossalai.amp.apex_amp.apex_amp
:members:

View File

@ -1,5 +1,11 @@
colossalai.amp.apex\_amp
==========================
========================
.. automodule:: colossalai.amp.apex_amp
:members:
.. toctree::
:maxdepth: 2
colossalai.amp.apex_amp.apex_amp

View File

@ -0,0 +1,5 @@
colossalai.amp.naive\_amp.naive\_amp
====================================
.. automodule:: colossalai.amp.naive_amp.naive_amp
:members:

View File

@ -1,5 +1,11 @@
colossalai.amp.naive\_amp
==========================
=========================
.. automodule:: colossalai.amp.naive_amp
:members:
.. toctree::
:maxdepth: 2
colossalai.amp.naive_amp.naive_amp

View File

@ -1,13 +1,18 @@
colossalai.amp
==================
==============
.. automodule:: colossalai.amp
:members:
.. toctree::
:maxdepth: 2
colossalai.amp.torch_amp
colossalai.amp.apex_amp
colossalai.amp.naive_amp
colossalai.amp.torch_amp
.. automodule:: colossalai.amp
:members:
.. toctree::
:maxdepth: 2
colossalai.amp.amp_type

View File

@ -1,5 +1,11 @@
colossalai.amp.torch\_amp
==========================
=========================
.. automodule:: colossalai.amp.torch_amp
:members:
:members:
.. toctree::
:maxdepth: 2
colossalai.amp.torch_amp.torch_amp

View File

@ -0,0 +1,5 @@
colossalai.amp.torch\_amp.torch\_amp
====================================
.. automodule:: colossalai.amp.torch_amp.torch_amp
:members:

View File

@ -1,12 +1,12 @@
colossalai.builder
==================
.. automodule:: colossalai.builder
:members:
.. toctree::
:maxdepth: 2
colossalai.builder.builder
colossalai.builder.pipeline
.. automodule:: colossalai.builder
:members:

View File

@ -1,6 +1,10 @@
colossalai.communication
========================
.. automodule:: colossalai.communication
:members:
.. toctree::
:maxdepth: 2
@ -8,7 +12,3 @@ colossalai.communication
colossalai.communication.p2p
colossalai.communication.ring
colossalai.communication.utils
.. automodule:: colossalai.communication
:members:

View File

@ -1,5 +0,0 @@
colossalai.constants
====================
.. automodule:: colossalai.constants
:members:

View File

@ -0,0 +1,5 @@
colossalai.context.process\_group\_initializer.initializer\_model
=================================================================
.. automodule:: colossalai.context.process_group_initializer.initializer_model
:members:

View File

@ -0,0 +1,5 @@
colossalai.context.process\_group\_initializer.initializer\_moe
===============================================================
.. automodule:: colossalai.context.process_group_initializer.initializer_moe
:members:

View File

@ -13,6 +13,8 @@ colossalai.context.process\_group\_initializer
colossalai.context.process_group_initializer.initializer_2p5d
colossalai.context.process_group_initializer.initializer_3d
colossalai.context.process_group_initializer.initializer_data
colossalai.context.process_group_initializer.initializer_model
colossalai.context.process_group_initializer.initializer_moe
colossalai.context.process_group_initializer.initializer_pipeline
colossalai.context.process_group_initializer.initializer_sequence
colossalai.context.process_group_initializer.initializer_tensor

View File

@ -1,11 +1,11 @@
colossalai.context.random
=========================
.. automodule:: colossalai.context.random
:members:
.. toctree::
:maxdepth: 2
colossalai.context.random.seed_manager
.. automodule:: colossalai.context.random
:members:

View File

@ -1,6 +1,9 @@
colossalai.context
==================
.. automodule:: colossalai.context
:members:
.. toctree::
:maxdepth: 2
@ -14,7 +17,3 @@ colossalai.context
colossalai.context.config
colossalai.context.parallel_context
colossalai.context.parallel_mode
.. automodule:: colossalai.context
:members:

View File

@ -1,5 +0,0 @@
colossalai.core
===============
.. automodule:: colossalai.core
:members:

View File

@ -1,12 +1,11 @@
colossalai.engine
=================
.. automodule:: colossalai.engine
:members:
.. toctree::
:maxdepth: 2
colossalai.engine.gradient_handler
colossalai.engine.schedule
.. automodule:: colossalai.engine
:members:

View File

@ -0,0 +1,5 @@
colossalai.kernel.cuda\_native.layer\_norm
==========================================
.. automodule:: colossalai.kernel.cuda_native.layer_norm
:members:

View File

@ -0,0 +1,5 @@
colossalai.kernel.cuda\_native.multihead\_attention
===================================================
.. automodule:: colossalai.kernel.cuda_native.multihead_attention
:members:

View File

@ -0,0 +1,13 @@
colossalai.kernel.cuda\_native
==============================
.. automodule:: colossalai.kernel.cuda_native
:members:
.. toctree::
:maxdepth: 2
colossalai.kernel.cuda_native.layer_norm
colossalai.kernel.cuda_native.multihead_attention
colossalai.kernel.cuda_native.scaled_softmax

View File

@ -0,0 +1,5 @@
colossalai.kernel.cuda\_native.scaled\_softmax
==============================================
.. automodule:: colossalai.kernel.cuda_native.scaled_softmax
:members:

View File

@ -0,0 +1,5 @@
colossalai.kernel.jit.bias\_dropout\_add
========================================
.. automodule:: colossalai.kernel.jit.bias_dropout_add
:members:

View File

@ -0,0 +1,5 @@
colossalai.kernel.jit.bias\_gelu
================================
.. automodule:: colossalai.kernel.jit.bias_gelu
:members:

View File

@ -0,0 +1,5 @@
colossalai.kernel.jit.option
============================
.. automodule:: colossalai.kernel.jit.option
:members:

View File

@ -0,0 +1,13 @@
colossalai.kernel.jit
=====================
.. automodule:: colossalai.kernel.jit
:members:
.. toctree::
:maxdepth: 2
colossalai.kernel.jit.bias_dropout_add
colossalai.kernel.jit.bias_gelu
colossalai.kernel.jit.option

View File

@ -0,0 +1,11 @@
colossalai.kernel
=================
.. automodule:: colossalai.kernel
:members:
.. toctree::
:maxdepth: 2
colossalai.kernel.cuda_native
colossalai.kernel.jit

View File

@ -1,11 +1,11 @@
colossalai.logging
==================
.. automodule:: colossalai.logging
:members:
.. toctree::
:maxdepth: 2
colossalai.logging.logging
.. automodule:: colossalai.logging
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.init
==================
.. automodule:: colossalai.nn.init
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.layer.colossalai\_layer.dropout
=============================================
.. automodule:: colossalai.nn.layer.colossalai_layer.dropout
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.layer.colossalai\_layer.embedding
===============================================
.. automodule:: colossalai.nn.layer.colossalai_layer.embedding
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.layer.colossalai\_layer.linear
============================================
.. automodule:: colossalai.nn.layer.colossalai_layer.linear
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.layer.colossalai\_layer.normalization
===================================================
.. automodule:: colossalai.nn.layer.colossalai_layer.normalization
:members:

View File

@ -0,0 +1,14 @@
colossalai.nn.layer.colossalai\_layer
=====================================
.. automodule:: colossalai.nn.layer.colossalai_layer
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer.colossalai_layer.dropout
colossalai.nn.layer.colossalai_layer.embedding
colossalai.nn.layer.colossalai_layer.linear
colossalai.nn.layer.colossalai_layer.normalization

View File

@ -0,0 +1,5 @@
colossalai.nn.layer.moe.layers
==============================
.. automodule:: colossalai.nn.layer.moe.layers
:members:

View File

@ -0,0 +1,11 @@
colossalai.nn.layer.moe
=======================
.. automodule:: colossalai.nn.layer.moe
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer.moe.layers

View File

@ -1,5 +0,0 @@
colossalai.nn.layer.non\_parallel\_layers
======================================
.. automodule:: colossalai.nn.layer.non_parallel_layers
:members:

View File

@ -1,11 +1,11 @@
colossalai.nn.layer.parallel\_1d
================================
.. automodule:: colossalai.nn.layer.parallel_1d
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer.parallel_1d.layers
.. automodule:: colossalai.nn.layer.parallel_1d
:members:

View File

@ -1,11 +1,11 @@
colossalai.nn.layer.parallel\_2d
================================
.. automodule:: colossalai.nn.layer.parallel_2d
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer.parallel_2d.layers
.. automodule:: colossalai.nn.layer.parallel_2d
:members:

View File

@ -1,11 +1,11 @@
colossalai.nn.layer.parallel\_2p5d
==================================
.. automodule:: colossalai.nn.layer.parallel_2p5d
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer.parallel_2p5d.layers
.. automodule:: colossalai.nn.layer.parallel_2p5d
:members:

View File

@ -1,11 +1,11 @@
colossalai.nn.layer.parallel\_3d
================================
.. automodule:: colossalai.nn.layer.parallel_3d
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer.parallel_3d.layers
.. automodule:: colossalai.nn.layer.parallel_3d
:members:

View File

@ -1,11 +1,11 @@
colossalai.nn.layer.parallel\_sequence
======================================
.. automodule:: colossalai.nn.layer.parallel_sequence
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer.parallel_sequence.layers
.. automodule:: colossalai.nn.layer.parallel_sequence
:members:

View File

@ -1,18 +1,25 @@
colossalai.nn.layer
===================
.. automodule:: colossalai.nn.layer
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer.colossalai_layer
colossalai.nn.layer.moe
colossalai.nn.layer.parallel_1d
colossalai.nn.layer.parallel_2d
colossalai.nn.layer.parallel_2p5d
colossalai.nn.layer.parallel_3d
colossalai.nn.layer.parallel_sequence
colossalai.nn.layer.non_parallel_layers
colossalai.nn.layer.utils
colossalai.nn.layer.vanilla
colossalai.nn.layer.wrapper
.. toctree::
:maxdepth: 2
colossalai.nn.layer.base_layer
.. automodule:: colossalai.nn.layer
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.layer.utils.common
================================
.. automodule:: colossalai.nn.layer.utils.common
:members:

View File

@ -0,0 +1,11 @@
colossalai.nn.layer.utils
=========================
.. automodule:: colossalai.nn.layer.utils
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer.utils.common

View File

@ -0,0 +1,5 @@
colossalai.nn.layer.vanilla.layers
==================================
.. automodule:: colossalai.nn.layer.vanilla.layers
:members:

View File

@ -0,0 +1,11 @@
colossalai.nn.layer.vanilla
===========================
.. automodule:: colossalai.nn.layer.vanilla
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer.vanilla.layers

View File

@ -0,0 +1,5 @@
colossalai.nn.layer.wrapper.pipeline\_wrapper
=============================================
.. automodule:: colossalai.nn.layer.wrapper.pipeline_wrapper
:members:

View File

@ -9,3 +9,4 @@ colossalai.nn.layer.wrapper
:maxdepth: 2
colossalai.nn.layer.wrapper.lambda_wrapper
colossalai.nn.layer.wrapper.pipeline_wrapper

View File

@ -1,5 +0,0 @@
colossalai.nn.loss.cross\_entropy\_2d
=====================================
.. automodule:: colossalai.nn.loss.cross_entropy_2d
:members:

View File

@ -1,5 +0,0 @@
colossalai.nn.loss.cross\_entropy\_2p5d
=======================================
.. automodule:: colossalai.nn.loss.cross_entropy_2p5d
:members:

View File

@ -1,5 +0,0 @@
colossalai.nn.loss.cross\_entropy\_3d
=====================================
.. automodule:: colossalai.nn.loss.cross_entropy_3d
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.loss.loss\_2d
===========================
.. automodule:: colossalai.nn.loss.loss_2d
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.loss.loss\_2p5d
=============================
.. automodule:: colossalai.nn.loss.loss_2p5d
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.loss.loss\_3d
===========================
.. automodule:: colossalai.nn.loss.loss_3d
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.loss.loss\_moe
============================
.. automodule:: colossalai.nn.loss.loss_moe
:members:

View File

@ -1,13 +1,14 @@
colossalai.nn.loss
==================
.. automodule:: colossalai.nn.loss
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.loss.cross_entropy_2d
colossalai.nn.loss.cross_entropy_2p5d
colossalai.nn.loss.cross_entropy_3d
.. automodule:: colossalai.nn.loss
:members:
colossalai.nn.loss.loss_2d
colossalai.nn.loss.loss_2p5d
colossalai.nn.loss.loss_3d
colossalai.nn.loss.loss_moe

View File

@ -1,6 +1,10 @@
colossalai.nn.lr\_scheduler
===========================
.. automodule:: colossalai.nn.lr_scheduler
:members:
.. toctree::
:maxdepth: 2
@ -11,7 +15,3 @@ colossalai.nn.lr\_scheduler
colossalai.nn.lr_scheduler.onecycle
colossalai.nn.lr_scheduler.poly
colossalai.nn.lr_scheduler.torch
.. automodule:: colossalai.nn.lr_scheduler
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.metric.accuracy\_2d
=================================
.. automodule:: colossalai.nn.metric.accuracy_2d
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.metric.accuracy\_2p5d
===================================
.. automodule:: colossalai.nn.metric.accuracy_2p5d
:members:

View File

@ -0,0 +1,5 @@
colossalai.nn.metric.accuracy\_3d
=================================
.. automodule:: colossalai.nn.metric.accuracy_3d
:members:

View File

@ -0,0 +1,13 @@
colossalai.nn.metric
====================
.. automodule:: colossalai.nn.metric
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.metric.accuracy_2d
colossalai.nn.metric.accuracy_2p5d
colossalai.nn.metric.accuracy_3d

View File

@ -1,5 +1,5 @@
colossalai.nn.model.model\_from\_config
===============================
=======================================
.. automodule:: colossalai.nn.model.model_from_config
:members:

View File

@ -1,6 +1,10 @@
colossalai.nn.model
===================
.. automodule:: colossalai.nn.model
:members:
.. toctree::
:maxdepth: 2

View File

@ -0,0 +1,5 @@
colossalai.nn.optimizer.colossalai\_optimizer
=============================================
.. automodule:: colossalai.nn.optimizer.colossalai_optimizer
:members:

View File

@ -1,15 +1,16 @@
colossalai.nn.optimizer
=======================
.. automodule:: colossalai.nn.optimizer
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.optimizer.colossalai_optimizer
colossalai.nn.optimizer.fused_adam
colossalai.nn.optimizer.fused_lamb
colossalai.nn.optimizer.fused_sgd
colossalai.nn.optimizer.lamb
colossalai.nn.optimizer.lars
.. automodule:: colossalai.nn.optimizer
:members:

View File

@ -1,15 +1,21 @@
colossalai.nn
=============
.. automodule:: colossalai.nn
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.layer
colossalai.nn.loss
colossalai.nn.lr_scheduler
colossalai.nn.metric
colossalai.nn.model
colossalai.nn.optimizer
.. automodule:: colossalai.nn
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.init

View File

@ -1,11 +1,11 @@
colossalai.registry
===================
.. automodule:: colossalai.registry
:members:
.. toctree::
:maxdepth: 2
colossalai.registry.registry
.. automodule:: colossalai.registry
:members:

View File

@ -1,13 +1,8 @@
colossalai
==========
.. toctree::
:maxdepth: 2
colossalai.constants
colossalai.core
colossalai.initialize
.. automodule:: colossalai
:members:
.. toctree::
:maxdepth: 2
@ -17,6 +12,7 @@ colossalai
colossalai.communication
colossalai.context
colossalai.engine
colossalai.kernel
colossalai.logging
colossalai.nn
colossalai.registry
@ -24,5 +20,8 @@ colossalai
colossalai.utils
colossalai.zero
.. automodule:: colossalai
:members:
.. toctree::
:maxdepth: 2
colossalai.initialize

View File

@ -1,5 +0,0 @@
colossalai.trainer.metric
=========================
.. automodule:: colossalai.trainer.metric
:members:

View File

@ -1,17 +1,10 @@
colossalai.trainer
==================
.. automodule:: colossalai.trainer
:members:
.. toctree::
:maxdepth: 2
colossalai.trainer.hooks
.. toctree::
:maxdepth: 2
colossalai.trainer.metric
.. automodule:: colossalai.trainer
:members:

View File

@ -0,0 +1,5 @@
colossalai.utils.data\_sampler.base\_sampler
============================================
.. automodule:: colossalai.utils.data_sampler.base_sampler
:members:

View File

@ -0,0 +1,5 @@
colossalai.utils.data\_sampler.data\_parallel\_sampler
======================================================
.. automodule:: colossalai.utils.data_sampler.data_parallel_sampler
:members:

View File

@ -1,5 +1,12 @@
colossalai.utils.data\_sampler
=======================================
==============================
.. automodule:: colossalai.utils.data_sampler
:members:
.. toctree::
:maxdepth: 2
colossalai.utils.data_sampler.base_sampler
colossalai.utils.data_sampler.data_parallel_sampler

View File

@ -0,0 +1,5 @@
colossalai.utils.multi\_tensor\_apply.multi\_tensor\_apply
==========================================================
.. automodule:: colossalai.utils.multi_tensor_apply.multi_tensor_apply
:members:

View File

@ -1,8 +1,11 @@
colossalai.nn.multi\_tensor\_apply
==================================
colossalai.utils.multi\_tensor\_apply
=====================================
.. automodule:: colossalai.utils.multi_tensor_apply.multi_tensor_apply
.. automodule:: colossalai.utils.multi_tensor_apply
:members:
.. toctree::
:maxdepth: 2
colossalai.utils.multi_tensor_apply.multi_tensor_apply

View File

@ -1,6 +1,17 @@
colossalai.utils
================
.. automodule:: colossalai.utils
:members:
.. toctree::
:maxdepth: 2
colossalai.utils.data_sampler
colossalai.utils.gradient_accumulation
colossalai.utils.multi_tensor_apply
.. toctree::
:maxdepth: 2
@ -8,12 +19,5 @@ colossalai.utils
colossalai.utils.checkpointing
colossalai.utils.common
colossalai.utils.cuda
colossalai.utils.data_sampler
colossalai.utils.gradient_accumulation
colossalai.utils.memory
colossalai.utils.multi_tensor_apply
colossalai.utils.timer
.. automodule:: colossalai.utils
:members:

View File

@ -0,0 +1,5 @@
colossalai.zero.loss\_scaler
============================
.. automodule:: colossalai.zero.loss_scaler
:members:

View File

@ -1,5 +1,13 @@
colossalai.zero
================
===============
.. automodule:: colossalai.zero
:members:
.. toctree::
:maxdepth: 2
colossalai.zero.loss_scaler
colossalai.zero.zero_redundancy_optimizer_level_2
colossalai.zero.zero_redundancy_optimizer_level_3

View File

@ -0,0 +1,5 @@
colossalai.zero.zero\_redundancy\_optimizer\_level\_2
=====================================================
.. automodule:: colossalai.zero.zero_redundancy_optimizer_level_2
:members:

View File

@ -0,0 +1,5 @@
colossalai.zero.zero\_redundancy\_optimizer\_level\_3
=====================================================
.. automodule:: colossalai.zero.zero_redundancy_optimizer_level_3
:members:

View File

@ -1,54 +0,0 @@
# Config file
Here is a config file example showing how to train a ViT model on the CIFAR10 dataset using Colossal-AI:
```python
# optional
# three keys: pipeline, tensor
# data parallel size is inferred
parallel = dict(
pipeline=dict(size=1),
tensor=dict(size=4, mode='2d'),
)
# optional
# pipeline or no pipeline schedule
fp16 = dict(
mode=AMP_TYPE.NAIVE,
initial_scale=2 ** 8
)
# optional
# configuration for zero
# you can refer to the Zero Redundancy optimizer and zero offload section for details
# https://www.colossalai.org/zero.html
zero = dict(
level=<int>,
...
)
# optional
# if you are using complex gradient handling
# otherwise, you do not need this in your config file
# default gradient_handlers = None
gradient_handlers = [dict(type='MyHandler', arg1=1, arg=2), ...]
# optional
# specific gradient accumulation size
# if your batch size is not large enough
gradient_accumulation = <int>
# optional
# add gradient clipping to your engine
# this config is not compatible with zero and AMP_TYPE.NAIVE
# but works with AMP_TYPE.TORCH and AMP_TYPE.APEX
# defautl clip_grad_norm = 0.0
clip_grad_norm = <float>
# optional
# cudnn setting
# default is like below
cudnn_benchmark = False,
cudnn_deterministic=True,
```

View File

@ -1,187 +0,0 @@
# 配置文件
下方代码块中的示例展示了如何在CIFAR10数据集上使用Colossal-AI训练ViT模型。
```python
# build train_dataset and train_dataloader from this dictionary
# It is not compulsory in Config File, instead, you can input this dictionary as an argument into colossalai.initialize()
train_data = dict(
# dictionary for building Dataset
dataset=dict(
# the type CIFAR10Dataset has to be registered
type='CIFAR10Dataset',
root='/path/to/data',
# transform pipeline
transform_pipeline=[
dict(type='Resize', size=IMG_SIZE),
dict(type='RandomCrop', size=IMG_SIZE, padding=4),
dict(type='RandomHorizontalFlip'),
dict(type='ToTensor'),
dict(type='Normalize',
mean=[0.4914, 0.4822, 0.4465],
std=[0.2023, 0.1994, 0.2010]),
]
),
# dictionary for building Dataloader
dataloader=dict(
batch_size=BATCH_SIZE,
pin_memory=True,
# num_workers=1,
shuffle=True,
)
)
# build test_dataset and test_dataloader from this dictionary
test_data = dict(
dataset=dict(
type='CIFAR10Dataset',
root='/path/to/data',
train=False,
transform_pipeline=[
dict(type='Resize', size=IMG_SIZE),
dict(type='ToTensor'),
dict(type='Normalize',
mean=[0.4914, 0.4822, 0.4465],
std=[0.2023, 0.1994, 0.2010]
),
]
),
dataloader=dict(
batch_size=BATCH_SIZE,
pin_memory=True,
# num_workers=1,
)
)
# compulsory
# build optimizer from this dictionary
optimizer = dict(
# Avaluable types: 'ZeroRedundancyOptimizer_Level_1', 'ZeroRedundancyOptimizer_Level_2', 'ZeroRedundancyOptimizer_Level_3'
# 'Adam', 'Lamb', 'SGD', 'FusedLAMB', 'FusedAdam', 'FusedSGD', 'FP16Optimizer'
type='Adam',
lr=0.001,
weight_decay=0
)
# compulsory
# build loss function from this dictionary
loss = dict(
# Avaluable types:
# 'CrossEntropyLoss2D', 'CrossEntropyLoss2p5D', 'CrossEntropyLoss3D'
type='CrossEntropyLoss2D',
)
# compulsory
# build model from this dictionary
model = dict(
# types avaluable: 'PretrainBERT', 'VanillaResNet', 'VisionTransformerFromConfig'
type='VisionTransformerFromConfig',
# each key-value pair above refers to a layer
# input data pass through these layers recursively
tensor_splitting_cfg=dict(
type='ViTInputSplitter2D',
),
embedding_cfg=dict(
type='ViTPatchEmbedding2D',
img_size=IMG_SIZE,
patch_size=PATCH_SIZE,
embed_dim=DIM,
),
token_fusion_cfg=dict(
type='ViTTokenFuser2D',
img_size=IMG_SIZE,
patch_size=PATCH_SIZE,
embed_dim=DIM,
drop_rate=0.1
),
norm_cfg=dict(
type='LayerNorm2D',
normalized_shape=DIM,
eps=1e-6,
),
block_cfg=dict(
# ViTBlock is a submodule
type='ViTBlock',
attention_cfg=dict(
type='ViTSelfAttention2D',
hidden_size=DIM,
num_attention_heads=NUM_ATTENTION_HEADS,
attention_dropout_prob=0.,
hidden_dropout_prob=0.1,
checkpoint=True
),
droppath_cfg=dict(
type='VanillaViTDropPath',
),
mlp_cfg=dict(
type='ViTMLP2D',
in_features=DIM,
dropout_prob=0.1,
mlp_ratio=4,
checkpoint=True
),
norm_cfg=dict(
type='LayerNorm2D',
normalized_shape=DIM,
eps=1e-6,
),
),
head_cfg=dict(
type='ViTHead2D',
hidden_size=DIM,
num_classes=NUM_CLASSES,
),
embed_dim=DIM,
depth=DEPTH,
drop_path_rate=0.,
)
# hooks are built when initializing trainer
# possible hooks: 'BaseHook', 'MetricHook','LoadCheckpointHook'
# 'SaveCheckpointHook','LossHook', 'AccuracyHook', 'Accuracy2DHook'
# 'LogMetricByEpochHook', 'TensorboardHook','LogTimingByEpochHook', 'LogMemoryByEpochHook'
hooks = [
dict(type='LogMetricByEpochHook'),
dict(type='LogTimingByEpochHook'),
dict(type='LogMemoryByEpochHook'),
dict(type='Accuracy2DHook'),
dict(type='LossHook'),
# dict(type='TensorboardHook', log_dir='./tfb_logs'),
# dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
# dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
]
# three keys: pipeline, tensor, data
# if data=dict(size=1), which means no data parallelization, then there is no need to define it
parallel = dict(
pipeline=dict(size=1),
tensor=dict(size=4, mode='2d'),
)
# not compulsory
# pipeline or no pipeline schedule
fp16 = dict(
mode=AMP_TYPE.PARALLEL,
initial_scale=2 ** 8
)
# not compulsory
# build learning rate scheduler
lr_scheduler = dict(
type='LinearWarmupLR',
warmup_epochs=5
)
schedule = dict(
num_microbatches=8
)
# training stopping criterion
# you can give num_steps or num_epochs
num_epochs = 60
# config logging path
logging = dict(
root_path='./logs'
)
```

View File

@ -3,30 +3,8 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Colossal-AI documentation
Colossal-AI API documentation
======================================
.. toctree::
:maxdepth: 1
:caption: GETTING STARTED
installation.md
run_demo.md
.. toctree::
:maxdepth: 1
:caption: CUSTOMIZE YOUR TRAINING
parallelization.md
model.md
trainer_engine.md
amp.md
zero.md
add_your_parallel.md
config.md
.. toctree::
:maxdepth: 2
:caption: API REFERENCE

View File

@ -1,40 +0,0 @@
.. Colossal-AI documentation master file, created by
sphinx-quickstart on Mon Oct 11 17:05:05 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
夸父AI系统Colossal-AI开发文档
======================================
.. toctree::
:maxdepth: 1
:caption: 快速上手指南
installation_zh.md
run_demo_zh.md
.. toctree::
:maxdepth: 1
:caption: 个性化您的训练
parallelization_zh.md
model_zh.md
trainer_engine_zh.md
amp_zh.md
zero_zh.md
add_your_parallel_zh.md
config_zh.md
.. toctree::
:maxdepth: 2
:caption: API REFERENCE
colossalai/colossalai
Indices and tables
==================
* :ref:`genindex`

View File

@ -1,27 +0,0 @@
# Setup
### PyPI
```bash
pip install colossalai
```
### Install From Source (Recommended)
> We **recommend** you to install from source as the Colossal-AI is updating frequently in the early versions. The documentation will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
```shell
git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt
# install colossalai
pip install .
```
Install and enable CUDA kernel fusion (compulsory installation when using fused optimizer)
```shell
pip install -v --no-cache-dir --global-option="--cuda_ext" .
```

View File

@ -1,25 +0,0 @@
# 快速安装
## 使用pip安装
```bash
pip install colossalai
```
## 使用源代码安装
```shell
git clone git@github.com:hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt
# install colossalai
pip install .
```
安装并支持内核融合(使用融合优化器时必须执行下面的代码)
```
pip install -v --no-cache-dir --global-option="--cuda_ext" .
```

View File

@ -1,31 +0,0 @@
# Define your own parallel model
Let's say that you have a huge MLP model with billions of parameters and its extremely large hidden layer size makes it
impossible to fit into a single GPU directly. Don't worry, ColossalAI is here to help you sort things out. With the help of ColossalAI,
you can write your model in the familiar way in which you used to write models for a single GPU, while ColossalAI automatically
splits your model weights and fit them perfectly into a set of GPUs. We give a simple example showing how to write a simple
2D parallel model in the Colossal-AI context.
## Write a simple 2D parallel model
```python
from colossalai.nn import Linear2D
import torch.nn as nn
class MLP_2D(nn.Module):
def __init__(self):
super().__init__()
self.linear_1 = Linear2D(in_features=1024, out_features=16384)
self.linear_2 = Linear2D(in_features=16384, out_features=1024)
def forward(self, x):
x = self.linear_1(x)
x = self.linear_2(x)
return x
```
## Use pre-defined model
For the sake of your convenience, we kindly provide you in our Model Zoo with some prevalent models such as *BERT*, *VIT*,
and *MLP-Mixer*. Feel free to customize them into different sizes to fit into your special needs.

View File

@ -1,26 +0,0 @@
# 定义符合您需求的并行模型
如果您在训练一个拥有数亿级参数的巨大MLP模型那么该模型一定无法在单个GPU上直接进行训练不用担心Colossal-AI可以帮您解决这一问题。您仍旧可以像写单GPU模型那样来写您的模型Colossal-AI会按照您的并行设置自动将模型参数进行切割并将它们均匀地存入一组GPU中。下面是一个简单的例子来向您展示如何在Colossal-AI环境下写一个2D张量并行的模型。
## 简单的2D张量并行模型
```python
from colossalai.nn import Linear2D
import torch.nn as nn
class MLP_2D(nn.Module):
def __init__(self):
super().__init__()
self.linear_1 = Linear2D(in_features=1024, out_features=16384)
self.linear_2 = Linear2D(in_features=16384, out_features=1024)
def forward(self, x):
x = self.linear_1(x)
x = self.linear_2(x)
return x
```
## 使用事先定义好的模型
为了您使用的方便我们事先在我们的Model Zoo中定义好了一些现在流行的模型比如*BERT*、*VIT*以及*MLP-Mixer*等,您可以根据您的需求来自定义这些模型的规模。

View File

@ -1,240 +0,0 @@
# Parallelization
## Configure the Combination of Parallelization
We support multiple parallelization in our library.
Hybrid parallelism in our codebase refers to namely the combination of data parallelism, pipeline parallelism
and tensor parallelism (1D, 2D, 2.5D, 3D). Each parallelism requires different network topology and thus
different initializers for distributed process group. You can initialize the corresponding process group by
setting `parallel` in our config. The parallel configuration can be easily deployed by a dictionary in
configuration file. The configuration dictionary must obey the following format. Data parallel size will be
inferred automatically based on your inputs to pipeline parallelism and tensor parallelism. The distributed
environment will set up by `colossalai.launch`.
```python
# sampler format
parallel = dict(
pipeline=dict("size": int),
tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
)
# this is ok
parallel = dict(
pipeline=dict(size=2),
tensor=dict(size=4, mode='2d')
)
# this is ok
parallel = dict(
pipeline=2,
tensor=dict(size=4, mode='2d')
)
# this is not ok
# as you need to specify the mode for tensor parallelism
parallel = dict(
pipeline=2,
tensor=4
)
# this is ok as well as tensor will be default to size 1
# and mode None
parallel = dict(
pipeline=2
)
# this is ok as well as pipeline will default to size 1
parallel = dict(
tensor=dict(size=4, mode='2d')
)
```
The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and
data, pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a
int representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
represents the way of tensor parallelism.
**You can choose to not have 'parallel' in your configuration and both pipelineand tensor will default to size 1.**
## Data Parallel
Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
have to explicitly set them in your configurations. There are two ways to handle the all-reduce in data parallel in Colossal-AI.
1. If you specify gradient handlers, gradients will be all-reduced according to the gradient handlers
2. Otherwise, PyTorch DistributedDataParallel will be used
In most cases, you will be using the second mode unless you have complex handling of the gradients.
## 1D, 2D, 2.5D and 3D Parallel
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ devices where
$N$ is the number of tensor chunks in a single dimension.
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which
further parallelizes 2D tensor parallelism. An amount of $P = N^2 d$ processors are arranged into $d$ layers, where
each layer performs matrix multiplication operations independently with a dimension $N$.
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method
achieves the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage
are evenly distributed through optimized load balancing of parameters as well as activations.
```python
# 1D parallel
parallel = dict(
tensor=dict(size=4, mode='1d')
)
# 2D parallel
parallel = dict(
tensor=dict(size=4, mode='2d')
)
# 2.5D parallel
parallel = dict(
tensor=dict(size=8, mode='2.5d', depth=2)
)
# 3D parallel
parallel = dict(
tensor=dict(size=8, mode='3d')
)
```
Once you specify the tensor parallel mode in your configuration, you can proceed to use its corresponding distributed
operator. For example, if you mode is '2d', you can use `colossalai.nn.Linear2D` in you model construction.
## Pipeline Parallel (experimental)
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
and the second layer to the second GPU.
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
will automatically creates the pipeline schedule which defines the forward and backward step.
```python
parallel = dict(
pipeline=dict(size=4), # number of pipeline stages
)
```
As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline parallelism, you have the following two ways to split your model,
1. Split your model directly. Below is an exmaple of resnet split into two pipeline stages.
```python
from torchvision.models import resnet18
from colossalai.core import global_context as gpc
model = resnet18(num_classes=10)
if gpc.get_local_rank(ParallelMode.PIPELINE) == 0:
model = nn.Sequential(
model.conv1,
model.bn1,
model.relu,
model.maxpool,
model.layer1,
model.layer2
)
elif gpc.get_local_rank(ParallelMode.PIPELINE) == 1:
from functools import partial
class Flatten(nn.Module):
def forward(self, x):
return torch.flatten(x, 1)
model = nn.Sequential(
model.layer3,
model.layer4,
model.avgpool,
Flatten(),
model.fc
)
```
2. Make sure your model inherit `colossalai.nn.model.ModelFromConfig` and registered into the
`MODELS` registry. Define the `self.layers_cfg` attribute.
Pass in a dict/Config object which specifies the parameters of your model.
Use `colossalai.builder.pipeline.build_pipeline_model_from_cfg` to partition the layers.
```python
from colossalai.builder import build_pipeline_model_from_cfg
from colossalai.nn.model import ModelFromConfig
from colossalai.registry import MODELS
@MODELS.register_module
class MyModel(ModelFromConfig):
def __init__(self, arg1, arg2, ...):
...
self.layers_cfg = [
dict(type='Linear', in_features=3, out_features=512),
dict(type='Linear', in_features=512, out_features=512),
...
]
model_cfg = dict(
type='MyModel',
arg1=1,
arg2=2
...
)
# from config
model = build_pipeline_model_from_cfg(model_cfg, num_chunks=1)
# from torch.nn.Sequential
# model = build_pipeline_model(sequential_model, num_model_chunks)
```
When your model is split into partitions, you can use PipelineSchedule to execute training.
```python
import colossalai
from colossalai.engine.schedule import PipelineSchedule
engine, train_dataloader, _, _ = colossalai.initialize(model, optimizer, criterion, train_dataloader)
schedule = PipelineSchedule(num_microbatches=4)
# interleaved pipeline
# schedule = InterleavedPipelineSchedule(num_microbatches=4, num_model_chunks=2)
# execute a training epoch
data_iter = iter(train_dataloader)
for i in range(len(train_dataloader)):
output, label, loss = schedule.forward_backward_step(engine,
data_iter,
forward_only=False,
)
```
This feature is still in development and is only experimental for now.
## Sequence Parallel (experimental)
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
This feature is still in development and is only experimental for now.

View File

@ -1,189 +0,0 @@
# 并行技术
## 配置并行技术组合
Colossal-AI支持多种并行技术包括数据并行、张量并行1D、2D、2.5D、3D、流水线并行以及序列并行。您可以通过更改配置文件中的`parallel`字典变量来初始化分布式系统中的进程组,配置文件中的`parallel`字典变量必须满足下面的格式。数据并行的规模可以通过`parallel`中流水线并行的规模和张量并行的规模计算得出。
```python
parallel = dict(
pipeline=dict("size": int),
tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
)
```
注意该字典变量的名称必须为**parallel**。该变量中所有的参数,包括`parallel`本身都是非必需的如果您的代码中没有提供该变量则所有并行规模都将被设定为默认值1即不使用任何并行技术的情况。`parallel`中data、pipeline以及tensor的值分别代表了数据并行、流水线并行、以及张量并行的规模而`mode`的值代表了张量并行的模式。
## 数据并行
数据并行是一种最常见的并行技术可以将数据分成几个不同的部分并对每一个部分在一台设备上进行训练。Colossal-AI可以自动检测数据并行设置并为您设置好环境您不需要在您的环境配置中显式地设置。当数据并行规模大于1时Colossal-AI会自动为数据读取器增加分布式数据采样器以此来达到切分数据集的目的。
## 1D、2D、2.5D与3D张量并行
为了方便混合并行技术我们提供了一系列的张量并行技术同时下面罗列了每一种张量并行技术对应的论文这些张量并行技术需要Colossal-AI提供的分布式层结构的支持。
- 1D[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
- 2D[An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
2维张量并行依赖SUMMA矩阵乘法技术其在两个不同的维度上对于输入数据进行切分。切分后的张量分布在一个的2维网格上使用的总设备数量为$P = N^2$,其中$N$为一个维度上的切分张量数量。
- 2.5D[2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
2.5维并行技术受到了2.5D矩阵乘法的启发其对于2维张量并行的结果进行进一步切分在$d$层上面安排$P = N^2 d$个处理器,相应地,矩阵乘法操作也被切分为$d$份在不同的层上面进行。
- 3D[Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
我们还引入了3维张量并行技术该技术在一个3维处理器立方体中对神经网络参数进行并行化。使用$P$个处理器时,该并行技术可以在付出$O(P^{1/3})$的通信开销的情况下达到最优表现,且计算资源和内存使用都可以在$P$个处理器上达到平均分配。
使用上述几种张量并行的`parallel`字典变量示例参见下方代码。
```python
# 1D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=4, mode='1d')
)
# 2D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=4, mode='2d')
)
# 2.5D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=8, mode='2.5d', depth=2)
)
# 3D parallel
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=8, mode='3d')
)
```
## 流水线并行(开发中)
流水线并行指的是在将深度学习模型按照层切分为几个不同的部分。例如对于一个由两个线性层组成的简单模型我们可以使用两个GPU并把第一个线性层的工作分配给一个GPU把第二个线性层的工作分配给另一个GPU。当然这个例子只是为了说明流水线并行的工作方式没有实际意义。
由于PyTorch的计算基于动态计算图所以在执行前无法确定计算流。为了支持PyTorch中的流水线并行您需要为您的模型类加入一个额外的特征`layers_cfg`使Colossal-AI清楚具体的计算流程`colossalai.nn.VanillaResNet`给出了一个您可以参考的示例。
```python
from colossalai.nn import BaseModel
import torch
class VanillaResNet(BaseModel):
def __init__(
self,
num_cls: int,
block_type: str,
layers: List[int],
norm_layer_type: str = 'BatchNorm2d',
in_channels: int = 3,
groups: int = 1,
width_per_group: int = 64,
zero_init_residual: bool = False,
replace_stride_with_dilation: Optional[List[bool]] = None,
dilations=(1, 1, 1, 1)
) -> None:
super().__init__()
... # some model params
self.layers_cfg = [
# conv1
dict(type='Conv2d',
in_channels=in_channels,
out_channels=self.inplanes,
kernel_size=7,
stride=2,
padding=3,
bias=False),
# bn1
dict(
type=norm_layer_type,
num_features=self.inplanes
),
# relu
dict(
type='ReLU',
inplace=True
),
# maxpool
dict(
type='MaxPool2d',
kernel_size=3,
stride=2,
padding=1
),
# layer 1
dict(
inplanes=self.inplanes,
planes=64,
blocks=self.blocks[0],
dilation=self.dilations[0],
**self.reslayer_common_cfg
),
# layer 2
dict(
inplanes=64 * self.block_expansion,
planes=128,
blocks=self.blocks[1],
stride=2,
dilate=replace_stride_with_dilation[0],
dilation=self.dilations[1],
**self.reslayer_common_cfg
),
# layer 3
dict(
inplanes=128 * self.block_expansion,
planes=256,
blocks=layers[2],
stride=2,
dilate=replace_stride_with_dilation[1],
dilation=self.dilations[2],
**self.reslayer_common_cfg
),
# layer 4
dict(
inplanes=256 * self.block_expansion,
planes=512,
blocks=layers[3], stride=2,
dilate=replace_stride_with_dilation[2],
dilation=self.dilations[3],
**self.reslayer_common_cfg
),
# avg pool
dict(
type='AdaptiveAvgPool2d',
output_size=(1, 1)
),
# flatten
dict(
type='LambdaWrapper',
func=lambda mod, x: torch.flatten(x, 1)
),
# linear
dict(
type='Linear',
in_features=512 * self.block_expansion,
out_features=num_cls
)
]
```
您可以在配置文件中手动设置流水线并行的级数当流水线的并行级数大于1时Colossal-AI将会自动创建定义前向传播和后向传播的流水线调度程序。同时您还可以在配置文件中的`schedule`字典变量来定义每一个步骤中训练的微批次数量。下面的代码给出了一个配置流水线并行的例子。
```python
parallel = dict(
pipeline=dict(size=1), # number of pipeline stages
tensor=dict(size=1, mode=None)
)
schedule = dict(
num_microbatches = 4 # set the number of microbatches per step
)
```
目前该并行技术仍处于实验开发阶段。
## 序列并行(开发中)
序列并行是为了支持对于长序列数据的建模,这类数据包括文档级别的文本理解以及医学影像分析,该并行技术由论文[Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120)提出。目前该并行技术仍处于实验开发阶段。

View File

@ -1,120 +0,0 @@
# Quick demo
Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system can
accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The system
can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
## Single GPU
Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline
performances. We provided an example to train ResNet on CIFAR10 data with only one GPU. You can find this example in
`examples\resnet_cifar10_data_parallel` in the repository. Detailed instructions can be found in its `README.md`.
## Multiple GPUs
Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the
training process drastically by applying efficient parallelization techiniques, which will be elaborated in
the [Parallelization](parallelization.md) section below.
You can turn the resnet example mentioned above into a multi-GPU training by setting `--nproc_per_node` to be the number of
GPUs you have on your system. We also provide an example of Vision Transformer which relies on
training with more GPUs. You can visit this example in `examples\vit_b16_imagenet_data_parallel`. It has a detailed instructional
`README.md` for you too.
## Sample Training Script
Below is a typical way of how you train the model using
```python
import colossalai
from colossalai.amp import AMP_TYPE
from colossalai.logging import get_dist_logger
from colossalai.trainer import Trainer, hooks
from colossalai.utils import get_dataloader
CONFIG = dict(
parallel=dict(
pipeline=1,
tensor=1, mode=None
),
fp16 = dict(
mode=AMP_TYPE.TORCH
),
gradient_accumulation=4,
clip_grad_norm=1.0
)
def run_trainer():
parser = colossalai.get_default_parser()
args = parser.parse_args()
colossalai.launch(config=CONFIG,
rank=args.rank,
world_size=args.world_size,
host=args.host,
port=args.port,
backend=args.backend)
logger = get_dist_logger()
# instantiate your compoentns
model = MyModel()
optimizer = MyOptimizer(model.parameters(), ...)
train_dataset = TrainDataset()
test_dataset = TestDataset()
train_dataloader = get_dataloader(train_dataset, ...)
test_dataloader = get_dataloader(test_dataset, ...)
lr_scheduler = MyScheduler()
logger.info("components are built")
engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model,
optimizer,
criterion,
train_dataloader,
test_dataloader,
lr_scheduler)
trainer = Trainer(engine=engine,
verbose=True)
hook_list = [
hooks.LossHook(),
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
hooks.AccuracyHook(),
hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
hooks.LogMetricByEpochHook(logger),
hooks.LogMemoryByEpochHook(logger),
hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')
]
trainer.fit(
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
epochs=NUM_EPOCH,
hooks=hook_list,
display_progress=True,
test_interval=2
)
if __name__ == '__main__':
run_trainer()
```
Alternatively, the `model` variable can be substituted with a self-defined model or a pre-defined model in our Model
Zoo. The detailed substitution process is elaborated [here](model.md).
## Features
Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development
of distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools
to kickstart distributed training in a few lines.
- [Data Parallelism](parallelization.md)
- [Pipeline Parallelism](parallelization.md)
- [1D, 2D, 2.5D, 3D and sequence parallelism](parallelization.md)
- [Friendly trainer and engine](trainer_engine.md)
- [Extensible for new parallelism](add_your_parallel.md)
- [Mixed Precision Training](amp.md)
- [Zero Redundancy Optimizer (ZeRO)](zero.md)

View File

@ -1,67 +0,0 @@
# 快速上手
Colossal-AI是一个大规模深度学习系统其中包含高效的并行技术。该系统可以在多GPU的分布式系统上使用并行技术有效地加速模型训练同时该系统也可以运行在带有GPU的非分布式系统上。下面是ColossalAI的快速上手指南。
## 单GPU系统
在带有GPU的非分布式系统上进行模型训练时Colossal-AI可以达到当前的基线效率。[这里](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)我们给出一个Google
Colab示例展现如何使用Colossal-AI与CIFAR10数据集在非分布式系统上训练一个LeNet模型。
## 多GPU系统
在多GPU的分布式系统上训练深度学习模型时Colossal-AI可以使用高效的并行技术来显著地加速训练过程这些技术将在下面的[并行技术](parallelization.md)
章节中被详述。下面的代码将在拥有四个GPU的分布式系统上训练一个ViT模型其中`HOST`
变量为您分布式系统的IP地址。请注意下面的代码使用了[Slurm](https://slurm.schedmd.com/documentation.html)作业调度系统。
```bash
HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
```
`./configs/vit/vit_2d.py`是一个[配置文件](config.md)
Colossal-AI使用配置文件来定义训练过程中需要用到的参数比如模型类型、数据集、以及优化器、学习率调度器等。您可以通过编写配置文件的方式来训练不同的模型。`./examples/run_trainer.py`
是一个标准的训练脚本,具体代码已经附在下面。该脚本可以读入配置文件中的训练参数并训练模型。
```python
import colossalai
from colossalai.core import global_context as gpc
from colossalai.logging import get_dist_logger
from colossalai.trainer import Trainer
def run_trainer():
engine, train_dataloader, test_dataloader = colossalai.initialize()
logger = get_dist_logger()
logger.info("engine is built", ranks=[0])
trainer = Trainer(engine=engine,
verbose=True)
logger.info("trainer is built", ranks=[0])
logger.info("start training", ranks=[0])
trainer.fit(
train_dataloader=train_dataloader,
test_dataloader=test_dataloader,
epochs=gpc.config.num_epochs,
hooks_cfg=gpc.config.hooks,
display_progress=True,
test_interval=2
)
if __name__ == '__main__':
run_trainer()
```
上面代码中的`model`变量可以被替换为一个自定义的模型或者`Model Zoo`中一个事先定义的模型,以此来达到训练不同模型的目的,[这里](model.md)详述了如何进行这样的替换。
## 系统功能
Colossal-AI提供了一系列并行组件来加速您的模型训练我们在下面的章节提供了关于这些并行组件的介绍。我们的目标是使您的分布式深度学习模型开发像单卡深度学习模型开发那样方便。
- [数据并行](parallelization.md)
- [1D、2D、2.5D、3D张量并行以及序列并行](parallelization.md)
- [流水线并行](parallelization.md)
- [训练器以及引擎](trainer_engine.md)
- [自定义您的并行模式](add_your_parallel.md)
- [混合精度训练](amp.md)
- [ZeRO优化器](zero.md)

View File

@ -1,132 +0,0 @@
# Colossal-AI Engine & Customize Your Trainer
## Colossal-AI engine
To better understand how `Engine` class works, let's start from the conception of the process function in common
engines. The process function usually controls the behavior over a batch of a dataset, `Engine` class just controls the
process function. Here we give a standard process function in the following code block.
```python
def process_function(dataloader, model, criterion, optim):
optim.zero_grad()
data, label = next(dataloader)
output = model(data)
loss = criterion(output, label)
loss.backward()
optim.setp()
```
The engine class is a high-level wrapper of these frequently-used functions while preserving the PyTorch-like function signature and integrating with our features.
```python
import torch
import torch.nn as nn
import torchvision.models as models
import colossalai
from colossalai.engine import Engine
from torchvision.datasets import CIFAR10
model = models.resnet18()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())
dataset = CIFAR10(...)
dataloader = colossalai.utils.get_dataloader(dataset)
engine, dataloader, _, _ = colossalai.initialize(model, optimizer, criterion, dataloader)
# exmaple of a training iteratio
for img, label in dataloader:
engine.zero_grad()
output = engine(img)
loss = engine.criterion(output, label)
engine.backward(loss)
engine.step()
```
More information regarding the class can be found in the API references.
## Customize your trainer
### Overview
To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly
recommend that you read *Get Started*
section and *Colossal-AI engine* first.
The `Trainer` class enables researchers and engineers to use our system more conveniently. Instead of having to write
your own scripts, you can simply construct your own trainer by calling the `Trainer` class, just like what we did in the
following code block.
```python
trainer = Trainer(engine)
```
After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more
powerful, we incorporate a set of handy tools to the class. For example, you can monitor or record the running states
and metrics which indicate the current performance of the model. These functions are realized by hooks. The `BasicHook`
class allows you to execute your hook functions at specified time. We have already created some practical hooks for you,
as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the
class can be found in the API references.
These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides,
they print the current loss and accuracy to let users monitor the performance of the model.
```python
import colossalai
from colossalai.trainer import hooks, Trainer
from colossalai.utils import MultiTimer
from colossalai.logging import get_dist_logger
... = colossalai.initialize(...)
timer = MultiTimer()
logger = get_dist_logger()
# if you want to save log to file
logger.log_to_file('./logs/')
trainer = Trainer(
engine=engine,
timer=timer,
logger=logger
)
hook_list = [
hooks.LossHook(),
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
hooks.AccuracyHook(),
hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
hooks.LogMetricByEpochHook(logger),
hooks.LogMemoryByEpochHook(logger),
hooks.LogTimingByEpochHook(timer, logger),
hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')
]
trainer.fit(
train_dataloader=train_dataloader,
epochs=NUM_EPOCHS,
test_dataloader=test_dataloader,
test_interval=1,
hooks=hook_list,
display_progress=True
)
```
### Hook
If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook`
class to write a metric collector. These hook functions can be called at different stage in the trainer's life cycle.
Besides, you can define the priorities of all hooks to arrange the execution order of them. More information can be
found in the API references.
### Metric
You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your
write your own metric hooks, please set the priority carefully and make sure the hook is called before other hooks which
might require the results of the metric hook.
We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary
and metrics can be accessed by their names.

View File

@ -1,84 +0,0 @@
# 引擎与训练器
## 引擎
为了更好的理解我们的`Engine`类是如何工作的,我们首先需要了解常见引擎中进程函数的概念。进程函数控制数据集中一个批次的行为,`Engine`类控制的正是该进程函数。我们在下方的代码块中给出了一个标准的进程函数例子。
```python
def process_function(dataloader, model, criterion, optim):
optim.zero_grad()
data, label = next(dataloader)
output = model(data)
loss = criterion(output, label)
loss.backward()
optim.setp()
```
在`ignite.engine`与`keras.engine`中,进程函数需要由用户提供,然而,用户很难为流水线并行编写进程函数。为了向用户提供方便的混合并行,我们提供了具备强大功能的`Engine`
类,该类支持流水线并行,并提供前向传播后向传播不交织的策略。同时,您可以在`Engine`类中使用您事先定义好的学习率调度器来在训练过程中调整学习率。
您在构造引擎时只需要定义`model`、`criterion`、`optimizer`、`lr_scheduler`与`schedule`等变量即可,下面的代码块给出了一个这样的例子。
**如果你使用`colossalai.initialize`的话engine会从config文件里自动构建。**
```python
import torch
import torch.nn as nn
import torchvision.models as models
import colossalai
from colossalai.engine import Engine
model = models.resnet18()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model)
lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
schedule = colossalai.engine.NonPipelineSchedule()
MyEngine = Engine(
model=model,
criterion=criterion,
optimizer=optimizer,
step_schedule=schedule
)
```
更多该类的相关信息可以在API信息中找到。
## 训练器
要了解如何个性化适应您需求的训练器,首先需要了解我们的`Trainer`类。
`Trainer`类旨在让科研工作者和工程师更加方便地使用我们的系统,您不需要自己写脚本,只需要调用`Trainer`类来构造您的训练器即可,就像下面的代码块中所做的。
```python
MyTrainer = Trainer(my_trainer)
```
在此之后,您可以使用`fit`方法来训练或调用您的模型。除此之外,为了让我们的`Trainer`
类拥有更强大的功能,我们加入了一系列方便您使用的工具。例如,您可以在训练过程中持续监测并记录模型目前的运行状态和表现,这些功能都是通过钩子函数来实现的。我们提供的`BasicHook`
类让您可以在指定时间执行您的钩子函数。如下方的代码块所示我们事先为您定义好了一些实用的钩子函数您需要做的就是找到符合您需求的钩子函数。更多该类的相关信息可以在API信息中找到。
```python
hooks = [
dict(type='LogMetricByEpochHook'),
dict(type='LogTimingByEpochHook'),
dict(type='LogMemoryByEpochHook'),
dict(type='AccuracyHook'),
dict(type='LossHook'),
dict(type='TensorboardHook', log_dir='./tfb_logs'),
dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
]
```
上面这些钩子函数可以记录模型性能指标训练时间显存使用等信息并在每一个epoch结束后将这些信息写入到日志中。除此之外这些钩子函数还可以即时输出当前的损失以及准确率让用户可以监测模型的性能。
### 钩子函数
如果您有个性化需求,您可以继承我们的`BaseHook`类并添加您的钩子函数,或者继承我们的`MetricHook`来编写您需要的度量标准。这些钩子函数可以在`Trainer`
生命周期的12个时间点被执行。更多该类的相关信息可以在API信息中找到。
### 度量标准
您可以通过继承我们的`Metric`类来提供您需要的度量标准,该类需要与`MetricHook`类一同使用。当您编写您的度量标准钩子函数时,请用心设置您的优先级来确保该钩子函数的优先级高于那些需要度量结果的钩子函数。
我们已经为您定义好了一些度量标准钩子函数在`runner.states['metrics']`供您参考。

Some files were not shown because too many files have changed in this diff Show More