Merge pull request #10 from hpcaitech/feature/zhdoc

added Chinese documents and fixed some typos in English documents
2021-11-03 11:38:06 +08:00 · 2021-11-03 11:38:06 +08:00 · ccb44882e1
parent ccbc918c11 18ba66e012
commit ccb44882e1
19 changed files with 1064 additions and 130 deletions
--- a/docs/add_your_parallel.md
+++ b/docs/add_your_parallel.md
@ -1,12 +1,12 @@
-# Add Your Own Parallelism
+# Add your own parallelism

 ## Overview

 To enable researchers and engineers to extend our framework to other novel large-scale distributed training algorithm
-with less effort, we have decoupled the various components in the training lifecycle. You can implement your own
+with less effort, we have decoupled various components in the training lifecycle. You can implement your own
 parallelism by simply inheriting from the base class.

-The main components are
+The main components are:

 1. `ProcessGroupInitializer`
 2. `GradientHandler`
@ -14,14 +14,13 @@ The main components are

 ## Process Group Initializer

-Parallelism is often managed by process groups where processes involved in parallel computing are placed in the same
+Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
 process group. For different parallel algorithms, different process groups need to be created. ColossalAI provides a
-global context for the user to easily manage their process groups. If you wish to add new process group, you can easily
+global context for users to easily manage their process groups. If you wish to add new process group, you can easily
 define a new class and set it in your configuration file. To define your own way of creating process groups, you can
-follow the steps below to create new distributed initialization.
-
-1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`
+follow the steps below to create a new distributed initialization.

+1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`.
    ```python
    class ParallelMode(Enum):
        GLOBAL = 'global'
@ -34,11 +33,10 @@ follow the steps below to create new distributed initialization.
        NEW_MODE = 'new_mode'  # define your mode here
    ```

-2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossal.context.dist_group_initializer`. The
+2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The
   first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other
   arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the
   registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`.
-
    ```python
    # sample initializer class
    @DIST_GROUP_INITIALIZER.register_module
@ -84,18 +82,16 @@ follow the steps below to create new distributed initialization.
 ## Gradient Handler

 Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce
-strategies may be executed for different kinds of parallelism, the user can
-inherit `colossal.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
+strategies may be executed for different kinds of parallelism, users can
+inherit `colossalai.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
 uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data
 parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own
 gradient handler like below:

 ```python
-
 from colossalai.registry import GRADIENT_HANDLER
 from colossalai.engine import BaseGradientHandler

-
@GRADIENT_HANDLER.register_module
 class YourGradientHandler(BaseGradientHandler):

@ -116,5 +112,5 @@ dist_initializer = [

 Schedule entails how to execute a forward and backward pass. Currently, ColossalAI provides pipeline and non-pipeline
 schedules. If you want to modify how the forward and backward passes are executed, you can
-inherit `colossalai.engine.BaseSchedule` and implement your idea. You can add your schedule to the engine before
+inherit `colossalai.engine.BaseSchedule` and implement your idea. You can also add your schedule to the engine before
 training.
--- a/docs/add_your_parallel_zh.md
+++ b/docs/add_your_parallel_zh.md
@ -0,0 +1,103 @@
+# 添加新的并行技术
+
+为了方便科研人员和工程师们更方便地拓展我们的框架来兼容一些新的大规模分布式训练算法，我们对训练过程中的几个组件进行了解耦，您可以通过继承基类的方式
+来实现新的并行技术。
+
+主要的组件如下所示：
+
+1. `ProcessGroupInitializer`
+2. `GradientHandler`
+3. `Schedule`
+
+## 进程组初始化器
+
+并行化一般是通过进程组来进行管理的，同属于一个并行化算法的进程将被分到一个进程组中，如果系统中存在多种不同的并行化技术，那么需要创建多个不同的进程组。
+ColossalAI为用户提供了一个全局上下文变量来便捷地管理他们的进程组。如果您希望增加新的进程组，您可以定义一个新的类并且在您的配置文件中进行设置。下方的
+代码块中介绍了如果在系统中加入您的新并行技术以及如何进行初始化。
+
+1. 在`colossalai.context.parallel_mode.ParallelMode`中添加新的并行模式。
+```python
+class ParallelMode(Enum):
+    GLOBAL = 'global'
+    DATA = 'data'
+    PIPELINE = 'pipe'
+    PIPELINE_PREV = 'pipe_prev'
+    PIPELINE_NEXT = 'pipe_next'
+    ...
+
+    NEW_MODE = 'new_mode'  # define your mode here
+```
+
+2. 创建一个`ProcessGroupInitializer`的子类，您可以参考`colossalai.context.dist_group_initializer`中给出的例子。前六个参数将由`ParallelContext`
+决定。如果您需要设置新的参数，您可以用新的参数替换下面例子中的`arg1`与`arg2`。最后，您需要使用`@DIST_GROUP_INITIALIZER.register_module`装饰器
+在我们的注册表注册您的初始化器。
+```python
+# sample initializer class
+@DIST_GROUP_INITIALIZER.register_module
+class MyParallelInitializer(ProcessGroupInitializer):
+
+    def __init__(self,
+                rank: int,
+                world_size: int,
+                config: Config,
+                data_parallel_size: int,
+                pipeline_parlalel_size: int,
+                tensor_parallel_size: int,
+                arg1,
+                arg2):
+        super().__init__(rank, world_size, config)
+        self.arg1 = arg1
+        self.arg2 = arg2
+        # ... your variable init
+
+    def init_parallel_groups(self):
+        # initialize your process groups
+        pass
+```
+
+在此之后，您可以将您的初始化器插入到当前的mode-to-initialize映射`colossalai.constants.INITIALIZER_MAPPING`中，您也可以通过更改该文件来动态变更名称与
+并行模式的映射。
+
+```python
+colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
+```
+
+3. 在配置文件中设置您的初始化器，如果您的初始化器需要参数，您可以自行传入，下面的代码可以让`ParallelContext`来创建您的初始化器并初始化您需要的进程组。
+
+```python
+parallel = dict(
+    pipeline=dict(size=1),
+    tensor=dict(size=x, mode='new_mode')  # this is where you enable your new parallel mode
+)
+```
+
+## 梯度处理器
+
+梯度处理器的功能是对模型参数的梯度进行all-reduce操作。由于不同的并行技术可能需要不同的all-reduce操作，用户们可以通过继承
+`colossalai.engine.gradient_handler.BaseGradientHandler`来执行其个性化操作。目前，ColossalAI使用普通的数据并行梯度处理器，该处理器在所有的数据
+并行rank上执行all-reduce操作，且当ColossalAI监测到当前系统使用了数据并行时，该处理器会被自动创建。您可以使用下方代码块中的代码添加您自定义的梯度处理器：
+
+```python
+from colossalai.registry import GRADIENT_HANDLER
+from colossalai.engine import BaseGradientHandler
+
+@GRADIENT_HANDLER.register_module
+class YourGradientHandler(BaseGradientHandler):
+
+    def handle_gradient(self):
+        do_something()
+
+```
+
+在此之后，您可以在配置文件中指定您想要使用的梯度处理器。
+
+```python
+dist_initializer = [
+    dict(type='YourGradientHandler'),
+]
+```
+
+## 调度器
+
+调度器中指定了在前向传播和后向传播时需要执行哪些操作，ColossalAI提供了支持流水线和不支持流水线的调度器。如果您想要修改前向传播和后向传播的执行方式，您可以
+继承`colossalai.engine.BaseSchedule`并实现您想要的操作。您也可以在训练模型之前将您的调度器添加到我们的引擎中来。
--- a/docs/amp.md
+++ b/docs/amp.md
@ -1,24 +1,24 @@
-# Mixed Precision Training
+# Mixed precision training

-In Colossal-AI, we have integrated different implementations of mixed precision training:
+In ColossalAI, we have incorporated different implementations of mixed precision training:
 1. torch.cuda.amp
 2. apex.amp
 3. tensor-parallel amp

 The first two rely on the original implementation of [PyTorch](https://pytorch.org/docs/stable/amp.html)
 (version 1.6 and above) and [Nvidia Apex](https://github.com/NVIDIA/apex). However, these two methods are not compatible 
-with tensor parallelism. This is because that tensors are split across devices in tensor parallelism, thus, it is needed 
-to communicate among different processes to check if inf or nan occurs throughout the whole model weights. For the mixed
-precision training with tensor parallel, we adapted this feature from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). 
+with tensor parallelism. This is because that tensors are split across devices in tensor parallelism, thus, it is required 
+to communicate among different processes to check if `inf` or `nan` occurs in the whole model weights. For the mixed
+precision training with tensor parallelism, we adapted this feature from [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). 

-To use mixed precision training, you can easily specify the `fp16` field in the configuration file. Currently, torch and 
-apex amp cannot be guaranteed to work with tensor and pipeline parallelism, thus, only the last one is recommended if you 
+To use mixed precision training, you can easily specify the `fp16` field in the config file to be True. Currently, PyTorch and 
+Apex amp cannot be guaranteed to work with tensor and pipeline parallelism, thus, only the last one is recommended if you 
 are using hybrid parallelism.

-## Torch AMP
+## PyTorch AMP

-PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to fp16 format 
-while keeping some operations such as reductions in fp32. You can configure the gradient scaler in the configuration.
+PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to `fp16` format 
+while keeping some operations such as reductions in `fp32`. You can configure the gradient scaler in the config file.

 ```python
 from colossalai.engine import AMP_TYPE
@ -34,13 +34,14 @@ fp16=dict(
 )
 ```

-
 ## Apex AMP

-For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We supported this plugin because it allows 
-for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2) will keep batch normalization in fp32.
+For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We support 
+this plugin because it allows for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2) 
+will keep batch normalization in `fp32`.
+
+The following code block shows a config file for Apex AMP.

-The configuration is like below.
 ```python
 from colossalai.engine import AMP_TYPE

@ -64,8 +65,10 @@ fp16 = dict(

 ## Tensor Parallel AMP

-We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with 
-complex tensor and pipeline parallel.
+We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with complex tensor 
+and pipeline parallelism.
+
+The following conde block show a config file for this mode.

 ```python
 from colossalai.engine import AMP_TYPE
--- a/docs/amp_zh.md
+++ b/docs/amp_zh.md
@ -0,0 +1,79 @@
+# 混合精度训练
+
+ColossalAI可以使用如下三种不同的混合精度训练方式：
+1. torch.cuda.amp
+2. apex.amp
+3. 张量并行AMP
+
+前两种混合精度训练方式依赖于[PyTorch](https://pytorch.org/docs/stable/amp.html)的原生实现（1.6或以上版本）以及
+[Nvidia Apex](https://github.com/NVIDIA/apex)，但这两种方法与张量并行并不兼容，因为在张量并行中我们需要将张量进行切分并保存在不同的设备上，
+因此，实现兼容张量并行的混合精度训练需要在不同进程之间不断通信来交流`inf`以及`nan`是否存在于模型参数中，因此我们才用了
+[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)的实现方式。
+
+您可以简单地将配置文件中的`fp16`字段设置为True来使用混合精度训练。目前，PyTorch与Apex的amp不能保证与张量和流水线并行兼容，因此，我们推荐您使用
+最后一种混合精度训练方式。
+
+## PyTorch AMP
+
+PyTorch在1.6及以上版本中提供了混合精度训练，其可以在保持一些操作的精度为`fp32`的同时，将数据转换成`fp16`格式，您可以在配置文件中配置使用。
+
+```python
+from colossalai.engine import AMP_TYPE
+
+fp16=dict(
+    mode=AMP_TYPE.TORCH,
+    # below are default values for grad scaler
+    init_scale=2.**16,
+    growth_factor=2.0,
+    backoff_factor=0.5,
+    growth_interval=2000,
+    enabled=True
+)
+```
+
+## Apex AMP
+
+我们使用了[Apex](https://nvidia.github.io/apex/)中的混合精度训练，因为该模式提供了细粒度的混合精度控制，例如，`O2`级（第二级优化器）将会保持
+批标准化在`fp32`上进行。下面的代码块展示了使用Apex AMP的配置文件。
+
+```python
+from colossalai.engine import AMP_TYPE
+
+fp16 = dict(
+    mode=AMP_TYPE.APEX,
+    # below are the default values
+    enabled=True, 
+    opt_level='O1', 
+    cast_model_type=None, 
+    patch_torch_functions=None, 
+    keep_batchnorm_fp32=None, 
+    master_weights=None, 
+    loss_scale=None, 
+    cast_model_outputs=None,
+    num_losses=1, 
+    verbosity=1, 
+    min_loss_scale=None, 
+    max_loss_scale=16777216.0
+)
+```
+
+## 张量并行AMP
+
+我们借鉴了Megatron-LM的混合精度训练实现，该实现方式与张量并行与流水线并行相兼容。下面的代码块展示了使用张量并行AMP的配置文件。
+
+```python
+from colossalai.engine import AMP_TYPE
+
+fp16 = dict(
+    mode=AMP_TYPE.PARALLEL,
+    # below are the default values
+    clip_grad=0,
+    log_num_zeros_in_grad=False,
+    initial_scale=2 ** 32,
+    min_scale=1,
+    growth_factor=2,
+    backoff_factor=0.5,
+    growth_interval=1000,
+    hysteresis=2
+)
+```
--- a/docs/config.md
+++ b/docs/config.md
@ -1,6 +1,6 @@
 # Config file

-Here is an example config file of training ViT on cifar:
+Here is a config file example showing how to train a ViT model on the CIFAR10 dataset using ColossalAI:

 ```python
 # build train_dataset and train_dataloader from this dictionary
--- a/docs/config_zh.md
+++ b/docs/config_zh.md
@ -0,0 +1,187 @@
+# 配置文件
+
+下方代码块中的示例展示了如何在CIFAR10数据集上使用ColossalAI训练ViT模型。
+
+```python
+# build train_dataset and train_dataloader from this dictionary
+# It is not compulsory in Config File, instead, you can input this dictionary as an argument into colossalai.initialize() 
+train_data = dict(
+    # dictionary for building Dataset
+    dataset=dict(
+        # the type CIFAR10Dataset has to be registered
+        type='CIFAR10Dataset',
+        root='/path/to/data',
+        # transform pipeline
+        transform_pipeline=[
+            dict(type='Resize', size=IMG_SIZE),
+            dict(type='RandomCrop', size=IMG_SIZE, padding=4),
+            dict(type='RandomHorizontalFlip'),
+            dict(type='ToTensor'),
+            dict(type='Normalize',
+                 mean=[0.4914, 0.4822, 0.4465],
+                 std=[0.2023, 0.1994, 0.2010]),
+        ]
+    ),
+    # dictionary for building Dataloader
+    dataloader=dict(
+        batch_size=BATCH_SIZE,
+        pin_memory=True,
+        # num_workers=1,
+        shuffle=True,
+    )
+)
+
+# build test_dataset and test_dataloader from this dictionary
+test_data = dict(
+    dataset=dict(
+        type='CIFAR10Dataset',
+        root='/path/to/data',
+        train=False,
+        transform_pipeline=[
+            dict(type='Resize', size=IMG_SIZE),
+            dict(type='ToTensor'),
+            dict(type='Normalize',
+                 mean=[0.4914, 0.4822, 0.4465],
+                 std=[0.2023, 0.1994, 0.2010]
+                 ),
+        ]
+    ),
+    dataloader=dict(
+        batch_size=BATCH_SIZE,
+        pin_memory=True,
+        # num_workers=1,
+    )
+)
+
+# compulsory
+# build optimizer from this dictionary
+optimizer = dict(
+    # Avaluable types: 'ZeroRedundancyOptimizer_Level_1', 'ZeroRedundancyOptimizer_Level_2', 'ZeroRedundancyOptimizer_Level_3'
+    # 'Adam', 'Lamb', 'SGD', 'FusedLAMB', 'FusedAdam', 'FusedSGD', 'FP16Optimizer'
+    type='Adam',
+    lr=0.001,
+    weight_decay=0
+)
+
+# compulsory
+# build loss function from this dictionary
+loss = dict(
+    # Avaluable types:
+    # 'CrossEntropyLoss2D', 'CrossEntropyLoss2p5D', 'CrossEntropyLoss3D'
+    type='CrossEntropyLoss2D',
+)
+
+# compulsory
+# build model from this dictionary
+model = dict(
+    # types avaluable: 'PretrainBERT', 'VanillaResNet', 'VisionTransformerFromConfig'
+    type='VisionTransformerFromConfig',
+    # each key-value pair above refers to a layer
+    # input data pass through these layers recursively
+    tensor_splitting_cfg=dict(
+        type='ViTInputSplitter2D',
+    ),
+    embedding_cfg=dict(
+        type='ViTPatchEmbedding2D',
+        img_size=IMG_SIZE,
+        patch_size=PATCH_SIZE,
+        embed_dim=DIM,
+    ),
+    token_fusion_cfg=dict(
+        type='ViTTokenFuser2D',
+        img_size=IMG_SIZE,
+        patch_size=PATCH_SIZE,
+        embed_dim=DIM,
+        drop_rate=0.1
+    ),
+    norm_cfg=dict(
+        type='LayerNorm2D',
+        normalized_shape=DIM,
+        eps=1e-6,
+    ),
+    block_cfg=dict(
+        # ViTBlock is a submodule
+        type='ViTBlock',
+        attention_cfg=dict(
+            type='ViTSelfAttention2D',
+            hidden_size=DIM,
+            num_attention_heads=NUM_ATTENTION_HEADS,
+            attention_dropout_prob=0.,
+            hidden_dropout_prob=0.1,
+            checkpoint=True
+        ),
+        droppath_cfg=dict(
+            type='VanillaViTDropPath',
+        ),
+        mlp_cfg=dict(
+            type='ViTMLP2D',
+            in_features=DIM,
+            dropout_prob=0.1,
+            mlp_ratio=4,
+            checkpoint=True
+        ),
+        norm_cfg=dict(
+            type='LayerNorm2D',
+            normalized_shape=DIM,
+            eps=1e-6,
+        ),
+    ),
+    head_cfg=dict(
+        type='ViTHead2D',
+        hidden_size=DIM,
+        num_classes=NUM_CLASSES,
+    ),
+    embed_dim=DIM,
+    depth=DEPTH,
+    drop_path_rate=0.,
+)
+
+# hooks are built when initializing trainer
+# possible hooks: 'BaseHook', 'MetricHook','LoadCheckpointHook'
+# 'SaveCheckpointHook','LossHook', 'AccuracyHook', 'Accuracy2DHook'
+# 'LogMetricByEpochHook', 'TensorboardHook','LogTimingByEpochHook', 'LogMemoryByEpochHook' 
+hooks = [
+    dict(type='LogMetricByEpochHook'),
+    dict(type='LogTimingByEpochHook'),
+    dict(type='LogMemoryByEpochHook'),
+    dict(type='Accuracy2DHook'),
+    dict(type='LossHook'),
+    # dict(type='TensorboardHook', log_dir='./tfb_logs'),
+    # dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
+    # dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
+]
+
+# three keys: pipeline, tensor, data
+# if data=dict(size=1), which means no data parallelization, then there is no need to define it
+parallel = dict(
+    pipeline=dict(size=1),
+    tensor=dict(size=4, mode='2d'),
+)
+
+# not compulsory
+# pipeline or no pipeline schedule
+fp16 = dict(
+    mode=AMP_TYPE.PARALLEL,
+    initial_scale=2 ** 8
+)
+
+# not compulsory
+# build learning rate scheduler
+lr_scheduler = dict(
+    type='LinearWarmupLR',
+    warmup_epochs=5
+)
+
+schedule = dict(
+    num_microbatches=8
+)
+
+# training stopping criterion
+# you can give num_steps or num_epochs
+num_epochs = 60
+
+# config logging path
+logging = dict(
+    root_path='./logs'
+)
+```
--- a/docs/index.rst
+++ b/docs/index.rst
@ -3,27 +3,27 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

-ColossalAI documentation
+ColossalAI开发文档
 ======================================
 .. toctree::
   :maxdepth: 1
-   :caption: GETTING STARTED
+   :caption: 快速上手指南

-   installation.md
-   run_demo.md
+   installation_zh.md
+   run_demo_zh.md


 .. toctree::
   :maxdepth: 1
-   :caption: CUSTOMIZE YOUR TRAINING
+   :caption: 个性化您的训练

-   parallelization.md
-   model.md
-   trainer_engine.md
-   amp.md
-   zero.md
-   add_your_parallel.md
-   config.md
+   parallelization_zh.md
+   model_zh.md
+   trainer_engine_zh.md
+   amp_zh.md
+   zero_zh.md
+   add_your_parallel_zh.md
+   config_zh.md
   


--- a/docs/index_en.rst
+++ b/docs/index_en.rst
@ -0,0 +1,40 @@
+.. ColossalAI documentation master file, created by
+   sphinx-quickstart on Mon Oct 11 17:05:05 2021.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+ColossalAI documentation
+======================================
+.. toctree::
+   :maxdepth: 1
+   :caption: GETTING STARTED
+
+   installation.md
+   run_demo.md
+
+
+.. toctree::
+   :maxdepth: 1
+   :caption: CUSTOMIZE YOUR TRAINING
+
+   parallelization.md
+   model.md
+   trainer_engine.md
+   amp.md
+   zero.md
+   add_your_parallel.md
+   config.md
+   
+
+
+.. toctree::
+   :maxdepth: 2
+   :caption: API REFERENCE
+
+   colossalai/colossalai
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
--- a/docs/installation_zh.md
+++ b/docs/installation_zh.md
@ -0,0 +1,25 @@
+# 快速安装
+
+## 使用pip安装
+
+```bash
+pip install colossalai
+```
+
+## 使用源代码安装
+
+```shell
+git clone git@github.com:hpcaitech/ColossalAI.git
+cd ColossalAI
+# install dependency
+pip install -r requirements/requirements.txt
+
+# install colossalai
+pip install .
+```
+
+安装并支持内核融合（使用融合优化器时必须执行下面的代码）
+
+```
+pip install -v --no-cache-dir --global-option="--cuda_ext" .
+```
--- a/docs/model.md
+++ b/docs/model.md
@ -1,15 +1,17 @@
 # Define your own parallel model

-## Write a Simple 2D Parallel Model
+Let's say that you have a huge MLP model with billions of parameters and its extremely large hidden layer size makes it
+impossible to fit into a single GPU. Don't worry, ColossalAI is here to help you sort things out. With the help of ColossalAI, 
+you can write your model in the familiar way in which you used to write models for a single GPU, while ColossalAI automatically 
+splits your model weights and fit them perfectly into a set of GPUs. We give a simple example showing how to write a simple 
+2D parallel model in the ColossalAI context.

-Let's say we have a huge MLP model and its very large hidden size makes it difficult to fit into a single GPU. We can
-then distribute the model weights across GPUs in a 2D mesh while you still write your model in a familiar way.
+## Write a simple 2D parallel model

 ```python
 from colossalai.nn import Linear2D
 import torch.nn as nn

-
 class MLP_2D(nn.Module):

    def __init__(self):
@ -21,8 +23,9 @@ class MLP_2D(nn.Module):
        x = self.linear_1(x)
        x = self.linear_2(x)
        return x
-
 ```

 ## Use pre-defined model
-Our Model Zoo supports *BERT*, *VIT*, *MLP-Mixer* of different sizes.
+
+For the sake of your convenience, we kindly provide you in our Model Zoo with some prevalent models such as *BERT*, *VIT*, 
+and *MLP-Mixer*. Feel free to customize them into different sizes to fit into your special needs.
--- a/docs/model_zh.md
+++ b/docs/model_zh.md
@ -0,0 +1,28 @@
+# 定义符合您需求的并行模型
+
+如果您在训练一个拥有数亿级参数的巨大MLP模型，那么该模型一定无法在单个GPU上进行训练，不用担心，ColossalAI可以帮您解决这一问题。您仍旧可以像写单GPU模型
+那样来写您的模型，ColossalAI会按照您的并行设置自动将模型参数进行切割，并将它们均匀地存入一组GPU中。下面是一个简单的例子，来向您展示如何在ColossalAI
+环境下写一个2D张量并行的模型。
+
+## 简单的2D张量并行模型
+
+```python
+from colossalai.nn import Linear2D
+import torch.nn as nn
+
+class MLP_2D(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
+        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
+
+    def forward(self, x):
+        x = self.linear_1(x)
+        x = self.linear_2(x)
+        return x
+```
+
+## 使用事先定义好的模型
+
+为了您使用的方便，我们事先在我们的Model Zoo中定义好了一些现在流行的模型，比如*BERT*、*VIT*以及*MLP-Mixer*，您可以根据您的需求来自定义这些模型的规模。
--- a/docs/parallelization.md
+++ b/docs/parallelization.md
@ -5,52 +5,51 @@
 We support multiple parallelization in our library.

 Hybrid parallelism in our codebase, namely data parallelism, pipeline parallelism and tensor parallelism (
-1D,2D, 2.5D, 3D). You can initialize the corresponding process group by setting `parallel` in our config. The parallel
+1D, 2D, 2.5D, 3D). You can initialize the corresponding process group by setting `parallel` in our config. The parallel
 configuration can be easily deployed by a dictionary in configuration file. The configuration dictionary must obey the
 following format. Data parallel size will be inferred automatically based on your inputs to pipeline parallelism and
 tensor parallelism.

 ```python
 parallel = dict(
-    pipeline=dict["size": int],
-    tensor=dict["size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any]
-) 
+    pipeline=dict("size": int),
+    tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
+)
 ```

 The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and data,
 pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a int
 representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
-represents the way of model parallelism.
+represents the way of tensor parallelism.

 ## Data Parallel
+
 Data parallel is the most common way to distribute your training task by splitting data into several shards and train 
 on a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do 
-not have to explicitly set them in your configurations. When data parallel size is larger than 1, Colossal-AI automatically 
+not have to explicitly set them in your configurations. When data parallel size is larger than 1, ColossalAI automatically 
 adds the distributed data sampler to the dataloader to shard the dataset.

-
 ## 1D, 2D, 2.5D and 3D Parallel
+
 To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each 
-tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
+tensor parallel method. These parallel modes need to work with the distributed layers provided by ColossalAI.
 - 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)

 - 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)  
 2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, 
 model weights and layer outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ 
-devices where N is the number of tensor chunks in a single dimension.
+devices where $N$ is the number of tensor chunks in a single dimension.

 - 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)  
-Inspired by the 2.5D matrix multi-plication algorithm, 2.5D parallel introduces a novel tensor parallelism which further 
-parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into d layers, 
-where each layer performs matrix multiplication operations independently with a dimension N.
+Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which further 
+parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into $d$ layers, 
+where each layer performs matrix multiplication operations independently with a dimension $N$.

 - 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)  
 We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method achieves 
-the optimal, $O(P^{1/3})$ communication overhead on P processors, while both computation and memory usage are evenly distributed 
+the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage are evenly distributed 
 through optimized load balancing of parameters as well as activations.

-
-
 ```python
 # 1D parallel
 parallel = dict(
@ -85,8 +84,8 @@ and the second layer to the second GPU. This example of course wastes the comput
 the idea of pipeline parallelism. 

 As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline 
-parallelism in PyTorch, you may need to add one more attribute in your model class which tells Colossal-AI the sequence 
-of execution. One example you can refer is `colossalai.nn.VanillaResNet`.
+parallelism in PyTorch, you may need to add one more attribute, `layers_cfg` in your model class which tells ColossalAI
+the sequence of execution. One example you can refer is `colossalai.nn.model.VanillaResNet`.

 ```python
 from colossalai.nn import BaseModel
@ -193,7 +192,7 @@ class VanillaResNet(BaseModel):
        ]
 ```

-You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI 
+You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, ColossalAI 
 will automatically creates the pipeline schedule which defines the forward and backward step. You can specify how many microbatches
 to run in each step in the `schedule` configuration.

@ -207,10 +206,10 @@ schedule = dict(
    num_microbatches = 4 # set the number of microbatches per step
 )
 ```
-
+This feature is still in development and is only experimental for now.

 ## Sequence Parallel (experimental)

 Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging. 
 This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120). 
-This feature is still in development is only experimental for now.
+This feature is still in development and is only experimental for now.
--- a/docs/parallelization_zh.md
+++ b/docs/parallelization_zh.md
@ -0,0 +1,202 @@
+# 并行技术
+
+## 配置并行技术组合
+
+ColossalAI支持多种并行技术，包括数据并行、张量并行（1D、2D、2.5D、3D）、流水线并行以及序列并行。您可以通过更改配置文件中的`parallel`字典变量
+来初始化分布式系统中的进程组，配置文件中的`parallel`字典变量必须满足下面的格式。数据并行的规模可以通过`parallel`中流水线并行的规模和张量并行的
+规模计算得出。
+
+```python
+parallel = dict(
+    pipeline=dict("size": int),
+    tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
+)
+```
+
+注意该字典变量的名称必须为**parallel**。该变量中所有的参数，包括`parallel`本身都是非必需的，如果您的代码中没有提供该变量，则所有并行规模都将被
+设定为默认值1，即不使用任何并行技术的情况。`parallel`中data、pipeline以及tensor的值分别代表了数据并行、流水线并行、以及张量并行的规模，而mode
+的值代表了张量并行的模式。
+
+## 数据并行
+
+数据并行是一种最常见的并行技术，可以将数据分成几个不同的部分，并对每一个部分在一台设备上进行训练。ColossalAI可以自动检测数据并行设置并为您设置好环境，
+您不需要在您的环境配置中显式地设置。当数据并行规模大于1时，ColossalAI会自动为数据读取器增加分布式数据采样器，以此来达到切分数据集的目的。
+
+## 1D、2D、2.5D与3D张量并行
+
+为了方便混合并行技术，我们提供了一系列的张量并行技术，同时下面罗列了每一种张量并行技术对应的论文，这些张量并行技术需要ColossalAI提供的分布式层结构的支持。
+- 1D：[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+
+- 2D：[An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
+2D张量并行依赖SUMMA矩阵乘法技术，其在两个不同的维度上对于输入数据进行切分。切分后的张量分布在一个的2D网格上，使用的总设备数量为$P = N^2$，其中$N$为
+一个维度上的切分张量数量。
+
+- 2.5D：[2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
+2.5D并行技术受到了2.5D矩阵乘法的启发，其对于2D张量并行的结果进行进一步切分，在$d$层上面安排$P = N^2 ∗ d$个处理器，相应地，矩阵乘法操作也被切分为$d$份
+在不同的层上面进行。
+
+- 3D：[Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
+我们还引入了3D张量并行技术，该技术在一个3D处理器立方体中对神经网络参数进行并行化。使用$P$个处理器时，该并行技术可以在付出$O(P^{1/3})$的通信开销的情况下
+达到最优表现，且计算资源和内存使用都可以在$P$个处理器上达到平均分配。
+
+使用上述几种张量并行的`parallel`字典变量示例参见下方代码。
+
+```python
+# 1D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=4, mode='1d')
+)
+
+# 2D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=4, mode='2d')
+)
+
+# 2.5D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=8, mode='2.5d', depth=2)
+)
+
+# 3D parallel
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=8, mode='3d')
+)
+```
+
+## 流水线并行（开发中）
+
+流水线并行指的是在将深度学习模型按照层切分为几个不同的部分，例如，假设一个由两个线性层组成的简单模型，我们可以使用两个GPU，那么我们可以把第一个线性层
+的工作分配给一个GPU，把第二个线性层的工作分配给另一个GPU。当然这个例子只是为了说明流水线并行的工作方式，没有实际意义。
+
+由于PyTorch的计算基于动态计算图，所以在执行前无法确定计算流。为了支持PyTorch中的流水线并行，您需要为您的模型类加入一个额外的特征`layers_cfg`，
+使ColossalAI清楚具体的计算流程，`colossalai.nn.VanillaResNet`给出了一个您可以参考的示例。
+
+```python
+from colossalai.nn import BaseModel
+import torch
+
+class VanillaResNet(BaseModel):
+
+    def __init__(
+            self,
+            num_cls: int,
+            block_type: str,
+            layers: List[int],
+            norm_layer_type: str = 'BatchNorm2d',
+            in_channels: int = 3,
+            groups: int = 1,
+            width_per_group: int = 64,
+            zero_init_residual: bool = False,
+            replace_stride_with_dilation: Optional[List[bool]] = None,
+            dilations=(1, 1, 1, 1)
+    ) -> None:
+        super().__init__()
+        
+        ... # some model params
+        
+        self.layers_cfg = [
+            # conv1
+            dict(type='Conv2d',
+                 in_channels=in_channels,
+                 out_channels=self.inplanes,
+                 kernel_size=7,
+                 stride=2,
+                 padding=3,
+                 bias=False),
+            # bn1
+            dict(
+                type=norm_layer_type,
+                num_features=self.inplanes
+            ),
+            # relu
+            dict(
+                type='ReLU',
+                inplace=True
+            ),
+            # maxpool
+            dict(
+                type='MaxPool2d',
+                kernel_size=3,
+                stride=2,
+                padding=1
+            ),
+            # layer 1
+            dict(
+                inplanes=self.inplanes,
+                planes=64,
+                blocks=self.blocks[0],
+                dilation=self.dilations[0],
+                **self.reslayer_common_cfg
+            ),
+            # layer 2
+            dict(
+                inplanes=64 * self.block_expansion,
+                planes=128,
+                blocks=self.blocks[1],
+                stride=2,
+                dilate=replace_stride_with_dilation[0],
+                dilation=self.dilations[1],
+                **self.reslayer_common_cfg
+            ),
+            # layer  3
+            dict(
+                inplanes=128 * self.block_expansion,
+                planes=256,
+                blocks=layers[2],
+                stride=2,
+                dilate=replace_stride_with_dilation[1],
+                dilation=self.dilations[2],
+                **self.reslayer_common_cfg
+            ),
+            # layer 4
+            dict(
+                inplanes=256 * self.block_expansion,
+                planes=512,
+                blocks=layers[3], stride=2,
+                dilate=replace_stride_with_dilation[2],
+                dilation=self.dilations[3],
+                **self.reslayer_common_cfg
+            ),
+            # avg pool
+            dict(
+                type='AdaptiveAvgPool2d',
+                output_size=(1, 1)
+            ),
+            # flatten
+            dict(
+                type='LambdaWrapper',
+                func=lambda mod, x: torch.flatten(x, 1)
+            ),
+            # linear
+            dict(
+                type='Linear',
+                in_features=512 * self.block_expansion,
+                out_features=num_cls
+            )
+        ]
+```
+
+您可以在配置文件中手动设置流水线并行的级数，当柳树线并行级数大于1时，ColossalAI将会自动创建定义前向传播和后向传播的流水线调度程序。同时您还可以在配置文件
+中的`schedule`字典变量来定义每一个步骤中训练的微批数量。下面的代码给出了一个配置流水线并行的例子。
+
+```python
+parallel = dict(
+    pipeline=dict(size=1), # number of pipeline stages
+    tensor=dict(size=1, mode=None)
+)
+
+schedule = dict(
+    num_microbatches = 4 # set the number of microbatches per step
+)
+```
+目前该并行技术仍处于实验开发阶段。
+
+## 序列并行（开发中）
+
+序列并行是为了支持对于长序列数据的建模，这类数据包括文档级别的文本理解以及医学影像分析，该并行技术由论文
+[Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120)提出。
+目前该并行技术仍处于实验开发阶段。
--- a/docs/run_demo.md
+++ b/docs/run_demo.md
@ -19,41 +19,52 @@ where `HOST` is the IP address of your system. Note that we use
 the [Slurm](https://slurm.schedmd.com/documentation.html) job scheduling system here.

 ```bash
-HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./example/train_vit_2d.py ./configs/vit/vit_2d.py
+HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
 ```

 `./configs/vit/vit_2d.py` is a config file, which is introduced in the [Config file](config.md) section below. These
 config files are used by ColossalAI to define all kinds of training arguments, such as the model, dataset and training
 method (optimizer, lr_scheduler, epoch, etc.). Config files are highly customizable and can be modified so as to train
 different models.
-`./example/run_trainer.py` contains a standard training script and is presented below, it reads the config file and
+`./examples/run_trainer.py` contains a standard training script and is presented below, it reads the config file and
 realizes the training process.

 ```python
 import colossalai
-from colossalai.engine import Engine
-from colossalai.trainer import Trainer
 from colossalai.core import global_context as gpc
+from colossalai.engine import Engine
+from colossalai.logging import get_global_dist_logger
+from colossalai.trainer import Trainer

-model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
-engine = Engine(
-    model=model,
-    criterion=criterion,
-    optimizer=optimizer,
-    lr_scheduler=lr_scheduler,
-    schedule=schedule
-)
+def run_trainer():
+    model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
+    logger = get_global_dist_logger()
+    schedule.data_sync = False
+    engine = Engine(
+        model=model,
+        criterion=criterion,
+        optimizer=optimizer,
+        lr_scheduler=lr_scheduler,
+        schedule=schedule
+    )
+    logger.info("engine is built", ranks=[0])

-trainer = Trainer(engine=engine,
-                  hooks_cfg=gpc.config.hooks,
-                  verbose=True)
-trainer.fit(
-    train_dataloader=train_dataloader,
-    test_dataloader=test_dataloader,
-    max_epochs=gpc.config.num_epochs,
-    display_progress=True,
-    test_interval=5
-)
+    trainer = Trainer(engine=engine,
+                      hooks_cfg=gpc.config.hooks,
+                      verbose=True)
+    logger.info("trainer is built", ranks=[0])
+
+    logger.info("start training", ranks=[0])
+    trainer.fit(
+        train_dataloader=train_dataloader,
+        test_dataloader=test_dataloader,
+        max_epochs=gpc.config.num_epochs,
+        display_progress=True,
+        test_interval=2
+    )
+
+if __name__ == '__main__':
+    run_trainer()
 ```

 Alternatively, the `model` variable can be substituted with a self-defined model or a pre-defined model in our Model
@ -62,7 +73,7 @@ Zoo. The detailed substitution process is elaborated [here](model.md).
 ## Features

 ColossalAI provides a collection of parallel training components for you. We aim to support you with your development of
-distributed deep learning models just like how you write single-GPU deeo learning models. We provide friendly tools to
+distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools to
 kickstart distributed training in a few lines.

 - [Data Parallelism](parallelization.md)
--- a/docs/run_demo_zh.md
+++ b/docs/run_demo_zh.md
@ -0,0 +1,75 @@
+# 快速上手
+
+ColossalAI是一个大规模深度学习框架，其中包含高效的并行技术。该框架可以在多GPU的分布式系统上使用并行技术有效地加速模型训练，同时该框架也可以运行在
+带有GPU的非分布式系统上。下面是ColossalAI的快速上手指南。
+
+## 单GPU系统
+
+在带有GPU的非分布式系统上进行模型训练时，ColossalAI可以达到当前的基线效率。
+[这里](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)我们给出一个Google Colab
+示例展现如何使用ColossalAI与CIFAR10数据集在非分布式系统上训练一个LeNet模型。
+
+## 多GPU系统
+
+在多GPU的分布式系统上训练深度学习模型时，ColossalAI可以使用高效的并行技术来显著地加速训练过程，这些技术将在下面的[并行技术](parallelization.md)章节中被详述。
+下面的代码将在拥有四个GPU的分布式系统上训练一个ViT模型，其中`HOST`变量为您分布式系统的IP地址。请注意下面的代码使用了
+[Slurm](https://slurm.schedmd.com/documentation.html)作业调度系统。
+
+```bash
+HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
+```
+
+`./configs/vit/vit_2d.py`是一个[配置文件](config.md)，ColossalAI使用配置文件来定义训练过程中需要用到的参数，比如模型类型、数据集、以及优化器、学习率调度器等。
+您可以通过编写配置文件的方式来训练不同的模型。`./examples/run_trainer.py`是一个标准的训练脚本，具体代码已经附在下面。该脚本可以读入配置文件中的训练参数并训练模型。
+
+```python
+import colossalai
+from colossalai.core import global_context as gpc
+from colossalai.engine import Engine
+from colossalai.logging import get_global_dist_logger
+from colossalai.trainer import Trainer
+
+def run_trainer():
+    model, train_dataloader, test_dataloader, criterion, optimizer, schedule, lr_scheduler = colossalai.initialize()
+    logger = get_global_dist_logger()
+    schedule.data_sync = False
+    engine = Engine(
+        model=model,
+        criterion=criterion,
+        optimizer=optimizer,
+        lr_scheduler=lr_scheduler,
+        schedule=schedule
+    )
+    logger.info("engine is built", ranks=[0])
+
+    trainer = Trainer(engine=engine,
+                      hooks_cfg=gpc.config.hooks,
+                      verbose=True)
+    logger.info("trainer is built", ranks=[0])
+
+    logger.info("start training", ranks=[0])
+    trainer.fit(
+        train_dataloader=train_dataloader,
+        test_dataloader=test_dataloader,
+        max_epochs=gpc.config.num_epochs,
+        display_progress=True,
+        test_interval=2
+    )
+
+if __name__ == '__main__':
+    run_trainer()
+```
+
+上面代码中的`model`变量可以被替换为一个自定义的模型或者`Model Zoo`中一个事先定义的模型，以此来达到训练不同模型的目的，[这里](model.md)详述了如何进行这样的替换。
+
+## 系统功能
+
+ColossalAI提供了一系列并行组件来加速您的模型训练，我们在下面的章节提供了关于这些并行组件的介绍。我们的目标是使您的分布式深度学习模型开发像单卡深度学习模型开发那样方便。
+
+- [数据并行](parallelization.md)
+- [1D、2D、2.5D、3D张量并行以及序列并行](parallelization.md)
+- [流水线并行](parallelization.md)
+- [训练器以及引擎](trainer_engine.md)
+- [自定义您的并行模式](add_your_parallel.md)
+- [混合精度训练](amp.md)
+- [ZeRO优化器](zero.md)
--- a/docs/trainer_engine.md
+++ b/docs/trainer_engine.md
@ -2,7 +2,9 @@

 ## Build your engine

-To better understand the function of `Engine` class, you should know the conception of the process function in common engines. The process function usually controls the behavior over a batch of a dataset, `Engine` class just controls the process function. For example, common process function looks like this:
+To better understand how `Engine` class works, let's start from the conception of the process function in common engines. The process function 
+usually controls the behavior over a batch of a dataset, `Engine` class just controls the process function. Here we give a standard process 
+function in the following code block.

 ```python
 def process_function(dataloader, model, criterion, optim):
@ -14,9 +16,13 @@ def process_function(dataloader, model, criterion, optim):
    optim.setp()
 ```

-In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is hard for users to write their own functions for pipeline parallelism.  Aiming at accessible hybrid parallelism for users, we provide powerful `Engine` class. It enables pipeline parallelism and offers 1F1B non-interleaving strategy. Also, you can use pre-defined learning rate scheduler in your `Engine` to adjust learning rate during training.
+In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is tricky for users to write their own process 
+functions for pipeline parallelism. Aiming at offering accessible hybrid parallelism for users, we provide the powerful `Engine` class. This class 
+enables pipeline parallelism and offers one-forward-one-backward non-interleaving strategy. Also, you can use pre-defined learning rate scheduler 
+in the `Engine` class to adjust learning rate during training.

-In order to build your engine, just set model, criterion, optimizer, learning rate scheduler and schedule. Consider the following code as an example.
+In order to build your engine, just set variables `model`, `criterion`, `optimizer`, `lr_scheduler` and `schedule`. The following code block provides
+an example.

 ```python
 import torch
@ -24,7 +30,6 @@ import torch.nn as nn
 import torchvision.models as models
 import colossalai

-
 model = models.resnet18()
 criterion = nn.CrossEntropyLoss()
 optimizer = torch.optim.Adam(model)
@ -40,23 +45,27 @@ MyEngine = Engine(
 )
 ```

-More information is in API reference.
-
-
+More information regarding the class can be found in the API references.

 ## Customize your trainer

 ### Overview

-Before starting to learn how to customize a trainer meeting your need, you should have a basic understanding about the function of `Trainer`. We recommend you to read *Get Started* section and *Build your engine* first. 
+To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly recommend that you read *Get Started* 
+section and *Build your engine* first.

-Trainer class tends to enable researchers and engineers to use our framework more conveniently, instead of writing their own scripts, we provide `Trainer` class and you can simply construct it with your own `Engine` by calling `MyTrainer = Trainer(MyEngine)`.  Then use method `fit` to train or evaluate your model. In order to make our `Trainer` class more powerful, we add some useful features to it, such as monitor or record running states and metrics which indicate model's performance, or save after a training epoch. 
+The `Trainer` class enables researchers and engineers to use our framework more conveniently. Instead of having to write your own scripts, you can simply 
+construct your own trainer by calling the `Trainer` class, just like what we did in the following code block.

-To accomplish  that, specific actions must be added to the training or evaluation. `BaseHook` class allow you to add desired actions in specific time points. We have already created practical hooks for those useful features. What you need to do is just picking the hooks you want. 
+```python
+MyTrainer = Trainer(MyEngine)
+```

-More detailed class descriptions can be found in API reference.
-
-### Example
+After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more powerful, we incorporate a set of 
+handy tools to the class. For example, you can monitor or record the running states and metrics which indicate the current performance of the model. These
+functions are realized by hooks. The `BasicHook` class allows you to execute your hook functions at specified time. We have already created some practical
+hooks for you, as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the class can be found 
+in the API references.

 ```python
 hooks = [
@ -65,26 +74,24 @@ hooks = [
    dict(type='LogMemoryByEpochHook'),
    dict(type='AccuracyHook'),
    dict(type='LossHook'),
-    # dict(type='TensorboardHook', log_dir='./tfb_logs'),
-    # dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
-    # dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
+    dict(type='TensorboardHook', log_dir='./tfb_logs'),
+    dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
+    dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
 ]
 ```

-Above hooks will record metrics, used time and memory usage to log every epoch. Also it prints loss and accuracy to let users monitor the performance of the model.
+These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides, they print the current loss and 
+accuracy to let users monitor the performance of the model.

 ### Hook

-You can extend our `BaseHook` class. Hooks can be called at twelve time points. More detailed information can be found in API reference.
-
-Or extend from `MetricHook` to write a metric collector. You should also use the decorator `@HOOKS.register_module` for your own hook class, and import it in your main python script.
-
-For `after_train_iter()`, it receives the output of engine per iteration, which is a list including output, label and loss.
-
-Note that you can define the priority to arrange the execution order of all hooks.
+If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook` class to write a metric collector. 
+These hook functions can be called at twelve points in time. Besides, you can define the priorities of all hooks to arrange the execution order of them.
+More information can be found in the API references. 

 ### Metric

-You can write your own metric by extending `Metric` class.  It is always used with `MetricHook`. If you write your own metric hooks, please set the priority carefully and make sure is called before other hooks which may use the results of metrics.
+You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your write your own metric hooks, please set 
+the priority carefully and make sure the hook is called before other hooks which might require the results of the metric hook.

-We've already provided some metric hooks. We store metric objects in `runner.states['metrics']`. It is a dictionary and you can use the name of the metric to access it.
+We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary and metrics can be accessed by their names.
--- a/docs/trainer_engine_zh.md
+++ b/docs/trainer_engine_zh.md
@ -0,0 +1,85 @@
+# 引擎与训练器
+
+## 引擎
+
+为了更好的理解我们的`Engine`类是如何工作的，我们首先需要了解常见引擎中进程函数的概念。进程函数控制数据集中一个批的行为，`Engine`类控制的正是该进程函数。我们在下方的代码块
+中给出了一个标准的进程函数例子。
+
+```python
+def process_function(dataloader, model, criterion, optim):
+    optim.zero_grad()
+    data, label = next(dataloader)
+    output = model(data)
+    loss = criterion(output, label)
+    loss.backward()
+    optim.setp()
+```
+
+在`ignite.engine`与`keras.engine`中，进程函数需要由用户提供，然而，用户很难为流水线并行编写进程函数。为了为用户提供方便的混合并行，我们提供了具备强大功能的`Engine`类，
+该类提供前向传播后向传播不交织的策略，并支持流水线并行。同时，我们`Engine`类在使用事先定义好的学习率调度器来在训练过程中调整学习率。
+
+您在构造引擎时只需要定义`model`、`criterion`、`optimizer`、`lr_scheduler`与`schedule`等变量即可，下面的代码块给出了一个这样的例子。
+
+```python
+import torch
+import torch.nn as nn
+import torchvision.models as models
+import colossalai
+
+model = models.resnet18()
+criterion = nn.CrossEntropyLoss()
+optimizer = torch.optim.Adam(model)
+lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
+schedule = colossalai.engine.schedule.NoPipelineSchedule()
+
+MyEngine = Engine(
+    model=model,
+    criterion=criterion,
+    optimizer=optimizer,
+    lr_scheduler=lr_scheduler,
+    schedule=schedule
+)
+```
+
+更多该类的相关信息可以在API信息中找到。
+
+## 训练器
+
+要了解如何个性化适应您需求的训练器，首先需要了解我们的`Trainer`类。
+
+`Trainer`类旨在让科研工作者和工程师更加方便地使用我们的框架，您不需要自己写脚本，只需要调用`Trainer`类来构造您的训练器即可，就像下面的代码块中所做的。
+
+```python
+MyTrainer = Trainer(MyEngine)
+```
+
+在此之后，您可以使用`fit`方法来训练或调用您的模型。除此之外，为了让我们的`Trainer`类拥有更强大的功能，我们加入了一系列方便您使用的工具，例如，您可以在训练过程中持续监测并记录模型目前
+的运行状态和表现，这些功能都是通过钩子函数来实现的。我们提供的`BasicHook`类让您可以在您指定的时间运行我们提供的钩子函数，或者您自行定义的钩子函数。我们事先为您定义好了一些实用的钩子
+函数，列在下面的代码块中，您需要做的就是找到符合您需求的钩子函数。更多该类的相关信息可以在API信息中找到。
+
+```python
+hooks = [
+    dict(type='LogMetricByEpochHook'),
+    dict(type='LogTimingByEpochHook'),
+    dict(type='LogMemoryByEpochHook'),
+    dict(type='AccuracyHook'),
+    dict(type='LossHook'),
+    dict(type='TensorboardHook', log_dir='./tfb_logs'),
+    dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
+    dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
+]
+```
+
+上面这些钩子函数可以记录模型性能指标，训练时间，显存使用等，并在每一个epoch结束后将这些信息写入到日志中。除此之外，这些钩子函数还可以即时输出当前的损失以及准确率，让用户可以监测模型的性能。
+
+### 钩子函数
+
+如果您有个性化需求，您可以继承我们的`BaseHook`类并添加您的函数，或者继承我们的`MetricHook`来编写您需要的度量标准。这些钩子函数可以在12个不同的时间点被执行。更多该类的相关信息可以在API
+信息中找到。
+
+### 度量标准
+
+您可以通过继承我们的`Metric`类来提供您需要的度量标准，该类需要与`MetricHook`类一同使用。当您编写您的度量标准钩子函数时，请用心设置您的优先级来确保该钩子函数的优先级高于那些需要度量结果的
+钩子函数。
+
+我们已经为您定义好了一些度量标准钩子函数在`runner.states['metrics']`供您参考。
--- a/docs/zero.md
+++ b/docs/zero.md
@ -1,18 +1,25 @@
-# Zero Redundancy Optimizer and Zero Offload
+# Zero Redundancy optimizer and zero offload

-The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.
+The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three 
+model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. 
+By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity 
+and communication efficiency are retained.

-1. **ZeRO Level 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
-2. **ZeRO Level 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
-3. **ZeRO Level 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
+1. **ZeRO Level 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the 
+first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
+2. **ZeRO Level 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process 
+only stores the gradients corresponding to its partition of the optimizer states.
+3. **ZeRO Level 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and 
+partition them during the forward and backward passes.

-## Getting Started
+## Getting Started with ZeRO

-Once you are training with ColossalAI, enabling ZeRO-3 offload is as simple as enabling it in your ColossalAI configuration! Below are a few examples of ZeRO-3 configurations. 
+If you are training models with ColossalAI, enabling ZeRO-3 offload is as simple as enabling it in your ColossalAI configuration! 
+Below are a few examples of ZeRO-3 configurations.

-### Example ZeRO-3 Configurations
+### Example of ZeRO-3 Configurations

-Here we use ``Adam`` as the initial optimizer.
+Here we use `Adam` as the initial optimizer.

 1. Use ZeRO to partition the optimizer states (level 1), gradients (level 2), and parameters (level 3).
    ```python
@ -74,8 +81,8 @@ Here we use ``Adam`` as the initial optimizer.
    )
    ```

-Note that ``fp16`` is automatically enabled when using ZeRO. 
+Note that `fp16` is automatically enabled when using ZeRO. 

 ### Training

-Once you complete your configuration, just use `colossalai.initialize()` to initialize your training. All you need to do is to write your configuration.
+Once you have completed your configuration, just use `colossalai.initialize()` to initialize your training.
--- a/docs/zero_zh.md
+++ b/docs/zero_zh.md
@ -0,0 +1,84 @@
+# ZeRO优化器与offload
+
+ZeRO优化器可以切分三种模型状态（优化器状态、梯度、参数），并将它们存储在不同的进程中，以此来减少数据并行的存储冗余，传统的数据并行需要将上述三种状态
+复制很多份保存在每一个进程中。与传统的做法相比，ZeRO优化器可以极大地提高内存存储效率，并保持较好的通信效率。
+
+1. **ZeRO Level 1**: 优化器状态（如对于[Adam优化器](https://arxiv.org/abs/1412.6980)而言，32比特的参数，以及第一和第二动量的预测值）被切分
+存储在不同的进程中，这样每一个进程只需要更新它对应的那一部分参数。
+2. **ZeRO Level 2**: 用于更新模型参数的32比特的梯度在这一级被切分存储在不同的进程中，这里梯度的切分与level 1中模型参数的切分是一一对应的，每一个
+进程上的梯度正好被用来更新该进程上的保存的模型参数。
+3. **ZeRO Level 3**: 16比特的模型参数在这一级被切分存储在不同的进程中，ZeRO-3可以在前向传播和后向传播期间自动收集或切分这些参数。
+
+## 使用ZeRO优化器
+
+在ColossalAI中启用ZeRO优化器只需要您在配置文件中进行配置即可，下面是一些使用ZeRO-3的配置文件例子。
+
+### 使用ZeRO优化器以及offload
+
+这里我们使用`Adam`作为我们的初始优化器.
+
+1. 使用ZeRO来切分优化器状态（level 1），梯度（level 2），以及模型参数（level 3）：
+    ```python
+    optimizer = dict(
+        type='Adam',
+        lr=0.001,
+        weight_decay=0
+    )
+
+    zero = dict(
+        type='ZeroRedundancyOptimizer_Level_3',
+        dynamic_loss_scale=True,
+        clip_grad=1.0
+    )
+    ```
+2. 将优化器状态以及计算分配到CPU上：
+    ```python
+    zero = dict(
+        offload_optimizer_config=dict(
+            device='cpu',
+            pin_memory=True,
+            fast_init=True
+        ),
+        ...
+    )
+    ```
+3. 将模型参数分配到CPU上来节省显存：
+    ```python
+    zero = dict(
+        offload_optimizer_config=dict(
+            device='cpu',
+            pin_memory=True,
+            fast_init=True
+        ),
+        offload_param_config=dict(
+            device='cpu',
+            pin_memory=True,
+            fast_init=OFFLOAD_PARAM_MAX_IN_CPU
+        ),
+        ...
+    )
+    ```
+4. 将参数分配到NVMe上来节省更多显存（如果您的系统上安装了NVMe）：
+    ```python
+    zero = dict(
+        offload_optimizer_config=dict(
+            device='nvme',
+            pin_memory=True,
+            fast_init=True,
+            nvme_path='/nvme_data'
+        ),
+        offload_param_config=dict(
+            device='nvme',
+            pin_memory=True,
+            max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU,
+            nvme_path='/nvme_data'
+        ),
+        ...
+    )
+    ```
+
+请注意使用ZeRO时`fp16`将会被自动激活。
+
+### 使用ZeRO优化器进行训练
+
+如果您完成了上述配置，可以运行`colossalai.initialize()`来开始您的训练。