diff --git a/README.md b/README.md
index f04f7776b..8b4b81fcb 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,7 @@
 # Colossal-AI
+
 ![logo](./docs/images/Colossal-AI_logo.png)
+
 <div align="center">
    <h3> <a href="https://arxiv.org/abs/2110.14883"> Paper </a> | <a href="https://www.colossalai.org/"> Documentation </a> | <a href="https://github.com/hpcaitech/ColossalAI/discussions"> Forum </a> | <a href="https://medium.com/@hpcaitech"> Blog </a></h3>
 </div>
@@ -33,7 +35,6 @@ Install and enable CUDA kernel fusion (compulsory installation when using fused
 pip install -v --no-cache-dir --global-option="--cuda_ext" .
 ```
 
-
 ## Use Docker
 
 Run the following command to build a docker image from Dockerfile provided.
@@ -71,18 +72,18 @@ colossalai.launch(
 )
 
 # build your model
-model = ... 
+model = ...
 
-# build you dataset, the dataloader will have distributed data 
+# build you dataset, the dataloader will have distributed data
 # sampler by default
-train_dataset = ... 
+train_dataset = ...
 train_dataloader = get_dataloader(dataset=dataset,
                                 shuffle=True
                                 )
 
 
-# build your 
-optimizer = ... 
+# build your
+optimizer = ...
 
 # build your loss function
 criterion = ...
@@ -137,13 +138,15 @@ Colossal-AI provides a collection of parallel training components for you. We ai
 distributed deep learning models just like how you write your single-GPU model. We provide friendly tools to kickstart
 distributed training in a few lines.
 
-- [Data Parallelism](./docs/parallelization.md)
-- [Pipeline Parallelism](./docs/parallelization.md)
-- [1D, 2D, 2.5D, 3D and sequence parallelism](./docs/parallelization.md)
-- [Friendly trainer and engine](./docs/trainer_engine.md)
-- [Extensible for new parallelism](./docs/add_your_parallel.md)
-- [Mixed Precision Training](./docs/amp.md)
-- [Zero Redundancy Optimizer (ZeRO)](./docs/zero.md)
+- Data Parallelism
+- Pipeline Parallelism
+- 1D, 2D, 2.5D, 3D and sequence parallelism
+- Friendly trainer and engine
+- Extensible for new parallelism
+- Mixed Precision Training
+- Zero Redundancy Optimizer (ZeRO)
+
+Please visit our [documentation and tutorials](https://www.colossalai.org/) for more details.
 
 ## Cite Us
 
diff --git a/docs/add_your_parallel.md b/docs/add_your_parallel.md
deleted file mode 100644
index 6a8fe1ed7..000000000
--- a/docs/add_your_parallel.md
+++ /dev/null
@@ -1,113 +0,0 @@
-# Add your own parallelism
-
-## Overview
-
-To enable researchers and engineers to extend our system to other novel large-scale distributed training algorithm
-with less effort, we have decoupled various components in the training lifecycle. You can implement your own
-parallelism by simply inheriting from the base class.
-
-The main components are:
-
-1. `ProcessGroupInitializer`
-2. `GradientHandler`
-3. `Schedule`
-
-## Process Group Initializer
-
-Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
-process group. For different parallel algorithms, different process groups need to be created. Colossal-AI provides a
-global context for users to easily manage their process groups. If you wish to add new process group, you can easily
-define a new class and set it in your configuration file. To define your own way of creating process groups, you can
-follow the steps below to create a new distributed initialization.
-
-1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`.
-    ```python
-    class ParallelMode(Enum):
-        GLOBAL = 'global'
-        DATA = 'data'
-        PIPELINE = 'pipe'
-        ...
-
-        NEW_MODE = 'new_mode'  # define your mode here
-    ```
-
-2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The
-   first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other
-   arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the
-   registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`.
-    ```python
-    # sample initializer class
-    @DIST_GROUP_INITIALIZER.register_module
-    class MyParallelInitializer(ProcessGroupInitializer):
-
-        def __init__(self,
-                    rank: int,
-                    world_size: int,
-                    config: Config,
-                    data_parallel_size: int,
-                    pipeline_parlalel_size: int,
-                    tensor_parallel_size: int,
-                    arg1,
-                    arg2):
-            super().__init__(rank, world_size, config)
-            self.arg1 = arg1
-            self.arg2 = arg2
-            # ... your variable init
-
-        def init_parallel_groups(self):
-            # initialize your process groups
-            pass
-
-    ```
-
-    Then, you can insert your new initializer to the current mode-to-initialize mapping
-    in `colossalai.constants.INITIALIZER_MAPPING`. You can modify the file or insert new key-value pair dynamically.
-
-    ```python
-    colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
-    ```
-
-3. Set your initializer in your config file. You can pass in your own arguments if there is any. This allows
-   the `ParallelContext` to create your initializer and initialize your desired process groups.
-
-    ```python
-    parallel = dict(
-        pipeline=dict(size=1),
-        tensor=dict(size=x, mode='new_mode')  # this is where you enable your new parallel mode
-    )
-    ```
-
-## Gradient Handler
-
-Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce
-strategies may be executed for different kinds of parallelism, users can
-inherit `colossalai.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
-uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data
-parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own
-gradient handler like below:
-
-```python
-from colossalai.registry import GRADIENT_HANDLER
-from colossalai.engine import BaseGradientHandler
-
-@GRADIENT_HANDLER.register_module
-class YourGradientHandler(BaseGradientHandler):
-
-    def handle_gradient(self):
-        do_something()
-
-```
-
-Afterwards, you can specify the gradient handler you want to use in your configuration file.
-
-```python
-gradient_handlers = [
-    dict(type='YourGradientHandler'),
-]
-```
-
-## Schedule
-
-Schedule entails how to execute a forward and backward pass. Currently, Colossal-AI provides pipeline and non-pipeline
-schedules. If you want to modify how the forward and backward passes are executed, you can
-inherit `colossalai.engine.schedule.BaseSchedule` and implement the `forward_back_step` function.
\ No newline at end of file
diff --git a/docs/add_your_parallel_zh.md b/docs/add_your_parallel_zh.md
deleted file mode 100644
index b4625e465..000000000
--- a/docs/add_your_parallel_zh.md
+++ /dev/null
@@ -1,92 +0,0 @@
-# 添加新的并行技术
-
-为了方便科研人员和工程师们更方便地拓展我们的系统来兼容一些新的大规模分布式训练算法，我们对训练过程中的几个组件进行了解耦，您可以通过继承基类的方式来实现新的并行技术。
-
-主要的组件如下所示：
-
-1. `ProcessGroupInitializer`
-2. `GradientHandler`
-3. `Schedule`
-
-## 进程组初始化器
-
-并行化一般是通过进程组来进行管理的，同属于一个并行化算法的进程将被分到一个进程组中，如果系统中存在多种不同的并行化技术，那么需要创建多个不同的进程组。Colossal-AI为用户提供了一个全局上下文变量来便捷地管理他们的进程组。如果您希望增加新的进程组，您可以定义一个新的类并且在您的配置文件中进行设置。下方的代码块介绍了如何在系统中加入您的新颖并行技术以及如何进行初始化。
-
-1. 在`colossalai.context.parallel_mode.ParallelMode`中添加新的并行模式。
-```python
-class ParallelMode(Enum):
-    GLOBAL = 'global'
-    DATA = 'data'
-    PIPELINE = 'pipe'
-    ...
-
-    NEW_MODE = 'new_mode'  # define your mode here
-```
-
-2. 创建一个`ProcessGroupInitializer`的子类，您可以参考`colossalai.context.dist_group_initializer`中给出的例子。前六个参数将由`ParallelContext`决定。如果您需要设置新的参数，您可以用新的参数替换下面例子中的`arg1`与`arg2`。最后，您需要使用`@DIST_GROUP_INITIALIZER.register_module`装饰器在我们的注册表中注册您的初始化器。
-```python
-# sample initializer class
-@DIST_GROUP_INITIALIZER.register_module
-class MyParallelInitializer(ProcessGroupInitializer):
-
-    def __init__(self,
-                rank: int,
-                world_size: int,
-                config: Config,
-                data_parallel_size: int,
-                pipeline_parlalel_size: int,
-                tensor_parallel_size: int,
-                arg1,
-                arg2):
-        super().__init__(rank, world_size, config)
-        self.arg1 = arg1
-        self.arg2 = arg2
-        # ... your variable init
-
-    def init_parallel_groups(self):
-        # initialize your process groups
-        pass
-```
-
-在此之后，您可以将您的初始化器插入到当前的mode-to-initialize映射`colossalai.constants.INITIALIZER_MAPPING`中，您也可以通过更改该文件来动态变更名称与并行模式的映射。
-
-```python
-colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
-```
-
-3. 在配置文件中设置您的初始化器。如果您的初始化器需要参数，您可以自行传入。下面的代码可以让`ParallelContext`来创建您的初始化器并初始化您需要的进程组。
-
-```python
-parallel = dict(
-    pipeline=dict(size=1),
-    tensor=dict(size=x, mode='new_mode')  # this is where you enable your new parallel mode
-)
-```
-
-## 梯度处理器
-
-梯度处理器的功能是对模型参数的梯度进行all-reduce操作。由于不同的并行技术可能需要不同的all-reduce操作，用户们可以通过继承`colossalai.engine.gradient_handler.BaseGradientHandler`来执行其个性化操作。目前，Colossal-AI使用普通的数据并行梯度处理器，该处理器在所有的数据并行rank上执行all-reduce操作，且当Colossal-AI检测到当前系统使用了数据并行时，该处理器会被自动创建。您可以使用下方代码块中的代码添加您自定义的梯度处理器：
-
-```python
-from colossalai.registry import GRADIENT_HANDLER
-from colossalai.engine import BaseGradientHandler
-
-@GRADIENT_HANDLER.register_module
-class YourGradientHandler(BaseGradientHandler):
-
-    def handle_gradient(self):
-        do_something()
-
-```
-
-在此之后，您可以在配置文件中指定您想要使用的梯度处理器。
-
-```python
-dist_initializer = [
-    dict(type='YourGradientHandler'),
-]
-```
-
-## 调度器
-
-调度器中指定了在前向传播和后向传播时需要执行哪些操作，Colossal-AI提供了流水线和非流水线的调度器。如果您想要修改前向传播和后向传播的执行方式，您可以继承`colossalai.engine.BaseSchedule`并实现您想要的操作。您也可以在训练模型之前将您的调度器添加到我们的引擎中来。
diff --git a/docs/amp.md b/docs/amp.md
deleted file mode 100644
index 40892f750..000000000
--- a/docs/amp.md
+++ /dev/null
@@ -1,102 +0,0 @@
-# Mixed precision training
-
-In Colossal-AI, we have incorporated different implementations of mixed precision training:
-1. torch.cuda.amp
-2. apex.amp
-3. naive amp
-
-The first two rely on the original implementation of [PyTorch](https://pytorch.org/docs/stable/amp.html)
-(version 1.6 and above) and [Nvidia Apex](https://github.com/NVIDIA/apex). The last mehtod is simialr to Apex O2 level.
-
-Among these methods, apex.amp is not compatible with tensor parallelism. This is because that tensors are split across devices 
-in tensor parallelism, thus, it is required to communicate among different processes to check if `inf` or `nan` occurs in the 
-whole model weights. **We modified the torch amp implementation so that it is compatible with tensor parallelism now.**
-
-To use mixed precision training, you can easily specify the `fp16` field in the config file to be True. Currently, PyTorch and 
-Apex amp cannot be guaranteed to work with tensor and pipeline parallelism. We recommend you to use torch amp as it generally 
-gives better accuracy than naive amp.
-
-The AMP module is designed to be completely modular and can be used independently from other colossalai modules.
-If you wish to only use amp in your code base without `colossalai.initialize`, you can use `colossalai.amp.convert_to_amp`.
-
-```python
-from colossalai.amp import AMP_TYPE
-
-# exmaple of using torch amp
-model, optimizer, criterion = colossalai.amp.convert_to_amp(model, 
-                                                            optimizer, 
-                                                            criterion,
-                                                            AMP_TYPE.TORCH)
-```
-
-## PyTorch AMP
-
-PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to `fp16` format 
-while keeping some operations such as reductions in `fp32`. You can configure the gradient scaler in the config file.
-
-```python
-from colossalai.amp import AMP_TYPE
-
-fp16=dict(
-    mode=AMP_TYPE.TORCH,
-    # below are default values for grad scaler
-    init_scale=2.**16,
-    growth_factor=2.0,
-    backoff_factor=0.5,
-    growth_interval=2000,
-    enabled=True
-)
-```
-
-## Apex AMP
-
-For this mode, we rely on the [Apex](https://nvidia.github.io/apex/) implementation for mixed precision training. We support 
-this plugin because it allows for finer control on the granularity of mixed precision. For example, `O2` level (optimization level 2) 
-will keep batch normalization in `fp32`.
-
-The following code block shows a config file for Apex AMP.
-
-```python
-from colossalai.amp import AMP_TYPE
-
-fp16 = dict(
-    mode=AMP_TYPE.APEX,
-    # below are the default values
-    enabled=True, 
-    opt_level='O1', 
-    cast_model_type=None, 
-    patch_torch_functions=None, 
-    keep_batchnorm_fp32=None, 
-    master_weights=None, 
-    loss_scale=None, 
-    cast_model_outputs=None,
-    num_losses=1, 
-    verbosity=1, 
-    min_loss_scale=None, 
-    max_loss_scale=16777216.0
-)
-```
-
-## Naive AMP
-
-We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with complex tensor 
-and pipeline parallelism. This AMP mode will cast all operations into fp16.
-
-The following code block shows a config file for this mode.
-
-```python
-from colossalai.amp import AMP_TYPE
-
-fp16 = dict(
-    mode=AMP_TYPE.NAIVE,
-    # below are the default values
-    clip_grad=0,
-    log_num_zeros_in_grad=False,
-    initial_scale=2 ** 32,
-    min_scale=1,
-    growth_factor=2,
-    backoff_factor=0.5,
-    growth_interval=1000,
-    hysteresis=2
-)
-```
\ No newline at end of file
diff --git a/docs/amp_zh.md b/docs/amp_zh.md
deleted file mode 100644
index 9ec331048..000000000
--- a/docs/amp_zh.md
+++ /dev/null
@@ -1,74 +0,0 @@
-# 混合精度训练
-
-Colossal-AI可以使用如下三种不同的混合精度训练方式：
-1. torch.cuda.amp
-2. apex.amp
-3. 张量并行AMP
-
-前两种混合精度训练方式依赖于[PyTorch](https://pytorch.org/docs/stable/amp.html)的原生实现（1.6或以上版本）以及[Nvidia Apex](https://github.com/NVIDIA/apex)，但这两种方法与张量并行并不兼容，因为在张量并行中我们需要将张量进行切分并保存在不同的设备上，因此，实现兼容张量并行的混合精度训练需要在不同进程之间不断通信来交流`inf`以及`nan`是否存在于模型参数中，因此我们采用了[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)的实现方式。
-
-您可以简单地将配置文件中的`fp16`字段设置为True来使用混合精度训练。目前，PyTorch与Apex的amp不能保证与张量和流水线并行兼容，因此，我们推荐您使用最后一种混合精度训练方式。
-
-## PyTorch AMP
-
-PyTorch在1.6及以上版本中提供了混合精度训练，它可以在保持一些操作的精度为`fp32`的同时，将数据转换成`fp16`格式，您可以在配置文件中配置使用。
-
-```python
-from colossalai.engine import AMP_TYPE
-
-fp16=dict(
-    mode=AMP_TYPE.TORCH,
-    # below are default values for grad scaler
-    init_scale=2.**16,
-    growth_factor=2.0,
-    backoff_factor=0.5,
-    growth_interval=2000,
-    enabled=True
-)
-```
-
-## Apex AMP
-
-我们使用了[Apex](https://nvidia.github.io/apex/)中的混合精度训练，因为该模式提供了细粒度的混合精度控制，例如，`O2`级（第二级优化器）将会保持批标准化在`fp32`上进行。下面的代码块展示了使用Apex AMP的配置文件。
-
-```python
-from colossalai.engine import AMP_TYPE
-
-fp16 = dict(
-    mode=AMP_TYPE.APEX,
-    # below are the default values
-    enabled=True, 
-    opt_level='O1', 
-    cast_model_type=None, 
-    patch_torch_functions=None, 
-    keep_batchnorm_fp32=None, 
-    master_weights=None, 
-    loss_scale=None, 
-    cast_model_outputs=None,
-    num_losses=1, 
-    verbosity=1, 
-    min_loss_scale=None, 
-    max_loss_scale=16777216.0
-)
-```
-
-## 张量并行AMP
-
-我们借鉴了Megatron-LM的混合精度训练实现，该实现方式与张量并行、流水线并行相兼容。下面的代码块展示了使用张量并行AMP的配置文件。
-
-```python
-from colossalai.engine import AMP_TYPE
-
-fp16 = dict(
-    mode=AMP_TYPE.PARALLEL,
-    # below are the default values
-    clip_grad=0,
-    log_num_zeros_in_grad=False,
-    initial_scale=2 ** 32,
-    min_scale=1,
-    growth_factor=2,
-    backoff_factor=0.5,
-    growth_interval=1000,
-    hysteresis=2
-)
-```
\ No newline at end of file
diff --git a/docs/colossalai/colossalai.amp.amp_type.rst b/docs/colossalai/colossalai.amp.amp_type.rst
new file mode 100644
index 000000000..067af7d8c
--- /dev/null
+++ b/docs/colossalai/colossalai.amp.amp_type.rst
@@ -0,0 +1,5 @@
+colossalai.amp.amp\_type
+========================
+
+.. automodule:: colossalai.amp.amp_type
+   :members:
diff --git a/docs/colossalai/colossalai.amp.apex_amp.apex_amp.rst b/docs/colossalai/colossalai.amp.apex_amp.apex_amp.rst
new file mode 100644
index 000000000..cba7e0062
--- /dev/null
+++ b/docs/colossalai/colossalai.amp.apex_amp.apex_amp.rst
@@ -0,0 +1,5 @@
+colossalai.amp.apex\_amp.apex\_amp
+==================================
+
+.. automodule:: colossalai.amp.apex_amp.apex_amp
+   :members:
diff --git a/docs/colossalai/colossalai.amp.apex_amp.rst b/docs/colossalai/colossalai.amp.apex_amp.rst
index c3ed5420c..7116a538b 100644
--- a/docs/colossalai/colossalai.amp.apex_amp.rst
+++ b/docs/colossalai/colossalai.amp.apex_amp.rst
@@ -1,5 +1,11 @@
 colossalai.amp.apex\_amp
-==========================
+========================
 
 .. automodule:: colossalai.amp.apex_amp
    :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.amp.apex_amp.apex_amp
diff --git a/docs/colossalai/colossalai.amp.naive_amp.naive_amp.rst b/docs/colossalai/colossalai.amp.naive_amp.naive_amp.rst
new file mode 100644
index 000000000..e20f22b2e
--- /dev/null
+++ b/docs/colossalai/colossalai.amp.naive_amp.naive_amp.rst
@@ -0,0 +1,5 @@
+colossalai.amp.naive\_amp.naive\_amp
+====================================
+
+.. automodule:: colossalai.amp.naive_amp.naive_amp
+   :members:
diff --git a/docs/colossalai/colossalai.amp.naive_amp.rst b/docs/colossalai/colossalai.amp.naive_amp.rst
index 0bf2795bf..15917e174 100644
--- a/docs/colossalai/colossalai.amp.naive_amp.rst
+++ b/docs/colossalai/colossalai.amp.naive_amp.rst
@@ -1,5 +1,11 @@
 colossalai.amp.naive\_amp
-==========================
+=========================
 
 .. automodule:: colossalai.amp.naive_amp
    :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.amp.naive_amp.naive_amp
diff --git a/docs/colossalai/colossalai.amp.rst b/docs/colossalai/colossalai.amp.rst
index 0c7f22d6c..5ef4f36c1 100644
--- a/docs/colossalai/colossalai.amp.rst
+++ b/docs/colossalai/colossalai.amp.rst
@@ -1,13 +1,18 @@
 colossalai.amp
-==================
+==============
+
+.. automodule:: colossalai.amp
+   :members:
 
 .. toctree::
    :maxdepth: 2
 
-   colossalai.amp.torch_amp
    colossalai.amp.apex_amp
    colossalai.amp.naive_amp
+   colossalai.amp.torch_amp
 
 
-.. automodule:: colossalai.amp
-   :members:
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.amp.amp_type
diff --git a/docs/colossalai/colossalai.amp.torch_amp.rst b/docs/colossalai/colossalai.amp.torch_amp.rst
index d71ff3c0d..f10095f13 100644
--- a/docs/colossalai/colossalai.amp.torch_amp.rst
+++ b/docs/colossalai/colossalai.amp.torch_amp.rst
@@ -1,5 +1,11 @@
 colossalai.amp.torch\_amp
-==========================
+=========================
 
 .. automodule:: colossalai.amp.torch_amp
-      :members:
+   :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.amp.torch_amp.torch_amp
diff --git a/docs/colossalai/colossalai.amp.torch_amp.torch_amp.rst b/docs/colossalai/colossalai.amp.torch_amp.torch_amp.rst
new file mode 100644
index 000000000..5f1549cb8
--- /dev/null
+++ b/docs/colossalai/colossalai.amp.torch_amp.torch_amp.rst
@@ -0,0 +1,5 @@
+colossalai.amp.torch\_amp.torch\_amp
+====================================
+
+.. automodule:: colossalai.amp.torch_amp.torch_amp
+   :members:
diff --git a/docs/colossalai/colossalai.builder.rst b/docs/colossalai/colossalai.builder.rst
index d2b96604c..60b8501c8 100644
--- a/docs/colossalai/colossalai.builder.rst
+++ b/docs/colossalai/colossalai.builder.rst
@@ -1,12 +1,12 @@
 colossalai.builder
 ==================
 
+.. automodule:: colossalai.builder
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.builder.builder
    colossalai.builder.pipeline
-
-
-.. automodule:: colossalai.builder
-   :members:
diff --git a/docs/colossalai/colossalai.communication.rst b/docs/colossalai/colossalai.communication.rst
index 05ad0d4d7..5086fa663 100644
--- a/docs/colossalai/colossalai.communication.rst
+++ b/docs/colossalai/colossalai.communication.rst
@@ -1,6 +1,10 @@
 colossalai.communication
 ========================
 
+.. automodule:: colossalai.communication
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
@@ -8,7 +12,3 @@ colossalai.communication
    colossalai.communication.p2p
    colossalai.communication.ring
    colossalai.communication.utils
-
-
-.. automodule:: colossalai.communication
-   :members:
diff --git a/docs/colossalai/colossalai.constants.rst b/docs/colossalai/colossalai.constants.rst
deleted file mode 100644
index 330b3e866..000000000
--- a/docs/colossalai/colossalai.constants.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.constants
-====================
-
-.. automodule:: colossalai.constants
-   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_model.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_model.rst
new file mode 100644
index 000000000..8f2d79369
--- /dev/null
+++ b/docs/colossalai/colossalai.context.process_group_initializer.initializer_model.rst
@@ -0,0 +1,5 @@
+colossalai.context.process\_group\_initializer.initializer\_model
+=================================================================
+
+.. automodule:: colossalai.context.process_group_initializer.initializer_model
+   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.initializer_moe.rst b/docs/colossalai/colossalai.context.process_group_initializer.initializer_moe.rst
new file mode 100644
index 000000000..be2314629
--- /dev/null
+++ b/docs/colossalai/colossalai.context.process_group_initializer.initializer_moe.rst
@@ -0,0 +1,5 @@
+colossalai.context.process\_group\_initializer.initializer\_moe
+===============================================================
+
+.. automodule:: colossalai.context.process_group_initializer.initializer_moe
+   :members:
diff --git a/docs/colossalai/colossalai.context.process_group_initializer.rst b/docs/colossalai/colossalai.context.process_group_initializer.rst
index 68aedaaa3..b5e261195 100644
--- a/docs/colossalai/colossalai.context.process_group_initializer.rst
+++ b/docs/colossalai/colossalai.context.process_group_initializer.rst
@@ -13,6 +13,8 @@ colossalai.context.process\_group\_initializer
    colossalai.context.process_group_initializer.initializer_2p5d
    colossalai.context.process_group_initializer.initializer_3d
    colossalai.context.process_group_initializer.initializer_data
+   colossalai.context.process_group_initializer.initializer_model
+   colossalai.context.process_group_initializer.initializer_moe
    colossalai.context.process_group_initializer.initializer_pipeline
    colossalai.context.process_group_initializer.initializer_sequence
    colossalai.context.process_group_initializer.initializer_tensor
diff --git a/docs/colossalai/colossalai.context.random.rst b/docs/colossalai/colossalai.context.random.rst
index 58ed5b269..8d4b9c56a 100644
--- a/docs/colossalai/colossalai.context.random.rst
+++ b/docs/colossalai/colossalai.context.random.rst
@@ -1,11 +1,11 @@
 colossalai.context.random
 =========================
 
+.. automodule:: colossalai.context.random
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.context.random.seed_manager
-
-
-.. automodule:: colossalai.context.random
-   :members:
diff --git a/docs/colossalai/colossalai.context.rst b/docs/colossalai/colossalai.context.rst
index 4ff29ce3d..babab5099 100644
--- a/docs/colossalai/colossalai.context.rst
+++ b/docs/colossalai/colossalai.context.rst
@@ -1,6 +1,9 @@
 colossalai.context
 ==================
 
+.. automodule:: colossalai.context
+   :members:
+
 .. toctree::
    :maxdepth: 2
 
@@ -14,7 +17,3 @@ colossalai.context
    colossalai.context.config
    colossalai.context.parallel_context
    colossalai.context.parallel_mode
-
-
-.. automodule:: colossalai.context
-   :members:
diff --git a/docs/colossalai/colossalai.core.rst b/docs/colossalai/colossalai.core.rst
deleted file mode 100644
index d9ddb76ed..000000000
--- a/docs/colossalai/colossalai.core.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.core
-===============
-
-.. automodule:: colossalai.core
-   :members:
diff --git a/docs/colossalai/colossalai.engine.rst b/docs/colossalai/colossalai.engine.rst
index 5b37fb842..f41c21e67 100644
--- a/docs/colossalai/colossalai.engine.rst
+++ b/docs/colossalai/colossalai.engine.rst
@@ -1,12 +1,11 @@
 colossalai.engine
 =================
 
+.. automodule:: colossalai.engine
+   :members:
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.engine.gradient_handler
    colossalai.engine.schedule
-
-
-.. automodule:: colossalai.engine
-   :members:
diff --git a/docs/colossalai/colossalai.kernel.cuda_native.layer_norm.rst b/docs/colossalai/colossalai.kernel.cuda_native.layer_norm.rst
new file mode 100644
index 000000000..b8bff51be
--- /dev/null
+++ b/docs/colossalai/colossalai.kernel.cuda_native.layer_norm.rst
@@ -0,0 +1,5 @@
+colossalai.kernel.cuda\_native.layer\_norm
+==========================================
+
+.. automodule:: colossalai.kernel.cuda_native.layer_norm
+   :members:
diff --git a/docs/colossalai/colossalai.kernel.cuda_native.multihead_attention.rst b/docs/colossalai/colossalai.kernel.cuda_native.multihead_attention.rst
new file mode 100644
index 000000000..de7577d19
--- /dev/null
+++ b/docs/colossalai/colossalai.kernel.cuda_native.multihead_attention.rst
@@ -0,0 +1,5 @@
+colossalai.kernel.cuda\_native.multihead\_attention
+===================================================
+
+.. automodule:: colossalai.kernel.cuda_native.multihead_attention
+   :members:
diff --git a/docs/colossalai/colossalai.kernel.cuda_native.rst b/docs/colossalai/colossalai.kernel.cuda_native.rst
new file mode 100644
index 000000000..d88e4cfdb
--- /dev/null
+++ b/docs/colossalai/colossalai.kernel.cuda_native.rst
@@ -0,0 +1,13 @@
+colossalai.kernel.cuda\_native
+==============================
+
+.. automodule:: colossalai.kernel.cuda_native
+   :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.kernel.cuda_native.layer_norm
+   colossalai.kernel.cuda_native.multihead_attention
+   colossalai.kernel.cuda_native.scaled_softmax
diff --git a/docs/colossalai/colossalai.kernel.cuda_native.scaled_softmax.rst b/docs/colossalai/colossalai.kernel.cuda_native.scaled_softmax.rst
new file mode 100644
index 000000000..474fcd334
--- /dev/null
+++ b/docs/colossalai/colossalai.kernel.cuda_native.scaled_softmax.rst
@@ -0,0 +1,5 @@
+colossalai.kernel.cuda\_native.scaled\_softmax
+==============================================
+
+.. automodule:: colossalai.kernel.cuda_native.scaled_softmax
+   :members:
diff --git a/docs/colossalai/colossalai.kernel.jit.bias_dropout_add.rst b/docs/colossalai/colossalai.kernel.jit.bias_dropout_add.rst
new file mode 100644
index 000000000..d61550928
--- /dev/null
+++ b/docs/colossalai/colossalai.kernel.jit.bias_dropout_add.rst
@@ -0,0 +1,5 @@
+colossalai.kernel.jit.bias\_dropout\_add
+========================================
+
+.. automodule:: colossalai.kernel.jit.bias_dropout_add
+   :members:
diff --git a/docs/colossalai/colossalai.kernel.jit.bias_gelu.rst b/docs/colossalai/colossalai.kernel.jit.bias_gelu.rst
new file mode 100644
index 000000000..7db184b4c
--- /dev/null
+++ b/docs/colossalai/colossalai.kernel.jit.bias_gelu.rst
@@ -0,0 +1,5 @@
+colossalai.kernel.jit.bias\_gelu
+================================
+
+.. automodule:: colossalai.kernel.jit.bias_gelu
+   :members:
diff --git a/docs/colossalai/colossalai.kernel.jit.option.rst b/docs/colossalai/colossalai.kernel.jit.option.rst
new file mode 100644
index 000000000..15ebfc83a
--- /dev/null
+++ b/docs/colossalai/colossalai.kernel.jit.option.rst
@@ -0,0 +1,5 @@
+colossalai.kernel.jit.option
+============================
+
+.. automodule:: colossalai.kernel.jit.option
+   :members:
diff --git a/docs/colossalai/colossalai.kernel.jit.rst b/docs/colossalai/colossalai.kernel.jit.rst
new file mode 100644
index 000000000..8b2f728d3
--- /dev/null
+++ b/docs/colossalai/colossalai.kernel.jit.rst
@@ -0,0 +1,13 @@
+colossalai.kernel.jit
+=====================
+
+.. automodule:: colossalai.kernel.jit
+   :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.kernel.jit.bias_dropout_add
+   colossalai.kernel.jit.bias_gelu
+   colossalai.kernel.jit.option
diff --git a/docs/colossalai/colossalai.kernel.rst b/docs/colossalai/colossalai.kernel.rst
new file mode 100644
index 000000000..dcbac8c1d
--- /dev/null
+++ b/docs/colossalai/colossalai.kernel.rst
@@ -0,0 +1,11 @@
+colossalai.kernel
+=================
+
+.. automodule:: colossalai.kernel
+   :members:
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.kernel.cuda_native
+   colossalai.kernel.jit
diff --git a/docs/colossalai/colossalai.logging.rst b/docs/colossalai/colossalai.logging.rst
index 71bcdd16b..a7a5cec72 100644
--- a/docs/colossalai/colossalai.logging.rst
+++ b/docs/colossalai/colossalai.logging.rst
@@ -1,11 +1,11 @@
 colossalai.logging
 ==================
 
+.. automodule:: colossalai.logging
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.logging.logging
-
-
-.. automodule:: colossalai.logging
-   :members:
diff --git a/docs/colossalai/colossalai.nn.init.rst b/docs/colossalai/colossalai.nn.init.rst
new file mode 100644
index 000000000..d0ab99312
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.init.rst
@@ -0,0 +1,5 @@
+colossalai.nn.init
+==================
+
+.. automodule:: colossalai.nn.init
+   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.colossalai_layer.dropout.rst b/docs/colossalai/colossalai.nn.layer.colossalai_layer.dropout.rst
new file mode 100644
index 000000000..ec1dfd395
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.colossalai_layer.dropout.rst
@@ -0,0 +1,5 @@
+colossalai.nn.layer.colossalai\_layer.dropout
+=============================================
+
+.. automodule:: colossalai.nn.layer.colossalai_layer.dropout
+   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.colossalai_layer.embedding.rst b/docs/colossalai/colossalai.nn.layer.colossalai_layer.embedding.rst
new file mode 100644
index 000000000..8438b3a07
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.colossalai_layer.embedding.rst
@@ -0,0 +1,5 @@
+colossalai.nn.layer.colossalai\_layer.embedding
+===============================================
+
+.. automodule:: colossalai.nn.layer.colossalai_layer.embedding
+   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.colossalai_layer.linear.rst b/docs/colossalai/colossalai.nn.layer.colossalai_layer.linear.rst
new file mode 100644
index 000000000..321328254
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.colossalai_layer.linear.rst
@@ -0,0 +1,5 @@
+colossalai.nn.layer.colossalai\_layer.linear
+============================================
+
+.. automodule:: colossalai.nn.layer.colossalai_layer.linear
+   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.colossalai_layer.normalization.rst b/docs/colossalai/colossalai.nn.layer.colossalai_layer.normalization.rst
new file mode 100644
index 000000000..f94dd27b8
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.colossalai_layer.normalization.rst
@@ -0,0 +1,5 @@
+colossalai.nn.layer.colossalai\_layer.normalization
+===================================================
+
+.. automodule:: colossalai.nn.layer.colossalai_layer.normalization
+   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.colossalai_layer.rst b/docs/colossalai/colossalai.nn.layer.colossalai_layer.rst
new file mode 100644
index 000000000..0f685e6c2
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.colossalai_layer.rst
@@ -0,0 +1,14 @@
+colossalai.nn.layer.colossalai\_layer
+=====================================
+
+.. automodule:: colossalai.nn.layer.colossalai_layer
+   :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.nn.layer.colossalai_layer.dropout
+   colossalai.nn.layer.colossalai_layer.embedding
+   colossalai.nn.layer.colossalai_layer.linear
+   colossalai.nn.layer.colossalai_layer.normalization
diff --git a/docs/colossalai/colossalai.nn.layer.moe.layers.rst b/docs/colossalai/colossalai.nn.layer.moe.layers.rst
new file mode 100644
index 000000000..d109d47b8
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.moe.layers.rst
@@ -0,0 +1,5 @@
+colossalai.nn.layer.moe.layers
+==============================
+
+.. automodule:: colossalai.nn.layer.moe.layers
+   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.moe.rst b/docs/colossalai/colossalai.nn.layer.moe.rst
new file mode 100644
index 000000000..403d39817
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.moe.rst
@@ -0,0 +1,11 @@
+colossalai.nn.layer.moe
+=======================
+
+.. automodule:: colossalai.nn.layer.moe
+   :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.nn.layer.moe.layers
diff --git a/docs/colossalai/colossalai.nn.layer.non_parallel_layers.rst b/docs/colossalai/colossalai.nn.layer.non_parallel_layers.rst
deleted file mode 100644
index 8103d92b8..000000000
--- a/docs/colossalai/colossalai.nn.layer.non_parallel_layers.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.layer.non\_parallel\_layers
-======================================
-
-.. automodule:: colossalai.nn.layer.non_parallel_layers
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_1d.rst b/docs/colossalai/colossalai.nn.layer.parallel_1d.rst
index a765b04ad..3a8ed6206 100644
--- a/docs/colossalai/colossalai.nn.layer.parallel_1d.rst
+++ b/docs/colossalai/colossalai.nn.layer.parallel_1d.rst
@@ -1,11 +1,11 @@
 colossalai.nn.layer.parallel\_1d
 ================================
 
+.. automodule:: colossalai.nn.layer.parallel_1d
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.nn.layer.parallel_1d.layers
-
-
-.. automodule:: colossalai.nn.layer.parallel_1d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_2d.rst b/docs/colossalai/colossalai.nn.layer.parallel_2d.rst
index d72fef9a9..f5ad41a1b 100644
--- a/docs/colossalai/colossalai.nn.layer.parallel_2d.rst
+++ b/docs/colossalai/colossalai.nn.layer.parallel_2d.rst
@@ -1,11 +1,11 @@
 colossalai.nn.layer.parallel\_2d
 ================================
 
+.. automodule:: colossalai.nn.layer.parallel_2d
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.nn.layer.parallel_2d.layers
-
-
-.. automodule:: colossalai.nn.layer.parallel_2d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_2p5d.rst b/docs/colossalai/colossalai.nn.layer.parallel_2p5d.rst
index 4ba8b1348..5869bdee9 100644
--- a/docs/colossalai/colossalai.nn.layer.parallel_2p5d.rst
+++ b/docs/colossalai/colossalai.nn.layer.parallel_2p5d.rst
@@ -1,11 +1,11 @@
 colossalai.nn.layer.parallel\_2p5d
 ==================================
 
+.. automodule:: colossalai.nn.layer.parallel_2p5d
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.nn.layer.parallel_2p5d.layers
-
-
-.. automodule:: colossalai.nn.layer.parallel_2p5d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_3d.rst b/docs/colossalai/colossalai.nn.layer.parallel_3d.rst
index d0e82c838..bb55a63e5 100644
--- a/docs/colossalai/colossalai.nn.layer.parallel_3d.rst
+++ b/docs/colossalai/colossalai.nn.layer.parallel_3d.rst
@@ -1,11 +1,11 @@
 colossalai.nn.layer.parallel\_3d
 ================================
 
+.. automodule:: colossalai.nn.layer.parallel_3d
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.nn.layer.parallel_3d.layers
-
-
-.. automodule:: colossalai.nn.layer.parallel_3d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.parallel_sequence.rst b/docs/colossalai/colossalai.nn.layer.parallel_sequence.rst
index dfea23ab3..24e8941d4 100644
--- a/docs/colossalai/colossalai.nn.layer.parallel_sequence.rst
+++ b/docs/colossalai/colossalai.nn.layer.parallel_sequence.rst
@@ -1,11 +1,11 @@
 colossalai.nn.layer.parallel\_sequence
 ======================================
 
+.. automodule:: colossalai.nn.layer.parallel_sequence
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.nn.layer.parallel_sequence.layers
-
-
-.. automodule:: colossalai.nn.layer.parallel_sequence
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.rst b/docs/colossalai/colossalai.nn.layer.rst
index 1538c3f02..32a93128f 100644
--- a/docs/colossalai/colossalai.nn.layer.rst
+++ b/docs/colossalai/colossalai.nn.layer.rst
@@ -1,18 +1,25 @@
 colossalai.nn.layer
 ===================
 
+.. automodule:: colossalai.nn.layer
+   :members:
+
 .. toctree::
    :maxdepth: 2
 
+   colossalai.nn.layer.colossalai_layer
+   colossalai.nn.layer.moe
    colossalai.nn.layer.parallel_1d
    colossalai.nn.layer.parallel_2d
    colossalai.nn.layer.parallel_2p5d
    colossalai.nn.layer.parallel_3d
    colossalai.nn.layer.parallel_sequence
-   colossalai.nn.layer.non_parallel_layers
+   colossalai.nn.layer.utils
+   colossalai.nn.layer.vanilla
    colossalai.nn.layer.wrapper
+
+
+.. toctree::
+   :maxdepth: 2
+
    colossalai.nn.layer.base_layer
-
-
-.. automodule:: colossalai.nn.layer
-   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.utils.common.rst b/docs/colossalai/colossalai.nn.layer.utils.common.rst
new file mode 100644
index 000000000..6a552830f
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.utils.common.rst
@@ -0,0 +1,5 @@
+colossalai.nn.layer.utils.common
+================================
+
+.. automodule:: colossalai.nn.layer.utils.common
+   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.utils.rst b/docs/colossalai/colossalai.nn.layer.utils.rst
new file mode 100644
index 000000000..16c3d7182
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.utils.rst
@@ -0,0 +1,11 @@
+colossalai.nn.layer.utils
+=========================
+
+.. automodule:: colossalai.nn.layer.utils
+   :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.nn.layer.utils.common
diff --git a/docs/colossalai/colossalai.nn.layer.vanilla.layers.rst b/docs/colossalai/colossalai.nn.layer.vanilla.layers.rst
new file mode 100644
index 000000000..f993b1f50
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.vanilla.layers.rst
@@ -0,0 +1,5 @@
+colossalai.nn.layer.vanilla.layers
+==================================
+
+.. automodule:: colossalai.nn.layer.vanilla.layers
+   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.vanilla.rst b/docs/colossalai/colossalai.nn.layer.vanilla.rst
new file mode 100644
index 000000000..fe1ea5c6c
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.vanilla.rst
@@ -0,0 +1,11 @@
+colossalai.nn.layer.vanilla
+===========================
+
+.. automodule:: colossalai.nn.layer.vanilla
+   :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.nn.layer.vanilla.layers
diff --git a/docs/colossalai/colossalai.nn.layer.wrapper.pipeline_wrapper.rst b/docs/colossalai/colossalai.nn.layer.wrapper.pipeline_wrapper.rst
new file mode 100644
index 000000000..e5648873d
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.layer.wrapper.pipeline_wrapper.rst
@@ -0,0 +1,5 @@
+colossalai.nn.layer.wrapper.pipeline\_wrapper
+=============================================
+
+.. automodule:: colossalai.nn.layer.wrapper.pipeline_wrapper
+   :members:
diff --git a/docs/colossalai/colossalai.nn.layer.wrapper.rst b/docs/colossalai/colossalai.nn.layer.wrapper.rst
index 40ed618cb..4e66651dc 100644
--- a/docs/colossalai/colossalai.nn.layer.wrapper.rst
+++ b/docs/colossalai/colossalai.nn.layer.wrapper.rst
@@ -9,3 +9,4 @@ colossalai.nn.layer.wrapper
    :maxdepth: 2
 
    colossalai.nn.layer.wrapper.lambda_wrapper
+   colossalai.nn.layer.wrapper.pipeline_wrapper
diff --git a/docs/colossalai/colossalai.nn.loss.cross_entropy_2d.rst b/docs/colossalai/colossalai.nn.loss.cross_entropy_2d.rst
deleted file mode 100644
index 780a66557..000000000
--- a/docs/colossalai/colossalai.nn.loss.cross_entropy_2d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.loss.cross\_entropy\_2d
-=====================================
-
-.. automodule:: colossalai.nn.loss.cross_entropy_2d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.cross_entropy_2p5d.rst b/docs/colossalai/colossalai.nn.loss.cross_entropy_2p5d.rst
deleted file mode 100644
index dd136dca2..000000000
--- a/docs/colossalai/colossalai.nn.loss.cross_entropy_2p5d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.loss.cross\_entropy\_2p5d
-=======================================
-
-.. automodule:: colossalai.nn.loss.cross_entropy_2p5d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.cross_entropy_3d.rst b/docs/colossalai/colossalai.nn.loss.cross_entropy_3d.rst
deleted file mode 100644
index 9b8610f31..000000000
--- a/docs/colossalai/colossalai.nn.loss.cross_entropy_3d.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.nn.loss.cross\_entropy\_3d
-=====================================
-
-.. automodule:: colossalai.nn.loss.cross_entropy_3d
-   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.loss_2d.rst b/docs/colossalai/colossalai.nn.loss.loss_2d.rst
new file mode 100644
index 000000000..14d1585e3
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.loss.loss_2d.rst
@@ -0,0 +1,5 @@
+colossalai.nn.loss.loss\_2d
+===========================
+
+.. automodule:: colossalai.nn.loss.loss_2d
+   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.loss_2p5d.rst b/docs/colossalai/colossalai.nn.loss.loss_2p5d.rst
new file mode 100644
index 000000000..fc3714da3
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.loss.loss_2p5d.rst
@@ -0,0 +1,5 @@
+colossalai.nn.loss.loss\_2p5d
+=============================
+
+.. automodule:: colossalai.nn.loss.loss_2p5d
+   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.loss_3d.rst b/docs/colossalai/colossalai.nn.loss.loss_3d.rst
new file mode 100644
index 000000000..a593324fb
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.loss.loss_3d.rst
@@ -0,0 +1,5 @@
+colossalai.nn.loss.loss\_3d
+===========================
+
+.. automodule:: colossalai.nn.loss.loss_3d
+   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.loss_moe.rst b/docs/colossalai/colossalai.nn.loss.loss_moe.rst
new file mode 100644
index 000000000..ef2851ace
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.loss.loss_moe.rst
@@ -0,0 +1,5 @@
+colossalai.nn.loss.loss\_moe
+============================
+
+.. automodule:: colossalai.nn.loss.loss_moe
+   :members:
diff --git a/docs/colossalai/colossalai.nn.loss.rst b/docs/colossalai/colossalai.nn.loss.rst
index 8efd847d6..5677b7448 100644
--- a/docs/colossalai/colossalai.nn.loss.rst
+++ b/docs/colossalai/colossalai.nn.loss.rst
@@ -1,13 +1,14 @@
 colossalai.nn.loss
 ==================
 
+.. automodule:: colossalai.nn.loss
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
-   colossalai.nn.loss.cross_entropy_2d
-   colossalai.nn.loss.cross_entropy_2p5d
-   colossalai.nn.loss.cross_entropy_3d
-
-
-.. automodule:: colossalai.nn.loss
-   :members:
+   colossalai.nn.loss.loss_2d
+   colossalai.nn.loss.loss_2p5d
+   colossalai.nn.loss.loss_3d
+   colossalai.nn.loss.loss_moe
diff --git a/docs/colossalai/colossalai.nn.lr_scheduler.rst b/docs/colossalai/colossalai.nn.lr_scheduler.rst
index f32eb3be4..427a3ee45 100644
--- a/docs/colossalai/colossalai.nn.lr_scheduler.rst
+++ b/docs/colossalai/colossalai.nn.lr_scheduler.rst
@@ -1,6 +1,10 @@
 colossalai.nn.lr\_scheduler
 ===========================
 
+.. automodule:: colossalai.nn.lr_scheduler
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
@@ -11,7 +15,3 @@ colossalai.nn.lr\_scheduler
    colossalai.nn.lr_scheduler.onecycle
    colossalai.nn.lr_scheduler.poly
    colossalai.nn.lr_scheduler.torch
-
-   
-.. automodule:: colossalai.nn.lr_scheduler
-   :members:
diff --git a/docs/colossalai/colossalai.nn.metric.accuracy_2d.rst b/docs/colossalai/colossalai.nn.metric.accuracy_2d.rst
new file mode 100644
index 000000000..63bcb8349
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.metric.accuracy_2d.rst
@@ -0,0 +1,5 @@
+colossalai.nn.metric.accuracy\_2d
+=================================
+
+.. automodule:: colossalai.nn.metric.accuracy_2d
+   :members:
diff --git a/docs/colossalai/colossalai.nn.metric.accuracy_2p5d.rst b/docs/colossalai/colossalai.nn.metric.accuracy_2p5d.rst
new file mode 100644
index 000000000..dd4358fbf
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.metric.accuracy_2p5d.rst
@@ -0,0 +1,5 @@
+colossalai.nn.metric.accuracy\_2p5d
+===================================
+
+.. automodule:: colossalai.nn.metric.accuracy_2p5d
+   :members:
diff --git a/docs/colossalai/colossalai.nn.metric.accuracy_3d.rst b/docs/colossalai/colossalai.nn.metric.accuracy_3d.rst
new file mode 100644
index 000000000..95143444b
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.metric.accuracy_3d.rst
@@ -0,0 +1,5 @@
+colossalai.nn.metric.accuracy\_3d
+=================================
+
+.. automodule:: colossalai.nn.metric.accuracy_3d
+   :members:
diff --git a/docs/colossalai/colossalai.nn.metric.rst b/docs/colossalai/colossalai.nn.metric.rst
new file mode 100644
index 000000000..28f5568eb
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.metric.rst
@@ -0,0 +1,13 @@
+colossalai.nn.metric
+====================
+
+.. automodule:: colossalai.nn.metric
+   :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.nn.metric.accuracy_2d
+   colossalai.nn.metric.accuracy_2p5d
+   colossalai.nn.metric.accuracy_3d
diff --git a/docs/colossalai/colossalai.nn.model.model_from_config.rst b/docs/colossalai/colossalai.nn.model.model_from_config.rst
index cea8ff4f4..fadb5fd0f 100644
--- a/docs/colossalai/colossalai.nn.model.model_from_config.rst
+++ b/docs/colossalai/colossalai.nn.model.model_from_config.rst
@@ -1,5 +1,5 @@
 colossalai.nn.model.model\_from\_config
-===============================
+=======================================
 
 .. automodule:: colossalai.nn.model.model_from_config
    :members:
diff --git a/docs/colossalai/colossalai.nn.model.rst b/docs/colossalai/colossalai.nn.model.rst
index 88fc55e06..5756e11cd 100644
--- a/docs/colossalai/colossalai.nn.model.rst
+++ b/docs/colossalai/colossalai.nn.model.rst
@@ -1,6 +1,10 @@
 colossalai.nn.model
 ===================
 
+.. automodule:: colossalai.nn.model
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
diff --git a/docs/colossalai/colossalai.nn.optimizer.colossalai_optimizer.rst b/docs/colossalai/colossalai.nn.optimizer.colossalai_optimizer.rst
new file mode 100644
index 000000000..35515c374
--- /dev/null
+++ b/docs/colossalai/colossalai.nn.optimizer.colossalai_optimizer.rst
@@ -0,0 +1,5 @@
+colossalai.nn.optimizer.colossalai\_optimizer
+=============================================
+
+.. automodule:: colossalai.nn.optimizer.colossalai_optimizer
+   :members:
diff --git a/docs/colossalai/colossalai.nn.optimizer.rst b/docs/colossalai/colossalai.nn.optimizer.rst
index f865b91f4..7fbd81406 100644
--- a/docs/colossalai/colossalai.nn.optimizer.rst
+++ b/docs/colossalai/colossalai.nn.optimizer.rst
@@ -1,15 +1,16 @@
 colossalai.nn.optimizer
 =======================
 
+.. automodule:: colossalai.nn.optimizer
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
+   colossalai.nn.optimizer.colossalai_optimizer
    colossalai.nn.optimizer.fused_adam
    colossalai.nn.optimizer.fused_lamb
    colossalai.nn.optimizer.fused_sgd
    colossalai.nn.optimizer.lamb
    colossalai.nn.optimizer.lars
-
-
-.. automodule:: colossalai.nn.optimizer
-   :members:
diff --git a/docs/colossalai/colossalai.nn.rst b/docs/colossalai/colossalai.nn.rst
index bf83f33f4..32e5eae2f 100644
--- a/docs/colossalai/colossalai.nn.rst
+++ b/docs/colossalai/colossalai.nn.rst
@@ -1,15 +1,21 @@
 colossalai.nn
 =============
 
+.. automodule:: colossalai.nn
+   :members:
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.nn.layer
    colossalai.nn.loss
    colossalai.nn.lr_scheduler
+   colossalai.nn.metric
    colossalai.nn.model
    colossalai.nn.optimizer
 
 
-.. automodule:: colossalai.nn
-   :members:
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.nn.init
diff --git a/docs/colossalai/colossalai.registry.rst b/docs/colossalai/colossalai.registry.rst
index 2991f04b1..0f294f6d1 100644
--- a/docs/colossalai/colossalai.registry.rst
+++ b/docs/colossalai/colossalai.registry.rst
@@ -1,11 +1,11 @@
 colossalai.registry
 ===================
 
+.. automodule:: colossalai.registry
+   :members:
+
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.registry.registry
-
-
-.. automodule:: colossalai.registry
-   :members:
diff --git a/docs/colossalai/colossalai.rst b/docs/colossalai/colossalai.rst
index eace3075b..eca3e273a 100644
--- a/docs/colossalai/colossalai.rst
+++ b/docs/colossalai/colossalai.rst
@@ -1,13 +1,8 @@
 colossalai
 ==========
 
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.constants
-   colossalai.core
-   colossalai.initialize
-
+.. automodule:: colossalai
+   :members:
 
 .. toctree::
    :maxdepth: 2
@@ -17,6 +12,7 @@ colossalai
    colossalai.communication
    colossalai.context
    colossalai.engine
+   colossalai.kernel
    colossalai.logging
    colossalai.nn
    colossalai.registry
@@ -24,5 +20,8 @@ colossalai
    colossalai.utils
    colossalai.zero
 
-.. automodule:: colossalai
-   :members:
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.initialize
diff --git a/docs/colossalai/colossalai.trainer.metric.rst b/docs/colossalai/colossalai.trainer.metric.rst
deleted file mode 100644
index b6b06555d..000000000
--- a/docs/colossalai/colossalai.trainer.metric.rst
+++ /dev/null
@@ -1,5 +0,0 @@
-colossalai.trainer.metric
-=========================
-
-.. automodule:: colossalai.trainer.metric
-   :members:
diff --git a/docs/colossalai/colossalai.trainer.rst b/docs/colossalai/colossalai.trainer.rst
index 44bdc06cf..abc636e62 100644
--- a/docs/colossalai/colossalai.trainer.rst
+++ b/docs/colossalai/colossalai.trainer.rst
@@ -1,17 +1,10 @@
 colossalai.trainer
 ==================
 
+.. automodule:: colossalai.trainer
+   :members:
+
 .. toctree::
    :maxdepth: 2
 
    colossalai.trainer.hooks
-
-
-.. toctree::
-   :maxdepth: 2
-
-   colossalai.trainer.metric
-
-
-.. automodule:: colossalai.trainer
-   :members:
diff --git a/docs/colossalai/colossalai.utils.data_sampler.base_sampler.rst b/docs/colossalai/colossalai.utils.data_sampler.base_sampler.rst
new file mode 100644
index 000000000..199e8fcf8
--- /dev/null
+++ b/docs/colossalai/colossalai.utils.data_sampler.base_sampler.rst
@@ -0,0 +1,5 @@
+colossalai.utils.data\_sampler.base\_sampler
+============================================
+
+.. automodule:: colossalai.utils.data_sampler.base_sampler
+   :members:
diff --git a/docs/colossalai/colossalai.utils.data_sampler.data_parallel_sampler.rst b/docs/colossalai/colossalai.utils.data_sampler.data_parallel_sampler.rst
new file mode 100644
index 000000000..85e1b121c
--- /dev/null
+++ b/docs/colossalai/colossalai.utils.data_sampler.data_parallel_sampler.rst
@@ -0,0 +1,5 @@
+colossalai.utils.data\_sampler.data\_parallel\_sampler
+======================================================
+
+.. automodule:: colossalai.utils.data_sampler.data_parallel_sampler
+   :members:
diff --git a/docs/colossalai/colossalai.utils.data_sampler.rst b/docs/colossalai/colossalai.utils.data_sampler.rst
index 96eac582c..61dde070b 100644
--- a/docs/colossalai/colossalai.utils.data_sampler.rst
+++ b/docs/colossalai/colossalai.utils.data_sampler.rst
@@ -1,5 +1,12 @@
 colossalai.utils.data\_sampler
-=======================================
+==============================
 
 .. automodule:: colossalai.utils.data_sampler
    :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.utils.data_sampler.base_sampler
+   colossalai.utils.data_sampler.data_parallel_sampler
diff --git a/docs/colossalai/colossalai.utils.multi_tensor_apply.multi_tensor_apply.rst b/docs/colossalai/colossalai.utils.multi_tensor_apply.multi_tensor_apply.rst
new file mode 100644
index 000000000..493b9530e
--- /dev/null
+++ b/docs/colossalai/colossalai.utils.multi_tensor_apply.multi_tensor_apply.rst
@@ -0,0 +1,5 @@
+colossalai.utils.multi\_tensor\_apply.multi\_tensor\_apply
+==========================================================
+
+.. automodule:: colossalai.utils.multi_tensor_apply.multi_tensor_apply
+   :members:
diff --git a/docs/colossalai/colossalai.utils.multi_tensor_apply.rst b/docs/colossalai/colossalai.utils.multi_tensor_apply.rst
index 495b4fa6a..d5749cfa8 100644
--- a/docs/colossalai/colossalai.utils.multi_tensor_apply.rst
+++ b/docs/colossalai/colossalai.utils.multi_tensor_apply.rst
@@ -1,8 +1,11 @@
-colossalai.nn.multi\_tensor\_apply
-==================================
+colossalai.utils.multi\_tensor\_apply
+=====================================
 
-.. automodule:: colossalai.utils.multi_tensor_apply.multi_tensor_apply
+.. automodule:: colossalai.utils.multi_tensor_apply
    :members:
 
 
+.. toctree::
+   :maxdepth: 2
 
+   colossalai.utils.multi_tensor_apply.multi_tensor_apply
diff --git a/docs/colossalai/colossalai.utils.rst b/docs/colossalai/colossalai.utils.rst
index 998c31bbb..5a7d2ea5c 100644
--- a/docs/colossalai/colossalai.utils.rst
+++ b/docs/colossalai/colossalai.utils.rst
@@ -1,6 +1,17 @@
 colossalai.utils
 ================
 
+.. automodule:: colossalai.utils
+   :members:
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.utils.data_sampler
+   colossalai.utils.gradient_accumulation
+   colossalai.utils.multi_tensor_apply
+
+
 .. toctree::
    :maxdepth: 2
 
@@ -8,12 +19,5 @@ colossalai.utils
    colossalai.utils.checkpointing
    colossalai.utils.common
    colossalai.utils.cuda
-   colossalai.utils.data_sampler
-   colossalai.utils.gradient_accumulation
    colossalai.utils.memory
-   colossalai.utils.multi_tensor_apply
    colossalai.utils.timer
-
-
-.. automodule:: colossalai.utils
-   :members:
diff --git a/docs/colossalai/colossalai.zero.loss_scaler.rst b/docs/colossalai/colossalai.zero.loss_scaler.rst
new file mode 100644
index 000000000..71c4d4446
--- /dev/null
+++ b/docs/colossalai/colossalai.zero.loss_scaler.rst
@@ -0,0 +1,5 @@
+colossalai.zero.loss\_scaler
+============================
+
+.. automodule:: colossalai.zero.loss_scaler
+   :members:
diff --git a/docs/colossalai/colossalai.zero.rst b/docs/colossalai/colossalai.zero.rst
index bbd085d51..136c3c51e 100644
--- a/docs/colossalai/colossalai.zero.rst
+++ b/docs/colossalai/colossalai.zero.rst
@@ -1,5 +1,13 @@
 colossalai.zero
-================
+===============
 
 .. automodule:: colossalai.zero
    :members:
+
+
+.. toctree::
+   :maxdepth: 2
+
+   colossalai.zero.loss_scaler
+   colossalai.zero.zero_redundancy_optimizer_level_2
+   colossalai.zero.zero_redundancy_optimizer_level_3
diff --git a/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_2.rst b/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_2.rst
new file mode 100644
index 000000000..5929d5c12
--- /dev/null
+++ b/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_2.rst
@@ -0,0 +1,5 @@
+colossalai.zero.zero\_redundancy\_optimizer\_level\_2
+=====================================================
+
+.. automodule:: colossalai.zero.zero_redundancy_optimizer_level_2
+   :members:
diff --git a/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_3.rst b/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_3.rst
new file mode 100644
index 000000000..063dba60b
--- /dev/null
+++ b/docs/colossalai/colossalai.zero.zero_redundancy_optimizer_level_3.rst
@@ -0,0 +1,5 @@
+colossalai.zero.zero\_redundancy\_optimizer\_level\_3
+=====================================================
+
+.. automodule:: colossalai.zero.zero_redundancy_optimizer_level_3
+   :members:
diff --git a/docs/config.md b/docs/config.md
deleted file mode 100644
index 72c508d63..000000000
--- a/docs/config.md
+++ /dev/null
@@ -1,54 +0,0 @@
-# Config file
-
-Here is a config file example showing how to train a ViT model on the CIFAR10 dataset using Colossal-AI:
-
-```python
-# optional
-# three keys: pipeline, tensor
-# data parallel size is inferred
-parallel = dict(
-    pipeline=dict(size=1),
-    tensor=dict(size=4, mode='2d'),
-)
-
-# optional
-# pipeline or no pipeline schedule
-fp16 = dict(
-    mode=AMP_TYPE.NAIVE,
-    initial_scale=2 ** 8
-)
-
-# optional
-# configuration for zero
-# you can refer to the Zero Redundancy optimizer and zero offload section for details
-# https://www.colossalai.org/zero.html
-zero = dict(
-    level=<int>,
-    ...
-)
-
-# optional
-# if you are using complex gradient handling
-# otherwise, you do not need this in your config file
-# default gradient_handlers = None
-gradient_handlers = [dict(type='MyHandler', arg1=1, arg=2), ...]
-
-# optional
-# specific gradient accumulation size
-# if your batch size is not large enough
-gradient_accumulation = <int>
-
-# optional
-# add gradient clipping to your engine
-# this config is not compatible with zero and AMP_TYPE.NAIVE
-# but works with AMP_TYPE.TORCH and AMP_TYPE.APEX
-# defautl clip_grad_norm = 0.0
-clip_grad_norm = <float>
-
-# optional
-# cudnn setting
-# default is like below
-cudnn_benchmark = False,
-cudnn_deterministic=True,
-
-```
\ No newline at end of file
diff --git a/docs/config_zh.md b/docs/config_zh.md
deleted file mode 100644
index 055ba91b2..000000000
--- a/docs/config_zh.md
+++ /dev/null
@@ -1,187 +0,0 @@
-# 配置文件
-
-下方代码块中的示例展示了如何在CIFAR10数据集上使用Colossal-AI训练ViT模型。
-
-```python
-# build train_dataset and train_dataloader from this dictionary
-# It is not compulsory in Config File, instead, you can input this dictionary as an argument into colossalai.initialize() 
-train_data = dict(
-    # dictionary for building Dataset
-    dataset=dict(
-        # the type CIFAR10Dataset has to be registered
-        type='CIFAR10Dataset',
-        root='/path/to/data',
-        # transform pipeline
-        transform_pipeline=[
-            dict(type='Resize', size=IMG_SIZE),
-            dict(type='RandomCrop', size=IMG_SIZE, padding=4),
-            dict(type='RandomHorizontalFlip'),
-            dict(type='ToTensor'),
-            dict(type='Normalize',
-                 mean=[0.4914, 0.4822, 0.4465],
-                 std=[0.2023, 0.1994, 0.2010]),
-        ]
-    ),
-    # dictionary for building Dataloader
-    dataloader=dict(
-        batch_size=BATCH_SIZE,
-        pin_memory=True,
-        # num_workers=1,
-        shuffle=True,
-    )
-)
-
-# build test_dataset and test_dataloader from this dictionary
-test_data = dict(
-    dataset=dict(
-        type='CIFAR10Dataset',
-        root='/path/to/data',
-        train=False,
-        transform_pipeline=[
-            dict(type='Resize', size=IMG_SIZE),
-            dict(type='ToTensor'),
-            dict(type='Normalize',
-                 mean=[0.4914, 0.4822, 0.4465],
-                 std=[0.2023, 0.1994, 0.2010]
-                 ),
-        ]
-    ),
-    dataloader=dict(
-        batch_size=BATCH_SIZE,
-        pin_memory=True,
-        # num_workers=1,
-    )
-)
-
-# compulsory
-# build optimizer from this dictionary
-optimizer = dict(
-    # Avaluable types: 'ZeroRedundancyOptimizer_Level_1', 'ZeroRedundancyOptimizer_Level_2', 'ZeroRedundancyOptimizer_Level_3'
-    # 'Adam', 'Lamb', 'SGD', 'FusedLAMB', 'FusedAdam', 'FusedSGD', 'FP16Optimizer'
-    type='Adam',
-    lr=0.001,
-    weight_decay=0
-)
-
-# compulsory
-# build loss function from this dictionary
-loss = dict(
-    # Avaluable types:
-    # 'CrossEntropyLoss2D', 'CrossEntropyLoss2p5D', 'CrossEntropyLoss3D'
-    type='CrossEntropyLoss2D',
-)
-
-# compulsory
-# build model from this dictionary
-model = dict(
-    # types avaluable: 'PretrainBERT', 'VanillaResNet', 'VisionTransformerFromConfig'
-    type='VisionTransformerFromConfig',
-    # each key-value pair above refers to a layer
-    # input data pass through these layers recursively
-    tensor_splitting_cfg=dict(
-        type='ViTInputSplitter2D',
-    ),
-    embedding_cfg=dict(
-        type='ViTPatchEmbedding2D',
-        img_size=IMG_SIZE,
-        patch_size=PATCH_SIZE,
-        embed_dim=DIM,
-    ),
-    token_fusion_cfg=dict(
-        type='ViTTokenFuser2D',
-        img_size=IMG_SIZE,
-        patch_size=PATCH_SIZE,
-        embed_dim=DIM,
-        drop_rate=0.1
-    ),
-    norm_cfg=dict(
-        type='LayerNorm2D',
-        normalized_shape=DIM,
-        eps=1e-6,
-    ),
-    block_cfg=dict(
-        # ViTBlock is a submodule
-        type='ViTBlock',
-        attention_cfg=dict(
-            type='ViTSelfAttention2D',
-            hidden_size=DIM,
-            num_attention_heads=NUM_ATTENTION_HEADS,
-            attention_dropout_prob=0.,
-            hidden_dropout_prob=0.1,
-            checkpoint=True
-        ),
-        droppath_cfg=dict(
-            type='VanillaViTDropPath',
-        ),
-        mlp_cfg=dict(
-            type='ViTMLP2D',
-            in_features=DIM,
-            dropout_prob=0.1,
-            mlp_ratio=4,
-            checkpoint=True
-        ),
-        norm_cfg=dict(
-            type='LayerNorm2D',
-            normalized_shape=DIM,
-            eps=1e-6,
-        ),
-    ),
-    head_cfg=dict(
-        type='ViTHead2D',
-        hidden_size=DIM,
-        num_classes=NUM_CLASSES,
-    ),
-    embed_dim=DIM,
-    depth=DEPTH,
-    drop_path_rate=0.,
-)
-
-# hooks are built when initializing trainer
-# possible hooks: 'BaseHook', 'MetricHook','LoadCheckpointHook'
-# 'SaveCheckpointHook','LossHook', 'AccuracyHook', 'Accuracy2DHook'
-# 'LogMetricByEpochHook', 'TensorboardHook','LogTimingByEpochHook', 'LogMemoryByEpochHook' 
-hooks = [
-    dict(type='LogMetricByEpochHook'),
-    dict(type='LogTimingByEpochHook'),
-    dict(type='LogMemoryByEpochHook'),
-    dict(type='Accuracy2DHook'),
-    dict(type='LossHook'),
-    # dict(type='TensorboardHook', log_dir='./tfb_logs'),
-    # dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
-    # dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
-]
-
-# three keys: pipeline, tensor, data
-# if data=dict(size=1), which means no data parallelization, then there is no need to define it
-parallel = dict(
-    pipeline=dict(size=1),
-    tensor=dict(size=4, mode='2d'),
-)
-
-# not compulsory
-# pipeline or no pipeline schedule
-fp16 = dict(
-    mode=AMP_TYPE.PARALLEL,
-    initial_scale=2 ** 8
-)
-
-# not compulsory
-# build learning rate scheduler
-lr_scheduler = dict(
-    type='LinearWarmupLR',
-    warmup_epochs=5
-)
-
-schedule = dict(
-    num_microbatches=8
-)
-
-# training stopping criterion
-# you can give num_steps or num_epochs
-num_epochs = 60
-
-# config logging path
-logging = dict(
-    root_path='./logs'
-)
-```
\ No newline at end of file
diff --git a/docs/index.rst b/docs/index.rst
index 16141b5ea..b29450f58 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -3,30 +3,8 @@
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
-Colossal-AI documentation
+Colossal-AI API documentation
 ======================================
-.. toctree::
-   :maxdepth: 1
-   :caption: GETTING STARTED
-
-   installation.md
-   run_demo.md
-
-
-.. toctree::
-   :maxdepth: 1
-   :caption: CUSTOMIZE YOUR TRAINING
-
-   parallelization.md
-   model.md
-   trainer_engine.md
-   amp.md
-   zero.md
-   add_your_parallel.md
-   config.md
-   
-
-
 .. toctree::
    :maxdepth: 2
    :caption: API REFERENCE
diff --git a/docs/index_zh.rst b/docs/index_zh.rst
deleted file mode 100644
index f9a6ce444..000000000
--- a/docs/index_zh.rst
+++ /dev/null
@@ -1,40 +0,0 @@
-.. Colossal-AI documentation master file, created by
-   sphinx-quickstart on Mon Oct 11 17:05:05 2021.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-
-夸父AI系统（Colossal-AI）开发文档
-======================================
-.. toctree::
-   :maxdepth: 1
-   :caption: 快速上手指南
-
-   installation_zh.md
-   run_demo_zh.md
-
-
-.. toctree::
-   :maxdepth: 1
-   :caption: 个性化您的训练
-
-   parallelization_zh.md
-   model_zh.md
-   trainer_engine_zh.md
-   amp_zh.md
-   zero_zh.md
-   add_your_parallel_zh.md
-   config_zh.md
-   
-
-
-.. toctree::
-   :maxdepth: 2
-   :caption: API REFERENCE
-
-   colossalai/colossalai
-
-
-Indices and tables
-==================
-
-* :ref:`genindex`
\ No newline at end of file
diff --git a/docs/installation.md b/docs/installation.md
deleted file mode 100644
index 50858d05c..000000000
--- a/docs/installation.md
+++ /dev/null
@@ -1,27 +0,0 @@
-# Setup
-
-### PyPI
-
-```bash
-pip install colossalai
-```
-
-### Install From Source (Recommended)
-
-> We **recommend** you to install from source as the Colossal-AI is updating frequently in the early versions. The documentation will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
-
-```shell
-git clone https://github.com/hpcaitech/ColossalAI.git
-cd ColossalAI
-# install dependency
-pip install -r requirements/requirements.txt
-
-# install colossalai
-pip install .
-```
-
-Install and enable CUDA kernel fusion (compulsory installation when using fused optimizer)
-
-```shell
-pip install -v --no-cache-dir --global-option="--cuda_ext" .
-```
diff --git a/docs/installation_zh.md b/docs/installation_zh.md
deleted file mode 100644
index f47dd9d38..000000000
--- a/docs/installation_zh.md
+++ /dev/null
@@ -1,25 +0,0 @@
-# 快速安装
-
-## 使用pip安装
-
-```bash
-pip install colossalai
-```
-
-## 使用源代码安装
-
-```shell
-git clone git@github.com:hpcaitech/ColossalAI.git
-cd ColossalAI
-# install dependency
-pip install -r requirements/requirements.txt
-
-# install colossalai
-pip install .
-```
-
-安装并支持内核融合（使用融合优化器时必须执行下面的代码）
-
-```
-pip install -v --no-cache-dir --global-option="--cuda_ext" .
-```
diff --git a/docs/model.md b/docs/model.md
deleted file mode 100644
index dc912cd42..000000000
--- a/docs/model.md
+++ /dev/null
@@ -1,31 +0,0 @@
-# Define your own parallel model
-
-Let's say that you have a huge MLP model with billions of parameters and its extremely large hidden layer size makes it
-impossible to fit into a single GPU directly. Don't worry, ColossalAI is here to help you sort things out. With the help of ColossalAI, 
-you can write your model in the familiar way in which you used to write models for a single GPU, while ColossalAI automatically 
-splits your model weights and fit them perfectly into a set of GPUs. We give a simple example showing how to write a simple 
-2D parallel model in the Colossal-AI context.
-
-## Write a simple 2D parallel model
-
-```python
-from colossalai.nn import Linear2D
-import torch.nn as nn
-
-class MLP_2D(nn.Module):
-
-    def __init__(self):
-        super().__init__()
-        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
-        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
-
-    def forward(self, x):
-        x = self.linear_1(x)
-        x = self.linear_2(x)
-        return x
-```
-
-## Use pre-defined model
-
-For the sake of your convenience, we kindly provide you in our Model Zoo with some prevalent models such as *BERT*, *VIT*, 
-and *MLP-Mixer*. Feel free to customize them into different sizes to fit into your special needs.
diff --git a/docs/model_zh.md b/docs/model_zh.md
deleted file mode 100644
index e11ff7ae8..000000000
--- a/docs/model_zh.md
+++ /dev/null
@@ -1,26 +0,0 @@
-# 定义符合您需求的并行模型
-
-如果您在训练一个拥有数亿级参数的巨大MLP模型，那么该模型一定无法在单个GPU上直接进行训练，不用担心，Colossal-AI可以帮您解决这一问题。您仍旧可以像写单GPU模型那样来写您的模型，Colossal-AI会按照您的并行设置自动将模型参数进行切割，并将它们均匀地存入一组GPU中。下面是一个简单的例子，来向您展示如何在Colossal-AI环境下写一个2D张量并行的模型。
-
-## 简单的2D张量并行模型
-
-```python
-from colossalai.nn import Linear2D
-import torch.nn as nn
-
-class MLP_2D(nn.Module):
-
-    def __init__(self):
-        super().__init__()
-        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
-        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
-
-    def forward(self, x):
-        x = self.linear_1(x)
-        x = self.linear_2(x)
-        return x
-```
-
-## 使用事先定义好的模型
-
-为了您使用的方便，我们事先在我们的Model Zoo中定义好了一些现在流行的模型，比如*BERT*、*VIT*以及*MLP-Mixer*等，您可以根据您的需求来自定义这些模型的规模。
diff --git a/docs/parallelization.md b/docs/parallelization.md
deleted file mode 100644
index 595925957..000000000
--- a/docs/parallelization.md
+++ /dev/null
@@ -1,240 +0,0 @@
-# Parallelization
-
-## Configure the Combination of Parallelization
-
-We support multiple parallelization in our library.
-
-Hybrid parallelism in our codebase refers to namely the combination of data parallelism, pipeline parallelism 
-and tensor parallelism (1D, 2D, 2.5D, 3D). Each parallelism requires different network topology and thus 
-different initializers for distributed process group. You can initialize the corresponding process group by 
-setting `parallel` in our config. The parallel configuration can be easily deployed by a dictionary in 
-configuration file. The configuration dictionary must obey the following format. Data parallel size will be 
-inferred automatically based on your inputs to pipeline parallelism and tensor parallelism. The distributed 
-environment will set up by `colossalai.launch`.
-
-```python
-# sampler format
-parallel = dict(
-    pipeline=dict("size": int),
-    tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
-)
-
-# this is ok
-parallel = dict(
-    pipeline=dict(size=2),
-    tensor=dict(size=4, mode='2d')
-)
-
-# this is ok
-parallel = dict(
-    pipeline=2,
-    tensor=dict(size=4, mode='2d')
-)
-
-# this is not ok
-# as you need to specify the mode for tensor parallelism
-parallel = dict(
-    pipeline=2,
-    tensor=4
-)
-
-# this is ok as well as tensor will be default to size 1 
-# and mode None
-parallel = dict(
-    pipeline=2
-)
-
-# this is ok as well as pipeline will default to size 1
-parallel = dict(
-    tensor=dict(size=4, mode='2d')
-)
-
-```
-
-The name of the dictionary variable should be **parallel**. All the arguments even **parallel** itself are optional and
-data, pipeline, tensor parallel size will be set to defaulted value 1. The value of data, pipeline and tensor can be a
-int representing the size of specific parallel dimension or a dictionary with a key called "size". The key "mode"
-represents the way of tensor parallelism. 
-
-**You can choose to not have 'parallel' in your configuration and both pipelineand tensor will default to size 1.**
-
-
-## Data Parallel
-
-Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
-a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
-have to explicitly set them in your configurations. There are two ways to handle the all-reduce in data parallel in Colossal-AI.
-
-1. If you specify gradient handlers, gradients will be all-reduced according to the gradient handlers
-2. Otherwise, PyTorch DistributedDataParallel will be used
-
-In most cases, you will be using the second mode unless you have complex handling of the gradients.
-
-## 1D, 2D, 2.5D and 3D Parallel
-
-To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
-tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
-
-- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
-
-- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)  
-  2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
-  outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of $P = N^2$ devices where
-  $N$ is the number of tensor chunks in a single dimension.
-
-- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)  
-  Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which
-  further parallelizes 2D tensor parallelism. An amount of $P = N^2 ∗ d$ processors are arranged into $d$ layers, where
-  each layer performs matrix multiplication operations independently with a dimension $N$.
-
-- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)  
-  We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method
-  achieves the optimal, $O(P^{1/3})$ communication overhead on $P$ processors, while both computation and memory usage
-  are evenly distributed through optimized load balancing of parameters as well as activations.
-
-```python
-# 1D parallel
-parallel = dict(
-    tensor=dict(size=4, mode='1d')
-)
-
-# 2D parallel
-parallel = dict(
-    tensor=dict(size=4, mode='2d')
-)
-
-# 2.5D parallel
-parallel = dict(
-    tensor=dict(size=8, mode='2.5d', depth=2)
-)
-
-# 3D parallel
-parallel = dict(
-    tensor=dict(size=8, mode='3d')
-)
-```
-
-Once you specify the tensor parallel mode in your configuration, you can proceed to use its corresponding distributed 
-operator. For example, if you mode is '2d', you can use `colossalai.nn.Linear2D` in you model construction.
-
-
-## Pipeline Parallel (experimental)
-
-Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
-model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
-and the second layer to the second GPU. 
-
-You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
-will automatically creates the pipeline schedule which defines the forward and backward step. 
-
-```python
-parallel = dict(
-    pipeline=dict(size=4), # number of pipeline stages
-)
-```
-
-As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline parallelism, you have the following two ways to split your model,
-
-1. Split your model directly. Below is an exmaple of resnet split into two pipeline stages.
-```python
-from torchvision.models import resnet18
-from colossalai.core import global_context as gpc
-
-model = resnet18(num_classes=10)
-
-if gpc.get_local_rank(ParallelMode.PIPELINE) == 0:
-    model = nn.Sequential(
-        model.conv1,
-        model.bn1,
-        model.relu,
-        model.maxpool,
-        model.layer1,
-        model.layer2
-    )
-elif gpc.get_local_rank(ParallelMode.PIPELINE) == 1:
-    from functools import partial
-
-    class Flatten(nn.Module):
-
-        def forward(self, x):
-            return torch.flatten(x, 1)
-
-    model = nn.Sequential(
-        model.layer3,
-        model.layer4,
-        model.avgpool,
-        Flatten(),
-        model.fc
-    )
-```
-
-
-2. Make sure your model inherit `colossalai.nn.model.ModelFromConfig` and registered into the
-`MODELS` registry. Define the `self.layers_cfg` attribute. 
-Pass in a dict/Config object which specifies the parameters of your model. 
-Use `colossalai.builder.pipeline.build_pipeline_model_from_cfg` to partition the layers.
-
-```python
-from colossalai.builder import build_pipeline_model_from_cfg
-from colossalai.nn.model import ModelFromConfig
-from colossalai.registry import MODELS
-
-
-@MODELS.register_module
-class MyModel(ModelFromConfig):
-
-    def __init__(self, arg1, arg2, ...):
-        ...
-        self.layers_cfg = [
-            dict(type='Linear', in_features=3, out_features=512),
-            dict(type='Linear', in_features=512, out_features=512),
-            ...
-        ]
-
-
-model_cfg = dict(
-    type='MyModel',
-    arg1=1,
-    arg2=2
-    ...
-)
-
-# from config 
-model = build_pipeline_model_from_cfg(model_cfg, num_chunks=1)
-
-# from torch.nn.Sequential
-# model = build_pipeline_model(sequential_model, num_model_chunks)
-
-```
-
-When your model is split into partitions, you can use PipelineSchedule to execute training.
-
-```python
-import colossalai
-from colossalai.engine.schedule import PipelineSchedule
-
-engine, train_dataloader, _, _ = colossalai.initialize(model, optimizer, criterion, train_dataloader) 
-
-schedule = PipelineSchedule(num_microbatches=4)
-
-# interleaved pipeline
-# schedule = InterleavedPipelineSchedule(num_microbatches=4, num_model_chunks=2)
-
-# execute a training epoch
-data_iter = iter(train_dataloader)
-
-for i in range(len(train_dataloader)):
-    output, label, loss = schedule.forward_backward_step(engine,
-                                                        data_iter,
-                                                        forward_only=False,
-                                                        )
-
-```
-
-This feature is still in development and is only experimental for now.
-
-## Sequence Parallel (experimental)
-
-Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
-This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
-This feature is still in development and is only experimental for now.
diff --git a/docs/parallelization_zh.md b/docs/parallelization_zh.md
deleted file mode 100644
index 5154f464c..000000000
--- a/docs/parallelization_zh.md
+++ /dev/null
@@ -1,189 +0,0 @@
-# 并行技术
-
-## 配置并行技术组合
-
-Colossal-AI支持多种并行技术，包括数据并行、张量并行（1D、2D、2.5D、3D）、流水线并行以及序列并行。您可以通过更改配置文件中的`parallel`字典变量来初始化分布式系统中的进程组，配置文件中的`parallel`字典变量必须满足下面的格式。数据并行的规模可以通过`parallel`中流水线并行的规模和张量并行的规模计算得出。
-
-```python
-parallel = dict(
-    pipeline=dict("size": int),
-    tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
-)
-```
-
-注意该字典变量的名称必须为**parallel**。该变量中所有的参数，包括`parallel`本身都是非必需的，如果您的代码中没有提供该变量，则所有并行规模都将被设定为默认值1，即不使用任何并行技术的情况。`parallel`中data、pipeline以及tensor的值分别代表了数据并行、流水线并行、以及张量并行的规模，而`mode`的值代表了张量并行的模式。
-
-## 数据并行
-
-数据并行是一种最常见的并行技术，可以将数据分成几个不同的部分，并对每一个部分在一台设备上进行训练。Colossal-AI可以自动检测数据并行设置并为您设置好环境，您不需要在您的环境配置中显式地设置。当数据并行规模大于1时，Colossal-AI会自动为数据读取器增加分布式数据采样器，以此来达到切分数据集的目的。
-
-## 1D、2D、2.5D与3D张量并行
-
-为了方便混合并行技术，我们提供了一系列的张量并行技术，同时下面罗列了每一种张量并行技术对应的论文，这些张量并行技术需要Colossal-AI提供的分布式层结构的支持。
-- 1D：[Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
-
-- 2D：[An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
-2维张量并行依赖SUMMA矩阵乘法技术，其在两个不同的维度上对于输入数据进行切分。切分后的张量分布在一个的2维网格上，使用的总设备数量为$P = N^2$，其中$N$为一个维度上的切分张量数量。
-
-- 2.5D：[2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
-2.5维并行技术受到了2.5D矩阵乘法的启发，其对于2维张量并行的结果进行进一步切分，在$d$层上面安排$P = N^2 ∗ d$个处理器，相应地，矩阵乘法操作也被切分为$d$份在不同的层上面进行。
-
-- 3D：[Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
-我们还引入了3维张量并行技术，该技术在一个3维处理器立方体中对神经网络参数进行并行化。使用$P$个处理器时，该并行技术可以在付出$O(P^{1/3})$的通信开销的情况下达到最优表现，且计算资源和内存使用都可以在$P$个处理器上达到平均分配。
-
-使用上述几种张量并行的`parallel`字典变量示例参见下方代码。
-
-```python
-# 1D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=4, mode='1d')
-)
-
-# 2D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=4, mode='2d')
-)
-
-# 2.5D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=8, mode='2.5d', depth=2)
-)
-
-# 3D parallel
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=8, mode='3d')
-)
-```
-
-## 流水线并行（开发中）
-
-流水线并行指的是在将深度学习模型按照层切分为几个不同的部分。例如，对于一个由两个线性层组成的简单模型，我们可以使用两个GPU，并把第一个线性层的工作分配给一个GPU，把第二个线性层的工作分配给另一个GPU。当然这个例子只是为了说明流水线并行的工作方式，没有实际意义。
-
-由于PyTorch的计算基于动态计算图，所以在执行前无法确定计算流。为了支持PyTorch中的流水线并行，您需要为您的模型类加入一个额外的特征`layers_cfg`，使Colossal-AI清楚具体的计算流程，`colossalai.nn.VanillaResNet`给出了一个您可以参考的示例。
-
-```python
-from colossalai.nn import BaseModel
-import torch
-
-class VanillaResNet(BaseModel):
-
-    def __init__(
-            self,
-            num_cls: int,
-            block_type: str,
-            layers: List[int],
-            norm_layer_type: str = 'BatchNorm2d',
-            in_channels: int = 3,
-            groups: int = 1,
-            width_per_group: int = 64,
-            zero_init_residual: bool = False,
-            replace_stride_with_dilation: Optional[List[bool]] = None,
-            dilations=(1, 1, 1, 1)
-    ) -> None:
-        super().__init__()
-        
-        ... # some model params
-        
-        self.layers_cfg = [
-            # conv1
-            dict(type='Conv2d',
-                 in_channels=in_channels,
-                 out_channels=self.inplanes,
-                 kernel_size=7,
-                 stride=2,
-                 padding=3,
-                 bias=False),
-            # bn1
-            dict(
-                type=norm_layer_type,
-                num_features=self.inplanes
-            ),
-            # relu
-            dict(
-                type='ReLU',
-                inplace=True
-            ),
-            # maxpool
-            dict(
-                type='MaxPool2d',
-                kernel_size=3,
-                stride=2,
-                padding=1
-            ),
-            # layer 1
-            dict(
-                inplanes=self.inplanes,
-                planes=64,
-                blocks=self.blocks[0],
-                dilation=self.dilations[0],
-                **self.reslayer_common_cfg
-            ),
-            # layer 2
-            dict(
-                inplanes=64 * self.block_expansion,
-                planes=128,
-                blocks=self.blocks[1],
-                stride=2,
-                dilate=replace_stride_with_dilation[0],
-                dilation=self.dilations[1],
-                **self.reslayer_common_cfg
-            ),
-            # layer  3
-            dict(
-                inplanes=128 * self.block_expansion,
-                planes=256,
-                blocks=layers[2],
-                stride=2,
-                dilate=replace_stride_with_dilation[1],
-                dilation=self.dilations[2],
-                **self.reslayer_common_cfg
-            ),
-            # layer 4
-            dict(
-                inplanes=256 * self.block_expansion,
-                planes=512,
-                blocks=layers[3], stride=2,
-                dilate=replace_stride_with_dilation[2],
-                dilation=self.dilations[3],
-                **self.reslayer_common_cfg
-            ),
-            # avg pool
-            dict(
-                type='AdaptiveAvgPool2d',
-                output_size=(1, 1)
-            ),
-            # flatten
-            dict(
-                type='LambdaWrapper',
-                func=lambda mod, x: torch.flatten(x, 1)
-            ),
-            # linear
-            dict(
-                type='Linear',
-                in_features=512 * self.block_expansion,
-                out_features=num_cls
-            )
-        ]
-```
-
-您可以在配置文件中手动设置流水线并行的级数，当流水线的并行级数大于1时，Colossal-AI将会自动创建定义前向传播和后向传播的流水线调度程序。同时，您还可以在配置文件中的`schedule`字典变量来定义每一个步骤中训练的微批次数量。下面的代码给出了一个配置流水线并行的例子。
-
-```python
-parallel = dict(
-    pipeline=dict(size=1), # number of pipeline stages
-    tensor=dict(size=1, mode=None)
-)
-
-schedule = dict(
-    num_microbatches = 4 # set the number of microbatches per step
-)
-```
-目前该并行技术仍处于实验开发阶段。
-
-## 序列并行（开发中）
-
-序列并行是为了支持对于长序列数据的建模，这类数据包括文档级别的文本理解以及医学影像分析，该并行技术由论文[Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120)提出。目前该并行技术仍处于实验开发阶段。
diff --git a/docs/run_demo.md b/docs/run_demo.md
deleted file mode 100644
index 60d7eebf5..000000000
--- a/docs/run_demo.md
+++ /dev/null
@@ -1,120 +0,0 @@
-# Quick demo
-
-Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system can
-accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The system
-can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
-
-## Single GPU
-
-Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline
-performances. We provided an example to train ResNet on CIFAR10 data with only one GPU. You can find this example in 
-`examples\resnet_cifar10_data_parallel` in the repository. Detailed instructions can be found in its `README.md`.
-
-## Multiple GPUs
-
-Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the
-training process drastically by applying efficient parallelization techiniques, which will be elaborated in
-the [Parallelization](parallelization.md) section below. 
-
-You can turn the resnet example mentioned above into a multi-GPU training by setting `--nproc_per_node` to be the number of 
-GPUs you have on your system. We also provide an example of Vision Transformer which relies on
-training with more GPUs. You can visit this example in `examples\vit_b16_imagenet_data_parallel`. It has a detailed instructional 
-`README.md` for you too.
-
-
-## Sample Training Script
-
-Below is a typical way of how you train the model using 
-
-```python
-import colossalai
-from colossalai.amp import AMP_TYPE
-from colossalai.logging import get_dist_logger
-from colossalai.trainer import Trainer, hooks
-from colossalai.utils import get_dataloader
-
-
-CONFIG = dict(
-    parallel=dict(
-        pipeline=1,
-        tensor=1, mode=None
-    ),
-    fp16 = dict(
-        mode=AMP_TYPE.TORCH
-    ),
-    gradient_accumulation=4,
-    clip_grad_norm=1.0
-)
-
-def run_trainer():
-    parser = colossalai.get_default_parser()
-    args = parser.parse_args()
-    colossalai.launch(config=CONFIG,
-                      rank=args.rank,
-                      world_size=args.world_size,
-                      host=args.host,
-                      port=args.port,
-                      backend=args.backend)
-
-    logger = get_dist_logger()
-
-    # instantiate your compoentns
-    model = MyModel()
-    optimizer = MyOptimizer(model.parameters(), ...)
-    train_dataset = TrainDataset()
-    test_dataset = TestDataset()
-    train_dataloader = get_dataloader(train_dataset, ...)
-    test_dataloader = get_dataloader(test_dataset, ...)
-    lr_scheduler = MyScheduler()
-    logger.info("components are built")
-
-    engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model, 
-                                                                                    optimizer, 
-                                                                                    criterion, 
-                                                                                    train_dataloader, 
-                                                                                    test_dataloader, 
-                                                                                    lr_scheduler)
-
-    trainer = Trainer(engine=engine,
-                      verbose=True)
-
-    hook_list = [
-        hooks.LossHook(),
-        hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
-        hooks.AccuracyHook(),
-        hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
-        hooks.LogMetricByEpochHook(logger),
-        hooks.LogMemoryByEpochHook(logger),
-        hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')
-    ]
-
-    trainer.fit(
-        train_dataloader=train_dataloader,
-        test_dataloader=test_dataloader,
-        epochs=NUM_EPOCH,
-        hooks=hook_list,
-        display_progress=True,
-        test_interval=2
-    )
-
-
-if __name__ == '__main__':
-    run_trainer()
-```
-
-Alternatively, the `model` variable can be substituted with a self-defined model or a pre-defined model in our Model
-Zoo. The detailed substitution process is elaborated [here](model.md).
-
-## Features
-
-Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development
-of distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools
-to kickstart distributed training in a few lines.
-
-- [Data Parallelism](parallelization.md)
-- [Pipeline Parallelism](parallelization.md)
-- [1D, 2D, 2.5D, 3D and sequence parallelism](parallelization.md)
-- [Friendly trainer and engine](trainer_engine.md)
-- [Extensible for new parallelism](add_your_parallel.md)
-- [Mixed Precision Training](amp.md)
-- [Zero Redundancy Optimizer (ZeRO)](zero.md)
diff --git a/docs/run_demo_zh.md b/docs/run_demo_zh.md
deleted file mode 100644
index 5eadef6f2..000000000
--- a/docs/run_demo_zh.md
+++ /dev/null
@@ -1,67 +0,0 @@
-# 快速上手
-
-Colossal-AI是一个大规模深度学习系统，其中包含高效的并行技术。该系统可以在多GPU的分布式系统上使用并行技术有效地加速模型训练，同时该系统也可以运行在带有GPU的非分布式系统上。下面是ColossalAI的快速上手指南。
-
-## 单GPU系统
-
-在带有GPU的非分布式系统上进行模型训练时，Colossal-AI可以达到当前的基线效率。[这里](https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS)我们给出一个Google
-Colab示例展现如何使用Colossal-AI与CIFAR10数据集在非分布式系统上训练一个LeNet模型。
-
-## 多GPU系统
-
-在多GPU的分布式系统上训练深度学习模型时，Colossal-AI可以使用高效的并行技术来显著地加速训练过程，这些技术将在下面的[并行技术](parallelization.md)
-章节中被详述。下面的代码将在拥有四个GPU的分布式系统上训练一个ViT模型，其中`HOST`
-变量为您分布式系统的IP地址。请注意下面的代码使用了[Slurm](https://slurm.schedmd.com/documentation.html)作业调度系统。
-
-```bash
-HOST=xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
-```
-
-`./configs/vit/vit_2d.py`是一个[配置文件](config.md)
-，Colossal-AI使用配置文件来定义训练过程中需要用到的参数，比如模型类型、数据集、以及优化器、学习率调度器等。您可以通过编写配置文件的方式来训练不同的模型。`./examples/run_trainer.py`
-是一个标准的训练脚本，具体代码已经附在下面。该脚本可以读入配置文件中的训练参数并训练模型。
-
-```python
-import colossalai
-from colossalai.core import global_context as gpc
-from colossalai.logging import get_dist_logger
-from colossalai.trainer import Trainer
-
-
-def run_trainer():
-    engine, train_dataloader, test_dataloader = colossalai.initialize()
-    logger = get_dist_logger()
-    logger.info("engine is built", ranks=[0])
-
-    trainer = Trainer(engine=engine,
-                      verbose=True)
-    logger.info("trainer is built", ranks=[0])
-
-    logger.info("start training", ranks=[0])
-    trainer.fit(
-        train_dataloader=train_dataloader,
-        test_dataloader=test_dataloader,
-        epochs=gpc.config.num_epochs,
-        hooks_cfg=gpc.config.hooks,
-        display_progress=True,
-        test_interval=2
-    )
-
-
-if __name__ == '__main__':
-    run_trainer()
-```
-
-上面代码中的`model`变量可以被替换为一个自定义的模型或者`Model Zoo`中一个事先定义的模型，以此来达到训练不同模型的目的，[这里](model.md)详述了如何进行这样的替换。
-
-## 系统功能
-
-Colossal-AI提供了一系列并行组件来加速您的模型训练，我们在下面的章节提供了关于这些并行组件的介绍。我们的目标是使您的分布式深度学习模型开发像单卡深度学习模型开发那样方便。
-
-- [数据并行](parallelization.md)
-- [1D、2D、2.5D、3D张量并行以及序列并行](parallelization.md)
-- [流水线并行](parallelization.md)
-- [训练器以及引擎](trainer_engine.md)
-- [自定义您的并行模式](add_your_parallel.md)
-- [混合精度训练](amp.md)
-- [ZeRO优化器](zero.md)
diff --git a/docs/trainer_engine.md b/docs/trainer_engine.md
deleted file mode 100644
index fbca20028..000000000
--- a/docs/trainer_engine.md
+++ /dev/null
@@ -1,132 +0,0 @@
-# Colossal-AI Engine & Customize Your Trainer
-
-## Colossal-AI engine
-
-To better understand how `Engine` class works, let's start from the conception of the process function in common
-engines. The process function usually controls the behavior over a batch of a dataset, `Engine` class just controls the
-process function. Here we give a standard process function in the following code block.
-
-```python
-def process_function(dataloader, model, criterion, optim):
-    optim.zero_grad()
-    data, label = next(dataloader)
-    output = model(data)
-    loss = criterion(output, label)
-    loss.backward()
-    optim.setp()
-```
-
-The engine class is a high-level wrapper of these frequently-used functions while preserving the PyTorch-like function signature and integrating with our features.
-
-```python
-import torch
-import torch.nn as nn
-import torchvision.models as models
-import colossalai
-from colossalai.engine import Engine
-from torchvision.datasets import CIFAR10
-
-model = models.resnet18()
-criterion = nn.CrossEntropyLoss()
-optimizer = torch.optim.Adam(model.parameters())
-
-dataset = CIFAR10(...)
-dataloader = colossalai.utils.get_dataloader(dataset)
-
-engine, dataloader, _,  _ = colossalai.initialize(model, optimizer, criterion, dataloader)
-
-# exmaple of a training iteratio
-for img, label in dataloader:
-    engine.zero_grad()
-    output = engine(img)
-    loss = engine.criterion(output, label)
-    engine.backward(loss)
-    engine.step()
-
-```
-
-More information regarding the class can be found in the API references.
-
-## Customize your trainer
-
-### Overview
-
-To learn how to customize a trainer which meets your needs, let's first give a look at the `Trainer` class. We highly
-recommend that you read *Get Started*
-section and *Colossal-AI engine* first.
-
-The `Trainer` class enables researchers and engineers to use our system more conveniently. Instead of having to write
-your own scripts, you can simply construct your own trainer by calling the `Trainer` class, just like what we did in the
-following code block.
-
-```python
-trainer = Trainer(engine)
-```
-
-After that, you can use the `fit` method to train or evaluate your model. In order to make our `Trainer` class even more
-powerful, we incorporate a set of handy tools to the class. For example, you can monitor or record the running states
-and metrics which indicate the current performance of the model. These functions are realized by hooks. The `BasicHook`
-class allows you to execute your hook functions at specified time. We have already created some practical hooks for you,
-as listed below. What you need to do is just picking the right ones which suit your needs. Detailed descriptions of the
-class can be found in the API references.
-
-These hook functions will record metrics, elapsed time and memory usage and write them to log after each epoch. Besides,
-they print the current loss and accuracy to let users monitor the performance of the model.
-
-```python
-import colossalai
-from colossalai.trainer import hooks, Trainer
-from colossalai.utils import MultiTimer
-from colossalai.logging import get_dist_logger
-
-... = colossalai.initialize(...)
-
-timer = MultiTimer()
-logger = get_dist_logger()
-
-# if you want to save log to file
-logger.log_to_file('./logs/')
-
-trainer = Trainer(
-    engine=engine,
-    timer=timer,
-    logger=logger
-)
-
-hook_list = [
-    hooks.LossHook(),
-    hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
-    hooks.AccuracyHook(),
-    hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
-    hooks.LogMetricByEpochHook(logger),
-    hooks.LogMemoryByEpochHook(logger),
-    hooks.LogTimingByEpochHook(timer, logger),
-    hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')
-]
-
-trainer.fit(
-    train_dataloader=train_dataloader,
-    epochs=NUM_EPOCHS,
-    test_dataloader=test_dataloader,
-    test_interval=1,
-    hooks=hook_list,
-    display_progress=True
-)
-
-```
-
-### Hook
-
-If you have your specific needs, feel free to extend our `BaseHook` class to add your own functions, or our `MetricHook`
-class to write a metric collector. These hook functions can be called at different stage in the trainer's life cycle.
-Besides, you can define the priorities of all hooks to arrange the execution order of them. More information can be
-found in the API references.
-
-### Metric
-
-You can write your own metrics by extending our `Metric` class. It should be used with the `MetricHook` class. When your
-write your own metric hooks, please set the priority carefully and make sure the hook is called before other hooks which
-might require the results of the metric hook.
-
-We've already provided some metric hooks and we store metric objects in `runner.states['metrics']`. It is a dictionary
-and metrics can be accessed by their names.
diff --git a/docs/trainer_engine_zh.md b/docs/trainer_engine_zh.md
deleted file mode 100644
index 5729a0599..000000000
--- a/docs/trainer_engine_zh.md
+++ /dev/null
@@ -1,84 +0,0 @@
-# 引擎与训练器
-
-## 引擎
-
-为了更好的理解我们的`Engine`类是如何工作的，我们首先需要了解常见引擎中进程函数的概念。进程函数控制数据集中一个批次的行为，`Engine`类控制的正是该进程函数。我们在下方的代码块中给出了一个标准的进程函数例子。
-
-```python
-def process_function(dataloader, model, criterion, optim):
-    optim.zero_grad()
-    data, label = next(dataloader)
-    output = model(data)
-    loss = criterion(output, label)
-    loss.backward()
-    optim.setp()
-```
-
-在`ignite.engine`与`keras.engine`中，进程函数需要由用户提供，然而，用户很难为流水线并行编写进程函数。为了向用户提供方便的混合并行，我们提供了具备强大功能的`Engine`
-类，该类支持流水线并行，并提供前向传播后向传播不交织的策略。同时，您可以在`Engine`类中使用您事先定义好的学习率调度器来在训练过程中调整学习率。
-
-您在构造引擎时只需要定义`model`、`criterion`、`optimizer`、`lr_scheduler`与`schedule`等变量即可，下面的代码块给出了一个这样的例子。
-**如果你使用`colossalai.initialize`的话，engine会从config文件里自动构建。**
-
-```python
-import torch
-import torch.nn as nn
-import torchvision.models as models
-import colossalai
-from colossalai.engine import Engine
-
-model = models.resnet18()
-criterion = nn.CrossEntropyLoss()
-optimizer = torch.optim.Adam(model)
-lr_scheduler = colossalai.nn.lr_scheduler.CosineAnnealingLR(optimizer, 1000)
-schedule = colossalai.engine.NonPipelineSchedule()
-
-MyEngine = Engine(
-    model=model,
-    criterion=criterion,
-    optimizer=optimizer,
-    step_schedule=schedule
-)
-```
-
-更多该类的相关信息可以在API信息中找到。
-
-## 训练器
-
-要了解如何个性化适应您需求的训练器，首先需要了解我们的`Trainer`类。
-
-`Trainer`类旨在让科研工作者和工程师更加方便地使用我们的系统，您不需要自己写脚本，只需要调用`Trainer`类来构造您的训练器即可，就像下面的代码块中所做的。
-
-```python
-MyTrainer = Trainer(my_trainer)
-```
-
-在此之后，您可以使用`fit`方法来训练或调用您的模型。除此之外，为了让我们的`Trainer`
-类拥有更强大的功能，我们加入了一系列方便您使用的工具。例如，您可以在训练过程中持续监测并记录模型目前的运行状态和表现，这些功能都是通过钩子函数来实现的。我们提供的`BasicHook`
-类让您可以在指定时间执行您的钩子函数。如下方的代码块所示，我们事先为您定义好了一些实用的钩子函数，您需要做的就是找到符合您需求的钩子函数。更多该类的相关信息可以在API信息中找到。
-
-```python
-hooks = [
-    dict(type='LogMetricByEpochHook'),
-    dict(type='LogTimingByEpochHook'),
-    dict(type='LogMemoryByEpochHook'),
-    dict(type='AccuracyHook'),
-    dict(type='LossHook'),
-    dict(type='TensorboardHook', log_dir='./tfb_logs'),
-    dict(type='SaveCheckpointHook', interval=5, checkpoint_dir='./ckpt'),
-    dict(type='LoadCheckpointHook', epoch=20, checkpoint_dir='./ckpt')
-]
-```
-
-上面这些钩子函数可以记录模型性能指标，训练时间，显存使用等信息，并在每一个epoch结束后将这些信息写入到日志中。除此之外，这些钩子函数还可以即时输出当前的损失以及准确率，让用户可以监测模型的性能。
-
-### 钩子函数
-
-如果您有个性化需求，您可以继承我们的`BaseHook`类并添加您的钩子函数，或者继承我们的`MetricHook`来编写您需要的度量标准。这些钩子函数可以在`Trainer`
-生命周期的12个时间点被执行。更多该类的相关信息可以在API信息中找到。
-
-### 度量标准
-
-您可以通过继承我们的`Metric`类来提供您需要的度量标准，该类需要与`MetricHook`类一同使用。当您编写您的度量标准钩子函数时，请用心设置您的优先级来确保该钩子函数的优先级高于那些需要度量结果的钩子函数。
-
-我们已经为您定义好了一些度量标准钩子函数在`runner.states['metrics']`供您参考。
diff --git a/docs/zero.md b/docs/zero.md
deleted file mode 100644
index d2a6c1658..000000000
--- a/docs/zero.md
+++ /dev/null
@@ -1,96 +0,0 @@
-# Zero Redundancy optimizer and zero offload
-
-The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three 
-model states (optimizer states, gradients, and parameters) instead of replicating them. 
-By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity 
-and communication efficiency are retained.
-
-1. **ZeRO Level 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the 
-first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
-2. **ZeRO Level 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process 
-only stores the gradients corresponding to its partition of the optimizer states.
-3. **ZeRO Level 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and 
-partition them during the forward and backward passes.
-
-## Getting Started with ZeRO
-
-If you are training models with Colossal-AI, enabling ZeRO DP and Offloading is easy by addding several lines in your configuration file. We support configration for level 2 and 3. You have use [PyTorch native implementation](https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html) for level 1 optimizer.
-Below are a few examples of ZeRO-3 configurations.
-
-### Example of ZeRO-3 Configurations
-
-You can refer to the [DeepSpeed configuration](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training) for details. 
-Here we use `Adam` as the initial optimizer.
-
-1. Use ZeRO to partition the optimizer states, gradients (level 2), and parameters (level 3).
-    ```python
-    zero = dict(
-        level=3,
-        dynamic_loss_scale=True,
-        clip_grad=1.0
-    )
-    ```
-
-2. Additionally offload the optimizer states and computations to the CPU.
-    ```python
-    zero = dict(
-        level=3,
-        offload_optimizer_config=dict(
-            device='cpu',
-            pin_memory=True,
-            fast_init=True
-        ),
-        ...
-    )
-    ```
-3. Save even more memory by offloading parameters to the CPU memory.
-    ```python
-    zero = dict(
-        level=3,
-        offload_optimizer_config=dict(
-            device='cpu',
-            pin_memory=True,
-            fast_init=True
-        ),
-        offload_param_config=dict(
-            device='cpu',
-            pin_memory=True,
-            fast_init=OFFLOAD_PARAM_MAX_IN_CPU
-        ),
-        ...
-    )
-    ```
-4. Save even MORE memory by offloading to NVMe (if available on your system):
-    ```python
-    zero = dict(
-        level=3,
-        offload_optimizer_config=dict(
-            device='nvme',
-            pin_memory=True,
-            fast_init=True,
-            nvme_path='/nvme_data'
-        ),
-        offload_param_config=dict(
-            device='nvme',
-            pin_memory=True,
-            max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU,
-            nvme_path='/nvme_data'
-        ),
-        ...
-    )
-    ```
-
-Note that `fp16` is automatically enabled when using ZeRO. This relies on `AMP_TYPE.NAIVE` in Colossal-AI AMP module.
-
-### Training
-
-Note that if your model is too large to fit within the memory when using ZeRO-3, you should use `colossalai.zero.zero3_model_context` to construct your model:
-
-```python
-from colossalai.zero import zero3_model_context
-
-with zero3_model_context():
-    model = Model()
-```
-
-Once you have completed your configuration, just use `colossalai.initialize()` to initialize your training.
diff --git a/docs/zero_zh.md b/docs/zero_zh.md
deleted file mode 100644
index df170dcb0..000000000
--- a/docs/zero_zh.md
+++ /dev/null
@@ -1,90 +0,0 @@
-# ZeRO优化器与offload
-
-ZeRO优化器可以切分三种模型状态（优化器状态、梯度、参数），并将它们存储在不同的进程中，以此来减少数据并行的存储冗余，传统的数据并行需要将上述三种状态复制很多份保存在每一个进程中。与传统的做法相比，ZeRO优化器可以极大地提高内存的存储效率，并保持较好的通信效率。
-
-1. **ZeRO Level 1**: 优化器状态（如对于[Adam优化器](https://arxiv.org/abs/1412.6980)而言，32比特的参数，以及第一和第二动量的预测值）被切分存储在不同的进程中，这样每一个进程只需要更新它对应的那一部分参数。
-2. **ZeRO Level 2**: 用于更新模型参数的32比特的梯度在这一级被切分存储在不同的进程中，这里梯度的切分与level 1中模型参数的切分是一一对应的，每一个进程上的梯度恰好被用来更新该进程上的保存的模型参数。
-3. **ZeRO Level 3**: 16比特的模型参数在这一级被切分存储在不同的进程中，ZeRO-3可以在前向传播和后向传播期间自动收集或切分这些参数。
-
-## 使用ZeRO优化器
-
-在Colossal-AI中启用ZeRO优化器只需要您在配置文件中进行配置即可，下面是一些使用ZeRO-3的配置文件例子。
-
-### 使用ZeRO优化器以及offload
-
-这里我们使用`Adam`作为我们的初始优化器。
-
-1. 使用ZeRO来切分优化器状态（level 1），梯度（level 2），以及模型参数（level 3）：
-    ```python
-    optimizer = dict(
-        type='Adam',
-        lr=0.001,
-        weight_decay=0
-    )
-
-    zero = dict(
-        level=3,
-        dynamic_loss_scale=True,
-        clip_grad=1.0
-    )
-    ```
-2. 将优化器状态以及计算分配到CPU上：
-    ```python
-    zero = dict(
-        offload_optimizer_config=dict(
-            device='cpu',
-            pin_memory=True,
-            fast_init=True
-        ),
-        ...
-    )
-    ```
-3. 将模型参数分配到CPU上来节省显存：
-    ```python
-    zero = dict(
-        offload_optimizer_config=dict(
-            device='cpu',
-            pin_memory=True,
-            fast_init=True
-        ),
-        offload_param_config=dict(
-            device='cpu',
-            pin_memory=True,
-            fast_init=OFFLOAD_PARAM_MAX_IN_CPU
-        ),
-        ...
-    )
-    ```
-4. 将参数分配到NVMe上来节省更多显存（如果您的系统上安装了NVMe）：
-    ```python
-    zero = dict(
-        offload_optimizer_config=dict(
-            device='nvme',
-            pin_memory=True,
-            fast_init=True,
-            nvme_path='/nvme_data'
-        ),
-        offload_param_config=dict(
-            device='nvme',
-            pin_memory=True,
-            max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU,
-            nvme_path='/nvme_data'
-        ),
-        ...
-    )
-    ```
-
-请注意使用ZeRO时`fp16`将会被自动激活。
-
-### 使用ZeRO优化器进行训练
-
-注意，当使用ZeRO-3时，如果您的模型过大以至于无法放入内存, 您应该使用`colossalai.zero.zero3_model_context`来构建您的模型:
-
-```python
-from colossalai.zero import zero3_model_context
-
-with zero3_model_context():
-    model = Model()
-```
-
-如果您完成了上述配置，可以运行`colossalai.initialize()`来开始您的训练。