[doc] migrate the markdown files (#2652)

2023-02-09 14:21:38 +08:00 · 2023-02-09 14:21:38 +08:00 · 85b2303b55
parent a020eecc70
commit 85b2303b55
84 changed files with 9729 additions and 0 deletions
--- a/.github/workflows/check_doc_on_pr.yml
+++ b/.github/workflows/check_doc_on_pr.yml
@ -0,0 +1,23 @@
+name: Check Documentation on PR
+
+on:
+  pull_request:
+    paths:
+      - 'docs/**'
+
+jobs:
+  check-i18n:
+    name: Check docs in diff languages
+    if: |
+        github.event.pull_request.draft == false &&
+        github.base_ref == 'main' &&
+        github.event.pull_request.base.repo.full_name == 'hpcaitech/ColossalAI'
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+
+      - uses: actions/setup-python@v2
+        with:
+          python-version: '3.8.14'
+
+      - run: python .github/workflows/scripts/check_doc_i18n.py -d docs/source
--- a/.github/workflows/scripts/check_doc_i18n.py
+++ b/.github/workflows/scripts/check_doc_i18n.py
@ -0,0 +1,67 @@
+import argparse
+import os
+
+
+def compare_dirs(dir1, dir2):
+    # First, we need to check if the two directories exist
+    if not os.path.exists(dir1) or not os.path.exists(dir2):
+        return False
+
+    # Now, we compare the list of items in each directory
+    items1 = os.listdir(dir1)
+    items2 = os.listdir(dir2)
+
+    # If the number of items in each directory is different, the directories are different
+    if len(items1) != len(items2):
+        return False
+
+    # For each item in the first directory, we check if there is a corresponding item in the second directory
+    for item in items1:
+        item_path1 = os.path.join(dir1, item)
+        item_path2 = os.path.join(dir2, item)
+
+        # If the corresponding item doesn't exist in the second directory, the directories are different
+        if not os.path.exists(item_path2):
+            print(f'Found mismatch: {item_path1}, {item_path2}')
+            return False
+
+        # If the corresponding item is a directory, we compare the two directories recursively
+        if os.path.isdir(item_path1) and os.path.isdir(item_path2):
+            if not compare_dirs(item_path1, item_path2):
+                print(f'Found mismatch: {item_path1}, {item_path2}')
+                return False
+
+        # both are files
+        elif os.path.isfile(item_path1) and os.path.isfile(item_path2):
+            continue
+
+        # If the corresponding item is not a file or a directory, the directories are different
+        else:
+            print(f'Found mismatch: {item_path1}, {item_path2}')
+            return False
+
+    # If all items are the same, the directories are the same
+    return True
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-d', '--directory', help="The directory where the multi-language source files are kept.")
+    args = parser.parse_args()
+
+    i18n_folders = os.listdir(args.directory)
+    i18n_folders = [os.path.join(args.directory, val) for val in i18n_folders]
+
+    if len(i18n_folders) > 1:
+        for i in range(1, len(i18n_folders)):
+            dir1 = i18n_folders[0]
+            dir2 = i18n_folders[i]
+            print(f'comparing {dir1} vs {dir2}')
+            match = compare_dirs(i18n_folders[0], i18n_folders[i])
+
+            if not match:
+                print(
+                    f"{dir1} and {dir2} don't match, please ensure that your documentation is available in different languages"
+                )
+            else:
+                print(f"{dir1} and {dir2} match")
--- a/docs/source/en/Colossal-Auto/feature/auto_checkpoint.md
+++ b/docs/source/en/Colossal-Auto/feature/auto_checkpoint.md
--- a/docs/source/en/Colossal-Auto/feature/device_mesh.md
+++ b/docs/source/en/Colossal-Auto/feature/device_mesh.md
--- a/docs/source/en/Colossal-Auto/feature/shape_consistency.md
+++ b/docs/source/en/Colossal-Auto/feature/shape_consistency.md
--- a/docs/source/en/Colossal-Auto/feature/tracer.md
+++ b/docs/source/en/Colossal-Auto/feature/tracer.md
--- a/docs/source/en/Colossal-Auto/get_started/installation.md
+++ b/docs/source/en/Colossal-Auto/get_started/installation.md
@ -0,0 +1,27 @@
+# Setup
+
+## Announcement
+
+Our auto-parallel feature is a alpha version. It is still under development. We will keep updating it and make it more stable. If you encounter any problem, please feel free to raise an issue.
+
+## Requirements
+
+We need some extra dependencies to support auto-parallel. Please install them before using auto-parallel.
+
+### Install PyTorch
+
+We only support PyTorch 1.12 now, other versions are not tested. We will support more versions in the future.
+
+```bash
+#conda
+conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
+#pip
+pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113
+```
+
+### Install pulp and coin-or-cbc
+
+```bash
+pip install pulp
+conda install -c conda-forge coin-or-cbc
+```
--- a/docs/source/en/Colossal-Auto/get_started/introduction.md
+++ b/docs/source/en/Colossal-Auto/get_started/introduction.md
@ -0,0 +1,47 @@
+# Introduction
+
+In recent years, the deployment of large-scale machine learning models has become increasingly important. However, distributed training systems often require **manual parallelization plans**, which can be complex and require expert knowledge in system engineering and configuration. This can be a challenge for most AI developers without the necessary skills. The need for manual parallelization can make deploying large-scale machine learning models difficult and expensive.
+
+**Colossal-Auto** simplifies the process of deploying large-scale machine learning models for AI developers. Compared to other solutions that require manual configuration of complex parallel policies and model modification, Colossal-Auto only requires one line of code from the user, along with cluster information and model configurations, to enable distributed training. Technically, It seamlessly **integrates with popular AI model frameworks like Hugging Face and Timm.**
+
+
+
+## Overview
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/auto_parallel.png"/>
+</figure>
+
+
+## Usage
+
+```python
+# wrap the model using auto_engine
+model = autoparallelize(model, meta_input_samples)
+# normal training loop
+...
+```
+
+
+## Graph Tracing
+
+Colossal-Auto is **the first auto-parallelism system** that uses static graph analysis based on the PyTorch framework. Obtaining a static execution plan for PyTorch, a dynamic graph framework, has long been an area of research in the field of machine learning systems. Colossal-Auto uses ColoTracer, a forked version of the torch.FX Tracer, to guide the search for an optimal parallelization strategy. The meta-information of each tensor, such as tensor shape, dims, dtype, etc., is computed and recorded during the tracing process. This approach has the advantage of better generalization, as it is not tied to specific models or configurations.
+
+
+
+## Fine-grained Parallelism Search
+Colossal-AI’s auto-parallelism searches for strategies in regard to each operand with the goal of achieving the fastest runtime while meeting memory budget constraints. It ultimately determines the actual training time strategy, including the tensor split strategy for each tensor, the type of communication operators to be inserted between different computing nodes, whether to replace operators, etc. The tensor, data, and hybrid parallelism such as column and row split used by NVIDIA in Megatron-LM and other parallelism systems are all subsets of strategies that can be searched by Colossal-AI. In addition to these parallelisms that can be manually specified, Colossal-AI can specify a unique parallelism method for each operation and, potentially finding a better parallelism strategy than what human experts could provide.
+
+
+
+## Distributed Tensor and Shape-Consistency System
+
+The Colossal-AI system uses a device-mesh, similar to PyTorch's latest DTensor release, to manage its cluster. Colossal-AI uses a sharding-spec to annotate the storage status of each tensor and facilitate their distribution across the cluster. The system also employs a shape-consistency manager to automatically transform tensors between different sharding-specs, allowing for seamless slicing and dicing of tensors, while the shape-consistency manager ensures that the output of upstream operands is consistently stored in the cluster, regardless of how the input of downstream operands is stored. This makes Colossal-AI highly versatile and easy to use without users worrying about the storage status of tensors when performing operations on them.
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/shape_consistency.png"/>
+</figure>
+
+Here are some key advantages of Colossal-AI compared to PyTorch DTensor:
+Colossal-AI's device-mesh uses cluster performance metrics and profiling results to estimate the time consumption of different communication operators. This helps Colossal-AI optimize communication between nodes and improve overall system efficiency.
+Colossal-AI's shape-consistency manager uses a greedy search algorithm to find relatively efficient ways to transform tensors between different sharding-specs, rather than simply transforming dimensions one by one. This can lead to more efficient and effective transformations.
+The integration of all-to-all operations in Colossal-AI increases the scalability of the system by enabling more efficient communication between nodes. This is especially useful for large-scale machine learning tasks that require the transfer of large amounts of data between nodes.
--- a/docs/source/en/Colossal-Auto/get_started/run_demo.md
+++ b/docs/source/en/Colossal-Auto/get_started/run_demo.md
@ -0,0 +1,17 @@
+# Quick Demo
+
+Colossal-Auto simplifies the process of deploying large-scale machine learning models for AI developers. Compared to other solutions that require manual configuration of complex parallel policies and model modification, Colossal-Auto only requires one line of code from the user, along with cluster information and model configurations, to enable distributed training. Quick demos showing how to use Colossal-Auto are given below.
+
+### 1. Basic usage
+
+Colossal-Auto can be used to find a hybrid SPMD parallel strategy includes data, tensor(i.e., 1D, 2D, sequencial) for each operation. You can follow the [GPT example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/experiments/auto_parallel).
+Detailed instructions can be found in its `README.md`.
+
+### 2. Integration with activation checkpoint
+
+Colossal-Auto's automatic search function for activation checkpointing finds the most efficient checkpoint within a given memory budget, rather than just aiming for maximum memory compression. To avoid a lengthy search process for an optimal activation checkpoint, Colossal-Auto has implemented a two-stage search process. This allows the system to find a feasible distributed training solution in a reasonable amount of time while still benefiting from activation checkpointing for memory management. The integration of activation checkpointing in Colossal-AI improves the efficiency and effectiveness of large model training. You can follow the [Resnet example](TBA).
+Detailed instructions can be found in its `README.md`.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/auto_ckpt.jpg"/>
+</figure>
--- a/docs/source/en/advanced_tutorials/add_your_parallel.md
+++ b/docs/source/en/advanced_tutorials/add_your_parallel.md
@ -0,0 +1,124 @@
+# Add Your Own Parallel Mode
+
+Author: Shenggui Li, Yongbin Li
+
+**Prerequisite:**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+
+## Introduction
+
+To enable researchers and engineers to extend our system to other novel large-scale distributed training algorithm
+with less effort, we have decoupled various components in the training lifecycle. You can implement your own
+parallelism by simply inheriting from the base class.
+
+The main components are:
+
+1. `ProcessGroupInitializer`
+2. `GradientHandler`
+3. `Schedule`
+
+**This currently requires some code to the source code, thus we recommend that you install from source with the `-e` flag.
+`-e` flag makes the installation editable, thus, your code change will be reflected in your Python runtime.
+We will work on this to avoid change to source code in future releases.**
+
+
+## Process Group Initializer
+
+Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
+process group. For different parallel algorithms, different process groups need to be created. Colossal-AI provides a
+global context for users to easily manage their process groups. If you wish to add new process group, you can easily
+define a new class and set it in your configuration file. To define your own way of creating process groups, you can
+follow the steps below to create a new distributed initialization.
+
+1. Add your parallel mode in `colossalai.context.parallel_mode.ParallelMode`.
+    ```python
+    class ParallelMode(Enum):
+        GLOBAL = 'global'
+        DATA = 'data'
+        PIPELINE = 'pipe'
+        ...
+
+        NEW_MODE = 'new_mode'  # define your mode here
+    ```
+
+2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The
+   first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other
+   arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the
+   registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`.
+    ```python
+    # sample initializer class
+    @DIST_GROUP_INITIALIZER.register_module
+    class MyParallelInitializer(ProcessGroupInitializer):
+
+        def __init__(self,
+                    rank: int,
+                    world_size: int,
+                    config: Config,
+                    data_parallel_size: int,
+                    pipeline_parlalel_size: int,
+                    tensor_parallel_size: int,
+                    arg1,
+                    arg2):
+            super().__init__(rank, world_size, config)
+            self.arg1 = arg1
+            self.arg2 = arg2
+            # ... your variable init
+
+        def init_parallel_groups(self):
+            # initialize your process groups
+            pass
+
+    ```
+
+    Then, you can insert your new initializer to the current mode-to-initialize mapping
+    in `colossalai.constants.INITIALIZER_MAPPING`. You can modify the file or insert new key-value pair dynamically.
+
+    ```python
+    colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
+    ```
+
+3. Set your initializer in your config file. You can pass in your own arguments if there is any. This allows
+   the `ParallelContext` to create your initializer and initialize your desired process groups.
+
+    ```python
+    parallel = dict(
+        pipeline=dict(size=1),
+        tensor=dict(size=x, mode='new_mode')  # this is where you enable your new parallel mode
+    )
+    ```
+
+## Gradient Handler
+
+Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce
+strategies may be executed for different kinds of parallelism, users can
+inherit `colossalai.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
+uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data
+parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own
+gradient handler like below:
+
+```python
+from colossalai.registry import GRADIENT_HANDLER
+from colossalai.engine import BaseGradientHandler
+
+@GRADIENT_HANDLER.register_module
+class YourGradientHandler(BaseGradientHandler):
+
+    def handle_gradient(self):
+        do_something()
+
+```
+
+Afterwards, you can specify the gradient handler you want to use in your configuration file.
+
+```python
+gradient_handlers = [
+    dict(type='YourGradientHandler'),
+]
+```
+
+## Schedule
+
+Schedule entails how to execute a forward and backward pass. Currently, Colossal-AI provides pipeline and non-pipeline
+schedules. If you want to modify how the forward and backward passes are executed, you can
+inherit `colossalai.engine.schedule.BaseSchedule` and implement the `forward_back_step` function.
--- a/docs/source/en/advanced_tutorials/define_your_own_parallel_model.md
+++ b/docs/source/en/advanced_tutorials/define_your_own_parallel_model.md
@ -0,0 +1,36 @@
+# Define your own parallel model
+
+Author: Zhengda Bian, Yongbin Li
+
+> ⚠️ We are working on this documentation to make it more detailed. We will introduce the mechanism of different parallelism
+> and how to use them to write a model.
+
+Let's say that you have a huge MLP model with billions of parameters and its extremely large hidden layer size makes it
+impossible to fit into a single GPU directly. Don't worry, Colossal-AI is here to help you sort things out. With the help of Colossal-AI,
+you can write your model in the familiar way in which you used to write models for a single GPU, while Colossal-AI automatically
+splits your model weights and fit them perfectly into a set of GPUs. We give a simple example showing how to write a simple
+2D parallel model in the Colossal-AI context.
+
+## Write a simple 2D parallel model
+
+```python
+from colossalai.nn import Linear2D
+import torch.nn as nn
+
+class MLP_2D(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
+        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
+
+    def forward(self, x):
+        x = self.linear_1(x)
+        x = self.linear_2(x)
+        return x
+```
+
+## Use pre-defined model
+
+For the sake of your convenience, we kindly provide you in our Model Zoo with some prevalent models such as *BERT*, *ViT*, *MoE*,
+and *GPT*. Feel free to customize them into different sizes to fit into your special needs.
--- a/docs/source/en/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md
+++ b/docs/source/en/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md
@ -0,0 +1,139 @@
+# Integrate Mixture-of-Experts Into Your Model
+
+Author: Haichen Huang
+
+**Example Code**
+- [ColossalAI-Examples WideNet](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet)
+
+**Related Paper**
+- [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961)
+- [Go Wider Instead of Deeper](https://arxiv.org/abs/2107.11817)
+
+
+## Introduction
+
+Since the advent of Switch Transformer, the AI community has found Mixture of Experts (MoE) a useful technique to enlarge the capacity of deep learning models.
+
+Colossal-AI provides an early access version of parallelism specifically designed for MoE models.
+The most prominent advantage of MoE in Colossal-AI is convenience.
+We aim to help our users to easily combine MoE with model parallelism and data parallelism.
+
+However, the current implementation has two main drawbacks now.
+The first drawback is its poor efficiency in large batch size and long sequence length training.
+The second drawback is incompatibility with tensor parallelism.
+We are working on system optimization to overcome the training efficiency problem.
+The compatibility problem with tensor parallelism requires more adaptation, and we will tackle this issue in the future.
+
+Here, we will introduce how to use MoE with model parallelism and data parallelism.
+
+## Table of Content
+In this tutorial we will cover:
+1. Set up MoE running environment
+2. Create MoE layer
+3. Train your model
+
+We provided the [example code](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet) for this tutorial in [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples).
+This example uses [WideNet](https://arxiv.org/abs/2107.11817) as an example of MoE-based model.
+
+
+## Set up MoE running environment
+In your project folder, create a `config.py`.
+
+This file is to specify some features you may want to use to train your model.
+In order to enable MoE, you need to add a dict called parallel and specify the value of key moe.
+You can assign a value for the key size of moe, which represents the model parallel size of experts (i.e. the number of experts in one group to parallelize training).
+
+For example, if the size is 4, 4 processes will be assigned to 4 consecutive GPUs and these 4 processes form a moe model parallel group.
+Each process on the 4 GPUs will only get a portion of experts. Increasing the model parallel size will reduce communication cost, but increase computation cost in each GPU and activation cost in memory.
+The total data parallel size is auto-detected and set as the number of GPUs by default.
+
+```python
+MOE_MODEL_PARALLEL_SIZE = ...
+parallel = dict(
+    moe=dict(size=MOE_MODEL_PARALLEL_SIZE)
+)
+```
+
+If `MOE_MODEL_PARALLEL_SIZE = E` and set the number of experts as `E` where `E` is a constant number, the process flow of forward pass of a transformer encoder in a model parallel group is shown below.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/oI59QcxdteKUTks.png"/>
+<figcaption>MoE Transformer, image source: <a href="https://arxiv.org/abs/2006.16668">GShard</a></figcaption>
+</figure>
+
+Since all experts are allocated to all GPUs in a model parallel group and a GPU only owns a portion of experts,
+original data parallel groups are no longer correct for the parameters of experts during gradient handling in backward pass anymore.
+So we create a new kind of parallel group called moe data parallel group.
+The difference among different kinds of parallel group, when the configuration is set as `WORLD_SIZE=4`,
+`MOE_MODEL_PARALLEL_SIZE=2`, is shown here.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/Sn8FpmQPKIiBEq2.png"/>
+<figcaption>MoE process group</figcaption>
+</figure>
+
+
+As for gradient handling, we provide MoeGradientHandler to all-reduce every parameter of the model.
+If you use `colossalai.initialize` function to create your training engine, the MoE gradient handler will be added to your engine automatically.
+Otherwise, you should take care of gradient by yourself.
+All parameters of MoE running environment are stored in colossalai.global_variables.moe_env.
+You can access your configuration parameters to check whether your setup is correct.
+```python
+from colossalai.global_variables import moe_env
+```
+
+## Create MoE layer
+You can create a MoE layer from `colossalai.nn.moe`.
+But before doing that, you should set up random seeds for all processes like this.
+
+```python
+from colossalai.context.random import moe_set_seed
+from model_zoo.moe.models import Widenet
+
+moe_set_seed(42)
+model = Widenet(num_experts=4, capacity_factor=1.2)
+```
+
+`moe_set_seed` will set different seed for different processes in a moe model parallel group.
+This helps initialize parameters in experts.
+Then create an instance of experts and an instance of router.
+Here is the example in model zoo.
+
+```python
+from colossalai.nn.layer.moe import Experts, MoeLayer, Top2Router, NormalNoiseGenerator
+
+
+noisy_func = NormalNoiseGenerator(num_experts)
+shared_router = Top2Router(capacity_factor,
+                           noisy_func=noisy_func)
+shared_experts = Experts(expert=VanillaFFN,
+                         num_experts=num_experts,
+                         **moe_mlp_args(
+                             d_model=d_model,
+                             d_ff=d_ff,
+                             drop_rate=drop_rate
+                         ))
+ffn=MoeLayer(dim_model=d_model, num_experts=num_experts,
+             router=shared_router, experts=shared_experts)
+```
+
+Inside the initialization of Experts, the local expert number of each GPU will be calculated automatically. You just need to specify the class of each expert and its parameters used in its initialization. As for routers, we have provided top1 router and top2 router. You can find them in colossalai.nn.layer.moe. After creating the instance of experts and router, the only thing initialized in Moelayer is gate module. More definitions of each class can be found in our API document and code.
+
+
+## Train Your Model
+Do not to forget to use `colossalai.initialize` function in `colosalai` to add gradient handler for the engine.
+We handle the back-propagation of MoE models for you.
+In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients.
+You can find more information about the handler `MoeGradientHandler` in colossal directory.
+
+The loss criterion should be wrapped by `Moeloss` to add auxiliary loss of MoE. Example is like this.
+```python
+criterion = MoeLoss(
+    aux_weight=0.01,
+    loss_fn=nn.CrossEntropyLoss,
+    label_smoothing=0.1
+)
+```
+
+Finally, just use trainer or engine in `colossalai` to do your training.
+Otherwise, you should take care of gradient by yourself.
--- a/docs/source/en/advanced_tutorials/meet_gemini.md
+++ b/docs/source/en/advanced_tutorials/meet_gemini.md
@ -0,0 +1,88 @@
+
+# Meet Gemini:The Heterogeneous Memory Manager of Colossal-AI
+
+Author: [Jiarui Fang](https://github.com/feifeibear), Yang You
+
+## Brief
+
+When you only have a few GPUs for large model training tasks, **heterogeneous training** is the most effective approach. By accommodating model data in CPU and GPU and moving the data to the computing device when necessary, it can breakthrough the GPU memory wall by using GPU  and CPU memory (composed of CPU DRAM or nvme SSD memory) together at the same time. Moreover, the model scale can be further improved by combining heterogeneous training with the other parallel approaches, such as data parallel, tensor parallel and pipeline parallel . We now describe the design details of **Gemini**, the heterogeneous memory space manager of Colossal-AI. Its idea comes from [PatrickStar](https://arxiv.org/abs/2108.05818), which has been adapted to Colossal-AI.
+
+## Usage
+
+At present, Gemini supports compatibility with ZeRO parallel mode, and it is really simple to use Gemini. Set attribute of zero model_config, i.e., tensor_placement_policy='auto'.
+
+```
+zero = dict(
+    model_config=dict(
+        tensor_placement_policy='auto',
+        shard_strategy=BucketTensorShardStrategy()
+    ),
+    optimizer_config=dict(
+    ...)
+)
+```
+
+Note that Gemini and parallel strategies such as tensor parallelism, data parallelism, pipeline parallelism and zero should be decoupled. However, Colossal-AI requires users to use Gemini with ZeRO. Although they are not necessarily coupled, we will improve it in the near future.
+
+## Concepts
+
+**OP**(**OP**erator)：operation of a neural network layer, such as linear, LayerNorm, etc. The operator can be a forward propagation calculation or a back-propagation calculation.
+
+Neural networks must manage two types of training data during training.
+**model data**: consists of parameters, gradients and optimizer states, and its scale is related to the definition of model structure.
+
+**Non-model data**: mainly composed of the intermediate tensor generated by the operator and the temporary variables of the operator. Non-model data changes dynamically according to the configuration of training tasks, such as batch size. Model data and non-model data compete with each other for GPU memory.
+
+## Design Details
+
+
+In some solutions, the [Zero-offload](https://arxiv.org/abs/2101.06840) adopted by DeepSpeed statically divides model data between CPU and GPU memory, and their memory layout is constant for different training configurations. As shown on the left of the figure below, when the GPU memory is insufficient to meet its corresponding model data requirements, the system will crash even if there is still available memory on the CPU at that time. While Colossal-AI can complete the training by moving part of the model data to the CPU.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/tutorial/gemini/deepspeed_compare.png"/>
+<figcaption>Comparison of the memory management of Zero-Offload and Gemini</figcaption>
+</figure>
+
+
+Colossal-AI designed Gemini, just like two-stars, which manages the memory space of CPU and GPU efficiently. It can make the tensor dynamically distributed in the storage space of CPU-GPU during training, so that the model training can break through the memory wall of GPU. The memory manager consists of two parts: **MemStatsCollector (MSC)** and **StatefuleTensorMgr (STM)**.
+
+We take advantage of the iterative characteristics of the deep learning network training process. We divide iterations into two stages: warmup and non-warmup. One or several iterative steps at the beginning belong to the warmup stage, and the other iterative steps belong to the non-warmup stage. In the warmup stage, we collect information for the MSC, while in the non-warmup stage, STM gets the information collected by the MSC to move the tensor, so as to minimize the CPU-GPU data movement volume.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/tutorial/gemini/gemini_workflow.png"/>
+<figcaption>The workflow of Gemini during warmup and non-warmup phase</figcaption>
+</figure>
+
+
+### StatefulTensorMgr
+
+STM manages the information of all model data tensors. In the process of model construction, Colossal-AI registers all model data tensors with STM. The memory manager marks each tensor with state information. The state set includes three types: HOLD, COMPUTE and FREE. The functions of STM are as follows:
+
+**Query memory usage:**by traversing the locations of all tensors in heterogeneous space, obtain the memory occupation of CPU and GPU by model data.
+
+**Transition tensor state:** it marks the tensor as COMPUTE state before each model data tensor participates in the operator calculation, and as HOLD state after calculation. The FREE state marked if the tensor is no longer in use.
+
+**Adjust tensor position:**tensor manager ensures that the tensor in COMPUTE state is placed on the computing device. If the storage space of the computing device is insufficient, it is necessary to move some tensors in HOLD state to other devices for storage. Tensor eviction strategy requires information from MSC, which will be introduced later.
+
+
+### MemStatsCollector
+In the warmup stage, the memory information statistician monitors the memory usage of model data and non-model data in CPU and GPU for reference in the non-warmup stage. We can obtain the memory usage of model data at a certain time by querying STM. However, the memory usage of non-model data is difficult to obtain. Owing to the life cycle of non-model data not being managed by users, the existing deep learning framework does not expose the tracking interface of non-model data to users. MSC obtains the usage of CPU and GPU memory by non-model in the warmup stage through sampling. The specific methods are as follows:
+
+We trigger the memory sampling operation at the beginning and end of the operator. We call this time point **sampling moment**, and the time between the two sampling moments is called **period**. The calculation process is a black box. Due to the possible allocation of temporary buffer, the memory usage is very complex. However, we can accurately obtain the maximum memory usage of the system during the period. The use of non-model data can be obtained by the maximum memory use of the system between two statistical moments-model memory use.
+
+How do we design the sampling time. Before we choose model data layout adjust of preOp. As shown in the figure below. We sample the system memory used of the previous period and the model data memory used of the next period. The parallel strategy will cause obstacles to the work of MSC. As shown in the figure, for example, for ZeRO or Tensor Parallel, because gathering model data is required before OP calculation, it will bring additional memory requirements. Therefore, we require to sample the system memory before the model data changes, so that the MSC will capture the model change memory of preOp within a period. For example, in period 2-3, we consider the memory changes brought by tensor gather and shard.
+
+Although the sampling time can be placed in other locations, such as excluding the new information of the change of the gather buffer, it will cause trouble. There are differences in the implementation of Op in different parallel modes. For example, for Linear Op, gather buffer in Tensor Parallel is allocated in Op. For ZeRO, the allocation of gather buffer is in PreOp. Sampling at the beginning of PreOp helps to unify the two situations.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/tutorial/gemini/gemini_mem_curve.png"/>
+<figcaption>workflow</figcaption>
+</figure>
+
+### Tensor Eviction Strategy
+
+The important duty of MSC is to adjust the tensor layout position. For example, at S2 in the figure above, we reduce the model data on the device, and meet the peak memory requirement calculated in period 2-3.
+
+In the warmup stage, since we haven't finished a complete iteration yet, we don't know actual memory occupation. At this time, we limit the upper bound of memory usage of the model data. For example, only 30% of the GPU memory can be used. This ensures that we can successfully complete the warmup state.
+
+In the non-warmup stage, we need to use the memory information of non-model data collected in the warm-up stage to reserve the peak memory required by the computing device for the next Period, which requires us to move some model tensors. In order to avoid frequent replacement of the same tensor in and out of the CPU-GPU, causing a phenomenon similar to [cache thrashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science)). Using the iterative characteristics of DNN training, we design the OPT cache swap out strategy. Specifically, in the warmup stage, we record the sampling time required by each tensor computing device. If we need to expel some HOLD tensors, we will choose the latest tensor needed on this device as the victim.
--- a/docs/source/en/advanced_tutorials/opt_service.md
+++ b/docs/source/en/advanced_tutorials/opt_service.md
@ -0,0 +1,81 @@
+# Build an online OPT service using Colossal-AI in 5 minutes
+
+## Introduction
+
+This tutorial shows how to build your own service with OPT with the help of [Colossal-AI](https://github.com/hpcaitech/ColossalAI).
+
+## Colossal-AI Inference Overview
+Colossal-AI provides an inference subsystem [Energon-AI](https://github.com/hpcaitech/EnergonAI), a serving system built upon Colossal-AI, which has the following characteristics:
+
+- **Parallelism for Large-scale Models:** With the help of tensor parallel operations, pipeline parallel strategies from Colossal-AI, Colossal-AI inference enables efficient parallel inference for large-scale models.
+- **Pre-built large models:** There are pre-built implementations for popular models, such as OPT. It supports a caching technique for the generation task and checkpoints loading.
+- **Engine encapsulation：** There has an abstraction layer called an engine. It encapsulates the single instance multiple devices (SIMD) execution with the remote procedure call, making it act as the single instance single device (SISD) execution.
+- **An online service system:** Based on FastAPI, users can launch a web service of a distributed inference quickly. The online service makes special optimizations for the generation task. It adopts both left padding and bucket batching techniques to improve efficiency.
+
+## Basic Usage:
+
+1. Download OPT model
+
+To launch the distributed inference service quickly, you can download the OPT-125M from [here](https://huggingface.co/patrickvonplaten/opt_metaseq_125m/blob/main/model/restored.pt). You can get details for loading other sizes of models [here](https://github.com/hpcaitech/EnergonAI/tree/main/examples/opt/script).
+
+2. Prepare a prebuilt service image
+
+Pull a docker image from dockerhub installed with Colossal-AI inference.
+
+```bash
+docker pull hpcaitech/energon-ai:latest
+```
+
+3. Launch an HTTP service
+
+To launch a service, we need to provide python scripts to describe the model type and related configurations, and settings for the HTTP service.
+We have provided a set of [examples](https://github.com/hpcaitech/EnergonAI/tree/main/examples]). We will use the [OPT example](https://github.com/hpcaitech/EnergonAI/tree/main/examples/opt) in this tutorial.
+The entrance of the service is a bash script server.sh.
+The config of the service is at opt_config.py, which defines the model type, the checkpoint file path, the parallel strategy, and http settings. You can adapt it for your own case.
+For example, set the model class as opt_125M and set the correct checkpoint path as follows.
+
+```bash
+model_class = opt_125M
+checkpoint = 'your_file_path'
+```
+
+Set the tensor parallelism degree the same as your gpu number.
+
+```bash
+tp_init_size = #gpu
+```
+
+Now, we can launch a service using docker. You can map the path of the checkpoint and directory containing configs to local disk path `/model_checkpoint` and `/config`.
+
+
+```bash
+export CHECKPOINT_DIR="your_opt_checkpoint_path"
+# the ${CONFIG_DIR} must contain a server.sh file as the entry of service
+export CONFIG_DIR="config_file_path"
+
+docker run --gpus all  --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:lastest
+```
+
+Then open `https://[IP-ADDRESS]:8020/docs#` in your browser to try out!
+
+
+## Advance Features Usage:
+
+1. Batching Optimization
+
+To use our advanced batching technique to collect multiple queries in batches to serve, you can set the executor_max_batch_size as the max batch size. Note, that only the decoder task with the same top_k, top_p and temperature can be batched together.
+
+```
+executor_max_batch_size = 16
+```
+
+All queries are submitted to a FIFO queue. All consecutive queries whose number of decoding steps is less than or equal to that of the head of the queue can be batched together. Left padding is applied to ensure correctness. executor_max_batch_size should not be too large. This ensures batching won't increase latency. For opt-30b, `executor_max_batch_size=16` may be a good choice, while for opt-175b, `executor_max_batch_size=4` may be better.
+
+2. Cache Optimization.
+
+You can cache several recently served query results for each independent serving process. Set the cache_size and cache_list_size in config.py. The cache size is the number of queries cached. The cache_list_size is the number of results stored for each query. And a random cached result will be returned. When the cache is full, LRU is applied to evict cached queries. cache_size=0means no cache is applied.
+
+```
+cache_size = 50
+cache_list_size = 2
+```
--- a/docs/source/en/advanced_tutorials/parallelize_your_training_like_Megatron.md
+++ b/docs/source/en/advanced_tutorials/parallelize_your_training_like_Megatron.md
@ -0,0 +1,192 @@
+# Parallelize Your Training like Megatron-LM via ColoTensor
+
+Author: [Haichen Huang](https://github.com/1SAA) and [Jiarui Fang](https://github.com/feifeibear)
+
+**Prerequisite:**
+- [ColoTensor Concepts](../basics/colotensor_concept.md)
+
+## Introduction
+
+Thanks to the convenience given by ColoTensor, users can apply parallelism with the least edition to their serial code.
+In this tutorial, we will illustrate how to modify the training model to automatically adapt the code to parallel training like Megatron-LM.
+We take the GPT-2 model offered by HuggingFace as an example and provide a way for you to pre-train the GPT-2 model on a single GPU.
+
+Megatron-LM provided a profound paradigm to parallelize large transformer language models.
+However, in order to train large transformer language models at scale, users have to build their models with those modules provided by Megatron.
+It imposes several difficult jobs on users, such as loading the weights from the pre-trained models and constructing the parallelized models.
+To mitigate users' trouble, we offer ColoTensor to enable the tensor model parallelism automatically.
+
+## Definitions of the model and the loss function
+
+First we use the GPTModel and GPTLoss directly from the HuggingFace library.
+
+```python
+import torch
+import torch.nn as nn
+from transformers import GPT2Config, GPT2LMHeadModel
+
+class GPTLMModel(nn.Module):
+    def __init__(self, hidden_size=768, num_layers=12, num_attention_heads=12, max_seq_len=1024, vocab_size=50257, checkpoint=False):
+        super().__init__()
+        self.checkpoint = checkpoint
+        self.model = GPT2LMHeadModel(GPT2Config(n_embd=hidden_size, n_layer=num_layers,
+                                     n_head=num_attention_heads, n_positions=max_seq_len, n_ctx=max_seq_len, vocab_size=vocab_size))
+        if checkpoint:
+            self.model.gradient_checkpointing_enable()
+
+    def forward(self, input_ids, attention_mask):
+        # Only return lm_logits
+        return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
+
+
+class GPTLMLoss(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.loss_fn = nn.CrossEntropyLoss()
+
+    def forward(self, logits, labels):
+        shift_logits = logits[..., :-1, :].contiguous()
+        shift_labels = labels[..., 1:].contiguous()
+        # Flatten the tokens
+        return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+```
+
+## Brief Review of GPT-2
+
+Now, we recall the structure of each GPT-2 model.
+Every GPT-2 model can be represented as a DAG.
+As shown in the below pictures, each circle represents an operator and each square represents a weight.
+An arrow indicates the flow of the input data, and the notation alongside the arrow demonstrates the shape of the input data.
+
+Then, let's take an insight into this GPT-2 model. It consists of three parts.
+They are the **embedding module**, **transformer layers**, and the **classification head**.
+
+The embedding module contains two weights, token embedding weight and position embedding weight.
+After the forward operation of the embedding module, each word in all sequences of the raw input data will be embedded into a hidden state.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/17/omfkIEN6ui5jcL3.png"/>
+<figcaption>The embedding module</figcaption>
+</figure>
+
+Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer percepton is located in the second block.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/17/LAVzDlpRcj4dYeb.png"/>
+<figcaption>The transformer layer</figcaption>
+</figure>
+
+In the end, the classification head is just a linear module without bias, which only has a weight inside.
+
+## Applied with ColoTensor
+
+Two steps make your serial code adapted to Megatron-LM tensor parallel style.
+1. Initialize the model in the context of ColoInitContext.
+2. Setting ColoTensorSpec for each parameter.
+
+### Initialize with ColoInitContext
+
+We should build the model in the ColoInitContext.
+In this context, any parameter initialized would be transformed to ColoParameter and moved to the corresponded device automatically.
+
+```python
+from colossalai.utils.model.colo_init_context import ColoInitContext
+
+with ColoInitContext(device=torch.device('cpu')):
+    model = GPTLMModel()
+```
+
+### Setting ColoTensorSpec for each parameter
+
+After the creation of the model, we establish the distributed environment through ProcessGroup.
+Here, we specify the degree of the tensor parallelism as the same as the number of all GPUs, which means the degree of data parallelism is 1.
+
+```python
+import torch.distributed as dist
+from colossalai.tensor import ProcessGroup
+
+pg = ProcessGroup(tp_degree=dist.get_world_size())
+```
+
+Now, some auxiliary functions are necessary for the next step. We define two functions to split a parameter.
+Megatron-LM-like tensor parallelism requires splitting a parameter tensor along its first dimension or its last dimension.
+
+```python
+from colossalai.tensor import ShardSpec, ComputeSpec, ComputePattern, ColoParameter, ProcessGroup
+
+def split_param_single_dim_tp1d(dim: int, param: ColoParameter, pg: ProcessGroup):
+    spec = (ShardSpec([dim], [pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
+    if param.process_group.tp_world_size() == 1:
+        param.set_process_group(pg)
+    param.set_tensor_spec(*spec)
+
+
+def split_param_row_tp1d(param: ColoParameter, pg: ProcessGroup):
+    split_param_single_dim_tp1d(0, param, pg)
+
+
+def split_param_col_tp1d(param: ColoParameter, pg: ProcessGroup):
+    split_param_single_dim_tp1d(-1, param, pg)
+```
+
+Then we adapt the model to the tensor parallelism.
+According to the tensor parallelism applied in Megatron, it is supposed to shard along the last dimension of tensors, including the weights of token embedding, position embedding, all linear weights and biases in self-attention blocks, the first weight linear and bias in each MLP.
+And it shards the second linear weight along its first dimension.
+
+```python
+for mn, module in model.named_modules():
+    for pn, param in module.named_parameters(recurse=False):
+        # set process group for all parameters
+        param.set_process_group(pg)
+
+        if 'mlp.c_fc' in mn:
+            if 'weight' in pn or 'bias' in pn:
+                split_param_col_tp1d(param, pg)  # colmn slice
+                # keep the shape of the output from c_fc
+                param.compute_spec.set_output_replicate(False)
+        elif 'mlp.c_proj' in mn:
+            if 'weight' in pn:
+                split_param_row_tp1d(param, pg)  # row slice
+        elif 'wte' in mn or 'wpe' in mn:
+            split_param_col_tp1d(param, pg)  # colmn slice
+        elif 'c_attn' in mn or 'c_proj' in mn:
+            split_param_col_tp1d(param, pg)  # colmn slice
+```
+
+The modified model is illustrated below.
+
+The embedding module:
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/17/Yu2xzXEabHV7pwe.png"/>
+<figcaption>The modified embedding module</figcaption>
+</figure>
+
+The transformer layers:
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/17/4HWsA2xz51IhPFO.png"/>
+<figcaption>The modified transformer layer</figcaption>
+</figure>
+
+Once users have specified the distributed pattern of each parameter, ColoTensor is capable of inferring the computation patterns of all operators, including matrix multiplication, the linear function, other elementwise functions in torch.nn.functional, etc.
+In this way, users can train their models as usual.
+
+In our latest example, a Gemini + ZeRO DDP model is also defined to reduce overhead and improve efficiency.For the details of this part, please refer to [ZeRO](../features/zero_with_chunk.md). You can combine these two parts to understand our entire training process:
+
+```python
+def gemini_zero_dpp(model: torch.nn.Module, pg: ProcessGroup, placememt_policy: str = "auto"):
+    from colossalai.nn.parallel import GeminiDDP
+    model = GeminiDDP(model,
+                        device=get_current_device(),
+                        placement_policy=placememt_policy,
+                        pin_memory=True,
+                        search_range_mb=32)
+    return model
+```
+
+## Pretrain GPT-2 On Single GPU
+
+The above optimization we made allows us to pretrain the GPT-2 model on a single GPU. We only need to set the parameter `GPUNUM`=1 in `run.sh`, and then we can complete the model training on a single GPU when running the file.
+
+The GPT-2 example is accessible at [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt).
--- a/docs/source/en/advanced_tutorials/train_gpt_using_hybrid_parallelism.md
+++ b/docs/source/en/advanced_tutorials/train_gpt_using_hybrid_parallelism.md
@ -0,0 +1,270 @@
+# Train GPT Using Hybrid Parallelism
+
+Author: Hongxin Liu, Yongbin Li
+
+**Example Code**
+- [ColossalAI-Examples GPT2](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt_2)
+- [ColossalAI-Examples GPT3](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt_3)
+
+**Related Paper**
+- [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training](https://arxiv.org/abs/2110.14883)
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)
+
+## Introduction
+
+In the previous tutorial, we introduce how to train ViT with pipeline. In this tutorial, you will learn a more complex scenario -- train GPT with hybrid parallelism. In this case, GPT-3 is so large that CPU memory cannot fit it as well. Therefore, you must split the model by yourself.
+
+## Table of content
+
+In this tutorial we will cover:
+
+1. The definition of GPT model, based on colossalai/model_zoo
+2. Processing the dataset
+3. Training GPT using hybrid parallelism
+
+## Import libraries
+
+```python
+import json
+import os
+from typing import Callable
+
+import colossalai
+import colossalai.utils as utils
+import model_zoo.gpt.gpt as col_gpt
+import torch
+import torch.nn as nn
+from colossalai import nn as col_nn
+from colossalai.amp import AMP_TYPE
+from colossalai.builder.pipeline import partition_uniform
+from colossalai.context.parallel_mode import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.engine.schedule import (InterleavedPipelineSchedule,
+                                        PipelineSchedule)
+from colossalai.logging import disable_existing_loggers, get_dist_logger
+from colossalai.nn.layer.wrapper import PipelineSharedModuleWrapper
+from colossalai.trainer import Trainer, hooks
+from colossalai.utils.timer import MultiTimer
+from model_zoo.gpt import GPTLMLoss
+from torch.nn import functional as F
+from torch.utils.data import Dataset
+from transformers import GPT2Tokenizer
+```
+
+
+
+## Define GPT model
+
+In the previous tutorial, we introduced 3 ways to build a pipelined model. But for huge models like GPT-3, you can't even build the model in CPU. In this case, you must split the model by yourself.
+
+GPT dataloader returns `input_ids` and `attention_mask`, so we use two keyword arguments in `forward()` to get them. Note that for stages except the first stage, the first positional argument of `forward()` is the output tensor from the previous stage. So the `hidden_states` is from the previous stage, and for the first stage it's `None`.
+
+For GPT, the *word embedding layer* shares the weights with the *output head*. We provide `PipelineSharedModuleWrapper` to share parameters among pipeline stages. It takes a `list` of `int` as argument, which means those ranks share the parameters. You can use `register_module()` or `register_parameter()` to register a module or a parameter as the shared module or parameter. If you have multiple sets of shared modules / parameters, you should have multiple `PipelineSharedModuleWrapper` instance. If the parameter is shared within **one** stage, you should not use `PipelineSharedModuleWrapper`, and just use the same module / parameter instance. In this example, the *word embedding layer* is at the first stage, and the *output head* is at the last stage. Thus, they are shared among ranks `[0, pipeline_size - 1]`.
+
+For the first stage, it maintains the embedding layer and some transformer blocks. For the last stage, it maintains some transformer blocks and the output head layer. For other stages, they just maintain some transformer blocks. `partition_uniform(num_layers, pipeline_size, num_chunks)` returns the parts of all ranks, and the part is a `tuple` of `(start, end)` (exclude end). `start == 0` means that it's the first stage, and `end == num_layers` means it's the last stage.
+
+```python
+class PipelineGPTHybrid(nn.Module):
+    def __init__(self,
+                 num_layers: int = 12,
+                 hidden_size: int = 768,
+                 num_attention_heads: int = 12,
+                 vocab_size: int = 50304,
+                 embed_drop_rate: float = 0.,
+                 act_func: Callable = F.gelu,
+                 mlp_ratio: int = 4,
+                 attn_drop_rate: float = 0.,
+                 drop_rate: float = 0.,
+                 dtype: torch.dtype = torch.float,
+                 checkpoint: bool = False,
+                 max_position_embeddings: int = 1024,
+                 layer_norm_epsilon: float = 1e-5,
+                 first: bool = False,
+                 last: bool = False):
+        super().__init__()
+        self.embedding = None
+        self.norm = None
+        self.head = None
+        if first:
+            self.embedding = col_gpt.GPTEmbedding(
+                hidden_size, vocab_size, max_position_embeddings, dropout=embed_drop_rate, dtype=dtype)
+        self.blocks = nn.ModuleList([
+            col_gpt.GPTBlock(hidden_size, num_attention_heads, mlp_ratio=mlp_ratio, attention_dropout=attn_drop_rate,
+                             dropout=drop_rate, dtype=dtype, checkpoint=checkpoint, activation=act_func)
+            for _ in range(num_layers)
+        ])
+        if last:
+            self.norm = col_nn.LayerNorm(hidden_size, eps=layer_norm_epsilon)
+            self.head = col_gpt.GPTLMHead(vocab_size=vocab_size,
+                                          dim=hidden_size,
+                                          dtype=dtype,
+                                          bias=False)
+
+    def forward(self, hidden_states=None, input_ids=None, attention_mask=None):
+        if self.embedding is not None:
+            hidden_states = self.embedding(input_ids=input_ids)
+        batch_size = hidden_states.shape[0]
+        attention_mask = attention_mask.view(batch_size, -1)
+        attention_mask = attention_mask[:, None, None, :]
+        attention_mask = attention_mask.to(dtype=hidden_states.dtype)  # fp16 compatibility
+        attention_mask = (1.0 - attention_mask) * -10000.0
+        for block in self.blocks:
+            hidden_states, attention_mask = block(hidden_states, attention_mask)
+        if self.norm is not None:
+            hidden_states = self.head(self.norm(hidden_states))
+        return hidden_states
+
+
+def build_gpt_pipeline(num_layers, num_chunks, device=torch.device('cuda'), **kwargs):
+    logger = get_dist_logger()
+    pipeline_size = gpc.get_world_size(ParallelMode.PIPELINE)
+    pipeline_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
+    rank = gpc.get_global_rank()
+    wrapper = PipelineSharedModuleWrapper([0, pipeline_size - 1])
+    parts = partition_uniform(num_layers, pipeline_size, num_chunks)[pipeline_rank]
+    models = []
+    for start, end in parts:
+        kwargs['num_layers'] = end - start
+        kwargs['first'] = start == 0
+        kwargs['last'] = end == num_layers
+        logger.info(f'Rank{rank} build layer {start}-{end}, {end-start}/{num_layers} layers')
+        chunk = PipelineGPTHybrid(**kwargs).to(device)
+        if start == 0:
+            wrapper.register_module(chunk.embedding.word_embeddings)
+        elif end == num_layers:
+            wrapper.register_module(chunk.head)
+        models.append(chunk)
+    if len(models) == 1:
+        model = models[0]
+    else:
+        model = nn.ModuleList(models)
+    return model
+
+
+def GPT2_exlarge_pipeline_hybrid(num_chunks=1, checkpoint=False, dtype=torch.float):
+    cfg = dict(hidden_size=1600, num_attention_heads=32, checkpoint=checkpoint, dtype=dtype)
+    return build_gpt_pipeline(48, num_chunks, **cfg)
+
+
+def GPT3_pipeline_hybrid(num_chunks=1, checkpoint=False, dtype=torch.float):
+    cfg = dict(hidden_size=12288, num_attention_heads=96,
+               checkpoint=checkpoint, max_position_embeddings=2048, dtype=dtype)
+    return build_gpt_pipeline(96, num_chunks, **cfg)
+```
+
+## Process the dataset
+
+We provide a small GPT web-text dataset here. The original format is loose JSON, and we will save the processed dataset.
+
+```python
+class WebtextDataset(Dataset):
+    def __init__(self, path, seq_len=1024) -> None:
+        super().__init__()
+        root = os.path.dirname(path)
+        encoded_data_cache_path = os.path.join(root, f'gpt_webtext_{seq_len}.pt')
+        if os.path.isfile(encoded_data_cache_path):
+            seq_len_, data, attention_mask = torch.load(
+                encoded_data_cache_path)
+            if seq_len_ == seq_len:
+                self.data = data
+                self.attention_mask = attention_mask
+                return
+        raw_data = []
+        with open(path) as f:
+            for line in f.readlines():
+                raw_data.append(json.loads(line)['text'])
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        tokenizer.pad_token = tokenizer.unk_token
+        encoded_data = tokenizer(
+            raw_data, padding=True, truncation=True, max_length=seq_len, return_tensors='pt')
+        self.data = encoded_data['input_ids']
+        self.attention_mask = encoded_data['attention_mask']
+        torch.save((seq_len, self.data, self.attention_mask),
+                   encoded_data_cache_path)
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return {
+            'input_ids': self.data[index],
+            'attention_mask': self.attention_mask[index]
+        }, self.data[index]
+```
+
+## Training GPT using hybrid parallelism
+
+In the previous tutorial, we explained the meanings of some pipeline arguments. In this case, we can determine the shape of each output tensor which is exchanged among pipeline stages. For GPT, the shape is `(MICRO BATCH SIZE, SEQUENCE LEN, HIDDEN SIZE)`. By setting this, we can avoid exchanging the tensor shape of each stage. When you are not sure of the tensor shape, you can just  leave it `None`, and the shape is inferred automatically. Make sure that the `dtype` of your model is correct. When you use `fp16`, the `dtype` of your model must be `torch.half`. Otherwise, the `dtype` must be `torch.float`. For pipeline parallelism, only `AMP_TYPE.NAIVE` is supported.
+
+You can easily use tensor parallel by setting `parallel` in `CONFIG`. The data parallelism size is automatically set based on the number of GPUs.
+
+```python
+NUM_EPOCHS = 60
+SEQ_LEN = 1024
+BATCH_SIZE = 192
+NUM_CHUNKS = None
+TENSOR_SHAPE = (1, 1024, 1600)
+# only pipeline parallel
+# CONFIG = dict(parallel=dict(pipeline=2), fp16=dict(mode=AMP_TYPE.NAIVE))
+# pipeline + 1D model parallel
+CONFIG = dict(NUM_MICRO_BATCHES = 192, parallel=dict(pipeline=2, tensor=dict(mode='1d', size=2)), fp16=dict(mode=AMP_TYPE.NAIVE))
+
+
+def train():
+    disable_existing_loggers()
+    parser = colossalai.get_default_parser()
+    args = parser.parse_args()
+    colossalai.launch_from_torch(config=CONFIG, backend=args.backend)
+    logger = get_dist_logger()
+
+    train_ds = WebtextDataset(os.environ['DATA'], seq_len=SEQ_LEN)
+    train_dataloader = utils.get_dataloader(train_ds,
+                                            seed=42,
+                                            batch_size=BATCH_SIZE,
+                                            pin_memory=True,
+                                            shuffle=True,
+                                            drop_last=True)
+
+    use_interleaved = NUM_CHUNKS is not None
+    num_chunks = 1 if not use_interleaved else NUM_CHUNKS
+    model = GPT2_exlarge_pipeline_hybrid(num_chunks=num_chunks, checkpoint=True, dtype=torch.half)
+    # model = GPT3_pipeline_hybrid(num_chunks=num_chunks, checkpoint=True, dtype=torch.half)
+    if use_interleaved and not isinstance(model, nn.ModuleList):
+        model = nn.ModuleList([model])
+
+    criterion = GPTLMLoss()
+
+    optimizer = torch.optim.Adam(model.parameters(), lr=0.00015, weight_decay=1e-2,)
+
+    engine, train_dataloader, _, _ = colossalai.initialize(model,
+                                                           optimizer,
+                                                           criterion,
+                                                           train_dataloader=train_dataloader)
+    global_batch_size = BATCH_SIZE * \
+        gpc.get_world_size(ParallelMode.DATA) * getattr(gpc.config, "gradient_accumulation", 1)
+    logger.info(f'Init done, global batch size = {global_batch_size}', ranks=[0])
+
+    timer = MultiTimer()
+
+    trainer = Trainer(
+        engine=engine,
+        logger=logger,
+        timer=timer
+    )
+
+    hook_list = [
+        hooks.LossHook(),
+        hooks.LogMetricByEpochHook(logger),
+        hooks.ThroughputHook(),
+        hooks.LogMetricByStepHook(),
+    ]
+
+    trainer.fit(
+        train_dataloader=train_dataloader,
+        epochs=NUM_EPOCHS,
+        test_interval=1,
+        hooks=hook_list,
+        display_progress=True,
+        return_output_label=False,
+    )
+```
--- a/docs/source/en/advanced_tutorials/train_vit_using_pipeline_parallelism.md
+++ b/docs/source/en/advanced_tutorials/train_vit_using_pipeline_parallelism.md
@ -0,0 +1,247 @@
+# Train ViT Using Pipeline Parallelism
+
+Author: Hongxin Liu, Yongbin Li
+
+**Example Code**
+- [ColossalAI-Examples Pipeline Parallel ViT](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/pipeline_parallel)
+
+**Related Paper**
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)
+
+## Introduction
+
+In this tutorial, you will learn how to train Vision Transformer for image classification from scratch, using pipeline.
+Pipeline parallelism is a kind of model parallelism, which is useful when your GPU memory cannot fit your model.
+By using it, we split the original model into multi stages, and each stage maintains a part of the original model.
+We assume that your GPU memory cannot fit ViT/L-16, and your memory can fit this model.
+
+##  Table of contents
+
+In this tutorial we will cover:
+
+1. The definition of ViT model, based on [TIMM](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py)
+2. Processing the dataset
+3. Training ViT using pipeline
+
+## Import libraries
+
+```python
+import os
+from collections import OrderedDict
+from functools import partial
+
+import colossalai
+import colossalai.nn as col_nn
+import torch
+import torch.nn as nn
+from colossalai.builder import build_pipeline_model
+from colossalai.engine.schedule import (InterleavedPipelineSchedule,
+                                        PipelineSchedule)
+from colossalai.logging import disable_existing_loggers, get_dist_logger
+from colossalai.trainer import Trainer, hooks
+from colossalai.utils import MultiTimer, get_dataloader
+from timm.models import vision_transformer as vit
+from torchvision import transforms
+from torchvision.datasets import CIFAR10
+```
+
+
+
+## Define Vision Transformer model
+
+Generally, we provide 3 ways to build a pipelined model:
+
+1. `colossalai.builder.build_pipeline_model_from_cfg`
+2. `colossalai.builder.build_pipeline_model`
+3. Split the model by stages by yourself
+
+When your memory can fit the model, you can use the first two methods to build your model, otherwise you must split the model by yourself. The first two methods first build the whole model on CPU, then split the model, and finally you can just move the corresponding part of model to GPU.
+
+`colossalai.builder.build_pipeline_model_from_cfg()` receives a config file of model, and it can split the model uniformly (by layer) or balanced (by parameter size).
+
+If you are familiar with `PyTorch`, you can use  `colossalai.builder.build_pipeline_model()` which receives a `torch.nn.Sequential` model and split it by layer uniformly.
+
+In this tutorial, we will modify [TIMM/ViT](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py) to `torch.nn.Sequential` and then use `colossalai.builder.build_pipeline_model()` to build the pipelined model.
+
+When the data is **one** `Tensor`, you can use the positional argument in `forward()` of your model to get the data tensor. For the first stage of pipeline, the first positional argument of `forward()` is the data tensor loaded from data loader. For other stages, the first positional argument of `forward()` is the output tensor from the previous stage. Note that if the stage is not the last stage, the return of `forward()` must be a `Tensor`.
+
+When the data is a `dict` of `Tensor`, you can use named keyword arguments in `forward()` of your model to get the data `dict`.
+
+```python
+class ViTEmbedding(nn.Module):
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, embed_layer=vit.PatchEmbed, drop_rate=0., distilled=False):
+        super().__init__()
+        self.embed_dim = embed_dim  # num_features for consistency with other models
+        self.num_tokens = 2 if distilled else 1
+        self.patch_embed = embed_layer(
+            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
+        num_patches = self.patch_embed.num_patches
+
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
+        self.dist_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) if distilled else None
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + self.num_tokens, embed_dim))
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        self.init_weights()
+
+    def forward(self, x):
+        x = self.patch_embed(x)
+        cls_token = self.cls_token.expand(x.shape[0], -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
+        if self.dist_token is None:
+            x = torch.cat((cls_token, x), dim=1)
+        else:
+            x = torch.cat((cls_token, self.dist_token.expand(x.shape[0], -1, -1), x), dim=1)
+        x = self.pos_drop(x + self.pos_embed)
+        return x
+
+    def init_weights(self):
+        vit.trunc_normal_(self.pos_embed, std=.02)
+        if self.dist_token is not None:
+            vit.trunc_normal_(self.dist_token, std=.02)
+        vit.trunc_normal_(self.cls_token, std=.02)
+        self.apply(vit._init_vit_weights)
+
+
+class ViTHead(nn.Module):
+    def __init__(self, embed_dim=768, num_classes=1000, norm_layer=None, distilled=False, representation_size=None):
+        super().__init__()
+        norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
+        self.norm = norm_layer(embed_dim)
+        self.num_classes = num_classes
+        self.distilled = distilled
+        self.num_features = embed_dim
+        # Representation layer
+        if representation_size and not distilled:
+            self.num_features = representation_size
+            self.pre_logits = nn.Sequential(OrderedDict([
+                ('fc', nn.Linear(embed_dim, representation_size)),
+                ('act', nn.Tanh())
+            ]))
+        else:
+            self.pre_logits = nn.Identity()
+        # Classifier head(s)
+        self.head = nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity()
+        self.head_dist = None
+        if distilled:
+            self.head_dist = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+        self.init_weights()
+
+    def forward(self, x):
+        x = self.norm(x)
+        if self.distilled:
+            x, x_dist = self.head(x[:, 0]), self.head_dist(x[:, 1])
+            if self.training and not torch.jit.is_scripting():
+                # during inference, return the average of both classifier predictions
+                return x, x_dist
+            else:
+                return (x + x_dist) / 2
+        else:
+            x = self.pre_logits(x[:, 0])
+            x = self.head(x)
+        return x
+
+    def init_weights(self):
+        self.apply(vit._init_vit_weights)
+
+
+def sequential_vit(img_size=224, patch_size=16, in_chans=3, num_classes=1000, embed_dim=768, depth=12,
+                   num_heads=12, mlp_ratio=4., qkv_bias=True, representation_size=None, distilled=False,
+                   drop_rate=0., attn_drop_rate=0., drop_path_rate=0., embed_layer=vit.PatchEmbed, norm_layer=None,
+                   act_layer=None):
+    norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
+    act_layer = act_layer or nn.GELU
+    embedding = ViTEmbedding(img_size=img_size, patch_size=patch_size, in_chans=in_chans,
+                             embed_dim=embed_dim, embed_layer=embed_layer, drop_rate=drop_rate, distilled=distilled)
+    dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+    blocks = [vit.Block(
+        dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, drop=drop_rate,
+        attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer, act_layer=act_layer)
+        for i in range(depth)]
+    for block in blocks:
+        block.apply(vit._init_vit_weights)
+    head = ViTHead(embed_dim=embed_dim, num_classes=num_classes, norm_layer=norm_layer,
+                   distilled=distilled, representation_size=representation_size)
+    return nn.Sequential(embedding, *blocks, head)
+
+
+def vit_large_patch16_224(**kwargs):
+    model_kwargs = dict(embed_dim=1024, depth=24, num_heads=16, **kwargs)
+    return sequential_vit(**model_kwargs)
+```
+
+## Process the dataset
+
+Generally, we train ViT on large dataset like Imagenet. For simplicity, we just use CIFAR-10 here, since this tutorial is just for pipeline training.
+
+```python
+def build_cifar(batch_size):
+    transform_train = transforms.Compose([
+        transforms.RandomCrop(224, pad_if_needed=True),
+        transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.CIFAR10),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+    transform_test = transforms.Compose([
+        transforms.Resize(224),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+
+    train_dataset = CIFAR10(root=os.environ['DATA'], train=True, download=True, transform=transform_train)
+    test_dataset = CIFAR10(root=os.environ['DATA'], train=False, transform=transform_test)
+    train_dataloader = get_dataloader(dataset=train_dataset, shuffle=True, batch_size=batch_size, pin_memory=True)
+    test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, pin_memory=True)
+    return train_dataloader, test_dataloader
+```
+
+## Training ViT using pipeline
+
+You can set the size of pipeline parallel and number of microbatches in config. `NUM_CHUNKS` is useful when using interleved-pipeline (for more details see [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) ). The original batch will be split into `num_microbatches`, and each stage will load a micro batch each time. Then we will generate an approriate schedule for you to execute the pipeline training. If you don't need the output and label of model, you can set `return_output_label` to `False` when calling `trainer.fit()` which can further reduce GPU memory usage.
+
+You should `export DATA=/path/to/cifar`.
+
+```python
+BATCH_SIZE = 16
+NUM_EPOCHS = 60
+NUM_CHUNKS = 1
+CONFIG = dict(NUM_MICRO_BATCHES=4, parallel=dict(pipeline=2))
+
+
+def train():
+    disable_existing_loggers()
+    parser = colossalai.get_default_parser()
+    args = parser.parse_args()
+    colossalai.launch_from_torch(backend=args.backend, config=CONFIG)
+    logger = get_dist_logger()
+
+    # build model
+    model = vit_large_patch16_224()
+    model = build_pipeline_model(model, num_chunks=NUM_CHUNKS, verbose=True)
+
+    # build criterion
+    criterion = nn.CrossEntropyLoss()
+
+    # optimizer
+    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)
+
+    # build dataloader
+    train_dataloader, test_dataloader = build_cifar(BATCH_SIZE)
+
+    engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model, optimizer, criterion,
+                                                                         train_dataloader, test_dataloader)
+    timer = MultiTimer()
+
+    trainer = Trainer(engine=engine, timer=timer, logger=logger)
+
+    hook_list = [
+        hooks.LossHook(),
+        hooks.AccuracyHook(col_nn.metric.Accuracy()),
+        hooks.LogMetricByEpochHook(logger),
+    ]
+
+    trainer.fit(train_dataloader=train_dataloader,
+                epochs=NUM_EPOCHS,
+                test_dataloader=test_dataloader,
+                test_interval=1,
+                hooks=hook_list,
+                display_progress=True)
+```
--- a/docs/source/en/advanced_tutorials/train_vit_with_hybrid_parallelism.md
+++ b/docs/source/en/advanced_tutorials/train_vit_with_hybrid_parallelism.md
@ -0,0 +1,646 @@
+# Step By Step: Accelerate ViT Training With Colossal-AI (From Data Parallel to Hybrid Parallel)
+
+Author: Yuxuan Lou
+
+**Example Code**
+
+- [Colossal-AI Examples ViT on Cifar10](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer)
+
+**Related Paper**
+- [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929.pdf)
+
+
+## Introduction
+
+In this example for ViT model, Colossal-AI provides three different parallelism techniques which acclerate model training: data parallelism, pipeline parallelism and tensor parallelism.
+We will show you how to train ViT on CIFAR-10 dataset with these parallelism techniques. To run this example, you will need 2-4 GPUs.
+
+
+## Tabel of Contents
+1. Colossal-AI installation
+2. Steps to train ViT with data parallelism
+3. Steps to train ViT with pipeline parallelism
+4. Steps to train ViT with tensor parallelism or hybrid parallelism
+
+## Colossal-AI Installation
+You can install Colossal-AI pacakage and its dependencies with PyPI.
+```bash
+pip install colossalai
+```
+
+
+
+## Data Parallelism
+Data parallism is one basic way to accelerate model training process. You can apply data parallism to training by only two steps:
+1. Define a configuration file
+2. Change a few lines of code in train script
+
+### Define your configuration file (`data_parallel/config.py`)
+To use Colossal-AI, the first step is to define a configuration file. And there are two kinds of variables here:
+
+1. **Colossal-AI feature specification**
+
+There is an array of features Colossal-AI provides to speed up training (parallel mode, mixed precision, ZeRO, etc.). Each feature is defined by a corresponding field in the config file. If we apply data parallel only, we do not need to specify the parallel mode. In this example, we use mixed precision training natively provided by PyTorch by define the mixed precision configuration `fp16 = dict(mode=AMP_TYPE.TORCH)`.
+
+2. **Global hyper-parameters**
+
+Global hyper-parameters include model-specific hyper-parameters, training settings, dataset information, etc.
+
+```python
+from colossalai.amp import AMP_TYPE
+
+# ViT Base
+BATCH_SIZE = 256
+DROP_RATE = 0.1
+NUM_EPOCHS = 300
+
+# mix precision
+fp16 = dict(
+    mode=AMP_TYPE.TORCH,
+)
+
+gradient_accumulation = 16
+clip_grad_norm = 1.0
+
+dali = dict(
+    gpu_aug=True,
+    mixup_alpha=0.2
+)
+```
+
+### Modify train script (`/data_parallel/train_with_cifar10.py`)
+
+#### Import modules
+- Colossal-AI related modules
+```python
+import colossalai
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.logging import disable_existing_loggers, get_dist_logger
+from colossalai.nn.lr_scheduler import LinearWarmupLR
+from colossalai.nn.metric import Accuracy
+from colossalai.trainer import Trainer, hooks
+```
+
+- Other modules
+```python
+import os
+
+import torch
+from timm.models import vit_base_patch16_224
+
+
+from torchvision import transforms
+from torchvision.datasets import CIFAR10
+```
+
+#### Lauch Colossal-AI
+
+In train script,  you need to initialize the distributed environment for Colossal-AI after your config file is prepared. We call this process `launch`. In Colossal-AI, we provided several launch methods to initialize the distributed backend. In most cases, you can use `colossalai.launch` and `colossalai.get_default_parser` to pass the parameters via command line. Besides, Colossal-AI can utilize the existing launch tool provided by PyTorch as many users are familiar with by using `colossalai.launch_from_torch`. For more details, you can view the related [documents](https://www.colossalai.org/docs/basics/launch_colossalai).
+
+```python
+# initialize distributed setting
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+colossalai.launch_from_torch(config=args.config)
+
+disable_existing_loggers()
+logger = get_dist_logger()
+```
+
+After initialization, you can acess the variables in the config file by using `colossalai.core.global_context`.
+
+```python
+#access parameters
+print(gpc.config.BATCH_SIZE)
+```
+
+#### Build Model
+
+If only data parallelism is required, you do not need to make any changes to your model. Here, we use `vit_base_patch16_224` from `timm`.
+```python
+# build model
+model = vit_base_patch16_224(drop_rate=0.1, num_classes=gpc.config.NUM_CLASSES)
+```
+
+#### Build CIFAR-10 Dataloader
+`colossalai.utils.get_dataloader` can help you build dataloader easily.
+
+```python
+def build_cifar(batch_size):
+    transform_train = transforms.Compose([
+        transforms.RandomCrop(224, pad_if_needed=True),
+        transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.CIFAR10),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+    transform_test = transforms.Compose([
+        transforms.Resize(224),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+
+    train_dataset = CIFAR10(root=os.environ['DATA'], train=True, download=True, transform=transform_train)
+    test_dataset = CIFAR10(root=os.environ['DATA'], train=False, transform=transform_test)
+    train_dataloader = get_dataloader(dataset=train_dataset, shuffle=True, batch_size=batch_size, pin_memory=True)
+    test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, pin_memory=True)
+    return train_dataloader, test_dataloader
+
+
+# build dataloader
+train_dataloader, test_dataloader = build_cifar(gpc.config.BATCH_SIZE)
+```
+
+#### Define optimizer, loss function and LR scheduler
+
+Colossal-AI provides its own optimizer, loss function and LR scheduler. Those from PyTorch are also compatible.
+
+```python
+# build optimizer
+optimizer = colossalai.nn.Lamb(model.parameters(), lr=1.8e-2, weight_decay=0.1)
+
+# build loss
+criterion = torch.nn.CrossEntropyLoss()
+
+# lr_scheduelr
+lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
+```
+
+#### Start Colossal-AI engine
+
+Engine is essentially a wrapper class for model, optimizer and loss function. When we call `colossalai.initialize`, an engine object will be returned, and it has already been equipped with functionalities such as gradient clipping, gradient accumulation and zero optimizer as specified in your configuration file. Further model training is based on Colossal-AI engine.
+
+```python
+engine, train_dataloader, test_dataloader, _ = colossalai.initialize(
+        model, optimizer, criterion, train_dataloader, test_dataloader
+    )
+```
+
+#### Train: Trainer API
+Trainer is a more high-level wrapper for the user to execute training with fewer lines of code. It is easy to create a trainer object by passing the engine object.
+
+Besides, In trainer, the user can customize some hooks and attach these hooks to the trainer object. A hook object will execute life-cycle methods periodically based on the training scheme. For example, The `LRSchedulerHook` will execute `lr_scheduler.step()` to update the learning rate of the model during either `after_train_iter` or `after_train_epoch` stages.
+
+```python
+# build trainer
+trainer = Trainer(engine=engine, logger=logger)
+
+# build hooks
+hook_list = [
+    hooks.LossHook(),
+    hooks.AccuracyHook(accuracy_func=MixupAccuracy()),
+    hooks.LogMetricByEpochHook(logger),
+    hooks.LRSchedulerHook(lr_scheduler, by_epoch=True),
+
+    # comment if you do not need to use the hooks below
+    hooks.SaveCheckpointHook(interval=1, checkpoint_dir='./ckpt'),
+    hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
+]
+```
+
+Use `trainer.fit` for training:
+
+```python
+# start training
+trainer.fit(
+    train_dataloader=train_dataloader,
+    test_dataloader=test_dataloader,
+    epochs=gpc.config.NUM_EPOCHS,
+    hooks=hook_list,
+    display_progress=True,
+    test_interval=1
+)
+```
+
+### Start training
+`DATA` is the filepath where CIFAR-10 dataset will be automatically downloaded and stored.
+
+`<NUM_GPUs>` is the number of GPUs you want to use to train ViT on CIFAR-10 with data parallelism.
+
+```bash
+export DATA=<path_to_data>
+# If your torch >= 1.10.0
+torchrun --standalone --nproc_per_node <NUM_GPUs>  train_dp.py --config ./configs/config_data_parallel.py
+# If your torch >= 1.9.0
+# python -m torch.distributed.run --standalone --nproc_per_node= <NUM_GPUs> train_dp.py --config ./configs/config_data_parallel.py
+# Otherwise
+# python -m torch.distributed.launch --nproc_per_node <NUM_GPUs> --master_addr <node_name> --master_port 29500 train_dp.py --config ./configs/config.py
+```
+
+
+
+## Pipeline Parallelism
+Aside from data parallelism, Colossal-AI also support pipleline parallelism. In specific, Colossal-AI uses 1F1B pipeline introduced by NVIDIA. For more details, you can view the related [documents](https://www.colossalai.org/tutorials/features/pipeline_parallel).
+
+### Define your configuration file(`hybrid_parallel/configs/vit_pipeline.py`)
+To apply pipleline parallel on the data parallel basis, you only need to add a **parallel dict**
+```python
+from colossalai.amp import AMP_TYPE
+
+parallel = dict(
+    pipeline=2
+)
+# pipeline config
+NUM_MICRO_BATCHES = parallel['pipeline']
+TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+clip_grad_norm = 1.0
+```
+
+Other configs：
+```python
+# hyperparameters
+# BATCH_SIZE is as per GPU
+# global batch size = BATCH_SIZE x data parallel size
+BATCH_SIZE = 256
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+# model config
+IMG_SIZE = 224
+PATCH_SIZE = 16
+HIDDEN_SIZE = 768
+DEPTH = 12
+NUM_HEADS = 12
+MLP_RATIO = 4
+NUM_CLASSES = 10
+CHECKPOINT = True
+SEQ_LENGTH = (IMG_SIZE // PATCH_SIZE) ** 2 + 1  # add 1 for cls token
+```
+
+### Build pipeline model (`/hybrid_parallel/model/vit.py`)
+Colossal-AI provides two methods to build a pipeline model from the existing model.
+- `colossalai.builder.build_pipeline_model_from_cfg`
+- `colossalai.builder.build_pipeline_model`
+
+Besides, you can also build a pipeline model from scrath with Colossal-AI.
+```python
+import math
+from typing import Callable
+
+import inspect
+import torch
+from colossalai import nn as col_nn
+from colossalai.registry import LAYERS, MODELS
+from colossalai.logging import get_dist_logger
+from colossalai.core import global_context as gpc
+from colossalai.context import ParallelMode
+from colossalai.builder.pipeline import partition_uniform
+from torch import dtype, nn
+from model_zoo.vit.vit import ViTBlock, ViTEmbedding, ViTHead
+
+
+@MODELS.register_module
+class PipelineVisionTransformer(nn.Module):
+    def __init__(self,
+                 img_size: int = 224,
+                 patch_size: int = 16,
+                 in_chans: int = 3,
+                 num_classes: int = 1000,
+                 depth: int = 12,
+                 num_heads: int = 12,
+                 dim: int = 768,
+                 mlp_ratio: int = 4,
+                 attention_dropout: float = 0.,
+                 dropout: float = 0.1,
+                 drop_path: float = 0.,
+                 layernorm_epsilon: float = 1e-6,
+                 activation: Callable = nn.functional.gelu,
+                 representation_size: int = None,
+                 dtype: dtype = None,
+                 bias: bool = True,
+                 checkpoint: bool = False,
+                 init_method: str = 'torch',
+                 first_stage=True,
+                 last_stage=True,
+                 start_idx=None,
+                 end_idx=None,):
+        super().__init__()
+
+        layers = []
+
+        if first_stage:
+            embed = ViTEmbedding(img_size=img_size,
+                                 patch_size=patch_size,
+                                 in_chans=in_chans,
+                                 embedding_dim=dim,
+                                 dropout=dropout,
+                                 dtype=dtype,
+                                 init_method=init_method)
+            layers.append(embed)
+
+        # stochastic depth decay rule
+        dpr = [x.item() for x in torch.linspace(0, drop_path, depth)]
+
+        if start_idx is None and end_idx is None:
+            start_idx = 0
+            end_idx = depth
+
+        blocks = [
+            ViTBlock(
+                dim=dim,
+                num_heads=num_heads,
+                mlp_ratio=mlp_ratio,
+                attention_dropout=attention_dropout,
+                dropout=dropout,
+                drop_path=dpr[i],
+                activation=activation,
+                dtype=dtype,
+                bias=bias,
+                checkpoint=checkpoint,
+                init_method=init_method,
+            ) for i in range(start_idx, end_idx)
+        ]
+        layers.extend(blocks)
+
+        if last_stage:
+            norm = col_nn.LayerNorm(normalized_shape=dim, eps=layernorm_epsilon, dtype=dtype)
+            head = ViTHead(dim=dim,
+                           num_classes=num_classes,
+                           representation_size=representation_size,
+                           dtype=dtype,
+                           bias=bias,
+                           init_method=init_method)
+            layers.extend([norm, head])
+
+        self.layers = nn.Sequential(
+            *layers
+        )
+
+    def forward(self, x):
+        x = self.layers(x)
+        return x
+
+
+def _filter_kwargs(func, kwargs):
+    sig = inspect.signature(func)
+    return {k: v for k, v in kwargs.items() if k in sig.parameters}
+
+
+def _build_pipeline_vit(module_cls, num_layers, num_chunks, device=torch.device('cuda'), **kwargs):
+    logger = get_dist_logger()
+    if gpc.is_initialized(ParallelMode.PIPELINE):
+        pipeline_size = gpc.get_world_size(ParallelMode.PIPELINE)
+        pipeline_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
+    else:
+        pipeline_size = 1
+        pipeline_rank = 0
+    rank = gpc.get_global_rank()
+    parts = partition_uniform(num_layers, pipeline_size, num_chunks)[pipeline_rank]
+    models = []
+
+    for start, end in parts:
+        kwargs['first_stage'] = start == 0
+        kwargs['last_stage'] = end == num_layers
+        kwargs['start_idx'] = start
+        kwargs['end_idx'] = end
+        logger.info(f'Rank{rank} build layer {start}-{end}, {end-start}/{num_layers} layers')
+        chunk = module_cls(**_filter_kwargs(module_cls.__init__, kwargs)).to(device)
+        models.append(chunk)
+    if len(models) == 1:
+        model = models[0]
+    else:
+        model = nn.ModuleList(models)
+    return model
+
+
+def build_pipeline_vit(num_layers, num_chunks, device=torch.device('cuda'), **kwargs):
+    return _build_pipeline_vit(PipelineVisionTransformer, num_layers, num_chunks, device, **kwargs)
+```
+
+### Modify train script (`/hybrid_parallel/train_with_cifar10.py`)
+
+#### Import modules
+```python
+from colossalai.engine.schedule import (InterleavedPipelineSchedule,
+                                        PipelineSchedule)
+from colossalai.utils import MultiTimer
+import os
+
+import colossalai
+
+import torch
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.logging import get_dist_logger
+from colossalai.nn import CrossEntropyLoss
+from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
+from colossalai.utils import is_using_pp, get_dataloader
+from model.vit import build_pipeline_vit
+from model_zoo.vit.vit import _create_vit_model
+from tqdm import tqdm
+
+from torchvision import transforms
+from torchvision.datasets import CIFAR10
+```
+
+#### Launch Colossal-AI
+`colossalai.utils.is_using_pp` can help check whether pipeline parallelism is required in config file.
+
+```python
+# initialize distributed setting
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+
+# launch from torch
+colossalai.launch_from_torch(config=args.config)
+
+# get logger
+logger = get_dist_logger()
+logger.info("initialized distributed environment", ranks=[0])
+
+if hasattr(gpc.config, 'LOG_PATH'):
+    if gpc.get_global_rank() == 0:
+        log_path = gpc.config.LOG_PATH
+        if not os.path.exists(log_path):
+            os.mkdir(log_path)
+        logger.log_to_file(log_path)
+
+use_pipeline = is_using_pp()
+```
+
+#### Define model
+
+```python
+# create model
+model_kwargs = dict(img_size=gpc.config.IMG_SIZE,
+                    patch_size=gpc.config.PATCH_SIZE,
+                    dim=gpc.config.HIDDEN_SIZE,
+                    depth=gpc.config.DEPTH,
+                    num_heads=gpc.config.NUM_HEADS,
+                    mlp_ratio=gpc.config.MLP_RATIO,
+                    num_classes=gpc.config.NUM_CLASSES,
+                    init_method='jax',
+                    checkpoint=gpc.config.CHECKPOINT)
+
+if use_pipeline:
+    model = build_pipeline_vit(num_layers=model_kwargs['depth'], num_chunks=1, **model_kwargs)
+else:
+    model = _create_vit_model(**model_kwargs)
+```
+
+#### Count number of parameters
+
+You can count model parameters on different pipeline stages easily.
+
+```
+# count number of parameters
+total_numel = 0
+for p in model.parameters():
+    total_numel += p.numel()
+if not gpc.is_initialized(ParallelMode.PIPELINE):
+    pipeline_stage = 0
+else:
+    pipeline_stage = gpc.get_local_rank(ParallelMode.PIPELINE)
+logger.info(f"number of parameters: {total_numel} on pipeline stage {pipeline_stage}")
+```
+
+#### Build dataloader, optimizer, etc.
+
+```python
+def build_cifar(batch_size):
+    transform_train = transforms.Compose([
+        transforms.RandomCrop(224, pad_if_needed=True),
+        transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.CIFAR10),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+    transform_test = transforms.Compose([
+        transforms.Resize(224),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+
+    train_dataset = CIFAR10(root=os.environ['DATA'], train=True, download=True, transform=transform_train)
+    test_dataset = CIFAR10(root=os.environ['DATA'], train=False, transform=transform_test)
+    train_dataloader = get_dataloader(dataset=train_dataset, shuffle=True, batch_size=batch_size, pin_memory=True)
+    test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, pin_memory=True)
+    return train_dataloader, test_dataloader
+
+
+# craete dataloaders
+train_dataloader , test_dataloader = build_cifar()
+
+# create loss function
+criterion = CrossEntropyLoss(label_smoothing=0.1)
+
+# create optimizer
+optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)
+
+# create lr scheduler
+lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
+                                       total_steps=gpc.config.NUM_EPOCHS,
+                                       warmup_steps=gpc.config.WARMUP_EPOCHS)
+```
+
+#### Start Colossal-AI engine
+
+```python
+# intiailize
+engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model,
+                                                                     optimizer=optimizer,
+                                                                     criterion=criterion,
+                                                                     train_dataloader=train_dataloader,
+                                                                     test_dataloader=test_dataloader)
+
+logger.info("Engine is built", ranks=[0])
+```
+
+#### Train: based on engine
+
+In the data parallelism example, we show how to train a model with Trainer API. We can also directly train a model based on engine. In this way, you can customize your training with more features.
+
+```python
+data_iter = iter(train_dataloader)
+
+for epoch in range(gpc.config.NUM_EPOCHS):
+    # training
+    engine.train()
+
+    if gpc.get_global_rank() == 0:
+        description = 'Epoch {} / {}'.format(
+            epoch,
+            gpc.config.NUM_EPOCHS
+        )
+        progress = tqdm(range(len(train_dataloader)), desc=description)
+    else:
+        progress = range(len(train_dataloader))
+    for _ in progress:
+        engine.zero_grad()
+        engine.execute_schedule(data_iter, return_output_label=False)
+        engine.step()
+        lr_scheduler.step()
+```
+
+### Start training
+```bash
+export DATA=<path_to_dataset>
+# If your torch >= 1.10.0
+torchrun --standalone --nproc_per_node <NUM_GPUs>  train_hybrid.py --config ./configs/config_pipeline_parallel.py
+# If your torch >= 1.9.0
+# python -m torch.distributed.run --standalone --nproc_per_node= <NUM_GPUs> train_hybrid.py --config ./configs/config_pipeline_parallel.py
+```
+
+
+
+
+## Tensor Parallelism and Hybrid Parallelism
+Tensor parallelism partitions each weight parameter across multiple devices in order to reduce memory load. Colossal-AI support 1D, 2D, 2.5D and 3D tensor parallelism. Besides, you can combine tensor parallelism with pipeline parallelism and data parallelism to reach hybrid parallelism. Colossal-AI also provides an easy way to apply tensor parallelism and hybrid parallelism. On the basis of pipeline parallelism, a few lines of code changing in config file is all you need.
+
+### Define your configuration file(`/hybrid_parallel/configs/vit_1d_tp2_pp2.py`)
+To use tensor parallelism, you only need to add related information to the **parallel dict**. To be specific, `TENSOR_PARALLEL_MODE` can be '1d', '2d', '2.5d', '3d'. And the size of different parallelism should satisfy: `#GPUs = pipeline parallel size x tensor parallel size x data parallel size`.  `data parallel size` will automatically computed after you specify the number of GPUs, pipeline parallel size and tensor parallel size.
+
+```python
+from colossalai.amp import AMP_TYPE
+# parallel setting
+TENSOR_PARALLEL_SIZE = 2
+TENSOR_PARALLEL_MODE = '1d'
+
+parallel = dict(
+    pipeline=2,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE)
+)
+
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+clip_grad_norm = 1.0
+
+
+# pipeline config
+NUM_MICRO_BATCHES = parallel['pipeline']
+TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE)
+```
+
+Ohter configs:
+```python
+# hyperparameters
+# BATCH_SIZE is as per GPU
+# global batch size = BATCH_SIZE x data parallel size
+BATCH_SIZE = 256
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+
+# model config
+IMG_SIZE = 224
+PATCH_SIZE = 16
+HIDDEN_SIZE = 768
+DEPTH = 12
+NUM_HEADS = 12
+MLP_RATIO = 4
+NUM_CLASSES = 10
+CHECKPOINT = True
+SEQ_LENGTH = (IMG_SIZE // PATCH_SIZE) ** 2 + 1  # add 1 for cls token
+```
+
+### Start training
+```bash
+export DATA=<path_to_dataset>
+# If your torch >= 1.10.0
+torchrun --standalone --nproc_per_node <NUM_GPUs>  train_hybrid.py --config ./configs/config_hybrid_parallel.py
+# If your torch >= 1.9.0
+# python -m torch.distributed.run --standalone --nproc_per_node= <NUM_GPUs> train_hybrid.py --config ./configs/config_hybrid_parallel.py
+```
--- a/docs/source/en/basics/colotensor_concept.md
+++ b/docs/source/en/basics/colotensor_concept.md
@ -0,0 +1,97 @@
+# ColoTensor Concepts
+
+Author: [Jiarui Fang](https://github.com/feifeibear), [Hongxin Liu](https://github.com/ver217) and [Haichen Huang](https://github.com/1SAA)
+
+**Prerequisite:**
+- [Colossal-AI Overview](../concepts/colossalai_overview.md)
+- [Distributed Training](../concepts/distributed_training.md)
+- [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md)
+
+## Introduction
+
+After ColossalAI version 0.1.8, [ColoTensor](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.html#colossalai.tensor.ColoTensor) becomes the basic data structure for tensors in ColossalAI. It is a subclass of torch.Tensor and can be used as a PyTorch Tensor. Additionally, some unique features make it possible to represent a Global Tensor with a payload distributed across multiple GPU devices. With the help of ColoTensor, the users can write distributed DNN training program similar to a serial one.support the following features.
+
+ColoTensor contains extra attributes capsuled in a [ColoTensorSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.tensor_spec.html#colossalai.tensor.tensor_spec.ColoTensorSpec) instance to describe the tensor's payload distribution and computing pattern.
+
+- ProcessGroup: how processes are organized as communication groups.
+- Distributed Spec: how tensor is distributed among process groups.
+- Compute Spec: how the tensor is used during computation.
+
+We elaborate on them one by one.
+
+## ProcessGroup
+
+An instance of class [ProcessGroup](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.html#colossalai.tensor.ProcessGroup) describes how processes are organized in process groups. Processes in a process group can participate in the same collective communication operations together, such as allgather, allreduce, etc. The way the process group is organized is dominated by the Tensor's parallelism strategy. For example, if the user defines the tensor parallel (TP) and data parallel (DP) modes of a tensor, then the process organization of the process group will be automatically deduced. The process group settings can vary among different tensors. Therefore, it enables us to support more complicated hybrid parallel. The pipeline parallel (PP) definition is not in the ProcessGroup, it needs another set of mechanisms . We will supplement the related content of ColoTensor applied to PP in the future.
+
+Currently, a process group of ColoTensor is defined by two configurations, i.e. tp_degree and dp_degree. In the case of DP+TP hybrid parallelism, the device can be viewed as a 2D mesh. We place TP communication groups on the leading low dimension of the device mesh and then place the data parallel groups along the high dimension of the device mesh. The reason is that tensor parallelism has a larger communication overhead than data parallelism. Neighboring devices are placed inside a TP process group and are often placed in the same node.
+
+Considering that 8 processes are configured as tp_degree=4, and dp_degree=2, the layout is shown below. Process group tp0 contains gpu 0,1,2,3. Process dp1 contains gpu 1 and 5.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/ColoTensor_layout_demo.PNG"/>
+<figcaption>Process Group using tp_degree=4, dp_degree=2</figcaption>
+</figure>
+
+## Distributed Spec
+
+An instance of [Distributed Spec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html) describes how a ColoTensor is distributed among the ProcessGroup.
+
+How tensors are distributed among DP process groups is automatically derived and does not need to be manually specified by the user. If this tensor is a model parameter, it is replicated within the DP process group. If it is an activation tensor, it is split along the process with the highest dimension and evenly distributed the tensor payload among processes in the DP process group.
+
+Therefore, when using Distributed Spec, we only need to describe the way that the tensor is distributed among TP process groups. There are currently two ways to distribute among TP process group, i.e. [ShardSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html#colossalai.tensor.distspec.ShardSpec) and [ReplicaSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html#colossalai.tensor.distspec.ReplicaSpec). ShardSpec needs to specify the dimension index dim of the partition and the number of partitions num_partitions. Currently, we only support the split on a single dim. Different dist specs on the TP process groups can be converted to each other through the set_dist_spec() interface. The spec conversions are recorded by the autograd mechanism and it will trigger corresponding reverse operations during backward propagation.
+
+## Compute Spec
+
+An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.compute_spec.html#colossalai.tensor.compute_spec.ComputeSpec) describes how a Coloensor be used in DNN training. Currently, we will set the correct Compute Pattern for the ColoTensor as the parameters of the module. The specific application scenarios will be shown in the next document.
+
+## ColoParameter
+
+[ColoParameter](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.colo_parameter.html#colossalai.tensor.colo_parameter.ColoParameter) is a subclass of ColoTensor. Used to define a Global Parameter tensor. Its relationship with ColoTensor is consistent with Torch.Tensor and torch.Parameter. The latter allows the tensor to appear in the return values of the module's parameters() and name_parameters() methods.
+
+## Example
+
+Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_dgree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor.
+
+
+```python
+import torch
+import torch.multiprocessing as mp
+from colossalai.utils import free_port, print_rank_0
+from functools import partial
+
+import colossalai
+from colossalai.tensor import ProcessGroup, ColoTensor, ColoTensorSpec, ShardSpec, ComputeSpec, ComputePattern
+from colossalai.utils import free_port
+
+import torch
+
+def run_dist_tests(rank, world_size, port):
+    colossalai.launch(config={}, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
+    pg = ProcessGroup(tp_degree=2, dp_degree=2)
+
+    torch.manual_seed(0)
+    local_tensor = torch.randn(2, 3, 1).cuda()
+    print_rank_0(f"shape {local_tensor.shape}, {local_tensor.data}")
+
+    spec = ColoTensorSpec(pg, ShardSpec(dims=[-1], num_partitions=[pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
+    t1 = ColoTensor.from_torch_tensor(local_tensor, spec)
+    t1 = t1.to_replicate()
+    print_rank_0(f"shape {t1.shape}, {t1.data}")
+
+    spec2 = ShardSpec([0], [pg.tp_world_size()])
+    t1.set_dist_spec(spec2)
+    print_rank_0(f"shape {t1.shape}, {t1.data}")
+
+def test_dist_cases(world_size):
+    run_func = partial(run_dist_tests, world_size=world_size, port=free_port())
+    mp.spawn(run_func, nprocs=world_size)
+
+if __name__ == '__main__':
+    test_dist_cases(4)
+```
+
+:::caution
+
+The ColoTensor is an experimental feature and may be updated.
+
+:::
--- a/docs/source/en/basics/command_line_tool.md
+++ b/docs/source/en/basics/command_line_tool.md
@ -0,0 +1,53 @@
+# Command Line Tool
+
+Author: Shenggui Li
+
+**Prerequisite:**
+- [Distributed Training](../concepts/distributed_training.md)
+- [Colossal-AI Overview](../concepts/colossalai_overview.md)
+
+## Introduction
+
+Colossal-AI provides command-line utilities for the user.
+The current command line tools support the following features.
+
+- verify Colossal-AI build
+- launch distributed jobs
+- tensor parallel micro-benchmarking
+
+## Check Installation
+
+To verify whether your Colossal-AI is built correctly, you can use the command `colossalai check -i`.
+This command will inform you information regarding the version compatibility and cuda extension.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/05/04/KJmcVknyPHpBofa.png"/>
+<figcaption>Check Installation Demo</figcaption>
+</figure>
+
+## Launcher
+
+To launch distributed jobs on single or multiple nodes, the command `colossalai run` can be used for process launching.
+You may refer to [Launch Colossal-AI](./launch_colossalai.md) for more details.
+
+## Tensor Parallel Micro-Benchmarking
+
+As Colossal-AI provides an array of tensor parallelism methods, it is not intuitive to choose one for your hardware and
+model. Therefore, we provide a simple benchmarking to evaluate the performance of various tensor parallelisms on your system.
+This benchmarking is run on a simple MLP model where the input data is of the shape `(batch_size, seq_length, hidden_size)`.
+Based on the number of GPUs, the CLI will look for all possible tensor parallel configurations and display the benchmarking results.
+You can customize the benchmarking configurations by checking out `colossalai benchmark --help`.
+
+```shell
+# run on 4 GPUs
+colossalai benchmark --gpus 4
+
+# run on 8 GPUs
+colossalai benchmark --gpus 8
+```
+
+:::caution
+
+Only single-node benchmarking is supported currently.
+
+:::
--- a/docs/source/en/basics/configure_parallelization.md
+++ b/docs/source/en/basics/configure_parallelization.md
@ -0,0 +1,156 @@
+# Configure Parallelization
+
+Author: Shenggui Li, Siqi Mai
+
+**Prerequisite:**
+- [Distributed Training](../concepts/distributed_training.md)
+- [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md)
+- [Define Your Configuration](./define_your_config.md)
+
+
+## Introduction
+
+We support multiple parallelization in Colossal-AI. Hybrid parallelism in our codebase refers to namely the combination
+of data parallelism, pipeline parallelism and tensor parallelism (1D, 2D, 2.5D, 3D).
+
+Each parallelism requires different network topology and thus initialize different process groups.
+You can initialize the corresponding process group by setting `parallel` in the config file.
+The configuration for `parallel` must obey the following format. Data parallel size will be
+inferred automatically based on your inputs to pipeline parallelism and tensor parallelism.
+`colossalai.launch` will initialize these distributed process groups automatically based on your configuration.
+
+Some sample configurations are shown below:
+
+```python
+# sampler format
+parallel = dict(
+    pipeline=dict("size": int),
+    tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
+)
+
+# this is ok
+parallel = dict(
+    pipeline=dict(size=2),
+    tensor=dict(size=4, mode='2d')
+)
+
+# this is ok
+parallel = dict(
+    pipeline=2,
+    tensor=dict(size=4, mode='2d')
+)
+
+# this is not ok
+# as you need to specify the mode for tensor parallelism
+parallel = dict(
+    pipeline=2,
+    tensor=4
+)
+
+# this is ok as well as tensor will be default to size 1
+# and mode None
+parallel = dict(
+    pipeline=2
+)
+
+# this is ok as well as pipeline will default to size 1
+parallel = dict(
+    tensor=dict(size=4, mode='2d')
+)
+
+```
+
+The key name `size` refers to the parallel size of the parallelism dimension. For example, pipeline size 2 means there
+will be 2 pipeline stages. The key name `mode` in tensor parallel config means the corresponding tensor parallelism
+will be initialized.
+
+**You can choose to not have 'parallel' in your configuration and both pipeline and tensor will default to size 1.**
+
+**Total number of GPUs must be equal to `data parallel size * tensor parallel size * pipeline parallel size`**
+
+## Data Parallel
+
+Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
+a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
+have to explicitly set them in your configurations. There are two ways to handle the all-reduce in data parallel in Colossal-AI.
+
+1. If you specify gradient handlers, gradients will be all-reduced according to the gradient handlers
+2. Otherwise, PyTorch DistributedDataParallel will be used
+
+In most cases, you will be using the second mode unless you have complex handling of the gradients.
+
+## 1D, 2D, 2.5D and 3D Parallel
+
+To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
+tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
+
+- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+
+- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
+  2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
+  outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of `P = N^2` devices where
+  `N` is the number of tensor chunks in a single dimension.
+
+- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
+  Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which
+  further parallelizes 2D tensor parallelism. An amount of `P = N^2 ∗ d` processors are arranged into `d` layers, where
+  each layer performs matrix multiplication operations independently with a dimension `N`.
+
+- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
+  We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method
+  achieves the optimal, `O(P^{1/3})` communication overhead on $P$ processors, while both computation and memory usage
+  are evenly distributed through optimized load balancing of parameters as well as activations.
+
+```python
+# 1D parallel
+parallel = dict(
+    tensor=dict(size=4, mode='1d')
+)
+
+# 2D parallel
+parallel = dict(
+    tensor=dict(size=4, mode='2d')
+)
+
+# 2.5D parallel
+parallel = dict(
+    tensor=dict(size=8, mode='2.5d', depth=2)
+)
+
+# 3D parallel
+parallel = dict(
+    tensor=dict(size=8, mode='3d')
+)
+```
+
+Once you specify the tensor parallel mode in your configuration, you can proceed to use its corresponding distributed
+operator. For example, if you mode is '2d', you can use `colossalai.nn.Linear2D` in you model construction.
+
+
+## Pipeline Parallel
+
+Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
+model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
+and the second layer to the second GPU.
+
+You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
+will automatically creates the pipeline schedule which defines the forward and backward step.
+
+```python
+parallel = dict(
+    pipeline=dict(size=4), # number of pipeline stages
+)
+```
+
+## Sequence Parallel
+
+Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
+This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
+You can use specify the mode to be `sequence` to initialize its process group.
+
+
+```python
+parallel = dict(
+    tensor=dict(size=4, mode='sequence')
+)
+```
--- a/docs/source/en/basics/define_your_config.md
+++ b/docs/source/en/basics/define_your_config.md
@ -0,0 +1,82 @@
+# Define Your Configuration
+
+Author: Guangyang Lu, Shenggui Li, Siqi Mai
+
+**Prerequisite:**
+- [Distributed Training](../concepts/distributed_training.md)
+- [Colossal-AI Overview](../concepts/colossalai_overview.md)
+
+
+## Introduction
+
+In Colossal-AI, a configuration file is required to specify the features the system will inject into the training process.
+In this tutorial, we will introduce you how to construct your configuration file and how this config file will be used.
+Using configuration file has several advantages:
+
+1. You can store your feature configuration and training hyper-parameters in different configuration files
+2. New features released in the future can be specified in the configuration without code change in the training script
+
+In this tutorial, we will cover how to define your configuration file.
+
+## Configuration Definition
+
+In a configuration file, there are two types of variables. One serves as feature specification and the other serves
+as hyper-parameters. All feature-related variables are reserved keywords. For example, if you want to use mixed precision
+training, you need to use the variable name `fp16` in the config file and follow a pre-defined format.
+
+### Feature Specification
+
+There is an array of features Colossal-AI provides to speed up training. Each feature is defined by a corresponding field
+in the config file. In this tutorial, we are not giving the config details for all the features, but rather we are providing
+an illustration of how to specify a feature. **The details of each feature can be found in its respective tutorial.**
+
+To illustrate the use of config file, we use mixed precision training as an example here. In order to do so, you need to
+follow the steps below.
+
+1. create a configuration file (e.g. `config.py`, the file name can be anything)
+2. define the mixed precision configuration in the config file. For example, in order to use mixed precision training
+natively provided by PyTorch, you can just write these lines of code below into your config file.
+
+   ```python
+   from colossalai.amp import AMP_TYPE
+
+   fp16 = dict(
+     mode=AMP_TYPE.TORCH
+   )
+   ```
+
+3. Tell Colossal-AI where your config file is when launch the distributed environment. For example, the config file is in
+the current directory.
+
+   ```python
+   import colossalai
+
+   colossalai.launch(config='./config.py', ...)
+   ```
+
+In this way, Colossal-AI knows what features you want to use and will inject this feature during `colossalai.initialize`.
+
+### Global Hyper-parameters
+
+Besides feature specification, the config file can also serve as a place to define your training hyper-parameters. This
+comes handy when you want to perform multiple experiments, each experiment details can be put into a single config file
+to avoid confusion. These parameters will be stored in the global parallel context and can be accessed in the training script.
+
+For example, you can specify the batch size in your config file.
+
+```python
+BATCH_SIZE = 32
+```
+
+After launch, you are able to access your hyper-parameters through global parallel context.
+
+```python
+import colossalai
+from colossalai.core import global_context as gpc
+
+colossalai.launch(config='./config.py', ...)
+
+# access your parameter
+print(gpc.config.BATCH_SIZE)
+
+```
--- a/docs/source/en/basics/engine_trainer.md
+++ b/docs/source/en/basics/engine_trainer.md
@ -0,0 +1,387 @@
+# Use Engine and Trainer in Training
+
+Author: Shenggui Li, Siqi Mai
+
+**Prerequisite:**
+- [Initialize Features](./initialize_features.md)
+
+## Introduction
+
+In this tutorial, you will learn how to use the engine and trainer provided in Colossal-AI to train your model.
+Before we delve into the details, we would like to first explain the concept of engine and trainer.
+
+### Engine
+
+Engine is essentially a wrapper class for model, optimizer and loss function.
+When we call `colossalai.initialize`, an engine object will be returned, and it has already been equipped with
+functionalities such as gradient clipping, gradient accumulation and zero optimizer as specified in your configuration file.
+An engine object will use similar APIs to those of PyTorch training components such that the user has minimum change
+to their code.
+
+Below is a table which shows the commonly used APIs for the engine object.
+
+| Component                             | Function                                      | PyTorch                         | Colossal-AI                            |
+| ------------------------------------- | --------------------------------------------- | ------------------------------- | -------------------------------------- |
+| optimizer                             | Set all gradients to zero before an iteration | optimizer.zero_grad()           | engine.zero_grad()                     |
+| optimizer                             | Update the parameters                         | optimizer.step()                | engine.step()                          |
+| model                                 | Run a forward pass                            | outputs = model(inputs)         | outputs = engine(inputs)               |
+| criterion                             | Calculate the loss value                      | loss = criterion(output, label) | loss = engine.criterion(output, label) |
+| criterion                             | Execute back-propagation on the model         | loss.backward()                 | engine.backward(loss)                  |
+
+The reason why we need such an engine class is that we can add more functionalities while hiding the implementations in
+the `colossalai.initialize` function.
+Imaging we are gonna add a new feature, we can manipulate the model, optimizer, dataloader and loss function in the
+`colossalai.initialize` function and only expose an engine object to the user.
+The user only needs to modify their code to the minimum extent by adapting the normal PyTorch APIs to the Colossal-AI
+engine APIs. In this way, they can enjoy more features for efficient training.
+
+A normal training iteration using engine can be:
+
+```python
+import colossalai
+
+# build your model, optimizer, criterion, dataloaders
+...
+
+engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
+                                                                    optimizer,
+                                                                    criterion,
+                                                                    train_dataloader,
+                                                                    test_dataloader)
+for img, label in train_dataloader:
+    engine.zero_grad()
+    output = engine(img)
+    loss = engine.criterion(output, label)
+    engine.backward(loss)
+    engine.step()
+```
+
+### Trainer
+
+Trainer is a more high-level wrapper for the user to execute training with fewer lines of code. However, in pursuit of more abstraction, it loses some flexibility compared to engine. The trainer is designed to execute a forward and backward step to perform model weight update. It is easy to create a trainer object by passing the engine object. The trainer has a default value `None` for the argument `schedule`. In most cases, we leave this value to `None` unless we want to use pipeline parallelism. If you wish to explore more about this parameter, you can go to the tutorial on pipeline parallelism.
+
+```python
+from colossalai.logging import get_dist_logger
+from colossalai.trainer import Trainer, hooks
+
+# build components and initialize with colossalai.initialize
+...
+
+# create a logger so that trainer can log on the console
+logger = get_dist_logger()
+
+# create a trainer object
+trainer = Trainer(
+    engine=engine,
+    logger=logger
+)
+```
+
+
+
+In trainer, the user can customize some hooks and attach these hooks to the trainer object. A hook object will execute life-cycle methods periodically based on the training scheme. For example,  The `LRSchedulerHook` will execute `lr_scheduler.step()` to update the learning rate of the model during either `after_train_iter` or `after_train_epoch` stages depending on whether the user wants to update the learning rate after each training iteration or only after the entire training epoch. You can store the hook objects in a list and pass it to `trainer.fit` method. `trainer.fit` method will execute training and testing based on your parameters. If `display_process` is True, a progress bar will be displayed on your console to show the training process.
+
+```python
+# define the hooks to attach to the trainer
+hook_list = [
+    hooks.LossHook(),
+    hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
+    hooks.AccuracyHook(accuracy_func=Accuracy()),
+    hooks.LogMetricByEpochHook(logger),
+]
+
+# start training
+trainer.fit(
+    train_dataloader=train_dataloader,
+    epochs=NUM_EPOCHS,
+    test_dataloader=test_dataloader,
+    test_interval=1,
+    hooks=hook_list,
+    display_progress=True
+)
+```
+
+If you want to customize your own hook class, you can inherit `hooks.BaseHook` and override the life-cycle methods of your interest. A dummy example to demonstrate how to create a simple log message hook is provided below for your reference.
+
+```python
+from colossalai.logging import get_dist_logger
+from colossalai.trainer import hooks
+
+class LogMessageHook(hooks.BaseHook):
+
+    def __init__(self, priority=10):
+        self._logger = get_dist_logger()
+
+    def before_train(self, trainer):
+        self._logger.info('training starts')
+
+    def after_train(self, trainer):
+        self._logger.info('training finished')
+
+
+...
+
+# then in your training script
+hook_list.append(LogMessageHook())
+```
+
+
+
+In the sections below, I will guide you through the steps required to train a ResNet model with both engine and trainer.
+
+
+
+## Explain with ResNet
+
+### Overview
+
+In this section we will cover:
+
+1. Use an engine object to train a ResNet34 model on CIFAR10 dataset
+2. Use a trainer object to train a ResNet34 model on CIFAR10 dataset
+
+The project structure will be like:
+
+```bash
+-- config.py
+-- run_resnet_cifar10_with_engine.py
+-- run_resnet_cifar10_with_trainer.py
+```
+
+Steps 1-4 below are commonly used regardless of using engine or trainer. Thus, steps 1-4 + step 5 will be your `run_resnet_cifar10_with_engine.py` and steps 1-4 + step 6 will form `run_resnet_cifar10_with_trainer.py`.
+
+### Hands-on Practice
+
+#### Step 1. Create a Config File
+
+In your project folder, create a `config.py`. This file is to specify some features you may want to use to train your model. A sample config file is as below:
+
+```python
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 128
+NUM_EPOCHS = 200
+
+fp16=dict(
+    mode=AMP_TYPE.TORCH
+)
+```
+
+In this config file, we specify that we want to use batch size 128 per GPU and run for 200 epochs. These two parameters are exposed by `gpc.config`. For example, you can use `gpc.config.BATCH_SIZE` to access the value you store in your config file. The `fp16` configuration tells `colossalai.initialize` to use mixed precision training provided by PyTorch to train the model with better speed and lower memory consumption.
+
+#### Step 2. Initialize Distributed Environment
+
+We need to initialize the distributed training environment. This has been introduced in the tutorial on how to
+[launch Colossal-AI](./launch_colossalai.md). For this demostration, we use `launch_from_torch` and PyTorch launch utility.
+
+```python
+import colossalai
+
+# ./config.py refers to the config file we just created in step 1
+colossalai.launch_from_torch(config='./config.py')
+```
+
+#### Step 3. Create all the training components
+
+In this step, we can create all the components used for training. These components include:
+
+1. Model
+2. Optimizer
+3. Criterion/loss function
+4. Training/Testing dataloaders
+5. Learning rate Scheduler
+6. Logger
+
+
+
+To build these components, you need to import the following modules:
+
+```python
+from pathlib import Path
+from colossalai.logging import get_dist_logger
+import torch
+import os
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_dataloader
+from torchvision import transforms
+from colossalai.nn.lr_scheduler import CosineAnnealingLR
+from torchvision.datasets import CIFAR10
+from torchvision.models import resnet34
+```
+
+
+
+Then build your components in the same way as how to normally build them in your PyTorch scripts. In the script below, we set the root path for CIFAR10 dataset as an environment variable `DATA`. You can change it to any path you like, for example, you can change `root=Path(os.environ['DATA'])` to `root='./data'` so that there is no need to set the environment variable.
+
+```python
+# build logger
+logger = get_dist_logger()
+
+# build resnet
+model = resnet34(num_classes=10)
+
+# build datasets
+train_dataset = CIFAR10(
+    root='./data',
+    download=True,
+    transform=transforms.Compose(
+        [
+            transforms.RandomCrop(size=32, padding=4),
+            transforms.RandomHorizontalFlip(),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
+                0.2023, 0.1994, 0.2010]),
+        ]
+    )
+)
+
+test_dataset = CIFAR10(
+    root='./data',
+    train=False,
+    transform=transforms.Compose(
+        [
+            transforms.ToTensor(),
+            transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
+                0.2023, 0.1994, 0.2010]),
+        ]
+    )
+)
+
+# build dataloaders
+train_dataloader = get_dataloader(dataset=train_dataset,
+                                  shuffle=True,
+                                  batch_size=gpc.config.BATCH_SIZE,
+                                  num_workers=1,
+                                  pin_memory=True,
+                                  )
+
+test_dataloader = get_dataloader(dataset=test_dataset,
+                                 add_sampler=False,
+                                 batch_size=gpc.config.BATCH_SIZE,
+                                 num_workers=1,
+                                 pin_memory=True,
+                                 )
+
+# build criterion
+criterion = torch.nn.CrossEntropyLoss()
+
+# optimizer
+optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
+
+# lr_scheduler
+lr_scheduler = CosineAnnealingLR(optimizer, total_steps=gpc.config.NUM_EPOCHS)
+```
+
+#### Step 4. Initialize with Colossal-AI
+
+Next, the essential step is to obtain the engine class by calling `colossalai.initialize`. As stated in `config.py`, we will be using mixed precision training for training ResNet34 model. `colossalai.initialize` will automatically check your config file and assign relevant features to your training components. In this way, our engine object has already been able to train with mixed precision, but you do not have to explicitly take care of it.
+
+```python
+engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
+                                                                     optimizer,
+                                                                     criterion,
+                                                                     train_dataloader,
+                                                                     test_dataloader,
+                                                                     )
+```
+
+
+
+#### Step 5. Train with engine
+
+With all the training components ready, we can train ResNet34 just like how to normally deal with PyTorch training.
+
+```python
+for epoch in range(gpc.config.NUM_EPOCHS):
+    # execute a training iteration
+    engine.train()
+    for img, label in train_dataloader:
+        img = img.cuda()
+        label = label.cuda()
+
+        # set gradients to zero
+        engine.zero_grad()
+
+        # run forward pass
+        output = engine(img)
+
+        # compute loss value and run backward pass
+        train_loss = engine.criterion(output, label)
+        engine.backward(train_loss)
+
+        # update parameters
+        engine.step()
+
+    # update learning rate
+    lr_scheduler.step()
+
+    # execute a testing iteration
+    engine.eval()
+    correct = 0
+    total = 0
+    for img, label in test_dataloader:
+        img = img.cuda()
+        label = label.cuda()
+
+        # run prediction without back-propagation
+        with torch.no_grad():
+            output = engine(img)
+            test_loss = engine.criterion(output, label)
+
+        # compute the number of correct prediction
+        pred = torch.argmax(output, dim=-1)
+        correct += torch.sum(pred == label)
+        total += img.size(0)
+
+    logger.info(
+        f"Epoch {epoch} - train loss: {train_loss:.5}, test loss: {test_loss:.5}, acc: {correct / total:.5}, lr: {lr_scheduler.get_last_lr()[0]:.5g}", ranks=[0])
+```
+
+#### Step 6. Train with trainer
+
+If you wish to train with a trainer object, you can follow the code snippet below:
+
+```python
+from colossalai.nn.metric import Accuracy
+from colossalai.trainer import Trainer, hooks
+
+
+# create a trainer object
+trainer = Trainer(
+    engine=engine,
+    logger=logger
+)
+
+# define the hooks to attach to the trainer
+hook_list = [
+    hooks.LossHook(),
+    hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
+    hooks.AccuracyHook(accuracy_func=Accuracy()),
+    hooks.LogMetricByEpochHook(logger),
+    hooks.LogMemoryByEpochHook(logger)
+]
+
+# start training
+# run testing every 1 epoch
+trainer.fit(
+    train_dataloader=train_dataloader,
+    epochs=gpc.config.NUM_EPOCHS,
+    test_dataloader=test_dataloader,
+    test_interval=1,
+    hooks=hook_list,
+    display_progress=True
+)
+```
+
+
+
+#### Step 7. Start Distributed Training
+
+Lastly, we can invoke the scripts using the distributed launcher provided by PyTorch as we used `launch_from_torch` in Step 2. You need to replace `<num_gpus>` with the number of GPUs available on your machine. This number can be 1 if you only want to use 1 GPU. If you wish to use other launchers, you can refer to the tutorial on How to Launch Colossal-AI.
+
+```bash
+# with engine
+python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
+# with trainer
+python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_trainer.py
+```
--- a/docs/source/en/basics/initialize_features.md
+++ b/docs/source/en/basics/initialize_features.md
@ -0,0 +1,49 @@
+# Initialize Features
+
+Author: Shenggui Li, Siqi Mai
+
+**Prerequisite:**
+- [Distributed Training](../concepts/distributed_training.md)
+- [Colossal-AI Overview](../concepts/colossalai_overview.md)
+
+## Introduction
+
+In this tutorial, we will cover the use of `colossalai.initialize` which injects features into your training components
+(e.g. model, optimizer, dataloader) seamlessly. Calling `colossalai.initialize` is the standard procedure before you run
+into your training loops.
+
+In the section below, I will cover how `colossalai.initialize` works and what we should take note  of.
+
+## Usage
+
+In a typical workflow, we will launch distributed environment at the beginning of our training script.
+Afterwards, we will instantiate our objects such as model, optimizer, loss function, dataloader etc. At this moment, `colossalai.initialize`
+can come in to inject features into these objects. A pseudo-code example is like below:
+
+```python
+import colossalai
+import torch
+...
+
+
+# launch distributed environment
+colossalai.launch(config='./config.py', ...)
+
+# create your objects
+model = MyModel()
+optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+criterion = torch.nn.CrossEntropyLoss()
+train_dataloader = MyTrainDataloader()
+test_dataloader = MyTrainDataloader()
+
+# initialize features
+engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
+                                                                     optimizer,
+                                                                     criterion,
+                                                                     train_dataloader,
+                                                                     test_dataloader)
+```
+
+The `colossalai.initialize` function will return an `Engine` object. The engine object is a wrapper
+for model, optimizer and loss function. **The engine object will run with features specified in the config file.**
+More details about the engine can be found in the [Use Engine and Trainer in Training](./engine_trainer.md).
--- a/docs/source/en/basics/launch_colossalai.md
+++ b/docs/source/en/basics/launch_colossalai.md
@ -0,0 +1,232 @@
+# Launch Colossal-AI
+
+Author: Chuanrui Wang, Shenggui Li, Siqi Mai
+
+**Prerequisite:**
+- [Distributed Training](../concepts/distributed_training.md)
+- [Colossal-AI Overview](../concepts/colossalai_overview.md)
+
+
+## Introduction
+
+As mentioned in the previous tutorials stated in the prerequisite, you need to initialize the distributed environment
+for Colossal-AI after your config file is prepared.
+We call this process `launch`.
+In this tutorial, you will learn how to launch Colossal-AI on your server, be it a small one or big one.
+
+In Colossal-AI, we provided several launch methods to initialize the distributed backend.
+In most cases, you can use `colossalai.launch` and `colossalai.get_default_parser` to pass the
+parameters via command line.
+If you happen to use launchers such as SLURM, OpenMPI and PyTorch launch utility,
+we also provide several launching helper methods to access the rank and world size from the environment variables
+set by these launchers directly for your convenience.
+
+In this tutorial we will cover how to launch Colossal-AI to initialize the distributed backends:
+- Launch with `colossalai.launch`
+- Launch with Colossal-AI CLI
+- Launch with SLURM
+- Launch with OpenMPI
+
+## Launch Distributed Environment
+
+In order to launch Colossal-AI, we need two types of arguments:
+1. config file
+2. distributed settings
+
+The config file is always required regardless of the launch method but distributed settings can vary. The config file
+can be a path to the configuration file or a Python dictionary. The distributed settings can be passed via command line
+or multi-process launchers.
+
+### Command Line Parser
+
+Before we jump to `launch`, we firstly need to understand what parameters we need for initialization.
+As stated in the `Basic Concepts in Distributed Training` section of [Distributed Training](../concepts/distributed_training.md),
+the important parameters are:
+
+1. host
+2. port
+3. rank
+4. world_size
+5. backend
+
+In Colossal-AI, we provided a command line parser which has added these arguments in advance. You can get this parser by calling
+`colossalai.get_default_parser()`. This parser is usually used with `colossalai.launch`.
+
+```python
+# add these lines in your train.py
+import colossalai
+
+# get default parser
+parser = colossalai.get_default_parser()
+
+# if you want to add your own arguments
+parser.add_argument(...)
+
+# parse arguments
+args = parser.parse_args()
+```
+
+Then in your terminal, you can pass in these arguments:
+```shell
+
+python train.py --host <host> --rank <rank> --world_size <world_size> --port <port> --backend <backend>
+```
+
+`backend` is optional and the default value is `nccl`.
+
+### Native Launch
+
+To initialize the distributed environment, we provided a general `colossalai.launch` API. The `colossalai.launch` function takes in the parameters
+listed above and create a default process group in the communication network. This function is often used with the default
+parser for convenience.
+
+```python
+import colossalai
+
+# parse arguments
+args = colossalai.get_default_parser().parse_args()
+
+# launch distributed environment
+colossalai.launch(config=<CONFIG>,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  host=args.host,
+                  port=args.port,
+                  backend=args.backend
+)
+
+```
+
+
+### Launch with Colossal-AI CLI
+
+To enable easy launching on both single or multi nodes, we have implemented a launcher for Colossal-AI. This launcher is
+a wrapper of the torch distributed launch utility but enhanced with the capability of launching multi-node jobs easily.
+
+First, we need to set the launch method in our code. As this is a wrapper of the torch distributed launch utility, we will
+use `colossalai.launch_from_torch`. The arguments required for distributed environment such as rank, world size, host and port are all set by the PyTorch
+launcher and can be read from the environment variable directly.
+
+```python
+import colossalai
+
+colossalai.launch_from_torch(
+    config=<CONFIG>,
+)
+```
+
+Next, we can easily start multiple processes with `colossalai run` in your terminal. Below is an example to run the code
+on a single node with 4 GPUs. You can change the number of GPUs by `nproc_per_node` and the default port by `master_port`.
+
+```shell
+# run on the local node with 4 GPUs (default port: 29500)
+colossalai run --nproc_per_node 4 train.py
+
+# run on the local node with 4 GPUs with a different port
+colossalai run --nproc_per_node 4 --master_port 29505 test.py
+```
+
+If you are in a cluster and want to launch multi-node training, the CLI can help you start processes on different nodes
+with one simple command. There are two ways you can launch multi-node jobs.
+
+- Run with `--hosts`
+
+This is suitable when you only have a few nodes. Let's say I have two nodes, namely `host1` and `host2`,  I can start
+multi-node training with the following command. Compared to single-node training, you must specify the `master_addr`
+option, which is auto-set to localhost if running on a single node only.
+
+:::caution
+
+`master_addr` cannot be localhost when running on multiple nodes, it should be the hostname or IP address of a node.
+
+:::
+
+```shell
+# run on these two nodes
+colossalai run --nproc_per_node 4 --host host1,host2 --master_addr host1 test.py
+```
+- Run with `--hostfile`
+
+This method is suitable when you have a lot of nodes. The host file is a simple text file listing the available nodes.
+The list of nodes is commonly provided by cluster managers such as SLURM and PBS Pro. For example, you can get the list
+of nodes allocated to you via the environment variable `SLURM_NODELIST` in SLURM and `PBS_NODEFILE` in PBS Pro.
+Just do `echo $SLURM_NODELIST` or `cat $PBS_NODEFILE` to check it out. If you do not have such cluster managers, you can
+manually create one for your own use.
+
+The host file given to Colossal-AI launcher must be in the following format where each line is the host name of a node.
+
+```text
+host1
+host2
+```
+
+With the host file ready, we can launch multi-node jobs with the following commands. Just like using `--host`, you also
+need to specify the `master_addr` option. Some extra options are provided for `--hostfile` as listed below:
+
+- `--include`: specify the hosts to include for multi-node jobs. For example, if your host file has 8 nodes, but you
+happen to only want to run on 6 nodes instead, you can add `--include host1,host2,host3,...,host6` so that the job will only
+be launcher on the 6 nodes.
+- `--exclude`: specify the hosts to exclude for multi-node jobs. This is useful when some nodes are faulty. For example,
+if host1 GPU has some problems and you do not wish to run on host1 but all other nodes, you can add `--exclude host1` so that
+the job will only be launched on the remaining nodes.
+
+```shell
+# run with a hostfile
+colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1  test.py
+
+# only include certain hosts to execute commands
+# this is used to manually select nodes to run
+colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1  --include host1 test.py
+
+# exclude certain hosts to execute commands
+# this can be used when certain nodes are faulty
+colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1  --exclude host2 test.py
+```
+
+### Launch with SLURM
+
+If you are on a system managed by the SLURM scheduler, you can also rely on the `srun` launcher to kickstart your Colossal-AI scripts.
+We provided the helper function `launch_from_slurm` for compatibility with the SLURM scheduler.
+`launch_from_slurm` will automatically read the rank and world size from the environment variables `SLURM_PROCID` and `SLURM_NPROCS` respectively
+and use them to start the distributed backend.
+Do this in your training script:
+
+```python
+import colossalai
+
+colossalai.launch_from_slurm(
+    config=<CONFIG>,
+    host=args.host,
+    port=args.port
+)
+```
+
+You can initialize the distributed environment by using this command in terminal.
+
+```bash
+srun python train.py --host <master_node> --port 29500
+```
+
+### Launch with OpenMPI
+If you are more familiar with OpenMPI, you can use `launch_from_openmpi` instead.
+`launch_from_openmpi` will automatically read the local rank, global rank and world size from the environment variables
+`OMPI_COMM_WORLD_LOCAL_RANK`, `MPI_COMM_WORLD_RANK` and `OMPI_COMM_WORLD_SIZE` respectively and
+use them to start the distributed backend.
+
+Do this in your train.py:
+```python
+colossalai.launch_from_openmpi(
+    config=<CONFIG>,
+    host=args.host,
+    port=args.port
+)
+```
+
+A sample command to launch multiple processes with OpenMPI would be:
+
+```bash
+mpirun --hostfile <my_hostfile> -np <num_process> python train.py --host <node name or ip> --port 29500
+```
+
+- --hostfile: use this option to specify a list of hosts on which to run
+- --np: set the number of processes (GPUs) to launch in total. For example, if --np 4, 4 python processes will be initialized to run train.py.
--- a/docs/source/en/basics/model_checkpoint.md
+++ b/docs/source/en/basics/model_checkpoint.md
@ -0,0 +1,61 @@
+# Model Checkpoint
+
+Author : Guangyang Lu
+
+**Prerequisite:**
+- [Launch Colossal-AI](./launch_colossalai.md)
+- [Initialize Colossal-AI](./initialize_features.md)
+
+**Example Code:**
+- [ColossalAI-Examples Model Checkpoint](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/utils/checkpoint)
+
+**This function is experiential.**
+
+## Introduction
+
+In this tutorial, you will learn how to save and load model checkpoints.
+
+To leverage the power of parallel strategies in Colossal-AI, modifications to models and tensors are needed, for which you cannot directly use `torch.save` or `torch.load`  to save or load model checkpoints. Therefore, we have provided you with the API to achieve the same thing.
+
+Moreover, when loading, you are not demanded to use the same parallel strategy as saving.
+
+## How to use
+
+### Save
+
+There are two ways to train a model in Colossal-AI, by engine or by trainer.
+**Be aware that we only save the `state_dict`.** Therefore, when loading the checkpoints, you need to define the model first.
+
+#### Save when using engine
+
+```python
+from colossalai.utils import save_checkpoint
+model = ...
+engine, _, _, _ = colossalai.initialize(model=model, ...)
+for epoch in range(num_epochs):
+    ... # do some training
+    save_checkpoint('xxx.pt', epoch, model)
+```
+
+#### Save when using trainer
+```python
+from colossalai.trainer import Trainer, hooks
+model = ...
+engine, _, _, _ = colossalai.initialize(model=model, ...)
+trainer = Trainer(engine, ...)
+hook_list = [
+            hooks.SaveCheckpointHook(1, 'xxx.pt', model)
+            ...]
+
+trainer.fit(...
+            hook=hook_list)
+```
+
+### Load
+
+```python
+from colossalai.utils import load_checkpoint
+model = ...
+load_checkpoint('xxx.pt', model)
+... # train or test
+```
--- a/docs/source/en/concepts/colossalai_overview.md
+++ b/docs/source/en/concepts/colossalai_overview.md
@ -0,0 +1,36 @@
+# Colossal-AI Overview
+
+Author: Shenggui Li, Siqi Mai
+
+## About Colossal-AI
+
+With the development of deep learning model size, it is important to shift to a new training paradigm. The traditional training method with no parallelism and optimization became a thing of the past and new training methods are the key to make training large-scale models efficient and cost-effective.
+
+Colossal-AI is designed to be a unfied system to provide an integrated set of training skills and utilities to the user. You can find the common training utilities such as mixed precision training and gradient accumulation. Besides, we provide an array of parallelism including data, tensor and pipeline parallelism. We optimize tensor parallelism with different multi-dimensional distributed matrix-matrix multiplication algorithm. We also provided different pipeline parallelism methods to allow the user to scale their model across nodes efficiently. More advanced features such as offloading can be found in this tutorial documentation in detail as well.
+
+## General Usage
+
+We aim to make Colossal-AI easy to use and non-instrusive to user code. There is a simple general workflow if you want to use Colossal-AI.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/ZK7ICWzbMsVuJof.png"/>
+<figcaption>Workflow</figcaption>
+</figure>
+
+1. Prepare a configiguration file where specifies the features you want to use and your parameters.
+2. Initialize distributed backend with `colossalai.launch`
+3. Inject the training features into your training components (e.g. model, optimizer) with `colossalai.initialize`.
+4. Run training and testing
+
+We will cover the whole workflow in the `basic tutorials` section.
+
+## Future Development
+
+The Colossal-AI system will be expanded to include more training skills, these new developments may include but are not limited to:
+
+1. optimization of distributed operations
+2. optimization of training on heterogenous system
+3. implementation of training utilities to reduce model size and speed up training while preserving model performance
+4. expansion of existing parallelism methods
+
+We welcome ideas and contribution from the community and you can post your idea for future development in our forum.
--- a/docs/source/en/concepts/distributed_training.md
+++ b/docs/source/en/concepts/distributed_training.md
@ -0,0 +1,120 @@
+# Distributed Training
+
+Author: Shenggui Li, Siqi Mai
+
+## What is a distributed system?
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/sE5daHf2ohIy9wX.png"/>
+<figcaption>Image source: <a href="https://towardsdatascience.com/distributed-training-in-the-cloud-cloud-machine-learning-engine-9e264ddde27f">Towards Data Science</a></figcaption>
+</figure>
+
+A distributed system consists of multiple software components which run on multiple machines. For example, the traditional
+database runs on a single machine. As the amount of data gets incredibly large, a single machine can no longer deliver desirable
+performance to the business, especially in situations such as Black Friday where network traffic can be unexpectedly high.
+To handle such pressure, modern high-performance database is designed to run on multiple machines, and they work together to provide
+high throughput and low latency to the user.
+
+One important evaluation metric for distributed system is scalability. For example, when we run an application on 4 machines,
+we naturally expect that the application can run 4 times faster. However, due to communication overhead and difference in
+hardware performance, it is difficult to achieve linear speedup. Thus, it is important to consider how to make the application
+faster when we implement it. Algorithms of good design and system optimization can help to deliver good performance. Sometimes,
+it is even possible to achieve linear and super-linear speedup.
+
+
+## Why we need distributed training for machine learning?
+
+Back in 2012, [AlexNet](https://arxiv.org/abs/1404.5997) won the champion of the ImageNet competition, and it was trained
+on two GTX 580 3GB GPUs.
+Today, most models that appear in the top AI conferences are trained on multiple GPUs. Distributed training is definitely
+a common practice when researchers and engineers develop AI models. There are several reasons behind this trend.
+
+1. Model size increases rapidly. [ResNet50](https://arxiv.org/abs/1512.03385) has 20 million parameters in 2015,
+[BERT-Large](https://arxiv.org/abs/1810.04805) has 345 million parameters in 2018,
+[GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
+has 1.5 billion parameters in 2018, and [GPT-3](https://arxiv.org/abs/2005.14165) has 175 billion parameters in 2020.
+It is obvious that the model size grows exponentially with time. The current largest model has exceeded more than 1000
+billion parameters. Super large models generally deliver more superior performance compared to their smaller counterparts.
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/sCyreJ9PF1EdZYf.jpg"/>
+<figcaption>Image source: <a href="https://huggingface.co/blog/large-language-models">HuggingFace</a></figcaption>
+</figure>
+
+
+2. Dataset size increases rapidly. For most machine learning developers, MNIST and CIFAR10 datasets are often the first few
+datasets on which they train their models. However, these datasets are very small compared to well-known ImageNet datasets.
+Google even has its own (unpublished) JFT-300M dataset which has around 300 million images, and this is close to 300 times
+larger than the ImageNet-1k dataset.
+
+
+3. Computing power gets stronger. With the advancement in the semiconductor industry, graphics cards become more and more
+powerful. Due to its larger number of cores, GPU is the most common compute platform for deep learning.
+From K10 GPU in 2012 to A100 GPU in 2020, the computing power has increased several hundred times. This allows us to performance
+compute-intensive tasks faster and deep learning is exactly such a task.
+
+Nowadays, the model can be too large to fit into a single GPU, and the dataset can be large enough to train for a hundred
+days on a single GPU. Only by training our models on multiple GPUs with different parallelization techniques, we are able
+to speed up the training process and obtain results in a reasonable amount of time.
+
+
+## Basic Concepts in Distributed Training
+
+Distributed training requires multiple machines/GPUs. During training, there will be communication among these devices.
+To understand distributed training better, there are several important terms to be made clear.
+
+- host: host is the main device in the communication network. It is often required as an argument when initializing the
+distributed environment.
+- port: port here mainly refers to master port on the host for communication.
+- rank: the unique ID given to a device in the network.
+- world size: the number of devices in the network.
+- process group: a process group is a communication network which include a subset of the devices. There is always a default
+process group which contains all the devices. A subset devices can form a process group so that they only communicate among
+the devices within the group.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/qnNBKh8AjzgM5sY.png"/>
+<figcaption>A distributed system example</figcaption>
+</figure>
+
+To illustrate these concepts, let's assume we have 2 machines (also called nodes), and each machine has 4 GPUs. When we
+initialize distributed environment over these two machines, we essentially launch 8 processes (4 processes on each machine)
+and each process is bound to a GPU.
+
+Before initializing the distributed environment, we need to specify the host (master address) and port (master port). In
+this example, we can let host be node 0 and port be a number such as 29500. All the 8 processes will then look for the
+address and port and connect to one another.
+The default process group will then be created. The default process group has a world size of 8 and details are as follows:
+
+| process ID | rank | Node index | GPU index |
+| ---------- | ---- | ---------- | --------- |
+| 0          | 0    | 0          | 0         |
+| 1          | 1    | 0          | 1         |
+| 2          | 2    | 0          | 2         |
+| 3          | 3    | 0          | 3         |
+| 4          | 4    | 1          | 0         |
+| 5          | 5    | 1          | 1         |
+| 6          | 6    | 1          | 2         |
+| 7          | 7    | 1          | 3         |
+
+
+We can also create a new process group. This new process group can contain any subset of the processes.
+For example, we can create one containing only even-number processes, and the details of this new group will be:
+
+| process ID | rank | Node index | GPU index |
+| ---------- | ---- | ---------- | --------- |
+| 0          | 0    | 0          | 0         |
+| 2          | 1    | 0          | 2         |
+| 4          | 2    | 1          | 0         |
+| 6          | 3    | 1          | 2         |
+
+**Please note that rank is relative to the process group and one process can have a different rank in different process
+groups. The max rank is always `world size of the process group - 1`.**
+
+In the process group, the processes can communicate in two ways:
+1. peer-to-peer: one process send data to another process
+2. collective: a group of process perform operations such as scatter, gather, all-reduce, broadcast together.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/zTmlxgc3oeAdn97.png"/>
+<figcaption>Collective communication, source: <a href="https://pytorch.org/tutorials/intermediate/dist_tuto.html">PyTorch distributed tutorial</a></figcaption>
+</figure>
--- a/docs/source/en/concepts/paradigms_of_parallelism.md
+++ b/docs/source/en/concepts/paradigms_of_parallelism.md
@ -0,0 +1,123 @@
+# Paradigms of Parallelism
+
+Author: Shenggui Li, Siqi Mai
+
+## Introduction
+
+With the development of deep learning, there is an increasing demand for parallel training. This is because that model
+and datasets are getting larger and larger and training time becomes a nightmare if we stick to single-GPU training. In
+this section, we will provide a brief overview of existing methods to parallelize training. If you wish to add on to this
+post, you may create a discussion in the [GitHub forum](https://github.com/hpcaitech/ColossalAI/discussions).
+
+## Data Parallel
+
+Data parallel is the most common form of parallelism due to its simplicity. In data parallel training, the dataset is split
+into several shards, each shard is allocated to a device. This is equivalent to parallelize the training process along the
+batch dimension. Each device will hold a full copy of the model replica and trains on the dataset shard allocated. After
+back-propagation, the gradients of the model will be all-reduced so that the model parameters on different devices can stay
+synchronized.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/WSAensMqjwHdOlR.png"/>
+<figcaption>Data parallel illustration</figcaption>
+</figure>
+
+## Model Parallel
+
+In data parallel training, one prominent feature is that each GPU holds a copy of the whole model weights. This brings
+redundancy issue. Another paradigm of parallelism is model parallelism, where model is split and distributed over an array
+of devices. There are generally two types of parallelism: tensor parallelism and pipeline parallelism. Tensor parallelism is
+to parallelize computation within an operation such as matrix-matrix multiplication. Pipeline parallelism is to parallelize
+computation between layers. Thus, from another point of view, tensor parallelism can be seen as intra-layer parallelism and
+pipeline parallelism can be seen as inter-layer parallelism.
+
+### Tensor Parallel
+
+Tensor parallel training is to split a tensor into `N` chunks along a specific dimension and each device only holds `1/N`
+of the whole tensor while not affecting the correctness of the computation graph. This requires additional communication
+to make sure that the result is correct.
+
+Taking a general matrix multiplication as an example, let's say we have C = AB. We can split B along the column dimension
+into `[B0 B1 B2 ... Bn]` and each device holds a column. We then multiply `A` with each column in `B` on each device, we
+will get `[AB0 AB1 AB2 ... ABn]`. At this moment, each device still holds partial results, e.g. device rank 0 holds `AB0`.
+To make sure the result is correct, we need to all-gather the partial result and concatenate the tensor along the column
+dimension. In this way, we are able to distribute the tensor over devices while making sure the computation flow remains
+correct.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/2ZwyPDvXANW4tMG.png"/>
+<figcaption>Tensor parallel illustration</figcaption>
+</figure>
+
+In Colossal-AI, we provide an array of tensor parallelism methods, namely 1D, 2D, 2.5D and 3D tensor parallelism. We will
+talk about them in detail in `advanced tutorials`.
+
+
+Related paper:
+- [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668)
+- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+- [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
+- [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
+- [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
+
+### Pipeline Parallel
+
+Pipeline parallelism is generally easy to understand. If you recall your computer architecture course, this indeed exists
+in the CPU design.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/at3eDv7kKBusxbd.png"/>
+<figcaption>Pipeline parallel illustration</figcaption>
+</figure>
+
+The core idea of pipeline parallelism is that the model is split by layer into several chunks, each chunk is
+given to a device. During the forward pass, each device passes the intermediate activation to the next stage. During the backward pass,
+each device passes the gradient of the input tensor back to the previous pipeline stage. This allows devices to compute simultaneously,
+and increases the training throughput. One drawback of pipeline parallel training is that there will be some bubble time where
+some devices are engaged in computation, leading to waste of computational resources.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/sDNq51PS3Gxbw7F.png"/>
+<figcaption>Source: <a href="https://arxiv.org/abs/1811.06965">GPipe</a></figcaption>
+</figure>
+
+Related paper:
+- [PipeDream: Fast and Efficient Pipeline Parallel DNN Training](https://arxiv.org/abs/1806.03377)
+- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
+- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+- [Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines](https://arxiv.org/abs/2107.06925)
+
+
+## Optimizer-Level Parallel
+
+Another paradigm works at the optimizer level, and the current most famous method of this paradigm is ZeRO which stands
+for [zero redundancy optimizer](https://arxiv.org/abs/1910.02054). ZeRO works at three levels to remove memory redundancy
+(fp16 training is required for ZeRO):
+
+- Level 1: The optimizer states are partitioned across the processes
+- Level 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process
+only stores the gradients corresponding to its partition of the optimizer states.
+- Level 3: The 16-bit model parameters are partitioned across the processes
+
+Related paper:
+- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
+
+
+## Parallelism on Heterogeneous System
+
+The methods mentioned above generally require a large number of GPU to train a large model. However, it is often neglected
+that CPU has a much larger memory compared to GPU. On a typical server, CPU can easily have several hundred GB RAM while each GPU
+typically only has 16 or 32 GB RAM. This prompts the community to think why CPU memory is not utilized for distributed training.
+
+Recent advances rely on CPU and even NVMe disk to train large models. The main idea is to offload tensors back to CPU memory
+or NVMe disk when they are not used. By using the heterogeneous system architecture, it is possible to accommodate a huge
+model on a single machine.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/qLHD5lk97hXQdbv.png"/>
+<figcaption>Heterogenous system illustration</figcaption>
+</figure>
+
+Related paper:
+- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
+- [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)
--- a/docs/source/en/features/1D_tensor_parallel.md
+++ b/docs/source/en/features/1D_tensor_parallel.md
@ -0,0 +1,111 @@
+# 1D Tensor Parallelism
+
+Author: Zhengda Bian, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+
+**Example Code**
+- [ColossalAI-Examples 1D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_1d.py)
+
+**Related Paper**
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf)
+
+## Introduction
+
+Tensor parallelism partitions model weights across multiple devices in order to reduce memory load.
+An efficient 1D tensor parallelism implementation was introduced by [Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf).
+
+Let's take a linear layer as an example, which consists of a GEMM $Y = XA$. Given 2 processors, we split the columns of $A$ into $[A_1 ~ A_2]$, and calculate $Y_i = XA_i$ on each processor, which then forms $[Y_1 ~ Y_2] = [XA_1 ~ XA_2]$. This is called a column-parallel fashion.
+
+When a second linear layer $Z=YB$ follows the column-parallel one, we split $B$ into $\left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right]$,
+which is called a row-parallel fashion.
+To calculate $Z = [Y_1 ~ Y_2] \left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right]$, we first calculate $Y_iB_i$ on each processor, then use an all-reduce to aggregate the results as $Z=Y_1B_1+Y_2B_2$.
+
+We also need to note that in the backward pass, the column-parallel linear layer needs to aggregate the gradients of the input tensor $X$, because on each processor $i$ we only have $\dot{X_i}=\dot{Y_i}A_i^T$.
+Thus, we apply an all-reduce across the processors to get $\dot{X}=\dot{Y}A^T=\dot{Y_1}A_1^T+\dot{Y_2}A_2^T$.
+
+## Efficiency
+Given $P$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 1D tensor parallelism.
+
+| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/P)$    | $O(1/P)$         | $O(1)$               | $O(2(P-1)/P)$             | $O(2(P-1))$             |
+
+## Usage
+
+To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallism setting as below.
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=2, mode='1d'),
+))
+```
+Then Colossal-AI will automatically apply 1D parallelism to all the layers from `colossalai.nn`.
+
+Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.transpose(0, 1).shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.transpose(0, 1).shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+
+Launch Colossal-AI on 2 GPUs and build the model.
+
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
+```shell
+Weight of the first linear layer: torch.Size([256, 512])
+Weight of the second linear layer: torch.Size([512, 256])
+```
+The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the column-parallel partitioning, it becomes `[256, 512]`.
+Similarly, the second row-parallel layer partitions the weight `[1024, 256]` into `[512, 256]`.
+
+We can run the model with some random inputs.
+```python
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+torch.distributed.broadcast(x, src=0)  # synchronize input
+
+x = m(x)
+```
+Then we can see the shapes of activation results.
+```shell
+Output of the first linear layer: torch.Size([16, 512])
+Output of the second linear layer: torch.Size([16, 256])
+```
+The output of the first linear layer is split into 2 partitions (each has the shape `[16, 512]`), while the second layer has identical outputs across the GPUs.
--- a/docs/source/en/features/2D_tensor_parallel.md
+++ b/docs/source/en/features/2D_tensor_parallel.md
@ -0,0 +1,142 @@
+# 2D Tensor Parallelism
+
+Author: Zhengda Bian, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+- [1D Tensor Parallelism](./1D_tensor_parallel.md)
+
+**Example Code**
+- [ColossalAI-Examples - 2D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_2d.py)
+
+**Related Paper**
+- [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/pdf/2104.05343.pdf)
+
+## Introduction
+
+1D tensor parallelism does not partition activations, which can also consume a great amount of memory in terms of large-scale models.
+To evenly distribute the computation and memory load, [an efficient 2D tensor parallelism algorithm](https://arxiv.org/pdf/2104.05343.pdf) was introduced based on SUMMA (Scalable Universal Matrix Multiplication Algorithm).
+
+Let's still take a linear layer $Y = XA$ as an example.
+Given $P=q\times q$ processors (necessary condition), e.g. $q=2$, we split both the input $X$ and weight $A$ into
+
+$$
+\left[\begin{matrix} X_{10} & X_{11} \\ X_{00} & X_{01} \end{matrix} \right]
+\text{~and~}
+\left[\begin{matrix} A_{10} & A_{11} \\ A_{00} & A_{01} \end{matrix} \right].
+$$
+
+The calculation includes $q$ steps. When $t=1$, $X_{i0}$ is broadcasted in its row, and $A_{0j}$ is broadcasted in its column. So, we have
+
+$$
+\left[\begin{matrix} X_{10},A_{00} & X_{10},A_{01} \\ X_{00},A_{00} & X_{00},A_{01} \end{matrix} \right].
+$$
+
+Then we multiply $X_{i0}$ and $A_{0j}$ on each processor $(i, j)$ as
+
+$$
+\left[\begin{matrix} X_{10}A_{00} & X_{10}A_{01} \\ X_{00}A_{00} & X_{00}A_{01} \end{matrix} \right] (1).
+$$
+
+Similarly, when $t=2$, $X_{i1}$ is broadcasted in its row, $A_{1j}$ is broadcasted in its column, and we multiply them as
+
+$$
+\left[\begin{matrix} X_{11}A_{10} & X_{11}A_{11} \\ X_{01}A_{10} & X_{01}A_{11} \end{matrix} \right] (2).
+$$
+
+By adding $(1)$ and $(2)$ up, we have
+
+$$
+Y = XA = \left[\begin{matrix} X_{10}A_{00}+X_{11}A_{10} & X_{10}A_{01}+X_{11}A_{11} \\ X_{00}A_{00}+X_{01}A_{10} & X_{00}A_{01}+X_{01}A_{11} \end{matrix} \right].
+$$
+
+## Efficiency
+Given $P=q\times q$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 2D tensor parallelism.
+
+| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/q^2)$  | $O(1/q^2)$       | $O(1/q^2)$           | $O(6(q-1)/q)$             | $O(6(q-1))$             |
+
+## Usage
+
+To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallism setting as below.
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=4, mode='2d'),
+))
+```
+Then Colossal-AI will automatically apply 2D parallelism to all the layers from `colossalai.nn`.
+
+Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+Launch Colossal-AI on 4 GPUs and build the model
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
+```shell
+Weight of the first linear layer: torch.Size([128, 512])
+Weight of the second linear layer: torch.Size([512, 128])
+```
+The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 2D parallelism, it becomes `[128, 512]` on each GPU.
+Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 128]`.
+
+We can run the model with some random inputs.
+```python
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+# partition input
+torch.distributed.broadcast(x, src=0)
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)]
+x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)]
+print_rank_0(f'Input: {x.shape}')
+
+x = m(x)
+```
+Then we can see the shapes of activation results.
+```shell
+Input: torch.Size([8, 128])
+Output of the first linear layer: torch.Size([8, 512])
+Output of the second linear layer: torch.Size([8, 128])
+```
+The activation tensors in 2D parallelism are all split in both row and column.
+E.g. the output of the first linear layer has the shape `[8, 512]`, while the second layer has the output of `[8, 128]`.
--- a/docs/source/en/features/2p5D_tensor_parallel.md
+++ b/docs/source/en/features/2p5D_tensor_parallel.md
@ -0,0 +1,142 @@
+# 2.5D Tensor Parallelism
+
+Author: Zhengda Bian, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+- [1D Tensor Parallelism](./1D_tensor_parallel.md)
+- [2D Tensor Parallelism](./2D_tensor_parallel.md)
+
+**Example Code**
+- [ColossalAI-Examples - 2.5D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_2p5d.py)
+
+**Related Paper**
+- [2.5-dimensional distributed model training](https://arxiv.org/pdf/2105.14500.pdf)
+
+## Introduction
+
+Compared with 1D tensor parallelism, 2D parallelism reduces the memory cost, but may introduce more communication.
+Therefore, a  [2.5D tensor parallelism algorithm](https://arxiv.org/pdf/2105.14500.pdf) was proposed based on 2.5D SUMMA to reduce communication by using more devices.
+
+Let's still take a linear layer $Y = XA$ as an example.
+Given $P=q \times q \times d$ processors (necessary condition), e.g. $q=d=2$, we split the input $X$ into $d\times q$ rows and $q$ columns as
+
+$$
+\left[\begin{matrix} X_{30} & X_{31} \\ X_{20} & X_{21} \\ X_{10} & X_{11} \\ X_{00} & X_{01}\end{matrix} \right],
+$$
+which can be reshaped into $d$ layers as
+
+$$
+\left[\begin{matrix} X_{10} & X_{11} \\ X_{00} & X_{01} \end{matrix} \right] \text{~and~}\left[\begin{matrix} X_{30} & X_{31} \\ X_{20} & X_{21} \end{matrix} \right].
+$$
+
+Also, the weight $A$ is split into
+
+$$
+\left[\begin{matrix} A_{10} & A_{11} \\ A_{00} & A_{01} \end{matrix} \right].
+$$
+
+For each layer of $X$, we use the SUMMA algorithm to multiply $X$ and $A$.
+Then, we have the output
+
+$$
+\left[\begin{matrix} Y_{10}=X_{10}A_{00}+X_{11}A_{10} & Y_{11}=X_{10}A_{01}+X_{11}A_{11} \\ Y_{00}=X_{00}A_{00}+X_{01}A_{10} & Y_{01}=X_{00}A_{01}+X_{01}A_{11} \end{matrix} \right]
+\text{~and~}
+$$
+$$
+\left[\begin{matrix} Y_{30}=X_{30}A_{00}+X_{31}A_{10} & Y_{31}=X_{30}A_{01}+X_{31}A_{11} \\ Y_{20}=X_{20}A_{00}+X_{21}A_{10} & Y_{21}=X_{20}A_{01}+X_{21}A_{11} \end{matrix} \right].
+$$
+
+## Efficiency
+Given $P=q \times q \times d$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 2.5D tensor parallelism.
+
+| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/dq^2)$ | $O(1/q^2)$       | $O(1/dq^2)$          | $\small O(3(q-1)(d+1)/dq)$       | $O(6(q-1))$             |
+
+## Usage
+
+To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=8, mode='2.5d', depth=2),
+))
+
+```
+Then Colossal-AI will automatically apply 2.5D parallelism to all the layers from `colossalai.nn`.
+
+Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+Launch Colossal-AI on 8 GPUs and build the model
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
+```shell
+Weight of the first linear layer: torch.Size([128, 512])
+Weight of the second linear layer: torch.Size([512, 128])
+```
+The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 2.5D parallelism, it becomes `[128, 512]` on each GPU.
+Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 128]`.
+
+We can run the model with some random inputs.
+```python
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+# partition input
+torch.distributed.broadcast(x, src=0)
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)]
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)]
+x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)]
+print_rank_0(f'Input: {x.shape}')
+
+x = m(x)
+```
+Then we can see the shapes of activation results.
+```shell
+Input: torch.Size([4, 128])
+Output of the first linear layer: torch.Size([4, 512])
+Output of the second linear layer: torch.Size([4, 128])
+```
+The activation tensors in 2.5D parallelism are all split by $d \times q$ in the row and $q$ in the column.
+E.g. the output of the first linear layer has the shape `[4, 512]`), while the second layer has the output of `[4, 128]`.
+Note, 2.5D parallelism use the same partition method as 2D parallelism for weights, where the difference is the partition of input.
--- a/docs/source/en/features/3D_tensor_parallel.md
+++ b/docs/source/en/features/3D_tensor_parallel.md
@ -0,0 +1,151 @@
+# 3D Tensor Parallelism
+
+Author: Zhengda Bian, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+- [1D Tensor Parallelism](./1D_tensor_parallel.md)
+- [2D Tensor Parallelism](./2D_tensor_parallel.md)
+
+**Example Code**
+- [ColossalAI-Examples - 3D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_3d.py)
+
+**Related Paper**
+- [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/pdf/2105.14450.pdf)
+
+## Introduction
+
+The [3D tensor parallelism](https://arxiv.org/pdf/2105.14450.pdf) is an approach to parallelize the computation of neural models, hoping to obtain the optimal communication cost.
+
+Let's still take a linear layer $Y = XA$ as an example.
+Given $P=q \times q \times q$ processors (necessary condition), e.g. $q=2$, we split the input $X$ and weight $A$ into
+
+$$
+\left[\begin{matrix}
+            X_{000} & X_{001} \\
+            X_{010} & X_{011} \\
+            X_{100} & X_{101} \\
+            X_{110} & X_{111} \end{matrix}
+\right]
+\text{~and~}
+\left[\begin{matrix}
+            A_{000} & A_{001} & A_{010} & A_{011} \\
+            A_{100} & A_{101} & A_{110} & A_{111} \end{matrix}
+\right]
+\text{~respectively,}$$
+where each $X_{ijl}$ and $A_{lji}$ are stored at processor $(i,j,l)$, as shown in the figure below.
+
+<center>
+<img src="https://s2.loli.net/2022/02/17/JevO6SED5z4PFdp.png" width = "200" height = "250" />
+<img src="https://s2.loli.net/2022/02/17/qvtwjdfNXMAb4nF.png" width = "200" height = "250" />
+<img src="https://s2.loli.net/2022/02/17/WFzm2N4IwKf1jXZ.png" width = "200" height = "250" />
+<img src="https://s2.loli.net/2022/02/17/r2dZQ4hKxwTuIv6.png" width = "200" height = "250" />
+</center>
+
+Then we all-gather $X_{ijl}$ across $(i, 0...q,l)$, as well as $A_{lji}$ across $(0...q, j, l)$.
+So, we have $X_{il}$ and $A_{lj}$ on each processor $(i,j,l)$ to get $X_{il}A_{lj}$.
+Finally, we reduce-scatter the results across $(i, j, 0...q)$ to get $Y_{ijl}$, which forms
+$$
+Y=
+\left[\begin{matrix}
+            Y_{000} & Y_{001} \\
+            Y_{010} & Y_{011} \\
+            Y_{100} & Y_{101} \\
+            Y_{110} & Y_{111} \end{matrix}
+\right].
+$$
+
+We also need to note that in the backward pass, we need to all-gather the gradient $\dot{Y_{ijl}}$, and then reduce-scatter the gradient $\dot{X_{il}}=\dot{Y_{ij}}A_{lj}^T$ and $\dot{A_{lj}}=X_{il}^T\dot{Y_{ij}}$.
+
+## Efficiency
+Given $P=q \times q \times q$ processors, we present the theoretical computation and memory cost, as well as the communication cost based on the ring algorithm in both the forward and backward pass of 3D tensor parallelism.
+
+| Computation | Memory (parameters) | Memory (activations) | Communication (bandwidth) | Communication (latency) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/q^3)$  | $O(1/q^3)$       | $O(1/q^3)$           | $O(6(q-1)/q^3)$           | $O(6(q-1))$             |
+
+## Usage
+
+To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below.
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=8, mode='3d'),
+))
+```
+Then Colossal-AI will automatically apply 3D parallelism to all the layers from `colossalai.nn`.
+
+Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+Launch Colossal-AI on 8 GPUs and build the model
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
+```shell
+Weight of the first linear layer: torch.Size([128, 256])
+Weight of the second linear layer: torch.Size([512, 64])
+```
+The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 3D parallelism, it becomes `[128, 256]` on each GPU.
+Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 64]`.
+
+We can run the model with some random inputs.
+```python
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+# partition input
+torch.distributed.broadcast(x, src=0)
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_WEIGHT)]
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_INPUT)]
+x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_OUTPUT)]
+print_rank_0(f'Input: {x.shape}')
+
+x = m(x)
+```
+Then we can see the shapes of activation results.
+```shell
+Input: torch.Size([4, 128])
+Output of the first linear layer: torch.Size([4, 512])
+Output of the second linear layer: torch.Size([4, 128])
+```
+The activation tensors in 3D parallelism are all split by $q^2$ in the row and $q$ in the column.
+E.g. the output of the first linear layer has the shape `[4, 512]`), while the second layer has the output of `[4, 128]`.
+Note, although the results of 3D parallelism have the same shape as that of 2.5D parallelism for weights here, the content of each partition is different.
--- a/docs/source/en/features/gradient_accumulation.md
+++ b/docs/source/en/features/gradient_accumulation.md
@ -0,0 +1,45 @@
+# Gradient Accumulation
+
+Author: Shenggui Li, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
+
+**Example Code**
+- [ColossalAI-Examples Gradient Accumulation](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
+
+## Introduction
+
+Gradient accumulation is a common way to enlarge your batch size for training.
+When training large-scale models, memory can easily become the bottleneck and the batch size can be very small, (e.g. 2),
+leading to unsatisfactory convergence. Gradient accumulation works by adding up the gradients calculated in multiple iterations,
+and only update the parameters in the preset iteration.
+
+## Usage
+
+It is simple to use gradient accumulation in Colossal-AI. Just add this following configuration into your config file.
+The integer represents the number of iterations to accumulate gradients.
+
+```python
+gradient_accumulation = <int>
+```
+
+## Hands-on Practice
+
+We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
+to demonstrate gradient accumulation. In this example, we set the gradinet accumulation size to be 4. You can run the script using this command:
+
+```shell
+python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  run_resnet_cifar10_with_engine.py
+```
+
+You will see output similar to the text below. This shows gradient is indeed accumulated as the parameter is not updated
+in the first 3 steps, but only updated in the last step.
+
+```text
+iteration 0, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 1, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 2, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 3, first 10 elements of param: tensor([-0.0141,  0.0464,  0.0507,  0.0321,  0.0356, -0.0150,  0.0172, -0.0118, 0.0222,  0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
+```
--- a/docs/source/en/features/gradient_clipping.md
+++ b/docs/source/en/features/gradient_clipping.md
@ -0,0 +1,62 @@
+# Gradient Clipping
+
+Author: Boxiang Wang, Haichen Huang, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
+
+**Example Code**
+- [ColossalAI-Examples Gradient Clipping](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
+
+**Related Paper**
+- [On the difficulty of training Recurrent Neural Networks](https://arxiv.org/abs/1211.5063)
+
+## Introduction
+
+In order to speed up training process and seek global optimum for better performance, more and more learning
+rate schedulers have been proposed. People turn to control learning rate to adjust descent pace during training,
+which makes gradient vector better to be uniformed in every step. In that case, the descent pace can be
+controlled as expected. As a result, gradient clipping, a technique which can normalize the gradient vector
+to circumscribe it in a uniformed length, becomes indispensable for those who desire their better
+performance of their models.
+
+You do not have to worry about implementing gradient clipping when using Colossal-AI, we support gradient
+clipping in a powerful and convenient way. All you need is just an additional command in your configuration
+file.
+
+## Why you should use gradient clipping provided by Colossal-AI
+
+The reason of why we do not recommend users to write gradient clipping by themselves is that naive gradient clipping
+may fail when applying tensor parallelism, pipeline parallelism or MoE.
+
+According to the illustration below, each GPU only owns a portion of parameters of the weight in a linear layer.
+To get correct norm of gradient vector of the weight of the linear layer, the norm of every gradient vector in each GPU
+should be summed together.
+More complicated thing is that the distribution of bias is different from the distribution of the weight.
+The communication group is different in the sum operation.
+
+(PS: This situation is an old version of 2D parallelism, the implementation in the code is not the same.
+But it is a good example about the difficulty to unify all communication in gradient clipping.)
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/KXiJPHt3Dum82cA.png"/>
+<figcaption>Layout of parameters</figcaption>
+</figure>
+
+Do not worry about it, since Colossal-AI have handled it for you.
+
+### Usage
+To use gradient clipping, you can just simply add gradient clipping norm in your configuration file.
+```python
+clip_grad_norm = 1.0
+```
+
+### Hands-On Practice
+
+We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
+to demonstrate gradient clipping. In this example, we set the gradient clipping vector norm to be 1.0. You can run the script using this command:
+
+```shell
+python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  train_with_engine.py
+```
--- a/docs/source/en/features/gradient_handler.md
+++ b/docs/source/en/features/gradient_handler.md
@ -0,0 +1,63 @@
+# Gradient Handler
+
+Author: Shenggui Li, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
+
+**Example Code**
+- [ColossalAI-Examples Gradient Handler](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
+
+## Introduction
+
+In distributed training, gradient synchronization is required at the end of each iteration. This is important because we
+need to make sure the parameters are updated with the same gradients in different machines so that the resulting parameters
+are the same. This is often seen in data parallel as the model is replicated across data parallel ranks.
+
+In Colossal-AI, we provide an interface for users to customize how they want to handle the synchronization. This brings
+flexibility in cases such as implementing a new parallelism method.
+
+When gradient handlers are used, PyTorch `DistributedDataParallel` will not be used as it will synchronize automatically.
+
+## Customize Your Gradient Handlers
+
+To implement a customized gradient handler, you need to follow these steps.
+1. inherit `BaseGradientHandler` in Colossal-AI.
+2. register the gradient handler into the `GRADIENT_HANDLER`.
+3. implement `handle_gradient` method.
+
+```python
+from colossalai.registry import GRADIENT_HANDLER
+from colossalai.engine.gradient_handler import BaseGradientHandler
+
+
+@GRADIENT_HANDLER.register_module
+class MyGradientHandler(BaseGradientHandler):
+
+    def handle_gradient(self):
+        do_something()
+
+
+```
+
+
+## Usage
+
+To use a gradient handler, you need to specify your gradient handler in the config file. The gradient handler
+will be automatically built and attached to the engine.
+
+```python
+gradient_handler = [dict(type='MyGradientHandler')]
+```
+
+
+### Hands-On Practice
+
+We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
+to demonstrate the use of gradient handler. In this example, we used `DataParallelGradientHandler` instead of PyTorch
+`DistributedDataParallel` for data parallel training.
+
+```shell
+python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500  train_with_engine.py
+```
--- a/docs/source/en/features/mixed_precision_training.md
+++ b/docs/source/en/features/mixed_precision_training.md
@ -0,0 +1,367 @@
+# Auto Mixed Precision Training
+
+Author: Chuanrui Wang, Shenggui Li, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
+
+**Example Code**
+- [ColossalAI-Examples AMP](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp)
+
+**Related Paper**
+- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
+
+
+## Introduction
+
+AMP stands for automatic mixed precision training.
+In Colossal-AI, we have incorporated different implementations of mixed precision training:
+
+1. torch.cuda.amp
+2. apex.amp
+3. naive amp
+
+
+| Colossal-AI | support tensor parallel | support pipeline parallel | fp16 extent |
+| ----------- | ----------------------- | ------------------------- | ----------- |
+| AMP_TYPE.TORCH | ✅ | ❌ | Model parameters, activation, gradients are downcast to fp16 during forward and backward propagation |
+| AMP_TYPE.APEX | ❌ | ❌ | More fine-grained, we can choose opt_level O0, O1, O2, O3 |
+| AMP_TYPE.NAIVE | ✅ | ✅ | Model parameters, forward and backward operations are all downcast to fp16 |
+
+The first two rely on the original implementation of PyTorch (version 1.6 and above) and NVIDIA Apex.
+The last method is similar to Apex O2 level.
+Among these methods, apex AMP is not compatible with tensor parallelism.
+This is because that tensors are split across devices in tensor parallelism, thus, it is required to communicate among different processes to check if inf or nan occurs in the whole model weights.
+We modified the torch amp implementation so that it is compatible with tensor parallelism now.
+
+> ❌️ fp16 and zero configuration are not compatible
+>
+> ⚠️ Pipeline only support naive AMP currently
+
+We recommend you to use torch AMP as it generally gives better accuracy than naive AMP if no pipeline is used.
+
+## Table of Contents
+
+In this tutorial we will cover:
+
+1. AMP introduction
+2. AMP in Colossal-AI
+3. Hands-on Practice
+
+## AMP Introduction
+
+Automatic Mixed Precision training is a mixture of FP16 and FP32 training.
+
+Half-precision float point format (FP16) has lower arithmetic complexity and higher compute efficiency.
+Besides, fp16 requires half of the storage needed by fp32 and saves memory & network bandwidth, which makes more memory
+available for large batch size and model size.
+
+However, there are other operations, like reductions, which require the dynamic range of fp32 to avoid numeric overflow/underflow. That's the reason why we introduce automatic mixed precision, attempting to match each operation to its appropriate data type, which can reduce the memory footprint and augment training efficiency.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
+<figcaption>Illustration of an ordinary AMP (figure from <a href="https://arxiv.org/abs/2108.05818">PatrickStar paper</a>)</figcaption>
+</figure>
+
+## AMP in Colossal-AI
+
+We supported three AMP training methods and allowed the user to train with AMP with no code. You can just simply add `fp16`
+configuration in your configuration file to use AMP.
+
+
+```python
+from colossalai.amp import AMP_TYPE
+
+# use Torch AMP
+fp16=dict(
+    mode = AMP_TYPE.TORCH
+)
+
+# use naive AMP
+fp16=dict(
+    mode = AMP_TYPE.NAIVE
+)
+
+# use NVIDIA Apex AMP
+fp16=dict(
+    mode = AMP_TYPE.APEX
+)
+
+```
+
+> These are the minimum configuration, full configuration are stated in the section later
+
+### AMP Modularity
+
+AMP module is designed to be completely modular and can be used independently.
+If you wish to only use AMP in your code base without `colossalai.initialize`,
+you can use `colossalai.amp.convert_to_amp`.
+
+```python
+from colossalai.amp import AMP_TYPE
+
+# exmaple of using torch amp
+model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
+                                                            optimizer,
+                                                            criterion,
+                                                            AMP_TYPE.TORCH)
+```
+
+### Torch AMP Configuration
+
+```python
+from colossalai.amp import AMP_TYPE
+
+fp16=dict(
+    mode=AMP_TYPE.TORCH,
+
+    # below are default values for grad scaler
+    init_scale=2.**16,
+    growth_factor=2.0,
+    backoff_factor=0.5,
+    growth_interval=2000,
+    enabled=True
+)
+```
+
+With optional arguments:
+- init_scale(float, optional, default=2.**16): Initial scale factor
+- growth_factor(float, optional, default=2.0): Factor by which the scale is multiplied during `update` if no inf/NaN gradients occur for ``growth_interval`` consecutive iterations.
+- backoff_factor(float, optional, default=0.5): Factor by which the scale is multiplied during `update` if inf/NaN gradients occur in an iteration.
+- growth_interval(int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied by ``growth_factor``.
+- enabled(bool, optional, default=True): If ``False``, disables gradient scaling. `step` simply invokes the underlying ``optimizer.step()``, and other methods become no-ops.
+
+### Apex AMP Configuration
+
+For this mode, we rely on the Apex implementation for mixed precision training.
+We support this plugin because it allows for finer control on the granularity of mixed precision.
+For example, O2 level (optimization level 2) will keep batch normalization in fp32.
+
+If you look for more details, please refer to [Apex Documentation](https://nvidia.github.io/apex/).
+
+```python
+from colossalai.amp import AMP_TYPE
+
+fp16 = dict(
+    mode=AMP_TYPE.APEX,
+
+    # below are the default values
+    enabled=True,
+    opt_level='O1',
+    cast_model_type=None,
+    patch_torch_functions=None,
+    keep_batchnorm_fp32=None,
+    master_weights=None,
+    loss_scale=None,
+    cast_model_outputs=None,
+    num_losses=1,
+    verbosity=1,
+    min_loss_scale=None,
+    max_loss_scale=16777216.0
+)
+```
+
+Parameters:
+- enabled(bool, optional, default=True): If False, renders all AMP calls no-ops, so your script should run as if Amp were not present.
+
+- opt_level(str, optional, default="O1" ): Pure or mixed precision optimization level.
+Accepted values are “O0”, “O1”, “O2”, and “O3”, explained in detail above Apex AMP Documentation.
+
+- num_losses(int, optional, default=1): Option to tell AMP in advance how many losses/backward passes you plan to use.
+When used in conjunction with the loss_id argument to `amp.scale_loss`, enables Amp to use a different loss scale per
+loss/backward pass, which can improve stability. If num_losses is left to 1, Amp will still support multiple
+losses/backward passes, but use a single global loss scale for all of them.
+
+- verbosity(int, default=1): Set to 0 to suppress Amp-related output.
+
+- min_loss_scale(float, default=None): Sets a floor for the loss scale values that can be chosen by dynamic loss scaling.
+The default value of None means that no floor is imposed. If dynamic loss scaling is not used, min_loss_scale is ignored.
+
+- max_loss_scale(float, default=2.**24 ): Sets a ceiling for the loss scale values that can be chosen by dynamic loss
+scaling. If dynamic loss scaling is not used, max_loss_scale is ignored.
+
+Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
+cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale.
+They are optional properties override once opt_level is determined
+
+- cast_model_type: Casts your model’s parameters and buffers to the desired type.
+- patch_torch_functions: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
+- keep_batchnorm_fp32: To enhance precision and enable cudnn batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
+- master_weights: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
+- loss_scale: If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string "dynamic", adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
+
+
+### Naive AMP Configuration
+
+In Naive AMP mode, we achieved mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism.
+This AMP mode will cast all operations into fp16.
+The following code block shows the `config.py` file for this mode.
+
+```python
+from colossalai.amp import AMP_TYPE
+
+fp16 = dict(
+    mode=AMP_TYPE.NAIVE,
+
+    # below are the default values
+    log_num_zeros_in_grad=False,
+    initial_scale=2 ** 32,
+    min_scale=1,
+    growth_factor=2,
+    backoff_factor=0.5,
+    growth_interval=1000,
+    hysteresis=2
+)
+```
+
+The default parameters of Naive AMP:
+- log_num_zeros_in_grad(bool): return number of zeros in the gradients.
+- initial_scale(int): initial scale of gradient scaler
+- growth_factor(int): the growth rate of loss scale
+- backoff_factor(float): the decrease rate of loss scale
+- hysterisis(int): delay shift in dynamic loss scaling
+- max_scale(int): maximum loss scale allowed
+- verbose(bool): if set to `True`, will print debug info
+
+When using `colossalai.initialize`, you are required to first instantiate a model, an optimizer and a criterion.
+The output model is converted to AMP model of smaller memory consumption.
+If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
+Otherwise, try smaller models or checkout more parallelization training techniques!
+
+
+## Hands-on Practice
+
+We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp) which demonstrates
+the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example, but do note that config files are provided for all AMP modes.
+
+### Step 1. Create a config file
+
+Create a `config.py` and add the `fp16` configuration.
+
+```python
+# in config.py
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 128
+DROP_RATE = 0.1
+NUM_EPOCHS = 300
+
+fp16 = dict(
+    mode=AMP_TYPE.TORCH,
+)
+
+clip_grad_norm = 1.0
+```
+
+### Step 2. Import libraries in train_with_engine.py
+
+Create a `train_with_engine.py` and import the necessary dependencies. Remember to install `scipy` and `timm` by running
+`pip install timm scipy`.
+
+```python
+import os
+import colossalai
+import torch
+from pathlib import Path
+from colossalai.core import global_context as gpc
+from colossalai.logging import get_dist_logger
+from colossalai.utils import get_dataloader
+from colossalai.trainer import Trainer, hooks
+from colossalai.nn.lr_scheduler import LinearWarmupLR
+from timm.models import vit_base_patch16_224
+from torchvision import datasets, transforms
+
+```
+
+### Step 3. Initialize Distributed Environment
+
+We then need to initialize distributed environment. For demo purpose, we uses `launch_from_torch`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)
+for other initialization methods.
+
+```python
+# initialize distributed setting
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+
+# launch from torch
+colossalai.launch_from_torch(config=args.config)
+
+```
+
+### Step 4. Create training components
+
+Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
+obtained from the environment varialbe `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
+to a path on your machine. Data will be automatically downloaded to the root path.
+
+```python
+# build model
+    model = vit_base_patch16_224(drop_rate=0.1)
+
+    # build dataloader
+    train_dataset = datasets.Caltech101(
+        root=Path(os.environ['DATA']),
+        download=True,
+        transform=transforms.Compose([
+            transforms.Resize(256),
+            transforms.RandomResizedCrop(224),
+            transforms.RandomHorizontalFlip(),
+            transforms.ToTensor(),
+            Gray2RGB(),
+            transforms.Normalize([0.5, 0.5, 0.5],
+                                 [0.5, 0.5, 0.5])
+        ]))
+
+    train_dataloader = get_dataloader(dataset=train_dataset,
+                                      shuffle=True,
+                                      batch_size=gpc.config.BATCH_SIZE,
+                                      num_workers=1,
+                                      pin_memory=True,
+                                      )
+
+    # build optimizer
+    optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)
+
+    # build loss
+    criterion = torch.nn.CrossEntropyLoss()
+
+    # lr_scheduelr
+    lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
+```
+
+### Step 5. Inject AMP Feature
+
+Call `colossalai.initialize` to convert the training components to be running with FP16.
+
+```python
+engine, train_dataloader, _, _ = colossalai.initialize(
+        model, optimizer, criterion, train_dataloader,
+    )
+```
+
+### Step 6. Train with Engine
+
+Use engine in a normal training loops.
+
+```python
+engine.train()
+for epoch in range(gpc.config.NUM_EPOCHS):
+    for img, label in enumerate(train_dataloader):
+        img = img.cuda()
+        label = label.cuda()
+        engine.zero_grad()
+        output = engine(img)
+        loss = engine.criterion(output, label)
+        engine.backward(loss)
+        engine.step()
+        lr_scheduler.step()
+```
+
+### Step 7. Invoke Training Scripts
+
+Use the following command to start the training scripts. You can change `--nproc_per_node` to use a different number of GPUs.
+
+```python
+python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
+```
--- a/docs/source/en/features/nvme_offload.md
+++ b/docs/source/en/features/nvme_offload.md
@ -0,0 +1,42 @@
+# NVMe offload
+
+Author: Hongxin Liu
+
+**Prerequisite:**
+- [Zero Redundancy Optimizer with chunk-based memory management](../features/zero_with_chunk.md)
+
+## Introduction
+
+If a model has `N` parameters, when using Adam, it has `8N` optimizer states. For billion-scale models, optimizer states take at least 32 GB memory. GPU memory limits the model scale we can train, which is called GPU memory wall. If we offload optimizer states to the disk, we can break through GPU memory wall.
+
+We implement a user-friendly and efficient asynchronous Tensor I/O library: [TensorNVMe](https://github.com/hpcaitech/TensorNVMe). With this library, we can simply implement NVMe offload.
+
+> This library is compatible with all kinds of disk (HDD, SATA SSD, and NVMe SSD). As I/O bandwidth of HDD or SATA SSD is low, it's recommended to use this lib only on NVMe disk.
+
+When optimizing a parameter, we can divide the optimization process into three stages: read, compute and offload. We perform the optimization process in a pipelined fashion, which can overlap computation and I/O.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/16/CvRnowrsNyB4hza.jpg"/>
+<figcaption>Optimization process</figcaption>
+</figure>
+
+## Usage
+
+First, please make sure you installed [TensorNVMe](https://github.com/hpcaitech/TensorNVMe):
+
+```shell
+pip install packaging
+pip install tensornvme
+```
+
+We implement NVMe offload of optimizer states for Adam ([CPUAdam](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.optimizer.cpu_adam.html) and [HybridAdam](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.optimizer.hybrid_adam.html)).
+
+```python
+from colossalai.nn.optimizer import CPUAdam, HybridAdam
+
+optimizer = HybridAdam(model.parameters(), lr=1e-3, nvme_offload_fraction=1.0, nvme_offload_dir='./')
+```
+
+`nvme_offload_fraction` is the fraction of optimizer states to be offloaded to NVMe. `nvme_offload_dir` is the directory to save NVMe offload files. If `nvme_offload_dir` is `None`, a random temporary directory will be used.
+
+It's compatible with all parallel methods in ColossalAI.
--- a/docs/source/en/features/pipeline_parallel.md
+++ b/docs/source/en/features/pipeline_parallel.md
@ -0,0 +1,159 @@
+# Pipeline Parallel
+
+Author: Guangyang Lu, Hongxin Liu, Yongbin Li
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
+- [Configure Parallelization](../basics/configure_parallelization.md)
+
+**Example Code**
+- [ColossalAI-Examples ResNet with pipeline](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/pipeline_parallel)
+
+**Related Paper**
+- [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training](https://arxiv.org/abs/2110.14883)
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)
+- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
+
+## Quick introduction
+
+In this tutorial, you will learn how to use pipeline parallel. In Colossal-AI, we use 1F1B pipeline, introduced by Nvidia. In this case, ViT and Imagenet are too large to use. Therefore, here we use ResNet and Cifar as example.
+
+## Table Of Content
+
+In this tutorial we will cover:
+
+1. Introduction of 1F1B pipeline.
+2. Usage of non-interleaved and interleaved schedule.
+3. Training ResNet with pipeline.
+
+## Introduction of 1F1B pipeline
+
+First of all, we will introduce you GPipe for your better understanding.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/OAucPF6mWYynUtV.png"/>
+<figcaption>Figure1: GPipe. This figure is from <a href="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM</a> paper.</figcaption>
+</figure>
+
+
+As you can see, for GPipe, only when the forward passes of all microbatches in a batch finish, the backward passes would be executed.
+
+In general, 1F1B(one forward pass followed by one backward pass) is more efficient than GPipe(in memory or both memory and time). There are two schedules of 1F1B pipeline, the non-interleaved and the interleaved. The figures are shown below.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/iJrVkp2HLcahjsT.png"/>
+<figcaption>Figure2: This figure is from <a href="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM</a> paper. The top part shows the default non-interleaved schedule. And the bottom part shows the interleaved schedule.</figcaption>
+</figure>
+
+### Non-interleaved Schedule
+
+The non-interleaved schedule can be divided into three stages. The first stage is the warm-up stage, where workers perform differing numbers of forward passes. At the following stage, workers perform one forward pass followed by one backward pass. Workers will finish backward passes at the last stage.
+
+This mode is more memory-efficient than GPipe. However, it would take the same time to finish a turn of passes as GPipe.
+
+### Interleaved Schedule
+
+This schedule requires **the number of microbatches to be an integer multiple of the stage of pipeline**.
+
+In this schedule, each device can perform computation for multiple subsets of layers(called a model chunk) instead of a single contiguous set of layers. i.e. Before device 1 had layer 1-4; device 2 had layer 5-8; and so on. But now device 1 has layer 1,2,9,10; device 2 has layer 3,4,11,12; and so on. With this scheme, each device in the pipeline is assigned multiple pipeline stages and each pipeline stage has less computation.
+
+This mode is both memory-efficient and time-efficient.
+
+## Usage of non-interleaved and interleaved schedule
+
+In Colossal-AI, we provided both non-interleaved(as `PipelineSchedule`) and interleaved schedule(as  `InterleavedPipelineSchedule`).
+
+You just need to set `NUM_MICRO_BATCHES` in config file and set `NUM_CHUNKS` in config file if you want to use Interleaved Pipeline Schedule. If you certainly know the shape of each pipeline stage's output tensor and the shapes are all the same, you can set `TENSOR_SHAPE` in config file to further reduce communication. Otherwise, you can just ignore `tensor_shape`, and the shape will be exchanged over pipeline stages automatically. Then we will generate an appropriate schedule for you.
+
+## Training ResNet with pipeline
+
+Let's build the `ResNet` model first with Colossal PipelinableContext:
+```python
+import os
+from typing import Callable, List, Optional, Type, Union
+import torch
+import torch.nn as nn
+import colossalai
+import colossalai.nn as col_nn
+
+from colossalai.core import global_context as gpc
+from colossalai.logging import disable_existing_loggers, get_dist_logger
+from colossalai.trainer import Trainer, hooks
+from colossalai.utils import MultiTimer, get_dataloader
+from colossalai.context import ParallelMode
+from colossalai.pipeline.pipelinable import PipelinableContext
+
+from titans.dataloader.cifar10 import build_cifar
+from torchvision.models import resnet50
+from torchvision.models.resnet import BasicBlock, Bottleneck, conv1x1
+
+# Define some config
+BATCH_SIZE = 64
+NUM_EPOCHS = 2
+NUM_CHUNKS = 1
+CONFIG = dict(NUM_MICRO_BATCHES=4, parallel=dict(pipeline=2))
+
+# Train
+disable_existing_loggers()
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+colossalai.launch_from_torch(backend=args.backend, config=CONFIG)
+logger = get_dist_logger()
+pipelinable = PipelinableContext()
+
+# build model
+with pipelinable:
+    model = resnet50()
+```
+
+Define an execution sequence.
+```python
+exec_seq = [
+    'conv1', 'bn1', 'relu', 'maxpool', 'layer1', 'layer2', 'layer3', 'layer4', 'avgpool',
+    (lambda x: torch.flatten(x, 1), "behind"), 'fc'
+]
+pipelinable.to_layer_list(exec_seq)
+```
+
+Partition the model into pipeline.
+```python
+model = pipelinable.partition(NUM_CHUNKS, gpc.pipeline_parallel_size, gpc.get_local_rank(ParallelMode.PIPELINE))
+```
+
+In this tutorial, we use `Trainer` to train `ResNet`:
+```python
+# build criterion
+criterion = nn.CrossEntropyLoss()
+
+# optimizer
+optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+
+# build dataloader
+root = os.environ.get('DATA', './data')
+train_dataloader, test_dataloader = build_cifar(BATCH_SIZE, root, padding=4, crop=32, resize=32)
+
+lr_scheduler = col_nn.lr_scheduler.LinearWarmupLR(optimizer, NUM_EPOCHS, warmup_steps=1)
+engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model, optimizer, criterion,
+                                                                                train_dataloader, test_dataloader,
+                                                                                lr_scheduler)
+timer = MultiTimer()
+
+trainer = Trainer(engine=engine, timer=timer, logger=logger)
+
+hook_list = [
+    hooks.LossHook(),
+    hooks.AccuracyHook(col_nn.metric.Accuracy()),
+    hooks.LogMetricByEpochHook(logger),
+    hooks.LRSchedulerHook(lr_scheduler, by_epoch=True)
+]
+
+trainer.fit(train_dataloader=train_dataloader,
+            epochs=NUM_EPOCHS,
+            test_dataloader=test_dataloader,
+            test_interval=1,
+            hooks=hook_list,
+            display_progress=True)
+```
+
+We use `2` pipeline stages and the batch will be splitted into `4` micro batches.
--- a/docs/source/en/features/zero_with_chunk.md
+++ b/docs/source/en/features/zero_with_chunk.md
@ -0,0 +1,262 @@
+# Zero Redundancy Optimizer with chunk-based memory management
+
+Author: [Hongxiu Liu](https://github.com/ver217), [Jiarui Fang](https://github.com/feifeibear), [Zijian Ye](https://github.com/ZijianYY)
+**Prerequisite:**
+- [Define Your Configuration](../basics/define_your_config.md)
+
+**Example Code**
+
+- [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt)
+
+**Related Paper**
+- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
+- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
+- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
+- [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)
+
+## Introduction
+
+The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three
+model states (optimizer states, gradients, and parameters) instead of replicating them.
+By doing so, memory efficiency is boosted drastically compared to classic data parallelism, while the computational granularity
+and communication efficiency is retained.
+
+1. **Shard Optimizer States**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights,
+and the first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
+
+
+2. **Shard Gradient**: After reduction inside data parallel process group, gradient tensors are also partitioned such that each process only stores the gradients corresponding to its partition of the optimizer states. Note, Colossal converts gradient into fp32 format to participate in parameter updating.
+
+3. **Shard Parameter**: The 16-bit model parameters are partitioned across the processes of a data parallel group.
+
+4. **[Gemini](../advanced_tutorials/meet_gemini.md)**: Dynamic heterogeneous memory space manager for paramters, gradients and optimizer states.
+
+Besides, this article will introduce the Zero Redundancy Optimizer with chunk-based memory management.
+
+When using ZeRO, we distributed the model by sharding the parameters. The advantage of this method is that the memory of each node is load balanced. But this approach has two significiant disadvantages. First, during communication, a temporary memory buffer needs to be allocated and released afterwards, leading to the memory fragmentation problem. Secondly, using tensor as the granularity for communication will cause the network bandwidth underutilized. Generally, the longer the transmitted message length, the higher the bandwidth utilization.
+
+Using the Chunk mechanism introduced in ColossalAI v0.1.8, we can improve the efficiency of ZeRO. We store a continuous set of parameters in initialization order into a Chunk (a chunk is a continuous memory space), and each Chunk has the same size. Organizing memory in chunks can lead to efficient use of network bandwidth between PCI-e and GPU-GPU, reduce the number of communications, and avoid potential memory fragmentation.
+
+Before v0.1.8, ZeRO had a high communication cost for parameter communications. If a parameter was used multiple times in several consecutive operators, there will be repeated communications operations, and the efficiency was highly damaged. This situation is very common when using the Gradient Checkpoint technique, and the parameter will recompute the forward propagation during backward propagation.
+
+Taking GPT as an example, its Checkpoint will be applied to each GPT Block, and each GPT Block contains a Self-Attention layer and an MLP layer. During the backward pass, the forward of the Self-Attention layer and the MLP layer will be computed in turn, and then the backward of the MLP layer and the Self-Attention layer will be computed in turn.
+
+In addition, due to the communication and memory movement of small Tensors, the bandwidth of NVLINK and PCI-E cannot be fully utilized, and each communication and memory movement has the overhead of kernel launch. After using Chunk, multiple small Tensor communication and memory movement can be changed into one large Tensor communication and memory movement, which not only improves bandwidth utilization but also reduces the overhead of kernel launch.
+
+We also provide a lightweight chunk search mechanism to help users automatically find the chunk size with the smallest memory fragmentation.
+
+## Usage
+
+### GeminiDDP
+
+We will use `GeminiDDP` to use ZeRO with chunk-based memory management. This is our new torch.Module wrapper which uses ZeRO-DP and Gemini. ZeRO is for parallelism and Gemini is for memory management.
+
+Also Make sure that your model is initialized under the context of ColoInitContext.
+
+```python
+with ColoInitContext(device='cpu', default_dist_spec=default_dist_spec, default_pg=default_pg):
+  model = gpt2_medium(checkpoint=True)
+```
+
+Define the model parameters as follows:
+
+```python
+chunk_manager = init_chunk_manager(model=module,
+                                           init_device=device,
+                                           hidden_dim=hidden_dim,
+                                           search_range_mb=search_range_mb,
+                                           min_chunk_size_mb=min_chunk_size_mb)
+gemini_manager = GeminiManager(placement_policy, chunk_manager)
+```
+
+`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still samller than the minimum chunk size, all parameters will be compacted into one small chunk.
+
+Initialization of the optimizer.
+```python
+optimizer = GeminiAdamOptimizer(model, lr=1e-3, initial_scale=2**5)
+```
+
+Training
+```python
+optimizer.zero_grad()
+outputs = model(input_ids, attn_mask)
+loss = criterion(outputs, input_ids)
+optimizer.backward(loss)
+optimizer.step()
+```
+> ⚠️ Note: Please do not use `loss.backward()`, the standard way of writing is `optimizer.backward(loss)`.
+
+### Train GPT
+
+In this example, we use `Hugging Face Transformers`. You have to install `transformers` before running this example. We will take `GPT2 Medium` as an example here.
+
+For simplicity, we just use randomly generated data here.
+
+First we only need to import `GPT2LMHeadModel` from `Huggingface transformers` to define our model, which does not require users to define or modify the model, so that users can use it more conveniently.
+
+```python
+class GPTLMModel(nn.Module):
+
+    def __init__(self,
+                 hidden_size=768,
+                 num_layers=12,
+                 num_attention_heads=12,
+                 max_seq_len=1024,
+                 vocab_size=50257,
+                 checkpoint=False):
+        super().__init__()
+        self.checkpoint = checkpoint
+        self.model = GPT2LMHeadModel(
+            GPT2Config(n_embd=hidden_size,
+                       n_layer=num_layers,
+                       n_head=num_attention_heads,
+                       n_positions=max_seq_len,
+                       n_ctx=max_seq_len,
+                       vocab_size=vocab_size))
+        if checkpoint:
+            self.model.gradient_checkpointing_enable()
+
+    def forward(self, input_ids, attention_mask):
+        return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
+
+def gpt2_medium(checkpoint=False):
+    return GPTLMModel(hidden_size=1024, num_layers=24, num_attention_heads=16, checkpoint=checkpoint)
+```
+
+Define our loss function:
+
+```python
+class GPTLMLoss(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+        self.loss_fn = nn.CrossEntropyLoss()
+
+    def forward(self, logits, labels):
+        shift_logits = logits[..., :-1, :].contiguous()
+        shift_labels = labels[..., 1:].contiguous()
+        return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+```
+
+Define tensor parallel and parameter sharding strategies for tensor parallelism:
+
+```python
+def tensor_parallelize(model: torch.nn.Module, pg: ProcessGroup):
+    for mn, module in model.named_modules():
+        for pn, param in module.named_parameters(recurse=False):
+            if hasattr(param, 'visited'):
+                continue
+            param.set_dist_spec(ReplicaSpec())
+            if 'mlp.c_fc' in mn:
+                if 'weight' in pn or 'bias' in pn:
+                    split_param_col_tp1d(param, pg)
+                    param.compute_spec.set_output_replicate(False)
+                else:
+                    param.set_dist_spec(ReplicaSpec())
+            elif 'mlp.c_proj' in mn:
+                if 'weight' in pn:
+                    split_param_row_tp1d(param, pg)
+                else:
+                    param.set_dist_spec(ReplicaSpec())
+            elif 'wte' in mn or 'wpe' in mn:
+                split_param_col_tp1d(param, pg)
+            elif 'c_attn' in mn or 'c_proj' in mn:
+                split_param_col_tp1d(param, pg)
+            else:
+                param.set_dist_spec(ReplicaSpec())
+
+            param.visited = True
+def split_param_single_dim_tp1d(dim: int, param: ColoParameter, pg: ProcessGroup):
+    spec = (ShardSpec([dim], [pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
+    param.set_tensor_spec(*spec)
+
+
+def split_param_row_tp1d(param: ColoParameter, pg: ProcessGroup):
+    split_param_single_dim_tp1d(0, param, pg)
+
+
+def split_param_col_tp1d(param: ColoParameter, pg: ProcessGroup):
+    split_param_single_dim_tp1d(-1, param, pg)
+```
+
+Define a model which uses Gemini + ZeRO DDP:
+
+```python
+def gemini_zero_dpp(model: torch.nn.Module, pg: ProcessGroup, placememt_policy: str = "auto"):
+    cai_version = colossalai.__version__
+    if version.parse(cai_version) > version.parse("0.1.10"):
+        from colossalai.nn.parallel import GeminiDDP
+        model = GeminiDDP(model,
+                          device=get_current_device(),
+                          placement_policy=placememt_policy,
+                          pin_memory=True,
+                          search_range_mb=32)
+    elif version.parse(cai_version) <= version.parse("0.1.10") and version.parse(cai_version) >= version.parse("0.1.9"):
+        from colossalai.gemini import ChunkManager, GeminiManager
+        chunk_size = ChunkManager.search_chunk_size(model, 64 * 1024**2, 32)
+        gemini_manager = GeminiManager(placememt_policy, chunk_manager)
+        chunk_manager = ChunkManager(chunk_size,
+                                     pg,
+                                     enable_distributed_storage=True,
+                                     init_device=GeminiManager.get_default_device(placememt_policy))
+        model = ZeroDDP(model, gemini_manager)
+    else:
+        raise NotImplemented(f"CAI version {cai_version} is not supported")
+    return model
+```
+
+As we pre-train GPT in this example, we just use a simple language model loss.
+
+Write a function to get random inputs:
+
+```python
+def get_data(batch_size, seq_len, vocab_size):
+    input_ids = torch.randint(0, vocab_size, (batch_size, seq_len), device=torch.cuda.current_device())
+    attention_mask = torch.ones_like(input_ids)
+    return input_ids, attention_mask
+```
+
+Finally, we can define our training loop:
+
+```python
+def main():
+    args = parse_args()
+    BATCH_SIZE = 8
+    SEQ_LEN = 1024
+    VOCAB_SIZE = 50257
+    NUM_STEPS = 10
+    colossalai.launch_from_torch(config={})
+
+    # build criterion
+    criterion = GPTLMLoss()
+
+    torch.manual_seed(123)
+    default_pg = ProcessGroup(tp_degree=args.tp_degree)
+    default_dist_spec = ShardSpec([-1], [args.tp_degree]) if args.shardinit else None
+    # build GPT model
+    with ColoInitContext(device='cpu', default_dist_spec=default_dist_spec, default_pg=default_pg):
+      model = gpt2_medium(checkpoint=True)
+    pg = default_pg
+    # Tensor Parallelism (TP)
+    tensor_parallelize(model, pg)
+    # Gemini + ZeRO DP, Note it must be used after TP
+    model = gemini_zero_dpp(model, pg, args.placement)
+    # build optimizer
+    optimizer = GeminiAdamOptimizer(model, lr=1e-3, initial_scale=2**5)
+    numel = sum([p.numel() for p in model.parameters()])
+    get_tflops_func = partial(get_tflops, numel, BATCH_SIZE, SEQ_LEN)
+    torch.cuda.synchronize()
+    model.train()
+    for n in range(NUM_STEPS):
+        # we just use randomly generated data here
+        input_ids, attn_mask = get_data(BATCH_SIZE, SEQ_LEN, VOCAB_SIZE)
+        optimizer.zero_grad()
+        outputs = model(input_ids, attn_mask)
+        loss = criterion(outputs, input_ids)
+        optimizer.backward(loss)
+        optimizer.step()
+
+    torch.cuda.synchronize()
+```
+> ⚠️ Note: If you want to use the Gemini module, please do not use the [Gradient Accumulation](../features/gradient_accumulation.md) we mentioned before。
+The complete example can be found on [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt).
--- a/docs/source/en/get_started/installation.md
+++ b/docs/source/en/get_started/installation.md
@ -0,0 +1,37 @@
+# Setup
+
+## Download From PyPI
+
+You can install Colossal-AI with
+
+```shell
+pip install colossalai
+```
+
+If you want to build PyTorch extensions during installation, you can use the command below. Otherwise, the PyTorch extensions will be built during runtime.
+
+```shell
+CUDA_EXT=1 pip install colossalai
+```
+
+
+## Download From Source
+
+> The version of Colossal-AI will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
+
+```shell
+git clone https://github.com/hpcaitech/ColossalAI.git
+cd ColossalAI
+
+# install dependency
+pip install -r requirements/requirements.txt
+
+# install colossalai
+pip install .
+```
+
+If you don't want to install and enable CUDA kernel fusion (compulsory installation when using fused optimizer):
+
+```shell
+CUDA_EXT=1 pip install .
+```
--- a/docs/source/en/get_started/reading_roadmap.md
+++ b/docs/source/en/get_started/reading_roadmap.md
@ -0,0 +1,19 @@
+# Reading Roadmap
+
+Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development
+of distributed deep learning models just like how you write single-GPU deep learning models. ColossalAI provides easy-to-use
+APIs to help you kickstart your training process. To better how ColossalAI works, we recommend you to read this documentation
+in the following order.
+
+- If you are not familiar with distributed system or have never used Colossal-AI, you should first jump into the `Concepts`
+section to get a sense of what we are trying to achieve. This section can provide you with some background knowledge on
+distributed training as well.
+- Next, you can follow the `basics` tutorials. This section will cover the details about how to use Colossal-AI.
+- Afterwards, you can try out the features provided in Colossal-AI by reading `features` section. We will provide a codebase for each tutorial. These tutorials will cover the
+basic usage of Colossal-AI to realize simple functions such as data parallel and mixed precision training.
+- Lastly, if you wish to apply more complicated techniques such as how to run hybrid parallel on GPT-3,  the
+`advanced tutorials` section is the place to go!
+
+**We always welcome suggestions and discussions from the community, and we would be more than willing to help you if you
+encounter any issue. You can raise an [issue](https://github.com/hpcaitech/ColossalAI/issues) here or create a discussion
+topic in the [forum](https://github.com/hpcaitech/ColossalAI/discussions).**
--- a/docs/source/en/get_started/run_demo.md
+++ b/docs/source/en/get_started/run_demo.md
@ -0,0 +1,43 @@
+# Quick Demo
+
+Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system can
+accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The system
+can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
+
+## Single GPU
+
+Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline
+performances. We provided an example to [train ResNet on CIFAR10 dataset](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet)
+with only one GPU. You can find the example in [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples).
+Detailed instructions can be found in its `README.md`.
+
+## Multiple GPUs
+
+Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the
+training process drastically by applying efficient parallelization techniques. When we have several parallelism for you
+to try out.
+
+#### 1. data parallel
+
+You can use the same [ResNet example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet) as the
+single-GPU demo above. By setting `--nproc_per_node` to be the number of GPUs you have on your machine, the example
+is turned into a data parallel example.
+
+#### 2. hybrid parallel
+
+Hybrid parallel includes data, tensor, and pipeline parallelism. In Colossal-AI, we support different types of tensor
+parallelism (i.e. 1D, 2D, 2.5D and 3D). You can switch between different tensor parallelism by simply changing the configuration
+in the `config.py`. You can follow the [GPT example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt).
+Detailed instructions can be found in its `README.md`.
+
+#### 3. MoE parallel
+
+We provided [an example of WideNet](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet) to demonstrate
+MoE parallelism. WideNet uses mixture of experts (MoE) to achieve better performance. More details can be found in
+[Tutorial: Integrate Mixture-of-Experts Into Your Model](../advanced_tutorials/integrate_mixture_of_experts_into_your_model.md)
+
+#### 4. sequence parallel
+
+Sequence parallel is designed to tackle memory efficiency and sequence length limit problems in NLP tasks. We provided
+[an example of BERT](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/bert/sequene_parallel) in
+[ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples). You can follow the `README.md` to execute the code.
--- a/docs/source/zh/Colossal-Auto/feature/auto_checkpoint.md
+++ b/docs/source/zh/Colossal-Auto/feature/auto_checkpoint.md
--- a/docs/source/zh/Colossal-Auto/feature/device_mesh.md
+++ b/docs/source/zh/Colossal-Auto/feature/device_mesh.md
--- a/docs/source/zh/Colossal-Auto/feature/shape_consistency.md
+++ b/docs/source/zh/Colossal-Auto/feature/shape_consistency.md
--- a/docs/source/zh/Colossal-Auto/feature/tracer.md
+++ b/docs/source/zh/Colossal-Auto/feature/tracer.md
--- a/docs/source/zh/Colossal-Auto/get_started/installation.md
+++ b/docs/source/zh/Colossal-Auto/get_started/installation.md
@ -0,0 +1,28 @@
+# 安装
+
+## 声明
+
+我们的自动并行功能处于alpha版本，仍在快速的开发迭代中。我们会在兼容性和稳定性上做持续地改进。如果您遇到任何问题，欢迎随时提issue给我们。
+
+
+## 要求
+
+我们需要一些额外的依赖性来支持自动并行功能。 请在使用自动平行之前安装它们。
+
+### 安装PyTorch
+
+我们仅支持Pytorch 1.12，现在未测试其他版本。 将来我们将支持更多版本。
+
+```bash
+#conda
+conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
+#pip
+pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113
+```
+
+### 安装pulp和coin-or-cbc
+
+```bash
+pip install pulp
+conda install -c conda-forge coin-or-cbc
+```
--- a/docs/source/zh/Colossal-Auto/get_started/introduction.md
+++ b/docs/source/zh/Colossal-Auto/get_started/introduction.md
@ -0,0 +1,43 @@
+# 介绍
+
+近年来，大规模机器学习模型的部署受到越来越多的重视。然而，目前常见的分布式大模型训练方案，都依赖用户**人工反复尝试**和系统专家的经验来进行配置部署。这对绝大多数AI开发者来说十分不友好，因为他们不希望将时间精力花费在研究分布式系统和试错上。
+Colossal-AI的**Colossal-Auto** 帮助AI开发者简化了大规模机器学习模型的部署过程。相比现有其他手动配置复杂并行策略和修改模型的解决方案，Colossal-Auto 仅需增加一行代码，提供 cluster 信息以及单机训练模型即可获得分布式训练能力，并且**原生支持包括 Hugging Face，Timm 等热门 AI 模型库**。
+
+
+
+## 概览
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/auto_parallel.png"/>
+</figure>
+
+## 用法
+```python
+# wrap the model using auto_engine
+model = autoparallelize(model, meta_input_samples)
+# normal training loop
+...
+```
+
+
+## 图追踪
+Colossal-Auto 是**首个基于 PyTorch 框架使用静态图分析的自动并行系统**。PyTorch 作为一个动态图框架，获取其静态的执行计划是机器学习系统领域被长期研究的问题。Colossal-Auto 使用基于 torch.FX Tracer 的 ColoTracer 来完成对于最优并行策略的搜索。在 tracing 过程中推导并记录了每个 tensor 的元信息，例如 tensor shape，dims，dtype 等。因此 Colossal-AI 具有更好的模型泛化能力，而不是依靠模型名或手动修改来适配并行策略。
+
+
+## 细粒度分布式训练策略搜索
+Colossal-AI 的自动并行策略会在满足内存预算的限制下，以最快运行时间为目标，为每个 op 进行策略搜索，最终得到真实训练时的策略，包括每个 tensor 的切分策略，不同计算节点间需要插入的通信算子类型，是否要进行算子替换等。现有系统中的张量并行，数据并行，NVIDIA 在 Megatron-LM 等并行系统中使用的 column 切分和 row 切分并行等混合并行，都是自动并行可以搜索到的策略的子集。除了这些可以手动指定的并行方式外，Colossal-AI 有能力为每个 op 指定独特的并行方式，因此有可能找到比依赖专家经验和试错配置的手动切分更好的并行策略。
+
+
+
+## 分布式 tensor 与 shape consistency 系统
+
+与 PyTorch 最新发布的 DTensor 类似，Colossal-AI 也使用了 device mesh 对集群进行了抽象管理。具体来说，Colossal-AI 使用 sharding spec 对 tensor 的分布式存储状态进行标注，使用 shape consistency manager 自动地对同一 tensor 在不同 sharding spec 间进行转换。这让 Colossal-AI 的通用性和易用性极大地提升，借助 shape consistency manager 可以没有负担地切分 tensor，而不用担心上游 op 的 output 与下游的 input 在集群中的存储方式不同。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/shape_consistency.png"/>
+</figure>
+
+相较于 PyTorch DTensor，Colossal-AI 有以下优势：
+ Colossal-AI 的 device mesh 可以 profiling 到集群性能指标，对不同的通信算子进行耗时估算。
+ Colossal-AI 的 shape consistency 会贪心地搜索 sharding spec 间的转换方式，而不是朴素地逐 dimension 进行转换，这样能找到更高效的转换路径，进而使得 sharding spec 间的转换通信开销更小。
+ 加入了 all_to_all 操作，使得 Colossal-AI 的扩展性更强，这在大规模集群上进行训练时，可以展现出很大的优势。
--- a/docs/source/zh/Colossal-Auto/get_started/run_demo.md
+++ b/docs/source/zh/Colossal-Auto/get_started/run_demo.md
@ -0,0 +1,16 @@
+# 快速上手
+
+Colossal-AI 提供了业界急需的一套高效易用自动并行系统。相比现有其他手动配置复杂并行策略和修改模型的解决方案，Colossal-AI 仅需增加一行代码，提供 cluster 信息以及单机训练模型即可获得分布式训练能力。Colossal-Auto的快速上手示例如下。
+
+### 1. 基本用法
+Colossal-Auto 可被用于为每一次操作寻找一个包含数据、张量（如1D、2D、序列化）的混合SPMD并行策略。您可参考[GPT 示例](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/experiments/auto_parallel)。
+详细的操作指引见其 `README.md`。
+
+### 2. 与 activation checkpoint 结合
+
+作为大模型训练中必不可少的显存压缩技术，Colossal-AI 也提供了对于 activation checkpoint 的自动搜索功能。相比于大部分将最大显存压缩作为目标的技术方案，Colossal-AI 的搜索目标是在显存预算以内，找到最快的 activation checkpoint 方案。同时，为了避免将 activation checkpoint 的搜索一起建模到 SPMD solver 中导致搜索时间爆炸，Colossal-AI 做了 2-stage search 的设计，因此可以在合理的时间内搜索到有效可行的分布式训练方案。 您可参考 [Resnet 示例](TBA)。
+详细的操作指引见其 `README.md`。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/auto_ckpt.jpg"/>
+</figure>
--- a/docs/source/zh/advanced_tutorials/add_your_parallel.md
+++ b/docs/source/zh/advanced_tutorials/add_your_parallel.md
@ -0,0 +1,112 @@
+# 添加你自己的并行模式
+
+作者: Shenggui Li, Yongbin Li
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [并行配置](../basics/configure_parallelization.md)
+
+## 引言
+
+为了使研究人员和工程师能够以更少的努力将我们的系统扩展到其他新颖的大规模分布式训练算法，我们已经将训练生命周期中的各种组件解耦。你可以通过简单地继承基类来实现你自己的并行模式。
+
+主要组件有:
+
+1. `ProcessGroupInitializer`
+2. `GradientHandler`
+3. `Schedule`
+
+**目前这需要对源代码进行一些改动，因此我们建议你用`-e`标志从源代码安装。`-e`标志使得安装是可编辑的，因此，你的代码变化将反映在你的Python运行时中。我们将在这方面努力，以避免在未来的版本中改变源代码。**
+
+
+## 进程组初始化器
+
+并行通常由进程组来管理，参与相同并行算法的进程被置于同一进程组。对于不同的并行算法，需要创建不同的进程组。
+Colossal-AI 为用户提供了一个全局 context，使他们能够轻松地管理进程组。如果你想添加新的进程组，你可以很容易地定义一个新的类并在你的配置文件中设置它。为了定义你自己的进程组创建方式，你可以按照下面的步骤来创建一个新的分布式初始化。
+
+1. 在 `colossalai.context.parallel_mode.ParallelMode` 中添加你自己的并行模式。
+    ```python
+    class ParallelMode(Enum):
+        GLOBAL = 'global'
+        DATA = 'data'
+        PIPELINE = 'pipe'
+        ...
+
+        NEW_MODE = 'new_mode'  # define your mode here
+    ```
+
+2. 创建一个 `ProcessGroupInitializer`。 你可以参考 `colossalai.context.dist_group_initializer` 中给出的例子，前六个参数是固定的。
+`ParallelContext` 将为你传入这些参数。如果你需要设置其他参数，可以像下面的例子中的 `arg1, arg2` 一样，在后面添加它。
+最后，通过添加装饰器 `@DIST_GROUP_INITIALIZER.register_module` 将你的初始化程序注册到注册表。
+    ```python
+    # sample initializer class
+    @DIST_GROUP_INITIALIZER.register_module
+    class MyParallelInitializer(ProcessGroupInitializer):
+
+        def __init__(self,
+                    rank: int,
+                    world_size: int,
+                    config: Config,
+                    data_parallel_size: int,
+                    pipeline_parlalel_size: int,
+                    tensor_parallel_size: int,
+                    arg1,
+                    arg2):
+            super().__init__(rank, world_size, config)
+            self.arg1 = arg1
+            self.arg2 = arg2
+            # ... your variable init
+
+        def init_parallel_groups(self):
+            # initialize your process groups
+            pass
+
+    ```
+    然后，你可以将你的新初始化器插入到 `colossalai.constants.INITIALIZER_MAPPING` 当前的模式与初始化映射中。你可以修改该文件或动态插入新的键值对。
+
+    ```python
+    colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
+    ```
+
+3. 在你的配置文件中设置你的初始化器。你可以传入你的自定义参数。这允许
+   `ParallelContext` 创建你的初始化器并初始化你期望的进程组。
+
+    ```python
+    parallel = dict(
+        pipeline=dict(size=1),
+        tensor=dict(size=x, mode='new_mode')  # this is where you enable your new parallel mode
+    )
+    ```
+
+## 梯度 Handler
+
+梯度 handler 是对参数的梯度执行 all-reduce 操作的对象。由于不同的 all-reduce 策略或许在不同的并行中被执行，用户可以继承
+`colossalai.engine.gradient_handler.BaseGradientHandler` 来实现其策略。目前，Colossal-AI 使用普通的数据并行梯度 handler 在数据并行的 rank 间 all-reduce 梯度。
+如果数据并行被检测到，梯度 handler 会被自动添加进 engine。
+
+你可以添加你自己的梯度 handler，如下所示：
+
+```python
+from colossalai.registry import GRADIENT_HANDLER
+from colossalai.engine import BaseGradientHandler
+
+@GRADIENT_HANDLER.register_module
+class YourGradientHandler(BaseGradientHandler):
+
+    def handle_gradient(self):
+        do_something()
+
+```
+
+之后，你可以在配置文件中指定你要使用的梯度 handler。
+
+```python
+gradient_handlers = [
+    dict(type='YourGradientHandler'),
+]
+```
+
+## Schedule
+
+Schedule 包含了如何执行前向和后向计算。目前， Colossal-AI 提供了流水和非流水的 schedule。
+如果你想修改前向和后向计算的执行方式，你可以继承 `colossalai.engine.schedule.BaseSchedule` 并实现 `forward_back_step` 函数。
--- a/docs/source/zh/advanced_tutorials/define_your_own_parallel_model.md
+++ b/docs/source/zh/advanced_tutorials/define_your_own_parallel_model.md
@ -0,0 +1,31 @@
+# 定义你自己的并行模型
+
+作者: Zhengda Bian, Yongbin Li
+
+> ⚠️ 我们正在编写此文档以使其更加详细。 我们将介绍不同并行的机制以及如何使用它们来编写模型。
+
+假设您有一个具有数十亿参数的巨大 MLP 模型，其极大的隐藏层大小使其无法直接被单个 GPU 容纳。别担心，Colossal-AI 可以帮你解决这个问题。
+在 Colossal-AI 的帮助下，您可以用所熟悉的为单个 GPU 编写模型的方式编写大模型，而 Colossal-AI 会自动拆分您的模型权重，并将它们完美地分配到一组 GPU 中。我们给出一个简单的示例，展示如何在 Colossal-AI 中编写简单的 2D 并行模型。
+
+## 写一个简单的2D并行模型
+
+```python
+from colossalai.nn import Linear2D
+import torch.nn as nn
+
+class MLP_2D(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+        self.linear_1 = Linear2D(in_features=1024, out_features=16384)
+        self.linear_2 = Linear2D(in_features=16384, out_features=1024)
+
+    def forward(self, x):
+        x = self.linear_1(x)
+        x = self.linear_2(x)
+        return x
+```
+
+## 使用预定义的模型
+
+为了方便您的使用，我们在 Colossal-AI 的 Model Zoo 中提供一些流行的模型，如*BERT*, *ViT*, *MoE* 和 *GPT*，请自由地将它们定制为不同的尺寸，以满足您的特殊需求。
--- a/docs/source/zh/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md
+++ b/docs/source/zh/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md
@ -0,0 +1,140 @@
+# 将 MoE 整合进你的模型
+
+作者: Haichen Huang, Yongbin Li
+
+**前置教程**
+- [ColossalAI-Examples WideNet](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet)
+
+**相关论文**
+- [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://arxiv.org/abs/2101.03961)
+- [Go Wider Instead of Deeper](https://arxiv.org/abs/2107.11817)
+
+（中文版教程将会在近期提供）
+
+## Introduction
+
+Since the advent of Switch Transformer, the AI community has found Mixture of Experts (MoE) a useful technique to enlarge the capacity of deep learning models.
+
+Colossal-AI provides an early access version of parallelism specifically designed for MoE models.
+The most prominent advantage of MoE in Colossal-AI is convenience.
+We aim to help our users to easily combine MoE with model parallelism and data parallelism.
+
+However, the current implementation has two main drawbacks now.
+The first drawback is its poor efficiency in large batch size and long sequence length training.
+The second drawback is incompatibility with tensor parallelism.
+We are working on system optimization to overcome the training efficiency problem.
+The compatibility problem with tensor parallelism requires more adaptation, and we will tackle this issue in the future.
+
+Here, we will introduce how to use MoE with model parallelism and data parallelism.
+
+## Table of Content
+In this tutorial we will cover:
+1. Set up MoE running environment
+2. Create MoE layer
+3. Train your model
+
+We provided the [example code](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet) for this tutorial in [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples).
+This example uses [WideNet](https://arxiv.org/abs/2107.11817) as an example of MoE-based model.
+
+
+## Set up MoE running environment
+In your project folder, create a `config.py`.
+
+This file is to specify some features you may want to use to train your model.
+In order to enable MoE, you need to add a dict called parallel and specify the value of key moe.
+You can assign a value for the key size of moe, which represents the model parallel size of experts (i.e. the number of experts in one group to parallelize training).
+
+For example, if the size is 4, 4 processes will be assigned to 4 consecutive GPUs and these 4 processes form a moe model parallel group.
+Each process on the 4 GPUs will only get a portion of experts. Increasing the model parallel size will reduce communication cost, but increase computation cost in each GPU and activation cost in memory.
+The total data parallel size is auto-detected and set as the number of GPUs by default.
+
+```python
+MOE_MODEL_PARALLEL_SIZE = ...
+parallel = dict(
+    moe=dict(size=MOE_MODEL_PARALLEL_SIZE)
+)
+```
+
+If `MOE_MODEL_PARALLEL_SIZE = E` and set the number of experts as `E` where `E` is a constant number, the process flow of forward pass of a transformer encoder in a model parallel group is shown below.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/oI59QcxdteKUTks.png"/>
+<figcaption>MoE Transformer, image source: <a href="https://arxiv.org/abs/2006.16668">GShard</a></figcaption>
+</figure>
+
+Since all experts are allocated to all GPUs in a model parallel group and a GPU only owns a portion of experts,
+original data parallel groups are no longer correct for the parameters of experts during gradient handling in backward pass anymore.
+So we create a new kind of parallel group called moe data parallel group.
+The difference among different kinds of parallel group, when the configuration is set as `WORLD_SIZE=4`,
+`MOE_MODEL_PARALLEL_SIZE=2`, is shown here.
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/Sn8FpmQPKIiBEq2.png"/>
+<figcaption>MoE process group</figcaption>
+</figure>
+
+
+As for gradient handling, we provide MoeGradientHandler to all-reduce every parameter of the model.
+If you use `colossalai.initialize` function to create your training engine, the MoE gradient handler will be added to your engine automatically.
+Otherwise, you should take care of gradient by yourself.
+All parameters of MoE running environment are stored in colossalai.global_variables.moe_env.
+You can access your configuration parameters to check whether your setup is correct.
+```python
+from colossalai.global_variables import moe_env
+```
+
+## Create MoE layer
+You can create a MoE layer from `colossalai.nn.moe`.
+But before doing that, you should set up random seeds for all processes like this.
+
+```python
+from colossalai.context.random import moe_set_seed
+from model_zoo.moe.models import Widenet
+
+moe_set_seed(42)
+model = Widenet(num_experts=4, capacity_factor=1.2)
+```
+
+`moe_set_seed` will set different seed for different processes in a moe model parallel group.
+This helps initialize parameters in experts.
+Then create an instance of experts and an instance of router.
+Here is the example in model zoo.
+
+```python
+from colossalai.nn.layer.moe import Experts, MoeLayer, Top2Router, NormalNoiseGenerator
+
+
+noisy_func = NormalNoiseGenerator(num_experts)
+shared_router = Top2Router(capacity_factor,
+                           noisy_func=noisy_func)
+shared_experts = Experts(expert=VanillaFFN,
+                         num_experts=num_experts,
+                         **moe_mlp_args(
+                             d_model=d_model,
+                             d_ff=d_ff,
+                             drop_rate=drop_rate
+                         ))
+ffn=MoeLayer(dim_model=d_model, num_experts=num_experts,
+             router=shared_router, experts=shared_experts)
+```
+
+Inside the initialization of Experts, the local expert number of each GPU will be calculated automatically. You just need to specify the class of each expert and its parameters used in its initialization. As for routers, we have provided top1 router and top2 router. You can find them in colossalai.nn.layer.moe. After creating the instance of experts and router, the only thing initialized in Moelayer is gate module. More definitions of each class can be found in our API document and code.
+
+
+## Train Your Model
+Do not to forget to use `colossalai.initialize` function in `colosalai` to add gradient handler for the engine.
+We handle the back-propagation of MoE models for you.
+In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients.
+You can find more information about the handler `MoeGradientHandler` in colossal directory.
+
+The loss criterion should be wrapped by `Moeloss` to add auxiliary loss of MoE. Example is like this.
+```python
+criterion = MoeLoss(
+    aux_weight=0.01,
+    loss_fn=nn.CrossEntropyLoss,
+    label_smoothing=0.1
+)
+```
+
+Finally, just use trainer or engine in `colossalai` to do your training.
+Otherwise, you should take care of gradient by yourself.
--- a/docs/source/zh/advanced_tutorials/meet_gemini.md
+++ b/docs/source/zh/advanced_tutorials/meet_gemini.md
@ -0,0 +1,96 @@
+# 认识Gemini：ColossalAI的异构内存空间管理器
+
+作者: [Jiarui Fang](https://github.com/feifeibear)
+
+## 简介
+
+在GPU数量不足情况下，想要增加模型规模，异构训练是最有效的手段。它通过在 CPU 和 GPU 中容纳模型数据，并仅在必要时将数据移动到当前设备，可以同时利用 GPU 内存、CPU 内存（由 CPU DRAM 或 NVMe SSD内存组成）来突破单GPU内存墙的限制。并行，在大规模训练下，其他方案如数据并行、模型并行、流水线并行都可以在异构训练基础上进一步扩展GPU规模。这篇文章描述ColossalAI的异构内存空间管理模块Gemini的设计细节，它的思想来源于[PatrickStar](https://arxiv.org/abs/2108.05818)，ColossalAI根据自身情况进行了重新实现。
+
+## 用法
+
+目前Gemini支持和ZeRO并行方式兼容，它的使用方法很简单，在训练策略的配置文件里设置zero的model_config属性tensor_placement_policy='auto'
+
+```
+zero = dict(
+    model_config=dict(
+        reduce_scatter_bucket_size_mb=25,
+        fp32_reduce_scatter=False,
+        gradient_predivide_factor=1.0,
+        tensor_placement_policy="auto",
+        shard_strategy=TensorShardStrategy(),
+        ...
+    ),
+    optimizer_config=dict(
+        ...
+    )
+)
+```
+
+注意，Gemini和并行策略，如Tensor Parallelism，Data Parallelism，Pipeline Parallelism，ZeRO是解耦合的。对TP，PP的支持还在开发中。
+
+## 术语
+
+**算子**(**OP**erator)：一个神经网络层的计算操作，比如Linear，LayerNorm等。算子可以是正向传播的计算，也可以是反向传播的计算。
+
+神经网络在训练期间必须管理的两种类型的训练数据。
+
+**模型数据(model data)**: 由参数、梯度和优化器状态组成，其规模与模型结构定义相关
+
+**非模型数据(non-model data)**: 主要由算子生成的中间张量和算子的临时变量组成。非模型数据根据训练任务的配置动态变化，例如批量大小。模型数据和非模型数据相互竞争 GPU 内存。
+
+## 设计
+
+目前的一些解决方案，DeepSpeed采用的[Zero-offload](https://arxiv.org/abs/2101.06840)在CPU和GPU内存之间静态划分模型数据，并且它们的内存布局对于不同的训练配置是恒定的。如下图左边所示，当 GPU 内存不足以满足其相应的模型数据要求时，即使当时CPU上仍有可用内存，系统也会崩溃。而ColossalAI可以通过将一部分模型数据换出到CPU上来完成训练。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/tutorial/gemini/deepspeed_compare.png"/>
+<figcaption>比较Zero-Offload和Gemini的内存管理方案</figcaption>
+</figure>
+
+
+ColossalAI设计了Gemini，就像双子星一样，它管理CPU和GPU二者内存空间。它可以让张量在训练过程中动态分布在CPU-GPU的存储空间内，从而让模型训练突破GPU的内存墙。内存管理器由两部分组成，分别是MemStatsCollector(MSC)和StatefuleTensorMgr(STM)。
+
+
+我们利用了深度学习网络训练过程的迭代特性。我们将迭代分为warmup和non-warmup两个阶段，开始时的一个或若干迭代步属于预热阶段，其余的迭代步属于正式阶段。在warmup阶段我们为MSC收集信息，而在non-warmup阶段STM入去MSC收集的信息来移动tensor，以达到最小化CPU-GPU数据移动volume的目的。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/tutorial/gemini/gemini_workflow.png"/>
+<figcaption>Gemini在不同训练阶段的运行流程</figcaption>
+</figure>
+
+
+### StatefulTensorMgr
+
+STM管理所有model data tensor的信息。在模型的构造过程中，ColossalAI把所有model data张量注册给STM。内存管理器给每个张量标记一个状态信息。状态集合包括HOLD，COMPUTE，FREE三种状态。STM的功能如下：
+
+**查询内存使用：**通过遍历所有tensor的在异构空间的位置，获取模型数据对CPU和GPU的内存占用。
+
+**转换张量状态：**它在每个模型数据张量参与算子计算之前，将张量标记为COMPUTE状态，在计算之后标记为HOLD状态。如果张量不再使用则标记的FREE状态。
+
+**调整张量位置：**张量管理器保证COMPUTE状态的张量被放置在计算设备上，如果计算设备的存储空间不足，则需要移动出一些HOLD状态的张量到其他设备上存储。Tensor eviction strategy需要MSC的信息，我们将在后面介绍。
+
+
+### MemStatsCollector
+在预热阶段，内存信息统计器监测CPU和GPU中模型数据和非模型数据的内存使用情况，供正式训练阶段参考。我们通过查询STM可以获得模型数据在某个时刻的内存使用。但是非模型的内存使用却难以获取。因为非模型数据的生存周期并不归用户管理，现有的深度学习框架没有暴露非模型数据的追踪接口给用户。MSC通过采样方式在预热阶段获得非模型对CPU和GPU内存的使用情况。具体方法如下：
+
+我们在算子的开始和结束计算时，触发内存采样操作，我们称这个时间点为**采样时刻（sampling moment)**，两个采样时刻之间的时间我们称为**period**。计算过程是一个黑盒，由于可能分配临时buffer，内存使用情况很复杂。但是，我们可以较准确的获取period的系统最大内存使用。非模型数据的使用可以通过两个统计时刻之间系统最大内存使用-模型内存使用获得。
+
+我们如何设计采样时刻呢。我们选择preOp的model data layout adjust之前。如下图所示。我们采样获得上一个period的system memory used，和下一个period的model data memoy used。并行策略会给MSC的工作造成障碍。如图所示，比如对于ZeRO或者Tensor Parallel，由于Op计算前需要gather模型数据，会带来额外的内存需求。因此，我们要求在模型数据变化前进行采样系统内存，这样在一个period内，MSC会把preOp的模型变化内存捕捉。比如在period 2-3内，我们考虑的tensor gather和shard带来的内存变化。
+尽管可以将采样时刻放在其他位置，比如排除gather buffer的变动新信息，但是会给造成麻烦。不同并行方式Op的实现有差异，比如对于Linear Op，Tensor Parallel中gather buffer的分配在Op中。而对于ZeRO，gather buffer的分配是在PreOp中。将放在PreOp开始时采样有利于将两种情况统一。
+
+
+尽管可以将采样时刻放在其他位置，比如排除gather buffer的变动新信息，但是会给造成麻烦。不同并行方式Op的实现有差异，比如对于Linear Op，Tensor Parallel中gather buffer的分配在Op中。而对于ZeRO，gather buffer的分配是在PreOp中。将放在PreOp开始时采样有利于将两种情况统一。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/tutorial/gemini/gemini_mem_curve.png"/>
+<figcaption>Sampling based MemStatsCollector</figcaption>
+</figure>
+
+### Tensor Eviction Strategy
+
+MSC的重要职责是在调整tensor layout位置，比如在上图S2时刻，我们减少设备上model data数据，Period 2-3计算的峰值内存得到满足。
+
+在warmup阶段，由于还没执行完毕一个完整的迭代，我们对内存的真实使用情况尚一无所知。我们此时限制模型数据的内存使用上限，比如只使用30%的GPU内存。这样保证我们可以顺利完成预热状态。
+
+在non-warmup阶段，我们需要利用预热阶段采集的非模型数据内存信息，预留出下一个Period在计算设备上需要的峰值内存，这需要我们移动出一些模型张量。
+为了避免频繁在CPU-GPU换入换出相同的tensor，引起类似[cache thrashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science))的现象。我们利用DNN训练迭代特性，设计了OPT cache换出策略。具体来说，在warmup阶段，我们记录每个tensor被计算设备需要的采样时刻。如果我们需要驱逐一些HOLD tensor，那么我们选择在本设备上最晚被需要的tensor作为受害者。
--- a/docs/source/zh/advanced_tutorials/opt_service.md
+++ b/docs/source/zh/advanced_tutorials/opt_service.md
@ -0,0 +1,79 @@
+# Colossal-AI使用指南：5分钟搭建在线OPT服务
+
+## 介绍
+
+本指导手册将说明如何利用[Colossal-AI](https://github.com/hpcaitech/ColossalAI)搭建您自己的OPT服务。
+
+## Colossal-AI 推理概述
+Colossal-AI 提供了一个推理子系统 [Energon-AI](https://github.com/hpcaitech/EnergonAI)， 这是一个基于Colossal-AI的服务系统，拥有以下特性：
+
+- **大模型并行：** 在Colossal-AI的张量并行和流水线并行策略的帮助下，Colossal-AI的推理可实现大模型的高效并行推理。
+- **预构建大模型：** Colossal-AI提供热门模型的预构建部署，例如OPT。其支持用于生成任务和加载检查点的缓存技术。
+- **引擎封装：** Colossal-AI中有一个抽象层被称作引擎。其将单实例多设备(SIMD) 执行与远程过程调用封装在一起。
+- **在线服务系统：** 基于FastAPI，用户可以快速启动分布式推理的网络服务。 在线服务对生成任务进行了特殊优化。它采用left padding和bucket batching两种技术来提高效率。
+
+## 基本用法
+
+1. 下载OPT模型
+
+想要快速发布分布式推理服务，您从[此处](https://huggingface.co/patrickvonplaten/opt_metaseq_125m/blob/main/model/restored.pt)下载OPT-125M。有关加载其他体量模型的详细方法，您可访问[此处](https://github.com/hpcaitech/EnergonAI/tree/main/examples/opt/script)。
+
+2. 准备提前构建的服务镜像
+
+从dockerhub拉取一个已经安装Colossal-AI推理的docker镜像。
+
+```bash
+docker pull hpcaitech/energon-ai:latest
+```
+
+3. 发布HTTP服务
+
+若想发布服务，我们需要准备python脚本来描述模型的类型和相关的部署，以及HTTP服务的设置。 我们为您提供了一组[示例](https://github.com/hpcaitech/EnergonAI/tree/main/examples])。 我们将在本指导手册中使用[OPT 示例](https://github.com/hpcaitech/EnergonAI/tree/main/examples/opt)。
+服务的入口是一个bash脚本 server.sh。
+本服务的配置文件参考 opt_config.py，该文件定义了模型的类型、 检查点文件路径、并行策略和http设置。您能按照您的需求来修改这些设置。
+例如，将模型的大小设置为opt_125M，将正确的检查点路径按照如下设置：
+
+```bash
+model_class = opt_125M
+checkpoint = 'your_file_path'
+```
+
+将张量并行度设置为您的gpu数量。
+
+```bash
+tp_init_size = #gpu
+```
+
+现在，我们就能利用docker发布一个服务。您能在`/model_checkpoint` 和 `/config`路径下找到检查点文件和配置文件。
+
+
+```bash
+export CHECKPOINT_DIR="your_opt_checkpoint_path"
+# the ${CONFIG_DIR} must contain a server.sh file as the entry of service
+export CONFIG_DIR="config_file_path"
+
+docker run --gpus all  --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:lastest
+```
+
+接下来，您就可以在您的浏览器中打开 `https://[IP-ADDRESS]:8020/docs#` 进行测试。
+
+## 高级特性用法
+
+1. 批处理优化
+
+若想使用我们的高级批处理技术来批量收集多个查询，您可以将executor_max_batch_size设置为最大批处理大小。 请注意，只有具有相同 top_k、top_p 和温度的解码任务才能一起批处理。
+
+```
+executor_max_batch_size = 16
+```
+
+所有的查询将进入FIFO队列。解码步数小于或等于队列头部解码步数的所有连续查询可以一起批处理。  应用左填充以确保正确性。 executor_max_batch_size 不应该过大，从而确保批处理不会增加延迟。 以opt-30b为例， `executor_max_batch_size=16` 合适，但对于opt-175b而言， `executor_max_batch_size=4` 更合适。
+
+2. 缓存优化
+
+对于每一个独立的服务过程，您能将最近的多个查询结果缓存在一起。在config.py中设置 cache_size 和 cache_list_size。缓存的大小应为缓存的查询数目。cache_list_size 应为每次查询存储的结果数。一个随机缓存的结果将会被返回。当缓存已满，LRU策略被用于清理缓存过的查询。cache_size=0意味着不缓存。
+
+```
+cache_size = 50
+cache_list_size = 2
+```
--- a/docs/source/zh/advanced_tutorials/parallelize_your_training_like_Megatron.md
+++ b/docs/source/zh/advanced_tutorials/parallelize_your_training_like_Megatron.md
@ -0,0 +1,176 @@
+# 使用ColoTensor让串行程序像Megatron-LM一样并行
+
+Author: [Haichen Huang](https://github.com/1SAA) and [Jiarui Fang](https://github.com/feifeibear)
+
+**Prerequisite:**
+- [ColoTensor Concepts](../basics/colotensor_concept.md)
+
+## 介绍
+
+在新版本中，我们引入了ColoTensor。ColoTensor为用户使用并行训练提供了极大的便利，使得用户可以在原本的串行代码上，通过较小的修改将训练改为并行。在本教程中，我们将说明如何修改训练模型以自动使代码采取像 Megatron-LM 一样的方式并行训练。我们以 HuggingFace 提供的 GPT-2 模型为例，并提供一种方式让你可以在单个GPU上预训练GPT-2模型。
+
+Megatron-LM 提供了一个具有影响力的并行化范式，这个范式主要应用于Transformer大模型的训练。然而，为了大规模训练 Transformer 语言大模型，用户必须使用Megatron-LM提供的特殊模块来构建他们的模型。这给用户带来了一些困难的工作，例如从预先训练的模型中加载权重，或是构建自己的并行训练模型。为了减轻用户的麻烦，我们提供 ColoTensor 类，以完成自动启用张量模型并行。
+
+## 定义模型和损失函数
+
+首先，我们直接调用 HuggingFace 库中的 GPTModel 和 GPTLoss。
+
+```python
+import torch
+import torch.nn as nn
+from transformers import GPT2Config, GPT2LMHeadModel
+
+class GPTLMModel(nn.Module):
+    def __init__(self, hidden_size=768, num_layers=12, num_attention_heads=12, max_seq_len=1024, vocab_size=50257, checkpoint=False):
+        super().__init__()
+        self.checkpoint = checkpoint
+        self.model = GPT2LMHeadModel(GPT2Config(n_embd=hidden_size, n_layer=num_layers,
+                                     n_head=num_attention_heads, n_positions=max_seq_len, n_ctx=max_seq_len, vocab_size=vocab_size))
+        if checkpoint:
+            self.model.gradient_checkpointing_enable()
+
+    def forward(self, input_ids, attention_mask):
+        # Only return lm_logits
+        return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
+
+
+class GPTLMLoss(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.loss_fn = nn.CrossEntropyLoss()
+
+    def forward(self, logits, labels):
+        shift_logits = logits[..., :-1, :].contiguous()
+        shift_labels = labels[..., 1:].contiguous()
+        # Flatten the tokens
+        return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+```
+
+## 对GPT-2的简短回顾
+
+现在，我们回顾一下 GPT-2 模型的结构。每个 GPT-2 模型都可以表示为一个 DAG。如下图所示，每个圆圈代表一个算子，每个方块代表一个权重。每个箭头表示输入数据的流向，而箭头旁边的符号表示输入数据的形状。
+
+然后，让我们深入了解一下这个 GPT-2 模型。它由三部分组成，分别是**嵌入模块**、**转换器层**和**分类头**。
+
+嵌入模块包含两个权重，符号嵌入权重和位置嵌入权重。在嵌入模块的前向操作之后，原始输入数据的所有序列中的每个单词都会被嵌入到隐藏状态。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/17/omfkIEN6ui5jcL3.png"/>
+<figcaption>嵌入模块</figcaption>
+</figure>
+
+每个转换器层包含两个块。自注意操作在第一个块中调用，同时一个双层感知器位于第二个块中。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/17/LAVzDlpRcj4dYeb.png"/>
+<figcaption>转换器层</figcaption>
+</figure>
+
+最后，分类头只是一个不加偏差的线性模块，里面只有一个线性权重。
+
+## 应用ColoTensor
+
+两个步骤使您的串行代码采取 Megatron-LM 张量并行风格。
+1. 在ColoInitContext的上下文中初始化模型。
+2. 为每个参数设置 ColoTensorSpec。
+
+### 使用 ColoInitContext 初始化
+
+我们应该在 ColoInitContext 中构建模型。在该种上下文中，任何初始化的参数都将转换为 ColoParameter 并自动移动到相应的设备上。
+
+```python
+from colossalai.utils.model.colo_init_context import ColoInitContext
+
+with ColoInitContext(device=torch.device('cpu')):
+    model = GPTLMModel()
+```
+
+### 为每个参数设置 ColoTensorSpec
+
+模型创建完成后，我们通过ProcessGroup建立分布式环境。这里，我们将张量并行度指定为所有GPU的数量，即数据并行度为一。
+
+```python
+import torch.distributed as dist
+from colossalai.tensor import ProcessGroup
+
+pg = ProcessGroup(tp_degree=dist.get_world_size())
+```
+
+现在，我们需要一些辅助函数为下一步做准备。我们定义了两个函数来切分参数。Megatron-LM张量并行需要沿参数的第一维或最后一维切分参数张量。
+
+```python
+from colossalai.tensor import ShardSpec, ComputeSpec, ComputePattern, ColoParameter, ProcessGroup
+
+def split_param_single_dim_tp1d(dim: int, param: ColoParameter, pg: ProcessGroup):
+    spec = (ShardSpec([dim], [pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
+    if param.process_group.tp_world_size() == 1:
+        param.set_process_group(pg)
+    param.set_tensor_spec(*spec)
+
+
+def split_param_row_tp1d(param: ColoParameter, pg: ProcessGroup):
+    split_param_single_dim_tp1d(0, param, pg)
+
+
+def split_param_col_tp1d(param: ColoParameter, pg: ProcessGroup):
+    split_param_single_dim_tp1d(-1, param, pg)
+```
+
+然后我们使模型采用张量并行。根据 Megatron 中使用的张量并行，应该沿着张量的最后一个维度进行切片，包括符号嵌入的权重，位置嵌入的权重，自注意力块中的所有线性权重和偏差，以及每个双层感知器中的第一个线性权重和偏差。且需要沿第一个维度切分双层感知器中的第二个线性权重。
+
+```python
+for mn, module in model.named_modules():
+    for pn, param in module.named_parameters(recurse=False):
+        # set process group for all parameters
+        param.set_process_group(pg)
+
+        if 'mlp.c_fc' in mn:
+            if 'weight' in pn or 'bias' in pn:
+                split_param_col_tp1d(param, pg)  # colmn slice
+                # keep the shape of the output from c_fc
+                param.compute_spec.set_output_replicate(False)
+        elif 'mlp.c_proj' in mn:
+            if 'weight' in pn:
+                split_param_row_tp1d(param, pg)  # row slice
+        elif 'wte' in mn or 'wpe' in mn:
+            split_param_col_tp1d(param, pg)  # colmn slice
+        elif 'c_attn' in mn or 'c_proj' in mn:
+            split_param_col_tp1d(param, pg)  # colmn slice
+```
+
+修改后的模型如下图所示。
+
+嵌入模块:
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/17/Yu2xzXEabHV7pwe.png"/>
+<figcaption>修改后的嵌入模块</figcaption>
+</figure>
+
+转换器层:
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/17/4HWsA2xz51IhPFO.png"/>
+<figcaption>修改后的转换器层</figcaption>
+</figure>
+
+一旦用户指定了每个参数的在并行中的分布模式，ColoTensor 就能够推断出所有算子的计算模式，包括矩阵乘法、线性函数、torch.nn.functional 中的其他逐元素函数，以及其他的一些常用函数。这样，用户可以像往常一样训练他们的模型。
+
+在我们最新示例中还定义了一个Gemini + ZeRO DDP 的模型从而减小开销，提升效率。这一部分的详细内容可以参考[ZeRO](../features/zero_with_chunk.md)，你可以将这两部分内容结合起来看从而理解我们整个训练流程：
+
+```python
+def gemini_zero_dpp(model: torch.nn.Module, pg: ProcessGroup, placememt_policy: str = "auto"):
+    from colossalai.nn.parallel import GeminiDDP
+    model = GeminiDDP(model,
+                        device=get_current_device(),
+                        placement_policy=placememt_policy,
+                        pin_memory=True,
+                        search_range_mb=32)
+    return model
+```
+
+## 在单个GPU上预训练GPT-2
+
+我们做的上述优化让我们可以在单GPU上训练GPT-2模型，只需要将`run.sh`中设置参数`GPUNUM`=1，再运行文件时就可以在单个GPU上完成模型的训练。
+
+GPT-2 示例在[Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt). 获得。
--- a/docs/source/zh/advanced_tutorials/train_gpt_using_hybrid_parallelism.md
+++ b/docs/source/zh/advanced_tutorials/train_gpt_using_hybrid_parallelism.md
@ -0,0 +1,275 @@
+# 使用混合并行训练 GPT
+
+作者: Hongxin Liu, Yongbin Li
+
+**示例代码**
+- [ColossalAI-Examples GPT2](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt_2)
+- [ColossalAI-Examples GPT3](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt_3)
+
+**相关论文**
+- [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training](https://arxiv.org/abs/2110.14883)
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)
+
+## 引言
+
+在上一篇教程中，我们介绍了如何用流水并行训练 ViT。在本教程中，你将学习一个更复杂的场景--用混合并行方式训练GPT。在这种情况下，由于GPT-3过大，即使CPU内存也无法容纳它。因此，你必须自己分割模型。
+
+## 目录
+
+在本教程中，我们将介绍:
+
+1. 基于 colossalai/model_zoo 定义 GPT 模型
+2. 处理数据集
+3. 使用混合并行训练 GPT
+
+## 导入依赖库
+
+```python
+import json
+import os
+from typing import Callable
+
+import colossalai
+import colossalai.utils as utils
+import model_zoo.gpt.gpt as col_gpt
+import torch
+import torch.nn as nn
+from colossalai import nn as col_nn
+from colossalai.amp import AMP_TYPE
+from colossalai.builder.pipeline import partition_uniform
+from colossalai.context.parallel_mode import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.engine.schedule import (InterleavedPipelineSchedule,
+                                        PipelineSchedule)
+from colossalai.logging import disable_existing_loggers, get_dist_logger
+from colossalai.nn.layer.wrapper import PipelineSharedModuleWrapper
+from colossalai.trainer import Trainer, hooks
+from colossalai.utils.timer import MultiTimer
+from model_zoo.gpt import GPTLMLoss
+from torch.nn import functional as F
+from torch.utils.data import Dataset
+from transformers import GPT2Tokenizer
+```
+
+
+
+## 定义 GPT 模型
+
+在前面的教程中，我们介绍了3种建立流水并行模型的方法，但对于像 GPT-3 这样的巨大模型，你甚至不能在 CPU 中建立模型。在这种情况下，你必须自己分割模型。
+
+GPT 数据加载器返回 `input_ids` 和 `attention_mask`, 因此我们在 `forward()` 中使用两个关键字参数来获得它们。请注意，对于除第一阶段以外的其他阶段， `forward()` 的第一个位置参数是上一阶段的输出张量。所以 `hidden_states` 来自前一阶段，并且对于第一阶段来说，它是 `None`。
+
+对于 GPT, *word embedding layer* 与 *output head* 共享权重。我们提供 `PipelineSharedModuleWrapper` 在流水阶段间共享参数。它需要一个 `int` 型的 `list` 作为参数, 这意味着 rank 们共享这些参数。你可以使用 `register_module()`
+或 `register_parameter()` 来注册一个模块或一个参数作为共享模块或参数。如果你有多组共享模块/参数，你应该有多个 `PipelineSharedModuleWrapper` 实例。 如果参数在**一个**阶段内共享, 你不应该使用
+`PipelineSharedModuleWrapper`, 而只是使用同一个模块/参数实例。在这个例子中，*word embedding layer* 在第一阶段, 而 *output head* 在最后一个阶段。因此，他们在 rank `[0, pipeline_size - 1]` 之间共享参数。
+
+对于第一阶段，它维护 embedding layer 和一些 transformer blocks。对于最后一个阶段，它维护一些 transformer blocks 和 output head layer。对于其他阶段，他们只维护一些 transformer blocks。
+`partition_uniform(num_layers, pipeline_size, num_chunks)` 返回所有 rank 的 parts, part 是一个 `(start, end)` (不包括end) 的 `tuple`。`start == 0` 表示这是第一阶段, 而 `end == num_layers` 表示这是最后一个阶段。
+
+```python
+class PipelineGPTHybrid(nn.Module):
+    def __init__(self,
+                 num_layers: int = 12,
+                 hidden_size: int = 768,
+                 num_attention_heads: int = 12,
+                 vocab_size: int = 50304,
+                 embed_drop_rate: float = 0.,
+                 act_func: Callable = F.gelu,
+                 mlp_ratio: int = 4,
+                 attn_drop_rate: float = 0.,
+                 drop_rate: float = 0.,
+                 dtype: torch.dtype = torch.float,
+                 checkpoint: bool = False,
+                 max_position_embeddings: int = 1024,
+                 layer_norm_epsilon: float = 1e-5,
+                 first: bool = False,
+                 last: bool = False):
+        super().__init__()
+        self.embedding = None
+        self.norm = None
+        self.head = None
+        if first:
+            self.embedding = col_gpt.GPTEmbedding(
+                hidden_size, vocab_size, max_position_embeddings, dropout=embed_drop_rate, dtype=dtype)
+        self.blocks = nn.ModuleList([
+            col_gpt.GPTBlock(hidden_size, num_attention_heads, mlp_ratio=mlp_ratio, attention_dropout=attn_drop_rate,
+                             dropout=drop_rate, dtype=dtype, checkpoint=checkpoint, activation=act_func)
+            for _ in range(num_layers)
+        ])
+        if last:
+            self.norm = col_nn.LayerNorm(hidden_size, eps=layer_norm_epsilon)
+            self.head = col_gpt.GPTLMHead(vocab_size=vocab_size,
+                                          dim=hidden_size,
+                                          dtype=dtype,
+                                          bias=False)
+
+    def forward(self, hidden_states=None, input_ids=None, attention_mask=None):
+        if self.embedding is not None:
+            hidden_states = self.embedding(input_ids=input_ids)
+        batch_size = hidden_states.shape[0]
+        attention_mask = attention_mask.view(batch_size, -1)
+        attention_mask = attention_mask[:, None, None, :]
+        attention_mask = attention_mask.to(dtype=hidden_states.dtype)  # fp16 compatibility
+        attention_mask = (1.0 - attention_mask) * -10000.0
+        for block in self.blocks:
+            hidden_states, attention_mask = block(hidden_states, attention_mask)
+        if self.norm is not None:
+            hidden_states = self.head(self.norm(hidden_states))
+        return hidden_states
+
+
+def build_gpt_pipeline(num_layers, num_chunks, device=torch.device('cuda'), **kwargs):
+    logger = get_dist_logger()
+    pipeline_size = gpc.get_world_size(ParallelMode.PIPELINE)
+    pipeline_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
+    rank = gpc.get_global_rank()
+    wrapper = PipelineSharedModuleWrapper([0, pipeline_size - 1])
+    parts = partition_uniform(num_layers, pipeline_size, num_chunks)[pipeline_rank]
+    models = []
+    for start, end in parts:
+        kwargs['num_layers'] = end - start
+        kwargs['first'] = start == 0
+        kwargs['last'] = end == num_layers
+        logger.info(f'Rank{rank} build layer {start}-{end}, {end-start}/{num_layers} layers')
+        chunk = PipelineGPTHybrid(**kwargs).to(device)
+        if start == 0:
+            wrapper.register_module(chunk.embedding.word_embeddings)
+        elif end == num_layers:
+            wrapper.register_module(chunk.head)
+        models.append(chunk)
+    if len(models) == 1:
+        model = models[0]
+    else:
+        model = nn.ModuleList(models)
+    return model
+
+
+def GPT2_exlarge_pipeline_hybrid(num_chunks=1, checkpoint=False, dtype=torch.float):
+    cfg = dict(hidden_size=1600, num_attention_heads=32, checkpoint=checkpoint, dtype=dtype)
+    return build_gpt_pipeline(48, num_chunks, **cfg)
+
+
+def GPT3_pipeline_hybrid(num_chunks=1, checkpoint=False, dtype=torch.float):
+    cfg = dict(hidden_size=12288, num_attention_heads=96,
+               checkpoint=checkpoint, max_position_embeddings=2048, dtype=dtype)
+    return build_gpt_pipeline(96, num_chunks, **cfg)
+```
+
+## 处理数据集
+
+我们在这里提供了一个小型 GPT web-text 数据集。 原始格式是 loose JSON, 我们将保存处理后的数据集。
+
+```python
+class WebtextDataset(Dataset):
+    def __init__(self, path, seq_len=1024) -> None:
+        super().__init__()
+        root = os.path.dirname(path)
+        encoded_data_cache_path = os.path.join(root, f'gpt_webtext_{seq_len}.pt')
+        if os.path.isfile(encoded_data_cache_path):
+            seq_len_, data, attention_mask = torch.load(
+                encoded_data_cache_path)
+            if seq_len_ == seq_len:
+                self.data = data
+                self.attention_mask = attention_mask
+                return
+        raw_data = []
+        with open(path) as f:
+            for line in f.readlines():
+                raw_data.append(json.loads(line)['text'])
+        tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
+        tokenizer.pad_token = tokenizer.unk_token
+        encoded_data = tokenizer(
+            raw_data, padding=True, truncation=True, max_length=seq_len, return_tensors='pt')
+        self.data = encoded_data['input_ids']
+        self.attention_mask = encoded_data['attention_mask']
+        torch.save((seq_len, self.data, self.attention_mask),
+                   encoded_data_cache_path)
+
+    def __len__(self):
+        return len(self.data)
+
+    def __getitem__(self, index):
+        return {
+            'input_ids': self.data[index],
+            'attention_mask': self.attention_mask[index]
+        }, self.data[index]
+```
+
+## 使用混合并行训练 GPT
+
+在上一个教程中，我们解释了一些流水并行的参数含义。在本例中，我们可以确定在流水阶段之间交换的每个输出张量的形状。对于 GPT，该形状为
+`(MICRO BATCH SIZE, SEQUENCE LEN, HIDDEN SIZE)`。通过设置该参数，我们可以避免交换每个阶段的张量形状。当你不确定张量的形状时，你可以把它保留为
+`None`, 形状会被自动推测。请确保你的模型的 `dtype` 是正确的：当你使用 `fp16`，模型的 `dtype` 必须是 `torch.half`；否则，`dtype` 必须是 `torch.float`。对于流水并行，仅支持 `AMP_TYPE.NAIVE`。
+
+你可以通过在 `CONFIG` 里使用 `parallel` 来轻松使用张量并行。数据并行的大小是根据 GPU 的数量自动设置的。
+
+```python
+NUM_EPOCHS = 60
+SEQ_LEN = 1024
+BATCH_SIZE = 192
+NUM_CHUNKS = None
+TENSOR_SHAPE = (1, 1024, 1600)
+# only pipeline parallel
+# CONFIG = dict(NUM_MICRO_BATCHES = 192, parallel=dict(pipeline=2), fp16=dict(mode=AMP_TYPE.NAIVE))
+# pipeline + 1D model parallel
+CONFIG = dict(NUM_MICRO_BATCHES = 192, parallel=dict(pipeline=2, tensor=dict(mode='1d', size=2)), fp16=dict(mode=AMP_TYPE.NAIVE))
+
+
+def train():
+    disable_existing_loggers()
+    parser = colossalai.get_default_parser()
+    args = parser.parse_args()
+    colossalai.launch_from_torch(config=CONFIG, backend=args.backend)
+    logger = get_dist_logger()
+
+    train_ds = WebtextDataset(os.environ['DATA'], seq_len=SEQ_LEN)
+    train_dataloader = utils.get_dataloader(train_ds,
+                                            seed=42,
+                                            batch_size=BATCH_SIZE,
+                                            pin_memory=True,
+                                            shuffle=True,
+                                            drop_last=True)
+
+    use_interleaved = NUM_CHUNKS is not None
+    num_chunks = 1 if not use_interleaved else NUM_CHUNKS
+    model = GPT2_exlarge_pipeline_hybrid(num_chunks=num_chunks, checkpoint=True, dtype=torch.half)
+    # model = GPT3_pipeline_hybrid(num_chunks=num_chunks, checkpoint=True, dtype=torch.half)
+    if use_interleaved and not isinstance(model, nn.ModuleList):
+        model = nn.ModuleList([model])
+
+    criterion = GPTLMLoss()
+
+    optimizer = torch.optim.Adam(model.parameters(), lr=0.00015, weight_decay=1e-2,)
+
+    engine, train_dataloader, _, _ = colossalai.initialize(model,
+                                                           optimizer,
+                                                           criterion,
+                                                           train_dataloader=train_dataloader)
+    global_batch_size = BATCH_SIZE * \
+        gpc.get_world_size(ParallelMode.DATA) * getattr(gpc.config, "gradient_accumulation", 1)
+    logger.info(f'Init done, global batch size = {global_batch_size}', ranks=[0])
+
+    timer = MultiTimer()
+
+    trainer = Trainer(
+        engine=engine,
+        logger=logger,
+        timer=timer
+    )
+
+    hook_list = [
+        hooks.LossHook(),
+        hooks.LogMetricByEpochHook(logger),
+        hooks.ThroughputHook(),
+        hooks.LogMetricByStepHook(),
+    ]
+
+    trainer.fit(
+        train_dataloader=train_dataloader,
+        epochs=NUM_EPOCHS,
+        test_interval=1,
+        hooks=hook_list,
+        display_progress=True,
+        return_output_label=False,
+    )
+```
--- a/docs/source/zh/advanced_tutorials/train_vit_using_pipeline_parallelism.md
+++ b/docs/source/zh/advanced_tutorials/train_vit_using_pipeline_parallelism.md
@ -0,0 +1,246 @@
+# 使用流水并行训练 ViT
+
+作者: Hongxin Liu, Yongbin Li
+
+**示例代码**
+- [ColossalAI-Examples Pipeline Parallel ViT](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/pipeline_parallel)
+
+**相关论文**
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)
+
+## 引言
+
+在本教程中，你将学习如何使用流水并行从头开始训练用于图像分类的 Vision Transformer (ViT)。流水并行是一种模型并行，主要针对 GPU 内存不能满足模型容量的情况。
+通过使用流水并行，我们将原始模型分割成多个阶段，每个阶段保留原始模型的一部分。我们假设你的 GPU 内存不能容纳 ViT/L-16，而你的内存可以容纳这个模型。
+
+##  目录
+
+在本教程中，我们将介绍:
+
+1. 基于 [TIMM](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py) 定义 ViT 模型
+2. 处理数据集
+3. 使用流水并行训练 ViT
+
+## 导入依赖库
+
+```python
+import os
+from collections import OrderedDict
+from functools import partial
+
+import colossalai
+import colossalai.nn as col_nn
+import torch
+import torch.nn as nn
+from colossalai.builder import build_pipeline_model
+from colossalai.engine.schedule import (InterleavedPipelineSchedule,
+                                        PipelineSchedule)
+from colossalai.logging import disable_existing_loggers, get_dist_logger
+from colossalai.trainer import Trainer, hooks
+from colossalai.utils import MultiTimer, get_dataloader
+from timm.models import vision_transformer as vit
+from torchvision import transforms
+from torchvision.datasets import CIFAR10
+```
+
+
+## 定义 Vision Transformer 模型
+
+总的来说, 我们提供3种方法来建立一个流水并行的模型:
+
+1. `colossalai.builder.build_pipeline_model_from_cfg`
+2. `colossalai.builder.build_pipeline_model`
+3. 自己按阶段拆分模型
+
+当你的内存能够容纳模型时，你可以使用前两种方法来建立你的模型，否则你必须自己分割模型。前两种方法首先在 CPU 上建立整个模型，然后分割模型，最后你可以直接把模型的相应部分移到 GPU 上。
+
+`colossalai.builder.build_pipeline_model_from_cfg()` 接收一个模型的配置文件，它可以均匀地（按层）或平衡地（按参数大小）分割模型。
+
+如果你熟悉 `PyTorch`, 你可以使用 `colossalai.builder.build_pipeline_model()` 它接收一个 `torch.nn.Sequential` 模型并按层均匀分割。
+
+在本教程中，我们将修改 [TIMM/ViT](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py) to `torch.nn.Sequential`，然后使用 `colossalai.builder.build_pipeline_model()` 来建立流水线模型。
+
+当数据是 **一个** `Tensor`, 你可以使用你的模型 `forward()` 中的位置参数来获得数据张量。对于流水线的第一阶段，`forward()` 的第一个位置参数是从数据加载器加载的数据张量。对于其他阶段，`forward()` 的第一个位置参数是上一阶段的输出张量。注意，如果该阶段不是最后一个阶段，则 `forward()` 的返回必须是一个 `Tensor`。
+
+当数据是一个 `Tensor` 的 `dict`, 你可以使用你模型 `forward()` 的命名关键字参数来获得数据的 `dict`。
+
+```python
+class ViTEmbedding(nn.Module):
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, embed_layer=vit.PatchEmbed, drop_rate=0., distilled=False):
+        super().__init__()
+        self.embed_dim = embed_dim  # num_features for consistency with other models
+        self.num_tokens = 2 if distilled else 1
+        self.patch_embed = embed_layer(
+            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
+        num_patches = self.patch_embed.num_patches
+
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
+        self.dist_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) if distilled else None
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + self.num_tokens, embed_dim))
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        self.init_weights()
+
+    def forward(self, x):
+        x = self.patch_embed(x)
+        cls_token = self.cls_token.expand(x.shape[0], -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
+        if self.dist_token is None:
+            x = torch.cat((cls_token, x), dim=1)
+        else:
+            x = torch.cat((cls_token, self.dist_token.expand(x.shape[0], -1, -1), x), dim=1)
+        x = self.pos_drop(x + self.pos_embed)
+        return x
+
+    def init_weights(self):
+        vit.trunc_normal_(self.pos_embed, std=.02)
+        if self.dist_token is not None:
+            vit.trunc_normal_(self.dist_token, std=.02)
+        vit.trunc_normal_(self.cls_token, std=.02)
+        self.apply(vit._init_vit_weights)
+
+
+class ViTHead(nn.Module):
+    def __init__(self, embed_dim=768, num_classes=1000, norm_layer=None, distilled=False, representation_size=None):
+        super().__init__()
+        norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
+        self.norm = norm_layer(embed_dim)
+        self.num_classes = num_classes
+        self.distilled = distilled
+        self.num_features = embed_dim
+        # Representation layer
+        if representation_size and not distilled:
+            self.num_features = representation_size
+            self.pre_logits = nn.Sequential(OrderedDict([
+                ('fc', nn.Linear(embed_dim, representation_size)),
+                ('act', nn.Tanh())
+            ]))
+        else:
+            self.pre_logits = nn.Identity()
+        # Classifier head(s)
+        self.head = nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity()
+        self.head_dist = None
+        if distilled:
+            self.head_dist = nn.Linear(embed_dim, num_classes) if num_classes > 0 else nn.Identity()
+        self.init_weights()
+
+    def forward(self, x):
+        x = self.norm(x)
+        if self.distilled:
+            x, x_dist = self.head(x[:, 0]), self.head_dist(x[:, 1])
+            if self.training and not torch.jit.is_scripting():
+                # during inference, return the average of both classifier predictions
+                return x, x_dist
+            else:
+                return (x + x_dist) / 2
+        else:
+            x = self.pre_logits(x[:, 0])
+            x = self.head(x)
+        return x
+
+    def init_weights(self):
+        self.apply(vit._init_vit_weights)
+
+
+def sequential_vit(img_size=224, patch_size=16, in_chans=3, num_classes=1000, embed_dim=768, depth=12,
+                   num_heads=12, mlp_ratio=4., qkv_bias=True, representation_size=None, distilled=False,
+                   drop_rate=0., attn_drop_rate=0., drop_path_rate=0., embed_layer=vit.PatchEmbed, norm_layer=None,
+                   act_layer=None):
+    norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
+    act_layer = act_layer or nn.GELU
+    embedding = ViTEmbedding(img_size=img_size, patch_size=patch_size, in_chans=in_chans,
+                             embed_dim=embed_dim, embed_layer=embed_layer, drop_rate=drop_rate, distilled=distilled)
+    dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+    blocks = [vit.Block(
+        dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, drop=drop_rate,
+        attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer, act_layer=act_layer)
+        for i in range(depth)]
+    for block in blocks:
+        block.apply(vit._init_vit_weights)
+    head = ViTHead(embed_dim=embed_dim, num_classes=num_classes, norm_layer=norm_layer,
+                   distilled=distilled, representation_size=representation_size)
+    return nn.Sequential(embedding, *blocks, head)
+
+
+def vit_large_patch16_224(**kwargs):
+    model_kwargs = dict(embed_dim=1024, depth=24, num_heads=16, **kwargs)
+    return sequential_vit(**model_kwargs)
+```
+
+## 处理数据集
+
+一般来说, 我们在大型数据集如 ImageNet 上训练 ViT。为了简单期间，我们在这里只使用 CIFAR-10, 因为本教程只是用于流水并行训练。
+
+```python
+def build_cifar(batch_size):
+    transform_train = transforms.Compose([
+        transforms.RandomCrop(224, pad_if_needed=True),
+        transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.CIFAR10),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+    transform_test = transforms.Compose([
+        transforms.Resize(224),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+
+    train_dataset = CIFAR10(root=os.environ['DATA'], train=True, download=True, transform=transform_train)
+    test_dataset = CIFAR10(root=os.environ['DATA'], train=False, transform=transform_test)
+    train_dataloader = get_dataloader(dataset=train_dataset, shuffle=True, batch_size=batch_size, pin_memory=True)
+    test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, pin_memory=True)
+    return train_dataloader, test_dataloader
+```
+
+## 使用流水并行训练 ViT
+
+你可以在配置文件中设置流水并行的大小。`NUM_CHUNKS` 在使用交错流水线时很有用 (更多细节见 [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) )。
+原始 batch 将会被分割为 `num_microbatches`, 每个阶段每次将加载一个 micro batch。如果你确定性地知道每个阶段输出张量的形状，你可以在配置文件中设置 `tensor_shape` 来减少通信。
+我们的仓库会自动为用户生成合适的schedule来支持流水并行训练。如果你不需要模型的输出和标签，你可以在调用 `trainer.fit()` 时，将 `return_output_label` 设置为 `False`，这样能进一步减少 GPU 显存使用。
+
+你应当使用 `export DATA=/path/to/cifar`。
+
+```python
+BATCH_SIZE = 16
+NUM_EPOCHS = 60
+NUM_CHUNKS = 1
+CONFIG = dict(NUM_MICRO_BATCHES=4, parallel=dict(pipeline=2))
+
+
+def train():
+    disable_existing_loggers()
+    parser = colossalai.get_default_parser()
+    args = parser.parse_args()
+    colossalai.launch_from_torch(backend=args.backend, config=CONFIG)
+    logger = get_dist_logger()
+
+    # build model
+    model = vit_large_patch16_224()
+    model = build_pipeline_model(model, num_chunks=NUM_CHUNKS, verbose=True)
+
+    # build criterion
+    criterion = nn.CrossEntropyLoss()
+
+    # optimizer
+    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)
+
+    # build dataloader
+    train_dataloader, test_dataloader = build_cifar(BATCH_SIZE)
+
+    engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model, optimizer, criterion,
+                                                                         train_dataloader, test_dataloader)
+    timer = MultiTimer()
+
+    trainer = Trainer(engine=engine, timer=timer, logger=logger)
+
+    hook_list = [
+        hooks.LossHook(),
+        hooks.AccuracyHook(col_nn.metric.Accuracy()),
+        hooks.LogMetricByEpochHook(logger),
+    ]
+
+    trainer.fit(train_dataloader=train_dataloader,
+                epochs=NUM_EPOCHS,
+                test_dataloader=test_dataloader,
+                test_interval=1,
+                hooks=hook_list,
+                display_progress=True)
+```
--- a/docs/source/zh/advanced_tutorials/train_vit_with_hybrid_parallelism.md
+++ b/docs/source/zh/advanced_tutorials/train_vit_with_hybrid_parallelism.md
@ -0,0 +1,591 @@
+# 使用 Colossal-AI （从数据并行到异构并行）加速 ViT 训练详解
+
+作者：Yuxuan Lou
+
+**示例代码**
+
+- [Colossal-AI Examples ViT on Cifar10](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer)
+
+**相关文献**
+- [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/pdf/2010.11929.pdf)
+
+
+## 引言
+
+在这个ViT模型的样例中，Colossal-AI 提供了三种不同的并行技术来加速模型训练：数据并行，流水线并行和张量并行。我们将展示如何使用这三种并行技术在 CIFAR-10 数据集上训练 ViT。为了运行项目，需要2-4个 GPU。
+
+
+## 目录
+1. Colossal-AI 安装方法
+2. 使用数据并行训练 ViT 步骤
+3. 使用数据流水线并行训练 ViT 步骤
+4. 使用张量并行或异构并行训练 ViT 步骤
+
+## Colossal-AI 安装
+可以通过 Python 的官方索引来安装 Colossal-AI 软件包。
+```bash
+pip install colossalai
+```
+
+
+
+## 数据并行
+数据并行是实现加速模型训练的基本方法。通过两步可以实现训练的数据并行：
+1. 构建一个配置文件
+2. 在训练脚本中修改很少的几行代码
+
+### 构建配置文件 (`data_parallel/config.py`)
+为了使用 Colossal-AI，第一步是构建配置文件。并且，在这里有两种变量：
+
+1. **Colossal-AI 功能配置**
+
+Colossal-AI 提供了一系列的功能来加快训练速度（包括模型并行，混合精度，零冗余优化器等）。每个功能都是由配置文件中的相应字段定义的。如果我们只用到数据并行，那么我们只需要具体说明并行模式。在本例中，我们使用 PyTorch 最初提出的混合精度训练，只需要定义混合精度配置 `fp16 = dict(mode=AMP_TYPE.TORCH)` 。
+
+2. **全局超参数**
+
+全局超参数包括特定于模型的超参数、训练设置、数据集信息等。
+
+```python
+from colossalai.amp import AMP_TYPE
+# ViT Base
+BATCH_SIZE = 256
+DROP_RATE = 0.1
+NUM_EPOCHS = 300
+# mix precision
+fp16 = dict(
+    mode=AMP_TYPE.TORCH,
+)
+gradient_accumulation = 16
+clip_grad_norm = 1.0
+dali = dict(
+    gpu_aug=True,
+    mixup_alpha=0.2
+)
+```
+
+### 修改训练脚本 (`/data_parallel/train_with_cifar10.py`)
+
+#### 导入模块
+- Colossal-AI 相关模块
+```python
+import colossalai
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.logging import disable_existing_loggers, get_dist_logger
+from colossalai.nn.lr_scheduler import LinearWarmupLR
+from colossalai.nn.metric import Accuracy
+from colossalai.trainer import Trainer, hooks
+```
+
+- 其他模块
+```python
+import os
+import torch
+from timm.models import vit_base_patch16_224
+from torchvision import transforms
+from torchvision.datasets import CIFAR10
+```
+
+#### 启动 Colossal-AI
+
+在训练脚本中，在构建好配置文件后，需要为 Colossal-AI 初始化分布式环境。我们将此过程称为 `launch` 。在 Colossal-AI 中，我们提供了几种启动方法来初始化分布式后端。在大多数情况下，您可以使用 `colossalai.launch` 和 `colossalai.get_default_parser ` 来实现使用命令行传递参数。此外，Colossal-AI 可以利用 PyTorch 提供的现有启动工具，正如许多用户通过使用熟知的 `colossalai.launch_from_torch` 那样。更多详细信息，您可以查看相关[文档](https://www.colossalai.org/docs/basics/launch_colossalai)。
+
+
+```python
+# initialize distributed setting
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+colossalai.launch_from_torch(config=args.config)
+disable_existing_loggers()
+logger = get_dist_logger()
+```
+
+初始化后，您可以使用 `colossalai.core.global_context` 访问配置文件中的变量。
+
+```python
+#access parameters
+print(gpc.config.BATCH_SIZE)
+```
+
+#### 构建模型
+
+如果只需要数据并行性，则无需对模型代码进行任何更改。这里，我们使用 `timm` 中的 `vit_base_patch16_224`。
+
+```python
+# build model
+model = vit_base_patch16_224(drop_rate=0.1, num_classes=gpc.config.NUM_CLASSES)
+```
+
+#### 构建 CIFAR-10 数据加载器
+`colossalai.utils.get_dataloader` 可以帮助您轻松构建数据加载器。
+
+```python
+def build_cifar(batch_size):
+    transform_train = transforms.Compose([
+        transforms.RandomCrop(224, pad_if_needed=True),
+        transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.CIFAR10),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+    transform_test = transforms.Compose([
+        transforms.Resize(224),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+    train_dataset = CIFAR10(root=os.environ['DATA'], train=True, download=True, transform=transform_train)
+    test_dataset = CIFAR10(root=os.environ['DATA'], train=False, transform=transform_test)
+    train_dataloader = get_dataloader(dataset=train_dataset, shuffle=True, batch_size=batch_size, pin_memory=True)
+    test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, pin_memory=True)
+    return train_dataloader, test_dataloader
+# build dataloader
+train_dataloader, test_dataloader = build_cifar(gpc.config.BATCH_SIZE)
+```
+
+#### 定义优化器，损失函数和学习率调度器
+
+Colossal-AI 提供了自己的优化器、损失函数和学习率调度器。PyTorch 的这些组件与Colossal-AI也兼容。
+
+```python
+# build optimizer
+optimizer = colossalai.nn.Lamb(model.parameters(), lr=1.8e-2, weight_decay=0.1)
+# build loss
+criterion = torch.nn.CrossEntropyLoss()
+# lr_scheduelr
+lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
+```
+
+#### 启动用于训练的 Colossal-AI 引擎
+
+Engine 本质上是对模型、优化器和损失函数的封装类。当我们使用 `colossalai.initialize` ，将返回一个 engine 对象，并且它已经按照配置文件中的指定内容，配置了梯度剪裁、梯度累积和零冗余优化器等功能。之后，基于 Colossal-AI 的 engine 我们可以进行模型训练。
+
+```python
+engine, train_dataloader, test_dataloader, _ = colossalai.initialize(
+        model, optimizer, criterion, train_dataloader, test_dataloader
+    )
+```
+
+#### 训练：Trainer 应用程序编程接口
+Trainer 是一个更高级的封装类，用户可以用更少的代码就可以实现训练。通过传递 engine 对象很容易创建 trainer 对象。
+
+此外，在 trainer 中，用户可以自定义一些挂钩，并将这些挂钩连接到 trainer 对象。钩子对象将根据训练方案定期执行生命周期方法。例如，`LRSchedulerHook` 将执行`lr_scheduler.step()` 在 `after_train_iter` 或 `after_train_epoch` 阶段更新模型的学习速率。
+
+```python
+# build trainer
+trainer = Trainer(engine=engine, logger=logger)
+# build hooks
+hook_list = [
+    hooks.LossHook(),
+    hooks.AccuracyHook(accuracy_func=MixupAccuracy()),
+    hooks.LogMetricByEpochHook(logger),
+    hooks.LRSchedulerHook(lr_scheduler, by_epoch=True),
+    # comment if you do not need to use the hooks below
+    hooks.SaveCheckpointHook(interval=1, checkpoint_dir='./ckpt'),
+    hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
+]
+```
+
+使用 `trainer.fit` 进行训练:
+
+```python
+# start training
+trainer.fit(
+    train_dataloader=train_dataloader,
+    test_dataloader=test_dataloader,
+    epochs=gpc.config.NUM_EPOCHS,
+    hooks=hook_list,
+    display_progress=True,
+    test_interval=1
+)
+```
+
+### 开始训练
+`DATA` 是自动下载和存储 CIFAR-10 数据集的文件路径。
+
+`<NUM_GPUs>` 是要用于使用 CIFAR-10 数据集，以数据并行方式训练 ViT 的 GPU 数。
+
+```bash
+export DATA=<path_to_data>
+# If your torch >= 1.10.0
+torchrun --standalone --nproc_per_node <NUM_GPUs>  train_dp.py --config ./configs/config_data_parallel.py
+# If your torch >= 1.9.0
+# python -m torch.distributed.run --standalone --nproc_per_node= <NUM_GPUs> train_dp.py --config ./configs/config_data_parallel.py
+# Otherwise
+# python -m torch.distributed.launch --nproc_per_node <NUM_GPUs> --master_addr <node_name> --master_port 29500 train_dp.py --config ./configs/config.py
+```
+
+
+
+## 流水线并行
+除了数据并行性，Colossal-AI 还支持流水线并行。具体而言，Colossal-AI 使用 NVIDIA 引入的 1F1B 流水线。更多详细信息，您可以查看相关[文档](https://www.colossalai.org/tutorials/features/pipeline_parallel)。
+
+### 构建配置文件(`hybrid_parallel/configs/vit_pipeline.py`)
+要在数据并行的基础上应用流水线并行，只需添加一个 **parallel dict**
+```python
+from colossalai.amp import AMP_TYPE
+parallel = dict(
+    pipeline=2
+)
+# pipeline config
+NUM_MICRO_BATCHES = parallel['pipeline']
+TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE)
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+clip_grad_norm = 1.0
+```
+
+其他配置：
+```python
+# hyperparameters
+# BATCH_SIZE is as per GPU
+# global batch size = BATCH_SIZE x data parallel size
+BATCH_SIZE = 256
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+# model config
+IMG_SIZE = 224
+PATCH_SIZE = 16
+HIDDEN_SIZE = 768
+DEPTH = 12
+NUM_HEADS = 12
+MLP_RATIO = 4
+NUM_CLASSES = 10
+CHECKPOINT = True
+SEQ_LENGTH = (IMG_SIZE // PATCH_SIZE) ** 2 + 1  # add 1 for cls token
+```
+
+### 构建流水线模型 (`/hybrid_parallel/model/vit.py`)
+Colossal-AI 提供了两种从现有模型构建流水线模型的方法。
+- `colossalai.builder.build_pipeline_model_from_cfg`
+- `colossalai.builder.build_pipeline_model`
+
+此外，您还可以使用 Colossal-AI 从头开始构建流水线模型。
+```python
+import math
+from typing import Callable
+import inspect
+import torch
+from colossalai import nn as col_nn
+from colossalai.registry import LAYERS, MODELS
+from colossalai.logging import get_dist_logger
+from colossalai.core import global_context as gpc
+from colossalai.context import ParallelMode
+from colossalai.builder.pipeline import partition_uniform
+from torch import dtype, nn
+from model_zoo.vit.vit import ViTBlock, ViTEmbedding, ViTHead
+@MODELS.register_module
+class PipelineVisionTransformer(nn.Module):
+    def __init__(self,
+                 img_size: int = 224,
+                 patch_size: int = 16,
+                 in_chans: int = 3,
+                 num_classes: int = 1000,
+                 depth: int = 12,
+                 num_heads: int = 12,
+                 dim: int = 768,
+                 mlp_ratio: int = 4,
+                 attention_dropout: float = 0.,
+                 dropout: float = 0.1,
+                 drop_path: float = 0.,
+                 layernorm_epsilon: float = 1e-6,
+                 activation: Callable = nn.functional.gelu,
+                 representation_size: int = None,
+                 dtype: dtype = None,
+                 bias: bool = True,
+                 checkpoint: bool = False,
+                 init_method: str = 'torch',
+                 first_stage=True,
+                 last_stage=True,
+                 start_idx=None,
+                 end_idx=None,):
+        super().__init__()
+        layers = []
+        if first_stage:
+            embed = ViTEmbedding(img_size=img_size,
+                                 patch_size=patch_size,
+                                 in_chans=in_chans,
+                                 embedding_dim=dim,
+                                 dropout=dropout,
+                                 dtype=dtype,
+                                 init_method=init_method)
+            layers.append(embed)
+        # stochastic depth decay rule
+        dpr = [x.item() for x in torch.linspace(0, drop_path, depth)]
+        if start_idx is None and end_idx is None:
+            start_idx = 0
+            end_idx = depth
+        blocks = [
+            ViTBlock(
+                dim=dim,
+                num_heads=num_heads,
+                mlp_ratio=mlp_ratio,
+                attention_dropout=attention_dropout,
+                dropout=dropout,
+                drop_path=dpr[i],
+                activation=activation,
+                dtype=dtype,
+                bias=bias,
+                checkpoint=checkpoint,
+                init_method=init_method,
+            ) for i in range(start_idx, end_idx)
+        ]
+        layers.extend(blocks)
+        if last_stage:
+            norm = col_nn.LayerNorm(normalized_shape=dim, eps=layernorm_epsilon, dtype=dtype)
+            head = ViTHead(dim=dim,
+                           num_classes=num_classes,
+                           representation_size=representation_size,
+                           dtype=dtype,
+                           bias=bias,
+                           init_method=init_method)
+            layers.extend([norm, head])
+        self.layers = nn.Sequential(
+            *layers
+        )
+    def forward(self, x):
+        x = self.layers(x)
+        return x
+def _filter_kwargs(func, kwargs):
+    sig = inspect.signature(func)
+    return {k: v for k, v in kwargs.items() if k in sig.parameters}
+def _build_pipeline_vit(module_cls, num_layers, num_chunks, device=torch.device('cuda'), **kwargs):
+    logger = get_dist_logger()
+    if gpc.is_initialized(ParallelMode.PIPELINE):
+        pipeline_size = gpc.get_world_size(ParallelMode.PIPELINE)
+        pipeline_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
+    else:
+        pipeline_size = 1
+        pipeline_rank = 0
+    rank = gpc.get_global_rank()
+    parts = partition_uniform(num_layers, pipeline_size, num_chunks)[pipeline_rank]
+    models = []
+    for start, end in parts:
+        kwargs['first_stage'] = start == 0
+        kwargs['last_stage'] = end == num_layers
+        kwargs['start_idx'] = start
+        kwargs['end_idx'] = end
+        logger.info(f'Rank{rank} build layer {start}-{end}, {end-start}/{num_layers} layers')
+        chunk = module_cls(**_filter_kwargs(module_cls.__init__, kwargs)).to(device)
+        models.append(chunk)
+    if len(models) == 1:
+        model = models[0]
+    else:
+        model = nn.ModuleList(models)
+    return model
+def build_pipeline_vit(num_layers, num_chunks, device=torch.device('cuda'), **kwargs):
+    return _build_pipeline_vit(PipelineVisionTransformer, num_layers, num_chunks, device, **kwargs)
+```
+
+### 修改训练脚本 (`/hybrid_parallel/train_with_cifar10.py`)
+
+#### 导入模块
+```python
+from colossalai.engine.schedule import (InterleavedPipelineSchedule,
+                                        PipelineSchedule)
+from colossalai.utils import MultiTimer
+import os
+import colossalai
+import torch
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.logging import get_dist_logger
+from colossalai.nn import CrossEntropyLoss
+from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR
+from colossalai.utils import is_using_pp, get_dataloader
+from model.vit import build_pipeline_vit
+from model_zoo.vit.vit import _create_vit_model
+from tqdm import tqdm
+from torchvision import transforms
+from torchvision.datasets import CIFAR10
+```
+
+#### 启动 Colossal-AI
+`colossalai.utils.is_using_pp` 可以帮您检查配置文件是否满足流水线并行的要求。
+
+```python
+# initialize distributed setting
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+# launch from torch
+colossalai.launch_from_torch(config=args.config)
+# get logger
+logger = get_dist_logger()
+logger.info("initialized distributed environment", ranks=[0])
+if hasattr(gpc.config, 'LOG_PATH'):
+    if gpc.get_global_rank() == 0:
+        log_path = gpc.config.LOG_PATH
+        if not os.path.exists(log_path):
+            os.mkdir(log_path)
+        logger.log_to_file(log_path)
+use_pipeline = is_using_pp()
+```
+
+#### 定义模型
+
+```python
+# create model
+model_kwargs = dict(img_size=gpc.config.IMG_SIZE,
+                    patch_size=gpc.config.PATCH_SIZE,
+                    dim=gpc.config.HIDDEN_SIZE,
+                    depth=gpc.config.DEPTH,
+                    num_heads=gpc.config.NUM_HEADS,
+                    mlp_ratio=gpc.config.MLP_RATIO,
+                    num_classes=gpc.config.NUM_CLASSES,
+                    init_method='jax',
+                    checkpoint=gpc.config.CHECKPOINT)
+if use_pipeline:
+    model = build_pipeline_vit(num_layers=model_kwargs['depth'], num_chunks=1, **model_kwargs)
+else:
+    model = _create_vit_model(**model_kwargs)
+```
+
+#### 计算参数个数
+
+您可以轻松计算不同流水线阶段上的模型参数个数。
+
+```
+# count number of parameters
+total_numel = 0
+for p in model.parameters():
+    total_numel += p.numel()
+if not gpc.is_initialized(ParallelMode.PIPELINE):
+    pipeline_stage = 0
+else:
+    pipeline_stage = gpc.get_local_rank(ParallelMode.PIPELINE)
+logger.info(f"number of parameters: {total_numel} on pipeline stage {pipeline_stage}")
+```
+
+#### 构建数据加载器，优化器等组件
+
+```python
+def build_cifar(batch_size):
+    transform_train = transforms.Compose([
+        transforms.RandomCrop(224, pad_if_needed=True),
+        transforms.AutoAugment(policy=transforms.AutoAugmentPolicy.CIFAR10),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+    transform_test = transforms.Compose([
+        transforms.Resize(224),
+        transforms.ToTensor(),
+        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
+    ])
+    train_dataset = CIFAR10(root=os.environ['DATA'], train=True, download=True, transform=transform_train)
+    test_dataset = CIFAR10(root=os.environ['DATA'], train=False, transform=transform_test)
+    train_dataloader = get_dataloader(dataset=train_dataset, shuffle=True, batch_size=batch_size, pin_memory=True)
+    test_dataloader = get_dataloader(dataset=test_dataset, batch_size=batch_size, pin_memory=True)
+    return train_dataloader, test_dataloader
+
+
+# craete dataloaders
+train_dataloader , test_dataloader = build_cifar()
+# create loss function
+criterion = CrossEntropyLoss(label_smoothing=0.1)
+# create optimizer
+optimizer = torch.optim.AdamW(model.parameters(), lr=gpc.config.LEARNING_RATE, weight_decay=gpc.config.WEIGHT_DECAY)
+# create lr scheduler
+lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
+                                       total_steps=gpc.config.NUM_EPOCHS,
+                                       warmup_steps=gpc.config.WARMUP_EPOCHS)
+```
+
+#### 启动 Colossal-AI 引擎
+
+```python
+# intiailize
+engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model,
+                                                                     optimizer=optimizer,
+                                                                     criterion=criterion,
+                                                                     train_dataloader=train_dataloader,
+                                                                     test_dataloader=test_dataloader)
+logger.info("Engine is built", ranks=[0])
+```
+
+#### 训练：基于engine
+
+在数据并行示例中，我们展示了如何使用 Trainer API 训练模型。我们还可以直接训练基于 engine 的模型。通过这种方式，您可以使用更多功能自定义训练方法。
+
+```python
+data_iter = iter(train_dataloader)
+for epoch in range(gpc.config.NUM_EPOCHS):
+    # training
+    engine.train()
+    if gpc.get_global_rank() == 0:
+        description = 'Epoch {} / {}'.format(
+            epoch,
+            gpc.config.NUM_EPOCHS
+        )
+        progress = tqdm(range(len(train_dataloader)), desc=description)
+    else:
+        progress = range(len(train_dataloader))
+    for _ in progress:
+        engine.zero_grad()
+        engine.execute_schedule(data_iter, return_output_label=False)
+        engine.step()
+        lr_scheduler.step()
+```
+
+### 开始训练
+```bash
+export DATA=<path_to_dataset>
+# If your torch >= 1.10.0
+torchrun --standalone --nproc_per_node <NUM_GPUs>  train_hybrid.py --config ./configs/config_pipeline_parallel.py
+# If your torch >= 1.9.0
+# python -m torch.distributed.run --standalone --nproc_per_node= <NUM_GPUs> train_hybrid.py --config ./configs/config_pipeline_parallel.py
+```
+
+
+
+
+## 张量并行和异构并行
+张量并行将每个权重参数跨多个设备进行分区，以减少内存负载。Colossal-AI 支持 1D、2D、2.5D 和 3D 张量并行。此外，还可以将张量并行、流水线并行和数据并行结合起来，实现混合并行。Colossal-AI 还提供了一种简单的方法来应用张量并行和混合并行。只需在配置文件中更改几行代码即可实现流水线并行。
+
+### 构造您的配置文件 (`/hybrid_parallel/configs/vit_1d_tp2_pp2.py`)
+使用张量并行，只需将相关信息添加到 **parallel dict**。具体而言，`TENSOR_PARALLEL_MODE` 可以是“1d”、“2d”、“2.5d”、“3d”。不同并行度的大小应满足：`#GPUs = pipeline parallel size x tensor parallel size x data parallel size`。在指定 GPU 数量、流水线并行大小和张量并行大小后 `data parallel size` 会自动计算。
+
+```python
+from colossalai.amp import AMP_TYPE
+# parallel setting
+TENSOR_PARALLEL_SIZE = 2
+TENSOR_PARALLEL_MODE = '1d'
+parallel = dict(
+    pipeline=2,
+    tensor=dict(mode=TENSOR_PARALLEL_MODE, size=TENSOR_PARALLEL_SIZE)
+)
+fp16 = dict(mode=AMP_TYPE.NAIVE)
+clip_grad_norm = 1.0
+# pipeline config
+NUM_MICRO_BATCHES = parallel['pipeline']
+TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE)
+```
+
+其他配置:
+```python
+# hyperparameters
+# BATCH_SIZE is as per GPU
+# global batch size = BATCH_SIZE x data parallel size
+BATCH_SIZE = 256
+LEARNING_RATE = 3e-3
+WEIGHT_DECAY = 0.3
+NUM_EPOCHS = 300
+WARMUP_EPOCHS = 32
+# model config
+IMG_SIZE = 224
+PATCH_SIZE = 16
+HIDDEN_SIZE = 768
+DEPTH = 12
+NUM_HEADS = 12
+MLP_RATIO = 4
+NUM_CLASSES = 10
+CHECKPOINT = True
+SEQ_LENGTH = (IMG_SIZE // PATCH_SIZE) ** 2 + 1  # add 1 for cls token
+```
+
+### 开始训练
+```bash
+export DATA=<path_to_dataset>
+# If your torch >= 1.10.0
+torchrun --standalone --nproc_per_node <NUM_GPUs>  train_hybrid.py --config ./configs/config_hybrid_parallel.py
+# If your torch >= 1.9.0
+# python -m torch.distributed.run --standalone --nproc_per_node= <NUM_GPUs> train_hybrid.py --config ./configs/config_hybrid_parallel.py
+```
--- a/docs/source/zh/basics/colotensor_concept.md
+++ b/docs/source/zh/basics/colotensor_concept.md
@ -0,0 +1,98 @@
+# ColoTensor Concepts
+
+Author: [Jiarui Fang](https://github.com/feifeibear), [Hongxin Liu](https://github.com/ver217) and [Haichen Huang](https://github.com/1SAA)
+
+**Prerequisite:**
+- [Colossal-AI Overview](../concepts/colossalai_overview.md)
+- [Distributed Training](../concepts/distributed_training.md)
+- [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md)
+
+## Introduction
+
+在ColossalAI 0.1.8 版本之后，[ColoTensor](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.html#colossalai.tensor.ColoTensor) 成为 ColossalAI 中张量的基本数据结构。 它是 torch.Tensor 的子类，可以当做 PyTorch Tensor使用。 此外，一些独特的功能使其能够表示一个payload分布在多个 GPU 设备上的Global  Tensor，并提供一些列方式操作这个Global Tensor。 在 ColoTensor 的帮助下，用户可以以类似编写串行程序方式，编写的分布式 DNN 训练程序。
+
+ColoTensor 包含额外的属性[ColoTensorSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.tensor_spec.html#colossalai.tensor.tensor_spec.ColoTensorSpec)
+来描述张量的payload分布和计算模式。
+
+- ProcessGroup：如何将进程组织为通信组。
+- Distributed Spec：张量如何在进程组之间分布。
+- Compute Spec：计算过程中如何使用张量。
+
+我们一一详述。
+
+## ProcessGroup
+
+[ProcessGroup](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.html#colossalai.tensor.ProcessGroup) 类的一个实例描述了如何在进程组中组织进程。进程组内的进程可以一起参与同一个集合通信，比如allgather, allreduce等。进程组组织方式被张量的并行策略支配。比如，如果用户定义了Tensor的张量并行（TP），数据并行（DP）方式，那么进程组的进程组织方式将被自动推导出来。 进程组设置可能因不同的张量而异。 因此，它使我们能够支持更复杂的混合并行。流水线并行(PP)定义不在ProcessGroup中描述，它需要另一套机制，我们将在未来补充ColoTensor应用于PP的相关内容。
+
+目前，ColoTensor 的一个进程组由 tp_degree 和 dp_degree 两种配置定义。 在 DP+TP 混合并行的情况下，可以将设备视为 2D 网格。 我们将 TP 通信组放置在设备网格的前导低维上，然后将数据并行组放置在设备网格的高维上。 原因是张量并行比数据并行具有更大的通信开销。 相邻设备放置在一个 TP 进程组内，并且通常放置在同一个节点中。
+
+考虑到8个进程配置为tp_degree=4，dp_degree=2，布局如下图。 进程组 tp0 包含 gpu 0,1,2,3。 进程 dp1 包含 gpu 1 和 5。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/ColoTensor_layout_demo.PNG"/>
+<figcaption>Process Group using tp_degree=4, dp_degree=2</figcaption>
+</figure>
+
+## Distributed Spec
+
+[Distributed Spec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html)描述了 ColoTensor 如何在 ProcessGroup 中分布。
+
+张量在 DP 进程组之间的分布方式是自动导出的，不需要用户手动指定。 如果这个张量是一个模型参数，它会在 DP 进程组中被复制。 如果是activation张量，则沿tensor最高维度在DP进程组中进行平均分割。
+
+因此，在使用 Distributed Spec 时，我们只需要描述张量在 TP 进程组之间的分布方式即可。 TP 进程组目前有两种分布式规范，即 [ShardSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html#colossalai.tensor.distspec.ShardSpec)和[ReplicaSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html#colossalai.tensor.distspec.ReplicaSpec)。 ShardSpec 需要指定分区的维度索引 dim 和分区个数 num_partitions。 目前，我们仅支持在单个dim上进行拆分。 TP进程组上不同的dist spec可以通过set_dist_spec()接口相互转换。这些转化操作可以被记录在PyTorch的自动求导机制中，并在反向传播时候触发对应的反向操作。
+
+## Compute Spec
+
+[ComputeSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.compute_spec.html#colossalai.tensor.compute_spec.ComputeSpec)类描述Tensor如何参与计算。目前，我们将作为module parameter的ColoTensor设置正确的Compute Pattern。可以触发正取的计算模式。具体应用方式我们会在接下来的文档中展示。
+
+## ColoParameter
+
+[ColoParameter](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.colo_parameter.html#colossalai.tensor.colo_parameter.ColoParameter)是ColoTensor的子类。用来声明Parameter。他和ColoTensor关系和Torch.Tensor和torch.Parameter一致。后者可以让tensor出现在module的parameters()和name_parameters() 的返回值中。
+
+## Example
+
+让我们看一个例子。 使用 tp_degree=4, dp_dgree=2 在 8 个 GPU 上初始化并Shard一个ColoTensor。 然后tensor被沿着 TP 进程组中的最后一个维度进行分片。 最后，我们沿着 TP 进程组中的第一个维度（dim 0）对其进行重新Shard。 我们鼓励用户运行代码并观察每个张量的形状。
+
+
+```python
+import torch
+import torch.multiprocessing as mp
+from colossalai.utils import free_port, print_rank_0
+from functools import partial
+
+import colossalai
+from colossalai.tensor import ProcessGroup, ColoTensor, ColoTensorSpec, ShardSpec, ComputeSpec, ComputePattern
+from colossalai.utils import free_port
+
+import torch
+
+def run_dist_tests(rank, world_size, port):
+    colossalai.launch(config={}, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
+    pg = ProcessGroup(tp_degree=2, dp_degree=2)
+
+    torch.manual_seed(0)
+    local_tensor = torch.randn(2, 3, 1).cuda()
+    print_rank_0(f"shape {local_tensor.shape}, {local_tensor.data}")
+
+    spec = ColoTensorSpec(pg, ShardSpec(dims=[-1], num_partitions=[pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
+    t1 = ColoTensor.from_torch_tensor(local_tensor, spec)
+    t1 = t1.to_replicate()
+    print_rank_0(f"shape {t1.shape}, {t1.data}")
+
+    spec2 = ShardSpec([0], [pg.tp_world_size()])
+    t1.set_dist_spec(spec2)
+    print_rank_0(f"shape {t1.shape}, {t1.data}")
+
+def test_dist_cases(world_size):
+    run_func = partial(run_dist_tests, world_size=world_size, port=free_port())
+    mp.spawn(run_func, nprocs=world_size)
+
+if __name__ == '__main__':
+    test_dist_cases(4)
+```
+
+:::caution
+
+The ColoTensor is an experimental feature and may be updated.
+
+:::
--- a/docs/source/zh/basics/command_line_tool.md
+++ b/docs/source/zh/basics/command_line_tool.md
@ -0,0 +1,47 @@
+# 命令行工具
+
+作者: Shenggui Li
+
+**预备知识:**
+- [Distributed Training](../concepts/distributed_training.md)
+- [Colossal-AI Overview](../concepts/colossalai_overview.md)
+
+## 简介
+
+Colossal-AI给用户提供了命令行工具，目前命令行工具可以用来支持以下功能。
+- 检查Colossal-AI是否安装正确
+- 启动分布式训练
+- 张量并行基准测试
+
+## 安装检查
+
+用户可以使用`colossalai check -i`这个命令来检查目前环境里的版本兼容性以及CUDA Extension的状态。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/05/04/KJmcVknyPHpBofa.png"/>
+<figcaption>Check Installation Demo</figcaption>
+</figure>
+
+## 启动分布式训练
+
+在分布式训练时，我们可以使用`colossalai run`来启动单节点或者多节点的多进程，详细的内容可以参考[启动 Colossal-AI](./launch_colossalai.md)。
+
+## 张量并行基准测试
+
+Colossal-AI提供了多种张量并行，想要充分理解这些方法需要一定的学习成本，对于新手来说很难靠经验选择一个并行方式。
+所以我们提供了一个简单的基准测试，能够让用户在自己的机器上测试不同张量并行的性能。这个基准测试跑一个并行的MLP模型，
+输入数据的维度为`（批大小，序列长度，隐藏层维度）`。通过指定GPU的数量，Colossal-AI会搜索所有可行的并行配置。用户可以通过查看`colossalai benchmark --help`来自定义相关的测试参数。
+
+```shell
+# 使用4个GPU
+colossalai benchmark --gpus 4
+
+# 使用8个GPU
+colossalai benchmark --gpus 8
+```
+
+:::caution
+
+目前仅支持单节点的基准测试。
+
+:::
--- a/docs/source/zh/basics/configure_parallelization.md
+++ b/docs/source/zh/basics/configure_parallelization.md
@ -0,0 +1,136 @@
+# 并行配置
+
+作者: Shenggui Li, Siqi Mai
+
+**预备知识:**
+- [分布式训练](../concepts/distributed_training.md)
+- [并行技术](../concepts/paradigms_of_parallelism.md)
+- [构建配置文件](./define_your_config.md)
+
+
+## 简介
+
+我们在 Colossal-AI 中支持多种并行技术。代码库中的混合并行是指您可以轻松地结合数据并行、流水线并行和张量并行（1D、2D、2.5D、3D）的优势共同来进行并行训练。
+
+每种并行方式需要不同的网络拓扑结构，因此要初始化不同的进程组。您可以通过在配置文件中设置 `parallel` 来初始化相应的进程组。 `parallel` 的配置必须遵从以下格式。数据并行度的大小将被根据您对流水线并行和张量并行的输入自动推断。`colossalai.launch` 将根据您的配置自动初始化这些分布式进程组。
+
+我们为您提供了一些配置的例子以供参考。
+
+```python
+# sampler format
+parallel = dict(
+    pipeline=dict("size": int),
+    tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
+)
+
+# this is ok
+parallel = dict(
+    pipeline=dict(size=2),
+    tensor=dict(size=4, mode='2d')
+)
+
+# this is ok
+parallel = dict(
+    pipeline=2,
+    tensor=dict(size=4, mode='2d')
+)
+
+# this is not ok
+# as you need to specify the mode for tensor parallelism
+parallel = dict(
+    pipeline=2,
+    tensor=4
+)
+
+# this is ok as well as tensor will be default to size 1
+# and mode None
+parallel = dict(
+    pipeline=2
+)
+
+# this is ok as well as pipeline will default to size 1
+parallel = dict(
+    tensor=dict(size=4, mode='2d')
+)
+
+```
+
+关键字 `size` 指的是并行维度的并行大小。 例如，流水线大小为2意味着有
+将有2个流水线阶段。张量并行配置中的关键字 `mode` 意味着相应的张量并行技术
+将被初始化，如1D、2D、2.5D、3D。
+
+**您也可以选择不在您的配置中使用 "并行"，此时流水线和张量的并行度都将默认为大小1。**
+
+**GPU的总数量必须等于` 数据并行大小 x 张量并行大小 x 流水线并行大小` 。**
+
+## 数据并行
+
+数据并行是最常见的分布式训练方式。它将数据分割成几个碎片分别在每个设备上进行训练。数据并行的配置会自动检测并为您设置。您不需要在您的配置中明确地设置它们。在Colossal-AI 中，有两种方法来处理数据并行的 all-reduce。
+
+1. 如果您设置了梯度handler，梯度handler将会all-reduce梯度。
+2. 若没有指定相应的配置，Colossal-AI 将会使用 PyTorch 的 DistributedDataParallel。
+
+在大多数情况下，若您对梯度没有复杂的处理的需求，您将会使用第二种模式。
+
+## 1D, 2D, 2.5D 和 3D 并行
+
+为了实现混合并行，我们提供了一系列张量并行方法。您可以阅读相应的学术论文进行深入的了解。这些并行模式需要和 Colossal-AI 提供的分布式层一同工作。
+
+- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+
+- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
+  2D 并行基于 SUMMA 矩阵乘法，它将输入数据、模型权重和层输出切分成两个不同的维度。 这些张量块分布在 `P = N^2` 设备的二维网格上，其中 `N` 是单一维度上张量块的数量。
+
+- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
+  在 2.5D 矩阵乘法的启发下，2.5D 并行引入了一种新的张量并行，进一步将2D张量并行化。其中，`P = N^2 ∗ d` 个处理器被分配到 `d` 层， 每层独立进行矩阵乘法运算，维度为 `N`。
+
+- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
+  我们还介绍了一种 3D 张量并行方法，在三维处理器立方体上并行化神经网络。这种方法在数量为 `P` 的处理器上实现了最佳的 `O(P^{1/3})` 通信开销，而计算和内存的使用都是通过优化的参数和激活的负载平衡来实现的。同时，通过优化参数和 activations 的负载平衡，计算和内存的使用都是均匀分布的。
+
+```python
+# 1D parallel
+parallel = dict(
+    tensor=dict(size=4, mode='1d')
+)
+
+# 2D parallel
+parallel = dict(
+    tensor=dict(size=4, mode='2d')
+)
+
+# 2.5D parallel
+parallel = dict(
+    tensor=dict(size=8, mode='2.5d', depth=2)
+)
+
+# 3D parallel
+parallel = dict(
+    tensor=dict(size=8, mode='3d')
+)
+```
+
+当您在配置中指定了张量并行模式，您就可以使用其相应的分布式算子。例如，若您设置模式为 `2d`，那么在模型构建中就能使用 `colossalai.nn.Linear2D` 了。
+
+
+## 流水线并行
+
+流水线并行是将模型按层分成几个部分。例如，假设我们有一个简单的模型，它由两个线性层组成。我们有两个 GPU，我们可以将第一个线性层分配给第一个 GPU 而第二层则分配给第二个 GPU。
+
+您可以在您的配置文件中设置流水线并行度的大小。当流水线并行度大于1，Colossal-AI 将会自动地创建流水线并行的 schedule，这将会为您定义好模型训练的 `forward` 和 `backward`。
+
+```python
+parallel = dict(
+    pipeline=dict(size=4), # number of pipeline stages
+)
+```
+
+## 序列并行
+
+针对处理大图片、视频、长文本、长时间医疗监控等数据的需要，Colossal-AI 还提供了序列并行的方法。该方法是在论文[Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120)中提出的。您可以指定模式为 `sequence` 来初始化进程组。
+
+
+```python
+parallel = dict(
+    tensor=dict(size=4, mode='sequence')
+)
+```
--- a/docs/source/zh/basics/define_your_config.md
+++ b/docs/source/zh/basics/define_your_config.md
@ -0,0 +1,71 @@
+# 构建配置文件
+
+作者: Guangyang Lu, Shenggui Li, Siqi Mai
+
+**预备知识:**
+- [分布式训练](../concepts/distributed_training.md)
+- [Colossal-AI 总览](../concepts/colossalai_overview.md)
+
+
+## 简介
+
+在 Colossal-AI 中，我们需要一个配置文件来指定系统在训练过程中要注入的特征。在本教程中，我们将向您介绍如何构建您的配置文件以及如何使用这个配置文件。使用配置文件有以下一些好处：
+
+1. 您可以在不同的配置文件中存储您的特征配置和训练超参数。
+2. 对于我们未来发布的新功能，您亦可以在配置中指定，而无需改变训练脚本的代码。
+
+在本教程中，我们将向您介绍如何构建您的配置文件。
+
+## 配置定义
+
+在一个配置文件中，有两种类型的变量。一种是作为特征说明，另一种是作为超参数。所有与特征相关的变量都是保留关键字。例如，如果您想使用混合精度训练，需要在 config 文件中使用变量名`fp16`，并遵循预先定义的格式。
+
+### 功能配置
+
+Colossal-AI 提供了一系列的功能来加快训练速度。每个功能都是由配置文件中的相应字段定义的。在本教程中，我们不会给出所有功能的配置细节，而是提供一个如何指定一个功能的说明。**每个功能的细节可以在其各自的教程中找到。**
+
+为了说明配置文件的使用，我们在这里使用混合精度训练作为例子。您需要遵循以下步骤。
+
+1. 创建一个配置文件（例如 `config.py`，您可以指定任意的文件名）。
+2. 在配置文件中定义混合精度的配置。例如，为了使用 PyTorch 提供的原始混合精度训练，您只需将下面这几行代码写入您的配置文件中。
+
+   ```python
+   from colossalai.amp import AMP_TYPE
+
+   fp16 = dict(
+     mode=AMP_TYPE.TORCH
+   )
+   ```
+
+3. 当启动分布式环境时，向 Colossal-AI 指定您的配置文件的位置。比如下面的例子是配置文件在当前目录下。
+
+   ```python
+   import colossalai
+
+   colossalai.launch(config='./config.py', ...)
+   ```
+
+这样，Colossal-AI 便知道您想使用什么功能，并会在 `colossalai.initialize` 期间注入您所需要的功能。
+
+### 全局超参数
+
+除了功能的配置，您还可以在配置文件中定义训练的超参数。当您想进行多个实验时，这将会变得非常方便。每个实验的细节都可以放在独立的配置文件中，以避免混乱。这些参数将被存储在全局并行环境中，可以在训练脚本中访问。
+
+例如，您可以在配置文件中指定批量大小。
+
+```python
+BATCH_SIZE = 32
+```
+
+启动后，您能够通过全局并行上下文访问您的超参数。
+
+```python
+import colossalai
+from colossalai.core import global_context as gpc
+
+colossalai.launch(config='./config.py', ...)
+
+# access your parameter
+print(gpc.config.BATCH_SIZE)
+
+```
--- a/docs/source/zh/basics/engine_trainer.md
+++ b/docs/source/zh/basics/engine_trainer.md
@ -0,0 +1,384 @@
+# 如何在训练中使用 Engine 和 Trainer
+
+作者: Shenggui Li, Siqi Mai
+
+**预备知识:**
+- [初始化功能](./initialize_features.md)
+
+## 简介
+
+在本教程中，您将学习如何使用 Colossal-AI 中提供的 Engine 和 Trainer 来训练您的模型。在深入研究细节之前，我们想先解释一下 Engine 和 Trainer 的概念。
+
+### Engine
+
+Engine 本质上是一个模型、优化器和损失函数的封装类。当我们调用 `colossalai.initialize` 时，一个 Engine 对象将被返回，并且配备了在您的配置文件中指定的梯度剪裁、梯度累计和 ZeRO 优化器等功能。
+
+Engine 将使用与 PyTorch 训练组件类似的 API，因此您只需对代码进行微小的修改即可。
+
+下表展示了Engine的常用API。
+
+| 组件                             | 功能                                      | PyTorch                         | Colossal-AI                            |
+| ------------------------------------- | --------------------------------------------- | ------------------------------- | -------------------------------------- |
+| optimizer                             | 迭代前将所有梯度设置为零 | optimizer.zero_grad()           | engine.zero_grad()                     |
+| optimizer                             | 更新参数                         | optimizer.step()                | engine.step()                          |
+| model                                 | 进行一次前向计算                            | outputs = model(inputs)         | outputs = engine(inputs)               |
+| criterion                             | 计算loss值                      | loss = criterion(output, label) | loss = engine.criterion(output, label) |
+| criterion                             | 反向计算         | loss.backward()                 | engine.backward(loss)                  |
+
+我们需要这样一个 Engine 类的原因是，我们可以添加更多的功能，同时将实现隐藏在
+`colossalai.initialize` 函数中实现。
+假如我们要添加一个新的功能，我们可以在 `colossalai.initialize` 函数中完成对于模型、优化器、数据加载器和损失函数的功能诠释。不管中间的过程有多复杂，最终我们呈现的以及用户需要使用的只有一个 Engine 类，这将十分便捷。
+用户只需要在最小范围内修改他们的代码，将普通的 PyTorch APIs 调整为 Colossal-AI
+Engine 的 API。通过这种方式，他们可以享受更多的功能来进行有效的训练。
+
+以下是一个简单的例子：
+
+```python
+import colossalai
+
+# build your model, optimizer, criterion, dataloaders
+...
+
+engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
+                                                                    optimizer,
+                                                                    criterion,
+                                                                    train_dataloader,
+                                                                    test_dataloader)
+for img, label in train_dataloader:
+    engine.zero_grad()
+    output = engine(img)
+    loss = engine.criterion(output, label)
+    engine.backward(loss)
+    engine.step()
+```
+
+### Trainer
+
+Trainer 是一个更高级的封装器，用户可以用更少的代码行来执行训练。 由于 Trainer 的使用会更加简单，相较于 Engine，它会缺少一点灵活性。 Trainer 被设计为进行前向和反向计算来进行模型权重的更新。通过传递 Engine 对象，我们可以很容易地创建一个 Trainer。
+Trainer 的参数 `schedule` 默认值是 `None` 。在大多数情况下，除非我们想使用流水线并行，否则我们把这个值设为 `None`。如果您想探索更多关于这个参数的内容，您可以前往流水线并行的相关教程。
+
+```python
+from colossalai.logging import get_dist_logger
+from colossalai.trainer import Trainer, hooks
+
+# build components and initialize with colossalai.initialize
+...
+
+# create a logger so that trainer can log on the console
+logger = get_dist_logger()
+
+# create a trainer object
+trainer = Trainer(
+    engine=engine,
+    logger=logger
+)
+```
+
+在 Trainer 中，用户可以定制一些 hooks，并将这些 hooks 附加到 Trainer 上。hook 将根据训练方案定期地执行生命周期函数。例如，基于用户是想在每次训练迭代后还是只在整个训练周期后更新学习率，
+`LRSchedulerHook` 将会在 `after_train_iter` 或 `after_train_epoch` 阶段执行 `lr_scheduler.step()` 去为用户更新学习率。您可以将 hook 存储在一个列表中并将其传递给 `trainer.fit` 方法。`trainer.fit` 方法将根据您的参数执行训练和测试。如果 `display_process` 为 True，将在您的控制台显示一个进度条，以显示训练的过程。
+
+
+```python
+# define the hooks to attach to the trainer
+hook_list = [
+    hooks.LossHook(),
+    hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
+    hooks.AccuracyHook(accuracy_func=Accuracy()),
+    hooks.LogMetricByEpochHook(logger),
+]
+
+# start training
+trainer.fit(
+    train_dataloader=train_dataloader,
+    epochs=NUM_EPOCHS,
+    test_dataloader=test_dataloader,
+    test_interval=1,
+    hooks=hook_list,
+    display_progress=True
+)
+```
+
+如果您想定制您的 hook 类，您可以继承 `hooks.BaseHook` 并重写您想要的生命周期方法。下面提供了一个例子来演示如何创建一个简单的关于日志信息的 hook，以供您参考。
+
+```python
+from colossalai.logging import get_dist_logger
+from colossalai.trainer import hooks
+
+class LogMessageHook(hooks.BaseHook):
+
+    def __init__(self, priority=10):
+        self._logger = get_dist_logger()
+
+    def before_train(self, trainer):
+        self._logger.info('training starts')
+
+    def after_train(self, trainer):
+        self._logger.info('training finished')
+
+
+...
+
+# then in your training script
+hook_list.append(LogMessageHook())
+```
+
+
+
+在下面的章节中，您将会详细地了解到如何用 Engine 和 Trainer 来训练 ResNet 模型。
+
+
+## ResNet
+
+### 总览
+
+在本节中，我们将介绍：
+
+1. 使用一个 Engine 在 CIFAR10 数据集上训练 ResNet34 模型
+2. 使用一个 Trainer 在 CIFAR10 数据集上训练 ResNet34 模型
+
+项目结构如下：
+
+```bash
+-- config.py
+-- run_resnet_cifar10_with_engine.py
+-- run_resnet_cifar10_with_trainer.py
+```
+
+对于使用 Engine 或 Trainer，步骤 1-4 是通用的。 因此，步骤 1-4 + 步骤 5 将会是对应 `run_resnet_cifar10_with_engine.py` 而 步骤 1-4 + 步骤6 则对应 `run_resnet_cifar10_with_trainer.py`。
+
+### 牛刀小试
+
+#### 步骤 1. 创建配置文件
+
+在你的项目文件夹中，创建一个 `config.py`。这个文件是用来指定一些您可能想用来训练您的模型的特征。下面是一个配置文件的例子。
+
+```python
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 128
+NUM_EPOCHS = 200
+
+fp16=dict(
+    mode=AMP_TYPE.TORCH
+)
+```
+
+在这个配置文件中，我们指定要在每个 GPU 上使用批大小为128，并运行200个 epoch。这两个参数是在 `gpc.config` 中体现的。例如，您可以使用 `gpc.config.BATCH_SIZE` 来访问您存储在配置文件中的批大小值。而 `fp16` 配置则会告诉 `colossalai.initialize` 使用 PyTorch 提供的混合精度训练，以更好的速度和更低的内存消耗来训练模型。
+
+#### 步骤 2. 初始化分布式环境
+
+我们需要初始化分布式训练环境。这在 [启动 Colossal-AI](./launch_colossalai.md) 中有相应的教程。在当前的演示中，我们使用 `launch_from_torch` 和 PyTorch 启用工具。
+
+```python
+import colossalai
+
+# ./config.py refers to the config file we just created in step 1
+colossalai.launch_from_torch(config='./config.py')
+```
+
+#### 步骤 3. 创建所有的训练组件
+
+这时，我们可以创建用于训练的所有组件，包括：
+
+1. 模型
+2. 优化器
+3. 损失函数
+4. 训练/测试数据加载器
+5. 学习率调度器
+6. 日志记录器
+
+
+
+为了构建这些组件，您需要导入以下模块。
+
+```python
+from pathlib import Path
+from colossalai.logging import get_dist_logger
+import torch
+import os
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_dataloader
+from torchvision import transforms
+from colossalai.nn.lr_scheduler import CosineAnnealingLR
+from torchvision.datasets import CIFAR10
+from torchvision.models import resnet34
+```
+
+
+
+然后按照通常在PyTorch脚本中构建组件的方式来构建组件。在下面的脚本中，我们将CIFAR10数据集的根路径设置为环境变量 `DATA`。您可以把它改为您想要的任何路径，例如，您可以把 `root=Path(os.environ['DATA'])` 改为 `root='./data'` ，这样就不需要设置环境变量。
+
+```python
+# build logger
+logger = get_dist_logger()
+
+# build resnet
+model = resnet34(num_classes=10)
+
+# build datasets
+train_dataset = CIFAR10(
+    root='./data',
+    download=True,
+    transform=transforms.Compose(
+        [
+            transforms.RandomCrop(size=32, padding=4),
+            transforms.RandomHorizontalFlip(),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
+                0.2023, 0.1994, 0.2010]),
+        ]
+    )
+)
+
+test_dataset = CIFAR10(
+    root='./data',
+    train=False,
+    transform=transforms.Compose(
+        [
+            transforms.ToTensor(),
+            transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
+                0.2023, 0.1994, 0.2010]),
+        ]
+    )
+)
+
+# build dataloaders
+train_dataloader = get_dataloader(dataset=train_dataset,
+                                  shuffle=True,
+                                  batch_size=gpc.config.BATCH_SIZE,
+                                  num_workers=1,
+                                  pin_memory=True,
+                                  )
+
+test_dataloader = get_dataloader(dataset=test_dataset,
+                                 add_sampler=False,
+                                 batch_size=gpc.config.BATCH_SIZE,
+                                 num_workers=1,
+                                 pin_memory=True,
+                                 )
+
+# build criterion
+criterion = torch.nn.CrossEntropyLoss()
+
+# optimizer
+optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
+
+# lr_scheduler
+lr_scheduler = CosineAnnealingLR(optimizer, total_steps=gpc.config.NUM_EPOCHS)
+```
+
+#### 步骤 4. 用 Colossal-AI 进行初始化
+
+接下来，重要的一步是通过调用 `colossalai.initialize` 获得 Engine。正如 `config.py` 中所述，我们将使用混合精度训练来训练 ResNet34 模型。`colossalai.initialize` 将自动检查您的配置文件，并将相关特征分配给您的训练组件。这样一来，我们的 Engine 已经能够进行混合精度训练，而您不需要进行额外的处理。
+
+```python
+engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
+                                                                     optimizer,
+                                                                     criterion,
+                                                                     train_dataloader,
+                                                                     test_dataloader,
+                                                                     )
+```
+
+
+
+#### 步骤 5. 用 Engine 进行训练
+
+当所有的训练组件都准备好后，我们就可以像使用 PyTorch 一样训练 ResNet34 了。
+
+```python
+for epoch in range(gpc.config.NUM_EPOCHS):
+    # execute a training iteration
+    engine.train()
+    for img, label in train_dataloader:
+        img = img.cuda()
+        label = label.cuda()
+
+        # set gradients to zero
+        engine.zero_grad()
+
+        # run forward pass
+        output = engine(img)
+
+        # compute loss value and run backward pass
+        train_loss = engine.criterion(output, label)
+        engine.backward(train_loss)
+
+        # update parameters
+        engine.step()
+
+    # update learning rate
+    lr_scheduler.step()
+
+    # execute a testing iteration
+    engine.eval()
+    correct = 0
+    total = 0
+    for img, label in test_dataloader:
+        img = img.cuda()
+        label = label.cuda()
+
+        # run prediction without back-propagation
+        with torch.no_grad():
+            output = engine(img)
+            test_loss = engine.criterion(output, label)
+
+        # compute the number of correct prediction
+        pred = torch.argmax(output, dim=-1)
+        correct += torch.sum(pred == label)
+        total += img.size(0)
+
+    logger.info(
+        f"Epoch {epoch} - train loss: {train_loss:.5}, test loss: {test_loss:.5}, acc: {correct / total:.5}, lr: {lr_scheduler.get_last_lr()[0]:.5g}", ranks=[0])
+```
+
+#### 步骤 6. 用 Trainer 进行训练
+
+如果您想用 Trainer 进行训练，您可以参考下面的代码进行您的实验。
+
+
+```python
+from colossalai.nn.metric import Accuracy
+from colossalai.trainer import Trainer, hooks
+
+
+# create a trainer object
+trainer = Trainer(
+    engine=engine,
+    logger=logger
+)
+
+# define the hooks to attach to the trainer
+hook_list = [
+    hooks.LossHook(),
+    hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
+    hooks.AccuracyHook(accuracy_func=Accuracy()),
+    hooks.LogMetricByEpochHook(logger),
+    hooks.LogMemoryByEpochHook(logger)
+]
+
+# start training
+# run testing every 1 epoch
+trainer.fit(
+    train_dataloader=train_dataloader,
+    epochs=gpc.config.NUM_EPOCHS,
+    test_dataloader=test_dataloader,
+    test_interval=1,
+    hooks=hook_list,
+    display_progress=True
+)
+```
+
+
+
+#### 步骤 7. 开始分布式训练
+
+最后，我们可以使用 PyTorch 提供的分布式启动器来调用脚本，因为我们在步骤2中使用了 `launch_from_torch`。您需要把`<num_gpus>` 替换成您机器上可用的GPU数量。如果您只想使用一个 GPU，您可以把这个数字设为1。如果您想使用其他的启动器，请您参考如何启动 Colossal-AI 的教程。
+
+
+```bash
+# with engine
+python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
+# with trainer
+python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_trainer.py
+```
--- a/docs/source/zh/basics/initialize_features.md
+++ b/docs/source/zh/basics/initialize_features.md
@ -0,0 +1,46 @@
+# 初始化功能
+
+作者: Shenggui Li, Siqi Mai
+
+**预备知识:**
+- [分布式训练](../concepts/distributed_training.md)
+- [Colossal-AI 总览](../concepts/colossalai_overview.md)
+
+## 简介
+
+在本教程中，我们将介绍 `colossalai.initialize` 的使用。 它包含了如何将特征(例如，模型、优化器、数据加载器）无缝注入您的训练组件中。 调用 `colossalai.initialize` 是您进入训练循环前的基本操作。
+
+在下面一节中，我们将介绍 `colossalai.initialize` 是如何工作的以及使用中我们要注意的细节。
+
+## 使用
+
+在一个典型的工作流程中，我们将在训练脚本的开始启动分布式环境。
+之后，我们将实例化我们的对象，如模型、优化器、损失函数、数据加载器等。此时，我们可以使用 `colossalai.initialize` 便捷地为这些对象注入特征。
+具体细节请看以下的伪代码例子。
+
+```python
+import colossalai
+import torch
+...
+
+
+# launch distributed environment
+colossalai.launch(config='./config.py', ...)
+
+# create your objects
+model = MyModel()
+optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+criterion = torch.nn.CrossEntropyLoss()
+train_dataloader = MyTrainDataloader()
+test_dataloader = MyTrainDataloader()
+
+# initialize features
+engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
+                                                                     optimizer,
+                                                                     criterion,
+                                                                     train_dataloader,
+                                                                     test_dataloader)
+```
+
+ `colossalai.initialize` 将返回一个 `Engine` 对象。 该对象把模型、优化器和损失函数封装起来。 **`Engine` 对象会以配置文件中指定的特征运行。**
+关于 `Engine` 的更多使用细节可以在 [在训练中使用Engine和Trainer](./engine_trainer.md) 中获取。
--- a/docs/source/zh/basics/launch_colossalai.md
+++ b/docs/source/zh/basics/launch_colossalai.md
@ -0,0 +1,212 @@
+# 启动 Colossal-AI
+
+作者: Chuanrui Wang, Shenggui Li, Siqi Mai
+
+**预备知识:**
+- [分布式训练](../concepts/distributed_training.md)
+- [Colossal-AI 总览](../concepts/colossalai_overview.md)
+
+
+## 简介
+
+正如我们在前面的教程中所提到的，在您的配置文件准备好后，您需要为 Colossal-AI 初始化分布式环境。我们把这个过程称为 `launch`。在本教程中，您将学习如何在您的服务器上启动 Colossal-AI，不管是小型的还是大型的。
+
+在 Colossal-AI 中，我们提供了几种启动方法来初始化分布式后端。
+在大多数情况下，您可以使用 `colossalai.launch` 和 `colossalai.get_default_parser` 来通过命令行传递参数。如果您想使用 SLURM、OpenMPI 和 PyTorch 等启动工具，我们也提供了几个启动的辅助方法以便您的使用。您可以直接从这些启动工具设置的环境变量中访问 rank 和 world size 大小。
+
+在本教程中，我们将介绍如何启动 Colossal-AI 来初始化分布式后端：
+- 用 colossalai.launch 启动
+- 用 Colossal-AI命令行 启动
+- 用 SLURM 启动
+- 用 OpenMPI 启动
+
+## 启动分布式环境
+
+为了启动 Colossal-AI，我们需要两类参数:
+1. 配置文件
+2. 分布式设置
+
+无论我们使用何种启动方式，配置文件是必须要求的，而分布式设置有可能依情况而定。配置文件可以是配置文件的路径或 Python dictionary 的形式。分布式设置可以通过命令行或多进程启动器传递。
+
+### 命令行解析器
+
+在使用 `launch` 之前, 我们首先需要了解我们需要哪些参数来进行初始化。
+如[分布式训练](../concepts/distributed_training.md) 中 `基本概念` 一节所述 ，涉及的重要参数是:
+
+1. host
+2. port
+3. rank
+4. world_size
+5. backend
+
+在 Colossal-AI 中，我们提供了一个命令行解析器，它已经提前添加了这些参数。您可以通过调用 `colossalai.get_default_parser()` 来获得这个解析器。这个解析器通常与 `colossalai.launch` 一起使用。
+
+```python
+# add these lines in your train.py
+import colossalai
+
+# get default parser
+parser = colossalai.get_default_parser()
+
+# if you want to add your own arguments
+parser.add_argument(...)
+
+# parse arguments
+args = parser.parse_args()
+```
+
+您可以在您的终端传入以下这些参数。
+```shell
+
+python train.py --host <host> --rank <rank> --world_size <world_size> --port <port> --backend <backend>
+```
+
+`backend` 是用户可选的，默认值是 nccl。
+
+### 本地启动
+
+为了初始化分布式环境，我们提供了一个通用的 `colossalai.launch` API。`colossalai.launch` 函数接收上面列出的参数，并在通信网络中创建一个默认的进程组。方便起见，这个函数通常与默认解析器一起使用。
+
+```python
+import colossalai
+
+# parse arguments
+args = colossalai.get_default_parser().parse_args()
+
+# launch distributed environment
+colossalai.launch(config=<CONFIG>,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  host=args.host,
+                  port=args.port,
+                  backend=args.backend
+)
+
+```
+
+
+### 用 Colossal-AI命令行工具 启动
+
+为了更好地支持单节点以及多节点的训练，我们通过封装PyTorch的启动器实现了一个更加方便的启动器。
+PyTorch自带的启动器需要在每个节点上都启动命令才能启动多节点训练，而我们的启动器只需要一次调用即可启动训练。
+
+首先，我们需要在代码里指定我们的启动方式。由于这个启动器是PyTorch启动器的封装，那么我们自然而然应该使用`colossalai.launch_from_torch`。
+分布式环境所需的参数，如 rank, world size, host 和 port 都是由 PyTorch 启动器设置的，可以直接从环境变量中读取。
+
+```python
+import colossalai
+
+colossalai.launch_from_torch(
+    config=<CONFIG>,
+)
+```
+
+接下来，我们可以轻松地在终端使用`colossalai run`来启动训练。下面的命令可以在当前机器上启动一个4卡的训练任务。
+你可以通过设置`nproc_per_node`来调整使用的GPU的数量，也可以改变`master_port`的参数来选择通信的端口。
+
+```shell
+# 在当前节点上启动4卡训练 （默认使用29500端口）
+colossalai run --nproc_per_node 4 train.py
+
+# 在当前节点上启动4卡训练，并使用一个不同的端口
+colossalai run --nproc_per_node 4 --master_port 29505 test.py
+```
+
+如果你在使用一个集群，并且想进行多节点的训练，你需要使用Colossal-AI的命令行工具进行一键启动。我们提供了两种方式来启动多节点任务
+
+- 通过`--hosts`来启动
+
+这个方式适合节点数不多的情况。假设我们有两个节点，分别为`host`和`host2`。我们可以用以下命令进行多节点训练。
+比起单节点训练，多节点训练需要手动设置`--master_addr` （在单节点训练中`master_addr`默认为`127.0.0.1`）。
+
+:::caution
+
+多节点训练时，`master_addr`不能为`localhost`或者`127.0.0.1`，它应该是一个节点的名字或者IP地址。
+
+:::
+
+```shell
+# 在两个节点上训练
+colossalai run --nproc_per_node 4 --host host1,host2 --master_addr host1 test.py
+```
+
+
+- 通过`--hostfile`来启动
+
+这个方式适用于节点数很大的情况。host file是一个简单的文本文件，这个文件里列出了可以使用的节点的名字。
+在一个集群中，可用节点的列表一般由SLURM或者PBS Pro这样的集群资源管理器来提供。比如，在SLURM中，
+你可以从`SLURM_NODELIST`这个环境变量中获取到当前分配列表。在PBS Pro中，这个环境变量为`PBS_NODEFILE`。
+可以通过`echo $SLURM_NODELIST` 或者 `cat $PBS_NODEFILE` 来尝试一下。如果你没有这样的集群管理器，
+那么你可以自己手动写一个这样的文本文件即可。
+
+提供给Colossal-AI的host file需要遵循以下格式，每一行都是一个节点的名字。
+
+```text
+host1
+host2
+```
+
+如果host file准备好了，那么我们就可以用以下命令开始多节点训练了。和使用`--host`一样，你也需要指定一个`master_addr`。
+当使用host file时，我们可以使用一些额外的参数：
+- `--include`: 设置你想要启动训练的节点。比如，你的host file里有8个节点，但是你只想用其中的6个节点进行训练，
+  你可以添加`--include host1,host2,host3,...,host6`，这样训练任务只会在这6个节点上启动。
+
+- `--exclude`: 设置你想排除在训练之外的节点。当你的某一些节点坏掉时，这个参数会比较有用。比如假如host1的GPU有一些问题，无法正常使用，
+  那么你就可以使用`--exclude host1`来将其排除在外，这样你就可以训练任务就只会在剩余的节点上启动。
+
+```shell
+# 使用hostfile启动
+colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1  test.py
+
+# 只使用部分节点进行训练
+colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1  --include host1 test.py
+
+# 不使用某些节点进行训练
+colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1  --exclude host2 test.py
+```
+
+
+### 用 SLURM 启动
+
+如果您是在一个由 SLURM 调度器管理的系统上， 您也可以使用 `srun` 启动器来启动您的 Colossal-AI 脚本。我们提供了辅助函数 `launch_from_slurm` 来与 SLURM 调度器兼容。
+`launch_from_slurm` 会自动从环境变量 `SLURM_PROCID` 和 `SLURM_NPROCS` 中分别读取 rank 和 world size ，并使用它们来启动分布式后端。
+
+您可以在您的训练脚本中尝试以下操作。
+
+```python
+import colossalai
+
+colossalai.launch_from_slurm(
+    config=<CONFIG>,
+    host=args.host,
+    port=args.port
+)
+```
+
+您可以通过在终端使用这个命令来初始化分布式环境。
+
+```bash
+srun python train.py --host <master_node> --port 29500
+```
+
+### 用 OpenMPI 启动
+如果您对OpenMPI比较熟悉，您也可以使用 `launch_from_openmpi` 。
+`launch_from_openmpi` 会自动从环境变量
+`OMPI_COMM_WORLD_LOCAL_RANK`， `MPI_COMM_WORLD_RANK` 和 `OMPI_COMM_WORLD_SIZE` 中分别读取local rank、global rank 和 world size，并利用它们来启动分布式后端。
+
+您可以在您的训练脚本中尝试以下操作。
+```python
+colossalai.launch_from_openmpi(
+    config=<CONFIG>,
+    host=args.host,
+    port=args.port
+)
+```
+
+以下是用 OpenMPI 启动多个进程的示例命令。
+```bash
+mpirun --hostfile <my_hostfile> -np <num_process> python train.py --host <node name or ip> --port 29500
+```
+
+- --hostfile: 指定一个要运行的主机列表。
+- --np: 设置总共要启动的进程（GPU）的数量。例如，如果 --np 4，4个 python 进程将被初始化以运行 train.py。
--- a/docs/source/zh/basics/model_checkpoint.md
+++ b/docs/source/zh/basics/model_checkpoint.md
@ -0,0 +1,61 @@
+# 模型检查点
+
+作者 : Guangyang Lu
+
+**预备知识:**
+- [Launch Colossal-AI](./launch_colossalai.md)
+- [Initialize Colossal-AI](./initialize_features.md)
+
+**示例代码:**
+- [ColossalAI-Examples Model Checkpoint](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/utils/checkpoint)
+
+**函数是经验函数.**
+
+## 简介
+
+本教程将介绍如何保存和加载模型检查点。
+
+为了充分利用Colossal-AI的强大并行策略，我们需要修改模型和张量，可以直接使用 `torch.save` 或者 `torch.load` 保存或加载模型检查点。在Colossal-AI中，我们提供了应用程序接口实现上述同样的效果。
+
+但是，在加载时，你不需要使用与存储相同的保存策略。
+
+## 使用方法
+
+### 保存
+
+有两种方法可以使用Colossal-AI训练模型，即使用engine或使用trainer。
+**注意我们只保存 `state_dict`.** 因此，在加载检查点时，需要首先定义模型。
+
+#### 同 engine 保存
+
+```python
+from colossalai.utils import save_checkpoint
+model = ...
+engine, _, _, _ = colossalai.initialize(model=model, ...)
+for epoch in range(num_epochs):
+    ... # do some training
+    save_checkpoint('xxx.pt', epoch, model)
+```
+
+#### 用 trainer 保存
+```python
+from colossalai.trainer import Trainer, hooks
+model = ...
+engine, _, _, _ = colossalai.initialize(model=model, ...)
+trainer = Trainer(engine, ...)
+hook_list = [
+            hooks.SaveCheckpointHook(1, 'xxx.pt', model)
+            ...]
+
+trainer.fit(...
+            hook=hook_list)
+```
+
+### 加载
+
+```python
+from colossalai.utils import load_checkpoint
+model = ...
+load_checkpoint('xxx.pt', model)
+... # train or test
+```
--- a/docs/source/zh/concepts/colossalai_overview.md
+++ b/docs/source/zh/concepts/colossalai_overview.md
@ -0,0 +1,36 @@
+# Colossal-AI 总览
+
+作者: Shenggui Li, Siqi Mai
+
+## 关于 Colossal-AI
+
+随着深度学习模型规模的发展，向新的训练模式转变是非常重要的。没有并行和优化的传统训练方法将成为过去，新的训练方法是使训练大规模模型高效和节省成本的关键。
+
+Colossal-AI 是一个集成的系统，为用户提供一套综合的训练方法。您可以找到常见的训练方法，如混合精度训练和梯度累积。此外，我们提供了一系列的并行技术，包括数据并行、张量并行和流水线并行。我们通过不同的多维分布式矩阵乘法算法来优化张量并行。我们还提供了不同的流水线并行方法，使用户能够有效地跨节点扩展他们的模型。更多的高级功能，如卸载，也可以在这个教程文档中找到详细的内容。
+
+## Colossal-AI 的使用
+
+我们的目标是使 Colossal-AI 易于使用，并且对用户的代码不产生干扰。如果您想使用Colossal-AI，这里有一个简单的一般工作流程。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/ZK7ICWzbMsVuJof.png"/>
+<figcaption>Workflow</figcaption>
+</figure>
+
+1. 准备一个配置文件，指定您要使用的功能和参数。
+2. 用 `colossalai.launch` 初始化分布式后端。
+3. 用 `colossalai.initialize` 将训练特征注入您的训练组件（如模型、优化器）中。
+4. 进行训练和测试.
+
+我们将在`基本教程`部分介绍整个工作流程。
+
+## 未来计划
+
+Colossal-AI 系统将会进一步拓展和优化，包括但不限于:
+
+1. 分布式操作的优化
+2. 异构系统训练的优化
+3. 从模型大小的维度切入，提升训练速度并维持精度
+4. 拓展现有的并行方法
+
+**我们始终欢迎社区的建议和讨论，如果您遇到任何问题，我们将非常愿意帮助您。您可以在GitHub 提 [issue](https://github.com/hpcaitech/ColossalAI/issues) ，或在[论坛](https://github.com/hpcaitech/ColossalAI/discussions)上创建一个讨论主题。**
--- a/docs/source/zh/concepts/distributed_training.md
+++ b/docs/source/zh/concepts/distributed_training.md
@ -0,0 +1,88 @@
+# 分布式训练
+
+作者: Shenggui Li, Siqi Mai
+
+## 什么是分布式系统？
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/sE5daHf2ohIy9wX.png"/>
+<figcaption>图片来源: <a href="https://towardsdatascience.com/distributed-training-in-the-cloud-cloud-machine-learning-engine-9e264ddde27f">Towards Data Science</a></figcaption>
+</figure>
+
+分布式系统由多个软件组件组成，在多台机器上运行。例如，传统的数据库运行在一台机器上。随着数据量的爆发式增长，单台机器已经不能为企业提供理想的性能。特别是在双十一这样的网络狂欢节，网络流量会出乎意料的大。为了应对这种压力，现代高性能数据库被设计成在多台机器上运行，它们共同为用户提供高吞吐量和低延迟。
+
+分布式系统的一个重要评价指标是可扩展性。例如，当我们在4台机器上运行一个应用程序时，我们自然希望该应用程序的运行速度能提高4倍。然而，由于通信开销和硬件性能的差异，很难实现线性提速。因此，当我们实现应用程序时，必须考虑如何使其更快。良好的设计和系统优化的算法可以帮助我们提供良好的性能。有时，甚至有可能实现线性和超线性提速。
+
+
+## 为什么我们需要机器学习的分布式训练？
+
+早在2012年，[AlexNet](https://arxiv.org/abs/1404.5997) 就赢得了ImageNet比赛的冠军，而它是在两张 GTX 580 3GB GPU 上训练的。今天，大多数出现在顶级人工智能会议上的模型都是在多个GPU上训练的。当研究人员和工程师开发人工智能模型时，分布式训练无疑是一种常见的做法。这一趋势背后有几个原因。
+
+1. 模型规模迅速增加。2015年的 [ResNet50](https://arxiv.org/abs/1512.03385) 有2000万的参数，
+2018年的 [BERT-Large](https://arxiv.org/abs/1810.04805)有3.45亿的参数，2018年的
+[GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
+有15亿的参数，而2020年的 [GPT-3](https://arxiv.org/abs/2005.14165) 有1750亿个参数。很明显，模型规模随着时间的推移呈指数级增长。目前最大的模型已经超过了1000多亿个参数。而与较小的模型相比，超大型模型通常能提供更优越的性能。
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/sCyreJ9PF1EdZYf.jpg"/>
+<figcaption>图片来源: <a href="https://huggingface.co/blog/large-language-models">HuggingFace</a></figcaption>
+</figure>
+
+
+2. 数据集规模迅速增加。对于大多数机器学习开发者来说，MNIST 和 CIFAR10 数据集往往是他们训练模型的前几个数据集。然而，与著名的 ImageNet 数据集相比，这些数据集非常小。谷歌甚至有自己的（未公布的）JFT-300M 数据集，它有大约3亿张图片，这比 ImageNet-1k 数据集大了近300倍。
+
+
+3. 计算能力越来越强。随着半导体行业的进步，显卡变得越来越强大。由于核的数量增多，GPU是深度学习最常见的算力资源。从2012年的 K10 GPU 到2020年的 A100 GPU，计算能力已经增加了几百倍。这使我们能够更快地执行计算密集型任务，而深度学习正是这样一项任务。
+
+如今，我们接触到的模型可能太大，以致于无法装入一个GPU，而数据集也可能大到足以在一个GPU上训练一百天。这时，只有用不同的并行化技术在多个GPU上训练我们的模型，我们才能完成并加快模型训练，以追求在合理的时间内获得想要的结果。
+
+
+## 分布式训练的基本概念
+
+分布式训练需要多台机器/GPU。在训练期间，这些设备之间会有通信。为了更好地理解分布式训练，有几个重要的术语需要我们了解清楚。
+
+- host: 主机(host)是通信网络中的主要设备。在初始化分布式环境时，经常需要它作为一个参数。
+- port: 这里的端口(port)主要是指主机上用于通信的主端口。
+- rank: 在网络中赋予设备的唯一ID。
+- world size: 网络中设备的数量。
+- process group: 进程组(process group)是一个通信网络，包括设备的一个子集。总是有一个默认的进程组，它包含所有的设备。一个子集的设备可以形成一个进程组，以便它们只在组内的设备之间进行通信。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/qnNBKh8AjzgM5sY.png"/>
+<figcaption>一个分布式系统的例子</figcaption>
+</figure>
+
+为了说明这些概念，让我们假设我们有2台机器（也称为节点），每台机器有4个 GPU。当我们在这两台机器上初始化分布式环境时，我们基本上启动了8个进程（每台机器上有4个进程），每个进程被绑定到一个 GPU 上。
+
+在初始化分布式环境之前，我们需要指定主机（主地址）和端口（主端口）。在这个例子中，我们可以让主机为节点0，端口为一个数字，如29500。所有的8个进程将寻找地址和端口并相互连接，默认的进程组将被创建。默认进程组的 world size 为8，细节如下。
+
+| process ID | rank | Node index | GPU index |
+| ---------- | ---- | ---------- | --------- |
+| 0          | 0    | 0          | 0         |
+| 1          | 1    | 0          | 1         |
+| 2          | 2    | 0          | 2         |
+| 3          | 3    | 0          | 3         |
+| 4          | 4    | 1          | 0         |
+| 5          | 5    | 1          | 1         |
+| 6          | 6    | 1          | 2         |
+| 7          | 7    | 1          | 3         |
+
+
+我们还可以创建一个新的进程组。这个新的进程组可以包含任何进程的子集。例如，我们可以创建一个只包含偶数进程的组:
+
+| process ID | rank | Node index | GPU index |
+| ---------- | ---- | ---------- | --------- |
+| 0          | 0    | 0          | 0         |
+| 2          | 1    | 0          | 2         |
+| 4          | 2    | 1          | 0         |
+| 6          | 3    | 1          | 2         |
+
+**请注意，rank 是相对于进程组而言的，一个进程在不同的进程组中可以有不同的 rank。最大的 rank 始终是 `world size of the process group - 1`。**
+
+在进程组中，各进程可以通过两种方式进行通信。
+1. peer-to-peer: 一个进程向另一个进程发送数据。
+2. collective: 一组进程一起执行分散、聚集、all-reduce、广播等操作。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/zTmlxgc3oeAdn97.png"/>
+<figcaption>Collective communication， 来源: <a href="https://pytorch.org/tutorials/intermediate/dist_tuto.html">PyTorch distributed tutorial</a></figcaption>
+</figure>
--- a/docs/source/zh/concepts/paradigms_of_parallelism.md
+++ b/docs/source/zh/concepts/paradigms_of_parallelism.md
@ -0,0 +1,91 @@
+# 并行技术
+
+作者: Shenggui Li, Siqi Mai
+
+## 简介
+
+随着深度学习的发展，对并行训练的需求越来越大。这是因为模型和数据集越来越大，如果我们坚持使用单 GPU 训练，训练过程的等待将会成为一场噩梦。在本节中，我们将对现有的并行训练方法进行简要介绍。如果您想对这篇文章进行补充，欢迎在[GitHub论坛](https://github.com/hpcaitech/ColossalAI/discussions)上进行讨论。
+
+## 数据并行
+
+数据并行是最常见的并行形式，因为它很简单。在数据并行训练中，数据集被分割成几个碎片，每个碎片被分配到一个设备上。这相当于沿批次维度对训练过程进行并行化。每个设备将持有一个完整的模型副本，并在分配的数据集碎片上进行训练。在反向传播之后，模型的梯度将被全部减少，以便在不同设备上的模型参数能够保持同步。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/WSAensMqjwHdOlR.png"/>
+<figcaption>数据并行</figcaption>
+</figure>
+
+## 模型并行
+
+在数据并行训练中，一个明显的特点是每个 GPU 持有整个模型权重的副本。这就带来了冗余问题。另一种并行模式是模型并行，即模型被分割并分布在一个设备阵列上。通常有两种类型的并行：张量并行和流水线并行。张量并行是在一个操作中进行并行计算，如矩阵-矩阵乘法。流水线并行是在各层之间进行并行计算。因此，从另一个角度来看，张量并行可以被看作是层内并行，流水线并行可以被看作是层间并行。
+
+### 张量并行
+
+张量并行训练是将一个张量沿特定维度分成 `N` 块，每个设备只持有整个张量的 `1/N`，同时不影响计算图的正确性。这需要额外的通信来确保结果的正确性。
+
+以一般的矩阵乘法为例，假设我们有 `C = AB`。我们可以将B沿着列分割成 `[B0 B1 B2 ... Bn]`，每个设备持有一列。然后我们将 `A` 与每个设备上 `B` 中的每一列相乘，我们将得到 `[AB0 AB1 AB2 ... ABn]` 。此刻，每个设备仍然持有一部分的结果，例如，设备(rank=0)持有 `AB0`。为了确保结果的正确性，我们需要收集全部的结果，并沿列维串联张量。通过这种方式，我们能够将张量分布在设备上，同时确保计算流程保持正确。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/2ZwyPDvXANW4tMG.png"/>
+<figcaption>张量并行</figcaption>
+</figure>
+
+在 Colossal-AI 中，我们提供了一系列的张量并行方法，即 1D、2D、2.5D 和 3D 张量并行。我们将在`高级教程`中详细讨论它们。
+
+
+相关文章:
+- [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668)
+- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+- [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
+- [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
+- [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
+
+### 流水线并行
+
+流水线并行一般来说很容易理解。请您回忆一下您的计算机结构课程，这确实存在于 CPU 设计中。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/at3eDv7kKBusxbd.png"/>
+<figcaption>流水线并行</figcaption>
+</figure>
+
+流水线并行的核心思想是，模型按层分割成若干块，每块都交给一个设备。在前向传递过程中，每个设备将中间的激活传递给下一个阶段。在后向传递过程中，每个设备将输入张量的梯度传回给前一个流水线阶段。这允许设备同时进行计算，并增加了训练的吞吐量。流水线并行训练的一个缺点是，会有一些设备参与计算的冒泡时间，导致计算资源的浪费。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/sDNq51PS3Gxbw7F.png"/>
+<figcaption>Source: <a href="https://arxiv.org/abs/1811.06965">GPipe</a></figcaption>
+</figure>
+
+相关文章:
+- [PipeDream: Fast and Efficient Pipeline Parallel DNN Training](https://arxiv.org/abs/1806.03377)
+- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
+- [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
+- [Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines](https://arxiv.org/abs/2107.06925)
+
+
+## 优化器相关的并行
+
+另一种并行方法和优化器相关，目前这种并行最流行的方法是 `ZeRO`，即[零冗余优化器](https://arxiv.org/abs/1910.02054)。 ZeRO 在三个层面上工作，以消除内存冗余（ZeRO需要进行fp16训练）。
+
+- Level 1: 优化器状态在各进程中被划分。
+- Level 2: 用于更新模型权重的32位梯度也被划分，因此每个进程只存储与其优化器状态划分相对应的梯度。
+- Level 3: 16位模型参数在各进程中被划分。
+
+相关文章:
+- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
+
+
+## 异构系统的并行
+
+上述方法通常需要大量的 GPU 来训练一个大型模型。然而，人们常常忽略的是，与 GPU 相比，CPU 的内存要大得多。在一个典型的服务器上，CPU 可以轻松拥有几百GB的内存，而每个 GPU 通常只有16或32GB的内存。这促使人们思考为什么 CPU 内存没有被用于分布式训练。
+
+最近的进展是依靠 CPU 甚至是 NVMe 磁盘来训练大型模型。主要的想法是，在不使用张量时，将其卸载回 CPU 内存或 NVMe 磁盘。通过使用异构系统架构，有可能在一台机器上容纳一个巨大的模型。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/qLHD5lk97hXQdbv.png"/>
+<figcaption>异构系统</figcaption>
+</figure>
+
+相关文章:
+- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
+- [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)
--- a/docs/source/zh/features/1D_tensor_parallel.md
+++ b/docs/source/zh/features/1D_tensor_parallel.md
@ -0,0 +1,111 @@
+# 1D 张量并行
+
+作者: Zhengda Bian, Yongbin Li
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [并行配置](../basics/configure_parallelization.md)
+
+**示例代码**
+- [ColossalAI-Examples 1D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_1d.py)
+
+**相关论文**
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf)
+
+## 引言
+
+张量并行将模型参数划分到多个设备上，以减少内存负荷。
+[Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) 介绍了一种高效的一维张量并行化实现。
+
+让我们以一个线性层为例，它包括一个 GEMM $Y = XA$。 给定2个处理器，我们把列 $A$ 划分为 $[A_1 ~ A_2]$, 并在每个处理器上计算 $Y_i = XA_i$ , which then forms $[Y_1 ~ Y_2] = [XA_1 ~ XA_2]$. This is called a column-parallel fashion.
+
+当第二个线性层 $Z=YB$ 跟随上述列并行层的时候, 我们把 $B$ 划分为 $\left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right]$,
+这就是所谓的行并行方式.
+为了计算 $Z = [Y_1 ~ Y_2] \left[\begin{matrix} B_1 \\ B_2 \end{matrix} \right]$, 我们首先在每个处理器上计算 $Y_iB_i$ 然后使用一个all-reduce操作将结果汇总为 $Z=Y_1B_1+Y_2B_2$。
+
+我们还需要注意，在后向计算中，列并行线性层需要聚合输入张量 $X$, 因为在每个处理器 $i$ 上，我们只有 $\dot{X_i}=\dot{Y_i}A_i^T$，因此，我们在各处理器之间进行all-reduce，得到 $\dot{X}=\dot{Y}A^T=\dot{Y_1}A_1^T+\dot{Y_2}A_2^T$。
+
+## 效率
+给定 $P$ 个处理器, 我们展现理论上的计算和内存成本，以及基于环形算法的1D张量并行的前向和后向的通信成本。
+
+| 计算 | 内存 (参数) | 内存 (activations) | 通信 (带宽) | 通信 (时延) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/P)$    | $O(1/P)$         | $O(1)$               | $O(2(P-1)/P)$             | $O(2(P-1))$             |
+
+## 使用
+
+为了使模型能够实现一维张量并行, 如在2个 GPU 上, 我们需要配置如下的并行设置。
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=2, mode='1d'),
+))
+```
+
+然后 Colossal-AI 会自动对所有来自 `colossalai.nn` 的层应用1D张量并行。
+
+让我们定义一个由两层多层感知器 (MLP) 组成的模型，如下所示。
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.transpose(0, 1).shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.transpose(0, 1).shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+
+在2个 GPU 上启动 Colossal-AI 并建立模型。
+
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+我们将会看到 MLP 模型中被划分的参数（如权重）的形状。
+```shell
+Weight of the first linear layer: torch.Size([256, 512])
+Weight of the second linear layer: torch.Size([512, 256])
+```
+第一个线性层的完整权重形状应该为 `[256, 1024]`. 经过列-并行分割，它变成了 `[256, 512]`。
+同样地，第二个行并行层将权重 `[1024, 256]` 划分为 `[512, 256]`。
+
+我们可以用一些随机输入来运行这个模型。
+```python
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+torch.distributed.broadcast(x, src=0)  # synchronize input
+
+x = m(x)
+```
+然后我们可以看到 activation 结果的形状。
+```shell
+Output of the first linear layer: torch.Size([16, 512])
+Output of the second linear layer: torch.Size([16, 256])
+```
+第一个线性层的输出被划分成2块 (每个形状为 `[16, 512]`), 而第二层在整个 GPU 上的输出是相同的。
--- a/docs/source/zh/features/2D_tensor_parallel.md
+++ b/docs/source/zh/features/2D_tensor_parallel.md
@ -0,0 +1,141 @@
+# 2D 张量并行
+
+作者: Zhengda Bian, Yongbin Li
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [并行配置](../basics/configure_parallelization.md)
+- [1D 张量并行](./1D_tensor_parallel.md)
+
+**示例代码**
+- [ColossalAI-Examples - 2D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_2d.py)
+
+**相关论文**
+- [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/pdf/2104.05343.pdf)
+
+## 引言
+
+1D张量并行没有对 activations 进行划分，就大规模模型而言，这也会消耗大量的内存。
+为了平均分配计算和内存负荷，在 SUMMA（可扩展的通用矩阵乘法算法）的基础上， [2D张量并行](https://arxiv.org/pdf/2104.05343.pdf) 被引入。
+
+我们还是以线性层 $Y = XA$ 为例。
+给定 $P=q\times q$ 个处理器（必要条件）, 如 $q=2$, 我们把输入 $X$ 和权重A $A$ 都划分为
+
+$$
+\left[\begin{matrix} X_{10} & X_{11} \\ X_{00} & X_{01} \end{matrix} \right]
+\text{~and~}
+\left[\begin{matrix} A_{10} & A_{11} \\ A_{00} & A_{01} \end{matrix} \right]。
+$$
+
+该计算包括 $q$ 步。 当 $t=1$ 时, $X_{i0}$ 在其行中被广播, 而 $A_{0j}$ 在其列中被广播。因此，我们有
+
+$$
+\left[\begin{matrix} X_{10},A_{00} & X_{10},A_{01} \\ X_{00},A_{00} & X_{00},A_{01} \end{matrix} \right]。
+$$
+
+然后我们在每个处理器 $(i, j)$ 上将 $X_{i0}$ 和 $A_{0j}$ 相乘为
+
+$$
+\left[\begin{matrix} X_{10}A_{00} & X_{10}A_{01} \\ X_{00}A_{00} & X_{00}A_{01} \end{matrix} \right] (1)。
+$$
+
+同样，当 $t=2$ 时, $X_{i1}$ 在其行中被广播, $A_{1j}$ 在其列中被广播, 我们将它们相乘为
+
+$$
+\left[\begin{matrix} X_{11}A_{10} & X_{11}A_{11} \\ X_{01}A_{10} & X_{01}A_{11} \end{matrix} \right] (2)。
+$$
+
+通过将 $(1)$ 和 $(2)$ 相加，我们有
+
+$$
+Y = XA = \left[\begin{matrix} X_{10}A_{00}+X_{11}A_{10} & X_{10}A_{01}+X_{11}A_{11} \\ X_{00}A_{00}+X_{01}A_{10} & X_{00}A_{01}+X_{01}A_{11} \end{matrix} \right]。
+$$
+
+## 效率
+给定 $P=q\times q$ 个处理器, 我们展现理论上的计算和内存成本，以及基于环形算法的2D张量并行的前向和后向的通信成本。
+
+| 计算 | 内存 (参数) | 内存 (activations) | 通信 (带宽) | 通信 (时延) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/q^2)$  | $O(1/q^2)$       | $O(1/q^2)$           | $O(6(q-1)/q)$             | $O(6(q-1))$             |
+
+## 使用
+
+为了使我们的模型能够实现二维张量并行，例如在4个 GPU 上，我们需要配置如下的并行设置。
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=4, mode='2d'),
+))
+```
+然后 Colossal-AI 会自动对所有来自 `colossalai.nn` 的层应用2D张量并行。
+
+让我们定义一个由两层多层感知器 (MLP) 组成的模型，如下所示。
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+在4个 GPU 上启动 Colossal-AI 并建立模型。
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+我们将会看到 MLP 模型中被划分的参数（如权重）的形状。
+```shell
+Weight of the first linear layer: torch.Size([128, 512])
+Weight of the second linear layer: torch.Size([512, 128])
+```
+第一个线性层的完整权重形状应该为 `[256, 1024]`. 经过2D并行划分后，它在每个 GPU 上变成了 `[128, 512]` 。
+同样地，第二层将权重 `[1024, 256]` 划分为 `[512, 128]`.
+
+我们可以用一些随机输入来运行这个模型。
+```python
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+# partition input
+torch.distributed.broadcast(x, src=0)
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)]
+x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)]
+print_rank_0(f'Input: {x.shape}')
+
+x = m(x)
+```
+然后我们可以看到 activation 结果的形状。
+```shell
+Input: torch.Size([8, 128])
+Output of the first linear layer: torch.Size([8, 512])
+Output of the second linear layer: torch.Size([8, 128])
+```
+2D并行中的 activation 张量都是同时在行和列分割的。例如，第一个线性层的输出是 `[8, 512]`, 而第二层的输出为 `[8, 128]`。
--- a/docs/source/zh/features/2p5D_tensor_parallel.md
+++ b/docs/source/zh/features/2p5D_tensor_parallel.md
@ -0,0 +1,145 @@
+# 2.5D 张量并行
+
+作者: Zhengda Bian, Yongbin Li
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [并行配置](../basics/configure_parallelization.md)
+- [1D 张量并行](./1D_tensor_parallel.md)
+- [2D 张量并行](./2D_tensor_parallel.md)
+
+**示例代码**
+- [ColossalAI-Examples - 2.5D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_2p5d.py)
+
+**相关论文**
+- [2.5-dimensional distributed model training](https://arxiv.org/pdf/2105.14500.pdf)
+
+## 引言
+
+与一维张量并行相比，二维并行降低了内存成本，但可能引入更多的通信。因此，[2.5D张量并行](https://arxiv.org/pdf/2105.14500.pdf) 在 2.5D SUMMA 的基础上被提出，它通过使用更多的设备来减少通信。
+
+我们还是以线性层 $Y = XA$ 为例。
+给定 $P=q \times q \times d$ 个处理器（必要条件）, 如 $q=d=2$, 我们把输入 $X$ 划分为 $d\times q$ 行和 $q$ 列
+
+$$
+\left[\begin{matrix} X_{30} & X_{31} \\ X_{20} & X_{21} \\ X_{10} & X_{11} \\ X_{00} & X_{01}\end{matrix} \right],
+$$
+它可以被重塑为 $d$ 层
+
+$$
+\left[\begin{matrix} X_{10} & X_{11} \\ X_{00} & X_{01} \end{matrix} \right] \text{~and~}\left[\begin{matrix} X_{30} & X_{31} \\ X_{20} & X_{21} \end{matrix} \right].
+$$
+
+另外，权重 $A$ 被分割为
+
+$$
+\left[\begin{matrix} A_{10} & A_{11} \\ A_{00} & A_{01} \end{matrix} \right].
+$$
+
+对于 $X$ 相关的每一层, 我们使用SUMMA算法将 $X$ 与 $A$ 相乘。
+然后，我们得到输出
+
+$$
+\left[\begin{matrix} Y_{10}=X_{10}A_{00}+X_{11}A_{10} & Y_{11}=X_{10}A_{01}+X_{11}A_{11} \\ Y_{00}=X_{00}A_{00}+X_{01}A_{10} & Y_{01}=X_{00}A_{01}+X_{01}A_{11} \end{matrix} \right]
+\text{~and~}
+$$
+$$
+\left[\begin{matrix} Y_{30}=X_{30}A_{00}+X_{31}A_{10} & Y_{31}=X_{30}A_{01}+X_{31}A_{11} \\ Y_{20}=X_{20}A_{00}+X_{21}A_{10} & Y_{21}=X_{20}A_{01}+X_{21}A_{11} \end{matrix} \right].
+$$
+
+## 效率
+
+给定 $P=q \times q \times d$ 个处理器, 我们展现理论上的计算和内存成本，以及基于环形算法的2.5D张量并行的前向和后向的通信成本。
+
+| 计算 | 内存 (参数) | 内存 (activations) | 通信 (带宽) | 通信 (时延) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/dq^2)$ | $O(1/q^2)$       | $O(1/dq^2)$          | $\small O(3(q-1)(d+1)/dq)$       | $O(6(q-1))$             |
+
+## 使用
+
+为了使我们的模型能够实现2.5D张量并行，例如在8个 GPU 上，我们需要配置如下的并行设置。
+
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=8, mode='2.5d', depth=2),
+))
+
+```
+
+然后 Colossal-AI 会自动对所有来自 `colossalai.nn` 的层应用2.5D张量并行。
+
+让我们定义一个由两层多层感知器 (MLP) 组成的模型，如下所示。
+
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+在8个 GPU 上启动 Colossal-AI 并建立模型。
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+我们将会看到 MLP 模型中被划分的参数（如权重）的形状。
+```shell
+Weight of the first linear layer: torch.Size([128, 512])
+Weight of the second linear layer: torch.Size([512, 128])
+```
+
+第一个线性层的完整权重形状应该为 `[256, 1024]`. 经过2.5D并行划分后，它在每个 GPU 上变成了 `[128, 512]` 。
+同样地，第二层将权重 `[1024, 256]` 划分为 `[512, 128]`.
+
+我们可以用一些随机输入来运行这个模型。
+```python
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+# partition input
+torch.distributed.broadcast(x, src=0)
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)]
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)]
+x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)]
+print_rank_0(f'Input: {x.shape}')
+
+x = m(x)
+```
+然后我们可以看到 activation 结果的形状。
+```shell
+Input: torch.Size([4, 128])
+Output of the first linear layer: torch.Size([4, 512])
+Output of the second linear layer: torch.Size([4, 128])
+```
+2.5D并行中的 activation 张量都是同时在$d \times q$行和$q$列分割的。例如，第一个线性层的输出是 `[4, 512]`, 而第二层的输出为 `[4, 128]`。
+注意，2.5D并行使用与2D并行相同的划分方法来处理权重，区别在于对输入的划分。
--- a/docs/source/zh/features/3D_tensor_parallel.md
+++ b/docs/source/zh/features/3D_tensor_parallel.md
@ -0,0 +1,154 @@
+# 3D 张量并行
+
+作者: Zhengda Bian, Yongbin Li
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [并行配置](../basics/configure_parallelization.md)
+- [1D 张量并行](./1D_tensor_parallel.md)
+- [2D 张量并行](./2D_tensor_parallel.md)
+
+**示例代码**
+- [ColossalAI-Examples - 3D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel/tensor_parallel_3d.py)
+
+**相关论文**
+- [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/pdf/2105.14450.pdf)
+
+## 引言
+
+[3D 张量并行](https://arxiv.org/pdf/2105.14450.pdf) 是一种将神经网络模型的计算并行化，以期望获得最佳通信成本优化的方法。
+
+我们还是以线性层 $Y = XA$ 为例。
+给定 $P=q \times q \times q$ 个处理器（必要条件）, 如 $q=2$, 我们把输入 $X$ 和权重 $A$ 划分为
+
+$$
+\left[\begin{matrix}
+            X_{000} & X_{001} \\
+            X_{010} & X_{011} \\
+            X_{100} & X_{101} \\
+            X_{110} & X_{111} \end{matrix}
+\right]
+\text{~and~}
+\left[\begin{matrix}
+            A_{000} & A_{001} & A_{010} & A_{011} \\
+            A_{100} & A_{101} & A_{110} & A_{111} \end{matrix}
+\right]
+\text{~respectively,}$$
+其中每个 $X_{ijl}$ 和 $A_{lji}$ 都被存储在处理器 $(i,j,l)$ 上, 如下图所示。
+
+<center>
+<img src="https://s2.loli.net/2022/02/17/JevO6SED5z4PFdp.png" width = "200" height = "250" />
+<img src="https://s2.loli.net/2022/02/17/qvtwjdfNXMAb4nF.png" width = "200" height = "250" />
+<img src="https://s2.loli.net/2022/02/17/WFzm2N4IwKf1jXZ.png" width = "200" height = "250" />
+<img src="https://s2.loli.net/2022/02/17/r2dZQ4hKxwTuIv6.png" width = "200" height = "250" />
+</center>
+
+然后我们在 $(i, 0...q,l)$ 上收集 $X_{ijl}$, 以及在$(0...q, j, l)$ 上收集 $A_{lji}$。
+因此，我们在每个处理器 $(i,j,l)$ 上都有 $X_{il}$ 和 $A_{lj}$ 以获得 $X_{il}A_{lj}$。
+最后，我们在 $(i, j, 0...q)$ 对结果进行 reduce-scatter 得到 $Y_{ijl}$, 形成
+$$
+Y=
+\left[\begin{matrix}
+            Y_{000} & Y_{001} \\
+            Y_{010} & Y_{011} \\
+            Y_{100} & Y_{101} \\
+            Y_{110} & Y_{111} \end{matrix}
+\right].
+$$
+
+我们还需要注意，在后向传播中, 我们需要 all-gather 梯度 $\dot{Y_{ijl}}$, 然后 reduce-scatter 梯度 $\dot{X_{il}}=\dot{Y_{ij}}A_{lj}^T$ and $\dot{A_{lj}}=X_{il}^T\dot{Y_{ij}}$。
+
+## 效率
+给定 $P=q \times q \times q$ 个处理器, 我们展现理论上的计算和内存成本，以及基于环形算法的3D张量并行的前向和后向的通信成本。
+
+| 计算 | 内存 (参数) | 内存 (activations) | 通信 (带宽) | 通信 (时延) |
+| :-:         | :-:              | :-:                  | :-:                       | :-:                     |
+| $O(1/q^3)$  | $O(1/q^3)$       | $O(1/q^3)$           | $O(6(q-1)/q^3)$           | $O(6(q-1))$             |
+
+## 使用
+
+为了使我们的模型能够实现3D张量并行，例如在8个 GPU 上，我们需要配置如下的并行设置。
+
+```python
+CONFIG = dict(parallel=dict(
+    data=1,
+    pipeline=1,
+    tensor=dict(size=8, mode='3d'),
+))
+```
+然后 Colossal-AI 会自动对所有来自 `colossalai.nn` 的层应用3D张量并行。
+
+让我们定义一个由两层多层感知器 (MLP) 组成的模型，如下所示。
+
+```python
+import colossalai
+import colossalai.nn as col_nn
+import torch
+from colossalai.utils import print_rank_0
+
+class MLP(torch.nn.Module):
+    def __init__(self, dim: int = 256):
+        super().__init__()
+        intermediate_dim = dim * 4
+        self.dense_1 = col_nn.Linear(dim, intermediate_dim)
+        print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
+        self.activation = torch.nn.GELU()
+        self.dense_2 = col_nn.Linear(intermediate_dim, dim)
+        print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
+        self.dropout = col_nn.Dropout(0.1)
+
+    def forward(self, x):
+        x = self.dense_1(x)
+        print_rank_0(f'Output of the first linear layer: {x.shape}')
+        x = self.activation(x)
+        x = self.dense_2(x)
+        print_rank_0(f'Output of the second linear layer: {x.shape}')
+        x = self.dropout(x)
+        return x
+```
+在8个 GPU 上启动 Colossal-AI 并建立模型。
+```python
+parser = colossalai.get_default_parser()
+colossalai.launch(config=CONFIG,
+                  rank=args.rank,
+                  world_size=args.world_size,
+                  local_rank=args.local_rank,
+                  host=args.host,
+                  port=args.port)
+
+m = MLP()
+```
+我们将会看到 MLP 模型中被划分的参数（如权重）的形状。
+```shell
+Weight of the first linear layer: torch.Size([128, 256])
+Weight of the second linear layer: torch.Size([512, 64])
+```
+
+第一个线性层的完整权重形状应该为 `[256, 1024]`. 经过3D并行划分后，它在每个 GPU 上变成了 `[128, 256]` 。
+同样地，第二层将权重 `[1024, 256]` 划分为 `[512, 64]`.
+
+我们可以用一些随机输入来运行这个模型。
+
+```python
+from colossalai.context import ParallelMode
+from colossalai.core import global_context as gpc
+from colossalai.utils import get_current_device
+
+x = torch.randn((16, 256), device=get_current_device())
+# partition input
+torch.distributed.broadcast(x, src=0)
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_WEIGHT)]
+x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_INPUT)]
+x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_OUTPUT)]
+print_rank_0(f'Input: {x.shape}')
+
+x = m(x)
+```
+然后我们可以看到 activation 结果的形状。
+```shell
+Input: torch.Size([4, 128])
+Output of the first linear layer: torch.Size([4, 512])
+Output of the second linear layer: torch.Size([4, 128])
+```
+3D并行中的 activation 张量都是同时在$q^2$行和$q$列分割的。例如，第一个线性层的输出是 `[4, 512]`, 而第二层的输出为 `[4, 128]`。
+注意，虽然这里3D并行的结果与2.5D并行的结果形状相同，但每个划分的内容是不同的。
--- a/docs/source/zh/features/gradient_accumulation.md
+++ b/docs/source/zh/features/gradient_accumulation.md
@ -0,0 +1,40 @@
+# 梯度累积
+
+作者: Shenggui Li, Yongbin Li
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [在训练中使用Engine和Trainer](../basics/engine_trainer.md)
+
+**示例代码**
+- [ColossalAI-Examples Gradient Accumulation](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
+
+## 引言
+
+梯度累积是一种常见的增大训练 batch size 的方式。 在训练大模型时，内存经常会成为瓶颈，并且 batch size 通常会很小（如2），这导致收敛性无法保证。梯度累积将多次迭代的梯度累加，并仅在达到预设迭代次数时更新参数。
+
+## 使用
+
+在 Colossal-AI 中使用梯度累积非常简单，仅需将下列配置添加进 config 文件。其中，整数值代表期望梯度累积的次数。
+
+```python
+gradient_accumulation = <int>
+```
+
+## 实例
+
+我们提供了一个 [运行实例](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
+来展现梯度累积。在这个例子中，梯度累积次数被设置为4，你可以通过一下命令启动脚本
+
+```shell
+python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  run_resnet_cifar10_with_engine.py
+```
+
+你将会看到类似下方的文本输出。这展现了梯度虽然在前3个迭代中被计算，但直到最后一次迭代，参数才被更新。
+
+```text
+iteration 0, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 1, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 2, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 3, first 10 elements of param: tensor([-0.0141,  0.0464,  0.0507,  0.0321,  0.0356, -0.0150,  0.0172, -0.0118, 0.0222,  0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
+```
--- a/docs/source/zh/features/gradient_clipping.md
+++ b/docs/source/zh/features/gradient_clipping.md
@ -0,0 +1,51 @@
+# 梯度裁剪
+
+作者: Boxiang Wang, Haichen Huang, Yongbin Li
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [在训练中使用Engine和Trainer](../basics/engine_trainer.md)
+
+**示例代码**
+- [ColossalAI-Examples Gradient Clipping](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
+
+**相关论文**
+- [On the difficulty of training Recurrent Neural Networks](https://arxiv.org/abs/1211.5063)
+
+## 引言
+
+为了加快训练过程和寻求全局最优以获得更好的性能，越来越多的学习率调度器被提出。人们通过控制学习率来调整训练中的下降速度。这使得梯度向量在每一步都能更好地统一。在这种情况下，下降速度可以按预期被控制。
+因此，梯度裁剪，一种可以将梯度向量归一化，以将其限制在统一长度的技术，对于那些希望模型性能更好的人来说是不可或缺的。
+
+在使用 Colossal-AI 时，你不必担心实现梯度剪裁，我们以一种有效而方便的方式支持梯度剪裁。你所需要的只是在你的配置文件中增加一个命令。
+
+## 为什么应该使用 Colossal-AI 中的梯度裁剪
+
+我们不建议用户自己编写梯度剪裁，因为朴素的梯度剪裁在应用张量并行、流水线并行、MoE 等功能时可能会失败。
+
+根据下图，每个 GPU 只拥有线性层中权重的一部分参数。为了得到线性层权重的梯度向量的正确范数，每个 GPU 中的每个梯度向量的范数应该相加。更复杂的是，偏置的分布不同于权重的分布。通信组在求和运算中有所不同。
+
+(注: 这种情况是旧版本的 2D 并行，在代码中的实现是不一样的。但这是一个很好的例子，能够说明在梯度剪裁中统一所有通信的困难。)
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/KXiJPHt3Dum82cA.png"/>
+<figcaption>参数分布</figcaption>
+</figure>
+
+不用担心它，因为 Colossal-AI 已经为你处理好。
+
+### 使用
+要使用梯度裁剪，只需在配置文件中添加梯度裁剪范数即可。
+
+```python
+clip_grad_norm = 1.0
+```
+
+### 实例
+
+我们提供了一个展现梯度裁剪的[运行实例](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
+。在本例中，我们将梯度裁剪范数设置为1.0，你可以使用以下命令运行脚本：
+
+```shell
+python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  train_with_engine.py
+```
--- a/docs/source/zh/features/gradient_handler.md
+++ b/docs/source/zh/features/gradient_handler.md
@ -0,0 +1,59 @@
+# 梯度 Handler
+
+作者: Shenggui Li, Yongbin Li
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [在训练中使用Engine和Trainer](../basics/engine_trainer.md)
+
+**示例代码**
+- [ColossalAI-Examples Gradient Handler](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
+
+## 引言
+
+在分布式训练中，每次迭代结束时都需要梯度同步。这很重要，因为我们需要确保在不同的机器中使用相同的梯度更新参数，以便生成的参数都一样。这通常在数据并行中看到，因为在数据并行中的模型是直接复制的。
+
+在 Colossal-AI 中，我们为用户提供了一个接口来定制他们想要如何处理同步。这为实现新的并行方法等情况带来了灵活性。
+
+当梯度 Handler 被使用时, PyTorch 的 `DistributedDataParallel` 将不再被使用，因为它会自动同步梯度.
+
+## 定制你的梯度 Handler
+
+要实现定制的梯度Handler，需要遵循以下步骤。
+1. 继承Colossal-AI中的 `BaseGradientHandler`
+2. 将梯度Handler注册进 `GRADIENT_HANDLER`
+3. 实现 `handle_gradient`
+
+```python
+from colossalai.registry import GRADIENT_HANDLER
+from colossalai.engine.gradient_handler import BaseGradientHandler
+
+
+@GRADIENT_HANDLER.register_module
+class MyGradientHandler(BaseGradientHandler):
+
+    def handle_gradient(self):
+        do_something()
+
+
+```
+
+
+## 使用
+
+要使用梯度 Handler，需要在配置文件中指定梯度 Handler。梯度 Handler 将自动构建并连接到 Engine。
+
+```python
+gradient_handler = [dict(type='MyGradientHandler')]
+```
+
+
+### 实例
+
+我们提供了一个 [运行实例](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
+展现梯度 Handler 的使用. 在这个例子中，我们使用 `DataParallelGradientHandler` 而不是 PyTorch 的
+`DistributedDataParallel` 实现数据并行.
+
+```shell
+python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500  train_with_engine.py
+```
--- a/docs/source/zh/features/mixed_precision_training.md
+++ b/docs/source/zh/features/mixed_precision_training.md
@ -0,0 +1,344 @@
+# 自动混合精度训练 (AMP)
+
+作者: Chuanrui Wang, Shenggui Li, Yongbin Li
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [在训练中使用Engine和Trainer](../basics/engine_trainer.md)
+
+**示例代码**
+- [ColossalAI-Examples AMP](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp)
+
+**相关论文**
+- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
+
+
+## 引言
+
+AMP 代表自动混合精度训练。
+在 Colossal-AI 中, 我们结合了混合精度训练的不同实现:
+
+1. torch.cuda.amp
+2. apex.amp
+3. naive amp
+
+
+| Colossal-AI | 支持张量并行 | 支持流水并行 | fp16范围 |
+| ----------- | ----------------------- | ------------------------- | ----------- |
+| AMP_TYPE.TORCH | ✅ | ❌ | 在前向和反向传播期间，模型参数、激活和梯度向下转换至fp16 |
+| AMP_TYPE.APEX | ❌ | ❌ | 更细粒度，我们可以选择 opt_level O0, O1, O2, O3 |
+| AMP_TYPE.NAIVE | ✅ | ✅ | 模型参数、前向和反向操作，全都向下转换至fp16 |
+
+前两个依赖于 PyTorch (1.6及以上) 和 NVIDIA Apex 的原始实现。最后一种方法类似 Apex O2。在这些方法中，Apex-AMP 与张量并行不兼容。这是因为张量是以张量并行的方式在设备之间拆分的，因此，需要在不同的进程之间进行通信，以检查整个模型权重中是否出现inf或nan。我们修改了torch amp实现，使其现在与张量并行兼容。
+
+> ❌️ fp16与ZeRO配置不兼容
+>
+> ⚠️ 流水并行目前仅支持naive amp
+
+我们建议使用 torch AMP，因为在不使用流水并行时，它通常比 NVIDIA AMP 提供更好的准确性。
+
+## 目录
+
+在本教程中，我们将介绍:
+
+1. AMP 介绍
+2. Colossal-AI 中的 AMP
+3. 练习实例
+
+## AMP 介绍
+
+自动混合精度训练是混合 FP16 和 FP32 训练。
+
+半精度浮点格式（FP16）具有较低的算法复杂度和较高的计算效率。此外，FP16 仅需要 FP32 所需的一半存储空间，并节省了内存和网络带宽，从而为大 batch size 和大模型提供了更多内存。
+
+然而，还有其他操作，如缩减，需要 FP32 的动态范围，以避免数值溢出/下溢。因此，我们引入自动混合精度，尝试将每个操作与其相应的数据类型相匹配，这可以减少内存占用并提高训练效率。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
+<figcaption>AMP 示意图 (图片来自 <a href="https://arxiv.org/abs/2108.05818">PatrickStar 论文</a>)</figcaption>
+</figure>
+
+## Colossal-AI 中的 AMP
+
+我们支持三种 AMP 训练方法，并允许用户在没有改变代码的情况下使用 AMP 进行训练。只需在配置文件中添加'fp16'配置即可使用 AMP。
+
+```python
+from colossalai.amp import AMP_TYPE
+
+# 使用 Torch AMP
+fp16=dict(
+    mode = AMP_TYPE.TORCH
+)
+
+# 使用 naive AMP
+fp16=dict(
+    mode = AMP_TYPE.NAIVE
+)
+
+# 使用 Nvidia Apex AMP
+fp16=dict(
+    mode = AMP_TYPE.APEX
+)
+
+```
+
+> 这些是最低配置，完整配置将在后面的部分中说明
+
+### AMP 模块化
+
+AMP 模块设计为完全模块化，可以独立使用。如果你想在你的代码库中只使用 AMP 而不使用`colossalai.initialize`，你可以导入`colossalai.amp.convert_to_amp`。
+
+```python
+from colossalai.amp import AMP_TYPE
+
+# 使用torch amp的例子
+model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
+                                                            optimizer,
+                                                            criterion,
+                                                            AMP_TYPE.TORCH)
+```
+
+### Torch AMP 配置
+
+```python
+from colossalai.amp import AMP_TYPE
+
+fp16=dict(
+    mode=AMP_TYPE.TORCH,
+
+    # 下列是grad scaler的默认值
+    init_scale=2.**16,
+    growth_factor=2.0,
+    backoff_factor=0.5,
+    growth_interval=2000,
+    enabled=True
+)
+```
+
+可选参数:
+- init_scale(float, optional, default=2.**16): 初始缩放因子；
+- growth_factor(float, optional, default=2.0): 如果在``growth_interval``连续迭代过程中没有出现 inf/NaN 梯度，则在`update`中乘以比例系数；
+- backoff_factor(float, optional, default=0.5): 如果在迭代中出现 inf/NaN 梯度，则在`update`中乘以比例系数；
+- growth_interval(int, optional, default=2000): 在指定次数的连续迭代中，若没有出现 inf/NaN 梯度，则乘以``growth_factor``.
+- enabled(bool, optional, default=True):  ``False``则使梯度缩放无效，`step` 仅调用底层的 ``optimizer.step()``, 其他方法成为空操作。
+
+### Apex AMP 配置
+
+对于这种模式，我们依靠 Apex 实现混合精度训练。我们支持这个插件，因为它允许对混合精度的粒度进行更精细的控制。
+例如, O2 水平 (优化器水平2) 将保持 batch normalization 为 FP32。
+
+如果你想了解更多细节，请参考 [Apex Documentation](https://nvidia.github.io/apex/)。
+
+```python
+from colossalai.amp import AMP_TYPE
+
+fp16 = dict(
+    mode=AMP_TYPE.APEX,
+
+    # 下列是默认值
+    enabled=True,
+    opt_level='O1',
+    cast_model_type=None,
+    patch_torch_functions=None,
+    keep_batchnorm_fp32=None,
+    master_weights=None,
+    loss_scale=None,
+    cast_model_outputs=None,
+    num_losses=1,
+    verbosity=1,
+    min_loss_scale=None,
+    max_loss_scale=16777216.0
+)
+```
+
+参数:
+- enabled(bool, optional, default=True): False 会使所有 AMP 调用成为空操作, 程序将会像没有使用 AMP 一样运行。
+
+- opt_level(str, optional, default="O1" ): 纯精度或混合精度优化水平。可选值 “O0”, “O1”, “O2”, and “O3”, 详细解释见上方 Apex AMP 文档。
+
+- num_losses(int, optional, default=1): 选择提前告知 AMP 您计划使用多少次损失/反向计算。
+当`amp.scale_loss`与 loss_id 参数一起使用时，使 AMP 在每次损失/反向计算时使用不同的损失比例，这可以提高稳定性。如果 num_losses 被设置为1，AMP 仍支持多次损失/反向计算，但对他们都使用同一个全局损失比例。
+
+- verbosity(int, default=1): 设置为0抑制 AMP 相关输出。
+
+- min_loss_scale(float, default=None): 为可通过动态损耗比例选择的损耗比例值设置下限。
+默认值“None”意味着不设置任何下限。如果不使用动态损耗比例，则忽略 min_loss_scale 。
+
+- max_loss_scale(float, default=2.**24 ): 为可通过动态损耗比例选择的损耗比例值设置上限。如果不使用动态损耗比例，则 max_loss_scale 被忽略.
+
+目前，管理纯精度或混合精度训练的幕后属性有以下几种:
+cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale.
+一旦 opt_level 被确定，它们是可选的可覆盖属性
+
+- cast_model_type: 将模型的参数和缓冲区强制转换为所需的类型。
+- patch_torch_functions: 补全所有的 Torch 函数和张量方法，以便在FP16中执行张量核心友好的操作，如 GEMMs 和卷积，以及在 FP32 中执行任何受益于 FP32 精度的操作。
+- keep_batchnorm_fp32: 为了提高精度并启用 cudnn batchnorm (这会提高性能),在 FP32 中保留 batchnorm 权重通常是有益的，即使模型的其余部分是 FP16。
+- master_weights: 保持 FP32 主权重以配合任何 FP16 模型权重。 FP32 主权重由优化器分级，以提高精度和捕捉小梯度。
+- loss_scale: 如果 loss_scale 是一个浮点数，则使用这个值作为静态（固定）的损失比例。如果 loss_scale 是字符串 "dynamic"，则随着时间的推移自适应地调整损失比例。动态损失比例调整由 AMP 自动执行。
+
+
+### Naive AMP 配置
+
+在 Naive AMP 模式中, 我们实现了混合精度训练，同时保持了与复杂张量和流水并行的兼容性。该 AMP 模式将所有操作转为 FP16 。下列代码块展示了该模式的`config.py`。
+
+```python
+from colossalai.amp import AMP_TYPE
+
+fp16 = dict(
+    mode=AMP_TYPE.NAIVE,
+
+    # below are the default values
+    log_num_zeros_in_grad=False,
+    initial_scale=2 ** 32,
+    min_scale=1,
+    growth_factor=2,
+    backoff_factor=0.5,
+    growth_interval=1000,
+    hysteresis=2
+)
+```
+
+Naive AMP 的默认参数:
+- log_num_zeros_in_grad(bool): 返回0值梯度的个数.
+- initial_scale(int): gradient scaler 的初始值
+- growth_factor(int): loss scale 的增长率
+- backoff_factor(float): loss scale 的下降率
+- hysterisis(int): 动态 loss scaling 的延迟偏移
+- max_scale(int): loss scale 的最大允许值
+- verbose(bool): 如果被设为`True`,将打印调试信息
+
+当使用`colossalai.initialize`时, 首先需要实例化一个模型、一个优化器和一个标准。将输出模型转换为内存消耗较小的 AMP 模型。如果您的输入模型已经太大，无法放置在 GPU 中，请使用`dtype=torch.float16`实例化你的模型。或者请尝试更小的模型，或尝试更多的并行化训练技术！
+
+## 实例
+
+我们提供了一个 [运行实例](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp)
+展现如何在 Colossal-AI 使用 AMP。在该例程中，我们使用 Torch AMP, 但提供的配置文件也适用于所有 AMP 模式.
+
+### 步骤 1. 创建配置文件
+
+创建一个`config.py`文件并添加`fp16`配置.
+
+```python
+# in config.py
+from colossalai.amp import AMP_TYPE
+
+BATCH_SIZE = 128
+DROP_RATE = 0.1
+NUM_EPOCHS = 300
+
+fp16 = dict(
+    mode=AMP_TYPE.TORCH,
+)
+
+clip_grad_norm = 1.0
+```
+
+### 步骤 2. 在 train_with_engine.py 导入相关库
+
+创建`train_with_engine.py`并导入必要依赖. 请记得通过命令`pip install timm scipy`安装`scipy`和`timm`。
+
+```python
+import os
+import colossalai
+import torch
+from pathlib import Path
+from colossalai.core import global_context as gpc
+from colossalai.logging import get_dist_logger
+from colossalai.utils import get_dataloader
+from colossalai.trainer import Trainer, hooks
+from colossalai.nn.lr_scheduler import LinearWarmupLR
+from timm.models import vit_base_patch16_224
+from torchvision import datasets, transforms
+
+```
+
+### 步骤 3. 初始化分布式环境
+
+我们需要初始化分布式环境。为了快速演示，我们使用`launch_from_torch`。你可以参考 [Launch Colossal-AI](../basics/launch_colossalai.md)
+使用其他初始化方法。
+
+```python
+# 初始化分布式设置
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+
+# launch from torch
+colossalai.launch_from_torch(config=args.config)
+
+```
+
+### 步骤 4. 创建训练组件
+
+构建你的模型、优化器、损失函数、学习率调整器和数据加载器。注意数据集的路径从环境变量`DATA`获得。你可以通过 `export DATA=/path/to/data` 或 `Path(os.environ['DATA'])`
+在你的机器上设置路径。数据将会被自动下载到该路径。
+
+```python
+# build model
+    model = vit_base_patch16_224(drop_rate=0.1)
+
+    # build dataloader
+    train_dataset = datasets.Caltech101(
+        root=Path(os.environ['DATA']),
+        download=True,
+        transform=transforms.Compose([
+            transforms.Resize(256),
+            transforms.RandomResizedCrop(224),
+            transforms.RandomHorizontalFlip(),
+            transforms.ToTensor(),
+            Gray2RGB(),
+            transforms.Normalize([0.5, 0.5, 0.5],
+                                 [0.5, 0.5, 0.5])
+        ]))
+
+    train_dataloader = get_dataloader(dataset=train_dataset,
+                                      shuffle=True,
+                                      batch_size=gpc.config.BATCH_SIZE,
+                                      num_workers=1,
+                                      pin_memory=True,
+                                      )
+
+    # build optimizer
+    optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)
+
+    # build loss
+    criterion = torch.nn.CrossEntropyLoss()
+
+    # lr_scheduelr
+    lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
+```
+
+### 步骤 5. 插入 AMP
+
+调用 `colossalai.initialize` 将所有训练组件转为为FP16模式.
+
+```python
+engine, train_dataloader, _, _ = colossalai.initialize(
+        model, optimizer, criterion, train_dataloader,
+    )
+```
+
+### 步骤 6. 使用 Engine 训练
+
+使用Engine构建一个普通的训练循环
+
+```python
+engine.train()
+for epoch in range(gpc.config.NUM_EPOCHS):
+    for img, label in enumerate(train_dataloader):
+        img = img.cuda()
+        label = label.cuda()
+        engine.zero_grad()
+        output = engine(img)
+        loss = engine.criterion(output, label)
+        engine.backward(loss)
+        engine.step()
+        lr_scheduler.step()
+```
+
+### 步骤 7. 启动训练脚本
+
+使用下列命令启动训练脚本，你可以改变 `--nproc_per_node` 以使用不同数量的 GPU。
+
+```python
+python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
+```
--- a/docs/source/zh/features/nvme_offload.md
+++ b/docs/source/zh/features/nvme_offload.md
@ -0,0 +1,43 @@
+# NVMe offload
+
+作者: Hongxin Liu
+
+**前置教程:**
+- [基于Chunk内存管理的零冗余优化器 (ZeRO)](../features/zero_with_chunk.md)
+
+## 引言
+
+如果模型具有`N`个参数，在使用 Adam 时，优化器状态具有`8N`个参数。对于十亿规模的模型，优化器状态至少需要 32 GB 内存。 GPU显存限制了我们可以训练的模型规模，这称为GPU显存墙。如果我们将优化器状态 offload 到磁盘，我们可以突破 GPU 内存墙。
+
+我们实现了一个用户友好且高效的异步 Tensor I/O 库：[TensorNVMe](https://github.com/hpcaitech/TensorNVMe)。有了这个库，我们可以简单地实现 NVMe offload。
+
+> 该库与各种磁盘（HDD、SATA SSD 和 NVMe SSD）兼容。由于 HDD 或 SATA SSD 的 I/O 带宽较低，建议仅在 NVMe 磁盘上使用此库。
+
+在优化参数时，我们可以将优化过程分为三个阶段：读取、计算和 offload。我们以流水线的方式执行优化过程，这可以重叠计算和 I/O。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/08/16/CvRnowrsNyB4hza.jpg"/>
+<figcaption>优化过程</figcaption>
+</figure>
+
+
+## 使用
+
+首先，请确保您安装了 [TensorNVMe](https://github.com/hpcaitech/TensorNVMe):
+
+```shell
+pip install packaging
+pip install tensornvme
+```
+
+我们为 Adam ([CPUAdam](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.optimizer.cpu_adam.html) 和 [HybridAdam](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.nn.optimizer.hybrid_adam.html)) 实现了优化器状态的 NVMe offload。
+
+```python
+from colossalai.nn.optimizer import CPUAdam, HybridAdam
+
+optimizer = HybridAdam(model.parameters(), lr=1e-3, nvme_offload_fraction=1.0, nvme_offload_dir='./')
+```
+
+`nvme_offload_fraction` 是要 offload 到 NVMe 的优化器状态的比例。 `nvme_offload_dir` 是保存 NVMe offload 文件的目录。如果 `nvme_offload_dir` 为 `None`，将使用随机临时目录。
+
+它与 ColossalAI 中的所有并行方法兼容。
--- a/docs/source/zh/features/pipeline_parallel.md
+++ b/docs/source/zh/features/pipeline_parallel.md
@ -0,0 +1,158 @@
+# 流水并行
+
+作者: Guangyang Lu, Hongxin Liu, Yongbin Li
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [在训练中使用Engine和Trainer](../basics/engine_trainer.md)
+- [并行配置](../basics/configure_parallelization.md)
+
+**示例代码**
+- [ColossalAI-Examples ResNet with pipeline](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/pipeline_parallel)
+
+**相关论文**
+- [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training](https://arxiv.org/abs/2110.14883)
+- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)
+- [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://arxiv.org/abs/1811.06965)
+
+## 快速预览
+
+在本教程中，你将学习如何使用流水并行。在 Colossal-AI 中, 我们使用 NVIDIA 推出的 1F1B 流水线。由于在本例中, 使用 ViT 和 ImageNet 太过庞大，因此我们使用 ResNet 和 CIFAR 为例.
+
+## 目录
+
+在本教程中，我们将介绍:
+
+1. 介绍 1F1B 流水线；
+2. 使用非交错和交错 schedule；
+3. 使用流水线训练 ResNet。
+
+## 认识 1F1B 流水线
+
+首先，我们将向您介绍 GPipe，以便您更好地了解。
+
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/OAucPF6mWYynUtV.png"/>
+<figcaption>图1: GPipe，来自论文 <a href="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM</a> 。</figcaption>
+</figure>
+
+正如你所看到的，对于 GPipe，只有当一个批次中所有 microbatches 的前向计算完成后，才会执行后向计算。
+
+一般来说，1F1B（一个前向通道和一个后向通道）比 GPipe （在内存或内存和时间方面）更有效率。1F1B 流水线有两个 schedule ，非交错式和交错式，图示如下。
+<figure style={{textAlign: "center"}}>
+<img src="https://s2.loli.net/2022/01/28/iJrVkp2HLcahjsT.png"/>
+<figcaption>Figure2: 图片来自论文 <a href="https://arxiv.org/pdf/2104.04473.pdf">Megatron-LM</a> 。上面的部分显示了默认的非交错 schedule，底部显示的是交错的 schedule。</figcaption>
+</figure>
+
+### 非交错 Schedule
+
+非交错式 schedule 可分为三个阶段。第一阶段是热身阶段，处理器进行不同数量的前向计算。在接下来的阶段，处理器进行一次前向计算，然后是一次后向计算。处理器将在最后一个阶段完成后向计算。
+
+这种模式比 GPipe 更节省内存。然而，它需要和 GPipe 一样的时间来完成一轮计算。
+
+### 交错 Schedule
+
+这个 schedule 要求**microbatches的数量是流水线阶段的整数倍**。
+
+在这个 schedule 中，每个设备可以对多个层的子集（称为模型块）进行计算，而不是一个连续层的集合。具体来看，之前设备1拥有层1-4，设备2拥有层5-8，以此类推；但现在设备1有层1,2,9,10，设备2有层3,4,11,12，以此类推。
+在该模式下，流水线上的每个设备都被分配到多个流水线阶段，每个流水线阶段的计算量较少。
+
+这种模式既节省内存又节省时间。
+
+## 使用schedule
+
+在 Colossal-AI 中, 我们提供非交错(`PipelineSchedule`) 和交错(`InterleavedPipelineSchedule`)schedule。
+
+你只需要在配置文件中，设置 `NUM_MICRO_BATCHES` 并在你想使用交错schedule的时候，设置 `NUM_CHUNKS`。 如果你确定性地知道每个管道阶段的输出张量的形状，而且形状都是一样的，你可以设置 `tensor_shape` 以进一步减少通信。否则，你可以忽略 `tensor_shape` , 形状将在管道阶段之间自动交换。 我们将会根据用户提供的配置文件，生成一个合适schedule来支持用户的流水并行训练。
+
+## 使用流水线训练 ResNet
+
+我们首先用Colossal PipelinableContext方式建立 `ResNet` 模型:
+```python
+import os
+from typing import Callable, List, Optional, Type, Union
+import torch
+import torch.nn as nn
+import colossalai
+import colossalai.nn as col_nn
+
+from colossalai.core import global_context as gpc
+from colossalai.logging import disable_existing_loggers, get_dist_logger
+from colossalai.trainer import Trainer, hooks
+from colossalai.utils import MultiTimer, get_dataloader
+from colossalai.context import ParallelMode
+from colossalai.pipeline.pipelinable import PipelinableContext
+
+from titans.dataloader.cifar10 import build_cifar
+from torchvision.models import resnet50
+from torchvision.models.resnet import BasicBlock, Bottleneck, conv1x1
+
+# Define some config
+BATCH_SIZE = 64
+NUM_EPOCHS = 2
+NUM_CHUNKS = 1
+CONFIG = dict(NUM_MICRO_BATCHES=4, parallel=dict(pipeline=2))
+
+# Train
+disable_existing_loggers()
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+colossalai.launch_from_torch(backend=args.backend, config=CONFIG)
+logger = get_dist_logger()
+pipelinable = PipelinableContext()
+
+# build model
+with pipelinable:
+    model = resnet50()
+```
+
+给定切分顺序，module直接给出name，部分函数需要手动添加。
+```python
+exec_seq = [
+    'conv1', 'bn1', 'relu', 'maxpool', 'layer1', 'layer2', 'layer3', 'layer4', 'avgpool',
+    (lambda x: torch.flatten(x, 1), "behind"), 'fc'
+]
+pipelinable.to_layer_list(exec_seq)
+```
+
+将模型切分成流水线阶段。
+```python
+model = pipelinable.partition(NUM_CHUNKS, gpc.pipeline_parallel_size, gpc.get_local_rank(ParallelMode.PIPELINE))
+```
+
+我们使用`Trainer`训练`ResNet`:
+```python
+# build criterion
+criterion = nn.CrossEntropyLoss()
+
+# optimizer
+optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
+
+# build dataloader
+root = os.environ.get('DATA', './data')
+train_dataloader, test_dataloader = build_cifar(BATCH_SIZE, root, padding=4, crop=32, resize=32)
+
+lr_scheduler = col_nn.lr_scheduler.LinearWarmupLR(optimizer, NUM_EPOCHS, warmup_steps=1)
+engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model, optimizer, criterion,
+                                                                                train_dataloader, test_dataloader,
+                                                                                lr_scheduler)
+timer = MultiTimer()
+
+trainer = Trainer(engine=engine, timer=timer, logger=logger)
+
+hook_list = [
+    hooks.LossHook(),
+    hooks.AccuracyHook(col_nn.metric.Accuracy()),
+    hooks.LogMetricByEpochHook(logger),
+    hooks.LRSchedulerHook(lr_scheduler, by_epoch=True)
+]
+
+trainer.fit(train_dataloader=train_dataloader,
+            epochs=NUM_EPOCHS,
+            test_dataloader=test_dataloader,
+            test_interval=1,
+            hooks=hook_list,
+            display_progress=True)
+```
+
+我们使用 `2` 个流水段，并且 batch 将被切分为 `4` 个 micro batches。
--- a/docs/source/zh/features/zero_with_chunk.md
+++ b/docs/source/zh/features/zero_with_chunk.md
@ -0,0 +1,261 @@
+# 基于Chunk内存管理的零冗余优化器 (ZeRO)
+
+作者: [Hongxiu Liu](https://github.com/ver217), [Jiarui Fang](https://github.com/feifeibear), [Zijian Ye](https://github.com/ZijianYY)
+
+**前置教程:**
+- [定义配置文件](../basics/define_your_config.md)
+
+**示例代码**
+- [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt)
+
+**相关论文**
+
+- [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054)
+- [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840)
+- [ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857)
+- [PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management](https://arxiv.org/abs/2108.05818)
+
+## 引言
+
+零冗余优化器 (ZeRO) 通过对三个模型状态（优化器状态、梯度和参数）进行划分而不是复制他们，消除了数据并行进程中的内存冗余。该方法与传统的数据并行相比，内存效率得到了极大的提高，而计算粒度和通信效率得到了保留。
+
+1. **分片优化器状态**: 优化器状态 (如 [Adam optimizer](https://arxiv.org/abs/1412.6980), 32位的权重,
+以及一二阶动量估计) 被划分到各个进程中, 因此每个进程只更新其分区。
+
+
+2. **分片梯度**: 在梯度在数据并行进程组内进行 reduction 后, 梯度张量也被划分，这样每个进程只存储与其划分的优化器状态对应的梯度。 注意, Colossal-AI 将梯度转换为 FP32 格式以参与更新参数。
+
+3. **分片参数**: 16位的模型参数被划分到一个数据并行组的进程中。
+
+4. **[Gemini](../advanced_tutorials/meet_gemini.md)**: 对于参数、梯度、优化器状态的动态异构内存空间管理器。
+
+此外，我们还将介绍基于Chunk内存管理的零冗余优化器。
+
+在使用零冗余优化器 (ZeRO)时，我们通过切分参数的方式对模型进行分布式存储，这种方法的优点是每个节点的内存负载是完全均衡的。但是这种方式有很多缺点。首先，通信时需要申请一块临时内存用来通信，通信完毕释放，这回导致存在内存碎片化的问题。其次，以Tensor为粒度进行通信，会导致网络带宽无法充分利用。通常来说传输的消息长度越长带宽利用率越高。
+
+利用ColossalAI v0.1.8引入了Chunk机制，我们可以提升ZeRO的性能。我们将运算顺序上连续的一组参数存入一个Chunk中（Chunk即一段连续的内存空间），每个Chunk的大小相同。Chunk方式组织内存可以保证PCI-e和GPU-GPU之间网络带宽的高效利用，减小了通信次数，同时避免潜在的内存碎片。
+
+在v0.1.8之前，ZeRO在进行参数聚合时通信成本较高，如果一个参数在连续的几次计算中被使用多次，即会发生多次通信，效率较低。这种情况在使用Checkpoint时非常常见，参数在计算backward时会重计算一遍forward。这种情况下，ZeRO的效率便不高。
+
+以GPT为例，其Checkpoint会应用在每一个GPT Block上，每一个GPT Block包含一个Self-Attention层和MLP层。在计算Backward时，会依次计算Self-Attention层、MLP层的forward，然后依次计算MLP层、Self-Attention层的backward。如使用Chunk机制，我们将Self-Attention层和MLP层放在同一个Chunk中，在每个GPT Block的backward的中便无需再通信。
+
+除此之外，由于小Tensor的通信、内存移动没法完全利用NVLINK、PCIE带宽，而且每次通信、内存移动都有kernel launch的开销。使用了Chunk之后可以把多次小Tensor的通信、内存移动变为一次大Tensor的通信、内存移动，既提高了带宽利用，也减小了kernel launch的开销。
+
+我们提供了轻量级的Chunk搜索机制，帮助用户自动找到内存碎片最小的Chunk尺寸。
+
+## 使用
+
+### GeminiDDP
+
+我们将运用`GeminiDDP`的方式来使用基于Chunk内存管理的ZeRO。这是我们新包装的torch.Module ，它使用 ZeRO-DP 和 Gemini，其中ZeRO 用于并行，Gemini 用于内存管理。
+
+同样需要确保你的模型是在 `ColoInitContext` 的上下文中初始化的。
+
+```python
+with ColoInitContext(device='cpu', default_dist_spec=default_dist_spec, default_pg=default_pg):
+  model = gpt2_medium(checkpoint=True)
+```
+
+定义模型参数如下:
+
+```python
+chunk_manager = init_chunk_manager(model=module,
+                                   init_device=device,
+                                   hidden_dim=hidden_dim,
+                                   search_range_mb=search_range_mb,
+                                   min_chunk_size_mb=min_chunk_size_mb)
+gemini_manager = GeminiManager(placement_policy, chunk_manager)
+model = ZeroDDP(model, gemini_manager)
+```
+
+`hidden dim`是DNN的隐藏维度。用户可以提供这个参数来加快搜索速度。如果用户在训练前不知道这个参数也可以。 我们将使用默认值 1024。`min_chunk_size_mb`是以兆字节为单位的最小块大小。如果参数的总大小仍然小于最小块大小，则所有参数将被压缩为一个小块。
+
+初始化优化器。
+```python
+optimizer = GeminiAdamOptimizer(model, lr=1e-3, initial_scale=2**5)
+```
+
+训练
+```python
+optimizer.zero_grad()
+outputs = model(input_ids, attn_mask)
+loss = criterion(outputs, input_ids)
+optimizer.backward(loss)
+optimizer.step()
+```
+> ⚠️ 注意：请不要使用`loss.backward()`，规范写法是`optimizer.backward(loss)`。
+
+### 训练GPT
+
+在此例程中, 我们使用 `Hugging Face Transformers`，并以 `GPT2 Medium` 为例。你必须在允许该例程前安装 `transformers`。
+
+为了简单起见，我们在这里只使用随机生成的数据。
+
+首先我们只需要引入`Huggingface transformers` 的 `GPT2LMHeadModel`来定义我们的模型，不需要用户进行模型的定义与修改，方便用户使用。
+
+```python
+class GPTLMModel(nn.Module):
+
+    def __init__(self,
+                 hidden_size=768,
+                 num_layers=12,
+                 num_attention_heads=12,
+                 max_seq_len=1024,
+                 vocab_size=50257,
+                 checkpoint=False):
+        super().__init__()
+        self.checkpoint = checkpoint
+        self.model = GPT2LMHeadModel(
+            GPT2Config(n_embd=hidden_size,
+                       n_layer=num_layers,
+                       n_head=num_attention_heads,
+                       n_positions=max_seq_len,
+                       n_ctx=max_seq_len,
+                       vocab_size=vocab_size))
+        if checkpoint:
+            self.model.gradient_checkpointing_enable()
+
+    def forward(self, input_ids, attention_mask):
+        return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
+
+def gpt2_medium(checkpoint=False):
+    return GPTLMModel(hidden_size=1024, num_layers=24, num_attention_heads=16, checkpoint=checkpoint)
+```
+
+定义损失函数:
+
+```python
+class GPTLMLoss(nn.Module):
+
+    def __init__(self):
+        super().__init__()
+        self.loss_fn = nn.CrossEntropyLoss()
+
+    def forward(self, logits, labels):
+        shift_logits = logits[..., :-1, :].contiguous()
+        shift_labels = labels[..., 1:].contiguous()
+        return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
+```
+
+定义张量并行和参数分片策略：
+
+```python
+def tensor_parallelize(model: torch.nn.Module, pg: ProcessGroup):
+    for mn, module in model.named_modules():
+        for pn, param in module.named_parameters(recurse=False):
+            if hasattr(param, 'visited'):
+                continue
+            param.set_dist_spec(ReplicaSpec())
+            if 'mlp.c_fc' in mn:
+                if 'weight' in pn or 'bias' in pn:
+                    split_param_col_tp1d(param, pg)
+                    param.compute_spec.set_output_replicate(False)
+                else:
+                    param.set_dist_spec(ReplicaSpec())
+            elif 'mlp.c_proj' in mn:
+                if 'weight' in pn:
+                    split_param_row_tp1d(param, pg)
+                else:
+                    param.set_dist_spec(ReplicaSpec())
+            elif 'wte' in mn or 'wpe' in mn:
+                split_param_col_tp1d(param, pg)
+            elif 'c_attn' in mn or 'c_proj' in mn:
+                split_param_col_tp1d(param, pg)
+            else:
+                param.set_dist_spec(ReplicaSpec())
+
+            param.visited = True
+def split_param_single_dim_tp1d(dim: int, param: ColoParameter, pg: ProcessGroup):
+    spec = (ShardSpec([dim], [pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
+    param.set_tensor_spec(*spec)
+
+
+def split_param_row_tp1d(param: ColoParameter, pg: ProcessGroup):
+    split_param_single_dim_tp1d(0, param, pg)
+
+
+def split_param_col_tp1d(param: ColoParameter, pg: ProcessGroup):
+    split_param_single_dim_tp1d(-1, param, pg)
+```
+
+定义一个使用 Gemini + ZeRO DDP 的模型：
+
+```python
+def gemini_zero_dpp(model: torch.nn.Module, pg: ProcessGroup, placememt_policy: str = "auto"):
+    cai_version = colossalai.__version__
+    if version.parse(cai_version) > version.parse("0.1.10"):
+        from colossalai.nn.parallel import GeminiDDP
+        model = GeminiDDP(model,
+                          device=get_current_device(),
+                          placement_policy=placememt_policy,
+                          pin_memory=True,
+                          search_range_mb=32)
+    elif version.parse(cai_version) <= version.parse("0.1.10") and version.parse(cai_version) >= version.parse("0.1.9"):
+        from colossalai.gemini import ChunkManager, GeminiManager
+        chunk_size = ChunkManager.search_chunk_size(model, 64 * 1024**2, 32)
+        gemini_manager = GeminiManager(placememt_policy, chunk_manager)
+        chunk_manager = ChunkManager(chunk_size,
+                                     pg,
+                                     enable_distributed_storage=True,
+                                 			init_device=GeminiManager.get_default_device(placememt_policy))
+        model = ZeroDDP(model, gemini_manager)
+    else:
+        raise NotImplemented(f"CAI version {cai_version} is not supported")
+    return model
+```
+
+由于我们在这个例子中对GPT进行预训练，因此只使用了一个简单的语言模型损失函数。
+
+写一个获得随机输入的函数:
+
+```python
+def get_data(batch_size, seq_len, vocab_size):
+    input_ids = torch.randint(0, vocab_size, (batch_size, seq_len), device=torch.cuda.current_device())
+    attention_mask = torch.ones_like(input_ids)
+    return input_ids, attention_mask
+```
+
+最后，我们可以定义我们的训练循环:
+
+```python
+def main():
+    args = parse_args()
+    BATCH_SIZE = 8
+    SEQ_LEN = 1024
+    VOCAB_SIZE = 50257
+    NUM_STEPS = 10
+    colossalai.launch_from_torch(config={})
+
+    # build criterion
+    criterion = GPTLMLoss()
+
+    torch.manual_seed(123)
+    default_pg = ProcessGroup(tp_degree=args.tp_degree)
+    default_dist_spec = ShardSpec([-1], [args.tp_degree]) if args.shardinit else None
+    # build GPT model
+    with ColoInitContext(device='cpu', default_dist_spec=default_dist_spec, default_pg=default_pg):
+      model = gpt2_medium(checkpoint=True)
+    pg = default_pg
+    # Tensor Parallelism (TP)
+    tensor_parallelize(model, pg)
+    # Gemini + ZeRO DP, Note it must be used after TP
+    model = gemini_zero_dpp(model, pg, args.placement)
+    # build optimizer
+    optimizer = GeminiAdamOptimizer(model, lr=1e-3, initial_scale=2**5)
+    numel = sum([p.numel() for p in model.parameters()])
+    get_tflops_func = partial(get_tflops, numel, BATCH_SIZE, SEQ_LEN)
+    torch.cuda.synchronize()
+    model.train()
+    for n in range(NUM_STEPS):
+        # we just use randomly generated data here
+        input_ids, attn_mask = get_data(BATCH_SIZE, SEQ_LEN, VOCAB_SIZE)
+        optimizer.zero_grad()
+        outputs = model(input_ids, attn_mask)
+        loss = criterion(outputs, input_ids)
+        optimizer.backward(loss)
+        optimizer.step()
+
+    torch.cuda.synchronize()
+```
+> ⚠️ 注意：如果你使用Gemini模块的话，请不要使用我们之前提到过的[梯度累加](../features/gradient_accumulation.md)。
+完整的例子代码可以在 [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt). 获得。
--- a/docs/source/zh/get_started/installation.md
+++ b/docs/source/zh/get_started/installation.md
@ -0,0 +1,36 @@
+# 安装
+
+## 从PyPI上安装
+
+你可以PyPI上使用以下命令直接安装Colossal-AI。
+
+```shell
+pip install colossalai
+```
+
+如果你想同时安装PyTorch扩展的话，可以添加`CUDA_EXT=1`。如果不添加的话，PyTorch扩展会在运行时自动安装。
+
+```shell
+CUDA_EXT=1 pip install colossalai
+```
+
+## 从源安装
+
+> 此文档将与版本库的主分支保持一致。如果您遇到任何问题，欢迎给我们提 issue :)
+
+```shell
+git clone https://github.com/hpcaitech/ColossalAI.git
+cd ColossalAI
+
+# install dependency
+pip install -r requirements/requirements.txt
+
+# install colossalai
+pip install .
+```
+
+如果您不想安装和启用 CUDA 内核融合（使用融合优化器时强制安装）：
+
+```shell
+NO_CUDA_EXT=1 pip install .
+```
--- a/docs/source/zh/get_started/reading_roadmap.md
+++ b/docs/source/zh/get_started/reading_roadmap.md
@ -0,0 +1,10 @@
+# 阅读指引
+
+Colossal-AI为您提供了一系列的并行训练组件。我们的目标是支持您开发分布式深度学习模型，就像您编写单GPU深度学习模型一样简单。ColossalAI提供了易于使用的API来帮助您启动您的训练过程。为了更好地了解ColossalAI的工作原理，我们建议您按照以下顺序阅读本文档。
+
+- 如果您不熟悉分布式系统，或者没有使用过Colossal-AI，您可以先浏览`概念`部分，了解我们要实现的目标同时掌握一些关于分布式训练的背景知识。
+- 接下来，您可以按照`基础教程`进行学习。该节将介绍关于如何使用Colossal-AI的细节。
+- 这时候，您就可以小试牛刀了！`功能` 部分将帮助您尝试如何使用Colossal-AI为您的模型训练进行加速。我们将为每个教程提供一个代码库。这些教程将涵盖Colossal-AI的基本用法，以实现简单的功能，如数据并行和混合精度训练。
+- 最后，如果您希望应用更高超的技术，比如，如何在GPT-3上运行混合并行，快来`高级教程`部分学习如何搭建您自己的模型吧！
+
+**我们始终欢迎社区的建议和讨论，如果您遇到任何问题，我们将非常愿意帮助您。您可以在GitHub 提 [issue](https://github.com/hpcaitech/ColossalAI/issues) ，或在[论坛](https://github.com/hpcaitech/ColossalAI/discussions)上创建一个讨论主题。**
--- a/docs/source/zh/get_started/run_demo.md
+++ b/docs/source/zh/get_started/run_demo.md
@ -0,0 +1,28 @@
+# 快速演示
+
+Colossal-AI 是一个集成的大规模深度学习系统，具有高效的并行化技术。该系统可以通过应用并行化技术在具有多个 GPU 的分布式系统上加速模型训练。该系统也可以在只有一个 GPU 的系统上运行。以下是展示如何使用 Colossal-AI 的 Quick demos。
+
+## 单 GPU
+
+Colossal-AI 可以用在只有一个 GPU 的系统上训练深度学习模型，并达到 baseline 的性能。 我们提供了一个 [在CIFAR10数据集上训练ResNet](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet) 的例子，该例子只需要一个 GPU。
+您可以在 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples) 中获取该例子。详细说明可以在其 `README.md` 中获取。
+
+## 多 GPU
+
+Colossal-AI 可用于在具有多个 GPU 的分布式系统上训练深度学习模型，并通过应用高效的并行化技术大幅加速训练过程。我们提供了多种并行化技术供您尝试。
+
+#### 1. 数据并行
+
+您可以使用与上述单 GPU 演示相同的 [ResNet例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/resnet)。 通过设置 `--nproc_per_node` 为您机器上的 GPU 数量，您就能把数据并行应用在您的例子上了。
+
+#### 2. 混合并行
+
+混合并行包括数据、张量和流水线并行。在 Colossal-AI 中，我们支持不同类型的张量并行（即 1D、2D、2.5D 和 3D）。您可以通过简单地改变 `config.py` 中的配置在不同的张量并行之间切换。您可以参考 [GPT example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt), 更多细节能在它的 `README.md` 中被找到。
+
+#### 3. MoE并行
+
+我们提供了一个 [WideNet例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/widenet) 来验证 MoE 的并行性。 WideNet 使用 Mixture of Experts（MoE）来实现更好的性能。更多的细节可以在我们的教程中获取：[教会您如何把Mixture of Experts整合到模型中](../advanced_tutorials/integrate_mixture_of_experts_into_your_model.md)。
+
+#### 4. 序列并行
+
+序列并行是为了解决NLP任务中的内存效率和序列长度限制问题。 我们在 [ColossalAI-Examples](https://github.com/hpcaitech/ColossalAI-Examples) 中提供了一个 [BERT例子](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/bert/sequene_parallel)。您可以按照 `README.md` 来执行代码。