[doc] polish shardformer doc (#4735)

* arrange position of chapters * fix typos in seq parallel doc
2023-09-15 17:39:10 +08:00 · 2023-09-15 17:39:10 +08:00 · 451c3465fb
parent 73eb3e8862
commit 451c3465fb
2 changed files with 153 additions and 155 deletions
--- a/docs/source/en/features/shardformer.md
+++ b/docs/source/en/features/shardformer.md
@ -29,90 +29,6 @@ This module aims to make parallelization hassle-free for users who are not from
 Within a few lines of codes, users can turn a model into a state ready for distributed training.
 Also, Shardformer contains various optimization tools for acceleration and memory saving during forward/backward pass.

-## Usage
-
-### Shardformer Configuration
-
-The configuration of Shardformer is controlled by class `ShardConfig`:
-
-{{ autodoc:colossalai.shardformer.ShardConfig }}
-
-If you want to enable Apex Fused Layernorm, please install `apex`.
-If you want to enable the usage of flash attention, please install `flash_attn`.
-In addition, xFormers's `cutlass_op` can serve as a backup for flash attention.
-
-### Enabling Shardformer
-
-#### 1. Enabling Shardformer Through Booster (Recommended)
-
-Enabling `Shardformer` through `Booster` initialized with `HybridParallelPlugin` is the recommended way to awaken the power of Shardformer.
-The main reason is that pipeline parallelism cannot successfully work without the calling of `execute_pipeline` method of `Booster`. Besides, `HybridParallelPlugin` provides the capacity to combine the features of `Shardformer` with other useful features, such as mixed precision training or Zero.
-
-More details about this usage can be found in chapter [Booster API](../basics/booster_api.md) and [Booster Plugins](../basics/booster_plugins.md).
-
-[Here](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert) is an example on how to trigger `Shardformer` through `HybridParallelPlugin`. Please be aware that there's a difference in the way of doing forward and backward between the situation of using pipeline and not using pipeline.
-
-
-#### 2. Enabling Shardformer Through Shardformer APIs (Not Recommended)
-
-You can also use Shardformer through manually calling Shardformer APIs. However, this usage is not recommended since pipeline parallelism can't run without `Booster`.
-
-[Here](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/examples/convergence_benchmark.py)
-is an example on how to trigger `Shardformer` through calling Shardformer APIs.
-
-
-### Precautions
-
-1. When enabling pipeline parallel, please don't do the forward/backward pass in the conventional way (`model(input)`, `loss.backward()`), which will cause unexpected errors. Rather, please do forward/backward pass through calling `booster.execute_pipeline` method.
-
-2. When you use Shardformer to process classification models such as `GPT2ForSequenceClassification`, `ViTForImageClassification`, please ensure that the total number of labels should be integer multiple of tensor parallel size, otherwise Shardformer can't process the classifier layer correctly. A simple fix could be appending dummy labels in transformers config. This bug will be fixed in future version of Shardformer.
-
-3. The case of training ChatGLM-2 6B is a little special: since Huggingface transformers doesn't officially support ChatGLM at present, please import the configuration/model classes through
-    ```python
-    from colossalai.shardformer.modeling.chatglm2_6b.configuration_chatglm import ChatGLMConfig
-    from colossalai.shardformer.modeling.chatglm2_6b.modeling_chatglm import ChatGLMForConditionalGeneration, ChatGLMModel
-    ```
-    when training ChatGLM-2 with Shardformer, and initialize your model with these imported classes.
-
-### Sequence Parallelism
-
-Sequence parallelism in `Shardformer` is a little different from [this one](https://colossalai.org/docs/basics/configure_parallelization/#sequence-parallel) which focuses on ring attention. In `Shardformer`, sequence parallelism just use with 1D tensor parallelism to to further reduce the memory occupation of activations in computations.
-
-1. In normal [1D tensor parallel](https://colossalai.org/docs/features/1D_tensor_parallel), there are 2 communication operations, $g$ and $\vec{g}$, $g$ will do one time All-Reduce in backward to get all gradient from all the devices and $\vec{g}$ will do one time All-Reduce in forward to get whole outputs from all the device.
-
-2. When using sequence parallelism, $\vec{g}$ needs to do All-Gather to gather the inputs in sequence dimension during forward and Reduce-Scatter to splite the gradient during backward. $\vec{g}$ needs to do Reduce-Scatter to splite the output of row linear layer of tensor parallel to all devices in sequence dimension, and All-Gather to get the whole gradient during backward.
-
-3. The implementation of All-Reduce using NCCL adopts the `Ring All-Reduce` approach, which consists of a Reduce-Scatter operation and an All-Gather operation with equal costs. Therefore, compared to sequence parallelism and tensor parallelism, it does not introduce additional communication overhead.
-
-4. One important thing to note is that when using sequence parallelism with 'Column Linear' of tensor parallelism,, during the backward computation of gradients, the complete input needs to be obtained. During the forward pass, only the portion of the input that is split along the sequence dimension is retained, shape like $(batch, sequence_len/k, hidden_states)$. Therefore, an additional All-Gather operation is required to obtain the complete input for gradient computation. However, in the implementation, it is possible to overlap the gradient computation with the All-Gather communication operation, which would not introduce additional communication overhead (corresponding to the `enable_sequence_overlap` parameter in `Shardformer`).
-
-## How Shardformer Works
-
-Generally, Shardformer works through the following four kinds of *replacements*:
-
-1. Replacing original PyTorch module (e.g. `nn.Linear`, `nn.Embedding`) with a crafted distributed module.
-The distributed module keeps the same attributes as the original module but replaces the original parameters with distributed parameters.
-Also, new `forward` methods will replace original ones so as to execute distributed computation, such as linear layers' split /gather operations executed under tensor parallelism.
-Each distributed module implements its `from_native_module` static method to convert the PyTorch module to its corresponding distributed module.
-
-2. Replacing attributes of original Huggingface Transformers layers with appropriate attributes for distributed training.
-For example, when training LlaMa-2 with tensor parallel size as 2, the attribute `num_heads` of `LlamaDecoderLayer` (the number of attention heads in each layer) should be replaced with `model.config.num_attention_heads // 2`.
-
-3. Replacing the `forward` methods implemented by original Huggingface
-Transformers libraries with our customized `forward` methods.
-This replacement is essential for pipeline paralellism, where a customiozed function is needed to pass intermediate hidden states between different pipeline stages.
-Also, optimization methods such as flash attention or sequence parallel can be injected into the `forward` process through our customized `forward` method.
-
-4. Replacing the whole copy of model parameters and optimizer states with incomplete ones controlled by current device (this is why it's called Shardformer).
-By executing `ModelSharder.shard` method, current device will only keep the part of model parameters it's supposed to take care of.
-To be specific, they should be the assigned parameter shards when using tensor parallelism, or the parameters belonging to current pipeline stage when using pipeline parallelism, or both of them.
-All other parameters are released so as to liberate memory usage.
-As a result, the optimizer will only compute the states corresponding to these part of parameters, causing the usage of memory to be further saved.
-
-All of these replacements are implemented with manually written policies and forward functions.
-If you want to delve deeper into the design of Shardformer or customize your own Shardformer policies, please refer to our [Shardformer development document](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md) and [pipeline parallelism design](https://github.com/hpcaitech/ColossalAI/discussions/4050) for more details.
-
-
 ## Supporting Information

 Model/Feature Compatibility Matrix:
@ -279,4 +195,87 @@ List of model families we plan to support in the near future:

 The support matrix will grow larger as more models and optimization tools emerge in the future. If you have any suggestions on the models/optimization we should support, please feel free to mention it in [Issues](https://github.com/hpcaitech/ColossalAI/issues) section of our project.

+## Usage
+
+### Shardformer Configuration
+
+The configuration of Shardformer is controlled by class `ShardConfig`:
+
+{{ autodoc:colossalai.shardformer.ShardConfig }}
+
+If you want to enable Apex Fused Layernorm, please install `apex`.
+If you want to enable the usage of flash attention, please install `flash_attn`.
+In addition, xFormers's `cutlass_op` can serve as a backup for flash attention.
+
+### Enabling Shardformer
+
+#### 1. Enabling Shardformer Through Booster (Recommended)
+
+Enabling `Shardformer` through `Booster` initialized with `HybridParallelPlugin` is the recommended way to awaken the power of Shardformer.
+The main reason is that pipeline parallelism cannot successfully work without the calling of `execute_pipeline` method of `Booster`. Besides, `HybridParallelPlugin` provides the capacity to combine the features of `Shardformer` with other useful features, such as mixed precision training or Zero.
+
+More details about this usage can be found in chapter [Booster API](../basics/booster_api.md) and [Booster Plugins](../basics/booster_plugins.md).
+
+[Here](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert) is an example on how to trigger `Shardformer` through `HybridParallelPlugin`. Please be aware that there's a difference in the way of doing forward and backward between the situation of using pipeline and not using pipeline.
+
+
+#### 2. Enabling Shardformer Through Shardformer APIs (Not Recommended)
+
+You can also use Shardformer through manually calling Shardformer APIs. However, this usage is not recommended since pipeline parallelism can't run without `Booster`.
+
+[Here](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/examples/convergence_benchmark.py)
+is an example on how to trigger `Shardformer` through calling Shardformer APIs.
+
+### Precautions
+
+1. When enabling pipeline parallel, please don't do the forward/backward pass in the conventional way (`model(input)`, `loss.backward()`), which will cause unexpected errors. Rather, please do forward/backward pass through calling `booster.execute_pipeline` method.
+
+2. When you use Shardformer to process classification models such as `GPT2ForSequenceClassification`, `ViTForImageClassification`, please ensure that the total number of labels should be integer multiple of tensor parallel size, otherwise Shardformer can't process the classifier layer correctly. A simple fix could be appending dummy labels in transformers config. This bug will be fixed in future version of Shardformer.
+
+3. The case of training ChatGLM-2 6B is a little special: since Huggingface transformers doesn't officially support ChatGLM at present, please import the configuration/model classes through
+    ```python
+    from colossalai.shardformer.modeling.chatglm2_6b.configuration_chatglm import ChatGLMConfig
+    from colossalai.shardformer.modeling.chatglm2_6b.modeling_chatglm import ChatGLMForConditionalGeneration, ChatGLMModel
+    ```
+    when training ChatGLM-2 with Shardformer, and initialize your model with these imported classes.
+
+## How Shardformer Works
+
+Generally, Shardformer works through the following four kinds of *replacements*:
+
+1. Replacing original PyTorch module (e.g. `nn.Linear`, `nn.Embedding`) with a crafted distributed module.
+The distributed module keeps the same attributes as the original module but replaces the original parameters with distributed parameters.
+Also, new `forward` methods will replace original ones so as to execute distributed computation, such as linear layers' split /gather operations executed under tensor parallelism.
+Each distributed module implements its `from_native_module` static method to convert the PyTorch module to its corresponding distributed module.
+
+2. Replacing attributes of original Huggingface Transformers layers with appropriate attributes for distributed training.
+For example, when training LlaMa-2 with tensor parallel size as 2, the attribute `num_heads` of `LlamaDecoderLayer` (the number of attention heads in each layer) should be replaced with `model.config.num_attention_heads // 2`.
+
+3. Replacing the `forward` methods implemented by original Huggingface
+Transformers libraries with our customized `forward` methods.
+This replacement is essential for pipeline paralellism, where a customiozed function is needed to pass intermediate hidden states between different pipeline stages.
+Also, optimization methods such as flash attention or sequence parallel can be injected into the `forward` process through our customized `forward` method.
+
+4. Replacing the whole copy of model parameters and optimizer states with incomplete ones controlled by current device (this is why it's called Shardformer).
+By executing `ModelSharder.shard` method, current device will only keep the part of model parameters it's supposed to take care of.
+To be specific, they should be the assigned parameter shards when using tensor parallelism, or the parameters belonging to current pipeline stage when using pipeline parallelism, or both of them.
+All other parameters are released so as to liberate memory usage.
+As a result, the optimizer will only compute the states corresponding to these part of parameters, causing the usage of memory to be further saved.
+
+All of these replacements are implemented with manually written policies and forward functions.
+If you want to delve deeper into the design of Shardformer or customize your own Shardformer policies, please refer to our [Shardformer development document](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md) and [pipeline parallelism design](https://github.com/hpcaitech/ColossalAI/discussions/4050) for more details.
+
+### Sequence Parallelism
+
+Sequence parallelism is a special optimization method supported by `Shardformer`. Sequence parallelism in `Shardformer` is a little different from [this one](https://colossalai.org/docs/basics/configure_parallelization/#sequence-parallel) which focuses on ring attention. In `Shardformer`, sequence parallelism is only used along with 1D tensor parallelism to further reduce memory occupation of activation tensors during computation.
+
+1. In normal [1D tensor parallel](https://colossalai.org/docs/features/1D_tensor_parallel), there are 2 communication operations, $g$ and $\vec{g}$, $g$ will do one time All-Reduce in backward to get all gradients from all the devices and $\vec{g}$ will do one time All-Reduce in forward to get whole outputs from all the devices.
+
+2. When using sequence parallelism, $\vec{g}$ needs to do All-Gather to gather the inputs along sequence dimension during forward, and Reduce-Scatter to split the gradient during backward. $\vec{g}$ needs to do Reduce-Scatter to split the output of `Row Linear` layer of tensor parallel to all devices along sequence dimension, and All-Gather to get the whole gradient during backward.
+
+3. NCCL's implementation of All-Reduce adopts the `Ring All-Reduce` approach, which consists of a Reduce-Scatter operation and an All-Gather operation with equal costs. Therefore, compared with sequence parallelism and tensor parallelism, it does not introduce additional communication overhead.
+
+4. One important thing to note is that when using sequence parallelism along with `Column Linear` module of tensor parallelism, the complete input needs to be obtained during the backward computation of gradients. During the forward pass, only the portion of the input that is split along the sequence dimension is retained, in the shape of $(batch, sequence_len/k, hidden_states)$. Therefore, an additional All-Gather operation is required to obtain the complete input for gradient computation. However, it is possible to overlap the gradient computation with the All-Gather communication operation in our implementation, which would not introduce additional communication overhead (corresponding to the `enable_sequence_overlap` parameter in `Shardformer`).
+
+
 <!-- doc-test-command: echo  -->
--- a/docs/source/zh-Hans/features/shardformer.md
+++ b/docs/source/zh-Hans/features/shardformer.md
@ -25,77 +25,6 @@ Author: [Baizhou Zhang](https://github.com/Fridge003), [Bin Jia](https://github.

 出于这种动机，ColossalAI团队开发了**Shardformer**，该功能可以自动为HuggingFace中主流的Transformer模型进行封装，用于张量并行以及流水线并行的训练策略。如此一来，对系统了解不多的用户也可以轻松地在transformers模型上进行并行训练：只需几行代码，用户就可以将模型转变为并行训练的状态。此外，Shardformer也包括了多种优化工具，用于在前向/后向的传递过程中实现加速和节省内存。

-## 用法
-
-### Shardformer的参数配置
-
-Shardformer的配置由类`ShardConfig`的参数控制：
-
-{{ autodoc:colossalai.shardformer.ShardConfig }}
-
-如果您想启用 Apex Fused Layernorm，请安装 `apex`。如果您想启用 flash attention，请安装 `flash_attn`。此外，xFormers 的 `cutlass_op` 可以作为Flash Attention的补充优化方式。
-
-### 启动Shardformer
-
-#### 1. 通过Booster启动Shardformer (推荐)
-
-通过用`HybridParallelPlugin`初始化的`Booster`来启动`Shardformer`是最推荐的用法。其主要原因是如果不调用`Booster`的`execute_pipeline`方法，流水线并行就无法正常工作。此外，`HybridParallelPlugin`提供了将`Shardformer`的功能与其他功能（例如混合精度训练或Zero）相结合的能力。
-
-更多关于这一用法的细节可以参考 [Booster API 文档](../basics/booster_api.md)以及[Booster 插件文档](../basics/booster_plugins.md)。[这里](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert)是一个通过`HybridParallelPlugin`启动`Shardformer`的示例。
-
-
-#### 2. 通过Shardformer API启动Shardformer (不推荐)
-
-您还可以通过手动调用Shardformer API的方式启动Shardformer。然而我们并不推荐这种用法，因为流水线并行在没有`Booster`的情况下无法正常运行。
-
-[这里](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/examples/convergence_benchmark.py)
-是一个通过调用Shardformer的API启动`Shardformer`的示例。
-
-
-### 注意事项
-
-1. 当启用流水线并行时，请不要用常规方式（`model(input)`、`loss.backward()`）进行前向/后向传递，这样会导致未知的错误。这种情形下请通过调用`booster.execute_pipeline`方法来进行前向/后向传递。
-
-2. 当使用Shardformer处理`GPT2ForSequenceClassification`、`ViTForImageClassification`等分类模型时，请确保labels的总数为张量并行度的整数倍，否则Shardformer无法正确地处理classifier层。一个简单的修复方法就是在transformers的config中添加虚拟的标签。这一bug将在 Shardformer的未来版本中修复。
-
-3. 训练ChatGLM-2 6B的情况有点特殊：由于Huggingface Transformers 目前尚未正式支持ChatGLM。在使用Shardformer训练ChatGLM-2时，请通过以下方式导入config/model的类：
-    ```python
-    from colossalai.shardformer.modeling.chatglm2_6b.configuration_chatglm import ChatGLMConfig
-    from colossalai.shardformer.modeling.chatglm2_6b.modeling_chatglm import ChatGLMForConditionalGeneration, ChatGLMModel
-    ```
-    并且使用这些导入的类初始化模型。
-
-
-### 序列并行 Sequence Parallelism
-
-在`Shardformer`中，序列并行与[此处](https://colossalai.org/docs/basics/configure_parallelization/#sequence-parallel)稍有不同，后者侧重于ring attention。在`Shardformer`中，序列并行仅与1D张量并行一起使用，以进一步减少计算中activation的内存占用。
-
-1. 在普通的[1D张量并行](https://colossalai.org/docs/features/1D_tensor_parallel)中，有两个通信操作$g$和$\vec{g}$，$g$在反向传播中进行一次全局归约以获取来自所有设备的梯度，而$\vec{g}$在正向传播中进行一次All-Reduce以获取来自所有设备的输出。
-
-2. 当使用序列并行时，$\vec{g}$需要在正向传播过程中进行All-Gather以获取序列维度上的输入，并在反向传播过程中进行Reduce-Scatter以分割梯度。$\vec{g}$需要进行Reduce-Scatter以将序列维度上的行线性层输出分割到所有设备上，并进行All-Gather以获取完整的梯度。
-
-3. 使用NCCL的All-reduce实现采用了`Ring All-Reduce`方法，由一次Reduce-Scatter和一次All-Gather组成，两者的开销相等。因此，与序列并行和张量并行相比，它并不会引入额外的通信开销。
-
-4. 需要注意的一点是，在张量并行的 “Column Linear” 中进行序列并行时，梯度的反向计算过程中需要获取完整的输入。在前向传播过程中，仅保留沿序列维度分割的输入部分，张量的形状例如$(batch, sequence\_len/k, hidden\_states)$。因此，需要进行额外的全局收集操作以获取完整的输入进行梯度计算。但是，在实现中，可以将梯度计算与全局收集通信操作重叠，这不会引入额外的通信开销（对应`Shardformer`中的`enable_sequence_overlap`参数）。
-
-
-## Shardformer的工作原理
-
-通常来说，Shardformer通过以下四种“替换”进行工作：
-
-1. 用我们设计的分布式模块替换原始的PyTorch模块（例如`nn.Linear`、`nn.Embedding`）。
-分布式模块保持与原始模块相同的属性，但分布式模块会用新的参数替换原始模块的参数。新的前向函数将取代原来的前向函数，用于执行分布式计算，例如在张量并行下执行线性层的split/gather操作。每个分布式模块都应当实现其`from_native_module`静态方法，以将PyTorch模块转换为其相应的分布式模块。
-
-2. 将原始Huggingface Transformers中间层的属性为适用于并行训练的属性。例如，当使用并行度为2的张量并行训练LlaMa-2时,`LlamaDecoderLayer`   的属性`num_heads`（每一层注意力头的数量）应替换为`model.config.num_attention_heads // 2`。
-
-3. 将原来Huggingface transformers库实现的前向函数替换为我们定制的前向函数。前向函数的替换对于流水线并行性至关重要，因为流水线并行需要特殊的前向函数去在不同的流水线阶段之间传递中间的隐藏状态。此外，可以通过我们定制的前向函数将例如`flash attention`或序列并行的优化方法注入到前向的过程中。
-
-4. 将完整的模型参数和优化器状态替换为只由当前设备控制的部分模型参数和优化器状态。通过执行`ModelSharder.shard`方法，当前设备仅会保留它应该处理的那部分模型参数。具体来说，这部分参数可以是使用张量并行时分配到当前机器的参数分片，或者使用流水线并行时当前流水线阶段的模型参数，或者兼而有之。除此之外的所有其他参数都被释放，用于节省内存的空间。
-如此一来，优化器只会计算保留的部分参数对应的状态，从而进一步节省内存的使用。
-
-所有这些替换都是通过手动编写的策略和前向函数来实现的。如果您想更深入地研究Shardformer的设计方案，或者定制您自己的Shardformer策略，请参考[Shardformer 开发者文档](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md)和[流水并行设计方案](https://github.com/hpcaitech/ColossalAI/discussions/4050)以获得更多细节。
-
-
 ## 支持信息

 模型/功能 兼容性矩阵：
@ -262,4 +191,74 @@ Shardformer的配置由类`ShardConfig`的参数控制：

 随着未来更多模型和优化工具的出现，我们支持的模型/优化工具将会变得越来越多。如果您对我们应该支持的模型/优化工具有任何建议，欢迎在项目的[Issues](https://github.com/hpcaitech/ColossalAI/issues)板块参与讨论。

+## 用法
+
+### Shardformer的参数配置
+
+Shardformer的配置由类`ShardConfig`的参数控制：
+
+{{ autodoc:colossalai.shardformer.ShardConfig }}
+
+如果您想启用 Apex Fused Layernorm，请安装 `apex`。如果您想启用 flash attention，请安装 `flash_attn`。此外，xFormers 的 `cutlass_op` 可以作为Flash Attention的补充优化方式。
+
+### 启动Shardformer
+
+#### 1. 通过Booster启动Shardformer (推荐)
+
+通过用`HybridParallelPlugin`初始化的`Booster`来启动`Shardformer`是最推荐的用法。其主要原因是如果不调用`Booster`的`execute_pipeline`方法，流水线并行就无法正常工作。此外，`HybridParallelPlugin`提供了将`Shardformer`的功能与其他功能（例如混合精度训练或Zero）相结合的能力。
+
+更多关于这一用法的细节可以参考 [Booster API 文档](../basics/booster_api.md)以及[Booster 插件文档](../basics/booster_plugins.md)。[这里](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/bert)是一个通过`HybridParallelPlugin`启动`Shardformer`的示例。
+
+
+#### 2. 通过Shardformer API启动Shardformer (不推荐)
+
+您还可以通过手动调用Shardformer API的方式启动Shardformer。然而我们并不推荐这种用法，因为流水线并行在没有`Booster`的情况下无法正常运行。
+
+[这里](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/examples/convergence_benchmark.py)
+是一个通过调用Shardformer的API启动`Shardformer`的示例。
+
+
+### 注意事项
+
+1. 当启用流水线并行时，请不要用常规方式（`model(input)`、`loss.backward()`）进行前向/后向传递，这样会导致未知的错误。这种情形下请通过调用`booster.execute_pipeline`方法来进行前向/后向传递。
+
+2. 当使用Shardformer处理`GPT2ForSequenceClassification`、`ViTForImageClassification`等分类模型时，请确保labels的总数为张量并行度的整数倍，否则Shardformer无法正确地处理classifier层。一个简单的修复方法就是在transformers的config中添加虚拟的标签。这一bug将在 Shardformer的未来版本中修复。
+
+3. 训练ChatGLM-2 6B的情况有点特殊：由于Huggingface Transformers 目前尚未正式支持ChatGLM。在使用Shardformer训练ChatGLM-2时，请通过以下方式导入config/model的类：
+    ```python
+    from colossalai.shardformer.modeling.chatglm2_6b.configuration_chatglm import ChatGLMConfig
+    from colossalai.shardformer.modeling.chatglm2_6b.modeling_chatglm import ChatGLMForConditionalGeneration, ChatGLMModel
+    ```
+    并且使用这些导入的类初始化模型。
+
+
+## Shardformer的工作原理
+
+通常来说，Shardformer通过以下四种“替换”进行工作：
+
+1. 用我们设计的分布式模块替换原始的PyTorch模块（例如`nn.Linear`、`nn.Embedding`）。
+分布式模块保持与原始模块相同的属性，但分布式模块会用新的参数替换原始模块的参数。新的前向函数将取代原来的前向函数，用于执行分布式计算，例如在张量并行下执行线性层的split/gather操作。每个分布式模块都应当实现其`from_native_module`静态方法，以将PyTorch模块转换为其相应的分布式模块。
+
+2. 将原始Huggingface Transformers中间层的属性为适用于并行训练的属性。例如，当使用并行度为2的张量并行训练LlaMa-2时,`LlamaDecoderLayer`   的属性`num_heads`（每一层注意力头的数量）应替换为`model.config.num_attention_heads // 2`。
+
+3. 将原来Huggingface transformers库实现的前向函数替换为我们定制的前向函数。前向函数的替换对于流水线并行性至关重要，因为流水线并行需要特殊的前向函数去在不同的流水线阶段之间传递中间的隐藏状态。此外，可以通过我们定制的前向函数将例如`flash attention`或序列并行的优化方法注入到前向的过程中。
+
+4. 将完整的模型参数和优化器状态替换为只由当前设备控制的部分模型参数和优化器状态。通过执行`ModelSharder.shard`方法，当前设备仅会保留它应该处理的那部分模型参数。具体来说，这部分参数可以是使用张量并行时分配到当前机器的参数分片，或者使用流水线并行时当前流水线阶段的模型参数，或者兼而有之。除此之外的所有其他参数都被释放，用于节省内存的空间。
+如此一来，优化器只会计算保留的部分参数对应的状态，从而进一步节省内存的使用。
+
+所有这些替换都是通过手动编写的策略和前向函数来实现的。如果您想更深入地研究Shardformer的设计方案，或者定制您自己的Shardformer策略，请参考[Shardformer 开发者文档](https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/README.md)和[流水并行设计方案](https://github.com/hpcaitech/ColossalAI/discussions/4050)以获得更多细节。
+
+### 序列并行 Sequence Parallelism
+
+序列并行是`Shardformer`支持的一种特殊的优化方法。在`Shardformer`中，序列并行与[此处](https://colossalai.org/docs/basics/configure_parallelization/#sequence-parallel)稍有不同，后者侧重于ring attention。在`Shardformer`中，序列并行仅与1D张量并行一起使用，以进一步减少计算中activation的内存占用。
+
+1. 在普通的[1D张量并行](https://colossalai.org/docs/features/1D_tensor_parallel)中，有两个通信操作$g$和$\vec{g}$，$g$在反向传播中进行一次全局归约以获取来自所有设备的梯度，而$\vec{g}$在正向传播中进行一次All-Reduce以获取来自所有设备的输出。
+
+2. 当使用序列并行时，$\vec{g}$需要在正向传播过程中进行All-Gather以获取序列维度上的输入，并在反向传播过程中进行Reduce-Scatter以分割梯度。$\vec{g}$需要进行Reduce-Scatter以将序列维度上的行线性层输出分割到所有设备上，并进行All-Gather以获取完整的梯度。
+
+3. 使用NCCL的All-reduce实现采用了`Ring All-Reduce`方法，由一次Reduce-Scatter和一次All-Gather组成，两者的开销相等。因此，与序列并行和张量并行相比，它并不会引入额外的通信开销。
+
+4. 需要注意的一点是，在张量并行的 `Column Linear` 层中进行序列并行时，梯度的反向计算过程中需要获取完整的输入。在前向传播过程中，仅保留沿序列维度分割的输入部分，张量的形状例如$(batch, sequence\_len/k, hidden\_states)$。因此，需要进行额外的全局收集操作以获取完整的输入进行梯度计算。但是，在实现中，可以将梯度计算与全局收集通信操作重叠，这不会引入额外的通信开销（对应`Shardformer`中的`enable_sequence_overlap`参数）。
+
+
 <!-- doc-test-command: echo  -->