Hotfix/auto parallel zh doc (#2820)

* [hotfix] fix autoparallel zh docs

* polish

* polish
pull/2826/head
YuliangLiu0306 2023-02-19 15:57:14 +08:00 committed by GitHub
parent 2059fdd6b0
commit cf6409dd40
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 3 additions and 9 deletions

View File

@ -30,7 +30,7 @@ Colossal-Auto is **the first auto-parallelism system** that uses static graph an
## Fine-grained Parallelism Search
Colossal-AIs auto-parallelism searches for strategies in regard to each operand with the goal of achieving the fastest runtime while meeting memory budget constraints. It ultimately determines the actual training time strategy, including the tensor split strategy for each tensor, the type of communication operators to be inserted between different computing nodes, whether to replace operators, etc. The tensor, data, and hybrid parallelism such as column and row split used by NVIDIA in Megatron-LM and other parallelism systems are all subsets of strategies that can be searched by Colossal-AI. In addition to these parallelisms that can be manually specified, Colossal-AI can specify a unique parallelism method for each operation and, potentially finding a better parallelism strategy than what human experts could provide.
We investigate and research a number of current automatic parallel systems(<a href="https://arxiv.org/abs/1807.08887"> Tofu </a>, <a href="https://arxiv.org/abs/1807.05358"> Flexflow </a>, <a href="https://arxiv.org/abs/2201.12023"> Alpa </a>) and some auto activation checkpoint algorithms(<a href="https://hal.inria.fr/hal-02352969"> Rotor </a>, <a href="https://arxiv.org/abs/1604.06174"> Sublinear </a>). Inspired from these advanced systems, we build Colossal-Auto which is an automatic parallel system upon PyTorch framework. Colossal-Auto searches for strategies in regard to each operand with the goal of achieving the fastest runtime while meeting memory budget constraints. It ultimately determines the actual training time strategy, including the tensor split strategy for each tensor, the type of communication operators to be inserted between different computing nodes, whether to replace operators, etc. The tensor, data, and hybrid parallelism such as column and row split used by NVIDIA in Megatron-LM and other parallelism systems are all subsets of strategies that can be searched by Colossal-AI. In addition to these parallelisms that can be manually specified, Colossal-AI can specify a unique parallelism method for each operation and, potentially finding a better parallelism strategy than what human experts could provide.

View File

@ -25,7 +25,8 @@ Colossal-Auto 是**首个基于 PyTorch 框架使用静态图分析的自动并
## 细粒度分布式训练策略搜索
Colossal-AI 的自动并行策略会在满足内存预算的限制下,以最快运行时间为目标,为每个 op 进行策略搜索,最终得到真实训练时的策略,包括每个 tensor 的切分策略不同计算节点间需要插入的通信算子类型是否要进行算子替换等。现有系统中的张量并行数据并行NVIDIA 在 Megatron-LM 等并行系统中使用的 column 切分和 row 切分并行等混合并行都是自动并行可以搜索到的策略的子集。除了这些可以手动指定的并行方式外Colossal-AI 有能力为每个 op 指定独特的并行方式,因此有可能找到比依赖专家经验和试错配置的手动切分更好的并行策略。
我们调研了很多现有的自动并行系统(<a href="https://arxiv.org/abs/1807.08887"> Tofu </a>, <a href="https://arxiv.org/abs/1807.05358"> Flexflow </a>, <a href="https://arxiv.org/abs/2201.12023"> Alpa </a>),以及自动激活值检查点算法(<a href="https://hal.inria.fr/hal-02352969"> Rotor </a>, <a href="https://arxiv.org/abs/1604.06174"> Sublinear </a>在他们的启发下我们开发一个基于PyTorch框架的自动并行系统Colossal-Auto。Colossal-Auto会在满足内存预算的限制下以最快运行时间为目标为每个 op 进行策略搜索,最终得到真实训练时的策略,包括每个 tensor 的切分策略不同计算节点间需要插入的通信算子类型是否要进行算子替换等。现有系统中的张量并行数据并行NVIDIA 在 Megatron-LM 等并行系统中使用的 column 切分和 row 切分并行等混合并行都是自动并行可以搜索到的策略的子集。除了这些可以手动指定的并行方式外Colossal-AI 有能力为每个 op 指定独特的并行方式,因此有可能找到比依赖专家经验和试错配置的手动切分更好的并行策略。
@ -33,9 +34,6 @@ Colossal-AI 的自动并行策略会在满足内存预算的限制下,以最
与 PyTorch 最新发布的 DTensor 类似Colossal-AI 也使用了 device mesh 对集群进行了抽象管理。具体来说Colossal-AI 使用 sharding spec 对 tensor 的分布式存储状态进行标注,使用 shape consistency manager 自动地对同一 tensor 在不同 sharding spec 间进行转换。这让 Colossal-AI 的通用性和易用性极大地提升,借助 shape consistency manager 可以没有负担地切分 tensor而不用担心上游 op 的 output 与下游的 input 在集群中的存储方式不同。
<figure style={{textAlign: "center"}}>
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/shape_consistency.png"/>
</figure>
相较于 PyTorch DTensorColossal-AI 有以下优势:
+ Colossal-AI 的 device mesh 可以 profiling 到集群性能指标,对不同的通信算子进行耗时估算。

View File

@ -10,7 +10,3 @@ Colossal-Auto 可被用于为每一次操作寻找一个包含数据、张量(
作为大模型训练中必不可少的显存压缩技术Colossal-AI 也提供了对于 activation checkpoint 的自动搜索功能。相比于大部分将最大显存压缩作为目标的技术方案Colossal-AI 的搜索目标是在显存预算以内,找到最快的 activation checkpoint 方案。同时,为了避免将 activation checkpoint 的搜索一起建模到 SPMD solver 中导致搜索时间爆炸Colossal-AI 做了 2-stage search 的设计,因此可以在合理的时间内搜索到有效可行的分布式训练方案。 您可参考 [Resnet 示例](https://github.com/hpcaitech/ColossalAI/tree/main/examples/tutorial/auto_parallel)。
详细的操作指引见其 `README.md`
<figure style={{textAlign: "center"}}>
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/auto_ckpt.jpg"/>
</figure>