[doc] explain suitable use case for each plugin

pull/4757/head
Pengtai Xu 1 year ago
parent 079bf3cb26
commit 10513f203c

@ -1,6 +1,6 @@
# Booster Plugins # Booster Plugins
Author: [Hongxin Liu](https://github.com/ver217), [Baizhou Zhang](https://github.com/Fridge003) Author: [Hongxin Liu](https://github.com/ver217), [Baizhou Zhang](https://github.com/Fridge003), [Pengtai Xu](https://github.com/ppt0011)
**Prerequisite:** **Prerequisite:**
- [Booster API](./booster_api.md) - [Booster API](./booster_api.md)
@ -11,16 +11,43 @@ As mentioned in [Booster API](./booster_api.md), we can use booster plugins to c
We currently provide the following plugins: We currently provide the following plugins:
- [Low Level Zero Plugin](#low-level-zero-plugin): It wraps the `colossalai.zero.low_level.LowLevelZeroOptimizer` and can be used to train models with zero-dp. It only supports zero stage-1 and stage-2.
- [Gemini Plugin](#gemini-plugin): It wraps the [Gemini](../features/zero_with_chunk.md) which implements Zero-3 with chunk-based and heterogeneous memory management.
- [Torch DDP Plugin](#torch-ddp-plugin): It is a wrapper of `torch.nn.parallel.DistributedDataParallel` and can be used to train models with data parallelism. - [Torch DDP Plugin](#torch-ddp-plugin): It is a wrapper of `torch.nn.parallel.DistributedDataParallel` and can be used to train models with data parallelism.
- [Torch FSDP Plugin](#torch-fsdp-plugin): It is a wrapper of `torch.distributed.fsdp.FullyShardedDataParallel` and can be used to train models with zero-dp. - [Torch FSDP Plugin](#torch-fsdp-plugin): It is a wrapper of `torch.distributed.fsdp.FullyShardedDataParallel` and can be used to train models with zero-dp.
- [Low Level Zero Plugin](#low-level-zero-plugin): It wraps the `colossalai.zero.low_level.LowLevelZeroOptimizer` and can be used to train models with zero-dp. It only supports zero stage-1 and stage-2.
- [Gemini Plugin](#gemini-plugin): It wraps the [Gemini](../features/zero_with_chunk.md) which implements Zero-3 with chunk-based and heterogeneous memory management.
- [Hybrid Pararllel Plugin](#hybrid-parallel-plugin): It provides a tidy interface that integrates the power of Shardformer, pipeline manager, mixied precision training, TorchDDP and Zero stage 1/2 feature. With this plugin, transformer models can be easily trained with any combination of tensor parallel, pipeline parallel and data parallel (DDP/Zero) efficiently, along with various kinds of optimization tools for acceleration and memory saving. Detailed information about supported parallel strategies and optimization tools is explained in the section below. - [Hybrid Pararllel Plugin](#hybrid-parallel-plugin): It provides a tidy interface that integrates the power of Shardformer, pipeline manager, mixied precision training, TorchDDP and Zero stage 1/2 feature. With this plugin, transformer models can be easily trained with any combination of tensor parallel, pipeline parallel and data parallel (DDP/Zero) efficiently, along with various kinds of optimization tools for acceleration and memory saving. Detailed information about supported parallel strategies and optimization tools is explained in the section below.
More plugins are coming soon. More plugins are coming soon.
## Choosing Your Plugin
Generally only one plugin is used to train a model. Our recommended use case for each plugin is as follows.
- [Torch DDP Plugin](#torch-ddp-plugin): It is suitable for models with less than 2 billion parameters.
- [Torch FSDP Plugin](#torch-fsdp-plugin) / [Low Level Zero Plugin](#low-level-zero-plugin): It is suitable for models with less than 10 billion parameters.
- [Gemini Plugin](#gemini-plugin): it is suitable for models with more than 10 billion parameters and is ideal for scenarios with high cross-node bandwidth and medium to small-scale clusters (below a thousand cards).
- [Hybrid Pararllel Plugin](#hybrid-parallel-plugin): It is suitable for models with more than 60 billion parameters, exceptionally long sequences, very large vocabularies, and is best suited for scenarios with low cross-node bandwidth and large-scale clusters (a thousand cards or more).
## Plugins ## Plugins
### Torch DDP Plugin
More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel).
{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }}
### Torch FSDP Plugin
> ⚠ This plugin is not available when torch version is lower than 1.12.0.
> ⚠ This plugin does not support save/load sharded model checkpoint now.
> ⚠ This plugin does not support optimizer that use multi params group.
More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/fsdp.html).
{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }}
### Low Level Zero Plugin ### Low Level Zero Plugin
This plugin implements Zero-1 and Zero-2 (w/wo CPU offload), using `reduce` and `gather` to synchronize gradients and weights. This plugin implements Zero-1 and Zero-2 (w/wo CPU offload), using `reduce` and `gather` to synchronize gradients and weights.
@ -50,24 +77,6 @@ This plugin implements Zero-3 with chunk-based and heterogeneous memory manageme
{{ autodoc:colossalai.booster.plugin.GeminiPlugin }} {{ autodoc:colossalai.booster.plugin.GeminiPlugin }}
### Torch DDP Plugin
More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel).
{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }}
### Torch FSDP Plugin
> ⚠ This plugin is not available when torch version is lower than 1.12.0.
> ⚠ This plugin does not support save/load sharded model checkpoint now.
> ⚠ This plugin does not support optimizer that use multi params group.
More details can be found in [Pytorch Docs](https://pytorch.org/docs/main/fsdp.html).
{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }}
### Hybrid Parallel Plugin ### Hybrid Parallel Plugin
@ -87,5 +96,4 @@ This plugin implements the combination of various parallel training strategies a
{{ autodoc:colossalai.booster.plugin.HybridParallelPlugin }} {{ autodoc:colossalai.booster.plugin.HybridParallelPlugin }}
<!-- doc-test-command: echo --> <!-- doc-test-command: echo -->

@ -11,16 +11,41 @@
我们现在提供以下插件: 我们现在提供以下插件:
- [Low Level Zero 插件](#low-level-zero-插件): 它包装了 `colossalai.zero.low_level.LowLevelZeroOptimizer`,可用于使用 Zero-dp 训练模型。它仅支持 Zero 阶段1和阶段2。
- [Gemini 插件](#gemini-插件): 它包装了 [Gemini](../features/zero_with_chunk.md)Gemini 实现了基于Chunk内存管理和异构内存管理的 Zero-3。
- [Torch DDP 插件](#torch-ddp-插件): 它包装了 `torch.nn.parallel.DistributedDataParallel` 并且可用于使用数据并行训练模型。 - [Torch DDP 插件](#torch-ddp-插件): 它包装了 `torch.nn.parallel.DistributedDataParallel` 并且可用于使用数据并行训练模型。
- [Torch FSDP 插件](#torch-fsdp-插件): 它包装了 `torch.distributed.fsdp.FullyShardedDataParallel` 并且可用于使用 Zero-dp 训练模型。 - [Torch FSDP 插件](#torch-fsdp-插件): 它包装了 `torch.distributed.fsdp.FullyShardedDataParallel` 并且可用于使用 Zero-dp 训练模型。
- [Low Level Zero 插件](#low-level-zero-插件): 它包装了 `colossalai.zero.low_level.LowLevelZeroOptimizer`,可用于使用 Zero-dp 训练模型。它仅支持 Zero 阶段1和阶段2。
- [Gemini 插件](#gemini-插件): 它包装了 [Gemini](../features/zero_with_chunk.md)Gemini 实现了基于Chunk内存管理和异构内存管理的 Zero-3。
- [Hybrid Pararllel 插件](#hybrid-parallel-插件): 它为Shardformer流水线管理器混合精度运算TorchDDP以及Zero-1/Zero-2功能提供了一个统一且简洁的接口。使用该插件可以简单高效地实现transformer模型在张量并行流水线并行以及数据并行DDP, Zero间任意组合并行训练策略同时支持多种训练速度和内存的优化工具。有关这些训练策略和优化工具的具体信息将在下一章中阐述。 - [Hybrid Pararllel 插件](#hybrid-parallel-插件): 它为Shardformer流水线管理器混合精度运算TorchDDP以及Zero-1/Zero-2功能提供了一个统一且简洁的接口。使用该插件可以简单高效地实现transformer模型在张量并行流水线并行以及数据并行DDP, Zero间任意组合并行训练策略同时支持多种训练速度和内存的优化工具。有关这些训练策略和优化工具的具体信息将在下一章中阐述。
更多插件即将推出。 更多插件即将推出。
## 插件选择
- [Torch DDP 插件](#torch-ddp-插件): 适用于参数少于 20 亿的模型。
- [Torch FSDP 插件](#torch-fsdp-插件) / [Low Level Zero 插件](#low-level-zero-插件): 适用于参数少于 100 亿的模型。
- [Gemini 插件](#gemini-插件): 适合参数超过 100 亿的模型,且跨节点带宽高、中小规模集群(千卡以下)的场景。
- [Hybrid Pararllel 插件](#hybrid-parallel-插件): 适合参数超过 600 亿的模型、超长序列、超大词表等特殊模型,且跨节点带宽低、大规模集群(千卡以上)的场景。
## 插件 ## 插件
### Torch DDP 插件
更多详细信息,请参阅 [Pytorch 文档](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel).
{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }}
### Torch FSDP 插件
> ⚠ 如果 torch 版本低于 1.12.0,此插件将不可用。
> ⚠ 该插件现在还不支持保存/加载分片的模型 checkpoint。
> ⚠ 该插件现在还不支持使用了multi params group的optimizer。
更多详细信息,请参阅 [Pytorch 文档](https://pytorch.org/docs/main/fsdp.html).
{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }}
### Low Level Zero 插件 ### Low Level Zero 插件
该插件实现了 Zero-1 和 Zero-2使用/不使用 CPU 卸载),使用`reduce`和`gather`来同步梯度和权重。 该插件实现了 Zero-1 和 Zero-2使用/不使用 CPU 卸载),使用`reduce`和`gather`来同步梯度和权重。
@ -50,26 +75,6 @@ Zero-2 不支持局部梯度累积。如果您坚持使用,虽然可以积累
{{ autodoc:colossalai.booster.plugin.GeminiPlugin }} {{ autodoc:colossalai.booster.plugin.GeminiPlugin }}
### Torch DDP 插件
更多详细信息,请参阅 [Pytorch 文档](https://pytorch.org/docs/main/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel).
{{ autodoc:colossalai.booster.plugin.TorchDDPPlugin }}
### Torch FSDP 插件
> ⚠ 如果 torch 版本低于 1.12.0,此插件将不可用。
> ⚠ 该插件现在还不支持保存/加载分片的模型 checkpoint。
> ⚠ 该插件现在还不支持使用了multi params group的optimizer。
更多详细信息,请参阅 [Pytorch 文档](https://pytorch.org/docs/main/fsdp.html).
{{ autodoc:colossalai.booster.plugin.TorchFSDPPlugin }}
### Hybrid Parallel 插件 ### Hybrid Parallel 插件
这个插件实现了多种并行训练策略和优化工具的组合。Hybrid Parallel插件支持的功能大致可以被分为以下四个部分 这个插件实现了多种并行训练策略和优化工具的组合。Hybrid Parallel插件支持的功能大致可以被分为以下四个部分

Loading…
Cancel
Save