diff --git a/docs/sidebars.json b/docs/sidebars.json index 123211db5..754600627 100644 --- a/docs/sidebars.json +++ b/docs/sidebars.json @@ -56,6 +56,7 @@ "features/pipeline_parallel", "features/nvme_offload", "features/lazy_init", + "features/distributed_optimizers", "features/cluster_utils" ] }, diff --git a/docs/source/en/features/distributed_optimizers.md b/docs/source/en/features/distributed_optimizers.md index bdd7a3f40..f95b23304 100644 --- a/docs/source/en/features/distributed_optimizers.md +++ b/docs/source/en/features/distributed_optimizers.md @@ -4,9 +4,9 @@ Author: [Wenxuan Tan](https://github.com/Edenzzzz), [Junwen Duan](https://github **Related Paper** - [Adafactor: Adaptive Learning Rates with Sublinear Memory Cost](https://arxiv.org/abs/1804.04235) -- [CAME: Confidence-guided Adaptive Memory Efficient Optimization] (https://arxiv.org/abs/2307.02047) -- [GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection] (https://arxiv.org/abs/2403.03507) -- [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes] (https://arxiv.org/pdf/1904.00962) +- [CAME: Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047) +- [GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection](https://arxiv.org/abs/2403.03507) +- [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/pdf/1904.00962) ## Introduction Apart from the widely adopted Adam and SGD, many modern optimizers require layer-wise statistics to update parameters, and thus aren't directly applicable to settings where model layers are sharded across multiple devices. We provide optimized distributed implementations with minimal extra communications, and seamless integrations with Tensor Parallel, DDP and ZeRO plugins, which automatically uses distributed optimizers with 0 code change. @@ -14,12 +14,6 @@ Apart from the widely adopted Adam and SGD, many modern optimizers require layer ## Optimizers Adafactor is a first-order Adam variant using Non-negative Matrix Factorization(NMF) to reduce memory footprint. CAME improves by introducting a confidence matrix to correct NMF. GaLore further reduces memory by projecting gradients into a low-rank space and 8-bit block-wise quantization. Lamb allows huge batch sizes without lossing accuracy via layer-wise adaptive update bounded by the inverse of its Lipschiz constant. -## API Reference - -{{ autodoc:colossalai.nn.optimizer.distributed_adafactor.DistributedAdaFactor }} -{{ autodoc:colossalai.nn.optimizer.distributed_lamb.DistributedLamb }} -{{ autodoc:colossalai.nn.optimizer.distributed_galore.DistGaloreAwamW }} -{{ autodoc:colossalai.nn.optimizer.distributed_came.DistributedCAME }} ## Hands-On Practice We now demonstrate how to use Distributed Adafactor with booster API combining Tensor Parallel and ZeRO 2 with 4 GPUs. **Note that even if you're not aware of distributed optimizers, the plugins automatically casts yours to the distributed version for convenience.** @@ -140,3 +134,10 @@ optim = DistGaloreAwamW( + +## API Reference + +{{ autodoc:colossalai.nn.optimizer.distributed_adafactor.DistributedAdaFactor }} +{{ autodoc:colossalai.nn.optimizer.distributed_lamb.DistributedLamb }} +{{ autodoc:colossalai.nn.optimizer.distributed_galore.DistGaloreAwamW }} +{{ autodoc:colossalai.nn.optimizer.distributed_came.DistributedCAME }} diff --git a/docs/source/zh-Hans/features/distributed_optimizers.md b/docs/source/zh-Hans/features/distributed_optimizers.md index 36dbdf948..7a7068077 100644 --- a/docs/source/zh-Hans/features/distributed_optimizers.md +++ b/docs/source/zh-Hans/features/distributed_optimizers.md @@ -4,21 +4,15 @@ Author: Wenxuan Tan, Junwen Duan, Renjie Mao **相关论文** - [Adafactor: Adaptive Learning Rates with Sublinear Memory Cost](https://arxiv.org/abs/1804.04235) -- [CAME: Confidence-guided Adaptive Memory Efficient Optimization] (https://arxiv.org/abs/2307.02047) -- [GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection] (https://arxiv.org/abs/2403.03507) -- [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes] (https://arxiv.org/pdf/1904.00962) +- [CAME: Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047) +- [GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection](https://arxiv.org/abs/2403.03507) +- [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/pdf/1904.00962) ## 介绍 除了广泛采用的Adam和SGD外,许多现代优化器需要逐层统计信息以有效更新参数,因此无法直接应用于模型层在多个设备上分片的并行设置。我们以提供了优化的分布式实现,,并且通过plugin与Tensor Parallel、DDP和ZeRO无缝集成。 ## 优化器 Adafactor 是一种首次采用非负矩阵分解(NMF)的 Adam 变体,用于减少内存占用。CAME 通过引入一个置信度矩阵来改进 NMF 的效果。GaLore 通过将梯度投影到低秩空间,并使用 8 位块状量化进一步减少内存占用。Lamb 允许使用巨大的批量大小而不失准确性,通过按其 Lipschitz 常数的倒数界定的逐层自适应更新实现 -## API 参考 - -{{ autodoc:colossalai.nn.optimizer.distributed_adafactor.DistributedAdaFactor }} -{{ autodoc:colossalai.nn.optimizer.distributed_lamb.DistributedLamb }} -{{ autodoc:colossalai.nn.optimizer.distributed_galore.DistGaloreAwamW }} -{{ autodoc:colossalai.nn.optimizer.distributed_came.DistributedCAME }} ## 使用 现在我们展示如何使用分布式 Adafactor 与 booster API 结合 Tensor Parallel 和 ZeRO 2。即使您不使用distributed optimizer,plugin 也会自动将optimizer转换为分布式版本以方便使用。 @@ -137,3 +131,10 @@ optim = DistGaloreAwamW( + +## API 参考 + +{{ autodoc:colossalai.nn.optimizer.distributed_adafactor.DistributedAdaFactor }} +{{ autodoc:colossalai.nn.optimizer.distributed_lamb.DistributedLamb }} +{{ autodoc:colossalai.nn.optimizer.distributed_galore.DistGaloreAwamW }} +{{ autodoc:colossalai.nn.optimizer.distributed_came.DistributedCAME }}