3.3 KiB

Raw Blame History

Booster Plugins

Prerequisite:

Booster API

Introduction

As mentioned in Booster API, we can use booster plugins to customize the parallel training. In this tutorial, we will introduce how to use booster plugins.

We currently provide the following plugins:

Low Level Zero Plugin: It wraps the colossalai.zero.low_level.LowLevelZeroOptimizer and can be used to train models with zero-dp. It only supports zero stage-1 and stage-2.
Gemini Plugin: It wraps the Gemini which implements Zero-3 with chunk-based and heterogeneous memory management.
Torch DDP Plugin: It is a wrapper of torch.nn.parallel.DistributedDataParallel and can be used to train models with data parallelism.
Torch FSDP Plugin: It is a wrapper of torch.distributed.fsdp.FullyShardedDataParallel and can be used to train models with zero-dp.

More plugins are coming soon.

Plugins

Low Level Zero Plugin

This plugin implements Zero-1 and Zero-2 (w/wo CPU offload), using reduce and gather to synchronize gradients and weights.

Zero-1 can be regarded as a better substitute of Torch DDP, which is more memory efficient and faster. It can be easily used in hybrid parallelism.

Zero-2 does not support local gradient accumulation. Though you can accumulate gradient if you insist, it cannot reduce communication cost. That is to say, it's not a good idea to use Zero-2 with pipeline parallelism.

We've tested compatibility on some famous models, following models may not be supported:

timm.models.convit_base
dlrm and deepfm models in torchrec
diffusers.VQModel
transformers.AlbertModel
transformers.AlbertForPreTraining
transformers.BertModel
transformers.BertForPreTraining
transformers.GPT2DoubleHeadsModel

Compatibility problems will be fixed in the future.

⚠ This plugin can only load optimizer checkpoint saved by itself with the same number of processes now. This will be fixed in the future.

Gemini Plugin

This plugin implements Zero-3 with chunk-based and heterogeneous memory management. It can train large models without much loss in speed. It also does not support local gradient accumulation. More details can be found in Gemini Doc.

⚠ This plugin can only load optimizer checkpoint saved by itself with the same number of processes now. This will be fixed in the future.

Torch DDP Plugin

More details can be found in Pytorch Docs.

Torch FSDP Plugin

⚠ This plugin is not available when torch version is lower than 1.12.0.

⚠ This plugin does not support save/load sharded model checkpoint now.

⚠ This plugin does not support optimizer that use multi params group.