2021-10-28 16:21:23 +00:00
|
|
|
# Quick demo
|
|
|
|
|
2021-11-18 11:45:06 +00:00
|
|
|
Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system can
|
|
|
|
accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The system
|
|
|
|
can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.
|
2021-10-28 16:21:23 +00:00
|
|
|
|
|
|
|
## Single GPU
|
|
|
|
|
2021-11-03 08:07:28 +00:00
|
|
|
Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline
|
2021-12-13 14:07:01 +00:00
|
|
|
performances. We provided an example to train ResNet on CIFAR10 data with only one GPU. You can find this example in
|
|
|
|
`examples\resnet_cifar10_data_parallel` in the repository. Detailed instructions can be found in its `README.md`.
|
2021-10-28 16:21:23 +00:00
|
|
|
|
|
|
|
## Multiple GPUs
|
|
|
|
|
2021-11-03 08:07:28 +00:00
|
|
|
Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the
|
2021-10-28 16:21:23 +00:00
|
|
|
training process drastically by applying efficient parallelization techiniques, which will be elaborated in
|
2021-12-13 14:07:01 +00:00
|
|
|
the [Parallelization](parallelization.md) section below.
|
2021-10-28 16:21:23 +00:00
|
|
|
|
2021-12-13 14:07:01 +00:00
|
|
|
You can turn the resnet example mentioned above into a multi-GPU training by setting `--nproc_per_node` to be the number of
|
|
|
|
GPUs you have on your system. We also provide an example of Vision Transformer which relies on
|
|
|
|
training with more GPUs. You can visit this example in `examples\vit_b16_imagenet_data_parallel`. It has a detailed instructional
|
|
|
|
`README.md` for you too.
|
|
|
|
|
|
|
|
|
|
|
|
## Sample Training Script
|
2021-10-28 16:21:23 +00:00
|
|
|
|
2021-12-13 14:07:01 +00:00
|
|
|
Below is a typical way of how you train the model using
|
2021-10-28 16:21:23 +00:00
|
|
|
|
|
|
|
```python
|
|
|
|
import colossalai
|
2021-12-13 14:07:01 +00:00
|
|
|
from colossalai.amp import AMP_TYPE
|
Develop/experiments (#59)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000
* Integrate 1d tensor parallel in Colossal-AI (#39)
* fixed 1D and 2D convergence (#38)
* optimized 2D operations
* fixed 1D ViT convergence problem
* Feature/ddp (#49)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* support torch ddp
* fix loss accumulation
* add log for ddp
* change seed
* modify timing hook
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* Feature/pipeline (#40)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* optimize communication of pipeline parallel
* fix grad clip for pipeline
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)
* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset
* update api for better usability (#58)
update api for better usability
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-09 07:08:29 +00:00
|
|
|
from colossalai.logging import get_dist_logger
|
2021-12-13 14:07:01 +00:00
|
|
|
from colossalai.trainer import Trainer, hooks
|
|
|
|
from colossalai.utils import get_dataloader
|
|
|
|
|
|
|
|
|
|
|
|
CONFIG = dict(
|
|
|
|
parallel=dict(
|
|
|
|
pipeline=1,
|
|
|
|
tensor=1, mode=None
|
|
|
|
),
|
|
|
|
fp16 = dict(
|
|
|
|
mode=AMP_TYPE.TORCH
|
|
|
|
),
|
|
|
|
gradient_accumulation=4,
|
|
|
|
clip_grad_norm=1.0
|
|
|
|
)
|
2021-11-18 11:45:06 +00:00
|
|
|
|
2021-11-02 15:01:13 +00:00
|
|
|
def run_trainer():
|
2021-12-13 14:07:01 +00:00
|
|
|
parser = colossalai.get_default_parser()
|
|
|
|
args = parser.parse_args()
|
|
|
|
colossalai.launch(config=CONFIG,
|
|
|
|
rank=args.rank,
|
|
|
|
world_size=args.world_size,
|
|
|
|
host=args.host,
|
|
|
|
port=args.port,
|
|
|
|
backend=args.backend)
|
|
|
|
|
Develop/experiments (#59)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000
* Integrate 1d tensor parallel in Colossal-AI (#39)
* fixed 1D and 2D convergence (#38)
* optimized 2D operations
* fixed 1D ViT convergence problem
* Feature/ddp (#49)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* support torch ddp
* fix loss accumulation
* add log for ddp
* change seed
* modify timing hook
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* Feature/pipeline (#40)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* optimize communication of pipeline parallel
* fix grad clip for pipeline
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)
* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset
* update api for better usability (#58)
update api for better usability
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-09 07:08:29 +00:00
|
|
|
logger = get_dist_logger()
|
2021-11-18 11:45:06 +00:00
|
|
|
|
2021-12-13 14:07:01 +00:00
|
|
|
# instantiate your compoentns
|
|
|
|
model = MyModel()
|
|
|
|
optimizer = MyOptimizer(model.parameters(), ...)
|
|
|
|
train_dataset = TrainDataset()
|
|
|
|
test_dataset = TestDataset()
|
|
|
|
train_dataloader = get_dataloader(train_dataset, ...)
|
|
|
|
test_dataloader = get_dataloader(test_dataset, ...)
|
|
|
|
lr_scheduler = MyScheduler()
|
|
|
|
logger.info("components are built")
|
|
|
|
|
|
|
|
engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model,
|
|
|
|
optimizer,
|
|
|
|
criterion,
|
|
|
|
train_dataloader,
|
|
|
|
test_dataloader,
|
|
|
|
lr_scheduler)
|
2021-11-02 15:01:13 +00:00
|
|
|
|
|
|
|
trainer = Trainer(engine=engine,
|
|
|
|
verbose=True)
|
|
|
|
|
2021-12-13 14:07:01 +00:00
|
|
|
hook_list = [
|
|
|
|
hooks.LossHook(),
|
|
|
|
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
|
|
|
|
hooks.AccuracyHook(),
|
|
|
|
hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
|
|
|
|
hooks.LogMetricByEpochHook(logger),
|
|
|
|
hooks.LogMemoryByEpochHook(logger),
|
|
|
|
hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')
|
|
|
|
]
|
|
|
|
|
2021-11-02 15:01:13 +00:00
|
|
|
trainer.fit(
|
|
|
|
train_dataloader=train_dataloader,
|
|
|
|
test_dataloader=test_dataloader,
|
2021-12-13 14:07:01 +00:00
|
|
|
epochs=NUM_EPOCH,
|
|
|
|
hooks=hook_list,
|
2021-11-02 15:01:13 +00:00
|
|
|
display_progress=True,
|
|
|
|
test_interval=2
|
|
|
|
)
|
|
|
|
|
2021-11-18 11:45:06 +00:00
|
|
|
|
2021-11-02 15:01:13 +00:00
|
|
|
if __name__ == '__main__':
|
|
|
|
run_trainer()
|
2021-10-28 16:21:23 +00:00
|
|
|
```
|
|
|
|
|
|
|
|
Alternatively, the `model` variable can be substituted with a self-defined model or a pre-defined model in our Model
|
|
|
|
Zoo. The detailed substitution process is elaborated [here](model.md).
|
|
|
|
|
|
|
|
## Features
|
|
|
|
|
2021-11-18 11:45:06 +00:00
|
|
|
Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development
|
|
|
|
of distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools
|
|
|
|
to kickstart distributed training in a few lines.
|
2021-10-28 16:21:23 +00:00
|
|
|
|
|
|
|
- [Data Parallelism](parallelization.md)
|
|
|
|
- [Pipeline Parallelism](parallelization.md)
|
|
|
|
- [1D, 2D, 2.5D, 3D and sequence parallelism](parallelization.md)
|
|
|
|
- [Friendly trainer and engine](trainer_engine.md)
|
|
|
|
- [Extensible for new parallelism](add_your_parallel.md)
|
|
|
|
- [Mixed Precision Training](amp.md)
|
|
|
|
- [Zero Redundancy Optimizer (ZeRO)](zero.md)
|