ColossalAI/docs/run_demo.md

# Quick demo

Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system can
accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The system
can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.

## Single GPU

Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline
performances. We provided an example to train ResNet on CIFAR10 data with only one GPU. You can find this example in 
`examples\resnet_cifar10_data_parallel` in the repository. Detailed instructions can be found in its `README.md`.

## Multiple GPUs

Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the
training process drastically by applying efficient parallelization techiniques, which will be elaborated in
the [Parallelization](parallelization.md) section below. 

You can turn the resnet example mentioned above into a multi-GPU training by setting `--nproc_per_node` to be the number of 
GPUs you have on your system. We also provide an example of Vision Transformer which relies on
training with more GPUs. You can visit this example in `examples\vit_b16_imagenet_data_parallel`. It has a detailed instructional 
`README.md` for you too.


## Sample Training Script

Below is a typical way of how you train the model using 

```python
import colossalai
from colossalai.amp import AMP_TYPE
from colossalai.logging import get_dist_logger
from colossalai.trainer import Trainer, hooks
from colossalai.utils import get_dataloader


CONFIG = dict(
    parallel=dict(
        pipeline=1,
        tensor=1, mode=None
    ),
    fp16 = dict(
        mode=AMP_TYPE.TORCH
    ),
    gradient_accumulation=4,
    clip_grad_norm=1.0
)

def run_trainer():
    parser = colossalai.get_default_parser()
    args = parser.parse_args()
    colossalai.launch(config=CONFIG,
                      rank=args.rank,
                      world_size=args.world_size,
                      host=args.host,
                      port=args.port,
                      backend=args.backend)

    logger = get_dist_logger()

    # instantiate your compoentns
    model = MyModel()
    optimizer = MyOptimizer(model.parameters(), ...)
    train_dataset = TrainDataset()
    test_dataset = TestDataset()
    train_dataloader = get_dataloader(train_dataset, ...)
    test_dataloader = get_dataloader(test_dataset, ...)
    lr_scheduler = MyScheduler()
    logger.info("components are built")

    engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model, 
                                                                                    optimizer, 
                                                                                    criterion, 
                                                                                    train_dataloader, 
                                                                                    test_dataloader, 
                                                                                    lr_scheduler)

    trainer = Trainer(engine=engine,
                      verbose=True)

    hook_list = [
        hooks.LossHook(),
        hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),
        hooks.AccuracyHook(),
        hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
        hooks.LogMetricByEpochHook(logger),
        hooks.LogMemoryByEpochHook(logger),
        hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')
    ]

    trainer.fit(
        train_dataloader=train_dataloader,
        test_dataloader=test_dataloader,
        epochs=NUM_EPOCH,
        hooks=hook_list,
        display_progress=True,
        test_interval=2
    )


if __name__ == '__main__':
    run_trainer()
```

Alternatively, the `model` variable can be substituted with a self-defined model or a pre-defined model in our Model
Zoo. The detailed substitution process is elaborated [here](model.md).

## Features

Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development
of distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools
to kickstart distributed training in a few lines.

- [Data Parallelism](parallelization.md)
- [Pipeline Parallelism](parallelization.md)
- [1D, 2D, 2.5D, 3D and sequence parallelism](parallelization.md)
- [Friendly trainer and engine](trainer_engine.md)
- [Extensible for new parallelism](add_your_parallel.md)
- [Mixed Precision Training](amp.md)
- [Zero Redundancy Optimizer (ZeRO)](zero.md)
Migrated project 2021-10-28 16:21:23 +00:00			`# Quick demo`

Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> 2021-11-18 11:45:06 +00:00			`Colossal-AI is an integrated large-scale deep learning system with efficient parallelization techniques. The system can`
			`accelerate model training on distributed systems with multiple GPUs by applying parallelization techniques. The system`
			`can also run on systems with only one GPU. Quick demos showing how to use Colossal-AI are given below.`
Migrated project 2021-10-28 16:21:23 +00:00
			`## Single GPU`

fixed some typos in the documents, added blog link and paper author information in README 2021-11-03 08:07:28 +00:00			`Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline`
update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			`performances. We provided an example to train ResNet on CIFAR10 data with only one GPU. You can find this example in`
			`examples\resnet_cifar10_data_parallel` in the repository. Detailed instructions can be found in its `README.md`.
Migrated project 2021-10-28 16:21:23 +00:00
			`## Multiple GPUs`

fixed some typos in the documents, added blog link and paper author information in README 2021-11-03 08:07:28 +00:00			`Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the`
Migrated project 2021-10-28 16:21:23 +00:00			`training process drastically by applying efficient parallelization techiniques, which will be elaborated in`
update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			`the [Parallelization](parallelization.md) section below.`
Migrated project 2021-10-28 16:21:23 +00:00
update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			You can turn the resnet example mentioned above into a multi-GPU training by setting `--nproc_per_node` to be the number of
			`GPUs you have on your system. We also provide an example of Vision Transformer which relies on`
			training with more GPUs. You can visit this example in `examples\vit_b16_imagenet_data_parallel`. It has a detailed instructional
			`README.md` for you too.


			`## Sample Training Script`
Migrated project 2021-10-28 16:21:23 +00:00
update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			`Below is a typical way of how you train the model using`
Migrated project 2021-10-28 16:21:23 +00:00
			```python
			`import colossalai`
update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			`from colossalai.amp import AMP_TYPE`
Develop/experiments (#59) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> * Split conv2d, class token, positional embedding in 2d, Fix random number in ddp Fix convergence in cifar10, Imagenet1000 * Integrate 1d tensor parallel in Colossal-AI (#39) * fixed 1D and 2D convergence (#38) * optimized 2D operations * fixed 1D ViT convergence problem * Feature/ddp (#49) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * support torch ddp * fix loss accumulation * add log for ddp * change seed * modify timing hook Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * Feature/pipeline (#40) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * optimize communication of pipeline parallel * fix grad clip for pipeline Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51) * Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset * update api for better usability (#58) update api for better usability Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com> Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com> 2021-12-09 07:08:29 +00:00			`from colossalai.logging import get_dist_logger`
update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			`from colossalai.trainer import Trainer, hooks`
			`from colossalai.utils import get_dataloader`


			`CONFIG = dict(`
			`parallel=dict(`
			`pipeline=1,`
			`tensor=1, mode=None`
			`),`
			`fp16 = dict(`
			`mode=AMP_TYPE.TORCH`
			`),`
			`gradient_accumulation=4,`
			`clip_grad_norm=1.0`
			`)`
Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> 2021-11-18 11:45:06 +00:00
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`def run_trainer():`
update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			`parser = colossalai.get_default_parser()`
			`args = parser.parse_args()`
			`colossalai.launch(config=CONFIG,`
			`rank=args.rank,`
			`world_size=args.world_size,`
			`host=args.host,`
			`port=args.port,`
			`backend=args.backend)`

Develop/experiments (#59) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> * Split conv2d, class token, positional embedding in 2d, Fix random number in ddp Fix convergence in cifar10, Imagenet1000 * Integrate 1d tensor parallel in Colossal-AI (#39) * fixed 1D and 2D convergence (#38) * optimized 2D operations * fixed 1D ViT convergence problem * Feature/ddp (#49) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * support torch ddp * fix loss accumulation * add log for ddp * change seed * modify timing hook Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * Feature/pipeline (#40) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * optimize communication of pipeline parallel * fix grad clip for pipeline Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51) * Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset * update api for better usability (#58) update api for better usability Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com> Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com> 2021-12-09 07:08:29 +00:00			`logger = get_dist_logger()`
Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> 2021-11-18 11:45:06 +00:00
update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			`# instantiate your compoentns`
			`model = MyModel()`
			`optimizer = MyOptimizer(model.parameters(), ...)`
			`train_dataset = TrainDataset()`
			`test_dataset = TestDataset()`
			`train_dataloader = get_dataloader(train_dataset, ...)`
			`test_dataloader = get_dataloader(test_dataset, ...)`
			`lr_scheduler = MyScheduler()`
			`logger.info("components are built")`

			`engine, train_dataloader, test_dataloader, lr_scheduler = colossalai.initialize(model,`
			`optimizer,`
			`criterion,`
			`train_dataloader,`
			`test_dataloader,`
			`lr_scheduler)`
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00
			`trainer = Trainer(engine=engine,`
			`verbose=True)`

update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			`hook_list = [`
			`hooks.LossHook(),`
			`hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=False),`
			`hooks.AccuracyHook(),`
			`hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),`
			`hooks.LogMetricByEpochHook(logger),`
			`hooks.LogMemoryByEpochHook(logger),`
			`hooks.SaveCheckpointHook(checkpoint_dir='./ckpt')`
			`]`

added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`trainer.fit(`
			`train_dataloader=train_dataloader,`
			`test_dataloader=test_dataloader,`
update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			`epochs=NUM_EPOCH,`
			`hooks=hook_list,`
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`display_progress=True,`
			`test_interval=2`
			`)`

Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> 2021-11-18 11:45:06 +00:00
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`if __name__ == '__main__':`
			`run_trainer()`
Migrated project 2021-10-28 16:21:23 +00:00			```

			Alternatively, the `model` variable can be substituted with a self-defined model or a pre-defined model in our Model
			`Zoo. The detailed substitution process is elaborated [here](model.md).`

			`## Features`

Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> 2021-11-18 11:45:06 +00:00			`Colossal-AI provides a collection of parallel training components for you. We aim to support you with your development`
			`of distributed deep learning models just like how you write single-GPU deep learning models. We provide friendly tools`
			`to kickstart distributed training in a few lines.`
Migrated project 2021-10-28 16:21:23 +00:00
			`- [Data Parallelism](parallelization.md)`
			`- [Pipeline Parallelism](parallelization.md)`
			`- [1D, 2D, 2.5D, 3D and sequence parallelism](parallelization.md)`
			`- [Friendly trainer and engine](trainer_engine.md)`
			`- [Extensible for new parallelism](add_your_parallel.md)`
			`- [Mixed Precision Training](amp.md)`
			`- [Zero Redundancy Optimizer (ZeRO)](zero.md)`