History

Frank Lee 73d3e4d309 [booster] implemented the torch ddd + resnet example (#3232 ) * [booster] implemented the torch ddd + resnet example * polish code		2023-03-27 10:24:14 +08:00
..
.gitignore	[booster] implemented the torch ddd + resnet example (#3232 )	2023-03-27 10:24:14 +08:00
README.md	[booster] implemented the torch ddd + resnet example (#3232 )	2023-03-27 10:24:14 +08:00
eval.py	[booster] implemented the torch ddd + resnet example (#3232 )	2023-03-27 10:24:14 +08:00
train.py	[booster] implemented the torch ddd + resnet example (#3232 )	2023-03-27 10:24:14 +08:00

README.md

Distributed Data Parallel

🚀 Quick Start

This example provides a training script and and evaluation script. The training script provides a an example of training ResNet on CIFAR10 dataset from scratch.

Training Arguments
- -r, --resume`: resume from checkpoint file path
- -c, --checkpoint: the folder to save checkpoints
- -i, --interval: epoch interval to save checkpoints
- -f, --fp16: use fp16
Eval Arguments
- -e, --epoch: select the epoch to evaluate
- -c, --checkpoint: the folder where checkpoints are found

Train

# train with torch DDP with fp32
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32

# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 --fp16

Eval

# evaluate fp32 training
python eval.py -c ./ckpt-fp32 -e 80

# evaluate fp16 mixed precision training
python eval.py -c ./ckpt-fp16 -e 80

Expected accuracy performance will be:

Model	Single-GPU Baseline FP32	Booster DDP with FP32	Booster DDP with FP16
ResNet-18	85.85%	85.03%	85.12%

Note: the baseline is a adapted from the script to use torchvision.models.resnet18