History

Frank Lee 7d8d825681 [booster] fixed the torch ddp plugin with the new checkpoint api (#3442 )		2 years ago
..
.gitignore	[booster] implemented the torch ddd + resnet example (#3232 )	2 years ago
README.md	[booster] fixed the torch ddp plugin with the new checkpoint api (#3442 )	2 years ago
eval.py	[booster] implemented the torch ddd + resnet example (#3232 )	2 years ago
train.py	[booster] implemented the torch ddd + resnet example (#3232 )	2 years ago

README.md

Distributed Data Parallel

🚀 Quick Start

This example provides a training script and an evaluation script. The training script provides an example of training ResNet on CIFAR10 dataset from scratch.

Training Arguments
- -r, --resume: resume from checkpoint file path
- -c, --checkpoint: the folder to save checkpoints
- -i, --interval: epoch interval to save checkpoints
- -f, --fp16: use fp16
Eval Arguments
- -e, --epoch: select the epoch to evaluate
- -c, --checkpoint: the folder where checkpoints are found

Train

# train with torch DDP with fp32
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32

# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 --fp16

Eval

# evaluate fp32 training
python eval.py -c ./ckpt-fp32 -e 80

# evaluate fp16 mixed precision training
python eval.py -c ./ckpt-fp16 -e 80

Expected accuracy performance will be:

Model	Single-GPU Baseline FP32	Booster DDP with FP32	Booster DDP with FP16
ResNet-18	85.85%	85.03%	85.12%

Note: the baseline is adapted from the script to use torchvision.models.resnet18