mirror of https://github.com/hpcaitech/ColossalAI
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Frank Lee
7d8d825681
|
2 years ago | |
---|---|---|
.. | ||
.gitignore | 2 years ago | |
README.md | 2 years ago | |
eval.py | 2 years ago | |
train.py | 2 years ago |
README.md
Distributed Data Parallel
🚀 Quick Start
This example provides a training script and an evaluation script. The training script provides an example of training ResNet on CIFAR10 dataset from scratch.
-
Training Arguments
-r
,--resume
: resume from checkpoint file path-c
,--checkpoint
: the folder to save checkpoints-i
,--interval
: epoch interval to save checkpoints-f
,--fp16
: use fp16
-
Eval Arguments
-e
,--epoch
: select the epoch to evaluate-c
,--checkpoint
: the folder where checkpoints are found
Train
# train with torch DDP with fp32
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32
# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 --fp16
Eval
# evaluate fp32 training
python eval.py -c ./ckpt-fp32 -e 80
# evaluate fp16 mixed precision training
python eval.py -c ./ckpt-fp16 -e 80
Expected accuracy performance will be:
Model | Single-GPU Baseline FP32 | Booster DDP with FP32 | Booster DDP with FP16 |
---|---|---|---|
ResNet-18 | 85.85% | 85.03% | 85.12% |
Note: the baseline is adapted from the script to use torchvision.models.resnet18