mirror of https://github.com/hpcaitech/ColossalAI
aibig-modeldata-parallelismdeep-learningdistributed-computingfoundation-modelsheterogeneous-traininghpcinferencelarge-scalemodel-parallelismpipeline-parallelism
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Hongxin Liu
27061426f7
|
1 year ago | |
---|---|---|
.. | ||
.gitignore | 1 year ago | |
README.md | 1 year ago | |
eval.py | 1 year ago | |
requirements.txt | 1 year ago | |
test_ci.sh | 1 year ago | |
train.py | 1 year ago |
README.md
Train ResNet on CIFAR-10 from scratch
🚀 Quick Start
This example provides a training script and an evaluation script. The training script provides an example of training ResNet on CIFAR10 dataset from scratch.
-
Training Arguments
-p
,--plugin
: Plugin to use. Choices:torch_ddp
,torch_ddp_fp16
,low_level_zero
. Defaults totorch_ddp
.-r
,--resume
: Resume from checkpoint file path. Defaults to-1
, which means not resuming.-c
,--checkpoint
: The folder to save checkpoints. Defaults to./checkpoint
.-i
,--interval
: Epoch interval to save checkpoints. Defaults to5
. If set to0
, no checkpoint will be saved.--target_acc
: Target accuracy. Raise exception if not reached. Defaults toNone
.
-
Eval Arguments
-e
,--epoch
: select the epoch to evaluate-c
,--checkpoint
: the folder where checkpoints are found
Install requirements
pip install -r requirements.txt
Train
The folders will be created automatically.
# train with torch DDP with fp32
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32
# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 -p torch_ddp_fp16
# train with low level zero
colossalai run --nproc_per_node 2 train.py -c ./ckpt-low_level_zero -p low_level_zero
Eval
# evaluate fp32 training
python eval.py -c ./ckpt-fp32 -e 80
# evaluate fp16 mixed precision training
python eval.py -c ./ckpt-fp16 -e 80
# evaluate low level zero training
python eval.py -c ./ckpt-low_level_zero -e 80
Expected accuracy performance will be:
Model | Single-GPU Baseline FP32 | Booster DDP with FP32 | Booster DDP with FP16 | Booster Low Level Zero | Booster Gemini |
---|---|---|---|---|---|
ResNet-18 | 85.85% | 84.91% | 85.46% | 84.50% | 84.60% |
Note: the baseline is adapted from the script to use torchvision.models.resnet18