mirror of https://github.com/hpcaitech/ColossalAI
d202cc28c0
* update accelerator * fix timer * fix amp * update * fix * update bug * add error raise * fix autocast * fix set device * remove doc accelerator * update doc * update doc * update doc * use nullcontext * update cpu * update null context * change time limit for example * udpate * update * update * update * [npu] polish accelerator code --------- Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com> Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com> |
||
---|---|---|
.. | ||
README.md | ||
requirements.txt | ||
test_ci.sh | ||
train.py |
README.md
Train ViT on CIFAR-10 from scratch
🚀 Quick Start
This example provides a training script, which provides an example of training ViT on CIFAR10 dataset from scratch.
- Training Arguments
-p
,--plugin
: Plugin to use. Choices:torch_ddp
,torch_ddp_fp16
,low_level_zero
. Defaults totorch_ddp
.-r
,--resume
: Resume from checkpoint file path. Defaults to-1
, which means not resuming.-c
,--checkpoint
: The folder to save checkpoints. Defaults to./checkpoint
.-i
,--interval
: Epoch interval to save checkpoints. Defaults to5
. If set to0
, no checkpoint will be saved.--target_acc
: Target accuracy. Raise exception if not reached. Defaults toNone
.
Install requirements
pip install -r requirements.txt
Train
# train with torch DDP with fp32
colossalai run --nproc_per_node 4 train.py -c ./ckpt-fp32
# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 4 train.py -c ./ckpt-fp16 -p torch_ddp_fp16
# train with low level zero
colossalai run --nproc_per_node 4 train.py -c ./ckpt-low_level_zero -p low_level_zero
Expected accuracy performance will be:
Model | Single-GPU Baseline FP32 | Booster DDP with FP32 | Booster DDP with FP16 | Booster Low Level Zero |
---|---|---|---|---|
ViT | 83.00% | 84.03% | 84.00% | 84.43% |