ColossalAI/examples/tutorial/new_api/torch_ddp
Frank Lee 73d3e4d309
[booster] implemented the torch ddd + resnet example (#3232)
* [booster] implemented the torch ddd + resnet example

* polish code
2023-03-27 10:24:14 +08:00
..
.gitignore [booster] implemented the torch ddd + resnet example (#3232) 2023-03-27 10:24:14 +08:00
README.md [booster] implemented the torch ddd + resnet example (#3232) 2023-03-27 10:24:14 +08:00
eval.py [booster] implemented the torch ddd + resnet example (#3232) 2023-03-27 10:24:14 +08:00
train.py [booster] implemented the torch ddd + resnet example (#3232) 2023-03-27 10:24:14 +08:00

README.md

Distributed Data Parallel

🚀 Quick Start

This example provides a training script and and evaluation script. The training script provides a an example of training ResNet on CIFAR10 dataset from scratch.

  • Training Arguments

    • -r, --resume`: resume from checkpoint file path
    • -c, --checkpoint: the folder to save checkpoints
    • -i, --interval: epoch interval to save checkpoints
    • -f, --fp16: use fp16
  • Eval Arguments

    • -e, --epoch: select the epoch to evaluate
    • -c, --checkpoint: the folder where checkpoints are found

Train

# train with torch DDP with fp32
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32

# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 --fp16

Eval

# evaluate fp32 training
python eval.py -c ./ckpt-fp32 -e 80

# evaluate fp16 mixed precision training
python eval.py -c ./ckpt-fp16 -e 80

Expected accuracy performance will be:

Model Single-GPU Baseline FP32 Booster DDP with FP32 Booster DDP with FP16
ResNet-18 85.85% 85.03% 85.12%

Note: the baseline is a adapted from the script to use torchvision.models.resnet18