You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ColossalAI/examples/tutorial/new_api/torch_ddp
Frank Lee 7d8d825681
[booster] fixed the torch ddp plugin with the new checkpoint api (#3442)
2 years ago
..
.gitignore [booster] implemented the torch ddd + resnet example (#3232) 2 years ago
README.md [booster] fixed the torch ddp plugin with the new checkpoint api (#3442) 2 years ago
eval.py [booster] implemented the torch ddd + resnet example (#3232) 2 years ago
train.py [booster] implemented the torch ddd + resnet example (#3232) 2 years ago

README.md

Distributed Data Parallel

🚀 Quick Start

This example provides a training script and an evaluation script. The training script provides an example of training ResNet on CIFAR10 dataset from scratch.

  • Training Arguments

    • -r, --resume: resume from checkpoint file path
    • -c, --checkpoint: the folder to save checkpoints
    • -i, --interval: epoch interval to save checkpoints
    • -f, --fp16: use fp16
  • Eval Arguments

    • -e, --epoch: select the epoch to evaluate
    • -c, --checkpoint: the folder where checkpoints are found

Train

# train with torch DDP with fp32
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32

# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 --fp16

Eval

# evaluate fp32 training
python eval.py -c ./ckpt-fp32 -e 80

# evaluate fp16 mixed precision training
python eval.py -c ./ckpt-fp16 -e 80

Expected accuracy performance will be:

Model Single-GPU Baseline FP32 Booster DDP with FP32 Booster DDP with FP16
ResNet-18 85.85% 85.03% 85.12%

Note: the baseline is adapted from the script to use torchvision.models.resnet18