2023-03-27 02:24:14 +00:00
# Distributed Data Parallel
## 🚀 Quick Start
2023-04-06 01:43:51 +00:00
This example provides a training script and an evaluation script. The training script provides an example of training ResNet on CIFAR10 dataset from scratch.
2023-03-27 02:24:14 +00:00
- Training Arguments
2023-04-06 01:43:51 +00:00
- `-r` , `--resume` : resume from checkpoint file path
2023-03-27 02:24:14 +00:00
- `-c` , `--checkpoint` : the folder to save checkpoints
- `-i` , `--interval` : epoch interval to save checkpoints
- `-f` , `--fp16` : use fp16
- Eval Arguments
- `-e` , `--epoch` : select the epoch to evaluate
- `-c` , `--checkpoint` : the folder where checkpoints are found
### Train
```bash
# train with torch DDP with fp32
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32
# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 --fp16
```
### Eval
```bash
# evaluate fp32 training
python eval.py -c ./ckpt-fp32 -e 80
# evaluate fp16 mixed precision training
python eval.py -c ./ckpt-fp16 -e 80
```
Expected accuracy performance will be:
| Model | Single-GPU Baseline FP32 | Booster DDP with FP32 | Booster DDP with FP16 |
| --------- | ------------------------ | --------------------- | --------------------- |
| ResNet-18 | 85.85% | 85.03% | 85.12% |
2023-04-06 01:43:51 +00:00
**Note: the baseline is adapted from the [script ](https://pytorch-tutorial.readthedocs.io/en/latest/tutorial/chapter03_intermediate/3_2_2_cnn_resnet_cifar10/ ) to use `torchvision.models.resnet18` **