This example provides a training script and an evaluation script. The training script provides an example of training ResNet on CIFAR10 dataset from scratch.
- Training Arguments
-`-p`, `--plugin`: Plugin to use. Choices: `torch_ddp`, `torch_ddp_fp16`, `low_level_zero`. Defaults to `torch_ddp`.
-`-r`, `--resume`: Resume from checkpoint file path. Defaults to `-1`, which means not resuming.
-`-c`, `--checkpoint`: The folder to save checkpoints. Defaults to `./checkpoint`.
-`-i`, `--interval`: Epoch interval to save checkpoints. Defaults to `5`. If set to `0`, no checkpoint will be saved.
-`--target_acc`: Target accuracy. Raise exception if not reached. Defaults to `None`.
- Eval Arguments
-`-e`, `--epoch`: select the epoch to evaluate
-`-c`, `--checkpoint`: the folder where checkpoints are found
### Install requirements
```bash
pip install -r requirements.txt
```
### Train
The folders will be created automatically.
```bash
# train with torch DDP with fp32
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32
# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 -p torch_ddp_fp16
# train with low level zero
colossalai run --nproc_per_node 2 train.py -c ./ckpt-low_level_zero -p low_level_zero
**Note: the baseline is adapted from the [script](https://pytorch-tutorial.readthedocs.io/en/latest/tutorial/chapter03_intermediate/3_2_2_cnn_resnet_cifar10/) to use `torchvision.models.resnet18`**