ColossalAI/examples/tutorial/new_api/cifar_resnet/README.md

# Train ResNet on CIFAR-10 from scratch

## 🚀 Quick Start

This example provides a training script and an evaluation script. The training script provides an example of training ResNet on CIFAR10 dataset from scratch.

- Training Arguments
  - `-p`, `--plugin`: Plugin to use. Choices: `torch_ddp`, `torch_ddp_fp16`, `low_level_zero`. Defaults to `torch_ddp`.
  - `-r`, `--resume`: Resume from checkpoint file path. Defaults to `-1`, which means not resuming.
  - `-c`, `--checkpoint`: The folder to save checkpoints. Defaults to `./checkpoint`.
  - `-i`, `--interval`: Epoch interval to save checkpoints. Defaults to `5`. If set to `0`, no checkpoint will be saved.
  - `--target_acc`: Target accuracy. Raise exception if not reached. Defaults to `None`.

- Eval Arguments
  - `-e`, `--epoch`: select the epoch to evaluate
  - `-c`, `--checkpoint`: the folder where checkpoints are found

### Install requirements

```bash
pip install -r requirements.txt
```

### Train

```bash
# train with torch DDP with fp32
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32

# train with torch DDP with mixed precision training
colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 -p torch_ddp_fp16

# train with low level zero
colossalai run --nproc_per_node 2 train.py -c ./ckpt-low_level_zero -p low_level_zero
```

### Eval

```bash
# evaluate fp32 training
python eval.py -c ./ckpt-fp32 -e 80

# evaluate fp16 mixed precision training
python eval.py -c ./ckpt-fp16 -e 80

# evaluate low level zero training
python eval.py -c ./ckpt-low_level_zero -e 80
```

Expected accuracy performance will be:

| Model     | Single-GPU Baseline FP32 | Booster DDP with FP32 | Booster DDP with FP16 | Booster Low Level Zero |
| --------- | ------------------------ | --------------------- | --------------------- | ---------------------- |
| ResNet-18 | 85.85%                   | 84.91%                | 85.46%                | 84.50%                 |

**Note: the baseline is adapted from the [script](https://pytorch-tutorial.readthedocs.io/en/latest/tutorial/chapter03_intermediate/3_2_2_cnn_resnet_cifar10/) to use `torchvision.models.resnet18`**
[example] add train resnet/vit with booster example (#3694) * [example] add train vit with booster example * [example] update readme * [example] add train resnet with booster example * [example] enable ci * [example] enable ci * [example] add requirements * [hotfix] fix analyzer init * [example] update requirements 2023-05-08 02:42:30 +00:00			`# Train ResNet on CIFAR-10 from scratch`

			`## 🚀 Quick Start`

			`This example provides a training script and an evaluation script. The training script provides an example of training ResNet on CIFAR10 dataset from scratch.`

			`- Training Arguments`
			- `-p`, `--plugin`: Plugin to use. Choices: `torch_ddp`, `torch_ddp_fp16`, `low_level_zero`. Defaults to `torch_ddp`.
			- `-r`, `--resume`: Resume from checkpoint file path. Defaults to `-1`, which means not resuming.
			- `-c`, `--checkpoint`: The folder to save checkpoints. Defaults to `./checkpoint`.
			- `-i`, `--interval`: Epoch interval to save checkpoints. Defaults to `5`. If set to `0`, no checkpoint will be saved.
			- `--target_acc`: Target accuracy. Raise exception if not reached. Defaults to `None`.

			`- Eval Arguments`
			- `-e`, `--epoch`: select the epoch to evaluate
			- `-c`, `--checkpoint`: the folder where checkpoints are found

			`### Install requirements`

			```bash
			`pip install -r requirements.txt`
			```

			`### Train`

			```bash
			`# train with torch DDP with fp32`
			`colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp32`

			`# train with torch DDP with mixed precision training`
			`colossalai run --nproc_per_node 2 train.py -c ./ckpt-fp16 -p torch_ddp_fp16`

			`# train with low level zero`
			`colossalai run --nproc_per_node 2 train.py -c ./ckpt-low_level_zero -p low_level_zero`
			```

			`### Eval`

			```bash
			`# evaluate fp32 training`
			`python eval.py -c ./ckpt-fp32 -e 80`

			`# evaluate fp16 mixed precision training`
			`python eval.py -c ./ckpt-fp16 -e 80`

			`# evaluate low level zero training`
			`python eval.py -c ./ckpt-low_level_zero -e 80`
			```

			`Expected accuracy performance will be:`

			`\| Model \| Single-GPU Baseline FP32 \| Booster DDP with FP32 \| Booster DDP with FP16 \| Booster Low Level Zero \|`
			`\| --------- \| ------------------------ \| --------------------- \| --------------------- \| ---------------------- \|`
			`\| ResNet-18 \| 85.85% \| 84.91% \| 85.46% \| 84.50% \|`

			Note: the baseline is adapted from the [script](https://pytorch-tutorial.readthedocs.io/en/latest/tutorial/chapter03_intermediate/3_2_2_cnn_resnet_cifar10/) to use `torchvision.models.resnet18`