ColossalAI/examples/resnet_cifar10_data_parallel/README.md

50 lines
1.6 KiB
Markdown

# Train ResNet34 on CIFAR10
## Prepare Dataset
In the script, we used CIFAR10 dataset provided by the `torchvision` library. The code snippet is shown below:
```python
train_dataset = CIFAR10(
root=Path(os.environ['DATA']),
download=True,
transform=transforms.Compose(
[
transforms.RandomCrop(size=32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
0.2023, 0.1994, 0.2010]),
]
)
)
```
Firstly, you need to specify where you want to store your CIFAR10 dataset by setting the environment variable `DATA`.
```bash
export DATA=/path/to/data
# example
# this will store the data in the current directory
export DATA=$PWD/data
```
The `torchvison` module will download the data automatically for you into the specified directory.
## Run training
We provide two examples of training resnet 34 on the CIFAR10 dataset. One example is with engine and the other is
with the trainer. You can invoke the training script by the following command. This batch size and learning rate
are for a single GPU. Thus, in the following command, `nproc_per_node` is 1, which means there is only one process
invoked. If you change `nproc_per_node`, you will have to change the learning rate accordingly as the global batch
size has changed.
```bash
# with engine
python -m torch.distributed.launch --nproc_per_node 1 run_resnet_cifar10_with_engine.py
# with trainer
python -m torch.distributed.launch --nproc_per_node 1 run_resnet_cifar10_with_trainer.py
```