![]() * integrated parallel layers for ease of building models * integrated 2.5d layers * cleaned codes and unit tests * added log metric by step hook; updated imagenet benchmark; fixed some bugs * reworked initialization; cleaned codes Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com> |
||
---|---|---|
.. | ||
README.md | ||
config.py | ||
run_resnet_cifar10_with_engine.py | ||
run_resnet_cifar10_with_trainer.py |
README.md
Train ResNet34 on CIFAR10
Prepare Dataset
In the script, we used CIFAR10 dataset provided by the torchvision
library. The code snippet is shown below:
train_dataset = CIFAR10(
root=Path(os.environ['DATA']),
download=True,
transform=transforms.Compose(
[
transforms.RandomCrop(size=32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
0.2023, 0.1994, 0.2010]),
]
)
)
Firstly, you need to specify where you want to store your CIFAR10 dataset by setting the environment variable DATA
.
export DATA=/path/to/data
# example
# this will store the data in the current directory
export DATA=$PWD/data
The torchvison
module will download the data automatically for you into the specified directory.
Run training
We provide two examples of training resnet 34 on the CIFAR10 dataset. One example is with engine and the other is
with the trainer. You can invoke the training script by the following command. This batch size and learning rate
are for a single GPU. Thus, in the following command, nproc_per_node
is 1, which means there is only one process
invoked. If you change nproc_per_node
, you will have to change the learning rate accordingly as the global batch
size has changed.
# with engine
python -m torch.distributed.launch --nproc_per_node 1 run_resnet_cifar10_with_engine.py
# with trainer
python -m torch.distributed.launch --nproc_per_node 1 run_resnet_cifar10_with_trainer.py