ColossalAI/examples/resnet_cifar10_data_parallel
アマデウス 0fedef4f3c
Layer integration (#83)
* integrated parallel layers for ease of building models

* integrated 2.5d layers

* cleaned codes and unit tests

* added log metric by step hook; updated imagenet benchmark; fixed some bugs

* reworked initialization; cleaned codes

Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-27 15:04:32 +08:00
..
README.md update examples and sphnix docs for the new api (#63) 2021-12-13 22:07:01 +08:00
config.py update examples and sphnix docs for the new api (#63) 2021-12-13 22:07:01 +08:00
run_resnet_cifar10_with_engine.py Layer integration (#83) 2021-12-27 15:04:32 +08:00
run_resnet_cifar10_with_trainer.py Layer integration (#83) 2021-12-27 15:04:32 +08:00

README.md

Train ResNet34 on CIFAR10

Prepare Dataset

In the script, we used CIFAR10 dataset provided by the torchvision library. The code snippet is shown below:

train_dataset = CIFAR10(
        root=Path(os.environ['DATA']),
        download=True,
        transform=transforms.Compose(
            [
                transforms.RandomCrop(size=32, padding=4),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
                    0.2023, 0.1994, 0.2010]),
            ]
        )
    )

Firstly, you need to specify where you want to store your CIFAR10 dataset by setting the environment variable DATA.

export DATA=/path/to/data

# example
# this will store the data in the current directory
export DATA=$PWD/data

The torchvison module will download the data automatically for you into the specified directory.

Run training

We provide two examples of training resnet 34 on the CIFAR10 dataset. One example is with engine and the other is with the trainer. You can invoke the training script by the following command. This batch size and learning rate are for a single GPU. Thus, in the following command, nproc_per_node is 1, which means there is only one process invoked. If you change nproc_per_node, you will have to change the learning rate accordingly as the global batch size has changed.

# with engine
python -m torch.distributed.launch --nproc_per_node 1 run_resnet_cifar10_with_engine.py

# with trainer
python -m torch.distributed.launch --nproc_per_node 1 run_resnet_cifar10_with_trainer.py