ColossalAI/examples/tutorial/hybrid_parallel
Frank Lee 8327932d2c
[workflow] refactored the example check workflow (#2411)
* [workflow] refactored the example check workflow

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-01-10 11:26:19 +08:00
..
README.md [tutorial] polish all README (#1946) 2022-11-14 19:49:32 +08:00
config.py [workflow] refactored the example check workflow (#2411) 2023-01-10 11:26:19 +08:00
requirements.txt [workflow] refactored the example check workflow (#2411) 2023-01-10 11:26:19 +08:00
test_ci.sh [workflow] refactored the example check workflow (#2411) 2023-01-10 11:26:19 +08:00
train.py [workflow] refactored the example check workflow (#2411) 2023-01-10 11:26:19 +08:00

README.md

Multi-dimensional Parallelism with Colossal-AI

🚀Quick Start

  1. Install our model zoo.
pip install titans
  1. Run with synthetic data which is of similar shape to CIFAR10 with the -s flag.
colossalai run --nproc_per_node 4 train.py --config config.py -s
  1. Modify the config file to play with different types of tensor parallelism, for example, change tensor parallel size to be 4 and mode to be 2d and run on 8 GPUs.

Install Titans Model Zoo

pip install titans

Prepare Dataset

We use CIFAR10 dataset in this example. You should invoke the donwload_cifar10.py in the tutorial root directory or directly run the auto_parallel_with_resnet.py. The dataset will be downloaded to colossalai/examples/tutorials/data by default. If you wish to use customized directory for the dataset. You can set the environment variable DATA via the following command.

export DATA=/path/to/data

Run on 2*2 device mesh

Current configuration setting on config.py is TP=2, PP=2.

# train with cifar10
colossalai run --nproc_per_node 4 train.py --config config.py

# train with synthetic data
colossalai run --nproc_per_node 4 train.py --config config.py -s