8.7 KiB

Raw Blame History

Launch Colossal-AI

Author: Chuanrui Wang, Shenggui Li, Siqi Mai

Prerequisite:

Introduction

As mentioned in the previous tutorials stated in the prerequisite, you need to initialize the distributed environment for Colossal-AI after your config file is prepared. We call this process launch. In this tutorial, you will learn how to launch Colossal-AI on your server, be it a small one or big one.

In Colossal-AI, we provided several launch methods to initialize the distributed backend. In most cases, you can use colossalai.launch and colossalai.get_default_parser to pass the parameters via command line. If you happen to use launchers such as SLURM, OpenMPI and PyTorch launch utility, we also provide several launching helper methods to access the rank and world size from the environment variables set by these launchers directly for your convenience.

In this tutorial we will cover how to launch Colossal-AI to initialize the distributed backends:

Launch with colossalai.launch
Launch with Colossal-AI CLI
Launch with SLURM
Launch with OpenMPI

Launch Distributed Environment

In order to launch Colossal-AI, we need two types of arguments:

config file
distributed settings

The config file is always required regardless of the launch method but distributed settings can vary. The config file can be a path to the configuration file or a Python dictionary. The distributed settings can be passed via command line or multi-process launchers.

Command Line Parser

Before we jump to launch, we firstly need to understand what parameters we need for initialization. As stated in the Basic Concepts in Distributed Training section of Distributed Training, the important parameters are:

host
port
rank
world_size
backend

In Colossal-AI, we provided a command line parser which has added these arguments in advance. You can get this parser by calling colossalai.get_default_parser(). This parser is usually used with colossalai.launch.

# add these lines in your train.py
import colossalai

# get default parser
parser = colossalai.get_default_parser()

# if you want to add your own arguments
parser.add_argument(...)

# parse arguments
args = parser.parse_args()

Then in your terminal, you can pass in these arguments:


python train.py --host <host> --rank <rank> --world_size <world_size> --port <port> --backend <backend>

backend is optional and the default value is nccl.

Native Launch

To initialize the distributed environment, we provided a general colossalai.launch API. The colossalai.launch function takes in the parameters listed above and create a default process group in the communication network. This function is often used with the default parser for convenience.

import colossalai

# parse arguments
args = colossalai.get_default_parser().parse_args()

# launch distributed environment
colossalai.launch(config=<CONFIG>,
                  rank=args.rank,
                  world_size=args.world_size,
                  host=args.host,
                  port=args.port,
                  backend=args.backend
)

Launch with Colossal-AI CLI

To enable easy launching on both single or multi nodes, we have implemented a launcher for Colossal-AI. This launcher is a wrapper of the torch distributed launch utility but enhanced with the capability of launching multi-node jobs easily.

First, we need to set the launch method in our code. As this is a wrapper of the torch distributed launch utility, we will use colossalai.launch_from_torch. The arguments required for distributed environment such as rank, world size, host and port are all set by the PyTorch launcher and can be read from the environment variable directly.

import colossalai

colossalai.launch_from_torch(
    config=<CONFIG>,
)

Next, we can easily start multiple processes with colossalai run in your terminal. Below is an example to run the code on a single node with 4 GPUs. You can change the number of GPUs by nproc_per_node and the default port by master_port.

# run on the local node with 4 GPUs (default port: 29500)
colossalai run --nproc_per_node 4 train.py

# run on the local node with 4 GPUs with a different port
colossalai run --nproc_per_node 4 --master_port 29505 test.py

If you are in a cluster and want to launch multi-node training, the CLI can help you start processes on different nodes with one simple command. There are two ways you can launch multi-node jobs.

Run with --hosts

This is suitable when you only have a few nodes. Let's say I have two nodes, namely host1 and host2, I can start multi-node training with the following command. Compared to single-node training, you must specify the master_addr option, which is auto-set to localhost if running on a single node only.

:::caution

master_addr cannot be localhost when running on multiple nodes, it should be the hostname or IP address of a node.

:::

# run on these two nodes
colossalai run --nproc_per_node 4 --host host1,host2 --master_addr host1 test.py

Run with --hostfile

This method is suitable when you have a lot of nodes. The host file is a simple text file listing the available nodes. The list of nodes is commonly provided by cluster managers such as SLURM and PBS Pro. For example, you can get the list of nodes allocated to you via the environment variable SLURM_NODELIST in SLURM and PBS_NODEFILE in PBS Pro. Just do echo $SLURM_NODELIST or cat $PBS_NODEFILE to check it out. If you do not have such cluster managers, you can manually create one for your own use.

The host file given to Colossal-AI launcher must be in the following format where each line is the host name of a node.

host1
host2

With the host file ready, we can launch multi-node jobs with the following commands. Just like using --host, you also need to specify the master_addr option. Some extra options are provided for --hostfile as listed below:

--include: specify the hosts to include for multi-node jobs. For example, if your host file has 8 nodes, but you happen to only want to run on 6 nodes instead, you can add --include host1,host2,host3,...,host6 so that the job will only be launcher on the 6 nodes.
--exclude: specify the hosts to exclude for multi-node jobs. This is useful when some nodes are faulty. For example, if host1 GPU has some problems and you do not wish to run on host1 but all other nodes, you can add --exclude host1 so that the job will only be launched on the remaining nodes.

# run with a hostfile
colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1  test.py

# only include certain hosts to execute commands
# this is used to manually select nodes to run
colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1  --include host1 test.py

# exclude certain hosts to execute commands
# this can be used when certain nodes are faulty
colossalai run --nproc_per_node 4 --hostfile ./hostfile --master_addr host1  --exclude host2 test.py

Launch with SLURM

If you are on a system managed by the SLURM scheduler, you can also rely on the srun launcher to kickstart your Colossal-AI scripts. We provided the helper function launch_from_slurm for compatibility with the SLURM scheduler. launch_from_slurm will automatically read the rank and world size from the environment variables SLURM_PROCID and SLURM_NPROCS respectively and use them to start the distributed backend. Do this in your training script:

import colossalai

colossalai.launch_from_slurm(
    config=<CONFIG>,
    host=args.host,
    port=args.port
)

You can initialize the distributed environment by using this command in terminal.

srun python train.py --host <master_node> --port 29500

Launch with OpenMPI

If you are more familiar with OpenMPI, you can use launch_from_openmpi instead. launch_from_openmpi will automatically read the local rank, global rank and world size from the environment variables OMPI_COMM_WORLD_LOCAL_RANK, MPI_COMM_WORLD_RANK and OMPI_COMM_WORLD_SIZE respectively and use them to start the distributed backend.

Do this in your train.py:

colossalai.launch_from_openmpi(
    config=<CONFIG>,
    host=args.host,
    port=args.port
)

A sample command to launch multiple processes with OpenMPI would be:

mpirun --hostfile <my_hostfile> -np <num_process> python train.py --host <node name or ip> --port 29500

--hostfile: use this option to specify a list of hosts on which to run
--np: set the number of processes (GPUs) to launch in total. For example, if --np 4, 4 python processes will be initialized to run train.py.

8.7 KiB Raw Blame History