2023-01-13 06:40:05 +00:00
# Sequence Parallelism
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
## Table of contents
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
- [Sequence Parallelism ](#sequence-parallelism )
- [Table of contents ](#table-of-contents )
- [📚 Overview ](#-overview )
- [🚀 Quick Start ](#-quick-start )
- [🏎 How to Train with Sequence Parallelism ](#-how-to-train-with-sequence-parallelism )
- [Step 1. Configure your parameters ](#step-1-configure-your-parameters )
- [Step 2. Invoke parallel training ](#step-2-invoke-parallel-training )
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
## 📚 Overview
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
In this tutorial, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate
activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
Paper: [Sequence Parallelism: Long Sequence Training from System Perspective ](https://arxiv.org/abs/2105.13120 )
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
## 🚀 Quick Start
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
1. Install PyTorch
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
2. Install the dependencies.
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
```bash
pip install -r requirements.txt
2022-11-11 09:08:17 +00:00
```
2023-01-13 06:40:05 +00:00
3. Run with the following command
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
```bash
export PYTHONPATH=$PWD
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
# run with synthetic dataset
colossalai run --nproc_per_node 4 train.py
2022-11-11 09:08:17 +00:00
```
2023-01-13 06:40:05 +00:00
> The default config is sequence parallel size = 2, pipeline size = 1, let’ s change pipeline size to be 2 and try it again.
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
## 🏎 How to Train with Sequence Parallelism
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
We provided `train.py` for you to execute training. Before invoking the script, there are several
steps to perform.
2022-11-11 09:08:17 +00:00
2023-01-13 06:40:05 +00:00
### Step 1. Configure your parameters
2022-11-11 09:08:17 +00:00
In the `config.py` provided, a set of parameters are defined including training scheme, model, etc.
2022-11-12 10:24:52 +00:00
You can also modify the ColossalAI setting. For example, if you wish to parallelize over the
2022-11-11 09:08:17 +00:00
sequence dimension on 8 GPUs. You can change `size=4` to `size=8` . If you wish to use pipeline parallelism, you can set `pipeline=<num_of_pipeline_stages>` .
2023-01-13 06:40:05 +00:00
### Step 2. Invoke parallel training
2022-11-11 09:08:17 +00:00
2022-11-12 10:24:52 +00:00
Lastly, you can start training with sequence parallelism. How you invoke `train.py` depends on your
2022-11-11 09:08:17 +00:00
machine setting.
- If you are using a single machine with multiple GPUs, PyTorch launch utility can easily let you
start your script. A sample command is like below:
```bash
2022-11-12 19:24:02 +00:00
colossalai run --nproc_per_node < num_gpus_on_this_machine > --master_addr localhost --master_port 29500 train.py
2022-11-11 09:08:17 +00:00
```
- If you are using multiple machines with multiple GPUs, we suggest that you refer to `colossalai
2022-11-12 10:24:52 +00:00
launch_from_slurm` or `colossalai.launch_from_openmpi` as it is easier to use SLURM and OpenMPI
to start multiple processes over multiple nodes. If you have your own launcher, you can fall back
2022-11-11 09:08:17 +00:00
to the default `colossalai.launch` function.