History

Hongxin Liu b5f9e37c70 [legacy] clean up legacy code (#4743 ) * [legacy] remove outdated codes of pipeline (#4692) * [legacy] remove cli of benchmark and update optim (#4690) * [legacy] remove cli of benchmark and update optim * [doc] fix cli doc test * [legacy] fix engine clip grad norm * [legacy] remove outdated colo tensor (#4694) * [legacy] remove outdated colo tensor * [test] fix test import * [legacy] move outdated zero to legacy (#4696) * [legacy] clean up utils (#4700) * [legacy] clean up utils * [example] update examples * [legacy] clean up amp * [legacy] fix amp module * [legacy] clean up gpc (#4742) * [legacy] clean up context * [legacy] clean core, constants and global vars * [legacy] refactor initialize * [example] fix examples ci * [example] fix examples ci * [legacy] fix tests * [example] fix gpt example * [example] fix examples ci * [devops] fix ci installation * [example] fix examples ci		1 year ago
..
data	[legacy] clean up legacy code (#4743 )	1 year ago
loss_func	[legacy] clean up legacy code (#4743 )	1 year ago
lr_scheduler	[tutorial] edited hands-on practices (#1899 )	2 years ago
model	[legacy] clean up legacy code (#4743 )	1 year ago
README.md	[example] integrate seq-parallel tutorial with CI (#2463 )	2 years ago
config.py	[legacy] clean up legacy code (#4743 )	1 year ago
requirements.txt	[legacy] move engine to legacy (#4560 )	1 year ago
test_ci.sh	[legacy] clean up legacy code (#4743 )	1 year ago
train.py	[legacy] clean up legacy code (#4743 )	1 year ago

README.md

Unescape Escape

Sequence Parallelism

Sequence Parallelism

📚 Overview

In this tutorial, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.

Paper: Sequence Parallelism: Long Sequence Training from System Perspective

🚀 Quick Start

Install PyTorch
Install the dependencies.

pip install -r requirements.txt

Run with the following command

export PYTHONPATH=$PWD

# run with synthetic dataset
colossalai run --nproc_per_node 4 train.py

The default config is sequence parallel size = 2, pipeline size = 1, let’s change pipeline size to be 2 and try it again.

🏎 How to Train with Sequence Parallelism

We provided train.py for you to execute training. Before invoking the script, there are several steps to perform.

Step 1. Configure your parameters

In the config.py provided, a set of parameters are defined including training scheme, model, etc. You can also modify the ColossalAI setting. For example, if you wish to parallelize over the sequence dimension on 8 GPUs. You can change size=4 to size=8. If you wish to use pipeline parallelism, you can set pipeline=<num_of_pipeline_stages>.

Step 2. Invoke parallel training

Lastly, you can start training with sequence parallelism. How you invoke train.py depends on your machine setting.

If you are using a single machine with multiple GPUs, PyTorch launch utility can easily let you start your script. A sample command is like below:
```
  colossalai run --nproc_per_node <num_gpus_on_this_machine> --master_addr localhost --master_port 29500 train.py
```
If you are using multiple machines with multiple GPUs, we suggest that you refer to colossalai launch_from_slurm or colossalai.launch_from_openmpi as it is easier to use SLURM and OpenMPI to start multiple processes over multiple nodes. If you have your own launcher, you can fall back to the default colossalai.launch function.

README.md Unescape Escape