ColossalAI/examples/tutorial/sequence_parallel/README.md

# Sequence Parallelism with BERT

In this example, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate
activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.

Paper: [Sequence Parallelism: Long Sequence Training from System Perspective](https://arxiv.org/abs/2105.13120)

## 🚀Quick Start
1. Run with the following command
```bash
export PYTHONPATH=$PWD
colossalai run --nproc_per_node 4 train.py -s
```
2. The default config is sequence parallel size = 2, pipeline size = 1, let’s change pipeline size to be 2 and try it again.


## How to Prepare WikiPedia Dataset

First, let's prepare the WikiPedia dataset from scratch. To generate a preprocessed dataset, we need four items:
1. raw WikiPedia dataset
2. wikipedia extractor (extract data from the raw dataset)
3. vocabulary file
4. preprocessing scripts (generate final data from extracted data)

For the preprocessing script, we thank Megatron-LM for providing a preprocessing script to generate the corpus file.

```python
# download raw data
mkdir data && cd ./data
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

# install wiki extractor
git clone https://github.com/FrankLeeeee/wikiextractor.git
pip install ./wikiextractor

# extractmodule
wikiextractor --json enwiki-latest-pages-articles.xml.bz2
cat text/*/* > ./corpus.json
cd ..

# download vocab file
mkdir vocab && cd ./vocab
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
cd ..

# preprocess some data
git clone https://github.com/NVIDIA/Megatron-LM.git
cd ./Megatron-LM
python tools/preprocess_data.py \
    --input ../data/corpus.json \
    --output-prefix my-bert \
    --vocab ../vocab/bert-large-uncased-vocab.txt \
    --dataset-impl mmap \
    --tokenizer-type BertWordPieceLowerCase \
    --split-sentences \
    --workers 24
```

After running the preprocessing scripts, you will obtain two files:
1. my-bert_text_sentence.bin
2. my-bert_text_sentence.idx

If you happen to encouter `index out of range` problem when running Megatron's script,
this is probably because that a sentence starts with a punctuation and cannot be tokenized. A work-around is to update `Encoder.encode` method with the code below:

```python
class Encoder(object):
    def __init__(self, args):
        ...

    def initializer(self):
        ...

    def encode(self, json_line):
        data = json.loads(json_line)
        ids = {}
        for key in self.args.json_keys:
            text = data[key]
            doc_ids = []

            # lsg: avoid sentences which start with a punctuation
            # as it cannot be tokenized by splitter
            if len(text) > 0 and text[0] in string.punctuation:
                text = text[1:]

            for sentence in Encoder.splitter.tokenize(text):
                sentence_ids = Encoder.tokenizer.tokenize(sentence)
                if len(sentence_ids) > 0:
                    doc_ids.append(sentence_ids)
            if len(doc_ids) > 0 and self.args.append_eod:
                doc_ids[-1].append(Encoder.tokenizer.eod)
            ids[key] = doc_ids
        return ids, len(json_line)
```

## How to Train with Sequence Parallelism

We provided `train.py` for you to execute training. Before invoking the script, there are several
steps to perform.

### Step 1. Set data path and vocab path

At the top of `config.py`, you can see two global variables `DATA_PATH` and `VOCAB_FILE_PATH`.

```python
DATA_PATH = <data-path>
VOCAB_FILE_PATH = <vocab-path>
```

`DATA_PATH` refers to the path to the data file generated by Megatron's script. For example, in the section above, you should get two data files (my-bert_text_sentence.bin and my-bert_text_sentence.idx). You just need to `DATA_PATH` to the path to the bin file without the file extension.

For example, if your my-bert_text_sentence.bin is /home/Megatron-LM/my-bert_text_sentence.bin, then you should set

```python
DATA_PATH = '/home/Megatron-LM/my-bert_text_sentence'
```

The `VOCAB_FILE_PATH` refers to the path to the vocabulary downloaded when you prepare the dataset
(e.g. bert-large-uncased-vocab.txt).

### Step 3. Make Dataset Helper

Build BERT dataset helper. Requirements are `CUDA`, `g++`, `pybind11` and `make`.

```python
cd ./data/datasets
make
```

### Step 3. Configure your parameters

In the `config.py` provided, a set of parameters are defined including training scheme, model, etc.
You can also modify the ColossalAI setting. For example, if you wish to parallelize over the
sequence dimension on 8 GPUs. You can change `size=4` to `size=8`. If you wish to use pipeline parallelism, you can set `pipeline=<num_of_pipeline_stages>`.

### Step 4. Invoke parallel training

Lastly, you can start training with sequence parallelism. How you invoke `train.py` depends on your
machine setting.

- If you are using a single machine with multiple GPUs, PyTorch launch utility can easily let you
  start your script. A sample command is like below:

  ```bash
    colossalai run --nproc_per_node <num_gpus_on_this_machine> --master_addr localhost --master_port 29500 train.py
  ```

- If you are using multiple machines with multiple GPUs, we suggest that you refer to `colossalai
  launch_from_slurm` or `colossalai.launch_from_openmpi` as it is easier to use SLURM and OpenMPI
  to start multiple processes over multiple nodes. If you have your own launcher, you can fall back
  to the default `colossalai.launch` function.
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								# Sequence Parallelism with BERT
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								In this example, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.
 								Paper: [Sequence Parallelism: Long Sequence Training from System Perspective](https://arxiv.org/abs/2105.13120)
-												[tutorial] polish all README (#1946)


											
										
										
											2022-11-14 11:49:32 +00:00
+								## 🚀Quick Start
 . Run with the following command
 								```bash
 								export PYTHONPATH=$PWD
 								colossalai run --nproc_per_node 4 train.py -s
 								```
 . The default config is sequence parallel size = 2, pipeline size = 1, let’s change pipeline size to be 2 and try it again.
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								## How to Prepare WikiPedia Dataset
 								First, let's prepare the WikiPedia dataset from scratch. To generate a preprocessed dataset, we need four items:
 . raw WikiPedia dataset
 . wikipedia extractor (extract data from the raw dataset)
 . vocabulary file
 . preprocessing scripts (generate final data from extracted data)
 								For the preprocessing script, we thank Megatron-LM for providing a preprocessing script to generate the corpus file.
 								```python
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								# download raw data
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								mkdir data && cd ./data
 								wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
 								# install wiki extractor
 								git clone https://github.com/FrankLeeeee/wikiextractor.git
 								pip install ./wikiextractor
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								# extractmodule
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								wikiextractor --json enwiki-latest-pages-articles.xml.bz2
 								cat text/*/* > ./corpus.json
 								cd ..
 								# download vocab file
 								mkdir vocab && cd ./vocab
 								wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
 								cd ..
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								# preprocess some data
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								git clone https://github.com/NVIDIA/Megatron-LM.git
 								cd ./Megatron-LM
 								python tools/preprocess_data.py \
 								    --input ../data/corpus.json \
 								    --output-prefix my-bert \
 								    --vocab ../vocab/bert-large-uncased-vocab.txt \
 								    --dataset-impl mmap \
 								    --tokenizer-type BertWordPieceLowerCase \
 								    --split-sentences \
 								    --workers 24
 								```
 								After running the preprocessing scripts, you will obtain two files:
 . my-bert_text_sentence.bin
 . my-bert_text_sentence.idx
 								If you happen to encouter `index out of range` problem when running Megatron's script,
 								this is probably because that a sentence starts with a punctuation and cannot be tokenized. A work-around is to update `Encoder.encode` method with the code below:
 								```python
 								class Encoder(object):
 								    def __init__(self, args):
 								        ...
 								    def initializer(self):
 								        ...
 								    def encode(self, json_line):
 								        data = json.loads(json_line)
 								        ids = {}
 								        for key in self.args.json_keys:
 								            text = data[key]
 								            doc_ids = []
 								            # lsg: avoid sentences which start with a punctuation
 								            # as it cannot be tokenized by splitter
 								            if len(text) > 0 and text[0] in string.punctuation:
 								                text = text[1:]
 								            for sentence in Encoder.splitter.tokenize(text):
 								                sentence_ids = Encoder.tokenizer.tokenize(sentence)
 								                if len(sentence_ids) > 0:
 								                    doc_ids.append(sentence_ids)
 								            if len(doc_ids) > 0 and self.args.append_eod:
 								                doc_ids[-1].append(Encoder.tokenizer.eod)
 								            ids[key] = doc_ids
 								        return ids, len(json_line)
 								```
 								## How to Train with Sequence Parallelism
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								We provided `train.py` for you to execute training. Before invoking the script, there are several
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								steps to perform.
 								### Step 1. Set data path and vocab path
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								At the top of `config.py`, you can see two global variables `DATA_PATH` and `VOCAB_FILE_PATH`.
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
 								```python
 								DATA_PATH = <data-path>
 								VOCAB_FILE_PATH = <vocab-path>
 								```
 								`DATA_PATH` refers to the path to the data file generated by Megatron's script. For example, in the section above, you should get two data files (my-bert_text_sentence.bin and my-bert_text_sentence.idx). You just need to `DATA_PATH` to the path to the bin file without the file extension.
 								For example, if your my-bert_text_sentence.bin is /home/Megatron-LM/my-bert_text_sentence.bin, then you should set
 								```python
 								DATA_PATH = '/home/Megatron-LM/my-bert_text_sentence'
 								```
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								The `VOCAB_FILE_PATH` refers to the path to the vocabulary downloaded when you prepare the dataset
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								(e.g. bert-large-uncased-vocab.txt).
 								### Step 3. Make Dataset Helper
 								Build BERT dataset helper. Requirements are `CUDA`, `g++`, `pybind11` and `make`.
 								```python
 								cd ./data/datasets
 								make
 								```
 								### Step 3. Configure your parameters
 								In the `config.py` provided, a set of parameters are defined including training scheme, model, etc.
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								You can also modify the ColossalAI setting. For example, if you wish to parallelize over the
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								sequence dimension on 8 GPUs. You can change `size=4` to `size=8`. If you wish to use pipeline parallelism, you can set `pipeline=<num_of_pipeline_stages>`.
 								### Step 4. Invoke parallel training
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								Lastly, you can start training with sequence parallelism. How you invoke `train.py` depends on your
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								machine setting.
 								- If you are using a single machine with multiple GPUs, PyTorch launch utility can easily let you
 								  start your script. A sample command is like below:
 								  ```bash
-												[tutorial] added synthetic data for sequence parallel (#1927)

* [tutorial] added synthetic data for sequence parallel

* polish code
											
										
										
											2022-11-12 19:24:02 +00:00
+								    colossalai run --nproc_per_node <num_gpus_on_this_machine> --master_addr localhost --master_port 29500 train.py
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								  ```
 								- If you are using multiple machines with multiple GPUs, we suggest that you refer to `colossalai
-												Hotfix/tutorial readme index (#1922)

* [tutorial] removed tutorial index in readme

* [tutorial] removed tutorial index in readme
											
										
										
											2022-11-12 10:24:52 +00:00
+								  launch_from_slurm` or `colossalai.launch_from_openmpi` as it is easier to use SLURM and OpenMPI
 								  to start multiple processes over multiple nodes. If you have your own launcher, you can fall back
-												[tutorial] edited hands-on practices (#1899)

* Add handson to ColossalAI.

* Change names of handsons and edit sequence parallel example.

* Edit wrong folder name

* resolve conflict

* delete readme
											
										
										
											2022-11-11 09:08:17 +00:00
+								  to the default `colossalai.launch` function.