History

Jiarui Fang 203ca57aed [example] add GPT		2022-11-08 10:58:17 +08:00
..
dataset	[example] add GPT	2022-11-08 10:58:17 +08:00
gpt2_configs	[example] add GPT	2022-11-08 10:58:17 +08:00
gpt3_configs	[example] add GPT	2022-11-08 10:58:17 +08:00
tools	[example] add GPT	2022-11-08 10:58:17 +08:00
README.md	[example] add GPT	2022-11-08 10:58:17 +08:00
run.sh	[example] add GPT	2022-11-08 10:58:17 +08:00
train_gpt.py	[example] add GPT	2022-11-08 10:58:17 +08:00

README.md

Run GPT With Colossal-AI

Overview

In Colossal-AI, there are many ways to run GPT in a distributed manner. The train_gpt.py script runs training with the specific configuration scripts in gpt2_configs/ for different parallelisms of GPT-2 . We have provided some example configuration files of GPT-2 and you can modify them to adapt to your own use.

How to Prepare Webtext Dataset

We do not host any datasets for GPT or BERT training, however, we provide a detailed guide on how to prepare the dataset so that our results may be reproduced.

Overview

We utilize the publicly available OpenWebText library by jcpeterson and eukaryote31's work to download urls to different web pages. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in following section.

Install necessary packages

Note: LSH requires GCC's early version. We have tested that version 9.3.0 works, but version 10.3.0 is not.

pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract cached-path
git clone https://github.com/mattilyra/LSH.git
cd LSH
python setup.py install

If you couldn't install it successfully, you may try to replace the cMinhash.cpp in LSH/lsh with ours, which is provided in tools/lsh/cMinhash.cpp.

Download Data

Download the deduplicated URLs from jcpeterson.
Unzip the zip file and you will get a folder URLs which consists of many txt files including urls.
Remove blacklisted URLs.

We appreciate Megatron-LM for making the data preprocessing code public. We have forked Megatron-LM and fixed some bugs. For your convenience, we have collated the needed files in tools/Megatron. Click here to check the source code of Megatron-LM.
```
cd path/to/tools
python Megatron/blacklist_urls.py <path/to/URLs> <path/to/clean_urls.txt>
```
Download the content from the clean urls and merge the contents into one loose json file with 1 json per newline of the format {'text': text, 'url': unique_url}.

We have forked and modified openwebtext as there are some bugs in it. For your convenience, we provide our modified version in tools/download.
```
python download/download.py <path/to/clean_urls.txt> --n_procs 50 --output <path/to/raw.json>
```

Prepare Data for GPT Training

Perform ftfy, English detection and remove documents with less than 128 tokens. This step can be sharded and run on shards.
```
python Megatron/cleanup_dataset.py <path/to/raw.json> <path/to/clean.json>
```
Additional cleanup (e.g. remove documents less than 512 characters or dataset specific cleaning like stories, realnews datasets) can be done using cleanup_fix_dataset.py. More details can be found by running python cleanup_fix_dataset.py --help.
Using LSH, find possible duplicates and store them in a file for later processing. The code supports saving and loading fingerprints for recurrent deduplications, and is also multithreaded for faster processing. More details are can be found by python find_duplicate.py --help.
```
python Megatron/find_duplicates.py --inputs <path/to/clean.json> url --output <path/to/process_stage_one.json>
```
Based on similarity measure defind inside function is_similar (default: 0.9), group urls that are similar. Basically, for each group, only one url we should keep and remove the rest.
```
python Megatron/group_duplicate_url.py <path/to/process_stage_one.json> <path/to/process_stage_two.json>
```

Remove similar documents that were detected in the last step. The dedup.json is the data after deduplication.

python Megatron/remove_group_duplicates.py <path/to/process_stage_two.json> <path/to/clean.json> <path/to/dedup.json>

shuffle the dataset.

shuf <path/to/dedup.json> -o <path/to/train_data.json>

How to Prepare Yuan Dataset

Overview

Yuan dataset is a large scale Chinese dataset with 1TB high quality texts proposed by Inspur. You can apply on https://air.inspur.com/home to get access to the dataset. We downloaded and loaded all downloaded content according to the procedure described in following section.

Download

The dataset can be according to the website once your application is approved.

You also need to download the vocab file from https://github.com/Shawn-Inspur/Yuan-1.0/blob/main/src/vocab.txt

The final data dir should be organized as:

|--dataset
|     |--001.txt
|     |--002.txt
|     |--...
|--vocab.txt

Process & Load

Before you run the code, you should replace line 44 in train_gpt.py with

import dataset.yuan import YuanDataset
train_ds = YuanDataset(os.environ['DATA'], vocab_path='/path/to/data/vocab.txt'seq_len=gpc.config.SEQ_LEN)

Then you can run model following the Usage section. The dataset will be processed when you run it for the first time, and save the cache. Then the data can be loaded automatically.

Usage

#!/usr/bin/env sh
export DATA=/path/to/train_data.json

colossalai run --nproc_per_node=<num_gpus> train_gpt.py --config=gpt2_configs/<config_file>

You can copy it and save it as run.sh. Then use bash ./run.sh to run the script in your terminal.

Please modify DATA, num_gpus and config_file with the path to your dataset, the number of GPUs and the config file path, respectively. If you are going to train gpt3, just replace gpt2_configs with gpt3_configs.

GPT-2

Here are the GPT-2 configs' default parameter:

config	scale	GPU*	batch size	MiB of each GPU	TP	PP	DP
gpt2-vanilla	small	1	1	6071	1	1	1
gpt2-vanilla	small	2	1	6449*2	1	1	2
gpt2-1d	small	2	1	5287*2	2	1	1
gpt2-2d	small	4	1	4590*4	4	1	1
gpt-2.5d	small	8	1	4815*8	8	1	1
gpt2-3d	small	8	1	4901*8	8	1	1
gpt2-pp	small	2	1	5877*2	1	2	1
gpt2-zero2	small	1	1	5459	1	1	1
gpt2-zero3	small	1	1	6577	1	1	1
gpt2-nvme	small	1	1	5067	1	1	1
gpt2-pp1d	small	8	8	5411*8	2	2	2

*Note: For GPUs, we use Nvidia A100 80G. *Note: Results of ZeRO are outdated, we will update them soon.

We set TENSOR_PARALLEL PIPELINE_PARALLEL and DATA_PARALLEL as small as it can be to run every demo with the least number of GPUs.

Modify the config file

General

There are some general rules when modifying the config files.

TP denotes Tensor Parallel
PP denotes Pipeline Parallel
DP denotes Data Parallel

GPUS = TP * PP * DP
Where DP is autoseted

You can set the batch size and the epoch number by changing the number of BATCH_SIZE and NUM_EPOCHS, respectively. Then, we will introduce the config file of each mode.

Please note that gpt2_zero3.py has nothing but BATCH_SIZE and NUM_EPOCHS to change.

Vanilla & Data Parallel

Vanilla is the basic mode of GPT-2 with no parallelism at all. However, if you use more than 1 GPU and TP * PP < no. of GPUs, Colossal-AI will set DP for you automatically.

1D, 2D, 2.5D, 3D

In files gpt2_1d.py, gpt2_2d.py, gpt2_2p5d.py, gpt2_3d.py, there is a line:

TENSOR_PARALLEL = 2

You can modify it to use more tensor parallel, just with the general rules satisfied. In particular, TENSOR_PARALLEL should be a square number and cubic number for 2D and 3D, respectively, and TENSOR_PARALLEL / DEPTH should be a square number for 2.5D.

Pipeline Parallel

To use pipeline parallel training, you should install colossalai from the latest main branch.

In gpt2_pp.py, there are lines:

# BATCH_SIZE / NUM_MICRO_BATCHES should be an integer
NUM_MICRO_BATCHES = 1
PIPELINE = 2

Pipeline + 1D + Data Parallel

In gpt2_pp1d.py, we have

BATCH_SIZE = 8
NUM_EPOCHS = 60
NUM_MICRO_BATCHES = 1
HIDDEN_SIZE = 768
PIPELINE = 2
TENSOR_PARALLEL = 2
MODE  = '1d'
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LEN, HIDDEN_SIZE)

We have introduced BATCH_SIZE, NUM_EPOCHS, NUM_MICRO_BATCHES, PIPELINE, TENSOR_PARALLEL as discussed above. HIDDEN_SIZE refers to the hidden dimension of the model, i.e. gpt2_small is 768. You can choose None, '1d', '2d', '2.5d', '3d' for MODE.

GPT-3

GPT-3 is a really huge model, for which it seems not possible to train it with a little number of GPUs. Therefore, we choose some common sets of parameters instead of the smallest ones.

Here are our default parameters of GPT-3 configs:

config	GPU*	batch size	TP	PP	DP
gpt3_pp1d_min	96	192	4	24	1
gpt3_pp1d	128	192	4	32	1
gpt3_pp2d	96	2*48	4	24	1
gpt3_pp2p5d	96	2*48	4	24	1
gpt3_zero3_min	64	3	1	1	64
gpt3_zero3	96	2	1	1	96

*Note: we use Nvidia A100 40G GPUs *Note: Results of ZeRO are outdated, we will update them soon.

In the figure above, the suffix _min means the set of hyper-parameters requires the least number of GPUs with the same mode.

GPT-3 and GPT-2 have the same set of hyper-parameters.