ColossalAI/examples/language/gpt/README.md

# Train GPT with Colossal-AI

This example shows how to use [Colossal-AI](https://github.com/hpcaitech/ColossalAI) to run huggingface GPT training in distributed manners.

## GPT

We use the [GPT-2](https://huggingface.co/gpt2) model from huggingface transformers. The key learning goal of GPT-2 is to use unsupervised pre-training models to do supervised tasks.GPT-2 has an amazing performance in text generation, and the generated text exceeds people's expectations in terms of contextual coherence and emotional expression.

## Requirements

Before you can launch training, you need to install the following requirements.

### Install PyTorch

```bash
#conda
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
#pip
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113
```

### Install [Colossal-AI v0.1.12](https://colossalai.org/download/) From Official Website

```bash
pip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org
```

### Install requirements

```bash
pip install -r requirements.txt
```

This is just an example that we download PyTorch=1.12.0, CUDA=11.6 and colossalai=0.1.12+torch1.12cu11.3. You can download another version of PyTorch and its corresponding ColossalAI version. Just make sure that the version of ColossalAI is at least 0.1.10, PyTorch is at least 1.8.1 and transformers is at least 4.231.
If you want to test ZeRO1 and ZeRO2 in Colossal-AI, you need to ensure Colossal-AI>=0.1.12.

## Dataset

For simplicity, the input data is randonly generated here.

## Training
We provide two solutions. One utilizes the hybrid parallel strategies of Gemini, DDP/ZeRO, and Tensor Parallelism.
The other one uses Pipeline Parallelism Only.
In the future, we are going merge them together and they can be used orthogonally to each other.

### GeminiDPP/ZeRO + Tensor Parallelism
```bash
bash run_gemini.sh
```

The `train_gpt_demo.py` provides three distributed plans, you can choose the plan you want in `run_gemini.sh`. The Colossal-AI leverages Tensor Parallel and Gemini + ZeRO DDP.

- Colossal-AI
- ZeRO1 (Colossal-AI)
- ZeRO2 (Colossal-AI)
- Pytorch DDP
- Pytorch ZeRO


## Performance

Testbed: a cluster of 8xA100 (80GB) and 1xAMD EPYC 7543 32-Core Processor (512 GB). GPUs are connected via PCI-e.
ColossalAI version 0.1.13.

How dose Batch Size affect the efficency.

| model | #GPU | policy | TP | batch per DP | Tflops |
| ---------- | --------- |--------- |--------- |--------- |--------- |
| gpt2_10b |  2  | cpu | 1 | 32 | 122.046 |
| gpt2_10b |  2  | cpu | 1 | 16 | 82.649 |
| gpt2_10b |  2  | cpu | 1 | 8 | 61.354 |


How dose the Placement Policy affect the efficency.

| model | #GPU | policy | TP | batch per DP | Tflops |
| ---------- | --------- |--------- |--------- |--------- |--------- |
| gpt2_10b |  4  | auto | 1 | 8 | 88.657 |
| gpt2_10b |  4  | cuda | 1 | 8 | OOM |
| gpt2_10b |  4  | cpu | 1 | 8 | 61.354 |
| gpt2_10b |  4  | const | 1 | 8 | 82.137 |

How dose the Tensor Parallel Degree affect the efficency.

| model | #GPU | policy | TP | batch per DP | Tflops |
| ---------- | --------- |--------- |--------- |--------- |--------- |
| gpt2_10b |  4  | auto | 1 | 8 | 88.657 |
| gpt2_10b |  4  | auto | 2 | 8 | 56.687 |
| gpt2_10b |  4  | auto | 4 | 8 | 29.019 |
| gpt2_10b |  4  | auto | 4 | 64 | 50.411 |
| gpt2_20b |  1  | cpu | 1 | 8 | 43.102 |
| gpt2_20b |  4  | cpu | 4 | 8 | 28.491 |


Touch the bar of model scale and batch size.

1. `cpu` is the most stable policy for large model and large batch size. One 8 GPU with TP=2, largest batch size of `auto`, `const`
 `cpu` is 64, 32 and 16, respectively.

2. Tensor parallel is necessary for 20B model to reduce model data memory requirement on each GPU.

| model | #GPU | policy | TP | batch per DP | Tflops |
| ---------- | --------- |--------- |--------- |--------- |--------- |
| gpt2_20b |  4  | cpu | 1 | 64 | CUDA OOM |
| gpt2_20b |  4  | auto | 1/2 | 64 | CUDA OOM |
| gpt2_20b |  4  | cpu | 2 | 8 | 43.102 |
| gpt2_20b |  4  | cpu | 2 | 64 | 121.394 |
| gpt2_20b |  8  | auto | 2 | 16 | 99.871 |
| gpt2_20b |  8  | cpu | 2 | 64 | 125.170 |
| gpt2_20b |  8  | const | 2 | 32 | 105.415 |


| model | #GPU | policy | TP | batch per DP | Tflops |
| ---------- | --------- |--------- |--------- |--------- |--------- |
| gpt2_20b |  8  | cpu | 2 | 8 | 46.895 |


### Experimental Features

#### [Pipeline Parallel](./experiments/pipeline_parallel/)
#### [Auto Parallel](./experiments/auto_parallel_with_gpt/)
[example] update GPT README (#2095) 2 years ago			`# Train GPT with Colossal-AI`

			`This example shows how to use [Colossal-AI](https://github.com/hpcaitech/ColossalAI) to run huggingface GPT training in distributed manners.`
[example] add GPT 2 years ago
[example] simplify the GPT2 huggingface example (#1826) 2 years ago			`## GPT`
[example] add GPT 2 years ago
[example] update GPT README (#2095) 2 years ago			`We use the [GPT-2](https://huggingface.co/gpt2) model from huggingface transformers. The key learning goal of GPT-2 is to use unsupervised pre-training models to do supervised tasks.GPT-2 has an amazing performance in text generation, and the generated text exceeds people's expectations in terms of contextual coherence and emotional expression.`

			`## Requirements`

			`Before you can launch training, you need to install the following requirements.`

			`### Install PyTorch`

			```bash
			`#conda`
			`conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch`
			`#pip`
			`pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113`
			```

[example] add zero1, zero2 example in GPT examples (#2146) * [example] add zero1 and zero2 for GPT * update readme in gpt example * polish code * change init value * update readme 2 years ago			`### Install [Colossal-AI v0.1.12](https://colossalai.org/download/) From Official Website`
[example] update GPT README (#2095) 2 years ago
			```bash
[example] add zero1, zero2 example in GPT examples (#2146) * [example] add zero1 and zero2 for GPT * update readme in gpt example * polish code * change init value * update readme 2 years ago			`pip install colossalai==0.1.12+torch1.12cu11.3 -f https://release.colossalai.org`
[example] update GPT README (#2095) 2 years ago			```

[example] GPT polish readme (#2274) 2 years ago			`### Install requirements`
[example] update GPT README (#2095) 2 years ago
			```bash
[example] GPT polish readme (#2274) 2 years ago			`pip install -r requirements.txt`
[example] update GPT README (#2095) 2 years ago			```

[example] add zero1, zero2 example in GPT examples (#2146) * [example] add zero1 and zero2 for GPT * update readme in gpt example * polish code * change init value * update readme 2 years ago			`This is just an example that we download PyTorch=1.12.0, CUDA=11.6 and colossalai=0.1.12+torch1.12cu11.3. You can download another version of PyTorch and its corresponding ColossalAI version. Just make sure that the version of ColossalAI is at least 0.1.10, PyTorch is at least 1.8.1 and transformers is at least 4.231.`
			`If you want to test ZeRO1 and ZeRO2 in Colossal-AI, you need to ensure Colossal-AI>=0.1.12.`
[example] update GPT README (#2095) 2 years ago
			`## Dataset`

			`For simplicity, the input data is randonly generated here.`

			`## Training`
[example] GPT polish readme (#2274) 2 years ago			`We provide two solutions. One utilizes the hybrid parallel strategies of Gemini, DDP/ZeRO, and Tensor Parallelism.`
			`The other one uses Pipeline Parallelism Only.`
			`In the future, we are going merge them together and they can be used orthogonally to each other.`
[example] add GPT 2 years ago
[example] GPT polish readme (#2274) 2 years ago			`### GeminiDPP/ZeRO + Tensor Parallelism`
[example] add GPT 2 years ago			```bash
[example] GPT polish readme (#2274) 2 years ago			`bash run_gemini.sh`
[example] add GPT 2 years ago			```
[example] update GPT README (#2095) 2 years ago
[example] GPT polish readme (#2274) 2 years ago			The `train_gpt_demo.py` provides three distributed plans, you can choose the plan you want in `run_gemini.sh`. The Colossal-AI leverages Tensor Parallel and Gemini + ZeRO DDP.
[example] update GPT README (#2095) 2 years ago
			`- Colossal-AI`
[example] add zero1, zero2 example in GPT examples (#2146) * [example] add zero1 and zero2 for GPT * update readme in gpt example * polish code * change init value * update readme 2 years ago			`- ZeRO1 (Colossal-AI)`
			`- ZeRO2 (Colossal-AI)`
			`- Pytorch DDP`
			`- Pytorch ZeRO`
[example] update gpt readme with performance (#2206) 2 years ago

			`## Performance`

			`Testbed: a cluster of 8xA100 (80GB) and 1xAMD EPYC 7543 32-Core Processor (512 GB). GPUs are connected via PCI-e.`
			`ColossalAI version 0.1.13.`

			`How dose Batch Size affect the efficency.`

[example] update gpt example for larger model scale (#2211) 2 years ago			`\| model \| #GPU \| policy \| TP \| batch per DP \| Tflops \|`
[example] update gpt readme with performance (#2206) 2 years ago			`\| ---------- \| --------- \|--------- \|--------- \|--------- \|--------- \|`
			`\| gpt2_10b \| 2 \| cpu \| 1 \| 32 \| 122.046 \|`
			`\| gpt2_10b \| 2 \| cpu \| 1 \| 16 \| 82.649 \|`
			`\| gpt2_10b \| 2 \| cpu \| 1 \| 8 \| 61.354 \|`


			`How dose the Placement Policy affect the efficency.`

[example] update gpt example for larger model scale (#2211) 2 years ago			`\| model \| #GPU \| policy \| TP \| batch per DP \| Tflops \|`
[example] update gpt readme with performance (#2206) 2 years ago			`\| ---------- \| --------- \|--------- \|--------- \|--------- \|--------- \|`
			`\| gpt2_10b \| 4 \| auto \| 1 \| 8 \| 88.657 \|`
			`\| gpt2_10b \| 4 \| cuda \| 1 \| 8 \| OOM \|`
			`\| gpt2_10b \| 4 \| cpu \| 1 \| 8 \| 61.354 \|`
			`\| gpt2_10b \| 4 \| const \| 1 \| 8 \| 82.137 \|`

			`How dose the Tensor Parallel Degree affect the efficency.`

[example] update gpt example for larger model scale (#2211) 2 years ago			`\| model \| #GPU \| policy \| TP \| batch per DP \| Tflops \|`
[example] update gpt readme with performance (#2206) 2 years ago			`\| ---------- \| --------- \|--------- \|--------- \|--------- \|--------- \|`
			`\| gpt2_10b \| 4 \| auto \| 1 \| 8 \| 88.657 \|`
			`\| gpt2_10b \| 4 \| auto \| 2 \| 8 \| 56.687 \|`
			`\| gpt2_10b \| 4 \| auto \| 4 \| 8 \| 29.019 \|`
			`\| gpt2_10b \| 4 \| auto \| 4 \| 64 \| 50.411 \|`
[example] update gpt example for larger model scale (#2211) 2 years ago			`\| gpt2_20b \| 1 \| cpu \| 1 \| 8 \| 43.102 \|`
			`\| gpt2_20b \| 4 \| cpu \| 4 \| 8 \| 28.491 \|`


			`Touch the bar of model scale and batch size.`

[example] update GPT example benchmark results (#2212) 2 years ago			1. `cpu` is the most stable policy for large model and large batch size. One 8 GPU with TP=2, largest batch size of `auto`, `const`
			`cpu` is 64, 32 and 16, respectively.

			`2. Tensor parallel is necessary for 20B model to reduce model data memory requirement on each GPU.`

[example] update gpt example for larger model scale (#2211) 2 years ago			`\| model \| #GPU \| policy \| TP \| batch per DP \| Tflops \|`
			`\| ---------- \| --------- \|--------- \|--------- \|--------- \|--------- \|`
			`\| gpt2_20b \| 4 \| cpu \| 1 \| 64 \| CUDA OOM \|`
			`\| gpt2_20b \| 4 \| auto \| 1/2 \| 64 \| CUDA OOM \|`
			`\| gpt2_20b \| 4 \| cpu \| 2 \| 8 \| 43.102 \|`
[example] update GPT example benchmark results (#2212) 2 years ago			`\| gpt2_20b \| 4 \| cpu \| 2 \| 64 \| 121.394 \|`
			`\| gpt2_20b \| 8 \| auto \| 2 \| 16 \| 99.871 \|`
[example] update gpt example for larger model scale (#2211) 2 years ago			`\| gpt2_20b \| 8 \| cpu \| 2 \| 64 \| 125.170 \|`
[example] update GPT example benchmark results (#2212) 2 years ago			`\| gpt2_20b \| 8 \| const \| 2 \| 32 \| 105.415 \|`
[builder] raise Error when CUDA_HOME is not set (#2213) 2 years ago

			`\| model \| #GPU \| policy \| TP \| batch per DP \| Tflops \|`
			`\| ---------- \| --------- \|--------- \|--------- \|--------- \|--------- \|`
			`\| gpt2_20b \| 8 \| cpu \| 2 \| 8 \| 46.895 \|`
[example] make gpt example directory more clear (#2353) 2 years ago

			`### Experimental Features`

			`#### [Pipeline Parallel](./experiments/pipeline_parallel/)`
			`#### [Auto Parallel](./experiments/auto_parallel_with_gpt/)`