ColossalAI/examples/language/gpt/README.md

# Train GPT with Colossal-AI

This example shows how to use [Colossal-AI](https://github.com/hpcaitech/ColossalAI) to run huggingface GPT training in distributed manners.

## GPT

We use the [GPT-2](https://huggingface.co/gpt2) model from huggingface transformers. The key learning goal of GPT-2 is to use unsupervised pre-training models to do supervised tasks.GPT-2 has an amazing performance in text generation, and the generated text exceeds people's expectations in terms of contextual coherence and emotional expression.

## Requirements

Before you can launch training, you need to install the following requirements.

### Install PyTorch

```bash
#conda
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
#pip
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113
```

### [Install Colossal-AI](https://github.com/hpcaitech/ColossalAI#installation)


### Install requirements

```bash
pip install -r requirements.txt
```

This is just an example that we download PyTorch=1.12.0, CUDA=11.6 and colossalai. You can download another version of PyTorch and its corresponding ColossalAI version. Just make sure that the version of ColossalAI is at least 0.1.10, PyTorch is at least 1.8.1 and transformers is at least 4.231.
If you want to test ZeRO1 and ZeRO2 in Colossal-AI, you need to ensure Colossal-AI>=0.1.12.

## Dataset

For simplicity, the input data is randomly generated here.

## Training
We provide two stable solutions.
One utilizes the Gemini to implement hybrid parallel strategies of Gemini, DDP/ZeRO, and Tensor Parallelism for a huggingface GPT model.
The other one use [Titans](https://github.com/hpcaitech/Titans), a distributed executed model zoo maintained by ColossalAI,to implement the hybrid parallel strategies of TP + ZeRO + PP.

We recommend using Gemini to quickly run your model in a distributed manner.
It doesn't require significant changes to the model structures, therefore you can apply it on a new model easily.
And use Titans as an advanced weapon to pursue a more extreme performance.
Titans has included the some typical models, such as Vit and GPT.
However, it requires some efforts to start if facing a new model structure.

### GeminiDPP/ZeRO + Tensor Parallelism
```bash
bash run_gemini.sh
```

The `train_gpt_demo.py` provides three distributed plans (except ones already provided by PyTorch), you can choose the plan you want in `run_gemini.sh`. The CAI_Gemini leverages Tensor Parallel and Gemini + ZeRO DDP. For their differences, you may check out the answer to issue [here](https://github.com/hpcaitech/ColossalAI/issues/2590#issuecomment-1418766581).

- ZeRO1 (CAI_ZeRO1)
- ZeRO2 (CAI_ZeRO2)
- Gemini + ZeRO DDP (CAI_Gemini)
- Pytorch DDP (Pytorch_DDP)
- Pytorch ZeRO (Pytorch_ZeRO)

### Titans (Tensor Parallelism) + ZeRO + Pipeline Parallelism

Titans provides a customized GPT model, which uses distributed operators as building blocks.
In [./titans/README.md], we provide a hybrid parallelism of ZeRO, TP and PP.
You can switch parallel strategies using a config file.

## Performance

Testbed: a cluster of 8xA100 (80GB) and 1xAMD EPYC 7543 32-Core Processor (512 GB). GPUs are connected via PCI-e.
ColossalAI version 0.1.13.

[benchmark results on google doc](https://docs.google.com/spreadsheets/d/15A2j3RwyHh-UobAPv_hJgT4W_d7CnlPm5Fp4yEzH5K4/edit#gid=0)

[benchmark results on Tencent doc (for china)](https://docs.qq.com/sheet/DUVpqeVdxS3RKRldk?tab=BB08J2)

### Experimental Features

#### [Pipeline Parallel](./experiments/pipeline_parallel/)
#### [Auto Parallel](./experiments/auto_parallel_with_gpt/)
[example] update GPT README (#2095) 2 years ago			`# Train GPT with Colossal-AI`

			`This example shows how to use [Colossal-AI](https://github.com/hpcaitech/ColossalAI) to run huggingface GPT training in distributed manners.`
[example] add GPT 2 years ago
[example] simplify the GPT2 huggingface example (#1826) 2 years ago			`## GPT`
[example] add GPT 2 years ago
[example] update GPT README (#2095) 2 years ago			`We use the [GPT-2](https://huggingface.co/gpt2) model from huggingface transformers. The key learning goal of GPT-2 is to use unsupervised pre-training models to do supervised tasks.GPT-2 has an amazing performance in text generation, and the generated text exceeds people's expectations in terms of contextual coherence and emotional expression.`

			`## Requirements`

			`Before you can launch training, you need to install the following requirements.`

			`### Install PyTorch`

			```bash
			`#conda`
			`conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch`
			`#pip`
			`pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113`
			```

[format] applied code formatting on changed files in pull request 2922 (#2923) Co-authored-by: github-actions <github-actions@github.com> 2 years ago			`### [Install Colossal-AI](https://github.com/hpcaitech/ColossalAI#installation)`
[example] update GPT README (#2095) 2 years ago

[example] GPT polish readme (#2274) 2 years ago			`### Install requirements`
[example] update GPT README (#2095) 2 years ago
			```bash
[example] GPT polish readme (#2274) 2 years ago			`pip install -r requirements.txt`
[example] update GPT README (#2095) 2 years ago			```

[doc] update installation for GPT (#2922) 2 years ago			`This is just an example that we download PyTorch=1.12.0, CUDA=11.6 and colossalai. You can download another version of PyTorch and its corresponding ColossalAI version. Just make sure that the version of ColossalAI is at least 0.1.10, PyTorch is at least 1.8.1 and transformers is at least 4.231.`
[example] add zero1, zero2 example in GPT examples (#2146) * [example] add zero1 and zero2 for GPT * update readme in gpt example * polish code * change init value * update readme 2 years ago			`If you want to test ZeRO1 and ZeRO2 in Colossal-AI, you need to ensure Colossal-AI>=0.1.12.`
[example] update GPT README (#2095) 2 years ago
			`## Dataset`

[doc] fixed a typo in GPT readme (#2736) 2 years ago			`For simplicity, the input data is randomly generated here.`
[example] update GPT README (#2095) 2 years ago
			`## Training`
[example] titans for gpt (#2484) 2 years ago			`We provide two stable solutions.`
			`One utilizes the Gemini to implement hybrid parallel strategies of Gemini, DDP/ZeRO, and Tensor Parallelism for a huggingface GPT model.`
			`The other one use [Titans](https://github.com/hpcaitech/Titans), a distributed executed model zoo maintained by ColossalAI,to implement the hybrid parallel strategies of TP + ZeRO + PP.`

[doc] Fix typo under colossalai and doc(#3618) * Fixed several spelling errors under colossalai * Fix the spelling error in colossalai and docs directory * Cautious Changed the spelling error under the example folder * Update runtime_preparation_pass.py revert autograft to autograd * Update search_chunk.py utile to until * Update check_installation.py change misteach to mismatch in line 91 * Update 1D_tensor_parallel.md revert to perceptron * Update 2D_tensor_parallel.md revert to perceptron in line 73 * Update 2p5D_tensor_parallel.md revert to perceptron in line 71 * Update 3D_tensor_parallel.md revert to perceptron in line 80 * Update README.md revert to resnet in line 42 * Update reorder_graph.py revert to indice in line 7 * Update p2p.py revert to megatron in line 94 * Update initialize.py revert to torchrun in line 198 * Update routers.py change to detailed in line 63 * Update routers.py change to detailed in line 146 * Update README.md revert random number in line 402 2 years ago			`We recommend using Gemini to quickly run your model in a distributed manner.`
[example] titans for gpt (#2484) 2 years ago			`It doesn't require significant changes to the model structures, therefore you can apply it on a new model easily.`
			`And use Titans as an advanced weapon to pursue a more extreme performance.`
			`Titans has included the some typical models, such as Vit and GPT.`
			`However, it requires some efforts to start if facing a new model structure.`
[example] add GPT 2 years ago
[example] GPT polish readme (#2274) 2 years ago			`### GeminiDPP/ZeRO + Tensor Parallelism`
[example] add GPT 2 years ago			```bash
[example] GPT polish readme (#2274) 2 years ago			`bash run_gemini.sh`
[example] add GPT 2 years ago			```
[example] update GPT README (#2095) 2 years ago
[example] Polish README.md (#2658) * [tutorial] polish readme.md * [example] Update README.md 2 years ago			The `train_gpt_demo.py` provides three distributed plans (except ones already provided by PyTorch), you can choose the plan you want in `run_gemini.sh`. The CAI_Gemini leverages Tensor Parallel and Gemini + ZeRO DDP. For their differences, you may check out the answer to issue [here](https://github.com/hpcaitech/ColossalAI/issues/2590#issuecomment-1418766581).
[example] update GPT README (#2095) 2 years ago
[example] Polish README.md (#2658) * [tutorial] polish readme.md * [example] Update README.md 2 years ago			`- ZeRO1 (CAI_ZeRO1)`
			`- ZeRO2 (CAI_ZeRO2)`
			`- Gemini + ZeRO DDP (CAI_Gemini)`
			`- Pytorch DDP (Pytorch_DDP)`
			`- Pytorch ZeRO (Pytorch_ZeRO)`
[example] update gpt readme with performance (#2206) 2 years ago
[example] titans for gpt (#2484) 2 years ago			`### Titans (Tensor Parallelism) + ZeRO + Pipeline Parallelism`

			`Titans provides a customized GPT model, which uses distributed operators as building blocks.`
			`In [./titans/README.md], we provide a hybrid parallelism of ZeRO, TP and PP.`
			`You can switch parallel strategies using a config file.`
[example] update gpt readme with performance (#2206) 2 years ago
			`## Performance`

			`Testbed: a cluster of 8xA100 (80GB) and 1xAMD EPYC 7543 32-Core Processor (512 GB). GPUs are connected via PCI-e.`
			`ColossalAI version 0.1.13.`

[example] add google doc for benchmark results of GPT (#2355) 2 years ago			`[benchmark results on google doc](https://docs.google.com/spreadsheets/d/15A2j3RwyHh-UobAPv_hJgT4W_d7CnlPm5Fp4yEzH5K4/edit#gid=0)`
[example] make gpt example directory more clear (#2353) 2 years ago
[example] add google doc for benchmark results of GPT (#2355) 2 years ago			`[benchmark results on Tencent doc (for china)](https://docs.qq.com/sheet/DUVpqeVdxS3RKRldk?tab=BB08J2)`
[example] make gpt example directory more clear (#2353) 2 years ago
			`### Experimental Features`

			`#### [Pipeline Parallel](./experiments/pipeline_parallel/)`
			`#### [Auto Parallel](./experiments/auto_parallel_with_gpt/)`