ColossalAI/applications/ChatGPT/README.md

# RLHF - Colossal-AI

Implementation of RLHF (Reinforcement Learning with Human Feedback) powered by Colossal-AI. It supports distributed training and offloading, which can fit extremly large models. More details can be found in the [blog](https://www.hpc-ai.tech/blog/colossal-ai-chatgpt).

<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/chatgpt.png" width=700/>
</p>

## Training process (step 3)
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/experience.jpg" width=500/>
</p>
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/train.jpg" width=500/>
</p>


## Install
```shell
pip install .
```


## Usage

The main entrypoint is `Trainer`. We only support PPO trainer now. We support many training strategies:

- NaiveStrategy: simplest strategy. Train on single GPU.
- DDPStrategy: use `torch.nn.parallel.DistributedDataParallel`. Train on multi GPUs.
- ColossalAIStrategy: use Gemini and Zero of ColossalAI. It eliminates model duplication on each GPU and supports offload. It's very useful when training large models on multi GPUs.

Simplest usage:

```python
from chatgpt.trainer import PPOTrainer
from chatgpt.trainer.strategies import ColossalAIStrategy
from chatgpt.nn import GPTActor, GPTCritic, RewardModel
from copy import deepcopy
from colossalai.nn.optimizer import HybridAdam

strategy = ColossalAIStrategy()

with strategy.model_init_context():
  # init your model here
  # load pretrained gpt2
  actor = GPTActor(pretrained='gpt2')
  critic = GPTCritic()
  initial_model = deepcopy(actor).cuda()
  reward_model = RewardModel(deepcopy(critic.model), deepcopy(critic.value_head)).cuda()

actor_optim = HybridAdam(actor.parameters(), lr=5e-6)
critic_optim = HybridAdam(critic.parameters(), lr=5e-6)

# prepare models and optimizers
(actor, actor_optim), (critic, critic_optim), reward_model, initial_model = strategy.prepare(
        (actor, actor_optim), (critic, critic_optim), reward_model, initial_model)

# load saved model checkpoint after preparing
strategy.load_model(actor, 'actor_checkpoint.pt', strict=False)
# load saved optimizer checkpoint after preparing
strategy.load_optimizer(actor_optim, 'actor_optim_checkpoint.pt')

trainer = PPOTrainer(strategy,
                     actor,
                     critic,
                     reward_model,
                     initial_model,
                     actor_optim,
                     critic_optim,
                     ...)

trainer.fit(dataset, ...)

# save model checkpoint after fitting on only rank0
strategy.save_model(actor, 'actor_checkpoint.pt', only_rank0=True)
# save optimizer checkpoint on all ranks
strategy.save_optimizer(actor_optim, 'actor_optim_checkpoint.pt', only_rank0=False)
```

For more details, see `examples/`.

We also support training reward model with true-world data. See `examples/train_reward_model.py`.

## FAQ

### How to save/load checkpoint

To load pretrained model, you can simply use huggingface pretrained models:

```python
# load OPT-350m pretrained model
actor = OPTActor(pretrained='facebook/opt-350m')
```

To save model checkpoint:

```python
# save model checkpoint on only rank0
strategy.save_model(actor, 'actor_checkpoint.pt', only_rank0=True)
```

This function must be called after `strategy.prepare()`.

For DDP strategy, model weights are replicated on all ranks. And for ColossalAI strategy, model weights may be sharded, but all-gather will be applied before returning state dict. You can set `only_rank0=True` for both of them, which only saves checkpoint on rank0, to save disk space usage. The checkpoint is float32.

To save optimizer checkpoint:

```python
# save optimizer checkpoint on all ranks
strategy.save_optimizer(actor_optim, 'actor_optim_checkpoint.pt', only_rank0=False)
```

For DDP strategy, optimizer states are replicated on all ranks. You can set `only_rank0=True`. But for ColossalAI strategy, optimizer states are sharded over all ranks, and no all-gather will be applied. So for ColossalAI strategy, you can only set `only_rank0=False`. That is to say, each rank will save a cehckpoint. When loading, each rank should load the corresponding part.

Note that different stategy may have different shapes of optimizer checkpoint.

To load model checkpoint:

```python
# load saved model checkpoint after preparing
strategy.load_model(actor, 'actor_checkpoint.pt', strict=False)
```

To load optimizer checkpoint:

```python
# load saved optimizer checkpoint after preparing
strategy.load_optimizer(actor_optim, 'actor_optim_checkpoint.pt')
```

## Todo

- [x] implement PPO fine-tuning
- [x] implement training reward model
- [x] support LoRA
- [ ] implement PPO-ptx fine-tuning
- [ ] integrate with Ray
- [ ] support more RL paradigms, like Implicit Language Q-Learning (ILQL)

## Invitation to open-source contribution
Referring to the successful attempts of [BLOOM](https://bigscience.huggingface.co/) and [Stable Diffusion](https://en.wikipedia.org/wiki/Stable_Diffusion), any and all developers and partners with computing powers, datasets, models are welcome to join and build an ecosystem with Colossal-AI, making efforts towards the era of big AI models from the starting point of replicating ChatGPT!

You may contact us or participate in the following ways:
1. Posting an [issue](https://github.com/hpcaitech/ColossalAI/issues/new/choose) or submitting a [PR](https://github.com/hpcaitech/ColossalAI/pulls) on GitHub
2. Join the Colossal-AI community on
[Slack](https://join.slack.com/t/colossalaiworkspace/shared_invite/zt-z7b26eeb-CBp7jouvu~r0~lcFzX832w),
and [WeChat](https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/WeChat.png "qrcode") to share your ideas.
3. Check out and fill in the [cooperation proposal](https://www.hpc-ai.tech/partners)
4. Send your proposal to email contact@hpcaitech.com

Thanks so much to all of our amazing contributors!

## Quick Preview
<p id="ChatGPT_scaling" align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/ChatGPT%20scaling.png" width=800/>
</p>

- Up to 7.73 times faster for single server training and 1.42 times faster for single-GPU inference

<p id="ChatGPT-1GPU" align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/ChatGPT-1GPU.jpg" width=450/>
</p>

- Up to 10.3x growth in model capacity on one GPU
- A mini demo training process requires only 1.62GB of GPU memory (any consumer-grade GPU)

<p id="inference" align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/LoRA%20data.jpg" width=600/>
</p>

- Increase the capacity of the fine-tuning model by up to 3.7 times on a single GPU
- Keep in a sufficiently high running speed

## Citations

```bibtex
@article{Hu2021LoRALA,
    title   = {LoRA: Low-Rank Adaptation of Large Language Models},
    author  = {Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen},
    journal = {ArXiv},
    year    = {2021},
    volume  = {abs/2106.09685}
}

@article{ouyang2022training,
  title={Training language models to follow instructions with human feedback},
  author={Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others},
  journal={arXiv preprint arXiv:2203.02155},
  year={2022}
}
```
[doc] add Quick Preview (#2706) 2 years ago			`# RLHF - Colossal-AI`
[app] add chatgpt application (#2698) 2 years ago
[doc] add Quick Preview (#2706) 2 years ago			`Implementation of RLHF (Reinforcement Learning with Human Feedback) powered by Colossal-AI. It supports distributed training and offloading, which can fit extremly large models. More details can be found in the [blog](https://www.hpc-ai.tech/blog/colossal-ai-chatgpt).`
[app] add chatgpt application (#2698) 2 years ago
			`<p align="center">`
			`<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/chatgpt.png" width=700/>`
			`</p>`

			`## Training process (step 3)`
			`<p align="center">`
			`<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/experience.jpg" width=500/>`
			`</p>`
			`<p align="center">`
			`<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/train.jpg" width=500/>`
			`</p>`


			`## Install`
			```shell
			`pip install .`
			```


			`## Usage`

			The main entrypoint is `Trainer`. We only support PPO trainer now. We support many training strategies:

			`- NaiveStrategy: simplest strategy. Train on single GPU.`
			- DDPStrategy: use `torch.nn.parallel.DistributedDataParallel`. Train on multi GPUs.
			`- ColossalAIStrategy: use Gemini and Zero of ColossalAI. It eliminates model duplication on each GPU and supports offload. It's very useful when training large models on multi GPUs.`

			`Simplest usage:`

			```python
			`from chatgpt.trainer import PPOTrainer`
			`from chatgpt.trainer.strategies import ColossalAIStrategy`
[chatgpt] update readme about checkpoint (#2792) * [chatgpt] add save/load checkpoint sample code * [chatgpt] add save/load checkpoint readme * [chatgpt] refactor save/load checkpoint readme 2 years ago			`from chatgpt.nn import GPTActor, GPTCritic, RewardModel`
			`from copy import deepcopy`
			`from colossalai.nn.optimizer import HybridAdam`
[app] add chatgpt application (#2698) 2 years ago
			`strategy = ColossalAIStrategy()`

			`with strategy.model_init_context():`
			`# init your model here`
[chatgpt] update readme about checkpoint (#2792) * [chatgpt] add save/load checkpoint sample code * [chatgpt] add save/load checkpoint readme * [chatgpt] refactor save/load checkpoint readme 2 years ago			`# load pretrained gpt2`
			`actor = GPTActor(pretrained='gpt2')`
			`critic = GPTCritic()`
			`initial_model = deepcopy(actor).cuda()`
			`reward_model = RewardModel(deepcopy(critic.model), deepcopy(critic.value_head)).cuda()`

			`actor_optim = HybridAdam(actor.parameters(), lr=5e-6)`
			`critic_optim = HybridAdam(critic.parameters(), lr=5e-6)`

			`# prepare models and optimizers`
			`(actor, actor_optim), (critic, critic_optim), reward_model, initial_model = strategy.prepare(`
			`(actor, actor_optim), (critic, critic_optim), reward_model, initial_model)`

			`# load saved model checkpoint after preparing`
			`strategy.load_model(actor, 'actor_checkpoint.pt', strict=False)`
			`# load saved optimizer checkpoint after preparing`
			`strategy.load_optimizer(actor_optim, 'actor_optim_checkpoint.pt')`

			`trainer = PPOTrainer(strategy,`
			`actor,`
			`critic,`
			`reward_model,`
			`initial_model,`
			`actor_optim,`
			`critic_optim,`
			`...)`
[app] add chatgpt application (#2698) 2 years ago
			`trainer.fit(dataset, ...)`
[chatgpt] update readme about checkpoint (#2792) * [chatgpt] add save/load checkpoint sample code * [chatgpt] add save/load checkpoint readme * [chatgpt] refactor save/load checkpoint readme 2 years ago
			`# save model checkpoint after fitting on only rank0`
			`strategy.save_model(actor, 'actor_checkpoint.pt', only_rank0=True)`
			`# save optimizer checkpoint on all ranks`
			`strategy.save_optimizer(actor_optim, 'actor_optim_checkpoint.pt', only_rank0=False)`
[app] add chatgpt application (#2698) 2 years ago			```

			For more details, see `examples/`.

			We also support training reward model with true-world data. See `examples/train_reward_model.py`.

[chatgpt] update readme about checkpoint (#2792) * [chatgpt] add save/load checkpoint sample code * [chatgpt] add save/load checkpoint readme * [chatgpt] refactor save/load checkpoint readme 2 years ago			`## FAQ`

			`### How to save/load checkpoint`

			`To load pretrained model, you can simply use huggingface pretrained models:`

			```python
			`# load OPT-350m pretrained model`
			`actor = OPTActor(pretrained='facebook/opt-350m')`
			```

			`To save model checkpoint:`

			```python
			`# save model checkpoint on only rank0`
			`strategy.save_model(actor, 'actor_checkpoint.pt', only_rank0=True)`
			```

			This function must be called after `strategy.prepare()`.

			For DDP strategy, model weights are replicated on all ranks. And for ColossalAI strategy, model weights may be sharded, but all-gather will be applied before returning state dict. You can set `only_rank0=True` for both of them, which only saves checkpoint on rank0, to save disk space usage. The checkpoint is float32.

			`To save optimizer checkpoint:`

			```python
			`# save optimizer checkpoint on all ranks`
			`strategy.save_optimizer(actor_optim, 'actor_optim_checkpoint.pt', only_rank0=False)`
			```

			For DDP strategy, optimizer states are replicated on all ranks. You can set `only_rank0=True`. But for ColossalAI strategy, optimizer states are sharded over all ranks, and no all-gather will be applied. So for ColossalAI strategy, you can only set `only_rank0=False`. That is to say, each rank will save a cehckpoint. When loading, each rank should load the corresponding part.

			`Note that different stategy may have different shapes of optimizer checkpoint.`

			`To load model checkpoint:`

			```python
			`# load saved model checkpoint after preparing`
			`strategy.load_model(actor, 'actor_checkpoint.pt', strict=False)`
			```

			`To load optimizer checkpoint:`

			```python
			`# load saved optimizer checkpoint after preparing`
			`strategy.load_optimizer(actor_optim, 'actor_optim_checkpoint.pt')`
			```

[app] add chatgpt application (#2698) 2 years ago			`## Todo`

[chatgpt] update readme about checkpoint (#2792) * [chatgpt] add save/load checkpoint sample code * [chatgpt] add save/load checkpoint readme * [chatgpt] refactor save/load checkpoint readme 2 years ago			`- [x] implement PPO fine-tuning`
[app] add chatgpt application (#2698) 2 years ago			`- [x] implement training reward model`
			`- [x] support LoRA`
			`- [ ] implement PPO-ptx fine-tuning`
			`- [ ] integrate with Ray`
			`- [ ] support more RL paradigms, like Implicit Language Q-Learning (ILQL)`

[doc] add open-source contribution invitation (#2714) * [doc] fix typo * [doc] add invitation 2 years ago			`## Invitation to open-source contribution`
			`Referring to the successful attempts of [BLOOM](https://bigscience.huggingface.co/) and [Stable Diffusion](https://en.wikipedia.org/wiki/Stable_Diffusion), any and all developers and partners with computing powers, datasets, models are welcome to join and build an ecosystem with Colossal-AI, making efforts towards the era of big AI models from the starting point of replicating ChatGPT!`

			`You may contact us or participate in the following ways:`
			`1. Posting an [issue](https://github.com/hpcaitech/ColossalAI/issues/new/choose) or submitting a [PR](https://github.com/hpcaitech/ColossalAI/pulls) on GitHub`
[chatgpt] update readme about checkpoint (#2792) * [chatgpt] add save/load checkpoint sample code * [chatgpt] add save/load checkpoint readme * [chatgpt] refactor save/load checkpoint readme 2 years ago			`2. Join the Colossal-AI community on`
[doc] add open-source contribution invitation (#2714) * [doc] fix typo * [doc] add invitation 2 years ago			`[Slack](https://join.slack.com/t/colossalaiworkspace/shared_invite/zt-z7b26eeb-CBp7jouvu~r0~lcFzX832w),`
			`and [WeChat](https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/WeChat.png "qrcode") to share your ideas.`
			`3. Check out and fill in the [cooperation proposal](https://www.hpc-ai.tech/partners)`
			`4. Send your proposal to email contact@hpcaitech.com`

			`Thanks so much to all of our amazing contributors!`

[doc] add Quick Preview (#2706) 2 years ago			`## Quick Preview`
			`<p id="ChatGPT_scaling" align="center">`
			`<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/ChatGPT%20scaling.png" width=800/>`
			`</p>`

			`- Up to 7.73 times faster for single server training and 1.42 times faster for single-GPU inference`

			`<p id="ChatGPT-1GPU" align="center">`
			`<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/ChatGPT-1GPU.jpg" width=450/>`
			`</p>`

			`- Up to 10.3x growth in model capacity on one GPU`
			`- A mini demo training process requires only 1.62GB of GPU memory (any consumer-grade GPU)`

			`<p id="inference" align="center">`
			`<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chatgpt/LoRA%20data.jpg" width=600/>`
			`</p>`

			`- Increase the capacity of the fine-tuning model by up to 3.7 times on a single GPU`
			`- Keep in a sufficiently high running speed`

[app] add chatgpt application (#2698) 2 years ago			`## Citations`

			```bibtex
			`@article{Hu2021LoRALA,`
			`title = {LoRA: Low-Rank Adaptation of Large Language Models},`
			`author = {Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen},`
			`journal = {ArXiv},`
			`year = {2021},`
			`volume = {abs/2106.09685}`
			`}`

			`@article{ouyang2022training,`
			`title={Training language models to follow instructions with human feedback},`
			`author={Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others},`
			`journal={arXiv preprint arXiv:2203.02155},`
			`year={2022}`
			`}`
			```