2023-03-28 12:25:36 +00:00
# Examples
2023-04-17 07:40:41 +00:00
## Table of Contents
- [Examples ](#examples )
- [Table of Contents ](#table-of-contents )
- [Install requirements ](#install-requirements )
- [Supervised datasets collection ](#supervised-datasets-collection )
2023-07-28 03:29:55 +00:00
- [Conversation dataset generation ](#conversation-dataset-generation )
2023-04-17 07:40:41 +00:00
- [Stage1 - Supervised instructs tuning ](#stage1---supervised-instructs-tuning )
- [Arg List ](#arg-list )
- [Stage2 - Training reward model ](#stage2---training-reward-model )
- [Features and tricks in RM training ](#features-and-tricks-in-rm-training )
- [Experiment result ](#experiment-result )
- [Arg List ](#arg-list-1 )
- [Stage3 - Training model using prompts with RL ](#stage3---training-model-using-prompts-with-rl )
- [Arg List ](#arg-list-2 )
- [Inference example - After Stage3 ](#inference-example---after-stage3 )
- [Attention ](#attention )
2023-08-14 07:26:27 +00:00
- [data ](#data )
2023-04-17 07:40:41 +00:00
- [Support Model ](#support-model )
- [GPT ](#gpt )
- [BLOOM ](#bloom )
- [OPT ](#opt )
- [LLaMA ](#llama )
- [Add your own models ](#add-your-own-models )
- [Actor model ](#actor-model )
- [Reward model ](#reward-model )
- [Critic model ](#critic-model )
---
2023-08-14 07:26:27 +00:00
2023-03-28 12:25:36 +00:00
## Install requirements
```shell
pip install -r requirements.txt
```
2023-03-28 18:32:17 +00:00
## Supervised datasets collection
2023-08-14 07:26:27 +00:00
We collected 104K bilingual datasets of Chinese and English, and you can find the datasets in this repo
[InstructionWild ](https://github.com/XueFuzhao/InstructionWild ) and in this [file ](https://github.com/XueFuzhao/InstructionWild/blob/main/data/README.md ).
Here is how we collected the data
2023-03-28 18:32:17 +00:00
< p align = "center" >
< img src = "https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/data-collect.png" width = 500/ >
< / p >
2023-07-28 03:29:55 +00:00
### Conversation dataset generation
In order to further improve the model's ability to handle multi-turn conversations, we need to include samples with multi-turn conversations in the dataset. However, the samples in InstructWild and Alpaca datasets currently consist of only single-turn conversations, and their dataset organization is not suitable for storing multi-turn conversations. Additionally, after converting the aforementioned datasets, we also need to include multi-turn conversation datasets like ShareGPT, and we should transform them into the training format supported by ColossalChat.
A sample of conversation dataset should have the following fields:
2023-08-14 07:26:27 +00:00
- `type` (str, optional): The type of the data sample.
- `language` (str, optional): The language of the data sample.
- `dataset` (str, optional): The dataset the data sample originates from.
- `conversations` (str, compulsory): Conversation content of the data sample.
- `id` (int, optional): The ID of the data sample.
2023-07-28 03:29:55 +00:00
A simple example:
2023-08-14 07:26:27 +00:00
2023-07-28 03:29:55 +00:00
```json
{
2023-08-14 07:26:27 +00:00
"type": "instruction",
"language": "English",
"dataset": "Alpaca",
"conversations": [
{
"from": "human",
"value": "Give three tips for staying healthy."
},
{
"from": "gpt",
"value": "1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule."
}
],
"id": 1
2023-07-28 03:29:55 +00:00
}
```
2023-08-14 07:26:27 +00:00
> **NOTE:** Only key `conversations` is compulsary for training and other keys serve as metadata. The length of `conversations` varies.
2023-07-28 03:29:55 +00:00
You can run the `examples/generate_conversation_dataset.py` to generate a conversation dataset supported by ColossalChat.
You can use the following cmd to generate conversation dataset.
2023-08-14 07:26:27 +00:00
```bash
2023-07-28 03:29:55 +00:00
python generate_conversation_dataset.py \
--dataset "All"
--save_path "/path/to/dataset"
```
2023-03-28 18:32:17 +00:00
## Stage1 - Supervised instructs tuning
Stage1 is supervised instructs fine-tuning, which uses the datasets mentioned earlier to fine-tune the model.
2023-05-19 10:03:56 +00:00
[[Stage1 tutorial video]](https://www.youtube.com/watch?v=-qFBZFmOJfg)
2023-03-28 18:32:17 +00:00
You can run the `examples/train_sft.sh` to start a supervised instructs fine-tuning.
You can also use the following cmd to start a supervised instructs fine-tuning with your own settings.
2023-08-14 07:26:27 +00:00
```bash
2023-03-28 18:32:17 +00:00
torchrun --standalone --nproc_per_node=4 train_sft.py \
--pretrain "/path/to/LLaMa-7B/" \
--model 'llama' \
--strategy colossalai_zero2 \
--save_path /path/to/Coati-7B \
--dataset /path/to/data.json \
--batch_size 4 \
2023-04-28 07:42:57 +00:00
--accumulation_steps 8 \
2023-03-28 18:32:17 +00:00
--lr 2e-5 \
--max_datasets_size 512 \
--max_epochs 1 \
2023-04-27 10:41:49 +00:00
--grad_checkpoint
2023-03-28 18:32:17 +00:00
```
2023-08-14 07:26:27 +00:00
**Note**: the supervised dataset follows the following format,
```json
[
{
"instruction": "Provide a list of the top 10 most popular mobile games in Asia",
"input": "",
"output": "The top 10 most popular mobile games in Asia are:\n1) PUBG Mobile\n2) Pokemon Go\n3) Candy Crush Saga\n4) Free Fire\n5) Clash of Clans\n6) Mario Kart Tour\n7) Arena of Valor\n8) Fantasy Westward Journey\n9) Subway Surfers\n10) ARK Survival Evolved",
"id": 0
},
...
]
```
2023-03-28 18:32:17 +00:00
### Arg List
2023-08-14 07:26:27 +00:00
- `--strategy` : the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- `--model` : model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- `--pretrain` : pretrain model, type=str, default=None
- `--max_datasets_size` : the max size of dataset, type=int, default=None
- `--save_path` : path to save the model, type=str, default='output'
- `--need_optim_ckpt` : whether to save optim ckpt, type=bool, default=False
- `--max_epochs` : max epochs for training, type=int, default=3
- `--batch_size` : batch size while training, type=int, default=4
- `--lora_rank` : low-rank adaptation matrices rank, type=int, default=0
- `--grad_checkpoint` : enable gradient checkpointing, type=bool, default=False
2023-03-28 18:32:17 +00:00
## Stage2 - Training reward model
We train a reward model in stage 2, which obtains corresponding scores by manually ranking different outputs for the same prompt and supervises the training of the reward model.
2023-05-19 10:03:56 +00:00
[[Stage2 tutorial video]](https://www.youtube.com/watch?v=gMx2CApKhuo)
2023-03-28 18:32:17 +00:00
You can run the `examples/train_rm.sh` to start a reward model training.
2023-03-28 12:25:36 +00:00
2023-03-28 18:32:17 +00:00
You can also use the following cmd to start training a reward model.
2023-08-14 07:26:27 +00:00
```bash
2023-04-06 02:58:53 +00:00
torchrun --standalone --nproc_per_node=4 train_reward_model.py \
2023-03-28 18:32:17 +00:00
--pretrain "/path/to/LLaMa-7B/" \
--model 'llama' \
--strategy colossalai_zero2 \
--loss_fn 'log_exp'\
--save_path 'rmstatic.pt' \
```
2023-08-14 07:26:27 +00:00
2023-03-28 12:25:36 +00:00
### Features and tricks in RM training
2023-08-14 07:26:27 +00:00
2023-03-28 12:25:36 +00:00
- We support [Anthropic/hh-rlhf ](https://huggingface.co/datasets/Anthropic/hh-rlhf )and[rm-static](https://huggingface.co/datasets/Dahoas/rm-static) datasets.
2023-08-14 07:26:27 +00:00
- We support 2 kinds of loss function named `log_sig` (used by OpenAI) and `log_exp` (used by Anthropic).
- We change the loss to `valid_acc` and `pair_dist` to monitor progress during training.
2023-03-28 12:25:36 +00:00
- We add special token to the end of the sequence to get better result.
- We use cosine-reducing lr-scheduler for RM training.
- We set value_head as 1 liner layer and initialize the weight of value_head using N(0, 1/(d_model + 1)) distribution.
- We train a Bloom-560m reward model for 1 epoch and find the test acc of the model achieve the performance mentions in [Anthropics paper ](https://arxiv.org/abs/2204.05862 ).
### Experiment result
2023-08-14 07:26:27 +00:00
2023-03-28 12:25:36 +00:00
Model performance in [Anthropics paper ](https://arxiv.org/abs/2204.05862 ):
2023-03-28 18:32:17 +00:00
< div align = middle > < img width = "512" alt = "image" src = "https://user-images.githubusercontent.com/70618399/225263321-8d64c3a8-6877-4cc8-9b61-0e1c52d3d94f.png" >
2023-03-28 12:25:36 +00:00
< div align = left > Our training & test result of bloom-560m for 1 epoch:
2023-03-28 18:32:17 +00:00
< div align = middle > < img width = "512" alt = "image" src = "https://user-images.githubusercontent.com/70618399/225262950-a7f0a686-25de-44ec-98f2-11b83ea86674.png" >
2023-03-28 12:25:36 +00:00
2023-03-28 18:32:17 +00:00
< div align = left > We also train the reward model based on LLaMA-7B, which reaches the ACC of 72.06% after 1 epoch, performing almost the same as Anthropic's best RM.
2023-03-28 12:25:36 +00:00
2023-03-28 18:32:17 +00:00
### Arg List
2023-08-14 07:26:27 +00:00
- `--strategy` : the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- `--model` : model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- `--pretrain` : pretrain model, type=str, default=None
- `--model_path` : the path of rm model(if continue to train), type=str, default=None
- `--save_path` : path to save the model, type=str, default='output'
- `--need_optim_ckpt` : whether to save optim ckpt, type=bool, default=False
- `--max_epochs` : max epochs for training, type=int, default=3
- `--dataset` : dataset name, type=str, choices=['Anthropic/hh-rlhf', 'Dahoas/rm-static']
- `--subset` : subset of the dataset, type=str, default=None
- `--batch_size` : batch size while training, type=int, default=4
- `--lora_rank` : low-rank adaptation matrices rank, type=int, default=0
- `--loss_func` : which kind of loss function, choices=['log_sig', 'log_exp']
- `--max_len` : max sentence length for generation, type=int, default=512
2023-03-28 12:25:36 +00:00
2023-03-28 18:35:40 +00:00
## Stage3 - Training model using prompts with RL
2023-03-28 12:25:36 +00:00
2023-03-28 18:32:17 +00:00
Stage3 uses reinforcement learning algorithm, which is the most complex part of the training process, as shown below:
2023-03-28 12:25:36 +00:00
2023-03-28 18:32:17 +00:00
< p align = "center" >
< img src = "https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/stage-3.jpeg" width = 800/ >
< / p >
2023-03-28 12:25:36 +00:00
2023-03-28 18:32:17 +00:00
You can run the `examples/train_prompts.sh` to start PPO training.
2023-08-14 07:26:27 +00:00
2023-03-28 18:32:17 +00:00
You can also use the cmd following to start PPO training.
2023-05-19 10:03:56 +00:00
[[Stage3 tutorial video]](https://www.youtube.com/watch?v=Z8wwSHxPL9g)
2023-03-28 12:25:36 +00:00
2023-08-14 07:26:27 +00:00
```bash
2023-03-28 18:32:17 +00:00
torchrun --standalone --nproc_per_node=4 train_prompts.py \
2023-08-14 07:26:27 +00:00
--pretrain "/path/to/LLaMa-7B/" \
--model 'llama' \
--strategy colossalai_zero2 \
--prompt_dataset /path/to/your/prompt_dataset \
--pretrain_dataset /path/to/your/pretrain_dataset \
--rm_pretrain /your/pretrain/rm/definition \
--rm_path /your/rm/model/path
2023-03-28 12:25:36 +00:00
```
2023-04-27 06:26:19 +00:00
2023-05-17 09:44:05 +00:00
Prompt dataset: the instruction dataset mentioned in the above figure which includes the instructions, e.g. you can use the [script ](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/generate_prompt_dataset.py ) which samples `instinwild_en.json` or `instinwild_ch.json` in [InstructionWild ](https://github.com/XueFuzhao/InstructionWild/tree/main/data#instructwild-data ) to generate the prompt dataset.
2023-04-27 06:26:19 +00:00
Pretrain dataset: the pretrain dataset including the instruction and corresponding response, e.g. you can use the [InstructWild Data ](https://github.com/XueFuzhao/InstructionWild/tree/main/data ) in stage 1 supervised instructs tuning.
2023-08-14 07:26:27 +00:00
**Note**: the required datasets follow the following format,
- `pretrain dataset`
```json
[
{
"instruction": "Provide a list of the top 10 most popular mobile games in Asia",
"input": "",
"output": "The top 10 most popular mobile games in Asia are:\n1) PUBG Mobile\n2) Pokemon Go\n3) Candy Crush Saga\n4) Free Fire\n5) Clash of Clans\n6) Mario Kart Tour\n7) Arena of Valor\n8) Fantasy Westward Journey\n9) Subway Surfers\n10) ARK Survival Evolved",
"id": 0
},
...
]
```
- `prompt dataset`
```json
[
{
"instruction": "Edit this paragraph to make it more concise: \"Yesterday, I went to the store and bought some things. Then, I came home and put them away. After that, I went for a walk and met some friends.\"",
"id": 0
},
{
"instruction": "Write a descriptive paragraph about a memorable vacation you went on",
"id": 1
},
...
]
```
2023-03-28 18:32:17 +00:00
### Arg List
2023-08-14 07:26:27 +00:00
- `--strategy` : the strategy using for training, choices=['ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- `--model` : model type of actor, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- `--pretrain` : pretrain model, type=str, default=None
- `--rm_model` : reward model type, type=str, choices=['gpt2', 'bloom', 'opt', 'llama'], default=None
- `--rm_pretrain` : pretrain model for reward model, type=str, default=None
- `--rm_path` : the path of rm model, type=str, default=None
- `--save_path` : path to save the model, type=str, default='output'
- `--prompt_dataset` : path of the prompt dataset, type=str, default=None
- `--pretrain_dataset` : path of the ptx dataset, type=str, default=None
- `--need_optim_ckpt` : whether to save optim ckpt, type=bool, default=False
- `--num_episodes` : num of episodes for training, type=int, default=10
- `--num_update_steps` : number of steps to update policy per episode, type=int
- `--num_collect_steps` : number of steps to collect experience per episode, type=int
- `--train_batch_size` : batch size while training, type=int, default=8
- `--ptx_batch_size` : batch size to compute ptx loss, type=int, default=1
- `--experience_batch_size` : batch size to make experience, type=int, default=8
- `--lora_rank` : low-rank adaptation matrices rank, type=int, default=0
- `--kl_coef` : kl_coef using for computing reward, type=float, default=0.1
- `--ptx_coef` : ptx_coef using for computing policy loss, type=float, default=0.9
2023-03-28 18:32:17 +00:00
## Inference example - After Stage3
2023-08-14 07:26:27 +00:00
2023-03-28 18:32:17 +00:00
We support different inference options, including int8 and int4 quantization.
For details, see [`inference/` ](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference ).
2023-03-28 12:25:36 +00:00
## Attention
2023-08-14 07:26:27 +00:00
2023-03-28 18:32:17 +00:00
The examples are demos for the whole training process.You need to change the hyper-parameters to reach great performance.
2023-03-28 12:25:36 +00:00
#### data
2023-08-14 07:26:27 +00:00
2023-03-28 12:25:36 +00:00
- [x] [rm-static ](https://huggingface.co/datasets/Dahoas/rm-static )
- [x] [hh-rlhf ](https://huggingface.co/datasets/Anthropic/hh-rlhf )
- [ ] [openai/summarize_from_feedback ](https://huggingface.co/datasets/openai/summarize_from_feedback )
- [ ] [openai/webgpt_comparisons ](https://huggingface.co/datasets/openai/webgpt_comparisons )
- [ ] [Dahoas/instruct-synthetic-prompt-responses ](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses )
## Support Model
### GPT
2023-08-14 07:26:27 +00:00
- [x] GPT2-S (s)
- [x] GPT2-M (m)
- [x] GPT2-L (l)
- [x] GPT2-XL (xl)
- [x] GPT2-4B (4b)
- [ ] GPT2-6B (6b)
2023-03-28 12:25:36 +00:00
### BLOOM
2023-08-14 07:26:27 +00:00
2023-03-28 12:25:36 +00:00
- [x] [BLOOM-560m ](https://huggingface.co/bigscience/bloom-560m )
- [x] [BLOOM-1b1 ](https://huggingface.co/bigscience/bloom-1b1 )
- [x] [BLOOM-3b ](https://huggingface.co/bigscience/bloom-3b )
- [x] [BLOOM-7b ](https://huggingface.co/bigscience/bloom-7b1 )
2023-03-28 18:32:17 +00:00
- [ ] [BLOOM-175b ](https://huggingface.co/bigscience/bloom )
2023-03-28 12:25:36 +00:00
### OPT
2023-08-14 07:26:27 +00:00
2023-03-28 12:25:36 +00:00
- [x] [OPT-125M ](https://huggingface.co/facebook/opt-125m )
- [x] [OPT-350M ](https://huggingface.co/facebook/opt-350m )
2023-04-17 07:40:41 +00:00
- [x] [OPT-1.3B ](https://huggingface.co/facebook/opt-1.3b )
- [x] [OPT-2.7B ](https://huggingface.co/facebook/opt-2.7b )
- [x] [OPT-6.7B ](https://huggingface.co/facebook/opt-6.7b )
2023-03-28 12:25:36 +00:00
- [ ] [OPT-13B ](https://huggingface.co/facebook/opt-13b )
- [ ] [OPT-30B ](https://huggingface.co/facebook/opt-30b )
2023-03-28 18:32:17 +00:00
### [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
2023-08-14 07:26:27 +00:00
- [x] LLaMA-7B
- [x] LLaMA-13B
- [ ] LLaMA-33B
- [ ] LLaMA-65B
2023-04-17 07:40:41 +00:00
## Add your own models
If you want to support your own model in Coati, please refer the pull request for RoBERTa support as an example --[[chatgpt] add pre-trained model RoBERTa for RLHF stage 2 & 3](https://github.com/hpcaitech/ColossalAI/pull/3223), and submit a PR to us.
You should complete the implementation of four model classes, including Reward model, Critic model, LM model, Actor model
here are some example code for a NewModel named `Coati` .
2023-05-05 03:37:35 +00:00
if it is supported in huggingface [transformers ](https://github.com/huggingface/transformers ), you can load it by `from_pretrained` , o
2023-04-17 07:40:41 +00:00
r you can build your own model by yourself.
### Actor model
2023-08-14 07:26:27 +00:00
```python
2023-04-17 07:40:41 +00:00
from ..base import Actor
from transformers.models.coati import CoatiModel
class CoatiActor(Actor):
def __init__ (self,
pretrained: Optional[str] = None,
checkpoint: bool = False,
lora_rank: int = 0,
lora_train_bias: str = 'none') -> None:
if pretrained is not None:
model = CoatiModel.from_pretrained(pretrained)
else:
2023-04-20 09:22:15 +00:00
model = build_model() # load your own model if it is not support in transformers
2023-04-17 07:40:41 +00:00
super().__init__(model, lora_rank, lora_train_bias)
```
### Reward model
2023-08-14 07:26:27 +00:00
```python
2023-04-17 07:40:41 +00:00
from ..base import RewardModel
from transformers.models.coati import CoatiModel
class CoatiRM(RewardModel):
def __init__ (self,
pretrained: Optional[str] = None,
checkpoint: bool = False,
lora_rank: int = 0,
lora_train_bias: str = 'none') -> None:
if pretrained is not None:
model = CoatiModel.from_pretrained(pretrained)
else:
2023-04-20 09:22:15 +00:00
model = build_model() # load your own model if it is not support in transformers
2023-04-17 07:40:41 +00:00
value_head = nn.Linear(model.config.n_embd, 1)
value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
super().__init__(model, value_head, lora_rank, lora_train_bias)
```
### Critic model
2023-08-14 07:26:27 +00:00
```python
2023-04-17 07:40:41 +00:00
from ..base import Critic
from transformers.models.coati import CoatiModel
class CoatiCritic(Critic):
def __init__ (self,
pretrained: Optional[str] = None,
checkpoint: bool = False,
lora_rank: int = 0,
lora_train_bias: str = 'none') -> None:
if pretrained is not None:
model = CoatiModel.from_pretrained(pretrained)
else:
2023-04-20 09:22:15 +00:00
model = build_model() # load your own model if it is not support in transformers
2023-04-17 07:40:41 +00:00
value_head = nn.Linear(model.config.n_embd, 1)
value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
super().__init__(model, value_head, lora_rank, lora_train_bias)
```