ColossalAI/applications/Chat/examples/README.md

# Examples

## Table of Contents

- [Examples](#examples)
  - [Table of Contents](#table-of-contents)
  - [Install requirements](#install-requirements)
  - [Supervised datasets collection](#supervised-datasets-collection)
  - [Stage1 - Supervised instructs tuning](#stage1---supervised-instructs-tuning)
    - [Arg List](#arg-list)
  - [Stage2 - Training reward model](#stage2---training-reward-model)
    - [Features and tricks in RM training](#features-and-tricks-in-rm-training)
    - [Experiment result](#experiment-result)
    - [Arg List](#arg-list-1)
  - [Stage3 - Training model using prompts with RL](#stage3---training-model-using-prompts-with-rl)
    - [Arg List](#arg-list-2)
  - [Inference example - After Stage3](#inference-example---after-stage3)
  - [Attention](#attention)
      - [data](#data)
  - [Support Model](#support-model)
    - [GPT](#gpt)
    - [BLOOM](#bloom)
    - [OPT](#opt)
    - [LLaMA](#llama)
  - [Add your own models](#add-your-own-models)
    - [Actor model](#actor-model)
    - [Reward model](#reward-model)
    - [Critic model](#critic-model)


---
## Install requirements

```shell
pip install -r requirements.txt
```

## Supervised datasets collection

We collected 104K bilingual dataset of Chinese and English, and you can find the datasets in this repo
[InstructionWild](https://github.com/XueFuzhao/InstructionWild).

The following pic shows how we collected the data.
<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/data-collect.png" width=500/>
</p>

## Stage1 - Supervised instructs tuning

Stage1 is supervised instructs fine-tuning, which uses the datasets mentioned earlier to fine-tune the model.
[[Stage1 tutorial video]](https://www.youtube.com/watch?v=-qFBZFmOJfg)

You can run the `examples/train_sft.sh` to start a supervised instructs fine-tuning.

You can also use the following cmd to start a supervised instructs fine-tuning with your own settings.
```
torchrun --standalone --nproc_per_node=4 train_sft.py \
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --log_interval 10 \
    --save_path  /path/to/Coati-7B \
    --dataset /path/to/data.json \
    --batch_size 4 \
    --accumulation_steps 8 \
    --lr 2e-5 \
    --max_datasets_size 512 \
    --max_epochs 1 \
    --grad_checkpoint
```
### Arg List
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --max_datasets_size: the max size of dataset, type=int, default=None
- --save_path:         path to save the model, type=str, default='output'
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --max_epochs:        max epochs for training, type=int, default=3
- --batch_size:        batch size while training, type=int, default=4
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --log_interval:      how many steps to log, type=int, default=100
- --grad_checkpoint:   enable gradient checkpointing, type=bool, default=False

## Stage2 - Training reward model

We train a reward model in stage 2, which obtains corresponding scores by manually ranking different outputs for the same prompt and supervises the training of the reward model.
[[Stage2 tutorial video]](https://www.youtube.com/watch?v=gMx2CApKhuo)

You can run the `examples/train_rm.sh` to start a reward model training.

You can also use the following cmd to start training a reward model.
```
torchrun --standalone --nproc_per_node=4 train_reward_model.py \
    --pretrain "/path/to/LLaMa-7B/" \
    --model 'llama' \
    --strategy colossalai_zero2 \
    --loss_fn 'log_exp'\
    --save_path 'rmstatic.pt' \
```
### Features and tricks in RM training
- We support [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)and[rm-static](https://huggingface.co/datasets/Dahoas/rm-static) datasets.
- We support 2 kinds of loss_function named 'log_sig'(used by OpenAI) and 'log_exp'(used by Anthropic).
- We change the loss to valid_acc and pair_dist to monitor progress during training.
- We add special token to the end of the sequence to get better result.
- We use cosine-reducing lr-scheduler for RM training.
- We set value_head as 1 liner layer and initialize the weight of value_head using N(0，1/(d_model + 1)) distribution.
- We train a Bloom-560m reward model for 1 epoch and find the test acc of the model achieve the performance mentions in [Anthropics paper](https://arxiv.org/abs/2204.05862).

### Experiment result
Model performance in [Anthropics paper](https://arxiv.org/abs/2204.05862):

<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225263321-8d64c3a8-6877-4cc8-9b61-0e1c52d3d94f.png">

<div align=left>Our training & test result of bloom-560m for 1 epoch:

<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225262950-a7f0a686-25de-44ec-98f2-11b83ea86674.png">

<div align=left>We also train the reward model based on LLaMA-7B, which reaches the ACC of 72.06% after 1 epoch, performing almost the same as Anthropic's best RM.

### Arg List
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --model_path:        the path of rm model(if continue to train), type=str, default=None
- --save_path:         path to save the model, type=str, default='output'
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --max_epochs:        max epochs for training, type=int, default=3
- --dataset:           dataset name, type=str, choices=['Anthropic/hh-rlhf', 'Dahoas/rm-static']
- --subset:            subset of the dataset, type=str, default=None
- --batch_size:        batch size while training, type=int, default=4
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --loss_func:         which kind of loss function, choices=['log_sig', 'log_exp']
- --max_len:           max sentence length for generation, type=int, default=512
- --test:              whether is only testing, if it's true, the dataset will be small

## Stage3 - Training model using prompts with RL

Stage3 uses reinforcement learning algorithm, which is the most complex part of the training process, as shown below:

<p align="center">
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/stage-3.jpeg" width=800/>
</p>

You can run the `examples/train_prompts.sh` to start PPO training.
You can also use the cmd following to start PPO training.
[[Stage3 tutorial video]](https://www.youtube.com/watch?v=Z8wwSHxPL9g)

```
torchrun --standalone --nproc_per_node=4 train_prompts.py \
         --pretrain "/path/to/LLaMa-7B/" \
         --model 'llama' \
         --strategy colossalai_zero2 \
         --prompt_dataset /path/to/your/prompt_dataset \
         --pretrain_dataset /path/to/your/pretrain_dataset \
         --rm_pretrain /your/pretrain/rm/definition \
         --rm_path /your/rm/model/path
```

Prompt dataset: the instruction dataset mentioned in the above figure which includes the instructions, e.g. you can use the [script](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/generate_prompt_dataset.py) which samples `instinwild_en.json` or `instinwild_ch.json` in [InstructionWild](https://github.com/XueFuzhao/InstructionWild/tree/main/data#instructwild-data) to generate the prompt dataset.
Pretrain dataset: the pretrain dataset including the instruction and corresponding response, e.g. you can use the [InstructWild Data](https://github.com/XueFuzhao/InstructionWild/tree/main/data) in stage 1 supervised instructs tuning.

### Arg List
- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
- --model:             model type of actor, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
- --pretrain:          pretrain model, type=str, default=None
- --rm_model:          reward model type, type=str, choices=['gpt2', 'bloom', 'opt', 'llama'], default=None
- --rm_pretrain:       pretrain model for reward model, type=str, default=None
- --rm_path:           the path of rm model, type=str, default=None
- --save_path:         path to save the model, type=str, default='output'
- --prompt_dataset:       path of the prompt dataset, type=str, default=None
- --pretrain_dataset:  path of the ptx dataset, type=str, default=None
- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
- --num_episodes:      num of episodes for training, type=int, default=10
- --max_epochs:        max epochs for training in one episode, type=int, default=5
- --max_timesteps:     max episodes in one batch, type=int, default=10
- --update_timesteps:  timesteps to update, type=int, default=10
- --train_batch_size:  batch size while training, type=int, default=8
- --ptx_batch_size:    batch size to compute ptx loss, type=int, default=1
- --experience_batch_size: batch size to make experience, type=int, default=8
- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
- --kl_coef:           kl_coef using for computing reward, type=float, default=0.1
- --ptx_coef:          ptx_coef using for computing policy loss, type=float, default=0.9

## Inference example - After Stage3
We support different inference options, including int8 and int4 quantization.
For details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).


## Attention
The examples are demos for the whole training process.You need to change the hyper-parameters to reach great performance.

#### data
- [x] [rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
- [x] [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- [ ] [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
- [ ] [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
- [ ] [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses)

## Support Model

### GPT
- [x]  GPT2-S (s)
- [x]  GPT2-M (m)
- [x]  GPT2-L (l)
- [x]  GPT2-XL (xl)
- [x]  GPT2-4B (4b)
- [ ]  GPT2-6B (6b)

### BLOOM
- [x] [BLOOM-560m](https://huggingface.co/bigscience/bloom-560m)
- [x] [BLOOM-1b1](https://huggingface.co/bigscience/bloom-1b1)
- [x] [BLOOM-3b](https://huggingface.co/bigscience/bloom-3b)
- [x] [BLOOM-7b](https://huggingface.co/bigscience/bloom-7b1)
- [ ] [BLOOM-175b](https://huggingface.co/bigscience/bloom)

### OPT
- [x] [OPT-125M](https://huggingface.co/facebook/opt-125m)
- [x] [OPT-350M](https://huggingface.co/facebook/opt-350m)
- [x] [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b)
- [x] [OPT-2.7B](https://huggingface.co/facebook/opt-2.7b)
- [x] [OPT-6.7B](https://huggingface.co/facebook/opt-6.7b)
- [ ] [OPT-13B](https://huggingface.co/facebook/opt-13b)
- [ ] [OPT-30B](https://huggingface.co/facebook/opt-30b)

### [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
- [x]  LLaMA-7B
- [x]  LLaMA-13B
- [ ]  LLaMA-33B
- [ ]  LLaMA-65B

## Add your own models

If you want to support your own model in Coati, please refer the pull request for RoBERTa support as an example --[[chatgpt] add pre-trained model RoBERTa for RLHF stage 2 & 3](https://github.com/hpcaitech/ColossalAI/pull/3223), and submit a PR to us.

You should complete the implementation of four model classes, including Reward model, Critic model, LM model, Actor model

here are some example code for a NewModel named `Coati`.
if it is supported in huggingface [transformers](https://github.com/huggingface/transformers), you can load it by `from_pretrained`, o
r you can build your own model by yourself.

### Actor model
```
from ..base import Actor
from transformers.models.coati import CoatiModel

class CoatiActor(Actor):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
            model = build_model() # load your own model if it is not support in transformers

        super().__init__(model, lora_rank, lora_train_bias)
```

### Reward model
```
from ..base import RewardModel
from transformers.models.coati import CoatiModel

class CoatiRM(RewardModel):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
            model = build_model() # load your own model if it is not support in transformers

        value_head = nn.Linear(model.config.n_embd, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)
```

### Critic model

```
from ..base import Critic
from transformers.models.coati import CoatiModel

class CoatiCritic(Critic):

    def __init__(self,
                 pretrained: Optional[str] = None,
                 checkpoint: bool = False,
                 lora_rank: int = 0,
                 lora_train_bias: str = 'none') -> None:
        if pretrained is not None:
            model = CoatiModel.from_pretrained(pretrained)
        else:
            model = build_model() # load your own model if it is not support in transformers

        value_head = nn.Linear(model.config.n_embd, 1)
        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
        super().__init__(model, value_head, lora_rank, lora_train_bias)
```
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
+								# Examples
-												[coati] add costom model suppor tguide (#3579)


											
										
										
											2023-04-17 07:40:41 +00:00
+								## Table of Contents
 								- [Examples](#examples)
 								  - [Table of Contents](#table-of-contents)
 								  - [Install requirements](#install-requirements)
 								  - [Supervised datasets collection](#supervised-datasets-collection)
 								  - [Stage1 - Supervised instructs tuning](#stage1---supervised-instructs-tuning)
 								    - [Arg List](#arg-list)
 								  - [Stage2 - Training reward model](#stage2---training-reward-model)
 								    - [Features and tricks in RM training](#features-and-tricks-in-rm-training)
 								    - [Experiment result](#experiment-result)
 								    - [Arg List](#arg-list-1)
 								  - [Stage3 - Training model using prompts with RL](#stage3---training-model-using-prompts-with-rl)
 								    - [Arg List](#arg-list-2)
 								  - [Inference example - After Stage3](#inference-example---after-stage3)
 								  - [Attention](#attention)
 								      - [data](#data)
 								  - [Support Model](#support-model)
 								    - [GPT](#gpt)
 								    - [BLOOM](#bloom)
 								    - [OPT](#opt)
 								    - [LLaMA](#llama)
 								  - [Add your own models](#add-your-own-models)
 								    - [Actor model](#actor-model)
 								    - [Reward model](#reward-model)
 								    - [Critic model](#critic-model)
 								---
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
+								## Install requirements
 								```shell
 								pip install -r requirements.txt
 								```
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								## Supervised datasets collection
-												[chat] polish code note typo (#3612)


											
										
										
											2023-04-20 09:22:15 +00:00
+								We collected 104K bilingual dataset of Chinese and English, and you can find the datasets in this repo
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								[InstructionWild](https://github.com/XueFuzhao/InstructionWild).
 								The following pic shows how we collected the data.
 								<p align="center">
 								<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/data-collect.png" width=500/>
 								</p>
 								## Stage1 - Supervised instructs tuning
 								Stage1 is supervised instructs fine-tuning, which uses the datasets mentioned earlier to fine-tune the model.
-												[chat] add performance and tutorial (#3786)


											
										
										
											2023-05-19 10:03:56 +00:00
+								[[Stage1 tutorial video]](https://www.youtube.com/watch?v=-qFBZFmOJfg)
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
 								You can run the `examples/train_sft.sh` to start a supervised instructs fine-tuning.
 								You can also use the following cmd to start a supervised instructs fine-tuning with your own settings.
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
+								```
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								torchrun --standalone --nproc_per_node=4 train_sft.py \
 								    --pretrain "/path/to/LLaMa-7B/" \
 								    --model 'llama' \
 								    --strategy colossalai_zero2 \
 								    --log_interval 10 \
 								    --save_path  /path/to/Coati-7B \
 								    --dataset /path/to/data.json \
 								    --batch_size 4 \
-												[chat] typo accimulation_steps -> accumulation_steps (#3662)


											
										
										
											2023-04-28 07:42:57 +00:00
+								    --accumulation_steps 8 \
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								    --lr 2e-5 \
 								    --max_datasets_size 512 \
 								    --max_epochs 1 \
-												[chat] refactor model save/load logic (#3654)

* [chat] strategy refactor unwrap model

* [chat] strategy refactor save model

* [chat] add docstr

* [chat] refactor trainer save model

* [chat] fix strategy typing

* [chat] refactor trainer save model

* [chat] update readme

* [chat] fix unit test
											
										
										
											2023-04-27 10:41:49 +00:00
+								    --grad_checkpoint
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								```
 								### Arg List
-												[chat] set default zero2 strategy (#3667)

* [chat] set default gemini strategy

* [chat] set default zero2 strategy

* [chat] set default zero2 strategy
											
										
										
											2023-04-28 05:56:50 +00:00
+								- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
 								- --pretrain:          pretrain model, type=str, default=None
 								- --max_datasets_size: the max size of dataset, type=int, default=None
 								- --save_path:         path to save the model, type=str, default='output'
 								- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
 								- --max_epochs:        max epochs for training, type=int, default=3
 								- --batch_size:        batch size while training, type=int, default=4
 								- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
 								- --log_interval:      how many steps to log, type=int, default=100
-												[chat] refactor model save/load logic (#3654)

* [chat] strategy refactor unwrap model

* [chat] strategy refactor save model

* [chat] add docstr

* [chat] refactor trainer save model

* [chat] fix strategy typing

* [chat] refactor trainer save model

* [chat] update readme

* [chat] fix unit test
											
										
										
											2023-04-27 10:41:49 +00:00
+								- --grad_checkpoint:   enable gradient checkpointing, type=bool, default=False
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
 								## Stage2 - Training reward model
 								We train a reward model in stage 2, which obtains corresponding scores by manually ranking different outputs for the same prompt and supervises the training of the reward model.
-												[chat] add performance and tutorial (#3786)


											
										
										
											2023-05-19 10:03:56 +00:00
+								[[Stage2 tutorial video]](https://www.youtube.com/watch?v=gMx2CApKhuo)
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
 								You can run the `examples/train_rm.sh` to start a reward model training.
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								You can also use the following cmd to start training a reward model.
 								```
-												[chat]fix readme (#3429)

* fix stage 2

fix stage 2

* add torch
											
										
										
											2023-04-06 02:58:53 +00:00
+								torchrun --standalone --nproc_per_node=4 train_reward_model.py \
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								    --pretrain "/path/to/LLaMa-7B/" \
 								    --model 'llama' \
 								    --strategy colossalai_zero2 \
 								    --loss_fn 'log_exp'\
 								    --save_path 'rmstatic.pt' \
 								```
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
+								### Features and tricks in RM training
 								- We support [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)and[rm-static](https://huggingface.co/datasets/Dahoas/rm-static) datasets.
 								- We support 2 kinds of loss_function named 'log_sig'(used by OpenAI) and 'log_exp'(used by Anthropic).
 								- We change the loss to valid_acc and pair_dist to monitor progress during training.
 								- We add special token to the end of the sequence to get better result.
 								- We use cosine-reducing lr-scheduler for RM training.
 								- We set value_head as 1 liner layer and initialize the weight of value_head using N(0，1/(d_model + 1)) distribution.
 								- We train a Bloom-560m reward model for 1 epoch and find the test acc of the model achieve the performance mentions in [Anthropics paper](https://arxiv.org/abs/2204.05862).
 								### Experiment result
 								Model performance in [Anthropics paper](https://arxiv.org/abs/2204.05862):
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225263321-8d64c3a8-6877-4cc8-9b61-0e1c52d3d94f.png">
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
 								<div align=left>Our training & test result of bloom-560m for 1 epoch:
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								<div align=middle> <img width="512" alt="image" src="https://user-images.githubusercontent.com/70618399/225262950-a7f0a686-25de-44ec-98f2-11b83ea86674.png">
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								<div align=left>We also train the reward model based on LLaMA-7B, which reaches the ACC of 72.06% after 1 epoch, performing almost the same as Anthropic's best RM.
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								### Arg List
-												[chat] set default zero2 strategy (#3667)

* [chat] set default gemini strategy

* [chat] set default zero2 strategy

* [chat] set default zero2 strategy
											
										
										
											2023-04-28 05:56:50 +00:00
+								- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								- --model:             model type, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
 								- --pretrain:          pretrain model, type=str, default=None
 								- --model_path:        the path of rm model(if continue to train), type=str, default=None
 								- --save_path:         path to save the model, type=str, default='output'
 								- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
 								- --max_epochs:        max epochs for training, type=int, default=3
 								- --dataset:           dataset name, type=str, choices=['Anthropic/hh-rlhf', 'Dahoas/rm-static']
 								- --subset:            subset of the dataset, type=str, default=None
 								- --batch_size:        batch size while training, type=int, default=4
 								- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
 								- --loss_func:         which kind of loss function, choices=['log_sig', 'log_exp']
 								- --max_len:           max sentence length for generation, type=int, default=512
-												[chat] polish code note typo (#3612)


											
										
										
											2023-04-20 09:22:15 +00:00
+								- --test:              whether is only testing, if it's true, the dataset will be small
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
-												[format] applied code formatting on changed files in pull request 3296 (#3298)

Co-authored-by: github-actions <github-actions@github.com>
											
										
										
											2023-03-28 18:35:40 +00:00
+								## Stage3 - Training model using prompts with RL
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								Stage3 uses reinforcement learning algorithm, which is the most complex part of the training process, as shown below:
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								<p align="center">
 								<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/applications/chat/stage-3.jpeg" width=800/>
 								</p>
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								You can run the `examples/train_prompts.sh` to start PPO training.
 								You can also use the cmd following to start PPO training.
-												[chat] add performance and tutorial (#3786)


											
										
										
											2023-05-19 10:03:56 +00:00
+								[[Stage3 tutorial video]](https://www.youtube.com/watch?v=Z8wwSHxPL9g)
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
 								```
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								torchrun --standalone --nproc_per_node=4 train_prompts.py \
 								         --pretrain "/path/to/LLaMa-7B/" \
 								         --model 'llama' \
 								         --strategy colossalai_zero2 \
-												[Doc] enhancement on README.md for chat examples (#3646)

* Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

update roberta with coati

chat ci update

Revert "chat ci update"

This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.

* Update README.md

Update README.md

* update readme

* Update test_ci.sh
											
										
										
											2023-04-27 06:26:19 +00:00
+								         --prompt_dataset /path/to/your/prompt_dataset \
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								         --pretrain_dataset /path/to/your/pretrain_dataset \
-												fix some spelling error with applications/Chat/examples/  (#3692)

* fix spelling error with examples/comminity/

* fix spelling error with example/
											
										
										
											2023-05-06 03:27:23 +00:00
+								         --rm_pretrain /your/pretrain/rm/definition \
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								         --rm_path /your/rm/model/path
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
+								```
-												[Doc] enhancement on README.md for chat examples (#3646)

* Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

update roberta with coati

chat ci update

Revert "chat ci update"

This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.

* Update README.md

Update README.md

* update readme

* Update test_ci.sh
											
										
										
											2023-04-27 06:26:19 +00:00
-												[chat] fix bugs in stage 3 training (#3759)

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
											
										
										
											2023-05-17 09:44:05 +00:00
+								Prompt dataset: the instruction dataset mentioned in the above figure which includes the instructions, e.g. you can use the [script](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/examples/generate_prompt_dataset.py) which samples `instinwild_en.json` or `instinwild_ch.json` in [InstructionWild](https://github.com/XueFuzhao/InstructionWild/tree/main/data#instructwild-data) to generate the prompt dataset.
-												[Doc] enhancement on README.md for chat examples (#3646)

* Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

update roberta with coati

chat ci update

Revert "chat ci update"

This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.

* Update README.md

Update README.md

* update readme

* Update test_ci.sh
											
										
										
											2023-04-27 06:26:19 +00:00
+								Pretrain dataset: the pretrain dataset including the instruction and corresponding response, e.g. you can use the [InstructWild Data](https://github.com/XueFuzhao/InstructionWild/tree/main/data) in stage 1 supervised instructs tuning.
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								### Arg List
-												[chat] set default zero2 strategy (#3667)

* [chat] set default gemini strategy

* [chat] set default zero2 strategy

* [chat] set default zero2 strategy
											
										
										
											2023-04-28 05:56:50 +00:00
+								- --strategy:          the strategy using for training, choices=['naive', 'ddp', 'colossalai_gemini', 'colossalai_zero2'], default='colossalai_zero2'
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								- --model:             model type of actor, choices=['gpt2', 'bloom', 'opt', 'llama'], default='bloom'
 								- --pretrain:          pretrain model, type=str, default=None
-												[chat]polish prompts training (#3300)

* polish train_prompts

* polish readme
											
										
										
											2023-03-29 00:44:16 +00:00
+								- --rm_model:          reward model type, type=str, choices=['gpt2', 'bloom', 'opt', 'llama'], default=None
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								- --rm_pretrain:       pretrain model for reward model, type=str, default=None
 								- --rm_path:           the path of rm model, type=str, default=None
 								- --save_path:         path to save the model, type=str, default='output'
-												[Doc] enhancement on README.md for chat examples (#3646)

* Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894dcbe958acd4e10d771f22275e20e368.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

update roberta with coati

chat ci update

Revert "chat ci update"

This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.

* Update README.md

Update README.md

* update readme

* Update test_ci.sh
											
										
										
											2023-04-27 06:26:19 +00:00
+								- --prompt_dataset:       path of the prompt dataset, type=str, default=None
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								- --pretrain_dataset:  path of the ptx dataset, type=str, default=None
 								- --need_optim_ckpt:   whether to save optim ckpt, type=bool, default=False
 								- --num_episodes:      num of episodes for training, type=int, default=10
 								- --max_epochs:        max epochs for training in one episode, type=int, default=5
 								- --max_timesteps:     max episodes in one batch, type=int, default=10
 								- --update_timesteps:  timesteps to update, type=int, default=10
 								- --train_batch_size:  batch size while training, type=int, default=8
 								- --ptx_batch_size:    batch size to compute ptx loss, type=int, default=1
 								- --experience_batch_size: batch size to make experience, type=int, default=8
 								- --lora_rank:         low-rank adaptation matrices rank, type=int, default=0
 								- --kl_coef:           kl_coef using for computing reward, type=float, default=0.1
 								- --ptx_coef:          ptx_coef using for computing policy loss, type=float, default=0.9
 								## Inference example - After Stage3
 								We support different inference options, including int8 and int4 quantization.
 								For details, see [`inference/`](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat/inference).
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
 								## Attention
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								The examples are demos for the whole training process.You need to change the hyper-parameters to reach great performance.
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
 								#### data
 								- [x] [rm-static](https://huggingface.co/datasets/Dahoas/rm-static)
 								- [x] [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)
 								- [ ] [openai/summarize_from_feedback](https://huggingface.co/datasets/openai/summarize_from_feedback)
 								- [ ] [openai/webgpt_comparisons](https://huggingface.co/datasets/openai/webgpt_comparisons)
 								- [ ] [Dahoas/instruct-synthetic-prompt-responses](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses)
 								## Support Model
 								### GPT
 								- [x]  GPT2-S (s)
 								- [x]  GPT2-M (m)
 								- [x]  GPT2-L (l)
-												[coati] add costom model suppor tguide (#3579)


											
										
										
											2023-04-17 07:40:41 +00:00
+								- [x]  GPT2-XL (xl)
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
+								- [x]  GPT2-4B (4b)
 								- [ ]  GPT2-6B (6b)
 								### BLOOM
 								- [x] [BLOOM-560m](https://huggingface.co/bigscience/bloom-560m)
 								- [x] [BLOOM-1b1](https://huggingface.co/bigscience/bloom-1b1)
 								- [x] [BLOOM-3b](https://huggingface.co/bigscience/bloom-3b)
 								- [x] [BLOOM-7b](https://huggingface.co/bigscience/bloom-7b1)
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
+								- [ ] [BLOOM-175b](https://huggingface.co/bigscience/bloom)
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
 								### OPT
 								- [x] [OPT-125M](https://huggingface.co/facebook/opt-125m)
 								- [x] [OPT-350M](https://huggingface.co/facebook/opt-350m)
-												[coati] add costom model suppor tguide (#3579)


											
										
										
											2023-04-17 07:40:41 +00:00
+								- [x] [OPT-1.3B](https://huggingface.co/facebook/opt-1.3b)
 								- [x] [OPT-2.7B](https://huggingface.co/facebook/opt-2.7b)
 								- [x] [OPT-6.7B](https://huggingface.co/facebook/opt-6.7b)
-												[Coati] first commit (#3283)


											
										
										
											2023-03-28 12:25:36 +00:00
+								- [ ] [OPT-13B](https://huggingface.co/facebook/opt-13b)
 								- [ ] [OPT-30B](https://huggingface.co/facebook/opt-30b)
-												[chat]Update Readme (#3296)

* Update README.md

* Update README.md

* Update README.md

* update example readme
											
										
										
											2023-03-28 18:32:17 +00:00
 								### [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)
 								- [x]  LLaMA-7B
 								- [x]  LLaMA-13B
 								- [ ]  LLaMA-33B
 								- [ ]  LLaMA-65B
-												[coati] add costom model suppor tguide (#3579)


											
										
										
											2023-04-17 07:40:41 +00:00
 								## Add your own models
 								If you want to support your own model in Coati, please refer the pull request for RoBERTa support as an example --[[chatgpt] add pre-trained model RoBERTa for RLHF stage 2 & 3](https://github.com/hpcaitech/ColossalAI/pull/3223), and submit a PR to us.
 								You should complete the implementation of four model classes, including Reward model, Critic model, LM model, Actor model
 								here are some example code for a NewModel named `Coati`.
-												[doc] fix chat spelling error (#3671)

* Update README.md

change "huggingaface" to "huggingface"

* Update README.md

change "Colossa-AI" to "Colossal-AI"
											
										
										
											2023-05-05 03:37:35 +00:00
+								if it is supported in huggingface [transformers](https://github.com/huggingface/transformers), you can load it by `from_pretrained`, o
-												[coati] add costom model suppor tguide (#3579)


											
										
										
											2023-04-17 07:40:41 +00:00
+								r you can build your own model by yourself.
 								### Actor model
 								```
 								from ..base import Actor
 								from transformers.models.coati import CoatiModel
 								class CoatiActor(Actor):
 								    def __init__(self,
 								                 pretrained: Optional[str] = None,
 								                 checkpoint: bool = False,
 								                 lora_rank: int = 0,
 								                 lora_train_bias: str = 'none') -> None:
 								        if pretrained is not None:
 								            model = CoatiModel.from_pretrained(pretrained)
 								        else:
-												[chat] polish code note typo (#3612)


											
										
										
											2023-04-20 09:22:15 +00:00
+								            model = build_model() # load your own model if it is not support in transformers
-												[coati] add costom model suppor tguide (#3579)


											
										
										
											2023-04-17 07:40:41 +00:00
 								        super().__init__(model, lora_rank, lora_train_bias)
 								```
 								### Reward model
 								```
 								from ..base import RewardModel
 								from transformers.models.coati import CoatiModel
 								class CoatiRM(RewardModel):
 								    def __init__(self,
 								                 pretrained: Optional[str] = None,
 								                 checkpoint: bool = False,
 								                 lora_rank: int = 0,
 								                 lora_train_bias: str = 'none') -> None:
 								        if pretrained is not None:
 								            model = CoatiModel.from_pretrained(pretrained)
 								        else:
-												[chat] polish code note typo (#3612)


											
										
										
											2023-04-20 09:22:15 +00:00
+								            model = build_model() # load your own model if it is not support in transformers
-												[coati] add costom model suppor tguide (#3579)


											
										
										
											2023-04-17 07:40:41 +00:00
 								        value_head = nn.Linear(model.config.n_embd, 1)
 								        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
 								        super().__init__(model, value_head, lora_rank, lora_train_bias)
 								```
 								### Critic model
 								```
 								from ..base import Critic
 								from transformers.models.coati import CoatiModel
 								class CoatiCritic(Critic):
 								    def __init__(self,
 								                 pretrained: Optional[str] = None,
 								                 checkpoint: bool = False,
 								                 lora_rank: int = 0,
 								                 lora_train_bias: str = 'none') -> None:
 								        if pretrained is not None:
 								            model = CoatiModel.from_pretrained(pretrained)
 								        else:
-												[chat] polish code note typo (#3612)


											
										
										
											2023-04-20 09:22:15 +00:00
+								            model = build_model() # load your own model if it is not support in transformers
-												[coati] add costom model suppor tguide (#3579)


											
										
										
											2023-04-17 07:40:41 +00:00
 								        value_head = nn.Linear(model.config.n_embd, 1)
 								        value_head.weight.data.normal_(mean=0.0, std=1 / (model.config.n_embd + 1))
 								        super().__init__(model, value_head, lora_rank, lora_train_bias)
 								```