History

BlueRum 7548ca5a54 [chatgpt]Reward Model Training Process update (#3133 ) * add normalize function to value_head in bloom rm * add normalization to value_function in gpt_rm * add normalization to value_head of opt_rm * add Anthropic/hh-rlhf dataset * Update __init__.py * Add LogExpLoss in RM training * Update __init__.py * update rm trainer to use acc as target * update example/train_rm * Update train_rm.sh * code style * Update README.md * Update README.md * add rm test to ci * fix tokenier * fix typo * change batchsize to avoid oom in ci * Update test_ci.sh		2023-03-20 09:59:06 +08:00
..
README.md	[chatgpt]Reward Model Training Process update (#3133 )	2023-03-20 09:59:06 +08:00
inference.py	change nn to models (#3032 )	2023-03-07 16:34:22 +08:00
requirements.txt	[app] add chatgpt application (#2698 )	2023-02-14 22:17:25 +08:00
test_ci.sh	[chatgpt]Reward Model Training Process update (#3133 )	2023-03-20 09:59:06 +08:00
train_dummy.py	[chatgpt]update ci (#3087 )	2023-03-14 11:01:17 +08:00
train_dummy.sh	[chatgpt] support colossalai strategy to train rm (#2742 )	2023-02-16 11:24:07 +08:00
train_prompts.py	[chatgpt] fix ppo training hanging problem with gemini (#3162 )	2023-03-17 15:41:47 +08:00
train_prompts.sh	[chatgpt] support colossalai strategy to train rm (#2742 )	2023-02-16 11:24:07 +08:00
train_reward_model.py	[chatgpt]Reward Model Training Process update (#3133 )	2023-03-20 09:59:06 +08:00
train_rm.sh	[chatgpt]Reward Model Training Process update (#3133 )	2023-03-20 09:59:06 +08:00

README.md

Examples

Install requirements

pip install -r requirements.txt

Train the reward model (Stage 2)

Use these code to train your reward model.

# Take naive reward model training with opt-350m as example
python train_reward_model.py --pretrain "facebook/opt-350m" --model 'opt' --strategy naive
# use colossalai_zero2
torchrun --standalone --nproc_per_node=2 train_reward_model.py --pretrain "facebook/opt-350m" --model 'opt' --strategy colossalai_zero2

Features and tricks in RM training

We support Anthropic/hh-rlhfandrm-static datasets.
We support 2 kinds of loss_function named 'log_sig'(used by OpenAI) and 'log_exp'(used by Anthropic).
We change the loss to valid_acc and pair_dist to monitor progress during training.
We add special token to the end of the sequence to get better result.
We use cosine-reducing lr-scheduler for RM training.
We set value_head as 1 liner layer and initialize the weight of value_head using N(0，1/(d_model + 1)) distribution.
We train a Bloom-560m reward model for 1 epoch and find the test acc of the model achieve the performance mentions in Anthropics paper.

Experiment result

Model performance in Anthropics paper:

Our training & test result of bloom-560m for 1 epoch:

Train with dummy prompt data (Stage 3)

This script supports 4 kinds of strategies:

naive
ddp
colossalai_zero2
colossalai_gemini

It uses random generated prompt data.

Naive strategy only support single GPU training:

python train_dummy.py --strategy naive
# display cli help
python train_dummy.py -h

DDP strategy and ColossalAI strategy support multi GPUs training:

# run DDP on 2 GPUs
torchrun --standalone --nproc_per_node=2 train_dummy.py --strategy ddp
# run ColossalAI on 2 GPUs
torchrun --standalone --nproc_per_node=2 train_dummy.py --strategy colossalai_zero2

Train with real prompt data (Stage 3)

We use awesome-chatgpt-prompts as example dataset. It is a small dataset with hundreds of prompts.

You should download prompts.csv first.

This script also supports 4 strategies.

# display cli help
python train_dummy.py -h
# run naive on 1 GPU
python train_prompts.py prompts.csv --strategy naive
# run DDP on 2 GPUs
torchrun --standalone --nproc_per_node=2 train_prompts.py prompts.csv --strategy ddp
# run ColossalAI on 2 GPUs
torchrun --standalone --nproc_per_node=2 train_prompts.py prompts.csv --strategy colossalai_zero2

Inference example(After Stage3)

We support naive inference demo after training.

# inference, using pretrain path to configure model
python inference.py --model_path <your actor model path> --model <your model type> --pretrain <your pretrain model name/path>
# example
python inference.py --model_path ./actor_checkpoint_prompts.pt --pretrain bigscience/bloom-560m --model bloom

Attention

The examples is just a demo for testing our progress of RM and PPO training.

data

Support Model

GPT

GPT2-S (s)
GPT2-M (m)
GPT2-L (l)
GPT2-XL (xl)
GPT2-4B (4b)
GPT2-6B (6b)
GPT2-8B (8b)
GPT2-10B (10b)
GPT2-12B (12b)
GPT2-15B (15b)
GPT2-18B (18b)
GPT2-20B (20b)
GPT2-24B (24b)
GPT2-28B (28b)
GPT2-32B (32b)
GPT2-36B (36b)
GPT2-40B (40b)
GPT3 (175b)

README.md

Examples

Install requirements

Train the reward model (Stage 2)

Features and tricks in RM training

Experiment result

Train with dummy prompt data (Stage 3)

Train with real prompt data (Stage 3)

Inference example(After Stage3)

Attention

data

Support Model

GPT

BLOOM

OPT

README.md Unescape Escape

Examples

Install requirements

Train the reward model (Stage 2)

Features and tricks in RM training

Experiment result

Train with dummy prompt data (Stage 3)

Train with real prompt data (Stage 3)

Inference example(After Stage3)

Attention

data

Support Model

GPT

BLOOM

OPT

README.md