ColossalAI/applications/ChatGPT/examples
BlueRum 7548ca5a54
[chatgpt]Reward Model Training Process update (#3133)
* add normalize function to value_head in bloom rm

* add normalization to value_function in gpt_rm

* add normalization to value_head of opt_rm

* add Anthropic/hh-rlhf dataset

* Update __init__.py

* Add LogExpLoss in RM training

* Update __init__.py

* update rm trainer to use acc as target

* update example/train_rm

* Update train_rm.sh

* code style

* Update README.md

* Update README.md

* add rm test to ci

* fix tokenier

* fix typo

* change batchsize to avoid oom in ci

* Update test_ci.sh
2023-03-20 09:59:06 +08:00
..
README.md [chatgpt]Reward Model Training Process update (#3133) 2023-03-20 09:59:06 +08:00
inference.py change nn to models (#3032) 2023-03-07 16:34:22 +08:00
requirements.txt [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
test_ci.sh [chatgpt]Reward Model Training Process update (#3133) 2023-03-20 09:59:06 +08:00
train_dummy.py [chatgpt]update ci (#3087) 2023-03-14 11:01:17 +08:00
train_dummy.sh [chatgpt] support colossalai strategy to train rm (#2742) 2023-02-16 11:24:07 +08:00
train_prompts.py [chatgpt] fix ppo training hanging problem with gemini (#3162) 2023-03-17 15:41:47 +08:00
train_prompts.sh [chatgpt] support colossalai strategy to train rm (#2742) 2023-02-16 11:24:07 +08:00
train_reward_model.py [chatgpt]Reward Model Training Process update (#3133) 2023-03-20 09:59:06 +08:00
train_rm.sh [chatgpt]Reward Model Training Process update (#3133) 2023-03-20 09:59:06 +08:00

README.md

Examples

Install requirements

pip install -r requirements.txt

Train the reward model (Stage 2)

Use these code to train your reward model.

# Take naive reward model training with opt-350m as example
python train_reward_model.py --pretrain "facebook/opt-350m" --model 'opt' --strategy naive
# use colossalai_zero2
torchrun --standalone --nproc_per_node=2 train_reward_model.py --pretrain "facebook/opt-350m" --model 'opt' --strategy colossalai_zero2

Features and tricks in RM training

  • We support Anthropic/hh-rlhfandrm-static datasets.
  • We support 2 kinds of loss_function named 'log_sig'(used by OpenAI) and 'log_exp'(used by Anthropic).
  • We change the loss to valid_acc and pair_dist to monitor progress during training.
  • We add special token to the end of the sequence to get better result.
  • We use cosine-reducing lr-scheduler for RM training.
  • We set value_head as 1 liner layer and initialize the weight of value_head using N(01/(d_model + 1)) distribution.
  • We train a Bloom-560m reward model for 1 epoch and find the test acc of the model achieve the performance mentions in Anthropics paper.

Experiment result

Model performance in Anthropics paper:

image
Our training & test result of bloom-560m for 1 epoch:
image

Train with dummy prompt data (Stage 3)

This script supports 4 kinds of strategies:

  • naive
  • ddp
  • colossalai_zero2
  • colossalai_gemini

It uses random generated prompt data.

Naive strategy only support single GPU training:

python train_dummy.py --strategy naive
# display cli help
python train_dummy.py -h

DDP strategy and ColossalAI strategy support multi GPUs training:

# run DDP on 2 GPUs
torchrun --standalone --nproc_per_node=2 train_dummy.py --strategy ddp
# run ColossalAI on 2 GPUs
torchrun --standalone --nproc_per_node=2 train_dummy.py --strategy colossalai_zero2

Train with real prompt data (Stage 3)

We use awesome-chatgpt-prompts as example dataset. It is a small dataset with hundreds of prompts.

You should download prompts.csv first.

This script also supports 4 strategies.

# display cli help
python train_dummy.py -h
# run naive on 1 GPU
python train_prompts.py prompts.csv --strategy naive
# run DDP on 2 GPUs
torchrun --standalone --nproc_per_node=2 train_prompts.py prompts.csv --strategy ddp
# run ColossalAI on 2 GPUs
torchrun --standalone --nproc_per_node=2 train_prompts.py prompts.csv --strategy colossalai_zero2

Inference example(After Stage3)

We support naive inference demo after training.

# inference, using pretrain path to configure model
python inference.py --model_path <your actor model path> --model <your model type> --pretrain <your pretrain model name/path>
# example
python inference.py --model_path ./actor_checkpoint_prompts.pt --pretrain bigscience/bloom-560m --model bloom

Attention

The examples is just a demo for testing our progress of RM and PPO training.

data

Support Model

GPT

  • GPT2-S (s)
  • GPT2-M (m)
  • GPT2-L (l)
  • GPT2-XL (xl)
  • GPT2-4B (4b)
  • GPT2-6B (6b)
  • GPT2-8B (8b)
  • GPT2-10B (10b)
  • GPT2-12B (12b)
  • GPT2-15B (15b)
  • GPT2-18B (18b)
  • GPT2-20B (20b)
  • GPT2-24B (24b)
  • GPT2-28B (28b)
  • GPT2-32B (32b)
  • GPT2-36B (36b)
  • GPT2-40B (40b)
  • GPT3 (175b)

BLOOM

OPT