docs: internlm2-reward readme (#767)

2024-07-19 17:53:02 +08:00 · 2024-07-19 17:53:02 +08:00 · feef5023b3
parent 7af5da56b9
commit feef5023b3
3 changed files with 183 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -46,6 +46,8 @@ InternLM2.5 series are released with the following features:
 ## News
 \[2024.07.19\] We release the InternLM2-Reward series of reward models in 1.8B, 7B and 20B sizes. See [model zoo below](#model-zoo) for download or [model cards](./model_cards/internlm2_reward.md) for more details.
 \[2024.07.03\] We release InternLM2.5-7B, InternLM2.5-7B-Chat and InternLM2.5-7B-Chat-1M. See [model zoo below](#model-zoo) for download or [model cards](./model_cards/) for more details.
 \[2024.03.26\] We release InternLM2 technical report. See [arXiv](https://arxiv.org/abs/2403.17297) for details.
@ -82,6 +84,16 @@ The release of InternLM2.5 series contains 7B model size for now and we are goin
 **Supplements:** `HF` refers to the format used by HuggingFace in [transformers](https://github.com/huggingface/transformers), whereas `Origin` denotes the format adopted by the InternLM team in [InternEvo](https://github.com/InternLM/InternEvo).
 ### InternLM2-Reward
 InternLM2-Reward is a series of reward models, trained on 2.4 million preference samples, available in 1.8B, 7B, and 20B sizes. These model were applied to the PPO training process of our chat models. See [model cards](./model_cards/internlm2_reward.md) for more details.
 | Model                     | RewardBench Score | Transformers(HF)                                   | ModelScope(HF)                                    | OpenXLab(HF)                                    | Release Date |
 | ------------------------- | ----------------- | -------------------------------------------------- | ------------------------------------------------- | ----------------------------------------------- | ------------ |
 | **InternLM2-1.8B-Reward** | 80.6              | [🤗internlm2-1_8b-reward](https://huggingface.co/internlm/internlm2-1_8b-reward) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-1_8b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-1_8b-reward) | 2024-07-19   |
 | **InternLM2-7B-Reward**   | 86.6              | [🤗internlm2-7b-reward](https://huggingface.co/internlm/internlm2-7b-reward) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-7b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-7b-reward) | 2024-07-19   |
 | **InternLM2-20B-Reward**  | 89.5              | [🤗internlm2-20b-reward](https://huggingface.co/internlm/internlm2-20b-reward) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-20b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-20b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-20b-reward) | 2024-07-19   |
 ### InternLM2
 <details>
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -44,6 +44,8 @@ InternLM2.5 系列模型在本仓库正式发布，具有如下特性：
 ## 更新
 \[2024.07.19\] 我们发布了 1.8B、7B 和 20B 大小的 InternLM2-Reward 系列奖励模型。可以在下方的 [模型库](#model-zoo) 进行下载，或者在 [model cards](./model_cards/internlm2_reward.md) 中了解更多细节。
 \[2024.06.30\] 我们发布了 InternLM2.5-7B、InternLM2.5-7B-Chat 和 InternLM2.5-7B-Chat-1M。可以在下方的 [模型库](#model-zoo) 进行下载，或者在 [model cards](./model_cards/) 中了解更多细节。
 \[2024.03.26\] 我们发布了 InternLM2 的技术报告。 可以点击 [arXiv链接](https://arxiv.org/abs/2403.17297) 来了解更多细节。
@ -80,6 +82,16 @@ InternLM2.5 系列模型在本仓库正式发布，具有如下特性：
 **补充说明：** 上表中的 `HF` 表示对应模型为 HuggingFace 平台提供的 [transformers](https://github.com/huggingface/transformers) 框架格式；`Origin` 则表示对应模型为我们 InternLM 团队的 [InternEvo](https://github.com/InternLM/InternEvo) 框架格式。
 ### InternLM2-Reward
 InternLM2-Reward 是基于 240 万个偏好样本进行训练的奖励模型，有 1.8B、7B 和 20B 大小可供选择。这些模型被用于 InternLM 对话模型的 PPO 训练过程。请参考 [model cards](./model_cards/internlm2_reward.md) 了解更多细节。
 | Model                     | RewardBench Score | Transformers(HF)                                   | ModelScope(HF)                                    | OpenXLab(HF)                                    | Release Date |
 | ------------------------- | ----------------- | -------------------------------------------------- | ------------------------------------------------- | ----------------------------------------------- | ------------ |
 | **InternLM2-1.8B-Reward** | 80.6              | [🤗internlm2-1_8b-reward](https://huggingface.co/internlm/internlm2-1_8b-reward) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-1_8b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-1_8b-reward) | 2024-07-19   |
 | **InternLM2-7B-Reward**   | 86.6              | [🤗internlm2-7b-reward](https://huggingface.co/internlm/internlm2-7b-reward) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-7b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-7b-reward) | 2024-07-19   |
 | **InternLM2-20B-Reward**  | 89.5              | [🤗internlm2-20b-reward](https://huggingface.co/internlm/internlm2-20b-reward) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-20b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-20b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-20b-reward) | 2024-07-19   |
 ### InternLM2
 <details>
--- a/model_cards/internlm2_reward.md
+++ b/model_cards/internlm2_reward.md
@ -0,0 +1,159 @@
 # InternLM2-Reward Model Card
 ## Introduction
 **InternLM2-Reward** is a series of reward models trained on the foundation of InternLM2-Chat-SFT. These model have been trained using over 2.4 million preference samples, both human-annotated and AI-synthesized, achieving outstanding performance while ensuring a balance between helpfulness and harmlessness.
 ### Key Features:
 - **Variety of Sizes Available**: Our open-sourced reward models are available in sizes of **1.8B, 7B, and 20B**, each demonstrating exceptional performance across various metrics. We aim for these different-sized models to facilitate research on the scaling laws of reward models, providing valuable insights to the community.
 - **Comprehensive Coverage of Preference**: Trained with **2.4 million** preference pairs derived from both human annotations and AI synthesis, covering diverse areas such as dialogue, writing, poetry, summarization, coding, mathematics, etc. It also maintains a balance between helpful and harmless.
 - **Multilingual Support**: InternLM2-Reward was trained on high-quality **English and Chinese** preference data, delivering robust performance in both languages.
 This model was applied to the RLHF training process of InternLM2-Chat. The reward model training techniques from the [InternLM2 Technical Report](https://arxiv.org/abs/2403.17297) have been open-sourced in XTuner, try it out [here](https://github.com/InternLM/xtuner)!
 ## Download links
 | Model                     | RewardBench Score | Transformers(HF)                                   | ModelScope(HF)                                    | OpenXLab(HF)                                    | Release Date |
 | ------------------------- | ----------------- | -------------------------------------------------- | ------------------------------------------------- | ----------------------------------------------- | ------------ |
 | **InternLM2-1.8B-Reward** | 80.6              | [🤗internlm2-1_8b-reward](https://huggingface.co/internlm/internlm2-1_8b-reward) | [<img src="../assets/modelscope_logo.png" width="20px" /> internlm2-1_8b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-1_8b-reward) | 2024-07-19   |
 | **InternLM2-7B-Reward**   | 86.6              | [🤗internlm2-7b-reward](https://huggingface.co/internlm/internlm2-7b-reward) | [<img src="../assets/modelscope_logo.png" width="20px" /> internlm2-7b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-7b-reward) | 2024-07-19   |
 | **InternLM2-20B-Reward**  | 89.5              | [🤗internlm2-20b-reward](https://huggingface.co/internlm/internlm2-20b-reward) | [<img src="../assets/modelscope_logo.png" width="20px" /> internlm2-20b-reward](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-20b-reward/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-20b-reward) | 2024-07-19   |
 ## Performance Evaluation on RewardBench
 | Models                | Score | Chat | Chat Hard | Safety | Reasoning |
 | --------------------- | ----- | ---- | --------- | ------ | --------- |
 | InternLM2-20B-Reward  | 89.5  | 98.6 | 74.1      | 89.4   | 95.7      |
 | InternLM2-7B-Reward   | 86.6  | 98.6 | 66.7      | 88.3   | 92.8      |
 | InternLM2-1.8B-Reward | 80.6  | 95.0 | 58.1      | 81.8   | 87.4      |
 - The evaluation is conducted on the [RewardBench](https://github.com/allenai/reward-bench) dataset.
 - For a fair comparison, conditional system prompts proposed in our technical report were not included during testing.
 ## Demo Code
 ### Basic Usage
 We provide some user-friendly APIs for you to use the model. Here is an example of how to use the model to get the reward score of a chat, compare two chats, or rank multiple chats.
 ```python
 import torch
 from transformers import AutoModel, AutoTokenizer
 model = AutoModel.from_pretrained(
    "internlm/internlm2-20b-reward",
    device_map="cuda",
    torch_dtype=torch.float16,
    trust_remote_code=True,
 )
 tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-20b-reward", trust_remote_code=True)
 chat_1 = [
    {"role": "user", "content": "Hello! What's your name?"},
    {"role": "assistant", "content": "My name is InternLM2! A helpful AI assistant. What can I do for you?"}
 ]
 chat_2 = [
    {"role": "user", "content": "Hello! What's your name?"},
    {"role": "assistant", "content": "I have no idea."}
 ]
 # get reward score for a single chat
 score1 = model.get_score(tokenizer, chat_1)
 score2 = model.get_score(tokenizer, chat_2)
 print("score1: ", score1)
 print("score2: ", score2)
 # >>> score1:  0.767578125
 # >>> score2:  -2.22265625
 # batch inference, get multiple scores at once
 scores = model.get_scores(tokenizer, [chat_1, chat_2])
 print("scores: ", scores)
 # >>> scores:  [0.767578125, -2.22265625]
 # compare whether chat_1 is better than chat_2
 compare_res = model.compare(tokenizer, chat_1, chat_2)
 print("compare_res: ", compare_res)
 # >>> compare_res:  True
 # rank multiple chats, it will return the ranking index of each chat
 # the chat with the highest score will have ranking index as 0
 rank_res = model.rank(tokenizer, [chat_1, chat_2])
 print("rank_res: ", rank_res)  # lower index means higher score
 # >>> rank_res:  [0, 1]
 ```
 ### Best of N Sampling
 Here is an example of how to use the reward model to perform best of N sampling.
 The code below demonstrates how to select the best response from the candidates generated by the language model.
 ```python
 import torch
 from transformers import AutoModel, AutoTokenizer
 # prepare the llm model and tokenizer
 llm = AutoModel.from_pretrained(
    "internlm/internlm2-chat-7b",
    device_map="cuda",
    torch_dtype=torch.float16,
    trust_remote_code=True,
 )
 llm_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
 # prepare the reward model and tokenizer
 reward = AutoModel.from_pretrained(
    "internlm/internlm2-20b-reward",
    device_map="cuda",
    torch_dtype=torch.float16,
    trust_remote_code=True,
 )
 reward_tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-20b-reward", trust_remote_code=True)
 # prepare the chat prompt
 prompt = "Write an article about the artificial intelligence revolution."
 messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
 ]
 text = llm_tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
 )
 model_inputs = llm_tokenizer([text], return_tensors="pt").to("cuda")
 # generate best of N candidates
 num_candidates = 10  # N=10
 candidates = []
 outputs = llm.generate(
    **model_inputs,
    max_new_tokens=512,
    num_return_sequences=num_candidates,
    pad_token_id=llm_tokenizer.eos_token_id,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.8,
 )
 outputs = outputs[:, model_inputs["input_ids"].shape[1]:]
 for i in range(num_candidates):
    candidate = llm_tokenizer.decode(outputs[i], skip_special_tokens=True)
    candidates.append(messages + [{"role": "assistant", "content": candidate}])
 rank_indices = reward.rank(reward_tokenizer, candidates)
 sorted_candidates = sorted(zip(rank_indices, candidates), key=lambda x: x[0])
 ## print the ranked candidates
 # for i, (rank_index, candidate) in enumerate(sorted_candidates):
 #     print(f"------------Rank {i}------------: \n{candidate[-1]['content']}")
 # print the best response
 best_response = sorted_candidates[0][1][-1]['content']
 print(best_response)
 ```