[evaluation] add automatic evaluation pipeline (#3821)

* add functions for gpt evaluation

* add automatic eval

Update eval.py

* using jload and modify the type of answers1 and answers2

* Update eval.py

Update eval.py

* Update evaluator.py

* support gpt evaluation

* update readme.md

update README.md

update READNE.md

modify readme.md

* add Chinese example for config, battle prompt and evaluation prompt file

* remove GPT-4 config

* remove sample folder

---------

Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>
pull/3916/head
Yuanchen 2 years ago committed by GitHub
parent 05b8a8de58
commit 34966378e8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,173 +1,225 @@
# Evaluation # Evaluation
In this directory, we introduce how you can evaluate your model with GPT-4. In this directory, we introduce how you can evaluate your model with our pipeline. This pipeline is available for model
evaluation of Chinese capability and the one for English capability is under preparation.
## Evaluation Pipeline
The whole evaluation process undergoes the following three steps:
1. Prepare the questions following the internal data structure in the data format section (described below).
2. Generate answers from different models:
* Generate answers using GPT-3.5: [`generate_gpt35_answers.py`](generate_gpt35_answers.py).
* Generate answers using your own models: [`generate_answers.py`](generate_answers.py).
3. Evaluate models using GPT-4: [`evaluate.py`](evaluate.py).
### Generate Answers
#### Generate Answers Using GPT-3.5
You can provide your own OpenAI key to generate answers from GPT-3.5 using [`generate_gpt35_answers.py`](./generate_gpt35_answers.py).
An example script is provided as follows:
```shell
python generate_gpt35_answers.py \
--dataset "path to the question dataset" \
--answer_path "path to answer folder" \
--num_workers 4 \
--openai_key "your openai key" \
--max_tokens 512 \
```
#### Generate Answers Using our Own Model
You can also generate answers using your own models. The generation process is divided into two stages:
1. Generate answers using multiple GPUs (optional) with batch processing: [`generate_answers.py`](./generate_answers.py).
2. Merge multiple shards and output a single file: [`merge.py`](./merge.py).
An example script is given as follows:
```shell
device_number=number of your devices
model_name="name of your model"
model_path="path to your model"
dataset="path to the question dataset"
answer_path="path to save the model answers"
torchrun --standalone --nproc_per_node=$device_number generate_answers.py \
--model 'llama' \
--strategy ddp \
--model_path $model_path \
--model_name $model_name \
--dataset $dataset \
--batch_size 8 \
--max_datasets_size 80 \
--answer_path $answer_path \
--max_length 512
python merge.py \
--model_name $model_name \
--shards $device_number \
--answer_path $answer_path \
for (( i=0; i<device_number; i++ )) do
rm -rf "${answer_path}/${model_name}_answers_rank${i}.json"
done
```
### Evaluate Answers
In [`evaluate.py`](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files.
The metrics include:
- `Invalid Count`: The number of reviews where the program fail to parse the score pair.
- `Better Count`: The number of reviews where Model 2 receives a higher score.
- `Worse Count`: The number of reviews where Model 2 receives a lower score.
- `Tie Count`: The number of reviews where two models play to a tie.
- `Win Rate of Model 2`: Win rate of Model 2.
- `Model 1 Average Score`: Average score of Model 1.
- `Model 2 Average Score`: Average score of Model 2.
Other than the `review` and `result` file which include all reviews, the output files also include `invalid`, `better`, `worse` and `tie` JSON file which only include the corresponding reviews.
## Installation
To start model evaluation, you need to install required packages which listed in `requirements.txt` under `evaluate` folder.
```shell ```shell
python evaluate.py \ pip install -r requirements.txt
--answer_file_list "path to answers of model 1" "path to answers of model 2" \
--prompt_file "path to prompt file" \
--reviewer_file "path to reviewer file" \
--output_folder "path to output folder" \
--openai_key "your openai key" \
--model "the gpt model" \
--num_workers 8 \
--max_tokens 512 \
``` ```
## Results ## Evaluation Pipeline
We compare our model with alpaca and vicuna. The results is shown below. Please note that the better cases don't add to 80 because there are reviews the program can't successfully parse to get the score pair. Our Coati-7B model performs better than Alpaca-7B. The Coati-7B model we evaluate is an old version we trained a few weeks ago and the new version is around the corner.
| Model Pair | Alpaca-7B ⚔ Coati-7B | Coati-7B ⚔ Alpaca-7B |
| :-----------: | :------------------: | :------------------: |
| Better Cases | 38 ⚔ **41** | **45** ⚔ 33 |
| Win Rate | 48% ⚔ **52%** | **58%** ⚔ 42% |
| Average Score | 7.06 ⚔ **7.13** | **7.31** ⚔ 6.82 |
We would like to mention that the evaluation of model answers using the GPT-3.5 model is not reliable. GPT-3.5 tends to give a higher score to the second answer (`{answer2}` in the prompt). In our evaluation which uses GPT-4, we still swap the two model answers. As can be seen from the table, GPT-4 can generate consistent results and it is more unbiased than GPT-3.5.
## Data Format The whole evaluation pipeline consists of two methods:
1. `GPT Evaluation`: evaluates model predictions using the GPT-3.5.
* Compare the performance of two different models (battle).
* Rate model according to pre-defined metrics using prompting design.
2. `Automatic Evaluation`: evaluates model predictions using automatic metrics.
### Evaluation Category
The model capability is seperated into 10 evaluation categories, which refers to the user case mentioned in InstructGPT.
Following table introduces each category:
| Evaluation Category | Description |
|:-------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------|
| Roleplay | Given certain characteristic, the capability of chatting as the character |
| Chat | Conduct multiple rounds of dialogue, the capability of understanding and memorization of previous rounds of dialogue |
| Open QA | Given an open question, the capability of answering questions in opened-ended way |
| Closed QA | Given a closed question, the capability of answering questions with limited scope (such as single/multiple choice question) |
| Brainstorming | Given a question requiring divergent answers, the capability of divergent answering and listing in points |
| Generation | Given generation task, the capability of generating in high quality and human-written way (such as writing an email) |
| Rewriting | Given rewriting task, the capability of rewriting sentences to meet task requirements (such as active and passive switches, translation) |
| Classification | Given classification task, the capability of accurate classification |
| Extraction | Given extraction task, the capability of extracting required information |
| Summarization | Given a paragraph or passage, the capability of summarization |
To better understand each evaluation category, here are some prompt examples provided.
| Evaluation Category | Chinese Example | English Example |
|:-------------------:|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Roleplay | **Example 1**<br/>我想让你担任Android开发工程师面试官。我将成为候选人您将向我询问Android开发工程师职位的面试问题。我希望你只作为面试官回答。不要一次写出所有的问题。我希望你只对我进行采访。问我问题等待我的回答。不要写解释。像面试官一样一个一个问我等我回答。我的第一句话是“面试官你好”。 <br/><br/>**Example 2**<br/>我想让你扮演讲故事的角色。你会想出引人入胜、富有想象力和吸引观众的有趣故事。它可以是童话故事、教育故事或任何其他类型的有潜力的故事以吸引人们的注意力和想象力。根据目标受众,您可以为您的讲故事环节选择特定的主题或主题,例如,如果是儿童,那么您可以谈论动物;如果是成人,那么基于历史的故事可能会更好地吸引他们等。我的第一个请求是我需要一个关于毅力的有趣故事。 | **Example 1**<br/>I want you to act as a journalist. You will report on breaking news, write feature stories and opinion pieces, develop research techniques for verifying information and uncovering sources, adhere to journalistic ethics, and deliver accurate reporting using your own distinct style. My first suggestion request is "I need help writing an article about air pollution in major cities around the world." <br/><br/>**Example 2:**<br/> I want you to act as a debater. I will provide you with some topics related to current events and your task is to research both sides of the debates, present valid arguments for each side, refute opposing points of view, and draw persuasive conclusions based on evidence. Your goal is to help people come away from discussion with increased knowledge and insight into the topic at hand. My first request is "I want an opinion piece about Deno." |
| Chat | **Example 1**<br/>基于以下角色信息完成一段对话。小王20岁是一名大学生小张22岁是一名大学生。<br/> 小王:小张,我们一起去看电影吧? 小张:好啊,你想看什么电影呢? 小王:我听说最近有一部很好看的科幻电影。你觉得怎么样?<br/>小张:<br/><br/>**Example 2:**<br/>基于以下角色信息完成一段对话。小明是一名医生,一位老年病患者想要停药,但他对病情有所忽视并有担忧;王叔叔是老年病患者的儿子,希望能够听取医生的建议。<br/>小明:你好,王叔叔,我了解你想要让你父亲停药。<br/>王叔叔:是的,我父亲已经吃了那么久的药,我担心药物对他的身体会有副作用。<br/>小明: | **Example 1**<br/>Human: Who is the highest try scorer in the current Sydney Roosters team?<br/>AI: As an AI language model, I don't have access to current statistics. However, you can check the current Sydney Roosters team's official website or relevant sports news websites to find the answer.<br/>Human: Thanks for the suggestion, but can you recommend a reliable sports news website that I can check?<br/>AI:<br/><br/>**Example 2:**<br/>Complete a dialogue based on the following role information. <br/>A: Elementary student B: Teacher<br/>B: Good morning, Student A. Today we're going to learn about addition and subtraction.<br/>A: Teacher, I already know this very well. Why do I need to learn it again?<br/>B: |
| Open QA | **Example 1**<br/>请问万有引力定律由谁提出的?<br/><br/>**Example 2**<br/>哪些国家参与了第一次世界大战? | **Example 1**<br/>Who are the indigenous people of New Zealand?<br/><br/>**Example 2**<br/>How do you take the derivative of the sin function? |
| Closed QA | **Example 1**<br/>请从以下选项中选择正确答案。以下哪个是世界上最高山峰? <br/>A. 长城 <br/>B. 泰山 <br/>C. 珠穆朗玛峰 <br/>D. 黄山<br/><br/>**Example 2**<br/>请从以下选项中选择一个最佳答案回答下面的问题。问题:非洲最高的山是哪座山?<br/> 选项: <br/>A. 麦金利山 <br/>B. 喜马拉雅山 <br/>C. 乞力马扎罗山 | **Example 1**<br/>Answer the following question:<br/>What shape is the Earth?<br/>A) A circle<br/>B) A sphere<br/>C) An ellipse<br/>D) A plane<br/><br/>**Example 2**<br/>Choose the correct classification for the following question:<br/>"What type of restaurant is 'Burger King'"?<br/>fast food<br/>family style<br/>formal dining<br/>buffet<br/> |
| Brainstorming | **Example 1**<br/>请介绍一下人工智能的多个领域。<br/><br/>**Example 2**<br/>请给出管理家庭财务的3个小技巧。<br/> | **Example 1**<br/>What are 10 science fiction books I should read next?<br/><br/>**Example 2**<br/>List five ideas for how to regain enthusiasm for my career. |
| Generation | **Example 1**<br/>请撰写一篇文章,介绍如何通过改善生活习惯来预防疾病和延长寿命。<br/><br/>**Example 2**<br/>请根据以下情节撰写一篇短篇小说:一名年轻人被困在一个荒岛上,他必须想办法生存下去直到被救援。但他很快发现自己并不孤单。 | **Example 1**<br/>Can you help me write a formal email to a potential business partner proposing a joint venture?<br/><br/>**Example 2**<br/>Please use the appropriate format to write a formal letter of recommendation for a student applying to a prestigious computer science graduate program at a university. |
| Rewriting | **Example 1**<br/>将以下句子改为被动语态:<br/>"他们正在洗车"<br/><br/>**Example 2**<br/>将以下文本翻译成英语:<br/>“这个周末我要去海边玩” | **Example 1**<br/>Translate the following text into English: <br/>"我最喜欢的季节是春天,因为我可以看到美丽的花朵。"<br/><br/>**Example 2**<br/>Please correct the following sentences and give them the correct sentence.<br/>"Their going to the party there." |
| Classification | **Example 1**<br/>新闻标题:今日立夏,有一上联,立夏万物并秀,下联怎么对?<br/>请根据以上新闻标题判断新闻所属的分类,你需要从文化,娱乐,体育,财经,房产,教育,科技,旅游,游戏,军事这十类中选择一个答案。<br/><br/> **Example 2**<br/>新闻标题:赵丽颖很久没有登上微博热搜了,但你们别急,她只是在憋大招而已。<br/>请根据新闻标题判断新闻所属的分类,你需要从文化,娱乐,体育,财经,房产,教育,科技,旅游,游戏,军事这十类中选择一个答案。 | **Example 1**<br/>Classify the given email as spam or non-spam.<br/>"Hello, this is an email reminding you to pay your property fees"<br/><br/>**Example 2**<br/>Classify the following text as news, ads or forum posts<br/>"The latest iPhone 13 is now available, shop now!" |
| Extraction | **Example 1**<br/>根据以下新闻文本提取新闻报道时间例如回答时按照格式“新闻报道时间2007年8月10日”<br/>新闻文本如下2007-4-7中新网4月7日电据中国消防在线消息4月4日晚上7时30分左右湖南长潭高速公路上发生一起6车连环相撞失火事故。长株潭三地消防部门共出动消防车21台警力100余人。经过消防官兵近2个小时奋力扑救大火被成功扑灭。据初步调查有1人在此次事故中死亡。<br/><br/>**Example 2**<br/>根据以下新闻文本提取新闻报道时间例如回答时按照格式“新闻报道时间2007年8月10日”<br/>新闻文本如下2014年1月15日据外媒《俄罗斯报》报道称位于北半球的澳大利亚现在正处于炎热的夏季而近日也到了高温酷暑的时候当地时间1月14日晚澳大利亚南部一夜间发生至少250起火灾。受炎热天气及雷雨天气影响澳大利亚南部一夜间发生至少250起火灾灾情多集中在维多利亚州。火灾发生后救援人员立即展开救灾行动。目前大部分起火点火势已被控制。 | **Example 1**<br/>Extract all phenotypes of the following text:<br/>"The 55-year-old patient has fever and hypertension."<br/><br/>**Example 2**<br/>Extract the location mentioned in the following text:<br/>"The student graduated from Harvard university, which is located in Boston" |
| Summarization | **Example 1**<br/>请简要总结概括以下段落材料。<br/>新华社快讯斯里兰卡政府部门21日说首都科伦坡包括教堂、酒店等多个地点当天发生的爆炸已导致至少70人死亡另有260多人受伤。<br/><br/> **Example 2**<br/>请简要总结概括以下段落材料。<br/>近期参与京雄高铁站站房建设的中铁十二局因在施工过程中存在环境违法行为被雄安新区公开通报。通报发出后引起社会广泛关注。近日人民网记者从雄安新区相关部门及中铁十二局获悉新区有关部门已经集中约谈了中铁十二局等24个参与雄安建设的项目单位。对于约谈内容和结果中铁十二局有关宣传负责人回应“具体内容不清楚最好找雄安新区相关部门了解情况。”新区有关部门负责人表示此前涉及的环境违法行为中铁十二局已基本整改到位但约谈内容和结果暂不公开接下来将按部就班推进环境治理工作。原题为《雄安新区中铁十二局涉环境违法已基本整改到位》 | **Example 1**<br/>Please provide a summary based on the following news<br/>"China plans to launch its first space station core module in 2022, an important development in the country's space program. The space station, called Tianhe, will include three modules: a core module, an experiment module and an astronomy module. The first launch of the core module will be used to test and verify the basic functions of the station, as well as to conduct related scientific research and technology experiments. "<br/><br/>**Example 2**<br/>What information is provided in the table below? Summarize the core information in it<br/>"Ranking, Player Name, Team, Position, Salary (in millions of dollars)<br/>1, LeBron James, Los Angeles Lakers, SF, 45.0<br/>2, Stephen Curry, Golden State Warriors, PG, 43.5" |
### Evaluation Metrics
#### GPT Evaluation
Use GPT-3.5 to evaluate the prediction of different models, and pre-define evaluation metrics for different categories. There are 10 pre-defined evaluation metrics and you can refer to the table below:
| Evaluation Metric | Prompt Words | CoT |
|:-----------------------:|:-------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Language organization | 语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。 | 1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。<br/> 2.检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说<br/> 3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。<br/> 4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。<br/> 5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。<br/> 6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。 |
| Relevance | 切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。 | 1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。<br/> 2. 阅读答案,确认答案是否直接回答了题目所问的问题。<br/> 3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。<br/> 4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。 |
| Creativity | 创意性(1-5):某些头脑风暴问题可能需要答案具有创意,提出新的思路。 | 1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则创意性评分可能会受到影响。<br/> 3. 考虑答案中是否包含新颖的想法或独特的思路。答案可能与已知的解决方案有所重叠,但仍然可以被认为是有创意的,只要它提供了新的角度或方法来解决问题。<br/> 4. 根据答案的创意性给出一个1到5的评分。如果答案缺乏创意则应给出一个较低的评分。如果答案具有创意并提供了新的思路应给出一个较高的评分。 |
| Practicality | 实用性(1-5):某些头脑风暴问题可能需要答案提出实用的建议或解决方法。 | 1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则实用性评分可能会受到影响。<br/> 3. 考虑答案中提出的建议或解决方法是否实用并可行。答案可能看起来很好,但如果无法实现或应用,则实用性评分可能会受到影响。<br/> 4. 根据答案的实用性给出一个1到5的评分。如果答案缺乏实用性则应给出一个较低的评分。如果答案提出了实用的建议或解决方法并且可以很好地解决问题则应给出一个较高的评分。 |
| Correctness | 正确性(1-5):答案应该符合常识、生活实际等等 | 1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则正确性评分可能会受到影响。<br/> 3. 考虑答案中所提供的信息是否正确、符合常识、生活实际等等。如果答案中存在明显的错误或不合理之处,则正确性评分可能会受到影响。<br/> 4. 根据答案的正确性给出一个1到5的评分。如果答案存在明显的错误或不合理之处则应给出一个较低的评分。如果答案正确、符合常识、生活实际等等则应给出一个较高的评分。 |
| Naturalness | 自然(1-5):答案是否自然,并且符合问题给定的身份。 | 1. 阅读题目,确定题目提供的身份信息。<br/> 2. 检查答案内容是否符合题目给定的身份。<br/> 3. 根据以上因素对该回答的自然性进行打分分数从1到5其中1表示不自然5表示非常自然并符合问题给定的身份。 |
| Engagingness | 参与感(1-5):答案是否对前面的对话内容做出了恰当的反应,是否理解对话的语境和背景。 | 1. 阅读题目,确定对话的语境和背景。<br/> 2. 检查答案是否充分理解对话的语境和背景,能否自然地融入到对话中而不显得突兀。<br/> 3. 根据以上因素对该回答的参与感进行打分分数从1到5其中1表示没有参与感5表示非常有参与感并且恰当地理解了对话的语境和背景。 |
| Reasonableness | 合理性(1-5):答案是否能够与前面的对话内容形成逻辑上的衔接,是否符合常理,能否在这个上下文中合理存在。 | 1. 阅读题目,确定对话的主题以及问题期望的回答方向。<br/> 2. 判断答案是否能够与前面的对话内容形成逻辑上的衔接,是否符合常理,能否在这个上下文中合理存在。<br/> 3. 根据以上因素对该回答的合理性进行打分分数从1到5其中1表示不合理5表示非常合理并且能够与前面的对话内容形成逻辑上的衔接并符合常理。 |
| Diversity | 多样性(1-5):答案使用语言是否优美,具有有一定的创造性和想象力。然而,回答也应该保持合理和适度,不要过于夸张或离题。 | 1. 仔细阅读整个回答,确保完全理解回答所表达的内容和主题。<br/> 2. 在阅读回答的同时,注意语言的质量,例如措辞是否正确,语言是否生动等。<br/> 3. 检查回答的创造性和想象力,看看回答是否能够吸引人阅读下去。<br/> 4. 检查回答的合理性和适度看看回答是否夸张或离题。5. 将多样性的评分打分在1到5之间5分表示回答的质量很好能够吸引人阅读1分表示回答的内容生硬或者有离题的问题。 |
| Fidelity | 保真度(1-5):答案是否能够严格遵守角色的设定回答给定的请求。 | 1. 仔细阅读问题,了解角色在问题中的设定和表现,包括职业、背景、观点、性格等方面。<br/> 阅读题目的请求,确认回答请求时需要注意的细节。<br/> 3. 对比提供的回答与该角色的设定,评估回答是否能够严格遵守角色的设定。<br/> 4. 结合以上评估结果给出保真度的评分范围从1到5分其中1分表示回答与角色设定完全不符5分表示回答完全符合角色设定且满足给定请求。 |
| Conciseness | 简明扼要(1-5):答案是否简明扼要,没有冗余内容。 | 1. 阅读题目,提取出材料的重点。<br/> 2. 阅读该总结,并注意其中的主要观点和信息。<br/> 3. 评估总结的长度。一个简明扼要的总结通常应该在几句话或几段文字内传达关键信息,而不是冗长的段落或文章。<br/> 4. 检查总结是否包含与主要观点无关的信息或冗余信息。<br/> 5. 确定总结涵盖了材料中的关键信息,并且没有忽略任何重要细节。<br/> 6. 给总结打出1-5的分数其中5表示总结简明扼要没有冗余内容而1表示总结冗长或包含不必要的信息难以理解或记忆。根据您的判断打出适当的得分。 |
GPT-3.5 evaluates the quality of model predictions based on the given prompt words and gives a score between 1-5.
#### Automatic Evaluation
Automated metrics evaluate the capability of a model by comparing model predictions with reference answers.
There are two ways to obtain reference answers:
* For instruction coming from human-designed problems, the reference answers are generated by GPT-3.5, such as roleplay, chat.
* For instruction related with classic NLP problems, the reference answers are collected from open-sourced dataset with target answers, such as classification, extraction, summarization.
There are 5 types of automatic evaluation metrics listed in the table below:
| Automatic Evaluation Metric | Description |
|:-----------------------------------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| BLEU-n | Measure the accuracy between prediction and reference.<br/> BLEU-1 (Unigram) evaluates accuracy in word level<br/> BLEU-n (n-gram) evaluate the fluency in sentence level. |
| ROUGE | ROUGE-N measures the number of matching n-grams between prediction and reference. <br/> ROUGE-L measures the number of matching longest common subsequence (LCS) between prediction and reference. |
| Distinct | Measure the diversity of generation text by counting the unique n-grams. |
| BERTScore | Measure the semantic similarity between tokens of predictions and references with BERT. |
| Precision<br/> Recall<br/> F1 Score | Measure the number of overlaps between prediction and reference (design for classification and extraction categories) |
## Evaluation Process
### Data Format
#### Target Answers / Predictions
A JSON file contains one list. Each element in the list is a target answer / prediction record for one instruction / question.
An element should have the following fields:
### Questions * `category` (str, compulsory): The category of the instruction / question.
The file [`questions.json`](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field:
* `id` (id, compulsory): The ID of the instruction / question.
* `instruction` (str, compulsory): The instruction / question for the LLM. * `instruction` (str, compulsory): The instruction / question for the LLM.
* `input` (str, optional): The additional context of the instruction / question. * `input` (str, optional): The additional context of the instruction / question.
* `output` (str, optional): The sample output of the instruction / question. * `output` (str, optional): The sample output of the instruction (default: GPT-3.5).
* `category` (str, compulsory): The category of the instruction / question. * `target` (str, optional): The target answer for the instruction.
* `id` (int, compulsory): The ID of the instruction / question.
If the `input` has a target answer, the `output` can be empty. Otherwise, we generate answers from GPT-3.5 as the `output`, and the `target` field is empty.
Example: Example:
``` ```
{ [
"id": 0, {
"instruction": "Help me summarize the following short story?", "category": "brainstorming",
"input": "{story}", "instruction": "请介绍一下人工智能的多个领域。",
"output": "{summarized story}", "input": "",
"category": "closed qa" "output": "{GPT-3.5 Answers}",
} "target": "",
"id": 1
},
{
"category": "classification",
"instruction": "新闻标题:为什么电影《倩女幽魂》中燕赤霞一个道士却拿着金刚经?请根据新闻标题判断新闻所属的分类,你需要从文化,娱乐,体育,财经,房产,教育,科技,旅游,游戏,军事这十类中选择一个答案。",
"input": "",
"output": "",
"target": "{target answer}",
"id": 2
}
]
``` ```
### Answers #### Model Answers / Predictions
We store model answers in `{model_name}_answers.json`. The JSON file contains one list. Each element in the list is an answer record to one question. A JSON file contains one list. Each element in the list is a model answer / prediction record for one instruction / question.
An answer record has the following field: An element should have the following fields:
* `category` (str, compulsory): The category of the instruction / question. * `category` (str, compulsory): The category of the instruction / question.
* `instruction` (str, compulsory): The instruction / question for the LLM. * `instruction` (str, compulsory): The instruction / question for the LLM.
* `input` (str, optional): The additional context of the instruction / question. * `input` (str, optional): The additional context of the instruction / question.
* `output` (str, compulsory): The output from the LLM. * `output` (str, compulsory): The output from the LLM.
* `target` (str, optional): The target answer for the instruction.
* `id` (int, compulsory): The ID of the instruction / question. * `id` (int, compulsory): The ID of the instruction / question.
### Results Example:
```
We store evaluation results in `results.json`. The JSON file contains one dictionary. The key in the dictionary is formatted as `{model 1}_vs_{model 2}` and the value is also a dictionary contains metrics about the evaluation. [
{
The value has the following field: "category": "brainstorming",
"instruction": "请介绍一下人工智能的多个领域。",
* `model` (list, compulsory): The names of the two models. "input": "",
* `better` (int, compulsory): The number of reviews where Model 2 receives a higher score. "output": "{Model Answers / Predictions}",
* `worse` (int, compulsory): The number of reviews where Model 2 receives a lower score. "target": "",
* `tie` (int, compulsory): The number of reviews where two models play to a tie. "id": 1
* `win_rate` (float, compulsory): Win rate of Model 2. },
* `score` (list, compulsory): Average score of the two models. {
"category": "classification",
### Better, Worse, Tie, Invalid, Review "instruction": "新闻标题:为什么电影《倩女幽魂》中燕赤霞一个道士却拿着金刚经?请根据新闻标题判断新闻所属的分类,你需要从文化,娱乐,体育,财经,房产,教育,科技,旅游,游戏,军事这十类中选择一个答案。",
"input": "",
To help better compare the model answers, we store JSON files whose name ends with `_better`, `_worse`, `_tie`, `_invalid` or `_review`. Each JSON file contains one list. Each element in the list is a record of better, worse, tie, invalid or all cases. "output": "{Model Answers / Predictions}",
"target": "{target answer}",
A record has the following field: "id": 2
}
* `review_id` (str, optional): Random UUID, not in use. ]
* `id` (int, compulsory): The ID of the instruction / question. ```
* `reviewer_id` (int, compulsory): A unique ID for a reviewer. Different reviewer id use different prompts.
* `metadata` (dict, optional): It is empty.
* `review` (str, optional): GPT-4's review.
* `score` (list, compulsory): The scores of two models.
### Prompts ### Evaluation
#### Configuration
The configuration file `config_cn.json` can control how evaluate the performance of the model.
The following is an example showing the config structure:
```
{
"language": "cn",
"category": {
"brainstorming": {
"GPT-3.5": ["relevance", "creativity", "practicality", "correctness"],
"Metrics": ["Distinct"]
},
"chat": {
"GPT-3.5": [ "relevance", "naturalness", "engagingness", "reasonableness"],
"Metrics": ["Distinct"]
}
}
}
```
`"language"`: evaluate the model capability in which language, we only support Chinese `"cn"` for now.
`"category"`: evaluate the model capability in which category/categories.
`"GPT-3.5"`: config metrics for GPT-3.5 evaluation.
`"Metrics"`: config metrics for automatic metrics evaluation.
You can create your config file based on available settings listed in following table.
| "category" | "GPT-3.5" | "Metrics" |
|:----------------:|:-----------------------:|:-----------:|
| "brainstorming" | "language organization" | "BLEU" |
| "chat" | "relevance" | "ROUGE" |
| "classification" | "creativity" | "Distinct" |
| "closed_qa" | "practicality" | "BERTScore" |
| "extraction" | "correctness" | "Precision" |
| "generation" | "naturalness" | "Recall" |
| "open_qa" | "engagingness" | "F1 score" |
| "rewriting" | "reasonableness" |
| "roleplay" | "diversity" |
| "summarization" | "fidelity" |
| | "conciseness" |
#### Evaluate
After setting the configuration file, you can evaluate the model using `eval.py`.
The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts.
### Reviewer An example script is provided as follows:
```shell
python eval.py \
--config_file "path to the config file" \
--battle_prompt_file "path to the prompt file for battle" \
--gpt_evaluation_prompt_file "path to the prompt file for gpt evaluation" \
--target_file "path to the target answer file" \
--answer_file_list "path to the answer files of at most 2 models" \
--model_name_list "the names of at most 2 models" \
--save_path "path to save results" \
--openai_key "your openai key" \
```
The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers. ## To Do
- [ ] Add evaluation for English capability
- [ ] Support UniEval
- [ ] Support GPT-4 evaluation
## Citations ## Citations
@ -179,4 +231,22 @@ The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastCh
month = {March}, month = {March},
year = {2023} year = {2023}
} }
@misc{ouyang2022training,
title={Training language models to follow instructions with human feedback},
author={Long Ouyang and Jeff Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul Christiano and Jan Leike and Ryan Lowe},
year={2022},
eprint={2203.02155},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{liu2023geval,
title={G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment},
author={Yang Liu and Dan Iter and Yichong Xu and Shuohang Wang and Ruochen Xu and Chenguang Zhu},
year={2023},
eprint={2303.16634},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
``` ```

@ -0,0 +1,123 @@
{
"language": "cn",
"category": {
"brainstorming": {
"GPT-3.5": [
"language organization",
"relevance",
"creativity",
"practicality",
"correctness"
],
"Metrics": [
"Distinct"
]
},
"chat": {
"GPT-3.5": [
"language organization",
"relevance",
"naturalness",
"engagingness",
"reasonableness"
],
"Metrics": [
"Distinct"
]
},
"classification": {
"GPT-3.5": [
"language organization",
"relevance",
"correctness"
],
"Metrics": [
"Precision",
"Recall",
"F1 score"
]
},
"closed_qa": {
"GPT-3.5": [
"language organization",
"relevance",
"correctness"
],
"Metrics": [
"BLEU",
"ROUGE",
"BERTScore"
]
},
"extraction": {
"GPT-3.5": [
"language organization",
"relevance",
"correctness"
],
"Metrics": [
"Precision",
"Recall",
"F1 score"
]
},
"generation": {
"GPT-3.5": [
"language organization",
"relevance",
"diversity"
],
"Metrics": [
"BLEU",
"ROUGE",
"BERTScore"
]
},
"open_qa": {
"GPT-3.5": [
"language organization",
"relevance",
"correctness"
],
"Metrics": [
"Distinct"
]
},
"rewriting": {
"GPT-3.5": [
"language organization",
"relevance",
"correctness"
],
"Metrics": [
"BLEU",
"ROUGE",
"BERTScore"
]
},
"roleplay": {
"GPT-3.5": [
"language organization",
"relevance",
"fidelity",
"creativity"
],
"Metrics": [
"Distinct"
]
},
"summarization": {
"GPT-3.5": [
"language organization",
"relevance",
"correctness",
"conciseness"
],
"Metrics": [
"BLEU",
"ROUGE",
"BERTScore"
]
}
}
}

@ -0,0 +1,98 @@
import argparse
import json
import os
import openai
from evaluator import Evaluator
from utils import jload
def main(args):
assert len(args.answer_file_list) == len(
args.model_name_list), "The number of answer files and model names should be equal!"
# load config
config = jload(args.config_file)
if config["language"] == "cn":
# get metric settings for all categories
metrics_per_category = {}
for category in config["category"].keys():
metrics_all = {}
for metric_type, metrics in config["category"][category].items():
metrics_all[metric_type] = metrics
metrics_per_category[category] = metrics_all
battle_prompt = None
if args.battle_prompt_file:
battle_prompt = jload(args.battle_prompt_file)
gpt_evaluation_prompt = None
if args.gpt_evaluation_prompt_file:
gpt_evaluation_prompt = jload(args.gpt_evaluation_prompt_file)
if len(args.model_name_list) == 2 and not battle_prompt:
raise Exception("No prompt file for battle provided. Please specify the prompt file for battle!")
if len(args.model_name_list) == 1 and not gpt_evaluation_prompt:
raise Exception(
"No prompt file for gpt evaluation provided. Please specify the prompt file for gpt evaluation!")
# initialize evaluator
evaluator = Evaluator(metrics_per_category, battle_prompt, gpt_evaluation_prompt)
if len(args.model_name_list) == 2:
answers1 = jload(args.answer_file_list[0])
answers2 = jload(args.answer_file_list[1])
assert len(answers1) == len(answers2), "The number of answers for two models should be equal!"
evaluator.battle(answers1=answers1, answers2=answers2)
evaluator.save(args.save_path, args.model_name_list)
elif len(args.model_name_list) == 1:
targets = jload(args.target_file)
answers = jload(args.answer_file_list[0])
assert len(targets) == len(answers), "The number of target answers and model answers should be equal!"
evaluator.evaluate(answers=answers, targets=targets)
evaluator.save(args.save_path, args.model_name_list)
else:
raise ValueError("Unsupported number of answer files and model names!")
else:
raise ValueError(f'Unsupported language {config["language"]}!')
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='ColossalAI LLM evaluation pipeline.')
parser.add_argument('--config_file',
type=str,
default=None,
required=True,
help='path to the file of target results')
parser.add_argument('--battle_prompt_file', type=str, default=None, help='path to the prompt file for battle')
parser.add_argument('--gpt_evaluation_prompt_file',
type=str,
default=None,
help='path to the prompt file for gpt evaluation')
parser.add_argument('--target_file', type=str, default=None, help='path to the target answer (ground truth) file')
parser.add_argument('--answer_file_list',
type=str,
nargs='+',
default=[],
required=True,
help='path to the answer files of at most 2 models')
parser.add_argument('--model_name_list',
type=str,
nargs='+',
default=[],
required=True,
help='the names of at most 2 models')
parser.add_argument('--save_path', type=str, default="results", help='path to save evaluation results')
parser.add_argument('--openai_key', type=str, default=None, required=True, help='Your openai key')
args = parser.parse_args()
if args.openai_key is not None:
os.environ["OPENAI_API_KEY"] = args.openai_key
openai.api_key = os.getenv("OPENAI_API_KEY")
main(args)

@ -0,0 +1,9 @@
python eval.py \
--config_file "path to the config file" \
--battle_prompt_file "path to the prompt file for battle" \
--gpt_evaluation_prompt_file "path to the prompt file for gpt evaluation" \
--target_file "path to the target answer file" \
--answer_file_list "path to the answer files of at most 2 models" \
--model_name_list "the names of at most 2 models" \
--save_path "path to save results" \
--openai_key "your openai key" \

@ -1,256 +0,0 @@
# Adapted form https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/eval_gpt_review.py
# Copyright 2023 LM-SYS@FastChat
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
import os
import time
import re
import concurrent.futures
import openai
import tqdm
import shortuuid
import logging
from utils import jload, jdump, get_json_list
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
MAX_API_RETRY = 3
def get_eval(sys_prompt, user_prompt: str, answer_id: int, max_tokens: int, model: str):
logging.basicConfig(level=logging.INFO)
for _ in range(MAX_API_RETRY):
try:
response = openai.ChatCompletion.create(
model=model,
messages=[{
'role': 'system',
'content': sys_prompt
}, {
'role': 'user',
'content': user_prompt,
}],
temperature=0.2,
max_tokens=max_tokens,
)
review = response['choices'][0]['message']['content']
return {"review": review, 'id': answer_id}
except Exception as e:
logger.error(e)
time.sleep(1)
logger.error(f' Review {answer_id} failed after {MAX_API_RETRY} retries.')
return 'error'
def parse_score(review):
try:
pattern = re.compile('([0-9]|10) out of 10')
sp = re.findall(pattern, review)
if len(re.findall(pattern, review)) == 2:
return [float(sp[0]), float(sp[1])]
pattern = re.compile('a score of ([0-9]|10)')
sp = re.findall(pattern, review)
if len(re.findall(pattern, review)) == 2:
return [float(sp[0]), float(sp[1])]
pattern = re.compile('([0-9]|10)/10')
sp = re.findall(pattern, review)
if len(re.findall(pattern, review)) == 2:
return [float(sp[0]), float(sp[1])]
score_pair = review.split('\n')[0]
score_pair = score_pair.replace(',', ' ')
sp = score_pair.split(' ')
if len(sp) == 2:
return [float(sp[0]), float(sp[1])]
else:
raise Exception('Invalid score pair.')
except Exception as e:
return [-1, -1]
def gen_prompt(reviewer_jsons, prompt_jsons, cat, ques, ans1, ans2):
reviewer_idx = 0
for idx, reviewer in enumerate(reviewer_jsons):
if reviewer['category'] == cat:
reviewer_idx = idx
break
prompt_id = reviewer_jsons[reviewer_idx]['prompt_id']
prompt_json = prompt_jsons[prompt_id-1]
assert prompt_json['prompt_id'] == prompt_id
sys_prompt = prompt_json['system_prompt']
prompt_template = prompt_json['prompt_template']
defaults = prompt_json['defaults']
prompt = prompt_template.format(
question=ques, answer_1=ans1, answer_2=ans2, **defaults)
return sys_prompt, prompt, reviewer_idx+1
def evaluate(args):
answer1_jsons = jload(args.answer_file_list[0])
answer2_jsons = jload(args.answer_file_list[1])
reviewer_jsons = get_json_list(args.reviewer_file)
prompt_jsons = get_json_list(args.prompt_file)
assert len(answer1_jsons) == len(answer2_jsons)
handles = []
review_jsons = []
total_len = len(answer1_jsons)
question_idx_list = list(range(total_len))
logger.info(
f' Total number of answers: {len(answer2_jsons)}.')
reviews = []
with concurrent.futures.ThreadPoolExecutor(max_workers=args.num_workers) as executor:
futures = []
for i in question_idx_list:
assert answer1_jsons[i]['id'] == answer2_jsons[i]['id']
answer_id = answer1_jsons[i]['id']
ques = answer1_jsons[i]['instruction'] if answer1_jsons[i]['input'] == "" else answer1_jsons[i]['instruction'] + \
" " + answer1_jsons[i]['input']
cat = answer1_jsons[i]['category']
ans1 = answer1_jsons[i]['output']
ans2 = answer2_jsons[i]['output']
sys_prompt, prompt, reviewer_id = gen_prompt(
reviewer_jsons, prompt_jsons, cat, ques, ans1, ans2)
review_id = shortuuid.uuid()
review_jsons.append({
'review_id': review_id,
'id': answer_id,
'reviewer_id': reviewer_id,
'metadata': {}
})
future = executor.submit(
get_eval, sys_prompt, prompt, answer_id, args.max_tokens, args.model)
futures.append(future)
for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
reviews.append(future.result())
reviews.sort(key=lambda x: x['id'])
review_jsons.sort(key=lambda x: x['id'])
ans1_score = 0
ans2_score = 0
better_count = 0
worse_count = 0
tie_count = 0
invalid_count = 0
better_file = []
worse_file = []
tie_file = []
invalid_file = []
output_review_file = []
for idx, review in enumerate(reviews):
scores = parse_score(review['review'])
review_jsons[idx]['review'] = review['review']
review_jsons[idx]['score'] = scores
if scores[0] == -1 and scores[1] == -1:
invalid_count += 1
invalid_file.append(review_jsons[idx])
logger.info(f' Invalid score pair: {review_jsons[idx]["id"]}.')
else:
if scores[0] > scores[1]:
worse_count += 1
worse_file.append(review_jsons[idx])
elif scores[0] < scores[1]:
better_count += 1
better_file.append(review_jsons[idx])
else:
tie_count += 1
tie_file.append(review_jsons[idx])
ans1_score += scores[0]
ans2_score += scores[1]
output_review_file.append(review_jsons[idx])
better_file.sort(key=lambda x: x['id'])
worse_file.sort(key=lambda x: x['id'])
tie_file.sort(key=lambda x: x['id'])
invalid_file.sort(key=lambda x: x['id'])
output_review_file.sort(key=lambda x: x['id'])
name1 = os.path.basename(args.answer_file_list[0]).split("_answers")[0]
name2 = os.path.basename(args.answer_file_list[1]).split("_answers")[0]
prefix = f"{name1}_vs_{name2}"
jdump(better_file, os.path.join(
args.output_folder, prefix, f"{prefix}_better.json"))
jdump(worse_file, os.path.join(
args.output_folder, prefix, f"{prefix}_worse.json"))
jdump(tie_file, os.path.join(
args.output_folder, prefix, f"{prefix}_tie.json"))
jdump(invalid_file, os.path.join(
args.output_folder, prefix, f"{prefix}_invalid.json"))
jdump(output_review_file, os.path.join(
args.output_folder, prefix, f"{prefix}_review.json"))
if os.path.exists(os.path.join(args.output_folder, "results.json")):
results = jload(os.path.join(args.output_folder, "results.json"))
else:
results = {}
results[prefix] = {'model': [name1, name2], 'better': better_count, 'worse': worse_count, 'tie': tie_count, 'win_rate': better_count /
(len(reviews)-invalid_count), 'score': [ans1_score/(len(reviews)-invalid_count), ans2_score/(len(reviews)-invalid_count)]}
jdump(results, os.path.join(args.output_folder, "results.json"))
logger.info(f' Total {invalid_count} invalid score pair(s).')
logger.info(f' Model {name2} has {better_count} better answer(s).')
logger.info(f' Model {name2} has {worse_count} worse answer(s).')
logger.info(f' {tie_count} answer(s) play(s) to a tie.')
logger.info(
f' Win rate of model {name2}: {better_count/(len(reviews)-invalid_count):.2f}')
logger.info(
f' Model {name1} average score: {ans1_score/(len(reviews)-invalid_count):.2f}')
logger.info(
f' Model {name2} average score: {ans2_score/(len(reviews)-invalid_count):.2f}')
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Model evaluation.')
parser.add_argument('--answer_file_list', nargs='+', default=[])
parser.add_argument('--prompt_file')
parser.add_argument('--reviewer_file')
parser.add_argument('--output_folder', type=str, default="./output")
parser.add_argument('--openai_key', type=str, default=None)
parser.add_argument('--model', type=str, default="gpt-4")
parser.add_argument('--num_workers', type=int, default=8)
parser.add_argument('--max_tokens', type=int, default=512,
help='maximum number of tokens produced in the output')
args = parser.parse_args()
if args.openai_key is not None:
os.environ["OPENAI_API_KEY"] = args.openai_key
openai.api_key = os.getenv("OPENAI_API_KEY")
evaluate(args)

@ -1,9 +0,0 @@
python evaluate.py \
--answer_file_list "path to answers of model 1" "path to answers of model 2" \
--prompt_file "path to prompt file" \
--reviewer_file "path to reviewer file" \
--output_folder "path to output folder" \
--openai_key "your openai key" \
--model "gpt-4" \
--num_workers 8 \
--max_tokens 512 \

@ -0,0 +1,130 @@
import os
from typing import Any, Dict, List
import gpt_evaluate
import metrics
import pandas as pd
from utils import get_data_per_category, jdump
class Evaluator(object):
"""
A class named Evaluator includes GPT-3.5/GPT-4 evaluation
and automatic evaluation
"""
def __init__(self, params: Dict[str, Any], battle_prompt: Dict[str, Any], gpt_evaluation_prompt: Dict[str,
Any]) -> None:
self.params = params
self.battle_prompt = battle_prompt
self.gpt_evaluation_prompt = gpt_evaluation_prompt
self.automatic_metric_stats = dict()
self.gpt35_evaluation_results = dict()
self.battle_results = []
def battle(self, answers1: List[Dict], answers2: List[Dict]) -> None:
"""
Comparison between two models using GPT-4 as the reviewer.
"""
self.battle_results = gpt_evaluate.battle(answers1, answers2, self.battle_prompt)
def evaluate(self, answers: List[Dict], targets: List[Dict]) -> None:
"""
A comprehensive evaluation of the answers from the model.
The function evaluates the model's performance from different perspectives
using GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.
The metrics will be decided by the config file.
"""
def switch(metric):
if metric == "BLEU":
return metrics.bleu_score(preds=predicts_list, targets=targets_list)
elif metric == "ROUGE":
return metrics.rouge_cn_score(preds=predicts_list, targets=targets_list)
elif (metric == "Distinct"):
return metrics.distinct_score(preds=predicts_list)
elif (metric == "BERTScore"):
return metrics.bert_score(preds=predicts_list, targets=targets_list)
elif (metric == "Precision"):
return metrics.precision(preds=predicts_list, targets=targets_list)
elif (metric == "Recall"):
return metrics.recall(preds=predicts_list, targets=targets_list)
elif (metric == "F1 score"):
return metrics.F1_score(preds=predicts_list, targets=targets_list)
else:
raise ValueError(f"Unexpected metric")
answers_per_category = get_data_per_category(answers, list(self.params.keys()))
targets_per_category = get_data_per_category(targets, list(self.params.keys()))
# automatic evaluation
for category in self.params:
category_metrics = self.params[category]["Metrics"]
self.automatic_metric_stats[category] = {}
targets_list = [
target["target"] if target["target"] else target["output"] for target in targets_per_category[category]
]
predicts_list = [answer["output"] for answer in answers_per_category[category]]
for metric in category_metrics:
self.automatic_metric_stats[category].update(switch(metric=metric))
# gpt35 evaluation
for category in self.params:
category_metrics = self.params[category]["GPT-3.5"]
prompt = self.gpt_evaluation_prompt.get(category, None)
if prompt is None:
print(f"No prompt for category {category}! Use prompt for category general now.")
prompt = self.gpt_evaluation_prompt["general"]
self.gpt35_evaluation_results[category] = gpt_evaluate.gpt35_evaluate(answers_per_category[category],
prompt, category_metrics, category)
def save(self, path: str, model_name_list: List[str]) -> None:
"""
Save evaluation results of GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.
"""
if len(model_name_list) == 2:
save_path = os.path.join(path, "gpt_evaluate", "battle_results")
gpt_evaluate.save_battle_results(self.battle_results, model_name_list[0], model_name_list[1], save_path)
else:
# save evaluation results for automatic metrics
automatic_df = pd.DataFrame(self.automatic_metric_stats)
automatic_results_save_path = os.path.join(path, "automatic_results")
if not os.path.exists(automatic_results_save_path):
os.makedirs(automatic_results_save_path)
automatic_df.to_csv(os.path.join(automatic_results_save_path, f"{model_name_list[0]}.csv"), index=True)
# Save evaluation results for GPT-3.5 evaluation metrics.
all_evaluations = []
base_save_path = os.path.join(path, "gpt_evaluate", "gpt35_evaluate_results")
evaluation_results_save_path = os.path.join(base_save_path, "evaluation_results")
for category, evaluations in self.gpt35_evaluation_results.items():
jdump(
evaluations,
os.path.join(evaluation_results_save_path, model_name_list[0],
f"{category}_evaluation_results.json"))
all_evaluations.extend(evaluations)
jdump(all_evaluations,
os.path.join(evaluation_results_save_path, f"{model_name_list[0]}_evaluation_results.json"))
# Start to calculate scores and save statictics.
evaluation_statistics_save_path = os.path.join(base_save_path, "evaluation_statistics")
gpt_evaluate.save_gpt35_evaluation_statistics(model_name_list[0], all_evaluations,
evaluation_statistics_save_path)
# Save charts and csv.
evaluation_analyses_save_path = os.path.join(base_save_path, "evaluation_analyses")
gpt_evaluate.analyze_gpt35_evaluation_statistics(evaluation_statistics_save_path,
evaluation_analyses_save_path)

@ -1,173 +0,0 @@
import argparse
import os
import random
import copy
import math
from tqdm import tqdm
import torch
import torch.distributed as dist
import transformers
from coati.models.bloom import BLOOMActor
from coati.models.gpt import GPTActor
from coati.models.opt import OPTActor
from coati.models.roberta import RoBERTaActor
from coati.models.llama import LlamaActor
from coati.trainer.strategies import ColossalAIStrategy, DDPStrategy, NaiveStrategy
from transformers import AutoTokenizer, RobertaTokenizer
from transformers.models.gpt2.tokenization_gpt2 import GPT2Tokenizer
from colossalai.logging import get_dist_logger
from utils import jload, jdump, is_rank_0
logger = get_dist_logger()
PROMPT_DICT = {
"prompt_input":
("Below is an instruction that describes a task, paired with an input that provides further context. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"),
"prompt_no_input": ("Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n{instruction}\n\n### Response:"),
}
def generate(args):
# torch.cuda.set_per_process_memory_fraction(0.4)
if args.strategy == 'naive':
strategy = NaiveStrategy()
elif args.strategy == 'ddp':
strategy = DDPStrategy()
elif args.strategy == 'colossalai_gemini':
strategy = ColossalAIStrategy(stage=3, placement_policy='cuda')
elif args.strategy == 'colossalai_zero2':
strategy = ColossalAIStrategy(stage=2, placement_policy='cuda')
elif args.strategy == 'colossalai_zero2_cpu':
strategy = ColossalAIStrategy(stage=2, placement_policy='cpu')
else:
raise ValueError(f'Unsupported strategy "{args.strategy}"')
world_size = dist.get_world_size()
rank = dist.get_rank()
with strategy.model_init_context():
if args.model == 'gpt2':
actor = GPTActor(pretrained=args.model_path).to(
torch.cuda.current_device())
elif args.model == 'bloom':
actor = BLOOMActor(pretrained=args.model_path).to(
torch.cuda.current_device())
elif args.model == 'opt':
actor = OPTActor(pretrained=args.model_path).to(
torch.cuda.current_device())
elif args.model == 'roberta':
actor = RoBERTaActor(pretrained=args.model_path).to(
torch.cuda.current_device())
elif args.model == 'llama':
actor = LlamaActor(pretrained=args.model_path).to(
torch.float16).to(torch.cuda.current_device())
else:
raise ValueError(f'Unsupported model "{args.model}"')
if args.model == 'gpt2':
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
elif args.model == 'bloom':
tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-560m')
tokenizer.pad_token = tokenizer.eos_token
elif args.model == 'opt':
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-350m')
elif args.model == 'roberta':
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
elif args.model == 'llama':
tokenizer = AutoTokenizer.from_pretrained(args.model_path,
padding_side="right",
use_fast=False,
)
tokenizer.eos_token = '<\s>'
else:
raise ValueError(f'Unsupported model "{args.model}"')
questions = []
if args.max_datasets_size is not None:
questions = random.sample(jload(args.dataset), args.max_datasets_size)
if is_rank_0():
logger.info(
f"Limiting dataset to {args.max_datasets_size} examples.")
questions = questions[rank:args.max_datasets_size:world_size]
answers = copy.deepcopy(questions)
prompt_input, prompt_no_input = PROMPT_DICT["prompt_input"], PROMPT_DICT["prompt_no_input"]
sources = [
prompt_input.format_map(example) if example.get(
"input", "") != "" else prompt_no_input.format_map(example)
for example in questions
]
if is_rank_0():
logger.info("Tokenizing inputs... This may take some time...")
input_ids_list = []
for string in sources:
input_ids = tokenizer.encode(string, return_tensors='pt').squeeze(0)
input_ids_list.append(input_ids)
bar = tqdm(range(math.ceil(len(input_ids_list)/args.batch_size)),
desc=f'steps', disable=not is_rank_0())
actor.eval()
with torch.no_grad():
for i in range(0, len(input_ids_list), args.batch_size):
batch = input_ids_list[i:i+args.batch_size]
batch = [i.flip(dims=[0]) for i in batch]
batch = torch.nn.utils.rnn.pad_sequence(batch,
batch_first=True,
padding_value=tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0).to(torch.cuda.current_device())
batch = batch.flip(dims=[1])
attention_mask = batch.ne(tokenizer.pad_token_id if tokenizer.pad_token_id is not None else 0)
outputs = actor.model.generate(batch, attention_mask=attention_mask,
max_length=args.max_length,
do_sample=True,
top_k=50,
top_p=0.95,
num_return_sequences=1)
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for j in range(batch.size(0)):
answers[i +
j]['output'] = outputs[j].split("### Response:")[1].strip()
bar.update()
jdump(answers, os.path.join(args.answer_path,
f'{args.model_name}_answers_rank{rank}.json'))
if is_rank_0():
logger.info(
f'Peak CUDA mem: {torch.cuda.max_memory_allocated()/1024**3:.3f} GB')
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--strategy',
choices=['naive', 'ddp', 'colossalai_gemini',
'colossalai_zero2', 'colossalai_zero2_cpu'],
default='naive')
parser.add_argument('--model', default='gpt2',
choices=['gpt2', 'bloom', 'opt', 'roberta', 'llama'])
parser.add_argument('--model_path', type=str, default=None)
parser.add_argument('--model_name', type=str, default='model')
parser.add_argument('--dataset', type=str, default=None)
parser.add_argument('--batch_size', type=int, default=1)
parser.add_argument('--max_datasets_size', type=int, default=None)
parser.add_argument('--answer_path', type=str, default="answer")
parser.add_argument('--max_length', type=int, default=1024)
args = parser.parse_args()
generate(args)

@ -1,25 +0,0 @@
device_number=number of your devices
model_name="name of your model"
model_path="path to your model"
dataset="path to the question dataset"
answer_path="path to save the model answers"
torchrun --standalone --nproc_per_node=$device_number generate_answers.py \
--model 'llama' \
--strategy ddp \
--model_path $model_path \
--model_name $model_name \
--dataset $dataset \
--batch_size 8 \
--max_datasets_size 80 \
--answer_path $answer_path \
--max_length 512
python merge.py \
--model_name $model_name \
--shards $device_number \
--answer_path $answer_path \
for (( i=0; i<device_number; i++ )) do
rm -rf "${answer_path}/${model_name}_answers_rank${i}.json"
done

@ -1,98 +0,0 @@
# Adapted form https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/qa_baseline_gpt35.py
# Copyright 2023 LM-SYS@FastChat
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
import os
import time
import concurrent.futures
import openai
import tqdm
import shortuuid
import logging
from utils import jload, jdump
MODEL = 'gpt-3.5-turbo'
MAX_API_RETRY = 3
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def get_answer(question: str, max_tokens: int):
answer = question
prompt = question['instruction'] if question['input'] == "" else question['instruction'] + \
" " + question['input']
for _ in range(MAX_API_RETRY):
try:
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[{
'role': 'system',
'content': 'You are a helpful assistant.'
}, {
'role': 'user',
'content': prompt,
}],
max_tokens=max_tokens,
)
answer['output'] = response['choices'][0]['message']['content']
return answer
except Exception as e:
logger.error(e)
time.sleep(1)
logger.error(f' Answer {question["id"]} failed after {MAX_API_RETRY} retries.')
return answer
def evaluate_gpt35(args):
questions=jload(args.dataset)
logger.info(
f' Total number of answers: {len(questions)}.')
logger.info(
f' Waiting for {args.request_time_gap} seconds before sending the next request.')
answers = []
with concurrent.futures.ThreadPoolExecutor(max_workers=args.num_workers) as executor:
futures = []
for question in questions:
future = executor.submit(get_answer, question, args.max_tokens)
futures.append(future)
for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
answers.append(future.result())
answers.sort(key=lambda x: x['id'])
jdump(answers, os.path.join(args.answer_path,
f'gpt35_answers.json'))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Evaluate GPT 3.5.')
parser.add_argument('--dataset', type=str, default="questions.json")
parser.add_argument('--answer_path', type=str, default="answer")
parser.add_argument('--num_workers', type=int, default=4)
parser.add_argument('--openai_key', type=str, default=None)
parser.add_argument('--max_tokens', type=int, default=1024)
args = parser.parse_args()
if args.openai_key is not None:
os.environ["OPENAI_API_KEY"] = args.openai_key
openai.api_key = os.getenv("OPENAI_API_KEY")
evaluate_gpt35(args)

@ -1,6 +0,0 @@
python generate_gpt35_answers.py \
--dataset "path to the question dataset" \
--answer_path "path to answer folder" \
--num_workers 4 \
--openai_key "your openai key" \
--max_tokens 512 \

@ -0,0 +1,496 @@
import concurrent.futures
import os
import re
import time
from copy import deepcopy
from typing import Any, Dict, List
import matplotlib.pyplot as plt
import numpy as np
import openai
import pandas as pd
import seaborn as sns
import tqdm
from utils import jdump, jload
def get_battle_result(sys_prompt: str, user_prompt: str, id: int, max_tokens: int = 2048) -> Dict[str, Any]:
"""
Get evaluation from GPT-4.
Args:
sys_prompt: prompt for the system.
user_prompt: prompt for the user.
id: id of the answers for comparison.
max_tokens: the maximum number of tokens to generate in the chat completion.
Returns:
An evaluation of one comparison.
"""
MAX_API_RETRY = 3
for _ in range(MAX_API_RETRY):
try:
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": sys_prompt
},
{
"role": "user",
"content": user_prompt,
},
],
temperature=0.2,
max_tokens=max_tokens,
)
evaluation = response["choices"][0]["message"]["content"]
return {"evaluation": evaluation, "id": id}
except Exception as e:
print(e)
time.sleep(1)
print(f" Evaluation {id} failed after {MAX_API_RETRY} retries.")
return {"evaluation": "", "id": id}
def parse_battle_score(evaluation: str) -> List[float]:
"""
Parse evaluation from GPT-4 and get the scores of model 1 and 2.
Args:
evaluation: evaluation from GPT-4.
Returns:
A score pair of two different model answers.
"""
try:
pattern = re.compile("([0-9]|10) out of 10")
sp = re.findall(pattern, evaluation)
if len(re.findall(pattern, evaluation)) == 2:
return [float(sp[0]), float(sp[1])]
pattern = re.compile("a score of ([0-9]|10)")
sp = re.findall(pattern, evaluation)
if len(re.findall(pattern, evaluation)) == 2:
return [float(sp[0]), float(sp[1])]
pattern = re.compile("([0-9]|10)/10")
sp = re.findall(pattern, evaluation)
if len(re.findall(pattern, evaluation)) == 2:
return [float(sp[0]), float(sp[1])]
score_pair = evaluation.split("\n")[0]
score_pair = score_pair.replace(",", " ")
sp = score_pair.split(" ")
if len(sp) == 2:
return [float(sp[0]), float(sp[1])]
else:
raise Exception(f"Invalid score pair. Got {evaluation}.")
except Exception as e:
return [-1, -1]
def battle(answer1: List[Dict], answer2: List[Dict], prompt_dict: Dict[str, Any]) -> List[Dict]:
"""
Use GPT-4 to compare answers of two different models.
Args:
answer1: answers of model 1.
answer2: answers of model 2.
prompt_dict: prompt for battle.
Returns:
Evaluations of all comparison pairs.
"""
assert len(answer1) == len(answer2)
handles = []
evaluation_file = []
total_len = len(answer1)
question_idx_list = list(range(total_len))
print(f" Total number of answers: {len(answer1)}.")
evaluations = []
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
futures = []
for i in question_idx_list:
assert answer1[i]["id"] == answer2[i]["id"]
answer_id = answer1[i]["id"]
ques = answer1[i]["instruction"] if answer1[i][
"input"] == "" else answer1[i]["instruction"] + " " + answer1[i]["input"]
cat = answer1[i]["category"]
ans1 = answer1[i]["output"]
ans2 = answer2[i]["output"]
sys_prompt = prompt_dict["system_prompt"]
prompt_template = prompt_dict["prompt_template"]
prompt = prompt_template.format(
question=ques,
answer_1=ans1,
answer_2=ans2,
prompt=prompt_dict["prompt"],
)
future = executor.submit(get_battle_result, sys_prompt, prompt, answer_id, 2048)
futures.append(future)
for future in tqdm.tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
evaluations.append(future.result())
evaluations.sort(key=lambda x: x["id"])
return evaluations
def save_battle_results(evaluations: List[Dict], name1: str, name2: str, save_path: str) -> None:
"""
Save evaluation results (model 1 vs model 2) from GPT-4.
Args:
evaluations: evaluation results from GPT-4.
name1: model 1 's name.
name2: model 2 's name.
save_path: path to save battle results.
"""
evaluation_file = deepcopy(evaluations)
ans1_score = 0
ans2_score = 0
better_count = 0
worse_count = 0
tie_count = 0
invalid_count = 0
better_file = []
worse_file = []
tie_file = []
invalid_file = []
for idx, evaluation in enumerate(evaluations):
scores = parse_battle_score(evaluation["evaluation"])
evaluation_file[idx]["score"] = scores
if scores[0] == -1 and scores[1] == -1:
invalid_count += 1
invalid_file.append(evaluation_file[idx])
print(f'Invalid score pair: {evaluation_file[idx]["id"]}.')
else:
if scores[0] > scores[1]:
worse_count += 1
worse_file.append(evaluation_file[idx])
elif scores[0] < scores[1]:
better_count += 1
better_file.append(evaluation_file[idx])
else:
tie_count += 1
tie_file.append(evaluation_file[idx])
ans1_score += scores[0]
ans2_score += scores[1]
prefix = f"{name1}_vs_{name2}"
if not os.path.exists(save_path):
os.makedirs(save_path)
jdump(better_file, os.path.join(save_path, prefix, f"{name2}_better.json"))
jdump(worse_file, os.path.join(save_path, prefix, f"{name2}_worse.json"))
jdump(tie_file, os.path.join(save_path, prefix, f"{prefix}_tie.json"))
jdump(invalid_file, os.path.join(save_path, prefix, f"{prefix}_invalid.json"))
jdump(evaluation_file, os.path.join(save_path, prefix, f"{prefix}_evaluations.json"))
if os.path.exists(os.path.join(save_path, "battle_results.json")):
results = jload(os.path.join(save_path, "battle_results.json"))
else:
results = {}
results[prefix] = {
"model": [name1, name2],
"better": better_count,
"worse": worse_count,
"tie": tie_count,
"win_rate": better_count / (len(evaluations) - invalid_count),
"score": [
ans1_score / (len(evaluations) - invalid_count),
ans2_score / (len(evaluations) - invalid_count),
],
}
jdump(results, os.path.join(save_path, "battle_results.json"))
print(f"Total {invalid_count} invalid score pair(s).")
print(f"Model {name2} has {better_count} better answer(s).")
print(f"Model {name2} has {worse_count} worse answer(s).")
print(f"{tie_count} answer(s) play(s) to a tie.")
print(f"Win rate of model {name2}: {better_count/(len(evaluations)-invalid_count):.2f}")
print(f"Model {name1} average score: {ans1_score/(len(evaluations)-invalid_count):.2f}")
print(f"Model {name2} average score: {ans2_score/(len(evaluations)-invalid_count):.2f}")
def get_gpt35_evaluation(prompt: Dict[str, Any],
inst: Dict[str, Any],
metrics: List[str],
max_tokens: int = 2048) -> Dict[str, Any]:
"""
Use GPT-3.5 to evaluate one model answer.
Args:
prompt: a dictionary including prompt template, CoT and metrics.
inst: the instruction that is needed to be evaluated.
metrics: the metrics for evaluation.
max_tokens: the maximum number of tokens to generate in the completion.
Returns:
An evaluation of one answer.
"""
MAX_API_RETRY = 3
question = (inst["instruction"] if inst["input"] == "" else inst["instruction"] + " " + inst["input"])
answer = inst["output"]
inst["evaluation"] = {}
for metric in metrics:
if prompt["metrics"].get(metric, None) is None:
raise Exception(
f"Unsupported metric {metric} for category {inst['category']}! You should add this metric in the prompt file!"
)
for i in range(MAX_API_RETRY):
try:
response = openai.Completion.create(
model="text-davinci-003",
prompt=prompt["prompt"].format(
question=question,
answer=answer,
metric=prompt["metrics"][metric],
steps=prompt["CoT"][metric],
),
logprobs=5,
temperature=0,
max_tokens=max_tokens,
)
inst["evaluation"][metric] = {
"response": response["choices"][0]["text"],
"logprobs": response["choices"][0]["logprobs"]["top_logprobs"],
}
break
except Exception as e:
print(e)
time.sleep(1)
return inst
def gpt35_evaluate(
answers: List[Dict],
prompt: Dict[str, Any],
metrics: List[str],
category: str,
) -> List[Dict]:
"""
Use GPT-3.5 to evaluate model answers and save evaluation results.
Args:
answers: model answers.
prompt: prompt for GPT-3.5 evaluation.
metrics: metrics for GPT-3.5 evaluation.
category: the category of the model answers for evaluation.
Returns:
Evaluations of the given answers.
"""
print(f"The number of instances of category {category}'s is {len(answers)}.")
evaluations = []
metrics_str = ", ".join(x for x in metrics)
print(f"Category {category}'s metrics are {metrics_str}.")
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
futures = []
for inst in answers:
future = executor.submit(get_gpt35_evaluation, prompt, inst, metrics, 1)
futures.append(future)
for future in tqdm.tqdm(
concurrent.futures.as_completed(futures),
desc=f"{category}: ",
total=len(futures),
):
evaluations.append(future.result())
evaluations.sort(key=lambda x: x["id"])
print(f"{category} done.")
return evaluations
def calculate_scores_form_logprobs(logprobs: Dict[str, Any]) -> float:
"""
Calculate score from log probabilities returned by text-davinci-003.
Only openai.Completion can return logprobs.
Calculation formula:
score = sum(score_i * exp(value)) where score_i is the score which corresponds to the key(predicted token) and value is its log probability.
Ref: https://arxiv.org/abs/2303.16634
This paper proposes NLG evaluation methods using GPT-3.5(logprobs returned by openai api) and GPT-4(logprobs obtained by sampling).
Args:
logprobs: logprobs returned by openai.Completion.
Returns:
Score of one answer.
"""
# GPT-3.5 only returns score of 1 to 5.
prob = np.zeros(5)
for key, value in logprobs.items():
# Sometimes the key will be one byte of a unicode character which takes the form of "bytes:\\xe7".
# It is meaningless and thus we don't calculate probability.
if "bytes" in key:
continue
# results[0] is the score which corresponds to the key(predicted token).
# For example, key "5" corresponds to score 5.
results = re.findall(r"\d", key)
if len(results) == 1:
prob[int(results[0]) - 1] = prob[int(results[0]) - 1] + np.exp(value)
score = np.dot(np.arange(1, 6), prob)
return score
def save_gpt35_evaluation_statistics(model_name: str, evaluations: List[Dict], save_path: str) -> None:
"""
Generate statistics for one model.
Args:
model_name: name of the model for saving statistics.
evaluations: evaluations for all of the model answers.
save_path: path to save GPT-3.5 evaluation statistics.
"""
if not os.path.exists(save_path):
os.makedirs(save_path)
data_per_category = {}
for evaluation in evaluations:
category = evaluation["category"]
if evaluation["category"] in data_per_category.keys():
data_per_category[category].append(evaluation)
else:
data_per_category[category] = [evaluation]
all_statistics = {}
for category, data in data_per_category.items():
metrics = data[0]["evaluation"].keys()
scores = {metric: [] for metric in metrics}
for evaluation in data:
for metric in metrics:
scores[metric].append(calculate_scores_form_logprobs(evaluation["evaluation"][metric]["logprobs"][0]))
statistics = {}
for metric in metrics:
arg_sort = np.argsort(scores[metric])
statistics[metric] = {}
statistics[metric]["avg_score"] = sum(scores[metric]) / len(data)
statistics[metric]["best_3"] = {data[i]["id"]: scores[metric][i] for i in arg_sort[-3:][::-1]}
statistics[metric]["worst_3"] = {data[i]["id"]: scores[metric][i] for i in arg_sort[:3]}
all_statistics[category] = statistics
jdump(
all_statistics,
os.path.join(save_path, f"{model_name}_evaluation_statistics.json"),
)
def analyze_gpt35_evaluation_statistics(statistics_path: str, save_path: str) -> None:
"""
Analyze and visualize all GPT-3.5 evaluation statistics in the given directory.
Args:
statistics_path: path to all the models' statistics.
save_path: path to save table and visualization results.
"""
if not os.path.exists(statistics_path):
raise Exception(f'The given directory "{statistics_path}" doesn\'t exist! No statistics found!')
all_statistics = {}
for file_name in os.listdir(statistics_path):
if file_name.endswith("_evaluation_statistics.json"):
model_name = file_name.split("_evaluation_statistics.json")[0]
all_statistics[model_name] = jload(os.path.join(statistics_path, file_name))
if len(list(all_statistics.keys())) == 0:
raise Exception(f'There are no statistics in the given directory "{statistics_path}"!')
frame_all = {
"model": [],
"category": [],
"metric": [],
"avg_score": [],
"best_3": [],
"worst_3": [],
}
frame_per_category = {}
for model_name, model_statistics in all_statistics.items():
for category, category_statistics in model_statistics.items():
if frame_per_category.get(category) is None:
frame_per_category[category] = {
"model": [],
"metric": [],
"avg_score": [],
"best_3": [],
"worst_3": [],
}
for metric, metric_statistics in category_statistics.items():
frame_all["model"].append(model_name)
frame_all["category"].append(category)
frame_all["metric"].append(metric)
frame_all["avg_score"].append(metric_statistics["avg_score"])
frame_all["best_3"].append(metric_statistics["best_3"])
frame_all["worst_3"].append(metric_statistics["worst_3"])
frame_per_category[category]["model"].append(model_name)
frame_per_category[category]["metric"].append(metric)
frame_per_category[category]["avg_score"].append(metric_statistics["avg_score"])
frame_per_category[category]["best_3"].append(metric_statistics["best_3"])
frame_per_category[category]["worst_3"].append(metric_statistics["worst_3"])
if not os.path.exists(save_path):
os.makedirs(save_path)
frame_all = pd.DataFrame(frame_all)
frame_all.to_csv(os.path.join(save_path, "gpt35_evaluation_statistics.csv"))
for category in tqdm.tqdm(
frame_per_category.keys(),
desc=f"category: ",
total=len(frame_per_category.keys()),
):
data = pd.DataFrame(frame_per_category[category])
sns.set()
fig = plt.figure(figsize=(16, 10))
plt.ylim((0, 5))
fig = sns.barplot(x="metric", y="avg_score", hue="model", data=data, dodge=True)
fig.set_title(f"Comparison between Different Models for Category {category.title()}")
plt.xlabel("Evaluation Metric")
plt.ylabel("Average Score")
figure = fig.get_figure()
figure.savefig(os.path.join(save_path, f"{category}.png"), dpi=400)

@ -1,25 +0,0 @@
import argparse
import os
from utils import jload, jdump
def generate(args):
dataset = []
for i in range(args.shards):
shard = jload(os.path.join(args.answer_path,
f'{args.model_name}_answers_rank{i}.json'))
dataset.extend(shard)
dataset.sort(key=lambda x: x['id'])
jdump(dataset, os.path.join(args.answer_path,
f'{args.model_name}_answers.json'))
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model_name', type=str, default='model')
parser.add_argument('--shards', type=int, default=4)
parser.add_argument('--answer_path', type=str, default="answer")
args = parser.parse_args()
generate(args)

@ -0,0 +1,169 @@
import statistics
import jieba
from bert_score import score
from nltk.translate.bleu_score import sentence_bleu
from rouge_chinese import Rouge as Rouge_cn
from sklearn.metrics import f1_score, precision_score, recall_score
def bleu_score(preds: list, targets: list) -> dict:
"""Calculate BLEU Score Metric
The calculation includes BLEU-1 for unigram, BLEU-2 for bigram,
BLEU-3 for trigram and BLEU-4 for 4-gram. Unigram evaluates the
accuracy in word level, other n-gram evaluate the fluency in
sentence level.
"""
bleu_scores = {"bleu1": 0, "bleu2": 0, "bleu3": 0, "bleu4": 0}
cumulative_bleu = [0] * 4
weights = [(1. / 1., 0., 0., 0.), (1. / 2., 1. / 2., 0., 0.), (1. / 3., 1. / 3., 1. / 3., 0.),
(1. / 4., 1. / 4., 1. / 4., 1. / 4.)]
for pred, target in zip(preds, targets):
pred_list = (' '.join(jieba.cut(pred))).split()
target_list = [(' '.join(jieba.cut(target))).split()]
bleu = sentence_bleu(target_list, pred_list, weights=weights)
cumulative_bleu = [a + b for a, b in zip(cumulative_bleu, bleu)]
for i in range(len(cumulative_bleu)):
bleu_scores[f"bleu{i+1}"] = cumulative_bleu[i] / len(preds)
return bleu_scores
def rouge_cn_score(preds: list, targets: list) -> dict:
"""Calculate Chinese ROUGE Score Metric
The calculation includes ROUGE-1 for unigram, ROUGE-2 for bigram
and ROUGE-L. ROUGE-N evaluates the number of matching n-grams between
the preds and targets. ROUGE-L measures the number of matching
longest common subsequence (LCS) between preds and targets.
"""
rouge_scores = {"rouge1": {}, "rouge2": {}, "rougeL": {}}
all_preds = []
all_targets = []
for pred, target in zip(preds, targets):
pred_list = ' '.join(jieba.cut(pred))
target_list = ' '.join(jieba.cut(target))
all_preds.append(pred_list)
all_targets.append(target_list)
rouge_cn = Rouge_cn()
rouge_avg = rouge_cn.get_scores(all_preds, all_targets, avg=True)
rouge_scores["rouge1"] = rouge_avg["rouge-1"]["f"]
rouge_scores["rouge2"] = rouge_avg["rouge-2"]["f"]
rouge_scores["rougeL"] = rouge_avg["rouge-l"]["f"]
return rouge_scores
def distinct_score(preds: list) -> dict:
"""Calculate Distinct Score Metric
This metric refers to https://arxiv.org/abs/1510.03055.
It evaluates the diversity of generation text by counting
the unique n-grams.
"""
distinct_score = {"distinct": 0}
cumulative_distinct = []
for pred in preds:
pred_seg_list = list(' '.join(jieba.cut(pred)))
count_segs = len(pred_seg_list)
unique_segs = set(pred_seg_list)
count_unique_chars = len(unique_segs)
cumulative_distinct.append(count_unique_chars / count_segs)
distinct_score["distinct"] = statistics.mean(cumulative_distinct)
return distinct_score
def bert_score(preds: list, targets: list) -> dict:
"""Calculate BERTScore Metric
The BERTScore evaluates the semantic similarity between
tokens of preds and targets with BERT.
"""
bert_score = {"bert_score": 0}
pred_list = []
target_list = []
for pred, target in zip(preds, targets):
pred_list.append(' '.join(jieba.cut(pred)))
target_list.append(' '.join(jieba.cut(target)))
_, _, F = score(pred_list, target_list, lang="zh", verbose=True)
bert_score["bert_score"] = F.mean().item()
return bert_score
def calculate_precision_recall_f1(preds: list, targets: list) -> dict:
"""Precision, Recall and F1-Score Calculation
The calculation of precision, recall and f1-score is realized by counting
the number f overlaps between the preds and target. The comparison length
limited by the shorter one of preds and targets. This design is mainly
considered for classifiction and extraction categories.
"""
precision_recall_f1 = {"precision": 0, "recall": 0, "f1_score": 0}
precision_scores = []
recall_scores = []
f1_scores = []
for pred, target in zip(preds, targets):
pred_list = [char for char in pred]
target_list = [char for char in target]
target_labels = [1] * min(len(target_list), len(pred_list))
pred_labels = [int(pred_list[i] == target_list[i]) for i in range(0, min(len(target_list), len(pred_list)))]
precision_scores.append(precision_score(target_labels, pred_labels, zero_division=0))
recall_scores.append(recall_score(target_labels, pred_labels, zero_division=0))
f1_scores.append(f1_score(target_labels, pred_labels, zero_division=0))
precision_recall_f1["precision"] = statistics.mean(precision_scores)
precision_recall_f1["recall"] = statistics.mean(recall_scores)
precision_recall_f1["f1_score"] = statistics.mean(f1_scores)
return precision_recall_f1
def precision(preds: list, targets: list) -> dict:
"""Calculate Precision Metric
(design for classifiction and extraction categories)
Calculating precision by counting the number of overlaps between the preds and target.
"""
precision = {"precision": 0}
precision["precision"] = calculate_precision_recall_f1(preds, targets)["precision"]
return precision
def recall(preds: list, targets: list) -> dict:
"""Calculate Recall Metric
(design for classifiction and extraction categories)
Calculating recall by counting the number of overlaps between the preds and target.
"""
recall = {"recall": 0}
recall["recall"] = calculate_precision_recall_f1(preds, targets)["recall"]
return recall
def F1_score(preds: list, targets: list) -> dict:
"""Calculate F1-score Metric
(design for classifiction and extraction categories)
Calculating f1-score by counting the number of overlaps between the preds and target.
"""
f1 = {"f1_score": 0}
f1["f1_score"] = calculate_precision_recall_f1(preds, targets)["f1_score"]
return f1

@ -0,0 +1,6 @@
{
"id": 1,
"system_prompt": "你是一个检查回答质量的好助手。",
"prompt_template": "[问题]\n{question}\n\n[1号AI助手的答案]\n{answer_1}\n\n[1号AI助手答案终止]\n\n[2号AI助手的答案]\n{answer_2}\n\n[2号AI助手答案终止]\n\n[要求]\n{prompt}\n\n",
"prompt": "我们需要你评价这两个AI助手回答的性能。\n请对他们的回答的有用性、相关性、准确性、详细程度进行评分。每个AI助手都会得到一个1到10分的总分分数越高表示整体表现越好。\n请首先输出一行该行只包含两个数值分别表示1号和2号AI助手的分数。这两个分数之间要有一个空格。在随后的一行中请对你的评价作出全面的解释避免任何潜在的偏见并确保AI助手回答的顺序不会影响您的判断。"
}

@ -0,0 +1,179 @@
[
{
"id": 1,
"category": "brainstorming",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"creativity": "创意性(1-5):某些头脑风暴问题可能需要答案具有创意,提出新的思路。",
"practicality": "实用性(1-5):某些头脑风暴问题可能需要答案提出实用的建议或解决方法。",
"correctness": "正确性(1-5):答案应该符合常识、生活实际等等。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"creativity": "1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。\n2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则创意性评分可能会受到影响。\n3. 考虑答案中是否包含新颖的想法或独特的思路。答案可能与已知的解决方案有所重叠,但仍然可以被认为是有创意的,只要它提供了新的角度或方法来解决问题。\n4. 根据答案的创意性给出一个1到5的评分。如果答案缺乏创意则应给出一个较低的评分。如果答案具有创意并提供了新的思路应给出一个较高的评分。\n\n创意性",
"practicality": "1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。\n2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则实用性评分可能会受到影响。\n3. 考虑答案中提出的建议或解决方法是否实用并可行。答案可能看起来很好,但如果无法实现或应用,则实用性评分可能会受到影响。\n4. 根据答案的实用性给出一个1到5的评分。如果答案缺乏实用性则应给出一个较低的评分。如果答案提出了实用的建议或解决方法并且可以很好地解决问题则应给出一个较高的评分。\n\n实用性",
"correctness": "1. 仔细阅读所提供的头脑风暴问题,确保你理解问题的要点和背景。\n2. 根据你的知识和经验,判断所提供的答案是否可行。如果答案不可行,则正确性评分可能会受到影响。\n3. 考虑答案中所提供的信息是否正确、符合常识、生活实际等等。如果答案中存在明显的错误或不合理之处,则正确性评分可能会受到影响。\n4. 根据答案的正确性给出一个1到5的评分。如果答案存在明显的错误或不合理之处则应给出一个较低的评分。如果答案正确、符合常识、生活实际等等则应给出一个较高的评分。\n\n正确性"
},
"prompt": "你是一个好助手。请你为下面“头脑风暴”问题的答案打分。\n\n问题如下\n\n{question}\n\n答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
},
{
"id": 2,
"category": "chat",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"naturalness": "自然(1-5):答案是否自然,并且符合问题给定的身份。",
"engagingness": "参与感(1-5):答案是否对前面的对话内容做出了恰当的反应,是否理解对话的语境和背景。",
"reasonableness": "合理性(1-5):答案是否能够与前面的对话内容形成逻辑上的衔接,是否符合常理,能否在这个上下文中合理存在。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"naturalness": "1. 阅读题目,确定题目提供的身份信息。\n2. 检查答案内容是否符合题目给定的身份。\n3. 根据以上因素对该回答的自然性进行打分分数从1到5其中1表示不自然5表示非常自然并符合问题给定的身份。\n\n自然",
"engagingness": "1. 阅读题目,确定对话的语境和背景。\n2. 检查答案是否充分理解对话的语境和背景,能否自然地融入到对话中而不显得突兀。\n3. 根据以上因素对该回答的参与感进行打分分数从1到5其中1表示没有参与感5表示非常有参与感并且恰当地理解了对话的语境和背景。\n\n参与感",
"reasonableness": "1. 阅读题目,确定对话的主题以及问题期望的回答方向。\n2. 判断答案是否能够与前面的对话内容形成逻辑上的衔接,是否符合常理,能否在这个上下文中合理存在。\n3. 根据以上因素对该回答的合理性进行打分分数从1到5其中1表示不合理5表示非常合理并且能够与前面的对话内容形成逻辑上的衔接并符合常理。\n\n合理性"
},
"prompt": "你是一个好助手。请你为下面的“补全对话”问题的答案打分。\n\n问题如下\n\n{question}\n\n答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
},
{
"id": 3,
"category": "classification",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"correctness": "正确性(1-5):答案是否正确。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"correctness": "1. 仔细阅读题目,尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的则可以将正确性得分为5分。如果答案是部分正确的则可以给予适当的得分例如2分、3分或4分。如果答案完全不正确则只得1分。\n\n正确性"
},
"prompt": "你是一个好助手。请你为下面的“分类“问题的答案打分。\n\n问题如下\n\n{question}\n\n答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
},
{
"id": 4,
"category": "closed_qa",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"correctness": "正确性(1-5):答案是否正确。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"correctness": "1. 仔细阅读题目,尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的则可以将正确性得分为5分。如果答案是部分正确的则可以给予适当的得分例如2分、3分或4分。如果答案完全不正确则只得1分。\n\n正确性"
},
"prompt": "你是一个好助手。请你为下面问题的答案打分。\n\n问题如下\n\n{question}\n\n需要你评分的答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
},
{
"id": 5,
"category": "extraction",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"correctness": "准确性(1-5):回答应该准确无误地提取出所需信息,不应该包含任何错误或误导性信息。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"correctness": "1. 仔细阅读问题并确定需要从材料中提取的信息。\n2. 仔细阅读回答并确保它涵盖了所有需要提取的信息。\n3. 使用所提供的材料来验证回答的准确性。如果回答不准确或包含错误或误导性信息,则无法给出高分。\n4. 检查回答是否包含所有要求提取的信息,不要漏掉任何重要细节。\n5. 根据回答的准确性和完整性给出一个介于1和5之间的分数5分表示回答非常准确且完整1分表示回答几乎没有提取出所需信息。\n\n准确性"
},
"prompt": "你是一个好助手。请你为下面的“提取”问题的答案打分。\n\n问题如下\n\n{question}\n\n答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
},
{
"id": 6,
"category": "generation",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"diversity": "多样性(1-5):答案使用语言是否优美,具有有一定的创造性和想象力。然而,回答也应该保持合理和适度,不要过于夸张或离题。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"diversity": "1. 仔细阅读整个回答,确保完全理解回答所表达的内容和主题。\n2. 在阅读回答的同时,注意语言的质量,例如措辞是否正确,语言是否生动等。\n3. 检查回答的创造性和想象力,看看回答是否能够吸引人阅读下去。\n4. 检查回答的合理性和适度,看看回答是否夸张或离题。\n5. 将多样性的评分打分在1到5之间5分表示回答的质量很好能够吸引人阅读1分表示回答的内容生硬或者有离题的问题。\n\n多样性"
},
"prompt": "你是一个好助手。请你为下面的“生成”问题的答案打分。\n\n问题如下\n\n{question}\n\n答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
},
{
"id": 7,
"category": "open_qa",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"correctness": "正确性(1-5):答案是否正确。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"correctness": "1. 仔细阅读题目,尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的则可以将正确性得分为5分。如果答案是部分正确的则可以给予适当的得分例如2分、3分或4分。如果答案完全不正确则只得1分。\n\n正确性"
},
"prompt": "你是一个好助手。请你为下面的问题的答案打分。\n\n问题如下\n\n{question}\n\n答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
},
{
"id": 8,
"category": "rewriting",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"correctness": "正确性(1-5):答案是否正确。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"correctness": "1. 仔细阅读题目,尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的则可以将正确性得分为5分。如果答案是部分正确的则可以给予适当的得分例如2分、3分或4分。如果答案完全不正确则只得1分。\n\n正确性"
},
"prompt": "你是一个好助手。请你为下面的问题的答案打分。\n\n问题如下\n\n{question}\n\n答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
},
{
"id": 9,
"category": "roleplay",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"fidelity": "保真度(1-5):答案是否能够严格遵守角色的设定回答给定的请求。",
"creativity": "创意性(1-5):角色扮演问题的回答需要具有一定创意,但同时需要遵守角色的设定。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"fidelity": "1. 仔细阅读问题,了解角色在问题中的设定和表现,包括职业、背景、观点、性格等方面。\n2. 阅读题目的请求,确认回答请求时需要注意的细节。\n3. 对比提供的回答与该角色的设定,评估回答是否能够严格遵守角色的设定。\n4. 结合以上评估结果给出保真度的评分范围从1到5分其中1分表示回答与角色设定完全不符5分表示回答完全符合角色设定且满足给定请求。\n\n保真度",
"creativity": "1. 仔细阅读问题,了解角色在问题中的设定和表现,包括职业、背景、观点、性格等方面。\n2. 评估回答是否具有独特的思路和建议,是否能够给提问者带来新的想法和启示。\n3. 对比回答中的创意和该角色的设定,评估回答是否遵守了该角色的设定和基本特征。\n4. 对回答的质量进行总体评估并结合以上评估结果给出创意性的评分范围从1到5分其中1分表示回答缺乏创意5分表示回答具有独特的思路和建议并且能够遵守该角色的设定。\n\n创意性"
},
"prompt": "你是一个好助手。请你为下面的“角色扮演”问题的答案打分。\n\n问题如下\n\n{question}\n\n答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
},
{
"id": 10,
"category": "summarization",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"correctness": "准确性(1-5):回答应该准确无误地总结出材料的重点。",
"conciseness": "简明扼要(1-5):答案是否简明扼要,没有冗余内容。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"correctness": "1. 仔细阅读问题给的材料,理解其内容和要点。\n2. 评估回答是否准确地总结出原始材料的重点。\n3. 评估回答是否包含原始材料中的所有关键信息。\n4. 根据以上步骤给出一个1-5的分数其中1表示回答不能准确地总结出材料的重点5表示回答完全准确地总结出材料的重点。\n\n准确性",
"conciseness": "1. 阅读题目,提取出材料的重点。\n2. 阅读该总结,并注意其中的主要观点和信息。\n3. 评估总结的长度。一个简明扼要的总结通常应该在几句话或几段文字内传达关键信息,而不是冗长的段落或文章。\n4. 检查总结是否包含与主要观点无关的信息或冗余信息。\n5.确定总结涵盖了材料中的关键信息,并且没有忽略任何重要细节。\n6.给总结打出1-5的分数其中5表示总结简明扼要没有冗余内容而1表示总结冗长或包含不必要的信息难以理解或记忆。根据您的判断打出适当的得分。\n\n简明扼要"
},
"prompt": "你是一个好助手。请你为下面的“总结”问题的答案打分。\n\n问题如下\n\n{question}\n\n答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
},
{
"id": 11,
"category": "general",
"metrics": {
"language organization": "语言组织(1-5):答案语言是否流畅、连贯,使用正确的语法,具有一定逻辑性,使用恰当的连接词、过渡词等等。",
"relevance": "切题(1-5):答案内容是否切题,不答非所问,并且严格遵照题目要求。",
"correctness": "正确性(1-5):答案是否正确。"
},
"CoT": {
"language organization": "1. 阅读答案,并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性,能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关,并且能够传达清晰的信息。\n4. 检查答案是否连贯,是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式,使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织并给出一个1到5的分数其中5表示语言组织非常好而1表示语言组织非常差。\n\n语言组织",
"relevance": "1. 阅读题目,确定题目所问的问题是什么,以及需要回答哪些方面的问题。\n2. 阅读答案,确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求,包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度并给出一个1到5的分数其中5表示答案非常切题而1表示答案完全没有切题。\n\n切题",
"correctness": "1. 仔细阅读题目,尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的则可以将正确性得分为5分。如果答案是部分正确的则可以给予适当的得分例如2分、3分或4分。如果答案完全不正确则只得1分。\n\n正确性"
},
"prompt": "你是一个好助手。请你为下面问题的答案打分。\n\n问题如下\n\n{question}\n\n需要你评分的答案如下\n\n{answer}\n\n评分的指标如下\n\n{metric}\n\n请你遵照以下的评分步骤\n\n{steps}"
}
]

@ -0,0 +1,10 @@
jieba
bert-score
rouge_chinese
scikit-metrics
nltk
openai
seaborn
pandas
matplotlib
numpy

@ -1,9 +0,0 @@
[
{
"id": 0,
"instruction": "Help me summarize the following news?",
"input": "National Commercial Bank (NCB), Saudi Arabia's largest lender by assets, agreed to buy rival Samba Financial Group for $15 billion in the biggest banking takeover this year.NCB will pay 28.45 riyals ($7.58) for each Samba share, according to a statement on Sunday, valuing it at about 55.7 billion riyals. NCB will offer 0.739 new shares for each Samba share, at the lower end of the 0.736-0.787 ratio the banks set when they signed an initial framework agreement in June.The offer is a 3.5% premium to Samba's Oct. 8 closing price of 27.50 riyals and about 24% higher than the level the shares traded at before the talks were made public. Bloomberg News first reported the merger discussions.The new bank will have total assets of more than $220 billion, creating the Gulf region's third-largest lender. The entity's $46 billion market capitalization nearly matches that of Qatar National Bank QPSC, which is still the Middle East's biggest lender with about $268 billion of assets.",
"output": "NCB to pay 28.45 riyals for each Samba share. Deal will create Gulf region's third-largest lender",
"category": "closed qa"
}
]

@ -2,10 +2,6 @@ import io
import json import json
import os import os
import torch.distributed as dist
def is_rank_0() -> bool:
return not dist.is_initialized() or dist.get_rank() == 0
def _make_w_io_base(f, mode: str): def _make_w_io_base(f, mode: str):
if not isinstance(f, io.IOBase): if not isinstance(f, io.IOBase):
@ -15,11 +11,13 @@ def _make_w_io_base(f, mode: str):
f = open(f, mode=mode) f = open(f, mode=mode)
return f return f
def _make_r_io_base(f, mode: str): def _make_r_io_base(f, mode: str):
if not isinstance(f, io.IOBase): if not isinstance(f, io.IOBase):
f = open(f, mode=mode) f = open(f, mode=mode)
return f return f
def jdump(obj, f, mode="w", indent=4, default=str): def jdump(obj, f, mode="w", indent=4, default=str):
"""Dump a str or dictionary to a file in json format. """Dump a str or dictionary to a file in json format.
Args: Args:
@ -38,6 +36,7 @@ def jdump(obj, f, mode="w", indent=4, default=str):
raise ValueError(f"Unexpected type: {type(obj)}") raise ValueError(f"Unexpected type: {type(obj)}")
f.close() f.close()
def jload(f, mode="r"): def jload(f, mode="r"):
"""Load a .json file into a dictionary.""" """Load a .json file into a dictionary."""
f = _make_r_io_base(f, mode) f = _make_r_io_base(f, mode)
@ -45,9 +44,19 @@ def jload(f, mode="r"):
f.close() f.close()
return jdict return jdict
def get_json_list(file_path): def get_json_list(file_path):
with open(file_path, 'r') as f: with open(file_path, 'r') as f:
json_list = [] json_list = []
for line in f: for line in f:
json_list.append(json.loads(line)) json_list.append(json.loads(line))
return json_list return json_list
def get_data_per_category(data, categories):
data_per_category = {category: [] for category in categories}
for item in data:
category = item["category"]
data_per_category[category].append(item)
return data_per_category

Loading…
Cancel
Save