diff --git a/applications/Chat/evaluate/README.md b/applications/Chat/evaluate/README.md index 6113dbbb1..7ace4bfe6 100644 --- a/applications/Chat/evaluate/README.md +++ b/applications/Chat/evaluate/README.md @@ -1,26 +1,36 @@ # Evaluation -In this directory we will introduce how you can evaluate your model with GPT-4. +In this directory, we introduce how you can evaluate your model with GPT-4. ## Evaluation Pipeline -The whole evaluation process undergoes two steps. - -1. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models. -2. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4. +The whole evaluation process undergoes the following three steps: +1. Prepare the questions following the internal data structure in the data format section (described below). +2. Generate answers from different models: + * Generate answers using GPT-3.5: [`generate_gpt35_answers.py`](generate_gpt35_answers.py). + * Generate answers using your own models: [`generate_answers.py`](generate_answers.py). +3. Evaluate models using GPT-4: [`evaluate.py`](evaluate.py). ### Generate Answers +#### Generate Answers Using GPT-3.5 +You can provide your own OpenAI key to generate answers from GPT-3.5 using [`generate_gpt35_answers.py`](./generate_gpt35_answers.py). -To generate answers, you should first format [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) `question.jsonl` file. We do this formatting because we would like to add more questions later and the pipeline for generating new questions may follow that of Self-Instruct and Stanford Alpaca. An example script is given as follows. - +An example script is provided as follows: ```shell -python format_questions.py \ - --questions_path "path to FastChat's question.jsonl" \ - --save_path "path to the formatted file" \ +python generate_gpt35_answers.py \ + --dataset "path to the question dataset" \ + --answer_path "path to answer folder" \ + --num_workers 4 \ + --openai_key "your openai key" \ + --max_tokens 512 \ +``` -``` +#### Generate Answers Using our Own Model +You can also generate answers using your own models. The generation process is divided into two stages: +1. Generate answers using multiple GPUs (optional) with batch processing: [`generate_answers.py`](./generate_answers.py). +2. Merge multiple shards and output a single file: [`merge.py`](./merge.py). -In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows. +An example script is given as follows: ```shell device_number=number of your devices @@ -51,21 +61,9 @@ done ``` -`generate_gpt35_answers.py` will generate answers of GPT-3.5 An example script is given as follows. - -```shell -python generate_gpt35_answers.py \ - --dataset "path to the question dataset" \ - --answer_path "path to answer folder" \ - --num_workers 4 \ - --openai_key "your openai key" \ - --max_tokens 512 \ - -``` - ### Evaluate Answers -In `evaluate.py`, GPT-4 will help review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script will finally print several metrics and output corresponding JSON files. +In [`evaluate.py`](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files. The metrics include: @@ -107,16 +105,23 @@ We would like to mention that the evaluation of model answers using the GPT-3.5 ## Data Format ### Questions +The file [`questions.json`](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field: +* `id` (id, compulsory): The ID of the instruction / question. +* `instruction` (str, compulsory): The instruction / question for the LLM. +* `input` (str, optional): The additional context of the instruction / question. +* `output` (str, optional): The sample output of the instruction / question. +* `category` (str, compulsory): The category of the instruction / question. -We store questions in `questions.json`. The JSON file contains one list. Each element in the list is a question record. - -A question record has the following field: - -* `category` (str): The category of the question. -* `instruction` (str): The question. -* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions. -* `output` (str): This is empty. -* `id` (int): The question id. +Example: +``` +{ + "id": 0, + "instruction": "Help me summarize the following short story?", + "input": "{story}", + "output": "{summarized story}", + "category": "closed qa" +} +``` ### Answers @@ -124,11 +129,11 @@ We store model answers in `{model_name}_answers.json`. The JSON file contains on An answer record has the following field: -* `category` (str): The category of the question. -* `instruction` (str): The question. -* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions. -* `output` (str): The answer to the question. -* `id` (int): The question id. +* `category` (str, compulsory): The category of the instruction / question. +* `instruction` (str, compulsory): The instruction / question for the LLM. +* `input` (str, optional): The additional context of the instruction / question. +* `output` (str, compulsory): The output from the LLM. +* `id` (int, compulsory): The ID of the instruction / question. ### Results @@ -136,12 +141,12 @@ We store evaluation results in `results.json`. The JSON file contains one dictio The value has the following field: -* `model` (list): The names of the two models. -* `better` (int): The number of reviews where Model 2 receives a higher score. -* `worse` (int): The number of reviews where Model 2 receives a lower score. -* `tie` (int): The number of reviews where two models play to a tie. -* `win_rate` (float): Win rate of Model 2. -* `score` (list): Average score of the two models. +* `model` (list, compulsory): The names of the two models. +* `better` (int, compulsory): The number of reviews where Model 2 receives a higher score. +* `worse` (int, compulsory): The number of reviews where Model 2 receives a lower score. +* `tie` (int, compulsory): The number of reviews where two models play to a tie. +* `win_rate` (float, compulsory): Win rate of Model 2. +* `score` (list, compulsory): Average score of the two models. ### Better, Worse, Tie, Invalid, Review @@ -149,24 +154,20 @@ To help better compare the model answers, we store JSON files whose name ends wi A record has the following field: -* `review_id` (str): Random UUID, not in use. -* `id` (int): The question id. -* `reviewer_id` (int): A unique ID for a reviewer. Different reviewer id use different prompts. -* `metadata` (dict): It is empty. -* `review` (str): GPT-4 's review. -* `score` (list): The scores of two models. +* `review_id` (str, optional): Random UUID, not in use. +* `id` (int, compulsory): The ID of the instruction / question. +* `reviewer_id` (int, compulsory): A unique ID for a reviewer. Different reviewer id use different prompts. +* `metadata` (dict, optional): It is empty. +* `review` (str, optional): GPT-4's review. +* `score` (list, compulsory): The scores of two models. ### Prompts -The data format is the same with [FastChat's]([FastChat/prompt.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl)) prompts. +The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts. ### Reviewer -The data format is the same with [FastChat's]([FastChat/reviewer.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl)) reviewers. - -## Plan - -- [ ] Extend the questions +The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers. ## Citations diff --git a/applications/Chat/evaluate/format_questions.py b/applications/Chat/evaluate/format_questions.py deleted file mode 100644 index 9b47907c3..000000000 --- a/applications/Chat/evaluate/format_questions.py +++ /dev/null @@ -1,31 +0,0 @@ -import argparse -import os -import json -import copy - -from utils import jdump, get_json_list - - -def format_questions(args): - questions = get_json_list(args.questions_path) - keys=questions[0].keys() - - formatted_questions=copy.deepcopy(questions) - for i in range(len(formatted_questions)): - formatted_questions[i]['instruction']=questions[i]['text'] - formatted_questions[i]['input']="" - formatted_questions[i]['output']="" - formatted_questions[i]['id']=questions[i]['question_id'] - for key in keys: - if key=="category": - continue - del formatted_questions[i][key] - - jdump(formatted_questions, args.save_path) - -if __name__ == '__main__': - parser = argparse.ArgumentParser() - parser.add_argument('--questions_path', type=str, default='table/question.jsonl') - parser.add_argument('--save_path', type=str, default="table/questions.json") - args = parser.parse_args() - format_questions(args) \ No newline at end of file diff --git a/applications/Chat/evaluate/format_questions.sh b/applications/Chat/evaluate/format_questions.sh deleted file mode 100755 index a7568da36..000000000 --- a/applications/Chat/evaluate/format_questions.sh +++ /dev/null @@ -1,3 +0,0 @@ -python format_questions.py \ - --questions_path "path to FastChat's question.jsonl" \ - --save_path "path to the formatted file" \ diff --git a/applications/Chat/evaluate/sample/questions.json b/applications/Chat/evaluate/sample/questions.json new file mode 100644 index 000000000..e9ef9f8b1 --- /dev/null +++ b/applications/Chat/evaluate/sample/questions.json @@ -0,0 +1,9 @@ +[ + { + "id": 0, + "instruction": "Help me summarize the following news?", + "input": "National Commercial Bank (NCB), Saudi Arabia's largest lender by assets, agreed to buy rival Samba Financial Group for $15 billion in the biggest banking takeover this year.NCB will pay 28.45 riyals ($7.58) for each Samba share, according to a statement on Sunday, valuing it at about 55.7 billion riyals. NCB will offer 0.739 new shares for each Samba share, at the lower end of the 0.736-0.787 ratio the banks set when they signed an initial framework agreement in June.The offer is a 3.5% premium to Samba's Oct. 8 closing price of 27.50 riyals and about 24% higher than the level the shares traded at before the talks were made public. Bloomberg News first reported the merger discussions.The new bank will have total assets of more than $220 billion, creating the Gulf region's third-largest lender. The entity's $46 billion market capitalization nearly matches that of Qatar National Bank QPSC, which is still the Middle East's biggest lender with about $268 billion of assets.", + "output": "NCB to pay 28.45 riyals for each Samba share. Deal will create Gulf region's third-largest lender", + "category": "closed qa" + } +] \ No newline at end of file