update documentation

pull/3656/head
Tong Li 2023-04-28 11:49:21 +08:00
parent c419117329
commit ed3eaa6922
1 changed files with 43 additions and 35 deletions

View File

@ -1,16 +1,36 @@
# Evaluation
In this directory we will introduce how you can evaluate your model with GPT-4.
In this directory, we introduce how you can evaluate your model with GPT-4.
## Evaluation Pipeline
The whole evaluation process undergoes two steps.
The whole evaluation process undergoes the following three steps:
1. Prepare the questions following the internal data structure in the data format section (described below).
2. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models.
3. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4.
2. Generate answers from different models:
* Generate answers using GPT-3.5: [generate_gpt35_answers.py](generate_gpt35_answers.py).
* Generate answers using your own models: [generate_answers.py](generate_answers.py).
3. Evaluate models using GPT-4: [evaluate.py](evaluate.py).
### Generate Answers
In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows.
#### Generate Answers Using GPT-3.5
You can provide your own OpenAI key to generate answers from GPT-3.5 using [generate_gpt35_answers.py](./generate_gpt35_answers.py).
An example script is provided as follows:
```shell
python generate_gpt35_answers.py \
--dataset "path to the question dataset" \
--answer_path "path to answer folder" \
--num_workers 4 \
--openai_key "your openai key" \
--max_tokens 512 \
```
#### Generate Answers Using our Own Model
You can also generate answers using your own models. The generation process is divided into two stages:
1. Generate answers using multiple GPUs (optional) with batch processing: [generate_answers.py](./generate_answers.py).
2. Merge multiple shards and output a single file: [merge.py](./merge.py).
An example script is given as follows:
```shell
device_number=number of your devices
@ -41,21 +61,9 @@ done
```
`generate_gpt35_answers.py` will generate answers of GPT-3.5 An example script is given as follows.
```shell
python generate_gpt35_answers.py \
--dataset "path to the question dataset" \
--answer_path "path to answer folder" \
--num_workers 4 \
--openai_key "your openai key" \
--max_tokens 512 \
```
### Evaluate Answers
In `evaluate.py`, GPT-4 will help review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script will finally print several metrics and output corresponding JSON files.
In [evaluate.py](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files.
The metrics include:
@ -121,11 +129,11 @@ We store model answers in `{model_name}_answers.json`. The JSON file contains on
An answer record has the following field:
* `category` (str): The category of the question.
* `instruction` (str): The question.
* `input` (str): This is empty if you only use [FastChat's]((https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
* `output` (str): The answer to the question.
* `id` (int): The question id.
* `category` (str, compulsory): The category of the instruction / question.
* `instruction` (str, compulsory): The instruction / question for the LLM.
* `input` (str, optional): The additional context of the instruction / question.
* `output` (str, compulsory): The output from the LLM.
* `id` (int, compulsory): The ID of the instruction / question.
### Results
@ -133,12 +141,12 @@ We store evaluation results in `results.json`. The JSON file contains one dictio
The value has the following field:
* `model` (list): The names of the two models.
* `better` (int): The number of reviews where Model 2 receives a higher score.
* `worse` (int): The number of reviews where Model 2 receives a lower score.
* `tie` (int): The number of reviews where two models play to a tie.
* `win_rate` (float): Win rate of Model 2.
* `score` (list): Average score of the two models.
* `model` (list, compulsory): The names of the two models.
* `better` (int, compulsory): The number of reviews where Model 2 receives a higher score.
* `worse` (int, compulsory): The number of reviews where Model 2 receives a lower score.
* `tie` (int, compulsory): The number of reviews where two models play to a tie.
* `win_rate` (float, compulsory): Win rate of Model 2.
* `score` (list, compulsory): Average score of the two models.
### Better, Worse, Tie, Invalid, Review
@ -146,12 +154,12 @@ To help better compare the model answers, we store JSON files whose name ends wi
A record has the following field:
* `review_id` (str): Random UUID, not in use.
* `id` (int): The question id.
* `reviewer_id` (int): A unique ID for a reviewer. Different reviewer id use different prompts.
* `metadata` (dict): It is empty.
* `review` (str): GPT-4 's review.
* `score` (list): The scores of two models.
* `review_id` (str, optional): Random UUID, not in use.
* `id` (int, compulsory): The ID of the instruction / question.
* `reviewer_id` (int, compulsory): A unique ID for a reviewer. Different reviewer id use different prompts.
* `metadata` (dict, optional): It is empty.
* `review` (str, optional): GPT-4's review.
* `score` (list, compulsory): The scores of two models.
### Prompts