update readme

pull/3656/head
Tong Li 2 years ago
parent ed3eaa6922
commit c1a355940e

@ -7,13 +7,13 @@ In this directory, we introduce how you can evaluate your model with GPT-4.
The whole evaluation process undergoes the following three steps: The whole evaluation process undergoes the following three steps:
1. Prepare the questions following the internal data structure in the data format section (described below). 1. Prepare the questions following the internal data structure in the data format section (described below).
2. Generate answers from different models: 2. Generate answers from different models:
* Generate answers using GPT-3.5: [generate_gpt35_answers.py](generate_gpt35_answers.py). * Generate answers using GPT-3.5: [`generate_gpt35_answers.py`](generate_gpt35_answers.py).
* Generate answers using your own models: [generate_answers.py](generate_answers.py). * Generate answers using your own models: [`generate_answers.py`](generate_answers.py).
3. Evaluate models using GPT-4: [evaluate.py](evaluate.py). 3. Evaluate models using GPT-4: [`evaluate.py`](evaluate.py).
### Generate Answers ### Generate Answers
#### Generate Answers Using GPT-3.5 #### Generate Answers Using GPT-3.5
You can provide your own OpenAI key to generate answers from GPT-3.5 using [generate_gpt35_answers.py](./generate_gpt35_answers.py). You can provide your own OpenAI key to generate answers from GPT-3.5 using [`generate_gpt35_answers.py`](./generate_gpt35_answers.py).
An example script is provided as follows: An example script is provided as follows:
```shell ```shell
@ -27,8 +27,8 @@ python generate_gpt35_answers.py \
#### Generate Answers Using our Own Model #### Generate Answers Using our Own Model
You can also generate answers using your own models. The generation process is divided into two stages: You can also generate answers using your own models. The generation process is divided into two stages:
1. Generate answers using multiple GPUs (optional) with batch processing: [generate_answers.py](./generate_answers.py). 1. Generate answers using multiple GPUs (optional) with batch processing: [`generate_answers.py`](./generate_answers.py).
2. Merge multiple shards and output a single file: [merge.py](./merge.py). 2. Merge multiple shards and output a single file: [`merge.py`](./merge.py).
An example script is given as follows: An example script is given as follows:
@ -63,7 +63,7 @@ done
### Evaluate Answers ### Evaluate Answers
In [evaluate.py](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files. In [`evaluate.py`](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files.
The metrics include: The metrics include:
@ -105,7 +105,7 @@ We would like to mention that the evaluation of model answers using the GPT-3.5
## Data Format ## Data Format
### Questions ### Questions
The file [questions.json](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field: The file [`questions.json`](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field:
* `id` (id, compulsory): The ID of the instruction / question. * `id` (id, compulsory): The ID of the instruction / question.
* `instruction` (str, compulsory): The instruction / question for the LLM. * `instruction` (str, compulsory): The instruction / question for the LLM.
* `input` (str, optional): The additional context of the instruction / question. * `input` (str, optional): The additional context of the instruction / question.
@ -163,11 +163,11 @@ A record has the following field:
### Prompts ### Prompts
The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts. The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts.
### Reviewer ### Reviewer
The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers. The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers.
## Citations ## Citations

Loading…
Cancel
Save