update documentation

2023-04-28 11:49:21 +08:00 · 2023-04-28 11:49:21 +08:00 · ed3eaa6922
parent c419117329
commit ed3eaa6922
1 changed files with 43 additions and 35 deletions
--- a/applications/Chat/evaluate/README.md
+++ b/applications/Chat/evaluate/README.md
@ -1,16 +1,36 @@
 # Evaluation

-In this directory we will introduce how you can evaluate your model with GPT-4. 
+In this directory, we introduce how you can evaluate your model with GPT-4. 

 ## Evaluation Pipeline

-The whole evaluation process undergoes two steps. 
+The whole evaluation process undergoes the following three steps: 
 1. Prepare the questions following the internal data structure in the data format section (described below).
-2. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models.
-3. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4.
+2. Generate answers from different models: 
+    * Generate answers using GPT-3.5: [generate_gpt35_answers.py](generate_gpt35_answers.py).
+    * Generate answers using your own models: [generate_answers.py](generate_answers.py).
+3. Evaluate models using GPT-4: [evaluate.py](evaluate.py).

 ### Generate Answers
-In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows.
+#### Generate Answers Using GPT-3.5
+You can provide your own OpenAI key to generate answers from GPT-3.5 using [generate_gpt35_answers.py](./generate_gpt35_answers.py).
+
+An example script is provided as follows:
+```shell
+python generate_gpt35_answers.py \
+    --dataset "path to the question dataset" \
+    --answer_path "path to answer folder" \
+    --num_workers 4 \
+    --openai_key "your openai key" \
+    --max_tokens 512 \
+``` 
+
+#### Generate Answers Using our Own Model
+You can also generate answers using your own models. The generation process is divided into two stages:
+1. Generate answers using multiple GPUs (optional) with batch processing: [generate_answers.py](./generate_answers.py).
+2. Merge multiple shards and output a single file: [merge.py](./merge.py).
+
+An example script is given as follows:

 ```shell
 device_number=number of your devices
@ -41,21 +61,9 @@ done

 ```

-`generate_gpt35_answers.py` will generate answers of GPT-3.5 An example script is given as follows.
-
-```shell
-python generate_gpt35_answers.py \
-    --dataset "path to the question dataset" \
-    --answer_path "path to answer folder" \
-    --num_workers 4 \
-    --openai_key "your openai key" \
-    --max_tokens 512 \
-
-```
-
 ### Evaluate Answers

-In `evaluate.py`, GPT-4 will help review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script will finally print several metrics and output corresponding JSON files.
+In [evaluate.py](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files.

 The metrics include:

@ -121,11 +129,11 @@ We store model answers in `{model_name}_answers.json`. The JSON file contains on

 An answer record has the following field:

-* `category` (str): The category of the question.
-* `instruction` (str): The question.
-* `input` (str): This is empty if you only use [FastChat's]((https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
-* `output` (str): The answer to the question.
-* `id` (int): The question id.
+* `category` (str, compulsory): The category of the instruction / question.
+* `instruction` (str, compulsory): The instruction / question for the LLM.
+* `input` (str, optional): The additional context of the instruction / question.
+* `output` (str, compulsory): The output from the LLM.
+* `id` (int, compulsory): The ID of the instruction / question.

 ### Results

@ -133,12 +141,12 @@ We store evaluation results in `results.json`. The JSON file contains one dictio

 The value has the following field:

-* `model` (list): The names of the two models.
-* `better` (int): The number of reviews where Model 2 receives a higher score.
-* `worse` (int): The number of reviews where Model 2 receives a lower score.
-* `tie` (int): The number of reviews where two models play to a tie.
-* `win_rate` (float): Win rate of Model 2.
-* `score` (list): Average score of the two models.
+* `model` (list, compulsory): The names of the two models.
+* `better` (int, compulsory): The number of reviews where Model 2 receives a higher score.
+* `worse` (int, compulsory): The number of reviews where Model 2 receives a lower score.
+* `tie` (int, compulsory): The number of reviews where two models play to a tie.
+* `win_rate` (float, compulsory): Win rate of Model 2.
+* `score` (list, compulsory): Average score of the two models.

 ### Better, Worse, Tie, Invalid, Review

@ -146,12 +154,12 @@ To help better compare the model answers, we store JSON files whose name ends wi

 A record has the following field:

-* `review_id` (str): Random UUID, not in use.
-* `id` (int): The question id.
-* `reviewer_id` (int): A unique ID for a reviewer. Different reviewer id use different prompts.
-* `metadata` (dict): It is empty.
-* `review` (str): GPT-4 's review.
-* `score` (list): The scores of two models.
+* `review_id` (str, optional): Random UUID, not in use.
+* `id` (int, compulsory): The ID of the instruction / question.
+* `reviewer_id` (int, compulsory): A unique ID for a reviewer. Different reviewer id use different prompts.
+* `metadata` (dict, optional): It is empty.
+* `review` (str, optional): GPT-4's review.
+* `score` (list, compulsory): The scores of two models.

 ### Prompts