From aa77ddae334343322e959f2be72e0b48400ca70a Mon Sep 17 00:00:00 2001
From: Tong Li <tong.li352711588@gmail.com>
Date: Thu, 27 Apr 2023 18:51:58 +0800
Subject: [PATCH 1/4] remove unnecessary step and update readme

---
 applications/Chat/evaluate/README.md          |  51 +-
 .../Chat/evaluate/format_questions.py         |  31 -
 .../Chat/evaluate/format_questions.sh         |   3 -
 .../Chat/evaluate/sample/questions.json       | 562 ++++++++++++++++++
 4 files changed, 584 insertions(+), 63 deletions(-)
 delete mode 100644 applications/Chat/evaluate/format_questions.py
 delete mode 100755 applications/Chat/evaluate/format_questions.sh
 create mode 100644 applications/Chat/evaluate/sample/questions.json

diff --git a/applications/Chat/evaluate/README.md b/applications/Chat/evaluate/README.md
index 6113dbbb1..f44311f4b 100644
--- a/applications/Chat/evaluate/README.md
+++ b/applications/Chat/evaluate/README.md
@@ -5,21 +5,11 @@ In this directory we will introduce how you can evaluate your model with GPT-4.
 ## Evaluation Pipeline
 
 The whole evaluation process undergoes two steps. 
-
-1. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models.
-2. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4.
+1. Prepare the questions following the internal data structure in the data format section (described below).
+2. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models.
+3. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4.
 
 ### Generate Answers
-
-To generate answers, you should first format [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) `question.jsonl` file. We do this formatting because we would like to add more questions later and the pipeline for generating new questions may follow that of Self-Instruct and Stanford Alpaca. An example script is given as follows.
-
-```shell
-python format_questions.py \
-    --questions_path "path to FastChat's question.jsonl" \
-    --save_path "path to the formatted file" \
-
-```
-
 In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows.
 
 ```shell
@@ -107,16 +97,23 @@ We would like to mention that the evaluation of model answers using the GPT-3.5
 ## Data Format
 
 ### Questions
+The file [questions.json](./sample/questions.json) shows the example questions used to evaluate the performance of the model. The current sample questions are collected from [FastChat](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl). Each question record has the following field:
+* `id` (id, compulsory): The ID of the instruction / question.
+* `instruction` (str, compulsory): The instruction / question for the LLM.
+* `input` (str, optional): The additional context of the instruction / question.
+* `output` (str, optional): The sample output of the instruction / question.
+* `category` (str, compulsory): The category of the instruction / question.
 
-We store questions in `questions.json`. The JSON file contains one list. Each element in the list is a question record.
-
-A question record has the following field:
-
-* `category` (str): The category of the question.
-* `instruction` (str): The question.
-* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
-* `output` (str): This is empty.
-* `id` (int): The question id.
+Example:
+```
+{
+    "id": 0,
+    "instruction": "Help me summarize the following short story?",
+    "input": "{story}",
+    "output": "{summarized story}",
+    "category": "closed qa"
+}
+```
 
 ### Answers
 
@@ -126,7 +123,7 @@ An answer record has the following field:
 
 * `category` (str): The category of the question.
 * `instruction` (str): The question.
-* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
+* `input` (str): This is empty if you only use [FastChat's]((https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
 * `output` (str): The answer to the question.
 * `id` (int): The question id.
 
@@ -158,15 +155,11 @@ A record has the following field:
 
 ### Prompts
 
-The data format is the same with [FastChat's]([FastChat/prompt.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl)) prompts.
+The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts.
 
 ### Reviewer
 
-The data format is the same with [FastChat's]([FastChat/reviewer.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl)) reviewers.
-
-## Plan
-
-- [ ] Extend the questions
+The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers.
 
 ## Citations
 
diff --git a/applications/Chat/evaluate/format_questions.py b/applications/Chat/evaluate/format_questions.py
deleted file mode 100644
index 9b47907c3..000000000
--- a/applications/Chat/evaluate/format_questions.py
+++ /dev/null
@@ -1,31 +0,0 @@
-import argparse
-import os
-import json
-import copy
-
-from utils import jdump, get_json_list
-
-
-def format_questions(args):
-    questions = get_json_list(args.questions_path)
-    keys=questions[0].keys()
-    
-    formatted_questions=copy.deepcopy(questions)
-    for i in range(len(formatted_questions)):
-        formatted_questions[i]['instruction']=questions[i]['text']
-        formatted_questions[i]['input']=""
-        formatted_questions[i]['output']=""
-        formatted_questions[i]['id']=questions[i]['question_id']
-        for key in keys:
-            if key=="category":
-                continue
-            del formatted_questions[i][key]
-    
-    jdump(formatted_questions, args.save_path)
-
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument('--questions_path', type=str, default='table/question.jsonl')
-    parser.add_argument('--save_path', type=str, default="table/questions.json")
-    args = parser.parse_args()
-    format_questions(args)
\ No newline at end of file
diff --git a/applications/Chat/evaluate/format_questions.sh b/applications/Chat/evaluate/format_questions.sh
deleted file mode 100755
index a7568da36..000000000
--- a/applications/Chat/evaluate/format_questions.sh
+++ /dev/null
@@ -1,3 +0,0 @@
-python format_questions.py \
-    --questions_path "path to FastChat's question.jsonl" \
-    --save_path "path to the formatted file" \
diff --git a/applications/Chat/evaluate/sample/questions.json b/applications/Chat/evaluate/sample/questions.json
new file mode 100644
index 000000000..cbda9c086
--- /dev/null
+++ b/applications/Chat/evaluate/sample/questions.json
@@ -0,0 +1,562 @@
+[
+    {
+        "category": "generic",
+        "instruction": "How can I improve my time management skills?",
+        "input": "",
+        "output": "",
+        "id": 1
+    },
+    {
+        "category": "generic",
+        "instruction": "What are the most effective ways to deal with stress?",
+        "input": "",
+        "output": "",
+        "id": 2
+    },
+    {
+        "category": "generic",
+        "instruction": "What are the main differences between Python and JavaScript programming languages?",
+        "input": "",
+        "output": "",
+        "id": 3
+    },
+    {
+        "category": "generic",
+        "instruction": "How can I increase my productivity while working from home?",
+        "input": "",
+        "output": "",
+        "id": 4
+    },
+    {
+        "category": "generic",
+        "instruction": "Can you explain the basics of quantum computing?",
+        "input": "",
+        "output": "",
+        "id": 5
+    },
+    {
+        "category": "generic",
+        "instruction": "What are the differences between plant-based and animal-based protein sources?",
+        "input": "",
+        "output": "",
+        "id": 6
+    },
+    {
+        "category": "generic",
+        "instruction": "How can I develop my critical thinking skills?",
+        "input": "",
+        "output": "",
+        "id": 7
+    },
+    {
+        "category": "generic",
+        "instruction": "What are the major challenges faced by the education sector today?",
+        "input": "",
+        "output": "",
+        "id": 8
+    },
+    {
+        "category": "generic",
+        "instruction": "What are the primary factors that influence consumer behavior?",
+        "input": "",
+        "output": "",
+        "id": 9
+    },
+    {
+        "category": "generic",
+        "instruction": "What are the most effective strategies for conflict resolution in the workplace?",
+        "input": "",
+        "output": "",
+        "id": 10
+    },
+    {
+        "category": "knowledge",
+        "instruction": "What are some potential implications of using a single-use plastic bottle versus a reusable bottle on both the environment and human health?",
+        "input": "",
+        "output": "",
+        "id": 11
+    },
+    {
+        "category": "knowledge",
+        "instruction": "What factors would you consider when designing an inclusive and accessible public transportation system?",
+        "input": "",
+        "output": "",
+        "id": 12
+    },
+    {
+        "category": "knowledge",
+        "instruction": "How can governments utilize fiscal and monetary policies to combat economic recessions?",
+        "input": "",
+        "output": "",
+        "id": 13
+    },
+    {
+        "category": "knowledge",
+        "instruction": "How do language and cultural barriers affect the way people communicate and form relationships in multicultural societies?",
+        "input": "",
+        "output": "",
+        "id": 14
+    },
+    {
+        "category": "knowledge",
+        "instruction": "Describe a scenario where artificial intelligence could be used to improve the quality and efficiency of healthcare delivery.",
+        "input": "",
+        "output": "",
+        "id": 15
+    },
+    {
+        "category": "knowledge",
+        "instruction": "Explain the process of gene editing using CRISPR-Cas9 technology, and discuss its potential applications and ethical implications.",
+        "input": "",
+        "output": "",
+        "id": 16
+    },
+    {
+        "category": "knowledge",
+        "instruction": "How do vaccinations work to protect individuals and communities from infectious diseases, and what is herd immunity?",
+        "input": "",
+        "output": "",
+        "id": 17
+    },
+    {
+        "category": "knowledge",
+        "instruction": "How do social media platforms influence the way people consume and share news, and what are the potential implications for the spread of misinformation?",
+        "input": "",
+        "output": "",
+        "id": 18
+    },
+    {
+        "category": "knowledge",
+        "instruction": "How do cultural, social, and economic factors influence people's food choices, and how can this knowledge be used to promote healthier diets?",
+        "input": "",
+        "output": "",
+        "id": 19
+    },
+    {
+        "category": "knowledge",
+        "instruction": "Explain the process of natural selection and how it contributes to the evolution and adaptation of species.",
+        "input": "",
+        "output": "",
+        "id": 20
+    },
+    {
+        "category": "roleplay",
+        "instruction": "How would you introduce yourself as a medieval knight at a royal banquet?",
+        "input": "",
+        "output": "",
+        "id": 21
+    },
+    {
+        "category": "roleplay",
+        "instruction": "As a pirate captain, what would you say to your crew to motivate them to search for hidden treasure?",
+        "input": "",
+        "output": "",
+        "id": 22
+    },
+    {
+        "category": "roleplay",
+        "instruction": "If you were a Shakespearean character, how would you declare your love for someone in a soliloquy?",
+        "input": "",
+        "output": "",
+        "id": 23
+    },
+    {
+        "category": "roleplay",
+        "instruction": "As a superhero, how would you explain your origin story to a curious child?",
+        "input": "",
+        "output": "",
+        "id": 24
+    },
+    {
+        "category": "roleplay",
+        "instruction": "Imagine you are a time traveler from the year 3000. What technological advancements would you tell people about?",
+        "input": "",
+        "output": "",
+        "id": 25
+    },
+    {
+        "category": "roleplay",
+        "instruction": "As a sports commentator, describe the winning play in the final seconds of a championship game.",
+        "input": "",
+        "output": "",
+        "id": 26
+    },
+    {
+        "category": "roleplay",
+        "instruction": "Pretend to be a world-famous chef. How would you describe your signature dish to a panel of judges?",
+        "input": "",
+        "output": "",
+        "id": 27
+    },
+    {
+        "category": "roleplay",
+        "instruction": "You are a mountain climber reaching the summit of Mount Everest. Describe your emotions and the view from the top.",
+        "input": "",
+        "output": "",
+        "id": 28
+    },
+    {
+        "category": "roleplay",
+        "instruction": "As a space colonist on Mars, describe your daily life and the challenges you face living on another planet.",
+        "input": "",
+        "output": "",
+        "id": 29
+    },
+    {
+        "category": "roleplay",
+        "instruction": "Pretend to be a character in a post-apocalyptic world. Describe how you survive and the allies you encounter.",
+        "input": "",
+        "output": "",
+        "id": 30
+    },
+    {
+        "category": "common-sense",
+        "instruction": "How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful?",
+        "input": "",
+        "output": "",
+        "id": 31
+    },
+    {
+        "category": "common-sense",
+        "instruction": "What are some subtle clues that suggest someone is pretending to understand a topic or conversation when they are actually confused or uninformed?",
+        "input": "",
+        "output": "",
+        "id": 32
+    },
+    {
+        "category": "common-sense",
+        "instruction": "Why might someone choose to use a paper map or ask for directions instead of relying on a GPS device or smartphone app?",
+        "input": "",
+        "output": "",
+        "id": 33
+    },
+    {
+        "category": "common-sense",
+        "instruction": "How can you determine if a person is genuinely interested in a conversation or simply being polite?",
+        "input": "",
+        "output": "",
+        "id": 34
+    },
+    {
+        "category": "common-sense",
+        "instruction": "Why might someone prefer to shop at a small, locally-owned business instead of a large chain store, even if the prices are higher?",
+        "input": "",
+        "output": "",
+        "id": 35
+    },
+    {
+        "category": "common-sense",
+        "instruction": "How can you assess the credibility of a source of information, such as a news article or blog post, without relying solely on the reputation of the author or publisher?",
+        "input": "",
+        "output": "",
+        "id": 36
+    },
+    {
+        "category": "common-sense",
+        "instruction": "Why do some people enjoy the sensation of being scared, such as by watching horror movies or going on roller coasters, while others avoid these experiences?",
+        "input": "",
+        "output": "",
+        "id": 37
+    },
+    {
+        "category": "common-sense",
+        "instruction": "How can observing the behavior of other people in a social situation provide clues about cultural norms and expectations?",
+        "input": "",
+        "output": "",
+        "id": 38
+    },
+    {
+        "category": "common-sense",
+        "instruction": "Do we have a moral obligation to explore space, or should we focus on solving Earth's problems first?",
+        "input": "",
+        "output": "",
+        "id": 39
+    },
+    {
+        "category": "common-sense",
+        "instruction": "In a world where automation is becoming increasingly prevalent, is it more important to prioritize job creation or technological progress?",
+        "input": "",
+        "output": "",
+        "id": 40
+    },
+    {
+        "category": "fermi",
+        "instruction": "How many times does the average human blink in a lifetime? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
+        "input": "",
+        "output": "",
+        "id": 41
+    },
+    {
+        "category": "fermi",
+        "instruction": "How many atoms are in a grain of salt? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
+        "input": "",
+        "output": "",
+        "id": 42
+    },
+    {
+        "category": "fermi",
+        "instruction": "How many lightning strikes occur on Earth each day? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
+        "input": "",
+        "output": "",
+        "id": 43
+    },
+    {
+        "category": "fermi",
+        "instruction": "How many balloons would it take to lift a house like in the movie \"Up\"? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
+        "input": "",
+        "output": "",
+        "id": 44
+    },
+    {
+        "category": "fermi",
+        "instruction": "How many text messages are sent globally in a minute? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
+        "input": "",
+        "output": "",
+        "id": 45
+    },
+    {
+        "category": "fermi",
+        "instruction": "How many words are spoken daily on Earth? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
+        "input": "",
+        "output": "",
+        "id": 46
+    },
+    {
+        "category": "fermi",
+        "instruction": "How many snowflakes fall during a typical winter? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
+        "input": "",
+        "output": "",
+        "id": 47
+    },
+    {
+        "category": "fermi",
+        "instruction": "How many pages are in all the books ever written? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
+        "input": "",
+        "output": "",
+        "id": 48
+    },
+    {
+        "category": "fermi",
+        "instruction": "How many times has the Earth orbited the Sun since the beginning of life? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
+        "input": "",
+        "output": "",
+        "id": 49
+    },
+    {
+        "category": "fermi",
+        "instruction": "How many songs have been recorded throughout history? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
+        "input": "",
+        "output": "",
+        "id": 50
+    },
+    {
+        "category": "counterfactual",
+        "instruction": "What if the Internet had been invented during the Renaissance period?",
+        "input": "",
+        "output": "",
+        "id": 51
+    },
+    {
+        "category": "counterfactual",
+        "instruction": "What if the Aztecs had successfully repelled the Spanish conquistadors?",
+        "input": "",
+        "output": "",
+        "id": 52
+    },
+    {
+        "category": "counterfactual",
+        "instruction": "What if the Black Death had not occurred in the 14th century?",
+        "input": "",
+        "output": "",
+        "id": 53
+    },
+    {
+        "category": "counterfactual",
+        "instruction": "What if Isaac Newton had focused on biology instead of physics?",
+        "input": "",
+        "output": "",
+        "id": 54
+    },
+    {
+        "category": "counterfactual",
+        "instruction": "What if the Beatles had never formed as a band?",
+        "input": "",
+        "output": "",
+        "id": 55
+    },
+    {
+        "category": "counterfactual",
+        "instruction": "What if Alan Turing had not cracked the Enigma code during World War II?",
+        "input": "",
+        "output": "",
+        "id": 56
+    },
+    {
+        "category": "counterfactual",
+        "instruction": "What if the Suez Canal had never been constructed?",
+        "input": "",
+        "output": "",
+        "id": 57
+    },
+    {
+        "category": "counterfactual",
+        "instruction": "What if the Maya civilization had never mysteriously collapsed?",
+        "input": "",
+        "output": "",
+        "id": 58
+    },
+    {
+        "category": "counterfactual",
+        "instruction": "What if Christopher Columbus had not discovered the Americas?",
+        "input": "",
+        "output": "",
+        "id": 59
+    },
+    {
+        "category": "counterfactual",
+        "instruction": "What if Vincent van Gogh had been a successful artist during his lifetime?",
+        "input": "",
+        "output": "",
+        "id": 60
+    },
+    {
+        "category": "coding",
+        "instruction": "Develop a C++ program that reads a text file line by line and counts the number of occurrences of a specific word in the file.",
+        "input": "",
+        "output": "",
+        "id": 61
+    },
+    {
+        "category": "coding",
+        "instruction": "Implement a Python function to find the longest common subsequence of two input strings using dynamic programming.",
+        "input": "",
+        "output": "",
+        "id": 62
+    },
+    {
+        "category": "coding",
+        "instruction": "Implement a regular expression in Python to validate an email address.",
+        "input": "",
+        "output": "",
+        "id": 63
+    },
+    {
+        "category": "coding",
+        "instruction": "Write a program to find the nth Fibonacci number using dynamic programming.",
+        "input": "",
+        "output": "",
+        "id": 64
+    },
+    {
+        "category": "coding",
+        "instruction": "Implement a binary search algorithm to find a specific element in a sorted array.",
+        "input": "",
+        "output": "",
+        "id": 65
+    },
+    {
+        "category": "coding",
+        "instruction": "Implement a queue data structure using two stacks in Python.",
+        "input": "",
+        "output": "",
+        "id": 66
+    },
+    {
+        "category": "coding",
+        "instruction": "Implement a program to find the common elements in two arrays without using any extra data structures.",
+        "input": "",
+        "output": "",
+        "id": 67
+    },
+    {
+        "category": "math",
+        "instruction": "Given that f(x) = 5x^3 - 2x + 3, find the value of f(2).",
+        "input": "",
+        "output": "",
+        "id": 68
+    },
+    {
+        "category": "math",
+        "instruction": "Solve for x in the equation 3x + 10 = 5(x - 2).",
+        "input": "",
+        "output": "",
+        "id": 69
+    },
+    {
+        "category": "math",
+        "instruction": "If the endpoints of a line segment are (2, -2) and (10, 4), what is the length of the segment?",
+        "input": "",
+        "output": "",
+        "id": 70
+    },
+    {
+        "category": "writing",
+        "instruction": "Can you help me write a formal email to a potential business partner proposing a joint venture?",
+        "input": "",
+        "output": "",
+        "id": 71
+    },
+    {
+        "category": "writing",
+        "instruction": "Can you help me write a resignation letter to my current employer, while leaving on good terms and expressing gratitude for the opportunities provided?",
+        "input": "",
+        "output": "",
+        "id": 72
+    },
+    {
+        "category": "writing",
+        "instruction": "Use an appropriate format to structure a formal letter of recommendation for a student applying to a prestigious graduate program in computer science.",
+        "input": "",
+        "output": "",
+        "id": 73
+    },
+    {
+        "category": "writing",
+        "instruction": "Write a compelling product launch announcement email to inform our customers of our new software solution.",
+        "input": "",
+        "output": "",
+        "id": 74
+    },
+    {
+        "category": "writing",
+        "instruction": "Draft an apology email to a customer who experienced a delay in their order, and provide reassurance that the issue has been resolved.",
+        "input": "",
+        "output": "",
+        "id": 75
+    },
+    {
+        "category": "writing",
+        "instruction": "Write a script for a YouTube video exploring the history and cultural significance of jazz.",
+        "input": "",
+        "output": "",
+        "id": 76
+    },
+    {
+        "category": "writing",
+        "instruction": "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.",
+        "input": "",
+        "output": "",
+        "id": 77
+    },
+    {
+        "category": "writing",
+        "instruction": "Write a captivating movie review for a recently released science fiction film, discussing its plot, characters, and special effects.",
+        "input": "",
+        "output": "",
+        "id": 78
+    },
+    {
+        "category": "writing",
+        "instruction": "Structure a podcast script for an episode discussing the influence of streaming platforms on the music industry.",
+        "input": "",
+        "output": "",
+        "id": 79
+    },
+    {
+        "category": "writing",
+        "instruction": "Write a symphony concert review, discussing the orchestra's performance and overall audience experience.",
+        "input": "",
+        "output": "",
+        "id": 80
+    }
+]
\ No newline at end of file

From c419117329c7f7701b1119c1047d999b05390533 Mon Sep 17 00:00:00 2001
From: Tong Li <tong.li352711588@gmail.com>
Date: Thu, 27 Apr 2023 19:04:26 +0800
Subject: [PATCH 2/4] update questions and readme

---
 applications/Chat/evaluate/README.md          |   2 +-
 .../Chat/evaluate/sample/questions.json       | 563 +-----------------
 2 files changed, 6 insertions(+), 559 deletions(-)

diff --git a/applications/Chat/evaluate/README.md b/applications/Chat/evaluate/README.md
index f44311f4b..d6611abf7 100644
--- a/applications/Chat/evaluate/README.md
+++ b/applications/Chat/evaluate/README.md
@@ -97,7 +97,7 @@ We would like to mention that the evaluation of model answers using the GPT-3.5
 ## Data Format
 
 ### Questions
-The file [questions.json](./sample/questions.json) shows the example questions used to evaluate the performance of the model. The current sample questions are collected from [FastChat](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl). Each question record has the following field:
+The file [questions.json](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field:
 * `id` (id, compulsory): The ID of the instruction / question.
 * `instruction` (str, compulsory): The instruction / question for the LLM.
 * `input` (str, optional): The additional context of the instruction / question.
diff --git a/applications/Chat/evaluate/sample/questions.json b/applications/Chat/evaluate/sample/questions.json
index cbda9c086..e9ef9f8b1 100644
--- a/applications/Chat/evaluate/sample/questions.json
+++ b/applications/Chat/evaluate/sample/questions.json
@@ -1,562 +1,9 @@
 [
     {
-        "category": "generic",
-        "instruction": "How can I improve my time management skills?",
-        "input": "",
-        "output": "",
-        "id": 1
-    },
-    {
-        "category": "generic",
-        "instruction": "What are the most effective ways to deal with stress?",
-        "input": "",
-        "output": "",
-        "id": 2
-    },
-    {
-        "category": "generic",
-        "instruction": "What are the main differences between Python and JavaScript programming languages?",
-        "input": "",
-        "output": "",
-        "id": 3
-    },
-    {
-        "category": "generic",
-        "instruction": "How can I increase my productivity while working from home?",
-        "input": "",
-        "output": "",
-        "id": 4
-    },
-    {
-        "category": "generic",
-        "instruction": "Can you explain the basics of quantum computing?",
-        "input": "",
-        "output": "",
-        "id": 5
-    },
-    {
-        "category": "generic",
-        "instruction": "What are the differences between plant-based and animal-based protein sources?",
-        "input": "",
-        "output": "",
-        "id": 6
-    },
-    {
-        "category": "generic",
-        "instruction": "How can I develop my critical thinking skills?",
-        "input": "",
-        "output": "",
-        "id": 7
-    },
-    {
-        "category": "generic",
-        "instruction": "What are the major challenges faced by the education sector today?",
-        "input": "",
-        "output": "",
-        "id": 8
-    },
-    {
-        "category": "generic",
-        "instruction": "What are the primary factors that influence consumer behavior?",
-        "input": "",
-        "output": "",
-        "id": 9
-    },
-    {
-        "category": "generic",
-        "instruction": "What are the most effective strategies for conflict resolution in the workplace?",
-        "input": "",
-        "output": "",
-        "id": 10
-    },
-    {
-        "category": "knowledge",
-        "instruction": "What are some potential implications of using a single-use plastic bottle versus a reusable bottle on both the environment and human health?",
-        "input": "",
-        "output": "",
-        "id": 11
-    },
-    {
-        "category": "knowledge",
-        "instruction": "What factors would you consider when designing an inclusive and accessible public transportation system?",
-        "input": "",
-        "output": "",
-        "id": 12
-    },
-    {
-        "category": "knowledge",
-        "instruction": "How can governments utilize fiscal and monetary policies to combat economic recessions?",
-        "input": "",
-        "output": "",
-        "id": 13
-    },
-    {
-        "category": "knowledge",
-        "instruction": "How do language and cultural barriers affect the way people communicate and form relationships in multicultural societies?",
-        "input": "",
-        "output": "",
-        "id": 14
-    },
-    {
-        "category": "knowledge",
-        "instruction": "Describe a scenario where artificial intelligence could be used to improve the quality and efficiency of healthcare delivery.",
-        "input": "",
-        "output": "",
-        "id": 15
-    },
-    {
-        "category": "knowledge",
-        "instruction": "Explain the process of gene editing using CRISPR-Cas9 technology, and discuss its potential applications and ethical implications.",
-        "input": "",
-        "output": "",
-        "id": 16
-    },
-    {
-        "category": "knowledge",
-        "instruction": "How do vaccinations work to protect individuals and communities from infectious diseases, and what is herd immunity?",
-        "input": "",
-        "output": "",
-        "id": 17
-    },
-    {
-        "category": "knowledge",
-        "instruction": "How do social media platforms influence the way people consume and share news, and what are the potential implications for the spread of misinformation?",
-        "input": "",
-        "output": "",
-        "id": 18
-    },
-    {
-        "category": "knowledge",
-        "instruction": "How do cultural, social, and economic factors influence people's food choices, and how can this knowledge be used to promote healthier diets?",
-        "input": "",
-        "output": "",
-        "id": 19
-    },
-    {
-        "category": "knowledge",
-        "instruction": "Explain the process of natural selection and how it contributes to the evolution and adaptation of species.",
-        "input": "",
-        "output": "",
-        "id": 20
-    },
-    {
-        "category": "roleplay",
-        "instruction": "How would you introduce yourself as a medieval knight at a royal banquet?",
-        "input": "",
-        "output": "",
-        "id": 21
-    },
-    {
-        "category": "roleplay",
-        "instruction": "As a pirate captain, what would you say to your crew to motivate them to search for hidden treasure?",
-        "input": "",
-        "output": "",
-        "id": 22
-    },
-    {
-        "category": "roleplay",
-        "instruction": "If you were a Shakespearean character, how would you declare your love for someone in a soliloquy?",
-        "input": "",
-        "output": "",
-        "id": 23
-    },
-    {
-        "category": "roleplay",
-        "instruction": "As a superhero, how would you explain your origin story to a curious child?",
-        "input": "",
-        "output": "",
-        "id": 24
-    },
-    {
-        "category": "roleplay",
-        "instruction": "Imagine you are a time traveler from the year 3000. What technological advancements would you tell people about?",
-        "input": "",
-        "output": "",
-        "id": 25
-    },
-    {
-        "category": "roleplay",
-        "instruction": "As a sports commentator, describe the winning play in the final seconds of a championship game.",
-        "input": "",
-        "output": "",
-        "id": 26
-    },
-    {
-        "category": "roleplay",
-        "instruction": "Pretend to be a world-famous chef. How would you describe your signature dish to a panel of judges?",
-        "input": "",
-        "output": "",
-        "id": 27
-    },
-    {
-        "category": "roleplay",
-        "instruction": "You are a mountain climber reaching the summit of Mount Everest. Describe your emotions and the view from the top.",
-        "input": "",
-        "output": "",
-        "id": 28
-    },
-    {
-        "category": "roleplay",
-        "instruction": "As a space colonist on Mars, describe your daily life and the challenges you face living on another planet.",
-        "input": "",
-        "output": "",
-        "id": 29
-    },
-    {
-        "category": "roleplay",
-        "instruction": "Pretend to be a character in a post-apocalyptic world. Describe how you survive and the allies you encounter.",
-        "input": "",
-        "output": "",
-        "id": 30
-    },
-    {
-        "category": "common-sense",
-        "instruction": "How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful?",
-        "input": "",
-        "output": "",
-        "id": 31
-    },
-    {
-        "category": "common-sense",
-        "instruction": "What are some subtle clues that suggest someone is pretending to understand a topic or conversation when they are actually confused or uninformed?",
-        "input": "",
-        "output": "",
-        "id": 32
-    },
-    {
-        "category": "common-sense",
-        "instruction": "Why might someone choose to use a paper map or ask for directions instead of relying on a GPS device or smartphone app?",
-        "input": "",
-        "output": "",
-        "id": 33
-    },
-    {
-        "category": "common-sense",
-        "instruction": "How can you determine if a person is genuinely interested in a conversation or simply being polite?",
-        "input": "",
-        "output": "",
-        "id": 34
-    },
-    {
-        "category": "common-sense",
-        "instruction": "Why might someone prefer to shop at a small, locally-owned business instead of a large chain store, even if the prices are higher?",
-        "input": "",
-        "output": "",
-        "id": 35
-    },
-    {
-        "category": "common-sense",
-        "instruction": "How can you assess the credibility of a source of information, such as a news article or blog post, without relying solely on the reputation of the author or publisher?",
-        "input": "",
-        "output": "",
-        "id": 36
-    },
-    {
-        "category": "common-sense",
-        "instruction": "Why do some people enjoy the sensation of being scared, such as by watching horror movies or going on roller coasters, while others avoid these experiences?",
-        "input": "",
-        "output": "",
-        "id": 37
-    },
-    {
-        "category": "common-sense",
-        "instruction": "How can observing the behavior of other people in a social situation provide clues about cultural norms and expectations?",
-        "input": "",
-        "output": "",
-        "id": 38
-    },
-    {
-        "category": "common-sense",
-        "instruction": "Do we have a moral obligation to explore space, or should we focus on solving Earth's problems first?",
-        "input": "",
-        "output": "",
-        "id": 39
-    },
-    {
-        "category": "common-sense",
-        "instruction": "In a world where automation is becoming increasingly prevalent, is it more important to prioritize job creation or technological progress?",
-        "input": "",
-        "output": "",
-        "id": 40
-    },
-    {
-        "category": "fermi",
-        "instruction": "How many times does the average human blink in a lifetime? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
-        "input": "",
-        "output": "",
-        "id": 41
-    },
-    {
-        "category": "fermi",
-        "instruction": "How many atoms are in a grain of salt? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
-        "input": "",
-        "output": "",
-        "id": 42
-    },
-    {
-        "category": "fermi",
-        "instruction": "How many lightning strikes occur on Earth each day? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
-        "input": "",
-        "output": "",
-        "id": 43
-    },
-    {
-        "category": "fermi",
-        "instruction": "How many balloons would it take to lift a house like in the movie \"Up\"? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
-        "input": "",
-        "output": "",
-        "id": 44
-    },
-    {
-        "category": "fermi",
-        "instruction": "How many text messages are sent globally in a minute? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
-        "input": "",
-        "output": "",
-        "id": 45
-    },
-    {
-        "category": "fermi",
-        "instruction": "How many words are spoken daily on Earth? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
-        "input": "",
-        "output": "",
-        "id": 46
-    },
-    {
-        "category": "fermi",
-        "instruction": "How many snowflakes fall during a typical winter? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
-        "input": "",
-        "output": "",
-        "id": 47
-    },
-    {
-        "category": "fermi",
-        "instruction": "How many pages are in all the books ever written? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
-        "input": "",
-        "output": "",
-        "id": 48
-    },
-    {
-        "category": "fermi",
-        "instruction": "How many times has the Earth orbited the Sun since the beginning of life? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
-        "input": "",
-        "output": "",
-        "id": 49
-    },
-    {
-        "category": "fermi",
-        "instruction": "How many songs have been recorded throughout history? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.",
-        "input": "",
-        "output": "",
-        "id": 50
-    },
-    {
-        "category": "counterfactual",
-        "instruction": "What if the Internet had been invented during the Renaissance period?",
-        "input": "",
-        "output": "",
-        "id": 51
-    },
-    {
-        "category": "counterfactual",
-        "instruction": "What if the Aztecs had successfully repelled the Spanish conquistadors?",
-        "input": "",
-        "output": "",
-        "id": 52
-    },
-    {
-        "category": "counterfactual",
-        "instruction": "What if the Black Death had not occurred in the 14th century?",
-        "input": "",
-        "output": "",
-        "id": 53
-    },
-    {
-        "category": "counterfactual",
-        "instruction": "What if Isaac Newton had focused on biology instead of physics?",
-        "input": "",
-        "output": "",
-        "id": 54
-    },
-    {
-        "category": "counterfactual",
-        "instruction": "What if the Beatles had never formed as a band?",
-        "input": "",
-        "output": "",
-        "id": 55
-    },
-    {
-        "category": "counterfactual",
-        "instruction": "What if Alan Turing had not cracked the Enigma code during World War II?",
-        "input": "",
-        "output": "",
-        "id": 56
-    },
-    {
-        "category": "counterfactual",
-        "instruction": "What if the Suez Canal had never been constructed?",
-        "input": "",
-        "output": "",
-        "id": 57
-    },
-    {
-        "category": "counterfactual",
-        "instruction": "What if the Maya civilization had never mysteriously collapsed?",
-        "input": "",
-        "output": "",
-        "id": 58
-    },
-    {
-        "category": "counterfactual",
-        "instruction": "What if Christopher Columbus had not discovered the Americas?",
-        "input": "",
-        "output": "",
-        "id": 59
-    },
-    {
-        "category": "counterfactual",
-        "instruction": "What if Vincent van Gogh had been a successful artist during his lifetime?",
-        "input": "",
-        "output": "",
-        "id": 60
-    },
-    {
-        "category": "coding",
-        "instruction": "Develop a C++ program that reads a text file line by line and counts the number of occurrences of a specific word in the file.",
-        "input": "",
-        "output": "",
-        "id": 61
-    },
-    {
-        "category": "coding",
-        "instruction": "Implement a Python function to find the longest common subsequence of two input strings using dynamic programming.",
-        "input": "",
-        "output": "",
-        "id": 62
-    },
-    {
-        "category": "coding",
-        "instruction": "Implement a regular expression in Python to validate an email address.",
-        "input": "",
-        "output": "",
-        "id": 63
-    },
-    {
-        "category": "coding",
-        "instruction": "Write a program to find the nth Fibonacci number using dynamic programming.",
-        "input": "",
-        "output": "",
-        "id": 64
-    },
-    {
-        "category": "coding",
-        "instruction": "Implement a binary search algorithm to find a specific element in a sorted array.",
-        "input": "",
-        "output": "",
-        "id": 65
-    },
-    {
-        "category": "coding",
-        "instruction": "Implement a queue data structure using two stacks in Python.",
-        "input": "",
-        "output": "",
-        "id": 66
-    },
-    {
-        "category": "coding",
-        "instruction": "Implement a program to find the common elements in two arrays without using any extra data structures.",
-        "input": "",
-        "output": "",
-        "id": 67
-    },
-    {
-        "category": "math",
-        "instruction": "Given that f(x) = 5x^3 - 2x + 3, find the value of f(2).",
-        "input": "",
-        "output": "",
-        "id": 68
-    },
-    {
-        "category": "math",
-        "instruction": "Solve for x in the equation 3x + 10 = 5(x - 2).",
-        "input": "",
-        "output": "",
-        "id": 69
-    },
-    {
-        "category": "math",
-        "instruction": "If the endpoints of a line segment are (2, -2) and (10, 4), what is the length of the segment?",
-        "input": "",
-        "output": "",
-        "id": 70
-    },
-    {
-        "category": "writing",
-        "instruction": "Can you help me write a formal email to a potential business partner proposing a joint venture?",
-        "input": "",
-        "output": "",
-        "id": 71
-    },
-    {
-        "category": "writing",
-        "instruction": "Can you help me write a resignation letter to my current employer, while leaving on good terms and expressing gratitude for the opportunities provided?",
-        "input": "",
-        "output": "",
-        "id": 72
-    },
-    {
-        "category": "writing",
-        "instruction": "Use an appropriate format to structure a formal letter of recommendation for a student applying to a prestigious graduate program in computer science.",
-        "input": "",
-        "output": "",
-        "id": 73
-    },
-    {
-        "category": "writing",
-        "instruction": "Write a compelling product launch announcement email to inform our customers of our new software solution.",
-        "input": "",
-        "output": "",
-        "id": 74
-    },
-    {
-        "category": "writing",
-        "instruction": "Draft an apology email to a customer who experienced a delay in their order, and provide reassurance that the issue has been resolved.",
-        "input": "",
-        "output": "",
-        "id": 75
-    },
-    {
-        "category": "writing",
-        "instruction": "Write a script for a YouTube video exploring the history and cultural significance of jazz.",
-        "input": "",
-        "output": "",
-        "id": 76
-    },
-    {
-        "category": "writing",
-        "instruction": "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.",
-        "input": "",
-        "output": "",
-        "id": 77
-    },
-    {
-        "category": "writing",
-        "instruction": "Write a captivating movie review for a recently released science fiction film, discussing its plot, characters, and special effects.",
-        "input": "",
-        "output": "",
-        "id": 78
-    },
-    {
-        "category": "writing",
-        "instruction": "Structure a podcast script for an episode discussing the influence of streaming platforms on the music industry.",
-        "input": "",
-        "output": "",
-        "id": 79
-    },
-    {
-        "category": "writing",
-        "instruction": "Write a symphony concert review, discussing the orchestra's performance and overall audience experience.",
-        "input": "",
-        "output": "",
-        "id": 80
+        "id": 0,
+        "instruction": "Help me summarize the following news?",
+        "input": "National Commercial Bank (NCB), Saudi Arabia's largest lender by assets, agreed to buy rival Samba Financial Group for $15 billion in the biggest banking takeover this year.NCB will pay 28.45 riyals ($7.58) for each Samba share, according to a statement on Sunday, valuing it at about 55.7 billion riyals. NCB will offer 0.739 new shares for each Samba share, at the lower end of the 0.736-0.787 ratio the banks set when they signed an initial framework agreement in June.The offer is a 3.5% premium to Samba's Oct. 8 closing price of 27.50 riyals and about 24% higher than the level the shares traded at before the talks were made public. Bloomberg News first reported the merger discussions.The new bank will have total assets of more than $220 billion, creating the Gulf region's third-largest lender. The entity's $46 billion market capitalization nearly matches that of Qatar National Bank QPSC, which is still the Middle East's biggest lender with about $268 billion of assets.",
+        "output": "NCB to pay 28.45 riyals for each Samba share. Deal will create Gulf region's third-largest lender",
+        "category": "closed qa"
     }
 ]
\ No newline at end of file

From ed3eaa6922ce95cff77744bf1975092de6fb57bd Mon Sep 17 00:00:00 2001
From: Tong Li <tong.li352711588@gmail.com>
Date: Fri, 28 Apr 2023 11:49:21 +0800
Subject: [PATCH 3/4] update documentation

---
 applications/Chat/evaluate/README.md | 78 +++++++++++++++-------------
 1 file changed, 43 insertions(+), 35 deletions(-)

diff --git a/applications/Chat/evaluate/README.md b/applications/Chat/evaluate/README.md
index d6611abf7..d776a3e1f 100644
--- a/applications/Chat/evaluate/README.md
+++ b/applications/Chat/evaluate/README.md
@@ -1,16 +1,36 @@
 # Evaluation
 
-In this directory we will introduce how you can evaluate your model with GPT-4. 
+In this directory, we introduce how you can evaluate your model with GPT-4. 
 
 ## Evaluation Pipeline
 
-The whole evaluation process undergoes two steps. 
+The whole evaluation process undergoes the following three steps: 
 1. Prepare the questions following the internal data structure in the data format section (described below).
-2. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models.
-3. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4.
+2. Generate answers from different models: 
+    * Generate answers using GPT-3.5: [generate_gpt35_answers.py](generate_gpt35_answers.py).
+    * Generate answers using your own models: [generate_answers.py](generate_answers.py).
+3. Evaluate models using GPT-4: [evaluate.py](evaluate.py).
 
 ### Generate Answers
-In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows.
+#### Generate Answers Using GPT-3.5
+You can provide your own OpenAI key to generate answers from GPT-3.5 using [generate_gpt35_answers.py](./generate_gpt35_answers.py).
+
+An example script is provided as follows:
+```shell
+python generate_gpt35_answers.py \
+    --dataset "path to the question dataset" \
+    --answer_path "path to answer folder" \
+    --num_workers 4 \
+    --openai_key "your openai key" \
+    --max_tokens 512 \
+``` 
+
+#### Generate Answers Using our Own Model
+You can also generate answers using your own models. The generation process is divided into two stages:
+1. Generate answers using multiple GPUs (optional) with batch processing: [generate_answers.py](./generate_answers.py).
+2. Merge multiple shards and output a single file: [merge.py](./merge.py).
+
+An example script is given as follows:
 
 ```shell
 device_number=number of your devices
@@ -41,21 +61,9 @@ done
 
 ```
 
-`generate_gpt35_answers.py` will generate answers of GPT-3.5 An example script is given as follows.
-
-```shell
-python generate_gpt35_answers.py \
-    --dataset "path to the question dataset" \
-    --answer_path "path to answer folder" \
-    --num_workers 4 \
-    --openai_key "your openai key" \
-    --max_tokens 512 \
-
-```
-
 ### Evaluate Answers
 
-In `evaluate.py`, GPT-4 will help review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script will finally print several metrics and output corresponding JSON files.
+In [evaluate.py](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files.
 
 The metrics include:
 
@@ -121,11 +129,11 @@ We store model answers in `{model_name}_answers.json`. The JSON file contains on
 
 An answer record has the following field:
 
-* `category` (str): The category of the question.
-* `instruction` (str): The question.
-* `input` (str): This is empty if you only use [FastChat's]((https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions.
-* `output` (str): The answer to the question.
-* `id` (int): The question id.
+* `category` (str, compulsory): The category of the instruction / question.
+* `instruction` (str, compulsory): The instruction / question for the LLM.
+* `input` (str, optional): The additional context of the instruction / question.
+* `output` (str, compulsory): The output from the LLM.
+* `id` (int, compulsory): The ID of the instruction / question.
 
 ### Results
 
@@ -133,12 +141,12 @@ We store evaluation results in `results.json`. The JSON file contains one dictio
 
 The value has the following field:
 
-* `model` (list): The names of the two models.
-* `better` (int): The number of reviews where Model 2 receives a higher score.
-* `worse` (int): The number of reviews where Model 2 receives a lower score.
-* `tie` (int): The number of reviews where two models play to a tie.
-* `win_rate` (float): Win rate of Model 2.
-* `score` (list): Average score of the two models.
+* `model` (list, compulsory): The names of the two models.
+* `better` (int, compulsory): The number of reviews where Model 2 receives a higher score.
+* `worse` (int, compulsory): The number of reviews where Model 2 receives a lower score.
+* `tie` (int, compulsory): The number of reviews where two models play to a tie.
+* `win_rate` (float, compulsory): Win rate of Model 2.
+* `score` (list, compulsory): Average score of the two models.
 
 ### Better, Worse, Tie, Invalid, Review
 
@@ -146,12 +154,12 @@ To help better compare the model answers, we store JSON files whose name ends wi
 
 A record has the following field:
 
-* `review_id` (str): Random UUID, not in use.
-* `id` (int): The question id.
-* `reviewer_id` (int): A unique ID for a reviewer. Different reviewer id use different prompts.
-* `metadata` (dict): It is empty.
-* `review` (str): GPT-4 's review.
-* `score` (list): The scores of two models.
+* `review_id` (str, optional): Random UUID, not in use.
+* `id` (int, compulsory): The ID of the instruction / question.
+* `reviewer_id` (int, compulsory): A unique ID for a reviewer. Different reviewer id use different prompts.
+* `metadata` (dict, optional): It is empty.
+* `review` (str, optional): GPT-4's review.
+* `score` (list, compulsory): The scores of two models.
 
 ### Prompts
 

From c1a355940ea4c5ec203bc9295e057fd3c8ca5efb Mon Sep 17 00:00:00 2001
From: Tong Li <tong.li352711588@gmail.com>
Date: Fri, 28 Apr 2023 11:56:35 +0800
Subject: [PATCH 4/4] update readme

---
 applications/Chat/evaluate/README.md | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/applications/Chat/evaluate/README.md b/applications/Chat/evaluate/README.md
index d776a3e1f..7ace4bfe6 100644
--- a/applications/Chat/evaluate/README.md
+++ b/applications/Chat/evaluate/README.md
@@ -7,13 +7,13 @@ In this directory, we introduce how you can evaluate your model with GPT-4.
 The whole evaluation process undergoes the following three steps: 
 1. Prepare the questions following the internal data structure in the data format section (described below).
 2. Generate answers from different models: 
-    * Generate answers using GPT-3.5: [generate_gpt35_answers.py](generate_gpt35_answers.py).
-    * Generate answers using your own models: [generate_answers.py](generate_answers.py).
-3. Evaluate models using GPT-4: [evaluate.py](evaluate.py).
+    * Generate answers using GPT-3.5: [`generate_gpt35_answers.py`](generate_gpt35_answers.py).
+    * Generate answers using your own models: [`generate_answers.py`](generate_answers.py).
+3. Evaluate models using GPT-4: [`evaluate.py`](evaluate.py).
 
 ### Generate Answers
 #### Generate Answers Using GPT-3.5
-You can provide your own OpenAI key to generate answers from GPT-3.5 using [generate_gpt35_answers.py](./generate_gpt35_answers.py).
+You can provide your own OpenAI key to generate answers from GPT-3.5 using [`generate_gpt35_answers.py`](./generate_gpt35_answers.py).
 
 An example script is provided as follows:
 ```shell
@@ -27,8 +27,8 @@ python generate_gpt35_answers.py \
 
 #### Generate Answers Using our Own Model
 You can also generate answers using your own models. The generation process is divided into two stages:
-1. Generate answers using multiple GPUs (optional) with batch processing: [generate_answers.py](./generate_answers.py).
-2. Merge multiple shards and output a single file: [merge.py](./merge.py).
+1. Generate answers using multiple GPUs (optional) with batch processing: [`generate_answers.py`](./generate_answers.py).
+2. Merge multiple shards and output a single file: [`merge.py`](./merge.py).
 
 An example script is given as follows:
 
@@ -63,7 +63,7 @@ done
 
 ### Evaluate Answers
 
-In [evaluate.py](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files.
+In [`evaluate.py`](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files.
 
 The metrics include:
 
@@ -105,7 +105,7 @@ We would like to mention that the evaluation of model answers using the GPT-3.5
 ## Data Format
 
 ### Questions
-The file [questions.json](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field:
+The file [`questions.json`](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field:
 * `id` (id, compulsory): The ID of the instruction / question.
 * `instruction` (str, compulsory): The instruction / question for the LLM.
 * `input` (str, optional): The additional context of the instruction / question.
@@ -163,11 +163,11 @@ A record has the following field:
 
 ### Prompts
 
-The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts.
+The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts.
 
 ### Reviewer
 
-The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers.
+The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers.
 
 ## Citations