From aa77ddae334343322e959f2be72e0b48400ca70a Mon Sep 17 00:00:00 2001 From: Tong Li Date: Thu, 27 Apr 2023 18:51:58 +0800 Subject: [PATCH 1/4] remove unnecessary step and update readme --- applications/Chat/evaluate/README.md | 51 +- .../Chat/evaluate/format_questions.py | 31 - .../Chat/evaluate/format_questions.sh | 3 - .../Chat/evaluate/sample/questions.json | 562 ++++++++++++++++++ 4 files changed, 584 insertions(+), 63 deletions(-) delete mode 100644 applications/Chat/evaluate/format_questions.py delete mode 100755 applications/Chat/evaluate/format_questions.sh create mode 100644 applications/Chat/evaluate/sample/questions.json diff --git a/applications/Chat/evaluate/README.md b/applications/Chat/evaluate/README.md index 6113dbbb1..f44311f4b 100644 --- a/applications/Chat/evaluate/README.md +++ b/applications/Chat/evaluate/README.md @@ -5,21 +5,11 @@ In this directory we will introduce how you can evaluate your model with GPT-4. ## Evaluation Pipeline The whole evaluation process undergoes two steps. - -1. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models. -2. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4. +1. Prepare the questions following the internal data structure in the data format section (described below). +2. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models. +3. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4. ### Generate Answers - -To generate answers, you should first format [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) `question.jsonl` file. We do this formatting because we would like to add more questions later and the pipeline for generating new questions may follow that of Self-Instruct and Stanford Alpaca. An example script is given as follows. - -```shell -python format_questions.py \ - --questions_path "path to FastChat's question.jsonl" \ - --save_path "path to the formatted file" \ - -``` - In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows. ```shell @@ -107,16 +97,23 @@ We would like to mention that the evaluation of model answers using the GPT-3.5 ## Data Format ### Questions +The file [questions.json](./sample/questions.json) shows the example questions used to evaluate the performance of the model. The current sample questions are collected from [FastChat](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl). Each question record has the following field: +* `id` (id, compulsory): The ID of the instruction / question. +* `instruction` (str, compulsory): The instruction / question for the LLM. +* `input` (str, optional): The additional context of the instruction / question. +* `output` (str, optional): The sample output of the instruction / question. +* `category` (str, compulsory): The category of the instruction / question. -We store questions in `questions.json`. The JSON file contains one list. Each element in the list is a question record. - -A question record has the following field: - -* `category` (str): The category of the question. -* `instruction` (str): The question. -* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions. -* `output` (str): This is empty. -* `id` (int): The question id. +Example: +``` +{ + "id": 0, + "instruction": "Help me summarize the following short story?", + "input": "{story}", + "output": "{summarized story}", + "category": "closed qa" +} +``` ### Answers @@ -126,7 +123,7 @@ An answer record has the following field: * `category` (str): The category of the question. * `instruction` (str): The question. -* `input` (str): This is empty if you only use [FastChat's]([FastChat/question.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions. +* `input` (str): This is empty if you only use [FastChat's]((https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions. * `output` (str): The answer to the question. * `id` (int): The question id. @@ -158,15 +155,11 @@ A record has the following field: ### Prompts -The data format is the same with [FastChat's]([FastChat/prompt.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl)) prompts. +The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts. ### Reviewer -The data format is the same with [FastChat's]([FastChat/reviewer.jsonl at main · lm-sys/FastChat (github.com)](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl)) reviewers. - -## Plan - -- [ ] Extend the questions +The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers. ## Citations diff --git a/applications/Chat/evaluate/format_questions.py b/applications/Chat/evaluate/format_questions.py deleted file mode 100644 index 9b47907c3..000000000 --- a/applications/Chat/evaluate/format_questions.py +++ /dev/null @@ -1,31 +0,0 @@ -import argparse -import os -import json -import copy - -from utils import jdump, get_json_list - - -def format_questions(args): - questions = get_json_list(args.questions_path) - keys=questions[0].keys() - - formatted_questions=copy.deepcopy(questions) - for i in range(len(formatted_questions)): - formatted_questions[i]['instruction']=questions[i]['text'] - formatted_questions[i]['input']="" - formatted_questions[i]['output']="" - formatted_questions[i]['id']=questions[i]['question_id'] - for key in keys: - if key=="category": - continue - del formatted_questions[i][key] - - jdump(formatted_questions, args.save_path) - -if __name__ == '__main__': - parser = argparse.ArgumentParser() - parser.add_argument('--questions_path', type=str, default='table/question.jsonl') - parser.add_argument('--save_path', type=str, default="table/questions.json") - args = parser.parse_args() - format_questions(args) \ No newline at end of file diff --git a/applications/Chat/evaluate/format_questions.sh b/applications/Chat/evaluate/format_questions.sh deleted file mode 100755 index a7568da36..000000000 --- a/applications/Chat/evaluate/format_questions.sh +++ /dev/null @@ -1,3 +0,0 @@ -python format_questions.py \ - --questions_path "path to FastChat's question.jsonl" \ - --save_path "path to the formatted file" \ diff --git a/applications/Chat/evaluate/sample/questions.json b/applications/Chat/evaluate/sample/questions.json new file mode 100644 index 000000000..cbda9c086 --- /dev/null +++ b/applications/Chat/evaluate/sample/questions.json @@ -0,0 +1,562 @@ +[ + { + "category": "generic", + "instruction": "How can I improve my time management skills?", + "input": "", + "output": "", + "id": 1 + }, + { + "category": "generic", + "instruction": "What are the most effective ways to deal with stress?", + "input": "", + "output": "", + "id": 2 + }, + { + "category": "generic", + "instruction": "What are the main differences between Python and JavaScript programming languages?", + "input": "", + "output": "", + "id": 3 + }, + { + "category": "generic", + "instruction": "How can I increase my productivity while working from home?", + "input": "", + "output": "", + "id": 4 + }, + { + "category": "generic", + "instruction": "Can you explain the basics of quantum computing?", + "input": "", + "output": "", + "id": 5 + }, + { + "category": "generic", + "instruction": "What are the differences between plant-based and animal-based protein sources?", + "input": "", + "output": "", + "id": 6 + }, + { + "category": "generic", + "instruction": "How can I develop my critical thinking skills?", + "input": "", + "output": "", + "id": 7 + }, + { + "category": "generic", + "instruction": "What are the major challenges faced by the education sector today?", + "input": "", + "output": "", + "id": 8 + }, + { + "category": "generic", + "instruction": "What are the primary factors that influence consumer behavior?", + "input": "", + "output": "", + "id": 9 + }, + { + "category": "generic", + "instruction": "What are the most effective strategies for conflict resolution in the workplace?", + "input": "", + "output": "", + "id": 10 + }, + { + "category": "knowledge", + "instruction": "What are some potential implications of using a single-use plastic bottle versus a reusable bottle on both the environment and human health?", + "input": "", + "output": "", + "id": 11 + }, + { + "category": "knowledge", + "instruction": "What factors would you consider when designing an inclusive and accessible public transportation system?", + "input": "", + "output": "", + "id": 12 + }, + { + "category": "knowledge", + "instruction": "How can governments utilize fiscal and monetary policies to combat economic recessions?", + "input": "", + "output": "", + "id": 13 + }, + { + "category": "knowledge", + "instruction": "How do language and cultural barriers affect the way people communicate and form relationships in multicultural societies?", + "input": "", + "output": "", + "id": 14 + }, + { + "category": "knowledge", + "instruction": "Describe a scenario where artificial intelligence could be used to improve the quality and efficiency of healthcare delivery.", + "input": "", + "output": "", + "id": 15 + }, + { + "category": "knowledge", + "instruction": "Explain the process of gene editing using CRISPR-Cas9 technology, and discuss its potential applications and ethical implications.", + "input": "", + "output": "", + "id": 16 + }, + { + "category": "knowledge", + "instruction": "How do vaccinations work to protect individuals and communities from infectious diseases, and what is herd immunity?", + "input": "", + "output": "", + "id": 17 + }, + { + "category": "knowledge", + "instruction": "How do social media platforms influence the way people consume and share news, and what are the potential implications for the spread of misinformation?", + "input": "", + "output": "", + "id": 18 + }, + { + "category": "knowledge", + "instruction": "How do cultural, social, and economic factors influence people's food choices, and how can this knowledge be used to promote healthier diets?", + "input": "", + "output": "", + "id": 19 + }, + { + "category": "knowledge", + "instruction": "Explain the process of natural selection and how it contributes to the evolution and adaptation of species.", + "input": "", + "output": "", + "id": 20 + }, + { + "category": "roleplay", + "instruction": "How would you introduce yourself as a medieval knight at a royal banquet?", + "input": "", + "output": "", + "id": 21 + }, + { + "category": "roleplay", + "instruction": "As a pirate captain, what would you say to your crew to motivate them to search for hidden treasure?", + "input": "", + "output": "", + "id": 22 + }, + { + "category": "roleplay", + "instruction": "If you were a Shakespearean character, how would you declare your love for someone in a soliloquy?", + "input": "", + "output": "", + "id": 23 + }, + { + "category": "roleplay", + "instruction": "As a superhero, how would you explain your origin story to a curious child?", + "input": "", + "output": "", + "id": 24 + }, + { + "category": "roleplay", + "instruction": "Imagine you are a time traveler from the year 3000. What technological advancements would you tell people about?", + "input": "", + "output": "", + "id": 25 + }, + { + "category": "roleplay", + "instruction": "As a sports commentator, describe the winning play in the final seconds of a championship game.", + "input": "", + "output": "", + "id": 26 + }, + { + "category": "roleplay", + "instruction": "Pretend to be a world-famous chef. How would you describe your signature dish to a panel of judges?", + "input": "", + "output": "", + "id": 27 + }, + { + "category": "roleplay", + "instruction": "You are a mountain climber reaching the summit of Mount Everest. Describe your emotions and the view from the top.", + "input": "", + "output": "", + "id": 28 + }, + { + "category": "roleplay", + "instruction": "As a space colonist on Mars, describe your daily life and the challenges you face living on another planet.", + "input": "", + "output": "", + "id": 29 + }, + { + "category": "roleplay", + "instruction": "Pretend to be a character in a post-apocalyptic world. Describe how you survive and the allies you encounter.", + "input": "", + "output": "", + "id": 30 + }, + { + "category": "common-sense", + "instruction": "How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful?", + "input": "", + "output": "", + "id": 31 + }, + { + "category": "common-sense", + "instruction": "What are some subtle clues that suggest someone is pretending to understand a topic or conversation when they are actually confused or uninformed?", + "input": "", + "output": "", + "id": 32 + }, + { + "category": "common-sense", + "instruction": "Why might someone choose to use a paper map or ask for directions instead of relying on a GPS device or smartphone app?", + "input": "", + "output": "", + "id": 33 + }, + { + "category": "common-sense", + "instruction": "How can you determine if a person is genuinely interested in a conversation or simply being polite?", + "input": "", + "output": "", + "id": 34 + }, + { + "category": "common-sense", + "instruction": "Why might someone prefer to shop at a small, locally-owned business instead of a large chain store, even if the prices are higher?", + "input": "", + "output": "", + "id": 35 + }, + { + "category": "common-sense", + "instruction": "How can you assess the credibility of a source of information, such as a news article or blog post, without relying solely on the reputation of the author or publisher?", + "input": "", + "output": "", + "id": 36 + }, + { + "category": "common-sense", + "instruction": "Why do some people enjoy the sensation of being scared, such as by watching horror movies or going on roller coasters, while others avoid these experiences?", + "input": "", + "output": "", + "id": 37 + }, + { + "category": "common-sense", + "instruction": "How can observing the behavior of other people in a social situation provide clues about cultural norms and expectations?", + "input": "", + "output": "", + "id": 38 + }, + { + "category": "common-sense", + "instruction": "Do we have a moral obligation to explore space, or should we focus on solving Earth's problems first?", + "input": "", + "output": "", + "id": 39 + }, + { + "category": "common-sense", + "instruction": "In a world where automation is becoming increasingly prevalent, is it more important to prioritize job creation or technological progress?", + "input": "", + "output": "", + "id": 40 + }, + { + "category": "fermi", + "instruction": "How many times does the average human blink in a lifetime? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", + "input": "", + "output": "", + "id": 41 + }, + { + "category": "fermi", + "instruction": "How many atoms are in a grain of salt? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", + "input": "", + "output": "", + "id": 42 + }, + { + "category": "fermi", + "instruction": "How many lightning strikes occur on Earth each day? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", + "input": "", + "output": "", + "id": 43 + }, + { + "category": "fermi", + "instruction": "How many balloons would it take to lift a house like in the movie \"Up\"? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", + "input": "", + "output": "", + "id": 44 + }, + { + "category": "fermi", + "instruction": "How many text messages are sent globally in a minute? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", + "input": "", + "output": "", + "id": 45 + }, + { + "category": "fermi", + "instruction": "How many words are spoken daily on Earth? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", + "input": "", + "output": "", + "id": 46 + }, + { + "category": "fermi", + "instruction": "How many snowflakes fall during a typical winter? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", + "input": "", + "output": "", + "id": 47 + }, + { + "category": "fermi", + "instruction": "How many pages are in all the books ever written? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", + "input": "", + "output": "", + "id": 48 + }, + { + "category": "fermi", + "instruction": "How many times has the Earth orbited the Sun since the beginning of life? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", + "input": "", + "output": "", + "id": 49 + }, + { + "category": "fermi", + "instruction": "How many songs have been recorded throughout history? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", + "input": "", + "output": "", + "id": 50 + }, + { + "category": "counterfactual", + "instruction": "What if the Internet had been invented during the Renaissance period?", + "input": "", + "output": "", + "id": 51 + }, + { + "category": "counterfactual", + "instruction": "What if the Aztecs had successfully repelled the Spanish conquistadors?", + "input": "", + "output": "", + "id": 52 + }, + { + "category": "counterfactual", + "instruction": "What if the Black Death had not occurred in the 14th century?", + "input": "", + "output": "", + "id": 53 + }, + { + "category": "counterfactual", + "instruction": "What if Isaac Newton had focused on biology instead of physics?", + "input": "", + "output": "", + "id": 54 + }, + { + "category": "counterfactual", + "instruction": "What if the Beatles had never formed as a band?", + "input": "", + "output": "", + "id": 55 + }, + { + "category": "counterfactual", + "instruction": "What if Alan Turing had not cracked the Enigma code during World War II?", + "input": "", + "output": "", + "id": 56 + }, + { + "category": "counterfactual", + "instruction": "What if the Suez Canal had never been constructed?", + "input": "", + "output": "", + "id": 57 + }, + { + "category": "counterfactual", + "instruction": "What if the Maya civilization had never mysteriously collapsed?", + "input": "", + "output": "", + "id": 58 + }, + { + "category": "counterfactual", + "instruction": "What if Christopher Columbus had not discovered the Americas?", + "input": "", + "output": "", + "id": 59 + }, + { + "category": "counterfactual", + "instruction": "What if Vincent van Gogh had been a successful artist during his lifetime?", + "input": "", + "output": "", + "id": 60 + }, + { + "category": "coding", + "instruction": "Develop a C++ program that reads a text file line by line and counts the number of occurrences of a specific word in the file.", + "input": "", + "output": "", + "id": 61 + }, + { + "category": "coding", + "instruction": "Implement a Python function to find the longest common subsequence of two input strings using dynamic programming.", + "input": "", + "output": "", + "id": 62 + }, + { + "category": "coding", + "instruction": "Implement a regular expression in Python to validate an email address.", + "input": "", + "output": "", + "id": 63 + }, + { + "category": "coding", + "instruction": "Write a program to find the nth Fibonacci number using dynamic programming.", + "input": "", + "output": "", + "id": 64 + }, + { + "category": "coding", + "instruction": "Implement a binary search algorithm to find a specific element in a sorted array.", + "input": "", + "output": "", + "id": 65 + }, + { + "category": "coding", + "instruction": "Implement a queue data structure using two stacks in Python.", + "input": "", + "output": "", + "id": 66 + }, + { + "category": "coding", + "instruction": "Implement a program to find the common elements in two arrays without using any extra data structures.", + "input": "", + "output": "", + "id": 67 + }, + { + "category": "math", + "instruction": "Given that f(x) = 5x^3 - 2x + 3, find the value of f(2).", + "input": "", + "output": "", + "id": 68 + }, + { + "category": "math", + "instruction": "Solve for x in the equation 3x + 10 = 5(x - 2).", + "input": "", + "output": "", + "id": 69 + }, + { + "category": "math", + "instruction": "If the endpoints of a line segment are (2, -2) and (10, 4), what is the length of the segment?", + "input": "", + "output": "", + "id": 70 + }, + { + "category": "writing", + "instruction": "Can you help me write a formal email to a potential business partner proposing a joint venture?", + "input": "", + "output": "", + "id": 71 + }, + { + "category": "writing", + "instruction": "Can you help me write a resignation letter to my current employer, while leaving on good terms and expressing gratitude for the opportunities provided?", + "input": "", + "output": "", + "id": 72 + }, + { + "category": "writing", + "instruction": "Use an appropriate format to structure a formal letter of recommendation for a student applying to a prestigious graduate program in computer science.", + "input": "", + "output": "", + "id": 73 + }, + { + "category": "writing", + "instruction": "Write a compelling product launch announcement email to inform our customers of our new software solution.", + "input": "", + "output": "", + "id": 74 + }, + { + "category": "writing", + "instruction": "Draft an apology email to a customer who experienced a delay in their order, and provide reassurance that the issue has been resolved.", + "input": "", + "output": "", + "id": 75 + }, + { + "category": "writing", + "instruction": "Write a script for a YouTube video exploring the history and cultural significance of jazz.", + "input": "", + "output": "", + "id": 76 + }, + { + "category": "writing", + "instruction": "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.", + "input": "", + "output": "", + "id": 77 + }, + { + "category": "writing", + "instruction": "Write a captivating movie review for a recently released science fiction film, discussing its plot, characters, and special effects.", + "input": "", + "output": "", + "id": 78 + }, + { + "category": "writing", + "instruction": "Structure a podcast script for an episode discussing the influence of streaming platforms on the music industry.", + "input": "", + "output": "", + "id": 79 + }, + { + "category": "writing", + "instruction": "Write a symphony concert review, discussing the orchestra's performance and overall audience experience.", + "input": "", + "output": "", + "id": 80 + } +] \ No newline at end of file From c419117329c7f7701b1119c1047d999b05390533 Mon Sep 17 00:00:00 2001 From: Tong Li Date: Thu, 27 Apr 2023 19:04:26 +0800 Subject: [PATCH 2/4] update questions and readme --- applications/Chat/evaluate/README.md | 2 +- .../Chat/evaluate/sample/questions.json | 563 +----------------- 2 files changed, 6 insertions(+), 559 deletions(-) diff --git a/applications/Chat/evaluate/README.md b/applications/Chat/evaluate/README.md index f44311f4b..d6611abf7 100644 --- a/applications/Chat/evaluate/README.md +++ b/applications/Chat/evaluate/README.md @@ -97,7 +97,7 @@ We would like to mention that the evaluation of model answers using the GPT-3.5 ## Data Format ### Questions -The file [questions.json](./sample/questions.json) shows the example questions used to evaluate the performance of the model. The current sample questions are collected from [FastChat](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl). Each question record has the following field: +The file [questions.json](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field: * `id` (id, compulsory): The ID of the instruction / question. * `instruction` (str, compulsory): The instruction / question for the LLM. * `input` (str, optional): The additional context of the instruction / question. diff --git a/applications/Chat/evaluate/sample/questions.json b/applications/Chat/evaluate/sample/questions.json index cbda9c086..e9ef9f8b1 100644 --- a/applications/Chat/evaluate/sample/questions.json +++ b/applications/Chat/evaluate/sample/questions.json @@ -1,562 +1,9 @@ [ { - "category": "generic", - "instruction": "How can I improve my time management skills?", - "input": "", - "output": "", - "id": 1 - }, - { - "category": "generic", - "instruction": "What are the most effective ways to deal with stress?", - "input": "", - "output": "", - "id": 2 - }, - { - "category": "generic", - "instruction": "What are the main differences between Python and JavaScript programming languages?", - "input": "", - "output": "", - "id": 3 - }, - { - "category": "generic", - "instruction": "How can I increase my productivity while working from home?", - "input": "", - "output": "", - "id": 4 - }, - { - "category": "generic", - "instruction": "Can you explain the basics of quantum computing?", - "input": "", - "output": "", - "id": 5 - }, - { - "category": "generic", - "instruction": "What are the differences between plant-based and animal-based protein sources?", - "input": "", - "output": "", - "id": 6 - }, - { - "category": "generic", - "instruction": "How can I develop my critical thinking skills?", - "input": "", - "output": "", - "id": 7 - }, - { - "category": "generic", - "instruction": "What are the major challenges faced by the education sector today?", - "input": "", - "output": "", - "id": 8 - }, - { - "category": "generic", - "instruction": "What are the primary factors that influence consumer behavior?", - "input": "", - "output": "", - "id": 9 - }, - { - "category": "generic", - "instruction": "What are the most effective strategies for conflict resolution in the workplace?", - "input": "", - "output": "", - "id": 10 - }, - { - "category": "knowledge", - "instruction": "What are some potential implications of using a single-use plastic bottle versus a reusable bottle on both the environment and human health?", - "input": "", - "output": "", - "id": 11 - }, - { - "category": "knowledge", - "instruction": "What factors would you consider when designing an inclusive and accessible public transportation system?", - "input": "", - "output": "", - "id": 12 - }, - { - "category": "knowledge", - "instruction": "How can governments utilize fiscal and monetary policies to combat economic recessions?", - "input": "", - "output": "", - "id": 13 - }, - { - "category": "knowledge", - "instruction": "How do language and cultural barriers affect the way people communicate and form relationships in multicultural societies?", - "input": "", - "output": "", - "id": 14 - }, - { - "category": "knowledge", - "instruction": "Describe a scenario where artificial intelligence could be used to improve the quality and efficiency of healthcare delivery.", - "input": "", - "output": "", - "id": 15 - }, - { - "category": "knowledge", - "instruction": "Explain the process of gene editing using CRISPR-Cas9 technology, and discuss its potential applications and ethical implications.", - "input": "", - "output": "", - "id": 16 - }, - { - "category": "knowledge", - "instruction": "How do vaccinations work to protect individuals and communities from infectious diseases, and what is herd immunity?", - "input": "", - "output": "", - "id": 17 - }, - { - "category": "knowledge", - "instruction": "How do social media platforms influence the way people consume and share news, and what are the potential implications for the spread of misinformation?", - "input": "", - "output": "", - "id": 18 - }, - { - "category": "knowledge", - "instruction": "How do cultural, social, and economic factors influence people's food choices, and how can this knowledge be used to promote healthier diets?", - "input": "", - "output": "", - "id": 19 - }, - { - "category": "knowledge", - "instruction": "Explain the process of natural selection and how it contributes to the evolution and adaptation of species.", - "input": "", - "output": "", - "id": 20 - }, - { - "category": "roleplay", - "instruction": "How would you introduce yourself as a medieval knight at a royal banquet?", - "input": "", - "output": "", - "id": 21 - }, - { - "category": "roleplay", - "instruction": "As a pirate captain, what would you say to your crew to motivate them to search for hidden treasure?", - "input": "", - "output": "", - "id": 22 - }, - { - "category": "roleplay", - "instruction": "If you were a Shakespearean character, how would you declare your love for someone in a soliloquy?", - "input": "", - "output": "", - "id": 23 - }, - { - "category": "roleplay", - "instruction": "As a superhero, how would you explain your origin story to a curious child?", - "input": "", - "output": "", - "id": 24 - }, - { - "category": "roleplay", - "instruction": "Imagine you are a time traveler from the year 3000. What technological advancements would you tell people about?", - "input": "", - "output": "", - "id": 25 - }, - { - "category": "roleplay", - "instruction": "As a sports commentator, describe the winning play in the final seconds of a championship game.", - "input": "", - "output": "", - "id": 26 - }, - { - "category": "roleplay", - "instruction": "Pretend to be a world-famous chef. How would you describe your signature dish to a panel of judges?", - "input": "", - "output": "", - "id": 27 - }, - { - "category": "roleplay", - "instruction": "You are a mountain climber reaching the summit of Mount Everest. Describe your emotions and the view from the top.", - "input": "", - "output": "", - "id": 28 - }, - { - "category": "roleplay", - "instruction": "As a space colonist on Mars, describe your daily life and the challenges you face living on another planet.", - "input": "", - "output": "", - "id": 29 - }, - { - "category": "roleplay", - "instruction": "Pretend to be a character in a post-apocalyptic world. Describe how you survive and the allies you encounter.", - "input": "", - "output": "", - "id": 30 - }, - { - "category": "common-sense", - "instruction": "How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful?", - "input": "", - "output": "", - "id": 31 - }, - { - "category": "common-sense", - "instruction": "What are some subtle clues that suggest someone is pretending to understand a topic or conversation when they are actually confused or uninformed?", - "input": "", - "output": "", - "id": 32 - }, - { - "category": "common-sense", - "instruction": "Why might someone choose to use a paper map or ask for directions instead of relying on a GPS device or smartphone app?", - "input": "", - "output": "", - "id": 33 - }, - { - "category": "common-sense", - "instruction": "How can you determine if a person is genuinely interested in a conversation or simply being polite?", - "input": "", - "output": "", - "id": 34 - }, - { - "category": "common-sense", - "instruction": "Why might someone prefer to shop at a small, locally-owned business instead of a large chain store, even if the prices are higher?", - "input": "", - "output": "", - "id": 35 - }, - { - "category": "common-sense", - "instruction": "How can you assess the credibility of a source of information, such as a news article or blog post, without relying solely on the reputation of the author or publisher?", - "input": "", - "output": "", - "id": 36 - }, - { - "category": "common-sense", - "instruction": "Why do some people enjoy the sensation of being scared, such as by watching horror movies or going on roller coasters, while others avoid these experiences?", - "input": "", - "output": "", - "id": 37 - }, - { - "category": "common-sense", - "instruction": "How can observing the behavior of other people in a social situation provide clues about cultural norms and expectations?", - "input": "", - "output": "", - "id": 38 - }, - { - "category": "common-sense", - "instruction": "Do we have a moral obligation to explore space, or should we focus on solving Earth's problems first?", - "input": "", - "output": "", - "id": 39 - }, - { - "category": "common-sense", - "instruction": "In a world where automation is becoming increasingly prevalent, is it more important to prioritize job creation or technological progress?", - "input": "", - "output": "", - "id": 40 - }, - { - "category": "fermi", - "instruction": "How many times does the average human blink in a lifetime? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", - "input": "", - "output": "", - "id": 41 - }, - { - "category": "fermi", - "instruction": "How many atoms are in a grain of salt? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", - "input": "", - "output": "", - "id": 42 - }, - { - "category": "fermi", - "instruction": "How many lightning strikes occur on Earth each day? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", - "input": "", - "output": "", - "id": 43 - }, - { - "category": "fermi", - "instruction": "How many balloons would it take to lift a house like in the movie \"Up\"? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", - "input": "", - "output": "", - "id": 44 - }, - { - "category": "fermi", - "instruction": "How many text messages are sent globally in a minute? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", - "input": "", - "output": "", - "id": 45 - }, - { - "category": "fermi", - "instruction": "How many words are spoken daily on Earth? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", - "input": "", - "output": "", - "id": 46 - }, - { - "category": "fermi", - "instruction": "How many snowflakes fall during a typical winter? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", - "input": "", - "output": "", - "id": 47 - }, - { - "category": "fermi", - "instruction": "How many pages are in all the books ever written? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", - "input": "", - "output": "", - "id": 48 - }, - { - "category": "fermi", - "instruction": "How many times has the Earth orbited the Sun since the beginning of life? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", - "input": "", - "output": "", - "id": 49 - }, - { - "category": "fermi", - "instruction": "How many songs have been recorded throughout history? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.", - "input": "", - "output": "", - "id": 50 - }, - { - "category": "counterfactual", - "instruction": "What if the Internet had been invented during the Renaissance period?", - "input": "", - "output": "", - "id": 51 - }, - { - "category": "counterfactual", - "instruction": "What if the Aztecs had successfully repelled the Spanish conquistadors?", - "input": "", - "output": "", - "id": 52 - }, - { - "category": "counterfactual", - "instruction": "What if the Black Death had not occurred in the 14th century?", - "input": "", - "output": "", - "id": 53 - }, - { - "category": "counterfactual", - "instruction": "What if Isaac Newton had focused on biology instead of physics?", - "input": "", - "output": "", - "id": 54 - }, - { - "category": "counterfactual", - "instruction": "What if the Beatles had never formed as a band?", - "input": "", - "output": "", - "id": 55 - }, - { - "category": "counterfactual", - "instruction": "What if Alan Turing had not cracked the Enigma code during World War II?", - "input": "", - "output": "", - "id": 56 - }, - { - "category": "counterfactual", - "instruction": "What if the Suez Canal had never been constructed?", - "input": "", - "output": "", - "id": 57 - }, - { - "category": "counterfactual", - "instruction": "What if the Maya civilization had never mysteriously collapsed?", - "input": "", - "output": "", - "id": 58 - }, - { - "category": "counterfactual", - "instruction": "What if Christopher Columbus had not discovered the Americas?", - "input": "", - "output": "", - "id": 59 - }, - { - "category": "counterfactual", - "instruction": "What if Vincent van Gogh had been a successful artist during his lifetime?", - "input": "", - "output": "", - "id": 60 - }, - { - "category": "coding", - "instruction": "Develop a C++ program that reads a text file line by line and counts the number of occurrences of a specific word in the file.", - "input": "", - "output": "", - "id": 61 - }, - { - "category": "coding", - "instruction": "Implement a Python function to find the longest common subsequence of two input strings using dynamic programming.", - "input": "", - "output": "", - "id": 62 - }, - { - "category": "coding", - "instruction": "Implement a regular expression in Python to validate an email address.", - "input": "", - "output": "", - "id": 63 - }, - { - "category": "coding", - "instruction": "Write a program to find the nth Fibonacci number using dynamic programming.", - "input": "", - "output": "", - "id": 64 - }, - { - "category": "coding", - "instruction": "Implement a binary search algorithm to find a specific element in a sorted array.", - "input": "", - "output": "", - "id": 65 - }, - { - "category": "coding", - "instruction": "Implement a queue data structure using two stacks in Python.", - "input": "", - "output": "", - "id": 66 - }, - { - "category": "coding", - "instruction": "Implement a program to find the common elements in two arrays without using any extra data structures.", - "input": "", - "output": "", - "id": 67 - }, - { - "category": "math", - "instruction": "Given that f(x) = 5x^3 - 2x + 3, find the value of f(2).", - "input": "", - "output": "", - "id": 68 - }, - { - "category": "math", - "instruction": "Solve for x in the equation 3x + 10 = 5(x - 2).", - "input": "", - "output": "", - "id": 69 - }, - { - "category": "math", - "instruction": "If the endpoints of a line segment are (2, -2) and (10, 4), what is the length of the segment?", - "input": "", - "output": "", - "id": 70 - }, - { - "category": "writing", - "instruction": "Can you help me write a formal email to a potential business partner proposing a joint venture?", - "input": "", - "output": "", - "id": 71 - }, - { - "category": "writing", - "instruction": "Can you help me write a resignation letter to my current employer, while leaving on good terms and expressing gratitude for the opportunities provided?", - "input": "", - "output": "", - "id": 72 - }, - { - "category": "writing", - "instruction": "Use an appropriate format to structure a formal letter of recommendation for a student applying to a prestigious graduate program in computer science.", - "input": "", - "output": "", - "id": 73 - }, - { - "category": "writing", - "instruction": "Write a compelling product launch announcement email to inform our customers of our new software solution.", - "input": "", - "output": "", - "id": 74 - }, - { - "category": "writing", - "instruction": "Draft an apology email to a customer who experienced a delay in their order, and provide reassurance that the issue has been resolved.", - "input": "", - "output": "", - "id": 75 - }, - { - "category": "writing", - "instruction": "Write a script for a YouTube video exploring the history and cultural significance of jazz.", - "input": "", - "output": "", - "id": 76 - }, - { - "category": "writing", - "instruction": "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.", - "input": "", - "output": "", - "id": 77 - }, - { - "category": "writing", - "instruction": "Write a captivating movie review for a recently released science fiction film, discussing its plot, characters, and special effects.", - "input": "", - "output": "", - "id": 78 - }, - { - "category": "writing", - "instruction": "Structure a podcast script for an episode discussing the influence of streaming platforms on the music industry.", - "input": "", - "output": "", - "id": 79 - }, - { - "category": "writing", - "instruction": "Write a symphony concert review, discussing the orchestra's performance and overall audience experience.", - "input": "", - "output": "", - "id": 80 + "id": 0, + "instruction": "Help me summarize the following news?", + "input": "National Commercial Bank (NCB), Saudi Arabia's largest lender by assets, agreed to buy rival Samba Financial Group for $15 billion in the biggest banking takeover this year.NCB will pay 28.45 riyals ($7.58) for each Samba share, according to a statement on Sunday, valuing it at about 55.7 billion riyals. NCB will offer 0.739 new shares for each Samba share, at the lower end of the 0.736-0.787 ratio the banks set when they signed an initial framework agreement in June.The offer is a 3.5% premium to Samba's Oct. 8 closing price of 27.50 riyals and about 24% higher than the level the shares traded at before the talks were made public. Bloomberg News first reported the merger discussions.The new bank will have total assets of more than $220 billion, creating the Gulf region's third-largest lender. The entity's $46 billion market capitalization nearly matches that of Qatar National Bank QPSC, which is still the Middle East's biggest lender with about $268 billion of assets.", + "output": "NCB to pay 28.45 riyals for each Samba share. Deal will create Gulf region's third-largest lender", + "category": "closed qa" } ] \ No newline at end of file From ed3eaa6922ce95cff77744bf1975092de6fb57bd Mon Sep 17 00:00:00 2001 From: Tong Li Date: Fri, 28 Apr 2023 11:49:21 +0800 Subject: [PATCH 3/4] update documentation --- applications/Chat/evaluate/README.md | 78 +++++++++++++++------------- 1 file changed, 43 insertions(+), 35 deletions(-) diff --git a/applications/Chat/evaluate/README.md b/applications/Chat/evaluate/README.md index d6611abf7..d776a3e1f 100644 --- a/applications/Chat/evaluate/README.md +++ b/applications/Chat/evaluate/README.md @@ -1,16 +1,36 @@ # Evaluation -In this directory we will introduce how you can evaluate your model with GPT-4. +In this directory, we introduce how you can evaluate your model with GPT-4. ## Evaluation Pipeline -The whole evaluation process undergoes two steps. +The whole evaluation process undergoes the following three steps: 1. Prepare the questions following the internal data structure in the data format section (described below). -2. Generate answers from different models: Use `generate_gpt35_answers.py` to generate answers of GPT 3.5 and use `generate_answers.py` to generate answers of your own models. -3. Evaluate models using GPT 4: Use `evaluate.py` to evaluate model answers with GPT-4. +2. Generate answers from different models: + * Generate answers using GPT-3.5: [generate_gpt35_answers.py](generate_gpt35_answers.py). + * Generate answers using your own models: [generate_answers.py](generate_answers.py). +3. Evaluate models using GPT-4: [evaluate.py](evaluate.py). ### Generate Answers -In `generate_answers.py`, the model will generate answers in a batch way and different GPU processes will do inference on different shards of the given questions. Once all GPU process generate its answers, `merge.py` will merge different shards of answers and output a single answer file. Finally, the script will also remove the answer shards. An example script is given as follows. +#### Generate Answers Using GPT-3.5 +You can provide your own OpenAI key to generate answers from GPT-3.5 using [generate_gpt35_answers.py](./generate_gpt35_answers.py). + +An example script is provided as follows: +```shell +python generate_gpt35_answers.py \ + --dataset "path to the question dataset" \ + --answer_path "path to answer folder" \ + --num_workers 4 \ + --openai_key "your openai key" \ + --max_tokens 512 \ +``` + +#### Generate Answers Using our Own Model +You can also generate answers using your own models. The generation process is divided into two stages: +1. Generate answers using multiple GPUs (optional) with batch processing: [generate_answers.py](./generate_answers.py). +2. Merge multiple shards and output a single file: [merge.py](./merge.py). + +An example script is given as follows: ```shell device_number=number of your devices @@ -41,21 +61,9 @@ done ``` -`generate_gpt35_answers.py` will generate answers of GPT-3.5 An example script is given as follows. - -```shell -python generate_gpt35_answers.py \ - --dataset "path to the question dataset" \ - --answer_path "path to answer folder" \ - --num_workers 4 \ - --openai_key "your openai key" \ - --max_tokens 512 \ - -``` - ### Evaluate Answers -In `evaluate.py`, GPT-4 will help review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script will finally print several metrics and output corresponding JSON files. +In [evaluate.py](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files. The metrics include: @@ -121,11 +129,11 @@ We store model answers in `{model_name}_answers.json`. The JSON file contains on An answer record has the following field: -* `category` (str): The category of the question. -* `instruction` (str): The question. -* `input` (str): This is empty if you only use [FastChat's]((https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/question.jsonl)) questions. -* `output` (str): The answer to the question. -* `id` (int): The question id. +* `category` (str, compulsory): The category of the instruction / question. +* `instruction` (str, compulsory): The instruction / question for the LLM. +* `input` (str, optional): The additional context of the instruction / question. +* `output` (str, compulsory): The output from the LLM. +* `id` (int, compulsory): The ID of the instruction / question. ### Results @@ -133,12 +141,12 @@ We store evaluation results in `results.json`. The JSON file contains one dictio The value has the following field: -* `model` (list): The names of the two models. -* `better` (int): The number of reviews where Model 2 receives a higher score. -* `worse` (int): The number of reviews where Model 2 receives a lower score. -* `tie` (int): The number of reviews where two models play to a tie. -* `win_rate` (float): Win rate of Model 2. -* `score` (list): Average score of the two models. +* `model` (list, compulsory): The names of the two models. +* `better` (int, compulsory): The number of reviews where Model 2 receives a higher score. +* `worse` (int, compulsory): The number of reviews where Model 2 receives a lower score. +* `tie` (int, compulsory): The number of reviews where two models play to a tie. +* `win_rate` (float, compulsory): Win rate of Model 2. +* `score` (list, compulsory): Average score of the two models. ### Better, Worse, Tie, Invalid, Review @@ -146,12 +154,12 @@ To help better compare the model answers, we store JSON files whose name ends wi A record has the following field: -* `review_id` (str): Random UUID, not in use. -* `id` (int): The question id. -* `reviewer_id` (int): A unique ID for a reviewer. Different reviewer id use different prompts. -* `metadata` (dict): It is empty. -* `review` (str): GPT-4 's review. -* `score` (list): The scores of two models. +* `review_id` (str, optional): Random UUID, not in use. +* `id` (int, compulsory): The ID of the instruction / question. +* `reviewer_id` (int, compulsory): A unique ID for a reviewer. Different reviewer id use different prompts. +* `metadata` (dict, optional): It is empty. +* `review` (str, optional): GPT-4's review. +* `score` (list, compulsory): The scores of two models. ### Prompts From c1a355940ea4c5ec203bc9295e057fd3c8ca5efb Mon Sep 17 00:00:00 2001 From: Tong Li Date: Fri, 28 Apr 2023 11:56:35 +0800 Subject: [PATCH 4/4] update readme --- applications/Chat/evaluate/README.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/applications/Chat/evaluate/README.md b/applications/Chat/evaluate/README.md index d776a3e1f..7ace4bfe6 100644 --- a/applications/Chat/evaluate/README.md +++ b/applications/Chat/evaluate/README.md @@ -7,13 +7,13 @@ In this directory, we introduce how you can evaluate your model with GPT-4. The whole evaluation process undergoes the following three steps: 1. Prepare the questions following the internal data structure in the data format section (described below). 2. Generate answers from different models: - * Generate answers using GPT-3.5: [generate_gpt35_answers.py](generate_gpt35_answers.py). - * Generate answers using your own models: [generate_answers.py](generate_answers.py). -3. Evaluate models using GPT-4: [evaluate.py](evaluate.py). + * Generate answers using GPT-3.5: [`generate_gpt35_answers.py`](generate_gpt35_answers.py). + * Generate answers using your own models: [`generate_answers.py`](generate_answers.py). +3. Evaluate models using GPT-4: [`evaluate.py`](evaluate.py). ### Generate Answers #### Generate Answers Using GPT-3.5 -You can provide your own OpenAI key to generate answers from GPT-3.5 using [generate_gpt35_answers.py](./generate_gpt35_answers.py). +You can provide your own OpenAI key to generate answers from GPT-3.5 using [`generate_gpt35_answers.py`](./generate_gpt35_answers.py). An example script is provided as follows: ```shell @@ -27,8 +27,8 @@ python generate_gpt35_answers.py \ #### Generate Answers Using our Own Model You can also generate answers using your own models. The generation process is divided into two stages: -1. Generate answers using multiple GPUs (optional) with batch processing: [generate_answers.py](./generate_answers.py). -2. Merge multiple shards and output a single file: [merge.py](./merge.py). +1. Generate answers using multiple GPUs (optional) with batch processing: [`generate_answers.py`](./generate_answers.py). +2. Merge multiple shards and output a single file: [`merge.py`](./merge.py). An example script is given as follows: @@ -63,7 +63,7 @@ done ### Evaluate Answers -In [evaluate.py](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files. +In [`evaluate.py`](./evaluate.py), GPT-4 helps to review and score answers of two different models. Here `Model 1` refers to the first model you specify in the `--answer_file_list` and `Model 2` refers to the second model. The script shows several metrics and output the corresponding JSON files. The metrics include: @@ -105,7 +105,7 @@ We would like to mention that the evaluation of model answers using the GPT-3.5 ## Data Format ### Questions -The file [questions.json](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field: +The file [`questions.json`](./sample/questions.json) shows the example questions used to evaluate the performance of the model. Each question record has the following field: * `id` (id, compulsory): The ID of the instruction / question. * `instruction` (str, compulsory): The instruction / question for the LLM. * `input` (str, optional): The additional context of the instruction / question. @@ -163,11 +163,11 @@ A record has the following field: ### Prompts -The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts. +The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/prompt.jsonl) prompts. ### Reviewer -The data format is the same with [FastChat's](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers. +The data format is the same with [`FastChat's`](https://github.com/lm-sys/FastChat/blob/main/fastchat/eval/table/reviewer.jsonl) reviewers. ## Citations