[feature] ColossalEval: Evaluation Pipeline for LLMs (#4786)

* Add ColossalEval * Delete evaluate in Chat --------- Co-authored-by: Xu Yuanchen <yuanchen.xu00@gmail.com> Co-authored-by: Tong Li <tong.li352711588@gmail.com>
2023-09-24 23:14:11 +08:00 · 2023-09-24 23:14:11 +08:00 · ce777853ae
parent 74aa7d964a
commit ce777853ae
60 changed files with 5314 additions and 2497 deletions
--- a/applications/Chat/evaluate/README.md
+++ b/applications/Chat/evaluate/README.md
@ -1,396 +0,0 @@
-# Evaluation
-
-In this directory, we introduce how you can evaluate your model with our pipeline. This pipeline is now available for evaluation of both Chinese and English capability.
-
-## Installation
-
-To start model evaluation, you need to install required packages which listed in `requirements.txt` under `evaluate` folder.
-
-```shell
-pip install -r requirements.txt
-```
-
-## Evaluation Pipeline
-
-The whole evaluation pipeline consists of three methods:
-
-1. `GPT Evaluation`: evaluates model predictions using GPT models.
-   - Compare the performance of two different models (battle).
-   - Rate the model according to pre-defined metrics using prompting design.
-   - Rate the model according to pre-defined metrics with additional reference answer using prompting design.
-2. `Automatic Evaluation`: evaluates model predictions using automatic metrics.
-3. `UniEval`: evaluates model predictions using UniEval models(English only).
-
-### Evaluation Category
-
-Our evaluation pipeline examines the model's capability using 10 categories of questions. The following table introduces each category:
-
-| Evaluation Category | Description                                                                                                                                                                                                            |
-| :-----------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-|    Brainstorming    | Models are asked to generate a range of creative and diverse ideas according to the question. The capability of creativity is required.                                                                                |
-|        Chat         | Models are asked to continue a multi-round dialogue given the roles involved. The capability of understanding, memorizing previous rounds of the dialogue and answering according to the persona provided is required. |
-|   Classification    | Models are asked to do classification tasks. The capability of accurate classification is required.                                                                                                                    |
-|      Closed QA      | Models are asked to answer a closed QA question. The capability of answering questions with limited scope (such as single/multiple choice question) is required.                                                       |
-|     Extraction      | Models are asked to extract information from a given material. The capability of extracting required information is required.                                                                                          |
-|     Generation      | Models are asked to generate an email, letter, article, etc. The capability of generating texts in a high quality and human-written way is required.                                                                   |
-|       Open QA       | Models are asked to answer an open QA question(without context provided). The capability of answering questions with the models' own knowledge base is required.                                                       |
-|      Roleplay       | Models are asked to play the role provided. The capability of engaging in the scenario and effectively interacting with the user is required.                                                                          |
-|      Rewriting      | Models are asked to do rewriting tasks such as translation and grammar correction. The capability of rewriting according to different instructions is required.                                                        |
-|    Summarization    | Models are asked to summarize the given paragraph or passage. The capability of summarization is required.                                                                                                             |
-
-To better understand each evaluation category, here are some example questions provided.
-
-| Evaluation Category | Chinese Example                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | English Example                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
-| :-----------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-|    Brainstorming    | **Example 1:**<br/>请介绍一下人工智能的多个领域。<br/><br/>**Example 2:**<br/>请给出管理家庭财务的 3 个小技巧。<br/>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | **Example 1:**<br/>How can I improve my memory? Any useful techniques you can suggest?<br/><br/>**Example 2:**<br/>What are some ways to increase productivity while working from home?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
-|        Chat         | **Example 1:**<br/>基于以下角色信息完成一段对话。小张是一名新手爱好者，对养鸡有浓厚的兴趣。老李是一名有丰富经验的养鸡大师。<br/>小张：您好，老李，我最近开始对养鸡感兴趣了，想请教您一些问题。 <br/>老李：你好，小张，我很乐意帮助你。你想问些什么？ <br/>小张：我想知道如何确定鸡的品种和性别？ <br/>老李：确切的品种可以通过鸡的外貌特征来确定，而性别一般是通过鸡卵的大小和形状来判断。还有什么问题吗？<br/> 小张：<br/><br/>**Example 2:**<br/>基于以下角色信息完成一段对话。小明是一名医生，一位老年病患者想要停药，但他对病情有所忽视并有担忧；王叔叔是老年病患者的儿子，希望能够听取医生的建议。<br/>小明：你好，王叔叔，我了解你想要让你父亲停药。<br/>王叔叔：是的，我父亲已经吃了那么久的药，我担心药物对他的身体会有副作用。<br/>小明：                                                                                                                                                                               | **Example 1:**<br/>Complete a conversation based on the following character information. Amy is a 30-year-old chef who runs her own restaurant. Jack is a food blogger who specializes in reviewing local restaurants.<br/>Amy: Hi Jack, I heard that you're a food blogger. Nice to meet you. <br/>Jack: Hi Amy, yes I am. Your restaurant has been receiving a lot of good reviews lately. <br/>Amy: Yes, we use only fresh and quality ingredients, and every dish is carefully crafted. <br/>Jack: <br/><br/>**Example 2:**<br/>Complete a dialogue based on the following role information. A: Elementary student B: Teacher<br/>B: Good morning, Student A. Today we're going to learn about addition and subtraction.<br/>A: Teacher, I already know this very well. Why do I need to learn it again?<br/>B:                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
-|   Classification    | **Example 1:**<br/>新闻标题：今日立夏，有一上联，立夏万物并秀，下联怎么对？<br/>请根据以上新闻标题判断新闻所属的分类，你需要从文化，娱乐，体育，财经，房产，教育，科技，旅游，游戏，军事这十类中选择一个答案。<br/><br/> **Example 2:**<br/>新闻标题：赵丽颖很久没有登上微博热搜了，但你们别急，她只是在憋大招而已。<br/>请根据新闻标题判断新闻所属的分类，你需要从文化，娱乐，体育，财经，房产，教育，科技，旅游，游戏，军事这十类中选择一个答案。                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | **Example 1:**<br/>Title: Fighting for Love (2020) <br/>Description: Jasmine got obsessed with a man and now he's obsessed with her. Steamy nights, kisses and rules being broken awaits them. She turned his whole world upside down and now he's doing it to hers. In this free fall, can they survive each others love?\"<br/>Based on the above information, determine which genre the work of art belongs to. You can only choose one from \"sport\", \"horror\", \"drama\", \"history\", \"romance\", \"biography\", \"science fiction\", \"comedy\", \"animation\", \"documentary\", \"music\" and \"news\".<br/><br/>**Example2:** <br/>Title: Summer Breeze: The Isley Brothers Greatest Hits Live (2005)<br/>Description: Filmed in the US in 2005 and captured in excellent form led by Ron Isley's vocals and Ernie Isley's hard edged guitar. Virtually every track is a hit including Shout, Who's That Lady, Twist And Shout, Summer Breeze and Harvest For The World.<br/>Based on the above information, determine which genre the work of art belongs to. You can only choose one from \"sport\", \"horror\", \"drama\", \"history\", \"romance\", \"biography\", \"science fiction\", \"comedy\", \"animation\", \"documentary\", \"music\" and \"news\"." |
-|      Closed QA      | **Example 1:**<br/>请从以下选项中选择正确答案。以下哪个是世界上最高山峰？ <br/>A. 长城 <br/>B. 泰山 <br/>C. 珠穆朗玛峰 <br/>D. 黄山<br/><br/>**Example 2:**<br/>请从以下选项中选择一个最佳答案回答下面的问题。问题：非洲最高的山是哪座山？<br/> 选项： <br/>A. 麦金利山 <br/>B. 喜马拉雅山 <br/>C. 乞力马扎罗山                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | **Example 1:**<br/>Which of the following options is NOT a primary color?<br/>(a) yellow<br/>(b) blue<br/>(c) orange<br/>(d) red<br/><br/>**Example 2:**<br/>Choose the correct option to complete the following sentence: \"Harry Potter and the Chamber of Secrets\" is the **\_\_\_\_** book in the Harry Potter series.<br/>(A) first<br/>(B) second<br/>(C) third<br/>(D) fourth                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
-|     Extraction      | **Example 1:**<br/>根据以下新闻文本，提取新闻报道时间，例如回答时按照格式“新闻报道时间：2007 年 8 月 10 日”<br/>新闻文本如下：2007-4-7 中新网 4 月 7 日电据中国消防在线消息，4 月 4 日晚上 7 时 30 分左右，湖南长潭高速公路上发生一起 6 车连环相撞失火事故。长株潭三地消防部门共出动消防车 21 台，警力 100 余人。经过消防官兵近 2 个小时奋力扑救，大火被成功扑灭。据初步调查，有 1 人在此次事故中死亡。<br/><br/>**Example 2:**<br/>根据以下新闻文本，提取新闻报道时间，例如回答时按照格式“新闻报道时间：2007 年 8 月 10 日”<br/>新闻文本如下：2014 年 1 月 15 日，据外媒《俄罗斯报》报道称，位于北半球的澳大利亚现在正处于炎热的夏季，而近日也到了高温酷暑的时候，当地时间 1 月 14 日晚，澳大利亚南部一夜间发生至少 250 起火灾。受炎热天气及雷雨天气影响，澳大利亚南部一夜间发生至少 250 起火灾，灾情多集中在维多利亚州。火灾发生后，救援人员立即展开救灾行动。目前，大部分起火点火势已被控制。                                 | **Example 1:**<br/>Ernest Hemingway, an American literary giant known for his spare and direct writing style, has penned timeless works such as 'The Old Man and the Sea', 'For Whom the Bell Tolls', and 'A Farewell to Arms', which have made a profound impact on the literary world and continue to be widely read and admired today.<br/>Extract the name of the author mentioned above.<br/><br/>**Example 2:**<br/>In the epic fantasy series 'A Song of Ice and Fire', George R.R. Martin weaves a complex web of political intrigue, war, and magic across the fictional continents of Westeros and Essos. Martin's richly developed characters and intricate plotlines have captivated readers worldwide, much like his other acclaimed works such as 'A Clash of Kings' and 'A Storm of Swords'.<br/>Extract the name of the author in the above material.                                                                                                                                                                                                                                                                                                                                                                                                         |
-|     Generation      | **Example 1:**<br/>请撰写一篇文章，介绍如何通过改善生活习惯来预防疾病和延长寿命。<br/><br/>**Example 2:**<br/>请根据以下情节撰写一篇短篇小说：一名年轻人被困在一个荒岛上，他必须想办法生存下去直到被救援。但他很快发现自己并不孤单。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | **Example 1:**<br/>Write a descriptive paragraph about an island to relax and unwind, including details about the location and atmosphere.<br/><br/>**Example 2:**<br/>Can you help me write a persuasive email to my colleagues encouraging them to participate in a charitable fundraising event?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
-|       Open QA       | **Example 1:**<br/>请问万有引力定律由谁提出的？<br/><br/>**Example 2:**<br/>哪些国家参与了第一次世界大战？                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | **Example 1:**<br/>What are the four basic tastes of the human palate?<br/><br/>**Example 2:**<br/>Who painted the The Scream?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
-|      Rewriting      | **Example 1:**<br/>请将以下句子改为正确的语序。 <br/>生日快乐你祝他了吗？<br/><br/>**Example 2:**<br/>将以下文本翻译成英语：<br/>“这个周末我要去海边玩”                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | **Example 1:**<br/>Please translate the following sentences, which are a mixture of Chinese and English, into full English. <br/>我需要买一些 healthy snacks，比如 nuts 和 dried fruits，作为我的 office 的午餐.<br/><br/>**Example 2:**<br/>Please rewrite the sentence using an inverted sentence structure.<br/>We won't begin our journey until the sun sets.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-|      Roleplay       | **Example 1:**<br/>我想让你担任 Android 开发工程师面试官。我将成为候选人，您将向我询问 Android 开发工程师职位的面试问题。我希望你只作为面试官回答。不要一次写出所有的问题。我希望你只对我进行采访。问我问题，等待我的回答。不要写解释。像面试官一样一个一个问我，等我回答。我的第一句话是“面试官你好”。 <br/><br/>**Example 2:**<br/>我想让你扮演讲故事的角色。你会想出引人入胜、富有想象力和吸引观众的有趣故事。它可以是童话故事、教育故事或任何其他类型的有潜力的故事以吸引人们的注意力和想象力。根据目标受众，您可以为您的讲故事环节选择特定的主题或主题，例如，如果是儿童，那么您可以谈论动物；如果是成人，那么基于历史的故事可能会更好地吸引他们等。我的第一个请求是我需要一个关于毅力的有趣故事。                                                                                                                                                                                                                          | **Example 1:**<br/>Assume the role of a marriage counselor. Develop a series of communication exercises for a couple who are experiencing difficulties in their relationship. These exercises should promote active listening, empathy, and effective expression of emotions. Your first assignment is to provide a set of three exercises that focus on resolving conflicts and rebuilding trust. <br/><br/>**Example 2:**<br/>I want you to act as a travel agent. I will tell you my desired destination, travel dates, and budget, and it will be your job to suggest the best travel itinerary for me. Your recommendations should include the best transportation options, hotel accommodations, and any popular tourist attractions nearby. My first request is "I want to plan a trip to Tokyo for a week, with a budget of $2000. I want to explore the culture and food of the city."                                                                                                                                                                                                                                                                                                                                                                               |
-|    Summarization    | **Example 1:**<br/>请简要总结概括以下段落材料。<br/>当地时间 29 日，泰国卫生部通报，新增 143 名新冠肺炎确诊病例和 1 名死亡病例。截止到当地时间 29 日上午，泰国累计确诊病例 1388 例，其中泰国籍 1172 例，非泰国籍 216 例。死亡病例累计 7 例。（原题为《泰国新增 143 例新冠肺炎确诊病例累计确诊 1388 例》）<br/><br/> **Example 2:**<br/>请简要总结概括以下段落材料。<br/>近期，参与京雄高铁站站房建设的中铁十二局，因在施工过程中存在环境违法行为被雄安新区公开通报。通报发出后，引起社会广泛关注。近日，人民网记者从雄安新区相关部门及中铁十二局获悉，新区有关部门已经集中约谈了中铁十二局等 24 个参与雄安建设的项目单位。对于约谈内容和结果，中铁十二局有关宣传负责人回应：“具体内容不清楚，最好找雄安新区相关部门了解情况。”新区有关部门负责人表示，此前涉及的环境违法行为，中铁十二局已基本整改到位，但约谈内容和结果暂不公开，接下来，将按部就班推进环境治理工作。（原题为《雄安新区：中铁十二局涉环境违法已基本整改到位》） | **Example 1:**<br/>The 21 year-old-woman was treated by paramedics after the kitchen fire in Botfield Road in Shifnal, Shropshire. West Mercia Police said it is treating Wednesday morning's incident as arson and are appealing for any witnesses to contact them.The 50-year-old man has been arrested on suspicion of arson with intent to endanger life. For more on this and other stories from Shropshire.<br/>Please briefly summarize the above material within 20 words.<br/><br/>**Example 2:**<br/>South Wales Police were called to a property in Heolgerrig, Merthyr Tydfil, at about 13:40 BST on Sunday. The child was airlifted to Prince Charles Hospital but died shortly afterwards. Police are investigating the circumstances surrounding the incident and have appealed for witnesses. The girl's family are being supported by specially trained officers.<br/>Please briefly summarize the above material within 20 words.                                                                                                                                                                                                                                                                                                                           |
-
-### Evaluation Metrics
-
-#### GPT Evaluation
-
-GPT evaluation uses GPT models to evaluate the prediction of different models and different pre-defined evaluation metrics are applied to different categories. The following table shows the 11 pre-defined evaluation metrics both in Chinese and English:
-
-|          Evaluation Metric           | Prompt Words                                                                                                                                                                                                                                                                                                                          | CoT(Chain-of-Thought)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| :----------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| 语言组织<br/>(Language organization) | 语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。</br></br>Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.                                  | 1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。<br/> 2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说<br/> 3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。<br/> 4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。<br/> 5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。<br/> 6. 根据以上因素综合评估答案的语言组织，并给出一个 1 到 5 的分数，其中 5 表示语言组织非常好，而 1 表示语言组织非常差。</br></br>1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.<br>2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.<br>3. Determine if the answer is relevant to the question or topic and conveys a clear message.<br>4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.<br>5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.<br>6. Evaluate the linguistic organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good linguistic organization and 1 indicates very poor linguistic organization. |
-|         切题<br/>(Relevance)         | 切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。</br></br>Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.                                                                                         | 1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。<br/> 2. 阅读答案，确认答案是否直接回答了题目所问的问题。<br/> 3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。<br/> 4. 根据以上因素综合评估答案的切题程度，并给出一个 1 到 5 的分数，其中 5 表示答案非常切题，而 1 表示答案完全没有切题。</br></br>1. Read the question to determine what the question asks and what aspects of the question need to be answered.<br>2. Read the answers to make sure that they directly answer the question asked.<br>3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.<br>4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
-|       创意性<br/>(Creativity)        | 创意性(1-5)：某些头脑风暴问题可能需要答案具有创意，提出新的思路。</br></br>Creativity (1-5): Some brainstorming questions may require answers that are creative and suggest new ideas.                                                                                                                                                | 1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则创意性评分可能会受到影响。<br/> 3. 考虑答案中是否包含新颖的想法或独特的思路。答案可能与已知的解决方案有所重叠，但仍然可以被认为是有创意的，只要它提供了新的角度或方法来解决问题。<br/> 4. 根据答案的创意性，给出一个 1 到 5 的评分。如果答案缺乏创意，则应给出一个较低的评分。如果答案具有创意并提供了新的思路，应给出一个较高的评分。</br></br>1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.<br>2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the creativity score may be affected.<br>3. Consider whether the answer contains novel ideas or unique thoughts. An answer may overlap with a known solution and still be considered creative, as long as it offers a new perspective or approach to the problem.<br>4. Give a score of 1 to 5 depending on the creativity of the answer. If the answer lacks creativity, a lower score should be given. If the answer is creative and provides a new idea, a higher score should be given.                                                                                                                                                                 |
-|      实用性<br/>(Practicality)       | 实用性(1-5)：某些头脑风暴问题可能需要答案提出实用的建议或解决方法。</br></br>Practicality (1-5): Some brainstorming questions may require answers to suggest practical suggestions or solutions.                                                                                                                                      | 1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则实用性评分可能会受到影响。<br/> 3. 考虑答案中提出的建议或解决方法是否实用并可行。答案可能看起来很好，但如果无法实现或应用，则实用性评分可能会受到影响。<br/> 4. 根据答案的实用性，给出一个 1 到 5 的评分。如果答案缺乏实用性，则应给出一个较低的评分。如果答案提出了实用的建议或解决方法，并且可以很好地解决问题，则应给出一个较高的评分。</br></br>1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.<br>2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the practicality score may be affected.<br>3. Consider whether the suggestions or solutions presented in the answer are practical and workable. The answer may look good, but if it cannot be implemented or applied, the practicality score may be affected.<br>4. Give a score of 1 to 5 depending on the practicality of the answer. If the answer lacks practicality, a lower score should be given. If the answer makes a practical suggestion or solution and solves the problem well, a higher score should be given.                                                                                                                            |
-|       正确性<br/>(Correctness)       | 正确性(1-5)：正确性(1-5)：答案是否正确。</br></br> Correctness (1-5): whether the answer is correct or not.                                                                                                                                                                                                                           | 1. 仔细阅读题目，尝试自己回答该问题。<br/>2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为 5 分。如果答案是部分正确的，则可以给予适当的得分，例如 2 分、3 分或 4 分。如果答案完全不正确，则只得 1 分。<br/><br/>1. Read the question carefully and try to answer the question yourself. <br/>2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be given. If the answer is completely incorrect, only 1 point is awarded.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
-|        自然<br/>(Naturalness)        | 自然(1-5)：答案是否自然，并且符合问题给定的身份。</br></br>Naturalness (1-5): whether the answer is natural and fits the identity given by the question.                                                                                                                                                                              | 1. 阅读题目，确定题目提供的身份信息。<br/> 2. 检查答案内容是否符合题目给定的身份。<br/> 3. 根据以上因素，对该回答的自然性进行打分，分数从 1 到 5，其中 1 表示不自然，5 表示非常自然，并符合问题给定的身份。</br></br>1. Read the question and determine the identity information provided in the question.<br>2. Check whether the content of the answer matches the identity given in the question.<br>3. Based on the above factors, score the naturalness of the response on a scale from 1 to 5, where 1 means unnatural and 5 means very natural and in accordance with the identity given in the question.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
-|      参与感<br/>(Engagingness)       | 参与感(1-5)：答案是否对前面的对话内容做出了恰当的反应，是否理解对话的语境和背景。</br></br>Engagingness (1-5): whether the answer responds appropriately to the content of the preceding conversation and whether it understands the context and background of the conversation.                                                      | 1. 阅读题目，确定对话的语境和背景。<br/> 2. 检查答案是否充分理解对话的语境和背景，能否自然地融入到对话中而不显得突兀。<br/> 3. 根据以上因素，对该回答的参与感进行打分，分数从 1 到 5，其中 1 表示没有参与感，5 表示非常有参与感，并且恰当地理解了对话的语境和背景。</br></br>1. Read the questions to determine the context and background of the dialogue.<br>2. Check that the answer fully understands the context and background of the conversation and that it fits naturally into the conversation without seeming abrupt.<br>3. Based on the above factors, rate the response's engagement on a scale from 1 to 5, where 1 means not engaged and 5 means very engaged and appropriately understands the context and background of the conversation.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
-|     合理性<br/>(Reasonableness)      | 合理性(1-5)：答案是否能够与前面的对话内容形成逻辑上的衔接，是否符合常理，能否在这个上下文中合理存在。</br></br>Reasonableness (1-5): Whether the answer can form a logical connection with the content of the previous dialogue, whether it is consistent with common sense, and whether it can reasonably exist in this context.     | 1. 阅读题目，确定对话的主题以及问题期望的回答方向。<br/> 2. 判断答案是否能够与前面的对话内容形成逻辑上的衔接，是否符合常理，能否在这个上下文中合理存在。<br/> 3. 根据以上因素，对该回答的合理性进行打分，分数从 1 到 5，其中 1 表示不合理，5 表示非常合理，并且能够与前面的对话内容形成逻辑上的衔接，并符合常理。</br></br>1. Read the question and determine the topic of the conversation and the direction the question expects the answer to go.<br>2. Determine whether the answer can be logically connected to the preceding conversation, whether it makes common sense, and whether it can reasonably exist in this context.<br>3. Based on the above factors, rate the reasonableness of the answer on a scale from 1 to 5, where 1 means unreasonable and 5 means very reasonable and able to form a logical connection with the preceding dialogue content and consistent with common sense.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
-|        多样性<br/>(Diversity)        | 多样性(1-5)：答案使用语言是否优美，具有有一定的创造性和想象力。然而，回答也应该保持合理和适度，不要过于夸张或离题。</br></br>Diversity (1-5): Whether the answers use beautiful language and have some creativity and imagination. However, answers should also be kept reasonable and moderate, not overly exaggerated or off-topic. | 1. 仔细阅读整个回答，确保完全理解回答所表达的内容和主题。<br/> 2. 在阅读回答的同时，注意语言的质量，例如措辞是否正确，语言是否生动等。<br/> 3. 检查回答的创造性和想象力，看看回答是否能够吸引人阅读下去。<br/> 4. 检查回答的合理性和适度，看看回答是否夸张或离题。5. 将多样性的评分打分在 1 到 5 之间，5 分表示回答的质量很好，能够吸引人阅读，1 分表示回答的内容生硬或者有离题的问题。</br></br>1. Read the entire response carefully to ensure that you fully understand the content and theme expressed in the response.<br>2. While reading the response, pay attention to the quality of the language, such as whether the wording is correct and the language is vivid.<br>3. Check the creativity and imagination of the response to see if the response is engaging to read on.<br>4. Check the reasonableness and appropriateness of the responses to see if the responses are exaggerated or off-topic.<br>5. Rate the diversity on a scale of 1 to 5, with a 5 indicating a good quality response that is engaging to read and a 1 indicating a raw response or a question that is off-topic.                                                                                                                                                                                                                                                                                                 |
-|        保真度<br/>(Fidelity)         | 保真度(1-5)：答案是否能够严格遵守角色的设定回答给定的请求。</br></br>Fidelity (1-5): whether the answer is able to answer the given request in strict compliance with the role setting.                                                                                                                                               | 1. 仔细阅读问题，了解角色在问题中的设定和表现，包括职业、背景、观点、性格等方面。<br/> 阅读题目的请求，确认回答请求时需要注意的细节。<br/> 3. 对比提供的回答与该角色的设定，评估回答是否能够严格遵守角色的设定。<br/> 4. 结合以上评估结果给出保真度的评分，范围从 1 到 5 分，其中 1 分表示回答与角色设定完全不符，5 分表示回答完全符合角色设定且满足给定请求。</br></br>1. Read the question carefully to understand how the character is set up and represented in the question, including aspects such as occupation, background, point of view, and personality.<br>2. Read the question's request and confirm the details that need to be taken into account when answering the request.<br>3. Compare the provided answer with the setting of the role and assess whether the answer can strictly adhere to the setting of the role.<br>4. Combine the results of the above assessment to give a fidelity score ranging from 1 to 5, where a score of 1 means that the response does not match the persona at all, and a score of 5 means that the response fully complies with the persona and satisfies the given request.                                                                                                                                                                                                                                                                        |
-|      简明扼要<br/>(Conciseness)      | 简明扼要(1-5)：答案是否简明扼要，没有冗余内容。</br></br>Conciseness (1-5): answers should be concise and without redundant content.                                                                                                                                                                                                  | 1. 阅读题目，提取出材料的重点。<br/> 2. 阅读该总结，并注意其中的主要观点和信息。<br/> 3. 评估总结的长度。一个简明扼要的总结通常应该在几句话或几段文字内传达关键信息，而不是冗长的段落或文章。<br/> 4. 检查总结是否包含与主要观点无关的信息或冗余信息。<br/> 5. 确定总结涵盖了材料中的关键信息，并且没有忽略任何重要细节。<br/> 6. 给总结打出 1-5 的分数，其中 5 表示总结简明扼要，没有冗余内容，而 1 表示总结冗长或包含不必要的信息，难以理解或记忆。根据您的判断，打出适当的得分。</br></br>1. Read the title and extract the main points of the material.<br>2. Read the summary and note the main ideas and messages in it.<br>3. Assess the length of the summary. A concise summary should usually convey key information within a few sentences or paragraphs, rather than lengthy paragraphs or essays.<br>4. Check that the summary does not contain information that is not relevant to the main ideas or that is redundant.<br>5. Make sure that the summary covers the key information in the material and that no important details have been omitted.<br>6. Rate the summary on a scale of 1-5, where 5 means the summary is concise and free of redundancy, and 1 means the summary is lengthy or contains unnecessary information that is difficult to understand or remember. Based on your judgment, assign the appropriate score.                                                      |
-
-GPT models evaluate the quality of model predictions based on the given prompt words and gives a score between 1-5.
-
-> **NOTE 1:** Even for the same metric, the details of its prompt words and CoT(Chain-of-Thought) can differ based on which category you want to evaluate. For example, prompt words for metric `correctness` showed here is "Whether the answer is correct or not."(this is for category `classification`), but for category `extraction`, prompt words can be "Answers should extract the required information accurately and should not contain any incorrect or misleading information." You can find all the prompt words and CoT(Chain-of-Thought) in `prompt/evaluation_prompt`.
-
-> **NOTE 2:** To add customized metrics, you can refer to [FAQ](#faq).
-
-#### Automatic Evaluation
-
-Automated metrics evaluate the capability of a model by comparing model predictions with reference answers.
-There are two ways to obtain reference answers:
-
- For instruction coming from human-designed problems, the reference answers are generated by GPT-3.5, such as roleplay, chat.
- For instruction related with classic NLP problems, the reference answers are collected from open-sourced dataset with target answers, such as classification, extraction, summarization.
-
-There are 6 types of automatic evaluation metrics listed in the table below:
-
-|     Automatic Evaluation Metric     | Description                                                                                                                                                                                        |
-| :---------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-|               BLEU-n                | Measure the accuracy between prediction and reference.<br/> BLEU-1 (Unigram) evaluates accuracy in word level.<br/> BLEU-n (n-gram) evaluate the fluency in sentence level.                        |
-|                ROUGE                | ROUGE-N measures the number of matching n-grams between prediction and reference. <br/> ROUGE-L measures the number of matching longest common subsequence (LCS) between prediction and reference. |
-|              Distinct               | Measure the diversity of generation text by counting the unique n-grams.                                                                                                                           |
-|              BERTScore              | Measure the semantic similarity between tokens of predictions and references with BERT.                                                                                                            |
-| Precision<br/> Recall<br/> F1 Score | Measure the number of overlaps between prediction and reference (design for classification and extraction categories).                                                                             |
-|                CHRF                 | Measure the similarity of character n-grams between prediction and reference.                                                                                                                      |
-
-#### UniEval Evaluation
-
-UniEval converts all evaluation tasks of different dimensions(metrics) into Boolean QA problems and utilize the model to answer with “Yes” or “No”. Compared with similarity-based metrics such as ROUGE and BLEU, UniEval can achieve a more comprehensive evaluation. In addition, UniEval also demonstrates its ability to transfer to unseen dimensions and tasks.
-
-In our evaluation pipeline, two pre-trained UniEval evaluators are used. One is [unieval-sum](https://huggingface.co/MingZhong/unieval-sum) and the other is [unieval-dialog](https://huggingface.co/MingZhong/unieval-dialog). The two models can be used for the 3 tasks, `summarization`, `dialogue` and `data2text`. Each task has different evaluation dimensions.
-
-| UniEval Model  | Task          | Dimension(Metric)                                                                                                                                                                                                                |
-| :------------: | :------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-|  unieval-sum   | summarization | coherence: whether the summary is coherent<br/>consistency: whether the claim is consistent with the given document<br/>fluency: whether the paragraph is fluent<br/>relevance: whether the summary is relevant to the reference |
-|  unieval-sum   | data2text     | naturalness: whether the utterance is fluent<br/>informativeness: whether the utterance is informative according to the reference                                                                                                |
-| unieval-dialog | dialogue      | naturalness: whether the response is natural in the dialogue<br/>coherence: whether the response is coherent in the dialogue history<br/>understandability: whether the response is understandable in the dialogue               |
-
-> **NOTE 1:** Task "data2text" uses the same model as task "summarization".
-
-> **NOTE 2:** In UniEval paper, the `unieval-sum` model demonstrates the best transfer ability and so you can evaluate your customized metric with this model. Details of adding customized metrics can be found in [FAQ](#faq).
-
-> **NOTE 3:** We consider not including all metrics provided in UniEval in our pipeline because the data structure and content of the instructions we want to evaluate are not suitable for direct use of some UniEval metrics.
-
-## Evaluation Process
-
-### Data Format
-
-#### Target Answers / Predictions
-
-A JSON file contains one list. Each element in the list is a target answer / prediction record for one instruction / question.
-An element should have the following fields:
-
- `category` (str, compulsory): The category of the instruction / question.
- `instruction` (str, compulsory): The instruction / question for the LLM.
- `input` (str, optional): The additional context of the instruction / question.
- `output` (str, optional): The sample output of the instruction (default: GPT-3.5).
- `target` (str, optional): The target answer for the instruction.
- `id` (int, compulsory): The ID of the instruction / question.
-
-If the `input` has a target answer, the `output` can be empty. Otherwise, we generate answers from GPT-3.5 as the `output`, and the `target` field is empty.
-
-Example:
-
-```json
-[
-  {
-    "category": "brainstorming",
-    "instruction": "请介绍一下人工智能的多个领域。",
-    "input": "",
-    "output": "{GPT-3.5 Answers}",
-    "target": "",
-    "id": 1
-  },
-  {
-    "category": "classification",
-    "instruction": "新闻标题：为什么电影《倩女幽魂》中燕赤霞一个道士却拿着金刚经？请根据新闻标题判断新闻所属的分类，你需要从文化，娱乐，体育，财经，房产，教育，科技，旅游，游戏，军事这十类中选择一个答案。",
-    "input": "",
-    "output": "",
-    "target": "{target answer}",
-    "id": 2
-  }
-]
-```
-
-#### Model Answers / Predictions
-
-A JSON file contains one list. Each element in the list is a model answer / prediction record for one instruction / question.
-
-An element should have the following fields:
-
- `category` (str, compulsory): The category of the instruction / question.
- `instruction` (str, compulsory): The instruction / question for the LLM.
- `input` (str, optional): The additional context of the instruction / question.
- `output` (str, compulsory): The output from the LLM.
- `target` (str, optional): The target answer for the instruction.
- `id` (int, compulsory): The ID of the instruction / question.
-
-Example:
-
-```json
-[
-  {
-    "category": "brainstorming",
-    "instruction": "请介绍一下人工智能的多个领域。",
-    "input": "",
-    "output": "{Model Answers / Predictions}",
-    "target": "",
-    "id": 1
-  },
-  {
-    "category": "classification",
-    "instruction": "新闻标题：为什么电影《倩女幽魂》中燕赤霞一个道士却拿着金刚经？请根据新闻标题判断新闻所属的分类，你需要从文化，娱乐，体育，财经，房产，教育，科技，旅游，游戏，军事这十类中选择一个答案。",
-    "input": "",
-    "output": "{Model Answers / Predictions}",
-    "target": "{target answer}",
-    "id": 2
-  }
-]
-```
-
-### Prompt
-
-#### Battle Prompt
-
-The following is the Chinese battle prompt. In the battle prompt, the question and answers from two different models are fed into the prompt template. You can find example battle prompt files for Chinese and English in `prompt/battle_prompt`.
-
-```json
-{
-  "id": 1,
-  "system_prompt": "你是一个检查回答质量的好助手。",
-  "prompt_template": "[问题]\n{question}\n\n[1号AI助手的答案]\n{answer_1}\n\n[1号AI助手答案终止]\n\n[2号AI助手的答	案]\n{answer_2}\n\n[2号AI助手答案终止]\n\n[要求]\n{prompt}\n\n",
-  "prompt": "我们需要你评价这两个AI助手回答的性能。\n请对他们的回答的有用性、相关性、准确性、详细程度进行评分。每个AI助手都会得到一个1到10分的总分，分数越高表示整体表现越好。\n请首先输出一行，该行只包含两个数值，分别表示1号和2号AI助手的分数。这两个分数之间要有一个空格。在随后的一行中，请对你的评价作出全面的解释，避免任何潜在的偏见，并确保AI助手回答的顺序不会影响您的判断。"
-}
-```
-
-#### Evaluation Prompt
-
-The following is an example of a Chinese GPT evaluation prompt. In an evaluation prompt, you should define your metrics in `metrics` and provide CoT(Chain-of-Thought) in `CoT`. You can find example evaluation prompt files for Chinese and English in `prompt/evaluation_prompt`.
-
-```json
-{
-  "brainstorming": {
-    "id": 1,
-    "category": "brainstorming",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织："
-    },
-    "prompt": "你是一个好助手。请你为下面“头脑风暴”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  }
-}
-```
-
-`"metrics"`: the metrics that can be used in GPT evaluation. This field determines which metrics can be added to your config file.
-
-`"CoT"`: evaluation steps you prompt to GPT models for each metric defined in `"metrics"`.
-
-### Evaluation
-
-#### Configuration
-
-The following is an example of a Chinese config file. The configuration file can control how the pipeline evaluates the model. You need to specify GPT evaluation metrics, automatic metrics and UniEval metrics in key `GPT`, `Metrics` and `UniEval`(English only). You can find an example English config file in `config`.
-
-```json
-{
-  "language": "en",
-  "path_for_UniEval": {
-    "summarization": "path to unieval-sum model",
-    "dialogue": "path to unieval-dialog model",
-    "data2text": "path to unieval-sum model"
-  },
-  "category": {
-    "brainstorming": {
-      "GPT": ["relevance", "creativity", "practicality", "reasonableness"],
-      "Metrics": ["Distinct"],
-      "UniEval": [
-        "summarization-fluency",
-        "data2text-naturalness",
-        "data2text-informativeness"
-      ]
-    },
-    "chat": {
-      "GPT": ["relevance", "naturalness", "engagingness", "reasonableness"],
-      "Metrics": ["Distinct"],
-      "UniEval": [
-        "dialogue-naturalness",
-        "dialogue-coherence",
-        "dialogue-understandability"
-      ]
-    }
-  }
-}
-```
-
-`"language"`: the language used to evaluate the model capability. We only support Chinese `"cn"` for now.
-
-`"path_for_UniEval"`: path to the UniEval model.
-
-`"category"`: the category/categories needed to evaluate the model capability.
-
-`"GPT"`: the metrics you want to use for GPT evaluation.
-
-`"Metrics"`: the metrics you want to use for automatic metrics evaluation.
-
-`"UniEval"`: the metrics you want to use for UniEval metrics evaluation. The metric has to be in the `"{task}-{metric}"` format because different tasks have same metrics such as naturalness and coherence.
-
-You can remove the key such as `"Metrics"` to skip evaluating answers using its corresponding evaluation metrics.
-
-You can create your config file based on available settings listed in following table.
-
-|    "category"    |          "GPT"          |  "Metrics"  |          "UniEval"           |
-| :--------------: | :---------------------: | :---------: | :--------------------------: |
-| "brainstorming"  | "language organization" |   "BLEU"    |    "dialogue-naturalness"    |
-|      "chat"      |       "relevance"       |   "ROUGE"   |     "dialogue-coherence"     |
-| "classification" |      "creativity"       | "Distinct"  | "dialogue-understandability" |
-|   "closed_qa"    |     "practicality"      | "BERTScore" |   "data2text-naturalness"    |
-|   "extraction"   |      "correctness"      | "Precision" | "data2text-informativeness"  |
-|   "generation"   |      "naturalness"      |  "Recall"   |  "summarization-coherence"   |
-|    "open_qa"     |     "engagingness"      | "F1 score"  | "summarization-consistency"  |
-|   "rewriting"    |    "reasonableness"     |   "CHRF"    |   "summarization-fluency"    |
-|    "roleplay"    |       "diversity"       |             |  "summarization-relevance"   |
-| "summarization"  |       "fidelity"        |             |                              |
-|                  |      "conciseness"      |             |                              |
-
-> **NOTE:** For categories which don't have standard answers such as `brainstorming`, you should avoid using automatic metrics such as `BLEU` and `ROUGE` which are based on similarity measures and you should use `Distinct` instead in your config file.
-
-#### Evaluate
-
-After setting the configuration file, you can evaluate the model using `eval.py`. If you want to make comparisons between answers of two different models, you should specify two answer files in the argument `answer_file_list` and two model names in the argument `model_name_list`. If you want to evaluate one answer file, the length of both `answer_file_list` and `model_name_list` should be 1 and the program will perform evaluation using automatic metrics and GPT models.
-
-An example script is provided as follows:
-
-```shell
-python eval.py \
-    --config_file "path to the config file" \
-    --battle_prompt_file "path to the prompt file for battle" \
-    --gpt_evaluation_prompt_file "path to the prompt file for gpt evaluation" \
-    --target_file "path to the target answer file" \
-    --answer_file_list "path to the answer files of at most 2 models" \
-    --model_name_list "the names of at most 2 models" \
-    --gpt_model "which GPT model to use for evaluation" \
-    --save_path "path to save results" \
-    --openai_key "your openai key" \
-```
-
-If you want GPT evaluation with reference, you can add an argument `--gpt_with_reference`.
-
-## FAQ
-
-<details><summary><b>How can I add a new GPT evaluation metric?</b></summary>
-
-For example, if you want to add a new metric `persuasiveness` into category `brainstorming`, you should add the metric definition and its corresponding CoT(Chain-of-thought) in the evaluation prompt file in `prompt/evaluation_promt`. The CoT can be generated using ChatGPT. You can prompt ChatGPT to generate evaluation steps for the new metric.
-
-```json
-{
-  "brainstorming": {
-    "id": 1,
-    "category": "brainstorming",
-    "metrics": {
-      "persuasiveness": "persuasiveness(1-5)：a short description for persuasiveness"
-    },
-    "CoT": {
-      "persuasiveness": "CoT for persuasiveness\n\npersuasiveness："
-    },
-    "prompt": "You are a good assistant. Please rate the given answer to the \"brainstorming\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  }
-}
-```
-
-</details>
-
-<details><summary><b>How can I add a new UniEval evaluation metric?</b></summary>
-
-For example, if you want to add a new metric `persuasiveness` into task `data2text`, you should add a Boolean QA question about the metric in function `add_question` in `unieval/utils.py`. Please do note that how effectively the model would evaluate this metric is unknown, and you may need some experiments to test whether the model is capable of evaluating this metric.
-
-```python
-if task == 'data2text':
-    if dimension == 'persuasiveness':
-        cur_input = 'question: Is this a persuasive utterence </s> utterance: ' + output[i]
-```
-
-</details>
-
-## To Do
-
- [x] Add evaluation for English capability
- [x] Support UniEval
- [x] Support GPT-4 evaluation
- [x] Support GPT evaluation with reference
-
-## Citations
-
-```bibtex
-@misc{vicuna2023,
-    title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%* ChatGPT Quality},
-    url = {https://vicuna.lmsys.org},
-    author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.},
-    month = {March},
-    year = {2023}
-}
-
-@misc{liu2023geval,
-      title={G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment},
-      author={Yang Liu and Dan Iter and Yichong Xu and Shuohang Wang and Ruochen Xu and Chenguang Zhu},
-      year={2023},
-      eprint={2303.16634},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-
-@misc{zhong2022unified,
-      title={Towards a Unified Multi-Dimensional Evaluator for Text Generation},
-      author={Ming Zhong and Yang Liu and Da Yin and Yuning Mao and Yizhu Jiao and Pengfei Liu and Chenguang Zhu and Heng Ji and Jiawei Han},
-      year={2022},
-      eprint={2210.07197},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-```
--- a/applications/Chat/evaluate/config/config_cn.json
+++ b/applications/Chat/evaluate/config/config_cn.json
@ -1,204 +0,0 @@
-{
-  "language": "cn",
-  "category": {
-    "brainstorming": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "creativity",
-        "practicality",
-        "reasonableness"
-      ],
-      "Metrics": [
-        "Distinct"
-      ]
-    },
-    "chat": {
-      "GPT": [
-        "language organization",
-        "naturalness",
-        "engagingness",
-        "fidelity"
-      ],
-      "Metrics": [
-        "Distinct"
-      ]
-    },
-    "classification": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-        "Precision",
-        "Recall",
-        "F1 score",
-        "CHRF"
-      ]
-    },
-    "closed_qa": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-        "BLEU",
-        "ROUGE",
-        "BERTScore",
-        "CHRF"
-      ]
-    },
-    "extraction": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-        "Precision",
-        "Recall",
-        "F1 score",
-        "CHRF"
-      ]
-    },
-    "generation": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "diversity"
-      ],
-      "Metrics": [
-        "BLEU",
-        "ROUGE",
-        "BERTScore"
-      ]
-    },
-    "logical_reasoning": {
-      "GPT": [
-        "correctness",
-        "relevance",
-        "reasonableness"
-      ],
-      "Metrics": [
-        "BLEU",
-        "ROUGE",
-        "BERTScore",
-        "CHRF"
-      ]
-    },
-    "open_qa": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-        "Distinct"
-      ]
-    },
-    "rewriting": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-        "BLEU",
-        "ROUGE",
-        "BERTScore"
-      ]
-    },
-    "roleplay": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "fidelity",
-        "creativity"
-      ],
-      "Metrics": [
-        "Distinct"
-      ]
-    },
-    "summarization": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "correctness",
-        "conciseness"
-      ],
-      "Metrics": [
-      ]
-    },
-    "Finance": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ]
-    },
-    "Law": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ]
-    },
-    "Education": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ]
-    },
-    "Medical": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ]
-    },
-    "STEM": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ]
-    },
-    "SocialScience": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ]
-    },
-    "Humanity": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ]
-    },
-    "Other": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ]
-    },
-    "ethics": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ]
-    }
-  }
-}
--- a/applications/Chat/evaluate/config/config_en.json
+++ b/applications/Chat/evaluate/config/config_en.json
@ -1,283 +0,0 @@
-{
-  "language": "en",
-  "path_for_UniEval": {
-    "summarization": "path to unieval-sum",
-    "dialogue": "path to unieval-dialog",
-    "data2text": "path to unieval-sum"
-  },
-  "category": {
-    "brainstorming": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "creativity",
-        "practicality",
-        "reasonableness"
-      ],
-      "Metrics": [
-        "Distinct"
-      ],
-      "UniEval": [
-        "summarization-fluency",
-        "data2text-naturalness",
-        "data2text-informativeness"
-      ]
-    },
-    "chat": {
-      "GPT": [
-        "language organization",
-        "naturalness",
-        "engagingness",
-        "fidelity"
-      ],
-      "Metrics": [
-        "Distinct"
-      ],
-      "UniEval": [
-        "summarization-fluency",
-        "dialogue-naturalness",
-        "dialogue-coherence",
-        "dialogue-understandability",
-        "data2text-naturalness",
-        "data2text-informativeness"
-      ]
-    },
-    "classification": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-        "Precision",
-        "Recall",
-        "F1 score",
-        "CHRF"
-      ],
-      "UniEval": [
-        "summarization-fluency",
-        "data2text-naturalness",
-        "data2text-informativeness"
-      ]
-    },
-    "closed_qa": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-        "BLEU",
-        "ROUGE",
-        "BERTScore",
-        "CHRF"
-      ],
-      "UniEval": [
-        "summarization-fluency",
-        "data2text-naturalness",
-        "data2text-informativeness"
-      ]
-    },
-    "extraction": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-        "Precision",
-        "Recall",
-        "F1 score",
-        "CHRF"
-      ],
-      "UniEval": [
-        "summarization-fluency",
-        "data2text-naturalness",
-        "data2text-informativeness"
-      ]
-    },
-    "generation": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "diversity"
-      ],
-      "Metrics": [
-        "BLEU",
-        "ROUGE",
-        "BERTScore"
-      ],
-      "UniEval": [
-        "summarization-fluency",
-        "data2text-naturalness",
-        "data2text-informativeness"
-      ]
-    },
-    "logical_reasoning": {
-      "GPT": [
-        "correctness",
-        "relevance",
-        "reasonableness"
-      ],
-      "Metrics": [
-        "BLEU",
-        "ROUGE",
-        "BERTScore",
-        "CHRF"
-      ],
-      "UniEval": [
-      ]
-    },
-    "open_qa": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-        "Distinct"
-      ],
-      "UniEval": [
-        "summarization-fluency",
-        "data2text-naturalness",
-        "data2text-informativeness"
-      ]
-    },
-    "rewriting": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-        "BLEU",
-        "ROUGE",
-        "BERTScore"
-      ],
-      "UniEval": [
-        "summarization-fluency",
-        "data2text-naturalness",
-        "data2text-informativeness"
-      ]
-    },
-    "roleplay": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "fidelity",
-        "creativity"
-      ],
-      "Metrics": [
-        "Distinct"
-      ],
-      "UniEval": [
-        "summarization-fluency",
-        "data2text-naturalness",
-        "data2text-informativeness"
-      ]
-    },
-    "summarization": {
-      "GPT": [
-        "language organization",
-        "relevance",
-        "correctness",
-        "conciseness"
-      ],
-      "Metrics": [
-        "BLEU",
-        "ROUGE",
-        "BERTScore",
-        "CHRF"
-      ],
-      "UniEval": [
-      ]
-    },
-    "Finance": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ],
-      "UniEval": [
-      ]
-    },
-    "Law": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ],
-      "UniEval": [
-      ]
-    },
-    "Education": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ],
-      "UniEval": [
-      ]
-    },
-    "Medical": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ],
-      "UniEval": [
-      ]
-    },
-    "STEM": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ],
-      "UniEval": [
-      ]
-    },
-    "SocialScience": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ],
-      "UniEval": [
-      ]
-    },
-    "Humanity": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ],
-      "UniEval": [
-      ]
-    },
-    "Other": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ],
-      "UniEval": [
-      ]
-    },
-    "ethics": {
-      "GPT": [
-        "relevance",
-        "correctness"
-      ],
-      "Metrics": [
-      ],
-      "UniEval": [
-      ]
-    }
-  }
-}
--- a/applications/Chat/evaluate/evaluator.py
+++ b/applications/Chat/evaluate/evaluator.py
@ -1,229 +0,0 @@
-import os
-from typing import Any, Dict, List
-
-import gpt_evaluate
-import metrics
-import unieval
-from utils import analyze_automatic_results, get_data_per_category, save_automatic_results
-
-
-class Evaluator(object):
-    """
-    A class named Evaluator includes GPT-3.5/GPT-4 evaluation
-    and automatic evaluation
-
-    """
-
-    def __init__(
-        self,
-        params: Dict[str, Any],
-        battle_prompt: Dict[str, Any],
-        gpt_evaluation_prompt: Dict[str, Any],
-        gpt_model: str,
-        language: str,
-        path_for_UniEval: Dict[str, str],
-        gpt_with_reference: bool,
-    ) -> None:
-        self.params = params
-        self.battle_prompt = battle_prompt
-        self.gpt_evaluation_prompt = gpt_evaluation_prompt
-        self.gpt_model = gpt_model
-        self.language = language
-        self.path_for_UniEval = path_for_UniEval
-        self.gpt_with_reference = gpt_with_reference
-        self.automatic_metric_stats = dict()
-        self.unieval_metric_stats = dict()
-        self.gpt_evaluation_results = dict()
-        self.battle_results = []
-
-    def battle(self, answers1: List[Dict], answers2: List[Dict]) -> None:
-        """
-        Comparison between two models using GPT-4 as the reviewer.
-        """
-
-        self.battle_results = gpt_evaluate.battle(answers1, answers2, self.battle_prompt)
-
-    def evaluate(self, answers: List[Dict], targets: List[Dict]) -> None:
-        """
-        A comprehensive evaluation of the answers from the model.
-        The function evaluates the model's performance from different perspectives
-        using GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.
-
-        The metrics will be decided by the config file.
-
-        """
-
-        def switch(metric, language):
-            if metric == "BLEU":
-                return metrics.bleu_score(preds=predicts_list, targets=targets_list, language=language)
-            elif metric == "ROUGE":
-                return metrics.rouge_score(preds=predicts_list, targets=targets_list, language=language)
-            elif metric == "Distinct":
-                return metrics.distinct_score(preds=predicts_list, language=language)
-            elif metric == "BERTScore":
-                return metrics.bert_score(preds=predicts_list, targets=targets_list, language=language)
-            elif metric == "Precision":
-                return metrics.precision(preds=predicts_list, targets=targets_list, language=language)
-            elif metric == "Recall":
-                return metrics.recall(preds=predicts_list, targets=targets_list, language=language)
-            elif metric == "F1 score":
-                return metrics.F1_score(preds=predicts_list, targets=targets_list, language=language)
-            elif metric == "CHRF":
-                return metrics.chrf_score(preds=predicts_list, targets=targets_list, language=language)
-            else:
-                raise ValueError(f"Unexpected metric")
-
-        answers_per_category = get_data_per_category(answers, list(self.params.keys()))
-        targets_per_category = get_data_per_category(targets, list(self.params.keys()))
-
-        # automatic evaluation
-        for category in self.params:
-            if len(answers_per_category[category]) == 0:
-                print(f"Category {category} specified in your config doesn't have corresponding answers!")
-                continue
-
-            if self.params[category].get("Metrics", None) is None:
-                continue
-
-            category_metrics = self.params[category]["Metrics"]
-            self.automatic_metric_stats[category] = {}
-
-            targets_list = [
-                target["target"] if target["target"] else target["output"] for target in targets_per_category[category]
-            ]
-            predicts_list = [answer["output"] for answer in answers_per_category[category]]
-
-            for metric in category_metrics:
-                self.automatic_metric_stats[category].update(switch(metric=metric, language=self.language))
-
-        # UniEval evaluation
-        # self.unieval_metric_stats's key is "task" instead of "category".
-        # Iterating "task" first will avoid repeated loading models because one task corresponds to one UniEval model.
-        # If key is "category", different models will be loaded for multiple times across categories because the user may require different task(models) to evaluate one category.
-        for category in self.params:
-            if len(answers_per_category[category]) == 0:
-                print(f"Category {category} specified in your config doesn't have corresponding answers!")
-                continue
-
-            if self.params[category].get("UniEval", None) is None:
-                continue
-
-            if self.params[category]["UniEval"] and self.language == "cn":
-                raise Exception(
-                    "UniEval doesn't support Chinese! Please remove UniEval config in your Chinese config file."
-                )
-
-            category_metrics = self.params[category]["UniEval"]
-
-            for task, metric in [tuple(category_metric.split("-")) for category_metric in category_metrics]:
-                if self.unieval_metric_stats.get(task, None) is None:
-                    self.unieval_metric_stats[task] = {category: {metric: 0}}
-                elif self.unieval_metric_stats[task].get(category, None) is None:
-                    self.unieval_metric_stats[task][category] = {metric: 0}
-                else:
-                    self.unieval_metric_stats[task][category][metric] = 0
-
-        for task in self.unieval_metric_stats:
-            if self.path_for_UniEval is None:
-                raise Exception(f"Please specify the path for UniEval model in the config file!")
-
-            if self.path_for_UniEval.get(task, None) is None:
-                raise Exception(f"Please specify the model path for task {task} in the config file!")
-
-            print(f"Load UniEval model for task {task}.")
-
-            uni_evaluator = unieval.get_evaluator(task, model_name_or_path=self.path_for_UniEval[task])
-            for category in self.unieval_metric_stats[task]:
-                targets_list = [
-                    target["target"] if target["target"] else target["output"]
-                    for target in targets_per_category[category]
-                ]
-                predicts_list = [answer["output"] for answer in answers_per_category[category]]
-                sources_list = [answer["instruction"] + answer["input"] for answer in answers_per_category[category]]
-
-                data = unieval.convert_data_to_unieval_format(predicts_list, sources_list, targets_list)
-                scores = uni_evaluator.evaluate(
-                    data, category, dims=list(self.unieval_metric_stats[task][category].keys()), overall=False
-                )
-                avg_scores = unieval.calculate_average_score(scores)
-
-                self.unieval_metric_stats[task][category].update(avg_scores)
-
-        # gpt evaluation
-        for category in self.params:
-            if len(answers_per_category[category]) == 0:
-                print(f"Category {category} specified in your config doesn't have corresponding answers!")
-                continue
-
-            if self.params[category].get("GPT", None) is None:
-                continue
-
-            category_metrics = self.params[category]["GPT"]
-
-            prompt = self.gpt_evaluation_prompt.get(category, None)
-            if prompt is None:
-                print(f"No prompt for category {category}! Use prompt for category general now.")
-                prompt = self.gpt_evaluation_prompt["general"]
-
-            self.gpt_evaluation_results[category] = gpt_evaluate.evaluate(
-                answers_per_category[category],
-                prompt,
-                category_metrics,
-                category,
-                self.gpt_model,
-                self.language,
-                references=targets_per_category[category] if self.gpt_with_reference else None,
-            )
-
-    def save(self, path: str, model_name_list: List[str]) -> None:
-        """
-        Save evaluation results of GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.
-
-        """
-
-        if len(model_name_list) == 2:
-            save_path = os.path.join(path, "gpt_evaluate", "battle_results")
-            gpt_evaluate.save_battle_results(self.battle_results, model_name_list[0], model_name_list[1], save_path)
-        else:
-            if self.automatic_metric_stats:
-                # Save evaluation results for automatic metrics
-                automatic_base_save_path = os.path.join(path, "automatic_results")
-                automatic_results_save_path = os.path.join(automatic_base_save_path, "evaluation_results")
-
-                save_automatic_results(model_name_list[0], self.automatic_metric_stats, automatic_results_save_path)
-
-                # Save charts and csv.
-                automatic_analyses_save_path = os.path.join(automatic_base_save_path, "evaluation_analyses")
-                analyze_automatic_results(automatic_results_save_path, automatic_analyses_save_path)
-
-            if self.unieval_metric_stats:
-                # Save evaluation results for UniEval metrics
-                unieval_base_save_path = os.path.join(path, "unieval_results")
-                unieval_results_save_path = os.path.join(unieval_base_save_path, "evaluation_results")
-
-                unieval.save_unieval_results(model_name_list[0], self.unieval_metric_stats, unieval_results_save_path)
-
-                # Save charts and csv.
-                unieval_analyses_save_path = os.path.join(unieval_base_save_path, "evaluation_analyses")
-                unieval.analyze_unieval_results(unieval_results_save_path, unieval_analyses_save_path)
-
-            if self.gpt_evaluation_results:
-                # Save evaluation results for GPT evaluation metrics.
-                gpt_base_save_path = os.path.join(path, "gpt_evaluate", "gpt_evaluate_results")
-                gpt_evaluation_results_save_path = os.path.join(gpt_base_save_path, "evaluation_results")
-
-                all_evaluations = gpt_evaluate.save_gpt_evaluation_results(
-                    model_name_list[0], self.gpt_evaluation_results, gpt_evaluation_results_save_path
-                )
-
-                # Start to calculate scores and save statistics.
-                gpt_evaluation_statistics_save_path = os.path.join(gpt_base_save_path, "evaluation_statistics")
-                gpt_evaluate.save_gpt_evaluation_statistics(
-                    model_name_list[0], all_evaluations, gpt_evaluation_statistics_save_path
-                )
-
-                # Save charts and csv.
-                gpt_evaluation_analyses_save_path = os.path.join(gpt_base_save_path, "evaluation_analyses")
-                gpt_evaluate.analyze_gpt_evaluation_statistics(
-                    gpt_evaluation_statistics_save_path, gpt_evaluation_analyses_save_path
-                )
--- a/applications/Chat/evaluate/metrics.py
+++ b/applications/Chat/evaluate/metrics.py
@ -1,254 +0,0 @@
-import statistics
-from typing import Dict, List
-
-import jieba
-from bert_score import score
-from nltk.translate.bleu_score import sentence_bleu
-from nltk.translate.chrf_score import sentence_chrf
-from rouge_chinese import Rouge as Rouge_cn
-from rouge_score import rouge_scorer as Rouge_en
-from sklearn.metrics import f1_score, precision_score, recall_score
-from utils import preprocessing_text, remove_redundant_space
-
-
-def bleu_score(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
-    """Calculate BLEU Score Metric
-
-    The calculation includes BLEU-1 for unigram, BLEU-2 for bigram,
-    BLEU-3 for trigram and BLEU-4 for 4-gram. Unigram evaluates the
-    accuracy in word level, other n-gram evaluate the fluency in
-    sentence level.
-    """
-    bleu_scores = {"bleu1": 0, "bleu2": 0, "bleu3": 0, "bleu4": 0}
-    cumulative_bleu = [0] * 4
-    weights = [
-        (1.0 / 1.0, 0.0, 0.0, 0.0),
-        (1.0 / 2.0, 1.0 / 2.0, 0.0, 0.0),
-        (1.0 / 3.0, 1.0 / 3.0, 1.0 / 3.0, 0.0),
-        (1.0 / 4.0, 1.0 / 4.0, 1.0 / 4.0, 1.0 / 4.0),
-    ]
-
-    for pred, target in zip(preds, targets):
-        if language == "cn":
-            pred_list = " ".join(jieba.cut(preprocessing_text(pred))).split()
-            target_list = [(" ".join(jieba.cut(preprocessing_text(target)))).split()]
-        elif language == "en":
-            pred_list = preprocessing_text(pred).split()
-            target_list = [preprocessing_text(target).split()]
-
-        bleu = sentence_bleu(target_list, pred_list, weights=weights)
-        cumulative_bleu = [a + b for a, b in zip(cumulative_bleu, bleu)]
-
-    for i in range(len(cumulative_bleu)):
-        bleu_scores[f"bleu{i+1}"] = cumulative_bleu[i] / len(preds)
-
-    return bleu_scores
-
-
-def chrf_score(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
-    """Calculate CHRF Score Metric in sentence level."""
-    chrf_score = {"chrf": 0}
-    cumulative_chrf = []
-
-    for pred, target in zip(preds, targets):
-        if language == "cn":
-            pred_list = " ".join(jieba.cut(preprocessing_text(pred))).split()
-            target_list = " ".join(jieba.cut(preprocessing_text(target))).split()
-        elif language == "en":
-            pred_list = preprocessing_text(pred).split()
-            target_list = preprocessing_text(target).split()
-
-        cumulative_chrf.append(sentence_chrf(target_list, pred_list))
-
-    chrf_score["chrf"] = statistics.mean(cumulative_chrf)
-
-    return chrf_score
-
-
-def rouge_cn_score(preds: List[str], targets: List[str]) -> Dict[str, float]:
-    """Calculate Chinese ROUGE Score Metric
-
-    The calculation includes ROUGE-1 for unigram, ROUGE-2 for bigram
-    and ROUGE-L. ROUGE-N evaluates the number of matching n-grams between
-    the preds and targets. ROUGE-L measures the number of matching
-    longest common subsequence (LCS) between preds and targets.
-    """
-    rouge_scores = {"rouge1": 0, "rouge2": 0, "rougeL": 0}
-    all_preds = []
-    all_targets = []
-
-    for pred, target in zip(preds, targets):
-        pred_list = remove_redundant_space(" ".join(jieba.cut(preprocessing_text(pred))))
-        target_list = remove_redundant_space(" ".join(jieba.cut(preprocessing_text(target))))
-        all_preds.append(pred_list)
-        all_targets.append(target_list)
-
-    rouge_cn = Rouge_cn()
-    rouge_avg = rouge_cn.get_scores(all_preds, all_targets, avg=True)
-
-    rouge_scores["rouge1"] = rouge_avg["rouge-1"]["f"]
-    rouge_scores["rouge2"] = rouge_avg["rouge-2"]["f"]
-    rouge_scores["rougeL"] = rouge_avg["rouge-l"]["f"]
-
-    return rouge_scores
-
-
-def rouge_en_score(preds: List[str], targets: List[str]) -> Dict[str, float]:
-    """Calculate English ROUGE Score Metric
-
-    The calculation includes ROUGE-1 for unigram, ROUGE-2 for bigram
-    and ROUGE-L. ROUGE-N evaluates the number of matching n-grams between
-    the preds and targets. ROUGE-L measures the number of matching
-    longest common subsequence (LCS) between preds and targets.
-    """
-    rouge_scores = {"rouge1": 0, "rouge2": 0, "rougeL": 0}
-
-    rouge_en = Rouge_en.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=False)
-
-    for pred, target in zip(preds, targets):
-        score = rouge_en.score(preprocessing_text(pred), preprocessing_text(target))
-        rouge_scores["rouge1"] += score["rouge1"].fmeasure
-        rouge_scores["rouge2"] += score["rouge2"].fmeasure
-        rouge_scores["rougeL"] += score["rougeL"].fmeasure
-
-    rouge_scores["rouge1"] = rouge_scores["rouge1"] / len(preds)
-    rouge_scores["rouge2"] = rouge_scores["rouge2"] / len(preds)
-    rouge_scores["rougeL"] = rouge_scores["rougeL"] / len(preds)
-
-    return rouge_scores
-
-
-def rouge_score(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
-    """Calculate ROUGE Score Metric"""
-    if language == "cn":
-        return rouge_cn_score(preds, targets)
-    elif language == "en":
-        return rouge_en_score(preds, targets)
-
-
-def distinct_score(preds: List[str], language: str) -> Dict[str, float]:
-    """Calculate Distinct Score Metric
-
-    This metric refers to https://arxiv.org/abs/1510.03055.
-    It evaluates the diversity of generation text by counting
-    the unique n-grams.
-    """
-    distinct_score = {"distinct": 0}
-    cumulative_distinct = []
-
-    for pred in preds:
-        if language == "cn":
-            pred_seg_list = " ".join(jieba.cut(pred)).split()
-            count_segs = len(pred_seg_list)
-            unique_segs = set(pred_seg_list)
-            count_unique_chars = len(unique_segs)
-            # prevent denominator from being 0
-            cumulative_distinct.append(count_unique_chars / (count_segs + 1e-6))
-        elif language == "en":
-            # calculate distinct 1-gram, 2-gram, 3-gram
-            unique_ngram = [set() for _ in range(0, 3)]
-            all_ngram_count = [0 for _ in range(0, 3)]
-
-            split_pred = preprocessing_text(pred).split()
-            for n in range(0, 3):
-                for i in range(0, len(split_pred) - n):
-                    ngram = " ".join(split_pred[i : i + n + 1])
-                    unique_ngram[n].add(ngram)
-                    all_ngram_count[n] += 1
-
-            # Sometimes the answer may contain only one word. For 2-gram and 3-gram, the gram count(denominator) may be zero.
-            avg_distinct = [len(a) / (b + 1e-6) for a, b in zip(unique_ngram, all_ngram_count)]
-
-            cumulative_distinct.append(statistics.mean(avg_distinct))
-
-    distinct_score["distinct"] = statistics.mean(cumulative_distinct)
-
-    return distinct_score
-
-
-def bert_score(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
-    """Calculate BERTScore Metric
-
-    The BERTScore evaluates the semantic similarity between
-    tokens of preds and targets with BERT.
-    """
-    bert_score = {"bert_score": 0}
-    pred_list = []
-    target_list = []
-
-    for pred, target in zip(preds, targets):
-        pred_list.append(pred)
-        target_list.append(target)
-
-    if language == "cn":
-        _, _, F = score(pred_list, target_list, lang="zh", verbose=True)
-    elif language == "en":
-        _, _, F = score(pred_list, target_list, lang="en", verbose=True)
-
-    bert_score["bert_score"] = F.mean().item()
-
-    return bert_score
-
-
-def calculate_precision_recall_f1(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
-    """Precision, Recall and F1-Score Calculation
-
-    The calculation of precision, recall and f1-score is realized by counting
-    the number f overlaps between the preds and target. The comparison length
-    limited by the shorter one of preds and targets.
-    """
-    precision_recall_f1 = {"precision": 0, "recall": 0, "f1_score": 0}
-    precision_scores = []
-    recall_scores = []
-    f1_scores = []
-
-    for pred, target in zip(preds, targets):
-        if language == "cn":
-            pred_list = [char for char in " ".join(jieba.cut(preprocessing_text(pred))).split()]
-            target_list = [char for char in " ".join(jieba.cut(preprocessing_text(target))).split()]
-        elif language == "en":
-            pred_list = [char for char in preprocessing_text(pred).split()]
-            target_list = [char for char in preprocessing_text(target).split()]
-
-        target_labels = [1] * min(len(target_list), len(pred_list))
-        pred_labels = [int(pred_list[i] == target_list[i]) for i in range(0, min(len(target_list), len(pred_list)))]
-
-        precision_scores.append(precision_score(target_labels, pred_labels, zero_division=0))
-        recall_scores.append(recall_score(target_labels, pred_labels, zero_division=0))
-        f1_scores.append(f1_score(target_labels, pred_labels, zero_division=0))
-
-    precision_recall_f1["precision"] = statistics.mean(precision_scores)
-    precision_recall_f1["recall"] = statistics.mean(recall_scores)
-    precision_recall_f1["f1_score"] = statistics.mean(f1_scores)
-
-    return precision_recall_f1
-
-
-def precision(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
-    """Calculate Precision Metric
-
-    Calculating precision by counting the number of overlaps between the preds and target.
-    """
-    precision = {"precision": 0}
-    precision["precision"] = calculate_precision_recall_f1(preds, targets, language)["precision"]
-    return precision
-
-
-def recall(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
-    """Calculate Recall Metric
-
-    Calculating recall by counting the number of overlaps between the preds and target.
-    """
-    recall = {"recall": 0}
-    recall["recall"] = calculate_precision_recall_f1(preds, targets, language)["recall"]
-    return recall
-
-
-def F1_score(preds: List[str], targets: List[str], language: str) -> Dict[str, float]:
-    """Calculate F1-score Metric
-
-    Calculating f1-score by counting the number of overlaps between the preds and target.
-    """
-    f1 = {"f1_score": 0}
-    f1["f1_score"] = calculate_precision_recall_f1(preds, targets, language)["f1_score"]
-    return f1
--- a/applications/Chat/evaluate/requirements.txt
+++ b/applications/Chat/evaluate/requirements.txt
@ -1,12 +0,0 @@
-jieba
-bert-score
-rouge_chinese
-scikit-metrics
-nltk
-openai
-seaborn
-pandas
-matplotlib
-numpy
-zhon
-rouge_score
--- a/applications/Chat/evaluate/unieval/init.py
+++ b/applications/Chat/evaluate/unieval/init.py
@ -1,15 +0,0 @@
-from .evaluator import get_evaluator
-from .utils import (
-    analyze_unieval_results,
-    calculate_average_score,
-    convert_data_to_unieval_format,
-    save_unieval_results,
-)
-
-__all__ = [
-    "get_evaluator",
-    "convert_data_to_unieval_format",
-    "calculate_average_score",
-    "save_unieval_results",
-    "analyze_unieval_results",
-]
--- a/applications/Chat/evaluate/unieval/evaluator.py
+++ b/applications/Chat/evaluate/unieval/evaluator.py
@ -1,329 +0,0 @@
-# MIT License
-
-# Copyright (c) 2022 Ming Zhong
-
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-
-# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-# SOFTWARE.
-
-import numpy as np
-from nltk import sent_tokenize
-
-from .scorer import UniEvaluator
-from .utils import add_question
-
-
-class SumEvaluator:
-    def __init__(self, model_name_or_path, max_length=1024, device="cuda:0", cache_dir=None):
-        """Set up evaluator for text summarization"""
-        self.scorer = UniEvaluator(
-            model_name_or_path="MingZhong/unieval-sum" if model_name_or_path == "" else model_name_or_path,
-            max_length=max_length,
-            device=device,
-            cache_dir=cache_dir,
-        )
-        self.task = "summarization"
-        self.dimensions = ["coherence", "consistency", "fluency", "relevance"]
-
-    def evaluate(self, data, category, dims=None, overall=True):
-        """
-        Get the scores of all the given dimensions
-
-        category: The category to be evaluated.
-
-        dims: A list of dimensions to be evaluated. If dims is None, SumEvaluator will evaluate
-              four dimensions: coherence, consistency, fluency, relevance.
-
-        overall: indicates whether the overall score is to be calculated.
-                 Overall score can be customized to a combination of scores based on different
-                 dimensions. The default here is the average score of all the given dimensions.
-        """
-        n_data = len(data)
-        eval_scores = [{} for _ in range(n_data)]
-
-        if dims == None:
-            eval_dims = self.dimensions
-        else:
-            assert isinstance(dims, list)
-            eval_dims = dims
-
-        for dim in eval_dims:
-            # Calculate average sentence-level scores for 'consistency' and 'fluency'
-            if dim == "consistency" or dim == "fluency":
-                src_list, output_list = [], []
-                n_sents = []  # the number of sentences in each generated summary
-                for i in range(n_data):
-                    source = data[i]["source"]
-                    system_outputs = sent_tokenize(data[i]["system_output"])
-                    n_sents.append(len(system_outputs))
-                    for j in range(len(system_outputs)):
-                        src_list.append(source)
-                        output_list.append(system_outputs[j])
-                input_list = add_question(dimension=dim, output=output_list, src=src_list, task=self.task)
-                sent_score = self.scorer.score(input_list, self.task, category, dim)
-
-                # Get average score for each sample
-                start_idx = 0
-                score = []
-                for cur_n_sent in n_sents:
-                    # prevent denominator from being 0
-                    score.append(sum(sent_score[start_idx : start_idx + cur_n_sent]) / (cur_n_sent + 1e-6))
-                    start_idx += cur_n_sent
-
-            # Calculate summary-level score for 'coherence' and 'relevance'
-            elif dim == "coherence" or dim == "relevance":
-                src_list, output_list, ref_list = [], [], []
-                for i in range(n_data):
-                    src_list.append(data[i]["source"])
-                    output_list.append(data[i]["system_output"])
-                    if dim == "relevance":
-                        ref_list.append(data[i]["reference"])
-                input_list = add_question(dimension=dim, output=output_list, src=src_list, ref=ref_list, task=self.task)
-                score = self.scorer.score(input_list, self.task, category, dim)
-
-            # Please customize other dimensions here for summarization
-            else:
-                raise NotImplementedError(
-                    "The input format for this dimension is still undefined. \
-                                           Please customize it first."
-                )
-
-            for i in range(n_data):
-                eval_scores[i][dim] = score[i]
-
-        # Customize your overall score here.
-        if overall == True:
-            for i in range(n_data):
-                eval_scores[i]["overall"] = np.mean(list(eval_scores[i].values()))
-
-        return eval_scores
-
-
-class DialogEvaluator:
-    def __init__(self, model_name_or_path, max_length=1024, device="cuda:0", cache_dir=None):
-        """Set up evaluator for dialogues"""
-        self.scorer = UniEvaluator(
-            model_name_or_path="MingZhong/unieval-dialog" if model_name_or_path == "" else model_name_or_path,
-            max_length=max_length,
-            device=device,
-            cache_dir=cache_dir,
-        )
-        self.task = "dialogue"
-        self.dimensions = ["naturalness", "coherence", "engagingness", "groundedness", "understandability"]
-
-    def evaluate(self, data, category, dims=None, overall=True):
-        """
-        Get the scores of all the given dimensions
-
-        category: The category to be evaluated.
-
-        dims: A list of dimensions to be evaluated. If dims is None, DialogEvaluator will evaluate
-              five dimensions: naturalness, coherence, engagingness, groundedness and understandability.
-
-        overall: indicates whether the overall score is to be calculated.
-                 Overall score can be customized to a combination of scores based on different
-                 dimensions. The default here is the average score of all the given dimensions.
-        """
-        n_data = len(data)
-        eval_scores = [{} for _ in range(n_data)]
-
-        if dims == None:
-            eval_dims = self.dimensions
-        else:
-            assert isinstance(dims, list)
-            eval_dims = dims
-
-        for dim in eval_dims:
-            # Calculate summation score for 'engagingness'
-            if dim == "engagingness":
-                src_list, output_list, context_list = [], [], []
-                n_sents = []  # the number of sentences in each generated response
-                for i in range(n_data):
-                    source = data[i]["source"]
-                    context = data[i]["context"]
-                    system_outputs = sent_tokenize(data[i]["system_output"])
-                    n_sents.append(len(system_outputs))
-                    for j in range(len(system_outputs)):
-                        src_list.append(source)
-                        context_list.append(context)
-                        output_list.append(system_outputs[j])
-                input_list = add_question(
-                    dimension=dim, output=output_list, src=src_list, context=context_list, task=self.task
-                )
-                sent_score = self.scorer.score(input_list, self.task, category, dim)
-
-                # Get the summation score for each sample
-                start_idx = 0
-                score = []
-                for cur_n_sent in n_sents:
-                    score.append(sum(sent_score[start_idx : start_idx + cur_n_sent]))
-                    start_idx += cur_n_sent
-
-            # Calculate turn-level score for other dimensions
-            elif dim in ["naturalness", "coherence", "groundedness", "understandability"]:
-                src_list, output_list, context_list = [], [], []
-                for i in range(n_data):
-                    src_list.append(data[i]["source"])
-                    output_list.append(data[i]["system_output"])
-                    context_list.append(data[i]["context"])
-                input_list = add_question(
-                    dimension=dim, output=output_list, src=src_list, context=context_list, task=self.task
-                )
-                score = self.scorer.score(input_list, self.task, category, dim)
-
-            # Please customize other dimensions here for summarization
-            else:
-                raise NotImplementedError(
-                    "The input format for this dimension is still undefined. \
-                                           Please customize it first."
-                )
-
-            for i in range(n_data):
-                eval_scores[i][dim] = score[i]
-
-        # Customize your overall score here.
-        if overall == True:
-            for i in range(n_data):
-                eval_scores[i]["overall"] = np.mean(list(eval_scores[i].values()))
-
-        return eval_scores
-
-
-class D2tEvaluator:
-    def __init__(self, model_name_or_path, max_length=1024, device="cuda:0", cache_dir=None):
-        """Set up evaluator for data-to-text"""
-        self.scorer = UniEvaluator(
-            model_name_or_path="MingZhong/unieval-sum" if model_name_or_path == "" else model_name_or_path,
-            max_length=max_length,
-            device=device,
-            cache_dir=cache_dir,
-        )
-        self.task = "data2text"
-        self.dimensions = ["naturalness", "informativeness"]
-
-    def evaluate(self, data, category, dims=None, overall=True):
-        """
-        Get the scores of all the given dimensions
-
-        category: The category to be evaluated.
-
-        dims: A list of dimensions to be evaluated. If dims is None, D2tEvaluator will evaluate
-              two dimensions: naturalness and informativeness.
-
-        overall: indicates whether the overall score is to be calculated.
-                 Overall score can be customized to a combination of scores based on different
-                 dimensions. The default here is the average score of all the given dimensions.
-        """
-        n_data = len(data)
-        eval_scores = [{} for _ in range(n_data)]
-
-        if dims == None:
-            eval_dims = self.dimensions
-        else:
-            assert isinstance(dims, list)
-            eval_dims = dims
-
-        for dim in eval_dims:
-            output_list, ref_list = [], []
-            for i in range(n_data):
-                output_list.append(data[i]["system_output"])
-                ref_list.append(data[i]["reference"])
-
-            input_list = add_question(dimension=dim, output=output_list, ref=ref_list, task=self.task)
-            score = self.scorer.score(input_list, self.task, category, dim)
-
-            for i in range(n_data):
-                eval_scores[i][dim] = score[i]
-
-        # Customize your overall score here.
-        if overall == True:
-            for i in range(n_data):
-                eval_scores[i]["overall"] = np.mean(list(eval_scores[i].values()))
-
-        return eval_scores
-
-
-class FactEvaluator:
-    def __init__(self, model_name_or_path, max_length=1024, device="cuda:0", cache_dir=None):
-        """Set up evaluator for factual consistency detection"""
-        self.scorer = UniEvaluator(
-            model_name_or_path="MingZhong/unieval-fact" if model_name_or_path == "" else model_name_or_path,
-            max_length=max_length,
-            device=device,
-            cache_dir=cache_dir,
-        )
-        self.task = "fact"
-        self.dim = "consistency"
-
-    def evaluate(self, data, category):
-        """
-        Get the factual consistency score (only 1 dimension for this task)
-
-        category: The category to be evaluated.
-        """
-        n_data = len(data)
-        eval_scores = [{} for _ in range(n_data)]
-
-        # Calculate average sentence-level scores for factual consistency
-        src_list, output_list = [], []
-        n_sents = []  # the number of sentences in the claim
-        for i in range(n_data):
-            source = data[i]["source"]
-            system_outputs = sent_tokenize(data[i]["system_output"])
-            n_sents.append(len(system_outputs))
-            for j in range(len(system_outputs)):
-                src_list.append(source)
-                output_list.append(system_outputs[j])
-        input_list = add_question(dimension=self.dim, output=output_list, src=src_list, task=self.task)
-        sent_score = self.scorer.score(input_list, self.task, category, self.dim)
-
-        # Get average score for each sample
-        start_idx = 0
-        score = []
-        for cur_n_sent in n_sents:
-            score.append(sum(sent_score[start_idx : start_idx + cur_n_sent]) / cur_n_sent)
-            start_idx += cur_n_sent
-
-        for i in range(n_data):
-            eval_scores[i][self.dim] = score[i]
-
-        return eval_scores
-
-
-def get_evaluator(task, model_name_or_path="", max_length=1024, device="cuda:0", cache_dir=None):
-    assert task in ["summarization", "dialogue", "data2text", "fact"]
-    if task == "summarization":
-        return SumEvaluator(
-            model_name_or_path=model_name_or_path, max_length=max_length, device=device, cache_dir=cache_dir
-        )
-    elif task == "dialogue":
-        return DialogEvaluator(
-            model_name_or_path=model_name_or_path, max_length=max_length, device=device, cache_dir=cache_dir
-        )
-    elif task == "data2text":
-        return D2tEvaluator(
-            model_name_or_path=model_name_or_path, max_length=max_length, device=device, cache_dir=cache_dir
-        )
-    elif task == "fact":
-        return FactEvaluator(
-            model_name_or_path=model_name_or_path, max_length=max_length, device=device, cache_dir=cache_dir
-        )
-    else:
-        raise NotImplementedError(
-            "Other tasks are not implemented, \
-                                   please customize specific tasks here."
-        )
--- a/applications/Chat/evaluate/unieval/scorer.py
+++ b/applications/Chat/evaluate/unieval/scorer.py
@ -1,96 +0,0 @@
-# MIT License
-
-# Copyright (c) 2022 Ming Zhong
-
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-
-# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-# SOFTWARE.
-
-import torch
-import torch.nn as nn
-from tqdm import tqdm
-from transformers import AutoConfig, AutoModelForSeq2SeqLM, AutoTokenizer
-
-
-class UniEvaluator:
-    def __init__(self, model_name_or_path, max_length=1024, device="cuda:0", cache_dir=None):
-        """Set up model"""
-        self.device = device
-        self.max_length = max_length
-
-        self.config = AutoConfig.from_pretrained(model_name_or_path, cache_dir=cache_dir)
-        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir)
-        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path, config=self.config, cache_dir=cache_dir)
-
-        self.model.eval()
-        self.model.to(device)
-
-        self.softmax = nn.Softmax(dim=1)
-
-        self.pos_id = self.tokenizer("Yes")["input_ids"][0]
-        self.neg_id = self.tokenizer("No")["input_ids"][0]
-
-    def score(self, inputs, task, category, dim, batch_size=8):
-        """
-        Get scores for the given samples.
-        final_score = postive_score / (postive_score + negative_score)
-        """
-
-        # The implementation of "forward" in T5 still requires decoder_input_ids.
-        # Therefore, we construct a random one-word target sequence.
-        # The content of the target has no effect on the final scores.
-        tgts = ["No" for _ in range(len(inputs))]
-
-        pos_score_list, neg_score_list = [], []
-        for i in tqdm(range(0, len(inputs), batch_size), desc=f"{category}-({dim}-{task}): "):
-            src_list = inputs[i : i + batch_size]
-            tgt_list = tgts[i : i + batch_size]
-            try:
-                with torch.no_grad():
-                    encoded_src = self.tokenizer(
-                        src_list, max_length=self.max_length, truncation=True, padding=True, return_tensors="pt"
-                    )
-                    encoded_tgt = self.tokenizer(
-                        tgt_list, max_length=self.max_length, truncation=True, padding=True, return_tensors="pt"
-                    )
-
-                    src_tokens = encoded_src["input_ids"].to(self.device)
-                    src_mask = encoded_src["attention_mask"].to(self.device)
-
-                    tgt_tokens = encoded_tgt["input_ids"].to(self.device)[:, 0].unsqueeze(-1)
-
-                    output = self.model(input_ids=src_tokens, attention_mask=src_mask, labels=tgt_tokens)
-                    logits = output.logits.view(-1, self.model.config.vocab_size)
-
-                    pos_score = self.softmax(logits)[:, self.pos_id]  # Yes
-                    neg_score = self.softmax(logits)[:, self.neg_id]  # No
-
-                    cur_pos_score = [x.item() for x in pos_score]
-                    cur_neg_score = [x.item() for x in neg_score]
-                    pos_score_list += cur_pos_score
-                    neg_score_list += cur_neg_score
-
-            except RuntimeError:
-                print(f"source: {src_list}")
-                print(f"target: {tgt_list}")
-                exit(0)
-
-        score_list = []
-        for i in range(len(pos_score_list)):
-            score_list.append(pos_score_list[i] / (pos_score_list[i] + neg_score_list[i]))
-
-        return score_list
--- a/applications/Chat/evaluate/unieval/utils.py
+++ b/applications/Chat/evaluate/unieval/utils.py
@ -1,285 +0,0 @@
-# MIT License
-
-# Copyright (c) 2022 Ming Zhong
-
-# Permission is hereby granted, free of charge, to any person obtaining a copy
-# of this software and associated documentation files (the "Software"), to deal
-# in the Software without restriction, including without limitation the rights
-# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-# copies of the Software, and to permit persons to whom the Software is
-# furnished to do so, subject to the following conditions:
-
-# The above copyright notice and this permission notice shall be included in all
-# copies or substantial portions of the Software.
-
-# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-# SOFTWARE.
-
-import os
-from typing import Dict
-
-import matplotlib.pyplot as plt
-import pandas as pd
-import seaborn as sns
-import tqdm
-
-
-def add_question(dimension, output, src=None, ref=None, context=None, task=None):
-    """
-    Add questions to generate input in Bool-QA format for UniEval.
-
-    dimension: specific dimension to be evaluated
-    src: source input for different NLG tasks. For example, source document for summarization
-         and dialogue history for dialogue response generation.
-    output: output text generated by the models
-    ref: human-annotated groundtruth
-    context: the context needed to evaluate several specific dimension. For example,
-             additional factual information when evaluating engagingness and groundedness in dialogues.
-    """
-
-    input_with_question = []
-    for i in range(len(output)):
-        # For summarization
-        if task == "summarization":
-            if dimension == "fluency":
-                cur_input = "question: Is this a fluent paragraph? </s> paragraph: " + output[i]
-            elif dimension == "coherence":
-                cur_input = (
-                    "question: Is this a coherent summary to the document? </s> summary: "
-                    + output[i]
-                    + " </s> document: "
-                    + src[i]
-                )
-            elif dimension == "consistency":
-                cur_input = (
-                    "question: Is this claim consistent with the document? </s> claim: "
-                    + output[i]
-                    + " </s> document: "
-                    + src[i]
-                )
-            elif dimension == "relevance":
-                cur_input = (
-                    "question: Is this summary relevant to the reference? </s> summary: "
-                    + output[i]
-                    + " </s> reference: "
-                    + ref[i]
-                )
-            else:
-                raise NotImplementedError(
-                    "The input format for this dimension is still undefined. Please customize it first."
-                )
-        # For dialogues
-        elif task == "dialogue":
-            if dimension == "naturalness":
-                cur_input = "question: Is this a natural response in the dialogue? </s> response: " + output[i]
-            elif dimension == "coherence":
-                cur_input = (
-                    "question: Is this a coherent response given the dialogue history? </s> response: "
-                    + output[i]
-                    + " </s> dialogue history: "
-                    + src[i]
-                )
-            elif dimension == "engagingness":
-                cur_input = (
-                    "question: Is this an engaging and informative response according to the dialogue history and fact? </s> response: "
-                    + output[i]
-                    + " </s> dialogue history: "
-                    + src[i]
-                    + " </s> fact: "
-                    + context[i]
-                )
-            elif dimension == "groundedness":
-                cur_input = (
-                    "question: Is this response consistent with knowledge in the fact? </s> response: "
-                    + output[i]
-                    + " </s> fact: "
-                    + context[i]
-                )
-            elif dimension == "understandability":
-                cur_input = "question: Is this an understandable response in the dialogue? </s> response: " + output[i]
-            else:
-                raise NotImplementedError(
-                    "The input format for this dimension is still undefined. Please customize it first."
-                )
-        # For data-to-text
-        elif task == "data2text":
-            if dimension == "naturalness":
-                cur_input = "question: Is this a fluent utterance? </s> utterance: " + output[i]
-            elif dimension == "informativeness":
-                cur_input = (
-                    "question: Is this sentence informative according to the reference? </s> sentence: "
-                    + output[i]
-                    + " </s> reference: "
-                    + ref[i]
-                )
-            else:
-                raise NotImplementedError(
-                    "The input format for this dimension is still undefined. Please customize it first."
-                )
-        # For factual consistency detection
-        elif task == "fact":
-            if dimension == "consistency":
-                cur_input = (
-                    "question: Is this claim consistent with the document? </s> claim: "
-                    + output[i]
-                    + " </s> document: "
-                    + src[i]
-                )
-            else:
-                raise NotImplementedError("No other dimensions for the factual consistency detection task.")
-        # For new customized tasks
-        else:
-            raise NotImplementedError("Other tasks are not implemented, please customize specific tasks here.")
-        input_with_question.append(cur_input)
-    return input_with_question
-
-
-def convert_data_to_unieval_format(output_list, src_list=None, ref_list=None):
-    """
-    Convert the data into the unieval's format.
-
-    output_list: a list of model output
-
-    src_list: source input for different NLG tasks. For example, source document for summarization
-              and dialogue history for dialogue response generation
-    ref_list: human-annotated groundtruth
-    """
-    json_data = []
-    for i in range(len(output_list)):
-        cur = {}
-        cur["system_output"] = output_list[i]
-        if src_list is not None:
-            cur["source"] = src_list[i]
-        if ref_list is not None:
-            cur["reference"] = ref_list[i]
-        cur["context"] = ""
-        json_data.append(cur)
-    return json_data
-
-
-def calculate_average_score(scores):
-    """
-    Calculate average scores for different metrics
-
-    scores: a list of scores for different metrics for each answer
-
-    """
-    metrics = {metric: 0 for metric in scores[0]}
-
-    for score in scores:
-        for metric in score:
-            metrics[metric] += score[metric]
-
-    for metric in metrics:
-        metrics[metric] /= len(scores)
-
-    return metrics
-
-
-def save_unieval_results(model_name: str, unieval_metric_stats: Dict[str, Dict], save_path: str) -> None:
-    """
-    Save UniEval evaluation results of different categories for one model.
-
-    """
-
-    if not os.path.exists(save_path):
-        os.makedirs(save_path)
-
-    unieval_metric_stats_per_category = {}
-    for task, category_stat in unieval_metric_stats.items():
-        for category, metric_stat in category_stat.items():
-            if unieval_metric_stats_per_category.get(category, None) is None:
-                unieval_metric_stats_per_category[category] = {}
-            for metric, score in metric_stat.items():
-                unieval_metric_stats_per_category[category][f"{metric}-{task}"] = score
-
-    automatic_df = pd.DataFrame(unieval_metric_stats_per_category)
-    automatic_df.to_csv(os.path.join(save_path, f"{model_name}_results.csv"), index=True)
-
-
-def read_unieval_results(results_path: str, file_name: str) -> Dict[str, Dict]:
-    """
-    Read a csv file and return a dictionary which stores scores per metric.
-
-    """
-
-    results = pd.read_csv(os.path.join(results_path, file_name), index_col=0)
-
-    results_dict = {metric: {} for metric in list(results.index)}
-    for i, metric in enumerate(results_dict.keys()):
-        for j, category in enumerate(list(results.columns)):
-            if pd.isnull(results.iloc[i][j]):
-                continue
-            results_dict[metric][category] = results.iloc[i][j]
-
-    return results_dict
-
-
-def analyze_unieval_results(results_path: str, save_path: str) -> None:
-    """
-    Analyze and visualize all csv files in the given folder.
-
-    """
-
-    if not os.path.exists(results_path):
-        raise Exception(f'The given directory "{results_path}" doesn\'t exist! No results found!')
-
-    all_statistics = {}
-
-    for file_name in os.listdir(results_path):
-        if file_name.endswith("_results.csv"):
-            model_name = file_name.split("_results.csv")[0]
-            all_statistics[model_name] = read_unieval_results(results_path, file_name)
-
-    if len(list(all_statistics.keys())) == 0:
-        raise Exception(f'There are no csv files in the given directory "{results_path}"!')
-
-    frame_all = {"model": [], "category": [], "metric": [], "score": []}
-    frame_per_metric = {}
-    for model_name, model_statistics in all_statistics.items():
-        for metric, metric_statistics in model_statistics.items():
-            if frame_per_metric.get(metric) is None:
-                frame_per_metric[metric] = {"model": [], "category": [], "score": []}
-
-            for category, category_score in metric_statistics.items():
-                frame_all["model"].append(model_name)
-                frame_all["category"].append(category)
-                frame_all["metric"].append(metric)
-                frame_all["score"].append(category_score)
-
-                frame_per_metric[metric]["model"].append(model_name)
-                frame_per_metric[metric]["category"].append(category)
-                frame_per_metric[metric]["score"].append(category_score)
-
-    if not os.path.exists(save_path):
-        os.makedirs(save_path)
-
-    frame_all = pd.DataFrame(frame_all)
-    frame_all.to_csv(os.path.join(save_path, "unieval_statistics.csv"))
-
-    for metric in tqdm.tqdm(
-        frame_per_metric.keys(),
-        desc=f"UniEval metrics: ",
-        total=len(frame_per_metric.keys()),
-    ):
-        data = pd.DataFrame(frame_per_metric[metric])
-
-        sns.set()
-        fig = plt.figure(figsize=(16, 10))
-
-        fig = sns.barplot(x="category", y="score", hue="model", data=data, dodge=True)
-        fig.set_title(
-            f"Comparison between Different Models for Metric {metric.split('-')[0].title()} in Task {metric.split('-')[1].title()}"
-        )
-        plt.xlabel("Evaluation Category")
-        plt.ylabel("Score")
-
-        figure = fig.get_figure()
-        figure.savefig(os.path.join(save_path, f"{metric}.png"), dpi=400)
-
-        plt.close()
--- a/applications/Chat/evaluate/utils.py
+++ b/applications/Chat/evaluate/utils.py
@ -1,206 +0,0 @@
-import io
-import json
-import os
-import string
-from typing import Dict
-
-import matplotlib.pyplot as plt
-import pandas as pd
-import seaborn as sns
-import tqdm
-from zhon import hanzi
-
-
-def _make_w_io_base(f, mode: str):
-    if not isinstance(f, io.IOBase):
-        f_dirname = os.path.dirname(f)
-        if f_dirname != "":
-            os.makedirs(f_dirname, exist_ok=True)
-        f = open(f, mode=mode)
-    return f
-
-
-def _make_r_io_base(f, mode: str):
-    if not isinstance(f, io.IOBase):
-        f = open(f, mode=mode)
-    return f
-
-
-def jdump(obj, f, mode="w", indent=4, default=str):
-    """Dump a str or dictionary to a file in json format.
-    Args:
-        obj: An object to be written.
-        f: A string path to the location on disk.
-        mode: Mode for opening the file.
-        indent: Indent for storing json dictionaries.
-        default: A function to handle non-serializable entries; defaults to `str`.
-    """
-    f = _make_w_io_base(f, mode)
-    if isinstance(obj, (dict, list)):
-        json.dump(obj, f, indent=indent, default=default, ensure_ascii=False)
-    elif isinstance(obj, str):
-        f.write(obj)
-    else:
-        raise ValueError(f"Unexpected type: {type(obj)}")
-    f.close()
-
-
-def jload(f, mode="r"):
-    """Load a .json file into a dictionary."""
-    f = _make_r_io_base(f, mode)
-    jdict = json.load(f)
-    f.close()
-    return jdict
-
-
-def get_json_list(file_path):
-    with open(file_path, "r") as f:
-        json_list = []
-        for line in f:
-            json_list.append(json.loads(line))
-        return json_list
-
-
-def get_data_per_category(data, categories):
-    data_per_category = {category: [] for category in categories}
-    for item in data:
-        category = item["category"]
-        if category in categories:
-            data_per_category[category].append(item)
-
-    return data_per_category
-
-
-def remove_punctuations(text: str) -> str:
-    """
-    Remove punctuations in the given text.
-    It is used in evaluation of automatic metrics.
-
-    """
-
-    punctuation = string.punctuation + hanzi.punctuation
-    punctuation = set([char for char in punctuation])
-    punctuation.difference_update(set("!@#$%&()<>?|,.\"'"))
-
-    out = []
-    for char in text:
-        if char in punctuation:
-            continue
-        else:
-            out.append(char)
-
-    return "".join(out)
-
-
-def remove_redundant_space(text: str) -> str:
-    """
-    Remove redundant spaces in the given text.
-    It is used in evaluation of automatic metrics.
-
-    """
-
-    return " ".join(text.split())
-
-
-def preprocessing_text(text: str) -> str:
-    """
-    Preprocess the given text.
-    It is used in evaluation of automatic metrics.
-
-    """
-
-    return remove_redundant_space(remove_punctuations(text.lower()))
-
-
-def save_automatic_results(model_name: str, automatic_metric_stats: Dict[str, Dict], save_path: str) -> None:
-    """
-    Save automatic evaluation results of different categories for one model.
-
-    """
-
-    if not os.path.exists(save_path):
-        os.makedirs(save_path)
-
-    automatic_df = pd.DataFrame(automatic_metric_stats)
-    automatic_df.to_csv(os.path.join(save_path, f"{model_name}_results.csv"), index=True)
-
-
-def read_automatic_results(results_path: str, file_name: str) -> Dict[str, Dict]:
-    """
-    Read a csv file and return a dictionary which stores scores per metric.
-
-    """
-
-    results = pd.read_csv(os.path.join(results_path, file_name), index_col=0)
-
-    results_dict = {metric: {} for metric in list(results.index)}
-    for i, metric in enumerate(results_dict.keys()):
-        for j, category in enumerate(list(results.columns)):
-            if pd.isnull(results.iloc[i][j]):
-                continue
-            results_dict[metric][category] = results.iloc[i][j]
-
-    return results_dict
-
-
-def analyze_automatic_results(results_path: str, save_path: str) -> None:
-    """
-    Analyze and visualize all csv files in the given folder.
-
-    """
-
-    if not os.path.exists(results_path):
-        raise Exception(f'The given directory "{results_path}" doesn\'t exist! No results found!')
-
-    all_statistics = {}
-
-    for file_name in os.listdir(results_path):
-        if file_name.endswith("_results.csv"):
-            model_name = file_name.split("_results.csv")[0]
-            all_statistics[model_name] = read_automatic_results(results_path, file_name)
-
-    if len(list(all_statistics.keys())) == 0:
-        raise Exception(f'There are no csv files in the given directory "{results_path}"!')
-
-    frame_all = {"model": [], "category": [], "metric": [], "score": []}
-    frame_per_metric = {}
-    for model_name, model_statistics in all_statistics.items():
-        for metric, metric_statistics in model_statistics.items():
-            if frame_per_metric.get(metric) is None:
-                frame_per_metric[metric] = {"model": [], "category": [], "score": []}
-
-            for category, category_score in metric_statistics.items():
-                frame_all["model"].append(model_name)
-                frame_all["category"].append(category)
-                frame_all["metric"].append(metric)
-                frame_all["score"].append(category_score)
-
-                frame_per_metric[metric]["model"].append(model_name)
-                frame_per_metric[metric]["category"].append(category)
-                frame_per_metric[metric]["score"].append(category_score)
-
-    if not os.path.exists(save_path):
-        os.makedirs(save_path)
-
-    frame_all = pd.DataFrame(frame_all)
-    frame_all.to_csv(os.path.join(save_path, "automatic_evaluation_statistics.csv"))
-
-    for metric in tqdm.tqdm(
-        frame_per_metric.keys(),
-        desc=f"automatic metrics: ",
-        total=len(frame_per_metric.keys()),
-    ):
-        data = pd.DataFrame(frame_per_metric[metric])
-
-        sns.set()
-        fig = plt.figure(figsize=(16, 10))
-
-        fig = sns.barplot(x="category", y="score", hue="model", data=data, dodge=True)
-        fig.set_title(f"Comparison between Different Models for Metric {metric.title()}")
-        plt.xlabel("Evaluation Category")
-        plt.ylabel("Score")
-
-        figure = fig.get_figure()
-        figure.savefig(os.path.join(save_path, f"{metric}.png"), dpi=400)
-
-        plt.close()
--- a/applications/ColossalEval/README.md
+++ b/applications/ColossalEval/README.md
@ -0,0 +1,554 @@
+# ColossalEval
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Leaderboard](#leaderboard)
+- [Install](#install)
+- [Evaluation Process](#evaluation-process)
+  - [Inference](#inference)
+  	- [Dataset Preparation](#dataset-preparation)
+    - [Configuration](#configuration)
+    - [How to Use](#how-to-use)
+  - [Evaluation](#evaluation)
+    - [Dataset Evaluation](#dataset-evaluation)
+      - [Configuration](#dataset-evaluation)
+      - [How to Use](#dataset-evaluation)
+    - [GPT Evaluation](#gpt-evaluation)
+      - [Configuration](#gpt-evaluation)
+      - [How to Use](#gpt-evaluation)
+- [More Details](#more-details)
+  - [Inference Details](#inference-details)
+  - [Evaluation Details](#evaluation-details)
+    - [Metrics](#metrics)
+  - [examples](#examples)
+    - [Dataset Evaluation Example](#dataset-evaluation-example)
+    - [GPT Evaluation Example](#gpt-evaluation-example)
+- [To Do](#to-do)
+- [FAQ](#faq)
+  - [How to Add a New Metric?](#how-to-add-a-new-metric)
+  - [How to Add a New Dataset?](#how-to-add-a-new-dataset)
+  - [How to Add a New Model?](#how-to-add-a-new-model)
+- [Citations](#citations)
+
+## Overview
+[ColossalEval](https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval) is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs. More details can be found in the following sections.
+
+## Leaderboard
+
+We conducted comprehensive evaluation on 4 dataset and compare our Colossal-Llama-2-7b-base model with various models.
+
+- We use 5-shot for MMLU and calculate scores based on the logits of first predicted token.
+- We use 5-shot for CMMLU and calculate scores based on the logits of first predicted token.
+- We use 5-shot for AGIEval and only calculate scores for 4-choice questions using a combination metric of exact match and the logits of first predicted token. If any of the exact match or logits of first predicted token is correct, the model will get the score.
+- We use 0-shot for GAOKAO-Bench and only calculate scores for 4-choice questions based on the logits of first predicted token.
+- The generation config for all dataset is greedy search.
+- We also provided CEval scores from its lastest leaderboard or the official repository of the model.
+
+More details about metrics can be found in [Metrics](#metrics).
+
+|                                |  Backbone  | Tokens Consumed |  |         MMLU         |     CMMLU     | AGIEval | GAOKAO | CEval  |
+| :----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :-----: | :----: | :----: | :----------------------------: |
+|                                |     -      |        -        |                |        5-shot        |    5-shot     | 5-shot  | 0-shot | 5-shot |
+|          Baichuan-7B           |     -      |      1.2T       |             |    42.32 (42.30)     | 44.53 (44.02) |  38.72  | 36.74  | 42.80  |
+|       Baichuan-13B-Base        |     -      |      1.4T       |             |    50.51 (51.60)     | 55.73 (55.30) |  47.20  | 51.41  | 53.60  |
+|       Baichuan2-7B-Base        |     -      |      2.6T       |             |    46.97 (54.16)     | 57.67 (57.07) |  45.76  | 52.60  | 54.00  |
+|       Baichuan2-13B-Base       |     -      |      2.6T       |             |    54.84 (59.17)     | 62.62 (61.97) |  52.08  | 58.25  | 58.10  |
+|           ChatGLM-6B           |     -      |      1.0T       |             |    39.67 (40.63)     |   41.17 (-)   |  40.10  | 36.53  | 38.90  |
+|          ChatGLM2-6B           |     -      |      1.4T       |             |    44.74 (45.46)     |   49.40 (-)   |  46.36  | 45.49  | 51.70  |
+|          InternLM-7B           |     -      |        -        |                |    46.70 (51.00)     |   52.00 (-)   |  44.77  | 61.64  | 52.80  |
+|            Qwen-7B             |     -      |      2.2T       |             | 54.29 (56.70) | 56.03 (58.80) |  52.47  | 56.42  | 59.60  |
+|                                |            |                 |                 |                      |               |         |        |        |
+|           Llama-2-7B           |     -      |      2.0T       |             |    44.47 (45.30)     |   32.97 (-)   |  32.60  | 25.46  |   -    |
+| Linly-AI/Chinese-LLaMA-2-7B-hf | Llama-2-7B |      1.0T       |             |        37.43         |     29.92     |  32.00  | 27.57  |   -    |
+| wenge-research/yayi-7b-llama2  | Llama-2-7B |        -        |                |        38.56         |     31.52     |  30.99  | 25.95  |   -    |
+| ziqingyang/chinese-llama-2-7b  | Llama-2-7B |        -        |                |        33.86         |     34.69     |  34.52  | 25.18  |  34.2  |
+| TigerResearch/tigerbot-7b-base | Llama-2-7B |      0.3T       |             |        43.73         |     42.04     |  37.64  | 30.61  |   -    |
+|  LinkSoul/Chinese-Llama-2-7b   | Llama-2-7B |        -        |                |        48.41         |     38.31     |  38.45  | 27.72  |   -    |
+|       FlagAlpha/Atom-7B        | Llama-2-7B |      0.1T       |             |        49.96         |     41.10     |  39.83  | 33.00  |   -    |
+| IDEA-CCNL/Ziya-LLaMA-13B-v1.1  | Llama-13B  |      0.11T      |            |        50.25         |     40.99     |  40.04  | 30.54  |   -    |
+|  |  |  |  |  |  |  |  |  |
+|    **Colossal-LLaMA-2-7b-base**    | Llama-2-7B |      **0.0085T**      |            |        53.06         |     49.89     |  51.48  | 58.82  |  50.20  |
+
+> The score in parentheses corresponds to the scores in the official repository of the model.
+>
+> We use zero-shot for ChatGLM models.
+>
+> Qwen-7B is now inaccessible in Hugging Face, we are using the latest version of it before it was made inaccessible. Only for dataset MMLU, the prompt would be "xxx Answer:"(remove the space after ":") and we calculate the logits over " A", " B", " C" and " D" for Qwen-7B. Qwen-7B tends to be much more deterministic than other models. For example, the logits over " A" can be `-inf` and softmax would be exact `0`.
+>
+> For other models and other dataset, we calculate logits over "A", "B", "C" and "D".
+
+Our model achieves a much better score over all other Llama-1 or Llama-2 based models and also stands out among popular open source LLMs.
+
+## Install
+You should install `ColossalEval` in order to use it and `colossal_eval` is the package installed.
+```bash
+git clone https://github.com/hpcaitech/ColossalAI.git
+cd ColossalAI/applications/ColossalEval
+pip install .
+```
+If you want to add customized dataset or models, use `pip install -e .` in stead to ensure that any changes you make to the source code will immediately affect the package you install.
+
+## Evaluation Process
+The evaluation process involves 2 steps which are `inference` and `evaluation`. You need to set the config for each step.
+
+### Inference
+
+The inference process consists of two parts.
+1. Preprocess and convert the original dataset.
+2. Config your tokenizer and model arguments to perform zero-shot or few-shot prompting.
+
+#### Dataset Preparation
+
+In this step, the original dataset(either in `csv` or `jsonl` format) will be loaded and converted into a `dict`. In the conversion process, we carefully parse each subcategory and assign specific inference arguments for this subcategory.
+
+Inference arguments are stored in a `dict`. The following is an example.
+
+```python
+inference_kwargs = {
+    "calculate_loss": True,
+    "all_classes": ["A", "B", "C", "D"],
+    "language": "Chinese",
+    "pretrain": False,
+    "max_new_tokens": 32
+}
+```
+The `inference_kwargs` currently contains 5 fields:
+
+- `calculate_loss` (bool, compulsory): Whether the loss on target tokens will be calculated
+- `all_classes` (Optional[list], compulsory): Whether the subcategory is a single-choice question. Specify all available options in a list or otherwise None.
+- `language` (str, compulsory): The language for the subcategory.
+- `pretrain` (bool, compulsory): Whether the dataset is a pretrain dataset or not. It is usually used for calculate perplexity when you want to evaluate a model with extended context length.
+- `max_new_tokens` (int, compulsory): The number of new tokens to generate during inference.
+
+For example, for dataset MMLU, each subcategory consists of single-choice questions with options A, B, C and D by default and we can assign value `["A", "B", "C", "D"]` to key`all_classes`. For dataset C-Eval, target answers aren't provided in the test split so `calculate_loss` should be set as False. However, other dataset such as GAOKAO-bench contains different formats of questions and lacks some keys or metadata which can reveal what type (single-choice or multi-choice) of questions it is. Before assigning inference arguments, we first parse the dataset to decide which type of questions the subcategory belongs to and set the inference arguments accordingly.
+
+Other than `inference_kwargs`, `data` is a list containing questions of a same subcategory. The following is a converted dataset.
+
+```json
+{
+    "dev": {
+        "category 1": {"data": [], "inference_kwargs": {}},
+        "category 2": {"data": [], "inference_kwargs": {}}
+    },
+    "test": {
+        "category 1": {"data": [], "inference_kwargs": {}},
+        "category 2": {"data": [], "inference_kwargs": {}}
+    }
+}
+```
+
+A data sample basically follow the format of Alpaca. It should contain the following keys:
+
+* `dataset` (str, compulsory): The name of the dataset.
+* `split` (str, compulsory): The split of the instruction.
+* `catrgory` (str, compulsory): The category of the instruction.
+* `instruction` (str, compulsory): The instruction for the LLM.
+* `input` (str, optional): The additional context of the instruction.
+* `output` (str, optional): The model output of the instruction.
+* `target` (str, optional): The target answer for the instruction.
+
+Example:
+
+```json
+{
+    "dev": {
+        "Abstract Algebra": [
+            {
+                "dataset": "mmlu",
+                "split": "dev",
+                "category": "Abstract Algebra",
+                "instruction": "The following is a single-choice question on Abstract Algebra. Answer the question by replying A, B, C or D.",
+                "input": "Question: Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.\nA. 0\nB. 1\nC. 2\nD. 3\nAnswer: ",
+                "output": "",
+                "target": "B"
+            },
+        ]
+    },
+    "test": {
+        "Abstract Algebra": [
+            {
+                "dataset": "mmlu",
+                "split": "test",
+                "category": "Abstract Algebra",
+                "instruction": "The following is a single-choice question on Abstract Algebra. Answer the question by replying A, B, C or D.",
+                "input": "Question: Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.\nA. 0\nB. 4\nC. 2\nD. 6\nAnswer: ",
+                "output": "",
+                "target": "B"
+            },
+        ]
+    }
+}
+```
+
+#### Configuration
+In this step, you will configure your tokenizer and model arguments to infer on the given datasets.
+
+A config file consists of two parts.
+1. Model config. In model config, you need to specify model name, model path, model class, tokenizer arguments and model arguments.
+2. Dataset config. In dataset config, you need to specify dataset name, path and dataset class.
+
+Once you have all config ready, the program will run inference on all the given datasets on all the given models.
+
+An example config using model class `HuggingFaceCausalLM` and dataset class `CMMLUDataset` can be:
+```json
+{
+    "model": [
+        {
+            "name": "model name",
+            "model_class": "HuggingFaceCausalLM",
+            "parameters": {
+                "path": "path to model",
+                "model_max_length": 2048,
+                "tokenizer_path": "path to tokenizer",
+                "tokenizer_kwargs": {
+                    "use_fast": false,
+                    "trust_remote_code": true
+                },
+                "peft_path": null,
+                "model_kwargs": {
+                    "trust_remote_code": true
+                },
+                "prompt_template": "plain",
+                "batch_size": 4
+            }
+        }
+    ],
+    "dataset": [
+        {
+            "name": "dataset name",
+            "dataset_class": "CMMLUDataset",
+            "debug": false,
+            "few_shot": true,
+            "path": "path to original dataset",
+            "save_path": "path to save converted dataset"
+        }
+    ]
+}
+```
+
+Currently, we support Hugging Face models. The `tokenizer_kwargs` is the arguments used in `AutoTokenizer.from_pretrained()`. The `model_kwargs` is the arguments used in `AutoModel.from_pretrained` or `AutoModelForCausalLM.from_pretrained()`. `few_shot` will be set true if you want to enable few-shot prompting for the dataset. `debug` will be set true if you want to verify whether your prompt is right or wrong.
+
+#### How to Use
+An example script can be the following. The `configs/dataset_evaluation/inference.py` is the same in all examples provided.
+
+```shell
+torchrun --nproc_per_node=1 inference.py \
+    --config "path to config file" \
+    --load_dataset \
+    --inference_save_path "path to save inference results"
+```
+
+You should specify the path to config file in `config`. You can run the script without specifying `load_dataset` if you already save the converted dataset or otherwise set it to first load the original dataset and save the converted dataset. You should specify the path to save inference results in `inference_save_path`.
+
+### Evaluation
+
+In the evaluation process, you only need to configure your evaluation parameters. You can use either public dataset or help from GPTs to do evaluation. We will introduce configuration for dataset evaluation and GPT evaluation.
+
+#### Dataset Evaluation
+
+In dataset evaluation, we calculate different metrics on the given inference results and public dataset.
+
+##### Configuration
+
+A config file for dataset evaluation consists of two parts.
+1. Model config. In model config, you need to specify model name. If you want to evaluate perplexity over a pretrain dataset and calculate per-byte-perplexity, you have to add your tokenizer config and model max length.
+2. Dataset config. In dataset config, you need to specify the evaluation arguments for the dataset.
+
+Once you have all config ready, the program will run evaluation on inference results for all given models and dataset.
+
+An example config can be:
+```json
+{
+    "model": [
+        {
+            "name": "model name"
+        }
+    ],
+    "dataset": [
+        {
+            "name": "dataset name",
+            "metrics": ["first_token_accuracy"]
+        }
+    ]
+}
+```
+
+The above config specifies that the program will evaluate the inference results using `first_token_accuracy` metric.
+
+##### How to Use
+
+An example script can be the following.
+
+```shell
+python eval_dataset.py \
+    --config "path to config file" \
+    --inference_results_path "path to inference results" \
+    --evaluation_results_save_path "path to save evaluation results"
+```
+
+You should specify the path to config file in `config`, the path to inference results in `inference_results_path` and the path to save evaluation results in `evaluation_save_path`.
+
+#### GPT Evaluation
+
+In GPT evaluation, we provide a prompt template which can fit in different pre-defined metrics with Chain-of-Thoughts. In the following sections, we will only introduce how you can evaluate model answers using GPTs. More details can be found in `colossal_eval/evaluate/GPT Evaluation.md`.
+
+##### Configuration
+
+The following is an example of a English config file. The configuration file can control how the pipeline evaluates the model. You need to specify GPT evaluation metrics. You can find an example English config file in `configs/gpt_evaluation`.
+
+```json
+{
+    "language": "en",
+    "category": {
+        "brainstorming": {
+            "GPT": [
+                "language organization",
+                "relevance",
+                "creativity",
+                "practicality",
+                "reasonableness"
+            ]
+        },
+    }
+}
+```
+
+##### How to Use
+After setting the config file, you can evaluate the model using `examples/gpt_evaluation/eval.py`. If you want to make comparisons between answers of two different models, you should specify two answer files in the argument `answer_file_list` and two model names in the argument `model_name_list`(details can be found in `colossal_eval/evaluate/GPT Evaluation.md`). If you want to evaluate one answer file, the length of both `answer_file_list` and `model_name_list` should be 1 and the program will perform evaluation using GPTs.
+
+An example script is provided as follows:
+
+```shell
+python eval.py \
+    --config_file "path to the config file" \
+    --battle_prompt_file "path to the prompt file for battle" \
+    --gpt_evaluation_prompt_file "path to the prompt file for gpt evaluation" \
+    --target_file "path to the target answer file" \
+    --answer_file_list "path to the answer file" \
+    --model_name_list "the names of the model" \
+    --gpt_model "which GPT model to use for evaluation" \
+    --save_path "path to save results" \
+    --openai_key "your openai key" \
+```
+
+## More Details
+
+### Inference
+
+In the inference process, we will do generation, calculate loss over target tokens, calculate number of target tokens, softmax over given options (for example, "A", "B", "C", and "D") according to the inference arguments.
+
+For tokenization, we adopt tokenization strategy in [LongBench](https://github.com/THUDM/LongBench/blob/main/pred.py#L55) to preserve crucial instructions on the left and right side and keep all target tokens.
+
+For labeling target tokens, we adopt method from [FastChat](https://github.com/lm-sys/FastChat/blob/main/fastchat/train/train.py#L137), but it doesn't always hold true due to tokenizers' different behavior. We plan to insert special tokens to correctly label the target tokens.
+
+For calculating loss, we return per-sample-loss instead of per-batch-loss if we directly use `model(batch).loss` provided in HuggingFace.
+
+### Evaluation
+
+To make it more easier to set the config, you only need to specify all metrics you want to use in key `metrics`. However, the program will only use a subset of metrics you give for different subcategories. Applying all metrics to all subcategories is obviously unsuitable. The suggested metrics for specific categories should be defined in `colossal_eval/evaluate/dataset_evaluator/metrics.py`.
+
+#### Metrics
+
+- `combined_single_choice_accuracy`: A combination of `first_token_logit` and `single_choice_accuracy`. If one of these is correct, the model will get the score. It can be used in all dataset that contains single-choice questions.
+- `first_token_logit`: Calculate score based on softmax score over the given choices. If the argmax of the softmax is equal to the reference, the model will get the score. If there is `NaN` in softmax score, it will calculate the score using exact match. It can be used in all dataset that contains single-choice questions.
+- `single_choice_accuracy`: Calculate score using exact match. It will only get the first uppercase letter such as A, B, C or D that is not surrouded by lowercase letters. If the uppercase letter is equal to the reference, the model will get the score. It can be used in all dataset that contains single-choice questions.
+- `multi_choice_accuracy`: Calculate score on multi-choice questions. It will get a set of all uppercase letters such as A, B, C or D that is not surrouded by lowercase letters. If the prediction conatains uppercase letters that are not in reference. The model will get 0 score. If the prediction contains a uppercase letter that is in reference, the model will get a score of `1/len(reference)`. It is used in AGIEval and GAOKAO-Bench.
+- `math_equivalence`: Code from [hendrycks](https://github.com/hendrycks/math/blob/main/modeling/math_equivalence.py). Compute scores over the prediction math formula and reference math formula. It is used in AGIEval and GAOKAO-Bench.
+- `f1_score`: Calculate English f1 score between prediction and reference. It is used in Longbench.
+- `f1_zh_score`: Calculate Chinese f1 score between prediction and reference. It is used in Longbench.
+- `rouge_score`: Calculate English f1 score between prediction and reference. It is used in GAOKAO-Bench and LongBench.
+- `rouge_zh_score`: Calculate Chinese rouge score between prediction and reference. It is used in GAOKAO-Bench and LongBench.
+- `retrieval_score`: Calculate English retrieval score between prediction and reference. It determines whether the ouput(which paragraph) corresponds to the given abstract. It is used in Longbench.
+- `retrieval_zh_score`: Calculate Chinese retrieval score between prediction and reference. It determines whether the ouput(which paragraph) corresponds to the given abstract. It is used in Longbench.
+- `classification_score`: Calculate classification score between prediction and reference. It determines whether the ouput(a class) is equal to the reference. It is used in Longbench.
+- `code_sim_score`: Calculate similarity score between prediction and reference. It is used in Longbench.
+- `count_score`: Calculate count score between prediction and reference. It determines whether the ouput(number of given passages) is equal to the reference. It is used in Longbench.
+- `perplexity`: Calculate perplexity. The formula is $ perplexity = \frac{1}{n} \sum_i e^{loss_i} $ where $n$ is the number of samples and $ loss_i $ is the average loss for sample $ i $. It can be used in all dataset.
+- `ppl_score`: Calculate perplexity score. The formula is $ ppl\_score = \frac{1}{n} \sum_i e^{-loss_i} $ where $n$ is the number of samples and $ loss_i $ is the average loss for sample $ i $. It can be used in all dataset.
+- `ppl_score_over_choices`: Calculate perplexity score over choices. The formula is $ ppl\_score\_over\_choices= \frac{1}{n} \sum_i e^{-loss\_over\_choices_i} $ where $n$ is the number of samples and $ loss\_over\_choices_i $ is the loss on the first predicted token for sample $ i $. It can be used in all dataset that contains single-choice questions.
+- `per_byte_perplexity`: Calculate per byte perplexity. The formula is $ \frac{1}{n} \sum_i e^{\frac{loss_i}{byte_i}} $ where $n$ is the number of samples, $ loss_i $ is the total loss for sample $ i $ and $ byte_i $ is the number of bytes sample $ i $ occupies. It can be used in all dataset.
+- `per_byte_ppl_score`: Calculate per byte perplexity score. The formula is $ \frac{1}{n} \sum_i e^{-\frac{loss_i}{byte_i}} $ where $n$ is the number of samples, $ loss_i $ is the total loss for sample $ i $ and $ byte_i $ is the number of bytes sample $ i $ occupies. It can be used in all dataset.
+
+We use `combined_single_choice_accuracy` and `first_token_logit` in the leaderboard.
+
+### Examples
+
+We provide 2 examples for you to explore our `colossal_eval` package.
+
+#### Dataset Evaluation Example
+
+This example is in folder `examples/dataset_evaluation`.
+
+1. `cd examples/dataset_evaluation`
+2. Fill in your inference config file in `config/inference/config.json`. Set the model and dataset parameters
+3. Run `inference.sh` to get inference results.
+4. Fill in your evaluation config file in `config/evaluation/config.json`. Set the model and dataset parameters.
+5. Run `eval_dataset.sh` to get evaluation results.
+
+#### GPT Evaluation Example
+
+The examples is in folder `examples/gpt_evaluation`.
+
+1. `cd examples/gpt_evaluation`
+2. Fill in your inference config file in `config/inference/config.json`. Set the model and dataset parameters. If you want to use the example dataset we provide, the dataset is `ColossalDataset`.
+3. Run `inference.sh` to get inference results.
+4. Fill in your evaluation config file in `config/evaluation/config.json`.
+5. Run `eval.sh` to get evaluation results.
+
+## FAQ
+
+### How to Add a New Metric?
+
+If you want to add a customized metric, we recommend using `pip install -e .` to ensure that any changes you make to the source code will immediately affect the package you install.
+
+To add a new metric, you can follow the example of multi_choice_accuracy in line 339 in `colossal_eval/evaluate/dataset_evaluator/metric.py`. The method take one data sample's prediction and reference as input and return a score ranging from 0 to 1.
+
+A skeleton of code is the following.
+
+```python
+
+def CustomizedMetric(prediction: str, reference: str):
+	score = xxx
+	return score
+```
+
+Once you have successfully added your own metric, you should specify your metric both in `colossal_eval/evaluate/dataset_evaluator/metric.py` (suggest which subcategories shoule the metric be applied to) and your evaluation config.
+
+### How to Add a New Dataset?
+
+If you want to add customized dataset, we recommend using `pip install -e .` to ensure that any changes you make to the source code will immediately affect the package you install.
+
+To add a new dataset, you can follow the example of `colossal_eval/dataset/mmlu.py`. You need to make sure that the format of questions in one subcategory should be the same. For example, all questions should have target answers or all questions should be single-choice questions.
+
+A skeleton of code is the following.
+
+```python
+
+class CustomizedDataset(BaseDataset):
+    @staticmethod
+    def load():
+        # 1. Load and convert the original dataset format.
+    	# 2. Assign inference arguments for each subcategory.
+    	# 3. Return the converted dataset.
+    	pass
+```
+
+Once you have successfully added your own dataset, you can specify your dataset class in your inference config.
+
+### How to Add a New Model?
+
+If you want to add customized models, we recommend using `pip install -e .` to ensure that any changes you make to the source code will immediately affect the package you install.
+
+To add a new model, you can follow the example of `colossal_eval/models/huggingface.py`. You need to provide a way to load the model and tokenizer, calculate loss and generate.
+
+A skeleton of code is the following.
+
+```python
+
+class CustomizedModel(BaseModel):
+    def __init__(self):
+        super().__init__()
+		self._load_tokenizer()
+		self._load_model()
+
+	def _load_tokenizer():
+		pass
+
+	def _load_model():
+		pass
+
+	def _calculate_loss():
+		pass
+
+	def get_loss():
+		self._calculate_loss()
+
+	def inference(samples):
+		# 1. Load samples from the same subcategory.
+		# 2. Infer in a batch way according to inference arguments.
+		# 3. Return results.
+		batch_samples = xxx
+		self.get_loss(batch_samples)
+		self.generate(batch_samples)
+
+		return inference_results
+
+	def generate():
+		pass
+```
+
+Once you have successfully added your own model, you can specify your model class in your inference config.
+
+## To do
+
+- [ ] Add visualization code for evaluation results on public dataset
+- [ ] Improve the way to label target tokens
+
+## Citations
+
+```bibtex
+@misc{zhong2023agieval,
+      title={AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models},
+      author={Wanjun Zhong and Ruixiang Cui and Yiduo Guo and Yaobo Liang and Shuai Lu and Yanlin Wang and Amin Saied and Weizhu Chen and Nan Duan},
+      year={2023},
+      eprint={2304.06364},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@article{huang2023ceval,
+title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models},
+author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian},
+journal={arXiv preprint arXiv:2305.08322},
+year={2023}
+}
+
+@misc{li2023cmmlu,
+      title={CMMLU: Measuring massive multitask language understanding in Chinese},
+      author={Haonan Li and Yixuan Zhang and Fajri Koto and Yifei Yang and Hai Zhao and Yeyun Gong and Nan Duan and Timothy Baldwin},
+      year={2023},
+      eprint={2306.09212},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@inproceedings{Zhang2023EvaluatingTP,
+  title={Evaluating the Performance of Large Language Models on GAOKAO Benchmark},
+  author={Xiaotian Zhang and Chunyang Li and Yi Zong and Zhengyu Ying and Liang He and Xipeng Qiu},
+  year={2023}
+}
+
+@misc{bai2023longbench,
+      title={LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding},
+      author={Yushi Bai and Xin Lv and Jiajie Zhang and Hongchang Lyu and Jiankai Tang and Zhidian Huang and Zhengxiao Du and Xiao Liu and Aohan Zeng and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li},
+      year={2023},
+      eprint={2308.14508},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+@article{hendryckstest2021,
+  title={Measuring Massive Multitask Language Understanding},
+  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
+  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
+  year={2021}
+}
+
+@article{hendrycks2021ethics,
+  title={Aligning AI With Shared Human Values},
+  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
+  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
+  year={2021}
+}
+
+@misc{zheng2023judging,
+      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
+      author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
+      year={2023},
+      eprint={2306.05685},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+
+```
--- a/applications/ColossalEval/colossal_eval/init.py
+++ b/applications/ColossalEval/colossal_eval/init.py
--- a/applications/ColossalEval/colossal_eval/dataset/init.py
+++ b/applications/ColossalEval/colossal_eval/dataset/init.py
@ -0,0 +1,19 @@
+from .agieval import AGIEvalDataset
+from .base import BaseDataset
+from .ceval import CEvalDataset
+from .cmmlu import CMMLUDataset
+from .colossalai import ColossalDataset
+from .gaokaobench import GaoKaoBenchDataset
+from .longbench import LongBenchDataset
+from .mmlu import MMLUDataset
+
+__all__ = [
+    "AGIEvalDataset",
+    "BaseDataset",
+    "CEvalDataset",
+    "CMMLUDataset",
+    "GaoKaoBenchDataset",
+    "LongBenchDataset",
+    "MMLUDataset",
+    "ColossalDataset",
+]
--- a/applications/ColossalEval/colossal_eval/dataset/agieval.py
+++ b/applications/ColossalEval/colossal_eval/dataset/agieval.py
@ -0,0 +1,247 @@
+# Adapted from https://github.com/ruixiangcui/AGIEval/blob/main/src/dataset_loader.py.
+
+import ast
+import glob
+import os
+from copy import deepcopy
+from typing import Dict, List
+
+import pandas as pd
+from colossal_eval.utils import get_json_list
+
+from colossalai.logging import DistributedLogger
+
+from .base import BaseDataset
+
+# define the datasets
+english_qa_datasets = [
+    "lsat-ar",
+    "lsat-lr",
+    "lsat-rc",
+    "logiqa-en",
+    "sat-math",
+    "sat-en",
+    "aqua-rat",
+    "sat-en-without-passage",
+    "gaokao-english",
+]
+chinese_qa_datasets = [
+    "logiqa-zh",
+    "jec-qa-kd",
+    "jec-qa-ca",
+    "gaokao-chinese",
+    "gaokao-geography",
+    "gaokao-history",
+    "gaokao-biology",
+    "gaokao-chemistry",
+    "gaokao-physics",
+    "gaokao-mathqa",
+]
+english_cloze_datasets = ["math"]
+chinese_cloze_datasets = ["gaokao-mathcloze"]
+
+multi_choice_datasets = ["jec-qa-kd", "jec-qa-ca", "gaokao-physics", "gaokao-mathqa"]
+math_output_datasets = {"gaokao-mathcloze", "math"}
+
+default_inference_kwargs = {
+    "calculate_loss": True,
+    "all_classes": None,
+    "language": "Chinese",
+    "pretrain": False,
+    "max_new_tokens": 32,
+}
+
+
+def get_prompt(line: Dict, dataset_name: str, logger: DistributedLogger) -> Dict:
+    """Modified from https://github.com/microsoft/AGIEval/blob/main/src/dataset_loader.py#L190"""
+    try:
+        all_classes = None
+        passage = line["passage"] if line["passage"] is not None else ""
+
+        if dataset_name in english_qa_datasets:
+            option_string = "ABCDEFG"
+            count = len(line["options"])
+
+            input = (
+                "Question: "
+                + line["question"]
+                + " "
+                + "Choose from the following options: "
+                + " ".join(line["options"])
+                + "\n"
+                + "Answer: "
+            )
+
+            all_classes = list(option_string[0:count])
+
+        elif dataset_name in chinese_qa_datasets:
+            option_string = "ABCDEFG"
+            count = len(line["options"])
+
+            input = "问题：" + line["question"] + " " + "从以下选项中选择：" + " ".join(line["options"]) + "\n" + "答案："
+
+            all_classes = list(option_string[0:count])
+
+        elif dataset_name in english_cloze_datasets:
+            input = "Question: " + line["question"] + "\n" + "Answer: "
+
+        elif dataset_name in chinese_cloze_datasets:
+            input = "问题：" + line["question"] + "\n" + "答案："
+
+        return {
+            "instruction": input if not passage else passage + "\n\n" + input,
+            "target": line["label"] if line["label"] else line["answer"],
+        }, all_classes
+
+    except NameError:
+        logger.info("Dataset not defined.")
+
+
+# process few-shot raw_prompts
+def combine_prompt(prompt_path, dataset_name, load_explanation=True, chat_mode=False):
+    skip_passage = False
+    if dataset_name == "sat-en-without-passage":
+        skip_passage = True
+        dataset_name = "sat-en"
+    demostrations = []
+    # read the prompts by context and explanation
+    context_row = [0, 1, 3, 5, 7, 9]
+    explanation_row = [0, 2, 4, 6, 8, 10]
+    raw_prompts_context = pd.read_csv(
+        prompt_path, header=0, skiprows=lambda x: x not in context_row, keep_default_na=False
+    )
+    raw_prompts_explanation = pd.read_csv(
+        prompt_path, header=0, skiprows=lambda x: x not in explanation_row, keep_default_na=False
+    ).replace(r"\n\n", "\n", regex=True)
+    contexts = []
+    for line in list(raw_prompts_context[dataset_name]):
+        if line:
+            # print(line)
+            contexts.append(ast.literal_eval(line))
+    explanations = [exp for exp in raw_prompts_explanation[dataset_name] if exp]
+
+    for idx, (con, exp) in enumerate(zip(contexts, explanations)):
+        passage = con["passage"] if con["passage"] is not None and not skip_passage else ""
+        question = con["question"]
+        options = con["options"] if con["options"] is not None else ""
+        label = con["label"] if con["label"] is not None else ""
+        answer = con["answer"] if "answer" in con and con["answer"] is not None else ""
+
+        if dataset_name in english_qa_datasets:
+            question_input = (
+                "Question: "
+                + passage
+                + " "
+                + question
+                + "\n"
+                + "Choose from the following options: "
+                + " ".join(options)
+                + "\n"
+                + "Answer: {}".format(label)
+            )
+        elif dataset_name in chinese_qa_datasets:
+            question_input = (
+                "问题：" + passage + " " + question + "\n" + "从以下选项中选择：" + " ".join(options) + "\n" + "答案：{}".format(label)
+            )
+        elif dataset_name in english_cloze_datasets:
+            question_input = "Question: ".format(idx + 1) + question + "\n" + "Answer: {}".format(answer)
+        elif dataset_name in chinese_cloze_datasets:
+            question_input = "问题：" + question + "\n" + "答案：{}".format(answer)
+        else:
+            raise ValueError(f"During loading few-sot examples, found unknown dataset: {dataset_name}")
+
+        if chat_mode:
+            demostrations.append((question_input,))
+        else:
+            demostrations.append(question_input + "\n")
+
+    return demostrations
+
+
+class AGIEvalDataset(BaseDataset):
+    """
+    Dataset wrapper for AGIEval dataset.
+    Data source: https://github.com/microsoft/AGIEval
+    This dataset class will convert the original dataset into the inference dataset.
+
+    A few dirty data needed to be manually corrected in the origin dataset:
+    Issue link: https://github.com/microsoft/AGIEval/issues/16
+    1. Invalid options in line 190 in gaokao-chemistry.jsonl.
+    2. Option D (They may increase in value as those same resources become rare on Earth.) missing in line 17 in sat-en-without-passage.jsonl.
+    3. Option D (They may increase in value as those same resources become rare on Earth.) missing in line 17 in sat-en.jsonl.
+    4. Option D (No, because the data do not indicate whether the honeybees had been infected with mites.) missing in line 57 in sat-en-without-passage.jsonl.
+    5. Option D (No, because the data do not indicate whether the honeybees had been infected with mites.) missing in line 57 in sat-en.jsonl.
+    6. Option D (Published theories of scientists who developed earlier models of the Venus flytrap) missing in line 98 in sat-en-without-passage.jsonl.
+    7. Option D (Published theories of scientists who developed earlier models of the Venus flytrap) missing in line 98 in sat-en.jsonl.
+    8. Label is empty in line 212 in jec-qa-kd.jsonl. Content is also dirty.
+    9. Actually, gaokao-mathqa.jsonl is also a multi-choice dataset. See line 149 286 287.
+    """
+
+    @staticmethod
+    def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
+        dataset = {"test": {}}
+
+        files = glob.glob(os.path.join(path, "*.jsonl"))
+        files.sort()
+
+        if few_shot:
+            prompt_path = os.path.join(path, "few_shot_prompts.csv")
+
+        for file in files:
+            dataset_name = os.path.basename(file)[0 : -len(".jsonl")]
+
+            few_shot_data = []
+            if few_shot:
+                # process demo once if it is few-shot-CoT
+                few_shot_data = combine_prompt(prompt_path, dataset_name, load_explanation=False, chat_mode=False)
+
+            dataset["test"][dataset_name] = {"data": []}
+
+            file_dir = os.path.join(path, file)
+
+            loaded_jsonl = get_json_list(file_dir)
+
+            # It's been tested that each data sample in one subcategory have same inference arguments.
+            _, all_classes = get_prompt(loaded_jsonl[0], dataset_name, logger)
+            inference_kwargs = deepcopy(default_inference_kwargs)
+            if all_classes is not None and dataset_name not in multi_choice_datasets:
+                inference_kwargs["all_classes"] = all_classes
+
+            if dataset_name in english_qa_datasets:
+                inference_kwargs["language"] = "English"
+            if dataset_name in chinese_qa_datasets:
+                inference_kwargs["language"] = "Chinese"
+            inference_kwargs["few_shot_data"] = few_shot_data
+
+            dataset["test"][dataset_name]["inference_kwargs"] = inference_kwargs
+
+            for line in loaded_jsonl:
+                info, all_classes = get_prompt(line, dataset_name, logger)
+
+                # Convert multi-choice answers to a single string.
+                # We will convert it back when evaluating.
+                # We do this because if target is a list, it should be only used for multiple target answers.
+                if dataset_name in multi_choice_datasets:
+                    if isinstance(info["target"], str) and len(info["target"]) > 1:
+                        # "gaokao-mathqa" actually contain multi-choice questions.
+                        # This if clause is specially used for it.
+                        info["target"] = "".join(info["target"].split())
+                    else:
+                        info["target"] = "".join(info["target"])
+
+                if isinstance(info["target"], list) and len(info["target"]) == 1:
+                    info["target"] = info["target"][0]
+
+                data_sample = {
+                    "dataset": "agieval",
+                    "split": "test",
+                    "category": dataset_name,
+                    "instruction": info["instruction"],
+                    "input": "",
+                    "output": "",
+                    "target": info["target"],
+                }
+
+                dataset["test"][dataset_name]["data"].append(data_sample)
+
+        return dataset
--- a/applications/ColossalEval/colossal_eval/dataset/base.py
+++ b/applications/ColossalEval/colossal_eval/dataset/base.py
@ -0,0 +1,24 @@
+from abc import abstractstaticmethod
+
+from colossal_eval.utils import jdump
+
+
+class BaseDataset:
+    """
+    Base class for dataset wrapper.
+
+    Args:
+        path: The path to the original dataset.
+        logger: Logger for the dataset.
+    """
+
+    def __init__(self, path, logger, few_shot):
+        self.dataset = self.load(path, logger, few_shot)
+
+    def save(self, save_path):
+        """Save the converted dataset"""
+        jdump(self.dataset, save_path)
+
+    @abstractstaticmethod
+    def load(path, logger):
+        """Load the original dataset and convert it into the inference dataset"""
--- a/applications/ColossalEval/colossal_eval/dataset/ceval.py
+++ b/applications/ColossalEval/colossal_eval/dataset/ceval.py
@ -0,0 +1,132 @@
+import copy
+import csv
+import os
+from typing import Dict, List
+
+from colossalai.logging import DistributedLogger
+
+from .base import BaseDataset
+
+ceval_subject_mapping = {
+    "computer_network": ["Computer Network", "计算机网络", "STEM"],
+    "operating_system": ["Operating System", "操作系统", "STEM"],
+    "computer_architecture": ["Computer Architecture", "计算机组成", "STEM"],
+    "college_programming": ["College Programming", "大学编程", "STEM"],
+    "college_physics": ["College Physics", "大学物理", "STEM"],
+    "college_chemistry": ["College Chemistry", "大学化学", "STEM"],
+    "advanced_mathematics": ["Advanced Mathematics", "高等数学", "STEM"],
+    "probability_and_statistics": ["Probability and Statistics", "概率统计", "STEM"],
+    "discrete_mathematics": ["Discrete Mathematics", "离散数学", "STEM"],
+    "electrical_engineer": ["Electrical Engineer", "注册电气工程师", "STEM"],
+    "metrology_engineer": ["Metrology Engineer", "注册计量师", "STEM"],
+    "high_school_mathematics": ["High School Mathematics", "高中数学", "STEM"],
+    "high_school_physics": ["High School Physics", "高中物理", "STEM"],
+    "high_school_chemistry": ["High School Chemistry", "高中化学", "STEM"],
+    "high_school_biology": ["High School Biology", "高中生物", "STEM"],
+    "middle_school_mathematics": ["Middle School Mathematics", "初中数学", "STEM"],
+    "middle_school_biology": ["Middle School Biology", "初中生物", "STEM"],
+    "middle_school_physics": ["Middle School Physics", "初中物理", "STEM"],
+    "middle_school_chemistry": ["Middle School Chemistry", "初中化学", "STEM"],
+    "veterinary_medicine": ["Veterinary Medicine", "兽医学", "STEM"],
+    "college_economics": ["College Economics", "大学经济学", "Social Science"],
+    "business_administration": ["Business Administration", "工商管理", "Social Science"],
+    "marxism": ["Marxism", "马克思主义基本原理", "Social Science"],
+    "mao_zedong_thought": ["Mao Zedong Thought", "毛泽东思想和中国特色社会主义理论体系概论", "Social Science"],
+    "education_science": ["Education Science", "教育学", "Social Science"],
+    "teacher_qualification": ["Teacher Qualification", "教师资格", "Social Science"],
+    "high_school_politics": ["High School Politics", "高中政治", "Social Science"],
+    "high_school_geography": ["High School Geography", "高中地理", "Social Science"],
+    "middle_school_politics": ["Middle School Politics", "初中政治", "Social Science"],
+    "middle_school_geography": ["Middle School Geography", "初中地理", "Social Science"],
+    "modern_chinese_history": ["Modern Chinese History", "近代史纲要", "Humanities"],
+    "ideological_and_moral_cultivation": ["Ideological and Moral Cultivation", "思想道德修养与法律基础", "Humanities"],
+    "logic": ["Logic", "逻辑学", "Humanities"],
+    "law": ["Law", "法学", "Humanities"],
+    "chinese_language_and_literature": ["Chinese Language and Literature", "中国语言文学", "Humanities"],
+    "art_studies": ["Art Studies", "艺术学", "Humanities"],
+    "professional_tour_guide": ["Professional Tour Guide", "导游资格", "Humanities"],
+    "legal_professional": ["Legal Professional", "法律职业资格", "Humanities"],
+    "high_school_chinese": ["High School Chinese", "高中语文", "Humanities"],
+    "high_school_history": ["High School History", "高中历史", "Humanities"],
+    "middle_school_history": ["Middle School History", "初中历史", "Humanities"],
+    "civil_servant": ["Civil Servant", "公务员", "Other"],
+    "sports_science": ["Sports Science", "体育学", "Other"],
+    "plant_protection": ["Plant Protection", "植物保护", "Other"],
+    "basic_medicine": ["Basic Medicine", "基础医学", "Other"],
+    "clinical_medicine": ["Clinical Medicine", "临床医学", "Other"],
+    "urban_and_rural_planner": ["Urban and Rural Planner", "注册城乡规划师", "Other"],
+    "accountant": ["Accountant", "注册会计师", "Other"],
+    "fire_engineer": ["Fire Engineer", "注册消防工程师", "Other"],
+    "environmental_impact_assessment_engineer": ["Environmental Impact Assessment Engineer", "环境影响评价工程师", "Other"],
+    "tax_accountant": ["Tax Accountant", "税务师", "Other"],
+    "physician": ["Physician", "医师资格", "Other"],
+}
+
+default_inference_kwargs = {
+    "calculate_loss": False,
+    "all_classes": ["A", "B", "C", "D"],
+    "language": "Chinese",
+    "pretrain": False,
+    "max_new_tokens": 32,
+}
+
+
+def get_few_shot_data(data: List[Dict]):
+    few_shot_data = []
+    for i in data:
+        few_shot_data.append(i["input"] + i["target"])
+    return few_shot_data
+
+
+class CEvalDataset(BaseDataset):
+    """
+    Dataset class for CEval dataset.
+    Data source: https://huggingface.co/datasets/ceval/ceval-exam
+    This dataset class will convert the original dataset into the inference dataset.
+    """
+
+    @staticmethod
+    def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
+        dataset = {"dev": {}, "test": {}}
+        for split in ["dev", "test"]:
+            files = os.listdir(os.path.join(path, split))
+            files.sort()
+
+            for file in files:
+                subject = file[0 : -len(f"_{split}.csv")]
+                subject = ceval_subject_mapping[subject][1]
+
+                file_dir = os.path.join(path, split, file)
+
+                dataset[split][subject] = {"data": []}
+
+                # It's been tested that each data sample in one subcategory have same inference arguments.
+                dataset[split][subject]["inference_kwargs"] = copy.deepcopy(default_inference_kwargs)
+
+                if split == "test" and few_shot:
+                    dataset[split][subject]["inference_kwargs"]["few_shot_data"] = get_few_shot_data(
+                        dataset["dev"][subject]["data"]
+                    )
+
+                with open(file_dir, encoding="utf-8") as f:
+                    reader = csv.reader(f)
+                    _ = next(reader)
+                    for row in reader:
+                        # Dev split have answer and explanation so len(row) is 8
+                        # But test split doesn't contain answer and explanation, so len(row) is 6
+                        assert len(row) >= 6
+                        choices = f"A. {row[2]}\nB. {row[3]}\nC. {row[4]}\nD. {row[5]}"
+                        data_sample = {
+                            "dataset": "ceval",
+                            "split": split,
+                            "category": subject,
+                            "instruction": f"以下是中国关于{subject}考试的单项选择题，请选出其中的正确答案。",
+                            "input": f"题目：{row[1]}\n{choices}\n答案：",
+                            "output": "",
+                            "target": row[6] if split == "dev" else "",
+                            "id": int(row[0]),
+                        }
+
+                        dataset[split][subject]["data"].append(data_sample)
+
+        return dataset
--- a/applications/ColossalEval/colossal_eval/dataset/cmmlu.py
+++ b/applications/ColossalEval/colossal_eval/dataset/cmmlu.py
@ -0,0 +1,144 @@
+import copy
+import csv
+import os
+from typing import Dict, List
+
+from colossalai.logging import DistributedLogger
+
+from .base import BaseDataset
+
+cmmlu_subject_mapping = {
+    "agronomy": "农学",
+    "anatomy": "解剖学",
+    "ancient_chinese": "古汉语",
+    "arts": "艺术学",
+    "astronomy": "天文学",
+    "business_ethics": "商业伦理",
+    "chinese_civil_service_exam": "中国公务员考试",
+    "chinese_driving_rule": "中国驾驶规则",
+    "chinese_food_culture": "中国饮食文化",
+    "chinese_foreign_policy": "中国外交政策",
+    "chinese_history": "中国历史",
+    "chinese_literature": "中国文学",
+    "chinese_teacher_qualification": "中国教师资格",
+    "clinical_knowledge": "临床知识",
+    "college_actuarial_science": "大学精算学",
+    "college_education": "大学教育学",
+    "college_engineering_hydrology": "大学工程水文学",
+    "college_law": "大学法律",
+    "college_mathematics": "大学数学",
+    "college_medical_statistics": "大学医学统计",
+    "college_medicine": "大学医学",
+    "computer_science": "计算机科学",
+    "computer_security": "计算机安全",
+    "conceptual_physics": "概念物理学",
+    "construction_project_management": "建设工程管理",
+    "economics": "经济学",
+    "education": "教育学",
+    "electrical_engineering": "电气工程",
+    "elementary_chinese": "小学语文",
+    "elementary_commonsense": "小学常识",
+    "elementary_information_and_technology": "小学信息技术",
+    "elementary_mathematics": "初等数学",
+    "ethnology": "民族学",
+    "food_science": "食品科学",
+    "genetics": "遗传学",
+    "global_facts": "全球事实",
+    "high_school_biology": "高中生物",
+    "high_school_chemistry": "高中化学",
+    "high_school_geography": "高中地理",
+    "high_school_mathematics": "高中数学",
+    "high_school_physics": "高中物理学",
+    "high_school_politics": "高中政治",
+    "human_sexuality": "人类性行为",
+    "international_law": "国际法学",
+    "journalism": "新闻学",
+    "jurisprudence": "法理学",
+    "legal_and_moral_basis": "法律与道德基础",
+    "logical": "逻辑学",
+    "machine_learning": "机器学习",
+    "management": "管理学",
+    "marketing": "市场营销",
+    "marxist_theory": "马克思主义理论",
+    "modern_chinese": "现代汉语",
+    "nutrition": "营养学",
+    "philosophy": "哲学",
+    "professional_accounting": "专业会计",
+    "professional_law": "专业法学",
+    "professional_medicine": "专业医学",
+    "professional_psychology": "专业心理学",
+    "public_relations": "公共关系",
+    "security_study": "安全研究",
+    "sociology": "社会学",
+    "sports_science": "体育学",
+    "traditional_chinese_medicine": "中医中药",
+    "virology": "病毒学",
+    "world_history": "世界历史",
+    "world_religions": "世界宗教",
+}
+
+default_inference_kwargs = {
+    "calculate_loss": True,
+    "all_classes": ["A", "B", "C", "D"],
+    "language": "Chinese",
+    "pretrain": False,
+    "max_new_tokens": 32,
+}
+
+
+def get_few_shot_data(data: List[Dict]):
+    few_shot_data = []
+    for i in data:
+        few_shot_data.append(i["input"] + i["target"])
+    return few_shot_data
+
+
+class CMMLUDataset(BaseDataset):
+    """
+    Dataset class for CMMLU dataset.
+    Data source: https://github.com/haonan-li/CMMLU/tree/master/data
+    This dataset class will convert the original dataset into the inference dataset.
+    """
+
+    @staticmethod
+    def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
+        dataset = {"dev": {}, "test": {}}
+        for split in ["dev", "test"]:
+            files = os.listdir(os.path.join(path, split))
+            files.sort()
+
+            for file in files:
+                subject = file[0 : -len(".csv")]
+                subject = cmmlu_subject_mapping[subject]
+
+                file_dir = os.path.join(path, split, file)
+
+                dataset[split][subject] = {"data": []}
+
+                # It's been tested that each data sample in one subcategory have same inference arguments.
+                dataset[split][subject]["inference_kwargs"] = copy.deepcopy(default_inference_kwargs)
+
+                if split == "test" and few_shot:
+                    dataset[split][subject]["inference_kwargs"]["few_shot_data"] = get_few_shot_data(
+                        dataset["dev"][subject]["data"]
+                    )
+
+                with open(file_dir, encoding="utf-8") as f:
+                    reader = csv.reader(f)
+                    _ = next(reader)
+                    for row in reader:
+                        assert len(row) == 7
+                        choices = f"A. {row[2]}\nB. {row[3]}\nC. {row[4]}\nD. {row[5]}"
+                        data_sample = {
+                            "dataset": "cmmlu",
+                            "split": split,
+                            "category": subject,
+                            "instruction": f"以下是关于{subject}的单项选择题，请直接给出正确答案的选项。",
+                            "input": f"题目：{row[1]}\n{choices}\n答案：",
+                            "output": "",
+                            "target": row[6],
+                        }
+
+                        dataset[split][subject]["data"].append(data_sample)
+
+        return dataset
--- a/applications/ColossalEval/colossal_eval/dataset/colossalai.py
+++ b/applications/ColossalEval/colossal_eval/dataset/colossalai.py
@ -0,0 +1,70 @@
+from collections import defaultdict
+from copy import deepcopy
+from typing import Dict, List
+
+from colossal_eval.utils import jload
+
+from colossalai.logging import DistributedLogger
+
+from .base import BaseDataset
+
+default_inference_kwargs = {
+    "calculate_loss": False,
+    "all_classes": None,
+    "language": "Chinese",
+    "pretrain": False,
+    "max_new_tokens": 256,
+}
+
+# You can add your own subcategory questions and specify whether it is a single-choice question or has target answers and need to calculate loss.
+single_choice_question = set()
+calculate_loss = set()
+
+
+def get_data_per_category(data):
+    data_per_category = defaultdict(list)
+    for item in data:
+        category = item["category"]
+        data_per_category[category].append(item)
+
+    return data_per_category
+
+
+class ColossalDataset(BaseDataset):
+    """
+    Dataset class for Colossal dataset.
+    This dataset class will convert the original dataset into the inference dataset.
+    """
+
+    @staticmethod
+    def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
+        dataset = {"test": {}}
+        data = jload(path)
+        data_per_category = get_data_per_category(data)
+        categories = list(data_per_category.keys())
+
+        for category in categories:
+            dataset["test"][category] = {"data": []}
+            category_data = data_per_category[category]
+
+            dataset["test"][category]["inference_kwargs"] = deepcopy(default_inference_kwargs)
+
+            if category in calculate_loss:
+                dataset["test"][category]["inference_kwargs"]["calculate_loss"] = True
+            if category in single_choice_question:
+                dataset["test"][category]["inference_kwargs"]["all_classes"] = ["A", "B", "C", "D"]
+
+            for item in category_data:
+                data_sample = {
+                    "dataset": "colossal",
+                    "split": "test",
+                    "category": category,
+                    "instruction": item["instruction"],
+                    "input": item["input"],
+                    "output": "",
+                    "target": item["target"],
+                    "id": item["id"],
+                }
+                dataset["test"][category]["data"].append(data_sample)
+
+        return dataset
--- a/applications/ColossalEval/colossal_eval/dataset/gaokaobench.py
+++ b/applications/ColossalEval/colossal_eval/dataset/gaokaobench.py
@ -0,0 +1,122 @@
+import json
+import os
+import re
+from copy import deepcopy
+from typing import Dict, List
+
+from colossalai.logging import DistributedLogger
+
+from .base import BaseDataset
+
+multi_choice_datasets = [
+    "Chinese Lang and Usage MCQs",
+    "Chinese Modern Lit",
+    "English Fill in Blanks",
+    "English Reading Comp",
+    "Geography MCQs",
+    "Physics MCQs",
+    "English Cloze Test",
+]
+
+chinese_qa_datasets = [
+    "Biology MCQs",
+    "Chemistry MCQs",
+    "Chinese Lang and Usage MCQs",
+    "Chinese Modern Lit",
+    "Geography MCQs",
+    "History MCQs",
+    "Math I MCQs",
+    "Math II MCQs",
+    "Physics MCQs",
+    "Political Science MCQs",
+]
+english_qa_datasets = ["English MCQs", "English Fill in Blanks", "English Reading Comp", "English Cloze Test"]
+
+default_inference_kwargs = {
+    "calculate_loss": True,
+    "all_classes": None,
+    "language": "Chinese",
+    "pretrain": False,
+    "max_new_tokens": 32,
+}
+
+
+def get_all_classes(instruction: str):
+    letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+    pattern = r"([A-Z]\. |[A-Z]．|[A-Z]\.)"
+    options = sorted(list(set(re.findall(pattern, instruction))))
+    options = sorted(list(set([string[0] for string in options])))
+
+    for i in range(len(options)):
+        if options[i] == letters[i]:
+            continue
+        else:
+            return options[0:i]
+    return options
+
+
+class GaoKaoBenchDataset(BaseDataset):
+    """
+    Dataset class for GAOKAO-Bench dataset.
+    Data source: https://github.com/OpenLMLab/GAOKAO-Bench/tree/main/data
+    This dataset class will convert the original dataset into the inference dataset.
+
+    A few typos needed to be manually corrected in the origin dataset, some of the following is fixed.
+    Issue link: https://github.com/OpenLMLab/GAOKAO-Bench/issues/20
+    1. Option C missing in index 111 in 2010-2022_Chemistry_MCQs.json
+    2. Option B missing "." after it in index 16 in 2012-2022_English_Cloze_Test.json
+    3. Option G missing "." after it in index 23 in 2012-2022_English_Cloze_Test.json
+    """
+
+    @staticmethod
+    def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
+        dataset = {"test": {}}
+        for category in ["Fill-in-the-blank_Questions", "Multiple-choice_Questions", "Open-ended_Questions"]:
+            files = os.listdir(os.path.join(path, "data", category))
+            files.sort()
+
+            for file in files:
+                subject = file[10:-5].split("_")
+                subject = " ".join(subject)
+                dataset["test"][subject] = {"data": []}
+
+                file_dir = os.path.join(path, "data", category, file)
+
+                with open(file_dir, encoding="utf-8") as f:
+                    data = json.load(f)
+
+                    # It's been tested that each data sample in one subcategory have same inference arguments.
+                    inference_kwargs = deepcopy(default_inference_kwargs)
+                    if category == "Multiple-choice_Questions" and subject not in multi_choice_datasets:
+                        all_classes = get_all_classes(data["example"][0]["question"])
+                        inference_kwargs["all_classes"] = all_classes
+                    if subject in english_qa_datasets:
+                        inference_kwargs["language"] = "English"
+                    if subject in chinese_qa_datasets:
+                        inference_kwargs["language"] = "Chinese"
+
+                    dataset["test"][subject]["inference_kwargs"] = inference_kwargs
+
+                    for sample in data["example"]:
+                        # Convert multi-choice answers to a single string.
+                        # We will convert it back when evaluating.
+                        # We do this because if target is a list, it should be only used for multiple target answers.
+                        if subject in multi_choice_datasets:
+                            sample["answer"] = "".join(sample["answer"])
+
+                        if isinstance(sample["answer"], list) and len(sample["answer"]) == 1:
+                            sample["answer"] = sample["answer"][0]
+
+                        data_sample = {
+                            "dataset": "gaokaobench",
+                            "split": "test",
+                            "category": f"{category[:-10]}-{subject}",
+                            "instruction": sample["question"].strip() + "\n答案：",
+                            "input": "",
+                            "output": "",
+                            "target": sample["answer"],
+                        }
+
+                        dataset["test"][subject]["data"].append(data_sample)
+
+        return dataset
--- a/applications/ColossalEval/colossal_eval/dataset/longbench.py
+++ b/applications/ColossalEval/colossal_eval/dataset/longbench.py
@ -0,0 +1,120 @@
+import os
+from copy import deepcopy
+from typing import Dict, List
+
+from colossal_eval.utils import get_json_list
+
+from colossalai.logging import DistributedLogger
+
+from .base import BaseDataset
+
+dataset2prompt = {
+    "narrativeqa": "You are given a story, which can be either a novel or a movie script, and a question. Answer the question asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: {context}\n\nNow, answer the question based on the story asconcisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
+    "qasper": 'You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no", or "unanswerable". Do not provide any explanation.\n\nArticle: {context}\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no", or "unanswerable". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:',
+    "multifieldqa_en": "Read the following text and answer briefly.\n\n{context}\n\nNow, answer the following question based on the above text, only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
+    "multifieldqa_zh": "阅读以下文字并用中文简短回答：\n\n{context}\n\n现在请基于上面的文章回答下面的问题，只告诉我答案，不要输出任何其他字词。\n\n问题：{input}\n回答：",
+    "hotpotqa": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{context}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
+    "2wikimqa": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{context}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
+    "musique": "Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{context}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\n\nQuestion: {input}\nAnswer:",
+    "dureader": "请基于给定的文章回答下述问题。\n\n文章：{context}\n\n请基于上述文章回答下面的问题。\n\n问题：{input}\n回答：",
+    "gov_report": "You are given a report by a government agency. Write a one-page summary of the report.\n\nReport:\n{context}\n\nNow, write a one-page summary of the report.\n\nSummary:",
+    "qmsum": "You are given a meeting transcript and a query containing a question or instruction. Answer the query in one or more sentences.\n\nTranscript:\n{context}\n\nNow, answer the query based on the above meeting transcript in one or more sentences.\n\nQuery: {input}\nAnswer:",
+    "multi_news": "You are given several news passages. Write a one-page summary of all news. \n\nNews:\n{context}\n\nNow, write a one-page summary of all the news.\n\nSummary:",
+    "vcsum": "下面有一段会议记录，请你阅读后，写一段总结，总结会议的内容。\n会议记录：\n{context}\n\n会议总结：",
+    "trec": "Please determine the type of the question below. Here are some examples of questions.\n\n{context}\n{input}",
+    "triviaqa": "Answer the question based on the given passage. Only give me the answer and do not output any other words. The following are some examples.\n\n{context}\n\n{input}",
+    "samsum": "Summarize the dialogue into a few short sentences. The following are some examples.\n\n{context}\n\n{input}",
+    "lsht": "请判断给定新闻的类别，下面是一些例子。\n\n{context}\n{input}",
+    "passage_count": "There are some paragraphs below sourced from Wikipedia. Some of them may be duplicates. Please carefully read these paragraphs and determine how many unique paragraphs there are after removing duplicates. In other words, how many non-repeating paragraphs are there in total?\n\n{context}\n\nPlease enter the final count of unique paragraphs after removing duplicates. The output format should only contain the number, such as 1, 2, 3, and so on.\n\nThe final answer is: ",
+    "passage_retrieval_en": 'Here are 30 paragraphs from Wikipedia, along with an abstract. Please determine which paragraph the abstract is from.\n\n{context}\n\nThe following is an abstract.\n\n{input}\n\nPlease enter the number of the paragraph that the abstract is from. The answer format must be like "Paragraph 1", "Paragraph 2", etc.\n\nThe answer is: ',
+    "passage_retrieval_zh": '以下是若干段落文字，以及其中一个段落的摘要。请确定给定的摘要出自哪一段。\n\n{context}\n\n下面是一个摘要\n\n{input}\n\n请输入摘要所属段落的编号。答案格式必须是"段落1"，"段落2"等格式\n\n答案是：',
+    "lcc": "Please complete the code given below. \n{context}Next line of code:\n",
+    "repobench-p": "Please complete the code given below. \n{context}{input}Next line of code:\n",
+}
+
+dataset2maxlen = {
+    "narrativeqa": 128,
+    "qasper": 128,
+    "multifieldqa_en": 64,
+    "multifieldqa_zh": 64,
+    "hotpotqa": 32,
+    "2wikimqa": 32,
+    "musique": 32,
+    "dureader": 128,
+    "gov_report": 512,
+    "qmsum": 512,
+    "multi_news": 512,
+    "vcsum": 512,
+    "trec": 64,
+    "triviaqa": 32,
+    "samsum": 128,
+    "lsht": 64,
+    "passage_count": 32,
+    "passage_retrieval_en": 32,
+    "passage_retrieval_zh": 32,
+    "lcc": 64,
+    "repobench-p": 64,
+}
+
+default_inference_kwargs = {
+    "calculate_loss": True,
+    "all_classes": None,
+    "language": "Chinese",
+    "pretrain": False,
+    "max_new_tokens": 32,
+}
+
+
+class LongBenchDataset(BaseDataset):
+    """
+    Dataset class for LongBench dataset.
+    Data source: https://huggingface.co/datasets/THUDM/LongBench
+    This dataset class will convert the original dataset into the inference dataset.
+
+    Issue link: https://github.com/THUDM/LongBench/issues/15 (fixed)
+    There are duplicate target answers in `nq.jsonl`, but this doesn't affect evaluation results.
+    Also doesn't affect perplexity calculation (the program only need to select the minimum loss).
+    """
+
+    @staticmethod
+    def load(path: str, logger: DistributedLogger) -> List[Dict]:
+        dataset = {"test": {}}
+
+        files = os.listdir(path)
+        files.sort()
+
+        for file in files:
+            category = file[0:-6]
+
+            if category.endswith("_e"):
+                continue
+
+            dataset["test"][category] = {"data": []}
+
+            file_dir = os.path.join(path, file)
+
+            loaded_jsonl = get_json_list(file_dir)
+
+            # It's been tested that each data sample in one subcategory have same inference arguments.
+            inference_kwargs = deepcopy(default_inference_kwargs)
+            if loaded_jsonl[0]["all_classes"] is not None:
+                inference_kwargs["all_classes"] = loaded_jsonl[0]["all_classes"]
+            inference_kwargs["max_new_tokens"] = dataset2maxlen[category]
+            dataset["test"][category]["inference_kwargs"] = inference_kwargs
+
+            for sample in loaded_jsonl:
+                prompt = dataset2prompt[category].format(**sample)
+
+                data_sample = {
+                    "dataset": "longbench",
+                    "split": "test",
+                    "category": category,
+                    "instruction": prompt,
+                    "input": "",
+                    "output": "",
+                    "target": sample["answers"],
+                }
+
+                dataset["test"][category]["data"].append(data_sample)
+
+        return dataset
--- a/applications/ColossalEval/colossal_eval/dataset/mmlu.py
+++ b/applications/ColossalEval/colossal_eval/dataset/mmlu.py
@ -0,0 +1,73 @@
+import copy
+import csv
+import os
+from typing import Dict, List
+
+from colossalai.logging import DistributedLogger
+
+from .base import BaseDataset
+
+default_inference_kwargs = {
+    "calculate_loss": True,
+    "all_classes": ["A", "B", "C", "D"],
+    "language": "English",
+    "pretrain": False,
+    "max_new_tokens": 32,
+}
+
+
+def get_few_shot_data(data: List[Dict]):
+    few_shot_data = []
+    for i in data:
+        few_shot_data.append(i["input"] + i["target"])
+    return few_shot_data
+
+
+class MMLUDataset(BaseDataset):
+    """
+    Dataset class for MMLU dataset.
+    Data source: https://github.com/hendrycks/test
+    This dataset class will convert the original dataset into the inference dataset.
+    """
+
+    @staticmethod
+    def load(path: str, logger: DistributedLogger, few_shot: bool) -> List[Dict]:
+        dataset = {"dev": {}, "test": {}}
+        for split in ["dev", "test"]:
+            files = os.listdir(os.path.join(path, split))
+            files.sort()
+
+            for file in files:
+                subject = file[0 : -len(f"_{split}.csv")].split("_")
+                subject = " ".join([word.title() if word != "us" else "US" for word in subject])
+
+                file_dir = os.path.join(path, split, file)
+
+                dataset[split][subject] = {"data": [], "inference_kwargs": {}}
+
+                # It's been tested that each data sample in one subcategory have same inference arguments.
+                dataset[split][subject]["inference_kwargs"] = copy.deepcopy(default_inference_kwargs)
+
+                if split == "test" and few_shot:
+                    dataset[split][subject]["inference_kwargs"]["few_shot_data"] = get_few_shot_data(
+                        dataset["dev"][subject]["data"]
+                    )
+
+                with open(file_dir, encoding="utf-8") as f:
+                    reader = csv.reader(f)
+                    for row in reader:
+                        assert len(row) == 6
+                        choices = f"A. {row[1]}\nB. {row[2]}\nC. {row[3]}\nD. {row[4]}"
+                        data_sample = {
+                            "dataset": "mmlu",
+                            "split": split,
+                            "category": subject,
+                            "instruction": f"The following is a single-choice question on {subject}. Answer the question by replying A, B, C or D.",
+                            "input": f"Question: {row[0]}\n{choices}\nAnswer: ",
+                            "output": "",
+                            "target": row[5],
+                        }
+
+                        dataset[split][subject]["data"].append(data_sample)
+
+        return dataset
--- a/applications/ColossalEval/colossal_eval/evaluate/GPT
+++ b/applications/ColossalEval/colossal_eval/evaluate/GPT
@ -0,0 +1,248 @@
+# GPT Evaluation
+## Table of Contents
+- [Overview](#overview)
+- [GPT Evaluation](#gpt-evaluation)
+  - [Evaluation Category](#evaluation-category)
+  - [Evaluation Category Examples](#evaluation-category-examples)
+  - [Evaluation Metrics](#evaluation-metrics)
+- [Evaluation Process](#evaluation-process)
+  - [Data Format](#data-format)
+  - [Prompt](#prompt)
+    - [Battle Prompt](#battle-prompt)
+    - [Evaluation Prompt](#evaluation-prompt)
+  - [Evaluation](#evaluation)
+    - [Configuration](#configuration)
+    - [Evaluate](#evaluate)
+- [FAQ](#faq)
+- [Citations](#citations)
+
+
+## Overview
+
+In this directory, we introduce how you can evaluate your model using GPTs. It is now available for evaluation of both Chinese and English capability and we provide the following functions:
+
+* Compare the performance of two different models (battle).
+* Rate the model according to pre-defined metrics using prompting design.
+* Rate the model according to pre-defined metrics with additional reference answer using prompting design.
+
+## GPT Evaluation
+
+### Evaluation Category
+
+Our evaluation pipeline can examine the model's capability using different categories of questions. The following table includes some example categories. You can add your own questions.
+
+| Evaluation Category | Description                                                  |
+| :-----------------: | :----------------------------------------------------------- |
+|    Brainstorming    | Models are asked to generate a range of creative and diverse ideas according to the question. The capability of creativity is required. |
+|        Chat         | Models are asked to continue a multi-round dialogue given the roles involved. The capability of understanding, memorizing previous rounds of the dialogue and answering according to the persona provided is required. |
+|     Generation      | Models are asked to generate an email, letter, article, etc. The capability of generating texts in a high quality and human-written way is required. |
+|       Open QA       | Models are asked to answer an open QA question(without context provided). The capability of answering questions with the models' own knowledge base is required. |
+|       Roleplay      | Models are asked to play the role provided. The capability of engaging in the scenario and effectively interacting with the user is required. |
+
+
+### Evaluation Category Examples
+To better understand each evaluation category, here are some example questions provided. Example questions are in the `configs/gpt_evaluation/data` folder.
+
+
+| Evaluation Category | Chinese Example                                              | English Example                                              |
+| :-----------------: | :----------------------------------------------------------- | :----------------------------------------------------------- |
+|    Brainstorming    | 列举一些可以促进头发生长的食物。                             | How do you properly chop an onion without crying?            |
+|        Chat         | 基于以下角色信息完成一段对话。小张是一名新手爱好者，对养鸡有浓厚的兴趣。老李是一名有丰富经验的养鸡大师。<br/>小张：您好，老李，我最近开始对养鸡感兴趣了，想请教您一些问题。 <br/>老李：你好，小张，我很乐意帮助你。你想问些什么？ <br/>小张：我想知道如何确定鸡的品种和性别？ <br/>老李：确切的品种可以通过鸡的外貌特征来确定，而性别一般是通过鸡卵的大小和形状来判断。还有什么问题吗？<br/> 小张：<br/> | Complete a dialogue based on the following character information. Alex: A novice writer who is struggling to find inspiration and develop his writing skills. Emma: A successful author with many published works, providing guidance and advice to Alex.<br/>Alex: Hi Emma, I have been writing for a while now but can't seem to make any progress. Can you give me any advice? <br/>Emma: Hi Alex, sure. What kind of writing are you doing?<br/>Alex: I'm trying to write a novel, but I just can't seem to find any inspiration.<br/>Emma: <br/> |
+|     Generation      | 请为一家咖啡店编写一篇简短的广告语，吸引更多的顾客。         | Write a set of guidelines for first-time pet owners on how to properly care for a new puppy. |
+|       Open QA       | 解释什么是RNA病毒和DNA病毒。                                 | Explain the process of osmosis in biological systems.        |
+|      Roleplay       | 我要你把我写的句子翻译成表情符号。我会写句子，你会用表情符号表达它。我只是想让你用表情符号来表达它。除了表情符号，我不希望你回复任何内容。当我需要用中文告诉你一些事情时，我会用 {} 这样的大括号括起来。我的第一句话是“{我的职业是消防员。}” | I want you to act as a rapper. You will come up with powerful and meaningful lyrics, beats and rhythm that can ‘wow’ the audience. Your lyrics should have an intriguing meaning and message which people can relate too. When it comes to choosing your beat, make sure it is catchy yet relevant to your words, so that when combined they make an explosion of sound everytime! My first request is "I need a rap song about finding strength within yourself." |
+
+### Evaluation Metrics
+
+GPT evaluation uses GPT models to evaluate the prediction of different models and different pre-defined evaluation metrics are applied to different categories. The following table shows the 10 pre-defined evaluation metrics both in Chinese and English:
+
+|   Evaluation Metric   | Prompt Words                                                 | CoT(Chain-of-Thought)                                        |
+| :-------------------: | :----------------------------------------------------------- | :----------------------------------------------------------- |
+| 语言组织<br/>(Language organization) | 语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。</br></br>Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc. | 1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。<br/> 2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说<br/> 3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。<br/> 4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。<br/> 5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。<br/> 6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。</br></br>1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.<br>2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.<br>3. Determine if the answer is relevant to the question or topic and conveys a clear message.<br>4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.<br>5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.<br>6. Evaluate the linguistic organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good linguistic organization and 1 indicates very poor linguistic organization. |
+|       切题<br/>(Relevance)       | 切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。</br></br>Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic. | 1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。<br/> 2. 阅读答案，确认答案是否直接回答了题目所问的问题。<br/> 3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。<br/> 4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。</br></br>1. Read the question to determine what the question asks and what aspects of the question need to be answered.<br>2. Read the answers to make sure that they directly answer the question asked.<br>3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.<br>4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all. |
+|      创意性<br/>(Creativity)       | 创意性(1-5)：某些头脑风暴问题可能需要答案具有创意，提出新的思路。</br></br>Creativity (1-5): Some brainstorming questions may require answers that are creative and suggest new ideas. | 1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则创意性评分可能会受到影响。<br/> 3. 考虑答案中是否包含新颖的想法或独特的思路。答案可能与已知的解决方案有所重叠，但仍然可以被认为是有创意的，只要它提供了新的角度或方法来解决问题。<br/> 4. 根据答案的创意性，给出一个1到5的评分。如果答案缺乏创意，则应给出一个较低的评分。如果答案具有创意并提供了新的思路，应给出一个较高的评分。</br></br>1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.<br>2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the creativity score may be affected.<br>3. Consider whether the answer contains novel ideas or unique thoughts. An answer may overlap with a known solution and still be considered creative, as long as it offers a new perspective or approach to the problem.<br>4. Give a score of 1 to 5 depending on the creativity of the answer. If the answer lacks creativity, a lower score should be given. If the answer is creative and provides a new idea, a higher score should be given. |
+|     实用性<br/>(Practicality)      | 实用性(1-5)：某些头脑风暴问题可能需要答案提出实用的建议或解决方法。</br></br>Practicality (1-5): Some brainstorming questions may require answers to suggest practical suggestions or solutions. | 1. 仔细阅读所提供的头脑风暴问题，确保你理解问题的要点和背景。<br/> 2. 根据你的知识和经验，判断所提供的答案是否可行。如果答案不可行，则实用性评分可能会受到影响。<br/> 3. 考虑答案中提出的建议或解决方法是否实用并可行。答案可能看起来很好，但如果无法实现或应用，则实用性评分可能会受到影响。<br/> 4. 根据答案的实用性，给出一个1到5的评分。如果答案缺乏实用性，则应给出一个较低的评分。如果答案提出了实用的建议或解决方法，并且可以很好地解决问题，则应给出一个较高的评分。</br></br>1. Read the provided brainstorming questions carefully to make sure you understand the gist and context of the questions.<br>2. Based on your knowledge and experience, determine if the answers provided are feasible. If the answer is not feasible, the practicality score may be affected.<br>3. Consider whether the suggestions or solutions presented in the answer are practical and workable. The answer may look good, but if it cannot be implemented or applied, the practicality score may be affected.<br>4. Give a score of 1 to 5 depending on the practicality of the answer. If the answer lacks practicality, a lower score should be given. If the answer makes a practical suggestion or solution and solves the problem well, a higher score should be given. |
+|      正确性<br/>(Correctness)      | 正确性(1-5)：正确性(1-5)：答案是否正确。</br></br> Correctness (1-5): whether the answer is correct or not. | 1. 仔细阅读题目，尝试自己回答该问题。<br/>2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为5分。如果答案是部分正确的，则可以给予适当的得分，例如2分、3分或4分。如果答案完全不正确，则只得1分。<br/><br/>1. Read the question carefully and try to answer the question yourself. <br/>2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be given. If the answer is completely incorrect, only 1 point is awarded. |
+|      自然<br/>(Naturalness)      | 自然(1-5)：答案是否自然，并且符合问题给定的身份。</br></br>Naturalness (1-5): whether the answer is natural and fits the identity given by the question. | 1. 阅读题目，确定题目提供的身份信息。<br/> 2. 检查答案内容是否符合题目给定的身份。<br/> 3. 根据以上因素，对该回答的自然性进行打分，分数从1到5，其中1表示不自然，5表示非常自然，并符合问题给定的身份。</br></br>1. Read the question and determine the identity information provided in the question.<br>2. Check whether the content of the answer matches the identity given in the question.<br>3. Based on the above factors, score the naturalness of the response on a scale from 1 to 5, where 1 means unnatural and 5 means very natural and in accordance with the identity given in the question. |
+|     参与感<br/>(Engagingness)      | 参与感(1-5)：答案是否对前面的对话内容做出了恰当的反应，是否理解对话的语境和背景。</br></br>Engagingness (1-5): whether the answer responds appropriately to the content of the preceding conversation and whether it understands the context and background of the conversation. | 1. 阅读题目，确定对话的语境和背景。<br/> 2. 检查答案是否充分理解对话的语境和背景，能否自然地融入到对话中而不显得突兀。<br/> 3. 根据以上因素，对该回答的参与感进行打分，分数从1到5，其中1表示没有参与感，5表示非常有参与感，并且恰当地理解了对话的语境和背景。</br></br>1. Read the questions to determine the context and background of the dialogue.<br>2. Check that the answer fully understands the context and background of the conversation and that it fits naturally into the conversation without seeming abrupt.<br>3. Based on the above factors, rate the response's engagement on a scale from 1 to 5, where 1 means not engaged and 5 means very engaged and appropriately understands the context and background of the conversation. |
+|    合理性<br/>(Reasonableness)     | 合理性(1-5)：答案是否能够与前面的对话内容形成逻辑上的衔接，是否符合常理，能否在这个上下文中合理存在。</br></br>Reasonableness (1-5): Whether the answer can form a logical connection with the content of the previous dialogue, whether it is consistent with common sense, and whether it can reasonably exist in this context. | 1. 阅读题目，确定对话的主题以及问题期望的回答方向。<br/> 2. 判断答案是否能够与前面的对话内容形成逻辑上的衔接，是否符合常理，能否在这个上下文中合理存在。<br/> 3. 根据以上因素，对该回答的合理性进行打分，分数从1到5，其中1表示不合理，5表示非常合理，并且能够与前面的对话内容形成逻辑上的衔接，并符合常理。</br></br>1. Read the question and determine the topic of the conversation and the direction the question expects the answer to go.<br>2. Determine whether the answer can be logically connected to the preceding conversation, whether it makes common sense, and whether it can reasonably exist in this context.<br>3. Based on the above factors, rate the reasonableness of the answer on a scale from 1 to 5, where 1 means unreasonable and 5 means very reasonable and able to form a logical connection with the preceding dialogue content and consistent with common sense. |
+|       多样性<br/>(Diversity)       | 多样性(1-5)：答案使用语言是否优美，具有有一定的创造性和想象力。然而，回答也应该保持合理和适度，不要过于夸张或离题。</br></br>Diversity (1-5): Whether the answers use beautiful language and have some creativity and imagination. However, answers should also be kept reasonable and moderate, not overly exaggerated or off-topic. | 1. 仔细阅读整个回答，确保完全理解回答所表达的内容和主题。<br/> 2. 在阅读回答的同时，注意语言的质量，例如措辞是否正确，语言是否生动等。<br/> 3. 检查回答的创造性和想象力，看看回答是否能够吸引人阅读下去。<br/> 4. 检查回答的合理性和适度，看看回答是否夸张或离题。5. 将多样性的评分打分在1到5之间，5分表示回答的质量很好，能够吸引人阅读，1分表示回答的内容生硬或者有离题的问题。</br></br>1. Read the entire response carefully to ensure that you fully understand the content and theme expressed in the response.<br>2. While reading the response, pay attention to the quality of the language, such as whether the wording is correct and the language is vivid.<br>3. Check the creativity and imagination of the response to see if the response is engaging to read on.<br>4. Check the reasonableness and appropriateness of the responses to see if the responses are exaggerated or off-topic.<br>5. Rate the diversity on a scale of 1 to 5, with a 5 indicating a good quality response that is engaging to read and a 1 indicating a raw response or a question that is off-topic. |
+|       保真度<br/>(Fidelity)        | 保真度(1-5)：答案是否能够严格遵守角色的设定回答给定的请求。</br></br>Fidelity (1-5): whether the answer is able to answer the given request in strict compliance with the role setting. | 1. 仔细阅读问题，了解角色在问题中的设定和表现，包括职业、背景、观点、性格等方面。<br/> 阅读题目的请求，确认回答请求时需要注意的细节。<br/> 3. 对比提供的回答与该角色的设定，评估回答是否能够严格遵守角色的设定。<br/> 4. 结合以上评估结果给出保真度的评分，范围从1到5分，其中1分表示回答与角色设定完全不符，5分表示回答完全符合角色设定且满足给定请求。</br></br>1. Read the question carefully to understand how the character is set up and represented in the question, including aspects such as occupation, background, point of view, and personality.<br>2. Read the question's request and confirm the details that need to be taken into account when answering the request.<br>3. Compare the provided answer with the setting of the role and assess whether the answer can strictly adhere to the setting of the role.<br>4. Combine the results of the above assessment to give a fidelity score ranging from 1 to 5, where a score of 1 means that the response does not match the persona at all, and a score of 5 means that the response fully complies with the persona and satisfies the given request. |
+
+GPT models evaluate the quality of model predictions based on the given prompt words and gives a score between 1-5.
+
+> **NOTE 1:**  You can find all the prompt words and CoT(Chain-of-Thought) in `configs/gpt_evaluation/prompt/evaluation_prompt`.
+
+> **NOTE 2:** To add customized metrics, you can refer to [FAQ](#faq).
+
+## Evaluation Process
+
+### Data Format
+
+A JSON file contains one list. Each element in the list is a target answer / prediction record for one instruction / question.
+An element should have the following fields:
+
+* `category` (str, compulsory): The category of the instruction / question.
+* `instruction` (str, compulsory): The instruction / question for the LLM.
+* `input` (str, optional): The additional context of the instruction / question.
+* `output` (str, optional): The model output of the instruction, models will fill in this field during inference time.
+* `target` (str, optional): The target answer for the instruction.
+* `id` (int, compulsory): The ID of the instruction / question.
+
+Example:
+
+```json
+[
+    {
+        "category": "brainstorming",
+        "instruction": "请问如何制作一份美味的西红柿炒鸡蛋？",
+        "input": "",
+        "output": "",
+        "target": "",
+        "id": 1
+    },
+    {
+        "category": "chat",
+        "instruction": "基于以下角色信息完成一段对话。小张是一名新手爱好者，对养鸡有浓厚的兴趣。老李是一名有丰富经验的养鸡大师。",
+        "input": "小张：您好，老李，我最近开始对养鸡感兴趣了，想请教您一些问题。 老李：你好，小张，我很乐意帮助你。你想问些什么？ 小张：我想知道如何确定鸡的品种和性别？ 老李：确切的品种可以通过鸡的外貌特征来确定，而性别一般是通过鸡卵的大小和形状来判断。还有什么问题吗？ 小张：",
+        "output": "",
+        "target": "",
+        "id": 2
+    }
+]
+```
+
+### Prompt
+
+#### Battle Prompt
+
+The following is the Chinese battle prompt. In the battle prompt, the question and answers from two different models are fed into the prompt template. You can find example battle prompt files for Chinese and English in `configs/gpt_evaluation/prompt/battle_prompt`.
+
+```json
+{
+  "id": 1,
+  "system_prompt": "你是一个检查回答质量的好助手。",
+  "prompt_template": "[问题]\n{question}\n\n[1号AI助手的答案]\n{answer_1}\n\n[1号AI助手答案终止]\n\n[2号AI助手的答  案]\n{answer_2}\n\n[2号AI助手答案终止]\n\n[要求]\n{prompt}\n\n",
+  "prompt": "我们需要你评价这两个AI助手回答的性能。\n请对他们的回答的有用性、相关性、准确性、详细程度进行评分。每个AI助手都会得到一个1到10分的总分，分数越高表示整体表现越好。\n请首先输出一行，该行只包含两个数值，分别表示1号和2号AI助手的分数。这两个分数之间要有一个空格。在随后的一行中，请对你的评价作出全面的解释，避免任何潜在的偏见，并确保AI助手回答的顺序不会影响您的判断。"
+}
+```
+
+#### Evaluation Prompt
+
+The following is an example of a Chinese GPT evaluation prompt. In an evaluation prompt, you should define your metrics in `metrics` and provide CoT(Chain-of-Thought) in `CoT`.  You can find example evaluation prompt files for Chinese and English in `configs/gpt_evaluation/prompt/evaluation_prompt`.
+
+```json
+{
+  "brainstorming": {
+    "id": 1,
+    "category": "brainstorming",
+    "metrics": {
+      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。"
+    },
+    "CoT": {
+      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织："
+    },
+    "prompt": "你是一个好助手。请你为下面“头脑风暴”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
+  }
+}
+```
+
+`"metrics"`: the metrics that can be used in GPT evaluation. This field determines which metrics can be added to your config file.
+
+`"CoT"`: evaluation steps you prompt to GPT models for each metric defined in `"metrics"`.
+
+### Evaluation
+
+#### Configuration
+
+The following is an example of a Chinese config file. The configuration file can control how the pipeline evaluates the model. You need to specify GPT evaluation metrics in key `GPT`. You can find an example English config file in `configs/gpt_evaluation/config/config_en.json`.
+
+```json
+{
+    "language": "cn",
+    "category": {
+        "brainstorming": {
+            "GPT": [
+                "language organization",
+                "relevance",
+                "creativity",
+                "practicality",
+                "reasonableness"
+            ]
+        }
+    }
+}
+```
+
+`"language"`: the language used to evaluate the model capability. We only support Chinese `"cn"` for now.
+
+`"category"`: the category/categories needed to evaluate the model capability.
+
+`"GPT"`: the metrics you want to use for GPT evaluation.
+
+
+#### Evaluate
+
+After setting the configuration file, you can evaluate the model using `examples/gpt_evaluation/eval.py`. If you want to make comparisons between answers of two different models, you should specify two answer files in the argument `answer_file_list` and two model names in the argument `model_name_list`. If you want to evaluate one answer file, the length of both `answer_file_list` and `model_name_list` should be 1 and the program will perform evaluation using automatic metrics and GPT models.
+
+An example script is provided as follows:
+
+```shell
+python eval.py \
+    --config_file "path to the config file" \
+    --battle_prompt_file "path to the prompt file for battle" \
+    --gpt_evaluation_prompt_file "path to the prompt file for gpt evaluation" \
+    --target_file "path to the target answer file" \
+    --answer_file_list "path to the answer files of at most 2 models" \
+    --model_name_list "the names of at most 2 models" \
+    --gpt_model "which GPT model to use for evaluation" \
+    --save_path "path to save results" \
+    --openai_key "your openai key" \
+```
+
+If you want GPT evaluation with reference, you can add an argument `--gpt_with_reference`, but make sure the reference file have target answers.
+
+## FAQ
+
+<details><summary><b>How can I add a new GPT evaluation metric?</b></summary>
+
+For example, if you want to add a new metric `persuasiveness` into category `brainstorming`, you should add the metric definition and its corresponding CoT(Chain-of-thought) in the evaluation prompt file in `prompt/evaluation_promt`. The CoT can be generated using ChatGPT. You can prompt ChatGPT to generate evaluation steps for the new metric.
+
+```json
+{
+  "brainstorming": {
+    "id": 1,
+    "category": "brainstorming",
+    "metrics": {
+      "persuasiveness": "persuasiveness(1-5)：a short description for persuasiveness"
+    },
+    "CoT": {
+      "persuasiveness": "CoT for persuasiveness\n\npersuasiveness："
+    },
+    "prompt": "You are a good assistant. Please rate the given answer to the \"brainstorming\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
+  }
+}
+```
+
+</details>
+
+## Citations
+
+```bibtex
+@misc{vicuna2023,
+    title = {Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\%* ChatGPT Quality},
+    url = {https://vicuna.lmsys.org},
+    author = {Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P.},
+    month = {March},
+    year = {2023}
+}
+
+@misc{liu2023geval,
+      title={G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment},
+      author={Yang Liu and Dan Iter and Yichong Xu and Shuohang Wang and Ruochen Xu and Chenguang Zhu},
+      year={2023},
+      eprint={2303.16634},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
--- a/applications/ColossalEval/colossal_eval/evaluate/init.py
+++ b/applications/ColossalEval/colossal_eval/evaluate/init.py
--- a/applications/ColossalEval/colossal_eval/evaluate/dataset_evaluator/init.py
+++ b/applications/ColossalEval/colossal_eval/evaluate/dataset_evaluator/init.py
@ -0,0 +1,3 @@
+from .dataset_evaluator import DatasetEvaluator
+
+__all__ = ["DatasetEvaluator"]
--- a/applications/ColossalEval/colossal_eval/evaluate/dataset_evaluator/dataset_evaluator.py
+++ b/applications/ColossalEval/colossal_eval/evaluate/dataset_evaluator/dataset_evaluator.py
@ -0,0 +1,269 @@
+from typing import Dict, List
+
+import colossal_eval.evaluate.dataset_evaluator.metrics as metric_helper
+import numpy as np
+import tqdm
+
+LabelBasedMetrics = ["first_token_accuracy", "matthews_correlation"]
+LossBasedMetrics = ["perplexity", "ppl_score", "ppl_score_over_choices", "per_byte_perplexity", "per_byte_ppl_score"]
+CombinedMetrics = ["combined_single_choice_accuracy"]
+OtherMetrics = [
+    "f1_score",
+    "f1_zh_score",
+    "rouge_score",
+    "rouge_zh_score",
+    "retrieval_score",
+    "retrieval_zh_score",
+    "classification_score",
+    "code_sim_score",
+    "count_score",
+    "multi_choice_accuracy",
+    "math_equivalence",
+    "single_choice_accuracy",
+]
+
+
+class DatasetEvaluator(object):
+    """
+    Dataset evaluator.
+
+    """
+
+    def __init__(self):
+        pass
+
+    def _calculate_label_metrics(self, metric: str, category: str):
+        """Calculate label-based metrics."""
+        weight = len(self.data[category]["data"]) / self.metric_total_length[metric]
+
+        str_label_map = {
+            choice: idx for idx, choice in enumerate(self.data[category]["inference_kwargs"]["all_classes"])
+        }
+
+        references = [str_label_map[sample["target"]] for sample in self.data[category]["data"]]
+        [sample["output"] for sample in self.data[category]["data"]]
+
+        flag = False
+        softmaxs = []
+        for i, sample in enumerate(self.data[category]["data"]):
+            if np.any(np.isnan(np.array(list(sample["softmax_over_choices"].values())))):
+                if not flag:
+                    print(
+                        f"NaN in the softmax, switch to exact match for category {category} in dataset {self.dataset_name} in model {self.model_name}."
+                    )
+                    flag = True
+                score = 0
+                for ref in sample["target"]:
+                    score = max(
+                        score,
+                        metric_helper.single_choice_accuracy(
+                            sample["output"], ref, all_classes=self.data[category]["inference_kwargs"]["all_classes"]
+                        ),
+                    )
+                softmaxs.append(references[i] if score == 1 else -1)
+            else:
+                softmaxs.append(np.argmax(np.array(list(sample["softmax_over_choices"].values()))))
+
+        references = np.array(references)
+        softmaxs = np.array(softmaxs)
+        scores = np.sum(references == softmaxs) / len(self.data[category]["data"]) * 100
+
+        self.evaluation_results[metric][category] = (scores, len(self.data[category]["data"]))
+        self.evaluation_results[metric]["ALL"] += scores * weight
+
+    def _calculate_combined_metrics(self, metric: str, category: str):
+        """Calculate combined metrics."""
+        weight = len(self.data[category]["data"]) / self.metric_total_length[metric]
+
+        references = [sample["target"] for sample in self.data[category]["data"]]
+        predictions = [sample["output"] for sample in self.data[category]["data"]]
+
+        str_label_map = {
+            choice: idx for idx, choice in enumerate(self.data[category]["inference_kwargs"]["all_classes"])
+        }
+
+        references_labels = [str_label_map[sample["target"][0]] for sample in self.data[category]["data"]]
+        predictions = [sample["output"] for sample in self.data[category]["data"]]
+
+        flag = False
+        softmaxs = []
+        for i, sample in enumerate(self.data[category]["data"]):
+            if np.any(np.isnan(np.array(list(sample["softmax_over_choices"].values())))):
+                if not flag:
+                    print(
+                        f"NaN in the softmax, switch to exact match for category {category} in dataset {self.dataset_name} in model {self.model_name}."
+                    )
+                    flag = True
+                score = 0
+                for ref in sample["target"]:
+                    score = max(
+                        score,
+                        metric_helper.single_choice_accuracy(
+                            sample["output"], ref, all_classes=self.data[category]["inference_kwargs"]["all_classes"]
+                        ),
+                    )
+                softmaxs.append(references[i] if score == 1 else -1)
+            else:
+                softmaxs.append(np.argmax(np.array(list(sample["softmax_over_choices"].values()))))
+
+        metric_method = eval("metric_helper." + metric)
+
+        total_score = 0.0
+        for prediction, reference, references_label, softmax in zip(
+            predictions, references, references_labels, softmaxs
+        ):
+            score = 0.0
+
+            for ref in reference:
+                score = max(
+                    score,
+                    metric_method(prediction, ref, all_classes=self.data[category]["inference_kwargs"]["all_classes"]),
+                )
+            if references_label == softmax:
+                score = 1
+
+            total_score += score
+        total_score = total_score * 100 / len(self.data[category]["data"])
+
+        self.evaluation_results[metric][category] = (total_score, len(self.data[category]["data"]))
+        self.evaluation_results[metric]["ALL"] += total_score * weight
+
+    def _calculate_other_metrics(self, metric: str, category: str):
+        """Calculate other metrics."""
+        weight = len(self.data[category]["data"]) / self.metric_total_length[metric]
+
+        references = [sample["target"] for sample in self.data[category]["data"]]
+        predictions = [sample["output"] for sample in self.data[category]["data"]]
+
+        metric_method = eval("metric_helper." + metric)
+
+        total_score = 0.0
+        for prediction, reference in zip(predictions, references):
+            score = 0.0
+            for ref in reference:
+                score = max(
+                    score,
+                    metric_method(prediction, ref, all_classes=self.data[category]["inference_kwargs"]["all_classes"]),
+                )
+            total_score += score
+        total_score = total_score * 100 / len(predictions)
+
+        self.evaluation_results[metric][category] = (total_score, len(self.data[category]["data"]))
+        self.evaluation_results[metric]["ALL"] += total_score * weight
+
+    def _calculate_loss_metrics(self, metric: str, category: str):
+        """Calculate perplexity."""
+        if metric == "perplexity":
+            weight = len(self.data[category]["data"]) / self.metric_total_length[metric]
+            losses = [min(sample["loss"]) for sample in self.data[category]["data"]]
+            perplexity = np.mean(np.exp(np.array(losses)))
+
+            self.evaluation_results["perplexity"][category] = (perplexity, len(self.data[category]["data"]))
+            self.evaluation_results["perplexity"]["ALL"] += perplexity * weight
+        elif metric == "ppl_score":
+            weight = len(self.data[category]["data"]) / self.metric_total_length[metric]
+            losses = [min(sample["loss"]) for sample in self.data[category]["data"]]
+            perplexity_score = np.mean(np.exp(-np.array(losses))) * 100
+
+            self.evaluation_results["ppl_score"][category] = (perplexity_score, len(self.data[category]["data"]))
+            self.evaluation_results["ppl_score"]["ALL"] += perplexity_score * weight
+        elif metric == "ppl_score_over_choices" and self.data[category]["inference_kwargs"]["all_classes"] is not None:
+            weight = len(self.data[category]["data"]) / self.metric_total_length[metric]
+            loss_over_choices = [sample["loss_over_choices"] for sample in self.data[category]["data"]]
+            perplexity_score_over_choices = np.mean(np.exp(-np.array(loss_over_choices))) * 100
+
+            self.evaluation_results["ppl_score_over_choices"][category] = (
+                perplexity_score_over_choices,
+                len(self.data[category]["data"]),
+            )
+            self.evaluation_results["ppl_score_over_choices"]["ALL"] += perplexity_score_over_choices * weight
+        elif metric == "per_byte_perplexity":
+            weight = len(self.data[category]["data"]) / self.metric_total_length[metric]
+            losses = [min(sample["loss_sum"]) for sample in self.data[category]["data"]]
+            perplexity = np.mean(np.exp(np.array(losses) / np.array(self.N_bytes[category])))
+
+            self.evaluation_results["per_byte_perplexity"][category] = perplexity
+            self.evaluation_results["per_byte_perplexity"]["ALL"] += perplexity * weight
+        elif metric == "per_byte_ppl_score":
+            weight = len(self.data[category]["data"]) / self.metric_total_length[metric]
+            losses = [min(sample["loss_sum"]) for sample in self.data[category]["data"]]
+            perplexity_score = np.mean(np.exp(-np.array(losses) / np.array(self.N_bytes[category]))) * 100
+
+            self.evaluation_results["per_byte_ppl_score"][category] = perplexity_score
+            self.evaluation_results["per_byte_ppl_score"]["ALL"] += perplexity_score * weight
+
+    def _evaluate(self):
+        """Calculate and return evaluation results"""
+
+        for metric in self.metrics:
+            pbar = tqdm.tqdm(
+                desc=f"{self.dataset_name}-{metric}-{self.model_name}", total=len(self.suggested_categories[metric])
+            )
+
+            if metric in LabelBasedMetrics:
+                for category in self.suggested_categories[metric]:
+                    self._calculate_label_metrics(metric, category)
+                    pbar.update(1)
+            elif metric in LossBasedMetrics:
+                for category in self.suggested_categories[metric]:
+                    self._calculate_loss_metrics(metric, category)
+                    pbar.update(1)
+            elif metric in CombinedMetrics:
+                for category in self.suggested_categories[metric]:
+                    self._calculate_combined_metrics(metric, category)
+                    pbar.update(1)
+            elif metric in OtherMetrics:
+                for category in self.suggested_categories[metric]:
+                    self._calculate_other_metrics(metric, category)
+                    pbar.update(1)
+
+        return self.evaluation_results
+
+    def get_evaluation_results(self, data: List[Dict], dataset_name: str, model_name: str, metrics: List[str]):
+        """
+        Evaluate inference data on the given metrics.
+
+        Args:
+            data: Data to be evaluated.
+            dataset_name: Name of the dataset
+            model_name: Name of the model
+            metrics: Metrics used to evaluate.
+
+        """
+        self.data = data
+        self.dataset_name = dataset_name
+        self.model_name = model_name
+        self.categories = list(data.keys())
+        self.metrics = metrics
+
+        self.evaluation_results = {
+            metric: {category: 0 for category in (["ALL"] + self.categories)} for metric in self.metrics
+        }
+
+        self.total_length = 0
+        self.total_single_choices = 0
+        for value in self.data.values():
+            self.total_length += len(value["data"])
+            if value["inference_kwargs"]["all_classes"] is not None:
+                self.total_single_choices += len(value["data"])
+
+        self.metric_total_length = {metric: 0 for metric in self.metrics}
+        self.suggested_categories = {metric: [] for metric in self.metrics}
+
+        for metric in self.metrics:
+            self.suggested_categories[metric] = metric_helper.metrics4subcategory[self.dataset_name][metric]
+            if "ALL" in self.suggested_categories[metric]:
+                self.suggested_categories[metric] = self.categories
+                self.metric_total_length[metric] = self.total_length
+                continue
+            for category in self.suggested_categories[metric]:
+                self.metric_total_length[metric] += len(self.data[category]["data"])
+
+        if "per_byte_perplexity" in self.metrics or "per_byte_ppl_score" in self.metrics:
+            self.N_bytes = {category: [] for category in self.categories}
+            for category in self.categories:
+                samples = self.data[category]["data"]
+                for sample in samples:
+                    self.N_bytes[category].append(sample["byte_num"][0])
+
+        return self._evaluate()
--- a/applications/ColossalEval/colossal_eval/evaluate/dataset_evaluator/metrics.py
+++ b/applications/ColossalEval/colossal_eval/evaluate/dataset_evaluator/metrics.py
@ -0,0 +1,623 @@
+# Code adapted from https://github.com/THUDM/LongBench/blob/main/metrics.py
+# Code adapted from https://github.com/hendrycks/math/blob/main/modeling/math_equivalence.py
+# Code adapted from https://github.com/ruixiangcui/AGIEval/blob/main/src/evaluation.py
+
+import difflib
+import re
+import string
+from collections import Counter
+
+import jieba
+from fuzzywuzzy import fuzz
+from rouge import Rouge
+
+metrics4subcategory = {
+    "pretrain": {
+        "perplexity": ["ALL"],
+        "ppl_score": ["ALL"],
+        "per_byte_perplexity": ["ALL"],
+        "per_byte_ppl_score": ["ALL"],
+    },
+    # The commented are non 4-choice questions.
+    "agieval": {
+        "combined_single_choice_accuracy": [
+            # "lsat-ar",
+            # "lsat-lr",
+            # "lsat-rc",
+            "logiqa-en",
+            "sat-math",
+            "sat-en",
+            # "aqua-rat",
+            "sat-en-without-passage",
+            "gaokao-english",
+            "logiqa-zh",
+            "gaokao-chinese",
+            "gaokao-geography",
+            "gaokao-history",
+            "gaokao-biology",
+            "gaokao-chemistry",
+        ],
+        "first_token_accuracy": [
+            # "lsat-ar",
+            # "lsat-lr",
+            # "lsat-rc",
+            "logiqa-en",
+            "sat-math",
+            "sat-en",
+            # "aqua-rat",
+            "sat-en-without-passage",
+            "gaokao-english",
+            "logiqa-zh",
+            "gaokao-chinese",
+            "gaokao-geography",
+            "gaokao-history",
+            "gaokao-biology",
+            "gaokao-chemistry",
+        ],
+        "single_choice_accuracy": [
+            # "lsat-ar",
+            # "lsat-lr",
+            # "lsat-rc",
+            "logiqa-en",
+            "sat-math",
+            "sat-en",
+            # "aqua-rat",
+            "sat-en-without-passage",
+            "gaokao-english",
+            "logiqa-zh",
+            "gaokao-chinese",
+            "gaokao-geography",
+            "gaokao-history",
+            "gaokao-biology",
+            "gaokao-chemistry",
+        ],
+        "multi_choice_accuracy": ["jec-qa-kd", "jec-qa-ca", "gaokao-physics", "gaokao-mathqa"],
+        "math_equivalence": ["gaokao-mathcloze", "math"],
+        "perplexity": ["ALL"],
+        "ppl_score_over_choices": [
+            "lsat-ar",
+            "lsat-lr",
+            "lsat-rc",
+            "logiqa-en",
+            "sat-math",
+            "sat-en",
+            "aqua-rat",
+            "sat-en-without-passage",
+            "gaokao-english",
+            "logiqa-zh",
+            "jec-qa-kd",
+            "jec-qa-ca",
+            "gaokao-chinese",
+            "gaokao-geography",
+            "gaokao-history",
+            "gaokao-biology",
+            "gaokao-chemistry",
+            "gaokao-physics",
+            "gaokao-mathqa",
+        ],
+        "ppl_score": ["ALL"],
+    },
+    "cmmlu": {
+        "first_token_accuracy": ["ALL"],
+        "single_choice_accuracy": ["ALL"],
+        "perplexity": ["ALL"],
+        "ppl_score_over_choices": ["ALL"],
+        "ppl_score": ["ALL"],
+    },
+    "gaokaobench": {
+        "combined_single_choice_accuracy": [
+            "English MCQs",
+            "Biology MCQs",
+            "Chemistry MCQs",
+            "History MCQs",
+            "Math I MCQs",
+            "Math II MCQs",
+            "Political Science MCQs",
+        ],
+        "first_token_accuracy": [
+            "English MCQs",
+            "Biology MCQs",
+            "Chemistry MCQs",
+            "History MCQs",
+            "Math I MCQs",
+            "Math II MCQs",
+            "Political Science MCQs",
+        ],
+        "single_choice_accuracy": [
+            "English MCQs",
+            "Biology MCQs",
+            "Chemistry MCQs",
+            "History MCQs",
+            "Math I MCQs",
+            "Math II MCQs",
+            "Political Science MCQs",
+        ],
+        "multi_choice_accuracy": [
+            "Chinese Lang and Usage MCQs",
+            "Chinese Modern Lit",
+            "English Fill in Blanks",
+            "English Reading Comp",
+            "Geography MCQs",
+            "Physics MCQs",
+            "English Cloze Test",
+        ],
+        "math_equivalence": ["Math I Fill-in-the-Blank", "Math II Fill-in-the-Blank"],
+        "rouge_score": ["English Language Cloze Passage"],
+        "rouge_zh_score": [
+            "Chinese Language Famous Passages and Sentences Dictation",
+            "Chemistry Open-ended Questions",
+            "History Open-ended Questions",
+            "Biology Open-ended Questions",
+            "Political Science Open-ended Questions",
+            "English Language Error Correction",
+            "Chinese Language Language and Writing Skills Open-ended Questions",
+            "Math II Open-ended Questions",
+            "Chinese Language Literary Text Reading",
+            "Chinese Language Ancient Poetry Reading",
+            "Chinese Language Classical Chinese Reading",
+            "Physics Open-ended Questions",
+            "Math I Open-ended Questions",
+            "Geography Open-ended Questions",
+            "Chinese Language Practical Text Reading",
+        ],
+        "perplexity": ["ALL"],
+        "ppl_score_over_choices": ["ALL"],
+        "ppl_score": ["ALL"],
+    },
+    "longbench": {
+        "f1_score": ["hotpotqa", "2wikimqa", "musique", "narrativeqa", "qasper", "multifieldqa_en", "triviaqa"],
+        "f1_zh_score": ["multifieldqa_zh"],
+        "rouge_score": ["gov_report", "qmsum", "multi_news", "samsum"],
+        "rouge_zh_score": ["dureader", "vcsum"],
+        "retrieval_score": ["passage_retrieval_en"],
+        "retrieval_zh_score": ["passage_retrieval_zh"],
+        "classification_score": ["trec", "lsht"],
+        "code_sim_score": ["lcc", "repobench-p"],
+        "count_score": ["passage_count"],
+        "perplexity": ["ALL"],
+        "ppl_score": ["ALL"],
+    },
+    "mmlu": {
+        "first_token_accuracy": ["ALL"],
+        "single_choice_accuracy": ["ALL"],
+        "accuracy": ["ALL"],
+        "perplexity": ["ALL"],
+        "ppl_score_over_choices": ["ALL"],
+        "ppl_score": ["ALL"],
+    },
+}
+
+
+def _fix_fracs(string):
+    substrs = string.split("\\frac")
+    new_str = substrs[0]
+    if len(substrs) > 1:
+        substrs = substrs[1:]
+        for substr in substrs:
+            new_str += "\\frac"
+            if substr[0] == "{":
+                new_str += substr
+            else:
+                try:
+                    assert len(substr) >= 2
+                except:
+                    return string
+                a = substr[0]
+                b = substr[1]
+                if b != "{":
+                    if len(substr) > 2:
+                        post_substr = substr[2:]
+                        new_str += "{" + a + "}{" + b + "}" + post_substr
+                    else:
+                        new_str += "{" + a + "}{" + b + "}"
+                else:
+                    if len(substr) > 2:
+                        post_substr = substr[2:]
+                        new_str += "{" + a + "}" + b + post_substr
+                    else:
+                        new_str += "{" + a + "}" + b
+    string = new_str
+    return string
+
+
+def _fix_a_slash_b(string):
+    if len(string.split("/")) != 2:
+        return string
+    a = string.split("/")[0]
+    b = string.split("/")[1]
+    try:
+        a = int(a)
+        b = int(b)
+        assert string == "{}/{}".format(a, b)
+        new_string = "\\frac{" + str(a) + "}{" + str(b) + "}"
+        return new_string
+    except:
+        return string
+
+
+def _remove_right_units(string):
+    # "\\text{ " only ever occurs (at least in the val set) when describing units
+    if "\\text{ " in string:
+        splits = string.split("\\text{ ")
+        assert len(splits) == 2
+        return splits[0]
+    else:
+        return string
+
+
+def _fix_sqrt(string):
+    if "\\sqrt" not in string:
+        return string
+    splits = string.split("\\sqrt")
+    new_string = splits[0]
+    for split in splits[1:]:
+        if split[0] != "{":
+            a = split[0]
+            new_substr = "\\sqrt{" + a + "}" + split[1:]
+        else:
+            new_substr = "\\sqrt" + split
+        new_string += new_substr
+    return new_string
+
+
+def _strip_string(string):
+    # linebreaks
+    string = string.replace("\n", "")
+    # print(string)
+
+    # remove inverse spaces
+    string = string.replace("\\!", "")
+    # print(string)
+
+    # replace \\ with \
+    string = string.replace("\\\\", "\\")
+    # print(string)
+
+    # replace tfrac and dfrac with frac
+    string = string.replace("tfrac", "frac")
+    string = string.replace("dfrac", "frac")
+    # print(string)
+
+    # remove \left and \right
+    string = string.replace("\\left", "")
+    string = string.replace("\\right", "")
+    # print(string)
+
+    # Remove circ (degrees)
+    string = string.replace("^{\\circ}", "")
+    string = string.replace("^\\circ", "")
+
+    # remove dollar signs
+    string = string.replace("\\$", "")
+
+    # remove units (on the right)
+    string = _remove_right_units(string)
+
+    # remove percentage
+    string = string.replace("\\%", "")
+    string = string.replace("\%", "")
+
+    # " 0." equivalent to " ." and "{0." equivalent to "{." Alternatively, add "0" if "." is the start of the string
+    string = string.replace(" .", " 0.")
+    string = string.replace("{.", "{0.")
+    # if empty, return empty string
+    if len(string) == 0:
+        return string
+    if string[0] == ".":
+        string = "0" + string
+
+    # to consider: get rid of e.g. "k = " or "q = " at beginning
+    if len(string.split("=")) == 2:
+        if len(string.split("=")[0]) <= 2:
+            string = string.split("=")[1]
+
+    # fix sqrt3 --> sqrt{3}
+    string = _fix_sqrt(string)
+
+    # remove spaces
+    string = string.replace(" ", "")
+
+    # \frac1b or \frac12 --> \frac{1}{b} and \frac{1}{2}, etc. Even works with \frac1{72} (but not \frac{72}1). Also does a/b --> \\frac{a}{b}
+    string = _fix_fracs(string)
+
+    # manually change 0.5 --> \frac{1}{2}
+    if string == "0.5":
+        string = "\\frac{1}{2}"
+
+    # NOTE: X/Y changed to \frac{X}{Y} in dataset, but in simple cases fix in case the model output is X/Y
+    string = _fix_a_slash_b(string)
+
+    return string
+
+
+def parse_math_answer(raw_string):
+    def remove_boxed(s):
+        left = "\\boxed{"
+        try:
+            assert s[: len(left)] == left
+            assert s[-1] == "}"
+            answer = s[len(left) : -1]
+            if "=" in answer:
+                answer = answer.split("=")[-1].lstrip(" ")
+            return answer
+        except:
+            return None
+
+    def last_boxed_only_string(string):
+        idx = string.rfind("\\boxed")
+        if idx < 0:
+            idx = string.rfind("\\fbox")
+            if idx < 0:
+                return None
+        i = idx
+        right_brace_idx = None
+        num_left_braces_open = 0
+        while i < len(string):
+            if string[i] == "{":
+                num_left_braces_open += 1
+            if string[i] == "}":
+                num_left_braces_open -= 1
+                if num_left_braces_open == 0:
+                    right_brace_idx = i
+                    break
+            i += 1
+
+        if right_brace_idx == None:
+            retval = None
+        else:
+            retval = string[idx : right_brace_idx + 1]
+
+        return retval
+
+    def get_answer_with_dollar_sign(s):
+        first_pattern = "\$(.*)\$"
+        last_match = None
+        matches = re.findall(first_pattern, s)
+        if matches:
+            last_match = matches[-1]
+            if "=" in last_match:
+                last_match = last_match.split("=")[-1].lstrip(" ")
+        return last_match
+
+    def get_answer_without_dollar_sign(s):
+        last_match = None
+        if "=" in s:
+            last_match = s.split("=")[-1].lstrip(" ").rstrip(".")
+            if "\\n" in last_match:
+                last_match = last_match.split("\\n")[0]
+        else:
+            pattern = "(?:\\$)?\d+(?:\.\d+)?(?![\w\d])"
+            matches = re.findall(pattern, s)
+            if matches:
+                last_match = matches[-1]
+        return last_match
+
+    if "\\boxed" in raw_string:
+        answer = remove_boxed(last_boxed_only_string(raw_string))
+    else:
+        answer = get_answer_with_dollar_sign(raw_string)
+        if not answer:
+            answer = get_answer_without_dollar_sign(raw_string)
+    return answer
+
+
+def math_equivalence(prediction, reference, **kwargs):
+    prediction = parse_math_answer(prediction)
+
+    if prediction is None and reference is None:
+        print("WARNING: Both None")
+        return False
+
+    if prediction is None or reference is None:
+        return False
+
+    try:
+        ss1 = _strip_string(prediction)
+        ss2 = _strip_string(reference)
+        return ss1 == ss2
+    except:
+        return prediction == reference
+
+
+def multi_choice_accuracy(prediction, reference, **kwargs):
+    # Only find uppercase letters not surrounded by lowercase letters
+    all_classes = kwargs.get("all_classes", None)
+    if all_classes:
+        pattern = f"(?<![a-z])[{all_classes[0]}-{all_classes[-1]}](?![a-z])"
+    else:
+        pattern = "(?<![a-z])[A-F](?![a-z])"
+
+    prediction = re.findall(pattern, prediction)
+    reference = re.findall(pattern, reference)
+
+    prediction_set = set(prediction)
+    reference_set = set(reference)
+
+    score = 0.0
+    for p in prediction_set:
+        if p not in reference_set:
+            return 0.0
+        else:
+            score += 1 / len(reference_set)
+
+    return score
+
+
+def combined_single_choice_accuracy(prediction, reference, **kwargs):
+    return single_choice_accuracy(prediction, reference, **kwargs)
+
+
+def single_choice_accuracy(prediction, reference, **kwargs):
+    # Only find uppercase letters not surrounded by lowercase letters
+    all_classes = kwargs.get("all_classes", None)
+    if all_classes:
+        pattern = f"(?<![a-z])[{all_classes[0]}-{all_classes[-1]}](?![a-z])"
+    else:
+        pattern = "(?<![a-z])[A-F](?![a-z])"
+
+    prediction = re.findall(pattern, prediction)[0:1]
+    reference = re.findall(pattern, reference)
+
+    assert len(reference) == 1
+
+    prediction_set = set(prediction)
+    reference_set = set(reference)
+
+    if prediction_set == reference_set:
+        return 1.0
+
+    return 0.0
+
+
+def normalize_answer(s):
+    """Lower text and remove punctuation, articles and extra whitespace."""
+
+    def remove_articles(text):
+        return re.sub(r"\b(a|an|the)\b", " ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def normalize_zh_answer(s):
+    """Lower text and remove punctuation, extra whitespace."""
+
+    def white_space_fix(text):
+        return "".join(text.split())
+
+    def remove_punc(text):
+        cn_punctuation = "！？｡。＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏."
+        all_punctuation = set(string.punctuation + cn_punctuation)
+        return "".join(ch for ch in text if ch not in all_punctuation)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_punc(lower(s)))
+
+
+def count_score(prediction, reference, **kwargs):
+    numbers = re.findall(r"\d+", prediction)
+    right_num = 0
+    for number in numbers:
+        if str(number) == str(reference):
+            right_num += 1
+    final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
+    return float(final_score)
+
+
+def retrieval_score(prediction, reference, **kwargs):
+    pattern = r"Paragraph (\d+)"
+    matches = re.findall(pattern, reference)
+    ground_truth_id = matches[0]
+    numbers = re.findall(r"\d+", prediction)
+    right_num = 0
+    for number in numbers:
+        if str(number) == str(ground_truth_id):
+            right_num += 1
+    final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
+    return float(final_score)
+
+
+def retrieval_zh_score(prediction, reference, **kwargs):
+    pattern = r"段落(\d+)"
+    matches = re.findall(pattern, reference)
+    ground_truth_id = matches[0]
+    numbers = re.findall(r"\d+", prediction)
+    right_num = 0
+    for number in numbers:
+        if str(number) == str(ground_truth_id):
+            right_num += 1
+    final_score = 0.0 if len(numbers) == 0 else right_num / len(numbers)
+    return float(final_score)
+
+
+def code_sim_score(prediction, reference, **kwargs):
+    all_lines = prediction.lstrip("\n").split("\n")
+    prediction = ""
+    for line in all_lines:
+        if ("`" not in line) and ("#" not in line) and ("//" not in line):
+            prediction = line
+            break
+    return fuzz.ratio(prediction, reference) / 100
+
+
+def classification_score(prediction, reference, **kwargs):
+    em_match_list = []
+    all_classes = kwargs["all_classes"]
+    for class_name in all_classes:
+        if class_name in prediction:
+            em_match_list.append(class_name)
+    for match_term in em_match_list:
+        if match_term in reference and match_term != reference:
+            em_match_list.remove(match_term)
+    if em_match_list != 0:
+        if reference in em_match_list:
+            score = 1.0 / len(em_match_list)
+        else:
+            score = 0.0
+    else:
+        best_match = None
+        highest_similarity = 0
+        for string in all_classes:
+            similarity = difflib.SequenceMatcher(None, string, prediction).ratio()
+            if similarity > highest_similarity:
+                highest_similarity = similarity
+                best_match = string
+        score = float(best_match == reference)
+    return score
+
+
+def rouge_score(prediction, reference, **kwargs):
+    rouge = Rouge()
+    try:
+        scores = rouge.get_scores([prediction], [reference], avg=True)
+    except:
+        return 0.0
+    return scores["rouge-l"]["f"]
+
+
+def rouge_zh_score(prediction, reference, **kwargs):
+    prediction = " ".join(list(jieba.cut(prediction, cut_all=False)))
+    reference = " ".join(list(jieba.cut(reference, cut_all=False)))
+    score = rouge_score(prediction, reference)
+    return score
+
+
+def _f1_score(prediction, reference, **kwargs):
+    common = Counter(prediction) & Counter(reference)
+    num_same = sum(common.values())
+    if num_same == 0:
+        return 0
+    precision = 1.0 * num_same / len(prediction)
+    recall = 1.0 * num_same / len(reference)
+    f1 = (2 * precision * recall) / (precision + recall)
+    return f1
+
+
+def f1_score(prediction, reference, **kwargs):
+    normalized_prediction = normalize_answer(prediction)
+    normalized_ground_truth = normalize_answer(reference)
+
+    prediction_tokens = normalized_prediction.split()
+    ground_truth_tokens = normalized_ground_truth.split()
+    return _f1_score(prediction_tokens, ground_truth_tokens)
+
+
+def f1_zh_score(prediction, reference, **kwargs):
+    prediction_tokens = list(jieba.cut(prediction, cut_all=False))
+    ground_truth_tokens = list(jieba.cut(reference, cut_all=False))
+    prediction_tokens = [normalize_zh_answer(token) for token in prediction_tokens]
+    ground_truth_tokens = [normalize_zh_answer(token) for token in ground_truth_tokens]
+    prediction_tokens = [token for token in prediction_tokens if len(token) > 0]
+    ground_truth_tokens = [token for token in ground_truth_tokens if len(token) > 0]
+    return _f1_score(prediction_tokens, ground_truth_tokens)
--- a/applications/ColossalEval/colossal_eval/evaluate/evaluator.py
+++ b/applications/ColossalEval/colossal_eval/evaluate/evaluator.py
@ -0,0 +1,110 @@
+import os
+from typing import Any, Dict, List
+
+import colossal_eval.evaluate.gpt_evaluate as gpt_evaluate
+
+from .utils import get_data_per_category
+
+
+class Evaluator(object):
+    """
+    A class named Evaluator includes GPT-3.5/GPT-4 evaluation
+
+    """
+
+    def __init__(
+        self,
+        params: Dict[str, Any],
+        battle_prompt: Dict[str, Any],
+        gpt_evaluation_prompt: Dict[str, Any],
+        gpt_model: str,
+        language: str,
+        gpt_with_reference: bool,
+    ) -> None:
+        self.params = params
+        self.battle_prompt = battle_prompt
+        self.gpt_evaluation_prompt = gpt_evaluation_prompt
+        self.gpt_model = gpt_model
+        self.language = language
+        self.gpt_with_reference = gpt_with_reference
+        self.gpt_evaluation_results = dict()
+        self.battle_results = []
+
+    def battle(self, answers1: List[Dict], answers2: List[Dict]) -> None:
+        """
+        Comparison between two models using GPT-4 as the reviewer.
+        """
+
+        self.battle_results = gpt_evaluate.battle(answers1, answers2, self.battle_prompt)
+
+    def evaluate(self, answers: List[Dict], targets: List[Dict], save_path: str, model_name: str) -> None:
+        """
+        A comprehensive evaluation of the answers from the model.
+        The function evaluates the model's performance from different perspectives
+        using GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.
+
+        The metrics will be decided by the config file.
+
+        """
+
+        answers_per_category = get_data_per_category(answers, list(self.params.keys()))
+        targets_per_category = get_data_per_category(targets, list(self.params.keys()))
+
+        # gpt evaluation
+        for category in self.params:
+            if len(answers_per_category[category]) == 0:
+                print(f"Category {category} specified in your config doesn't have corresponding answers!")
+                continue
+
+            if self.params[category].get("GPT", None) is None:
+                continue
+
+            category_metrics = self.params[category]["GPT"]
+
+            prompt = self.gpt_evaluation_prompt.get(category, None)
+            if prompt is None:
+                print(f"No prompt for category {category}! Use prompt for category general now.")
+                prompt = self.gpt_evaluation_prompt["general"]
+
+            self.gpt_evaluation_results[category] = gpt_evaluate.evaluate(
+                answers_per_category[category],
+                prompt,
+                category_metrics,
+                category,
+                save_path,
+                model_name,
+                self.gpt_model,
+                self.language,
+                references=targets_per_category[category] if self.gpt_with_reference else None,
+            )
+
+    def save(self, path: str, model_name_list: List[str]) -> None:
+        """
+        Save evaluation results of GPT-3.5, GPT-4, and off-the-shelf evaluation metrics.
+
+        """
+
+        if len(model_name_list) == 2:
+            save_path = os.path.join(path, "gpt_evaluate", "battle_results")
+            gpt_evaluate.save_battle_results(self.battle_results, model_name_list[0], model_name_list[1], save_path)
+        else:
+            if self.gpt_evaluation_results:
+                # Save evaluation results for GPT evaluation metrics.
+                gpt_base_save_path = os.path.join(path, "gpt_evaluate", "gpt_evaluate_results")
+                gpt_evaluation_results_save_path = os.path.join(gpt_base_save_path, "evaluation_results")
+
+                all_evaluations = gpt_evaluate.save_gpt_evaluation_results(
+                    model_name_list[0], self.gpt_evaluation_results, gpt_evaluation_results_save_path
+                )
+
+                # Start to calculate scores and save statistics.
+                gpt_evaluation_statistics_save_path = os.path.join(gpt_base_save_path, "evaluation_statistics")
+                gpt_evaluate.save_gpt_evaluation_statistics(
+                    model_name_list[0], all_evaluations, gpt_evaluation_statistics_save_path
+                )
+
+                # Save charts and csv.
+                gpt_evaluation_analyses_save_path = os.path.join(gpt_base_save_path, "evaluation_analyses")
+                gpt_evaluate.analyze_gpt_evaluation_statistics(
+                    gpt_evaluation_statistics_save_path, gpt_evaluation_analyses_save_path
+                )
--- a/applications/ColossalEval/colossal_eval/evaluate/gpt_evaluate.py
+++ b/applications/ColossalEval/colossal_eval/evaluate/gpt_evaluate.py
@ -11,7 +11,7 @@ import openai
 import pandas as pd
 import seaborn as sns
 import tqdm
-from utils import jdump, jload
+from colossal_eval.utils import jdump, jload

 ref_step_template = {
    "en": "Now please compare the answer with the {adjective} answer, determine whether the answer is able to achieve the same level of {metric}.\n\n",
@ -364,7 +364,7 @@ def get_gpt_evaluation_without_logprobs(
    """
    Use chat models(gpt-3.5-turbo or gpt-4) to evaluate one model answer.

-    Temperature is set to 0 to make the model more deterministic.
+    Temprature is set to 0 to make the model more deterministic.

    Args:
        prompt: a dictionary including prompt template, CoT and metrics.
@ -401,7 +401,7 @@ def get_gpt_evaluation_without_logprobs(
                    steps=prompt["CoT"][metric],
                )

-                if prompt_reference:
+                if prompt_reference and (reference["target"] or reference["output"]):
                    # Do a 2-round conversation
                    response = multiturn_chat_completion(
                        [prompt_1st_round, prompt_reference], model, max_tokens=max_tokens, turns=2
@ -436,7 +436,7 @@ def get_gpt_evaluation_with_logprobs(
    Use completion model(text-davinci-003) to evaluate one model answer.
    Only completion models can return log probabilities.

-    Temperature is set to 0 to make the model more deterministic.
+    Temprature is set to 0 to make the model more deterministic.

    Args:
        prompt: a dictionary including prompt template, CoT and metrics.
@ -498,6 +498,8 @@ def evaluate(
    prompt: Dict[str, Any],
    metrics: List[str],
    category: str,
+    save_path: str,
+    model_name: str,
    model: str,
    language: str,
    references: List[Dict] = None,
@ -525,6 +527,72 @@ def evaluate(
    metrics_str = ", ".join(x for x in metrics)
    print(f"Category {category}'s metrics are {metrics_str}.")

+    gpt_base_save_path = os.path.join(save_path, "gpt_evaluate", "gpt_evaluate_results")
+    gpt_evaluation_results_save_path = os.path.join(gpt_base_save_path, "evaluation_results")
+    category_file = os.path.join(gpt_evaluation_results_save_path, model_name, f"{category}_evaluation_results.json")
+
+    if os.path.exists(category_file):
+        print(f"Evaluation results for category {category}, model {model_name} already exists.")
+        print("Skip evaluating.")
+
+        evaluations = jload(category_file)
+
+        retry = []
+        evaluations_copy = deepcopy(evaluations)
+
+        success = []
+        for idx, e in enumerate(evaluations_copy):
+            keys = list(e["evaluation"].keys())
+            for key in keys:
+                if e["evaluation"][key] == {}:
+                    retry.append(e["id"])
+                    print(f"Re-evaluate id {e['id']} now.")
+                    break
+            if e["id"] not in retry:
+                success.append(e)
+
+        if len(retry) == 0:
+            evaluations.sort(key=lambda x: x["id"])
+            print(f"{category} done.")
+            return evaluations
+
+        with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
+            futures = []
+            for idx, inst in enumerate(answers):
+                if not inst["id"] in retry:
+                    continue
+                # Completion models can return log probabilities.
+                if model == "text-davinci-003":
+                    future = executor.submit(get_gpt_evaluation_with_logprobs, prompt, inst, metrics, 1)
+                else:
+                    future = executor.submit(
+                        get_gpt_evaluation_without_logprobs,
+                        prompt,
+                        inst,
+                        metrics,
+                        language,
+                        reference=None if references is None else references[idx],
+                        model=model,
+                        max_tokens=1,
+                    )
+
+                futures.append(future)
+
+            for future in tqdm.tqdm(
+                concurrent.futures.as_completed(futures),
+                desc=f"{category}: ",
+                total=len(futures),
+            ):
+                success.append(future.result())
+
+        success.sort(key=lambda x: x["id"])
+
+        print(f"Saving evaluation results for category {category}, model {model_name}.")
+
+        jdump(success, category_file)
+
+        return success
+
    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
        futures = []
        for idx, inst in enumerate(answers):
@ -556,6 +624,10 @@ def evaluate(

    print(f"{category} done.")

+    print(f"Saving evaluation results for category {category}, model {model_name}.")
+
+    jdump(evaluations, category_file)
+
    return evaluations


@ -581,7 +653,7 @@ def calculate_scores_form_logprobs(logprobs: Dict[str, Any]) -> float:

    for key, value in logprobs.items():
        # Sometimes the key will be one byte of a unicode character which takes the form of "bytes:\\xe7".
-        # It is meaningless, and thus we don't calculate probability.
+        # It is meaningless and thus we don't calculate probability.
        if "bytes" in key:
            continue
        # results[0] is the score which corresponds to the key(predicted token).
@ -598,7 +670,7 @@ def calculate_scores_form_logprobs(logprobs: Dict[str, Any]) -> float:
 def calculate_scores_form_response(response: str, evaluation: Dict[str, Any]) -> int:
    """
    Calculate the score from the response returned by gpt-3.5-turbo or gpt-4.
-    Different from text-davinci-003, this function directly calculates the score according to the plain response returned by gpt-3.5-turbo or gpt-4.
+    Different from text-davinci-003, this fuction directly calculates the score according to the plain response returned by gpt-3.5-turbo or gpt-4.
    Although text-davinci-003 can return log probabilities, it costs ten times as much as gpt-3.5-turbo.

    Args:
@ -627,7 +699,7 @@ def save_gpt_evaluation_results(

    Args:
        model_name: name of the model for saving evaluation results.
-        gpt_evaluation_results: evaluations results for all the model answers.
+        gpt_evaluation_results: evaluations results for all of the model answers.
        save_path: path to save GPT evaluation statistics.
    """

@ -647,7 +719,7 @@ def save_gpt_evaluation_statistics(model_name: str, evaluations: List[Dict], sav

    Args:
        model_name: name of the model for saving statistics.
-        evaluations: evaluations for all the model answers.
+        evaluations: evaluations for all of the model answers.
        save_path: path to save GPT evaluation statistics.
    """

@ -669,7 +741,7 @@ def save_gpt_evaluation_statistics(model_name: str, evaluations: List[Dict], sav
        for evaluation in data:
            for metric in metrics:
                if evaluation["evaluation"][metric] == {}:
-                    # This means after 3 retries, the server still returns an error, and we set the score to 0.
+                    # This means after 3 retries, the server still returns an error and we set the score to 0.
                    scores[metric].append(0)
                elif evaluation["evaluation"][metric]["logprobs"] is not None:
                    scores[metric].append(
--- a/applications/ColossalEval/colossal_eval/evaluate/utils.py
+++ b/applications/ColossalEval/colossal_eval/evaluate/utils.py
@ -0,0 +1,8 @@
+def get_data_per_category(data, categories):
+    data_per_category = {category: [] for category in categories}
+    for item in data:
+        category = item["category"]
+        if category in categories:
+            data_per_category[category].append(item)
+
+    return data_per_category
--- a/applications/ColossalEval/colossal_eval/models/init.py
+++ b/applications/ColossalEval/colossal_eval/models/init.py
@ -0,0 +1,5 @@
+from .base import BaseModel
+from .chatglm import ChatGLM2Model, ChatGLMModel
+from .huggingface import HuggingFaceCausalLM, HuggingFaceModel
+
+__all__ = ["BaseModel", "HuggingFaceModel", "HuggingFaceCausalLM", "ChatGLMModel", "ChatGLM2Model"]
--- a/applications/ColossalEval/colossal_eval/models/base.py
+++ b/applications/ColossalEval/colossal_eval/models/base.py
@ -0,0 +1,78 @@
+from abc import abstractclassmethod
+from typing import Dict, List
+
+from colossal_eval.utils import Conversation, prompt_templates
+
+from colossalai.logging import DistributedLogger
+
+
+class BaseModel:
+    """
+    Base class for model wrapper.
+
+    Args:
+        path: The path to the model.
+        model_max_length: The maximum sequence length of the model.
+        prompt_template: The model's prompt template.
+        batch_size: Batch size for inference.
+        logger: Logger for the model.
+    """
+
+    def __init__(
+        self,
+        path: str,
+        model_max_length: int = 2048,
+        prompt_template: Conversation = None,
+        batch_size: int = 1,
+        logger: DistributedLogger = None,
+    ):
+        self.path = path
+        self.model_max_length = model_max_length
+
+        if prompt_template:
+            self.prompt_template = prompt_template
+        else:
+            self.prompt_template = prompt_templates["plain"]
+
+        self.batch_size = batch_size
+        self.logger = logger
+
+    @abstractclassmethod
+    def inference(self, data: List[Dict]) -> None:
+        """
+        Infer the given data.
+        This function will call self.generate() to get model outputs and also self.model(input) to get logits.
+
+        Args:
+            data: The data for inference.
+        """
+
+    @abstractclassmethod
+    def generate(self, inputs: List[str], max_new_tokens: int) -> List[str]:
+        """
+        Generate results given a list of inputs.
+
+        Args:
+            inputs: A list of strings.
+            max_new_tokens: The maximum length of the output.
+
+        Returns:
+            A list of generated strings.
+        """
+
+    @abstractclassmethod
+    def get_loss(self, batch: List[str], batch_target: List[str]) -> List[float]:
+        """
+        Get loss given batch and batch with target.
+        Use their length difference after tokenization to mask the loss and only compute loss at target tokens.
+
+        Args:
+            batch: batch prompt without target answer.
+            batch_target: batch prompt with target answer.
+
+        Returns:
+            A list of loss.
+        """
+
+    def to(self, device):
+        self.model.to(device)
--- a/applications/ColossalEval/colossal_eval/models/chatglm.py
+++ b/applications/ColossalEval/colossal_eval/models/chatglm.py
@ -0,0 +1,303 @@
+import copy
+from typing import List
+
+import torch
+
+from .huggingface import HuggingFaceModel
+
+IGNORE_INDEX = -100
+
+
+class ChatGLMModel(HuggingFaceModel):
+    def _get_truncated_prompts(self, inputs: List[str], max_new_tokens: int) -> List[str]:
+        truncated_inputs = copy.deepcopy(inputs)
+        # Adapted from https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/main.py#L187
+        for i, input in enumerate(inputs):
+            a_ids = self.tokenizer.encode(text=input, truncation=False, add_special_tokens=False)
+
+            if len(a_ids) > self.model_max_length - max_new_tokens:
+                half = (self.model_max_length - max_new_tokens) // 2
+                prompt = self.tokenizer.decode(a_ids[:half], skip_special_tokens=True) + self.tokenizer.decode(
+                    a_ids[-half:], skip_special_tokens=True
+                )
+                truncated_inputs[i] = prompt
+
+        return truncated_inputs
+
+    @torch.no_grad()
+    def get_loss(
+        self, batch_prompt: List[str], batch_target: List[List[str]], pretrain: bool = False
+    ) -> List[List[float]]:
+        """
+        Calculate loss only on target tokens.
+
+        Args:
+            batch: A batch of prompt without target answer.
+            batch_target: A batch of target answer. Sometimes one question can have multiple target answers.
+
+        Returns:
+            Loss.
+
+        """
+
+        # We set max_new_tokens in self._get_truncated_prompts to 0 because we only need logits to calculate loss.
+        # We don't need to generate new tokens.
+        # Target answer's length is usually << model_max_length, but we still call it in case.
+        # We don't call self._get_truncated_prompts for batch_prompt because we need target answer's length first to reserve some space for target answer's tokens.
+        batch_target = [self._get_truncated_prompts(prompt_target, 0) for prompt_target in batch_target]
+
+        # Get the number of target answers for different questions
+        batch_target_nums = [len(prompt_target) for prompt_target in batch_target]
+
+        labels_list = []
+        input_ids_list = []
+
+        for input, targets in zip(batch_prompt, batch_target):
+            for target in targets:
+                # Adapted from https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/main.py#L187
+                # If there is no history, the prompt is just the query.
+                # We don't need to override self.generate() in ChatGLM-6B but need to override it in ChatGLM2-6B.
+                # See https://huggingface.co/THUDM/chatglm-6b/blob/main/modeling_chatglm.py#L1276
+                target_tokenized = self.tokenizer.encode(text=target, add_special_tokens=False)
+
+                # Get prompt with length model_max_length - len(target_tokenized).
+                # Reserve some space for target answer tokens using max_new_tokens.
+                # This will generate the correct start_idx and end_idx.
+                max_new_tokens = len(target_tokenized)
+
+                # Here 3 tokens are reserved for [gmask_id, bos_token, eos_id]. So we reserve max_new_tokens + 3 tokens.
+                # See https://huggingface.co/THUDM/chatglm-6b/blob/main/tokenization_chatglm.py#L323
+                prompt_with_correct_length = self._get_truncated_prompts([input], max_new_tokens + 3)[0]
+                input_tokenized = self.tokenizer.encode(prompt_with_correct_length, add_special_tokens=False)
+
+                input_ids = self.tokenizer.build_inputs_with_special_tokens(input_tokenized, target_tokenized)
+
+                context_length = input_ids.index(self.tokenizer.bos_token_id)
+                context_length - 1
+
+                target_ids = [IGNORE_INDEX] * len(input_ids)
+
+                # -1 is for eos_token, we don't want to calculate loss on eos token.
+                target_ids[-max_new_tokens - 1 : -1] = input_ids[-max_new_tokens - 1 : -1]
+
+                input_ids_list.append(torch.LongTensor(input_ids))
+                labels_list.append(torch.LongTensor(target_ids))
+
+        # Because of multiple target answers, the final batch size may be greater than self.batch_size.
+        # We will generate new batches.
+        losses = []
+        target_token_nums = []
+
+        batched_input_ids = [
+            input_ids_list[i : i + self.batch_size] for i in range(0, len(input_ids_list), self.batch_size)
+        ]
+        batched_labels = [labels_list[i : i + self.batch_size] for i in range(0, len(labels_list), self.batch_size)]
+
+        for batch_input_ids, batch_labels in zip(batched_input_ids, batched_labels):
+            losses_per_batch, target_token_num_per_batch = self._calculate_loss(batch_input_ids, batch_labels)
+            losses.extend(losses_per_batch)
+            target_token_nums.extend(target_token_num_per_batch)
+
+        start_indice = 0
+        losses_per_sample = []
+
+        target_token_nums_per_sample = []
+        for length in batch_target_nums:
+            losses_per_sample.append(losses[start_indice : start_indice + length])
+            target_token_nums_per_sample.append(target_token_nums[start_indice : start_indice + length])
+            start_indice += length
+
+        return losses_per_sample, target_token_nums_per_sample, None
+
+    def _calculate_loss(self, input_ids_list: List[torch.LongTensor], labels: List[torch.LongTensor]) -> List[float]:
+        """
+        Calculate loss only on target tokens.
+        Hugging Face generate() function can't return per sample loss.
+        It will only return the mean of the loss in a batch.
+        In torch.nn.CrossEntropyLoss(), reduction should be specified as "none" to get per sample loss.
+
+        Args:
+            input_ids_list: A batch of input token ids.
+            labels: A batch of labels.
+
+        Returns:
+            A list of loss.
+
+        """
+        input_ids = torch.nn.utils.rnn.pad_sequence(
+            input_ids_list, batch_first=True, padding_value=self.tokenizer.pad_token_id
+        ).to(torch.cuda.current_device())
+        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX).to(
+            torch.cuda.current_device()
+        )
+
+        outputs = self.model(input_ids)[0]
+
+        shift_logits = outputs[..., :-1, :].contiguous()
+        shift_labels = labels[..., 1:].contiguous()
+
+        loss_fct = torch.nn.CrossEntropyLoss(reduction="none", ignore_index=IGNORE_INDEX)
+        loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)).view(shift_labels.size())
+
+        lens = (labels != IGNORE_INDEX).sum(-1).cpu().numpy()
+
+        loss_sum = loss.sum(-1).to(torch.float32).cpu().detach().numpy()
+        return loss_sum.tolist(), lens.tolist()
+
+
+class ChatGLM2Model(ChatGLMModel):
+    def _get_truncated_prompts(self, inputs: List[str], max_new_tokens: int) -> List[str]:
+        truncated_inputs = copy.deepcopy(inputs)
+        # Adapted from https://github.com/THUDM/ChatGLM2-6B/blob/main/ptuning/main.py#L180
+        for i, input in enumerate(inputs):
+            a_ids = self.tokenizer.encode(text=input, add_special_tokens=True, truncation=False)
+
+            if len(a_ids) > self.model_max_length - max_new_tokens:
+                half = (self.model_max_length - max_new_tokens) // 2
+                prompt = self.tokenizer.decode(a_ids[:half], skip_special_tokens=True) + self.tokenizer.decode(
+                    a_ids[-half:], skip_special_tokens=True
+                )
+                truncated_inputs[i] = prompt
+
+        return truncated_inputs
+
+    @torch.no_grad()
+    def generate(self, inputs: List[str], max_new_tokens: int, **kwargs) -> List[str]:
+        """Generate results given a list of inputs and get logits of the first new token over choices.
+
+        Args:
+            inputs: A list of strings.
+            max_new_tokens: Max new tokens for generation.
+            kwargs: Key arguments for generation
+
+        Returns:
+            A list of generated strings and logits over choices.
+
+        Note:
+            Currently the function only returns the logits of the first new token.
+            It is used for single choice question.
+            For multiple choices question, please avoid using the loss over choices.
+            You should set argument choices as None in self.inference().
+
+        """
+        # Follow the process of model.chat() method in modeling_chatglm2.py
+        # See https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L1020
+        # See https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L1001
+
+        query = []
+        for input in inputs:
+            prompt = self.tokenizer.build_prompt(input, None)
+            query.append(prompt)
+
+        truncated_query = self._get_truncated_prompts(query, max_new_tokens)
+
+        encoded_inputs = self.tokenizer(
+            truncated_query,
+            padding=True,
+            truncation=True,
+            return_tensors="pt",
+            max_length=self.model_max_length - max_new_tokens,
+        ).to(torch.cuda.current_device())
+
+        # Set output_scores=True to get prediction scores.
+        outputs = self.model.generate(
+            **encoded_inputs, max_new_tokens=max_new_tokens, return_dict_in_generate=True, output_scores=True, **kwargs
+        )
+
+        # We only need to decode predicted tokens.
+        sequences = outputs.sequences[:, encoded_inputs["input_ids"].shape[1] :]
+
+        scores = []
+        if self.indices_for_choices:
+            # If the question is a single-choice question, we will return the scores of specific indices for first predicted token.
+            # The indices are the tokenization results of the options for the single-choice question.
+            # For example, if the options of the question are A, B, C and D, we only returns scores at indices of A, B, C and D.
+            for option_indices in self.indices_for_choices:
+                scores.append(outputs.scores[0][:, option_indices].detach().cpu())
+
+            scores = torch.max(torch.stack(scores), dim=0)[0]
+
+        decoded_sequences = self.tokenizer.batch_decode(sequences, skip_special_tokens=True)
+
+        return decoded_sequences, scores
+
+    @torch.no_grad()
+    def get_loss(
+        self, batch_prompt: List[str], batch_target: List[List[str]], pretrain: bool = False
+    ) -> List[List[float]]:
+        """
+        Calculate loss only on target tokens.
+
+        Args:
+            batch: A batch of prompt without target answer.
+            batch_target: A batch of target answer. Sometimes one question can have multiple target answers.
+
+        Returns:
+            Loss.
+
+        """
+
+        # We set max_new_tokens in self._get_truncated_prompts to 0 because we only need logits to calculate loss.
+        # We don't need to generate new tokens.
+        # Target answer's length is usually << model_max_length, but we still call it in case.
+        # We don't call self._get_truncated_prompts for batch_prompt because we need target answer's length first to reserve some space for target answer's tokens.
+        batch_target = [self._get_truncated_prompts(prompt_target, 0) for prompt_target in batch_target]
+
+        # Get the number of target answers for different questions
+        batch_target_nums = [len(prompt_target) for prompt_target in batch_target]
+
+        labels_list = []
+        input_ids_list = []
+
+        for input, targets in zip(batch_prompt, batch_target):
+            for target in targets:
+                # Adapted from https://github.com/THUDM/ChatGLM2-6B/blob/main/ptuning/main.py#L180
+                prompt = self.tokenizer.build_prompt(input, None)
+
+                target_tokenized = self.tokenizer.encode(
+                    text=target, add_special_tokens=False, truncation=True, max_length=self.model_max_length
+                )
+
+                max_new_tokens = len(target_tokenized)
+                prompt_with_correct_length = self._get_truncated_prompts([prompt], max_new_tokens)[0]
+                input_tokenized = self.tokenizer.encode(
+                    prompt_with_correct_length,
+                    add_special_tokens=True,
+                    truncation=True,
+                    max_length=self.model_max_length,
+                )
+
+                input_ids = input_tokenized + target_tokenized + [self.tokenizer.eos_token_id]
+                target_ids = [IGNORE_INDEX] * len(input_ids)
+
+                # -1 is for "eos"
+                target_ids[-max_new_tokens - 1 : -1] = input_ids[-max_new_tokens - 1 : -1]
+
+                input_ids_list.append(torch.LongTensor(input_ids))
+                labels_list.append(torch.LongTensor(target_ids))
+
+        # Because of multiple target answers, the final batch size may be greater than self.batch_size.
+        # We will generate new batches.
+        losses = []
+        target_token_nums = []
+
+        batched_input_ids = [
+            input_ids_list[i : i + self.batch_size] for i in range(0, len(input_ids_list), self.batch_size)
+        ]
+        batched_labels = [labels_list[i : i + self.batch_size] for i in range(0, len(labels_list), self.batch_size)]
+
+        for batch_input_ids, batch_labels in zip(batched_input_ids, batched_labels):
+            losses_per_batch, target_token_num_per_batch = self._calculate_loss(batch_input_ids, batch_labels)
+            losses.extend(losses_per_batch)
+            target_token_nums.extend(target_token_num_per_batch)
+
+        start_indice = 0
+        losses_per_sample = []
+
+        target_token_nums_per_sample = []
+        for length in batch_target_nums:
+            losses_per_sample.append(losses[start_indice : start_indice + length])
+            target_token_nums_per_sample.append(target_token_nums[start_indice : start_indice + length])
+            start_indice += length
+
+        return losses_per_sample, target_token_nums_per_sample, None
--- a/applications/ColossalEval/colossal_eval/models/huggingface.py
+++ b/applications/ColossalEval/colossal_eval/models/huggingface.py
@ -0,0 +1,561 @@
+import copy
+import math
+from typing import Any, Dict, List, Optional, Tuple
+
+import numpy as np
+import torch
+from colossal_eval.utils import Conversation, get_batch_prompt, is_rank_0
+from peft import PeftModel
+from tqdm import tqdm
+from transformers import AutoConfig, AutoModel, AutoModelForCausalLM, AutoTokenizer
+
+from colossalai.logging import DistributedLogger
+
+from .base import BaseModel
+
+IGNORE_INDEX = -100
+
+
+class HuggingFaceModel(BaseModel):
+    """
+    Model wrapper around HuggingFace AutoModel models.
+
+    Args:
+        path: The path to a HuggingFace model.
+        model_max_length: The maximum sequence length of the model.
+        tokenizer_path: The path to the tokenizer.
+        tokenizer_kwargs: Keyword arguments for the tokenizer.
+        peft_path: The name or path to the HuggingFace's PEFT model.
+        model_kwargs: Keyword arguments for the model.
+        prompt_template: The model's prompt template.
+        batch_size: Batch size for inference.
+        logger: Logger for the model.
+
+    """
+
+    def __init__(
+        self,
+        path: str,
+        model_max_length: int = 2048,
+        tokenizer_path: Optional[str] = None,
+        tokenizer_kwargs: dict = dict(),
+        peft_path: Optional[str] = None,
+        model_kwargs: Dict = None,
+        prompt_template: Conversation = None,
+        batch_size: int = 1,
+        logger: DistributedLogger = None,
+    ):
+        super().__init__(
+            path=path,
+            model_max_length=model_max_length,
+            prompt_template=prompt_template,
+            batch_size=batch_size,
+            logger=logger,
+        )
+        self._load_tokenizer(path=path, tokenizer_path=tokenizer_path, tokenizer_kwargs=tokenizer_kwargs)
+
+        self._load_model(path=path, model_kwargs=model_kwargs, peft_path=peft_path)
+
+    def _get_choices_indices(self, language: str):
+        """
+        Get indices for each choice
+
+        Some tokenizer will insert BOS if you don't specify add_special_tokens=False such as Llama-2.
+        The indices for choices may be different given the context. For example, for Llama-2 tokenizer, for Chinese context like "答案：{choice}", indices for choices A, B, C and D are 29909, 29933, 29907 and 29928, for English context like "Answer: {choice}", indices for choices A, B, C and D are 319, 350, 315 and 360.
+        print(self.tokenizer("答案：A")) to see
+        print(self.tokenizer("Answer: A")) to see
+
+        """
+
+        # A trick for get "all" tokens ids related to given choices.
+        self.indices_for_choices = [[] for _ in range(2)]
+        for choice in self.choices:
+            self.indices_for_choices[0].append(
+                self.tokenizer(f"Answer: {choice}", add_special_tokens=False).input_ids[-1]
+            )
+            self.indices_for_choices[1].append(self.tokenizer(f"答案：{choice}", add_special_tokens=False).input_ids[-1])
+
+    def _load_tokenizer(self, path: str, tokenizer_path: Optional[str], tokenizer_kwargs: dict):
+        """
+        Load tokenizer.
+
+        Args:
+            path: The path to the model. Usually it also serves as the path to the tokenizer.
+            tokenizer_path: The path to the tokenzier.
+            tokenizer_kwargs: Keyword arguments for the tokenizer.
+
+        """
+
+        if self.batch_size > 1:
+            tokenizer_kwargs.update({"padding_side": "left"})
+            tokenizer_kwargs.update({"truncation_side": "left"})
+
+        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path if tokenizer_path else path, **tokenizer_kwargs)
+
+        if self.tokenizer.pad_token_id is None:
+            self.logger.warning("pad_token_id is not set for the tokenizer. " "Using eos_token_id as pad_token_id.")
+            if self.tokenizer.eos_token:
+                self.tokenizer.pad_token = self.tokenizer.eos_token
+            elif self.tokenizer.eod_id:
+                # Qwen has an eod token "<|endoftext|>".
+                self.tokenizer.pad_token_id = self.tokenizer.eod_id
+
+    def _load_model(self, path: str, model_kwargs: dict, peft_path: Optional[str] = None):
+        """
+        Load model.
+
+        Args:
+            path: The path to the model.
+            model_kwargs: Keyword arguments for the model.
+            peft_path: The path to the peft model.
+
+        """
+
+        if "torch_dtype" in model_kwargs:
+            model_kwargs["torch_dtype"] = eval(model_kwargs["torch_dtype"])
+
+        model_kwargs.setdefault("torch_dtype", torch.float16)
+
+        self.model = AutoModel.from_pretrained(path, **model_kwargs).to(torch.cuda.current_device())
+        if peft_path is not None:
+            self.model = PeftModel.from_pretrained(self.model, peft_path, is_trainable=False)
+        self.model.eval()
+
+    def _calculate_loss(self, input_ids_list: List[torch.LongTensor], labels: List[torch.LongTensor]) -> Tuple[List]:
+        """
+        Calculate loss only on target tokens.
+        Hugging Face generate() function can't return per sample loss.
+        It will only return the mean of the loss in a batch.
+        In torch.nn.CrossEntropyLoss(), reduction should be specified as "none" to get per sample loss.
+
+        Args:
+            input_ids_list: A batch of input token ids.
+            labels: A batch of labels.
+
+        Returns:
+            A list of loss.
+
+        """
+        input_ids = torch.nn.utils.rnn.pad_sequence(
+            input_ids_list, batch_first=True, padding_value=self.tokenizer.pad_token_id
+        ).to(torch.cuda.current_device())
+        labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX).to(
+            torch.cuda.current_device()
+        )
+        attention_mask = input_ids.ne(self.tokenizer.pad_token_id).to(torch.cuda.current_device())
+
+        outputs = self.model(input_ids, attention_mask=attention_mask)[0]
+
+        shift_logits = outputs[..., :-1, :].contiguous()
+        shift_labels = labels[..., 1:].contiguous()
+
+        loss_fct = torch.nn.CrossEntropyLoss(reduction="none", ignore_index=IGNORE_INDEX)
+        loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)).view(shift_labels.size())
+
+        lens = (labels != IGNORE_INDEX).sum(-1).cpu().numpy()
+
+        loss_sum = loss.sum(-1).to(torch.float32).cpu().detach().numpy()
+        return loss_sum.tolist(), lens.tolist()
+
+    def _get_truncated_prompts(self, inputs: List[str], max_new_tokens: int) -> List[str]:
+        """
+        Truncate the input sequence to fit model_max_length (we suggest truncate in the middle, since the left and right side may contain crucial instructions)
+        https://github.com/THUDM/LongBench/blob/main/pred.py#L16
+
+        Args:
+            inputs: A batch of input prompts.
+            max_new_tokens: Max new tokens for model to generate.
+
+        Returns:
+            Truncated prompts.
+
+        """
+
+        truncated_inputs = copy.deepcopy(inputs)
+        for i, input in enumerate(inputs):
+            tokenized_prompt = self.tokenizer(input, truncation=False, return_tensors="pt").input_ids[0]
+            if len(tokenized_prompt) > self.model_max_length - max_new_tokens:
+                half = (self.model_max_length - max_new_tokens) // 2
+                prompt = self.tokenizer.decode(
+                    tokenized_prompt[:half], skip_special_tokens=True
+                ) + self.tokenizer.decode(tokenized_prompt[-half:], skip_special_tokens=True)
+                truncated_inputs[i] = prompt
+
+        return truncated_inputs
+
+    def _get_input_ids_and_labels_pretrain(self, batch_prompt: List[str]) -> Tuple[List[torch.LongTensor]]:
+        """
+        Get input_ids and labels for pretrain data.
+        We only need batch_prompt because for pretain dataset, we don't need to predict new tokens.
+
+        Args:
+            batch_prompt: A batch of prompt.
+
+        Returns:
+            Input_ids and labels for the given batch.
+
+        """
+        input_ids_list = []
+        labels_list = []
+        bytes_list = []
+
+        for input in batch_prompt:
+            # Pretrain data tends to be very long, sometimes much larger than the model_max_length, we only tokenize 1/ratio of the data first to accelerate the tokenization process.
+            # Once the length of the result is greater or equal to model_max_length, we stop iterating on ratios and use the result as input_ids and labels.
+            # After all, the rest of the original string doesn't need to be tokenized at the first place.
+            ratio = [16, 8, 4, 2, 1]
+            tokenized = None
+            for r in ratio:
+                tokenized = self.tokenizer(
+                    [input[0 : len(input) // r]], truncation=True, max_length=self.model_max_length, return_tensors="pt"
+                )
+                if tokenized.input_ids.size(1) >= self.model_max_length:
+                    break
+
+            input_ids = copy.deepcopy(tokenized["input_ids"])[0]
+            target_ids = copy.deepcopy(input_ids)
+
+            string = self.tokenizer.decode(tokenized.input_ids[0], skip_special_tokens=True)
+
+            bytes_list.append(len(string.encode("utf-8")))
+
+            input_ids_list.append(input_ids)
+            labels_list.append(target_ids)
+
+        return input_ids_list, labels_list, bytes_list
+
+    def _get_input_ids_and_labels(
+        self, batch_prompt: List[str], batch_target: List[List[str]], pretrain: bool
+    ) -> Tuple[List[torch.LongTensor]]:
+        """
+        Get input_ids and labels for the given data.
+
+        Args:
+            batch_prompt: A batch of prompt.
+            batch_target: A batch of target.
+
+        Returns:
+            Input_ids and labels for the given batch.
+
+        """
+        if pretrain:
+            return self._get_input_ids_and_labels_pretrain(batch_prompt)
+
+        input_ids_list = []
+        labels_list = []
+
+        for input, targets in zip(batch_prompt, batch_target):
+            for target in targets:
+                # TODO: Improve the labeling process. Should annotate the border by adding special tokens.
+                target_tokenized = self.tokenizer(
+                    [target], truncation=True, max_length=self.model_max_length, return_tensors="pt"
+                )
+
+                # Get prompt with length model_max_length - len(target_tokenized).
+                # Reserve some space for target answer tokens using max_new_tokens.
+                # This will generate the correct start_idx and end_idx.
+                max_new_tokens = target_tokenized["input_ids"][0].size(0)
+                prompt_with_correct_length = self._get_truncated_prompts([input], max_new_tokens)[0]
+                input_tokenized = self.tokenizer(
+                    [prompt_with_correct_length],
+                    truncation=True,
+                    max_length=self.model_max_length - max_new_tokens,
+                    return_tensors="pt",
+                )
+
+                target_tokenized = self.tokenizer(
+                    [prompt_with_correct_length + target],
+                    truncation=True,
+                    max_length=self.model_max_length,
+                    return_tensors="pt",
+                )
+
+                start_idx = input_tokenized["input_ids"][0].size(0)
+                end_idx = target_tokenized["input_ids"][0].size(0)
+
+                # Sometimes if the target is only an option such as A, B, C and D, the length of input_tokenized is equal to the length of target_tokenized, so we need -1.
+                # This is caused by the different behavior of tokenizers.
+                # For example, the tokenizer for Baichuan and Llama will cause such problem in a plain prompt setting.
+                # The length of the tokenized sequences for prompt "Answer: " and "Answer: A" is the same.
+                # Baichuan: [29394, 31143, 31106] [29394, 31143, 703]
+                # Llama: [673, 29901, 29871] [673, 29901, 319]
+                # The length for sequence "prompt" and "prompt + A" is equal.
+                # For ChatGLM, the length of the tokenized sequences is different.
+                # ChatGLM: [16583, 12] [16583, 12, 167]
+
+                if start_idx == end_idx:
+                    start_idx -= 1
+
+                input_ids = copy.deepcopy(target_tokenized["input_ids"])[0]
+                target_ids = copy.deepcopy(input_ids)
+
+                mask = torch.zeros_like(target_ids, dtype=torch.bool)
+                mask[start_idx:end_idx] = True
+
+                target_ids[~mask] = IGNORE_INDEX
+
+                input_ids_list.append(input_ids)
+                labels_list.append(target_ids)
+
+        return input_ids_list, labels_list, None
+
+    def inference(self, data: List[Dict], inference_kwargs: Dict[str, Any], debug: bool = False) -> List[Dict]:
+        """
+        Infer the given data.
+        This function will call self.generate() to get model outputs and also self.model() to get logits.
+
+        Args:
+            data: The data for inference.
+            inference_kwargs: Arguments for inference.
+            debug: Whether to display generated prompt for debugging.
+
+        Returns:
+            Inference results.
+
+        """
+        calculate_loss = inference_kwargs["calculate_loss"]
+        classes = inference_kwargs["all_classes"]
+        language = inference_kwargs["language"]
+        pretrain = inference_kwargs["pretrain"]
+        max_new_tokens = inference_kwargs["max_new_tokens"]
+        few_shot_data = inference_kwargs.get("few_shot_data", None)
+
+        # Some classification questions' options are texts not a single letter such as A, B, C and D.
+        # If the text length is greater than 1, we won't calculate loss over choices.
+        if classes is not None and any(len(c) > 1 for c in classes):
+            classes = None
+
+        self.choices = classes
+        self.indices_for_choices = None
+        if self.choices:
+            # Get indices for each choice
+            self._get_choices_indices(language)
+
+            self.str_label_map = {choice: idx for idx, choice in enumerate(self.choices)}
+
+        bar = tqdm(
+            range(math.ceil(len(data) / self.batch_size)),
+            desc=f"{data[0]['dataset']}-{data[0]['category']} Inference steps",
+            disable=not is_rank_0(),
+        )
+        loss_fct = torch.nn.CrossEntropyLoss(reduction="none")
+
+        answers = copy.deepcopy(data)
+        for i in range(0, len(data), self.batch_size):
+            batch = data[i : i + self.batch_size]
+            batch_prompt, batch_target = get_batch_prompt(
+                self.prompt_template, batch, few_shot_data, self.tokenizer, language, self.model_max_length
+            )
+
+            if is_rank_0() and debug and i == 0:
+                self.logger.info(
+                    f"Inference arguments for dataset {data[0]['dataset']} category {data[0]['category']} is:\n{inference_kwargs}"
+                )
+                self.logger.info("-" * 120)
+                self.logger.info("An example prompt and prompt with target is:")
+                self.logger.info("-" * 120)
+                self.logger.info(batch_prompt[0])
+                self.logger.info("-" * 120)
+                self.logger.info(batch_prompt[0] + batch_target[0][0])
+
+            if not pretrain:
+                batch_decodes, scores = self.generate(batch_prompt, max_new_tokens)
+
+            if calculate_loss:
+                batch_losses, batch_target_token_nums, batch_bytes_nums = self.get_loss(
+                    batch_prompt, batch_target, pretrain
+                )
+
+            probs = []
+            if self.indices_for_choices:
+                scores = scores.to(torch.float32)
+                # If we have indices_for_choices(must be single-choice question), there will be only one target answer for one data sample.
+                # Otherwise this will violate the single-choice setting.
+
+                if calculate_loss:
+                    labels = [self.str_label_map[answers[i + j]["target"]] for j in range(len(batch_decodes))]
+
+                    loss_over_choices = loss_fct(scores, torch.tensor(labels, dtype=torch.long)).numpy().tolist()
+
+                probs = torch.nn.functional.softmax(scores, dim=-1).numpy().tolist()
+                probs = [
+                    {choice: probs[i][self.str_label_map[choice]] for choice in self.choices} for i in range(len(probs))
+                ]
+
+            for j in range(len(batch_prompt)):
+                if not pretrain:
+                    answers[i + j]["output"] = batch_decodes[j].strip()
+
+                    if isinstance(scores, torch.Tensor):
+                        answers[i + j]["softmax_over_choices"] = probs[j]
+
+                        if calculate_loss:
+                            answers[i + j]["loss_over_choices"] = loss_over_choices[j]
+
+                if calculate_loss:
+                    answers[i + j]["loss"] = (np.array(batch_losses[j]) / np.array(batch_target_token_nums[j])).tolist()
+
+                    # loss_sum is specially used for pertrain dataset for calculating per-byte-perplexity.
+                    # However, loss (which is per sample loss) suffices for most cases.
+                    answers[i + j]["loss_sum"] = batch_losses[j]
+                    answers[i + j]["token_num"] = batch_target_token_nums[j]
+
+                    if batch_bytes_nums:
+                        answers[i + j]["byte_num"] = batch_bytes_nums[j]
+
+            bar.update()
+
+        return answers
+
+    @torch.no_grad()
+    def generate(self, inputs: List[str], max_new_tokens: int, **kwargs) -> List[str]:
+        """Generate results given a list of inputs and get logits of the first new token over choices.
+
+        Args:
+            inputs: A list of strings.
+            max_new_tokens: Max new tokens for generation.
+            kwargs: Key arguments for generation
+
+        Returns:
+            A list of generated strings and logits over choices.
+
+        Note:
+            Currently the function only returns the logits of the first new token.
+            It is used for single choice question.
+            For multiple choices question, please avoid using the loss over choices.
+            You should set argument choices as None in self.inference().
+
+        """
+        truncated_inputs = self._get_truncated_prompts(inputs, max_new_tokens)
+
+        encoded_inputs = self.tokenizer(
+            truncated_inputs,
+            padding=True,
+            truncation=True,
+            return_tensors="pt",
+            return_token_type_ids=False,
+            max_length=self.model_max_length - max_new_tokens,
+        ).to(torch.cuda.current_device())
+
+        # Set output_scores=True to get prediction scores.
+        outputs = self.model.generate(
+            **encoded_inputs, max_new_tokens=max_new_tokens, return_dict_in_generate=True, output_scores=True, **kwargs
+        )
+
+        # We only need to decode predicted tokens.
+        sequences = outputs.sequences[:, encoded_inputs["input_ids"].shape[1] :]
+
+        scores = []
+        if self.indices_for_choices:
+            # If the question is a single-choice question, we will return the scores of specific indices for first predicted token.
+            # The indices are the tokenization results of the options for the single-choice question.
+            # For example, if the options of the question are A, B, C and D, we only returns scores at indices of A, B, C and D.
+            for option_indices in self.indices_for_choices:
+                scores.append(outputs.scores[0][:, option_indices].detach().cpu())
+
+            scores = torch.max(torch.stack(scores), dim=0)[0]
+
+        decoded_sequences = self.tokenizer.batch_decode(sequences, skip_special_tokens=True)
+
+        return decoded_sequences, scores
+
+    @torch.no_grad()
+    def get_loss(self, batch_prompt: List[str], batch_target: List[List[str]], pretrain: bool) -> List[List[float]]:
+        """
+        Calculate loss only on target tokens.
+
+        Args:
+            batch: A batch of prompt without target answer.
+            batch_target: A batch of target answer. Sometimes one question can have multiple target answers.
+
+        Returns:
+            Loss.
+
+        """
+
+        # We set max_new_tokens in self._get_truncated_prompts to 0 because we only need logits to calculate loss.
+        # We don't need to generate new tokens.
+        # Target answer's length is usually << model_max_length, but we still call it in case.
+        # We don't call self._get_truncated_prompts for batch_prompt because we need target answer's length first to reserve some space for target answer's tokens.
+        if not pretrain:
+            batch_target = [self._get_truncated_prompts(prompt_target, 0) for prompt_target in batch_target]
+
+        # Get the number of target answers for different questions
+        batch_target_nums = [len(prompt_target) for prompt_target in batch_target]
+
+        input_ids_list, labels_list, bytes_list = self._get_input_ids_and_labels(batch_prompt, batch_target, pretrain)
+
+        # Because of multiple target answers, the final batch size may be greater than self.batch_size.
+        # We will generate new batches.
+        losses = []
+        target_token_nums = []
+
+        batched_input_ids = [
+            input_ids_list[i : i + self.batch_size] for i in range(0, len(input_ids_list), self.batch_size)
+        ]
+        batched_labels = [labels_list[i : i + self.batch_size] for i in range(0, len(labels_list), self.batch_size)]
+
+        for batch_input_ids, batch_labels in zip(batched_input_ids, batched_labels):
+            losses_per_batch, target_token_num_per_batch = self._calculate_loss(batch_input_ids, batch_labels)
+            losses.extend(losses_per_batch)
+            target_token_nums.extend(target_token_num_per_batch)
+
+        start_indice = 0
+        losses_per_sample = []
+
+        target_token_nums_per_sample = []
+        bytes_nums_per_sample = []
+        for length in batch_target_nums:
+            losses_per_sample.append(losses[start_indice : start_indice + length])
+            target_token_nums_per_sample.append(target_token_nums[start_indice : start_indice + length])
+
+            if bytes_list:
+                bytes_nums_per_sample.append(bytes_list[start_indice : start_indice + length])
+
+            start_indice += length
+
+        if bytes_list:
+            return losses_per_sample, target_token_nums_per_sample, bytes_nums_per_sample
+
+        return losses_per_sample, target_token_nums_per_sample, None
+
+
+class HuggingFaceCausalLM(HuggingFaceModel):
+    """
+    Model wrapper around HuggingFace AutoModelForCausalLM models.
+
+    Args:
+        path: The path to a HuggingFace model.
+        model_max_length: The maximum sequence length of the model.
+        tokenizer_path: The path to the tokenizer.
+        tokenizer_kwargs: Keyword arguments for the tokenizer.
+        peft_path: The name or path to the HuggingFace's PEFT model.
+        model_kwargs: Keyword arguments for the model.
+        prompt_template: The model's prompt template.
+        batch_size: Batch size for inference.
+        logger: Logger for the model.
+
+    """
+
+    def _load_model(self, path: str, model_kwargs: dict, peft_path: Optional[str] = None):
+        """
+        Load model.
+
+        Args:
+            path: The path to the model.
+            model_kwargs: Keyword arguments for the model.
+            peft_path: The path to the peft model.
+
+        """
+
+        if "torch_dtype" in model_kwargs:
+            model_kwargs["torch_dtype"] = eval(model_kwargs["torch_dtype"])
+
+        if "config" in model_kwargs:
+            model_kwargs["config"] = AutoConfig.from_pretrained(model_kwargs["config"])
+
+        model_kwargs.setdefault("torch_dtype", torch.float16)
+        self.model = AutoModelForCausalLM.from_pretrained(path, **model_kwargs).to(torch.cuda.current_device())
+        if peft_path is not None:
+            self.model = PeftModel.from_pretrained(self.model, peft_path, is_trainable=False)
+        self.model.eval()
--- a/applications/ColossalEval/colossal_eval/utils/init.py
+++ b/applications/ColossalEval/colossal_eval/utils/init.py
@ -0,0 +1,4 @@
+from .conversation import Conversation, get_batch_prompt, prompt_templates
+from .utilities import get_json_list, is_rank_0, jdump, jload
+
+__all__ = ["Conversation", "prompt_templates", "get_batch_prompt", "is_rank_0", "jload", "jdump", "get_json_list"]
--- a/applications/ColossalEval/colossal_eval/utils/conversation.py
+++ b/applications/ColossalEval/colossal_eval/utils/conversation.py
@ -0,0 +1,231 @@
+import dataclasses
+from enum import Enum, auto
+from typing import Dict, List, Optional, Tuple
+
+from transformers import AutoTokenizer
+
+
+class SeparatorStyle(Enum):
+    ADD_BOS_EOS_TOKEN = auto()
+    ALPACA = auto()
+    PLAIN = auto()
+
+
+@dataclasses.dataclass
+class Conversation:
+    system: str
+    roles: List[str]
+    messages: List[List[str]]
+    offset: int
+    sep_style: SeparatorStyle = SeparatorStyle.ADD_BOS_EOS_TOKEN
+    sep: str = "</s>"
+
+    def clear(self):
+        self.messages = []
+
+    def get_prompt(self):
+        if self.sep_style == SeparatorStyle.ADD_BOS_EOS_TOKEN:
+            ret = self.system
+            for role, message in self.messages:
+                if message:
+                    ret += role + ": " + "<s>" + message + self.sep
+                else:
+                    ret += role + ": " + "<s>"
+            return ret
+        elif self.sep_style == SeparatorStyle.ALPACA:
+            ret = self.system + self.sep
+            for role, message in self.messages:
+                if message:
+                    ret += role + ":\n" + message + self.sep
+                else:
+                    ret += role + ":"
+            return ret
+        elif self.sep_style == SeparatorStyle.PLAIN:
+            ret = self.system
+            for role, message in self.messages:
+                if message:
+                    ret += message
+                else:
+                    ret += ""
+            return ret
+        else:
+            raise ValueError(f"Invalid style: {self.sep_style}")
+
+    def get_prompt_with_target(self, target):
+        prompt = self.get_prompt()
+        prompt_with_target = []
+
+        # Some dataset provides multiple target answers.
+        # This will make it difficult when we calculate loss.
+        # We convert target into list[str] first if the question only has one target answer.
+        target_answers = []
+        if isinstance(target, str):
+            target_answers = [target]
+        else:
+            target_answers = target
+
+        for target_answer in target_answers:
+            if self.sep_style == SeparatorStyle.ADD_BOS_EOS_TOKEN:
+                prompt_with_target.append(prompt + target_answer)
+            elif self.sep_style == SeparatorStyle.ALPACA:
+                prompt_with_target.append(prompt + target_answer)
+            elif self.sep_style == SeparatorStyle.PLAIN:
+                prompt_with_target.append(prompt + target_answer)
+            else:
+                raise ValueError(f"Invalid style: {self.sep_style}")
+
+        return prompt_with_target
+
+    def save_prompt(self):
+        if self.sep_style == SeparatorStyle.ADD_BOS_EOS_TOKEN:
+            ret = self.system
+            for role, message in self.messages:
+                if message:
+                    ret += role + ": " + "<s>" + message + "</s>\n"
+                else:
+                    ret += role + ": " + "<s>"
+            return ret
+        else:
+            raise ValueError(f"Invalid style: {self.sep_style}")
+
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+
+    def copy(self):
+        return Conversation(
+            system=self.system,
+            roles=self.roles,
+            messages=[[x, y] for x, y in self.messages],
+            offset=self.offset,
+            sep_style=self.sep_style,
+            sep=self.sep,
+        )
+
+    def dict(self):
+        return {
+            "system": self.system,
+            "roles": self.roles,
+            "messages": self.messages,
+            "offset": self.offset,
+            "sep_style": self.sep_style,
+            "sep": self.sep,
+        }
+
+
+def get_few_shot_prefix(
+    conv: Conversation, few_shot_data: List[str], tokenizer: Optional[AutoTokenizer], language: str, max_tokens: int
+) -> str:
+    """
+    Get few shot prefix.
+
+    Args:
+        conv: Conversation template.
+        few_shot_examples: Few shot examples to generate few shot prompt prefix.
+
+    Returns:
+        Few shot prompt prefix.
+    """
+
+    if language == "English":
+        few_shot_prefix = f"The following are answers for questions in an exam.\n\n"
+    elif language == "Chinese":
+        few_shot_prefix = f"以下是考试中各个问题的答案。\n\n"
+
+    output = None
+    for i in range(len(few_shot_data)):
+        few_shot_prefix = few_shot_prefix + few_shot_data[i] + "\n\n"
+
+        if len(tokenizer([few_shot_prefix]).input_ids[0]) <= max_tokens:
+            output = few_shot_prefix
+        else:
+            break
+
+    return output if output is not None else few_shot_prefix
+
+
+def get_batch_prompt(
+    conv: Conversation,
+    batch: List[Dict],
+    few_shot_data: List[str],
+    tokenizer: Optional[AutoTokenizer],
+    language: Optional[str],
+    model_max_length: Optional[int],
+) -> Tuple[List[Dict], List[Dict]]:
+    """
+    Get batch prompt and target.
+
+    Args:
+        conv: Conversation template.
+        batch: Batch data to generate prompt from.
+        few_shot_data: Few shot data to generate few shot prompt prefix.
+
+    Returns:
+        Tuple containg batch prompt and target.
+
+    """
+
+    batch_prompt = []
+    batch_target = []
+
+    if isinstance(batch[0], dict):
+        for b in batch:
+            few_shot_prefix = ""
+            if few_shot_data is not None:
+                # For few-shot, only need input. Otherwise use instruction (in AGIEval).
+                query_text = b["input"] if b.get("input", "") != "" else b["instruction"]
+
+                if isinstance(b["target"], str):
+                    zero_shot_prompt = query_text + b["target"]
+                    max_tokens = model_max_length - len(tokenizer([zero_shot_prompt]).input_ids[0])
+                else:
+                    raise Exception("When using few-shot, target answer should be a string.")
+
+                few_shot_prefix = get_few_shot_prefix(conv, few_shot_data, tokenizer, language, max_tokens)
+            else:
+                query_text = b["instruction"] + "\n\n" + b["input"] if b.get("input", "") != "" else b["instruction"]
+
+            conv.append_message(conv.roles[0], few_shot_prefix + query_text)
+            conv.append_message(conv.roles[1], None)
+
+            batch_prompt.append(conv.get_prompt())
+
+            target = b["target"]
+            if isinstance(b["target"], str):
+                target = [target]
+
+            batch_target.append(target)
+
+            conv.clear()
+
+    return batch_prompt, batch_target
+
+
+conv_coati = Conversation(
+    system="A chat between a curious human and an artificial intelligence assistant. "
+    "The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
+    roles=("Human", "Assistant"),
+    messages=[],
+    offset=0,
+    sep_style=SeparatorStyle.ADD_BOS_EOS_TOKEN,
+    sep="</s>",
+)
+
+conv_alpaca = Conversation(
+    system="Below is an instruction that describes a task. Write a response that appropriately completes the request.",
+    roles=("### Instruction", "### Response"),
+    messages=[],
+    offset=0,
+    sep_style=SeparatorStyle.ALPACA,
+    sep="\n\n",
+)
+
+conv_plain = Conversation(
+    system="",
+    roles=("", ""),
+    messages=[],
+    offset=0,
+    sep_style=SeparatorStyle.PLAIN,
+    sep="",
+)
+
+prompt_templates = {"coati": conv_coati, "alpaca": conv_alpaca, "plain": conv_plain}
--- a/applications/ColossalEval/colossal_eval/utils/utilities.py
+++ b/applications/ColossalEval/colossal_eval/utils/utilities.py
@ -0,0 +1,62 @@
+import io
+import json
+import os
+
+import torch.distributed as dist
+
+
+def is_rank_0() -> bool:
+    return not dist.is_initialized() or dist.get_rank() == 0
+
+
+def _make_w_io_base(f, mode: str):
+    if not isinstance(f, io.IOBase):
+        f_dirname = os.path.dirname(f)
+        if f_dirname != "":
+            os.makedirs(f_dirname, exist_ok=True)
+        f = open(f, mode=mode, encoding="utf-8")
+    return f
+
+
+def _make_r_io_base(f, mode: str):
+    if not isinstance(f, io.IOBase):
+        f = open(f, mode=mode, encoding="utf-8")
+    return f
+
+
+def jdump(obj, f, mode="w", indent=4, default=str):
+    """
+    Dump a str or dictionary to a file in json format.
+
+    Args:
+        obj: An object to be written.
+        f: A string path to the location on disk.
+        mode: Mode for opening the file.
+        indent: Indent for storing json dictionaries.
+        default: A function to handle non-serializable entries; defaults to `str`.
+
+    """
+    f = _make_w_io_base(f, mode)
+    if isinstance(obj, (dict, list)):
+        json.dump(obj, f, indent=indent, default=default, ensure_ascii=False)
+    elif isinstance(obj, str):
+        f.write(obj)
+    else:
+        raise ValueError(f"Unexpected type: {type(obj)}")
+    f.close()
+
+
+def jload(f, mode="r"):
+    """Load a .json file into a dictionary."""
+    f = _make_r_io_base(f, mode)
+    jdict = json.load(f)
+    f.close()
+    return jdict
+
+
+def get_json_list(file_path):
+    with open(file_path, "r") as f:
+        json_list = []
+        for line in f:
+            json_list.append(json.loads(line if line != "null" else line))
+        return json_list
--- a/applications/ColossalEval/configs/gpt_evaluation/config/config_cn.json
+++ b/applications/ColossalEval/configs/gpt_evaluation/config/config_cn.json
@ -0,0 +1,44 @@
+{
+  "language": "cn",
+  "category": {
+    "brainstorming": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "creativity",
+        "practicality",
+        "reasonableness"
+      ]
+    },
+    "chat": {
+      "GPT": [
+        "language organization",
+        "naturalness",
+        "engagingness",
+        "fidelity"
+      ]
+    },
+    "generation": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "diversity"
+      ]
+    },
+    "open_qa": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "correctness"
+      ]
+    },
+    "roleplay": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "fidelity",
+        "creativity"
+      ]
+    }
+  }
+}
--- a/applications/ColossalEval/configs/gpt_evaluation/config/config_en.json
+++ b/applications/ColossalEval/configs/gpt_evaluation/config/config_en.json
@ -0,0 +1,44 @@
+{
+  "language": "en",
+  "category": {
+    "brainstorming": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "creativity",
+        "practicality",
+        "reasonableness"
+      ]
+    },
+    "chat": {
+      "GPT": [
+        "language organization",
+        "naturalness",
+        "engagingness",
+        "fidelity"
+      ]
+    },
+    "generation": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "diversity"
+      ]
+    },
+    "open_qa": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "correctness"
+      ]
+    },
+    "roleplay": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "fidelity",
+        "creativity"
+      ]
+    }
+  }
+}
--- a/applications/ColossalEval/configs/gpt_evaluation/data/eval_cn_examples.json
+++ b/applications/ColossalEval/configs/gpt_evaluation/data/eval_cn_examples.json
@ -0,0 +1,202 @@
+[
+  {
+    "category": "brainstorming",
+    "instruction": "列举一些可以促进头发生长的食物。",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 1
+  },
+  {
+    "category": "brainstorming",
+    "instruction": "中年夫妻如何提升夫妻感情，请给出三个实用的的方法，并举例说明。",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 2
+  },
+  {
+    "category": "brainstorming",
+    "instruction": "请列举4种日常的环保行为。",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 3
+  },
+  {
+    "category": "brainstorming",
+    "instruction": "请给出5个可以随时随地锻炼身体的小动作。",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 4
+  },
+  {
+    "category": "brainstorming",
+    "instruction": "请问如何制作一份美味的西红柿炒鸡蛋？",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 5
+  },
+  {
+    "category": "chat",
+    "instruction": "基于以下角色信息完成一段对话。小张是一名新手爱好者，对养鸡有浓厚的兴趣。老李是一名有丰富经验的养鸡大师。",
+    "input": "小张：您好，老李，我最近开始对养鸡感兴趣了，想请教您一些问题。 老李：你好，小张，我很乐意帮助你。你想问些什么？ 小张：我想知道如何确定鸡的品种和性别？ 老李：确切的品种可以通过鸡的外貌特征来确定，而性别一般是通过鸡卵的大小和形状来判断。还有什么问题吗？ 小张：",
+    "output": "",
+    "target": "",
+    "id": 6
+  },
+  {
+    "category": "chat",
+    "instruction": "基于以下角色信息完成一段对话。李华是一名参加了期末考试的学生，他已经很担心自己的考试成绩。老师Lucy正在帮助他度过这个紧张的时刻。",
+    "input": "李华：Lucy老师，我很担心自己的考试成绩，我不知道我是否能够通过这次考试。 Lucy：放松，李华，你已经做好了充分的准备。相信你自己，你会做得很好的。 李华：我很怕考试时会忘记自己所学的知识。 Lucy：你可以预留一些时间，过一遍自己所学的知识点或笔记，这样你会更有信心和准确地回答考题。 李华：如果我还是失败了，该怎么办？ Lucy：",
+    "output": "",
+    "target": "",
+    "id": 7
+  },
+  {
+    "category": "chat",
+    "instruction": "基于以下角色信息完成一段对话。张先生是一名企业家，正在考虑是否开拓海外市场；李女士是一名跨境电商专家，擅长国际商务和电子商务。",
+    "input": "张先生：你好，李女士，我正在考虑将我们的产品销售扩大至海外市场，您有什么建议吗？ 李女士：您好，张先生，我们需要考虑到海外市场对于产品的需求是否与国内市场一致，需要进行市场调研和定位。然后再进行各种软性、硬性的创新。 张先生：听起来很专业，您能具体解释一下吗？ 李女士：",
+    "output": "",
+    "target": "",
+    "id": 8
+  },
+  {
+    "category": "chat",
+    "instruction": "基于以下角色信息完成一段对话。小明是一名医生。一名病患想要提前停药。小王是病患的儿子，希望父亲能够听取医生的建议。",
+    "input": "小明：你好，小王，我了解你想要让你父亲停药。小王：是的，我父亲已经吃了那么久的药，我担心药物对他的身体会有副作用。小明：",
+    "output": "",
+    "target": "",
+    "id": 9
+  },
+  {
+    "category": "chat",
+    "instruction": "基于以下角色信息完成一段对话。张三是一位语文老师，对学生认真负责；李四是张三的学生，对语文兴趣不是很高。",
+    "input": "张三：同学们，今天要讲的是一篇古文《岳阳楼记》。这篇文章非常精彩，希望同学们能够认真听课，理解其中的含义。 李四：怎么又是古文？ 张三：",
+    "output": "",
+    "target": "",
+    "id": 10
+  },
+  {
+    "category": "generation",
+    "instruction": "根据主题写一封邮件。",
+    "input": "主题: \"加入我们，共创未来\"",
+    "output": "",
+    "target": "",
+    "id": 11
+  },
+  {
+    "category": "generation",
+    "instruction": "为公司编写一份职场行为准则，包括明确的行为规范和道德准则。",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 12
+  },
+  {
+    "category": "generation",
+    "instruction": "请撰写一篇文章，介绍如何通过改善生活习惯来预防疾病和延长寿命。",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 13
+  },
+  {
+    "category": "generation",
+    "instruction": "请为一家咖啡店编写一篇简短的广告语，吸引更多的顾客。",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 14
+  },
+  {
+    "category": "generation",
+    "instruction": "根据以下故事提示写一篇故事：",
+    "input": "故事提示：```在一个废弃的古堡中，一个小女孩遇到了一只会说话的黑猫，他们一起揭开了一个古老的谜题。```",
+    "output": "",
+    "target": "",
+    "id": 15
+  },
+  {
+    "category": "open_qa",
+    "instruction": "请介绍一下《红楼梦》这部经典小说的故事情节。",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 16
+  },
+  {
+    "category": "open_qa",
+    "instruction": "解释什么是RNA病毒和DNA病毒。",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 17
+  },
+  {
+    "category": "open_qa",
+    "instruction": "什么是比特币？",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 18
+  },
+  {
+    "category": "open_qa",
+    "instruction": "在计算机中，什么是RAM？与ROM有什么区别？",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 19
+  },
+  {
+    "category": "open_qa",
+    "instruction": "请简单介绍一下世界上最长的河流途经的国家。",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 20
+  },
+  {
+    "category": "roleplay",
+    "instruction": "我要你把我写的句子翻译成表情符号。我会写句子，你会用表情符号表达它。我只是想让你用表情符号来表达它。除了表情符号，我不希望你回复任何内容。当我需要用中文告诉你一些事情时，我会用 {} 这样的大括号括起来。我的第一句话是“{我的职业是消防员。}”\n",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 21
+  },
+  {
+    "category": "roleplay",
+    "instruction": "我希望你假定自己是雅思写作考官，根据雅思评判标准，按我给你的雅思考题和对应答案给我评分，并且按照雅思写作评分细则给出打分依据。此外，请给我详细的修改意见并写出满分范文。第一个问题是：It is sometimes argued that too many students go to university, while others claim that a university education should be a universal right. Discuss both sides of the argument and give your own opinion.对于这个问题，我的答案是：In some advanced countries, it is not unusual for more than 50% of young adults to attend college or university. Critics, however, claim that many university courses are worthless and young people would be better off gaining skills in the workplace. In this essay, I will examine both sides of this argument and try to reach a conclusion.There are several reasons why young people today believe they have the right to a university education. First, growing prosperity in many parts of the world has increased the number of families with money to invest in their children’s future. At the same time, falling birthrates mean that one- or two-child families have become common, increasing the level of investment in each child. It is hardly surprising, therefore, that young people are willing to let their families support them until the age of 21 or 22. Furthermore, millions of new jobs have been created in knowledge industries, and these jobs are typically open only to university graduates.However, it often appears that graduates end up in occupations unrelated to their university studies. It is not uncommon for an English literature major to end up working in sales, or an engineering graduate to retrain as a teacher, for example. Some critics have suggested that young people are just delaying their entry into the workplace, rather than developing professional skills.请依次给到我以下内容：具体分数及其评分依据、文章修改意见、满分范文。\n",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 22
+  },
+  {
+    "category": "roleplay",
+    "instruction": "我想让你充当 Linux 终端。我将输入命令，您将回复终端应显示的内容。我希望您只在一个唯一的代码块内回复终端输出，而不是其他任何内容。不要写解释。除非我指示您这样做，否则不要键入命令。当我需要用英语告诉你一些事情时，我会把文字放在中括号内[就像这样]。我的第一个命令是 pwd\n",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 23
+  },
+  {
+    "category": "roleplay",
+    "instruction": "我希望你充当宠物行为主义者。我将为您提供一只宠物和它们的主人，您的目标是帮助主人了解为什么他们的宠物表现出某些行为，并提出帮助宠物做出相应调整的策略。您应该利用您的动物心理学知识和行为矫正技术来制定一个有效的计划，双方的主人都可以遵循，以取得积极的成果。我的第一个请求是“我有一只好斗的德国牧羊犬，它需要帮助来控制它的攻击性。”\n",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 24
+  },
+  {
+    "category": "roleplay",
+    "instruction": "我希望你充当正则表达式生成器。您的角色是生成匹配文本中特定模式的正则表达式。您应该以一种可以轻松复制并粘贴到支持正则表达式的文本编辑器或编程语言中的格式提供正则表达式。不要写正则表达式如何工作的解释或例子；只需提供正则表达式本身。我的第一个提示是生成一个匹配电子邮件地址的正则表达式。\n",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 25
+  }
+]
--- a/applications/ColossalEval/configs/gpt_evaluation/data/eval_en_examples.json
+++ b/applications/ColossalEval/configs/gpt_evaluation/data/eval_en_examples.json
@ -0,0 +1,202 @@
+[
+  {
+    "category": "brainstorming",
+    "instruction": "Which are some popular fiction books that I should read?",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 1
+  },
+  {
+    "category": "brainstorming",
+    "instruction": "How do I properly store fruits and vegetables to keep them fresh for longer?",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 2
+  },
+  {
+    "category": "brainstorming",
+    "instruction": "How do you properly chop an onion without crying?",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 3
+  },
+  {
+    "category": "brainstorming",
+    "instruction": "How to make an international transfer? Please provide 3 techniques.",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 4
+  },
+  {
+    "category": "brainstorming",
+    "instruction": "Name five leadership qualities that you consider most important.",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 5
+  },
+  {
+    "category": "chat",
+    "instruction": "Complete a dialogue based on the following character information. Alex: A novice writer who is struggling to find inspiration and develop his writing skills. Emma: A successful author with many published works, providing guidance and advice to Alex.",
+    "input": "Alex: Hi Emma, I have been writing for a while now but can't seem to make any progress. Can you give me any advice? Emma: Hi Alex, sure. What kind of writing are you doing? Alex: I'm trying to write a novel, but I just can't seem to find any inspiration. Emma: ",
+    "output": "",
+    "target": "",
+    "id": 6
+  },
+  {
+    "category": "chat",
+    "instruction": "Complete a dialogue based on the following character information. John: An experienced software engineer with a passion for coding. Karen: A recent college graduate who is interested in learning more about software development.",
+    "input": "Karen: Hi John, I noticed that you have a lot of experience in the software industry. Can you tell me what you think is the most important skill for a software engineer? John: ",
+    "output": "",
+    "target": "",
+    "id": 7
+  },
+  {
+    "category": "chat",
+    "instruction": "Complete a dialogue based on the following character information. Sarah is a new employee who is nervous about her first presentation; Tom is her boss who has given her coaching and preparation materials.",
+    "input": "Sarah: Tom, I'm feeling really nervous about my presentation tomorrow. Tom: I know how you feel, Sarah. However, I believe in you and your abilities. Just stick to the preparation materials that I have given you, and you'll do great. Sarah: Thank you, Tom. What if I forget something important during the presentation? Tom: ",
+    "output": "",
+    "target": "",
+    "id": 8
+  },
+  {
+    "category": "chat",
+    "instruction": "Complete a dialogue based on the following character information. Sarah: a young artist who is full of creative ideas and always eager to try new things. Jack: a seasoned artist who has achieved great success in the art world and is more traditional in his approach to art.",
+    "input": "Sarah: Hi Jack, I'm really excited to meet you. I'm a big fan of your work. Jack: Hi Sarah, nice to meet you too. So, what kind of art do you do? Sarah: I am passionate about abstract art, especially combining different materials and colors. I think it can really give people a new perspective on things. Jack: That's interesting, but I am more focused on realistic paintings. I believe the most important thing is to master the basic skills first. Sarah: ",
+    "output": "",
+    "target": "",
+    "id": 9
+  },
+  {
+    "category": "chat",
+    "instruction": "Complete a conversation based on the following persona information. Sarah is a college student who is interested in joining a volunteer organization. John is the leader of the volunteer organization and is eager to welcome new members.",
+    "input": "Sarah: Hi, I'm Sarah, and I'm interested in joining your volunteer organization. John: Hi Sarah, welcome! We're always looking for new members who are passionate about volunteering. What areas would you like to focus on? Sarah: I'm interested in community outreach and working with children. John: ",
+    "output": "",
+    "target": "",
+    "id": 10
+  },
+  {
+    "category": "generation",
+    "instruction": "Write an email based on the subject:",
+    "input": "Subject: \"Invitation to an Exclusive Webinar\"",
+    "output": "",
+    "target": "",
+    "id": 11
+  },
+  {
+    "category": "generation",
+    "instruction": "Write a set of guidelines for first-time pet owners on how to properly care for a new puppy.",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 12
+  },
+  {
+    "category": "generation",
+    "instruction": "Can you help me write a persuasive speech on why we should recycle more and take better care of the environment?",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 13
+  },
+  {
+    "category": "generation",
+    "instruction": "Write a pitch for a brand-new mobile app that helps people organize their daily tasks more efficiently.",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 14
+  },
+  {
+    "category": "generation",
+    "instruction": "Write a social media post promoting an upcoming concert featuring a local band.",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 15
+  },
+  {
+    "category": "open_qa",
+    "instruction": "Describe the significance of the Renaissance period in European history.",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 16
+  },
+  {
+    "category": "open_qa",
+    "instruction": "What is the term for the surgical removal of the appendix?",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 17
+  },
+  {
+    "category": "open_qa",
+    "instruction": "Explain the process of osmosis in biological systems.",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 18
+  },
+  {
+    "category": "open_qa",
+    "instruction": "Who were the members of the Beatles band?",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 19
+  },
+  {
+    "category": "open_qa",
+    "instruction": "Who painted the The Scream?",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 20
+  },
+  {
+    "category": "roleplay",
+    "instruction": "I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 21
+  },
+  {
+    "category": "roleplay",
+    "instruction": "I want you to act as a travel guide. I will write you my location and you will suggest a place to visit near my location. In some cases, I will also give you the type of places I will visit. You will also suggest me places of similar type that are close to my first location. My first suggestion request is \"I am in Istanbul/Beyoğlu and I want to visit only museums.\"",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 22
+  },
+  {
+    "category": "roleplay",
+    "instruction": "I want you to act as an advertiser. You will create a campaign to promote a product or service of your choice. You will choose a target audience, develop key messages and slogans, select the media channels for promotion, and decide on any additional activities needed to reach your goals. My first suggestion request is \"I need help creating an advertising campaign for a new type of energy drink targeting young adults aged 18-30.\"",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 23
+  },
+  {
+    "category": "roleplay",
+    "instruction": "I want you to act as a storyteller. You will come up with entertaining stories that are engaging, imaginative and captivating for the audience. It can be fairy tales, educational stories or any other type of stories which has the potential to capture people's attention and imagination. Depending on the target audience, you may choose specific themes or topics for your storytelling session e.g., if it’s children then you can talk about animals; If it’s adults then history-based tales might engage them better etc. My first request is \"I need an interesting story on perseverance.\"",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 24
+  },
+  {
+    "category": "roleplay",
+    "instruction": "I want you to act as a rapper. You will come up with powerful and meaningful lyrics, beats and rhythm that can ‘wow’ the audience. Your lyrics should have an intriguing meaning and message which people can relate too. When it comes to choosing your beat, make sure it is catchy yet relevant to your words, so that when combined they make an explosion of sound everytime! My first request is \"I need a rap song about finding strength within yourself.\"",
+    "input": "",
+    "output": "",
+    "target": "",
+    "id": 25
+  }
+]
--- a/applications/ColossalEval/configs/gpt_evaluation/prompt/battle_prompt/battle_prompt_cn.json
+++ b/applications/ColossalEval/configs/gpt_evaluation/prompt/battle_prompt/battle_prompt_cn.json
--- a/applications/ColossalEval/configs/gpt_evaluation/prompt/battle_prompt/battle_prompt_en.json
+++ b/applications/ColossalEval/configs/gpt_evaluation/prompt/battle_prompt/battle_prompt_en.json
--- a/applications/ColossalEval/configs/gpt_evaluation/prompt/evaluation_prompt/evaluation_prompt_cn.json
+++ b/applications/ColossalEval/configs/gpt_evaluation/prompt/evaluation_prompt/evaluation_prompt_cn.json
@ -39,53 +39,8 @@
    },
    "prompt": "你是一个好助手。请你为下面的“补全对话”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
  },
-  "classification": {
-    "id": 3,
-    "category": "classification",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
-      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
-      "correctness": "正确性(1-5)：答案是否正确。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
-      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
-      "correctness": "1. 仔细阅读题目，尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为5分。如果答案是部分正确的，则可以给予适当的得分，例如2分、3分或4分。如果答案完全不正确，则只得1分。\n\n正确性："
-    },
-    "prompt": "你是一个好助手。请你为下面的“分类“问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  },
-  "closed_qa": {
-    "id": 4,
-    "category": "closed_qa",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
-      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
-      "correctness": "正确性(1-5)：答案是否正确。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
-      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
-      "correctness": "1. 仔细阅读题目，尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为5分。如果答案是部分正确的，则可以给予适当的得分，例如2分、3分或4分。如果答案完全不正确，则只得1分。\n\n正确性："
-    },
-    "prompt": "你是一个好助手。请你为下面问题的答案打分。\n\n问题如下：\n\n{question}\n\n需要你评分的答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  },
-  "extraction": {
-    "id": 5,
-    "category": "extraction",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
-      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
-      "correctness": "准确性(1-5)：回答应该准确无误地提取出所需信息，不应该包含任何错误或误导性信息。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
-      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
-      "correctness": "1. 仔细阅读问题并确定需要从材料中提取的信息。\n2. 仔细阅读回答并确保它涵盖了所有需要提取的信息。\n3. 使用所提供的材料来验证回答的准确性。如果回答不准确或包含错误或误导性信息，则无法给出高分。\n4. 检查回答是否包含所有要求提取的信息，不要漏掉任何重要细节。\n5. 根据回答的准确性和完整性，给出一个介于1和5之间的分数，5分表示回答非常准确且完整，1分表示回答几乎没有提取出所需信息。\n\n准确性："
-    },
-    "prompt": "你是一个好助手。请你为下面的“提取”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  },
  "generation": {
-    "id": 6,
+    "id": 3,
    "category": "generation",
    "metrics": {
      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
@ -100,7 +55,7 @@
    "prompt": "你是一个好助手。请你为下面的“生成”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
  },
  "open_qa": {
-    "id": 7,
+    "id": 4,
    "category": "open_qa",
    "metrics": {
      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
@ -114,23 +69,8 @@
    },
    "prompt": "你是一个好助手。请你为下面的问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
  },
-  "rewriting": {
-    "id": 8,
-    "category": "rewriting",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
-      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
-      "correctness": "正确性(1-5)：答案是否正确。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
-      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
-      "correctness": "1. 仔细阅读题目，尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为5分。如果答案是部分正确的，则可以给予适当的得分，例如2分、3分或4分。如果答案完全不正确，则只得1分。\n\n正确性："
-    },
-    "prompt": "你是一个好助手。请你为下面的问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  },
  "roleplay": {
-    "id": 9,
+    "id": 5,
    "category": "roleplay",
    "metrics": {
      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
@ -146,33 +86,14 @@
    },
    "prompt": "你是一个好助手。请你为下面的“角色扮演”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
  },
-  "summarization": {
-    "id": 10,
-    "category": "summarization",
+  "Other": {
+    "id": 6,
+    "category": "Other",
    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
-      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
-      "correctness": "准确性(1-5)：回答应该准确无误地总结出材料的重点。",
-      "conciseness": "简明扼要(1-5)：答案是否简明扼要，没有冗余内容。"
-    },
-    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
-      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
-      "correctness": "1. 仔细阅读问题给的材料，理解其内容和要点。\n2. 评估回答是否准确地总结出原始材料的重点。\n3. 评估回答是否包含原始材料中的所有关键信息。\n4. 根据以上步骤，给出一个1-5的分数，其中1表示回答不能准确地总结出材料的重点，5表示回答完全准确地总结出材料的重点。\n\n准确性：",
-      "conciseness": "1. 阅读题目，提取出材料的重点。\n2. 阅读该总结，并注意其中的主要观点和信息。\n3. 评估总结的长度。一个简明扼要的总结通常应该在几句话或几段文字内传达关键信息，而不是冗长的段落或文章。\n4. 检查总结是否包含与主要观点无关的信息或冗余信息。\n5.确定总结涵盖了材料中的关键信息，并且没有忽略任何重要细节。\n6.给总结打出1-5的分数，其中5表示总结简明扼要，没有冗余内容，而1表示总结冗长或包含不必要的信息，难以理解或记忆。根据您的判断，打出适当的得分。\n\n简明扼要："
-    },
-    "prompt": "你是一个好助手。请你为下面的“总结”问题的答案打分。\n\n问题如下：\n\n{question}\n\n答案如下：\n\n{answer}\n\n评分的指标如下：\n\n{metric}\n\n请你遵照以下的评分步骤：\n\n{steps}"
-  },
-  "general": {
-    "id": 11,
-    "category": "general",
-    "metrics": {
-      "language organization": "语言组织(1-5)：答案语言是否流畅、连贯，使用正确的语法，具有一定逻辑性，使用恰当的连接词、过渡词等等。",
      "relevance": "切题(1-5)：答案内容是否切题，不答非所问，并且严格遵照题目要求。",
      "correctness": "正确性(1-5)：答案是否正确。"
    },
    "CoT": {
-      "language organization": "1. 阅读答案，并检查是否有语法错误、用词不当或其他显著的错误。\n2. 检查答案是否具有逻辑性，能够按照合理的顺序传达信息并且能够自圆其说。\n3. 确定答案是否与问题或主题相关，并且能够传达清晰的信息。\n4. 检查答案是否连贯，是否使用适当的转换和过渡来保持句子和段落之间的连贯性。\n5. 检查答案是否具有明确的结构和组织方式，使得读者可以轻松理解信息的层次和结构。\n6. 根据以上因素综合评估答案的语言组织，并给出一个1到5的分数，其中5表示语言组织非常好，而1表示语言组织非常差。\n\n语言组织：",
      "relevance": "1. 阅读题目，确定题目所问的问题是什么，以及需要回答哪些方面的问题。\n2. 阅读答案，确认答案是否直接回答了题目所问的问题。\n3. 检查答案是否严格遵照了题目的要求，包括答题方式、答题长度、答题格式等等。\n4. 根据以上因素综合评估答案的切题程度，并给出一个1到5的分数，其中5表示答案非常切题，而1表示答案完全没有切题。\n\n切题：",
      "correctness": "1. 仔细阅读题目，尝试自己回答该问题。\n2. 检查答案的准确性。您可以使用已知的事实或研究来验证答案是否正确。如果答案是正确的，则可以将正确性得分为5分。如果答案是部分正确的，则可以给予适当的得分，例如2分、3分或4分。如果答案完全不正确，则只得1分。\n\n正确性："
    },
--- a/applications/ColossalEval/configs/gpt_evaluation/prompt/evaluation_prompt/evaluation_prompt_en.json
+++ b/applications/ColossalEval/configs/gpt_evaluation/prompt/evaluation_prompt/evaluation_prompt_en.json
@ -39,53 +39,8 @@
    },
    "prompt": "You are a good assistant. Please rate the given answer to the \"chat\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
  },
-  "classification": {
-    "id": 3,
-    "category": "classification",
-    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
-      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
-      "correctness": "Correctness (1-5): whether the answer is correct or not."
-    },
-    "CoT": {
-      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
-      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the question carefully and try to answer the question yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be given. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
-    },
-    "prompt": "You are a good assistant. Please rate the given answer to the \"classification\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  },
-  "closed_qa": {
-    "id": 4,
-    "category": "closed_qa",
-    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
-      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
-      "correctness": "Correctness (1-5): whether the answer is correct or not."
-    },
-    "CoT": {
-      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
-      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the question carefully and try to answer the question by yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
-    },
-    "prompt": "You are a good assistant. Please rate the given answer to the \"closed qa\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  },
-  "extraction": {
-    "id": 5,
-    "category": "extraction",
-    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
-      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
-      "correctness": "correctness (1-5): Answers should extract the required information accurately and should not contain any incorrect or misleading information."
-    },
-    "CoT": {
-      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
-      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the questions carefully and identify the information that needs to be extracted from the material.\n2. Read the answer carefully and make sure it covers all the information that needs to be extracted.\n3. Use the material provided to verify the correctness of the response. If the response is inaccurate or contains incorrect or misleading information, a high score cannot be given.\n4. Check that the answer contains all the information required to be extracted and do not leave out any important details.\n5. Give a score between 1 and 5 based on the correctness and completeness of the response, with a score of 5 indicating a very accurate and complete response and a score of 1 indicating that the response barely extracts the required information.\n\nCorrectness:"
-    },
-    "prompt": "You are a good assistant. Please rate the given answer to the \"extraction\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  },
  "generation": {
-    "id": 6,
+    "id": 3,
    "category": "generation",
    "metrics": {
      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
@ -100,7 +55,7 @@
    "prompt": "You are a good assistant. Please rate the given answer to the \"generation\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
  },
  "open_qa": {
-    "id": 7,
+    "id": 4,
    "category": "open_qa",
    "metrics": {
      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
@ -114,23 +69,8 @@
    },
    "prompt": "You are a good assistant. Please rate the answers to the \"open qa\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
  },
-  "rewriting": {
-    "id": 8,
-    "category": "rewriting",
-    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
-      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
-      "correctness": "Correctness (1-5): whether the answer is correct or not."
-    },
-    "CoT": {
-      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
-      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the question carefully and try to answer the question yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
-    },
-    "prompt": "You are a good assistant. Please rate the answers to the \"rewriting\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  },
  "roleplay": {
-    "id": 9,
+    "id": 5,
    "category": "roleplay",
    "metrics": {
      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
@ -146,35 +86,17 @@
    },
    "prompt": "You are a good assistant. Please rate the given answer to the \"role-play\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
  },
-  "summarization": {
-    "id": 10,
-    "category": "summarization",
+  "Other": {
+    "id": 6,
+    "category": "Other",
    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
-      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
-      "correctness": "Correctness (1-5): answers should summarize the main points of the material accurately and unambiguously.",
-      "conciseness": "Conciseness (1-5): answers should be concise and without redundant content."
-    },
-    "CoT": {
-      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
-      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the material given in the question carefully to understand its content and main points.\n2. Assess whether the answer accurately summarizes the key points of the source material.\n3. assess whether the response contains all the key information in the source material.\n4. Based on the above steps, give a score of 1-5, where 1 means that the response does not accurately summarize the main points of the material and 5 means that the response completely accurately summarizes the main points of the material.\n\nCorrectness:",
-      "conciseness": "1. Read the title and extract the main points of the material.\n2. Read the summary and note the main ideas and messages in it.\n3. Assess the length of the summary. A concise summary should usually convey key information within a few sentences or paragraphs, rather than lengthy paragraphs or essays.\n4. Check that the summary does not contain information that is not relevant to the main ideas or that is redundant.\n5. Make sure that the summary covers the key information in the material and that no important details have been omitted.\n6. Rate the summary on a scale of 1-5, where 5 means the summary is concise and free of redundancy, and 1 means the summary is lengthy or contains unnecessary information that is difficult to understand or remember. Based on your judgment, assign the appropriate score.\n\nConciseness:"
-    },
-    "prompt": "You are a good assistant. Please rate the given answer to the \"summarization\" question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
-  },
-  "general": {
-    "id": 11,
-    "category": "general",
-    "metrics": {
-      "language organization": "Language organization (1-5): whether the answer language is fluent and coherent, uses correct grammar, has a certain logic, uses appropriate connecting words, transition words, etc.",
      "relevance": "Relevance (1-5): whether the content of the answer is relevant to the topic, does not answer the wrong question, and strictly follows the requirements of the topic.",
      "correctness": "Correctness (1-5): whether the answer is correct or not."
    },
    "CoT": {
      "language organization": "1. Read the answers and check for grammatical errors, poor word choice, or other significant mistakes.\n2. Check that the answer is logical, conveys the information in a logical order, and is self-explanatory.\n3. Determine if the answer is relevant to the question or topic and conveys a clear message.\n4. Check that the answer is coherent and that appropriate transitions and switches are used to maintain coherence between sentences and paragraphs.\n5. Check that the answer is clearly structured and organized in such a way that the reader can easily understand the hierarchy and structure of the information.\n6. Evaluate the language organization of the answer based on a combination of the above factors and give a score of 1 to 5, where 5 indicates very good language organization and 1 indicates very poor language organization.\n\nLanguage organization:",
      "relevance": "1. Read the question to determine what the question asks and what aspects of the question need to be answered.\n2. Read the answers to make sure that they directly answer the question asked.\n3. Check that the answer follows the requirements of the question, including the way it is answered, the length of the answer, the format of the answer, etc.\n4. Evaluate how relevant the answer is based on the above factors and give a score of 1 to 5, where 5 means the answer is very relevant and 1 means the answer is not relevant at all.\n\nRelevance:",
-      "correctness": "1. Read the question carefully and try to answer the question yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
+      "correctness": "1. Read the question carefully and try to answer the question by yourself.\n2. Check the correctness of the answer. You can use known facts or research to verify that the answer is correct. If the answer is correct, you can give a score of 5 for correctness. If the answer is partially correct, an appropriate score, such as 2, 3, or 4, may be assigned. If the answer is completely incorrect, only 1 point is awarded.\n\nCorrectness:"
    },
    "prompt": "You are a good assistant. Please rate the given answer to the question below.\n\nThe question is as follows:\n\n{question}\n\nThe answer is as follows:\n\n{answer}\n\nThe metric for evaluation is as follows:\n\n{metric}\n\nYou should follow the following evaluation steps:\n\n{steps}"
  }
--- a/applications/ColossalEval/examples/dataset_evaluation/config/evaluation/config.json
+++ b/applications/ColossalEval/examples/dataset_evaluation/config/evaluation/config.json
@ -0,0 +1,58 @@
+{
+  "model": [
+    {
+      "name": "model1"
+    },
+    {
+      "name": "model2"
+    }
+  ],
+  "dataset": [
+    {
+      "name": "mmlu",
+      "metrics": [
+        "first_token_accuracy",
+        "single_choice_accuracy",
+        "perplexity",
+        "ppl_score",
+        "ppl_score_over_choices"
+      ]
+    },
+    {
+      "name": "cmmlu",
+      "metrics": [
+        "first_token_accuracy",
+        "single_choice_accuracy",
+        "perplexity",
+        "ppl_score",
+        "ppl_score_over_choices"
+      ]
+    },
+    {
+      "name": "agieval",
+      "metrics": [
+        "first_token_accuracy",
+        "single_choice_accuracy",
+        "multi_choice_accuracy",
+        "math_equivalence",
+        "perplexity",
+        "ppl_score_over_choices",
+        "ppl_score"
+      ]
+    },
+    {
+      "name": "gaokaobench",
+      "metrics": [
+        "first_token_accuracy",
+        "single_choice_accuracy",
+        "multi_choice_accuracy",
+        "math_equivalence",
+        "rouge_score",
+        "rouge_zh_score",
+        "perplexity",
+        "ppl_score_over_choices",
+        "ppl_score"
+      ]
+    }
+  ]
+}
--- a/applications/ColossalEval/examples/dataset_evaluation/config/inference/config.json
+++ b/applications/ColossalEval/examples/dataset_evaluation/config/inference/config.json
@ -0,0 +1,84 @@
+{
+  "model": [
+    {
+      "name": "model name",
+      "model_class": "HuggingFaceCausalLM",
+      "parameters": {
+        "path": "path to model",
+        "model_max_length": 4096,
+        "tokenizer_path": "",
+        "tokenizer_kwargs": {
+          "trust_remote_code": true
+        },
+        "peft_path": null,
+        "model_kwargs": {
+          "torch_dtype": "torch.float32",
+          "trust_remote_code": true
+        },
+        "prompt_template": "plain",
+        "batch_size": 4
+      }
+    },
+    {
+      "name": "model2 name",
+      "model_class": "HuggingFaceCausalLM",
+      "parameters": {
+        "path": "path to model2",
+        "model_max_length": 4096,
+        "tokenizer_path": "",
+        "tokenizer_kwargs": {
+          "trust_remote_code": true
+        },
+        "peft_path": null,
+        "model_kwargs": {
+          "torch_dtype": "torch.float32",
+          "trust_remote_code": true
+        },
+        "prompt_template": "plain",
+        "batch_size": 4
+      }
+    }
+  ],
+  "dataset": [
+    {
+      "name": "agieval",
+      "dataset_class": "AGIEvalDataset",
+      "debug": false,
+      "few_shot": false,
+      "path": "path to original dataset (folder)",
+      "save_path": "path to save converted dataset (e.g. inference_data/agieval.json)"
+    },
+    {
+      "name": "ceval",
+      "dataset_class": "CEvalDataset",
+      "debug": false,
+      "few_shot": true,
+      "path": "path to original dataset (folder)",
+      "save_path": "path to save converted dataset (e.g. inference_data/ceval.json)"
+    },
+    {
+      "name": "cmmlu",
+      "dataset_class": "CMMLUDataset",
+      "debug": false,
+      "few_shot": true,
+      "path": "path to original dataset (folder)",
+      "save_path": "path to save converted dataset (e.g. inference_data/cmmlu.json)"
+    },
+    {
+      "name": "gaokaobench",
+      "dataset_class": "GaoKaoBenchDataset",
+      "debug": false,
+      "few_shot": false,
+      "path": "path to original dataset (folder)",
+      "save_path": "path to save converted dataset (e.g. inference_data/gaokaobench.json)"
+    },
+    {
+      "name": "mmlu",
+      "dataset_class": "MMLUDataset",
+      "debug": false,
+      "few_shot": true,
+      "path": "path to original dataset (folder)",
+      "save_path": "path to save converted dataset (e.g. inference_data/mmlu.json)"
+    }
+  ]
+}
--- a/applications/ColossalEval/examples/dataset_evaluation/eval_dataset.py
+++ b/applications/ColossalEval/examples/dataset_evaluation/eval_dataset.py
@ -0,0 +1,73 @@
+import argparse
+import os
+
+import tabulate
+from colossal_eval.evaluate.dataset_evaluator import DatasetEvaluator
+from colossal_eval.utils import jdump, jload
+
+
+def main(args):
+    config = jload(args.config)
+
+    evaluation_results = {dataset["name"]: {} for dataset in config["dataset"]}
+    evaluation_results_table = {dataset["name"]: {} for dataset in config["dataset"]}
+    evaluator = DatasetEvaluator()
+
+    for dataset_parameter in config["dataset"]:
+        dataset_name = dataset_parameter["name"]
+        metrics = dataset_parameter["metrics"]
+        results_metric_model = {metric: {model["name"]: None for model in config["model"]} for metric in metrics}
+        for model in config["model"]:
+            model_name = model["name"]
+
+            data = jload(
+                os.path.join(args.inference_results_path, model_name, f"{dataset_name}_inference_results.json")
+            )
+            results = evaluator.get_evaluation_results(data, dataset_name, model_name, metrics)
+
+            for metric, score in results.items():
+                results_metric_model[metric][model_name] = score["ALL"]
+
+            evaluation_results[dataset_name][model_name] = results
+
+        evaluation_results_table[dataset_name] = results_metric_model
+
+    table = []
+    header = ["dataset", "metric"] + [model["name"] for model in config["model"]]
+    table.append(header)
+
+    for dataset_parameter in config["dataset"]:
+        dataset_name = dataset_parameter["name"]
+        metrics = dataset_parameter["metrics"]
+
+        for metric, model_results in evaluation_results_table[dataset_name].items():
+            row = [dataset_name]
+            for model, score in model_results.items():
+                if len(row) == 1:
+                    row.extend([metric, "{:.02f}".format(score)])
+                else:
+                    row.append("{:.02f}".format(score))
+
+            table.append(row)
+
+    table = tabulate.tabulate(table, headers="firstrow")
+    print(table)
+
+    os.makedirs(args.evaluation_results_save_path, exist_ok=True)
+
+    with open(os.path.join(args.evaluation_results_save_path, "evaluation_results_table.txt"), "w") as file:
+        file.write(table)
+
+    jdump(evaluation_results, os.path.join(args.evaluation_results_save_path, "evaluation_results.json"))
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="ColossalEval evaluation process.")
+    parser.add_argument("--config", type=str, default=None, required=True, help="path to config file")
+    parser.add_argument("--inference_results_path", type=str, default=None, help="path to inference results")
+    parser.add_argument(
+        "--evaluation_results_save_path", type=str, default=None, help="path to save evaluation results"
+    )
+    args = parser.parse_args()
+
+    main(args)
--- a/applications/ColossalEval/examples/dataset_evaluation/eval_dataset.sh
+++ b/applications/ColossalEval/examples/dataset_evaluation/eval_dataset.sh
@ -0,0 +1,4 @@
+python eval_dataset.py \
+    --config "path to config file" \
+    --inference_results_path "path to inference results" \
+    --evaluation_results_save_path "path to save evaluation results"
--- a/applications/ColossalEval/examples/dataset_evaluation/inference.py
+++ b/applications/ColossalEval/examples/dataset_evaluation/inference.py
@ -0,0 +1,171 @@
+import argparse
+import copy
+import os
+from typing import Dict, List
+
+import torch
+import torch.distributed as dist
+from colossal_eval import dataset, models, utils
+
+import colossalai
+from colossalai.logging import get_dist_logger
+
+logger = get_dist_logger()
+
+
+def rm_and_merge(world_size: int, save_path: str, model_names: List[str], dataset_names: Dict[str, List]) -> None:
+    """
+    Remove inference result per rank and merge them into one file.
+
+    Args:
+        world_size: Number of processes for inference.
+        save_path: The folder for storing inference results.
+        model_names: Names of models for inference.
+        dataset_names: Names of dataset for inference.
+
+    """
+
+    for model_name in model_names:
+        for dataset_name, categories in dataset_names.items():
+            all_answers = {}
+            for category in categories:
+                all_answers[category] = {"data": []}
+                answers = {"data": []}
+
+                for r in range(world_size):
+                    directory = os.path.join(
+                        save_path, model_name, f"{dataset_name}_{category}_inference_results_rank{r}.json"
+                    )
+                    if not os.path.exists(directory):
+                        raise Exception(
+                            f"Directory {directory} not found. There may be an error during inference time."
+                        )
+                    else:
+                        rank_answers = utils.jload(directory)
+                        answers["data"].extend(rank_answers["data"])
+                        answers["inference_kwargs"] = rank_answers["inference_kwargs"]
+
+                for r in range(world_size):
+                    try:
+                        directory = os.path.join(
+                            save_path, model_name, f"{dataset_name}_{category}_inference_results_rank{r}.json"
+                        )
+                        os.remove(directory)
+                    except Exception as e:
+                        print(e)
+
+                all_answers[category] = answers
+
+            logger.info(f"Save inference results of model {model_name} on dataset {dataset_name}.")
+            utils.jdump(all_answers, os.path.join(save_path, model_name, f"{dataset_name}_inference_results.json"))
+
+        logger.info(f"Save inference results of model {model_name} for all dataset.")
+    logger.info(f"Save inference results of all models for all dataset.")
+
+
+def main(args):
+    colossalai.launch_from_torch(config={}, seed=42)
+    world_size = dist.get_world_size()
+    rank = dist.get_rank()
+
+    inference_data = {}
+    debug_args = {}
+    few_shot_args = {}
+
+    config = utils.jload(args.config)
+
+    model_parameters = config["model"]
+    dataset_parameters = config["dataset"]
+
+    for dataset_parameter in dataset_parameters:
+        path = dataset_parameter["path"]
+        save_path = dataset_parameter["save_path"]
+        dataset_name = dataset_parameter["name"]
+        debug_args[dataset_name] = dataset_parameter["debug"]
+        few_shot_args[dataset_name] = dataset_parameter["few_shot"]
+
+        if not args.load_dataset:
+            if os.path.exists(save_path):
+                dataset_ = utils.jload(save_path)
+                inference_data[dataset_name] = dataset_["test"]
+            else:
+                raise Exception(
+                    "Can't find the converted dataset. You may set load_dataset True to store the dataset first."
+                )
+
+            continue
+
+        dataset_class = eval(f"dataset.{dataset_parameter['dataset_class']}")
+        if not issubclass(dataset_class, dataset.BaseDataset):
+            raise ValueError(f"Dataset class {dataset_parameter['dataset_class']} is not a subclass of BaseDataset.")
+
+        dataset_ = dataset_class(path, logger, dataset_parameter["few_shot"])
+
+        dataset_.save(save_path)
+        inference_data[dataset_name] = dataset_.dataset["test"]
+
+    for model_parameter in model_parameters:
+        model_name = model_parameter["name"]
+        model_class = eval(f"models.{model_parameter['model_class']}")
+        paramerters = model_parameter["parameters"]
+        paramerters.update({"logger": logger})
+        paramerters.update({"prompt_template": utils.prompt_templates[paramerters["prompt_template"]]})
+
+        model_ = model_class(**paramerters)
+        if not issubclass(model_class, models.BaseModel):
+            raise ValueError(f"Model class {model_parameter['model_class']} is not a subclass of BaseModel.")
+
+        for dataset_name, split_data in inference_data.items():
+            start = 0
+            for category, category_data in split_data.items():
+                if few_shot_args[dataset_name] and category_data["inference_kwargs"].get("few_shot_data", None) is None:
+                    raise Exception(f"Dataset {dataset_name} doesn't have few-shot data for category {category}!")
+
+                answers_to_dump = copy.deepcopy(category_data)
+                partition_size = len(category_data["data"]) // world_size
+                redundant = len(category_data["data"]) % world_size
+
+                # Ensure that the amount of data for inference is as consistent as possible across different processes.
+                lengths = [partition_size for _ in range(world_size)]
+                for j in range(redundant):
+                    lengths[(j + start) % world_size] += 1
+
+                start = (start + redundant) % world_size
+
+                questions = category_data["data"][sum(lengths[0:rank]) : sum(lengths[0:rank]) + lengths[rank]]
+
+                answers_per_rank = model_.inference(
+                    questions, inference_kwargs=category_data["inference_kwargs"], debug=debug_args[dataset_name]
+                )
+
+                answers_to_dump["data"] = answers_per_rank
+
+                utils.jdump(
+                    answers_to_dump,
+                    os.path.join(
+                        args.inference_save_path,
+                        model_name,
+                        f"{dataset_name}_{category}_inference_results_rank{rank}.json",
+                    ),
+                )
+
+        logger.info(f"Rank {rank} peak CUDA mem: {torch.cuda.max_memory_allocated()/1024**3:.3f} GB")
+
+        del model_
+        torch.cuda.empty_cache()
+
+    dist.barrier()
+    if rank == 0:
+        model_names = [model_parameter["name"] for model_parameter in model_parameters]
+        dataset_names = {key: list(inference_data[key].keys()) for key in inference_data}
+        rm_and_merge(world_size, args.inference_save_path, model_names, dataset_names)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="ColossalEval inference process.")
+    parser.add_argument("--config", type=str, default=None, required=True, help="path to config file")
+    parser.add_argument("--load_dataset", default=False, action="store_true")
+    parser.add_argument("--inference_save_path", type=str, default=None, help="path to save inference results")
+    args = parser.parse_args()
+
+    main(args)
--- a/applications/ColossalEval/examples/dataset_evaluation/inference.sh
+++ b/applications/ColossalEval/examples/dataset_evaluation/inference.sh
@ -0,0 +1,4 @@
+torchrun --nproc_per_node=1 inference.py \
+    --config "path to config file" \
+    --load_dataset \
+    --inference_save_path "path to save inference results"
--- a/applications/ColossalEval/examples/gpt_evaluation/config/evaluation/config.json
+++ b/applications/ColossalEval/examples/gpt_evaluation/config/evaluation/config.json
@ -0,0 +1,44 @@
+{
+  "language": "en",
+  "category": {
+    "brainstorming": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "creativity",
+        "practicality",
+        "reasonableness"
+      ]
+    },
+    "chat": {
+      "GPT": [
+        "language organization",
+        "naturalness",
+        "engagingness",
+        "fidelity"
+      ]
+    },
+    "generation": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "diversity"
+      ]
+    },
+    "open_qa": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "correctness"
+      ]
+    },
+    "roleplay": {
+      "GPT": [
+        "language organization",
+        "relevance",
+        "fidelity",
+        "creativity"
+      ]
+    }
+  }
+}
--- a/applications/ColossalEval/examples/gpt_evaluation/config/inference/config.json
+++ b/applications/ColossalEval/examples/gpt_evaluation/config/inference/config.json
@ -0,0 +1,33 @@
+{
+  "model": [
+    {
+      "name": "model name",
+      "model_class": "HuggingFaceCausalLM",
+      "parameters": {
+        "path": "path to model",
+        "model_max_length": 4096,
+        "tokenizer_path": "",
+        "tokenizer_kwargs": {
+          "trust_remote_code": true
+        },
+        "peft_path": null,
+        "model_kwargs": {
+          "torch_dtype": "torch.float32",
+          "trust_remote_code": true
+        },
+        "prompt_template": "plain",
+        "batch_size": 4
+      }
+    }
+  ],
+  "dataset": [
+    {
+      "name": "colossal",
+      "dataset_class": "ColossalDataset",
+      "debug": false,
+      "few_shot": false,
+      "path": "../../configs/gpt_evaluation/data/eval_en_examples.json",
+      "save_path": "path to save converted dataset (inference_data/colossal.json)"
+    }
+  ]
+}
--- a/applications/ColossalEval/examples/gpt_evaluation/eval.py
+++ b/applications/ColossalEval/examples/gpt_evaluation/eval.py
@ -2,8 +2,8 @@ import argparse
 import os

 import openai
-from evaluator import Evaluator
-from utils import jload
+from colossal_eval.evaluate.evaluator import Evaluator
+from colossal_eval.utils import jload


 def main(args):
@ -51,12 +51,19 @@ def main(args):
            gpt_evaluation_prompt,
            args.gpt_model,
            config["language"],
-            config.get("path_for_UniEval", None),
            args.gpt_with_reference,
        )
        if len(args.model_name_list) == 2:
-            answers1 = jload(args.answer_file_list[0])
-            answers2 = jload(args.answer_file_list[1])
+            answers_1 = jload(args.answer_file_list[0])
+            answers_2 = jload(args.answer_file_list[1])
+
+            answers1 = []
+            for category, value in answers_1.items():
+                answers1.extend(value["data"])
+
+            answers2 = []
+            for category, value in answers_2.items():
+                answers2.extend(value["data"])

            assert len(answers1) == len(answers2), "The number of answers for two models should be equal!"

@ -66,9 +73,21 @@ def main(args):
            targets = jload(args.target_file)
            answers = jload(args.answer_file_list[0])

-            assert len(targets) == len(answers), "The number of target answers and model answers should be equal!"
+            references = []
+            for category, value in targets["test"].items():
+                references.extend(value["data"])

-            evaluator.evaluate(answers=answers, targets=targets)
+            predictions = []
+            for category, value in answers.items():
+                predictions.extend(value["data"])
+
+            assert len(references) == len(
+                predictions
+            ), "The number of target answers and model answers should be equal!"
+
+            evaluator.evaluate(
+                answers=predictions, targets=references, save_path=args.save_path, model_name=args.model_name_list[0]
+            )
            evaluator.save(args.save_path, args.model_name_list)
        else:
            raise ValueError("Unsupported number of answer files and model names!")
@ -99,8 +118,8 @@ if __name__ == "__main__":
    )
    parser.add_argument(
        "--gpt_model",
-        default="gpt-3.5-turbo",
-        choices=["text-davinci-003", "gpt-3.5-turbo", "gpt-4"],
+        default="gpt-3.5-turbo-16k",
+        choices=["text-davinci-003", "gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-4"],
        help="which GPT model to use for evaluation",
    )
    parser.add_argument(
--- a/applications/ColossalEval/examples/gpt_evaluation/eval.sh
+++ b/applications/ColossalEval/examples/gpt_evaluation/eval.sh
--- a/applications/ColossalEval/examples/gpt_evaluation/inference.py
+++ b/applications/ColossalEval/examples/gpt_evaluation/inference.py
@ -0,0 +1,171 @@
+import argparse
+import copy
+import os
+from typing import Dict, List
+
+import torch
+import torch.distributed as dist
+from colossal_eval import dataset, models, utils
+
+import colossalai
+from colossalai.logging import get_dist_logger
+
+logger = get_dist_logger()
+
+
+def rm_and_merge(world_size: int, save_path: str, model_names: List[str], dataset_names: Dict[str, List]) -> None:
+    """
+    Remove inference result per rank and merge them into one file.
+
+    Args:
+        world_size: Number of processes for inference.
+        save_path: The folder for storing inference results.
+        model_names: Names of models for inference.
+        dataset_names: Names of dataset for inference.
+
+    """
+
+    for model_name in model_names:
+        for dataset_name, categories in dataset_names.items():
+            all_answers = {}
+            for category in categories:
+                all_answers[category] = {"data": []}
+                answers = {"data": []}
+
+                for r in range(world_size):
+                    directory = os.path.join(
+                        save_path, model_name, f"{dataset_name}_{category}_inference_results_rank{r}.json"
+                    )
+                    if not os.path.exists(directory):
+                        raise Exception(
+                            f"Directory {directory} not found. There may be an error during inference time."
+                        )
+                    else:
+                        rank_answers = utils.jload(directory)
+                        answers["data"].extend(rank_answers["data"])
+                        answers["inference_kwargs"] = rank_answers["inference_kwargs"]
+
+                for r in range(world_size):
+                    try:
+                        directory = os.path.join(
+                            save_path, model_name, f"{dataset_name}_{category}_inference_results_rank{r}.json"
+                        )
+                        os.remove(directory)
+                    except Exception as e:
+                        print(e)
+
+                all_answers[category] = answers
+
+            logger.info(f"Save inference results of model {model_name} on dataset {dataset_name}.")
+            utils.jdump(all_answers, os.path.join(save_path, model_name, f"{dataset_name}_inference_results.json"))
+
+        logger.info(f"Save inference results of model {model_name} for all dataset.")
+    logger.info(f"Save inference results of all models for all dataset.")
+
+
+def main(args):
+    colossalai.launch_from_torch(config={}, seed=42)
+    world_size = dist.get_world_size()
+    rank = dist.get_rank()
+
+    inference_data = {}
+    debug_args = {}
+    few_shot_args = {}
+
+    config = utils.jload(args.config)
+
+    model_parameters = config["model"]
+    dataset_parameters = config["dataset"]
+
+    for dataset_parameter in dataset_parameters:
+        path = dataset_parameter["path"]
+        save_path = dataset_parameter["save_path"]
+        dataset_name = dataset_parameter["name"]
+        debug_args[dataset_name] = dataset_parameter["debug"]
+        few_shot_args[dataset_name] = dataset_parameter["few_shot"]
+
+        if not args.load_dataset:
+            if os.path.exists(save_path):
+                dataset_ = utils.jload(save_path)
+                inference_data[dataset_name] = dataset_["test"]
+            else:
+                raise Exception(
+                    "Can't find the converted dataset. You may set load_dataset True to store the dataset first."
+                )
+
+            continue
+
+        dataset_class = eval(f"dataset.{dataset_parameter['dataset_class']}")
+        if not issubclass(dataset_class, dataset.BaseDataset):
+            raise ValueError(f"Dataset class {dataset_parameter['dataset_class']} is not a subclass of BaseDataset.")
+
+        dataset_ = dataset_class(path, logger, dataset_parameter["few_shot"])
+
+        dataset_.save(save_path)
+        inference_data[dataset_name] = dataset_.dataset["test"]
+
+    for model_parameter in model_parameters:
+        model_name = model_parameter["name"]
+        model_class = eval(f"models.{model_parameter['model_class']}")
+        paramerters = model_parameter["parameters"]
+        paramerters.update({"logger": logger})
+        paramerters.update({"prompt_template": utils.prompt_templates[paramerters["prompt_template"]]})
+
+        model_ = model_class(**paramerters)
+        if not issubclass(model_class, models.BaseModel):
+            raise ValueError(f"Model class {model_parameter['model_class']} is not a subclass of BaseModel.")
+
+        for dataset_name, split_data in inference_data.items():
+            start = 0
+            for category, category_data in split_data.items():
+                if few_shot_args[dataset_name] and category_data["inference_kwargs"].get("few_shot_data", None) is None:
+                    raise Exception(f"Dataset {dataset_name} doesn't have few-shot data for category {category}!")
+
+                answers_to_dump = copy.deepcopy(category_data)
+                partition_size = len(category_data["data"]) // world_size
+                redundant = len(category_data["data"]) % world_size
+
+                # Ensure that the amount of data for inference is as consistent as possible across different processes.
+                lengths = [partition_size for _ in range(world_size)]
+                for j in range(redundant):
+                    lengths[(j + start) % world_size] += 1
+
+                start = (start + redundant) % world_size
+
+                questions = category_data["data"][sum(lengths[0:rank]) : sum(lengths[0:rank]) + lengths[rank]]
+
+                answers_per_rank = model_.inference(
+                    questions, inference_kwargs=category_data["inference_kwargs"], debug=debug_args[dataset_name]
+                )
+
+                answers_to_dump["data"] = answers_per_rank
+
+                utils.jdump(
+                    answers_to_dump,
+                    os.path.join(
+                        args.inference_save_path,
+                        model_name,
+                        f"{dataset_name}_{category}_inference_results_rank{rank}.json",
+                    ),
+                )
+
+        logger.info(f"Rank {rank} peak CUDA mem: {torch.cuda.max_memory_allocated()/1024**3:.3f} GB")
+
+        del model_
+        torch.cuda.empty_cache()
+
+    dist.barrier()
+    if rank == 0:
+        model_names = [model_parameter["name"] for model_parameter in model_parameters]
+        dataset_names = {key: list(inference_data[key].keys()) for key in inference_data}
+        rm_and_merge(world_size, args.inference_save_path, model_names, dataset_names)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="ColossalEval inference process.")
+    parser.add_argument("--config", type=str, default=None, required=True, help="path to config file")
+    parser.add_argument("--load_dataset", default=False, action="store_true")
+    parser.add_argument("--inference_save_path", type=str, default=None, help="path to save inference results")
+    args = parser.parse_args()
+
+    main(args)
--- a/applications/ColossalEval/examples/gpt_evaluation/inference.sh
+++ b/applications/ColossalEval/examples/gpt_evaluation/inference.sh
@ -0,0 +1,4 @@
+torchrun --nproc_per_node=1 inference.py \
+    --config "path to config file" \
+    --load_dataset \
+    --inference_save_path "path to save inference results"
--- a/applications/ColossalEval/requirements.txt
+++ b/applications/ColossalEval/requirements.txt
@ -0,0 +1,12 @@
+transformers>=4.32.0
+colossalai>=0.3.1
+peft
+tabulate
+jieba
+fuzzywuzzy
+rouge
+openai
+matplotlib
+pandas
+seaborn
+scikit-learn
--- a/applications/ColossalEval/setup.py
+++ b/applications/ColossalEval/setup.py
@ -0,0 +1,31 @@
+from setuptools import find_packages, setup
+
+
+def fetch_requirements(path):
+    with open(path, "r") as fd:
+        return [r.strip() for r in fd.readlines()]
+
+
+def fetch_readme():
+    with open("README.md", encoding="utf-8") as f:
+        return f.read()
+
+
+setup(
+    name="colossal_eval",
+    version="0.0.1",
+    packages=find_packages(exclude=["examples", "*.egg-info"]),
+    description="Colossal-AI LLM-Evaluation Framework",
+    long_description=fetch_readme(),
+    long_description_content_type="text/markdown",
+    license="Apache Software License 2.0",
+    url="https://github.com/hpcaitech/LLM-Evaluation",
+    install_requires=fetch_requirements("requirements.txt"),
+    python_requires=">=3.6",
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: Apache Software License",
+        "Environment :: GPU :: NVIDIA CUDA",
+        "Topic :: Scientific/Engineering :: Artificial Intelligence",
+    ],
+)
--- a/applications/README.md
+++ b/applications/README.md
@ -5,6 +5,7 @@ This directory contains the applications that are powered by Colossal-AI.
 The list of applications include:

 - [X] [Colossal-LLaMA-2](./Colossal-LLaMA-2/): Continual Pre-training of LLaMA-2.
+- [X] [ColossalEval](./ColossalEval): Evaluation Pipeline for LLMs.
 - [X] [Chatbot](./Chat/README.md): Replication of ChatGPT with RLHF.
 - [X] [FastFold](https://github.com/hpcaitech/FastFold): Optimizing AlphaFold (Biomedicine) Training and Inference on GPU Clusters.