update 7b evaluation results

2023-12-12 16:39:13 +08:00 · 2023-12-12 16:39:13 +08:00 · 430e559364
parent d4a81fad5d
commit 430e559364
3 changed files with 37 additions and 36 deletions
--- a/README-ja-JP.md
+++ b/README-ja-JP.md
@ -49,18 +49,19 @@ InternLM は、70 億のパラメータを持つベースモデルと、実用

 オープンソースの評価ツール [OpenCompass](https://github.com/internLM/OpenCompass/) を用いて、InternLM の総合的な評価を行った。この評価では、分野別能力、言語能力、知識能力、推論能力、理解能力の 5 つの次元をカバーしました。以下は評価結果の一部であり、その他の評価結果については [OpenCompass leaderboard](https://opencompass.org.cn/rank) をご覧ください。

-| データセット\モデル | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
-| ---------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
-| C-Eval(Val)      | 53.4                  | 24.2     | 42.7        | 50.9        | 28.9      | 31.2      |
-| MMLU             | 51.0                  | 35.2*    | 41.5        | 46.0        | 39.7      | 47.3      |
-| AGIEval          | 37.6                  | 20.8     | 24.6        | 39.0        | 24.1      | 26.4      |
-| CommonSenseQA    | 59.5                  | 65.0     | 58.8        | 60.0        | 68.7      | 66.7      |
-| BUSTM            | 50.6                  | 48.5     | 51.3        | 55.0        | 48.8      | 62.5      |
-| CLUEWSC          | 59.1                  | 50.3     | 52.8        | 59.8        | 50.3      | 52.2      |
-| MATH             | 7.1                   | 2.8      | 3.0         | 6.6         | 2.2       | 2.8       |
-| GSM8K            | 31.2                  | 10.1     | 9.7         | 29.2        | 6.0       | 15.3      |
-| HumanEval        | 10.4                  | 14.0     | 9.2         | 9.2         | 9.2       | 11.0      |
-| RACE(High)       | 57.4                  | 46.9*    | 28.1        | 66.3        | 40.7      | 54.0      |
+
+| データセット\モデル | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
+| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
+| C-Eval(Val)     | 52.0                       | 53.4                  | 24.2     | 42.7        | 50.9        | 28.9      | 31.2      |
+| MMLU            | 52.6                       | 51.0                  | 35.2*    | 41.5        | 46.0        | 39.7      | 47.3      |
+| AGIEval         | 46.4                       | 37.6                  | 20.8     | 24.6        | 39.0        | 24.1      | 26.4      |
+| CommonSenseQA   | 80.8                       | 59.5                  | 65.0     | 58.8        | 60.0        | 68.7      | 66.7      |
+| BUSTM           | 80.6                       | 50.6                  | 48.5     | 51.3        | 55.0        | 48.8      | 62.5      |
+| CLUEWSC         | 81.8                       | 59.1                  | 50.3     | 52.8        | 59.8        | 50.3      | 52.2      |
+| MATH            | 5.0                        | 7.1                   | 2.8      | 3.0         | 6.6         | 2.2       | 2.8       |
+| GSM8K           | 36.2                       | 31.2                  | 10.1     | 9.7         | 29.2        | 6.0       | 15.3      |
+| HumanEval       | 15.9                       | 10.4                  | 14.0     | 9.2         | 9.2         | 9.2       | 11.0      |
+| RACE(High)      | 80.3                       | 57.4                  | 46.9*    | 28.1        | 66.3        | 40.7      | 54.0      |

 - 評価結果は [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) (*印のあるデータは原著論文からの引用を意味する)から取得したもので、評価設定は [OpenCompass](https://github.com/internLM/OpenCompass/) が提供する設定ファイルに記載されています。
 - 評価データは、[OpenCompass](https://github.com/internLM/OpenCompass/) のバージョンアップにより数値的な差異が生じる可能性がありますので、[OpenCompass](https://github.com/internLM/OpenCompass/) の最新の評価結果をご参照ください。
--- a/README-zh-Hans.md
+++ b/README-zh-Hans.md
@ -131,18 +131,18 @@ InternLM-7B 包含了一个拥有70亿参数的基础模型和一个为实际场

 我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测，部分评测结果如下表所示，欢迎访问[OpenCompass 榜单](https://opencompass.org.cn/rank)获取更多的评测结果。

-| 数据集\模型           |  **InternLM-7B**  |  LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
-| -------------------- | ---------------- | --------- |  --------- | ------------ | --------- | ---------- |
-| C-Eval(Val)          |        53.4       | 24.2      | 42.7       |  50.9       |  28.9     | 31.2     |
-| MMLU                 |       51.0        | 35.2*     |  41.5      |  46.0       |  39.7     | 47.3     |
-| AGIEval              |       37.6        | 20.8      | 24.6       |  39.0       | 24.1      | 26.4     |
-| CommonSenseQA        |      59.5         | 65.0      | 58.8       | 60.0        | 68.7      | 66.7     |
-| BUSTM                |       50.6        | 48.5      | 51.3        | 55.0        | 48.8      | 62.5     |
-| CLUEWSC              |      59.1         |  50.3     |  52.8     |  59.8     |   50.3    |  52.2     |
-| MATH                 |         7.1        |  2.8       | 3.0       | 6.6       |  2.2      | 2.8       |
-| GSM8K                |        31.2        | 10.1       | 9.7       | 29.2      |  6.0      | 15.3  |
-|  HumanEval           |        10.4        |   14.0     | 9.2       | 9.2       | 9.2       | 11.0  |
-| RACE(High)           |        57.4        | 46.9*      | 28.1      | 66.3      | 40.7      | 54.0  |
+| 数据集\模型 | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
+| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
+| C-Eval(Val)     | 52.0                       | 53.4                  | 24.2     | 42.7        | 50.9        | 28.9      | 31.2      |
+| MMLU            | 52.6                       | 51.0                  | 35.2*    | 41.5        | 46.0        | 39.7      | 47.3      |
+| AGIEval         | 46.4                       | 37.6                  | 20.8     | 24.6        | 39.0        | 24.1      | 26.4      |
+| CommonSenseQA   | 80.8                       | 59.5                  | 65.0     | 58.8        | 60.0        | 68.7      | 66.7      |
+| BUSTM           | 80.6                       | 50.6                  | 48.5     | 51.3        | 55.0        | 48.8      | 62.5      |
+| CLUEWSC         | 81.8                       | 59.1                  | 50.3     | 52.8        | 59.8        | 50.3      | 52.2      |
+| MATH            | 5.0                        | 7.1                   | 2.8      | 3.0         | 6.6         | 2.2       | 2.8       |
+| GSM8K           | 36.2                       | 31.2                  | 10.1     | 9.7         | 29.2        | 6.0       | 15.3      |
+| HumanEval       | 15.9                       | 10.4                  | 14.0     | 9.2         | 9.2         | 9.2       | 11.0      |
+| RACE(High)      | 80.3                       | 57.4                  | 46.9*    | 28.1        | 66.3        | 40.7      | 54.0      |

 - 以上评测结果基于 [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) 获得（部分数据标注`*`代表数据来自原始论文），具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
 - 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异，请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
--- a/README.md
+++ b/README.md
@ -130,18 +130,18 @@ InternLM-7B contains a 7 billion parameter base model and a chat model tailored

 We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://opencompass.org.cn/rank) for more evaluation results.

-| Datasets\Models | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
-| --------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
-| C-Eval(Val)     | 53.4                  | 24.2     | 42.7        | 50.9        | 28.9      | 31.2      |
-| MMLU            | 51.0                  | 35.2*    | 41.5        | 46.0        | 39.7      | 47.3      |
-| AGIEval         | 37.6                  | 20.8     | 24.6        | 39.0        | 24.1      | 26.4      |
-| CommonSenseQA   | 59.5                  | 65.0     | 58.8        | 60.0        | 68.7      | 66.7      |
-| BUSTM           | 50.6                  | 48.5     | 51.3        | 55.0        | 48.8      | 62.5      |
-| CLUEWSC         | 59.1                  | 50.3     | 52.8        | 59.8        | 50.3      | 52.2      |
-| MATH            | 7.1                   | 2.8      | 3.0         | 6.6         | 2.2       | 2.8       |
-| GSM8K           | 31.2                  | 10.1     | 9.7         | 29.2        | 6.0       | 15.3      |
-| HumanEval       | 10.4                  | 14.0     | 9.2         | 9.2         | 9.2       | 11.0      |
-| RACE(High)      | 57.4                  | 46.9*    | 28.1        | 66.3        | 40.7      | 54.0      |
+| Datasets\Models | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
+| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
+| C-Eval(Val)     | 52.0                       | 53.4                  | 24.2     | 42.7        | 50.9        | 28.9      | 31.2      |
+| MMLU            | 52.6                       | 51.0                  | 35.2*    | 41.5        | 46.0        | 39.7      | 47.3      |
+| AGIEval         | 46.4                       | 37.6                  | 20.8     | 24.6        | 39.0        | 24.1      | 26.4      |
+| CommonSenseQA   | 80.8                       | 59.5                  | 65.0     | 58.8        | 60.0        | 68.7      | 66.7      |
+| BUSTM           | 80.6                       | 50.6                  | 48.5     | 51.3        | 55.0        | 48.8      | 62.5      |
+| CLUEWSC         | 81.8                       | 59.1                  | 50.3     | 52.8        | 59.8        | 50.3      | 52.2      |
+| MATH            | 5.0                        | 7.1                   | 2.8      | 3.0         | 6.6         | 2.2       | 2.8       |
+| GSM8K           | 36.2                       | 31.2                  | 10.1     | 9.7         | 29.2        | 6.0       | 15.3      |
+| HumanEval       | 15.9                       | 10.4                  | 14.0     | 9.2         | 9.2         | 9.2       | 11.0      |
+| RACE(High)      | 80.3                       | 57.4                  | 46.9*    | 28.1        | 66.3        | 40.7      | 54.0      |

 - The evaluation results were obtained from [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) (some data marked with *, which means come from the original papers), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
 - The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).