Update performance

2024-01-17 11:01:59 +08:00 · 2024-01-17 11:01:59 +08:00 · 7767629116
parent 41e13609d8
commit 7767629116
2 changed files with 105 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -77,6 +77,58 @@ The release of InternLM2 series contains two model sizes: 7B and 20B. 7B models

 **Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.

+## Performance
+
+### Objective Evaluation
+| Dataset                | Baichuan2-7B-Chat | Mistral-7B-Instruct-v0.2 | Qwen-7B-Chat | InternLM2-Chat-7B | ChatGLM3-6B | Baichuan2-13B-Chat | Mixtral-8x7B-Instruct-v0.1 | Qwen-14B-Chat | InternLM2-Chat-20B |
+|-----------------------|-------------------|--------------------------|--------------|-------------------|-------------|---------------------|--------------------------------|---------------|---------------------|
+| MMLU                  | 50.1              | 59.2                     | 57.1         | 63.7              | 58.0        | 56.6                | 70.3                          | 66.7          | 65.1                |
+| CMMLU                 | 53.4              | 42.0                     | 57.9         | 63.0              | 57.8        | 54.8                | 50.6                          | 68.1          | 65.1                |
+| AGIEval               | 35.3              | 34.5                     | 39.7         | 47.2              | 44.2        | 40.0                | 41.7                          | 46.5          | 50.3                |
+| C-Eval                | 53.9              | 42.4                     | 59.8         | 60.8              | 59.1        | 56.3                | 54.0                          | 71.5          | 63.0                |
+| TrivialQA             | 37.6              | 35.0                     | 46.1         | 50.8              | 38.1        | 40.3                | 57.7                          | 54.5          | 53.9                |
+| NaturalQuestions      | 12.8              | 8.1                      | 18.6         | 24.1              | 14.0        | 12.7                | 22.5                          | 22.9          | 25.9                |
+| C3                    | 78.5              | 66.9                     | 84.4         | 91.5              | 79.3        | 84.4                | 82.1                          | 91.5          | 93.5                |
+| CMRC                  | 8.1               | 5.6                      | 14.6         | 63.8              | 43.2        | 27.8                | 5.3                           | 13.0          | 50.4                |
+| WinoGrande            | 49.9              | 50.8                     | 54.2         | 65.8              | 61.7        | 50.9                | 60.9                          | 55.7          | 74.8                |
+| BBH                   | 35.9              | 46.5                     | 45.5         | 61.2              | 56.0        | 42.5                | 57.3                          | 55.8          | 68.3                |
+| GSM-8K                | 32.4              | 48.3                     | 44.1         | 70.7              | 53.8        | 56.0                | 71.7                          | 57.7          | 79.6                |
+| Math                  | 5.7               | 8.6                      | 12.0         | 23.0              | 20.4        | 4.3                 | 22.5                          | 27.6          | 31.9                |
+| HumanEval              | 17.7              | 35.4                     | 36.0         | 59.8              | 52.4        | 19.5                | 37.8                          | 40.9          | 67.1                |
+| MBPP                  | 37.7              | 25.7                     | 33.9         | 51.4              | 55.6        | 40.9                | 40.9                          | 30.0          | 65.8                |
+
+- Performance of MBPP is reported with MBPP(Sanitized)
+
+### Alignment Evaluation
+
+- We have evaluated our model on [AlpacaEval 2.0](https://tatsu-lab.github.io/alpaca_eval/) and InternLM2-Chat-20B surpass Claude 2, GPT-4(0613) and Gemini Pro.
+
+| Model Name              | Win Rate | Length |
+| ----------------------- | -------- | ------ |
+| GPT-4 Turbo      | 50.00%   | 2049   |
+| GPT-4         | 23.58%   | 1365   |
+| GPT-4 0314             | 22.07%   | 1371   |
+| Mistral Medium      | 21.86%   | 1500   |
+| XwinLM 70b V0.1   | 21.81%   | 1775   |
+| InternLM2 Chat 20B  | 21.75%   | 2373   |
+| Mixtral 8x7B v0.1  | 18.26%   | 1465   |
+| Claude 2            | 17.19%   | 1069   |
+| Gemini Pro         | 16.85%   | 1315   |
+| GPT-4 0613         | 15.76%   | 1140   |
+| Claude 2.1         | 15.73%   | 1096   |
+
+* According to the released performance of 2024-01-17.
+
+### Data Contamination 
+
+| Method       | GSM-8k | English Knowledge | Chinese Knowledge | Coding |
+|------------|----------|-------------|-------------|------|
+| Average of Open-source LLMs | -0.02 | -0.13 | -0.20 | -0.07 |
+| InternLM2-Base-7B | 0.09 | -0.13 | -0.16 | 0.03 |
+| InternLM2-7B | 0.02 | -0.12 | -0.16 | 0.05 |
+| InternLM2-Base-20B | 0.08 | -0.13 | -0.17 | -0.02 |
+| InternLM2-20B | 0.04 | -0.13 | -0.19 | -0.02 |
+
 ## Usages

 We briefly show the usages with [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue).
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -75,6 +75,59 @@ InternLM2 系列模型在本仓库正式发布，具有如下特性：

 **局限性：** 尽管在训练过程中我们非常注重模型的安全性，尽力促使模型输出符合伦理和法律要求的文本，但受限于模型大小以及概率生成范式，模型可能会产生各种不符合预期的输出，例如回复内容包含偏见、歧视等有害内容，请勿传播这些内容。由于传播不良信息导致的任何后果，本项目不承担责任。

+## 性能
+
+### 客观评测
+| Dataset                | Baichuan2-7B-Chat | Mistral-7B-Instruct-v0.2 | Qwen-7B-Chat | InternLM2-Chat-7B | ChatGLM3-6B | Baichuan2-13B-Chat | Mixtral-8x7B-Instruct-v0.1 | Qwen-14B-Chat | InternLM2-Chat-20B |
+|-----------------------|-------------------|--------------------------|--------------|-------------------|-------------|---------------------|--------------------------------|---------------|---------------------|
+| MMLU                  | 50.1              | 59.2                     | 57.1         | 63.7              | 58.0        | 56.6                | 70.3                          | 66.7          | 65.1                |
+| CMMLU                 | 53.4              | 42.0                     | 57.9         | 63.0              | 57.8        | 54.8                | 50.6                          | 68.1          | 65.1                |
+| AGIEval               | 35.3              | 34.5                     | 39.7         | 47.2              | 44.2        | 40.0                | 41.7                          | 46.5          | 50.3                |
+| C-Eval                | 53.9              | 42.4                     | 59.8         | 60.8              | 59.1        | 56.3                | 54.0                          | 71.5          | 63.0                |
+| TrivialQA             | 37.6              | 35.0                     | 46.1         | 50.8              | 38.1        | 40.3                | 57.7                          | 54.5          | 53.9                |
+| NaturalQuestions      | 12.8              | 8.1                      | 18.6         | 24.1              | 14.0        | 12.7                | 22.5                          | 22.9          | 25.9                |
+| C3                    | 78.5              | 66.9                     | 84.4         | 91.5              | 79.3        | 84.4                | 82.1                          | 91.5          | 93.5                |
+| CMRC                  | 8.1               | 5.6                      | 14.6         | 63.8              | 43.2        | 27.8                | 5.3                           | 13.0          | 50.4                |
+| WinoGrande            | 49.9              | 50.8                     | 54.2         | 65.8              | 61.7        | 50.9                | 60.9                          | 55.7          | 74.8                |
+| BBH                   | 35.9              | 46.5                     | 45.5         | 61.2              | 56.0        | 42.5                | 57.3                          | 55.8          | 68.3                |
+| GSM-8K                | 32.4              | 48.3                     | 44.1         | 70.7              | 53.8        | 56.0                | 71.7                          | 57.7          | 79.6                |
+| Math                  | 5.7               | 8.6                      | 12.0         | 23.0              | 20.4        | 4.3                 | 22.5                          | 27.6          | 31.9                |
+| HumanEval              | 17.7              | 35.4                     | 36.0         | 59.8              | 52.4        | 19.5                | 37.8                          | 40.9          | 67.1                |
+| MBPP                  | 37.7              | 25.7                     | 33.9         | 51.4              | 55.6        | 40.9                | 40.9                          | 30.0          | 65.8                |
+
+- MBPP性能使用的是MBPP(Sanitized)版本数据集
+
+### 主观评测
+
+- 我们评测了InternLM2-Chat在[AlpacaEval 2.0](https://tatsu-lab.github.io/alpaca_eval/) 上的性能，结果表明InternLM2-Chat在AlpacaEval上已经超过了 Claude 2, GPT-4(0613) 和  Gemini Pro.
+
+| Model Name              | Win Rate | Length |
+| ----------------------- | -------- | ------ |
+| GPT-4 Turbo      | 50.00%   | 2049   |
+| GPT-4         | 23.58%   | 1365   |
+| GPT-4 0314             | 22.07%   | 1371   |
+| Mistral Medium      | 21.86%   | 1500   |
+| XwinLM 70b V0.1   | 21.81%   | 1775   |
+| InternLM2 Chat 20B  | 21.75%   | 2373   |
+| Mixtral 8x7B v0.1  | 18.26%   | 1465   |
+| Claude 2            | 17.19%   | 1069   |
+| Gemini Pro         | 16.85%   | 1315   |
+| GPT-4 0613         | 15.76%   | 1140   |
+| Claude 2.1         | 15.73%   | 1096   |
+
+* 性能数据截止2024-01-17
+
+### 数据污染检测
+
+| 方法       | 数学推理 | 英文通用评测 | 中文通用评测 | 代码 |
+|------------|----------|-------------|-------------|------|
+| 国内主流模型平均水平 | -0.02 | -0.13 | -0.20 | -0.07 |
+| InternLM2-Base-7B | 0.09 | -0.13 | -0.16 | 0.03 |
+| InternLM2-7B | 0.02 | -0.12 | -0.16 | 0.05 |
+| InternLM2-Base-20B | 0.08 | -0.13 | -0.17 | -0.02 |
+| InternLM2-20B | 0.04 | -0.13 | -0.19 | -0.02 |
+
+
 ## 使用案例

 接下来我们展示使用 [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), 和 [Web demo](#dialogue) 进行推理.