From ab9ae22031a280816e3d852b2fe5f5b2d394d308 Mon Sep 17 00:00:00 2001 From: Shuo Zhang Date: Wed, 21 Feb 2024 17:54:50 +0800 Subject: [PATCH] fix Performance Evaluation if internlm2-1.8b --- README.md | 2 +- model_cards/internlm2_1.8b.md | 14 +++++++------- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 8078482..a3647e3 100644 --- a/README.md +++ b/README.md @@ -47,7 +47,7 @@ InternLM2 series are released with the following features: ## News -\[2024.01.31\] We release InternLM2-1.8B, along with the associated chat model. This model provides a cheaper deployment option while maintaining leading performance. +\[2024.01.31\] We release InternLM2-1.8B, along with the associated chat model. They provide a cheaper deployment option while maintaining leading performance. \[2024.01.23\] We release InternLM2-Math-7B and InternLM2-Math-20B with pretraining and SFT checkpoints. They surpass ChatGPT with small sizes. See [InternLM-Math](https://github.com/InternLM/internlm-math) for details and download. diff --git a/model_cards/internlm2_1.8b.md b/model_cards/internlm2_1.8b.md index 98eafa1..6945024 100644 --- a/model_cards/internlm2_1.8b.md +++ b/model_cards/internlm2_1.8b.md @@ -27,13 +27,13 @@ We have evaluated InternLM2 on several important benchmarks using the open-sourc | Dataset\Models | InternLM2-1.8B | InternLM2-Chat-1.8B-SFT | InternLM2-Chat-1.8B | InternLM2-7B | InternLM2-Chat-7B | | :---: | :---: | :---: | :---: | :---: | :---: | -| MMLU | 46.9 | 47.1 | 47.1 | 65.8 | 63.7 | -| AGIEval | 33.4 | 38.8 | 38.7 | 49.9 | 47.2 | -| BBH | 37.5 | 35.2 | 36.1 | 65.0 | 61.2 | -| GSM8K | 31.2 | 39.7 | 40.9 | 70.8 | 70.7 | -| MATH | 5.6 | 11.8 | 12.1 | 20.2 | 23.0 | -| HumanEval | 25.0 | 32.9 | 34.2 | 43.3 | 59.8 | -| MBPP(Sanitized) | 22.2 | 23.2 | 26.6 | 51.8 | 51.4 | +| MMLU | 46.9 | 47.1 | 44.1 | 65.8 | 63.7 | +| AGIEval | 33.4 | 38.8 | 34.6 | 49.9 | 47.2 | +| BBH | 37.5 | 35.2 | 34.3 | 65.0 | 61.2 | +| GSM8K | 31.2 | 39.7 | 34.3 | 70.8 | 70.7 | +| MATH | 5.6 | 11.8 | 10.7 | 20.2 | 23.0 | +| HumanEval | 25.0 | 32.9 | 29.3 | 43.3 | 59.8 | +| MBPP(Sanitized) | 22.2 | 23.2 | 27.0 | 51.8 | 51.4 | - The evaluation results were obtained from [OpenCompass](https://github.com/open-compass/opencompass) , and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/open-compass/opencompass).