diff --git a/README.md b/README.md index b75b352..ac9d6e6 100644 --- a/README.md +++ b/README.md @@ -19,7 +19,7 @@ [📘Chat](./chat) | [🛠️Agent](./agent) | -[📊Evaluation](./evaluation) | +[📊Evaluation](#evaluation) | [👀Model](./model_cards) | [🤗HuggingFace](https://huggingface.co/spaces/internlm/internlm2-Chat-7B) | [🆕Update News](#news) | @@ -154,6 +154,32 @@ Please refer to [finetune docs](./finetune/) for fine-tuning with InternLM. **Note:** We have migrated the whole training functionality in this project to [InternEvo](https://github.com/InternLM/InternEvo) for easier user experience, which provides efficient pre-training and fine-tuning infra for training InternLM. + +## Evaluation + +We utilize [OpenCompass](https://github.com/open-compass/opencompass) for model evaluation. In InternLM-2, we primarily focus on standard objective evaluation, long-context evaluation (needle in a haystack), data contamination assessment, agent evaluation, and subjective evaluation. + +### Objective Evaluation + +To evaluate the InternLM model, please follow the guidelines in the [OpenCompass tutorial](https://github.com/open-compass/opencompass). Typically, we use `ppl` for multiple-choice questions on the **Base** model and `gen` for all questions on the **Chat** model. + +### Long-Context Evaluation (Needle in a Haystack) + +For the `Needle in a Haystack` evaluation, refer to the tutorial provided in the [documentation](https://github.com/open-compass/opencompass/blob/main/docs/en/advanced_guides/needleinahaystack_eval.md). Feel free to try it out. + +### Data Contamination Assessment + +To learn more about data contamination assessment, please check the [contamination eval](https://opencompass.readthedocs.io/en/latest/advanced_guides/contamination_eval.html). + +### Agent Evaluation + +- To evaluate tool utilization, please refer to [T-Eval](https://github.com/open-compass/T-Eval). +- For code interpreter evaluation, use the [gsm-8k-agent](https://github.com/open-compass/opencompass/blob/main/configs/datasets/gsm8k/gsm8k_agent_gen_be1606.py) provided in the repository. Additionally, you need to install [Lagent](https://github.com/InternLM/lagent). + +### Subjective Evaluation + +- Please follow the [tutorial](https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html) for subjective evaluation. + ## Contribution We appreciate all the contributors for their efforts to improve and enhance InternLM. Community users are highly encouraged to participate in the project. Please refer to the contribution guidelines for instructions on how to contribute to the project. diff --git a/README_zh-CN.md b/README_zh-CN.md index 2d7d051..a414272 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -20,7 +20,7 @@ [📘对话教程](./chat) | [🛠️智能体教程](./agent) | -[📊评测](./evaluation) | +[📊评测](#评测) | [👀模型库](./model_cards) | [🤗HuggingFace](https://huggingface.co/spaces/internlm/internlm2-Chat-7B) | [🆕Update News](#news) | @@ -148,6 +148,32 @@ print(response) **注意:**本项目中的全量训练功能已经迁移到了[InternEvo](https://github.com/InternLM/InternEvo)以便捷用户的使用。InternEvo 提供了高效的预训练和微调基建用于训练 InternLM 系列模型。 + +## 评测 + +我们使用 [OpenCompass](https://github.com/open-compass/opencompass) 进行模型评估。在 InternLM-2 中,我们主要标准客观评估、长文评估(大海捞针)、数据污染评估、智能体评估和主观评估。 + +### 标准客观评测 + +请按照 [OpenCompass 教程](https://github.com/open-compass/opencompass) 进行客观评测。我们通常在 **Base** 模型上使用 `ppl` 进行多项选择题,在 **Chat** 模型上使用 `gen` 进行所有问题。 + +### 长文评估(大海捞针) + +有关 `大海捞针` 评估的教程,请参阅 [文档](https://github.com/open-compass/opencompass/blob/main/docs/en/advanced_guides/needleinahaystack_eval.md) 中的教程。 + +### 数据污染评估 + +要了解更多关于数据污染评估的信息,请查看 [污染评估](https://opencompass.readthedocs.io/en/latest/advanced_guides/contamination_eval.html)。 + +### 智能体评估 + +- 要评估大模型的工具利用能力,请使用 [T-Eval](https://github.com/open-compass/T-Eval) 进行评测。 +- 对于代码解释器评估,请使用 [gsm-8k-agent](https://github.com/open-compass/opencompass/blob/main/configs/datasets/gsm8k/gsm8k_agent_gen_be1606.py) 提供的配置进行评估。此外,您还需要安装 [Lagent](https://github.com/InternLM/lagent)。 + +### 主观评估 + +- 请按照 [教程](https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html) 进行主观评估。 + ## 贡献 我们感谢所有的贡献者为改进和提升 InternLM 所作出的努力。非常欢迎社区用户能参与进项目中来。请参考贡献指南来了解参与项目贡献的相关指引。