conflict

2023-07-10 17:14:46 +08:00 · 2023-07-10 17:14:46 +08:00 · cc02611e59
parent c823a55e11 c18bec9361
commit cc02611e59
19 changed files with 486 additions and 123 deletions
--- a/.github/ISSUE_TEMPLATE/1_bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/1_bug-report.yml
@ -0,0 +1,57 @@
+name: 🐞 Bug report
+description: Create a report to help us improve
+labels: ["bug"]
+title: "[Bug] "
+body:
+  - type: markdown
+    attributes:
+      value: |
+        If you have already identified the reason, we strongly appreciate you creating a new PR!
+        If you need our help, please fill in the following form to help us to identify the bug.
+
+  - type: textarea
+    id: describe
+    validations:
+      required: true
+    attributes:
+      label: Describe the bug
+      description: |
+        Please provide a clear and concise description of what the bug is.
+        Preferably a simple and minimal code snippet that we can reproduce the error by running the code.
+      placeholder: |
+        A clear and concise description of what the bug is.
+
+        ```python
+        # Sample code to reproduce the problem
+        ```
+
+        ```shell
+        The command or script you run.
+        ```
+
+        ```
+        The error message or logs you got, with the full traceback.
+        ```
+
+  - type: textarea
+    id: environment
+    validations:
+      required: true
+    attributes:
+      label: Environment
+      description: |
+        Please check the torch version and cuda version and fill in the following table.
+      placeholder: |
+        ```python
+        # The output the above command
+        ```
+
+  - type: textarea
+    id: other
+    attributes:
+      label: Other information
+      description: |
+        Tell us anything else you think we should know.
+
+        1. Did you make any modifications on the code or config?
+        2. What do you think might be the reason?
--- a/.github/ISSUE_TEMPLATE/2_feature-request.yml
+++ b/.github/ISSUE_TEMPLATE/2_feature-request.yml
@ -0,0 +1,29 @@
+name: 🚀 Feature request
+description: Suggest an idea for this project
+labels: ["enhancement"]
+title: "[Feature] "
+body:
+  - type: markdown
+    attributes:
+      value: |
+        If you have already implemented the feature, we strongly appreciate you creating a new PR!
+
+  - type: textarea
+    id: describe
+    validations:
+      required: true
+    attributes:
+      label: Describe the feature
+      description: |
+        What kind of feature do you want this repository to add. If there is an official code release or third-party implementation, please also provide the information here, which would be very helpful.
+      placeholder: |
+        A clear and concise description of the motivation of the feature.
+        Ex1. It is inconvenient when \[....\].
+        Ex2. There is a recent paper \[....\], which is very helpful for \[....\].
+
+  - type: checkboxes
+    id: pr
+    attributes:
+      label: Will you implement it?
+      options:
+        - label: I would like to implement this feature and create a PR!
--- a/.github/ISSUE_TEMPLATE/3_bug-report_zh.yml
+++ b/.github/ISSUE_TEMPLATE/3_bug-report_zh.yml
@ -0,0 +1,58 @@
+name: 🐞 报告 Bug
+description: 报告你在使用中遇到的不合预期的情况
+labels: ["bug"]
+title: "[Bug] "
+body:
+  - type: markdown
+    attributes:
+      value: |
+        我们推荐使用英语模板 Bug report，以便你的问题帮助更多人。
+
+        如果你已经有了解决方案，我们非常欢迎你直接创建一个新的 PR 来解决这个问题。
+        如果你需要我们的帮助，请填写以下内容帮助我们定位 Bug。
+
+  - type: textarea
+    id: describe
+    validations:
+      required: true
+    attributes:
+      label: 描述该错误
+      description: |
+        请简要说明你遇到的错误。如果可以的话，请提供一个简短的代码片段帮助我们复现这一错误。
+      placeholder: |
+        问题的简要说明
+
+        ```python
+        # 复现错误的代码片段
+        ```
+
+        ```shell
+        # 发生错误时你的运行命令
+        ```
+
+        ```
+        错误信息和日志，请展示全部的错误日志和 traceback
+        ```
+
+  - type: textarea
+    id: environment
+    validations:
+      required: true
+    attributes:
+      label: 环境信息
+      description: |
+        请检查 pytorch/CUDA 环境的信息，并贴在下方。
+      placeholder: |
+        ```python
+        # 上述命令的输出
+        ```
+
+  - type: textarea
+    id: other
+    attributes:
+      label: 其他信息
+      description: |
+        告诉我们其他有价值的信息。
+
+        1. 你是否对代码或配置文件做了任何改动？
+        2. 你认为可能的原因是什么？
--- a/.github/ISSUE_TEMPLATE/4_feature-request_zh.yml
+++ b/.github/ISSUE_TEMPLATE/4_feature-request_zh.yml
@ -0,0 +1,31 @@
+name: 🚀 功能建议
+description: 建议一项新的功能
+labels: ["enhancement"]
+title: "[Feature] "
+body:
+  - type: markdown
+    attributes:
+      value: |
+        推荐使用英语模板 Feature request，以便你的问题帮助更多人。
+
+        如果你已经实现了该功能，我们非常欢迎你直接创建一个新的 PR 来解决这个问题。创建 PR 的流程可以参考[文档](https://opencompass.readthedocs.io/zh_CN/master/community/CONTRIBUTING.html)。
+
+  - type: textarea
+    id: describe
+    validations:
+      required: true
+    attributes:
+      label: 描述该功能
+      description: |
+        你希望这个代码库添加什么功能？如果存在相关的论文、官方实现或者第三方实现，请同时贴出链接，这将非常有帮助。
+      placeholder: |
+        简要说明该功能，及为什么需要该功能
+        例 1. 现在进行 xxx 的时候不方便
+        例 2. 最近的论文中提出了有一个很有帮助的 xx
+
+  - type: checkboxes
+    id: pr
+    attributes:
+      label: 是否希望自己实现该功能？
+      options:
+        - label: 我希望自己来实现这一功能，并向 InternLM 贡献代码！
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@ -0,0 +1,12 @@
+blank_issues_enabled: false
+
+contact_links:
+  - name: 📚 InternLM Documentation (官方文档)
+    url: https://internlm.readthedocs.io/en/latest/
+    about: Check if your question is answered in docs
+  - name: 💬 General questions (寻求帮助)
+    url: https://github.com/InternLM/InternLM/discussions
+    about: Ask general usage questions and discuss with other InternLM community members
+  - name: 🌐 Explore InternLM (官网)
+    url: https://https://internlm.org/
+    about: Get know more about InternLM
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@ -0,0 +1,32 @@
+Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
+
+## Motivation
+
+Please describe the motivation of this PR and the goal you want to achieve through this PR.
+
+## Modification
+
+Please briefly describe what modification is made in this PR.
+
+## BC-breaking (Optional)
+
+Does the modification introduce changes that break the backward compatibility of the downstream repositories?
+If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
+
+## Use cases (Optional)
+
+If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
+
+## Checklist
+
+**Before PR**:
+
+- [ ] Pre-commit or other linting tools are used to fix the potential lint issues.
+- [ ] Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
+- [ ] The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
+- [ ] The documentation has been modified accordingly, like docstring or example tutorials.
+
+**After PR**:
+
+- [ ] If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
+- [ ] CLA has been signed and all committers have signed the CLA in this PR.
--- a/README-zh-Hans.md
+++ b/README-zh-Hans.md
@ -15,6 +15,7 @@
  </div>

 [![license](./doc/imgs/license.svg)](https://github.com/open-mmlab/mmdetection/blob/main/LICENSE)
+[![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)

 [📘使用文档](./doc/usage.md) |
 [🛠️安装教程](./doc/install.md) |
@ -23,15 +24,15 @@
 [🆕Update News](./CHANGE_LOG.md) |
 [🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new)

-
 [English](./README.md) |
 [简体中文](./README-zh-Hans.md)

-
 </div>

 ## 简介
+
 InternLM ，即书生·浦语大模型，包含面向实用场景的70亿参数基础模型与对话模型 （InternLM-7B）。模型具有以下特点：
+
 - 使用上万亿高质量预料，建立模型超强知识体系；
 - 支持8k语境窗口长度，实现更长输入与更强推理体验；
 - 通用工具调用能力，支持用户灵活自助搭建流程；
@ -42,7 +43,7 @@ InternLM ，即书生·浦语大模型，包含面向实用场景的70亿参数

 ### 性能评测

-我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测，部分评测结果如下表所示，欢迎访问[ OpenCompass 榜单 ](https://opencompass.org.cn/rank)获取更多的评测结果。
+我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测，部分评测结果如下表所示，欢迎访问[OpenCompass 榜单](https://opencompass.org.cn/rank)获取更多的评测结果。

 | 数据集\模型           |  **InternLM-Chat-7B** |  **InternLM-7B**  |  LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
 | -------------------- | --------------------- | ---------------- | --------- |  --------- | ------------ | --------- | ---------- |  
@ -61,6 +62,7 @@ InternLM ，即书生·浦语大模型，包含面向实用场景的70亿参数
 - 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异，请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。

 ### Model Zoo
+
 当前通过 InternLM 训练的 InternLM 7B 和 InternLM 7B Chat 已经开源，我们提供两种格式的模型权重以供使用。除了使用 Transformers 格式加载模型之外，还可以通过 InternLM 加载以下格式的权重直接进行继续预训练或人类偏好对齐训练

 | 模型                 | InternLM 格式权重下载地址                                                                                                                      | Transformers 格式权重下载地址                    |
@ -69,11 +71,12 @@ InternLM ，即书生·浦语大模型，包含面向实用场景的70亿参数
 | **InternLM Chat 7B** | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b) | [🤗internlm/intern-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)
 | **InternLM Chat 7B 8k** | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b-8k) | [🤗internlm/intern-chat-7b-8k](https://huggingface.co/internlm/internlm-chat-7b-8k)

-
 **局限性：** 尽管在训练过程中我们非常注重模型的安全性，尽力促使模型输出符合伦理和法律要求的文本，但受限于模型大小以及概率生成范式，模型可能会产生各种不符合预期的输出，例如回复内容包含偏见、歧视等有害内容，请勿传播这些内容。由于传播不良信息导致的任何后果，本项目不承担责任。

 ### 通过 Transformers 加载
+
 通过以下的代码加载 InternLM 7B Chat 模型
+
 ```python
 >>> from transformers import AutoTokenizer, AutoModelForCausalLM
 >>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
@ -91,12 +94,15 @@ InternLM ，即书生·浦语大模型，包含面向实用场景的70亿参数
 ```

 ### 通过前端网页对话
+
 可以通过以下代码启动一个前端的界面来与 InternLM Chat 7B 模型进行交互
+
 ```bash
 pip install streamlit==1.24.0
 pip install transformers==4.30.2
 streamlit run web_demo.py
 ```
+
 效果如下

 ![效果](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
@ -128,36 +134,40 @@ streamlit run web_demo.py
 ## 微调&训练

 ### 预训练与微调使用教程
+
 请参考[使用教程](./doc/usage.md)开始InternLM的安装、数据处理、预训练与微调。

 ### 转换为 Transformers 格式使用
+
 通过 InternLM 进行训练的模型可以很轻松地转换为 HuggingFace Transformers 格式，方便与社区各种开源项目无缝对接。借助 `tools/convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式
+
 ```bash
 python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizes/tokenizer.model
 ```
+
 转换之后可以通过以下的代码加载为 transformers
+
 ```python
 >>> from transformers import AutoTokenizer, AutoModel
 >>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
 ```

-
 ## 训练系统

 ### 系统结构
+
 请参考[系统结构文档](./doc/structure.md)进一步了解。

 ### 训练性能

 InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子，提高了训练效率。通过构建 Hybrid Zero 技术，实现计算和通信的高效重叠，大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡，千卡规模下加速效率可高达 90%，训练吞吐超过 180TFLOPS，平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据：

-| GPU数量         | 8  | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
+| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TKS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 192 | 192  | 186  | 186  | 185   | 185   | 186   | 182    |
-
-TKS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。
+| TGS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |

+TGS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。

 ## 贡献

@ -169,4 +179,15 @@ InternLM 代码库是一款由上海人工智能实验室和来自不同高校

 ## 开源许可证

-本仓库的代码依照 Apache-2.0 协议开源。InternLM 权重对学术研究完全开放，在获得官方的书面许可后，亦允许商业使用。申请商用许可与合作请联系 internlm@pjlab.org.cn。
+本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放，也可申请免费的商业使用授权（[申请表](https://wj.qq.com/s2/12725412/f7c1/)）。其他问题与合作请联系 <internlm@pjlab.org.cn>。
+
+## 引用
+
+```
+@misc{2023internlm,
+    title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
+    author={InternLM Team},
+    howpublished = {\url{https://github.com/InternLM/InternLM}},
+    year={2023}
+}
+```
--- a/README.md
+++ b/README.md
@ -3,7 +3,7 @@
 <div align="center">

 <img src="./doc/imgs/logo.svg" width="200"/>
-  <div>&nbsp;</div>
+  <div> </div>
  <div align="center">
    <b><font size="5">InternLM</font></b>
    <sup>
@ -11,7 +11,7 @@
        <i><font size="4">HOT</font></i>
      </a>
    </sup>
-    <div>&nbsp;</div>
+    <div> </div>
  </div>

 [![license](./doc/imgs/license.svg)](./LICENSE)
@ -24,16 +24,15 @@
 [🆕Update News](./CHANGE_LOG.md) |
 [🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new)

-
 [English](./README.md) |
 [简体中文](./README-zh-Hans.md)

-
 </div>

 ## Introduction

 InternLM has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
+
 - It leverages trillions of high-quality tokens for training to establish a powerful knowledge base.
 - It supports an 8k context window length, enabling longer input sequences and stronger reasoning capabilities.
 - It provides a versatile toolset for users to flexibly build their own workflows.
@ -47,7 +46,7 @@ Additionally, a lightweight training framework is offered to support model pre-t
 We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://opencompass.org.cn/rank) for more evaluation results.

 | Datasets\Models | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
-| -------------------- | --------------------- | ---------------- | --------- |  --------- | ------------ | --------- | ---------- |  
+| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
 | C-Eval(Val)     | 53.2                       | 53.4                  | 24.2     | 42.7        | 50.9        | 28.9      | 31.2      |
 | MMLU            | 50.8                       | 51.0                  | 35.2*    | 41.5        | 46.0        | 39.7      | 47.3      |
 | AGIEval         | 42.5                       | 37.6                  | 20.8     | 24.6        | 39.0        | 24.1      | 26.4      |
@ -63,19 +62,21 @@ We conducted a comprehensive evaluation of InternLM using the open-source evalua
 - The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).

 ### Model Zoo
+
 InternLM 7B and InternLM 7B Chat, trained using InternLM, have been open-sourced. We provide two formats of model weights for use. In addition to loading the models using the Transformers format, you can also load the weights directly using InternLM for further pre-training or human preference alignment training.

 | Model                         | InternLM Format Weight Download Link                                                                                                                 | Transformers Format Weight Download Link                                         |
-| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------ |
+| ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
 | **InternLM 7B**         | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b)         | [🤗internlm/intern-7b](https://huggingface.co/internlm/internlm-7b)                 |
 | **InternLM Chat 7B**    | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b)    | [🤗internlm/intern-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)       |
-| **InternLM Chat 7B 8k** | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b-8k) | [🤗internlm/intern-chat-7b-8k](https://huggingface.co/internlm/internlm-chat-7b-8k)
-
+| **InternLM Chat 7B 8k** | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b-8k) | [🤗internlm/intern-chat-7b-8k](https://huggingface.co/internlm/internlm-chat-7b-8k) |

 **Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.

 ### Import from Transformers
+
 To load the InternLM 7B Chat model using Transformers, use the following code:
+
 ```python
 >>> from transformers import AutoTokenizer, AutoModelForCausalLM
 >>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
@ -96,75 +97,81 @@ Remember, good time management skills take practice and patience. Start with sma
 ```

 ### Dialogue
+
 You can interact with the InternLM Chat 7B model through a frontend interface by running the following code:
+
 ```bash
 pip install streamlit==1.24.0
 pip install transformers==4.30.2
 streamlit run web_demo.py
 ```
+
 The effect is as follows

 ![demo](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)

-
 ### Deployment

 We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the one-click deployment of InternLM.

 1. First, install LMDeploy:

-  ```
+```
  python3 -m pip install lmdeploy
-  ```
+```

 2. Use the following command for quick deployment:

-  ```
+```
  python3 -m lmdeploy.serve.turbomind.deploy InternLM-7B /path/to/internlm-7b/model hf
-  ```
+```

 3. After exporting the model, you can start a server and have a conversation with the deployed model using the following command:

-  ```
+```
  python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
-  ```
+```

 [LMDeploy](https://github.com/InternLM/LMDeploy) provides a complete workflow for deploying InternLM. Please refer to the [deployment tutorial](https://github.com/InternLM/LMDeploy) for more details on deploying InternLM.

 ## Fine-tuning & Training

 ### Pre-training and Fine-tuning Tutorial
+
 Please refer to [Usage Tutorial](./doc/en/usage.md) to start InternLM installation, data processing, pre-training and fine-tuning.

 ### Convert to Transformers Format
+
 The model trained by InternLM can be easily converted to HuggingFace Transformers format, which is convenient for seamless docking with various open source projects in the community. With the help of `tools/convert2hf.py`, the weights saved during training can be converted into transformers format with one command
+
 ```bash
 python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizes/tokenizer.model
 ```
+
 After conversion, it can be loaded as transformers by the following code
+
 ```python
 >>> from transformers import AutoTokenizer, AutoModel
 >>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
 ```

-
 ## Training System

 ### System Architecture
+
 Please refer to the [System Architecture document](./doc/en/structure.md) for further details.

 ### Training Performance

 InternLM deeply integrates Flash-Attention, Apex and other high-performance model operators to improve training efficiency. By building the Hybrid Zero technique, it achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training. InternLM supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-GPU scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:

-| Number of GPUs | 8  | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
-| -------------- | ------ | ------- | ------- | ------- | -------- | -------- | -------- | --------- |
+| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
+| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
 | TGS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS         | 192    | 192     | 186     | 186     | 185      | 185      | 186      | 182       |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |

 TGS represents the average number of tokens processed per GPU per second. For more performance test data, please refer to the [Training Performance document](./doc/en/train_performance.md) for further details.

-
 ## Contribution

 We appreciate all the contributors for their efforts to improve and enhance InternLM. Community users are highly encouraged to participate in the project. Please refer to the contribution guidelines for instructions on how to contribute to the project.
@ -173,6 +180,17 @@ We appreciate all the contributors for their efforts to improve and enhance Inte

 InternLM codebase is an open-source project contributed by Shanghai AI Laboratory and researchers from different universities and companies. We would like to thank all the contributors for their support in adding new features to the project and the users for providing valuable feedback. We hope that this toolkit and benchmark can provide the community with flexible and efficient code tools for fine-tuning InternLM and developing their own models, thus continuously contributing to the open-source community. Special thanks to the two open-source projects, [flash-attention](https://github.com/HazyResearch/flash-attention) and [ColossalAI](https://github.com/hpcaitech/ColossalAI).

-## Open Source License
+## License

-The code in this repository is open-source under the Apache-2.0 license. The InternLM weights are fully open for academic research and also allow commercial use with written permission from the official team. For inquiries about commercial licenses and collaborations, please contact internlm@pjlab.org.cn.
+The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表（中文）](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact <internlm@pjlab.org.cn>.
+
+## Citation
+
+```
+@misc{2023internlm,
+    title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
+    author={InternLM Team},
+    howpublished = {\url{https://github.com/InternLM/InternLM}},
+    year={2023}
+}
+```
--- a/doc/en/train_performance.md
+++ b/doc/en/train_performance.md
@ -6,7 +6,7 @@ InternLM deeply integrates Flash-Attention, Apex, and other high-performance mod
 | GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
 | TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 192 | 192  | 186  | 186  | 185   | 185   | 186   | 182    |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |


 We tested the performance of training the 7B model in InternLM using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:
@ -46,19 +46,46 @@ Throughput is defined as TGS, the average number of tokens processed per GPU per

 ### FLOPS Testing

-The computational workload of model training is based on the FLOPS calculation method described in the [Megatron](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) paper. To ensure constant FLOPS during training, the test configuration had `pack_sample_into_one=True`. The training used the following configuration:
+The computational workload of model training is based on the FLOPS calculation method described in the [Megatron](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) paper. To ensure constant FLOPS during training, the test configuration had `pack_sample_into_one=True`, `dtype=torch.bfloat16`.

-Activation Checkpointing | tp  | zero-1 | seq_len | micro_num | micro_bsz |
-| --- | --- | ------ | ------- | --------- | --------- |
-Disabled | 1   | 8      | 2048    | 4         | 2      |
-Enabled  | 1   | 8      | 2048    | 1         | 8      |

-The test results are shown in the table below. InternLM can achieve `>180 TFLOPS` for 7B model on thousand-card scale.
+When `Activation Ckpt` is enabled，the test results are shown in the table below. InternLM can achieve `>180 TFLOPS` for 7B model training with 1024 GPUs.
+
+- TGS: Tokens per GPU per Second
+
+- Global Bsz: The total number of processed tokens with all GPUs in a step
+
+| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
+|-|-|-|-|-|-|-|-|-|-|-|
+| 1 | 8 | TRUE | TRUE | 8 | 2048 | 8 | 1 | 0.125M | 3314 | 193 |
+| 1 | 8 | TRUE | TRUE | 16 | 2048 | 8 | 1 | 0.25M | 3268 | 191 |  
+| 1 | 8 | TRUE | TRUE | 32 | 2048 | 8 | 1 | 0.5M | 3323 | 188 |
+| 1 | 8 | TRUE | TRUE | 64 | 2048 | 8 | 1 | 1M | 3217 | 188 |
+| 1 | 8 | TRUE | TRUE | 128 | 2048 | 8 | 1 | 2M | 3260 | 187 |
+| 1 | 8 | TRUE | TRUE | 256 | 2048 | 8 | 1 | 4M | 3215 | 187 |
+| 1 | 8 | TRUE | TRUE | 512 | 2048 | 8 | 1 | 8M | 3199 | 186 |  
+| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 8 | 1 | 16M | 3163 | 184 |
+| 1 | 8 | TRUE | TRUE | 512 | 2048 | 4 | 1 | 4M | 2963 | 173 |
+| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 2 | 1 | 4M | 2341 | 136 |
+| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 4 | 1 | 8M | 2796 | 160 |
+
+When `Activation Ckpt` is turned off, the test results are as shown in the table below:
+
+| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
+|-|-|-|-|-|-|-|-|-|-|-|
+| 1 | 8 | TRUE | FALSE | 8 | 2048 | 2 | 4 | 0.125M | 4103 | 183 |
+| 1 | 8 | TRUE | FALSE | 16 | 2048 | 2 | 4 | 0.25M | 3939 | 177 |
+| 1 | 8 | TRUE | FALSE | 32 | 2048 | 2 | 4 | 0.5M | 3919 | 176 |
+| 1 | 8 | TRUE | FALSE | 64 | 2048 | 2 | 4 | 1M | 3944 | 174 |
+| 1 | 8 | TRUE | FALSE | 128 | 2048 | 2 | 4 | 2M | 3928 | 173 |
+| 1 | 8 | TRUE | FALSE | 256 | 2048 | 2 | 4 | 4M | 3920 | 173 |
+| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 4 | 8M | 3900 | 173 |
+| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 4 | 16M | 3625 | 160 |
+| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 2 | 4M | 3084 | 139 |  
+| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 1 | 4M | 2346 | 105 |
+| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 2 | 8M | 2817 | 124 |
+

-| Activation Checkpoint | 8 GPUs | 16 GPUs | 32 GPUs | 64 GPUs | 128 GPUs | 256 GPUs | 512 GPUs | 1024 GPUs |
-| --------------------- | ------ | ------- | ------- | ------- | -------- | -------- | -------- | --------- |
-| Disabled              | 183    | 177     | 176     | 174     | 173      | 173      | 173      | 160       |
-| Enabled               | 192    | 192     | 186     | 186     | 185      | 185      | 186      | 182       |

 <div align="left">
    <img src="../imgs/flops.png" width="580"/>
--- a/doc/en/usage.md
+++ b/doc/en/usage.md
@ -14,7 +14,7 @@ You can generate the `bin` and `meta` files for your raw data by running the fol


 ```bash
-$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'text' or 'json' or 'jsonl' --bin your_output_bin_path
+$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'txt' or 'json' or 'jsonl' --bin your_output_bin_path
 ```

 Here is an example of data processing (only the data processing example for the `txt` format is provided here, the data processing process for `json` and `jsonl` is exactly the same as for `txt`):
@ -192,7 +192,7 @@ $ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python trai
 If you want to start distributed training on torch with 8 GPUs on a single node, use the following command:

 ```bash
-$ torchrun --nnodes=1 --nproc-per-node=8 train.py --config ./configs/7B_sft.py
+$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py
 ```

 ### Training Results
@ -200,10 +200,21 @@ $ torchrun --nnodes=1 --nproc-per-node=8 train.py --config ./configs/7B_sft.py
 Taking the configuration of the demo training on a single machine with 8 GPUs on slurm as an example, the training result log is shown below:

 ```bash
-2023-07-04 21:40:14,148 INFO train.py:318 in record_current_batch_training_metrics -- step=17,loss=9.810295104980469,tgs (tokens per gpu per second)=4399.93,lr=3.8e-06,loss_scale=65536.0,grad_norm=4.177205427229359,micro_num=4,num_consumed_tokens=2359296,inf_nan_skip_batches=0,num_samples_in_batch=60,largest_length=1300,largest_batch=18,smallest_batch=13,adam_beta2=0.95,fwd_bwd_time=3.57
-2023-07-04 21:40:17,825 INFO train.py:318 in record_current_batch_training_metrics -- step=18,loss=9.715232849121094,tgs (tokens per gpu per second)=4457.7,lr=4.000000000000001e-06,loss_scale=65536.0,grad_norm=5.018154183978863,micro_num=4,num_consumed_tokens=2490368,inf_nan_skip_batches=0,num_samples_in_batch=68,largest_length=1153,largest_batch=19,smallest_batch=16,adam_beta2=0.95,fwd_bwd_time=3.52
-2023-07-04 21:40:21,526 INFO train.py:318 in record_current_batch_training_metrics -- step=19,loss=9.76744556427002,tgs (tokens per gpu per second)=4429.13,lr=4.2000000000000004e-06,loss_scale=65536.0,grad_norm=5.245329823265071,micro_num=4,num_consumed_tokens=2621440,inf_nan_skip_batches=0,num_samples_in_batch=70,largest_length=706,largest_batch=18,smallest_batch=17,adam_beta2=0.95,fwd_bwd_time=3.54
-2023-07-04 21:40:25,227 INFO train.py:318 in record_current_batch_training_metrics -- step=20,loss=9.628969192504883,tgs (tokens per gpu per second)=4427.46,lr=4.4e-06,loss_scale=65536.0,grad_norm=5.503176552110271,micro_num=4,num_consumed_tokens=2752512,inf_nan_skip_batches=0,num_samples_in_batch=69,largest_length=915,largest_batch=20,smallest_batch=15,adam_beta2=0.95,fwd_bwd_time=3.55
-2023-07-04 21:40:28,899 INFO train.py:318 in record_current_batch_training_metrics -- step=21,loss=9.690847396850586,tgs (tokens per gpu per second)=4464.18,lr=4.6e-06,loss_scale=65536.0,grad_norm=5.5336643273197526,micro_num=4,num_consumed_tokens=2883584,inf_nan_skip_batches=0,num_samples_in_batch=66,largest_length=870,largest_batch=17,smallest_batch=16,adam_beta2=0.95,fwd_bwd_time=3.52
-2023-07-04 21:40:32,629 INFO train.py:318 in record_current_batch_training_metrics -- step=22,loss=9.61986255645752,tgs (tokens per gpu per second)=4393.28,lr=4.800000000000001e-06,loss_scale=65536.0,grad_norm=9.01168869536059,micro_num=4,num_consumed_tokens=3014656,inf_nan_skip_batches=0,num_samples_in_batch=65,largest_length=1151,largest_batch=20,smallest_batch=14,adam_beta2=0.95,fwd_bwd_time=3.57
+2023-07-07 12:26:58,293	INFO launch.py:228 in launch -- Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 1
+2023-07-07 12:26:58,293	INFO parallel_context.py:535 in set_seed -- initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
+2023-07-07 12:26:58,295	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=0===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=5===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=1===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=6===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=7===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=2===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=4===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=3===========
+2023-07-07 12:28:27,826	INFO hybrid_zero_optim.py:295 in _partition_param_list -- Number of elements on ranks: [907415552, 907411456, 910163968, 910163968, 921698304, 921698304, 921698304, 921698304], rank:0
+2023-07-07 12:28:57,802	INFO train.py:323 in record_current_batch_training_metrics -- tflops=63.27010355651958,step=0,loss=11.634403228759766,tgs (tokens/gpu/second)=1424.64,lr=4.0000000000000003e-07,loss_scale=65536.0,grad_norm=63.672620777841004,micro_num=4,num_consumed_tokens=131072,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=5,smallest_batch=4,adam_beta2=0.95,fwd_bwd_time=6.48
+2023-07-07 12:29:01,636	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.83371103277346,step=1,loss=11.613704681396484,tgs (tokens/gpu/second)=4274.45,lr=6.000000000000001e-07,loss_scale=65536.0,grad_norm=65.150786641452,micro_num=4,num_consumed_tokens=262144,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.67
+2023-07-07 12:29:05,451	INFO train.py:323 in record_current_batch_training_metrics -- tflops=190.99928472960033,step=2,loss=11.490386962890625,tgs (tokens/gpu/second)=4300.69,lr=8.000000000000001e-07,loss_scale=65536.0,grad_norm=61.57798028719357,micro_num=4,num_consumed_tokens=393216,inf_nan_skip_batches=0,num_samples_in_batch=14,largest_length=2048,largest_batch=4,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.66
+2023-07-07 12:29:09,307	INFO train.py:323 in record_current_batch_training_metrics -- tflops=188.8613541410694,step=3,loss=11.099515914916992,tgs (tokens/gpu/second)=4252.55,lr=1.0000000000000002e-06,loss_scale=65536.0,grad_norm=63.5478796484391,micro_num=4,num_consumed_tokens=524288,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.7
+2023-07-07 12:29:13,147	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.65918563194305,step=4,loss=10.149517059326172,tgs (tokens/gpu/second)=4270.52,lr=1.2000000000000002e-06,loss_scale=65536.0,grad_norm=51.582841631508145,micro_num=4,num_consumed_tokens=655360,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.68
+2023-07-07 12:29:16,994	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.3109313713174,step=5,loss=9.822169303894043,tgs (tokens/gpu/second)=4262.67,lr=1.4000000000000001e-06,loss_scale=65536.0,grad_norm=47.10386835560855,micro_num=4,num_consumed_tokens=786432,inf_nan_skip_batches=0,num_samples_in_batch=17,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.69
 ```
--- a/doc/train_performance.md
+++ b/doc/train_performance.md
@ -2,10 +2,10 @@

 InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子，提高了训练效率。通过构建 Hybrid Zero 技术，实现计算和通信的高效重叠，大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡，千卡规模下加速效率可高达 90%，训练吞吐超过 180TFLOPS，平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据：

-| InternLM         | 8卡  | 16卡 | 32卡 | 64卡 | 128卡 | 256卡 | 512卡 | 1024卡 |
+| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TKS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 192 | 192  | 186  | 186  | 185   | 185   | 186   | 182    |
+| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |


 我们在GPU集群上测试了多种并行配置下，InternLM训练7B模型的性能。在每组测试中，每张GPU在单次迭代中处理的token数量一致。测试使用的硬件和参数配置如下表所示：
@ -29,7 +29,7 @@ InternLM中`zero1`的配置决定了优化器状态的分配范围。

 ### 吞吐量测量

-吞吐量定义为TGS，平均每GPU每秒处理的token的数量（Tokens per GPU per Second）。在该项测试的训练配置中，`pack_sample_into_one=False`，`checkpoint=False`。测试结果如下表所示。采用`zero1=8，tp=1`，InternLM针对7B模型训练的扩展性，在千卡训练的加速效率可以达到`88%`。
+吞吐量定义为TGS，平均每GPU每秒处理的token的数量（Tokens per GPU per Second）。在该项测试的训练配置中，`pack_sample_into_one=False`，`checkpoint=False`, `dtype=torch.bfloat16`。测试结果如下表所示。采用`zero1=8，tp=1`，InternLM针对7B模型训练的扩展性，在千卡训练的加速效率可以达到`88%`。

 | 并行配置         | 8卡  | 16卡 | 32卡 | 64卡 | 128卡 | 256卡 | 512卡 | 1024卡 |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
@ -44,19 +44,47 @@ InternLM中`zero1`的配置决定了优化器状态的分配范围。
 </div>

 ### FLOPS测试
-模型训练的计算量参考 [Megatron](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) 论文中FLOPS计算方式。为了保证训练过程中的FLOPS恒定，在该项测试的训练配置中，`pack_sample_into_one=True`，其余超参设置如下所示：
+模型训练的计算量参考 [Megatron](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) 论文中FLOPS计算方式。为了保证训练过程中的FLOPS恒定，在该项测试的训练配置中，`pack_sample_into_one=True`，`dtype=torch.bfloat16`。

-activation checkpoint | tp  | zero-1 | seq_len | micro_num | micro_bsz |
-| --- | --- | ----  | ----   | ----  |---- |
-关闭 | 1   | 8      | 2048    | 4     | 2 |
-开启 | 1   | 8      | 2048    | 1     | 8 |

-测试结果如下表所示，InternLM针对7B模型的千卡训练，可以达到 `>180 TFLOPS`：
-| activation checkpoint         | 8卡 | 16卡 | 32卡 | 64卡 | 128卡 | 256卡 | 512卡 | 1024卡 |
-| --------------- | --- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| 关闭 | 183 | 177  | 176  | 174  | 173   | 173   | 173   | 160    |
-| 开启 | 192 | 192  | 186  | 186  | 185   | 185   | 186   | 182    |
+当开启 Activation Ckpt后，测试结果如下表所示，InternLM针对7B模型的千卡训练，可以达到 `>180 TFLOPS`：
+
+- TGS: Tokens per GPU per Second
+
+- Global Bsz: 一个step中所有GPU处理的token数量
+
+
+| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
+|-|-|-|-|-|-|-|-|-|-|-|
+| 1 | 8 | TRUE | TRUE | 8 | 2048 | 8 | 1 | 0.125M | 3314 | 193 |
+| 1 | 8 | TRUE | TRUE | 16 | 2048 | 8 | 1 | 0.25M | 3268 | 191 |  
+| 1 | 8 | TRUE | TRUE | 32 | 2048 | 8 | 1 | 0.5M | 3323 | 188 |
+| 1 | 8 | TRUE | TRUE | 64 | 2048 | 8 | 1 | 1M | 3217 | 188 |
+| 1 | 8 | TRUE | TRUE | 128 | 2048 | 8 | 1 | 2M | 3260 | 187 |
+| 1 | 8 | TRUE | TRUE | 256 | 2048 | 8 | 1 | 4M | 3215 | 187 |
+| 1 | 8 | TRUE | TRUE | 512 | 2048 | 8 | 1 | 8M | 3199 | 186 |  
+| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 8 | 1 | 16M | 3163 | 184 |
+| 1 | 8 | TRUE | TRUE | 512 | 2048 | 4 | 1 | 4M | 2963 | 173 |
+| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 2 | 1 | 4M | 2341 | 136 |
+| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 4 | 1 | 8M | 2796 | 160 |
+
+当关闭 Activation Ckpt后，测试结果如下表所示：
+
+| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
+|-|-|-|-|-|-|-|-|-|-|-|
+| 1 | 8 | TRUE | FALSE | 8 | 2048 | 2 | 4 | 0.125M | 4103 | 183 |
+| 1 | 8 | TRUE | FALSE | 16 | 2048 | 2 | 4 | 0.25M | 3939 | 177 |
+| 1 | 8 | TRUE | FALSE | 32 | 2048 | 2 | 4 | 0.5M | 3919 | 176 |
+| 1 | 8 | TRUE | FALSE | 64 | 2048 | 2 | 4 | 1M | 3944 | 174 |
+| 1 | 8 | TRUE | FALSE | 128 | 2048 | 2 | 4 | 2M | 3928 | 173 |
+| 1 | 8 | TRUE | FALSE | 256 | 2048 | 2 | 4 | 4M | 3920 | 173 |
+| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 4 | 8M | 3900 | 173 |
+| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 4 | 16M | 3625 | 160 |
+| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 2 | 4M | 3084 | 139 |  
+| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 1 | 4M | 2346 | 105 |
+| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 2 | 8M | 2817 | 124 |

 <div align="left">
    <img src="../doc/imgs/flops.png" width="580"/>
 </div>
+
--- a/doc/usage.md
+++ b/doc/usage.md
@ -11,7 +11,7 @@ InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`

 可以运行以下命令生成原始数据对应的`bin`和`meta`文件，其中参数`raw_data_name`表示原始数据集的文件名称，`input_file_type`表示原始数据集的文件格式，目前支持`txt`、`json`和`jsonl`这三种格式，`bin`表示生成的`bin`文件的保存路径。
 ```bash
-$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'text' or 'json' or 'jsonl' --bin your_output_bin_path
+$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'txt' or 'json' or 'jsonl' --bin your_output_bin_path
 ```

 下面是一个数据处理的例子（这里只给出了`txt`格式的数据处理例子，`json`和`jsonl`的数据处理流程和`txt`的完全一致）：
@ -175,17 +175,28 @@ $ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python trai

 若在 torch 上启动分布式运行环境，单节点 8 卡的运行命令如下所示：
 ```bash
-$ torchrun --nnodes=1 --nproc-per-node=8 train.py --config ./configs/7B_sft.py
+$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py
 ```

 ### 运行结果

 以 slurm 上单机 8 卡的 Demo 训练配置为例，训练结果日志展示如下：
 ```bash
-2023-07-04 21:40:14,148 INFO train.py:318 in record_current_batch_training_metrics -- step=17,loss=9.810295104980469,tgs (tokens per gpu per second)=4399.93,lr=3.8e-06,loss_scale=65536.0,grad_norm=4.177205427229359,micro_num=4,num_consumed_tokens=2359296,inf_nan_skip_batches=0,num_samples_in_batch=60,largest_length=1300,largest_batch=18,smallest_batch=13,adam_beta2=0.95,fwd_bwd_time=3.57
-2023-07-04 21:40:17,825 INFO train.py:318 in record_current_batch_training_metrics -- step=18,loss=9.715232849121094,tgs (tokens per gpu per second)=4457.7,lr=4.000000000000001e-06,loss_scale=65536.0,grad_norm=5.018154183978863,micro_num=4,num_consumed_tokens=2490368,inf_nan_skip_batches=0,num_samples_in_batch=68,largest_length=1153,largest_batch=19,smallest_batch=16,adam_beta2=0.95,fwd_bwd_time=3.52
-2023-07-04 21:40:21,526 INFO train.py:318 in record_current_batch_training_metrics -- step=19,loss=9.76744556427002,tgs (tokens per gpu per second)=4429.13,lr=4.2000000000000004e-06,loss_scale=65536.0,grad_norm=5.245329823265071,micro_num=4,num_consumed_tokens=2621440,inf_nan_skip_batches=0,num_samples_in_batch=70,largest_length=706,largest_batch=18,smallest_batch=17,adam_beta2=0.95,fwd_bwd_time=3.54
-2023-07-04 21:40:25,227 INFO train.py:318 in record_current_batch_training_metrics -- step=20,loss=9.628969192504883,tgs (tokens per gpu per second)=4427.46,lr=4.4e-06,loss_scale=65536.0,grad_norm=5.503176552110271,micro_num=4,num_consumed_tokens=2752512,inf_nan_skip_batches=0,num_samples_in_batch=69,largest_length=915,largest_batch=20,smallest_batch=15,adam_beta2=0.95,fwd_bwd_time=3.55
-2023-07-04 21:40:28,899 INFO train.py:318 in record_current_batch_training_metrics -- step=21,loss=9.690847396850586,tgs (tokens per gpu per second)=4464.18,lr=4.6e-06,loss_scale=65536.0,grad_norm=5.5336643273197526,micro_num=4,num_consumed_tokens=2883584,inf_nan_skip_batches=0,num_samples_in_batch=66,largest_length=870,largest_batch=17,smallest_batch=16,adam_beta2=0.95,fwd_bwd_time=3.52
-2023-07-04 21:40:32,629 INFO train.py:318 in record_current_batch_training_metrics -- step=22,loss=9.61986255645752,tgs (tokens per gpu per second)=4393.28,lr=4.800000000000001e-06,loss_scale=65536.0,grad_norm=9.01168869536059,micro_num=4,num_consumed_tokens=3014656,inf_nan_skip_batches=0,num_samples_in_batch=65,largest_length=1151,largest_batch=20,smallest_batch=14,adam_beta2=0.95,fwd_bwd_time=3.57
+2023-07-07 12:26:58,293	INFO launch.py:228 in launch -- Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 1
+2023-07-07 12:26:58,293	INFO parallel_context.py:535 in set_seed -- initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
+2023-07-07 12:26:58,295	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=0===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=5===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=1===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=6===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=7===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=2===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=4===========
+2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=3===========
+2023-07-07 12:28:27,826	INFO hybrid_zero_optim.py:295 in _partition_param_list -- Number of elements on ranks: [907415552, 907411456, 910163968, 910163968, 921698304, 921698304, 921698304, 921698304], rank:0
+2023-07-07 12:28:57,802	INFO train.py:323 in record_current_batch_training_metrics -- tflops=63.27010355651958,step=0,loss=11.634403228759766,tgs (tokens/gpu/second)=1424.64,lr=4.0000000000000003e-07,loss_scale=65536.0,grad_norm=63.672620777841004,micro_num=4,num_consumed_tokens=131072,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=5,smallest_batch=4,adam_beta2=0.95,fwd_bwd_time=6.48
+2023-07-07 12:29:01,636	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.83371103277346,step=1,loss=11.613704681396484,tgs (tokens/gpu/second)=4274.45,lr=6.000000000000001e-07,loss_scale=65536.0,grad_norm=65.150786641452,micro_num=4,num_consumed_tokens=262144,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.67
+2023-07-07 12:29:05,451	INFO train.py:323 in record_current_batch_training_metrics -- tflops=190.99928472960033,step=2,loss=11.490386962890625,tgs (tokens/gpu/second)=4300.69,lr=8.000000000000001e-07,loss_scale=65536.0,grad_norm=61.57798028719357,micro_num=4,num_consumed_tokens=393216,inf_nan_skip_batches=0,num_samples_in_batch=14,largest_length=2048,largest_batch=4,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.66
+2023-07-07 12:29:09,307	INFO train.py:323 in record_current_batch_training_metrics -- tflops=188.8613541410694,step=3,loss=11.099515914916992,tgs (tokens/gpu/second)=4252.55,lr=1.0000000000000002e-06,loss_scale=65536.0,grad_norm=63.5478796484391,micro_num=4,num_consumed_tokens=524288,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.7
+2023-07-07 12:29:13,147	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.65918563194305,step=4,loss=10.149517059326172,tgs (tokens/gpu/second)=4270.52,lr=1.2000000000000002e-06,loss_scale=65536.0,grad_norm=51.582841631508145,micro_num=4,num_consumed_tokens=655360,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.68
+2023-07-07 12:29:16,994	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.3109313713174,step=5,loss=9.822169303894043,tgs (tokens/gpu/second)=4262.67,lr=1.4000000000000001e-06,loss_scale=65536.0,grad_norm=47.10386835560855,micro_num=4,num_consumed_tokens=786432,inf_nan_skip_batches=0,num_samples_in_batch=17,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.69
 ```
--- a/internlm/core/no_pipeline_scheduler.py
+++ b/internlm/core/no_pipeline_scheduler.py
@ -269,7 +269,7 @@ class NonPipelineScheduler(BaseScheduler):

            if return_loss:
                loss += _loss
-
+            if return_output_label:
                outputs.append(_output)
                labels.append(_label)

--- a/internlm/initialize/launch.py
+++ b/internlm/initialize/launch.py
@ -89,7 +89,7 @@ def args_sanity_check():
        data._add_item("valid_folder", None)

    if gpc.is_rank_for_log():
-        logger.info("+++++++++++++++++++++++++++++++ Data Info +++++++++++++++++++++++++++++++")
+        logger.info("+" * 15 + " Data Info " + "+" * 15)  # pylint: disable=W1201
        logger.info(f"seq_len: {data.seq_len}")
        logger.info(f"micro_num: {data.micro_num}")
        logger.info(f"micro_bsz: {data.micro_bsz}")
@ -122,7 +122,7 @@ def args_sanity_check():
    )

    if gpc.is_rank_for_log():
-        logger.info("+++++++++++++++++++++++++++++++ Ckpt Info +++++++++++++++++++++++++++++++")
+        logger.info("+" * 15 + " Ckpt Info " + "+" * 15)  # pylint: disable=W1201
        logger.info(f"is enable save ckpt: {gpc.config.ckpt.enable_ckpt}")
        logger.info(f"save_ckpt_folder: {gpc.config.ckpt.save_ckpt_folder}")
        logger.info(f"checkpoint_every: {gpc.config.ckpt.checkpoint_every}")
@ -133,7 +133,7 @@ def args_sanity_check():
    clip_grad_norm = gpc.config.hybrid_zero_optimizer.get("clip_grad_norm", 0.0)

    if gpc.is_rank_for_log():
-        logger.info("+++++++++++++++++++++++++++++++ other Info +++++++++++++++++++++++++++++++")
+        logger.info("+" * 15 + " Other Info " + "+" * 15)  # pylint: disable=W1201
        logger.info(f"cudnn.benchmark: {torch.backends.cudnn.benchmark }")
        logger.info(f"cudnn.deterministic: {torch.backends.cudnn.deterministic }")
        logger.info(f"clip_grad_norm: {clip_grad_norm}")
@ -150,21 +150,20 @@ def args_sanity_check():
            assert gpc.config.model.dtype in ["torch.float16", "torch.half", "torch.bfloat16"]

    if gpc.is_rank_for_log():
-        logger.info("+++++++++++++++++++++++++++++++ Model Info +++++++++++++++++++++++++++++++")
+        logger.info("+" * 15 + " Model Info " + "+" * 15)  # pylint: disable=W1201
        logger.info(f"Model: {gpc.config.model}")

-        logger.info("+++++++++++++++++++++++++++++++ grad_scaler Info +++++++++++++++++++++++++++++++")
+        logger.info("+" * 15 + " grad_scaler Info " + "+" * 15)  # pylint: disable=W1201
        logger.info(f"grad_scaler: {gpc.config.grad_scaler}")

-        logger.info("+++++++++++++++++++++++++++++++ hybrid_zero_optimizer Info +++++++++++++++++++++++++++++++")
+        logger.info("+" * 15 + " hybrid_zero_optimizer Info " + "+" * 15)  # pylint: disable=W1201
        logger.info(f"hybrid_zero_optimizer: {gpc.config.hybrid_zero_optimizer}")

-        logger.info("+++++++++++++++++++++++++++++++ adam Info +++++++++++++++++++++++++++++++")
+        logger.info("+" * 15 + " adam Info " + "+" * 15)  # pylint: disable=W1201
        logger.info(f"adam: {gpc.config.adam}")

-        logger.info("+++++++++++++++++++++++++++++++ beta2_scheduler Info +++++++++++++++++++++++++++++++")
+        logger.info("+" * 15 + " beta2_scheduler Info " + "+" * 15)  # pylint: disable=W1201
        logger.info(f"beta2_scheduler: {gpc.config.beta2_scheduler}")
-        logger.info("+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")


 def launch(
--- a/requirements/runtime.txt
+++ b/requirements/runtime.txt
@ -1,4 +1,5 @@
 transformers>=4.25.1
+sentencepiece
 numpy
 tqdm
 psutil
--- a/tools/README.md
+++ b/tools/README.md
@ -3,8 +3,8 @@
 ├── transformers  # 适配hugging face的transformers的一些工具
 │   ├── configuration_internlm.py  # config适配工具
 │   ├── modeling_internlm.py  # model适配工具
-│   └── tokenization_internlm.py  # tokenizer适配工具
-├── convert2hf.py  # 模型适配hugging face工具
+│   ├── tokenization_internlm.py  # tokenizer适配工具
+│   └── convert2hf.py  # 模型适配hugging face工具
 └── tokenizer.py  # 将原始数据转换成bin和meta文件的工具
 ```

--- a/tools/README_EN.md
+++ b/tools/README_EN.md
@ -4,7 +4,7 @@ This directory provide some tools for model training with the following file str
 │   ├── configuration_internlm.py  # tools for adapting config
 │   ├── modeling_internlm.py  # tools for adapting model
 │   └── tokenization_internlm.py  # tools for adapting tokenizer
-├── convert2hf.py  # tools for adapting models to Hugging Face's format
+│   └── convert2hf.py  # tools for adapting models to Hugging Face's format
 └── tokenizer.py  # tools for generating `bin` and `meta` file for raw data
 ```

--- a/tools/transformers/README-zh-Hans.md
+++ b/tools/transformers/README-zh-Hans.md
@ -0,0 +1,26 @@
+# InternLM Transformers
+
+[English](./README.md) |
+[简体中文](./README-zh-Hans.md) 
+
+该文件夹下包含了 transformers 格式的 `InternLM` 模型。
+
+
+## 权重转换
+
+`convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式。在仓库根目录运行以下命令：
+
+```bash
+python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ../V7_sft.model
+```
+
+然后可以使用 `from_pretrained` 接口加载：
+
+```python
+from modeling_internlm import InternLMForCausalLM
+
+model = InternForCausalLM.from_pretrained("hf_ckpt/")
+```
+
+
+`intern_moss_example.py` 展示了如何使用 LoRA 来在 `fnlp/moss-moon-002-sft` 数据集上进行微调的样例。
--- a/tools/transformers/README.md
+++ b/tools/transformers/README.md
@ -1,16 +1,19 @@
 # InternLM Transformers

-该文件夹下包含了 transformers 格式的 `InternLM` 模型。
+[English](./README.md) |
+[简体中文](./README-zh-Hans.md) 

-## 权重转换
+This folder contains the `InternLM` model in transformers format.

-`convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式。在根目录下执行：
+## Weight Conversion
+
+`convert2hf.py` can convert saved training weights into the transformers format with a single command. Execute the command in the root directory of repository:

 ```bash
-python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizes/tokenizer.model
+python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ../V7_sft.model
 ```

-然后可以使用 `from_pretrained` 接口加载：
+Then, you can load it using the `from_pretrained` interface:

 ```python
 from modeling_internlm import InternLMForCausalLM
@ -18,5 +21,4 @@ from modeling_internlm import InternLMForCausalLM
 model = InternForCausalLM.from_pretrained("hf_ckpt/")
 ```

-
-`moss_example.py` 展示了如何使用 LoRA 来在 `fnlp/moss-moon-002-sft` 数据集上进行微调的样例。
+`intern_moss_example.py` demonstrates an example of how to use LoRA for fine-tuning on the `fnlp/moss-moon-002-sft` dataset.