mirror of https://github.com/InternLM/InternLM
[Develop] Pull Main Branch (#121)
* fix/fix_submodule_err (#61) * fix/fix_submodule_err --------- Co-authored-by: ChenQiaoling00 <qiaoling_chen@u.nus.edu> * fix issue templates (#65) * fix(tokenizer): refactor tokenizer and update usage in readme (#51) * update tokenizer example * fix(readme, requirements): fix typo at Chinese readme and select a lower version of transformers (#73) * fix a typo in readme * in order to find InternLMTokenizer, select a lower version of Transformers --------- Co-authored-by: gouhchangjiang <gouhchangjiang@gmail.com> * [Doc] Add wechat and discord link in readme (#78) * Doc:add wechat and discord link * Doc:update wechat and discord link * Doc:update wechat and discord link * Doc:update wechat and discord link * Doc:update wechat and discord link * Doc:update wechat and discord link * Doc:update wechat and discord link * Doc:update wechat and discord link * Doc:update wechat and discord link * Doc:update wechat and discord link * Doc:update wechat and discord link * [Docs]: add Japanese README (#43) * Add Japanese README * Update README-ja-JP.md replace message * Update README-ja-JP.md * add repetition_penalty in GenerationConfig in web_demo.py (#48) Co-authored-by: YWMditto <862779238@qq.com> * use fp16 in instruction (#80) * [Enchancement] add more options for issue template (#77) * [Enchancement] add more options for issue template * update qustion icon * fix link * Use tempfile for convert2hf.py (#23) Fix https://github.com/InternLM/InternLM/issues/50 * delete torch_dtype of README's example code (#100) * set the value of repetition_penalty to 1.0 to avoid random outputs (#99) * Update web_demo.py (#97) Remove meaningless log. * [Fix]Fix wrong string cutoff in the script for sft text tokenizing (#106) --------- Co-authored-by: ChenQiaoling00 <qiaoling_chen@u.nus.edu> Co-authored-by: Kai Chen <chenkaidev@gmail.com> Co-authored-by: Yang Gao <Gary1546308416AL@gmail.com> Co-authored-by: Changjiang GOU <gouchangjiang@gmail.com> Co-authored-by: gouhchangjiang <gouhchangjiang@gmail.com> Co-authored-by: vansin <msnode@163.com> Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com> Co-authored-by: YWMditto <46778265+YWMditto@users.noreply.github.com> Co-authored-by: YWMditto <862779238@qq.com> Co-authored-by: WRH <12756472+wangruohui@users.noreply.github.com> Co-authored-by: liukuikun <24622904+Harold-lkk@users.noreply.github.com> Co-authored-by: x54-729 <45304952+x54-729@users.noreply.github.com> Co-authored-by: Shuo Zhang <zhangshuolove@live.com> Co-authored-by: Miao Zheng <76149310+MeowZheng@users.noreply.github.com>pull/120/head
parent
0d3d27cdf4
commit
e0d6a3f84f
|
@ -3,6 +3,13 @@ description: Create a report to help us improve
|
|||
labels: ["bug"]
|
||||
title: "[Bug] "
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## Note
|
||||
For general usage questions or idea discussions, please post it to our [**Forum**](https://github.com/InternLM/InternLM/discussions)
|
||||
Please fill in as **much** of the following form as you're able to. **The clearer the description, the shorter it will take to solve it.**
|
||||
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
|
@ -55,3 +62,11 @@ body:
|
|||
|
||||
1. Did you make any modifications on the code or config?
|
||||
2. What do you think might be the reason?
|
||||
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## Acknowledgement
|
||||
Thanks for taking the time to fill out this report.
|
||||
|
||||
If you have already identified the reason, we strongly appreciate you creating a new PR to fix it [**here**](https://github.com/InternLM/InternLM/pulls)!
|
||||
|
|
|
@ -0,0 +1,24 @@
|
|||
name: "❔ Common Questions"
|
||||
description: Ask a question about the usage or idea discussions.
|
||||
labels: [ "question" ]
|
||||
title: "[QA] "
|
||||
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## Note
|
||||
For general usage questions or idea discussions, please post it to our [**Forum**](https://github.com/InternLM/InternLM/discussions)
|
||||
Please fill in as **much** of the following form as you're able to. **The clearer the description, the shorter it will take to solve it.**
|
||||
|
||||
- type: textarea
|
||||
id: describe
|
||||
validations:
|
||||
required: true
|
||||
attributes:
|
||||
label: Describe the question.
|
||||
description: |
|
||||
what is your question? Please provide a clear and concise description of what the question is.
|
||||
placeholder: |
|
||||
A clear and concise description of the question.
|
|
@ -3,6 +3,13 @@ description: Suggest an idea for this project
|
|||
labels: ["enhancement"]
|
||||
title: "[Feature] "
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## Note
|
||||
For general usage questions or idea discussions, please post it to our [**Forum**](https://github.com/InternLM/InternLM/discussions)
|
||||
Please fill in as **much** of the following form as you're able to. **The clearer the description, the shorter it will take to solve it.**
|
||||
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
|
@ -27,3 +34,11 @@ body:
|
|||
label: Will you implement it?
|
||||
options:
|
||||
- label: I would like to implement this feature and create a PR!
|
||||
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## Acknowledgement
|
||||
Thanks for taking the time to fill out this report.
|
||||
|
||||
If you have already identified the reason, we strongly appreciate you creating a new PR to implement it [**here**](https://github.com/InternLM/InternLM/pulls)!
|
|
@ -0,0 +1,35 @@
|
|||
name: 📚 Documentation
|
||||
description: Report an issue related to the documentation.
|
||||
labels: "documentation"
|
||||
title: "[Docs] "
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## Note
|
||||
For general usage questions or idea discussions, please post it to our [**Forum**](https://github.com/InternLM/InternLM/discussions)
|
||||
Please fill in as **much** of the following form as you're able to. **The clearer the description, the shorter it will take to solve it.**
|
||||
|
||||
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: 📚 The doc issue
|
||||
description: >
|
||||
A clear and concise description the issue.
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: Suggest a potential alternative/fix
|
||||
description: >
|
||||
Tell us how we could improve the documentation in this regard.
|
||||
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## Acknowledgement
|
||||
Thanks for taking the time to fill out this report.
|
||||
|
||||
If you have already identified the reason, we strongly appreciate you creating a new PR to fix it [**here**](https://github.com/InternLM/InternLM/pulls)!
|
|
@ -3,6 +3,13 @@ description: 报告你在使用中遇到的不合预期的情况
|
|||
labels: ["bug"]
|
||||
title: "[Bug] "
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## 注意
|
||||
对于一般的使用问题或者想法讨论,请在我们的[**论坛**](https://github.com/InternLM/InternLM/discussions)
|
||||
请尽可能填写以下表格。**描述得越清楚,解决问题的时间就越短。**
|
||||
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
|
@ -56,3 +63,11 @@ body:
|
|||
|
||||
1. 你是否对代码或配置文件做了任何改动?
|
||||
2. 你认为可能的原因是什么?
|
||||
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## 致谢
|
||||
感谢你的反馈!
|
||||
|
||||
如果你已经找到了错误的原因,我们非常欢迎你直接创建一个新的 PR 来解决这个问题 [**here**](https://github.com/InternLM/InternLM/pulls)!
|
|
@ -0,0 +1,25 @@
|
|||
name: "❔ 常见问题"
|
||||
description: 提出一个关于使用或者想法讨论的问题。
|
||||
labels: [ "question" ]
|
||||
title: "[QA] "
|
||||
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## 注意
|
||||
对于一般的使用问题或者想法讨论,请在我们的[**论坛**](https://github.com/InternLM/InternLM/discussions)
|
||||
请尽可能填写以下表格。**描述得越清楚,解决问题的时间就越短。**
|
||||
|
||||
- type: textarea
|
||||
id: describe
|
||||
validations:
|
||||
required: true
|
||||
attributes:
|
||||
label: 描述问题
|
||||
description: |
|
||||
what is your question? Please provide a clear and concise description of what the question is.
|
||||
你的问题是什么?请提供一个清晰简明的问题描述。
|
||||
placeholder: |
|
||||
一个清晰简明的问题描述。
|
|
@ -3,6 +3,12 @@ description: 建议一项新的功能
|
|||
labels: ["enhancement"]
|
||||
title: "[Feature] "
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## 注意
|
||||
对于一般的使用问题或者想法讨论,请在我们的[**论坛**](https://github.com/InternLM/InternLM/discussions)
|
||||
请尽可能填写以下表格。**描述得越清楚,解决问题的时间就越短。**
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
|
@ -29,3 +35,11 @@ body:
|
|||
label: 是否希望自己实现该功能?
|
||||
options:
|
||||
- label: 我希望自己来实现这一功能,并向 InternLM 贡献代码!
|
||||
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## 致谢
|
||||
感谢你的反馈!
|
||||
|
||||
如果您自己可以实现该功能,我们非常欢迎你直接创建一个新的 PR [**here**](https://github.com/InternLM/InternLM/pulls)!
|
|
@ -0,0 +1,35 @@
|
|||
name: 📚 文档
|
||||
description: 提出文档中的错误或者提出文档改进的建议
|
||||
labels: "documentation"
|
||||
title: "[Docs] "
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## 注意
|
||||
对于一般的使用问题或者想法讨论,请在我们的[**论坛**](https://github.com/InternLM/InternLM/discussions)
|
||||
请尽可能填写以下表格。**描述得越清楚,解决问题的时间就越短。**
|
||||
|
||||
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: 📚 描述文档中的错误
|
||||
description: >
|
||||
请简要说明你在文档中发现的错误
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
attributes:
|
||||
label: 📚 描述文档中的改进建议
|
||||
description: >
|
||||
请简要说明你对文档的改进建议
|
||||
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
## 致谢
|
||||
感谢你的反馈!
|
||||
|
||||
如果你已经找到了错误的原因,我们非常欢迎你直接创建一个新的 PR 来解决这个问题 [**here**](https://github.com/InternLM/InternLM/pulls)!
|
|
@ -1,12 +1,9 @@
|
|||
blank_issues_enabled: false
|
||||
|
||||
contact_links:
|
||||
- name: 📚 InternLM Documentation (官方文档)
|
||||
url: https://internlm.readthedocs.io/en/latest/
|
||||
about: Check if your question is answered in docs
|
||||
- name: 💬 General questions (寻求帮助)
|
||||
url: https://github.com/InternLM/InternLM/discussions
|
||||
about: Ask general usage questions and discuss with other InternLM community members
|
||||
- name: 🌐 Explore InternLM (官网)
|
||||
url: https://https://internlm.org/
|
||||
url: https://internlm.org/
|
||||
about: Get know more about InternLM
|
||||
|
|
|
@ -0,0 +1,197 @@
|
|||
# InternLM
|
||||
|
||||
<div align="center">
|
||||
|
||||
<img src="./doc/imgs/logo.svg" width="200"/>
|
||||
<div> </div>
|
||||
<div align="center">
|
||||
<b><font size="5">InternLM</font></b>
|
||||
<sup>
|
||||
<a href="https://internlm.intern-ai.org.cn/">
|
||||
<i><font size="4">HOT</font></i>
|
||||
</a>
|
||||
</sup>
|
||||
<div> </div>
|
||||
</div>
|
||||
|
||||
[](./LICENSE)
|
||||
[](https://github.com/internLM/OpenCompass/)
|
||||
|
||||
[📘使用法](./doc/en/usage.md) |
|
||||
[🛠️インストール](./doc/en/install.md) |
|
||||
[📊トレーニングパフォーマンス](./doc/en/train_performance.md) |
|
||||
[👀モデル](#model-zoo) |
|
||||
[🆕更新ニュース](./CHANGE_LOG.md) |
|
||||
[🤔Issues 報告](https://github.com/InternLM/InternLM/issues/new)
|
||||
|
||||
[English](./README.md) |
|
||||
[简体中文](./README-zh-Hans.md) |
|
||||
[日本語](./README-ja-JP.md)
|
||||
|
||||
</div>
|
||||
|
||||
## はじめに
|
||||
|
||||
InternLM は、70 億のパラメータを持つベースモデルと、実用的なシナリオに合わせたチャットモデルをオープンソース化しています。このモデルには以下の特徴があります:
|
||||
|
||||
- 何兆もの高品質なトークンをトレーニングに活用し、強力な知識ベースを確立します。
|
||||
- 8k のコンテキストウィンドウ長をサポートし、より長い入力シーケンスと強力な推論機能を可能にする。
|
||||
- ユーザが独自のワークフローを柔軟に構築できるよう、汎用性の高いツールセットを提供します。
|
||||
|
||||
さらに、大規模な依存関係を必要とせずにモデルの事前学習をサポートする軽量な学習フレームワークが提供されます。単一のコードベースで、数千の GPU を持つ大規模クラスタでの事前学習と、単一の GPU での微調整をサポートし、顕著な性能最適化を達成します。InternLM は、1024GPU でのトレーニングにおいて 90% 近いアクセラレーション効率を達成しています。
|
||||
|
||||
## InternLM-7B
|
||||
|
||||
### パフォーマンス評価
|
||||
|
||||
オープンソースの評価ツール [OpenCompass](https://github.com/internLM/OpenCompass/) を用いて、InternLM の総合的な評価を行った。この評価では、分野別能力、言語能力、知識能力、推論能力、理解能力の 5 つの次元をカバーしました。以下は評価結果の一部であり、その他の評価結果については [OpenCompass leaderboard](https://opencompass.org.cn/rank) をご覧ください。
|
||||
|
||||
| データセット\モデル | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
|
||||
| ---------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
|
||||
| C-Eval(Val) | 53.2 | 53.4 | 24.2 | 42.7 | 50.9 | 28.9 | 31.2 |
|
||||
| MMLU | 50.8 | 51.0 | 35.2* | 41.5 | 46.0 | 39.7 | 47.3 |
|
||||
| AGIEval | 42.5 | 37.6 | 20.8 | 24.6 | 39.0 | 24.1 | 26.4 |
|
||||
| CommonSenseQA | 75.2 | 59.5 | 65.0 | 58.8 | 60.0 | 68.7 | 66.7 |
|
||||
| BUSTM | 74.3 | 50.6 | 48.5 | 51.3 | 55.0 | 48.8 | 62.5 |
|
||||
| CLUEWSC | 78.6 | 59.1 | 50.3 | 52.8 | 59.8 | 50.3 | 52.2 |
|
||||
| MATH | 6.4 | 7.1 | 2.8 | 3.0 | 6.6 | 2.2 | 2.8 |
|
||||
| GSM8K | 34.5 | 31.2 | 10.1 | 9.7 | 29.2 | 6.0 | 15.3 |
|
||||
| HumanEval | 14.0 | 10.4 | 14.0 | 9.2 | 9.2 | 9.2 | 11.0 |
|
||||
| RACE(High) | 76.3 | 57.4 | 46.9* | 28.1 | 66.3 | 40.7 | 54.0 |
|
||||
|
||||
- 評価結果は [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) (*印のあるデータは原著論文からの引用を意味する)から取得したもので、評価設定は [OpenCompass](https://github.com/internLM/OpenCompass/) が提供する設定ファイルに記載されています。
|
||||
- 評価データは、[OpenCompass](https://github.com/internLM/OpenCompass/) のバージョンアップにより数値的な差異が生じる可能性がありますので、[OpenCompass](https://github.com/internLM/OpenCompass/) の最新の評価結果をご参照ください。
|
||||
|
||||
### Model Zoo
|
||||
|
||||
InternLM 7B と InternLM 7B チャットは、InternLM を使って訓練され、オープンソース化されています。モデルの重みは 2 つのフォーマットで提供されています。Transformers フォーマットを使ってモデルをロードするだけでなく、InternLM を使って直接重みをロードして、さらに事前トレーニングや人間の好みアライメントトレーニングを行うこともできます。
|
||||
|
||||
| モデル | InternLM フォーマット Weight ダウンロードリンク | Transformers フォーマット Weight ダウンロードリンク |
|
||||
| ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
|
||||
| **InternLM 7B** | [](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b) | [🤗internlm/intern-7b](https://huggingface.co/internlm/internlm-7b) |
|
||||
| **InternLM Chat 7B** | [](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b) | [🤗internlm/intern-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) |
|
||||
| **InternLM Chat 7B 8k** | [](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b-8k) | [🤗internlm/intern-chat-7b-8k](https://huggingface.co/internlm/internlm-chat-7b-8k) |
|
||||
|
||||
**制限事項:** 学習過程におけるモデルの安全性を確保し、倫理的・法的要件に準拠したテキストを生成するようモデルに促す努力を行ってきたが、モデルのサイズと確率的生成パラダイムのため、モデルは依然として予期せぬ出力を生成する可能性がある。例えば、生成された回答には偏見や差別、その他の有害な内容が含まれている可能性があります。そのような内容を伝播しないでください。有害な情報の伝播によって生じるいかなる結果に対しても、私たちは責任を負いません。
|
||||
|
||||
### Transformers からのインポート
|
||||
|
||||
Transformers を使用して InternLM 7B チャットモデルをロードするには、以下のコードを使用します:
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
|
||||
>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True).cuda()
|
||||
>>> model = model.eval()
|
||||
>>> response, history = model.chat(tokenizer, "こんにちは", history=[])
|
||||
>>> print(response)
|
||||
こんにちは!どのようにお手伝いできますか?
|
||||
>>> response, history = model.chat(tokenizer, "時間管理について3つの提案をお願いします", history=history)
|
||||
>>> print(response)
|
||||
もちろんです!以下に簡潔な形で時間管理に関する3つの提案を示します。
|
||||
|
||||
1. To-Doリストを作成し、優先順位を付ける: タスクを明確にリストアップし、それぞれの優先度を判断しましょう。重要で緊急なタスクから順に取り組むことで、効率的に作業を進めることができます。
|
||||
2. 時間のブロック化を実践する: 作業を特定の時間枠に集中させるため、時間をブロック化しましょう。例えば、朝の2時間をメール対応に割り当て、午後の3時間をプロジェクトに集中するなど、タスクごとに時間を確保することが効果的です。
|
||||
3. ディストラクションを排除する: 集中力を保つために、ディストラクションを最小限に抑えましょう。通知をオフにし、SNSやメールに気を取られないようにすることで、作業効率を向上させることができます。
|
||||
|
||||
これらの提案を実践することで、時間管理のスキルを向上させ、効果的に日々のタスクをこなしていくことができます。
|
||||
```
|
||||
|
||||
### 対話
|
||||
|
||||
以下のコードを実行することで、フロントエンドインターフェースを通して InternLM Chat 7B モデルと対話することができます:
|
||||
|
||||
```bash
|
||||
pip install streamlit==1.24.0
|
||||
pip install transformers==4.30.2
|
||||
streamlit run web_demo.py
|
||||
```
|
||||
|
||||
その効果は以下の通り
|
||||
|
||||

|
||||
|
||||
### デプロイ
|
||||
|
||||
[LMDeploy](https://github.com/InternLM/LMDeploy) を使って、InternLM をワンクリックでデプロイする。
|
||||
|
||||
1. まず、LMDeploy をインストールする:
|
||||
|
||||
```
|
||||
python3 -m pip install lmdeploy
|
||||
```
|
||||
|
||||
2. クイックデプロイには以下のコマンドを使用します:
|
||||
|
||||
```
|
||||
python3 -m lmdeploy.serve.turbomind.deploy InternLM-7B /path/to/internlm-7b/model hf
|
||||
```
|
||||
|
||||
3. モデルをエクスポートした後、以下のコマンドを使ってサーバーを起動し、デプロイされたモデルと会話することができます:
|
||||
|
||||
```
|
||||
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
|
||||
```
|
||||
|
||||
[LMDeploy](https://github.com/InternLM/LMDeploy) は、InternLM をデプロイするための完全なワークフローを提供します。InternLM のデプロイの詳細については、[デプロイチュートリアル](https://github.com/InternLM/LMDeploy)を参照してください。
|
||||
|
||||
## ファインチューニングとトレーニング
|
||||
|
||||
### プリトレーニングとファインチューニングのチュートリアル
|
||||
|
||||
InternLMのインストール、データ処理、プレトレーニング、ファインチューニングを始めるには、[使用法チュートリアル](./doc/ja/usage.md)を参照してください。
|
||||
|
||||
### Transformers フォーマットへの変換
|
||||
|
||||
InternLM によって学習されたモデルは、コミュニティの様々なオープンソースプロジェクトとシームレスにドッキングするのに便利な Hugging Face Transformers 形式に簡単に変換することができます。`tools/convert2hf.py` の助けを借りて、トレーニング中に保存された weights は 1 つのコマンドで transformers 形式に変換することができます
|
||||
|
||||
```bash
|
||||
python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizes/tokenizer.model
|
||||
```
|
||||
|
||||
変換後、以下のコードで transformers として読み込むことができます
|
||||
|
||||
```python
|
||||
>>> from transformers import AutoTokenizer, AutoModel
|
||||
>>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
|
||||
```
|
||||
|
||||
## トレーニングシステム
|
||||
|
||||
### システムアーキテクチャ
|
||||
|
||||
詳細については、[システムアーキテクチャドキュメント](./doc/ja/structure.md) を参照してください。
|
||||
|
||||
### トレーニングパフォーマンス
|
||||
|
||||
InternLM は、Flash-Attention、Apex その他の高性能モデルオペレータを深く統合し、トレーニング効率を向上させます。Hybrid Zero 技術を構築することで、計算と通信の効率的なオーバーラップを実現し、トレーニング中のノード間の通信トラフィックを大幅に削減します。InternLM は 7B モデルを 8GPU から 1024GPU まで拡張することをサポートし、1000GPU スケールで最大 90% のアクセラレーション効率、180TFLOPS 以上のトレーニングスループット、GPU あたり平均 3600 トークン/秒以上を実現します。次の表は、異なる構成における InternLM のスケーラビリティテストデータです:
|
||||
|
||||
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
|
||||
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
|
||||
| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
|
||||
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
|
||||
|
||||
TGSは、GPUあたり1秒間に処理されるトークンの平均数を表します。パフォーマンステストデータの詳細については、[トレーニングパフォーマンスドキュメント](./doc/ja/train_performance.md)を参照してください。
|
||||
|
||||
## コントリビュート
|
||||
|
||||
我々は、InternLM を改善し、向上させるために尽力してくれたすべての貢献者に感謝している。コミュニティ・ユーザーのプロジェクトへの参加が強く推奨されます。プロジェクトへの貢献方法については、貢献ガイドラインを参照してください。
|
||||
|
||||
## 謝辞
|
||||
|
||||
InternLM コードベースは、上海 AI 研究所と様々な大学や企業の研究者によって貢献されたオープンソースプロジェクトです。プロジェクトに新機能を追加してくれたすべての貢献者と、貴重なフィードバックを提供してくれたユーザーに感謝したい。私たちは、このツールキットとベンチマークが、InternLM をファインチューニングし、独自のモデルを開発するための柔軟で効率的なコードツールをコミュニティに提供し、オープンソースコミュニティに継続的に貢献できることを願っています。2 つのオープンソースプロジェクト、[flash-attention](https://github.com/HazyResearch/flash-attention) と [ColossalAI](https://github.com/hpcaitech/ColossalAI) に感謝します。
|
||||
|
||||
## ライセンス
|
||||
|
||||
コードは Apache-2.0 でライセンスされており、モデルの重さは学術研究のために完全にオープンで、**無料** の商用利用も許可されています。商用ライセンスの申請は、[申請フォーム(英語)](https://wj.qq.com/s2/12727483/5dba/)/[申请表(中文)](https://wj.qq.com/s2/12725412/f7c1/)にご記入ください。その他のご質問やコラボレーションについては、<internlm@pjlab.org.cn> までご連絡ください。
|
||||
|
||||
## 引用
|
||||
|
||||
```
|
||||
@misc{2023internlm,
|
||||
title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
|
||||
author={InternLM Team},
|
||||
howpublished = {\url{https://github.com/InternLM/InternLM}},
|
||||
year={2023}
|
||||
}
|
||||
```
|
|
@ -26,15 +26,20 @@
|
|||
[🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new)
|
||||
|
||||
[English](./README.md) |
|
||||
[简体中文](./README-zh-Hans.md)
|
||||
[简体中文](./README-zh-Hans.md) |
|
||||
[日本語](./README-ja-JP.md)
|
||||
|
||||
</div>
|
||||
|
||||
<p align="center">
|
||||
👋 加入我们的 <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> 和 <a href="https://github.com/InternLM/InternLM/assets/25839884/a6aad896-7232-4220-ac84-9e070c2633ce" target="_blank">微信社区</a>
|
||||
</p>
|
||||
|
||||
## 简介
|
||||
|
||||
InternLM ,即书生·浦语大模型,包含面向实用场景的70亿参数基础模型与对话模型 (InternLM-7B)。模型具有以下特点:
|
||||
|
||||
- 使用上万亿高质量预料,建立模型超强知识体系;
|
||||
- 使用上万亿高质量语料,建立模型超强知识体系;
|
||||
- 支持8k语境窗口长度,实现更长输入与更强推理体验;
|
||||
- 通用工具调用能力,支持用户灵活自助搭建流程;
|
||||
|
||||
|
@ -47,7 +52,7 @@ InternLM ,即书生·浦语大模型,包含面向实用场景的70亿参数
|
|||
我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测,部分评测结果如下表所示,欢迎访问[OpenCompass 榜单](https://opencompass.org.cn/rank)获取更多的评测结果。
|
||||
|
||||
| 数据集\模型 | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
|
||||
| -------------------- | --------------------- | ---------------- | --------- | --------- | ------------ | --------- | ---------- |
|
||||
| -------------------- | --------------------- | ---------------- | --------- | --------- | ------------ | --------- | ---------- |
|
||||
| C-Eval(Val) | 53.2 | 53.4 | 24.2 | 42.7 | 50.9 | 28.9 | 31.2 |
|
||||
| MMLU | 50.8 | 51.0 | 35.2* | 41.5 | 46.0 | 39.7 | 47.3 |
|
||||
| AGIEval | 42.5 | 37.6 | 20.8 | 24.6 | 39.0 | 24.1 | 26.4 |
|
||||
|
@ -140,10 +145,10 @@ streamlit run web_demo.py
|
|||
|
||||
### 转换为 Transformers 格式使用
|
||||
|
||||
通过 InternLM 进行训练的模型可以很轻松地转换为 HuggingFace Transformers 格式,方便与社区各种开源项目无缝对接。借助 `tools/convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式
|
||||
通过 InternLM 进行训练的模型可以很轻松地转换为 HuggingFace Transformers 格式,方便与社区各种开源项目无缝对接。借助 `tools/transformers/convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式
|
||||
|
||||
```bash
|
||||
python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizes/tokenizer.model
|
||||
python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
|
||||
```
|
||||
|
||||
转换之后可以通过以下的代码加载为 transformers
|
||||
|
|
15
README.md
15
README.md
|
@ -26,10 +26,19 @@
|
|||
[🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new)
|
||||
|
||||
[English](./README.md) |
|
||||
[简体中文](./README-zh-Hans.md)
|
||||
[简体中文](./README-zh-Hans.md) |
|
||||
[日本語](./README-ja-JP.md)
|
||||
|
||||
</div>
|
||||
|
||||
<p align="center">
|
||||
👋 join us on <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> and <a href="https://github.com/InternLM/InternLM/assets/25839884/a6aad896-7232-4220-ac84-9e070c2633ce" target="_blank">WeChat</a>
|
||||
</p>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
## Introduction
|
||||
|
||||
InternLM has open-sourced a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
|
||||
|
@ -143,10 +152,10 @@ Please refer to [Usage Tutorial](./doc/en/usage.md) to start InternLM installati
|
|||
|
||||
### Convert to Transformers Format
|
||||
|
||||
The model trained by InternLM can be easily converted to HuggingFace Transformers format, which is convenient for seamless docking with various open source projects in the community. With the help of `tools/convert2hf.py`, the weights saved during training can be converted into transformers format with one command
|
||||
The model trained by InternLM can be easily converted to HuggingFace Transformers format, which is convenient for seamless docking with various open source projects in the community. With the help of `tools/transformers/convert2hf.py`, the weights saved during training can be converted into transformers format with one command
|
||||
|
||||
```bash
|
||||
python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizes/tokenizer.model
|
||||
python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
|
||||
```
|
||||
|
||||
After conversion, it can be loaded as transformers by the following code
|
||||
|
|
|
@ -8,7 +8,8 @@ The required packages and corresponding version are shown as follows:
|
|||
- CUDA == 11.7
|
||||
- Pytorch == 1.13.1+cu117
|
||||
- Transformers >= 4.25.1
|
||||
- Flash-Attention == 23.05
|
||||
- Flash-Attention == v1.0.5
|
||||
- Apex == 23.05
|
||||
- GPU with Ampere or Hopper architecture (such as H100, A100)
|
||||
- Linux OS
|
||||
|
||||
|
|
|
@ -8,16 +8,16 @@ Please refer to the [installation guide](./install.md) for instructions on how t
|
|||
|
||||
### Dataset Preparation (Pre-training)
|
||||
|
||||
The dataset for InternLM training consists of a series of `bin` and `meta` files. To generate the training dataset, you need to use the `tokenizer` tool to tokenize the raw text data. The tokenizer model can be imported by specifying the model path in the `tools/tokenizer.py` script. The current provided model is `V7.model`. If you want to use a different model, you can modify the model path directly in the `tokenizer.py` script.
|
||||
The dataset for the InternLM training task includes a series of `bin` and `meta` files. A `tokenizer` is used to generate the training dataset from the original text files. The tokenizer model is imported by specifying the model parameter path in `tools/tokenizer.py`. Currently, `V7_sft.model` is provided to generate tokens. If you want to use a different model, you can directly modify the model parameter path in `tokenizer.py`.
|
||||
|
||||
You can generate the `bin` and `meta` files for your raw data by running the following command, where the `raw_data_name` parameter represents the name of your raw data file, `input_file_type` represents the format of your raw data file (currently supports `txt`, `json`, and `jsonl`), and `bin` represents the path to save the generated `bin` files.
|
||||
You can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.
|
||||
|
||||
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'txt' or 'json' or 'jsonl' --bin your_output_bin_path
|
||||
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
|
||||
```
|
||||
|
||||
Here is an example of data processing (only the data processing example for the `txt` format is provided here, the data processing process for `json` and `jsonl` is exactly the same as for `txt`):
|
||||
Here is an example of data processing:
|
||||
|
||||
Given a file `raw_data.txt` containing the raw dataset, the raw dataset is shown below:
|
||||
|
||||
|
@ -30,7 +30,7 @@ Learn to be tolerant and understanding to establish truly harmonious interperson
|
|||
You can generate the `bin` and `meta` files by running the following command:
|
||||
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
|
||||
$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
|
||||
```
|
||||
|
||||
It should be noted that the generated `bin` files need to be saved in one of the following directories: `cn`, `en`, `code`, `ja`, `ar`, or `kaoshi`, depending on the type of dataset.
|
||||
|
@ -192,7 +192,7 @@ $ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python trai
|
|||
If you want to start distributed training on torch with 8 GPUs on a single node, use the following command:
|
||||
|
||||
```bash
|
||||
$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py
|
||||
$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py --launcher "torch"
|
||||
```
|
||||
|
||||
### Training Results
|
||||
|
@ -217,4 +217,4 @@ Taking the configuration of the demo training on a single machine with 8 GPUs on
|
|||
2023-07-07 12:29:09,307 INFO train.py:323 in record_current_batch_training_metrics -- tflops=188.8613541410694,step=3,loss=11.099515914916992,tgs (tokens/gpu/second)=4252.55,lr=1.0000000000000002e-06,loss_scale=65536.0,grad_norm=63.5478796484391,micro_num=4,num_consumed_tokens=524288,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.7
|
||||
2023-07-07 12:29:13,147 INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.65918563194305,step=4,loss=10.149517059326172,tgs (tokens/gpu/second)=4270.52,lr=1.2000000000000002e-06,loss_scale=65536.0,grad_norm=51.582841631508145,micro_num=4,num_consumed_tokens=655360,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.68
|
||||
2023-07-07 12:29:16,994 INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.3109313713174,step=5,loss=9.822169303894043,tgs (tokens/gpu/second)=4262.67,lr=1.4000000000000001e-06,loss_scale=65536.0,grad_norm=47.10386835560855,micro_num=4,num_consumed_tokens=786432,inf_nan_skip_batches=0,num_samples_in_batch=17,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.69
|
||||
```
|
||||
```
|
||||
|
|
|
@ -8,7 +8,8 @@
|
|||
- CUDA == 11.7
|
||||
- Pytorch == 1.13.1+cu117
|
||||
- Transformers >= 4.25.1
|
||||
- Flash-Attention == 23.05
|
||||
- Flash-Attention == v1.0.5
|
||||
- Apex == 23.05
|
||||
- Ampere或者Hopper架构的GPU (例如H100, A100)
|
||||
- Linux OS
|
||||
|
||||
|
|
12
doc/usage.md
12
doc/usage.md
|
@ -7,14 +7,14 @@
|
|||
|
||||
### 数据准备 (预训练)
|
||||
|
||||
InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。
|
||||
InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7_sft.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。
|
||||
|
||||
可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`raw_data_name`表示原始数据集的文件名称,`input_file_type`表示原始数据集的文件格式,目前支持`txt`、`json`和`jsonl`这三种格式,`bin`表示生成的`bin`文件的保存路径。
|
||||
可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`text_input_path`表示原始文本数据路径,目前支持`txt`、`json`和`jsonl`三种输入格式,`bin_output_path`表示生成的`bin`文件的保存路径。
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'txt' or 'json' or 'jsonl' --bin your_output_bin_path
|
||||
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
|
||||
```
|
||||
|
||||
下面是一个数据处理的例子(这里只给出了`txt`格式的数据处理例子,`json`和`jsonl`的数据处理流程和`txt`的完全一致):
|
||||
下面是一个数据处理的例子:
|
||||
|
||||
给定一个包含原始数据集的文件`raw_data.txt`,原始数据集如下所示:
|
||||
```bash
|
||||
|
@ -25,7 +25,7 @@ $ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suff
|
|||
|
||||
可以通过运行以下命令来生成`bin`和`meta`文件:
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
|
||||
$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
|
||||
```
|
||||
|
||||
需要注意的是,生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这六个目录下,以区分数据集的类型。
|
||||
|
@ -175,7 +175,7 @@ $ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python trai
|
|||
|
||||
若在 torch 上启动分布式运行环境,单节点 8 卡的运行命令如下所示:
|
||||
```bash
|
||||
$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py
|
||||
$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py --launcher "torch"
|
||||
```
|
||||
|
||||
### 运行结果
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
transformers>=4.25.1
|
||||
transformers<4.30.0
|
||||
sentencepiece
|
||||
numpy
|
||||
tqdm
|
||||
|
|
|
@ -1 +1 @@
|
|||
Subproject commit 8ffc901e50bbf740fdb6d5bccb17f66a6ec8604e
|
||||
Subproject commit 0da3ffb92ee6fbe5336602f0e3989db1cd16f880
|
|
@ -1 +1 @@
|
|||
Subproject commit d2f4324f4c56e017fbf22dc421943793a8ca6c3b
|
||||
Subproject commit eff9fe6b8076df59d64d7a3f464696738a3c7c24
|
|
@ -9,14 +9,14 @@
|
|||
```
|
||||
|
||||
# tokenizer.py
|
||||
生成原始数据的`bin`和`meta`文件需要使用`tokenizer`,我们通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前我们提供了`V7.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。
|
||||
生成原始数据的`bin`和`meta`文件需要使用`tokenizer`,我们通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前我们提供了`V7_sft.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。
|
||||
|
||||
我们可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`raw_data_name`表示原始数据集的文件名称,`input_file_type`表示原始数据集的文件格式,我们目前支持`txt`、`json`和`jsonl`这三种格式,`bin`表示生成的`bin`文件的保存路径。
|
||||
可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`text_input_path`表示原始文本数据路径,目前支持`txt`、`json`和`jsonl`三种输入格式,`bin_output_path`表示生成的`bin`文件的保存路径。
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'text' or 'json' or 'jsonl' --bin your_output_bin_path
|
||||
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
|
||||
```
|
||||
|
||||
下面是一个数据处理的例子(这里只给出了`txt`格式的数据处理例子,`json`和`jsonl`的数据处理流程和`txt`的完全一致):
|
||||
下面是一个数据处理的例子:
|
||||
|
||||
给定一个包含原始数据集的文件`raw_data.txt`,原始数据集如下所示:
|
||||
```bash
|
||||
|
@ -25,9 +25,9 @@ $ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suff
|
|||
学会宽容和理解,才能建立真正和谐的人际关系。
|
||||
```
|
||||
|
||||
接下来,我们可以通过运行以下命令来生成`bin`和`meta`文件:
|
||||
可以通过运行以下命令来生成`bin`和`meta`文件:
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
|
||||
$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
|
||||
```
|
||||
|
||||
需要注意的是,生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这五个目录下,以区分数据集的类型。
|
||||
|
|
|
@ -11,12 +11,12 @@ This directory provide some tools for model training with the following file str
|
|||
# tokenizer.py
|
||||
We need to use a `tokenizer` to generate `bin` and `meta` files for raw data. We import the tokenizer model by specifying the model weight path in `tools/tokenizer.py`. Currently, we provide `V7.model` to generate tokens. If you want to use a different model, you can modify the model weight path in `tokenizer.py` directly.
|
||||
|
||||
We can run the following command to generate `bin` and `meta` files for raw data, where the parameter `raw_data_name` indicates the file name of raw data, `input_file_type` denotes the raw data format, which should be `txt`, `json` and `jsonl`, and `bin` indicates the path to save the generated `bin` file.
|
||||
We can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'text' or 'json' or 'jsonl' --bin your_output_bin_path
|
||||
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
|
||||
```
|
||||
|
||||
An example of data processing in `txt` format is given here (the data processing for `json` and `jsonl` is identical to that for `txt`).
|
||||
An example of data processing in `txt` format is given here:
|
||||
|
||||
Given a file `raw_data.txt` containg raw data with the following content.
|
||||
```bash
|
||||
|
@ -26,7 +26,7 @@ Learn to be tolerant and understanding to establish truly harmonious interperson
|
|||
```
|
||||
Next, we can run the following command to generate `bin` and `meta` files for raw data.
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
|
||||
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
|
||||
```
|
||||
|
||||
It should be noted that the generated `bin` files should be placed in one of the following directories to clarify the data type: `cn`(Chinese), `en`(English), `code`(code data), `ja`(Japanese), `ar`(Arabic) and `kaoshi`(kaoshi data).
|
||||
|
|
|
@ -1,10 +1,11 @@
|
|||
import argparse
|
||||
import json
|
||||
import sentencepiece as spm
|
||||
from tqdm import tqdm
|
||||
import os.path as osp
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
import sentencepiece as spm
|
||||
from tqdm import tqdm
|
||||
|
||||
|
||||
def process(dataset_path, sp_model):
|
||||
|
@ -33,15 +34,15 @@ def get_chat_format_data(ori_data):
|
|||
Returns:
|
||||
dict: data sample with chat format.
|
||||
"""
|
||||
input_str = ori_data['input']
|
||||
instruction_str = ori_data['instruction']
|
||||
output_str = ori_data['output']
|
||||
input_str = ori_data["input"]
|
||||
instruction_str = ori_data["instruction"]
|
||||
output_str = ori_data["output"]
|
||||
data = dict()
|
||||
if input_str != "":
|
||||
data['user'] = f'<|User|>:{instruction_str}\n{input_str}'
|
||||
data["user"] = f"<|User|>:{instruction_str}\n{input_str}"
|
||||
else:
|
||||
data['user'] = f'<|User|>:{instruction_str}'
|
||||
data['bot'] = f'<|Bot|>:{output_str}'
|
||||
data["user"] = f"<|User|>:{instruction_str}"
|
||||
data["bot"] = f"<|Bot|>:{output_str}"
|
||||
return data
|
||||
|
||||
|
||||
|
@ -55,27 +56,27 @@ def tokenize(sample, sp_model):
|
|||
Returns:
|
||||
tuple: dumped processed data sample and length of tokens.
|
||||
"""
|
||||
special_tokens_map = {'<eoh>': 103167, '<eoa>': 103166, 'nl_id': 13}
|
||||
special_tokens_map = {"<eoh>": 103167, "<eoa>": 103166, "nl_id": 13}
|
||||
token_ids = [sp_model.bos_id()]
|
||||
human_s = sample['user']
|
||||
ass_s = sample['bot']
|
||||
human_s = sample["user"]
|
||||
ass_s = sample["bot"]
|
||||
|
||||
human_ids = sp_model.encode(human_s) + [
|
||||
special_tokens_map["<eoh>"], special_tokens_map['nl_id']
|
||||
]
|
||||
human_ids = sp_model.encode(human_s) + [special_tokens_map["<eoh>"], special_tokens_map["nl_id"]]
|
||||
human_ids_ignore = [-token_id for token_id in human_ids]
|
||||
|
||||
ass_template_ids = sp_model.encode('<|Assistant|>:')
|
||||
ass_template_ids = sp_model.encode("<|Bot|>:")
|
||||
ass_template_ids_ignore = [-token_ids for token_ids in ass_template_ids]
|
||||
ass_ids = ass_template_ids_ignore + sp_model.encode(ass_s[14:]) + [
|
||||
special_tokens_map["<eoa>"], special_tokens_map['nl_id']
|
||||
]
|
||||
ass_ids = (
|
||||
ass_template_ids_ignore
|
||||
+ sp_model.encode(ass_s[8:])
|
||||
+ [special_tokens_map["<eoa>"], special_tokens_map["nl_id"]]
|
||||
)
|
||||
|
||||
token_ids += human_ids_ignore + ass_ids
|
||||
if len(token_ids) > 2047:
|
||||
token_ids = token_ids[:2047]
|
||||
token_ids += [sp_model.eos_id()]
|
||||
line = str.encode(json.dumps({'tokens': token_ids}) + '\n')
|
||||
line = str.encode(json.dumps({"tokens": token_ids}) + "\n")
|
||||
return line, len(token_ids)
|
||||
|
||||
|
||||
|
@ -93,14 +94,14 @@ def dump_bin_meta_bin(samples, path, split_ratio=0.1):
|
|||
number of train/valid samples of processed dataset.
|
||||
"""
|
||||
|
||||
train_path = osp.join(path, 'train/en/')
|
||||
valid_path = osp.join(path, 'valid/en/')
|
||||
train_path = osp.join(path, "train/en/")
|
||||
valid_path = osp.join(path, "valid/en/")
|
||||
train_dir = Path(train_path)
|
||||
valid_dir = Path(valid_path)
|
||||
train_dir.mkdir(exist_ok=True, parents=True)
|
||||
valid_dir.mkdir(exist_ok=True, parents=True)
|
||||
train_f = open(train_dir.joinpath('dataset.bin'), 'wb')
|
||||
valid_f = open(valid_dir.joinpath('dataset.bin'), 'wb')
|
||||
train_f = open(train_dir.joinpath("dataset.bin"), "wb")
|
||||
valid_f = open(valid_dir.joinpath("dataset.bin"), "wb")
|
||||
|
||||
train_tokens = 0
|
||||
valid_tokens = 0
|
||||
|
@ -113,8 +114,7 @@ def dump_bin_meta_bin(samples, path, split_ratio=0.1):
|
|||
|
||||
sample_length = len(samples)
|
||||
np.random.seed(0)
|
||||
valid_indices = np.random.choice(
|
||||
range(sample_length), int(sample_length * split_ratio)).tolist()
|
||||
valid_indices = np.random.choice(range(sample_length), int(sample_length * split_ratio)).tolist()
|
||||
|
||||
count = -1
|
||||
for line, token_num in samples:
|
||||
|
@ -134,25 +134,19 @@ def dump_bin_meta_bin(samples, path, split_ratio=0.1):
|
|||
|
||||
train_f.close()
|
||||
valid_f.close()
|
||||
np.save(open(train_dir.joinpath('dataset.bin.meta'), 'wb'), train_meta)
|
||||
np.save(open(valid_dir.joinpath('dataset.bin.meta'), "wb"), valid_meta)
|
||||
np.save(open(train_dir.joinpath("dataset.bin.meta"), "wb"), train_meta)
|
||||
np.save(open(valid_dir.joinpath("dataset.bin.meta"), "wb"), valid_meta)
|
||||
|
||||
return train_tokens, valid_tokens, train_samples, valid_samples
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
if __name__ == "__main__":
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument(
|
||||
'dataset_path', type=str, help='path of dataset json file')
|
||||
parser.add_argument(
|
||||
'output_path', type=str, help='path of processed dataset')
|
||||
parser.add_argument('tokenizer_path', type=str, help='path of tokenizer')
|
||||
parser.add_argument(
|
||||
'--split_ratio',
|
||||
type=float,
|
||||
default=0.1,
|
||||
help='ratio for validation dataset splitting')
|
||||
parser.add_argument("dataset_path", type=str, help="path of dataset json file")
|
||||
parser.add_argument("output_path", type=str, help="path of processed dataset")
|
||||
parser.add_argument("tokenizer_path", type=str, help="path of tokenizer")
|
||||
parser.add_argument("--split_ratio", type=float, default=0.1, help="ratio for validation dataset splitting")
|
||||
|
||||
args = parser.parse_args()
|
||||
sp_model = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
|
||||
|
@ -163,9 +157,8 @@ if __name__ == '__main__':
|
|||
for sample in tqdm(dataset):
|
||||
samples.append(sample)
|
||||
|
||||
train_tokens, valid_tokens, train_samples, valid_samples = \
|
||||
dump_bin_meta_bin(samples, args.output_path, args.split_ratio)
|
||||
print(f'number of train dataset: {train_samples}, '
|
||||
'number of train dataset token: {train_tokens}')
|
||||
print(f'number of validation dataset: {valid_samples}, '
|
||||
'number of validation dataset token: {valid_tokens}')
|
||||
train_tokens, valid_tokens, train_samples, valid_samples = dump_bin_meta_bin(
|
||||
samples, args.output_path, args.split_ratio
|
||||
)
|
||||
print(f"number of train dataset: {train_samples}, " "number of train dataset token: {train_tokens}")
|
||||
print(f"number of validation dataset: {valid_samples}, " "number of validation dataset token: {valid_tokens}")
|
||||
|
|
|
@ -1,24 +1,25 @@
|
|||
import argparse
|
||||
import json
|
||||
import os
|
||||
import warnings
|
||||
import sys
|
||||
|
||||
import numpy as np
|
||||
from sentencepiece import SentencePieceProcessor
|
||||
from termcolor import colored
|
||||
|
||||
current_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
model_path = os.path.join(current_dir, "V7.model")
|
||||
tokenizer = SentencePieceProcessor(model_file=model_path)
|
||||
model_path = os.path.join(current_dir, "V7_sft.model")
|
||||
sys.path.append(os.path.join(current_dir, "transformers"))
|
||||
from tokenization_internlm import InternLMTokenizer
|
||||
|
||||
tokenizer = InternLMTokenizer(vocab_file=model_path)
|
||||
|
||||
|
||||
def write_bin(context: str, path: str) -> None:
|
||||
def write_bin(context: str, bin_file) -> None:
|
||||
"""
|
||||
Write bin file.
|
||||
Write bin file based on the context.
|
||||
|
||||
Args:
|
||||
context (str): the context of raw file.
|
||||
path (str): the path for output bin file.
|
||||
bin_file (file handler): the opened bin file.
|
||||
|
||||
Example:
|
||||
>>> write_bin("今天天气晴朗适合出门散步", "out.bin") # the output file format is 'txt'
|
||||
|
@ -33,21 +34,20 @@ def write_bin(context: str, path: str) -> None:
|
|||
# encode the data into bytes to save
|
||||
saved_bin = str.encode(json.dumps(data) + "\n")
|
||||
|
||||
# write bytes into bin path
|
||||
with open(path, "ab") as f:
|
||||
f.write(saved_bin)
|
||||
# write bytes into bin_file
|
||||
bin_file.write(saved_bin)
|
||||
|
||||
|
||||
def prepare_meta(bin_file_path: str):
|
||||
def prepare_meta(bin_output_path: str):
|
||||
"""
|
||||
Prepare metadata for the given bin file.
|
||||
|
||||
Args:
|
||||
bin_file_path (str): the bin file path.
|
||||
bin_output_path (str): Output bin file path.
|
||||
"""
|
||||
meta = []
|
||||
cur = 0
|
||||
with open(bin_file_path, "rb") as f:
|
||||
with open(bin_output_path, "rb") as f:
|
||||
while True:
|
||||
# read lines
|
||||
line = f.readline()
|
||||
|
@ -62,109 +62,66 @@ def prepare_meta(bin_file_path: str):
|
|||
meta.append((cur, length))
|
||||
# update the cur to generate the meta information of next line
|
||||
cur += len(line)
|
||||
print(meta)
|
||||
|
||||
# define path of the generated meta file
|
||||
meta_fp = bin_file_path + ".meta"
|
||||
meta_fp = bin_output_path + ".meta"
|
||||
# save the generated meta information
|
||||
with open(meta_fp, "wb") as f:
|
||||
meta = np.array(meta, dtype=np.int32)
|
||||
np.save(f, meta)
|
||||
|
||||
|
||||
def txt2bin(txt_file_path: str, bin_file_path: str):
|
||||
def text2bin(text_input_path: str, bin_output_path: str):
|
||||
"""
|
||||
Read content from txt file and write to bin file
|
||||
Read content from the input file and write to bin file.
|
||||
Currently support 3 input formats: 'txt', 'json' and 'jsonl'.
|
||||
|
||||
Args:
|
||||
txt_file_path (str): txt file path.
|
||||
bin_file_path (str): output bin file path.
|
||||
text_input_path (str): txt file path.
|
||||
bin_output_path (str): output bin file path.
|
||||
"""
|
||||
# Check if the txt file exists
|
||||
if not os.path.isfile(txt_file_path):
|
||||
warnings.warn(colored(f"{txt_file_path} does not exist.", "red"))
|
||||
return
|
||||
if not os.path.isfile(text_input_path):
|
||||
raise FileNotFoundError(f"{text_input_path} does not exist.")
|
||||
|
||||
try:
|
||||
# Open the text file
|
||||
with open(txt_file_path, "r") as txt_file:
|
||||
for line in txt_file:
|
||||
file_format = text_input_path.split(".")[-1]
|
||||
assert file_format in ["txt", "json", "jsonl"], print(
|
||||
"Invalid input file type. Currently support `txt`, `json` and `jsonl`."
|
||||
)
|
||||
|
||||
with open(text_input_path, "r") as text_file, open(bin_output_path, "ab") as bin_file:
|
||||
if file_format == "txt":
|
||||
for line in text_file:
|
||||
# Strip any leading/trailing whitespace
|
||||
stripped_line = line.strip()
|
||||
if stripped_line:
|
||||
# Pass each line to the write_bin function
|
||||
write_bin(stripped_line, bin_file_path)
|
||||
write_bin(stripped_line, bin_file)
|
||||
|
||||
print(colored(f"Successfully converted {txt_file_path} to {bin_file_path}", "green"))
|
||||
|
||||
except Exception as e:
|
||||
print(colored(f"Error while converting {txt_file_path} to {bin_file_path}: {str(e)}", "red"))
|
||||
|
||||
|
||||
def json2bin(json_file_path: str, bin_file_path: str):
|
||||
"""
|
||||
Read content from json file and write to bin file
|
||||
|
||||
Args:
|
||||
json_file_path (str): json file path.
|
||||
bin_file_path (str): output bin file path.
|
||||
"""
|
||||
|
||||
if not os.path.isfile(json_file_path):
|
||||
warnings.warn(colored(f"{json_file_path} does not exist.", "red"))
|
||||
return
|
||||
|
||||
try:
|
||||
# load json file
|
||||
with open(json_file_path, "r") as json_file:
|
||||
data = json.load(json_file)
|
||||
# assuming data is a list of dictionaries
|
||||
for record in data:
|
||||
# the type of record is dict, transfer the dict into str
|
||||
context = json.dumps(record)
|
||||
# encode the str and write into bin
|
||||
write_bin(context, bin_file_path)
|
||||
|
||||
print(colored(f"Successfully converted {json_file_path} to {bin_file_path}", "green"))
|
||||
|
||||
except Exception as e:
|
||||
print(colored(f"Error while converting {json_file_path} to {bin_file_path}: {str(e)}", "red"))
|
||||
|
||||
|
||||
def jsonl2bin(jsonl_file_path: str, bin_file_path: str):
|
||||
"""
|
||||
Read content from jsonl file and write to bin file
|
||||
|
||||
Args:
|
||||
jsonl_file_path: jsonl file path.
|
||||
bin_file_path: bin file path.
|
||||
"""
|
||||
|
||||
if not os.path.isfile(jsonl_file_path):
|
||||
warnings.warn(colored(f"{jsonl_file_path} does not exist.", "red"))
|
||||
return
|
||||
|
||||
try:
|
||||
with open(jsonl_file_path, "r") as jsonl_file:
|
||||
for line in jsonl_file:
|
||||
elif file_format == "json":
|
||||
data = json.load(text_file)
|
||||
# assuming data is a list of dictionaries
|
||||
for record in data:
|
||||
# the type of record is dict, transfer the dict into str
|
||||
context = json.dumps(record)
|
||||
# encode the str and write into bin
|
||||
write_bin(line, bin_file_path)
|
||||
write_bin(context, bin_file)
|
||||
|
||||
print(colored(f"Successfully converted {jsonl_file_path} to {bin_file_path}", "green"))
|
||||
|
||||
except Exception as e:
|
||||
print(colored(f"Error while converting {jsonl_file_path} to {bin_file_path}: {str(e)}", "red"))
|
||||
elif file_format == "jsonl":
|
||||
for line in text_file:
|
||||
# encode the str and write into bin
|
||||
write_bin(line, bin_file)
|
||||
|
||||
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument("--raw_data_name", required=True, help="Input file name")
|
||||
parser.add_argument(
|
||||
"--input_file_type",
|
||||
choices=["txt", "json", "jsonl"],
|
||||
"--text_input_path",
|
||||
type=str,
|
||||
required=True,
|
||||
help="Input file format (either txt, json or jsonl)",
|
||||
help="Path to the input text file.",
|
||||
)
|
||||
parser.add_argument("--bin", required=True, help="Path to the output bin file")
|
||||
parser.add_argument("--bin_output_path", type=str, required=True, help="Path to the output bin file.")
|
||||
|
||||
return parser.parse_args()
|
||||
|
||||
|
@ -173,21 +130,12 @@ def main():
|
|||
# parse arguments
|
||||
args = parse_args()
|
||||
|
||||
# obtain the raw data path
|
||||
input_file_path = f"{args.raw_data_name}.{args.input_file_type}"
|
||||
|
||||
# different methods for different raw data type, we only support "txt", "json" and "jsonl" data type.
|
||||
if args.input_file_type == "txt":
|
||||
txt2bin(input_file_path, args.bin)
|
||||
elif args.input_file_type == "json":
|
||||
json2bin(input_file_path, args.bin)
|
||||
elif args.input_file_type == "jsonl":
|
||||
jsonl2bin(input_file_path, args.bin)
|
||||
else:
|
||||
print(colored("Invalid input file type. Use --help for more information.", "red"))
|
||||
text2bin(args.text_input_path, args.bin_output_path)
|
||||
print(f"Successfully converted {args.text_input_path} to {args.bin_output_path}")
|
||||
|
||||
# To avoid potential read/write errors, the metadata preparation follows after creating the .bin file.
|
||||
prepare_meta(args.bin)
|
||||
prepare_meta(args.bin_output_path)
|
||||
print(f"Successfully generated {args.bin_output_path}.meta")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
|
|
@ -8,18 +8,17 @@
|
|||
|
||||
## 权重转换
|
||||
|
||||
`convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式。
|
||||
`convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式。在仓库根目录运行以下命令:
|
||||
|
||||
```bash
|
||||
python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ../v7_sft.model
|
||||
python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
|
||||
```
|
||||
|
||||
然后可以使用 `from_pretrained` 接口加载:
|
||||
|
||||
```python
|
||||
from modeling_internlm import InternLMForCausalLM
|
||||
|
||||
model = InternForCausalLM.from_pretrained("hf_ckpt/")
|
||||
>>> from transformers import AutoTokenizer, AutoModel
|
||||
>>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
|
||||
```
|
||||
|
||||
|
||||
|
|
|
@ -7,18 +7,17 @@ This folder contains the `InternLM` model in transformers format.
|
|||
|
||||
## Weight Conversion
|
||||
|
||||
`convert2hf.py` can convert saved training weights into the transformers format with a single command.
|
||||
`convert2hf.py` can convert saved training weights into the transformers format with a single command. Execute the command in the root directory of repository:
|
||||
|
||||
```bash
|
||||
python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ../v7_sft.model
|
||||
python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
|
||||
```
|
||||
|
||||
Then, you can load it using the `from_pretrained` interface:
|
||||
|
||||
```python
|
||||
from modeling_internlm import InternLMForCausalLM
|
||||
|
||||
model = InternForCausalLM.from_pretrained("hf_ckpt/")
|
||||
>>> from transformers import AutoTokenizer, AutoModel
|
||||
>>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
|
||||
```
|
||||
|
||||
`intern_moss_example.py` demonstrates an example of how to use LoRA for fine-tuning on the `fnlp/moss-moon-002-sft` dataset.
|
||||
`intern_moss_example.py` demonstrates an example of how to use LoRA for fine-tuning on the `fnlp/moss-moon-002-sft` dataset.
|
||||
|
|
|
@ -1,9 +1,9 @@
|
|||
import argparse
|
||||
import math
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import re
|
||||
import shutil
|
||||
import tempfile
|
||||
|
||||
import torch
|
||||
from modeling_internlm import InternLMConfig, InternLMForCausalLM
|
||||
|
@ -15,10 +15,8 @@ NUM_SHARDS = {
|
|||
|
||||
|
||||
def convert2hf(model_config, states_tp_pps):
|
||||
folder = f"/dev/shm/wait_to_upload_weight_tmp_{random.random()}/"
|
||||
os.makedirs(folder, exist_ok=True)
|
||||
|
||||
try:
|
||||
with tempfile.TemporaryDirectory() as folder:
|
||||
states = merge_pp(states_tp_pps)[0]
|
||||
|
||||
if "embedding.word_embeddings.weight" in states:
|
||||
|
@ -88,12 +86,9 @@ def convert2hf(model_config, states_tp_pps):
|
|||
config.save_pretrained(folder)
|
||||
torch.save(current_states, os.path.join(folder, "pytorch_model.bin"))
|
||||
|
||||
model = InternLMForCausalLM.from_pretrained(folder, torch_dtype=torch.float16, low_cpu_mem_usage=True)
|
||||
model = InternLMForCausalLM.from_pretrained(folder, torch_dtype=torch.float16)
|
||||
del model.config._name_or_path
|
||||
|
||||
finally:
|
||||
shutil.rmtree(folder)
|
||||
|
||||
return config, model
|
||||
|
||||
|
||||
|
@ -169,6 +164,12 @@ if __name__ == "__main__":
|
|||
|
||||
os.makedirs(target_folder, exist_ok=True)
|
||||
model.save_pretrained(target_folder, max_shard_size="20GB")
|
||||
# TODO There should be a better way to add this.
|
||||
with open(os.path.join(target_folder, "config.json")) as fp:
|
||||
config_dict = json.load(fp)
|
||||
config_dict["auto_map"]["AutoModel"] = "modeling_internlm.InternLMModel"
|
||||
with open(os.path.join(target_folder, "config.json"), "w") as fp:
|
||||
json.dump(config_dict, fp, indent=2)
|
||||
|
||||
tokenizer = InternLMTokenizer(args.tokenizer)
|
||||
tokenizer.save_pretrained(target_folder)
|
||||
|
|
|
@ -31,7 +31,7 @@ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutpu
|
|||
from transformers.modeling_utils import PreTrainedModel
|
||||
from transformers.generation.streamers import BaseStreamer
|
||||
from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
|
||||
from .configuration_internlm import InternLMConfig
|
||||
from configuration_internlm import InternLMConfig
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
|
|
@ -147,7 +147,8 @@ class GenerationConfig:
|
|||
top_p: Optional[float] = None
|
||||
temperature: Optional[float] = None
|
||||
do_sample: Optional[bool] = True
|
||||
|
||||
repetition_penalty: Optional[float] = 1.0
|
||||
|
||||
|
||||
@st.cache_resource
|
||||
def load_model():
|
||||
|
@ -228,15 +229,12 @@ def main():
|
|||
# Add user message to chat history
|
||||
st.session_state.messages.append({"role": "user", "content": prompt, "avatar": user_avator})
|
||||
|
||||
print(f"cur real input:\n{real_prompt}\n")
|
||||
|
||||
with st.chat_message("robot", avatar=robot_avator):
|
||||
message_placeholder = st.empty()
|
||||
for cur_response in generate_interactive(model=model, tokenizer=tokenizer, prompt=real_prompt, additional_eos_token_id=103028, **asdict(generation_config)):
|
||||
# Display robot response in chat message container
|
||||
message_placeholder.markdown(cur_response + "▌")
|
||||
message_placeholder.markdown(cur_response)
|
||||
print(f"cur total response:\n{cur_response}\n")
|
||||
# Add robot response to chat history
|
||||
st.session_state.messages.append({"role": "robot", "content": cur_response, "avatar": robot_avator})
|
||||
|
||||
|
|
Loading…
Reference in New Issue