Update main branch and docs (#585)

* [Refactor]: refactor with pure documentations and examples

* update model information

* update model information

* Check-in lmdeploy user guide

* Update chat format doc

* update cn doc

* clean doc
pull/589/head
Wenwei Zhang 2024-01-17 09:46:11 +08:00 committed by GitHub
parent aaaf4d7b0e
commit dbec726c62
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
190 changed files with 530 additions and 26517 deletions

6
.gitmodules vendored
View File

@ -1,6 +0,0 @@
[submodule "third_party/flash-attention"]
path = third_party/flash-attention
url = https://github.com/HazyResearch/flash-attention.git
[submodule "third_party/apex"]
path = third_party/apex
url = https://github.com/NVIDIA/apex

View File

@ -1,203 +0,0 @@
# InternLM
<div align="center">
<img src="./doc/imgs/logo.svg" width="200"/>
<div> </div>
<div align="center">
<b><font size="5">InternLM</font></b>
<sup>
<a href="https://internlm.intern-ai.org.cn/">
<i><font size="4">HOT</font></i>
</a>
</sup>
<div> </div>
</div>
[![license](./doc/imgs/license.svg)](./LICENSE)
[![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)
[📘使用法](./doc/en/usage.md) |
[🛠️インストール](./doc/en/install.md) |
[📊トレーニングパフォーマンス](./doc/en/train_performance.md) |
[👀モデル](#model-zoo) |
[🆕更新ニュース](./CHANGE_LOG.md) |
[🤔Issues 報告](https://github.com/InternLM/InternLM/issues/new)
[English](./README.md) |
[简体中文](./README-zh-Hans.md) |
[日本語](./README-ja-JP.md)
</div>
## はじめに
InternLM は、70 億のパラメータを持つベースモデルと、実用的なシナリオに合わせたチャットモデルをオープンソース化しています。このモデルには以下の特徴があります:
- 何兆もの高品質なトークンをトレーニングに活用し、強力な知識ベースを確立します。
- 8k のコンテキストウィンドウ長をサポートし、より長い入力シーケンスと強力な推論機能を可能にする。
- ユーザが独自のワークフローを柔軟に構築できるよう、汎用性の高いツールセットを提供します。
さらに、大規模な依存関係を必要とせずにモデルの事前学習をサポートする軽量な学習フレームワークが提供されます。単一のコードベースで、数千の GPU を持つ大規模クラスタでの事前学習と、単一の GPU での微調整をサポートし、顕著な性能最適化を達成します。InternLM は、1024GPU でのトレーニングにおいて 90% 近いアクセラレーション効率を達成しています。
## 新闻
[20231213] InternLM-7B-Chat および InternLM-20B-Chat のモデルの重みを更新しました。 新しいバージョンの会話モデルでは、より高品質でより多様な言語スタイルの応答を生成できます。
[20230920] 基本版と会話版を含むInternLM-20Bをリリースしました。
## InternLM-7B
### パフォーマンス評価
オープンソースの評価ツール [OpenCompass](https://github.com/internLM/OpenCompass/) を用いて、InternLM の総合的な評価を行った。この評価では、分野別能力、言語能力、知識能力、推論能力、理解能力の 5 つの次元をカバーしました。以下は評価結果の一部であり、その他の評価結果については [OpenCompass leaderboard](https://opencompass.org.cn/rank) をご覧ください。
| データセット\モデル | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
| C-Eval(Val) | 52.0 | 53.4 | 24.2 | 42.7 | 50.9 | 28.9 | 31.2 |
| MMLU | 52.6 | 51.0 | 35.2* | 41.5 | 46.0 | 39.7 | 47.3 |
| AGIEval | 46.4 | 37.6 | 20.8 | 24.6 | 39.0 | 24.1 | 26.4 |
| CommonSenseQA | 80.8 | 59.5 | 65.0 | 58.8 | 60.0 | 68.7 | 66.7 |
| BUSTM | 80.6 | 50.6 | 48.5 | 51.3 | 55.0 | 48.8 | 62.5 |
| CLUEWSC | 81.8 | 59.1 | 50.3 | 52.8 | 59.8 | 50.3 | 52.2 |
| MATH | 5.0 | 7.1 | 2.8 | 3.0 | 6.6 | 2.2 | 2.8 |
| GSM8K | 36.2 | 31.2 | 10.1 | 9.7 | 29.2 | 6.0 | 15.3 |
| HumanEval | 15.9 | 10.4 | 14.0 | 9.2 | 9.2 | 9.2 | 11.0 |
| RACE(High) | 80.3 | 57.4 | 46.9* | 28.1 | 66.3 | 40.7 | 54.0 |
- 評価結果は [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) (*印のあるデータは原著論文からの引用を意味する)から取得したもので、評価設定は [OpenCompass](https://github.com/internLM/OpenCompass/) が提供する設定ファイルに記載されています。
- 評価データは、[OpenCompass](https://github.com/internLM/OpenCompass/) のバージョンアップにより数値的な差異が生じる可能性がありますので、[OpenCompass](https://github.com/internLM/OpenCompass/) の最新の評価結果をご参照ください。
### Model Zoo
InternLM 7B と InternLM 7B チャットは、InternLM を使って訓練され、オープンソース化されています。モデルの重みは 2 つのフォーマットで提供されています。Transformers フォーマットを使ってモデルをロードするだけでなく、InternLM を使って直接重みをロードして、さらに事前トレーニングや人間の好みアライメントトレーニングを行うこともできます。
| モデル | InternLM フォーマット Weight ダウンロードリンク | Transformers フォーマット Weight ダウンロードリンク |
| ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
| **InternLM 7B** | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b) | [🤗internlm/intern-7b](https://huggingface.co/internlm/internlm-7b) |
| **InternLM Chat 7B** | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b) | [🤗internlm/intern-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) |
**制限事項:** 学習過程におけるモデルの安全性を確保し、倫理的・法的要件に準拠したテキストを生成するようモデルに促す努力を行ってきたが、モデルのサイズと確率的生成パラダイムのため、モデルは依然として予期せぬ出力を生成する可能性がある。例えば、生成された回答には偏見や差別、その他の有害な内容が含まれている可能性があります。そのような内容を伝播しないでください。有害な情報の伝播によって生じるいかなる結果に対しても、私たちは責任を負いません。
### Transformers からのインポート
Transformers を使用して InternLM 7B チャットモデルをロードするには、以下のコードを使用します:
```python
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True).cuda()
>>> model = model.eval()
>>> response, history = model.chat(tokenizer, "こんにちは", history=[])
>>> print(response)
こんにちは!どのようにお手伝いできますか?
>>> response, history = model.chat(tokenizer, "時間管理について3つの提案をお願いします", history=history)
>>> print(response)
もちろんです以下に簡潔な形で時間管理に関する3つの提案を示します。
1. To-Doリストを作成し、優先順位を付ける: タスクを明確にリストアップし、それぞれの優先度を判断しましょう。重要で緊急なタスクから順に取り組むことで、効率的に作業を進めることができます。
2. 時間のブロック化を実践する: 作業を特定の時間枠に集中させるため、時間をブロック化しましょう。例えば、朝の2時間をメール対応に割り当て、午後の3時間をプロジェクトに集中するなど、タスクごとに時間を確保することが効果的です。
3. ディストラクションを排除する: 集中力を保つために、ディストラクションを最小限に抑えましょう。通知をオフにし、SNSやメールに気を取られないようにすることで、作業効率を向上させることができます。
これらの提案を実践することで、時間管理のスキルを向上させ、効果的に日々のタスクをこなしていくことができます。
```
### 対話
以下のコードを実行することで、フロントエンドインターフェースを通して InternLM Chat 7B モデルと対話することができます:
```bash
pip install streamlit==1.24.0
pip install transformers==4.30.2
streamlit run web_demo.py
```
その効果は以下の通り
![demo](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
### デプロイ
[LMDeploy](https://github.com/InternLM/LMDeploy) を使って、InternLM をワンクリックでデプロイする。
1. まず、LMDeploy をインストールする:
```shell
python3 -m pip install lmdeploy
```
2. クイックデプロイには以下のコマンドを使用します:
```shell
lmdeploy chat turbomind InternLM/internlm-chat-7b --model-name internlm-chat-7b
```
3. モデルをエクスポートした後、以下のコマンドを使ってサーバーを起動し、デプロイされたモデルと会話することができます:
```shell
lmdeploy serve api_server InternLM/internlm-chat-7b --model-name internlm-chat-7b
```
[LMDeploy](https://github.com/InternLM/LMDeploy) は、InternLM をデプロイするための完全なワークフローを提供します。InternLM のデプロイの詳細については、[デプロイチュートリアル](https://github.com/InternLM/LMDeploy)を参照してください。
## ファインチューニングとトレーニング
### プリトレーニングとファインチューニングのチュートリアル
InternLMのインストール、データ処理、プレトレーニング、ファインチューニングを始めるには、[使用法チュートリアル](./doc/ja/usage.md)を参照してください。
### Transformers フォーマットへの変換
InternLM によって学習されたモデルは、コミュニティの様々なオープンソースプロジェクトとシームレスにドッキングするのに便利な Hugging Face Transformers 形式に簡単に変換することができます。`tools/convert2hf.py` の助けを借りて、トレーニング中に保存された weights は 1 つのコマンドで transformers 形式に変換することができます
```bash
python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizes/tokenizer.model
```
変換後、以下のコードで transformers として読み込むことができます
```python
>>> from transformers import AutoTokenizer, AutoModel
>>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
```
## トレーニングシステム
### システムアーキテクチャ
詳細については、[システムアーキテクチャドキュメント](./doc/ja/structure.md) を参照してください。
### トレーニングパフォーマンス
InternLM は、Flash-Attention、Apex その他の高性能モデルオペレータを深く統合し、トレーニング効率を向上させます。Hybrid Zero 技術を構築することで、計算と通信の効率的なオーバーラップを実現し、トレーニング中のード間の通信トラフィックを大幅に削減します。InternLM は 7B モデルを 8GPU から 1024GPU まで拡張することをサポートし、1000GPU スケールで最大 90% のアクセラレーション効率、180TFLOPS 以上のトレーニングスループット、GPU あたり平均 3600 トークン/秒以上を実現します。次の表は、異なる構成における InternLM のスケーラビリティテストデータです:
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
TGSは、GPUあたり1秒間に処理されるトークンの平均数を表します。パフォーマンステストデータの詳細については、[トレーニングパフォーマンスドキュメント](./doc/ja/train_performance.md)を参照してください。
## コントリビュート
我々は、InternLM を改善し、向上させるために尽力してくれたすべての貢献者に感謝している。コミュニティ・ユーザーのプロジェクトへの参加が強く推奨されます。プロジェクトへの貢献方法については、貢献ガイドラインを参照してください。
## 謝辞
InternLM コードベースは、上海 AI 研究所と様々な大学や企業の研究者によって貢献されたオープンソースプロジェクトです。プロジェクトに新機能を追加してくれたすべての貢献者と、貴重なフィードバックを提供してくれたユーザーに感謝したい。私たちは、このツールキットとベンチマークが、InternLM をファインチューニングし、独自のモデルを開発するための柔軟で効率的なコードツールをコミュニティに提供し、オープンソースコミュニティに継続的に貢献できることを願っています。2 つのオープンソースプロジェクト、[flash-attention](https://github.com/HazyResearch/flash-attention) と [ColossalAI](https://github.com/hpcaitech/ColossalAI) に感謝します。
## ライセンス
コードは Apache-2.0 でライセンスされており、モデルの重さは学術研究のために完全にオープンで、**無料** の商用利用も許可されています。商用ライセンスの申請は、[申請フォーム(英語)](https://wj.qq.com/s2/12727483/5dba/)/[申请表(中文)](https://wj.qq.com/s2/12725412/f7c1/)にご記入ください。その他のご質問やコラボレーションについては、<internlm@pjlab.org.cn> までご連絡ください。
## 引用
```
@misc{2023internlm,
title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
author={InternLM Team},
howpublished = {\url{https://github.com/InternLM/InternLM}},
year={2023}
}
```

View File

@ -1,292 +0,0 @@
# InternLM
<div align="center">
<img src="./doc/imgs/logo.svg" width="200"/>
<div>&nbsp;</div>
<div align="center">
<b><font size="5">书生·浦语 官网</font></b>
<sup>
<a href="https://internlm.intern-ai.org.cn/">
<i><font size="4">HOT</font></i>
</a>
</sup>
<div>&nbsp;</div>
</div>
[![license](./doc/imgs/license.svg)](https://github.com/open-mmlab/mmdetection/blob/main/LICENSE)
[![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)
[📘使用文档](./doc/usage.md) |
[🛠️安装教程](./doc/install.md) |
[📊训练性能](./doc/train_performance.md) |
[👀模型库](#model-zoo) |
[🤗HuggingFace](https://huggingface.co/spaces/internlm/InternLM-Chat-7B) |
[🆕Update News](./CHANGE_LOG.md) |
[🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new)
[English](./README.md) |
[简体中文](./README-zh-Hans.md) |
[日本語](./README-ja-JP.md)
</div>
<p align="center">
👋 加入我们的 <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a><a href="https://github.com/InternLM/InternLM/assets/25839884/a6aad896-7232-4220-ac84-9e070c2633ce" target="_blank">微信社区</a>
</p>
## 简介
InternLM 是一个开源的轻量级训练框架,旨在支持大模型训练而无需大量的依赖。通过单一的代码库,它支持在拥有数千个 GPU 的大型集群上进行预训练,并在单个 GPU 上进行微调同时实现了卓越的性能优化。在1024个 GPU 上训练时InternLM 可以实现近90%的加速效率。
基于InternLM训练框架我们已经发布了两个开源的预训练模型InternLM-7B 和 InternLM-20B。
## 更新
[20231213] 我们更新了 InternLM-7B-Chat 和 InternLM-20B-Chat 模型权重。通过改进微调数据和训练策略,新版对话模型生成的回复质量更高、语言风格更加多元。
[20230920] InternLM-20B 已发布,包括基础版和对话版。
## Model Zoo
我们的模型在三个平台上发布Transformers、ModelScope 和 OpenXLab。
| Model | Transformers | ModelScope | OpenXLab |发布日期 |
|---------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
| **InternLM Chat 20B** | [🤗internlm/internlm-chat-20b](https://huggingface.co/internlm/internlm-20b-chat) | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b-chat/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-20b) | 2023-12-12 |
| **InternLM 20B** | [🤗internlm/internlm-20b](https://huggingface.co/internlm/internlm-20b) | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-20b) | 2023-09-20 |
| **InternLM Chat 7B** | [🤗internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b) | 2023-12-12 |
| **InternLM 7B** | [🤗internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-7b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b) | 2023-07-06 |
<details>
<summary> InternLM-20B </summary>
#### 简介
InternLM-20B 在超过 **2.3T** Tokens 包含高质量英文、中文和代码的数据上进行预训练,其中 Chat 版本还经过了 SFT 和 RLHF 训练,使其能够更好、更安全地满足用户的需求。
InternLM 20B 在模型结构上选择了深结构InternLM-20B 的层数设定为60层超过常规7B和13B模型所使用的32层或者40层。在参数受限的情况下提高层数有利于提高模型的综合能力。此外相较于InternLM-7BInternLM-20B使用的预训练数据经过了更高质量的清洗并补充了高知识密度和用于强化理解和推理能力的训练数据。因此它在理解能力、推理能力、数学能力、编程能力等考验语言模型技术水平的方面都得到了显著提升。总体而言InternLM-20B具有以下的特点
- 优异的综合性能
- 很强的工具调用功能
- 支持16k语境长度通过推理时外推
- 更好的价值对齐
#### 性能对比
在OpenCompass提出的5个能力维度上InternLM-20B都取得很好的效果粗体为13B-33B这个量级范围内各项最佳成绩
| 能力维度 | Llama-13B | Llama2-13B | Baichuan2-13B | InternLM-20B | Llama-33B | Llama-65B | Llama2-70B |
|----------|-----------|------------|---------------|--------------|-----------|-----------|------------|
| 语言 | 42.5 | 47 | 47.5 | **55** | 44.6 | 47.1 | 51.6 |
| 知识 | 58.2 | 58.3 | 48.9 | 60.1 | **64** | 66 | 67.7 |
| 理解 | 45.5 | 50.9 | 58.1 | **67.3** | 50.6 | 54.2 | 60.8 |
| 推理 | 42.7 | 43.6 | 44.2 | **54.9** | 46.4 | 49.8 | 55 |
| 学科 | 37.3 | 45.2 | 51.8 | **62.5** | 47.4 | 49.7 | 57.3 |
| 总平均 | 43.8 | 47.3 | 49.4 | **59.2** | 48.9 | 51.9 | 57.4 |
下表在一些有重要影响力的典型数据集上比较了主流开源模型的表现
| | 评测集 | Llama-13B | Llama2-13B | Baichuan2-13B | InternLM-20B | Llama-33B | Llama-65B | Llama2-70B |
|------|------------------|-----------|------------|---------------|--------------|-----------|-----------|------------|
| 学科 | MMLU | 47.73 | 54.99 | 59.55 | **62.05** | 58.73 | 63.71 | 69.75 |
| | C-Eval (val) | 31.83 | 41.4 | **59.01** | 58.8 | 37.47 | 40.36 | 50.13 |
| | AGI-Eval | 22.03 | 30.93 | 37.37 | **44.58** | 33.53 | 33.92 | 40.02 |
| 知识 | BoolQ | 78.75 | 82.42 | 67 | **87.46** | 84.43 | 86.61 | 87.74 |
| | TriviaQA | 52.47 | 59.36 | 46.61 | 57.26 | **66.24** | 69.79 | 70.71 |
| | NaturalQuestions | 20.17 | 24.85 | 16.32 | 25.15 | **30.89** | 33.41 | 34.16 |
| 理解 | CMRC | 9.26 | 31.59 | 29.85 | **68.78** | 14.17 | 34.73 | 43.74 |
| | CSL | 55 | 58.75 | 63.12 | **65.62** | 57.5 | 59.38 | 60 |
| | RACE (middle) | 53.41 | 63.02 | 68.94 | **86.35** | 64.55 | 72.35 | 81.55 |
| | RACE (high) | 47.63 | 58.86 | 67.18 | **83.28** | 62.61 | 68.01 | 79.93 |
| | XSum | 20.37 | 23.37 | 25.23 | **35.54** | 20.55 | 19.91 | 25.38 |
| 推理 | WinoGrande | 64.64 | 64.01 | 67.32 | **69.38** | 66.85 | 69.38 | 69.77 |
| | BBH | 37.93 | 45.62 | 48.98 | **52.51** | 49.98 | 58.38 | 64.91 |
| | GSM8K | 20.32 | 29.57 | **52.62** | **52.62** | 42.3 | 54.44 | 63.31 |
| | PIQA | 79.71 | 79.76 | 78.07 | 80.25 | **81.34** | 82.15 | 82.54 |
| 编程 | HumanEval | 14.02 | 18.9 | 17.07 | **25.61** | 17.68 | 18.9 | 26.22 |
| | MBPP | 20.6 | 26.8 | 30.8 | **35.6** | 28.4 | 33.6 | 39.6 |
总体而言InternLM-20B 在综合能力上全面领先于13B量级的开源模型同时在推理评测集上接近甚至超越Llama-65B的性能。
- 评估结果来自 [OpenCompass 20230920](https://github.com/internLM/OpenCompass/)。
- 由于 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代,评估数据可能存在数值上的差异,所以请参考 [OpenCompass](https://github.com/internLM/OpenCompass/) 的最新评估结果。
</details>
<details>
<summary> InternLM-7B </summary>
#### 模型更新
#### 简介
InternLM-7B 包含了一个拥有70亿参数的基础模型和一个为实际场景量身定制的对话模型。该模型具有以下特点
- 它利用数万亿的高质量令牌进行训练,建立了一个强大的知识库。
- 它支持8k的上下文窗口长度使得输入序列更长并增强了推理能力。
- 它为用户提供了一个多功能的工具集,使用户能够灵活地构建自己的工作流程。
#### 性能对比
我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测部分评测结果如下表所示欢迎访问[OpenCompass 榜单](https://opencompass.org.cn/rank)获取更多的评测结果。
| 数据集\模型 | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
| C-Eval(Val) | 52.0 | 53.4 | 24.2 | 42.7 | 50.9 | 28.9 | 31.2 |
| MMLU | 52.6 | 51.0 | 35.2* | 41.5 | 46.0 | 39.7 | 47.3 |
| AGIEval | 46.4 | 37.6 | 20.8 | 24.6 | 39.0 | 24.1 | 26.4 |
| CommonSenseQA | 80.8 | 59.5 | 65.0 | 58.8 | 60.0 | 68.7 | 66.7 |
| BUSTM | 80.6 | 50.6 | 48.5 | 51.3 | 55.0 | 48.8 | 62.5 |
| CLUEWSC | 81.8 | 59.1 | 50.3 | 52.8 | 59.8 | 50.3 | 52.2 |
| MATH | 5.0 | 7.1 | 2.8 | 3.0 | 6.6 | 2.2 | 2.8 |
| GSM8K | 36.2 | 31.2 | 10.1 | 9.7 | 29.2 | 6.0 | 15.3 |
| HumanEval | 15.9 | 10.4 | 14.0 | 9.2 | 9.2 | 9.2 | 11.0 |
| RACE(High) | 80.3 | 57.4 | 46.9* | 28.1 | 66.3 | 40.7 | 54.0 |
- 以上评测结果基于 [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) 获得(部分数据标注`*`代表数据来自原始论文),具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
- 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异,请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
**局限性:** 尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
</details>
## 使用案例
### 通过 Transformers 加载
通过以下的代码从 Transformers 加载 InternLM 模型 (可修改模型名称替换不同的模型)
```python
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True).cuda()
>>> model = model.eval()
>>> response, history = model.chat(tokenizer, "你好", history=[])
>>> print(response)
你好!有什么我可以帮助你的吗?
>>> response, history = model.chat(tokenizer, "请提供三个管理时间的建议。", history=history)
>>> print(response)
当然可以!以下是三个管理时间的建议:
1. 制定计划:制定一个详细的计划,包括每天要完成的任务和活动。这将有助于您更好地组织时间,并确保您能够按时完成任务。
2. 优先级:将任务按照优先级排序,先完成最重要的任务。这将确保您能够在最短的时间内完成最重要的任务,从而节省时间。
3. 集中注意力:避免分心,集中注意力完成任务。关闭社交媒体和电子邮件通知,专注于任务,这将帮助您更快地完成任务,并减少错误的可能性。
```
### 通过 ModelScope 加载
通过以下的代码从 ModelScope 加载 InternLM 模型 (可修改模型名称替换不同的模型)
```python
from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
import torch
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm-chat-7b', revision='v1.0.0')
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(model_dir,device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)
```
### 通过前端网页对话
可以通过以下代码启动一个前端的界面来与 InternLM Chat 7B 模型进行交互
```bash
pip install streamlit==1.24.0
pip install transformers==4.30.2
streamlit run web_demo.py
```
效果如下
![效果](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
### 基于InternLM高性能部署
我们使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 完成 InternLM 的一键部署。
1. 首先安装 LMDeploy:
```shell
python3 -m pip install lmdeploy
```
2. 直接在本地,通过命令行,交互式和 InternLM 对话:
```shell
lmdeploy chat turbomind InternLM/internlm-chat-7b --model-name internlm-chat-7b
```
1. 也可以使用如下命令启动推理服务:
```shell
lmdeploy serve api_server InternLM/internlm-chat-7b --model-name internlm-chat-7b
```
请参考[此指南](https://github.com/InternLM/lmdeploy/blob/main/docs/en/restful_api.md)获取详细的api_server RESTful API信息更多部署教程则可在[这里](https://github.com/InternLM/LMDeploy)找到。
## 微调&训练
### 预训练与微调使用教程
请参考[使用教程](./doc/usage.md)开始InternLM的安装、数据处理、预训练与微调。
### 转换为 Transformers 格式使用
通过 InternLM 进行训练的模型可以很轻松地转换为 HuggingFace Transformers 格式,方便与社区各种开源项目无缝对接。借助 `tools/transformers/convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式
```bash
python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
```
转换之后可以通过以下的代码加载为 transformers
```python
>>> from transformers import AutoTokenizer, AutoModel
>>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
```
## 训练系统
### 系统结构
请参考[系统结构文档](./doc/structure.md)进一步了解。
### 训练性能
InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子,提高了训练效率。通过构建 Hybrid Zero 技术实现计算和通信的高效重叠大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡,千卡规模下加速效率可高达 90%,训练吞吐超过 180TFLOPS平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据:
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
TGS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。
## 贡献
我们感谢所有的贡献者为改进和提升 InternLM 所作出的努力。非常欢迎社区用户能参与进项目中来。请参考贡献指南来了解参与项目贡献的相关指引。
## 致谢
InternLM 代码库是一款由上海人工智能实验室和来自不同高校、企业的研发人员共同参与贡献的开源项目。我们感谢所有为项目提供新功能支持的贡献者,以及提供宝贵反馈的用户。 我们希望这个工具箱和基准测试可以为社区提供灵活高效的代码工具,供用户微调 InternLM 并开发自己的新模型,从而不断为开源社区提供贡献。特别鸣谢[flash-attention](https://github.com/HazyResearch/flash-attention) 与 [ColossalAI](https://github.com/hpcaitech/ColossalAI) 两项开源项目。
## 开源许可证
本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放,也可申请免费的商业使用授权([申请表](https://wj.qq.com/s2/12725412/f7c1/))。其他问题与合作请联系 <internlm@pjlab.org.cn>
## 引用
```
@misc{2023internlm,
title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
author={InternLM Team},
howpublished = {\url{https://github.com/InternLM/InternLM}},
year={2023}
}
```

234
README.md
View File

@ -2,7 +2,7 @@
<div align="center">
<img src="./doc/imgs/logo.svg" width="200"/>
<img src="./assets/logo.svg" width="200"/>
<div> </div>
<div align="center">
<b><font size="5">InternLM</font></b>
@ -14,21 +14,19 @@
<div> </div>
</div>
[![license](./doc/imgs/license.svg)](./LICENSE)
[![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)
[📘Usage](./doc/en/usage.md) |
[🛠Installation](./doc/en/install.md) |
[📊Train Performance](./doc/en/train_performance.md) |
[👀Model](#model-zoo) |
[🤗HuggingFace](https://huggingface.co/spaces/internlm/InternLM-Chat-7B) |
[🆕Update News](./CHANGE_LOG.md) |
[![license](./assets/license.svg)](./LICENSE)
[![evaluation](./assets/compass_support.svg)](https://github.com/internLM/OpenCompass/)
<!-- [![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest) -->
[📘Chat](./chat) |
[🛠Agent](./agent) |
[📊Evaluation](./evaluation) |
[👀Model](./model_cards) |
[🤗HuggingFace](https://huggingface.co/spaces/internlm/internlm2-Chat-7B) |
[🆕Update News](#news) |
[🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new)
[English](./README.md) |
[简体中文](./README-zh-Hans.md) |
[日本語](./README-ja-JP.md)
[简体中文](./README_zh-CN.md) |
</div>
@ -37,142 +35,64 @@
</p>
## Introduction
InternLM is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies. With a single codebase, it supports pre-training on large-scale clusters with thousands of GPUs, and fine-tuning on a single GPU while achieving remarkable performance optimizations. InternLM achieves nearly 90% acceleration efficiency during training on 1024 GPUs.
Based on the InternLM training framework, we have released two open-sourced pretrained model InternLM-7B and InternLM-20B.
InternLM2 series are released with the following features:
- **200K Context window**: Nearly perfect at finding needles in the haystack with 200K-long context, with leading performance on long-context tasks like LongBench and L-Eval. Try it with [LMDeploy](./inference/) for 200K-context inference.
- **Outstanding comprehensive performance**: Significantly better than the last generation in all dimensions, especially in reasoning, math, code, chat experience, instruction following, and creative writing, with leading performance among open-source models in similar sizes. In some evaluations, InternLM2-Chat-20B may match or even surpass ChatGPT (GPT-3.5).
- **Code interpreter & Data analysis**: With code interpreter, InternLM2-Chat-20B obtains compatible performance with GPT-4 on GSM8K and MATH. InternLM2-Chat also provides data analysis capability.
- **Stronger tool use**: Based on better tool utilization-related capabilities in instruction following, tool selection and reflection, InternLM2 can support more kinds of agents and multi-step tool calling for complex tasks. See [examples](./agent/).
## News
[20231213] InternLM-7B-Chat and InternLM-20B-Chat checkpoints are updated. With an improved finetuning strategy, the new chat models can generate higher quality responses with greater stylistic diversity.
[20230920] InternLM-20B is released with base and chat versions.
[2024.01.17] We release InternLM2-7B and InternLM2-20B and their corresponding chat models with stronger capabilities in all dimensions. See [model zoo below](#model-zoo) for download or [model cards](./model_cards/) for more details.
[2023.12.13] InternLM-7B-Chat and InternLM-20B-Chat checkpoints are updated. With an improved finetuning strategy, the new chat models can generate higher quality responses with greater stylistic diversity.
[2023.09.20] InternLM-20B is released with base and chat versions.
## Model Zoo
Our models are released in three platforms: Transformers, ModelScope and OpenXLab.
- There are two kinds of model weights:
1. huggingface type(marked as HF)
2. original model weight(marked as Original), providing in OpenXLab, which can be loaded by InternLM and finetuned directly.
| Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | Release Date |
|---------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
| **InternLM2 Chat 20B** | [🤗internlm/internlm-chat-20b](https://huggingface.co/internlm/internlm2-chat-20b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-20b) | 2024-01-17 |
| **InternLM2 20B** | [🤗internlm/internlm2-20b](https://huggingface.co/internlm/internlm2-20b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-20b) | 2024-01-17 |
| **InternLM2 Chat 20B SFT** | [🤗internlm/internlm-chat-20b-sft](https://huggingface.co/internlm/internlm2-chat-20b-sft) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-20b-sft](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-20b-sft/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-20b-sft) | 2024-01-17 |
| **InternLM2 Base 20B** | [🤗internlm/internlm2-base-20b](https://huggingface.co/internlm/internlm2-base-20b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-base-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-base-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-base-20b) | 2024-01-17 |
| **InternLM2 Chat 7B** | [🤗internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-7b) | 2024-01-17 |
| **InternLM2 7B** | [🤗internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-7b) | 2024-01-17 |
| **InternLM2 Chat 7B SFT** | [🤗internlm/internlm2-chat-7b-sft](https://huggingface.co/internlm/internlm2-chat-7b-sft) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-7b-sft](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b-sft/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-7b-sft) | 2024-01-17 |
| **InternLM2 Base 7B** | [🤗internlm/internlm2-base-7b](https://huggingface.co/internlm/internlm2-base-7b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-base-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-base-7b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-base-7b) | 2024-01-17 |
| Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | OpenXLab(Original) | Release Date |
|---------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
| **InternLM Chat 20B** | [🤗internlm/internlm-chat-20b](https://huggingface.co/internlm/internlm-20b-chat) | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b-chat/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-20b) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-20b-original) | 2023-12-12 |
| **InternLM 20B** | [🤗internlm/internlm-20b](https://huggingface.co/internlm/internlm-20b) | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-20b) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-20b-original) | 2023-09-20 |
| **InternLM Chat 7B** | [🤗internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b) | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b-original) | 2023-12-12 |
| **InternLM 7B** | [🤗internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b) | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-7b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b-original) | 2023-07-06 |
**Note:**
#### Introduction
InternLM-20B was pre-trained on over **2.3T** Tokens containing high-quality English, Chinese, and code data. Additionally, the Chat version has undergone SFT and RLHF training, enabling it to better and more securely meet users' needs.
In terms of model structure, InternLM-20B opted for a deeper architecture, with a depth set at 60 layers. This surpasses the conventional 7B and 13B models that utilize 32 or 40 layers. When parameters are limited, increasing the number of layers can enhance the model's overall capability. Furthermore, compared to InternLM-7B, the pre-training data used for InternLM-20B underwent higher quality cleansing and was supplemented with data rich in knowledge and designed for reinforcing understanding and reasoning capabilities. As a result, it exhibits significant improvements in understanding, reasoning, mathematical, and programming abilities—all of which test the technical proficiency of language models. Overall, InternLM-20B features the following characteristics:
- Outstanding overall performance
- Strong utility invocation capability
- Supports a 16k context length (Through inference extrapolation)
- Better value alignment.
#### Performance Evaluation
On the 5 capability dimensions proposed by OpenCompass, InternLM-20B has achieved excellent results (the bolded scores represent the best performances within the 13B-33B parameter range).
| Capability | Llama-13B | Llama2-13B | Baichuan2-13B | InternLM-20B | Llama-33B | Llama-65B | Llama2-70B |
|----------|-----------|------------|---------------|--------------|-----------|-----------|------------|
| Language | 42.5 | 47 | 47.5 | **55** | 44.6 | 47.1 | 51.6 |
| Knowledge | 58.2 | 58.3 | 48.9 | 60.1 | **64** | 66 | 67.7 |
| Understanding | 45.5 | 50.9 | 58.1 | **67.3** | 50.6 | 54.2 | 60.8 |
| Reasoning | 42.7 | 43.6 | 44.2 | **54.9** | 46.4 | 49.8 | 55 |
| Examination | 37.3 | 45.2 | 51.8 | **62.5** | 47.4 | 49.7 | 57.3 |
| Overall | 43.8 | 47.3 | 49.4 | **59.2** | 48.9 | 51.9 | 57.4 |
The table below compares the performance of mainstream open-source models on some influential and typical datasets.
| | Benchmarks | Llama-13B | Llama2-13B | Baichuan2-13B | InternLM-20B | Llama-33B | Llama-65B | Llama2-70B |
|------|------------------|-----------|------------|---------------|--------------|-----------|-----------|------------|
| Examination | MMLU | 47.73 | 54.99 | 59.55 | **62.05** | 58.73 | 63.71 | 69.75 |
| | C-Eval (val) | 31.83 | 41.4 | **59.01** | 58.8 | 37.47 | 40.36 | 50.13 |
| | AGI-Eval | 22.03 | 30.93 | 37.37 | **44.58** | 33.53 | 33.92 | 40.02 |
| Knowledge | BoolQ | 78.75 | 82.42 | 67 | **87.46** | 84.43 | 86.61 | 87.74 |
| | TriviaQA | 52.47 | 59.36 | 46.61 | 57.26 | **66.24** | 69.79 | 70.71 |
| | NaturalQuestions | 20.17 | 24.85 | 16.32 | 25.15 | **30.89** | 33.41 | 34.16 |
| Understanding | CMRC | 9.26 | 31.59 | 29.85 | **68.78** | 14.17 | 34.73 | 43.74 |
| | CSL | 55 | 58.75 | 63.12 | **65.62** | 57.5 | 59.38 | 60 |
| | RACE (middle) | 53.41 | 63.02 | 68.94 | **86.35** | 64.55 | 72.35 | 81.55 |
| | RACE (high) | 47.63 | 58.86 | 67.18 | **83.28** | 62.61 | 68.01 | 79.93 |
| | XSum | 20.37 | 23.37 | 25.23 | **35.54** | 20.55 | 19.91 | 25.38 |
| Reasoning | WinoGrande | 64.64 | 64.01 | 67.32 | **69.38** | 66.85 | 69.38 | 69.77 |
| | BBH | 37.93 | 45.62 | 48.98 | **52.51** | 49.98 | 58.38 | 64.91 |
| | GSM8K | 20.32 | 29.57 | **52.62** | **52.62** | 42.3 | 54.44 | 63.31 |
| | PIQA | 79.71 | 79.76 | 78.07 | 80.25 | **81.34** | 82.15 | 82.54 |
| Programming | HumanEval | 14.02 | 18.9 | 17.07 | **25.61** | 17.68 | 18.9 | 26.22 |
| | MBPP | 20.6 | 26.8 | 30.8 | **35.6** | 28.4 | 33.6 | 39.6 |
Overall, InternLM-20B comprehensively outperforms open-source models in the 13B parameter range in terms of overall capabilities, and on inference evaluation sets, it approaches or even surpasses the performance of Llama-65B.
- The evaluation results were obtained from [OpenCompass 20230920](https://github.com/internLM/OpenCompass/).
- The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
</details>
<details>
<summary> InternLM-7B </summary>
#### News
#### Introduction
InternLM-7B contains a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
- It leverages trillions of high-quality tokens for training to establish a powerful knowledge base.
- It supports an 8k context window length, enabling longer input sequences and stronger reasoning capabilities.
- It provides a versatile toolset for users to flexibly build their own workflows.
#### Performance Evaluation
We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://opencompass.org.cn/rank) for more evaluation results.
| Datasets\Models | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
| C-Eval(Val) | 52.0 | 53.4 | 24.2 | 42.7 | 50.9 | 28.9 | 31.2 |
| MMLU | 52.6 | 51.0 | 35.2* | 41.5 | 46.0 | 39.7 | 47.3 |
| AGIEval | 46.4 | 37.6 | 20.8 | 24.6 | 39.0 | 24.1 | 26.4 |
| CommonSenseQA | 80.8 | 59.5 | 65.0 | 58.8 | 60.0 | 68.7 | 66.7 |
| BUSTM | 80.6 | 50.6 | 48.5 | 51.3 | 55.0 | 48.8 | 62.5 |
| CLUEWSC | 81.8 | 59.1 | 50.3 | 52.8 | 59.8 | 50.3 | 52.2 |
| MATH | 5.0 | 7.1 | 2.8 | 3.0 | 6.6 | 2.2 | 2.8 |
| GSM8K | 36.2 | 31.2 | 10.1 | 9.7 | 29.2 | 6.0 | 15.3 |
| HumanEval | 15.9 | 10.4 | 14.0 | 9.2 | 9.2 | 9.2 | 11.0 |
| RACE(High) | 80.3 | 57.4 | 46.9* | 28.1 | 66.3 | 40.7 | 54.0 |
- The evaluation results were obtained from [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) (some data marked with *, which means come from the original papers), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
- The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
</details>
1. For chat models, InternLM2 Chat 7/20B has gone through online RLHF for better alignment, which is recommended for downstream applications. We also released InternLM2 Chat 7/20B SFT, which are the ones that only have gone through SFT and used in RLHF to obtain InternLM2 Chat 7/20B. InternLM2 Chat 7/20B are trained from InternLM2 Base 7/20B.
2. For base models, InternLM2 7/20B are further trained from InternLM2 Base 7/20B, which is recommended for fast adaptation for downstream applications.
**Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
## Usage Examples
## Usages
We briefly show the usages with [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue).
The chat models adopt [chatml format](./chat/chat_format.md) to support both chat and agent applications.
### Import from Transformers
To load the InternLM 7B Chat model using Transformers, use the following code:
To load the InternLM2 7B Chat model using Transformers, use the following code:
```python
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True).cuda()
>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True).cuda()
>>> model = model.eval()
>>> response, history = model.chat(tokenizer, "hello", history=[])
>>> print(response)
Hello! How can I help you today?
>>> response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
>>> print(response)
Sure, here are three tips for effective time management:
1. Prioritize tasks based on importance and urgency: Make a list of all your tasks and categorize them into "important and urgent," "important but not urgent," and "not important but urgent." Focus on completing the tasks in the first category before moving on to the others.
2. Use a calendar or planner: Write down deadlines and appointments in a calendar or planner so you don't forget them. This will also help you schedule your time more effectively and avoid overbooking yourself.
3. Minimize distractions: Try to eliminate any potential distractions when working on important tasks. Turn off notifications on your phone, close unnecessary tabs on your computer, and find a quiet place to work if possible.
Remember, good time management skills take practice and patience. Start with small steps and gradually incorporate these habits into your daily routine.
```
### Import from ModelScope
@ -182,7 +102,7 @@ To load the InternLM model using ModelScope, use the following code:
```python
from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
import torch
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm-chat-7b', revision='v1.0.0')
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-chat-7b')
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(model_dir,device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
model = model.eval()
@ -199,82 +119,40 @@ You can interact with the InternLM Chat 7B model through a frontend interface by
```bash
pip install streamlit==1.24.0
pip install transformers==4.30.2
streamlit run web_demo.py
streamlit run ./chat/web_demo.py
```
The effect is as follows
The effect is similar to below:
![demo](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
### Deployment
We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the one-click deployment of InternLM.
1. First, install LMDeploy:
We use [LMDeploy](https://github.com/InternLM/LMDeploy) for fast deployment of InternLM.
```shell
# install LMDeploy
python3 -m pip install lmdeploy
# chat with internlm2
lmdeploy chat turbomind InternLM/internlm2-chat-7b --model-name internlm2-chat-7b
```
2. Use the following command for iteractive communication with `internlm-chat-7b` model on localhost:
Please refer to the [guidance](./chat/lmdeploy.md) for more usages about model deployment. For additional deployment tutorials, feel free to explore [here](https://github.com/InternLM/LMDeploy).
```shell
lmdeploy chat turbomind InternLM/internlm-chat-7b --model-name internlm-chat-7b
```
## Agent
3. Besides chatting via command line, you can start lmdeploy `api_server` as below:
InternLM2-Chat models have excellent tool utilization capabilities and can work with function calls in a zero-shot manner. See more examples in [agent session](./agent/).
```shell
lmdeploy serve api_server InternLM/internlm-chat-7b --model-name internlm-chat-7b
```
For a comprehensive understanding of the `api_server` RESTful API, kindly consult [this](https://github.com/InternLM/lmdeploy/blob/main/docs/en/restful_api.md) guide. For additional deployment tutorials, feel free to explore [here](https://github.com/InternLM/LMDeploy).
## Fine-tuning
## Fine-tuning & Training
Please refer to [finetune docs](./finetune/) for fine-tuning with InternLM.
### Pre-training and Fine-tuning Tutorial
Please refer to [Usage Tutorial](./doc/en/usage.md) to start InternLM installation, data processing, pre-training and fine-tuning.
### Convert to Transformers Format
The model trained by InternLM can be easily converted to HuggingFace Transformers format, which is convenient for seamless docking with various open source projects in the community. With the help of `tools/transformers/convert2hf.py`, the weights saved during training can be converted into transformers format with one command
```bash
python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
```
After conversion, it can be loaded as transformers by the following code
```python
>>> from transformers import AutoTokenizer, AutoModel
>>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
```
## Training System
### System Architecture
Please refer to the [System Architecture document](./doc/en/structure.md) for further details.
### Training Performance
InternLM deeply integrates Flash-Attention, Apex and other high-performance model operators to improve training efficiency. By building the Hybrid Zero technique, it achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training. InternLM supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-GPU scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
TGS represents the average number of tokens processed per GPU per second. For more performance test data, please refer to the [Training Performance document](./doc/en/train_performance.md) for further details.
**Note:** We have migrated the whole training functionality in this project to [InternEvo](https://github.com/InternLM/InternEvo) for easier user experience, which provides efficient pre-training and fine-tuning infra for training InternLM.
## Contribution
We appreciate all the contributors for their efforts to improve and enhance InternLM. Community users are highly encouraged to participate in the project. Please refer to the contribution guidelines for instructions on how to contribute to the project.
## Acknowledgements
InternLM codebase is an open-source project contributed by Shanghai AI Laboratory and researchers from different universities and companies. We would like to thank all the contributors for their support in adding new features to the project and the users for providing valuable feedback. We hope that this toolkit and benchmark can provide the community with flexible and efficient code tools for fine-tuning InternLM and developing their own models, thus continuously contributing to the open-source community. Special thanks to the two open-source projects, [flash-attention](https://github.com/HazyResearch/flash-attention) and [ColossalAI](https://github.com/hpcaitech/ColossalAI).
## License
The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表(中文)](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact <internlm@pjlab.org.cn>.

158
README_zh-CN.md Normal file
View File

@ -0,0 +1,158 @@
# InternLM
<div align="center">
<img src="./assets//logo.svg" width="200"/>
<div>&nbsp;</div>
<div align="center">
<b><font size="5">书生·浦语 官网</font></b>
<sup>
<a href="https://internlm.intern-ai.org.cn/">
<i><font size="4">HOT</font></i>
</a>
</sup>
<div>&nbsp;</div>
</div>
[![license](./assets//license.svg)](https://github.com/open-mmlab/mmdetection/blob/main/LICENSE)
[![evaluation](./assets//compass_support.svg)](https://github.com/internLM/OpenCompass/)
<!-- [![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest) -->
[📘对话教程](./chat) |
[🛠️智能体教程](./agent) |
[📊评测](./evaluation) |
[👀模型库](./model_cards) |
[🤗HuggingFace](https://huggingface.co/spaces/internlm/internlm2-Chat-7B) |
[🆕Update News](#news) |
[🤔提交反馈](https://github.com/InternLM/InternLM/issues/new)
[English](./README.md) |
[简体中文](./README_zh-CN.md) |
</div>
<p align="center">
👋 加入我们的 <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a><a href="https://github.com/InternLM/InternLM/assets/25839884/a6aad896-7232-4220-ac84-9e070c2633ce" target="_blank">微信社区</a>
</p>
## 简介
InternLM2 系列模型在本仓库正式发布,具有如下特性:
- 有效支持20万字超长上下文模型在20万字长输入中几乎完美地实现长文“大海捞针”而且在 LongBench 和 L-Eval 等长文任务中的表现也达到开源模型中的领先水平。 可以通过 [LMDeploy](./inference/) 尝试20万字超长上下文推理。
- 综合性能全面提升:各能力维度相比上一代模型全面进步,在推理、数学、代码、对话体验、指令遵循和创意写作等方面的能力提升尤为显著,综合性能达到同量级开源模型的领先水平,在重点能力评测上 InternLM2-Chat-20B 能比肩甚至超越 ChatGPT GPT-3.5)。
- 代码解释器与数据分析在配合代码解释器code-interpreter的条件下InternLM2-Chat-20B 在 GSM8K 和 MATH 上可以达到和 GPT-4 相仿的水平。基于在数理和工具方面强大的基础能力InternLM2-Chat 提供了实用的数据分析能力。
- 工具调用能力整体升级:基于更强和更具有泛化性的指令理解、工具筛选与结果反思等能力,新版模型可以更可靠地支持复杂智能体的搭建,支持对工具进行有效的多轮调用,完成较复杂的任务。可以查看更多[样例](./agent/)。
## 更新
[2024.01.17] 我们发布了 InternLM2-7B 和 InternLM2-20B 以及相关的对话模型InternLM2 在数理、代码、对话、创作等各方面能力都获得了长足进步,综合性能达到开源模型的领先水平。可以点击 [下面的模型库](#model-zoo)进行下载或者[查看模型文档](./model_cards/)来了解更多细节.
[2023.12.13] 我们更新了 InternLM-7B-Chat 和 InternLM-20B-Chat 模型权重。通过改进微调数据和训练策略,新版对话模型生成的回复质量更高、语言风格更加多元。
[2023.09.20] InternLM-20B 已发布,包括基础版和对话版。
## Model Zoo
| Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | Release Date |
|---------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
| **InternLM2 Chat 20B** | [🤗internlm/internlm-chat-20b](https://huggingface.co/internlm/internlm2-chat-20b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-20b) | 2024-01-17 |
| **InternLM2 20B** | [🤗internlm/internlm2-20b](https://huggingface.co/internlm/internlm2-20b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-20b) | 2024-01-17 |
| **InternLM2 Chat 20B SFT** | [🤗internlm/internlm-chat-20b-sft](https://huggingface.co/internlm/internlm2-chat-20b-sft) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-20b-sft](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-20b-sft/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-20b-sft) | 2024-01-17 |
| **InternLM2 Base 20B** | [🤗internlm/internlm2-base-20b](https://huggingface.co/internlm/internlm2-base-20b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-base-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-base-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-base-20b) | 2024-01-17 |
| **InternLM2 Chat 7B** | [🤗internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-7b) | 2024-01-17 |
| **InternLM2 7B** | [🤗internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-7b) | 2024-01-17 |
| **InternLM2 Chat 7B SFT** | [🤗internlm/internlm2-chat-7b-sft](https://huggingface.co/internlm/internlm2-chat-7b-sft) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-7b-sft](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b-sft/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-7b-sft) | 2024-01-17 |
| **InternLM2 Base 7B** | [🤗internlm/internlm2-base-7b](https://huggingface.co/internlm/internlm2-base-7b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-base-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-base-7b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-base-7b) | 2024-01-17 |
## 使用案例
接下来我们展示使用 [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), 和 [Web demo](#dialogue) 进行推理.
对话模型采用了 [chatml 格式](./chat/chat_format.md) 来支持通用对话和智能体应用。
### 通过 Transformers 加载
通过以下的代码从 Transformers 加载 InternLM 模型 (可修改模型名称替换不同的模型)
```python
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True).cuda()
>>> model = model.eval()
>>> response, history = model.chat(tokenizer, "你好", history=[])
>>> print(response)
你好!有什么我可以帮助你的吗?
>>> response, history = model.chat(tokenizer, "请提供三个管理时间的建议。", history=history)
>>> print(response)
```
### 通过 ModelScope 加载
通过以下的代码从 ModelScope 加载 InternLM 模型 (可修改模型名称替换不同的模型)
```python
from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
import torch
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-chat-7b')
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(model_dir,device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)
```
### 通过前端网页对话
可以通过以下代码启动一个前端的界面来与 InternLM Chat 7B 模型进行交互
```bash
pip install streamlit==1.24.0
pip install transformers==4.30.2
streamlit run ./chat/web_demo.py
```
效果如下
![效果](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
### 基于InternLM高性能部署
我们使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 完成 InternLM 的一键部署。
```shell
python3 -m pip install lmdeploy
lmdeploy chat turbomind InternLM/internlm-chat-7b --model-name internlm-chat-7b
```
请参考[部署指南](./chat/lmdeploy.md)了解更多使用案例,更多部署教程则可在[这里](https://github.com/InternLM/LMDeploy)找到。
## 微调&训练
请参考[微调教程](./finetune/)尝试续训或微调 InternLM2。
**注意:**本项目中的全量训练功能已经迁移到了[InternEvo](https://github.com/InternLM/InternEvo)以便捷用户的使用。InternEvo 提供了高效的预训练和微调基建用于训练 InternLM 系列模型。
## 贡献
我们感谢所有的贡献者为改进和提升 InternLM 所作出的努力。非常欢迎社区用户能参与进项目中来。请参考贡献指南来了解参与项目贡献的相关指引。
## 致谢
InternLM 代码库是一款由上海人工智能实验室和来自不同高校、企业的研发人员共同参与贡献的开源项目。我们感谢所有为项目提供新功能支持的贡献者,以及提供宝贵反馈的用户。 我们希望这个工具箱和基准测试可以为社区提供灵活高效的代码工具,供用户微调 InternLM 并开发自己的新模型,从而不断为开源社区提供贡献。特别鸣谢[flash-attention](https://github.com/HazyResearch/flash-attention) 与 [ColossalAI](https://github.com/hpcaitech/ColossalAI) 两项开源项目。
## 开源许可证
本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放,也可申请免费的商业使用授权([申请表](https://wj.qq.com/s2/12725412/f7c1/))。其他问题与合作请联系 <internlm@pjlab.org.cn>
## 引用
```
@misc{2023internlm,
title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
author={InternLM Team},
howpublished = {\url{https://github.com/InternLM/InternLM}},
year={2023}
}
```

View File

Before

Width:  |  Height:  |  Size: 4.3 KiB

After

Width:  |  Height:  |  Size: 4.3 KiB

View File

Before

Width:  |  Height:  |  Size: 1.1 KiB

After

Width:  |  Height:  |  Size: 1.1 KiB

View File

Before

Width:  |  Height:  |  Size: 18 KiB

After

Width:  |  Height:  |  Size: 18 KiB

View File

Before

Width:  |  Height:  |  Size: 6.0 KiB

After

Width:  |  Height:  |  Size: 6.0 KiB

View File

Before

Width:  |  Height:  |  Size: 4.2 KiB

After

Width:  |  Height:  |  Size: 4.2 KiB

View File

Before

Width:  |  Height:  |  Size: 3.7 KiB

After

Width:  |  Height:  |  Size: 3.7 KiB

61
chat/README.md Normal file
View File

@ -0,0 +1,61 @@
# Chat
English | [简体中文](lmdeploy_zh_zh-CN.md)
This document briefly shows how to use [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue) to conduct inference with InternLM2-Chat.
You can also know more about the [chatml format](./chat_format.md) and how to use [LMDeploy for inference and model serving](./lmdeploy.md).
## Import from Transformers
To load the InternLM2 7B Chat model using Transformers, use the following code:
```python
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True).cuda()
>>> model = model.eval()
>>> response, history = model.chat(tokenizer, "hello", history=[])
>>> print(response)
Hello! How can I help you today?
>>> response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
>>> print(response)
Sure, here are three tips for effective time management:
1. Prioritize tasks based on importance and urgency: Make a list of all your tasks and categorize them into "important and urgent," "important but not urgent," and "not important but urgent." Focus on completing the tasks in the first category before moving on to the others.
2. Use a calendar or planner: Write down deadlines and appointments in a calendar or planner so you don't forget them. This will also help you schedule your time more effectively and avoid overbooking yourself.
3. Minimize distractions: Try to eliminate any potential distractions when working on important tasks. Turn off notifications on your phone, close unnecessary tabs on your computer, and find a quiet place to work if possible.
Remember, good time management skills take practice and patience. Start with small steps and gradually incorporate these habits into your daily routine.
```
## Import from ModelScope
To load the InternLM model using ModelScope, use the following code:
```python
from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
import torch
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-chat-7b')
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(model_dir,device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)
```
## Dialogue
You can interact with the InternLM Chat 7B model through a frontend interface by running the following code:
```bash
pip install streamlit==1.24.0
pip install transformers==4.30.2
streamlit run ./chat/web_demo.py
```
The effect is similar to below:
![demo](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)

51
chat/README_zh-CN.md Normal file
View File

@ -0,0 +1,51 @@
# 对话
[English](lmdeploy.md) | 简体中文
本文介绍采用 [Transformers](#import-from-transformers)、[ModelScope](#import-from-modelscope)、[Web demos](#dialogue)
对 InternLM2-Chat 进行推理。
你还可以进一步了解 InternLM2-Chat 采用的[对话格式](./chat_format_zh-CN.md),以及如何[用 LMDeploy 进行推理或部署服务](./lmdeploy_zh-CN.md),或者尝试用 [OpenAOE](./openaoe.md) 与多个模型对话。
## 通过 Transformers 加载
通过以下的代码从 Transformers 加载 InternLM 模型 (可修改模型名称替换不同的模型)
```python
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True).cuda()
>>> model = model.eval()
>>> response, history = model.chat(tokenizer, "你好", history=[])
>>> print(response)
你好!有什么我可以帮助你的吗?
>>> response, history = model.chat(tokenizer, "请提供三个管理时间的建议。", history=history)
>>> print(response)
```
### 通过 ModelScope 加载
通过以下的代码从 ModelScope 加载 InternLM2-Chat 模型 (可修改模型名称替换不同的模型)
```python
from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
import torch
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-chat-7b', revision='v1.0.0')
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(model_dir,device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
model = model.eval()
response, history = model.chat(tokenizer, "hello", history=[])
print(response)
response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
print(response)
```
## 通过前端网页对话
可以通过以下代码启动一个前端的界面来与 InternLM2 Chat 7B 模型进行交互
```bash
pip install streamlit==1.24.0
pip install transformers==4.30.2
streamlit run ./web_demo.py
```

99
chat/chat_format_zh-CN.md Normal file
View File

@ -0,0 +1,99 @@
# 对话格式
[English](chat_format.md) | 简体中文
InternLM2-Chat 采用了全新的对话格式,以灵活地支持工具调用等更广泛的应用,并避免用户输入的攻击。新的对话格式和 [ChatML](https://github.com/openai/openai-python/blob/release-v0.28.0/chatml.md) 格式类似,但是为了支持通用的智能体应用,在 `system``user``assistant` 的基础上,引入了 `environment` 角色。
## 基本结构
常规的对话结构一般包含 `system``user``assistant` 三个角色,采用如下格式进行多轮对话
```
[UNUSED_TOKEN_146]system
你是书生浦语2一个无害的人工智能助手[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]user
你好呀[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]assistant
你好,我是书生浦语,请问有什么可以帮助你的吗[UNUSED_TOKEN_145]
```
其中 `[UNUSED_TOKEN_146]` 充当了每轮对话开始符,`[UNUSED_TOKEN_145]` 充当了当前轮对话结束符。每轮对话一般以 `[UNUSED_TOKEN_146]role` 开头,以模型输出的 `[UNUSED_TOKEN_145]` 结尾role 代表 `system``user``assistant` 和 `environment` 角色。目前InternLM2-Chat 模型的词表中还维护了如下映射
- `[UNUSED_TOKEN_146]`:每个角色对话的开始符
- `[UNUSED_TOKEN_145]`:每个角色对话的结束符
- `[UNUSED_TOKEN_144]`:模型调用外部插件的开始符
- `[UNUSED_TOKEN_143]`:模型调用外部插件的结束符
- `[UNUSED_TOKEN_142]`:代码解释器
- `[UNUSED_TOKEN_141]`:外部插件,常规的 tools
## 完整结构
InternLM2-Chat 的完整对话格式在上述基本结构的基础上还包含了针对通用智能体的设计,其核心目的是采用流式格式,使得同一套格式在支持各种类插件拓展和智能体环境的同时能够和通用对话兼容。通用的智能体对话状态如下所示
```
[UNUSED_TOKEN_146]system
你是书生浦语2一个无害的人工智能助手[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]system name=[UNUSED_TOKEN_141]
[
{
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string"},
},
"required": ["location"],
},
}
]
[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]user
请帮我对该数据集进行数据处理并可视化。
[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]user name=file
{"path": "data.csv"}[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]assistant
我已经帮您处理了数据并进行了可视化。
[UNUSED_TOKEN_144][UNUSED_TOKEN_142]
```python
import plotly.express as px
import pandas as pd
# Load the data into a pandas dataframe
df = pd.read_csv('data.csv')
# Create a scatter plot of rainfall vs wind direction
fig = px.scatter(df, x='WindDir9am', y='Rainfall', color='WindDir3pm',
labels={'WindDir9am': 'Wind Direction 9am', 'Rainfall': 'Rainfall', 'WindDir3pm': 'Wind Direction 3pm'},
title='Rainfall vs Wind Direction',
template='plotly_dark',
width=600, height=500)
# Add a hover effect to show the date
fig.update_traces(hovertemplate='<b>Date: %{text}</b><br>Wind Direction 9am: %{x}<br>Rainfall: %{y}<br>Wind Direction 3pm: %{marker.color}')
# Show the plot
fig.show()
```[UNUSED_TOKEN_143]
[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]environment name=[UNUSED_TOKEN_142]
[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]assistant
由于代码输出没有包含在这里我无法提供具体的图表内容。但是如果代码执行无误它会显示一个极坐标图其中包含两个填充区域一个代表上午9点的风向与降雨量的关系另一个代表下午3点的风向与降雨量的关系。每个方向上的值是根据提供的数据集计算得到的平均降雨量。图表应该有一个清晰的标题、图例并且能够直观地比较不同风向下的降雨量。由于使用了暗色主题图表的整体外观应该是深色背景上的亮色线条和填充区域。[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]user
你能解释一下代码吗[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]assistant
我使用的是Python的Plotly库来创建一个极坐标图以展示风向与降雨量之间的关系。代码中首先创建了一个Plotly的Figure对象。接着我添加了两个`Scatterpolar`追踪。第一个追踪展示的是上午9点的风向与降雨量的平均值的关系第二个追踪展示的是下午3点的风向与降雨量的平均值的关系。在这两个追踪中`r`参数代表降雨量的平均值,`theta`参数代表风向。`fill='toself'`表示图形以自身为边界进行填充。然后我更新了图表的布局,隐藏了径向轴,添加了图例,设置了标题和标题字体,选择了暗色主题,并设定了图表的背景色、宽度和高度。最后,使用`fig.show()`展示了图表。
[UNUSED_TOKEN_145]
[UNUSED_TOKEN_146]user
我想了解今天上海的天气[UNUSED_TOKEN_145]
[UNUSED_TOKEN_144][UNUSED_TOKEN_141]
{"name": "get_current_weather", "parameters": {"location": "上海"}}[UNUSED_TOKEN_143]
[UNUSED_TOKEN_145]
```

View File

@ -73,8 +73,8 @@ def main():
model, tokenizer = load_model()
print("load model end.")
user_avator = "doc/imgs/user.png"
robot_avator = "doc/imgs/robot.png"
user_avator = "docs/imgs/user.png"
robot_avator = "docs/imgs/robot.png"
st.title("InternLM-Chat-7B")

View File

@ -1,18 +0,0 @@
#!/bin/bash
#######################################
# Calculate the number of files in a directory.
# Call this function like this: num_files "${file_path}".
# Globals:
# None
# Arguments:
# $1: the directory path
# Returns:
# the number of files in the directory
#######################################
num_files() {
[[ $# -eq 1 ]] || return 1
local file_num
file_num=$(ls -l $1 | grep '^-' | wc -l)
echo $file_num
}

View File

@ -1,29 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
def merge_dicts(dict_a: dict, dict_b: dict):
for key in dict_b.keys():
if isinstance(dict_b[key], dict):
dict_b[key] = {**dict_a[key], **dict_b[key]}
merge_dicts(dict_a[key], dict_b[key])
dict_c = {**dict_a, **dict_b}
return dict_c
def format_dict_to_py_string(data: dict, indent=0, is_nested=False):
result = ""
for key, value in data.items():
if isinstance(value, dict):
result += f"{' ' * indent}{key} = dict(\n"
result += format_dict_to_py_string(value, indent + 4, is_nested=True)
result += f"{' ' * indent})"
else:
result += f"{' ' * indent}{key} = {repr(value)}"
if is_nested:
result += ","
result += "\n"
result = f"""\
{result}
"""
return result

View File

@ -1,21 +0,0 @@
#!/bin/bash
set -x
retry_times=3
for ((i=1;i<=$retry_times;i++));do
jobid=$(squeue -o "%A %j" -u $USER | grep ${GITHUB_RUN_ID}-${GITHUB_JOB} | awk '{print $1}')
if [[ -n "$jobid" ]];then
echo "The job $jobid will be canceled."
scancel $jobid
sleep 0.5
else
echo "There are no more jobs that need to be canceled."
break
fi
done
if [[ $i -gt $retry_times ]];then
echo "There have been tried $retry_times times. Please contact user $USER to confirm the job status."
fi
exit 0

View File

@ -1,4 +0,0 @@
#!/bin/bash
readonly DATA_VOLUME=$(echo $GITHUB_WORKSPACE | cut -d '/' -f 1-4)/data
readonly CLEAN_PATH=$(echo $GITHUB_WORKSPACE | cut -d '/' -f 1-4)/ci_clean_bak

View File

@ -1,51 +0,0 @@
#!/bin/bash
set -x
source ./ci_scripts/common/variables.sh
[[ -n ${DATA_VOLUME} ]] || { echo "should set DATA_VOLUME first before ci, exit."; exit 1; }
[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
readonly SRC_DATASET_META=${DATA_VOLUME}/lm_data/alpaca_data/alpaca_data.json
readonly RESULTS=${DATA_VOLUME}/lm_data/alpaca_data/result
readonly TRAIN_DATASET=${RESULTS}/train/en/dataset.bin
readonly TRAIN_DATASET_META=${RESULTS}/train/en/dataset.bin.meta
readonly VALID_DATASET=${RESULTS}/valid/en/dataset.bin
readonly VALID_DATASET_META=${RESULTS}/valid/en/dataset.bin.meta
split_ratio=0.1
exit_code=0
source ./ci_scripts/common/basic_func.sh
echo "start to test alpaca_tokenizer.py."
if [[ -d ${RESULTS} ]]; then
if ! rsync -av --remove-source-files ${RESULTS} ${CLEAN_PATH}; then
echo "cleaning test data in ${RESULTS} failed, exit."
exit 1
fi
fi
if [[ ! -f ${SRC_DATASET_META} ]]; then
echo "${SRC_DATASET_META} should be exist, exit."
exit 1
fi
python tools/alpaca_tokenizer.py ${SRC_DATASET_META} ${RESULTS} tools/V7_sft.model --split_ratio ${split_ratio}
[[ $? -ne 0 ]] && { echo "test alpaca_tokenizer.py failed."; exit_code=$(($exit_code + 1)); }
file_list=(${TRAIN_DATASET} ${TRAIN_DATASET_META} ${VALID_DATASET} ${VALID_DATASET_META})
for file in ${file_list[@]}; do
if [[ ! -f ${file} ]]; then
echo "expect: ${file} exists, actual: not exist."
exit_code=$(($exit_code + 1))
fi
done
# move the test files.
if ! rsync -av --remove-source-files ${RESULTS} ${CLEAN_PATH}; then
echo "cleaning test data in ${RESULTS} failed."
exit_code=$(($exit_code + 1))
fi
exit $exit_code

View File

@ -1,43 +0,0 @@
#!/bin/bash
set -x
source ./ci_scripts/common/variables.sh
[[ -n ${DATA_VOLUME} ]] || { echo "should set DATA_VOLUME first before ci, exit."; exit 1; }
[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
readonly DATA=${DATA_VOLUME}/lm_data/cn_data/raw_data.txt
readonly RESULT=${DATA_VOLUME}/lm_data/cn_data/result.bin
readonly RESULT_META=${DATA_VOLUME}/lm_data/cn_data/result.bin.meta
readonly RESULTS=${DATA_VOLUME}/lm_data/cn_data/result.*
exit_code=0
source ./ci_scripts/common/basic_func.sh
echo "start to test tokenizer.py."
num=$(num_files "${RESULTS}")
if [[ ${num} -gt 0 ]]; then
if ! rsync -av --remove-source-files ${RESULTS} ${CLEAN_PATH}; then
echo "cleaning test data ${RESULTS} failed, exit."
exit 1
fi
fi
srun -p ${SLURM_PARTITION} --quotatype=spot --job-name=$1 --gpus-per-task=1 python tools/tokenizer.py --text_input_path ${DATA} --bin_output_path ${RESULT}
[[ $? -ne 0 ]] && { echo "test tokenizer.py failed."; exit_code=$(($exit_code + 1)); }
file_list=($RESULT $RESULT_META)
for file in ${file_list[@]}; do
if [[ ! -f ${file} ]]; then
echo "expect: ${file} exists, actual: not exist."
exit_code=$(($exit_code + 1))
fi
done
# move the test files.
if ! rsync -av --remove-source-files ${RESULTS} ${CLEAN_PATH}; then
echo "cleaning cached file in ${RESULTS} failed."
exit_code=$(($exit_code + 1))
fi
exit $exit_code

View File

@ -1,48 +0,0 @@
#!/bin/bash
set -x
source ./ci_scripts/common/variables.sh
[[ -n ${DATA_VOLUME} ]] || { echo "should set DATA_VOLUME first before ci, exit."; exit 1; }
[[ -n ${GITHUB_WORKSPACE} ]] || { echo "should set GITHUB_WORKSPACE first before ci, exit."; exit 1; }
[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
readonly CKPTS_INPUT="${DATA_VOLUME}/lm_data/alpaca_data/llm_ckpts/20"
readonly CKPTS_OUTPUT="${GITHUB_WORKSPACE}/hf_ckpt"
readonly TOKENIZER="${GITHUB_WORKSPACE}/hf_ckpt/tokenizer.model"
readonly CONFIG="${GITHUB_WORKSPACE}/hf_ckpt/config.json"
readonly INERNLM="${GITHUB_WORKSPACE}/hf_ckpt/modeling_internlm.py"
exit_code=0
expected_num=9
source ./ci_scripts/common/basic_func.sh
echo "start to test convert2hf.py."
if [[ -d ${CKPTS_OUTPUT} ]]; then
if ! rsync -av --remove-source-files ${CKPTS_OUTPUT}/* ${CLEAN_PATH}; then
echo "cleaning cached file in ${CKPTS_OUTPUT} failed, exit."
exit 1
fi
fi
python ./tools/transformers/convert2hf.py --src_folder ${CKPTS_INPUT} --tgt_folder ${CKPTS_OUTPUT} --tokenizer ./tools/V7_sft.model
[[ $? -ne 0 ]] && { echo "test convert2hf.py failed."; exit_code=$(($exit_code + 1)); }
#assert exists model
file_list=($TOKENIZER $CONFIG $INERNLM)
for file in ${file_list[@]}; do
if [[ ! -f ${file} ]];then
echo "file ${file} does not exist."
exit_code=$(($exit_code + 1))
fi
done
num=$(num_files "${CKPTS_OUTPUT}")
if [[ ${num} -ne ${expected_num} ]]; then
echo "expect: ${expected_num} files, actual: ${num} files."
exit_code=$(($exit_code + 1))
fi
# NOTICE: should not remove the cached files, because the cached files will be used in the next test case.
exit $exit_code

View File

@ -1,13 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True).cuda()
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
assert len(response) != 0
response, history = model.chat(tokenizer, "请提供三个管理时间的建议。", history=history)
print(response)
assert len(response) != 0

View File

@ -1,9 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from transformers import AutoModel
model = AutoModel.from_pretrained("../hf_ckpt/", trust_remote_code=True).cuda()
print(model)
assert model.config.hidden_size == 2048
assert model.config.num_attention_heads == 16
assert model.config.num_hidden_layers == 16

View File

@ -1,131 +0,0 @@
JOB_NAME = "7b_train"
SEQ_LEN = 1024
HIDDEN_SIZE = 2048
NUM_ATTENTION_HEAD = 16
MLP_RATIO = 8 / 3
NUM_LAYER = 16
VOCAB_SIZE = 103168
# Ckpt folder format:
# fs: 'local:/mnt/nfs/XXX'
# oss: 'boto3:s3://model_weights/XXX'
# MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
# SAVE_CKPT_FOLDER = "local:llm_ckpts"
SAVE_CKPT_FOLDER = "local:llm_ckpts"
# LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
ckpt = dict(
enable_save_ckpt=True,
# Path to save training ckpt.
save_ckpt_folder=SAVE_CKPT_FOLDER,
# Path to continue training ckpt (load model weights and scheduler/context states).
# load_ckpt_folder=LOAD_CKPT_FOLDER,
# Path to initialize with given model weights.
# load_model_only_folder=MODEL_ONLY_FOLDER,
checkpoint_every=20,
# Wheter to load optimizer states when continuing training.
load_optimizer=True,
)
TRAIN_FOLDER = "local:../lm_data/alpaca_data/train/en"
data = dict(
seq_len=SEQ_LEN,
# micro_num means the number of micro_batch contained in one gradient update
micro_num=4,
# packed_length = micro_bsz * SEQ_LEN
micro_bsz=2,
pack_sample_into_one=False,
total_steps=20,
skip_batches="",
rampup_batch_size="",
# Datasets with less than 50 rows will be discarded
min_length=50,
# train_folder=TRAIN_FOLDER,
)
grad_scaler = dict(
fp16=dict(
# the initial loss scale, defaults to 2**16
initial_scale=2**16,
# the minimum loss scale, defaults to None
min_scale=1,
# the number of steps to increase loss scale when no overflow occurs
growth_interval=1000,
),
# the multiplication factor for increasing loss scale, defaults to 2
growth_factor=2,
# the multiplication factor for decreasing loss scale, defaults to 0.5
backoff_factor=0.5,
# the maximum loss scale, defaults to None
max_scale=2**24,
# the number of overflows before decreasing loss scale, defaults to 2
hysteresis=2,
)
hybrid_zero_optimizer = dict(
# Enable low_level_optimzer overlap_communication
zero_overlap_communication=True,
# bucket size for nccl communication params
reduce_bucket_size=512 * 1024 * 1024,
# grad clipping
clip_grad_norm=1.0,
)
loss = dict(
label_smoothing=0,
)
adam = dict(
lr=1e-4,
adam_beta1=0.9,
adam_beta2=0.95,
adam_beta2_c=0,
adam_eps=1e-8,
weight_decay=0.01,
)
lr_scheduler = dict(
total_steps=data["total_steps"],
init_steps=0, # optimizer_warmup_step
warmup_ratio=0.01,
eta_min=1e-5,
last_epoch=-1,
)
beta2_scheduler = dict(
init_beta2=adam["adam_beta2"],
c=adam["adam_beta2_c"],
cur_iter=-1,
)
model = dict(
checkpoint=False,
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.bfloat16",
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
)
"""
zero1 parallel:
1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
so parameters will be divided within the range of dp.
2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
pipeline parallel: pipeline parallel size, only 1 is accepted currently.
tensor parallel: tensor parallel size, usually the number of GPUs per node, only 1 is accepted currently.
"""
parallel = dict(
zero1=8,
)
cudnn_deterministic = False
cudnn_benchmark = False

View File

@ -1,49 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
import argparse
import json
import os
from ci_scripts.common import com_func
from internlm.core.context import Config
def generate_new_config(config_py_file, test_config_json, case_name):
# generate path of the new config py
config_path = os.path.split(config_py_file)
new_config_py_file = os.path.join(config_path[0], case_name + ".py")
# merge dict
origin_config = Config.from_file(config_py_file)
with open(test_config_json) as f:
test_config = json.load(f)
if test_config:
if case_name not in test_config.keys():
raise KeyError(f"the {case_name} doesn't exist.Please check {test_config} again!")
new_config = com_func.merge_dicts(origin_config, test_config[case_name])
print(f"new config is:\n{new_config}")
# write new config to py file
file_content = com_func.format_dict_to_py_string(new_config)
with open(new_config_py_file, "w") as f:
f.write(file_content)
print(f"The new test train config file is {new_config_py_file}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--origin_config",
type=str,
default="./ci_scripts/train/ci_7B_sft.py",
help="path to the origin train config file",
)
parser.add_argument(
"--test_config",
type=str,
default="./ci_scripts/train/test_config.json",
help="path to the test train config file",
)
parser.add_argument("--case_name", type=str, help="name of the case which will be runned ")
args = parser.parse_args()
generate_new_config(args.origin_config, args.test_config, args.case_name)

View File

@ -1,43 +0,0 @@
#!/bin/bash
set -x
source ./ci_scripts/common/variables.sh
[[ -n ${GITHUB_WORKSPACE} ]] || { echo "should set GITHUB_WORKSPACE first before ci, exit."; exit 1; }
[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
readonly CKPTS_PATH="$GITHUB_WORKSPACE/llm_ckpts"
readonly CKPTS40_PATH="$GITHUB_WORKSPACE/llm_ckpts/40"
readonly CKPTS40_OUTPUT="${CKPTS40_PATH}/*.pt"
expected_num=22
exit_code=0
source ./ci_scripts/common/basic_func.sh
echo "start to test slurm training with loading checkpoint."
python ./ci_scripts/train/generate_config.py --case_name $1
file="./ci_scripts/train/$1.py"
if [[ ! -f ${file} ]]; then
echo "expect: ${file} exists, actual: not exist."
exit_code=$(($exit_code + 1))
fi
srun -p ${SLURM_PARTITION} --exclusive --quotatype=spot --job-name=$2 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ${file}
[[ $? -ne 0 ]] && { echo "test slurm training failed."; exit_code=$(($exit_code + 1)); }
num=$(num_files "${CKPTS40_OUTPUT}")
if [[ ${num} -ne ${expected_num} ]]; then
echo "expect: ${expected_num} files, actual: ${num} files."
exit_code=$(($exit_code + 1))
fi
# move the test files.
if [[ -d ${CKPTS_PATH} ]]; then
if ! rsync -av --remove-source-files ${CKPTS_PATH} ${CLEAN_PATH}; then
echo "cleaning cached file in ${CKPTS_PATH} failed."
exit_code=$(($exit_code + 1))
fi
fi
exit $exit_code

View File

@ -1,34 +0,0 @@
#!/bin/bash
set -x
source ./ci_scripts/common/variables.sh
[[ -n ${GITHUB_WORKSPACE} ]] || { echo "should set GITHUB_WORKSPACE first before ci, exit."; exit 1; }
[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
readonly CKPTS_PATH="$GITHUB_WORKSPACE/llm_ckpts"
readonly CKPTS20_PATH="$GITHUB_WORKSPACE/llm_ckpts/20"
readonly CKPTS20_OUTPUT="${CKPTS20_PATH}/*.pt"
expected_num=22
exit_code=0
source ./ci_scripts/common/basic_func.sh
echo "start to test slurm training."
if [[ -d ${CKPTS20_PATH} ]]; then
if ! rsync -av --remove-source-files ${CKPTS20_PATH} ${CLEAN_PATH}; then
echo "cleaning cached file in ${CKPTS20_PATH} failed, exit."
exit 1
fi
fi
srun -p ${SLURM_PARTITION} --exclusive --quotatype=spot --job-name=$1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./ci_scripts/train/ci_7B_sft.py
[[ $? -ne 0 ]] && { echo "test slurm training failed."; exit_code=$(($exit_code + 1)); }
num=$(num_files "${CKPTS20_OUTPUT}")
if [[ ${num} -ne ${expected_num} ]]; then
echo "expect: ${expected_num} files, actual: ${num} files."
exit_code=$(($exit_code + 1))
fi
exit $exit_code

View File

@ -1,45 +0,0 @@
{
"7B_basic_train": {
"SEQ_LEN": 1024,
"HIDDEN_SIZE": 2048,
"NUM_ATTENTION_HEAD": 16,
"NUM_LAYER": 16,
"TRAIN_FOLDER":"local:../lm_data/alpaca_data/train/en",
"ckpt": {
"checkpoint_every": 20
},
"data": {
"total_steps": 20
}
},
"7B_load_new_ckpt": {
"SEQ_LEN": 1024,
"HIDDEN_SIZE": 2048,
"NUM_ATTENTION_HEAD": 16,
"NUM_LAYER": 16,
"TRAIN_FOLDER":"local:../lm_data/alpaca_data/train/en",
"LOAD_CKPT_FOLDER": "local:llm_ckpts/20",
"ckpt": {
"load_ckpt_folder": "local:llm_ckpts/20",
"checkpoint_every": 20
},
"data": {
"total_steps": 40
}
},
"7B_load_preset_ckpt": {
"SEQ_LEN": 1024,
"HIDDEN_SIZE": 2048,
"NUM_ATTENTION_HEAD": 16,
"NUM_LAYER": 16,
"TRAIN_FOLDER":"local:../lm_data/alpaca_data/train/en",
"LOAD_CKPT_FOLDER": "local:../lm_data/alpaca_data/llm_ckpts/20",
"ckpt": {
"load_ckpt_folder": "local:../lm_data/alpaca_data/llm_ckpts/20",
"checkpoint_every": 20
},
"data": {
"total_steps": 40
}
}
}

View File

@ -1,40 +0,0 @@
#!/bin/bash
set -x
source ./ci_scripts/common/variables.sh
[[ -n ${GITHUB_WORKSPACE} ]] || { echo "should set GITHUB_WORKSPACE first before ci, exit."; exit 1; }
[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
readonly CKPTS_PATH="$GITHUB_WORKSPACE/llm_ckpts"
readonly CKPTS20_PATH="$GITHUB_WORKSPACE/llm_ckpts/20"
readonly CKPTS_OUTPUT="${CKPTS20_PATH}/*.pt"
expected_num=22
exit_code=0
source ./ci_scripts/common/basic_func.sh
echo "start to test torch training."
if [[ -d ${CKPTS20_PATH} ]]; then
if ! rsync -av --remove-source-files ${CKPTS20_PATH} ${CLEAN_PATH}; then
echo "cleaning cached file in ${CKPTS20_PATH} failed, exit."
exit 1
fi
fi
srun -p ${SLURM_PARTITION} --exclusive --quotatype=spot --job-name=$1 -N 1 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29501 train.py --config ./ci_scripts/train/ci_7B_sft.py --launcher torch
[[ $? -ne 0 ]] && { echo "test torch training failed."; exit_code=$(($exit_code + 1)); }
num=$(num_files "${CKPTS_OUTPUT}")
if [[ ${num} -ne ${expected_num} ]]; then
echo "expect: ${expected_num} files, actual: ${num} files."
exit_code=$(($exit_code + 1))
fi
# move the test files.
if ! rsync -av --remove-source-files ${CKPTS_PATH}/* ${CLEAN_PATH}; then
echo "cleaning cached file in ${CKPTS_PATH} failed."
exit_code=$(($exit_code + 1))
fi
exit $exit_code

View File

@ -1,164 +0,0 @@
JOB_NAME = "7b_train"
DO_ALERT = False
SEQ_LEN = 2048
HIDDEN_SIZE = 4096
NUM_ATTENTION_HEAD = 32
MLP_RATIO = 8 / 3
NUM_LAYER = 32
VOCAB_SIZE = 103168
MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
# Ckpt folder format:
# fs: 'local:/mnt/nfs/XXX'
SAVE_CKPT_FOLDER = "local:llm_ckpts"
LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
# boto3 Ckpt folder format:
# import os
# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
CHECKPOINT_EVERY = 50
ckpt = dict(
enable_save_ckpt=False, # enable ckpt save.
save_ckpt_folder=SAVE_CKPT_FOLDER, # Path to save training ckpt.
# load_ckpt_folder= dict(path=MODEL_ONLY_FOLDER, content=["model"], ckpt_type="normal"),
load_ckpt_folder="local:llm_ckpts/",
# 'load_ckpt_info' setting guide:
# 1. the 'path' indicate ckpt path,
# 2. the 'content means what states will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all"
# 3. the ckpt_type means the type of checkpoint to be loaded, now only 'normal' type is supported.
load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
checkpoint_every=CHECKPOINT_EVERY,
async_upload=True, # async ckpt upload. (only work for boto3 ckpt)
async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/", # path for temporarily files during asynchronous upload.
oss_snapshot_freq=int(CHECKPOINT_EVERY / 2), # snapshot ckpt save frequency.
)
TRAIN_FOLDER = "/path/to/dataset"
VALID_FOLDER = "/path/to/dataset"
data = dict(
seq_len=SEQ_LEN,
# micro_num means the number of micro_batch contained in one gradient update
micro_num=4,
# packed_length = micro_bsz * SEQ_LEN
micro_bsz=2,
# defaults to the value of micro_num
valid_micro_num=4,
# defaults to 0, means disable evaluate
valid_every=50,
pack_sample_into_one=False,
total_steps=50000,
skip_batches="",
rampup_batch_size="",
# Datasets with less than 50 rows will be discarded
min_length=50,
# train_folder=TRAIN_FOLDER,
# valid_folder=VALID_FOLDER,
empty_cache_and_diag_interval=10,
diag_outlier_ratio=1.1,
)
grad_scaler = dict(
fp16=dict(
# the initial loss scale, defaults to 2**16
initial_scale=2**16,
# the minimum loss scale, defaults to None
min_scale=1,
# the number of steps to increase loss scale when no overflow occurs
growth_interval=1000,
),
# the multiplication factor for increasing loss scale, defaults to 2
growth_factor=2,
# the multiplication factor for decreasing loss scale, defaults to 0.5
backoff_factor=0.5,
# the maximum loss scale, defaults to None
max_scale=2**24,
# the number of overflows before decreasing loss scale, defaults to 2
hysteresis=2,
)
hybrid_zero_optimizer = dict(
# Enable low_level_optimzer overlap_communication
overlap_sync_grad=True,
overlap_sync_param=True,
# bucket size for nccl communication params
reduce_bucket_size=512 * 1024 * 1024,
# grad clipping
clip_grad_norm=1.0,
)
loss = dict(
label_smoothing=0,
)
adam = dict(
lr=1e-4,
adam_beta1=0.9,
adam_beta2=0.95,
adam_beta2_c=0,
adam_eps=1e-8,
weight_decay=0.01,
)
lr_scheduler = dict(
total_steps=data["total_steps"],
init_steps=0, # optimizer_warmup_step
warmup_ratio=0.01,
eta_min=1e-5,
last_epoch=-1,
)
beta2_scheduler = dict(
init_beta2=adam["adam_beta2"],
c=adam["adam_beta2_c"],
cur_iter=-1,
)
model = dict(
checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.bfloat16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
use_flash_attn=True,
num_chunks=1, # if num_chunks > 1, interleaved pipeline scheduler is used.
)
"""
zero1 parallel:
1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
so parameters will be divided within the range of dp.
2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
pipeline parallel (dict):
1. size: int, the size of pipeline parallel.
2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
tensor parallel: tensor parallel size, usually the number of GPUs per node.
"""
parallel = dict(
zero1=8,
pipeline=dict(size=1, interleaved_overlap=True),
sequence_parallel=False,
)
cudnn_deterministic = False
cudnn_benchmark = False
monitor = dict(
# feishu alert configs
alert=dict(
enable_feishu_alert=DO_ALERT,
feishu_alert_address=None, # feishu webhook to send alert message
light_monitor_address=None, # light_monitor address to send heartbeat
),
)

View File

@ -1,20 +0,0 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

View File

@ -1,105 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-13 17:07+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/checkpoint.rst:2
msgid "模型保存"
msgstr "Model Checkpointing"
#: ../../source/checkpoint.rst:4
msgid ""
"InternLM 使用 ``internlm.utils.model_checkpoint.CheckpointManager`` "
"来管理模型保存。其中,可以使用 ``CheckpointManager.try_save_checkpoint(train_state)`` "
"来保存指定 step 的模型状态。"
msgstr ""
"InternLM uses ``internlm.utils.model_checkpoint.CheckpointManager`` to "
"manage model checkpointing. In the implementation, we use "
"``CheckpointManager.try_save_checkpoint(train_state)`` to checkpoint "
"training states at specific steps. "
#: ../../source/checkpoint.rst:6
msgid "InternLM支持启动时自动加载最新的模型备份并在接收信号退出训练时自动进行模型备份。"
msgstr "InternLM supports automatic loading of latest ckpt at startup and automatic model checkpointing at signal quit. "
#: ../../source/checkpoint.rst:9
msgid "Checkpointing"
msgstr ""
#: internlm.utils.model_checkpoint.CheckpointManager:1 of
msgid "StorageManagerContext"
msgstr ""
#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler:1 of
msgid ""
"Exit signal detection function, if we write the exit step in the "
"'QUIT_FILE_PATH' file, all ranks will save ckpt and exit. Negative "
"integer step means save ckpt. Positive integer step means save ckpt and "
"quit."
msgstr ""
#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler of
msgid "参数"
msgstr ""
#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler of
msgid "返回"
msgstr ""
#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler:9 of
msgid "whether to quit."
msgstr ""
#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler of
msgid "返回类型"
msgstr ""
#: internlm.utils.model_checkpoint.CheckpointManager.wait_async_upload_finish:1
#: of
msgid "wait for all checkpoint uploads to be completed"
msgstr ""
#: internlm.utils.model_checkpoint.CheckpointManager.query_latest_snapshot_step_boto3:1
#: of
msgid ""
"Returns: Tuple(str, int): path of latest ckpt and ckpt step, if not "
"found, None will return."
msgstr ""
#: internlm.utils.model_checkpoint.CheckpointManager.save_checkpoint:1 of
msgid "Save checkpoint to the given folder path."
msgstr ""
#~ msgid "Attempt to restore the training state of the last ckpt."
#~ msgstr ""
#~ msgid "lr_scheduler object."
#~ msgstr ""
#~ msgid "optimizer object."
#~ msgstr ""
#~ msgid "learning rate."
#~ msgstr ""
#~ msgid "traing states."
#~ msgstr ""
#~ msgid "traning dataloader object"
#~ msgstr ""

View File

@ -1,49 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-07 10:56+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/example/30B_demo.rst:2 242d1f89ae2045f1bf1f31bf82f07846
msgid "30B Demo"
msgstr ""
#: ../../source/example/30B_demo.rst:5 c2415bfa6978414a939dcc395fdfb544
msgid "训练配置"
msgstr "Training Config"
#: ../../source/example/30B_demo.rst:7 75f568d1ca5546228f88958c12c2dd65
msgid "30B demo 训练配置文件样例如下:"
msgstr "30B demo config file example:"
#: ../../source/example/30B_demo.rst:164 533cb04f94314eeb8381e45f06d03108
msgid "启动训练"
msgstr "Start Training"
#: ../../source/example/30B_demo.rst:166 24974384d5ab42e68266aeb67ae222ce
msgid "完成以上训练配置后,可启动模型训练,以在 ``slurm`` 平台上为例,启动两节点 16GPU 的训练命令如下所示:"
msgstr "After completing the data preparation and relevant training configurations, you can start the demo training. "
"The following example shows how to start distributed training in ``slurm`` environments with 16 GPUs."
#: ../../source/example/30B_demo.rst:173 948ac71ed53848f9bad07f69d956c4bb
msgid "训练结果"
msgstr "Training Results"
#: ../../source/example/30B_demo.rst:175 615a3481b0aa49729b7219b1365519aa
msgid "基于以上训练配置和启动命令,两节点 16GPU 下的模型训练部分日志展示如下:"
msgstr "Taking the configuration of the demo training on two nodes with 16 GPUs on slurm as an example, the training result log is shown below:"

View File

@ -1,49 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-07 10:56+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/example/7B_demo.rst:2 8576f969040249bb93e7c347ef210990
msgid "7B Demo"
msgstr ""
#: ../../source/example/7B_demo.rst:5 5429ceea12424825991744bece744f60
msgid "训练配置"
msgstr "Training Config"
#: ../../source/example/7B_demo.rst:7 c9a47faf5deb40b68ad2bc950fdf2b14
msgid "7B demo 的训练配置文件样例如下:"
msgstr "7B demo config file example:"
#: ../../source/example/7B_demo.rst:162 eb93a6ca05c8421eb87a2470f9f31fc2
msgid "启动训练"
msgstr "Start Training"
#: ../../source/example/7B_demo.rst:164 9e7a864ae2e14d05b0681f16792e5278
msgid "完成以上训练配置后,可启动模型训练,以在 ``slurm`` 平台上为例,启动单节点 8GPU 的训练命令如下所示:"
msgstr "After completing the data preparation and relevant training configurations, you can start the demo training. "
"The following example shows how to start distributed training in ``slurm`` environments with 8 GPUs."
#: ../../source/example/7B_demo.rst:171 fdd053efb1854d46aabf6c0f279fe7fc
msgid "训练结果"
msgstr "Training Results"
#: ../../source/example/7B_demo.rst:173 33ec81f34e3c4340beacdb5254069d08
msgid "基于以上训练配置和启动命令,单节点 8GPU 下的模型训练部分日志展示如下:"
msgstr "Taking the configuration of the demo training on a single machine with 8 GPUs on slurm as an example, the training result log is shown below:"

View File

@ -1,32 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-07 10:56+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/example/index.rst:2 de54695e8bde40ffb8878043072197e6
msgid "训练样例"
msgstr "Training Example"
#: ../../source/example/index.rst:5 da388b3209ff4bd39fd0700a7fba413a
msgid "7B Demo"
msgstr ""
#: ../../source/example/index.rst:13 b095e27dfc924a7a943b7cba5361700a
msgid "30B Demo"
msgstr ""

View File

@ -1,80 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-07 10:56+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/index.rst:8 11e029810acf410180311a3c63eb01f4
msgid "InternLM"
msgstr "InternLM"
#: ../../source/index.rst:11 e6fd7d058e4b43bb81157ac79867e3d3
msgid "环境构建"
msgstr "Environment Setup"
#: ../../source/index.rst:19 f323ede90c0f434d8b627eded1d8fc10
msgid "快速上手"
msgstr "Quickstart Guide"
#: ../../source/index.rst:27 3c504b4b92264e9182abb0fa81fe80c3
msgid "训练构建"
msgstr "Model Setup"
#: ../../source/index.rst:35 5cc5c831399a40b089d27b777a776b16
msgid "训练 API"
msgstr "Training API"
#: ../../source/index.rst:43 21a7473eabb441f8bfe28d2a0e306889
msgid "并行训练"
msgstr "Parallel Training"
#: ../../source/index.rst:51 9234725f3c464731993d73607608c874
msgid "模型备份"
msgstr "Model Checkpointing"
#: ../../source/index.rst:59 8e4ce037017f4510b2892a66003877fa
msgid "性能分析"
msgstr "Profiler"
#: ../../source/index.rst:67 a36e02819ecd4b448a8cb4ebbecb6600
msgid "训练监控"
msgstr "Monitor"
#: ../../source/index.rst:75 b912e292486f455c8b5cdd75962e8ac2
msgid "训练样例"
msgstr "Example"
#: ../../source/index.rst:83 ea9e9281720941a1830e5df7a2badf7a
msgid "常见问题"
msgstr "Q&A"
#: ../../source/index.rst:91 e08edc5aa1c74965b10084b393b88fae
msgid "索引和表格"
msgstr "Indices and tables"
#: ../../source/index.rst:93 f3fdca059caa49dcad09aa44be7f02d6
msgid ":ref:`genindex`"
msgstr ""
#: ../../source/index.rst:94 b3791e811315435097bb507edc3f4b9b
msgid ":ref:`modindex`"
msgstr ""
#: ../../source/index.rst:95 a164b772960f4ab8b18c7e8820f69f55
msgid ":ref:`search`"
msgstr ""

View File

@ -1,247 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-14 12:23+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: zh_CN\n"
"Language-Team: zh_CN <LL@li.org>\n"
"Plural-Forms: nplurals=1; plural=0;\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/initialize.rst:2
msgid "训练构建"
msgstr "Training Setup"
#: ../../source/initialize.rst:4
msgid "InternLM 的训练流程可以归纳为两个步骤:"
msgstr "The training process of InternLM can be summarized into two steps: "
#: ../../source/initialize.rst:6
msgid "初始化"
msgstr "Initialization"
#: ../../source/initialize.rst:8
msgid "初始化模型、优化器、数据加载器、Trainer生成不同种类的进程组为混合并行的迭代训练做准备。"
msgstr ""
"Initialize model, optimizer, dataloader, trainer, and create different "
"types of process groups to prepare for iterative steps of hybrid parallel training. "
#: ../../source/initialize.rst:9
msgid "初始化Logger、Checkpoint管理器、Monitor管理器、Profiler对迭代训练的过程观察、预警、记录。"
msgstr ""
"Initialize logger, checkpoint manager, monitor manager, and profiler to "
"watch, alert, and record the iterative training steps. "
#: ../../source/initialize.rst:11
msgid "迭代训练"
msgstr "Iterative training steps"
#: ../../source/initialize.rst:13
msgid "根据配置文件定义的张量并行、流水线并行、数据并行的大小,加载训练引擎和调度器进行混合并行训练。"
msgstr ""
"Load the training engine and scheduler for hybrid parallel training "
"according to the configuration such as tensor parallel size, pipeline "
"parallel size, and data parallel size. "
#: ../../source/initialize.rst:14
msgid "在迭代训练中,调用 Trainer API 进行梯度置零,前向传播计算损失并反向传播,参数更新。"
msgstr ""
"In iterative training steps, the Trainer API is called to perform zero "
"gradients, forward-loss-backward, and parameter update."
#: ../../source/initialize.rst:20
msgid "InternLM训练流程图"
msgstr "InternLM training process"
#: ../../source/initialize.rst:25
msgid "命令行参数解析"
msgstr "Argument Parsing"
#: ../../source/initialize.rst:27
msgid ""
"InternLM 使用 `argparse <https://docs.python.org/3/library/argparse.html>`_"
" 库来向InternLM运行时提供命令行参数配置。"
msgstr ""
"InternLM uses the `argparse "
"<https://docs.python.org/3/library/argparse.html>`_ library to supply "
"commandline configuration to the InternLM runtime. "
#: ../../source/initialize.rst:29
msgid ""
"用户可使用 ``internlm.initialize.get_default_parser()`` 来获取 InternLM "
"的默认解析器,其中包含一些内置参数,用户可以向此解析器添加自定义参数。"
msgstr ""
"Use ``internlm.initialize.get_default_parser()`` to get InternLM's "
"default parser with some builtin arguments, users can add custom "
"parameters to this parser."
#: internlm.initialize.launch.get_default_parser:1 of
msgid ""
"Reads user command line and uses an argument parser to parse the input "
"arguments. Input arguments include configuration, host, port, world size,"
" local rank, backend for torch.distributed."
msgstr ""
#: internlm.initialize.initialize_trainer.initialize_trainer
#: internlm.initialize.launch.get_default_parser
#: internlm.train.training_internlm.get_train_data_loader
#: internlm.train.training_internlm.initialize_model
#: internlm.train.training_internlm.initialize_optimizer of
msgid "返回"
msgstr ""
#: internlm.initialize.launch.get_default_parser:4 of
msgid ""
"Returns the parser with the default arguments, the user may add "
"customized arguments into this parser."
msgstr ""
#: internlm.initialize.initialize_trainer.initialize_trainer
#: internlm.initialize.launch.get_default_parser
#: internlm.train.training_internlm.initialize_model of
msgid "返回类型"
msgstr ""
#: ../../source/initialize.rst:45
msgid "模型初始化"
msgstr "Model Initialization"
#: internlm.train.training_internlm.initialize_model:1 of
msgid "Initialize model with Automatic Mixed Precision."
msgstr ""
#: internlm.train.training_internlm.initialize_model:3 of
msgid "The neural network model to be trained or evaluated."
msgstr ""
#: ../../source/initialize.rst:49
msgid "InternLM 在配置文件中使用字段 ``model_type`` 和 ``model`` 来控制模型初始化过程。示例模型初始化配置定义如下:"
msgstr ""
"InternLM uses the field ``model_type`` and ``model`` in the config file "
"to control model initialization process. An example model initialization "
"configuratio"
#: ../../source/initialize.rst:77
msgid "字段 ``model_type`` 指明了要初始化的模型类型"
msgstr ""
"The field ``model_type`` specifics the model type has been registered and"
" to be initialized."
#: ../../source/initialize.rst:78
msgid "字段 ``model`` 中的参数指定了在模型初始化过程中的参数设置"
msgstr ""
"The parameters in field ``model`` specific the configuration settings "
"during model initialization."
#: ../../source/initialize.rst:80
msgid ""
"值得注意的是,用户可以定义新的模型类型,并使用装饰器 ``@MODEL_INITIALIZER.register_module`` "
"注册模型的初始化函数,其中 ``MODEL_INITIALIZER`` 是类 "
"``internlm.util.registry.Registry`` 的一个实例化对象,示例如下所示:"
msgstr ""
"It is worth noting that, users can define new model type, and register "
"model's initialization function by decorater "
"``@MODEL_INITIALIZER.register_module``, which ``MODEL_INITIALIZER`` is an"
" instantiated object of class ``internlm.util.registry.Registry``, the "
"example is shown as follows."
#: ../../source/initialize.rst:92
msgid "优化器初始化"
msgstr "Optimizer Initialization"
#: internlm.train.training_internlm.initialize_optimizer:1 of
msgid "Initialize optimizer."
msgstr ""
#: internlm.initialize.initialize_trainer.initialize_trainer
#: internlm.train.training_internlm.get_train_data_loader
#: internlm.train.training_internlm.initialize_optimizer of
msgid "参数"
msgstr ""
#: internlm.train.training_internlm.initialize_optimizer:3 of
msgid "Your model instance to be trained or evaluated."
msgstr ""
#: internlm.train.training_internlm.initialize_optimizer:6 of
msgid "A tuple of (optimizer, beta2_scheduler, lr_scheduler)."
msgstr ""
#: ../../source/initialize.rst:99
msgid "数据加载器初始化"
msgstr "Dataloader Initialization"
#: internlm.train.training_internlm.get_train_data_loader:1 of
msgid "Generate and return the training data loader."
msgstr ""
#: internlm.train.training_internlm.get_train_data_loader:3 of
msgid "number of subprocesses used for dataloader."
msgstr ""
#: internlm.train.training_internlm.get_train_data_loader:5 of
msgid "generate function for dataset."
msgstr ""
#: internlm.train.training_internlm.get_train_data_loader:7 of
msgid "dataset sampler for training dataloader."
msgstr ""
#: internlm.train.training_internlm.get_train_data_loader:9 of
msgid "collate function for training dataloader."
msgstr ""
#: internlm.train.training_internlm.get_train_data_loader:12 of
msgid "A tuple of (train_dl, dataset_types)."
msgstr ""
#: ../../source/initialize.rst:106
msgid "Trainer 初始化"
msgstr "Trainer Initialization"
#: internlm.initialize.initialize_trainer.initialize_trainer:1 of
msgid ""
"Core function to wrap the essential training components with our "
"functionality based on the config which is loaded into gpc.config."
msgstr ""
#: internlm.initialize.initialize_trainer.initialize_trainer:4 of
msgid "Your model instance or a function to build the model."
msgstr ""
#: internlm.initialize.initialize_trainer.initialize_trainer:6 of
msgid "Your optimizer for training."
msgstr ""
#: internlm.initialize.initialize_trainer.initialize_trainer:8 of
msgid "Your criterion instance."
msgstr ""
#: internlm.initialize.initialize_trainer.initialize_trainer:10 of
msgid "Dataloader for training."
msgstr ""
#: internlm.initialize.initialize_trainer.initialize_trainer:12 of
msgid "Dataloader for testing."
msgstr ""
#: internlm.initialize.initialize_trainer.initialize_trainer:14 of
msgid "Your lr scheduler instance, optional."
msgstr ""
#: internlm.initialize.initialize_trainer.initialize_trainer:17 of
msgid ""
"A tuple of ``(trainer, train_dataloader, test_dataloader, lr_scheduler)``"
" where only ``trainer`` could not be None."
msgstr ""

View File

@ -1,139 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-07 10:56+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../../install.md:2 ../../../install.md:28
#: c237a7328df9440eb54f36c5e6ceef46 e55787faf3f74d5996f251b28422cf15
msgid "环境安装"
msgstr "Installation"
#: ../../../install.md:4 d5cd61481eb04f55a9b1636e47e2bc49
msgid "环境准备"
msgstr "Environment Preparation"
#: ../../../install.md:5 418763cd4acb4ff3afba059ae7066739
msgid "首先,需要安装的依赖包及对应版本列表如下:"
msgstr "The required packages and corresponding version are shown as follows:"
#: ../../../install.md:6 dcb95218036f4452a92a5a9c2fdbe337
msgid "Python == 3.10"
msgstr ""
#: ../../../install.md:7 79e3d9ff5df7455fa596ba63ce3089b7
msgid "GCC == 10.2.0"
msgstr ""
#: ../../../install.md:8 d14840f7b64d4a32a0be5762027e9c32
msgid "MPFR == 4.1.0"
msgstr ""
#: ../../../install.md:9 851e3e5c874a4d0f8fd37a4f85ec8f2f
msgid "CUDA >= 11.7"
msgstr ""
#: ../../../install.md:10 dbf2012c72e1479ba6647baa047ecc04
msgid "Pytorch >= 1.13.1"
msgstr ""
#: ../../../install.md:11 b191e289a079455ea906694a75439b3e
msgid "Transformers >= 4.28.0"
msgstr ""
#: ../../../install.md:12 17accf19fe184e3cb704274d8a66e87e
msgid "Flash-Attention >= v1.0.5"
msgstr ""
#: ../../../install.md:13 8063cdce4bb94947a07dbaedd97e1013
msgid "Apex == 23.05"
msgstr ""
#: ../../../install.md:14 7d6d2682ed214d0cba0048903c128bce
msgid "Ampere或者Hopper架构的GPU (例如H100, A100)"
msgstr "GPU with Ampere or Hopper architecture (such as H100, A100)"
#: ../../../install.md:15 91039fb42b94421586c558a2afcbed71
msgid "Linux OS"
msgstr ""
#: ../../../install.md:17 694b95a146d54878a4a5d57e0c1e8c6c
msgid "以上依赖包安装完成后,需要更新配置系统环境变量:"
msgstr "After installing the above dependencies, some system environment variables need to be updated:"
#: ../../../install.md:29 d0ebf84438dc43708ea517c7eff92e79
msgid "将项目`internlm`及其依赖子模块,从 github 仓库中 clone 下来,命令如下:"
msgstr "Clone the project `internlm` and its dependent submodules from the github repository, as follows:"
#: ../../../install.md:34 c278177fc1974f3fac9b33688d0591fd
msgid "推荐使用 conda 构建一个 Python-3.10 的虚拟环境, 并基于`requirements/`文件安装项目所需的依赖包:"
msgstr "It is recommended to build a Python-3.10 virtual environment using conda and install the required dependencies based on the `requirements/` files:"
#: ../../../install.md:43 6a152c8e332f47b0ba35a9bcec2ed32d
msgid "安装 flash-attention (version v1.0.5)"
msgstr "Install flash-attention (version v1.0.5):"
#: ../../../install.md:55 d7b2116e6ca745ceb48a792fae371283
msgid "安装 Apex (version 23.05)"
msgstr "Install Apex (version 23.05):"
#: ../../../install.md:62 8bcbfb9f74de4a2796212a339feb8283
msgid "环境镜像"
msgstr "Environment Image"
#: ../../../install.md:63 6cbb97568d704cf19e7dabab20ce1d5b
msgid ""
"用户可以使用提供的 dockerfile 结合 docker.Makefile 来构建自己的镜像,或者也可以从 "
"https://hub.docker.com/r/internlm/internlm 获取安装了 InternLM 运行环境的镜像。"
msgstr "Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternLM runtime environment installed from https://hub.docker.com/r/internlm/internlm."
#: ../../../install.md:65 9c29ae2ac9984a8094daf52751f5c7b9
msgid "镜像配置及构造"
msgstr "Image Configuration and Build"
#: ../../../install.md:66 12bd6b0729464cb5af663a384dadd0ec
msgid ""
"dockerfile 的配置以及构造均通过 docker.Makefile 文件实现,在 InternLM 根目录下执行如下命令即可 build "
"镜像:"
msgstr "The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternLM:"
#: ../../../install.md:70 b5f42dbca3e340c4bb80de1f502e0700
msgid ""
"在 docker.Makefile 中可自定义基础镜像,环境版本等内容,对应参数可直接通过命令行传递。对于 BASE_OS 分别支持 "
"ubuntu20.04 和 centos7。"
msgstr "In docker.Makefile, you can customize the basic image, environment version, etc., and the corresponding parameters can be passed directly through the command line. For BASE_OS, ubuntu20.04 and centos7 are respectively supported."
#: ../../../install.md:72 4abb47ce9cf64b3c9b8dc23ace37a826
msgid "镜像拉取"
msgstr "Pull Standard Image"
#: ../../../install.md:73 1b6e61b2e0cb4da98f5d70d67ac638f9
msgid "基于 ubuntu 和 centos 的标准镜像已经 build 完成也可直接拉取使用:"
msgstr "The standard image based on ubuntu and centos has been built and can be directly pulled:"
#: ../../../install.md:82 2bd75cc4b74848c19775e2b1c83726c1
msgid "容器启动"
msgstr "Run Container"
#: ../../../install.md:83 4bb2dd4bba904255a204776a50721159
msgid "对于使用 dockerfile 构建或拉取的本地标准镜像,使用如下命令启动并进入容器:"
msgstr "For the local standard image built with dockerfile or pulled, use the following command to run and enter the container:"
#: ../../../install.md:87 66613606256e4094a6be5ab2af1269ae
msgid "容器内默认目录即 `/InternLM`,根据[使用文档](./usage.md)即可启动训练。"
msgstr "The default directory in the container is `/InternLM`, please start training according to the [Usage](./usage.md)."

View File

@ -1,197 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-07 10:56+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/monitor.rst:2 f95ef3bff8574c77a28ca2f6212cc4b8
msgid "监控和告警"
msgstr "Monitor and Alert"
#: ../../source/monitor.rst:5 959bd4a6061f4483875c7950ab4546cf
msgid "监控"
msgstr "Monitoring"
#: ../../source/monitor.rst:7 6071bc878d894865b73380cb887847c1
msgid ""
"InternLM 使用 ``internlm.monitor.monitor.initialize_monitor_manager()`` "
"来初始化上下文监控管理。其中,一个实例化的单例对象 ``internlm.monitor.monitor.MonitorManager`` "
"将管理监控线程并使用 ``internlm.monitor.monitor.MonitorTracker`` 来跟踪模型训练生命周期和训练状态。"
msgstr ""
"InternLM uses ``internlm.monitor.monitor.initialize_monitor_manager()`` to initialize context monitor. During this time, "
"a singleton ``internlm.monitor.monitor.MonitorManager`` will manage monitoring thread and track training status "
"with ``internlm.monitor.monitor.MonitorTracker``."
#: 9256a063b6dd449786f29e03ce085176
#: internlm.monitor.monitor.initialize_monitor_manager:1 of
msgid ""
"Initialize monitor manager for monitoring training lifetime and alerting "
"exception info to Feishu."
msgstr ""
#: 138340fca72a4226be901f7f16c8a590 904b7938fdea46bf81c1ef738aa7bfae
#: 9ed2a7b4af2243b289e72b2751aec902 aa0dd0dc6bee4a5bb15cc9705f7c13ee
#: internlm.monitor.alert.send_feishu_msg_with_webhook
#: internlm.monitor.monitor.MonitorManager.start_monitor
#: internlm.monitor.monitor.MonitorTracker
#: internlm.monitor.monitor.initialize_monitor_manager of
msgid "参数"
msgstr ""
#: 3b302339e1d143b6b1d782ff59c9396d 6a06f053828b4c80aef56970750e2085
#: internlm.monitor.monitor.MonitorManager.start_monitor:3
#: internlm.monitor.monitor.initialize_monitor_manager:3 of
msgid "The training job name."
msgstr ""
#: 3330d06145ee4d35b0b3632e799a35b3 c105473f2f6a4f838a9f0d098762d698
#: internlm.monitor.monitor.MonitorManager.start_monitor:5
#: internlm.monitor.monitor.initialize_monitor_manager:5 of
msgid "The Feishu webhook address for sending alert messages."
msgstr ""
#: 774c6ff82a2e452295a1a7dcabaded3d internlm.monitor.monitor.MonitorManager:1
#: of
msgid ""
"Monitor Manager for managing monitor thread and monitoring training "
"status."
msgstr ""
#: 72e696c0ce8f41ea8c7947d35cf322f0
#: internlm.monitor.monitor.MonitorManager.monitor_loss_spike:1 of
msgid "Check loss value, if loss spike occurs, send alert message to Feishu."
msgstr ""
#: 2b668b057fa84e8b92c65bfd49bfb3e9
#: internlm.monitor.monitor.MonitorManager.monitor_exception:1 of
msgid "Catch and format exception information, send alert message to Feishu."
msgstr ""
#: 9852b7143026476d89e1a175223e6d79
#: internlm.monitor.monitor.MonitorManager.handle_sigterm:1 of
msgid "Catch SIGTERM signal, and send alert message to Feishu."
msgstr ""
#: 2e3827bad7b1445fb0d9a7c5a28def5d
#: internlm.monitor.monitor.MonitorManager.start_monitor:1 of
msgid ""
"Initialize and start monitor thread for checking training job status, "
"loss spike and so on."
msgstr ""
#: 271cc3e1b0834a7ba6a1ba4d5cce0ef1
#: internlm.monitor.monitor.MonitorManager.start_monitor:7 of
msgid "The time of monitor interval in seconds, defaults to 300."
msgstr ""
#: e4a06091fce8401b83e31ce26c8075a0
#: internlm.monitor.monitor.MonitorManager.start_monitor:9 of
msgid ""
"The limit multiple of current loss to previous loss value, which means "
"loss spike may be occurs, defaults to 1.5."
msgstr ""
#: 28bde748477e41f39fa6ca3e1855923d
#: internlm.monitor.monitor.MonitorManager.stop_monitor:1 of
msgid "Stop the monitor and alert thread."
msgstr ""
#: ffb3dda227664748bdb326b6630bc827 internlm.monitor.monitor.MonitorTracker:1
#: of
msgid "Track job status and alert to Feishu during job training."
msgstr ""
#: a1e93683cbb04d8ab825e2776e76efa7 internlm.monitor.monitor.MonitorTracker:3
#: of
msgid "The Feishu webhook address for sending alerting messages."
msgstr ""
#: 7913eeecc0904c128046e80cec1553f2 internlm.monitor.monitor.MonitorTracker:5
#: of
msgid "The interval in seconds for monitoring checks. Defaults to 300."
msgstr ""
#: 8d1abc3067584866983139dd3d85c59c internlm.monitor.monitor.MonitorTracker:7
#: of
msgid "The threshold for detecting loss value spikes. Defaults to 1.5."
msgstr ""
#: a0416fd68700450793daa2167f776618
#: internlm.monitor.monitor.MonitorTracker.run:1 of
msgid "start the monitor tracker."
msgstr ""
#: f55eb990c07b4e8f9388236dd60f0017
#: internlm.monitor.monitor.MonitorTracker.stop:1 of
msgid "Stop the monitor tracker."
msgstr ""
#: ../../source/monitor.rst:18 2202bc091aab417097a1b0268dfe6785
msgid "告警"
msgstr "Alerting"
#: ../../source/monitor.rst:20 69334f83e644455aa619dde70b8ed1f2
msgid ""
"InternLM 监控线程会周期性地检查模型训练过程中是否出现 loss spike、潜在的 training stuck、运行时异常等并捕获 "
"SIGTERM 异常信号。当出现上述情况时,将触发警报,并通过调用 "
"``internlm.monitor.alert.send_feishu_msg_with_webhook()`` 向飞书的 Webhook "
"地址发送报警消息。"
msgstr ""
"InternLM monitor thread periodically tracks loss spike, potential stuck condition, runtime exception, and SIGTERM signal. "
"When above situation occurs, an alert will be triggered and a message will be sent to the Feishu webhook address by calling "
"``internlm.monitor.alert.send_feishu_msg_with_webhook()``."
#: 15980526c2fa4ed8befa1604f271a3f1
#: internlm.monitor.alert.send_feishu_msg_with_webhook:1 of
msgid "Use Feishu robot to send messages with the given webhook."
msgstr ""
#: 38e5738c2b914c8096e1a0f345e6c0b4
#: internlm.monitor.alert.send_feishu_msg_with_webhook:3 of
msgid "The webhook to be used to send message."
msgstr ""
#: 4984f1a3bb0d46b48b2aad4fba8b43d9
#: internlm.monitor.alert.send_feishu_msg_with_webhook:5 of
msgid "The message title."
msgstr ""
#: a9822a4cf30d4947b12f70a0efe62a5e
#: internlm.monitor.alert.send_feishu_msg_with_webhook:7 of
msgid "The message body."
msgstr ""
#: 57d9ab65fe9f45c28351839fecf2f31e
#: internlm.monitor.alert.send_feishu_msg_with_webhook of
msgid "返回"
msgstr ""
#: 2b6ac97fd152498183a8624a9087812b
#: internlm.monitor.alert.send_feishu_msg_with_webhook:10 of
msgid "The response from the request. Or catch the exception and return None."
msgstr ""
#: ec45dedf976046eb909f5b7f79a7d44c
#: internlm.monitor.alert.send_feishu_msg_with_webhook of
msgid "抛出"
msgstr ""
#: 4c6aeec19a6041cfbfa577b1c5a85ac1
#: internlm.monitor.alert.send_feishu_msg_with_webhook:12 of
msgid "An exception rasied by the HTTP post request."
msgstr ""

View File

@ -1,456 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-07 10:56+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/parallel.rst:2 28d82a05db464e35aa3ec83e36597214
msgid "并行训练"
msgstr "Parallel Training"
#: ../../source/parallel.rst:6 f5c2eef4812640fca0aeaef62a2d85d4
msgid ""
"InternLM 支持张量并行、流水线并行、序列并行、数据并行和 ZeRO1.5 "
"等并行化训练策略。在初始化分布式环境时,我们需要指定张量并行大小、流水线并行大小、数据并行大小以及 ZeRO1.5 策略。"
msgstr ""
"InternLM supports tensor parallel, pipeline parallel, sequence parallel, data parallel, and ZeRO1.5 "
"to parallelize the training pipeline. When initializing the distributed environment, we need to specify "
"tensor parallel size, pipeline parallel size, data parallel size, and ZeRO1.5 strategy."
#: ../../source/parallel.rst:8 649c52696a734a0c86d3d5377193aba5
msgid ""
"InternLM 的并行设置由配置文件中的 ``parallel`` 字段指定,用户可以通过修改配置文件 `config file "
"<https://github.com/InternLM/InternLM/blob/main/configs/7B_sft.py>`_ "
"来更改并行配置。以下是一个并行训练配置示例:"
msgstr ""
"The parallel setting of InternLM is fully config-driven, and you can change the parallelism by modifying "
"`config file <https://github.com/InternLM/InternLM/blob/main/configs/7B_sft.py>`_. An exmaple parallel "
"training configuration can be defined as follows:"
#: ../../source/parallel.rst:19 a06ae11e51ea479b9501ada103c9d071
msgid "zero1zero 并行策略,分如下三种情况,默认值为 -1"
msgstr "zero1: zero parallel strategy, divided into the following three cases, the default value is -1"
#: ../../source/parallel.rst:21 08005d5cdde84057b870495d9683c7be
msgid "当 ``zero1 <= 0``,则 zero1 进程组的大小等于数据并行进程组的大小,因此优化器状态参数将在数据并行范围内分配"
msgstr "When ``zero1 <= 0``, the size of the zero1 process group is equal to the size of the data parallel process group, so the optimizer state parameters will be split within the data parallel range."
#: ../../source/parallel.rst:22 fe30803c0aec4b70847ac40b68641e05
msgid "当 ``zero1 == 1``,则不使用 zero1 ,所有数据并行组保留完整的优化器状态参数"
msgstr "When ``zero1 == 1``, zero1 is not used, and all data parallel groups retain the complete optimizer state parameters."
#: ../../source/parallel.rst:23 e0acea7d80094e018fab75404ec25163
msgid ""
"当 ``zero1 > 1`` 且 ``zero1 <= data_parallel_world_size``,则 zero1 "
"进程组是数据并行进程组的子集"
msgstr "When ``zero1 > 1`` and ``zero1 <= data_parallel_world_size``, the zero1 process group is a subset of the data parallel process group."
#: ../../source/parallel.rst:25 17bba79e2e884993a602df9cf20d2489
msgid "tensor张量并行大小通常是每个节点的 GPU 数量,默认值为 1"
msgstr "tensor: tensor parallel size, usually the number of GPUs per node, the default value is 1"
#: ../../source/parallel.rst:26 3bda721a03a144f28f33d360a87cbf83
msgid "pipeline流水线并行策略"
msgstr "pipeline: pipeline parallel strategy"
#: ../../source/parallel.rst:28 2b10f2b57ef64fcc872d036a7ad82b03
msgid "size流水线并行大小默认值为 1"
msgstr "size: pipeline parallel size, the default value is 1"
#: ../../source/parallel.rst:29 49c8a409e60244c49514a27780ae39a3
msgid "interleaved_overlapbool 类型,交错式调度时,开启或关闭通信优化,默认值为 False"
msgstr "interleaved_overlap: bool type, when interleaved scheduling, enable or disable communication optimization, the default value is False"
#: ../../source/parallel.rst:31 e4ff81960c434b78847174787f0423e2
msgid "sequence_parallel是否开启序列化并行默认值为 False"
msgstr "sequence_parallel: whether to enable sequence parallelism, the default value is False"
#: ../../source/parallel.rst:33 a24f4bc81fea48619ae2720e0cb6a392
msgid "注意:数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小"
msgstr "Note: `Data parallel size = Total number of GPUs / Pipeline parallel size / Tensor parallel size`"
#: ../../source/parallel.rst:36 a93fc45f855c4ca7901ccbe23bf14edc
msgid "张量并行"
msgstr "Tensor Parallel"
#: ../../source/parallel.rst:38 cce9e8f3c8f14c1c96c63273baceb164
msgid ""
"InternLM 的张量并行实现方案基于 `flash attention <https://github.com/Dao-AILab"
"/flash-attention>`_, 主要对 `attention "
"<https://github.com/InternLM/InternLM/blob/main/internlm/model/multi_head_attention.py>`_"
" 和 `linear "
"<https://github.com/InternLM/InternLM/blob/main/internlm/model/linear.py>`_"
" 这两个模块进行张量并行操作。"
msgstr ""
"The implementation of tensor parallel for InternLM is based on `flash attention <https://github.com/Dao-AILab/flash-attention>`_, "
"which has tensor parallel extensions to parallelize `attention <https://github.com/InternLM/InternLM/blob/main/internlm/model/multi_head_attention.py>`_ "
"and `linear <https://github.com/InternLM/InternLM/blob/main/internlm/model/linear.py>`_ blocks in InternLM model. "
#: ../../source/parallel.rst:41 f98a4b36ffdf4381a03899b605346be6
msgid "用户可通过配置文件中的 ``parallel.tensor`` 字段来设置张量并行大小。"
msgstr "To use tensor parallel, you need to set the value of tensor parallel size ``parallel.tensor`` in the config file, which is usually the number of GPUs per node."
#: ../../source/parallel.rst:47 956804e7cde441989212f7eb505e8815
msgid "张量并行,采用自 `flash-attention <https://arxiv.org/pdf/2205.14135.pdf>`_"
msgstr "Tensor parallel, adopted from `flash-attention <https://arxiv.org/pdf/2205.14135.pdf>`_"
#: ../../source/parallel.rst:50 a6424fd0ff0246fcadf56436260fadb6
msgid "流水线并行"
msgstr "Pipeline Parallel"
#: ../../source/parallel.rst:52 f2c163418fed432a8f3f59f1a5229e88
msgid ""
"InternLM 在流水线并行中使用 `1F1B <https://arxiv.org/pdf/2104.04473.pdf>`_ "
"1F1B一次前向传递后跟一次反向传递策略。对于 1F1B 策略,有两种实现方式:"
msgstr "InternLM uses `1F1B <https://arxiv.org/pdf/2104.04473.pdf>`_ (one forward pass followed by one backward pass) for pipeline parallel. For 1F1B strategy, there are two implementations:"
#: ../../source/parallel.rst:54 43f3b988e2924fe9968b9d049b46ffa0
msgid "非交错调度器,内存高效。"
msgstr "non-interleaved scheduler, which is memory-efficient"
#: ../../source/parallel.rst:55 7a45446082c441d48d49b6be661ea8d2
msgid "交错调度器内存高效且时间高效GPU空泡较少。"
msgstr "interleaved scheduler, which is both memory-efficient and time-efficient."
#: ../../source/parallel.rst:61 92f2a168d7794811b56f9bb3bc170982
msgid "1F1B 流水线并行调度器,采用自 `Megatron-LM <https://arxiv.org/pdf/2104.04473.pdf>`_"
msgstr "Non-interleaved and interleaved scheduler for 1F1B pipeline parallelism, adopted from `Megatron-LM <https://arxiv.org/pdf/2104.04473.pdf>`_"
#: ../../source/parallel.rst:64 a6d3df0b74b14b158a04ddda3e904004
msgid "非交错式流水线调度"
msgstr "scheduler for non-interleaved 1F1B strategy"
#: ../../source/parallel.rst:65 1fa48743f39a44a29d78fb7f9eed5a52
msgid "如果要使用非交错式调度, 需要设置 ``model.num_chunks = 1``。"
msgstr "To use non-interleaved pipeline scheduler, users need to set ``model.num_chunks = 1`` in the config file."
#: 57206dc0bc734686841c363c88839708
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:1 of
msgid ""
"A helper schedule class for pipeline parallelism running environment. It "
"uses non-interleaved 1F1B strategy. Other properties are similar as "
":class:`NonPipelineSchedule`."
msgstr ""
#: 6475fee6f3cd462ba1073a641b322e12 7060a021efb0459598f49f74e8e7185b
#: 9218fee47e5542cab88ac65ff0054068 d1be8d5479fb48f59be379548ee24bd9
#: d41da940b4a84cd0822c3f94c2eaf344 f5654fe6eacc49dba5baa1d058df5d29
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.pre_processing
#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step
#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.zero_grad of
msgid "参数"
msgstr ""
#: 567e2a87a45245469af9f8709e020a20
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:5 of
msgid "The number of microbatches."
msgstr ""
#: 6d3b2256ea9c4897bf72f551f8b4696b
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:7 of
msgid "Type of data. torch.float by default."
msgstr ""
#: 6e36198f5ed344f7ad02f56aec9a333c
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:9 of
msgid ""
"The post processing function which receives a micro batch of data, and it"
" will be executed in `load_micro_batch`."
msgstr ""
#: ffae9611bd854615af1ced927f72c556
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:12 of
msgid "Specified shape in pipeline communication."
msgstr ""
#: 31d45af550334cb8a94142da335b9724
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:14 of
msgid ""
"If set to `True`, communication will be reduced over pipeline when using "
"1D tensor parallelization."
msgstr ""
#: 5c852dc7866f4e50ab87c15b86d338f2
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:16 of
msgid "List of scheduler hooks."
msgstr ""
#: 4ebec38a972b4c31a59f1fc824d51f62
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.pre_processing:1
#: of
msgid "To perform actions before running the schedule."
msgstr ""
#: d491d0dfa1bf41708150cc57567ac0f0
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.pre_processing:3
#: of
msgid "InternLM engine for training and inference."
msgstr ""
#: bc5dc62440b94825b192ad2e28641976
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:1
#: of
msgid ""
"Runs non-interleaved 1F1B schedule, with communication between pipeline "
"stages. Returns a tuple with losses if the last stage, an empty tuple "
"otherwise."
msgstr ""
#: 765809e448b644678a9fb822f6427a94 99c948f562e343aabdecac2d43650f59
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:4
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:4
#: of
msgid "Colossalai engine for training and inference."
msgstr ""
#: 31af7a46c5a645628bea05ad35757dcf 4ea88ec52c5b4df79a57ab2d217de697
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:6
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:6
#: of
msgid ""
"Dataloader as the form of an iterator, obtained by calling "
"iter(dataloader)."
msgstr ""
#: 2deff747718449fabc5b47a1de0be52e e0d2e154ac134da28470924aa65342a1
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:8
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:8
#: of
msgid ""
"Whether run forward step only. Default is false. If true, no backward "
"will be run."
msgstr ""
#: 71aa2b45248c4af28525dbc1ba4a1aff d3b3c1e350334dd2a16cbb2e8c8d339a
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:10
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:10
#: of
msgid "Whether returns the loss value. Default is true."
msgstr ""
#: 2021eaca687148539b03f6b0b1c118c8 5c138015fb254eccae2f0df2dab45629
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:12
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:12
#: of
msgid "If False, the output and label won't be returned."
msgstr ""
#: 57a86115b88541b1a7220d9535058607 5dabcd12b6d844aab8039b022ad0cf1c
#: b8ccfee837a242a3abbdf9e15eaa53d8
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step
#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step of
msgid "返回"
msgstr ""
#: 7dc47f5518e64d1095a6051184985f17 fe678c953e8149a5ade387e95d10d3b2
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:17
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:15
#: of
msgid "A tuple of (output, label, loss), loss and label could be None."
msgstr ""
#: a50c7c3d40e14ba8a5af06aa0cb031cb ea3574b76d604402a41fcd3874d05c9a
#: fa12b183c7534a20b61445eb9f2a2a7a
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step
#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step of
msgid "返回类型"
msgstr ""
#: 82936eed6da5408c9361732f8fd5cb93 c46a28c21ca149d98ff625b7fdad4c03
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:19
#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:16
#: of
msgid "Tuple[:class:`torch.Tensor`]"
msgstr ""
#: ../../source/parallel.rst:71 d2bfdbbd9a7641c38e6957a72ac6bc97
msgid "交错式流水线调度"
msgstr "scheduler for interleaved 1F1B strategy"
#: ../../source/parallel.rst:72 395c484fef984a65a284147dc3056241
msgid "如果要使用交错式调度, 需要设置 ``model.num_chunks > 1``。"
msgstr "To use interleaved pipeline scheduler, users need to set ``model.num_chunks > 1`` in the config file."
#: 036fffe3aacc4400af38ce5252840a50
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler:1 of
msgid "Interleaved Pipeline Scheduler."
msgstr ""
#: 1b6e63b4004e44999e3ad38382b4e308
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:1
#: of
msgid ""
"Run interleaved 1F1B schedule (model split into model chunks), with "
"communication between pipeline stages as needed."
msgstr ""
#: 6ece1dfcdb5e408db4870d6c0f524787
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:15
#: of
msgid ""
"A tuple of (output, label, loss), loss and label could be None. The "
"loss would be returned only in the last stage."
msgstr ""
#: ed7e5a4826f84e9eb2840e494761437f
#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:18
#: of
msgid "The loss would be returned only in the last stage."
msgstr ""
#: ../../source/parallel.rst:77 1b771fea1d434f0b8b118f1b5344dde4
msgid "值得注意的是,在使用交错式流水线调度器时可启用通信优化功能,即在 1F1B 阶段启用异步通信,以充分利用上行/下行带宽并实现通信与计算重叠。"
msgstr "Asynchronous communication will be enabled in 1F1B stage to make full use of uplink/downlink bandwidth and achieve communication overlap. "
#: ../../source/parallel.rst:79 27430e179b454d48a052b9fe6e11ecae
msgid ""
"用户需要在配置文件中设置 ``parallel.pipeline.interleaved_overlap = "
"True``。该功能启用后,将调用函数 "
"``InterleavedPipelineScheduler._run_1f1b_loop_with_overlap``,并创建 "
"``internlm.core.communication.AsynCommunicator`` 以管理异步通信。"
msgstr ""
"When ``parallel.pipeline.interleaved_overlap = True``, function ``InterleavedPipelineScheduler._run_1f1b_loop_with_overlap`` will be called and "
"``internlm.core.communication.AsynCommunicator`` will be created for managing async communication."
#: ../../source/parallel.rst:81 4e0b6269ca48430098ed4619d0f0f22f
msgid "``1F1B-without-overlap`` 和 ``1F1B-with-overlap`` 的区别如下所示:"
msgstr "The difference between 1F1B stage without overlap and 1F1B stage with overlap is shown as follows:"
#: ../../source/parallel.rst:102 8412b1f6f51c479d9cbb281763215327
msgid "序列并行"
msgstr "Sequence Parallel"
#: ../../source/parallel.rst:104 45aea8164dd244e5a730881c693eeecf
msgid ""
"序列并行是一种在不引入额外计算、通信和内存开销的情况下,减少层 ``layer_norm`` 和 ``dropout`` "
"操作中的激活值内存。InternLM 中的序列并行实现基于 `flash attention <https://github.com/Dao-"
"AILab/flash-attention>`_。这个并行策略有助于降低模型的内存消耗提高了模型在资源受限环境中的可扩展性。"
msgstr ""
"Sequence parallel is a technique to reduce activation memory in layer norm and dropout without additional computation, "
"communication or memory overhead. The implementation of sequence parallel for InternLM is based on `flash attention <https://github.com/Dao-AILab/flash-attention>`_. "
#: ../../source/parallel.rst:106 29836b441ee84df6a6dbe877930ba911
msgid "如果要启用序列并行, 用户需要设置 ``parallel.sequence_parallel = True``。"
msgstr "To enable sequence parallel, you need to set ``parallel.sequence_parallel = True`` in the config file."
#: ../../source/parallel.rst:112 eadcd6e77c2547998b4e132939a15856
msgid "序列并行, 采用自 flash-attention"
msgstr "Sequence parallel, adopted from flash-attention"
#: ../../source/parallel.rst:115 47a0ac84251949fab0d9d8d34efb8751
msgid "数据并行"
msgstr "Data Parallel"
#: ../../source/parallel.rst:117 938ad5a1cbc846bab36e8d2f4804a685
msgid "InternLM 支持数据并行。数据并行大小为:"
msgstr "InternLM supports data parallel. For data parallel:"
#: ../../source/parallel.rst:119 1e8691a5ff4a4b40ae24815c681f7306
msgid ""
"`Data parallel size = Total number of GPUs / Pipeline parallel size / "
"Tensor parallel size`"
msgstr ""
#: ../../source/parallel.rst:122 c417e2af4e8e45ca8ca18ad39e96dadd
msgid "ZeRO1.5"
msgstr ""
#: ../../source/parallel.rst:124 9c05b4baf8a04e4b8a0f204c4e30cc9c
msgid ""
"ZeRO1.5 的实现使用了分层分片的概念,通过配置值 ``parallel.zero1`` "
"启用了本地节点内的分片。这个方法有助于有效管理和分配模型参数和梯度,以减少内存使用并提高训练效率。"
msgstr "The implementation of ZeRO1.5 uses the concept of hierarchical sharding via config value ``parallel.zero1``, which enables sharding within local nodes."
#: ../../source/parallel.rst:126 48c994fe37d54c35bbf81f4be070e151
msgid "当 ``parallel.zero1 <= 0``,则 zero1 进程组的大小等于数据并行进程组的大小,因此优化器状态参数将在数据并行范围内分配"
msgstr "If ``parallel.zero1 <= 0``, the size of the zero process group is equal to the size of the dp process group, so parameters will be divided within the range of dp."
#: ../../source/parallel.rst:127 3d31193758e24a08b1e90eae21259f71
msgid "当 ``parallel.zero1 == 1``,则不使用 zero1 ,所有数据并行组保留完整的优化器状态参数"
msgstr "If ``parallel.zero1 == 1``, zero is not used, and all dp groups retain the full amount of model parameters."
#: ../../source/parallel.rst:128 fb5c43d2ac75423cabc12ba1512df25e
msgid ""
"当 ``parallel.zero1 > 1`` 且 ``parallel.zero1 <= "
"data_parallel_world_size``,则 zero1 进程组是数据并行进程组的子集"
msgstr "If ``parallel.zero1 > 1`` and ``parallel.zero1 <= dp world size``, the world size of zero is a subset of dp world size. For smaller models, it is usually a better choice to split the parameters within nodes with a setting ``parallel.zero1 <= 8``."
#: ../../source/parallel.rst:130 47f03cea956a4477854591363359cdb3
msgid ""
"此外,用户可以在配置文件中通过 ``hybrid_zero_optimizer`` "
"字段启用优化器的通信优化功能,设置桶大小,以及梯度剪裁等参数。这些设置有助于优化训练过程中的通信和计算效率,以及梯度的处理方式。"
msgstr "Furthermore, you can enable communication-computation overlap, set bucket reduce size, gradient clipping parameters in the config file."
#: ../../source/parallel.rst:144 dfc63103d4e341ccb7df8ef031e29f4e
msgid "这里有两个值得关注的通信优化点:"
msgstr "There are two communication optimizations worth paying attention to here:"
#: ../../source/parallel.rst:146 e4815f887d8f48368be01339b5e64d18
msgid ""
"overlap_sync_grad: 如果设置为 ``True``,则将训练的 ``backward pass`` 与梯度的 ``all-"
"reduce`` 通信重叠"
msgstr "overlap_sync_grad: If set True, overlapping training backward pass with gradients' all-reduce communication."
#: ../../source/parallel.rst:147 bcb1aedd8a89441488b211cd81d4f80c
msgid ""
"overlap_sync_param: 如果设置为 ``True``,则将参数的 ``broadcast`` 通信与下一步的 ``forward "
"pass`` 进行重叠"
msgstr "overlap_sync_param: If set True, overlapping parameters' broadcast communication with next step's forward pass."
#: ../../source/parallel.rst:149 3ba64e4762084e93ba62a70c909e7d82
msgid "这些优化可以加速训练过程,提高训练效率。"
msgstr "These optimizations can speed up the training process and improve training efficiency."
#: 757dad6b9916403c83042b49eaa35ae5
#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer:1 of
msgid "Hybrid Zero Optimizer."
msgstr ""
#: 83bcd49c056446f6806a55e6138579f2
#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.zero_grad:1
#: of
msgid ""
"Set parameter gradients to zero. If set_to_none = True, gradient will be "
"set to None to save memory."
msgstr ""
#: 2d3da89d360c458f80844f9caed6c316
#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.zero_grad:4
#: of
msgid "Whether set the gradient to None. Default value is True."
msgstr ""
#: 4164523156dc460cbbeaa17feed3c689
#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step:1 of
msgid "Performs a single optimization step."
msgstr ""
#: 5c68dace1ec649bfa849b6652051daac
#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step:3 of
msgid "A closure that reevaluates the model and returns the loss."
msgstr ""
#: 91e366d604ce48afa6b92666ece87b85
#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step:7 of
msgid "Whether the gradient is success updated, and the gradient."
msgstr ""

View File

@ -1,174 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-14 11:05+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/profiler.rst:2
msgid "性能分析"
msgstr "Profiler"
#: ../../source/profiler.rst:7
msgid "Torch Profiler"
msgstr ""
#: ../../source/profiler.rst:9
msgid ""
"InternLM 使用 ``internlm.train.initialize_llm_profile()`` "
"来收集和分析模型训练或推理期间的性能数据,如 CPU/CUDA/memory 等性能数据。这个实现基于 `torch.profiler "
"<https://pytorch.org/docs/stable/profiler.html>`_ ,输出的性能分析 trace 文件可以使用 "
"`tensorboard <https://www.tensorflow.org/tensorboard?hl=en>`_ 进行可视化。"
msgstr ""
"InternLM uses ``internlm.train.initialize_llm_profile()`` to profile "
"performance data, execution time duration and breakdown analysis of step "
"time. The implementation is based on `torch.profiler "
"<https://pytorch.org/docs/stable/profiler.html>`_ and output tracing "
"files can be visualized with `tensorboard <https://www.tensorflow.org/tensorboard?hl=en>`_."
#: ../../source/profiler.rst:11
msgid ""
"用户如果想使用这个 torch 性能分析工具,需要在启动训练时传递 ``--profiling`` 参数以启用性能分析。完成 torch "
"性能分析后,用户可以在 ``{JOB_NAME}/{start_time}/traces/rank{}_dp{}_tp{}_pp{}`` "
"文件夹中看到性能分析结果。"
msgstr ""
"To use this torch profiler tool, you need to enable profiling by passing "
"the ``--profiling`` flag when starting training. After torch profiling is"
" completed, you can find the profiling results in the "
"``{JOB_NAME}/{start_time}/traces/rank{}_dp{}_tp{}_pp{}`` folder."
#: ../../source/profiler.rst:13
msgid "实际运行生成的 ``Torch Profiler`` 目录结构如下:"
msgstr ""
"The directory structure of ``Torch Profiler`` generated files is as "
"follows:"
#: ../../source/profiler.rst:22
msgid "其中, ``traces`` 可以通过 ``TensorBoard`` 可视化,运行命令"
msgstr ""
"Among them, ``traces`` can be visualized through ``TensorBoard`` and run "
"with the command"
#: ../../source/profiler.rst:29
msgid ""
"在打开的 ``TensorBoard -> PyTorch Profiler -> Views -> Trace`` "
"页面可以看到Operator和GPU Kernel的性能分析时间线如下更多的功能请参考 `torch profiler with "
"tensorboard "
"<https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"
"#pytorch-profiler-with-tensorboard>`_"
msgstr ""
"In the opened ``TensorBoard -> PyTorch Profiler -> Views -> Trace`` page,"
" you can see the timeline of profiled operators and GPU kernels. For more"
" usage, please refer to `torch profiler with tensorboard "
"<https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"
"#pytorch-profiler-with-tensorboard>`_"
#: internlm.train.training_internlm.initialize_llm_profile:1 of
msgid "Initialize and return the profiler context manager instance."
msgstr ""
#: ../../source/profiler.rst:38
msgid "Memory Profiler"
msgstr ""
#: ../../source/profiler.rst:40
msgid ""
"InternLM 提供了一个实用的内存分析工具 "
"``internlm.utils.simple_memory_profiler.SimpleMemoryProfiler`` 来监控实际的 GPU"
" 内存使用情况。在实现中,会对模型数据(包括模型参数、模型梯度和优化器状态)和非模型数据(包括激活值)分别进行详细的统计。"
msgstr ""
"InternLM provides a practical solution "
"``internlm.utils.simple_memory_profiler.SimpleMemoryProfiler`` to monitor"
" actual GPU memory usage. In the implmentation, model data (including "
"model parameters, model gradients, and optimizer states) and non-model "
"data (including activations) are calculated."
#: ../../source/profiler.rst:42
msgid ""
"要使用这个内存分析工具,用户需要在启动训练时传递 ``--profiling`` 参数以启用内存分析。完成内存分析后,用户可以在 "
"``memory_trace/rank{}_dp{}_tp{}`` 文件夹中找到特定 rank "
"对应的内存分析结果(包括不同时间点的内存使用日志和显示总体内存使用情况的太阳图表)。"
msgstr ""
"To use this memory profiler tool, you need to enable profiling by passing"
" the ``--profiling`` flag when starting training. After memory profiling "
"is completed, you can find the profiling results (including logs of "
"memory usage at different time point and sunburst charts showing overall "
"memory usage) for a specific rank device in the "
"``memory_trace/rank{}_dp{}_tp{}`` folder."
#: ../../source/profiler.rst:44
msgid "实际运行生成的 ``memory_trace`` 目录结构如下:"
msgstr "The directory structure of ``memory_trace`` generated files is as follows:"
#: ../../source/profiler.rst:107
msgid "其中, ``memory.log`` 的内容示例如下:"
msgstr "An example of ``memory.log`` is as follows:"
#: ../../source/profiler.rst:157
msgid "模型参数的太阳图示例如下:"
msgstr "An example of model parameters sunburst chart is as follows:"
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler:1 of
msgid "A memory profiler for a llm model."
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point of
msgid "参数"
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler:3 of
msgid "The model to profile."
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler:5 of
msgid "The optimizer used for training the model."
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler:7 of
msgid "The file to write the memory state information to."
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler:9 of
msgid "number of steps to trace."
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point:1 of
msgid "Record the memory state."
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point:3 of
msgid "The options to include in the memory state. Defaults to \"\"."
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point:5 of
msgid "Whether to create a new memory record file. Defaults to False."
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.step of
msgid "返回"
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point:8
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.step:3 of
msgid "None"
msgstr ""
#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.step:1 of
msgid "Update the memory state of the optimizer state."
msgstr ""

View File

@ -1,24 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-07 10:56+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/qa.rst:2 e3b22a39640a40cfb527068a7f4bbfc9
msgid "问&答"
msgstr "Q&A"

View File

@ -1,161 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-14 12:23+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../source/training.rst:2
msgid "训练 API"
msgstr "Training API"
#: ../../source/training.rst:4
msgid ""
"InternLM 的训练 API 由 ``internlm.core.trainer.Trainer`` "
"管理。在定义了训练引擎和调度器之后,我们可以调用 Trainer API 来执行模型训练、评估、梯度清零和参数更新等。"
msgstr ""
"InternLM training API is managed in ``internlm.core.trainer.Trainer``. "
"After defining the training engine and runtime scheduler, we can call "
"training API to perform training, evaluation, zero gradients and "
"parameter update steps."
#: ../../source/training.rst:6
msgid "有关详细用法,请参阅 Trainer API 文档和示例。"
msgstr ""
"For detailed usage, please refer to Trainer API documentation and "
"examples."
#: internlm.core.trainer.Trainer:1 of
msgid ""
"This is a class tending for easy deployments of users' training and "
"evaluation instead of writing their own scripts."
msgstr ""
#: internlm.core.trainer.Trainer internlm.core.trainer.Trainer.execute_schedule
#: of
msgid "参数"
msgstr ""
#: internlm.core.trainer.Trainer:4 of
msgid "Engine responsible for the process function."
msgstr ""
#: internlm.core.trainer.Trainer:6 of
msgid "Runtime schedule. Defaults to None."
msgstr ""
#: internlm.core.trainer.Trainer.engine:1 of
msgid ""
"Returns the engine that responsible for managing the training and "
"evaluation process."
msgstr ""
#: internlm.core.trainer.Trainer.schedule:1 of
msgid "Returns the runtime scheduler."
msgstr ""
#: internlm.core.trainer.Trainer.uses_pipeline:1 of
msgid "Returns whether the pipeline parallel is used or not."
msgstr ""
#: internlm.core.trainer.Trainer.train:1 of
msgid "Sets the model to training mode."
msgstr ""
#: internlm.core.trainer.Trainer.eval:1 of
msgid "Sets the model to evaluation mode."
msgstr ""
#: internlm.core.trainer.Trainer.zero_grad:1 of
msgid "Sets the gradient of all parameters in the model to zero."
msgstr ""
#: internlm.core.trainer.Trainer.step:1 of
msgid "Executes the parameter update step."
msgstr ""
#: internlm.core.trainer.Trainer.execute_schedule:1 of
msgid ""
"Runs the forward, loss computation, and backward for the model. Returns a"
" tuple of (output, label, loss)."
msgstr ""
#: internlm.core.trainer.Trainer.execute_schedule:4 of
msgid "The data iterator."
msgstr ""
#: internlm.core.trainer.Trainer.execute_schedule:6 of
msgid "Additional keyword arguments."
msgstr ""
#: internlm.core.trainer.Trainer.execute_schedule of
msgid "返回"
msgstr ""
#: internlm.core.trainer.Trainer.execute_schedule:8 of
msgid "A tuple of (output, label, loss)."
msgstr ""
#: internlm.core.trainer.Trainer.execute_schedule of
msgid "返回类型"
msgstr ""
#: internlm.core.trainer.Trainer.execute_schedule:9 of
msgid "Tuple[:class:`torch.Tensor`]"
msgstr ""
#~ msgid "InternLM 的训练流程可以归纳为两个步骤:"
#~ msgstr "The training process of InternLM can be summarized into two steps: "
#~ msgid "初始化"
#~ msgstr "Initialization"
#~ msgid "初始化模型、优化器、数据加载器、Trainer生成不同种类的进程组为混合并行的迭代训练做准备。"
#~ msgstr ""
#~ "Initialize model, optimizer, dataloader, "
#~ "trainer, and create different types of"
#~ " process groups to prepare for "
#~ "iterative steps of hybrid parallel "
#~ "training. "
#~ msgid "初始化Logger、Checkpoint管理器、Monitor管理器、Profiler对迭代训练的过程观察、预警、记录。"
#~ msgstr ""
#~ "Initialize logger, checkpoint manager, monitor"
#~ " manager, and profiler to watch, "
#~ "alert, and record the iterative training"
#~ " steps. "
#~ msgid "迭代训练"
#~ msgstr "Iterative training steps"
#~ msgid "根据配置文件定义的张量并行、流水线并行、数据并行的大小,加载训练引擎和调度器进行混合并行训练。"
#~ msgstr ""
#~ "Load the training engine and scheduler"
#~ " for hybrid parallel training according "
#~ "to the configuration such as tensor "
#~ "parallel size, pipeline parallel size, "
#~ "and data parallel size. "
#~ msgid "在迭代训练中,调用 Trainer API 进行梯度置零,前向传播计算损失并反向传播,参数更新。"
#~ msgstr ""
#~ "In iterative training steps, the Trainer"
#~ " API is called to perform zero "
#~ "gradients, forward-loss-backward, and "
#~ "parameter update."
#~ msgid "InternLM训练流程图"
#~ msgstr "InternLM training process"

View File

@ -1,366 +0,0 @@
# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
#, fuzzy
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-11 14:25+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../../usage.md:2
msgid "使用教程"
msgstr "Quickstart Guide"
#: ../../../usage.md:4
msgid ""
"启动一个 Demo "
"模型训练,需要进行三项准备,**安装****数据集准备**和**模型训练配置**。接下来,首先会介绍数据准备相关的操作,再简要描述模型训练配置相关的内容。"
msgstr ""
"To start a demo model training, you need to prepare three things: "
"**installation**, **dataset preparation**, and **model training "
"configuration**. In this guide, we will first cover the steps for dataset"
" preparation and then briefly describe the model training configuration."
#: ../../../usage.md:6
msgid "安装"
msgstr "Installation"
#: ../../../usage.md:7
msgid "请参考[安装文档](./install.md)进行安装。"
msgstr ""
"Please refer to the [installation guide](./install.md) for instructions "
"on how to install the necessary dependencies."
#: ../../../usage.md:9
msgid "数据准备 (预训练)"
msgstr "Dataset Preparation (Pre-training)"
#: ../../../usage.md:11
msgid "InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7_sft.model`来生成tokens。若想使用不同的模型可直接修改`tokernizer.py`中的模型参数路径。"
msgstr ""
"The dataset for the InternLM training task includes a series of `bin` and"
" `meta` files. A `tokenizer` is used to generate the training dataset "
"from the original text files. The tokenizer model is imported by "
"specifying the model parameter path in `tools/tokenizer.py`. Currently, "
"`V7_sft.model` is provided to generate tokens. If you want to use a "
"different model, you can directly modify the model parameter path in "
"`tokenizer.py`."
#: ../../../usage.md:13
msgid "可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`text_input_path`表示原始文本数据路径,目前支持`txt`、`json`和`jsonl`三种输入格式,`bin_output_path`表示生成的`bin`文件的保存路径。"
msgstr ""
"You can run the following command to generate `bin` and `meta` files "
"corresponding to the original data. The parameter `text_input_path` "
"represents the path of the original text data, currently supporting "
"`txt`, `json`, and `jsonl` formats, while `bin_output_path` represents "
"the save path of the generated `bin` files."
#: ../../../usage.md:18
msgid "下面是一个数据处理的例子:"
msgstr "Here is an example of data processing:"
#: ../../../usage.md:20
msgid "给定一个包含原始数据集的文件`raw_data.txt`,原始数据集如下所示:"
msgstr ""
"Given a file `raw_data.txt` containing the raw dataset, the raw dataset "
"is shown below:"
#: ../../../usage.md:27
msgid "可以通过运行以下命令来生成`bin`和`meta`文件:"
msgstr ""
"You can generate the `bin` and `meta` files by running the following "
"command:"
#: ../../../usage.md:32
msgid "需要注意的是,生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这六个目录下,以区分数据集的类型。"
msgstr ""
"It should be noted that the generated `bin` files need to be saved in one"
" of the following directories: `cn`, `en`, `code`, `ja`, `ar`, or "
"`kaoshi`, depending on the type of dataset."
#: ../../../usage.md:34
msgid "其中,`cn`表示中文数据集;`en`表示英文数据集;`code`表示代码数据集;`ja`表示日语数据集;`ar`表示阿拉伯语数据集;`kaoshi`表示考试数据集。"
msgstr ""
"Here, `cn` represents the Chinese dataset, `en` represents the English "
"dataset, `code` represents the code dataset, `ja` represents the Japanese"
" dataset, `ar` represents the Arabic dataset, and `kaoshi` represents the"
" exam dataset."
#: ../../../usage.md:36
msgid "生成的bin文件的格式如下"
msgstr "The format of the generated `bin` files is as follows:"
#: ../../../usage.md:42
msgid "`bin`文件中的每一行均对应原始数据集中的每一个句子,表示每个句子的`token`下文将用sequence指定。"
msgstr ""
"Each line in the `bin` file corresponds to each sentence in the original "
"dataset, representing the tokens of each sentence (referred to as "
"sequence below)."
#: ../../../usage.md:44
msgid "生成的`meta`文件的格式如下:"
msgstr "The format of the generated `meta` file is as follows:"
#: ../../../usage.md:48
msgid ""
"在`meta`文件中,每个元组对应着`bin`文件中每一个`sequence`的元信息。其中,元组的第一个元素表示每个`sequence`在所有`sequence`中的`starting"
" index`,第二个元素表示每个`sequence`中有多少个`tokens`。"
msgstr ""
"Each tuple in the `meta` file represents the meta information of each "
"`sequence`, where the first element in the tuple indicates the `starting "
"index` of each `sequence` among all `sequences`, and the second element "
"indicates the number of `tokens` for each `sequence`."
#: ../../../usage.md:50
msgid ""
"例如,对于第一个`sequence``starting index`为 0有 11 "
"个`tokens`;对于第二个`sequence`,由于第一个`sequence`转换为`string`后的长度为`89`,因此它的`starting"
" index`为 90有 15 个`tokens`。"
msgstr ""
"For example, the first `sequence` starts at index 0 and has 16 `tokens`. "
"The second `sequence` starts at index 110 and has 24 `tokens`."
#: ../../../usage.md:52
msgid "`json`和`jsonl`类型的文件的`bin`和`meta`文件格式和`txt`一致,此处不再赘叙。"
msgstr ""
"The `bin` and `meta` file formats for `json` and `jsonl` type files are "
"the same as for `txt`, so we won't go over them here."
#: ../../../usage.md:54
msgid "数据准备 (微调)"
msgstr "Data Preparation (Fine-tuning)"
#: ../../../usage.md:56
msgid ""
"微调任务的数据集格式与预训练任务保持一致,生成的数据格式为一系列的`bin`和`meta`文件。以下以 Alpaca "
"数据集为例,介绍微调的数据准备流程。"
msgstr ""
"The data format for fine-tuning tasks is the same as for pre-training "
"tasks, which consists of a series of `bin` and `meta` files. Let's take "
"the Alpaca dataset as an example to explain the data preparation process "
"for fine-tuning."
#: ../../../usage.md:58
msgid ""
"下载 [Alpaca 数据集](https://github.com/tatsu-"
"lab/stanford_alpaca/blob/main/alpaca_data.json)"
msgstr ""
"Download the [Alpaca dataset](https://github.com/tatsu-"
"lab/stanford_alpaca/blob/main/alpaca_data.json)."
#: ../../../usage.md:60
msgid "对 Alpaca 数据进行 tokenize使用以下命令"
msgstr "Tokenize the Alpaca dataset using the following command:"
#: ../../../usage.md:66
msgid "建议用户参考 alpaca_tokenizer.py 编写新的脚本对自己的数据集进行 tokenize"
msgstr ""
"It is recommended that users refer to alpaca_tokenizer.py to write new "
"scripts to tokenize their own datasets"
#: ../../../usage.md:68
msgid "训练配置"
msgstr "Training Configuration"
#: ../../../usage.md:70
#, fuzzy
msgid "以 7B Demo 的配置文件`configs/7B_sft.py`为例:"
msgstr ""
"Taking the configuration file `configs/7B_sft.py` for the 7B demo as an "
"example,"
#: ../../../usage.md:237
msgid "接下来将详细介绍启动一个模型训练所需要进行的数据、模型、并行和监控等相关的配置。"
msgstr ""
"let's discuss the data, model, parallel and monitoring configurations "
"required to start a model training."
#: ../../../usage.md:239
msgid "数据配置"
msgstr "Data Configuration"
#: ../../../usage.md:240
msgid "数据相关的关键参数配置及释义如下所示:"
msgstr "Here are the key parameters and their explanations for data configuration:"
#: ../../../usage.md:255
msgid "![pack_into_one](./imgs/pack_into_one.png)"
msgstr ""
#: ../../../usage.md:255
msgid "pack_into_one"
msgstr ""
#: ../../../usage.md:258
msgid "目前支持传入数据集文件路径`train_folder`,且要求文件格式如下:"
msgstr ""
"Currently, it supports passing the dataset file path `train_folder`, and "
"the file format is required to be as follows:"
#: ../../../usage.md:265
msgid "数据集的详细内容可参考``数据准备``模块相关的介绍。"
msgstr ""
"For detailed information about the dataset, please refer to the \"Data "
"Preparation\" section."
#: ../../../usage.md:267
msgid "模型配置"
msgstr "Model Configuration"
#: ../../../usage.md:269
msgid "如果在启动训练时要加载模型 `checkpoint`,可进行如下相关配置:"
msgstr ""
"If you want to load a model checkpoint when starting the training, you "
"can configure it as follows:"
#: ../../../usage.md:282
msgid "注意:"
msgstr "Note:"
#: ../../../usage.md:283
msgid "路径若以 `local:` 为前缀,则存储在本地文件系统;若以 `boto3:` 为前缀,则存储在远程 oss 上"
msgstr ""
"If the path starts with `local:`, it means the file is stored in the "
"local file system. If it starts with `boto3:`, it means the file is "
"stored in the remote OSS."
#: ../../../usage.md:285
msgid "模型相关关键参数配置如下所示:"
msgstr "The configuration for the model is as follows:"
#: ../../../usage.md:309
msgid "注意:用户可自定义模型类型名和模型结构,并配置相对应的模型参数。通过`utils/registry.py`下的`MODEL_INITIALIZER`对象进行模型初始化函数接口注册,在训练主函数`train.py`中初始化模型时,可通过`model_type`配置获取指定的模型初始化接口函数。"
msgstr ""
"Note: Users can customize the model type name and model structure, and "
"configure the corresponding model parameters. The model initialization "
"function interface can be registered through the `MODEL_INITIALIZER` "
"object in `utils/registry.py`. When initializing the model in the "
"training main function `train.py`, the specified model initialization "
"interface function can be obtained through the `model_type` "
"configuration."
#: ../../../usage.md:311
msgid ""
"*如果基于 InternLM 7B继续训练可以参考 "
"[ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-zoo) 中 "
"OpenXLab 链接下载权重*"
msgstr ""
"*If you want to start training based on InternLM 7B, you can refer to "
"OpenXLab [ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-"
"zoo) to download weights*."
#: ../../../usage.md:313
msgid "并行配置"
msgstr "Parallel Configuration"
#: ../../../usage.md:315
msgid "训练并行配置样例如下:"
msgstr "Training parallel configuration example:"
#: ../../../usage.md:324
msgid "zero1zero 并行策略,分如下三种情况,默认值为 -1"
msgstr ""
"zero1: zero parallel strategy, divided into the following three cases, "
"default value is -1"
#: ../../../usage.md:325
msgid "当`zero1 <= 0`,则 zero1 进程组的大小等于数据并行进程组的大小,因此优化器状态参数将在数据并行范围内分配"
msgstr ""
"When `zero1 <= 0`, the size of the zero1 process group is equal to the "
"size of the data parallel process group, so the optimizer state "
"parameters will be split within the data parallel range."
#: ../../../usage.md:326
msgid "当`zero1 == 1`,则不使用 zero1 ,所有数据并行组保留完整的优化器状态参数"
msgstr ""
"When `zero1 == 1`, zero1 is not used, and all data parallel groups retain"
" the complete optimizer state parameters."
#: ../../../usage.md:327
msgid "当`zero1 > 1`且`zero1 <= data_parallel_world_size`,则 zero1 进程组是数据并行进程组的子集"
msgstr ""
"When `zero1 > 1` and `zero1 <= data_parallel_world_size`, the zero1 "
"process group is a subset of the data parallel process group."
#: ../../../usage.md:328
msgid "tensor张量并行大小通常是每个节点的 GPU 数量,默认值为 1"
msgstr ""
"tensor: tensor parallel size, usually the number of GPUs per node, "
"default is 1"
#: ../../../usage.md:329
msgid "pipeline流水线并行策略"
msgstr "pipeline: pipeline parallel strategy"
#: ../../../usage.md:330
msgid "size流水线并行大小默认值为 1"
msgstr "size: pipeline parallel size, the default value is 1"
#: ../../../usage.md:331
msgid "interleaved_overlapbool 类型,交错式调度时,开启或关闭通信优化,默认值为关闭"
msgstr ""
"interleaved_overlap: bool type, when interleaved scheduling, enable or "
"disable communication optimization, the default value is False"
#: ../../../usage.md:332
msgid "sequence_parallel是否开启序列化并行默认值为 False"
msgstr ""
"sequence_parallel: Whether to enable sequence parallelism, the default "
"value is False"
#: ../../../usage.md:334
msgid "注意:`数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小`"
msgstr ""
"Note: `Data parallel size = Total number of GPUs / Pipeline parallel size"
" / Tensor parallel size`"
#: ../../../usage.md:336
msgid "启动训练"
msgstr "Start Training"
#: ../../../usage.md:338
msgid "完成了以上数据集准备和相关训练配置后,可启动 Demo 训练。接下来分别以 slurm 和 torch 环境为例,介绍训练启动方式。"
msgstr ""
"After completing the data preparation and relevant training "
"configurations mentioned above, you can start the demo training. The "
"following examples demonstrate how to start the training in both slurm "
"and torch environments."
#: ../../../usage.md:340
msgid "若在 slurm 上启动分布式运行环境,多节点 16 卡的运行命令如下所示:"
msgstr ""
"If you want to start distributed training on slurm with 16 GPUs across "
"multiple nodes, use the following command:"
#: ../../../usage.md:345
msgid "若在 torch 上启动分布式运行环境,单节点 8 卡的运行命令如下所示:"
msgstr ""
"If you want to start distributed training on torch with 8 GPUs on a "
"single node, use the following command:"
#: ../../../usage.md:350
msgid "运行结果"
msgstr "Training Results"
#: ../../../usage.md:352
msgid "以 slurm 上单机 8 卡的 Demo 训练配置为例,训练结果日志展示如下:"
msgstr ""
"Taking the configuration of the demo training on a single machine with 8 "
"GPUs on slurm as an example, the training result log is shown below:"
#~ msgid "`load_model_only_folder`与`load_ckpt_folder`不能同时设置"
#~ msgstr ""
#~ "`load_model_only_folder` and `load_ckpt_folder` "
#~ "cannot be set at the same time."

View File

@ -1,35 +0,0 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
if "%1" == "" goto help
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd

View File

@ -1,11 +0,0 @@
Sphinx
sphinx-autobuild
sphinx_rtd_theme
sphinx_markdown_tables
autodoc_pydantic==1.9
enum_tools
numpy
torch
tqdm
pyecharts
myst-parser

View File

@ -1,12 +0,0 @@
模型保存
===================
InternLM 使用 ``internlm.utils.model_checkpoint.CheckpointManager`` 来管理模型保存。其中,可以使用 ``CheckpointManager.try_save_checkpoint(train_state)`` 来保存指定 step 的模型状态。
InternLM支持启动时自动加载最新的模型备份并在接收信号退出训练时自动进行模型备份。
Checkpointing
-------------
.. autoclass:: internlm.utils.model_checkpoint.CheckpointManager
:members:

View File

@ -1,103 +0,0 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
import os
import sys
project = "InternLM"
copyright = "2023, InternLM Team"
author = "InternLM Team"
with open("../../../version.txt", "r") as f:
release = f.readline().rstrip()
master_doc = "index"
autodoc_member_order = "bysource"
# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
extensions = [
"sphinx_rtd_theme",
"sphinx.ext.viewcode",
"sphinx.ext.autodoc",
"sphinxcontrib.autodoc_pydantic",
"sphinx.ext.autosectionlabel",
"sphinx.ext.napoleon",
"myst_parser",
]
pygments_style = "sphinx"
# autodoc_pyandtic config
autodoc_pydantic_model_show_field_summary = False
autodoc_pydantic_field_signature_prefix = " "
autodoc_pydantic_model_signature_prefix = "class"
autodoc_pydantic_model_show_json = False
autodoc_pydantic_model_show_config_summary = False
autodoc_pydantic_model_show_config_member = False
autodoc_pydantic_model_show_validator_summary = False
autodoc_pydantic_model_show_validator_members = False
autodoc_pydantic_model_summary_list_order = "bysource"
autodoc_pydantic_model_member_order = "bysource"
autodoc_pydantic_field_list_validators = False
# Napoleon settings
napoleon_google_docstring = True
napoleon_numpy_docstring = True
napoleon_include_init_with_doc = False
napoleon_include_private_with_doc = False
napoleon_include_special_with_doc = True
napoleon_use_admonition_for_examples = False
napoleon_use_admonition_for_notes = False
napoleon_use_admonition_for_references = False
napoleon_use_ivar = False
napoleon_use_param = True
napoleon_use_rtype = True
napoleon_preprocess_types = False
napoleon_type_aliases = None
napoleon_attr_annotations = True
templates_path = ["_templates"]
exclude_patterns = []
# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
html_theme = "sphinx_rtd_theme"
html_static_path = []
# GitHub integration
html_context = {
"display_github": True,
"github_user": "InternLM",
"github_repo": "InternLM",
"github_version": "main",
"conf_py_path": "/doc/code-docs/source/",
}
sys.path.insert(0, os.path.abspath("../../../"))
# Prepend module names to class descriptions
add_module_names = True
autoclass_content = "class"
autodoc_mock_imports = [
"apex",
"torch",
"numpy",
]
# support multi-language docs
language = "zh_CN"
locale_dirs = ["../locales/"] # path is example but recommended.
gettext_compact = False # optional.
gettext_uuid = False # optional.

View File

@ -1,202 +0,0 @@
30B Demo
================
训练配置
----------------
30B demo 训练配置文件样例如下:
.. code-block:: python
JOB_NAME = "30b_train"
SEQ_LEN = 2048
HIDDEN_SIZE = 6144
NUM_ATTENTION_HEAD = 48
MLP_RATIO = 8 / 3
NUM_LAYER = 60
VOCAB_SIZE = 103168
MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
# Ckpt folder format:
# fs: 'local:/mnt/nfs/XXX'
SAVE_CKPT_FOLDER = "local:llm_ckpts"
LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
# boto3 Ckpt folder format:
# import os
# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
CHECKPOINT_EVERY = 50
ckpt = dict(
enable_save_ckpt=False, # enable ckpt save.
save_ckpt_folder=SAVE_CKPT_FOLDER, # Path to save training ckpt.
# load_ckpt_folder=LOAD_CKPT_FOLDER, # Ckpt path to resume training(load weights and scheduler/context states).
# load_model_only_folder=MODEL_ONLY_FOLDER, # Path to initialize with given model weights.
load_optimizer=True, # Wheter to load optimizer states when continuing training.
checkpoint_every=CHECKPOINT_EVERY,
async_upload=True, # async ckpt upload. (only work for boto3 ckpt)
async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/", # path for temporarily files during asynchronous upload.
snapshot_ckpt_folder="/".join([SAVE_CKPT_FOLDER, "snapshot"]), # directory for snapshot ckpt storage path.
oss_snapshot_freq=int(CHECKPOINT_EVERY / 2), # snapshot ckpt save frequency.
)
TRAIN_FOLDER = "/path/to/dataset"
VALID_FOLDER = "/path/to/dataset"
data = dict(
seq_len=SEQ_LEN,
# micro_num means the number of micro_batch contained in one gradient update
micro_num=4,
# packed_length = micro_bsz * SEQ_LEN
micro_bsz=2,
# defaults to the value of micro_num
valid_micro_num=4,
# defaults to 0, means disable evaluate
valid_every=50,
pack_sample_into_one=False,
total_steps=50000,
skip_batches="",
rampup_batch_size="",
# Datasets with less than 50 rows will be discarded
min_length=50,
# train_folder=TRAIN_FOLDER,
# valid_folder=VALID_FOLDER,
)
grad_scaler = dict(
fp16=dict(
# the initial loss scale, defaults to 2**16
initial_scale=2**16,
# the minimum loss scale, defaults to None
min_scale=1,
# the number of steps to increase loss scale when no overflow occurs
growth_interval=1000,
),
# the multiplication factor for increasing loss scale, defaults to 2
growth_factor=2,
# the multiplication factor for decreasing loss scale, defaults to 0.5
backoff_factor=0.5,
# the maximum loss scale, defaults to None
max_scale=2**24,
# the number of overflows before decreasing loss scale, defaults to 2
hysteresis=2,
)
hybrid_zero_optimizer = dict(
# Enable low_level_optimzer overlap_communication
overlap_sync_grad=True,
overlap_sync_param=True,
# bucket size for nccl communication params
reduce_bucket_size=512 * 1024 * 1024,
# grad clipping
clip_grad_norm=1.0,
)
loss = dict(
label_smoothing=0,
)
adam = dict(
lr=1e-4,
adam_beta1=0.9,
adam_beta2=0.95,
adam_beta2_c=0,
adam_eps=1e-8,
weight_decay=0.01,
)
lr_scheduler = dict(
total_steps=data["total_steps"],
init_steps=0, # optimizer_warmup_step
warmup_ratio=0.01,
eta_min=1e-5,
last_epoch=-1,
)
beta2_scheduler = dict(
init_beta2=adam["adam_beta2"],
c=adam["adam_beta2_c"],
cur_iter=-1,
)
model = dict(
checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.float16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
use_flash_attn=True,
num_chunks=1, # if num_chunks > 1, interleaved pipeline scheduler is used.
)
"""
zero1 parallel:
1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
so parameters will be divided within the range of dp.
2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
pipeline parallel (dict):
1. size: int, the size of pipeline parallel.
2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
tensor parallel: tensor parallel size, usually the number of GPUs per node.
"""
parallel = dict(
zero1=-1,
tensor=4,
pipeline=dict(size=1, interleaved_overlap=True),
sequence_parallel=False,
)
cudnn_deterministic = False
cudnn_benchmark = False
启动训练
----------------
完成以上训练配置后,可启动模型训练,以在 ``slurm`` 平台上为例,启动两节点 16GPU 的训练命令如下所示:
.. code-block:: bash
srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/30B_sft.py
训练结果
----------------
基于以上训练配置和启动命令,两节点 16GPU 下的模型训练部分日志展示如下:
.. code-block:: bash
2023-09-06 10:29:26,629 INFO parallel_context.py:508 in set_device -- process rank 10 is bound to host:HOST-10-140-66-20 device: 2
2023-09-06 10:29:26,632 INFO parallel_context.py:508 in set_device -- process rank 11 is bound to host:HOST-10-140-66-20 device: 3
2023-09-06 10:29:26,634 INFO parallel_context.py:508 in set_device -- process rank 12 is bound to host:HOST-10-140-66-20 device: 4
2023-09-06 10:29:26,636 INFO parallel_context.py:508 in set_device -- process rank 9 is bound to host:HOST-10-140-66-20 device: 1
2023-09-06 10:29:26,640 INFO parallel_context.py:508 in set_device -- process rank 15 is bound to host:HOST-10-140-66-20 device: 7
2023-09-06 10:29:26,639 INFO parallel_context.py:508 in set_device -- process rank 0 is bound to host:HOST-10-140-66-9 device: 0
2023-09-06 10:29:26,641 INFO parallel_context.py:508 in set_device -- process rank 2 is bound to host:HOST-10-140-66-9 device: 2
2023-09-06 10:29:26,643 INFO parallel_context.py:508 in set_device -- process rank 5 is bound to host:HOST-10-140-66-9 device: 5
2023-09-06 10:29:26,645 INFO parallel_context.py:508 in set_device -- process rank 6 is bound to host:HOST-10-140-66-9 device: 6
2023-09-06 10:29:26,661 INFO parallel_context.py:508 in set_device -- process rank 13 is bound to host:HOST-10-140-66-20 device: 5
2023-09-06 10:29:26,707 INFO parallel_context.py:508 in set_device -- process rank 1 is bound to host:HOST-10-140-66-9 device: 1
2023-09-06 10:29:26,826 INFO parallel_context.py:508 in set_device -- process rank 4 is bound to host:HOST-10-140-66-9 device: 4
2023-09-06 10:29:26,871 INFO parallel_context.py:508 in set_device -- process rank 7 is bound to host:HOST-10-140-66-9 device: 7
2023-09-06 10:29:26,932 INFO parallel_context.py:508 in set_device -- process rank 3 is bound to host:HOST-10-140-66-9 device: 3
2023-09-06 10:29:27,156 INFO parallel_context.py:508 in set_device -- process rank 14 is bound to host:HOST-10-140-66-20 device: 6
2023-09-06 10:29:27,271 INFO parallel_context.py:508 in set_device -- process rank 8 is bound to host:HOST-10-140-66-20 device: 0
2023-09-06 10:29:32,060 INFO launch.py:329 in launch -- Distributed environment is initialized, data parallel size: 4, pipeline parallel size: 1, tensor parallel size: 4
2023-09-06 10:30:06,141 INFO hybrid_zero_optim.py:291 in _partition_param_list -- Number of elements on ranks: [1782007296, 1812307968, 1812307968, 1706469888], rank:0
2023-09-06T10:30:38.216+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=40.00268401421643 step=0 loss=11.548227310180664 tgs (tokens/gpu/second)=227.37 lr=9.779754323328192e-05 loss_scale=65536.0 grad_norm={'0_default': 61.5836932112004} micro_num=4 num_consumed_tokens=65536 inf_nan_skip_batches=0 num_samples_in_batch=18 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=12.51 acc=0.0 perplexity=104121.5547 acc/en=0.0 acc/cn=0.0 acc/code=0.0 tokens/en=60571 tokens/cn=0 tokens/code=0 loss_from_metric=11.5533 loss/en=11.5533 loss/cn=nan loss/code=nan
2023-09-06T10:30:46.343+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=89.00005814543725 step=1 loss=6.05580997467041 tgs (tokens/gpu/second)=505.86 lr=9.140576474687264e-05 loss_scale=65536.0 grad_norm={'0_default': 27.397946290506887} micro_num=4 num_consumed_tokens=131072 inf_nan_skip_batches=0 num_samples_in_batch=19 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=7.91 acc=0.0885 perplexity=405.4076 acc/en=0.0885 acc/cn=0.0 acc/code=0.0 tokens/en=60265 tokens/cn=0 tokens/code=0 loss_from_metric=6.0049 loss/en=6.0049 loss/cn=nan loss/code=nan
2023-09-06T10:30:51.443+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=142.5138940898651 step=2 loss=5.054169654846191 tgs (tokens/gpu/second)=810.03 lr=8.14503363531613e-05 loss_scale=65536.0 grad_norm={'0_default': 10.438111430093606} micro_num=4 num_consumed_tokens=196608 inf_nan_skip_batches=0 num_samples_in_batch=17 largest_length=2048 largest_batch=5 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=4.87 acc=0.0715 perplexity=184.2986 acc/en=0.0715 acc/cn=0.0 acc/code=0.0 tokens/en=60244 tokens/cn=0 tokens/code=0 loss_from_metric=5.2166 loss/en=5.2166 loss/cn=nan loss/code=nan
2023-09-06T10:30:56.509+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=143.56131674769466 step=3 loss=4.662276268005371 tgs (tokens/gpu/second)=815.98 lr=6.890576474687264e-05 loss_scale=65536.0 grad_norm={'0_default': 9.15959986316653} micro_num=4 num_consumed_tokens=262144 inf_nan_skip_batches=0 num_samples_in_batch=17 largest_length=2048 largest_batch=5 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=4.83 acc=0.0775 perplexity=102.6568 acc/en=0.0775 acc/cn=0.0 acc/code=0.0 tokens/en=60328 tokens/cn=0 tokens/code=0 loss_from_metric=4.6314 loss/en=4.6314 loss/cn=nan loss/code=nan
2023-09-06T10:31:01.552+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=143.85087291011183 step=4 loss=4.020431041717529 tgs (tokens/gpu/second)=817.63 lr=5.500000000000001e-05 loss_scale=65536.0 grad_norm={'0_default': 6.873464794412589} micro_num=4 num_consumed_tokens=327680 inf_nan_skip_batches=0 num_samples_in_batch=22 largest_length=1893 largest_batch=8 smallest_batch=4 adam_beta2=0.95 fwd_bwd_time=4.82 acc=0.0701 perplexity=69.1167 acc/en=0.0701 acc/cn=0.0 acc/code=0.0 tokens/en=61028 tokens/cn=0 tokens/code=0 loss_from_metric=4.2358 loss/en=4.2358 loss/cn=nan loss/code=nan
2023-09-06T10:31:06.830+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=142.8966468353613 step=5 loss=3.733311891555786 tgs (tokens/gpu/second)=812.2 lr=4.109423525312737e-05 loss_scale=65536.0 grad_norm={'0_default': 5.811005102730085} micro_num=4 num_consumed_tokens=393216 inf_nan_skip_batches=0 num_samples_in_batch=13 largest_length=2048 largest_batch=4 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=4.85 acc=0.0688 perplexity=46.298 acc/en=0.0688 acc/cn=0.0 acc/code=0.0 tokens/en=61004 tokens/cn=0 tokens/code=0 loss_from_metric=3.8351 loss/en=3.8351 loss/cn=nan loss/code=nan

View File

@ -1,192 +0,0 @@
7B Demo
================
训练配置
----------------
7B demo 的训练配置文件样例如下:
.. code-block:: python
JOB_NAME = "7b_train"
SEQ_LEN = 2048
HIDDEN_SIZE = 4096
NUM_ATTENTION_HEAD = 32
MLP_RATIO = 8 / 3
NUM_LAYER = 32
VOCAB_SIZE = 103168
MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
# Ckpt folder format:
# fs: 'local:/mnt/nfs/XXX'
SAVE_CKPT_FOLDER = "local:llm_ckpts"
LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
# boto3 Ckpt folder format:
# import os
# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
CHECKPOINT_EVERY = 50
ckpt = dict(
enable_save_ckpt=False, # enable ckpt save.
save_ckpt_folder=SAVE_CKPT_FOLDER, # Path to save training ckpt.
# load_ckpt_folder=LOAD_CKPT_FOLDER, # Ckpt path to resume training(load weights and scheduler/context states).
# load_model_only_folder=MODEL_ONLY_FOLDER, # Path to initialize with given model weights.
load_optimizer=True, # Wheter to load optimizer states when continuing training.
checkpoint_every=CHECKPOINT_EVERY,
async_upload=True, # async ckpt upload. (only work for boto3 ckpt)
async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/", # path for temporarily files during asynchronous upload.
snapshot_ckpt_folder="/".join([SAVE_CKPT_FOLDER, "snapshot"]), # directory for snapshot ckpt storage path.
oss_snapshot_freq=int(CHECKPOINT_EVERY / 2), # snapshot ckpt save frequency.
)
TRAIN_FOLDER = "/path/to/dataset"
VALID_FOLDER = "/path/to/dataset"
data = dict(
seq_len=SEQ_LEN,
# micro_num means the number of micro_batch contained in one gradient update
micro_num=4,
# packed_length = micro_bsz * SEQ_LEN
micro_bsz=2,
# defaults to the value of micro_num
valid_micro_num=4,
# defaults to 0, means disable evaluate
valid_every=50,
pack_sample_into_one=False,
total_steps=50000,
skip_batches="",
rampup_batch_size="",
# Datasets with less than 50 rows will be discarded
min_length=50,
# train_folder=TRAIN_FOLDER,
# valid_folder=VALID_FOLDER,
)
grad_scaler = dict(
fp16=dict(
# the initial loss scale, defaults to 2**16
initial_scale=2**16,
# the minimum loss scale, defaults to None
min_scale=1,
# the number of steps to increase loss scale when no overflow occurs
growth_interval=1000,
),
# the multiplication factor for increasing loss scale, defaults to 2
growth_factor=2,
# the multiplication factor for decreasing loss scale, defaults to 0.5
backoff_factor=0.5,
# the maximum loss scale, defaults to None
max_scale=2**24,
# the number of overflows before decreasing loss scale, defaults to 2
hysteresis=2,
)
hybrid_zero_optimizer = dict(
# Enable low_level_optimzer overlap_communication
overlap_sync_grad=True,
overlap_sync_param=True,
# bucket size for nccl communication params
reduce_bucket_size=512 * 1024 * 1024,
# grad clipping
clip_grad_norm=1.0,
)
loss = dict(
label_smoothing=0,
)
adam = dict(
lr=1e-4,
adam_beta1=0.9,
adam_beta2=0.95,
adam_beta2_c=0,
adam_eps=1e-8,
weight_decay=0.01,
)
lr_scheduler = dict(
total_steps=data["total_steps"],
init_steps=0, # optimizer_warmup_step
warmup_ratio=0.01,
eta_min=1e-5,
last_epoch=-1,
)
beta2_scheduler = dict(
init_beta2=adam["adam_beta2"],
c=adam["adam_beta2_c"],
cur_iter=-1,
)
model = dict(
checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.float16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
use_flash_attn=True,
num_chunks=1, # if num_chunks > 1, interleaved pipeline scheduler is used.
)
"""
zero1 parallel:
1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
so parameters will be divided within the range of dp.
2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
pipeline parallel (dict):
1. size: int, the size of pipeline parallel.
2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
tensor parallel: tensor parallel size, usually the number of GPUs per node.
"""
parallel = dict(
zero1=8,
pipeline=dict(size=1, interleaved_overlap=True),
sequence_parallel=False,
)
cudnn_deterministic = False
cudnn_benchmark = False
启动训练
----------------
完成以上训练配置后,可启动模型训练,以在 ``slurm`` 平台上为例,启动单节点 8GPU 的训练命令如下所示:
.. code-block:: bash
srun -p internllm -N 1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py
训练结果
----------------
基于以上训练配置和启动命令,单节点 8GPU 下的模型训练部分日志展示如下:
.. code-block:: bash
2023-09-05 11:47:44,649 INFO parallel_context.py:508 in set_device -- process rank 4 is bound to host:SH-IDC1-10-140-1-110 device: 4
2023-09-05 11:47:44,650 INFO parallel_context.py:508 in set_device -- process rank 3 is bound to host:SH-IDC1-10-140-1-110 device: 3
2023-09-05 11:47:44,651 INFO parallel_context.py:508 in set_device -- process rank 6 is bound to host:SH-IDC1-10-140-1-110 device: 6
2023-09-05 11:47:44,652 INFO parallel_context.py:508 in set_device -- process rank 7 is bound to host:SH-IDC1-10-140-1-110 device: 7
2023-09-05 11:47:44,652 INFO parallel_context.py:508 in set_device -- process rank 5 is bound to host:SH-IDC1-10-140-1-110 device: 5
2023-09-05 11:47:44,652 INFO parallel_context.py:508 in set_device -- process rank 1 is bound to host:SH-IDC1-10-140-1-110 device: 1
2023-09-05 11:47:44,652 INFO parallel_context.py:508 in set_device -- process rank 2 is bound to host:SH-IDC1-10-140-1-110 device: 2
2023-09-05 11:47:44,652 INFO parallel_context.py:508 in set_device -- process rank 0 is bound to host:SH-IDC1-10-140-1-110 device: 0
2023-09-05 11:47:51,006 INFO launch.py:354 in launch -- Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 1
2023-09-05 11:49:09,855 INFO hybrid_zero_optim.py:294 in _partition_param_list -- Number of elements on ranks: [894509056, 944865280, 966909952, 966909952, 966909952, 944865280, 966909952, 670068736], rank:0
2023-09-05T11:49:58.225+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=63.283263603947816 step=0 loss=11.641494750976562 tgs (tokens/gpu/second)=1424.93 lr=4.0000000000000003e-07 loss_scale=65536.0 grad_norm={'0_default': 66.51907327507652} micro_num=4 num_consumed_tokens=131072 inf_nan_skip_batches=0 num_samples_in_batch=19 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=6.87 acc=0.0 perplexity=112181.7188 acc/en=0.0 acc/cn=0.0 acc/code=0.0 tokens/en=120836 tokens/cn=0 tokens/code=0 loss_from_metric=11.6279 loss/en=11.6279 loss/cn=nan loss/code=nan
2023-09-05T11:50:02.553+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=171.92140761933035 step=1 loss=11.546792984008789 tgs (tokens/gpu/second)=3871.11 lr=6.000000000000001e-07 loss_scale=65536.0 grad_norm={'0_default': 64.47430144542088} micro_num=4 num_consumed_tokens=262144 inf_nan_skip_batches=0 num_samples_in_batch=16 largest_length=2048 largest_batch=5 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=4.14 acc=0.0 perplexity=103779.1406 acc/en=0.0 acc/cn=0.0 acc/code=0.0 tokens/en=120572 tokens/cn=0 tokens/code=0 loss_from_metric=11.55 loss/en=11.55 loss/cn=nan loss/code=nan
2023-09-05T11:50:06.504+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=186.0565203348341 step=2 loss=11.106071472167969 tgs (tokens/gpu/second)=4189.39 lr=8.000000000000001e-07 loss_scale=65536.0 grad_norm={'0_default': 62.520055376005146} micro_num=4 num_consumed_tokens=393216 inf_nan_skip_batches=0 num_samples_in_batch=16 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.82 acc=0.0001 perplexity=71139.6797 acc/en=0.0001 acc/cn=0.0 acc/code=0.0 tokens/en=122032 tokens/cn=0 tokens/code=0 loss_from_metric=11.1724 loss/en=11.1724 loss/cn=nan loss/code=nan
2023-09-05T11:50:10.487+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=185.48897918112567 step=3 loss=10.444510459899902 tgs (tokens/gpu/second)=4176.61 lr=1.0000000000000002e-06 loss_scale=65536.0 grad_norm={'0_default': 57.91057980979166} micro_num=4 num_consumed_tokens=524288 inf_nan_skip_batches=0 num_samples_in_batch=18 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.83 acc=0.0705 perplexity=39851.1289 acc/en=0.0705 acc/cn=0.0 acc/code=0.0 tokens/en=121125 tokens/cn=0 tokens/code=0 loss_from_metric=10.5929 loss/en=10.5929 loss/cn=nan loss/code=nan
2023-09-05T11:50:14.476+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=185.8751803758398 step=4 loss=9.798665046691895 tgs (tokens/gpu/second)=4185.31 lr=1.2000000000000002e-06 loss_scale=65536.0 grad_norm={'0_default': 48.1136933755285} micro_num=4 num_consumed_tokens=655360 inf_nan_skip_batches=0 num_samples_in_batch=14 largest_length=2048 largest_batch=4 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.82 acc=0.076 perplexity=18045.6699 acc/en=0.076 acc/cn=0.0 acc/code=0.0 tokens/en=121365 tokens/cn=0 tokens/code=0 loss_from_metric=9.8007 loss/en=9.8007 loss/cn=nan loss/code=nan
2023-09-05T11:50:18.442+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=185.6236609556878 step=5 loss=9.215429306030273 tgs (tokens/gpu/second)=4179.64 lr=1.4000000000000001e-06 loss_scale=65536.0 grad_norm={'0_default': 36.95489557069029} micro_num=4 num_consumed_tokens=786432 inf_nan_skip_batches=0 num_samples_in_batch=14 largest_length=2048 largest_batch=4 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.82 acc=0.0767 perplexity=8999.0869 acc/en=0.0767 acc/cn=0.0 acc/code=0.0 tokens/en=121223 tokens/cn=0 tokens/code=0 loss_from_metric=9.1049 loss/en=9.1049 loss/cn=nan loss/code=nan

View File

@ -1,18 +0,0 @@
训练样例
================
7B Demo
------------
.. toctree::
:maxdepth: 2
7B_demo
30B Demo
------------
.. toctree::
:maxdepth: 2
30B_demo

View File

@ -1,95 +0,0 @@
.. InternLM documentation master file, created by
sphinx-quickstart on Mon Aug 28 17:33:28 2023.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
InternLM
========
环境构建
-------------------
.. toctree::
:maxdepth: 2
install
快速上手
-------------------
.. toctree::
:maxdepth: 2
usage
训练构建
-------------------
.. toctree::
:maxdepth: 2
initialize
训练 API
-------------------
.. toctree::
:maxdepth: 2
training
并行训练
-------------------
.. toctree::
:maxdepth: 2
parallel
模型备份
--------------------
.. toctree::
:maxdepth: 2
checkpoint
性能分析
-------------------
.. toctree::
:maxdepth: 2
profiler
训练监控
-------------------
.. toctree::
:maxdepth: 2
monitor
训练样例
-------------------
.. toctree::
:maxdepth: 2
example/index
常见问题
-------------------
.. toctree::
:maxdepth: 2
qa
索引和表格
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

View File

@ -1,108 +0,0 @@
训练构建
==============
InternLM 的训练流程可以归纳为两个步骤:
1. 初始化
* 初始化模型、优化器、数据加载器、Trainer生成不同种类的进程组为混合并行的迭代训练做准备。
* 初始化Logger、Checkpoint管理器、Monitor管理器、Profiler对迭代训练的过程观察、预警、记录。
2. 迭代训练
* 根据配置文件定义的张量并行、流水线并行、数据并行的大小,加载训练引擎和调度器进行混合并行训练。
* 在迭代训练中,调用 Trainer API 进行梯度置零,前向传播计算损失并反向传播,参数更新。
.. figure:: ../../imgs/hybrid_parallel_training.png
:scale: 45%
:class: with-border
InternLM训练流程图
.. _InternLM-args:
命令行参数解析
----------------
InternLM 使用 `argparse <https://docs.python.org/3/library/argparse.html>`_ 库来向InternLM运行时提供命令行参数配置。
用户可使用 ``internlm.initialize.get_default_parser()`` 来获取 InternLM 的默认解析器,其中包含一些内置参数,用户可以向此解析器添加自定义参数。
.. code-block:: python
# Get InternLM default parser
parser = internlm.initialize.get_default_parser()
# Add new argument
parser.add_argument("--user_arg", type=int, default=-1, help="arguments add by user.")
cmd_args = parser.parse_args()
.. autofunction:: internlm.initialize.get_default_parser
.. _InternLM-model-init:
模型初始化
-------------------------
.. autofunction:: internlm.train.initialize_model
InternLM 在配置文件中使用字段 ``model_type````model`` 来控制模型初始化过程。示例模型初始化配置定义如下:
.. code-block:: python
model_type = "INTERNLM" # default is "INTERNLM", used to register classes and modules for model initialization
NUM_ATTENTION_HEAD = 32
VOCAB_SIZE = 103168
HIDDEN_SIZE = 4096
NUM_LAYER = 32
MLP_RATIO = 8 / 3
model = dict(
checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.bfloat16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
use_flash_attn=True,
num_chunks=1, # if num_chunks > 1, interleaved pipeline scheduler is used.
)
- 字段 ``model_type`` 指明了要初始化的模型类型
- 字段 ``model`` 中的参数指定了在模型初始化过程中的参数设置
值得注意的是,用户可以定义新的模型类型,并使用装饰器 ``@MODEL_INITIALIZER.register_module`` 注册模型的初始化函数,其中 ``MODEL_INITIALIZER`` 是类 ``internlm.util.registry.Registry`` 的一个实例化对象,示例如下所示:
.. code-block:: python
MODEL_TYPE = "NEW_MODEL"
@MODEL_INITIALIZER.register_module(module_name=MODEL_TYPE)
def build_new_model_with_cfg(*args, **kwargs):
.. _InternLM-optim-init:
优化器初始化
-------------------------
.. autofunction:: internlm.train.initialize_optimizer
.. _InternLM-dl-init:
数据加载器初始化
-------------------------
.. autofunction:: internlm.train.get_train_data_loader
.. _InternLM-trainer-init:
Trainer 初始化
-------------------------
.. autofunction:: internlm.initialize.initialize_trainer

View File

@ -1,2 +0,0 @@
```{include} ../../install.md
```

View File

@ -1,22 +0,0 @@
监控和告警
=================
监控
-----------------
InternLM 使用 ``internlm.monitor.monitor.initialize_monitor_manager()`` 来初始化上下文监控管理。其中,一个实例化的单例对象 ``internlm.monitor.monitor.MonitorManager`` 将管理监控线程并使用 ``internlm.monitor.monitor.MonitorTracker`` 来跟踪模型训练生命周期和训练状态。
.. autofunction:: internlm.monitor.monitor.initialize_monitor_manager
.. autoclass:: internlm.monitor.monitor.MonitorManager
:members:
.. autoclass:: internlm.monitor.monitor.MonitorTracker
:members:
告警
-----------------
InternLM 监控线程会周期性地检查模型训练过程中是否出现 loss spike、潜在的 training stuck、运行时异常等并捕获 SIGTERM 异常信号。当出现上述情况时,将触发警报,并通过调用 ``internlm.monitor.alert.send_feishu_msg_with_webhook()`` 向飞书的 Webhook 地址发送报警消息。
.. autofunction:: internlm.monitor.alert.send_feishu_msg_with_webhook

View File

@ -1,152 +0,0 @@
并行训练
==================
.. Brief introduction to training parallelism, and how-to guide about config setting
InternLM 支持张量并行、流水线并行、序列并行、数据并行和 ZeRO1.5 等并行化训练策略。在初始化分布式环境时,我们需要指定张量并行大小、流水线并行大小、数据并行大小以及 ZeRO1.5 策略。
InternLM 的并行设置由配置文件中的 ``parallel`` 字段指定,用户可以通过修改配置文件 `config file <https://github.com/InternLM/InternLM/blob/main/configs/7B_sft.py>`_ 来更改并行配置。以下是一个并行训练配置示例:
.. code-block:: python
parallel = dict(
zero1=8,
tensor=1,
pipeline=dict(size=1, interleaved_overlap=True),
sequence_parallel=False,
)
- zero1zero 并行策略,分如下三种情况,默认值为 -1
- 当 ``zero1 <= 0``,则 zero1 进程组的大小等于数据并行进程组的大小,因此优化器状态参数将在数据并行范围内分配
- 当 ``zero1 == 1``,则不使用 zero1 ,所有数据并行组保留完整的优化器状态参数
- 当 ``zero1 > 1````zero1 <= data_parallel_world_size``,则 zero1 进程组是数据并行进程组的子集
- tensor张量并行大小通常是每个节点的 GPU 数量,默认值为 1
- pipeline流水线并行策略
- size流水线并行大小默认值为 1
- interleaved_overlapbool 类型,交错式调度时,开启或关闭通信优化,默认值为 False
- sequence_parallel是否开启序列化并行默认值为 False
注意:数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小
张量并行
-----------------
InternLM 的张量并行实现方案基于 `flash attention <https://github.com/Dao-AILab/flash-attention>`_, 主要对 `attention <https://github.com/InternLM/InternLM/blob/main/internlm/model/multi_head_attention.py>`_
`linear <https://github.com/InternLM/InternLM/blob/main/internlm/model/linear.py>`_ 这两个模块进行张量并行操作。
用户可通过配置文件中的 ``parallel.tensor`` 字段来设置张量并行大小。
.. figure:: ../../imgs/tensor_parallel.png
:scale: 50%
:class: with-border
张量并行,采用自 `flash-attention <https://arxiv.org/pdf/2205.14135.pdf>`_
流水线并行
-----------------
InternLM 在流水线并行中使用 `1F1B <https://arxiv.org/pdf/2104.04473.pdf>`_ 1F1B一次前向传递后跟一次反向传递策略。对于 1F1B 策略,有两种实现方式:
1. 非交错调度器,内存高效。
2. 交错调度器内存高效且时间高效GPU空泡较少
.. figure:: ../../imgs/pipeline_schedule.png
:scale: 45%
:class: with-border
1F1B 流水线并行调度器,采用自 `Megatron-LM <https://arxiv.org/pdf/2104.04473.pdf>`_
非交错式流水线调度
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
如果要使用非交错式调度, 需要设置 ``model.num_chunks = 1``
.. autoclass:: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler
:members:
交错式流水线调度
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
如果要使用交错式调度, 需要设置 ``model.num_chunks > 1``
.. autoclass:: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler
:members:
值得注意的是,在使用交错式流水线调度器时可启用通信优化功能,即在 1F1B 阶段启用异步通信,以充分利用上行/下行带宽并实现通信与计算重叠。
用户需要在配置文件中设置 ``parallel.pipeline.interleaved_overlap = True``。该功能启用后,将调用函数 ``InterleavedPipelineScheduler._run_1f1b_loop_with_overlap``,并创建 ``internlm.core.communication.AsynCommunicator`` 以管理异步通信。
``1F1B-without-overlap````1F1B-with-overlap`` 的区别如下所示:
.. code-block:: bash
# The 1F1B stage without overlap consists of the following steps:
1. Perform the forward pass.
2. Perform the backward pass.
3. Send the forward output of this iteration to the next stage, and send the backward output of this iteration to the previous stage, and receive the forward and backward inputs for the next iteration.
.. code-block:: bash
# The 1F1B stage with overlap consists of the following steps:
1. Perform the forward pass.
2. Check if the backward input is ready.
3. Send the forward output and receive the forward input for the next iteration.
4. Perform the backward pass.
5. Check if the forward input is ready.
6. Send the backward output and receive the backward input for the next iteration.
序列并行
-----------------
序列并行是一种在不引入额外计算、通信和内存开销的情况下,减少层 ``layer_norm````dropout`` 操作中的激活值内存。InternLM 中的序列并行实现基于 `flash attention <https://github.com/Dao-AILab/flash-attention>`_。这个并行策略有助于降低模型的内存消耗,提高了模型在资源受限环境中的可扩展性。
如果要启用序列并行, 用户需要设置 ``parallel.sequence_parallel = True``
.. figure:: ../../imgs/sequence_parallel.png
:scale: 50%
:class: with-border
序列并行, 采用自 flash-attention
数据并行
-----------------
InternLM 支持数据并行。数据并行大小为:
`Data parallel size = Total number of GPUs / Pipeline parallel size / Tensor parallel size`
ZeRO1.5
-----------------
ZeRO1.5 的实现使用了分层分片的概念,通过配置值 ``parallel.zero1`` 启用了本地节点内的分片。这个方法有助于有效管理和分配模型参数和梯度,以减少内存使用并提高训练效率。
1. 当 ``parallel.zero1 <= 0``,则 zero1 进程组的大小等于数据并行进程组的大小,因此优化器状态参数将在数据并行范围内分配
2. 当 ``parallel.zero1 == 1``,则不使用 zero1 ,所有数据并行组保留完整的优化器状态参数
3. 当 ``parallel.zero1 > 1````parallel.zero1 <= data_parallel_world_size``,则 zero1 进程组是数据并行进程组的子集
此外,用户可以在配置文件中通过 ``hybrid_zero_optimizer`` 字段启用优化器的通信优化功能,设置桶大小,以及梯度剪裁等参数。这些设置有助于优化训练过程中的通信和计算效率,以及梯度的处理方式。
.. code-block:: python
hybrid_zero_optimizer = dict(
# Enable low_level_optimzer overlap_communication
overlap_sync_grad=True,
overlap_sync_param=True,
# bucket size for nccl communication params
reduce_bucket_size=512 * 1024 * 1024,
# grad clipping
clip_grad_norm=1.0,
)
这里有两个值得关注的通信优化点:
- overlap_sync_grad: 如果设置为 ``True``,则将训练的 ``backward pass`` 与梯度的 ``all-reduce`` 通信重叠
- overlap_sync_param: 如果设置为 ``True``,则将参数的 ``broadcast`` 通信与下一步的 ``forward pass`` 进行重叠
这些优化可以加速训练过程,提高训练效率。
.. autoclass:: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer
:members:

View File

@ -1,164 +0,0 @@
性能分析
========
.. Mainly about the usage of torch profiler and memory profiler
Torch Profiler
-----------------
InternLM 使用 ``internlm.train.initialize_llm_profile()`` 来收集和分析模型训练或推理期间的性能数据,如 CPU/CUDA/memory 等性能数据。这个实现基于 `torch.profiler <https://pytorch.org/docs/stable/profiler.html>`_ ,输出的性能分析 trace 文件可以使用 `tensorboard <https://www.tensorflow.org/tensorboard?hl=en>`_ 进行可视化。
用户如果想使用这个 torch 性能分析工具,需要在启动训练时传递 ``--profiling`` 参数以启用性能分析。完成 torch 性能分析后,用户可以在 ``{JOB_NAME}/{start_time}/traces/rank{}_dp{}_tp{}_pp{}`` 文件夹中看到性能分析结果。
实际运行生成的 ``Torch Profiler`` 目录结构如下:
.. code-block:: bash
# tree ./7b_train/Sep08_11-00-51/traces -L 2
./7b_train/Sep08_11-00-51/traces/
└── rank0_dp0_tp0_pp0
└── SH-IDC1-10-140-1-78_238619.1694142354680.pt.trace.json
其中, ``traces`` 可以通过 ``TensorBoard`` 可视化,运行命令
.. code-block:: bash
# visualize traces with tensorboard and custom port
tensorboard --logdir rank0_dp0_tp0_pp0 --port 10088
在打开的 ``TensorBoard -> PyTorch Profiler -> Views -> Trace`` 页面可以看到Operator和GPU Kernel的性能分析时间线如下更多的功能请参考 `torch profiler with tensorboard <https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html#pytorch-profiler-with-tensorboard>`_
.. figure:: ../../imgs/torch_profiler_trace.png
:scale: 45%
:class: with-border
.. autofunction:: internlm.train.initialize_llm_profile
Memory Profiler
-----------------
InternLM 提供了一个实用的内存分析工具 ``internlm.utils.simple_memory_profiler.SimpleMemoryProfiler`` 来监控实际的 GPU 内存使用情况。在实现中,会对模型数据(包括模型参数、模型梯度和优化器状态)和非模型数据(包括激活值)分别进行详细的统计。
要使用这个内存分析工具,用户需要在启动训练时传递 ``--profiling`` 参数以启用内存分析。完成内存分析后,用户可以在 ``memory_trace/rank{}_dp{}_tp{}`` 文件夹中找到特定 rank 对应的内存分析结果(包括不同时间点的内存使用日志和显示总体内存使用情况的太阳图表)。
实际运行生成的 ``memory_trace`` 目录结构如下:
.. code-block:: bash
# tree ./memory_trace -L 2
./memory_trace
├── rank0_dp0_tp0 # Profiling results for a specific rank device
│   ├── activation_memory_sunburst.html # Sunburst chart showing activation memory usage
│   ├── grads_memory_sunburst.html # Sunburst chart showing gradient memory usage
│   ├── memory.log # Log of GPU memory usage at different time points
│   ├── os_memory_sunburst.html # Sunburst chart showing optimizer state memory usage
│   ├── params_memory_sunburst.html # Sunburst chart showing parameter memory usage
│   └── summary_sunburst.html # Sunburst chart showing overall memory usage
├── rank1_dp1_tp0
│   ├── activation_memory_sunburst.html
│   ├── grads_memory_sunburst.html
│   ├── memory.log
│   ├── os_memory_sunburst.html
│   ├── params_memory_sunburst.html
│   └── summary_sunburst.html
├── rank2_dp2_tp0
│   ├── activation_memory_sunburst.html
│   ├── grads_memory_sunburst.html
│   ├── memory.log
│   ├── os_memory_sunburst.html
│   ├── params_memory_sunburst.html
│   └── summary_sunburst.html
├── rank3_dp3_tp0
│   ├── activation_memory_sunburst.html
│   ├── grads_memory_sunburst.html
│   ├── memory.log
│   ├── os_memory_sunburst.html
│   ├── params_memory_sunburst.html
│   └── summary_sunburst.html
├── rank4_dp4_tp0
│   ├── activation_memory_sunburst.html
│   ├── grads_memory_sunburst.html
│   ├── memory.log
│   ├── os_memory_sunburst.html
│   ├── params_memory_sunburst.html
│   └── summary_sunburst.html
├── rank5_dp5_tp0
│   ├── activation_memory_sunburst.html
│   ├── grads_memory_sunburst.html
│   ├── memory.log
│   ├── os_memory_sunburst.html
│   ├── params_memory_sunburst.html
│   └── summary_sunburst.html
├── rank6_dp6_tp0
│   ├── activation_memory_sunburst.html
│   ├── grads_memory_sunburst.html
│   ├── memory.log
│   ├── os_memory_sunburst.html
│   ├── params_memory_sunburst.html
│   └── summary_sunburst.html
└── rank7_dp7_tp0
├── activation_memory_sunburst.html
├── grads_memory_sunburst.html
├── memory.log
├── os_memory_sunburst.html
├── params_memory_sunburst.html
└── summary_sunburst.html
其中, ``memory.log`` 的内容示例如下:
.. code-block:: bash
Memory State:
time: 37.56313228607178
---summary---
total_memory: 55953.56 MB
params_memory: 13965.51 MB, grads_memory: 13965.51 MB, os_params_memory: 3461.52 MB, os_state_memory: 6923.03 MB, activation_memory: 17638.00 MB
Memory State:
time: 38.46969723701477
---summary---
total_memory: 38315.56 MB
params_memory: 13965.51 MB, grads_memory: 13965.51 MB, os_params_memory: 3461.52 MB, os_state_memory: 6923.03 MB, activation_memory: 0.00 MB
---Layout---
params_layout:
layer: param_mem, layer_mem: 0.00 MB, total_mem: 13965.51 MB
layer: param_mem.embedding, layer_mem: 0.00 MB, total_mem: 806.00 MB
layer: param_mem.embedding.weight, layer_mem: 806.00 MB, total_mem: 806.00 MB
layer: param_mem.blocks, layer_mem: 0.00 MB, total_mem: 12353.50 MB
layer: param_mem.blocks.0, layer_mem: 0.00 MB, total_mem: 386.05 MB
layer: param_mem.blocks.0.mixer, layer_mem: 0.00 MB, total_mem: 128.03 MB
layer: param_mem.blocks.0.mixer.Wqkv, layer_mem: 0.00 MB, total_mem: 96.02 MB
layer: param_mem.blocks.0.mixer.Wqkv.weight, layer_mem: 96.00 MB, total_mem: 96.00 MB
layer: param_mem.blocks.0.mixer.Wqkv.bias, layer_mem: 0.02 MB, total_mem: 0.02 MB
layer: param_mem.blocks.0.mixer.out_proj, layer_mem: 0.00 MB, total_mem: 32.01 MB
layer: param_mem.blocks.0.mixer.out_proj.weight, layer_mem: 32.00 MB, total_mem: 32.00 MB
layer: param_mem.blocks.0.mixer.out_proj.bias, layer_mem: 0.01 MB, total_mem: 0.01 MB
layer: param_mem.blocks.0.norm1, layer_mem: 0.00 MB, total_mem: 0.01 MB
layer: param_mem.blocks.0.norm1.weight, layer_mem: 0.01 MB, total_mem: 0.01 MB
layer: param_mem.blocks.0.norm2, layer_mem: 0.00 MB, total_mem: 0.01 MB
layer: param_mem.blocks.0.norm2.weight, layer_mem: 0.01 MB, total_mem: 0.01 MB
layer: param_mem.blocks.0.mlp, layer_mem: 0.00 MB, total_mem: 258.00 MB
layer: param_mem.blocks.0.mlp.w1, layer_mem: 0.00 MB, total_mem: 86.00 MB
layer: param_mem.blocks.0.mlp.w1.weight, layer_mem: 86.00 MB, total_mem: 86.00 MB
layer: param_mem.blocks.0.mlp.w2, layer_mem: 0.00 MB, total_mem: 86.00 MB
layer: param_mem.blocks.0.mlp.w2.weight, layer_mem: 86.00 MB, total_mem: 86.00 MB
layer: param_mem.blocks.0.mlp.w3, layer_mem: 0.00 MB, total_mem: 86.00 MB
layer: param_mem.blocks.0.mlp.w3.weight, layer_mem: 86.00 MB, total_mem: 86.00 MB
......
grads_layout:
......
os_params_layout:
......
os_state_layout:
......
activation_base_layout:
......
模型参数的太阳图示例如下:
.. figure:: ../../imgs/params_memory_sunburst.png
:scale: 50%
:class: with-border
.. autoclass:: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler
:members:

View File

@ -1,2 +0,0 @@
问&答
=====

View File

@ -1,9 +0,0 @@
训练 API
============
InternLM 的训练 API 由 ``internlm.core.trainer.Trainer`` 管理。在定义了训练引擎和调度器之后,我们可以调用 Trainer API 来执行模型训练、评估、梯度清零和参数更新等。
有关详细用法,请参阅 Trainer API 文档和示例。
.. autoclass:: internlm.core.trainer.Trainer
:members:

View File

@ -1,4 +0,0 @@
```{include} ../../usage.md
:relative-docs: docs/
:relative-images:
```

View File

@ -1,86 +0,0 @@
## Installation
### Environment Preparation
The required packages and corresponding version are shown as follows:
- Python == 3.10
- GCC == 10.2.0
- MPFR == 4.1.0
- CUDA >= 11.7
- Pytorch >= 1.13.1
- Transformers >= 4.28.0
- Flash-Attention >= v1.0.5
- Apex == 23.05
- GPU with Ampere or Hopper architecture (such as H100, A100)
- Linux OS
After installing the above dependencies, some system environment variables need to be updated:
```bash
export CUDA_PATH={path_of_cuda_11.7}
export GCC_HOME={path_of_gcc_10.2.0}
export MPFR_HOME={path_of_mpfr_4.1.0}
export LD_LIBRARY_PATH=${GCC_HOME}/lib64:${MPFR_HOME}/lib:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
export PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
export CC=${GCC_HOME}/bin/gcc
export CXX=${GCC_HOME}/bin/c++
```
### Environment Installation
Clone the project `internlm` and its dependent submodules from the github repository, as follows:
```bash
git clone git@github.com:InternLM/InternLM.git --recurse-submodules
```
It is recommended to build a Python-3.10 virtual environment using conda and install the required dependencies based on the `requirements/` files:
```bash
conda create --name internlm-env python=3.10 -y
conda activate internlm-env
cd internlm
pip install -r requirements/torch.txt
pip install -r requirements/runtime.txt
```
Install flash-attention (version v1.0.5):
```bash
cd ./third_party/flash-attention
python setup.py install
cd ./csrc
cd fused_dense_lib && pip install -v .
cd ../xentropy && pip install -v .
cd ../rotary && pip install -v .
cd ../layer_norm && pip install -v .
cd ../../../../
```
Install Apex (version 23.05):
```bash
cd ./third_party/apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ../../
```
### Environment Image
Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternLM runtime environment installed from https://hub.docker.com/r/internlm/internlm.
#### Image Configuration and Build
The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternLM:
``` bash
make -f docker.Makefile BASE_OS=centos7
```
In docker.Makefile, you can customize the basic image, environment version, etc., and the corresponding parameters can be passed directly through the command line. For BASE_OS, ubuntu20.04 and centos7 are respectively supported.
#### Pull Standard Image
The standard image based on ubuntu and centos has been built and can be directly pulled:
```bash
# ubuntu20.04
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-ubuntu20.04
# centos7
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7
```
#### Run Container
For the local standard image built with dockerfile or pulled, use the following command to run and enter the container:
```bash
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
```
The default directory in the container is `/InternLM`, please start training according to the [Usage](./usage.md).

View File

@ -1,28 +0,0 @@
## InternLM System Structure
The system code file structure is shown below:
```bash
├── configs # Configuration module, managing model and training-related parameters
│ └── 7B_sft.py # 7B_sft.py is a sample configuration file for the system demo
├── internlm # Main directory of the system code
│ ├── apis # Interface module, containing some interface functions related to inference, etc.
│ ├── core # Core module, managing parallel context and training scheduling engine for training and inference
│ │ ├── communication # Communication module, responsible for p2p communication in pipeline parallel scheduling
│ │ ├── context # Context module, mainly responsible for initializing parallel process groups and managing parallel context
│ │ │ ├── parallel_context.py
│ │ │ └── process_group_initializer.py
│ │ ├── scheduler # Scheduling module, which manages schedulers for parallel training, including non-pipeline and pipeline parallel schedulers
│ │ │ ├── no_pipeline_scheduler.py
│ │ │ └── pipeline_scheduler.py
│ │ ├── engine.py # Responsible for managing the training and evaluation process of the model
│ │ └── trainer.py # Responsible for managing the training engine and scheduler
│ ├── data # Data module, responsible for managing dataset generation and processing
│ ├── initialize # Initialization module, responsible for managing distributed environment startup and trainer initialization
│ ├── model # Model module, responsible for managing model structure definition and implementation
│ ├── solver # Responsible for managing the implementation of optimizer and lr_scheduler, etc.
│ └── utils # Auxiliary module, responsible for managing logs, storage, model registration, etc.
├── train.py # Main function entry file for model training
├── requirements # List of dependent packages for system running
├── third_party # Third-party modules on which the system depends, including apex and flash-attention, etc.
├── tools # Some script tools for processing and converting raw datasets, model checkpoint conversion, etc.
└── version.txt # System version number
```

View File

@ -1,92 +0,0 @@
## Training Performance
InternLM deeply integrates Flash-Attention, Apex, and other high-performance model operators to improve training efficiency. It achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training by building the Hybrid Zero technique. InternLM supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-card scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
We tested the performance of training the 7B model in InternLM using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:
| Hardware | Model |
| ----------------------- | ----------------------------- |
| GPU | nvidia_a100-sxm4-80gb |
| Memory | 2TB |
| Inter-machine bandwidth | 4 * 100Gb RoCE |
| CPU | 128 core Intel(R) Xeon(R) CPU |
| Hyperparameters | tp=1 | tp=2 |
| --------------- | ---- | ---- |
| micro_num | 4 | 4 |
| micro_bsz | 2 | 4 |
| seq_len | 2048 | 2048 |
The configuration of `zero1` in InternLM determines the allocation range of optimizer states.
- `zero1=-1` indicates that optimizer states are distributed across all data-parallel nodes (equivalent to Deepspeed Zero-1).
- In the case of `zero1=8, tp=1`, optimizer states are distributed within 8 GPUs in a single node, and the optimizer states remain consistent across different nodes.
### Throughput Measurement
Throughput is defined as TGS, the average number of tokens processed per GPU per second. In this test, the training configuration had `pack_sample_into_one=False` and `checkpoint=False`. The test results are shown in the following table. When using `zero1=8, tp=1`, InternLM achieves an acceleration efficiency of `88%` for training the 7B model with a thousand cards.
| Parallel Configuration | 8 GPUs | 16 GPUs | 32 GPUs | 64 GPUs | 128 GPUs | 256 GPUs | 512 GPUs | 1024 GPUs |
| ---------------------- | ------ | ------- | ------- | ------- | -------- | -------- | -------- | --------- |
| (tp=1, zero1=-1) | 4062 | 3842 | 3752 | 3690 | 3571 | 3209 | 2861 | 2271 |
| (tp=1, zero1=8) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| (tp=2, zero1=-1) | 3822 | 3595 | 3475 | 3438 | 3308 | 3094 | 2992 | 2785 |
| (tp=2, zero1=4) | 3761 | 3658 | 3655 | 3650 | 3651 | 3653 | 3589 | 3486 |
<div align="left">
<img src="../imgs/train_performance.png" width="580"/>
</div>
### FLOPS Testing
The computational workload of model training is based on the FLOPS calculation method described in the [Megatron](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) paper. To ensure constant FLOPS during training, the test configuration had `pack_sample_into_one=True`, `dtype=torch.bfloat16`.
When `Activation Ckpt` is enabledthe test results are shown in the table below. InternLM can achieve `>180 TFLOPS` for 7B model training with 1024 GPUs.
- TGS: Tokens per GPU per Second
- Global Bsz: The total number of processed tokens with all GPUs in a step
| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
|-|-|-|-|-|-|-|-|-|-|-|
| 1 | 8 | TRUE | TRUE | 8 | 2048 | 8 | 1 | 0.125M | 3314 | 193 |
| 1 | 8 | TRUE | TRUE | 16 | 2048 | 8 | 1 | 0.25M | 3268 | 191 |
| 1 | 8 | TRUE | TRUE | 32 | 2048 | 8 | 1 | 0.5M | 3323 | 188 |
| 1 | 8 | TRUE | TRUE | 64 | 2048 | 8 | 1 | 1M | 3217 | 188 |
| 1 | 8 | TRUE | TRUE | 128 | 2048 | 8 | 1 | 2M | 3260 | 187 |
| 1 | 8 | TRUE | TRUE | 256 | 2048 | 8 | 1 | 4M | 3215 | 187 |
| 1 | 8 | TRUE | TRUE | 512 | 2048 | 8 | 1 | 8M | 3199 | 186 |
| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 8 | 1 | 16M | 3163 | 184 |
| 1 | 8 | TRUE | TRUE | 512 | 2048 | 4 | 1 | 4M | 2963 | 173 |
| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 2 | 1 | 4M | 2341 | 136 |
| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 4 | 1 | 8M | 2796 | 160 |
When `Activation Ckpt` is turned off, the test results are as shown in the table below:
| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
|-|-|-|-|-|-|-|-|-|-|-|
| 1 | 8 | TRUE | FALSE | 8 | 2048 | 2 | 4 | 0.125M | 4103 | 183 |
| 1 | 8 | TRUE | FALSE | 16 | 2048 | 2 | 4 | 0.25M | 3939 | 177 |
| 1 | 8 | TRUE | FALSE | 32 | 2048 | 2 | 4 | 0.5M | 3919 | 176 |
| 1 | 8 | TRUE | FALSE | 64 | 2048 | 2 | 4 | 1M | 3944 | 174 |
| 1 | 8 | TRUE | FALSE | 128 | 2048 | 2 | 4 | 2M | 3928 | 173 |
| 1 | 8 | TRUE | FALSE | 256 | 2048 | 2 | 4 | 4M | 3920 | 173 |
| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 4 | 8M | 3900 | 173 |
| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 4 | 16M | 3625 | 160 |
| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 2 | 4M | 3084 | 139 |
| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 1 | 4M | 2346 | 105 |
| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 2 | 8M | 2817 | 124 |
<div align="left">
<img src="../imgs/flops.png" width="580"/>
</div>

View File

@ -1,387 +0,0 @@
## Quickstart Guide for Pre-training and Fine-tuning
To start a demo model training, you need to prepare three things: **installation**, **dataset preparation**, and **model training configuration**. In this guide, we will first cover the steps for dataset preparation and then briefly describe the model training configuration.
### Installation
Please refer to the [installation guide](./install.md) for instructions on how to install the necessary dependencies.
### Dataset Preparation (Pre-training)
The dataset for the InternLM training task includes a series of `bin` and `meta` files. A `tokenizer` is used to generate the training dataset from the original text files. The tokenizer model is imported by specifying the model parameter path in `tools/tokenizer.py`. Currently, `V7_sft.model` is provided to generate tokens. If you want to use a different model, you can directly modify the model parameter path in `tokenizer.py`.
You can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.
```bash
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
```
Here is an example of data processing:
Given a file `raw_data.txt` containing the raw dataset, the raw dataset is shown below:
```bash
Appreciate every detail in life to truly taste the flavor of happiness.
Dreams are the source of lifes motivation. Pursue them diligently to achieve your goals.
Learn to be tolerant and understanding to establish truly harmonious interpersonal relationships.
```
You can generate the `bin` and `meta` files by running the following command:
```bash
$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
```
It should be noted that the generated `bin` files need to be saved in one of the following directories: `cn`, `en`, `code`, `ja`, `ar`, or `kaoshi`, depending on the type of dataset.
Here, `cn` represents the Chinese dataset, `en` represents the English dataset, `code` represents the code dataset, `ja` represents the Japanese dataset, `ar` represents the Arabic dataset, and `kaoshi` represents the exam dataset.
The format of the generated `bin` files is as follows:
```python
{"tokens": [98655, 2317, 2922, 6649, 1595, 7856, 435, 2424, 442, 9556, 12807, 410, 17313, 446, 23331, 95746]}
{"tokens": [98655, 302, 1383, 269, 657, 410, 2687, 446, 2424, 98667, 269, 25220, 281, 523, 1874, 492, 1248, 38127, 4563, 442, 11227, 829, 8980, 95746]}
{"tokens": [98655, 24190, 442, 517, 15013, 649, 454, 8793, 442, 5849, 9556, 17917, 1369, 1084, 29890, 12021, 95746]}
```
Each line in the `bin` file corresponds to each sentence in the original dataset, representing the tokens of each sentence (referred to as sequence below).
The format of the generated `meta` file is as follows:
```bash
(0, 16), (110, 24), (262, 17)
```
Each tuple in the `meta` file represents the meta information of each `sequence`, where the first element in the tuple indicates the `starting index` of each `sequence` among all `sequences`, and the second element indicates the number of `tokens` for each `sequence`.
For example, the first `sequence` starts at index 0 and has 16 `tokens`. The second `sequence` starts at index 110 and has 24 `tokens`.
The `bin` and `meta` file formats for `json` and `jsonl` type files are the same as for `txt`, so we won't go over them here.
### Data Preparation (Fine-tuning)
The data format for fine-tuning tasks is the same as for pre-training tasks, which consists of a series of `bin` and `meta` files. Let's take the Alpaca dataset as an example to explain the data preparation process for fine-tuning.
1. Download the [Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json).
2. Tokenize the Alpaca dataset using the following command:
```shell
python tools/alpaca_tokenizer.py /path/to/alpaca_dataset /path/to/output_dataset /path/to/tokenizer --split_ratio 0.1
```
It is recommended that users refer to alpaca_tokenizer.py to write new scripts to tokenize their own datasets
### Training Configuration
Taking the configuration file `configs/7B_sft.py` for the 7B demo as an example, let's discuss the data, model, parallel and monitoring configurations required to start a model training.
```python
JOB_NAME = "7b_train"
DO_ALERT = False
SEQ_LEN = 2048
HIDDEN_SIZE = 4096
NUM_ATTENTION_HEAD = 32
MLP_RATIO = 8 / 3
NUM_LAYER = 32
VOCAB_SIZE = 103168
MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
# Ckpt folder format:
# fs: 'local:/mnt/nfs/XXX'
SAVE_CKPT_FOLDER = "local:llm_ckpts"
LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
# boto3 Ckpt folder format:
# import os
# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
CHECKPOINT_EVERY = 50
ckpt = dict(
enable_save_ckpt=False, # enable ckpt save.
save_ckpt_folder=SAVE_CKPT_FOLDER, # Path to save training ckpt.
# load_ckpt_folder= dict(path=MODEL_ONLY_FOLDER, content=["model"], ckpt_type="normal"),
load_ckpt_folder="local:llm_ckpts/",
# 'load_ckpt_info' setting guide:
# 1. the 'path' indicate ckpt path,
# 2. the 'content means what states will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all"
# 3. the ckpt_type means the type of checkpoint to be loaded, now only 'normal' type is supported.
load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
checkpoint_every=CHECKPOINT_EVERY,
async_upload=True, # async ckpt upload. (only work for boto3 ckpt)
async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/", # path for temporarily files during asynchronous upload.
oss_snapshot_freq=int(CHECKPOINT_EVERY / 2), # snapshot ckpt save frequency.
)
TRAIN_FOLDER = "/path/to/dataset"
VALID_FOLDER = "/path/to/dataset"
data = dict(
seq_len=SEQ_LEN,
# micro_num means the number of micro_batch contained in one gradient update
micro_num=4,
# packed_length = micro_bsz * SEQ_LEN
micro_bsz=2,
# defaults to the value of micro_num
valid_micro_num=4,
# defaults to 0, means disable evaluate
valid_every=50,
pack_sample_into_one=False,
total_steps=50000,
skip_batches="",
rampup_batch_size="",
# Datasets with less than 50 rows will be discarded
min_length=50,
# train_folder=TRAIN_FOLDER,
# valid_folder=VALID_FOLDER,
empty_cache_and_diag_interval=10,
diag_outlier_ratio=1.1,
)
grad_scaler = dict(
fp16=dict(
# the initial loss scale, defaults to 2**16
initial_scale=2**16,
# the minimum loss scale, defaults to None
min_scale=1,
# the number of steps to increase loss scale when no overflow occurs
growth_interval=1000,
),
# the multiplication factor for increasing loss scale, defaults to 2
growth_factor=2,
# the multiplication factor for decreasing loss scale, defaults to 0.5
backoff_factor=0.5,
# the maximum loss scale, defaults to None
max_scale=2**24,
# the number of overflows before decreasing loss scale, defaults to 2
hysteresis=2,
)
hybrid_zero_optimizer = dict(
# Enable low_level_optimzer overlap_communication
overlap_sync_grad=True,
overlap_sync_param=True,
# bucket size for nccl communication params
reduce_bucket_size=512 * 1024 * 1024,
# grad clipping
clip_grad_norm=1.0,
)
loss = dict(
label_smoothing=0,
)
adam = dict(
lr=1e-4,
adam_beta1=0.9,
adam_beta2=0.95,
adam_beta2_c=0,
adam_eps=1e-8,
weight_decay=0.01,
)
lr_scheduler = dict(
total_steps=data["total_steps"],
init_steps=0, # optimizer_warmup_step
warmup_ratio=0.01,
eta_min=1e-5,
last_epoch=-1,
)
beta2_scheduler = dict(
init_beta2=adam["adam_beta2"],
c=adam["adam_beta2_c"],
cur_iter=-1,
)
model = dict(
checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.float16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
use_flash_attn=True,
num_chunks=1, # if num_chunks > 1, interleaved pipeline scheduler is used.
)
"""
zero1 parallel:
1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
so parameters will be divided within the range of dp.
2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
pipeline parallel (dict):
1. size: int, the size of pipeline parallel.
2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
tensor parallel: tensor parallel size, usually the number of GPUs per node.
"""
parallel = dict(
zero1=8,
pipeline=dict(size=1, interleaved_overlap=True),
sequence_parallel=False,
)
cudnn_deterministic = False
cudnn_benchmark = False
monitor = dict(
# feishu alert configs
alert=dict(
enable_feishu_alert=DO_ALERT,
feishu_alert_address=None, # feishu webhook to send alert message
light_monitor_address=None, # light_monitor address to send heartbeat
),
)
```
#### Data Configuration
Here are the key parameters and their explanations for data configuration:
```python
TRAIN_FOLDER = "/path/to/dataset"
SEQ_LEN = 2048
data = dict(
seq_len=SEQ_LEN, # Length of the data samples, default value is 2048
micro_num=1, # Number of micro_batches processed in one model parameter update, default value is 1
micro_bsz=1, # Packed_length = micro_bsz * SEQ_LEN, the size of data processed in one micro_batch, default value is 1
total_steps=50000, # Total number of steps to be executed, default value is 50000
min_length=50, # If the number of lines in the dataset file is less than 50, it will be discarded
train_folder=TRAIN_FOLDER, # Dataset file path, default value is None; if train_folder is empty, training will be done using randomly generated datasets
pack_sample_into_one=False, # Logic for data arrangement, determines whether to calculate attention based on the seq_len dimension or the actual length of the sequence
)
```
![pack_into_one](../imgs/pack_into_one.png)
Currently, it supports passing the dataset file path `train_folder`, and the file format is required to be as follows:
```bash
- folder
- code
train_000.bin
train_000.bin.meta
```
For detailed information about the dataset, please refer to the "Data Preparation" section.
#### Model Configuration
If you want to load a model checkpoint when starting the training, you can configure it as follows:
```python
SAVE_CKPT_FOLDER = "local:/path/to/save/ckpt"
LOAD_CKPT_FOLDER = "local:/path/to/load/resume/ckpt"
ckpt = dict(
save_ckpt_folder=SAVE_CKPT_FOLDER, # Path to save the model and optimizer checkpoints
checkpoint_every=float("inf"), # Save a checkpoint every specified number of steps, default value is inf
# When resuming training from a breakpoint,:
# (1) 'path' is the path of the loaded checkpoint.
# (2) 'content' indicates which state will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all"
# (3) 'ckpt_type' indicates which type ckpt will be loaded, currently supported: "internlm"
load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
)
```
Note:
- If the path starts with `local:`, it means the file is stored in the local file system. If it starts with `boto3:`, it means the file is stored in the remote OSS.
The configuration for the model is as follows:
```python
model_type = "INTERNLM" # Model type, default value is "INTERNLM", corresponding to the model structure initialization interface function
NUM_ATTENTION_HEAD = 32
VOCAB_SIZE = 103168
HIDDEN_SIZE = 4096
NUM_LAYER = 32
MLP_RATIO = 8 / 3
model = dict(
checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.bfloat16",
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
)
```
Note: Users can customize the model type name and model structure, and configure the corresponding model parameters. The model initialization function interface can be registered through the `MODEL_INITIALIZER` object in `utils/registry.py`. When initializing the model in the training main function `train.py`, the specified model initialization interface function can be obtained through the `model_type` configuration.
#### Parallel Configuration
Training parallel configuration example:
```python
parallel = dict(
zero1=8,
tensor=1,
pipeline=dict(size=1, interleaved_overlap=True),
sequence_parallel=False,
)
```
- zero1: zero parallel strategy, divided into the following three cases, default value is -1
- When `zero1 <= 0`, the size of the zero1 process group is equal to the size of the data parallel process group, so the optimizer state parameters will be split within the data parallel range.
- When `zero1 == 1`, zero1 is not used, and all data parallel groups retain the complete optimizer state parameters.
- When `zero1 > 1` and `zero1 <= data_parallel_world_size`, the zero1 process group is a subset of the data parallel process group.
- tensor: tensor parallel size, usually the number of GPUs per node, default is 1
- pipeline: pipeline parallel strategy
- size: pipeline parallel size, the default value is 1
- interleaved_overlap: bool type, when interleaved scheduling, enable or disable communication optimization, the default value is False
- sequence_parallel: Whether to enable sequence parallelism, the default value is False
Note: `Data parallel size = Total number of GPUs / Pipeline parallel size / Tensor parallel size`
### Start Training
After completing the data preparation and relevant training configurations mentioned above, you can start the demo training. The following examples demonstrate how to start the training in both slurm and torch environments.
If you want to start distributed training on slurm with 16 GPUs across multiple nodes, use the following command:
```bash
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py
```
If you want to start distributed training on torch with 8 GPUs on a single node, use the following command:
```bash
$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py --launcher "torch"
```
### Training Results
Taking the configuration of the demo training on a single machine with 8 GPUs on slurm as an example, the training result log is shown below:
```bash
2023-07-07 12:26:58,293 INFO launch.py:228 in launch -- Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 1
2023-07-07 12:26:58,293 INFO parallel_context.py:535 in set_seed -- initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
2023-07-07 12:26:58,295 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=0===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=5===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=1===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=6===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=7===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=2===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=4===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=3===========
2023-07-07 12:28:27,826 INFO hybrid_zero_optim.py:295 in _partition_param_list -- Number of elements on ranks: [907415552, 907411456, 910163968, 910163968, 921698304, 921698304, 921698304, 921698304], rank:0
2023-07-07 12:28:57,802 INFO train.py:323 in record_current_batch_training_metrics -- tflops=63.27010355651958,step=0,loss=11.634403228759766,tgs (tokens/gpu/second)=1424.64,lr=4.0000000000000003e-07,loss_scale=65536.0,grad_norm=63.672620777841004,micro_num=4,num_consumed_tokens=131072,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=5,smallest_batch=4,adam_beta2=0.95,fwd_bwd_time=6.48
2023-07-07 12:29:01,636 INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.83371103277346,step=1,loss=11.613704681396484,tgs (tokens/gpu/second)=4274.45,lr=6.000000000000001e-07,loss_scale=65536.0,grad_norm=65.150786641452,micro_num=4,num_consumed_tokens=262144,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.67
2023-07-07 12:29:05,451 INFO train.py:323 in record_current_batch_training_metrics -- tflops=190.99928472960033,step=2,loss=11.490386962890625,tgs (tokens/gpu/second)=4300.69,lr=8.000000000000001e-07,loss_scale=65536.0,grad_norm=61.57798028719357,micro_num=4,num_consumed_tokens=393216,inf_nan_skip_batches=0,num_samples_in_batch=14,largest_length=2048,largest_batch=4,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.66
2023-07-07 12:29:09,307 INFO train.py:323 in record_current_batch_training_metrics -- tflops=188.8613541410694,step=3,loss=11.099515914916992,tgs (tokens/gpu/second)=4252.55,lr=1.0000000000000002e-06,loss_scale=65536.0,grad_norm=63.5478796484391,micro_num=4,num_consumed_tokens=524288,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.7
2023-07-07 12:29:13,147 INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.65918563194305,step=4,loss=10.149517059326172,tgs (tokens/gpu/second)=4270.52,lr=1.2000000000000002e-06,loss_scale=65536.0,grad_norm=51.582841631508145,micro_num=4,num_consumed_tokens=655360,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.68
2023-07-07 12:29:16,994 INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.3109313713174,step=5,loss=9.822169303894043,tgs (tokens/gpu/second)=4262.67,lr=1.4000000000000001e-06,loss_scale=65536.0,grad_norm=47.10386835560855,micro_num=4,num_consumed_tokens=786432,inf_nan_skip_batches=0,num_samples_in_batch=17,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.69
```

Binary file not shown.

Before

Width:  |  Height:  |  Size: 198 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 208 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 279 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 477 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 252 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 170 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 129 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 287 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 391 KiB

View File

@ -1,86 +0,0 @@
## 环境安装
### 环境准备
首先,需要安装的依赖包及对应版本列表如下:
- Python == 3.10
- GCC == 10.2.0
- MPFR == 4.1.0
- CUDA >= 11.7
- Pytorch >= 1.13.1
- Transformers >= 4.28.0
- Flash-Attention >= v1.0.5
- Apex == 23.05
- Ampere或者Hopper架构的GPU (例如H100, A100)
- Linux OS
以上依赖包安装完成后,需要更新配置系统环境变量:
```bash
export CUDA_PATH={path_of_cuda_11.7}
export GCC_HOME={path_of_gcc_10.2.0}
export MPFR_HOME={path_of_mpfr_4.1.0}
export LD_LIBRARY_PATH=${GCC_HOME}/lib64:${MPFR_HOME}/lib:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
export PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
export CC=${GCC_HOME}/bin/gcc
export CXX=${GCC_HOME}/bin/c++
```
### 环境安装
将项目`internlm`及其依赖子模块,从 github 仓库中 clone 下来,命令如下:
```bash
git clone git@github.com:InternLM/InternLM.git --recurse-submodules
```
推荐使用 conda 构建一个 Python-3.10 的虚拟环境, 并基于`requirements/`文件安装项目所需的依赖包:
```bash
conda create --name internlm-env python=3.10 -y
conda activate internlm-env
cd internlm
pip install -r requirements/torch.txt
pip install -r requirements/runtime.txt
```
安装 flash-attention (version v1.0.5)
```bash
cd ./third_party/flash-attention
python setup.py install
cd ./csrc
cd fused_dense_lib && pip install -v .
cd ../xentropy && pip install -v .
cd ../rotary && pip install -v .
cd ../layer_norm && pip install -v .
cd ../../../../
```
安装 Apex (version 23.05)
```bash
cd ./third_party/apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
cd ../../
```
### 环境镜像
用户可以使用提供的 dockerfile 结合 docker.Makefile 来构建自己的镜像,或者也可以从 https://hub.docker.com/r/internlm/internlm 获取安装了 InternLM 运行环境的镜像。
#### 镜像配置及构造
dockerfile 的配置以及构造均通过 docker.Makefile 文件实现,在 InternLM 根目录下执行如下命令即可 build 镜像:
``` bash
make -f docker.Makefile BASE_OS=centos7
```
在 docker.Makefile 中可自定义基础镜像,环境版本等内容,对应参数可直接通过命令行传递。对于 BASE_OS 分别支持 ubuntu20.04 和 centos7。
#### 镜像拉取
基于 ubuntu 和 centos 的标准镜像已经 build 完成也可直接拉取使用:
```bash
# ubuntu20.04
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-ubuntu20.04
# centos7
docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7
```
#### 容器启动
对于使用 dockerfile 构建或拉取的本地标准镜像,使用如下命令启动并进入容器:
```bash
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
```
容器内默认目录即 `/InternLM`,根据[使用文档](./usage.md)即可启动训练。

View File

@ -1,28 +0,0 @@
## InternLM系统结构
本项目系统代码文件结构如下所示:
```bash
├── configs # 配置模块,管理模型和训练相关参数
│ └── 7B_sft.py # 7B_sft.py 是系统 demo 的配置文件样例
├── internlm # 系统代码的主目录
│ ├── apis # 接口模块,包含一些关于推理等的接口函数
│ ├── core # 核心模块,管理用于训练和推理的 parallel context 和训练调度引擎
│ │ ├── communication # 通信模块负责流水线并行调度中的p2p通信
│ │ ├── context # context 模块,主要负责初始化并行进程组,并管理 parallel context
│ │ │ ├── parallel_context.py
│ │ │ └── process_group_initializer.py
│ │ ├── scheduler # 调度模块,管理并行训练的调度器,包括非流水线并行调度器和流水线并行调度器
│ │ │ ├── no_pipeline_scheduler.py
│ │ │ └── pipeline_scheduler.py
│ │ ├── engine.py # 负责管理模型的训练和评估过程
│ │ └── trainer.py # 负责管理训练引擎和调度器
│ ├── data # 数据模块,负责管理数据集生成和处理
│ ├── initialize # 初始化模块,负责管理分布式环境启动和训练器初始化
│ ├── model # 模型模块,负责管理模型结构定义和实现
│ ├── solver # 负责管理 optimizer 和 lr_scheduler 等的实现
│ └── utils # 辅助模块,负责管理日志、存储、模型注册等
├── train.py # 模型训练的主函数入口文件
├── requirements # 系统运行的依赖包列表
├── third_party # 系统所依赖的第三方模块,包括 apex 和 flash-attention 等
├── tools # 一些脚本工具,用于原始数据集处理和转换,模型 checkpoint 转换等
└── version.txt # 系统版本号
```

View File

@ -1,89 +0,0 @@
## 训练性能
InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子,提高了训练效率。通过构建 Hybrid Zero 技术实现计算和通信的高效重叠大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡,千卡规模下加速效率可高达 90%,训练吞吐超过 180TFLOPS平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据:
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
我们在GPU集群上测试了多种并行配置下InternLM训练7B模型的性能。在每组测试中每张GPU在单次迭代中处理的token数量一致。测试使用的硬件和参数配置如下表所示
| 硬件 | 硬件型号 |
| ----------------------- | ----------------------------- |
| GPU | nvidia_a100-sxm4-80gb |
| Memory | 2TB |
| Inter-machine bandwidth | 4 * 100Gb RoCE |
| CPU | 128 core Intel(R) Xeon(R) CPU |
| 超参 | tp=1 | tp=2 |
| --------- | ---- | ---- |
| micro_num | 4 | 4 |
| micro_bsz | 2 | 4 |
| seq_len | 2048 | 2048 |
InternLM中`zero1`的配置决定了优化器状态的分配范围。
- `zero1=-1`表明优化器状态分布在全部数据并行节点等同于Deepspeed Zero-1的效果
- `zero1=8tp=1`的情况下优化器状态分布在单节点8张GPU内并且不同节点上的优化器状态保持一致。
### 吞吐量测量
吞吐量定义为TGS平均每GPU每秒处理的token的数量Tokens per GPU per Second。在该项测试的训练配置中`pack_sample_into_one=False``checkpoint=False`, `dtype=torch.bfloat16`。测试结果如下表所示。采用`zero1=8tp=1`InternLM针对7B模型训练的扩展性在千卡训练的加速效率可以达到`88%`。
| 并行配置 | 8卡 | 16卡 | 32卡 | 64卡 | 128卡 | 256卡 | 512卡 | 1024卡 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| (tp=1, zero1=-1) | 4062 | 3842 | 3752 | 3690 | 3571 | 3209 | 2861 | 2271 |
| (tp=1, zero1=8) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| (tp=2, zero1=-1) | 3822 | 3595 | 3475 | 3438 | 3308 | 3094 | 2992 | 2785 |
| (tp=2, zero1=4) | 3761 | 3658 | 3655 | 3650 | 3651 | 3653 | 3589 | 3486 |
<div align="left">
<img src="../doc/imgs/train_performance.png" width="580"/>
</div>
### FLOPS测试
模型训练的计算量参考 [Megatron](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) 论文中FLOPS计算方式。为了保证训练过程中的FLOPS恒定在该项测试的训练配置中`pack_sample_into_one=True``dtype=torch.bfloat16`。
当开启 Activation Ckpt后测试结果如下表所示InternLM针对7B模型的千卡训练可以达到 `>180 TFLOPS`
- TGS: Tokens per GPU per Second
- Global Bsz: 一个step中所有GPU处理的token数量
| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
|-|-|-|-|-|-|-|-|-|-|-|
| 1 | 8 | TRUE | TRUE | 8 | 2048 | 8 | 1 | 0.125M | 3314 | 193 |
| 1 | 8 | TRUE | TRUE | 16 | 2048 | 8 | 1 | 0.25M | 3268 | 191 |
| 1 | 8 | TRUE | TRUE | 32 | 2048 | 8 | 1 | 0.5M | 3323 | 188 |
| 1 | 8 | TRUE | TRUE | 64 | 2048 | 8 | 1 | 1M | 3217 | 188 |
| 1 | 8 | TRUE | TRUE | 128 | 2048 | 8 | 1 | 2M | 3260 | 187 |
| 1 | 8 | TRUE | TRUE | 256 | 2048 | 8 | 1 | 4M | 3215 | 187 |
| 1 | 8 | TRUE | TRUE | 512 | 2048 | 8 | 1 | 8M | 3199 | 186 |
| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 8 | 1 | 16M | 3163 | 184 |
| 1 | 8 | TRUE | TRUE | 512 | 2048 | 4 | 1 | 4M | 2963 | 173 |
| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 2 | 1 | 4M | 2341 | 136 |
| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 4 | 1 | 8M | 2796 | 160 |
当关闭 Activation Ckpt后测试结果如下表所示
| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
|-|-|-|-|-|-|-|-|-|-|-|
| 1 | 8 | TRUE | FALSE | 8 | 2048 | 2 | 4 | 0.125M | 4103 | 183 |
| 1 | 8 | TRUE | FALSE | 16 | 2048 | 2 | 4 | 0.25M | 3939 | 177 |
| 1 | 8 | TRUE | FALSE | 32 | 2048 | 2 | 4 | 0.5M | 3919 | 176 |
| 1 | 8 | TRUE | FALSE | 64 | 2048 | 2 | 4 | 1M | 3944 | 174 |
| 1 | 8 | TRUE | FALSE | 128 | 2048 | 2 | 4 | 2M | 3928 | 173 |
| 1 | 8 | TRUE | FALSE | 256 | 2048 | 2 | 4 | 4M | 3920 | 173 |
| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 4 | 8M | 3900 | 173 |
| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 4 | 16M | 3625 | 160 |
| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 2 | 4M | 3084 | 139 |
| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 1 | 4M | 2346 | 105 |
| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 2 | 8M | 2817 | 124 |
<div align="left">
<img src="../doc/imgs/flops.png" width="580"/>
</div>

View File

@ -1,370 +0,0 @@
## 使用教程
启动一个 Demo 模型训练,需要进行三项准备,**安装****数据集准备**和**模型训练配置**。接下来,首先会介绍数据准备相关的操作,再简要描述模型训练配置相关的内容。
### 安装
请参考[安装文档](./install.md)进行安装。
### 数据准备 (预训练)
InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7_sft.model`来生成tokens。若想使用不同的模型可直接修改`tokernizer.py`中的模型参数路径。
可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`text_input_path`表示原始文本数据路径,目前支持`txt`、`json`和`jsonl`三种输入格式,`bin_output_path`表示生成的`bin`文件的保存路径。
```bash
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
```
下面是一个数据处理的例子:
给定一个包含原始数据集的文件`raw_data.txt`,原始数据集如下所示:
```bash
感恩生活中的每一个细节,才能真正体会到幸福的滋味。
梦想是人生的动力源泉,努力追逐,才能实现自己的目标。
学会宽容和理解,才能建立真正和谐的人际关系。
```
可以通过运行以下命令来生成`bin`和`meta`文件:
```bash
$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
```
需要注意的是,生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这六个目录下,以区分数据集的类型。
其中,`cn`表示中文数据集;`en`表示英文数据集;`code`表示代码数据集;`ja`表示日语数据集;`ar`表示阿拉伯语数据集;`kaoshi`表示考试数据集。
生成的bin文件的格式如下
```python
{"tokens": [73075, 75302, 69522, 69022, 98899, 67713, 68015, 81269, 74637, 75445, 99157]}
{"tokens": [69469, 60355, 73026, 68524, 60846, 61844, 98899, 67775, 79241, 98899, 67713, 67800, 67453, 67838, 99157]}
{"tokens": [68057, 79017, 60378, 68014, 98899, 67713, 67990, 68015, 70381, 67428, 61003, 67622, 99157]}
```
`bin`文件中的每一行均对应原始数据集中的每一个句子,表示每个句子的`token`下文将用sequence指定
生成的`meta`文件的格式如下:
```bash
(0, 11), (90, 15), (208, 13)
```
在`meta`文件中,每个元组对应着`bin`文件中每一个`sequence`的元信息。其中,元组的第一个元素表示每个`sequence`在所有`sequence`中的`starting index`,第二个元素表示每个`sequence`中有多少个`tokens`。
例如,对于第一个`sequence``starting index`为 0有 11 个`tokens`;对于第二个`sequence`,由于第一个`sequence`转换为`string`后的长度为`89`,因此它的`starting index`为 90有 15 个`tokens`。
`json`和`jsonl`类型的文件的`bin`和`meta`文件格式和`txt`一致,此处不再赘叙。
### 数据准备 (微调)
微调任务的数据集格式与预训练任务保持一致,生成的数据格式为一系列的`bin`和`meta`文件。以下以 Alpaca 数据集为例,介绍微调的数据准备流程。
1. 下载 [Alpaca 数据集](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json)
2. 对 Alpaca 数据进行 tokenize使用以下命令
```shell
python tools/alpaca_tokenizer.py /path/to/alpaca_dataset /path/to/output_dataset /path/to/tokenizer --split_ratio 0.1
```
建议用户参考 alpaca_tokenizer.py 编写新的脚本对自己的数据集进行 tokenize
### 训练配置
以 7B Demo 的配置文件`configs/7B_sft.py`为例:
```python
JOB_NAME = "7b_train"
DO_ALERT = False
SEQ_LEN = 2048
HIDDEN_SIZE = 4096
NUM_ATTENTION_HEAD = 32
MLP_RATIO = 8 / 3
NUM_LAYER = 32
VOCAB_SIZE = 103168
MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
# Ckpt folder format:
# fs: 'local:/mnt/nfs/XXX'
SAVE_CKPT_FOLDER = "local:llm_ckpts"
LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
# boto3 Ckpt folder format:
# import os
# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
CHECKPOINT_EVERY = 50
ckpt = dict(
enable_save_ckpt=False, # enable ckpt save.
save_ckpt_folder=SAVE_CKPT_FOLDER, # Path to save training ckpt.
# load_ckpt_folder= dict(path=MODEL_ONLY_FOLDER, content=["model"], ckpt_type="normal"),
load_ckpt_folder="local:llm_ckpts/",
# 'load_ckpt_info' setting guide:
# 1. the 'path' indicate ckpt path,
# 2. the 'content means what states will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all"
# 3. the ckpt_type means the type of checkpoint to be loaded, now only 'normal' type is supported.
load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
checkpoint_every=CHECKPOINT_EVERY,
async_upload=True, # async ckpt upload. (only work for boto3 ckpt)
async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/", # path for temporarily files during asynchronous upload.
oss_snapshot_freq=int(CHECKPOINT_EVERY / 2), # snapshot ckpt save frequency.
)
TRAIN_FOLDER = "/path/to/dataset"
VALID_FOLDER = "/path/to/dataset"
data = dict(
seq_len=SEQ_LEN,
# micro_num means the number of micro_batch contained in one gradient update
micro_num=4,
# packed_length = micro_bsz * SEQ_LEN
micro_bsz=2,
# defaults to the value of micro_num
valid_micro_num=4,
# defaults to 0, means disable evaluate
valid_every=50,
pack_sample_into_one=False,
total_steps=50000,
skip_batches="",
rampup_batch_size="",
# Datasets with less than 50 rows will be discarded
min_length=50,
# train_folder=TRAIN_FOLDER,
# valid_folder=VALID_FOLDER,
empty_cache_and_diag_interval=10,
diag_outlier_ratio=1.1,
)
grad_scaler = dict(
fp16=dict(
# the initial loss scale, defaults to 2**16
initial_scale=2**16,
# the minimum loss scale, defaults to None
min_scale=1,
# the number of steps to increase loss scale when no overflow occurs
growth_interval=1000,
),
# the multiplication factor for increasing loss scale, defaults to 2
growth_factor=2,
# the multiplication factor for decreasing loss scale, defaults to 0.5
backoff_factor=0.5,
# the maximum loss scale, defaults to None
max_scale=2**24,
# the number of overflows before decreasing loss scale, defaults to 2
hysteresis=2,
)
hybrid_zero_optimizer = dict(
# Enable low_level_optimzer overlap_communication
overlap_sync_grad=True,
overlap_sync_param=True,
# bucket size for nccl communication params
reduce_bucket_size=512 * 1024 * 1024,
# grad clipping
clip_grad_norm=1.0,
)
loss = dict(
label_smoothing=0,
)
adam = dict(
lr=1e-4,
adam_beta1=0.9,
adam_beta2=0.95,
adam_beta2_c=0,
adam_eps=1e-8,
weight_decay=0.01,
)
lr_scheduler = dict(
total_steps=data["total_steps"],
init_steps=0, # optimizer_warmup_step
warmup_ratio=0.01,
eta_min=1e-5,
last_epoch=-1,
)
beta2_scheduler = dict(
init_beta2=adam["adam_beta2"],
c=adam["adam_beta2_c"],
cur_iter=-1,
)
model = dict(
checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.float16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
use_flash_attn=True,
num_chunks=1, # if num_chunks > 1, interleaved pipeline scheduler is used.
)
"""
zero1 parallel:
1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
so parameters will be divided within the range of dp.
2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
pipeline parallel (dict):
1. size: int, the size of pipeline parallel.
2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
tensor parallel: tensor parallel size, usually the number of GPUs per node.
"""
parallel = dict(
zero1=8,
pipeline=dict(size=1, interleaved_overlap=True),
sequence_parallel=False,
)
cudnn_deterministic = False
cudnn_benchmark = False
monitor = dict(
# feishu alert configs
alert=dict(
enable_feishu_alert=DO_ALERT,
feishu_alert_address=None, # feishu webhook to send alert message
light_monitor_address=None, # light_monitor address to send heartbeat
),
)
```
接下来将详细介绍启动一个模型训练所需要进行的数据、模型、并行和监控等相关的配置。
#### 数据配置
数据相关的关键参数配置及释义如下所示:
```python
TRAIN_FOLDER = "/path/to/dataset"
SEQ_LEN = 2048
data = dict(
seq_len=SEQ_LEN, # 数据样本长度,默认值为 2048
micro_num=1, # micro_num 是指在一次模型参数更新中会处理的 micro_batch 的数目,默认值为 1
micro_bsz=1, # packed_length = micro_bsz * SEQ_LEN为一次处理的 micro_batch 的数据大小,默认值为 1
total_steps=50000, # 总的所需执行的 step 的数目,默认值为 50000
min_length=50, # 若数据集文件中数据行数少于50将会被废弃
train_folder=TRAIN_FOLDER, # 数据集文件路径,默认值为 None若 train_folder 为空,则以自动生成的随机数据集进行训练测试
pack_sample_into_one=False, # 数据整理的逻辑,决定是按照 seq_len 维度或者是 sequence 的真实长度来进行attention计算
)
```
![pack_into_one](./imgs/pack_into_one.png)
目前支持传入数据集文件路径`train_folder`,且要求文件格式如下:
```bash
- folder
- code
train_000.bin
train_000.bin.meta
```
数据集的详细内容可参考``数据准备``模块相关的介绍。
#### 模型配置
如果在启动训练时要加载模型 `checkpoint`,可进行如下相关配置:
```python
SAVE_CKPT_FOLDER = "local:/path/to/save/ckpt"
LOAD_CKPT_FOLDER = "local:/path/to/load/resume/ckpt"
ckpt = dict(
save_ckpt_folder=SAVE_CKPT_FOLDER, # 存储模型和优化器 checkpoint 的路径
checkpoint_every=float("inf"), # 每多少个 step 存储一次 checkpoint默认值为 inf
# 断点续训时,加载模型和优化器等权重的路径,将从指定的 step 恢复训练
# content 表示哪些状态会被加载,支持: "model", "sampler", "optimizer", "scheduler", "all"
# ckpt_type 表示加载的模型类型,目前支持: "internlm"
load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
)
```
注意:
- 路径若以 `local:` 为前缀,则存储在本地文件系统;若以 `boto3:` 为前缀,则存储在远程 oss 上
模型相关关键参数配置如下所示:
```python
model_type = "INTERNLM" # 模型类型,默认值为 "INTERNLM",对应模型结构初始化接口函数
NUM_ATTENTION_HEAD = 32
VOCAB_SIZE = 103168
HIDDEN_SIZE = 4096
NUM_LAYER = 32
MLP_RATIO = 8 / 3
model = dict(
checkpoint=False, # 进行重计算的模型层数比例,可选值为 True/False/[0-1]
num_attention_heads=NUM_ATTENTION_HEAD,
embed_split_hidden=True,
vocab_size=VOCAB_SIZE,
embed_grad_scale=1,
parallel_output=True,
hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYER,
mlp_ratio=MLP_RATIO,
apply_post_layer_norm=False,
dtype="torch.bfloat16",
norm_type="rmsnorm",
layer_norm_epsilon=1e-5,
)
```
注意:用户可自定义模型类型名和模型结构,并配置相对应的模型参数。通过`utils/registry.py`下的`MODEL_INITIALIZER`对象进行模型初始化函数接口注册,在训练主函数`train.py`中初始化模型时,可通过`model_type`配置获取指定的模型初始化接口函数。
*如果基于 InternLM 7B继续训练可以参考 [ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-zoo) 中 OpenXLab 链接下载权重*
#### 并行配置
训练并行配置样例如下:
```python
parallel = dict(
zero1=8,
tensor=1,
pipeline=dict(size=1, interleaved_overlap=True),
sequence_parallel=False,
)
```
- zero1zero 并行策略,分如下三种情况,默认值为 -1
- 当`zero1 <= 0`,则 zero1 进程组的大小等于数据并行进程组的大小,因此优化器状态参数将在数据并行范围内分配
- 当`zero1 == 1`,则不使用 zero1 ,所有数据并行组保留完整的优化器状态参数
- 当`zero1 > 1`且`zero1 <= data_parallel_world_size`,则 zero1 进程组是数据并行进程组的子集
- tensor张量并行大小通常是每个节点的 GPU 数量,默认值为 1
- pipeline流水线并行策略
- size流水线并行大小默认值为 1
- interleaved_overlapbool 类型,交错式调度时,开启或关闭通信优化,默认值为关闭
- sequence_parallel是否开启序列化并行默认值为 False
注意:`数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小`
### 启动训练
完成了以上数据集准备和相关训练配置后,可启动 Demo 训练。接下来分别以 slurm 和 torch 环境为例,介绍训练启动方式。
若在 slurm 上启动分布式运行环境,多节点 16 卡的运行命令如下所示:
```bash
$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py
```
若在 torch 上启动分布式运行环境,单节点 8 卡的运行命令如下所示:
```bash
$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py --launcher "torch"
```
### 运行结果
以 slurm 上单机 8 卡的 Demo 训练配置为例,训练结果日志展示如下:
```bash
2023-07-07 12:26:58,293 INFO launch.py:228 in launch -- Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 1
2023-07-07 12:26:58,293 INFO parallel_context.py:535 in set_seed -- initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
2023-07-07 12:26:58,295 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=0===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=5===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=1===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=6===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=7===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=2===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=4===========
2023-07-07 12:26:58,296 INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=3===========
2023-07-07 12:28:27,826 INFO hybrid_zero_optim.py:295 in _partition_param_list -- Number of elements on ranks: [907415552, 907411456, 910163968, 910163968, 921698304, 921698304, 921698304, 921698304], rank:0
2023-07-07 12:28:57,802 INFO train.py:323 in record_current_batch_training_metrics -- tflops=63.27010355651958,step=0,loss=11.634403228759766,tgs (tokens/gpu/second)=1424.64,lr=4.0000000000000003e-07,loss_scale=65536.0,grad_norm=63.672620777841004,micro_num=4,num_consumed_tokens=131072,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=5,smallest_batch=4,adam_beta2=0.95,fwd_bwd_time=6.48
2023-07-07 12:29:01,636 INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.83371103277346,step=1,loss=11.613704681396484,tgs (tokens/gpu/second)=4274.45,lr=6.000000000000001e-07,loss_scale=65536.0,grad_norm=65.150786641452,micro_num=4,num_consumed_tokens=262144,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.67
2023-07-07 12:29:05,451 INFO train.py:323 in record_current_batch_training_metrics -- tflops=190.99928472960033,step=2,loss=11.490386962890625,tgs (tokens/gpu/second)=4300.69,lr=8.000000000000001e-07,loss_scale=65536.0,grad_norm=61.57798028719357,micro_num=4,num_consumed_tokens=393216,inf_nan_skip_batches=0,num_samples_in_batch=14,largest_length=2048,largest_batch=4,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.66
2023-07-07 12:29:09,307 INFO train.py:323 in record_current_batch_training_metrics -- tflops=188.8613541410694,step=3,loss=11.099515914916992,tgs (tokens/gpu/second)=4252.55,lr=1.0000000000000002e-06,loss_scale=65536.0,grad_norm=63.5478796484391,micro_num=4,num_consumed_tokens=524288,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.7
2023-07-07 12:29:13,147 INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.65918563194305,step=4,loss=10.149517059326172,tgs (tokens/gpu/second)=4270.52,lr=1.2000000000000002e-06,loss_scale=65536.0,grad_norm=51.582841631508145,micro_num=4,num_consumed_tokens=655360,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.68
2023-07-07 12:29:16,994 INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.3109313713174,step=5,loss=9.822169303894043,tgs (tokens/gpu/second)=4262.67,lr=1.4000000000000001e-06,loss_scale=65536.0,grad_norm=47.10386835560855,micro_num=4,num_consumed_tokens=786432,inf_nan_skip_batches=0,num_samples_in_batch=17,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.69
```

View File

@ -1,107 +0,0 @@
DOCKER_REGISTRY ?= docker.io
DOCKER_ORG ?= my
DOCKER_IMAGE ?= internlm
DOCKER_FULL_NAME = $(DOCKER_REGISTRY)/$(DOCKER_ORG)/$(DOCKER_IMAGE)
CUDA_VERSION = 11.7.1
GCC_VERSION = 10.2.0
CUDNN_VERSION = 8
BASE_RUNTIME =
# ubuntu20.04 centos7
BASE_OS = centos7
BASE_DEVEL = nvidia/cuda:$(CUDA_VERSION)-cudnn$(CUDNN_VERSION)-devel-${BASE_OS}
# The conda channel to use to install cudatoolkit
CUDA_CHANNEL = nvidia
# The conda channel to use to install pytorch / torchvision
INSTALL_CHANNEL ?= pytorch
PYTHON_VERSION ?= 3.10
PYTORCH_VERSION ?= 1.13.1
TORCHVISION_VERSION ?= 0.14.1
TORCHAUDIO_VERSION ?= 0.13.1
BUILD_PROGRESS ?= auto
TRITON_VERSION ?=
GMP_VERSION ?= 6.2.1
MPFR_VERSION ?= 4.1.0
MPC_VERSION ?= 1.2.1
GCC_VERSION ?= 10.2.0
HTTPS_PROXY_I ?=
HTTP_PROXY_I ?=
FLASH_ATTEN_VERSION ?= 1.0.5
FLASH_ATTEN_TAG ?= v${FLASH_ATTEN_VERSION}
BUILD_ARGS = --build-arg BASE_IMAGE=$(BASE_IMAGE) \
--build-arg PYTHON_VERSION=$(PYTHON_VERSION) \
--build-arg CUDA_VERSION=$(CUDA_VERSION) \
--build-arg CUDA_CHANNEL=$(CUDA_CHANNEL) \
--build-arg PYTORCH_VERSION=$(PYTORCH_VERSION) \
--build-arg TORCHVISION_VERSION=$(TORCHVISION_VERSION) \
--build-arg TORCHAUDIO_VERSION=$(TORCHAUDIO_VERSION) \
--build-arg INSTALL_CHANNEL=$(INSTALL_CHANNEL) \
--build-arg TRITON_VERSION=$(TRITON_VERSION) \
--build-arg GMP_VERSION=$(GMP_VERSION) \
--build-arg MPFR_VERSION=$(MPFR_VERSION) \
--build-arg MPC_VERSION=$(MPC_VERSION) \
--build-arg GCC_VERSION=$(GCC_VERSION) \
--build-arg https_proxy=$(HTTPS_PROXY_I) \
--build-arg http_proxy=$(HTTP_PROXY_I) \
--build-arg FLASH_ATTEN_TAG=$(FLASH_ATTEN_TAG)
EXTRA_DOCKER_BUILD_FLAGS ?=
BUILD ?= build
# Intentionally left blank
PLATFORMS_FLAG ?=
PUSH_FLAG ?=
USE_BUILDX ?=1
BUILD_PLATFORMS ?=
WITH_PUSH ?= false
BUILD_TYPE ?= intrenlm-dev
# Setup buildx flags
ifneq ("$(USE_BUILDX)","")
BUILD = buildx build
ifneq ("$(BUILD_PLATFORMS)","")
PLATFORMS_FLAG = --platform="$(BUILD_PLATFORMS)"
endif
endif
# endif
# # Only set platforms flags if using buildx
# ifeq ("$(WITH_PUSH)","true")
# PUSH_FLAG = --push
# endif
# endif
ifeq ($(findstring centos,$(BASE_OS)),centos)
DOCKERFILE_PATH ?= ./docker/Dockerfile-centos
else
DOCKERFILE_PATH ?= ./docker/Dockerfile-ubuntu
endif
#use -f to specify dockerfile
DOCKER_BUILD = DOCKER_BUILDKIT=1 \
docker $(BUILD) \
--progress=$(BUILD_PROGRESS) \
$(EXTRA_DOCKER_BUILD_FLAGS) \
$(PLATFORMS_FLAG) \
$(PUSH_FLAG) \
-f $(DOCKERFILE_PATH) \
-t $(DOCKER_FULL_NAME):$(DOCKER_TAG) \
$(BUILD_ARGS) .
# --target $(BUILD_TYPE)
.PHONY: all
all: devel-image
.PHONY: devel-image
devel-image: BASE_IMAGE := $(BASE_DEVEL)
devel-image: DOCKER_TAG := torch${PYTORCH_VERSION}-cuda${CUDA_VERSION}-flashatten${FLASH_ATTEN_VERSION}-${BASE_OS}
devel-image:
$(DOCKER_BUILD)
.PHONY: clean
clean:
-docker rmi -f $(shell docker images -q $(DOCKER_FULL_NAME))

View File

@ -1,131 +0,0 @@
ARG BASE_IMAGE
ARG https_proxy
ARG http_proxy
##############################################################################
# Install the basic environment on centos
##############################################################################
FROM ${BASE_IMAGE} as base
ARG https_proxy
ARG http_proxy
RUN yum install deltarpm -y && yum update -y \
&& yum install -y \
ca-certificates \
cmake \
curl \
git \
wget \
tar \
m4 \
bzip2 \
gcc \
gcc-c++ \
file \
texinfo \
which
##############################################################################
# Install the conda environment
##############################################################################
FROM base as conda
ARG PYTHON_VERSION=3.10
ARG TARGETPLATFORM
ARG https_proxy
ARG http_proxy
RUN case ${TARGETPLATFORM} in \
"linux/arm64") MINICONDA_ARCH=aarch64 ;; \
*) MINICONDA_ARCH=x86_64 ;; \
esac && \
curl -fsSL -v -o ~/miniconda.sh -O "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh"
RUN chmod +x ~/miniconda.sh && \
bash ~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh && \
/opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
/opt/conda/bin/conda clean -ya
##############################################################################
# Install environment dependencies
##############################################################################
FROM conda as dep
WORKDIR /dep
ARG https_proxy
ARG http_proxy
ARG GMP_VERSION
ARG MPFR_VERSION
ARG MPC_VERSION
RUN wget https://ftp.gnu.org/gnu/gmp/gmp-${GMP_VERSION}.tar.bz2 \
&& tar -vxf gmp-${GMP_VERSION}.tar.bz2 \
&& cd gmp-${GMP_VERSION}/ \
&& ./configure --prefix=/usr/local/gmp-${GMP_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& wget https://ftp.gnu.org/gnu/mpfr/mpfr-${MPFR_VERSION}.tar.gz \
&& tar -vxf mpfr-${MPFR_VERSION}.tar.gz \
&& cd mpfr-${MPFR_VERSION}/ \
&& ./configure --prefix=/usr/local/mpfr-${MPFR_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& wget http://www.multiprecision.org/downloads/mpc-${MPC_VERSION}.tar.gz \
&& tar -vxf mpc-${MPC_VERSION}.tar.gz \
&& cd mpc-${MPC_VERSION}/ \
&& ./configure --prefix=/usr/local/mpc-${MPC_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& git clone https://github.com/ninja-build/ninja.git \
&& cd ninja \
&& git checkout release \
&& ./configure.py --bootstrap \
&& mv ./ninja /usr/bin \
&& cd ..
ENV MPFR_HOME=/usr/local/mpfr-${MPFR_VERSION}
ENV LD_LIBRARY_PATH=${MPFR_HOME}/lib:$LD_LIBRARY_PATH
ARG https_proxy
ARG http_proxy
ARG GCC_VERSION
ARG GMP_VERSION
ARG MPFR_VERSION
ARG MPC_VERSION
RUN wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.xz \
&& tar -vxf gcc-${GCC_VERSION}.tar.xz \
&& mkdir build \
&& cd build/ \
&& ../gcc-${GCC_VERSION}/configure --prefix=/usr/local/gcc-${GCC_VERSION}/ --enable-threads=posix --disable-checking --enable-languages=c,c++ --disable-multilib \
--with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} --with-mpc=/usr/local/mpc-${MPC_VERSION} \
&& make -j64 && make install
ENV GCC_HOME=/usr/local/gcc-${GCC_VERSION}
ENV LD_LIBRARY_PATH=${GCC_HOME}/lib64:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
ENV PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
ENV CC=${GCC_HOME}/bin/gcc
ENV CXX=${GCC_HOME}/bin/c++
##############################################################################
# Install InternLM development environment, including flash-attention and apex
##############################################################################
FROM dep as intrenlm-dev
COPY . /InternLM
WORKDIR /InternLM
ARG https_proxy
ARG http_proxy
ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
RUN git submodule update --init --recursive \
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt \
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt \
&& cd /InternLM/third_party/flash-attention \
&& /opt/conda/bin/python setup.py install \
&& cd ./csrc \
&& cd fused_dense_lib && /opt/conda/bin/pip install -v . \
&& cd ../xentropy && /opt/conda/bin/pip install -v . \
&& cd ../rotary && /opt/conda/bin/pip install -v . \
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
&& cd ../../../../ \
&& cd ./third_party/apex \
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
&& /opt/conda/bin/pip cache purge \
&& rm -rf ~/.cache/pip

View File

@ -1,112 +0,0 @@
ARG BASE_IMAGE
ARG https_proxy
ARG http_proxy
##############################################################################
# Install the basic environment on ubuntu
##############################################################################
FROM ${BASE_IMAGE} as base
ARG https_proxy
ARG http_proxy
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
cmake \
curl \
git \
wget \
tar \
m4 \
ninja-build
##############################################################################
# Install the conda environment
##############################################################################
FROM base as conda
ARG PYTHON_VERSION=3.10
ARG TARGETPLATFORM
ARG https_proxy
ARG http_proxy
RUN case ${TARGETPLATFORM} in \
"linux/arm64") MINICONDA_ARCH=aarch64 ;; \
*) MINICONDA_ARCH=x86_64 ;; \
esac && \
curl -fsSL -v -o ~/miniconda.sh -O "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh"
RUN chmod +x ~/miniconda.sh && \
bash ~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh && \
/opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
/opt/conda/bin/conda clean -ya
##############################################################################
# Install environment dependencies
##############################################################################
FROM conda as dep
WORKDIR /dep
ARG https_proxy
ARG http_proxy
ARG GCC_VERSION
ARG GMP_VERSION
ARG MPFR_VERSION
ARG MPC_VERSION
RUN wget https://ftp.gnu.org/gnu/gmp/gmp-${GMP_VERSION}.tar.bz2 \
&& tar -vxf gmp-${GMP_VERSION}.tar.bz2 \
&& cd gmp-${GMP_VERSION}/ \
&& ./configure --prefix=/usr/local/gmp-${GMP_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& wget https://ftp.gnu.org/gnu/mpfr/mpfr-${MPFR_VERSION}.tar.gz \
&& tar -vxf mpfr-${MPFR_VERSION}.tar.gz \
&& cd mpfr-${MPFR_VERSION}/ \
&& ./configure --prefix=/usr/local/mpfr-${MPFR_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& wget http://www.multiprecision.org/downloads/mpc-${MPC_VERSION}.tar.gz \
&& tar -vxf mpc-${MPC_VERSION}.tar.gz \
&& cd mpc-${MPC_VERSION}/ \
&& ./configure --prefix=/usr/local/mpc-${MPC_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.xz \
&& tar -vxJf gcc-${GCC_VERSION}.tar.xz \
&& mkdir build \
&& cd build/ \
&& ../gcc-${GCC_VERSION}/configure --prefix=/usr/local/gcc-${GCC_VERSION}/ --enable-checking=release --enable-languages=c,c++ --disable-multilib \
--with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} --with-mpc=/usr/local/mpc-${MPC_VERSION} \
&& make -j64 && make install
ENV GCC_HOME=/usr/local/gcc-${GCC_VERSION}
ENV MPFR_HOME=/usr/local/mpfr-${MPFR_VERSION}
ENV LD_LIBRARY_PATH=${GCC_HOME}/lib64:${MPFR_HOME}/lib:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
ENV PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
ENV CC=${GCC_HOME}/bin/gcc
ENV CXX=${GCC_HOME}/bin/c++
##############################################################################
# Install InternLM development environment, including flash-attention and apex
##############################################################################
FROM dep as intrenlm-dev
COPY . /InternLM
WORKDIR /InternLM
ARG https_proxy
ARG http_proxy
ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
RUN git submodule update --init --recursive \
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt \
&& /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt \
&& cd /InternLM/third_party/flash-attention \
&& /opt/conda/bin/python setup.py install \
&& cd ./csrc \
&& cd fused_dense_lib && /opt/conda/bin/pip install -v . \
&& cd ../xentropy && /opt/conda/bin/pip install -v . \
&& cd ../rotary && /opt/conda/bin/pip install -v . \
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
&& cd ../../../../ \
&& cd ./third_party/apex \
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
&& /opt/conda/bin/pip cache purge \
&& rm -rf ~/.cache/pip

View File

@ -1,161 +0,0 @@
ARG BASE_IMAGE
ARG https_proxy
ARG http_proxy
##############################################################################
# Install the basic environment on centos
##############################################################################
FROM ${BASE_IMAGE} as base
ARG https_proxy
ARG http_proxy
RUN yum install deltarpm -y && yum update -y \
&& yum install -y \
ca-certificates \
cmake \
curl \
git \
wget \
tar \
m4 \
bzip2 \
gcc \
gcc-c++ \
file \
texinfo \
which
##############################################################################
# Install the conda environment
##############################################################################
FROM base as conda
ARG PYTHON_VERSION=3.10
ARG TARGETPLATFORM
ARG https_proxy
ARG http_proxy
RUN case ${TARGETPLATFORM} in \
"linux/arm64") MINICONDA_ARCH=aarch64 ;; \
*) MINICONDA_ARCH=x86_64 ;; \
esac && \
curl -fsSL -v -o ~/miniconda.sh -O "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh"
RUN chmod +x ~/miniconda.sh && \
bash ~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh && \
/opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
/opt/conda/bin/conda clean -ya
##############################################################################
# Install environment dependencies
##############################################################################
FROM conda as dep
WORKDIR /dep
ARG https_proxy
ARG http_proxy
ARG GMP_VERSION
ARG MPFR_VERSION
ARG MPC_VERSION
RUN wget https://ftp.gnu.org/gnu/gmp/gmp-${GMP_VERSION}.tar.bz2 \
&& tar -vxf gmp-${GMP_VERSION}.tar.bz2 \
&& cd gmp-${GMP_VERSION}/ \
&& ./configure --prefix=/usr/local/gmp-${GMP_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& wget https://ftp.gnu.org/gnu/mpfr/mpfr-${MPFR_VERSION}.tar.gz \
&& tar -vxf mpfr-${MPFR_VERSION}.tar.gz \
&& cd mpfr-${MPFR_VERSION}/ \
&& ./configure --prefix=/usr/local/mpfr-${MPFR_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& wget http://www.multiprecision.org/downloads/mpc-${MPC_VERSION}.tar.gz \
&& tar -vxf mpc-${MPC_VERSION}.tar.gz \
&& cd mpc-${MPC_VERSION}/ \
&& ./configure --prefix=/usr/local/mpc-${MPC_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& git clone https://github.com/ninja-build/ninja.git \
&& cd ninja \
&& git checkout release \
&& ./configure.py --bootstrap \
&& mv ./ninja /usr/bin \
&& cd ..
ENV MPFR_HOME=/usr/local/mpfr-${MPFR_VERSION}
ENV LD_LIBRARY_PATH=${MPFR_HOME}/lib:$LD_LIBRARY_PATH
ARG https_proxy
ARG http_proxy
ARG GCC_VERSION
ARG GMP_VERSION
ARG MPFR_VERSION
ARG MPC_VERSION
RUN wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.xz \
&& tar -vxf gcc-${GCC_VERSION}.tar.xz \
&& mkdir build \
&& cd build/ \
&& ../gcc-${GCC_VERSION}/configure --prefix=/usr/local/gcc-${GCC_VERSION}/ --enable-threads=posix --disable-checking --enable-languages=c,c++ --disable-multilib \
--with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} --with-mpc=/usr/local/mpc-${MPC_VERSION} \
&& make -j64 && make install
ENV GCC_HOME=/usr/local/gcc-${GCC_VERSION}
ENV LD_LIBRARY_PATH=${GCC_HOME}/lib64:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
ENV PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
ENV CC=${GCC_HOME}/bin/gcc
ENV CXX=${GCC_HOME}/bin/c++
##############################################################################
# Install InternLM development environment, including flash-attention and apex
##############################################################################
FROM dep as intrenlm-dev
COPY . /InternLM
WORKDIR /InternLM
ARG https_proxy
ARG http_proxy
ARG PYTORCH_VERSION
ARG TORCHVISION_VERSION
ARG TORCHAUDIO_VERSION
RUN /opt/conda/bin/pip --no-cache-dir install \
transformers==4.29.2 \
sentencepiece \
numpy \
tqdm \
psutil \
packaging \
pre-commit \
ninja \
gputil \
pytest \
packaging \
boto3 \
botocore \
torch-scatter \
pyecharts \
-f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}+cu117.html \
&& /opt/conda/bin/pip --no-cache-dir install \
--extra-index-url https://download.pytorch.org/whl/cu117 \
torch==${PYTORCH_VERSION}+cu117 \
torchvision==${TORCHVISION_VERSION}+cu117 \
torchaudio==${TORCHAUDIO_VERSION}
ARG https_proxy
ARG http_proxy
ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
ARG FLASH_ATTEN_TAG
RUN git submodule update --init --recursive \
&& cd /InternLM/third_party/flash-attention \
&& git checkout ${FLASH_ATTEN_TAG} \
&& /opt/conda/bin/python setup.py install \
&& cd ./csrc \
&& cd fused_dense_lib && /opt/conda/bin/pip install -v . \
&& cd ../xentropy && /opt/conda/bin/pip install -v . \
&& cd ../rotary && /opt/conda/bin/pip install -v . \
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
&& cd ../../../../ \
&& cd ./third_party/apex \
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
&& /opt/conda/bin/pip cache purge \
&& rm -rf ~/.cache/pip

View File

@ -1,142 +0,0 @@
ARG BASE_IMAGE
ARG https_proxy
ARG http_proxy
##############################################################################
# Install the basic environment on ubuntu
##############################################################################
FROM ${BASE_IMAGE} as base
ARG https_proxy
ARG http_proxy
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
cmake \
curl \
git \
wget \
tar \
m4 \
ninja-build
##############################################################################
# Install the conda environment
##############################################################################
FROM base as conda
ARG PYTHON_VERSION=3.10
ARG TARGETPLATFORM
ARG https_proxy
ARG http_proxy
RUN case ${TARGETPLATFORM} in \
"linux/arm64") MINICONDA_ARCH=aarch64 ;; \
*) MINICONDA_ARCH=x86_64 ;; \
esac && \
curl -fsSL -v -o ~/miniconda.sh -O "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh"
RUN chmod +x ~/miniconda.sh && \
bash ~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh && \
/opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
/opt/conda/bin/conda clean -ya
##############################################################################
# Install environment dependencies
##############################################################################
FROM conda as dep
WORKDIR /dep
ARG https_proxy
ARG http_proxy
ARG GCC_VERSION
ARG GMP_VERSION
ARG MPFR_VERSION
ARG MPC_VERSION
RUN wget https://ftp.gnu.org/gnu/gmp/gmp-${GMP_VERSION}.tar.bz2 \
&& tar -vxf gmp-${GMP_VERSION}.tar.bz2 \
&& cd gmp-${GMP_VERSION}/ \
&& ./configure --prefix=/usr/local/gmp-${GMP_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& wget https://ftp.gnu.org/gnu/mpfr/mpfr-${MPFR_VERSION}.tar.gz \
&& tar -vxf mpfr-${MPFR_VERSION}.tar.gz \
&& cd mpfr-${MPFR_VERSION}/ \
&& ./configure --prefix=/usr/local/mpfr-${MPFR_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& wget http://www.multiprecision.org/downloads/mpc-${MPC_VERSION}.tar.gz \
&& tar -vxf mpc-${MPC_VERSION}.tar.gz \
&& cd mpc-${MPC_VERSION}/ \
&& ./configure --prefix=/usr/local/mpc-${MPC_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} \
&& make -j64 && make install \
&& cd .. \
&& wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.xz \
&& tar -vxJf gcc-${GCC_VERSION}.tar.xz \
&& mkdir build \
&& cd build/ \
&& ../gcc-${GCC_VERSION}/configure --prefix=/usr/local/gcc-${GCC_VERSION}/ --enable-checking=release --enable-languages=c,c++ --disable-multilib \
--with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} --with-mpc=/usr/local/mpc-${MPC_VERSION} \
&& make -j64 && make install
ENV GCC_HOME=/usr/local/gcc-${GCC_VERSION}
ENV MPFR_HOME=/usr/local/mpfr-${MPFR_VERSION}
ENV LD_LIBRARY_PATH=${GCC_HOME}/lib64:${MPFR_HOME}/lib:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
ENV PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
ENV CC=${GCC_HOME}/bin/gcc
ENV CXX=${GCC_HOME}/bin/c++
##############################################################################
# Install InternLM development environment, including flash-attention and apex
##############################################################################
FROM dep as intrenlm-dev
COPY . /InternLM
WORKDIR /InternLM
ARG https_proxy
ARG http_proxy
ARG PYTORCH_VERSION
ARG TORCHVISION_VERSION
ARG TORCHAUDIO_VERSION
RUN /opt/conda/bin/pip --no-cache-dir install \
transformers==4.29.2 \
sentencepiece \
numpy \
tqdm \
psutil \
packaging \
pre-commit \
ninja \
gputil \
pytest \
packaging \
boto3 \
botocore \
torch-scatter \
pyecharts \
-f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}+cu117.html \
&& /opt/conda/bin/pip --no-cache-dir install \
--extra-index-url https://download.pytorch.org/whl/cu117 \
torch==${PYTORCH_VERSION}+cu117 \
torchvision==${TORCHVISION_VERSION}+cu117 \
torchaudio==${TORCHAUDIO_VERSION}
ARG https_proxy
ARG http_proxy
ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
ARG FLASH_ATTEN_TAG
RUN git submodule update --init --recursive \
&& cd /InternLM/third_party/flash-attention \
&& git checkout ${FLASH_ATTEN_TAG} \
&& /opt/conda/bin/python setup.py install \
&& cd ./csrc \
&& cd fused_dense_lib && /opt/conda/bin/pip install -v . \
&& cd ../xentropy && /opt/conda/bin/pip install -v . \
&& cd ../rotary && /opt/conda/bin/pip install -v . \
&& cd ../layer_norm && /opt/conda/bin/pip install -v . \
&& cd ../../../../ \
&& cd ./third_party/apex \
&& /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
&& /opt/conda/bin/pip cache purge \
&& rm -rf ~/.cache/pip

View File

@ -1,25 +0,0 @@
## 实验性环境镜像
本模块用于测试新版本环境,默认测试新环境 torch=2.0.1flash-attention=2.1.0。新环境可能具有不稳定性,标准环境安装请参考:[安装文档](../doc/install.md)
### 镜像构建及拉取
构建镜像时请于 InternLM 根目录下执行 docker.Makefile该文件与标准环境镜像共用所使用的 Dockerfile 位于 experiment 目录下。也可直接从 https://hub.docker.com/r/internlm/internlm 拉取镜像,命令如下:
```bash
# 构建镜像
# ubuntu20.04
make -f docker.Makefile BASE_OS=ubuntu20.04 DOCKERFILE_PATH=./experiment/Dockerfile-ubuntu PYTORCH_VERSION=2.0.1 TORCHVISION_VERSION=0.15.2 TORCHAUDIO_VERSION=2.0.2 FLASH_ATTEN_VERSION=2.1.0
# centos7
make -f docker.Makefile BASE_OS=centos7 DOCKERFILE_PATH=./experiment/Dockerfile-centos PYTORCH_VERSION=2.0.1 TORCHVISION_VERSION=0.15.2 TORCHAUDIO_VERSION=2.0.2 FLASH_ATTEN_VERSION=2.1.0
# 拉取镜像
# ubuntu20.04
docker pull internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-ubuntu20.04
# centos7
docker pull internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-centos7
```
### 容器启动
对于使用 dockerfile 构建或拉取的本地标准镜像,使用如下命令启动并进入容器:
```bash
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-centos7 bash
```
容器内默认目录即 `/InternLM`,根据[使用文档](../doc/usage.md)即可启动训练。

View File

@ -1,25 +0,0 @@
## Environment Image for experiment
This module is used to test the new version environment, the default test new environment is torch=2.0.1, flash-attention=2.1.0. The new environment may be unstable, for the standard environment installation please refer to: [installation guide](../doc/en/install.md)
### Build and Pull Image
When building the image, please make docker.Makefile in the InternLM root directory. This Makefile is shared with the standard environment image, and the Dockerfile used is located in the experiment directory. You can also pull the image directly from https://hub.docker.com/r/internlm/internlm, the command is as follows:
```bash
# Build Image
# ubuntu20.04
make -f docker.Makefile BASE_OS=ubuntu20.04 DOCKERFILE_PATH=./experiment/Dockerfile-ubuntu PYTORCH_VERSION=2.0.1 TORCHVISION_VERSION=0.15.2 TORCHAUDIO_VERSION=2.0.2 FLASH_ATTEN_VERSION=2.1.0
# centos7
make -f docker.Makefile BASE_OS=centos7 DOCKERFILE_PATH=./experiment/Dockerfile-centos PYTORCH_VERSION=2.0.1 TORCHVISION_VERSION=0.15.2 TORCHAUDIO_VERSION=2.0.2 FLASH_ATTEN_VERSION=2.1.0
# Pull Image
# ubuntu20.04
docker pull internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-ubuntu20.04
# centos7
docker pull internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-centos7
```
### Run Container
For the local standard image built with dockerfile or pulled, use the following command to run and enter the container:
```bash
docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-centos7 bash
```
The default directory in the container is `/InternLM`, please start training according to the [Usage](../doc/en/usage.md).

6
finetune/README.md Normal file
View File

@ -0,0 +1,6 @@
# Fine-tuning with InternLM
We recommend two projects to fine-tune InternLM.
1. [Xtuner](): brief introduction
2. [InternLM-Train](): brief introduction

View File

@ -1,9 +0,0 @@
from .initialize.initialize_trainer import initialize_trainer
from .initialize.launch import get_default_parser, launch_from_slurm, launch_from_torch
__all__ = [
"get_default_parser",
"initialize_trainer",
"launch_from_slurm",
"launch_from_torch",
]

View File

@ -1,848 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
import torch
import torch.nn.functional as F
from torch import nn
__all__ = ["SequenceGenerator"]
class InferenceParams:
"""
Intermediate cache objects for inference
"""
def __init__(
self,
max_sequence_len,
max_batch_size,
sequence_len_offset=0,
batch_size_offset=0,
key_value_memory_dict: dict = None,
lengths_per_sample=None,
attention_mask=None,
) -> None:
self.max_sequence_len: int = max_sequence_len
self.max_batch_size: int = max_batch_size
self.sequence_len_offset: int = sequence_len_offset
self.batch_size_offset: int = batch_size_offset
if key_value_memory_dict is None:
key_value_memory_dict = {}
self.key_value_memory_dict: dict = key_value_memory_dict
self.fused_ft_kernel: bool = False
self.lengths_per_sample = lengths_per_sample
self.attention_mask = attention_mask
def reorder_state(self, indices):
if self.lengths_per_sample is not None:
self.lengths_per_sample = self.lengths_per_sample.index_select(index=indices, dim=0)
for key, value in list(self.key_value_memory_dict.items()):
value = value.index_select(index=indices, dim=0)
self.key_value_memory_dict[key] = value
def _get_model_device(model):
"""
obtain the device of an nn.Module.model
Args:
model: nn.Module
Return: torch.device. if None, the parameters of this model is None.
"""
assert isinstance(model, nn.Module)
parameters = list(model.parameters())
if len(parameters) == 0:
return None
else:
return parameters[0].device
class SequenceGenerator:
"""
Sequence Generator.
"""
def __init__(self, decoder, eos_token_id, pad_token_id, bos_token_id):
self.decoder = decoder
self.eos_token_id = eos_token_id
self.pad_token_id = pad_token_id
self.bos_token_id = bos_token_id
@torch.no_grad()
def generate(
self,
tokens: "torch.LongTensor" = None,
num_return_sequences=1,
max_length: int = 20,
num_beams: int = 1,
do_sample: bool = True,
temperature: float = 1.0,
top_k: int = 50,
top_p: float = 1.0,
repetition_penalty: float = 1,
length_penalty: float = 1.0,
):
"""
Args:
tokens: the beginning tokens whose shape is [bsz, length]. If shape is None, default ''bos_token'' will be
added to conduct generation.
num_return_sequences: number of returned sequences.
max_length: the max length of generated sequence.
num_beams: the size of beam search.
do_sample: whether using sample.
temperature: it's meaningful when do_sample is True.
top_k: sampling from top_k.
top_p: sampling from top_p tokens(nucleus sampling).
Return:
the token sequence whose shape is [bsz, num_return_sequences, max_length]. If eos_token_id is not None,
the ending of each sequence must be eos_token_id.
"""
assert num_return_sequences <= num_beams, f"The `{num_return_sequences}` must be less than `{num_beams}`..."
if do_sample:
return sample_generate(
self.decoder,
tokens=tokens,
max_length=max_length,
num_beams=num_beams,
num_return_sequences=num_return_sequences,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=self.eos_token_id, # the ending token id
pad_token_id=self.pad_token_id,
repetition_penalty=repetition_penalty, # the penalty degree for repetition tokens
length_penalty=length_penalty, # the penalty for length. if it > 1, then encourages long sequence.
# Otherwise, encourages short sequence.
bos_token_id=self.bos_token_id,
)
else:
return greedy_generate(
self.decoder,
tokens=tokens,
max_length=max_length,
num_beams=num_beams,
num_return_sequences=num_return_sequences,
eos_token_id=self.eos_token_id,
pad_token_id=self.pad_token_id,
repetition_penalty=repetition_penalty,
length_penalty=length_penalty,
bos_token_id=self.bos_token_id,
)
@torch.no_grad()
def greedy_generate(
decoder,
tokens=None,
max_length=20,
num_beams=1,
num_return_sequences=1,
eos_token_id=None,
pad_token_id=0,
repetition_penalty=1,
length_penalty=1.0,
bos_token_id=1,
feat_mask=None,
ffn_mask=None,
layer_mask=None,
):
"""
Search sequence greedily.
Args:
decoder: the Decoder object.
tokens: the shape is [batch size, length]. If decoder is None, generating begins with bos_token_id.
max_length: the max length for generated sequence.
num_beams: the size of beam to decode.
eos_token_id: the ending token id. If None, the decode length is max_length.
pad_token_id: the token id of pad.
repetition_penalty: the penalty degree for repetition tokens
length_penalty: the penalty for length.
"""
if num_beams == 1:
token_ids = _no_beam_search_generate(
decoder,
tokens=tokens,
max_length=max_length,
temperature=1,
top_k=50,
top_p=1,
eos_token_id=eos_token_id,
do_sample=False,
repetition_penalty=repetition_penalty,
length_penalty=length_penalty,
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
feat_mask=feat_mask,
ffn_mask=ffn_mask,
layer_mask=layer_mask,
)
else:
token_ids = _beam_search_generate(
decoder,
tokens=tokens,
max_length=max_length,
num_beams=num_beams,
num_return_sequences=num_return_sequences,
temperature=1,
top_k=50,
top_p=1,
eos_token_id=eos_token_id,
do_sample=False,
repetition_penalty=repetition_penalty,
length_penalty=length_penalty,
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
feat_mask=feat_mask,
ffn_mask=ffn_mask,
layer_mask=layer_mask,
)
return token_ids
@torch.no_grad()
def sample_generate(
decoder,
tokens,
max_length=20,
num_beams=1,
num_return_sequences=1,
temperature=1.0,
top_k=50,
top_p=1.0,
eos_token_id=None,
pad_token_id=0,
repetition_penalty=1.0,
length_penalty=1.0,
bos_token_id=1,
):
"""
generate sequence in sampling way.
Args:
decoder: the Decoder object.
tokens: the shape is [batch size, length]. If decoder is None, generating begins with bos_token_id.
max_length: the max length for generated sequence.
num_beams: the size of beam to decode.
num_return_sequences: number of returned sequence.
temperature: annealing magnitude during sampling.
top_k: sampling from top_k. (Default: 50)
top_p: sampling from top_p tokens(nucleus sampling). (Default: 1.0)
eos_token_id: the ending token id. If None, the decode length is max_length.
pad_token_id: the token id of pad.
repetition_penalty: the penalty degree for repetition tokens
length_penalty: the penalty for length.
"""
if num_beams == 1:
token_ids = _no_beam_search_generate(
decoder,
tokens=tokens,
max_length=max_length,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=eos_token_id,
do_sample=True,
repetition_penalty=repetition_penalty,
length_penalty=length_penalty,
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
)
else:
token_ids = _beam_search_generate(
decoder,
tokens=tokens,
max_length=max_length,
num_beams=num_beams,
num_return_sequences=num_return_sequences,
temperature=temperature,
top_k=top_k,
top_p=top_p,
eos_token_id=eos_token_id,
do_sample=True,
repetition_penalty=repetition_penalty,
length_penalty=length_penalty,
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
)
return token_ids
@torch.no_grad()
def _no_beam_search_generate(
decoder,
tokens,
inference_params=None,
max_length=20,
temperature=1.0,
top_k=50,
top_p=1.0,
eos_token_id=None,
do_sample=True,
repetition_penalty=1.0,
length_penalty=1.0,
pad_token_id=0,
bos_token_id=1,
feat_mask=None,
ffn_mask=None,
layer_mask=None,
):
# delete num_return_sequences=1 for lint check;
batch_size = tokens.size(0)
if eos_token_id is None:
_eos_token_id = -1
else:
_eos_token_id = eos_token_id
has_bos = torch.all(tokens[:, 0].eq(bos_token_id))
if has_bos:
bos_pos = torch.where(tokens.eq(bos_token_id), 1, 0)
bos_sum = bos_pos.cumsum(dim=-1)
bos_pos = torch.where(bos_sum.eq(bos_sum[:, -1:]), 0, 1)
to_atten_x = bos_pos[:, :, None]
to_atten_y = bos_pos[:, None, :]
# attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
else:
bos_pos = torch.where(tokens.eq(bos_token_id), 1, 0)
to_atten_x = bos_pos[:, :, None]
to_atten_y = bos_pos[:, None, :]
# attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
attention_mask = torch.logical_or(to_atten_x, to_atten_y).eq(1)
if inference_params is None:
inference_params = InferenceParams(
max_sequence_len=max_length,
max_batch_size=tokens.size(0),
sequence_len_offset=0,
batch_size_offset=0,
key_value_memory_dict=None,
lengths_per_sample=None,
attention_mask=attention_mask,
)
if layer_mask is None:
if feat_mask is None and ffn_mask is None:
scores = decoder(**{"input_ids": tokens, "inference_params": inference_params})
else:
scores = decoder(
**{
"input_ids": tokens,
"inference_params": inference_params,
"feat_mask": feat_mask,
"ffn_mask": ffn_mask,
}
)
else:
scores = decoder(
**{
"input_ids": tokens,
"inference_params": inference_params,
"feat_mask": feat_mask,
"ffn_mask": ffn_mask,
"layer_mask": layer_mask,
}
)
if isinstance(scores, (list, tuple)):
scores = scores[0]
scores = scores[:, -1].float()
inference_params.sequence_len_offset += tokens.size(1)
if _eos_token_id != -1:
scores[:, _eos_token_id] = -1e12
next_tokens = scores.argmax(dim=-1, keepdim=True)
token_ids = torch.cat([tokens, next_tokens], dim=1)
cur_len = token_ids.size(1)
dones = token_ids.new_zeros(batch_size).eq(1)
# tokens = tokens[:, -1:]
real_max_length = max_length
max_lengths = tokens.new_full((tokens.size(0),), fill_value=max_length, dtype=torch.long)
while cur_len < real_max_length:
# batch_size x vocab_size
if has_bos:
bos_pos = torch.where(token_ids.eq(bos_token_id), 1, 0)
bos_sum = bos_pos.cumsum(dim=-1)
bos_pos = torch.where(bos_sum.eq(bos_sum[:, -1:]), 0, 1)
to_atten_x = bos_pos[:, :, None]
to_atten_y = bos_pos[:, None, :]
# attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
else:
bos_pos = torch.where(token_ids.eq(bos_token_id), 1, 0)
to_atten_x = bos_pos[:, :, None]
to_atten_y = bos_pos[:, None, :]
# attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
attention_mask = torch.logical_or(to_atten_x, to_atten_y).eq(1)
inference_params.attention_mask = attention_mask
if layer_mask is None:
if feat_mask is None and ffn_mask is None:
scores = decoder(**{"input_ids": token_ids[:, -1:], "inference_params": inference_params})
else:
scores = decoder(
**{
"input_ids": token_ids[:, -1:],
"inference_params": inference_params,
"feat_mask": feat_mask,
"ffn_mask": ffn_mask,
}
)
else:
scores = decoder(
**{
"input_ids": token_ids[:, -1:],
"inference_params": inference_params,
"feat_mask": feat_mask,
"ffn_mask": ffn_mask,
"layer_mask": layer_mask,
}
)
if isinstance(scores, (list, tuple)):
scores = scores[0]
scores = scores[:, -1].float()
inference_params.sequence_len_offset += 1
if repetition_penalty != 1.0:
token_scores = scores.gather(dim=1, index=token_ids)
lt_zero_mask = token_scores.lt(0).float()
ge_zero_mask = lt_zero_mask.eq(0).float()
token_scores = (
lt_zero_mask * repetition_penalty * token_scores + ge_zero_mask / repetition_penalty * token_scores
)
scores.scatter_(dim=1, index=token_ids, src=token_scores)
if eos_token_id is not None and length_penalty != 1.0:
# batch_size x vocab_size
token_scores = scores / cur_len**length_penalty
eos_mask = scores.new_ones(scores.size(1))
eos_mask[eos_token_id] = 0
eos_mask = eos_mask.unsqueeze(0).eq(1)
scores = scores.masked_scatter(eos_mask, token_scores)
if do_sample:
if temperature > 0 and temperature != 1:
scores = scores / temperature
scores = top_k_top_p_filtering(scores, top_k, top_p, min_tokens_to_keep=2)
# add 1e-12 to avoid https://github.com/pytorch/pytorch/pull/27523
probs = F.softmax(scores, dim=-1) + 1e-12
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) # batch_size
else:
next_tokens = torch.argmax(scores, dim=-1) # batch_size
if _eos_token_id != -1:
next_tokens = next_tokens.masked_fill(max_lengths.eq(cur_len + 1), _eos_token_id)
next_tokens = next_tokens.masked_fill(dones, pad_token_id)
tokens = next_tokens.unsqueeze(1)
token_ids = torch.cat([token_ids, tokens], dim=-1) # batch_size x max_len
end_mask = next_tokens.eq(_eos_token_id)
dones = dones.__or__(end_mask)
cur_len += 1
if dones.min() == 1:
break
# if eos_token_id is not None:
# # setting the eos at the maximum length position
# tokens.scatter(index=max_lengths[:, None], dim=1, value=eos_token_id)
# if cur_len == max_length:
# # If eos is not reached by the maximum length, forcibly replace the last word with eos
# token_ids[:, -1].masked_fill_(~dones, eos_token_id)
# TODO Here we are simply adding an extra dimension for interface compatibility, but in the future it will need to
# be able to return multiple real results
return token_ids[:, None]
@torch.no_grad()
def _beam_search_generate(
decoder,
tokens,
inference_params=None,
max_length=20,
num_beams=4,
num_return_sequences=1,
temperature=1.0,
top_k=50,
top_p=1.0,
eos_token_id=None,
do_sample=True,
repetition_penalty=1.0,
length_penalty=1.0,
pad_token_id=0,
bos_token_id=1,
feat_mask=None,
ffn_mask=None,
layer_mask=None,
) -> torch.LongTensor:
device = _get_model_device(decoder)
batch_size = tokens.size(0)
if eos_token_id is None:
_eos_token_id = -1
else:
_eos_token_id = eos_token_id
has_bos = torch.all(tokens[:, 0].eq(bos_token_id))
if has_bos:
bos_pos = torch.where(tokens.eq(bos_token_id), 1, 0)
bos_sum = bos_pos.cumsum(dim=-1)
bos_pos = torch.where(bos_sum.eq(bos_sum[:, -1:]), 0, 1)
to_atten_x = bos_pos[:, :, None]
to_atten_y = bos_pos[:, None, :]
# attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
else:
bos_pos = torch.where(tokens.eq(bos_token_id), 1, 0)
to_atten_x = bos_pos[:, :, None]
to_atten_y = bos_pos[:, None, :]
# attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
attention_mask = torch.logical_or(to_atten_x, to_atten_y).eq(1)
if inference_params is None:
inference_params = InferenceParams(
max_sequence_len=max_length,
max_batch_size=tokens.size(0),
sequence_len_offset=0,
batch_size_offset=0,
key_value_memory_dict=None,
lengths_per_sample=None,
attention_mask=attention_mask,
)
if layer_mask is None:
if feat_mask is None and ffn_mask is None:
scores = decoder(**{"input_ids": tokens, "inference_params": inference_params})
else:
scores = decoder(
**{
"input_ids": tokens,
"inference_params": inference_params,
"feat_mask": feat_mask,
"ffn_mask": ffn_mask,
}
)
else:
scores = decoder(
**{
"input_ids": tokens,
"inference_params": inference_params,
"feat_mask": feat_mask,
"ffn_mask": ffn_mask,
"layer_mask": layer_mask,
}
)
if isinstance(scores, (list, tuple)):
scores = scores[0]
scores = scores[:, -1].float()
inference_params.sequence_len_offset += tokens.size(1)
if _eos_token_id != -1:
scores[:, _eos_token_id] = -1e12
vocab_size = scores.size(1)
assert vocab_size >= num_beams, "num_beams should be smaller than " "the number of vocabulary size."
if do_sample:
probs = F.softmax(scores, dim=-1) + 1e-12
# (batch_size, num_beams)
next_tokens = torch.multinomial(probs, num_samples=num_beams)
logits = probs.log()
# (batch_size, num_beams)
next_scores = logits.gather(dim=1, index=next_tokens)
else:
scores = F.log_softmax(scores, dim=-1) # (batch_size, vocab_size)
# obtain (batch_size, num_beams), (batch_size, num_beams)
next_scores, next_tokens = torch.topk(scores, num_beams, dim=1, largest=True, sorted=True)
indices = torch.arange(batch_size, dtype=torch.long).to(device)
indices = indices.repeat_interleave(num_beams)
inference_params.reorder_state(indices)
# batch_size * num_beams x length
tokens = tokens.index_select(dim=0, index=indices)
# genrated token (batch_size', cur_len)
token_ids = torch.cat([tokens, next_tokens.view(-1, 1)], dim=-1)
dones = [False] * batch_size
beam_scores = next_scores.view(-1) # batch_size * num_beams
cur_len = token_ids.size(1)
real_max_length = max_length
max_lengths = tokens.new_full((tokens.size(0),), fill_value=max_length, dtype=torch.long)
hypos = [
BeamHypotheses(num_beams, real_max_length, length_penalty, early_stopping=False) for _ in range(batch_size)
]
# 0, num_beams, 2*num_beams, ...
batch_inds_with_numbeams_interval = (torch.arange(batch_size) * num_beams).view(-1, 1).to(token_ids)
while cur_len < real_max_length:
if has_bos:
bos_pos = torch.where(token_ids.eq(bos_token_id), 1, 0)
bos_sum = bos_pos.cumsum(dim=-1)
bos_pos = torch.where(bos_sum.eq(bos_sum[:, -1:]), 0, 1)
to_atten_x = bos_pos[:, :, None]
to_atten_y = bos_pos[:, None, :]
# attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
else:
bos_pos = torch.where(token_ids.eq(bos_token_id), 1, 0)
to_atten_x = bos_pos[:, :, None]
to_atten_y = bos_pos[:, None, :]
# attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
attention_mask = torch.logical_or(to_atten_x, to_atten_y).eq(1)
inference_params.attention_mask = attention_mask
# (bsz x num_beams, vocab_size)
if layer_mask is None:
if feat_mask is None and ffn_mask is None:
scores = decoder(**{"input_ids": token_ids[:, -1:], "inference_params": inference_params})
else:
scores = decoder(
**{
"input_ids": token_ids[:, -1:],
"inference_params": inference_params,
"feat_mask": feat_mask,
"ffn_mask": ffn_mask,
}
)
else:
scores = decoder(
**{
"input_ids": token_ids[:, -1:],
"inference_params": inference_params,
"feat_mask": feat_mask,
"ffn_mask": ffn_mask,
"layer_mask": layer_mask,
}
)
if isinstance(scores, (list, tuple)):
scores = scores[0]
scores = scores[:, -1].float()
inference_params.sequence_len_offset += 1
if repetition_penalty != 1.0:
token_scores = scores.gather(dim=1, index=token_ids)
lt_zero_mask = token_scores.lt(0).float()
ge_zero_mask = lt_zero_mask.eq(0).float()
token_scores = (
lt_zero_mask * repetition_penalty * token_scores + ge_zero_mask / repetition_penalty * token_scores
)
scores.scatter_(dim=1, index=token_ids, src=token_scores)
if _eos_token_id != -1:
max_len_eos_mask = max_lengths.eq(cur_len + 1)
eos_scores = scores[:, _eos_token_id]
scores[:, _eos_token_id] = torch.where(max_len_eos_mask, eos_scores + 1e32, eos_scores)
if do_sample:
if temperature > 0 and temperature != 1:
scores = scores / temperature
scores = top_k_top_p_filtering(scores, top_k, top_p, min_tokens_to_keep=num_beams + 1)
# add 1e-12 to avoid https://github.com/pytorch/pytorch/pull/27523
probs = F.softmax(scores, dim=-1) + 1e-12
# batch_size' x (num_beams+1)
_tokens = torch.multinomial(probs, num_samples=num_beams + 1)
logits = probs.log()
# batch_size' x (num_beams+1)
_scores = logits.gather(dim=1, index=_tokens)
# batch_size' x (num_beams+1)
_scores = _scores + beam_scores[:, None]
_scores = _scores.view(batch_size, num_beams * (num_beams + 1))
next_scores, ids = _scores.topk(2 * num_beams, dim=1, largest=True, sorted=True)
_tokens = _tokens.view(batch_size, num_beams * (num_beams + 1))
# (batch_size, 2*num_beams)
next_tokens = _tokens.gather(dim=1, index=ids)
# (batch_size, 2*num_beams)
from_which_beam = torch.floor(ids.float() / (num_beams + 1)).long()
else:
# (batch_size * num_beams, vocab_size)
scores = F.log_softmax(scores, dim=-1)
# (batch_size * num_beams, vocab_size)
_scores = scores + beam_scores[:, None]
# (batch_size, num_beams*vocab_size)
_scores = _scores.view(batch_size, -1)
# (bsz, 2*num_beams)
next_scores, ids = torch.topk(_scores, 2 * num_beams, dim=1, largest=True, sorted=True)
# (batch_size, 2*num_beams)
from_which_beam = torch.floor(ids.float() / vocab_size).long()
next_tokens = ids % vocab_size # (batch_size, 2*num_beams)
# next_scores, sorted_inds = next_scores.sort(dim=-1, descending=True)
# next_tokens = next_tokens.gather(dim=1, index=sorted_inds)
# from_which_beam = from_which_beam.gather(dim=1, index=sorted_inds)
not_eos_mask = next_tokens.ne(_eos_token_id)
keep_mask = not_eos_mask.cumsum(dim=1).le(num_beams)
keep_mask = not_eos_mask.__and__(keep_mask)
_next_tokens = next_tokens.masked_select(keep_mask).view(-1, 1)
_from_which_beam = from_which_beam.masked_select(keep_mask).view(batch_size, num_beams)
_next_scores = next_scores.masked_select(keep_mask).view(batch_size, num_beams)
beam_scores = _next_scores.view(-1)
flag = True
if cur_len + 1 == real_max_length:
eos_batch_idx = torch.arange(batch_size).to(next_tokens).repeat_interleave(repeats=num_beams, dim=0)
eos_beam_ind = torch.arange(num_beams).to(token_ids).repeat(batch_size)
eos_beam_idx = from_which_beam[:, :num_beams].reshape(-1)
else:
effective_eos_mask = next_tokens[:, :num_beams].eq(_eos_token_id) # batch_size x num_beams
if effective_eos_mask.sum().gt(0):
eos_batch_idx, eos_beam_ind = effective_eos_mask.nonzero(as_tuple=True)
eos_beam_idx = eos_batch_idx * num_beams * 2 + eos_beam_ind
eos_beam_idx = from_which_beam.view(-1)[eos_beam_idx]
else:
flag = False
if flag:
_token_ids = torch.cat([token_ids, _next_tokens], dim=-1)
for batch_idx, beam_ind, beam_idx in zip(
eos_batch_idx.tolist(), eos_beam_ind.tolist(), eos_beam_idx.tolist()
):
if not dones[batch_idx]:
score = next_scores[batch_idx, beam_ind].item()
if _eos_token_id != -1:
hypos[batch_idx].add(_token_ids[batch_idx * num_beams + beam_idx, :cur_len].clone(), score)
else:
hypos[batch_idx].add(_token_ids[batch_idx * num_beams + beam_idx].clone(), score)
reorder_inds = (batch_inds_with_numbeams_interval + _from_which_beam).view(-1)
inference_params.reorder_state(reorder_inds)
token_ids = torch.cat([token_ids.index_select(index=reorder_inds, dim=0), _next_tokens], dim=-1)
for batch_idx in range(batch_size):
dones[batch_idx] = (
dones[batch_idx]
or hypos[batch_idx].is_done(next_scores[batch_idx, 0].item())
or max_lengths[batch_idx * num_beams] == cur_len + 1
)
cur_len += 1
if all(dones):
break
# select the best hypotheses
tgt_len = token_ids.new_zeros(batch_size, num_return_sequences)
best = []
for i, hypotheses in enumerate(hypos):
# best_hyp = max(hypotheses.hyp, key=lambda x: x[0])[1]
sorted_hyp = list(sorted(hypotheses.hyp, key=lambda x: x[0], reverse=True))
_best = []
for j, hyp in zip(range(num_return_sequences), sorted_hyp):
hyp = hyp[1]
if _eos_token_id != -1:
hyp = torch.cat([hyp, token_ids.new_ones(1) * _eos_token_id])
tgt_len[i, j] = len(hyp)
_best.append(hyp)
best.append(_best)
# generate target batch
decoded = token_ids.new_zeros(batch_size, num_return_sequences, tgt_len.max().item()).fill_(pad_token_id)
for i, hypo in enumerate(best):
for j, _hypo in enumerate(hypo):
decoded[i, j, : tgt_len[i, j]] = _hypo
return decoded
class BeamHypotheses(object):
"""
BeamHypotheses
"""
def __init__(self, num_beams, max_length, length_penalty, early_stopping):
"""Initialize n-best list of hypotheses."""
self.max_length = max_length - 1 # ignoring bos_token
self.length_penalty = length_penalty
self.early_stopping = early_stopping
self.num_beams = num_beams
self.hyp = []
self.worst_score = 1e9
def __len__(self):
"""Number of hypotheses in the list."""
return len(self.hyp)
def add(self, hyp, sum_logprobs):
"""Add a new hypothesis to the list."""
score = sum_logprobs / len(hyp) ** self.length_penalty
if len(self) < self.num_beams or score > self.worst_score:
self.hyp.append((score, hyp))
if len(self) > self.num_beams:
sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.hyp)])
del self.hyp[sorted_scores[0][1]]
self.worst_score = sorted_scores[1][0]
else:
self.worst_score = min(score, self.worst_score)
def is_done(self, best_sum_logprobs):
"""If there are enough hypotheses and that none of the hypotheses being
generated can become better than the worst one in the heap, then we are
done with this sentence."""
if len(self) < self.num_beams:
return False
elif self.early_stopping:
return True
else:
return self.worst_score >= best_sum_logprobs / self.max_length**self.length_penalty
def top_k_top_p_filtering(logits, top_k=0, top_p=1.0, filter_value=-float("Inf"), min_tokens_to_keep=1):
"""
Based on the values of top_k and top_p, set the values that do not meet the criteria to the filter_value.
Args:
logits: logit value, shape is [bsz, vocab_size].
top_k: If it is greater than 0, only the probabilities of the top_k vocabulary are kept, and the rest of
the positions are set to filter_value.
top_p: according to http://arxiv.org/abs/1904.09751.
filter_value: filter value
min_tokens_to_keep: The probability of words in each samples returned distribution will not be
lower than this value.
"""
if top_k > 0:
# Safety check
top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1))
# Remove all tokens with a probability less than the last token of
# the top-k
indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
logits[indices_to_remove] = filter_value
if top_p < 1.0:
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative probability above the threshold
# (token with 0 are kept)
sorted_indices_to_remove = cumulative_probs > top_p
if min_tokens_to_keep > 1:
# Keep at least min_tokens_to_keep
# (set to min_tokens_to_keep-1 because we add the first one below)
sorted_indices_to_remove[..., :min_tokens_to_keep] = 0
# Shift the indices to the right to keep also the first token
# above the threshold
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = 0
# scatter sorted tensors to original indexing
indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
logits[indices_to_remove] = filter_value
return logits

View File

@ -1,9 +0,0 @@
from .engine import Engine
from .naive_amp import NaiveAMPModel
from .trainer import Trainer
__all__ = [
"NaiveAMPModel",
"Engine",
"Trainer",
]

View File

@ -1,32 +0,0 @@
from .p2p import (
AsynCommunicator,
recv_backward,
recv_forward,
send_backward,
send_backward_and_recv_next_backward_async,
send_backward_recv_backward,
send_backward_recv_forward,
send_forward,
send_forward_and_recv_next_forward_async,
send_forward_backward_recv_forward_backward,
send_forward_recv_backward,
send_forward_recv_forward,
)
from .utils import recv_obj_meta, send_obj_meta
__all__ = [
"send_forward",
"send_forward_recv_forward",
"send_forward_backward_recv_forward_backward",
"send_backward",
"send_backward_recv_backward",
"send_backward_recv_forward",
"send_forward_recv_backward",
"recv_backward",
"recv_forward",
"send_obj_meta",
"recv_obj_meta",
"send_backward_and_recv_next_backward_async",
"send_forward_and_recv_next_forward_async",
"AsynCommunicator",
]

View File

@ -1,582 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/communication
import operator
from functools import reduce
from typing import List, Tuple, Union
import torch
import torch.distributed as dist
from internlm.core.context import ParallelMode
from internlm.core.context import global_context as gpc
from internlm.utils.common import get_current_device
from .utils import gather_split_1d_tensor, split_tensor_into_1d_equal_chunks
TensorShape = Union[torch.Size, List[int], Tuple[int]]
def _get_tensor_shape(tensor_shape: TensorShape, chunk_tensor: bool = False) -> Tuple[TensorShape, bool]:
"""get the exact tensor shape when communicating and return whether the tensor is a chunk
Args:
tensor_shape (:class:`torch.Size`): shape of tensor
chunk_tensor (bool, optional): whether to chunk tensor, defaults to False
Returns:
Tuple[Union[:class:`torch.Size`, List[int], Tuple[int]], bool]: exact tensor shape, whether to chunk tensor
"""
if chunk_tensor:
tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
tensor_parallel_world_size = gpc.get_world_size(ParallelMode.TENSOR)
if tensor_chunk_shape % tensor_parallel_world_size == 0:
tensor_chunk_shape = tensor_chunk_shape // tensor_parallel_world_size
else:
tensor_chunk_shape = tensor_shape
chunk_tensor = False
else:
tensor_chunk_shape = tensor_shape
return tensor_chunk_shape, chunk_tensor
def create_recv_buffer_with_shapes(recv_shapes, dtype, scatter_gather_tensors):
if isinstance(recv_shapes, torch.Size):
recv_chunk_shape, recv_split = _get_tensor_shape(recv_shapes, scatter_gather_tensors)
buffer_recv = torch.empty(recv_chunk_shape, requires_grad=True, device=get_current_device(), dtype=dtype)
return buffer_recv, recv_split
buffer_recv = []
for recv_shape in recv_shapes:
recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
tensor_recv = torch.empty(recv_chunk_shape, requires_grad=True, device=get_current_device(), dtype=dtype)
buffer_recv.append(tensor_recv)
return buffer_recv, recv_split
def process_object_to_send(object_send, scatter_gather_tensors):
if isinstance(object_send, torch.Tensor):
send_split = _get_tensor_shape(object_send.shape, scatter_gather_tensors)[1]
if send_split:
object_send = split_tensor_into_1d_equal_chunks(object_send)
return object_send
object_send_list = []
for tensor_send in object_send:
send_split = _get_tensor_shape(tensor_send.shape, scatter_gather_tensors)[1]
if send_split:
object_send_list.append(split_tensor_into_1d_equal_chunks(tensor_send))
else:
object_send_list.append(tensor_send)
object_send = tuple(object_send_list)
return object_send
def filling_ops_queue(obj, comm_op, comm_rank, ops_queue):
if isinstance(obj, torch.Tensor):
op_to_add = dist.P2POp(comm_op, obj, comm_rank)
ops_queue.append(op_to_add)
else:
for tensor_to_comm in obj:
op_to_add = dist.P2POp(comm_op, tensor_to_comm, comm_rank)
ops_queue.append(op_to_add)
def _communicate(
object_send_next: Union[torch.Tensor, List[torch.Tensor]] = None,
object_send_prev: Union[torch.Tensor, List[torch.Tensor]] = None,
recv_prev: bool = False,
recv_next: bool = False,
recv_prev_shape: Union[torch.Size, List[torch.Size]] = None,
recv_next_shape: Union[torch.Size, List[torch.Size]] = None,
prev_rank: int = None,
next_rank: int = None,
dtype: torch.dtype = None,
scatter_gather_tensors: bool = False,
) -> Tuple[Union[torch.Tensor, List[torch.Tensor]]]:
"""
Adapted from megatron.p2p_communication.
Communicate tensors between stages. Used as helper method in other
communication methods that are used in pipeline schedule.
Takes the following arguments:
object_send_next (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): tensor to send to next rank
(no tensor sent if set to None).
object_send_prev (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): tensor to send to prev rank
(no tensor sent if set to None).
recv_prev (bool): boolean for whether tensor should be received from
previous rank.
recv_next (bool): boolean for whether tensor should be received from
next rank.
recv_prev_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): shape of the tensor to be received
from the previous stage, defualts to None.
recv_next_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): shape of the tensor to be received
from the next stage, defualts to None.
prev_rank (int): the rank of the previous pipeline stage, defualts to None,
next_rank (int): the rank of the next pipeline stage, defualts to None,
dtype (torch.dtype): data type of intermediate buffers, defaults to None
scatter_gather_tensors (bool): whether to scatter and gather tensor between pipeline stages, defaults to False
Returns:
Tuple[Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]]: returns tensor_recv_prev, tensor_recv_next
"""
# Create placeholder tensors for receive in forward and backward directions
# if needed.
tensor_recv_prev = None
tensor_recv_next = None
if recv_prev:
assert recv_prev_shape is not None
tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(
recv_prev_shape, dtype, scatter_gather_tensors
)
if recv_next:
assert recv_next_shape is not None
tensor_recv_next, recv_next_split = create_recv_buffer_with_shapes(
recv_next_shape, dtype, scatter_gather_tensors
)
if object_send_prev is not None or recv_prev:
if prev_rank is None:
prev_rank = gpc.get_prev_global_rank(ParallelMode.PIPELINE)
if object_send_next is not None or recv_next:
if next_rank is None:
next_rank = gpc.get_next_global_rank(ParallelMode.PIPELINE)
if object_send_prev is not None:
object_send_prev = process_object_to_send(object_send_prev, scatter_gather_tensors)
if object_send_next is not None:
object_send_next = process_object_to_send(object_send_next, scatter_gather_tensors)
ops = []
if object_send_prev is not None:
filling_ops_queue(object_send_prev, dist.isend, prev_rank, ops)
if tensor_recv_prev is not None:
filling_ops_queue(tensor_recv_prev, dist.irecv, prev_rank, ops)
if tensor_recv_next is not None:
filling_ops_queue(tensor_recv_next, dist.irecv, next_rank, ops)
if object_send_next is not None:
filling_ops_queue(object_send_next, dist.isend, next_rank, ops)
if len(ops) > 0:
reqs = dist.batch_isend_irecv(ops)
for req in reqs:
req.wait()
# To protect against race condition when using batch_isend_irecv().
torch.cuda.synchronize()
if recv_prev and recv_prev_split:
if isinstance(tensor_recv_prev, torch.Tensor):
tensor_recv_prev = gather_split_1d_tensor(tensor_recv_prev).view(recv_prev_shape).requires_grad_()
else:
for index in range(len(tensor_recv_prev)):
tensor_recv_prev[index] = (
gather_split_1d_tensor(tensor_recv_prev[index]).view(recv_prev_shape[index]).requires_grad_()
)
if recv_next and recv_next_split:
if isinstance(tensor_recv_next, torch.Tensor):
tensor_recv_next = gather_split_1d_tensor(tensor_recv_next).view(recv_next_shape).requires_grad_()
else:
for index in range(len(tensor_recv_next)):
tensor_recv_next[index] = (
gather_split_1d_tensor(tensor_recv_next[index]).view(recv_next_shape[index]).requires_grad_()
)
return tensor_recv_prev, tensor_recv_next
def recv_forward(
input_tensor_shape, prev_rank=None, dtype=torch.float, scatter_gather_tensors=False
) -> Union[torch.Tensor, List[torch.Tensor]]:
"""Copy the forward output from the previous stage in pipeline as the input tensor of this stage.
Args:
input_tensor_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
to be received.
prev_rank (int, optional): The rank of the source of the tensor.
Returns:
Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input tensor or input tensor list.
"""
input_tensor, _ = _communicate(
recv_prev=True,
recv_prev_shape=input_tensor_shape,
prev_rank=prev_rank,
dtype=dtype,
scatter_gather_tensors=scatter_gather_tensors,
)
return input_tensor
def recv_backward(
output_grad_shape, next_rank=None, dtype=torch.float, scatter_gather_tensors=False
) -> Union[torch.Tensor, List[torch.Tensor]]:
"""Copy the gradient tensor from the next stage in pipeline as the input gradient of this stage.
Args:
output_grad_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
to be received.
next_rank (int, optional): The rank of the source of the tensor.
Returns:
Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input gradient tensor or gradident tensor list.
"""
_, output_tensor_grad = _communicate(
recv_next=True,
recv_next_shape=output_grad_shape,
next_rank=next_rank,
dtype=dtype,
scatter_gather_tensors=scatter_gather_tensors,
)
return output_tensor_grad
def send_forward(output_tensor, next_rank=None, scatter_gather_tensors=False) -> None:
"""Sends the input tensor to the next stage in pipeline.
Args:
output_tensor (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent.
next_rank (int, optional): The rank of the recipient of the tensor.
"""
_communicate(object_send_next=output_tensor, next_rank=next_rank, scatter_gather_tensors=scatter_gather_tensors)
def send_backward(input_tensor_grad, prev_rank=None, scatter_gather_tensors=False) -> None:
"""Sends the gradient tensor to the previous stage in pipeline.
Args:
input_tensor_grad (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent
prev_rank (int, optional): The rank of the recipient of the tensor
"""
_communicate(object_send_prev=input_tensor_grad, prev_rank=prev_rank, scatter_gather_tensors=scatter_gather_tensors)
def send_forward_recv_backward(
output_tensor, output_grad_shape, next_rank=None, dtype=torch.float, scatter_gather_tensors=False
) -> Union[torch.Tensor, List[torch.Tensor]]:
"""Batched communication operation. Sends the input tensor to the
next stage in pipeline, while receives the gradient tensor from the
next stage in pipeline as the input gradient tensor of this stage.
Args:
output_tensor (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent.
output_grad_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
to be received.
Returns:
Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input gradient tensor.
"""
_, output_tensor_grad = _communicate(
object_send_next=output_tensor,
recv_next=output_grad_shape is not None,
recv_next_shape=output_grad_shape,
next_rank=next_rank,
dtype=dtype,
scatter_gather_tensors=scatter_gather_tensors,
)
return output_tensor_grad
def send_backward_recv_forward(
input_tensor_grad,
input_tensor_shape,
prev_rank=None,
dtype=torch.float,
scatter_gather_tensors=False,
) -> Union[torch.Tensor, List[torch.Tensor]]:
"""Batched communication operation. Sends the gradient tensor to the
previous stage in pipeline, while receives the output tensor from the
previous stage in pipeline as the input of this stage.
Args:
input_tensor_grad (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent.
input_tensor_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
to be received.
Returns:
Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input tensor.
"""
input_tensor, _ = _communicate(
object_send_prev=input_tensor_grad,
recv_prev=input_tensor_shape is not None,
recv_prev_shape=input_tensor_shape,
prev_rank=prev_rank,
dtype=dtype,
scatter_gather_tensors=scatter_gather_tensors,
)
return input_tensor
def send_forward_recv_forward(
output_tensor,
input_tensor_shape,
prev_rank=None,
next_rank=None,
dtype=torch.float,
scatter_gather_tensors=False,
) -> Union[torch.Tensor, List[torch.Tensor]]:
"""Batched communication operation. Sends the input tensor to the
next stage in pipeline, while receives the output tensor from the
previous stage in pipeline as the input of this stage.
Args:
output_tensor (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent.
input_tensor_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
to be received.
Returns:
Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input tensor.
"""
input_tensor, _ = _communicate(
object_send_next=output_tensor,
recv_prev=input_tensor_shape is not None,
recv_prev_shape=input_tensor_shape,
prev_rank=prev_rank,
next_rank=next_rank,
dtype=dtype,
scatter_gather_tensors=scatter_gather_tensors,
)
return input_tensor
def send_backward_recv_backward(
input_tensor_grad,
output_grad_shape,
prev_rank=None,
next_rank=None,
dtype=torch.float,
scatter_gather_tensors=False,
) -> Union[torch.Tensor, List[torch.Tensor]]:
"""Batched communication operation. Sends the gradient tensor to the
previous stage in pipeline, while receives the gradient tensor from the
next member in pipeline as the input of this stage.
Args:
input_tensor_grad (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent.
output_grad_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
to be received.
Returns:
Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input gradient tensor.
"""
_, output_tensor_grad = _communicate(
object_send_prev=input_tensor_grad,
recv_next=output_grad_shape is not None,
recv_next_shape=output_grad_shape,
prev_rank=prev_rank,
next_rank=next_rank,
dtype=dtype,
scatter_gather_tensors=scatter_gather_tensors,
)
return output_tensor_grad
def send_forward_backward_recv_forward_backward(
output_tensor,
input_tensor_grad,
input_tensor_shape,
output_grad_shape,
prev_rank=None,
next_rank=None,
dtype=torch.float,
scatter_gather_tensors=False,
) -> Tuple[Union[torch.Tensor, List[torch.Tensor]]]:
"""Batched communication operation. Sends the input tensor to the next stage in pipeline and
the gradient tensor to the previous stage, while receives the input gradient tensor from the
next stage and the input tensor from the previous stage.
Args:
output_tensor (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor sent to the next.
input_tensor_grad (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor sent to the previous.
input_tensor_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor received
from the previous.
output_grad_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor received
from the next.
Returns:
Tuple(Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]], Union[:class:`torch.Tensor`,
List[:class:`torch.Tensor`]]): (the input tensor, the input gradient tensor)
"""
input_tensor, output_tensor_grad = _communicate(
object_send_next=output_tensor,
object_send_prev=input_tensor_grad,
recv_prev=input_tensor_shape is not None,
recv_next=output_grad_shape is not None,
recv_prev_shape=input_tensor_shape,
recv_next_shape=output_grad_shape,
prev_rank=prev_rank,
next_rank=next_rank,
dtype=dtype,
scatter_gather_tensors=scatter_gather_tensors,
)
return input_tensor, output_tensor_grad
def send_forward_and_recv_next_forward_async(
output_tensor,
recv_prev_shape: Union[torch.Size, List[torch.Size]] = None,
dtype: torch.dtype = None,
scatter_gather_tensors=False,
):
"""send forward output to next rank and recv forward input from prev rank"""
reqs = []
tensor_recv_prev = None
# prepare send opreations
if output_tensor is not None:
next_rank = gpc.get_next_global_rank(ParallelMode.PIPELINE)
output_tensor = process_object_to_send(output_tensor, scatter_gather_tensors)
if isinstance(output_tensor, torch.Tensor):
reqs.append(dist.P2POp(dist.isend, output_tensor, next_rank))
else:
for tensor_to_comm in output_tensor:
reqs.append(dist.P2POp(dist.isend, tensor_to_comm, next_rank))
# prepare receive opreations
if recv_prev_shape is not None:
prev_rank = gpc.get_prev_global_rank(ParallelMode.PIPELINE)
# create receive buffer
tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(
recv_prev_shape, dtype, scatter_gather_tensors
)
# generate async receive opterations
if isinstance(tensor_recv_prev, torch.Tensor):
reqs.append(dist.P2POp(dist.irecv, tensor_recv_prev, prev_rank))
else:
for tensor_to_comm in tensor_recv_prev:
reqs.append(dist.P2POp(dist.irecv, tensor_to_comm, prev_rank))
if len(reqs) > 0:
reqs = dist.batch_isend_irecv(reqs)
# return and do other things
yield
# check communication completed
for req in reqs:
req.wait()
# To protect against race condition when using batch_isend_irecv()
torch.cuda.synchronize()
# Process received data
if recv_prev_shape is not None and recv_prev_split:
if isinstance(tensor_recv_prev, torch.Tensor):
tensor_recv_prev = gather_split_1d_tensor(tensor_recv_prev).view(recv_prev_shape).requires_grad_()
else:
for index in range(len(tensor_recv_prev)):
tensor_recv_prev[index] = (
gather_split_1d_tensor(tensor_recv_prev[index]).view(recv_prev_shape[index]).requires_grad_()
)
yield tensor_recv_prev
def send_backward_and_recv_next_backward_async(
input_tensor,
recv_next_shape: Union[torch.Size, List[torch.Size]] = None,
dtype: torch.dtype = None,
scatter_gather_tensors=False,
):
reqs = []
tensor_recv_next = None
# prepare send opreations
if input_tensor is not None:
prev_rank = gpc.get_prev_global_rank(ParallelMode.PIPELINE)
input_tensor = process_object_to_send(input_tensor, scatter_gather_tensors)
if isinstance(input_tensor, torch.Tensor):
reqs.append(dist.P2POp(dist.isend, input_tensor, prev_rank))
else:
for tensor_to_comm in input_tensor:
reqs.append(dist.P2POp(dist.isend, tensor_to_comm, prev_rank))
# prepare receive opreations
if recv_next_shape is not None:
next_rank = gpc.get_next_global_rank(ParallelMode.PIPELINE)
# create receive buffer
tensor_recv_next, recv_next_split = create_recv_buffer_with_shapes(
recv_next_shape, dtype, scatter_gather_tensors
)
# generate async receive opreations
if isinstance(tensor_recv_next, torch.Tensor):
reqs.append(dist.P2POp(dist.irecv, tensor_recv_next, next_rank))
else:
for tensor_to_comm in tensor_recv_next:
reqs.append(dist.P2POp(dist.irecv, tensor_to_comm, next_rank))
if len(reqs) > 0:
reqs = dist.batch_isend_irecv(reqs)
# return and do other things
yield
# check communication completed
for req in reqs:
req.wait()
# To protect against race condition when using batch_isend_irecv()
torch.cuda.synchronize()
# Process received data
if recv_next_shape is not None and recv_next_split:
if isinstance(tensor_recv_next, torch.Tensor):
tensor_recv_next = gather_split_1d_tensor(tensor_recv_next).view(recv_next_shape).requires_grad_()
else:
for index in range(len(tensor_recv_next)):
tensor_recv_next[index] = (
gather_split_1d_tensor(tensor_recv_next[index]).view(recv_next_shape[index]).requires_grad_()
)
yield tensor_recv_next
class AsynCommunicator:
"""AsynCommunicator for managing async communication."""
def __init__(
self,
tensor_to_send: Union[torch.Tensor, List[torch.Tensor]],
recv_shape: Union[torch.Size, List[torch.Size]],
dtype: torch.dtype = None,
scatter_gather_tensors=False,
forward: bool = True,
) -> None:
self._need_receive = recv_shape is not None
if forward:
self._coroutine = send_forward_and_recv_next_forward_async(
tensor_to_send, recv_shape, dtype, scatter_gather_tensors
)
else:
self._coroutine = send_backward_and_recv_next_backward_async(
tensor_to_send, recv_shape, dtype, scatter_gather_tensors
)
@property
def need_receive(self) -> bool:
return self._need_receive
def start(self) -> None:
next(self._coroutine)
def wait_and_receive(self) -> Union[torch.Tensor, List[torch.Tensor]]:
received = next(self._coroutine)
self._coroutine.close()
return received

View File

@ -1,125 +0,0 @@
# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/communication
from typing import List, Tuple, Union
import torch
import torch.distributed as dist
from internlm.core.context import ParallelMode
from internlm.core.context import global_context as gpc
from internlm.utils.common import get_current_device
TensorShape = Union[torch.Size, List[int], Tuple[int]]
def send_meta_helper(obj, next_rank, tensor_kwargs):
send_shape = torch.tensor(obj.size(), **tensor_kwargs)
send_ndims = torch.tensor(len(obj.size()), **tensor_kwargs)
dist.send(send_ndims, next_rank)
dist.send(send_shape, next_rank)
def send_obj_meta(obj, next_rank=None):
"""Sends obj meta information before sending a specific obj.
Since the recipient must know the shape of the obj in p2p communications,
meta information of the obj should be sent before communications. This function
synchronizes with :func:`recv_obj_meta`.
Args:
obj (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): obj to be sent.
need_meta (bool, optional): If False, meta information won't be sent.
next_rank (int): The rank of the next member in pipeline parallel group.
Returns:
bool: False
"""
if next_rank is None:
next_rank = gpc.get_next_global_rank(ParallelMode.PIPELINE)
tensor_kwargs = {"dtype": torch.long, "device": get_current_device()}
if isinstance(obj, torch.Tensor):
send_obj_nums = torch.tensor(1, **tensor_kwargs)
dist.send(send_obj_nums, next_rank)
send_meta_helper(obj, next_rank, tensor_kwargs)
else:
send_obj_nums = torch.tensor(len(obj), **tensor_kwargs)
dist.send(send_obj_nums, next_rank)
for tensor_to_send in obj:
send_meta_helper(tensor_to_send, next_rank, tensor_kwargs)
def recv_meta_helper(prev_rank, tensor_kwargs):
recv_ndims = torch.empty((), **tensor_kwargs)
dist.recv(recv_ndims, prev_rank)
recv_shape = torch.empty(recv_ndims, **tensor_kwargs)
dist.recv(recv_shape, prev_rank)
return recv_shape
def recv_obj_meta(prev_rank=None) -> torch.Size:
"""Receives obj meta information before receiving a specific obj.
Since the recipient must know the shape of the obj in p2p communications,
meta information of the obj should be received before communications. This function
synchronizes with :func:`send_obj_meta`.
Args:
obj_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the obj to be received.
prev_rank (int): The rank of the source of the obj.
Returns:
Union[:class:`torch.Size`, List[:class:`torch.Size`]]: The shape of the obj to be received.
"""
if prev_rank is None:
prev_rank = gpc.get_prev_global_rank(ParallelMode.PIPELINE)
tensor_kwargs = {"dtype": torch.long, "device": get_current_device()}
recv_obj_nums = torch.empty((), **tensor_kwargs)
dist.recv(recv_obj_nums, prev_rank)
if recv_obj_nums.item() == 1:
recv_shape = recv_meta_helper(prev_rank, tensor_kwargs)
obj_shape = torch.Size(recv_shape)
else:
obj_shape = []
for _ in range(recv_obj_nums.item()):
recv_shape = recv_meta_helper(prev_rank, tensor_kwargs)
obj_shape.append(torch.Size(recv_shape))
return obj_shape
def split_tensor_into_1d_equal_chunks(tensor: torch.Tensor, new_buffer=False) -> torch.Tensor:
"""Break a tensor into equal 1D chunks.
Args:
tensor (:class:`torch.Tensor`): Tensor to be split before communication.
new_buffer (bool, optional): Whether to use a new buffer to store sliced tensor.
Returns:
:class:`torch.Tensor`: The split tensor
"""
partition_size = torch.numel(tensor) // gpc.get_world_size(ParallelMode.TENSOR)
start_index = partition_size * gpc.get_local_rank(ParallelMode.TENSOR)
end_index = start_index + partition_size
if new_buffer:
data = torch.empty(partition_size, dtype=tensor.dtype, device=torch.cuda.current_device(), requires_grad=False)
data.copy_(tensor.view(-1)[start_index:end_index])
else:
data = tensor.view(-1)[start_index:end_index]
return data
def gather_split_1d_tensor(tensor: torch.Tensor) -> torch.Tensor:
"""Opposite of above function, gather values from model parallel ranks.
Args:
tensor (:class:`torch.Tensor`): Tensor to be gathered after communication.
Returns:
:class:`torch.Tensor`: The gathered tensor.
"""
world_size = gpc.get_world_size(ParallelMode.TENSOR)
numel = torch.numel(tensor)
numel_gathered = world_size * numel
gathered = torch.empty(numel_gathered, dtype=tensor.dtype, device=torch.cuda.current_device(), requires_grad=False)
chunks = [gathered[i * numel : (i + 1) * numel] for i in range(world_size)]
dist.all_gather(chunks, tensor, group=gpc.get_group(ParallelMode.TENSOR))
return gathered

View File

@ -1,49 +0,0 @@
from .parallel_context import (
IS_TENSOR_PARALLEL,
Config,
ParallelContext,
global_context,
)
from .process_group_initializer import (
Initializer_Data,
Initializer_Model,
Initializer_Nettest,
Initializer_Pipeline,
Initializer_Tensor,
Initializer_Zero1,
ParallelMode,
ProcessGroupInitializer,
)
from .random import (
add_seed,
get_current_mode,
get_seeds,
get_states,
seed,
set_mode,
set_seed_states,
sync_states,
)
__all__ = [
"Config",
"IS_TENSOR_PARALLEL",
"global_context",
"ParallelContext",
"ParallelMode",
"Initializer_Tensor",
"Initializer_Pipeline",
"Initializer_Data",
"Initializer_Zero1",
"Initializer_Nettest",
"ProcessGroupInitializer",
"Initializer_Model",
"seed",
"set_mode",
"add_seed",
"get_seeds",
"get_states",
"get_current_mode",
"set_seed_states",
"sync_states",
]

View File

@ -1,569 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/context
import inspect
import random
import socket
import sys
from collections import Counter
from importlib.machinery import SourceFileLoader
from pathlib import Path
from typing import Union
import numpy as np
import torch
import torch.distributed as dist
from internlm.utils.common import SingletonMeta
from internlm.utils.logger import get_logger
from internlm.utils.timeout import LLM_NCCL_TIMEOUT
from . import process_group_initializer as pgroup_initializer
from .process_group_initializer import ParallelMode
from .random import add_seed, get_seeds, set_mode
IS_TENSOR_PARALLEL = "is_tensor_parallel"
logger = get_logger(__file__)
class Config(dict):
"""This is a wrapper class for dict objects so that values of which can be
accessed as attributes.
Args:
config (dict): The dict object to be wrapped.
"""
def __init__(self, config: dict = None): # pylint: disable=W0231
if config is not None:
for k, v in config.items():
self._add_item(k, v)
def __missing__(self, key):
raise KeyError(key)
def __getattr__(self, key):
try:
value = super().__getitem__(key)
return value
except KeyError:
raise AttributeError(key)
def __setattr__(self, key, value):
super().__setitem__(key, value)
def _add_item(self, key, value):
if isinstance(value, dict):
self.__setattr__(key, Config(value))
else:
self.__setattr__(key, value)
def update(self, config):
assert isinstance(config, (Config, dict)), "can only update dictionary or Config objects."
for k, v in config.items():
self._add_item(k, v)
return self
@staticmethod
def from_file(filename: str):
"""Reads a python file and constructs a corresponding :class:`Config` object.
Args:
filename (str): Name of the file to construct the return object.
Returns:
:class:`Config`: A :class:`Config` object constructed with information in the file.
Raises:
AssertionError: Raises an AssertionError if the file does not exist, or the file is not .py file
"""
# check config path
if isinstance(filename, str):
filepath = Path(filename).absolute()
elif isinstance(filename, Path):
filepath = filename.absolute()
assert filepath.exists(), f"{filename} is not found, please check your configuration path"
# check extension
extension = filepath.suffix
assert extension == ".py", "only .py files are supported"
# import the config as module
remove_path = False
if filepath.parent not in sys.path:
sys.path.insert(0, (filepath))
remove_path = True
module_name = filepath.stem
source_file = SourceFileLoader(fullname=str(module_name), path=str(filepath))
module = source_file.load_module() # pylint: disable=W4902,E1120,W1505
# load into config
config = Config()
for k, v in module.__dict__.items():
if k.startswith("__") or inspect.ismodule(v) or inspect.isclass(v):
continue
else:
config._add_item(k, v)
# remove module
del sys.modules[module_name]
if remove_path:
sys.path.pop(0)
return config
class ParallelContext(metaclass=SingletonMeta):
"""This class provides interface functions for users to get the parallel context,
such as the global rank, the local rank, the world size, etc. of each device.
"""
def __init__(self):
# distributed settings
self._global_ranks = dict()
self._local_ranks = dict()
self._world_sizes = dict()
self._groups = dict()
self._cpu_groups = dict()
self._ranks_in_group = dict()
# load config from file
self._config = None
# default parallel args, will be overwritten during process group intialization
self.world_size = 1
self.data_parallel_size = 1
self.pipeline_parallel_size = 1
self.tensor_parallel_size = 1
self.zero1_parallel_size = -1
self.nettest_parallel_size = 1
self.num_processes_on_current_node = -1
self.virtual_pipeline_parallel_size = None
self.virtual_pipeline_parallel_rank = None
@property
def config(self):
return self._config
def load_config(self, config: Union[dict, str]):
"""Loads the configuration from either a dict or a file.
Args:
config (dict or str): Either a dict containing the configuration information or the filename
of a file containing the configuration information.
Raises:
TypeError: Raises a TypeError if `config` is neither a dict nor a str.
"""
if isinstance(config, str):
self._config = Config.from_file(config)
elif isinstance(config, dict):
self._config = Config(config)
else:
raise TypeError("Invalid type for config, only dictionary or string is supported")
def detect_num_processes_on_current_node(self):
hostname = socket.gethostname()
hostname_list = [None for _ in range(self.get_world_size(ParallelMode.GLOBAL))]
dist.all_gather_object(hostname_list, hostname, group=self.get_group(ParallelMode.GLOBAL))
counter = Counter(hostname_list)
self.num_processes_on_current_node = counter[hostname]
@staticmethod
def _check_parallel_mode(parallel_mode: ParallelMode):
assert isinstance(
parallel_mode, ParallelMode
), f"expected the argument parallel_mode to be of enum ParallelMode, but got {type(parallel_mode)}"
def get_global_rank(self):
"""Returns the global rank of the current device.
Returns:
int: The global rank of the current device
"""
return self._global_ranks[ParallelMode.GLOBAL]
def get_local_rank(self, parallel_mode: ParallelMode):
"""Returns the local rank of the current device.
Args:
parallel_mode: The parallel mode for the rank.
Returns:
int: The local rank of the current device for `parallel_mode`.
"""
self._check_parallel_mode(parallel_mode)
return self._local_ranks.get(parallel_mode, 0)
def get_next_global_rank(self, parallel_mode: ParallelMode):
"""Returns the global rank of the next device.
Args:
parallel_mode: The parallel mode for the rank.
Returns:
int: The global rank of the next device for `parallel_mode`.
"""
self._check_parallel_mode(parallel_mode)
# get rank and world size
local_rank = self.get_local_rank(parallel_mode)
world_size = self.get_world_size(parallel_mode)
ranks_in_group = self.get_ranks_in_group(parallel_mode)
return ranks_in_group[(local_rank + 1) % world_size]
def get_prev_global_rank(self, parallel_mode: ParallelMode):
"""Returns the global rank of the previous device.
Args:
parallel_mode: The chosen parallel mode.
Returns:
int: The global rank of the previous device for `parallel_mode`.
"""
self._check_parallel_mode(parallel_mode)
# get rank and world size
local_rank = self.get_local_rank(parallel_mode)
world_size = self.get_world_size(parallel_mode)
ranks_in_group = self.get_ranks_in_group(parallel_mode)
return ranks_in_group[(local_rank - 1) % world_size]
def is_using_dp(self):
"""Returns a boolean value indicating whether the current device is initilized with
ParallelMode.DATA and its world_size is greater than 1.
"""
return self.is_initialized(ParallelMode.DATA) and self.get_world_size(ParallelMode.DATA) > 1
def is_using_tp(self):
"""Returns a boolean value indicating whether the current device is initilized with
ParallelMode.TENSOR and its world_size is greater than 1.
"""
return self.is_initialized(ParallelMode.TENSOR) and self.get_world_size(ParallelMode.TENSOR) > 1
def is_using_pp(self):
"""Returns a boolean value indicating whether the current device is initilized with
ParallelMode.PIPELINE and its world_size is greater than 1.
"""
return self.is_initialized(ParallelMode.PIPELINE) and self.get_world_size(ParallelMode.PIPELINE) > 1
def is_using_sequence(self):
"""Returns a boolean value indicating whether the current device is initilized with
ParallelMode.SEQUENCE and its world_size is greater than 1.
"""
return False
# return gpc.is_initialized(ParallelMode.SEQUENCE) and gpc.get_world_size(ParallelMode.SEQUENCE) > 1
def is_first_rank(self, parallel_mode: ParallelMode):
"""Returns a boolean value indicating whether the current device is the first one
among its group for `parallel_mode`.
Args:
parallel_mode: The chosen parallel mode.
Returns:
bool: a boolean value indicating whether the current device is the first one
among its group for `parallel_mode`.
"""
rank = 0
if self.is_initialized(parallel_mode):
rank = self.get_local_rank(parallel_mode)
return rank == 0
def is_rank_for_log(self):
"""Returns a boolean value indicating whether the current device should print log."""
is_log_rank = (
self.is_first_rank(ParallelMode.DATA)
and self.is_first_rank(ParallelMode.TENSOR)
and self.is_last_rank(ParallelMode.PIPELINE)
)
return is_log_rank
def is_last_rank(self, parallel_mode: ParallelMode):
"""Returns a boolean value indicating whether the current device is the last one
among its group for `parallel_mode`.
Args:
parallel_mode: The chosen parallel mode.
Returns:
bool: a boolean value indicating whether the current device is the first one
among its group for `parallel_mode`.
"""
rank = 0
world_size = 1
if self.is_initialized(parallel_mode):
rank = self.get_local_rank(parallel_mode)
world_size = self.get_world_size(parallel_mode)
return rank == world_size - 1
def is_pipeline_first_stage(self, ignore_virtual=False):
if not ignore_virtual:
if self.virtual_pipeline_parallel_size is not None and self.virtual_pipeline_parallel_rank != 0:
return False
return self.is_first_rank(ParallelMode.PIPELINE)
def is_pipeline_last_stage(self, ignore_virtual=False):
if not ignore_virtual:
if (
self.virtual_pipeline_parallel_size is not None
and self.virtual_pipeline_parallel_rank != self.virtual_pipeline_parallel_size - 1
):
return False
return self.is_last_rank(ParallelMode.PIPELINE)
def get_world_size(self, parallel_mode: ParallelMode):
"""Returns the world size for `parallel_mode`.
Args:
parallel_mode: The chosen parallel mode.
Returns:
int: The world size for `parallel_mode`.
"""
self._check_parallel_mode(parallel_mode)
return self._world_sizes.get(parallel_mode, 1)
def get_group(self, parallel_mode: ParallelMode):
"""Returns the group of the current device for `parallel_mode`.
Args:
parallel_mode: The chosen parallel mode.
Returns:
torch.distributed.ProcessGroup: The group of the current device for `parallel_mode`.
"""
self._check_parallel_mode(parallel_mode)
return self._groups[parallel_mode]
def get_ranks_in_group(self, parallel_mode: ParallelMode):
"""Returns the rank of the current device for `parallel_mode` in the group.
Args:
parallel_mode: The chosen parallel mode.
Returns:
int: The rank of the current device for `parallel_mode` in the group.
"""
self._check_parallel_mode(parallel_mode)
return self._ranks_in_group[parallel_mode]
def get_cpu_group(self, parallel_mode: ParallelMode):
self._check_parallel_mode(parallel_mode)
return self._cpu_groups[parallel_mode]
def init_global_dist(self, rank: int, world_size: int, backend: str, host: str, port: int, use_cpu: bool = False):
"""Initializes the global distributed environment
Args:
rank (int): rank for the default process group.
world_size (int): world size of the default process group.
backend (str): backend for ``torch.distributed``
host (str): the master address for distributed training.
port (str): the master port for distributed training.
use_cpu (bool): whether to set up cpu process group.
"""
# initialize the default process group
init_method = f"tcp://[{host}]:{port}"
dist.init_process_group(
rank=rank,
world_size=world_size,
backend=backend,
init_method=init_method,
timeout=LLM_NCCL_TIMEOUT,
)
# None will give the default global process group for pytorch dist operations
ranks = list(range(world_size))
if use_cpu:
cpu_group = (
dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
if dist.get_backend() != "gloo"
else None
)
else:
cpu_group = None
self._register_dist(rank, world_size, dist.GroupMember.WORLD, cpu_group, ranks, ParallelMode.GLOBAL)
self._global_ranks[ParallelMode.GLOBAL] = rank
def _register_dist(self, local_rank, world_size, process_group, cpu_group, ranks_in_group, mode):
self._check_parallel_mode(mode)
self._local_ranks[mode] = local_rank
self._world_sizes[mode] = world_size
self._groups[mode] = process_group
self._cpu_groups[mode] = cpu_group
self._ranks_in_group[mode] = ranks_in_group
def check_sanity(self):
"""Checks sanity of the parallel context.
Raises:
AssertionError: Raises an AssertionError if the world size does not equal to the product
of data parallel size, pipeline parallel size and tensor parallel size.
"""
dps = self.data_parallel_size
pps = self.pipeline_parallel_size
tps = self.tensor_parallel_size
ws = self.world_size
assert ws == dps * pps * tps, (
f"Expected the world size {ws} to be equal to data"
f" parallel size ({dps}) * pipeline parallel size "
f"({pps}) * tensor parallel size ({tps})"
)
assert self.zero1_parallel_size > 0
assert self.data_parallel_size % self.zero1_parallel_size == 0
def _set_parallel_size_from_config(self, config: dict, key: str, attr_name: str):
if key in config:
ele = config[key]
if isinstance(ele, int):
setattr(self, attr_name, ele)
elif isinstance(ele, dict):
setattr(self, attr_name, ele["size"])
else:
raise NotImplementedError(
f'{"Parallel configuration does not support this kind of argument, please use int or dict"}'
)
def init_parallel_groups(self):
"""Initializes the parallel groups."""
# get rank and world size
rank = self.get_global_rank()
world_size = self.get_world_size(ParallelMode.GLOBAL)
self.world_size = world_size
# set parallel size as attributes for global context
parallel_config = self.config.get("parallel", None)
if parallel_config is not None:
self._set_parallel_size_from_config(parallel_config, "pipeline", "pipeline_parallel_size")
self._set_parallel_size_from_config(parallel_config, "tensor", "tensor_parallel_size")
self._set_parallel_size_from_config(parallel_config, "zero1", "zero1_parallel_size")
# the user should not set the data parallel size manually
# instead, it should be calculated based on other parallel config
self.data_parallel_size = self.world_size // (self.pipeline_parallel_size * self.tensor_parallel_size)
# the recommended nettest_parallel_size is 32 GPUs
self.nettest_parallel_size = 32
if self.zero1_parallel_size <= 0:
self.zero1_parallel_size = self.data_parallel_size
self.check_sanity()
initializer_args = [
rank,
world_size,
self.data_parallel_size,
self.pipeline_parallel_size,
self.tensor_parallel_size,
self.zero1_parallel_size,
self.nettest_parallel_size,
]
# run initialization of different process groups
initializers = []
initializers.append(pgroup_initializer.Initializer_Data(*initializer_args))
initializers.append(pgroup_initializer.Initializer_Model(*initializer_args))
initializers.append(pgroup_initializer.Initializer_Tensor(*initializer_args))
initializers.append(pgroup_initializer.Initializer_Zero1(*initializer_args))
initializers.append(pgroup_initializer.Initializer_Nettest(*initializer_args))
if self.pipeline_parallel_size > 1:
initializers.append(pgroup_initializer.Initializer_Pipeline(*initializer_args))
for initializer in initializers:
parallel_setting = initializer.init_dist_group()
if isinstance(parallel_setting, list):
for args in parallel_setting:
self._register_dist(*args)
else:
self._register_dist(*parallel_setting)
def is_initialized(self, parallel_mode: ParallelMode):
"""Returns a boolean value indicating whether `parallel_mode` is initialized
in the current system.
"""
return parallel_mode in self._groups
def destroy(self):
"""Destroys the current distributed parallel environment."""
for mode, group in self._groups.items():
if mode is not ParallelMode.GLOBAL:
dist.destroy_process_group(group)
# destroy global process group
dist.destroy_process_group()
self._groups.clear()
def set_device(self, device_ordinal: int = None):
"""Sets distributed processes to be bound to devices.
Args:
device_ordinal (int, optional): the device id to be bound to
"""
global_rank = self.get_global_rank()
if device_ordinal is None:
devices_per_node = torch.cuda.device_count()
device_ordinal = global_rank % devices_per_node
torch.cuda.set_device(device_ordinal)
logger.info(f"process rank {global_rank} is bound to host:{socket.gethostname()} device: {device_ordinal}")
def set_seed(self, seed: int, dpseed_with_tpoffset: bool = False):
"""Sets seeds for all random libraries.
Args:
seed (int): seed for random states
"""
pipeline_offset = self._local_ranks.get(ParallelMode.PIPELINE, 0)
global_rank = self.get_global_rank()
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
assert torch.cuda.is_available()
# data parallel seed are kept the same in the same pipeline stage
dp_seed = seed
if dpseed_with_tpoffset:
dp_seed = seed + pipeline_offset * 1024
add_seed(ParallelMode.DATA, dp_seed)
add_seed(ParallelMode.DUMMY, dp_seed)
# model parallel seeds are different across ranks
if self.is_initialized(ParallelMode.TENSOR):
tp_rank = self.get_local_rank(ParallelMode.TENSOR)
tp_seed = seed + tp_rank + pipeline_offset * 1024
add_seed(ParallelMode.TENSOR, tp_seed)
# we do not set the random state mode to ParallelMode.DATA until model is built (instead, we use a dummy mode
# during model construction), this is because the random state will be different in different tensor parallel
# device of the same data parallel group. The underlying reason is that the device of tp_rank = 0 will perform
# additional random operations during the RowParallelLinear module building process.
set_mode(ParallelMode.DUMMY)
seeds = get_seeds()
seed_str = ", ".join([f"{k}: {v}" for k, v in seeds.items()])
logger.info(
f"initialized seed on rank {global_rank}, "
f"numpy: {seed}, python random: {seed}, {seed_str},"
f"the default parallel seed is {ParallelMode.DATA}."
)
def set_virtual_pipeline_parallel_size(self, size):
self.virtual_pipeline_parallel_size = size
def set_virtual_pipeline_parallel_rank(self, rank):
self.virtual_pipeline_parallel_rank = rank
global_context = ParallelContext()

View File

@ -1,418 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/context
import math
from abc import ABC, abstractmethod
from enum import Enum
import torch.distributed as dist
from internlm.utils.timeout import LLM_NCCL_TIMEOUT
# parallel modes
class ParallelMode(Enum):
"""This is an enumeration class containing all possible parallel modes."""
GLOBAL = "global"
# common parallel
DATA = "data"
# model parallel - containing tensor and pipeline parallel groups
# this is added to facilitate amp and grad clipping in hybrid parallel
MODEL = "model"
# pipeline parallel
PIPELINE = "pipe"
# containing all ranks in tensor parallel
TENSOR = "tensor"
# zero1 parallel
ZERO1 = "zero1"
# runntime network test
NETTEST = "nettest"
# dummy mode, only used during mode construction
DUMMY = "dummy"
class ProcessGroupInitializer(ABC):
"""An object, knowing the parallelism configuration, that initializes parallel groups.
Args:
rank (int): The rank of current process.
world_size (int): Size of whole communication world.
data_parallel_size (int): Size of data parallel.
pipeline_parallel_size (int): Size of pipeline parallel.
tensor_parallel_size (int): Size of tensor parallel.
zero1_parallel_size (int): Size of zero1 parallel.
"""
def __init__(
self,
rank: int,
world_size: int,
data_parallel_size: int,
pipeline_parallel_size: int,
tensor_parallel_size: int,
zero1_parallel_size: int,
nettest_parallel_size: int,
):
self.rank = rank
self.world_size = world_size
self.data_parallel_size = data_parallel_size
self.pipeline_parallel_size = pipeline_parallel_size
self.tensor_parallel_size = tensor_parallel_size
self.zero1_parallel_size = zero1_parallel_size
self.nettest_parallel_size = nettest_parallel_size
super().__init__()
@abstractmethod
def init_dist_group(self, use_cpu: bool = False):
pass
class Initializer_Data(ProcessGroupInitializer):
"""A ProcessGroupInitializer for data parallelism.
Args:
rank (int): The rank of current process.
world_size (int): Size of whole communication world.
data_parallel_size (int): Size of data parallel.
pipeline_parallel_size (int): Size of pipeline parallel.
tensor_parallel_size (int): Size of tensor parallel.
zero1_parallel_size (int): Size of zero1 parallel.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.rank_num_per_dp_group = self.world_size // self.data_parallel_size
assert self.world_size % self.data_parallel_size == 0
def init_dist_group(self, use_cpu: bool = False):
"""Initialize data parallel groups, and assign local_ranks and groups to each gpu.
Returns:
Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode):
A Data parallelism's information tuple.
"""
local_rank = None
ranks_in_group = None
process_group = None
cpu_group = None
group_world_size = None
mode = ParallelMode.DATA
for i in range(self.rank_num_per_dp_group):
ranks = [i + j * self.rank_num_per_dp_group for j in range(self.data_parallel_size)]
group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
if use_cpu:
group_cpu = (
dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
if dist.get_backend() != "gloo"
else group
)
else:
group_cpu = None
if self.rank in ranks:
local_rank = ranks.index(self.rank)
group_world_size = len(ranks)
process_group = group
cpu_group = group_cpu
ranks_in_group = ranks
return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
class Initializer_Model(ProcessGroupInitializer):
"""A ProcessGroupInitializer for model parallelism (model parallel group contains pipeline and tensor parallel
groups).
Args:
rank (int): The rank of current process.
world_size (int): Size of whole communication world.
data_parallel_size (int): Size of data parallel.
pipeline_parallel_size (int): Size of pipeline parallel.
tensor_parallel_size (int): Size of tensor parallel.
zero1_parallel_size (int): Size of zero1 parallel.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.rank_num_per_group = self.tensor_parallel_size * self.pipeline_parallel_size
self.num_group = self.world_size // self.rank_num_per_group
assert self.world_size % self.rank_num_per_group == 0
def init_dist_group(self, use_cpu: bool = False):
"""Initialize model parallel groups, and assign local_ranks and groups to each gpu.
Returns:
Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode):
A Model parallelism's information tuple.
"""
local_rank = None
ranks_in_group = None
process_group = None
cpu_group = None
group_world_size = None
mode = ParallelMode.MODEL
for i in range(self.num_group):
ranks = [i * self.rank_num_per_group + j for j in range(self.rank_num_per_group)]
group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
if use_cpu:
group_cpu = (
dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
if dist.get_backend() != "gloo"
else group
)
else:
group_cpu = None
if self.rank in ranks:
local_rank = ranks.index(self.rank)
group_world_size = len(ranks)
process_group = group
cpu_group = group_cpu
ranks_in_group = ranks
return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
class Initializer_Pipeline(ProcessGroupInitializer):
"""A ProcessGroupInitializer for pipeline parallelism.
Args:
rank (int): The rank of current process
world_size (int): Size of whole communication world
data_parallel_size (int): Size of data parallel
pipeline_parallel_size (int): Size of pipeline parallel
tensor_parallel_size (int): Size of tensor parallel
zero1_parallel_size (int): Size of zero1 parallel.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.rank_num_per_dp_group = self.world_size // self.data_parallel_size
self.pipeline_stage_size = self.rank_num_per_dp_group // self.pipeline_parallel_size
assert self.world_size % self.data_parallel_size == 0
assert self.rank_num_per_dp_group % self.pipeline_parallel_size == 0
def init_dist_group(self, use_cpu: bool = False):
"""Initialize pipeline parallel groups, and assign local_ranks and groups to each gpu.
Returns:
List[Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode)]:
A Pipeline parallelism's information in list of tuples.
"""
local_rank = None
ranks_in_group = None
process_group = None
cpu_group = None
group_world_size = None
mode = ParallelMode.PIPELINE
for i in range(self.data_parallel_size):
for j in range(self.pipeline_stage_size):
ranks = list(
range(
i * self.rank_num_per_dp_group + j,
(i + 1) * self.rank_num_per_dp_group,
self.pipeline_stage_size,
)
)
pipe_group_size = len(ranks)
pipe_group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
if use_cpu:
group_cpu = (
dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
if dist.get_backend() != "gloo"
else pipe_group
)
else:
group_cpu = None
if self.rank in ranks:
local_rank = ranks.index(self.rank)
group_world_size = pipe_group_size
process_group = pipe_group
cpu_group = group_cpu
ranks_in_group = ranks
return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
class Initializer_Tensor(ProcessGroupInitializer):
"""A ProcessGroupInitializer for tensor parallelism.
Args:
rank (int): The rank of current process.
world_size (int): Size of whole communication world.
data_parallel_size (int): Size of data parallel.
pipeline_parallel_size (int): Size of pipeline parallel.
tensor_parallel_size (int): Size of tensor parallel.
zero1_parallel_size (int): Size of zero1 parallel.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.num_tensor_parallel_group = self.world_size // self.tensor_parallel_size
assert self.world_size % self.tensor_parallel_size == 0
def init_dist_group(self, use_cpu: bool = False):
"""Initialize tensor parallel groups, and assign local_ranks and groups to each gpu.
Returns:
Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode):
A Tensor parallelism's information tuple.
"""
local_rank = None
ranks_in_group = None
process_group = None
cpu_group = None
group_world_size = None
mode = ParallelMode.TENSOR
for i in range(self.num_tensor_parallel_group):
ranks = [i * self.tensor_parallel_size + j for j in range(self.tensor_parallel_size)]
group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
if use_cpu:
group_cpu = (
dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
if dist.get_backend() != "gloo"
else group
)
else:
group_cpu = None
if self.rank in ranks:
local_rank = ranks.index(self.rank)
group_world_size = len(ranks)
process_group = group
cpu_group = group_cpu
ranks_in_group = ranks
return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
class Initializer_Zero1(ProcessGroupInitializer):
"""A ProcessGroupInitializer for zero-1 parallelism.
Args:
rank (int): The rank of current process.
world_size (int): Size of whole communication world.
data_parallel_size (int): Size of data parallel.
pipeline_parallel_size (int): Size of pipeline parallel.
tensor_parallel_size (int): Size of tensor parallel.
zero1_parallel_size (int): Size of zero-1 parallel.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.rank_num_per_dp_group = self.world_size // self.data_parallel_size
self.num_zero1_parallel_group = self.data_parallel_size // self.zero1_parallel_size
assert self.world_size % self.data_parallel_size == 0
assert self.world_size % self.zero1_parallel_size == 0
def init_dist_group(self, use_cpu: bool = False):
"""Initialize zero1 parallel groups, and assign local_ranks and groups to each gpu.
Returns:
Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode):
A zero1 parallelism's information tuple.
"""
local_rank = None
ranks_in_group = None
process_group = None
cpu_group = None
group_world_size = None
mode = ParallelMode.ZERO1
for i in range(self.rank_num_per_dp_group):
for j in range(self.num_zero1_parallel_group):
ranks = [
i + (j * self.zero1_parallel_size + k) * self.rank_num_per_dp_group
for k in range(self.zero1_parallel_size)
]
group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
if use_cpu:
group_cpu = (
dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
if dist.get_backend() != "gloo"
else group
)
else:
group_cpu = None
if self.rank in ranks:
local_rank = ranks.index(self.rank)
group_world_size = len(ranks)
process_group = group
cpu_group = group_cpu
ranks_in_group = ranks
return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
class Initializer_Nettest(ProcessGroupInitializer):
"""A ProcessGroupInitializer for network test, especailly for NCCL.
Args:
rank (int): The rank of current process.
world_size (int): Size of whole communication world.
nettest_parallel_size (int): Size of a network test group.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.num_nettest_group = math.ceil(self.world_size / self.nettest_parallel_size)
def init_dist_group(self, use_cpu: bool = False):
"""Initialize tensor parallel groups, and assign local_ranks and groups to each gpu.
Returns:
Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode):
A Tensor parallelism's information tuple.
"""
local_rank = None
ranks_in_group = None
process_group = None
cpu_group = None
group_world_size = None
mode = ParallelMode.NETTEST
for i in range(self.num_nettest_group):
ranks = []
for j in range(self.nettest_parallel_size):
rank = i * self.nettest_parallel_size + j
if rank < self.world_size:
ranks.append(rank)
group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
if use_cpu:
group_cpu = (
dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
if dist.get_backend() != "gloo"
else group
)
else:
group_cpu = None
if self.rank in ranks:
local_rank = ranks.index(self.rank)
group_world_size = len(ranks)
process_group = group
cpu_group = group_cpu
ranks_in_group = ranks
return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode

View File

@ -1,131 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/context
from contextlib import contextmanager
import torch
import torch.cuda
from torch import Tensor
from .process_group_initializer import ParallelMode
class SeedManager:
"""This class is a manager of all random seeds involved in the system."""
def __init__(self):
self._current_mode = None
self._seeds = {}
self._seed_states = {}
@property
def current_mode(self):
return self._current_mode
@property
def seeds(self):
return self._seeds
@property
def seed_states(self):
return self._seed_states
def set_state(self, parallel_mode: ParallelMode, state: Tensor):
"""Sets the state of the seed manager for `parallel_mode`."""
assert parallel_mode in self._seed_states, f"{parallel_mode} not found in seed manager"
self._seed_states[parallel_mode] = state
def set_mode(self, parallel_mode: ParallelMode):
"""Sets the current mode of the seed manager."""
if self.current_mode:
# save state for current mode
self._seed_states[self._current_mode] = torch.cuda.get_rng_state()
# set new state for new mode
self._current_mode = parallel_mode
torch.cuda.set_rng_state(self._seed_states[parallel_mode])
def add_seed(self, parallel_mode: ParallelMode, seed: int, overwrite: bool = False):
"""Adds a seed to the seed manager for `parallel_mode`."""
assert isinstance(parallel_mode, ParallelMode), "Invalid ParallelMode"
if not overwrite:
assert parallel_mode not in self._seed_states, f"Seed for {parallel_mode} exists"
elif parallel_mode in self._seed_states:
print(f"Warning: {parallel_mode} seed overwritten.", flush=True)
current_state = torch.cuda.get_rng_state()
torch.cuda.manual_seed(seed)
self._seed_states[parallel_mode] = torch.cuda.get_rng_state()
self._seeds[parallel_mode] = seed
torch.cuda.set_rng_state(current_state)
def reset(self):
self._current_mode = None
self._seeds = {}
self._seed_states = {}
_SEED_MANAGER = SeedManager()
def get_seeds():
"""Returns the seeds of the seed manager.
Returns:
dict: The seeds of the seed manager.
"""
return _SEED_MANAGER.seeds
def get_states(copy=False):
"""Returns the seed states of the seed manager.
Returns:
dict: The seed states of the seed manager.
"""
states = _SEED_MANAGER.seed_states
if copy:
new_states = dict()
for parallel_mode, state in states.items():
new_states[parallel_mode] = state.clone()
return new_states
else:
return _SEED_MANAGER.seed_states
def get_current_mode():
"""Returns the current mode of the seed manager.
Returns:
:class:`torch.ByteTensor`: The current mode of the seed manager.
"""
return _SEED_MANAGER.current_mode
def add_seed(parallel_mode: ParallelMode, seed: int, overwrite: bool = False):
"""Adds a seed to the seed manager for `parallel_mode`."""
_SEED_MANAGER.add_seed(parallel_mode, seed, overwrite)
def set_mode(parallel_mode: ParallelMode):
"""Sets the current mode of the seed manager."""
_SEED_MANAGER.set_mode(parallel_mode)
def set_seed_states(parallel_mode: ParallelMode, state: Tensor):
"""Sets the state of the seed manager for `parallel_mode`."""
_SEED_MANAGER.set_state(parallel_mode, state)
def sync_states():
current_mode = get_current_mode()
current_states = torch.cuda.get_rng_state()
set_seed_states(current_mode, current_states)
@contextmanager
def seed(parallel_mode: ParallelMode):
"""A context for seed switch"""
current_mode = _SEED_MANAGER.current_mode
try:
yield _SEED_MANAGER.set_mode(parallel_mode)
finally:
_SEED_MANAGER.set_mode(current_mode)

View File

@ -1,190 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/engine
from typing import List, Optional
import torch
from torch.nn import Module
from torch.nn.modules.loss import _Loss
from torch.optim.lr_scheduler import _LRScheduler
from internlm.core.gradient_handler import BaseGradientHandler
from internlm.solver.beta2_scheduler import Beta2Scheduler
from internlm.solver.optimizer.hybrid_zero_optim import BaseOptimizer
from internlm.utils.common import get_batch_size, move_to_device
class Engine:
"""
The Engine class is responsible for managing the training and evaluation process of a neural network model.
It handles the forward and backward passes, parameter updates, gradient handling, and mode switching between
training and evaluation.
Args:
model (torch.nn.Module): The neural network model to be trained or evaluated.
optimizer (BaseOptimizer): The optimizer used for updating the parameters of the model.
lr_scheduler (torch.optim.lr_scheduler._LRScheduler, optional): The learning rate scheduler for the optimizer.
Default is None.
beta2_scheduler (internlm.solver.beta2_scheduler.Beta2Scheduler, optional): The beta2 scheduler for the
optimizer. Default is None.
criterion (torch.nn.modules.loss._Loss, optional): The loss function used for calculating the loss during
training. Default is None.
gradient_handlers (List[BaseGradientHandler], optional): A list of gradient handlers used in the backward pass.
Default is None.
clip_grad_norm (float, optional): The norm value for gradient clipping. Default is 0.0.
Examples:
>>> # define model, criterion, optimizer, lr_scheduler, train_dataloader for your training
>>> model = ...
>>> criterion = ...
>>> optimizer = ...
>>> train_dataloader = ...
>>> engine, _, _, _ = internlm.initialize_engine(model, optimizer, criterion)
>>> engine.train()
>>> for inputs, labels in train_dataloader
>>> # set gradients to zero
>>> engine.zero_grad()
>>> # run forward pass
>>> outputs = engine(inputs)
>>> # compute loss value and run backward pass
>>> loss = engine.criterion(outputs, labels)
>>> engine.backward(loss)
>>> # update parameters
>>> engine.step()
"""
def __init__(
self,
model: Module,
optimizer: BaseOptimizer,
lr_scheduler: Optional[_LRScheduler] = None,
beta2_scheduler: Optional[Beta2Scheduler] = None,
criterion: Optional[_Loss] = None,
gradient_handlers: Optional[List[BaseGradientHandler]] = None,
clip_grad_norm: float = 0.0,
):
self._model = model
self._optimizer = optimizer
self._lr_scheduler = lr_scheduler
self._beta2_scheduler = beta2_scheduler
self._criterion = criterion
self._clip_grad_norm = clip_grad_norm
# state
self.training = True # default
# build gradient handler
self._gradient_handlers = gradient_handlers if gradient_handlers else []
@property
def model(self):
"""Returns the model attached to the engine."""
return self._model
@property
def optimizer(self):
"""Returns the optimizer attached to the engine."""
return self._optimizer
@property
def criterion(self):
"""Returns the criterion (loss function) attached to the engine."""
return self._criterion
def _all_reduce_gradients(self):
"""Handles all-reduce operations of gradients across different parallel groups."""
for handler in self._gradient_handlers:
handler.handle_gradient()
def zero_grad(self):
"""Sets the gradient of all parameters in the model to zero."""
self.optimizer.zero_grad()
def step(self):
"""
Executes the parameter update step. This includes all-reduce operations of gradients, gradient clipping,
and parameter update. If successful, it also steps the learning rate scheduler and beta2 scheduler
if they exist.
Returns:
success (bool): Whether the parameter update was successful.
grad_norm (float): The norm of the gradient after clipping.
"""
self._all_reduce_gradients()
self.optimizer.clip_grad_norm(self.model, self._clip_grad_norm)
success, grad_norm = self.optimizer.step()
if success and self._lr_scheduler is not None:
self._lr_scheduler.step()
if success and self._beta2_scheduler is not None:
self._beta2_scheduler.step()
return success, grad_norm
def train(self):
"""Sets the model to training mode."""
self.training = True
self._model.train()
def eval(self):
"""Sets the model to evaluation mode."""
self.training = False
self._model.eval()
def backward(self, loss: torch.Tensor):
"""
Starts the backward propagation given the loss value computed by a loss function.
Args:
loss (torch.Tensor): The loss value computed by a loss function.
"""
return self.optimizer.backward(loss)
def backward_by_grad(self, tensor, grad):
"""
Starts the backward propagation given the gradient of the output tensor.
Args:
tensor (torch.Tensor): The output tensor.
grad (torch.Tensor): The gradient passed back to the output tensor.
"""
return self.optimizer.backward_by_grad(tensor, grad)
def __call__(self, *args, **kwargs):
"""
Runs the forward step for the model.
Returns:
torch.Tensor: The output of the model.
"""
return self.model(*args, **kwargs)
def load_batch(self, data_iter, to_gpu=True):
"""
Loads a batch from the data iterator. It returns the data and labels which are
already in the same GPU as where the model is.
Args:
data_iter (Iterable): The data iterator from which to get a batch of data, obtained by calling
iter(dataloader).
to_gpu (bool, optional): Whether the data should be moved to the GPU. Default is True.
Returns:
Tuple (torch.Tensor, torch.Tensor): A tuple of (data, label).
"""
if data_iter is None:
raise RuntimeError("Dataloader is not defined.")
try:
batch_data = next(data_iter)
except TypeError:
batch_data = data_iter
if to_gpu:
batch_data = move_to_device(batch_data)
batch_size = get_batch_size(batch_data)
return batch_data, batch_size

View File

@ -1,76 +0,0 @@
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
from abc import ABC, abstractmethod
from collections import defaultdict
import torch
import torch.distributed as dist
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
from internlm.core.context import global_context as gpc
class BaseGradientHandler(ABC):
"""A basic helper class to handle all-reduce operations of gradients across different parallel groups
before optimization.
Args:
model (Module): Model where the gradients accumulate.
optimizer (Optimizer): Optimizer for updating the parameters.
"""
def __init__(self, model, optimizer):
self._model = model
self._optimizer = optimizer
@abstractmethod
def handle_gradient(self):
"""A method to accumulate gradients across different parallel groups. Users should
write their own functions or just use the functions in pre-defined subclasses.
"""
pass
class PipelineSharedModuleGradientHandler(BaseGradientHandler):
"""A helper class to handle all-reduce operations in sub parallel groups.
A all-reduce collective communication will be operated in
:func:`handle_gradient` among all sub pipeline parallel groups.
For better performance, it bucketizes the gradients of all parameters that are
the same type to improve the efficiency of communication.
Args:
model (Module): Model where the gradients accumulate.
optimizer (Optimizer): Optimizer for updating the parameters.
"""
def handle_gradient(self):
"""A method running a all-reduce operation in sub pipeline parallel groups."""
if gpc.pipeline_parallel_size > 1:
# bucketize and all-reduce
buckets = defaultdict(lambda: defaultdict(list))
# Pack the buckets.
for param in self._model.parameters():
group = getattr(param, "pipeline_shared_module_pg", None)
if (
param.requires_grad
and group is not None
and (
(hasattr(param, "colo_attr") and not param.colo_attr.saved_grad.is_null())
or param.grad is not None
)
):
tp = param.data.type()
buckets[group][tp].append(param)
# For each bucket, all-reduce and copy all-reduced grads.
for group, group_buckets in buckets.items():
for tp, bucket in group_buckets.items():
grads = [
param.colo_attr.grad_payload if hasattr(param, "colo_attr") else param.grad.data
for param in bucket
]
coalesced = _flatten_dense_tensors(grads).to(torch.cuda.current_device())
dist.all_reduce(coalesced, op=dist.ReduceOp.SUM, group=group)
for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
buf.copy_(synced)

Some files were not shown because too many files have changed in this diff Show More