diff --git a/.gitmodules b/.gitmodules
deleted file mode 100644
index 0e8d65a..0000000
--- a/.gitmodules
+++ /dev/null
@@ -1,6 +0,0 @@
-[submodule "third_party/flash-attention"]
-	path = third_party/flash-attention
-	url = https://github.com/HazyResearch/flash-attention.git
-[submodule "third_party/apex"]
-	path = third_party/apex
-	url = https://github.com/NVIDIA/apex
diff --git a/README-ja-JP.md b/README-ja-JP.md
deleted file mode 100644
index 71e1d4d..0000000
--- a/README-ja-JP.md
+++ /dev/null
@@ -1,203 +0,0 @@
-# InternLM
-
-<div align="center">
-
-<img src="./doc/imgs/logo.svg" width="200"/>
-  <div> </div>
-  <div align="center">
-    <b><font size="5">InternLM</font></b>
-    <sup>
-      <a href="https://internlm.intern-ai.org.cn/">
-        <i><font size="4">HOT</font></i>
-      </a>
-    </sup>
-    <div> </div>
-  </div>
-
-[![license](./doc/imgs/license.svg)](./LICENSE)
-[![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
-[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)
-
-[📘使用法](./doc/en/usage.md) |
-[🛠️インストール](./doc/en/install.md) |
-[📊トレーニングパフォーマンス](./doc/en/train_performance.md) |
-[👀モデル](#model-zoo) |
-[🆕更新ニュース](./CHANGE_LOG.md) |
-[🤔Issues 報告](https://github.com/InternLM/InternLM/issues/new)
-
-[English](./README.md) |
-[简体中文](./README-zh-Hans.md) |
-[日本語](./README-ja-JP.md)
-
-</div>
-
-## はじめに
-
-InternLM は、70 億のパラメータを持つベースモデルと、実用的なシナリオに合わせたチャットモデルをオープンソース化しています。このモデルには以下の特徴があります:
-
-- 何兆もの高品質なトークンをトレーニングに活用し、強力な知識ベースを確立します。
-- 8k のコンテキストウィンドウ長をサポートし、より長い入力シーケンスと強力な推論機能を可能にする。
-- ユーザが独自のワークフローを柔軟に構築できるよう、汎用性の高いツールセットを提供します。
-
-さらに、大規模な依存関係を必要とせずにモデルの事前学習をサポートする軽量な学習フレームワークが提供されます。単一のコードベースで、数千の GPU を持つ大規模クラスタでの事前学習と、単一の GPU での微調整をサポートし、顕著な性能最適化を達成します。InternLM は、1024GPU でのトレーニングにおいて 90% 近いアクセラレーション効率を達成しています。
-
-## 新闻
-
-[20231213] InternLM-7B-Chat および InternLM-20B-Chat のモデルの重みを更新しました。 新しいバージョンの会話モデルでは、より高品質でより多様な言語スタイルの応答を生成できます。
-[20230920] 基本版と会話版を含むInternLM-20Bをリリースしました。
-
-## InternLM-7B
-
-### パフォーマンス評価
-
-オープンソースの評価ツール [OpenCompass](https://github.com/internLM/OpenCompass/) を用いて、InternLM の総合的な評価を行った。この評価では、分野別能力、言語能力、知識能力、推論能力、理解能力の 5 つの次元をカバーしました。以下は評価結果の一部であり、その他の評価結果については [OpenCompass leaderboard](https://opencompass.org.cn/rank) をご覧ください。
-
-
-| データセット\モデル | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
-| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
-| C-Eval(Val)     | 52.0                       | 53.4                  | 24.2     | 42.7        | 50.9        | 28.9      | 31.2      |
-| MMLU            | 52.6                       | 51.0                  | 35.2*    | 41.5        | 46.0        | 39.7      | 47.3      |
-| AGIEval         | 46.4                       | 37.6                  | 20.8     | 24.6        | 39.0        | 24.1      | 26.4      |
-| CommonSenseQA   | 80.8                       | 59.5                  | 65.0     | 58.8        | 60.0        | 68.7      | 66.7      |
-| BUSTM           | 80.6                       | 50.6                  | 48.5     | 51.3        | 55.0        | 48.8      | 62.5      |
-| CLUEWSC         | 81.8                       | 59.1                  | 50.3     | 52.8        | 59.8        | 50.3      | 52.2      |
-| MATH            | 5.0                        | 7.1                   | 2.8      | 3.0         | 6.6         | 2.2       | 2.8       |
-| GSM8K           | 36.2                       | 31.2                  | 10.1     | 9.7         | 29.2        | 6.0       | 15.3      |
-| HumanEval       | 15.9                       | 10.4                  | 14.0     | 9.2         | 9.2         | 9.2       | 11.0      |
-| RACE(High)      | 80.3                       | 57.4                  | 46.9*    | 28.1        | 66.3        | 40.7      | 54.0      |
-
-- 評価結果は [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) (*印のあるデータは原著論文からの引用を意味する)から取得したもので、評価設定は [OpenCompass](https://github.com/internLM/OpenCompass/) が提供する設定ファイルに記載されています。
-- 評価データは、[OpenCompass](https://github.com/internLM/OpenCompass/) のバージョンアップにより数値的な差異が生じる可能性がありますので、[OpenCompass](https://github.com/internLM/OpenCompass/) の最新の評価結果をご参照ください。
-
-### Model Zoo
-
-InternLM 7B と InternLM 7B チャットは、InternLM を使って訓練され、オープンソース化されています。モデルの重みは 2 つのフォーマットで提供されています。Transformers フォーマットを使ってモデルをロードするだけでなく、InternLM を使って直接重みをロードして、さらに事前トレーニングや人間の好みアライメントトレーニングを行うこともできます。
-
-| モデル                         | InternLM フォーマット Weight ダウンロードリンク                                                                                                                 | Transformers フォーマット Weight ダウンロードリンク                                         |
-| ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
-| **InternLM 7B**         | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b)         | [🤗internlm/intern-7b](https://huggingface.co/internlm/internlm-7b)                 |
-| **InternLM Chat 7B**    | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b)    | [🤗internlm/intern-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)       |
-
-**制限事項:** 学習過程におけるモデルの安全性を確保し、倫理的・法的要件に準拠したテキストを生成するようモデルに促す努力を行ってきたが、モデルのサイズと確率的生成パラダイムのため、モデルは依然として予期せぬ出力を生成する可能性がある。例えば、生成された回答には偏見や差別、その他の有害な内容が含まれている可能性があります。そのような内容を伝播しないでください。有害な情報の伝播によって生じるいかなる結果に対しても、私たちは責任を負いません。
-
-### Transformers からのインポート
-
-Transformers を使用して InternLM 7B チャットモデルをロードするには、以下のコードを使用します:
-
-```python
->>> from transformers import AutoTokenizer, AutoModelForCausalLM
->>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
->>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True).cuda()
->>> model = model.eval()
->>> response, history = model.chat(tokenizer, "こんにちは", history=[])
->>> print(response)
-こんにちは！どのようにお手伝いできますか？
->>> response, history = model.chat(tokenizer, "時間管理について3つの提案をお願いします", history=history)
->>> print(response)
-もちろんです！以下に簡潔な形で時間管理に関する3つの提案を示します。
-
-1. To-Doリストを作成し、優先順位を付ける: タスクを明確にリストアップし、それぞれの優先度を判断しましょう。重要で緊急なタスクから順に取り組むことで、効率的に作業を進めることができます。
-2. 時間のブロック化を実践する: 作業を特定の時間枠に集中させるため、時間をブロック化しましょう。例えば、朝の2時間をメール対応に割り当て、午後の3時間をプロジェクトに集中するなど、タスクごとに時間を確保することが効果的です。
-3. ディストラクションを排除する: 集中力を保つために、ディストラクションを最小限に抑えましょう。通知をオフにし、SNSやメールに気を取られないようにすることで、作業効率を向上させることができます。
-
-これらの提案を実践することで、時間管理のスキルを向上させ、効果的に日々のタスクをこなしていくことができます。
-```
-
-### 対話
-
-以下のコードを実行することで、フロントエンドインターフェースを通して InternLM Chat 7B モデルと対話することができます:
-
-```bash
-pip install streamlit==1.24.0
-pip install transformers==4.30.2
-streamlit run web_demo.py
-```
-
-その効果は以下の通り
-
-![demo](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
-
-### デプロイ
-
-[LMDeploy](https://github.com/InternLM/LMDeploy) を使って、InternLM をワンクリックでデプロイする。
-
-1. まず、LMDeploy をインストールする:
-
-```shell
-python3 -m pip install lmdeploy
-```
-
-2. クイックデプロイには以下のコマンドを使用します:
-
-```shell
-lmdeploy chat turbomind InternLM/internlm-chat-7b --model-name internlm-chat-7b
-```
-
-3. モデルをエクスポートした後、以下のコマンドを使ってサーバーを起動し、デプロイされたモデルと会話することができます:
-
-```shell
-lmdeploy serve api_server InternLM/internlm-chat-7b --model-name internlm-chat-7b
-```
-
-[LMDeploy](https://github.com/InternLM/LMDeploy) は、InternLM をデプロイするための完全なワークフローを提供します。InternLM のデプロイの詳細については、[デプロイチュートリアル](https://github.com/InternLM/LMDeploy)を参照してください。
-
-## ファインチューニングとトレーニング
-
-### プリトレーニングとファインチューニングのチュートリアル
-
-InternLMのインストール、データ処理、プレトレーニング、ファインチューニングを始めるには、[使用法チュートリアル](./doc/ja/usage.md)を参照してください。
-
-### Transformers フォーマットへの変換
-
-InternLM によって学習されたモデルは、コミュニティの様々なオープンソースプロジェクトとシームレスにドッキングするのに便利な Hugging Face Transformers 形式に簡単に変換することができます。`tools/convert2hf.py` の助けを借りて、トレーニング中に保存された weights は 1 つのコマンドで transformers 形式に変換することができます
-
-```bash
-python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer tokenizes/tokenizer.model
-```
-
-変換後、以下のコードで transformers として読み込むことができます
-
-```python
->>> from transformers import AutoTokenizer, AutoModel
->>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
-```
-
-## トレーニングシステム
-
-### システムアーキテクチャ
-
-詳細については、[システムアーキテクチャドキュメント](./doc/ja/structure.md) を参照してください。
-
-### トレーニングパフォーマンス
-
-InternLM は、Flash-Attention、Apex その他の高性能モデルオペレータを深く統合し、トレーニング効率を向上させます。Hybrid Zero 技術を構築することで、計算と通信の効率的なオーバーラップを実現し、トレーニング中のノード間の通信トラフィックを大幅に削減します。InternLM は 7B モデルを 8GPU から 1024GPU まで拡張することをサポートし、1000GPU スケールで最大 90% のアクセラレーション効率、180TFLOPS 以上のトレーニングスループット、GPU あたり平均 3600 トークン/秒以上を実現します。次の表は、異なる構成における InternLM のスケーラビリティテストデータです:
-
-| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
-| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TGS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |
-
-TGSは、GPUあたり1秒間に処理されるトークンの平均数を表します。パフォーマンステストデータの詳細については、[トレーニングパフォーマンスドキュメント](./doc/ja/train_performance.md)を参照してください。
-
-## コントリビュート
-
-我々は、InternLM を改善し、向上させるために尽力してくれたすべての貢献者に感謝している。コミュニティ・ユーザーのプロジェクトへの参加が強く推奨されます。プロジェクトへの貢献方法については、貢献ガイドラインを参照してください。
-
-## 謝辞
-
-InternLM コードベースは、上海 AI 研究所と様々な大学や企業の研究者によって貢献されたオープンソースプロジェクトです。プロジェクトに新機能を追加してくれたすべての貢献者と、貴重なフィードバックを提供してくれたユーザーに感謝したい。私たちは、このツールキットとベンチマークが、InternLM をファインチューニングし、独自のモデルを開発するための柔軟で効率的なコードツールをコミュニティに提供し、オープンソースコミュニティに継続的に貢献できることを願っています。2 つのオープンソースプロジェクト、[flash-attention](https://github.com/HazyResearch/flash-attention) と [ColossalAI](https://github.com/hpcaitech/ColossalAI) に感謝します。
-
-## ライセンス
-
-コードは Apache-2.0 でライセンスされており、モデルの重さは学術研究のために完全にオープンで、**無料** の商用利用も許可されています。商用ライセンスの申請は、[申請フォーム（英語）](https://wj.qq.com/s2/12727483/5dba/)/[申请表（中文）](https://wj.qq.com/s2/12725412/f7c1/)にご記入ください。その他のご質問やコラボレーションについては、<internlm@pjlab.org.cn> までご連絡ください。
-
-## 引用
-
-```
-@misc{2023internlm,
-    title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
-    author={InternLM Team},
-    howpublished = {\url{https://github.com/InternLM/InternLM}},
-    year={2023}
-}
-```
diff --git a/README-zh-Hans.md b/README-zh-Hans.md
deleted file mode 100644
index 7ac8e75..0000000
--- a/README-zh-Hans.md
+++ /dev/null
@@ -1,292 +0,0 @@
-# InternLM
-
-<div align="center">
-
-<img src="./doc/imgs/logo.svg" width="200"/>
-  <div>&nbsp;</div>
-  <div align="center">
-    <b><font size="5">书生·浦语 官网</font></b>
-    <sup>
-      <a href="https://internlm.intern-ai.org.cn/">
-        <i><font size="4">HOT</font></i>
-      </a>
-    </sup>
-    <div>&nbsp;</div>
-  </div>
-
-[![license](./doc/imgs/license.svg)](https://github.com/open-mmlab/mmdetection/blob/main/LICENSE)
-[![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
-[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)
-
-[📘使用文档](./doc/usage.md) |
-[🛠️安装教程](./doc/install.md) |
-[📊训练性能](./doc/train_performance.md) |
-[👀模型库](#model-zoo) |
-[🤗HuggingFace](https://huggingface.co/spaces/internlm/InternLM-Chat-7B) |
-[🆕Update News](./CHANGE_LOG.md) |
-[🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new)
-
-[English](./README.md) |
-[简体中文](./README-zh-Hans.md) |
-[日本語](./README-ja-JP.md)
-
-</div>
-
-<p align="center">
-    👋 加入我们的 <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> 和 <a href="https://github.com/InternLM/InternLM/assets/25839884/a6aad896-7232-4220-ac84-9e070c2633ce" target="_blank">微信社区</a>
-</p>
-
-## 简介
-
-InternLM 是一个开源的轻量级训练框架，旨在支持大模型训练而无需大量的依赖。通过单一的代码库，它支持在拥有数千个 GPU 的大型集群上进行预训练，并在单个 GPU 上进行微调，同时实现了卓越的性能优化。在1024个 GPU 上训练时，InternLM 可以实现近90%的加速效率。
-
-基于InternLM训练框架，我们已经发布了两个开源的预训练模型：InternLM-7B 和 InternLM-20B。
-
-## 更新
-
-[20231213] 我们更新了 InternLM-7B-Chat 和 InternLM-20B-Chat 模型权重。通过改进微调数据和训练策略，新版对话模型生成的回复质量更高、语言风格更加多元。
-[20230920] InternLM-20B 已发布，包括基础版和对话版。
-
-
-## Model Zoo
-
-我们的模型在三个平台上发布：Transformers、ModelScope 和 OpenXLab。
-
-| Model                     | Transformers                        | ModelScope                                                                                                                        | OpenXLab                                                                              |发布日期 |
-|---------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
-| **InternLM Chat 20B**     | [🤗internlm/internlm-chat-20b](https://huggingface.co/internlm/internlm-20b-chat)         | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b-chat/summary)         | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-20b)     | 2023-12-12   |
-| **InternLM 20B**          | [🤗internlm/internlm-20b](https://huggingface.co/internlm/internlm-20b)                   | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b/summary)                   | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-20b)          | 2023-09-20   |
-| **InternLM Chat 7B**      | [🤗internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)           | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b/summary)           | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b)      | 2023-12-12   |
-| **InternLM 7B**           | [🤗internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b)                     | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-7b/summary)                     | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b)           | 2023-07-06   |
-
-
-<details>
-<summary> InternLM-20B </summary>
-
-#### 简介
-InternLM-20B 在超过 **2.3T** Tokens 包含高质量英文、中文和代码的数据上进行预训练，其中 Chat 版本还经过了 SFT 和 RLHF 训练，使其能够更好、更安全地满足用户的需求。
-
-InternLM 20B 在模型结构上选择了深结构，InternLM-20B 的层数设定为60层，超过常规7B和13B模型所使用的32层或者40层。在参数受限的情况下，提高层数有利于提高模型的综合能力。此外，相较于InternLM-7B，InternLM-20B使用的预训练数据经过了更高质量的清洗，并补充了高知识密度和用于强化理解和推理能力的训练数据。因此，它在理解能力、推理能力、数学能力、编程能力等考验语言模型技术水平的方面都得到了显著提升。总体而言，InternLM-20B具有以下的特点：
-- 优异的综合性能
-- 很强的工具调用功能
-- 支持16k语境长度（通过推理时外推）
-- 更好的价值对齐
-
-#### 性能对比
-
-在OpenCompass提出的5个能力维度上，InternLM-20B都取得很好的效果（粗体为13B-33B这个量级范围内，各项最佳成绩）
-
-| 能力维度 | Llama-13B | Llama2-13B | Baichuan2-13B | InternLM-20B | Llama-33B | Llama-65B | Llama2-70B |
-|----------|-----------|------------|---------------|--------------|-----------|-----------|------------|
-| 语言     | 42.5      | 47         | 47.5          | **55**           | 44.6      | 47.1      | 51.6       |
-| 知识     | 58.2      | 58.3       | 48.9          | 60.1         | **64**        | 66        | 67.7       |
-| 理解     | 45.5      | 50.9       | 58.1          | **67.3**         | 50.6      | 54.2      | 60.8       |
-| 推理     | 42.7      | 43.6       | 44.2          | **54.9**         | 46.4      | 49.8      | 55         |
-| 学科     | 37.3      | 45.2       | 51.8          | **62.5**         | 47.4      | 49.7      | 57.3       |
-| 总平均   | 43.8      | 47.3       | 49.4          | **59.2**         | 48.9      | 51.9      | 57.4       |
-
-下表在一些有重要影响力的典型数据集上比较了主流开源模型的表现
-
-|      | 评测集           | Llama-13B | Llama2-13B | Baichuan2-13B | InternLM-20B | Llama-33B | Llama-65B | Llama2-70B |
-|------|------------------|-----------|------------|---------------|--------------|-----------|-----------|------------|
-| 学科 | MMLU             | 47.73     | 54.99      | 59.55         | **62.05**        | 58.73     | 63.71     | 69.75      |
-|      | C-Eval (val)     | 31.83     | 41.4       | **59.01**         | 58.8         | 37.47     | 40.36     | 50.13      |
-|      | AGI-Eval         | 22.03     | 30.93      | 37.37         | **44.58**        | 33.53     | 33.92     | 40.02      |
-| 知识 | BoolQ            | 78.75     | 82.42      | 67            | **87.46**        | 84.43     | 86.61     | 87.74      |
-|      | TriviaQA         | 52.47     | 59.36      | 46.61         | 57.26        | **66.24**     | 69.79     | 70.71      |
-|      | NaturalQuestions | 20.17     | 24.85      | 16.32         | 25.15        | **30.89**     | 33.41     | 34.16      |
-| 理解 | CMRC             | 9.26      | 31.59      | 29.85         | **68.78**        | 14.17     | 34.73     | 43.74      |
-|      | CSL              | 55        | 58.75      | 63.12         | **65.62**        | 57.5      | 59.38     | 60         |
-|      | RACE (middle)    | 53.41     | 63.02      | 68.94         | **86.35**        | 64.55     | 72.35     | 81.55      |
-|      | RACE (high)      | 47.63     | 58.86      | 67.18         | **83.28**        | 62.61     | 68.01     | 79.93      |
-|      | XSum             | 20.37     | 23.37      | 25.23         | **35.54**        | 20.55     | 19.91     | 25.38      |
-| 推理 | WinoGrande       | 64.64     | 64.01      | 67.32         | **69.38**        | 66.85     | 69.38     | 69.77      |
-|      | BBH              | 37.93     | 45.62      | 48.98         | **52.51**        | 49.98     | 58.38     | 64.91      |
-|      | GSM8K            | 20.32     | 29.57      | **52.62**         | **52.62**        | 42.3      | 54.44     | 63.31      |
-|      | PIQA             | 79.71     | 79.76      | 78.07         | 80.25        | **81.34**     | 82.15     | 82.54      |
-| 编程 | HumanEval        | 14.02     | 18.9       | 17.07         | **25.61**        | 17.68     | 18.9      | 26.22      |
-|      | MBPP             | 20.6      | 26.8       | 30.8          | **35.6**         | 28.4      | 33.6      | 39.6       |
-
-总体而言，InternLM-20B 在综合能力上全面领先于13B量级的开源模型，同时在推理评测集上接近甚至超越Llama-65B的性能。
-
-- 评估结果来自 [OpenCompass 20230920](https://github.com/internLM/OpenCompass/)。
-- 由于 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代，评估数据可能存在数值上的差异，所以请参考 [OpenCompass](https://github.com/internLM/OpenCompass/) 的最新评估结果。
-
-</details>
-
-
-<details>
-<summary> InternLM-7B </summary>
-
-#### 模型更新
-
-#### 简介
-InternLM-7B 包含了一个拥有70亿参数的基础模型和一个为实际场景量身定制的对话模型。该模型具有以下特点：
-
-- 它利用数万亿的高质量令牌进行训练，建立了一个强大的知识库。
-- 它支持8k的上下文窗口长度，使得输入序列更长并增强了推理能力。
-- 它为用户提供了一个多功能的工具集，使用户能够灵活地构建自己的工作流程。
-
-#### 性能对比
-
-我们使用开源评测工具 [OpenCompass](https://github.com/internLM/OpenCompass/) 从学科综合能力、语言能力、知识能力、推理能力、理解能力五大能力维度对InternLM开展全面评测，部分评测结果如下表所示，欢迎访问[OpenCompass 榜单](https://opencompass.org.cn/rank)获取更多的评测结果。
-
-| 数据集\模型 | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
-| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
-| C-Eval(Val)     | 52.0                       | 53.4                  | 24.2     | 42.7        | 50.9        | 28.9      | 31.2      |
-| MMLU            | 52.6                       | 51.0                  | 35.2*    | 41.5        | 46.0        | 39.7      | 47.3      |
-| AGIEval         | 46.4                       | 37.6                  | 20.8     | 24.6        | 39.0        | 24.1      | 26.4      |
-| CommonSenseQA   | 80.8                       | 59.5                  | 65.0     | 58.8        | 60.0        | 68.7      | 66.7      |
-| BUSTM           | 80.6                       | 50.6                  | 48.5     | 51.3        | 55.0        | 48.8      | 62.5      |
-| CLUEWSC         | 81.8                       | 59.1                  | 50.3     | 52.8        | 59.8        | 50.3      | 52.2      |
-| MATH            | 5.0                        | 7.1                   | 2.8      | 3.0         | 6.6         | 2.2       | 2.8       |
-| GSM8K           | 36.2                       | 31.2                  | 10.1     | 9.7         | 29.2        | 6.0       | 15.3      |
-| HumanEval       | 15.9                       | 10.4                  | 14.0     | 9.2         | 9.2         | 9.2       | 11.0      |
-| RACE(High)      | 80.3                       | 57.4                  | 46.9*    | 28.1        | 66.3        | 40.7      | 54.0      |
-
-- 以上评测结果基于 [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) 获得（部分数据标注`*`代表数据来自原始论文），具体测试细节可参见 [OpenCompass](https://github.com/internLM/OpenCompass/) 中提供的配置文件。
-- 评测数据会因 [OpenCompass](https://github.com/internLM/OpenCompass/) 的版本迭代而存在数值差异，请以 [OpenCompass](https://github.com/internLM/OpenCompass/) 最新版的评测结果为主。
-
-
-
-**局限性：** 尽管在训练过程中我们非常注重模型的安全性，尽力促使模型输出符合伦理和法律要求的文本，但受限于模型大小以及概率生成范式，模型可能会产生各种不符合预期的输出，例如回复内容包含偏见、歧视等有害内容，请勿传播这些内容。由于传播不良信息导致的任何后果，本项目不承担责任。
-
-</details>
-
-## 使用案例
-
-### 通过 Transformers 加载
-
-通过以下的代码从 Transformers 加载 InternLM 模型 （可修改模型名称替换不同的模型）
-
-```python
->>> from transformers import AutoTokenizer, AutoModelForCausalLM
->>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
->>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True).cuda()
->>> model = model.eval()
->>> response, history = model.chat(tokenizer, "你好", history=[])
->>> print(response)
-你好！有什么我可以帮助你的吗？
->>> response, history = model.chat(tokenizer, "请提供三个管理时间的建议。", history=history)
->>> print(response)
-当然可以！以下是三个管理时间的建议：
-1. 制定计划：制定一个详细的计划，包括每天要完成的任务和活动。这将有助于您更好地组织时间，并确保您能够按时完成任务。
-2. 优先级：将任务按照优先级排序，先完成最重要的任务。这将确保您能够在最短的时间内完成最重要的任务，从而节省时间。
-3. 集中注意力：避免分心，集中注意力完成任务。关闭社交媒体和电子邮件通知，专注于任务，这将帮助您更快地完成任务，并减少错误的可能性。
-```
-
-### 通过 ModelScope 加载
-
-通过以下的代码从 ModelScope 加载 InternLM 模型 （可修改模型名称替换不同的模型）
-
-```python
-from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
-import torch
-model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm-chat-7b', revision='v1.0.0')
-tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
-model = AutoModelForCausalLM.from_pretrained(model_dir,device_map="auto",  trust_remote_code=True,torch_dtype=torch.float16)
-model = model.eval()
-response, history = model.chat(tokenizer, "hello", history=[])
-print(response)
-response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
-print(response)
-```
-
-
-### 通过前端网页对话
-
-可以通过以下代码启动一个前端的界面来与 InternLM Chat 7B 模型进行交互
-
-```bash
-pip install streamlit==1.24.0
-pip install transformers==4.30.2
-streamlit run web_demo.py
-```
-
-效果如下
-
-![效果](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
-
-### 基于InternLM高性能部署
-
-我们使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 完成 InternLM 的一键部署。
-
-1. 首先安装 LMDeploy:
-
-  ```shell
-  python3 -m pip install lmdeploy
-  ```
-2. 直接在本地，通过命令行，交互式和 InternLM 对话：
-
-  ```shell
-  lmdeploy chat turbomind InternLM/internlm-chat-7b --model-name internlm-chat-7b
-  ```
-
-1. 也可以使用如下命令启动推理服务：
-
-  ```shell
-  lmdeploy serve api_server InternLM/internlm-chat-7b --model-name internlm-chat-7b
-  ```
-请参考[此指南](https://github.com/InternLM/lmdeploy/blob/main/docs/en/restful_api.md)获取详细的api_server RESTful API信息，更多部署教程则可在[这里](https://github.com/InternLM/LMDeploy)找到。
-
-
-## 微调&训练
-
-### 预训练与微调使用教程
-
-请参考[使用教程](./doc/usage.md)开始InternLM的安装、数据处理、预训练与微调。
-
-### 转换为 Transformers 格式使用
-
-通过 InternLM 进行训练的模型可以很轻松地转换为 HuggingFace Transformers 格式，方便与社区各种开源项目无缝对接。借助 `tools/transformers/convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式
-
-```bash
-python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
-```
-
-转换之后可以通过以下的代码加载为 transformers
-
-```python
->>> from transformers import AutoTokenizer, AutoModel
->>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
-```
-
-## 训练系统
-
-### 系统结构
-
-请参考[系统结构文档](./doc/structure.md)进一步了解。
-
-### 训练性能
-
-InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子，提高了训练效率。通过构建 Hybrid Zero 技术，实现计算和通信的高效重叠，大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡，千卡规模下加速效率可高达 90%，训练吞吐超过 180TFLOPS，平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据：
-
-| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
-| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TGS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |
-
-TGS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。
-
-## 贡献
-
-我们感谢所有的贡献者为改进和提升 InternLM 所作出的努力。非常欢迎社区用户能参与进项目中来。请参考贡献指南来了解参与项目贡献的相关指引。
-
-## 致谢
-
-InternLM 代码库是一款由上海人工智能实验室和来自不同高校、企业的研发人员共同参与贡献的开源项目。我们感谢所有为项目提供新功能支持的贡献者，以及提供宝贵反馈的用户。 我们希望这个工具箱和基准测试可以为社区提供灵活高效的代码工具，供用户微调 InternLM 并开发自己的新模型，从而不断为开源社区提供贡献。特别鸣谢[flash-attention](https://github.com/HazyResearch/flash-attention) 与 [ColossalAI](https://github.com/hpcaitech/ColossalAI) 两项开源项目。
-
-## 开源许可证
-
-本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放，也可申请免费的商业使用授权（[申请表](https://wj.qq.com/s2/12725412/f7c1/)）。其他问题与合作请联系 <internlm@pjlab.org.cn>。
-
-## 引用
-
-```
-@misc{2023internlm,
-    title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
-    author={InternLM Team},
-    howpublished = {\url{https://github.com/InternLM/InternLM}},
-    year={2023}
-}
-```
diff --git a/README.md b/README.md
index bd58a76..22a610c 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 
 <div align="center">
 
-<img src="./doc/imgs/logo.svg" width="200"/>
+<img src="./assets/logo.svg" width="200"/>
   <div> </div>
   <div align="center">
     <b><font size="5">InternLM</font></b>
@@ -14,21 +14,19 @@
     <div> </div>
   </div>
 
-[![license](./doc/imgs/license.svg)](./LICENSE)
-[![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
-[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)
-
-[📘Usage](./doc/en/usage.md) |
-[🛠️Installation](./doc/en/install.md) |
-[📊Train Performance](./doc/en/train_performance.md) |
-[👀Model](#model-zoo) |
-[🤗HuggingFace](https://huggingface.co/spaces/internlm/InternLM-Chat-7B) |
-[🆕Update News](./CHANGE_LOG.md) |
+[![license](./assets/license.svg)](./LICENSE)
+[![evaluation](./assets/compass_support.svg)](https://github.com/internLM/OpenCompass/)
+<!-- [![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest) -->
+[📘Chat](./chat) |
+[🛠️Agent](./agent) |
+[📊Evaluation](./evaluation) |
+[👀Model](./model_cards) |
+[🤗HuggingFace](https://huggingface.co/spaces/internlm/internlm2-Chat-7B) |
+[🆕Update News](#news) |
 [🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new)
 
 [English](./README.md) |
-[简体中文](./README-zh-Hans.md) |
-[日本語](./README-ja-JP.md)
+[简体中文](./README_zh-CN.md) |
 
 </div>
 
@@ -37,142 +35,64 @@
 </p>
 
 ## Introduction
-InternLM is an open-sourced lightweight training framework aims to  support model pre-training without the need for extensive dependencies. With a single codebase, it supports pre-training on large-scale clusters with thousands of GPUs, and fine-tuning on a single GPU while achieving remarkable performance optimizations. InternLM achieves nearly 90% acceleration efficiency during training on 1024 GPUs.
 
-Based on the InternLM training framework, we have released two open-sourced pretrained model InternLM-7B and InternLM-20B.
+InternLM2 series are released with the following features:
 
+- **200K Context window**: Nearly perfect at finding needles in the haystack with 200K-long context, with leading performance on long-context tasks like LongBench and L-Eval. Try it with [LMDeploy](./inference/) for 200K-context inference.
+
+- **Outstanding comprehensive performance**: Significantly better than the last generation in all dimensions, especially in reasoning, math, code, chat experience, instruction following, and creative writing, with leading performance among open-source models in similar sizes. In some evaluations, InternLM2-Chat-20B may match or even surpass ChatGPT (GPT-3.5).
+
+- **Code interpreter & Data analysis**: With code interpreter, InternLM2-Chat-20B obtains compatible performance with GPT-4 on GSM8K and MATH. InternLM2-Chat also provides data analysis capability.
+
+- **Stronger tool use**: Based on better tool utilization-related capabilities in instruction following, tool selection and reflection, InternLM2 can support more kinds of agents and multi-step tool calling for complex tasks. See [examples](./agent/).
 
 ## News
 
-[20231213] InternLM-7B-Chat and InternLM-20B-Chat checkpoints are updated. With an improved finetuning strategy, the new chat models can generate higher quality responses with greater stylistic diversity.
-[20230920] InternLM-20B is released with base and chat versions.
+[2024.01.17] We release InternLM2-7B and InternLM2-20B and their corresponding chat models with stronger capabilities in all dimensions. See [model zoo below](#model-zoo) for download or [model cards](./model_cards/) for more details.
 
+[2023.12.13] InternLM-7B-Chat and InternLM-20B-Chat checkpoints are updated. With an improved finetuning strategy, the new chat models can generate higher quality responses with greater stylistic diversity.
+
+[2023.09.20] InternLM-20B is released with base and chat versions.
 
 ## Model Zoo
 
-Our models are released in three platforms: Transformers, ModelScope and OpenXLab.
-- There are two kinds of model weights:
-  1. huggingface type(marked as HF)
-  2. original model weight(marked as Original), providing in OpenXLab, which can be loaded by InternLM and finetuned directly.
+| Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | Release Date |
+|---------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
+| **InternLM2 Chat 20B**     | [🤗internlm/internlm-chat-20b](https://huggingface.co/internlm/internlm2-chat-20b)         | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-20b/summary)         | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-20b)     | 2024-01-17   |
+| **InternLM2 20B** | [🤗internlm/internlm2-20b](https://huggingface.co/internlm/internlm2-20b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-20b) | 2024-01-17 |
+| **InternLM2 Chat 20B SFT**     | [🤗internlm/internlm-chat-20b-sft](https://huggingface.co/internlm/internlm2-chat-20b-sft)         | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-20b-sft](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-20b-sft/summary)         | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-20b-sft)     | 2024-01-17   |
+| **InternLM2 Base 20B** | [🤗internlm/internlm2-base-20b](https://huggingface.co/internlm/internlm2-base-20b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-base-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-base-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-base-20b) | 2024-01-17 |
+| **InternLM2 Chat 7B**      | [🤗internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b)           | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b/summary)           | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-7b)      | 2024-01-17  |
+| **InternLM2 7B**           | [🤗internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b)                     | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b/summary)                     | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-7b)           |  2024-01-17   |
+| **InternLM2 Chat 7B SFT**      | [🤗internlm/internlm2-chat-7b-sft](https://huggingface.co/internlm/internlm2-chat-7b-sft)           | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-7b-sft](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b-sft/summary)           | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-7b-sft)      | 2024-01-17  |
+| **InternLM2 Base 7B**           | [🤗internlm/internlm2-base-7b](https://huggingface.co/internlm/internlm2-base-7b)                     | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-base-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-base-7b/summary)                     | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-base-7b)           |  2024-01-17   |
 
-| Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | OpenXLab(Original) | Release Date |
-|---------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
-| **InternLM Chat 20B**     | [🤗internlm/internlm-chat-20b](https://huggingface.co/internlm/internlm-20b-chat)         | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b-chat/summary)         | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-20b)     | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-20b-original)     | 2023-12-12   |
-| **InternLM 20B** | [🤗internlm/internlm-20b](https://huggingface.co/internlm/internlm-20b) | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-20b) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-20b-original) | 2023-09-20 |
-| **InternLM Chat 7B**      | [🤗internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)           | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b/summary)           | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b)      | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b-original)      | 2023-12-12   |
-| **InternLM 7B**           | [🤗internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b)                     | [<img src="./doc/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-7b/summary)                     | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b)           | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b-original)           | 2023-07-06   |
+**Note:**
 
-#### Introduction
-InternLM-20B was pre-trained on over **2.3T** Tokens containing high-quality English, Chinese, and code data. Additionally, the Chat version has undergone SFT and RLHF training, enabling it to better and more securely meet users' needs.
-
-In terms of model structure, InternLM-20B opted for a deeper architecture, with a depth set at 60 layers. This surpasses the conventional 7B and 13B models that utilize 32 or 40 layers. When parameters are limited, increasing the number of layers can enhance the model's overall capability. Furthermore, compared to InternLM-7B, the pre-training data used for InternLM-20B underwent higher quality cleansing and was supplemented with data rich in knowledge and designed for reinforcing understanding and reasoning capabilities. As a result, it exhibits significant improvements in understanding, reasoning, mathematical, and programming abilities—all of which test the technical proficiency of language models. Overall, InternLM-20B features the following characteristics:
-- Outstanding overall performance
-- Strong utility invocation capability
-- Supports a 16k context length (Through inference extrapolation)
-- Better value alignment.
-
-#### Performance Evaluation
-
-On the 5 capability dimensions proposed by OpenCompass, InternLM-20B has achieved excellent results (the bolded scores represent the best performances within the 13B-33B parameter range).
-
-| Capability | Llama-13B | Llama2-13B | Baichuan2-13B | InternLM-20B | Llama-33B | Llama-65B | Llama2-70B |
-|----------|-----------|------------|---------------|--------------|-----------|-----------|------------|
-| Language     | 42.5      | 47         | 47.5          | **55**           | 44.6      | 47.1      | 51.6       |
-| Knowledge     | 58.2      | 58.3       | 48.9          | 60.1         | **64**        | 66        | 67.7       |
-| Understanding     | 45.5      | 50.9       | 58.1          | **67.3**         | 50.6      | 54.2      | 60.8       |
-| Reasoning     | 42.7      | 43.6       | 44.2          | **54.9**         | 46.4      | 49.8      | 55         |
-| Examination     | 37.3      | 45.2       | 51.8          | **62.5**         | 47.4      | 49.7      | 57.3       |
-| Overall   | 43.8      | 47.3       | 49.4          | **59.2**         | 48.9      | 51.9      | 57.4       |
-
-The table below compares the performance of mainstream open-source models on some influential and typical datasets.
-
-|      | Benchmarks           | Llama-13B | Llama2-13B | Baichuan2-13B | InternLM-20B | Llama-33B | Llama-65B | Llama2-70B |
-|------|------------------|-----------|------------|---------------|--------------|-----------|-----------|------------|
-| Examination | MMLU             | 47.73     | 54.99      | 59.55         | **62.05**        | 58.73     | 63.71     | 69.75      |
-|      | C-Eval (val)     | 31.83     | 41.4       | **59.01**         | 58.8         | 37.47     | 40.36     | 50.13      |
-|      | AGI-Eval         | 22.03     | 30.93      | 37.37         | **44.58**        | 33.53     | 33.92     | 40.02      |
-| Knowledge | BoolQ            | 78.75     | 82.42      | 67            | **87.46**        | 84.43     | 86.61     | 87.74      |
-|      | TriviaQA         | 52.47     | 59.36      | 46.61         | 57.26        | **66.24**     | 69.79     | 70.71      |
-|      | NaturalQuestions | 20.17     | 24.85      | 16.32         | 25.15        | **30.89**     | 33.41     | 34.16      |
-| Understanding | CMRC             | 9.26      | 31.59      | 29.85         | **68.78**        | 14.17     | 34.73     | 43.74      |
-|      | CSL              | 55        | 58.75      | 63.12         | **65.62**        | 57.5      | 59.38     | 60         |
-|      | RACE (middle)    | 53.41     | 63.02      | 68.94         | **86.35**        | 64.55     | 72.35     | 81.55      |
-|      | RACE (high)      | 47.63     | 58.86      | 67.18         | **83.28**        | 62.61     | 68.01     | 79.93      |
-|      | XSum             | 20.37     | 23.37      | 25.23         | **35.54**        | 20.55     | 19.91     | 25.38      |
-| Reasoning | WinoGrande       | 64.64     | 64.01      | 67.32         | **69.38**        | 66.85     | 69.38     | 69.77      |
-|      | BBH              | 37.93     | 45.62      | 48.98         | **52.51**        | 49.98     | 58.38     | 64.91      |
-|      | GSM8K            | 20.32     | 29.57      | **52.62**         | **52.62**        | 42.3      | 54.44     | 63.31      |
-|      | PIQA             | 79.71     | 79.76      | 78.07         | 80.25        | **81.34**     | 82.15     | 82.54      |
-| Programming | HumanEval        | 14.02     | 18.9       | 17.07         | **25.61**        | 17.68     | 18.9      | 26.22      |
-|      | MBPP             | 20.6      | 26.8       | 30.8          | **35.6**         | 28.4      | 33.6      | 39.6       |
-
-Overall, InternLM-20B comprehensively outperforms open-source models in the 13B parameter range in terms of overall capabilities, and on inference evaluation sets, it approaches or even surpasses the performance of Llama-65B.
-
-- The evaluation results were obtained from [OpenCompass 20230920](https://github.com/internLM/OpenCompass/).
-- The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
-
-</details>
-
-
-<details>
-<summary> InternLM-7B </summary>
-
-#### News
-
-#### Introduction
-InternLM-7B contains a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
-
-- It leverages trillions of high-quality tokens for training to establish a powerful knowledge base.
-- It supports an 8k context window length, enabling longer input sequences and stronger reasoning capabilities.
-- It provides a versatile toolset for users to flexibly build their own workflows.
-
-#### Performance Evaluation
-
-We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://opencompass.org.cn/rank) for more evaluation results.
-
-| Datasets\Models | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
-| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
-| C-Eval(Val)     | 52.0                       | 53.4                  | 24.2     | 42.7        | 50.9        | 28.9      | 31.2      |
-| MMLU            | 52.6                       | 51.0                  | 35.2*    | 41.5        | 46.0        | 39.7      | 47.3      |
-| AGIEval         | 46.4                       | 37.6                  | 20.8     | 24.6        | 39.0        | 24.1      | 26.4      |
-| CommonSenseQA   | 80.8                       | 59.5                  | 65.0     | 58.8        | 60.0        | 68.7      | 66.7      |
-| BUSTM           | 80.6                       | 50.6                  | 48.5     | 51.3        | 55.0        | 48.8      | 62.5      |
-| CLUEWSC         | 81.8                       | 59.1                  | 50.3     | 52.8        | 59.8        | 50.3      | 52.2      |
-| MATH            | 5.0                        | 7.1                   | 2.8      | 3.0         | 6.6         | 2.2       | 2.8       |
-| GSM8K           | 36.2                       | 31.2                  | 10.1     | 9.7         | 29.2        | 6.0       | 15.3      |
-| HumanEval       | 15.9                       | 10.4                  | 14.0     | 9.2         | 9.2         | 9.2       | 11.0      |
-| RACE(High)      | 80.3                       | 57.4                  | 46.9*    | 28.1        | 66.3        | 40.7      | 54.0      |
-
-- The evaluation results were obtained from [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) (some data marked with *, which means come from the original papers), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
-- The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
-
-</details>
+1. For chat models, InternLM2 Chat 7/20B has gone through online RLHF for better alignment, which is recommended for downstream applications. We also released InternLM2 Chat 7/20B SFT, which are the ones that only have gone through SFT and used in RLHF to obtain InternLM2 Chat 7/20B. InternLM2 Chat 7/20B are trained from InternLM2 Base 7/20B.
+2. For base models, InternLM2 7/20B are further trained from InternLM2 Base 7/20B, which is recommended for fast adaptation for downstream applications.
 
 **Limitations:** Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
 
-## Usage Examples
+## Usages
+
+We briefly show the usages with [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue).
+The chat models adopt [chatml format](./chat/chat_format.md) to support both chat and agent applications.
 
 ### Import from Transformers
 
-To load the InternLM 7B Chat model using Transformers, use the following code:
+To load the InternLM2 7B Chat model using Transformers, use the following code:
 
 ```python
 >>> from transformers import AutoTokenizer, AutoModelForCausalLM
->>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
->>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True).cuda()
+>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
+>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True).cuda()
 >>> model = model.eval()
 >>> response, history = model.chat(tokenizer, "hello", history=[])
 >>> print(response)
 Hello! How can I help you today?
 >>> response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
 >>> print(response)
-Sure, here are three tips for effective time management:
-
-1. Prioritize tasks based on importance and urgency: Make a list of all your tasks and categorize them into "important and urgent," "important but not urgent," and "not important but urgent." Focus on completing the tasks in the first category before moving on to the others.
-2. Use a calendar or planner: Write down deadlines and appointments in a calendar or planner so you don't forget them. This will also help you schedule your time more effectively and avoid overbooking yourself.
-3. Minimize distractions: Try to eliminate any potential distractions when working on important tasks. Turn off notifications on your phone, close unnecessary tabs on your computer, and find a quiet place to work if possible.
-
-Remember, good time management skills take practice and patience. Start with small steps and gradually incorporate these habits into your daily routine.
 ```
 
 ### Import from ModelScope
@@ -182,7 +102,7 @@ To load the InternLM model using ModelScope, use the following code:
 ```python
 from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
 import torch
-model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm-chat-7b', revision='v1.0.0')
+model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-chat-7b')
 tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
 model = AutoModelForCausalLM.from_pretrained(model_dir,device_map="auto",  trust_remote_code=True,torch_dtype=torch.float16)
 model = model.eval()
@@ -199,82 +119,40 @@ You can interact with the InternLM Chat 7B model through a frontend interface by
 ```bash
 pip install streamlit==1.24.0
 pip install transformers==4.30.2
-streamlit run web_demo.py
+streamlit run ./chat/web_demo.py
 ```
 
-The effect is as follows
+The effect is similar to below:
 
 ![demo](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
 
 ### Deployment
 
-We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the one-click deployment of InternLM.
-
-1. First, install LMDeploy:
+We use [LMDeploy](https://github.com/InternLM/LMDeploy) for fast deployment of InternLM.
 
 ```shell
+# install LMDeploy
 python3 -m pip install lmdeploy
+# chat with internlm2
+lmdeploy chat turbomind InternLM/internlm2-chat-7b --model-name internlm2-chat-7b
 ```
 
-2. Use the following command for iteractive communication with `internlm-chat-7b` model on localhost:
+Please refer to the [guidance](./chat/lmdeploy.md) for more usages about model deployment. For additional deployment tutorials, feel free to explore [here](https://github.com/InternLM/LMDeploy).
 
-```shell
-lmdeploy chat turbomind InternLM/internlm-chat-7b --model-name internlm-chat-7b
-```
+## Agent
 
-3. Besides chatting via command line, you can start lmdeploy `api_server` as below:
+InternLM2-Chat models have excellent tool utilization capabilities and can work with function calls in a zero-shot manner. See more examples in [agent session](./agent/).
 
-```shell
-lmdeploy serve api_server InternLM/internlm-chat-7b --model-name internlm-chat-7b
-```
-For a comprehensive understanding of the `api_server` RESTful API, kindly consult [this](https://github.com/InternLM/lmdeploy/blob/main/docs/en/restful_api.md) guide. For additional deployment tutorials, feel free to explore [here](https://github.com/InternLM/LMDeploy).
+## Fine-tuning
 
-## Fine-tuning & Training
+Please refer to [finetune docs](./finetune/) for fine-tuning with InternLM.
 
-### Pre-training and Fine-tuning Tutorial
-
-Please refer to [Usage Tutorial](./doc/en/usage.md) to start InternLM installation, data processing, pre-training and fine-tuning.
-
-### Convert to Transformers Format
-
-The model trained by InternLM can be easily converted to HuggingFace Transformers format, which is convenient for seamless docking with various open source projects in the community. With the help of `tools/transformers/convert2hf.py`, the weights saved during training can be converted into transformers format with one command
-
-```bash
-python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
-```
-
-After conversion, it can be loaded as transformers by the following code
-
-```python
->>> from transformers import AutoTokenizer, AutoModel
->>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
-```
-
-## Training System
-
-### System Architecture
-
-Please refer to the [System Architecture document](./doc/en/structure.md) for further details.
-
-### Training Performance
-
-InternLM deeply integrates Flash-Attention, Apex and other high-performance model operators to improve training efficiency. By building the Hybrid Zero technique, it achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training. InternLM supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-GPU scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:
-
-| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
-| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TGS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |
-
-TGS represents the average number of tokens processed per GPU per second. For more performance test data, please refer to the [Training Performance document](./doc/en/train_performance.md) for further details.
+**Note:** We have migrated the whole training functionality in this project to [InternEvo](https://github.com/InternLM/InternEvo) for easier user experience, which provides efficient pre-training and fine-tuning infra for training InternLM.
 
 ## Contribution
 
 We appreciate all the contributors for their efforts to improve and enhance InternLM. Community users are highly encouraged to participate in the project. Please refer to the contribution guidelines for instructions on how to contribute to the project.
 
-## Acknowledgements
-
-InternLM codebase is an open-source project contributed by Shanghai AI Laboratory and researchers from different universities and companies. We would like to thank all the contributors for their support in adding new features to the project and the users for providing valuable feedback. We hope that this toolkit and benchmark can provide the community with flexible and efficient code tools for fine-tuning InternLM and developing their own models, thus continuously contributing to the open-source community. Special thanks to the two open-source projects, [flash-attention](https://github.com/HazyResearch/flash-attention) and [ColossalAI](https://github.com/hpcaitech/ColossalAI).
-
 ## License
 
 The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表（中文）](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact <internlm@pjlab.org.cn>.
diff --git a/README_zh-CN.md b/README_zh-CN.md
new file mode 100644
index 0000000..e4a460b
--- /dev/null
+++ b/README_zh-CN.md
@@ -0,0 +1,158 @@
+# InternLM
+
+<div align="center">
+
+<img src="./assets//logo.svg" width="200"/>
+  <div>&nbsp;</div>
+  <div align="center">
+    <b><font size="5">书生·浦语 官网</font></b>
+    <sup>
+      <a href="https://internlm.intern-ai.org.cn/">
+        <i><font size="4">HOT</font></i>
+      </a>
+    </sup>
+    <div>&nbsp;</div>
+  </div>
+
+[![license](./assets//license.svg)](https://github.com/open-mmlab/mmdetection/blob/main/LICENSE)
+[![evaluation](./assets//compass_support.svg)](https://github.com/internLM/OpenCompass/)
+<!-- [![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest) -->
+
+[📘对话教程](./chat) |
+[🛠️智能体教程](./agent) |
+[📊评测](./evaluation) |
+[👀模型库](./model_cards) |
+[🤗HuggingFace](https://huggingface.co/spaces/internlm/internlm2-Chat-7B) |
+[🆕Update News](#news) |
+[🤔提交反馈](https://github.com/InternLM/InternLM/issues/new)
+
+[English](./README.md) |
+[简体中文](./README_zh-CN.md) |
+
+</div>
+
+<p align="center">
+    👋 加入我们的 <a href="https://discord.gg/xa29JuW87d" target="_blank">Discord</a> 和 <a href="https://github.com/InternLM/InternLM/assets/25839884/a6aad896-7232-4220-ac84-9e070c2633ce" target="_blank">微信社区</a>
+</p>
+
+## 简介
+
+InternLM2 系列模型在本仓库正式发布，具有如下特性：
+
+- 有效支持20万字超长上下文：模型在20万字长输入中几乎完美地实现长文“大海捞针”，而且在 LongBench 和 L-Eval 等长文任务中的表现也达到开源模型中的领先水平。 可以通过 [LMDeploy](./inference/) 尝试20万字超长上下文推理。
+- 综合性能全面提升：各能力维度相比上一代模型全面进步，在推理、数学、代码、对话体验、指令遵循和创意写作等方面的能力提升尤为显著，综合性能达到同量级开源模型的领先水平，在重点能力评测上 InternLM2-Chat-20B 能比肩甚至超越 ChatGPT （GPT-3.5）。
+- 代码解释器与数据分析：在配合代码解释器（code-interpreter）的条件下，InternLM2-Chat-20B 在 GSM8K 和 MATH 上可以达到和 GPT-4 相仿的水平。基于在数理和工具方面强大的基础能力，InternLM2-Chat 提供了实用的数据分析能力。
+- 工具调用能力整体升级：基于更强和更具有泛化性的指令理解、工具筛选与结果反思等能力，新版模型可以更可靠地支持复杂智能体的搭建，支持对工具进行有效的多轮调用，完成较复杂的任务。可以查看更多[样例](./agent/)。
+
+## 更新
+
+[2024.01.17] 我们发布了 InternLM2-7B 和 InternLM2-20B 以及相关的对话模型，InternLM2 在数理、代码、对话、创作等各方面能力都获得了长足进步，综合性能达到开源模型的领先水平。可以点击 [下面的模型库](#model-zoo)进行下载或者[查看模型文档](./model_cards/)来了解更多细节.
+
+[2023.12.13] 我们更新了 InternLM-7B-Chat 和 InternLM-20B-Chat 模型权重。通过改进微调数据和训练策略，新版对话模型生成的回复质量更高、语言风格更加多元。
+
+[2023.09.20] InternLM-20B 已发布，包括基础版和对话版。
+
+## Model Zoo
+
+| Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | Release Date |
+|---------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
+| **InternLM2 Chat 20B**     | [🤗internlm/internlm-chat-20b](https://huggingface.co/internlm/internlm2-chat-20b)         | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-20b/summary)         | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-20b)     | 2024-01-17   |
+| **InternLM2 20B** | [🤗internlm/internlm2-20b](https://huggingface.co/internlm/internlm2-20b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-20b) | 2024-01-17 |
+| **InternLM2 Chat 20B SFT**     | [🤗internlm/internlm-chat-20b-sft](https://huggingface.co/internlm/internlm2-chat-20b-sft)         | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-20b-sft](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-20b-sft/summary)         | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-20b-sft)     | 2024-01-17   |
+| **InternLM2 Base 20B** | [🤗internlm/internlm2-base-20b](https://huggingface.co/internlm/internlm2-base-20b) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-base-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-base-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-base-20b) | 2024-01-17 |
+| **InternLM2 Chat 7B**      | [🤗internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b)           | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b/summary)           | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-7b)      | 2024-01-17  |
+| **InternLM2 7B**           | [🤗internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b)                     | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b/summary)                     | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-7b)           |  2024-01-17   |
+| **InternLM2 Chat 7B SFT**      | [🤗internlm/internlm2-chat-7b-sft](https://huggingface.co/internlm/internlm2-chat-7b-sft)           | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-chat-7b-sft](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b-sft/summary)           | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-chat-7b-sft)      | 2024-01-17  |
+| **InternLM2 Base 7B**           | [🤗internlm/internlm2-base-7b](https://huggingface.co/internlm/internlm2-base-7b)                     | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm2-base-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-base-7b/summary)                     | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/internlm2-base-7b)           |  2024-01-17   |
+
+## 使用案例
+
+接下来我们展示使用 [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), 和 [Web demo](#dialogue) 进行推理.
+对话模型采用了 [chatml 格式](./chat/chat_format.md) 来支持通用对话和智能体应用。
+
+### 通过 Transformers 加载
+
+通过以下的代码从 Transformers 加载 InternLM 模型 （可修改模型名称替换不同的模型）
+
+```python
+>>> from transformers import AutoTokenizer, AutoModelForCausalLM
+>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
+>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True).cuda()
+>>> model = model.eval()
+>>> response, history = model.chat(tokenizer, "你好", history=[])
+>>> print(response)
+你好！有什么我可以帮助你的吗？
+>>> response, history = model.chat(tokenizer, "请提供三个管理时间的建议。", history=history)
+>>> print(response)
+```
+
+### 通过 ModelScope 加载
+
+通过以下的代码从 ModelScope 加载 InternLM 模型 （可修改模型名称替换不同的模型）
+
+```python
+from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
+import torch
+model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-chat-7b')
+tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
+model = AutoModelForCausalLM.from_pretrained(model_dir,device_map="auto",  trust_remote_code=True,torch_dtype=torch.float16)
+model = model.eval()
+response, history = model.chat(tokenizer, "hello", history=[])
+print(response)
+response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
+print(response)
+```
+
+### 通过前端网页对话
+
+可以通过以下代码启动一个前端的界面来与 InternLM Chat 7B 模型进行交互
+
+```bash
+pip install streamlit==1.24.0
+pip install transformers==4.30.2
+streamlit run ./chat/web_demo.py
+```
+
+效果如下
+
+![效果](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
+
+### 基于InternLM高性能部署
+
+我们使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 完成 InternLM 的一键部署。
+
+```shell
+python3 -m pip install lmdeploy
+lmdeploy chat turbomind InternLM/internlm-chat-7b --model-name internlm-chat-7b
+```
+
+请参考[部署指南](./chat/lmdeploy.md)了解更多使用案例，更多部署教程则可在[这里](https://github.com/InternLM/LMDeploy)找到。
+
+## 微调&训练
+
+请参考[微调教程](./finetune/)尝试续训或微调 InternLM2。
+
+**注意：**本项目中的全量训练功能已经迁移到了[InternEvo](https://github.com/InternLM/InternEvo)以便捷用户的使用。InternEvo 提供了高效的预训练和微调基建用于训练 InternLM 系列模型。
+
+## 贡献
+
+我们感谢所有的贡献者为改进和提升 InternLM 所作出的努力。非常欢迎社区用户能参与进项目中来。请参考贡献指南来了解参与项目贡献的相关指引。
+
+## 致谢
+
+InternLM 代码库是一款由上海人工智能实验室和来自不同高校、企业的研发人员共同参与贡献的开源项目。我们感谢所有为项目提供新功能支持的贡献者，以及提供宝贵反馈的用户。 我们希望这个工具箱和基准测试可以为社区提供灵活高效的代码工具，供用户微调 InternLM 并开发自己的新模型，从而不断为开源社区提供贡献。特别鸣谢[flash-attention](https://github.com/HazyResearch/flash-attention) 与 [ColossalAI](https://github.com/hpcaitech/ColossalAI) 两项开源项目。
+
+## 开源许可证
+
+本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放，也可申请免费的商业使用授权（[申请表](https://wj.qq.com/s2/12725412/f7c1/)）。其他问题与合作请联系 <internlm@pjlab.org.cn>。
+
+## 引用
+
+```
+@misc{2023internlm,
+    title={InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities},
+    author={InternLM Team},
+    howpublished = {\url{https://github.com/InternLM/InternLM}},
+    year={2023}
+}
+```
diff --git a/CHANGE_LOG.md b/agent/README.md
similarity index 100%
rename from CHANGE_LOG.md
rename to agent/README.md
diff --git a/doc/imgs/compass_support.svg b/assets/compass_support.svg
similarity index 100%
rename from doc/imgs/compass_support.svg
rename to assets/compass_support.svg
diff --git a/doc/imgs/license.svg b/assets/license.svg
similarity index 100%
rename from doc/imgs/license.svg
rename to assets/license.svg
diff --git a/doc/imgs/logo.svg b/assets/logo.svg
similarity index 100%
rename from doc/imgs/logo.svg
rename to assets/logo.svg
diff --git a/doc/imgs/modelscope_logo.png b/assets/modelscope_logo.png
similarity index 100%
rename from doc/imgs/modelscope_logo.png
rename to assets/modelscope_logo.png
diff --git a/doc/imgs/robot.png b/assets/robot.png
similarity index 100%
rename from doc/imgs/robot.png
rename to assets/robot.png
diff --git a/doc/imgs/user.png b/assets/user.png
similarity index 100%
rename from doc/imgs/user.png
rename to assets/user.png
diff --git a/chat/README.md b/chat/README.md
new file mode 100644
index 0000000..ee19878
--- /dev/null
+++ b/chat/README.md
@@ -0,0 +1,61 @@
+# Chat
+
+English | [简体中文](lmdeploy_zh_zh-CN.md)
+
+This document briefly shows how to use [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue) to conduct inference with InternLM2-Chat.
+
+You can also know more about the [chatml format](./chat_format.md) and how to use [LMDeploy for inference and model serving](./lmdeploy.md).
+
+## Import from Transformers
+
+To load the InternLM2 7B Chat model using Transformers, use the following code:
+
+```python
+>>> from transformers import AutoTokenizer, AutoModelForCausalLM
+>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
+>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True).cuda()
+>>> model = model.eval()
+>>> response, history = model.chat(tokenizer, "hello", history=[])
+>>> print(response)
+Hello! How can I help you today?
+>>> response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
+>>> print(response)
+Sure, here are three tips for effective time management:
+
+1. Prioritize tasks based on importance and urgency: Make a list of all your tasks and categorize them into "important and urgent," "important but not urgent," and "not important but urgent." Focus on completing the tasks in the first category before moving on to the others.
+2. Use a calendar or planner: Write down deadlines and appointments in a calendar or planner so you don't forget them. This will also help you schedule your time more effectively and avoid overbooking yourself.
+3. Minimize distractions: Try to eliminate any potential distractions when working on important tasks. Turn off notifications on your phone, close unnecessary tabs on your computer, and find a quiet place to work if possible.
+
+Remember, good time management skills take practice and patience. Start with small steps and gradually incorporate these habits into your daily routine.
+```
+
+## Import from ModelScope
+
+To load the InternLM model using ModelScope, use the following code:
+
+```python
+from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
+import torch
+model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-chat-7b')
+tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
+model = AutoModelForCausalLM.from_pretrained(model_dir,device_map="auto",  trust_remote_code=True,torch_dtype=torch.float16)
+model = model.eval()
+response, history = model.chat(tokenizer, "hello", history=[])
+print(response)
+response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
+print(response)
+```
+
+## Dialogue
+
+You can interact with the InternLM Chat 7B model through a frontend interface by running the following code:
+
+```bash
+pip install streamlit==1.24.0
+pip install transformers==4.30.2
+streamlit run ./chat/web_demo.py
+```
+
+The effect is similar to below:
+
+![demo](https://github.com/InternLM/InternLM/assets/9102141/11b60ee0-47e4-42c0-8278-3051b2f17fe4)
diff --git a/chat/README_zh-CN.md b/chat/README_zh-CN.md
new file mode 100644
index 0000000..5f8491b
--- /dev/null
+++ b/chat/README_zh-CN.md
@@ -0,0 +1,51 @@
+# 对话
+
+[English](lmdeploy.md) | 简体中文
+
+本文介绍采用 [Transformers](#import-from-transformers)、[ModelScope](#import-from-modelscope)、[Web demos](#dialogue)
+对 InternLM2-Chat 进行推理。
+
+你还可以进一步了解 InternLM2-Chat 采用的[对话格式](./chat_format_zh-CN.md)，以及如何[用 LMDeploy 进行推理或部署服务](./lmdeploy_zh-CN.md)，或者尝试用 [OpenAOE](./openaoe.md) 与多个模型对话。
+
+## 通过 Transformers 加载
+
+通过以下的代码从 Transformers 加载 InternLM 模型 （可修改模型名称替换不同的模型）
+
+```python
+>>> from transformers import AutoTokenizer, AutoModelForCausalLM
+>>> tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True)
+>>> model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-chat-7b", trust_remote_code=True).cuda()
+>>> model = model.eval()
+>>> response, history = model.chat(tokenizer, "你好", history=[])
+>>> print(response)
+你好！有什么我可以帮助你的吗？
+>>> response, history = model.chat(tokenizer, "请提供三个管理时间的建议。", history=history)
+>>> print(response)
+```
+
+### 通过 ModelScope 加载
+
+通过以下的代码从 ModelScope 加载 InternLM2-Chat 模型 （可修改模型名称替换不同的模型）
+
+```python
+from modelscope import snapshot_download, AutoTokenizer, AutoModelForCausalLM
+import torch
+model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm2-chat-7b', revision='v1.0.0')
+tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True,torch_dtype=torch.float16)
+model = AutoModelForCausalLM.from_pretrained(model_dir,device_map="auto",  trust_remote_code=True,torch_dtype=torch.float16)
+model = model.eval()
+response, history = model.chat(tokenizer, "hello", history=[])
+print(response)
+response, history = model.chat(tokenizer, "please provide three suggestions about time management", history=history)
+print(response)
+```
+
+## 通过前端网页对话
+
+可以通过以下代码启动一个前端的界面来与 InternLM2 Chat 7B 模型进行交互
+
+```bash
+pip install streamlit==1.24.0
+pip install transformers==4.30.2
+streamlit run ./web_demo.py
+```
diff --git a/chat/chat_format_zh-CN.md b/chat/chat_format_zh-CN.md
new file mode 100644
index 0000000..f55c81e
--- /dev/null
+++ b/chat/chat_format_zh-CN.md
@@ -0,0 +1,99 @@
+# 对话格式
+
+[English](chat_format.md) | 简体中文
+
+InternLM2-Chat 采用了全新的对话格式，以灵活地支持工具调用等更广泛的应用，并避免用户输入的攻击。新的对话格式和 [ChatML](https://github.com/openai/openai-python/blob/release-v0.28.0/chatml.md) 格式类似，但是为了支持通用的智能体应用，在 `system`，`user`，`assistant` 的基础上，引入了 `environment` 角色。
+
+## 基本结构
+
+常规的对话结构一般包含 `system`，`user`，`assistant` 三个角色，采用如下格式进行多轮对话
+
+```
+[UNUSED_TOKEN_146]system
+你是书生浦语2，一个无害的人工智能助手[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]user
+你好呀[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]assistant
+你好，我是书生浦语，请问有什么可以帮助你的吗[UNUSED_TOKEN_145]
+```
+
+其中 `[UNUSED_TOKEN_146]` 充当了每轮对话开始符，`[UNUSED_TOKEN_145]` 充当了当前轮对话结束符。每轮对话一般以 `[UNUSED_TOKEN_146]role` 开头，以模型输出的 `[UNUSED_TOKEN_145]` 结尾，role 代表 `system`，`user`，`assistant` 和 `environment` 角色。目前，InternLM2-Chat 模型的词表中还维护了如下映射
+
+- `[UNUSED_TOKEN_146]`：每个角色对话的开始符
+- `[UNUSED_TOKEN_145]`：每个角色对话的结束符
+- `[UNUSED_TOKEN_144]`：模型调用外部插件的开始符
+- `[UNUSED_TOKEN_143]`：模型调用外部插件的结束符
+- `[UNUSED_TOKEN_142]`：代码解释器
+- `[UNUSED_TOKEN_141]`：外部插件，常规的 tools
+
+## 完整结构
+
+InternLM2-Chat 的完整对话格式在上述基本结构的基础上还包含了针对通用智能体的设计，其核心目的是采用流式格式，使得同一套格式在支持各种类插件拓展和智能体环境的同时能够和通用对话兼容。通用的智能体对话状态如下所示
+
+```
+[UNUSED_TOKEN_146]system
+你是书生浦语2，一个无害的人工智能助手[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]system name=[UNUSED_TOKEN_141]
+[
+    {
+        "name": "get_current_weather",
+        "description": "Get the current weather in a given location",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "location": {
+                    "type": "string",
+                    "description": "The city and state, e.g. San Francisco, CA",
+                },
+                "unit": {"type": "string"},
+            },
+            "required": ["location"],
+        },
+    }
+]
+[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]user
+请帮我对该数据集进行数据处理并可视化。
+[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]user name=file
+{"path": "data.csv"}[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]assistant
+我已经帮您处理了数据并进行了可视化。
+
+[UNUSED_TOKEN_144][UNUSED_TOKEN_142]
+```python
+import plotly.express as px
+import pandas as pd
+
+# Load the data into a pandas dataframe
+df = pd.read_csv('data.csv')
+
+# Create a scatter plot of rainfall vs wind direction
+fig = px.scatter(df, x='WindDir9am', y='Rainfall', color='WindDir3pm',
+                 labels={'WindDir9am': 'Wind Direction 9am', 'Rainfall': 'Rainfall', 'WindDir3pm': 'Wind Direction 3pm'},
+                 title='Rainfall vs Wind Direction',
+                 template='plotly_dark',
+                 width=600, height=500)
+
+# Add a hover effect to show the date
+fig.update_traces(hovertemplate='<b>Date: %{text}</b><br>Wind Direction 9am: %{x}<br>Rainfall: %{y}<br>Wind Direction 3pm: %{marker.color}')
+
+# Show the plot
+fig.show()
+```[UNUSED_TOKEN_143]
+[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]environment name=[UNUSED_TOKEN_142]
+[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]assistant
+由于代码输出没有包含在这里，我无法提供具体的图表内容。但是，如果代码执行无误，它会显示一个极坐标图，其中包含两个填充区域，一个代表上午9点的风向与降雨量的关系，另一个代表下午3点的风向与降雨量的关系。每个方向上的值是根据提供的数据集计算得到的平均降雨量。图表应该有一个清晰的标题、图例，并且能够直观地比较不同风向下的降雨量。由于使用了暗色主题，图表的整体外观应该是深色背景上的亮色线条和填充区域。[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]user
+你能解释一下代码吗[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]assistant
+我使用的是Python的Plotly库来创建一个极坐标图，以展示风向与降雨量之间的关系。代码中首先创建了一个Plotly的Figure对象。接着，我添加了两个`Scatterpolar`追踪。第一个追踪展示的是上午9点的风向与降雨量的平均值的关系，第二个追踪展示的是下午3点的风向与降雨量的平均值的关系。在这两个追踪中，`r`参数代表降雨量的平均值，`theta`参数代表风向。`fill='toself'`表示图形以自身为边界进行填充。然后我更新了图表的布局，隐藏了径向轴，添加了图例，设置了标题和标题字体，选择了暗色主题，并设定了图表的背景色、宽度和高度。最后，使用`fig.show()`展示了图表。
+[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_146]user
+我想了解今天上海的天气[UNUSED_TOKEN_145]
+[UNUSED_TOKEN_144][UNUSED_TOKEN_141]
+{"name": "get_current_weather", "parameters": {"location": "上海"}}[UNUSED_TOKEN_143]
+[UNUSED_TOKEN_145]
+```
diff --git a/web_demo.py b/chat/web_demo.py
similarity index 98%
rename from web_demo.py
rename to chat/web_demo.py
index 26de0ba..2ddee4e 100644
--- a/web_demo.py
+++ b/chat/web_demo.py
@@ -73,8 +73,8 @@ def main():
     model, tokenizer = load_model()
     print("load model end.")
 
-    user_avator = "doc/imgs/user.png"
-    robot_avator = "doc/imgs/robot.png"
+    user_avator = "docs/imgs/user.png"
+    robot_avator = "docs/imgs/robot.png"
 
     st.title("InternLM-Chat-7B")
 
diff --git a/ci_scripts/common/basic_func.sh b/ci_scripts/common/basic_func.sh
deleted file mode 100644
index 8ce1c54..0000000
--- a/ci_scripts/common/basic_func.sh
+++ /dev/null
@@ -1,18 +0,0 @@
-#!/bin/bash
-
-#######################################
-# Calculate the number of files in a directory.
-# Call this function like this: num_files "${file_path}".
-# Globals:
-#   None
-# Arguments:
-#   $1: the directory path
-# Returns:
-#   the number of files in the directory
-#######################################
-num_files() {
-    [[ $# -eq 1 ]] || return 1
-    local file_num
-    file_num=$(ls -l $1 | grep '^-' | wc -l)
-    echo $file_num
-}
diff --git a/ci_scripts/common/com_func.py b/ci_scripts/common/com_func.py
deleted file mode 100644
index 4d1ba63..0000000
--- a/ci_scripts/common/com_func.py
+++ /dev/null
@@ -1,29 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-
-def merge_dicts(dict_a: dict, dict_b: dict):
-    for key in dict_b.keys():
-        if isinstance(dict_b[key], dict):
-            dict_b[key] = {**dict_a[key], **dict_b[key]}
-            merge_dicts(dict_a[key], dict_b[key])
-    dict_c = {**dict_a, **dict_b}
-    return dict_c
-
-
-def format_dict_to_py_string(data: dict, indent=0, is_nested=False):
-    result = ""
-    for key, value in data.items():
-        if isinstance(value, dict):
-            result += f"{' ' * indent}{key} = dict(\n"
-            result += format_dict_to_py_string(value, indent + 4, is_nested=True)
-            result += f"{' ' * indent})"
-        else:
-            result += f"{' ' * indent}{key} = {repr(value)}"
-        if is_nested:
-            result += ","
-        result += "\n"
-    result = f"""\
-{result}
-"""
-    return result
diff --git a/ci_scripts/common/post_action.sh b/ci_scripts/common/post_action.sh
deleted file mode 100644
index 9aa4d22..0000000
--- a/ci_scripts/common/post_action.sh
+++ /dev/null
@@ -1,21 +0,0 @@
-#!/bin/bash
-set -x
-
-retry_times=3
-for ((i=1;i<=$retry_times;i++));do
-    jobid=$(squeue -o "%A %j" -u $USER | grep ${GITHUB_RUN_ID}-${GITHUB_JOB} | awk '{print $1}')
-    if [[ -n "$jobid" ]];then
-        echo "The job $jobid will be canceled."
-        scancel $jobid
-        sleep 0.5
-    else
-        echo "There are no more jobs that need to be canceled."
-        break
-    fi
-done
-
-if [[ $i -gt $retry_times ]];then
-    echo "There have been tried $retry_times times. Please contact user $USER to confirm the job status."
-fi
-
-exit 0
diff --git a/ci_scripts/common/variables.sh b/ci_scripts/common/variables.sh
deleted file mode 100644
index 077fee4..0000000
--- a/ci_scripts/common/variables.sh
+++ /dev/null
@@ -1,4 +0,0 @@
-#!/bin/bash
-
-readonly DATA_VOLUME=$(echo $GITHUB_WORKSPACE | cut -d '/' -f 1-4)/data
-readonly CLEAN_PATH=$(echo $GITHUB_WORKSPACE | cut -d '/' -f 1-4)/ci_clean_bak
diff --git a/ci_scripts/data/tokenizer_alpaca.sh b/ci_scripts/data/tokenizer_alpaca.sh
deleted file mode 100644
index db43d80..0000000
--- a/ci_scripts/data/tokenizer_alpaca.sh
+++ /dev/null
@@ -1,51 +0,0 @@
-#!/bin/bash
-set -x
-
-source ./ci_scripts/common/variables.sh
-[[ -n ${DATA_VOLUME} ]] || { echo "should set DATA_VOLUME first before ci, exit."; exit 1; }
-[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
-
-readonly SRC_DATASET_META=${DATA_VOLUME}/lm_data/alpaca_data/alpaca_data.json
-readonly RESULTS=${DATA_VOLUME}/lm_data/alpaca_data/result
-readonly TRAIN_DATASET=${RESULTS}/train/en/dataset.bin
-readonly TRAIN_DATASET_META=${RESULTS}/train/en/dataset.bin.meta
-readonly VALID_DATASET=${RESULTS}/valid/en/dataset.bin
-readonly VALID_DATASET_META=${RESULTS}/valid/en/dataset.bin.meta
-
-split_ratio=0.1
-exit_code=0
-
-source ./ci_scripts/common/basic_func.sh
-
-echo "start to test alpaca_tokenizer.py."
-
-if [[ -d ${RESULTS} ]]; then
-    if ! rsync -av --remove-source-files ${RESULTS} ${CLEAN_PATH}; then
-       echo "cleaning test data in ${RESULTS} failed, exit."
-       exit 1
-    fi
-fi
-
-if [[ ! -f ${SRC_DATASET_META} ]]; then
-   echo "${SRC_DATASET_META} should be exist, exit."
-   exit 1
-fi
-
-python tools/alpaca_tokenizer.py ${SRC_DATASET_META} ${RESULTS} tools/V7_sft.model --split_ratio ${split_ratio}
-[[ $? -ne 0 ]] && { echo "test alpaca_tokenizer.py failed.";  exit_code=$(($exit_code + 1)); }
-
-file_list=(${TRAIN_DATASET} ${TRAIN_DATASET_META} ${VALID_DATASET} ${VALID_DATASET_META})
-for file in ${file_list[@]}; do
-    if [[ ! -f ${file} ]]; then
-        echo "expect: ${file} exists, actual: not exist."
-        exit_code=$(($exit_code + 1))
-    fi
-done
-
-# move the test files.
-if ! rsync -av --remove-source-files ${RESULTS} ${CLEAN_PATH}; then
-    echo "cleaning test data in ${RESULTS} failed."
-    exit_code=$(($exit_code + 1))
-fi
-
-exit $exit_code
diff --git a/ci_scripts/data/tokenizer_chinese.sh b/ci_scripts/data/tokenizer_chinese.sh
deleted file mode 100644
index 81a5198..0000000
--- a/ci_scripts/data/tokenizer_chinese.sh
+++ /dev/null
@@ -1,43 +0,0 @@
-#!/bin/bash
-set -x
-
-source ./ci_scripts/common/variables.sh
-[[ -n ${DATA_VOLUME} ]] || { echo "should set DATA_VOLUME first before ci, exit."; exit 1; }
-[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
-
-readonly DATA=${DATA_VOLUME}/lm_data/cn_data/raw_data.txt
-readonly RESULT=${DATA_VOLUME}/lm_data/cn_data/result.bin
-readonly RESULT_META=${DATA_VOLUME}/lm_data/cn_data/result.bin.meta
-readonly RESULTS=${DATA_VOLUME}/lm_data/cn_data/result.*
-exit_code=0
-
-source ./ci_scripts/common/basic_func.sh
-
-echo "start to test tokenizer.py."
-
-num=$(num_files "${RESULTS}")
-if [[ ${num} -gt 0 ]]; then
-    if ! rsync -av --remove-source-files ${RESULTS} ${CLEAN_PATH}; then
-       echo "cleaning test data ${RESULTS} failed, exit."
-       exit 1
-    fi
-fi
-
-srun -p ${SLURM_PARTITION} --quotatype=spot --job-name=$1 --gpus-per-task=1 python tools/tokenizer.py --text_input_path ${DATA} --bin_output_path ${RESULT}
-[[ $? -ne 0 ]] && { echo "test tokenizer.py failed.";  exit_code=$(($exit_code + 1)); }
-
-file_list=($RESULT $RESULT_META)
-for file in ${file_list[@]}; do
-    if [[ ! -f ${file} ]]; then
-        echo "expect: ${file} exists, actual: not exist."
-        exit_code=$(($exit_code + 1))
-    fi
-done
-
-# move the test files.
-if ! rsync -av --remove-source-files ${RESULTS} ${CLEAN_PATH}; then
-   echo "cleaning cached file in ${RESULTS} failed."
-   exit_code=$(($exit_code + 1))
-fi
-
-exit $exit_code
diff --git a/ci_scripts/model/convert_to_hf.sh b/ci_scripts/model/convert_to_hf.sh
deleted file mode 100644
index d1af389..0000000
--- a/ci_scripts/model/convert_to_hf.sh
+++ /dev/null
@@ -1,48 +0,0 @@
-#!/bin/bash
-set -x
-
-source ./ci_scripts/common/variables.sh
-[[ -n ${DATA_VOLUME} ]] || { echo "should set DATA_VOLUME first before ci, exit."; exit 1; }
-[[ -n ${GITHUB_WORKSPACE} ]] || { echo "should set GITHUB_WORKSPACE first before ci, exit."; exit 1; }
-[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
-
-readonly CKPTS_INPUT="${DATA_VOLUME}/lm_data/alpaca_data/llm_ckpts/20"
-readonly CKPTS_OUTPUT="${GITHUB_WORKSPACE}/hf_ckpt"
-readonly TOKENIZER="${GITHUB_WORKSPACE}/hf_ckpt/tokenizer.model"
-readonly CONFIG="${GITHUB_WORKSPACE}/hf_ckpt/config.json"
-readonly INERNLM="${GITHUB_WORKSPACE}/hf_ckpt/modeling_internlm.py"
-exit_code=0
-expected_num=9
-
-source ./ci_scripts/common/basic_func.sh
-
-echo "start to test convert2hf.py."
-
-if [[ -d ${CKPTS_OUTPUT} ]]; then
-    if ! rsync -av --remove-source-files ${CKPTS_OUTPUT}/* ${CLEAN_PATH}; then
-       echo "cleaning cached file in ${CKPTS_OUTPUT} failed, exit."
-       exit 1
-    fi
-fi
-
-python ./tools/transformers/convert2hf.py --src_folder ${CKPTS_INPUT} --tgt_folder ${CKPTS_OUTPUT} --tokenizer ./tools/V7_sft.model
-[[ $? -ne 0 ]] && { echo "test convert2hf.py failed.";  exit_code=$(($exit_code + 1)); }
-
-#assert exists model
-file_list=($TOKENIZER $CONFIG $INERNLM)
-for file in ${file_list[@]}; do
-    if [[ ! -f ${file} ]];then
-        echo "file ${file} does not exist."
-        exit_code=$(($exit_code + 1))
-    fi
-done
-
-num=$(num_files "${CKPTS_OUTPUT}")
-
-if [[ ${num} -ne ${expected_num} ]]; then
-    echo "expect: ${expected_num} files, actual: ${num} files."
-    exit_code=$(($exit_code + 1))
-fi
-
-# NOTICE: should not remove the cached files, because the cached files will be used in the next test case.
-exit $exit_code
diff --git a/ci_scripts/model/demo_load_7B_chat_model.py b/ci_scripts/model/demo_load_7B_chat_model.py
deleted file mode 100644
index aed5a88..0000000
--- a/ci_scripts/model/demo_load_7B_chat_model.py
+++ /dev/null
@@ -1,13 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained("internlm/internlm-chat-7b", trust_remote_code=True).cuda()
-model = model.eval()
-response, history = model.chat(tokenizer, "你好", history=[])
-print(response)
-assert len(response) != 0
-response, history = model.chat(tokenizer, "请提供三个管理时间的建议。", history=history)
-print(response)
-assert len(response) != 0
diff --git a/ci_scripts/model/loaded_as_transformer.py b/ci_scripts/model/loaded_as_transformer.py
deleted file mode 100644
index 5254fb9..0000000
--- a/ci_scripts/model/loaded_as_transformer.py
+++ /dev/null
@@ -1,9 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-from transformers import AutoModel
-
-model = AutoModel.from_pretrained("../hf_ckpt/", trust_remote_code=True).cuda()
-print(model)
-assert model.config.hidden_size == 2048
-assert model.config.num_attention_heads == 16
-assert model.config.num_hidden_layers == 16
diff --git a/ci_scripts/train/ci_7B_sft.py b/ci_scripts/train/ci_7B_sft.py
deleted file mode 100644
index fea45e1..0000000
--- a/ci_scripts/train/ci_7B_sft.py
+++ /dev/null
@@ -1,131 +0,0 @@
-JOB_NAME = "7b_train"
-
-SEQ_LEN = 1024
-HIDDEN_SIZE = 2048
-NUM_ATTENTION_HEAD = 16
-MLP_RATIO = 8 / 3
-NUM_LAYER = 16
-VOCAB_SIZE = 103168
-
-# Ckpt folder format:
-# fs: 'local:/mnt/nfs/XXX'
-# oss: 'boto3:s3://model_weights/XXX'
-# MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
-# SAVE_CKPT_FOLDER = "local:llm_ckpts"
-SAVE_CKPT_FOLDER = "local:llm_ckpts"
-# LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
-ckpt = dict(
-    enable_save_ckpt=True,
-    # Path to save training ckpt.
-    save_ckpt_folder=SAVE_CKPT_FOLDER,
-    # Path to continue training ckpt (load model weights and scheduler/context states).
-    # load_ckpt_folder=LOAD_CKPT_FOLDER,
-    # Path to initialize with given model weights.
-    # load_model_only_folder=MODEL_ONLY_FOLDER,
-    checkpoint_every=20,
-    # Wheter to load optimizer states when continuing training.
-    load_optimizer=True,
-)
-
-TRAIN_FOLDER = "local:../lm_data/alpaca_data/train/en"
-data = dict(
-    seq_len=SEQ_LEN,
-    # micro_num means the number of micro_batch contained in one gradient update
-    micro_num=4,
-    # packed_length = micro_bsz * SEQ_LEN
-    micro_bsz=2,
-    pack_sample_into_one=False,
-    total_steps=20,
-    skip_batches="",
-    rampup_batch_size="",
-    # Datasets with less than 50 rows will be discarded
-    min_length=50,
-    # train_folder=TRAIN_FOLDER,
-)
-
-grad_scaler = dict(
-    fp16=dict(
-        # the initial loss scale, defaults to 2**16
-        initial_scale=2**16,
-        # the minimum loss scale, defaults to None
-        min_scale=1,
-        # the number of steps to increase loss scale when no overflow occurs
-        growth_interval=1000,
-    ),
-    # the multiplication factor for increasing loss scale, defaults to 2
-    growth_factor=2,
-    # the multiplication factor for decreasing loss scale, defaults to 0.5
-    backoff_factor=0.5,
-    # the maximum loss scale, defaults to None
-    max_scale=2**24,
-    # the number of overflows before decreasing loss scale, defaults to 2
-    hysteresis=2,
-)
-
-hybrid_zero_optimizer = dict(
-    # Enable low_level_optimzer overlap_communication
-    zero_overlap_communication=True,
-    # bucket size for nccl communication params
-    reduce_bucket_size=512 * 1024 * 1024,
-    # grad clipping
-    clip_grad_norm=1.0,
-)
-
-loss = dict(
-    label_smoothing=0,
-)
-
-adam = dict(
-    lr=1e-4,
-    adam_beta1=0.9,
-    adam_beta2=0.95,
-    adam_beta2_c=0,
-    adam_eps=1e-8,
-    weight_decay=0.01,
-)
-
-lr_scheduler = dict(
-    total_steps=data["total_steps"],
-    init_steps=0,  # optimizer_warmup_step
-    warmup_ratio=0.01,
-    eta_min=1e-5,
-    last_epoch=-1,
-)
-
-beta2_scheduler = dict(
-    init_beta2=adam["adam_beta2"],
-    c=adam["adam_beta2_c"],
-    cur_iter=-1,
-)
-
-model = dict(
-    checkpoint=False,
-    num_attention_heads=NUM_ATTENTION_HEAD,
-    embed_split_hidden=True,
-    vocab_size=VOCAB_SIZE,
-    embed_grad_scale=1,
-    parallel_output=True,
-    hidden_size=HIDDEN_SIZE,
-    num_layers=NUM_LAYER,
-    mlp_ratio=MLP_RATIO,
-    apply_post_layer_norm=False,
-    dtype="torch.bfloat16",
-    norm_type="rmsnorm",
-    layer_norm_epsilon=1e-5,
-)
-"""
-zero1 parallel:
-    1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
-        so parameters will be divided within the range of dp.
-    2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
-    3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
-        For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
-pipeline parallel: pipeline parallel size, only 1 is accepted currently.
-tensor parallel: tensor parallel size, usually the number of GPUs per node, only 1 is accepted currently.
-"""
-parallel = dict(
-    zero1=8,
-)
-
-cudnn_deterministic = False
-cudnn_benchmark = False
diff --git a/ci_scripts/train/generate_config.py b/ci_scripts/train/generate_config.py
deleted file mode 100644
index 096334d..0000000
--- a/ci_scripts/train/generate_config.py
+++ /dev/null
@@ -1,49 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-import argparse
-import json
-import os
-
-from ci_scripts.common import com_func
-from internlm.core.context import Config
-
-
-def generate_new_config(config_py_file, test_config_json, case_name):
-    # generate path of the new config py
-    config_path = os.path.split(config_py_file)
-    new_config_py_file = os.path.join(config_path[0], case_name + ".py")
-
-    # merge dict
-    origin_config = Config.from_file(config_py_file)
-    with open(test_config_json) as f:
-        test_config = json.load(f)
-    if test_config:
-        if case_name not in test_config.keys():
-            raise KeyError(f"the {case_name} doesn't exist.Please check {test_config} again!")
-    new_config = com_func.merge_dicts(origin_config, test_config[case_name])
-    print(f"new config is:\n{new_config}")
-
-    # write new config to py file
-    file_content = com_func.format_dict_to_py_string(new_config)
-    with open(new_config_py_file, "w") as f:
-        f.write(file_content)
-    print(f"The new test train config file is {new_config_py_file}")
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--origin_config",
-        type=str,
-        default="./ci_scripts/train/ci_7B_sft.py",
-        help="path to the origin train config file",
-    )
-    parser.add_argument(
-        "--test_config",
-        type=str,
-        default="./ci_scripts/train/test_config.json",
-        help="path to the test train config file",
-    )
-    parser.add_argument("--case_name", type=str, help="name of the case which will be runned ")
-    args = parser.parse_args()
-    generate_new_config(args.origin_config, args.test_config, args.case_name)
diff --git a/ci_scripts/train/load_ckpt.sh b/ci_scripts/train/load_ckpt.sh
deleted file mode 100644
index 06c6c1e..0000000
--- a/ci_scripts/train/load_ckpt.sh
+++ /dev/null
@@ -1,43 +0,0 @@
-#!/bin/bash
-set -x
-
-source ./ci_scripts/common/variables.sh
-[[ -n ${GITHUB_WORKSPACE} ]] || { echo "should set GITHUB_WORKSPACE first before ci, exit."; exit 1; }
-[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
-
-readonly CKPTS_PATH="$GITHUB_WORKSPACE/llm_ckpts"
-readonly CKPTS40_PATH="$GITHUB_WORKSPACE/llm_ckpts/40"
-readonly CKPTS40_OUTPUT="${CKPTS40_PATH}/*.pt"
-expected_num=22
-exit_code=0
-
-source ./ci_scripts/common/basic_func.sh
-
-echo "start to test slurm training with loading checkpoint."
-
-python ./ci_scripts/train/generate_config.py --case_name $1
-file="./ci_scripts/train/$1.py"
-if [[ ! -f ${file} ]]; then
-        echo "expect: ${file} exists, actual: not exist."
-        exit_code=$(($exit_code + 1))
-    fi
-
-srun -p ${SLURM_PARTITION} --exclusive --quotatype=spot --job-name=$2 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ${file}
-[[ $? -ne 0 ]] && { echo "test slurm training failed.";  exit_code=$(($exit_code + 1)); }
-
-
-num=$(num_files "${CKPTS40_OUTPUT}")
-if [[ ${num} -ne ${expected_num} ]]; then
-    echo "expect: ${expected_num} files, actual: ${num} files."
-    exit_code=$(($exit_code + 1))
-fi
-
-# move the test files.
-if [[ -d ${CKPTS_PATH} ]]; then
-    if ! rsync -av --remove-source-files ${CKPTS_PATH} ${CLEAN_PATH}; then
-        echo "cleaning cached file in ${CKPTS_PATH} failed."
-        exit_code=$(($exit_code + 1))
-    fi
-fi
-
-exit $exit_code
diff --git a/ci_scripts/train/slurm_train.sh b/ci_scripts/train/slurm_train.sh
deleted file mode 100644
index 3871fc4..0000000
--- a/ci_scripts/train/slurm_train.sh
+++ /dev/null
@@ -1,34 +0,0 @@
-#!/bin/bash
-set -x
-
-source ./ci_scripts/common/variables.sh
-[[ -n ${GITHUB_WORKSPACE} ]] || { echo "should set GITHUB_WORKSPACE first before ci, exit."; exit 1; }
-[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
-
-readonly CKPTS_PATH="$GITHUB_WORKSPACE/llm_ckpts"
-readonly CKPTS20_PATH="$GITHUB_WORKSPACE/llm_ckpts/20"
-readonly CKPTS20_OUTPUT="${CKPTS20_PATH}/*.pt"
-expected_num=22
-exit_code=0
-
-source ./ci_scripts/common/basic_func.sh
-
-echo "start to test slurm training."
-
-if [[ -d ${CKPTS20_PATH} ]]; then
-    if ! rsync -av --remove-source-files ${CKPTS20_PATH} ${CLEAN_PATH}; then
-       echo "cleaning cached file in ${CKPTS20_PATH} failed, exit."
-       exit 1
-    fi
-fi
-
-srun -p ${SLURM_PARTITION} --exclusive --quotatype=spot --job-name=$1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./ci_scripts/train/ci_7B_sft.py
-[[ $? -ne 0 ]] && { echo "test slurm training failed.";  exit_code=$(($exit_code + 1)); }
-
-num=$(num_files "${CKPTS20_OUTPUT}")
-if [[ ${num} -ne ${expected_num} ]]; then
-    echo "expect: ${expected_num} files, actual: ${num} files."
-    exit_code=$(($exit_code + 1))
-fi
-
-exit $exit_code
diff --git a/ci_scripts/train/test_config.json b/ci_scripts/train/test_config.json
deleted file mode 100644
index b5e3b24..0000000
--- a/ci_scripts/train/test_config.json
+++ /dev/null
@@ -1,45 +0,0 @@
-{
-    "7B_basic_train": {
-        "SEQ_LEN": 1024,
-        "HIDDEN_SIZE": 2048,
-        "NUM_ATTENTION_HEAD": 16,
-        "NUM_LAYER": 16,
-        "TRAIN_FOLDER":"local:../lm_data/alpaca_data/train/en",
-        "ckpt": {
-            "checkpoint_every": 20
-        },
-        "data": {
-            "total_steps": 20
-        }
-    },
-    "7B_load_new_ckpt": {
-        "SEQ_LEN": 1024,
-        "HIDDEN_SIZE": 2048,
-        "NUM_ATTENTION_HEAD": 16,
-        "NUM_LAYER": 16,
-        "TRAIN_FOLDER":"local:../lm_data/alpaca_data/train/en",
-        "LOAD_CKPT_FOLDER": "local:llm_ckpts/20",
-        "ckpt": {
-            "load_ckpt_folder": "local:llm_ckpts/20",
-            "checkpoint_every": 20
-        },
-        "data": {
-            "total_steps": 40
-        }
-    },
-    "7B_load_preset_ckpt": {
-        "SEQ_LEN": 1024,
-        "HIDDEN_SIZE": 2048,
-        "NUM_ATTENTION_HEAD": 16,
-        "NUM_LAYER": 16,
-        "TRAIN_FOLDER":"local:../lm_data/alpaca_data/train/en",
-        "LOAD_CKPT_FOLDER": "local:../lm_data/alpaca_data/llm_ckpts/20",
-        "ckpt": {
-            "load_ckpt_folder": "local:../lm_data/alpaca_data/llm_ckpts/20",
-            "checkpoint_every": 20
-        },
-        "data": {
-            "total_steps": 40
-        }
-    }
-}
diff --git a/ci_scripts/train/torchrun.sh b/ci_scripts/train/torchrun.sh
deleted file mode 100644
index 29ed54f..0000000
--- a/ci_scripts/train/torchrun.sh
+++ /dev/null
@@ -1,40 +0,0 @@
-#!/bin/bash
-set -x
-
-source ./ci_scripts/common/variables.sh
-[[ -n ${GITHUB_WORKSPACE} ]] || { echo "should set GITHUB_WORKSPACE first before ci, exit."; exit 1; }
-[[ -n ${CLEAN_PATH} ]] || { echo "should set CLEAN_PATH first before ci, exit."; exit 1; }
-
-readonly CKPTS_PATH="$GITHUB_WORKSPACE/llm_ckpts"
-readonly CKPTS20_PATH="$GITHUB_WORKSPACE/llm_ckpts/20"
-readonly CKPTS_OUTPUT="${CKPTS20_PATH}/*.pt"
-expected_num=22
-exit_code=0
-
-source ./ci_scripts/common/basic_func.sh
-
-echo "start to test torch training."
-
-if [[ -d ${CKPTS20_PATH} ]]; then
-    if ! rsync -av --remove-source-files ${CKPTS20_PATH} ${CLEAN_PATH}; then
-       echo "cleaning cached file in ${CKPTS20_PATH} failed, exit."
-       exit 1
-    fi
-fi
-
-srun -p ${SLURM_PARTITION} --exclusive --quotatype=spot --job-name=$1 -N 1 torchrun --nnodes=1 --nproc_per_node=8 --master_port=29501 train.py --config ./ci_scripts/train/ci_7B_sft.py --launcher torch
-[[ $? -ne 0 ]] && { echo "test torch training failed.";  exit_code=$(($exit_code + 1)); }
-
-num=$(num_files "${CKPTS_OUTPUT}")
-if [[ ${num} -ne ${expected_num} ]]; then
-    echo "expect: ${expected_num} files, actual: ${num} files."
-    exit_code=$(($exit_code + 1))
-fi
-
-# move the test files.
-if ! rsync -av --remove-source-files ${CKPTS_PATH}/* ${CLEAN_PATH}; then
-    echo "cleaning cached file in ${CKPTS_PATH} failed."
-    exit_code=$(($exit_code + 1))
-fi
-
-exit $exit_code
diff --git a/configs/7B_sft.py b/configs/7B_sft.py
deleted file mode 100644
index a430e8a..0000000
--- a/configs/7B_sft.py
+++ /dev/null
@@ -1,164 +0,0 @@
-JOB_NAME = "7b_train"
-DO_ALERT = False
-
-SEQ_LEN = 2048
-HIDDEN_SIZE = 4096
-NUM_ATTENTION_HEAD = 32
-MLP_RATIO = 8 / 3
-NUM_LAYER = 32
-VOCAB_SIZE = 103168
-
-MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
-# Ckpt folder format:
-# fs: 'local:/mnt/nfs/XXX'
-SAVE_CKPT_FOLDER = "local:llm_ckpts"
-LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
-
-# boto3 Ckpt folder format:
-# import os
-# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
-# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
-# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
-CHECKPOINT_EVERY = 50
-ckpt = dict(
-    enable_save_ckpt=False,  # enable ckpt save.
-    save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.
-    # load_ckpt_folder= dict(path=MODEL_ONLY_FOLDER, content=["model"], ckpt_type="normal"),
-    load_ckpt_folder="local:llm_ckpts/",
-    # 'load_ckpt_info' setting guide:
-    # 1. the 'path' indicate ckpt path,
-    # 2. the 'content‘ means what states will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all"
-    # 3. the ’ckpt_type‘ means the type of checkpoint to be loaded, now only 'normal' type is supported.
-    load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
-    checkpoint_every=CHECKPOINT_EVERY,
-    async_upload=True,  # async ckpt upload. (only work for boto3 ckpt)
-    async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.
-    oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
-)
-
-TRAIN_FOLDER = "/path/to/dataset"
-VALID_FOLDER = "/path/to/dataset"
-data = dict(
-    seq_len=SEQ_LEN,
-    # micro_num means the number of micro_batch contained in one gradient update
-    micro_num=4,
-    # packed_length = micro_bsz * SEQ_LEN
-    micro_bsz=2,
-    # defaults to the value of micro_num
-    valid_micro_num=4,
-    # defaults to 0, means disable evaluate
-    valid_every=50,
-    pack_sample_into_one=False,
-    total_steps=50000,
-    skip_batches="",
-    rampup_batch_size="",
-    # Datasets with less than 50 rows will be discarded
-    min_length=50,
-    # train_folder=TRAIN_FOLDER,
-    # valid_folder=VALID_FOLDER,
-    empty_cache_and_diag_interval=10,
-    diag_outlier_ratio=1.1,
-)
-
-grad_scaler = dict(
-    fp16=dict(
-        # the initial loss scale, defaults to 2**16
-        initial_scale=2**16,
-        # the minimum loss scale, defaults to None
-        min_scale=1,
-        # the number of steps to increase loss scale when no overflow occurs
-        growth_interval=1000,
-    ),
-    # the multiplication factor for increasing loss scale, defaults to 2
-    growth_factor=2,
-    # the multiplication factor for decreasing loss scale, defaults to 0.5
-    backoff_factor=0.5,
-    # the maximum loss scale, defaults to None
-    max_scale=2**24,
-    # the number of overflows before decreasing loss scale, defaults to 2
-    hysteresis=2,
-)
-
-hybrid_zero_optimizer = dict(
-    # Enable low_level_optimzer overlap_communication
-    overlap_sync_grad=True,
-    overlap_sync_param=True,
-    # bucket size for nccl communication params
-    reduce_bucket_size=512 * 1024 * 1024,
-    # grad clipping
-    clip_grad_norm=1.0,
-)
-
-loss = dict(
-    label_smoothing=0,
-)
-
-adam = dict(
-    lr=1e-4,
-    adam_beta1=0.9,
-    adam_beta2=0.95,
-    adam_beta2_c=0,
-    adam_eps=1e-8,
-    weight_decay=0.01,
-)
-
-lr_scheduler = dict(
-    total_steps=data["total_steps"],
-    init_steps=0,  # optimizer_warmup_step
-    warmup_ratio=0.01,
-    eta_min=1e-5,
-    last_epoch=-1,
-)
-
-beta2_scheduler = dict(
-    init_beta2=adam["adam_beta2"],
-    c=adam["adam_beta2_c"],
-    cur_iter=-1,
-)
-
-model = dict(
-    checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
-    num_attention_heads=NUM_ATTENTION_HEAD,
-    embed_split_hidden=True,
-    vocab_size=VOCAB_SIZE,
-    embed_grad_scale=1,
-    parallel_output=True,
-    hidden_size=HIDDEN_SIZE,
-    num_layers=NUM_LAYER,
-    mlp_ratio=MLP_RATIO,
-    apply_post_layer_norm=False,
-    dtype="torch.bfloat16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
-    norm_type="rmsnorm",
-    layer_norm_epsilon=1e-5,
-    use_flash_attn=True,
-    num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
-)
-"""
-zero1 parallel:
-    1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
-        so parameters will be divided within the range of dp.
-    2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
-    3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
-        For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
-pipeline parallel (dict):
-    1. size: int, the size of pipeline parallel.
-    2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
-tensor parallel: tensor parallel size, usually the number of GPUs per node.
-"""
-parallel = dict(
-    zero1=8,
-    pipeline=dict(size=1, interleaved_overlap=True),
-    sequence_parallel=False,
-)
-
-cudnn_deterministic = False
-cudnn_benchmark = False
-
-monitor = dict(
-    # feishu alert configs
-    alert=dict(
-        enable_feishu_alert=DO_ALERT,
-        feishu_alert_address=None,  # feishu webhook to send alert message
-        light_monitor_address=None,  # light_monitor address to send heartbeat
-    ),
-)
diff --git a/doc/code-docs/Makefile b/doc/code-docs/Makefile
deleted file mode 100644
index d0c3cbf..0000000
--- a/doc/code-docs/Makefile
+++ /dev/null
@@ -1,20 +0,0 @@
-# Minimal makefile for Sphinx documentation
-#
-
-# You can set these variables from the command line, and also
-# from the environment for the first two.
-SPHINXOPTS    ?=
-SPHINXBUILD   ?= sphinx-build
-SOURCEDIR     = source
-BUILDDIR      = build
-
-# Put it first so that "make" without argument is like "make help".
-help:
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
-
-.PHONY: help Makefile
-
-# Catch-all target: route all unknown targets to Sphinx using the new
-# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/checkpoint.po b/doc/code-docs/locales/en/LC_MESSAGES/checkpoint.po
deleted file mode 100644
index 3ddcb09..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/checkpoint.po
+++ /dev/null
@@ -1,105 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-13 17:07+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/checkpoint.rst:2
-msgid "模型保存"
-msgstr "Model Checkpointing"
-
-#: ../../source/checkpoint.rst:4
-msgid ""
-"InternLM 使用 ``internlm.utils.model_checkpoint.CheckpointManager`` "
-"来管理模型保存。其中，可以使用 ``CheckpointManager.try_save_checkpoint(train_state)`` "
-"来保存指定 step 的模型状态。"
-msgstr ""
-"InternLM uses ``internlm.utils.model_checkpoint.CheckpointManager`` to "
-"manage model checkpointing. In the implementation, we use "
-"``CheckpointManager.try_save_checkpoint(train_state)`` to checkpoint "
-"training states at specific steps. "
-
-#: ../../source/checkpoint.rst:6
-msgid "InternLM支持启动时自动加载最新的模型备份，并在接收信号退出训练时自动进行模型备份。"
-msgstr "InternLM supports automatic loading of latest ckpt at startup and automatic model checkpointing at signal quit. "
-
-#: ../../source/checkpoint.rst:9
-msgid "Checkpointing"
-msgstr ""
-
-#: internlm.utils.model_checkpoint.CheckpointManager:1 of
-msgid "StorageManagerContext"
-msgstr ""
-
-#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler:1 of
-msgid ""
-"Exit signal detection function, if we write the exit step in the "
-"'QUIT_FILE_PATH' file, all ranks will save ckpt and exit. Negative "
-"integer step means save ckpt. Positive integer step means save ckpt and "
-"quit."
-msgstr ""
-
-#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler of
-msgid "参数"
-msgstr ""
-
-#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler of
-msgid "返回"
-msgstr ""
-
-#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler:9 of
-msgid "whether to quit."
-msgstr ""
-
-#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler of
-msgid "返回类型"
-msgstr ""
-
-#: internlm.utils.model_checkpoint.CheckpointManager.wait_async_upload_finish:1
-#: of
-msgid "wait for all checkpoint uploads to be completed"
-msgstr ""
-
-#: internlm.utils.model_checkpoint.CheckpointManager.query_latest_snapshot_step_boto3:1
-#: of
-msgid ""
-"Returns: Tuple(str, int): path of latest ckpt and ckpt step, if not "
-"found, None will return."
-msgstr ""
-
-#: internlm.utils.model_checkpoint.CheckpointManager.save_checkpoint:1 of
-msgid "Save checkpoint to the given folder path."
-msgstr ""
-
-#~ msgid "Attempt to restore the training state of the last ckpt."
-#~ msgstr ""
-
-#~ msgid "lr_scheduler object."
-#~ msgstr ""
-
-#~ msgid "optimizer object."
-#~ msgstr ""
-
-#~ msgid "learning rate."
-#~ msgstr ""
-
-#~ msgid "traing states."
-#~ msgstr ""
-
-#~ msgid "traning dataloader object"
-#~ msgstr ""
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/example/30B_demo.po b/doc/code-docs/locales/en/LC_MESSAGES/example/30B_demo.po
deleted file mode 100644
index 6ac0e3b..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/example/30B_demo.po
+++ /dev/null
@@ -1,49 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/example/30B_demo.rst:2 242d1f89ae2045f1bf1f31bf82f07846
-msgid "30B Demo"
-msgstr ""
-
-#: ../../source/example/30B_demo.rst:5 c2415bfa6978414a939dcc395fdfb544
-msgid "训练配置"
-msgstr "Training Config"
-
-#: ../../source/example/30B_demo.rst:7 75f568d1ca5546228f88958c12c2dd65
-msgid "30B demo 训练配置文件样例如下:"
-msgstr "30B demo config file example:"
-
-#: ../../source/example/30B_demo.rst:164 533cb04f94314eeb8381e45f06d03108
-msgid "启动训练"
-msgstr "Start Training"
-
-#: ../../source/example/30B_demo.rst:166 24974384d5ab42e68266aeb67ae222ce
-msgid "完成以上训练配置后，可启动模型训练，以在 ``slurm`` 平台上为例，启动两节点 16GPU 的训练命令如下所示："
-msgstr "After completing the data preparation and relevant training configurations, you can start the demo training. "
-"The following example shows how to start distributed training in ``slurm`` environments with 16 GPUs."
-
-#: ../../source/example/30B_demo.rst:173 948ac71ed53848f9bad07f69d956c4bb
-msgid "训练结果"
-msgstr "Training Results"
-
-#: ../../source/example/30B_demo.rst:175 615a3481b0aa49729b7219b1365519aa
-msgid "基于以上训练配置和启动命令，两节点 16GPU 下的模型训练部分日志展示如下："
-msgstr "Taking the configuration of the demo training on two nodes with 16 GPUs on slurm as an example, the training result log is shown below:"
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/example/7B_demo.po b/doc/code-docs/locales/en/LC_MESSAGES/example/7B_demo.po
deleted file mode 100644
index 5e99429..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/example/7B_demo.po
+++ /dev/null
@@ -1,49 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/example/7B_demo.rst:2 8576f969040249bb93e7c347ef210990
-msgid "7B Demo"
-msgstr ""
-
-#: ../../source/example/7B_demo.rst:5 5429ceea12424825991744bece744f60
-msgid "训练配置"
-msgstr "Training Config"
-
-#: ../../source/example/7B_demo.rst:7 c9a47faf5deb40b68ad2bc950fdf2b14
-msgid "7B demo 的训练配置文件样例如下:"
-msgstr "7B demo config file example:"
-
-#: ../../source/example/7B_demo.rst:162 eb93a6ca05c8421eb87a2470f9f31fc2
-msgid "启动训练"
-msgstr "Start Training"
-
-#: ../../source/example/7B_demo.rst:164 9e7a864ae2e14d05b0681f16792e5278
-msgid "完成以上训练配置后，可启动模型训练，以在 ``slurm`` 平台上为例，启动单节点 8GPU 的训练命令如下所示："
-msgstr "After completing the data preparation and relevant training configurations, you can start the demo training. "
-"The following example shows how to start distributed training in ``slurm`` environments with 8 GPUs."
-
-#: ../../source/example/7B_demo.rst:171 fdd053efb1854d46aabf6c0f279fe7fc
-msgid "训练结果"
-msgstr "Training Results"
-
-#: ../../source/example/7B_demo.rst:173 33ec81f34e3c4340beacdb5254069d08
-msgid "基于以上训练配置和启动命令，单节点 8GPU 下的模型训练部分日志展示如下："
-msgstr "Taking the configuration of the demo training on a single machine with 8 GPUs on slurm as an example, the training result log is shown below:"
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/example/index.po b/doc/code-docs/locales/en/LC_MESSAGES/example/index.po
deleted file mode 100644
index 752345e..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/example/index.po
+++ /dev/null
@@ -1,32 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/example/index.rst:2 de54695e8bde40ffb8878043072197e6
-msgid "训练样例"
-msgstr "Training Example"
-
-#: ../../source/example/index.rst:5 da388b3209ff4bd39fd0700a7fba413a
-msgid "7B Demo"
-msgstr ""
-
-#: ../../source/example/index.rst:13 b095e27dfc924a7a943b7cba5361700a
-msgid "30B Demo"
-msgstr ""
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/index.po b/doc/code-docs/locales/en/LC_MESSAGES/index.po
deleted file mode 100644
index 69a862b..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/index.po
+++ /dev/null
@@ -1,80 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/index.rst:8 11e029810acf410180311a3c63eb01f4
-msgid "InternLM"
-msgstr "InternLM"
-
-#: ../../source/index.rst:11 e6fd7d058e4b43bb81157ac79867e3d3
-msgid "环境构建"
-msgstr "Environment Setup"
-
-#: ../../source/index.rst:19 f323ede90c0f434d8b627eded1d8fc10
-msgid "快速上手"
-msgstr "Quickstart Guide"
-
-#: ../../source/index.rst:27 3c504b4b92264e9182abb0fa81fe80c3
-msgid "训练构建"
-msgstr "Model Setup"
-
-#: ../../source/index.rst:35 5cc5c831399a40b089d27b777a776b16
-msgid "训练 API"
-msgstr "Training API"
-
-#: ../../source/index.rst:43 21a7473eabb441f8bfe28d2a0e306889
-msgid "并行训练"
-msgstr "Parallel Training"
-
-#: ../../source/index.rst:51 9234725f3c464731993d73607608c874
-msgid "模型备份"
-msgstr "Model Checkpointing"
-
-#: ../../source/index.rst:59 8e4ce037017f4510b2892a66003877fa
-msgid "性能分析"
-msgstr "Profiler"
-
-#: ../../source/index.rst:67 a36e02819ecd4b448a8cb4ebbecb6600
-msgid "训练监控"
-msgstr "Monitor"
-
-#: ../../source/index.rst:75 b912e292486f455c8b5cdd75962e8ac2
-msgid "训练样例"
-msgstr "Example"
-
-#: ../../source/index.rst:83 ea9e9281720941a1830e5df7a2badf7a
-msgid "常见问题"
-msgstr "Q&A"
-
-#: ../../source/index.rst:91 e08edc5aa1c74965b10084b393b88fae
-msgid "索引和表格"
-msgstr "Indices and tables"
-
-#: ../../source/index.rst:93 f3fdca059caa49dcad09aa44be7f02d6
-msgid ":ref:`genindex`"
-msgstr ""
-
-#: ../../source/index.rst:94 b3791e811315435097bb507edc3f4b9b
-msgid ":ref:`modindex`"
-msgstr ""
-
-#: ../../source/index.rst:95 a164b772960f4ab8b18c7e8820f69f55
-msgid ":ref:`search`"
-msgstr ""
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/initialize.po b/doc/code-docs/locales/en/LC_MESSAGES/initialize.po
deleted file mode 100644
index 632b470..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/initialize.po
+++ /dev/null
@@ -1,247 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-14 12:23+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: zh_CN\n"
-"Language-Team: zh_CN <LL@li.org>\n"
-"Plural-Forms: nplurals=1; plural=0;\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/initialize.rst:2
-msgid "训练构建"
-msgstr "Training Setup"
-
-#: ../../source/initialize.rst:4
-msgid "InternLM 的训练流程可以归纳为两个步骤："
-msgstr "The training process of InternLM can be summarized into two steps: "
-
-#: ../../source/initialize.rst:6
-msgid "初始化"
-msgstr "Initialization"
-
-#: ../../source/initialize.rst:8
-msgid "初始化模型、优化器、数据加载器、Trainer，生成不同种类的进程组，为混合并行的迭代训练做准备。"
-msgstr ""
-"Initialize model, optimizer, dataloader, trainer, and create different "
-"types of process groups to prepare for iterative steps of hybrid parallel training. "
-
-#: ../../source/initialize.rst:9
-msgid "初始化Logger、Checkpoint管理器、Monitor管理器、Profiler，对迭代训练的过程观察、预警、记录。"
-msgstr ""
-"Initialize logger, checkpoint manager, monitor manager, and profiler to "
-"watch, alert, and record the iterative training steps. "
-
-#: ../../source/initialize.rst:11
-msgid "迭代训练"
-msgstr "Iterative training steps"
-
-#: ../../source/initialize.rst:13
-msgid "根据配置文件定义的张量并行、流水线并行、数据并行的大小，加载训练引擎和调度器进行混合并行训练。"
-msgstr ""
-"Load the training engine and scheduler for hybrid parallel training "
-"according to the configuration such as tensor parallel size, pipeline "
-"parallel size, and data parallel size. "
-
-#: ../../source/initialize.rst:14
-msgid "在迭代训练中，调用 Trainer API 进行梯度置零，前向传播计算损失并反向传播，参数更新。"
-msgstr ""
-"In iterative training steps, the Trainer API is called to perform zero "
-"gradients, forward-loss-backward, and parameter update."
-
-#: ../../source/initialize.rst:20
-msgid "InternLM训练流程图"
-msgstr "InternLM training process"
-
-#: ../../source/initialize.rst:25
-msgid "命令行参数解析"
-msgstr "Argument Parsing"
-
-#: ../../source/initialize.rst:27
-msgid ""
-"InternLM 使用 `argparse <https://docs.python.org/3/library/argparse.html>`_"
-" 库来向InternLM运行时提供命令行参数配置。"
-msgstr ""
-"InternLM uses the `argparse "
-"<https://docs.python.org/3/library/argparse.html>`_ library to supply "
-"commandline configuration to the InternLM runtime. "
-
-#: ../../source/initialize.rst:29
-msgid ""
-"用户可使用 ``internlm.initialize.get_default_parser()`` 来获取 InternLM "
-"的默认解析器，其中包含一些内置参数，用户可以向此解析器添加自定义参数。"
-msgstr ""
-"Use ``internlm.initialize.get_default_parser()`` to get InternLM's "
-"default parser with some builtin arguments, users can add custom "
-"parameters to this parser."
-
-#: internlm.initialize.launch.get_default_parser:1 of
-msgid ""
-"Reads user command line and uses an argument parser to parse the input "
-"arguments. Input arguments include configuration, host, port, world size,"
-" local rank, backend for torch.distributed."
-msgstr ""
-
-#: internlm.initialize.initialize_trainer.initialize_trainer
-#: internlm.initialize.launch.get_default_parser
-#: internlm.train.training_internlm.get_train_data_loader
-#: internlm.train.training_internlm.initialize_model
-#: internlm.train.training_internlm.initialize_optimizer of
-msgid "返回"
-msgstr ""
-
-#: internlm.initialize.launch.get_default_parser:4 of
-msgid ""
-"Returns the parser with the default arguments, the user may add "
-"customized arguments into this parser."
-msgstr ""
-
-#: internlm.initialize.initialize_trainer.initialize_trainer
-#: internlm.initialize.launch.get_default_parser
-#: internlm.train.training_internlm.initialize_model of
-msgid "返回类型"
-msgstr ""
-
-#: ../../source/initialize.rst:45
-msgid "模型初始化"
-msgstr "Model Initialization"
-
-#: internlm.train.training_internlm.initialize_model:1 of
-msgid "Initialize model with Automatic Mixed Precision."
-msgstr ""
-
-#: internlm.train.training_internlm.initialize_model:3 of
-msgid "The neural network model to be trained or evaluated."
-msgstr ""
-
-#: ../../source/initialize.rst:49
-msgid "InternLM 在配置文件中使用字段 ``model_type`` 和 ``model`` 来控制模型初始化过程。示例模型初始化配置定义如下："
-msgstr ""
-"InternLM uses the field ``model_type`` and ``model`` in the config file "
-"to control model initialization process. An example model initialization "
-"configuratio"
-
-#: ../../source/initialize.rst:77
-msgid "字段 ``model_type`` 指明了要初始化的模型类型"
-msgstr ""
-"The field ``model_type`` specifics the model type has been registered and"
-" to be initialized."
-
-#: ../../source/initialize.rst:78
-msgid "字段 ``model`` 中的参数指定了在模型初始化过程中的参数设置"
-msgstr ""
-"The parameters in field ``model`` specific the configuration settings "
-"during model initialization."
-
-#: ../../source/initialize.rst:80
-msgid ""
-"值得注意的是，用户可以定义新的模型类型，并使用装饰器 ``@MODEL_INITIALIZER.register_module`` "
-"注册模型的初始化函数，其中 ``MODEL_INITIALIZER`` 是类 "
-"``internlm.util.registry.Registry`` 的一个实例化对象，示例如下所示："
-msgstr ""
-"It is worth noting that, users can define new model type, and register "
-"model's initialization function by decorater "
-"``@MODEL_INITIALIZER.register_module``, which ``MODEL_INITIALIZER`` is an"
-" instantiated object of class ``internlm.util.registry.Registry``, the "
-"example is shown as follows."
-
-#: ../../source/initialize.rst:92
-msgid "优化器初始化"
-msgstr "Optimizer Initialization"
-
-#: internlm.train.training_internlm.initialize_optimizer:1 of
-msgid "Initialize optimizer."
-msgstr ""
-
-#: internlm.initialize.initialize_trainer.initialize_trainer
-#: internlm.train.training_internlm.get_train_data_loader
-#: internlm.train.training_internlm.initialize_optimizer of
-msgid "参数"
-msgstr ""
-
-#: internlm.train.training_internlm.initialize_optimizer:3 of
-msgid "Your model instance to be trained or evaluated."
-msgstr ""
-
-#: internlm.train.training_internlm.initialize_optimizer:6 of
-msgid "A tuple of (optimizer, beta2_scheduler, lr_scheduler)."
-msgstr ""
-
-#: ../../source/initialize.rst:99
-msgid "数据加载器初始化"
-msgstr "Dataloader Initialization"
-
-#: internlm.train.training_internlm.get_train_data_loader:1 of
-msgid "Generate and return the training data loader."
-msgstr ""
-
-#: internlm.train.training_internlm.get_train_data_loader:3 of
-msgid "number of subprocesses used for dataloader."
-msgstr ""
-
-#: internlm.train.training_internlm.get_train_data_loader:5 of
-msgid "generate function for dataset."
-msgstr ""
-
-#: internlm.train.training_internlm.get_train_data_loader:7 of
-msgid "dataset sampler for training dataloader."
-msgstr ""
-
-#: internlm.train.training_internlm.get_train_data_loader:9 of
-msgid "collate function for training dataloader."
-msgstr ""
-
-#: internlm.train.training_internlm.get_train_data_loader:12 of
-msgid "A tuple of (train_dl, dataset_types)."
-msgstr ""
-
-#: ../../source/initialize.rst:106
-msgid "Trainer 初始化"
-msgstr "Trainer Initialization"
-
-#: internlm.initialize.initialize_trainer.initialize_trainer:1 of
-msgid ""
-"Core function to wrap the essential training components with our "
-"functionality based on the config which is loaded into gpc.config."
-msgstr ""
-
-#: internlm.initialize.initialize_trainer.initialize_trainer:4 of
-msgid "Your model instance or a function to build the model."
-msgstr ""
-
-#: internlm.initialize.initialize_trainer.initialize_trainer:6 of
-msgid "Your optimizer for training."
-msgstr ""
-
-#: internlm.initialize.initialize_trainer.initialize_trainer:8 of
-msgid "Your criterion instance."
-msgstr ""
-
-#: internlm.initialize.initialize_trainer.initialize_trainer:10 of
-msgid "Dataloader for training."
-msgstr ""
-
-#: internlm.initialize.initialize_trainer.initialize_trainer:12 of
-msgid "Dataloader for testing."
-msgstr ""
-
-#: internlm.initialize.initialize_trainer.initialize_trainer:14 of
-msgid "Your lr scheduler instance, optional."
-msgstr ""
-
-#: internlm.initialize.initialize_trainer.initialize_trainer:17 of
-msgid ""
-"A tuple of ``(trainer, train_dataloader, test_dataloader, lr_scheduler)``"
-" where only ``trainer`` could not be None."
-msgstr ""
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/install.po b/doc/code-docs/locales/en/LC_MESSAGES/install.po
deleted file mode 100644
index 7abeb3a..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/install.po
+++ /dev/null
@@ -1,139 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../../install.md:2 ../../../install.md:28
-#: c237a7328df9440eb54f36c5e6ceef46 e55787faf3f74d5996f251b28422cf15
-msgid "环境安装"
-msgstr "Installation"
-
-#: ../../../install.md:4 d5cd61481eb04f55a9b1636e47e2bc49
-msgid "环境准备"
-msgstr "Environment Preparation"
-
-#: ../../../install.md:5 418763cd4acb4ff3afba059ae7066739
-msgid "首先，需要安装的依赖包及对应版本列表如下："
-msgstr "The required packages and corresponding version are shown as follows:"
-
-#: ../../../install.md:6 dcb95218036f4452a92a5a9c2fdbe337
-msgid "Python == 3.10"
-msgstr ""
-
-#: ../../../install.md:7 79e3d9ff5df7455fa596ba63ce3089b7
-msgid "GCC == 10.2.0"
-msgstr ""
-
-#: ../../../install.md:8 d14840f7b64d4a32a0be5762027e9c32
-msgid "MPFR == 4.1.0"
-msgstr ""
-
-#: ../../../install.md:9 851e3e5c874a4d0f8fd37a4f85ec8f2f
-msgid "CUDA >= 11.7"
-msgstr ""
-
-#: ../../../install.md:10 dbf2012c72e1479ba6647baa047ecc04
-msgid "Pytorch >= 1.13.1"
-msgstr ""
-
-#: ../../../install.md:11 b191e289a079455ea906694a75439b3e
-msgid "Transformers >= 4.28.0"
-msgstr ""
-
-#: ../../../install.md:12 17accf19fe184e3cb704274d8a66e87e
-msgid "Flash-Attention >= v1.0.5"
-msgstr ""
-
-#: ../../../install.md:13 8063cdce4bb94947a07dbaedd97e1013
-msgid "Apex == 23.05"
-msgstr ""
-
-#: ../../../install.md:14 7d6d2682ed214d0cba0048903c128bce
-msgid "Ampere或者Hopper架构的GPU (例如H100, A100)"
-msgstr "GPU with Ampere or Hopper architecture (such as H100, A100)"
-
-#: ../../../install.md:15 91039fb42b94421586c558a2afcbed71
-msgid "Linux OS"
-msgstr ""
-
-#: ../../../install.md:17 694b95a146d54878a4a5d57e0c1e8c6c
-msgid "以上依赖包安装完成后，需要更新配置系统环境变量："
-msgstr "After installing the above dependencies, some system environment variables need to be updated:"
-
-#: ../../../install.md:29 d0ebf84438dc43708ea517c7eff92e79
-msgid "将项目`internlm`及其依赖子模块，从 github 仓库中 clone 下来，命令如下："
-msgstr "Clone the project `internlm` and its dependent submodules from the github repository, as follows:"
-
-#: ../../../install.md:34 c278177fc1974f3fac9b33688d0591fd
-msgid "推荐使用 conda 构建一个 Python-3.10 的虚拟环境， 并基于`requirements/`文件安装项目所需的依赖包："
-msgstr "It is recommended to build a Python-3.10 virtual environment using conda and install the required dependencies based on the `requirements/` files:"
-
-#: ../../../install.md:43 6a152c8e332f47b0ba35a9bcec2ed32d
-msgid "安装 flash-attention (version v1.0.5)："
-msgstr "Install flash-attention (version v1.0.5):"
-
-#: ../../../install.md:55 d7b2116e6ca745ceb48a792fae371283
-msgid "安装 Apex (version 23.05)："
-msgstr "Install Apex (version 23.05):"
-
-#: ../../../install.md:62 8bcbfb9f74de4a2796212a339feb8283
-msgid "环境镜像"
-msgstr "Environment Image"
-
-#: ../../../install.md:63 6cbb97568d704cf19e7dabab20ce1d5b
-msgid ""
-"用户可以使用提供的 dockerfile 结合 docker.Makefile 来构建自己的镜像，或者也可以从 "
-"https://hub.docker.com/r/internlm/internlm 获取安装了 InternLM 运行环境的镜像。"
-msgstr "Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternLM runtime environment installed from https://hub.docker.com/r/internlm/internlm."
-
-#: ../../../install.md:65 9c29ae2ac9984a8094daf52751f5c7b9
-msgid "镜像配置及构造"
-msgstr "Image Configuration and Build"
-
-#: ../../../install.md:66 12bd6b0729464cb5af663a384dadd0ec
-msgid ""
-"dockerfile 的配置以及构造均通过 docker.Makefile 文件实现，在 InternLM 根目录下执行如下命令即可 build "
-"镜像："
-msgstr "The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternLM:"
-
-#: ../../../install.md:70 b5f42dbca3e340c4bb80de1f502e0700
-msgid ""
-"在 docker.Makefile 中可自定义基础镜像，环境版本等内容，对应参数可直接通过命令行传递。对于 BASE_OS 分别支持 "
-"ubuntu20.04 和 centos7。"
-msgstr "In docker.Makefile, you can customize the basic image, environment version, etc., and the corresponding parameters can be passed directly through the command line. For BASE_OS, ubuntu20.04 and centos7 are respectively supported."
-
-#: ../../../install.md:72 4abb47ce9cf64b3c9b8dc23ace37a826
-msgid "镜像拉取"
-msgstr "Pull Standard Image"
-
-#: ../../../install.md:73 1b6e61b2e0cb4da98f5d70d67ac638f9
-msgid "基于 ubuntu 和 centos 的标准镜像已经 build 完成也可直接拉取使用："
-msgstr "The standard image based on ubuntu and centos has been built and can be directly pulled:"
-
-#: ../../../install.md:82 2bd75cc4b74848c19775e2b1c83726c1
-msgid "容器启动"
-msgstr "Run Container"
-
-#: ../../../install.md:83 4bb2dd4bba904255a204776a50721159
-msgid "对于使用 dockerfile 构建或拉取的本地标准镜像，使用如下命令启动并进入容器："
-msgstr "For the local standard image built with dockerfile or pulled, use the following command to run and enter the container:"
-
-#: ../../../install.md:87 66613606256e4094a6be5ab2af1269ae
-msgid "容器内默认目录即 `/InternLM`，根据[使用文档](./usage.md)即可启动训练。"
-msgstr "The default directory in the container is `/InternLM`, please start training according to the [Usage](./usage.md)."
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/monitor.po b/doc/code-docs/locales/en/LC_MESSAGES/monitor.po
deleted file mode 100644
index c0ec5f5..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/monitor.po
+++ /dev/null
@@ -1,197 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/monitor.rst:2 f95ef3bff8574c77a28ca2f6212cc4b8
-msgid "监控和告警"
-msgstr "Monitor and Alert"
-
-#: ../../source/monitor.rst:5 959bd4a6061f4483875c7950ab4546cf
-msgid "监控"
-msgstr "Monitoring"
-
-#: ../../source/monitor.rst:7 6071bc878d894865b73380cb887847c1
-msgid ""
-"InternLM 使用 ``internlm.monitor.monitor.initialize_monitor_manager()`` "
-"来初始化上下文监控管理。其中，一个实例化的单例对象 ``internlm.monitor.monitor.MonitorManager`` "
-"将管理监控线程并使用 ``internlm.monitor.monitor.MonitorTracker`` 来跟踪模型训练生命周期和训练状态。"
-msgstr ""
-"InternLM uses ``internlm.monitor.monitor.initialize_monitor_manager()`` to initialize context monitor. During this time, "
-"a singleton ``internlm.monitor.monitor.MonitorManager`` will manage monitoring thread and track training status "
-"with ``internlm.monitor.monitor.MonitorTracker``."
-
-#: 9256a063b6dd449786f29e03ce085176
-#: internlm.monitor.monitor.initialize_monitor_manager:1 of
-msgid ""
-"Initialize monitor manager for monitoring training lifetime and alerting "
-"exception info to Feishu."
-msgstr ""
-
-#: 138340fca72a4226be901f7f16c8a590 904b7938fdea46bf81c1ef738aa7bfae
-#: 9ed2a7b4af2243b289e72b2751aec902 aa0dd0dc6bee4a5bb15cc9705f7c13ee
-#: internlm.monitor.alert.send_feishu_msg_with_webhook
-#: internlm.monitor.monitor.MonitorManager.start_monitor
-#: internlm.monitor.monitor.MonitorTracker
-#: internlm.monitor.monitor.initialize_monitor_manager of
-msgid "参数"
-msgstr ""
-
-#: 3b302339e1d143b6b1d782ff59c9396d 6a06f053828b4c80aef56970750e2085
-#: internlm.monitor.monitor.MonitorManager.start_monitor:3
-#: internlm.monitor.monitor.initialize_monitor_manager:3 of
-msgid "The training job name."
-msgstr ""
-
-#: 3330d06145ee4d35b0b3632e799a35b3 c105473f2f6a4f838a9f0d098762d698
-#: internlm.monitor.monitor.MonitorManager.start_monitor:5
-#: internlm.monitor.monitor.initialize_monitor_manager:5 of
-msgid "The Feishu webhook address for sending alert messages."
-msgstr ""
-
-#: 774c6ff82a2e452295a1a7dcabaded3d internlm.monitor.monitor.MonitorManager:1
-#: of
-msgid ""
-"Monitor Manager for managing monitor thread and monitoring training "
-"status."
-msgstr ""
-
-#: 72e696c0ce8f41ea8c7947d35cf322f0
-#: internlm.monitor.monitor.MonitorManager.monitor_loss_spike:1 of
-msgid "Check loss value, if loss spike occurs, send alert message to Feishu."
-msgstr ""
-
-#: 2b668b057fa84e8b92c65bfd49bfb3e9
-#: internlm.monitor.monitor.MonitorManager.monitor_exception:1 of
-msgid "Catch and format exception information, send alert message to Feishu."
-msgstr ""
-
-#: 9852b7143026476d89e1a175223e6d79
-#: internlm.monitor.monitor.MonitorManager.handle_sigterm:1 of
-msgid "Catch SIGTERM signal, and send alert message to Feishu."
-msgstr ""
-
-#: 2e3827bad7b1445fb0d9a7c5a28def5d
-#: internlm.monitor.monitor.MonitorManager.start_monitor:1 of
-msgid ""
-"Initialize and start monitor thread for checking training job status, "
-"loss spike and so on."
-msgstr ""
-
-#: 271cc3e1b0834a7ba6a1ba4d5cce0ef1
-#: internlm.monitor.monitor.MonitorManager.start_monitor:7 of
-msgid "The time of monitor interval in seconds, defaults to 300."
-msgstr ""
-
-#: e4a06091fce8401b83e31ce26c8075a0
-#: internlm.monitor.monitor.MonitorManager.start_monitor:9 of
-msgid ""
-"The limit multiple of current loss to previous loss value, which means "
-"loss spike may be occurs, defaults to 1.5."
-msgstr ""
-
-#: 28bde748477e41f39fa6ca3e1855923d
-#: internlm.monitor.monitor.MonitorManager.stop_monitor:1 of
-msgid "Stop the monitor and alert thread."
-msgstr ""
-
-#: ffb3dda227664748bdb326b6630bc827 internlm.monitor.monitor.MonitorTracker:1
-#: of
-msgid "Track job status and alert to Feishu during job training."
-msgstr ""
-
-#: a1e93683cbb04d8ab825e2776e76efa7 internlm.monitor.monitor.MonitorTracker:3
-#: of
-msgid "The Feishu webhook address for sending alerting messages."
-msgstr ""
-
-#: 7913eeecc0904c128046e80cec1553f2 internlm.monitor.monitor.MonitorTracker:5
-#: of
-msgid "The interval in seconds for monitoring checks. Defaults to 300."
-msgstr ""
-
-#: 8d1abc3067584866983139dd3d85c59c internlm.monitor.monitor.MonitorTracker:7
-#: of
-msgid "The threshold for detecting loss value spikes. Defaults to 1.5."
-msgstr ""
-
-#: a0416fd68700450793daa2167f776618
-#: internlm.monitor.monitor.MonitorTracker.run:1 of
-msgid "start the monitor tracker."
-msgstr ""
-
-#: f55eb990c07b4e8f9388236dd60f0017
-#: internlm.monitor.monitor.MonitorTracker.stop:1 of
-msgid "Stop the monitor tracker."
-msgstr ""
-
-#: ../../source/monitor.rst:18 2202bc091aab417097a1b0268dfe6785
-msgid "告警"
-msgstr "Alerting"
-
-#: ../../source/monitor.rst:20 69334f83e644455aa619dde70b8ed1f2
-msgid ""
-"InternLM 监控线程会周期性地检查模型训练过程中是否出现 loss spike、潜在的 training stuck、运行时异常等，并捕获 "
-"SIGTERM 异常信号。当出现上述情况时，将触发警报，并通过调用 "
-"``internlm.monitor.alert.send_feishu_msg_with_webhook()`` 向飞书的 Webhook "
-"地址发送报警消息。"
-msgstr ""
-"InternLM monitor thread periodically tracks loss spike, potential stuck condition, runtime exception, and SIGTERM signal. "
-"When above situation occurs, an alert will be triggered and a message will be sent to the Feishu webhook address by calling "
-"``internlm.monitor.alert.send_feishu_msg_with_webhook()``."
-
-#: 15980526c2fa4ed8befa1604f271a3f1
-#: internlm.monitor.alert.send_feishu_msg_with_webhook:1 of
-msgid "Use Feishu robot to send messages with the given webhook."
-msgstr ""
-
-#: 38e5738c2b914c8096e1a0f345e6c0b4
-#: internlm.monitor.alert.send_feishu_msg_with_webhook:3 of
-msgid "The webhook to be used to send message."
-msgstr ""
-
-#: 4984f1a3bb0d46b48b2aad4fba8b43d9
-#: internlm.monitor.alert.send_feishu_msg_with_webhook:5 of
-msgid "The message title."
-msgstr ""
-
-#: a9822a4cf30d4947b12f70a0efe62a5e
-#: internlm.monitor.alert.send_feishu_msg_with_webhook:7 of
-msgid "The message body."
-msgstr ""
-
-#: 57d9ab65fe9f45c28351839fecf2f31e
-#: internlm.monitor.alert.send_feishu_msg_with_webhook of
-msgid "返回"
-msgstr ""
-
-#: 2b6ac97fd152498183a8624a9087812b
-#: internlm.monitor.alert.send_feishu_msg_with_webhook:10 of
-msgid "The response from the request. Or catch the exception and return None."
-msgstr ""
-
-#: ec45dedf976046eb909f5b7f79a7d44c
-#: internlm.monitor.alert.send_feishu_msg_with_webhook of
-msgid "抛出"
-msgstr ""
-
-#: 4c6aeec19a6041cfbfa577b1c5a85ac1
-#: internlm.monitor.alert.send_feishu_msg_with_webhook:12 of
-msgid "An exception rasied by the HTTP post request."
-msgstr ""
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/parallel.po b/doc/code-docs/locales/en/LC_MESSAGES/parallel.po
deleted file mode 100644
index d9770dc..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/parallel.po
+++ /dev/null
@@ -1,456 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/parallel.rst:2 28d82a05db464e35aa3ec83e36597214
-msgid "并行训练"
-msgstr "Parallel Training"
-
-#: ../../source/parallel.rst:6 f5c2eef4812640fca0aeaef62a2d85d4
-msgid ""
-"InternLM 支持张量并行、流水线并行、序列并行、数据并行和 ZeRO1.5 "
-"等并行化训练策略。在初始化分布式环境时，我们需要指定张量并行大小、流水线并行大小、数据并行大小以及 ZeRO1.5 策略。"
-msgstr ""
-"InternLM supports tensor parallel, pipeline parallel, sequence parallel, data parallel, and ZeRO1.5 "
-"to parallelize the training pipeline. When initializing the distributed environment, we need to specify "
-"tensor parallel size, pipeline parallel size, data parallel size, and ZeRO1.5 strategy."
-
-#: ../../source/parallel.rst:8 649c52696a734a0c86d3d5377193aba5
-msgid ""
-"InternLM 的并行设置由配置文件中的 ``parallel`` 字段指定，用户可以通过修改配置文件 `config file "
-"<https://github.com/InternLM/InternLM/blob/main/configs/7B_sft.py>`_ "
-"来更改并行配置。以下是一个并行训练配置示例："
-msgstr ""
-"The parallel setting of InternLM is fully config-driven, and you can change the parallelism by modifying "
-"`config file <https://github.com/InternLM/InternLM/blob/main/configs/7B_sft.py>`_. An exmaple parallel "
-"training configuration can be defined as follows:"
-
-#: ../../source/parallel.rst:19 a06ae11e51ea479b9501ada103c9d071
-msgid "zero1：zero 并行策略，分如下三种情况，默认值为 -1"
-msgstr "zero1: zero parallel strategy, divided into the following three cases, the default value is -1"
-
-#: ../../source/parallel.rst:21 08005d5cdde84057b870495d9683c7be
-msgid "当 ``zero1 <= 0``，则 zero1 进程组的大小等于数据并行进程组的大小，因此优化器状态参数将在数据并行范围内分配"
-msgstr "When ``zero1 <= 0``, the size of the zero1 process group is equal to the size of the data parallel process group, so the optimizer state parameters will be split within the data parallel range."
-
-#: ../../source/parallel.rst:22 fe30803c0aec4b70847ac40b68641e05
-msgid "当 ``zero1 == 1``，则不使用 zero1 ，所有数据并行组保留完整的优化器状态参数"
-msgstr "When ``zero1 == 1``, zero1 is not used, and all data parallel groups retain the complete optimizer state parameters."
-
-#: ../../source/parallel.rst:23 e0acea7d80094e018fab75404ec25163
-msgid ""
-"当 ``zero1 > 1`` 且 ``zero1 <= data_parallel_world_size``，则 zero1 "
-"进程组是数据并行进程组的子集"
-msgstr "When ``zero1 > 1`` and ``zero1 <= data_parallel_world_size``, the zero1 process group is a subset of the data parallel process group."
-
-#: ../../source/parallel.rst:25 17bba79e2e884993a602df9cf20d2489
-msgid "tensor：张量并行大小，通常是每个节点的 GPU 数量，默认值为 1"
-msgstr "tensor: tensor parallel size, usually the number of GPUs per node, the default value is 1"
-
-#: ../../source/parallel.rst:26 3bda721a03a144f28f33d360a87cbf83
-msgid "pipeline：流水线并行策略"
-msgstr "pipeline: pipeline parallel strategy"
-
-#: ../../source/parallel.rst:28 2b10f2b57ef64fcc872d036a7ad82b03
-msgid "size：流水线并行大小，默认值为 1"
-msgstr "size: pipeline parallel size, the default value is 1"
-
-#: ../../source/parallel.rst:29 49c8a409e60244c49514a27780ae39a3
-msgid "interleaved_overlap：bool 类型，交错式调度时，开启或关闭通信优化，默认值为 False"
-msgstr "interleaved_overlap: bool type, when interleaved scheduling, enable or disable communication optimization, the default value is False"
-
-#: ../../source/parallel.rst:31 e4ff81960c434b78847174787f0423e2
-msgid "sequence_parallel：是否开启序列化并行，默认值为 False"
-msgstr "sequence_parallel: whether to enable sequence parallelism, the default value is False"
-
-#: ../../source/parallel.rst:33 a24f4bc81fea48619ae2720e0cb6a392
-msgid "注意：数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小"
-msgstr "Note: `Data parallel size = Total number of GPUs / Pipeline parallel size / Tensor parallel size`"
-
-#: ../../source/parallel.rst:36 a93fc45f855c4ca7901ccbe23bf14edc
-msgid "张量并行"
-msgstr "Tensor Parallel"
-
-#: ../../source/parallel.rst:38 cce9e8f3c8f14c1c96c63273baceb164
-msgid ""
-"InternLM 的张量并行实现方案基于 `flash attention <https://github.com/Dao-AILab"
-"/flash-attention>`_, 主要对 `attention "
-"<https://github.com/InternLM/InternLM/blob/main/internlm/model/multi_head_attention.py>`_"
-" 和 `linear "
-"<https://github.com/InternLM/InternLM/blob/main/internlm/model/linear.py>`_"
-" 这两个模块进行张量并行操作。"
-msgstr ""
-"The implementation of tensor parallel for InternLM is based on `flash attention <https://github.com/Dao-AILab/flash-attention>`_, "
-"which has tensor parallel extensions to parallelize `attention <https://github.com/InternLM/InternLM/blob/main/internlm/model/multi_head_attention.py>`_ "
-"and `linear <https://github.com/InternLM/InternLM/blob/main/internlm/model/linear.py>`_ blocks in InternLM model. "
-
-#: ../../source/parallel.rst:41 f98a4b36ffdf4381a03899b605346be6
-msgid "用户可通过配置文件中的 ``parallel.tensor`` 字段来设置张量并行大小。"
-msgstr "To use tensor parallel, you need to set the value of tensor parallel size ``parallel.tensor`` in the config file, which is usually the number of GPUs per node."
-
-#: ../../source/parallel.rst:47 956804e7cde441989212f7eb505e8815
-msgid "张量并行，采用自 `flash-attention <https://arxiv.org/pdf/2205.14135.pdf>`_"
-msgstr "Tensor parallel, adopted from `flash-attention <https://arxiv.org/pdf/2205.14135.pdf>`_"
-
-#: ../../source/parallel.rst:50 a6424fd0ff0246fcadf56436260fadb6
-msgid "流水线并行"
-msgstr "Pipeline Parallel"
-
-#: ../../source/parallel.rst:52 f2c163418fed432a8f3f59f1a5229e88
-msgid ""
-"InternLM 在流水线并行中使用 `1F1B <https://arxiv.org/pdf/2104.04473.pdf>`_ "
-"（1F1B，一次前向传递后跟一次反向传递）策略。对于 1F1B 策略，有两种实现方式："
-msgstr "InternLM uses `1F1B <https://arxiv.org/pdf/2104.04473.pdf>`_ (one forward pass followed by one backward pass) for pipeline parallel. For 1F1B strategy, there are two implementations:"
-
-#: ../../source/parallel.rst:54 43f3b988e2924fe9968b9d049b46ffa0
-msgid "非交错调度器，内存高效。"
-msgstr "non-interleaved scheduler, which is memory-efficient"
-
-#: ../../source/parallel.rst:55 7a45446082c441d48d49b6be661ea8d2
-msgid "交错调度器，内存高效且时间高效（GPU空泡较少）。"
-msgstr "interleaved scheduler, which is both memory-efficient and time-efficient."
-
-#: ../../source/parallel.rst:61 92f2a168d7794811b56f9bb3bc170982
-msgid "1F1B 流水线并行调度器，采用自 `Megatron-LM <https://arxiv.org/pdf/2104.04473.pdf>`_"
-msgstr "Non-interleaved and interleaved scheduler for 1F1B pipeline parallelism, adopted from `Megatron-LM <https://arxiv.org/pdf/2104.04473.pdf>`_"
-
-#: ../../source/parallel.rst:64 a6d3df0b74b14b158a04ddda3e904004
-msgid "非交错式流水线调度"
-msgstr "scheduler for non-interleaved 1F1B strategy"
-
-#: ../../source/parallel.rst:65 1fa48743f39a44a29d78fb7f9eed5a52
-msgid "如果要使用非交错式调度, 需要设置 ``model.num_chunks = 1``。"
-msgstr "To use non-interleaved pipeline scheduler, users need to set ``model.num_chunks = 1`` in the config file."
-
-#: 57206dc0bc734686841c363c88839708
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:1 of
-msgid ""
-"A helper schedule class for pipeline parallelism running environment. It "
-"uses non-interleaved 1F1B strategy. Other properties are similar as "
-":class:`NonPipelineSchedule`."
-msgstr ""
-
-#: 6475fee6f3cd462ba1073a641b322e12 7060a021efb0459598f49f74e8e7185b
-#: 9218fee47e5542cab88ac65ff0054068 d1be8d5479fb48f59be379548ee24bd9
-#: d41da940b4a84cd0822c3f94c2eaf344 f5654fe6eacc49dba5baa1d058df5d29
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.pre_processing
-#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step
-#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.zero_grad of
-msgid "参数"
-msgstr ""
-
-#: 567e2a87a45245469af9f8709e020a20
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:5 of
-msgid "The number of microbatches."
-msgstr ""
-
-#: 6d3b2256ea9c4897bf72f551f8b4696b
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:7 of
-msgid "Type of data. torch.float by default."
-msgstr ""
-
-#: 6e36198f5ed344f7ad02f56aec9a333c
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:9 of
-msgid ""
-"The post processing function which receives a micro batch of data, and it"
-" will be executed in `load_micro_batch`."
-msgstr ""
-
-#: ffae9611bd854615af1ced927f72c556
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:12 of
-msgid "Specified shape in pipeline communication."
-msgstr ""
-
-#: 31d45af550334cb8a94142da335b9724
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:14 of
-msgid ""
-"If set to `True`, communication will be reduced over pipeline when using "
-"1D tensor parallelization."
-msgstr ""
-
-#: 5c852dc7866f4e50ab87c15b86d338f2
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler:16 of
-msgid "List of scheduler hooks."
-msgstr ""
-
-#: 4ebec38a972b4c31a59f1fc824d51f62
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.pre_processing:1
-#: of
-msgid "To perform actions before running the schedule."
-msgstr ""
-
-#: d491d0dfa1bf41708150cc57567ac0f0
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.pre_processing:3
-#: of
-msgid "InternLM engine for training and inference."
-msgstr ""
-
-#: bc5dc62440b94825b192ad2e28641976
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:1
-#: of
-msgid ""
-"Runs non-interleaved 1F1B schedule, with communication between pipeline "
-"stages. Returns a tuple with losses if the last stage, an empty tuple "
-"otherwise."
-msgstr ""
-
-#: 765809e448b644678a9fb822f6427a94 99c948f562e343aabdecac2d43650f59
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:4
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:4
-#: of
-msgid "Colossalai engine for training and inference."
-msgstr ""
-
-#: 31af7a46c5a645628bea05ad35757dcf 4ea88ec52c5b4df79a57ab2d217de697
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:6
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:6
-#: of
-msgid ""
-"Dataloader as the form of an iterator, obtained by calling "
-"iter(dataloader)."
-msgstr ""
-
-#: 2deff747718449fabc5b47a1de0be52e e0d2e154ac134da28470924aa65342a1
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:8
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:8
-#: of
-msgid ""
-"Whether run forward step only. Default is false. If true, no backward "
-"will be run."
-msgstr ""
-
-#: 71aa2b45248c4af28525dbc1ba4a1aff d3b3c1e350334dd2a16cbb2e8c8d339a
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:10
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:10
-#: of
-msgid "Whether returns the loss value. Default is true."
-msgstr ""
-
-#: 2021eaca687148539b03f6b0b1c118c8 5c138015fb254eccae2f0df2dab45629
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:12
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:12
-#: of
-msgid "If False, the output and label won't be returned."
-msgstr ""
-
-#: 57a86115b88541b1a7220d9535058607 5dabcd12b6d844aab8039b022ad0cf1c
-#: b8ccfee837a242a3abbdf9e15eaa53d8
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step
-#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step of
-msgid "返回"
-msgstr ""
-
-#: 7dc47f5518e64d1095a6051184985f17 fe678c953e8149a5ade387e95d10d3b2
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:17
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:15
-#: of
-msgid "A tuple of (output, label, loss), loss and label could be None."
-msgstr ""
-
-#: a50c7c3d40e14ba8a5af06aa0cb031cb ea3574b76d604402a41fcd3874d05c9a
-#: fa12b183c7534a20b61445eb9f2a2a7a
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step
-#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step of
-msgid "返回类型"
-msgstr ""
-
-#: 82936eed6da5408c9361732f8fd5cb93 c46a28c21ca149d98ff625b7fdad4c03
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:19
-#: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler.forward_backward_step:16
-#: of
-msgid "Tuple[:class:`torch.Tensor`]"
-msgstr ""
-
-#: ../../source/parallel.rst:71 d2bfdbbd9a7641c38e6957a72ac6bc97
-msgid "交错式流水线调度"
-msgstr "scheduler for interleaved 1F1B strategy"
-
-#: ../../source/parallel.rst:72 395c484fef984a65a284147dc3056241
-msgid "如果要使用交错式调度, 需要设置 ``model.num_chunks > 1``。"
-msgstr "To use interleaved pipeline scheduler, users need to set ``model.num_chunks > 1`` in the config file."
-
-#: 036fffe3aacc4400af38ce5252840a50
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler:1 of
-msgid "Interleaved Pipeline Scheduler."
-msgstr ""
-
-#: 1b6e63b4004e44999e3ad38382b4e308
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:1
-#: of
-msgid ""
-"Run interleaved 1F1B schedule (model split into model chunks), with "
-"communication between pipeline stages as needed."
-msgstr ""
-
-#: 6ece1dfcdb5e408db4870d6c0f524787
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:15
-#: of
-msgid ""
-"A tuple of (output, label, loss), loss and label could be None.     The "
-"loss would be returned only in the last stage."
-msgstr ""
-
-#: ed7e5a4826f84e9eb2840e494761437f
-#: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler.forward_backward_step:18
-#: of
-msgid "The loss would be returned only in the last stage."
-msgstr ""
-
-#: ../../source/parallel.rst:77 1b771fea1d434f0b8b118f1b5344dde4
-msgid "值得注意的是，在使用交错式流水线调度器时可启用通信优化功能，即在 1F1B 阶段启用异步通信，以充分利用上行/下行带宽并实现通信与计算重叠。"
-msgstr "Asynchronous communication will be enabled in 1F1B stage to make full use of uplink/downlink bandwidth and achieve communication overlap. "
-
-#: ../../source/parallel.rst:79 27430e179b454d48a052b9fe6e11ecae
-msgid ""
-"用户需要在配置文件中设置 ``parallel.pipeline.interleaved_overlap = "
-"True``。该功能启用后，将调用函数 "
-"``InterleavedPipelineScheduler._run_1f1b_loop_with_overlap``，并创建 "
-"``internlm.core.communication.AsynCommunicator`` 以管理异步通信。"
-msgstr ""
-"When ``parallel.pipeline.interleaved_overlap = True``, function ``InterleavedPipelineScheduler._run_1f1b_loop_with_overlap`` will be called and "
-"``internlm.core.communication.AsynCommunicator`` will be created for managing async communication."
-
-#: ../../source/parallel.rst:81 4e0b6269ca48430098ed4619d0f0f22f
-msgid "``1F1B-without-overlap`` 和 ``1F1B-with-overlap`` 的区别如下所示："
-msgstr "The difference between 1F1B stage without overlap and 1F1B stage with overlap is shown as follows:"
-
-#: ../../source/parallel.rst:102 8412b1f6f51c479d9cbb281763215327
-msgid "序列并行"
-msgstr "Sequence Parallel"
-
-#: ../../source/parallel.rst:104 45aea8164dd244e5a730881c693eeecf
-msgid ""
-"序列并行是一种在不引入额外计算、通信和内存开销的情况下，减少层 ``layer_norm`` 和 ``dropout`` "
-"操作中的激活值内存。InternLM 中的序列并行实现基于 `flash attention <https://github.com/Dao-"
-"AILab/flash-attention>`_。这个并行策略有助于降低模型的内存消耗，提高了模型在资源受限环境中的可扩展性。"
-msgstr ""
-"Sequence parallel is a technique to reduce activation memory in layer norm and dropout without additional computation, "
-"communication or memory overhead. The implementation of sequence parallel for InternLM is based on `flash attention <https://github.com/Dao-AILab/flash-attention>`_. "
-
-#: ../../source/parallel.rst:106 29836b441ee84df6a6dbe877930ba911
-msgid "如果要启用序列并行, 用户需要设置 ``parallel.sequence_parallel = True``。"
-msgstr "To enable sequence parallel, you need to set ``parallel.sequence_parallel = True`` in the config file."
-
-#: ../../source/parallel.rst:112 eadcd6e77c2547998b4e132939a15856
-msgid "序列并行, 采用自 flash-attention"
-msgstr "Sequence parallel, adopted from flash-attention"
-
-#: ../../source/parallel.rst:115 47a0ac84251949fab0d9d8d34efb8751
-msgid "数据并行"
-msgstr "Data Parallel"
-
-#: ../../source/parallel.rst:117 938ad5a1cbc846bab36e8d2f4804a685
-msgid "InternLM 支持数据并行。数据并行大小为:"
-msgstr "InternLM supports data parallel. For data parallel:"
-
-#: ../../source/parallel.rst:119 1e8691a5ff4a4b40ae24815c681f7306
-msgid ""
-"`Data parallel size = Total number of GPUs / Pipeline parallel size / "
-"Tensor parallel size`"
-msgstr ""
-
-#: ../../source/parallel.rst:122 c417e2af4e8e45ca8ca18ad39e96dadd
-msgid "ZeRO1.5"
-msgstr ""
-
-#: ../../source/parallel.rst:124 9c05b4baf8a04e4b8a0f204c4e30cc9c
-msgid ""
-"ZeRO1.5 的实现使用了分层分片的概念，通过配置值 ``parallel.zero1`` "
-"启用了本地节点内的分片。这个方法有助于有效管理和分配模型参数和梯度，以减少内存使用并提高训练效率。"
-msgstr "The implementation of ZeRO1.5 uses the concept of hierarchical sharding via config value ``parallel.zero1``, which enables sharding within local nodes."
-
-#: ../../source/parallel.rst:126 48c994fe37d54c35bbf81f4be070e151
-msgid "当 ``parallel.zero1 <= 0``，则 zero1 进程组的大小等于数据并行进程组的大小，因此优化器状态参数将在数据并行范围内分配"
-msgstr "If ``parallel.zero1 <= 0``, the size of the zero process group is equal to the size of the dp process group, so parameters will be divided within the range of dp."
-
-#: ../../source/parallel.rst:127 3d31193758e24a08b1e90eae21259f71
-msgid "当 ``parallel.zero1 == 1``，则不使用 zero1 ，所有数据并行组保留完整的优化器状态参数"
-msgstr "If ``parallel.zero1 == 1``, zero is not used, and all dp groups retain the full amount of model parameters."
-
-#: ../../source/parallel.rst:128 fb5c43d2ac75423cabc12ba1512df25e
-msgid ""
-"当 ``parallel.zero1 > 1`` 且 ``parallel.zero1 <= "
-"data_parallel_world_size``，则 zero1 进程组是数据并行进程组的子集"
-msgstr "If ``parallel.zero1 > 1`` and ``parallel.zero1 <= dp world size``, the world size of zero is a subset of dp world size. For smaller models, it is usually a better choice to split the parameters within nodes with a setting ``parallel.zero1 <= 8``."
-
-#: ../../source/parallel.rst:130 47f03cea956a4477854591363359cdb3
-msgid ""
-"此外，用户可以在配置文件中通过 ``hybrid_zero_optimizer`` "
-"字段启用优化器的通信优化功能，设置桶大小，以及梯度剪裁等参数。这些设置有助于优化训练过程中的通信和计算效率，以及梯度的处理方式。"
-msgstr "Furthermore, you can enable communication-computation overlap, set bucket reduce size, gradient clipping parameters in the config file."
-
-#: ../../source/parallel.rst:144 dfc63103d4e341ccb7df8ef031e29f4e
-msgid "这里有两个值得关注的通信优化点："
-msgstr "There are two communication optimizations worth paying attention to here:"
-
-#: ../../source/parallel.rst:146 e4815f887d8f48368be01339b5e64d18
-msgid ""
-"overlap_sync_grad: 如果设置为 ``True``，则将训练的 ``backward pass`` 与梯度的 ``all-"
-"reduce`` 通信重叠"
-msgstr "overlap_sync_grad: If set True, overlapping training backward pass with gradients' all-reduce communication."
-
-#: ../../source/parallel.rst:147 bcb1aedd8a89441488b211cd81d4f80c
-msgid ""
-"overlap_sync_param: 如果设置为 ``True``，则将参数的 ``broadcast`` 通信与下一步的 ``forward "
-"pass`` 进行重叠"
-msgstr "overlap_sync_param: If set True, overlapping parameters' broadcast communication with next step's forward pass."
-
-#: ../../source/parallel.rst:149 3ba64e4762084e93ba62a70c909e7d82
-msgid "这些优化可以加速训练过程，提高训练效率。"
-msgstr "These optimizations can speed up the training process and improve training efficiency."
-
-#: 757dad6b9916403c83042b49eaa35ae5
-#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer:1 of
-msgid "Hybrid Zero Optimizer."
-msgstr ""
-
-#: 83bcd49c056446f6806a55e6138579f2
-#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.zero_grad:1
-#: of
-msgid ""
-"Set parameter gradients to zero. If set_to_none = True, gradient will be "
-"set to None to save memory."
-msgstr ""
-
-#: 2d3da89d360c458f80844f9caed6c316
-#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.zero_grad:4
-#: of
-msgid "Whether set the gradient to None. Default value is True."
-msgstr ""
-
-#: 4164523156dc460cbbeaa17feed3c689
-#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step:1 of
-msgid "Performs a single optimization step."
-msgstr ""
-
-#: 5c68dace1ec649bfa849b6652051daac
-#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step:3 of
-msgid "A closure that reevaluates the model and returns the loss."
-msgstr ""
-
-#: 91e366d604ce48afa6b92666ece87b85
-#: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer.step:7 of
-msgid "Whether the gradient is success updated, and the gradient."
-msgstr ""
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/profiler.po b/doc/code-docs/locales/en/LC_MESSAGES/profiler.po
deleted file mode 100644
index 3acae56..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/profiler.po
+++ /dev/null
@@ -1,174 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-14 11:05+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/profiler.rst:2
-msgid "性能分析"
-msgstr "Profiler"
-
-#: ../../source/profiler.rst:7
-msgid "Torch Profiler"
-msgstr ""
-
-#: ../../source/profiler.rst:9
-msgid ""
-"InternLM 使用 ``internlm.train.initialize_llm_profile()`` "
-"来收集和分析模型训练或推理期间的性能数据，如 CPU/CUDA/memory 等性能数据。这个实现基于 `torch.profiler "
-"<https://pytorch.org/docs/stable/profiler.html>`_ ，输出的性能分析 trace 文件可以使用 "
-"`tensorboard <https://www.tensorflow.org/tensorboard?hl=en>`_ 进行可视化。"
-msgstr ""
-"InternLM uses ``internlm.train.initialize_llm_profile()`` to profile "
-"performance data, execution time duration and breakdown analysis of step "
-"time. The implementation is based on `torch.profiler "
-"<https://pytorch.org/docs/stable/profiler.html>`_ and output tracing "
-"files can be visualized with `tensorboard <https://www.tensorflow.org/tensorboard?hl=en>`_."
-
-#: ../../source/profiler.rst:11
-msgid ""
-"用户如果想使用这个 torch 性能分析工具，需要在启动训练时传递 ``--profiling`` 参数以启用性能分析。完成 torch "
-"性能分析后，用户可以在 ``{JOB_NAME}/{start_time}/traces/rank{}_dp{}_tp{}_pp{}`` "
-"文件夹中看到性能分析结果。"
-msgstr ""
-"To use this torch profiler tool, you need to enable profiling by passing "
-"the ``--profiling`` flag when starting training. After torch profiling is"
-" completed, you can find the profiling results in the "
-"``{JOB_NAME}/{start_time}/traces/rank{}_dp{}_tp{}_pp{}`` folder."
-
-#: ../../source/profiler.rst:13
-msgid "实际运行生成的 ``Torch Profiler`` 目录结构如下："
-msgstr ""
-"The directory structure of ``Torch Profiler`` generated files is as "
-"follows:"
-
-#: ../../source/profiler.rst:22
-msgid "其中， ``traces`` 可以通过 ``TensorBoard`` 可视化，运行命令"
-msgstr ""
-"Among them, ``traces`` can be visualized through ``TensorBoard`` and run "
-"with the command"
-
-#: ../../source/profiler.rst:29
-msgid ""
-"在打开的 ``TensorBoard -> PyTorch Profiler -> Views -> Trace`` "
-"页面可以看到Operator和GPU Kernel的性能分析时间线如下，更多的功能请参考 `torch profiler with "
-"tensorboard "
-"<https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"
-"#pytorch-profiler-with-tensorboard>`_"
-msgstr ""
-"In the opened ``TensorBoard -> PyTorch Profiler -> Views -> Trace`` page,"
-" you can see the timeline of profiled operators and GPU kernels. For more"
-" usage, please refer to `torch profiler with tensorboard "
-"<https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"
-"#pytorch-profiler-with-tensorboard>`_"
-
-#: internlm.train.training_internlm.initialize_llm_profile:1 of
-msgid "Initialize and return the profiler context manager instance."
-msgstr ""
-
-#: ../../source/profiler.rst:38
-msgid "Memory Profiler"
-msgstr ""
-
-#: ../../source/profiler.rst:40
-msgid ""
-"InternLM 提供了一个实用的内存分析工具 "
-"``internlm.utils.simple_memory_profiler.SimpleMemoryProfiler`` 来监控实际的 GPU"
-" 内存使用情况。在实现中，会对模型数据（包括模型参数、模型梯度和优化器状态）和非模型数据（包括激活值）分别进行详细的统计。"
-msgstr ""
-"InternLM provides a practical solution "
-"``internlm.utils.simple_memory_profiler.SimpleMemoryProfiler`` to monitor"
-" actual GPU memory usage. In the implmentation, model data (including "
-"model parameters, model gradients, and optimizer states) and non-model "
-"data (including activations) are calculated."
-
-#: ../../source/profiler.rst:42
-msgid ""
-"要使用这个内存分析工具，用户需要在启动训练时传递 ``--profiling`` 参数以启用内存分析。完成内存分析后，用户可以在 "
-"``memory_trace/rank{}_dp{}_tp{}`` 文件夹中找到特定 rank "
-"对应的内存分析结果（包括不同时间点的内存使用日志和显示总体内存使用情况的太阳图表）。"
-msgstr ""
-"To use this memory profiler tool, you need to enable profiling by passing"
-" the ``--profiling`` flag when starting training. After memory profiling "
-"is completed, you can find the profiling results (including logs of "
-"memory usage at different time point and sunburst charts showing overall "
-"memory usage) for a specific rank device in the "
-"``memory_trace/rank{}_dp{}_tp{}`` folder."
-
-#: ../../source/profiler.rst:44
-msgid "实际运行生成的 ``memory_trace`` 目录结构如下："
-msgstr "The directory structure of ``memory_trace`` generated files is as follows:"
-
-#: ../../source/profiler.rst:107
-msgid "其中， ``memory.log`` 的内容示例如下："
-msgstr "An example of ``memory.log`` is as follows:"
-
-#: ../../source/profiler.rst:157
-msgid "模型参数的太阳图示例如下："
-msgstr "An example of model parameters sunburst chart is as follows:"
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler:1 of
-msgid "A memory profiler for a llm model."
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point of
-msgid "参数"
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler:3 of
-msgid "The model to profile."
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler:5 of
-msgid "The optimizer used for training the model."
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler:7 of
-msgid "The file to write the memory state information to."
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler:9 of
-msgid "number of steps to trace."
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point:1 of
-msgid "Record the memory state."
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point:3 of
-msgid "The options to include in the memory state. Defaults to \"\"."
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point:5 of
-msgid "Whether to create a new memory record file. Defaults to False."
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.step of
-msgid "返回"
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.point:8
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.step:3 of
-msgid "None"
-msgstr ""
-
-#: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler.step:1 of
-msgid "Update the memory state of the optimizer state."
-msgstr ""
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/qa.po b/doc/code-docs/locales/en/LC_MESSAGES/qa.po
deleted file mode 100644
index 651a825..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/qa.po
+++ /dev/null
@@ -1,24 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/qa.rst:2 e3b22a39640a40cfb527068a7f4bbfc9
-msgid "问&答"
-msgstr "Q&A"
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/training.po b/doc/code-docs/locales/en/LC_MESSAGES/training.po
deleted file mode 100644
index b8771f3..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/training.po
+++ /dev/null
@@ -1,161 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-14 12:23+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../source/training.rst:2
-msgid "训练 API"
-msgstr "Training API"
-
-#: ../../source/training.rst:4
-msgid ""
-"InternLM 的训练 API 由 ``internlm.core.trainer.Trainer`` "
-"管理。在定义了训练引擎和调度器之后，我们可以调用 Trainer API 来执行模型训练、评估、梯度清零和参数更新等。"
-msgstr ""
-"InternLM training API is managed in ``internlm.core.trainer.Trainer``. "
-"After defining the training engine and runtime scheduler, we can call "
-"training API to perform training, evaluation, zero gradients and "
-"parameter update steps."
-
-#: ../../source/training.rst:6
-msgid "有关详细用法，请参阅 Trainer API 文档和示例。"
-msgstr ""
-"For detailed usage, please refer to Trainer API documentation and "
-"examples."
-
-#: internlm.core.trainer.Trainer:1 of
-msgid ""
-"This is a class tending for easy deployments of users' training and "
-"evaluation instead of writing their own scripts."
-msgstr ""
-
-#: internlm.core.trainer.Trainer internlm.core.trainer.Trainer.execute_schedule
-#: of
-msgid "参数"
-msgstr ""
-
-#: internlm.core.trainer.Trainer:4 of
-msgid "Engine responsible for the process function."
-msgstr ""
-
-#: internlm.core.trainer.Trainer:6 of
-msgid "Runtime schedule. Defaults to None."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.engine:1 of
-msgid ""
-"Returns the engine that responsible for managing the training and "
-"evaluation process."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.schedule:1 of
-msgid "Returns the runtime scheduler."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.uses_pipeline:1 of
-msgid "Returns whether the pipeline parallel is used or not."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.train:1 of
-msgid "Sets the model to training mode."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.eval:1 of
-msgid "Sets the model to evaluation mode."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.zero_grad:1 of
-msgid "Sets the gradient of all parameters in the model to zero."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.step:1 of
-msgid "Executes the parameter update step."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.execute_schedule:1 of
-msgid ""
-"Runs the forward, loss computation, and backward for the model. Returns a"
-" tuple of (output, label, loss)."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.execute_schedule:4 of
-msgid "The data iterator."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.execute_schedule:6 of
-msgid "Additional keyword arguments."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.execute_schedule of
-msgid "返回"
-msgstr ""
-
-#: internlm.core.trainer.Trainer.execute_schedule:8 of
-msgid "A tuple of (output, label, loss)."
-msgstr ""
-
-#: internlm.core.trainer.Trainer.execute_schedule of
-msgid "返回类型"
-msgstr ""
-
-#: internlm.core.trainer.Trainer.execute_schedule:9 of
-msgid "Tuple[:class:`torch.Tensor`]"
-msgstr ""
-
-#~ msgid "InternLM 的训练流程可以归纳为两个步骤："
-#~ msgstr "The training process of InternLM can be summarized into two steps: "
-
-#~ msgid "初始化"
-#~ msgstr "Initialization"
-
-#~ msgid "初始化模型、优化器、数据加载器、Trainer，生成不同种类的进程组，为混合并行的迭代训练做准备。"
-#~ msgstr ""
-#~ "Initialize model, optimizer, dataloader, "
-#~ "trainer, and create different types of"
-#~ " process groups to prepare for "
-#~ "iterative steps of hybrid parallel "
-#~ "training. "
-
-#~ msgid "初始化Logger、Checkpoint管理器、Monitor管理器、Profiler，对迭代训练的过程观察、预警、记录。"
-#~ msgstr ""
-#~ "Initialize logger, checkpoint manager, monitor"
-#~ " manager, and profiler to watch, "
-#~ "alert, and record the iterative training"
-#~ " steps. "
-
-#~ msgid "迭代训练"
-#~ msgstr "Iterative training steps"
-
-#~ msgid "根据配置文件定义的张量并行、流水线并行、数据并行的大小，加载训练引擎和调度器进行混合并行训练。"
-#~ msgstr ""
-#~ "Load the training engine and scheduler"
-#~ " for hybrid parallel training according "
-#~ "to the configuration such as tensor "
-#~ "parallel size, pipeline parallel size, "
-#~ "and data parallel size. "
-
-#~ msgid "在迭代训练中，调用 Trainer API 进行梯度置零，前向传播计算损失并反向传播，参数更新。"
-#~ msgstr ""
-#~ "In iterative training steps, the Trainer"
-#~ " API is called to perform zero "
-#~ "gradients, forward-loss-backward, and "
-#~ "parameter update."
-
-#~ msgid "InternLM训练流程图"
-#~ msgstr "InternLM training process"
diff --git a/doc/code-docs/locales/en/LC_MESSAGES/usage.po b/doc/code-docs/locales/en/LC_MESSAGES/usage.po
deleted file mode 100644
index 8717ecf..0000000
--- a/doc/code-docs/locales/en/LC_MESSAGES/usage.po
+++ /dev/null
@@ -1,366 +0,0 @@
-# SOME DESCRIPTIVE TITLE.
-# Copyright (C) 2023, InternLM Team
-# This file is distributed under the same license as the InternLM package.
-# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
-#
-#, fuzzy
-msgid ""
-msgstr ""
-"Project-Id-Version: InternLM \n"
-"Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-11 14:25+0800\n"
-"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
-"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
-"Language: en\n"
-"Language-Team: en <LL@li.org>\n"
-"Plural-Forms: nplurals=2; plural=(n != 1);\n"
-"MIME-Version: 1.0\n"
-"Content-Type: text/plain; charset=utf-8\n"
-"Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
-
-#: ../../../usage.md:2
-msgid "使用教程"
-msgstr "Quickstart Guide"
-
-#: ../../../usage.md:4
-msgid ""
-"启动一个 Demo "
-"模型训练，需要进行三项准备，**安装**，**数据集准备**和**模型训练配置**。接下来，首先会介绍数据准备相关的操作，再简要描述模型训练配置相关的内容。"
-msgstr ""
-"To start a demo model training, you need to prepare three things: "
-"**installation**, **dataset preparation**, and **model training "
-"configuration**. In this guide, we will first cover the steps for dataset"
-" preparation and then briefly describe the model training configuration."
-
-#: ../../../usage.md:6
-msgid "安装"
-msgstr "Installation"
-
-#: ../../../usage.md:7
-msgid "请参考[安装文档](./install.md)进行安装。"
-msgstr ""
-"Please refer to the [installation guide](./install.md) for instructions "
-"on how to install the necessary dependencies."
-
-#: ../../../usage.md:9
-msgid "数据准备 （预训练）"
-msgstr "Dataset Preparation (Pre-training)"
-
-#: ../../../usage.md:11
-msgid "InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7_sft.model`来生成tokens。若想使用不同的模型，可直接修改`tokernizer.py`中的模型参数路径。"
-msgstr ""
-"The dataset for the InternLM training task includes a series of `bin` and"
-" `meta` files. A `tokenizer` is used to generate the training dataset "
-"from the original text files. The tokenizer model is imported by "
-"specifying the model parameter path in `tools/tokenizer.py`. Currently, "
-"`V7_sft.model` is provided to generate tokens. If you want to use a "
-"different model, you can directly modify the model parameter path in "
-"`tokenizer.py`."
-
-#: ../../../usage.md:13
-msgid "可以运行以下命令生成原始数据对应的`bin`和`meta`文件，其中参数`text_input_path`表示原始文本数据路径，目前支持`txt`、`json`和`jsonl`三种输入格式，`bin_output_path`表示生成的`bin`文件的保存路径。"
-msgstr ""
-"You can run the following command to generate `bin` and `meta` files "
-"corresponding to the original data. The parameter `text_input_path` "
-"represents the path of the original text data, currently supporting "
-"`txt`, `json`, and `jsonl` formats, while `bin_output_path` represents "
-"the save path of the generated `bin` files."
-
-#: ../../../usage.md:18
-msgid "下面是一个数据处理的例子："
-msgstr "Here is an example of data processing:"
-
-#: ../../../usage.md:20
-msgid "给定一个包含原始数据集的文件`raw_data.txt`，原始数据集如下所示："
-msgstr ""
-"Given a file `raw_data.txt` containing the raw dataset, the raw dataset "
-"is shown below:"
-
-#: ../../../usage.md:27
-msgid "可以通过运行以下命令来生成`bin`和`meta`文件："
-msgstr ""
-"You can generate the `bin` and `meta` files by running the following "
-"command:"
-
-#: ../../../usage.md:32
-msgid "需要注意的是，生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这六个目录下，以区分数据集的类型。"
-msgstr ""
-"It should be noted that the generated `bin` files need to be saved in one"
-" of the following directories: `cn`, `en`, `code`, `ja`, `ar`, or "
-"`kaoshi`, depending on the type of dataset."
-
-#: ../../../usage.md:34
-msgid "其中，`cn`表示中文数据集；`en`表示英文数据集；`code`表示代码数据集；`ja`表示日语数据集；`ar`表示阿拉伯语数据集；`kaoshi`表示考试数据集。"
-msgstr ""
-"Here, `cn` represents the Chinese dataset, `en` represents the English "
-"dataset, `code` represents the code dataset, `ja` represents the Japanese"
-" dataset, `ar` represents the Arabic dataset, and `kaoshi` represents the"
-" exam dataset."
-
-#: ../../../usage.md:36
-msgid "生成的bin文件的格式如下："
-msgstr "The format of the generated `bin` files is as follows:"
-
-#: ../../../usage.md:42
-msgid "`bin`文件中的每一行均对应原始数据集中的每一个句子，表示每个句子的`token`（下文将用sequence指定）。"
-msgstr ""
-"Each line in the `bin` file corresponds to each sentence in the original "
-"dataset, representing the tokens of each sentence (referred to as "
-"sequence below)."
-
-#: ../../../usage.md:44
-msgid "生成的`meta`文件的格式如下："
-msgstr "The format of the generated `meta` file is as follows:"
-
-#: ../../../usage.md:48
-msgid ""
-"在`meta`文件中，每个元组对应着`bin`文件中每一个`sequence`的元信息。其中，元组的第一个元素表示每个`sequence`在所有`sequence`中的`starting"
-" index`，第二个元素表示每个`sequence`中有多少个`tokens`。"
-msgstr ""
-"Each tuple in the `meta` file represents the meta information of each "
-"`sequence`, where the first element in the tuple indicates the `starting "
-"index` of each `sequence` among all `sequences`, and the second element "
-"indicates the number of `tokens` for each `sequence`."
-
-#: ../../../usage.md:50
-msgid ""
-"例如，对于第一个`sequence`，`starting index`为 0，有 11 "
-"个`tokens`；对于第二个`sequence`，由于第一个`sequence`转换为`string`后的长度为`89`，因此它的`starting"
-" index`为 90，有 15 个`tokens`。"
-msgstr ""
-"For example, the first `sequence` starts at index 0 and has 16 `tokens`. "
-"The second `sequence` starts at index 110 and has 24 `tokens`."
-
-#: ../../../usage.md:52
-msgid "`json`和`jsonl`类型的文件的`bin`和`meta`文件格式和`txt`一致，此处不再赘叙。"
-msgstr ""
-"The `bin` and `meta` file formats for `json` and `jsonl` type files are "
-"the same as for `txt`, so we won't go over them here."
-
-#: ../../../usage.md:54
-msgid "数据准备 （微调）"
-msgstr "Data Preparation (Fine-tuning)"
-
-#: ../../../usage.md:56
-msgid ""
-"微调任务的数据集格式与预训练任务保持一致，生成的数据格式为一系列的`bin`和`meta`文件。以下以 Alpaca "
-"数据集为例，介绍微调的数据准备流程。"
-msgstr ""
-"The data format for fine-tuning tasks is the same as for pre-training "
-"tasks, which consists of a series of `bin` and `meta` files. Let's take "
-"the Alpaca dataset as an example to explain the data preparation process "
-"for fine-tuning."
-
-#: ../../../usage.md:58
-msgid ""
-"下载 [Alpaca 数据集](https://github.com/tatsu-"
-"lab/stanford_alpaca/blob/main/alpaca_data.json)"
-msgstr ""
-"Download the [Alpaca dataset](https://github.com/tatsu-"
-"lab/stanford_alpaca/blob/main/alpaca_data.json)."
-
-#: ../../../usage.md:60
-msgid "对 Alpaca 数据进行 tokenize，使用以下命令"
-msgstr "Tokenize the Alpaca dataset using the following command:"
-
-#: ../../../usage.md:66
-msgid "建议用户参考 alpaca_tokenizer.py 编写新的脚本对自己的数据集进行 tokenize"
-msgstr ""
-"It is recommended that users refer to alpaca_tokenizer.py to write new "
-"scripts to tokenize their own datasets"
-
-#: ../../../usage.md:68
-msgid "训练配置"
-msgstr "Training Configuration"
-
-#: ../../../usage.md:70
-#, fuzzy
-msgid "以 7B Demo 的配置文件`configs/7B_sft.py`为例："
-msgstr ""
-"Taking the configuration file `configs/7B_sft.py` for the 7B demo as an "
-"example,"
-
-#: ../../../usage.md:237
-msgid "接下来将详细介绍启动一个模型训练所需要进行的数据、模型、并行和监控等相关的配置。"
-msgstr ""
-"let's discuss the data, model, parallel and monitoring configurations "
-"required to start a model training."
-
-#: ../../../usage.md:239
-msgid "数据配置"
-msgstr "Data Configuration"
-
-#: ../../../usage.md:240
-msgid "数据相关的关键参数配置及释义如下所示："
-msgstr "Here are the key parameters and their explanations for data configuration:"
-
-#: ../../../usage.md:255
-msgid "![pack_into_one](./imgs/pack_into_one.png)"
-msgstr ""
-
-#: ../../../usage.md:255
-msgid "pack_into_one"
-msgstr ""
-
-#: ../../../usage.md:258
-msgid "目前支持传入数据集文件路径`train_folder`，且要求文件格式如下："
-msgstr ""
-"Currently, it supports passing the dataset file path `train_folder`, and "
-"the file format is required to be as follows:"
-
-#: ../../../usage.md:265
-msgid "数据集的详细内容可参考``数据准备``模块相关的介绍。"
-msgstr ""
-"For detailed information about the dataset, please refer to the \"Data "
-"Preparation\" section."
-
-#: ../../../usage.md:267
-msgid "模型配置"
-msgstr "Model Configuration"
-
-#: ../../../usage.md:269
-msgid "如果在启动训练时要加载模型 `checkpoint`，可进行如下相关配置："
-msgstr ""
-"If you want to load a model checkpoint when starting the training, you "
-"can configure it as follows:"
-
-#: ../../../usage.md:282
-msgid "注意："
-msgstr "Note:"
-
-#: ../../../usage.md:283
-msgid "路径若以 `local:` 为前缀，则存储在本地文件系统；若以 `boto3:` 为前缀，则存储在远程 oss 上"
-msgstr ""
-"If the path starts with `local:`, it means the file is stored in the "
-"local file system. If it starts with `boto3:`, it means the file is "
-"stored in the remote OSS."
-
-#: ../../../usage.md:285
-msgid "模型相关关键参数配置如下所示："
-msgstr "The configuration for the model is as follows:"
-
-#: ../../../usage.md:309
-msgid "注意：用户可自定义模型类型名和模型结构，并配置相对应的模型参数。通过`utils/registry.py`下的`MODEL_INITIALIZER`对象进行模型初始化函数接口注册，在训练主函数`train.py`中初始化模型时，可通过`model_type`配置获取指定的模型初始化接口函数。"
-msgstr ""
-"Note: Users can customize the model type name and model structure, and "
-"configure the corresponding model parameters. The model initialization "
-"function interface can be registered through the `MODEL_INITIALIZER` "
-"object in `utils/registry.py`. When initializing the model in the "
-"training main function `train.py`, the specified model initialization "
-"interface function can be obtained through the `model_type` "
-"configuration."
-
-#: ../../../usage.md:311
-msgid ""
-"*如果基于 InternLM 7B继续训练，可以参考 "
-"[ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-zoo) 中 "
-"OpenXLab 链接下载权重*"
-msgstr ""
-"*If you want to start training based on InternLM 7B, you can refer to "
-"OpenXLab [ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-"
-"zoo) to download weights*."
-
-#: ../../../usage.md:313
-msgid "并行配置"
-msgstr "Parallel Configuration"
-
-#: ../../../usage.md:315
-msgid "训练并行配置样例如下："
-msgstr "Training parallel configuration example:"
-
-#: ../../../usage.md:324
-msgid "zero1：zero 并行策略，分如下三种情况，默认值为 -1"
-msgstr ""
-"zero1: zero parallel strategy, divided into the following three cases, "
-"default value is -1"
-
-#: ../../../usage.md:325
-msgid "当`zero1 <= 0`，则 zero1 进程组的大小等于数据并行进程组的大小，因此优化器状态参数将在数据并行范围内分配"
-msgstr ""
-"When `zero1 <= 0`, the size of the zero1 process group is equal to the "
-"size of the data parallel process group, so the optimizer state "
-"parameters will be split within the data parallel range."
-
-#: ../../../usage.md:326
-msgid "当`zero1 == 1`，则不使用 zero1 ，所有数据并行组保留完整的优化器状态参数"
-msgstr ""
-"When `zero1 == 1`, zero1 is not used, and all data parallel groups retain"
-" the complete optimizer state parameters."
-
-#: ../../../usage.md:327
-msgid "当`zero1 > 1`且`zero1 <= data_parallel_world_size`，则 zero1 进程组是数据并行进程组的子集"
-msgstr ""
-"When `zero1 > 1` and `zero1 <= data_parallel_world_size`, the zero1 "
-"process group is a subset of the data parallel process group."
-
-#: ../../../usage.md:328
-msgid "tensor：张量并行大小，通常是每个节点的 GPU 数量，默认值为 1"
-msgstr ""
-"tensor: tensor parallel size, usually the number of GPUs per node, "
-"default is 1"
-
-#: ../../../usage.md:329
-msgid "pipeline：流水线并行策略"
-msgstr "pipeline: pipeline parallel strategy"
-
-#: ../../../usage.md:330
-msgid "size：流水线并行大小，默认值为 1"
-msgstr "size: pipeline parallel size, the default value is 1"
-
-#: ../../../usage.md:331
-msgid "interleaved_overlap：bool 类型，交错式调度时，开启或关闭通信优化，默认值为关闭"
-msgstr ""
-"interleaved_overlap: bool type, when interleaved scheduling, enable or "
-"disable communication optimization, the default value is False"
-
-#: ../../../usage.md:332
-msgid "sequence_parallel：是否开启序列化并行，默认值为 False"
-msgstr ""
-"sequence_parallel: Whether to enable sequence parallelism, the default "
-"value is False"
-
-#: ../../../usage.md:334
-msgid "注意：`数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小`"
-msgstr ""
-"Note: `Data parallel size = Total number of GPUs / Pipeline parallel size"
-" / Tensor parallel size`"
-
-#: ../../../usage.md:336
-msgid "启动训练"
-msgstr "Start Training"
-
-#: ../../../usage.md:338
-msgid "完成了以上数据集准备和相关训练配置后，可启动 Demo 训练。接下来分别以 slurm 和 torch 环境为例，介绍训练启动方式。"
-msgstr ""
-"After completing the data preparation and relevant training "
-"configurations mentioned above, you can start the demo training. The "
-"following examples demonstrate how to start the training in both slurm "
-"and torch environments."
-
-#: ../../../usage.md:340
-msgid "若在 slurm 上启动分布式运行环境，多节点 16 卡的运行命令如下所示："
-msgstr ""
-"If you want to start distributed training on slurm with 16 GPUs across "
-"multiple nodes, use the following command:"
-
-#: ../../../usage.md:345
-msgid "若在 torch 上启动分布式运行环境，单节点 8 卡的运行命令如下所示："
-msgstr ""
-"If you want to start distributed training on torch with 8 GPUs on a "
-"single node, use the following command:"
-
-#: ../../../usage.md:350
-msgid "运行结果"
-msgstr "Training Results"
-
-#: ../../../usage.md:352
-msgid "以 slurm 上单机 8 卡的 Demo 训练配置为例，训练结果日志展示如下："
-msgstr ""
-"Taking the configuration of the demo training on a single machine with 8 "
-"GPUs on slurm as an example, the training result log is shown below:"
-
-#~ msgid "`load_model_only_folder`与`load_ckpt_folder`不能同时设置"
-#~ msgstr ""
-#~ "`load_model_only_folder` and `load_ckpt_folder` "
-#~ "cannot be set at the same time."
diff --git a/doc/code-docs/make.bat b/doc/code-docs/make.bat
deleted file mode 100644
index 747ffb7..0000000
--- a/doc/code-docs/make.bat
+++ /dev/null
@@ -1,35 +0,0 @@
-@ECHO OFF
-
-pushd %~dp0
-
-REM Command file for Sphinx documentation
-
-if "%SPHINXBUILD%" == "" (
-	set SPHINXBUILD=sphinx-build
-)
-set SOURCEDIR=source
-set BUILDDIR=build
-
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
-	echo.
-	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
-	echo.installed, then set the SPHINXBUILD environment variable to point
-	echo.to the full path of the 'sphinx-build' executable. Alternatively you
-	echo.may add the Sphinx directory to PATH.
-	echo.
-	echo.If you don't have Sphinx installed, grab it from
-	echo.https://www.sphinx-doc.org/
-	exit /b 1
-)
-
-if "%1" == "" goto help
-
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-goto end
-
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-
-:end
-popd
diff --git a/doc/code-docs/requirements.txt b/doc/code-docs/requirements.txt
deleted file mode 100644
index f44a0ae..0000000
--- a/doc/code-docs/requirements.txt
+++ /dev/null
@@ -1,11 +0,0 @@
-Sphinx
-sphinx-autobuild
-sphinx_rtd_theme
-sphinx_markdown_tables
-autodoc_pydantic==1.9
-enum_tools
-numpy
-torch
-tqdm
-pyecharts
-myst-parser
diff --git a/doc/code-docs/source/checkpoint.rst b/doc/code-docs/source/checkpoint.rst
deleted file mode 100644
index ee4f037..0000000
--- a/doc/code-docs/source/checkpoint.rst
+++ /dev/null
@@ -1,12 +0,0 @@
-模型保存
-===================
-
-InternLM 使用 ``internlm.utils.model_checkpoint.CheckpointManager`` 来管理模型保存。其中，可以使用 ``CheckpointManager.try_save_checkpoint(train_state)`` 来保存指定 step 的模型状态。
-
-InternLM支持启动时自动加载最新的模型备份，并在接收信号退出训练时自动进行模型备份。
-
-Checkpointing
--------------
-
-.. autoclass:: internlm.utils.model_checkpoint.CheckpointManager
-    :members:
diff --git a/doc/code-docs/source/conf.py b/doc/code-docs/source/conf.py
deleted file mode 100644
index c752047..0000000
--- a/doc/code-docs/source/conf.py
+++ /dev/null
@@ -1,103 +0,0 @@
-# Configuration file for the Sphinx documentation builder.
-#
-# For the full list of built-in configuration values, see the documentation:
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-
-# -- Project information -----------------------------------------------------
-# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
-
-import os
-import sys
-
-project = "InternLM"
-copyright = "2023, InternLM Team"
-author = "InternLM Team"
-
-with open("../../../version.txt", "r") as f:
-    release = f.readline().rstrip()
-
-master_doc = "index"
-
-autodoc_member_order = "bysource"
-
-# -- General configuration ---------------------------------------------------
-# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
-
-extensions = [
-    "sphinx_rtd_theme",
-    "sphinx.ext.viewcode",
-    "sphinx.ext.autodoc",
-    "sphinxcontrib.autodoc_pydantic",
-    "sphinx.ext.autosectionlabel",
-    "sphinx.ext.napoleon",
-    "myst_parser",
-]
-
-pygments_style = "sphinx"
-
-# autodoc_pyandtic config
-autodoc_pydantic_model_show_field_summary = False
-autodoc_pydantic_field_signature_prefix = " "
-autodoc_pydantic_model_signature_prefix = "class"
-autodoc_pydantic_model_show_json = False
-autodoc_pydantic_model_show_config_summary = False
-autodoc_pydantic_model_show_config_member = False
-autodoc_pydantic_model_show_validator_summary = False
-autodoc_pydantic_model_show_validator_members = False
-autodoc_pydantic_model_summary_list_order = "bysource"
-autodoc_pydantic_model_member_order = "bysource"
-autodoc_pydantic_field_list_validators = False
-
-# Napoleon settings
-napoleon_google_docstring = True
-napoleon_numpy_docstring = True
-napoleon_include_init_with_doc = False
-napoleon_include_private_with_doc = False
-napoleon_include_special_with_doc = True
-napoleon_use_admonition_for_examples = False
-napoleon_use_admonition_for_notes = False
-napoleon_use_admonition_for_references = False
-napoleon_use_ivar = False
-napoleon_use_param = True
-napoleon_use_rtype = True
-napoleon_preprocess_types = False
-napoleon_type_aliases = None
-napoleon_attr_annotations = True
-
-templates_path = ["_templates"]
-
-exclude_patterns = []
-
-# -- Options for HTML output -------------------------------------------------
-# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
-
-html_theme = "sphinx_rtd_theme"
-html_static_path = []
-
-# GitHub integration
-html_context = {
-    "display_github": True,
-    "github_user": "InternLM",
-    "github_repo": "InternLM",
-    "github_version": "main",
-    "conf_py_path": "/doc/code-docs/source/",
-}
-
-sys.path.insert(0, os.path.abspath("../../../"))
-
-# Prepend module names to class descriptions
-add_module_names = True
-
-autoclass_content = "class"
-
-autodoc_mock_imports = [
-    "apex",
-    "torch",
-    "numpy",
-]
-
-# support multi-language docs
-language = "zh_CN"
-locale_dirs = ["../locales/"]  # path is example but recommended.
-gettext_compact = False  # optional.
-gettext_uuid = False  # optional.
diff --git a/doc/code-docs/source/example/30B_demo.rst b/doc/code-docs/source/example/30B_demo.rst
deleted file mode 100644
index 47f8dcc..0000000
--- a/doc/code-docs/source/example/30B_demo.rst
+++ /dev/null
@@ -1,202 +0,0 @@
-30B Demo
-================
-
-训练配置
-----------------
-
-30B demo 训练配置文件样例如下:
-
-.. code-block:: python
-
-    JOB_NAME = "30b_train"
-
-    SEQ_LEN = 2048
-    HIDDEN_SIZE = 6144
-    NUM_ATTENTION_HEAD = 48
-    MLP_RATIO = 8 / 3
-    NUM_LAYER = 60
-    VOCAB_SIZE = 103168
-
-    MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
-    # Ckpt folder format:
-    # fs: 'local:/mnt/nfs/XXX'
-    SAVE_CKPT_FOLDER = "local:llm_ckpts"
-    LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
-
-    # boto3 Ckpt folder format:
-    # import os
-    # BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
-    # SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
-    # LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
-    CHECKPOINT_EVERY = 50
-    ckpt = dict(
-        enable_save_ckpt=False,  # enable ckpt save.
-        save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.
-        # load_ckpt_folder=LOAD_CKPT_FOLDER, # Ckpt path to resume training(load weights and scheduler/context states).
-        # load_model_only_folder=MODEL_ONLY_FOLDER, # Path to initialize with given model weights.
-        load_optimizer=True,  # Wheter to load optimizer states when continuing training.
-        checkpoint_every=CHECKPOINT_EVERY,
-        async_upload=True,  # async ckpt upload. (only work for boto3 ckpt)
-        async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.
-        snapshot_ckpt_folder="/".join([SAVE_CKPT_FOLDER, "snapshot"]),  # directory for snapshot ckpt storage path.
-        oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
-    )
-
-    TRAIN_FOLDER = "/path/to/dataset"
-    VALID_FOLDER = "/path/to/dataset"
-    data = dict(
-        seq_len=SEQ_LEN,
-        # micro_num means the number of micro_batch contained in one gradient update
-        micro_num=4,
-        # packed_length = micro_bsz * SEQ_LEN
-        micro_bsz=2,
-        # defaults to the value of micro_num
-        valid_micro_num=4,
-        # defaults to 0, means disable evaluate
-        valid_every=50,
-        pack_sample_into_one=False,
-        total_steps=50000,
-        skip_batches="",
-        rampup_batch_size="",
-        # Datasets with less than 50 rows will be discarded
-        min_length=50,
-        # train_folder=TRAIN_FOLDER,
-        # valid_folder=VALID_FOLDER,
-    )
-
-    grad_scaler = dict(
-        fp16=dict(
-            # the initial loss scale, defaults to 2**16
-            initial_scale=2**16,
-            # the minimum loss scale, defaults to None
-            min_scale=1,
-            # the number of steps to increase loss scale when no overflow occurs
-            growth_interval=1000,
-        ),
-        # the multiplication factor for increasing loss scale, defaults to 2
-        growth_factor=2,
-        # the multiplication factor for decreasing loss scale, defaults to 0.5
-        backoff_factor=0.5,
-        # the maximum loss scale, defaults to None
-        max_scale=2**24,
-        # the number of overflows before decreasing loss scale, defaults to 2
-        hysteresis=2,
-    )
-
-    hybrid_zero_optimizer = dict(
-        # Enable low_level_optimzer overlap_communication
-        overlap_sync_grad=True,
-        overlap_sync_param=True,
-        # bucket size for nccl communication params
-        reduce_bucket_size=512 * 1024 * 1024,
-        # grad clipping
-        clip_grad_norm=1.0,
-    )
-
-    loss = dict(
-        label_smoothing=0,
-    )
-
-    adam = dict(
-        lr=1e-4,
-        adam_beta1=0.9,
-        adam_beta2=0.95,
-        adam_beta2_c=0,
-        adam_eps=1e-8,
-        weight_decay=0.01,
-    )
-
-    lr_scheduler = dict(
-        total_steps=data["total_steps"],
-        init_steps=0,  # optimizer_warmup_step
-        warmup_ratio=0.01,
-        eta_min=1e-5,
-        last_epoch=-1,
-    )
-
-    beta2_scheduler = dict(
-        init_beta2=adam["adam_beta2"],
-        c=adam["adam_beta2_c"],
-        cur_iter=-1,
-    )
-
-    model = dict(
-        checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
-        num_attention_heads=NUM_ATTENTION_HEAD,
-        embed_split_hidden=True,
-        vocab_size=VOCAB_SIZE,
-        embed_grad_scale=1,
-        parallel_output=True,
-        hidden_size=HIDDEN_SIZE,
-        num_layers=NUM_LAYER,
-        mlp_ratio=MLP_RATIO,
-        apply_post_layer_norm=False,
-        dtype="torch.float16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
-        norm_type="rmsnorm",
-        layer_norm_epsilon=1e-5,
-        use_flash_attn=True,
-        num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
-    )
-    """
-    zero1 parallel:
-        1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
-            so parameters will be divided within the range of dp.
-        2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
-        3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
-            For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
-    pipeline parallel (dict):
-        1. size: int, the size of pipeline parallel.
-        2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
-    tensor parallel: tensor parallel size, usually the number of GPUs per node.
-    """
-    parallel = dict(
-        zero1=-1,
-        tensor=4,
-        pipeline=dict(size=1, interleaved_overlap=True),
-        sequence_parallel=False,
-    )
-
-    cudnn_deterministic = False
-    cudnn_benchmark = False
-
-
-启动训练
-----------------
-
-完成以上训练配置后，可启动模型训练，以在 ``slurm`` 平台上为例，启动两节点 16GPU 的训练命令如下所示：
-
-.. code-block:: bash
-
-    srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/30B_sft.py
-
-训练结果
-----------------
-
-基于以上训练配置和启动命令，两节点 16GPU 下的模型训练部分日志展示如下：
-
-.. code-block:: bash
-
-    2023-09-06 10:29:26,629 INFO parallel_context.py:508 in set_device -- process rank 10 is bound to host:HOST-10-140-66-20 device: 2
-    2023-09-06 10:29:26,632 INFO parallel_context.py:508 in set_device -- process rank 11 is bound to host:HOST-10-140-66-20 device: 3
-    2023-09-06 10:29:26,634 INFO parallel_context.py:508 in set_device -- process rank 12 is bound to host:HOST-10-140-66-20 device: 4
-    2023-09-06 10:29:26,636 INFO parallel_context.py:508 in set_device -- process rank 9 is bound to host:HOST-10-140-66-20 device: 1
-    2023-09-06 10:29:26,640 INFO parallel_context.py:508 in set_device -- process rank 15 is bound to host:HOST-10-140-66-20 device: 7
-    2023-09-06 10:29:26,639 INFO parallel_context.py:508 in set_device -- process rank 0 is bound to host:HOST-10-140-66-9 device: 0
-    2023-09-06 10:29:26,641 INFO parallel_context.py:508 in set_device -- process rank 2 is bound to host:HOST-10-140-66-9 device: 2
-    2023-09-06 10:29:26,643 INFO parallel_context.py:508 in set_device -- process rank 5 is bound to host:HOST-10-140-66-9 device: 5
-    2023-09-06 10:29:26,645 INFO parallel_context.py:508 in set_device -- process rank 6 is bound to host:HOST-10-140-66-9 device: 6
-    2023-09-06 10:29:26,661 INFO parallel_context.py:508 in set_device -- process rank 13 is bound to host:HOST-10-140-66-20 device: 5
-    2023-09-06 10:29:26,707 INFO parallel_context.py:508 in set_device -- process rank 1 is bound to host:HOST-10-140-66-9 device: 1
-    2023-09-06 10:29:26,826 INFO parallel_context.py:508 in set_device -- process rank 4 is bound to host:HOST-10-140-66-9 device: 4
-    2023-09-06 10:29:26,871 INFO parallel_context.py:508 in set_device -- process rank 7 is bound to host:HOST-10-140-66-9 device: 7
-    2023-09-06 10:29:26,932 INFO parallel_context.py:508 in set_device -- process rank 3 is bound to host:HOST-10-140-66-9 device: 3
-    2023-09-06 10:29:27,156 INFO parallel_context.py:508 in set_device -- process rank 14 is bound to host:HOST-10-140-66-20 device: 6
-    2023-09-06 10:29:27,271 INFO parallel_context.py:508 in set_device -- process rank 8 is bound to host:HOST-10-140-66-20 device: 0
-    2023-09-06 10:29:32,060 INFO launch.py:329 in launch -- Distributed environment is initialized, data parallel size: 4, pipeline parallel size: 1, tensor parallel size: 4
-    2023-09-06 10:30:06,141 INFO hybrid_zero_optim.py:291 in _partition_param_list -- Number of elements on ranks: [1782007296, 1812307968, 1812307968, 1706469888], rank:0
-    2023-09-06T10:30:38.216+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=40.00268401421643 step=0 loss=11.548227310180664 tgs (tokens/gpu/second)=227.37 lr=9.779754323328192e-05 loss_scale=65536.0 grad_norm={'0_default': 61.5836932112004} micro_num=4 num_consumed_tokens=65536 inf_nan_skip_batches=0 num_samples_in_batch=18 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=12.51 acc=0.0 perplexity=104121.5547 acc/en=0.0 acc/cn=0.0 acc/code=0.0 tokens/en=60571 tokens/cn=0 tokens/code=0 loss_from_metric=11.5533 loss/en=11.5533 loss/cn=nan loss/code=nan
-    2023-09-06T10:30:46.343+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=89.00005814543725 step=1 loss=6.05580997467041 tgs (tokens/gpu/second)=505.86 lr=9.140576474687264e-05 loss_scale=65536.0 grad_norm={'0_default': 27.397946290506887} micro_num=4 num_consumed_tokens=131072 inf_nan_skip_batches=0 num_samples_in_batch=19 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=7.91 acc=0.0885 perplexity=405.4076 acc/en=0.0885 acc/cn=0.0 acc/code=0.0 tokens/en=60265 tokens/cn=0 tokens/code=0 loss_from_metric=6.0049 loss/en=6.0049 loss/cn=nan loss/code=nan
-    2023-09-06T10:30:51.443+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=142.5138940898651 step=2 loss=5.054169654846191 tgs (tokens/gpu/second)=810.03 lr=8.14503363531613e-05 loss_scale=65536.0 grad_norm={'0_default': 10.438111430093606} micro_num=4 num_consumed_tokens=196608 inf_nan_skip_batches=0 num_samples_in_batch=17 largest_length=2048 largest_batch=5 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=4.87 acc=0.0715 perplexity=184.2986 acc/en=0.0715 acc/cn=0.0 acc/code=0.0 tokens/en=60244 tokens/cn=0 tokens/code=0 loss_from_metric=5.2166 loss/en=5.2166 loss/cn=nan loss/code=nan
-    2023-09-06T10:30:56.509+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=143.56131674769466 step=3 loss=4.662276268005371 tgs (tokens/gpu/second)=815.98 lr=6.890576474687264e-05 loss_scale=65536.0 grad_norm={'0_default': 9.15959986316653} micro_num=4 num_consumed_tokens=262144 inf_nan_skip_batches=0 num_samples_in_batch=17 largest_length=2048 largest_batch=5 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=4.83 acc=0.0775 perplexity=102.6568 acc/en=0.0775 acc/cn=0.0 acc/code=0.0 tokens/en=60328 tokens/cn=0 tokens/code=0 loss_from_metric=4.6314 loss/en=4.6314 loss/cn=nan loss/code=nan
-    2023-09-06T10:31:01.552+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=143.85087291011183 step=4 loss=4.020431041717529 tgs (tokens/gpu/second)=817.63 lr=5.500000000000001e-05 loss_scale=65536.0 grad_norm={'0_default': 6.873464794412589} micro_num=4 num_consumed_tokens=327680 inf_nan_skip_batches=0 num_samples_in_batch=22 largest_length=1893 largest_batch=8 smallest_batch=4 adam_beta2=0.95 fwd_bwd_time=4.82 acc=0.0701 perplexity=69.1167 acc/en=0.0701 acc/cn=0.0 acc/code=0.0 tokens/en=61028 tokens/cn=0 tokens/code=0 loss_from_metric=4.2358 loss/en=4.2358 loss/cn=nan loss/code=nan
-    2023-09-06T10:31:06.830+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=15224 : tflops=142.8966468353613 step=5 loss=3.733311891555786 tgs (tokens/gpu/second)=812.2 lr=4.109423525312737e-05 loss_scale=65536.0 grad_norm={'0_default': 5.811005102730085} micro_num=4 num_consumed_tokens=393216 inf_nan_skip_batches=0 num_samples_in_batch=13 largest_length=2048 largest_batch=4 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=4.85 acc=0.0688 perplexity=46.298 acc/en=0.0688 acc/cn=0.0 acc/code=0.0 tokens/en=61004 tokens/cn=0 tokens/code=0 loss_from_metric=3.8351 loss/en=3.8351 loss/cn=nan loss/code=nan
diff --git a/doc/code-docs/source/example/7B_demo.rst b/doc/code-docs/source/example/7B_demo.rst
deleted file mode 100644
index 7815417..0000000
--- a/doc/code-docs/source/example/7B_demo.rst
+++ /dev/null
@@ -1,192 +0,0 @@
-7B Demo
-================
-
-训练配置
-----------------
-
-7B demo 的训练配置文件样例如下:
-
-.. code-block:: python
-
-    JOB_NAME = "7b_train"
-
-    SEQ_LEN = 2048
-    HIDDEN_SIZE = 4096
-    NUM_ATTENTION_HEAD = 32
-    MLP_RATIO = 8 / 3
-    NUM_LAYER = 32
-    VOCAB_SIZE = 103168
-
-    MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
-    # Ckpt folder format:
-    # fs: 'local:/mnt/nfs/XXX'
-    SAVE_CKPT_FOLDER = "local:llm_ckpts"
-    LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
-
-    # boto3 Ckpt folder format:
-    # import os
-    # BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
-    # SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
-    # LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
-    CHECKPOINT_EVERY = 50
-    ckpt = dict(
-        enable_save_ckpt=False,  # enable ckpt save.
-        save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.
-        # load_ckpt_folder=LOAD_CKPT_FOLDER, # Ckpt path to resume training(load weights and scheduler/context states).
-        # load_model_only_folder=MODEL_ONLY_FOLDER, # Path to initialize with given model weights.
-        load_optimizer=True,  # Wheter to load optimizer states when continuing training.
-        checkpoint_every=CHECKPOINT_EVERY,
-        async_upload=True,  # async ckpt upload. (only work for boto3 ckpt)
-        async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.
-        snapshot_ckpt_folder="/".join([SAVE_CKPT_FOLDER, "snapshot"]),  # directory for snapshot ckpt storage path.
-        oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
-    )
-
-    TRAIN_FOLDER = "/path/to/dataset"
-    VALID_FOLDER = "/path/to/dataset"
-    data = dict(
-        seq_len=SEQ_LEN,
-        # micro_num means the number of micro_batch contained in one gradient update
-        micro_num=4,
-        # packed_length = micro_bsz * SEQ_LEN
-        micro_bsz=2,
-        # defaults to the value of micro_num
-        valid_micro_num=4,
-        # defaults to 0, means disable evaluate
-        valid_every=50,
-        pack_sample_into_one=False,
-        total_steps=50000,
-        skip_batches="",
-        rampup_batch_size="",
-        # Datasets with less than 50 rows will be discarded
-        min_length=50,
-        # train_folder=TRAIN_FOLDER,
-        # valid_folder=VALID_FOLDER,
-    )
-
-    grad_scaler = dict(
-        fp16=dict(
-            # the initial loss scale, defaults to 2**16
-            initial_scale=2**16,
-            # the minimum loss scale, defaults to None
-            min_scale=1,
-            # the number of steps to increase loss scale when no overflow occurs
-            growth_interval=1000,
-        ),
-        # the multiplication factor for increasing loss scale, defaults to 2
-        growth_factor=2,
-        # the multiplication factor for decreasing loss scale, defaults to 0.5
-        backoff_factor=0.5,
-        # the maximum loss scale, defaults to None
-        max_scale=2**24,
-        # the number of overflows before decreasing loss scale, defaults to 2
-        hysteresis=2,
-    )
-
-    hybrid_zero_optimizer = dict(
-        # Enable low_level_optimzer overlap_communication
-        overlap_sync_grad=True,
-        overlap_sync_param=True,
-        # bucket size for nccl communication params
-        reduce_bucket_size=512 * 1024 * 1024,
-        # grad clipping
-        clip_grad_norm=1.0,
-    )
-
-    loss = dict(
-        label_smoothing=0,
-    )
-
-    adam = dict(
-        lr=1e-4,
-        adam_beta1=0.9,
-        adam_beta2=0.95,
-        adam_beta2_c=0,
-        adam_eps=1e-8,
-        weight_decay=0.01,
-    )
-
-    lr_scheduler = dict(
-        total_steps=data["total_steps"],
-        init_steps=0,  # optimizer_warmup_step
-        warmup_ratio=0.01,
-        eta_min=1e-5,
-        last_epoch=-1,
-    )
-
-    beta2_scheduler = dict(
-        init_beta2=adam["adam_beta2"],
-        c=adam["adam_beta2_c"],
-        cur_iter=-1,
-    )
-
-    model = dict(
-        checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
-        num_attention_heads=NUM_ATTENTION_HEAD,
-        embed_split_hidden=True,
-        vocab_size=VOCAB_SIZE,
-        embed_grad_scale=1,
-        parallel_output=True,
-        hidden_size=HIDDEN_SIZE,
-        num_layers=NUM_LAYER,
-        mlp_ratio=MLP_RATIO,
-        apply_post_layer_norm=False,
-        dtype="torch.float16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
-        norm_type="rmsnorm",
-        layer_norm_epsilon=1e-5,
-        use_flash_attn=True,
-        num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
-    )
-    """
-    zero1 parallel:
-        1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
-            so parameters will be divided within the range of dp.
-        2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
-        3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
-            For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
-    pipeline parallel (dict):
-        1. size: int, the size of pipeline parallel.
-        2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
-    tensor parallel: tensor parallel size, usually the number of GPUs per node.
-    """
-    parallel = dict(
-        zero1=8,
-        pipeline=dict(size=1, interleaved_overlap=True),
-        sequence_parallel=False,
-    )
-
-    cudnn_deterministic = False
-    cudnn_benchmark = False
-
-启动训练
-----------------
-
-完成以上训练配置后，可启动模型训练，以在 ``slurm`` 平台上为例，启动单节点 8GPU 的训练命令如下所示：
-
-.. code-block:: bash
-
-    srun -p internllm -N 1 -n 8 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py
-
-训练结果
-----------------
-
-基于以上训练配置和启动命令，单节点 8GPU 下的模型训练部分日志展示如下：
-
-.. code-block:: bash
-
-    2023-09-05 11:47:44,649 INFO parallel_context.py:508 in set_device -- process rank 4 is bound to host:SH-IDC1-10-140-1-110 device: 4
-    2023-09-05 11:47:44,650 INFO parallel_context.py:508 in set_device -- process rank 3 is bound to host:SH-IDC1-10-140-1-110 device: 3
-    2023-09-05 11:47:44,651 INFO parallel_context.py:508 in set_device -- process rank 6 is bound to host:SH-IDC1-10-140-1-110 device: 6
-    2023-09-05 11:47:44,652 INFO parallel_context.py:508 in set_device -- process rank 7 is bound to host:SH-IDC1-10-140-1-110 device: 7
-    2023-09-05 11:47:44,652 INFO parallel_context.py:508 in set_device -- process rank 5 is bound to host:SH-IDC1-10-140-1-110 device: 5
-    2023-09-05 11:47:44,652 INFO parallel_context.py:508 in set_device -- process rank 1 is bound to host:SH-IDC1-10-140-1-110 device: 1
-    2023-09-05 11:47:44,652 INFO parallel_context.py:508 in set_device -- process rank 2 is bound to host:SH-IDC1-10-140-1-110 device: 2
-    2023-09-05 11:47:44,652 INFO parallel_context.py:508 in set_device -- process rank 0 is bound to host:SH-IDC1-10-140-1-110 device: 0
-    2023-09-05 11:47:51,006 INFO launch.py:354 in launch -- Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 1
-    2023-09-05 11:49:09,855 INFO hybrid_zero_optim.py:294 in _partition_param_list -- Number of elements on ranks: [894509056, 944865280, 966909952, 966909952, 966909952, 944865280, 966909952, 670068736], rank:0
-    2023-09-05T11:49:58.225+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=63.283263603947816 step=0 loss=11.641494750976562 tgs (tokens/gpu/second)=1424.93 lr=4.0000000000000003e-07 loss_scale=65536.0 grad_norm={'0_default': 66.51907327507652} micro_num=4 num_consumed_tokens=131072 inf_nan_skip_batches=0 num_samples_in_batch=19 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=6.87 acc=0.0 perplexity=112181.7188 acc/en=0.0 acc/cn=0.0 acc/code=0.0 tokens/en=120836 tokens/cn=0 tokens/code=0 loss_from_metric=11.6279 loss/en=11.6279 loss/cn=nan loss/code=nan
-    2023-09-05T11:50:02.553+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=171.92140761933035 step=1 loss=11.546792984008789 tgs (tokens/gpu/second)=3871.11 lr=6.000000000000001e-07 loss_scale=65536.0 grad_norm={'0_default': 64.47430144542088} micro_num=4 num_consumed_tokens=262144 inf_nan_skip_batches=0 num_samples_in_batch=16 largest_length=2048 largest_batch=5 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=4.14 acc=0.0 perplexity=103779.1406 acc/en=0.0 acc/cn=0.0 acc/code=0.0 tokens/en=120572 tokens/cn=0 tokens/code=0 loss_from_metric=11.55 loss/en=11.55 loss/cn=nan loss/code=nan
-    2023-09-05T11:50:06.504+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=186.0565203348341 step=2 loss=11.106071472167969 tgs (tokens/gpu/second)=4189.39 lr=8.000000000000001e-07 loss_scale=65536.0 grad_norm={'0_default': 62.520055376005146} micro_num=4 num_consumed_tokens=393216 inf_nan_skip_batches=0 num_samples_in_batch=16 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.82 acc=0.0001 perplexity=71139.6797 acc/en=0.0001 acc/cn=0.0 acc/code=0.0 tokens/en=122032 tokens/cn=0 tokens/code=0 loss_from_metric=11.1724 loss/en=11.1724 loss/cn=nan loss/code=nan
-    2023-09-05T11:50:10.487+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=185.48897918112567 step=3 loss=10.444510459899902 tgs (tokens/gpu/second)=4176.61 lr=1.0000000000000002e-06 loss_scale=65536.0 grad_norm={'0_default': 57.91057980979166} micro_num=4 num_consumed_tokens=524288 inf_nan_skip_batches=0 num_samples_in_batch=18 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.83 acc=0.0705 perplexity=39851.1289 acc/en=0.0705 acc/cn=0.0 acc/code=0.0 tokens/en=121125 tokens/cn=0 tokens/code=0 loss_from_metric=10.5929 loss/en=10.5929 loss/cn=nan loss/code=nan
-    2023-09-05T11:50:14.476+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=185.8751803758398 step=4 loss=9.798665046691895 tgs (tokens/gpu/second)=4185.31 lr=1.2000000000000002e-06 loss_scale=65536.0 grad_norm={'0_default': 48.1136933755285} micro_num=4 num_consumed_tokens=655360 inf_nan_skip_batches=0 num_samples_in_batch=14 largest_length=2048 largest_batch=4 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.82 acc=0.076 perplexity=18045.6699 acc/en=0.076 acc/cn=0.0 acc/code=0.0 tokens/en=121365 tokens/cn=0 tokens/code=0 loss_from_metric=9.8007 loss/en=9.8007 loss/cn=nan loss/code=nan
-    2023-09-05T11:50:18.442+08:00 INFO [training_internlm.py, line 413, in record_current_batch_training_metrics] - pid=6794 : tflops=185.6236609556878 step=5 loss=9.215429306030273 tgs (tokens/gpu/second)=4179.64 lr=1.4000000000000001e-06 loss_scale=65536.0 grad_norm={'0_default': 36.95489557069029} micro_num=4 num_consumed_tokens=786432 inf_nan_skip_batches=0 num_samples_in_batch=14 largest_length=2048 largest_batch=4 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.82 acc=0.0767 perplexity=8999.0869 acc/en=0.0767 acc/cn=0.0 acc/code=0.0 tokens/en=121223 tokens/cn=0 tokens/code=0 loss_from_metric=9.1049 loss/en=9.1049 loss/cn=nan loss/code=nan
diff --git a/doc/code-docs/source/example/index.rst b/doc/code-docs/source/example/index.rst
deleted file mode 100644
index 5437688..0000000
--- a/doc/code-docs/source/example/index.rst
+++ /dev/null
@@ -1,18 +0,0 @@
-训练样例
-================
-
-7B Demo
-------------
-
-.. toctree::
-   :maxdepth: 2
-
-   7B_demo
-
-30B Demo
-------------
-
-.. toctree::
-   :maxdepth: 2
-
-   30B_demo
diff --git a/doc/code-docs/source/index.rst b/doc/code-docs/source/index.rst
deleted file mode 100644
index c01ac54..0000000
--- a/doc/code-docs/source/index.rst
+++ /dev/null
@@ -1,95 +0,0 @@
-.. InternLM documentation master file, created by
-   sphinx-quickstart on Mon Aug 28 17:33:28 2023.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-
-
-InternLM
-========
-
-环境构建
--------------------
-
-.. toctree::
-   :maxdepth: 2
-
-   install
-
-快速上手
--------------------
-
-.. toctree::
-   :maxdepth: 2
-
-   usage
-
-训练构建
--------------------
-
-.. toctree::
-   :maxdepth: 2
-
-   initialize
-
-训练 API
--------------------
-
-.. toctree::
-   :maxdepth: 2
-
-   training
-
-并行训练
--------------------
-
-.. toctree::
-   :maxdepth: 2
-
-   parallel
-
-模型备份
---------------------
-
-.. toctree::
-   :maxdepth: 2
-
-   checkpoint
-
-性能分析
--------------------
-
-.. toctree::
-   :maxdepth: 2
-
-   profiler
-
-训练监控
--------------------
-
-.. toctree::
-   :maxdepth: 2
-
-   monitor
-
-训练样例
--------------------
-
-.. toctree::
-   :maxdepth: 2
-
-   example/index
-
-常见问题
--------------------
-
-.. toctree::
-   :maxdepth: 2
-
-   qa
-
-索引和表格
-==================
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
diff --git a/doc/code-docs/source/initialize.rst b/doc/code-docs/source/initialize.rst
deleted file mode 100644
index fca87a5..0000000
--- a/doc/code-docs/source/initialize.rst
+++ /dev/null
@@ -1,108 +0,0 @@
-训练构建
-==============
-
-InternLM 的训练流程可以归纳为两个步骤：
-
-1. 初始化
-
-    * 初始化模型、优化器、数据加载器、Trainer，生成不同种类的进程组，为混合并行的迭代训练做准备。
-    * 初始化Logger、Checkpoint管理器、Monitor管理器、Profiler，对迭代训练的过程观察、预警、记录。
-
-2. 迭代训练
-
-    * 根据配置文件定义的张量并行、流水线并行、数据并行的大小，加载训练引擎和调度器进行混合并行训练。
-    * 在迭代训练中，调用 Trainer API 进行梯度置零，前向传播计算损失并反向传播，参数更新。
-
-.. figure:: ../../imgs/hybrid_parallel_training.png
-  :scale: 45%
-  :class: with-border
-
-  InternLM训练流程图
-
-.. _InternLM-args:
-
-命令行参数解析
-----------------
-
-InternLM 使用 `argparse <https://docs.python.org/3/library/argparse.html>`_ 库来向InternLM运行时提供命令行参数配置。
-
-用户可使用 ``internlm.initialize.get_default_parser()`` 来获取 InternLM 的默认解析器，其中包含一些内置参数，用户可以向此解析器添加自定义参数。
-
-.. code-block:: python
-
-    # Get InternLM default parser
-    parser = internlm.initialize.get_default_parser()
-    # Add new argument
-    parser.add_argument("--user_arg", type=int, default=-1, help="arguments add by user.")
-    cmd_args = parser.parse_args()
-
-.. autofunction:: internlm.initialize.get_default_parser
-
-
-.. _InternLM-model-init:
-
-模型初始化
--------------------------
-
-.. autofunction:: internlm.train.initialize_model
-
-InternLM 在配置文件中使用字段 ``model_type`` 和 ``model`` 来控制模型初始化过程。示例模型初始化配置定义如下：
-
-.. code-block:: python
-
-    model_type = "INTERNLM"  # default is "INTERNLM", used to register classes and modules for model initialization
-    NUM_ATTENTION_HEAD = 32
-    VOCAB_SIZE = 103168
-    HIDDEN_SIZE = 4096
-    NUM_LAYER = 32
-    MLP_RATIO = 8 / 3
-    model = dict(
-        checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
-        num_attention_heads=NUM_ATTENTION_HEAD,
-        embed_split_hidden=True,
-        vocab_size=VOCAB_SIZE,
-        embed_grad_scale=1,
-        parallel_output=True,
-        hidden_size=HIDDEN_SIZE,
-        num_layers=NUM_LAYER,
-        mlp_ratio=MLP_RATIO,
-        apply_post_layer_norm=False,
-        dtype="torch.bfloat16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
-        norm_type="rmsnorm",
-        layer_norm_epsilon=1e-5,
-        use_flash_attn=True,
-        num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
-    )
-
-- 字段 ``model_type`` 指明了要初始化的模型类型
-- 字段 ``model`` 中的参数指定了在模型初始化过程中的参数设置
-
-值得注意的是，用户可以定义新的模型类型，并使用装饰器 ``@MODEL_INITIALIZER.register_module`` 注册模型的初始化函数，其中 ``MODEL_INITIALIZER`` 是类 ``internlm.util.registry.Registry`` 的一个实例化对象，示例如下所示：
-
-.. code-block:: python
-
-    MODEL_TYPE = "NEW_MODEL"
-
-    @MODEL_INITIALIZER.register_module(module_name=MODEL_TYPE)
-    def build_new_model_with_cfg(*args, **kwargs):
-
-.. _InternLM-optim-init:
-
-优化器初始化
--------------------------
-
-.. autofunction:: internlm.train.initialize_optimizer
-
-.. _InternLM-dl-init:
-
-数据加载器初始化
--------------------------
-
-.. autofunction:: internlm.train.get_train_data_loader
-
-.. _InternLM-trainer-init:
-
-Trainer 初始化
--------------------------
-
-.. autofunction:: internlm.initialize.initialize_trainer
diff --git a/doc/code-docs/source/install.md b/doc/code-docs/source/install.md
deleted file mode 100644
index 912270c..0000000
--- a/doc/code-docs/source/install.md
+++ /dev/null
@@ -1,2 +0,0 @@
-```{include} ../../install.md
-```
diff --git a/doc/code-docs/source/monitor.rst b/doc/code-docs/source/monitor.rst
deleted file mode 100644
index de150fd..0000000
--- a/doc/code-docs/source/monitor.rst
+++ /dev/null
@@ -1,22 +0,0 @@
-监控和告警
-=================
-
-监控
------------------
-
-InternLM 使用 ``internlm.monitor.monitor.initialize_monitor_manager()`` 来初始化上下文监控管理。其中，一个实例化的单例对象 ``internlm.monitor.monitor.MonitorManager`` 将管理监控线程并使用 ``internlm.monitor.monitor.MonitorTracker`` 来跟踪模型训练生命周期和训练状态。
-
-.. autofunction:: internlm.monitor.monitor.initialize_monitor_manager
-
-.. autoclass:: internlm.monitor.monitor.MonitorManager
-    :members:
-
-.. autoclass:: internlm.monitor.monitor.MonitorTracker
-    :members:
-
-告警
------------------
-
-InternLM 监控线程会周期性地检查模型训练过程中是否出现 loss spike、潜在的 training stuck、运行时异常等，并捕获 SIGTERM 异常信号。当出现上述情况时，将触发警报，并通过调用 ``internlm.monitor.alert.send_feishu_msg_with_webhook()`` 向飞书的 Webhook 地址发送报警消息。
-
-.. autofunction:: internlm.monitor.alert.send_feishu_msg_with_webhook
diff --git a/doc/code-docs/source/parallel.rst b/doc/code-docs/source/parallel.rst
deleted file mode 100644
index 6de9545..0000000
--- a/doc/code-docs/source/parallel.rst
+++ /dev/null
@@ -1,152 +0,0 @@
-并行训练
-==================
-
-.. Brief introduction to training parallelism, and how-to guide about config setting
-
-InternLM 支持张量并行、流水线并行、序列并行、数据并行和 ZeRO1.5 等并行化训练策略。在初始化分布式环境时，我们需要指定张量并行大小、流水线并行大小、数据并行大小以及 ZeRO1.5 策略。
-
-InternLM 的并行设置由配置文件中的 ``parallel`` 字段指定，用户可以通过修改配置文件 `config file <https://github.com/InternLM/InternLM/blob/main/configs/7B_sft.py>`_ 来更改并行配置。以下是一个并行训练配置示例：
-
-.. code-block:: python
-
-    parallel = dict(
-        zero1=8,
-        tensor=1,
-        pipeline=dict(size=1, interleaved_overlap=True),
-        sequence_parallel=False,
-    )
-
-- zero1：zero 并行策略，分如下三种情况，默认值为 -1
-
-    - 当 ``zero1 <= 0``，则 zero1 进程组的大小等于数据并行进程组的大小，因此优化器状态参数将在数据并行范围内分配
-    - 当 ``zero1 == 1``，则不使用 zero1 ，所有数据并行组保留完整的优化器状态参数
-    - 当 ``zero1 > 1`` 且 ``zero1 <= data_parallel_world_size``，则 zero1 进程组是数据并行进程组的子集
-
-- tensor：张量并行大小，通常是每个节点的 GPU 数量，默认值为 1
-- pipeline：流水线并行策略
-
-    - size：流水线并行大小，默认值为 1
-    - interleaved_overlap：bool 类型，交错式调度时，开启或关闭通信优化，默认值为 False
-
-- sequence_parallel：是否开启序列化并行，默认值为 False
-
-注意：数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小
-
-张量并行
------------------
-
-InternLM 的张量并行实现方案基于 `flash attention <https://github.com/Dao-AILab/flash-attention>`_, 主要对 `attention <https://github.com/InternLM/InternLM/blob/main/internlm/model/multi_head_attention.py>`_ 和
-`linear <https://github.com/InternLM/InternLM/blob/main/internlm/model/linear.py>`_ 这两个模块进行张量并行操作。
-
-用户可通过配置文件中的 ``parallel.tensor`` 字段来设置张量并行大小。
-
-.. figure:: ../../imgs/tensor_parallel.png
-  :scale: 50%
-  :class: with-border
-
-  张量并行，采用自 `flash-attention <https://arxiv.org/pdf/2205.14135.pdf>`_
-
-流水线并行
------------------
-
-InternLM 在流水线并行中使用 `1F1B <https://arxiv.org/pdf/2104.04473.pdf>`_ （1F1B，一次前向传递后跟一次反向传递）策略。对于 1F1B 策略，有两种实现方式：
-
-1. 非交错调度器，内存高效。
-2. 交错调度器，内存高效且时间高效（GPU空泡较少）。
-
-.. figure:: ../../imgs/pipeline_schedule.png
-  :scale: 45%
-  :class: with-border
-
-  1F1B 流水线并行调度器，采用自 `Megatron-LM <https://arxiv.org/pdf/2104.04473.pdf>`_
-
-非交错式流水线调度
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-如果要使用非交错式调度, 需要设置 ``model.num_chunks = 1``。
-
-.. autoclass:: internlm.core.scheduler.pipeline_scheduler.PipelineScheduler
-    :members:
-
-交错式流水线调度
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-如果要使用交错式调度, 需要设置 ``model.num_chunks > 1``。
-
-.. autoclass:: internlm.core.scheduler.pipeline_scheduler.InterleavedPipelineScheduler
-    :members:
-
-值得注意的是，在使用交错式流水线调度器时可启用通信优化功能，即在 1F1B 阶段启用异步通信，以充分利用上行/下行带宽并实现通信与计算重叠。
-
-用户需要在配置文件中设置 ``parallel.pipeline.interleaved_overlap = True``。该功能启用后，将调用函数 ``InterleavedPipelineScheduler._run_1f1b_loop_with_overlap``，并创建 ``internlm.core.communication.AsynCommunicator`` 以管理异步通信。
-
-``1F1B-without-overlap`` 和 ``1F1B-with-overlap`` 的区别如下所示：
-
-.. code-block:: bash
-
-    # The 1F1B stage without overlap consists of the following steps:
-    1. Perform the forward pass.
-    2. Perform the backward pass.
-    3. Send the forward output of this iteration to the next stage, and send the backward output of this iteration to the previous stage, and receive the forward and backward inputs for the next iteration.
-
-.. code-block:: bash
-
-    # The 1F1B stage with overlap consists of the following steps:
-    1. Perform the forward pass.
-    2. Check if the backward input is ready.
-    3. Send the forward output and receive the forward input for the next iteration.
-    4. Perform the backward pass.
-    5. Check if the forward input is ready.
-    6. Send the backward output and receive the backward input for the next iteration.
-
-
-序列并行
------------------
-
-序列并行是一种在不引入额外计算、通信和内存开销的情况下，减少层 ``layer_norm`` 和 ``dropout`` 操作中的激活值内存。InternLM 中的序列并行实现基于 `flash attention <https://github.com/Dao-AILab/flash-attention>`_。这个并行策略有助于降低模型的内存消耗，提高了模型在资源受限环境中的可扩展性。
-
-如果要启用序列并行, 用户需要设置 ``parallel.sequence_parallel = True``。
-
-.. figure:: ../../imgs/sequence_parallel.png
-  :scale: 50%
-  :class: with-border
-
-  序列并行, 采用自 flash-attention
-
-数据并行
------------------
-
-InternLM 支持数据并行。数据并行大小为:
-
-`Data parallel size = Total number of GPUs / Pipeline parallel size / Tensor parallel size`
-
-ZeRO1.5
------------------
-
-ZeRO1.5 的实现使用了分层分片的概念，通过配置值 ``parallel.zero1`` 启用了本地节点内的分片。这个方法有助于有效管理和分配模型参数和梯度，以减少内存使用并提高训练效率。
-
-1. 当 ``parallel.zero1 <= 0``，则 zero1 进程组的大小等于数据并行进程组的大小，因此优化器状态参数将在数据并行范围内分配
-2. 当 ``parallel.zero1 == 1``，则不使用 zero1 ，所有数据并行组保留完整的优化器状态参数
-3. 当 ``parallel.zero1 > 1`` 且 ``parallel.zero1 <= data_parallel_world_size``，则 zero1 进程组是数据并行进程组的子集
-
-此外，用户可以在配置文件中通过 ``hybrid_zero_optimizer`` 字段启用优化器的通信优化功能，设置桶大小，以及梯度剪裁等参数。这些设置有助于优化训练过程中的通信和计算效率，以及梯度的处理方式。
-
-.. code-block:: python
-
-    hybrid_zero_optimizer = dict(
-        # Enable low_level_optimzer overlap_communication
-        overlap_sync_grad=True,
-        overlap_sync_param=True,
-        # bucket size for nccl communication params
-        reduce_bucket_size=512 * 1024 * 1024,
-        # grad clipping
-        clip_grad_norm=1.0,
-    )
-
-这里有两个值得关注的通信优化点：
-
-- overlap_sync_grad: 如果设置为 ``True``，则将训练的 ``backward pass`` 与梯度的 ``all-reduce`` 通信重叠
-- overlap_sync_param: 如果设置为 ``True``，则将参数的 ``broadcast`` 通信与下一步的 ``forward pass`` 进行重叠
-
-这些优化可以加速训练过程，提高训练效率。
-
-.. autoclass:: internlm.solver.optimizer.hybrid_zero_optim.HybridZeroOptimizer
-    :members:
diff --git a/doc/code-docs/source/profiler.rst b/doc/code-docs/source/profiler.rst
deleted file mode 100644
index 0163ebe..0000000
--- a/doc/code-docs/source/profiler.rst
+++ /dev/null
@@ -1,164 +0,0 @@
-性能分析
-========
-
-.. Mainly about the usage of torch profiler and memory profiler
-
-Torch Profiler
------------------
-
-InternLM 使用 ``internlm.train.initialize_llm_profile()`` 来收集和分析模型训练或推理期间的性能数据，如 CPU/CUDA/memory 等性能数据。这个实现基于 `torch.profiler <https://pytorch.org/docs/stable/profiler.html>`_ ，输出的性能分析 trace 文件可以使用 `tensorboard <https://www.tensorflow.org/tensorboard?hl=en>`_ 进行可视化。
-
-用户如果想使用这个 torch 性能分析工具，需要在启动训练时传递 ``--profiling`` 参数以启用性能分析。完成 torch 性能分析后，用户可以在 ``{JOB_NAME}/{start_time}/traces/rank{}_dp{}_tp{}_pp{}`` 文件夹中看到性能分析结果。
-
-实际运行生成的 ``Torch Profiler`` 目录结构如下：
-
-.. code-block:: bash
-
-    # tree ./7b_train/Sep08_11-00-51/traces -L 2
-    ./7b_train/Sep08_11-00-51/traces/
-    └── rank0_dp0_tp0_pp0
-        └── SH-IDC1-10-140-1-78_238619.1694142354680.pt.trace.json
-
-其中， ``traces`` 可以通过 ``TensorBoard`` 可视化，运行命令
-
-.. code-block:: bash
-
-    # visualize traces with tensorboard and custom port
-    tensorboard --logdir rank0_dp0_tp0_pp0 --port 10088
-
-在打开的 ``TensorBoard -> PyTorch Profiler -> Views -> Trace`` 页面可以看到Operator和GPU Kernel的性能分析时间线如下，更多的功能请参考 `torch profiler with tensorboard <https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html#pytorch-profiler-with-tensorboard>`_
-
-.. figure:: ../../imgs/torch_profiler_trace.png
-  :scale: 45%
-  :class: with-border
-
-.. autofunction:: internlm.train.initialize_llm_profile
-
-Memory Profiler
------------------
-
-InternLM 提供了一个实用的内存分析工具 ``internlm.utils.simple_memory_profiler.SimpleMemoryProfiler`` 来监控实际的 GPU 内存使用情况。在实现中，会对模型数据（包括模型参数、模型梯度和优化器状态）和非模型数据（包括激活值）分别进行详细的统计。
-
-要使用这个内存分析工具，用户需要在启动训练时传递 ``--profiling`` 参数以启用内存分析。完成内存分析后，用户可以在 ``memory_trace/rank{}_dp{}_tp{}`` 文件夹中找到特定 rank 对应的内存分析结果（包括不同时间点的内存使用日志和显示总体内存使用情况的太阳图表）。
-
-实际运行生成的 ``memory_trace`` 目录结构如下：
-
-.. code-block:: bash
-
-    # tree ./memory_trace -L 2
-    ./memory_trace
-    ├── rank0_dp0_tp0                              # Profiling results for a specific rank device
-    │   ├── activation_memory_sunburst.html        # Sunburst chart showing activation memory usage
-    │   ├── grads_memory_sunburst.html             # Sunburst chart showing gradient memory usage
-    │   ├── memory.log                             # Log of GPU memory usage at different time points
-    │   ├── os_memory_sunburst.html                # Sunburst chart showing optimizer state memory usage
-    │   ├── params_memory_sunburst.html            # Sunburst chart showing parameter memory usage
-    │   └── summary_sunburst.html                  # Sunburst chart showing overall memory usage
-    ├── rank1_dp1_tp0
-    │   ├── activation_memory_sunburst.html
-    │   ├── grads_memory_sunburst.html
-    │   ├── memory.log
-    │   ├── os_memory_sunburst.html
-    │   ├── params_memory_sunburst.html
-    │   └── summary_sunburst.html
-    ├── rank2_dp2_tp0
-    │   ├── activation_memory_sunburst.html
-    │   ├── grads_memory_sunburst.html
-    │   ├── memory.log
-    │   ├── os_memory_sunburst.html
-    │   ├── params_memory_sunburst.html
-    │   └── summary_sunburst.html
-    ├── rank3_dp3_tp0
-    │   ├── activation_memory_sunburst.html
-    │   ├── grads_memory_sunburst.html
-    │   ├── memory.log
-    │   ├── os_memory_sunburst.html
-    │   ├── params_memory_sunburst.html
-    │   └── summary_sunburst.html
-    ├── rank4_dp4_tp0
-    │   ├── activation_memory_sunburst.html
-    │   ├── grads_memory_sunburst.html
-    │   ├── memory.log
-    │   ├── os_memory_sunburst.html
-    │   ├── params_memory_sunburst.html
-    │   └── summary_sunburst.html
-    ├── rank5_dp5_tp0
-    │   ├── activation_memory_sunburst.html
-    │   ├── grads_memory_sunburst.html
-    │   ├── memory.log
-    │   ├── os_memory_sunburst.html
-    │   ├── params_memory_sunburst.html
-    │   └── summary_sunburst.html
-    ├── rank6_dp6_tp0
-    │   ├── activation_memory_sunburst.html
-    │   ├── grads_memory_sunburst.html
-    │   ├── memory.log
-    │   ├── os_memory_sunburst.html
-    │   ├── params_memory_sunburst.html
-    │   └── summary_sunburst.html
-    └── rank7_dp7_tp0
-        ├── activation_memory_sunburst.html
-        ├── grads_memory_sunburst.html
-        ├── memory.log
-        ├── os_memory_sunburst.html
-        ├── params_memory_sunburst.html
-        └── summary_sunburst.html
-
-其中， ``memory.log`` 的内容示例如下：
-
-.. code-block:: bash
-
-    Memory State:
-    time: 37.56313228607178
-    ---summary---
-    total_memory: 55953.56 MB
-    params_memory: 13965.51 MB, grads_memory: 13965.51 MB, os_params_memory: 3461.52 MB, os_state_memory: 6923.03 MB, activation_memory: 17638.00 MB
-
-    Memory State:
-    time: 38.46969723701477
-    ---summary---
-    total_memory: 38315.56 MB
-    params_memory: 13965.51 MB, grads_memory: 13965.51 MB, os_params_memory: 3461.52 MB, os_state_memory: 6923.03 MB, activation_memory: 0.00 MB
-    ---Layout---
-    params_layout:
-    layer: param_mem, layer_mem: 0.00 MB, total_mem: 13965.51 MB
-    layer: param_mem.embedding, layer_mem: 0.00 MB, total_mem: 806.00 MB
-    layer: param_mem.embedding.weight, layer_mem: 806.00 MB, total_mem: 806.00 MB
-    layer: param_mem.blocks, layer_mem: 0.00 MB, total_mem: 12353.50 MB
-    layer: param_mem.blocks.0, layer_mem: 0.00 MB, total_mem: 386.05 MB
-    layer: param_mem.blocks.0.mixer, layer_mem: 0.00 MB, total_mem: 128.03 MB
-    layer: param_mem.blocks.0.mixer.Wqkv, layer_mem: 0.00 MB, total_mem: 96.02 MB
-    layer: param_mem.blocks.0.mixer.Wqkv.weight, layer_mem: 96.00 MB, total_mem: 96.00 MB
-    layer: param_mem.blocks.0.mixer.Wqkv.bias, layer_mem: 0.02 MB, total_mem: 0.02 MB
-    layer: param_mem.blocks.0.mixer.out_proj, layer_mem: 0.00 MB, total_mem: 32.01 MB
-    layer: param_mem.blocks.0.mixer.out_proj.weight, layer_mem: 32.00 MB, total_mem: 32.00 MB
-    layer: param_mem.blocks.0.mixer.out_proj.bias, layer_mem: 0.01 MB, total_mem: 0.01 MB
-    layer: param_mem.blocks.0.norm1, layer_mem: 0.00 MB, total_mem: 0.01 MB
-    layer: param_mem.blocks.0.norm1.weight, layer_mem: 0.01 MB, total_mem: 0.01 MB
-    layer: param_mem.blocks.0.norm2, layer_mem: 0.00 MB, total_mem: 0.01 MB
-    layer: param_mem.blocks.0.norm2.weight, layer_mem: 0.01 MB, total_mem: 0.01 MB
-    layer: param_mem.blocks.0.mlp, layer_mem: 0.00 MB, total_mem: 258.00 MB
-    layer: param_mem.blocks.0.mlp.w1, layer_mem: 0.00 MB, total_mem: 86.00 MB
-    layer: param_mem.blocks.0.mlp.w1.weight, layer_mem: 86.00 MB, total_mem: 86.00 MB
-    layer: param_mem.blocks.0.mlp.w2, layer_mem: 0.00 MB, total_mem: 86.00 MB
-    layer: param_mem.blocks.0.mlp.w2.weight, layer_mem: 86.00 MB, total_mem: 86.00 MB
-    layer: param_mem.blocks.0.mlp.w3, layer_mem: 0.00 MB, total_mem: 86.00 MB
-    layer: param_mem.blocks.0.mlp.w3.weight, layer_mem: 86.00 MB, total_mem: 86.00 MB
-    ......
-    grads_layout:
-    ......
-    os_params_layout:
-    ......
-    os_state_layout:
-    ......
-    activation_base_layout:
-    ......
-
-模型参数的太阳图示例如下：
-
-.. figure:: ../../imgs/params_memory_sunburst.png
-  :scale: 50%
-  :class: with-border
-
-.. autoclass:: internlm.utils.simple_memory_profiler.SimpleMemoryProfiler
-    :members:
diff --git a/doc/code-docs/source/qa.rst b/doc/code-docs/source/qa.rst
deleted file mode 100644
index b32859f..0000000
--- a/doc/code-docs/source/qa.rst
+++ /dev/null
@@ -1,2 +0,0 @@
-问&答
-=====
diff --git a/doc/code-docs/source/training.rst b/doc/code-docs/source/training.rst
deleted file mode 100644
index 19bf80c..0000000
--- a/doc/code-docs/source/training.rst
+++ /dev/null
@@ -1,9 +0,0 @@
-训练 API
-============
-
-InternLM 的训练 API 由 ``internlm.core.trainer.Trainer`` 管理。在定义了训练引擎和调度器之后，我们可以调用 Trainer API 来执行模型训练、评估、梯度清零和参数更新等。
-
-有关详细用法，请参阅 Trainer API 文档和示例。
-
-.. autoclass:: internlm.core.trainer.Trainer
-    :members:
diff --git a/doc/code-docs/source/usage.md b/doc/code-docs/source/usage.md
deleted file mode 100644
index 7632959..0000000
--- a/doc/code-docs/source/usage.md
+++ /dev/null
@@ -1,4 +0,0 @@
-```{include} ../../usage.md
-:relative-docs: docs/
-:relative-images:
-```
diff --git a/doc/en/install.md b/doc/en/install.md
deleted file mode 100644
index 8885037..0000000
--- a/doc/en/install.md
+++ /dev/null
@@ -1,86 +0,0 @@
-## Installation
-
-### Environment Preparation
-The required packages and corresponding version are shown as follows:
-- Python == 3.10
-- GCC == 10.2.0
-- MPFR == 4.1.0
-- CUDA >= 11.7
-- Pytorch >= 1.13.1
-- Transformers >= 4.28.0
-- Flash-Attention >= v1.0.5
-- Apex == 23.05
-- GPU with Ampere or Hopper architecture (such as H100, A100)
-- Linux OS
-
-After installing the above dependencies, some system environment variables need to be updated:
-```bash
-export CUDA_PATH={path_of_cuda_11.7}
-export GCC_HOME={path_of_gcc_10.2.0}
-export MPFR_HOME={path_of_mpfr_4.1.0}
-export LD_LIBRARY_PATH=${GCC_HOME}/lib64:${MPFR_HOME}/lib:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
-export PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
-export CC=${GCC_HOME}/bin/gcc
-export CXX=${GCC_HOME}/bin/c++
-```
-
-### Environment Installation
-Clone the project `internlm` and its dependent submodules from the github repository, as follows:
-```bash
-git clone git@github.com:InternLM/InternLM.git --recurse-submodules
-```
-
-It is recommended to build a Python-3.10 virtual environment using conda and install the required dependencies based on the `requirements/` files:
-```bash
-conda create --name internlm-env python=3.10 -y
-conda activate internlm-env
-cd internlm
-pip install -r requirements/torch.txt
-pip install -r requirements/runtime.txt
-```
-
-Install flash-attention (version v1.0.5):
-```bash
-cd ./third_party/flash-attention
-python setup.py install
-cd ./csrc
-cd fused_dense_lib && pip install -v .
-cd ../xentropy && pip install -v .
-cd ../rotary && pip install -v .
-cd ../layer_norm && pip install -v .
-cd ../../../../
-```
-
-Install Apex (version 23.05):
-```bash
-cd ./third_party/apex
-pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
-cd ../../
-```
-
-### Environment Image
-Users can use the provided dockerfile combined with docker.Makefile to build their own images, or obtain images with InternLM runtime environment installed from https://hub.docker.com/r/internlm/internlm.
-
-#### Image Configuration and Build
-The configuration and build of the Dockerfile are implemented through the docker.Makefile. To build the image, execute the following command in the root directory of InternLM:
-``` bash
-make -f docker.Makefile BASE_OS=centos7
-```
-In docker.Makefile, you can customize the basic image, environment version, etc., and the corresponding parameters can be passed directly through the command line. For BASE_OS, ubuntu20.04 and centos7 are respectively supported.
-
-#### Pull Standard Image
-The standard image based on ubuntu and centos has been built and can be directly pulled:
-
-```bash
-# ubuntu20.04
-docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-ubuntu20.04
-# centos7
-docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7
-```
-
-#### Run Container
-For the local standard image built with dockerfile or pulled, use the following command to run and enter the container:
-```bash
-docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
-```
-The default directory in the container is `/InternLM`, please start training according to the [Usage](./usage.md).
diff --git a/doc/en/structure.md b/doc/en/structure.md
deleted file mode 100644
index 5b50e93..0000000
--- a/doc/en/structure.md
+++ /dev/null
@@ -1,28 +0,0 @@
-## InternLM System Structure
-The system code file structure is shown below:
-```bash
-├── configs                                  # Configuration module, managing model and training-related parameters
-│   └── 7B_sft.py                            # 7B_sft.py is a sample configuration file for the system demo
-├── internlm                                 # Main directory of the system code
-│   ├── apis                                 # Interface module, containing some interface functions related to inference, etc.
-│   ├── core                                 # Core module, managing parallel context and training scheduling engine for training and inference
-│   │   ├── communication                    # Communication module, responsible for p2p communication in pipeline parallel scheduling
-│   │   ├── context                          # Context module, mainly responsible for initializing parallel process groups and managing parallel context
-│   │   │   ├── parallel_context.py
-│   │   │   └── process_group_initializer.py
-│   │   ├── scheduler                        # Scheduling module, which manages schedulers for parallel training, including non-pipeline and pipeline parallel schedulers
-│   │   │   ├── no_pipeline_scheduler.py
-│   │   │   └── pipeline_scheduler.py
-│   │   ├── engine.py                        # Responsible for managing the training and evaluation process of the model
-│   │   └── trainer.py                       # Responsible for managing the training engine and scheduler
-│   ├── data                                 # Data module, responsible for managing dataset generation and processing
-│   ├── initialize                           # Initialization module, responsible for managing distributed environment startup and trainer initialization
-│   ├── model                                # Model module, responsible for managing model structure definition and implementation
-│   ├── solver                               # Responsible for managing the implementation of optimizer and lr_scheduler, etc.
-│   └── utils                                # Auxiliary module, responsible for managing logs, storage, model registration, etc.
-├── train.py                                 # Main function entry file for model training
-├── requirements                             # List of dependent packages for system running
-├── third_party                              # Third-party modules on which the system depends, including apex and flash-attention, etc.
-├── tools                                    # Some script tools for processing and converting raw datasets, model checkpoint conversion, etc.
-└── version.txt                              # System version number
-```
diff --git a/doc/en/train_performance.md b/doc/en/train_performance.md
deleted file mode 100644
index 823ecce..0000000
--- a/doc/en/train_performance.md
+++ /dev/null
@@ -1,92 +0,0 @@
-## Training Performance
-
-
-InternLM deeply integrates Flash-Attention, Apex, and other high-performance model operators to improve training efficiency. It achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training by building the Hybrid Zero technique. InternLM supports expanding the 7B model from 8  GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-card scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:
-
-| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
-| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |
-
-
-We tested the performance of training the 7B model in InternLM using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:
-
-| Hardware                | Model                         |
-| ----------------------- | ----------------------------- |
-| GPU                     | nvidia_a100-sxm4-80gb         |
-| Memory                  | 2TB                           |
-| Inter-machine bandwidth | 4 * 100Gb RoCE                |
-| CPU                     | 128 core Intel(R) Xeon(R) CPU |
-
-| Hyperparameters | tp=1 | tp=2 |
-| --------------- | ---- | ---- |
-| micro_num       | 4    | 4    |
-| micro_bsz       | 2    | 4    |
-| seq_len         | 2048 | 2048 |
-
-The configuration of `zero1` in InternLM determines the allocation range of optimizer states.
-- `zero1=-1` indicates that optimizer states are distributed across all data-parallel nodes (equivalent to Deepspeed Zero-1).
-- In the case of `zero1=8, tp=1`, optimizer states are distributed within 8 GPUs in a single node, and the optimizer states remain consistent across different nodes.
-
-### Throughput Measurement
-
-Throughput is defined as TGS, the average number of tokens processed per GPU per second. In this test, the training configuration had `pack_sample_into_one=False` and `checkpoint=False`. The test results are shown in the following table. When using `zero1=8, tp=1`, InternLM achieves an acceleration efficiency of `88%` for training the 7B model with a thousand cards.
-
-| Parallel Configuration | 8 GPUs | 16 GPUs | 32 GPUs | 64 GPUs | 128 GPUs | 256 GPUs | 512 GPUs | 1024 GPUs |
-| ---------------------- | ------ | ------- | ------- | ------- | -------- | -------- | -------- | --------- |
-| (tp=1, zero1=-1)       | 4062   | 3842    | 3752    | 3690    | 3571     | 3209     | 2861     | 2271      |
-| (tp=1, zero1=8)        | 4078   | 3939    | 3919    | 3944    | 3928     | 3920     | 3835     | 3625      |
-| (tp=2, zero1=-1)       | 3822   | 3595    | 3475    | 3438    | 3308     | 3094     | 2992     | 2785      |
-| (tp=2, zero1=4)        | 3761   | 3658    | 3655    | 3650    | 3651     | 3653     | 3589     | 3486      |
-
-<div align="left">
-    <img src="../imgs/train_performance.png" width="580"/>
-</div>
-
-
-### FLOPS Testing
-
-The computational workload of model training is based on the FLOPS calculation method described in the [Megatron](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) paper. To ensure constant FLOPS during training, the test configuration had `pack_sample_into_one=True`, `dtype=torch.bfloat16`.
-
-
-When `Activation Ckpt` is enabled，the test results are shown in the table below. InternLM can achieve `>180 TFLOPS` for 7B model training with 1024 GPUs.
-
-- TGS: Tokens per GPU per Second
-
-- Global Bsz: The total number of processed tokens with all GPUs in a step
-
-| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
-|-|-|-|-|-|-|-|-|-|-|-|
-| 1 | 8 | TRUE | TRUE | 8 | 2048 | 8 | 1 | 0.125M | 3314 | 193 |
-| 1 | 8 | TRUE | TRUE | 16 | 2048 | 8 | 1 | 0.25M | 3268 | 191 |
-| 1 | 8 | TRUE | TRUE | 32 | 2048 | 8 | 1 | 0.5M | 3323 | 188 |
-| 1 | 8 | TRUE | TRUE | 64 | 2048 | 8 | 1 | 1M | 3217 | 188 |
-| 1 | 8 | TRUE | TRUE | 128 | 2048 | 8 | 1 | 2M | 3260 | 187 |
-| 1 | 8 | TRUE | TRUE | 256 | 2048 | 8 | 1 | 4M | 3215 | 187 |
-| 1 | 8 | TRUE | TRUE | 512 | 2048 | 8 | 1 | 8M | 3199 | 186 |
-| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 8 | 1 | 16M | 3163 | 184 |
-| 1 | 8 | TRUE | TRUE | 512 | 2048 | 4 | 1 | 4M | 2963 | 173 |
-| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 2 | 1 | 4M | 2341 | 136 |
-| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 4 | 1 | 8M | 2796 | 160 |
-
-When `Activation Ckpt` is turned off, the test results are as shown in the table below:
-
-| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
-|-|-|-|-|-|-|-|-|-|-|-|
-| 1 | 8 | TRUE | FALSE | 8 | 2048 | 2 | 4 | 0.125M | 4103 | 183 |
-| 1 | 8 | TRUE | FALSE | 16 | 2048 | 2 | 4 | 0.25M | 3939 | 177 |
-| 1 | 8 | TRUE | FALSE | 32 | 2048 | 2 | 4 | 0.5M | 3919 | 176 |
-| 1 | 8 | TRUE | FALSE | 64 | 2048 | 2 | 4 | 1M | 3944 | 174 |
-| 1 | 8 | TRUE | FALSE | 128 | 2048 | 2 | 4 | 2M | 3928 | 173 |
-| 1 | 8 | TRUE | FALSE | 256 | 2048 | 2 | 4 | 4M | 3920 | 173 |
-| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 4 | 8M | 3900 | 173 |
-| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 4 | 16M | 3625 | 160 |
-| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 2 | 4M | 3084 | 139 |
-| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 1 | 4M | 2346 | 105 |
-| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 2 | 8M | 2817 | 124 |
-
-
-
-<div align="left">
-    <img src="../imgs/flops.png" width="580"/>
-</div>
diff --git a/doc/en/usage.md b/doc/en/usage.md
deleted file mode 100644
index 864ead6..0000000
--- a/doc/en/usage.md
+++ /dev/null
@@ -1,387 +0,0 @@
-## Quickstart Guide for Pre-training and Fine-tuning
-
-To start a demo model training, you need to prepare three things: **installation**, **dataset preparation**, and **model training configuration**. In this guide, we will first cover the steps for dataset preparation and then briefly describe the model training configuration.
-
-### Installation
-
-Please refer to the [installation guide](./install.md) for instructions on how to install the necessary dependencies.
-
-### Dataset Preparation (Pre-training)
-
-The dataset for the InternLM training task includes a series of `bin` and `meta` files. A `tokenizer` is used to generate the training dataset from the original text files. The tokenizer model is imported by specifying the model parameter path in `tools/tokenizer.py`. Currently, `V7_sft.model` is provided to generate tokens. If you want to use a different model, you can directly modify the model parameter path in `tokenizer.py`.
-
-You can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.
-
-
-```bash
-$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
-```
-
-Here is an example of data processing:
-
-Given a file `raw_data.txt` containing the raw dataset, the raw dataset is shown below:
-
-```bash
-Appreciate every detail in life to truly taste the flavor of happiness.
-Dreams are the source of life’s motivation. Pursue them diligently to achieve your goals.
-Learn to be tolerant and understanding to establish truly harmonious interpersonal relationships.
-```
-
-You can generate the `bin` and `meta` files by running the following command:
-
-```bash
-$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
-```
-
-It should be noted that the generated `bin` files need to be saved in one of the following directories: `cn`, `en`, `code`, `ja`, `ar`, or `kaoshi`, depending on the type of dataset.
-
-Here, `cn` represents the Chinese dataset, `en` represents the English dataset, `code` represents the code dataset, `ja` represents the Japanese dataset, `ar` represents the Arabic dataset, and `kaoshi` represents the exam dataset.
-
-The format of the generated `bin` files is as follows:
-
-```python
-{"tokens": [98655, 2317, 2922, 6649, 1595, 7856, 435, 2424, 442, 9556, 12807, 410, 17313, 446, 23331, 95746]}
-{"tokens": [98655, 302, 1383, 269, 657, 410, 2687, 446, 2424, 98667, 269, 25220, 281, 523, 1874, 492, 1248, 38127, 4563, 442, 11227, 829, 8980, 95746]}
-{"tokens": [98655, 24190, 442, 517, 15013, 649, 454, 8793, 442, 5849, 9556, 17917, 1369, 1084, 29890, 12021, 95746]}
-```
-Each line in the `bin` file corresponds to each sentence in the original dataset, representing the tokens of each sentence (referred to as sequence below).
-
-The format of the generated `meta` file is as follows:
-
-```bash
-(0, 16), (110, 24), (262, 17)
-```
-
-Each tuple in the `meta` file represents the meta information of each `sequence`, where the first element in the tuple indicates the `starting index` of each `sequence` among all `sequences`, and the second element indicates the number of `tokens` for each `sequence`.
-
-For example, the first `sequence` starts at index 0 and has 16 `tokens`. The second `sequence` starts at index 110 and has 24 `tokens`.
-
-The `bin` and `meta` file formats for `json` and `jsonl` type files are the same as for `txt`, so we won't go over them here.
-
-### Data Preparation (Fine-tuning)
-
-The data format for fine-tuning tasks is the same as for pre-training tasks, which consists of a series of `bin` and `meta` files. Let's take the Alpaca dataset as an example to explain the data preparation process for fine-tuning.
-
-1. Download the [Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json).
-
-2. Tokenize the Alpaca dataset using the following command:
-
-```shell
-python tools/alpaca_tokenizer.py /path/to/alpaca_dataset /path/to/output_dataset /path/to/tokenizer --split_ratio 0.1
-```
-
-It is recommended that users refer to alpaca_tokenizer.py to write new scripts to tokenize their own datasets
-
-### Training Configuration
-
-Taking the configuration file `configs/7B_sft.py` for the 7B demo as an example, let's discuss the data, model, parallel and monitoring configurations required to start a model training.
-```python
-JOB_NAME = "7b_train"
-DO_ALERT = False
-
-SEQ_LEN = 2048
-HIDDEN_SIZE = 4096
-NUM_ATTENTION_HEAD = 32
-MLP_RATIO = 8 / 3
-NUM_LAYER = 32
-VOCAB_SIZE = 103168
-
-MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
-# Ckpt folder format:
-# fs: 'local:/mnt/nfs/XXX'
-SAVE_CKPT_FOLDER = "local:llm_ckpts"
-LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
-
-# boto3 Ckpt folder format:
-# import os
-# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
-# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
-# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
-CHECKPOINT_EVERY = 50
-ckpt = dict(
-    enable_save_ckpt=False,  # enable ckpt save.
-    save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.
-    # load_ckpt_folder= dict(path=MODEL_ONLY_FOLDER, content=["model"], ckpt_type="normal"),
-    load_ckpt_folder="local:llm_ckpts/",
-    # 'load_ckpt_info' setting guide:
-    # 1. the 'path' indicate ckpt path,
-    # 2. the 'content‘ means what states will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all"
-    # 3. the ’ckpt_type‘ means the type of checkpoint to be loaded, now only 'normal' type is supported.
-    load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
-    checkpoint_every=CHECKPOINT_EVERY,
-    async_upload=True,  # async ckpt upload. (only work for boto3 ckpt)
-    async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.
-    oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
-)
-
-TRAIN_FOLDER = "/path/to/dataset"
-VALID_FOLDER = "/path/to/dataset"
-data = dict(
-    seq_len=SEQ_LEN,
-    # micro_num means the number of micro_batch contained in one gradient update
-    micro_num=4,
-    # packed_length = micro_bsz * SEQ_LEN
-    micro_bsz=2,
-    # defaults to the value of micro_num
-    valid_micro_num=4,
-    # defaults to 0, means disable evaluate
-    valid_every=50,
-    pack_sample_into_one=False,
-    total_steps=50000,
-    skip_batches="",
-    rampup_batch_size="",
-    # Datasets with less than 50 rows will be discarded
-    min_length=50,
-    # train_folder=TRAIN_FOLDER,
-    # valid_folder=VALID_FOLDER,
-    empty_cache_and_diag_interval=10,
-    diag_outlier_ratio=1.1,
-)
-
-grad_scaler = dict(
-    fp16=dict(
-        # the initial loss scale, defaults to 2**16
-        initial_scale=2**16,
-        # the minimum loss scale, defaults to None
-        min_scale=1,
-        # the number of steps to increase loss scale when no overflow occurs
-        growth_interval=1000,
-    ),
-    # the multiplication factor for increasing loss scale, defaults to 2
-    growth_factor=2,
-    # the multiplication factor for decreasing loss scale, defaults to 0.5
-    backoff_factor=0.5,
-    # the maximum loss scale, defaults to None
-    max_scale=2**24,
-    # the number of overflows before decreasing loss scale, defaults to 2
-    hysteresis=2,
-)
-
-hybrid_zero_optimizer = dict(
-    # Enable low_level_optimzer overlap_communication
-    overlap_sync_grad=True,
-    overlap_sync_param=True,
-    # bucket size for nccl communication params
-    reduce_bucket_size=512 * 1024 * 1024,
-    # grad clipping
-    clip_grad_norm=1.0,
-)
-
-loss = dict(
-    label_smoothing=0,
-)
-
-adam = dict(
-    lr=1e-4,
-    adam_beta1=0.9,
-    adam_beta2=0.95,
-    adam_beta2_c=0,
-    adam_eps=1e-8,
-    weight_decay=0.01,
-)
-
-lr_scheduler = dict(
-    total_steps=data["total_steps"],
-    init_steps=0,  # optimizer_warmup_step
-    warmup_ratio=0.01,
-    eta_min=1e-5,
-    last_epoch=-1,
-)
-
-beta2_scheduler = dict(
-    init_beta2=adam["adam_beta2"],
-    c=adam["adam_beta2_c"],
-    cur_iter=-1,
-)
-
-model = dict(
-    checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
-    num_attention_heads=NUM_ATTENTION_HEAD,
-    embed_split_hidden=True,
-    vocab_size=VOCAB_SIZE,
-    embed_grad_scale=1,
-    parallel_output=True,
-    hidden_size=HIDDEN_SIZE,
-    num_layers=NUM_LAYER,
-    mlp_ratio=MLP_RATIO,
-    apply_post_layer_norm=False,
-    dtype="torch.float16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
-    norm_type="rmsnorm",
-    layer_norm_epsilon=1e-5,
-    use_flash_attn=True,
-    num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
-)
-"""
-zero1 parallel:
-    1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
-        so parameters will be divided within the range of dp.
-    2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
-    3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
-        For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
-pipeline parallel (dict):
-    1. size: int, the size of pipeline parallel.
-    2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
-tensor parallel: tensor parallel size, usually the number of GPUs per node.
-"""
-parallel = dict(
-    zero1=8,
-    pipeline=dict(size=1, interleaved_overlap=True),
-    sequence_parallel=False,
-)
-
-cudnn_deterministic = False
-cudnn_benchmark = False
-
-monitor = dict(
-    # feishu alert configs
-    alert=dict(
-        enable_feishu_alert=DO_ALERT,
-        feishu_alert_address=None,  # feishu webhook to send alert message
-        light_monitor_address=None,  # light_monitor address to send heartbeat
-    ),
-)
-```
-
-#### Data Configuration
-Here are the key parameters and their explanations for data configuration:
-
-```python
-TRAIN_FOLDER = "/path/to/dataset"
-SEQ_LEN = 2048
-data = dict(
-    seq_len=SEQ_LEN,  # Length of the data samples, default value is 2048
-    micro_num=1,  # Number of micro_batches processed in one model parameter update, default value is 1
-    micro_bsz=1,  # Packed_length = micro_bsz * SEQ_LEN, the size of data processed in one micro_batch, default value is 1
-    total_steps=50000,  # Total number of steps to be executed, default value is 50000
-    min_length=50,  # If the number of lines in the dataset file is less than 50, it will be discarded
-    train_folder=TRAIN_FOLDER,  # Dataset file path, default value is None; if train_folder is empty, training will be done using randomly generated datasets
-    pack_sample_into_one=False, # Logic for data arrangement, determines whether to calculate attention based on the seq_len dimension or the actual length of the sequence
-)
-```
-
-![pack_into_one](../imgs/pack_into_one.png)
-
-Currently, it supports passing the dataset file path `train_folder`, and the file format is required to be as follows:
-
-```bash
-- folder
-    - code
-        train_000.bin
-        train_000.bin.meta
-```
-
-For detailed information about the dataset, please refer to the "Data Preparation" section.
-
-#### Model Configuration
-
-If you want to load a model checkpoint when starting the training, you can configure it as follows:
-
-```python
-SAVE_CKPT_FOLDER = "local:/path/to/save/ckpt"
-LOAD_CKPT_FOLDER = "local:/path/to/load/resume/ckpt"
-ckpt = dict(
-    save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save the model and optimizer checkpoints
-    checkpoint_every=float("inf"),  # Save a checkpoint every specified number of steps, default value is inf
-    # When resuming training from a breakpoint,:
-    # (1) 'path' is the path of the loaded checkpoint.
-    # (2) 'content' indicates which state will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all"
-    # (3) 'ckpt_type' indicates which type ckpt will be loaded, currently supported: "internlm"
-    load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
-)
-```
-
-Note:
-- If the path starts with `local:`, it means the file is stored in the local file system. If it starts with `boto3:`, it means the file is stored in the remote OSS.
-
-The configuration for the model is as follows:
-
-```python
-model_type = "INTERNLM"  # Model type, default value is "INTERNLM", corresponding to the model structure initialization interface function
-NUM_ATTENTION_HEAD = 32
-VOCAB_SIZE = 103168
-HIDDEN_SIZE = 4096
-NUM_LAYER = 32
-MLP_RATIO = 8 / 3
-model = dict(
-    checkpoint=False,   # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
-    num_attention_heads=NUM_ATTENTION_HEAD,
-    embed_split_hidden=True,
-    vocab_size=VOCAB_SIZE,
-    embed_grad_scale=1,
-    parallel_output=True,
-    hidden_size=HIDDEN_SIZE,
-    num_layers=NUM_LAYER,
-    mlp_ratio=MLP_RATIO,
-    apply_post_layer_norm=False,
-    dtype="torch.bfloat16",
-    norm_type="rmsnorm",
-    layer_norm_epsilon=1e-5,
-)
-```
-
-Note: Users can customize the model type name and model structure, and configure the corresponding model parameters. The model initialization function interface can be registered through the `MODEL_INITIALIZER` object in `utils/registry.py`. When initializing the model in the training main function `train.py`, the specified model initialization interface function can be obtained through the `model_type` configuration.
-
-#### Parallel Configuration
-
-Training parallel configuration example:
-
-```python
-parallel = dict(
-    zero1=8,
-    tensor=1,
-    pipeline=dict(size=1, interleaved_overlap=True),
-    sequence_parallel=False,
-)
-```
-
-- zero1: zero parallel strategy, divided into the following three cases, default value is -1
-  - When `zero1 <= 0`, the size of the zero1 process group is equal to the size of the data parallel process group, so the optimizer state parameters will be split within the data parallel range.
-  - When `zero1 == 1`, zero1 is not used, and all data parallel groups retain the complete optimizer state parameters.
-  - When `zero1 > 1` and `zero1 <= data_parallel_world_size`, the zero1 process group is a subset of the data parallel process group.
-- tensor: tensor parallel size, usually the number of GPUs per node, default is 1
-- pipeline: pipeline parallel strategy
-   - size: pipeline parallel size, the default value is 1
-   - interleaved_overlap: bool type, when interleaved scheduling, enable or disable communication optimization, the default value is False
-- sequence_parallel: Whether to enable sequence parallelism, the default value is False
-
-Note: `Data parallel size = Total number of GPUs / Pipeline parallel size / Tensor parallel size`
-
-### Start Training
-
-After completing the data preparation and relevant training configurations mentioned above, you can start the demo training. The following examples demonstrate how to start the training in both slurm and torch environments.
-
-If you want to start distributed training on slurm with 16 GPUs across multiple nodes, use the following command:
-
-```bash
-$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py
-```
-
-If you want to start distributed training on torch with 8 GPUs on a single node, use the following command:
-
-```bash
-$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py --launcher "torch"
-```
-
-### Training Results
-
-Taking the configuration of the demo training on a single machine with 8 GPUs on slurm as an example, the training result log is shown below:
-
-```bash
-2023-07-07 12:26:58,293	INFO launch.py:228 in launch -- Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 1
-2023-07-07 12:26:58,293	INFO parallel_context.py:535 in set_seed -- initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
-2023-07-07 12:26:58,295	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=0===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=5===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=1===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=6===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=7===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=2===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=4===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=3===========
-2023-07-07 12:28:27,826	INFO hybrid_zero_optim.py:295 in _partition_param_list -- Number of elements on ranks: [907415552, 907411456, 910163968, 910163968, 921698304, 921698304, 921698304, 921698304], rank:0
-2023-07-07 12:28:57,802	INFO train.py:323 in record_current_batch_training_metrics -- tflops=63.27010355651958,step=0,loss=11.634403228759766,tgs (tokens/gpu/second)=1424.64,lr=4.0000000000000003e-07,loss_scale=65536.0,grad_norm=63.672620777841004,micro_num=4,num_consumed_tokens=131072,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=5,smallest_batch=4,adam_beta2=0.95,fwd_bwd_time=6.48
-2023-07-07 12:29:01,636	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.83371103277346,step=1,loss=11.613704681396484,tgs (tokens/gpu/second)=4274.45,lr=6.000000000000001e-07,loss_scale=65536.0,grad_norm=65.150786641452,micro_num=4,num_consumed_tokens=262144,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.67
-2023-07-07 12:29:05,451	INFO train.py:323 in record_current_batch_training_metrics -- tflops=190.99928472960033,step=2,loss=11.490386962890625,tgs (tokens/gpu/second)=4300.69,lr=8.000000000000001e-07,loss_scale=65536.0,grad_norm=61.57798028719357,micro_num=4,num_consumed_tokens=393216,inf_nan_skip_batches=0,num_samples_in_batch=14,largest_length=2048,largest_batch=4,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.66
-2023-07-07 12:29:09,307	INFO train.py:323 in record_current_batch_training_metrics -- tflops=188.8613541410694,step=3,loss=11.099515914916992,tgs (tokens/gpu/second)=4252.55,lr=1.0000000000000002e-06,loss_scale=65536.0,grad_norm=63.5478796484391,micro_num=4,num_consumed_tokens=524288,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.7
-2023-07-07 12:29:13,147	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.65918563194305,step=4,loss=10.149517059326172,tgs (tokens/gpu/second)=4270.52,lr=1.2000000000000002e-06,loss_scale=65536.0,grad_norm=51.582841631508145,micro_num=4,num_consumed_tokens=655360,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.68
-2023-07-07 12:29:16,994	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.3109313713174,step=5,loss=9.822169303894043,tgs (tokens/gpu/second)=4262.67,lr=1.4000000000000001e-06,loss_scale=65536.0,grad_norm=47.10386835560855,micro_num=4,num_consumed_tokens=786432,inf_nan_skip_batches=0,num_samples_in_batch=17,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.69
-```
diff --git a/doc/imgs/flops.png b/doc/imgs/flops.png
deleted file mode 100644
index 4b2ea0c..0000000
Binary files a/doc/imgs/flops.png and /dev/null differ
diff --git a/doc/imgs/hybrid_parallel_training.png b/doc/imgs/hybrid_parallel_training.png
deleted file mode 100644
index 33e4ff9..0000000
Binary files a/doc/imgs/hybrid_parallel_training.png and /dev/null differ
diff --git a/doc/imgs/pack_into_one.png b/doc/imgs/pack_into_one.png
deleted file mode 100644
index 5b98ef8..0000000
Binary files a/doc/imgs/pack_into_one.png and /dev/null differ
diff --git a/doc/imgs/params_memory_sunburst.png b/doc/imgs/params_memory_sunburst.png
deleted file mode 100644
index c3ee8bc..0000000
Binary files a/doc/imgs/params_memory_sunburst.png and /dev/null differ
diff --git a/doc/imgs/pipeline_schedule.png b/doc/imgs/pipeline_schedule.png
deleted file mode 100644
index 64398ce..0000000
Binary files a/doc/imgs/pipeline_schedule.png and /dev/null differ
diff --git a/doc/imgs/sequence_parallel.png b/doc/imgs/sequence_parallel.png
deleted file mode 100644
index becf628..0000000
Binary files a/doc/imgs/sequence_parallel.png and /dev/null differ
diff --git a/doc/imgs/tensor_parallel.png b/doc/imgs/tensor_parallel.png
deleted file mode 100644
index b5dd310..0000000
Binary files a/doc/imgs/tensor_parallel.png and /dev/null differ
diff --git a/doc/imgs/torch_profiler_trace.png b/doc/imgs/torch_profiler_trace.png
deleted file mode 100644
index 76129ae..0000000
Binary files a/doc/imgs/torch_profiler_trace.png and /dev/null differ
diff --git a/doc/imgs/train_performance.png b/doc/imgs/train_performance.png
deleted file mode 100644
index e21c10b..0000000
Binary files a/doc/imgs/train_performance.png and /dev/null differ
diff --git a/doc/install.md b/doc/install.md
deleted file mode 100644
index 63d392b..0000000
--- a/doc/install.md
+++ /dev/null
@@ -1,86 +0,0 @@
-## 环境安装
-
-### 环境准备
-首先，需要安装的依赖包及对应版本列表如下：
-- Python == 3.10
-- GCC == 10.2.0
-- MPFR == 4.1.0
-- CUDA >= 11.7
-- Pytorch >= 1.13.1
-- Transformers >= 4.28.0
-- Flash-Attention >= v1.0.5
-- Apex == 23.05
-- Ampere或者Hopper架构的GPU (例如H100, A100)
-- Linux OS
-
-以上依赖包安装完成后，需要更新配置系统环境变量：
-```bash
-export CUDA_PATH={path_of_cuda_11.7}
-export GCC_HOME={path_of_gcc_10.2.0}
-export MPFR_HOME={path_of_mpfr_4.1.0}
-export LD_LIBRARY_PATH=${GCC_HOME}/lib64:${MPFR_HOME}/lib:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
-export PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
-export CC=${GCC_HOME}/bin/gcc
-export CXX=${GCC_HOME}/bin/c++
-```
-
-### 环境安装
-将项目`internlm`及其依赖子模块，从 github 仓库中 clone 下来，命令如下：
-```bash
-git clone git@github.com:InternLM/InternLM.git --recurse-submodules
-```
-
-推荐使用 conda 构建一个 Python-3.10 的虚拟环境， 并基于`requirements/`文件安装项目所需的依赖包：
-```bash
-conda create --name internlm-env python=3.10 -y
-conda activate internlm-env
-cd internlm
-pip install -r requirements/torch.txt
-pip install -r requirements/runtime.txt
-```
-
-安装 flash-attention (version v1.0.5)：
-```bash
-cd ./third_party/flash-attention
-python setup.py install
-cd ./csrc
-cd fused_dense_lib && pip install -v .
-cd ../xentropy && pip install -v .
-cd ../rotary && pip install -v .
-cd ../layer_norm && pip install -v .
-cd ../../../../
-```
-
-安装 Apex (version 23.05)：
-```bash
-cd ./third_party/apex
-pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
-cd ../../
-```
-
-### 环境镜像
-用户可以使用提供的 dockerfile 结合 docker.Makefile 来构建自己的镜像，或者也可以从 https://hub.docker.com/r/internlm/internlm 获取安装了 InternLM 运行环境的镜像。
-
-#### 镜像配置及构造
-dockerfile 的配置以及构造均通过 docker.Makefile 文件实现，在 InternLM 根目录下执行如下命令即可 build 镜像：
-``` bash
-make -f docker.Makefile BASE_OS=centos7
-```
-在 docker.Makefile 中可自定义基础镜像，环境版本等内容，对应参数可直接通过命令行传递。对于 BASE_OS 分别支持 ubuntu20.04 和 centos7。
-
-#### 镜像拉取
-基于 ubuntu 和 centos 的标准镜像已经 build 完成也可直接拉取使用：
-
-```bash
-# ubuntu20.04
-docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-ubuntu20.04
-# centos7
-docker pull internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7
-```
-
-#### 容器启动
-对于使用 dockerfile 构建或拉取的本地标准镜像，使用如下命令启动并进入容器：
-```bash
-docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:torch1.13.1-cuda11.7.1-flashatten1.0.5-centos7 bash
-```
-容器内默认目录即 `/InternLM`，根据[使用文档](./usage.md)即可启动训练。
diff --git a/doc/structure.md b/doc/structure.md
deleted file mode 100644
index 2893438..0000000
--- a/doc/structure.md
+++ /dev/null
@@ -1,28 +0,0 @@
-## InternLM系统结构
-本项目系统代码文件结构如下所示：
-```bash
-├── configs                                  # 配置模块，管理模型和训练相关参数
-│   └── 7B_sft.py                            # 7B_sft.py 是系统 demo 的配置文件样例
-├── internlm                                 # 系统代码的主目录
-│   ├── apis                                 # 接口模块，包含一些关于推理等的接口函数
-│   ├── core                                 # 核心模块，管理用于训练和推理的 parallel context 和训练调度引擎
-│   │   ├── communication                    # 通信模块，负责流水线并行调度中的p2p通信
-│   │   ├── context                          # context 模块，主要负责初始化并行进程组，并管理 parallel context
-│   │   │   ├── parallel_context.py
-│   │   │   └── process_group_initializer.py
-│   │   ├── scheduler                        # 调度模块，管理并行训练的调度器，包括非流水线并行调度器和流水线并行调度器
-│   │   │   ├── no_pipeline_scheduler.py
-│   │   │   └── pipeline_scheduler.py
-│   │   ├── engine.py                        # 负责管理模型的训练和评估过程
-│   │   └── trainer.py                       # 负责管理训练引擎和调度器
-│   ├── data                                 # 数据模块，负责管理数据集生成和处理
-│   ├── initialize                           # 初始化模块，负责管理分布式环境启动和训练器初始化
-│   ├── model                                # 模型模块，负责管理模型结构定义和实现
-│   ├── solver                               # 负责管理 optimizer 和 lr_scheduler 等的实现
-│   └── utils                                # 辅助模块，负责管理日志、存储、模型注册等
-├── train.py                                 # 模型训练的主函数入口文件
-├── requirements                             # 系统运行的依赖包列表
-├── third_party                              # 系统所依赖的第三方模块，包括 apex 和 flash-attention 等
-├── tools                                    # 一些脚本工具，用于原始数据集处理和转换，模型 checkpoint 转换等
-└── version.txt                              # 系统版本号
-```
diff --git a/doc/train_performance.md b/doc/train_performance.md
deleted file mode 100644
index 64c768e..0000000
--- a/doc/train_performance.md
+++ /dev/null
@@ -1,89 +0,0 @@
-## 训练性能
-
-InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子，提高了训练效率。通过构建 Hybrid Zero 技术，实现计算和通信的高效重叠，大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡，千卡规模下加速效率可高达 90%，训练吞吐超过 180TFLOPS，平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据：
-
-| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
-| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |
-
-
-我们在GPU集群上测试了多种并行配置下，InternLM训练7B模型的性能。在每组测试中，每张GPU在单次迭代中处理的token数量一致。测试使用的硬件和参数配置如下表所示：
-
-| 硬件                    | 硬件型号                      |
-| ----------------------- | ----------------------------- |
-| GPU                     | nvidia_a100-sxm4-80gb         |
-| Memory                  | 2TB                           |
-| Inter-machine bandwidth | 4 * 100Gb RoCE                |
-| CPU                     | 128 core Intel(R) Xeon(R) CPU |
-
-| 超参      | tp=1 | tp=2 |
-| --------- | ---- | ---- |
-| micro_num | 4    | 4    |
-| micro_bsz | 2    | 4    |
-| seq_len   | 2048 | 2048 |
-
-InternLM中`zero1`的配置决定了优化器状态的分配范围。
-- `zero1=-1`表明优化器状态分布在全部数据并行节点（等同于Deepspeed Zero-1的效果）
-- `zero1=8，tp=1`的情况下，优化器状态分布在单节点8张GPU内，并且不同节点上的优化器状态保持一致。
-
-### 吞吐量测量
-
-吞吐量定义为TGS，平均每GPU每秒处理的token的数量（Tokens per GPU per Second）。在该项测试的训练配置中，`pack_sample_into_one=False`，`checkpoint=False`, `dtype=torch.bfloat16`。测试结果如下表所示。采用`zero1=8，tp=1`，InternLM针对7B模型训练的扩展性，在千卡训练的加速效率可以达到`88%`。
-
-| 并行配置         | 8卡  | 16卡 | 32卡 | 64卡 | 128卡 | 256卡 | 512卡 | 1024卡 |
-| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| (tp=1, zero1=-1) | 4062 | 3842 | 3752 | 3690 | 3571  | 3209  | 2861  | 2271   |
-| (tp=1, zero1=8)  | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| (tp=2, zero1=-1) | 3822 | 3595 | 3475 | 3438 | 3308  | 3094  | 2992  | 2785   |
-| (tp=2, zero1=4)  | 3761 | 3658 | 3655 | 3650 | 3651  | 3653  | 3589  | 3486   |
-
-
-<div align="left">
-    <img src="../doc/imgs/train_performance.png" width="580"/>
-</div>
-
-### FLOPS测试
-模型训练的计算量参考 [Megatron](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf) 论文中FLOPS计算方式。为了保证训练过程中的FLOPS恒定，在该项测试的训练配置中，`pack_sample_into_one=True`，`dtype=torch.bfloat16`。
-
-
-当开启 Activation Ckpt后，测试结果如下表所示，InternLM针对7B模型的千卡训练，可以达到 `>180 TFLOPS`：
-
-- TGS: Tokens per GPU per Second
-
-- Global Bsz: 一个step中所有GPU处理的token数量
-
-
-| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
-|-|-|-|-|-|-|-|-|-|-|-|
-| 1 | 8 | TRUE | TRUE | 8 | 2048 | 8 | 1 | 0.125M | 3314 | 193 |
-| 1 | 8 | TRUE | TRUE | 16 | 2048 | 8 | 1 | 0.25M | 3268 | 191 |
-| 1 | 8 | TRUE | TRUE | 32 | 2048 | 8 | 1 | 0.5M | 3323 | 188 |
-| 1 | 8 | TRUE | TRUE | 64 | 2048 | 8 | 1 | 1M | 3217 | 188 |
-| 1 | 8 | TRUE | TRUE | 128 | 2048 | 8 | 1 | 2M | 3260 | 187 |
-| 1 | 8 | TRUE | TRUE | 256 | 2048 | 8 | 1 | 4M | 3215 | 187 |
-| 1 | 8 | TRUE | TRUE | 512 | 2048 | 8 | 1 | 8M | 3199 | 186 |
-| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 8 | 1 | 16M | 3163 | 184 |
-| 1 | 8 | TRUE | TRUE | 512 | 2048 | 4 | 1 | 4M | 2963 | 173 |
-| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 2 | 1 | 4M | 2341 | 136 |
-| 1 | 8 | TRUE | TRUE | 1024 | 2048 | 4 | 1 | 8M | 2796 | 160 |
-
-当关闭 Activation Ckpt后，测试结果如下表所示：
-
-| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
-|-|-|-|-|-|-|-|-|-|-|-|
-| 1 | 8 | TRUE | FALSE | 8 | 2048 | 2 | 4 | 0.125M | 4103 | 183 |
-| 1 | 8 | TRUE | FALSE | 16 | 2048 | 2 | 4 | 0.25M | 3939 | 177 |
-| 1 | 8 | TRUE | FALSE | 32 | 2048 | 2 | 4 | 0.5M | 3919 | 176 |
-| 1 | 8 | TRUE | FALSE | 64 | 2048 | 2 | 4 | 1M | 3944 | 174 |
-| 1 | 8 | TRUE | FALSE | 128 | 2048 | 2 | 4 | 2M | 3928 | 173 |
-| 1 | 8 | TRUE | FALSE | 256 | 2048 | 2 | 4 | 4M | 3920 | 173 |
-| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 4 | 8M | 3900 | 173 |
-| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 4 | 16M | 3625 | 160 |
-| 1 | 8 | TRUE | FALSE | 512 | 2048 | 2 | 2 | 4M | 3084 | 139 |
-| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 1 | 4M | 2346 | 105 |
-| 1 | 8 | TRUE | FALSE | 1024 | 2048 | 2 | 2 | 8M | 2817 | 124 |
-
-<div align="left">
-    <img src="../doc/imgs/flops.png" width="580"/>
-</div>
diff --git a/doc/usage.md b/doc/usage.md
deleted file mode 100644
index 82c20e0..0000000
--- a/doc/usage.md
+++ /dev/null
@@ -1,370 +0,0 @@
-## 使用教程
-
-启动一个 Demo 模型训练，需要进行三项准备，**安装**，**数据集准备**和**模型训练配置**。接下来，首先会介绍数据准备相关的操作，再简要描述模型训练配置相关的内容。
-
-### 安装
-请参考[安装文档](./install.md)进行安装。
-
-### 数据准备 （预训练）
-
-InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7_sft.model`来生成tokens。若想使用不同的模型，可直接修改`tokernizer.py`中的模型参数路径。
-
-可以运行以下命令生成原始数据对应的`bin`和`meta`文件，其中参数`text_input_path`表示原始文本数据路径，目前支持`txt`、`json`和`jsonl`三种输入格式，`bin_output_path`表示生成的`bin`文件的保存路径。
-```bash
-$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
-```
-
-下面是一个数据处理的例子：
-
-给定一个包含原始数据集的文件`raw_data.txt`，原始数据集如下所示：
-```bash
-感恩生活中的每一个细节，才能真正体会到幸福的滋味。
-梦想是人生的动力源泉，努力追逐，才能实现自己的目标。
-学会宽容和理解，才能建立真正和谐的人际关系。
-```
-
-可以通过运行以下命令来生成`bin`和`meta`文件：
-```bash
-$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
-```
-
-需要注意的是，生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这六个目录下，以区分数据集的类型。
-
-其中，`cn`表示中文数据集；`en`表示英文数据集；`code`表示代码数据集；`ja`表示日语数据集；`ar`表示阿拉伯语数据集；`kaoshi`表示考试数据集。
-
-生成的bin文件的格式如下：
-```python
-{"tokens": [73075, 75302, 69522, 69022, 98899, 67713, 68015, 81269, 74637, 75445, 99157]}
-{"tokens": [69469, 60355, 73026, 68524, 60846, 61844, 98899, 67775, 79241, 98899, 67713, 67800, 67453, 67838, 99157]}
-{"tokens": [68057, 79017, 60378, 68014, 98899, 67713, 67990, 68015, 70381, 67428, 61003, 67622, 99157]}
-```
-`bin`文件中的每一行均对应原始数据集中的每一个句子，表示每个句子的`token`（下文将用sequence指定）。
-
-生成的`meta`文件的格式如下：
-```bash
-(0, 11), (90, 15), (208, 13)
-```
-在`meta`文件中，每个元组对应着`bin`文件中每一个`sequence`的元信息。其中，元组的第一个元素表示每个`sequence`在所有`sequence`中的`starting index`，第二个元素表示每个`sequence`中有多少个`tokens`。
-
-例如，对于第一个`sequence`，`starting index`为 0，有 11 个`tokens`；对于第二个`sequence`，由于第一个`sequence`转换为`string`后的长度为`89`，因此它的`starting index`为 90，有 15 个`tokens`。
-
-`json`和`jsonl`类型的文件的`bin`和`meta`文件格式和`txt`一致，此处不再赘叙。
-
-### 数据准备 （微调）
-
-微调任务的数据集格式与预训练任务保持一致，生成的数据格式为一系列的`bin`和`meta`文件。以下以 Alpaca 数据集为例，介绍微调的数据准备流程。
-
-1. 下载 [Alpaca 数据集](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json)
-
-2. 对 Alpaca 数据进行 tokenize，使用以下命令
-
-```shell
-python tools/alpaca_tokenizer.py /path/to/alpaca_dataset /path/to/output_dataset /path/to/tokenizer --split_ratio 0.1
-```
-
-建议用户参考 alpaca_tokenizer.py 编写新的脚本对自己的数据集进行 tokenize
-
-### 训练配置
-
-以 7B Demo 的配置文件`configs/7B_sft.py`为例：
-```python
-JOB_NAME = "7b_train"
-DO_ALERT = False
-
-SEQ_LEN = 2048
-HIDDEN_SIZE = 4096
-NUM_ATTENTION_HEAD = 32
-MLP_RATIO = 8 / 3
-NUM_LAYER = 32
-VOCAB_SIZE = 103168
-
-MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
-# Ckpt folder format:
-# fs: 'local:/mnt/nfs/XXX'
-SAVE_CKPT_FOLDER = "local:llm_ckpts"
-LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
-
-# boto3 Ckpt folder format:
-# import os
-# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
-# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
-# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
-CHECKPOINT_EVERY = 50
-ckpt = dict(
-    enable_save_ckpt=False,  # enable ckpt save.
-    save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.
-    # load_ckpt_folder= dict(path=MODEL_ONLY_FOLDER, content=["model"], ckpt_type="normal"),
-    load_ckpt_folder="local:llm_ckpts/",
-    # 'load_ckpt_info' setting guide:
-    # 1. the 'path' indicate ckpt path,
-    # 2. the 'content‘ means what states will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all"
-    # 3. the ’ckpt_type‘ means the type of checkpoint to be loaded, now only 'normal' type is supported.
-    load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
-    checkpoint_every=CHECKPOINT_EVERY,
-    async_upload=True,  # async ckpt upload. (only work for boto3 ckpt)
-    async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.
-    oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
-)
-
-TRAIN_FOLDER = "/path/to/dataset"
-VALID_FOLDER = "/path/to/dataset"
-data = dict(
-    seq_len=SEQ_LEN,
-    # micro_num means the number of micro_batch contained in one gradient update
-    micro_num=4,
-    # packed_length = micro_bsz * SEQ_LEN
-    micro_bsz=2,
-    # defaults to the value of micro_num
-    valid_micro_num=4,
-    # defaults to 0, means disable evaluate
-    valid_every=50,
-    pack_sample_into_one=False,
-    total_steps=50000,
-    skip_batches="",
-    rampup_batch_size="",
-    # Datasets with less than 50 rows will be discarded
-    min_length=50,
-    # train_folder=TRAIN_FOLDER,
-    # valid_folder=VALID_FOLDER,
-    empty_cache_and_diag_interval=10,
-    diag_outlier_ratio=1.1,
-)
-
-grad_scaler = dict(
-    fp16=dict(
-        # the initial loss scale, defaults to 2**16
-        initial_scale=2**16,
-        # the minimum loss scale, defaults to None
-        min_scale=1,
-        # the number of steps to increase loss scale when no overflow occurs
-        growth_interval=1000,
-    ),
-    # the multiplication factor for increasing loss scale, defaults to 2
-    growth_factor=2,
-    # the multiplication factor for decreasing loss scale, defaults to 0.5
-    backoff_factor=0.5,
-    # the maximum loss scale, defaults to None
-    max_scale=2**24,
-    # the number of overflows before decreasing loss scale, defaults to 2
-    hysteresis=2,
-)
-
-hybrid_zero_optimizer = dict(
-    # Enable low_level_optimzer overlap_communication
-    overlap_sync_grad=True,
-    overlap_sync_param=True,
-    # bucket size for nccl communication params
-    reduce_bucket_size=512 * 1024 * 1024,
-    # grad clipping
-    clip_grad_norm=1.0,
-)
-
-loss = dict(
-    label_smoothing=0,
-)
-
-adam = dict(
-    lr=1e-4,
-    adam_beta1=0.9,
-    adam_beta2=0.95,
-    adam_beta2_c=0,
-    adam_eps=1e-8,
-    weight_decay=0.01,
-)
-
-lr_scheduler = dict(
-    total_steps=data["total_steps"],
-    init_steps=0,  # optimizer_warmup_step
-    warmup_ratio=0.01,
-    eta_min=1e-5,
-    last_epoch=-1,
-)
-
-beta2_scheduler = dict(
-    init_beta2=adam["adam_beta2"],
-    c=adam["adam_beta2_c"],
-    cur_iter=-1,
-)
-
-model = dict(
-    checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
-    num_attention_heads=NUM_ATTENTION_HEAD,
-    embed_split_hidden=True,
-    vocab_size=VOCAB_SIZE,
-    embed_grad_scale=1,
-    parallel_output=True,
-    hidden_size=HIDDEN_SIZE,
-    num_layers=NUM_LAYER,
-    mlp_ratio=MLP_RATIO,
-    apply_post_layer_norm=False,
-    dtype="torch.float16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
-    norm_type="rmsnorm",
-    layer_norm_epsilon=1e-5,
-    use_flash_attn=True,
-    num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
-)
-"""
-zero1 parallel:
-    1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
-        so parameters will be divided within the range of dp.
-    2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
-    3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
-        For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
-pipeline parallel (dict):
-    1. size: int, the size of pipeline parallel.
-    2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
-tensor parallel: tensor parallel size, usually the number of GPUs per node.
-"""
-parallel = dict(
-    zero1=8,
-    pipeline=dict(size=1, interleaved_overlap=True),
-    sequence_parallel=False,
-)
-
-cudnn_deterministic = False
-cudnn_benchmark = False
-
-monitor = dict(
-    # feishu alert configs
-    alert=dict(
-        enable_feishu_alert=DO_ALERT,
-        feishu_alert_address=None,  # feishu webhook to send alert message
-        light_monitor_address=None,  # light_monitor address to send heartbeat
-    ),
-)
-```
-接下来将详细介绍启动一个模型训练所需要进行的数据、模型、并行和监控等相关的配置。
-
-#### 数据配置
-数据相关的关键参数配置及释义如下所示：
-```python
-TRAIN_FOLDER = "/path/to/dataset"
-SEQ_LEN = 2048
-data = dict(
-    seq_len=SEQ_LEN,  # 数据样本长度，默认值为 2048
-    micro_num=1,  # micro_num 是指在一次模型参数更新中会处理的 micro_batch 的数目，默认值为 1
-    micro_bsz=1,  # packed_length = micro_bsz * SEQ_LEN，为一次处理的 micro_batch 的数据大小，默认值为 1
-    total_steps=50000,  # 总的所需执行的 step 的数目，默认值为 50000
-    min_length=50,  # 若数据集文件中，数据行数少于50，将会被废弃
-    train_folder=TRAIN_FOLDER,  # 数据集文件路径，默认值为 None；若 train_folder 为空，则以自动生成的随机数据集进行训练测试
-    pack_sample_into_one=False, # 数据整理的逻辑，决定是按照 seq_len 维度或者是 sequence 的真实长度来进行attention计算
-)
-```
-
-![pack_into_one](./imgs/pack_into_one.png)
-
-
-目前支持传入数据集文件路径`train_folder`，且要求文件格式如下：
-```bash
-- folder
-    - code
-        train_000.bin
-        train_000.bin.meta
-```
-数据集的详细内容可参考``数据准备``模块相关的介绍。
-
-#### 模型配置
-
-如果在启动训练时要加载模型 `checkpoint`，可进行如下相关配置：
-```python
-SAVE_CKPT_FOLDER = "local:/path/to/save/ckpt"
-LOAD_CKPT_FOLDER = "local:/path/to/load/resume/ckpt"
-ckpt = dict(
-    save_ckpt_folder=SAVE_CKPT_FOLDER,  # 存储模型和优化器 checkpoint 的路径
-    checkpoint_every=float("inf"),  # 每多少个 step 存储一次 checkpoint，默认值为 inf
-    # 断点续训时，加载模型和优化器等权重的路径，将从指定的 step 恢复训练
-    # content 表示哪些状态会被加载，支持： "model", "sampler", "optimizer", "scheduler", "all"
-    # ckpt_type 表示加载的模型类型，目前支持: "internlm"
-    load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
-)
-```
-注意：
-- 路径若以 `local:` 为前缀，则存储在本地文件系统；若以 `boto3:` 为前缀，则存储在远程 oss 上
-
-模型相关关键参数配置如下所示：
-```python
-model_type = "INTERNLM"  # 模型类型，默认值为 "INTERNLM"，对应模型结构初始化接口函数
-NUM_ATTENTION_HEAD = 32
-VOCAB_SIZE = 103168
-HIDDEN_SIZE = 4096
-NUM_LAYER = 32
-MLP_RATIO = 8 / 3
-model = dict(
-    checkpoint=False,   # 进行重计算的模型层数比例，可选值为 True/False/[0-1]
-    num_attention_heads=NUM_ATTENTION_HEAD,
-    embed_split_hidden=True,
-    vocab_size=VOCAB_SIZE,
-    embed_grad_scale=1,
-    parallel_output=True,
-    hidden_size=HIDDEN_SIZE,
-    num_layers=NUM_LAYER,
-    mlp_ratio=MLP_RATIO,
-    apply_post_layer_norm=False,
-    dtype="torch.bfloat16",
-    norm_type="rmsnorm",
-    layer_norm_epsilon=1e-5,
-)
-```
-注意：用户可自定义模型类型名和模型结构，并配置相对应的模型参数。通过`utils/registry.py`下的`MODEL_INITIALIZER`对象进行模型初始化函数接口注册，在训练主函数`train.py`中初始化模型时，可通过`model_type`配置获取指定的模型初始化接口函数。
-
-*如果基于 InternLM 7B继续训练，可以参考 [ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-zoo) 中 OpenXLab 链接下载权重*
-
-#### 并行配置
-
-训练并行配置样例如下：
-```python
-parallel = dict(
-    zero1=8,
-    tensor=1,
-    pipeline=dict(size=1, interleaved_overlap=True),
-    sequence_parallel=False,
-)
-```
-- zero1：zero 并行策略，分如下三种情况，默认值为 -1
-  - 当`zero1 <= 0`，则 zero1 进程组的大小等于数据并行进程组的大小，因此优化器状态参数将在数据并行范围内分配
-  - 当`zero1 == 1`，则不使用 zero1 ，所有数据并行组保留完整的优化器状态参数
-  - 当`zero1 > 1`且`zero1 <= data_parallel_world_size`，则 zero1 进程组是数据并行进程组的子集
-- tensor：张量并行大小，通常是每个节点的 GPU 数量，默认值为 1
-- pipeline：流水线并行策略
-  - size：流水线并行大小，默认值为 1
-  - interleaved_overlap：bool 类型，交错式调度时，开启或关闭通信优化，默认值为关闭
-- sequence_parallel：是否开启序列化并行，默认值为 False
-
-注意：`数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小`
-
-### 启动训练
-
-完成了以上数据集准备和相关训练配置后，可启动 Demo 训练。接下来分别以 slurm 和 torch 环境为例，介绍训练启动方式。
-
-若在 slurm 上启动分布式运行环境，多节点 16 卡的运行命令如下所示：
-```bash
-$ srun -p internllm -N 2 -n 16 --ntasks-per-node=8 --gpus-per-task=1 python train.py --config ./configs/7B_sft.py
-```
-
-若在 torch 上启动分布式运行环境，单节点 8 卡的运行命令如下所示：
-```bash
-$ torchrun --nnodes=1 --nproc_per_node=8 train.py --config ./configs/7B_sft.py --launcher "torch"
-```
-
-### 运行结果
-
-以 slurm 上单机 8 卡的 Demo 训练配置为例，训练结果日志展示如下：
-```bash
-2023-07-07 12:26:58,293	INFO launch.py:228 in launch -- Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 1
-2023-07-07 12:26:58,293	INFO parallel_context.py:535 in set_seed -- initialized seed on rank 2, numpy: 1024, python random: 1024, ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is ParallelMode.DATA.
-2023-07-07 12:26:58,295	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=0===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=5===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=1===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=6===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=7===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=2===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=4===========
-2023-07-07 12:26:58,296	INFO train.py:378 in main -- ===========New Run Jul07_12-26-58 on host:SH-IDC1-10-140-0-135,tp:0,pp=0,dp=3===========
-2023-07-07 12:28:27,826	INFO hybrid_zero_optim.py:295 in _partition_param_list -- Number of elements on ranks: [907415552, 907411456, 910163968, 910163968, 921698304, 921698304, 921698304, 921698304], rank:0
-2023-07-07 12:28:57,802	INFO train.py:323 in record_current_batch_training_metrics -- tflops=63.27010355651958,step=0,loss=11.634403228759766,tgs (tokens/gpu/second)=1424.64,lr=4.0000000000000003e-07,loss_scale=65536.0,grad_norm=63.672620777841004,micro_num=4,num_consumed_tokens=131072,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=5,smallest_batch=4,adam_beta2=0.95,fwd_bwd_time=6.48
-2023-07-07 12:29:01,636	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.83371103277346,step=1,loss=11.613704681396484,tgs (tokens/gpu/second)=4274.45,lr=6.000000000000001e-07,loss_scale=65536.0,grad_norm=65.150786641452,micro_num=4,num_consumed_tokens=262144,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.67
-2023-07-07 12:29:05,451	INFO train.py:323 in record_current_batch_training_metrics -- tflops=190.99928472960033,step=2,loss=11.490386962890625,tgs (tokens/gpu/second)=4300.69,lr=8.000000000000001e-07,loss_scale=65536.0,grad_norm=61.57798028719357,micro_num=4,num_consumed_tokens=393216,inf_nan_skip_batches=0,num_samples_in_batch=14,largest_length=2048,largest_batch=4,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.66
-2023-07-07 12:29:09,307	INFO train.py:323 in record_current_batch_training_metrics -- tflops=188.8613541410694,step=3,loss=11.099515914916992,tgs (tokens/gpu/second)=4252.55,lr=1.0000000000000002e-06,loss_scale=65536.0,grad_norm=63.5478796484391,micro_num=4,num_consumed_tokens=524288,inf_nan_skip_batches=0,num_samples_in_batch=16,largest_length=2048,largest_batch=5,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.7
-2023-07-07 12:29:13,147	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.65918563194305,step=4,loss=10.149517059326172,tgs (tokens/gpu/second)=4270.52,lr=1.2000000000000002e-06,loss_scale=65536.0,grad_norm=51.582841631508145,micro_num=4,num_consumed_tokens=655360,inf_nan_skip_batches=0,num_samples_in_batch=19,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.68
-2023-07-07 12:29:16,994	INFO train.py:323 in record_current_batch_training_metrics -- tflops=189.3109313713174,step=5,loss=9.822169303894043,tgs (tokens/gpu/second)=4262.67,lr=1.4000000000000001e-06,loss_scale=65536.0,grad_norm=47.10386835560855,micro_num=4,num_consumed_tokens=786432,inf_nan_skip_batches=0,num_samples_in_batch=17,largest_length=2048,largest_batch=6,smallest_batch=3,adam_beta2=0.95,fwd_bwd_time=3.69
-```
diff --git a/docker.Makefile b/docker.Makefile
deleted file mode 100644
index 7cfd55a..0000000
--- a/docker.Makefile
+++ /dev/null
@@ -1,107 +0,0 @@
-DOCKER_REGISTRY          ?= docker.io
-DOCKER_ORG               ?= my
-DOCKER_IMAGE             ?= internlm
-DOCKER_FULL_NAME          = $(DOCKER_REGISTRY)/$(DOCKER_ORG)/$(DOCKER_IMAGE)
-
-CUDA_VERSION              = 11.7.1
-GCC_VERSION               = 10.2.0
-
-CUDNN_VERSION             = 8
-BASE_RUNTIME              =
-# ubuntu20.04  centos7
-BASE_OS                   = centos7
-BASE_DEVEL                = nvidia/cuda:$(CUDA_VERSION)-cudnn$(CUDNN_VERSION)-devel-${BASE_OS}
-# The conda channel to use to install cudatoolkit
-CUDA_CHANNEL              = nvidia
-# The conda channel to use to install pytorch / torchvision
-INSTALL_CHANNEL          ?= pytorch
-
-PYTHON_VERSION           ?= 3.10
-PYTORCH_VERSION          ?= 1.13.1
-TORCHVISION_VERSION      ?= 0.14.1
-TORCHAUDIO_VERSION       ?= 0.13.1
-BUILD_PROGRESS           ?= auto
-TRITON_VERSION           ?=
-GMP_VERSION              ?= 6.2.1
-MPFR_VERSION             ?= 4.1.0
-MPC_VERSION              ?= 1.2.1
-GCC_VERSION              ?= 10.2.0
-HTTPS_PROXY_I            ?=
-HTTP_PROXY_I             ?=
-FLASH_ATTEN_VERSION      ?= 1.0.5
-FLASH_ATTEN_TAG          ?= v${FLASH_ATTEN_VERSION}
-
-BUILD_ARGS                = --build-arg BASE_IMAGE=$(BASE_IMAGE) \
-                            --build-arg PYTHON_VERSION=$(PYTHON_VERSION) \
-                            --build-arg CUDA_VERSION=$(CUDA_VERSION) \
-                            --build-arg CUDA_CHANNEL=$(CUDA_CHANNEL) \
-                            --build-arg PYTORCH_VERSION=$(PYTORCH_VERSION) \
-                            --build-arg TORCHVISION_VERSION=$(TORCHVISION_VERSION) \
-                            --build-arg TORCHAUDIO_VERSION=$(TORCHAUDIO_VERSION) \
-                            --build-arg INSTALL_CHANNEL=$(INSTALL_CHANNEL) \
-                            --build-arg TRITON_VERSION=$(TRITON_VERSION) \
-                            --build-arg GMP_VERSION=$(GMP_VERSION) \
-                            --build-arg MPFR_VERSION=$(MPFR_VERSION) \
-                            --build-arg MPC_VERSION=$(MPC_VERSION) \
-                            --build-arg GCC_VERSION=$(GCC_VERSION) \
-                            --build-arg https_proxy=$(HTTPS_PROXY_I) \
-                            --build-arg http_proxy=$(HTTP_PROXY_I) \
-                            --build-arg FLASH_ATTEN_TAG=$(FLASH_ATTEN_TAG)
-
-EXTRA_DOCKER_BUILD_FLAGS ?=
-
-BUILD                    ?= build
-# Intentionally left blank
-PLATFORMS_FLAG           ?=
-PUSH_FLAG                ?=
-USE_BUILDX               ?=1
-BUILD_PLATFORMS          ?=
-WITH_PUSH                ?= false
-BUILD_TYPE               ?= intrenlm-dev
-
-# Setup buildx flags
-ifneq ("$(USE_BUILDX)","")
-BUILD                     =  buildx build
-ifneq ("$(BUILD_PLATFORMS)","")
-PLATFORMS_FLAG            = --platform="$(BUILD_PLATFORMS)"
-endif
-endif
-# endif
-
-# # Only set platforms flags if using buildx
-# ifeq ("$(WITH_PUSH)","true")
-# PUSH_FLAG               = --push
-# endif
-# endif
-
-ifeq ($(findstring centos,$(BASE_OS)),centos)
-    DOCKERFILE_PATH ?= ./docker/Dockerfile-centos
-else
-    DOCKERFILE_PATH ?= ./docker/Dockerfile-ubuntu
-endif
-
-#use -f to specify dockerfile
-DOCKER_BUILD              = DOCKER_BUILDKIT=1 \
-                            docker $(BUILD) \
-                                   --progress=$(BUILD_PROGRESS) \
-                                   $(EXTRA_DOCKER_BUILD_FLAGS) \
-                                   $(PLATFORMS_FLAG) \
-                                   $(PUSH_FLAG) \
-                                   -f $(DOCKERFILE_PATH) \
-                                   -t $(DOCKER_FULL_NAME):$(DOCKER_TAG) \
-                                   $(BUILD_ARGS) .
-
-                                   # --target $(BUILD_TYPE)
-
-.PHONY: all
-all: devel-image
-
-.PHONY: devel-image
-devel-image: BASE_IMAGE := $(BASE_DEVEL)
-devel-image: DOCKER_TAG := torch${PYTORCH_VERSION}-cuda${CUDA_VERSION}-flashatten${FLASH_ATTEN_VERSION}-${BASE_OS}
-devel-image:
-	$(DOCKER_BUILD)
-
-.PHONY: clean
-clean:
-	-docker rmi -f $(shell docker images -q $(DOCKER_FULL_NAME))
diff --git a/docker/Dockerfile-centos b/docker/Dockerfile-centos
deleted file mode 100644
index eed33c8..0000000
--- a/docker/Dockerfile-centos
+++ /dev/null
@@ -1,131 +0,0 @@
-ARG BASE_IMAGE
-ARG https_proxy
-ARG http_proxy
-
-##############################################################################
-# Install the basic environment on centos
-##############################################################################
-FROM ${BASE_IMAGE} as base
-ARG https_proxy
-ARG http_proxy
-RUN yum install deltarpm -y && yum update -y \
-    && yum install -y \
-        ca-certificates \
-        cmake \
-        curl \
-        git \
-        wget \
-        tar \
-        m4 \
-        bzip2 \
-        gcc \
-        gcc-c++ \
-        file \
-        texinfo \
-        which
-
-
-##############################################################################
-# Install the conda environment
-##############################################################################
-FROM base as conda
-ARG PYTHON_VERSION=3.10
-ARG TARGETPLATFORM
-ARG https_proxy
-ARG http_proxy
-RUN case ${TARGETPLATFORM} in \
-         "linux/arm64")  MINICONDA_ARCH=aarch64  ;; \
-         *)              MINICONDA_ARCH=x86_64   ;; \
-    esac && \
-    curl -fsSL -v -o ~/miniconda.sh -O  "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh"
-
-RUN chmod +x ~/miniconda.sh && \
-    bash ~/miniconda.sh -b -p /opt/conda && \
-    rm ~/miniconda.sh && \
-    /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
-    /opt/conda/bin/conda clean -ya
-
-
-##############################################################################
-# Install environment dependencies
-##############################################################################
-FROM conda as dep
-WORKDIR /dep
-ARG https_proxy
-ARG http_proxy
-ARG GMP_VERSION
-ARG MPFR_VERSION
-ARG MPC_VERSION
-RUN wget https://ftp.gnu.org/gnu/gmp/gmp-${GMP_VERSION}.tar.bz2 \
-    && tar -vxf gmp-${GMP_VERSION}.tar.bz2 \
-    && cd gmp-${GMP_VERSION}/ \
-    && ./configure --prefix=/usr/local/gmp-${GMP_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && wget https://ftp.gnu.org/gnu/mpfr/mpfr-${MPFR_VERSION}.tar.gz \
-    && tar -vxf mpfr-${MPFR_VERSION}.tar.gz \
-    && cd mpfr-${MPFR_VERSION}/ \
-    && ./configure --prefix=/usr/local/mpfr-${MPFR_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && wget http://www.multiprecision.org/downloads/mpc-${MPC_VERSION}.tar.gz \
-    && tar -vxf mpc-${MPC_VERSION}.tar.gz \
-    && cd mpc-${MPC_VERSION}/ \
-    && ./configure --prefix=/usr/local/mpc-${MPC_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && git clone https://github.com/ninja-build/ninja.git \
-    && cd ninja \
-    && git checkout release \
-    && ./configure.py --bootstrap \
-    && mv ./ninja /usr/bin \
-    && cd ..
-
-ENV MPFR_HOME=/usr/local/mpfr-${MPFR_VERSION}
-ENV LD_LIBRARY_PATH=${MPFR_HOME}/lib:$LD_LIBRARY_PATH
-
-ARG https_proxy
-ARG http_proxy
-ARG GCC_VERSION
-ARG GMP_VERSION
-ARG MPFR_VERSION
-ARG MPC_VERSION
-RUN wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.xz \
-    && tar -vxf gcc-${GCC_VERSION}.tar.xz \
-    && mkdir build \
-    && cd build/ \
-    && ../gcc-${GCC_VERSION}/configure --prefix=/usr/local/gcc-${GCC_VERSION}/ --enable-threads=posix --disable-checking --enable-languages=c,c++ --disable-multilib \
-       --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} --with-mpc=/usr/local/mpc-${MPC_VERSION} \
-    && make -j64 && make install
-
-ENV GCC_HOME=/usr/local/gcc-${GCC_VERSION}
-ENV LD_LIBRARY_PATH=${GCC_HOME}/lib64:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
-ENV PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
-ENV CC=${GCC_HOME}/bin/gcc
-ENV CXX=${GCC_HOME}/bin/c++
-
-
-##############################################################################
-# Install InternLM development environment, including flash-attention and apex
-##############################################################################
-FROM dep as intrenlm-dev
-COPY . /InternLM
-WORKDIR /InternLM
-ARG https_proxy
-ARG http_proxy
-ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
-RUN git submodule update --init --recursive \
-    && /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt \
-    && /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt \
-    && cd /InternLM/third_party/flash-attention \
-    && /opt/conda/bin/python setup.py install \
-    && cd ./csrc \
-    && cd fused_dense_lib && /opt/conda/bin/pip install -v . \
-    && cd ../xentropy && /opt/conda/bin/pip install -v . \
-    && cd ../rotary && /opt/conda/bin/pip install -v . \
-    && cd ../layer_norm && /opt/conda/bin/pip install -v . \
-    && cd ../../../../ \
-    && cd ./third_party/apex \
-    && /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
-    && /opt/conda/bin/pip cache purge \
-    && rm -rf ~/.cache/pip
diff --git a/docker/Dockerfile-ubuntu b/docker/Dockerfile-ubuntu
deleted file mode 100644
index a7c5526..0000000
--- a/docker/Dockerfile-ubuntu
+++ /dev/null
@@ -1,112 +0,0 @@
-ARG BASE_IMAGE
-ARG https_proxy
-ARG http_proxy
-
-##############################################################################
-# Install the basic environment on ubuntu
-##############################################################################
-FROM ${BASE_IMAGE} as base
-ARG https_proxy
-ARG http_proxy
-RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
-        build-essential \
-        ca-certificates \
-        cmake \
-        curl \
-        git \
-        wget \
-        tar \
-        m4 \
-        ninja-build
-
-
-##############################################################################
-# Install the conda environment
-##############################################################################
-FROM base as conda
-ARG PYTHON_VERSION=3.10
-ARG TARGETPLATFORM
-ARG https_proxy
-ARG http_proxy
-RUN case ${TARGETPLATFORM} in \
-         "linux/arm64")  MINICONDA_ARCH=aarch64  ;; \
-         *)              MINICONDA_ARCH=x86_64   ;; \
-    esac && \
-    curl -fsSL -v -o ~/miniconda.sh -O  "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh"
-
-RUN chmod +x ~/miniconda.sh && \
-    bash ~/miniconda.sh -b -p /opt/conda && \
-    rm ~/miniconda.sh && \
-    /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
-    /opt/conda/bin/conda clean -ya
-
-
-##############################################################################
-# Install environment dependencies
-##############################################################################
-FROM conda as dep
-WORKDIR /dep
-ARG https_proxy
-ARG http_proxy
-ARG GCC_VERSION
-ARG GMP_VERSION
-ARG MPFR_VERSION
-ARG MPC_VERSION
-RUN wget https://ftp.gnu.org/gnu/gmp/gmp-${GMP_VERSION}.tar.bz2 \
-    && tar -vxf gmp-${GMP_VERSION}.tar.bz2 \
-    && cd gmp-${GMP_VERSION}/ \
-    && ./configure --prefix=/usr/local/gmp-${GMP_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && wget https://ftp.gnu.org/gnu/mpfr/mpfr-${MPFR_VERSION}.tar.gz \
-    && tar -vxf mpfr-${MPFR_VERSION}.tar.gz \
-    && cd mpfr-${MPFR_VERSION}/ \
-    && ./configure --prefix=/usr/local/mpfr-${MPFR_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && wget http://www.multiprecision.org/downloads/mpc-${MPC_VERSION}.tar.gz \
-    && tar -vxf mpc-${MPC_VERSION}.tar.gz \
-    && cd mpc-${MPC_VERSION}/ \
-    && ./configure --prefix=/usr/local/mpc-${MPC_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.xz \
-    && tar -vxJf gcc-${GCC_VERSION}.tar.xz \
-    && mkdir build \
-    && cd build/ \
-    && ../gcc-${GCC_VERSION}/configure --prefix=/usr/local/gcc-${GCC_VERSION}/ --enable-checking=release --enable-languages=c,c++ --disable-multilib \
-       --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} --with-mpc=/usr/local/mpc-${MPC_VERSION} \
-    && make -j64 && make install
-
-ENV GCC_HOME=/usr/local/gcc-${GCC_VERSION}
-ENV MPFR_HOME=/usr/local/mpfr-${MPFR_VERSION}
-ENV LD_LIBRARY_PATH=${GCC_HOME}/lib64:${MPFR_HOME}/lib:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
-ENV PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
-ENV CC=${GCC_HOME}/bin/gcc
-ENV CXX=${GCC_HOME}/bin/c++
-
-
-##############################################################################
-# Install InternLM development environment, including flash-attention and apex
-##############################################################################
-FROM dep as intrenlm-dev
-COPY . /InternLM
-WORKDIR /InternLM
-ARG https_proxy
-ARG http_proxy
-ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
-RUN git submodule update --init --recursive \
-    && /opt/conda/bin/pip --no-cache-dir install -r requirements/torch.txt \
-    && /opt/conda/bin/pip --no-cache-dir install -r requirements/runtime.txt \
-    && cd /InternLM/third_party/flash-attention \
-    && /opt/conda/bin/python setup.py install \
-    && cd ./csrc \
-    && cd fused_dense_lib && /opt/conda/bin/pip install -v . \
-    && cd ../xentropy && /opt/conda/bin/pip install -v . \
-    && cd ../rotary && /opt/conda/bin/pip install -v . \
-    && cd ../layer_norm && /opt/conda/bin/pip install -v . \
-    && cd ../../../../ \
-    && cd ./third_party/apex \
-    && /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
-    && /opt/conda/bin/pip cache purge \
-    && rm -rf ~/.cache/pip
diff --git a/experiment/Dockerfile-centos b/experiment/Dockerfile-centos
deleted file mode 100644
index a1b1424..0000000
--- a/experiment/Dockerfile-centos
+++ /dev/null
@@ -1,161 +0,0 @@
-ARG BASE_IMAGE
-ARG https_proxy
-ARG http_proxy
-
-##############################################################################
-# Install the basic environment on centos
-##############################################################################
-FROM ${BASE_IMAGE} as base
-ARG https_proxy
-ARG http_proxy
-RUN yum install deltarpm -y && yum update -y \
-    && yum install -y \
-        ca-certificates \
-        cmake \
-        curl \
-        git \
-        wget \
-        tar \
-        m4 \
-        bzip2 \
-        gcc \
-        gcc-c++ \
-        file \
-        texinfo \
-        which
-
-
-##############################################################################
-# Install the conda environment
-##############################################################################
-FROM base as conda
-ARG PYTHON_VERSION=3.10
-ARG TARGETPLATFORM
-ARG https_proxy
-ARG http_proxy
-RUN case ${TARGETPLATFORM} in \
-         "linux/arm64")  MINICONDA_ARCH=aarch64  ;; \
-         *)              MINICONDA_ARCH=x86_64   ;; \
-    esac && \
-    curl -fsSL -v -o ~/miniconda.sh -O  "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh"
-
-RUN chmod +x ~/miniconda.sh && \
-    bash ~/miniconda.sh -b -p /opt/conda && \
-    rm ~/miniconda.sh && \
-    /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
-    /opt/conda/bin/conda clean -ya
-
-
-##############################################################################
-# Install environment dependencies
-##############################################################################
-FROM conda as dep
-WORKDIR /dep
-ARG https_proxy
-ARG http_proxy
-ARG GMP_VERSION
-ARG MPFR_VERSION
-ARG MPC_VERSION
-RUN wget https://ftp.gnu.org/gnu/gmp/gmp-${GMP_VERSION}.tar.bz2 \
-    && tar -vxf gmp-${GMP_VERSION}.tar.bz2 \
-    && cd gmp-${GMP_VERSION}/ \
-    && ./configure --prefix=/usr/local/gmp-${GMP_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && wget https://ftp.gnu.org/gnu/mpfr/mpfr-${MPFR_VERSION}.tar.gz \
-    && tar -vxf mpfr-${MPFR_VERSION}.tar.gz \
-    && cd mpfr-${MPFR_VERSION}/ \
-    && ./configure --prefix=/usr/local/mpfr-${MPFR_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && wget http://www.multiprecision.org/downloads/mpc-${MPC_VERSION}.tar.gz \
-    && tar -vxf mpc-${MPC_VERSION}.tar.gz \
-    && cd mpc-${MPC_VERSION}/ \
-    && ./configure --prefix=/usr/local/mpc-${MPC_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && git clone https://github.com/ninja-build/ninja.git \
-    && cd ninja \
-    && git checkout release \
-    && ./configure.py --bootstrap \
-    && mv ./ninja /usr/bin \
-    && cd ..
-
-ENV MPFR_HOME=/usr/local/mpfr-${MPFR_VERSION}
-ENV LD_LIBRARY_PATH=${MPFR_HOME}/lib:$LD_LIBRARY_PATH
-
-ARG https_proxy
-ARG http_proxy
-ARG GCC_VERSION
-ARG GMP_VERSION
-ARG MPFR_VERSION
-ARG MPC_VERSION
-RUN wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.xz \
-    && tar -vxf gcc-${GCC_VERSION}.tar.xz \
-    && mkdir build \
-    && cd build/ \
-    && ../gcc-${GCC_VERSION}/configure --prefix=/usr/local/gcc-${GCC_VERSION}/ --enable-threads=posix --disable-checking --enable-languages=c,c++ --disable-multilib \
-       --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} --with-mpc=/usr/local/mpc-${MPC_VERSION} \
-    && make -j64 && make install
-
-ENV GCC_HOME=/usr/local/gcc-${GCC_VERSION}
-ENV LD_LIBRARY_PATH=${GCC_HOME}/lib64:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
-ENV PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
-ENV CC=${GCC_HOME}/bin/gcc
-ENV CXX=${GCC_HOME}/bin/c++
-
-
-##############################################################################
-# Install InternLM development environment, including flash-attention and apex
-##############################################################################
-FROM dep as intrenlm-dev
-COPY . /InternLM
-WORKDIR /InternLM
-ARG https_proxy
-ARG http_proxy
-ARG PYTORCH_VERSION
-ARG TORCHVISION_VERSION
-ARG TORCHAUDIO_VERSION
-
-RUN /opt/conda/bin/pip --no-cache-dir install \
-    transformers==4.29.2 \
-    sentencepiece \
-    numpy \
-    tqdm \
-    psutil \
-    packaging \
-    pre-commit \
-    ninja \
-    gputil \
-    pytest \
-    packaging \
-    boto3 \
-    botocore \
-    torch-scatter \
-    pyecharts \
-    -f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}+cu117.html \
-    && /opt/conda/bin/pip --no-cache-dir install \
-    --extra-index-url https://download.pytorch.org/whl/cu117 \
-    torch==${PYTORCH_VERSION}+cu117 \
-    torchvision==${TORCHVISION_VERSION}+cu117 \
-    torchaudio==${TORCHAUDIO_VERSION}
-
-ARG https_proxy
-ARG http_proxy
-ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
-ARG FLASH_ATTEN_TAG
-
-RUN git submodule update --init --recursive \
-    && cd /InternLM/third_party/flash-attention \
-    && git checkout ${FLASH_ATTEN_TAG} \
-    && /opt/conda/bin/python setup.py install \
-    && cd ./csrc \
-    && cd fused_dense_lib && /opt/conda/bin/pip install -v . \
-    && cd ../xentropy && /opt/conda/bin/pip install -v . \
-    && cd ../rotary && /opt/conda/bin/pip install -v . \
-    && cd ../layer_norm && /opt/conda/bin/pip install -v . \
-    && cd ../../../../ \
-    && cd ./third_party/apex \
-    && /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
-    && /opt/conda/bin/pip cache purge \
-    && rm -rf ~/.cache/pip
diff --git a/experiment/Dockerfile-ubuntu b/experiment/Dockerfile-ubuntu
deleted file mode 100644
index ed78d50..0000000
--- a/experiment/Dockerfile-ubuntu
+++ /dev/null
@@ -1,142 +0,0 @@
-ARG BASE_IMAGE
-ARG https_proxy
-ARG http_proxy
-
-##############################################################################
-# Install the basic environment on ubuntu
-##############################################################################
-FROM ${BASE_IMAGE} as base
-ARG https_proxy
-ARG http_proxy
-RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
-        build-essential \
-        ca-certificates \
-        cmake \
-        curl \
-        git \
-        wget \
-        tar \
-        m4 \
-        ninja-build
-
-
-##############################################################################
-# Install the conda environment
-##############################################################################
-FROM base as conda
-ARG PYTHON_VERSION=3.10
-ARG TARGETPLATFORM
-ARG https_proxy
-ARG http_proxy
-RUN case ${TARGETPLATFORM} in \
-         "linux/arm64")  MINICONDA_ARCH=aarch64  ;; \
-         *)              MINICONDA_ARCH=x86_64   ;; \
-    esac && \
-    curl -fsSL -v -o ~/miniconda.sh -O  "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-${MINICONDA_ARCH}.sh"
-
-RUN chmod +x ~/miniconda.sh && \
-    bash ~/miniconda.sh -b -p /opt/conda && \
-    rm ~/miniconda.sh && \
-    /opt/conda/bin/conda install -y python=${PYTHON_VERSION} cmake conda-build pyyaml numpy ipython && \
-    /opt/conda/bin/conda clean -ya
-
-
-##############################################################################
-# Install environment dependencies
-##############################################################################
-FROM conda as dep
-WORKDIR /dep
-ARG https_proxy
-ARG http_proxy
-ARG GCC_VERSION
-ARG GMP_VERSION
-ARG MPFR_VERSION
-ARG MPC_VERSION
-RUN wget https://ftp.gnu.org/gnu/gmp/gmp-${GMP_VERSION}.tar.bz2 \
-    && tar -vxf gmp-${GMP_VERSION}.tar.bz2 \
-    && cd gmp-${GMP_VERSION}/ \
-    && ./configure --prefix=/usr/local/gmp-${GMP_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && wget https://ftp.gnu.org/gnu/mpfr/mpfr-${MPFR_VERSION}.tar.gz \
-    && tar -vxf mpfr-${MPFR_VERSION}.tar.gz \
-    && cd mpfr-${MPFR_VERSION}/ \
-    && ./configure --prefix=/usr/local/mpfr-${MPFR_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && wget http://www.multiprecision.org/downloads/mpc-${MPC_VERSION}.tar.gz \
-    && tar -vxf mpc-${MPC_VERSION}.tar.gz \
-    && cd mpc-${MPC_VERSION}/ \
-    && ./configure --prefix=/usr/local/mpc-${MPC_VERSION} --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} \
-    && make -j64 && make install \
-    && cd .. \
-    && wget https://ftp.gnu.org/gnu/gcc/gcc-${GCC_VERSION}/gcc-${GCC_VERSION}.tar.xz \
-    && tar -vxJf gcc-${GCC_VERSION}.tar.xz \
-    && mkdir build \
-    && cd build/ \
-    && ../gcc-${GCC_VERSION}/configure --prefix=/usr/local/gcc-${GCC_VERSION}/ --enable-checking=release --enable-languages=c,c++ --disable-multilib \
-       --with-gmp=/usr/local/gmp-${GMP_VERSION} --with-mpfr=/usr/local/mpfr-${MPFR_VERSION} --with-mpc=/usr/local/mpc-${MPC_VERSION} \
-    && make -j64 && make install
-
-ENV GCC_HOME=/usr/local/gcc-${GCC_VERSION}
-ENV MPFR_HOME=/usr/local/mpfr-${MPFR_VERSION}
-ENV LD_LIBRARY_PATH=${GCC_HOME}/lib64:${MPFR_HOME}/lib:${CUDA_PATH}/lib64:$LD_LIBRARY_PATH
-ENV PATH=${GCC_HOME}/bin:${CUDA_PATH}/bin:$PATH
-ENV CC=${GCC_HOME}/bin/gcc
-ENV CXX=${GCC_HOME}/bin/c++
-
-
-##############################################################################
-# Install InternLM development environment, including flash-attention and apex
-##############################################################################
-FROM dep as intrenlm-dev
-COPY . /InternLM
-WORKDIR /InternLM
-ARG https_proxy
-ARG http_proxy
-ARG PYTORCH_VERSION
-ARG TORCHVISION_VERSION
-ARG TORCHAUDIO_VERSION
-
-RUN /opt/conda/bin/pip --no-cache-dir install \
-    transformers==4.29.2 \
-    sentencepiece \
-    numpy \
-    tqdm \
-    psutil \
-    packaging \
-    pre-commit \
-    ninja \
-    gputil \
-    pytest \
-    packaging \
-    boto3 \
-    botocore \
-    torch-scatter \
-    pyecharts \
-    -f https://data.pyg.org/whl/torch-${PYTORCH_VERSION}+cu117.html \
-    && /opt/conda/bin/pip --no-cache-dir install \
-    --extra-index-url https://download.pytorch.org/whl/cu117 \
-    torch==${PYTORCH_VERSION}+cu117 \
-    torchvision==${TORCHVISION_VERSION}+cu117 \
-    torchaudio==${TORCHAUDIO_VERSION}
-
-ARG https_proxy
-ARG http_proxy
-ARG TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
-ARG FLASH_ATTEN_TAG
-
-RUN git submodule update --init --recursive \
-    && cd /InternLM/third_party/flash-attention \
-    && git checkout ${FLASH_ATTEN_TAG} \
-    && /opt/conda/bin/python setup.py install \
-    && cd ./csrc \
-    && cd fused_dense_lib && /opt/conda/bin/pip install -v . \
-    && cd ../xentropy && /opt/conda/bin/pip install -v . \
-    && cd ../rotary && /opt/conda/bin/pip install -v . \
-    && cd ../layer_norm && /opt/conda/bin/pip install -v . \
-    && cd ../../../../ \
-    && cd ./third_party/apex \
-    && /opt/conda/bin/pip --no-cache-dir install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ \
-    && /opt/conda/bin/pip cache purge \
-    && rm -rf ~/.cache/pip
diff --git a/experiment/README-CN.md b/experiment/README-CN.md
deleted file mode 100644
index 7fee559..0000000
--- a/experiment/README-CN.md
+++ /dev/null
@@ -1,25 +0,0 @@
-## 实验性环境镜像
-本模块用于测试新版本环境，默认测试新环境 torch=2.0.1，flash-attention=2.1.0。新环境可能具有不稳定性，标准环境安装请参考：[安装文档](../doc/install.md)
-
-### 镜像构建及拉取
-构建镜像时请于 InternLM 根目录下执行 docker.Makefile，该文件与标准环境镜像共用，所使用的 Dockerfile 位于 experiment 目录下。也可直接从 https://hub.docker.com/r/internlm/internlm 拉取镜像，命令如下：
-```bash
-# 构建镜像
-# ubuntu20.04
-make -f docker.Makefile BASE_OS=ubuntu20.04 DOCKERFILE_PATH=./experiment/Dockerfile-ubuntu PYTORCH_VERSION=2.0.1 TORCHVISION_VERSION=0.15.2 TORCHAUDIO_VERSION=2.0.2 FLASH_ATTEN_VERSION=2.1.0
-# centos7
-make -f docker.Makefile BASE_OS=centos7 DOCKERFILE_PATH=./experiment/Dockerfile-centos PYTORCH_VERSION=2.0.1 TORCHVISION_VERSION=0.15.2 TORCHAUDIO_VERSION=2.0.2 FLASH_ATTEN_VERSION=2.1.0
-
-# 拉取镜像
-# ubuntu20.04
-docker pull internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-ubuntu20.04
-# centos7
-docker pull internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-centos7
-```
-
-### 容器启动
-对于使用 dockerfile 构建或拉取的本地标准镜像，使用如下命令启动并进入容器：
-```bash
-docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-centos7 bash
-```
-容器内默认目录即 `/InternLM`，根据[使用文档](../doc/usage.md)即可启动训练。
diff --git a/experiment/README-EN.md b/experiment/README-EN.md
deleted file mode 100644
index f68efc8..0000000
--- a/experiment/README-EN.md
+++ /dev/null
@@ -1,25 +0,0 @@
-## Environment Image for experiment
-This module is used to test the new version environment, the default test new environment is torch=2.0.1, flash-attention=2.1.0. The new environment may be unstable, for the standard environment installation please refer to: [installation guide](../doc/en/install.md)
-
-### Build and Pull Image
-When building the image, please make docker.Makefile in the InternLM root directory. This Makefile is shared with the standard environment image, and the Dockerfile used is located in the experiment directory. You can also pull the image directly from https://hub.docker.com/r/internlm/internlm, the command is as follows:
-```bash
-# Build Image
-# ubuntu20.04
-make -f docker.Makefile BASE_OS=ubuntu20.04 DOCKERFILE_PATH=./experiment/Dockerfile-ubuntu PYTORCH_VERSION=2.0.1 TORCHVISION_VERSION=0.15.2 TORCHAUDIO_VERSION=2.0.2 FLASH_ATTEN_VERSION=2.1.0
-# centos7
-make -f docker.Makefile BASE_OS=centos7 DOCKERFILE_PATH=./experiment/Dockerfile-centos PYTORCH_VERSION=2.0.1 TORCHVISION_VERSION=0.15.2 TORCHAUDIO_VERSION=2.0.2 FLASH_ATTEN_VERSION=2.1.0
-
-# Pull Image
-# ubuntu20.04
-docker pull internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-ubuntu20.04
-# centos7
-docker pull internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-centos7
-```
-
-### Run Container
-For the local standard image built with dockerfile or pulled, use the following command to run and enter the container:
-```bash
-docker run --gpus all -it -m 500g --cap-add=SYS_PTRACE --cap-add=IPC_LOCK --shm-size 20g --network=host --name myinternlm internlm/internlm:experiment-torch2.0.1-flashatten2.1.0-centos7 bash
-```
-The default directory in the container is `/InternLM`, please start training according to the [Usage](../doc/en/usage.md).
diff --git a/finetune/README.md b/finetune/README.md
new file mode 100644
index 0000000..b2c48f3
--- /dev/null
+++ b/finetune/README.md
@@ -0,0 +1,6 @@
+# Fine-tuning with InternLM
+
+We recommend two projects to fine-tune InternLM.
+
+1. [Xtuner](): brief introduction
+2. [InternLM-Train](): brief introduction
diff --git a/internlm/__init__.py b/internlm/__init__.py
deleted file mode 100644
index dc34a31..0000000
--- a/internlm/__init__.py
+++ /dev/null
@@ -1,9 +0,0 @@
-from .initialize.initialize_trainer import initialize_trainer
-from .initialize.launch import get_default_parser, launch_from_slurm, launch_from_torch
-
-__all__ = [
-    "get_default_parser",
-    "initialize_trainer",
-    "launch_from_slurm",
-    "launch_from_torch",
-]
diff --git a/internlm/apis/__init__.py b/internlm/apis/__init__.py
deleted file mode 100644
index e69de29..0000000
diff --git a/internlm/apis/inference.py b/internlm/apis/inference.py
deleted file mode 100644
index 88d6d50..0000000
--- a/internlm/apis/inference.py
+++ /dev/null
@@ -1,848 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-import torch.nn.functional as F
-from torch import nn
-
-__all__ = ["SequenceGenerator"]
-
-
-class InferenceParams:
-    """
-    Intermediate cache objects for inference
-    """
-
-    def __init__(
-        self,
-        max_sequence_len,
-        max_batch_size,
-        sequence_len_offset=0,
-        batch_size_offset=0,
-        key_value_memory_dict: dict = None,
-        lengths_per_sample=None,
-        attention_mask=None,
-    ) -> None:
-
-        self.max_sequence_len: int = max_sequence_len
-        self.max_batch_size: int = max_batch_size
-        self.sequence_len_offset: int = sequence_len_offset
-        self.batch_size_offset: int = batch_size_offset
-        if key_value_memory_dict is None:
-            key_value_memory_dict = {}
-        self.key_value_memory_dict: dict = key_value_memory_dict
-        self.fused_ft_kernel: bool = False
-        self.lengths_per_sample = lengths_per_sample
-        self.attention_mask = attention_mask
-
-    def reorder_state(self, indices):
-        if self.lengths_per_sample is not None:
-            self.lengths_per_sample = self.lengths_per_sample.index_select(index=indices, dim=0)
-        for key, value in list(self.key_value_memory_dict.items()):
-            value = value.index_select(index=indices, dim=0)
-            self.key_value_memory_dict[key] = value
-
-
-def _get_model_device(model):
-    """
-    obtain the device of an nn.Module.model
-
-    Args:
-        model: nn.Module
-
-    Return: torch.device. if None, the parameters of this model is None.
-    """
-    assert isinstance(model, nn.Module)
-
-    parameters = list(model.parameters())
-    if len(parameters) == 0:
-        return None
-    else:
-        return parameters[0].device
-
-
-class SequenceGenerator:
-    """
-    Sequence Generator.
-    """
-
-    def __init__(self, decoder, eos_token_id, pad_token_id, bos_token_id):
-        self.decoder = decoder
-        self.eos_token_id = eos_token_id
-        self.pad_token_id = pad_token_id
-        self.bos_token_id = bos_token_id
-
-    @torch.no_grad()
-    def generate(
-        self,
-        tokens: "torch.LongTensor" = None,
-        num_return_sequences=1,
-        max_length: int = 20,
-        num_beams: int = 1,
-        do_sample: bool = True,
-        temperature: float = 1.0,
-        top_k: int = 50,
-        top_p: float = 1.0,
-        repetition_penalty: float = 1,
-        length_penalty: float = 1.0,
-    ):
-        """
-        Args:
-            tokens: the beginning tokens whose shape is [bsz, length]. If shape is None, default ''bos_token'' will be
-                added to conduct generation.
-            num_return_sequences: number of returned sequences.
-            max_length: the max length of generated sequence.
-            num_beams: the size of beam search.
-            do_sample: whether using sample.
-            temperature: it's meaningful when do_sample is True.
-            top_k: sampling from top_k.
-            top_p: sampling from top_p tokens(nucleus sampling).
-
-        Return:
-            the token sequence whose shape is [bsz, num_return_sequences, max_length]. If eos_token_id is not None,
-                the ending of each sequence must be eos_token_id.
-        """
-        assert num_return_sequences <= num_beams, f"The `{num_return_sequences}` must be less than `{num_beams}`..."
-        if do_sample:
-            return sample_generate(
-                self.decoder,
-                tokens=tokens,
-                max_length=max_length,
-                num_beams=num_beams,
-                num_return_sequences=num_return_sequences,
-                temperature=temperature,
-                top_k=top_k,
-                top_p=top_p,
-                eos_token_id=self.eos_token_id,  # the ending token id
-                pad_token_id=self.pad_token_id,
-                repetition_penalty=repetition_penalty,  # the penalty degree for repetition tokens
-                length_penalty=length_penalty,  # the penalty for length. if it > 1, then encourages long sequence.
-                # Otherwise, encourages short sequence.
-                bos_token_id=self.bos_token_id,
-            )
-        else:
-            return greedy_generate(
-                self.decoder,
-                tokens=tokens,
-                max_length=max_length,
-                num_beams=num_beams,
-                num_return_sequences=num_return_sequences,
-                eos_token_id=self.eos_token_id,
-                pad_token_id=self.pad_token_id,
-                repetition_penalty=repetition_penalty,
-                length_penalty=length_penalty,
-                bos_token_id=self.bos_token_id,
-            )
-
-
-@torch.no_grad()
-def greedy_generate(
-    decoder,
-    tokens=None,
-    max_length=20,
-    num_beams=1,
-    num_return_sequences=1,
-    eos_token_id=None,
-    pad_token_id=0,
-    repetition_penalty=1,
-    length_penalty=1.0,
-    bos_token_id=1,
-    feat_mask=None,
-    ffn_mask=None,
-    layer_mask=None,
-):
-    """
-    Search sequence greedily.
-
-    Args:
-        decoder: the Decoder object.
-        tokens: the shape is [batch size, length]. If decoder is None, generating begins with bos_token_id.
-        max_length: the max length for generated sequence.
-        num_beams: the size of beam to decode.
-        eos_token_id: the ending token id. If None, the decode length is max_length.
-        pad_token_id: the token id of pad.
-        repetition_penalty: the penalty degree for repetition tokens
-        length_penalty: the penalty for length.
-
-    """
-    if num_beams == 1:
-        token_ids = _no_beam_search_generate(
-            decoder,
-            tokens=tokens,
-            max_length=max_length,
-            temperature=1,
-            top_k=50,
-            top_p=1,
-            eos_token_id=eos_token_id,
-            do_sample=False,
-            repetition_penalty=repetition_penalty,
-            length_penalty=length_penalty,
-            pad_token_id=pad_token_id,
-            bos_token_id=bos_token_id,
-            feat_mask=feat_mask,
-            ffn_mask=ffn_mask,
-            layer_mask=layer_mask,
-        )
-    else:
-        token_ids = _beam_search_generate(
-            decoder,
-            tokens=tokens,
-            max_length=max_length,
-            num_beams=num_beams,
-            num_return_sequences=num_return_sequences,
-            temperature=1,
-            top_k=50,
-            top_p=1,
-            eos_token_id=eos_token_id,
-            do_sample=False,
-            repetition_penalty=repetition_penalty,
-            length_penalty=length_penalty,
-            pad_token_id=pad_token_id,
-            bos_token_id=bos_token_id,
-            feat_mask=feat_mask,
-            ffn_mask=ffn_mask,
-            layer_mask=layer_mask,
-        )
-
-    return token_ids
-
-
-@torch.no_grad()
-def sample_generate(
-    decoder,
-    tokens,
-    max_length=20,
-    num_beams=1,
-    num_return_sequences=1,
-    temperature=1.0,
-    top_k=50,
-    top_p=1.0,
-    eos_token_id=None,
-    pad_token_id=0,
-    repetition_penalty=1.0,
-    length_penalty=1.0,
-    bos_token_id=1,
-):
-    """
-    generate sequence in sampling way.
-
-    Args:
-        decoder: the Decoder object.
-        tokens: the shape is [batch size, length]. If decoder is None, generating begins with bos_token_id.
-        max_length: the max length for generated sequence.
-        num_beams: the size of beam to decode.
-        num_return_sequences: number of returned sequence.
-        temperature: annealing magnitude during sampling.
-        top_k: sampling from top_k. (Default: 50)
-        top_p: sampling from top_p tokens(nucleus sampling). (Default: 1.0)
-        eos_token_id: the ending token id. If None, the decode length is max_length.
-        pad_token_id: the token id of pad.
-        repetition_penalty: the penalty degree for repetition tokens
-        length_penalty: the penalty for length.
-
-    """
-    if num_beams == 1:
-        token_ids = _no_beam_search_generate(
-            decoder,
-            tokens=tokens,
-            max_length=max_length,
-            temperature=temperature,
-            top_k=top_k,
-            top_p=top_p,
-            eos_token_id=eos_token_id,
-            do_sample=True,
-            repetition_penalty=repetition_penalty,
-            length_penalty=length_penalty,
-            pad_token_id=pad_token_id,
-            bos_token_id=bos_token_id,
-        )
-    else:
-        token_ids = _beam_search_generate(
-            decoder,
-            tokens=tokens,
-            max_length=max_length,
-            num_beams=num_beams,
-            num_return_sequences=num_return_sequences,
-            temperature=temperature,
-            top_k=top_k,
-            top_p=top_p,
-            eos_token_id=eos_token_id,
-            do_sample=True,
-            repetition_penalty=repetition_penalty,
-            length_penalty=length_penalty,
-            pad_token_id=pad_token_id,
-            bos_token_id=bos_token_id,
-        )
-    return token_ids
-
-
-@torch.no_grad()
-def _no_beam_search_generate(
-    decoder,
-    tokens,
-    inference_params=None,
-    max_length=20,
-    temperature=1.0,
-    top_k=50,
-    top_p=1.0,
-    eos_token_id=None,
-    do_sample=True,
-    repetition_penalty=1.0,
-    length_penalty=1.0,
-    pad_token_id=0,
-    bos_token_id=1,
-    feat_mask=None,
-    ffn_mask=None,
-    layer_mask=None,
-):
-    # delete num_return_sequences=1 for lint check;
-    batch_size = tokens.size(0)
-    if eos_token_id is None:
-        _eos_token_id = -1
-    else:
-        _eos_token_id = eos_token_id
-
-    has_bos = torch.all(tokens[:, 0].eq(bos_token_id))
-    if has_bos:
-        bos_pos = torch.where(tokens.eq(bos_token_id), 1, 0)
-        bos_sum = bos_pos.cumsum(dim=-1)
-        bos_pos = torch.where(bos_sum.eq(bos_sum[:, -1:]), 0, 1)
-        to_atten_x = bos_pos[:, :, None]
-        to_atten_y = bos_pos[:, None, :]
-        # attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
-    else:
-        bos_pos = torch.where(tokens.eq(bos_token_id), 1, 0)
-        to_atten_x = bos_pos[:, :, None]
-        to_atten_y = bos_pos[:, None, :]
-        # attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
-    attention_mask = torch.logical_or(to_atten_x, to_atten_y).eq(1)
-    if inference_params is None:
-        inference_params = InferenceParams(
-            max_sequence_len=max_length,
-            max_batch_size=tokens.size(0),
-            sequence_len_offset=0,
-            batch_size_offset=0,
-            key_value_memory_dict=None,
-            lengths_per_sample=None,
-            attention_mask=attention_mask,
-        )
-
-    if layer_mask is None:
-        if feat_mask is None and ffn_mask is None:
-            scores = decoder(**{"input_ids": tokens, "inference_params": inference_params})
-        else:
-            scores = decoder(
-                **{
-                    "input_ids": tokens,
-                    "inference_params": inference_params,
-                    "feat_mask": feat_mask,
-                    "ffn_mask": ffn_mask,
-                }
-            )
-    else:
-        scores = decoder(
-            **{
-                "input_ids": tokens,
-                "inference_params": inference_params,
-                "feat_mask": feat_mask,
-                "ffn_mask": ffn_mask,
-                "layer_mask": layer_mask,
-            }
-        )
-
-    if isinstance(scores, (list, tuple)):
-        scores = scores[0]
-    scores = scores[:, -1].float()
-    inference_params.sequence_len_offset += tokens.size(1)
-    if _eos_token_id != -1:
-        scores[:, _eos_token_id] = -1e12
-    next_tokens = scores.argmax(dim=-1, keepdim=True)
-    token_ids = torch.cat([tokens, next_tokens], dim=1)
-    cur_len = token_ids.size(1)
-    dones = token_ids.new_zeros(batch_size).eq(1)
-    # tokens = tokens[:, -1:]
-
-    real_max_length = max_length
-    max_lengths = tokens.new_full((tokens.size(0),), fill_value=max_length, dtype=torch.long)
-
-    while cur_len < real_max_length:
-        # batch_size x vocab_size
-        if has_bos:
-            bos_pos = torch.where(token_ids.eq(bos_token_id), 1, 0)
-            bos_sum = bos_pos.cumsum(dim=-1)
-            bos_pos = torch.where(bos_sum.eq(bos_sum[:, -1:]), 0, 1)
-            to_atten_x = bos_pos[:, :, None]
-            to_atten_y = bos_pos[:, None, :]
-            # attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
-        else:
-            bos_pos = torch.where(token_ids.eq(bos_token_id), 1, 0)
-            to_atten_x = bos_pos[:, :, None]
-            to_atten_y = bos_pos[:, None, :]
-            # attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
-        attention_mask = torch.logical_or(to_atten_x, to_atten_y).eq(1)
-        inference_params.attention_mask = attention_mask
-        if layer_mask is None:
-            if feat_mask is None and ffn_mask is None:
-                scores = decoder(**{"input_ids": token_ids[:, -1:], "inference_params": inference_params})
-            else:
-                scores = decoder(
-                    **{
-                        "input_ids": token_ids[:, -1:],
-                        "inference_params": inference_params,
-                        "feat_mask": feat_mask,
-                        "ffn_mask": ffn_mask,
-                    }
-                )
-        else:
-            scores = decoder(
-                **{
-                    "input_ids": token_ids[:, -1:],
-                    "inference_params": inference_params,
-                    "feat_mask": feat_mask,
-                    "ffn_mask": ffn_mask,
-                    "layer_mask": layer_mask,
-                }
-            )
-
-        if isinstance(scores, (list, tuple)):
-            scores = scores[0]
-        scores = scores[:, -1].float()
-        inference_params.sequence_len_offset += 1
-
-        if repetition_penalty != 1.0:
-            token_scores = scores.gather(dim=1, index=token_ids)
-            lt_zero_mask = token_scores.lt(0).float()
-            ge_zero_mask = lt_zero_mask.eq(0).float()
-            token_scores = (
-                lt_zero_mask * repetition_penalty * token_scores + ge_zero_mask / repetition_penalty * token_scores
-            )
-            scores.scatter_(dim=1, index=token_ids, src=token_scores)
-
-        if eos_token_id is not None and length_penalty != 1.0:
-            # batch_size x vocab_size
-            token_scores = scores / cur_len**length_penalty
-            eos_mask = scores.new_ones(scores.size(1))
-            eos_mask[eos_token_id] = 0
-            eos_mask = eos_mask.unsqueeze(0).eq(1)
-
-            scores = scores.masked_scatter(eos_mask, token_scores)
-
-        if do_sample:
-            if temperature > 0 and temperature != 1:
-                scores = scores / temperature
-
-            scores = top_k_top_p_filtering(scores, top_k, top_p, min_tokens_to_keep=2)
-            # add 1e-12 to avoid https://github.com/pytorch/pytorch/pull/27523
-            probs = F.softmax(scores, dim=-1) + 1e-12
-
-            next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)  # batch_size
-        else:
-            next_tokens = torch.argmax(scores, dim=-1)  # batch_size
-
-        if _eos_token_id != -1:
-            next_tokens = next_tokens.masked_fill(max_lengths.eq(cur_len + 1), _eos_token_id)
-        next_tokens = next_tokens.masked_fill(dones, pad_token_id)
-        tokens = next_tokens.unsqueeze(1)
-
-        token_ids = torch.cat([token_ids, tokens], dim=-1)  # batch_size x max_len
-
-        end_mask = next_tokens.eq(_eos_token_id)
-        dones = dones.__or__(end_mask)
-        cur_len += 1
-
-        if dones.min() == 1:
-            break
-
-    # if eos_token_id is not None:
-    #     # setting the eos at the maximum length position
-    #     tokens.scatter(index=max_lengths[:, None], dim=1, value=eos_token_id)
-    # if cur_len == max_length:
-    #     # If eos is not reached by the maximum length, forcibly replace the last word with eos
-    #     token_ids[:, -1].masked_fill_(~dones, eos_token_id)
-    # TODO Here we are simply adding an extra dimension for interface compatibility, but in the future it will need to
-    # be able to return multiple real results
-    return token_ids[:, None]
-
-
-@torch.no_grad()
-def _beam_search_generate(
-    decoder,
-    tokens,
-    inference_params=None,
-    max_length=20,
-    num_beams=4,
-    num_return_sequences=1,
-    temperature=1.0,
-    top_k=50,
-    top_p=1.0,
-    eos_token_id=None,
-    do_sample=True,
-    repetition_penalty=1.0,
-    length_penalty=1.0,
-    pad_token_id=0,
-    bos_token_id=1,
-    feat_mask=None,
-    ffn_mask=None,
-    layer_mask=None,
-) -> torch.LongTensor:
-
-    device = _get_model_device(decoder)
-    batch_size = tokens.size(0)
-
-    if eos_token_id is None:
-        _eos_token_id = -1
-    else:
-        _eos_token_id = eos_token_id
-
-    has_bos = torch.all(tokens[:, 0].eq(bos_token_id))
-
-    if has_bos:
-        bos_pos = torch.where(tokens.eq(bos_token_id), 1, 0)
-        bos_sum = bos_pos.cumsum(dim=-1)
-        bos_pos = torch.where(bos_sum.eq(bos_sum[:, -1:]), 0, 1)
-        to_atten_x = bos_pos[:, :, None]
-        to_atten_y = bos_pos[:, None, :]
-        # attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
-    else:
-        bos_pos = torch.where(tokens.eq(bos_token_id), 1, 0)
-        to_atten_x = bos_pos[:, :, None]
-        to_atten_y = bos_pos[:, None, :]
-        # attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
-    attention_mask = torch.logical_or(to_atten_x, to_atten_y).eq(1)
-
-    if inference_params is None:
-        inference_params = InferenceParams(
-            max_sequence_len=max_length,
-            max_batch_size=tokens.size(0),
-            sequence_len_offset=0,
-            batch_size_offset=0,
-            key_value_memory_dict=None,
-            lengths_per_sample=None,
-            attention_mask=attention_mask,
-        )
-
-    if layer_mask is None:
-        if feat_mask is None and ffn_mask is None:
-            scores = decoder(**{"input_ids": tokens, "inference_params": inference_params})
-        else:
-            scores = decoder(
-                **{
-                    "input_ids": tokens,
-                    "inference_params": inference_params,
-                    "feat_mask": feat_mask,
-                    "ffn_mask": ffn_mask,
-                }
-            )
-    else:
-        scores = decoder(
-            **{
-                "input_ids": tokens,
-                "inference_params": inference_params,
-                "feat_mask": feat_mask,
-                "ffn_mask": ffn_mask,
-                "layer_mask": layer_mask,
-            }
-        )
-
-    if isinstance(scores, (list, tuple)):
-        scores = scores[0]
-    scores = scores[:, -1].float()
-    inference_params.sequence_len_offset += tokens.size(1)
-    if _eos_token_id != -1:
-        scores[:, _eos_token_id] = -1e12
-    vocab_size = scores.size(1)
-    assert vocab_size >= num_beams, "num_beams should be smaller than " "the number of vocabulary size."
-
-    if do_sample:
-        probs = F.softmax(scores, dim=-1) + 1e-12
-        # (batch_size, num_beams)
-        next_tokens = torch.multinomial(probs, num_samples=num_beams)
-        logits = probs.log()
-        # (batch_size, num_beams)
-        next_scores = logits.gather(dim=1, index=next_tokens)
-    else:
-        scores = F.log_softmax(scores, dim=-1)  # (batch_size, vocab_size)
-        # obtain (batch_size, num_beams), (batch_size, num_beams)
-        next_scores, next_tokens = torch.topk(scores, num_beams, dim=1, largest=True, sorted=True)
-
-    indices = torch.arange(batch_size, dtype=torch.long).to(device)
-    indices = indices.repeat_interleave(num_beams)
-    inference_params.reorder_state(indices)
-
-    # batch_size * num_beams x length
-    tokens = tokens.index_select(dim=0, index=indices)
-    # genrated token (batch_size', cur_len)
-    token_ids = torch.cat([tokens, next_tokens.view(-1, 1)], dim=-1)
-    dones = [False] * batch_size
-
-    beam_scores = next_scores.view(-1)  # batch_size * num_beams
-
-    cur_len = token_ids.size(1)
-
-    real_max_length = max_length
-    max_lengths = tokens.new_full((tokens.size(0),), fill_value=max_length, dtype=torch.long)
-    hypos = [
-        BeamHypotheses(num_beams, real_max_length, length_penalty, early_stopping=False) for _ in range(batch_size)
-    ]
-    # 0, num_beams, 2*num_beams, ...
-    batch_inds_with_numbeams_interval = (torch.arange(batch_size) * num_beams).view(-1, 1).to(token_ids)
-
-    while cur_len < real_max_length:
-        if has_bos:
-            bos_pos = torch.where(token_ids.eq(bos_token_id), 1, 0)
-            bos_sum = bos_pos.cumsum(dim=-1)
-            bos_pos = torch.where(bos_sum.eq(bos_sum[:, -1:]), 0, 1)
-            to_atten_x = bos_pos[:, :, None]
-            to_atten_y = bos_pos[:, None, :]
-            # attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
-        else:
-            bos_pos = torch.where(token_ids.eq(bos_token_id), 1, 0)
-            to_atten_x = bos_pos[:, :, None]
-            to_atten_y = bos_pos[:, None, :]
-            # attention_mask = torch.einsum('bno,bom->bnm', to_atten_x, to_atten_y).eq(1)
-        attention_mask = torch.logical_or(to_atten_x, to_atten_y).eq(1)
-
-        inference_params.attention_mask = attention_mask
-        # (bsz x num_beams, vocab_size)
-
-        if layer_mask is None:
-            if feat_mask is None and ffn_mask is None:
-                scores = decoder(**{"input_ids": token_ids[:, -1:], "inference_params": inference_params})
-            else:
-                scores = decoder(
-                    **{
-                        "input_ids": token_ids[:, -1:],
-                        "inference_params": inference_params,
-                        "feat_mask": feat_mask,
-                        "ffn_mask": ffn_mask,
-                    }
-                )
-        else:
-            scores = decoder(
-                **{
-                    "input_ids": token_ids[:, -1:],
-                    "inference_params": inference_params,
-                    "feat_mask": feat_mask,
-                    "ffn_mask": ffn_mask,
-                    "layer_mask": layer_mask,
-                }
-            )
-
-        if isinstance(scores, (list, tuple)):
-            scores = scores[0]
-        scores = scores[:, -1].float()
-        inference_params.sequence_len_offset += 1
-        if repetition_penalty != 1.0:
-            token_scores = scores.gather(dim=1, index=token_ids)
-            lt_zero_mask = token_scores.lt(0).float()
-            ge_zero_mask = lt_zero_mask.eq(0).float()
-            token_scores = (
-                lt_zero_mask * repetition_penalty * token_scores + ge_zero_mask / repetition_penalty * token_scores
-            )
-            scores.scatter_(dim=1, index=token_ids, src=token_scores)
-
-        if _eos_token_id != -1:
-            max_len_eos_mask = max_lengths.eq(cur_len + 1)
-            eos_scores = scores[:, _eos_token_id]
-            scores[:, _eos_token_id] = torch.where(max_len_eos_mask, eos_scores + 1e32, eos_scores)
-
-        if do_sample:
-            if temperature > 0 and temperature != 1:
-                scores = scores / temperature
-
-            scores = top_k_top_p_filtering(scores, top_k, top_p, min_tokens_to_keep=num_beams + 1)
-            # add 1e-12 to avoid https://github.com/pytorch/pytorch/pull/27523
-            probs = F.softmax(scores, dim=-1) + 1e-12
-
-            # batch_size' x (num_beams+1)
-            _tokens = torch.multinomial(probs, num_samples=num_beams + 1)
-
-            logits = probs.log()
-            # batch_size' x (num_beams+1)
-            _scores = logits.gather(dim=1, index=_tokens)
-            # batch_size' x (num_beams+1)
-            _scores = _scores + beam_scores[:, None]
-            _scores = _scores.view(batch_size, num_beams * (num_beams + 1))
-            next_scores, ids = _scores.topk(2 * num_beams, dim=1, largest=True, sorted=True)
-            _tokens = _tokens.view(batch_size, num_beams * (num_beams + 1))
-            # (batch_size, 2*num_beams)
-            next_tokens = _tokens.gather(dim=1, index=ids)
-            # (batch_size, 2*num_beams)
-            from_which_beam = torch.floor(ids.float() / (num_beams + 1)).long()
-        else:
-            # (batch_size * num_beams, vocab_size)
-            scores = F.log_softmax(scores, dim=-1)
-            # (batch_size * num_beams, vocab_size)
-            _scores = scores + beam_scores[:, None]
-            # (batch_size, num_beams*vocab_size)
-            _scores = _scores.view(batch_size, -1)
-            # (bsz, 2*num_beams)
-            next_scores, ids = torch.topk(_scores, 2 * num_beams, dim=1, largest=True, sorted=True)
-            # (batch_size, 2*num_beams)
-            from_which_beam = torch.floor(ids.float() / vocab_size).long()
-            next_tokens = ids % vocab_size  # (batch_size, 2*num_beams)
-
-        # next_scores, sorted_inds = next_scores.sort(dim=-1, descending=True)
-        # next_tokens = next_tokens.gather(dim=1, index=sorted_inds)
-        # from_which_beam = from_which_beam.gather(dim=1, index=sorted_inds)
-
-        not_eos_mask = next_tokens.ne(_eos_token_id)
-        keep_mask = not_eos_mask.cumsum(dim=1).le(num_beams)
-        keep_mask = not_eos_mask.__and__(keep_mask)
-
-        _next_tokens = next_tokens.masked_select(keep_mask).view(-1, 1)
-        _from_which_beam = from_which_beam.masked_select(keep_mask).view(batch_size, num_beams)
-        _next_scores = next_scores.masked_select(keep_mask).view(batch_size, num_beams)
-        beam_scores = _next_scores.view(-1)
-
-        flag = True
-        if cur_len + 1 == real_max_length:
-            eos_batch_idx = torch.arange(batch_size).to(next_tokens).repeat_interleave(repeats=num_beams, dim=0)
-            eos_beam_ind = torch.arange(num_beams).to(token_ids).repeat(batch_size)
-            eos_beam_idx = from_which_beam[:, :num_beams].reshape(-1)
-        else:
-            effective_eos_mask = next_tokens[:, :num_beams].eq(_eos_token_id)  # batch_size x num_beams
-            if effective_eos_mask.sum().gt(0):
-                eos_batch_idx, eos_beam_ind = effective_eos_mask.nonzero(as_tuple=True)
-                eos_beam_idx = eos_batch_idx * num_beams * 2 + eos_beam_ind
-                eos_beam_idx = from_which_beam.view(-1)[eos_beam_idx]
-            else:
-                flag = False
-
-        if flag:
-            _token_ids = torch.cat([token_ids, _next_tokens], dim=-1)
-            for batch_idx, beam_ind, beam_idx in zip(
-                eos_batch_idx.tolist(), eos_beam_ind.tolist(), eos_beam_idx.tolist()
-            ):
-                if not dones[batch_idx]:
-                    score = next_scores[batch_idx, beam_ind].item()
-                    if _eos_token_id != -1:
-                        hypos[batch_idx].add(_token_ids[batch_idx * num_beams + beam_idx, :cur_len].clone(), score)
-                    else:
-                        hypos[batch_idx].add(_token_ids[batch_idx * num_beams + beam_idx].clone(), score)
-
-        reorder_inds = (batch_inds_with_numbeams_interval + _from_which_beam).view(-1)
-        inference_params.reorder_state(reorder_inds)
-        token_ids = torch.cat([token_ids.index_select(index=reorder_inds, dim=0), _next_tokens], dim=-1)
-
-        for batch_idx in range(batch_size):
-            dones[batch_idx] = (
-                dones[batch_idx]
-                or hypos[batch_idx].is_done(next_scores[batch_idx, 0].item())
-                or max_lengths[batch_idx * num_beams] == cur_len + 1
-            )
-
-        cur_len += 1
-
-        if all(dones):
-            break
-
-    # select the best hypotheses
-    tgt_len = token_ids.new_zeros(batch_size, num_return_sequences)
-    best = []
-
-    for i, hypotheses in enumerate(hypos):
-        # best_hyp = max(hypotheses.hyp, key=lambda x: x[0])[1]
-        sorted_hyp = list(sorted(hypotheses.hyp, key=lambda x: x[0], reverse=True))
-        _best = []
-        for j, hyp in zip(range(num_return_sequences), sorted_hyp):
-            hyp = hyp[1]
-            if _eos_token_id != -1:
-                hyp = torch.cat([hyp, token_ids.new_ones(1) * _eos_token_id])
-            tgt_len[i, j] = len(hyp)
-            _best.append(hyp)
-        best.append(_best)
-
-    # generate target batch
-    decoded = token_ids.new_zeros(batch_size, num_return_sequences, tgt_len.max().item()).fill_(pad_token_id)
-    for i, hypo in enumerate(best):
-        for j, _hypo in enumerate(hypo):
-            decoded[i, j, : tgt_len[i, j]] = _hypo
-
-    return decoded
-
-
-class BeamHypotheses(object):
-    """
-    BeamHypotheses
-    """
-
-    def __init__(self, num_beams, max_length, length_penalty, early_stopping):
-        """Initialize n-best list of hypotheses."""
-        self.max_length = max_length - 1  # ignoring bos_token
-        self.length_penalty = length_penalty
-        self.early_stopping = early_stopping
-        self.num_beams = num_beams
-        self.hyp = []
-        self.worst_score = 1e9
-
-    def __len__(self):
-        """Number of hypotheses in the list."""
-        return len(self.hyp)
-
-    def add(self, hyp, sum_logprobs):
-        """Add a new hypothesis to the list."""
-        score = sum_logprobs / len(hyp) ** self.length_penalty
-        if len(self) < self.num_beams or score > self.worst_score:
-            self.hyp.append((score, hyp))
-            if len(self) > self.num_beams:
-                sorted_scores = sorted([(s, idx) for idx, (s, _) in enumerate(self.hyp)])
-                del self.hyp[sorted_scores[0][1]]
-                self.worst_score = sorted_scores[1][0]
-            else:
-                self.worst_score = min(score, self.worst_score)
-
-    def is_done(self, best_sum_logprobs):
-        """If there are enough hypotheses and that none of the hypotheses being
-        generated can become better than the worst one in the heap, then we are
-        done with this sentence."""
-        if len(self) < self.num_beams:
-            return False
-        elif self.early_stopping:
-            return True
-        else:
-            return self.worst_score >= best_sum_logprobs / self.max_length**self.length_penalty
-
-
-def top_k_top_p_filtering(logits, top_k=0, top_p=1.0, filter_value=-float("Inf"), min_tokens_to_keep=1):
-    """
-    Based on the values of top_k and top_p, set the values that do not meet the criteria to the filter_value.
-
-    Args:
-        logits: logit value, shape is [bsz, vocab_size].
-        top_k: If it is greater than 0, only the probabilities of the top_k vocabulary are kept, and the rest of
-            the positions are set to filter_value.
-        top_p: according to http://arxiv.org/abs/1904.09751.
-        filter_value: filter value
-        min_tokens_to_keep: The probability of words in each sample‘s returned distribution will not be
-            lower than this value.
-
-    """
-    if top_k > 0:
-        # Safety check
-        top_k = min(max(top_k, min_tokens_to_keep), logits.size(-1))
-        # Remove all tokens with a probability less than the last token of
-        # the top-k
-        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
-        logits[indices_to_remove] = filter_value
-
-    if top_p < 1.0:
-        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
-        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
-
-        # Remove tokens with cumulative probability above the threshold
-        # (token with 0 are kept)
-        sorted_indices_to_remove = cumulative_probs > top_p
-        if min_tokens_to_keep > 1:
-            # Keep at least min_tokens_to_keep
-            # (set to min_tokens_to_keep-1 because we add the first one below)
-            sorted_indices_to_remove[..., :min_tokens_to_keep] = 0
-        # Shift the indices to the right to keep also the first token
-        # above the threshold
-        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
-        sorted_indices_to_remove[..., 0] = 0
-
-        # scatter sorted tensors to original indexing
-        indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
-        logits[indices_to_remove] = filter_value
-    return logits
diff --git a/internlm/core/__init__.py b/internlm/core/__init__.py
deleted file mode 100644
index d6b7048..0000000
--- a/internlm/core/__init__.py
+++ /dev/null
@@ -1,9 +0,0 @@
-from .engine import Engine
-from .naive_amp import NaiveAMPModel
-from .trainer import Trainer
-
-__all__ = [
-    "NaiveAMPModel",
-    "Engine",
-    "Trainer",
-]
diff --git a/internlm/core/communication/__init__.py b/internlm/core/communication/__init__.py
deleted file mode 100644
index a42b9ea..0000000
--- a/internlm/core/communication/__init__.py
+++ /dev/null
@@ -1,32 +0,0 @@
-from .p2p import (
-    AsynCommunicator,
-    recv_backward,
-    recv_forward,
-    send_backward,
-    send_backward_and_recv_next_backward_async,
-    send_backward_recv_backward,
-    send_backward_recv_forward,
-    send_forward,
-    send_forward_and_recv_next_forward_async,
-    send_forward_backward_recv_forward_backward,
-    send_forward_recv_backward,
-    send_forward_recv_forward,
-)
-from .utils import recv_obj_meta, send_obj_meta
-
-__all__ = [
-    "send_forward",
-    "send_forward_recv_forward",
-    "send_forward_backward_recv_forward_backward",
-    "send_backward",
-    "send_backward_recv_backward",
-    "send_backward_recv_forward",
-    "send_forward_recv_backward",
-    "recv_backward",
-    "recv_forward",
-    "send_obj_meta",
-    "recv_obj_meta",
-    "send_backward_and_recv_next_backward_async",
-    "send_forward_and_recv_next_forward_async",
-    "AsynCommunicator",
-]
diff --git a/internlm/core/communication/p2p.py b/internlm/core/communication/p2p.py
deleted file mode 100644
index e707661..0000000
--- a/internlm/core/communication/p2p.py
+++ /dev/null
@@ -1,582 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/communication
-
-import operator
-from functools import reduce
-from typing import List, Tuple, Union
-
-import torch
-import torch.distributed as dist
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.utils.common import get_current_device
-
-from .utils import gather_split_1d_tensor, split_tensor_into_1d_equal_chunks
-
-TensorShape = Union[torch.Size, List[int], Tuple[int]]
-
-
-def _get_tensor_shape(tensor_shape: TensorShape, chunk_tensor: bool = False) -> Tuple[TensorShape, bool]:
-    """get the exact tensor shape when communicating and return whether the tensor is a chunk
-
-    Args:
-        tensor_shape (:class:`torch.Size`): shape of tensor
-        chunk_tensor (bool, optional): whether to chunk tensor, defaults to False
-
-    Returns:
-        Tuple[Union[:class:`torch.Size`, List[int], Tuple[int]], bool]: exact tensor shape, whether to chunk tensor
-    """
-    if chunk_tensor:
-        tensor_chunk_shape = reduce(operator.mul, tensor_shape, 1)
-        tensor_parallel_world_size = gpc.get_world_size(ParallelMode.TENSOR)
-        if tensor_chunk_shape % tensor_parallel_world_size == 0:
-            tensor_chunk_shape = tensor_chunk_shape // tensor_parallel_world_size
-        else:
-            tensor_chunk_shape = tensor_shape
-            chunk_tensor = False
-    else:
-        tensor_chunk_shape = tensor_shape
-    return tensor_chunk_shape, chunk_tensor
-
-
-def create_recv_buffer_with_shapes(recv_shapes, dtype, scatter_gather_tensors):
-    if isinstance(recv_shapes, torch.Size):
-        recv_chunk_shape, recv_split = _get_tensor_shape(recv_shapes, scatter_gather_tensors)
-        buffer_recv = torch.empty(recv_chunk_shape, requires_grad=True, device=get_current_device(), dtype=dtype)
-        return buffer_recv, recv_split
-    buffer_recv = []
-    for recv_shape in recv_shapes:
-        recv_chunk_shape, recv_split = _get_tensor_shape(recv_shape, scatter_gather_tensors)
-        tensor_recv = torch.empty(recv_chunk_shape, requires_grad=True, device=get_current_device(), dtype=dtype)
-        buffer_recv.append(tensor_recv)
-    return buffer_recv, recv_split
-
-
-def process_object_to_send(object_send, scatter_gather_tensors):
-    if isinstance(object_send, torch.Tensor):
-        send_split = _get_tensor_shape(object_send.shape, scatter_gather_tensors)[1]
-        if send_split:
-            object_send = split_tensor_into_1d_equal_chunks(object_send)
-        return object_send
-
-    object_send_list = []
-    for tensor_send in object_send:
-        send_split = _get_tensor_shape(tensor_send.shape, scatter_gather_tensors)[1]
-        if send_split:
-            object_send_list.append(split_tensor_into_1d_equal_chunks(tensor_send))
-        else:
-            object_send_list.append(tensor_send)
-    object_send = tuple(object_send_list)
-
-    return object_send
-
-
-def filling_ops_queue(obj, comm_op, comm_rank, ops_queue):
-    if isinstance(obj, torch.Tensor):
-        op_to_add = dist.P2POp(comm_op, obj, comm_rank)
-        ops_queue.append(op_to_add)
-    else:
-        for tensor_to_comm in obj:
-            op_to_add = dist.P2POp(comm_op, tensor_to_comm, comm_rank)
-            ops_queue.append(op_to_add)
-
-
-def _communicate(
-    object_send_next: Union[torch.Tensor, List[torch.Tensor]] = None,
-    object_send_prev: Union[torch.Tensor, List[torch.Tensor]] = None,
-    recv_prev: bool = False,
-    recv_next: bool = False,
-    recv_prev_shape: Union[torch.Size, List[torch.Size]] = None,
-    recv_next_shape: Union[torch.Size, List[torch.Size]] = None,
-    prev_rank: int = None,
-    next_rank: int = None,
-    dtype: torch.dtype = None,
-    scatter_gather_tensors: bool = False,
-) -> Tuple[Union[torch.Tensor, List[torch.Tensor]]]:
-    """
-    Adapted from megatron.p2p_communication.
-    Communicate tensors between stages. Used as helper method in other
-    communication methods that are used in pipeline schedule.
-    Takes the following arguments:
-        object_send_next (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): tensor to send to next rank
-        (no tensor sent if set to None).
-        object_send_prev (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): tensor to send to prev rank
-        (no tensor sent if set to None).
-        recv_prev (bool): boolean for whether tensor should be received from
-                   previous rank.
-        recv_next (bool): boolean for whether tensor should be received from
-                   next rank.
-        recv_prev_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): shape of the tensor to be received
-        from the previous stage, defualts to None.
-        recv_next_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): shape of the tensor to be received
-        from the next stage, defualts to None.
-        prev_rank (int): the rank of the previous pipeline stage, defualts to None,
-        next_rank (int): the rank of the next pipeline stage, defualts to None,
-        dtype (torch.dtype): data type of intermediate buffers, defaults to None
-        scatter_gather_tensors (bool): whether to scatter and gather tensor between pipeline stages, defaults to False
-
-    Returns:
-        Tuple[Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]]: returns tensor_recv_prev, tensor_recv_next
-    """
-
-    # Create placeholder tensors for receive in forward and backward directions
-    # if needed.
-    tensor_recv_prev = None
-    tensor_recv_next = None
-
-    if recv_prev:
-        assert recv_prev_shape is not None
-        tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(
-            recv_prev_shape, dtype, scatter_gather_tensors
-        )
-
-    if recv_next:
-        assert recv_next_shape is not None
-        tensor_recv_next, recv_next_split = create_recv_buffer_with_shapes(
-            recv_next_shape, dtype, scatter_gather_tensors
-        )
-
-    if object_send_prev is not None or recv_prev:
-        if prev_rank is None:
-            prev_rank = gpc.get_prev_global_rank(ParallelMode.PIPELINE)
-
-    if object_send_next is not None or recv_next:
-        if next_rank is None:
-            next_rank = gpc.get_next_global_rank(ParallelMode.PIPELINE)
-
-    if object_send_prev is not None:
-        object_send_prev = process_object_to_send(object_send_prev, scatter_gather_tensors)
-
-    if object_send_next is not None:
-        object_send_next = process_object_to_send(object_send_next, scatter_gather_tensors)
-
-    ops = []
-    if object_send_prev is not None:
-        filling_ops_queue(object_send_prev, dist.isend, prev_rank, ops)
-
-    if tensor_recv_prev is not None:
-        filling_ops_queue(tensor_recv_prev, dist.irecv, prev_rank, ops)
-
-    if tensor_recv_next is not None:
-        filling_ops_queue(tensor_recv_next, dist.irecv, next_rank, ops)
-
-    if object_send_next is not None:
-        filling_ops_queue(object_send_next, dist.isend, next_rank, ops)
-
-    if len(ops) > 0:
-        reqs = dist.batch_isend_irecv(ops)
-        for req in reqs:
-            req.wait()
-    # To protect against race condition when using batch_isend_irecv().
-    torch.cuda.synchronize()
-
-    if recv_prev and recv_prev_split:
-        if isinstance(tensor_recv_prev, torch.Tensor):
-            tensor_recv_prev = gather_split_1d_tensor(tensor_recv_prev).view(recv_prev_shape).requires_grad_()
-        else:
-            for index in range(len(tensor_recv_prev)):
-                tensor_recv_prev[index] = (
-                    gather_split_1d_tensor(tensor_recv_prev[index]).view(recv_prev_shape[index]).requires_grad_()
-                )
-
-    if recv_next and recv_next_split:
-        if isinstance(tensor_recv_next, torch.Tensor):
-            tensor_recv_next = gather_split_1d_tensor(tensor_recv_next).view(recv_next_shape).requires_grad_()
-        else:
-            for index in range(len(tensor_recv_next)):
-                tensor_recv_next[index] = (
-                    gather_split_1d_tensor(tensor_recv_next[index]).view(recv_next_shape[index]).requires_grad_()
-                )
-
-    return tensor_recv_prev, tensor_recv_next
-
-
-def recv_forward(
-    input_tensor_shape, prev_rank=None, dtype=torch.float, scatter_gather_tensors=False
-) -> Union[torch.Tensor, List[torch.Tensor]]:
-    """Copy the forward output from the previous stage in pipeline as the input tensor of this stage.
-
-    Args:
-        input_tensor_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
-            to be received.
-        prev_rank (int, optional): The rank of the source of the tensor.
-
-    Returns:
-        Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input tensor or input tensor list.
-    """
-    input_tensor, _ = _communicate(
-        recv_prev=True,
-        recv_prev_shape=input_tensor_shape,
-        prev_rank=prev_rank,
-        dtype=dtype,
-        scatter_gather_tensors=scatter_gather_tensors,
-    )
-    return input_tensor
-
-
-def recv_backward(
-    output_grad_shape, next_rank=None, dtype=torch.float, scatter_gather_tensors=False
-) -> Union[torch.Tensor, List[torch.Tensor]]:
-    """Copy the gradient tensor from the next stage in pipeline as the input gradient of this stage.
-
-    Args:
-        output_grad_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
-            to be received.
-        next_rank (int, optional): The rank of the source of the tensor.
-
-    Returns:
-        Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input gradient tensor or gradident tensor list.
-    """
-    _, output_tensor_grad = _communicate(
-        recv_next=True,
-        recv_next_shape=output_grad_shape,
-        next_rank=next_rank,
-        dtype=dtype,
-        scatter_gather_tensors=scatter_gather_tensors,
-    )
-    return output_tensor_grad
-
-
-def send_forward(output_tensor, next_rank=None, scatter_gather_tensors=False) -> None:
-    """Sends the input tensor to the next stage in pipeline.
-
-    Args:
-        output_tensor (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent.
-        next_rank (int, optional): The rank of the recipient of the tensor.
-    """
-    _communicate(object_send_next=output_tensor, next_rank=next_rank, scatter_gather_tensors=scatter_gather_tensors)
-
-
-def send_backward(input_tensor_grad, prev_rank=None, scatter_gather_tensors=False) -> None:
-    """Sends the gradient tensor to the previous stage in pipeline.
-
-    Args:
-        input_tensor_grad (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent
-        prev_rank (int, optional): The rank of the recipient of the tensor
-    """
-
-    _communicate(object_send_prev=input_tensor_grad, prev_rank=prev_rank, scatter_gather_tensors=scatter_gather_tensors)
-
-
-def send_forward_recv_backward(
-    output_tensor, output_grad_shape, next_rank=None, dtype=torch.float, scatter_gather_tensors=False
-) -> Union[torch.Tensor, List[torch.Tensor]]:
-    """Batched communication operation. Sends the input tensor to the
-    next stage in pipeline, while receives the gradient tensor from the
-    next stage in pipeline as the input gradient tensor of this stage.
-
-    Args:
-        output_tensor (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent.
-        output_grad_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
-            to be received.
-
-    Returns:
-        Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input gradient tensor.
-    """
-    _, output_tensor_grad = _communicate(
-        object_send_next=output_tensor,
-        recv_next=output_grad_shape is not None,
-        recv_next_shape=output_grad_shape,
-        next_rank=next_rank,
-        dtype=dtype,
-        scatter_gather_tensors=scatter_gather_tensors,
-    )
-
-    return output_tensor_grad
-
-
-def send_backward_recv_forward(
-    input_tensor_grad,
-    input_tensor_shape,
-    prev_rank=None,
-    dtype=torch.float,
-    scatter_gather_tensors=False,
-) -> Union[torch.Tensor, List[torch.Tensor]]:
-    """Batched communication operation. Sends the gradient tensor to the
-    previous stage in pipeline, while receives the output tensor from the
-    previous stage in pipeline as the input of this stage.
-
-    Args:
-        input_tensor_grad (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent.
-        input_tensor_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
-            to be received.
-
-    Returns:
-        Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input tensor.
-    """
-    input_tensor, _ = _communicate(
-        object_send_prev=input_tensor_grad,
-        recv_prev=input_tensor_shape is not None,
-        recv_prev_shape=input_tensor_shape,
-        prev_rank=prev_rank,
-        dtype=dtype,
-        scatter_gather_tensors=scatter_gather_tensors,
-    )
-
-    return input_tensor
-
-
-def send_forward_recv_forward(
-    output_tensor,
-    input_tensor_shape,
-    prev_rank=None,
-    next_rank=None,
-    dtype=torch.float,
-    scatter_gather_tensors=False,
-) -> Union[torch.Tensor, List[torch.Tensor]]:
-    """Batched communication operation. Sends the input tensor to the
-    next stage in pipeline, while receives the output tensor from the
-    previous stage in pipeline as the input of this stage.
-
-    Args:
-        output_tensor (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent.
-        input_tensor_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
-            to be received.
-
-    Returns:
-        Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input tensor.
-    """
-    input_tensor, _ = _communicate(
-        object_send_next=output_tensor,
-        recv_prev=input_tensor_shape is not None,
-        recv_prev_shape=input_tensor_shape,
-        prev_rank=prev_rank,
-        next_rank=next_rank,
-        dtype=dtype,
-        scatter_gather_tensors=scatter_gather_tensors,
-    )
-    return input_tensor
-
-
-def send_backward_recv_backward(
-    input_tensor_grad,
-    output_grad_shape,
-    prev_rank=None,
-    next_rank=None,
-    dtype=torch.float,
-    scatter_gather_tensors=False,
-) -> Union[torch.Tensor, List[torch.Tensor]]:
-    """Batched communication operation. Sends the gradient tensor to the
-    previous stage in pipeline, while receives the gradient tensor from the
-    next member in pipeline as the input of this stage.
-
-    Args:
-        input_tensor_grad (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor to be sent.
-        output_grad_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor
-            to be received.
-
-    Returns:
-        Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: The input gradient tensor.
-    """
-    _, output_tensor_grad = _communicate(
-        object_send_prev=input_tensor_grad,
-        recv_next=output_grad_shape is not None,
-        recv_next_shape=output_grad_shape,
-        prev_rank=prev_rank,
-        next_rank=next_rank,
-        dtype=dtype,
-        scatter_gather_tensors=scatter_gather_tensors,
-    )
-    return output_tensor_grad
-
-
-def send_forward_backward_recv_forward_backward(
-    output_tensor,
-    input_tensor_grad,
-    input_tensor_shape,
-    output_grad_shape,
-    prev_rank=None,
-    next_rank=None,
-    dtype=torch.float,
-    scatter_gather_tensors=False,
-) -> Tuple[Union[torch.Tensor, List[torch.Tensor]]]:
-    """Batched communication operation. Sends the input tensor to the next stage in pipeline and
-    the gradient tensor to the previous stage, while receives the input gradient tensor from the
-    next stage and the input tensor from the previous stage.
-
-    Args:
-        output_tensor (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor sent to the next.
-        input_tensor_grad (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Tensor sent to the previous.
-        input_tensor_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor received
-            from the previous.
-        output_grad_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the tensor received
-            from the next.
-
-    Returns:
-        Tuple(Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]], Union[:class:`torch.Tensor`,
-        List[:class:`torch.Tensor`]]): (the input tensor, the input gradient tensor)
-    """
-    input_tensor, output_tensor_grad = _communicate(
-        object_send_next=output_tensor,
-        object_send_prev=input_tensor_grad,
-        recv_prev=input_tensor_shape is not None,
-        recv_next=output_grad_shape is not None,
-        recv_prev_shape=input_tensor_shape,
-        recv_next_shape=output_grad_shape,
-        prev_rank=prev_rank,
-        next_rank=next_rank,
-        dtype=dtype,
-        scatter_gather_tensors=scatter_gather_tensors,
-    )
-    return input_tensor, output_tensor_grad
-
-
-def send_forward_and_recv_next_forward_async(
-    output_tensor,
-    recv_prev_shape: Union[torch.Size, List[torch.Size]] = None,
-    dtype: torch.dtype = None,
-    scatter_gather_tensors=False,
-):
-    """send forward output to next rank and recv forward input from prev rank"""
-
-    reqs = []
-    tensor_recv_prev = None
-
-    # prepare send opreations
-    if output_tensor is not None:
-        next_rank = gpc.get_next_global_rank(ParallelMode.PIPELINE)
-
-        output_tensor = process_object_to_send(output_tensor, scatter_gather_tensors)
-
-        if isinstance(output_tensor, torch.Tensor):
-            reqs.append(dist.P2POp(dist.isend, output_tensor, next_rank))
-        else:
-            for tensor_to_comm in output_tensor:
-                reqs.append(dist.P2POp(dist.isend, tensor_to_comm, next_rank))
-
-    # prepare receive opreations
-    if recv_prev_shape is not None:
-        prev_rank = gpc.get_prev_global_rank(ParallelMode.PIPELINE)
-        # create receive buffer
-        tensor_recv_prev, recv_prev_split = create_recv_buffer_with_shapes(
-            recv_prev_shape, dtype, scatter_gather_tensors
-        )
-        # generate async receive opterations
-        if isinstance(tensor_recv_prev, torch.Tensor):
-            reqs.append(dist.P2POp(dist.irecv, tensor_recv_prev, prev_rank))
-        else:
-            for tensor_to_comm in tensor_recv_prev:
-                reqs.append(dist.P2POp(dist.irecv, tensor_to_comm, prev_rank))
-
-    if len(reqs) > 0:
-        reqs = dist.batch_isend_irecv(reqs)
-
-    # return and do other things
-    yield
-
-    # check communication completed
-    for req in reqs:
-        req.wait()
-    # To protect against race condition when using batch_isend_irecv()
-    torch.cuda.synchronize()
-
-    # Process received data
-    if recv_prev_shape is not None and recv_prev_split:
-        if isinstance(tensor_recv_prev, torch.Tensor):
-            tensor_recv_prev = gather_split_1d_tensor(tensor_recv_prev).view(recv_prev_shape).requires_grad_()
-        else:
-            for index in range(len(tensor_recv_prev)):
-                tensor_recv_prev[index] = (
-                    gather_split_1d_tensor(tensor_recv_prev[index]).view(recv_prev_shape[index]).requires_grad_()
-                )
-
-    yield tensor_recv_prev
-
-
-def send_backward_and_recv_next_backward_async(
-    input_tensor,
-    recv_next_shape: Union[torch.Size, List[torch.Size]] = None,
-    dtype: torch.dtype = None,
-    scatter_gather_tensors=False,
-):
-    reqs = []
-    tensor_recv_next = None
-
-    # prepare send opreations
-    if input_tensor is not None:
-        prev_rank = gpc.get_prev_global_rank(ParallelMode.PIPELINE)
-
-        input_tensor = process_object_to_send(input_tensor, scatter_gather_tensors)
-
-        if isinstance(input_tensor, torch.Tensor):
-            reqs.append(dist.P2POp(dist.isend, input_tensor, prev_rank))
-        else:
-            for tensor_to_comm in input_tensor:
-                reqs.append(dist.P2POp(dist.isend, tensor_to_comm, prev_rank))
-
-    # prepare receive opreations
-    if recv_next_shape is not None:
-        next_rank = gpc.get_next_global_rank(ParallelMode.PIPELINE)
-        # create receive buffer
-        tensor_recv_next, recv_next_split = create_recv_buffer_with_shapes(
-            recv_next_shape, dtype, scatter_gather_tensors
-        )
-        # generate async receive opreations
-        if isinstance(tensor_recv_next, torch.Tensor):
-            reqs.append(dist.P2POp(dist.irecv, tensor_recv_next, next_rank))
-        else:
-            for tensor_to_comm in tensor_recv_next:
-                reqs.append(dist.P2POp(dist.irecv, tensor_to_comm, next_rank))
-
-    if len(reqs) > 0:
-        reqs = dist.batch_isend_irecv(reqs)
-
-    # return and do other things
-    yield
-
-    # check communication completed
-    for req in reqs:
-        req.wait()
-    # To protect against race condition when using batch_isend_irecv()
-    torch.cuda.synchronize()
-
-    # Process received data
-    if recv_next_shape is not None and recv_next_split:
-        if isinstance(tensor_recv_next, torch.Tensor):
-            tensor_recv_next = gather_split_1d_tensor(tensor_recv_next).view(recv_next_shape).requires_grad_()
-        else:
-            for index in range(len(tensor_recv_next)):
-                tensor_recv_next[index] = (
-                    gather_split_1d_tensor(tensor_recv_next[index]).view(recv_next_shape[index]).requires_grad_()
-                )
-
-    yield tensor_recv_next
-
-
-class AsynCommunicator:
-    """AsynCommunicator for managing async communication."""
-
-    def __init__(
-        self,
-        tensor_to_send: Union[torch.Tensor, List[torch.Tensor]],
-        recv_shape: Union[torch.Size, List[torch.Size]],
-        dtype: torch.dtype = None,
-        scatter_gather_tensors=False,
-        forward: bool = True,
-    ) -> None:
-        self._need_receive = recv_shape is not None
-
-        if forward:
-            self._coroutine = send_forward_and_recv_next_forward_async(
-                tensor_to_send, recv_shape, dtype, scatter_gather_tensors
-            )
-        else:
-            self._coroutine = send_backward_and_recv_next_backward_async(
-                tensor_to_send, recv_shape, dtype, scatter_gather_tensors
-            )
-
-    @property
-    def need_receive(self) -> bool:
-        return self._need_receive
-
-    def start(self) -> None:
-        next(self._coroutine)
-
-    def wait_and_receive(self) -> Union[torch.Tensor, List[torch.Tensor]]:
-        received = next(self._coroutine)
-        self._coroutine.close()
-
-        return received
diff --git a/internlm/core/communication/utils.py b/internlm/core/communication/utils.py
deleted file mode 100644
index f413286..0000000
--- a/internlm/core/communication/utils.py
+++ /dev/null
@@ -1,125 +0,0 @@
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/communication
-
-from typing import List, Tuple, Union
-
-import torch
-import torch.distributed as dist
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.utils.common import get_current_device
-
-TensorShape = Union[torch.Size, List[int], Tuple[int]]
-
-
-def send_meta_helper(obj, next_rank, tensor_kwargs):
-    send_shape = torch.tensor(obj.size(), **tensor_kwargs)
-    send_ndims = torch.tensor(len(obj.size()), **tensor_kwargs)
-    dist.send(send_ndims, next_rank)
-    dist.send(send_shape, next_rank)
-
-
-def send_obj_meta(obj, next_rank=None):
-    """Sends obj meta information before sending a specific obj.
-    Since the recipient must know the shape of the obj in p2p communications,
-    meta information of the obj should be sent before communications. This function
-    synchronizes with :func:`recv_obj_meta`.
-
-    Args:
-        obj (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): obj to be sent.
-        need_meta (bool, optional): If False, meta information won't be sent.
-        next_rank (int): The rank of the next member in pipeline parallel group.
-
-    Returns:
-        bool: False
-    """
-    if next_rank is None:
-        next_rank = gpc.get_next_global_rank(ParallelMode.PIPELINE)
-
-    tensor_kwargs = {"dtype": torch.long, "device": get_current_device()}
-    if isinstance(obj, torch.Tensor):
-        send_obj_nums = torch.tensor(1, **tensor_kwargs)
-        dist.send(send_obj_nums, next_rank)
-        send_meta_helper(obj, next_rank, tensor_kwargs)
-    else:
-        send_obj_nums = torch.tensor(len(obj), **tensor_kwargs)
-        dist.send(send_obj_nums, next_rank)
-        for tensor_to_send in obj:
-            send_meta_helper(tensor_to_send, next_rank, tensor_kwargs)
-
-
-def recv_meta_helper(prev_rank, tensor_kwargs):
-    recv_ndims = torch.empty((), **tensor_kwargs)
-    dist.recv(recv_ndims, prev_rank)
-    recv_shape = torch.empty(recv_ndims, **tensor_kwargs)
-    dist.recv(recv_shape, prev_rank)
-    return recv_shape
-
-
-def recv_obj_meta(prev_rank=None) -> torch.Size:
-    """Receives obj meta information before receiving a specific obj.
-    Since the recipient must know the shape of the obj in p2p communications,
-    meta information of the obj should be received before communications. This function
-    synchronizes with :func:`send_obj_meta`.
-
-    Args:
-        obj_shape (Union[:class:`torch.Size`, List[:class:`torch.Size`]]): The shape of the obj to be received.
-        prev_rank (int): The rank of the source of the obj.
-
-    Returns:
-        Union[:class:`torch.Size`, List[:class:`torch.Size`]]: The shape of the obj to be received.
-    """
-    if prev_rank is None:
-        prev_rank = gpc.get_prev_global_rank(ParallelMode.PIPELINE)
-
-    tensor_kwargs = {"dtype": torch.long, "device": get_current_device()}
-    recv_obj_nums = torch.empty((), **tensor_kwargs)
-    dist.recv(recv_obj_nums, prev_rank)
-    if recv_obj_nums.item() == 1:
-        recv_shape = recv_meta_helper(prev_rank, tensor_kwargs)
-        obj_shape = torch.Size(recv_shape)
-    else:
-        obj_shape = []
-        for _ in range(recv_obj_nums.item()):
-            recv_shape = recv_meta_helper(prev_rank, tensor_kwargs)
-            obj_shape.append(torch.Size(recv_shape))
-
-    return obj_shape
-
-
-def split_tensor_into_1d_equal_chunks(tensor: torch.Tensor, new_buffer=False) -> torch.Tensor:
-    """Break a tensor into equal 1D chunks.
-
-    Args:
-        tensor (:class:`torch.Tensor`): Tensor to be split before communication.
-        new_buffer (bool, optional): Whether to use a new buffer to store sliced tensor.
-
-    Returns:
-        :class:`torch.Tensor`: The split tensor
-    """
-    partition_size = torch.numel(tensor) // gpc.get_world_size(ParallelMode.TENSOR)
-    start_index = partition_size * gpc.get_local_rank(ParallelMode.TENSOR)
-    end_index = start_index + partition_size
-    if new_buffer:
-        data = torch.empty(partition_size, dtype=tensor.dtype, device=torch.cuda.current_device(), requires_grad=False)
-        data.copy_(tensor.view(-1)[start_index:end_index])
-    else:
-        data = tensor.view(-1)[start_index:end_index]
-    return data
-
-
-def gather_split_1d_tensor(tensor: torch.Tensor) -> torch.Tensor:
-    """Opposite of above function, gather values from model parallel ranks.
-
-    Args:
-        tensor (:class:`torch.Tensor`): Tensor to be gathered after communication.
-    Returns:
-        :class:`torch.Tensor`: The gathered tensor.
-    """
-    world_size = gpc.get_world_size(ParallelMode.TENSOR)
-    numel = torch.numel(tensor)
-    numel_gathered = world_size * numel
-    gathered = torch.empty(numel_gathered, dtype=tensor.dtype, device=torch.cuda.current_device(), requires_grad=False)
-    chunks = [gathered[i * numel : (i + 1) * numel] for i in range(world_size)]
-    dist.all_gather(chunks, tensor, group=gpc.get_group(ParallelMode.TENSOR))
-    return gathered
diff --git a/internlm/core/context/__init__.py b/internlm/core/context/__init__.py
deleted file mode 100644
index 3fc7deb..0000000
--- a/internlm/core/context/__init__.py
+++ /dev/null
@@ -1,49 +0,0 @@
-from .parallel_context import (
-    IS_TENSOR_PARALLEL,
-    Config,
-    ParallelContext,
-    global_context,
-)
-from .process_group_initializer import (
-    Initializer_Data,
-    Initializer_Model,
-    Initializer_Nettest,
-    Initializer_Pipeline,
-    Initializer_Tensor,
-    Initializer_Zero1,
-    ParallelMode,
-    ProcessGroupInitializer,
-)
-from .random import (
-    add_seed,
-    get_current_mode,
-    get_seeds,
-    get_states,
-    seed,
-    set_mode,
-    set_seed_states,
-    sync_states,
-)
-
-__all__ = [
-    "Config",
-    "IS_TENSOR_PARALLEL",
-    "global_context",
-    "ParallelContext",
-    "ParallelMode",
-    "Initializer_Tensor",
-    "Initializer_Pipeline",
-    "Initializer_Data",
-    "Initializer_Zero1",
-    "Initializer_Nettest",
-    "ProcessGroupInitializer",
-    "Initializer_Model",
-    "seed",
-    "set_mode",
-    "add_seed",
-    "get_seeds",
-    "get_states",
-    "get_current_mode",
-    "set_seed_states",
-    "sync_states",
-]
diff --git a/internlm/core/context/parallel_context.py b/internlm/core/context/parallel_context.py
deleted file mode 100644
index 968489c..0000000
--- a/internlm/core/context/parallel_context.py
+++ /dev/null
@@ -1,569 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/context
-
-import inspect
-import random
-import socket
-import sys
-from collections import Counter
-from importlib.machinery import SourceFileLoader
-from pathlib import Path
-from typing import Union
-
-import numpy as np
-import torch
-import torch.distributed as dist
-
-from internlm.utils.common import SingletonMeta
-from internlm.utils.logger import get_logger
-from internlm.utils.timeout import LLM_NCCL_TIMEOUT
-
-from . import process_group_initializer as pgroup_initializer
-from .process_group_initializer import ParallelMode
-from .random import add_seed, get_seeds, set_mode
-
-IS_TENSOR_PARALLEL = "is_tensor_parallel"
-
-logger = get_logger(__file__)
-
-
-class Config(dict):
-    """This is a wrapper class for dict objects so that values of which can be
-    accessed as attributes.
-
-    Args:
-        config (dict): The dict object to be wrapped.
-    """
-
-    def __init__(self, config: dict = None):  # pylint: disable=W0231
-        if config is not None:
-            for k, v in config.items():
-                self._add_item(k, v)
-
-    def __missing__(self, key):
-        raise KeyError(key)
-
-    def __getattr__(self, key):
-        try:
-            value = super().__getitem__(key)
-            return value
-        except KeyError:
-            raise AttributeError(key)
-
-    def __setattr__(self, key, value):
-        super().__setitem__(key, value)
-
-    def _add_item(self, key, value):
-        if isinstance(value, dict):
-            self.__setattr__(key, Config(value))
-        else:
-            self.__setattr__(key, value)
-
-    def update(self, config):
-        assert isinstance(config, (Config, dict)), "can only update dictionary or Config objects."
-        for k, v in config.items():
-            self._add_item(k, v)
-        return self
-
-    @staticmethod
-    def from_file(filename: str):
-        """Reads a python file and constructs a corresponding :class:`Config` object.
-
-        Args:
-            filename (str): Name of the file to construct the return object.
-
-        Returns:
-            :class:`Config`: A :class:`Config` object constructed with information in the file.
-
-        Raises:
-            AssertionError: Raises an AssertionError if the file does not exist, or the file is not .py file
-        """
-
-        # check config path
-        if isinstance(filename, str):
-            filepath = Path(filename).absolute()
-        elif isinstance(filename, Path):
-            filepath = filename.absolute()
-
-        assert filepath.exists(), f"{filename} is not found, please check your configuration path"
-
-        # check extension
-        extension = filepath.suffix
-        assert extension == ".py", "only .py files are supported"
-
-        # import the config as module
-        remove_path = False
-        if filepath.parent not in sys.path:
-            sys.path.insert(0, (filepath))
-            remove_path = True
-
-        module_name = filepath.stem
-        source_file = SourceFileLoader(fullname=str(module_name), path=str(filepath))
-        module = source_file.load_module()  # pylint: disable=W4902,E1120,W1505
-
-        # load into config
-        config = Config()
-
-        for k, v in module.__dict__.items():
-            if k.startswith("__") or inspect.ismodule(v) or inspect.isclass(v):
-                continue
-            else:
-                config._add_item(k, v)
-
-        # remove module
-        del sys.modules[module_name]
-        if remove_path:
-            sys.path.pop(0)
-
-        return config
-
-
-class ParallelContext(metaclass=SingletonMeta):
-    """This class provides interface functions for users to get the parallel context,
-    such as the global rank, the local rank, the world size, etc. of each device.
-
-    """
-
-    def __init__(self):
-        # distributed settings
-        self._global_ranks = dict()
-        self._local_ranks = dict()
-        self._world_sizes = dict()
-        self._groups = dict()
-        self._cpu_groups = dict()
-        self._ranks_in_group = dict()
-
-        # load config from file
-        self._config = None
-
-        # default parallel args, will be overwritten during process group intialization
-        self.world_size = 1
-        self.data_parallel_size = 1
-        self.pipeline_parallel_size = 1
-        self.tensor_parallel_size = 1
-        self.zero1_parallel_size = -1
-        self.nettest_parallel_size = 1
-        self.num_processes_on_current_node = -1
-        self.virtual_pipeline_parallel_size = None
-        self.virtual_pipeline_parallel_rank = None
-
-    @property
-    def config(self):
-        return self._config
-
-    def load_config(self, config: Union[dict, str]):
-        """Loads the configuration from either a dict or a file.
-
-        Args:
-            config (dict or str): Either a dict containing the configuration information or the filename
-                of a file containing the configuration information.
-
-        Raises:
-            TypeError: Raises a TypeError if `config` is neither a dict nor a str.
-        """
-        if isinstance(config, str):
-            self._config = Config.from_file(config)
-        elif isinstance(config, dict):
-            self._config = Config(config)
-        else:
-            raise TypeError("Invalid type for config, only dictionary or string is supported")
-
-    def detect_num_processes_on_current_node(self):
-        hostname = socket.gethostname()
-        hostname_list = [None for _ in range(self.get_world_size(ParallelMode.GLOBAL))]
-        dist.all_gather_object(hostname_list, hostname, group=self.get_group(ParallelMode.GLOBAL))
-        counter = Counter(hostname_list)
-        self.num_processes_on_current_node = counter[hostname]
-
-    @staticmethod
-    def _check_parallel_mode(parallel_mode: ParallelMode):
-        assert isinstance(
-            parallel_mode, ParallelMode
-        ), f"expected the argument parallel_mode to be of enum ParallelMode, but got {type(parallel_mode)}"
-
-    def get_global_rank(self):
-        """Returns the global rank of the current device.
-
-        Returns:
-            int: The global rank of the current device
-        """
-        return self._global_ranks[ParallelMode.GLOBAL]
-
-    def get_local_rank(self, parallel_mode: ParallelMode):
-        """Returns the local rank of the current device.
-
-        Args:
-            parallel_mode: The parallel mode for the rank.
-
-        Returns:
-            int: The local rank of the current device for `parallel_mode`.
-        """
-        self._check_parallel_mode(parallel_mode)
-        return self._local_ranks.get(parallel_mode, 0)
-
-    def get_next_global_rank(self, parallel_mode: ParallelMode):
-        """Returns the global rank of the next device.
-
-        Args:
-            parallel_mode: The parallel mode for the rank.
-
-        Returns:
-            int: The global rank of the next device for `parallel_mode`.
-        """
-        self._check_parallel_mode(parallel_mode)
-
-        # get rank and world size
-        local_rank = self.get_local_rank(parallel_mode)
-        world_size = self.get_world_size(parallel_mode)
-        ranks_in_group = self.get_ranks_in_group(parallel_mode)
-
-        return ranks_in_group[(local_rank + 1) % world_size]
-
-    def get_prev_global_rank(self, parallel_mode: ParallelMode):
-        """Returns the global rank of the previous device.
-
-        Args:
-            parallel_mode: The chosen parallel mode.
-
-        Returns:
-            int: The global rank of the previous device for `parallel_mode`.
-        """
-        self._check_parallel_mode(parallel_mode)
-
-        # get rank and world size
-        local_rank = self.get_local_rank(parallel_mode)
-        world_size = self.get_world_size(parallel_mode)
-        ranks_in_group = self.get_ranks_in_group(parallel_mode)
-
-        return ranks_in_group[(local_rank - 1) % world_size]
-
-    def is_using_dp(self):
-        """Returns a boolean value indicating whether the current device is initilized with
-        ParallelMode.DATA and its world_size is greater than 1.
-        """
-        return self.is_initialized(ParallelMode.DATA) and self.get_world_size(ParallelMode.DATA) > 1
-
-    def is_using_tp(self):
-        """Returns a boolean value indicating whether the current device is initilized with
-        ParallelMode.TENSOR and its world_size is greater than 1.
-        """
-        return self.is_initialized(ParallelMode.TENSOR) and self.get_world_size(ParallelMode.TENSOR) > 1
-
-    def is_using_pp(self):
-        """Returns a boolean value indicating whether the current device is initilized with
-        ParallelMode.PIPELINE and its world_size is greater than 1.
-        """
-        return self.is_initialized(ParallelMode.PIPELINE) and self.get_world_size(ParallelMode.PIPELINE) > 1
-
-    def is_using_sequence(self):
-        """Returns a boolean value indicating whether the current device is initilized with
-        ParallelMode.SEQUENCE and its world_size is greater than 1.
-        """
-        return False
-        # return gpc.is_initialized(ParallelMode.SEQUENCE) and gpc.get_world_size(ParallelMode.SEQUENCE) > 1
-
-    def is_first_rank(self, parallel_mode: ParallelMode):
-        """Returns a boolean value indicating whether the current device is the first one
-        among its group for `parallel_mode`.
-
-        Args:
-            parallel_mode: The chosen parallel mode.
-
-        Returns:
-            bool: a boolean value indicating whether the current device is the first one
-            among its group for `parallel_mode`.
-        """
-        rank = 0
-        if self.is_initialized(parallel_mode):
-            rank = self.get_local_rank(parallel_mode)
-        return rank == 0
-
-    def is_rank_for_log(self):
-        """Returns a boolean value indicating whether the current device should print log."""
-        is_log_rank = (
-            self.is_first_rank(ParallelMode.DATA)
-            and self.is_first_rank(ParallelMode.TENSOR)
-            and self.is_last_rank(ParallelMode.PIPELINE)
-        )
-        return is_log_rank
-
-    def is_last_rank(self, parallel_mode: ParallelMode):
-        """Returns a boolean value indicating whether the current device is the last one
-        among its group for `parallel_mode`.
-
-        Args:
-            parallel_mode: The chosen parallel mode.
-
-        Returns:
-            bool: a boolean value indicating whether the current device is the first one
-            among its group for `parallel_mode`.
-        """
-        rank = 0
-        world_size = 1
-        if self.is_initialized(parallel_mode):
-            rank = self.get_local_rank(parallel_mode)
-            world_size = self.get_world_size(parallel_mode)
-        return rank == world_size - 1
-
-    def is_pipeline_first_stage(self, ignore_virtual=False):
-        if not ignore_virtual:
-            if self.virtual_pipeline_parallel_size is not None and self.virtual_pipeline_parallel_rank != 0:
-                return False
-        return self.is_first_rank(ParallelMode.PIPELINE)
-
-    def is_pipeline_last_stage(self, ignore_virtual=False):
-        if not ignore_virtual:
-            if (
-                self.virtual_pipeline_parallel_size is not None
-                and self.virtual_pipeline_parallel_rank != self.virtual_pipeline_parallel_size - 1
-            ):
-                return False
-        return self.is_last_rank(ParallelMode.PIPELINE)
-
-    def get_world_size(self, parallel_mode: ParallelMode):
-        """Returns the world size for `parallel_mode`.
-
-        Args:
-            parallel_mode: The chosen parallel mode.
-
-        Returns:
-            int: The world size for `parallel_mode`.
-        """
-        self._check_parallel_mode(parallel_mode)
-        return self._world_sizes.get(parallel_mode, 1)
-
-    def get_group(self, parallel_mode: ParallelMode):
-        """Returns the group of the current device for `parallel_mode`.
-
-        Args:
-            parallel_mode: The chosen parallel mode.
-
-        Returns:
-            torch.distributed.ProcessGroup: The group of the current device for `parallel_mode`.
-        """
-        self._check_parallel_mode(parallel_mode)
-        return self._groups[parallel_mode]
-
-    def get_ranks_in_group(self, parallel_mode: ParallelMode):
-        """Returns the rank of the current device for `parallel_mode` in the group.
-
-        Args:
-            parallel_mode: The chosen parallel mode.
-
-        Returns:
-            int: The rank of the current device for `parallel_mode` in the group.
-        """
-        self._check_parallel_mode(parallel_mode)
-        return self._ranks_in_group[parallel_mode]
-
-    def get_cpu_group(self, parallel_mode: ParallelMode):
-        self._check_parallel_mode(parallel_mode)
-        return self._cpu_groups[parallel_mode]
-
-    def init_global_dist(self, rank: int, world_size: int, backend: str, host: str, port: int, use_cpu: bool = False):
-        """Initializes the global distributed environment
-
-        Args:
-           rank (int): rank for the default process group.
-           world_size (int): world size of the default process group.
-           backend (str): backend for ``torch.distributed``
-           host (str): the master address for distributed training.
-           port (str): the master port for distributed training.
-           use_cpu (bool): whether to set up cpu process group.
-        """
-        # initialize the default process group
-        init_method = f"tcp://[{host}]:{port}"
-        dist.init_process_group(
-            rank=rank,
-            world_size=world_size,
-            backend=backend,
-            init_method=init_method,
-            timeout=LLM_NCCL_TIMEOUT,
-        )
-
-        # None will give the default global process group for pytorch dist operations
-        ranks = list(range(world_size))
-        if use_cpu:
-            cpu_group = (
-                dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
-                if dist.get_backend() != "gloo"
-                else None
-            )
-        else:
-            cpu_group = None
-        self._register_dist(rank, world_size, dist.GroupMember.WORLD, cpu_group, ranks, ParallelMode.GLOBAL)
-        self._global_ranks[ParallelMode.GLOBAL] = rank
-
-    def _register_dist(self, local_rank, world_size, process_group, cpu_group, ranks_in_group, mode):
-        self._check_parallel_mode(mode)
-        self._local_ranks[mode] = local_rank
-        self._world_sizes[mode] = world_size
-        self._groups[mode] = process_group
-        self._cpu_groups[mode] = cpu_group
-        self._ranks_in_group[mode] = ranks_in_group
-
-    def check_sanity(self):
-        """Checks sanity of the parallel context.
-
-        Raises:
-            AssertionError: Raises an AssertionError if the world size does not equal to the product
-                of data parallel size, pipeline parallel size and tensor parallel size.
-        """
-        dps = self.data_parallel_size
-        pps = self.pipeline_parallel_size
-        tps = self.tensor_parallel_size
-        ws = self.world_size
-        assert ws == dps * pps * tps, (
-            f"Expected the world size {ws} to be equal to data"
-            f" parallel size ({dps}) * pipeline parallel size "
-            f"({pps}) * tensor parallel size ({tps})"
-        )
-        assert self.zero1_parallel_size > 0
-        assert self.data_parallel_size % self.zero1_parallel_size == 0
-
-    def _set_parallel_size_from_config(self, config: dict, key: str, attr_name: str):
-        if key in config:
-            ele = config[key]
-            if isinstance(ele, int):
-                setattr(self, attr_name, ele)
-            elif isinstance(ele, dict):
-                setattr(self, attr_name, ele["size"])
-            else:
-                raise NotImplementedError(
-                    f'{"Parallel configuration does not support this kind of argument, please use int or dict"}'
-                )
-
-    def init_parallel_groups(self):
-        """Initializes the parallel groups."""
-
-        # get rank and world size
-        rank = self.get_global_rank()
-        world_size = self.get_world_size(ParallelMode.GLOBAL)
-        self.world_size = world_size
-
-        # set parallel size as attributes for global context
-        parallel_config = self.config.get("parallel", None)
-        if parallel_config is not None:
-            self._set_parallel_size_from_config(parallel_config, "pipeline", "pipeline_parallel_size")
-            self._set_parallel_size_from_config(parallel_config, "tensor", "tensor_parallel_size")
-            self._set_parallel_size_from_config(parallel_config, "zero1", "zero1_parallel_size")
-
-        # the user should not set the data parallel size manually
-        # instead, it should be calculated based on other parallel config
-        self.data_parallel_size = self.world_size // (self.pipeline_parallel_size * self.tensor_parallel_size)
-
-        # the recommended nettest_parallel_size is 32 GPUs
-        self.nettest_parallel_size = 32
-
-        if self.zero1_parallel_size <= 0:
-            self.zero1_parallel_size = self.data_parallel_size
-
-        self.check_sanity()
-
-        initializer_args = [
-            rank,
-            world_size,
-            self.data_parallel_size,
-            self.pipeline_parallel_size,
-            self.tensor_parallel_size,
-            self.zero1_parallel_size,
-            self.nettest_parallel_size,
-        ]
-
-        # run initialization of different process groups
-        initializers = []
-        initializers.append(pgroup_initializer.Initializer_Data(*initializer_args))
-        initializers.append(pgroup_initializer.Initializer_Model(*initializer_args))
-        initializers.append(pgroup_initializer.Initializer_Tensor(*initializer_args))
-        initializers.append(pgroup_initializer.Initializer_Zero1(*initializer_args))
-        initializers.append(pgroup_initializer.Initializer_Nettest(*initializer_args))
-        if self.pipeline_parallel_size > 1:
-            initializers.append(pgroup_initializer.Initializer_Pipeline(*initializer_args))
-        for initializer in initializers:
-            parallel_setting = initializer.init_dist_group()
-            if isinstance(parallel_setting, list):
-                for args in parallel_setting:
-                    self._register_dist(*args)
-            else:
-                self._register_dist(*parallel_setting)
-
-    def is_initialized(self, parallel_mode: ParallelMode):
-        """Returns a boolean value indicating whether `parallel_mode` is initialized
-        in the current system.
-        """
-        return parallel_mode in self._groups
-
-    def destroy(self):
-        """Destroys the current distributed parallel environment."""
-        for mode, group in self._groups.items():
-            if mode is not ParallelMode.GLOBAL:
-                dist.destroy_process_group(group)
-        # destroy global process group
-        dist.destroy_process_group()
-        self._groups.clear()
-
-    def set_device(self, device_ordinal: int = None):
-        """Sets distributed processes to be bound to devices.
-
-        Args:
-           device_ordinal (int, optional): the device id to be bound to
-        """
-        global_rank = self.get_global_rank()
-        if device_ordinal is None:
-            devices_per_node = torch.cuda.device_count()
-            device_ordinal = global_rank % devices_per_node
-
-        torch.cuda.set_device(device_ordinal)
-        logger.info(f"process rank {global_rank} is bound to host:{socket.gethostname()} device: {device_ordinal}")
-
-    def set_seed(self, seed: int, dpseed_with_tpoffset: bool = False):
-        """Sets seeds for all random libraries.
-
-        Args:
-            seed (int): seed for random states
-        """
-        pipeline_offset = self._local_ranks.get(ParallelMode.PIPELINE, 0)
-        global_rank = self.get_global_rank()
-
-        random.seed(seed)
-        np.random.seed(seed)
-        torch.manual_seed(seed)
-        assert torch.cuda.is_available()
-
-        # data parallel seed are kept the same in the same pipeline stage
-        dp_seed = seed
-        if dpseed_with_tpoffset:
-            dp_seed = seed + pipeline_offset * 1024
-        add_seed(ParallelMode.DATA, dp_seed)
-        add_seed(ParallelMode.DUMMY, dp_seed)
-
-        # model parallel seeds are different across ranks
-        if self.is_initialized(ParallelMode.TENSOR):
-            tp_rank = self.get_local_rank(ParallelMode.TENSOR)
-            tp_seed = seed + tp_rank + pipeline_offset * 1024
-            add_seed(ParallelMode.TENSOR, tp_seed)
-
-        # we do not set the random state mode to ParallelMode.DATA until model is built (instead, we use a dummy mode
-        # during model construction), this is because the random state will be different in different tensor parallel
-        # device of the same data parallel group. The underlying reason is that the device of tp_rank = 0 will perform
-        # additional random operations during the RowParallelLinear module building process.
-        set_mode(ParallelMode.DUMMY)
-
-        seeds = get_seeds()
-        seed_str = ", ".join([f"{k}: {v}" for k, v in seeds.items()])
-        logger.info(
-            f"initialized seed on rank {global_rank}, "
-            f"numpy: {seed}, python random: {seed}, {seed_str},"
-            f"the default parallel seed is {ParallelMode.DATA}."
-        )
-
-    def set_virtual_pipeline_parallel_size(self, size):
-        self.virtual_pipeline_parallel_size = size
-
-    def set_virtual_pipeline_parallel_rank(self, rank):
-        self.virtual_pipeline_parallel_rank = rank
-
-
-global_context = ParallelContext()
diff --git a/internlm/core/context/process_group_initializer.py b/internlm/core/context/process_group_initializer.py
deleted file mode 100644
index 97e9ef0..0000000
--- a/internlm/core/context/process_group_initializer.py
+++ /dev/null
@@ -1,418 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/context
-
-import math
-from abc import ABC, abstractmethod
-from enum import Enum
-
-import torch.distributed as dist
-
-from internlm.utils.timeout import LLM_NCCL_TIMEOUT
-
-
-# parallel modes
-class ParallelMode(Enum):
-    """This is an enumeration class containing all possible parallel modes."""
-
-    GLOBAL = "global"
-
-    # common parallel
-    DATA = "data"
-
-    # model parallel - containing tensor and pipeline parallel groups
-    # this is added to facilitate amp and grad clipping in hybrid parallel
-    MODEL = "model"
-
-    # pipeline parallel
-    PIPELINE = "pipe"
-
-    # containing all ranks in tensor parallel
-    TENSOR = "tensor"
-
-    # zero1 parallel
-    ZERO1 = "zero1"
-
-    # runntime network test
-    NETTEST = "nettest"
-
-    # dummy mode, only used during mode construction
-    DUMMY = "dummy"
-
-
-class ProcessGroupInitializer(ABC):
-    """An object, knowing the parallelism configuration, that initializes parallel groups.
-
-    Args:
-        rank (int): The rank of current process.
-        world_size (int): Size of whole communication world.
-        data_parallel_size (int): Size of data parallel.
-        pipeline_parallel_size (int): Size of pipeline parallel.
-        tensor_parallel_size (int): Size of tensor parallel.
-        zero1_parallel_size (int): Size of zero1 parallel.
-    """
-
-    def __init__(
-        self,
-        rank: int,
-        world_size: int,
-        data_parallel_size: int,
-        pipeline_parallel_size: int,
-        tensor_parallel_size: int,
-        zero1_parallel_size: int,
-        nettest_parallel_size: int,
-    ):
-        self.rank = rank
-        self.world_size = world_size
-        self.data_parallel_size = data_parallel_size
-        self.pipeline_parallel_size = pipeline_parallel_size
-        self.tensor_parallel_size = tensor_parallel_size
-        self.zero1_parallel_size = zero1_parallel_size
-        self.nettest_parallel_size = nettest_parallel_size
-        super().__init__()
-
-    @abstractmethod
-    def init_dist_group(self, use_cpu: bool = False):
-        pass
-
-
-class Initializer_Data(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for data parallelism.
-
-    Args:
-        rank (int): The rank of current process.
-        world_size (int): Size of whole communication world.
-        data_parallel_size (int): Size of data parallel.
-        pipeline_parallel_size (int): Size of pipeline parallel.
-        tensor_parallel_size (int): Size of tensor parallel.
-        zero1_parallel_size (int): Size of zero1 parallel.
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.rank_num_per_dp_group = self.world_size // self.data_parallel_size
-
-        assert self.world_size % self.data_parallel_size == 0
-
-    def init_dist_group(self, use_cpu: bool = False):
-        """Initialize data parallel groups, and assign local_ranks and groups to each gpu.
-
-        Returns:
-            Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode):
-                A Data parallelism's information tuple.
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        cpu_group = None
-        group_world_size = None
-        mode = ParallelMode.DATA
-
-        for i in range(self.rank_num_per_dp_group):
-            ranks = [i + j * self.rank_num_per_dp_group for j in range(self.data_parallel_size)]
-            group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
-            if use_cpu:
-                group_cpu = (
-                    dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
-                    if dist.get_backend() != "gloo"
-                    else group
-                )
-            else:
-                group_cpu = None
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                cpu_group = group_cpu
-                ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
-
-
-class Initializer_Model(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for model parallelism (model parallel group contains pipeline and tensor parallel
-    groups).
-
-    Args:
-        rank (int): The rank of current process.
-        world_size (int): Size of whole communication world.
-        data_parallel_size (int): Size of data parallel.
-        pipeline_parallel_size (int): Size of pipeline parallel.
-        tensor_parallel_size (int): Size of tensor parallel.
-        zero1_parallel_size (int): Size of zero1 parallel.
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.rank_num_per_group = self.tensor_parallel_size * self.pipeline_parallel_size
-        self.num_group = self.world_size // self.rank_num_per_group
-
-        assert self.world_size % self.rank_num_per_group == 0
-
-    def init_dist_group(self, use_cpu: bool = False):
-        """Initialize model parallel groups, and assign local_ranks and groups to each gpu.
-
-        Returns:
-            Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode):
-                A Model parallelism's information tuple.
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        cpu_group = None
-        group_world_size = None
-        mode = ParallelMode.MODEL
-
-        for i in range(self.num_group):
-            ranks = [i * self.rank_num_per_group + j for j in range(self.rank_num_per_group)]
-            group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
-            if use_cpu:
-                group_cpu = (
-                    dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
-                    if dist.get_backend() != "gloo"
-                    else group
-                )
-            else:
-                group_cpu = None
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                cpu_group = group_cpu
-                ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
-
-
-class Initializer_Pipeline(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for pipeline parallelism.
-
-    Args:
-        rank (int): The rank of current process
-        world_size (int): Size of whole communication world
-        data_parallel_size (int): Size of data parallel
-        pipeline_parallel_size (int): Size of pipeline parallel
-        tensor_parallel_size (int): Size of tensor parallel
-        zero1_parallel_size (int): Size of zero1 parallel.
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.rank_num_per_dp_group = self.world_size // self.data_parallel_size
-        self.pipeline_stage_size = self.rank_num_per_dp_group // self.pipeline_parallel_size
-
-        assert self.world_size % self.data_parallel_size == 0
-        assert self.rank_num_per_dp_group % self.pipeline_parallel_size == 0
-
-    def init_dist_group(self, use_cpu: bool = False):
-        """Initialize pipeline parallel groups, and assign local_ranks and groups to each gpu.
-
-        Returns:
-            List[Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode)]:
-                A Pipeline parallelism's information in list of tuples.
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        cpu_group = None
-        group_world_size = None
-        mode = ParallelMode.PIPELINE
-
-        for i in range(self.data_parallel_size):
-            for j in range(self.pipeline_stage_size):
-                ranks = list(
-                    range(
-                        i * self.rank_num_per_dp_group + j,
-                        (i + 1) * self.rank_num_per_dp_group,
-                        self.pipeline_stage_size,
-                    )
-                )
-                pipe_group_size = len(ranks)
-                pipe_group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
-                if use_cpu:
-                    group_cpu = (
-                        dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
-                        if dist.get_backend() != "gloo"
-                        else pipe_group
-                    )
-                else:
-                    group_cpu = None
-
-                if self.rank in ranks:
-                    local_rank = ranks.index(self.rank)
-                    group_world_size = pipe_group_size
-                    process_group = pipe_group
-                    cpu_group = group_cpu
-                    ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
-
-
-class Initializer_Tensor(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for tensor parallelism.
-
-    Args:
-        rank (int): The rank of current process.
-        world_size (int): Size of whole communication world.
-        data_parallel_size (int): Size of data parallel.
-        pipeline_parallel_size (int): Size of pipeline parallel.
-        tensor_parallel_size (int): Size of tensor parallel.
-        zero1_parallel_size (int): Size of zero1 parallel.
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.num_tensor_parallel_group = self.world_size // self.tensor_parallel_size
-
-        assert self.world_size % self.tensor_parallel_size == 0
-
-    def init_dist_group(self, use_cpu: bool = False):
-        """Initialize tensor parallel groups, and assign local_ranks and groups to each gpu.
-
-        Returns:
-            Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode):
-                A Tensor parallelism's information tuple.
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        cpu_group = None
-        group_world_size = None
-        mode = ParallelMode.TENSOR
-
-        for i in range(self.num_tensor_parallel_group):
-            ranks = [i * self.tensor_parallel_size + j for j in range(self.tensor_parallel_size)]
-            group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
-            if use_cpu:
-                group_cpu = (
-                    dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
-                    if dist.get_backend() != "gloo"
-                    else group
-                )
-            else:
-                group_cpu = None
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                cpu_group = group_cpu
-                ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
-
-
-class Initializer_Zero1(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for zero-1 parallelism.
-
-    Args:
-        rank (int): The rank of current process.
-        world_size (int): Size of whole communication world.
-        data_parallel_size (int): Size of data parallel.
-        pipeline_parallel_size (int): Size of pipeline parallel.
-        tensor_parallel_size (int): Size of tensor parallel.
-        zero1_parallel_size (int): Size of zero-1 parallel.
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.rank_num_per_dp_group = self.world_size // self.data_parallel_size
-        self.num_zero1_parallel_group = self.data_parallel_size // self.zero1_parallel_size
-
-        assert self.world_size % self.data_parallel_size == 0
-        assert self.world_size % self.zero1_parallel_size == 0
-
-    def init_dist_group(self, use_cpu: bool = False):
-        """Initialize zero1 parallel groups, and assign local_ranks and groups to each gpu.
-
-        Returns:
-            Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode):
-                A zero1 parallelism's information tuple.
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        cpu_group = None
-        group_world_size = None
-        mode = ParallelMode.ZERO1
-
-        for i in range(self.rank_num_per_dp_group):
-            for j in range(self.num_zero1_parallel_group):
-                ranks = [
-                    i + (j * self.zero1_parallel_size + k) * self.rank_num_per_dp_group
-                    for k in range(self.zero1_parallel_size)
-                ]
-                group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
-                if use_cpu:
-                    group_cpu = (
-                        dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
-                        if dist.get_backend() != "gloo"
-                        else group
-                    )
-                else:
-                    group_cpu = None
-
-                if self.rank in ranks:
-                    local_rank = ranks.index(self.rank)
-                    group_world_size = len(ranks)
-                    process_group = group
-                    cpu_group = group_cpu
-                    ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
-
-
-class Initializer_Nettest(ProcessGroupInitializer):
-    """A ProcessGroupInitializer for network test, especailly for NCCL.
-
-    Args:
-        rank (int): The rank of current process.
-        world_size (int): Size of whole communication world.
-        nettest_parallel_size (int): Size of a network test group.
-    """
-
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.num_nettest_group = math.ceil(self.world_size / self.nettest_parallel_size)
-
-    def init_dist_group(self, use_cpu: bool = False):
-        """Initialize tensor parallel groups, and assign local_ranks and groups to each gpu.
-
-        Returns:
-            Tuple (local_rank, group_world_size, process_group, ranks_in_group, mode):
-                A Tensor parallelism's information tuple.
-        """
-        local_rank = None
-        ranks_in_group = None
-        process_group = None
-        cpu_group = None
-        group_world_size = None
-        mode = ParallelMode.NETTEST
-
-        for i in range(self.num_nettest_group):
-            ranks = []
-            for j in range(self.nettest_parallel_size):
-                rank = i * self.nettest_parallel_size + j
-                if rank < self.world_size:
-                    ranks.append(rank)
-            group = dist.new_group(ranks, timeout=LLM_NCCL_TIMEOUT)
-            if use_cpu:
-                group_cpu = (
-                    dist.new_group(ranks, backend="gloo", timeout=LLM_NCCL_TIMEOUT)
-                    if dist.get_backend() != "gloo"
-                    else group
-                )
-            else:
-                group_cpu = None
-
-            if self.rank in ranks:
-                local_rank = ranks.index(self.rank)
-                group_world_size = len(ranks)
-                process_group = group
-                cpu_group = group_cpu
-                ranks_in_group = ranks
-
-        return local_rank, group_world_size, process_group, cpu_group, ranks_in_group, mode
diff --git a/internlm/core/context/random.py b/internlm/core/context/random.py
deleted file mode 100644
index b2c0a1d..0000000
--- a/internlm/core/context/random.py
+++ /dev/null
@@ -1,131 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/context
-
-from contextlib import contextmanager
-
-import torch
-import torch.cuda
-from torch import Tensor
-
-from .process_group_initializer import ParallelMode
-
-
-class SeedManager:
-    """This class is a manager of all random seeds involved in the system."""
-
-    def __init__(self):
-        self._current_mode = None
-        self._seeds = {}
-        self._seed_states = {}
-
-    @property
-    def current_mode(self):
-        return self._current_mode
-
-    @property
-    def seeds(self):
-        return self._seeds
-
-    @property
-    def seed_states(self):
-        return self._seed_states
-
-    def set_state(self, parallel_mode: ParallelMode, state: Tensor):
-        """Sets the state of the seed manager for `parallel_mode`."""
-        assert parallel_mode in self._seed_states, f"{parallel_mode} not found in seed manager"
-        self._seed_states[parallel_mode] = state
-
-    def set_mode(self, parallel_mode: ParallelMode):
-        """Sets the current mode of the seed manager."""
-        if self.current_mode:
-            # save state for current mode
-            self._seed_states[self._current_mode] = torch.cuda.get_rng_state()
-
-        # set new state for new mode
-        self._current_mode = parallel_mode
-        torch.cuda.set_rng_state(self._seed_states[parallel_mode])
-
-    def add_seed(self, parallel_mode: ParallelMode, seed: int, overwrite: bool = False):
-        """Adds a seed to the seed manager for `parallel_mode`."""
-        assert isinstance(parallel_mode, ParallelMode), "Invalid ParallelMode"
-        if not overwrite:
-            assert parallel_mode not in self._seed_states, f"Seed for {parallel_mode} exists"
-        elif parallel_mode in self._seed_states:
-            print(f"Warning: {parallel_mode} seed overwritten.", flush=True)
-
-        current_state = torch.cuda.get_rng_state()
-        torch.cuda.manual_seed(seed)
-        self._seed_states[parallel_mode] = torch.cuda.get_rng_state()
-        self._seeds[parallel_mode] = seed
-        torch.cuda.set_rng_state(current_state)
-
-    def reset(self):
-        self._current_mode = None
-        self._seeds = {}
-        self._seed_states = {}
-
-
-_SEED_MANAGER = SeedManager()
-
-
-def get_seeds():
-    """Returns the seeds of the seed manager.
-    Returns:
-        dict: The seeds of the seed manager.
-    """
-    return _SEED_MANAGER.seeds
-
-
-def get_states(copy=False):
-    """Returns the seed states of the seed manager.
-    Returns:
-        dict: The seed states of the seed manager.
-    """
-    states = _SEED_MANAGER.seed_states
-    if copy:
-        new_states = dict()
-        for parallel_mode, state in states.items():
-            new_states[parallel_mode] = state.clone()
-        return new_states
-    else:
-        return _SEED_MANAGER.seed_states
-
-
-def get_current_mode():
-    """Returns the current mode of the seed manager.
-    Returns:
-        :class:`torch.ByteTensor`: The current mode of the seed manager.
-    """
-    return _SEED_MANAGER.current_mode
-
-
-def add_seed(parallel_mode: ParallelMode, seed: int, overwrite: bool = False):
-    """Adds a seed to the seed manager for `parallel_mode`."""
-    _SEED_MANAGER.add_seed(parallel_mode, seed, overwrite)
-
-
-def set_mode(parallel_mode: ParallelMode):
-    """Sets the current mode of the seed manager."""
-    _SEED_MANAGER.set_mode(parallel_mode)
-
-
-def set_seed_states(parallel_mode: ParallelMode, state: Tensor):
-    """Sets the state of the seed manager for `parallel_mode`."""
-    _SEED_MANAGER.set_state(parallel_mode, state)
-
-
-def sync_states():
-    current_mode = get_current_mode()
-    current_states = torch.cuda.get_rng_state()
-    set_seed_states(current_mode, current_states)
-
-
-@contextmanager
-def seed(parallel_mode: ParallelMode):
-    """A context for seed switch"""
-    current_mode = _SEED_MANAGER.current_mode
-    try:
-        yield _SEED_MANAGER.set_mode(parallel_mode)
-    finally:
-        _SEED_MANAGER.set_mode(current_mode)
diff --git a/internlm/core/engine.py b/internlm/core/engine.py
deleted file mode 100644
index a372b9e..0000000
--- a/internlm/core/engine.py
+++ /dev/null
@@ -1,190 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/engine
-
-from typing import List, Optional
-
-import torch
-from torch.nn import Module
-from torch.nn.modules.loss import _Loss
-from torch.optim.lr_scheduler import _LRScheduler
-
-from internlm.core.gradient_handler import BaseGradientHandler
-from internlm.solver.beta2_scheduler import Beta2Scheduler
-from internlm.solver.optimizer.hybrid_zero_optim import BaseOptimizer
-from internlm.utils.common import get_batch_size, move_to_device
-
-
-class Engine:
-    """
-    The Engine class is responsible for managing the training and evaluation process of a neural network model.
-    It handles the forward and backward passes, parameter updates, gradient handling, and mode switching between
-    training and evaluation.
-
-    Args:
-        model (torch.nn.Module): The neural network model to be trained or evaluated.
-        optimizer (BaseOptimizer): The optimizer used for updating the parameters of the model.
-        lr_scheduler (torch.optim.lr_scheduler._LRScheduler, optional): The learning rate scheduler for the optimizer.
-                                                                        Default is None.
-        beta2_scheduler (internlm.solver.beta2_scheduler.Beta2Scheduler, optional): The beta2 scheduler for the
-                                                                                    optimizer. Default is None.
-        criterion (torch.nn.modules.loss._Loss, optional): The loss function used for calculating the loss during
-                                                           training. Default is None.
-        gradient_handlers (List[BaseGradientHandler], optional): A list of gradient handlers used in the backward pass.
-                                                                 Default is None.
-        clip_grad_norm (float, optional): The norm value for gradient clipping. Default is 0.0.
-
-    Examples:
-        >>> # define model, criterion, optimizer, lr_scheduler, train_dataloader for your training
-        >>> model = ...
-        >>> criterion = ...
-        >>> optimizer = ...
-        >>> train_dataloader = ...
-        >>> engine, _, _, _ = internlm.initialize_engine(model, optimizer, criterion)
-        >>> engine.train()
-        >>> for inputs, labels in train_dataloader
-        >>>     # set gradients to zero
-        >>>     engine.zero_grad()
-        >>>     # run forward pass
-        >>>     outputs = engine(inputs)
-        >>>     # compute loss value and run backward pass
-        >>>     loss = engine.criterion(outputs, labels)
-        >>>     engine.backward(loss)
-        >>>     # update parameters
-        >>>     engine.step()
-    """
-
-    def __init__(
-        self,
-        model: Module,
-        optimizer: BaseOptimizer,
-        lr_scheduler: Optional[_LRScheduler] = None,
-        beta2_scheduler: Optional[Beta2Scheduler] = None,
-        criterion: Optional[_Loss] = None,
-        gradient_handlers: Optional[List[BaseGradientHandler]] = None,
-        clip_grad_norm: float = 0.0,
-    ):
-        self._model = model
-        self._optimizer = optimizer
-        self._lr_scheduler = lr_scheduler
-        self._beta2_scheduler = beta2_scheduler
-        self._criterion = criterion
-        self._clip_grad_norm = clip_grad_norm
-
-        # state
-        self.training = True  # default
-
-        # build gradient handler
-        self._gradient_handlers = gradient_handlers if gradient_handlers else []
-
-    @property
-    def model(self):
-        """Returns the model attached to the engine."""
-        return self._model
-
-    @property
-    def optimizer(self):
-        """Returns the optimizer attached to the engine."""
-        return self._optimizer
-
-    @property
-    def criterion(self):
-        """Returns the criterion (loss function) attached to the engine."""
-        return self._criterion
-
-    def _all_reduce_gradients(self):
-        """Handles all-reduce operations of gradients across different parallel groups."""
-        for handler in self._gradient_handlers:
-            handler.handle_gradient()
-
-    def zero_grad(self):
-        """Sets the gradient of all parameters in the model to zero."""
-        self.optimizer.zero_grad()
-
-    def step(self):
-        """
-        Executes the parameter update step. This includes all-reduce operations of gradients, gradient clipping,
-        and parameter update. If successful, it also steps the learning rate scheduler and beta2 scheduler
-        if they exist.
-
-        Returns:
-            success (bool): Whether the parameter update was successful.
-            grad_norm (float): The norm of the gradient after clipping.
-        """
-        self._all_reduce_gradients()
-        self.optimizer.clip_grad_norm(self.model, self._clip_grad_norm)
-
-        success, grad_norm = self.optimizer.step()
-
-        if success and self._lr_scheduler is not None:
-            self._lr_scheduler.step()
-
-        if success and self._beta2_scheduler is not None:
-            self._beta2_scheduler.step()
-
-        return success, grad_norm
-
-    def train(self):
-        """Sets the model to training mode."""
-        self.training = True
-        self._model.train()
-
-    def eval(self):
-        """Sets the model to evaluation mode."""
-        self.training = False
-        self._model.eval()
-
-    def backward(self, loss: torch.Tensor):
-        """
-        Starts the backward propagation given the loss value computed by a loss function.
-
-        Args:
-            loss (torch.Tensor): The loss value computed by a loss function.
-        """
-        return self.optimizer.backward(loss)
-
-    def backward_by_grad(self, tensor, grad):
-        """
-        Starts the backward propagation given the gradient of the output tensor.
-
-        Args:
-            tensor (torch.Tensor): The output tensor.
-            grad (torch.Tensor): The gradient passed back to the output tensor.
-        """
-        return self.optimizer.backward_by_grad(tensor, grad)
-
-    def __call__(self, *args, **kwargs):
-        """
-        Runs the forward step for the model.
-
-        Returns:
-            torch.Tensor: The output of the model.
-        """
-        return self.model(*args, **kwargs)
-
-    def load_batch(self, data_iter, to_gpu=True):
-        """
-        Loads a batch from the data iterator. It returns the data and labels which are
-        already in the same GPU as where the model is.
-
-        Args:
-            data_iter (Iterable): The data iterator from which to get a batch of data, obtained by calling
-                                  iter(dataloader).
-            to_gpu (bool, optional): Whether the data should be moved to the GPU. Default is True.
-
-        Returns:
-            Tuple (torch.Tensor, torch.Tensor): A tuple of (data, label).
-        """
-        if data_iter is None:
-            raise RuntimeError("Dataloader is not defined.")
-        try:
-            batch_data = next(data_iter)
-        except TypeError:
-            batch_data = data_iter
-
-        if to_gpu:
-            batch_data = move_to_device(batch_data)
-        batch_size = get_batch_size(batch_data)
-
-        return batch_data, batch_size
diff --git a/internlm/core/gradient_handler.py b/internlm/core/gradient_handler.py
deleted file mode 100644
index f2aaa1d..0000000
--- a/internlm/core/gradient_handler.py
+++ /dev/null
@@ -1,76 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from abc import ABC, abstractmethod
-from collections import defaultdict
-
-import torch
-import torch.distributed as dist
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-
-from internlm.core.context import global_context as gpc
-
-
-class BaseGradientHandler(ABC):
-    """A basic helper class to handle all-reduce operations of gradients across different parallel groups
-    before optimization.
-
-    Args:
-        model (Module): Model where the gradients accumulate.
-        optimizer (Optimizer): Optimizer for updating the parameters.
-    """
-
-    def __init__(self, model, optimizer):
-        self._model = model
-        self._optimizer = optimizer
-
-    @abstractmethod
-    def handle_gradient(self):
-        """A method to accumulate gradients across different parallel groups. Users should
-        write their own functions or just use the functions in pre-defined subclasses.
-        """
-        pass
-
-
-class PipelineSharedModuleGradientHandler(BaseGradientHandler):
-    """A helper class to handle all-reduce operations in sub parallel groups.
-    A all-reduce collective communication will be operated in
-    :func:`handle_gradient` among all sub pipeline parallel groups.
-    For better performance, it bucketizes the gradients of all parameters that are
-    the same type to improve the efficiency of communication.
-
-    Args:
-        model (Module): Model where the gradients accumulate.
-        optimizer (Optimizer): Optimizer for updating the parameters.
-    """
-
-    def handle_gradient(self):
-        """A method running a all-reduce operation in sub pipeline parallel groups."""
-        if gpc.pipeline_parallel_size > 1:
-            # bucketize and all-reduce
-            buckets = defaultdict(lambda: defaultdict(list))
-            # Pack the buckets.
-            for param in self._model.parameters():
-                group = getattr(param, "pipeline_shared_module_pg", None)
-                if (
-                    param.requires_grad
-                    and group is not None
-                    and (
-                        (hasattr(param, "colo_attr") and not param.colo_attr.saved_grad.is_null())
-                        or param.grad is not None
-                    )
-                ):
-                    tp = param.data.type()
-                    buckets[group][tp].append(param)
-
-            # For each bucket, all-reduce and copy all-reduced grads.
-            for group, group_buckets in buckets.items():
-                for tp, bucket in group_buckets.items():
-                    grads = [
-                        param.colo_attr.grad_payload if hasattr(param, "colo_attr") else param.grad.data
-                        for param in bucket
-                    ]
-                    coalesced = _flatten_dense_tensors(grads).to(torch.cuda.current_device())
-                    dist.all_reduce(coalesced, op=dist.ReduceOp.SUM, group=group)
-                    for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
-                        buf.copy_(synced)
diff --git a/internlm/core/naive_amp.py b/internlm/core/naive_amp.py
deleted file mode 100644
index 7470659..0000000
--- a/internlm/core/naive_amp.py
+++ /dev/null
@@ -1,136 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-# adopted from https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/amp
-
-from typing import Any
-
-import torch
-import torch.distributed as dist
-from torch import Tensor, nn
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-from torch.distributed import ReduceOp
-
-from internlm.core.context import ParallelMode
-from internlm.core.context.parallel_context import global_context as gpc
-
-
-class NaiveAMPModel(nn.Module):
-    """
-    This is a wrapper class for a model that automatically casts the model, its inputs, and outputs into fp16.
-    It also provides options to cast the output back to fp32 and to synchronize buffers.
-
-    Args:
-        model (torch.nn.Module): The model to be wrapped and cast into fp16.
-        output_to_fp32 (bool, optional): If True, the output of this module is cast into fp32. Defaults to True.
-        parallel_mode (:class:`internlm.core.context.ParallelMode`): The parallel group mode used in this module.
-                                                                Defaults to ``ParallelMode.DATA``.
-        sync_buffer (bool, optional): If True, the buffers are synchronized. Defaults to True.
-    """
-
-    def __init__(
-        self,
-        model: nn.Module,
-        output_to_fp32: bool = True,
-        parallel_mode: ParallelMode = ParallelMode.DATA,
-        sync_buffer: bool = True,
-        dtype=torch.float16,
-    ):
-        super().__init__()
-        self.model = model.to(dtype)
-        self._output_to_fp32 = output_to_fp32
-        self._sync_buf = sync_buffer
-        self.dtype = dtype
-
-        if gpc.is_initialized(parallel_mode) and gpc.get_world_size(parallel_mode) > 1:
-            self._process_group = gpc.get_group(parallel_mode)
-            self._world_size = gpc.get_world_size(parallel_mode)
-        else:
-            self._process_group = None
-            self._world_size = 1
-            self._sync_buf = False
-        self._first_eval_run = False
-
-    @property
-    def sync_buffer(self):
-        """Returns the current state of the buffer synchronization."""
-        return self._sync_buf
-
-    @sync_buffer.setter
-    def sync_buffer(self, state: bool):
-        """Sets the state of the buffer synchronization."""
-        self._sync_buf = state
-
-    def _convert_to_fp16(self, input_: Any):
-        """Converts the input to fp16 if it is a Tensor of dtype float32."""
-        if isinstance(input_, Tensor) and input_.dtype == torch.float32:
-            input_ = input_.to(self.dtype)
-        return input_
-
-    def _convert_to_fp32(self, input_: Any):
-        """Converts the input to fp32 if it is a Tensor of dtype float16."""
-        if isinstance(input_, Tensor) and input_.dtype == torch.float16:
-            input_ = input_.float()
-        return input_
-
-    def convert_to_fp32(self, out):
-        """Converts the output to fp32"""
-        if isinstance(out, Tensor):
-            out = self._convert_to_fp32(out)
-        elif isinstance(out, (tuple, list)):
-            out = [self._convert_to_fp32(val) for val in out]
-        elif isinstance(out, dict):
-            out = {key: self._convert_to_fp32(val) for key, val in out.items()}
-
-        return out
-
-    def _reduce_module_buffer(self):
-        """
-        All-reduces the buffers (e.g., running stats of batch normalization) across
-        data parallel ranks so that all the ranks will produce consistent results
-        when given the same input.
-        """
-        buf_list = []
-
-        # find valid buffers
-        for buf in self.model.buffers():
-            if buf is not None:
-                buf_list.append(buf)
-
-        # reduce buffers across data parallel ranks
-        if buf_list:
-            coalesced_buf = _flatten_dense_tensors(buf_list)
-            coalesced_buf.div_(self._world_size)
-            dist.all_reduce(coalesced_buf, op=ReduceOp.SUM, group=self._process_group)
-            unflattened_buf_list = _unflatten_dense_tensors(coalesced_buf, buf_list)
-            for old, new in zip(buf_list, unflattened_buf_list):
-                old.copy_(new)
-
-    def eval(self):
-        """Sets the model to evaluation mode. Buffers are only synchronized in the first eval iteration."""
-        self.model.eval()
-        self._first_eval_run = True
-
-    def forward(self, *args, **kwargs):
-        """
-        Performs a forward pass on the model. Buffers are synchronized before the forward pass.
-        The inputs are converted to fp16 and the outputs are optionally converted back to fp32.
-        """
-        if (self.training or self._first_eval_run) and self._sync_buf:
-            with torch.no_grad():
-                self._reduce_module_buffer()
-
-            if self._first_eval_run:
-                self._first_eval_run = False
-
-        if args:
-            args = [self._convert_to_fp16(arg) for arg in args]
-        if kwargs:
-            for k, v in kwargs.items():
-                kwargs[k] = self._convert_to_fp16(v)
-
-        out = self.model(*args, **kwargs)
-
-        if self._output_to_fp32:
-            out = self.convert_to_fp32(out)
-        return out
diff --git a/internlm/core/scheduler/__init__.py b/internlm/core/scheduler/__init__.py
deleted file mode 100644
index a9bf013..0000000
--- a/internlm/core/scheduler/__init__.py
+++ /dev/null
@@ -1,12 +0,0 @@
-from .base_scheduler import BaseScheduler, SchedulerHook, SchedulerMetricHook
-from .no_pipeline_scheduler import NonPipelineScheduler
-from .pipeline_scheduler import InterleavedPipelineScheduler, PipelineScheduler
-
-__all__ = [
-    "BaseScheduler",
-    "NonPipelineScheduler",
-    "InterleavedPipelineScheduler",
-    "PipelineScheduler",
-    "SchedulerHook",
-    "SchedulerMetricHook",
-]
diff --git a/internlm/core/scheduler/base_scheduler.py b/internlm/core/scheduler/base_scheduler.py
deleted file mode 100644
index 20b4460..0000000
--- a/internlm/core/scheduler/base_scheduler.py
+++ /dev/null
@@ -1,187 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/engine
-
-from abc import ABC, abstractmethod
-from typing import Any, Callable, Iterable, Optional
-
-import torch
-
-from internlm.core.engine import Engine
-from internlm.utils.megatron_timers import megatron_timer as timer
-
-
-class BaseScheduler(ABC):
-    """A basic helper class to control the process of training or evaluation.
-    It mainly composes of forward_backward_step for gradient backward and
-    optimizer_step for parameters update.
-    For the convenience to enable FP16, we aggregate all codes that contain the
-    control of FP16 in class schedule.
-
-    Args:
-        data_process_func (Callable, optional): The preprocessing function which receives a batch of data and arranges
-            them into data and label.
-    """
-
-    def __init__(self, data_process_func: Callable = None):
-        self.data_process_func = data_process_func
-
-    @abstractmethod
-    def pre_processing(self, engine: Engine):
-        """To perform actions before running the schedule.
-
-        Args:
-           engine (internlm.core.Engine): InternLM engine for training and inference.
-        """
-        pass
-
-    def _load_micro_batch(self, data, label, offset, micro_bsz):
-        assert isinstance(data, dict) and isinstance(label, torch.Tensor)
-        micro_batch_data = {k: v[offset : offset + micro_bsz] for k, v in data.items()}
-        micro_batch_label = label[offset : offset + micro_bsz]
-
-        return micro_batch_data, micro_batch_label
-
-    @abstractmethod
-    def forward_backward_step(
-        self,
-        engine: Engine,
-        data_iter: Iterable,
-        forward_only: bool,
-        return_loss: bool = True,
-        return_output_label: bool = True,
-    ):
-        """The process function over a batch of dataset for training or evaluation.
-
-        Args:
-            engine (internlm.core.Engine): InternLM engine for training and inference.
-            data_iter (Iterable): Data iterator from which get a batch of data, obtained by calling iter(dataloader).
-            forward_only (bool): If True, the process won't include backward.
-            return_loss (bool, optional): If False, the loss won't be returned.
-            return_output_label (bool, optional): If False, the output and label won't be returned.
-        """
-        pass
-
-    @staticmethod
-    def _call_engine(engine: Engine, inputs: Any):
-        """Calls the engine with the given inputs.
-
-        Args:
-            engine (internlm.core.Engine): InternLM engine for training and inference.
-            inputs (Any): The inputs to the engine, can be of type torch.Tensor, list, tuple, or dict.
-        """
-        if isinstance(inputs, torch.Tensor):
-            return engine(inputs)
-        elif isinstance(inputs, (list, tuple)):
-            return engine(*inputs)
-        elif isinstance(inputs, dict):
-            return engine(**inputs)
-        else:
-            raise TypeError(
-                f"Expected engine inputs to be of type torch.Tensor, list, tuple, or dict, but got {type(inputs)}"
-            )
-
-    @staticmethod
-    def _call_engine_criterion(engine: Engine, outputs: Any, labels: Any):
-        """Calls the engine's criterion with the given outputs and labels.
-
-        Args:
-            engine (internlm.core.Engine): InternLM engine for training and inference.
-            outputs (Any): The outputs from the model, can be of type torch.Tensor, list, tuple, or dict.
-            labels (Any): The labels for the outputs, can be of type torch.Tensor, list, tuple, or dict.
-        """
-        assert isinstance(
-            outputs, (torch.Tensor, list, tuple, dict)
-        ), f"Expect output of model is (torch.Tensor, list, tuple), got {type(outputs)}"
-        if isinstance(outputs, torch.Tensor):
-            outputs = (outputs,)
-        if isinstance(labels, torch.Tensor):
-            labels = (labels,)
-
-        if isinstance(outputs, (tuple, list)) and isinstance(labels, (tuple, list)):
-            return engine.criterion(*outputs, *labels)
-        elif isinstance(outputs, (tuple, list)) and isinstance(labels, dict):
-            return engine.criterion(*outputs, **labels)
-        elif isinstance(outputs, dict) and isinstance(labels, dict):
-            return engine.criterion(**outputs, **labels)
-        elif isinstance(outputs, dict) and isinstance(labels, (list, tuple)):
-            raise ValueError(f"Expected labels to be a dict when the model outputs are dict, but got {type(labels)}")
-        else:
-            raise TypeError(
-                f"Expected model outputs and labels to be of type torch.Tensor ' \
-                '(which is auto-converted to tuple), list, tuple, or dict, ' \
-                'but got {type(outputs)} (model outputs) and {type(labels)} (labels)"
-            )
-
-
-class SchedulerHook(ABC):
-    """
-    Scheduler Hook.
-    """
-
-    @abstractmethod
-    def before_forward(self, scheduler, inputs) -> None:
-        """Actions before forward"""
-
-    @abstractmethod
-    def after_forward(self, scheduler, outputs) -> None:
-        """Actions after forward"""
-
-    @abstractmethod
-    def before_criterion(self, scheduler, outputs, label) -> None:
-        """Actions before criterion"""
-
-    @abstractmethod
-    def after_criterion(self, scheduler, loss) -> None:
-        """Actions after criterion"""
-
-    @abstractmethod
-    def before_backward(self, scheduler, outputs, outputs_grad) -> None:
-        """Actions before backward"""
-
-    @abstractmethod
-    def after_backward(self, scheduler, inputs_grad) -> None:
-        """Actions after backward"""
-
-    @abstractmethod
-    def post_helper_func(self, scheduler, outputs, label) -> None:
-        """A post helper function"""
-
-
-class SchedulerMetricHook(SchedulerHook):
-    """
-    Scheduler Metric Hook.
-    """
-
-    def __init__(self, metric: Optional[Callable] = None, skip: bool = False) -> None:
-        self._post_func = metric
-        self._skip = skip
-
-    def before_forward(self, scheduler, inputs) -> None:
-        if not self._skip:
-            timer("fwd").start()
-
-    def after_forward(self, scheduler, outputs) -> None:
-        if not self._skip:
-            timer("fwd").stop()
-
-    def before_criterion(self, scheduler, outputs, label) -> None:
-        if not self._skip:
-            timer("cal_loss").start()
-
-    def after_criterion(self, scheduler, loss) -> None:
-        if not self._skip:
-            timer("cal_loss").stop()
-
-    def before_backward(self, scheduler, outputs, outputs_grad) -> None:
-        if not self._skip:
-            timer("bwd").start()
-
-    def after_backward(self, scheduler, inputs_grad) -> None:
-        if not self._skip:
-            timer("bwd").stop()
-
-    def post_helper_func(self, scheduler, outputs, label) -> None:
-        if self._post_func is not None:
-            self._post_func(outputs, label)
diff --git a/internlm/core/scheduler/no_pipeline_scheduler.py b/internlm/core/scheduler/no_pipeline_scheduler.py
deleted file mode 100644
index 1d8b61e..0000000
--- a/internlm/core/scheduler/no_pipeline_scheduler.py
+++ /dev/null
@@ -1,194 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/engine
-
-from typing import Any, Callable, Iterable, List, Optional
-
-import torch
-
-from internlm.core.engine import Engine
-from internlm.utils.common import conditional_context
-from internlm.utils.timeout import llm_timeout
-
-from .base_scheduler import BaseScheduler, SchedulerHook
-
-
-class NonPipelineScheduler(BaseScheduler):
-    """A helper schedule class for no pipeline parallelism running environment.
-    During one process, it loads a batch of dataset and feeds it to the model.
-    After getting the output and calculating the loss, it will use :meth:`step`
-    to update the parameters if it is in training mode.
-
-    Args:
-        data_process_func (Callable, optional): The preprocessing function which receives a batch of data
-            and returns a tuple in the form of (data, label), and it will be executed in load_batch.
-        gradient_accumulation_steps(int, optional): the steps of gradient accumulation, 1 for disable
-            gradient accumulation.
-
-    Examples:
-        >>> # this shows an example of customized data_process_func
-        >>> def data_process_func(dataloader_output):
-        >>>     item1, item2, item3 = dataloader_output
-        >>>     data = (item1, item2)
-        >>>     label = item3
-        >>>     return data, label
-    """
-
-    def __init__(
-        self,
-        data_process_func: Callable = None,
-        gradient_accumulation_size: int = 1,
-        scheduler_hooks: Optional[List[SchedulerHook]] = None,
-    ):
-        self._grad_accum_size = gradient_accumulation_size
-        self._grad_accum_offset = 0
-
-        self._hooks = scheduler_hooks
-
-        super().__init__(data_process_func)
-
-    def pre_processing(self, engine: Engine):
-        """Performs actions before running the schedule.
-
-        Args:
-           engine (internlm.core.Engine): InternLM engine for training and inference.
-        """
-        pass
-
-    def _call_hooks(self, func_name: str, *args, **kwargs) -> None:
-        for hook in self._hooks:
-            getattr(hook, func_name)(self, *args, **kwargs)
-
-    def _load_accum_batch(self, data: Any, label: Any):
-        """Loads a batch of data and label for gradient accumulation.
-
-        Args:
-            data (Any): The data to be loaded.
-            label (Any): The label to be loaded.
-        """
-
-        _data, _label = self._load_micro_batch(
-            data=data, label=label, offset=self._grad_accum_offset, micro_bsz=self._grad_accum_batch_size
-        )
-        self._grad_accum_offset += self._grad_accum_batch_size
-
-        if self.data_process_func:
-            _data["input_ids"] = self.data_process_func(_data["input_ids"], _data["cu_seqlens"])
-            _label = self.data_process_func(_label, _data["cu_seqlens"])
-            _data.pop("cu_seqlens")
-            _data.pop("indexes")
-
-        return _data, _label
-
-    def _train_one_batch(
-        self,
-        data: Any,
-        label: Any,
-        engine: Engine,
-        forward_only: bool = False,
-        return_loss: bool = True,
-        scale_loss: int = 1,
-    ):
-        """Trains one batch of data.
-
-        Args:
-            data (Any): The data to be trained.
-            label (Any): The label for the data.
-            engine (internlm.core.Engine): InternLM engine for training and inference.
-            forward_only (bool, optional): If True, the model is run for the forward pass, else back propagation will
-                be executed.
-            return_loss (bool, optional): Loss will be returned if True.
-            scale_loss (int, optional): The scale factor for the loss.
-        """
-
-        # forward
-        with conditional_context(torch.no_grad(), enable=forward_only):
-            self._call_hooks("before_forward", data)
-            output = self._call_engine(engine, data)
-            self._call_hooks("after_forward", output)
-
-            self._call_hooks("post_helper_func", output, label)
-
-            if return_loss:
-                self._call_hooks("before_criterion", output, label)
-                loss = self._call_engine_criterion(engine, output, label)
-                self._call_hooks("after_criterion", loss)
-                loss /= scale_loss
-
-        # backward
-        if not forward_only:
-            self._call_hooks("before_backward", None, None)
-            engine.backward(loss)
-            self._call_hooks("after_backward", None)
-
-        if not return_loss:
-            loss = None
-
-        return output, loss
-
-    @llm_timeout(func_name="nopp_forward_backward_step")
-    def forward_backward_step(
-        self,
-        engine: Engine,
-        data_iter: Iterable,
-        forward_only: bool = False,
-        return_loss: bool = True,
-        return_output_label: bool = True,
-    ):
-        """The process function that loads a batch of dataset and feeds it to the model.
-        The returned labels and loss will None if :attr:`return_loss` is False.
-
-        Args:
-            engine (internlm.core.Engine): InternLM engine for training and inference.
-            data_iter (Iterable): Dataloader as the form of an iterator, obtained by calling iter(dataloader).
-            forward_only (bool, optional):
-                If True, the model is run for the forward pass, else back propagation will be executed.
-            return_loss (bool, optional): Loss will be returned if True.
-            return_output_label (bool, optional): Output and label will be returned if True.
-
-        Returns:
-            Tuple[:class:`torch.Tensor`]: A tuple of (output, label, loss), loss and label could be None.
-        """
-        assert (
-            forward_only or return_loss
-        ), "The argument 'return_loss' has to be True when 'forward_only' is False, but got False."
-
-        batch_data, batch_size = engine.load_batch(data_iter)
-
-        assert (
-            batch_size % self._grad_accum_size == 0
-        ), f"batch_size:{batch_size} must be an integer multiple of gradient accumulation steps:{self._grad_accum_size}"
-        self._grad_accum_batch_size = batch_size // self._grad_accum_size
-
-        data, label = batch_data
-
-        loss = 0 if return_loss else None
-        outputs = []
-        labels = []
-
-        # reset accumulation microbatch offset
-        self._grad_accum_offset = 0
-
-        for _current_accum_step in range(self._grad_accum_size):
-            if _current_accum_step == self._grad_accum_size - 1:
-                engine.optimizer.skip_grad_reduce = False
-            else:
-                engine.optimizer.skip_grad_reduce = True
-
-            _data, _label = self._load_accum_batch(data, label)
-
-            _output, _loss = self._train_one_batch(
-                _data, _label, engine, forward_only, return_loss, self._grad_accum_size
-            )
-
-            if return_loss:
-                loss += _loss
-            if return_output_label:
-                outputs.append(_output)
-                labels.append(_label)
-
-        if not return_output_label:
-            outputs, labels = None, None
-
-        return outputs, labels, loss
diff --git a/internlm/core/scheduler/pipeline_scheduler.py b/internlm/core/scheduler/pipeline_scheduler.py
deleted file mode 100644
index e9b6c64..0000000
--- a/internlm/core/scheduler/pipeline_scheduler.py
+++ /dev/null
@@ -1,1295 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/engine
-
-from contextlib import contextmanager
-from typing import Callable, List, Optional, Tuple, Union
-
-import torch.cuda
-
-import internlm.core.communication as comm
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.core.engine import Engine
-from internlm.core.naive_amp import NaiveAMPModel
-from internlm.utils.common import get_current_device, move_to_device
-from internlm.utils.logger import get_logger
-from internlm.utils.timeout import llm_timeout
-
-from .base_scheduler import BaseScheduler, SchedulerHook
-
-logger = get_logger(__file__)
-
-
-def get_tensor_shape():
-    if hasattr(gpc.config, "TENSOR_SHAPE"):
-        return gpc.config.TENSOR_SHAPE
-
-    if not gpc.is_initialized(ParallelMode.PIPELINE):
-        return None
-
-    if hasattr(gpc.config, "SEQ_LEN") and hasattr(gpc.config.data, "micro_bsz") and hasattr(gpc.config, "HIDDEN_SIZE"):
-        if gpc.config.model.use_flash_attn:
-            if gpc.config.parallel.sequence_parallel:
-                sequence_world_size = gpc.get_world_size(ParallelMode.TENSOR)
-                tensor_shape = (
-                    gpc.config.SEQ_LEN * gpc.config.data["micro_bsz"] // sequence_world_size,
-                    gpc.config.HIDDEN_SIZE,
-                )
-            else:
-                tensor_shape = (
-                    gpc.config.SEQ_LEN * gpc.config.data["micro_bsz"],
-                    gpc.config.HIDDEN_SIZE,
-                )
-        else:
-            tensor_shape = (
-                gpc.config.data["micro_bsz"],
-                gpc.config.SEQ_LEN,
-                gpc.config.HIDDEN_SIZE,
-            )
-        return tensor_shape
-    else:
-        return None
-
-
-def pack_return_tensors(return_tensors):
-    output, label = tuple(zip(*return_tensors))
-    if isinstance(output[0], torch.Tensor):
-        output = torch.cat(output, dim=0)
-    elif isinstance(output[0], (list, tuple)):
-        output = tuple(torch.cat(tensors, dim=0) for tensors in zip(*output))
-    else:
-        raise TypeError("Output of model must be tensor or list/tuple of tensors")
-    if isinstance(label[0], torch.Tensor):
-        label = torch.cat(label, dim=0)
-    else:
-        merged_label = {k: [] for k in label[0].keys()}
-        for d in label:
-            for k, v in d.items():
-                merged_label[k].append(v)
-        label = {k: torch.cat(v, dim=0) for k, v in merged_label.items()}
-    return output, label
-
-
-@contextmanager
-def switch_virtual_pipeline_parallel_rank(rank):
-    prev_rank = gpc.virtual_pipeline_parallel_rank
-    try:
-        gpc.set_virtual_pipeline_parallel_rank(rank)
-        yield
-    finally:
-        gpc.set_virtual_pipeline_parallel_rank(prev_rank)
-
-
-@contextmanager
-def switch_optimizer_grad_sync_skip_mode(optimizer, skip: bool = True):
-    prev_mode = optimizer.skip_grad_reduce
-    try:
-        optimizer.skip_grad_reduce = skip
-        yield
-    finally:
-        optimizer.skip_grad_reduce = prev_mode
-
-
-class PipelineScheduler(BaseScheduler):
-    """
-    A helper schedule class for pipeline parallelism running environment.
-    It uses non-interleaved 1F1B strategy. Other properties are similar as
-    :class:`NonPipelineSchedule`.
-
-    Args:
-        num_microbatches (int): The number of microbatches.
-        dtype (torch.dtype): Type of data. torch.float by default.
-        data_process_func (Callable, optional):
-            The post processing function which receives a micro batch of data, and it will be executed
-            in `load_micro_batch`.
-        tensor_shape (torch.Size, optional): Specified shape in pipeline communication.
-        scatter_gather_tensors (bool, optional):
-            If set to `True`, communication will be reduced over pipeline when using 1D tensor parallelization.
-        scheduler_hooks (Optional[List[SchedulerHook]], optional): List of scheduler hooks.
-    """
-
-    def __init__(
-        self,
-        num_microbatches: int,
-        dtype: torch.dtype = torch.float,
-        data_process_func: Callable = None,
-        tensor_shape: Union[torch.Size, List[int], Tuple[int]] = None,
-        scatter_gather_tensors: bool = False,
-        scheduler_hooks: Optional[List[SchedulerHook]] = None,
-    ):
-        assert num_microbatches > 0, f"expected num_microbatches to be larger then 1, but got {num_microbatches}"
-
-        assert not isinstance(
-            tensor_shape, int
-        ), "tensor_shape type should be one of Union[torch.Size, List[int], Tuple[int]]."
-
-        super().__init__(data_process_func=data_process_func)
-
-        self.num_microbatches = num_microbatches
-        self.dtype = dtype
-        self._hooks = scheduler_hooks
-
-        self._tensor_shape = (
-            tensor_shape if tensor_shape is None or isinstance(tensor_shape, torch.Size) else torch.Size(tensor_shape)
-        )
-
-        self.scatter_gather_tensors = (
-            scatter_gather_tensors
-            and gpc.is_initialized(ParallelMode.TENSOR)
-            and gpc.get_world_size(ParallelMode.TENSOR) > 1
-        )
-
-        if gpc.config.parallel.sequence_parallel:
-            self.scatter_gather_tensors = False
-
-        # cache for the batch data
-        self.batch_data = None
-
-    @property
-    def tensor_shape(self) -> torch.Size:
-        return self._tensor_shape
-
-    @tensor_shape.setter
-    def tensor_shape(self, tensor_shape: torch.Size):
-        self._tensor_shape = tensor_shape
-
-    def pre_processing(self, engine):
-        types = set()
-
-        for param in engine.model.parameters():
-            types.add(param.dtype)
-        assert len(types) == 1, f"Mixed types of parameter detected, {types}"
-
-        self.dtype = types.pop()
-
-    @staticmethod
-    def _call_engine(engine, data):  # pylint: disable=W0237
-        if data is None:
-            return None
-
-        if isinstance(data, torch.Tensor):
-            return engine(data)
-        elif isinstance(data, (list, tuple)):
-            return engine(*data)
-        elif isinstance(data, dict):
-            stage_output = data.pop("stage_output", None)
-
-            if stage_output is None:
-                return engine(**data)
-            elif isinstance(stage_output, torch.Tensor):
-                return engine(stage_output, **data)
-            elif isinstance(stage_output, (tuple, list)):
-                return engine(*stage_output, **data)
-            else:
-                raise TypeError(
-                    f"Expected stage_output to be of type torch.Tensor, list, or tuple, "
-                    f"but got {type(stage_output)}"
-                )
-        else:
-            raise TypeError(f"Expected data to be of type torch.Tensor, list, tuple, or dict, but got {type(data)}")
-
-    def load_batch(self, engine, data_iter):
-        # Pipeline schedule just puts data in memory
-        batch_data, batch_size = engine.load_batch(data_iter, to_gpu=False)
-        assert batch_size % self.num_microbatches == 0, "Batch size should divided by the number of microbatches"
-
-        self.microbatch_offset = 0
-        self.batch_size = batch_size
-        self.batch_data, self.batch_label = batch_data
-        self.microbatch_size = self.batch_size // self.num_microbatches
-
-    def load_micro_batch(self):
-        micro_batch_data, micro_batch_label = self._load_micro_batch(
-            data=self.batch_data, label=self.batch_label, offset=self.microbatch_offset, micro_bsz=self.microbatch_size
-        )
-        if self.data_process_func:
-            micro_batch_data["input_ids"] = self.data_process_func(
-                micro_batch_data["input_ids"], micro_batch_data["cu_seqlens"]
-            )
-            micro_batch_label = self.data_process_func(micro_batch_label, micro_batch_data["cu_seqlens"])
-
-            micro_batch_data.pop("cu_seqlens")
-            micro_batch_data.pop("indexes")
-
-        micro_batch_data["label"] = micro_batch_label
-        self.microbatch_offset += self.microbatch_size
-
-        return move_to_device(micro_batch_data)
-
-    def _get_data_label_for_current_step(self, stage_output, micro_batch_data):
-        if isinstance(micro_batch_data, (tuple, list)):
-            if gpc.is_first_rank(ParallelMode.PIPELINE):
-                # for the first stage, we use the data from the
-                # dataloader output by default
-                data, label = micro_batch_data
-            else:
-                # for non-first stage, we use the output passed
-                # by the previous as the model input
-                data = stage_output
-                _, label = micro_batch_data
-        elif isinstance(micro_batch_data, dict):
-            label = micro_batch_data.pop("label", None)
-            data = {"stage_output": stage_output, **micro_batch_data}
-
-        return data, label
-
-    def _call_hooks(self, func_name: str, *args, **kwargs) -> None:
-        for hook in self._hooks:
-            getattr(hook, func_name)(self, *args, **kwargs)
-
-    def _get_current_microbatch_id(self, step_id: int) -> int:
-        """
-        Get the current microbatch ID based on the step ID.
-        In 1f1b scheduler, the microbatch ID is the same as the step ID,
-        but it is important to note that the step ID is calculated separately
-        for forward and backward passes.
-        """
-        return step_id
-
-    def _forward_step(self, engine, input_obj, return_tensors, return_output_label=True, accum_loss=None):
-        """
-        Forward step for passed-in model. If it is the first stage, the input tensor
-        is obtained from data_iterator, otherwise the passed-in input_obj is used.
-        Returns output tensor. This is a helper function and can be ignored by users.
-
-        Args:
-            engine (colossalai.engine.Engine): Colossalai engine for training and inference.
-            input_obj (Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]): Input tensor for this pipeline stage.
-            return_tensors (List[:class:`torch.Tensor`]): A list of tensors to return.
-            return_output_label (bool, optional): Whether returns output labels.
-            accum_loss (optional): Where accumulated loss stores.
-        Returns:
-            Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: output or the loss value of the current
-                pipeline stage.
-        """
-        micro_batch_data = self.load_micro_batch()
-        data, label = self._get_data_label_for_current_step(input_obj, micro_batch_data)
-
-        self._call_hooks("before_forward", data)
-        output_obj = self._call_engine(engine.model, data)
-        self._call_hooks("after_forward", output_obj)
-
-        if gpc.is_last_rank(ParallelMode.PIPELINE):
-            self._call_hooks("post_helper_func", output_obj, label)
-            if return_output_label:
-                return_tensors.append((output_obj, label))
-            if accum_loss is not None:
-                self._call_hooks("before_criterion", output_obj, label)
-                loss = self._call_engine_criterion(engine, output_obj, label)
-                self._call_hooks("after_criterion", loss)
-
-                loss_reduced = loss / self.num_microbatches
-                accum_loss.add_(loss_reduced.detach())
-                output_obj = loss_reduced
-
-        return output_obj
-
-    def _backward_step(self, engine, step_id, input_obj, output_obj, output_obj_grad):
-        """
-        Backward step through the passed-in output tensor. If it is the last stage, the
-        output_obj_grad is None, otherwise it is the gradients with respect to stage's output tensor.
-        Returns the gradients with respect to the input tensor (None if first stage).
-        This is a helper function and can be ignored by users.
-
-        Args:
-            engine (colossalai.engine.Engine): Colossalai engine for training and inference.
-            step_id (int): The ID of the current step.
-            input_obj (Union[torch.Tensor, List[torch.Tensor]]): Input tensor for this stage.
-            output_obj (Union[torch.Tensor, List[torch.Tensor]]): Output tensor for this stage.
-            output_obj_grad (Union[torch.Tensor, List[torch.Tensor]]): Gradient of output tensor for this stage.
-
-        Returns:
-            Union[torch.Tensor, List[torch.Tensor]]: Gradient of input tensor.
-        """
-
-        # Retain the grad on the input_obj.
-        if input_obj is not None:
-            if isinstance(input_obj, torch.Tensor):
-                input_obj.retain_grad()
-            else:
-                for in_tensor in input_obj:
-                    if in_tensor is not None:
-                        in_tensor.retain_grad()
-
-        # Backward pass.
-
-        # Only the last microbatch does syncing grad.
-        skip_grad_sync = self._get_current_microbatch_id(step_id) != self.num_microbatches - 1
-
-        self._call_hooks("before_backward", output_obj, output_obj_grad)
-        with switch_optimizer_grad_sync_skip_mode(engine.optimizer, skip_grad_sync):
-            if output_obj_grad is None:
-                engine.backward(output_obj)
-            else:
-                engine.backward_by_grad(output_obj, output_obj_grad)
-
-        # Collect the grad of the input_obj.
-        input_obj_grad = None
-        if input_obj is not None:
-            if isinstance(input_obj, torch.Tensor):
-                input_obj_grad = input_obj.grad
-            else:
-                input_obj_grad = []
-                for in_tensor in input_obj:
-                    input_obj_grad.append(in_tensor.grad)
-        self._call_hooks("after_backward", input_obj_grad)
-
-        return input_obj_grad
-
-    def _forward_only_step(self, engine, return_loss=True, return_output_label=True):
-        """
-        This function performs forward only computation process. The scheduling of microbatches is similar to the
-        warmup phase, where each microbatch first receives the forward input from the previous stage, then performs
-        the forward computation, and finally passes the forward computation output to the next stage. There are two
-        special cases to note:
-        1. The first stage of the pipeline does not need to receive forward input; its input comes from the dataloader.
-        2. The last stage of the pipeline does not need to send forward output; its output is returned to the user code
-           for processing.
-
-        Args:
-            engine (colossalai.engine.Engine): internlm engine for training and inference.
-            return_loss (bool, optional): Whether to return the accumulated loss.
-            return_output_label (bool, optional): Whether to return outputs and labels.
-
-        Returns:
-            Tuple[Union[torch.Tensor, None], Union[torch.Tensor, None], Union[torch.Tensor, None]]:
-                output, label, and accumulated loss.
-        """
-
-        # Input, output tensors only need to be saved when doing backward passes
-        return_tensors = []
-        accum_loss = (
-            torch.zeros(1, device=get_current_device())
-            if return_loss and gpc.is_pipeline_last_stage(ignore_virtual=True)
-            else None
-        )
-
-        # Used for tensor meta information communication
-        forward_recv_shapes = self.tensor_shape
-        need_forward_meta = self.tensor_shape is None
-
-        # Run all forward passes.
-        for _ in range(self.num_microbatches):
-            # Receive input from the previous stage
-            if not gpc.is_first_rank(ParallelMode.PIPELINE):
-                if forward_recv_shapes is None:
-                    forward_recv_shapes = comm.recv_obj_meta()
-                input_obj = comm.recv_forward(
-                    forward_recv_shapes,
-                    dtype=self.dtype,
-                    scatter_gather_tensors=self.scatter_gather_tensors,
-                )
-            else:
-                input_obj = None
-
-            # Perform forward computation
-            output_obj = self._forward_step(
-                engine,
-                input_obj,
-                return_tensors,
-                return_output_label=return_output_label,
-                accum_loss=accum_loss,
-            )
-
-            if not gpc.is_last_rank(ParallelMode.PIPELINE):
-                if need_forward_meta:
-                    comm.send_obj_meta(output_obj)
-                    need_forward_meta = False  # send only once.
-                # Send the forward computation output to the next stage
-                comm.send_forward(output_obj, scatter_gather_tensors=self.scatter_gather_tensors)
-
-        output, label = pack_return_tensors(return_tensors) if len(return_tensors) > 0 else (None, None)
-
-        return output, label, accum_loss
-
-    def _forward_backward_step(self, engine, return_loss=True, return_output_label=True):
-        """
-        This function schedules the forward and backward computation of microbatches in the pipeline in a 1F1B manner.
-        It consists of three stages: warmup, 1F1B, and cooldown.
-
-        1. Warmup Stage:
-        The warmup stage performs num_warmup forward microsteps. The calculation of num_warmup is the pipeline length
-        minus the rank of the current pipeline minus 1. For each microstep, it receives data as input from the previous
-        stage, performs the forward computation, and then sends the result to the next stage.
-
-        2. 1F1B Stage:
-        The 1F1B stage consists of pairs of forward and backward microsteps. It performs num_1f1b_micropairs iterations,
-        where num_1f1b_micropairs is calculated as the total number of microbatches minus the number of microbatches in
-        the warmup stage. In each iteration, it first performs a forward computation, sends the result to the next
-        stage, receives input for the backward computation, performs the backward computation, and finally sends the
-        result to the previous stage to receive input for the next forward computation.
-
-        3. Cooldown Stage:
-        The cooldown stage performs the same number of iterations as the warmup stage. In each iteration, it receives
-        input for the backward computation, performs the backward computation, and finally sends the result to the
-        previous stage.
-
-        There are two special cases to consider:
-        1. The first stage of the pipeline does not need to receive forward input or send backward output. The last
-        stage does not need to send forward output or receive backward input.
-        2. Pay attention to the communication between stages and use additional communication to bridge the gap.
-
-        Args:
-            engine (Engine): The engine used for computation.
-            return_loss (bool, optional): Whether to return the accumulated loss.
-            return_output_label (bool, optional): Whether to return outputs and labels.
-
-        Returns:
-            Tuple[Union[torch.Tensor, None], Union[torch.Tensor, None], Union[torch.Tensor, None]]:
-            The output, label, and accumulated loss.
-        """
-
-        num_warmup_microsteps = (
-            gpc.get_world_size(ParallelMode.PIPELINE) - gpc.get_local_rank(ParallelMode.PIPELINE) - 1
-        )
-        num_warmup_microsteps = min(num_warmup_microsteps, self.num_microbatches)
-        num_1f1b_micropairs = self.num_microbatches - num_warmup_microsteps
-
-        # Input, output tensors only need to be saved when doing backward passes
-        input_objs = []
-        output_objs = []
-        return_tensors = []
-        accum_loss = (
-            torch.zeros(1, device=get_current_device())
-            if return_loss and gpc.is_pipeline_last_stage(ignore_virtual=True)
-            else None
-        )
-
-        # Used for tensor meta information communication
-        forward_recv_shapes = self.tensor_shape
-        backward_recv_shapes = None
-        need_forward_meta = self.tensor_shape is None
-
-        # Run warmup forward passes.
-        for i in range(num_warmup_microsteps):
-            # Receive the input from the previous stage
-            if not gpc.is_first_rank(ParallelMode.PIPELINE):
-                if forward_recv_shapes is None:
-                    forward_recv_shapes = comm.recv_obj_meta()
-                input_obj = comm.recv_forward(
-                    forward_recv_shapes,
-                    dtype=self.dtype,
-                    scatter_gather_tensors=self.scatter_gather_tensors,
-                )
-            else:
-                input_obj = None
-
-            # Perform forward computation
-            output_obj = self._forward_step(
-                engine,
-                input_obj,
-                return_tensors,
-                return_output_label=return_output_label,
-                accum_loss=accum_loss,
-            )
-
-            if not gpc.is_last_rank(ParallelMode.PIPELINE):
-                if isinstance(output_obj, torch.Tensor):
-                    backward_recv_shapes = output_obj.shape
-                else:
-                    backward_recv_shapes = [out_tensor.shape for out_tensor in output_obj]
-
-                if need_forward_meta:
-                    comm.send_obj_meta(output_obj)
-                    need_forward_meta = False  # send only once.
-
-            # Send the output of forward computation of this pipeline stage to the next pipeline stage as input for
-            # forward computation
-            if not gpc.is_last_rank(ParallelMode.PIPELINE):
-                comm.send_forward(output_obj, scatter_gather_tensors=self.scatter_gather_tensors)
-
-            input_objs.append(input_obj)
-            output_objs.append(output_obj)
-
-        # Before running 1F1B, need to receive first forward tensor.
-        # If all microbatches are run in warmup / cooldown phase, then no need to
-        # receive this tensor here.
-        if num_1f1b_micropairs > 0:
-            if not gpc.is_first_rank(ParallelMode.PIPELINE):
-                if forward_recv_shapes is None:
-                    forward_recv_shapes = comm.recv_obj_meta(forward_recv_shapes)
-                input_obj = comm.recv_forward(
-                    forward_recv_shapes,
-                    dtype=self.dtype,
-                    scatter_gather_tensors=self.scatter_gather_tensors,
-                )
-            else:
-                input_obj = None
-
-        # Run 1F1B in steady state.
-        for i in range(num_1f1b_micropairs):
-            # Perform forward computation
-            output_obj = self._forward_step(
-                engine,
-                input_obj,
-                return_tensors,
-                return_output_label=return_output_label,
-                accum_loss=accum_loss,
-            )
-
-            if gpc.is_last_rank(ParallelMode.PIPELINE):
-                output_obj_grad = None
-            else:
-                output_obj_grad = comm.send_forward_recv_backward(
-                    output_obj,
-                    backward_recv_shapes,
-                    dtype=self.dtype,
-                    scatter_gather_tensors=self.scatter_gather_tensors,
-                )
-
-            # Add input_obj and output_obj to end of list.
-            input_objs.append(input_obj)
-            output_objs.append(output_obj)
-
-            # Pop output_obj and output_obj from the start of the list for
-            # the backward pass.
-            input_obj = input_objs.pop(0)
-            output_obj = output_objs.pop(0)
-
-            input_obj_grad = self._backward_step(engine, i, input_obj, output_obj, output_obj_grad)
-
-            if i == (num_1f1b_micropairs - 1):
-                input_obj = None
-                if not gpc.is_first_rank(ParallelMode.PIPELINE):
-                    comm.send_backward(
-                        input_obj_grad,
-                        scatter_gather_tensors=self.scatter_gather_tensors,
-                    )
-            else:
-                if gpc.is_first_rank(ParallelMode.PIPELINE):
-                    input_obj = None
-                else:
-                    input_obj = comm.send_backward_recv_forward(
-                        input_obj_grad,
-                        forward_recv_shapes,
-                        dtype=self.dtype,
-                        scatter_gather_tensors=self.scatter_gather_tensors,
-                    )
-
-        # Run cooldown backward passes.
-        for i in range(num_warmup_microsteps):
-            input_obj = input_objs.pop(0)
-            output_obj = output_objs.pop(0)
-
-            if not gpc.is_last_rank(ParallelMode.PIPELINE):
-                output_obj_grad = comm.recv_backward(
-                    backward_recv_shapes,
-                    dtype=self.dtype,
-                    scatter_gather_tensors=self.scatter_gather_tensors,
-                )
-            else:
-                output_obj_grad = None
-
-            input_obj_grad = self._backward_step(
-                engine, num_1f1b_micropairs + i, input_obj, output_obj, output_obj_grad
-            )
-
-            if not gpc.is_first_rank(ParallelMode.PIPELINE):
-                comm.send_backward(input_obj_grad, scatter_gather_tensors=self.scatter_gather_tensors)
-
-        output, label = pack_return_tensors(return_tensors) if len(return_tensors) > 0 else (None, None)
-
-        return output, label, accum_loss
-
-    @llm_timeout(func_name="nointerleaved_forward_backward_step")
-    def forward_backward_step(self, engine, data_iter, forward_only=False, return_loss=True, return_output_label=True):
-        """Runs non-interleaved 1F1B schedule, with communication between pipeline stages.
-        Returns a tuple with losses if the last stage, an empty tuple otherwise.
-
-        Args:
-            engine (colossalai.engine.Engine): Colossalai engine for training and inference.
-            data_iter (Iterable): Dataloader as the form of an iterator, obtained by calling iter(dataloader).
-            forward_only (bool, optional):
-                Whether run forward step only. Default is false. If true, no backward will be run.
-            return_loss (bool, optional): Whether returns the loss value. Default is true.
-            return_output_label (bool, optional): If False, the output and label won't be returned.
-        Returns:
-            Tuple[:class:`torch.Tensor`]: A tuple of (output, label, loss), loss and label could be None.
-        """
-
-        assert (
-            forward_only or return_loss
-        ), "The argument 'return_loss' has to be True when 'forward_only' is False, but got False."
-
-        # Load data first
-        self.load_batch(engine, data_iter)
-
-        if forward_only:
-            return self._forward_only_step(engine, return_loss, return_output_label)
-        else:
-            return self._forward_backward_step(engine, return_loss, return_output_label)
-
-
-class InterleavedPipelineScheduler(PipelineScheduler):
-    """
-    Interleaved Pipeline Scheduler.
-    """
-
-    def __init__(
-        self,
-        num_microbatches: int,
-        num_chunks: int,
-        dtype: torch.dtype = torch.float,
-        data_process_func: Callable = None,
-        tensor_shape: Union[torch.Size, List[int], Tuple[int]] = None,
-        scatter_gather_tensors: bool = False,
-        scheduler_hooks: Optional[List[SchedulerHook]] = None,
-        communication_overlap: bool = False,
-    ):
-        """A helper schedule class for pipeline parallelism running environment.
-        It uses interleaved 1F1B strategy. Other properties are similar as
-        :class:`NonPipelineSchedule`.
-
-        Args:
-            num_microbatches (int): The number of microbatches.
-            num_chunks (int): The number of model chunks.
-            dtype (torch.dtype, optional): The data type of the tensors. Default is torch.float.
-            data_process_func (Callable, optional):
-                The preprocessing function which receives a batch of data, and it will be executed in `load_batch`.
-            tensor_shape (torch.Size, optional): Specified shape in pipeline communication.
-            scatter_gather_tensors (bool, optional):
-                If set to `True`, communication will be reduced over pipeline when using 1D tensor parallelization.
-            scheduler_hooks (List[SchedulerHook], optional): List of scheduler hooks. Default is None.
-            communication_overlap (bool, optional): Whether to enable communication overlap. Default is False.
-        """
-        assert (
-            num_microbatches % gpc.get_world_size(ParallelMode.PIPELINE) == 0
-        ), "num_microbatches must be an integer multiple of pipeline parallel world size"
-
-        assert (
-            isinstance(num_chunks, int) and num_chunks > 0
-        ), f"expected num_chunks to be an integer and larger than 0, but got {num_chunks}"
-
-        super().__init__(
-            num_microbatches,
-            dtype=dtype,
-            data_process_func=data_process_func,
-            tensor_shape=tensor_shape,
-            scatter_gather_tensors=scatter_gather_tensors,
-            scheduler_hooks=scheduler_hooks,
-        )
-
-        gpc.set_virtual_pipeline_parallel_size(num_chunks)
-        gpc.set_virtual_pipeline_parallel_rank(0)
-
-        self._num_chunks = num_chunks
-        self._communication_overlap = communication_overlap
-        # switch 1f1b loop runner function according to communication overlap
-        self._run_1f1b_loop = (
-            self._run_1f1b_loop_with_overlap if communication_overlap else self._run_1f1b_loop_without_overlap
-        )
-
-        # states
-        self._pp_size = gpc.get_world_size(ParallelMode.PIPELINE)
-        self._pp_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
-
-        self._accum_loss = None
-        self._return_tensors = None
-        self._input_objs = [[] for _ in range(num_chunks)]
-        self._output_objs = [[] for _ in range(num_chunks)]
-        self._output_obj_grads = [[] for _ in range(num_chunks)]
-
-        self._input_obj_shapes = [self.tensor_shape for _ in range(num_chunks)]
-        self._output_obj_shapes = [None for _ in range(num_chunks)]
-        self._send_tensor_shape_flags = [self.tensor_shape is None for _ in range(num_chunks)]
-
-    @property
-    def tensor_shape(self) -> torch.Size:
-        return self._tensor_shape
-
-    @tensor_shape.setter
-    def tensor_shape(self, tensor_shape: torch.Size):
-        self._tensor_shape = tensor_shape
-        self._input_obj_shapes = [self._tensor_shape for _ in range(self._num_chunks)]
-        self._send_tensor_shape_flags = [self._tensor_shape is None for _ in range(self._num_chunks)]
-
-    def _clear_state(self) -> None:
-        self._accum_loss = None
-        self._return_tensors = None
-        self._input_objs = [[] for _ in range(self._num_chunks)]
-        self._output_objs = [[] for _ in range(self._num_chunks)]
-        self._output_obj_grads = [[] for _ in range(self._num_chunks)]
-
-        self._input_obj_shapes = [self.tensor_shape for _ in range(self._num_chunks)]
-        self._output_obj_shapes = [None for _ in range(self._num_chunks)]
-        self._send_tensor_shape_flags = [self.tensor_shape is None for _ in range(self._num_chunks)]
-
-    def load_batch(self, engine, data_iter):
-        super().load_batch(engine, data_iter)
-        # overwrite microbatch_offset, since model chunks load the same microbatch, and should tract the offset
-        self.microbatch_offset = [0 for _ in range(self._num_chunks)]
-
-    def load_micro_batch(self, model_chunk_id):
-        micro_batch_data, micro_batch_label = self._load_micro_batch(
-            data=self.batch_data,
-            label=self.batch_label,
-            offset=self.microbatch_offset[model_chunk_id],
-            micro_bsz=self.microbatch_size,
-        )
-        micro_batch_data["label"] = micro_batch_label
-        self.microbatch_offset[model_chunk_id] += self.microbatch_size
-        return move_to_device(micro_batch_data)
-
-    def _forward_step(self, engine, chunk_id):
-        """Forward step for passed-in model. If it is the first stage, the input tensor
-        is obtained from data_iterator, otherwise the passed-in input_obj is used.
-        Returns output tensor. This is a helper function and can be ignored by users.
-
-        Args:
-            engine (colossalai.engine.Engine): Colossalai engine for training and inference.
-            chunk_id (int): The id of model chunks.
-        Returns:
-            Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: output or the loss value of the current
-                pipeline stage.
-        """
-        gpc.set_virtual_pipeline_parallel_rank(chunk_id)
-
-        if gpc.is_pipeline_first_stage() and len(self._input_objs[chunk_id]) == len(self._output_objs[chunk_id]):
-            self._input_objs[chunk_id].append(None)
-        input_obj = self._input_objs[chunk_id][-1]
-
-        micro_batch_data = self.load_micro_batch(chunk_id)
-        data, label = self._get_data_label_for_current_step(input_obj, micro_batch_data)
-
-        self._call_hooks("before_forward", data)
-        output_obj = self._call_engine(engine.model[chunk_id], data)
-        # Convert output_obj to fp32 when last model chunk of last stage
-        if gpc.is_pipeline_last_stage(ignore_virtual=False) and isinstance(engine.model[chunk_id], NaiveAMPModel):
-            output_obj = engine.model[chunk_id].convert_to_fp32(output_obj)
-        self._call_hooks("after_forward", output_obj)
-
-        if gpc.is_pipeline_last_stage():
-            self._call_hooks("post_helper_func", output_obj, label)
-
-            if self._return_tensors is not None:
-                self._return_tensors.append((output_obj, label))
-            if self._accum_loss is not None:
-                self._call_hooks("before_criterion", output_obj, label)
-                loss = self._call_engine_criterion(engine, output_obj, label)
-                self._call_hooks("after_criterion", loss)
-
-                loss_reduced = loss / self.num_microbatches
-                self._accum_loss.add_(loss_reduced.detach())
-                output_obj = loss_reduced
-
-        self._output_objs[chunk_id].append(output_obj)
-
-        return output_obj
-
-    def _backward_step(self, engine, chunk_id, step_id):
-        """
-        Backward step for passed-in model. If it is the last stage, the input tensor
-        is obtained from the previous forward step, otherwise the passed-in input_obj is used.
-        Returns input tensor gradient. This is a helper function and can be ignored by users.
-
-        Args:
-            engine (colossalai.engine.Engine): Colossalai engine for training and inference.
-            chunk_id (int): The id of model chunks.
-            step_id (int): The current step id.
-
-        Returns:
-            Union[:class:`torch.Tensor`, List[:class:`torch.Tensor`]]: input tensor gradient.
-        """
-        gpc.set_virtual_pipeline_parallel_rank(chunk_id)
-
-        if gpc.is_pipeline_last_stage() and len(self._output_obj_grads[chunk_id]) == 0:
-            self._output_obj_grads[chunk_id].append(None)
-
-        input_obj = self._input_objs[chunk_id].pop(0)
-        output_obj = self._output_objs[chunk_id].pop(0)
-        output_obj_grad = self._output_obj_grads[chunk_id].pop(0)
-
-        input_obj_grad = super()._backward_step(engine, step_id, input_obj, output_obj, output_obj_grad)
-
-        return input_obj_grad
-
-    def _get_chunk_by_microbatch(self, step_id: int, backward: bool = False) -> int:
-        """Helper method to get the model chunk ID given the iteration number."""
-        microbatch_id_in_group = step_id % (self._pp_size * self._num_chunks)
-        chunk_id = microbatch_id_in_group // self._pp_size
-
-        if backward:
-            chunk_id = self._num_chunks - chunk_id - 1
-
-        return chunk_id
-
-    def _get_current_microbatch_id(self, step_id: int) -> int:
-        # format:
-        # microstep_id : 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16
-        # microbatch_id: 1  2  3  4  1  2  3  4  5  6  7  8  5  6  7  8
-        num_microbatch_group = step_id // (self._pp_size * self._num_chunks)
-        step_id_in_group = step_id % (self._pp_size * self._num_chunks)
-
-        microbatch_id = num_microbatch_group * self._pp_size + step_id_in_group % self._pp_size
-
-        return microbatch_id
-
-    def _run_warmup_loop(
-        self,
-        engine: Engine,
-        num_microsteps: int,
-        num_warmup_microsteps: int,
-        receive_extra_backward: bool = False,
-        forward_only: bool = False,
-    ) -> None:
-        """
-        Run the warm-up loop and prepare data for the 1F1B stage.
-
-        During the warm-up process, for each execution, it first performs a forward computation,
-        and then sends the computation result to the next stage.
-        It also receives data for the next forward computation.
-        Since the input for the first forward computation is not considered initially,
-        it needs to receive data once at the beginning.
-
-        After the warm-up is completed, we need to prepare data for the 1F1B stage.
-        The data preparation process should be consistent with the communication method of the 1F1B stage.
-
-        Args:
-            engine (Engine): The engine to run the warm-up loop.
-            num_microsteps (int): The total number of microsteps.
-            num_warmup_microsteps (int): The number of warm-up microsteps.
-            receive_extra_backward (bool, optional): Whether to receive extra backward input for the 1F1B stage.
-                                                     Default is False.
-            forward_only (bool, optional): Whether to only perform forward pass. Default is False.
-        """
-        if not gpc.is_pipeline_first_stage():
-            if self._input_obj_shapes[0] is None:
-                self._input_obj_shapes[0] = comm.recv_obj_meta(self._input_obj_shapes[0])
-            self._input_objs[0].append(
-                comm.recv_forward(
-                    self._input_obj_shapes[0],
-                    dtype=self.dtype,
-                    scatter_gather_tensors=self.scatter_gather_tensors,
-                )
-            )
-        else:
-            self._input_objs[0].append(None)
-
-        for k in range(num_warmup_microsteps):
-            chunk_id = self._get_chunk_by_microbatch(k)
-
-            output_obj = self._forward_step(engine, chunk_id)
-
-            if forward_only:
-                # when forward-only, no need to save tensors for a backward pass
-                self._input_objs[chunk_id].pop()
-                self._output_objs[chunk_id].pop()
-
-            if not gpc.is_pipeline_last_stage():
-                if isinstance(output_obj, torch.Tensor):
-                    self._output_obj_shapes[chunk_id] = output_obj.shape
-                else:
-                    self._output_obj_shapes[chunk_id] = [out_tensor.shape for out_tensor in output_obj]
-
-                if self._send_tensor_shape_flags[chunk_id]:
-                    comm.send_obj_meta(output_obj)
-                    self._send_tensor_shape_flags[chunk_id] = False  # send only once for each chunk.
-
-            # Determine if tensor should be received from previous stage.
-            next_forward_chunk_id = self._get_chunk_by_microbatch(k + 1)
-
-            with switch_virtual_pipeline_parallel_rank(next_forward_chunk_id):
-                if not gpc.is_pipeline_first_stage() and self._input_obj_shapes[next_forward_chunk_id] is None:
-                    self._input_obj_shapes[next_forward_chunk_id] = comm.recv_obj_meta()
-                if k == (num_microsteps - 1) or gpc.is_pipeline_first_stage():
-                    input_shape = None
-                else:
-                    input_shape = self._input_obj_shapes[next_forward_chunk_id]
-
-            # Don't send tensor downstream if on last stage.
-            if gpc.is_pipeline_last_stage():
-                output_obj = None
-
-            # Send and receive tensors as appropriate (send tensors computed
-            # in this iteration; receive tensors for next iteration).
-            if k != (num_warmup_microsteps - 1) or not receive_extra_backward:
-                # Normal warm-up communication process, or no need to prepare backward input for the 1F1B stage
-                input_obj = comm.send_forward_recv_forward(
-                    output_obj,
-                    input_shape,
-                    dtype=self.dtype,
-                    scatter_gather_tensors=self.scatter_gather_tensors,
-                )
-            else:
-                # Receive output_obj_grad for next backward, if receive_extra_backward is True.
-                if self._communication_overlap:
-                    # In this case, we should handle forward and backward communication separately, consistent with the
-                    # overlap version of the 1F1B stage
-                    input_obj = comm.send_forward_recv_forward(
-                        output_obj,
-                        input_shape,
-                        dtype=self.dtype,
-                        scatter_gather_tensors=self.scatter_gather_tensors,
-                    )
-                    output_obj_grad = comm.send_backward_recv_backward(
-                        None,  # nothing to send
-                        self._output_obj_shapes[self._num_chunks - 1],
-                        dtype=self.dtype,
-                        scatter_gather_tensors=self.scatter_gather_tensors,
-                    )
-                    self._output_obj_grads[self._num_chunks - 1].append(output_obj_grad)
-                else:
-                    # In this case, we should handle forward and backward communication together, consistent with the
-                    # non-overlap version of the 1F1B stage
-                    input_obj, output_obj_grad = comm.send_forward_backward_recv_forward_backward(
-                        output_obj,
-                        None,  # no backward grad to send
-                        input_shape,
-                        self._output_obj_shapes[self._num_chunks - 1],
-                        dtype=self.dtype,
-                        scatter_gather_tensors=self.scatter_gather_tensors,
-                    )
-                    self._output_obj_grads[self._num_chunks - 1].append(output_obj_grad)
-
-            self._input_objs[next_forward_chunk_id].append(input_obj)
-
-    def _run_1f1b_loop_with_overlap(
-        self,
-        engine: Engine,
-        num_warmup_microsteps: int,
-        num_1f1b_micropairs: int,
-        all_warmup_microsteps: bool = False,
-    ) -> None:
-        """
-        Run the 1F1B loop with overlap.
-
-        The 1F1B loop with overlap consists of the following steps:
-        1. Perform the forward pass.
-        2. Check if the backward input is ready.
-        3. Send the forward output and receive the forward input for the next iteration.
-        4. Perform the backward pass.
-        5. Check if the forward input is ready.
-        6. Send the backward output and receive the backward input for the next iteration.
-
-        Args:
-            engine (Engine): The engine to run the 1F1B loop.
-            num_warmup_microsteps (int): The number of warm-up microsteps.
-            num_1f1b_micropairs (int): The number of 1F1B micropairs.
-            all_warmup_microsteps (bool, optional): Whether to run all warm-up microsteps. Default is False.
-        """
-
-        backward_async_communicator = None
-
-        # Run 1F1B in steady state.
-        for k in range(num_1f1b_micropairs):
-            forward_microstep_id = k + num_warmup_microsteps
-            backward_microstep_id = k
-            forward_chunk_id = self._get_chunk_by_microbatch(forward_microstep_id)
-            backward_chunk_id = self._get_chunk_by_microbatch(backward_microstep_id, backward=True)
-
-            # 1. Forward pass.
-            output_obj = self._forward_step(engine, forward_chunk_id)
-
-            # 2. Check if the backward input is ready.
-            if backward_async_communicator is not None:
-                output_obj_grad = backward_async_communicator.wait_and_receive()
-
-                if backward_async_communicator.need_receive:
-                    self._output_obj_grads[backward_chunk_id].append(output_obj_grad)
-
-            # 3. Send the forward outputs and receive the forward inputs from the previous rank.
-
-            # Check if it is the last model chunk of the last pipeline stage, no need to send forward output.
-            gpc.set_virtual_pipeline_parallel_rank(forward_chunk_id)
-            if gpc.is_pipeline_last_stage():
-                output_obj = None
-
-            # Check if it needs to receive the results from the previous rank.
-            next_forward_chunk_id = self._get_chunk_by_microbatch(forward_microstep_id + 1)
-            with switch_virtual_pipeline_parallel_rank(next_forward_chunk_id):
-                if gpc.is_pipeline_first_stage() or k == num_1f1b_micropairs - 1:
-                    input_obj_shape = None
-                else:
-                    input_obj_shape = self._input_obj_shapes[next_forward_chunk_id]
-
-            forward_async_communicator = comm.AsynCommunicator(
-                output_obj,
-                input_obj_shape,
-                self.dtype,
-                self.scatter_gather_tensors,
-                forward=True,
-            )
-            forward_async_communicator.start()
-
-            # 5. Backward pass.
-
-            input_obj_grad = self._backward_step(engine, backward_chunk_id, backward_microstep_id)
-
-            input_obj = forward_async_communicator.wait_and_receive()
-            if forward_async_communicator.need_receive:
-                self._input_objs[next_forward_chunk_id].append(input_obj)
-
-            # 6. Send the backward output and receive the backward input for the next iteration.
-            gpc.set_virtual_pipeline_parallel_rank(backward_chunk_id)
-            if gpc.is_pipeline_first_stage():
-                input_obj_grad = None
-
-            next_backward_chunk_id = self._get_chunk_by_microbatch(backward_microstep_id + 1, backward=True)
-            with switch_virtual_pipeline_parallel_rank(next_backward_chunk_id):
-                if gpc.is_pipeline_last_stage():
-                    output_obj_shape = None
-                else:
-                    output_obj_shape = self._output_obj_shapes[next_backward_chunk_id]
-
-            backward_async_communicator = comm.AsynCommunicator(
-                input_obj_grad,
-                output_obj_shape,
-                self.dtype,
-                self.scatter_gather_tensors,
-                forward=False,
-            )
-            backward_async_communicator.start()
-
-        if all_warmup_microsteps:
-            if not gpc.is_pipeline_last_stage():
-                self._output_obj_grads[self._num_chunks - 1].append(
-                    comm.recv_backward(
-                        self._output_obj_shapes[self._num_chunks - 1],
-                        dtype=self.dtype,
-                        scatter_gather_tensors=self.scatter_gather_tensors,
-                    )
-                )
-            else:
-                self._output_obj_grads[self._num_chunks - 1].append(None)
-        else:
-            output_obj_grad = backward_async_communicator.wait_and_receive()
-            if backward_async_communicator.need_receive:
-                backward_chunk_id = self._get_chunk_by_microbatch(num_1f1b_micropairs, backward=True)
-                self._output_obj_grads[backward_chunk_id].append(output_obj_grad)
-
-    def _run_1f1b_loop_without_overlap(
-        self,
-        engine: Engine,
-        num_warmup_microsteps: int,
-        num_1f1b_micropairs: int,
-        all_warmup_microsteps: bool = False,
-    ) -> None:
-        """
-        Run the 1F1B loop without overlap.
-
-        The 1F1B loop without overlap consists of the following steps:
-        1. Perform the forward pass.
-        2. Perform the backward pass.
-        3. Send the forward output of this iteration to the next stage, and send the backward output of this iteration
-           to the previous stage, and receive the forward and backward inputs for the next iteration.
-
-        Args:
-            engine (Engine): The engine to use for computation.
-            num_warmup_microsteps (int): The number of warmup microsteps.
-            num_1f1b_micropairs (int): The number of 1F1B micro-pairs.
-            all_warmup_microsteps (bool, optional): Whether to run all warmup microsteps. Defaults to False.
-        """
-        for k in range(num_1f1b_micropairs):
-            # Forward pass.
-            forward_microstep_id = k + num_warmup_microsteps
-            forward_chunk_id = self._get_chunk_by_microbatch(forward_microstep_id)
-            output_obj = self._forward_step(engine, forward_chunk_id)
-
-            # Backward pass.
-            backward_microstep_id = k
-            backward_chunk_id = self._get_chunk_by_microbatch(backward_microstep_id, backward=True)
-            input_obj_grad = self._backward_step(engine, backward_chunk_id, backward_microstep_id)
-
-            # Send output_obj and input_obj_grad, receive input_obj
-            # and output_obj_grad.
-
-            # Determine if current stage has anything to send in either direction,
-            # otherwise set obj to None.
-            gpc.set_virtual_pipeline_parallel_rank(forward_chunk_id)
-            if gpc.is_pipeline_last_stage():
-                output_obj = None
-
-            gpc.set_virtual_pipeline_parallel_rank(backward_chunk_id)
-            if gpc.is_pipeline_first_stage():
-                input_obj_grad = None
-
-            # Determine if peers are sending, and where in data structure to put
-            # received tensors.
-            next_forward_chunk_id = self._get_chunk_by_microbatch(forward_microstep_id + 1)
-            with switch_virtual_pipeline_parallel_rank(next_forward_chunk_id):
-                if gpc.is_pipeline_first_stage() or k == num_1f1b_micropairs - 1:
-                    recv_prev = False
-                else:
-                    recv_prev = True
-
-            next_backward_chunk_id = self._get_chunk_by_microbatch(backward_microstep_id + 1, backward=True)
-            with switch_virtual_pipeline_parallel_rank(next_backward_chunk_id):
-                if gpc.is_pipeline_last_stage():
-                    recv_next = False
-                else:
-                    recv_next = True
-
-            input_shape = self._input_obj_shapes[next_forward_chunk_id] if recv_prev else None
-            output_shape = self._output_obj_shapes[next_backward_chunk_id] if recv_next else None
-
-            # Communicate objs.
-            input_obj, output_obj_grad = comm.send_forward_backward_recv_forward_backward(
-                output_obj,
-                input_obj_grad,
-                input_shape,
-                output_shape,
-                dtype=self.dtype,
-                scatter_gather_tensors=self.scatter_gather_tensors,
-            )
-
-            # Put input_obj and output_obj_grad in data structures in the
-            # right location.
-            if recv_prev:
-                self._input_objs[next_forward_chunk_id].append(input_obj)
-            if recv_next:
-                self._output_obj_grads[next_backward_chunk_id].append(output_obj_grad)
-
-        # receive necessary data for next cooldown loop
-        if all_warmup_microsteps:
-            if not gpc.is_pipeline_last_stage():
-                self._output_obj_grads[self._num_chunks - 1].append(
-                    comm.recv_backward(
-                        self._output_obj_shapes[self._num_chunks - 1],
-                        dtype=self.dtype,
-                        scatter_gather_tensors=self.scatter_gather_tensors,
-                    )
-                )
-            else:
-                self._output_obj_grads[self._num_chunks - 1].append(None)
-
-    def _run_cooldown_loop(self, engine: Engine, num_microsteps: int, num_1f1b_micropairs: int) -> None:
-        """
-        Run the cooldown loop.
-
-        The cooldown loop consists of the following steps:
-        1. Perform the backward step.
-        2. Send the backward output to the next stage and receive inputs for next backward.
-
-        Args:
-            engine (Engine): The engine to use for computation.
-            num_microsteps (int): The total number of microsteps.
-            num_1f1b_micropairs (int): The number of 1F1B micro-pairs.
-        """
-        for k in range(num_1f1b_micropairs, num_microsteps):
-            chunk_id = self._get_chunk_by_microbatch(k, backward=True)
-
-            input_obj_grad = self._backward_step(engine, chunk_id, k)
-
-            next_backward_chunk_id = self._get_chunk_by_microbatch(k + 1, backward=True)
-
-            if k != (num_microsteps - 1) and not (
-                gpc.is_pipeline_last_stage(ignore_virtual=True) and next_backward_chunk_id == (self._num_chunks - 1)
-            ):
-                output_shape = self._output_obj_shapes[next_backward_chunk_id]
-            else:
-                output_shape = None
-
-            self._output_obj_grads[next_backward_chunk_id].append(
-                comm.send_backward_recv_backward(
-                    input_obj_grad,
-                    output_shape,
-                    dtype=self.dtype,
-                    scatter_gather_tensors=self.scatter_gather_tensors,
-                )
-            )
-
-    def _forward_only_step(self, engine: Engine):
-        num_microsteps = self.num_microbatches * self._num_chunks
-        num_warmup_microsteps = num_microsteps
-
-        self._run_warmup_loop(
-            engine,
-            num_microsteps,
-            num_warmup_microsteps,
-            receive_extra_backward=False,
-            forward_only=True,
-        )
-
-    def _forward_backward_step(self, engine: Engine):
-        # Compute number of warmup and remaining microbatches.
-        all_warmup_microsteps = False
-        num_microsteps = self.num_microbatches * self._num_chunks
-
-        # Run all forward passes and then all backward passes if number of
-        # microbatches is just the number of pipeline stages.
-        # Otherwise, perform (num_chunks-1)*pipeline_parallel_size on
-        # all workers, followed by more microbatches after depending on
-        # stage ID (more forward passes for earlier stages, later stages can
-        # immediately start with 1F1B).
-        if self.num_microbatches == self._pp_size:
-            num_warmup_steps = num_microsteps
-            all_warmup_microsteps = True
-        else:
-            num_warmup_steps = (self._pp_size - self._pp_rank - 1) * 2
-            num_warmup_steps += (self._num_chunks - 1) * self._pp_size
-            num_warmup_steps = min(num_warmup_steps, num_microsteps)
-        num_1f1b_micropairs = num_microsteps - num_warmup_steps
-
-        # We usually need to prepare an extra backward data for the 1F1B stage when the WarmUp stage ends,
-        # because the 1F1B stage typically performs one forward and backward pass together,
-        # except in the following cases:
-        receive_extra_backward = not (
-            all_warmup_microsteps  # Only warmup microsteps
-            or gpc.is_pipeline_last_stage(ignore_virtual=True)  # The rank is the last pipeline stage
-        )
-
-        # 1. Warmup
-        self._run_warmup_loop(
-            engine,
-            num_microsteps,
-            num_warmup_steps,
-            receive_extra_backward=receive_extra_backward,
-        )
-
-        # 2. 1F1B
-        self._run_1f1b_loop(
-            engine,
-            num_warmup_steps,
-            num_1f1b_micropairs=num_1f1b_micropairs,
-            all_warmup_microsteps=all_warmup_microsteps,
-        )
-
-        # 3. Cooldown
-        self._run_cooldown_loop(engine, num_microsteps, num_1f1b_micropairs=num_1f1b_micropairs)
-
-    @llm_timeout(func_name="interleaved_forward_backward_step")
-    def forward_backward_step(self, engine, data_iter, forward_only=False, return_loss=True, return_output_label=True):
-        """Run interleaved 1F1B schedule (model split into model chunks), with
-        communication between pipeline stages as needed.
-
-        Args:
-            engine (colossalai.engine.Engine): Colossalai engine for training and inference.
-            data_iter (Iterable): Dataloader as the form of an iterator, obtained by calling iter(dataloader).
-            forward_only (bool, optional):
-                Whether run forward step only. Default is false. If true, no backward will be run.
-            return_loss (bool, optional): Whether returns the loss value. Default is true.
-            return_output_label (bool, optional): If False, the output and label won't be returned.
-
-        Returns:
-            Tuple[:class:`torch.Tensor`]: A tuple of (output, label, loss), loss and label could be None.
-                The loss would be returned only in the last stage.
-        """
-        assert (
-            forward_only or return_loss
-        ), "The argument 'return_loss' has to be True when 'forward_only' is False, but got False."
-
-        gpc.set_virtual_pipeline_parallel_rank(0)
-
-        self.load_batch(engine, data_iter)
-
-        if return_loss and gpc.is_pipeline_last_stage(ignore_virtual=True):
-            self._accum_loss = torch.zeros(1, device=get_current_device())
-        if return_output_label:
-            self._return_tensors = []
-
-        if forward_only:
-            self._forward_only_step(engine)
-        else:
-            self._forward_backward_step(engine)
-
-        if return_output_label and len(self._return_tensors) > 0:
-            output, label = pack_return_tensors(self._return_tensors)
-        else:
-            output, label = (None, None)
-        accum_loss = self._accum_loss
-
-        self._clear_state()
-
-        return output, label, accum_loss
diff --git a/internlm/core/trainer.py b/internlm/core/trainer.py
deleted file mode 100644
index 6c747aa..0000000
--- a/internlm/core/trainer.py
+++ /dev/null
@@ -1,209 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/engine
-
-import json
-from collections import deque
-from typing import Iterable, Optional
-
-from internlm.core.engine import Engine
-from internlm.core.scheduler import (
-    BaseScheduler,
-    InterleavedPipelineScheduler,
-    NonPipelineScheduler,
-    PipelineScheduler,
-)
-
-
-class TrainState:
-    """
-    The TrainState class is used to record the current state of training.
-
-    Args:
-        train_dl (DataLoader): The DataLoader object used for training.
-    """
-
-    def __init__(self, config, batch_sampler) -> None:
-        """
-        Args:
-            config (Config): internlm config
-            batch_sampler (torch.utils.data.Sampler): Because the dataloader loading is
-            asynchronous and prefetched, the batch_sampler state maintained inside the
-            dataloader are faster then the actual training progress, so we copy the
-            batch_sampler as the anchor point of ckpt reload.
-        """
-        # The number of batches produced by the data iterator
-        self.batch_count: int = 0
-        # Used to store the number of samples consumed in the current epoch
-        self.num_consumed_samples_in_epoch: int = 0
-        # Total number of tokens consumed
-        self.num_consumed_tokens: int = 0
-        # Number of batches skipped due to inf or nan values
-        self.inf_nan_skip_batches: int = 0
-        # Records the number of updates, skipped batches and inf batches are not counted
-        self.step_count: int = 0
-
-        # Total step count
-        self.total_steps: int = config.data.total_steps
-
-        # resume tensorboard folder, need load from checkpoint or set manually.
-        self.resume_tb_folder = config.resume_tb_folder
-
-        self.tensorboard_folder = config.tensorboard_folder
-
-        # learning rate
-        self.lr = config.adam.lr
-
-        # smapler state
-        if batch_sampler:
-            self.init_batch_sampler(batch_sampler)
-
-        # tgs statistic
-        self.tgs_statistic = {
-            "sum_step": 0,
-            "sum_tg": 0,
-            "sum_time": 0,
-            "sum_last_tg_10": 0,
-            "sum_last_time_10": 0,
-            "sum_last_tg_50": 0,
-            "sum_last_time_50": 0,
-            "SMA_tg_50": 0,
-            "SMA_time_50": 0,
-            "SMA_tg_50_list": deque(),
-            "SMA_time_50_list": deque(),
-            "sum_tgs": 0,
-            "last_tgs_10": 0,
-            "last_tgs_50": 0,
-        }
-
-    def init_batch_sampler(self, batch_sampler):
-        """
-        Args:
-            batch_sampler (torch.utils.data.Sampler): sampler.
-        """
-        # make a copy of batch_sampler.
-        self.batch_sampler = batch_sampler.copy()
-        # Iterator for the batch sampler
-        self.batch_sampler_iter = iter(self.batch_sampler)
-
-    def __str__(self) -> str:
-        """Returns a string representation of the training state in JSON format."""
-        info = {
-            "batch_count": self.batch_count,
-            "num_consumed_samples_in_epoch": self.num_consumed_samples_in_epoch,
-            "num_consumed_tokens": self.num_consumed_tokens,
-            "inf_nan_skip_batches": self.inf_nan_skip_batches,
-            "step_count": self.step_count,
-        }
-
-        return json.dumps(info, indent=4, sort_keys=True)
-
-    def load_state_dict(self, other_stuffs):
-        """
-        Resumes training from a checkpoint.
-
-        Args:
-            other_stuffs (dict): Other information needed to resume training.
-        """
-        self.num_consumed_samples_in_epoch = other_stuffs["num_consumed_samples_in_epoch"]
-        self.num_consumed_tokens = other_stuffs["num_consumed_tokens"]
-        self.inf_nan_skip_batches = other_stuffs["inf_nan_skip_batches"]
-
-        # Because the ckpt save occurs after updating 'step_count',
-        # there is no need to increment 'step_count' here (Does our step count start from 0 ?),
-        # However, 'batch_count' is updating before ckpt storage, so it need to inc 1 when resume.
-        self.batch_count = other_stuffs["batch_count"] + 1  # here you need to shift a batch backward
-        self.step_count = other_stuffs.get("step_count", self.batch_count)
-
-        # resume tensorboard from older tensorboard_folder
-        self.resume_tb_folder = other_stuffs.get("tensorboard_folder", None)
-
-    def state_dict(self):
-        return {
-            "batch_count": self.batch_count,
-            "num_consumed_samples_in_epoch": self.num_consumed_samples_in_epoch,
-            "num_consumed_tokens": self.num_consumed_tokens,
-            "inf_nan_skip_batches": self.inf_nan_skip_batches,
-            "step_count": self.step_count,
-            "tensorboard_folder": self.tensorboard_folder,
-        }
-
-
-class Trainer:
-    """This is a class tending for easy deployments of users' training and evaluation instead of
-    writing their own scripts.
-
-    Args:
-        engine (:class:`Engine`): Engine responsible for the process function.
-        schedule (:class:`BaseScheduler`, optional): Runtime schedule. Defaults to None.
-    """
-
-    def __init__(
-        self,
-        engine: Engine,
-        schedule: Optional[BaseScheduler] = None,
-    ):
-        """Initializes the Trainer class.
-
-        Args:
-            engine (Engine): The engine responsible for the process function.
-            schedule (Optional[BaseScheduler], optional): The runtime schedule. Defaults to None.
-        """
-        self._engine = engine
-
-        # build schedule
-        if schedule is None:
-            self._schedule = NonPipelineScheduler()
-        else:
-            assert isinstance(
-                schedule, BaseScheduler
-            ), f"expected schedule to be of type BaseSchedule, but got {type(schedule)}"
-            self._schedule = schedule
-
-        self._schedule.pre_processing(self._engine)
-
-    @property
-    def engine(self):
-        """Returns the engine that responsible for managing the training and evaluation process."""
-        return self._engine
-
-    @property
-    def schedule(self):
-        """Returns the runtime scheduler."""
-        return self._schedule
-
-    @property
-    def uses_pipeline(self):
-        """Returns whether the pipeline parallel is used or not."""
-        return isinstance(self._schedule, (PipelineScheduler, InterleavedPipelineScheduler))
-
-    def train(self):
-        """Sets the model to training mode."""
-        self._engine.train()
-
-    def eval(self):
-        """Sets the model to evaluation mode."""
-        self._engine.eval()
-
-    def zero_grad(self):
-        """Sets the gradient of all parameters in the model to zero."""
-        self._engine.zero_grad()
-
-    def step(self):
-        """Executes the parameter update step."""
-        return self._engine.step()
-
-    def execute_schedule(self, data_iter: Iterable, **kwargs):
-        """Runs the forward, loss computation, and backward for the model.
-        Returns a tuple of (output, label, loss).
-
-        Args:
-            data_iter (Iterable): The data iterator.
-            **kwargs: Additional keyword arguments.
-
-        Returns:
-            Tuple[:class:`torch.Tensor`]: A tuple of (output, label, loss).
-        """
-        output, label, loss = self._schedule.forward_backward_step(self._engine, data_iter, **kwargs)
-        return output, label, loss
diff --git a/internlm/data/__init__.py b/internlm/data/__init__.py
deleted file mode 100644
index 23eb3ab..0000000
--- a/internlm/data/__init__.py
+++ /dev/null
@@ -1,13 +0,0 @@
-from .batch_sampler import get_dpsampler_dataloader
-from .collaters import jsonl_ds_collate_fn, packed_collate_fn
-from .dummy_dataset import RandomDataset
-from .packed_dataset import PackedDataset, PackedDatasetWithoutCuSeqlen
-
-__all__ = [
-    "jsonl_ds_collate_fn",
-    "packed_collate_fn",
-    "RandomDataset",
-    "PackedDataset",
-    "PackedDatasetWithoutCuSeqlen",
-    "get_dpsampler_dataloader",
-]
diff --git a/internlm/data/batch_sampler.py b/internlm/data/batch_sampler.py
deleted file mode 100644
index 16fd6fc..0000000
--- a/internlm/data/batch_sampler.py
+++ /dev/null
@@ -1,354 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import math
-import random
-from typing import Iterator, TypeVar
-
-import numpy as np
-import torch
-from torch.utils.data import DataLoader, Dataset, Sampler
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.utils.logger import get_logger
-
-logger = get_logger(__file__)
-
-T_co = TypeVar("T_co", covariant=True)
-
-
-class DataParallelSampler(Sampler):
-    """A data sampler for distributed data parallelism.
-
-    Args:
-        dataset (:class:`torch.utils.data.Dataset`): The Dataset for sampling.
-        shuffle (bool, optional): Whether to shuffle data, defaults to False.
-        seed (int, optional): The random seed used for sampling, defaults to 0.
-        drop_last (bool, optional): Set to True to drop the last incomplete batch, if the dataset size
-            is not divisible by the batch size. If False and the size of dataset is not divisible by
-            the batch size, then the last batch will be smaller, defaults to False.
-    """
-
-    def __init__(
-        self,
-        dataset: Dataset,
-        shuffle: bool = False,
-        seed: int = 0,
-        drop_last: bool = False,
-    ) -> None:
-        self.dataset = dataset
-        self.num_replicas = gpc.get_world_size(ParallelMode.DATA)
-        self.rank = gpc.get_local_rank(ParallelMode.DATA)
-        self.epoch = 0
-        self.drop_last = drop_last
-        # If the dataset length is evenly divisible by # of replicas, then there
-        # is no need to drop any data, since the dataset will be split equally.
-        # type: ignore[arg-type]
-        if self.drop_last and len(self.dataset) % self.num_replicas != 0:
-            # Split to nearest available length that is evenly divisible.
-            # This is to ensure each rank receives the same amount of data when
-            # using this Sampler.
-            self.num_samples = math.ceil(
-                # `type:ignore` is required because Dataset cannot provide a default __len__
-                # see NOTE in pytorch/torch/utils/data/sampler.py
-                (len(self.dataset) - self.num_replicas)
-                / self.num_replicas  # type: ignore[arg-type]
-            )
-        else:
-            self.num_samples = math.ceil(len(self.dataset) / self.num_replicas)  # type: ignore[arg-type]
-        self.total_size = self.num_samples * self.num_replicas
-        self.shuffle = shuffle
-        self.seed = seed
-
-    def __iter__(self) -> Iterator[T_co]:
-        if self.shuffle:
-            # deterministically shuffle based on epoch and seed
-            g = torch.Generator()
-            g.manual_seed(self.seed + self.epoch)
-            # type: ignore[arg-type]
-            indices = torch.randperm(len(self.dataset), generator=g).tolist()
-
-            # update for next epoch so that there is no need to call
-            # set_epoch manually
-            self.epoch += 1
-        else:
-            indices = list(range(len(self.dataset)))  # type: ignore[arg-type]
-
-        if not self.drop_last:
-            # add extra samples to make it evenly divisible
-            padding_size = self.total_size - len(indices)
-            if padding_size <= len(indices):
-                indices += indices[:padding_size]
-            else:
-                indices += (indices * math.ceil(padding_size / len(indices)))[:padding_size]
-        else:
-            # remove tail of data to make it evenly divisible.
-            indices = indices[: self.total_size]
-        assert len(indices) == self.total_size
-
-        # subsample
-        indices = indices[self.rank : self.total_size : self.num_replicas]
-        assert len(indices) == self.num_samples
-
-        return iter(indices)
-
-    def __len__(self) -> int:
-        return self.num_samples
-
-    def set_epoch(self, epoch: int) -> None:
-        r"""Sets the epoch for this sampler. When :attr:`shuffle=True`, this ensures all replicas
-        use a different random ordering for each epoch. Otherwise, the next iteration of this
-        sampler will yield the same ordering.
-
-        Args:
-            epoch (int): Epoch number.
-        """
-        self.epoch = epoch
-
-
-def get_dpsampler_dataloader(
-    dataset,
-    shuffle=False,
-    seed=1024,
-    add_sampler=True,
-    drop_last=False,
-    pin_memory=False,
-    num_workers=0,
-    **kwargs,
-):
-    r"""Set up a deterministic dataloader (also configure seed workers, samplers and whether shuffle or not)
-
-    Note:
-        When pipeline parallel is enabled, shuffle cannot be True as it will result in mismatch between input data
-        on the 1st stage and label on the last stage.
-
-    Args:
-        dataset (:class:`torch.utils.data.Dataset`): The dataset to be loaded.
-        shuffle (bool, optional): Whether to shuffle the dataset. Defaults to False.
-        seed (int, optional): Random worker seed for sampling, defaults to 1024.
-        add_sampler: Whether to add ``DistributedDataParallelSampler`` to the dataset. Defaults to True.
-        drop_last (bool, optional): Set to True to drop the last incomplete batch, if the dataset size
-            is not divisible by the batch size. If False and the size of dataset is not divisible by
-            the batch size, then the last batch will be smaller, defaults to False.
-        pin_memory (bool, optional): Whether to pin memory address in CPU memory. Defaults to False.
-        num_workers (int, optional): Number of worker threads for this dataloader. Defaults to 0.
-        kwargs (dict): optional parameters for ``torch.utils.data.DataLoader``, more details could be found in
-                `DataLoader <https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader>`_.
-
-    Returns:
-        :class:`torch.utils.data.DataLoader`: A DataLoader used for training or testing.
-    """
-    _kwargs = kwargs.copy()
-
-    if add_sampler and gpc.is_initialized(ParallelMode.DATA) and gpc.get_world_size(ParallelMode.DATA) > 1:
-        sampler = DataParallelSampler(dataset, shuffle=shuffle, drop_last=drop_last)
-    else:
-        sampler = None
-
-    # Deterministic dataloader
-    def seed_worker():
-        worker_seed = seed
-        np.random.seed(worker_seed)
-        torch.manual_seed(worker_seed)
-        random.seed(worker_seed)
-
-    if sampler is None:
-        return DataLoader(
-            dataset,
-            worker_init_fn=seed_worker,
-            shuffle=shuffle,
-            drop_last=drop_last,
-            pin_memory=pin_memory,
-            num_workers=num_workers,
-            **_kwargs,
-        )
-    else:
-        return DataLoader(
-            dataset,
-            sampler=sampler,
-            worker_init_fn=seed_worker,
-            drop_last=drop_last,
-            pin_memory=pin_memory,
-            num_workers=num_workers,
-            **_kwargs,
-        )
-
-
-class StaticBatchSampler:
-    """
-    A static batch sampler that generates batches with a fixed micro-batch size.
-
-    Args:
-        num_samples (int): The total number of samples in the dataset.
-        batch_size (int): The batch size for the current rank. Defaults to 192.
-        rampup_batch_size (str): A string with three space-separated integers representing the
-                                 starting batch size, the increment, and the number of steps between
-                                 each increment. For example, "192 24 8" means that the batch size
-                                 starts at 192 and increases by 24 every 8 steps. Defaults to
-                                 "6 2 8", which corresponds to a batch size of 2 for the first 6 steps.
-        micro_bsz (int): The micro-batch size. Defaults to 2.
-        seed (int): The random seed for shuffling the indices. Defaults to 0.
-        drop_last (bool): If True, drop the last incomplete batch. Currently only supports True. Defaults to True.
-        data_rank (int): The rank of the current process in the data parallel group. Defaults to 0.
-        data_world_size (int): The number of processes in the data parallel group. Defaults to 1.
-    """
-
-    def __init__(
-        self,
-        datasets,
-        batch_size=192,
-        rampup_batch_size="6 2 8",
-        micro_bsz=2,
-        seed=0,
-        drop_last=True,
-        data_rank=0,
-        data_world_size=1,
-    ):
-        assert drop_last is True, "Currently only support drop last"
-        if rampup_batch_size:
-            # In the process increase to batch_size
-            start_bsz, bsz_incre, incre_every = map(int, rampup_batch_size.split())
-        else:
-            start_bsz, bsz_incre, incre_every = batch_size, batch_size, 1
-        self.raw_rampup_batch_size = rampup_batch_size
-        self.start_bsz = start_bsz
-        self.bsz_incre = bsz_incre
-        self.incre_every = incre_every
-        if gpc.is_initialized(ParallelMode.PIPELINE):
-            assert (
-                batch_size - self.start_bsz
-            ) % self.bsz_incre == 0, f"{batch_size} - {self.start_bsz} should be multiple of {self.bsz_incre}"
-            assert batch_size % micro_bsz == 0, f"batch_size({batch_size}) should be multiple of micro_bsz({micro_bsz})"
-            assert (
-                self.start_bsz % micro_bsz == 0
-            ), f"start_bsz({self.start_bsz}) should be multiple of micro_bsz({micro_bsz})"
-            assert (
-                self.bsz_incre % micro_bsz == 0
-            ), f"bsz_incre({self.bsz_incre}) should be multiple of micro_bsz({micro_bsz})"
-
-        self.batch_size = batch_size
-        self.epoch = 0
-        self.seed = seed
-        self.rng = np.random.RandomState(seed)
-        self.batch_count = 0
-        self.micro_bsz = micro_bsz
-        self.data_rank = data_rank
-        self.data_world_size = data_world_size
-        self.num_consumed_samples_in_epoch = 0
-        self.datasets = datasets
-        self.num_samples = sum([len(ds) for ds in datasets])
-
-        self.get_indices()  # get data
-
-    def get_indices(self, old_indices=None):
-        if old_indices is not None:
-            assert (
-                len(old_indices) <= self.num_samples
-            ), f"The checkpoint has {len(old_indices)} samples, \
-while the new restart use less samples ({self.num_samples})"
-
-        else:
-            old_indices = np.array([])
-
-        # indices includes len(old_indices) but not self.num_samples
-        indices = np.arange(len(old_indices), self.num_samples)
-        self.rng_state = self.rng.get_state()
-        self.rng.shuffle(indices)
-        # Need to consider drop_last
-        ramp_steps = (self.batch_size - self.start_bsz) // self.bsz_incre
-        if self.batch_count < ramp_steps * self.incre_every:
-            rampup_samples = 0
-            for i in range(ramp_steps):
-                rampup_samples += (i * self.bsz_incre + self.start_bsz) * self.incre_every
-            assert (
-                rampup_samples * self.data_world_size <= self.num_samples
-            ), f"Too much rampup samples: \
-{rampup_samples*self.data_world_size} Vs. self.num_samples: {self.num_samples}"
-
-            num_samples = (self.num_samples - rampup_samples * self.data_world_size) // (
-                self.batch_size * self.data_world_size
-            )
-            num_samples = num_samples * self.batch_size * self.data_world_size + rampup_samples * self.data_world_size
-        else:
-            num_samples = self.num_samples // (self.batch_size * self.data_world_size)
-            num_samples = num_samples * self.batch_size * self.data_world_size
-        indices = np.concatenate([old_indices, indices]).astype(int)  # It needs to be spliced with the previous
-        indices = indices[:num_samples]
-        self.indices = indices
-        assert len(self.indices) >= self.batch_size, "The number of samples should be larger than batch_size"
-        self.num_consumed_samples_in_epoch = 0
-
-    def set_epoch(self, epoch):
-        self.epoch = epoch
-        self.rng = np.random.RandomState(self.seed + self.epoch)
-
-    def __len__(self):
-        ramp_steps = (self.batch_size - self.start_bsz) // self.bsz_incre
-        if self.batch_count < ramp_steps * self.incre_every:
-            rampup_samples = 0
-            for i in range(ramp_steps):
-                rampup_samples += (i * self.bsz_incre + self.start_bsz) * self.incre_every
-            assert (
-                rampup_samples * self.data_world_size <= self.num_samples
-            ), f"Too much rampup samples: {rampup_samples*self.data_world_size} \
-Vs. self.num_samples: {self.num_samples}"
-
-            num_batches = (self.num_samples - rampup_samples * self.data_world_size) // self.batch_size
-            num_batches = num_batches // self.data_world_size + self.incre_every * ramp_steps
-        else:
-            num_batches = self.num_samples // self.batch_size // self.data_world_size
-
-        return num_batches
-
-    def __iter__(self):
-        indices = self.indices[self.data_rank :: self.data_world_size]
-        while self.num_consumed_samples_in_epoch < len(indices):
-            batch_rampup_idx = self.batch_count // self.incre_every
-            cur_batch_size = batch_rampup_idx * self.bsz_incre + self.start_bsz
-            cur_batch_size = min(cur_batch_size, self.batch_size)
-            batch = indices[self.num_consumed_samples_in_epoch : self.num_consumed_samples_in_epoch + cur_batch_size]
-            yield batch
-            self.num_consumed_samples_in_epoch += len(batch)  # Consider multiple processes.
-            self.batch_count += 1
-        self.get_indices()  # get a new round
-
-    def state_dict(self):
-        states = {
-            "batch_size": self.batch_size,
-            "raw_rampup_batch_size": self.raw_rampup_batch_size,
-            "rng_state": self.rng_state,
-            "epoch": self.epoch,
-            "seed": self.seed,
-            "data_world_size": self.data_world_size,
-            "num_consumed_samples_in_epoch": self.num_consumed_samples_in_epoch,
-            "batch_count": self.batch_count,  # The batch_count here is due to the existence of multiple processes,
-            # the batch may be oversent, and it needs to be overwritten by the external batch_count
-            "indices": self.indices,  # The sequence used to breakpoint retraining is the same as before
-        }
-
-        return states
-
-    def load_state_dict(self, states):
-        for name in ("data_world_size", "raw_rampup_batch_size", "seed"):  # 'batch_size'
-            assert states[name] == getattr(self, name), (name, states[name], getattr(self, name))  # should not change
-        self.rng.set_state(states["rng_state"])
-        self.get_indices(old_indices=None)  # Regenerate indices based on random state
-        self.epoch = states["epoch"]
-        self.batch_count = states["batch_count"]
-        self.num_consumed_samples_in_epoch = states["num_consumed_samples_in_epoch"]
-
-    def copy(self):
-        copy_sampler = StaticBatchSampler(
-            self.datasets,
-            self.batch_size,
-            self.raw_rampup_batch_size,
-            self.micro_bsz,
-            self.seed,
-            drop_last=True,
-            data_rank=self.data_rank,
-            data_world_size=self.data_world_size,
-        )
-
-        copy_sampler.load_state_dict(self.state_dict())
-        return copy_sampler
diff --git a/internlm/data/collaters.py b/internlm/data/collaters.py
deleted file mode 100644
index b327b54..0000000
--- a/internlm/data/collaters.py
+++ /dev/null
@@ -1,88 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-
-def packed_collate_fn(batch, packed_length):
-
-    """
-    Collate function for packed input sequences.
-
-    Args:
-        batch (List[Dict]): List of dictionaries representing each sample in batch.
-            Each dictionary contains "tokens", "labels", "type_ids", "cu_seqlens", and "indexes" keys.
-        packed_length (int): The length of packed sequence.
-
-    Returns:
-        Tuple[Dict[str, torch.Tensor], torch.Tensor]: A tuple containing a dictionary of tensors with "input_ids",
-            "cu_seqlens", "indexes", and "type_ids" keys, and the tensor of padded "labels".
-
-    Raises:
-        AssertionError: If the length of a sample is not equal to packed_length.
-        AssertionError: If the shape of the padded "input_ids" tensor does not have the correct shape.
-    """
-
-    xs, ys, cu_seqlens, indexes, ts = [], [], [], [], []
-    for b in batch:
-        assert (
-            len(b["tokens"]) == packed_length
-        ), f"length of a sample should be equal to packed_length, but got {len(b['tokens'])} and {packed_length})"
-        assert (
-            len(b["labels"]) == packed_length
-        ), f"length of a sample should be equal to packed_length, but got {len(b['labels'])} and {packed_length})"
-        assert (
-            len(b["type_ids"]) == packed_length
-        ), f"length of a sample should be equal to packed_length, but got {len(b['type_ids'])} and {packed_length})"
-
-        tokens = [abs(w) for w in b["tokens"]]
-        labels = [w if w > 0 else -100 for w in b["labels"]]
-
-        xs.append(torch.LongTensor(tokens))
-        # The labels have been shifted here, so they are aligned with the output corresponding to the token
-        ys.append(torch.LongTensor(labels))
-        ts.append(torch.LongTensor(b["type_ids"]))
-        cu_seqlens.append(torch.IntTensor(b["cu_seqlens"]))
-        indexes.append(torch.LongTensor(b["indexes"]))
-
-    xs = torch.nn.utils.rnn.pad_sequence(xs, batch_first=True)
-    ys = torch.nn.utils.rnn.pad_sequence(ys, batch_first=True, padding_value=-100)
-    ts = torch.nn.utils.rnn.pad_sequence(ts, batch_first=True, padding_value=0)
-    indexes = torch.stack(indexes, dim=0)
-    if len(set(map(len, cu_seqlens))) == 1:  # if has uniform length, then stack to save device transfer time
-        cu_seqlens = torch.stack(cu_seqlens, dim=0)
-
-    assert xs.shape[1] == packed_length, (xs.shape[1], packed_length)
-
-    return {"input_ids": xs, "cu_seqlens": cu_seqlens, "indexes": indexes, "type_ids": ts}, ys
-
-
-def jsonl_ds_collate_fn(batch, max_length_per_sample):
-    """
-    Collate function for json dataset.
-
-    Args:
-        batch (List[Dict]): List of dictionaries representing each sample in batch.
-            Each dictionary contains "tokens".
-        max_length_per_sample (int): The length of output sequence.
-
-    Returns:
-        Tuple[Dict[str, torch.Tensor], torch.Tensor]: A tuple containing a dictionary of tensors with "input_ids",
-        and the tensor of padded "labels".
-
-    """
-    xs, ys = [], []
-    for x in batch:
-        x["tokens"] = x["tokens"][:max_length_per_sample]
-        tokens = [abs(w) for w in x["tokens"]]
-        labels = [w if w > 0 else -100 for w in x["tokens"]]
-        labels = labels[1:] + [-100]
-        xs.append(torch.as_tensor(tokens))
-        ys.append(torch.as_tensor(labels))  # y has been shifted
-    xs = torch.nn.utils.rnn.pad_sequence(xs, batch_first=True)
-    ys = torch.nn.utils.rnn.pad_sequence(ys, batch_first=True, padding_value=-100)
-
-    xs = torch.cat([xs, xs.new_zeros(len(xs), max_length_per_sample - len(xs[0]))], dim=-1)
-    ys = torch.cat([ys, ys.new_full((len(ys), max_length_per_sample - len(ys[0])), fill_value=-100)], dim=-1)
-
-    return {"input_ids": xs}, ys
diff --git a/internlm/data/dataset.py b/internlm/data/dataset.py
deleted file mode 100644
index 401e510..0000000
--- a/internlm/data/dataset.py
+++ /dev/null
@@ -1,56 +0,0 @@
-import os
-from typing import Dict
-
-from torch.utils.data import ConcatDataset
-
-from internlm.data.single_dataset import JsonlDataset
-
-
-def get_dataset_dict(folder, split="valid") -> Dict:
-    """
-    Return a dictionary of Datasets from a folder containing data files for validation.
-
-    Args:
-        folder (str): The path to the folder containing data files.
-        split (str): The split of the data files to be used, default is "valid".
-
-    Returns:
-        A dictionary containing Datasets for each folder in the given path
-        that contains data files with the specified split.
-
-    Raises:
-        AssertionError: If the given folder does not exist.
-
-    Example:
-        If the given folder is as follows,
-        - data
-            - zhihu
-                - xxx.bin
-                - valid.bin
-            - baike
-                - xxx.bin
-                - valid.bin
-
-        The returned dictionary will be,
-        {
-            'zhihu': Dataset,
-            'baike': Dataset
-        }
-    """
-
-    assert os.path.exists(folder), f"folder `{folder}` not exists"
-    data_dict = {}
-
-    for root, dirs, files in os.walk(folder, followlinks=True):
-        dirs.sort()  # The order is guaranteed, and the newly added data starting with z needs to be ranked behind
-        datasets = []
-        for fn in sorted(files):  # Need sorted to ensure that the order is consistent
-            if fn.endswith(".bin") and split in fn:
-                fp = os.path.join(root, fn)
-                ds = JsonlDataset(fp)
-                datasets.append(ds)
-        if datasets:
-            ds = ConcatDataset(datasets=datasets)
-            data_dict[os.path.basename(root)] = ds
-
-    return data_dict
diff --git a/internlm/data/dummy_dataset.py b/internlm/data/dummy_dataset.py
deleted file mode 100644
index fb36184..0000000
--- a/internlm/data/dummy_dataset.py
+++ /dev/null
@@ -1,44 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import numpy as np
-from torch.utils.data import Dataset
-
-
-class RandomDataset(Dataset):
-    """
-    RandomDataset for generating random dataset.
-
-    Args:
-        num_samples (int): The number of samples to generate.
-        max_len (int): The maximum length of each sample.
-
-    """
-
-    def __init__(self, num_samples=10000, max_len=1024) -> None:
-        super().__init__()
-        rng = np.random.RandomState(1999)
-        max_num = rng.randint(1, 30, size=(num_samples,))
-        rep_num = rng.randint(10, 200, size=(num_samples,))
-        data = []
-        lengths = []
-        for n, r in zip(max_num, rep_num):
-            d = list(range(n)) * r
-            d = [n, r] + d
-            d = d[:max_len]
-            data.append(d)
-            lengths.append(len(d))
-        self.data = data
-        self.max_len = max_len
-        self.lengths = np.array(lengths, dtype=int)
-
-    def __getitem__(self, index):
-        d = self.data[index]
-        input_ids = np.array(d, dtype=int)
-        return {"tokens": list(input_ids), "type_id": 0}
-
-    def get_dataset_name(self):
-        return "dummy_path/dummy_lang/dummy_ds/train.bin"
-
-    def __len__(self):
-        return len(self.data)
diff --git a/internlm/data/packed_dataset.py b/internlm/data/packed_dataset.py
deleted file mode 100644
index c0d689f..0000000
--- a/internlm/data/packed_dataset.py
+++ /dev/null
@@ -1,421 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import itertools as it
-import operator
-import os
-from copy import deepcopy
-from typing import Dict
-
-import numpy as np
-import torch
-from torch.utils.data import ConcatDataset
-from tqdm import tqdm
-
-from internlm.core.context import global_context as gpc
-from internlm.data.single_dataset import JsonlDataset
-from internlm.data.utils import get_dataset_type_id
-from internlm.utils.logger import get_logger
-
-DEFAULT_SEED = 1024
-logger = get_logger(__file__)
-
-
-class PackedDataset(torch.utils.data.Dataset):
-    """
-    The class PackedDataset takes in a dataset and aggregates samples of different
-    lengths together based on the packed_length.
-
-    Args:
-        dataset: The original dataset to pack.
-        max_length_per_sample: The maximum length of each original sample. Default is 2048.
-        packed_length: The length of each packed sample. Default is 4096.
-    """
-
-    def __init__(
-        self,
-        dataset,
-        max_length_per_sample: int = 2048,
-        packed_length: int = 4096,
-    ):
-        assert hasattr(dataset, "lengths")
-        assert len(getattr(dataset, "lengths")) == len(
-            dataset
-        ), "The dataset must have lengths attribute and have the same length as the dataset"
-        self.dataset = dataset
-        self.max_length_per_sample = max_length_per_sample
-        self.lengths = getattr(self.dataset, "lengths")
-        self.packed_length = packed_length
-        # Force a seed to be fixed to prevent problems caused by the seed not being restored when restarting
-
-        self.seed = DEFAULT_SEED
-        self.sample_indices, self.len_samples_shuffled, self.acm_len_samples = self.accu_sample_len(seed=self.seed)
-        self.num_tokens = sum(self.lengths)
-
-    def get_dataset_name(self):
-        return self.dataset.get_dataset_name()
-
-    def accu_sample_len(self, seed=None):
-        """accumulative length of samples"""
-        if seed is not None:
-            rng = np.random.RandomState(seed)
-        else:
-            rng = np.random.RandomState(self.seed - 1)
-
-        sample_indices = np.arange(len(self.lengths))
-        rng.shuffle(sample_indices)
-        len_samples_shuffled = list(map(self.lengths.__getitem__, sample_indices))
-        acm_len_samples = list(it.accumulate(len_samples_shuffled, operator.add))
-        return sample_indices, len_samples_shuffled, acm_len_samples
-
-    def __len__(self):
-        # Line 405 of document_to_sequence.py in metaseq is directly spliced,
-        # without additional consideration of sos or eos
-        n_packs = self.num_tokens // self.packed_length
-        return n_packs
-
-    def cal_map(self, carriage_idx: int = 0):
-        assert carriage_idx >= 0
-        length_train = (carriage_idx + 1) * self.packed_length
-        post_pos = np.searchsorted(self.acm_len_samples, length_train, side="left")
-        return post_pos
-
-    def mapping(self, pack_idx: int = 0):
-        # pack_idx is zero-based
-        pre_pos, pre_token_id = 0, 0
-        if pack_idx > 0:
-            pre_pos = self.cal_map(pack_idx - 1)
-            pre_token_id = self.len_samples_shuffled[pre_pos] - (
-                self.acm_len_samples[pre_pos] - (pack_idx) * self.packed_length
-            )
-            if pre_token_id == self.len_samples_shuffled[pre_pos]:
-                pre_pos += 1
-                pre_token_id = 0
-
-        pos = self.cal_map(pack_idx)
-        token_id = self.len_samples_shuffled[pos] - (self.acm_len_samples[pos] - (pack_idx + 1) * self.packed_length)
-        return pre_pos, pre_token_id, pos, token_id
-
-    def build_pack(self, pre_pos: int, pre_token_id: int, pos: int, token_id: int):
-        pack, cu_seqlens, indexes, labels, type_ids = [], [0], [], [], []
-
-        while pre_pos < pos:
-            sample_idx = self.sample_indices[pre_pos]
-            sample = self.dataset[sample_idx]
-            chunk = sample["tokens"][pre_token_id:]
-            pack.extend(chunk)
-            _labels = deepcopy(chunk)
-            _labels = list(_labels[1:]) + [-100]
-            assert len(_labels) == len(chunk), (_labels, chunk)
-            labels.extend(_labels)
-            type_ids.extend([sample.get("type_id", 0)] * len(chunk))
-            num_new_samples, tokens_left = divmod(len(chunk), self.max_length_per_sample)
-            for _ in range(num_new_samples):
-                cu_seqlens.append(cu_seqlens[-1] + self.max_length_per_sample)
-                indexes.extend(list(range(self.max_length_per_sample)))
-            if tokens_left > 0:
-                cu_seqlens.append(cu_seqlens[-1] + tokens_left)
-                indexes.extend(list(range(tokens_left)))
-            pre_pos = pre_pos + 1
-            pre_token_id = 0
-
-        sample_idx = self.sample_indices[pos]
-        sample = self.dataset[sample_idx]
-        chunk = sample["tokens"][pre_token_id:token_id]  # fragement of a sample
-        pack.extend(chunk)
-        _labels = deepcopy(chunk)
-        if token_id == len(sample["tokens"]):
-            _labels = list(_labels[1:]) + [-100]
-        else:
-            if token_id > len(sample["tokens"]):
-                print(f"token_id {token_id}, len of sample {len(sample['tokens'])}")
-            _labels = list(_labels[1:]) + [sample["tokens"][token_id]]
-        assert len(_labels) == len(chunk), (_labels, chunk)
-        labels.extend(_labels)
-        type_ids.extend([sample.get("type_id", 0)] * len(chunk))
-        num_new_samples, tokens_left = divmod(len(chunk), self.max_length_per_sample)
-        for _ in range(num_new_samples):
-            cu_seqlens.append(cu_seqlens[-1] + self.max_length_per_sample)
-            indexes.extend(list(range(self.max_length_per_sample)))
-        if tokens_left > 0:
-            cu_seqlens.append(cu_seqlens[-1] + tokens_left)
-            indexes.extend(list(range(tokens_left)))
-
-        out = {"tokens": pack, "cu_seqlens": cu_seqlens, "indexes": indexes, "labels": labels, "type_ids": type_ids}
-        return out
-
-    def cal_pos_unpack(self, index):
-        if index == 0:
-            pre_pos = 0
-        else:
-            pre_pos = index * gpc.config.data["micro_bsz"]
-
-        pos = (index + 1) * gpc.config.data["micro_bsz"]
-        return pre_pos, pos
-
-    def build_unpack(self, index):
-
-        pre_pos, pos = self.cal_pos_unpack(index)
-
-        pack, cu_seqlens, indexes, labels, type_ids = [], [0], [], [], []
-
-        while pre_pos < pos and pre_pos < len(self.dataset):
-            sample_idx = self.sample_indices[pre_pos]
-            sample = self.dataset[sample_idx]
-            length = min(len(sample["tokens"]), self.max_length_per_sample)
-            chunk = sample["tokens"][0:length]
-            pack.extend(chunk)
-            _labels = deepcopy(chunk)
-            _labels = list(_labels[1:]) + [-100]
-            assert len(_labels) == len(chunk), (_labels, chunk)
-            labels.extend(_labels)
-            type_ids.extend([sample.get("type_id", 0)] * len(chunk))
-            cu_seqlens.append(cu_seqlens[-1] + len(chunk))
-            indexes.extend(list(range(length)))
-            pre_pos = pre_pos + 1
-
-        if cu_seqlens[-1] != self.packed_length:
-            pack = pack + [0] * (self.packed_length - cu_seqlens[-1])
-            labels = labels + [0] * (self.packed_length - cu_seqlens[-1])
-            type_ids = type_ids + [0] * (self.packed_length - cu_seqlens[-1])
-            indexes.extend(list(range(self.packed_length - cu_seqlens[-1])))
-            cu_seqlens.append(self.packed_length)
-
-        assert len(pack) == self.packed_length
-
-        out = {"tokens": pack, "cu_seqlens": cu_seqlens, "indexes": indexes, "labels": labels, "type_ids": type_ids}
-        return out
-
-    def __getitem__(self, item: int) -> Dict:
-        """Given the index, it returns a dict as
-        {
-         'tokens': List[int],
-         'cu_seqlens': List[int],
-         'indexes': List[int], # denotes positional vector as 'tokens'
-         'labels': List[int], # corresponds to 'tokens' and shifted yet, -100 means skipping prediction
-        }
-        """
-
-        if gpc.config.model.use_flash_attn:
-            pos_before, token_id_before, pos_after, token_id_after = self.mapping(item)
-            return self.build_pack(pos_before, token_id_before, pos_after, token_id_after)
-
-        return self.build_unpack(item)
-
-
-class PackedDatasetWithoutCuSeqlen(torch.utils.data.Dataset):
-    """
-    A dataset wrapper that aggregates samples with different lengths based on packed_length.
-    If a sample is shorter than max_length_per_sample, it will be merged with other samples.
-    For example, given a dataset with 10 samples:
-    [1, 2, 3, 4, 5]
-    [6, 7]
-    [8, 9, 10, 11]
-    [12, ..., 100]
-    ...
-
-    Args:
-        dataset: The original dataset to be wrapped.
-        max_length_per_sample (int): The maximum length allowed for each sample.
-        packed_length (int): The desired length for each packed sample.
-    """
-
-    def __init__(
-        self,
-        dataset,
-        max_length_per_sample: int = 2048,
-        packed_length: int = 4096,
-        debug=False,
-    ):
-        assert packed_length % max_length_per_sample == 0
-        assert hasattr(dataset, "lengths")
-        assert len(getattr(dataset, "lengths")) == len(
-            dataset
-        ), "The dataset must have lengths attribute and have the same length as the dataset"
-        self.dataset = dataset
-        self.max_length_per_sample = max_length_per_sample
-        self.lengths = getattr(self.dataset, "lengths")
-        self.bsz = packed_length // max_length_per_sample
-        self.packed_length = packed_length
-        self.debug = debug
-        # Force a seed to be fixed to prevent problems caused by the seed not being restored when restarting
-
-        self.seed = DEFAULT_SEED
-        indices = np.arange(len(self.lengths))
-        rng = np.random.RandomState(self.seed)
-        rng.shuffle(indices)
-        self.indices = indices
-        self.cum_lens = np.cumsum(self.lengths[self.indices])
-        self.num_tokens = sum(self.lengths)
-
-    def get_dataset_name(self):
-        return self.dataset.get_dataset_name()
-
-    def __len__(self):
-        n_packs = self.num_tokens // self.packed_length
-        return n_packs
-
-    def find_offset(self, offset):
-        idx = np.searchsorted(self.cum_lens, offset, side="right")
-        if idx == 0:
-            return idx, offset
-        length = offset - self.cum_lens[idx - 1]
-        return idx, length
-
-    def pdebug(self, line):
-        if self.debug:
-            print(line, flush=True)
-
-    def __getitem__(self, item: int) -> Dict:
-        """Given the index, it returns a dict as
-        {
-         'tokens': List[int],
-         'cu_seqlens': List[int],
-         'indexes': List[int], # denotes positional vector as 'tokens'
-         'labels': List[int], # corresponds to 'tokens' and shifted yet, -100 means skipping prediction
-        }
-        """
-
-        start_idx, start_length = self.find_offset(item * self.packed_length)
-        end_idx, end_length = self.find_offset((item + 1) * self.packed_length)
-        pack_tokens = []
-        pack_labels = []
-        type_ids = []
-
-        self.pdebug(f"item : {item}, start_idx:{start_idx}, start_length:{start_length} ")
-        self.pdebug(f"item : {item}, end_idx:{end_idx}, end_length:{end_length} ")
-
-        if start_idx == end_idx:
-            idx = self.indices[start_idx]
-            sample = self.dataset[idx]
-            self.pdebug(f"item : {item}, idx: {idx}, len : {len(sample['tokens'])}")
-            tokens = sample["tokens"][start_length:end_length]
-            pack_tokens.extend(tokens)
-            pack_labels.extend(tokens[1:] + [-100])
-            type_ids.extend([sample["type_id"]] * len(tokens))
-            return {
-                "tokens": pack_tokens,
-                "cu_seqlens": [i * self.max_length_per_sample for i in range(self.bsz + 1)],
-                "indexes": list(range(self.max_length_per_sample)) * self.bsz,
-                "labels": pack_labels,
-                "type_ids": type_ids,
-            }
-
-        idx = self.indices[start_idx]
-        sample = self.dataset[idx]
-        self.pdebug(f"item : {item}, idx: {idx}, len : {len(sample['tokens'])}")
-        tokens = sample["tokens"][start_length:]
-        pack_tokens.extend(tokens)
-        pack_labels.extend(tokens[1:] + [-100])
-        type_ids.extend([sample["type_id"]] * len(tokens))
-
-        for i in range(start_idx + 1, end_idx):
-            idx = self.indices[i]
-            sample = self.dataset[idx]
-            self.pdebug(f"item : {item}, idx: {idx}, len : {len(sample['tokens'])}")
-            tokens = sample["tokens"]
-            pack_tokens.extend(tokens)
-            pack_labels.extend(tokens[1:] + [-100])
-            type_ids.extend([sample.get("type_id")] * len(tokens))
-
-        # corner case, the last sample is useless
-        if end_length == 0:
-            pass
-        else:
-            idx = self.indices[end_idx]
-            sample = self.dataset[idx]
-            self.pdebug(f"item : {item}, idx: {idx}, len : {len(sample['tokens'])}")
-            tokens = sample["tokens"][:end_length]
-            pack_tokens.extend(tokens)
-            pack_labels.extend(tokens[1:] + [-100])
-            type_ids.extend([sample.get("type_id")] * len(tokens))
-
-        return {
-            "tokens": pack_tokens,
-            "cu_seqlens": [i * self.max_length_per_sample for i in range(self.bsz + 1)],
-            "indexes": list(range(self.max_length_per_sample)) * self.bsz,
-            "labels": pack_labels,
-            "type_ids": type_ids,
-        }
-
-
-def get_packed_dataset_without_short_length(
-    folder,
-    max_length_per_sample=2048,
-    packed_length=4096,
-    show_progress=False,
-    min_length=50,
-    min_length_dict=None,
-    pack_into_one_sample=False,
-):
-    """
-    Given a folder, combine all the .bin files into a single large dataset.
-    And filter out short samples with length less than 'min_length'.
-
-    Each .bin file is treated as a separate dataset.
-
-    Args:
-        folder (str): Path to the folder containing the .bin files.
-        max_length_per_sample (int): Maximum length of each sample.
-        packed_length (int): Length to pack samples to.
-        show_progress (bool): Whether to show the progress bar.
-        min_length (int): The minimum length of the sample.
-        min_length_dict (dict): The minimum length of the sample for each dataset.
-         The format is something like {'pile-arxiv': 50}
-        dataset_backend (Optional[str]): Dataset storage location. Optional parameters are local, local-shm, kv
-
-    Returns:
-        A packed dataset containing all the data from the .bin files.
-    """
-
-    assert os.path.exists(folder), f"{folder} does not exist."
-    datasets = []
-    delete_samples = 0
-
-    for root, dirs, files in os.walk(folder, followlinks=True):
-        dirs.sort()  # Let the folder need to be returned in a fixed order
-        if gpc.is_rank_for_log():
-            logger.info(f"Reading {root}...")
-        num_token_in_folder = 0
-
-        for fn in tqdm(sorted(files), total=len(files), leave=False, disable=not show_progress):
-            if fn.endswith(".bin"):
-                fp = os.path.join(root, fn)
-                catch_ml_keys = []
-                min_length_num = min_length
-                if min_length_dict is not None:
-                    for k, v in min_length_dict.items():
-                        if k in fp:
-                            min_length_num = v
-                            catch_ml_keys.append(k)
-                    assert (
-                        len(catch_ml_keys) < 2
-                    ), f"The file name `{fp}` matched the following resample keys:{catch_ml_keys}"
-
-                ds_type_id = get_dataset_type_id(path=fp)
-                ds = JsonlDataset(fp, ds_type_id, min_length=min_length_num)
-
-                if hasattr(ds, "old_length"):
-                    delete_samples += ds.old_length - len(ds)
-                if len(ds) == 0:
-                    if gpc.is_rank_for_log():
-                        logger.info(f"None of the data in `{fp}` is longer than {min_length}")
-                    continue
-
-                if pack_into_one_sample:
-                    ds = PackedDatasetWithoutCuSeqlen(ds, max_length_per_sample, packed_length)
-                else:
-                    ds = PackedDataset(ds, max_length_per_sample, packed_length)
-
-                num_token_in_folder += len(ds) * packed_length
-                datasets.append(ds)
-
-    dataset = ConcatDataset(datasets=datasets)
-    if gpc.is_rank_for_log():
-        logger.info(
-            f"Find `{len(datasets)}` datasets, \
-            {len(dataset)} samples, \
-            delete `{delete_samples}` because of short length",
-        )
-
-    return dataset
diff --git a/internlm/data/single_dataset.py b/internlm/data/single_dataset.py
deleted file mode 100644
index 5477d34..0000000
--- a/internlm/data/single_dataset.py
+++ /dev/null
@@ -1,117 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-"""
-A .bin file corresponds to a Dataset instance here.
-"""
-
-import json
-import mmap
-import os
-import threading
-from pathlib import Path
-
-import numpy as np
-import torch
-
-
-class JsonlDataset(torch.utils.data.Dataset):
-    """
-
-    JSONL format is expected to roughly follow that of The Pile.
-    One-line-per-document of the form:
-    ```
-    {
-        "tokens": List[int],
-    }
-    ```
-
-    Note that only the "tokens" key is used.
-    """
-
-    def __init__(self, path: str, dataset_type_id: int = 0, min_length=50):
-        self.path = path
-        self.threadlocal = threading.local()
-        resolved_path = Path(path).resolve()
-        self.resolved_path = resolved_path
-        self.meta = Path(f"{resolved_path}.meta")
-        self.type_id = dataset_type_id
-
-        # only build the cache in on the primary worker to prevent overloading nfs
-        assert os.path.exists(self.meta), f"The cache file:{self.meta} is not found for file:{self.path}"
-        try:
-            with open(self.meta, "rb") as f:
-                meta = np.load(f)
-        except Exception as e:
-            print(f"Cannot load file {self.meta}...")
-            raise e
-        self.offsets = meta[:, 0]
-        self.lengths = meta[:, -1]
-
-        if min_length > 0:
-            mask = self.lengths >= min_length
-            self.old_lengths = self.lengths.copy()
-            self.old_length = len(self.offsets)
-            self.offsets = self.offsets[mask]
-            self.lengths = self.lengths[mask]
-
-    def __getitem__(self, idx):
-        f = self._get_mmap()
-        position = self.offsets[idx]
-        f.seek(position)
-        item = f.readline().decode("utf-8")
-        try:
-            item = json.loads(item)
-            item["length"] = len(item["tokens"])  # add a length info
-            item["type_id"] = self.type_id
-        except Exception as err:
-            raise json.decoder.JSONDecodeError(
-                doc=self.path,
-                pos=position,
-                msg=(
-                    f"Error while loading JSONL line in file {self.path} at byte "
-                    f"{position}. Contents of line:\n{item}\n{err}"
-                ),
-            )
-        return item
-
-    def get_dataset_name(self):
-        return str(self.resolved_path)
-
-    def _get_mmap(self):
-        if not hasattr(self.threadlocal, "handles"):
-            with open(self.path, "rb") as f:
-                mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
-                self.threadlocal.handles = [f, mm]
-                if self.path.endswith(".gz") or self.path.endswith(".bz") or self.path.endswith(".bz2"):
-                    raise NotImplementedError(
-                        "Compressed files are not supported because .seek() would require "
-                        "rereading the entire file, making performance too slow."
-                    )
-        return self.threadlocal.handles[-1]
-
-    def __setstate__(self, state):
-        self.__dict__ = state
-        self.threadlocal = threading.local()
-
-    def __getstate__(self):
-        d = {}
-        for i, v in self.__dict__.items():
-            if i != "threadlocal":
-                d[i] = v
-        return d
-
-    def __del__(self):
-        if hasattr(self.threadlocal, "handles"):
-            # cleanup files we opened on initialization
-            while self.threadlocal.handles:
-                self.threadlocal.handles.pop().close()
-
-    @staticmethod
-    def exists(path):
-        return os.path.exists(path)
-
-    def __len__(self):
-        # Virtual length of the dataset depends on the epoch number if the number of documents
-        # is not perfectly divisible by the data_subshard_count
-        return len(self.offsets)
diff --git a/internlm/data/utils.py b/internlm/data/utils.py
deleted file mode 100644
index 724fb9f..0000000
--- a/internlm/data/utils.py
+++ /dev/null
@@ -1,46 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-from internlm.core.context import global_context as gpc
-
-DATASET_TYPE_IDS_MAP = {"en": 0, "cn": 1, "code": 2}
-
-
-def get_dataset_type_id(path):
-    import re
-
-    match_idxes = []
-    for key, idx in DATASET_TYPE_IDS_MAP.items():
-        if re.search(rf"/[z_]*{key}/", path):
-            match_idxes.append(idx)
-    assert len(match_idxes) == 1, f"{path}, match_idxes should be 1, but got {match_idxes} from {DATASET_TYPE_IDS_MAP}"
-    return match_idxes[0]
-
-
-def unpack_data(input_ids, cu_seqlens):
-    """
-    input_ids: (n, packed_length)
-    Return:
-    output: (batch_size, max_length)
-    """
-
-    bsz = input_ids.shape[0]
-
-    num_sequence = gpc.config.data["micro_bsz"]
-
-    outputs = torch.zeros(bsz, num_sequence, gpc.config.data.seq_len, device=input_ids.device, dtype=input_ids.dtype)
-
-    for i in range(bsz):
-        output = torch.zeros(num_sequence, gpc.config.data.seq_len, device=input_ids.device, dtype=input_ids.dtype)
-        cu_seqlens_slice = cu_seqlens[i]
-        for j in range(num_sequence):
-            seq_length = cu_seqlens_slice[j + 1] - cu_seqlens_slice[j]
-            output[j, 0:seq_length] = input_ids[0, cu_seqlens_slice[j] : cu_seqlens_slice[j + 1]]
-        outputs[i] = output
-
-    if bsz == 1:
-        outputs = outputs.squeeze(0)
-
-    return outputs
diff --git a/internlm/initialize/__init__.py b/internlm/initialize/__init__.py
deleted file mode 100644
index ae94e0a..0000000
--- a/internlm/initialize/__init__.py
+++ /dev/null
@@ -1,15 +0,0 @@
-from .initialize_trainer import initialize_trainer
-from .launch import (
-    get_default_parser,
-    initialize_distributed_env,
-    launch_from_slurm,
-    launch_from_torch,
-)
-
-__all__ = [
-    "get_default_parser",
-    "initialize_trainer",
-    "launch_from_slurm",
-    "launch_from_torch",
-    "initialize_distributed_env",
-]
diff --git a/internlm/initialize/initialize_tensor.py b/internlm/initialize/initialize_tensor.py
deleted file mode 100644
index b317f26..0000000
--- a/internlm/initialize/initialize_tensor.py
+++ /dev/null
@@ -1,63 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import math
-
-from torch import Tensor, nn
-
-
-def scaled_init_method_normal(sigma: float = 1.0, num_layers: int = 1):
-    """Init method based on N(0, sigma/sqrt(2*num_layers)."""
-    std = sigma / math.sqrt(2.0 * num_layers)
-
-    def init_(tensor):
-        return nn.init.normal_(tensor, mean=0.0, std=std)
-
-    return init_
-
-
-def normal_(mean: float = 0.0, std: float = 1.0):
-    r"""Return the initializer filling the input Tensor with values drawn from the normal distribution
-
-     .. math::
-        \mathcal{N}(\text{mean}, \text{std}^2)
-
-    Args:
-        mean (float): the mean of the normal distribution. Defaults 0.0.
-        std (float): the standard deviation of the normal distribution. Defaults 1.0.
-    """
-
-    def initializer(tensor: Tensor):
-        return nn.init.normal_(tensor, mean, std)
-
-    return initializer
-
-
-def scaled_init_method_uniform(sigma: float = 1.0, num_layers: int = 1):
-    """Init method based on p(x)=Uniform(-a, a) where std(x)=sigma/sqrt(2*num_layers)."""
-    std = sigma / math.sqrt(2.0 * num_layers)
-    a = math.sqrt(3.0 * std)
-
-    def init_(tensor):
-        return nn.init.uniform_(tensor, -a, a)
-
-    return init_
-
-
-def uniform_(mean: float = 0.0, std: float = 1.0):
-    r"""Return the initializer filling the input Tensor with values drawn from the uniform distribution
-
-     .. math::
-        \mathcal{U}(mean-a, mean+a), where a satisfies \mathcal{U}_{std}=std.
-
-    Args:
-        mean (float): the mean of the uniform distribution. Defaults 0.0.
-        std (float): the standard deviation of the uniform distribution. Defaults 1.0.
-    """
-
-    a = math.sqrt(3.0 * std)
-
-    def initializer(tensor: Tensor):
-        return nn.init.uniform_(tensor, mean - a, mean + a)
-
-    return initializer
diff --git a/internlm/initialize/initialize_trainer.py b/internlm/initialize/initialize_trainer.py
deleted file mode 100644
index beb4a40..0000000
--- a/internlm/initialize/initialize_trainer.py
+++ /dev/null
@@ -1,137 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-# adopted from https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/initialize
-
-from typing import Callable, Iterable, List, Optional, Tuple
-
-from torch import nn
-from torch.nn.modules.loss import _Loss
-from torch.optim.lr_scheduler import _LRScheduler
-from torch.optim.optimizer import Optimizer
-from torch.utils.data import DataLoader
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.core.engine import Engine
-from internlm.core.gradient_handler import PipelineSharedModuleGradientHandler
-from internlm.core.scheduler import (
-    InterleavedPipelineScheduler,
-    NonPipelineScheduler,
-    PipelineScheduler,
-    SchedulerHook,
-)
-from internlm.core.scheduler.pipeline_scheduler import get_tensor_shape
-from internlm.core.trainer import Trainer
-from internlm.data.utils import unpack_data
-from internlm.solver.beta2_scheduler import Beta2Scheduler
-from internlm.solver.optimizer.hybrid_zero_optim import BaseOptimizer
-from internlm.utils.common import get_current_device
-
-
-def initialize_trainer(
-    model: nn.Module,
-    optimizer: Optimizer,
-    criterion: Optional[_Loss] = None,
-    train_dataloader: Optional[Iterable] = None,
-    test_dataloader: Optional[Iterable] = None,
-    lr_scheduler: Optional[_LRScheduler] = None,
-    beta2_scheduler: Optional[Beta2Scheduler] = None,
-    scheduler_hooks: Optional[List[SchedulerHook]] = None,
-) -> Tuple[Trainer, DataLoader, DataLoader, _LRScheduler]:
-    """Core function to wrap the essential training components with our functionality based on the config which is
-    loaded into gpc.config.
-
-    Args:
-        model (:class:`torch.nn.Module` or `Callable`): Your model instance or a function to build the model.
-        optimizer (:class:`BaseOptimizer`): Your optimizer for training.
-        criterion (:class:`torch.nn.modules.loss._Loss`, optional): Your criterion instance.
-        train_dataloader (:class:`torch.utils.data.DataLoader`, optional): Dataloader for training.
-        test_dataloader (:class:`torch.utils.data.DataLoader`, optional): Dataloader for testing.
-        lr_scheduler (:class:`torch.nn.lr_scheduler._LRScheduler`, optional): Your lr scheduler instance, optional.
-
-    Returns:
-        Tuple (trainer, train_dataloader, test_dataloader, lr_scheduler):
-            A tuple of ``(trainer, train_dataloader, test_dataloader, lr_scheduler)``
-            where only ``trainer`` could not be None.
-    """
-
-    if isinstance(model, nn.Module):
-        # first sync model across dp ranks
-        model.to(get_current_device())
-    elif isinstance(model, Callable):
-        model = model().to(get_current_device())
-
-    # clip grad norm
-    clip_grad_norm = gpc.config.hybrid_zero_optimizer.get("clip_grad_norm", 0.0)
-
-    assert isinstance(optimizer, BaseOptimizer), "optimizer must be instance of BaseOptimizer"
-
-    # gradient handler, only support PipelineSharedModuleGradientHandler now
-    if gpc.is_using_pp():
-        gpc.config.gradient_handler = [dict(type="PipelineSharedModuleGradientHandler")]
-    gradient_handler_cfg = gpc.config.get("gradient_handler", [])
-    gradient_handlers = []
-    assert isinstance(gradient_handler_cfg, list), f"gradient_handler must be list but got {type(gradient_handler_cfg)}"
-    for config in gradient_handler_cfg:
-        if isinstance(config, dict) and config.get("type") == "PipelineSharedModuleGradientHandler":
-            handler = PipelineSharedModuleGradientHandler(model=model, optimizer=optimizer)
-            gradient_handlers.append(handler)
-
-    # initialize scheduler for trainer
-    scheduler = None
-    if gpc.config.model.use_flash_attn:
-        data_fn = None
-    else:
-        data_fn = unpack_data
-    if gpc.is_using_pp():
-        gpc.config.NUM_MICRO_BATCHES = gpc.config.data.micro_num
-        tensor_shape = get_tensor_shape()
-        use_interleaved = (
-            hasattr(gpc.config, "model") and hasattr(gpc.config.model, "num_chunks") and gpc.config.model.num_chunks > 1
-        )
-        scatter_gather = gpc.is_initialized(ParallelMode.TENSOR)
-        if use_interleaved:
-            if isinstance(model, nn.Sequential):
-                model = nn.ModuleList([model])
-
-            communication_overlap = gpc.config.parallel["pipeline"].get("interleaved_overlap", False)
-            scheduler = InterleavedPipelineScheduler(
-                num_microbatches=gpc.config.NUM_MICRO_BATCHES,
-                num_chunks=gpc.config.model.num_chunks,
-                dtype=gpc.config.model["dtype"],
-                tensor_shape=tensor_shape,
-                scatter_gather_tensors=scatter_gather,
-                scheduler_hooks=scheduler_hooks,
-                communication_overlap=communication_overlap,
-            )
-        else:
-            scheduler = PipelineScheduler(
-                data_process_func=data_fn,
-                num_microbatches=gpc.config.NUM_MICRO_BATCHES,
-                dtype=gpc.config.model["dtype"],
-                tensor_shape=tensor_shape,
-                scatter_gather_tensors=scatter_gather,
-                scheduler_hooks=scheduler_hooks,
-            )
-    else:
-        scheduler = NonPipelineScheduler(
-            data_process_func=data_fn,
-            gradient_accumulation_size=gpc.config.data.gradient_accumulation,
-            scheduler_hooks=scheduler_hooks,
-        )
-
-    # initialize engine for trainer
-    engine = Engine(
-        model=model,
-        optimizer=optimizer,
-        lr_scheduler=lr_scheduler,
-        beta2_scheduler=beta2_scheduler,
-        criterion=criterion,
-        gradient_handlers=gradient_handlers,
-        clip_grad_norm=clip_grad_norm,
-    )
-
-    trainer = Trainer(engine, scheduler)
-
-    return trainer, train_dataloader, test_dataloader, lr_scheduler
diff --git a/internlm/initialize/launch.py b/internlm/initialize/launch.py
deleted file mode 100644
index 079c2cb..0000000
--- a/internlm/initialize/launch.py
+++ /dev/null
@@ -1,476 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import argparse
-import os
-from pathlib import Path
-from typing import Dict, Union
-
-import torch
-
-from internlm.core.context import Config
-from internlm.core.context import global_context as gpc
-from internlm.monitor import initialize_light_monitor
-from internlm.utils.common import get_master_node
-from internlm.utils.logger import get_logger
-from internlm.utils.timeout import llm_timeout
-
-logger = get_logger(__file__)
-
-
-def get_default_parser():
-    """Reads user command line and uses an argument parser to parse the input arguments.
-    Input arguments include configuration, host, port, world size, local rank, backend for torch.distributed.
-
-    Returns:
-       Parser: Returns the parser with the default arguments, the user may add customized arguments into this parser.
-    """
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--config", type=str, help="path to the config file")
-    parser.add_argument(
-        "--launcher",
-        type=str,
-        default="slurm",
-        choices=["slurm", "torch"],
-        help="launcher for launching distributed environment",
-    )
-    parser.add_argument("--host", type=str, help="the master address for distributed training")
-    parser.add_argument("--port", type=int, default=8888, help="the master port for distributed training")
-    parser.add_argument("--world_size", type=int, help="world size for distributed training")
-    parser.add_argument("--rank", type=int, help="rank for the default process group")
-    parser.add_argument("--local_rank", type=int, help="local rank on the node")
-    parser.add_argument("--backend", type=str, default="nccl", help="backend for distributed communication")
-    parser.add_argument("--seed", type=int, default=1024)
-    parser.add_argument("--profiling", default=False, action="store_true", help="enable/disable profiling.")
-    return parser
-
-
-def args_sanity_check():
-    assert gpc.config is not None, "config is not load!"
-
-    # the default model type is INTERNLM
-    if "model_type" not in gpc.config:
-        gpc.config._add_item("model_type", "INTERNLM")
-
-    # procssing the parallel config in gpc
-    if "zero1" not in gpc.config.parallel:
-        gpc.config.parallel._add_item("zero1", -1)
-
-    if "pipeline" not in gpc.config.parallel:
-        gpc.config.parallel._add_item("pipeline", 1)
-
-    if "tensor" not in gpc.config.parallel:
-        gpc.config.parallel._add_item("tensor", 1)
-
-    # processing the data config in gpc
-    data = gpc.config.data
-
-    assert data.seq_len is not None, "'seq_len' must be given a value"
-    assert data.micro_bsz is not None, "'micro_bsz' must be given a value"
-
-    if "packed_length" in data and gpc.is_rank_for_log():
-        logger.warning("packed_length would be ignored and will be setted as seq_len * micro_bsz.")
-
-    data._add_item("packed_length", data.seq_len * data.micro_bsz)
-
-    if "micro_num" not in data:
-        data._add_item("micro_num", 1)
-
-    data._add_item("gradient_accumulation", data.micro_num)
-    if gpc.is_rank_for_log():
-        logger.info(f"gradient_accumulation size will be setted to {data.micro_num}.")
-
-    # batch_size should be equal with micro_num, should not use it directly
-    data._add_item("batch_size", data.micro_num)
-
-    if "min_length" not in data:
-        data._add_item("min_length", 0)
-
-    if "train_folder" not in data:
-        data._add_item("train_folder", None)
-
-    if "valid_folder" not in data:
-        data._add_item("valid_folder", None)
-
-    if "valid_micro_num" not in data:
-        data._add_item("valid_micro_num", data.micro_num)
-
-    if "valid_every" not in data:
-        data._add_item("valid_every", 0)
-
-    if "empty_cache_and_diag_interval" not in data:
-        data._add_item("empty_cache_and_diag_interval", 50)
-
-    if "diag_outlier_ratio" not in data:
-        data._add_item("diag_outlier_ratio", 1.1)
-    data.diag_outlier_ratio = max(1, data.diag_outlier_ratio)
-
-    if gpc.is_rank_for_log():
-        logger.info("+" * 15 + " Data Info " + "+" * 15)  # pylint: disable=W1201
-        logger.info(f"seq_len: {data.seq_len}")
-        logger.info(f"micro_num: {data.micro_num}")
-        logger.info(f"micro_bsz: {data.micro_bsz}")
-        logger.info(f"packed_length: {data.packed_length}")
-        logger.info(f"pack_sample_into_one: {data.pack_sample_into_one}")
-        logger.info(f"min_length: {data.min_length}")
-        logger.info(f"valid_micro_num: {data.valid_micro_num}")
-        logger.info(f"valid_every: {data.valid_every}")
-
-    # processing the checkpoint config
-    ckpt = gpc.config.ckpt
-    if "enable_save_ckpt" not in ckpt:
-        ckpt._add_item("enable_save_ckpt", True)
-
-    # Saving checkpoint args.
-    if ckpt.enable_save_ckpt:
-        assert "checkpoint_every" in ckpt, "If enable save checkpoint, must give checkpoint_every in config.data!"
-        assert ckpt.checkpoint_every > 0
-        assert "save_ckpt_folder" in ckpt, "If enable save checkpoint, must give save_ckpt_folder in config.data!"
-
-        if "async_upload" not in ckpt:
-            ckpt._add_item("async_upload", False)  # async defalut is False.
-        else:
-            if ckpt.async_upload:
-                assert "save_ckpt_folder" in ckpt
-                if "boto3:" not in ckpt.save_ckpt_folder:
-                    if gpc.is_rank_for_log():
-                        logger.warning(
-                            "Storing ckpt on file system does not support asynchronous storage, will use sync save!"
-                        )
-                    ckpt.async_upload = False
-                else:
-                    if "async_upload_tmp_folder" not in ckpt:
-                        ckpt._add_item("async_upload_tmp_folder", "/dev/shm/internlm_tmp_ckpt/")
-
-        if not ckpt.async_upload:
-            ckpt._add_item("async_upload_tmp_folder", None)
-
-        if "oss_snapshot_freq" not in ckpt:
-            ckpt._add_item("oss_snapshot_freq", float("inf"))  # if oss_snapshot_freq not given, we disable.
-    else:
-        ckpt._add_item("checkpoint_every", float("inf"))
-        ckpt._add_item("oss_snapshot_freq", float("inf"))
-        ckpt._add_item("save_ckpt_folder", None)
-        ckpt._add_item("async_upload", False)
-        ckpt._add_item("async_upload_tmp_folder", None)
-        ckpt._add_item("snapshot_ckpt_folder", None)
-
-    if "load_ckpt_folder" not in ckpt:
-        ckpt._add_item("load_ckpt_folder", None)
-
-    if "stop_file_path" not in ckpt:
-        ckpt._add_item("stop_file_path", None)
-
-    if "auto_resume" not in ckpt:
-        # If 'auto_resume' is not given, we set it to True, so internlm can have opportunity
-        # to auto-load latest checkpoint.
-        ckpt._add_item("auto_resume", True)
-
-    if gpc.is_rank_for_log():
-        logger.info("+" * 15 + " Ckpt Info " + "+" * 15)  # pylint: disable=W1201
-        logger.info(f"is enable save ckpt: {ckpt.enable_save_ckpt}")
-        logger.info(f"save_ckpt_folder: {ckpt.save_ckpt_folder}")
-        logger.info(f"checkpoint_every: {ckpt.checkpoint_every}")
-
-    # tensorboard writer config
-    if "enable_tb" not in gpc.config:
-        gpc.config._add_item("enable_tb", True)
-    if "tensorboard_folder" not in gpc.config:
-        gpc.config._add_item(
-            "tensorboard_folder", os.environ["tensorboard_folder"] if "tensorboard_folder" in os.environ else None
-        )
-    if "resume_tb_folder" not in gpc.config:
-        gpc.config._add_item(
-            "resume_tb_folder", os.environ["resume_tb_folder"] if "resume_tb_folder" in os.environ else None
-        )
-
-    if gpc.is_rank_for_log():
-        logger.info(f"tensorboard_folder: {gpc.config.tensorboard_folder}")
-        logger.info(f"resume_tb_folder: {gpc.config.resume_tb_folder}")
-
-    # cudnn
-    torch.backends.cudnn.benchmark = gpc.config.get("cudnn_benchmark", False)
-    torch.backends.cudnn.deterministic = gpc.config.get("cudnn_deterministic", False)
-    clip_grad_norm = gpc.config.hybrid_zero_optimizer.get("clip_grad_norm", 0.0)
-
-    if gpc.is_rank_for_log():
-        logger.info("+" * 15 + " Other Info " + "+" * 15)  # pylint: disable=W1201
-        logger.info(f"cudnn.benchmark: {torch.backends.cudnn.benchmark }")
-        logger.info(f"cudnn.deterministic: {torch.backends.cudnn.deterministic }")
-        logger.info(f"clip_grad_norm: {clip_grad_norm}")
-
-    model = gpc.config.model
-    if "dtype" not in model:
-        logger.warning("dtype is not set, use torch.float16 by defalut!")
-        model._add_item("dtype", torch.float16)
-    else:
-        if gpc.config.model.dtype == "torch.bfloat16":
-            gpc.config.model.dtype = torch.bfloat16
-        elif gpc.config.model.dtype in ("torch.float16", "torch.half"):
-            gpc.config.model.dtype = torch.float16
-        elif gpc.config.model.dtype == "torch.float32":
-            gpc.config.model.dtype = torch.float32
-        elif gpc.config.model.dtype == "torch.tf32":
-            torch.backends.cudnn.allow_tf32 = True
-            torch.backends.cuda.matmul.allow_tf32 = True
-            gpc.config.model.dtype = torch.float32
-        else:
-            assert gpc.config.model.dtype in [
-                "torch.float16",
-                "torch.half",
-                "torch.bfloat16",
-                "torch.float32",
-                "torch.tf32",
-            ]
-
-    if "checkpoint" in model:
-        if model.checkpoint is True:
-            model.checkpoint = 1
-        elif model.checkpoint is False:
-            model.checkpoint = 0
-        else:
-            assert (
-                model.checkpoint >= 0 and model.checkpoint <= 1
-            ), f'model.checkpoint: "{model.checkpoint}" should >=0 and <=1'
-
-    if gpc.is_rank_for_log():
-        logger.info("+" * 15 + " Model Info " + "+" * 15)  # pylint: disable=W1201
-        logger.info(f"Model: {gpc.config.model}")
-
-        logger.info("+" * 15 + " grad_scaler Info " + "+" * 15)  # pylint: disable=W1201
-        logger.info(f"grad_scaler: {gpc.config.grad_scaler}")
-
-        logger.info("+" * 15 + " hybrid_zero_optimizer Info " + "+" * 15)  # pylint: disable=W1201
-        logger.info(f"hybrid_zero_optimizer: {gpc.config.hybrid_zero_optimizer}")
-
-        logger.info("+" * 15 + " adam Info " + "+" * 15)  # pylint: disable=W1201
-        logger.info(f"adam: {gpc.config.adam}")
-
-        logger.info("+" * 15 + " beta2_scheduler Info " + "+" * 15)  # pylint: disable=W1201
-        logger.info(f"beta2_scheduler: {gpc.config.beta2_scheduler}")
-
-    # process the model config
-    if "use_flash_attn" not in gpc.config.model:
-        gpc.config.model._add_item("use_flash_attn", True)
-
-    # process the parallel config
-    if "sequence_parallel" not in gpc.config.parallel:
-        gpc.config.parallel._add_item("sequence_parallel", False)
-    else:
-        assert not (
-            gpc.config.parallel.sequence_parallel is True and gpc.config.model.use_flash_attn is False
-        ), "sequence parallel does not support use_flash_attn=False"
-
-    # monitoring default config
-    monitor_default_config = {
-        "alert_address": None,  # compatible with old alert config
-        "monitor": {  # new monitoring config
-            "alert": {"enable_feishu_alert": False, "feishu_alert_address": None, "light_monitor_address": None}
-        },
-    }
-
-    for key, value in monitor_default_config.items():
-        if key not in gpc.config:
-            gpc.config._add_item(key, value)
-
-    alert = gpc.config.monitor.alert
-
-    if alert.enable_feishu_alert and not alert.feishu_alert_address and gpc.is_rank_for_log():
-        logger.warning("alert is enable but alert_address is not set")
-
-    optim_ckpt = gpc.config.hybrid_zero_optimizer
-    if "zero_overlap_communication" in optim_ckpt:
-        # Compatible with the old interfaces.
-        optim_ckpt._add_item("overlap_sync_grad", optim_ckpt.zero_overlap_communication)
-    if "overlap_sync_grad" not in optim_ckpt:
-        optim_ckpt._add_item("overlap_sync_grad", False)
-    if "overlap_sync_param" not in optim_ckpt:
-        optim_ckpt._add_item("overlap_sync_param", False)
-    if gpc.is_rank_for_log():
-        logger.info(
-            f"overlap_sync_grad:{optim_ckpt.overlap_sync_grad}, overlap_sync_param:{optim_ckpt.overlap_sync_param}"
-        )
-
-
-def launch(
-    config: Union[str, Path, Config, Dict],
-    rank: int,
-    world_size: int,
-    host: str,
-    port: int,
-    backend: str = "nccl",
-    local_rank: int = None,
-    seed: int = 1024,
-):
-    """This function first parses the configuration arguments, using :func:`parse_args()` in case one of the input
-    arguments are not given. Then initialize and set distributed environment by calling global_context's functions.
-
-    Args:
-        config (Union[str, dict, Config]): Config file or config file path are both acceptable
-        rank (int): Rank for the default process group
-        world_size (int): World size of the default process group
-        host (str): The master address for distributed training
-        port (str): The master port for distributed training
-        backend (str, optional): Backend for ``torch.distributed``, defaults to ``nccl``
-        local_rank (int, optional):
-            Rank for the process on the node and is used to set the default CUDA device,
-            defaults to None. If local_rank = None, the default device ordinal will be calculated automatically.
-        seed (int, optional): Specified random seed for every process. Defaults to 1024.
-
-    Raises:
-        Exception: Raise exception when config type is wrong
-    """
-
-    # set config
-    assert isinstance(
-        config, (Config, str, Path, dict)
-    ), f"expected argument config to be Config, str or Path, but got {type(config)}"
-    if not isinstance(config, Config) and isinstance(config, dict):
-        config = Config(config)
-    if isinstance(config, (str, Path)):
-        config = Config.from_file(config)
-    gpc.load_config(config)
-
-    # init default process group
-    gpc.init_global_dist(rank, world_size, backend, host, port)
-
-    # init process groups for different parallel modes from config
-    gpc.init_parallel_groups()
-
-    # set cuda device
-    if torch.cuda.is_available():
-        # if local rank is not given, calculate automatically
-        gpc.set_device(local_rank)
-
-    # set the number of processes running on the same node
-    gpc.detect_num_processes_on_current_node()
-
-    gpc.set_seed(seed)
-
-    if gpc.is_rank_for_log():
-        logger.info(
-            f"Distributed environment is initialized, "
-            f"data parallel size: {gpc.data_parallel_size}, pipeline parallel size: {gpc.pipeline_parallel_size}, "
-            f"tensor parallel size: {gpc.tensor_parallel_size}",
-        )
-
-
-def launch_from_slurm(
-    config: Union[str, Path, Config, Dict],
-    host: str,
-    port: int,
-    backend: str = "nccl",
-    seed: int = 1024,
-):
-    """A wrapper for internlm.launch for SLURM launcher by reading rank and world size from the environment variables
-    set by SLURM
-
-    Args:
-        config (Union[str, dict, Config]): Config file or config file path are both acceptable
-        host (str): The master address for distributed training
-        port (str): The master port for distributed training
-        backend (str, optional): Backend for ``torch.distributed``, defaults to ``nccl``
-        seed (int, optional): Specified random seed for every process. Defaults to 1024.
-    """
-    try:
-        rank = int(os.environ["SLURM_PROCID"])
-        world_size = int(os.environ["SLURM_NPROCS"])
-    except KeyError as e:
-        raise RuntimeError(f"Could not find {e} in the SLURM environment")
-
-    launch(
-        config=config,
-        rank=rank,
-        world_size=world_size,
-        host=host,
-        port=port,
-        backend=backend,
-        seed=seed,
-    )
-
-
-def launch_from_torch(
-    config: Union[str, Path, Config, Dict],
-    backend: str = "nccl",
-    seed: int = 1024,
-):
-    """A wrapper for internlm.launch for torchrun or torch.distributed.launch by reading rank and world size
-    from the environment variables set by PyTorch
-
-    Args:
-        config (Union[str, dict, Config]): Config file or config file path are both acceptable
-        backend (str, optional): Backend for ``torch.distributed``, defaults to ``nccl``
-        seed (int, optional): Specified random seed for every process. Defaults to 1024.
-    """
-    try:
-        rank = int(os.environ["RANK"])
-        local_rank = int(os.environ["LOCAL_RANK"])
-        world_size = int(os.environ["WORLD_SIZE"])
-        host = os.environ["MASTER_ADDR"]
-        port = int(os.environ["MASTER_PORT"])
-    except KeyError as e:
-        raise RuntimeError(f"Could not find {e} in the torch environment")
-
-    launch(
-        config=config,
-        local_rank=local_rank,
-        rank=rank,
-        world_size=world_size,
-        host=host,
-        port=port,
-        backend=backend,
-        seed=seed,
-    )
-
-
-@llm_timeout(func_name="initialize_distributed_env")
-def initialize_distributed_env(
-    config: str,
-    launcher: str = "slurm",
-    master_port: int = 8888,
-    seed: int = 1024,
-    args_check=True,
-):
-    """
-    Initialize distributed environment for distributed training.
-
-    Args:
-        config (str): Config file path.
-        launcher (str): Launcher for launching distributed environment, can be slurm or torch. "slurm" by default.
-        master_port (str): The master port for distributed training. 8888 by default.
-        seed (int, optional): Specified random seed for every process. 1024 by default.
-    """
-
-    torch.cuda.empty_cache()
-
-    if launcher == "torch":
-        launch_from_torch(config=config, seed=seed)
-    elif launcher == "slurm":
-        launch_from_slurm(
-            config=config,
-            host=get_master_node(),
-            port=master_port,
-            seed=seed,
-        )
-    else:
-        assert launcher in ["slurm", "torch"], "launcher only support slurm or torch"
-
-    if args_check:
-        args_sanity_check()
-
-    # init light monitor client
-    alert_config = gpc.config.monitor.alert
-    if alert_config.enable_feishu_alert and gpc.is_rank_for_log():
-        light_monitor_address = alert_config.light_monitor_address
-        if light_monitor_address:
-            initialize_light_monitor(light_monitor_address)
-        else:
-            logger.warning("monitor address is none, monitor could not be used!")
-
-
-def get_config_value(config, key, defalut):
-    try:
-        value = config[key]
-    except KeyError:
-        value = defalut
-    return value
diff --git a/internlm/initialize/legacy/__init__.py b/internlm/initialize/legacy/__init__.py
deleted file mode 100644
index e69de29..0000000
diff --git a/internlm/initialize/legacy/launch.py b/internlm/initialize/legacy/launch.py
deleted file mode 100644
index 8313654..0000000
--- a/internlm/initialize/legacy/launch.py
+++ /dev/null
@@ -1,40 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from internlm.initialize.launch import get_config_value
-from internlm.utils.logger import get_logger
-
-logger = get_logger(__file__)
-
-
-def auto_resume_sanity_check(ckpt_config):
-    load_given_ckpt = get_config_value(ckpt_config, "load_given_ckpt", None)
-    if load_given_ckpt is None:
-        return True  # default value is True
-    else:
-        return not load_given_ckpt
-
-
-def ckpt_info_sanity_check(ckpt_config):
-    load_ckpt_folder = get_config_value(ckpt_config, "load_ckpt_folder", None)
-
-    load_model_only_folder = get_config_value(ckpt_config, "load_model_only_folder", None)
-
-    if load_model_only_folder is not None:
-        assert (
-            load_ckpt_folder is None
-        ), "Detect 'load_ckpt_folder' and 'load_model_only_folder' set at the same time, \
-# and 'load_given_ckpt' is True, so internlm will load from 'load_ckpt_folder'"
-        return dict(path=load_model_only_folder, content=("model",), ckpt_type="internlm")
-    else:
-        load_optimizer = get_config_value(ckpt_config, "load_optimizer", True)
-
-        if isinstance(load_ckpt_folder, str):
-            if load_optimizer:
-                return dict(path=load_ckpt_folder, content=("model", "sampler", "optimizer"), ckpt_type="internlm")
-            else:
-                return dict(path=load_ckpt_folder, content=("model", "sampler"), ckpt_type="internlm")
-        elif load_ckpt_folder is None:
-            return None
-        else:
-            assert f"Unsupport data type:'{type(load_ckpt_folder)}' for config.ckpt arg: 'load_ckpt_folder'"
diff --git a/internlm/model/__init__.py b/internlm/model/__init__.py
deleted file mode 100644
index b0fe77d..0000000
--- a/internlm/model/__init__.py
+++ /dev/null
@@ -1,21 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from .embedding import Embedding1D, RotaryEmbedding
-from .linear import FeedForward, RewardModelLinear, ScaleColumnParallelLinear
-from .metrics import AccPerplex
-from .modeling_internlm import build_model_with_cfg
-from .multi_head_attention import MHA
-from .utils import gather_forward_split_backward
-
-__all__ = [
-    "Embedding1D",
-    "FeedForward",
-    "RotaryEmbedding",
-    "RewardModelLinear",
-    "ScaleColumnParallelLinear",
-    "AccPerplex",
-    "MHA",
-    "gather_forward_split_backward",
-    "build_model_with_cfg",
-]
diff --git a/internlm/model/embedding.py b/internlm/model/embedding.py
deleted file mode 100644
index d4ae9b5..0000000
--- a/internlm/model/embedding.py
+++ /dev/null
@@ -1,232 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import Tuple
-
-import rotary_emb
-import torch
-import torch.nn.functional as F
-from einops import rearrange
-from flash_attn.layers.rotary import ApplyRotaryEmb as LegacyApplyRotaryEmb
-from flash_attn.layers.rotary import ApplyRotaryEmbQKV_ as LegacyApplyRotaryEmbQKV_
-from torch import Tensor, nn
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-
-from .utils import gather_forward_split_backward, split_forward_gather_backward
-
-
-class Embedding1D(nn.Module):
-    """
-    1D Embedding.
-
-    Args:
-        num_embeddings (int): The size of vocab.
-        embedding_dim (int): The dimention of model.
-        padding_idx (int): If specified, the entries at :attr:`padding_idx` do not contribute to the gradient;
-                            therefore, the embedding vector at :attr:`padding_idx` is not updated during training,
-                            i.e. it remains as a fixed "pad". None by default.
-        dtype (Optional[torch.dtype]): Data type None by default.
-
-    """
-
-    def __init__(
-        self,
-        num_embeddings: int,
-        embedding_dim: int,
-        *args,
-        padding_idx: int = None,
-        dtype: torch.dtype = None,
-        **kwargs,
-    ):
-        super().__init__()
-
-        self.num_embeddings = num_embeddings
-        self.embed_dim = embedding_dim
-        embed_dim_per_partition = embedding_dim // gpc.tensor_parallel_size
-
-        self.padding_idx = padding_idx
-        self.embed_args = args
-        self.embed_kwargs = kwargs
-
-        self.weight = nn.Parameter(torch.empty((num_embeddings, embed_dim_per_partition), dtype=dtype))
-
-    def forward(self, input_: Tensor) -> Tensor:
-        output_parallel = F.embedding(input_, self.weight, self.padding_idx, *self.embed_args, **self.embed_kwargs)
-
-        output = gather_forward_split_backward(output_parallel, ParallelMode.TENSOR, dim=-1)
-
-        if gpc.config.parallel.sequence_parallel:
-            output = split_forward_gather_backward(output, ParallelMode.TENSOR, dim=1)
-
-        return output
-
-
-class ApplyRotaryEmbQKV_(torch.autograd.Function):
-    """
-    ApplyRotaryEmbQKV_
-    """
-
-    @staticmethod
-    def forward(ctx, qkv, cos, sin, cos_k=None, sin_k=None):
-        """
-            qkv: (total, 3, nheads, headdim)
-            cos, sin: (seqlen, rotary_dim / 2)
-            cos_k, sin_k: (seqlen, rotary_dim / 2), optional
-        rotary_dim must be <= headdim
-        Apply rotary embedding *inplace* to the first rotary_dim of q and k.
-        """
-        _, three, _, headdim = qkv.shape
-        assert three == 3
-        rotary_seqlen, rotary_dim = cos.shape
-        rotary_dim *= 2
-        assert rotary_dim <= headdim
-        cos_k = cos if cos_k is None else cos_k
-        sin_k = sin if sin_k is None else sin_k
-        assert sin.shape == cos_k.shape == sin_k.shape == (rotary_seqlen, rotary_dim // 2)
-        q1, q2 = qkv[:, 0, :, :rotary_dim].chunk(2, dim=-1)
-        rotary_emb.apply_rotary(q1, q2, rearrange(cos, "s d -> s 1 d"), rearrange(sin, "s d -> s 1 d"), q1, q2, False)
-        k1, k2 = qkv[:, 1, :, :rotary_dim].chunk(2, dim=-1)
-        rotary_emb.apply_rotary(
-            k1, k2, rearrange(cos_k, "s d -> s 1 d"), rearrange(sin_k, "s d -> s 1 d"), k1, k2, False
-        )
-        ctx.save_for_backward(cos, sin, cos_k, sin_k)
-        return qkv
-
-    @staticmethod
-    def backward(ctx, dqkv):
-        cos, sin, cos_k, sin_k = ctx.saved_tensors
-        rotary_dim = cos.shape[-1]
-        rotary_dim *= 2
-        dq1, dq2 = dqkv[:, 0, :, :rotary_dim].chunk(2, dim=-1)
-        rotary_emb.apply_rotary(
-            dq1, dq2, rearrange(cos, "s d -> s 1 d"), rearrange(sin, "s d -> s 1 d"), dq1, dq2, True
-        )
-        dk1, dk2 = dqkv[:, 1, :, :rotary_dim].chunk(2, dim=-1)
-        rotary_emb.apply_rotary(
-            dk1, dk2, rearrange(cos_k, "s d -> s 1 d"), rearrange(sin_k, "s d -> s 1 d"), dk1, dk2, True
-        )
-        return dqkv, None, None, None, None
-
-
-apply_rotary_emb_qkv_ = ApplyRotaryEmbQKV_.apply
-legacy_apply_rotary_embed_qkv = LegacyApplyRotaryEmbQKV_.apply
-legacy_apply_rotary_embed = LegacyApplyRotaryEmb.apply
-
-
-class RotaryEmbedding(torch.nn.Module):
-    """
-    The rotary position embeddings from RoFormer_ (Su et. al).
-    A crucial insight from the method is that the query and keys are
-    transformed by rotation matrices which depend on the relative positions.
-
-    Other implementations are available in the Rotary Transformer repo_ and in
-    GPT-NeoX_, GPT-NeoX was an inspiration
-
-    .. _RoFormer: https://arxiv.org/abs/2104.09864
-    .. _repo: https://github.com/ZhuiyiTechnology/roformer
-    .. _GPT-NeoX: https://github.com/EleutherAI/gpt-neox
-
-    If scale_base > 0, this implements XPos (Sun et al., https://arxiv.org/abs/2212.10554).
-    A recommended value for scale_base is 512: https://github.com/HazyResearch/flash-attention/issues/96
-    Reference: https://github.com/sunyt32/torchscale/blob/main/torchscale/component/xpos_relative_position.py
-    """
-
-    def __init__(self, dim: int, base=10000, scale_base=0, device=None):
-        """ """
-        super().__init__()
-        # Generate and save the inverse frequency buffer (non trainable)
-        self.inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, device=device, dtype=torch.float32) / dim))
-        self.scale_base = scale_base
-        self.scale = (
-            (torch.arange(0, dim, 2, device=device, dtype=torch.float32) + 0.4 * dim) / (1.4 * dim)
-            if scale_base > 0
-            else None
-        )
-
-        self._seq_len_cached = 0
-        self._cos_cached = None
-        self._sin_cached = None
-        self._cos_k_cached = None
-        self._sin_k_cached = None
-
-    def _update_cos_sin_cache(self, x, indexes):
-        """x: (batch, seqlen, nheads, headdim) or (batch, seqlen, 3, nheads, headdim)"""
-        if not isinstance(indexes, int):
-            seqlen = indexes.max().item() + 1
-        else:
-            seqlen = indexes + 1  # eval_forward
-        # Reset the tables if the sequence length has changed,
-        # or if we're on a new device (possibly due to tracing for instance)
-        if seqlen > self._seq_len_cached or self._cos_cached.device != x.device or self._cos_cached.dtype != x.dtype:
-            self._seq_len_cached = seqlen
-            t = torch.arange(seqlen, device=x.device, dtype=self.inv_freq.dtype)
-            # Don't do einsum, it converts fp32 to fp16
-            # freqs = torch.einsum("i,j->ij", t, self.inv_freq)
-            freqs = torch.outer(t, self.inv_freq.to(device=t.device))
-            if self.scale is None:
-                self._cos_cached = torch.cos(freqs).to(x.dtype)
-                self._sin_cached = torch.sin(freqs).to(x.dtype)
-            else:
-                power = (
-                    torch.arange(seqlen, dtype=self.scale.dtype, device=self.scale.device) - seqlen // 2
-                ) / self.scale_base
-                scale = self.scale.to(device=power.device) ** rearrange(power, "s -> s 1")
-                # We want the multiplication by scale to happen in fp32
-                self._cos_cached = (torch.cos(freqs) * scale).to(x.dtype)
-                self._sin_cached = (torch.sin(freqs) * scale).to(x.dtype)
-                self._cos_k_cached = (torch.cos(freqs) / scale).to(x.dtype)
-                self._sin_k_cached = (torch.sin(freqs) / scale).to(x.dtype)
-
-    def forward(self, qkv: torch.Tensor, **kwargs):
-        if kwargs.get("indexes", None) is not None:
-            return self._forward(qkv, kwargs.pop("indexes"))
-        if kwargs.get("inference_params", None) is not None:
-            return self._eval_forward(qkv, seqlen_offset=kwargs.get("inference_params", None).sequence_len_offset)
-        else:
-            return self._eval_forward(qkv)
-
-    def _forward(self, qkv: torch.Tensor, indexes=0) -> Tuple[torch.Tensor, torch.Tensor]:
-        self._update_cos_sin_cache(qkv, indexes)
-        if self.scale is None:
-            return apply_rotary_emb_qkv_(qkv, self._cos_cached[indexes], self._sin_cached[indexes])
-        else:
-            return apply_rotary_emb_qkv_(
-                qkv,
-                self._cos_cached[indexes],
-                self._sin_cached[indexes],
-                self._cos_k_cached[indexes],
-                self._sin_k_cached[indexes],
-            )
-
-    def _eval_forward(self, qkv, seqlen_offset=0):
-        """
-        seqlen_offset: can be used in generation where the qkv being passed in is only the last
-        token in the batch.
-        """
-        self._update_cos_sin_cache(qkv, seqlen_offset + qkv.shape[1])
-        if self.scale is None:
-            return legacy_apply_rotary_embed_qkv(
-                qkv, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:]
-            )
-        else:
-            return legacy_apply_rotary_embed_qkv(
-                qkv,
-                self._cos_cached[seqlen_offset:],
-                self._sin_cached[seqlen_offset:],
-                self._cos_k_cached[seqlen_offset:],
-                self._sin_k_cached[seqlen_offset:],
-            )
-
-    def _single_forward(self, x, indexes=0):
-        assert self.scale is None
-        self._update_cos_sin_cache(x, indexes)
-        x = x[None, ...]
-        ret = legacy_apply_rotary_embed(x, self._cos_cached[indexes], self._sin_cached[indexes]).squeeze(0)
-        return ret
-
-    def _single_eval_forward(self, x, seqlen_offset=0):
-        assert self.scale is None
-        self._update_cos_sin_cache(x, seqlen_offset + x.shape[1])
-        return legacy_apply_rotary_embed(x, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:])
diff --git a/internlm/model/linear.py b/internlm/model/linear.py
deleted file mode 100644
index 5a3a4eb..0000000
--- a/internlm/model/linear.py
+++ /dev/null
@@ -1,201 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import Optional
-
-import torch
-import torch.nn.functional as F
-from flash_attn.ops.fused_dense import ColumnParallelLinear, RowParallelLinear
-from flash_attn.utils.distributed import all_reduce, reduce_scatter
-from torch import nn
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.model.utils import fused_dense_func_torch
-
-
-class ScaleColumnParallelLinear(nn.Linear):
-    """
-    ScaleColumnParallelLinear.
-
-    Args:
-        in_features (int): size of each input sample
-        out_features (int): size of each output sample
-        process_group (Optional[torch.distributed.ProcessGroup]): The group of the current device for `parallel_mode`.
-        bias (bool): Whether the bias is needed for linears. True by default. But it is typically set to False
-                    in the config.
-        sequence_parallel (bool): If sequence_parallel is True, we're doing Tensor Parallel with sequence parallelism:
-                                    we do an all_gather of x before doing the matmul.
-                                    If not, then the input is already gathered.
-        device (Optional[Union[str, torch.device]]): The device will be used.
-        dtype (Optional[torch.dtype]): The type of data.
-        weight_scale (int): For training stability. 1 by default.
-    """
-
-    def __init__(
-        self,
-        in_features: int,
-        out_features: int,
-        process_group: Optional[torch.distributed.ProcessGroup],
-        bias: bool = True,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-        weight_scale: int = 1,
-    ) -> None:
-        world_size = torch.distributed.get_world_size(process_group)
-        if out_features % world_size != 0:
-            raise ValueError(f"out_features ({out_features}) must be divisible by " f"world_size ({world_size})")
-        super().__init__(in_features, out_features // world_size, bias=bias, device=device, dtype=dtype)
-        self.process_group = process_group
-        self.weight_scale = weight_scale
-
-    def forward(self, input):  # pylint: disable=W0622
-        # If self.sequence_parallel is True, we're doing Tensor Parallel with sequence parallelism:
-        # we do an all_gather of x before doing the matmul.
-        # If not, then the input is already gathered.
-        if self.weight_scale != 1:
-            weight = self.weight * self.weight_scale + (1 - self.weight_scale) * self.weight.detach()
-        else:
-            weight = self.weight
-        return fused_dense_func_torch(
-            input,
-            weight,
-            self.bias,
-            process_group=self.process_group,
-            sequence_parallel=gpc.config.parallel.sequence_parallel,
-        )
-
-
-class RewardModelLinear(ScaleColumnParallelLinear):
-    """
-    RewardModelLinear.
-    Args:
-        in_features (int): size of each input sample
-        out_features (int): size of each output sample
-        process_group (Optional[torch.distributed.ProcessGroup]): The group of the current device for `parallel_mode`.
-        bias (bool): Whether the bias is needed for linears. True by default. But it is typically set to False
-                    in the config.
-        sequence_parallel (bool): If sequence_parallel is True, we're doing Tensor Parallel with sequence parallelism:
-                                    we do an all_gather of x before doing the matmul.
-                                    If not, then the input is already gathered.
-        device (Optional[Union[str, torch.device]]): The device will be used.
-        dtype (Optional[torch.dtype]): The type of data.
-        weight_scale (int): For training stability. 1 by default.
-    """
-
-    def __init__(
-        self,
-        in_features: int,
-        out_features: int,
-        process_group: Optional[torch.distributed.ProcessGroup],
-        bias: bool = True,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-        weight_scale: int = 1,
-    ) -> None:
-        super().__init__(in_features, out_features, process_group, bias, device, dtype, weight_scale)
-        torch.distributed.broadcast(self.weight, gpc.get_ranks_in_group(ParallelMode.TENSOR)[0], process_group)
-        if bias:
-            torch.distributed.broadcast(self.bias, gpc.get_ranks_in_group(ParallelMode.TENSOR)[0], process_group)
-
-    def forward(self, input):  # pylint: disable=W0622
-        # If self.sequence_parallel is True, we're doing Tensor Parallel with sequence parallelism:
-        # we do an all_gather of x before doing the matmul.
-        # If not, then the input is already gathered.
-        if self.weight_scale != 1:
-            weight = self.weight * self.weight_scale + (1 - self.weight_scale) * self.weight.detach()
-        else:
-            weight = self.weight
-        return fused_dense_func_torch(
-            input,
-            weight,
-            self.bias,
-            process_group=self.process_group,
-            sequence_parallel=gpc.config.parallel.sequence_parallel,
-        )
-
-
-class ColumnParallelLinearTorch(ColumnParallelLinear):
-    def forward(self, x):
-        # If self.sequence_parallel is True, we're doing Tensor Parallel with sequence parallelism:
-        # we do an all_gather of x before doing the matmul.
-        # If not, then the input is already gathered.
-
-        return fused_dense_func_torch(
-            x, self.weight, self.bias, process_group=self.process_group, sequence_parallel=self.sequence_parallel
-        )
-
-
-class RowParallelLinearTorch(RowParallelLinear):
-    def forward(self, x):
-        """
-        We're doing Tensor Parallel with sequence parallelism: we do the matmul and then
-        a reduce_scatter of the result.
-        """
-        out = fused_dense_func_torch(x, self.weight, self.bias)
-        reduce_fn = reduce_scatter if self.sequence_parallel else all_reduce
-        return reduce_fn(out, self.process_group)
-
-
-class FeedForward(nn.Module):
-    """
-    FeedForward.
-
-    Args:
-        in_features (int): size of each input sample
-        hidden_features (int): size of hidden state of FFN
-        out_features (int): size of each output sample
-        process_group (Optional[torch.distributed.ProcessGroup]): The group of the current device for `parallel_mode`.
-        bias (bool): Whether the bias is needed for linears. True by default. But it is typically set to False
-                    in the config.
-        device (Optional[Union[str, torch.device]]): The device will be used.
-        dtype (Optional[torch.dtype]): The type of data.
-        multiple_of (int): For efficient training. Reset the size of hidden feature. 256 by default.
-    """
-
-    def __init__(
-        self,
-        in_features: int,
-        hidden_features: int,
-        out_features: int = None,
-        process_group: Optional[torch.distributed.ProcessGroup] = None,
-        bias: bool = True,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-        multiple_of: int = 256,
-    ):
-        super().__init__()
-
-        hidden_features = multiple_of * ((hidden_features + multiple_of - 1) // multiple_of)
-
-        self.w1 = ColumnParallelLinearTorch(
-            in_features,
-            hidden_features,
-            process_group,
-            bias,
-            sequence_parallel=gpc.config.parallel.sequence_parallel,
-            device=device,
-            dtype=dtype,
-        )
-        self.w2 = ColumnParallelLinearTorch(
-            in_features,
-            hidden_features,
-            process_group,
-            bias,
-            sequence_parallel=gpc.config.parallel.sequence_parallel,
-            device=device,
-            dtype=dtype,
-        )
-        self.w3 = RowParallelLinearTorch(
-            hidden_features,
-            out_features,
-            process_group,
-            bias=bias,
-            sequence_parallel=gpc.config.parallel.sequence_parallel,
-            device=device,
-            dtype=dtype,
-        )
-
-    def forward(self, x):
-        out = self.w3(F.silu(self.w1(x)) * self.w2(x))
-        return out
diff --git a/internlm/model/loss.py b/internlm/model/loss.py
deleted file mode 100644
index ac92b4b..0000000
--- a/internlm/model/loss.py
+++ /dev/null
@@ -1,54 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from flash_attn.losses.cross_entropy import CrossEntropyLoss as FlashCrossEntropyLoss
-from torch import nn
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-
-
-class FlashGPTLMLoss(nn.Module):
-    """
-    Loss function for flash GPT Language Model.
-    """
-
-    def __init__(self, parallel_output=True, label_smoothing=0):
-        super().__init__()
-
-        if label_smoothing is not None:
-            if label_smoothing != 0:
-                if gpc.is_rank_for_log():
-                    print(f"use label_smoothing: {label_smoothing}")
-        else:
-            label_smoothing = 0
-        self.label_smoothing = label_smoothing
-
-        if parallel_output:
-            self.loss_fn = FlashCrossEntropyLoss(
-                reduction="mean",
-                inplace_backward=True,
-                process_group=gpc.get_group(ParallelMode.TENSOR),
-                label_smoothing=label_smoothing,
-            )  # The loss in this place is bound to the gather_output initialized by VocabParallelClassifier1D
-        else:
-            # Here, the output will gather output is set in the model, so use ordinary loss
-            self.loss_fn = nn.CrossEntropyLoss(reduction="mean", label_smoothing=label_smoothing)
-
-    def forward(self, *args):
-        if len(args) == 3:
-            # residual is to match prenorm
-            logits, _, labels = args
-        elif len(args) == 2:
-            # When using postnorm
-            logits, labels = args
-        else:
-            raise RuntimeError(f"The number of criterion inputs are:{len(args)}")
-        shift_logits = logits.contiguous().view(-1, logits.size(-1))
-        shift_labels = labels.contiguous().view(-1)
-        loss = self.loss_fn(
-            shift_logits, shift_labels
-        )  # There is no need to consider the ignore_index problem here, because the loss calculation will be
-        # calculated through the calculation range, and -100 must be outside this range, so there is no problem
-
-        return loss
diff --git a/internlm/model/metrics.py b/internlm/model/metrics.py
deleted file mode 100644
index 24ce592..0000000
--- a/internlm/model/metrics.py
+++ /dev/null
@@ -1,263 +0,0 @@
-from typing import List
-
-import torch
-from flash_attn.losses.cross_entropy import CrossEntropyLoss as FlashCrossEntropyLoss
-from torch_scatter import scatter
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.utils.parallel import is_no_pp_or_last_stage
-
-
-class AccPerplex:
-    """
-    AccPerplex module for calculating model's accuracy and perplexity metrics.
-
-    Args:
-        device: The GPU device.
-        tp_pg: The tensor parallel process group.
-        dp_pg: The data parallel process group.
-        tokenizer: For calculating BPB.
-        dataset_types (List[str]): Various data types that will be used in the current training process,
-            such as ['en', 'cn', 'code']. The order of the List should be consistent with the type_id specified
-            in the dataset. Changed parameters need to be used in conjunction with set_current_type_ids().
-    """
-
-    def __init__(self, device, tp_pg, dp_pg, tokenizer=None, dataset_types: List[str] = None):
-        self.device = device
-        self.right = torch.Tensor([0]).to(device=device)
-        self.total = torch.Tensor([0]).to(device=device)
-        self.total_log_probs = torch.Tensor([0]).to(device=device)
-        self.tp_pg = tp_pg
-        self.dp_pg = dp_pg
-        self.tp_local_rank = torch.distributed.get_rank(self.tp_pg)
-        self.tokenizer = tokenizer
-        self.total_bytes = torch.Tensor([0]).to(device=device).view(1)
-        self.batch_shift = 0
-        self.type_ids = None
-        if dataset_types is not None:
-            self.dataset_types = dataset_types
-            self.total_type_count = len(dataset_types)
-            self.ds_right = torch.zeros(self.total_type_count, dtype=torch.long, device=device)
-            self.ds_tokens = torch.zeros(self.total_type_count, dtype=torch.long, device=device)
-
-        self.loss_with_type_id = LossWithTypeId(device, dp_pg, dataset_types)
-
-    def set_current_type_ids(self, type_ids: torch.Tensor):
-        self.batch_shift = 0
-        self.type_ids = type_ids.cuda()
-
-    def __call__(self, logits, labels):
-        return self.update(logits, labels, type_ids=self.type_ids)
-
-    def update(self, logits, labels, type_ids=None):
-        if gpc.config.model.use_flash_attn:
-            micro_bsz = labels.size(0)
-        else:
-            micro_bsz = 1
-        if type_ids is not None:
-            type_ids = type_ids[self.batch_shift * micro_bsz : (self.batch_shift + 1) * micro_bsz].view(-1)
-            self.batch_shift += 1
-        self.loss_with_type_id.update(logits, labels, type_ids)
-
-        with torch.no_grad():
-            if isinstance(logits, (list, tuple)):
-                logits = logits[0]
-
-            logits = logits.detach().clone()
-            labels = labels.detach().clone()
-
-            if self.tokenizer:  # need to calculate bits per bytes
-                sequences = self.tokenizer.decode_ids(labels.tolist())
-                self.total_bytes += sum(map(lambda x: len(x.encode("utf-8")), sequences))
-
-            shift_logits = logits.view(-1, logits.size(-1))
-            shift_labels = labels.view(-1)
-            # There is a shift according to the current rank, because the logits are split
-            pred_shift = self.tp_local_rank * logits.shape[-1]
-
-            logits_max = torch.max(shift_logits, dim=-1)[0]
-            torch.distributed.all_reduce(logits_max, op=torch.distributed.ReduceOp.MAX, group=self.tp_pg)
-            # Determine whether the maximum value of the current local tensor is the global maximum value
-            logits_global = logits_max == torch.max(shift_logits, dim=-1)[0]
-
-            corrects = torch.logical_and(
-                (shift_labels == (shift_logits.argmax(dim=-1) + pred_shift)), logits_global
-            ).long()
-            mask = shift_labels.ne(-100).long()
-            if hasattr(self, "total_type_count"):
-                ds_acc = scatter(corrects, type_ids, dim=0, reduce="sum")
-                token_num_type = scatter(mask, type_ids, dim=0, reduce="sum")
-                if len(ds_acc) < self.total_type_count:
-                    ds_acc = torch.cat([ds_acc, ds_acc.new_zeros(self.total_type_count - len(ds_acc))])
-                    token_num_type = torch.cat(
-                        [token_num_type, token_num_type.new_zeros(self.total_type_count - len(token_num_type))]
-                    )
-                self.ds_tokens += token_num_type
-                sync_tensor = ds_acc
-                torch.distributed.all_reduce(sync_tensor, op=torch.distributed.ReduceOp.SUM, group=self.tp_pg)
-                self.ds_right += sync_tensor.view(-1)
-
-            acc = corrects.sum()
-            torch.distributed.all_reduce(acc, op=torch.distributed.ReduceOp.SUM, group=self.tp_pg)
-            self.right += acc  # Masked_fill is not needed here because -100 is not available anyway
-            self.total += mask.sum()
-
-            # Subtract the maximum value.
-            shift_logits = shift_logits.sub(logits_max.unsqueeze(dim=-1))
-
-            # Get the partition's vocab indecies
-            partition_vocab_size = shift_logits.size()[-1]
-            vocab_start_index = partition_vocab_size * self.tp_local_rank
-            vocab_end_index = vocab_start_index + partition_vocab_size
-
-            # Create a mask of valid vocab ids (1 means it needs to be masked).
-            target_mask = (shift_labels < vocab_start_index) | (shift_labels >= vocab_end_index)
-            masked_target = shift_labels - vocab_start_index
-            masked_target[target_mask] = 0
-
-            # Get predicted-logits = logits[target].
-            # For Simplicity, we convert logits to a 2-D tensor with size
-            # [*, partition-vocab-size] and target to a 1-D tensor of size [*].
-            logits_2d = shift_logits.view(-1, partition_vocab_size)
-            masked_target_1d = masked_target.view(-1)
-            arange_1d = torch.arange(start=0, end=logits_2d.size()[0], device=logits_2d.device)
-            predicted_logits_1d = logits_2d[arange_1d, masked_target_1d]
-            predicted_logits_1d = predicted_logits_1d.clone().contiguous()
-            predicted_logits = predicted_logits_1d.view_as(shift_labels)  # bsz x max_len
-            predicted_logits[target_mask] = 0.0
-            # All reduce is needed to get the chunks from other GPUs.
-            torch.distributed.all_reduce(predicted_logits, op=torch.distributed.ReduceOp.SUM, group=self.tp_pg)
-
-            pred_exp_logits = torch.exp(predicted_logits)
-            # Sum of exponential of logits along vocab dimension across all GPUs.
-            sum_exp_logits = torch.exp(shift_logits).sum(dim=-1)
-            torch.distributed.all_reduce(sum_exp_logits, op=torch.distributed.ReduceOp.SUM, group=self.tp_pg)
-
-            total_log_probs = -(pred_exp_logits / sum_exp_logits).log().masked_fill(shift_labels.eq(-100), 0).sum()
-            self.total_log_probs += total_log_probs
-
-    def get_metric(self, reset=True):
-        if is_no_pp_or_last_stage() and self.dp_pg is not None:
-            torch.distributed.all_reduce(self.right, op=torch.distributed.ReduceOp.SUM, group=self.dp_pg)
-            torch.distributed.all_reduce(self.total, op=torch.distributed.ReduceOp.SUM, group=self.dp_pg)
-            torch.distributed.all_reduce(self.total_log_probs, op=torch.distributed.ReduceOp.SUM, group=self.dp_pg)
-            if hasattr(self, "total_type_count"):
-                torch.distributed.all_reduce(self.ds_right, op=torch.distributed.ReduceOp.SUM, group=self.dp_pg)
-                torch.distributed.all_reduce(self.ds_tokens, op=torch.distributed.ReduceOp.SUM, group=self.dp_pg)
-            if self.tokenizer:
-                torch.distributed.all_reduce(self.total_bytes, op=torch.distributed.ReduceOp.SUM, group=self.dp_pg)
-
-        acc = round((self.right / self.total).item(), 4)
-        perplexity = round(torch.exp(self.total_log_probs / self.total).item(), 4)
-        bits_per_bytes = round((self.total_log_probs / self.total_bytes).item(), 4) if self.tokenizer else 0
-
-        if hasattr(self, "total_type_count"):
-            ds_acc = {}
-            ds_tokens = {}
-            for i in range(self.total_type_count):
-                ds_acc[f"acc/{self.dataset_types[i]}"] = round(
-                    (self.ds_right[i].float() / (self.ds_tokens[i].float() + 1e-5)).item(), 4
-                )
-                ds_tokens[f"tokens/{self.dataset_types[i]}"] = self.ds_tokens[i].item()
-        if reset:
-            self.right.fill_(0)
-            self.total.fill_(0)
-            self.total_log_probs.fill_(0)
-            self.total_bytes.fill_(0)
-            if hasattr(self, "total_type_count"):
-                self.ds_right.fill_(0)
-                self.ds_tokens.fill_(0)
-        if self.tokenizer is not None:
-            res = {"acc": acc, "perplexity": perplexity, "BPB": bits_per_bytes}
-        else:
-            res = {"acc": acc, "perplexity": perplexity}
-        if hasattr(self, "total_type_count"):
-            res.update(ds_acc)
-            res.update(ds_tokens)
-
-        loss_res = self.loss_with_type_id.get_metric(reset)
-        res.update(loss_res)
-
-        return res
-
-
-class LossWithTypeId:
-    """
-    Notice the loss value computed here may be not the same with the main info loss,
-    cause loss here is the reduced result of the data parallel.
-    """
-
-    def __init__(self, device, dp_pg, dataset_types: List[str] = None) -> None:
-        self.device = device
-        self.dp_pg = dp_pg
-
-        self.loss = torch.Tensor([0.0]).to(device=device)
-        self.token_num = torch.Tensor([0.0]).to(device=device)
-
-        if dataset_types is not None:
-            self.dataset_types = dataset_types
-            self.total_type_count = len(dataset_types)
-            self.ds_loss = torch.zeros(self.total_type_count, dtype=torch.float, device=device)
-            self.ds_token_num = torch.zeros(self.total_type_count, dtype=torch.float, device=device)
-
-        self.loss_fn = FlashCrossEntropyLoss(
-            reduction="none", inplace_backward=True, process_group=gpc.get_group(ParallelMode.TENSOR)
-        )
-
-    def update(self, logits, labels, type_ids=None):
-        with torch.no_grad():
-            if isinstance(logits, (list, tuple)):
-                logits = logits[0]
-            logits = logits.contiguous().view(-1, logits.size(-1))
-            labels = labels.contiguous().view(-1)
-            loss_list = self.loss_fn(logits, labels)
-
-            cond = labels != -100
-            real_loss_list = loss_list[cond]
-            self.loss += real_loss_list.sum()
-            self.token_num += real_loss_list.numel()
-
-            if hasattr(self, "total_type_count"):
-                type_ids = type_ids.contiguous().view(-1).to(self.device)
-                real_type_ids = type_ids[cond]
-
-                loss_list_type = scatter(real_loss_list, real_type_ids, dim=0, reduce="sum")
-                token_num_type = scatter(torch.ones_like(real_loss_list), real_type_ids, dim=0, reduce="sum")
-
-                if len(loss_list_type) < self.total_type_count:
-                    loss_list_type = torch.cat(
-                        [loss_list_type, loss_list_type.new_zeros(self.total_type_count - len(loss_list_type))]
-                    )
-                    token_num_type = torch.cat(
-                        [token_num_type, token_num_type.new_zeros(self.total_type_count - len(token_num_type))]
-                    )
-                self.ds_loss += loss_list_type
-                self.ds_token_num += token_num_type
-
-    def get_metric(self, reset=True):
-        if is_no_pp_or_last_stage() and self.dp_pg is not None:
-            torch.distributed.all_reduce(self.loss, op=torch.distributed.ReduceOp.SUM, group=self.dp_pg)
-            torch.distributed.all_reduce(self.token_num, op=torch.distributed.ReduceOp.SUM, group=self.dp_pg)
-            if hasattr(self, "total_type_count"):
-                torch.distributed.all_reduce(self.ds_loss, op=torch.distributed.ReduceOp.SUM, group=self.dp_pg)
-                torch.distributed.all_reduce(self.ds_token_num, op=torch.distributed.ReduceOp.SUM, group=self.dp_pg)
-
-        loss = round((self.loss / self.token_num).item(), 4)
-        res = {
-            "loss_from_metric": loss,
-        }
-        if hasattr(self, "total_type_count"):
-            ds_loss = {}
-            for i in range(self.total_type_count):
-                ds_loss[f"loss/{self.dataset_types[i]}"] = round((self.ds_loss[i] / self.ds_token_num[i]).item(), 4)
-            res.update(ds_loss)
-
-        if reset:
-            self.loss.fill_(0.0)
-            self.token_num.fill_(0.0)
-            if hasattr(self, "total_type_count"):
-                self.ds_loss.fill_(0.0)
-                self.ds_token_num.fill_(0.0)
-
-        return res
diff --git a/internlm/model/modeling_internlm.py b/internlm/model/modeling_internlm.py
deleted file mode 100644
index 64ff4de..0000000
--- a/internlm/model/modeling_internlm.py
+++ /dev/null
@@ -1,512 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import math
-from typing import Optional
-
-import torch
-from flash_attn.modules.embedding import ParallelGPT2Embeddings
-from flash_attn.modules.mlp import ParallelFusedMLP
-from torch import nn
-
-from internlm.core.context import IS_TENSOR_PARALLEL, ParallelMode
-from internlm.core.context.parallel_context import global_context as gpc
-from internlm.initialize.initialize_tensor import normal_, scaled_init_method_normal
-from internlm.model.embedding import Embedding1D
-from internlm.model.linear import (
-    FeedForward,
-    RewardModelLinear,
-    ScaleColumnParallelLinear,
-)
-from internlm.model.multi_head_attention import MHA
-from internlm.model.utils import gather_forward_split_backward, try_import_RMSNorm
-from internlm.solver.pipeline_utils import partition_uniform
-from internlm.utils.checkpoint import activation_checkpoint
-from internlm.utils.common import filter_kwargs
-from internlm.utils.logger import get_logger
-from internlm.utils.registry import MODEL_INITIALIZER
-
-MODEL_TYPE = "INTERNLM"
-
-logger = get_logger(__file__)
-RMSNorm = try_import_RMSNorm()
-
-
-class PackedFlashBaseLayer1D(nn.Module):
-    """
-    1D Packed Flash Base Layer.
-
-    Args:
-        hidden_size (int): The hidden size of model. 768 by default.
-        num_attention_heads (int): The number of attention heads. 12 by default.
-        mlp_ratio (int): The ratio of MLP layers. 4 by default.
-        attn_drop_rate (float): The dropout rate of attention module. 0 by default.
-        drop_rate (float): The dropout rate of the input hidden state. 0.0 by default.
-        dtype (torch.dtype): Type of data. torch.float by default.
-        layer_norm_epsilon (float): A value added to the denominator for numerical stability. 1e-5 by default.
-        checkpoint (bool): Whether to use checkpointing to save VRAM. True by default.
-        layer_idx (int): The index of current layer. 0 by default.
-        residual_in_fp32 (bool): Whether to use residual in fp32. False by default.
-        device (Optional[Union[str, torch.device]]): The device will be used.
-        norm_type (str): Use RMS norm or layernorm."rmsnorm" by default.
-        use_flash_attn (bool): Whether use flash-attn. True by default.
-    """
-
-    def __init__(
-        self,
-        hidden_size: int = 768,
-        num_attention_heads: int = 12,
-        mlp_ratio: int = 4,
-        attn_drop_rate: float = 0,
-        drop_rate: float = 0.0,
-        dtype: torch.dtype = torch.float,
-        layer_norm_epsilon: float = 1e-6,
-        checkpoint: bool = False,
-        layer_idx: int = 0,
-        residual_in_fp32: bool = False,
-        device: Optional[torch.device] = None,
-        norm_type: str = "rmsnorm",
-        dropout_selective_checkpoint: bool = True,
-        use_scaled_init: bool = True,
-        use_swiglu: bool = True,
-        use_flash_attn: bool = True,
-    ):
-        super().__init__()
-        self.checkpoint = checkpoint
-        # dropout selective checkpoint can only be enabled when checkpoint is disabled.
-        self.dropout_selective_checkpoint = dropout_selective_checkpoint is True and checkpoint is False
-        self.layer_idx = layer_idx
-        self.use_flash_attn = use_flash_attn
-
-        head_dim = hidden_size // num_attention_heads
-        self.mixer = MHA(
-            embed_dim=hidden_size,
-            num_heads=num_attention_heads,
-            process_group=gpc.get_group(ParallelMode.TENSOR),
-            dropout=attn_drop_rate,
-            softmax_scale=1 / math.sqrt(head_dim),
-            causal=True,
-            layer_idx=layer_idx,
-            rotary_emb_dim=head_dim,
-            rotary_emb_scale_base=0,
-            use_flash_attn=use_flash_attn,
-            device=device,
-            dtype=dtype,
-        )
-
-        self.dropout1 = nn.Dropout(drop_rate)
-        if norm_type == "rmsnorm":
-            self.norm1 = RMSNorm(hidden_size, eps=layer_norm_epsilon)
-            self.norm2 = RMSNorm(hidden_size, eps=layer_norm_epsilon)
-        else:
-            self.norm1 = nn.LayerNorm(hidden_size, eps=layer_norm_epsilon)
-            self.norm2 = nn.LayerNorm(hidden_size, eps=layer_norm_epsilon)
-
-        if use_swiglu:
-            self.mlp = FeedForward(
-                hidden_size,
-                int(hidden_size * mlp_ratio),
-                out_features=hidden_size,
-                process_group=gpc.get_group(ParallelMode.TENSOR),
-                bias=False,
-                device=device,
-                dtype=dtype,
-            )
-        else:
-            self.mlp = ParallelFusedMLP(
-                hidden_size,
-                int(hidden_size * mlp_ratio),
-                out_features=hidden_size,
-                activation="gelu_approx",
-                process_group=gpc.get_group(ParallelMode.TENSOR),
-                bias1=False,
-                bias2=False,
-                sequence_parallel=gpc.config.parallel.sequence_parallel,
-                checkpoint_lvl=0,
-                heuristic="auto",
-                device=device,
-                dtype=dtype,
-            )
-        for _, param in self.mlp.named_parameters():
-            if gpc.get_world_size(ParallelMode.TENSOR) > 1:
-                setattr(param, IS_TENSOR_PARALLEL, True)
-        self.dropout2 = nn.Dropout(drop_rate)
-        self.use_swiglu = use_swiglu
-        self.use_scaled_init = use_scaled_init
-        self.residual_in_fp32 = residual_in_fp32  # only make sense when using prenorm
-        self.return_residual = False
-        self.reset_parameters()
-
-    def reset_parameters(self):
-        with torch.no_grad():
-            for name, param in self.mixer.named_parameters():
-                if param.ndim == 1:
-                    param.data.zero_()
-                elif "Wqkv" in name:
-                    normal_(std=0.006)(param.data)
-                elif self.use_scaled_init:
-                    scaled_init_method_normal(sigma=0.006, num_layers=self.layer_idx + 1)(param.data)
-                else:
-                    normal_(std=0.0015)(param.data)
-
-            for name, param in self.mlp.named_parameters():
-                if param.ndim == 1 and "bias" in name:
-                    param.data.zero_()
-                elif self.use_swiglu:
-                    if self.use_scaled_init and "w2" in name:
-                        scaled_init_method_normal(sigma=0.006, num_layers=self.layer_idx + 1)(param.data)
-                    else:
-                        normal_(std=0.006 if "w1" in name or "w2" in name else 0.0015)(param.data)
-                else:
-                    if self.use_scaled_init and "fc1" not in name:
-                        scaled_init_method_normal(sigma=0.006, num_layers=self.layer_idx + 1)(param.data)
-                    else:
-                        normal_(std=0.006 if "fc1" in name else 0.0015)(param.data)
-
-    def forward(self, hidden_states, cu_seqlens=None, indexes=None, inference_params=None, max_seqlen=None):
-        if self.checkpoint and self.training:
-            return activation_checkpoint(
-                self._forward, False, hidden_states, cu_seqlens, indexes, inference_params, max_seqlen
-            )
-        else:
-            return self._forward(hidden_states, cu_seqlens, indexes, inference_params, max_seqlen)
-
-    def _forward(self, hidden_states=None, cu_seqlens=None, indexes=None, inference_params=None, max_seqlen=None):
-        r"""Pass the input through the encoder layer.
-
-        Args:
-            hidden_states: the sequence to the encoder layer (required).
-            residual: hidden_states = Attn/MLP(LN(residual))
-            cu_seqlens: 1d LongTensor, len(cu_seqlens) = hidden_states + 1
-            indexes: the length of index is same as hidden states, which stand for the current position
-        """
-        mixer_kwargs = {
-            "cu_seqlens": cu_seqlens,
-            "max_seqlen": max_seqlen,
-            "indexes": indexes,
-            "inference_params": inference_params,
-        }
-
-        def _dropout_and_norm_attn(_hidden_states):
-            _dropped = self.dropout1(_hidden_states)
-            _residual = _dropped
-            _hidden_states = self.norm1(_residual.float())
-            return _residual, _hidden_states
-
-        if self.dropout_selective_checkpoint:
-            residual, hidden_states = activation_checkpoint(_dropout_and_norm_attn, False, hidden_states)
-        else:
-            residual, hidden_states = _dropout_and_norm_attn(hidden_states)
-
-        if self.residual_in_fp32:
-            residual = residual.to(torch.float32)
-
-        hidden_states = self.mixer(hidden_states, **mixer_kwargs)
-
-        def _dropout_and_norm_ffn(_residual, _hidden_states):
-            _dropped = self.dropout2(_hidden_states)
-            _residual = (_dropped + _residual) if _residual is not None else _dropped
-            _hidden_states = self.norm2(_residual.float())
-            return _residual, _hidden_states
-
-        if self.dropout_selective_checkpoint:
-            residual, hidden_states = activation_checkpoint(_dropout_and_norm_ffn, False, residual, hidden_states)
-        else:
-            residual, hidden_states = _dropout_and_norm_ffn(residual, hidden_states)
-
-        if self.residual_in_fp32:
-            residual = residual.to(torch.float32)
-
-        hidden_states = self.mlp(hidden_states)
-
-        return hidden_states + residual
-
-
-class PackedFlashInternLm1D(nn.Module):
-    """
-    1D Packed Flash InternLm.
-
-    Args:
-        num_layers (int): The number of layer. 12 by default.
-        hidden_size (int): The size of hidden state. 768 by default.
-        num_attention_heads (int): The number of attention head. 12 by default.
-        vocab_size (int): The size of vocabulary. 50304 by default.
-        mlp_ratio (int): The ratio of MLP layers. 4 by default.
-        attn_drop_rate (float): The dropout rate of attention module. 0.0 by default.
-        drop_rate (float): The dropout rate of input hidden state. 0.0 by default.
-        dtype (torch.dtype): The type of data. torch.float by default.
-        checkpoint (float): The proportion of layers that need to be checkpointed compared to the total number
-                                    of layers. 0.0 by default.
-        layer_norm_epsilon (float): A value added to the denominator for numerical stability. 1e-6 by default.
-        first (bool): Whether input embedding layer or not. False by default.
-        last (bool): Whether output embedding layer or not. False by default.
-        embed_split_hidden (bool): Split the embedding layer in the hidden state dimention or vocabulary dimention.
-                                    True by default.
-        embed_grad_scale (float): Refer to GLM-130B, for training stability. 0.1 by default.
-        parallel_output (bool): If it is necessary to collect the output of parallel computing. True by default.
-        start_layer_idx (int): The index of start layer in the pipeline. 0 by default.
-        device (Optional[Union[str, torch.device]]): The device will be used. None by default.
-        residual_in_fp32 (bool): Whether to use residual in fp32. False by default.
-        norm_type (str): Normalization type. Use RMSNorm or LayerNorm. "rmsnorm" by default.
-        use_flash_attn (bool): Whether to use flash-attn. True by default.
-
-    """
-
-    def __init__(
-        self,
-        num_layers: int = 12,
-        hidden_size: int = 768,
-        num_attention_heads: int = 12,
-        vocab_size: int = 50304,
-        mlp_ratio: int = 4.0,
-        attn_drop_rate: float = 0.0,
-        drop_rate: float = 0.0,
-        dtype: torch.dtype = torch.float,
-        checkpoint: float = 0.0,
-        layer_norm_epsilon: float = 1e-5,
-        first: bool = False,
-        last: bool = False,
-        embed_split_hidden: bool = False,
-        embed_grad_scale: float = 0.1,
-        parallel_output: bool = True,
-        start_layer_idx: int = 0,
-        device: Optional[torch.device] = None,
-        residual_in_fp32: bool = False,
-        norm_type: str = "rmsnorm",
-        is_reward: bool = False,
-        dropout_selective_checkpoint: bool = True,
-        use_scaled_init: bool = True,
-        use_swiglu: bool = True,
-        use_flash_attn: bool = True,
-    ):
-        super().__init__()
-
-        checkpoint_layer_num = int(num_layers * checkpoint)
-
-        if is_reward:
-            head_cls = RewardModelLinear
-        else:
-            head_cls = ScaleColumnParallelLinear
-        if first:
-            if embed_split_hidden:
-                self.embedding = Embedding1D(num_embeddings=vocab_size, embedding_dim=hidden_size)
-            else:
-                self.embedding = ParallelGPT2Embeddings(
-                    embed_dim=hidden_size,
-                    vocab_size=vocab_size,
-                    max_position_embeddings=-1,
-                    process_group=gpc.get_group(ParallelMode.TENSOR),
-                    padding_idx=None,
-                    sequence_parallel=gpc.config.parallel.sequence_parallel,
-                    device=device,
-                    dtype=dtype,
-                )
-            for _, param in self.embedding.named_parameters():
-                normal_(std=0.0052)(param)
-                if gpc.get_world_size(ParallelMode.TENSOR) > 1:
-                    setattr(param, IS_TENSOR_PARALLEL, True)
-        self.embed_grad_scale = embed_grad_scale
-        self.blocks = nn.ModuleList(
-            [
-                PackedFlashBaseLayer1D(
-                    hidden_size=hidden_size,
-                    num_attention_heads=num_attention_heads,
-                    mlp_ratio=mlp_ratio,
-                    attn_drop_rate=attn_drop_rate,
-                    drop_rate=drop_rate,
-                    dtype=dtype,
-                    layer_norm_epsilon=layer_norm_epsilon,
-                    checkpoint=lid < checkpoint_layer_num,
-                    layer_idx=lid + start_layer_idx,  # This parameter is used for caching during generation
-                    residual_in_fp32=residual_in_fp32,
-                    device=device,
-                    norm_type=norm_type,
-                    dropout_selective_checkpoint=dropout_selective_checkpoint,
-                    use_scaled_init=use_scaled_init,
-                    use_swiglu=use_swiglu,
-                    use_flash_attn=use_flash_attn,
-                )
-                for lid in range(num_layers)
-            ]
-        )
-        if last:
-            if norm_type == "rmsnorm":
-                self.norm = RMSNorm(hidden_size, eps=layer_norm_epsilon)
-            else:
-                self.norm = nn.LayerNorm(hidden_size, eps=layer_norm_epsilon)
-            self.head = head_cls(
-                in_features=hidden_size,
-                out_features=gpc.get_world_size(ParallelMode.TENSOR) if is_reward else vocab_size,
-                process_group=gpc.get_group(ParallelMode.TENSOR),
-                bias=False,
-                device=device,
-                dtype=dtype,
-                weight_scale=embed_grad_scale,
-            )
-            for _, param in self.head.named_parameters():
-                normal_(std=0.0052)(param)
-                if gpc.get_world_size(ParallelMode.TENSOR) > 1:
-                    setattr(param, IS_TENSOR_PARALLEL, True)
-        self.parallel_output = parallel_output
-
-    def forward(self, hidden_states=None, cu_seqlens=None, input_ids=None, indexes=None, inference_params=None):
-        # attention_mask: compute attention on the places where the value is 1
-        if hasattr(self, "embedding"):
-            hidden_states = self.embedding(input_ids)
-            if self.embed_grad_scale != 1:
-                hidden_states = (
-                    self.embed_grad_scale * hidden_states + (1 - self.embed_grad_scale) * hidden_states.detach()
-                )
-        if isinstance(cu_seqlens, list):
-            assert len(cu_seqlens) == 1
-            cu_seqlens = cu_seqlens[0].to(hidden_states.device)
-
-        if cu_seqlens is not None:
-            cu_seqlens = cu_seqlens.squeeze(0)
-            hidden_states = hidden_states.squeeze(0)  # If cu_seqlens is passed in，it indicated a packed state，
-            # the batch dimension with a size of 1 should be directly squeezed off.
-
-        if indexes is not None:
-            assert len(indexes) == 1
-            # The indexes are used to indicate the actual position IDs of each token in the packed input.
-            indexes = indexes[0]
-        max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item() if cu_seqlens is not None else None
-
-        for _, block in enumerate(self.blocks):
-            hidden_states = block(
-                hidden_states,
-                cu_seqlens=cu_seqlens,
-                indexes=indexes,
-                inference_params=inference_params,
-                max_seqlen=max_seqlen,
-            )
-
-        if hasattr(self, "norm"):
-            hidden_states = self.norm(hidden_states.float())
-        if hasattr(self, "head"):
-            hidden_states = self.head(hidden_states)
-
-        if not self.parallel_output:
-            hidden_states = gather_forward_split_backward(hidden_states, ParallelMode.TENSOR, dim=-1)
-        return hidden_states
-
-
-def _build_generic_model_1d(num_layers, num_chunks, device=torch.device("cuda"), **kwargs):
-    """
-    build generic model 1d
-
-    Args:
-        num_layers (int): The number of layer.
-        num_chunks (int): The number of partitions in pipeline parallel.
-        device (Optional[Union[str, torch.device]]): The device will be used. torch.device("cuda") by default.
-
-    """
-    pipeline_size = gpc.get_world_size(ParallelMode.PIPELINE)
-    pipeline_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
-
-    all_parts = partition_uniform(num_layers, pipeline_size, num_chunks)
-    parts = all_parts[pipeline_rank]
-    if gpc.is_rank_for_log():
-        logger.info(f"The layer sharding is {all_parts}.")
-
-    models = []
-
-    for start, end in parts:
-        kwargs["num_layers"] = end - start
-        kwargs["first"] = start == 0
-        # If there is no content in the final layer, assign the last layer.
-        kwargs["last"] = end == num_layers and len(all_parts[-1]) != 0
-        kwargs["device"] = device
-        kwargs["start_layer_idx"] = start
-        chunk = PackedFlashInternLm1D(**filter_kwargs(PackedFlashInternLm1D.__init__, kwargs)).to(device)
-
-        models.append(chunk)
-    torch.distributed.barrier()
-    if len(models) == 1:
-        model = models[0]
-    else:
-        model = nn.ModuleList(models)
-
-    return model
-
-
-@MODEL_INITIALIZER.register_module(module_name=MODEL_TYPE)
-def build_model_with_cfg(
-    num_chunks=1,
-    checkpoint=0.0,
-    dtype=torch.float,
-    embed_split_hidden=False,
-    num_layers=48,
-    hidden_size=2048,
-    vocab_size=50304,
-    embed_grad_scale=1,
-    parallel_output=True,
-    num_attention_heads=32,
-    mlp_ratio=4.0,
-    residual_in_fp32=False,
-    norm_type="rmsnorm",
-    drop_rate=0,
-    attn_drop_rate=0,
-    apply_post_layer_norm=False,  # pylint: disable=W0613
-    layer_norm_epsilon=1e-5,
-    is_reward=False,
-    dropout_selective_checkpoint=True,
-    use_scaled_init: bool = True,
-    use_swiglu: bool = True,
-    use_flash_attn: bool = True,
-):
-    """
-    Build model with config.
-
-    Args:
-        num_chunks (int): The number of partitions in pipeline parallel. 1 by default.
-        checkpoint (bool): Whether to use checkpointing to save VRAM. False by default.
-        dtype (torch.dtype): The type of data. torch.float by default.
-        embed_split_hidden (bool): Split the embedding layer in the hidden state dimention or vocabulary dimention.
-                                    False by default.
-        num_layers (int): The number of layer. 48 by default.
-        hidden_size (int): The size of hidden state. 2048 by default.
-        vocab_size (int): The size of vocabulary. 50304 by default.
-        embed_grad_scale (float): Refer to GLM-130B, for training stability. 0.1 by default.
-        parallel_output (bool): If it is necessary to collect the output of parallel computing. True by default.
-        num_attention_heads (int): The number of attention head. 32 by default.
-        mlp_ratio (int): The ratio of MLP layers. 4.0 by default.
-        residual_in_fp32 (bool): Whether to use residual in fp32. False by default. It cannot be used temporarily
-                                 because this parameter requires inconsistent data types to be passed between pipelines,
-                                 which requires significant modifications to internlm.
-        norm_type (str): Normalization type. Use RMSNorm or LayerNorm. "rmsnorm" by default.
-        drop_rate (float): The dropout rate of input hidden state. 0 by default.
-        attn_drop_rate (float): The dropout rate of attention module. 0 by default.
-        apply_post_layer_norm (bool): Whether to apply post layer norm. False by default.
-        layer_norm_epsilon (float): A value added to the denominator for numerical stability. 1e-5 by default.
-        is_reward (bool): Whether to use reward model. False by default.
-        dropout_selective_checkpoint (bool): It can only be enabled when checkpoint is disabled. True by default.
-        use_scaled_init (bool): Whether to use scaled init. True by default.
-        use_swiglu (bool): Whether to use swiglu. True by default.
-        use_flash_attn (bool): Whether to use flash-attn. True by default.
-
-    """
-
-    cfg = dict(
-        hidden_size=hidden_size,
-        num_attention_heads=num_attention_heads,
-        checkpoint=checkpoint,
-        dtype=dtype,
-        embed_split_hidden=embed_split_hidden,
-        vocab_size=vocab_size,
-        embed_grad_scale=embed_grad_scale,
-        parallel_output=parallel_output,
-        mlp_ratio=mlp_ratio,
-        residual_in_fp32=residual_in_fp32,
-        norm_type=norm_type,
-        drop_rate=drop_rate,
-        attn_drop_rate=attn_drop_rate,
-        layer_norm_epsilon=layer_norm_epsilon,
-        is_reward=is_reward,
-        dropout_selective_checkpoint=dropout_selective_checkpoint,
-        use_scaled_init=use_scaled_init,
-        use_swiglu=use_swiglu,
-        use_flash_attn=use_flash_attn,
-    )
-
-    return _build_generic_model_1d(num_layers=num_layers, num_chunks=num_chunks, **cfg)
diff --git a/internlm/model/multi_head_attention.py b/internlm/model/multi_head_attention.py
deleted file mode 100644
index d634605..0000000
--- a/internlm/model/multi_head_attention.py
+++ /dev/null
@@ -1,186 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import Optional
-
-import torch
-from einops import rearrange
-from flash_attn.modules.mha import (
-    CrossAttention,
-    FlashCrossAttention,
-    FlashSelfAttention,
-    SelfAttention,
-    _update_kv_cache,
-)
-from torch import nn
-
-from internlm.core.context import IS_TENSOR_PARALLEL, ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.model.embedding import RotaryEmbedding
-from internlm.model.linear import ColumnParallelLinearTorch, RowParallelLinearTorch
-
-
-class MHA(nn.Module):
-    """
-    Multi-head self-attention and cross-attention.
-
-    Args:
-        embed_dim (int): The dimention of hidden state.
-        num_heads (int): The number of attention heads.
-        process_group (torch.distributed.ProcessGroup): The group of the current device for `parallel_mode`.
-        bias (boolean): Whether the bias is needed for linears. Will be used when initializing QKV matrix and
-                        output projection. True by default.
-        dropout (float): The dropout rate for cross attention and self attention. 0.0 by default.
-        softmax_scale (float): The temperature to use for the softmax attention.
-        causal (boolean): Whether to apply causal attention mask. False by default.
-        layer_idx (int): The index of current layer. None by default.
-        rotary_emb_dim (int): The dimention of Rotary Embedding. 0 by default.
-        rotary_emb_scale_base (int): The scaling factor of Rotary Embedding. If scale_base > 0, this implements
-                                    XPos(Sun et al., https://arxiv.org/abs/2212.10554). 0 by default.
-        use_flash_attn (boolean): Whether to use flash attention or not.If False, vanilla attention module will be used.
-                                    False by default.
-        sequence_parallel (boolean): If True, we're doing Tensor Parallel with sequence parallelism. An all_gather_raw
-                                    of x will be done before doing the matmul.
-        device (Optional[Union[str, torch.device]]): The device will be used.
-        dtype (Optional[torch.dtype]): The type of data.
-        use_flash_attn (bool): Whether to use flash-attn. True by default.
-
-    """
-
-    def __init__(
-        self,
-        embed_dim: int,
-        num_heads: int,
-        process_group: Optional[torch.distributed.ProcessGroup],
-        dropout: float = 0.0,
-        softmax_scale: float = None,
-        causal: bool = False,
-        layer_idx: int = None,
-        rotary_emb_dim: int = 0,
-        rotary_emb_scale_base: int = 0,
-        use_flash_attn: bool = True,
-        device: Optional[torch.device] = None,
-        dtype: Optional[torch.dtype] = None,
-    ) -> None:
-        factory_kwargs = {"device": device, "dtype": dtype}
-        super().__init__()
-        self.embed_dim = embed_dim
-        self.causal = causal
-        self.layer_idx = layer_idx
-        self.rotary_emb_dim = rotary_emb_dim
-        self.use_flash_attn = use_flash_attn
-        self.num_heads = num_heads
-        assert self.embed_dim % num_heads == 0, "self.kdim must be divisible by num_heads"
-        self.head_dim = self.embed_dim // num_heads
-
-        if self.rotary_emb_dim > 0:
-            self.rotary_emb = RotaryEmbedding(self.rotary_emb_dim, scale_base=rotary_emb_scale_base, device=device)
-
-        # notice here should change bias=True
-        self.Wqkv = ColumnParallelLinearTorch(
-            embed_dim,
-            3 * embed_dim,
-            process_group,
-            bias=True,
-            sequence_parallel=gpc.config.parallel.sequence_parallel,
-            **factory_kwargs,
-        )  # according to https://spaces.ac.cn/archives/9577
-
-        inner_attn_cls = FlashSelfAttention if use_flash_attn else SelfAttention
-        inner_cross_attn_cls = FlashCrossAttention if use_flash_attn else CrossAttention
-        self.inner_attn = inner_attn_cls(causal=causal, softmax_scale=softmax_scale, attention_dropout=dropout)
-        self.inner_cross_attn = inner_cross_attn_cls(
-            causal=causal, softmax_scale=softmax_scale, attention_dropout=dropout
-        )
-
-        # output projection always have the bias (for now)
-        self.out_proj = RowParallelLinearTorch(
-            embed_dim,
-            embed_dim,
-            process_group,
-            sequence_parallel=gpc.config.parallel.sequence_parallel,
-            **factory_kwargs,
-        )
-        # need to assign tp attribute so that internlm know it is tensor parallel module
-        if gpc.get_world_size(ParallelMode.TENSOR) > 1:
-            for name in ["out_proj", "Wqkv"]:
-                for param in getattr(self, name).parameters():
-                    setattr(param, IS_TENSOR_PARALLEL, True)
-
-    def forward(self, x, seqlen=None, inference_params=None, **kwargs):
-        if kwargs.get("indexes", None) is not None:
-            return self._packed_forward(x=x, inference_params=inference_params, **kwargs)
-        else:
-            return self._forward(x=x, seqlen=seqlen, inference_params=inference_params, **kwargs)
-
-    def _forward(self, x, seqlen=None, inference_params=None, **kwargs):
-        """
-        Arguments:
-            x: (batch, seqlen, hidden_dim) (where hidden_dim = num heads * head dim) if seqlen=None.
-                If seqlen is not None, x is (batch * seqlen, hidden_dim). This is so that when we
-                split x during sequence parallel, we split the batch * seqlen dimension
-                (in case batch is small).
-        """
-        qkv = self.Wqkv(x)
-        if seqlen is None:
-            qkv = rearrange(qkv, "b s (three h d) -> b s three h d", three=3, d=self.head_dim)
-        else:
-            qkv = rearrange(qkv, "(b s) (three h d) -> b s three h d", s=seqlen, three=3, d=self.head_dim)
-
-        if self.rotary_emb_dim > 0:
-            kwargs["inference_params"] = inference_params
-            qkv = self.rotary_emb(qkv, **kwargs)
-
-        if inference_params is None:
-            if gpc.config.model.dtype is torch.float32 and gpc.config.model.use_flash_attn:
-                with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-                    if qkv.dtype not in [torch.float16, torch.bfloat16]:
-                        qkv = qkv.to(torch.bfloat16)
-                    context = self.inner_attn(qkv).to(x.dtype)
-            else:
-                context = self.inner_attn(qkv)
-        else:
-            q = qkv[:, :, 0]
-            assert self.layer_idx is not None, "Generation requires layer_idx in the constructor"
-            kv = _update_kv_cache(qkv[:, :, 1:], inference_params, self.layer_idx)
-            # If we're processing the prompt, causal=None (use self.causal).
-            # If we're decoding, then causal=False.
-            causal = None if inference_params.sequence_len_offset == 0 else False
-            context = self.inner_cross_attn(q, kv, causal=causal)
-
-        if seqlen is None:
-            context = rearrange(context, "b s h d -> b s (h d)")
-        else:
-            context = rearrange(context, "b s h d -> (b s) (h d)")
-
-        out = self.out_proj(context)
-        return out
-
-    def _packed_forward(self, x, inference_params=None, **kwargs):
-        """
-        Arguments:
-            x: (batch, seqlen, hidden_dim) (where hidden_dim = num heads * head dim) if seqlen=None.
-                If seqlen is not None, x is (batch * seqlen, hidden_dim). This is so that when we
-                split x during sequence parallel, we split the batch * seqlen dimension
-                (in case batch is small).
-        """
-        qkv = self.Wqkv(x)  # total x hsz'
-        qkv = rearrange(qkv, "t (three h d) -> t three h d", three=3, d=self.head_dim)  # total x 3 x n_head x d
-        qkv = self.rotary_emb(qkv, **kwargs)
-        kwargs.pop("indexes")
-
-        if inference_params is None:
-            if gpc.config.model.dtype is torch.float32 and gpc.config.model.use_flash_attn:
-                with torch.cuda.amp.autocast(dtype=torch.bfloat16):
-                    if qkv.dtype not in [torch.float16, torch.bfloat16]:
-                        qkv = qkv.to(torch.bfloat16)
-                    context = self.inner_attn(qkv, **kwargs).to(x.dtype)
-            else:
-                context = self.inner_attn(qkv, **kwargs)
-
-        else:
-            raise RuntimeError("Not support this right now")
-
-        context = rearrange(context, "b h d -> b (h d)")  # recover the shape
-        out = self.out_proj(context)
-        return out
diff --git a/internlm/model/norm.py b/internlm/model/norm.py
deleted file mode 100644
index 6598e17..0000000
--- a/internlm/model/norm.py
+++ /dev/null
@@ -1,46 +0,0 @@
-# adopted from https://github.com/NVIDIA/apex/blob/master/apex/normalization/fused_layer_norm
-
-import numbers
-
-import torch
-from torch.nn import init
-from torch.nn.parameter import Parameter
-
-
-def manual_rms_norm(my_input, normalized_shape, weight, eps):
-    # layer norm should always be calculated in float32
-    dims = tuple(i for i in range(-1, -len(normalized_shape) - 1, -1))
-    variance = my_input.to(torch.float32).pow(2).mean(dims, keepdim=True)
-    my_input = my_input * torch.rsqrt(variance + eps)
-
-    if weight is None:
-        return my_input
-
-    # convert into half-precision if necessary
-    if weight.dtype in [torch.float16, torch.bfloat16]:
-        my_input = my_input.to(weight.dtype)
-
-    return weight * my_input
-
-
-class RMSNormTorch(torch.nn.Module):
-    """A custom PyTorch module for RMS normalization."""
-
-    def __init__(self, normalized_shape, eps=1e-5):
-        super().__init__()
-
-        if isinstance(normalized_shape, numbers.Integral):
-            normalized_shape = (normalized_shape,)
-        self.normalized_shape = torch.Size(normalized_shape)
-        self.eps = eps
-        self.weight = Parameter(torch.empty(*normalized_shape))
-        self.reset_parameters()
-
-    def forward(self, _input: torch.Tensor):
-        return manual_rms_norm(_input, self.normalized_shape, self.weight, self.eps)
-
-    def reset_parameters(self):
-        init.ones_(self.weight)
-
-    def extra_repr(self):
-        return "{normalized_shape}, eps={eps}, ".format(**self.__dict__)
diff --git a/internlm/model/utils.py b/internlm/model/utils.py
deleted file mode 100644
index 12f80e3..0000000
--- a/internlm/model/utils.py
+++ /dev/null
@@ -1,209 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import Optional
-
-import torch
-import torch.nn.functional as F
-from flash_attn.ops.fused_dense import FusedDenseFunc
-from flash_attn.utils.distributed import (
-    all_gather_raw,
-    all_reduce_raw,
-    reduce_scatter_raw,
-)
-from torch import Tensor
-from torch.cuda.amp import custom_bwd
-from torch.distributed import ProcessGroup
-
-from internlm.core.context import global_context as gpc
-from internlm.utils.logger import get_logger
-
-logger = get_logger(__file__)
-
-
-def _split(input_, parallel_mode, dim=-1):
-    # skip if only one rank involved
-    world_size = gpc.get_world_size(parallel_mode)
-    if world_size == 1:
-        return input_
-
-    # Split along last dimension.
-    dim_size = input_.size(dim)
-    assert dim_size % world_size == 0, (
-        f"The dimension to split ({dim_size}) is not a multiple of world size ({world_size}), "
-        f"cannot split tensor evenly"
-    )
-
-    tensor_list = torch.split(input_, dim_size // world_size, dim=dim)
-    rank = gpc.get_local_rank(parallel_mode)
-    output = tensor_list[rank].contiguous()
-
-    return output
-
-
-def _gather(input_, parallel_mode, dim=-1):
-    # skip if only one rank involved
-    world_size = gpc.get_world_size(parallel_mode)
-    if world_size == 1:
-        return input_
-
-    # all gather
-    rank = gpc.get_local_rank(parallel_mode)
-    tensor_list = [torch.empty_like(input_) for _ in range(world_size)]
-    tensor_list[rank] = input_
-    group = gpc.get_cpu_group(parallel_mode) if input_.device.type == "cpu" else gpc.get_group(parallel_mode)
-    torch.distributed.all_gather(tensor_list, input_, group=group)
-
-    # concat
-    output = torch.cat(tensor_list, dim=dim).contiguous()
-
-    return output
-
-
-class _GatherForwardSplitBackward(torch.autograd.Function):
-    """Gather the input from model parallel region and concatenate.
-
-    Args:
-        input_: input matrix.
-        parallel_mode: parallel mode.
-        dim: dimension
-    """
-
-    @staticmethod
-    def symbolic(input_):
-        return _gather(input_, parallel_mode=None)
-
-    @staticmethod
-    def forward(ctx, input_, parallel_mode, dim):
-        ctx.mode = parallel_mode
-        ctx.dim = dim
-        return _gather(input_, parallel_mode, dim)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _split(grad_output, ctx.mode, ctx.dim), None, None
-
-
-def gather_forward_split_backward(input_, parallel_mode, dim):
-    return _GatherForwardSplitBackward.apply(input_, parallel_mode, dim)
-
-
-def linear_bias_wgrad_torch(my_input, grad_output, has_d_bias):
-    assert my_input.dtype == grad_output.dtype
-    grad_weight = torch.matmul(grad_output.t(), my_input)
-    grad_bias = grad_output.sum(dim=0) if has_d_bias else None
-    return grad_weight, grad_bias
-
-
-# adpated from https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/ops/fused_dense.py
-class FusedDenseFuncTorch(FusedDenseFunc):
-    """A custom PyTorch module extending FusedDenseFunc."""
-
-    @staticmethod
-    @custom_bwd
-    def backward(ctx, grad_output, *args):
-        grad_output = grad_output.contiguous()
-        if ctx.return_residual:
-            (grad_input,) = args
-            grad_input = grad_input.contiguous()
-        process_group = ctx.process_group
-        sequence_parallel = ctx.sequence_parallel
-        if ctx.compute_weight_gradient:
-            x, weight = ctx.saved_tensors
-            if process_group is not None and sequence_parallel:
-                total_x, handle_x = all_gather_raw(x, process_group, async_op=True)
-            else:
-                total_x = x
-        else:
-            (weight,) = ctx.saved_tensors
-            total_x = None
-        batch_shape = grad_output.shape[:-1]
-        batch_dim = batch_shape.numel()
-        grad_output = grad_output.reshape(batch_dim, grad_output.shape[-1])
-        if ctx.needs_input_grad[0]:
-            if not ctx.return_residual:
-                grad_input = F.linear(grad_output, weight.t())
-            else:
-                grad_input = torch.addmm(grad_input.reshape(batch_dim, grad_input.shape[-1]), grad_output, weight)
-            grad_input = grad_input.reshape(*batch_shape, grad_input.shape[-1])
-            if process_group is not None:
-                reduce_fn = reduce_scatter_raw if sequence_parallel else all_reduce_raw
-                grad_input, handle_grad_input = reduce_fn(grad_input, process_group, async_op=True)
-        else:
-            grad_input = None
-        if ctx.needs_input_grad[1]:
-            assert ctx.compute_weight_gradient
-            if process_group is not None and sequence_parallel:
-                handle_x.wait()
-            # we remove the cuda independence, which is different from flash_attn.
-            grad_weight, grad_bias = linear_bias_wgrad_torch(
-                total_x.reshape(batch_dim, total_x.shape[-1]), grad_output, ctx.needs_input_grad[2]
-            )
-        else:
-            grad_weight = None
-            grad_bias = grad_output if ctx.needs_input_grad[2] else None
-        if process_group is not None and ctx.needs_input_grad[0]:
-            handle_grad_input.wait()
-        return grad_input, grad_weight, grad_bias, None, None, None
-
-
-def fused_dense_func_torch(
-    x: Tensor,
-    weight: Tensor,
-    bias: Optional[Tensor] = None,
-    return_residual: bool = False,
-    process_group: Optional[ProcessGroup] = None,
-    sequence_parallel: bool = True,
-):
-    dtype_eligible = x.dtype in [torch.float16, torch.bfloat16] or (
-        x.dtype == torch.float32 and torch.is_autocast_enabled()
-    )
-    if x.is_cuda and weight.is_cuda and (bias is None or bias.is_cuda) and dtype_eligible:
-        return FusedDenseFunc.apply(x, weight, bias, return_residual, process_group, sequence_parallel)
-    else:
-        return FusedDenseFuncTorch.apply(x, weight, bias, return_residual, process_group, sequence_parallel)
-
-
-class _SplitForwardGatherBackward(torch.autograd.Function):
-    """
-    Split the input and keep only the corresponding chuck to the rank.
-
-    Args:
-        input_: input matrix.
-        parallel_mode: parallel mode.
-        dim: dimension
-    """
-
-    @staticmethod
-    def symbolic(input_):
-        return _split(input_, parallel_mode=None)
-
-    @staticmethod
-    def forward(ctx, input_, parallel_mode, dim):
-        ctx.mode = parallel_mode
-        ctx.dim = dim
-        return _split(input_, parallel_mode, dim)
-
-    @staticmethod
-    def backward(ctx, grad_output):
-        return _gather(grad_output, ctx.mode, ctx.dim), None, None
-
-
-def split_forward_gather_backward(input_, parallel_mode, dim):
-    return _SplitForwardGatherBackward.apply(input_, parallel_mode, dim)
-
-
-def try_import_RMSNorm():
-    """
-    Try import MixFusedRMSNorm from apex, if failed, return our RMSNorm
-
-    """
-    try:
-        from apex.normalization.fused_layer_norm import MixedFusedRMSNorm as RMSNorm
-
-        return RMSNorm
-    except ModuleNotFoundError:
-        logger.warning("The torch implementation for MixFusedRMSNorm is slower than apex. Please note this!")
-        from internlm.model.norm import RMSNormTorch as RMSNorm
-
-        return RMSNorm
diff --git a/internlm/monitor/__init__.py b/internlm/monitor/__init__.py
deleted file mode 100644
index 2501d66..0000000
--- a/internlm/monitor/__init__.py
+++ /dev/null
@@ -1,11 +0,0 @@
-from .alert import initialize_light_monitor, send_heartbeat
-from .monitor import initialize_monitor_manager, send_alert_message
-from .utils import set_env_var
-
-__all__ = [
-    "send_alert_message",
-    "initialize_monitor_manager",
-    "set_env_var",
-    "initialize_light_monitor",
-    "send_heartbeat",
-]
diff --git a/internlm/monitor/alert.py b/internlm/monitor/alert.py
deleted file mode 100644
index 1772e7f..0000000
--- a/internlm/monitor/alert.py
+++ /dev/null
@@ -1,104 +0,0 @@
-import json
-import math
-import os
-import re
-import time
-from typing import Dict
-
-import requests
-
-from internlm.utils.logger import get_logger
-
-logger = get_logger(__file__)
-
-
-def initialize_light_monitor(monitor_address: str = None):
-    try:
-        from uniscale_monitoring import init_monitor
-
-        init_monitor(monitor_address)
-    except Exception as e:
-        logger.warning(f"init monitor meet error: {e}")
-
-
-def send_heartbeat(msg_type: str, msg: Dict):
-    def nan2none(v):
-        if isinstance(v, float) and math.isnan(v):
-            return None
-        return v
-
-    try:
-        from uniscale_monitoring import send_meta
-
-        data = {}
-        for k, v in msg.items():
-            if isinstance(v, Dict):
-                for k1, v1 in v.items():
-                    new_k = f"{k}_{k1}".split(" ")[0]
-                    new_k = re.sub(r"[^a-zA-Z0-9_]", "_", new_k)
-                    data[new_k] = nan2none(v1)
-            else:
-                new_k = k.split(" ")[0]
-                new_k = re.sub(r"[^a-zA-Z0-9_]", "_", new_k)
-                data[new_k] = nan2none(v)
-
-        if os.getenv("CLUSTER_NAME"):
-            data.update({"cluster": os.getenv("CLUSTER_NAME")})
-        if msg_type == "train_metrics":
-            data.update({"msg_type": "train_metrics"})
-        elif msg_type == "init_time":
-            data.update({"msg_type": "init_time"})
-        elif msg_type == "stage_time":
-            data.update({"msg_type": "stage_time"})
-        send_meta(data, timeout=0.1)
-    except Exception as e:
-        logger.warning(f"send heartbeat meet error: {e}")
-
-
-def send_feishu_msg_with_webhook(webhook: str, title: str, message: str):
-    """
-    Use Feishu robot to send messages with the given webhook.
-
-    Args:
-        webhook (str): The webhook to be used to send message.
-        title (str): The message title.
-        message (str): The message body.
-
-    Returns:
-        The response from the request. Or catch the exception and return None.
-
-    Raises:
-        Exception: An exception rasied by the HTTP post request.
-
-    """
-
-    headers = {"Content-Type": "application/json;charset=utf-8"}
-    msg_body = {
-        "timestamp": int(time.time()),
-        "msg_type": "post",
-        "content": {
-            "post": {
-                "zh_cn": {
-                    "title": title,
-                    "content": [
-                        [
-                            {
-                                "tag": "text",
-                                "text": message,
-                            },
-                        ],
-                    ],
-                },
-            },
-        },
-    }
-
-    try:
-        res = requests.post(webhook, data=json.dumps(msg_body), headers=headers, timeout=30)
-        res = res.json()
-        print(f"Feishu webhook response: {res}")
-    except Exception as err:  # pylint: disable=W0703
-        print(f"HTTP Post error: {err}")
-        res = None
-
-    return res
diff --git a/internlm/monitor/monitor.py b/internlm/monitor/monitor.py
deleted file mode 100644
index 6a3b9dc..0000000
--- a/internlm/monitor/monitor.py
+++ /dev/null
@@ -1,232 +0,0 @@
-import os
-import signal
-import socket
-import time
-from contextlib import contextmanager
-from threading import Thread
-
-from internlm.core.context import global_context as gpc
-from internlm.monitor.alert import send_feishu_msg_with_webhook
-from internlm.utils.common import SingletonMeta
-
-from .utils import get_job_key, set_env_var
-
-
-def send_alert_message(address: str = None, title: str = None, message: str = None):
-    """
-    Send alert messages to the given Feishu webhook address in log rank.
-
-    Args:
-        address (str): The alert address to be used to send message, defaults to None.
-        title (str): The message title, defaults to None.
-        message (str): The message body, defaults to None.
-    """
-
-    if address is not None and gpc.is_rank_for_log():
-        send_feishu_msg_with_webhook(
-            webhook=address,
-            title=title if title else get_job_key(),
-            message=message,
-        )
-
-
-class MonitorTracker(Thread):
-    """
-    Track job status and alert to Feishu during job training.
-
-    Args:
-        alert_address (str): The Feishu webhook address for sending alerting messages.
-        check_interval (float): The interval in seconds for monitoring checks. Defaults to 300.
-        loss_spike_limit (float): The threshold for detecting loss value spikes. Defaults to 1.5.
-    """
-
-    def __init__(
-        self,
-        alert_address: str,
-        check_interval: float = 300,
-        loss_spike_limit: float = 1.5,
-    ):
-        super().__init__()
-        self.alert_address = alert_address
-        self.check_interval = check_interval
-        self.loss_spike_limit = loss_spike_limit
-        self.last_active_time = -1
-        self.last_loss_value = -1
-        self.stopped = False
-        self.start()
-
-    def run(self):
-        """
-        start the monitor tracker.
-        """
-
-        while not self.stopped:
-            try:
-                self._check_stuck()
-                self._check_loss_spike()
-            except Exception:
-                continue
-            time.sleep(self.check_interval)
-
-    def _check_stuck(self):
-        """
-        Check training status for potential stuck condition.
-        """
-
-        new_active_time = -1
-        if os.getenv("LAST_ACTIVE_TIMESTAMP") is not None:
-            new_active_time = os.getenv("LAST_ACTIVE_TIMESTAMP")
-        if int(new_active_time) <= int(self.last_active_time) and new_active_time != -1:
-            self._send_alert("Training may be in stuck status, please check it.")
-        self.last_active_time = new_active_time
-
-    def _check_loss_spike(self):
-        """
-        Check for loss value spikes.
-        """
-
-        if gpc.is_rank_for_log():
-            new_loss_value = -1
-            new_step_id = -1
-            if os.getenv("LOSS") is not None:
-                new_loss_value = os.getenv("LOSS")
-            if os.getenv("STEP_ID") is not None:
-                new_step_id = os.getenv("STEP_ID")
-
-            if (float(new_loss_value) / float(self.last_loss_value)) > self.loss_spike_limit and new_loss_value != -1:
-                assert int(new_step_id) >= 0
-                self._send_alert(
-                    f"Checking periodically: Loss spike may be happened in step {new_step_id}, "
-                    f"loss value from {self.last_loss_value} to {new_loss_value}, please check it."
-                )
-
-            self.last_loss_value = new_loss_value
-
-    def _send_alert(self, message):
-        """
-        Send alerting message to the Feishu webhook address.
-
-        Args:
-            message (str): The alerting message to be sent.
-        """
-
-        send_alert_message(
-            address=self.alert_address,
-            message=message,
-        )
-
-    def stop(self):
-        """
-        Stop the monitor tracker.
-        """
-
-        self.stopped = True
-
-
-class MonitorManager(metaclass=SingletonMeta):
-    """
-    Monitor Manager for managing monitor thread and monitoring training status.
-    """
-
-    def __init__(self, loss_spike_limit: float = 1.5) -> None:
-        self.monitor_thread = None
-        self.loss_spike_limit = loss_spike_limit
-        self.last_step_loss = -1
-
-    def monitor_loss_spike(self, alert_address: str = None, step_count: int = 0, cur_step_loss: float = 0.0):
-        """Check loss value, if loss spike occurs, send alert message to Feishu."""
-        set_env_var(key="LOSS", value=cur_step_loss)
-        set_env_var(key="STEP_ID", value=step_count)
-
-        if self.last_step_loss != -1 and cur_step_loss > self.loss_spike_limit * self.last_step_loss:
-            send_alert_message(
-                address=alert_address,
-                message=(
-                    f"Checking step by step: Loss spike may be happened in step {step_count}, "
-                    f"loss value from {self.last_step_loss} to {cur_step_loss}, please check it."
-                ),
-            )
-        self.last_step_loss = cur_step_loss
-
-    def monitor_exception(self, alert_address: str = None, excp_info: str = None):
-        """Catch and format exception information, send alert message to Feishu."""
-        filtered_trace = excp_info.split("\n")[-10:]
-        format_trace = ""
-        for line in filtered_trace:
-            format_trace += "\n" + line
-        send_alert_message(
-            address=alert_address,
-            message=f"Catch Exception from {socket.gethostname()} with rank id {gpc.get_global_rank()}:{format_trace}",
-        )
-
-    def handle_sigterm(self, alert_address: str = None):
-        """Catch SIGTERM signal, and send alert message to Feishu."""
-
-        def sigterm_handler(sys_signal, frame):
-            print("receive frame: ", frame)
-            print("receive signal: ", sys_signal)
-            send_alert_message(
-                address=alert_address,
-                message=f"Process received signal {signal} and exited.",
-            )
-
-        signal.signal(signal.SIGTERM, sigterm_handler)
-
-    def start_monitor(
-        self,
-        job_name: str,
-        alert_address: str,
-        monitor_interval_seconds: int = 300,
-        loss_spike_limit: float = 1.5,
-    ):
-        """
-        Initialize and start monitor thread for checking training job status, loss spike and so on.
-
-        Args:
-            job_name (str): The training job name.
-            alert_address (str): The Feishu webhook address for sending alert messages.
-            monitor_interval_seconds (int): The time of monitor interval in seconds, defaults to 300.
-            loss_spike_limit (float): The limit multiple of current loss to previous loss value, which means loss spike
-                may be occurs, defaults to 1.5.
-        """
-
-        # initialize some variables for monitoring
-        set_env_var(key="JOB_NAME", value=job_name)
-
-        # start a monitor thread, periodically check the training status
-        self.monitor_thread = MonitorTracker(
-            alert_address=alert_address,
-            check_interval=monitor_interval_seconds,
-            loss_spike_limit=loss_spike_limit,
-        )
-
-    def stop_monitor(self):
-        """Stop the monitor and alert thread."""
-        if self.monitor_thread is not None:
-            self.monitor_thread.stop()
-
-
-monitor_manager = MonitorManager()
-
-
-@contextmanager
-def initialize_monitor_manager(job_name: str = None, alert_address: str = None):
-    """
-    Initialize monitor manager for monitoring training lifetime and alerting exception info to Feishu.
-
-    Args:
-        job_name (str): The training job name.
-        alert_address (str): The Feishu webhook address for sending alert messages.
-    """
-
-    if alert_address is not None:
-        try:
-            monitor_manager.start_monitor(job_name=job_name, alert_address=alert_address)
-            monitor_manager.handle_sigterm(alert_address=alert_address)
-            send_alert_message(address=alert_address, message=f"Training in {socket.gethostname()} is starting.")
-            yield
-        finally:
-            send_alert_message(address=alert_address, message=f"Training in {socket.gethostname()} completed.")
-            monitor_manager.stop_monitor()
-    else:
-        yield
diff --git a/internlm/monitor/utils.py b/internlm/monitor/utils.py
deleted file mode 100644
index f64c7dc..0000000
--- a/internlm/monitor/utils.py
+++ /dev/null
@@ -1,32 +0,0 @@
-import os
-from datetime import datetime
-
-
-def now_time():
-    return datetime.now().strftime("%b%d_%H-%M-%S")
-
-
-def set_env_var(key, value):
-    os.environ[str(key)] = str(value)
-
-
-def get_job_id():
-    job_id = "none"
-    if os.getenv("SLURM_JOB_ID") is not None:
-        job_id = os.getenv("SLURM_JOB_ID")
-    elif os.getenv("K8S_WORKSPACE_ID") is not None:
-        job_id = os.getenv("K8S_WORKSPACE_ID")
-
-    return job_id
-
-
-def get_job_name():
-    job_name = f"unknown-{now_time()}"
-    if os.getenv("JOB_NAME") is not None:
-        job_name = os.getenv("JOB_NAME")
-
-    return job_name
-
-
-def get_job_key():
-    return f"{get_job_id()}_{get_job_name()}"
diff --git a/internlm/solver/__init__.py b/internlm/solver/__init__.py
deleted file mode 100644
index 773f2dc..0000000
--- a/internlm/solver/__init__.py
+++ /dev/null
@@ -1,8 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from .beta2_scheduler import Beta2Scheduler
-from .lr_scheduler import FineTuneCosineAnnealingWarmupLR
-from .optimizer import HybridZeroOptimizer
-
-__all__ = ["Beta2Scheduler", "FineTuneCosineAnnealingWarmupLR", "HybridZeroOptimizer"]
diff --git a/internlm/solver/beta2_scheduler.py b/internlm/solver/beta2_scheduler.py
deleted file mode 100644
index 904f4e0..0000000
--- a/internlm/solver/beta2_scheduler.py
+++ /dev/null
@@ -1,36 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch
-
-
-class Beta2Scheduler:
-    """
-    Beta2Scheduler
-    """
-
-    def __init__(self, optimizer: torch.optim.Adam, init_beta2, c=0.8, cur_iter=-1):
-        self.cur_iter = 0 if cur_iter == -1 else cur_iter
-        self.init_beta2 = init_beta2
-        self.c = c
-        self.optimizer = optimizer
-        assert isinstance(
-            optimizer, (torch.optim.Adam, torch.optim.AdamW)
-        ), "should use Adam optimzier, which has beta2"
-
-    def step(self, cur_iter=None):
-        if cur_iter is None:
-            self.cur_iter += 1
-        else:
-            self.cur_iter = cur_iter
-
-        new_beta2 = self.get_beta2()
-        for pg in self.optimizer.param_groups:
-            beta1, _ = pg["betas"]
-            pg["betas"] = (beta1, new_beta2)
-
-    def get_beta2(self):
-        if self.c <= 0:
-            return self.init_beta2
-        scale = 1 - (1 / self.cur_iter**self.c)
-        return max(self.init_beta2, scale)
diff --git a/internlm/solver/lr_scheduler.py b/internlm/solver/lr_scheduler.py
deleted file mode 100644
index bcbca88..0000000
--- a/internlm/solver/lr_scheduler.py
+++ /dev/null
@@ -1,135 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import json
-
-from torch.optim.lr_scheduler import CosineAnnealingLR as _CosineAnnealingLR
-from torch.optim.lr_scheduler import _LRScheduler
-
-
-class WarmupScheduler(_LRScheduler):
-    """Starts with a linear warmup lr schedule until it reaches N epochs then applies
-    the specific scheduler (For example: ReduceLROnPlateau).
-
-    Args:
-        optimizer (:class:`torch.optim.Optimizer`): Wrapped optimizer.
-        warmup_epochs (int): Number of epochs to linearly warmup lr until starting applying the scheduler.
-        after_scheduler (:class:`torch.optim.lr_scheduler`): After target_epoch, use this scheduler.
-        last_epoch (int, optional): The index of last epoch, defaults to -1. When last_epoch=-1,
-            the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.
-    """
-
-    def __init__(self, optimizer, warmup_epochs, after_scheduler, last_epoch=-1):
-        self.warmup_epochs = int(warmup_epochs)
-        self.after_scheduler = after_scheduler
-        self.finished = False
-        super().__init__(optimizer, last_epoch)
-
-    def state_dict(self):
-        state_dict = {key: value for key, value in self.__dict__.items() if key not in "optimizer"}
-        if isinstance(state_dict["after_scheduler"], (_LRScheduler, _CosineAnnealingLR)):
-            state_dict["after_scheduler_type"] = type(state_dict["after_scheduler"]).__name__
-            state_dict["after_scheduler_dict"] = state_dict["after_scheduler"].state_dict()
-            del state_dict["after_scheduler"]
-        else:
-            raise NotImplementedError()
-        return state_dict
-
-    def load_state_dict(self, state_dict):
-        # state_dict = {key: value for key, value in self.__dict__.items() if key not in 'optimizer'}
-        for key in list(self.__dict__.keys()):
-            if key in state_dict:
-                self.__dict__[key] = state_dict[key]
-        if isinstance(self.after_scheduler, (_LRScheduler, _CosineAnnealingLR)):
-            assert type(self.after_scheduler).__name__ == state_dict["after_scheduler_type"]
-            # state_dict['after_scheduler_dict'] = state_dict['after_scheduler'].state_dict()
-            self.after_scheduler.load_state_dict(state_dict["after_scheduler_dict"])
-            # del state_dict['after_scheduler']
-        else:
-            raise NotImplementedError()
-        return state_dict
-
-    def get_lr(self):
-        if self.last_epoch >= self.warmup_epochs:
-            if not self.finished:
-                self.after_scheduler.base_lrs = self.base_lrs
-                self.finished = True
-            return self.after_scheduler.get_lr()
-
-        return [(self.last_epoch + 1) / self.warmup_epochs * lr for lr in self.base_lrs]
-
-    def step(self, epoch=None):
-        if self.finished:
-            if epoch is None:
-                self.after_scheduler.step(None)
-                self._last_lr = self.after_scheduler.get_last_lr()
-            else:
-                self.after_scheduler.step(epoch - self.warmup_epochs)
-                self._last_lr = self.after_scheduler.get_last_lr()
-        else:
-            return super().step(epoch)
-
-
-class CosineAnnealingWarmupLR(WarmupScheduler):
-    """Cosine annealing learning rate scheduler with learning rate warmup. A linear warmup schedule will be applied.
-
-    Args:
-        optimizer (:class:`torch.optim.Optimizer`): Wrapped optimizer.
-        total_steps (int): Number of total training steps.
-        warmup_steps (int, optional): Number of warmup steps, defaults to 0.
-        eta_min (int, optional): Minimum learning rate, defaults to 0.
-        last_epoch (int, optional): The index of last epoch, defaults to -1. When last_epoch=-1,
-            the schedule is started from the beginning or When last_epoch=-1, sets initial lr as lr.
-    """
-
-    def __init__(self, optimizer, total_steps: int, warmup_steps: int = 0, eta_min: float = 0.0, last_epoch: int = -1):
-        base_scheduler = _CosineAnnealingLR(
-            optimizer, total_steps - warmup_steps, eta_min=eta_min, last_epoch=last_epoch
-        )
-        super().__init__(optimizer, warmup_steps, base_scheduler)
-
-
-class FineTuneCosineAnnealingWarmupLR(CosineAnnealingWarmupLR):
-    """
-    FineTune Cosine Annealing Warmup LR.
-
-    Args:
-        optimizer: The optimizer object.
-        total_steps (int): The number of total steps.
-        init_steps (int): The number of init steps, default is 0.
-        warmup_steps (int): The number of warm up steps, default is 0.
-        eta_min (float): The minimum learning rate, default is 0.0.
-        last_epoch: Last epoch, default is -1.
-
-    """
-
-    def __init__(
-        self,
-        optimizer,
-        total_steps: int,
-        init_steps: int = 0,
-        warmup_ratio: float = 0.0,
-        eta_min: float = 0.0,
-        last_epoch: int = -1,
-    ):
-        self._init_steps = init_steps
-        self._warmup_steps = int(total_steps * warmup_ratio)
-        # Use this value to calculate the lr of warmup, because warmup_epochs = init_steps + warmup_steps
-        super().__init__(optimizer, total_steps, self._warmup_steps + init_steps, eta_min, last_epoch)
-
-    def get_lr(self):
-        if self.last_epoch >= self.warmup_epochs:
-            if not self.finished:  # pylint: disable=E0203
-                # This True switch is to avoid warning when the warmup reaches the preset value switch
-                self.after_scheduler._get_lr_called_within_step = True
-                self.after_scheduler.base_lrs = self.base_lrs
-                self.finished = True
-            return self.after_scheduler.get_lr()
-
-        elif self.last_epoch >= self._init_steps:
-            return [(self.last_epoch + 1 - self._init_steps) / self._warmup_steps * lr for lr in self.base_lrs]
-        else:
-            return [0 for lr in self.base_lrs]
-
-    def __str__(self):
-        return json.dumps(self.state_dict(), indent=4, sort_keys=True)
diff --git a/internlm/solver/optimizer/__init__.py b/internlm/solver/optimizer/__init__.py
deleted file mode 100644
index 99051f4..0000000
--- a/internlm/solver/optimizer/__init__.py
+++ /dev/null
@@ -1,6 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from .hybrid_zero_optim import HybridZeroOptimizer, reload_zero_fp32_buff
-
-__all__ = ["HybridZeroOptimizer", "reload_zero_fp32_buff"]
diff --git a/internlm/solver/optimizer/hybrid_zero_optim.py b/internlm/solver/optimizer/hybrid_zero_optim.py
deleted file mode 100644
index 4db758d..0000000
--- a/internlm/solver/optimizer/hybrid_zero_optim.py
+++ /dev/null
@@ -1,817 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import math
-from functools import partial
-from itertools import product
-
-import torch
-import torch.distributed as dist
-from torch.optim import Optimizer
-
-from internlm.core.context import Config, ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.monitor import send_alert_message
-from internlm.solver.optimizer.store import (
-    BucketStore,
-    GradientStore,
-    ParameterStore,
-    TensorBucket,
-)
-from internlm.solver.optimizer.utils import (
-    DynamicGradScaler,
-    ParamBcastSyncHandler,
-    flatten,
-    get_grad_accumulate_object,
-    has_inf_or_nan,
-    reduce_tensor,
-    release_param_grad,
-    split_half_float_double,
-    sync_param,
-)
-from internlm.utils.common import get_current_device
-from internlm.utils.logger import get_logger
-from internlm.utils.megatron_timers import megatron_timer as timer
-from internlm.utils.timeout import llm_timeout
-
-from .utils import compute_norm
-
-inf = math.inf
-logger = get_logger(__file__)
-
-
-class BaseOptimizer(Optimizer):
-    """
-    Base Optimizer.
-    """
-
-    def __init__(self, optim: Optimizer):  # pylint: disable=W0231
-        self.optim = optim
-
-    @property
-    def param_groups(self):
-        return self.optim.param_groups
-
-    @property
-    def defaults(self):
-        return self.optim.defaults
-
-    def add_param_group(self, *args, **kwargs):
-        return self.optim.add_param_group(*args, **kwargs)
-
-    def step(self, *args, **kwargs):
-        return self.optim.step(*args, **kwargs)
-
-    def zero_grad(self, *args, **kwargs):
-        self.optim.zero_grad(*args, **kwargs)
-
-    def load_state_dict(self, *args, **kwargs):
-        self.optim.load_state_dict(*args, **kwargs)
-
-    def state_dict(self):
-        return self.optim.state_dict()
-
-    def backward(self, loss):
-        loss.backward()
-
-    def backward_by_grad(self, tensor, grad):
-        torch.autograd.backward(tensors=tensor, grad_tensors=grad)
-
-    def clip_grad_norm(self):
-        pass
-
-
-class HybridZeroOptimizer(BaseOptimizer):
-    """
-    Hybrid Zero Optimizer.
-    """
-
-    def __init__(
-        self,
-        optimizer: Optimizer,
-        cpu_offload=False,
-        grad_scal_cfg: Config = None,
-        zero_cfg: Config = None,
-        param_bcast_sync_handler: ParamBcastSyncHandler = None,
-    ):
-        # DynamicGradScaler related args
-        if gpc.config.model.dtype is torch.float32:
-            initial_scale = 1
-        else:
-            initial_scale = grad_scal_cfg.fp16.initial_scale
-        min_scale = grad_scal_cfg.fp16.min_scale
-        growth_interval = grad_scal_cfg.fp16.growth_interval
-        growth_factor = grad_scal_cfg.growth_factor
-        backoff_factor = grad_scal_cfg.backoff_factor
-        hysteresis = grad_scal_cfg.hysteresis
-        max_scale = grad_scal_cfg.max_scale
-
-        # Zero related args
-        reduce_bucket_size = zero_cfg.reduce_bucket_size
-        clip_grad_norm = zero_cfg.clip_grad_norm
-        self._overlap_sync_grad = zero_cfg.overlap_sync_grad
-        self._overlap_sync_param = zero_cfg.overlap_sync_param
-
-        super().__init__(optim=optimizer)
-
-        self._dtype = self.optim.param_groups[0]["params"][0].dtype
-        self._cpu_offload = cpu_offload
-        self._zero_local_rank = gpc.get_local_rank(ParallelMode.ZERO1)
-        self._zero_world_size = gpc.get_world_size(ParallelMode.ZERO1)
-        self._broadcast_parallel_mode = ParallelMode.ZERO1
-
-        # ParameterStore will manage the tensor buffers used for zero
-        # it will not manage the tensors used by mixed precision training
-        self._param_store = ParameterStore(ParallelMode.ZERO1)
-        self._grad_store = GradientStore(ParallelMode.DATA)
-        self._bucket_store = BucketStore(ParallelMode.DATA)
-        self._bucket_in_progress = []
-
-        # fp16 and fp32 params for mixed precision training
-        self._fp16_param_groups = dict()
-        self._fp32_flat_param_groups_of_current_rank = dict()
-
-        # communication params
-        # self._overlap_communication = overlap_communication
-        self._reduce_bucket_size = reduce_bucket_size
-
-        self._comm_bcast_stream = torch.cuda.Stream()
-
-        # gradient scaler
-        self.grad_scaler = DynamicGradScaler(
-            initial_scale=initial_scale,
-            min_scale=min_scale,
-            growth_factor=growth_factor,
-            backoff_factor=backoff_factor,
-            growth_interval=growth_interval,
-            hysteresis=hysteresis,
-            max_scale=max_scale,
-        )
-        self._found_overflow = torch.cuda.FloatTensor([0], device=get_current_device())
-
-        # gradient clipping
-        self._clip_grad_norm = clip_grad_norm
-
-        # need to record the rank in which parameter groups are not assigned parameters.
-        self.param_group_has_params = []
-        self.param_group_no_params_ranks = []
-        self.padding_grad = torch.zeros([32], dtype=self._dtype, device=get_current_device())
-        self.padding_tensor = torch.zeros([32], dtype=self._dtype, device=get_current_device())
-
-        self.rank_unique_id = (
-            f"gpus-{gpc.get_world_size(ParallelMode.GLOBAL)}_"
-            + f"pp-{gpc.get_local_rank(ParallelMode.PIPELINE)}_"
-            + f"tp-{gpc.get_local_rank(ParallelMode.TENSOR)}_"
-            + f"zo-{self._zero_local_rank}.pt"
-        )
-        self.params_per_rank_id_dict = []
-        self._param_bcast_sync_handler = param_bcast_sync_handler
-        if self._overlap_sync_param:
-            assert self._param_bcast_sync_handler is not None
-
-        # iterate over the param group in the optimizer
-        # partition these param groups for data parallel training
-        # and add buffers to parameter store for future access
-        for group_id, param_group in enumerate(self.optim.param_groups):
-            group_params = param_group["params"]
-
-            # add the fp16 params to fp16_param_groups for bookkeeping
-            self._fp16_param_groups[group_id] = group_params
-
-            # assign parameters to ranks the params in the list are sorted
-            params_per_rank, no_params_ranks = self._partition_param_list(group_params)
-            self.param_group_no_params_ranks.append(no_params_ranks)
-            self.param_group_has_params.append(self._zero_local_rank not in no_params_ranks)
-
-            # store the mapping between param to rank each param should belong to only one rank
-            for rank, params in enumerate(params_per_rank):
-                # check whether any rank is not assigned params.
-                if len(params) != 0:
-                    self._param_store.add_fp16_param_list_by_rank_group(rank, group_id, params)
-                    for param in params:
-                        setattr(param, "group_id", group_id)
-                        self._param_store.set_param_to_rank(param, rank)
-
-            # move to cpu to make room to create the flat tensor
-            for param in group_params:
-                param.data = param.data.cpu()
-
-            # flatten the reordered tensors
-            for rank in range(self._zero_world_size):
-                # No flat fp16 buffer is allocated if the process has no parameters.
-                if rank not in self.param_group_no_params_ranks[group_id]:
-                    tensor_list = self._param_store.get_fp16_params_by_rank_group(rank, group_id)
-                    with torch.no_grad():
-                        flat_tensor = flatten(tensor_list)
-                    flat_tensor = flat_tensor.data.cuda()
-                    self._param_store.add_flat_fp16_param_by_rank_group(rank, group_id, flat_tensor)
-                    sync_param(flat_tensor=flat_tensor, tensor_list=tensor_list)
-
-            # create a copy of fp32 weights of the parameters for which this rank is responsible
-            # No flat fp32 buffer is allocated if the process has no parameters.
-            if self.param_group_has_params[group_id]:
-                fp16_flat_current_rank = self._param_store.get_flat_fp16_param_by_rank_group(
-                    self._zero_local_rank, group_id
-                )
-                fp32_flat_current_rank = fp16_flat_current_rank.float()
-                device = "cpu" if self._cpu_offload else get_current_device()
-                fp32_flat_current_rank = fp32_flat_current_rank.to(device)
-                fp32_flat_current_rank.requires_grad = True
-                self._fp32_flat_param_groups_of_current_rank[group_id] = fp32_flat_current_rank
-
-                # need to replace the params in the `params` field in the optimizer
-                # so that when the optimizer calls step(), it only updates the tensors
-                # managed by this data parallel rank
-                param_group["params"] = [fp32_flat_current_rank]
-
-            # set reduction state
-            for param in self._fp16_param_groups[group_id]:
-                self._param_store.set_param_reduction_state(param, False)
-
-        assert len(self._fp16_param_groups) != 0
-
-        # If a rank is not assigned any arguments, 'has_params' is False.
-        self.has_params = sum(self.param_group_has_params) != 0
-        # flag used to skip unnecessary gradient reduce operation when gradient accumulation is enabled.
-        self.skip_grad_reduce = False
-
-        # reduction hook is only used if overlapping communication
-        # if it is stage 1 without overlapping, no hook will be attached
-        if self._overlap_sync_grad:
-            self._attach_reduction_hook()
-
-    @property
-    def zero_local_rank(self):
-        return self._zero_local_rank
-
-    @property
-    def zero_world_size(self):
-        return self._zero_world_size
-
-    @property
-    def dtype(self):
-        return self._dtype
-
-    @property
-    def loss_scale(self):
-        return self.grad_scaler.scale
-
-    @property
-    def num_param_groups(self):
-        return len(self._fp16_param_groups)
-
-    def _partition_param_list(self, param_list):
-        no_params_ranks = []
-        params_per_rank = [[] for _ in range(self._zero_world_size)]
-        numel_per_rank = [0 for _ in range(self._zero_world_size)]
-        self.params_per_rank_id_dict.append([[] for _ in range(self._zero_world_size)])
-
-        sorted_params = sorted(param_list, key=lambda x: x.numel(), reverse=True)
-        for i, param in enumerate(sorted_params):
-            global_id = str(i)
-            for j in range(len(param.size())):
-                global_id = "_".join([global_id, str(param.size()[j])])
-            if self._overlap_sync_param:
-                rank_to_go = self._param_bcast_sync_handler.get_rank_by_param(param)
-            else:
-                rank_to_go = numel_per_rank.index(min(numel_per_rank))
-            params_per_rank[rank_to_go].append(param)
-            self.params_per_rank_id_dict[-1][rank_to_go].append(global_id)
-            numel_per_rank[rank_to_go] += param.numel()
-
-        # check whether any rank is not assigned to parameters.
-        for rank, params in enumerate(params_per_rank):
-            if len(params) == 0:
-                no_params_ranks.append(rank)
-
-        if gpc.is_rank_for_log():
-            logger.info(f"Number of elements on ranks: {numel_per_rank}, rank:{gpc.get_global_rank()}")
-
-        return params_per_rank, set(no_params_ranks)
-
-    def _attach_reduction_hook(self):
-        # we iterate over the fp16 params
-        # on each param, we register a hook to its AccumulateGrad object
-        for group_id in range(self.num_param_groups):
-            param_group = self._fp16_param_groups[group_id]
-            for param in param_group:
-                if param.requires_grad:
-                    reduce_rank = None
-
-                    def _define_and_attach(param, reduce_rank=None):
-                        # get the AccumulateGrad object of the param itself
-                        # If these objects are not kept, reduction hooks may not be attached successfully.
-                        accum_grad_obj = get_grad_accumulate_object(param)
-                        self._grad_store.add_accumulate_grad_object(accum_grad_obj)
-
-                        reduction_func = partial(
-                            self._store_and_try_reduce_grads_by_bucket,
-                            param=param,
-                            reduce_rank=reduce_rank,
-                        )
-
-                        # define hook
-                        # NOT IMPORTANT BUT GOOD TO KNOW:
-                        # args here is not grad, but allow_unreacable and accumulate_grad
-                        def reduce_grad_hook(*args):  # pylint: disable=W0613
-                            if self.skip_grad_reduce is False:
-                                reduction_func()
-
-                        accum_grad_obj.register_hook(reduce_grad_hook)
-
-                    _define_and_attach(param, reduce_rank)
-
-    def _store_and_try_reduce_grads_by_bucket(self, param, reduce_rank=None):
-        param_size = param.numel()
-
-        # check if the bucket is full
-        # if full, will reduce the grads already in the bucket
-        # after reduction, the bucket will be empty
-        if self._bucket_store.num_elements_in_bucket(reduce_rank) + param_size > self._reduce_bucket_size:
-            self._reduce_grads_stored_in_bucket(reduce_rank, last_bucket=False)
-
-        # the param must not be reduced to ensure correctness
-        is_param_reduced = self._param_store.is_param_reduced(param)
-        if is_param_reduced:
-            msg = (
-                f"Parameter of size ({param.size()}) has already been reduced, "
-                + "duplicate reduction will lead to arithmetic incorrectness"
-            )
-            raise RuntimeError(msg)
-
-        # the param must have grad for reduction
-        assert param.grad is not None, f"Parameter of size ({param.size()}) has None grad, cannot be reduced"
-
-        self._bucket_store.add_num_elements_in_bucket(param_size, reduce_rank)
-        self._bucket_store.add_grad(param.grad, reduce_rank)
-        self._bucket_store.add_param(param, reduce_rank)
-
-    def _reduce_grads_stored_in_bucket(self, reduce_rank=None, last_bucket=False):
-        # reduce grads
-        self._reduce_grads_by_rank(
-            reduce_rank=reduce_rank,
-            grads=self._bucket_store.get_grad(reduce_rank=reduce_rank),
-            bucket_size=self._bucket_store.num_elements_in_bucket(reduce_rank),
-        )
-
-        params_in_bucket = self._bucket_store.get_param(reduce_rank=reduce_rank)
-
-        for param in params_in_bucket:
-            # the is_param_reduced flag should be False showing that
-            # this param is not reduced before calling self._reduce_grads_by_rank
-            is_param_reduced = self._param_store.is_param_reduced(param)
-
-            if is_param_reduced:
-                msg = (
-                    f"Parameter of size ({param.size()}) has been reduced, "
-                    + "duplicate reduction will lead to arithmetic incorrectness"
-                )
-                raise RuntimeError(msg)
-
-            # update the flag
-            self._param_store.set_param_reduction_state(param, True)
-
-            if self._param_store.belongs_to_current_rank(param):
-                self._param_store.add_reduced_param_for_compute_norm(param, last_bucket)
-            else:
-                self._param_store.add_previous_reduced_param(param)
-
-        self._bucket_store.reset_by_rank(reduce_rank)
-
-    def _reduce_grads_by_rank(self, reduce_rank, grads, bucket_size):
-        grad_buckets_by_dtype = split_half_float_double(grads)
-        next_bucket_list = []
-        # add parameters into bucket for reduction
-        for tensor_list in grad_buckets_by_dtype:
-            param_bucket = TensorBucket(size=bucket_size)
-            for tensor in tensor_list:
-                param_bucket.add_to_bucket(tensor, allow_oversize=True)
-            if not param_bucket.is_empty():
-                self._reduce_and_copy(bucket=param_bucket, reduce_rank=reduce_rank)
-            next_bucket_list.append(param_bucket)
-
-        # wait for the completion of previouce bucket list reduction, and do unflatten_and_copy()
-        # here we can also overlap the communication with some memcpy operation caused by bucket.flatten()
-        for bucket in self._bucket_in_progress:
-            bucket.commu_handle.wait()
-            bucket.unflatten_and_copy()
-            bucket.empty()
-        self._bucket_in_progress = []
-        self._param_store.clear_grads_of_previous_reduced_params()
-
-        # after the completion of bucket list reduction, add new buckets into _bucket_in_progress
-        self._bucket_in_progress = next_bucket_list.copy()
-
-    def _reduce_and_copy(self, bucket: TensorBucket, reduce_rank):
-        # flatten the tensors and do allreduce
-        bucket.flatten()
-        bucket.commu_handle = reduce_tensor(
-            tensor=bucket.get_flat_tensor(),
-            dtype=None,
-            dst_rank=reduce_rank,
-            parallel_mode=ParallelMode.DATA,
-        )
-
-        # update the reduced tensor
-        if reduce_rank is None or reduce_rank == self._zero_local_rank:
-            bucket.set_unflatten_and_copy_flag(flag=True)
-
-    def _has_inf_or_nan(self, tensor):
-        try:
-            tensor_mean = float(tensor.mean())
-        except RuntimeError as instance:
-            # We want to check if inst is actually an overflow exception.
-            # RuntimeError could come from a different error.
-            # If so, we still want the exception to propagate.
-            if "value cannot be converted" not in instance.args[0]:
-                raise
-            return True
-        else:
-            if tensor_mean == float("inf") or tensor_mean == -float("inf"):
-                return True
-            return False
-
-    def _sync_grad(self):
-        # update param already reduced flag
-        reduction_states = self._param_store.get_param_reduction_states()
-        for tensor, _ in reduction_states.items():
-            reduction_states[tensor] = False
-        self._param_store.reset_reduced_data_for_compute_norm()
-
-        # accumulate gradient
-        avg_gradients = self._grad_store._averaged_gradients
-        for group_id in range(self.num_param_groups):
-            # the following operations are performed only on the rank to which parameters are assigned.
-            if self._zero_local_rank not in self.param_group_no_params_ranks[group_id]:
-                param_group = self._param_store.get_fp16_params_by_rank_group(self._zero_local_rank, group_id)
-
-                if group_id not in avg_gradients:
-                    avg_gradients[group_id] = []
-
-                param_idx = 0
-                for param in param_group:
-                    if param.grad is not None:
-                        if len(avg_gradients[group_id]) == param_idx:
-                            avg_gradients[group_id].append(param.grad)
-                        else:
-                            avg_gradients[group_id][param_idx].add_(param.grad)
-                        param_idx += 1
-
-        # the gradients needed are stored in the avg_gradients buffer
-        # thus, can clear this
-        self.zero_grad()
-
-    def zero_grad(self, set_to_none=True):
-        """
-        Set parameter gradients to zero. If set_to_none = True, gradient
-        will be set to None to save memory.
-
-        :param set_to_none: Whether set the gradient to None. Default value is True.
-        :type set_to_none: bool
-        """
-        for _, param_group in self._fp16_param_groups.items():
-            for param in param_group:
-                if set_to_none:
-                    param.grad = None
-                elif param.grad is not None:
-                    param.grad.detach()
-                    param.grad.zero_()
-                else:
-                    pass
-
-    def backward(self, loss, retain_graph=False):
-        loss = self.loss_scale * loss
-        loss.backward(retain_graph=retain_graph)
-
-        # Gradients may not be fully synchronized here.
-
-    def _compute_norm_with_stage(
-        self,
-        group_id: int = 0,
-        last_bucket: bool = False,
-        last_stage: bool = False,
-        previous_norm=None,
-    ):
-        # compute norm for gradients that have been reduced
-        params, grads = self._param_store.get_reduced_param_for_compute_norm(group_id=group_id, last_bucket=last_bucket)
-        if len(params) == 0:
-            grads = [self.padding_grad]
-            params = [self.padding_tensor]
-
-        norm = 0
-        if self._clip_grad_norm > 0:
-            # this norm is before scaling, it will be very large
-            norm = compute_norm(
-                gradients=grads,
-                parameters=params,
-                last_stage=last_stage,
-                previous_norm=previous_norm,
-            )
-
-        return norm
-
-    @llm_timeout(func_name="optim_step")
-    def step(self, closure=None):
-        """Performs a single optimization step.
-
-        Args:
-            closure (Callable, optional): A closure that reevaluates the model
-                and returns the loss.
-        Returns:
-            Union[bool, float]: Whether the gradient is success updated, and the gradient.
-        """
-        assert closure is None, "closure is not supported by step()"
-
-        # if not overlapping communication (no reduction hook is attached)
-        # we need to manually reduce these gradients
-        if not self._overlap_sync_grad:
-            for group_id in range(len(self._fp16_param_groups)):
-                for param in self._fp16_param_groups[group_id]:
-                    if param.grad is not None:
-                        self._store_and_try_reduce_grads_by_bucket(param)
-
-        # we need to reduce the gradients left in the communication bucket
-        self._reduce_grads_stored_in_bucket(reduce_rank=None, last_bucket=True)
-
-        # compute norm for gradients in the before bucket
-        groups_norms = []
-        for group_id in range(self.num_param_groups):
-            groups_norms.append(self._compute_norm_with_stage(group_id=group_id))
-
-        # clear reduced grads
-        # grads in the last bucket is reduced
-        for bucket in self._bucket_in_progress:
-            bucket.commu_handle.wait()
-            bucket.unflatten_and_copy()
-            bucket.empty()
-        self._bucket_in_progress = []
-        self._param_store.clear_grads_of_previous_reduced_params()
-
-        # compute norm for gradients in the last bucket
-        total_norms = {}
-        for group_id in range(self.num_param_groups):
-            group_name = self.param_groups[group_id]["name"] if "name" in self.param_groups[group_id] else "default"
-            group_name = f"{group_id}_{group_name}"
-            total_norms[group_name] = self._compute_norm_with_stage(
-                group_id=group_id,
-                last_bucket=True,
-                last_stage=True,
-                previous_norm=groups_norms[group_id],
-            )
-
-        timer("sync_grad").start()
-        self._sync_grad()
-        timer("sync_grad").stop()
-
-        return self._step(closure=closure, norms=total_norms)
-
-    def _step(self, closure=None, norms=None):
-        assert closure is None, "closure is not supported by step()"
-
-        # check for overflow
-        found_inf = False
-        found_nan = False
-        # if there is INF values in grades, compute_norm func would also returns -1
-        # thus, we try to avoid call _check_overflow here
-        # found_inf = self._check_overflow()
-        # Because you may encounter inf when computing norm
-
-        if -1 in norms.values():
-            found_inf = True
-
-        if -2 in norms.values():
-            found_nan = True
-
-        loss_scale = float(self.loss_scale.item())  # backup
-        if gpc.config.model.dtype is not torch.float32:
-            self.grad_scaler.update(found_inf)
-
-        # update loss scale if overflow occurs
-        if found_inf:
-            if gpc.is_rank_for_log():
-                logger.warning("Overflow occurs, please check it.")
-                send_alert_message(
-                    address=gpc.config.monitor.alert.feishu_alert_address,
-                    message="Overflow occurs, please check it.",
-                )
-            self._grad_store._averaged_gradients = dict()
-            self.zero_grad()
-            return False, norms
-
-        if found_nan:
-            if gpc.is_rank_for_log():
-                logger.warning("Nan grad norm occurs, please check it.")
-                send_alert_message(
-                    address=gpc.config.monitor.alert.feishu_alert_address,
-                    message="Nan grad norm  occurs, please check it.",
-                )
-            self._grad_store._averaged_gradients = dict()
-            self.zero_grad()
-            return False, norms
-
-        # copy the grad of fp16 param to fp32 param
-        single_grad_partition_groups = []
-        for group_id in range(self.num_param_groups):
-            # compute norm
-            # The following operations are performed only on the rank to which parameters are assigned.
-            if not self.param_group_has_params[group_id]:
-                continue
-
-            # create flat gradient for the flat fp32 params
-            gradients = self._grad_store.get_averaged_gradients_by_group(group_id)
-            with torch.no_grad():
-                flat_fp16_avg_grads = flatten(gradients)
-            self._grad_store.reset_average_gradients_by_group(group_id)
-            gradients = None  # release cuda memory
-
-            dtype = self._fp32_flat_param_groups_of_current_rank[group_id].dtype
-            flat_fp32_avg_grads = flat_fp16_avg_grads.to(dtype)
-            flat_fp16_avg_grads = None  # release cuda memory
-
-            param_shape = self._fp32_flat_param_groups_of_current_rank[group_id].shape
-            assert (
-                param_shape == flat_fp32_avg_grads.shape
-            ), f"fp32 param and grad have different shape {param_shape} vs {flat_fp32_avg_grads.shape}"
-
-            single_grad_partition_groups.append(flat_fp32_avg_grads)
-            device = self._fp32_flat_param_groups_of_current_rank[group_id].device
-            self._fp32_flat_param_groups_of_current_rank[group_id].grad = flat_fp32_avg_grads.to(device)
-
-        # unscale and clip grads
-        # get the global norm
-        global_norm_groups = {}
-        if self._clip_grad_norm > 0:
-            for group_name, norm in norms.items():
-                global_norm_groups[group_name] = norm**0.5
-
-        # the following operations are performed only on the rank to which parameters are assigned.
-        if gpc.config.model.dtype is not torch.float32:
-            if len(single_grad_partition_groups) != 0 and self._clip_grad_norm > 0:
-                self._unscale_and_clip_grads(
-                    single_grad_partition_groups,
-                    list(global_norm_groups.values()),
-                    loss_scale,
-                )
-
-        # update the parameters
-        timer("step").start()
-
-        # For those ranks that are not assigned parameters, we just wait for other ranks
-        # to send them updated their own parameters.
-        if self.has_params:
-            self.optim.step()
-            # release the fp32 grad
-            release_param_grad(self._fp32_flat_param_groups_of_current_rank.values())
-            # update fp16 partition updated by the current rank
-            for group_id in range(len(self._fp16_param_groups)):
-                if self.param_group_has_params[group_id]:
-                    fp16_param = self._param_store.get_flat_fp16_param_by_rank_group(
-                        rank=self._zero_local_rank, group_id=group_id
-                    )
-                    fp32_param = self._fp32_flat_param_groups_of_current_rank[group_id]
-                    fp16_param.data.copy_(fp32_param)
-
-        torch.cuda.synchronize()
-        with torch.cuda.stream(self._comm_bcast_stream):
-            self.broadcast_params()
-
-        timer("step").stop()
-
-        # update gradients may not be needed here, because the sync_params function is used in initialization,
-        # so synchronization is maintained
-        for group_name, global_norm in global_norm_groups.items():
-            global_norm_groups[group_name] = global_norm / loss_scale
-        return True, global_norm_groups
-
-    def broadcast_params(self):
-        handles = []
-
-        for rank, group_id in product(range(self._zero_world_size), range(self.num_param_groups)):
-            # The following operations are performed only on the rank to which parameters are assigned.
-            if rank in self.param_group_no_params_ranks[group_id]:
-                continue
-            fp16_param = self._param_store.get_flat_fp16_param_by_rank_group(rank=rank, group_id=group_id)
-            # grank = gpc.get_ranks_in_group(group_type)[rank]  # need to convert to the global rank
-            # assert grank == rank, f"{grank} == {rank}"
-            g_rank = gpc.get_ranks_in_group(self._broadcast_parallel_mode)[rank]
-            handle = dist.broadcast(
-                fp16_param,
-                src=g_rank,
-                group=gpc.get_group(ParallelMode.ZERO1),
-                async_op=True,
-            )
-
-            if self._overlap_sync_param:
-                self._param_bcast_sync_handler.add_bcast_handle(rank, handle)
-            else:
-                handles.append(handle)
-
-        for handle in handles:
-            handle.wait()
-
-        torch.cuda.synchronize()
-
-    ##################
-    # FP16 Utilities #
-    ##################
-
-    def _check_overflow(self):
-        # clear previous overflow record
-        self._found_overflow.fill_(0.0)
-
-        # check for overflow
-        for group_id in range(len(self._fp16_param_groups)):
-            # The following operations are performed only on the rank to which parameters are assigned.
-            if self._zero_local_rank not in self.param_group_no_params_ranks[group_id]:
-                for avg_grad in self._grad_store.get_averaged_gradients_by_group(group_id):
-                    if avg_grad is not None and has_inf_or_nan(avg_grad):
-                        self._found_overflow.fill_(1.0)
-                        break
-        dist.all_reduce(
-            self._found_overflow,
-            op=dist.ReduceOp.MAX,
-            group=gpc.get_group(ParallelMode.GLOBAL),
-        )
-
-        return self._found_overflow.item() > 0
-
-    def _unscale_and_clip_grads(self, grad_groups_flat, total_norm_groups, loss_scale):
-        # compute combined scale factor for this group
-        combined_scale_groups = []
-
-        if self._clip_grad_norm > 0.0:
-            # norm is in fact norm*scale
-            for group_id, total_norm in enumerate(total_norm_groups):
-                combined_scale_groups.append(loss_scale)
-                clip = ((total_norm / loss_scale) + 1e-6) / self._clip_grad_norm
-                if clip > 1.0:
-                    combined_scale_groups[group_id] = clip * loss_scale
-
-        for group_id, grad in enumerate(grad_groups_flat):
-            grad.data.mul_(1.0 / combined_scale_groups[group_id])
-
-    def clip_grad_norm(self, model, max_norm):
-        # will conduct in the step()
-        pass
-
-    def state_dict(self):
-        states = {}
-        grad_scaler = self.grad_scaler.state_dict()
-        states["grad_scaler"] = grad_scaler
-        optim_states = self.optim.state_dict()
-        states["base_optim_states"] = optim_states
-
-        flat_fp32_weights = {}
-        for group_id, param in self._fp32_flat_param_groups_of_current_rank.items():
-            if self._zero_local_rank not in self.param_group_no_params_ranks[group_id]:
-                assert param.grad is None
-                flat_fp32_weights[group_id] = param
-        states["flat_fp32_weights"] = flat_fp32_weights
-        states["zero_devide_optim_plan"] = self.params_per_rank_id_dict
-
-        return states
-
-    def load_state_dict(self, states):
-        # TODO: Need to take into account the change in the number of DP.
-        assert "grad_scaler" in states, "Not found grad_scaler state!"
-        grad_scaler = states["grad_scaler"]
-        self.grad_scaler.load_state_dict(grad_scaler)
-        optim_states = states["base_optim_states"]
-        self.optim.load_state_dict(optim_states)
-
-        # load fp32 model weight.
-        flat_fp32_weights = states["flat_fp32_weights"]
-        assert set(flat_fp32_weights.keys()) == set(self._fp32_flat_param_groups_of_current_rank)
-        for group_id, param in flat_fp32_weights.items():
-            if self._zero_local_rank not in self.param_group_no_params_ranks[group_id]:
-                self_param = self._fp32_flat_param_groups_of_current_rank[group_id]
-                assert (
-                    self_param.shape == param.shape
-                ), f"The loaded parameter shape is inconsistent, {self_param.shape} != {param.shape}"
-                self_param.data.copy_(param.data)
-
-        # Load the fp16 model weights.
-        for group_id in range(len(self._fp16_param_groups)):
-            if self._zero_local_rank not in self.param_group_no_params_ranks[group_id]:
-                fp16_param = self._param_store.get_flat_fp16_param_by_rank_group(
-                    rank=self._zero_local_rank, group_id=group_id
-                )
-                fp32_param = self._fp32_flat_param_groups_of_current_rank[group_id]
-                fp16_param.data.copy_(fp32_param)
-
-        if "zero_devide_optim_plan" in states:
-            self.params_per_rank_id_dict = states["zero_devide_optim_plan"]
-
-
-def reload_zero_fp32_buff(optimizer):
-    # If we use AMP optimizer, we need to update its fp32 buffer as newly loaded weights value.
-    # Or we must ensure that loading model weights must be done before zero is initialized.
-    if isinstance(optimizer, HybridZeroOptimizer):
-        for group_id, param_group in enumerate(optimizer.optim.param_groups):
-            if optimizer.param_group_has_params[group_id]:
-                # flatten fp16 params have already been updated by 'load_model_checkpoint'
-                fp16_flat_current_rank = optimizer._param_store.get_flat_fp16_param_by_rank_group(
-                    optimizer._zero_local_rank, group_id
-                )
-                # param_group["params"] is fp32 flatten optimizer states of this zero rank.
-                param_group["params"][0].data.copy_(fp16_flat_current_rank.float())
diff --git a/internlm/solver/optimizer/store.py b/internlm/solver/optimizer/store.py
deleted file mode 100644
index adab6c9..0000000
--- a/internlm/solver/optimizer/store.py
+++ /dev/null
@@ -1,343 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from typing import List
-
-from torch import Tensor
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-
-
-class BaseStore:
-    """
-    Base Store
-    """
-
-    def __init__(self, dp_parallel_mode=ParallelMode.DATA):
-        self._world_size = gpc.get_world_size(dp_parallel_mode)
-        self._local_rank = gpc.get_local_rank(dp_parallel_mode)
-
-    @property
-    def world_size(self):
-        return self._world_size
-
-    @property
-    def local_rank(self):
-        return self._local_rank
-
-
-class BucketStore(BaseStore):
-    """
-    Bucket Store
-    """
-
-    def __init__(self, dp_parallel_mode):
-        super().__init__(dp_parallel_mode)
-        self._grads = dict()
-        self._params = dict()
-        self._num_elements_in_bucket = dict()
-
-        self.reset()
-
-    def num_elements_in_bucket(self, reduce_rank: int = None):
-        return self._num_elements_in_bucket[reduce_rank]
-
-    def add_num_elements_in_bucket(self, num_elements, reduce_rank: int = None):
-        self._num_elements_in_bucket[reduce_rank] += num_elements
-
-    def add_grad(self, tensor, reduce_rank: int = None):
-        self._grads[reduce_rank].append(tensor)
-
-    def add_param(self, tensor, reduce_rank: int = None):
-        self._params[reduce_rank].append(tensor)
-
-    def reset(self):
-        keys = [None] + list(range(self._world_size))
-        self._grads = {rank: [] for rank in keys}
-        self._params = {rank: [] for rank in keys}
-        self._num_elements_in_bucket = {rank: 0 for rank in keys}
-
-    def reset_by_rank(self, reduce_rank=None):
-        self._grads[reduce_rank] = []
-        self._params[reduce_rank] = []
-        self._num_elements_in_bucket[reduce_rank] = 0
-
-    def get_grad(self, reduce_rank: int = None):
-        return self._grads[reduce_rank]
-
-    def get_param(self, reduce_rank: int = None):
-        return self._params[reduce_rank]
-
-
-class GradientStore(BaseStore):
-    """
-    Gradient Store
-    """
-
-    def __init__(self, *args):
-        super().__init__(*args)
-        # bookkeeping data structures
-        self._averaged_gradients = dict()
-
-        # for backward reduction hooks
-        self._grad_acc_objs = []
-
-    def add_accumulate_grad_object(self, obj):
-        """
-        Keep :class:`AccumulateGrad` objects. If these objects are not kept, reduction hooks may not
-        be attached successfully.
-
-        :param obj: An object of :class:`AccumulateGrad` class
-        :type obj: :class:`AccumulateGrad`
-        """
-
-        self._grad_acc_objs.append(obj)
-
-    def get_averaged_gradients_by_group(self, group_id: int) -> List[Tensor]:
-        """
-        Return average gradients of a parameter group
-
-        :param group_id: The index of parameter group
-        :type group_id: int
-
-        :return: Return the list of averaged gradients of a parameter group. Each element is a gradient,
-            not a parameter.
-        :rtype: List[torch.Tensor]
-        """
-
-        return self._averaged_gradients[group_id]
-
-    def add_average_gradient_by_group(self, group_id: int, tensor: Tensor) -> None:
-        """
-        Append an average gradient to the list of averaged gradients of a parameter group
-
-        :param group_id: The index of a parameter group
-        :param tensor: A :class:`torch.Tensor` object
-        :type group_id: int
-        :type tensor: torch.Tensor
-
-        """
-
-        if group_id in self._averaged_gradients:
-            self._averaged_gradients[group_id].append(tensor)
-        else:
-            self._averaged_gradients[group_id] = [tensor]
-
-    def reset_average_gradients_by_group(self, group_id: int) -> None:
-        """
-        Reset the bookkeeping data structure for averaged gradients to an empty list
-
-        :param group_id: The index of a parameter group
-        :type group_id: int
-        """
-
-        self._averaged_gradients[group_id] = []
-
-
-class ParameterStore(BaseStore):
-    """
-    Parameter Store
-    """
-
-    def __init__(self, dp_paralle_mode):
-        super().__init__(dp_paralle_mode)
-        # param partitioning data structures
-        self._fp16_param_to_rank = dict()
-        self._rank_groupid_to_fp16_param_list = dict()
-        self._rank_group_id_to_flat_fp16_param = dict()
-
-        # param reduction data structures
-        self._is_param_reduced = dict()
-        self._reduced_param = []
-
-        self._former_bucket_reduced_param = {}
-        self._last_bucket_reduced_param = {}
-        self._former_bucket_reduced_grad = {}
-        self._last_bucket_reduced_grad = {}
-
-    def set_param_to_rank(self, tensor: Tensor, rank: int) -> None:
-        """
-        Set the mapping between parameter to rank, each parameter should be owned by a rank.
-
-        :param tensor: A :class:`torch.Tensor` object
-        :type tensor: torch.Tensor
-        :param rank: The rank of which the process is responsible for updating the parameter
-        :type rank: int
-        """
-
-        self._fp16_param_to_rank[tensor] = rank
-
-    def get_param_rank(self, tensor: Tensor) -> int:
-        """
-        Gives the rank which the parameter belongs to
-
-        :param tensor: A :class:`torch.Tensor` object
-        :type tensor: torch.Tensor
-        """
-        return self._fp16_param_to_rank[tensor]
-
-    def belongs_to_current_rank(self, tensor) -> bool:
-        """
-        Check whether a parameter is supposed to be updated by the process of the current rank
-
-        :param tensor: A :class:`torch.Tensor` object
-        :type tensor: torch.Tensor
-
-        :return: True if the parameter should be updated by the current rank. Otherwise false.
-        :rtype: bool
-        """
-
-        tensor_rank = self._fp16_param_to_rank[tensor]
-        return tensor_rank == self._local_rank
-
-    def add_fp16_param_list_by_rank_group(self, rank, group_id, tensor_list) -> None:
-        if rank not in self._rank_groupid_to_fp16_param_list:
-            self._rank_groupid_to_fp16_param_list[rank] = dict()
-
-        if group_id not in self._rank_groupid_to_fp16_param_list[rank]:
-            self._rank_groupid_to_fp16_param_list[rank][group_id] = []
-
-        self._rank_groupid_to_fp16_param_list[rank][group_id].extend(tensor_list)
-
-    def get_fp16_params_by_rank_group(self, rank, group_id) -> List[Tensor]:
-        return self._rank_groupid_to_fp16_param_list[rank][group_id]
-
-    def add_flat_fp16_param_by_rank_group(self, rank, group_id, tensor) -> None:
-        if rank not in self._rank_group_id_to_flat_fp16_param:
-            self._rank_group_id_to_flat_fp16_param[rank] = dict()
-
-        self._rank_group_id_to_flat_fp16_param[rank][group_id] = tensor
-
-    def get_flat_fp16_param_by_rank_group(self, rank, group_id) -> Tensor:
-        return self._rank_group_id_to_flat_fp16_param[rank][group_id]
-
-    def is_param_reduced(self, tensor):
-        return self._is_param_reduced[tensor]
-
-    def set_param_reduction_state(self, tensor, state):
-        self._is_param_reduced[tensor] = state
-
-    def get_param_reduction_states(self):
-        return self._is_param_reduced
-
-    def reset_previous_reduced_params(self):
-        self._reduced_param = []
-
-    def add_previous_reduced_param(self, tensor):
-        self._reduced_param.append(tensor)
-
-    def add_reduced_param_for_compute_norm(self, param, last_bucket=False):
-        group_id = getattr(param, "group_id")
-        if last_bucket:
-            if group_id not in self._last_bucket_reduced_param:
-                self._last_bucket_reduced_param[group_id] = []
-                self._last_bucket_reduced_grad[group_id] = []
-
-            self._last_bucket_reduced_param[group_id].append(param)
-            self._last_bucket_reduced_grad[group_id].append(param.grad)
-        else:
-            if group_id not in self._former_bucket_reduced_param:
-                self._former_bucket_reduced_param[group_id] = []
-                self._former_bucket_reduced_grad[group_id] = []
-
-            self._former_bucket_reduced_param[group_id].append(param)
-            self._former_bucket_reduced_grad[group_id].append(param.grad)
-
-    def get_reduced_param_for_compute_norm(self, group_id=0, last_bucket=False):
-        if not last_bucket:
-            if group_id not in self._former_bucket_reduced_param:
-                return [], []
-            return (
-                self._former_bucket_reduced_param[group_id],
-                self._former_bucket_reduced_grad[group_id],
-            )
-        else:
-            if group_id not in self._last_bucket_reduced_param:
-                return [], []
-            return (
-                self._last_bucket_reduced_param[group_id],
-                self._last_bucket_reduced_grad[group_id],
-            )
-
-    def reset_reduced_data_for_compute_norm(self):
-        self._former_bucket_reduced_param = {}
-        self._last_bucket_reduced_param = {}
-        self._former_bucket_reduced_grad = {}
-        self._last_bucket_reduced_grad = {}
-
-    def clear_grads_of_previous_reduced_params(self):
-        if len(self._reduced_param) > 0:
-            for param in self._reduced_param:
-                param.grad = None
-            self.reset_previous_reduced_params()
-
-
-class TensorBucket:
-    """
-    Tensor Bucket
-    """
-
-    def __init__(self, size):
-        self._max_size = size
-        self._current_size = 0
-        self._bucket = []
-        self._flat_tensor = None
-        self._unflatten_and_copy_flag = False
-        self.commu_handle = None
-
-    @property
-    def max_size(self):
-        return self._max_size
-
-    @property
-    def current_size(self):
-        return self._current_size
-
-    def is_full_or_oversized(self):
-        return self._current_size >= self._max_size
-
-    def is_empty(self):
-        return len(self._bucket) == 0
-
-    def set_unflatten_and_copy_flag(self, flag):
-        self._unflatten_and_copy_flag = flag
-
-    def get_unflatten_and_copy_flag(self):
-        return self._unflatten_and_copy_flag
-
-    def get_flat_tensor(self):
-        return self._flat_tensor
-
-    def add_to_bucket(self, tensor, allow_oversize=False):
-        tensor_size = tensor.numel()
-
-        if not allow_oversize and self.will_exceed_max_size(tensor_size):
-            msg = f"The param bucket max size {self._max_size} is exceeded" + f"by tensor (size {tensor_size})"
-            raise RuntimeError(msg)
-
-        self._bucket.append(tensor)
-        self._current_size += tensor_size
-
-    def will_exceed_max_size(self, tensor_size):
-        expected_size = self._current_size + tensor_size
-        return expected_size > self._max_size
-
-    def get_bucket(self):
-        return self._bucket
-
-    def empty(self):
-        self._bucket = []
-        self._size = 0
-        self._flat_tensor = None
-        self.commu_handle = None
-
-    def flatten(self):
-        self._flat_tensor = _flatten_dense_tensors(self._bucket)
-
-    def unflatten_and_copy(self):
-        if self._unflatten_and_copy_flag:
-            unflattened_tensor_list = _unflatten_dense_tensors(self._flat_tensor, self._bucket)
-            for old, new in zip(self._bucket, unflattened_tensor_list):
-                old.copy_(new)
diff --git a/internlm/solver/optimizer/utils.py b/internlm/solver/optimizer/utils.py
deleted file mode 100644
index dbfcc34..0000000
--- a/internlm/solver/optimizer/utils.py
+++ /dev/null
@@ -1,569 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import math
-from abc import ABC, abstractmethod
-from collections import OrderedDict
-from functools import partial
-from typing import Any, Dict, Optional, Union
-
-import torch
-import torch.distributed as dist
-from torch import Tensor, nn
-from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.core.naive_amp import NaiveAMPModel
-from internlm.utils.common import get_tensor_norm, move_norm_to_cuda
-from internlm.utils.logger import get_logger
-from internlm.utils.parallel import is_model_parallel_parameter
-
-logger = get_logger(__file__)
-
-try:
-    import amp_C
-    from apex.multi_tensor_apply import multi_tensor_applier
-
-    APEX_AVAILABLE = True
-except (ModuleNotFoundError, ImportError):
-    logger.warning("The torch implementation for cal_l2norm is slower than apex. Please note this!")
-    APEX_AVAILABLE = False
-
-inf = math.inf
-
-
-def flatten(input_):
-    return _flatten_dense_tensors(input_)
-
-
-def unflatten(flat, tensors):
-    return _unflatten_dense_tensors(flat, tensors)
-
-
-def get_grad_accumulate_object(tensor):
-    """
-    Return the AccumulateGrad of the input tensor
-    """
-
-    # grad_fn reference:
-    # https://discuss.pytorch.org/t/in-the-grad-fn-i-find-a-next-functions-but-i-dont-understand-the-meaning-of-the-attribute/24463
-    # expand_as reference: https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html#torch.Tensor.expand
-    #
-    # `next_functions` will return the backward graph where
-    # the first element is the AccumulateGrad of the leaf nodes.
-    # we want to get the AccumulateGrad of the input tensor instead of the leaf
-    # node in the whole computation graph.
-    # Therefore, we call expand_as to create a dummy graph
-    # where tensor_tmp and tensor indeed point to the same object.
-    # You can check this by print(tensor.data_ptr() == tensor_tmp.data_ptr())
-    tensor_tmp = tensor.expand_as(tensor)
-    grad_acc_obj = tensor_tmp.grad_fn.next_functions[0][0]
-    return grad_acc_obj
-
-
-def split_half_float_double(tensor_list):
-    dtype_buckets = {
-        "torch.cuda.HalfTensor": [],
-        "torch.cuda.FloatTensor": [],
-        "torch.cuda.DoubleTensor": [],
-        "torch.cuda.BFloat16Tensor": [],
-    }
-
-    for t in tensor_list:
-        dtype = t.type()
-        if dtype in dtype_buckets:
-            dtype_buckets[dtype].append(t)
-
-    buckets = [bucket for bucket in dtype_buckets.values() if bucket]
-    return buckets
-
-
-def reduce_tensor(tensor, dtype=None, dst_rank=None, parallel_mode=ParallelMode.DATA):
-    """
-    Reduce the tensor in the data parallel process group
-
-    :param tensor: A tensor object to reduce/all-reduce
-    :param dtype: The data type used in communication
-    :param dst_rank: The source rank for reduce. If dst_rank is None,
-    :param parallel_mode: Communication parallel mode
-    all-reduce will be used instead of reduce. Default is None.
-
-    :type tensor: torch.Tensor
-    :type dtype: torch.dtype, optional
-    :type dst_rank: int, optional
-    :type parallel_mode: ParallelMode, optional
-    """
-    # use the original dtype
-    # if dtype is None:
-    assert dtype is None
-    dtype = tensor.dtype
-
-    # cast the data to specified dtype for reduce/all-reduce
-    # if tensor.dtype != dtype:
-    #     tensor_to_reduce = tensor.to(dtype)
-    # else:
-    #     tensor_to_reduce = tensor
-
-    # world_size = gpc.get_world_size(parallel_mode)
-    # tensor.div_(world_size)
-    group = gpc.get_group(parallel_mode)
-
-    # if rank is None, all reduce will be used
-    # else, reduce is used
-    use_all_reduce = dst_rank is None
-
-    if use_all_reduce:
-        handle = dist.all_reduce(tensor=tensor, group=group, op=torch.distributed.ReduceOp.AVG, async_op=True)
-    else:
-        ranks_in_group = gpc.get_ranks_in_group(parallel_mode)
-        global_rank = ranks_in_group[dst_rank]
-        handle = dist.reduce(
-            tensor=tensor, dst=global_rank, group=group, op=torch.distributed.ReduceOp.AVG, async_op=True
-        )
-
-    return handle
-
-
-def has_inf_or_nan(tensor):
-    try:
-        # if tensor is half, the .float() incurs an additional deep copy, but it's necessary if
-        # Pytorch's .sum() creates a one-element tensor of the same type as tensor
-        # (which is true for some recent version of pytorch).
-        tensor_sum = float(tensor.float().sum())
-        # More efficient version that can be used if .sum() returns a Python scalar
-        # tensor_sum = float(tensor.sum())
-    except RuntimeError as instance:
-        # We want to check if inst is actually an overflow exception.
-        # RuntimeError could come from a different error.
-        # If so, we still want the exception to propagate.
-        if "value cannot be converted" not in instance.args[0]:
-            raise
-        return True
-    else:
-        if tensor_sum == float("inf") or tensor_sum == -float("inf"):
-            return True
-        return False
-
-
-def release_param_grad(tensor_list):
-    for tensor in tensor_list:
-        tensor.grad = None
-
-
-def sync_param(flat_tensor, tensor_list):
-    """
-    Synchronize the flattened tensor and unflattened tensor list. When
-    a list of tensor are flattened with `torch._utils._unflatten_dense_tensors`,
-    a new tensor is created. Thus, the flat tensor and original tensor list do not
-    share the same memory space. This function will update the tensor list so that
-    they point to the same value.
-
-    :param flat_tensor: A flat tensor obtained by calling `torch._utils._unflatten_dense_tensors` on a tensor lsit
-    :param tensor_list: A list of tensors corresponding to the flattened tensor
-    :type flat_tensor: torch.Tensor
-    :type tensor_list: List[torch.Tensor]
-    """
-    updated_params = unflatten(flat_tensor, tensor_list)
-
-    # update the tensor data
-    for p, q in zip(tensor_list, updated_params):
-        p.data = q.data
-
-
-def multi_tensor_l2norm_torch(tensor_list, per_tensor):
-    # Convert tensor_list elements to torch.float32
-    tensor_list = [tensor.float() for tensor in tensor_list]
-    norms_tensor = torch.stack([torch.norm(tensor, p=2) for tensor in tensor_list])
-    l2_norm = torch.norm(norms_tensor, p=2).unsqueeze(0)
-
-    if per_tensor:
-        per_tensor_norm = norms_tensor
-    else:
-        per_tensor_norm = torch.Tensor([]).to(norms_tensor.device)
-
-    return l2_norm, per_tensor_norm
-
-
-def calc_l2_norm(grads):
-    norm = 0.0
-    if len(grads) > 0:
-        if APEX_AVAILABLE:
-            dummy_overflow_buf = torch.cuda.IntTensor([0])
-            norm, _ = multi_tensor_applier(
-                amp_C.multi_tensor_l2norm,
-                dummy_overflow_buf,
-                [grads],
-                False,  # no per-parameter norm
-            )
-        else:
-            norm, _ = multi_tensor_l2norm_torch(grads, False)
-    return norm
-
-
-def calc_lp(grads, norm_type):
-    norm = 0.0
-    for grad in grads:
-        grad_norm = torch.norm(grad, norm_type)
-        norm += grad_norm**norm_type
-    return norm
-
-
-def compute_norm(gradients, parameters, last_stage=False, previous_norm=None, norm_type=2):
-    """Get the norm
-    Arguments:
-        gradients (Iterable[Tensor]): The gradient value.
-        parameters (Iterable[Tensor]): The parameter each gradient corresponds to.
-        norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for
-            infinity norm.
-
-    Returns:
-        Total norm of the parameters, need total_norm**(1/norm) before using.
-    """
-
-    enable_cuda_kernels = gradients[0].device.type == "cuda"
-    # Norm parameters.
-    norm_type = float(norm_type)
-
-    # Calculate norm.
-    if norm_type == inf:
-        total_norm = max(g.data.abs().max() for g in gradients)
-        total_norm_cuda = torch.FloatTensor([float(total_norm)], device=gradients[0].device)
-
-        if last_stage is False:
-            return total_norm_cuda
-
-        if previous_norm is not None:
-            total_norm_cuda = max(total_norm_cuda, previous_norm)
-
-        # Take max across all model-parallel GPUs.
-        if gpc.get_world_size(ParallelMode.MODEL) > 1:
-            dist.all_reduce(
-                total_norm_cuda,
-                op=dist.ReduceOp.MAX,
-                group=gpc.get_group(ParallelMode.MODEL),
-            )
-        total_norm = total_norm_cuda[0].item()
-    else:
-        tensor_parallel_grads = []
-        for g, p in zip(gradients, parameters):
-            # TODO: consider the pipeline shared parameter
-            if (
-                gpc.is_initialized(ParallelMode.PIPELINE)
-                and hasattr(p, "pipeline_shared_module_pg")
-                and dist.get_rank(p.pipeline_shared_module_pg) == 0
-            ):  # if shared between different pipe, only count o
-                tensor_parallel_grads.append(g.data.float())
-            elif (
-                gpc.is_initialized(ParallelMode.PIPELINE)
-                and hasattr(p, "pipeline_shared_module_pg")
-                and dist.get_rank(p.pipeline_shared_module_pg) != 0
-            ):
-                continue
-            elif (
-                gpc.is_initialized(ParallelMode.TENSOR)
-                and not is_model_parallel_parameter(p)
-                and gpc.get_local_rank(ParallelMode.TENSOR) == 0
-            ):  # if not used in each chunk, such as layernorm
-                tensor_parallel_grads.append(g.data.float())
-            elif is_model_parallel_parameter(p):
-                tensor_parallel_grads.append(g.data.float())
-            elif gpc.get_local_rank(ParallelMode.TENSOR) != 0:
-                continue
-            else:
-                raise RuntimeError("Should not arrive here")
-
-        if norm_type == 2.0 and enable_cuda_kernels:
-            tensor_parallel_norm = calc_l2_norm(tensor_parallel_grads) ** norm_type
-        else:
-            tensor_parallel_norm = calc_lp(tensor_parallel_grads, norm_type)
-
-        # If norm is type of float, then we convert them into torch.Tensor.
-        tensor_parallel_norm = get_tensor_norm(tensor_parallel_norm, enable_cuda_kernels)
-        # If grads are on CPU, the norms is also on CPU. Cast them to CUDA tensors
-        if not enable_cuda_kernels:
-            tensor_parallel_norm = move_norm_to_cuda(tensor_parallel_norm)
-
-        total_norm = tensor_parallel_norm
-
-        if last_stage is False:
-            return total_norm
-
-        if previous_norm is not None:
-            total_norm = total_norm + previous_norm
-
-        # Sum across all model-parallel GPUs.
-        if gpc.is_initialized(ParallelMode.MODEL):
-            dist.all_reduce(
-                total_norm,
-                op=dist.ReduceOp.SUM,
-                group=gpc.get_group(ParallelMode.MODEL),
-            )
-
-        # This is because we use zero1, so we need to use this reduction.
-        # TODO: Check zero group to be a subset of dp group.
-        dist.all_reduce(total_norm, op=dist.ReduceOp.SUM, group=gpc.get_group(ParallelMode.ZERO1))
-
-        if torch.is_tensor(total_norm):
-            total_norm = total_norm.item()
-
-    # Scale.
-    if total_norm == float("inf") or total_norm == -float("inf"):
-        total_norm = -1
-
-    if math.isnan(total_norm):
-        total_norm = -2
-
-    return total_norm
-
-
-class BaseGradScaler(ABC):
-    """A base class for the gradient scaler.
-
-    Args:
-        initial_scale (float): the initial loss scale
-    """
-
-    def __init__(self, initial_scale: float):
-        assert initial_scale > 0
-        self._scale = torch.cuda.FloatTensor([initial_scale])
-
-    @property
-    def scale(self) -> Tensor:
-        """Returns the loss scale."""
-
-        return self._scale
-
-    @property
-    def inv_scale(self) -> Tensor:
-        """Returns the inverse of the loss scale."""
-
-        return self._scale.double().reciprocal().float()
-
-    def state_dict(self) -> Dict:
-        """Returns the states of the gradient scaler as a dict object."""
-
-        state_dict = dict()
-        state_dict["scale"] = self.scale
-        return state_dict
-
-    def load_state_dict(self, state_dict: Dict) -> None:
-        """Load the states of the gradient scaler from a dict object.
-
-        Args:
-            state_dict (dict): the states of the gradient scaler
-        """
-
-        self._scale = state_dict["scale"]
-
-    @abstractmethod
-    def update(self, overflow: bool) -> None:
-        """Update the loss scale.
-
-        Args:
-            overflow (bool): whether overflow occurs
-        """
-
-        pass
-
-
-class DynamicGradScaler(BaseGradScaler):
-    """A gradient scaler which uses dynamic loss scale
-
-    Args:
-        initial_scale (float): the initial loss scale, defaults to 2**16
-        growth_factor (float): the multiplication factor for increasing loss scale, defaults to 2
-        backoff_factor (float): the multiplication factor for decreasing loss scale, defaults to 0.5
-        growth_interval (int): the number of steps to increase loss scale when no overflow occurs, defaults to 1000
-        min_scale (float): the minimum loss scale, defaults to None
-        max_scale (float): the maximum loss scale, defaults to None
-        hysteresis (int):  the number of overflows before decreasing loss scale, defaults to 2
-    """
-
-    def __init__(
-        self,
-        initial_scale: float = 2**16,
-        growth_factor: float = 2,
-        backoff_factor: float = 0.5,
-        growth_interval: int = 1000,
-        min_scale: Optional[float] = None,
-        max_scale: Optional[float] = None,
-        hysteresis: int = 2,
-    ):
-        super().__init__(initial_scale)
-        if min_scale:
-            self._min_scale = torch.cuda.FloatTensor([min_scale])
-        else:
-            self._min_scale = None
-
-        if max_scale:
-            self._max_scale = torch.cuda.FloatTensor([max_scale])
-        else:
-            self._max_scale = None
-
-        self._growth_factor = growth_factor
-        self._backoff_factor = backoff_factor
-        self._growth_interval = growth_interval
-        self._growth_step = 0
-        self._hysteresis = hysteresis
-        self._hysteresis_step = 0
-        self._sanity_checks()
-
-    def _sanity_checks(self) -> None:
-        """Check if the arguments are correct."""
-
-        if self._min_scale:
-            assert self._min_scale > 0, "The minimum gradient scale cannot be zero or negative"
-        if self._max_scale:
-            assert self._min_scale > 0, "The maximum gradient scale cannot be zero or negative"
-        assert self._growth_factor > 1, "The growth factor cannot be equal or smaller than 1"
-        assert self._backoff_factor < 1 and self._backoff_factor > 0, "The backoff factor must be between 0 and 1"
-        assert self._hysteresis >= 0, "The hysteresis cannot be negative"
-
-    def update(self, overflow: bool) -> None:
-        """Update the loss scale.
-
-        Args:
-            overflow (bool): whether overflow occurs
-        """
-        if overflow:
-            self._hysteresis_step += 1
-            self._growth_step = 0
-
-            if self._hysteresis_step >= self._hysteresis:
-                self._backoff_scale()
-                if gpc.is_rank_for_log():
-                    logger.warning(f"Overflow occurs, the loss scale is adjusted to {self.scale.item()}")
-        else:
-            self._growth_step += 1
-            if self._growth_step == self._growth_interval:
-                self._growth_step = 0
-                self._hysteresis_step = 0
-                self._grow_scale()
-                if gpc.is_rank_for_log():
-                    logger.warning(
-                        f"No overflow for consecutive {self._growth_interval} steps, "
-                        f"the loss scale is adjusted to {self.scale.item()}",
-                    )
-
-    def _backoff_scale(self) -> None:
-        """Decrease the loss scale"""
-
-        self._scale = self._scale * self._backoff_factor
-        if self._min_scale:
-            self._scale = torch.max(self._scale, self._min_scale)
-
-    def _grow_scale(self) -> None:
-        """Increase the loss scale"""
-
-        self._scale = self._scale * self._growth_factor
-        if self._max_scale:
-            self._scale = torch.min(self._scale, self._max_scale)
-
-    def state_dict(self):
-        """Returns the states of the gradient scaler as a dict object."""
-
-        state_dict = dict()
-        state_dict["_scale"] = self._scale.item()
-        state_dict["_growth_step"] = self._growth_step
-        state_dict["_hysteresis_step"] = self._hysteresis_step
-
-        return state_dict
-
-    def load_state_dict(self, state_dict):
-        """Load the states of the gradient scaler from a dict object.
-
-        Args:
-            state_dict (dict): the states of the gradient scaler
-        """
-
-        self._scale = self._scale.fill_(state_dict["_scale"])
-        self._growth_step = state_dict["_growth_step"]
-        self._hysteresis_step = state_dict["_hysteresis_step"]
-
-
-class ParamBcastSyncHandler:
-    """
-    Model Partition Handler for overlap broadcast with forward
-    """
-
-    def __init__(self, model: Union[nn.Module, nn.ModuleList]) -> None:
-        self._block_to_param = OrderedDict()  # <key: nn.Module> <value: list(param)>
-        self._param_to_rank = dict()  # <key: param> <value: rank)>
-        self._block_to_rank = dict()  # <key: nn.Module> <value: rank)>
-        self._bcast_handles = dict()  # <key: rank> <value: list(bcast handles))>
-
-        zero1_size = gpc.get_world_size(ParallelMode.ZERO1)
-        total_param_num = sum(p.numel() for p in model.parameters())
-        avg_param_num = total_param_num * 1.0 // zero1_size
-
-        # just want to share same for loop for ModuleList and Module
-        if not isinstance(model, nn.ModuleList):
-            model = [model]
-
-        # record the parameters to transformer/embeding/head/norm block
-        for _chunk in model:
-            if isinstance(_chunk, NaiveAMPModel):
-                _chunk = _chunk.model
-
-            for _, children in _chunk.named_children():
-                # should be the transformer block definaton in modeling_xxx.py
-                if isinstance(children, nn.ModuleList):
-                    # record the block that a parameter belongs to
-                    for _, block in enumerate(children):
-                        # self._block_to_param[f"{name}.{idx}"] = list(block.parameters())
-                        self._block_to_param[block] = list(block.parameters())
-                else:
-                    # record the block that a parameter belongs to
-                    # self._block_to_param[name] = list(children.parameters())
-                    self._block_to_param[children] = list(children.parameters())
-
-        alloc_num = 0
-        rank_to_go = 0
-
-        # process the parameters in block_to_param sequencially,
-        # allocate each parameter to a local rank of ParallelMode.ZERO1,
-        # NOTE that we do NOT consider following scenarios:
-        # 1) whether a parameter is trainable;
-        # 2) paramters maybe in different optimizer group
-        for block, params in self._block_to_param.items():
-            # allocate a model block to a local rank of ParallelMode.ZERO1
-            self._block_to_rank[block] = [rank_to_go]
-            for p in params:
-                alloc_num = alloc_num + p.numel()
-                # in this case, allocate the param to next rank if possible
-                if alloc_num > avg_param_num * 1.01 and rank_to_go < zero1_size - 1:
-                    rank_to_go = rank_to_go + 1
-                    alloc_num = 0
-                    self._block_to_rank[block].append(rank_to_go)
-                # allocate a parameter to a local rank of ParallelMode.ZERO1
-                self._param_to_rank[p] = rank_to_go
-
-        # initialize an empty list for _bcast_handles of each rank
-        for rank in range(gpc.get_world_size(ParallelMode.ZERO1)):
-            self._bcast_handles[rank] = []
-
-        # register_forward_pre_hook for transformer/embeding/norm/xxx block
-        self._register_sync_parameters_hook()
-
-    def _register_sync_parameters_hook(self) -> None:
-        def _pre_forward_hook(model: nn.Module, inputs: Any):  # pylint: disable=W0613
-            bcast_handles = []
-            # gather all required broadcast hanles into a list
-            for rank in self._block_to_rank[model]:
-                bcast_handles.extend(self._bcast_handles[rank])
-                # need to clear _bcast_handles since they would be processed later
-                self._bcast_handles[rank] = []
-            # wait all required broadcast handles to be completed
-            for handle in bcast_handles:
-                handle.wait()
-
-        # register_forward_pre_hook for transformer/embeding/norm/xxx block
-        for block, _ in self._block_to_rank.items():
-            block.register_forward_pre_hook(partial(_pre_forward_hook))
-
-    def get_rank_by_param(self, param) -> int:
-        return self._param_to_rank[param]
-
-    def add_bcast_handle(self, rank, handle) -> None:
-        self._bcast_handles[rank].append(handle)
diff --git a/internlm/solver/pipeline_utils.py b/internlm/solver/pipeline_utils.py
deleted file mode 100644
index c57765e..0000000
--- a/internlm/solver/pipeline_utils.py
+++ /dev/null
@@ -1,34 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-from internlm.utils.logger import get_logger
-
-logger = get_logger(__file__)
-
-
-def partition_uniform(num_items, pipeline_parallel_size, num_chunks):
-    assert (
-        num_items % num_chunks == 0
-    ), "Layer length should be divided by the number of chunks, otherwise parameter method is recomended"
-
-    parts = [[] for _ in range(pipeline_parallel_size)]
-    partition_items = num_items // num_chunks
-    for idx in range(num_chunks):
-        base_idx = idx * partition_items
-        chunk_size = partition_items // pipeline_parallel_size
-        left = pipeline_parallel_size - partition_items % pipeline_parallel_size
-        if chunk_size == 0:
-            raise ValueError("Some nodes in Pipeline have no requests")
-
-        for p in range(pipeline_parallel_size):
-            st = base_idx
-            base_idx += chunk_size + (p >= left)
-            parts[p].append((st, base_idx))
-
-    indexes = []
-    for _parts in parts:
-        for s, e in _parts:
-            indexes.extend(list(range(s, e)))
-    assert len(indexes) == len(set(indexes)), indexes  # should have no duplicates
-    assert set(indexes) == set(list(range(num_items))), (indexes, num_items)  # should have the same indexes as expected
-    return parts
diff --git a/internlm/train/__init__.py b/internlm/train/__init__.py
deleted file mode 100644
index 457d7a4..0000000
--- a/internlm/train/__init__.py
+++ /dev/null
@@ -1,19 +0,0 @@
-from .training_internlm import (
-    get_train_data_loader,
-    get_validation_data_loader,
-    initialize_llm_profile,
-    initialize_model,
-    initialize_optimizer,
-    load_new_batch,
-    record_current_batch_training_metrics,
-)
-
-__all__ = [
-    "get_train_data_loader",
-    "get_validation_data_loader",
-    "initialize_llm_profile",
-    "initialize_model",
-    "initialize_optimizer",
-    "load_new_batch",
-    "record_current_batch_training_metrics",
-]
diff --git a/internlm/train/training_internlm.py b/internlm/train/training_internlm.py
deleted file mode 100644
index e08d4ec..0000000
--- a/internlm/train/training_internlm.py
+++ /dev/null
@@ -1,499 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import time
-from functools import partial
-from typing import Callable, Iterable, Union
-
-import torch
-import torch.distributed as dist
-from torch import nn
-from torch.utils.data import ConcatDataset, DataLoader
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.core.context.random import set_mode
-from internlm.core.naive_amp import NaiveAMPModel
-from internlm.core.trainer import TrainState
-from internlm.data.batch_sampler import StaticBatchSampler, get_dpsampler_dataloader
-from internlm.data.collaters import jsonl_ds_collate_fn, packed_collate_fn
-from internlm.data.dataset import get_dataset_dict
-from internlm.data.dummy_dataset import RandomDataset
-from internlm.data.packed_dataset import (
-    PackedDataset,
-    PackedDatasetWithoutCuSeqlen,
-    get_packed_dataset_without_short_length,
-)
-from internlm.data.utils import DATASET_TYPE_IDS_MAP, unpack_data
-from internlm.monitor import send_heartbeat, set_env_var
-from internlm.monitor.monitor import monitor_manager as mm
-from internlm.solver.beta2_scheduler import Beta2Scheduler
-from internlm.solver.lr_scheduler import FineTuneCosineAnnealingWarmupLR
-from internlm.solver.optimizer import HybridZeroOptimizer
-from internlm.solver.optimizer.utils import ParamBcastSyncHandler
-from internlm.utils.common import DummyProfile
-from internlm.utils.logger import get_logger
-from internlm.utils.megatron_timers import megatron_timer as timer
-from internlm.utils.parallel import (
-    is_no_pp_or_last_stage,
-    sync_model_param,
-    sync_model_param_within_tp,
-)
-from internlm.utils.registry import MODEL_INITIALIZER
-from internlm.utils.timeout import llm_timeout
-
-logger = get_logger(__file__)
-
-
-@llm_timeout(func_name="initialize_model")
-def initialize_model():
-    """
-    Initialize model with Automatic Mixed Precision.
-
-    Returns:
-        torch.nn.Module:
-            The neural network model to be trained or evaluated.
-    """
-
-    model = MODEL_INITIALIZER.get_module(module_name=gpc.config.model_type)(**(gpc.config.model))
-    if isinstance(model, nn.ModuleList):
-        model = nn.ModuleList(
-            [
-                NaiveAMPModel(
-                    model=_m,
-                    output_to_fp32=False,  # manually controlled by interleaved pipleline scheduler
-                    dtype=gpc.config.model.get("dtype", torch.half),
-                    sync_buffer=False,
-                )
-                for _m in model
-            ]
-        )
-    else:
-        model = NaiveAMPModel(
-            model=model,
-            output_to_fp32=is_no_pp_or_last_stage(),
-            dtype=gpc.config.model.get("dtype", torch.half),
-            sync_buffer=False,
-        )
-
-    # This sync is very important, cause the model weights kept in optimizer are copied
-    # from the origin parameters in the memory, so we should make sure the dp sync
-    # does not influence the model weights in optimizer be different with the origin parameters.
-    sync_model_param(model, parallel_mode=ParallelMode.DATA)
-
-    # This function is needed to make sure parameters that are not splitted by tensor parallelism are
-    # the same across tensor parallelism.
-    sync_model_param_within_tp(model)
-
-    # Change random state mode to ParallelMode.DATA after model is built, guaranteeing the random
-    # state in the same dp group are all the same.
-    set_mode(ParallelMode.DATA)
-
-    return model
-
-
-@llm_timeout(func_name="initialize_optimizer")
-def initialize_optimizer(model: Union[nn.Module, nn.ModuleList]):
-    """
-    Initialize optimizer.
-
-    Args:
-        model (:class:`torch.nn.Module`): Your model instance to be trained or evaluated.
-
-    Returns:
-        A tuple of (optimizer, beta2_scheduler, lr_scheduler).
-    """
-    if gpc.config.hybrid_zero_optimizer.overlap_sync_param:
-        param_bcast_sync_handler = ParamBcastSyncHandler(model)
-    else:
-        param_bcast_sync_handler = None
-
-    adam_cfg = gpc.config.adam
-    naive_optimizer = torch.optim.AdamW(
-        params=[{"params": model.parameters(), "weight_decay": adam_cfg.weight_decay}],
-        lr=adam_cfg.lr,
-        betas=(adam_cfg.adam_beta1, adam_cfg.adam_beta2),
-        eps=adam_cfg.adam_eps,
-    )
-
-    optimizer = HybridZeroOptimizer(
-        naive_optimizer,
-        grad_scal_cfg=gpc.config.grad_scaler,
-        zero_cfg=gpc.config.hybrid_zero_optimizer,
-        param_bcast_sync_handler=param_bcast_sync_handler,
-    )
-
-    beta2_scheduler = Beta2Scheduler(optimizer=naive_optimizer, **gpc.config.beta2_scheduler)
-
-    lr_scheduler = FineTuneCosineAnnealingWarmupLR(optimizer, **gpc.config.lr_scheduler)
-
-    return optimizer, beta2_scheduler, lr_scheduler
-
-
-@llm_timeout(func_name="get_train_data_loader")
-def get_train_data_loader(
-    num_worker: int = 0, dataset_generate_func: Callable = None, train_sampler=None, train_collate_fn=None
-):
-    """
-    Generate and return the training data loader.
-
-    Args:
-        num_worker (:class:`int`): number of subprocesses used for dataloader.
-        dataset_generate_func (:class:`Callable`, optional): generate function for dataset.
-        train_sampler (:class:`torch.utils.data.sampler`, optional): dataset sampler for training dataloader.
-        train_collate_fn (:class:`Callable`, optional): collate function for training dataloader.
-
-    Returns:
-        A tuple of (train_dl, dataset_types).
-    """
-
-    # Get the dataset types
-    dataset_types = None
-    dataset_types = list(DATASET_TYPE_IDS_MAP.keys())
-    data_cfg = gpc.config.data
-
-    # Get the sample weight dictionary
-    train_folder = data_cfg.train_folder
-
-    if not train_folder:
-        train_ds = RandomDataset(num_samples=1000000, max_len=data_cfg.seq_len)
-        if data_cfg.pack_sample_into_one:
-            train_ds = PackedDatasetWithoutCuSeqlen(
-                train_ds, max_length_per_sample=data_cfg.seq_len, packed_length=data_cfg.packed_length
-            )
-        else:
-            train_ds = PackedDataset(
-                train_ds, max_length_per_sample=data_cfg.seq_len, packed_length=data_cfg.packed_length
-            )
-    else:
-        if dataset_generate_func is not None:
-            train_ds = dataset_generate_func()
-        else:
-            train_ds = get_packed_dataset_without_short_length(
-                folder=data_cfg.train_folder,
-                packed_length=data_cfg.packed_length,
-                max_length_per_sample=data_cfg.seq_len,
-                show_progress=dist.get_rank() == 0,
-                min_length=data_cfg.min_length,
-                min_length_dict=data_cfg.get("min_length_dict", {}),
-                pack_into_one_sample=data_cfg.pack_sample_into_one,
-            )
-
-    if dataset_generate_func is None or not train_folder:
-        # partition already completed
-        assert isinstance(train_ds, (PackedDataset, PackedDatasetWithoutCuSeqlen, ConcatDataset))
-        # Create the training dataset sampler
-        train_sampler = StaticBatchSampler(
-            train_ds.datasets if isinstance(train_ds, ConcatDataset) else [train_ds],
-            batch_size=data_cfg.micro_num,
-            rampup_batch_size=data_cfg.rampup_batch_size,
-            micro_bsz=data_cfg.micro_bsz,
-            seed=1024,
-            drop_last=True,
-            data_rank=gpc.get_local_rank(ParallelMode.DATA),
-            data_world_size=gpc.get_world_size(ParallelMode.DATA),
-        )
-
-    if dataset_generate_func is None or not train_folder:
-        train_collate_fn = partial(packed_collate_fn, packed_length=data_cfg.packed_length)
-
-    # Create the training data loader
-    train_dl = DataLoader(
-        dataset=train_ds,
-        batch_sampler=train_sampler,
-        num_workers=num_worker,
-        pin_memory=True,
-        collate_fn=train_collate_fn,
-        persistent_workers=num_worker > 0,
-    )
-
-    return train_dl, dataset_types
-
-
-@llm_timeout(func_name="get_validation_data_loader")
-def get_validation_data_loader(
-    num_worker: int = 0, dataset_generate_func: Callable = None, val_collate_fn=None, dataloader_func=None
-):
-    """Generate and return the validation data loader."""
-
-    data_cfg = gpc.config.data
-
-    if not data_cfg.valid_folder:
-        val_ds = RandomDataset(num_samples=gpc.get_world_size(ParallelMode.DATA) * 500, max_len=data_cfg.seq_len)
-    else:
-        if dataset_generate_func is not None:
-            assert val_collate_fn and dataloader_func is not None
-            val_ds = dataset_generate_func()
-        else:
-            val_ds = get_dataset_dict(folder=data_cfg.valid_folder, split="")
-
-    if not isinstance(val_ds, dict):
-        val_ds = {"val": val_ds}
-
-    if val_collate_fn is None or not data_cfg.valid_folder:
-        val_collate_fn = partial(jsonl_ds_collate_fn, max_length_per_sample=data_cfg.seq_len)
-
-    val_dls = {}
-    for val_name, ds in val_ds.items():
-        if dataloader_func and data_cfg.valid_folder is not None:
-            val_dls[val_name] = dataloader_func(dataset=ds, collate_fn=val_collate_fn)
-            if gpc.is_rank_for_log():
-                logger.info(
-                    f"load validation dataset {val_name} with valid batch size {str(data_cfg.valid_micro_num)} and "
-                    f"{ds.size} Byte samples."
-                )
-        else:
-            # making the batch_size of validate larger can speed up the evaluation, but it should not be too large,
-            # otherwise too much data may be dropped
-            batch_size = min(
-                data_cfg.valid_micro_num * data_cfg.micro_bsz, len(ds) // gpc.get_world_size(ParallelMode.DATA)
-            )
-            batch_size = batch_size // data_cfg.micro_bsz * data_cfg.micro_bsz
-
-            if batch_size == 0 and gpc.is_rank_for_log():
-                logger.info(f"skip validate {val_name}.")
-                continue
-
-            val_dls[val_name] = get_dpsampler_dataloader(
-                ds,
-                shuffle=False,
-                num_workers=num_worker,
-                batch_size=batch_size,
-                collate_fn=val_collate_fn,
-                drop_last=True,
-            )  # drop_last=True, otherwise it may cause problems in the last batch
-
-            if gpc.is_rank_for_log():
-                logger.info(
-                    f"load validation dataset {val_name} with valid batch size {str(batch_size)} and "
-                    f"samples {str(len(val_dls[val_name]))}."
-                )
-
-    return val_dls
-
-
-@llm_timeout(func_name="load_new_batch")
-def load_new_batch(train_dl: DataLoader, train_iter: Iterable, train_state: TrainState):
-    """
-    Load and return the new batch data based on training data loader.
-
-    Args:
-        train_dl (torch.utils.data.DataLoader): Dataloader for training.
-        train_iter (Iterable): Data iterator from which get a batch of data, obtained by calling iter(dataloader).
-        train_state (TrainState): Current training state.
-
-    Returns: A batch data and the updated train_iter.
-    """
-
-    timer("batch-gen").start()
-    try:
-        batch = next(train_iter)  # structure is ({'input_ids': Tensor, 'cu_seqlens': Tensor}, Tensor)
-        if hasattr(train_state, "batch_sampler_iter"):
-            next(train_state.batch_sampler_iter)
-    except StopIteration:
-        train_iter = iter(train_dl)
-        batch = next(train_iter)
-        train_state.num_consumed_samples_in_epoch = 0
-        if hasattr(train_state, "batch_sampler"):
-            train_state.batch_sampler_iter = iter(train_state.batch_sampler)
-            next(train_state.batch_sampler_iter)
-    timer("batch-gen").stop()
-
-    if batch[0].get("type_ids", None) is not None:
-        # if use_flash_attn is False, we need to unpack type_ids
-        if not gpc.config.model.use_flash_attn:
-            batch[0]["type_ids"] = unpack_data(batch[0]["type_ids"], batch[0]["cu_seqlens"])
-
-    return batch, train_iter
-
-
-def initialize_llm_profile(profiling: bool = False, start_time: str = None):
-    """Initialize and return the profiler context manager instance."""
-
-    if profiling and gpc.get_local_rank(ParallelMode.DATA) == 0 and gpc.get_local_rank(ParallelMode.TENSOR) == 0:
-        llm_profile = torch.profiler.profile
-        logger.info(f"Do profiling in rank {gpc.get_global_rank()}!")
-    else:
-        llm_profile = DummyProfile
-
-    return llm_profile(
-        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
-        schedule=torch.profiler.schedule(skip_first=5, wait=1, warmup=1, active=1, repeat=1),
-        on_trace_ready=torch.profiler.tensorboard_trace_handler(
-            f"{gpc.config.JOB_NAME}/{start_time}/traces/rank{gpc.get_global_rank()}_"
-            + f"dp{gpc.get_local_rank(ParallelMode.DATA)}_"
-            + f"tp{gpc.get_local_rank(ParallelMode.TENSOR)}_"
-            + f"pp{gpc.get_local_rank(ParallelMode.PIPELINE)}",
-        ),
-        with_stack=True,
-        with_modules=True,
-    )
-
-
-@llm_timeout(func_name="record_current_batch_training_metrics")
-def record_current_batch_training_metrics(
-    get_tflops_func,
-    logger,
-    writer,
-    success_update,
-    batch_count,
-    batch,
-    train_state,
-    optimizer,
-    beta2_scheduler,
-    trainer,
-    start_time,
-    loss,
-    grad_norm,
-    metric,
-    update_panel,
-):
-    """
-    Print some training metrics of current batch.
-    """
-
-    set_env_var(key="LAST_ACTIVE_TIMESTAMP", value=int(time.time()))
-
-    timer.store_last_timers()
-    if success_update in (0, True):
-        train_state.num_consumed_tokens += batch[1].nelement() * gpc.get_world_size(ParallelMode.DATA)
-    if is_no_pp_or_last_stage():
-        acc_perplex = metric.get_metric()
-
-    if success_update and gpc.is_rank_for_log():
-        lr = optimizer.param_groups[0]["lr"]
-        if hasattr(trainer.engine.optimizer, "grad_scaler"):
-            scaler = trainer.engine.optimizer.grad_scaler._scale.item()
-        elif hasattr(trainer.engine.optimizer.optim, "grad_scaler"):
-            scaler = trainer.engine.optimizer.optim.grad_scaler._scale.item()
-
-        num_tokens_in_batch = batch[1].nelement()
-        num_samples_in_batch = sum([len(b) - 1 for b in batch[0]["cu_seqlens"]])
-        max_length_in_batch = max([(b[1:] - b[:-1]).max().item() for b in batch[0]["cu_seqlens"]])
-        max_samples_in_batch = max([len(b) - 1 for b in batch[0]["cu_seqlens"]])
-        min_samples_in_batch = min([len(b) - 1 for b in batch[0]["cu_seqlens"]])
-        time_cost = time.time() - start_time
-        tk_per_gpu = round(
-            num_tokens_in_batch * gpc.get_world_size(ParallelMode.DATA) / gpc.get_world_size(ParallelMode.GLOBAL),
-            4,
-        )
-        tgs_statistic = train_state.tgs_statistic
-        tgs_statistic["sum_step"] += 1
-        tgs_statistic["sum_tg"] += tk_per_gpu
-        tgs_statistic["sum_time"] += time_cost
-        tgs_statistic["sum_last_tg_10"] += tk_per_gpu
-        tgs_statistic["sum_last_time_10"] += time_cost
-        tgs_statistic["sum_last_tg_50"] += tk_per_gpu
-        tgs_statistic["sum_last_time_50"] += time_cost
-        tgs_statistic["SMA_tg_50"] += tk_per_gpu
-        tgs_statistic["SMA_time_50"] += time_cost
-        tgs_statistic["SMA_tg_50_list"].append(tk_per_gpu)
-        tgs_statistic["SMA_time_50_list"].append(time_cost)
-        if tgs_statistic["sum_step"] > 50:
-            tgs_statistic["SMA_tg_50"] -= tgs_statistic["SMA_tg_50_list"][0]
-            tgs_statistic["SMA_time_50"] -= tgs_statistic["SMA_time_50_list"][0]
-            tgs_statistic["SMA_tg_50_list"].popleft()
-            tgs_statistic["SMA_time_50_list"].popleft()
-
-        last_tgs_1 = round(tk_per_gpu / time_cost, 2)
-        tgs_statistic["sum_tgs"] += last_tgs_1
-
-        if tgs_statistic["sum_step"] % 10 == 0:
-            tgs_statistic["last_tgs_10"] = round(tgs_statistic["sum_last_tg_10"] / tgs_statistic["sum_last_time_10"], 2)
-            tgs_statistic["sum_last_tg_10"] = 0
-            tgs_statistic["sum_last_time_10"] = 0
-
-        if tgs_statistic["sum_step"] % 50 == 0:
-            tgs_statistic["last_tgs_50"] = round(tgs_statistic["sum_last_tg_50"] / tgs_statistic["sum_last_time_50"], 2)
-            tgs_statistic["sum_last_tg_50"] = 0
-            tgs_statistic["sum_last_time_50"] = 0
-
-        last_tgs_10 = tgs_statistic["last_tgs_10"]
-        last_tgs_50 = tgs_statistic["last_tgs_50"]
-
-        tgs_all = round(tgs_statistic["sum_tg"] / tgs_statistic["sum_time"], 2)
-        tgs_avg = round(tgs_statistic["sum_tgs"] / tgs_statistic["sum_step"], 2)
-        tgs_SMA = round(tgs_statistic["SMA_tg_50"] / tgs_statistic["SMA_time_50"], 2)
-
-        tflops = get_tflops_func((time.time() - start_time))
-
-        tgs_origin = round(
-            num_tokens_in_batch
-            * gpc.get_world_size(ParallelMode.DATA)
-            / gpc.get_world_size(ParallelMode.GLOBAL)
-            / (time.time() - start_time),
-            2,
-        )
-
-        infos = {
-            "tflops": tflops,
-            "step": batch_count,
-            "loss": loss.item(),
-            "tgs (tokens/gpu/second)": tgs_origin,
-            "tgs/last_tgs_1": last_tgs_1,
-            "tgs/tgs_all": tgs_all,
-            "tgs/tgs_avg": tgs_avg,
-            "tgs/tgs_SMA": tgs_SMA,
-            "tgs/last_tgs_10": last_tgs_10,
-            "tgs/last_tgs_50": last_tgs_50,
-            "lr": lr,
-            "loss_scale": scaler,
-            "grad_norm": grad_norm,
-        }
-
-        infos["micro_num"] = len(batch[1])
-        infos["num_consumed_tokens"] = train_state.num_consumed_tokens
-        infos["inf_nan_skip_batches"] = train_state.inf_nan_skip_batches
-        infos["num_samples_in_batch"] = num_samples_in_batch  # the number of batches which have the most samples
-        infos["largest_length"] = max_length_in_batch  # the longest input
-        infos["largest_batch"] = max_samples_in_batch  # the batch with the most samples
-        infos["smallest_batch"] = min_samples_in_batch
-        infos["adam_beta2"] = beta2_scheduler.get_beta2()
-
-        fwd_bwd_time = round(timer("fwd-bwd").elapsed(), 2)
-        infos["fwd_bwd_time"] = fwd_bwd_time
-
-        for key, value in acc_perplex.items():
-            infos[key] = value
-
-        line = ""
-        for key, value in infos.items():
-            line += f"{key}={value} "
-            if isinstance(value, dict):
-                writer.add_scalars(key=key, value=value, step=train_state.step_count)
-            else:
-                writer.add_scalar(key=key, value=value, step=train_state.step_count)
-
-        if gpc.config.monitor.alert.get("light_monitor_address", None) and batch_count % 50 == 0:
-            send_heartbeat("train_metrics", infos)
-
-        if update_panel:
-            # metrics shown with dashboard panels
-            panel_metrics = {
-                "step": batch_count,
-                "lr": lr,
-                "num_consumed_tokens": train_state.num_consumed_tokens,
-                "loss": loss.item(),
-                "flops": tflops,
-                "tgs": last_tgs_1,
-                "acc": acc_perplex["acc"],
-                "perplexity": acc_perplex["perplexity"],
-                "fwd_bwd_time": fwd_bwd_time,
-            }
-            for norm_key, norm_value in grad_norm.items():
-                panel_metrics[norm_key] = norm_value
-
-            logger.info(
-                "{line}",
-                line=line,
-                extra=panel_metrics,
-            )
-        else:
-            logger.info(line)
-
-        # if loss spike occurs, send alert info to feishu
-        mm.monitor_loss_spike(
-            alert_address=gpc.config.monitor.alert.feishu_alert_address,
-            step_count=batch_count,
-            cur_step_loss=loss.item(),
-        )
diff --git a/internlm/utils/__init__.py b/internlm/utils/__init__.py
deleted file mode 100644
index e69de29..0000000
diff --git a/internlm/utils/checkpoint.py b/internlm/utils/checkpoint.py
deleted file mode 100644
index 31a97af..0000000
--- a/internlm/utils/checkpoint.py
+++ /dev/null
@@ -1,269 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import weakref
-
-import torch
-from torch.utils.checkpoint import check_backward_validity, detach_variable
-
-from internlm.core.context.random import (
-    get_current_mode,
-    get_states,
-    set_mode,
-    set_seed_states,
-    sync_states,
-)
-
-from .common import get_current_device
-
-
-def copy_to_device(obj, device):
-    if torch.is_tensor(obj):
-        # Notice:
-        # When in no_grad context, requires_gard is False after movement
-        ret = obj.to(device).detach()
-        ret.requires_grad = obj.requires_grad
-        return ret
-    elif isinstance(obj, list):
-        return [copy_to_device(i, device) for i in obj]
-    elif isinstance(obj, tuple):
-        return tuple([copy_to_device(v, device) for v in obj])
-    elif isinstance(obj, dict):
-        return {k: copy_to_device(v, device) for k, v in obj.items()}
-    else:
-        return obj
-
-
-class CheckpointFunction(torch.autograd.Function):
-    """
-    Checkpoint Function
-    """
-
-    @staticmethod
-    def forward(ctx, run_function, activation_offload=False, *args):  # pylint: disable=W1113
-        check_backward_validity(args)
-        ctx.run_function = run_function
-        ctx.activation_offload = activation_offload
-        ctx.device = get_current_device()
-
-        # preserve rng states
-        ctx.fwd_cpu_rng_state = torch.get_rng_state()
-        sync_states()
-        ctx.fwd_seed_states = get_states(copy=True)
-        ctx.fwd_current_mode = get_current_mode()
-
-        if hasattr(torch, "is_autocast_enabled"):
-            ctx.had_autocast_in_fwd = torch.is_autocast_enabled()
-        else:
-            ctx.had_autocast_in_fwd = False
-
-        if activation_offload:
-            inputs_cuda = copy_to_device(args, ctx.device)
-        else:
-            inputs_cuda = args
-
-        with torch.no_grad():
-            outputs = run_function(*inputs_cuda)
-        # Save non-tensor inputs in ctx, keep a placeholder None for tensors
-        # to be filled out during the backward.
-        ctx.inputs = []
-        ctx.tensor_indices = []
-        tensor_inputs = []
-        for i, arg in enumerate(args):
-            if torch.is_tensor(arg):
-                if activation_offload:
-                    tensor_inputs.append(copy_to_device(arg, "cpu"))
-                else:
-                    tensor_inputs.append(arg)
-                ctx.tensor_indices.append(i)
-                ctx.inputs.append(None)
-            else:
-                ctx.inputs.append(arg)
-
-        if activation_offload:
-            ctx.tensor_inputs = tensor_inputs
-        else:
-            ctx.save_for_backward(*tensor_inputs)
-        return outputs
-
-    @staticmethod
-    def backward(ctx, *args):
-        if not torch.autograd._is_checkpoint_valid():
-            raise RuntimeError(
-                "Checkpointing is not compatible with .grad() or when an `inputs` parameter is "
-                "passed to .backward(). Please use .backward() and do not pass its `inputs` argument."
-            )
-        # Copy the list to avoid modifying original list.
-        inputs = list(ctx.inputs)
-        tensor_indices = ctx.tensor_indices
-
-        if ctx.activation_offload:
-            tensors = ctx.tensor_inputs
-        else:
-            tensors = ctx.saved_tensors
-
-        # store the current states
-        bwd_cpu_rng_state = torch.get_rng_state()
-        sync_states()
-        bwd_seed_states = get_states(copy=True)
-        bwd_current_mode = get_current_mode()
-
-        # set the states to what it used to be
-        torch.set_rng_state(ctx.fwd_cpu_rng_state)
-        for parallel_mode, state in ctx.fwd_seed_states.items():
-            set_seed_states(parallel_mode, state)
-        set_mode(ctx.fwd_current_mode)
-        if ctx.activation_offload:
-            tensors = copy_to_device(tensors, ctx.device)
-
-        # Fill in inputs with appropriate saved tensors.
-        for i, idx in enumerate(tensor_indices):
-            inputs[idx] = tensors[i]
-        detached_inputs = detach_variable(tuple(inputs))
-        if ctx.had_autocast_in_fwd:
-            with torch.enable_grad(), torch.cuda.amp.autocast():
-                outputs = ctx.run_function(*detached_inputs)
-        else:
-            with torch.enable_grad():
-                outputs = ctx.run_function(*detached_inputs)
-
-        if isinstance(outputs, torch.Tensor):
-            outputs = (outputs,)
-        # recover the rng states
-        torch.set_rng_state(bwd_cpu_rng_state)
-        for parallel_mode, state in bwd_seed_states.items():
-            set_seed_states(parallel_mode, state)
-        set_mode(bwd_current_mode)
-
-        # run backward() with only tensor that requires grad
-        outputs_with_grad = []
-        args_with_grad = []
-        for i in range(len(outputs)):
-            if torch.is_tensor(outputs[i]) and outputs[i].requires_grad:
-                outputs_with_grad.append(outputs[i])
-                args_with_grad.append(args[i])
-        if len(outputs_with_grad) == 0:
-            raise RuntimeError("none of output has requires_grad=True," " this checkpoint() is not necessary")
-        torch.autograd.backward(outputs_with_grad, args_with_grad)
-        grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None for inp in detached_inputs)
-        return (None, None) + grads
-
-
-def activation_checkpoint(function, activation_offload, *args, use_reentrant: bool = True):
-    """Checkpoint the computation while preserve the rng states, modified from Pytorch torch.utils.checkpoint.
-    Args:
-        function: Describe the forward pass function. It should know how to handle the input tuples.
-        activation_offload: The variable to check whether we should offload activation to cpu
-        args (list): Tuple containing the parameters of the function
-        use_reentrant: Bool type to check if we need to use_reentrant, if use_reentrant=False, there
-        might be more flexibility for user to define there checkpoint function
-    Returns:
-        Output of running function with provided args.
-    """
-    if use_reentrant:
-        return CheckpointFunction.apply(function, activation_offload, *args)
-    else:
-        return _checkpoint_without_reentrant(
-            function,
-            activation_offload,
-            *args,
-        )
-
-
-def _checkpoint_without_reentrant(function, activation_offload=False, *args):  # pylint: disable=W1113
-    # store rng_state
-    fwd_cpu_state = torch.get_rng_state()
-    sync_states()
-    fwd_seed_states = get_states(copy=True)
-    fwd_current_mode = get_current_mode()
-
-    # check if use autocast
-    if hasattr(torch, "is_autocast_enabled"):
-        has_autocast_in_fwd = torch.is_autocast_enabled()
-    else:
-        has_autocast_in_fwd = False
-
-    # using WeakKeyDictionary to store all the activation the first time we call unpack
-    storage: weakref.WeakKeyDictionary = weakref.WeakKeyDictionary()
-    weak_holder_list = []
-
-    # class for weakref.ref
-    class Holder:
-        pass
-
-    # return a Holder object for later unpack process
-    def pack():
-        res = Holder()
-        weak_holder_list.append(weakref.ref(res))
-        return res
-
-    # unpack hook
-    def unpack(x):
-        unpack_counter = 0
-
-        # re-compute all the activation inside the function when we first call unpack
-        if len(storage) == 0:
-
-            def inner_pack(inner):
-                nonlocal unpack_counter
-                unpack_counter += 1
-
-                # If the holder went out of scope, the SavedVariable is dead and so
-                # the value will never be read from the storage. Skip filling it.
-                if weak_holder_list[unpack_counter - 1]() is None:
-                    return
-
-                # Use detach here to ensure we don't keep the temporary autograd
-                # graph created during the second forward
-                storage[weak_holder_list[unpack_counter - 1]()] = inner.detach()
-                return
-
-            def inner_unpack(packed):
-                raise RuntimeError("You are calling backwards on a tensor that is never exposed. Please open an issue.")
-
-            # restore rng state
-            torch.set_rng_state(fwd_cpu_state)
-            for parallel_mode, state in fwd_seed_states.items():
-                set_seed_states(parallel_mode, state)
-            set_mode(fwd_current_mode)
-
-            # reload arg into device if needed
-            if activation_offload:
-                for arg in args:
-                    if torch.is_tensor(arg):
-                        arg = arg.to(device=device)
-
-            # rerun forward, the inner_pack will store all the activations in storage
-            if has_autocast_in_fwd:
-                with torch.enable_grad(), torch.cuda.amp.autocast(), torch.autograd.graph.saved_tensors_hooks(
-                    inner_pack, inner_unpack
-                ):
-                    function(*args)
-            else:
-                with torch.enable_grad(), torch.autograd.graph.saved_tensors_hooks(inner_pack, inner_unpack):
-                    function(*args)
-
-        if x not in storage:
-            raise RuntimeError(
-                "Attempt to retrieve a tensor saved by autograd multiple times without checkpoint"
-                " recomputation being triggered in between, this is not currently supported. Please"
-                " open an issue with details on your use case so that we can prioritize adding this."
-            )
-
-        return storage[x]
-
-    # get device if we need to offload the activation
-    if activation_offload:
-        device = get_current_device()
-
-    # run function with pack and unpack as saved_tensors_hooks
-    with torch.autograd.graph.saved_tensors_hooks(pack, unpack):
-        output = function(*args)
-
-        # offload activation if needed
-        if activation_offload:
-            for arg in args:
-                if torch.is_tensor(arg):
-                    arg = arg.to(device="cpu")
-
-    return output
diff --git a/internlm/utils/common.py b/internlm/utils/common.py
deleted file mode 100644
index f3b58c0..0000000
--- a/internlm/utils/common.py
+++ /dev/null
@@ -1,238 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import bisect
-import inspect
-import os
-import random
-from contextlib import contextmanager
-from datetime import datetime
-from typing import Union
-
-import numpy as np
-import torch
-
-import internlm
-
-CURRENT_TIME = None
-
-
-def parse_args():
-    parser = internlm.get_default_parser()
-    args = parser.parse_args()
-
-    return args
-
-
-def get_master_node():
-    import subprocess
-
-    if os.getenv("SLURM_JOB_ID") is None:
-        raise RuntimeError("get_master_node can only used in Slurm launch!")
-    result = subprocess.check_output('scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1', shell=True)
-    result = result.decode("utf8").strip()
-    return result
-
-
-def move_norm_to_cuda(norm: Union[float, torch.Tensor]) -> Union[float, torch.Tensor]:
-    if torch.is_tensor(norm) and norm.device.type != "cuda":
-        norm = norm.to(torch.cuda.current_device())
-    return norm
-
-
-def _move_tensor(element):
-    if not torch.is_tensor(element):
-        # we expecte the data type if a list of dictionaries
-        for item in element:
-            if isinstance(item, dict):
-                for key, value in item.items():
-                    assert not value.is_cuda, "elements are already on devices."
-                    item[key] = value.to(get_current_device()).detach()
-            elif isinstance(item, list):
-                for index, value in enumerate(item):
-                    assert not value.is_cuda, "elements are already on devices."
-                    item[index] = value.to(get_current_device()).detach()
-            elif torch.is_tensor(item):
-                if not item.is_cuda:
-                    item = item.to(get_current_device()).detach()
-    else:
-        assert torch.is_tensor(element), f"element should be of type tensor, but got {type(element)}"
-        if not element.is_cuda:
-            element = element.to(get_current_device()).detach()
-    return element
-
-
-def move_to_device(data):
-    if isinstance(data, torch.Tensor):
-        data = data.to(get_current_device())
-    elif isinstance(data, (list, tuple)):
-        data_to_return = []
-        for element in data:
-            if isinstance(element, dict):
-                data_to_return.append({k: _move_tensor(v) for k, v in element.items()})
-            else:
-                data_to_return.append(_move_tensor(element))
-        data = data_to_return
-    elif isinstance(data, dict):
-        data = {k: _move_tensor(v) for k, v in data.items()}
-    else:
-        raise TypeError(f"Expected batch data to be of type torch.Tensor, list, tuple, or dict, but got {type(data)}")
-    return data
-
-
-def get_tensor_norm(norm: Union[float, torch.Tensor], move_to_cuda) -> torch.Tensor:
-    if isinstance(norm, float):
-        norm = torch.Tensor([norm])
-    if move_to_cuda:
-        norm = norm.to(torch.cuda.current_device())
-    return norm
-
-
-def get_current_device() -> torch.device:
-    """
-    Returns currently selected device (gpu/cpu).
-    If cuda available, return gpu, otherwise return cpu.
-    """
-    if torch.cuda.is_available():
-        return torch.device(f"cuda:{torch.cuda.current_device()}")
-    else:
-        return torch.device("cpu")
-
-
-def get_batch_size(data):
-    if isinstance(data, torch.Tensor):
-        return data.size(0)
-    elif isinstance(data, (list, tuple)):
-        if isinstance(data[0], dict):
-            return data[0][list(data[0].keys())[0]].size(0)
-        return data[0].size(0)
-    elif isinstance(data, dict):
-        return data[list(data.keys())[0]].size(0)
-
-
-def filter_kwargs(func, kwargs):
-    sig = inspect.signature(func)
-    return {k: v for k, v in kwargs.items() if k in sig.parameters}
-
-
-def launch_time():
-    global CURRENT_TIME
-    if not CURRENT_TIME:
-        CURRENT_TIME = datetime.now().strftime("%b%d_%H-%M-%S")
-    return CURRENT_TIME
-
-
-def set_random_seed(seed):
-    """Set random seed for reproducability."""
-    # It is recommended to use this only when inference.
-    if seed is not None:
-        assert seed > 0
-        random.seed(seed)
-        np.random.seed(seed)
-        torch.manual_seed(seed)
-        torch.cuda.manual_seed(seed)
-        # if you are using multi-GPU.
-        torch.cuda.manual_seed_all(seed)
-
-
-@contextmanager
-def conditional_context(context_manager, enable=True):
-    if enable:
-        with context_manager:
-            yield
-    else:
-        yield
-
-
-class BatchSkipper:
-    """
-    BatchSkipper is used to determine whether to skip the current batch_idx.
-    """
-
-    def __init__(self, skip_batches):
-        if skip_batches == "":
-            pass
-        intervals = skip_batches.split(",")
-        spans = []
-        if skip_batches != "":
-            for interval in intervals:
-                if "-" in interval:
-                    start, end = map(int, interval.split("-"))
-                else:
-                    start, end = int(interval), int(interval)
-                if spans:
-                    assert spans[-1] <= start
-                spans.extend((start, end + 1))
-        self.spans = spans
-
-    def __call__(self, batch_count):
-        index = bisect.bisect_right(self.spans, batch_count)
-        return index % 2 == 1
-
-
-class SingletonMeta(type):
-    """
-    Singleton Meta.
-    """
-
-    _instances = {}
-
-    def __call__(cls, *args, **kwargs):
-        if cls not in cls._instances:
-            cls._instances[cls] = super().__call__(*args, **kwargs)
-        else:
-            assert (
-                len(args) == 0 and len(kwargs) == 0
-            ), f"{cls.__name__} is a singleton class and a instance has been created."
-        return cls._instances[cls]
-
-
-def get_megatron_flops(
-    elapsed_time_per_iter,
-    checkpoint=False,
-    seq_len=2048,
-    hidden_size=12,
-    num_layers=32,
-    vocab_size=12,
-    global_batch_size=4,
-    global_world_size=1,
-    mlp_ratio=4,
-    use_swiglu=True,
-):
-    """
-    Calc flops based on the paper of Megatron https://deepakn94.github.io/assets/papers/megatron-sc21.pdf
-    """
-
-    checkpoint_activations_factor = 4 if checkpoint else 3
-
-    if use_swiglu:
-        mlp_ratio = mlp_ratio * 3 / 2
-
-    flops_per_iteration = (
-        checkpoint_activations_factor
-        * (
-            (8 + mlp_ratio * 4) * global_batch_size * seq_len * hidden_size**2
-            + 4 * global_batch_size * seq_len**2 * hidden_size
-        )
-    ) * num_layers + 6 * global_batch_size * seq_len * hidden_size * vocab_size
-
-    tflops = flops_per_iteration / (elapsed_time_per_iter * global_world_size * (10**12))
-    return tflops
-
-
-class DummyProfile:
-    """
-    Dummy Profile.
-    """
-
-    def __init__(self, *args, **kwargs) -> None:
-        pass
-
-    def __enter__(self):
-        return self
-
-    def __exit__(self, a, b, c):
-        pass
-
-    def step(self):
-        pass
diff --git a/internlm/utils/evaluation.py b/internlm/utils/evaluation.py
deleted file mode 100644
index 2b9a384..0000000
--- a/internlm/utils/evaluation.py
+++ /dev/null
@@ -1,168 +0,0 @@
-from contextlib import contextmanager
-
-import torch
-import torch.distributed as dist
-from tqdm import tqdm
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.core.scheduler import SchedulerMetricHook
-from internlm.model.metrics import AccPerplex
-
-
-@contextmanager
-def switch_evaluation_no_pipeline_scheduler(trainer, grad_accum_size, grad_accum_batch_size, metric_hook_list):
-    if not gpc.is_using_pp():
-        prev_data_process_func = trainer.schedule.data_process_func
-        prev_grad_accum_size = trainer.schedule._grad_accum_size
-        prev_grad_accum_batch_size = trainer.schedule._grad_accum_batch_size
-        prev_metric_hooks = trainer.schedule._hooks
-        try:
-            trainer.schedule.data_process_func = None
-            trainer.schedule._grad_accum_size = grad_accum_size
-            trainer.schedule._grad_accum_batch_size = grad_accum_batch_size
-            trainer.schedule._hooks = metric_hook_list
-            yield
-        finally:
-            trainer.schedule.data_process_func = prev_data_process_func
-            trainer.schedule._grad_accum_size = prev_grad_accum_size
-            trainer.schedule._grad_accum_batch_size = prev_grad_accum_batch_size
-            trainer.schedule._hooks = prev_metric_hooks
-
-
-@contextmanager
-def switch_evaluation_pipeline_scheduler(trainer, num_microbatches, tensor_shape, metric_hook_list):
-    if gpc.is_using_pp():
-        pre_data_process_func = trainer.schedule.data_process_func
-        prev_num_microbatches = trainer.schedule.num_microbatches
-        prev_tensor_shape = trainer.schedule.tensor_shape
-        prev_metric_hooks = trainer.schedule._hooks
-        try:
-            trainer.schedule.data_process_func = None
-            trainer.schedule.num_microbatches = num_microbatches
-            trainer.schedule.tensor_shape = tensor_shape
-            trainer.schedule._hooks = metric_hook_list
-            yield
-        finally:
-            trainer.schedule.data_process_func = pre_data_process_func
-            trainer.schedule.num_microbatches = prev_num_microbatches
-            trainer.schedule.tensor_shape = prev_tensor_shape
-            trainer.schedule._hooks = prev_metric_hooks
-
-
-@contextmanager
-def switch_sequence_parallel_mode():
-    prev_mode = gpc.config.parallel.sequence_parallel
-    try:
-        gpc.config.parallel.sequence_parallel = False
-        yield
-    finally:
-        gpc.config.parallel.sequence_parallel = prev_mode
-
-
-def evaluate_on_val_dls(
-    trainer,
-    val_dls,
-    writer,
-    logger,
-    step_count,
-    update_panel: bool = False,
-    streaming: bool = False,
-):
-    with switch_sequence_parallel_mode():
-        torch.cuda.empty_cache()
-        trainer.eval()
-        verbose = gpc.is_rank_for_log()
-        data_cfg = gpc.config.data
-
-        for val_name, val_dl in val_dls.items():
-            if not streaming and len(val_dl) == 0 and verbose:
-                logger.info(f"Validation dataset: {val_name} is empty")
-                continue
-
-            val_metric = AccPerplex(
-                device=torch.cuda.current_device(),
-                tp_pg=gpc.get_group(ParallelMode.TENSOR),
-                dp_pg=gpc.get_group(ParallelMode.DATA),
-            )
-            val_sche_metric_hook = SchedulerMetricHook(metric=val_metric)
-
-            val_loss = 0
-            val_idx = -1
-            for val_idx, batch in tqdm(
-                enumerate(val_dl),
-                desc="Val.",
-                total=len(val_dl) if not streaming else None,
-                position=1,
-                disable=not verbose,
-                leave=False,
-            ):
-                with torch.inference_mode():
-                    if gpc.is_using_pp():
-                        total_val_bsz = len(batch[1])
-                        assert total_val_bsz % data_cfg.micro_bsz == 0
-                        num_microbatches = total_val_bsz // data_cfg.micro_bsz
-                        tensor_shape = torch.Size(
-                            [data_cfg.micro_bsz, batch[0]["input_ids"].shape[1], gpc.config.HIDDEN_SIZE]
-                        )
-
-                        with switch_evaluation_pipeline_scheduler(
-                            trainer=trainer,
-                            num_microbatches=num_microbatches,
-                            tensor_shape=tensor_shape,
-                            metric_hook_list=[val_sche_metric_hook],
-                        ):
-                            _, _, loss = trainer.execute_schedule(
-                                batch, forward_only=True, return_loss=True, return_output_label=False
-                            )
-                    else:
-                        total_val_bsz = len(batch[1])
-                        assert total_val_bsz % data_cfg.micro_bsz == 0
-                        grad_accum_size = total_val_bsz // data_cfg.micro_bsz
-                        grad_accum_batch_size = data_cfg.micro_bsz
-                        with switch_evaluation_no_pipeline_scheduler(
-                            trainer=trainer,
-                            grad_accum_size=grad_accum_size,
-                            grad_accum_batch_size=grad_accum_batch_size,
-                            metric_hook_list=[val_sche_metric_hook],
-                        ):
-                            _, _, loss = trainer.execute_schedule(
-                                batch, forward_only=True, return_loss=True, return_output_label=False
-                            )
-                if verbose:
-                    val_loss += loss.item()
-
-            assert val_idx != -1
-            dist.barrier()
-
-            val_res = val_metric.get_metric()
-            if verbose and (streaming or len(val_dl) != 0):
-                val_loss = val_loss / (val_idx + 1 + 1e-6)
-                infos = {
-                    "step": step_count,
-                    f"val/{val_name}_loss": val_loss,
-                    f"val/{val_name}_acc": val_res["acc"],
-                    f"val/{val_name}_plex": val_res["perplexity"],
-                }
-
-                for key, value in infos.items():
-                    writer.add_scalar(key=key, value=value, step=step_count)
-
-                if update_panel:
-                    logger.info(
-                        f"Validation on {val_name}: " + " ".join([f"{key}={value}" for key, value in infos.items()]),
-                        extra={
-                            "step": step_count,
-                            "val_loss": val_loss,
-                            "val_acc": val_res["acc"],
-                            "val_perplexity": val_res["perplexity"],
-                        },
-                    )
-                else:
-                    logger.info(
-                        f"Validation on {val_name}: " + " ".join([f"{key}={value}" for key, value in infos.items()])
-                    )
-
-        trainer.train()
-        torch.cuda.empty_cache()
-        dist.barrier()
diff --git a/internlm/utils/gputest.py b/internlm/utils/gputest.py
deleted file mode 100644
index ddb4932..0000000
--- a/internlm/utils/gputest.py
+++ /dev/null
@@ -1,256 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import math
-import socket
-
-import torch
-import torch.distributed as dist
-from flash_attn.modules.mha import FlashSelfAttention, SelfAttention
-from torch.utils import benchmark
-
-from internlm.monitor import send_alert_message
-from internlm.utils.logger import get_logger
-from internlm.utils.megatron_timers import megatron_timer as timer
-
-try:
-    import GPUtil
-    import psutil
-except ImportError:
-    GPUtil, psutil = None, None
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.utils.common import get_current_device
-
-logger = get_logger(__file__)
-
-
-def empty_cache_and_diag(batch_count, interval=50):
-    """empty cuda cache and run diag bench or tests."""
-    if interval <= 0:
-        interval = 50
-    if batch_count % int(interval) == 0:
-        # there is no need to do diag on the first batch
-        if batch_count > 0:
-            if gpc.is_rank_for_log():
-                logger.info("Empty Cache and Diagnosis GPU/NCCL/Timer ...")
-            with torch.no_grad():
-                timer_diagnosis()
-                bench_gpu()
-                bench_net()
-        # do empty_cache after the bench
-        torch.cuda.empty_cache()
-
-
-def benchmark_forward(
-    test_fn,
-    *inputs,
-    repeats=100,
-    amp=True,
-    amp_dtype=torch.float16,
-    **kwinputs,
-):
-    """Use Pytorch Benchmark on the forward pass of an arbitrary function."""
-
-    def amp_wrapper(*inputs, **kwinputs):
-        with torch.autocast(device_type="cuda", dtype=amp_dtype, enabled=amp):
-            test_fn(*inputs, **kwinputs)
-
-    bench_timer = benchmark.Timer(
-        stmt="test_fn_amp(*inputs, **kwinputs)",
-        globals={"test_fn_amp": amp_wrapper, "inputs": inputs, "kwinputs": kwinputs},
-        num_threads=torch.get_num_threads(),
-    )
-    used_time = bench_timer.timeit(repeats)
-    return used_time.mean
-
-
-def flops(batch, seqlen, headdim, nheads, time_f):
-    """Compute the flops value of a GPU with give flashattention function"""
-
-    flop = 4 * batch * seqlen**2 * nheads * headdim
-    return (flop / time_f / 10**12) if not math.isnan(time_f) else 0.0
-
-
-def get_gpu_temperature():
-    """Get current GPU temperature."""
-    try:
-        gpu_id = torch.cuda.current_device()
-    except AssertionError:
-        gpu_id = -1
-
-    if GPUtil is not None and gpu_id >= 0:
-        gpus = GPUtil.getGPUs()
-        gpu_temperature = gpus[gpu_id].temperature
-    else:
-        gpu_temperature = -1
-
-    return gpu_temperature
-
-
-def get_cpu_temperature():
-    """Get current CPU temperature."""
-
-    if psutil is not None:
-        cpu_temperature = psutil.sensors_temperatures()["coretemp"][0].current
-    else:
-        cpu_temperature = -1
-
-    return cpu_temperature
-
-
-def timer_diagnosis():
-    """Diagnosis running time"""
-
-    if len(timer.names) == 0 or len(timer.times) == 0:
-        return
-
-    world_size = gpc.get_world_size(ParallelMode.DATA)
-    if world_size < 2:
-        return
-
-    # if gpc.is_rank_for_log():
-    #     logger.info("Diagnosis running timers ...")
-
-    # detect slow rank compared to other ranks in the same DP group
-    running_time = torch.Tensor(timer.times).to(device=get_current_device())
-    avg_time = running_time.detach().clone()
-    if world_size <= 4:
-        dist.all_reduce(avg_time, op=torch.distributed.ReduceOp.AVG, group=gpc.get_group(ParallelMode.DATA))
-    else:
-        running_time_max = avg_time.detach().clone()
-        running_time_min = avg_time.detach().clone()
-        dist.all_reduce(running_time_max, op=torch.distributed.ReduceOp.MAX, group=gpc.get_group(ParallelMode.DATA))
-        dist.all_reduce(running_time_min, op=torch.distributed.ReduceOp.MIN, group=gpc.get_group(ParallelMode.DATA))
-        dist.all_reduce(avg_time, op=torch.distributed.ReduceOp.SUM, group=gpc.get_group(ParallelMode.DATA))
-        avg_time = (avg_time - running_time_max - running_time_min) / (world_size - 2)
-
-    diag_result = running_time > avg_time * gpc.config.data.diag_outlier_ratio
-    diag_result = diag_result.tolist()
-    avg_time = avg_time.tolist()
-
-    for slow, name, time, avg in zip(diag_result, timer.names, timer.times, avg_time):
-        if slow is False or avg < 0.5:
-            continue
-        msg = (
-            f"Rank {gpc.get_local_rank(ParallelMode.GLOBAL)} is slower than avg on {name}, "
-            f"Hostname {socket.gethostname()}, "
-            f"its time {time:.2f}, avg {avg:.2f}, "
-            f"CPU temp {get_cpu_temperature()}, GPU temp { get_gpu_temperature()}"
-        )
-        logger.warning(msg)
-        send_alert_message(
-            address=gpc.config.monitor.alert.feishu_alert_address,
-            message=msg,
-        )
-
-    # detect slow rank compared to historical timer data
-    for name, time in zip(timer.names, timer.times):
-        if name not in timer.hist or len(timer.hist[name]) < 5:
-            continue
-        hist_avg = sum(timer.hist[name]) / len(timer.hist[name])
-        if time > hist_avg * gpc.config.data.diag_outlier_ratio and time > 0.5:
-            msg = (
-                f"Rank {gpc.get_local_rank(ParallelMode.GLOBAL)} is slower than hist avg on {name}, "
-                f"Hostname {socket.gethostname()}, "
-                f"its time {time:.2f}, hist_avg {hist_avg:.2f}, "
-                f"CPU temp {get_cpu_temperature()}, GPU temp { get_gpu_temperature()}"
-            )
-            logger.warning(msg)
-            send_alert_message(
-                address=gpc.config.monitor.alert.feishu_alert_address,
-                message=msg,
-            )
-
-
-def bench_net():
-    """Benchmark nccl performance for slow node detection."""
-
-    if gpc.get_world_size(ParallelMode.GLOBAL) <= 1:
-        return
-
-    # if gpc.is_rank_for_log():
-    #     logger.info("benchmarking network speed ...")
-
-    repeats = 100
-    input_data = torch.randn(
-        8 * 1024 * 1024,
-        device=get_current_device(),
-        dtype=torch.bfloat16,
-    )
-
-    def allreduce_fn(inputs):
-        dist.all_reduce(inputs, op=torch.distributed.ReduceOp.AVG, group=gpc.get_group(ParallelMode.NETTEST))
-
-    bench_timer = benchmark.Timer(
-        stmt="test_fn_amp(inputs)",
-        globals={"test_fn_amp": allreduce_fn, "inputs": input_data},
-        num_threads=torch.get_num_threads(),
-    )
-    allreduce_time = bench_timer.timeit(repeats).mean
-    allreduce_time = allreduce_time * 10**3
-    allreduce_time_this = allreduce_time
-    allreduce_time = torch.Tensor([allreduce_time]).to(device=get_current_device())
-    dist.all_reduce(allreduce_time, group=gpc.get_group(ParallelMode.GLOBAL))
-    allreduce_time_avg = allreduce_time / gpc.get_world_size(ParallelMode.GLOBAL)
-    allreduce_time_avg = float(allreduce_time_avg.item())
-
-    if allreduce_time_this >= allreduce_time_avg * gpc.config.data.diag_outlier_ratio:
-        msg = (
-            f"Rank {gpc.get_local_rank(ParallelMode.GLOBAL)} NCCL test is slower than avg, "
-            f"Hostname {socket.gethostname()}, "
-            f"allreduce_time {allreduce_time_this:.2f}, avg {allreduce_time_avg:.2f}, "
-            f"CPU temp {get_cpu_temperature()}, GPU temp { get_gpu_temperature()}"
-        )
-        logger.warning(msg)
-        send_alert_message(
-            address=gpc.config.monitor.alert.feishu_alert_address,
-            message=msg,
-        )
-
-
-def bench_gpu(use_flash_attn=True):
-    """Benchmark single GPU performance for slow node detection."""
-
-    # if gpc.is_rank_for_log():
-    #     logger.info("benchmarking gpu speed ...")
-
-    headdim = 64
-    dim = 2048
-    batch_size, seqlen = 2, 1024
-    nheads = dim // headdim
-
-    inner_attn = FlashSelfAttention if use_flash_attn else SelfAttention
-    inner_attn = inner_attn(causal=True, softmax_scale=None, attention_dropout=0)
-
-    qkv = torch.randn(
-        batch_size,
-        seqlen,
-        3,
-        dim // headdim,
-        headdim,
-        device=get_current_device(),
-        dtype=torch.float16,
-        requires_grad=True,
-    )
-    time_f = benchmark_forward(inner_attn, qkv)
-    speed = flops(batch_size, seqlen, headdim, nheads, time_f)
-    speed_this = speed
-    speed = torch.Tensor([speed]).to(device=get_current_device())
-    dist.all_reduce(speed, group=gpc.get_group(ParallelMode.GLOBAL))
-    speed_avg = speed / gpc.get_world_size(ParallelMode.GLOBAL)
-    speed_avg = float(speed_avg.item())
-
-    if speed_this <= speed_avg / gpc.config.data.diag_outlier_ratio:
-        msg = (
-            f"Rank {gpc.get_local_rank(ParallelMode.GLOBAL)} GPU is slower than avg, "
-            f"Hostname {socket.gethostname()}, "
-            f"tflops {speed_this:.2f}, avg {speed_avg:.2f}, "
-            f"CPU temp {get_cpu_temperature()}, GPU temp { get_gpu_temperature()}"
-        )
-        logger.warning(msg)
-        send_alert_message(
-            address=gpc.config.monitor.alert.feishu_alert_address,
-            message=msg,
-        )
diff --git a/internlm/utils/logger.py b/internlm/utils/logger.py
deleted file mode 100644
index 6111553..0000000
--- a/internlm/utils/logger.py
+++ /dev/null
@@ -1,98 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import logging
-import os
-
-LOGGER_NAME = "internlm"
-LOGGER_FORMAT = "%(asctime)s\t%(levelname)s %(filename)s:%(lineno)s in %(funcName)s -- %(message)s"
-LOGGER_LEVEL = "info"
-LOGGER_LEVEL_CHOICES = ["debug", "info", "warning", "error", "critical"]
-LOGGER_LEVEL_HELP = (
-    "The logging level threshold, choices=['debug', 'info', 'warning', 'error', 'critical'], default='info'"
-)
-
-uniscale_logger = None
-
-
-def get_logger(logger_name: str = LOGGER_NAME, logging_level: str = LOGGER_LEVEL) -> logging.Logger:
-    """Configure the logger that is used for uniscale framework.
-
-    Args:
-        logger_name (str): used to create or get the correspoding logger in
-            getLogger call. It will be "internlm" by default.
-        logging_level (str, optional): Logging level in string or logging enum.
-
-    Returns:
-        logger (logging.Logger): the created or modified logger.
-
-    """
-
-    if uniscale_logger is not None:
-        return uniscale_logger
-
-    logger = logging.getLogger(logger_name)
-
-    if logging_level not in LOGGER_LEVEL_CHOICES:
-        logging_level = LOGGER_LEVEL
-        print(LOGGER_LEVEL_HELP)
-
-    logging_level = logging.getLevelName(logging_level.upper())
-
-    handler = logging.StreamHandler()
-    handler.setLevel(logging_level)
-    logger.setLevel(logging_level)
-    handler.setFormatter(logging.Formatter(LOGGER_FORMAT))
-    logger.addHandler(handler)
-
-    return logger
-
-
-def initialize_uniscale_logger(
-    job_name: str = None,
-    launch_time: str = None,
-    file_name: str = None,
-    name: str = LOGGER_NAME,
-    level: str = LOGGER_LEVEL,
-    file_path: str = None,
-    is_std: bool = True,
-):
-    """
-    Initialize uniscale logger.
-
-    Args:
-        job_name (str): The name of training job, defaults to None.
-        launch_time (str): The launch time of training job, defaults to None.
-        file_name (str): The log file name, defaults to None.
-        name (str): The logger name, defaults to "internlm".
-        level (str): The log level, defaults to "info".
-        file_path (str): The log file path, defaults to None.
-        is_std (bool): Whether to output to console, defaults to True.
-
-    Returns:
-        Uniscale logger instance.
-    """
-
-    try:
-        from uniscale_monitoring import get_logger as get_uniscale_logger
-    except ImportError:
-        print("Failed to import module uniscale_monitoring. Use default python logger.")
-        return None
-
-    if not file_path:
-        assert (
-            job_name and launch_time and file_name
-        ), "If file_path is None, job_name, launch_time and file_name must be setted."
-        log_file_name = file_name
-        log_folder = os.path.join("RUN", job_name, launch_time, "logs")
-        log_dir = os.path.join(log_folder, log_file_name)
-        file_path = log_dir
-
-    logger = get_uniscale_logger(name=name, level=level, filename=file_path, is_std=is_std)
-    if isinstance(logger, (list, tuple)):
-        logger = list(logger)[0]
-
-    global uniscale_logger
-    uniscale_logger = logger
-
-    return logger
diff --git a/internlm/utils/megatron_timers.py b/internlm/utils/megatron_timers.py
deleted file mode 100644
index d5d89e5..0000000
--- a/internlm/utils/megatron_timers.py
+++ /dev/null
@@ -1,133 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import time
-
-import torch
-
-
-class _Timer:
-    """Timer."""
-
-    def __init__(self, name):
-        self.name_ = name
-        self.elapsed_ = 0.0
-        self.started_ = False
-        self.start_time = time.time()
-        self.stream = torch.cuda.current_stream()
-
-    def start(self, reset_all=True):
-        """Start the timer."""
-        # need to reset all timers in a new batch
-        if self.name_ == "one-batch" and reset_all is True:
-            megatron_timer.reset()
-
-        assert not self.started_, "timer has already been started"
-        self.stream.synchronize()
-        self.start_time = time.time()
-        self.started_ = True
-
-    def stop(self):
-        """Stop the timer."""
-        assert self.started_, "timer is not started"
-        self.stream.synchronize()
-        self.elapsed_ += time.time() - self.start_time
-        self.started_ = False
-
-    def reset(self):
-        """Reset timer."""
-        self.elapsed_ = 0.0
-        self.started_ = False
-
-    def elapsed(self, reset=True):
-        """Calculate the elapsed time."""
-        started_ = self.started_
-        # If the timing in progress, end it first.
-        if self.started_:
-            self.stop()
-        # Get the elapsed time.
-        elapsed_ = self.elapsed_
-        # Reset the elapsed time
-        if reset:
-            self.reset()
-        # If timing was in progress, set it back.
-        if started_:
-            self.start(reset_all=False)
-        return elapsed_
-
-
-class Timers:
-    """Group of timers."""
-
-    def __init__(self):
-        self.timers = {}
-        self.hist = {}
-        self.names = []
-        self.times = []
-
-    def __call__(self, name):
-        if name not in self.timers:
-            self.timers[name] = _Timer(name)
-        return self.timers[name]
-
-    def store_last_timers(self):
-        """Store timers to two list"""
-        self.names = []
-        self.times = []
-        for key, value in self.timers.items():
-            senconds = round(float(value.elapsed(reset=False)), 4)
-            self.names.append(key)
-            self.times.append(senconds)
-            if key not in self.hist:
-                self.hist[key] = []
-            self.hist[key].append(senconds)
-            if len(self.hist[key]) > 10:
-                self.hist[key].pop(0)
-
-    def write(self, names, writer, iteration, normalizer=1.0, reset=False):
-        """Write timers to a tensorboard writer"""
-        # currently when using add_scalars,
-        # torch.utils.add_scalars makes each timer its own run, which
-        # polutes the runs list, so we just add each as a scalar
-        assert normalizer > 0.0
-        for name in names:
-            if name in self.timers:
-                value = self.timers[name].elapsed(reset=reset) / normalizer
-                writer.add_scalar(f"time/{name}-time", value, iteration)
-
-    def log(self, names, logger, normalizer=1.0, reset=True):
-        """Log a group of timers."""
-        assert normalizer > 0.0
-        string = ""
-        for name in names:
-            if name in self.timers:
-                elapsed_time = self.timers[name].elapsed(reset=reset) * 1000.0 / normalizer
-                string += " | {}: {:.2f}".format(name, elapsed_time)
-        if not len(string):  # pylint: disable=C1802
-            return
-        string = "time (ms)" + string
-
-        logger.info(string)
-        return string
-
-    def debug(self, names, logger, normalizer=1.0, reset=True):
-        """Log a group of timers."""
-        assert normalizer > 0.0
-        string = ""
-        for name in names:
-            if name in self.timers:
-                elapsed_time = self.timers[name].elapsed(reset=reset) * 1000.0 / normalizer
-                string += " | {}: {:.2f}".format(name, elapsed_time)
-        if not len(string):  # pylint: disable=C1802
-            return
-        string = "time (ms)" + string
-
-        logger.debug(string)
-        return string
-
-    def reset(self):
-        for _, t in self.timers.items():
-            t.reset()
-
-
-megatron_timer = Timers()
diff --git a/internlm/utils/model_checkpoint.py b/internlm/utils/model_checkpoint.py
deleted file mode 100644
index dad2fc6..0000000
--- a/internlm/utils/model_checkpoint.py
+++ /dev/null
@@ -1,825 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import copy
-import inspect
-import os
-import socket
-import time
-from enum import Enum
-from typing import Callable, Dict, Union
-
-import torch
-
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.core.trainer import TrainState
-from internlm.initialize.launch import get_config_value
-from internlm.initialize.legacy.launch import (
-    auto_resume_sanity_check,
-    ckpt_info_sanity_check,
-)
-from internlm.monitor import send_alert_message
-from internlm.solver.optimizer import HybridZeroOptimizer, reload_zero_fp32_buff
-from internlm.utils.common import get_current_device
-from internlm.utils.logger import get_logger
-from internlm.utils.megatron_timers import megatron_timer as timer
-from internlm.utils.storage_manager import (
-    get_fns,
-    get_storage_manager,
-    init_storage_manager,
-    llm_load,
-    llm_save,
-    try_get_storage_backend,
-)
-from internlm.utils.timeout import llm_timeout
-
-logger = get_logger(__file__)
-
-
-class CheckpointSaveType(Enum):
-    NORMAL_CHECKPOINT = 1
-    SNAPSHOT_CHECKPOINT = 2
-
-
-class CheckpointLoadType(Enum):
-    INTERNLM = "internlm"
-
-
-# The load method implemented by internlm by default does not use string representation types,
-# but uses enumeration types defined in advance.
-LOAD_TYPE_DICT = {
-    "internlm": CheckpointLoadType.INTERNLM,
-}
-
-
-class CheckpointLoadContent:
-    MODEL = "model"
-    SAMPLER = "sampler"
-    OPIMIZER = "optimizer"
-    SCHEDULAER = "scheduler"
-
-
-class CheckpointLoadMethod:
-    """The registration class of the checkpoint loading method,
-    users can define their own custom ckpt loading methods."""
-
-    LOAD_FUNC_SIG = None
-    LOAD_TYPE_FUNC = {}
-
-    @staticmethod
-    def convet_load_type(load_type: str) -> Union[CheckpointLoadType, str]:
-        if load_type.lower() in LOAD_TYPE_DICT:
-            # The ckpt load method implemented by internlm by default.
-            return LOAD_TYPE_DICT[load_type.lower()]
-        else:
-            # If it is a user-defined field, we do not do any conversion and represent it as a string.
-            return load_type
-
-    @staticmethod
-    def register_ckpt_load_type(load_type: Union[str, CheckpointLoadType], load_func: Callable):
-        if load_type in CheckpointLoadMethod.LOAD_TYPE_FUNC:
-            logger.warning(f"{load_type} has aleady been registed!")
-            return
-
-        CheckpointLoadMethod.LOAD_TYPE_FUNC.update({load_type: load_func})
-
-        if load_type == CheckpointLoadType.INTERNLM:
-            CheckpointLoadMethod.LOAD_FUNC_SIG = inspect.signature(load_func)
-        else:
-            if inspect.signature(load_func) != CheckpointLoadMethod.LOAD_FUNC_SIG:
-                logger.warning(
-                    f"registe load model ckpt signature is not same with: {CheckpointLoadMethod.LOAD_FUNC_SIG}"
-                )
-
-    @staticmethod
-    def get_ckpt_load_type_func(load_type: Union[str, CheckpointLoadType]):
-        return CheckpointLoadMethod.LOAD_TYPE_FUNC[load_type]
-
-
-class CheckpointLoadMask:
-    """
-    According to the content field in the incoming ckpt_info, decide which components to load.
-    """
-
-    LOAD_CONTENT_DICT = {
-        "model": CheckpointLoadContent.MODEL,
-        "sampler": CheckpointLoadContent.SAMPLER,
-        "optimizer": CheckpointLoadContent.OPIMIZER,
-        "scheduler": CheckpointLoadContent.SCHEDULAER,
-    }
-
-    def __init__(self, content: tuple) -> None:
-        self.load_set = set(map(lambda x: x.lower(), content))
-        if "all" in self.load_set:
-            self.load_set = set(CheckpointLoadMask.LOAD_CONTENT_DICT.values())
-        else:
-            self.load_set = set(map(lambda x: CheckpointLoadMask.LOAD_CONTENT_DICT[x.lower()], content))
-
-    def need_load(self, content: CheckpointLoadContent):
-        return content in self.load_set
-
-    def not_only_load(self, content: CheckpointLoadContent):
-        return content in self.load_set and len(self.load_set) > 1
-
-    def only_load(self, content: CheckpointLoadContent):
-        return set((content,)) == self.load_set
-
-    def __str__(self) -> str:
-        return f"{self.load_set}."
-
-    def __repr__(self) -> str:
-        return f"{self.load_set}."
-
-
-def get_model_topology(model):
-    """
-    Returns:
-        {
-            '{name}': {'dim': int}
-        }
-        where name is the name of the module, and all parameters under this module are
-        concatenated along the dimension 'dim'.
-    """
-
-    from flash_attn.modules.embedding import VocabParallelEmbedding
-
-    topos = {}
-    for name, module in model.named_modules():
-        # If it does not meet these conditions, it is shared between various tp/dp, and it is necessary to assert
-        # that they are consistent.
-        if isinstance(module, VocabParallelEmbedding):
-            topos[name] = {"dim": 0}
-    return topos
-
-
-def try_load_internlm_ckpt(ckpt_mm, load_info, train_state: TrainState):
-    load_content_str = ""
-    load_ckpt_folder = load_info["path"]
-    load_content: CheckpointLoadMask = load_info["content"]
-
-    if gpc.is_rank_for_log():
-        logger.info(f"Try load_ckpt_folder: {load_ckpt_folder}")
-
-    if load_content.need_load(CheckpointLoadContent.MODEL):
-        load_model_checkpoint(folder=load_ckpt_folder, model=ckpt_mm.model)
-        load_content_str += f"{CheckpointLoadContent.MODEL}, "
-
-    if load_content.not_only_load(CheckpointLoadContent.MODEL):
-        # load training states.
-        load_context(load_ckpt_folder, train_state)
-
-        # load optimzier states.
-        if load_content.need_load(CheckpointLoadContent.OPIMIZER):
-            load_optimizer_checkpoint(load_ckpt_folder, ckpt_mm.optimizer)
-            load_content_str += f"{CheckpointLoadContent.OPIMIZER}, "
-        else:
-            if gpc.is_rank_for_log():
-                logger.warning("CheckpointManager has no 'optimizer', skip reload optim checkpoint!")
-
-        # load lr scheduler states.
-        if load_content.need_load(CheckpointLoadContent.SCHEDULAER):
-            if ckpt_mm.lr_scheduler:
-                load_scheduler(load_ckpt_folder, ckpt_mm.lr_scheduler, ckpt_mm.optimizer, train_state)
-                load_content_str += f"{CheckpointLoadContent.SCHEDULAER}, "
-            else:
-                if gpc.is_rank_for_log():
-                    logger.warning("CheckpointManager has no 'lr_scheduler', skip reload lr_scheduler checkpoint!")
-
-        # load dataloader sampler states.
-        if load_content.need_load(CheckpointLoadContent.SAMPLER):
-            if hasattr(train_state, "batch_sampler") and not isinstance(
-                train_state.batch_sampler, torch.utils.data.sampler.BatchSampler
-            ):
-                load_sampler(load_ckpt_folder, ckpt_mm.train_dl.batch_sampler)
-                # track the actual updates of sampler when using weighted sampling
-                train_state.init_batch_sampler(ckpt_mm.train_dl.batch_sampler)
-                load_content_str += f"{CheckpointLoadContent.SAMPLER}, "
-            else:
-                if gpc.is_rank_for_log():
-                    logger.warning("CheckpointManager skip reload 'batch_sampler'")
-
-            # reload data state dict.
-            if hasattr(train_state, "data_state_dict"):
-                ckpt_mm.train_dl.dataset.load_state_dict(
-                    llm_load(os.path.join(load_ckpt_folder, "sampler_0.pt")), ckpt_path=load_ckpt_folder
-                )
-                load_content_str += f"{CheckpointLoadContent.SAMPLER}, "
-            else:
-                if gpc.is_rank_for_log():
-                    logger.warning(
-                        "CheckpointManager has no 'data_state_dict', skip reload data_state_dict checkpoint!"
-                    )
-    return load_content_str
-
-
-def save_model_checkpoint(folder, model):
-    """
-    Save the model according to the relationship between tp and dp. The principle is that the data of each tp
-    will not be gathered and saved separately, which is equivalent to actual sharding. The saved weight is named
-    - folder
-        - model_tp{tp_rank}_pp{pp_rank}.pt
-
-    If the tp is inconsistent with the saved one in the future use, the weight needs to be converted before loading.
-
-    Args:
-        folder: The folder to save the model
-        model: The model to be saved
-    """
-
-    states = model.state_dict()
-    topo = get_model_topology(model)
-
-    if folder is not None:
-        dp_size = gpc.get_world_size(ParallelMode.DATA)
-        tp_size = gpc.get_world_size(ParallelMode.TENSOR)
-        dp_rank = gpc.get_local_rank(ParallelMode.DATA)
-        tp_rank = gpc.get_local_rank(ParallelMode.TENSOR)
-        pp_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
-
-        # TODO In theory, we should also consider pp level, but since pp is generally a state across machines,
-        # even if pp is not considered, it will definitely not be written on the same machine.
-        should_save_rank_pair = set()  # (tp_rank, dp_rank)
-        for i in range(tp_size):
-            should_save_rank_pair.add((i, i % dp_size))
-
-        if (tp_rank, dp_rank) in should_save_rank_pair:
-            fn = f"model_tp{tp_rank}_pp{pp_rank}.pt"
-            fp = os.path.join(folder, fn)
-            llm_save(fp, saved_obj=states)
-            topo_fn = f"topo_tp{tp_rank}_pp{pp_rank}.json"
-            topo_fp = os.path.join(folder, topo_fn)
-            llm_save(topo_fp, saved_obj=topo)
-
-    torch.distributed.barrier()
-
-
-def load_model_checkpoint(folder, model):
-    """
-    There should be weights with names similar to the following under the folder.
-    - folder
-        - model_tp{tp_rank}_pp{pp_rank}.pt
-
-    If the tp is inconsistent with the saved one in the future use, the weight needs to be converted before loading.
-    """
-
-    tp_size = gpc.get_world_size(ParallelMode.TENSOR)
-    pp_size = gpc.get_world_size(ParallelMode.PIPELINE)
-    tp_rank = gpc.get_local_rank(ParallelMode.TENSOR)
-    pp_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
-
-    fns = get_fns(folder)
-    max_pp, max_tp = 0, 0
-    for fn in fns:
-        if fn.startswith("model_t") and not fn.endswith(".md5"):
-            segements = os.path.splitext(fn)[0].split("_")
-            max_pp = max(max_pp, int(segements[-1][2:]))
-            max_tp = max(max_tp, int(segements[-2][2:]))
-
-    assert (
-        pp_size == max_pp + 1
-    ), f"The weights are save for {max_pp+1} pipelines, while current has {pp_size} pipelines"
-    assert (
-        tp_size == max_tp + 1
-    ), f"The weights are save for {max_tp+1} parallelism, while current has {tp_size} tensor parallelism"
-
-    should_load_name = f"model_tp{tp_rank}_pp{pp_rank}.pt"
-    fp = os.path.join(folder, should_load_name)
-    states = llm_load(fp, map_location=get_current_device())
-
-    missing_k, unexpected_keys = model.load_state_dict(states, strict=False)
-    if len(missing_k) != 0:
-        logger.warning(f"Warning: missing keys {missing_k}")
-    if len(unexpected_keys) != 0:
-        logger.warning(f"Warning: unexpected keys {unexpected_keys}")
-
-    # avoid to cuda oom, Ref: https://discuss.pytorch.org/t/load-state-dict-causes-memory-leak/36189/11
-    del states
-    torch.cuda.empty_cache()
-
-
-def save_optimizer_checkpoint(optim, state_path):
-    """Store the state of the optimizer to the local file system or remote OSS.
-
-    Args:
-        optim (Optimizer)
-        state_path (str): The state loading path of optimizer.
-    """
-
-    # TODO sanity check for optimizer type
-    zero_rank = gpc.get_local_rank(ParallelMode.ZERO1)
-    tp_rank = gpc.get_local_rank(ParallelMode.TENSOR)
-    pp_rank = gpc.get_local_rank(ParallelMode.PIPELINE)
-    tp_size = gpc.get_world_size(ParallelMode.TENSOR)
-    pp_size = gpc.get_world_size(ParallelMode.PIPELINE)
-    fp = f"optimizer_tp{tp_rank}_pp{pp_rank}_zo{zero_rank}.pt"
-
-    states = optim.state_dict()
-    if isinstance(optim, HybridZeroOptimizer):
-        if gpc.get_global_rank() < optim.zero_world_size * tp_size * pp_size:
-            llm_save(os.path.join(state_path, fp), states)
-            if "zero_devide_optim_plan" in states:
-                params_per_rank_id_dict = states.pop("zero_devide_optim_plan")
-                fp_meta = os.path.join(state_path, optim.rank_unique_id)
-                llm_save(fp_meta, params_per_rank_id_dict)
-    else:
-        llm_save(os.path.join(state_path, fp), states)
-
-
-def load_optimizer_checkpoint(folder, optim):
-    """Load the optimizer state from the local file system or remote
-    object storage Service (OSS).
-
-    Args:
-        optim (Optimizer): optimizer
-        folder (str): The FS/OSS path where the optimizer will be stored.
-    """
-
-    fns = get_fns(folder)
-    max_tp, max_pp, max_zero = 0, 0, 0
-    for fn in fns:
-        if fn.startswith("optimizer_") and not fn.endswith(".md5"):
-            _, tp, pp, zero = os.path.splitext(fn)[0].split("_")
-            max_zero = max(max_zero, int(zero[2:]))
-            max_tp = max(max_tp, int(tp[2:]))
-            max_pp = max(max_pp, int(pp[2:]))
-
-    zero_size = gpc.get_world_size(ParallelMode.ZERO1)
-    zero_rank = gpc.get_local_rank(ParallelMode.ZERO1)
-    tp_size = gpc.get_world_size(ParallelMode.TENSOR)
-    pp_size = gpc.get_world_size(ParallelMode.PIPELINE)
-
-    assert (
-        zero_size == max_zero + 1
-    ), f"The weights are save for {max_zero+1} data parallel, while current has {zero_size} zero broadcast range."
-    assert (
-        pp_size == max_pp + 1
-    ), f"The weights are save for {max_pp+1} pipelines, while current has {pp_size} pipelines"
-    assert (
-        tp_size == max_tp + 1
-    ), f"The weights are save for {max_tp+1} parallelism, while current has {tp_size} tensor parallelism"
-
-    fp = f"optimizer_tp{gpc.get_local_rank(ParallelMode.TENSOR)}_"
-    fp += f"pp{gpc.get_local_rank(ParallelMode.PIPELINE)}_"
-    fp += f"zo{zero_rank}.pt"
-    states = llm_load(os.path.join(folder, fp), map_location=get_current_device())
-
-    if isinstance(optim, HybridZeroOptimizer):
-        fp_meta = os.path.join(folder, optim.rank_unique_id)
-        try:
-            zero_devide_optim_plan = llm_load(fp_meta)
-            states.update({"zero_devide_optim_plan": zero_devide_optim_plan})
-        except Exception as e:
-            logger.warning(
-                f"Read zero optimzer split file '{fp_meta}', for '{e}'"
-                f"Please check whether loading ckpts are saved with the HybridZeroOptimizer."
-            )
-
-    optim.load_state_dict(states)
-    del states
-    torch.cuda.empty_cache()
-
-
-def load_sampler(ckpt_path: str, sampler):
-    sampler_states = llm_load(os.path.join(ckpt_path, "sampler.pt"))
-    sampler.load_state_dict(sampler_states)
-    if gpc.is_rank_for_log():
-        pstate = copy.deepcopy(sampler_states)
-        pstate.pop("indices")
-        pstate.pop("rng_state")
-        logger.info(f"reload sampler_states:{pstate}")
-    torch.cuda.empty_cache()
-
-
-def load_context(ckpt_path: str, train_state: TrainState):
-    context_stuffs = llm_load(os.path.join(ckpt_path, "context.pt"))
-    train_state.load_state_dict(context_stuffs)
-    if gpc.is_rank_for_log():
-        logger.info(f"reload train_state:{train_state}")
-    torch.cuda.empty_cache()
-
-
-def load_scheduler(ckpt_path: str, lr_scheduler, optimizer, train_state: TrainState):
-    learning_rate = train_state.lr
-    scheduler_states = llm_load(os.path.join(ckpt_path, "schedulder.pt"))
-    if learning_rate != scheduler_states["base_lrs"][0] and gpc.is_rank_for_log():
-        logger.warning(
-            f"Using new learning rate {learning_rate} to replace old learn rate {scheduler_states['base_lrs'][0]}."
-        )
-
-    base_lrs = copy.deepcopy(scheduler_states["base_lrs"])
-    scheduler_states["base_lrs"] = [learning_rate] * len(scheduler_states["base_lrs"])
-    if "after_scheduler_dict" in scheduler_states:
-        scheduler_states["after_scheduler_dict"]["base_lrs"] = [learning_rate] * len(
-            scheduler_states["after_scheduler_dict"]["base_lrs"]
-        )
-
-    lr_scheduler.load_state_dict(scheduler_states)
-    lr_scheduler.last_epoch = train_state.step_count + 1
-
-    ratios = [learning_rate / lr for lr in base_lrs]
-    for idx, param_group in enumerate(optimizer.param_groups):
-        param_group["lr"] = param_group["lr"] * ratios[idx]
-    torch.cuda.empty_cache()
-
-    if gpc.is_rank_for_log():
-        logger.info(f"reload load_scheduler:{lr_scheduler}")
-
-
-class CheckpointManager:
-    """StorageManagerContext"""
-
-    def __init__(
-        self,
-        ckpt_config,
-        model,
-        train_dl=None,
-        optimizer=None,
-        lr_scheduler=None,
-        model_config=None,
-        model_config_file=None,
-        feishu_address=None,
-    ) -> None:
-        """
-        CheckpointManager is used to decide when to store ckpt. If it is an asynchronous
-        upload mode, you must call wait_async_upload_finish at the end of the program to wait
-        for the asynchronous ckpt upload to complete.
-
-        Args:
-            ckpt_config (dict): model checkpoint config.
-            model (nn.module): model obj.
-            optimizer (object): optimizer obj.
-            lr_scheduler (object): lr_scheduler obj.
-            model_config (dict): model config.
-        """
-        self.enable_save_ckpt = get_config_value(ckpt_config, "enable_save_ckpt", False)
-        self.checkpoint_every = get_config_value(ckpt_config, "checkpoint_every", 100)
-        self.save_ckpt_folder = get_config_value(ckpt_config, "save_ckpt_folder", None)
-        self.oss_snapshot_freq: int = get_config_value(ckpt_config, "oss_snapshot_freq", 50)
-        self.stop_file_path = get_config_value(ckpt_config, "stop_file_path", None)
-        if self.save_ckpt_folder:
-            self.snapshot_ckpt_folder = get_config_value(
-                ckpt_config, "snapshot_ckpt_folder", os.path.join(self.save_ckpt_folder, "snapshot")
-            )
-            self.async_upload_tmp_folder = get_config_value(
-                ckpt_config, "async_upload_tmp_folder", "/dev/shm/internlm_tmp_ckpt/"
-            )
-        else:
-            self.snapshot_ckpt_folder = None
-            self.async_upload_tmp_folder = None
-
-        self.async_upload = get_config_value(ckpt_config, "async_upload", False)
-
-        # initialization storage manager
-        init_storage_manager(self.enable_save_ckpt, self.async_upload_tmp_folder, self.async_upload)
-
-        self.feishu_address = feishu_address
-        self.storage_manager = get_storage_manager()
-        self.snapshot_counter = 0
-
-        self.model = model
-        self.optimizer = optimizer
-        self.lr_scheduler = lr_scheduler
-        self.train_dl = train_dl
-        self.model_config = model_config
-        self.model_config_file = model_config_file
-
-        # Register defalut internlm ckpt load type.
-        self.defalut_load_type_func = {CheckpointLoadType.INTERNLM: try_load_internlm_ckpt}
-        for ckpt_load_type in CheckpointLoadType:
-            CheckpointLoadMethod.register_ckpt_load_type(ckpt_load_type, self.defalut_load_type_func[ckpt_load_type])
-
-        # Init alter file.
-        if self.stop_file_path and gpc.get_global_rank() == 0:
-            dir_path = os.path.dirname(self.stop_file_path)
-            if dir_path != "" and not os.path.exists(dir_path):
-                os.makedirs(dir_path)
-            with open(self.stop_file_path, "w", encoding="utf-8") as f:
-                f.write("0")
-
-        self.load_ckpt_info = get_config_value(ckpt_config, "load_ckpt_info", None)
-        if self.load_ckpt_info is None:  # (legacy): Try Compatible with old interfaces
-            self.load_ckpt_info = ckpt_info_sanity_check(ckpt_config)
-
-        # Auto-reload latest checkpoint, it will overwrite the setting of 'load_ckpt_info'.
-        self.auto_resume = get_config_value(ckpt_config, "auto_resume", None)
-        if self.auto_resume is None:  # (legacy): Try Compatible with old interfaces
-            self.auto_resume = auto_resume_sanity_check(ckpt_config)
-        if self.auto_resume:
-            self.load_ckpt_info = self.query_lastest_ckpt()
-
-        if self.stop_file_path is None and gpc.is_rank_for_log():
-            logger.warning("no set stop_file_path, quit_signal_handler is disable")
-
-        # convert to internal representation
-        if self.load_ckpt_info:
-            assert (
-                "path" in self.load_ckpt_info
-                and "content" in self.load_ckpt_info
-                and "ckpt_type" in self.load_ckpt_info
-            ), "please set content in ckpt setting, eg: ckpt = dict(path='', content=['model'], ckpt_type='internlm')"
-
-            # replace load_ckpt
-            self.load_ckpt_info["content"] = CheckpointLoadMask(self.load_ckpt_info["content"])
-            self.load_ckpt_info["ckpt_type"] = CheckpointLoadMethod.convet_load_type(self.load_ckpt_info["ckpt_type"])
-
-        # test storage setting is ok.
-        if self.enable_save_ckpt:
-            self.try_ping_storage()
-
-    def quit_signal_handler(self, train_state) -> bool:
-        """
-        Exit signal detection function, if we write the exit step in the 'QUIT_FILE_PATH' file,
-        all ranks will save ckpt and exit.
-        Negative integer step means save ckpt.
-        Positive integer step means save ckpt and quit.
-
-        Args:
-            train_state (TrainState):
-        Returns:
-            bool: whether to quit.
-        """
-        now_break, now_save_ckpt, save_type = False, False, CheckpointSaveType.NORMAL_CHECKPOINT
-
-        if self.stop_file_path is None:
-            return now_break, now_save_ckpt, save_type
-
-        with torch.no_grad():
-            action_step_t = torch.zeros((1,), dtype=torch.int64).cuda()
-            if gpc.get_global_rank() == 0:
-                with open(self.stop_file_path, "r+", encoding="utf-8") as f:
-                    f.seek(0)
-                    msg = f.read()
-                    action_step_t.fill_(int(msg))
-
-            torch.distributed.broadcast(action_step_t, src=0)
-            action_step = action_step_t.item()
-            del action_step_t
-
-        if action_step < 0 and abs(action_step) == train_state.step_count:
-            now_save_ckpt = True
-
-        if action_step > 0 and action_step == train_state.step_count:
-            now_break, now_save_ckpt = True, True
-
-        if action_step != 0 and gpc.is_rank_for_log():
-            msg = "Stop" if action_step > 0 else "Save"
-            action_step = abs(action_step)
-            if train_state.step_count <= action_step:
-                if self.feishu_address:
-                    send_alert_message(
-                        address=self.feishu_address,
-                        message=f"training will {msg} at step_count {action_step}!\
-now step_count is {train_state.step_count}",
-                    )
-
-        return now_break, now_save_ckpt, save_type
-
-    def is_now_to_save_ckpt(self, train_state) -> (bool, CheckpointSaveType, bool):
-        save_ckpts, save_type, now_break = False, CheckpointSaveType.NORMAL_CHECKPOINT, False
-        if self.oss_snapshot_freq > 1 and train_state.step_count % self.oss_snapshot_freq == 0:
-            save_ckpts, save_type = True, CheckpointSaveType.SNAPSHOT_CHECKPOINT
-        if train_state.step_count % self.checkpoint_every == 0:
-            save_ckpts, save_type = True, CheckpointSaveType.NORMAL_CHECKPOINT
-        now_break, singal_save_ckpts, singal_save_type = self.quit_signal_handler(train_state)
-        if save_ckpts is False:
-            save_ckpts = singal_save_ckpts
-            save_type = singal_save_type
-
-        return save_ckpts, save_type, now_break
-
-    def try_save_checkpoint(self, train_state):
-        if not self.enable_save_ckpt:
-            return False
-
-        save_ckpts, save_type, now_break = self.is_now_to_save_ckpt(train_state)
-
-        if save_ckpts:
-            # Wait for the previous round of asynchronous upload storage to complete.
-            self.storage_manager.wait()
-            if save_type == CheckpointSaveType.SNAPSHOT_CHECKPOINT:
-                # Snapshot number, with only two snapshots written alternately.
-                self.snapshot_counter = (self.snapshot_counter + 1) % 2
-                save_ckpt_folder = os.path.join(self.snapshot_ckpt_folder, f"{self.snapshot_counter}")
-            else:
-                save_ckpt_folder = os.path.join(self.save_ckpt_folder, str(train_state.step_count))
-
-            self.save_checkpoint(
-                folder=save_ckpt_folder,
-                model=self.model,
-                optimizer=self.optimizer,
-                scheduler=self.lr_scheduler,
-                train_state=train_state,
-                model_config=self.model_config,
-                model_config_file=self.model_config_file,
-            )
-
-        return now_break
-
-    def wait_async_upload_finish(self):
-        """wait for all checkpoint uploads to be completed"""
-        self.storage_manager.wait()
-        torch.distributed.barrier()
-
-    def query_latest_snapshot_step_boto3(self):
-        """query_latest_snapshot_step_boto3
-        Returns:
-            Tuple(str, int): path of latest ckpt and ckpt step, if not found, None will return.
-        """
-        ckpt_list = self.storage_manager.get_fns(self.save_ckpt_folder)
-        if ckpt_list is None or len(ckpt_list) == 0:
-            return None, None
-
-        max_normal_step = 0
-        # Return ckpt_list look like: ['pings', 'snapshot', '4']
-        # Here we only try to find the ckpt folder named after step, ignoring snapshot and other folders.
-        ckpt_list = [int(fn.strip("/")) for fn in ckpt_list if fn.strip("/").isdigit()]
-        if len(ckpt_list) == 0:
-            logger.warning("Not found avaliable normal checkpoint!")
-        else:
-            logger.info(f"Found avaliable normal checkpoint: {ckpt_list}!")
-            ckpt_list.sort(reverse=True)
-            for ckpt in ckpt_list:
-                fns_list = self.storage_manager.get_fns(os.path.join(self.save_ckpt_folder, str(ckpt)))
-                for fn in fns_list:
-                    if fn.endswith(".step"):
-                        max_normal_step = ckpt
-                        break
-                if max_normal_step != 0:
-                    break
-
-            max_normal_step = ckpt_list[0]
-            load_normal_ckpt_path = os.path.join(self.save_ckpt_folder, str(max_normal_step))
-
-        snapshot_path_0 = os.path.join(self.save_ckpt_folder, "snapshot", "0")
-        snapshot_path_1 = os.path.join(self.save_ckpt_folder, "snapshot", "1")
-        ckpt_list_0 = self.storage_manager.get_fns(snapshot_path_0)
-        ckpt_list_1 = self.storage_manager.get_fns(snapshot_path_1)
-
-        def found_latest_snapshot(_ckpt_list):
-            _max_step_snapshot = 0
-            if _ckpt_list:
-                for ckpt in _ckpt_list:
-                    ckpt = ckpt.strip("/")
-                    if ckpt.endswith(".step"):
-                        _max_step_snapshot = max(_max_step_snapshot, int(ckpt.split(".")[0]))
-            return _max_step_snapshot
-
-        max_step_0 = found_latest_snapshot(ckpt_list_0)
-        max_step_1 = found_latest_snapshot(ckpt_list_1)
-
-        if sum([max_step_0, max_step_1, max_normal_step]) == 0:
-            return None, None
-        else:
-            snap_load_path = snapshot_path_0 if max_step_0 > max_step_1 else snapshot_path_1
-            snap_step = max(max_step_0, max_step_1)
-            load_path = snap_load_path if snap_step > max_normal_step else load_normal_ckpt_path
-            return load_path, max(snap_step, max_normal_step)
-
-    def query_latest_snapshot_step_local(self):
-        max_step, max_step_path = 0, None
-        save_ckpt_folder = self.save_ckpt_folder.split(":")[1]
-        for root, _, files in os.walk(save_ckpt_folder, followlinks=True):
-            for fn in files:
-                fn = fn.strip("/")
-                if fn.endswith(".step"):
-                    # We assume that both internlm ckpt and snapshot ckpt will store the '.step' file
-                    # as an integrity flag.
-                    step = int(fn.rsplit(".", maxsplit=1)[0])
-                    if max_step < step:
-                        max_step = step
-                        max_step_path = root
-
-        return max_step_path, max_step
-
-    def query_lastest_ckpt(self):
-        latest_ckpt, step = None, -1
-        # Training was automatically restarted by the process, forcing the latest snapshot to be read.
-        if self.save_ckpt_folder:
-            backend, _ = try_get_storage_backend(self.save_ckpt_folder)
-            if backend == "boto3":
-                latest_ckpt, step = self.query_latest_snapshot_step_boto3()
-                if latest_ckpt and not latest_ckpt.startswith("boto3:"):
-                    latest_ckpt = ":".join(["boto3", latest_ckpt])
-            elif backend == "local":
-                latest_ckpt, step = self.query_latest_snapshot_step_local()
-                if latest_ckpt and not latest_ckpt.startswith("local:"):
-                    latest_ckpt = ":".join(["local", latest_ckpt])
-
-        if gpc.is_rank_for_log():
-            logger.info(f"Found latest ckpt {latest_ckpt if latest_ckpt else 'None'}, step: {step}...")
-
-        return dict(path=latest_ckpt, content=("all",), ckpt_type="internlm")
-
-    def try_resume_training(self, train_state: TrainState, current_time=""):
-        if self.load_ckpt_info is None or self.load_ckpt_info["path"] is None:
-            if gpc.is_rank_for_log():
-                logger.info(
-                    f"===========New Run {current_time} on host:{socket.gethostname()},rank={gpc.get_global_rank()},"
-                    f"tp={gpc.get_local_rank(ParallelMode.TENSOR)},pp={gpc.get_local_rank(ParallelMode.PIPELINE)},"
-                    f"dp={gpc.get_local_rank(ParallelMode.DATA)}==========="
-                )
-        else:
-            load_path = self.load_ckpt_info["path"]
-            load_content = self.load_ckpt_info["content"]
-            load_type = self.load_ckpt_info["ckpt_type"]
-
-            load_func = CheckpointLoadMethod.get_ckpt_load_type_func(load_type)
-            load_content_str = load_func(self, self.load_ckpt_info, train_state)
-
-            # If we only load model weight, we need rewrite zero optim's fp32 buffer.
-            if load_content.only_load(CheckpointLoadContent.MODEL) and isinstance(self.optimizer, HybridZeroOptimizer):
-                reload_zero_fp32_buff(self.optimizer)
-
-            if gpc.is_rank_for_log():
-                logger.info(f"load_ckpt_info : {self.load_ckpt_info}")
-                logger.info(
-                    f"===========Resume training from `{load_path}` {current_time} on host:"
-                    f"{socket.gethostname()}==========="
-                )
-                if load_content_str:
-                    logger.info(f"===========Load contents are: {load_content_str}")
-
-    @llm_timeout(func_name="save_checkpoint")
-    def save_checkpoint(
-        self,
-        folder,
-        model,
-        optimizer,
-        scheduler,
-        train_state: TrainState,
-        model_config: Dict = None,
-        model_config_file: str = None,
-    ):
-        """
-        Save checkpoint to the given folder path.
-        """
-
-        start = time.time()
-        self.set_save_folder(folder, train_state.step_count)
-        torch.cuda.synchronize()
-        torch.distributed.barrier()
-        if gpc.is_rank_for_log():
-            logger.info(f"Saving checkpoint to `{folder}` at batch count:{train_state.step_count}...")
-
-        timer("save-model").start()
-        save_model_checkpoint(folder=folder, model=model)
-        timer("save-model").stop()
-
-        timer("save-optimizer").start()
-        save_optimizer_checkpoint(optim=optimizer, state_path=folder)
-        timer("save-optimizer").stop()
-
-        if (
-            hasattr(train_state, "data_state_dict")
-            and gpc.get_local_rank(ParallelMode.TENSOR) == 0
-            and gpc.get_local_rank(ParallelMode.PIPELINE) == 0
-        ):
-            llm_save(
-                os.path.join(folder, f"sampler_{gpc.get_local_rank(ParallelMode.DATA)}.pt"),
-                saved_obj=train_state.data_state_dict,
-            )
-
-        if gpc.is_rank_for_log():
-            if scheduler:
-                scheduler_states = scheduler.state_dict()
-                llm_save(os.path.join(folder, "schedulder.pt"), saved_obj=scheduler_states)
-
-            if hasattr(train_state, "batch_sampler") and not isinstance(
-                train_state.batch_sampler, torch.utils.data.sampler.BatchSampler
-            ):
-                sampler_state = train_state.batch_sampler.state_dict()
-                llm_save(os.path.join(folder, "sampler.pt"), saved_obj=sampler_state)
-            llm_save(os.path.join(folder, "context.pt"), saved_obj=train_state.state_dict())
-
-            if model_config is not None:
-                # Model configuration dictionary.
-                llm_save(os.path.join(folder, "model_config.pt"), saved_obj=model_config)
-
-            if model_config_file is not None:
-                # The complete training config file content, stored in binary format.
-                llm_save(os.path.join(folder, "config_file.pt"), saved_obj=model_config_file)
-
-        torch.distributed.barrier()
-
-        if gpc.is_rank_for_log():
-            timer.log(["save-model", "save-optimizer"], logger=logger)
-            logger.info(f"Step: {train_state.step_count}, rank 0 save ckpt use {time.time() - start:.3f} s")
-            if self.storage_manager.async_mode is False:
-                llm_save(
-                    os.path.join(folder, f"{train_state.step_count}.step"),
-                    saved_obj=dict({"step": train_state.step_count}),
-                )
-
-    def set_save_folder(self, folder, step):
-        self.storage_manager.latest_save_folder = folder
-        self.storage_manager.latest_save_step = step
-
-    def try_ping_storage(self):
-        if gpc.get_global_rank() % 8 == 0:
-            buff = torch.ones((1, 64, 64), dtype=torch.bfloat16)
-            test_fn = os.path.join(self.save_ckpt_folder, f"pings/{socket.gethostname()}.ping")
-            self.storage_manager.save(test_fn, buff)
-            self.storage_manager.wait()
-            self.storage_manager.load(test_fn)
-            del buff
diff --git a/internlm/utils/parallel.py b/internlm/utils/parallel.py
deleted file mode 100644
index cffcdc1..0000000
--- a/internlm/utils/parallel.py
+++ /dev/null
@@ -1,61 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import torch.distributed as dist
-
-from internlm.core.context import IS_TENSOR_PARALLEL, ParallelMode
-from internlm.core.context import global_context as gpc
-
-
-def is_model_parallel_parameter(p):
-    return hasattr(p, IS_TENSOR_PARALLEL) and getattr(p, IS_TENSOR_PARALLEL)
-
-
-def sync_model_param(model, parallel_mode):
-    r"""Make sure data parameters are consistent during Data Parallel Mode.
-
-    Args:
-        model (:class:`torch.nn.Module`): A pyTorch model on whose parameters you check the consistency.
-        parallel_mode (:class:`internlm.core.context.ParallelMode`): Parallel mode to be checked.
-    """
-    if gpc.is_initialized(parallel_mode) and gpc.get_world_size(parallel_mode) > 1:
-        for param in model.parameters():
-            ranks = gpc.get_ranks_in_group(parallel_mode)
-            dist.broadcast(param, src=ranks[0], group=gpc.get_group(parallel_mode))
-
-
-def sync_model_param_within_tp(model):
-    r"""This function is changed from colossalai, which is ``sync_model_param``.
-
-    We modified this function to make sure it only sync parameters within tensor parallelism
-    but they are not splitted by tensor parallelism.
-    This function is used to make sure parameters that are not splitted by tensor parallelism
-    are the same across each tensor parallelism.
-    For example, parameters like RMSNorm, LayerNorm...
-
-    Args:
-        model (:class:`torch.nn.Module`): A pyTorch model on whose parameters you check the consistency.
-    """
-    parallel_mode = ParallelMode.TENSOR
-    if gpc.is_initialized(parallel_mode) and gpc.get_world_size(parallel_mode) > 1:
-        for param in model.parameters():
-            if not is_model_parallel_parameter(param):
-                ranks = gpc.get_ranks_in_group(parallel_mode)
-                dist.broadcast(param, src=ranks[0], group=gpc.get_group(parallel_mode))
-
-
-def is_no_pp_or_last_stage():
-    return not gpc.is_initialized(ParallelMode.PIPELINE) or gpc.is_last_rank(ParallelMode.PIPELINE)
-
-
-def get_parallel_log_file_name():
-    if gpc.is_rank_for_log():
-        fn_prefix = "main_"  # Indicates a rank with more output information
-    else:
-        fn_prefix = ""
-
-    log_file_name = (
-        f"{fn_prefix}dp={gpc.get_local_rank(ParallelMode.DATA)}_"
-        f"tp={gpc.get_local_rank(ParallelMode.TENSOR)}_pp={gpc.get_local_rank(ParallelMode.PIPELINE)}"
-    )
-    return log_file_name
diff --git a/internlm/utils/registry.py b/internlm/utils/registry.py
deleted file mode 100644
index 7cbfcc5..0000000
--- a/internlm/utils/registry.py
+++ /dev/null
@@ -1,71 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-
-class Registry:
-    """This is a registry class used to register classes and modules so that a universal
-    object builder can be enabled.
-
-    Args:
-        name (str): The name of the registry.
-    """
-
-    def __init__(self, name: str):
-        self._name = name
-        self._registry = dict()
-
-    @property
-    def name(self):
-        return self._name
-
-    def register_module(self, module_name: str):
-        """Registers a module represented in `module_class`.
-
-        Args:
-            module_name (str): The name of module to be registered.
-        Returns:
-            function: The module to be registered, so as to use it normally if via importing.
-        Raises:
-            AssertionError: Raises an AssertionError if the module has already been registered before.
-        """
-
-        assert module_name not in self._registry, f"{module_name} not found in {self.name}"
-
-        def decorator_wrapper(original_func):
-            self._registry[module_name] = original_func
-            return original_func
-
-        return decorator_wrapper
-
-    def get_module(self, module_name: str):
-        """Retrieves a module with name `module_name` and returns the module if it has
-        already been registered before.
-
-        Args:
-            module_name (str): The name of the module to be retrieved.
-        Returns:
-            :class:`object`: The retrieved module or None.
-        Raises:
-            NameError: Raises a NameError if the module to be retrieved has neither been
-            registered directly nor as third party modules before.
-        """
-        if module_name in self._registry:
-            return self._registry[module_name]
-        raise NameError(f"Module {module_name} not found in the registry {self.name}")
-
-    def has(self, module_name: str):
-        """Searches for a module with name `module_name` and returns a boolean value indicating
-        whether the module has been registered directly or as third party modules before.
-
-        Args:
-            module_name (str): The name of the module to be searched for.
-        Returns:
-            bool: A boolean value indicating whether the module has been registered directly or
-            as third party modules before.
-        """
-        found_flag = module_name in self._registry
-
-        return found_flag
-
-
-MODEL_INITIALIZER = Registry("model_initializer")
diff --git a/internlm/utils/simple_memory_profiler.py b/internlm/utils/simple_memory_profiler.py
deleted file mode 100644
index 9caf0a2..0000000
--- a/internlm/utils/simple_memory_profiler.py
+++ /dev/null
@@ -1,672 +0,0 @@
-import os
-import time
-from collections import OrderedDict
-from functools import partial, reduce
-from typing import Any, Dict, List, Tuple
-
-import pyecharts
-import torch
-
-from internlm.core.naive_amp import NaiveAMPModel
-
-mb = 1024 * 1024
-
-
-class SimpleMemState:
-    """
-    A class to represent the memory state of a model layer.
-
-    Args:
-        layer_name (str): The name of the layer.
-        layer_mem (int): The memory usage of the layer in bytes.
-    """
-
-    def __init__(self, layer_name: str, layer_mem: int = 0) -> None:
-        self.layer_name = layer_name
-
-        # Memory status of the current model layer.
-        self._layer_mem: int = layer_mem
-        # Total memory status of the model and sub-models, initialized with layer memory.
-        self._total_mem: int = self._layer_mem
-        # SimpleMemState of sub-models.
-        self.sub_model_stats = OrderedDict()
-
-    @property
-    def layer_mem(self) -> int:
-        """
-        Get the memory usage of the layer.
-
-        Returns:
-            int: The memory usage of the layer in bytes.
-        """
-        return self._layer_mem
-
-    @layer_mem.setter
-    def layer_mem(self, new_layer_mem: int) -> None:
-        """
-        Set the memory usage of the layer.
-
-        Args:
-            new_layer_mem (int): The new memory usage of the layer in bytes.
-        """
-        diff = new_layer_mem - self._layer_mem
-        self._layer_mem = new_layer_mem
-        self._total_mem += diff
-
-    @property
-    def total_mem(self) -> int:
-        """
-        Get the total memory usage of the model and sub-models.
-
-        Returns:
-            int: The total memory usage in bytes.
-        """
-        return self._total_mem
-
-    def add(self, layer_name: str, layer_mem: int = 0, flush: bool = True) -> None:
-        """
-        Add a layer to the memory state.
-
-        Args:
-            layer_name (str): The name of the layer.
-            layer_mem (int, optional): The memory usage of the layer in bytes. Defaults to 0.
-            flush (bool, optional): Whether to update the total memory usage. Defaults to True.
-        """
-        path = layer_name.split(".")
-
-        target = self.find_layer_state(path, create=True)
-        target.layer_mem = layer_mem
-
-        if flush:
-            self.update_total_memory()
-
-    def delete(self, layer_name: str, flush: bool = True) -> None:
-        """
-        Delete a layer from the memory state.
-
-        Args:
-            layer_name (str): The name of the layer.
-            flush (bool, optional): Whether to update the total memory usage. Defaults to True.
-        """
-        path = layer_name.split(".")
-        assert len(path) >= 2, f"Only support deleting non-root layers, layer_name: {layer_name}"
-
-        parent_path = path[0:-1]
-        layer = path[-1]
-        parent = self.find_layer_state(parent_path)
-
-        if parent is not None and layer in parent.sub_model_stats:
-            del parent.sub_model_stats[layer]
-
-        if flush:
-            self.update_total_memory()
-
-    def update_total_memory(self) -> None:
-        """
-        Update the total memory usage of the model and sub-models.
-        """
-        self._total_mem = self._layer_mem
-
-        for stat in self.sub_model_stats.values():
-            # Update sub-model status first.
-            stat.update_total_memory()
-            # Add sub-model total_mem to model total_mem.
-            self._total_mem += stat._total_mem
-
-    def find_layer_state(self, path: Tuple[str], create: bool = False) -> "SimpleMemState":
-        """
-        Find the memory state of a layer.
-
-        Args:
-            path (Tuple[str]): The path to the layer.
-            create (bool, optional): Whether to create the layer if it doesn't exist. Defaults to False.
-
-        Returns:
-            SimpleMemState: The memory state of the layer.
-        """
-        current_node = self
-
-        for _node in path:
-            if _node not in current_node.sub_model_stats:
-                if not create:
-                    return None
-                # Create a layer node.
-                current_node.sub_model_stats[_node] = SimpleMemState(_node)
-
-            current_node = current_node.sub_model_stats[_node]
-
-        return current_node
-
-    def dump(self, prefix: str = "") -> str:
-        """
-        Dump the memory state of the model and sub-models.
-
-        Args:
-            prefix (str, optional): The prefix to add to the layer names. Defaults to "".
-
-        Returns:
-            str: The memory state information.
-        """
-        cur_prefix = prefix + "." + self.layer_name if prefix != "" else self.layer_name
-        res = f"layer: {cur_prefix}, layer_mem: {self.layer_mem / mb:.2f} MB, total_mem: {self.total_mem / mb:.2f} MB\n"
-
-        for sub_layer in self.sub_model_stats.values():
-            res += sub_layer.dump(cur_prefix)
-
-        return res
-
-    def to_json(self, base: int = 1024 * 1024) -> dict:
-        """
-        Convert the memory state to a JSON structure.
-
-        Returns:
-            dict: The JSON structure of the memory state.
-        """
-        children = [child.to_json() for child in self.sub_model_stats.values()]
-        if len(children) == 0:
-            return {"name": self.layer_name, "value": self.layer_mem // base}
-        else:
-            return {"name": self.layer_name, "children": children}
-
-
-class ActivationMemState:
-    """
-    Activation Memory State
-    """
-
-    def __init__(self, num_chunks: int) -> None:
-        self._num_chunks = num_chunks
-
-        self.inited: List[bool] = [False for _ in range(num_chunks)]
-        self.states: List[SimpleMemState] = [SimpleMemState(f"activations_{idx}") for idx in range(num_chunks)]
-
-    @property
-    def total_mem(self) -> int:
-        return sum(state.total_mem for state in self.states)
-
-    def dump(self, prefix: str = "") -> str:
-        return reduce(lambda x, y: x + y, [state.dump(prefix) for state in self.states])
-
-    def to_json(self, base: int = 1024 * 1024) -> List:
-        return [state.to_json(base) for state in self.states]
-
-
-def _unpack_naive_wrapper(model: torch.nn.Module) -> Tuple[torch.nn.Module, int]:
-    num_chunks = len(model) if isinstance(model, torch.nn.ModuleList) else 1
-
-    if num_chunks > 1:
-        model = torch.nn.ModuleList([_model.model if isinstance(_model, NaiveAMPModel) else _model for _model in model])
-    else:
-        model = model.model if isinstance(model, NaiveAMPModel) else model
-
-    return model, num_chunks
-
-
-class SimpleMemoryProfiler:
-    """
-    A memory profiler for a llm model.
-
-    Args:
-        model (torch.nn.Module): The model to profile.
-        optimizer (torch.optim.Optimizer): The optimizer used for training the model.
-        log_file (str): The file to write the memory state information to.
-        total_steps: number of steps to trace.
-    """
-
-    def __init__(
-        self,
-        model: torch.nn.Module,
-        optimizer: torch.optim.Optimizer,
-        log_folder: str,
-        total_steps: int = 5,
-    ):
-        self._model, self._num_model_chunks = _unpack_naive_wrapper(model)
-        self._optimizer = optimizer
-        self._log_folder = log_folder
-        self._remaining_steps = total_steps
-
-        self._stoped = False
-        self._record_start_time = time.time()
-
-        # For activation memory state.
-
-        self._activation_mem: int = 0
-        self._activation_mem_max: int = 0
-        self._activation_base_mems = ActivationMemState(self._num_model_chunks)
-
-        # Check or create log folder
-        os.makedirs(self._log_folder, exist_ok=True)
-
-        # Register activation memory tracking hooks
-        if self._num_model_chunks > 1:
-            for chunk_id in range(self._num_model_chunks):
-                self._register_activation_trace_hooks(chunk_id, self._model[chunk_id])
-        else:
-            self._register_activation_trace_hooks(0, self._model)
-
-        # Calculate static parameter cuda memory
-        self._param_mem_state = SimpleMemState("param_mem")
-        self._calc_tensor_memory(self._param_mem_state, self._model.named_parameters())
-        # Calculate static grad cuda memory
-        self._grad_mem_state = SimpleMemState("grad_mem")
-        self._calc_tensor_memory(self._grad_mem_state, self._model.named_parameters(), True)
-        # Calculate static optimizer state cuda memory
-        self._os_params_mem_state = SimpleMemState("os_params_mem")
-        self._os_state_mem_state = SimpleMemState("os_state_mem")
-        self._calc_tensor_group_memory(self._os_params_mem_state, list(enumerate(self._optimizer.param_groups)))
-
-        # Generate the first memory record
-        self.point(with_options="params,grads,os_params", create=True)
-
-    def point(self, with_options: str = "", create: bool = False) -> None:
-        """
-        Record the memory state.
-
-        Args:
-            with_options (str, optional): The options to include in the memory state. Defaults to "".
-            create (bool, optional): Whether to create a new memory record file. Defaults to False.
-
-        Returns:
-            None
-        """
-        now = time.time()
-        file = f"{self._log_folder}/memory.log"
-
-        if with_options == "all":
-            options = ["params", "grads", "os_params", "os_state", "activation_base"]
-        else:
-            options = with_options.split(",")
-
-        total_mem = (
-            self._param_mem_state.total_mem
-            + self._grad_mem_state.total_mem
-            + self._os_params_mem_state.total_mem
-            + self._os_state_mem_state.total_mem
-            + self._activation_mem
-        ) / mb
-
-        # Generate summary information for memory state
-        summary_info = (
-            f"total_memory: {total_mem:.2f} MB"
-            + "\n"
-            + f"params_memory: {self._param_mem_state.total_mem / mb:.2f} MB, "
-            + f"grads_memory: {self._grad_mem_state.total_mem / mb:.2f} MB, "
-            + f"os_params_memory: {self._os_params_mem_state.total_mem / mb:.2f} MB, "
-            + f"os_state_memory: {self._os_state_mem_state.total_mem / mb:.2f} MB, "
-            + f"activation_memory: {self._activation_mem / mb:.2f} MB"
-        )
-
-        # Generate layout information based on selected options
-        layout_info = ""
-        if "params" in options:
-            layout_info += "params_layout:\n" + self._param_mem_state.dump()
-        if "grads" in options:
-            layout_info += "grads_layout:\n" + self._grad_mem_state.dump()
-        if "os_params" in options:
-            layout_info += "os_params_layout:\n" + self._os_params_mem_state.dump()
-        if "os_state" in options:
-            layout_info += "os_state_layout:\n" + self._os_state_mem_state.dump()
-        if "activation_base" in options:
-            layout_info += "activation_base_layout:\n" + self._activation_base_mems.dump()
-
-        # Write memory state information to log file
-        file_mode = "w" if create else "a"
-        with open(file, file_mode, encoding="utf-8") as writer:
-            writer.write(
-                "Memory State:\n" + f"time: {now - self._record_start_time}\n" + "---summary---\n" + summary_info + "\n"
-            )
-            if layout_info != "":
-                writer.write("---Layout---\n" + layout_info)
-            writer.write("\n")
-
-    def step(self) -> None:
-        """
-        Update the memory state of the optimizer state.
-
-        Returns:
-            None
-        """
-        if self._stoped:
-            return
-
-        self._remaining_steps -= 1
-        if self._remaining_steps == 0:
-            self._stoped = True
-
-        # Update os state memory usage
-        self._os_state_mem_state = SimpleMemState("os_state_mem")
-        self._calc_tensor_group_memory(self._os_state_mem_state, list(self._optimizer.state_dict()["state"].items()))
-
-        if not self._stoped:
-            # Do we need to print os_state_layout every time? Is it always constant?
-            self.point(with_options="os_state")
-        else:
-            # Dump memory layout
-            self.point(with_options="all")
-            # Generate sunburst charts
-            self._render_sunburst_chart(self._param_mem_state.to_json()["children"], "params_memory_sunburst")
-            self._render_sunburst_chart(self._grad_mem_state.to_json()["children"], "grads_memory_sunburst")
-            self._render_sunburst_chart(
-                [self._os_params_mem_state.to_json(), self._os_state_mem_state.to_json()],
-                "os_memory_sunburst",
-            )
-            self._render_sunburst_chart(self._activation_base_mems.to_json(), "activation_memory_sunburst")
-            # Generate summary sunburst chart
-            summary_sunburst_data = [
-                {"name": "params", "value": self._param_mem_state.total_mem // mb},
-                {"name": "grads", "value": self._grad_mem_state.total_mem // mb},
-                {"name": "os_params", "value": self._os_params_mem_state.total_mem // mb},
-                {"name": "os_state", "value": self._os_state_mem_state.total_mem // mb},
-                {"name": "activation", "value": self._activation_mem_max // mb},
-            ]
-
-            self._render_sunburst_chart(summary_sunburst_data, "summary_sunburst")
-
-    def _render_sunburst_chart(self, data: Any, name: str) -> None:
-        pyecharts.charts.Sunburst(init_opts=pyecharts.options.InitOpts(width="1000px", height="1000px")).add(
-            name,
-            data_pair=data,
-            highlight_policy="ancestor",
-            radius=[0, "95%"],
-            levels=[
-                {},
-                {
-                    "r0": "10%",
-                    "r": "35%",
-                    "itemStyle": {"borderWidth": 3},
-                    "label": {"align": "left"},
-                },
-                {"r0": "35%", "r": "55%", "label": {"align": "left"}},
-                {"r0": "55%", "r": "70%", "label": {"align": "left"}},
-                {"r0": "70%", "r": "80%", "label": {"align": "left"}},
-                {"r0": "80%", "r": "90%", "label": {"align": "left"}},
-                {
-                    "r0": "90%",
-                    "r": "92%",
-                    "label": {"position": "outside", "padding": 3, "silent": False},
-                    "itemStyle": {"borderWidth": 3},
-                },
-            ],
-        ).set_global_opts(title_opts=pyecharts.options.TitleOpts(title="CUDA Memory")).set_series_opts(
-            label_opts=pyecharts.options.LabelOpts(formatter="{b}")
-        ).render(
-            f"{self._log_folder}/{name}.html"
-        )
-
-    def _inner_activation_trace_hook(
-        self,
-        chunk_id: int,
-        layer_name: str,
-        model: Any,
-        inputs: Any,
-        output: torch.Tensor,
-    ) -> None:
-        """
-        Hook function to trace the activation memory usage for a inner layer.
-
-        Args:
-            layer_name (str): The name of the layer.
-            model (Any): The model.
-            inputs (Any): The inputs to the layer.
-            output (torch.Tensor): The output tensor.
-
-        Returns:
-            None
-        """
-        del model, inputs
-        assert isinstance(output, torch.Tensor), f"Invalid output type: {type(output)}"
-
-        if self._stoped or self._activation_base_mems.inited[chunk_id]:
-            return
-
-        # Delay updating the total_mem of activation_base_mem here, it will be handled in the forward ending hook.
-        self._activation_base_mems.states[chunk_id].add(
-            layer_name, output.element_size() * output.nelement(), flush=False
-        )
-
-    def _activation_trace_hook_forward(self, chunk_id: int, model: Any, inputs: Any, output: torch.Tensor) -> None:
-        """
-        Hook function to trace the activation memory usage for a forward pass.
-
-        Args:
-            model (Any): The model.
-            inputs (Any): The inputs to the model.
-            output (torch.Tensor): The output tensor.
-
-        Returns:
-            None
-        """
-        del model, inputs
-        assert isinstance(output, torch.Tensor), f"invalid output type: {type(output)}"
-
-        if self._stoped:
-            return
-
-        # Check if the activation memory has been initialized
-        if self._activation_base_mems.inited[chunk_id] is False:
-            self._activation_base_mems.inited[chunk_id] = True
-            # Update the total memory of the activation base memory state
-            self._activation_base_mems.states[chunk_id].update_total_memory()
-            # Set with_options to "activation_base" to include activation_base_layout in the memory dump
-            with_options = "activation_base"
-        else:
-            with_options = ""
-
-        # Accumulate activation memory usage for each forward pass
-        self._activation_mem += self._activation_base_mems.states[chunk_id].total_mem
-        if self._activation_mem > self._activation_mem_max:
-            self._activation_mem_max = self._activation_mem
-
-        # Trigger a memory record
-        self.point(with_options)
-
-    def _activation_tarce_hook_backward(self, chunk_id: int, model: Any, inputs: Any, grad_outputs: Any) -> None:
-        """
-        Hook function to trace the activation memory usage for a backward pass.
-
-        Args:
-            model (Any): The model.
-            inputs (Any): The inputs to the model.
-            grad_outputs (Any): The gradients of the outputs.
-
-        Returns:
-            None
-        """
-        del model, inputs, grad_outputs
-
-        if self._stoped:
-            return
-
-        # Release activation memory usage for each backward pass
-        self._activation_mem -= self._activation_base_mems.states[chunk_id].total_mem
-
-        # Trigger a memory record
-        self.point()
-
-    def _register_activation_trace_hooks(self, chunk_id: int, model_chunk: torch.nn.Module) -> None:
-        """
-        Register activation trace hooks for the model and each submodule in the model.
-        """
-
-        # Register inner activation trace hooks for each submodule in the model
-        for layer_name, sub_model in model_chunk.named_modules():
-            # Register the hook
-            if len(sub_model._modules) != 0:
-                continue  # TODO: in some special cases, we may need some additional configuration to correct
-
-            sub_model.register_forward_hook(partial(self._inner_activation_trace_hook, chunk_id, layer_name))
-
-        # Register a forward hook for the main model to track activation memory usage
-        model_chunk.register_forward_hook(partial(self._activation_trace_hook_forward, chunk_id))
-        # Register a backward hook for the main model to release activation memory usage
-        model_chunk.register_full_backward_hook(partial(self._activation_tarce_hook_backward, chunk_id))
-
-    def _calc_tensor_memory(
-        self, root_stat: SimpleMemState, named_tensors: Dict[str, torch.Tensor], require_grad: bool = False
-    ) -> None:
-        """
-        Calculate the memory usage of tensors and update the memory state.
-
-        Args:
-            root_stat (SimpleMemState): The root memory state.
-            named_tensors (Dict[str, torch.Tensor]): A dictionary containing the named tensors.
-            require_grad (bool, optional): Whether to consider tensors with gradients. Defaults to False.
-
-        Returns:
-            None
-        """
-        for name, tensor in named_tensors:
-            if require_grad and not tensor.requires_grad:
-                continue
-
-            layer_splits = name.split(sep=".")
-            layer_stat = root_stat.find_layer_state(layer_splits, create=True)
-            layer_stat.layer_mem = tensor.element_size() * tensor.nelement()
-
-        root_stat.update_total_memory()
-
-    def _calc_tensor_group_memory(self, root_stat: SimpleMemState, tensor_groups: List[Tuple[int, torch.Tensor]]):
-        """
-        Calculate the memory usage of a group of tensors.
-
-        Args:
-            root_stat (SimpleMemState): The root memory state.
-            tensor_groups (List[Tuple[int, torch.Tensor]]): A list of tuples containing the tensor groups.
-
-        Returns:
-            None
-        """
-
-        def _normalize_helper(named_tensors: Dict[str, Any]) -> List[Tuple[str, Any]]:
-            """
-            Normalize the named tensors.
-
-            Args:
-                named_tensors (Dict[str, Any]): The named tensors to normalize.
-
-            Returns:
-                List[Tuple[str, Any]]: The normalized named tensors.
-            """
-            res = {}
-
-            for name, tensors in named_tensors.items():
-                if isinstance(tensors, torch.Tensor):
-                    res[name] = tensors
-                elif isinstance(tensors, (list, tuple)):
-                    for index, tensor in enumerate(tensors):
-                        res[f"{name}.{index}"] = tensor
-                elif isinstance(tensors, dict):
-                    for subname, tensor in tensors.items():
-                        res[f"{name}.{subname}"] = tensor
-                else:
-                    raise TypeError(f"unsupported normalize value type: {type(tensors)}")
-
-            return list(res.items())
-
-        def _value_check(tensor_or_tensors):
-            """
-            Check if the input is a tensor or a collection of tensors.
-
-            Args:
-                tensor_or_tensors (Any): The input to check.
-
-            Returns:
-                bool: True if the input is a tensor or a collection of tensors, False otherwise.
-            """
-            if torch.is_tensor(tensor_or_tensors):
-                return True
-            elif isinstance(tensor_or_tensors, (list, tuple)) and all(torch.is_tensor(x) for x in tensor_or_tensors):
-                return True
-            elif isinstance(tensor_or_tensors, dict) and all(torch.is_tensor(x) for x in tensor_or_tensors.values()):
-                return True
-            else:
-                return False
-
-        # Calculate the memory usage of a group of tensors.
-        for idx, tensors in tensor_groups:
-            # Normalize the named tensors
-            named_tensors = {f"{idx}.{k}": v for k, v in tensors.items() if _value_check(v)}
-            named_tensors = _normalize_helper(named_tensors)
-            # Calculate the memory usage of the tensors and update the memory state
-            self._calc_tensor_memory(root_stat, named_tensors)
-
-
-if __name__ == "__main__":
-
-    class SimpleModel(torch.nn.Module):
-        """
-        A simple model with three linear layers.
-
-        Args:
-            skip_layer2 (bool, optional): Whether to skip layer2. Defaults to False.
-        """
-
-        def __init__(self, skip_layer2: bool = False):
-            super().__init__()
-            self.layer1 = torch.nn.Linear(5120, 5120, True)
-            self.layer3 = torch.nn.Linear(5120, 5120, False)
-
-            if skip_layer2:
-                self.layer2 = None
-            else:
-                self.layer2 = SimpleModel(skip_layer2=True)
-
-        def forward(self, inputs: torch.Tensor) -> torch.Tensor:
-            """
-            Forward pass of the model.
-
-            Args:
-                inputs (torch.Tensor): The input tensor.
-
-            Returns:
-                torch.Tensor: The output tensor.
-            """
-            output1 = self.layer1(inputs)
-            if self.layer2 is not None:
-                output2 = self.layer2(output1)
-            else:
-                output2 = output1
-            output = self.layer3(output2)
-
-            return output
-
-    def _simple_schedule(_num_chunks, _model_chunks, _input) -> torch.Tensor:
-        if _num_chunks > 1:
-            _output = _input
-            for _model_chunk in _model_chunks:
-                _output = _model_chunk(_output)
-        else:
-            _output = _model_chunks(_input)
-
-        return _output
-
-    # num_chunks config
-    _num_chunks = 1
-
-    # init model and optimizer
-    if _num_chunks > 1:
-        _chunks = [SimpleModel(skip_layer2=idx % 2 == 0) for idx in range(_num_chunks)]
-        _model = torch.nn.ModuleList(_chunks).cuda()
-    else:
-        _model: torch.nn.Module = SimpleModel().cuda()
-    _optimizer = torch.optim.Adam(_model.parameters())
-
-    # init profiler
-    profiler = SimpleMemoryProfiler(_model, _optimizer, "./test_simple_memory_profiler", total_steps=1)
-
-    _optimizer.zero_grad()
-
-    # inputs
-    x1 = torch.randn((128, 5120)).cuda()
-    x2 = torch.randn((128, 5120)).cuda()
-    # forward
-    out1 = _simple_schedule(_num_chunks, _model, x1)
-    out2 = _simple_schedule(_num_chunks, _model, x2)
-    # backward
-    out1.mean().backward()
-    out2.mean().backward()
-
-    _optimizer.step()
-
-    # Update the optimizer state memory usage and record the memory state
-    profiler.step()
diff --git a/internlm/utils/storage_manager.py b/internlm/utils/storage_manager.py
deleted file mode 100644
index 36bd105..0000000
--- a/internlm/utils/storage_manager.py
+++ /dev/null
@@ -1,677 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import asyncio
-import concurrent.futures
-import hashlib
-import io
-import os
-import pickle
-import re
-import socket
-import stat
-from asyncio import InvalidStateError
-from asyncio.tasks import ALL_COMPLETED
-from datetime import datetime
-from typing import Any, Awaitable, Callable, Dict, List, Union
-
-import torch
-import torch.distributed as dist
-
-from internlm.core.context import global_context as gpc
-from internlm.utils.common import SingletonMeta
-from internlm.utils.logger import get_logger
-
-try:
-    import boto3
-    import botocore
-except ImportError:
-    pass
-
-
-logger = get_logger(__file__)
-
-boto3_url_re = re.compile(r"([^\.]+)\.([\d\.]+)")
-
-MB = 1024**2
-
-storage_manager = None
-
-
-def check_folder(fp: str):
-    storage_manager.assert_fp_exists(fp)
-
-
-def get_fns(fp: str):
-    return storage_manager.get_fns(fp)
-
-
-def llm_load(fp: str, **kwargs):
-    return storage_manager.load(fp, **kwargs)
-
-
-def llm_save(save_path: str, saved_obj: Any, **kwargs):
-    storage_manager.save(save_path, to_save_obj=saved_obj, **kwargs)
-
-
-class StorageClient:
-    """
-    StorageClient as a client for s3 storage access.
-    """
-
-    def __init__(self, handler) -> None:
-        self.handler = handler
-
-    @staticmethod
-    def load(*args, **kwargs):
-        raise NotImplementedError
-
-    @staticmethod
-    def sync_upload_fileobj(*args, **kwargs):
-        raise NotImplementedError
-
-    @staticmethod
-    def async_upload_fileobj(*args, **kwargs):
-        raise NotImplementedError
-
-    @staticmethod
-    def assert_fp_exists(*args, **kwargs):
-        raise NotImplementedError
-
-    @staticmethod
-    def get_fns(*args, **kwargs):
-        raise NotImplementedError
-
-
-class Boto3MetaInfo:
-    """Boto3 meta info for save/load etc."""
-
-    def __init__(
-        self,
-        is_async,
-        handler: StorageClient,
-        bucket_name: str,
-        endpoint: str,
-        file_path: str,
-        async_upload_fn: callable,
-        local_nvme_path=None,
-    ) -> None:
-        # all need info.
-        self.client = handler
-        self.bucket_name = bucket_name
-        self.file_path = file_path
-        # only save need info.
-        self.local_nvme_path = local_nvme_path
-        self.is_async = is_async
-        self.endpoint = endpoint
-        self.async_upload_fn = async_upload_fn
-
-    def __str__(self) -> str:
-        return f"is_async: {self.is_async}, bucket_name:{self.bucket_name}, endpoint:{self.endpoint}, \
-local_nvme_path: {self.local_nvme_path}"
-
-    @staticmethod
-    def unpack_boto3_save_meta(meta):
-        if meta.is_async:
-            return meta.client, meta.bucket_name, meta.file_path, meta.local_nvme_path
-        else:
-            return meta.client, meta.bucket_name, meta.file_path
-
-    @staticmethod
-    def unpack_boto3_nosave_meta(meta):
-        return meta.client, meta.bucket_name, meta.file_path
-
-
-class LocalMetaInfo:
-    """Local meta info for save/load etc."""
-
-    def __init__(self, file_path: str) -> None:
-        self.file_path = file_path
-        self.async_upload_fn = None
-        self.is_async = False
-
-    @staticmethod
-    def unpack_local_save_meta(meta):
-        return (meta.file_path,)
-
-    @staticmethod
-    def unpack_local_nosave_meta(meta):
-        return (meta.file_path,)
-
-
-def unpack_save_meta(meta: Union[Boto3MetaInfo, LocalMetaInfo]):
-    if isinstance(meta, Boto3MetaInfo):
-        return Boto3MetaInfo.unpack_boto3_save_meta(meta)
-    elif isinstance(meta, LocalMetaInfo):
-        return LocalMetaInfo.unpack_local_save_meta(meta)
-    else:
-        raise ValueError(f"unkonwn meta info: {type(meta)}")
-
-
-def unpack_nosave_meta(meta: Union[Boto3MetaInfo, LocalMetaInfo]):
-    if isinstance(meta, Boto3MetaInfo):
-        return Boto3MetaInfo.unpack_boto3_nosave_meta(meta)
-    elif isinstance(meta, LocalMetaInfo):
-        return LocalMetaInfo.unpack_local_nosave_meta(meta)
-    else:
-        raise ValueError(f"unkonwn meta info: {type(meta)}")
-
-
-def compute_file_md5_by_chunk(file_name: str):
-    hash_md5 = hashlib.md5()
-    with open(file_name, "rb") as f:
-        for chunk in iter(lambda: f.read(4096), b""):
-            hash_md5.update(chunk)
-    return hash_md5.hexdigest()
-
-
-def try_get_storage_backend(path: str):
-    sre = path.split(":", maxsplit=1)
-    if len(sre) == 1:
-        if path.startswith("s3:"):
-            backend = "boto3"
-            if gpc.is_rank_for_log():
-                logger.warning(f"path: '{path}' not start with backend prefix, guess it is the backend of boto3.")
-        else:
-            backend = "local"
-            if gpc.is_rank_for_log():
-                logger.warning(f"path: '{path}' not start with backend prefix, guess it is the backend of local.")
-        return backend, sre
-    else:
-        return sre[0], sre[1]  # (backend_prefix, splited_path)
-
-
-class Boto3Client(StorageClient):
-    """
-    Boto3Client
-    """
-
-    def __init__(
-        self,
-        s3_endpoint_url: str,
-        use_threads: int = True,
-        multipart_chunksize=8 * MB,
-        max_concurrency: int = 10,
-        multipart_threshold=100 * MB,
-    ) -> None:
-        """S3 object/file storage management class
-
-        Args:
-            s3_access_keys_id (str): S3 access key ID.
-            s3_secret_access_key (str): S3 secret access key.
-            use_threads (bool, optional): Whether to enable multipart. Defaults to True.
-            multipart_chunksize (_type_, optional): Defaults to 8*MB.
-            max_concurrency (int, optional): Defaults to 10.
-
-        Raises:
-            RuntimeError: Connection failures caused by misconfiguration or network problems.
-        """
-        super().__init__(boto3)
-        self.botocore = botocore
-        try:
-            s3_access_key_id = os.environ["S3_ACCESS_KEY_ID"]
-            s3_secret_access_key = os.environ["S3_SECRET_ACCESS_KEY_ID"]
-        except KeyError as exc:
-            raise RuntimeError(
-                "Please set boto3 bucket 'S3_ACCESS_KEY_ID' and 'S3_SECRET_ACCESS_KEY_ID' using environment variable!"
-            ) from exc
-
-        self.client = self.handler.client(
-            "s3",
-            "",
-            use_ssl=False,
-            verify=False,
-            endpoint_url=s3_endpoint_url,
-            aws_access_key_id=s3_access_key_id,
-            aws_secret_access_key=s3_secret_access_key,
-        )
-
-        self.config = self.handler.s3.transfer.TransferConfig(
-            multipart_threshold=multipart_threshold,
-            max_concurrency=max_concurrency,
-            multipart_chunksize=multipart_chunksize,
-            use_threads=use_threads,
-        )
-
-    @staticmethod
-    def sync_upload_fileobj(handler, bucket_name: str, fp: str, saved_obj=None, **kwargs):
-        assert saved_obj is not None, "saved_obj is None!"
-        try:
-            with io.BytesIO() as f:
-                torch.save(saved_obj, f, **kwargs)
-                f.seek(0)
-                handler.client.upload_fileobj(f, bucket_name, fp, Config=handler.config)
-        except handler.botocore.exceptions.EndpointConnectionError as exc:
-            raise RuntimeError(
-                f"Boto3 Network Error: Please Check your Internet Connection in {socket.gethostname()}"
-            ) from exc
-
-    @staticmethod
-    def load(handler, bucket_name: str, fp: str, **kwargs) -> Dict:
-        """
-        Args:
-            fp (str): Path to save, eg. s3://opennlplab/model_weights/xxx/ddd.pt
-        """
-        try:
-            with io.BytesIO() as f:
-                handler.client.download_fileobj(bucket_name, fp, f, Config=handler.config)
-                f.seek(0)
-                states = torch.load(f, **kwargs)
-        except handler.botocore.exceptions.EndpointConnectionError as exc:
-            raise RuntimeError(
-                f"Boto3 Network Error: Please Check your Internet Connection in {socket.gethostname()}"
-            ) from exc
-        return states
-
-    @staticmethod
-    def assert_fp_exists(handler, bucket_name: str, fp: str):  # pylint: disable=W0613
-        assert len(list(handler.client.list_objects(Bucket=bucket_name, Prefix=fp)["Contents"])) > 0, fp
-
-    @staticmethod
-    def is_fp_exists(handler, bucket_name: str, fp: str):  # pylint: disable=W0613
-        re = handler.client.list_objects(Bucket=bucket_name, Prefix=fp)
-        if "Contents" in re:
-            return len(list(re["Contents"])) > 0
-        else:
-            return False
-
-    @staticmethod
-    def get_fns(handler, bucket_name: str, fp: str):
-        """
-        Ref: https://stackoverflow.com/questions/54314563/
-        how-to-get-more-than-1000-objects-from-s3-by-using-list-objects-v2
-        """
-        if Boto3Client.is_fp_exists(handler, bucket_name, fp):
-            paginator = handler.client.get_paginator("list_objects_v2")
-            pages = paginator.paginate(Bucket=bucket_name, Prefix=fp)
-            folder_name_list = []
-            for page in pages:
-                if "Contents" in page:
-                    for obj in page["Contents"]:
-                        pth: str = obj["Key"]
-                        folder_name_list.append(pth.split(fp, maxsplit=1)[1].strip("/").split("/", maxsplit=1)[0])
-            return list(set(folder_name_list))
-        else:
-            if gpc.is_rank_for_log():
-                logger.warning(f"'{fp}' not found!")
-            return None
-
-    @staticmethod
-    def async_upload_fileobj(handler, bucket_name: str, fp: str, local_nvme_path: str):
-        try:
-            with open(local_nvme_path, "rb") as f:
-                handler.client.upload_fileobj(f, bucket_name, fp, Config=handler.config)
-        except handler.botocore.exceptions.EndpointConnectionError as exc:
-            raise RuntimeError(
-                f"Boto3 Network Error: Please Check your Internet Connection in {socket.gethostname()}"
-            ) from exc
-        except Exception as e:
-            raise e
-
-    @staticmethod
-    def delete_obj(handler, fp: str):
-        raise NotImplementedError("boto3 not support delete_obj")
-
-
-class LocalClient(StorageClient):
-    """
-    Storage Client for local NFS.
-    """
-
-    def __init__(self, *args, **kwargs) -> None:  # pylint: disable=W0613
-        super().__init__(None)
-
-    @staticmethod
-    def sync_upload_fileobj(fp: str, saved_obj=None, **kwargs):
-        assert saved_obj is not None
-        fp_dirname = os.path.dirname(fp)
-        if not os.path.exists(fp_dirname):
-            os.makedirs(fp_dirname, exist_ok=True)
-        torch.save(saved_obj, fp, **kwargs)
-
-    @staticmethod
-    def load(load_path: str, **kwargs):
-        assert os.path.exists(load_path), f"{load_path} is not found!"
-        with open(load_path, "rb") as f:
-            states = torch.load(f, **kwargs)
-        return states
-
-    @staticmethod
-    def assert_fp_exists(folder):
-        assert os.path.exists(folder), folder
-
-    @staticmethod
-    def get_fns(folder):
-        if not os.path.exists(folder):
-            if gpc.is_rank_for_log():
-                logger.warning(f"'{folder}' not found!")
-            return None
-        else:
-            return os.listdir(folder)
-
-    @staticmethod
-    def delete_obj(fp: str):
-        if not os.path.isdir(fp):
-            os.remove(fp)
-
-
-def get_tmp_file_name(tmp_local_folder: str, fp: str):
-    """
-    It should be noted that all our temporary files will be stored in the same folder,
-    so the file name passed upstream must be unique.
-    """
-    base_path = os.path.join(tmp_local_folder, fp.split("/")[-1])
-    current_time = datetime.now().strftime("%b%d_%H-%M-%S")
-    pid = os.getpid()
-    # step = self.step_counter
-    return "-".join([base_path, current_time, str(pid)]) + ".tmpfile"  # , str(step)
-
-
-def get_boto3_meta(fp: str, tmp_local_folder: str, is_async: bool) -> Boto3MetaInfo:
-    assert fp.startswith("s3://"), f"Path '{fp}' is not a boto3 url"
-    parts = fp.lstrip("s3://").split(os.path.sep)
-    match = boto3_url_re.match(parts[0])
-    assert match is not None, f"url '{fp}' is not a valid boto3 url"
-    bucket_name, endpoint = match.group(1), match.group(2)
-    endpoint = "http://" + endpoint + ":80"
-    if is_async:
-        tmp_step_file = get_tmp_file_name(tmp_local_folder, fp)
-    else:
-        tmp_step_file = None
-    return Boto3MetaInfo(
-        is_async=is_async,
-        handler=None,
-        bucket_name=bucket_name,
-        endpoint=endpoint,
-        file_path=os.path.sep.join(parts[1:]),
-        async_upload_fn=Boto3Client.async_upload_fileobj,
-        local_nvme_path=tmp_step_file,
-    )
-
-
-def get_local_meta(fp: str) -> LocalMetaInfo:
-    assert not fp.startswith("s3://"), f"Path '{fp}' is not a local path"
-    return LocalMetaInfo(fp)
-
-
-def get_mount_point_free_size(path: str):
-    """
-        Returns the remaining space of the temporary storage mount point as a percentage.
-    Args:
-        path (str): temporary storage folder path.
-
-    Raises:
-        FileNotFoundError: If the temporary storage folder does not exist,
-        an error will be reported。
-    """
-    if os.path.exists(path):
-        st = os.statvfs(path)
-        # f_bavail: Number of free blocks for unprivileged users.
-        # f_bsize: Filesystem block size.
-        # return unit is TB.
-        return st.f_bavail * st.f_bsize / (1024**3)
-
-
-def check_tmp_folder_accessibility(tmp_local_folder: str):
-    """
-    Check access permissions for temporary storage.
-    """
-    ret = True
-    if os.path.exists(tmp_local_folder):
-        ret &= os.access(tmp_local_folder, os.W_OK)
-        ret &= os.access(tmp_local_folder, os.R_OK)
-        if ret is False:
-            error_str = f'{socket.gethostname()} dose not have read and write permissions on {tmp_local_folder}"'
-            raise RuntimeError(error_str)
-
-
-class StorageManager(metaclass=SingletonMeta):
-    """
-    Storage Manager for saving or loading checkpoint.
-    TODO: add a thread to poll the asynchronous storage state.
-    """
-
-    BACKEND_TYPE = {"boto3", "local"}
-    BACKEND_INIT_METHOD = {
-        "boto3": Boto3Client,
-        "local": LocalClient,
-    }
-    CLI_DICT = {}
-
-    def __init__(self, enable_save, tmp_local_folder="/dev/shm/test/", async_mode=True, n_async_workers=8) -> None:
-        self._exception_list = []
-        self._to_be_del_files = []
-        self._async_stack = []
-        self.upload_count = 0
-        self.tmp_local_folder = tmp_local_folder
-        self.async_mode = async_mode
-        self.has_warning = False
-        self._async_loop = None
-        self._thread_pool = None
-        self.latest_save_folder = None
-        self.latest_save_step = 0
-        self.async_task_peeding = False
-
-        if enable_save and self.async_mode:
-            self._async_loop = asyncio.new_event_loop()
-            self._thread_pool = concurrent.futures.ThreadPoolExecutor(max_workers=n_async_workers)
-
-            check_tmp_folder_accessibility(os.path.dirname(self.tmp_local_folder))
-
-            # Try to create tmp folder
-            try:
-                os.makedirs(self.tmp_local_folder, exist_ok=True)
-                os.chmod(self.tmp_local_folder, stat.S_IRWXU | stat.S_IRWXG | stat.S_IRWXO)
-            except FileExistsError:
-                pass
-
-            # In case it is a directory created by other users, we check the permissions again.
-            check_tmp_folder_accessibility(self.tmp_local_folder)
-
-            # Try to clean tmp folder's empty folder.
-            self.try_delete_tmpfile(self.tmp_local_folder)
-
-            # Avaliable storeage space check.
-            free_size = get_mount_point_free_size(self.tmp_local_folder)
-            if free_size < 0.1:
-                logger.error(f'tmp_local_folder only have "{free_size}" GB free space, less then 100 GB!')
-                raise RuntimeError(f"Insufficient temporary storage space on {socket.gethostname()}")
-
-    def _get_client(self, path: str, async_mode: bool = False) -> Union[Boto3MetaInfo, LocalMetaInfo]:
-        """
-        example:
-        local:/path/to/checkpoint
-        boto3:s3://model_weights/0331/120bi
-
-        Args:
-            path (str): _description_
-        """
-        backend, path = try_get_storage_backend(path)
-
-        init_args = (None,)
-        if backend == "local":
-            meta_info = get_local_meta(path)
-            backend_key = backend
-        elif backend == "boto3":
-            meta_info = get_boto3_meta(path, self.tmp_local_folder, async_mode)
-            backend_key = backend + ":" + meta_info.endpoint
-            init_args = (meta_info.endpoint,)
-            if (
-                "http_proxy" in os.environ
-                or "https_proxy" in os.environ
-                or "HTTP_PROXY" in os.environ
-                or "HTTPS_PROXY" in os.environ
-            ):
-                if not self.has_warning:
-                    logger.warning(
-                        "HTTP/HTTPS proxy is detected when using boto3, incorrectly setting \
-    the proxy may make boto3 unavailable or affect performance."
-                    )
-                    self.has_warning = True
-
-        assert backend in StorageManager.BACKEND_TYPE, f"Unkown backend: {backend}"
-
-        # boto3 backend need special treatment.
-        if backend_key not in StorageManager.CLI_DICT:
-            StorageManager.CLI_DICT.update({backend_key: StorageManager.BACKEND_INIT_METHOD[backend](*init_args)})
-
-        meta_info.client = StorageManager.CLI_DICT[backend_key]
-
-        return meta_info
-
-    def assert_fp_exists(self, folder) -> None:
-        meta = self._get_client(path=folder)
-        meta.client.assert_fp_exists(*unpack_nosave_meta(meta))
-
-    def get_fns(self, folder) -> List[str]:
-        meta = self._get_client(path=folder)
-        return meta.client.get_fns(*unpack_nosave_meta(meta))
-
-    def save(self, save_path: str, to_save_obj: Any, async_upload=None, **kwargs):
-
-        if async_upload is None:
-            async_upload = self.async_mode
-
-        if not save_path.startswith("boto3:"):
-            async_upload = False
-
-        meta = self._get_client(save_path, async_upload)
-
-        if async_upload:
-            assert (
-                self.tmp_local_folder
-            ), "StorageManager is not setted tmp_local_folder, so async save cannot be performed."
-            tmp_step_file = meta.local_nvme_path
-            self._to_be_del_files.append(tmp_step_file)
-            with open(tmp_step_file, "wb") as f:
-                torch.save(to_save_obj, f, pickle_protocol=pickle.HIGHEST_PROTOCOL)
-            self.async_executor(meta.async_upload_fn, *unpack_save_meta(meta))
-            os.chmod(tmp_step_file, stat.S_IRWXU | stat.S_IRWXG | stat.S_IRWXO)
-            self.async_task_peeding = True
-        else:
-            meta.client.sync_upload_fileobj(*unpack_save_meta(meta), saved_obj=to_save_obj, **kwargs)
-            self.upload_count += 1
-
-    def load(self, load_path: str, **kwargs) -> Any:
-        self.wait()
-        meta = self._get_client(path=load_path)
-        return meta.client.load(*unpack_nosave_meta(meta), **kwargs)
-
-    def delete_obj(self, fp: str):
-        meta = self._get_client(path=fp)
-        meta.client.delete_obj(*unpack_nosave_meta(meta))
-
-    def _del_tmp_folder(self):
-        for fp in self._to_be_del_files:
-            try:
-                os.remove(fp)
-            except FileNotFoundError:
-                pass
-            except SystemError as e:
-                logger.error(f'delete file: {fp}, failed for reason:"{e}"')
-            else:
-                pass
-
-    def try_delete_tmpfile(self, tmp_dir: str):
-        """Delete temporary files in tmp_dir."""
-
-        for filename in os.listdir(tmp_dir):
-            if filename.endswith(".tmpfile"):
-                file_path = os.path.join(tmp_dir, filename)
-                try:
-                    os.remove(file_path)
-                    logger.info(f"Delete tmpfile: {file_path}")
-                except OSError:
-                    # Ignore deletion errors
-                    pass
-
-    async def _sync_tasks(self) -> Awaitable[None]:
-        if self._async_stack:
-            await asyncio.wait(self._async_stack, return_when=ALL_COMPLETED)
-            count = 0
-            while self._async_stack:
-                t = self._async_stack[0]
-                try:
-                    e = t.exception()
-                    if e:
-                        self._exception_list.append((e, count))
-                        logger.error(f"File:{self._to_be_del_files[count]}, upload failed for {e}")
-                        # raise e
-                    count += 1
-                    self._async_stack.pop(0)
-                except InvalidStateError:
-                    # Not finished. https://docs.python.org/3/library/asyncio-task.html#asyncio.Task.exception
-                    pass
-
-    def async_executor(self, fn: Callable, *args, **kwargs) -> None:
-        """
-        Overview:
-            Execute task in background, then apppend the future instance in _async_stack.
-        Arguments:
-            - fn (:obj:`Callable`): Synchronization fuction.
-        """
-        if not self._async_loop:
-            raise RuntimeError("Event loop was not initialized, please call this function in async or parallel mode")
-        t = self._async_loop.run_in_executor(self._thread_pool, fn, *args, **kwargs)
-        self._async_stack.append(t)
-
-    def wait(self) -> bool:
-        """Wait for async operations to complete."""
-
-        if not self.async_mode:
-            return
-
-        if not self.async_task_peeding:
-            return
-
-        if self._async_loop:
-            self._async_loop.run_until_complete(self._sync_tasks())
-
-        if self._exception_list:
-            for error_msg, file_id in self._exception_list:
-                logger.error(
-                    f"Node:{socket.gethostname()}, Error: Checkpoint {self._to_be_del_files[file_id]} "
-                    f"failed on step {self.upload_count}: {error_msg}"
-                )
-
-                # TODO: Re-upload in sync mode
-                raise RuntimeError(
-                    f"Failed to upload {self._to_be_del_files[file_id]} " f"on step {self.upload_count}: {error_msg}"
-                )
-
-        self._del_tmp_folder()
-        self._exception_list.clear()
-        self._to_be_del_files.clear()
-        self.async_task_peeding = False
-
-        if gpc.is_rank_for_log():
-            self.upload_count += 1
-            if self.async_mode and self.latest_save_folder:
-                self.save(
-                    os.path.join(self.latest_save_folder, f"{self.latest_save_step}.step"),
-                    to_save_obj=dict({"step": self.latest_save_step}),
-                    async_upload=False,
-                )
-                self.latest_save_folder = None
-
-
-storage_manager: StorageManager = None
-
-
-def init_storage_manager(enable_save_ckpt, async_upload_tmp_folder, async_upload):
-    global storage_manager
-    storage_manager = StorageManager(
-        enable_save_ckpt,
-        tmp_local_folder=async_upload_tmp_folder,
-        async_mode=async_upload,
-    )
-
-
-def get_storage_manager():
-    assert storage_manager is not None, "storage_manager has not been init!"
-    return storage_manager
-
-
-def wait_async_upload_finish():
-    dist.barrier()
-    storage_manager.wait()
diff --git a/internlm/utils/timeout.py b/internlm/utils/timeout.py
deleted file mode 100644
index 7a96841..0000000
--- a/internlm/utils/timeout.py
+++ /dev/null
@@ -1,113 +0,0 @@
-import datetime
-import os
-import signal
-import socket
-import traceback
-from functools import wraps
-
-from internlm.utils.logger import get_logger
-
-logger = get_logger(__file__)
-
-
-class Timeout:
-    """Timer to execute code
-
-    Adapted from https://github.com/reasoning-machines/pal
-
-    Args:
-        seconds (float): The maximum seconds to execute code
-        error_message (str)
-    """
-
-    def __init__(self, seconds=1, error_message="Timeout"):
-        self.seconds = seconds
-        self.error_message = error_message
-
-    def timeout_handler(self, signum, frame):
-        raise TimeoutError(self.error_message)
-
-    def __enter__(self):
-        signal.signal(signal.SIGALRM, self.timeout_handler)
-        signal.alarm(self.seconds)
-
-    def __exit__(self, error_type, value, traceback):
-        signal.alarm(0)
-
-
-ENABLE_TIMEOUT = os.getenv("INTERNLM_ENABLE_TIMEOUT", None)
-
-
-timeout_threshold_dict = {
-    "initialize_distributed_env": 120,
-    "nopp_forward_backward_step": 360,
-    "initialize_model": 10,
-    "initialize_optimizer": 20,
-    "optim_step": 30,
-    "get_train_data_loader": 600,
-    "get_validation_data_loader": 60,
-    "load_new_batch": 10,
-    "record_current_batch_training_metrics": 10,
-    "save_checkpoint": 1200,
-    "interleaved_forward_backward_step": 600,
-    "nointerleaved_forward_backward_step": 600,
-}
-
-if ENABLE_TIMEOUT is not None:
-    os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"
-    LLM_NCCL_TIMEOUT = datetime.timedelta(seconds=int(os.getenv("NCCL_TIMEOUT", str(60))))
-else:
-    timeout_threshold_dict = dict.fromkeys(timeout_threshold_dict.keys(), 0)
-    LLM_NCCL_TIMEOUT = datetime.timedelta(seconds=1800)
-
-
-def try_get_gpc_rank():
-    try:
-        from internlm.core.context import global_context as gpc
-
-        rank = gpc.get_global_rank()
-    except:  # noqa  # pylint: disable=bare-except
-        rank = "unknown"
-
-    return f"host-{socket.gethostname()}-rank-{rank}"
-
-
-def llm_timeout(seconds=0, func_name=None):
-    """timeout decorator, Note that this decorator cannot be reentrant,
-    otherwise the signal will be reset.
-
-    Args:
-        seconds (int, optional): timeout threshold. Defaults to 300.
-        func_name (str, optional): the func who is been waited to timeout.
-    """
-
-    def decorator(func):
-        nonlocal func_name
-        if func_name is None:
-            func_name = func.__name__
-
-        @wraps(func)
-        def wrapper(*args, **kwargs):
-            def _handle_timeout(signum, frame):
-                raise TimeoutError
-
-            nonlocal seconds
-            seconds = timeout_threshold_dict.get(func_name, seconds)
-
-            if seconds > 0:
-                signal.signal(signal.SIGALRM, _handle_timeout)
-                signal.alarm(seconds)
-
-            try:
-                result = func(*args, **kwargs)
-            except TimeoutError as e:
-                logger.error(f"TimeoutError at {try_get_gpc_rank()}: {func_name}\\n {traceback.format_exc()}")
-                raise e
-            finally:
-                signal.alarm(0)
-
-            return result
-
-        return wrapper
-
-    return decorator
diff --git a/internlm/utils/writer.py b/internlm/utils/writer.py
deleted file mode 100644
index b519b95..0000000
--- a/internlm/utils/writer.py
+++ /dev/null
@@ -1,150 +0,0 @@
-import logging
-import os
-import socket
-import sys
-import traceback
-from functools import partial
-
-import torch
-from torch.utils.tensorboard import SummaryWriter
-
-from internlm.core.context import global_context as gpc
-
-
-def tb_save_run_info(writer, config_lines, global_step=0):
-    writer.add_text(tag="cmd", text_string=" ".join(sys.argv[:]), global_step=global_step)
-    lines = []
-    for line in config_lines:
-        if line.strip().startswith("#"):
-            continue
-        lines.append(line)
-    writer.add_text(tag="config", text_string="\n".join(lines), global_step=global_step)
-
-
-def init_tb_writer(
-    job_name: str,
-    launch_time: str,
-    file_name: str,
-    tensorboard_folder: str,
-    resume_tb_folder: str,
-    step_count: int,
-    config: str,
-    logger: logging.Logger,
-):
-    tb_log_file_name = file_name
-    if not tensorboard_folder:
-        tb_folder = os.path.join(job_name, launch_time, "tensorboards")
-    else:
-        tb_folder = tensorboard_folder
-
-    if gpc.get_global_rank() == 0:
-        # If we don't load ckpt, 'resume_tb_folder' is set as the tensorboard
-        # dir of the last task by 'make_launch_script.sh'.
-        # If we load ckpt, 'resume_tb_folder' will be overwritten as the
-        # reloaded 'train_state.resume_tb_folder'.s
-        if resume_tb_folder is not None:
-            assert len(resume_tb_folder) > 0 and resume_tb_folder != "/"
-            if not os.path.exists(resume_tb_folder):
-                logger.error(
-                    f"Can't found resume_tb_folder{resume_tb_folder}, \
-please make sure this folder is located at local file system."
-                )
-            else:
-                logger.info(f"Try mv tensorboard logs: {resume_tb_folder} to {tb_folder}... ")
-                os.system(f"cp -r {resume_tb_folder}/* {tb_folder}/")
-                os.system(f"chmod -R +w {tb_folder}/")
-        else:
-            logger.info(f"Login tensorboard logs to: {tb_folder}")
-
-        tb_logdir = os.path.join(tb_folder, tb_log_file_name)
-        writer = SummaryWriter(log_dir=tb_logdir, max_queue=5, purge_step=step_count, flush_secs=3)
-        writer.add_text(tag="job_name", text_string=job_name, global_step=step_count)
-        writer.add_text(tag="tensorboard_folder", text_string=tb_logdir, global_step=step_count)
-
-        torch.distributed.broadcast_object_list([tb_folder], src=0)
-    else:
-        objects = [None]
-        torch.distributed.broadcast_object_list(objects, src=0)
-        tb_folder = objects[0]
-        tb_logdir = os.path.join(tb_folder, tb_log_file_name)
-        writer = SummaryWriter(log_dir=tb_logdir, max_queue=5, purge_step=step_count, flush_secs=3)
-
-    if gpc.is_rank_for_log():
-        tb_save_run_info(
-            writer=writer,
-            config_lines=config,
-            global_step=step_count,
-        )
-
-    writer.add_text(
-        tag=f"mapping_{tb_log_file_name}",
-        text_string=f"file_path={tb_logdir} hostname={socket.gethostname()} device={torch.cuda.current_device()}",
-        global_step=step_count,
-    )
-    writer.add_scaler = partial(writer.add_scalar, new_style=True)
-
-    return writer, tb_logdir
-
-
-class Writer:
-    """
-    Customed writer based on tensorboard for recording training metrics.
-
-    Args:
-        job_name (str): The name of training job, defaults to None.
-        launch_time (str): A string representing the launch time of the training.
-        file_name (str): The log file name, defaults to None.
-        tensorboard_folder (str): A string representing the folder for saving tensorboard logs.
-        resume_tb_folder (str): A string representing the folder for resuming tensorboard logs.
-        step_count (int): An integer representing the step count of the training.
-        config (str): A string representing the configuration of the training.
-        logger (logging.Logger): A logging.Logger object for logging information during training.
-        enable_tb (bool): A boolean indicating whether to enable the tensorboard writer.
-
-    """
-
-    def __init__(
-        self,
-        job_name: str = None,
-        launch_time: str = None,
-        file_name: str = None,
-        tensorboard_folder: str = None,
-        resume_tb_folder: str = None,
-        step_count: int = 0,
-        config: str = None,
-        logger: logging.Logger = None,
-        enable_tb: bool = True,
-    ) -> None:
-        self.enable_tb = enable_tb
-        self.tb_writer, self.tb_logdir = init_tb_writer(
-            job_name=job_name,
-            launch_time=launch_time,
-            file_name=file_name,
-            tensorboard_folder=tensorboard_folder,
-            resume_tb_folder=resume_tb_folder,
-            step_count=step_count,
-            config=config,
-            logger=logger,
-        )
-
-    def add_scalar(self, key, value, step):
-        try:
-            if self.enable_tb and self.tb_writer is not None:
-                self.tb_writer.add_scalar(tag=key, scalar_value=value, global_step=step)
-        except Exception:
-            traceback.print_exc()
-
-    def add_scalars(self, key, value, step):
-        try:
-            assert isinstance(value, dict)
-            if self.enable_tb and self.tb_writer is not None:
-                self.tb_writer.add_scalars(main_tag=key, tag_scalar_dict=value, global_step=step)
-        except Exception:
-            traceback.print_exc()
-
-    def add_text(self, key, value, step):
-        try:
-            if self.enable_tb and self.tb_writer is not None:
-                self.tb_writer.add_text(tag=key, text_string=value, global_step=step)
-        except Exception:
-            traceback.print_exc()
diff --git a/model_cards/internlm_20b.md b/model_cards/internlm_20b.md
new file mode 100644
index 0000000..5585eba
--- /dev/null
+++ b/model_cards/internlm_20b.md
@@ -0,0 +1,59 @@
+# InternLM-20B
+
+## Introduction
+
+InternLM-20B was pre-trained on over **2.3T** Tokens containing high-quality English, Chinese, and code data. Additionally, the Chat version has undergone SFT and RLHF training, enabling it to better and more securely meet users' needs.
+
+In terms of model structure, InternLM-20B opted for a deeper architecture, with a depth set at 60 layers. This surpasses the conventional 7B and 13B models that utilize 32 or 40 layers. When parameters are limited, increasing the number of layers can enhance the model's overall capability. Furthermore, compared to InternLM-7B, the pre-training data used for InternLM-20B underwent higher quality cleansing and was supplemented with data rich in knowledge and designed for reinforcing understanding and reasoning capabilities. As a result, it exhibits significant improvements in understanding, reasoning, mathematical, and programming abilities—all of which test the technical proficiency of language models. Overall, InternLM-20B features the following characteristics:
+
+- Outstanding overall performance
+- Strong utility invocation capability
+- Supports a 16k context length (Through inference extrapolation)
+- Better value alignment.
+
+## Model Zoo
+
+| Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | OpenXLab(Original) | Release Date |
+|---------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
+| **InternLM Chat 20B**     | [🤗internlm/internlm-chat-20b](https://huggingface.co/internlm/internlm-20b-chat)         | [<img src="./docs/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-chat-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b-chat/summary)         | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-20b)     | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-20b-original)     | 2023-12-12   |
+| **InternLM 20B** | [🤗internlm/internlm-20b](https://huggingface.co/internlm/internlm-20b) | [<img src="./docs/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-20b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-20b/summary) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-20b) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-20b-original) | 2023-09-20 |
+
+## Performance Evaluation
+
+On the 5 capability dimensions proposed by OpenCompass, InternLM-20B has achieved excellent results (the bolded scores represent the best performances within the 13B-33B parameter range).
+
+| Capability | Llama-13B | Llama2-13B | Baichuan2-13B | InternLM-20B | Llama-33B | Llama-65B | Llama2-70B |
+|----------|-----------|------------|---------------|--------------|-----------|-----------|------------|
+| Language     | 42.5      | 47         | 47.5          | **55**           | 44.6      | 47.1      | 51.6       |
+| Knowledge     | 58.2      | 58.3       | 48.9          | 60.1         | **64**        | 66        | 67.7       |
+| Understanding     | 45.5      | 50.9       | 58.1          | **67.3**         | 50.6      | 54.2      | 60.8       |
+| Reasoning     | 42.7      | 43.6       | 44.2          | **54.9**         | 46.4      | 49.8      | 55         |
+| Examination     | 37.3      | 45.2       | 51.8          | **62.5**         | 47.4      | 49.7      | 57.3       |
+| Overall   | 43.8      | 47.3       | 49.4          | **59.2**         | 48.9      | 51.9      | 57.4       |
+
+The table below compares the performance of mainstream open-source models on some influential and typical datasets.
+
+|      | Benchmarks           | Llama-13B | Llama2-13B | Baichuan2-13B | InternLM-20B | Llama-33B | Llama-65B | Llama2-70B |
+|------|------------------|-----------|------------|---------------|--------------|-----------|-----------|------------|
+| Examination | MMLU             | 47.73     | 54.99      | 59.55         | **62.05**        | 58.73     | 63.71     | 69.75      |
+|      | C-Eval (val)     | 31.83     | 41.4       | **59.01**         | 58.8         | 37.47     | 40.36     | 50.13      |
+|      | AGI-Eval         | 22.03     | 30.93      | 37.37         | **44.58**        | 33.53     | 33.92     | 40.02      |
+| Knowledge | BoolQ            | 78.75     | 82.42      | 67            | **87.46**        | 84.43     | 86.61     | 87.74      |
+|      | TriviaQA         | 52.47     | 59.36      | 46.61         | 57.26        | **66.24**     | 69.79     | 70.71      |
+|      | NaturalQuestions | 20.17     | 24.85      | 16.32         | 25.15        | **30.89**     | 33.41     | 34.16      |
+| Understanding | CMRC             | 9.26      | 31.59      | 29.85         | **68.78**        | 14.17     | 34.73     | 43.74      |
+|      | CSL              | 55        | 58.75      | 63.12         | **65.62**        | 57.5      | 59.38     | 60         |
+|      | RACE (middle)    | 53.41     | 63.02      | 68.94         | **86.35**        | 64.55     | 72.35     | 81.55      |
+|      | RACE (high)      | 47.63     | 58.86      | 67.18         | **83.28**        | 62.61     | 68.01     | 79.93      |
+|      | XSum             | 20.37     | 23.37      | 25.23         | **35.54**        | 20.55     | 19.91     | 25.38      |
+| Reasoning | WinoGrande       | 64.64     | 64.01      | 67.32         | **69.38**        | 66.85     | 69.38     | 69.77      |
+|      | BBH              | 37.93     | 45.62      | 48.98         | **52.51**        | 49.98     | 58.38     | 64.91      |
+|      | GSM8K            | 20.32     | 29.57      | **52.62**         | **52.62**        | 42.3      | 54.44     | 63.31      |
+|      | PIQA             | 79.71     | 79.76      | 78.07         | 80.25        | **81.34**     | 82.15     | 82.54      |
+| Programming | HumanEval        | 14.02     | 18.9       | 17.07         | **25.61**        | 17.68     | 18.9      | 26.22      |
+|      | MBPP             | 20.6      | 26.8       | 30.8          | **35.6**         | 28.4      | 33.6      | 39.6       |
+
+Overall, InternLM-20B comprehensively outperforms open-source models in the 13B parameter range in terms of overall capabilities, and on inference evaluation sets, it approaches or even surpasses the performance of Llama-65B.
+
+- The evaluation results were obtained from [OpenCompass 20230920](https://github.com/internLM/OpenCompass/).
+- The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
diff --git a/model_cards/internlm_7b.md b/model_cards/internlm_7b.md
new file mode 100644
index 0000000..f700a70
--- /dev/null
+++ b/model_cards/internlm_7b.md
@@ -0,0 +1,36 @@
+# InternLM-7B Model Card
+
+## Introduction
+
+InternLM-7B contains a 7 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:
+
+- It leverages trillions of high-quality tokens for training to establish a powerful knowledge base.
+- It supports an 8k context window length, enabling longer input sequences and stronger reasoning capabilities.
+- It provides a versatile toolset for users to flexibly build their own workflows.
+
+## Model Zoo
+
+| Model | Transformers(HF) | ModelScope(HF) | OpenXLab(HF) | OpenXLab(Original) | Release Date |
+|---------------------------|------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
+| **InternLM Chat 7B**      | [🤗internlm/internlm-chat-7b](https://huggingface.co/internlm/internlm-chat-7b)           | [<img src="./docs/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-chat-7b/summary)           | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b)      | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-chat-7b-original)      | 2023-12-12   |
+| **InternLM 7B**           | [🤗internlm/internlm-7b](https://huggingface.co/internlm/internlm-7b)                     | [<img src="./docs/imgs/modelscope_logo.png" width="20px" /> Shanghai_AI_Laboratory/internlm-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-7b/summary)                     | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b)           | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/OpenLMLab/InternLM-7b-original)           | 2023-07-06   |
+
+## Performance Evaluation
+
+We conducted a comprehensive evaluation of InternLM using the open-source evaluation tool [OpenCompass](https://github.com/internLM/OpenCompass/). The evaluation covered five dimensions of capabilities: disciplinary competence, language competence, knowledge competence, inference competence, and comprehension competence. Here are some of the evaluation results, and you can visit the [OpenCompass leaderboard](https://opencompass.org.cn/rank) for more evaluation results.
+
+| Datasets\Models | **InternLM-Chat-7B** | **InternLM-7B** | LLaMA-7B | Baichuan-7B | ChatGLM2-6B | Alpaca-7B | Vicuna-7B |
+| --------------- | -------------------------- | --------------------- | -------- | ----------- | ----------- | --------- | --------- |
+| C-Eval(Val)     | 52.0                       | 53.4                  | 24.2     | 42.7        | 50.9        | 28.9      | 31.2      |
+| MMLU            | 52.6                       | 51.0                  | 35.2*    | 41.5        | 46.0        | 39.7      | 47.3      |
+| AGIEval         | 46.4                       | 37.6                  | 20.8     | 24.6        | 39.0        | 24.1      | 26.4      |
+| CommonSenseQA   | 80.8                       | 59.5                  | 65.0     | 58.8        | 60.0        | 68.7      | 66.7      |
+| BUSTM           | 80.6                       | 50.6                  | 48.5     | 51.3        | 55.0        | 48.8      | 62.5      |
+| CLUEWSC         | 81.8                       | 59.1                  | 50.3     | 52.8        | 59.8        | 50.3      | 52.2      |
+| MATH            | 5.0                        | 7.1                   | 2.8      | 3.0         | 6.6         | 2.2       | 2.8       |
+| GSM8K           | 36.2                       | 31.2                  | 10.1     | 9.7         | 29.2        | 6.0       | 15.3      |
+| HumanEval       | 15.9                       | 10.4                  | 14.0     | 9.2         | 9.2         | 9.2       | 11.0      |
+| RACE(High)      | 80.3                       | 57.4                  | 46.9*    | 28.1        | 66.3        | 40.7      | 54.0      |
+
+- The evaluation results were obtained from [OpenCompass 20230706](https://github.com/internLM/OpenCompass/) (some data marked with *, which means come from the original papers), and evaluation configuration can be found in the configuration files provided by [OpenCompass](https://github.com/internLM/OpenCompass/).
+- The evaluation data may have numerical differences due to the version iteration of [OpenCompass](https://github.com/internLM/OpenCompass/), so please refer to the latest evaluation results of [OpenCompass](https://github.com/internLM/OpenCompass/).
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..dce94b4
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,2 @@
+transformers<4.30.0
+sentencepiece
diff --git a/requirements/runtime.txt b/requirements/runtime.txt
deleted file mode 100644
index e60ee2f..0000000
--- a/requirements/runtime.txt
+++ /dev/null
@@ -1,16 +0,0 @@
-transformers<4.30.0
-sentencepiece
-numpy
-tqdm
-psutil
-packaging
-pre-commit
-ninja
-gputil
-pytest
-packaging
-boto3
-botocore
-torch-scatter
-pyecharts
--f https://data.pyg.org/whl/torch-1.13.1+cu117.html
diff --git a/requirements/torch.txt b/requirements/torch.txt
deleted file mode 100644
index 4b1efcb..0000000
--- a/requirements/torch.txt
+++ /dev/null
@@ -1,4 +0,0 @@
---extra-index-url https://download.pytorch.org/whl/cu117
-torch==1.13.1+cu117
-torchvision==0.14.1+cu117
-torchaudio==0.13.1
diff --git a/tests/__init__.py b/tests/__init__.py
deleted file mode 100644
index e69de29..0000000
diff --git a/tests/test_model/test_embedding.py b/tests/test_model/test_embedding.py
deleted file mode 100644
index 324ca2b..0000000
--- a/tests/test_model/test_embedding.py
+++ /dev/null
@@ -1,65 +0,0 @@
-import multiprocessing as mp
-
-import pytest
-import torch
-
-from internlm.model.embedding import Embedding1D
-from tests.test_model.test_model_internlm import build_environment, seed_all
-
-
-def check_embedding(args):
-    # init
-    rank, world_size = args
-    device = torch.device("cuda")
-    build_environment(rank, world_size)
-    rtol, atol = (1e-3, 5e-3)
-    vocab_size = 4
-    hidden_size = 2
-
-    # fix seed
-    seed_all(1024)
-
-    # define embedding
-    embedding = Embedding1D(
-        num_embeddings=vocab_size,
-        embedding_dim=hidden_size,
-        padding_idx=None,
-    )
-
-    embedding.weight.data.copy_(torch.randn(vocab_size, hidden_size))
-    embedding = embedding.to(device)
-
-    # create input
-    input_ids = torch.tensor([[0, 2], [1, 3]]).to(device)
-    result = embedding(input_ids)
-
-    standard_list = [[[-1.4837, 0.2671], [0.6002, -0.5496]], [[-1.8337, -0.1047], [1.0391, 0.2261]]]
-    standard_result = torch.tensor(standard_list).to(device)
-
-    # check output
-    assert torch.allclose(result, standard_result, rtol=rtol, atol=atol, equal_nan=True)
-
-    loss = torch.randn_like(result)
-
-    # backward
-    result.backward(loss)
-
-    grad = embedding.weight.grad
-    standard_glist = [[-0.4461, 0.5602], [0.4353, 1.2988], [-0.0625, -1.3609], [0.9595, -0.1144]]
-    standard_grad = torch.tensor(standard_glist).to(device)
-
-    # check grad
-    assert torch.allclose(grad, standard_grad, rtol=rtol, atol=atol, equal_nan=True)
-
-
-@pytest.mark.embedding
-def test_embedding():
-    ctx = mp.get_context("spawn")
-    with ctx.Pool(processes=8) as pool:
-        pool.map(check_embedding, [[rank, 8] for rank in range(8)])
-        pool.close()
-        pool.join()
-
-
-if __name__ == "__main__":
-    pytest.main(["-s", "-q", "test_embedding.py"])
diff --git a/tests/test_model/test_model_internlm.py b/tests/test_model/test_model_internlm.py
deleted file mode 100644
index fb9c678..0000000
--- a/tests/test_model/test_model_internlm.py
+++ /dev/null
@@ -1,379 +0,0 @@
-import multiprocessing as mp
-import random
-
-import numpy as np
-import pytest
-import torch
-from torch import nn
-
-import internlm
-from internlm.core.context import ParallelMode
-from internlm.core.context.parallel_context import Config
-from internlm.core.context.parallel_context import global_context as gpc
-from internlm.model.linear import RewardModelLinear, ScaleColumnParallelLinear
-from internlm.model.modeling_internlm import PackedFlashBaseLayer1D
-from internlm.model.utils import gather_forward_split_backward
-
-config = Config(
-    dict(
-        parallel=dict(zero1=1, pipeline=dict(size=1, interleaved_overlap=False), sequence_parallel=False, tensor=1),
-        model_type="INTERNLM",
-        data=dict(seq_len=2048, micro_num=1, micro_bsz=1, pack_sample_into_one=False, min_length=0, total_steps=9999),
-        model=dict(
-            checkpoint=False,
-            num_attention_heads=2,
-            embed_split_hidden=True,
-            vocab_size=103168,
-            embed_grad_scale=1,
-            parallel_output=True,
-            hidden_size=1024,
-            num_layers=2,
-            mlp_ratio=1,
-            apply_post_layer_norm=False,
-            dtype=torch.bfloat16,
-            norm_type="rmsnorm",
-            layer_norm_epsilon=1e-5,
-            use_flash_attn=True,
-            num_chunks=1,
-        ),
-        resume_tb_folder="",
-        tensorboard_folder="",
-        alert_address=None,
-        monitor=dict(alert=dict(enable_feishu_alert=False, feishu_alert_address=None, light_monitor_address=None)),
-    )
-)
-
-
-def build_environment(rank, world_size):
-    import os
-
-    os.environ["RANK"] = str(rank)
-    os.environ["LOCAL_RANK"] = str(rank)
-    os.environ["WORLD_SIZE"] = str(world_size)
-    os.environ["MASTER_ADDR"] = "127.0.0.1"
-    os.environ["MASTER_PORT"] = "12345"
-    torch.cuda.empty_cache()
-    # launcher="torch"
-    internlm.launch_from_torch(config=config, seed=1024)
-
-
-def seed_all(seed, cuda_deterministic=False):
-    random.seed(seed)
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    if torch.cuda.is_available():
-        torch.cuda.manual_seed(seed)
-        torch.cuda.manual_seed_all(seed)
-    if cuda_deterministic:  # slower, more reproducible
-        torch.backends.cudnn.deterministic = True
-        torch.backends.cudnn.benchmark = False
-    else:
-        torch.backends.cudnn.deterministic = False
-        torch.backends.cudnn.benchmark = True
-
-
-def check_block(args):
-    # init
-    rank, world_size = args
-    build_environment(rank, world_size)
-    device = torch.device("cuda")
-    rtol, atol = (1e-3, 5e-3)
-
-    # fix seed
-    seed_all(1024)
-
-    # define block
-    blocks = nn.ModuleList(
-        [
-            PackedFlashBaseLayer1D(
-                hidden_size=4,  # 768
-                num_attention_heads=2,  # 12
-                mlp_ratio=2,
-                attn_drop_rate=0.0,
-                drop_rate=0.0,
-                dtype=torch.bfloat16,
-                layer_norm_epsilon=1e-5,
-                checkpoint=lid < 0,
-                layer_idx=lid + 0,  # This parameter is used for caching during generation
-                residual_in_fp32=False,
-                device=device,
-                norm_type="rmsnorm",
-                dropout_selective_checkpoint=True,
-                use_scaled_init=True,
-                use_swiglu=True,
-            )
-            for lid in range(4)  # 32
-        ]
-    )
-
-    # create input
-    cu_seqlens = torch.tensor([0, 2, 4], dtype=torch.int32).to(device)  # [0, 8, 16]
-    indexes = torch.tensor([0, 1, 0, 1]).to(device)  # [0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7]
-    hidden_states = torch.tensor([[0, 3, 2, 1]]).to(device)  # [[4, 118, 0, 1, 2, 3, 0, 1, 1, 97, 0, 0, 0, 0, 0, 0]]
-    max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max().item()
-
-    hidden_states = torch.tensor(
-        [
-            [
-                [-1.1620, 1.3113, 0.1507, 2.2698],
-                [-1.2610, 1.0990, 0.3787, -0.3478],
-                [1.4001, 1.1982, -0.6696, 0.3269],
-                [1.3304, 1.2262, 1.0735, -1.1169],
-            ]
-        ]
-    )
-
-    hidden_states = hidden_states.squeeze(0).to(device).requires_grad_()
-
-    # forward
-    for _, block in enumerate(blocks):
-        block = block.to(torch.bfloat16)
-        block = block.to(device)
-        hidden_states = block(
-            hidden_states,
-            cu_seqlens=cu_seqlens,
-            indexes=indexes,
-            inference_params=None,
-            max_seqlen=max_seqlen,
-        )
-
-    result = hidden_states
-    standard_result = torch.tensor(
-        [
-            [-1.1621, 1.3111, 0.1509, 2.2697],
-            [-1.2611, 1.0988, 0.3787, -0.3478],
-            [1.4000, 1.1982, -0.6694, 0.3268],
-            [1.3303, 1.2262, 1.0736, -1.1169],
-        ]
-    ).to(device)
-
-    # check output
-    assert torch.allclose(result, standard_result, rtol=rtol, atol=atol)
-
-    hidden_states.retain_grad()
-    loss = torch.randn_like(result)
-
-    # backward
-    result.backward(loss)
-
-    grad = hidden_states.grad
-    standard_grad = torch.tensor(
-        [
-            [0.7999, -0.2595, 0.2649, -1.3256],
-            [0.7064, 0.0283, -0.5508, 0.6494],
-            [-1.4657, -2.0316, 1.3776, 0.7211],
-            [-0.6046, 0.4329, -0.1884, 1.1170],
-        ]
-    ).to(device)
-
-    # check grad
-    assert torch.allclose(grad, standard_grad, rtol=rtol, atol=atol)
-
-
-def check_head(args):
-    # init
-    rank, world_size, is_reward = args
-    device = torch.device("cuda")
-    build_environment(rank, world_size)
-    rtol, atol = (1e-3, 5e-3)
-    hidden_size = 4
-    vocab_size = 4
-    embed_grad_scale = 1
-
-    # fix seed
-    seed_all(1024)
-
-    # load standard
-    if is_reward:
-        head_cls = RewardModelLinear
-        standard_result = torch.tensor([[3.5938], [1.0703], [3.6250], [3.6250]], dtype=torch.bfloat16).to(device)
-        standard_grad = torch.tensor(
-            [
-                [-0.2246, 0.0164, -0.0591, 0.1660],
-                [-0.5625, 0.0408, -0.1484, 0.4160],
-                [-0.1758, 0.0128, -0.0464, 0.1299],
-                [-0.4785, 0.0347, -0.1260, 0.3516],
-            ],
-            dtype=torch.bfloat16,
-        ).to(device)
-    else:
-        head_cls = ScaleColumnParallelLinear
-        standard_result = torch.tensor(
-            [
-                [3.5938, -2.2188, 2.0312, 3.5625],
-                [1.0703, -1.1797, 1.1406, 1.6641],
-                [3.6250, -2.0156, 1.7656, 3.4531],
-                [3.6250, -2.0156, 1.7656, 3.4531],
-            ],
-            dtype=torch.bfloat16,
-        ).to(device)
-        standard_grad = torch.tensor(
-            [
-                [-0.2354, 0.0981, -0.2930, -0.6328],
-                [0.2344, -0.2334, -0.0918, 0.1396],
-                [-0.5898, -1.0156, -0.7070, 1.3750],
-                [0.0242, -0.1494, 0.1206, -0.0427],
-            ],
-            dtype=torch.bfloat16,
-        ).to(device)
-
-    # define head
-    head = head_cls(
-        in_features=hidden_size,
-        out_features=gpc.get_world_size(ParallelMode.TENSOR) if is_reward else vocab_size,
-        process_group=gpc.get_group(ParallelMode.TENSOR),
-        bias=False,
-        device=device,
-        dtype=torch.bfloat16,
-        weight_scale=embed_grad_scale,
-    )
-
-    head = head.to(torch.bfloat16)
-    head = head.to(device)
-
-    # create input
-    hidden_states = torch.tensor(
-        [
-            [8.3726, 1.9245, 5.5101, 1.0000],
-            [3.3474, 2.9582, 1.0000, 1.0000],
-            [8.3726, 1.2875, 5.5101, 1.0000],
-            [8.3726, 1.2875, 5.5101, 1.0000],
-        ],
-        dtype=torch.bfloat16,
-        requires_grad=True,
-    ).to(device)
-
-    # forward
-    result = head(hidden_states)
-
-    # check output
-    assert torch.allclose(result, standard_result, rtol=rtol, atol=atol)
-
-    hidden_states.retain_grad()
-    loss = torch.randn_like(result)
-
-    # backward
-    result.backward(loss)
-    grad = hidden_states.grad
-
-    # check grad
-    assert torch.allclose(grad, standard_grad, rtol=rtol, atol=atol)
-
-
-def check_gather_forward(args):
-    # init
-    rank, world_size, parallel_tensor = args
-    assert parallel_tensor in [1, 2]
-    config.parallel.tensor = parallel_tensor
-    device = torch.device("cuda")
-    build_environment(rank, world_size)
-    rtol, atol = (1e-3, 5e-3)
-
-    # fix seed
-    seed_all(1024)
-
-    # load standard
-    if parallel_tensor == 1:
-        standard_result = torch.tensor(
-            [
-                [8.3726, 1.9245, 5.5101, 1.0000],
-                [3.3474, 2.9582, 1.0000, 1.0000],
-                [8.3726, 1.2875, 5.5101, 1.0000],
-                [8.3726, 1.2875, 5.5101, 1.0000],
-            ]
-        ).to(device)
-        standard_grad = torch.tensor(
-            [
-                [-0.4461, 0.5602, -0.0625, -1.3609],
-                [0.4353, 1.2988, 0.9595, -0.1144],
-                [-0.7593, -0.4031, 0.2041, 1.4955],
-                [0.5706, 0.9047, -0.6965, -0.3757],
-            ]
-        ).to(device)
-    else:
-        standard_result = torch.tensor(
-            [
-                [8.3726, 1.9245, 5.5101, 1.0000, 8.3726, 1.9245, 5.5101, 1.0000],
-                [3.3474, 2.9582, 1.0000, 1.0000, 3.3474, 2.9582, 1.0000, 1.0000],
-                [8.3726, 1.2875, 5.5101, 1.0000, 8.3726, 1.2875, 5.5101, 1.0000],
-                [8.3726, 1.2875, 5.5101, 1.0000, 8.3726, 1.2875, 5.5101, 1.0000],
-            ]
-        ).to(device)
-        if rank % 2 == 0:
-            standard_grad = torch.tensor(
-                [
-                    [-0.4461, 0.5602, -0.0625, -1.3609],
-                    [-0.7593, -0.4031, 0.2041, 1.4955],
-                    [0.8093, 1.7580, 1.2996, -0.7545],
-                    [1.0474, -0.5767, -1.0401, 0.8233],
-                ]
-            ).to(device)
-        else:
-            standard_grad = torch.tensor(
-                [
-                    [0.4353, 1.2988, 0.9595, -0.1144],
-                    [0.5706, 0.9047, -0.6965, -0.3757],
-                    [-1.3589, -0.7202, 0.6094, -0.8208],
-                    [-1.0042, 0.3695, 0.2511, -0.2718],
-                ]
-            ).to(device)
-
-    # create input
-    hidden_states = torch.tensor(
-        [
-            [8.3726, 1.9245, 5.5101, 1.0000],
-            [3.3474, 2.9582, 1.0000, 1.0000],
-            [8.3726, 1.2875, 5.5101, 1.0000],
-            [8.3726, 1.2875, 5.5101, 1.0000],
-        ],
-        requires_grad=True,
-    ).to(device)
-
-    # forward
-    result = gather_forward_split_backward(hidden_states, ParallelMode.TENSOR, dim=-1)
-
-    # check output
-    assert torch.allclose(result, standard_result, rtol=rtol, atol=atol)
-
-    loss = torch.randn_like(result)
-    hidden_states.retain_grad()
-
-    # backward
-    result.backward(loss)
-    grad = hidden_states.grad
-
-    # check grad
-    assert torch.allclose(grad, standard_grad, rtol=rtol, atol=atol)
-
-
-@pytest.mark.block
-def test_block():
-    ctx = mp.get_context("spawn")
-    with ctx.Pool(processes=8) as pool:
-        pool.map(check_block, [[rank, 8] for rank in range(8)])
-        pool.close()
-        pool.join()
-
-
-@pytest.mark.head
-@pytest.mark.parametrize("is_reward", [True, False])
-def test_head(is_reward):
-    ctx = mp.get_context("spawn")
-    with ctx.Pool(processes=8) as pool:
-        pool.map(check_head, [[rank, 8, is_reward] for rank in range(8)])
-        pool.close()
-        pool.join()
-
-
-@pytest.mark.gather_forward
-@pytest.mark.parametrize("parallel_tensor", [1, 2])
-def test_gather_forward(parallel_tensor):
-    ctx = mp.get_context("spawn")
-    with ctx.Pool(processes=8) as pool:
-        pool.map(check_gather_forward, [[rank, 8, parallel_tensor] for rank in range(8)])
-        pool.close()
-        pool.join()
-
-
-if __name__ == "__main__":
-    pytest.main(["-s", "-q", "test_model_internlm.py"])
diff --git a/tests/test_model/test_norm.py b/tests/test_model/test_norm.py
deleted file mode 100644
index 4078ef5..0000000
--- a/tests/test_model/test_norm.py
+++ /dev/null
@@ -1,84 +0,0 @@
-import multiprocessing as mp
-
-import pytest
-import torch
-
-from internlm.model.utils import try_import_RMSNorm
-from tests.test_model.test_model_internlm import build_environment, seed_all
-
-RMSNorm = try_import_RMSNorm()
-
-
-def check_norm(args):
-    # init
-    rank, world_size = args
-    device = torch.device("cuda")
-    build_environment(rank, world_size)
-    rtol, atol = (1e-3, 5e-3)
-    hidden_size = 4
-    layer_norm_epsilon = 1e-05
-
-    # fix seed
-    seed_all(1024)
-
-    # define norm
-    norm = RMSNorm(hidden_size, eps=layer_norm_epsilon)
-    norm = norm.to(device)
-
-    # create input
-    hidden_states = torch.tensor(
-        [
-            [8.3726, 1.9245, 5.5101, 1.0000],
-            [3.3474, 2.9582, 1.0000, 1.0000],
-            [8.3726, 1.2875, 5.5101, 1.0000],
-            [8.3726, 1.2875, 5.5101, 1.0000],
-        ],
-        requires_grad=True,
-    ).to(device)
-
-    # forward
-    result = norm(hidden_states.float())
-
-    standard = torch.tensor(
-        [
-            [1.6329, 0.3753, 1.0746, 0.1950],
-            [1.4288, 1.2626, 0.4268, 0.4268],
-            [1.6490, 0.2536, 1.0852, 0.1970],
-            [1.6490, 0.2536, 1.0852, 0.1970],
-        ]
-    ).to(device)
-
-    # check output
-    assert torch.allclose(result, standard, rtol=rtol, atol=atol, equal_nan=True)
-
-    hidden_states.retain_grad()
-    loss = torch.randn_like(result)
-
-    # backward
-    result.backward(loss)
-    grad = hidden_states.grad
-
-    standard_grad = torch.tensor(
-        [
-            [-0.0193, 0.1248, 0.0324, -0.2573],
-            [-0.2140, 0.2010, 0.2901, -0.1683],
-            [-0.0815, -0.0689, 0.0850, 0.3027],
-            [0.0847, 0.1739, -0.1554, -0.0773],
-        ]
-    ).to(device)
-
-    # check grad
-    assert torch.allclose(grad, standard_grad, rtol=rtol, atol=atol, equal_nan=True)
-
-
-@pytest.mark.norm
-def test_norm():
-    ctx = mp.get_context("spawn")
-    with ctx.Pool(processes=8) as pool:
-        pool.map(check_norm, [[rank, 8] for rank in range(8)])
-        pool.close()
-        pool.join()
-
-
-if __name__ == "__main__":
-    pytest.main(["-s", "-q", "test_norm.py"])
diff --git a/tests/test_solver/test_optimizer.py b/tests/test_solver/test_optimizer.py
deleted file mode 100644
index 6a22797..0000000
--- a/tests/test_solver/test_optimizer.py
+++ /dev/null
@@ -1,364 +0,0 @@
-import copy
-import multiprocessing as mp
-import random
-
-import numpy as np
-import pytest
-import torch
-from torch import nn
-from torch.nn.parallel import DistributedDataParallel as DDP
-from torch.testing import assert_close
-
-import internlm
-from internlm.core.context.parallel_context import Config
-from internlm.solver.optimizer import HybridZeroOptimizer
-from internlm.solver.optimizer.utils import ParamBcastSyncHandler
-
-
-class MlpModel(nn.Module):
-    def __init__(self):
-        super().__init__()
-        self.linear1 = nn.Linear(128, 256)
-        self.linear2 = nn.Linear(256, 512)
-
-    def forward(self, x):
-        x = self.linear1(x)
-        x = self.linear2(x)
-        return x
-
-
-config = Config(
-    dict(
-        parallel=dict(zero1=1, pipeline=dict(size=1, interleaved_overlap=False), sequence_parallel=False, tensor=1),
-        model_type="INTERNLM",
-        data=dict(seq_len=2048, micro_num=1, micro_bsz=1, pack_sample_into_one=False, min_length=0, total_steps=9999),
-        model=dict(
-            dtype=torch.bfloat16,
-        ),
-        resume_tb_folder="",
-        tensorboard_folder="",
-        alert_address=None,
-        monitor=dict(alert=dict(enable_feishu_alert=False, feishu_alert_address=None, light_monitor_address=None)),
-        grad_scaler=dict(
-            fp16=dict(
-                initial_scale=1,
-                min_scale=1,
-                growth_interval=1,
-            ),
-            growth_factor=1.1,
-            backoff_factor=0.9,
-            max_scale=1,
-            hysteresis=1,
-        ),
-        adam=dict(
-            lr=1e-4,
-            adam_beta1=0.9,
-            adam_beta2=0.95,
-            adam_beta2_c=0,
-            adam_eps=1e-8,
-            weight_decay=0.01,
-        ),
-        hybrid_zero_optimizer=dict(
-            overlap_sync_grad=False,
-            overlap_sync_param=False,
-            reduce_bucket_size=512 * 1024 * 1024,
-            clip_grad_norm=1.0,
-        ),
-    )
-)
-
-
-def build_environment(rank, world_size):
-    import os
-
-    os.environ["RANK"] = str(rank)
-    os.environ["LOCAL_RANK"] = str(rank)
-    os.environ["WORLD_SIZE"] = str(world_size)
-    os.environ["MASTER_ADDR"] = "127.0.0.1"
-    os.environ["MASTER_PORT"] = "12345"
-    torch.cuda.empty_cache()
-    # launcher="torch"
-    internlm.launch_from_torch(config=config, seed=1024)
-
-
-def loose_close(a, b, dtype: torch.dtype = torch.float32):
-
-    if dtype is torch.float32:
-        rtol = 1.3e-6
-        atol = 1e-5
-    elif dtype is torch.bfloat16:
-        rtol = 2e-2
-        atol = 2e-2
-
-    if isinstance(a, torch.Tensor):
-        a = a.detach().to(dtype)
-        b = b.detach().to(dtype)
-
-    assert_close(a, b, rtol=rtol, atol=atol)
-
-
-def init_optimizer_grouped_parameters(check_group, model):
-    if check_group:
-        optimizer_grouped_parameters = [
-            {
-                "params": list(model.parameters())[:2],
-                "weight_decay": config.adam.weight_decay,
-            },
-            {
-                "params": list(model.parameters())[2:],
-                "weight_decay": config.adam.weight_decay,
-            },
-        ]
-    else:
-        optimizer_grouped_parameters = [{"params": model.parameters(), "weight_decay": config.adam.weight_decay}]
-
-    return optimizer_grouped_parameters
-
-
-def seed_all(seed, cuda_deterministic=False):
-    random.seed(seed)
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    if torch.cuda.is_available():
-        torch.cuda.manual_seed(seed)
-        torch.cuda.manual_seed_all(seed)
-    if cuda_deterministic:  # slower, more reproducible
-        torch.backends.cudnn.deterministic = True
-        torch.backends.cudnn.benchmark = False
-    else:
-        torch.backends.cudnn.deterministic = False
-        torch.backends.cudnn.benchmark = True
-
-
-def exam_hybrid_zero_optim_with_ddp(args):
-    # init
-    rank, world_size, zero_parallel, overlap_sync_param, overlap_sync_grad, micro_num, check_group, dtype = args
-    # TODO: Need to test the combine of overlap param and group_params when ready
-    # ParamBcastSyncHandler does not consider paramters in different optimizer group currently
-    if overlap_sync_param and check_group:
-        return
-    config.parallel.zero1 = zero_parallel
-    config.hybrid_zero_optimizer.overlap_sync_param = overlap_sync_param
-    config.hybrid_zero_optimizer.overlap_sync_grad = overlap_sync_grad
-    config.data.micro_num = micro_num
-    config.model.dtype = dtype
-    totel_step = 5
-    if not overlap_sync_param:
-        totel_step = 1
-
-    build_environment(rank, world_size)
-    seed_all(1024)
-
-    # create models
-    torch_model = MlpModel().cuda()
-    zero_model = copy.deepcopy(torch_model).to(dtype)
-    torch_model = DDP(torch_model.cuda(), static_graph=True).cuda()
-
-    # create optimizer
-    if config.hybrid_zero_optimizer.overlap_sync_param:
-        param_bcast_sync_handler = ParamBcastSyncHandler(zero_model)
-    else:
-        param_bcast_sync_handler = None
-
-    optimizer_grouped_parameters_zero = init_optimizer_grouped_parameters(check_group, zero_model)
-    optimizer_grouped_parameters_torch = init_optimizer_grouped_parameters(check_group, torch_model)
-
-    naive_optimizer = torch.optim.AdamW(
-        params=optimizer_grouped_parameters_zero,
-        lr=config.adam.lr,
-        betas=(config.adam.adam_beta1, config.adam.adam_beta2),
-        eps=config.adam.adam_eps,
-    )
-
-    zero_optimizer = HybridZeroOptimizer(
-        naive_optimizer,
-        grad_scal_cfg=config.grad_scaler,
-        zero_cfg=config.hybrid_zero_optimizer,
-        param_bcast_sync_handler=param_bcast_sync_handler,
-    )
-
-    torch_optimizer = torch.optim.AdamW(
-        params=optimizer_grouped_parameters_torch,
-        lr=config.adam.lr,
-        betas=(config.adam.adam_beta1, config.adam.adam_beta2),
-        eps=config.adam.adam_eps,
-    )
-
-    for _ in range(totel_step):
-        zero_optimizer.zero_grad()
-        torch_optimizer.zero_grad()
-        zero_optimizer.skip_grad_reduce = True
-        for num in range(micro_num):
-            if num == micro_num - 1:
-                zero_optimizer.skip_grad_reduce = False
-
-            seed_all(1024 + rank)
-            # create input
-            input_data = torch.rand(16, 128).cuda()
-
-            # zero-dp forward
-            zero_output = zero_model(input_data.to(dtype))
-
-            # torch-ddp forward
-            torch_output = torch_model(input_data)
-
-            # check output
-            loose_close(zero_output, torch_output, dtype=dtype)
-
-            # zero-dp backward
-            zero_optimizer.backward(zero_output.mean())
-
-            # torch-ddp backward
-            if num == micro_num - 1:
-                torch_output.mean().backward()
-            else:
-                with torch_model.no_sync():
-                    torch_output.mean().backward()
-
-        # zero-dp step
-        zero_optimizer.step()
-
-        # torch-ddp step
-        torch_optimizer.step()
-
-        # check grad
-        if check_group:
-            group1 = zip(list(torch_model.parameters())[:2], list(zero_model.parameters())[:2])
-            group2 = zip(list(torch_model.parameters())[2:], list(zero_model.parameters())[2:])
-            for torch_parm, zero_parm in group1:
-                if zero_parm.grad is not None:
-                    loose_close(torch_parm.grad, zero_parm.grad, dtype=dtype)
-            for torch_parm, zero_parm in group2:
-                if zero_parm.grad is not None:
-                    loose_close(torch_parm.grad, zero_parm.grad, dtype=dtype)
-        else:
-            for torch_parm, zero_parm in zip(torch_model.parameters(), zero_model.parameters()):
-                if zero_parm.grad is not None:
-                    loose_close(torch_parm.grad, zero_parm.grad, dtype=dtype)
-
-    torch.cuda.synchronize()
-    # check updated param
-    if check_group:
-        group1 = zip(list(torch_model.parameters())[:2], list(zero_model.parameters())[:2])
-        group2 = zip(list(torch_model.parameters())[2:], list(zero_model.parameters())[2:])
-        for torch_parm, zero_parm in group1:
-            loose_close(torch_parm, zero_parm, dtype=dtype)
-        for torch_parm, zero_parm in group2:
-            loose_close(torch_parm, zero_parm, dtype=dtype)
-    else:
-        for torch_parm, zero_parm in zip(torch_model.parameters(), zero_model.parameters()):
-            loose_close(torch_parm, zero_parm, dtype=dtype)
-
-
-def exam_hybrid_zero_optim_with_ckpt_load_save(args):
-    # init
-    rank, world_size, zero_parallel, check_group, dtype = args
-    config.parallel.zero1 = zero_parallel
-    config.parallel.dtype = dtype
-
-    build_environment(rank, world_size)
-
-    # create models
-    zero_model = MlpModel().cuda().to(dtype)
-
-    # create optimizer
-    if config.hybrid_zero_optimizer.overlap_sync_param:
-        param_bcast_sync_handler = ParamBcastSyncHandler(zero_model)
-    else:
-        param_bcast_sync_handler = None
-
-    optimizer_grouped_parameters1 = init_optimizer_grouped_parameters(check_group, zero_model)
-    optimizer_grouped_parameters2 = init_optimizer_grouped_parameters(check_group, zero_model)
-
-    naive_optimizer = torch.optim.AdamW(
-        params=optimizer_grouped_parameters1,
-        lr=config.adam.lr,
-        betas=(config.adam.adam_beta1, config.adam.adam_beta2),
-        eps=config.adam.adam_eps,
-    )
-
-    zero_optimizer = HybridZeroOptimizer(
-        naive_optimizer,
-        grad_scal_cfg=config.grad_scaler,
-        zero_cfg=config.hybrid_zero_optimizer,
-        param_bcast_sync_handler=param_bcast_sync_handler,
-    )
-
-    naive_optimizer2 = torch.optim.AdamW(
-        params=optimizer_grouped_parameters2,
-        lr=config.adam.lr,
-        betas=(config.adam.adam_beta1, config.adam.adam_beta2),
-        eps=config.adam.adam_eps,
-    )
-
-    zero_optimizer2 = HybridZeroOptimizer(
-        naive_optimizer2,
-        grad_scal_cfg=config.grad_scaler,
-        zero_cfg=config.hybrid_zero_optimizer,
-        param_bcast_sync_handler=param_bcast_sync_handler,
-    )
-
-    # save and load states
-    states = zero_optimizer.state_dict()
-    zero_optimizer2.load_state_dict(states)
-
-    # check fp32 model weights
-    for zero1_param, zero2_param in zip(
-        zero_optimizer._fp32_flat_param_groups_of_current_rank.values(),
-        zero_optimizer2._fp32_flat_param_groups_of_current_rank.values(),
-    ):
-        assert torch.equal(zero1_param, zero2_param)
-
-    # check fp16 model weights
-    for zero1_param, zero2_param in zip(
-        zero_optimizer._fp16_param_groups.values(), zero_optimizer2._fp16_param_groups.values()
-    ):
-        assert zero1_param == zero2_param
-
-
-zero_parallel_check_list = [-1, 1, 4]
-overlap_sync_param_check_list = [True, False]
-overlap_sync_grad_check_list = [True, False]
-miro_num_check_list = [1, 2, 4]
-check_group_list = [True, False]
-dtype_list = [torch.float32, torch.bfloat16]
-
-
-@pytest.mark.parametrize("zero_parallel", zero_parallel_check_list)
-@pytest.mark.parametrize("overlap_sync_param", overlap_sync_param_check_list)
-@pytest.mark.parametrize("overlap_sync_grad", overlap_sync_grad_check_list)
-@pytest.mark.parametrize("micro_num", miro_num_check_list)
-@pytest.mark.parametrize("check_group", check_group_list)
-@pytest.mark.parametrize("dtype", dtype_list)
-def test_hybrid_zero_optim_with_ddp(
-    zero_parallel, overlap_sync_param, overlap_sync_grad, micro_num, check_group, dtype
-):
-    ctx = mp.get_context("spawn")
-    with ctx.Pool(processes=8) as pool:
-        pool.map(
-            exam_hybrid_zero_optim_with_ddp,
-            [
-                [rank, 8, zero_parallel, overlap_sync_param, overlap_sync_grad, micro_num, check_group, dtype]
-                for rank in range(8)
-            ],
-        )
-        pool.close()
-        pool.join()
-
-
-@pytest.mark.parametrize("zero_parallel", zero_parallel_check_list)
-@pytest.mark.parametrize("check_group", check_group_list)
-@pytest.mark.parametrize("dtype", dtype_list)
-def test_hybrid_zero_optim_with_ckpt_load_save(zero_parallel, check_group, dtype):
-    ctx = mp.get_context("spawn")
-    with ctx.Pool(processes=8) as pool:
-        pool.map(
-            exam_hybrid_zero_optim_with_ckpt_load_save,
-            [[rank, 8, zero_parallel, check_group, dtype] for rank in range(8)],
-        )
-        pool.close()
-        pool.join()
-
-
-if __name__ == "__main__":
-    pytest.main(["-s", "-q", "test_optimizer.py"])
diff --git a/tests/test_utils/common_fixture.py b/tests/test_utils/common_fixture.py
deleted file mode 100644
index 80cb353..0000000
--- a/tests/test_utils/common_fixture.py
+++ /dev/null
@@ -1,183 +0,0 @@
-import os
-import shutil
-from subprocess import PIPE, STDOUT, Popen
-
-import pytest
-import torch
-
-from internlm.core.context import global_context as gpc
-from internlm.core.context.parallel_context import Config
-from internlm.solver.optimizer.hybrid_zero_optim import HybridZeroOptimizer
-from internlm.utils.common import SingletonMeta
-
-OSS_NAME = os.environ["OSS_BUCKET_NAME"]
-OSS_IP = os.environ["OSS_IP"]
-USER = os.environ["USER"]
-JOB_NAME = "CI_TEST"
-LOCAL_SAVE_PATH = "local:local_ckpt"
-
-BOTO_SAVE_PATH = f"boto3:s3://{OSS_NAME}.{OSS_IP}/{USER}/{JOB_NAME}"
-BOTO_SAVE_PATH_NO_PRFIX = f"s3://{OSS_NAME}.{OSS_IP}/{USER}/{JOB_NAME}/"
-
-ASYNC_TMP_FOLDER = "./async_tmp_folder"
-
-
-# 1B
-init_config = Config(
-    dict(
-        parallel=dict(zero1=1, pipeline=dict(size=1, interleaved_overlap=False), sequence_parallel=False, tensor=1),
-        model_type="INTERNLM",
-        adam=dict(
-            lr=1e-4,
-        ),
-        data=dict(seq_len=2048, micro_num=1, micro_bsz=1, pack_sample_into_one=False, min_length=0, total_steps=9999),
-        model=dict(
-            checkpoint=False,
-            num_attention_heads=2,
-            embed_split_hidden=True,
-            vocab_size=103168,
-            embed_grad_scale=1,
-            parallel_output=True,
-            hidden_size=1024,
-            num_layers=2,
-            mlp_ratio=1,
-            apply_post_layer_norm=False,
-            dtype=torch.bfloat16,
-            norm_type="rmsnorm",
-            layer_norm_epsilon=1e-5,
-            use_flash_attn=True,
-            num_chunks=1,
-        ),
-        resume_tb_folder="",
-        tensorboard_folder="",
-        alert_address=None,
-        monitor=dict(alert=dict(enable_feishu_alert=False, feishu_alert_address=None, light_monitor_address=None)),
-    )
-)
-
-
-def init_naive_model():
-    # let MODEL_INITIALIZER to work
-    import internlm.model.modeling_internlm  # noqa # pylint: disable=unused-import
-    from internlm.core.naive_amp import NaiveAMPModel
-    from internlm.utils.registry import MODEL_INITIALIZER
-
-    model = MODEL_INITIALIZER.get_module(module_name=gpc.config.model_type)(**(init_config.model))
-    model = NaiveAMPModel(
-        model=model,
-        output_to_fp32=False,
-        dtype=torch.bfloat16,
-        sync_buffer=False,
-    )
-    return model
-
-
-def init_naive_optim(model):
-    naive_optimizer = torch.optim.AdamW(
-        params=[{"params": model.parameters(), "weight_decay": 0.01}],
-        lr=1e-4,
-        betas=(0.9, 0.95),
-        eps=1e-8,
-    )
-    return naive_optimizer
-
-
-def init_hybrid_optim(model):
-    naive_optimizer = torch.optim.AdamW(
-        params=[{"params": model.parameters(), "weight_decay": 0.01}],
-        lr=1e-4,
-        betas=(0.9, 0.95),
-        eps=1e-8,
-    )
-    optimizer = HybridZeroOptimizer(
-        naive_optimizer,
-        grad_scal_cfg=Config(
-            dict(
-                fp16=dict(
-                    initial_scale=2**16,
-                    min_scale=1,
-                    growth_interval=1000,
-                ),
-                growth_factor=2,
-                backoff_factor=0.5,
-                max_scale=2**24,
-                hysteresis=2,
-            )
-        ),
-        zero_cfg=Config(
-            dict(
-                overlap_sync_grad=False,
-                overlap_sync_param=False,
-                reduce_bucket_size=512 * 1024 * 1024,
-                clip_grad_norm=1.0,
-            )
-        ),
-        param_bcast_sync_handler=None,
-    )
-    return optimizer
-
-
-@pytest.fixture(autouse=True, scope="function")
-def reset_singletons():
-    SingletonMeta._instances = {}
-
-
-def reset_seed():
-    from internlm.core.context.random import _SEED_MANAGER
-
-    _SEED_MANAGER.reset()
-
-
-@pytest.fixture(scope="module")
-def init_dist_and_model(rank=0, world_size=1):
-    from internlm.initialize import initialize_distributed_env
-
-    os.environ["RANK"] = str(rank)
-    os.environ["LOCAL_RANK"] = str(rank)
-    os.environ["WORLD_SIZE"] = str(world_size)
-    os.environ["MASTER_ADDR"] = "127.0.0.1"
-    os.environ["MASTER_PORT"] = "12377"
-    initialize_distributed_env(config=init_config, launcher="torch", master_port=12377, args_check=False)
-
-    # setup
-    print("set up", flush=True)
-    model = init_naive_model()
-    # opim = init_naive_optim(model)
-    opim = init_hybrid_optim(model)
-
-    yield model, opim
-
-    # teardown
-    del model, opim
-    print("teardown", flush=True)
-    gpc.destroy()
-    reset_seed()
-
-
-def enter_flag(text):
-    print(f"{text} begin!", flush=True)
-    yield
-    print(f"{text} end!", flush=True)
-
-
-def del_tmp_file():
-    try:
-        shutil.rmtree(ASYNC_TMP_FOLDER, ignore_errors=True)
-    except FileNotFoundError:
-        pass
-
-    try:
-        shutil.rmtree(LOCAL_SAVE_PATH.split(":")[1], ignore_errors=True)
-    except FileNotFoundError:
-        pass
-
-    try:
-        cmd = r"/mnt/petrelfs/share/sensesync --dryrun --deleteSrc cp " + BOTO_SAVE_PATH_NO_PRFIX + " / "
-        with Popen(cmd, stdout=PIPE, stderr=STDOUT, shell=True) as output:
-            results, presults = "", ""
-            for line in iter(output.stdout.readline, b""):
-                results += str(line.rstrip())
-                presults += line.rstrip().decode() + "\n"
-        print(presults, flush=True)
-    except:  # noqa # pylint: disable=bare-except
-        pass
diff --git a/tests/test_utils/test_model_checkpoint.py b/tests/test_utils/test_model_checkpoint.py
deleted file mode 100644
index 956880b..0000000
--- a/tests/test_utils/test_model_checkpoint.py
+++ /dev/null
@@ -1,358 +0,0 @@
-import os
-from functools import partial
-
-import pytest
-import torch
-import torch.distributed as dist
-
-from internlm.core.context.parallel_context import Config
-from internlm.core.trainer import TrainState
-from internlm.solver.optimizer.hybrid_zero_optim import HybridZeroOptimizer
-from internlm.utils.common import SingletonMeta
-from internlm.utils.model_checkpoint import CheckpointManager
-from internlm.utils.storage_manager import wait_async_upload_finish
-from tests.test_utils.common_fixture import (  # noqa # pylint: disable=unused-import
-    ASYNC_TMP_FOLDER,
-    BOTO_SAVE_PATH,
-    LOCAL_SAVE_PATH,
-    del_tmp_file,
-    init_config,
-    init_dist_and_model,
-    reset_singletons,
-)
-
-# (TOTAL_STEP, CKPT_EVERY, SNPASHOT_EVERY)
-step_info_list = [(8, 4, 2), (3, 4, 2), (1, 6, 3)]
-ckpt_config_list = [
-    # Old interface format
-    dict(
-        enable_save_ckpt=True,
-        save_ckpt_folder=BOTO_SAVE_PATH,
-        load_optimizer=True,
-        checkpoint_every=0,
-        async_upload=True,
-        async_upload_tmp_folder=ASYNC_TMP_FOLDER,
-        snapshot_ckpt_folder="/".join([BOTO_SAVE_PATH, "snapshot"]),
-        oss_snapshot_freq=0,
-        stop_file_path=None,
-        load_model_only_folder=None,
-        load_given_ckpt=False,
-        load_ckpt_folder=None,
-        is_old_api=True,
-    ),
-    # Old interface format
-    dict(
-        enable_save_ckpt=True,
-        save_ckpt_folder=LOCAL_SAVE_PATH,
-        load_optimizer=True,
-        checkpoint_every=0,
-        async_upload=False,
-        async_upload_tmp_folder=ASYNC_TMP_FOLDER,
-        snapshot_ckpt_folder="/".join([LOCAL_SAVE_PATH, "snapshot"]),
-        oss_snapshot_freq=0,
-        stop_file_path=None,
-        load_model_only_folder=None,
-        load_given_ckpt=False,
-        load_ckpt_folder=None,
-        is_old_api=True,
-    ),
-    # New interface format
-    dict(
-        enable_save_ckpt=True,
-        save_ckpt_folder=BOTO_SAVE_PATH,
-        checkpoint_every=0,
-        async_upload=True,
-        async_upload_tmp_folder=ASYNC_TMP_FOLDER,
-        oss_snapshot_freq=0,
-        stop_file_path=None,
-        is_old_api=False,
-        auto_resume=True,
-    ),
-    dict(
-        enable_save_ckpt=True,
-        save_ckpt_folder=LOCAL_SAVE_PATH,
-        checkpoint_every=0,
-        async_upload=False,
-        async_upload_tmp_folder=ASYNC_TMP_FOLDER,
-        oss_snapshot_freq=0,
-        stop_file_path=None,
-        load_ckpt_folder=None,
-        is_old_api=False,
-        auto_resume=True,
-    ),
-]
-
-
-def overwrite_optim_state(optim, set_value):
-    if isinstance(optim, HybridZeroOptimizer):
-        for group_id, p in optim._fp32_flat_param_groups_of_current_rank.items():
-            if optim._zero_local_rank not in optim.param_group_no_params_ranks[group_id]:
-                # p.copy_(torch.full_like(p, set_value, dtype=p.dtype))
-                p.data.fill_(set_value)
-        for group_id in range(len(optim._fp16_param_groups)):
-            if optim._zero_local_rank not in optim.param_group_no_params_ranks[group_id]:
-                fp16_p = optim._param_store.get_flat_fp16_param_by_rank_group(
-                    rank=optim._zero_local_rank, group_id=group_id
-                )
-                fp16_p.fill_(set_value)
-    else:
-        for group in optim.param_groups:
-            for p in group["params"]:
-                # p.copy_(torch.full_like(p, set_value, dtype=p.dtype))
-                p.data.fill_(set_value)
-
-
-def compare_optim_state(optim1, optim2):
-    re = True
-    if isinstance(optim1, HybridZeroOptimizer):
-        fp32_buff1 = optim1._fp32_flat_param_groups_of_current_rank
-        fp32_buff2 = optim2._fp32_flat_param_groups_of_current_rank
-        for group_id_1, group_id_2 in zip(fp32_buff1, fp32_buff2):
-            re &= group_id_1 == group_id_2
-            if optim1.zero_local_rank not in optim1.param_group_no_params_ranks[group_id_1]:
-                re &= torch.equal(fp32_buff1[group_id_1], fp32_buff1[group_id_2])
-    else:
-        for group1, group2 in zip(optim1.param_groups, optim2.param_groups):
-            for p1, p2 in zip(group1["params"], group2["params"]):
-                re &= torch.equal(p1, p2)
-    return re
-
-
-def compare_optim_value(optim, value):
-    re = True
-    if isinstance(optim, HybridZeroOptimizer):
-        for group_id, p in optim._fp32_flat_param_groups_of_current_rank.items():
-            if optim._zero_local_rank not in optim.param_group_no_params_ranks[group_id]:
-                re &= torch.equal(p, torch.full_like(p, value, dtype=p.dtype))
-        for group_id in range(len(optim._fp16_param_groups)):
-            if optim._zero_local_rank not in optim.param_group_no_params_ranks[group_id]:
-                fp16_p = optim._param_store.get_flat_fp16_param_by_rank_group(
-                    rank=optim._zero_local_rank, group_id=group_id
-                )
-                re &= torch.equal(fp16_p, torch.full_like(fp16_p, value, dtype=fp16_p.dtype))
-    else:
-        for group in optim.param_groups:
-            for p in group["params"]:
-                re &= torch.equal(p, torch.full_like(p, value, dtype=p.dtype))
-    return re
-
-
-def overwrite_model_value(model, value):
-    for p in model.parameters():
-        # p.copy_(torch.full_like(p, value, dtype=p.dtype))
-        p.data.fill_(value)
-
-
-def compare_model_value(model, value):
-    re = True
-    for p in model.parameters():
-        re &= torch.equal(p, torch.full_like(p, value, dtype=p.dtype))
-    return re
-
-
-@pytest.fixture(scope="function")
-def del_tmp():
-    del_tmp_file()
-    yield
-    del_tmp_file()
-
-
-def return_prefix_path(save_ckpt_folder):
-    if save_ckpt_folder.startswith("local:"):
-        return LOCAL_SAVE_PATH
-    else:
-        return BOTO_SAVE_PATH
-
-
-def return_latest_save_path(save_ckpt_folder, total_step, snapshot_freq, ckpt_freq):
-
-    snapshot_latest_step, normal_latest_step = 0, 0
-    snapshot_latest_count, normal_latest_count = 0, 0
-
-    for i in range(total_step):
-        if (i + 1) % ckpt_freq == 0:
-            normal_latest_step = i + 1
-            normal_latest_count += 1
-        else:
-            if (i + 1) % snapshot_freq == 0:
-                snapshot_latest_step = i + 1
-                snapshot_latest_count += 1
-
-    if snapshot_latest_step == 0:
-        return None, None
-
-    if normal_latest_step >= snapshot_latest_step:
-        return normal_latest_step, os.path.join(return_prefix_path(save_ckpt_folder), f"{normal_latest_step}")
-    elif normal_latest_step < snapshot_latest_step:
-        if snapshot_latest_count % 2 == 0:
-            re_path = f"{return_prefix_path(save_ckpt_folder)}/snapshot/0"
-        else:
-            re_path = f"{return_prefix_path(save_ckpt_folder)}/snapshot/1"
-        return snapshot_latest_step, re_path
-    else:
-        assert False
-
-
-@pytest.mark.usefixtures("del_tmp")
-@pytest.mark.usefixtures("reset_singletons")
-@pytest.mark.parametrize("step_info", step_info_list)
-@pytest.mark.parametrize("ckpt_config", ckpt_config_list)
-def test_ckpt_mm(step_info, ckpt_config, init_dist_and_model):  # noqa # pylint: disable=unused-import
-    from internlm.core.context import global_context as gpc
-    from internlm.utils.model_checkpoint import CheckpointLoadMask, CheckpointLoadType
-
-    ckpt_config = Config(ckpt_config)
-    total_step, checkpoint_every, oss_snapshot_freq = step_info
-    print(total_step, checkpoint_every, oss_snapshot_freq, flush=True)
-    ckpt_config.checkpoint_every = checkpoint_every
-    ckpt_config.oss_snapshot_freq = oss_snapshot_freq
-
-    bond_return_latest_save_path = partial(
-        return_latest_save_path,
-        ckpt_config.save_ckpt_folder,
-        total_step,
-        ckpt_config.oss_snapshot_freq,
-        ckpt_config.checkpoint_every,
-    )
-
-    model, opim = init_dist_and_model
-    train_state = TrainState(gpc.config, None)
-    if isinstance(opim, HybridZeroOptimizer):
-        print("Is HybridZeroOptimizer!", flush=True)
-    else:
-        print("Is naive Adam!", flush=True)
-
-    ckpt_mm = CheckpointManager(ckpt_config, model=model, optimizer=opim)
-    latest_ckpt_step = None
-    for i in range(total_step):
-        overwrite_model_value(model, i)
-        overwrite_optim_state(opim, i)
-
-        train_state.batch_count = i
-        train_state.step_count += 1
-
-        save_ckpts, _, _ = ckpt_mm.is_now_to_save_ckpt(train_state)
-        if save_ckpts:
-            latest_ckpt_step = i
-
-        ckpt_mm.try_save_checkpoint(train_state)
-
-    wait_async_upload_finish()
-    latest_ckpt_info = ckpt_mm.query_lastest_ckpt()
-    step, path = bond_return_latest_save_path()
-    assert latest_ckpt_info["path"] == path
-    if latest_ckpt_step is None:
-        assert latest_ckpt_step == step
-    else:
-        assert latest_ckpt_step == step - 1
-
-    # resume from before save skpt
-    del ckpt_mm
-    SingletonMeta._instances = {}
-    ckpt_mm = CheckpointManager(ckpt_config, model=model, optimizer=opim)
-    ckpt_mm.try_resume_training(train_state)
-
-    if ckpt_config.checkpoint_every < total_step:
-        # we use step_count to decide when save ckpt, os here latest_ckpt_step = step_count - 1
-        assert train_state.step_count == latest_ckpt_step + 1
-        assert train_state.batch_count == latest_ckpt_step + 1
-        assert compare_optim_value(ckpt_mm.optimizer, latest_ckpt_step), ckpt_mm.optimizer.param_groups[0]["params"][0]
-        assert compare_model_value(ckpt_mm.model, latest_ckpt_step), list(ckpt_mm.model.parameters())[0][0]
-
-        if ckpt_mm.save_ckpt_folder.startswith("local:"):
-            ckpt_mm.load_ckpt_info = dict(
-                path=os.path.join(LOCAL_SAVE_PATH, f"{ckpt_config.checkpoint_every}"),
-                content=CheckpointLoadMask(("all",)),
-                ckpt_type=CheckpointLoadType.INTERNLM,
-            )
-        else:
-            ckpt_mm.load_ckpt_info = dict(
-                path=os.path.join(BOTO_SAVE_PATH, f"{ckpt_config.checkpoint_every}"),
-                content=CheckpointLoadMask(("all",)),
-                ckpt_type=CheckpointLoadType.INTERNLM,
-            )
-
-        ckpt_mm.try_resume_training(train_state)
-
-        assert train_state.step_count == ckpt_config.checkpoint_every
-        assert train_state.batch_count == ckpt_config.checkpoint_every
-        # compare value is same with i.
-        assert compare_optim_value(ckpt_mm.optimizer, ckpt_config.checkpoint_every - 1), ckpt_mm.optimizer.param_groups[
-            0
-        ]["params"][0]
-        assert compare_model_value(ckpt_mm.model, ckpt_config.checkpoint_every - 1), list(ckpt_mm.model.parameters())[
-            0
-        ][0]
-    else:
-        pass
-
-
-STOP_FILE_PATH = "./alter.log"
-
-
-def query_quit_file(rank, world_size=2):
-    from internlm.core.context import global_context as gpc
-    from internlm.initialize import initialize_distributed_env
-    from internlm.utils.model_checkpoint import CheckpointSaveType
-
-    ckpt_config = Config(
-        dict(
-            enable_save_ckpt=True,
-            save_ckpt_folder=BOTO_SAVE_PATH,
-            load_optimizer=True,
-            checkpoint_every=0,
-            async_upload=True,
-            async_upload_tmp_folder=ASYNC_TMP_FOLDER,
-            snapshot_ckpt_folder="/".join([BOTO_SAVE_PATH, "snapshot"]),
-            oss_snapshot_freq=0,
-            stop_file_path=STOP_FILE_PATH,
-            load_model_only_folder=None,
-            load_given_ckpt=False,
-            load_ckpt_folder=None,
-            is_old_api=True,
-        ),
-    )
-
-    os.environ["RANK"] = str(rank)
-    os.environ["LOCAL_RANK"] = str(rank)
-    os.environ["WORLD_SIZE"] = str(world_size)
-    os.environ["MASTER_ADDR"] = "127.0.0.1"
-    os.environ["MASTER_PORT"] = "12376"
-
-    initialize_distributed_env(config=init_config, launcher="torch", master_port=12376, args_check=False)
-    train_state = TrainState(init_config, None)
-    ckpt_mm = CheckpointManager(ckpt_config, model=None, optimizer=None)
-    if rank == 0:
-        with open(STOP_FILE_PATH, "w+") as f:
-            f.write("5")
-    dist.barrier()
-    for i in range(10):
-        train_state.step_count = i
-        now_break, now_save_ckpt, save_type = ckpt_mm.quit_signal_handler(train_state)
-        print(
-            f"step:{i}, rank:{rank}, now_break:{now_break}, now_save_ckpt:{now_save_ckpt}, save_type:{save_type}",
-            flush=True,
-        )
-        if train_state.step_count == 5:
-            assert now_break is True
-            assert now_save_ckpt is True
-            assert save_type is CheckpointSaveType.NORMAL_CHECKPOINT
-    dist.barrier()
-    gpc.destroy()
-
-
-def test_quit_siganl_handler():  # noqa # pylint: disable=unused-import
-    import multiprocessing
-    from multiprocessing.pool import Pool
-
-    world_size = 2
-    with Pool(processes=world_size, context=multiprocessing.get_context("spawn")) as pool:
-        items = [(0,), (1,)]
-        for result in pool.starmap(query_quit_file, items):
-            print(f"Got result: {result}", flush=True)
-
-    os.remove(STOP_FILE_PATH)
-
-
-if __name__ == "__main__":
-    pytest.main()
diff --git a/tests/test_utils/test_storage_manager.py b/tests/test_utils/test_storage_manager.py
deleted file mode 100644
index 32f905b..0000000
--- a/tests/test_utils/test_storage_manager.py
+++ /dev/null
@@ -1,89 +0,0 @@
-import os
-
-import pytest
-import torch
-
-from internlm.core.context.parallel_context import Config
-from internlm.initialize.launch import get_config_value
-from tests.test_utils.common_fixture import (  # noqa # pylint: disable=unused-import
-    ASYNC_TMP_FOLDER,
-    BOTO_SAVE_PATH,
-    LOCAL_SAVE_PATH,
-    del_tmp_file,
-    init_dist_and_model,
-    reset_singletons,
-)
-
-ASYNC_TMP_FOLDER = "./async_tmp_folder"
-ckpt_config_list = [
-    # async boto
-    dict(
-        enable_save_ckpt=True,
-        async_upload_tmp_folder=ASYNC_TMP_FOLDER,
-        async_upload=True,
-        save_folder=BOTO_SAVE_PATH,
-        test_id=0,
-    ),
-    # sync local
-    dict(
-        enable_save_ckpt=True,
-        async_upload_tmp_folder=None,
-        async_upload=False,
-        save_folder=LOCAL_SAVE_PATH,
-        test_id=1,
-    ),
-    # sync boto
-    dict(
-        enable_save_ckpt=True,
-        async_upload_tmp_folder=None,
-        async_upload=False,
-        save_folder=BOTO_SAVE_PATH,
-        test_id=2,
-    ),
-    # async local
-    dict(
-        enable_save_ckpt=True,
-        async_upload_tmp_folder=ASYNC_TMP_FOLDER,
-        async_upload=True,
-        save_folder=LOCAL_SAVE_PATH,
-        test_id=3,
-    ),
-]
-
-
-@pytest.fixture(scope="function")
-def del_tmp():
-    del_tmp_file()
-    yield
-    del_tmp_file()
-
-
-@pytest.mark.usefixtures("del_tmp")
-@pytest.mark.usefixtures("reset_singletons")
-@pytest.mark.parametrize("ckpt_config", ckpt_config_list)
-def test_storage_mm_save_load(ckpt_config, init_dist_and_model):  # noqa # pylint: disable=unused-argument
-    from internlm.utils.storage_manager import (
-        check_folder,
-        get_fns,
-        init_storage_manager,
-        llm_load,
-        llm_save,
-        wait_async_upload_finish,
-    )
-
-    ckpt_config = Config(ckpt_config)
-    enable_save_ckpt = get_config_value(ckpt_config, "enable_save_ckpt", False)
-    async_upload_tmp_folder = get_config_value(ckpt_config, "async_upload_tmp_folder", False)
-    async_upload = get_config_value(ckpt_config, "async_upload", False)
-
-    init_storage_manager(enable_save_ckpt, async_upload_tmp_folder, async_upload)
-
-    tobj = torch.rand(64, 64)
-    save_fn = os.path.join(ckpt_config.save_folder, "test.pt")
-    llm_save(save_fn, tobj)
-    if ckpt_config.test_id == 0:
-        wait_async_upload_finish()
-    check_folder(save_fn)
-    assert get_fns(ckpt_config.save_folder)[0] == "test.pt"
-    load_obj = llm_load(save_fn, map_location="cpu")
-    assert 0 == ((load_obj != tobj).sum())
diff --git a/tests/test_utils/test_timeout.py b/tests/test_utils/test_timeout.py
deleted file mode 100644
index a3f15f9..0000000
--- a/tests/test_utils/test_timeout.py
+++ /dev/null
@@ -1,119 +0,0 @@
-import fcntl
-import os
-import time
-from multiprocessing import Process
-
-import pytest
-import torch
-import torch.distributed as dist
-
-os.environ["INTERNLM_ENABLE_TIMEOUT"] = "1"  # noqa  # pylint: disable=wrong-import-position
-os.environ["NCCL_TIMEOUT"] = "5"
-from internlm.utils.timeout import llm_timeout
-from tests.test_utils.common_fixture import (  # noqa # pylint: disable=unused-import
-    init_config,
-)
-
-WORLD_SIZE = 2
-
-
-@llm_timeout(2, "fake_timeout_func")
-def fake_timeout_func():
-    time.sleep(10)
-
-
-@llm_timeout(10, "nccl_timeout_func")
-def nccl_timeout_func(rank):
-    # see: https://github.com/pytorch/pytorch/issues/104506#issuecomment-1679762880
-    # 'NCCL_ASYNC_ERROR_HANDLING' cannot take effect on the first collective communication.
-    buff = torch.ones([64, 64]).cuda(rank)
-    dist.all_reduce(buff)  # lazy communicator init
-    torch.cuda.synchronize()
-    if rank == 0:
-        dist.all_reduce(buff)
-        torch.cuda.synchronize()  # main thread will hang at here.
-    else:
-        time.sleep(9999)
-
-
-@llm_timeout(10, "try_file_lock")
-def try_file_lock(rank, stop_file_path):
-    if rank == 1:
-        time.sleep(5)
-
-    with open(stop_file_path, "r", encoding="utf-8") as f:
-        fcntl.flock(f, fcntl.LOCK_EX)  # rank 1 hang.
-        if rank == 0:
-            time.sleep(99999)  # rank 0 hang.
-        f.seek(0)
-        f.read()
-        fcntl.flock(f, fcntl.LOCK_UN)
-
-
-def local_timeout(rank, _):
-
-    try:
-        fake_timeout_func()
-    except TimeoutError as e:
-        print(f"local_timeout, rank:{rank}, e:{e}", flush=True)
-    else:
-        assert False, "It should timeout!"
-
-
-def gpc_timeout(rank, world_size):
-
-    from internlm.initialize import initialize_distributed_env
-
-    os.environ["RANK"] = str(rank)
-    os.environ["LOCAL_RANK"] = str(rank)
-    os.environ["WORLD_SIZE"] = str(world_size)
-    os.environ["MASTER_ADDR"] = "127.0.0.1"
-    os.environ["MASTER_PORT"] = "12377"
-    initialize_distributed_env(config=init_config, launcher="torch", master_port=12377, args_check=False)
-
-    try:
-        nccl_timeout_func(rank)
-    except TimeoutError as e:
-        print(f"gpc_timeout, rank:{rank}, e:{e}", flush=True)
-        time.sleep(5)  # wait rank 0 to be killed
-    else:
-        time.sleep(5)  # give some time to let Watchdog kill rank 0.
-        assert False, "It should timeout!"
-
-
-def file_lock_timeout(rank, _, stop_file_path):
-    if rank == 0:
-        with open(stop_file_path, "w"):
-            pass
-    try:
-        try_file_lock(rank, stop_file_path)
-    except TimeoutError as e:
-        print(e, flush=True)
-    else:
-        assert False, "It should timeout!"
-    finally:
-        if rank == 0:
-            os.remove(stop_file_path)
-
-
-timeout_func_list = [(gpc_timeout, 2, None), (local_timeout, 1, None), (file_lock_timeout, 2, "test_lock.log")]
-
-
-@pytest.mark.parametrize("timeout_func_and_args", timeout_func_list)
-def test_timeout(timeout_func_and_args):
-    timeout_func, world_size, other_args = timeout_func_and_args
-    procs = []
-    for i in range(world_size):
-        if other_args is None:
-            args = (i, world_size)
-        else:
-            args = (i, world_size, other_args)
-        proc = Process(target=timeout_func, args=args)
-        proc.start()
-        procs.append(proc)
-
-    for proc in procs:
-        proc.join(15)
-        if proc.is_alive():
-            proc.terminate()
-            proc.join()
diff --git a/third_party/apex b/third_party/apex
deleted file mode 160000
index 0da3ffb..0000000
--- a/third_party/apex
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit 0da3ffb92ee6fbe5336602f0e3989db1cd16f880
diff --git a/third_party/flash-attention b/third_party/flash-attention
deleted file mode 160000
index eff9fe6..0000000
--- a/third_party/flash-attention
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit eff9fe6b8076df59d64d7a3f464696738a3c7c24
diff --git a/tools/README.md b/tools/README.md
deleted file mode 100644
index 0c78a56..0000000
--- a/tools/README.md
+++ /dev/null
@@ -1,111 +0,0 @@
-本目录提供辅助模型训练的一些工具，文件结构如下所示：
-
-```bash
-├── transformers  # 适配hugging face的transformers的一些工具
-│   ├── configuration_internlm.py  # config适配工具
-│   ├── modeling_internlm.py  # model适配工具
-│   ├── tokenization_internlm.py  # tokenizer适配工具
-│   └── convert2hf.py  # 模型适配hugging face工具
-└── tokenizer.py  # 将原始数据转换成bin和meta文件的工具
-```
-
-# tokenizer.py
-
-生成原始数据的`bin`和`meta`文件需要使用`tokenizer`，我们通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前我们提供了`V7_sft.model`来生成tokens。若想使用不同的模型，可直接修改`tokernizer.py`中的模型参数路径。
-
-可以运行以下命令生成原始数据对应的`bin`和`meta`文件，其中参数`text_input_path`表示原始文本数据路径，目前支持`txt`、`json`和`jsonl`三种输入格式，`bin_output_path`表示生成的`bin`文件的保存路径。
-
-```bash
-$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
-```
-
-下面是一个数据处理的例子：
-
-给定一个包含原始数据集的文件`raw_data.txt`，原始数据集如下所示：
-
-```bash
-感恩生活中的每一个细节，才能真正体会到幸福的滋味。
-梦想是人生的动力源泉，努力追逐，才能实现自己的目标。
-学会宽容和理解，才能建立真正和谐的人际关系。
-```
-
-可以通过运行以下命令来生成`bin`和`meta`文件：
-```bash
-$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
-```
-
-需要注意的是，生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这五个目录下，以区分数据集的类型。
-
-其中，`cn`表示中文数据集；`en`表示英文数据集；`code`表示代码数据集；`ja`表示日语数据集；`ar`表示阿拉伯语数据集；`kaoshi`表示考试数据集。
-
-生成的bin文件的格式如下：
-
-```python
-{"tokens": [73075, 75302, 69522, 69022, 98899, 67713, 68015, 81269, 74637, 75445, 99157]}
-{"tokens": [69469, 60355, 73026, 68524, 60846, 61844, 98899, 67775, 79241, 98899, 67713, 67800, 67453, 67838, 99157]}
-{"tokens": [68057, 79017, 60378, 68014, 98899, 67713, 67990, 68015, 70381, 67428, 61003, 67622, 99157]}
-```
-
-`bin`文件中的每一行均对应原始数据集中的每一个句子，表示每个句子的`token`（下文将用sequence指定）。
-
-生成的`meta`文件的格式如下：
-
-```bash
-(0, 11), (90, 15), (208, 13)
-```
-
-在`meta`文件中，每个元组对应着`bin`文件中每一个`sequence`的元信息。其中，元组的第一个元素表示每个`sequence`在所有`sequence`中的`starting index`，第二个元素表示每个`sequence`中有多少个`tokens`。
-
-例如，对于第一个`sequence`，`starting index`为 0，有 11 个`tokens`；对于第二个`sequence`，由于第一个`sequence`转换为`string`后的长度为`89`，因此它的`starting index`为 90，有 15 个`tokens`。
-
-`json`和`jsonl`类型的文件的`bin`和`meta`文件格式和`txt`一致，此处不再赘叙。
-
-# pal_inference.py
-
-在 [GSM8K](https://huggingface.co/datasets/gsm8k) 数据集上使用 [PAL](https://github.com/reasoning-machines/pal) 范式推理，使模型编写代码并通过 Python 解释器执行来解决数学问题。其用法如下：
-
-```python
-# 用法:
-python pal_inference.py <model> <out_dir> [--dataset <dataset>] [--max_length <length>] [--top_p <threshold>] [--eoh <end token>] [--eoa <end token>] [--eos <end token>] [--temperature <temp>] [--time_out <time>] [--verbose, -v] [--append, -a]
-
-# 参数:
-# <model>                   用于推理的模型的路径。
-# <out_dir>                 生成代码将保存在指定的输出文件夹中。
-
-# 可选参数:
-# --dataset <dataset>       用于代码生成的数据集名称（默认：gsm8k）。
-# --max_length <length>     模型最大输入 token 长度（默认：2048）。
-# --top_p <threshold>       候选 token 相加的概率阈值（默认：0.8）。
-# --eoh <end token>         用户输入结束标识符 (默认: "") 。
-# --eoa <end token>         模型输入结束标识符 (默认: "") 。
-# --eos <end token>         系统输入结束标识符. (默认: "") 。
-# --temperature， -t <temp> 生成过程中的采样温度（默认：1.0）。
-# --time_out <time>         执行生成的代码的最大时间（秒）（默认：100）。
-# --verbose, -v             打印代码错误信息（可选）。
-# --append, -a              将输出追加到历史结果中（可选）。
-```
-
-以下是使用示例：
-
-```bash
-python tools/pal_inference.py internlm/internlm-chat-7k ./output -v
-```
-
-其输出文件每一行包括输入的问题，正确答案，执行答案，得分，以及模型生成的 Python 代码块：
-
-````json
-{
-    "question": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
-    "target": 18.0,
-    "answer": 18.0,
-    "score": 1,
-    "generation": ["```python\ndef solution():\n    eggs_per_day = 16\n    eggs_per_breakfast = 3\n    eggs_per_muffin = 4\n    eggs_used = eggs_per_day - eggs_per_breakfast - eggs_per_muffin\n    eggs_sold = eggs_used\n    price_per_egg = 2\n    eggs_made = eggs_sold * price_per_egg\n    result = eggs_made\n    return result\n```"]
-}
-````
-
-InternLM 在 GSM8K 数据集中带工具和不带工具的性能表现：
-
-| Method   | **InternLM-Chat-7B** |
-| -------- | -------------------- |
-| w/o tool | 34.5                 |
-| w tool   | 39.2                 |
diff --git a/tools/README_EN.md b/tools/README_EN.md
deleted file mode 100644
index 3105146..0000000
--- a/tools/README_EN.md
+++ /dev/null
@@ -1,109 +0,0 @@
-This directory provide some tools for model training with the following file structure.
-
-```bash
-├── transformers  # tools for adapting Hugging Face's transformers
-│   ├── configuration_internlm.py  # tools for adapting config
-│   ├── modeling_internlm.py  # tools for adapting model
-│   └── tokenization_internlm.py  # tools for adapting tokenizer
-│   └── convert2hf.py  # tools for adapting models to Hugging Face's format
-└── tokenizer.py  # tools for generating `bin` and `meta` file for raw data
-```
-
-# tokenizer.py
-
-We need to use a `tokenizer` to generate `bin` and `meta` files for raw data. We import the tokenizer model by specifying the model weight path in `tools/tokenizer.py`. Currently, we provide `V7.model` to generate tokens. If you want to use a different model, you can modify the model weight path in `tokenizer.py` directly.
-
-We can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.
-```bash
-$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
-```
-
-An example of data processing in `txt` format is given here:
-
-Given a file `raw_data.txt` containg raw data with the following content.
-
-```bash
-Appreciate every detail in life to truly taste the flavor of happiness.
-Dreams are the source of life’s motivation. Pursue them diligently to achieve your goals.
-Learn to be tolerant and understanding to establish truly harmonious interpersonal relationships.
-```
-
-Next, we can run the following command to generate `bin` and `meta` files for raw data.
-
-```bash
-$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
-```
-
-It should be noted that the generated `bin` files should be placed in one of the following directories to clarify the data type: `cn`(Chinese), `en`(English), `code`(code data), `ja`(Japanese), `ar`(Arabic) and `kaoshi`(kaoshi data).
-
-The format of generated `bin` file is as follows.
-
-```python
-{"tokens": [98655, 2317, 2922, 6649, 1595, 7856, 435, 2424, 442, 9556, 12807, 410, 17313, 446, 23331, 95746]}
-{"tokens": [98655, 302, 1383, 269, 657, 410, 2687, 446, 2424, 98667, 269, 25220, 281, 523, 1874, 492, 1248, 38127, 4563, 442, 11227, 829, 8980, 95746]}
-{"tokens": [98655, 24190, 442, 517, 15013, 649, 454, 8793, 442, 5849, 9556, 17917, 1369, 1084, 29890, 12021, 95746]}
-```
-
-In the generated `bin` file, each line (`sequence`) corresponds to the `tokens` for each sentence in the raw data.
-
-The format of generated `meta` file in as follows.
-
-```bash
-(0, 16), (110, 24), (262, 17)
-```
-
-Each tuple in the `meta` file represents the meta information of each `sequence` where the first element in the tuple indicates the `starting index` of each `sequence` among all `sequences` and the second element indicates the amount of `tokens` for each `sequence`.
-
-For example, the `starting index` is 0 for the first `sequence` with 16 `tokens`. Since the length of `sequence` in `string` format is 109, the `starting index` is 110. And the number of `tokens` of the sencond `sequence` is 24.
-
-The `bin` and `meta` file formats for `json` and `jsonl` type files are the same as for `txt`, so we won't go over them here.
-
-# pal_inference.py
-
-Perform reasoning using [PAL](https://github.com/reasoning-machines/pal) on the [GSM8K](https://huggingface.co/datasets/gsm8k) dataset, allowing the model to generate code and solve mathematical problems through Python interpretation. Here's how you can use it:
-
-```bash
-# Usage:
-python pal_inference.py <model> <out_dir> [--dataset <dataset>] [--max_length <length>] [--top_p <threshold>] [--eoh <end token>] [--eoa <end token>] [--eos <end token>] [--temperature <temp>] [--time_out <time>] [--verbose, -v] [--append, -a]
-
-# Parameters:
-# <model>                   Path to the model used for inference.
-# <out_dir>                 Generated code will be saved in the specified output folder.
-
-# Optional arguments:
-# --dataset <dataset>       Dataset name used for code generation (default: gsm8k).
-# --max_length <length>     Model's maximum input token length (default: 2048).
-# --top_p <threshold>       Probability threshold for candidate tokens (default: 0.8).
-# --eoh <end token>         End of human (user) token. (default: "").
-# --eoa <end token>         End of assistant (bot) token. (default: "").
-# --eos <end token>         End of system token. (default: "").
-# --temperature, -t <temp>  Sampling temperature during generation (default: 1.0).
-# --time_out <time>         Maximum time (in seconds) for executing the generated code (default: 100).
-# --verbose, -v             Print code error messages (optional).
-# --append, -a              ppend the output to historical results (optional).
-```
-
-Below is an example of usage:
-
-```bash
-python tools/pal_inference.py internlm/internlm-chat-7k ./output -v
-```
-
-The output file contains each line with the input question, the correct answer, the executed answer, the score, and the Python code block generated by the model:
-
-````json
-{
-    "question": "Janet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
-    "target": 18.0,
-    "answer": 18.0,
-    "score": 1,
-    "generation": ["```python\ndef solution():\n    eggs_per_day = 16\n    eggs_per_breakfast = 3\n    eggs_per_muffin = 4\n    eggs_used = eggs_per_day - eggs_per_breakfast - eggs_per_muffin\n    eggs_sold = eggs_used\n    price_per_egg = 2\n    eggs_made = eggs_sold * price_per_egg\n    result = eggs_made\n    return result\n```"]
-}
-````
-
-InternLM performance in the GSM8K dataset with and without tools:
-
-| Method   | **InternLM-Chat-7B** |
-| -------- | -------------------- |
-| w/o tool | 34.5                 |
-| w tool   | 39.2                 |
diff --git a/tools/V7_sft.model b/tools/V7_sft.model
deleted file mode 100644
index f7d52d6..0000000
Binary files a/tools/V7_sft.model and /dev/null differ
diff --git a/tools/alpaca_tokenizer.py b/tools/alpaca_tokenizer.py
deleted file mode 100644
index 0904bb9..0000000
--- a/tools/alpaca_tokenizer.py
+++ /dev/null
@@ -1,164 +0,0 @@
-import argparse
-import json
-import os.path as osp
-from pathlib import Path
-
-import numpy as np
-import sentencepiece as spm
-from tqdm import tqdm
-
-
-def process(dataset_path, sp_model):
-    """Process data sample from input dataset
-
-    Args:
-        dataset_path (str): Path of dataset json file.
-        sp_model (str): Path of tokenizer.
-
-    Yields:
-        tuple: dumped processed data sample and length of tokens.
-    """
-
-    dataset = json.load(open(dataset_path))
-
-    for data in dataset:
-        yield tokenize(get_chat_format_data(data), sp_model)
-
-
-def get_chat_format_data(ori_data):
-    """Format original data
-
-    Args:
-        ori_data (dict): input data sample.
-
-    Returns:
-        dict: data sample with chat format.
-    """
-    input_str = ori_data["input"]
-    instruction_str = ori_data["instruction"]
-    output_str = ori_data["output"]
-    data = dict()
-    if input_str != "":
-        data["user"] = f"<|User|>:{instruction_str}\n{input_str}"
-    else:
-        data["user"] = f"<|User|>:{instruction_str}"
-    data["bot"] = f"<|Bot|>:{output_str}"
-    return data
-
-
-def tokenize(sample, sp_model):
-    """Tokenize input dataset
-
-    Args:
-        sample (dict): Input data sample.
-        sp_model (str): Path of tokenizer.
-
-    Returns:
-        tuple: dumped processed data sample and length of tokens.
-    """
-    special_tokens_map = {"<eoh>": 103167, "<eoa>": 103166, "nl_id": 13}
-    token_ids = [sp_model.bos_id()]
-    human_s = sample["user"]
-    ass_s = sample["bot"]
-
-    human_ids = sp_model.encode(human_s) + [special_tokens_map["<eoh>"], special_tokens_map["nl_id"]]
-    human_ids_ignore = [-token_id for token_id in human_ids]
-
-    ass_template_ids = sp_model.encode("<|Bot|>:")
-    ass_template_ids_ignore = [-token_ids for token_ids in ass_template_ids]
-    ass_ids = (
-        ass_template_ids_ignore
-        + sp_model.encode(ass_s[8:])
-        + [special_tokens_map["<eoa>"], special_tokens_map["nl_id"]]
-    )
-
-    token_ids += human_ids_ignore + ass_ids
-    if len(token_ids) > 2047:
-        token_ids = token_ids[:2047]
-    token_ids += [sp_model.eos_id()]
-    line = str.encode(json.dumps({"tokens": token_ids}) + "\n")
-    return line, len(token_ids)
-
-
-def dump_bin_meta_bin(samples, path, split_ratio=0.1):
-    """Dump processed dataset
-
-    Args:
-        samples (dict): Input data sample.
-        path (str): Path for output dataset.
-        split_ratio (float): Ratio for validation dataset splitting.
-            Default to: 0.1.
-
-    Returns:
-        tuple: number of train/valid tokens of processed dataset,
-            number of train/valid samples of processed dataset.
-    """
-
-    train_path = osp.join(path, "train/en/")
-    valid_path = osp.join(path, "valid/en/")
-    train_dir = Path(train_path)
-    valid_dir = Path(valid_path)
-    train_dir.mkdir(exist_ok=True, parents=True)
-    valid_dir.mkdir(exist_ok=True, parents=True)
-    train_f = open(train_dir.joinpath("dataset.bin"), "wb")
-    valid_f = open(valid_dir.joinpath("dataset.bin"), "wb")
-
-    train_tokens = 0
-    valid_tokens = 0
-    last_train_position = 0
-    last_valid_position = 0
-    train_samples = 0
-    valid_samples = 0
-    train_meta = []
-    valid_meta = []
-
-    sample_length = len(samples)
-    np.random.seed(0)
-    valid_indices = np.random.choice(range(sample_length), int(sample_length * split_ratio)).tolist()
-
-    count = -1
-    for line, token_num in samples:
-        count += 1
-        if count in valid_indices:
-            valid_tokens += token_num
-            valid_f.write(line)
-            valid_meta.append((last_valid_position, token_num))
-            last_valid_position += len(line)
-            valid_samples += 1
-        else:
-            train_tokens += token_num
-            train_f.write(line)
-            train_meta.append((last_train_position, token_num))
-            last_train_position += len(line)
-            train_samples += 1
-
-    train_f.close()
-    valid_f.close()
-    np.save(open(train_dir.joinpath("dataset.bin.meta"), "wb"), train_meta)
-    np.save(open(valid_dir.joinpath("dataset.bin.meta"), "wb"), valid_meta)
-
-    return train_tokens, valid_tokens, train_samples, valid_samples
-
-
-if __name__ == "__main__":
-
-    parser = argparse.ArgumentParser()
-    parser.add_argument("dataset_path", type=str, help="path of dataset json file")
-    parser.add_argument("output_path", type=str, help="path of processed dataset")
-    parser.add_argument("tokenizer_path", type=str, help="path of tokenizer")
-    parser.add_argument("--split_ratio", type=float, default=0.1, help="ratio for validation dataset splitting")
-
-    args = parser.parse_args()
-    sp_model = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
-    split_ratio = args.split_ratio
-    samples = []
-
-    dataset = process(args.dataset_path, sp_model)
-    for sample in tqdm(dataset):
-        samples.append(sample)
-
-    train_tokens, valid_tokens, train_samples, valid_samples = dump_bin_meta_bin(
-        samples, args.output_path, args.split_ratio
-    )
-    print(f"number of train dataset: {train_samples}, number of train dataset token: {train_tokens}")
-    print(f"number of validation dataset: {valid_samples}, number of validation dataset token: {valid_tokens}")
diff --git a/tools/pal_inference.py b/tools/pal_inference.py
deleted file mode 100644
index 648ec58..0000000
--- a/tools/pal_inference.py
+++ /dev/null
@@ -1,322 +0,0 @@
-# flake8: noqa
-
-# This file is modified from:
-# hhttps://github.com/reasoning-machines/pal/blob/main/pal/core/interface.py
-#
-# Copyright 2022 PAL Authors. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import copy
-import json
-import os
-from dataclasses import asdict
-from typing import Any, Dict, List
-
-import torch
-import tqdm
-from datasets import load_dataset
-from transformers import AutoModelForCausalLM, AutoTokenizer
-
-from internlm.utils.timeout import Timeout
-from tools.transformers.interface import GenerationConfig, generate_interactive
-
-
-def parse_args():
-    parser = argparse.ArgumentParser(description="PAL Inference")
-    parser.add_argument("model", type=str, help="Path to the pre-trained LLM used for inference.")
-    parser.add_argument(
-        "out_dir", type=str, help="Name of the output folder where generated code snippets will be saved."
-    )
-    parser.add_argument("--dataset", default="gsm8k", type=str, help="Name of the dataset used for code generation.")
-    parser.add_argument(
-        "--max_length",
-        default=2048,
-        type=int,
-        help="Maximum input token length for the natural language description.",
-    )
-    parser.add_argument(
-        "--top_p",
-        default=0.8,
-        type=float,
-        help="Probability threshold to choose sample tokens during generation.",
-    )
-    parser.add_argument(
-        "--eoh",
-        default="",
-        type=str,
-        help="End of human (user) token.",
-    )
-    parser.add_argument(
-        "--eoa",
-        default="",
-        type=str,
-        help="End of assistant (bot) token.",
-    )
-    parser.add_argument(
-        "--eos",
-        default="",
-        type=str,
-        help="End of system token.",
-    )
-    parser.add_argument(
-        "--temperature", "-t", default=1.0, type=float, help="Temperature of token sampling during generation."
-    )
-    parser.add_argument(
-        "--time_out", default=100, type=float, help="Maximum time allowed for executing generated code."
-    )
-    parser.add_argument(
-        "--verbose",
-        "-v",
-        action="store_true",
-        help="Print code error information when executing generated code (optional).",
-    )
-    parser.add_argument("--append", "-a", action="store_true", help="Append output to the history results (optional).")
-    args = parser.parse_args()
-    return args
-
-
-class GenericRuntime:
-    """Adapted from https://github.com/reasoning-machines/pal"""
-
-    GLOBAL_DICT: dict = {}
-    LOCAL_DICT = None
-    HEADERS: List = []
-
-    def __init__(self):
-        self._global_vars = copy.copy(self.GLOBAL_DICT)
-        self._local_vars = copy.copy(self.LOCAL_DICT) if self.LOCAL_DICT else None
-
-        for c in self.HEADERS:
-            self.exec_code(c)
-
-    def exec_code(self, code_piece: str) -> None:
-        exec(code_piece, self._global_vars)
-
-    def eval_code(self, expr: str) -> Any:
-        return eval(expr, self._global_vars)
-
-    def inject(self, var_dict: Dict[str, Any]) -> None:
-        for k, v in var_dict.items():
-            self._global_vars[k] = v
-
-    @property
-    def answer(self):
-        return self._global_vars["answer"]
-
-
-class PALInterface:
-    """PAL interface wrap fun:`generate_interactive` to extract and execute
-    generated code.
-
-    Adapted from https://github.com/reasoning-machines/pal
-
-    Args:
-        model (AutoModelForCausalLM)
-        tokenizer (AutoTokenizer)
-        generation_config (GenerationConfig): Decode strategies
-        additional_eos_token_id (int): End of sentence token id, default: 103028
-        get_answer_expr (str): The function name of generated code, default: "solution()"
-        verbose (bool): Print error information
-    """
-
-    def __init__(
-        self,
-        model: AutoModelForCausalLM,
-        tokenizer: AutoTokenizer,
-        generation_config: GenerationConfig,
-        additional_eos_token_id: int = 103028,
-        get_answer_expr: str = "solution()",
-        verbose: bool = False,
-    ):
-        self.runtime = GenericRuntime()
-        self.history: List = []
-        self.model = model
-        self.tokenizer = tokenizer
-        self.generation_config = generation_config
-        self.additional_eos_token_id = additional_eos_token_id
-        self.answer_expr = get_answer_expr
-        self.verbose = verbose
-
-    def generate(self, prompt):
-        # The api will generate response word by word
-        # we only need the last generation as the final results
-        for cur_gen in generate_interactive(
-            model=self.model,
-            tokenizer=self.tokenizer,
-            prompt=prompt,
-            additional_eos_token_id=self.additional_eos_token_id,
-            **asdict(self.generation_config),
-        ):
-            continue
-        # Get final response
-        self.history.append(cur_gen)
-        # Extract code block
-        code = self.process_generation_to_code(cur_gen)
-        return code
-
-    def process_generation_to_code(self, gens: str):
-        if "```python" in gens:
-            gens = gens.split("```python")[1].split("```")[0]
-        elif "```" in gens:
-            gens = gens.split("```")[1].split("```")[0]
-        code = gens.split("\n")
-        return code
-
-    def run(self, prompt, time_out: float = 100):
-        code = self.generate(prompt)
-        with Timeout(time_out):
-            try:
-                exec_result = self.execute(code)
-            except Exception as e:
-                if self.verbose:
-                    print(e)
-        return exec_result
-
-    def execute(self, code: List[str]):
-        self.runtime.exec_code("\n".join(code))
-        return self.runtime.eval_code(self.answer_expr)
-
-    def clear_history(self):
-        self.history = []
-
-
-def load_model(args):
-    model = AutoModelForCausalLM.from_pretrained(args.model, trust_remote_code=True).to(torch.bfloat16).cuda()
-    tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=True)
-    return model, tokenizer
-
-
-def load_data(args):
-    # Load data from huggingface dataset
-    if args.dataset == "gsm8k":
-        gsm8k = load_dataset(path=args.dataset, name="main")
-        test_set = gsm8k["test"]
-        input_data = []
-        for data in test_set:
-            question = data["question"]
-            target = float(data["answer"].split("#")[-1].replace(",", ""))
-            input_data.append({"question": question, "target": target})
-    else:
-        raise NotImplementedError
-    return input_data
-
-
-PROMPT = """<|System|>:You are a helpful assistant which use tools to solve mathematical reasoning questions. The tools you can use are:
-PythonExecutor: It can execute Python code. The code must be a function, and the function name must be 'solution'. The example format is as follows:
-```python
-def solution():
-    variable_names_with_real_meaning = func(variable)
-    return variable_names_with_real_meaning
-```{eos}
-<|User|>:Olivia has $23. She bought five bagels for $3 each. How much money does she have left?{eoh}
-<|Bot|>:
-```python
-def solution():
-    money_initial = 23
-    bagels = 5
-    bagel_cost = 3
-    money_spent = bagels * bagel_cost
-    money_left = money_initial - money_spent
-    result = money_left
-    return result
-```{eoa}
-<|User|>:Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday?{eoh}
-<|Bot|>:
-```python
-def solution():
-    golf_balls_initial = 58
-    golf_balls_lost_tuesday = 23
-    golf_balls_lost_wednesday = 2
-    golf_balls_left = golf_balls_initial - golf_balls_lost_tuesday - golf_balls_lost_wednesday
-    result = golf_balls_left
-    return result
-```{eoa}
-<|User|>:There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room?{eoh}
-<|Bot|>:
-```python
-def solution():
-    computers_initial = 9
-    computers_per_day = 5
-    num_days = 4  # 4 days between monday and thursday
-    computers_added = computers_per_day * num_days
-    computers_total = computers_initial + computers_added
-    result = computers_total
-    return result
-```{eoa}
-<|System|>:How about this question?{eos}
-<|User|>:{question}{eoh}
-<|Bot|>:""".strip()
-
-
-def main():
-
-    args = parse_args()
-
-    print("load model begin.")
-    model, tokenizer = load_model(args)
-    print("load model end.")
-
-    generation_config = GenerationConfig(max_length=args.max_length, top_p=args.top_p, temperature=args.temperature)
-
-    verbose = args.verbose
-    interface = PALInterface(model=model, tokenizer=tokenizer, generation_config=generation_config, verbose=verbose)
-
-    if not os.path.exists(args.out_dir):
-        os.makedirs(args.out_dir)
-    savepath = os.path.join(args.out_dir, args.dataset + ".json")
-
-    # Load from history results
-    if args.append and os.path.exists(savepath):
-        lines = open(savepath).readlines()
-        num_skip_exps = len(lines)
-        scores = [x["score"] for x in map(json.loads, lines)]
-    else:
-        num_skip_exps = 0
-        scores = []
-
-    examples = load_data(args)
-    with open(savepath, "a" if args.append else "w") as f:
-        pbar = tqdm.tqdm(examples[num_skip_exps:], initial=num_skip_exps, total=len(examples))
-        for x in pbar:
-            question = x["question"]
-            result = copy.copy(x)
-
-            try:
-                answer = interface.run(
-                    prompt=PROMPT.format(question=question, eoh=args.eoh, eoa=args.eoa, eos=args.eos),
-                    time_out=args.time_out,
-                )
-                answer = float(answer)
-                score = 1 if abs(answer - x["target"]) < 1e-3 else 0
-            except Exception as e:
-                if verbose:
-                    print(e)
-                answer = ""
-                score = 0
-            scores.append(score)
-            result["answer"] = answer
-            result["score"] = score
-            result["generation"] = interface.history
-            f.write(json.dumps(result) + "\n")
-
-            interface.clear_history()
-            f.flush()
-
-    print(f"{args.model}: Accuracy - {sum(scores) / len(scores)}")
-    torch.cuda.empty_cache()
-
-
-if __name__ == "__main__":
-    main()
diff --git a/tools/tokenizer.py b/tools/tokenizer.py
deleted file mode 100644
index cf4ddec..0000000
--- a/tools/tokenizer.py
+++ /dev/null
@@ -1,144 +0,0 @@
-import argparse
-import json
-import os
-import sys
-
-import numpy as np
-
-current_dir = os.path.dirname(os.path.abspath(__file__))
-model_path = os.path.join(current_dir, "V7_sft.model")
-sys.path.append(os.path.join(current_dir, "transformers"))
-from tokenization_internlm import InternLMTokenizer
-
-tokenizer = InternLMTokenizer(
-    vocab_file=model_path, add_bos_token=True, add_eos_token=True
-)
-
-
-def write_bin(context: str, bin_file) -> None:
-    """
-    Write bin file based on the context.
-
-    Args:
-        context (str): the context of raw file.
-        bin_file (file handler): the opened bin file.
-
-    Example:
-    >>> write_bin("今天天气晴朗适合出门散步", "out.bin") # the output file format is 'txt'
-    >>> out.bin
-    >>> {"tokens": [67577, 69095, 63010, 61770, 67783, 69301, 74732]}
-    """
-    # encode the context into tokens, which is a list, eg. [67577, 69095, 63010, 61770, 67783, 69301, 74732]
-    tokens = tokenizer.encode(context)
-    # transfer the list into dic, key is str 'tokens', value is tokens.
-    # eg. {"tokens": [67577, 69095, 63010, 61770, 67783, 69301, 74732]}
-    data = dict(tokens=tokens)
-    # encode the data into bytes to save
-    saved_bin = str.encode(json.dumps(data) + "\n")
-
-    # write bytes into bin_file
-    bin_file.write(saved_bin)
-
-
-def prepare_meta(bin_output_path: str):
-    """
-    Prepare metadata for the given bin file.
-
-    Args:
-        bin_output_path (str): Output bin file path.
-    """
-    meta = []
-    cur = 0
-    with open(bin_output_path, "rb") as f:
-        while True:
-            # read lines
-            line = f.readline()
-            # if line is empty, then break
-            if line == b"":
-                break
-            # obtain the token amount of each line
-            length = len(json.loads(line)["tokens"])
-            # meta is a list of tuple(cur, length)
-            # cur: the start index of each line
-            # length: the token amount of each line
-            meta.append((cur, length))
-            # update the cur to generate the meta information of next line
-            cur += len(line)
-
-    # define path of the generated meta file
-    meta_fp = bin_output_path + ".meta"
-    # save the generated meta information
-    with open(meta_fp, "wb") as f:
-        meta = np.array(meta, dtype=np.int32)
-        np.save(f, meta)
-
-
-def text2bin(text_input_path: str, bin_output_path: str):
-    """
-    Read content from the input file and write to bin file.
-    Currently support 3 input formats: 'txt', 'json' and 'jsonl'.
-
-    Args:
-        text_input_path (str): txt file path.
-        bin_output_path (str): output bin file path.
-    """
-    # Check if the txt file exists
-    if not os.path.isfile(text_input_path):
-        raise FileNotFoundError(f"{text_input_path} does not exist.")
-
-    file_format = text_input_path.split(".")[-1]
-    assert file_format in ["txt", "json", "jsonl"], print(
-        "Invalid input file type. Currently support `txt`, `json` and `jsonl`."
-    )
-
-    with open(text_input_path, "r") as text_file, open(bin_output_path, "ab") as bin_file:
-        if file_format == "txt":
-            for line in text_file:
-                # Strip any leading/trailing whitespace
-                stripped_line = line.strip()
-                if stripped_line:
-                    # Pass each line to the write_bin function
-                    write_bin(stripped_line, bin_file)
-
-        elif file_format == "json":
-            data = json.load(text_file)
-            # assuming data is a list of dictionaries
-            for record in data:
-                # the type of record is dict, transfer the dict into str
-                context = json.dumps(record)
-                # encode the str and write into bin
-                write_bin(context, bin_file)
-
-        elif file_format == "jsonl":
-            for line in text_file:
-                # encode the str and write into bin
-                write_bin(line, bin_file)
-
-
-def parse_args():
-    parser = argparse.ArgumentParser()
-    parser.add_argument(
-        "--text_input_path",
-        type=str,
-        required=True,
-        help="Path to the input text file.",
-    )
-    parser.add_argument("--bin_output_path", type=str, required=True, help="Path to the output bin file.")
-
-    return parser.parse_args()
-
-
-def main():
-    # parse arguments
-    args = parse_args()
-
-    text2bin(args.text_input_path, args.bin_output_path)
-    print(f"Successfully converted {args.text_input_path} to {args.bin_output_path}")
-
-    # To avoid potential read/write errors, the metadata preparation follows after creating the .bin file.
-    prepare_meta(args.bin_output_path)
-    print(f"Successfully generated {args.bin_output_path}.meta")
-
-
-if __name__ == "__main__":
-    main()
diff --git a/tools/transformers/README-zh-Hans.md b/tools/transformers/README-zh-Hans.md
deleted file mode 100644
index 34f12fe..0000000
--- a/tools/transformers/README-zh-Hans.md
+++ /dev/null
@@ -1,25 +0,0 @@
-# InternLM Transformers
-
-[English](./README.md) |
-[简体中文](./README-zh-Hans.md)
-
-该文件夹下包含了 transformers 格式的 `InternLM` 模型。
-
-
-## 权重转换
-
-`convert2hf.py` 可以将训练保存的权重一键转换为 transformers 格式。在仓库根目录运行以下命令：
-
-```bash
-python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
-```
-
-然后可以使用 `from_pretrained` 接口加载：
-
-```python
->>> from transformers import AutoTokenizer, AutoModel
->>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
-```
-
-
-`intern_moss_example.py` 展示了如何使用 LoRA 来在 `fnlp/moss-moon-002-sft` 数据集上进行微调的样例。
diff --git a/tools/transformers/README.md b/tools/transformers/README.md
deleted file mode 100644
index 6b453f3..0000000
--- a/tools/transformers/README.md
+++ /dev/null
@@ -1,23 +0,0 @@
-# InternLM Transformers
-
-[English](./README.md) |
-[简体中文](./README-zh-Hans.md)
-
-This folder contains the `InternLM` model in transformers format.
-
-## Weight Conversion
-
-`convert2hf.py` can convert saved training weights into the transformers format with a single command. Execute the command in the root directory of repository:
-
-```bash
-python tools/transformers/convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer ./tools/V7_sft.model
-```
-
-Then, you can load it using the `from_pretrained` interface:
-
-```python
->>> from transformers import AutoTokenizer, AutoModel
->>> model = AutoModel.from_pretrained("hf_ckpt/", trust_remote_code=True).cuda()
-```
-
-`intern_moss_example.py` demonstrates an example of how to use LoRA for fine-tuning on the `fnlp/moss-moon-002-sft` dataset.
diff --git a/tools/transformers/configuration_internlm.py b/tools/transformers/configuration_internlm.py
deleted file mode 100644
index ebeb27d..0000000
--- a/tools/transformers/configuration_internlm.py
+++ /dev/null
@@ -1,119 +0,0 @@
-# coding=utf-8
-# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
-#
-# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
-# and OPT implementations in this library. It has been modified from its
-# original forms to accommodate minor architectural differences compared
-# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-""" InternLM model configuration"""
-
-from transformers.configuration_utils import PretrainedConfig
-from transformers.utils import logging
-
-logger = logging.get_logger(__name__)
-
-INTERNLM_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
-
-
-class InternLMConfig(PretrainedConfig):
-    r"""
-    This is the configuration class to store the configuration of a [`InternLMModel`]. It is used to instantiate an
-    InternLM model according to the specified arguments, defining the model architecture. Instantiating a
-    configuration with the defaults will yield a similar configuration to that of the InternLM-7B.
-
-    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
-    documentation from [`PretrainedConfig`] for more information.
-
-
-    Args:
-        vocab_size (`int`, *optional*, defaults to 32000):
-            Vocabulary size of the InternLM model. Defines the number of different tokens that can be represented by the
-            `inputs_ids` passed when calling [`InternLMModel`]
-        hidden_size (`int`, *optional*, defaults to 4096):
-            Dimension of the hidden representations.
-        intermediate_size (`int`, *optional*, defaults to 11008):
-            Dimension of the MLP representations.
-        num_hidden_layers (`int`, *optional*, defaults to 32):
-            Number of hidden layers in the Transformer encoder.
-        num_attention_heads (`int`, *optional*, defaults to 32):
-            Number of attention heads for each attention layer in the Transformer encoder.
-        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
-            The non-linear activation function (function or string) in the decoder.
-        max_position_embeddings (`int`, *optional*, defaults to 2048):
-            The maximum sequence length that this model might ever be used with. Typically set this to something large
-            just in case (e.g., 512 or 1024 or 2048).
-        initializer_range (`float`, *optional*, defaults to 0.02):
-            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-        rms_norm_eps (`float`, *optional*, defaults to 1e-12):
-            The epsilon used by the rms normalization layers.
-        use_cache (`bool`, *optional*, defaults to `True`):
-            Whether or not the model should return the last key/values attentions (not used by all models). Only
-            relevant if `config.is_decoder=True`.
-        tie_word_embeddings(`bool`, *optional*, defaults to `False`):
-            Whether to tie weight embeddings
-        Example:
-
-    ```python
-    >>> from transformers import InternLMModel, InternLMConfig
-
-    >>> # Initializing a InternLM internlm-7b style configuration
-    >>> configuration = InternLMConfig()
-
-    >>> # Initializing a model from the internlm-7b style configuration
-    >>> model = InternLMModel(configuration)
-
-    >>> # Accessing the model configuration
-    >>> configuration = model.config
-    ```"""
-    model_type = "internlm"
-    _auto_class = "AutoConfig"
-
-    def __init__(
-        self,
-        vocab_size=103168,
-        hidden_size=4096,
-        intermediate_size=11008,
-        num_hidden_layers=32,
-        num_attention_heads=32,
-        hidden_act="silu",
-        max_position_embeddings=2048,
-        initializer_range=0.02,
-        rms_norm_eps=1e-6,
-        use_cache=True,
-        pad_token_id=0,
-        bos_token_id=1,
-        eos_token_id=2,
-        tie_word_embeddings=False,
-        bias=True,
-        **kwargs,
-    ):
-        self.vocab_size = vocab_size
-        self.max_position_embeddings = max_position_embeddings
-        self.hidden_size = hidden_size
-        self.intermediate_size = intermediate_size
-        self.num_hidden_layers = num_hidden_layers
-        self.num_attention_heads = num_attention_heads
-        self.hidden_act = hidden_act
-        self.initializer_range = initializer_range
-        self.rms_norm_eps = rms_norm_eps
-        self.use_cache = use_cache
-        self.bias = bias
-        super().__init__(
-            pad_token_id=pad_token_id,
-            bos_token_id=bos_token_id,
-            eos_token_id=eos_token_id,
-            tie_word_embeddings=tie_word_embeddings,
-            **kwargs,
-        )
diff --git a/tools/transformers/convert2hf.py b/tools/transformers/convert2hf.py
deleted file mode 100644
index 594ab88..0000000
--- a/tools/transformers/convert2hf.py
+++ /dev/null
@@ -1,175 +0,0 @@
-import argparse
-import json
-import math
-import os
-import re
-import tempfile
-
-import torch
-from modeling_internlm import InternLMConfig, InternLMForCausalLM
-from tokenization_internlm import InternLMTokenizer
-
-NUM_SHARDS = {
-    "7B": 1,
-}
-
-
-def convert2hf(model_config, states_tp_pps):
-
-    with tempfile.TemporaryDirectory() as folder:
-        states = merge_pp(states_tp_pps)[0]
-
-        if "embedding.word_embeddings.weight" in states:
-            embedding_key = "embedding.word_embeddings.weight"
-        elif "embedding.weight" in states:
-            embedding_key = "embedding.weight"
-        else:
-            print("Check embedding states'names in below:", flush=True)
-            print(list(states.keys()), flush=True)
-
-        dims_per_head = model_config["hidden_size"] // model_config["num_attention_heads"]
-        base = 10000.0
-        inv_freq = 1.0 / (base ** (torch.arange(0, dims_per_head, 2).float() / dims_per_head))
-
-        current_states = {}
-
-        current_states["model.embed_tokens.weight"] = states.pop(embedding_key)
-        current_states["model.norm.weight"] = states.pop("norm.weight")
-        current_states["lm_head.weight"] = states.pop("head.weight")
-
-        for i in range(model_config["num_layers"]):
-            states.pop(f"blocks.{i}.mixer.rotary_emb.inv_freq", None)
-
-            wqkv = states.pop(f"blocks.{i}.mixer.Wqkv.weight").reshape(
-                3, model_config["num_attention_heads"], -1, model_config["hidden_size"]
-            )
-            bqkv = states.pop(f"blocks.{i}.mixer.Wqkv.bias").reshape(3, model_config["num_attention_heads"], -1)
-
-            current_states[f"model.layers.{i}.self_attn.q_proj.weight"] = wqkv[0].reshape(
-                -1, model_config["hidden_size"]
-            )
-            current_states[f"model.layers.{i}.self_attn.q_proj.bias"] = bqkv[0].reshape(-1)
-            current_states[f"model.layers.{i}.self_attn.k_proj.weight"] = wqkv[1].reshape(
-                -1, model_config["hidden_size"]
-            )
-            current_states[f"model.layers.{i}.self_attn.k_proj.bias"] = bqkv[1].reshape(-1)
-            current_states[f"model.layers.{i}.self_attn.v_proj.weight"] = wqkv[2].reshape(
-                -1, model_config["hidden_size"]
-            )
-            current_states[f"model.layers.{i}.self_attn.v_proj.bias"] = bqkv[2].reshape(-1)
-
-            current_states[f"model.layers.{i}.self_attn.o_proj.weight"] = states.pop(
-                f"blocks.{i}.mixer.out_proj.weight"
-            )
-            current_states[f"model.layers.{i}.self_attn.o_proj.bias"] = states.pop(f"blocks.{i}.mixer.out_proj.bias")
-
-            current_states[f"model.layers.{i}.mlp.gate_proj.weight"] = states.pop(f"blocks.{i}.mlp.w1.weight")
-            current_states[f"model.layers.{i}.mlp.down_proj.weight"] = states.pop(f"blocks.{i}.mlp.w3.weight")
-            current_states[f"model.layers.{i}.mlp.up_proj.weight"] = states.pop(f"blocks.{i}.mlp.w2.weight")
-
-            current_states[f"model.layers.{i}.input_layernorm.weight"] = states.pop(f"blocks.{i}.norm1.weight")
-            current_states[f"model.layers.{i}.post_attention_layernorm.weight"] = states.pop(f"blocks.{i}.norm2.weight")
-            current_states[f"model.layers.{i}.self_attn.rotary_emb.inv_freq"] = inv_freq
-
-        config = InternLMConfig(
-            hidden_size=model_config["hidden_size"],
-            intermediate_size=compute_intermediate_size(model_config["hidden_size"]),
-            num_attention_heads=model_config["num_attention_heads"],
-            num_hidden_layers=model_config["num_layers"],
-            rms_norm_eps=1e-06,
-            bias=True,
-        )
-
-        if model_config["vocab_size"] != -1:
-            config.vocab_size = model_config["vocab_size"]
-
-        config.save_pretrained(folder)
-        torch.save(current_states, os.path.join(folder, "pytorch_model.bin"))
-
-        model = InternLMForCausalLM.from_pretrained(folder, torch_dtype=torch.float16)
-        del model.config._name_or_path
-
-    return config, model
-
-
-def compute_intermediate_size(n):
-    return int(math.ceil(n * 8 / 3) + 255) // 256 * 256
-
-
-def merge_pp(states_tp_pp):
-    max_tp = len(states_tp_pp)
-    max_pp = len(states_tp_pp[0])
-
-    full_states = []
-    for tp in range(max_tp):
-        layer_shift = 0
-
-        tp_states = {}
-        for pp in range(max_pp):
-            _layer_shift = 0
-            states = states_tp_pp[tp][pp]
-            keys = list(states.keys())
-            for key in keys:
-                match = re.search("\.\d+\.", key)  # noqa: W605
-                if match is not None:
-                    s, e = match.span()
-                    layer_idx = int(key[s + 1 : e - 1]) + layer_shift
-                    _layer_shift = max(_layer_shift, int(key[s + 1 : e - 1]))
-                    name = key[:s] + f".{layer_idx}." + key[e:]
-                    tp_states[name] = states[key]
-                else:
-                    tp_states[key] = states[key]
-            layer_shift += _layer_shift + 1
-        full_states.append({(key[6:] if key.startswith("model.") else key): value for key, value in tp_states.items()})
-    return full_states
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--src_folder", type=str, default="~/test/")  # 需要转换为hf格式的checkpoint文件夹
-    parser.add_argument("--tgt_folder", type=str, default="~/output/")  # 存放转换后checkpoint的目标文件夹
-    parser.add_argument("--tokenizer", type=str, default="~/test/tokenizer.model")  # Tokenizer 文件的路径
-    args = parser.parse_args()
-
-    def load(fp):
-        with open(fp, "rb") as f:
-            pt_data = torch.load(f, map_location="cpu")
-        return pt_data
-
-    folder = args.src_folder
-    target_folder = args.tgt_folder
-    model_config = load(os.path.join(folder, "model_config.pt"))
-
-    fns = list(os.listdir(folder))
-
-    model_fns = []
-    for fn in fns:
-        if fn.startswith("model_t") and not fn.endswith("md5"):
-            model_fns.append(fn)
-
-    max_tp, max_pp = -1, -1
-    for fn in model_fns:
-        _, tp, pp = os.path.splitext(fn)[0].split("_")
-        max_pp = max(max_pp, int(pp[2:]) + 1)
-        max_tp = max(max_tp, int(tp[2:]) + 1)
-
-    states_tp_pps = [[]]
-
-    for pp in range(max_pp):
-        model_name = f"model_tp0_pp{pp}.pt"
-        states = load(os.path.join(folder, model_name))
-        states_tp_pps[0].append(states)
-
-    config, model = convert2hf(model_config, states_tp_pps)
-
-    os.makedirs(target_folder, exist_ok=True)
-    model.save_pretrained(target_folder, max_shard_size="20GB")
-    # TODO There should be a better way to add this.
-    with open(os.path.join(target_folder, "config.json")) as fp:
-        config_dict = json.load(fp)
-    config_dict["auto_map"]["AutoModel"] = "modeling_internlm.InternLMForCausalLM"
-    with open(os.path.join(target_folder, "config.json"), "w") as fp:
-        json.dump(config_dict, fp, indent=2)
-
-    tokenizer = InternLMTokenizer(args.tokenizer)
-    tokenizer.save_pretrained(target_folder)
diff --git a/tools/transformers/interface.py b/tools/transformers/interface.py
deleted file mode 100644
index 50fff85..0000000
--- a/tools/transformers/interface.py
+++ /dev/null
@@ -1,134 +0,0 @@
-import copy
-import warnings
-from dataclasses import dataclass
-from typing import Callable, List, Optional
-
-import torch
-from torch import nn
-from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList
-from transformers.utils import logging
-
-logger = logging.get_logger(__name__)
-
-
-@dataclass
-class GenerationConfig:
-    max_length: Optional[int] = None
-    top_p: Optional[float] = None
-    temperature: Optional[float] = None
-    do_sample: Optional[bool] = True
-    repetition_penalty: Optional[float] = 1.0
-
-
-@torch.inference_mode()
-def generate_interactive(
-    model,
-    tokenizer,
-    prompt,
-    generation_config: Optional[GenerationConfig] = None,
-    logits_processor: Optional[LogitsProcessorList] = None,
-    stopping_criteria: Optional[StoppingCriteriaList] = None,
-    prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,
-    additional_eos_token_id: Optional[int] = None,
-    **kwargs,
-):
-    inputs = tokenizer([prompt], padding=True, return_tensors="pt")
-    input_length = len(inputs["input_ids"][0])
-    for k, v in inputs.items():
-        inputs[k] = v.cuda()
-    input_ids = inputs["input_ids"]
-    batch_size, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1]  # noqa: F841
-    if generation_config is None:
-        generation_config = model.generation_config
-    generation_config = copy.deepcopy(generation_config)
-    model_kwargs = generation_config.update(**kwargs)
-    bos_token_id, eos_token_id = generation_config.bos_token_id, generation_config.eos_token_id  # noqa: F841
-    if isinstance(eos_token_id, int):
-        eos_token_id = [eos_token_id]
-    if additional_eos_token_id is not None:
-        eos_token_id.append(additional_eos_token_id)
-    has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not None
-    if has_default_max_length and generation_config.max_new_tokens is None:
-        warnings.warn(
-            f"Using `max_length`'s default ({generation_config.max_length}) to control the generation length. "
-            "This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we"
-            " recommend using `max_new_tokens` to control the maximum length of the generation.",
-            UserWarning,
-        )
-    elif generation_config.max_new_tokens is not None:
-        generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length
-        if not has_default_max_length:
-            logger.warn(
-                f"Both `max_new_tokens` (={generation_config.max_new_tokens}) and `max_length`(="
-                f"{generation_config.max_length}) seem to have been set. `max_new_tokens` will take precedence. "
-                "Please refer to the documentation for more information. "
-                "(https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)",
-                UserWarning,
-            )
-
-    if input_ids_seq_length >= generation_config.max_length:
-        input_ids_string = "input_ids"
-        logger.warning(
-            f"Input length of {input_ids_string} is {input_ids_seq_length}, but `max_length` is set to"
-            f" {generation_config.max_length}. This can lead to unexpected behavior. You should consider"
-            " increasing `max_new_tokens`."
-        )
-
-    # 2. Set generation parameters if not already defined
-    logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
-    stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
-
-    logits_processor = model._get_logits_processor(
-        generation_config=generation_config,
-        input_ids_seq_length=input_ids_seq_length,
-        encoder_input_ids=input_ids,
-        prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
-        logits_processor=logits_processor,
-    )
-
-    stopping_criteria = model._get_stopping_criteria(
-        generation_config=generation_config, stopping_criteria=stopping_criteria
-    )
-    logits_warper = model._get_logits_warper(generation_config)
-
-    unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
-    scores = None
-    while True:
-        model_inputs = model.prepare_inputs_for_generation(input_ids, **model_kwargs)
-        # forward pass to get next token
-        outputs = model(
-            **model_inputs,
-            return_dict=True,
-            output_attentions=False,
-            output_hidden_states=False,
-        )
-
-        next_token_logits = outputs.logits[:, -1, :]
-
-        # pre-process distribution
-        next_token_scores = logits_processor(input_ids, next_token_logits)
-        next_token_scores = logits_warper(input_ids, next_token_scores)
-
-        # sample
-        probs = nn.functional.softmax(next_token_scores, dim=-1)
-        if generation_config.do_sample:
-            next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
-        else:
-            next_tokens = torch.argmax(probs, dim=-1)
-
-        # update generated ids, model inputs, and length for next step
-        input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
-        model_kwargs = model._update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False)
-        unfinished_sequences = unfinished_sequences.mul((min(next_tokens != i for i in eos_token_id)).long())
-
-        output_token_ids = input_ids[0].cpu().tolist()
-        output_token_ids = output_token_ids[input_length:]
-        for each_eos_token_id in eos_token_id:
-            if output_token_ids[-1] == each_eos_token_id:
-                output_token_ids = output_token_ids[:-1]
-        response = tokenizer.decode(output_token_ids)
-
-        yield response
-        # stop when each sentence is finished, or if we exceed the maximum length
-        if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
-            break
diff --git a/tools/transformers/intern_moss_example.py b/tools/transformers/intern_moss_example.py
deleted file mode 100644
index 303efac..0000000
--- a/tools/transformers/intern_moss_example.py
+++ /dev/null
@@ -1,82 +0,0 @@
-import torch
-from moss_002_sft import collate_fn, get_dataset
-from peft import LoraConfig, TaskType, get_peft_model
-from torch.utils.data import DataLoader
-from tqdm import tqdm
-from transformers import (
-    AutoModelForCausalLM,
-    AutoTokenizer,
-    get_linear_schedule_with_warmup,
-)
-
-model_path = "model_path"
-data_dir = "moss_002_sft"
-data_num = -1
-test_size = 10
-train_batch_size = 1
-epochs = 5
-val_per_steps = 1000
-lr = 9e-6
-peft_config = LoraConfig(
-    task_type=TaskType.CAUSAL_LM,
-    r=32,
-    lora_alpha=32,
-    lora_dropout=0.1,
-    target_modules=["gate_proj", "down_proj", "up_proj", "q_proj", "k_proj", "v_proj", "o_proj"],
-)
-
-
-# model
-model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
-model = get_peft_model(model, peft_config)
-model.cuda()
-
-# dataset
-train_dataset, val_dataset = get_dataset(tokenizer, data_dir, num=data_num, test_size=test_size)
-train_dataloader = DataLoader(
-    train_dataset, batch_size=train_batch_size, shuffle=True, collate_fn=lambda x: collate_fn(x, tokenizer)
-)
-
-optimizer = torch.optim.AdamW(model.parameters(), lr)
-scheduler = get_linear_schedule_with_warmup(optimizer, 1000, epochs * len(train_dataloader))
-
-# train
-fp = open("output", "w")
-model.train()
-for epoch in tqdm(range(epochs), desc="Traning Epoch"):
-    batch_bar = tqdm(train_dataloader, desc="Training Batch")
-    for step, batch in enumerate(batch_bar):
-        batch = {k: v.cuda() for k, v in batch.items()}
-        with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
-            output = model(**batch)
-
-        loss = output.loss
-        loss.backward()
-        optimizer.step()
-        scheduler.step()
-        optimizer.zero_grad()
-        batch_bar.set_postfix({"loss": loss.item()})
-        if (step + 1) % val_per_steps == 0:
-            fp.write(f"Epoch {epoch} Batch {step}: Loss={loss.item()}\n")
-            for i in tqdm(range(len(val_dataset)), desc="Generating"):
-                data, label = val_dataset[i]
-                prefix = tokenizer.decode(data.tolist(), skip_special_tokens=True)
-                try:
-                    generate = model.generate(
-                        input_ids=data.unsqueeze(0).cuda(),
-                        temperature=0.7,
-                        top_k=50,
-                        do_sample=True,
-                        repetition_penalty=1.02,
-                        max_new_tokens=100,
-                        top_p=0.9,
-                    )
-                    text = tokenizer.decode(generate[0].tolist(), skip_special_tokens=True)
-                    text = text.replace(prefix, "")
-                    fp.write(f"Prefix: {prefix}\nGenerated: {text}" + "\n---------------------------------\n")
-                except Exception as e:
-                    fp.write(f"Prefix: {prefix}\nError: {e}" + "\n---------------------------------\n")
-            fp.write("\n==============================\n")
-            model.train()
-            torch.cuda.empty_cache()
diff --git a/tools/transformers/internlm_sft_on_moss.py b/tools/transformers/internlm_sft_on_moss.py
deleted file mode 100644
index ef88523..0000000
--- a/tools/transformers/internlm_sft_on_moss.py
+++ /dev/null
@@ -1,110 +0,0 @@
-import copy
-import os
-
-import torch
-from datasets import Dataset as HFDataset
-from datasets import load_dataset
-from torch.utils.data import Dataset
-
-
-class SFTDataset(Dataset):
-    # https://github.com/OpenLMLab/MOSS/blob/main/finetune_moss.py
-    def __init__(self, dataset):
-        super().__init__()
-        self.dataset = dataset
-
-    def __len__(self):
-        return len(self.dataset)
-
-    def __getitem__(self, index):
-        data = copy.deepcopy(self.dataset[index]["input_ids"])
-        no_loss_spans = copy.deepcopy(self.dataset[index]["no_loss_spans"])
-
-        data = torch.tensor(data, dtype=torch.long)
-        label = copy.deepcopy(data)
-
-        for no_loss_span in no_loss_spans:
-            label[no_loss_span[0] : no_loss_span[1]] = -100
-
-        return data, label
-
-
-def collate_fn(batch, tokenizer):
-    batch_input_ids, batch_labels = [], []
-    for input_ids, label in batch:
-        batch_input_ids.append(input_ids)
-        batch_labels.append(label)
-
-    batch_input_ids = torch.nn.utils.rnn.pad_sequence(
-        batch_input_ids, batch_first=True, padding_value=tokenizer.eos_token_id
-    )
-    batch_labels = torch.nn.utils.rnn.pad_sequence(batch_labels, batch_first=True, padding_value=-100)
-
-    return {
-        "input_ids": batch_input_ids,
-        "attention_mask": (batch_input_ids == tokenizer.eos_token_id).long(),
-        "labels": batch_labels,
-    }
-
-
-def process(sample, tokenizer, max_len):
-    chat = sample["plain_text"].split("<eoa>")[:-1]
-    num_turns = sample["num_turns"]
-    meta_instruction = sample["prefix"]
-
-    # encode instruction
-    instruction_ids = tokenizer.encode(meta_instruction)
-    assert isinstance(instruction_ids, list), instruction_ids
-    assert len(instruction_ids) > 0, len(instruction_ids)
-    input_ids = copy.deepcopy(instruction_ids)
-    # We do not calculate loss for instruction.
-    no_loss_spans = [(0, len(instruction_ids))]
-
-    for i in range(num_turns):
-        # Collect dialogues
-        cur_turn_ids = []
-        cur_no_loss_spans = []
-        # Add to cur_turn_ids
-        cur_turn_ids.extend(tokenizer.encode(chat[i] + "<eoa>"))
-        # if key == 'Tool Responses':
-        #     # The format tokens (<|Results|>:...<eor>\n) should have losses.
-        #     cur_no_loss_spans.append((len(input_ids + cur_turn_ids) + 5, len(input_ids + cur_turn_ids + cur_ids) - 2))
-        if len(input_ids + cur_turn_ids) > max_len:
-            # Too long, break
-            break
-        # Extend input_ids
-        input_ids.extend(cur_turn_ids)
-        no_loss_spans.extend(cur_no_loss_spans)
-
-    if len(input_ids) == len(instruction_ids):
-        # No dialogue, return
-        return {"input_ids": [], "no_loss_spans": []}
-    else:
-        return {"input_ids": input_ids, "no_loss_spans": no_loss_spans}
-
-
-def load_data(save_dir, tokenizer, max_len, num=-1) -> HFDataset:
-    if os.path.exists(save_dir):
-        print(f"Loading moss-002-sft from {save_dir}")
-    else:
-        print("Loading moss-002-sft from datasets")
-        moss_sft = load_dataset("fnlp/moss-002-sft-data", split="train")
-        moss_sft = moss_sft.map(lambda x: process(x, tokenizer, max_len), num_proc=10)
-        moss_sft = moss_sft.filter(lambda x: len(x["input_ids"]) != 0)
-        moss_sft.save_to_disk(save_dir)
-
-    moss_sft = HFDataset.load_from_disk(save_dir)
-    if num != -1:
-        moss_sft = moss_sft.select(range(num))
-    print(f"Load successfully, total {len(moss_sft)} samples.")
-
-    return moss_sft
-
-
-def get_dataset(tokenizer, save_dir, max_len=1024, num=-1, test_size=0.1):
-    moss_sft_data = load_data(save_dir, tokenizer, max_len, num)
-    moss_sft_split = moss_sft_data.train_test_split(test_size=test_size)
-    train_dataset = SFTDataset(moss_sft_split["train"])
-    val_dataset = SFTDataset(moss_sft_split["test"])
-
-    return train_dataset, val_dataset
diff --git a/tools/transformers/modeling_internlm.py b/tools/transformers/modeling_internlm.py
deleted file mode 100644
index 571971d..0000000
--- a/tools/transformers/modeling_internlm.py
+++ /dev/null
@@ -1,1022 +0,0 @@
-# coding=utf-8
-# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
-#
-# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
-# and OPT implementations in this library. It has been modified from its
-# original forms to accommodate minor architectural differences compared
-# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-""" PyTorch InternLM model."""
-import math
-import queue
-import threading
-from typing import List, Optional, Tuple, Union
-
-import torch
-import torch.utils.checkpoint
-from configuration_internlm import InternLMConfig
-from torch import nn
-from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
-from transformers.activations import ACT2FN
-from transformers.generation.streamers import BaseStreamer
-from transformers.modeling_outputs import (
-    BaseModelOutputWithPast,
-    CausalLMOutputWithPast,
-    SequenceClassifierOutputWithPast,
-)
-from transformers.modeling_utils import PreTrainedModel
-from transformers.utils import (
-    add_start_docstrings,
-    add_start_docstrings_to_model_forward,
-    logging,
-    replace_return_docstrings,
-)
-
-logger = logging.get_logger(__name__)
-
-_CONFIG_FOR_DOC = "InternLMConfig"
-
-
-# Copied from transformers.models.bart.modeling_bart._make_causal_mask
-def _make_causal_mask(
-    input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
-):
-    """
-    Make causal mask used for bi-directional self-attention.
-    """
-    bsz, tgt_len = input_ids_shape
-    mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
-    mask_cond = torch.arange(mask.size(-1), device=device)
-    mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
-    mask = mask.to(dtype)
-
-    if past_key_values_length > 0:
-        mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
-    return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
-
-
-# Copied from transformers.models.bart.modeling_bart._expand_mask
-def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
-    """
-    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
-    """
-    bsz, src_len = mask.size()
-    tgt_len = tgt_len if tgt_len is not None else src_len
-
-    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
-
-    inverted_mask = 1.0 - expanded_mask
-
-    return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
-
-
-class InternLMRMSNorm(nn.Module):
-    def __init__(self, hidden_size, eps=1e-6):
-        """
-        InternLMRMSNorm is equivalent to T5LayerNorm
-        """
-        super().__init__()
-        self.weight = nn.Parameter(torch.ones(hidden_size))
-        self.variance_epsilon = eps
-
-    def forward(self, hidden_states):
-        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
-        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
-
-        # convert into half-precision if necessary
-        if self.weight.dtype in [torch.float16, torch.bfloat16]:
-            hidden_states = hidden_states.to(self.weight.dtype)
-
-        return self.weight * hidden_states
-
-
-class InternLMRotaryEmbedding(torch.nn.Module):
-    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
-        super().__init__()
-        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
-        self.register_buffer("inv_freq", inv_freq)
-
-        # Build here to make `torch.jit.trace` work.
-        self.max_seq_len_cached = max_position_embeddings
-        t = torch.arange(self.max_seq_len_cached, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
-        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
-        # Different from paper, but it uses a different permutation in order to obtain the same calculation
-        emb = torch.cat((freqs, freqs), dim=-1)
-        self.register_buffer("cos_cached", emb.cos()[None, None, :, :], persistent=False)
-        self.register_buffer("sin_cached", emb.sin()[None, None, :, :], persistent=False)
-
-    def forward(self, x, seq_len=None):
-        # x: [bs, num_attention_heads, seq_len, head_size]
-        # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
-        if seq_len > self.max_seq_len_cached:
-            self.max_seq_len_cached = seq_len
-            t = torch.arange(self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype)
-            freqs = torch.einsum("i,j->ij", t, self.inv_freq)
-            # Different from paper, but it uses a different permutation in order to obtain the same calculation
-            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
-            self.register_buffer("cos_cached", emb.cos()[None, None, :, :], persistent=False)
-            self.register_buffer("sin_cached", emb.sin()[None, None, :, :], persistent=False)
-        return (
-            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
-            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
-        )
-
-
-def rotate_half(x):
-    """Rotates half the hidden dims of the input."""
-    x1 = x[..., : x.shape[-1] // 2]
-    x2 = x[..., x.shape[-1] // 2 :]
-    return torch.cat((-x2, x1), dim=-1)
-
-
-def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
-    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
-    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
-    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
-    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
-    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
-    q_embed = (q * cos) + (rotate_half(q) * sin)
-    k_embed = (k * cos) + (rotate_half(k) * sin)
-    return q_embed, k_embed
-
-
-class InternLMMLP(nn.Module):
-    def __init__(
-        self,
-        hidden_size: int,
-        intermediate_size: int,
-        hidden_act: str,
-    ):
-        super().__init__()
-        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
-        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
-        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
-        self.act_fn = ACT2FN[hidden_act]
-
-    def forward(self, x):
-        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
-
-
-class InternLMAttention(nn.Module):
-    """Multi-headed attention from 'Attention Is All You Need' paper"""
-
-    def __init__(self, config: InternLMConfig):
-        super().__init__()
-        self.config = config
-        self.hidden_size = config.hidden_size
-        self.num_heads = config.num_attention_heads
-        self.head_dim = self.hidden_size // self.num_heads
-        self.max_position_embeddings = config.max_position_embeddings
-
-        if (self.head_dim * self.num_heads) != self.hidden_size:
-            raise ValueError(
-                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
-                f" and `num_heads`: {self.num_heads})."
-            )
-        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
-        self.k_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
-        self.v_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.bias)
-        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)
-        self.rotary_emb = InternLMRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings)
-
-    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
-        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_value: Optional[Tuple[torch.Tensor]] = None,
-        output_attentions: bool = False,
-        use_cache: bool = False,
-    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
-        bsz, q_len, _ = hidden_states.size()
-
-        query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
-        key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
-        value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
-
-        kv_seq_len = key_states.shape[-2]
-        if past_key_value is not None:
-            kv_seq_len += past_key_value[0].shape[-2]
-        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
-        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
-        # [bsz, nh, t, hd]
-
-        if past_key_value is not None:
-            # reuse k, v, self_attention
-            key_states = torch.cat([past_key_value[0], key_states], dim=2)
-            value_states = torch.cat([past_key_value[1], value_states], dim=2)
-
-        past_key_value = (key_states, value_states) if use_cache else None
-
-        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
-
-        if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
-            raise ValueError(
-                f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
-                f" {attn_weights.size()}"
-            )
-
-        if attention_mask is not None:
-            if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
-                raise ValueError(
-                    f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
-                )
-            attn_weights = attn_weights + attention_mask
-            attn_weights = torch.max(attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min))
-
-        # upcast attention to fp32
-        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
-        attn_output = torch.matmul(attn_weights, value_states)
-
-        if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
-            raise ValueError(
-                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
-                f" {attn_output.size()}"
-            )
-
-        attn_output = attn_output.transpose(1, 2)
-        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
-
-        attn_output = self.o_proj(attn_output)
-
-        if not output_attentions:
-            attn_weights = None
-
-        return attn_output, attn_weights, past_key_value
-
-
-class InternLMDecoderLayer(nn.Module):
-    def __init__(self, config: InternLMConfig):
-        super().__init__()
-        self.hidden_size = config.hidden_size
-        self.self_attn = InternLMAttention(config=config)
-        self.mlp = InternLMMLP(
-            hidden_size=self.hidden_size,
-            intermediate_size=config.intermediate_size,
-            hidden_act=config.hidden_act,
-        )
-        self.input_layernorm = InternLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-        self.post_attention_layernorm = InternLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_value: Optional[Tuple[torch.Tensor]] = None,
-        output_attentions: Optional[bool] = False,
-        use_cache: Optional[bool] = False,
-    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
-        """
-        Args:
-            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
-            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
-                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
-            output_attentions (`bool`, *optional*):
-                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
-                returned tensors for more detail.
-            use_cache (`bool`, *optional*):
-                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
-                (see `past_key_values`).
-            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
-        """
-
-        residual = hidden_states
-
-        hidden_states = self.input_layernorm(hidden_states)
-
-        # Self Attention
-        hidden_states, self_attn_weights, present_key_value = self.self_attn(
-            hidden_states=hidden_states,
-            attention_mask=attention_mask,
-            position_ids=position_ids,
-            past_key_value=past_key_value,
-            output_attentions=output_attentions,
-            use_cache=use_cache,
-        )
-        hidden_states = residual + hidden_states
-
-        # Fully Connected
-        residual = hidden_states
-        hidden_states = self.post_attention_layernorm(hidden_states)
-        hidden_states = self.mlp(hidden_states)
-        hidden_states = residual + hidden_states
-
-        outputs = (hidden_states,)
-
-        if output_attentions:
-            outputs += (self_attn_weights,)
-
-        if use_cache:
-            outputs += (present_key_value,)
-
-        return outputs
-
-
-INTERNLM_START_DOCSTRING = r"""
-    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
-    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
-    etc.)
-
-    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
-    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
-    and behavior.
-
-    Parameters:
-        config ([`InternLMConfig`]):
-            Model configuration class with all the parameters of the model. Initializing with a config file does not
-            load the weights associated with the model, only the configuration. Check out the
-            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
-"""
-
-
-@add_start_docstrings(
-    "The bare InternLM Model outputting raw hidden-states without any specific head on top.",
-    INTERNLM_START_DOCSTRING,
-)
-class InternLMPreTrainedModel(PreTrainedModel):
-    config_class = InternLMConfig
-    base_model_prefix = "model"
-    supports_gradient_checkpointing = True
-    _no_split_modules = ["InternLMDecoderLayer"]
-    _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]
-
-    def _init_weights(self, module):
-        std = self.config.initializer_range
-        if isinstance(module, nn.Linear):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.bias is not None:
-                module.bias.data.zero_()
-        elif isinstance(module, nn.Embedding):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.padding_idx is not None:
-                module.weight.data[module.padding_idx].zero_()
-
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, InternLMModel):
-            module.gradient_checkpointing = value
-
-
-INTERNLM_INPUTS_DOCSTRING = r"""
-    Args:
-        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
-            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
-            it.
-
-            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-            [`PreTrainedTokenizer.__call__`] for details.
-
-            [What are input IDs?](../glossary#input-ids)
-        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
-            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
-
-            - 1 for tokens that are **not masked**,
-            - 0 for tokens that are **masked**.
-
-            [What are attention masks?](../glossary#attention-mask)
-
-            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
-            [`PreTrainedTokenizer.__call__`] for details.
-
-            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
-            `past_key_values`).
-
-            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
-            and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
-            information on the default strategy.
-
-            - 1 indicates the head is **not masked**,
-            - 0 indicates the head is **masked**.
-        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
-            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
-            config.n_positions - 1]`.
-
-            [What are position IDs?](../glossary#position-ids)
-        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
-            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
-            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
-            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
-
-            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
-            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
-
-            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
-            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
-            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
-        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
-            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
-            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
-            model's internal embedding lookup matrix.
-        use_cache (`bool`, *optional*):
-            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
-            `past_key_values`).
-        output_attentions (`bool`, *optional*):
-            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
-            tensors for more detail.
-        output_hidden_states (`bool`, *optional*):
-            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
-            more detail.
-        return_dict (`bool`, *optional*):
-            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
-"""  # noqa: E501
-
-
-@add_start_docstrings(
-    "The bare InternLM Model outputting raw hidden-states without any specific head on top.",
-    INTERNLM_START_DOCSTRING,
-)
-class InternLMModel(InternLMPreTrainedModel):
-    """
-    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`InternLMDecoderLayer`]
-
-    Args:
-        config: InternLMConfig
-    """
-
-    _auto_class = "AutoModel"
-
-    def __init__(self, config: InternLMConfig):
-        super().__init__(config)
-        self.padding_idx = config.pad_token_id
-        self.vocab_size = config.vocab_size
-
-        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
-        self.layers = nn.ModuleList([InternLMDecoderLayer(config) for _ in range(config.num_hidden_layers)])
-        self.norm = InternLMRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-
-        self.gradient_checkpointing = False
-        # Initialize weights and apply final processing
-        self.post_init()
-
-    def get_input_embeddings(self):
-        return self.embed_tokens
-
-    def set_input_embeddings(self, value):
-        self.embed_tokens = value
-
-    # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
-    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
-        # create causal mask
-        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
-        combined_attention_mask = None
-        if input_shape[-1] > 1:
-            combined_attention_mask = _make_causal_mask(
-                input_shape,
-                inputs_embeds.dtype,
-                device=inputs_embeds.device,
-                past_key_values_length=past_key_values_length,
-            )
-
-        if attention_mask is not None:
-            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
-            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
-                inputs_embeds.device
-            )
-            combined_attention_mask = (
-                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
-            )
-
-        return combined_attention_mask
-
-    @add_start_docstrings_to_model_forward(INTERNLM_INPUTS_DOCSTRING)
-    def forward(
-        self,
-        input_ids: torch.LongTensor = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_values: Optional[List[torch.FloatTensor]] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        use_cache: Optional[bool] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        return_dict: Optional[bool] = None,
-    ) -> Union[Tuple, BaseModelOutputWithPast]:
-        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
-        output_hidden_states = (
-            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
-        )
-        use_cache = use_cache if use_cache is not None else self.config.use_cache
-
-        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-
-        # retrieve input_ids and inputs_embeds
-        if input_ids is not None and inputs_embeds is not None:
-            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
-        elif input_ids is not None:
-            batch_size, seq_length = input_ids.shape
-        elif inputs_embeds is not None:
-            batch_size, seq_length, _ = inputs_embeds.shape
-        else:
-            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
-
-        seq_length_with_past = seq_length
-        past_key_values_length = 0
-
-        if past_key_values is not None:
-            past_key_values_length = past_key_values[0][0].shape[2]
-            seq_length_with_past = seq_length_with_past + past_key_values_length
-
-        if position_ids is None:
-            device = input_ids.device if input_ids is not None else inputs_embeds.device
-            position_ids = torch.arange(
-                past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
-            )
-            position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
-        else:
-            position_ids = position_ids.view(-1, seq_length).long()
-
-        if inputs_embeds is None:
-            inputs_embeds = self.embed_tokens(input_ids)
-        # embed positions
-        if attention_mask is None:
-            attention_mask = torch.ones(
-                (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
-            )
-        attention_mask = self._prepare_decoder_attention_mask(
-            attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
-        )
-
-        hidden_states = inputs_embeds
-
-        if self.gradient_checkpointing and self.training:
-            if use_cache:
-                logger.warning_once(
-                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                )
-                use_cache = False
-
-        # decoder layers
-        all_hidden_states = () if output_hidden_states else None
-        all_self_attns = () if output_attentions else None
-        next_decoder_cache = () if use_cache else None
-
-        for idx, decoder_layer in enumerate(self.layers):
-            if output_hidden_states:
-                all_hidden_states += (hidden_states,)
-
-            past_key_value = past_key_values[idx] if past_key_values is not None else None
-
-            if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        # None for past_key_value
-                        return module(*inputs, output_attentions, None)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(decoder_layer),
-                    hidden_states,
-                    attention_mask,
-                    position_ids,
-                    None,
-                )
-            else:
-                layer_outputs = decoder_layer(
-                    hidden_states,
-                    attention_mask=attention_mask,
-                    position_ids=position_ids,
-                    past_key_value=past_key_value,
-                    output_attentions=output_attentions,
-                    use_cache=use_cache,
-                )
-
-            hidden_states = layer_outputs[0]
-
-            if use_cache:
-                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
-
-            if output_attentions:
-                all_self_attns += (layer_outputs[1],)
-
-        hidden_states = self.norm(hidden_states)
-
-        # add hidden states from the last decoder layer
-        if output_hidden_states:
-            all_hidden_states += (hidden_states,)
-
-        next_cache = next_decoder_cache if use_cache else None
-        if not return_dict:
-            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
-        return BaseModelOutputWithPast(
-            last_hidden_state=hidden_states,
-            past_key_values=next_cache,
-            hidden_states=all_hidden_states,
-            attentions=all_self_attns,
-        )
-
-
-class InternLMForCausalLM(InternLMPreTrainedModel):
-    _auto_class = "AutoModelForCausalLM"
-
-    def __init__(self, config):
-        super().__init__(config)
-        self.model = InternLMModel(config)
-
-        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
-
-        # Initialize weights and apply final processing
-        self.post_init()
-
-    def get_input_embeddings(self):
-        return self.model.embed_tokens
-
-    def set_input_embeddings(self, value):
-        self.model.embed_tokens = value
-
-    def get_output_embeddings(self):
-        return self.lm_head
-
-    def set_output_embeddings(self, new_embeddings):
-        self.lm_head = new_embeddings
-
-    def set_decoder(self, decoder):
-        self.model = decoder
-
-    def get_decoder(self):
-        return self.model
-
-    @add_start_docstrings_to_model_forward(INTERNLM_INPUTS_DOCSTRING)
-    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
-    def forward(
-        self,
-        input_ids: torch.LongTensor = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_values: Optional[List[torch.FloatTensor]] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        labels: Optional[torch.LongTensor] = None,
-        use_cache: Optional[bool] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        return_dict: Optional[bool] = None,
-    ) -> Union[Tuple, CausalLMOutputWithPast]:
-        r"""
-        Args:
-            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
-                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
-                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
-                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
-
-        Returns:
-
-        Example:
-
-        ```python
-        >>> from transformers import AutoTokenizer, InternLMForCausalLM
-
-        >>> model = InternLMForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
-        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
-
-        >>> prompt = "Hey, are you consciours? Can you talk to me?"
-        >>> inputs = tokenizer(prompt, return_tensors="pt")
-
-        >>> # Generate
-        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
-        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
-        "Hey, are you consciours? Can you talk to me?\nI'm not consciours, but I can talk to you."
-        ```"""
-
-        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
-        output_hidden_states = (
-            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
-        )
-        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-
-        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-        outputs = self.model(
-            input_ids=input_ids,
-            attention_mask=attention_mask,
-            position_ids=position_ids,
-            past_key_values=past_key_values,
-            inputs_embeds=inputs_embeds,
-            use_cache=use_cache,
-            output_attentions=output_attentions,
-            output_hidden_states=output_hidden_states,
-            return_dict=return_dict,
-        )
-
-        hidden_states = outputs[0]
-        logits = self.lm_head(hidden_states)
-
-        loss = None
-        if labels is not None:
-            # Shift so that tokens < n predict n
-            shift_logits = logits[..., :-1, :].contiguous()
-            shift_labels = labels[..., 1:].contiguous()
-            # Flatten the tokens
-            loss_fct = CrossEntropyLoss()
-            shift_logits = shift_logits.view(-1, self.config.vocab_size)
-            shift_labels = shift_labels.view(-1)
-            # Enable model parallelism
-            shift_labels = shift_labels.to(shift_logits.device)
-            loss = loss_fct(shift_logits, shift_labels)
-
-        if not return_dict:
-            output = (logits,) + outputs[1:]
-            return (loss,) + output if loss is not None else output
-
-        return CausalLMOutputWithPast(
-            loss=loss,
-            logits=logits,
-            past_key_values=outputs.past_key_values,
-            hidden_states=outputs.hidden_states,
-            attentions=outputs.attentions,
-        )
-
-    def prepare_inputs_for_generation(
-        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
-    ):
-        if past_key_values:
-            input_ids = input_ids[:, -1:]
-
-        position_ids = kwargs.get("position_ids", None)
-        if attention_mask is not None and position_ids is None:
-            # create position_ids on the fly for batch generation
-            position_ids = attention_mask.long().cumsum(-1) - 1
-            position_ids.masked_fill_(attention_mask == 0, 1)
-            if past_key_values:
-                position_ids = position_ids[:, -1].unsqueeze(-1)
-
-        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
-        if inputs_embeds is not None and past_key_values is None:
-            model_inputs = {"inputs_embeds": inputs_embeds}
-        else:
-            model_inputs = {"input_ids": input_ids}
-
-        model_inputs.update(
-            {
-                "position_ids": position_ids,
-                "past_key_values": past_key_values,
-                "use_cache": kwargs.get("use_cache"),
-                "attention_mask": attention_mask,
-            }
-        )
-        return model_inputs
-
-    @staticmethod
-    def _reorder_cache(past_key_values, beam_idx):
-        reordered_past = ()
-        for layer_past in past_key_values:
-            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
-        return reordered_past
-
-    def build_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = []):
-        prompt = ""
-        for record in history:
-            prompt += f"""<s><|User|>:{record[0]}<eoh>\n<|Bot|>:{record[1]}<eoa>\n"""
-        if len(prompt) == 0:
-            prompt += "<s>"
-        prompt += f"""<|User|>:{query}<eoh>\n<|Bot|>:"""
-        return tokenizer([prompt], return_tensors="pt")
-
-    @torch.no_grad()
-    def chat(
-        self,
-        tokenizer,
-        query: str,
-        history: List[Tuple[str, str]] = [],
-        streamer: Optional[BaseStreamer] = None,
-        max_new_tokens: int = 1024,
-        do_sample: bool = True,
-        temperature: float = 0.8,
-        top_p: float = 0.8,
-        **kwargs,
-    ):
-        inputs = self.build_inputs(tokenizer, query, history)
-        inputs = {k: v.to(self.device) for k, v in inputs.items() if torch.is_tensor(v)}
-        outputs = self.generate(
-            **inputs,
-            streamer=streamer,
-            max_new_tokens=max_new_tokens,
-            do_sample=do_sample,
-            temperature=temperature,
-            top_p=top_p,
-            **kwargs,
-        )
-        outputs = outputs[0].cpu().tolist()[len(inputs["input_ids"][0]) :]
-        response = tokenizer.decode(outputs, skip_special_tokens=True)
-        response = response.split("<eoa>")[0]
-        history = history + [(query, response)]
-        return response, history
-
-    @torch.no_grad()
-    def stream_chat(
-        self,
-        tokenizer,
-        query: str,
-        history: List[Tuple[str, str]] = [],
-        max_new_tokens: int = 1024,
-        do_sample: bool = True,
-        temperature: float = 0.8,
-        top_p: float = 0.8,
-        **kwargs,
-    ):
-        """
-        Return a generator in format: (response, history)
-        Eg.
-        ('你好，有什么可以帮助您的吗', [('你好', '你好，有什么可以帮助您的吗')])
-        ('你好，有什么可以帮助您的吗？', [('你好', '你好，有什么可以帮助您的吗？')])
-        """
-
-        response_queue = queue.Queue(maxsize=20)
-
-        class ChatStreamer(BaseStreamer):
-            def __init__(self, tokenizer) -> None:
-                super().__init__()
-                self.tokenizer = tokenizer
-                self.queue = response_queue
-                self.query = query
-                self.history = history
-                self.response = ""
-                self.cache = []
-                self.received_inputs = False
-                self.queue.put((self.response, history + [(self.query, self.response)]))
-
-            def put(self, value):
-                if len(value.shape) > 1 and value.shape[0] > 1:
-                    raise ValueError("ChatStreamer only supports batch size 1")
-                elif len(value.shape) > 1:
-                    value = value[0]
-
-                if not self.received_inputs:
-                    # The first received value is input_ids, ignore here
-                    self.received_inputs = True
-                    return
-
-                self.cache.extend(value.tolist())
-                token = self.tokenizer.decode(self.cache, skip_special_tokens=True)
-                if "�" in token and len(token) <= 5:
-                    return
-                
-                if token.strip() != "<eoa>":
-                    self.response = self.response + token
-                    history = self.history + [(self.query, self.response)]
-                    self.queue.put((self.response, history))
-                    self.cache = []
-                else:
-                    self.end()
-
-            def end(self):
-                self.queue.put(None)
-
-        def stream_producer():
-            return self.chat(
-                tokenizer=tokenizer,
-                query=query,
-                streamer=ChatStreamer(tokenizer=tokenizer),
-                history=history,
-                max_new_tokens=max_new_tokens,
-                do_sample=do_sample,
-                temperature=temperature,
-                top_p=top_p,
-                **kwargs,
-            )
-
-        def consumer():
-            producer = threading.Thread(target=stream_producer)
-            producer.start()
-            while True:
-                res = response_queue.get()
-                if res is not None:
-                    return
-                yield res
-
-        return consumer()
-
-
-@add_start_docstrings(
-    """
-    The InternLM Model transformer with a sequence classification head on top (linear layer).
-
-    [`InternLMForSequenceClassification`] uses the last token in order to do the classification, as other causal models
-    (e.g. GPT-2) do.
-
-    Since it does classification on the last token, it requires to know the position of the last token. If a
-    `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
-    no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
-    padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
-    each row of the batch).
-    """,
-    INTERNLM_START_DOCSTRING,
-)
-class InternLMForSequenceClassification(InternLMPreTrainedModel):
-    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
-
-    def __init__(self, config):
-        super().__init__(config)
-        self.num_labels = config.num_labels
-        self.model = InternLMModel(config)
-        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
-
-        # Initialize weights and apply final processing
-        self.post_init()
-
-    def get_input_embeddings(self):
-        return self.model.embed_tokens
-
-    def set_input_embeddings(self, value):
-        self.model.embed_tokens = value
-
-    @add_start_docstrings_to_model_forward(INTERNLM_INPUTS_DOCSTRING)
-    def forward(
-        self,
-        input_ids: torch.LongTensor = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        position_ids: Optional[torch.LongTensor] = None,
-        past_key_values: Optional[List[torch.FloatTensor]] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        labels: Optional[torch.LongTensor] = None,
-        use_cache: Optional[bool] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        return_dict: Optional[bool] = None,
-    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
-        r"""
-        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
-            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
-            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
-            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
-        """
-        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-
-        transformer_outputs = self.model(
-            input_ids,
-            attention_mask=attention_mask,
-            position_ids=position_ids,
-            past_key_values=past_key_values,
-            inputs_embeds=inputs_embeds,
-            use_cache=use_cache,
-            output_attentions=output_attentions,
-            output_hidden_states=output_hidden_states,
-            return_dict=return_dict,
-        )
-        hidden_states = transformer_outputs[0]
-        logits = self.score(hidden_states)
-
-        if input_ids is not None:
-            batch_size = input_ids.shape[0]
-        else:
-            batch_size = inputs_embeds.shape[0]
-
-        if self.config.pad_token_id is None and batch_size != 1:
-            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
-        if self.config.pad_token_id is None:
-            sequence_lengths = -1
-        else:
-            if input_ids is not None:
-                sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
-            else:
-                sequence_lengths = -1
-
-        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
-
-        loss = None
-        if labels is not None:
-            labels = labels.to(logits.device)
-            if self.config.problem_type is None:
-                if self.num_labels == 1:
-                    self.config.problem_type = "regression"
-                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
-                    self.config.problem_type = "single_label_classification"
-                else:
-                    self.config.problem_type = "multi_label_classification"
-
-            if self.config.problem_type == "regression":
-                loss_fct = MSELoss()
-                if self.num_labels == 1:
-                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
-                else:
-                    loss = loss_fct(pooled_logits, labels)
-            elif self.config.problem_type == "single_label_classification":
-                loss_fct = CrossEntropyLoss()
-                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
-            elif self.config.problem_type == "multi_label_classification":
-                loss_fct = BCEWithLogitsLoss()
-                loss = loss_fct(pooled_logits, labels)
-        if not return_dict:
-            output = (pooled_logits,) + transformer_outputs[1:]
-            return ((loss,) + output) if loss is not None else output
-
-        return SequenceClassifierOutputWithPast(
-            loss=loss,
-            logits=pooled_logits,
-            past_key_values=transformer_outputs.past_key_values,
-            hidden_states=transformer_outputs.hidden_states,
-            attentions=transformer_outputs.attentions,
-        )
diff --git a/tools/transformers/tokenization_internlm.py b/tools/transformers/tokenization_internlm.py
deleted file mode 100644
index 2e1b114..0000000
--- a/tools/transformers/tokenization_internlm.py
+++ /dev/null
@@ -1,240 +0,0 @@
-# coding=utf-8
-# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
-#
-# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
-# and OPT implementations in this library. It has been modified from its
-# original forms to accommodate minor architectural differences compared
-# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-"""Tokenization classes for IntermLM."""
-import os
-from shutil import copyfile
-from typing import Any, Dict, List, Optional, Tuple
-
-import sentencepiece as spm
-from transformers.tokenization_utils import PreTrainedTokenizer
-from transformers.utils import logging
-
-logger = logging.get_logger(__name__)
-
-VOCAB_FILES_NAMES = {"vocab_file": "./tokenizer.model"}
-
-PRETRAINED_VOCAB_FILES_MAP = {}
-
-
-class InternLMTokenizer(PreTrainedTokenizer):
-    """
-    Construct a InternLM tokenizer. Based on byte-level Byte-Pair-Encoding.
-
-    Args:
-        vocab_file (`str`):
-            Path to the vocabulary file.
-    """
-
-    vocab_files_names = VOCAB_FILES_NAMES
-    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
-    model_input_names = ["input_ids", "attention_mask"]
-    _auto_class = "AutoTokenizer"
-
-    def __init__(
-        self,
-        vocab_file,
-        unk_token="<unk>",
-        bos_token="<s>",
-        eos_token="</s>",
-        pad_token="</s>",
-        sp_model_kwargs: Optional[Dict[str, Any]] = None,
-        add_bos_token=True,
-        add_eos_token=False,
-        decode_with_prefix_space=False,
-        clean_up_tokenization_spaces=False,
-        **kwargs,
-    ):
-        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
-        self.vocab_file = vocab_file
-        self.add_bos_token = add_bos_token
-        self.add_eos_token = add_eos_token
-        self.decode_with_prefix_space = decode_with_prefix_space
-        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
-        self.sp_model.Load(vocab_file)
-        self._no_prefix_space_tokens = None
-        super().__init__(
-            bos_token=bos_token,
-            eos_token=eos_token,
-            unk_token=unk_token,
-            pad_token=pad_token,
-            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
-            **kwargs,
-        )
-
-        """ Initialization"""
-
-    @property
-    def no_prefix_space_tokens(self):
-        if self._no_prefix_space_tokens is None:
-            vocab = self.convert_ids_to_tokens(list(range(self.vocab_size)))
-            self._no_prefix_space_tokens = {i for i, tok in enumerate(vocab) if not tok.startswith("▁")}
-        return self._no_prefix_space_tokens
-
-    @property
-    def vocab_size(self):
-        """Returns vocab size"""
-        return self.sp_model.get_piece_size()
-
-    @property
-    def bos_token_id(self) -> Optional[int]:
-        return self.sp_model.bos_id()
-
-    @property
-    def eos_token_id(self) -> Optional[int]:
-        return self.sp_model.eos_id()
-
-    def get_vocab(self):
-        """Returns vocab as a dict"""
-        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
-        vocab.update(self.added_tokens_encoder)
-        return vocab
-
-    def _tokenize(self, text):
-        """Returns a tokenized string."""
-        return self.sp_model.encode(text, out_type=str)
-
-    def _convert_token_to_id(self, token):
-        """Converts a token (str) in an id using the vocab."""
-        return self.sp_model.piece_to_id(token)
-
-    def _convert_id_to_token(self, index):
-        """Converts an index (integer) in a token (str) using the vocab."""
-        token = self.sp_model.IdToPiece(index)
-        return token
-
-    def _maybe_add_prefix_space(self, tokens, decoded):
-        if tokens and tokens[0] not in self.no_prefix_space_tokens:
-            return " " + decoded
-        else:
-            return decoded
-
-    def convert_tokens_to_string(self, tokens):
-        """Converts a sequence of tokens (string) in a single string."""
-        current_sub_tokens = []
-        out_string = ""
-        prev_is_special = False
-        for token in tokens:
-            # make sure that special tokens are not decoded using sentencepiece model
-            if token in self.all_special_tokens:
-                if not prev_is_special:
-                    out_string += " "
-                out_string += self.sp_model.decode(current_sub_tokens) + token
-                prev_is_special = True
-                current_sub_tokens = []
-            else:
-                current_sub_tokens.append(token)
-                prev_is_special = False
-        out_string += self.sp_model.decode(current_sub_tokens)
-        out_string = self.clean_up_tokenization(out_string)
-        out_string = self._maybe_add_prefix_space(tokens=tokens, decoded=out_string)
-        return out_string[1:]
-
-    def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
-        """
-        Save the vocabulary and special tokens file to a directory.
-
-        Args:
-            save_directory (`str`):
-                The directory in which to save the vocabulary.
-
-        Returns:
-            `Tuple(str)`: Paths to the files saved.
-        """
-        if not os.path.isdir(save_directory):
-            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
-            return
-        out_vocab_file = os.path.join(
-            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
-        )
-
-        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
-            copyfile(self.vocab_file, out_vocab_file)
-        elif not os.path.isfile(self.vocab_file):
-            with open(out_vocab_file, "wb") as fi:
-                content_spiece_model = self.sp_model.serialized_model_proto()
-                fi.write(content_spiece_model)
-
-        return (out_vocab_file,)
-
-    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
-        if self.add_bos_token:
-            bos_token_ids = [self.bos_token_id]
-        else:
-            bos_token_ids = []
-
-        output = bos_token_ids + token_ids_0
-
-        if token_ids_1 is not None:
-            output = output + token_ids_1
-
-        if self.add_eos_token:
-            output = output + [self.eos_token_id]
-
-        return output
-
-    def get_special_tokens_mask(
-        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
-    ) -> List[int]:
-        """
-        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
-        special tokens using the tokenizer `prepare_for_model` method.
-
-        Args:
-            token_ids_0 (`List[int]`):
-                List of IDs.
-            token_ids_1 (`List[int]`, *optional*):
-                Optional second list of IDs for sequence pairs.
-            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
-                Whether or not the token list is already formatted with special tokens for the model.
-
-        Returns:
-            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
-        """
-        if already_has_special_tokens:
-            return super().get_special_tokens_mask(
-                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
-            )
-
-        if token_ids_1 is None:
-            return [1] + ([0] * len(token_ids_0)) + [1]
-        return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
-
-    def create_token_type_ids_from_sequences(
-        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
-    ) -> List[int]:
-        """
-        Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make
-        use of token type ids, therefore a list of zeros is returned.
-
-        Args:
-            token_ids_0 (`List[int]`):
-                List of IDs.
-            token_ids_1 (`List[int]`, *optional*):
-                Optional second list of IDs for sequence pairs.
-
-        Returns:
-            `List[int]`: List of zeros.
-        """
-        eos = [self.eos_token_id]
-
-        if token_ids_1 is None:
-            return len(token_ids_0 + eos) * [0]
-        return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
diff --git a/train.py b/train.py
deleted file mode 100644
index ff15354..0000000
--- a/train.py
+++ /dev/null
@@ -1,310 +0,0 @@
-#!/usr/bin/env python
-# -*- encoding: utf-8 -*-
-
-import socket
-import time
-import traceback
-from functools import partial
-
-import torch
-import torch.distributed as dist
-
-import internlm
-from internlm.core.context import ParallelMode
-from internlm.core.context import global_context as gpc
-from internlm.core.scheduler import SchedulerMetricHook
-from internlm.core.trainer import TrainState
-from internlm.initialize import initialize_distributed_env
-from internlm.model.loss import FlashGPTLMLoss
-from internlm.model.metrics import AccPerplex
-from internlm.monitor import initialize_monitor_manager, send_alert_message
-from internlm.monitor.monitor import monitor_manager as mm
-from internlm.train import (
-    get_train_data_loader,
-    get_validation_data_loader,
-    initialize_llm_profile,
-    initialize_model,
-    initialize_optimizer,
-    load_new_batch,
-    record_current_batch_training_metrics,
-)
-from internlm.utils.common import (
-    BatchSkipper,
-    get_megatron_flops,
-    launch_time,
-    parse_args,
-)
-from internlm.utils.evaluation import evaluate_on_val_dls
-from internlm.utils.gputest import empty_cache_and_diag
-from internlm.utils.logger import get_logger, initialize_uniscale_logger
-from internlm.utils.megatron_timers import megatron_timer as timer
-from internlm.utils.model_checkpoint import CheckpointManager
-from internlm.utils.parallel import get_parallel_log_file_name
-from internlm.utils.simple_memory_profiler import SimpleMemoryProfiler
-from internlm.utils.writer import Writer
-
-# global llm logger
-logger = get_logger(__file__)
-
-
-def initialize_llm_logger(start_time: str):
-    """
-    Initialize customed uniscale logger.
-
-    Args:
-        start_time (str): The launch time of current training job.
-
-    Returns: The instance of uniscale logger.
-    """
-
-    uniscale_logger = initialize_uniscale_logger(
-        job_name=gpc.config.JOB_NAME, launch_time=start_time, file_name=get_parallel_log_file_name()
-    )
-    if uniscale_logger is not None:
-        global logger
-        logger = uniscale_logger
-
-    return uniscale_logger
-
-
-def main(args):
-    # init setting
-    skip_batches = gpc.config.data.skip_batches
-    total_steps = gpc.config.data.total_steps
-    valid_every = gpc.config.data.valid_every
-    label_smoothing = gpc.config.loss.label_smoothing
-
-    get_tflops_func = partial(
-        get_megatron_flops,
-        checkpoint=gpc.config.model.checkpoint,
-        seq_len=gpc.config.SEQ_LEN,
-        hidden_size=gpc.config.model.hidden_size,
-        num_layers=gpc.config.model.num_layers,
-        vocab_size=gpc.config.model.vocab_size,
-        global_batch_size=gpc.config.data.micro_bsz * gpc.config.data.micro_num * gpc.get_world_size(ParallelMode.DATA),
-        global_world_size=gpc.get_world_size(ParallelMode.GLOBAL),
-        mlp_ratio=gpc.config.MLP_RATIO,
-    )
-
-    # get and broadcast current time
-    current_time = launch_time()
-    objs = [current_time]
-    dist.broadcast_object_list(objs, src=0)
-    current_time = objs[0]
-
-    # initialize customed llm logger
-    uniscale_logger = initialize_llm_logger(start_time=current_time)
-
-    # initialize model
-    model = initialize_model()
-
-    with open(args.config, "r") as f:
-        config_lines = f.readlines()
-
-    # initialize loss function
-    criterion = FlashGPTLMLoss(parallel_output=True, label_smoothing=label_smoothing)
-
-    # initialize the train and validation data loader
-    train_dl, dataset_types = get_train_data_loader(num_worker=4)
-    val_dls = get_validation_data_loader()
-
-    # initialize and resume train state
-    train_state = TrainState(gpc.config, train_dl.batch_sampler)
-
-    optimizer, beta2_scheduler, lr_scheduler = initialize_optimizer(model=model)
-
-    ckpt_manager = CheckpointManager(
-        ckpt_config=gpc.config.ckpt,
-        model=model,
-        optimizer=optimizer,
-        lr_scheduler=lr_scheduler,
-        train_dl=train_dl,
-        model_config=gpc.config.model,
-        model_config_file="".join(config_lines),
-        feishu_address=gpc.config.monitor.alert.feishu_alert_address,
-    )
-
-    # Loading other persistent training states.
-    ckpt_manager.try_resume_training(train_state, current_time)
-
-    # initialize customed llm writer
-    writer = Writer(
-        job_name=gpc.config.JOB_NAME,
-        launch_time=current_time,
-        file_name=get_parallel_log_file_name(),
-        tensorboard_folder=gpc.config.tensorboard_folder,
-        resume_tb_folder=train_state.resume_tb_folder,  # resume from ckpt.
-        step_count=train_state.step_count,  # resume from ckpt.
-        config=config_lines,
-        logger=logger,
-        enable_tb=gpc.config.enable_tb,
-    )
-
-    # initialize metric for calculating accuracy and perplexity
-    metric = AccPerplex(
-        device=torch.cuda.current_device(),
-        tp_pg=gpc.get_group(ParallelMode.TENSOR),
-        dp_pg=gpc.get_group(ParallelMode.DATA),
-        dataset_types=dataset_types,
-    )
-
-    # initialize trainer
-    scheduler_hooks = [
-        SchedulerMetricHook(
-            metric=metric,
-            skip=(
-                gpc.is_using_pp()
-                and hasattr(gpc.config.model, "num_chunks")
-                and gpc.config.model.num_chunks > 1
-                and gpc.config.parallel["pipeline"].get("interleaved_overlap", False)
-            ),
-        ),
-    ]
-
-    trainer, train_dl, _, _ = internlm.initialize_trainer(
-        model=model,
-        optimizer=optimizer,
-        criterion=criterion,
-        train_dataloader=train_dl,
-        lr_scheduler=lr_scheduler,
-        beta2_scheduler=beta2_scheduler,
-        scheduler_hooks=scheduler_hooks,
-    )
-
-    # initialize simple memory profiler
-    if args.profiling:
-        memory_profiler = SimpleMemoryProfiler(
-            model,
-            optimizer.optim,
-            log_folder=f"memory_trace/rank{gpc.get_global_rank()}_"
-            + f"dp{gpc.get_local_rank(ParallelMode.DATA)}_"
-            + f"tp{gpc.get_local_rank(ParallelMode.TENSOR)}",
-        )
-    else:
-        memory_profiler = None
-
-    # initialize the batch skipper
-    batch_skipper = BatchSkipper(skip_batches)
-
-    trainer.train()
-
-    # transfer the train data loader into train data iterator
-    train_iter = iter(train_dl)
-
-    with initialize_llm_profile(profiling=args.profiling, start_time=current_time) as prof:
-        # start iterating the train data and begin training
-        for batch_count in range(train_state.batch_count, total_steps):
-            empty_cache_and_diag(batch_count, interval=gpc.config.data.empty_cache_and_diag_interval)
-            start_time = time.time()
-            timer("one-batch").start()
-
-            # load batch data
-            batch, train_iter = load_new_batch(train_dl=train_dl, train_iter=train_iter, train_state=train_state)
-
-            # record the consumed samples in training
-            train_state.batch_count = batch_count
-            train_state.num_consumed_samples_in_epoch += len(batch[1])
-            if batch_skipper(batch_count):  # skip this batch
-                if gpc.is_rank_for_log():
-                    logger.info(f"Skip batch count:`{batch_count}`...")
-                timer("one-batch").stop()
-                continue
-
-            # zero the grads of parameters
-            trainer.zero_grad()
-            # process data
-            if batch[0].get("type_ids", None) is not None:
-                metric.set_current_type_ids(type_ids=batch[0].pop("type_ids", None))
-
-            # do forward and backward
-            timer("fwd-bwd").start()
-
-            _, _, loss = trainer.execute_schedule(
-                batch, forward_only=False, return_loss=True, return_output_label=False
-            )
-            timer("fwd-bwd").stop()
-
-            # update parameters, and returns (success_update, grad_norm)
-            trainer_result = trainer.step()
-            assert trainer_result is not None
-
-            success_update, grad_norm_groups = trainer_result
-            if success_update:  # update parameters successfully
-                train_state.step_count += 1
-            else:
-                train_state.inf_nan_skip_batches += 1  # record the amount of updating parameters unsuccessfully.
-                if -1 in grad_norm_groups.values() and gpc.is_rank_for_log():  # -1 encodes a specific failure case
-                    logger.warning(f"Warning: skip parameter update at step {batch_count}.")
-                    send_alert_message(
-                        address=gpc.config.monitor.alert.feishu_alert_address,
-                        message=f"Warning: skip parameter update at step {batch_count}.",
-                    )
-
-            # calculate and record the training metrics, eg. loss, accuracy and so on.
-            record_current_batch_training_metrics(
-                get_tflops_func=get_tflops_func,
-                logger=logger,
-                writer=writer,
-                success_update=success_update,
-                batch_count=batch_count,
-                batch=batch,
-                train_state=train_state,
-                optimizer=optimizer,
-                beta2_scheduler=beta2_scheduler,
-                trainer=trainer,
-                start_time=start_time,
-                loss=loss,
-                grad_norm=grad_norm_groups,
-                metric=metric,
-                update_panel=uniscale_logger is not None,
-            )
-
-            timer("one-batch").stop()
-
-            # evaluate on validation data loaders
-            if valid_every > 0 and train_state.step_count % valid_every == 0:
-                evaluate_on_val_dls(
-                    trainer=trainer,
-                    val_dls=val_dls,
-                    writer=writer,
-                    logger=logger,
-                    step_count=train_state.step_count,
-                    update_panel=uniscale_logger is not None,
-                )
-
-            # checkpoint the training states in specific steps, which is determined by the args "checkpoint_every"
-            # # save batch sampler that tracks the true consumed samples
-            now_break = ckpt_manager.try_save_checkpoint(train_state)
-            if now_break:
-                break
-
-            if memory_profiler is not None:
-                memory_profiler.step()
-
-            if batch_count % 2 == 0:
-                prof.step()
-
-    ckpt_manager.wait_async_upload_finish()
-
-
-if __name__ == "__main__":
-    args = parse_args()
-    hostname = socket.gethostname()
-
-    # initialize distributed environment
-    initialize_distributed_env(config=args.config, launcher=args.launcher, master_port=args.port, seed=args.seed)
-    assert hasattr(gpc, "config") and gpc.config is not None
-
-    # initialize monitor manager context
-    with initialize_monitor_manager(
-        job_name=gpc.config.JOB_NAME, alert_address=gpc.config.monitor.alert.feishu_alert_address
-    ):
-        try:
-            main(args)
-        except Exception:
-            logger.error(
-                f"Raise exception from {hostname} with rank id: {gpc.get_global_rank()}\n{traceback.format_exc()}",
-            )
-            mm.monitor_exception(
-                alert_address=gpc.config.monitor.alert.feishu_alert_address, excp_info=traceback.format_exc()
-            )
diff --git a/version.txt b/version.txt
deleted file mode 100644
index 0ea3a94..0000000
--- a/version.txt
+++ /dev/null
@@ -1 +0,0 @@
-0.2.0