docs(doc/code-docs): add figure for training docs (#307)

* add training image for docs * docs(doc/code-docs): add training img for en doc * docs(doc/code-docs): fix en docs for initialize * docs(doc/code-docs): update conf file for readthedocs * docs(doc/code-docs): fix typos * docs(doc/code-docs): fix typos for reathedocs * docs(doc/code-docs): minor typo fix for reathedocs * docs(doc/code-docs): fix readthedocs conf file * docs(doc/code-docs): update training image * docs(doc/code-docs): fix typos * docs(doc/code-docs): update training image * docs(doc/code-docs): move training image to section initialize * docs(doc/code-docs): fix lint * add badge about reathedocs status
2023-09-15 15:22:22 +08:00 · 2023-09-15 15:22:22 +08:00 · de68cc5007
parent 07fc5f674a
commit de68cc5007
16 changed files with 274 additions and 183 deletions
--- a/README-ja-JP.md
+++ b/README-ja-JP.md
@ -16,6 +16,7 @@

 [![license](./doc/imgs/license.svg)](./LICENSE)
 [![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
+[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)

 [📘使用法](./doc/en/usage.md) |
 [🛠️インストール](./doc/en/install.md) |
--- a/README-zh-Hans.md
+++ b/README-zh-Hans.md
@ -16,6 +16,7 @@

 [![license](./doc/imgs/license.svg)](https://github.com/open-mmlab/mmdetection/blob/main/LICENSE)
 [![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
+[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)

 [📘使用文档](./doc/usage.md) |
 [🛠️安装教程](./doc/install.md) |
--- a/README.md
+++ b/README.md
@ -16,6 +16,7 @@

 [![license](./doc/imgs/license.svg)](./LICENSE)
 [![evaluation](./doc/imgs/compass_support.svg)](https://github.com/internLM/OpenCompass/)
+[![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest)

 [📘Usage](./doc/en/usage.md) |
 [🛠️Installation](./doc/en/install.md) |
--- a/doc/code-docs/locales/en/LC_MESSAGES/checkpoint.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/checkpoint.po
@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: InternLM \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
+"POT-Creation-Date: 2023-09-13 17:07+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: en\n"
@ -19,30 +19,33 @@ msgstr ""
 "Content-Transfer-Encoding: 8bit\n"
 "Generated-By: Babel 2.12.1\n"

-#: ../../source/checkpoint.rst:2 09c8645fba264cdf9a80c4b62c2bb4d1
+#: ../../source/checkpoint.rst:2
 msgid "模型保存"
 msgstr "Model Checkpointing"

-#: ../../source/checkpoint.rst:4 8b158d34631045b1afdb4fb0169b3c71
+#: ../../source/checkpoint.rst:4
 msgid ""
 "InternLM 使用 ``internlm.utils.model_checkpoint.CheckpointManager`` "
-"来管理模型保存。 其中，可以 使用 ``CheckpointManager.try_save_checkpoint(train_state)`` "
-"来保存指定 step 的模型状态。InternLM支持启动时自动加载最新的模型备份，并在接收信号退出训练时自动进行模型备份。"
+"来管理模型保存。其中，可以使用 ``CheckpointManager.try_save_checkpoint(train_state)`` "
+"来保存指定 step 的模型状态。"
 msgstr ""
-"InternLM uses ``internlm.utils.model_checkpoint.CheckpointManager`` to manage model checkpointing. In the implementation, "
-"we use ``CheckpointManager.try_save_checkpoint(train_state)`` to checkpoint training states at specific steps. InternLM supports "
-"automatic loading of latest ckpt at startup and automatic model checkpointing at signal quit."
+"InternLM uses ``internlm.utils.model_checkpoint.CheckpointManager`` to "
+"manage model checkpointing. In the implementation, we use "
+"``CheckpointManager.try_save_checkpoint(train_state)`` to checkpoint "
+"training states at specific steps. "

-#: ../../source/checkpoint.rst:8 a023b5a6d15749bfaa51cf2da194bda1
+#: ../../source/checkpoint.rst:6
+msgid "InternLM支持启动时自动加载最新的模型备份，并在接收信号退出训练时自动进行模型备份。"
+msgstr "InternLM supports automatic loading of latest ckpt at startup and automatic model checkpointing at signal quit. "
+
+#: ../../source/checkpoint.rst:9
 msgid "Checkpointing"
 msgstr ""

-#: 938575c699d1426c87e0b3f589a85d50
 #: internlm.utils.model_checkpoint.CheckpointManager:1 of
 msgid "StorageManagerContext"
 msgstr ""

-#: 754d6881cd034c5ebaab0f3362dd14c2
 #: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler:1 of
 msgid ""
 "Exit signal detection function, if we write the exit step in the "
@ -51,34 +54,27 @@ msgid ""
 "quit."
 msgstr ""

-#: 2169f9fb4a8b40bc9bf6093894fc7a5e 6a55d2b2b24a44c8b78b40f19f4d950b
-#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler
-#: internlm.utils.model_checkpoint.CheckpointManager.try_resume_training of
+#: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler of
 msgid "参数"
 msgstr ""

-#: 360a89b1591e4627ac432f4d75050354
 #: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler of
 msgid "返回"
 msgstr ""

-#: 2426832f4a8a4c5481be1c940e0e7b50
 #: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler:9 of
 msgid "whether to quit."
 msgstr ""

-#: 5f6842c261544a3c89f32d981b3ad755
 #: internlm.utils.model_checkpoint.CheckpointManager.quit_signal_handler of
 msgid "返回类型"
 msgstr ""

-#: 1392da84b6e645bcb8dab605e1231fdc
 #: internlm.utils.model_checkpoint.CheckpointManager.wait_async_upload_finish:1
 #: of
 msgid "wait for all checkpoint uploads to be completed"
 msgstr ""

-#: d1774593e9c94608b49b10504bfbc38b
 #: internlm.utils.model_checkpoint.CheckpointManager.query_latest_snapshot_step_boto3:1
 #: of
 msgid ""
@ -86,38 +82,25 @@ msgid ""
 "found, None will return."
 msgstr ""

-#: a3abbbd2bd574872892d908ab248e804
-#: internlm.utils.model_checkpoint.CheckpointManager.try_resume_training:1 of
-msgid "Attempt to restore the training state of the last ckpt."
-msgstr ""
-
-#: de021d1eb6d54955a2850c11c0191710
-#: internlm.utils.model_checkpoint.CheckpointManager.try_resume_training:3 of
-msgid "lr_scheduler object."
-msgstr ""
-
-#: 20be15854f2e420a9d96c86b5869bfa6
-#: internlm.utils.model_checkpoint.CheckpointManager.try_resume_training:5 of
-msgid "optimizer object."
-msgstr ""
-
-#: 68f69086c5054acc8aca15c8a764acc5
-#: internlm.utils.model_checkpoint.CheckpointManager.try_resume_training:7 of
-msgid "learning rate."
-msgstr ""
-
-#: 5d34d34a972d4abeab4bda3e49ee157b
-#: internlm.utils.model_checkpoint.CheckpointManager.try_resume_training:9 of
-msgid "traing states."
-msgstr ""
-
-#: 82ebb67afaa748ecabc4cef598d7fc30
-#: internlm.utils.model_checkpoint.CheckpointManager.try_resume_training:11 of
-msgid "traning dataloader object"
-msgstr ""
-
-#: 0c95dfcd712749279daca78166bb4326
 #: internlm.utils.model_checkpoint.CheckpointManager.save_checkpoint:1 of
 msgid "Save checkpoint to the given folder path."
 msgstr ""

+#~ msgid "Attempt to restore the training state of the last ckpt."
+#~ msgstr ""
+
+#~ msgid "lr_scheduler object."
+#~ msgstr ""
+
+#~ msgid "optimizer object."
+#~ msgstr ""
+
+#~ msgid "learning rate."
+#~ msgstr ""
+
+#~ msgid "traing states."
+#~ msgstr ""
+
+#~ msgid "traning dataloader object"
+#~ msgstr ""
+
--- a/doc/code-docs/locales/en/LC_MESSAGES/example/30B_demo.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/example/30B_demo.po
@ -37,8 +37,8 @@ msgstr "Start Training"

 #: ../../source/example/30B_demo.rst:166 24974384d5ab42e68266aeb67ae222ce
 msgid "完成以上训练配置后，可启动模型训练，以在 ``slurm`` 平台上为例，启动两节点 16GPU 的训练命令如下所示："
-msgstr "After completing the data preparation and relevant training configurations, you can start the demo training.
-The following example shows how to start distributed training in ``slurm`` environments with 16 GPUs."
+msgstr "After completing the data preparation and relevant training configurations, you can start the demo training. "
+"The following example shows how to start distributed training in ``slurm`` environments with 16 GPUs."

 #: ../../source/example/30B_demo.rst:173 948ac71ed53848f9bad07f69d956c4bb
 msgid "训练结果"
--- a/doc/code-docs/locales/en/LC_MESSAGES/example/7B_demo.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/example/7B_demo.po
@ -37,8 +37,8 @@ msgstr "Start Training"

 #: ../../source/example/7B_demo.rst:164 9e7a864ae2e14d05b0681f16792e5278
 msgid "完成以上训练配置后，可启动模型训练，以在 ``slurm`` 平台上为例，启动单节点 8GPU 的训练命令如下所示："
-msgstr "After completing the data preparation and relevant training configurations, you can start the demo training.
-The following example shows how to start distributed training in ``slurm`` environments with 8 GPUs."
+msgstr "After completing the data preparation and relevant training configurations, you can start the demo training. "
+"The following example shows how to start distributed training in ``slurm`` environments with 8 GPUs."

 #: ../../source/example/7B_demo.rst:171 fdd053efb1854d46aabf6c0f279fe7fc
 msgid "训练结果"
--- a/doc/code-docs/locales/en/LC_MESSAGES/initialize.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/initialize.po
@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: InternLM \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-08 15:32+0800\n"
+"POT-Creation-Date: 2023-09-14 12:23+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: zh_CN\n"
@ -23,24 +23,68 @@ msgstr ""
 msgid "训练构建"
 msgstr "Training Setup"

-#: ../../source/initialize.rst:7
+#: ../../source/initialize.rst:4
+msgid "InternLM 的训练流程可以归纳为两个步骤："
+msgstr "The training process of InternLM can be summarized into two steps: "
+
+#: ../../source/initialize.rst:6
+msgid "初始化"
+msgstr "Initialization"
+
+#: ../../source/initialize.rst:8
+msgid "初始化模型、优化器、数据加载器、Trainer，生成不同种类的进程组，为混合并行的迭代训练做准备。"
+msgstr ""
+"Initialize model, optimizer, dataloader, trainer, and create different "
+"types of process groups to prepare for iterative steps of hybrid parallel training. "
+
+#: ../../source/initialize.rst:9
+msgid "初始化Logger、Checkpoint管理器、Monitor管理器、Profiler，对迭代训练的过程观察、预警、记录。"
+msgstr ""
+"Initialize logger, checkpoint manager, monitor manager, and profiler to "
+"watch, alert, and record the iterative training steps. "
+
+#: ../../source/initialize.rst:11
+msgid "迭代训练"
+msgstr "Iterative training steps"
+
+#: ../../source/initialize.rst:13
+msgid "根据配置文件定义的张量并行、流水线并行、数据并行的大小，加载训练引擎和调度器进行混合并行训练。"
+msgstr ""
+"Load the training engine and scheduler for hybrid parallel training "
+"according to the configuration such as tensor parallel size, pipeline "
+"parallel size, and data parallel size. "
+
+#: ../../source/initialize.rst:14
+msgid "在迭代训练中，调用 Trainer API 进行梯度置零，前向传播计算损失并反向传播，参数更新。"
+msgstr ""
+"In iterative training steps, the Trainer API is called to perform zero "
+"gradients, forward-loss-backward, and parameter update."
+
+#: ../../source/initialize.rst:20
+msgid "InternLM训练流程图"
+msgstr "InternLM training process"
+
+#: ../../source/initialize.rst:25
 msgid "命令行参数解析"
 msgstr "Argument Parsing"

-#: ../../source/initialize.rst:9
-#, fuzzy
+#: ../../source/initialize.rst:27
 msgid ""
 "InternLM 使用 `argparse <https://docs.python.org/3/library/argparse.html>`_"
-" 库来向InternLM运行时提供命令行参数配置。用户可使用 "
-"``internlm.initialize.get_default_parser()`` 来获取 InternLM "
-"的默认解析器，其中包含一些内置参数，用户可以向此解析器添加自定义参数。"
+" 库来向InternLM运行时提供命令行参数配置。"
 msgstr ""
 "InternLM uses the `argparse "
 "<https://docs.python.org/3/library/argparse.html>`_ library to supply "
-"commandline configuration to the InternLM runtime. Use "
-"``internlm.initialize.get_default_parser()`` to get InternLM's default "
-"parser with some builtin arguments, users can add custom parameters to "
-"this parser."
+"commandline configuration to the InternLM runtime. "
+
+#: ../../source/initialize.rst:29
+msgid ""
+"用户可使用 ``internlm.initialize.get_default_parser()`` 来获取 InternLM "
+"的默认解析器，其中包含一些内置参数，用户可以向此解析器添加自定义参数。"
+msgstr ""
+"Use ``internlm.initialize.get_default_parser()`` to get InternLM's "
+"default parser with some builtin arguments, users can add custom "
+"parameters to this parser."

 #: internlm.initialize.launch.get_default_parser:1 of
 msgid ""
@ -69,7 +113,7 @@ msgstr ""
 msgid "返回类型"
 msgstr ""

-#: ../../source/initialize.rst:25
+#: ../../source/initialize.rst:45
 msgid "模型初始化"
 msgstr "Model Initialization"

@ -81,26 +125,26 @@ msgstr ""
 msgid "The neural network model to be trained or evaluated."
 msgstr ""

-#: ../../source/initialize.rst:29
+#: ../../source/initialize.rst:49
 msgid "InternLM 在配置文件中使用字段 ``model_type`` 和 ``model`` 来控制模型初始化过程。示例模型初始化配置定义如下："
 msgstr ""
 "InternLM uses the field ``model_type`` and ``model`` in the config file "
 "to control model initialization process. An example model initialization "
 "configuratio"

-#: ../../source/initialize.rst:57
+#: ../../source/initialize.rst:77
 msgid "字段 ``model_type`` 指明了要初始化的模型类型"
 msgstr ""
 "The field ``model_type`` specifics the model type has been registered and"
 " to be initialized."

-#: ../../source/initialize.rst:58
+#: ../../source/initialize.rst:78
 msgid "字段 ``model`` 中的参数指定了在模型初始化过程中的参数设置"
 msgstr ""
 "The parameters in field ``model`` specific the configuration settings "
 "during model initialization."

-#: ../../source/initialize.rst:60
+#: ../../source/initialize.rst:80
 msgid ""
 "值得注意的是，用户可以定义新的模型类型，并使用装饰器 ``@MODEL_INITIALIZER.register_module`` "
 "注册模型的初始化函数，其中 ``MODEL_INITIALIZER`` 是类 "
@ -112,7 +156,7 @@ msgstr ""
 " instantiated object of class ``internlm.util.registry.Registry``, the "
 "example is shown as follows."

-#: ../../source/initialize.rst:72
+#: ../../source/initialize.rst:92
 msgid "优化器初始化"
 msgstr "Optimizer Initialization"

@ -134,7 +178,7 @@ msgstr ""
 msgid "A tuple of (optimizer, beta2_scheduler, lr_scheduler)."
 msgstr ""

-#: ../../source/initialize.rst:79
+#: ../../source/initialize.rst:99
 msgid "数据加载器初始化"
 msgstr "Dataloader Initialization"

@ -162,7 +206,7 @@ msgstr ""
 msgid "A tuple of (train_dl, dataset_types)."
 msgstr ""

-#: ../../source/initialize.rst:86
+#: ../../source/initialize.rst:106
 msgid "Trainer 初始化"
 msgstr "Trainer Initialization"

--- a/doc/code-docs/locales/en/LC_MESSAGES/profiler.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/profiler.po
@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: InternLM \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-08 15:32+0800\n"
+"POT-Creation-Date: 2023-09-14 11:05+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: en\n"
@ -32,13 +32,13 @@ msgid ""
 "InternLM 使用 ``internlm.train.initialize_llm_profile()`` "
 "来收集和分析模型训练或推理期间的性能数据，如 CPU/CUDA/memory 等性能数据。这个实现基于 `torch.profiler "
 "<https://pytorch.org/docs/stable/profiler.html>`_ ，输出的性能分析 trace 文件可以使用 "
-"`tensorboard <https://www.tensorflow.org>`_ 进行可视化。"
+"`tensorboard <https://www.tensorflow.org/tensorboard?hl=en>`_ 进行可视化。"
 msgstr ""
 "InternLM uses ``internlm.train.initialize_llm_profile()`` to profile "
 "performance data, execution time duration and breakdown analysis of step "
 "time. The implementation is based on `torch.profiler "
 "<https://pytorch.org/docs/stable/profiler.html>`_ and output tracing "
-"files can be visualized with `tensorboard <https://www.tensorflow.org>`_."
+"files can be visualized with `tensorboard <https://www.tensorflow.org/tensorboard?hl=en>`_."

 #: ../../source/profiler.rst:11
 msgid ""
@ -53,11 +53,15 @@ msgstr ""

 #: ../../source/profiler.rst:13
 msgid "实际运行生成的 ``Torch Profiler`` 目录结构如下："
-msgstr "The directory structure of ``Torch Profiler`` generated files is as follows:"
+msgstr ""
+"The directory structure of ``Torch Profiler`` generated files is as "
+"follows:"

 #: ../../source/profiler.rst:22
 msgid "其中， ``traces`` 可以通过 ``TensorBoard`` 可视化，运行命令"
-msgstr "Among them, ``traces`` can be visualized through ``TensorBoard`` and run with the command"
+msgstr ""
+"Among them, ``traces`` can be visualized through ``TensorBoard`` and run "
+"with the command"

 #: ../../source/profiler.rst:29
 msgid ""
@ -66,7 +70,12 @@ msgid ""
 "tensorboard "
 "<https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"
 "#pytorch-profiler-with-tensorboard>`_"
-msgstr "In the opened ``TensorBoard -> PyTorch Profiler -> Views -> Trace`` page, you can see the timeline of profiled operators and GPU kernels. For more usage, please refer to `torch profiler with tensorboard <https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html#pytorch-profiler-with-tensorboard>`_"
+msgstr ""
+"In the opened ``TensorBoard -> PyTorch Profiler -> Views -> Trace`` page,"
+" you can see the timeline of profiled operators and GPU kernels. For more"
+" usage, please refer to `torch profiler with tensorboard "
+"<https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html"
+"#pytorch-profiler-with-tensorboard>`_"

 #: internlm.train.training_internlm.initialize_llm_profile:1 of
 msgid "Initialize and return the profiler context manager instance."
--- a/doc/code-docs/locales/en/LC_MESSAGES/training.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/training.po
@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: InternLM \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
+"POT-Creation-Date: 2023-09-14 12:23+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: en\n"
@ -19,109 +19,144 @@ msgstr ""
 "Content-Transfer-Encoding: 8bit\n"
 "Generated-By: Babel 2.12.1\n"

-#: ../../source/training.rst:2 6eafa5eb08e040039309a39cdb0f1bfe
+#: ../../source/training.rst:2
 msgid "训练 API"
 msgstr "Training API"

-#: ../../source/training.rst:4 74d81f3d0ca54c839d4e80bd589aedb2
+#: ../../source/training.rst:4
 msgid ""
 "InternLM 的训练 API 由 ``internlm.core.trainer.Trainer`` "
 "管理。在定义了训练引擎和调度器之后，我们可以调用 Trainer API 来执行模型训练、评估、梯度清零和参数更新等。"
 msgstr ""
-"InternLM training API is managed in ``internlm.core.trainer.Trainer``. After defining the "
-"training engine and runtime scheduler, we can call training API to perform training, evaluation, "
-"zero gradients and parameter update steps."
+"InternLM training API is managed in ``internlm.core.trainer.Trainer``. "
+"After defining the training engine and runtime scheduler, we can call "
+"training API to perform training, evaluation, zero gradients and "
+"parameter update steps."

-#: ../../source/training.rst:6 0e0cfddbb2334d3da99d3289edf4161d
+#: ../../source/training.rst:6
 msgid "有关详细用法，请参阅 Trainer API 文档和示例。"
-msgstr "For detailed usage, please refer to Trainer API documentation and examples."
+msgstr ""
+"For detailed usage, please refer to Trainer API documentation and "
+"examples."

-#: 7ea10280a8f1489984cb9994aa08976b internlm.core.trainer.Trainer:1 of
+#: internlm.core.trainer.Trainer:1 of
 msgid ""
 "This is a class tending for easy deployments of users' training and "
 "evaluation instead of writing their own scripts."
 msgstr ""

-#: 7969dca55840451193bffd3b071ab3b3 aff576168b59460491bb5da0ce41ea74
 #: internlm.core.trainer.Trainer internlm.core.trainer.Trainer.execute_schedule
 #: of
 msgid "参数"
 msgstr ""

-#: 59754d3e9ee8452a872bf397c01e0d8c internlm.core.trainer.Trainer:4 of
+#: internlm.core.trainer.Trainer:4 of
 msgid "Engine responsible for the process function."
 msgstr ""

-#: 2d18ff15256e48f98901c7a7e0cbbe35 internlm.core.trainer.Trainer:6 of
+#: internlm.core.trainer.Trainer:6 of
 msgid "Runtime schedule. Defaults to None."
 msgstr ""

-#: 76f4b3c7feba40eca3ee2b32559c53f5 internlm.core.trainer.Trainer.engine:1 of
+#: internlm.core.trainer.Trainer.engine:1 of
 msgid ""
 "Returns the engine that responsible for managing the training and "
 "evaluation process."
 msgstr ""

-#: c7eae2d4d06c4ef891e314902d80b7f3 internlm.core.trainer.Trainer.schedule:1 of
+#: internlm.core.trainer.Trainer.schedule:1 of
 msgid "Returns the runtime scheduler."
 msgstr ""

-#: cb495b21b3444881aec83803e92386d9
 #: internlm.core.trainer.Trainer.uses_pipeline:1 of
 msgid "Returns whether the pipeline parallel is used or not."
 msgstr ""

-#: 86b0b631189e46468281a397c5e97350 internlm.core.trainer.Trainer.train:1 of
+#: internlm.core.trainer.Trainer.train:1 of
 msgid "Sets the model to training mode."
 msgstr ""

-#: f997e13120ee4d8b9e45ea6698b3e2a6 internlm.core.trainer.Trainer.eval:1 of
+#: internlm.core.trainer.Trainer.eval:1 of
 msgid "Sets the model to evaluation mode."
 msgstr ""

-#: a8179e50312d47dcbe9de0433a65c2f7 internlm.core.trainer.Trainer.zero_grad:1
-#: of
+#: internlm.core.trainer.Trainer.zero_grad:1 of
 msgid "Sets the gradient of all parameters in the model to zero."
 msgstr ""

-#: f936136ef9e0452ca439b7c66dc8884b internlm.core.trainer.Trainer.step:1 of
+#: internlm.core.trainer.Trainer.step:1 of
 msgid "Executes the parameter update step."
 msgstr ""

-#: 250e2af89cfd432c84d228f9e03c174c
 #: internlm.core.trainer.Trainer.execute_schedule:1 of
 msgid ""
 "Runs the forward, loss computation, and backward for the model. Returns a"
 " tuple of (output, label, loss)."
 msgstr ""

-#: 6ca7de83033b432792eb0d7935ea04da
 #: internlm.core.trainer.Trainer.execute_schedule:4 of
 msgid "The data iterator."
 msgstr ""

-#: 6d3044e75b3149beba3c659e15607b79
 #: internlm.core.trainer.Trainer.execute_schedule:6 of
 msgid "Additional keyword arguments."
 msgstr ""

-#: 99d5a297d6414c30b432acf2566f0d3c
 #: internlm.core.trainer.Trainer.execute_schedule of
 msgid "返回"
 msgstr ""

-#: b625ebf0cf874edba384456d33e740b4
 #: internlm.core.trainer.Trainer.execute_schedule:8 of
 msgid "A tuple of (output, label, loss)."
 msgstr ""

-#: 391cde57d2e2478d8f83a7ad270c2a65
 #: internlm.core.trainer.Trainer.execute_schedule of
 msgid "返回类型"
 msgstr ""

-#: d4c4fb0fbddb499786970509cf0c9e13
 #: internlm.core.trainer.Trainer.execute_schedule:9 of
 msgid "Tuple[:class:`torch.Tensor`]"
 msgstr ""

+#~ msgid "InternLM 的训练流程可以归纳为两个步骤："
+#~ msgstr "The training process of InternLM can be summarized into two steps: "
+
+#~ msgid "初始化"
+#~ msgstr "Initialization"
+
+#~ msgid "初始化模型、优化器、数据加载器、Trainer，生成不同种类的进程组，为混合并行的迭代训练做准备。"
+#~ msgstr ""
+#~ "Initialize model, optimizer, dataloader, "
+#~ "trainer, and create different types of"
+#~ " process groups to prepare for "
+#~ "iterative steps of hybrid parallel "
+#~ "training. "
+
+#~ msgid "初始化Logger、Checkpoint管理器、Monitor管理器、Profiler，对迭代训练的过程观察、预警、记录。"
+#~ msgstr ""
+#~ "Initialize logger, checkpoint manager, monitor"
+#~ " manager, and profiler to watch, "
+#~ "alert, and record the iterative training"
+#~ " steps. "
+
+#~ msgid "迭代训练"
+#~ msgstr "Iterative training steps"
+
+#~ msgid "根据配置文件定义的张量并行、流水线并行、数据并行的大小，加载训练引擎和调度器进行混合并行训练。"
+#~ msgstr ""
+#~ "Load the training engine and scheduler"
+#~ " for hybrid parallel training according "
+#~ "to the configuration such as tensor "
+#~ "parallel size, pipeline parallel size, "
+#~ "and data parallel size. "
+
+#~ msgid "在迭代训练中，调用 Trainer API 进行梯度置零，前向传播计算损失并反向传播，参数更新。"
+#~ msgstr ""
+#~ "In iterative training steps, the Trainer"
+#~ " API is called to perform zero "
+#~ "gradients, forward-loss-backward, and "
+#~ "parameter update."
+
+#~ msgid "InternLM训练流程图"
+#~ msgstr "InternLM training process"
+
--- a/doc/code-docs/locales/en/LC_MESSAGES/usage.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/usage.po
@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: InternLM \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 14:15+0800\n"
+"POT-Creation-Date: 2023-09-13 17:07+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: en\n"
@ -19,11 +19,11 @@ msgstr ""
 "Content-Transfer-Encoding: 8bit\n"
 "Generated-By: Babel 2.12.1\n"

-#: ../../../usage.md:2 a64aaaa1525e4e01b0ddcebc42c24bbd
+#: ../../../usage.md:2
 msgid "使用教程"
 msgstr "Quickstart Guide"

-#: ../../../usage.md:4 f1b40737fb584d889b82c7f55b652977
+#: ../../../usage.md:4
 msgid ""
 "启动一个 Demo "
 "模型训练，需要进行三项准备，**安装**，**数据集准备**和**模型训练配置**。接下来，首先会介绍数据准备相关的操作，再简要描述模型训练配置相关的内容。"
@ -33,21 +33,21 @@ msgstr ""
 "configuration**. In this guide, we will first cover the steps for dataset"
 " preparation and then briefly describe the model training configuration."

-#: ../../../usage.md:6 b35abe307c2f4d23866fff828308ebf2
+#: ../../../usage.md:6
 msgid "安装"
 msgstr "Installation"

-#: ../../../usage.md:7 64a8c1f5f71c45519e636aa7edba10bc
+#: ../../../usage.md:7
 msgid "请参考[安装文档](./install.md)进行安装。"
 msgstr ""
 "Please refer to the [installation guide](./install.md) for instructions "
 "on how to install the necessary dependencies."

-#: ../../../usage.md:9 bd96714d12ee415794dea5a4578bd8cd
+#: ../../../usage.md:9
 msgid "数据准备 （预训练）"
 msgstr "Dataset Preparation (Pre-training)"

-#: ../../../usage.md:11 5a0b39fb9da94e96b87db40d1f231a0c
+#: ../../../usage.md:11
 msgid "InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7_sft.model`来生成tokens。若想使用不同的模型，可直接修改`tokernizer.py`中的模型参数路径。"
 msgstr ""
 "The dataset for the InternLM training task includes a series of `bin` and"
@ -58,7 +58,7 @@ msgstr ""
 "different model, you can directly modify the model parameter path in "
 "`tokenizer.py`."

-#: ../../../usage.md:13 3cef8126b8784af48d81cc140322909e
+#: ../../../usage.md:13
 msgid "可以运行以下命令生成原始数据对应的`bin`和`meta`文件，其中参数`text_input_path`表示原始文本数据路径，目前支持`txt`、`json`和`jsonl`三种输入格式，`bin_output_path`表示生成的`bin`文件的保存路径。"
 msgstr ""
 "You can run the following command to generate `bin` and `meta` files "
@ -67,30 +67,30 @@ msgstr ""
 "`txt`, `json`, and `jsonl` formats, while `bin_output_path` represents "
 "the save path of the generated `bin` files."

-#: ../../../usage.md:18 107ff2280da14cb6a27f4e9857186333
+#: ../../../usage.md:18
 msgid "下面是一个数据处理的例子："
 msgstr "Here is an example of data processing:"

-#: ../../../usage.md:20 c11a9860263c4e2288a561f3435fa706
+#: ../../../usage.md:20
 msgid "给定一个包含原始数据集的文件`raw_data.txt`，原始数据集如下所示："
 msgstr ""
 "Given a file `raw_data.txt` containing the raw dataset, the raw dataset "
 "is shown below:"

-#: ../../../usage.md:27 4012599b42ab47bd979d2a0b79ca1147
+#: ../../../usage.md:27
 msgid "可以通过运行以下命令来生成`bin`和`meta`文件："
 msgstr ""
 "You can generate the `bin` and `meta` files by running the following "
 "command:"

-#: ../../../usage.md:32 cca91b6cf53a4082932dd34ea4b7f954
+#: ../../../usage.md:32
 msgid "需要注意的是，生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这六个目录下，以区分数据集的类型。"
 msgstr ""
 "It should be noted that the generated `bin` files need to be saved in one"
 " of the following directories: `cn`, `en`, `code`, `ja`, `ar`, or "
 "`kaoshi`, depending on the type of dataset."

-#: ../../../usage.md:34 417312ca1e35479e811953f777e3565a
+#: ../../../usage.md:34
 msgid "其中，`cn`表示中文数据集；`en`表示英文数据集；`code`表示代码数据集；`ja`表示日语数据集；`ar`表示阿拉伯语数据集；`kaoshi`表示考试数据集。"
 msgstr ""
 "Here, `cn` represents the Chinese dataset, `en` represents the English "
@ -98,22 +98,22 @@ msgstr ""
 " dataset, `ar` represents the Arabic dataset, and `kaoshi` represents the"
 " exam dataset."

-#: ../../../usage.md:36 79c21f8e89b34499ba4e25e20593ec28
+#: ../../../usage.md:36
 msgid "生成的bin文件的格式如下："
 msgstr "The format of the generated `bin` files is as follows:"

-#: ../../../usage.md:42 26388d996c4e4116bc216be9bc007f62
+#: ../../../usage.md:42
 msgid "`bin`文件中的每一行均对应原始数据集中的每一个句子，表示每个句子的`token`（下文将用sequence指定）。"
 msgstr ""
 "Each line in the `bin` file corresponds to each sentence in the original "
 "dataset, representing the tokens of each sentence (referred to as "
 "sequence below)."

-#: ../../../usage.md:44 b39148a85ee64a349975d26282fbe59b
+#: ../../../usage.md:44
 msgid "生成的`meta`文件的格式如下："
 msgstr "The format of the generated `meta` file is as follows:"

-#: ../../../usage.md:48 175a6007197a40568535f945672e5df2
+#: ../../../usage.md:48
 msgid ""
 "在`meta`文件中，每个元组对应着`bin`文件中每一个`sequence`的元信息。其中，元组的第一个元素表示每个`sequence`在所有`sequence`中的`starting"
 " index`，第二个元素表示每个`sequence`中有多少个`tokens`。"
@ -123,7 +123,7 @@ msgstr ""
 "index` of each `sequence` among all `sequences`, and the second element "
 "indicates the number of `tokens` for each `sequence`."

-#: ../../../usage.md:50 46874a3de3924837979f9949f1237e39
+#: ../../../usage.md:50
 msgid ""
 "例如，对于第一个`sequence`，`starting index`为 0，有 11 "
 "个`tokens`；对于第二个`sequence`，由于第一个`sequence`转换为`string`后的长度为`89`，因此它的`starting"
@ -132,17 +132,17 @@ msgstr ""
 "For example, the first `sequence` starts at index 0 and has 16 `tokens`. "
 "The second `sequence` starts at index 110 and has 24 `tokens`."

-#: ../../../usage.md:52 25ea049fa411408b8856e7aa657835ab
+#: ../../../usage.md:52
 msgid "`json`和`jsonl`类型的文件的`bin`和`meta`文件格式和`txt`一致，此处不再赘叙。"
 msgstr ""
 "The `bin` and `meta` file formats for `json` and `jsonl` type files are "
 "the same as for `txt`, so we won't go over them here."

-#: ../../../usage.md:54 bc52f959cb57494483a181e843014ed1
+#: ../../../usage.md:54
 msgid "数据准备 （微调）"
 msgstr "Data Preparation (Fine-tuning)"

-#: ../../../usage.md:56 73c74620c2994486acc747ba0c7f0b46
+#: ../../../usage.md:56
 msgid ""
 "微调任务的数据集格式与预训练任务保持一致，生成的数据格式为一系列的`bin`和`meta`文件。以下以 Alpaca "
 "数据集为例，介绍微调的数据准备流程。"
@ -152,7 +152,7 @@ msgstr ""
 "the Alpaca dataset as an example to explain the data preparation process "
 "for fine-tuning."

-#: ../../../usage.md:58 75f0e22d10ca413389ec8b947ae6141f
+#: ../../../usage.md:58
 msgid ""
 "下载 [Alpaca 数据集](https://github.com/tatsu-"
 "lab/stanford_alpaca/blob/main/alpaca_data.json)"
@ -160,87 +160,81 @@ msgstr ""
 "Download the [Alpaca dataset](https://github.com/tatsu-"
 "lab/stanford_alpaca/blob/main/alpaca_data.json)."

-#: ../../../usage.md:60 667606fcea454af48353a5b40f82fc46
+#: ../../../usage.md:60
 msgid "对 Alpaca 数据进行 tokenize，使用以下命令"
 msgstr "Tokenize the Alpaca dataset using the following command:"

-#: ../../../usage.md:66 60283b9237c8462ea37288b8ece79081
+#: ../../../usage.md:66
 msgid "建议用户参考 alpaca_tokenizer.py 编写新的脚本对自己的数据集进行 tokenize"
 msgstr ""
 "It is recommended that users refer to alpaca_tokenizer.py to write new "
 "scripts to tokenize their own datasets"

-#: ../../../usage.md:68 cdf45a4de9874e9fb65f7104dcee3c61
+#: ../../../usage.md:68
 msgid "训练配置"
 msgstr "Training Configuration"

-#: ../../../usage.md:70 7c42ebc23246450cbc1270e1461b16f6
+#: ../../../usage.md:70
 msgid "以 7B Demo 的配置文件`configs/7B_sft.py`为例，介绍启动一个模型训练所需要进行的数据、模型和并行等相关的配置。"
 msgstr ""
 "Taking the configuration file `configs/7B_sft.py` for the 7B demo as an "
 "example, let's discuss the data, model, and parallel configurations "
 "required to start a model training."

-#: ../../../usage.md:72 247cfe98a7f44c2293aa2e2351f1ea69
+#: ../../../usage.md:72
 msgid "数据配置"
 msgstr "Data Configuration"

-#: ../../../usage.md:73 31327e7dce5848778db5361b3fbded1c
+#: ../../../usage.md:73
 msgid "数据相关的关键参数配置及释义如下所示："
 msgstr "Here are the key parameters and their explanations for data configuration:"

-#: ../../../usage.md:88 4d2608136fef4141bd6e47f78b8591b2
+#: ../../../usage.md:88
 msgid "![pack_into_one](./imgs/pack_into_one.png)"
 msgstr ""

-#: ../../../usage.md:88 c5acb028f2694712b2af788a864d5927
+#: ../../../usage.md:88
 msgid "pack_into_one"
 msgstr ""

-#: ../../../usage.md:91 db6b9ce8e8294952845893dd7aad098f
+#: ../../../usage.md:91
 msgid "目前支持传入数据集文件路径`train_folder`，且要求文件格式如下："
 msgstr ""
 "Currently, it supports passing the dataset file path `train_folder`, and "
 "the file format is required to be as follows:"

-#: ../../../usage.md:98 f22536fc3dfa4552a103a7cb57a20f92
+#: ../../../usage.md:98
 msgid "数据集的详细内容可参考``数据准备``模块相关的介绍。"
 msgstr ""
 "For detailed information about the dataset, please refer to the \"Data "
 "Preparation\" section."

-#: ../../../usage.md:100 bc4f0b06e9c24730a7a831b7aca417e2
+#: ../../../usage.md:100
 msgid "模型配置"
 msgstr "Model Configuration"

-#: ../../../usage.md:102 ecf278a0a851496fae2e49c436e59368
+#: ../../../usage.md:102
 msgid "如果在启动训练时要加载模型 `checkpoint`，可进行如下相关配置："
 msgstr ""
 "If you want to load a model checkpoint when starting the training, you "
 "can configure it as follows:"

-#: ../../../usage.md:115 38244aba74294067a4019d0777621746
+#: ../../../usage.md:115
 msgid "注意："
 msgstr "Note:"

-#: ../../../usage.md:116 19d1eb0a797f4bd9a702a00e525d7753
-msgid "`load_model_only_folder`与`load_ckpt_folder`不能同时设置"
-msgstr ""
-"`load_model_only_folder` and `load_ckpt_folder` cannot be set at the same"
-" time."
-
-#: ../../../usage.md:117 3ea27a1f6be044a3959890be69311b24
+#: ../../../usage.md:116
 msgid "路径若以 `local:` 为前缀，则存储在本地文件系统；若以 `boto3:` 为前缀，则存储在远程 oss 上"
 msgstr ""
 "If the path starts with `local:`, it means the file is stored in the "
 "local file system. If it starts with `boto3:`, it means the file is "
 "stored in the remote OSS."

-#: ../../../usage.md:119 1d6381b4cfff41d8bdd5347e8a135869
+#: ../../../usage.md:118
 msgid "模型相关关键参数配置如下所示："
 msgstr "The configuration for the model is as follows:"

-#: ../../../usage.md:143 1026791c9f054576857ef1930db6b167
+#: ../../../usage.md:142
 msgid "注意：用户可自定义模型类型名和模型结构，并配置相对应的模型参数。通过`utils/registry.py`下的`MODEL_INITIALIZER`对象进行模型初始化函数接口注册，在训练主函数`train.py`中初始化模型时，可通过`model_type`配置获取指定的模型初始化接口函数。"
 msgstr ""
 "Note: Users can customize the model type name and model structure, and "
@ -251,7 +245,7 @@ msgstr ""
 "interface function can be obtained through the `model_type` "
 "configuration."

-#: ../../../usage.md:145 34823bcbe7754190bc9747758c1aad0c
+#: ../../../usage.md:144
 msgid ""
 "*如果基于 InternLM 7B继续训练，可以参考 "
 "[ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-zoo) 中 "
@ -261,79 +255,76 @@ msgstr ""
 "OpenXLab [ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-"
 "zoo) to download weights*."

-#: ../../../usage.md:147 4cabc928f8884cd38a6bb683b3bfade3
+#: ../../../usage.md:146
 msgid "并行配置"
 msgstr "Parallel Configuration"

-#: ../../../usage.md:149 f97ade07340340959345e73567bae793
+#: ../../../usage.md:148
 msgid "训练并行配置样例如下："
 msgstr "Training parallel configuration example:"

-#: ../../../usage.md:158 87fb5a4e4a4047ee8a9b8bb43915636d
+#: ../../../usage.md:157
 msgid "zero1：zero 并行策略，分如下三种情况，默认值为 -1"
 msgstr ""
 "zero1: zero parallel strategy, divided into the following three cases, "
 "default value is -1"

-#: ../../../usage.md:159 58dc08e2c52e4aaba99b4fbb6cf2e8b4
-#, fuzzy
+#: ../../../usage.md:158
 msgid "当`zero1 <= 0`，则 zero1 进程组的大小等于数据并行进程组的大小，因此优化器状态参数将在数据并行范围内分配"
 msgstr ""
 "When `zero1 <= 0`, the size of the zero1 process group is equal to the "
 "size of the data parallel process group, so the optimizer state "
 "parameters will be split within the data parallel range."

-#: ../../../usage.md:160 67e2ebd795d840b29fd1d684a068e90d
-#, fuzzy
+#: ../../../usage.md:159
 msgid "当`zero1 == 1`，则不使用 zero1 ，所有数据并行组保留完整的优化器状态参数"
 msgstr ""
-"When `zero1 == 1`, zero1 is not used, and all data parallel groups retain "
-"the complete optimizer state parameters."
+"When `zero1 == 1`, zero1 is not used, and all data parallel groups retain"
+" the complete optimizer state parameters."

-#: ../../../usage.md:161 7caedfc943514b9b83090b858ef6d163
-#, fuzzy
+#: ../../../usage.md:160
 msgid "当`zero1 > 1`且`zero1 <= data_parallel_world_size`，则 zero1 进程组是数据并行进程组的子集"
 msgstr ""
-"When `zero1 > 1` and `zero1 <= data_parallel_world_size`, the zero1 process"
-" group is a subset of the data parallel process group."
+"When `zero1 > 1` and `zero1 <= data_parallel_world_size`, the zero1 "
+"process group is a subset of the data parallel process group."

-#: ../../../usage.md:162 b38d3a1f72d543c6a44728fb6babea6b
+#: ../../../usage.md:161
 msgid "tensor：张量并行大小，通常是每个节点的 GPU 数量，默认值为 1"
 msgstr ""
 "tensor: tensor parallel size, usually the number of GPUs per node, "
 "default is 1"

-#: ../../../usage.md:163 237ac76df68f4a999396dad37c5495c3
+#: ../../../usage.md:162
 msgid "pipeline：流水线并行策略"
 msgstr "pipeline: pipeline parallel strategy"

-#: ../../../usage.md:164 c8c38f6ab2ea432eb9ebbb62618ca33e
+#: ../../../usage.md:163
 msgid "size：流水线并行大小，默认值为 1"
 msgstr "size: pipeline parallel size, the default value is 1"

-#: ../../../usage.md:165 b9158818e72e49acbdd52ad317cb80df
+#: ../../../usage.md:164
 msgid "interleaved_overlap：bool 类型，交错式调度时，开启或关闭通信优化，默认值为关闭"
 msgstr ""
 "interleaved_overlap: bool type, when interleaved scheduling, enable or "
 "disable communication optimization, the default value is False"

-#: ../../../usage.md:166 28e4d48661ff4f80aff788fdda604433
+#: ../../../usage.md:165
 msgid "sequence_parallel：是否开启序列化并行，默认值为 False"
 msgstr ""
 "sequence_parallel: Whether to enable sequence parallelism, the default "
 "value is False"

-#: ../../../usage.md:168 27528ab826824d2280506460e1f2f7bd
+#: ../../../usage.md:167
 msgid "注意：`数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小`"
 msgstr ""
 "Note: `Data parallel size = Total number of GPUs / Pipeline parallel size"
 " / Tensor parallel size`"

-#: ../../../usage.md:170 5a7af23cec604f1d9096a5ab81993c87
+#: ../../../usage.md:169
 msgid "启动训练"
 msgstr "Start Training"

-#: ../../../usage.md:172 795e51542ed84cea83b63c5233bb88bc
+#: ../../../usage.md:171
 msgid "完成了以上数据集准备和相关训练配置后，可启动 Demo 训练。接下来分别以 slurm 和 torch 环境为例，介绍训练启动方式。"
 msgstr ""
 "After completing the data preparation and relevant training "
@ -341,25 +332,30 @@ msgstr ""
 "following examples demonstrate how to start the training in both slurm "
 "and torch environments."

-#: ../../../usage.md:174 96402cbe443044c0a0a1695c9847140b
+#: ../../../usage.md:173
 msgid "若在 slurm 上启动分布式运行环境，多节点 16 卡的运行命令如下所示："
 msgstr ""
 "If you want to start distributed training on slurm with 16 GPUs across "
 "multiple nodes, use the following command:"

-#: ../../../usage.md:179 c569e60401a6471eb9af2473acc4d5a6
+#: ../../../usage.md:178
 msgid "若在 torch 上启动分布式运行环境，单节点 8 卡的运行命令如下所示："
 msgstr ""
 "If you want to start distributed training on torch with 8 GPUs on a "
 "single node, use the following command:"

-#: ../../../usage.md:184 a045a060d0734aab9d894aed553cef34
+#: ../../../usage.md:183
 msgid "运行结果"
 msgstr "Training Results"

-#: ../../../usage.md:186 c68e8dfa259647c7a6e6e0c0446b0b18
+#: ../../../usage.md:185
 msgid "以 slurm 上单机 8 卡的 Demo 训练配置为例，训练结果日志展示如下："
 msgstr ""
 "Taking the configuration of the demo training on a single machine with 8 "
 "GPUs on slurm as an example, the training result log is shown below:"

+#~ msgid "`load_model_only_folder`与`load_ckpt_folder`不能同时设置"
+#~ msgstr ""
+#~ "`load_model_only_folder` and `load_ckpt_folder` "
+#~ "cannot be set at the same time."
+
--- a/doc/code-docs/source/checkpoint.rst
+++ b/doc/code-docs/source/checkpoint.rst
@ -1,8 +1,9 @@
 模型保存
 ===================

-InternLM 使用 ``internlm.utils.model_checkpoint.CheckpointManager`` 来管理模型保存。 其中，可以
-使用 ``CheckpointManager.try_save_checkpoint(train_state)`` 来保存指定 step 的模型状态。InternLM支持启动时自动加载最新的模型备份，并在接收信号退出训练时自动进行模型备份。
+InternLM 使用 ``internlm.utils.model_checkpoint.CheckpointManager`` 来管理模型保存。其中，可以使用 ``CheckpointManager.try_save_checkpoint(train_state)`` 来保存指定 step 的模型状态。
+
+InternLM支持启动时自动加载最新的模型备份，并在接收信号退出训练时自动进行模型备份。

 Checkpointing
 -------------
--- a/doc/code-docs/source/conf.py
+++ b/doc/code-docs/source/conf.py
@ -72,14 +72,14 @@ exclude_patterns = []
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

 html_theme = "sphinx_rtd_theme"
-html_static_path = ["_static"]
+html_static_path = []

 # GitHub integration
 html_context = {
    "display_github": True,
    "github_user": "InternLM",
    "github_repo": "InternLM",
-    "github_version": "master",
+    "github_version": "main",
    "conf_py_path": "/doc/code-docs/source/",
 }

--- a/doc/code-docs/source/initialize.rst
+++ b/doc/code-docs/source/initialize.rst
@ -1,12 +1,32 @@
 训练构建
 ==============

+InternLM 的训练流程可以归纳为两个步骤：
+
+1. 初始化
+
+    * 初始化模型、优化器、数据加载器、Trainer，生成不同种类的进程组，为混合并行的迭代训练做准备。
+    * 初始化Logger、Checkpoint管理器、Monitor管理器、Profiler，对迭代训练的过程观察、预警、记录。
+
+2. 迭代训练
+   
+    * 根据配置文件定义的张量并行、流水线并行、数据并行的大小，加载训练引擎和调度器进行混合并行训练。
+    * 在迭代训练中，调用 Trainer API 进行梯度置零，前向传播计算损失并反向传播，参数更新。
+
+.. figure:: ../../imgs/hybrid_parallel_training.png
+  :scale: 45%
+  :class: with-border
+
+  InternLM训练流程图
+
 .. _InternLM-args:

 命令行参数解析
 ----------------

-InternLM 使用 `argparse <https://docs.python.org/3/library/argparse.html>`_ 库来向InternLM运行时提供命令行参数配置。用户可使用 ``internlm.initialize.get_default_parser()`` 来获取 InternLM 的默认解析器，其中包含一些内置参数，用户可以向此解析器添加自定义参数。
+InternLM 使用 `argparse <https://docs.python.org/3/library/argparse.html>`_ 库来向InternLM运行时提供命令行参数配置。
+
+用户可使用 ``internlm.initialize.get_default_parser()`` 来获取 InternLM 的默认解析器，其中包含一些内置参数，用户可以向此解析器添加自定义参数。

 .. code-block:: python

--- a/doc/code-docs/source/profiler.rst
+++ b/doc/code-docs/source/profiler.rst
@ -6,7 +6,7 @@
 Torch Profiler
 -----------------

-InternLM 使用 ``internlm.train.initialize_llm_profile()`` 来收集和分析模型训练或推理期间的性能数据，如 CPU/CUDA/memory 等性能数据。这个实现基于 `torch.profiler <https://pytorch.org/docs/stable/profiler.html>`_ ，输出的性能分析 trace 文件可以使用 `tensorboard <https://www.tensorflow.org>`_ 进行可视化。
+InternLM 使用 ``internlm.train.initialize_llm_profile()`` 来收集和分析模型训练或推理期间的性能数据，如 CPU/CUDA/memory 等性能数据。这个实现基于 `torch.profiler <https://pytorch.org/docs/stable/profiler.html>`_ ，输出的性能分析 trace 文件可以使用 `tensorboard <https://www.tensorflow.org/tensorboard?hl=en>`_ 进行可视化。

 用户如果想使用这个 torch 性能分析工具，需要在启动训练时传递 ``--profiling`` 参数以启用性能分析。完成 torch 性能分析后，用户可以在 ``{JOB_NAME}/{start_time}/traces/rank{}_dp{}_tp{}_pp{}`` 文件夹中看到性能分析结果。

--- a/doc/code-docs/source/qa.rst
+++ b/doc/code-docs/source/qa.rst
@ -1,2 +1,2 @@
 问&答
-====
+=====
--- a/doc/imgs/hybrid_parallel_training.png
+++ b/doc/imgs/hybrid_parallel_training.png