Merge main to develop (#312)

* fix(chat): fix stream_chat to return generator (#123) * fix(configs/7B_sft.py): model dtype float16 to bfloat16 (#302) * fix(convert2hf.py): fix the rotary_emb.inv_freq KeyError (#299) * docs(doc/code-docs): update quickstart usage (#301) * docs(usage.md): update usage.md * docs(doc/code-docs): update en usage --------- Co-authored-by: huangting4201 <huangting3@sensetime.com> * docs(doc/code-docs): update en usage --------- Co-authored-by: yingtongxiong <974106207@qq.com> Co-authored-by: zhjunqin <zhjunqin@users.noreply.github.com> Co-authored-by: jiangtann <39088437+jiangtann@users.noreply.github.com> Co-authored-by: huangting4201 <huangting3@sensetime.com>
2023-09-15 16:19:26 +08:00 · 2023-09-15 16:19:26 +08:00 · 607f691e16
parent de68cc5007
commit 607f691e16
4 changed files with 377 additions and 39 deletions
--- a/doc/code-docs/locales/en/LC_MESSAGES/usage.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/usage.po
@ -8,7 +8,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: InternLM \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-13 17:07+0800\n"
+"POT-Creation-Date: 2023-09-11 14:25+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: en\n"
@ -175,66 +175,72 @@ msgid "训练配置"
 msgstr "Training Configuration"

 #: ../../../usage.md:70
-msgid "以 7B Demo 的配置文件`configs/7B_sft.py`为例，介绍启动一个模型训练所需要进行的数据、模型和并行等相关的配置。"
+#, fuzzy
+msgid "以 7B Demo 的配置文件`configs/7B_sft.py`为例："
 msgstr ""
 "Taking the configuration file `configs/7B_sft.py` for the 7B demo as an "
-"example, let's discuss the data, model, and parallel configurations "
+"example,"
+
+#: ../../../usage.md:237
+msgid "接下来将详细介绍启动一个模型训练所需要进行的数据、模型、并行和监控等相关的配置。"
+msgstr ""
+"let's discuss the data, model, parallel and monitoring configurations "
 "required to start a model training."

-#: ../../../usage.md:72
+#: ../../../usage.md:239
 msgid "数据配置"
 msgstr "Data Configuration"

-#: ../../../usage.md:73
+#: ../../../usage.md:240
 msgid "数据相关的关键参数配置及释义如下所示："
 msgstr "Here are the key parameters and their explanations for data configuration:"

-#: ../../../usage.md:88
+#: ../../../usage.md:255
 msgid "![pack_into_one](./imgs/pack_into_one.png)"
 msgstr ""

-#: ../../../usage.md:88
+#: ../../../usage.md:255
 msgid "pack_into_one"
 msgstr ""

-#: ../../../usage.md:91
+#: ../../../usage.md:258
 msgid "目前支持传入数据集文件路径`train_folder`，且要求文件格式如下："
 msgstr ""
 "Currently, it supports passing the dataset file path `train_folder`, and "
 "the file format is required to be as follows:"

-#: ../../../usage.md:98
+#: ../../../usage.md:265
 msgid "数据集的详细内容可参考``数据准备``模块相关的介绍。"
 msgstr ""
 "For detailed information about the dataset, please refer to the \"Data "
 "Preparation\" section."

-#: ../../../usage.md:100
+#: ../../../usage.md:267
 msgid "模型配置"
 msgstr "Model Configuration"

-#: ../../../usage.md:102
+#: ../../../usage.md:269
 msgid "如果在启动训练时要加载模型 `checkpoint`，可进行如下相关配置："
 msgstr ""
 "If you want to load a model checkpoint when starting the training, you "
 "can configure it as follows:"

-#: ../../../usage.md:115
+#: ../../../usage.md:282
 msgid "注意："
 msgstr "Note:"

-#: ../../../usage.md:116
+#: ../../../usage.md:283
 msgid "路径若以 `local:` 为前缀，则存储在本地文件系统；若以 `boto3:` 为前缀，则存储在远程 oss 上"
 msgstr ""
 "If the path starts with `local:`, it means the file is stored in the "
 "local file system. If it starts with `boto3:`, it means the file is "
 "stored in the remote OSS."

-#: ../../../usage.md:118
+#: ../../../usage.md:285
 msgid "模型相关关键参数配置如下所示："
 msgstr "The configuration for the model is as follows:"

-#: ../../../usage.md:142
+#: ../../../usage.md:309
 msgid "注意：用户可自定义模型类型名和模型结构，并配置相对应的模型参数。通过`utils/registry.py`下的`MODEL_INITIALIZER`对象进行模型初始化函数接口注册，在训练主函数`train.py`中初始化模型时，可通过`model_type`配置获取指定的模型初始化接口函数。"
 msgstr ""
 "Note: Users can customize the model type name and model structure, and "
@ -245,7 +251,7 @@ msgstr ""
 "interface function can be obtained through the `model_type` "
 "configuration."

-#: ../../../usage.md:144
+#: ../../../usage.md:311
 msgid ""
 "*如果基于 InternLM 7B继续训练，可以参考 "
 "[ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-zoo) 中 "
@ -255,76 +261,76 @@ msgstr ""
 "OpenXLab [ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-"
 "zoo) to download weights*."

-#: ../../../usage.md:146
+#: ../../../usage.md:313
 msgid "并行配置"
 msgstr "Parallel Configuration"

-#: ../../../usage.md:148
+#: ../../../usage.md:315
 msgid "训练并行配置样例如下："
 msgstr "Training parallel configuration example:"

-#: ../../../usage.md:157
+#: ../../../usage.md:324
 msgid "zero1：zero 并行策略，分如下三种情况，默认值为 -1"
 msgstr ""
 "zero1: zero parallel strategy, divided into the following three cases, "
 "default value is -1"

-#: ../../../usage.md:158
+#: ../../../usage.md:325
 msgid "当`zero1 <= 0`，则 zero1 进程组的大小等于数据并行进程组的大小，因此优化器状态参数将在数据并行范围内分配"
 msgstr ""
 "When `zero1 <= 0`, the size of the zero1 process group is equal to the "
 "size of the data parallel process group, so the optimizer state "
 "parameters will be split within the data parallel range."

-#: ../../../usage.md:159
+#: ../../../usage.md:326
 msgid "当`zero1 == 1`，则不使用 zero1 ，所有数据并行组保留完整的优化器状态参数"
 msgstr ""
 "When `zero1 == 1`, zero1 is not used, and all data parallel groups retain"
 " the complete optimizer state parameters."

-#: ../../../usage.md:160
+#: ../../../usage.md:327
 msgid "当`zero1 > 1`且`zero1 <= data_parallel_world_size`，则 zero1 进程组是数据并行进程组的子集"
 msgstr ""
 "When `zero1 > 1` and `zero1 <= data_parallel_world_size`, the zero1 "
 "process group is a subset of the data parallel process group."

-#: ../../../usage.md:161
+#: ../../../usage.md:328
 msgid "tensor：张量并行大小，通常是每个节点的 GPU 数量，默认值为 1"
 msgstr ""
 "tensor: tensor parallel size, usually the number of GPUs per node, "
 "default is 1"

-#: ../../../usage.md:162
+#: ../../../usage.md:329
 msgid "pipeline：流水线并行策略"
 msgstr "pipeline: pipeline parallel strategy"

-#: ../../../usage.md:163
+#: ../../../usage.md:330
 msgid "size：流水线并行大小，默认值为 1"
 msgstr "size: pipeline parallel size, the default value is 1"

-#: ../../../usage.md:164
+#: ../../../usage.md:331
 msgid "interleaved_overlap：bool 类型，交错式调度时，开启或关闭通信优化，默认值为关闭"
 msgstr ""
 "interleaved_overlap: bool type, when interleaved scheduling, enable or "
 "disable communication optimization, the default value is False"

-#: ../../../usage.md:165
+#: ../../../usage.md:332
 msgid "sequence_parallel：是否开启序列化并行，默认值为 False"
 msgstr ""
 "sequence_parallel: Whether to enable sequence parallelism, the default "
 "value is False"

-#: ../../../usage.md:167
+#: ../../../usage.md:334
 msgid "注意：`数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小`"
 msgstr ""
 "Note: `Data parallel size = Total number of GPUs / Pipeline parallel size"
 " / Tensor parallel size`"

-#: ../../../usage.md:169
+#: ../../../usage.md:336
 msgid "启动训练"
 msgstr "Start Training"

-#: ../../../usage.md:171
+#: ../../../usage.md:338
 msgid "完成了以上数据集准备和相关训练配置后，可启动 Demo 训练。接下来分别以 slurm 和 torch 环境为例，介绍训练启动方式。"
 msgstr ""
 "After completing the data preparation and relevant training "
@ -332,23 +338,23 @@ msgstr ""
 "following examples demonstrate how to start the training in both slurm "
 "and torch environments."

-#: ../../../usage.md:173
+#: ../../../usage.md:340
 msgid "若在 slurm 上启动分布式运行环境，多节点 16 卡的运行命令如下所示："
 msgstr ""
 "If you want to start distributed training on slurm with 16 GPUs across "
 "multiple nodes, use the following command:"

-#: ../../../usage.md:178
+#: ../../../usage.md:345
 msgid "若在 torch 上启动分布式运行环境，单节点 8 卡的运行命令如下所示："
 msgstr ""
 "If you want to start distributed training on torch with 8 GPUs on a "
 "single node, use the following command:"

-#: ../../../usage.md:183
+#: ../../../usage.md:350
 msgid "运行结果"
 msgstr "Training Results"

-#: ../../../usage.md:185
+#: ../../../usage.md:352
 msgid "以 slurm 上单机 8 卡的 Demo 训练配置为例，训练结果日志展示如下："
 msgstr ""
 "Taking the configuration of the demo training on a single machine with 8 "
--- a/doc/en/usage.md
+++ b/doc/en/usage.md
@ -74,7 +74,173 @@ It is recommended that users refer to alpaca_tokenizer.py to write new scripts t

 ### Training Configuration

-Taking the configuration file `configs/7B_sft.py` for the 7B demo as an example, let's discuss the data, model, and parallel configurations required to start a model training.
+Taking the configuration file `configs/7B_sft.py` for the 7B demo as an example, let's discuss the data, model, parallel and monitoring configurations required to start a model training.
+```python
+JOB_NAME = "7b_train"
+DO_ALERT = False
+
+SEQ_LEN = 2048
+HIDDEN_SIZE = 4096
+NUM_ATTENTION_HEAD = 32
+MLP_RATIO = 8 / 3
+NUM_LAYER = 32
+VOCAB_SIZE = 103168
+
+MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
+# Ckpt folder format:
+# fs: 'local:/mnt/nfs/XXX'
+SAVE_CKPT_FOLDER = "local:llm_ckpts"
+LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
+
+# boto3 Ckpt folder format:
+# import os
+# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
+# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
+# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
+CHECKPOINT_EVERY = 50
+ckpt = dict(
+    enable_save_ckpt=False,  # enable ckpt save.
+    save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.
+    # load_ckpt_folder= dict(path=MODEL_ONLY_FOLDER, content=["model"], ckpt_type="normal"),
+    load_ckpt_folder="local:llm_ckpts/",
+    # 'load_ckpt_info' setting guide:
+    # 1. the 'path' indicate ckpt path,
+    # 2. the 'content‘ means what states will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all"
+    # 3. the ’ckpt_type‘ means the type of checkpoint to be loaded, now only 'normal' type is supported.
+    load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
+    checkpoint_every=CHECKPOINT_EVERY,
+    async_upload=True,  # async ckpt upload. (only work for boto3 ckpt)
+    async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.
+    oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
+)
+
+TRAIN_FOLDER = "/path/to/dataset"
+VALID_FOLDER = "/path/to/dataset"
+data = dict(
+    seq_len=SEQ_LEN,
+    # micro_num means the number of micro_batch contained in one gradient update
+    micro_num=4,
+    # packed_length = micro_bsz * SEQ_LEN
+    micro_bsz=2,
+    # defaults to the value of micro_num
+    valid_micro_num=4,
+    # defaults to 0, means disable evaluate
+    valid_every=50,
+    pack_sample_into_one=False,
+    total_steps=50000,
+    skip_batches="",
+    rampup_batch_size="",
+    # Datasets with less than 50 rows will be discarded
+    min_length=50,
+    # train_folder=TRAIN_FOLDER,
+    # valid_folder=VALID_FOLDER,
+    empty_cache_and_diag_interval=10,
+    diag_outlier_ratio=1.1,
+)
+
+grad_scaler = dict(
+    fp16=dict(
+        # the initial loss scale, defaults to 2**16
+        initial_scale=2**16,
+        # the minimum loss scale, defaults to None
+        min_scale=1,
+        # the number of steps to increase loss scale when no overflow occurs
+        growth_interval=1000,
+    ),
+    # the multiplication factor for increasing loss scale, defaults to 2
+    growth_factor=2,
+    # the multiplication factor for decreasing loss scale, defaults to 0.5
+    backoff_factor=0.5,
+    # the maximum loss scale, defaults to None
+    max_scale=2**24,
+    # the number of overflows before decreasing loss scale, defaults to 2
+    hysteresis=2,
+)
+
+hybrid_zero_optimizer = dict(
+    # Enable low_level_optimzer overlap_communication
+    overlap_sync_grad=True,
+    overlap_sync_param=True,
+    # bucket size for nccl communication params
+    reduce_bucket_size=512 * 1024 * 1024,
+    # grad clipping
+    clip_grad_norm=1.0,
+)
+
+loss = dict(
+    label_smoothing=0,
+)
+
+adam = dict(
+    lr=1e-4,
+    adam_beta1=0.9,
+    adam_beta2=0.95,
+    adam_beta2_c=0,
+    adam_eps=1e-8,
+    weight_decay=0.01,
+)
+
+lr_scheduler = dict(
+    total_steps=data["total_steps"],
+    init_steps=0,  # optimizer_warmup_step
+    warmup_ratio=0.01,
+    eta_min=1e-5,
+    last_epoch=-1,
+)
+
+beta2_scheduler = dict(
+    init_beta2=adam["adam_beta2"],
+    c=adam["adam_beta2_c"],
+    cur_iter=-1,
+)
+
+model = dict(
+    checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
+    num_attention_heads=NUM_ATTENTION_HEAD,
+    embed_split_hidden=True,
+    vocab_size=VOCAB_SIZE,
+    embed_grad_scale=1,
+    parallel_output=True,
+    hidden_size=HIDDEN_SIZE,
+    num_layers=NUM_LAYER,
+    mlp_ratio=MLP_RATIO,
+    apply_post_layer_norm=False,
+    dtype="torch.float16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
+    norm_type="rmsnorm",
+    layer_norm_epsilon=1e-5,
+    use_flash_attn=True,
+    num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
+)
+"""
+zero1 parallel:
+    1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
+        so parameters will be divided within the range of dp.
+    2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
+    3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
+        For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
+pipeline parallel (dict):
+    1. size: int, the size of pipeline parallel.
+    2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
+tensor parallel: tensor parallel size, usually the number of GPUs per node.
+"""
+parallel = dict(
+    zero1=8,
+    pipeline=dict(size=1, interleaved_overlap=True),
+    sequence_parallel=False,
+)
+
+cudnn_deterministic = False
+cudnn_benchmark = False
+
+monitor = dict(
+    # feishu alert configs
+    alert=dict(
+        enable_feishu_alert=DO_ALERT,
+        feishu_alert_address=None,  # feishu webhook to send alert message
+        light_monitor_address=None,  # light_monitor address to send heartbeat
+    ),
+)
+```

 #### Data Configuration
 Here are the key parameters and their explanations for data configuration:
--- a/doc/usage.md
+++ b/doc/usage.md
@ -66,7 +66,174 @@ python tools/alpaca_tokenizer.py /path/to/alpaca_dataset /path/to/output_dataset

 ### 训练配置

-以 7B Demo 的配置文件`configs/7B_sft.py`为例，介绍启动一个模型训练所需要进行的数据、模型和并行等相关的配置。
+以 7B Demo 的配置文件`configs/7B_sft.py`为例：
+```python
+JOB_NAME = "7b_train"
+DO_ALERT = False
+
+SEQ_LEN = 2048
+HIDDEN_SIZE = 4096
+NUM_ATTENTION_HEAD = 32
+MLP_RATIO = 8 / 3
+NUM_LAYER = 32
+VOCAB_SIZE = 103168
+
+MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx"
+# Ckpt folder format:
+# fs: 'local:/mnt/nfs/XXX'
+SAVE_CKPT_FOLDER = "local:llm_ckpts"
+LOAD_CKPT_FOLDER = "local:llm_ckpts/49"
+
+# boto3 Ckpt folder format:
+# import os
+# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint
+# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm"
+# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/"
+CHECKPOINT_EVERY = 50
+ckpt = dict(
+    enable_save_ckpt=False,  # enable ckpt save.
+    save_ckpt_folder=SAVE_CKPT_FOLDER,  # Path to save training ckpt.
+    # load_ckpt_folder= dict(path=MODEL_ONLY_FOLDER, content=["model"], ckpt_type="normal"),
+    load_ckpt_folder="local:llm_ckpts/",
+    # 'load_ckpt_info' setting guide:
+    # 1. the 'path' indicate ckpt path,
+    # 2. the 'content‘ means what states will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all"
+    # 3. the ’ckpt_type‘ means the type of checkpoint to be loaded, now only 'normal' type is supported.
+    load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"),
+    checkpoint_every=CHECKPOINT_EVERY,
+    async_upload=True,  # async ckpt upload. (only work for boto3 ckpt)
+    async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/",  # path for temporarily files during asynchronous upload.
+    oss_snapshot_freq=int(CHECKPOINT_EVERY / 2),  # snapshot ckpt save frequency.
+)
+
+TRAIN_FOLDER = "/path/to/dataset"
+VALID_FOLDER = "/path/to/dataset"
+data = dict(
+    seq_len=SEQ_LEN,
+    # micro_num means the number of micro_batch contained in one gradient update
+    micro_num=4,
+    # packed_length = micro_bsz * SEQ_LEN
+    micro_bsz=2,
+    # defaults to the value of micro_num
+    valid_micro_num=4,
+    # defaults to 0, means disable evaluate
+    valid_every=50,
+    pack_sample_into_one=False,
+    total_steps=50000,
+    skip_batches="",
+    rampup_batch_size="",
+    # Datasets with less than 50 rows will be discarded
+    min_length=50,
+    # train_folder=TRAIN_FOLDER,
+    # valid_folder=VALID_FOLDER,
+    empty_cache_and_diag_interval=10,
+    diag_outlier_ratio=1.1,
+)
+
+grad_scaler = dict(
+    fp16=dict(
+        # the initial loss scale, defaults to 2**16
+        initial_scale=2**16,
+        # the minimum loss scale, defaults to None
+        min_scale=1,
+        # the number of steps to increase loss scale when no overflow occurs
+        growth_interval=1000,
+    ),
+    # the multiplication factor for increasing loss scale, defaults to 2
+    growth_factor=2,
+    # the multiplication factor for decreasing loss scale, defaults to 0.5
+    backoff_factor=0.5,
+    # the maximum loss scale, defaults to None
+    max_scale=2**24,
+    # the number of overflows before decreasing loss scale, defaults to 2
+    hysteresis=2,
+)
+
+hybrid_zero_optimizer = dict(
+    # Enable low_level_optimzer overlap_communication
+    overlap_sync_grad=True,
+    overlap_sync_param=True,
+    # bucket size for nccl communication params
+    reduce_bucket_size=512 * 1024 * 1024,
+    # grad clipping
+    clip_grad_norm=1.0,
+)
+
+loss = dict(
+    label_smoothing=0,
+)
+
+adam = dict(
+    lr=1e-4,
+    adam_beta1=0.9,
+    adam_beta2=0.95,
+    adam_beta2_c=0,
+    adam_eps=1e-8,
+    weight_decay=0.01,
+)
+
+lr_scheduler = dict(
+    total_steps=data["total_steps"],
+    init_steps=0,  # optimizer_warmup_step
+    warmup_ratio=0.01,
+    eta_min=1e-5,
+    last_epoch=-1,
+)
+
+beta2_scheduler = dict(
+    init_beta2=adam["adam_beta2"],
+    c=adam["adam_beta2_c"],
+    cur_iter=-1,
+)
+
+model = dict(
+    checkpoint=False,  # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1]
+    num_attention_heads=NUM_ATTENTION_HEAD,
+    embed_split_hidden=True,
+    vocab_size=VOCAB_SIZE,
+    embed_grad_scale=1,
+    parallel_output=True,
+    hidden_size=HIDDEN_SIZE,
+    num_layers=NUM_LAYER,
+    mlp_ratio=MLP_RATIO,
+    apply_post_layer_norm=False,
+    dtype="torch.float16",  # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32"
+    norm_type="rmsnorm",
+    layer_norm_epsilon=1e-5,
+    use_flash_attn=True,
+    num_chunks=1,  # if num_chunks > 1, interleaved pipeline scheduler is used.
+)
+"""
+zero1 parallel:
+    1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group,
+        so parameters will be divided within the range of dp.
+    2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters.
+    3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size.
+        For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8.
+pipeline parallel (dict):
+    1. size: int, the size of pipeline parallel.
+    2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler.
+tensor parallel: tensor parallel size, usually the number of GPUs per node.
+"""
+parallel = dict(
+    zero1=8,
+    pipeline=dict(size=1, interleaved_overlap=True),
+    sequence_parallel=False,
+)
+
+cudnn_deterministic = False
+cudnn_benchmark = False
+
+monitor = dict(
+    # feishu alert configs
+    alert=dict(
+        enable_feishu_alert=DO_ALERT,
+        feishu_alert_address=None,  # feishu webhook to send alert message
+        light_monitor_address=None,  # light_monitor address to send heartbeat
+    ),
+)
+```
+接下来将详细介绍启动一个模型训练所需要进行的数据、模型、并行和监控等相关的配置。

 #### 数据配置
 数据相关的关键参数配置及释义如下所示：
--- a/internlm/utils/model_checkpoint.py
+++ b/internlm/utils/model_checkpoint.py
@ -447,8 +447,8 @@ class CheckpointManager:

        Args:
            ckpt_config (dict): model checkpoint config.
-            model (nn.module): model obj
-            optimizer (object): optimzier obj.
+            model (nn.module): model obj.
+            optimizer (object): optimizer obj.
            lr_scheduler (object): lr_scheduler obj.
            model_config (dict): model config.
        """
@ -712,7 +712,6 @@ now step_count is {train_state.step_count}",
        return dict(path=latest_ckpt, content=("all",), ckpt_type="internlm")

    def try_resume_training(self, train_state: TrainState, current_time=""):
-
        if self.load_ckpt_info is None or self.load_ckpt_info["path"] is None:
            if gpc.is_rank_for_log():
                logger.info(