diff --git a/doc/code-docs/locales/en/LC_MESSAGES/usage.po b/doc/code-docs/locales/en/LC_MESSAGES/usage.po index c641deb..37e7cba 100644 --- a/doc/code-docs/locales/en/LC_MESSAGES/usage.po +++ b/doc/code-docs/locales/en/LC_MESSAGES/usage.po @@ -8,7 +8,7 @@ msgid "" msgstr "" "Project-Id-Version: InternLM \n" "Report-Msgid-Bugs-To: \n" -"POT-Creation-Date: 2023-09-13 17:07+0800\n" +"POT-Creation-Date: 2023-09-11 14:25+0800\n" "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" "Last-Translator: FULL NAME \n" "Language: en\n" @@ -175,66 +175,72 @@ msgid "训练配置" msgstr "Training Configuration" #: ../../../usage.md:70 -msgid "以 7B Demo 的配置文件`configs/7B_sft.py`为例,介绍启动一个模型训练所需要进行的数据、模型和并行等相关的配置。" +#, fuzzy +msgid "以 7B Demo 的配置文件`configs/7B_sft.py`为例:" msgstr "" "Taking the configuration file `configs/7B_sft.py` for the 7B demo as an " -"example, let's discuss the data, model, and parallel configurations " +"example," + +#: ../../../usage.md:237 +msgid "接下来将详细介绍启动一个模型训练所需要进行的数据、模型、并行和监控等相关的配置。" +msgstr "" +"let's discuss the data, model, parallel and monitoring configurations " "required to start a model training." -#: ../../../usage.md:72 +#: ../../../usage.md:239 msgid "数据配置" msgstr "Data Configuration" -#: ../../../usage.md:73 +#: ../../../usage.md:240 msgid "数据相关的关键参数配置及释义如下所示:" msgstr "Here are the key parameters and their explanations for data configuration:" -#: ../../../usage.md:88 +#: ../../../usage.md:255 msgid "![pack_into_one](./imgs/pack_into_one.png)" msgstr "" -#: ../../../usage.md:88 +#: ../../../usage.md:255 msgid "pack_into_one" msgstr "" -#: ../../../usage.md:91 +#: ../../../usage.md:258 msgid "目前支持传入数据集文件路径`train_folder`,且要求文件格式如下:" msgstr "" "Currently, it supports passing the dataset file path `train_folder`, and " "the file format is required to be as follows:" -#: ../../../usage.md:98 +#: ../../../usage.md:265 msgid "数据集的详细内容可参考``数据准备``模块相关的介绍。" msgstr "" "For detailed information about the dataset, please refer to the \"Data " "Preparation\" section." -#: ../../../usage.md:100 +#: ../../../usage.md:267 msgid "模型配置" msgstr "Model Configuration" -#: ../../../usage.md:102 +#: ../../../usage.md:269 msgid "如果在启动训练时要加载模型 `checkpoint`,可进行如下相关配置:" msgstr "" "If you want to load a model checkpoint when starting the training, you " "can configure it as follows:" -#: ../../../usage.md:115 +#: ../../../usage.md:282 msgid "注意:" msgstr "Note:" -#: ../../../usage.md:116 +#: ../../../usage.md:283 msgid "路径若以 `local:` 为前缀,则存储在本地文件系统;若以 `boto3:` 为前缀,则存储在远程 oss 上" msgstr "" "If the path starts with `local:`, it means the file is stored in the " "local file system. If it starts with `boto3:`, it means the file is " "stored in the remote OSS." -#: ../../../usage.md:118 +#: ../../../usage.md:285 msgid "模型相关关键参数配置如下所示:" msgstr "The configuration for the model is as follows:" -#: ../../../usage.md:142 +#: ../../../usage.md:309 msgid "注意:用户可自定义模型类型名和模型结构,并配置相对应的模型参数。通过`utils/registry.py`下的`MODEL_INITIALIZER`对象进行模型初始化函数接口注册,在训练主函数`train.py`中初始化模型时,可通过`model_type`配置获取指定的模型初始化接口函数。" msgstr "" "Note: Users can customize the model type name and model structure, and " @@ -245,7 +251,7 @@ msgstr "" "interface function can be obtained through the `model_type` " "configuration." -#: ../../../usage.md:144 +#: ../../../usage.md:311 msgid "" "*如果基于 InternLM 7B继续训练,可以参考 " "[ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-zoo) 中 " @@ -255,76 +261,76 @@ msgstr "" "OpenXLab [ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-" "zoo) to download weights*." -#: ../../../usage.md:146 +#: ../../../usage.md:313 msgid "并行配置" msgstr "Parallel Configuration" -#: ../../../usage.md:148 +#: ../../../usage.md:315 msgid "训练并行配置样例如下:" msgstr "Training parallel configuration example:" -#: ../../../usage.md:157 +#: ../../../usage.md:324 msgid "zero1:zero 并行策略,分如下三种情况,默认值为 -1" msgstr "" "zero1: zero parallel strategy, divided into the following three cases, " "default value is -1" -#: ../../../usage.md:158 +#: ../../../usage.md:325 msgid "当`zero1 <= 0`,则 zero1 进程组的大小等于数据并行进程组的大小,因此优化器状态参数将在数据并行范围内分配" msgstr "" "When `zero1 <= 0`, the size of the zero1 process group is equal to the " "size of the data parallel process group, so the optimizer state " "parameters will be split within the data parallel range." -#: ../../../usage.md:159 +#: ../../../usage.md:326 msgid "当`zero1 == 1`,则不使用 zero1 ,所有数据并行组保留完整的优化器状态参数" msgstr "" "When `zero1 == 1`, zero1 is not used, and all data parallel groups retain" " the complete optimizer state parameters." -#: ../../../usage.md:160 +#: ../../../usage.md:327 msgid "当`zero1 > 1`且`zero1 <= data_parallel_world_size`,则 zero1 进程组是数据并行进程组的子集" msgstr "" "When `zero1 > 1` and `zero1 <= data_parallel_world_size`, the zero1 " "process group is a subset of the data parallel process group." -#: ../../../usage.md:161 +#: ../../../usage.md:328 msgid "tensor:张量并行大小,通常是每个节点的 GPU 数量,默认值为 1" msgstr "" "tensor: tensor parallel size, usually the number of GPUs per node, " "default is 1" -#: ../../../usage.md:162 +#: ../../../usage.md:329 msgid "pipeline:流水线并行策略" msgstr "pipeline: pipeline parallel strategy" -#: ../../../usage.md:163 +#: ../../../usage.md:330 msgid "size:流水线并行大小,默认值为 1" msgstr "size: pipeline parallel size, the default value is 1" -#: ../../../usage.md:164 +#: ../../../usage.md:331 msgid "interleaved_overlap:bool 类型,交错式调度时,开启或关闭通信优化,默认值为关闭" msgstr "" "interleaved_overlap: bool type, when interleaved scheduling, enable or " "disable communication optimization, the default value is False" -#: ../../../usage.md:165 +#: ../../../usage.md:332 msgid "sequence_parallel:是否开启序列化并行,默认值为 False" msgstr "" "sequence_parallel: Whether to enable sequence parallelism, the default " "value is False" -#: ../../../usage.md:167 +#: ../../../usage.md:334 msgid "注意:`数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小`" msgstr "" "Note: `Data parallel size = Total number of GPUs / Pipeline parallel size" " / Tensor parallel size`" -#: ../../../usage.md:169 +#: ../../../usage.md:336 msgid "启动训练" msgstr "Start Training" -#: ../../../usage.md:171 +#: ../../../usage.md:338 msgid "完成了以上数据集准备和相关训练配置后,可启动 Demo 训练。接下来分别以 slurm 和 torch 环境为例,介绍训练启动方式。" msgstr "" "After completing the data preparation and relevant training " @@ -332,23 +338,23 @@ msgstr "" "following examples demonstrate how to start the training in both slurm " "and torch environments." -#: ../../../usage.md:173 +#: ../../../usage.md:340 msgid "若在 slurm 上启动分布式运行环境,多节点 16 卡的运行命令如下所示:" msgstr "" "If you want to start distributed training on slurm with 16 GPUs across " "multiple nodes, use the following command:" -#: ../../../usage.md:178 +#: ../../../usage.md:345 msgid "若在 torch 上启动分布式运行环境,单节点 8 卡的运行命令如下所示:" msgstr "" "If you want to start distributed training on torch with 8 GPUs on a " "single node, use the following command:" -#: ../../../usage.md:183 +#: ../../../usage.md:350 msgid "运行结果" msgstr "Training Results" -#: ../../../usage.md:185 +#: ../../../usage.md:352 msgid "以 slurm 上单机 8 卡的 Demo 训练配置为例,训练结果日志展示如下:" msgstr "" "Taking the configuration of the demo training on a single machine with 8 " diff --git a/doc/en/usage.md b/doc/en/usage.md index d115fb1..864ead6 100644 --- a/doc/en/usage.md +++ b/doc/en/usage.md @@ -74,7 +74,173 @@ It is recommended that users refer to alpaca_tokenizer.py to write new scripts t ### Training Configuration -Taking the configuration file `configs/7B_sft.py` for the 7B demo as an example, let's discuss the data, model, and parallel configurations required to start a model training. +Taking the configuration file `configs/7B_sft.py` for the 7B demo as an example, let's discuss the data, model, parallel and monitoring configurations required to start a model training. +```python +JOB_NAME = "7b_train" +DO_ALERT = False + +SEQ_LEN = 2048 +HIDDEN_SIZE = 4096 +NUM_ATTENTION_HEAD = 32 +MLP_RATIO = 8 / 3 +NUM_LAYER = 32 +VOCAB_SIZE = 103168 + +MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx" +# Ckpt folder format: +# fs: 'local:/mnt/nfs/XXX' +SAVE_CKPT_FOLDER = "local:llm_ckpts" +LOAD_CKPT_FOLDER = "local:llm_ckpts/49" + +# boto3 Ckpt folder format: +# import os +# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint +# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm" +# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/" +CHECKPOINT_EVERY = 50 +ckpt = dict( + enable_save_ckpt=False, # enable ckpt save. + save_ckpt_folder=SAVE_CKPT_FOLDER, # Path to save training ckpt. + # load_ckpt_folder= dict(path=MODEL_ONLY_FOLDER, content=["model"], ckpt_type="normal"), + load_ckpt_folder="local:llm_ckpts/", + # 'load_ckpt_info' setting guide: + # 1. the 'path' indicate ckpt path, + # 2. the 'content‘ means what states will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all" + # 3. the ’ckpt_type‘ means the type of checkpoint to be loaded, now only 'normal' type is supported. + load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"), + checkpoint_every=CHECKPOINT_EVERY, + async_upload=True, # async ckpt upload. (only work for boto3 ckpt) + async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/", # path for temporarily files during asynchronous upload. + oss_snapshot_freq=int(CHECKPOINT_EVERY / 2), # snapshot ckpt save frequency. +) + +TRAIN_FOLDER = "/path/to/dataset" +VALID_FOLDER = "/path/to/dataset" +data = dict( + seq_len=SEQ_LEN, + # micro_num means the number of micro_batch contained in one gradient update + micro_num=4, + # packed_length = micro_bsz * SEQ_LEN + micro_bsz=2, + # defaults to the value of micro_num + valid_micro_num=4, + # defaults to 0, means disable evaluate + valid_every=50, + pack_sample_into_one=False, + total_steps=50000, + skip_batches="", + rampup_batch_size="", + # Datasets with less than 50 rows will be discarded + min_length=50, + # train_folder=TRAIN_FOLDER, + # valid_folder=VALID_FOLDER, + empty_cache_and_diag_interval=10, + diag_outlier_ratio=1.1, +) + +grad_scaler = dict( + fp16=dict( + # the initial loss scale, defaults to 2**16 + initial_scale=2**16, + # the minimum loss scale, defaults to None + min_scale=1, + # the number of steps to increase loss scale when no overflow occurs + growth_interval=1000, + ), + # the multiplication factor for increasing loss scale, defaults to 2 + growth_factor=2, + # the multiplication factor for decreasing loss scale, defaults to 0.5 + backoff_factor=0.5, + # the maximum loss scale, defaults to None + max_scale=2**24, + # the number of overflows before decreasing loss scale, defaults to 2 + hysteresis=2, +) + +hybrid_zero_optimizer = dict( + # Enable low_level_optimzer overlap_communication + overlap_sync_grad=True, + overlap_sync_param=True, + # bucket size for nccl communication params + reduce_bucket_size=512 * 1024 * 1024, + # grad clipping + clip_grad_norm=1.0, +) + +loss = dict( + label_smoothing=0, +) + +adam = dict( + lr=1e-4, + adam_beta1=0.9, + adam_beta2=0.95, + adam_beta2_c=0, + adam_eps=1e-8, + weight_decay=0.01, +) + +lr_scheduler = dict( + total_steps=data["total_steps"], + init_steps=0, # optimizer_warmup_step + warmup_ratio=0.01, + eta_min=1e-5, + last_epoch=-1, +) + +beta2_scheduler = dict( + init_beta2=adam["adam_beta2"], + c=adam["adam_beta2_c"], + cur_iter=-1, +) + +model = dict( + checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1] + num_attention_heads=NUM_ATTENTION_HEAD, + embed_split_hidden=True, + vocab_size=VOCAB_SIZE, + embed_grad_scale=1, + parallel_output=True, + hidden_size=HIDDEN_SIZE, + num_layers=NUM_LAYER, + mlp_ratio=MLP_RATIO, + apply_post_layer_norm=False, + dtype="torch.float16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32" + norm_type="rmsnorm", + layer_norm_epsilon=1e-5, + use_flash_attn=True, + num_chunks=1, # if num_chunks > 1, interleaved pipeline scheduler is used. +) +""" +zero1 parallel: + 1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group, + so parameters will be divided within the range of dp. + 2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters. + 3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size. + For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8. +pipeline parallel (dict): + 1. size: int, the size of pipeline parallel. + 2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler. +tensor parallel: tensor parallel size, usually the number of GPUs per node. +""" +parallel = dict( + zero1=8, + pipeline=dict(size=1, interleaved_overlap=True), + sequence_parallel=False, +) + +cudnn_deterministic = False +cudnn_benchmark = False + +monitor = dict( + # feishu alert configs + alert=dict( + enable_feishu_alert=DO_ALERT, + feishu_alert_address=None, # feishu webhook to send alert message + light_monitor_address=None, # light_monitor address to send heartbeat + ), +) +``` #### Data Configuration Here are the key parameters and their explanations for data configuration: diff --git a/doc/usage.md b/doc/usage.md index 1b98c10..82c20e0 100644 --- a/doc/usage.md +++ b/doc/usage.md @@ -66,7 +66,174 @@ python tools/alpaca_tokenizer.py /path/to/alpaca_dataset /path/to/output_dataset ### 训练配置 -以 7B Demo 的配置文件`configs/7B_sft.py`为例,介绍启动一个模型训练所需要进行的数据、模型和并行等相关的配置。 +以 7B Demo 的配置文件`configs/7B_sft.py`为例: +```python +JOB_NAME = "7b_train" +DO_ALERT = False + +SEQ_LEN = 2048 +HIDDEN_SIZE = 4096 +NUM_ATTENTION_HEAD = 32 +MLP_RATIO = 8 / 3 +NUM_LAYER = 32 +VOCAB_SIZE = 103168 + +MODEL_ONLY_FOLDER = "local:llm_ckpts/xxxx" +# Ckpt folder format: +# fs: 'local:/mnt/nfs/XXX' +SAVE_CKPT_FOLDER = "local:llm_ckpts" +LOAD_CKPT_FOLDER = "local:llm_ckpts/49" + +# boto3 Ckpt folder format: +# import os +# BOTO3_IP = os.environ["BOTO3_IP"] # boto3 bucket endpoint +# SAVE_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm" +# LOAD_CKPT_FOLDER = f"boto3:s3://model_weights.{BOTO3_IP}/internlm/snapshot/1/" +CHECKPOINT_EVERY = 50 +ckpt = dict( + enable_save_ckpt=False, # enable ckpt save. + save_ckpt_folder=SAVE_CKPT_FOLDER, # Path to save training ckpt. + # load_ckpt_folder= dict(path=MODEL_ONLY_FOLDER, content=["model"], ckpt_type="normal"), + load_ckpt_folder="local:llm_ckpts/", + # 'load_ckpt_info' setting guide: + # 1. the 'path' indicate ckpt path, + # 2. the 'content‘ means what states will be loaded, support: "model", "sampler", "optimizer", "scheduler", "all" + # 3. the ’ckpt_type‘ means the type of checkpoint to be loaded, now only 'normal' type is supported. + load_ckpt_info=dict(path=MODEL_ONLY_FOLDER, content=("model",), ckpt_type="internlm"), + checkpoint_every=CHECKPOINT_EVERY, + async_upload=True, # async ckpt upload. (only work for boto3 ckpt) + async_upload_tmp_folder="/dev/shm/internlm_tmp_ckpt/", # path for temporarily files during asynchronous upload. + oss_snapshot_freq=int(CHECKPOINT_EVERY / 2), # snapshot ckpt save frequency. +) + +TRAIN_FOLDER = "/path/to/dataset" +VALID_FOLDER = "/path/to/dataset" +data = dict( + seq_len=SEQ_LEN, + # micro_num means the number of micro_batch contained in one gradient update + micro_num=4, + # packed_length = micro_bsz * SEQ_LEN + micro_bsz=2, + # defaults to the value of micro_num + valid_micro_num=4, + # defaults to 0, means disable evaluate + valid_every=50, + pack_sample_into_one=False, + total_steps=50000, + skip_batches="", + rampup_batch_size="", + # Datasets with less than 50 rows will be discarded + min_length=50, + # train_folder=TRAIN_FOLDER, + # valid_folder=VALID_FOLDER, + empty_cache_and_diag_interval=10, + diag_outlier_ratio=1.1, +) + +grad_scaler = dict( + fp16=dict( + # the initial loss scale, defaults to 2**16 + initial_scale=2**16, + # the minimum loss scale, defaults to None + min_scale=1, + # the number of steps to increase loss scale when no overflow occurs + growth_interval=1000, + ), + # the multiplication factor for increasing loss scale, defaults to 2 + growth_factor=2, + # the multiplication factor for decreasing loss scale, defaults to 0.5 + backoff_factor=0.5, + # the maximum loss scale, defaults to None + max_scale=2**24, + # the number of overflows before decreasing loss scale, defaults to 2 + hysteresis=2, +) + +hybrid_zero_optimizer = dict( + # Enable low_level_optimzer overlap_communication + overlap_sync_grad=True, + overlap_sync_param=True, + # bucket size for nccl communication params + reduce_bucket_size=512 * 1024 * 1024, + # grad clipping + clip_grad_norm=1.0, +) + +loss = dict( + label_smoothing=0, +) + +adam = dict( + lr=1e-4, + adam_beta1=0.9, + adam_beta2=0.95, + adam_beta2_c=0, + adam_eps=1e-8, + weight_decay=0.01, +) + +lr_scheduler = dict( + total_steps=data["total_steps"], + init_steps=0, # optimizer_warmup_step + warmup_ratio=0.01, + eta_min=1e-5, + last_epoch=-1, +) + +beta2_scheduler = dict( + init_beta2=adam["adam_beta2"], + c=adam["adam_beta2_c"], + cur_iter=-1, +) + +model = dict( + checkpoint=False, # The proportion of layers for activation aheckpointing, the optional value are True/False/[0-1] + num_attention_heads=NUM_ATTENTION_HEAD, + embed_split_hidden=True, + vocab_size=VOCAB_SIZE, + embed_grad_scale=1, + parallel_output=True, + hidden_size=HIDDEN_SIZE, + num_layers=NUM_LAYER, + mlp_ratio=MLP_RATIO, + apply_post_layer_norm=False, + dtype="torch.float16", # Support: "torch.float16", "torch.half", "torch.bfloat16", "torch.float32", "torch.tf32" + norm_type="rmsnorm", + layer_norm_epsilon=1e-5, + use_flash_attn=True, + num_chunks=1, # if num_chunks > 1, interleaved pipeline scheduler is used. +) +""" +zero1 parallel: + 1. if zero1 <= 0, The size of the zero process group is equal to the size of the dp process group, + so parameters will be divided within the range of dp. + 2. if zero1 == 1, zero is not used, and all dp groups retain the full amount of model parameters. + 3. zero1 > 1 and zero1 <= dp world size, the world size of zero is a subset of dp world size. + For smaller models, it is usually a better choice to split the parameters within nodes with a setting <= 8. +pipeline parallel (dict): + 1. size: int, the size of pipeline parallel. + 2. interleaved_overlap: bool, enable/disable communication overlap when using interleaved pipeline scheduler. +tensor parallel: tensor parallel size, usually the number of GPUs per node. +""" +parallel = dict( + zero1=8, + pipeline=dict(size=1, interleaved_overlap=True), + sequence_parallel=False, +) + +cudnn_deterministic = False +cudnn_benchmark = False + +monitor = dict( + # feishu alert configs + alert=dict( + enable_feishu_alert=DO_ALERT, + feishu_alert_address=None, # feishu webhook to send alert message + light_monitor_address=None, # light_monitor address to send heartbeat + ), +) +``` +接下来将详细介绍启动一个模型训练所需要进行的数据、模型、并行和监控等相关的配置。 #### 数据配置 数据相关的关键参数配置及释义如下所示: diff --git a/internlm/utils/model_checkpoint.py b/internlm/utils/model_checkpoint.py index b6aab02..dad2fc6 100644 --- a/internlm/utils/model_checkpoint.py +++ b/internlm/utils/model_checkpoint.py @@ -447,8 +447,8 @@ class CheckpointManager: Args: ckpt_config (dict): model checkpoint config. - model (nn.module): model obj - optimizer (object): optimzier obj. + model (nn.module): model obj. + optimizer (object): optimizer obj. lr_scheduler (object): lr_scheduler obj. model_config (dict): model config. """ @@ -712,7 +712,6 @@ now step_count is {train_state.step_count}", return dict(path=latest_ckpt, content=("all",), ckpt_type="internlm") def try_resume_training(self, train_state: TrainState, current_time=""): - if self.load_ckpt_info is None or self.load_ckpt_info["path"] is None: if gpc.is_rank_for_log(): logger.info(