InternLM/doc/code-docs/locales/en/LC_MESSAGES/usage.po

389 lines
16 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# SOME DESCRIPTIVE TITLE.
# Copyright (C) 2023, InternLM Team
# This file is distributed under the same license as the InternLM package.
# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
#
msgid ""
msgstr ""
"Project-Id-Version: InternLM \n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2023-09-27 11:14+0800\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language: en\n"
"Language-Team: en <LL@li.org>\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.12.1\n"
#: ../../../usage.md:2
msgid "使用教程"
msgstr "Quickstart Guide"
#: ../../../usage.md:4
msgid ""
"启动一个 Demo "
"模型训练,需要进行三项准备,**安装****数据集准备**和**模型训练配置**。接下来,首先会介绍数据准备相关的操作,再简要描述模型训练配置相关的内容。"
msgstr ""
"To start a demo model training, you need to prepare three things: "
"**installation**, **dataset preparation**, and **model training "
"configuration**. In this guide, we will first cover the steps for dataset"
" preparation and then briefly describe the model training configuration."
#: ../../../usage.md:6
msgid "安装"
msgstr "Installation"
#: ../../../usage.md:7
msgid "请参考[安装文档](./install.md)进行安装。"
msgstr ""
"Please refer to the [installation guide](./install.md) for instructions "
"on how to install the necessary dependencies."
#: ../../../usage.md:9
msgid "数据准备 (预训练)"
msgstr "Dataset Preparation (Pre-training)"
#: ../../../usage.md:11
msgid "InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7_sft.model`来生成tokens。若想使用不同的模型可直接修改`tokernizer.py`中的模型参数路径。"
msgstr ""
"The dataset for the InternLM training task includes a series of `bin` and"
" `meta` files. A `tokenizer` is used to generate the training dataset "
"from the original text files. The tokenizer model is imported by "
"specifying the model parameter path in `tools/tokenizer.py`. Currently, "
"`V7_sft.model` is provided to generate tokens. If you want to use a "
"different model, you can directly modify the model parameter path in "
"`tokenizer.py`."
#: ../../../usage.md:13
msgid "可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`text_input_path`表示原始文本数据路径,目前支持`txt`、`json`和`jsonl`三种输入格式,`bin_output_path`表示生成的`bin`文件的保存路径。"
msgstr ""
"You can run the following command to generate `bin` and `meta` files "
"corresponding to the original data. The parameter `text_input_path` "
"represents the path of the original text data, currently supporting "
"`txt`, `json`, and `jsonl` formats, while `bin_output_path` represents "
"the save path of the generated `bin` files."
#: ../../../usage.md:18
msgid "下面是一个数据处理的例子:"
msgstr "Here is an example of data processing:"
#: ../../../usage.md:20
msgid "给定一个包含原始数据集的文件`raw_data.txt`,原始数据集如下所示:"
msgstr ""
"Given a file `raw_data.txt` containing the raw dataset, the raw dataset "
"is shown below:"
#: ../../../usage.md:27
msgid "可以通过运行以下命令来生成`bin`和`meta`文件:"
msgstr ""
"You can generate the `bin` and `meta` files by running the following "
"command:"
#: ../../../usage.md:32
msgid "需要注意的是,生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这六个目录下,以区分数据集的类型。"
msgstr ""
"It should be noted that the generated `bin` files need to be saved in one"
" of the following directories: `cn`, `en`, `code`, `ja`, `ar`, or "
"`kaoshi`, depending on the type of dataset."
#: ../../../usage.md:34
msgid "其中,`cn`表示中文数据集;`en`表示英文数据集;`code`表示代码数据集;`ja`表示日语数据集;`ar`表示阿拉伯语数据集;`kaoshi`表示考试数据集。"
msgstr ""
"Here, `cn` represents the Chinese dataset, `en` represents the English "
"dataset, `code` represents the code dataset, `ja` represents the Japanese"
" dataset, `ar` represents the Arabic dataset, and `kaoshi` represents the"
" exam dataset."
#: ../../../usage.md:36
msgid "生成的bin文件的格式如下"
msgstr "The format of the generated `bin` files is as follows:"
#: ../../../usage.md:42
msgid "`bin`文件中的每一行均对应原始数据集中的每一个句子,表示每个句子的`token`下文将用sequence指定。"
msgstr ""
"Each line in the `bin` file corresponds to each sentence in the original "
"dataset, representing the tokens of each sentence (referred to as "
"sequence below)."
#: ../../../usage.md:44
msgid "生成的`meta`文件的格式如下:"
msgstr "The format of the generated `meta` file is as follows:"
#: ../../../usage.md:48
msgid ""
"在`meta`文件中,每个元组对应着`bin`文件中每一个`sequence`的元信息。其中,元组的第一个元素表示每个`sequence`在所有`sequence`中的`starting"
" index`,第二个元素表示每个`sequence`中有多少个`tokens`。"
msgstr ""
"Each tuple in the `meta` file represents the meta information of each "
"`sequence`, where the first element in the tuple indicates the `starting "
"index` of each `sequence` among all `sequences`, and the second element "
"indicates the number of `tokens` for each `sequence`."
#: ../../../usage.md:50
msgid ""
"例如,对于第一个`sequence``starting index`为 0有 11 "
"个`tokens`;对于第二个`sequence`,由于第一个`sequence`转换为`string`后的长度为`89`,因此它的`starting"
" index`为 90有 15 个`tokens`。"
msgstr ""
"For example, the first `sequence` starts at index 0 and has 16 `tokens`. "
"The second `sequence` starts at index 110 and has 24 `tokens`."
#: ../../../usage.md:52
msgid "`json`和`jsonl`类型的文件的`bin`和`meta`文件格式和`txt`一致,此处不再赘叙。"
msgstr ""
"The `bin` and `meta` file formats for `json` and `jsonl` type files are "
"the same as for `txt`, so we won't go over them here."
#: ../../../usage.md:54
msgid "数据准备 (微调)"
msgstr "Data Preparation (Fine-tuning)"
#: ../../../usage.md:56
msgid ""
"微调任务的数据集格式与预训练任务保持一致,生成的数据格式为一系列的`bin`和`meta`文件。以下以 Alpaca "
"数据集为例,介绍微调的数据准备流程。"
msgstr ""
"The data format for fine-tuning tasks is the same as for pre-training "
"tasks, which consists of a series of `bin` and `meta` files. Let's take "
"the Alpaca dataset as an example to explain the data preparation process "
"for fine-tuning."
#: ../../../usage.md:58
msgid ""
"下载 [Alpaca 数据集](https://github.com/tatsu-"
"lab/stanford_alpaca/blob/main/alpaca_data.json)"
msgstr ""
"Download the [Alpaca dataset](https://github.com/tatsu-"
"lab/stanford_alpaca/blob/main/alpaca_data.json)."
#: ../../../usage.md:60
msgid "对 Alpaca 数据进行 tokenize使用以下命令"
msgstr "Tokenize the Alpaca dataset using the following command:"
#: ../../../usage.md:66
msgid "建议用户参考 alpaca_tokenizer.py 编写新的脚本对自己的数据集进行 tokenize"
msgstr ""
"It is recommended that users refer to alpaca_tokenizer.py to write new "
"scripts to tokenize their own datasets"
#: ../../../usage.md:68
msgid "训练配置"
msgstr "Training Configuration"
#: ../../../usage.md:70
msgid "以 7B Demo 的配置文件`configs/7B_sft.py`为例:"
msgstr ""
"Taking the configuration file `configs/7B_sft.py` for the 7B demo as an "
"example,"
#: ../../../usage.md:237
msgid "接下来将详细介绍启动一个模型训练所需要进行的数据、模型、并行和监控等相关的配置。"
msgstr ""
"let's discuss the data, model, parallel and monitoring configurations "
"required to start a model training."
#: ../../../usage.md:239
msgid "数据配置"
msgstr "Data Configuration"
#: ../../../usage.md:240
msgid "数据相关的关键参数配置及释义如下所示:"
msgstr "Here are the key parameters and their explanations for data configuration:"
#: ../../../usage.md:255
msgid "![pack_into_one](./imgs/pack_into_one.png)"
msgstr ""
#: ../../../usage.md:255
msgid "pack_into_one"
msgstr ""
#: ../../../usage.md:258
msgid "目前支持传入数据集文件路径`train_folder`,且要求文件格式如下:"
msgstr ""
"Currently, it supports passing the dataset file path `train_folder`, and "
"the file format is required to be as follows:"
#: ../../../usage.md:265
msgid "数据集的详细内容可参考``数据准备``模块相关的介绍。"
msgstr ""
"For detailed information about the dataset, please refer to the \"Data "
"Preparation\" section."
#: ../../../usage.md:267
msgid "模型配置"
msgstr "Model Configuration"
#: ../../../usage.md:269
msgid "如果在启动训练时要加载模型 `checkpoint`,可进行如下相关配置:"
msgstr ""
"If you want to load a model checkpoint when starting the training, you "
"can configure it as follows:"
#: ../../../usage.md:282
msgid "注意:"
msgstr "Note:"
#: ../../../usage.md:283
msgid "路径若以 `local:` 为前缀,则存储在本地文件系统;若以 `boto3:` 为前缀,则存储在远程 oss 上"
msgstr ""
"If the path starts with `local:`, it means the file is stored in the "
"local file system. If it starts with `boto3:`, it means the file is "
"stored in the remote OSS."
#: ../../../usage.md:285
msgid "模型相关关键参数配置如下所示:"
msgstr "The configuration for the model is as follows:"
#: ../../../usage.md:309
msgid "注意:用户可自定义模型类型名和模型结构,并配置相对应的模型参数。通过`utils/registry.py`下的`MODEL_INITIALIZER`对象进行模型初始化函数接口注册,在训练主函数`train.py`中初始化模型时,可通过`model_type`配置获取指定的模型初始化接口函数。"
msgstr ""
"Note: Users can customize the model type name and model structure, and "
"configure the corresponding model parameters. The model initialization "
"function interface can be registered through the `MODEL_INITIALIZER` "
"object in `utils/registry.py`. When initializing the model in the "
"training main function `train.py`, the specified model initialization "
"interface function can be obtained through the `model_type` "
"configuration."
#: ../../../usage.md:311
msgid ""
"*如果基于 InternLM 7B继续训练可以参考 "
"[ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-zoo) 中 "
"OpenXLab 链接下载权重*"
msgstr ""
"*If you want to start training based on InternLM 7B, you can refer to "
"OpenXLab [ModelZoo](https://github.com/InternLM/InternLM/tree/main#model-"
"zoo) to download weights*."
#: ../../../usage.md:313
msgid "并行配置"
msgstr "Parallel Configuration"
#: ../../../usage.md:315
msgid "训练并行配置样例如下:"
msgstr "Training parallel configuration example:"
#: ../../../usage.md:324
msgid "zero1zero 并行策略,分如下三种情况,默认值为 -1"
msgstr ""
"zero1: zero parallel strategy, divided into the following three cases, "
"default value is -1"
#: ../../../usage.md:325
msgid "当`zero1 <= 0`,则 zero1 进程组的大小等于数据并行进程组的大小,因此优化器状态参数将在数据并行范围内分配"
msgstr ""
"When `zero1 <= 0`, the size of the zero1 process group is equal to the "
"size of the data parallel process group, so the optimizer state "
"parameters will be split within the data parallel range."
#: ../../../usage.md:326
msgid "当`zero1 == 1`,则不使用 zero1 ,所有数据并行组保留完整的优化器状态参数"
msgstr ""
"When `zero1 == 1`, zero1 is not used, and all data parallel groups retain"
" the complete optimizer state parameters."
#: ../../../usage.md:327
msgid "当`zero1 > 1`且`zero1 <= data_parallel_world_size`,则 zero1 进程组是数据并行进程组的子集"
msgstr ""
"When `zero1 > 1` and `zero1 <= data_parallel_world_size`, the zero1 "
"process group is a subset of the data parallel process group."
#: ../../../usage.md:328
msgid "tensor张量并行大小通常是每个节点的 GPU 数量,默认值为 1"
msgstr ""
"tensor: tensor parallel size, usually the number of GPUs per node, "
"default is 1"
#: ../../../usage.md:329
msgid "pipeline流水线并行策略"
msgstr "pipeline: pipeline parallel strategy"
#: ../../../usage.md:330
msgid "size流水线并行大小默认值为 1"
msgstr "size: pipeline parallel size, the default value is 1"
#: ../../../usage.md:331
msgid "interleaved_overlapbool 类型,交错式调度时,开启或关闭通信优化,默认值为关闭"
msgstr ""
"interleaved_overlap: bool type, when interleaved scheduling, enable or "
"disable communication optimization, the default value is False"
#: ../../../usage.md:332
msgid "sequence_parallel是否开启序列化并行默认值为 False"
msgstr ""
"sequence_parallel: Whether to enable sequence parallelism, the default "
"value is False"
#: ../../../usage.md:334
msgid "注意:`数据并行大小 = 总的 GPU 数目 / 流水线并行大小 / 张量并行大小`"
msgstr ""
"Note: `Data parallel size = Total number of GPUs / Pipeline parallel size"
" / Tensor parallel size`"
#: ../../../usage.md:336
msgid "启动训练"
msgstr "Start Training"
#: ../../../usage.md:338
msgid "完成了以上数据集准备和相关训练配置后,可启动 Demo 训练。接下来分别以 slurm 和 torch 环境为例,介绍训练启动方式。"
msgstr ""
"After completing the data preparation and relevant training "
"configurations mentioned above, you can start the demo training. The "
"following examples demonstrate how to start the training in both slurm "
"and torch environments."
#: ../../../usage.md:340
msgid "若在 slurm 上启动分布式运行环境,多节点 16 卡的运行命令如下所示:"
msgstr ""
"If you want to start distributed training on slurm with 16 GPUs across "
"multiple nodes, use the following command:"
#: ../../../usage.md:345
msgid "若在 torch 上启动分布式运行环境,单节点 8 卡的运行命令如下所示:"
msgstr ""
"If you want to start distributed training on torch with 8 GPUs on a "
"single node, use the following command:"
#: ../../../usage.md:350
msgid "运行结果"
msgstr "Training Results"
#: ../../../usage.md:352
msgid "以 slurm 上单机 8 卡的 Demo 训练配置为例,训练结果日志展示如下:"
msgstr ""
"Taking the configuration of the demo training on a single machine with 8 "
"GPUs on slurm as an example, the training result log is shown below:"
#: ../../../usage.md:373
msgid "长文本生成"
msgstr "Long Text Generation"
#: ../../../usage.md:375
msgid ""
"在推理阶段,您可以在模型配置中通过设置 `use_dynamic_ntk_rope=True` 开启 RoPE 的 Dynamic NTK "
"选项,从而使得模型适应长文本输入输出,达到 16K 的外推效果:"
msgstr "During the inference phase, you can turn on the Dynamic NTK option of RoPE by setting `use_dynamic_ntk_rope=True` in the model configuration, "
"so that the model can adapt to long text input and output and achieve an extrapolation effect of 16K:"
#: ../../../usage.md:401
msgid "关于 Dyanmic NTK 的原理,详细请参考"
msgstr "Regarding the principle of Dyanmic NTK, please refer to"
#: ../../../usage.md:403
msgid "https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases"
msgstr ""
#: ../../../usage.md:404
msgid "https://kexue.fm/archives/9675"
msgstr ""
#~ msgid "`load_model_only_folder`与`load_ckpt_folder`不能同时设置"
#~ msgstr ""
#~ "`load_model_only_folder` and `load_ckpt_folder` "
#~ "cannot be set at the same time."