From 6e561e65f6938ce3b1f7a5fbb58dcadd75f61dc7 Mon Sep 17 00:00:00 2001 From: gaoyang07 Date: Mon, 10 Jul 2023 19:25:13 +0800 Subject: [PATCH] update README.md & usage.md --- doc/en/usage.md | 10 +++++----- doc/usage.md | 10 +++++----- tools/README.md | 12 ++++++------ tools/README_EN.md | 8 ++++---- 4 files changed, 20 insertions(+), 20 deletions(-) diff --git a/doc/en/usage.md b/doc/en/usage.md index d6adf9f..5912250 100644 --- a/doc/en/usage.md +++ b/doc/en/usage.md @@ -8,16 +8,16 @@ Please refer to the [installation guide](./install.md) for instructions on how t ### Dataset Preparation (Pre-training) -The dataset for InternLM training consists of a series of `bin` and `meta` files. To generate the training dataset, you need to use the `tokenizer` tool to tokenize the raw text data. The tokenizer model can be imported by specifying the model path in the `tools/tokenizer.py` script. The current provided model is `V7.model`. If you want to use a different model, you can modify the model path directly in the `tokenizer.py` script. +The dataset for the InternLM training task includes a series of `bin` and `meta` files. A `tokenizer` is used to generate the training dataset from the original text files. The tokenizer model is imported by specifying the model parameter path in `tools/tokenizer.py`. Currently, `V7_sft.model` is provided to generate tokens. If you want to use a different model, you can directly modify the model parameter path in `tokenizer.py`. -You can generate the `bin` and `meta` files for your raw data by running the following command, where the `raw_data_name` parameter represents the name of your raw data file, `input_file_type` represents the format of your raw data file (currently supports `txt`, `json`, and `jsonl`), and `bin` represents the path to save the generated `bin` files. +You can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files. ```bash -$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'txt' or 'json' or 'jsonl' --bin your_output_bin_path +$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path ``` -Here is an example of data processing (only the data processing example for the `txt` format is provided here, the data processing process for `json` and `jsonl` is exactly the same as for `txt`): +Here is an example of data processing: Given a file `raw_data.txt` containing the raw dataset, the raw dataset is shown below: @@ -30,7 +30,7 @@ Learn to be tolerant and understanding to establish truly harmonious interperson You can generate the `bin` and `meta` files by running the following command: ```bash -$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin +$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin ``` It should be noted that the generated `bin` files need to be saved in one of the following directories: `cn`, `en`, `code`, `ja`, `ar`, or `kaoshi`, depending on the type of dataset. diff --git a/doc/usage.md b/doc/usage.md index d6e85ae..e46f1f7 100644 --- a/doc/usage.md +++ b/doc/usage.md @@ -7,14 +7,14 @@ ### 数据准备 (预训练) -InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。 +InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7_sft.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。 -可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`raw_data_name`表示原始数据集的文件名称,`input_file_type`表示原始数据集的文件格式,目前支持`txt`、`json`和`jsonl`这三种格式,`bin`表示生成的`bin`文件的保存路径。 +可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`text_input_path`表示原始文本数据路径,目前支持`txt`、`json`和`jsonl`三种输入格式,`bin_output_path`表示生成的`bin`文件的保存路径。 ```bash -$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'txt' or 'json' or 'jsonl' --bin your_output_bin_path +$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path ``` -下面是一个数据处理的例子(这里只给出了`txt`格式的数据处理例子,`json`和`jsonl`的数据处理流程和`txt`的完全一致): +下面是一个数据处理的例子: 给定一个包含原始数据集的文件`raw_data.txt`,原始数据集如下所示: ```bash @@ -25,7 +25,7 @@ $ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suff 可以通过运行以下命令来生成`bin`和`meta`文件: ```bash -$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin +$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin ``` 需要注意的是,生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这六个目录下,以区分数据集的类型。 diff --git a/tools/README.md b/tools/README.md index 38e9590..f3ba385 100644 --- a/tools/README.md +++ b/tools/README.md @@ -9,14 +9,14 @@ ``` # tokenizer.py -生成原始数据的`bin`和`meta`文件需要使用`tokenizer`,我们通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前我们提供了`V7.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。 +生成原始数据的`bin`和`meta`文件需要使用`tokenizer`,我们通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前我们提供了`V7_sft.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。 -我们可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`raw_data_name`表示原始数据集的文件名称,`input_file_type`表示原始数据集的文件格式,我们目前支持`txt`、`json`和`jsonl`这三种格式,`bin`表示生成的`bin`文件的保存路径。 +可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`text_input_path`表示原始文本数据路径,目前支持`txt`、`json`和`jsonl`三种输入格式,`bin_output_path`表示生成的`bin`文件的保存路径。 ```bash -$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'text' or 'json' or 'jsonl' --bin your_output_bin_path +$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path ``` -下面是一个数据处理的例子(这里只给出了`txt`格式的数据处理例子,`json`和`jsonl`的数据处理流程和`txt`的完全一致): +下面是一个数据处理的例子: 给定一个包含原始数据集的文件`raw_data.txt`,原始数据集如下所示: ```bash @@ -25,9 +25,9 @@ $ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suff 学会宽容和理解,才能建立真正和谐的人际关系。 ``` -接下来,我们可以通过运行以下命令来生成`bin`和`meta`文件: +可以通过运行以下命令来生成`bin`和`meta`文件: ```bash -$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin +$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin ``` 需要注意的是,生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这五个目录下,以区分数据集的类型。 diff --git a/tools/README_EN.md b/tools/README_EN.md index 3af5637..140c36f 100644 --- a/tools/README_EN.md +++ b/tools/README_EN.md @@ -11,12 +11,12 @@ This directory provide some tools for model training with the following file str # tokenizer.py We need to use a `tokenizer` to generate `bin` and `meta` files for raw data. We import the tokenizer model by specifying the model weight path in `tools/tokenizer.py`. Currently, we provide `V7.model` to generate tokens. If you want to use a different model, you can modify the model weight path in `tokenizer.py` directly. -We can run the following command to generate `bin` and `meta` files for raw data, where the parameter `raw_data_name` indicates the file name of raw data, `input_file_type` denotes the raw data format, which should be `txt`, `json` and `jsonl`, and `bin` indicates the path to save the generated `bin` file. +We can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files. ```bash -$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'text' or 'json' or 'jsonl' --bin your_output_bin_path +$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path ``` -An example of data processing in `txt` format is given here (the data processing for `json` and `jsonl` is identical to that for `txt`). +An example of data processing in `txt` format is given here: Given a file `raw_data.txt` containg raw data with the following content. ```bash @@ -26,7 +26,7 @@ Learn to be tolerant and understanding to establish truly harmonious interperson ``` Next, we can run the following command to generate `bin` and `meta` files for raw data. ```bash -$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin +$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path ``` It should be noted that the generated `bin` files should be placed in one of the following directories to clarify the data type: `cn`(Chinese), `en`(English), `code`(code data), `ja`(Japanese), `ar`(Arabic) and `kaoshi`(kaoshi data).