update README.md & usage.md

2023-07-10 19:25:13 +08:00 · 2023-07-10 19:25:13 +08:00 · 6e561e65f6
parent dc8dd6ec4d
commit 6e561e65f6
4 changed files with 20 additions and 20 deletions
--- a/doc/en/usage.md
+++ b/doc/en/usage.md
@ -8,16 +8,16 @@ Please refer to the [installation guide](./install.md) for instructions on how t

 ### Dataset Preparation (Pre-training)

-The dataset for InternLM training consists of a series of `bin` and `meta` files. To generate the training dataset, you need to use the `tokenizer` tool to tokenize the raw text data. The tokenizer model can be imported by specifying the model path in the `tools/tokenizer.py` script. The current provided model is `V7.model`. If you want to use a different model, you can modify the model path directly in the `tokenizer.py` script.
+The dataset for the InternLM training task includes a series of `bin` and `meta` files. A `tokenizer` is used to generate the training dataset from the original text files. The tokenizer model is imported by specifying the model parameter path in `tools/tokenizer.py`. Currently, `V7_sft.model` is provided to generate tokens. If you want to use a different model, you can directly modify the model parameter path in `tokenizer.py`.

-You can generate the `bin` and `meta` files for your raw data by running the following command, where the `raw_data_name` parameter represents the name of your raw data file, `input_file_type` represents the format of your raw data file (currently supports `txt`, `json`, and `jsonl`), and `bin` represents the path to save the generated `bin` files.
+You can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.


 ```bash
-$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'txt' or 'json' or 'jsonl' --bin your_output_bin_path
+$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
 ```

-Here is an example of data processing (only the data processing example for the `txt` format is provided here, the data processing process for `json` and `jsonl` is exactly the same as for `txt`):
+Here is an example of data processing:

 Given a file `raw_data.txt` containing the raw dataset, the raw dataset is shown below:

@ -30,7 +30,7 @@ Learn to be tolerant and understanding to establish truly harmonious interperson
 You can generate the `bin` and `meta` files by running the following command:

 ```bash
-$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
+$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
 ```

 It should be noted that the generated `bin` files need to be saved in one of the following directories: `cn`, `en`, `code`, `ja`, `ar`, or `kaoshi`, depending on the type of dataset.
--- a/doc/usage.md
+++ b/doc/usage.md
@ -7,14 +7,14 @@

 ### 数据准备 （预训练）

-InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7.model`来生成tokens。若想使用不同的模型，可直接修改`tokernizer.py`中的模型参数路径。
+InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7_sft.model`来生成tokens。若想使用不同的模型，可直接修改`tokernizer.py`中的模型参数路径。

-可以运行以下命令生成原始数据对应的`bin`和`meta`文件，其中参数`raw_data_name`表示原始数据集的文件名称，`input_file_type`表示原始数据集的文件格式，目前支持`txt`、`json`和`jsonl`这三种格式，`bin`表示生成的`bin`文件的保存路径。
+可以运行以下命令生成原始数据对应的`bin`和`meta`文件，其中参数`text_input_path`表示原始文本数据路径，目前支持`txt`、`json`和`jsonl`三种输入格式，`bin_output_path`表示生成的`bin`文件的保存路径。
 ```bash
-$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'txt' or 'json' or 'jsonl' --bin your_output_bin_path
+$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
 ```

-下面是一个数据处理的例子（这里只给出了`txt`格式的数据处理例子，`json`和`jsonl`的数据处理流程和`txt`的完全一致）：
+下面是一个数据处理的例子：

 给定一个包含原始数据集的文件`raw_data.txt`，原始数据集如下所示：
 ```bash
@ -25,7 +25,7 @@ $ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suff

 可以通过运行以下命令来生成`bin`和`meta`文件：
 ```bash
-$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
+$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
 ```

 需要注意的是，生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这六个目录下，以区分数据集的类型。
--- a/tools/README.md
+++ b/tools/README.md
@ -9,14 +9,14 @@
 ```

 # tokenizer.py
-生成原始数据的`bin`和`meta`文件需要使用`tokenizer`，我们通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前我们提供了`V7.model`来生成tokens。若想使用不同的模型，可直接修改`tokernizer.py`中的模型参数路径。
+生成原始数据的`bin`和`meta`文件需要使用`tokenizer`，我们通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前我们提供了`V7_sft.model`来生成tokens。若想使用不同的模型，可直接修改`tokernizer.py`中的模型参数路径。

-我们可以运行以下命令生成原始数据对应的`bin`和`meta`文件，其中参数`raw_data_name`表示原始数据集的文件名称，`input_file_type`表示原始数据集的文件格式，我们目前支持`txt`、`json`和`jsonl`这三种格式，`bin`表示生成的`bin`文件的保存路径。
+可以运行以下命令生成原始数据对应的`bin`和`meta`文件，其中参数`text_input_path`表示原始文本数据路径，目前支持`txt`、`json`和`jsonl`三种输入格式，`bin_output_path`表示生成的`bin`文件的保存路径。
 ```bash
-$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'text' or 'json' or 'jsonl' --bin your_output_bin_path
+$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
 ```

-下面是一个数据处理的例子（这里只给出了`txt`格式的数据处理例子，`json`和`jsonl`的数据处理流程和`txt`的完全一致）：
+下面是一个数据处理的例子：

 给定一个包含原始数据集的文件`raw_data.txt`，原始数据集如下所示：
 ```bash
@ -25,9 +25,9 @@ $ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suff
 学会宽容和理解，才能建立真正和谐的人际关系。
 ```

-接下来，我们可以通过运行以下命令来生成`bin`和`meta`文件：
+可以通过运行以下命令来生成`bin`和`meta`文件：
 ```bash
-$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
+$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
 ```

 需要注意的是，生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这五个目录下，以区分数据集的类型。
--- a/tools/README_EN.md
+++ b/tools/README_EN.md
@ -11,12 +11,12 @@ This directory provide some tools for model training with the following file str
 # tokenizer.py
 We need to use a `tokenizer` to generate `bin` and `meta` files for raw data. We import the tokenizer model by specifying the model weight path in `tools/tokenizer.py`. Currently, we provide `V7.model` to generate tokens. If you want to use a different model, you can modify the model weight path in `tokenizer.py` directly.

-We can run the following command to generate `bin` and `meta` files for raw data, where the parameter `raw_data_name` indicates the file name of raw data, `input_file_type` denotes the raw data format, which should be `txt`, `json` and `jsonl`, and `bin` indicates the path to save the generated `bin` file.
+We can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.
 ```bash
-$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'text' or 'json' or 'jsonl' --bin your_output_bin_path
+$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
 ```

-An example of data processing in `txt` format is given here (the data processing for `json` and `jsonl` is identical to that for `txt`).
+An example of data processing in `txt` format is given here:

 Given a file `raw_data.txt` containg raw data with the following content.
 ```bash
@ -26,7 +26,7 @@ Learn to be tolerant and understanding to establish truly harmonious interperson
 ```
 Next, we can run the following command to generate `bin` and `meta` files for raw data.
 ```bash
-$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
+$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
 ```

 It should be noted that the generated `bin` files should be placed in one of the following directories to clarify the data type: `cn`(Chinese), `en`(English), `code`(code data), `ja`(Japanese), `ar`(Arabic) and `kaoshi`(kaoshi data).