mirror of https://github.com/InternLM/InternLM
update README.md & usage.md
parent
dc8dd6ec4d
commit
6e561e65f6
|
@ -8,16 +8,16 @@ Please refer to the [installation guide](./install.md) for instructions on how t
|
|||
|
||||
### Dataset Preparation (Pre-training)
|
||||
|
||||
The dataset for InternLM training consists of a series of `bin` and `meta` files. To generate the training dataset, you need to use the `tokenizer` tool to tokenize the raw text data. The tokenizer model can be imported by specifying the model path in the `tools/tokenizer.py` script. The current provided model is `V7.model`. If you want to use a different model, you can modify the model path directly in the `tokenizer.py` script.
|
||||
The dataset for the InternLM training task includes a series of `bin` and `meta` files. A `tokenizer` is used to generate the training dataset from the original text files. The tokenizer model is imported by specifying the model parameter path in `tools/tokenizer.py`. Currently, `V7_sft.model` is provided to generate tokens. If you want to use a different model, you can directly modify the model parameter path in `tokenizer.py`.
|
||||
|
||||
You can generate the `bin` and `meta` files for your raw data by running the following command, where the `raw_data_name` parameter represents the name of your raw data file, `input_file_type` represents the format of your raw data file (currently supports `txt`, `json`, and `jsonl`), and `bin` represents the path to save the generated `bin` files.
|
||||
You can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.
|
||||
|
||||
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'txt' or 'json' or 'jsonl' --bin your_output_bin_path
|
||||
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
|
||||
```
|
||||
|
||||
Here is an example of data processing (only the data processing example for the `txt` format is provided here, the data processing process for `json` and `jsonl` is exactly the same as for `txt`):
|
||||
Here is an example of data processing:
|
||||
|
||||
Given a file `raw_data.txt` containing the raw dataset, the raw dataset is shown below:
|
||||
|
||||
|
@ -30,7 +30,7 @@ Learn to be tolerant and understanding to establish truly harmonious interperson
|
|||
You can generate the `bin` and `meta` files by running the following command:
|
||||
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
|
||||
$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
|
||||
```
|
||||
|
||||
It should be noted that the generated `bin` files need to be saved in one of the following directories: `cn`, `en`, `code`, `ja`, `ar`, or `kaoshi`, depending on the type of dataset.
|
||||
|
|
10
doc/usage.md
10
doc/usage.md
|
@ -7,14 +7,14 @@
|
|||
|
||||
### 数据准备 (预训练)
|
||||
|
||||
InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。
|
||||
InternLM训练任务的数据集包括一系列的`bin`和`meta`文件。使用`tokenizer`从原始文本文件生成训练用数据集。通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前提供`V7_sft.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。
|
||||
|
||||
可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`raw_data_name`表示原始数据集的文件名称,`input_file_type`表示原始数据集的文件格式,目前支持`txt`、`json`和`jsonl`这三种格式,`bin`表示生成的`bin`文件的保存路径。
|
||||
可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`text_input_path`表示原始文本数据路径,目前支持`txt`、`json`和`jsonl`三种输入格式,`bin_output_path`表示生成的`bin`文件的保存路径。
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'txt' or 'json' or 'jsonl' --bin your_output_bin_path
|
||||
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
|
||||
```
|
||||
|
||||
下面是一个数据处理的例子(这里只给出了`txt`格式的数据处理例子,`json`和`jsonl`的数据处理流程和`txt`的完全一致):
|
||||
下面是一个数据处理的例子:
|
||||
|
||||
给定一个包含原始数据集的文件`raw_data.txt`,原始数据集如下所示:
|
||||
```bash
|
||||
|
@ -25,7 +25,7 @@ $ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suff
|
|||
|
||||
可以通过运行以下命令来生成`bin`和`meta`文件:
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
|
||||
$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
|
||||
```
|
||||
|
||||
需要注意的是,生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这六个目录下,以区分数据集的类型。
|
||||
|
|
|
@ -9,14 +9,14 @@
|
|||
```
|
||||
|
||||
# tokenizer.py
|
||||
生成原始数据的`bin`和`meta`文件需要使用`tokenizer`,我们通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前我们提供了`V7.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。
|
||||
生成原始数据的`bin`和`meta`文件需要使用`tokenizer`,我们通过在`tools/tokenizer.py`中指定模型参数路径的方式来导入tokenizer模型。目前我们提供了`V7_sft.model`来生成tokens。若想使用不同的模型,可直接修改`tokernizer.py`中的模型参数路径。
|
||||
|
||||
我们可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`raw_data_name`表示原始数据集的文件名称,`input_file_type`表示原始数据集的文件格式,我们目前支持`txt`、`json`和`jsonl`这三种格式,`bin`表示生成的`bin`文件的保存路径。
|
||||
可以运行以下命令生成原始数据对应的`bin`和`meta`文件,其中参数`text_input_path`表示原始文本数据路径,目前支持`txt`、`json`和`jsonl`三种输入格式,`bin_output_path`表示生成的`bin`文件的保存路径。
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'text' or 'json' or 'jsonl' --bin your_output_bin_path
|
||||
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
|
||||
```
|
||||
|
||||
下面是一个数据处理的例子(这里只给出了`txt`格式的数据处理例子,`json`和`jsonl`的数据处理流程和`txt`的完全一致):
|
||||
下面是一个数据处理的例子:
|
||||
|
||||
给定一个包含原始数据集的文件`raw_data.txt`,原始数据集如下所示:
|
||||
```bash
|
||||
|
@ -25,9 +25,9 @@ $ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suff
|
|||
学会宽容和理解,才能建立真正和谐的人际关系。
|
||||
```
|
||||
|
||||
接下来,我们可以通过运行以下命令来生成`bin`和`meta`文件:
|
||||
可以通过运行以下命令来生成`bin`和`meta`文件:
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
|
||||
$ python tools/tokenizer.py --text_input_path raw_data.txt --bin_output_path cn/output.bin
|
||||
```
|
||||
|
||||
需要注意的是,生成的`bin`文件需要保存在`cn`或者`en`或者`code`或者`ja`或者`ar`或者`kaoshi`这五个目录下,以区分数据集的类型。
|
||||
|
|
|
@ -11,12 +11,12 @@ This directory provide some tools for model training with the following file str
|
|||
# tokenizer.py
|
||||
We need to use a `tokenizer` to generate `bin` and `meta` files for raw data. We import the tokenizer model by specifying the model weight path in `tools/tokenizer.py`. Currently, we provide `V7.model` to generate tokens. If you want to use a different model, you can modify the model weight path in `tokenizer.py` directly.
|
||||
|
||||
We can run the following command to generate `bin` and `meta` files for raw data, where the parameter `raw_data_name` indicates the file name of raw data, `input_file_type` denotes the raw data format, which should be `txt`, `json` and `jsonl`, and `bin` indicates the path to save the generated `bin` file.
|
||||
We can run the following command to generate `bin` and `meta` files corresponding to the original data. The parameter `text_input_path` represents the path of the original text data, currently supporting `txt`, `json`, and `jsonl` formats, while `bin_output_path` represents the save path of the generated `bin` files.
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name your_raw_data_file_name(without suffix) --input_file_type 'text' or 'json' or 'jsonl' --bin your_output_bin_path
|
||||
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
|
||||
```
|
||||
|
||||
An example of data processing in `txt` format is given here (the data processing for `json` and `jsonl` is identical to that for `txt`).
|
||||
An example of data processing in `txt` format is given here:
|
||||
|
||||
Given a file `raw_data.txt` containg raw data with the following content.
|
||||
```bash
|
||||
|
@ -26,7 +26,7 @@ Learn to be tolerant and understanding to establish truly harmonious interperson
|
|||
```
|
||||
Next, we can run the following command to generate `bin` and `meta` files for raw data.
|
||||
```bash
|
||||
$ python tools/tokenizer.py --raw_data_name raw_data --input_file_type 'text' --bin cn/output.bin
|
||||
$ python tools/tokenizer.py --text_input_path your_input_text_path --bin_output_path your_output_bin_path
|
||||
```
|
||||
|
||||
It should be noted that the generated `bin` files should be placed in one of the following directories to clarify the data type: `cn`(Chinese), `en`(English), `code`(code data), `ja`(Japanese), `ar`(Arabic) and `kaoshi`(kaoshi data).
|
||||
|
|
Loading…
Reference in New Issue