mirror of https://github.com/hpcaitech/ColossalAI
update readme
parent
3b35989ee7
commit
07b74c2d20
|
@ -1,6 +1,6 @@
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<h1>
|
<h1>
|
||||||
<img src="https://github.com/hpcaitech/public_assets/blob/main/applications/colossal-llama-2/colossalllam2.jpg?raw=true" width=800/>
|
Colossal-LLaMA
|
||||||
</h1>
|
</h1>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
@ -289,7 +289,7 @@ Here is details about CLI arguments:
|
||||||
|
|
||||||
#### 1. Install required packages
|
#### 1. Install required packages
|
||||||
```
|
```
|
||||||
cd Colossal-LLaMA-2
|
cd Colossal-LLaMA
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
#### 2. Install `xentropy`, `layer_norm` and `rotary`
|
#### 2. Install `xentropy`, `layer_norm` and `rotary`
|
||||||
|
@ -314,7 +314,7 @@ Initialize new tokenizer with additional Chinese tokens. Additional Chinese toke
|
||||||
Command to initialize new tokenizer:
|
Command to initialize new tokenizer:
|
||||||
```bash
|
```bash
|
||||||
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
|
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
|
||||||
python colossal_llama2/tokenizer/init_tokenizer.py \
|
python colossal_llama/tokenizer/init_tokenizer.py \
|
||||||
--source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \
|
--source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \
|
||||||
--target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \
|
--target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \
|
||||||
--expand_tokens_file "<NEW_TOKENS_FILE>.jsonl"
|
--expand_tokens_file "<NEW_TOKENS_FILE>.jsonl"
|
||||||
|
@ -328,7 +328,7 @@ Here is details about CLI arguments:
|
||||||
Initialize the new model checkpoint by calculating the mean values from the original model checkpoint.
|
Initialize the new model checkpoint by calculating the mean values from the original model checkpoint.
|
||||||
Command to initialize new model checkpoint:
|
Command to initialize new model checkpoint:
|
||||||
```bash
|
```bash
|
||||||
python colossal_llama2/model/init_model.py \
|
python colossal_llama/model/init_model.py \
|
||||||
--source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \
|
--source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \
|
||||||
--target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \
|
--target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \
|
||||||
--target_model_path "<TARGET_MODEL_DIR>"
|
--target_model_path "<TARGET_MODEL_DIR>"
|
||||||
|
@ -362,18 +362,17 @@ Command to convert jsonl dataset to arrow format:
|
||||||
python prepare_pretrain_dataset.py \
|
python prepare_pretrain_dataset.py \
|
||||||
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
|
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
|
||||||
--tokenizer_dir "<TOKENIZER_DIR>" \
|
--tokenizer_dir "<TOKENIZER_DIR>" \
|
||||||
--data_cache_dir "jsonl_to_arrow_cache" \
|
--data_output_dirs "spliced tokenized output" \
|
||||||
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
|
|
||||||
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
|
|
||||||
--max_length 4096 \
|
--max_length 4096 \
|
||||||
--num_spliced_dataset_bins 10
|
--num_spliced_dataset_bins 10
|
||||||
```
|
```
|
||||||
Here is details about CLI arguments:
|
Here is details about CLI arguments:
|
||||||
* Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format.
|
* Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format.
|
||||||
* Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format.
|
* Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format.
|
||||||
* Data cache directory: `data_cache_dir`. Directory to store Hugging Face data cache. Default case will create `cache` folder locally.
|
* Data output directory: `data_output_dirs`. Directory to store preprocessed output, including three sub-directories:
|
||||||
* Output directory for jsonl format: `data_jsonl_output_dir`. Output directory to store converted dataset in jsonl format.
|
* `cache`: Directory to store Hugging Face data cache.
|
||||||
* Output directory for arrow format: `data_arrow_output_dir`. Output directory to store converted dataset in arrow format, which can be used for training directly.
|
* `jsonl`: Output directory to store converted dataset in jsonl format.
|
||||||
|
* `arrow`: Output directory to store converted dataset in arrow format, which can be used for training directly.
|
||||||
* Max length: `max_length`. Max length of spliced samples. Default value is 4096.
|
* Max length: `max_length`. Max length of spliced samples. Default value is 4096.
|
||||||
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
|
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
|
||||||
|
|
||||||
|
@ -392,13 +391,15 @@ Command to convert jsonl dataset to arrow format is similar to the command in [3
|
||||||
python prepare_sft_dataset.py.py \
|
python prepare_sft_dataset.py.py \
|
||||||
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
|
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
|
||||||
--tokenizer_dir "<TOKENIZER_DIR>" \
|
--tokenizer_dir "<TOKENIZER_DIR>" \
|
||||||
--data_cache_dir "jsonl_to_arrow_cache" \
|
--data_output_dirs "spliced tokenized output" \
|
||||||
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
|
|
||||||
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
|
|
||||||
--max_length 4096 \
|
--max_length 4096 \
|
||||||
--num_spliced_dataset_bins 10
|
--num_spliced_dataset_bins 10 \
|
||||||
|
--llama_version 3
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Additional CLI arguments:
|
||||||
|
* LLaMA verison: `llama_version`. Specify the LLaMA version.
|
||||||
|
|
||||||
#### 4. Command Line Arguments for Training
|
#### 4. Command Line Arguments for Training
|
||||||
|
|
||||||
##### 4.1 Arguments for Pretraining
|
##### 4.1 Arguments for Pretraining
|
||||||
|
|
Loading…
Reference in New Issue