update readme

llama3
Tong Li 2024-04-21 11:32:57 +08:00
parent 3b35989ee7
commit 07b74c2d20
1 changed files with 15 additions and 14 deletions

View File

@ -1,6 +1,6 @@
<div align="center"> <div align="center">
<h1> <h1>
<img src="https://github.com/hpcaitech/public_assets/blob/main/applications/colossal-llama-2/colossalllam2.jpg?raw=true" width=800/> Colossal-LLaMA
</h1> </h1>
</div> </div>
@ -289,7 +289,7 @@ Here is details about CLI arguments:
#### 1. Install required packages #### 1. Install required packages
``` ```
cd Colossal-LLaMA-2 cd Colossal-LLaMA
pip install -r requirements.txt pip install -r requirements.txt
``` ```
#### 2. Install `xentropy`, `layer_norm` and `rotary` #### 2. Install `xentropy`, `layer_norm` and `rotary`
@ -314,7 +314,7 @@ Initialize new tokenizer with additional Chinese tokens. Additional Chinese toke
Command to initialize new tokenizer: Command to initialize new tokenizer:
```bash ```bash
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python' export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
python colossal_llama2/tokenizer/init_tokenizer.py \ python colossal_llama/tokenizer/init_tokenizer.py \
--source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \ --source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \
--target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \ --target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \
--expand_tokens_file "<NEW_TOKENS_FILE>.jsonl" --expand_tokens_file "<NEW_TOKENS_FILE>.jsonl"
@ -328,7 +328,7 @@ Here is details about CLI arguments:
Initialize the new model checkpoint by calculating the mean values from the original model checkpoint. Initialize the new model checkpoint by calculating the mean values from the original model checkpoint.
Command to initialize new model checkpoint: Command to initialize new model checkpoint:
```bash ```bash
python colossal_llama2/model/init_model.py \ python colossal_llama/model/init_model.py \
--source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \ --source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \
--target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \ --target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \
--target_model_path "<TARGET_MODEL_DIR>" --target_model_path "<TARGET_MODEL_DIR>"
@ -362,18 +362,17 @@ Command to convert jsonl dataset to arrow format:
python prepare_pretrain_dataset.py \ python prepare_pretrain_dataset.py \
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \ --data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
--tokenizer_dir "<TOKENIZER_DIR>" \ --tokenizer_dir "<TOKENIZER_DIR>" \
--data_cache_dir "jsonl_to_arrow_cache" \ --data_output_dirs "spliced tokenized output" \
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
--max_length 4096 \ --max_length 4096 \
--num_spliced_dataset_bins 10 --num_spliced_dataset_bins 10
``` ```
Here is details about CLI arguments: Here is details about CLI arguments:
* Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format. * Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format.
* Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format. * Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format.
* Data cache directory: `data_cache_dir`. Directory to store Hugging Face data cache. Default case will create `cache` folder locally. * Data output directory: `data_output_dirs`. Directory to store preprocessed output, including three sub-directories:
* Output directory for jsonl format: `data_jsonl_output_dir`. Output directory to store converted dataset in jsonl format. * `cache`: Directory to store Hugging Face data cache.
* Output directory for arrow format: `data_arrow_output_dir`. Output directory to store converted dataset in arrow format, which can be used for training directly. * `jsonl`: Output directory to store converted dataset in jsonl format.
* `arrow`: Output directory to store converted dataset in arrow format, which can be used for training directly.
* Max length: `max_length`. Max length of spliced samples. Default value is 4096. * Max length: `max_length`. Max length of spliced samples. Default value is 4096.
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training. * Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
@ -392,13 +391,15 @@ Command to convert jsonl dataset to arrow format is similar to the command in [3
python prepare_sft_dataset.py.py \ python prepare_sft_dataset.py.py \
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \ --data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
--tokenizer_dir "<TOKENIZER_DIR>" \ --tokenizer_dir "<TOKENIZER_DIR>" \
--data_cache_dir "jsonl_to_arrow_cache" \ --data_output_dirs "spliced tokenized output" \
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
--max_length 4096 \ --max_length 4096 \
--num_spliced_dataset_bins 10 --num_spliced_dataset_bins 10 \
--llama_version 3
``` ```
Additional CLI arguments:
* LLaMA verison: `llama_version`. Specify the LLaMA version.
#### 4. Command Line Arguments for Training #### 4. Command Line Arguments for Training
##### 4.1 Arguments for Pretraining ##### 4.1 Arguments for Pretraining