[Feature] Support LLaMA-3 CPT and ST (#5619)

* support LLaMA-3

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Run pre-commit

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
pull/5631/head
Tong Li 2024-04-23 13:54:05 +08:00 committed by GitHub
parent e094933da1
commit 862fbaaa62
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
28 changed files with 89 additions and 87 deletions

View File

@ -1 +0,0 @@
0.0.1

View File

@ -1,6 +1,6 @@
<div align="center"> <div align="center">
<h1> <h1>
<img src="https://github.com/hpcaitech/public_assets/blob/main/applications/colossal-llama-2/colossalllam2.jpg?raw=true" width=800/> Colossal-LLaMA
</h1> </h1>
</div> </div>
@ -47,6 +47,7 @@
- [Citations](#citations) - [Citations](#citations)
## News ## News
* [2024/4] Support continual pre-training and supervised fine-tuning of LLaMA-3.
* [2024/01] [Construct Refined 13B Private Model With Just $5000 USD, Upgraded Colossal-AI Llama-2 Open Source](https://hpc-ai.com/blog/colossal-llama-2-13b). * [2024/01] [Construct Refined 13B Private Model With Just $5000 USD, Upgraded Colossal-AI Llama-2 Open Source](https://hpc-ai.com/blog/colossal-llama-2-13b).
[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2) [[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2)
[[blog]](https://hpc-ai.com/blog/colossal-llama-2-13b) [[blog]](https://hpc-ai.com/blog/colossal-llama-2-13b)
@ -289,7 +290,7 @@ Here is details about CLI arguments:
#### 1. Install required packages #### 1. Install required packages
``` ```
cd Colossal-LLaMA-2 cd Colossal-LLaMA
pip install -r requirements.txt pip install -r requirements.txt
``` ```
#### 2. Install `xentropy`, `layer_norm` and `rotary` #### 2. Install `xentropy`, `layer_norm` and `rotary`
@ -314,7 +315,7 @@ Initialize new tokenizer with additional Chinese tokens. Additional Chinese toke
Command to initialize new tokenizer: Command to initialize new tokenizer:
```bash ```bash
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python' export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
python colossal_llama2/tokenizer/init_tokenizer.py \ python colossal_llama/tokenizer/init_tokenizer.py \
--source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \ --source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \
--target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \ --target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \
--expand_tokens_file "<NEW_TOKENS_FILE>.jsonl" --expand_tokens_file "<NEW_TOKENS_FILE>.jsonl"
@ -328,7 +329,7 @@ Here is details about CLI arguments:
Initialize the new model checkpoint by calculating the mean values from the original model checkpoint. Initialize the new model checkpoint by calculating the mean values from the original model checkpoint.
Command to initialize new model checkpoint: Command to initialize new model checkpoint:
```bash ```bash
python colossal_llama2/model/init_model.py \ python colossal_llama/model/init_model.py \
--source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \ --source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \
--target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \ --target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \
--target_model_path "<TARGET_MODEL_DIR>" --target_model_path "<TARGET_MODEL_DIR>"
@ -362,18 +363,17 @@ Command to convert jsonl dataset to arrow format:
python prepare_pretrain_dataset.py \ python prepare_pretrain_dataset.py \
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \ --data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
--tokenizer_dir "<TOKENIZER_DIR>" \ --tokenizer_dir "<TOKENIZER_DIR>" \
--data_cache_dir "jsonl_to_arrow_cache" \ --data_output_dirs "spliced tokenized output" \
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
--max_length 4096 \ --max_length 4096 \
--num_spliced_dataset_bins 10 --num_spliced_dataset_bins 10
``` ```
Here is details about CLI arguments: Here is details about CLI arguments:
* Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format. * Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format.
* Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format. * Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format.
* Data cache directory: `data_cache_dir`. Directory to store Hugging Face data cache. Default case will create `cache` folder locally. * Data output directory: `data_output_dirs`. Directory to store preprocessed output, including three sub-directories:
* Output directory for jsonl format: `data_jsonl_output_dir`. Output directory to store converted dataset in jsonl format. * `cache`: Directory to store Hugging Face data cache.
* Output directory for arrow format: `data_arrow_output_dir`. Output directory to store converted dataset in arrow format, which can be used for training directly. * `jsonl`: Output directory to store converted dataset in jsonl format.
* `arrow`: Output directory to store converted dataset in arrow format, which can be used for training directly.
* Max length: `max_length`. Max length of spliced samples. Default value is 4096. * Max length: `max_length`. Max length of spliced samples. Default value is 4096.
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training. * Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
@ -392,13 +392,15 @@ Command to convert jsonl dataset to arrow format is similar to the command in [3
python prepare_sft_dataset.py.py \ python prepare_sft_dataset.py.py \
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \ --data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
--tokenizer_dir "<TOKENIZER_DIR>" \ --tokenizer_dir "<TOKENIZER_DIR>" \
--data_cache_dir "jsonl_to_arrow_cache" \ --data_output_dirs "spliced tokenized output" \
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
--max_length 4096 \ --max_length 4096 \
--num_spliced_dataset_bins 10 --num_spliced_dataset_bins 10 \
--llama_version 3
``` ```
Additional CLI arguments:
* LLaMA verison: `llama_version`. Specify the LLaMA version.
#### 4. Command Line Arguments for Training #### 4. Command Line Arguments for Training
##### 4.1 Arguments for Pretraining ##### 4.1 Arguments for Pretraining

View File

@ -83,7 +83,7 @@ class Conversation:
} }
conv = Conversation( LLaMA2_Conv = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n", "The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
roles=("Human", "Assistant"), roles=("Human", "Assistant"),
@ -93,4 +93,14 @@ conv = Conversation(
seps=["<s>", "</s>"], seps=["<s>", "</s>"],
) )
default_conversation = conv LLaMA3_Conv = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
roles=("Human", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.ADD_BOS_EOS_TOKEN,
seps=["<|begin_of_text|>", "<|end_of_text|>"],
)
default_conversation = LLaMA3_Conv

View File

@ -12,6 +12,7 @@ from typing import Any, Callable, Dict, Iterable, List, Tuple, Union
from datasets import dataset_dict from datasets import dataset_dict
from torch.utils.data import ConcatDataset, Dataset, IterableDataset from torch.utils.data import ConcatDataset, Dataset, IterableDataset
from transformers import AutoTokenizer
from transformers.models.llama.tokenization_llama import LlamaTokenizer from transformers.models.llama.tokenization_llama import LlamaTokenizer
from transformers.tokenization_utils import PreTrainedTokenizer from transformers.tokenization_utils import PreTrainedTokenizer
@ -71,7 +72,7 @@ def supervised_tokenize_pretrain(
def supervised_tokenize_sft( def supervised_tokenize_sft(
data_point: Dict[str, str], data_point: Dict[str, str],
tokenizer: LlamaTokenizer, tokenizer: AutoTokenizer,
conversation_template: Conversation = default_conversation, conversation_template: Conversation = default_conversation,
ignore_index: int = None, ignore_index: int = None,
max_length: int = 4096, max_length: int = 4096,

View File

@ -1,7 +1,7 @@
import argparse import argparse
import torch import torch
from colossal_llama2.dataset.conversation import default_conversation from colossal_llama.dataset.conversation import default_conversation
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
from colossalai.logging import get_dist_logger from colossalai.logging import get_dist_logger

View File

@ -11,12 +11,12 @@ import os
import time import time
from multiprocessing import cpu_count from multiprocessing import cpu_count
from colossal_llama2.dataset.spliced_and_tokenized_dataset import ( from colossal_llama.dataset.spliced_and_tokenized_dataset import (
ClosedToConstantLengthSplicedDataset, ClosedToConstantLengthSplicedDataset,
supervised_tokenize_pretrain, supervised_tokenize_pretrain,
) )
from datasets import dataset_dict, load_dataset from datasets import dataset_dict, load_dataset
from transformers.models.llama.tokenization_llama import LlamaTokenizer from transformers import AutoTokenizer
from colossalai.logging import get_dist_logger from colossalai.logging import get_dist_logger
@ -35,34 +35,23 @@ def main():
parser.add_argument( parser.add_argument(
"--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer" "--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer"
) )
parser.add_argument("--data_cache_dir", type=str, default="cache", help="Data cache directory") parser.add_argument("--data_output_dirs", type=str, default="data_output_dirs", help="Data output directory")
parser.add_argument( parser.add_argument("--max_length", type=int, default=8192, help="Max length of each spliced tokenized sequence")
"--data_jsonl_output_dir",
type=str,
default="jsonl_output",
help="Output directory of spliced dataset with jsonl format",
)
parser.add_argument(
"--data_arrow_output_dir",
type=str,
default="arrow_output",
help="Output directory of spliced dataset with arrow format",
)
parser.add_argument("--max_length", type=int, default=4096, help="Max length of each spliced tokenized sequence")
parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins") parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins")
args = parser.parse_args() args = parser.parse_args()
if args.num_spliced_dataset_bins >= 100000: if args.num_spliced_dataset_bins >= 100000:
raise ValueError("Too many spliced divisions, must be smaller than 100000") raise ValueError("Too many spliced divisions, must be smaller than 100000")
assert not os.path.exists(args.data_cache_dir), f"Find existed data cache dir {args.data_cache_dir}" args.data_cache_dir = os.path.join(args.data_output_dirs, "cache")
assert not os.path.exists( args.data_jsonl_output_dir = os.path.join(args.data_output_dirs, "jsonl")
args.data_jsonl_output_dir args.data_arrow_output_dir = os.path.join(args.data_output_dirs, "arrow")
), f"Find existed jsonl data output dir {args.data_jsonl_output_dir}"
assert not os.path.exists( if not os.path.exists(args.data_cache_dir):
args.data_arrow_output_dir os.makedirs(args.data_cache_dir)
), f"Find existed arrow data output dir {args.data_arrow_output_dir}" if not os.path.exists(args.data_jsonl_output_dir):
os.makedirs(args.data_jsonl_output_dir) os.makedirs(args.data_jsonl_output_dir)
if not os.path.exists(args.data_arrow_output_dir):
os.makedirs(args.data_arrow_output_dir) os.makedirs(args.data_arrow_output_dir)
# Prepare to all input datasets # Prepare to all input datasets
@ -86,7 +75,7 @@ def main():
train_splits.append(f"train[{start}%:{end}%]") train_splits.append(f"train[{start}%:{end}%]")
# Prepare to the tokenizer. # Prepare to the tokenizer.
tokenizer = LlamaTokenizer.from_pretrained(args.tokenizer_dir) tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir)
tokenizer.add_bos_token = False tokenizer.add_bos_token = False
tokenizer.add_eos_token = False tokenizer.add_eos_token = False
if tokenizer.pad_token is None: if tokenizer.pad_token is None:

View File

@ -10,10 +10,10 @@ import math
import os import os
from multiprocessing import cpu_count from multiprocessing import cpu_count
from colossal_llama2.dataset.conversation import default_conversation from colossal_llama.dataset.conversation import default_conversation
from colossal_llama2.dataset.spliced_and_tokenized_dataset import supervised_tokenize_sft from colossal_llama.dataset.spliced_and_tokenized_dataset import supervised_tokenize_sft
from datasets import dataset_dict, load_dataset from datasets import dataset_dict, load_dataset
from transformers.models.llama.tokenization_llama import LlamaTokenizer from transformers import AddedToken, AutoTokenizer
from colossalai.logging import get_dist_logger from colossalai.logging import get_dist_logger
@ -32,34 +32,24 @@ def main():
parser.add_argument( parser.add_argument(
"--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer" "--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer"
) )
parser.add_argument("--data_cache_dir", type=str, default="cache", help="Data cache directory") parser.add_argument("--data_output_dirs", type=str, default="data_output_dirs", help="Data output directory")
parser.add_argument( parser.add_argument("--max_length", type=int, default=8192, help="Max length of each spliced tokenized sequence")
"--data_jsonl_output_dir",
type=str,
default="jsonl_output",
help="Output directory of spliced dataset with jsonl format",
)
parser.add_argument(
"--data_arrow_output_dir",
type=str,
default="arrow_output",
help="Output directory of spliced dataset with arrow format",
)
parser.add_argument("--max_length", type=int, default=4096, help="Max length of each spliced tokenized sequence")
parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins") parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins")
parser.add_argument("--llama_version", type=int, default=3, help="LLaMA version")
args = parser.parse_args() args = parser.parse_args()
if args.num_spliced_dataset_bins >= 100000: if args.num_spliced_dataset_bins >= 100000:
raise ValueError("Too many spliced divisions, must be smaller than 100000") raise ValueError("Too many spliced divisions, must be smaller than 100000")
assert not os.path.exists(args.data_cache_dir), f"Find existed data cache dir {args.data_cache_dir}" args.data_cache_dir = os.path.join(args.data_output_dirs, "cache")
assert not os.path.exists( args.data_jsonl_output_dir = os.path.join(args.data_output_dirs, "jsonl")
args.data_jsonl_output_dir args.data_arrow_output_dir = os.path.join(args.data_output_dirs, "arrow")
), f"Find existed jsonl data output dir {args.data_jsonl_output_dir}"
assert not os.path.exists( if not os.path.exists(args.data_cache_dir):
args.data_arrow_output_dir os.makedirs(args.data_cache_dir)
), f"Find existed arrow data output dir {args.data_arrow_output_dir}" if not os.path.exists(args.data_jsonl_output_dir):
os.makedirs(args.data_jsonl_output_dir) os.makedirs(args.data_jsonl_output_dir)
if not os.path.exists(args.data_arrow_output_dir):
os.makedirs(args.data_arrow_output_dir) os.makedirs(args.data_arrow_output_dir)
# Prepare to all input datasets # Prepare to all input datasets
@ -83,11 +73,20 @@ def main():
train_splits.append(f"train[{start}%:{end}%]") train_splits.append(f"train[{start}%:{end}%]")
# Prepare to the tokenizer. # Prepare to the tokenizer.
tokenizer = LlamaTokenizer.from_pretrained(args.tokenizer_dir) tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir)
# Fix </s> split issue: https://github.com/huggingface/transformers/issues/23833
if args.llama_version == 2:
tokenizer.add_tokens(AddedToken("</s>", normalized=False, special=True), special_tokens=True)
tokenizer.add_bos_token = False tokenizer.add_bos_token = False
tokenizer.add_eos_token = False tokenizer.add_eos_token = False
if tokenizer.pad_token is None: if tokenizer.pad_token is None:
if tokenizer.unk_token is not None:
tokenizer.pad_token = tokenizer.unk_token tokenizer.pad_token = tokenizer.unk_token
else:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.unk_token = tokenizer.eos_token
list_dataset = load_dataset( list_dataset = load_dataset(
path="json", path="json",

View File

@ -1,9 +1,10 @@
torch<2.0.0, >=1.12.1 torch==2.1.2
packaging==23.1 huggingface-hub
colossalai==0.3.5 packaging==24.0
colossalai==0.3.6
autoflake==2.2.1 autoflake==2.2.1
black==23.9.1 black==23.9.1
transformers==4.33.3 transformers==4.34.1
tensorboard==2.14.0 tensorboard==2.14.0
six==1.16.0 six==1.16.0
datasets datasets

View File

@ -1,6 +1,6 @@
import argparse import argparse
from colossal_llama2.utils.stream_chat_patch import streaming_chat from colossal_llama.utils.stream_chat_patch import streaming_chat
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
SYSTEM = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions." SYSTEM = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."

View File

@ -12,18 +12,18 @@ from contextlib import nullcontext
import torch import torch
import torch.distributed as dist import torch.distributed as dist
from colossal_llama2.dataset.loader import ( from colossal_llama.dataset.loader import (
DataCollatorForSupervisedDataset, DataCollatorForSupervisedDataset,
StatefulDistributedSampler, StatefulDistributedSampler,
load_tokenized_dataset, load_tokenized_dataset,
) )
from colossal_llama2.utils.ckpt_io import load_checkpoint, save_checkpoint from colossal_llama.utils.ckpt_io import load_checkpoint, save_checkpoint
from colossal_llama2.utils.flash_attention_patch import replace_with_flash_attention from colossal_llama.utils.flash_attention_patch import replace_with_flash_attention
from colossal_llama2.utils.froze import freeze_non_embeds_parameters from colossal_llama.utils.froze import freeze_non_embeds_parameters
from colossal_llama2.utils.neftune_patch import activate_neftune, deactivate_neftune from colossal_llama.utils.neftune_patch import activate_neftune, deactivate_neftune
from torch.utils.tensorboard import SummaryWriter from torch.utils.tensorboard import SummaryWriter
from tqdm import tqdm from tqdm import tqdm
from transformers import LlamaForCausalLM, LlamaTokenizer from transformers import AutoTokenizer, LlamaForCausalLM
import colossalai import colossalai
from colossalai.accelerator import get_accelerator from colossalai.accelerator import get_accelerator
@ -89,7 +89,7 @@ def main() -> None:
parser.add_argument("--accumulation_steps", type=int, default=1, help="Number of accumulation steps") parser.add_argument("--accumulation_steps", type=int, default=1, help="Number of accumulation steps")
parser.add_argument("--micro_batch_size", type=int, default=2, help="Batch size of each process") parser.add_argument("--micro_batch_size", type=int, default=2, help="Batch size of each process")
parser.add_argument("--lr", type=float, default=3e-4, help="Learning rate") parser.add_argument("--lr", type=float, default=3e-4, help="Learning rate")
parser.add_argument("--max_length", type=int, default=4096, help="Model max length") parser.add_argument("--max_length", type=int, default=8192, help="Model max length")
parser.add_argument( parser.add_argument(
"--mixed_precision", "--mixed_precision",
type=str, type=str,
@ -196,7 +196,7 @@ def main() -> None:
# ====================================================== # ======================================================
# Initialize Tokenizer, Dataset, Collator and Dataloader # Initialize Tokenizer, Dataset, Collator and Dataloader
# ====================================================== # ======================================================
tokenizer = LlamaTokenizer.from_pretrained(args.pretrained) tokenizer = AutoTokenizer.from_pretrained(args.pretrained)
if args.pad_token == "eos": if args.pad_token == "eos":
tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token = tokenizer.eos_token
elif args.pad_token == "unk": elif args.pad_token == "unk":

View File

@ -0,0 +1 @@
1.0.0

View File

@ -5,7 +5,7 @@ This directory contains the applications that are powered by Colossal-AI.
The list of applications include: The list of applications include:
- [X] [Open-Sora](https://github.com/hpcaitech/Open-Sora): Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models - [X] [Open-Sora](https://github.com/hpcaitech/Open-Sora): Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models
- [X] [Colossal-LLaMA-2](./Colossal-LLaMA-2/): Continual Pre-training of LLaMA-2. - [X] [Colossal-LLaMA](./Colossal-LLaMA/): Continual Pre-training and Supervisied Fine-tuning of LLaMA2 / LLaMA3.
- [X] [ColossalEval](./ColossalEval): Evaluation Pipeline for LLMs. - [X] [ColossalEval](./ColossalEval): Evaluation Pipeline for LLMs.
- [X] [ColossalChat](./Chat/README.md): Replication of ChatGPT with RLHF. - [X] [ColossalChat](./Chat/README.md): Replication of ChatGPT with RLHF.
- [X] [FastFold](https://github.com/hpcaitech/FastFold): Optimizing AlphaFold (Biomedicine) Training and Inference on GPU Clusters. - [X] [FastFold](https://github.com/hpcaitech/FastFold): Optimizing AlphaFold (Biomedicine) Training and Inference on GPU Clusters.