mirror of https://github.com/hpcaitech/ColossalAI
[Feature] Support LLaMA-3 CPT and ST (#5619)
* support LLaMA-3 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Run pre-commit --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>pull/5631/head
parent
e094933da1
commit
862fbaaa62
|
@ -1 +0,0 @@
|
||||||
0.0.1
|
|
|
@ -1,6 +1,6 @@
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<h1>
|
<h1>
|
||||||
<img src="https://github.com/hpcaitech/public_assets/blob/main/applications/colossal-llama-2/colossalllam2.jpg?raw=true" width=800/>
|
Colossal-LLaMA
|
||||||
</h1>
|
</h1>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
@ -47,6 +47,7 @@
|
||||||
- [Citations](#citations)
|
- [Citations](#citations)
|
||||||
|
|
||||||
## News
|
## News
|
||||||
|
* [2024/4] Support continual pre-training and supervised fine-tuning of LLaMA-3.
|
||||||
* [2024/01] [Construct Refined 13B Private Model With Just $5000 USD, Upgraded Colossal-AI Llama-2 Open Source](https://hpc-ai.com/blog/colossal-llama-2-13b).
|
* [2024/01] [Construct Refined 13B Private Model With Just $5000 USD, Upgraded Colossal-AI Llama-2 Open Source](https://hpc-ai.com/blog/colossal-llama-2-13b).
|
||||||
[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2)
|
[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2)
|
||||||
[[blog]](https://hpc-ai.com/blog/colossal-llama-2-13b)
|
[[blog]](https://hpc-ai.com/blog/colossal-llama-2-13b)
|
||||||
|
@ -289,7 +290,7 @@ Here is details about CLI arguments:
|
||||||
|
|
||||||
#### 1. Install required packages
|
#### 1. Install required packages
|
||||||
```
|
```
|
||||||
cd Colossal-LLaMA-2
|
cd Colossal-LLaMA
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
#### 2. Install `xentropy`, `layer_norm` and `rotary`
|
#### 2. Install `xentropy`, `layer_norm` and `rotary`
|
||||||
|
@ -314,7 +315,7 @@ Initialize new tokenizer with additional Chinese tokens. Additional Chinese toke
|
||||||
Command to initialize new tokenizer:
|
Command to initialize new tokenizer:
|
||||||
```bash
|
```bash
|
||||||
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
|
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION='python'
|
||||||
python colossal_llama2/tokenizer/init_tokenizer.py \
|
python colossal_llama/tokenizer/init_tokenizer.py \
|
||||||
--source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \
|
--source_tokenizer_dir "<SOURCE_TOKENIZER_DIR>" \
|
||||||
--target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \
|
--target_tokenizer_dir "<TARGET_TOKENIZER_DIR>" \
|
||||||
--expand_tokens_file "<NEW_TOKENS_FILE>.jsonl"
|
--expand_tokens_file "<NEW_TOKENS_FILE>.jsonl"
|
||||||
|
@ -328,7 +329,7 @@ Here is details about CLI arguments:
|
||||||
Initialize the new model checkpoint by calculating the mean values from the original model checkpoint.
|
Initialize the new model checkpoint by calculating the mean values from the original model checkpoint.
|
||||||
Command to initialize new model checkpoint:
|
Command to initialize new model checkpoint:
|
||||||
```bash
|
```bash
|
||||||
python colossal_llama2/model/init_model.py \
|
python colossal_llama/model/init_model.py \
|
||||||
--source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \
|
--source_model_and_tokenizer_path "<SOURCE_MODEL_AND_TOKENIZER_DIR>" \
|
||||||
--target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \
|
--target_tokenizer_path "<TARGET_TOKENIZER_DIR>" \
|
||||||
--target_model_path "<TARGET_MODEL_DIR>"
|
--target_model_path "<TARGET_MODEL_DIR>"
|
||||||
|
@ -362,18 +363,17 @@ Command to convert jsonl dataset to arrow format:
|
||||||
python prepare_pretrain_dataset.py \
|
python prepare_pretrain_dataset.py \
|
||||||
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
|
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
|
||||||
--tokenizer_dir "<TOKENIZER_DIR>" \
|
--tokenizer_dir "<TOKENIZER_DIR>" \
|
||||||
--data_cache_dir "jsonl_to_arrow_cache" \
|
--data_output_dirs "spliced tokenized output" \
|
||||||
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
|
|
||||||
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
|
|
||||||
--max_length 4096 \
|
--max_length 4096 \
|
||||||
--num_spliced_dataset_bins 10
|
--num_spliced_dataset_bins 10
|
||||||
```
|
```
|
||||||
Here is details about CLI arguments:
|
Here is details about CLI arguments:
|
||||||
* Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format.
|
* Source data directory: `data_input_dirs`. Each `<JSONL_DIR>` can have multiple file in `jsonl` format.
|
||||||
* Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format.
|
* Tokenizer directory: `tokenizer_dir`. Path to the tokenizer in Hugging Face format.
|
||||||
* Data cache directory: `data_cache_dir`. Directory to store Hugging Face data cache. Default case will create `cache` folder locally.
|
* Data output directory: `data_output_dirs`. Directory to store preprocessed output, including three sub-directories:
|
||||||
* Output directory for jsonl format: `data_jsonl_output_dir`. Output directory to store converted dataset in jsonl format.
|
* `cache`: Directory to store Hugging Face data cache.
|
||||||
* Output directory for arrow format: `data_arrow_output_dir`. Output directory to store converted dataset in arrow format, which can be used for training directly.
|
* `jsonl`: Output directory to store converted dataset in jsonl format.
|
||||||
|
* `arrow`: Output directory to store converted dataset in arrow format, which can be used for training directly.
|
||||||
* Max length: `max_length`. Max length of spliced samples. Default value is 4096.
|
* Max length: `max_length`. Max length of spliced samples. Default value is 4096.
|
||||||
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
|
* Number of bins for each category: `num_spliced_dataset_bins`. Number of bins for each category, used for bucket-based training.
|
||||||
|
|
||||||
|
@ -392,13 +392,15 @@ Command to convert jsonl dataset to arrow format is similar to the command in [3
|
||||||
python prepare_sft_dataset.py.py \
|
python prepare_sft_dataset.py.py \
|
||||||
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
|
--data_input_dirs "<JSONL_DIR_1>,<JSONL_DIR_2>,<JSONL_DIR_3>" \
|
||||||
--tokenizer_dir "<TOKENIZER_DIR>" \
|
--tokenizer_dir "<TOKENIZER_DIR>" \
|
||||||
--data_cache_dir "jsonl_to_arrow_cache" \
|
--data_output_dirs "spliced tokenized output" \
|
||||||
--data_jsonl_output_dir "spliced_tokenized_output_jsonl" \
|
|
||||||
--data_arrow_output_dir "spliced_tokenized_output_arrow" \
|
|
||||||
--max_length 4096 \
|
--max_length 4096 \
|
||||||
--num_spliced_dataset_bins 10
|
--num_spliced_dataset_bins 10 \
|
||||||
|
--llama_version 3
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Additional CLI arguments:
|
||||||
|
* LLaMA verison: `llama_version`. Specify the LLaMA version.
|
||||||
|
|
||||||
#### 4. Command Line Arguments for Training
|
#### 4. Command Line Arguments for Training
|
||||||
|
|
||||||
##### 4.1 Arguments for Pretraining
|
##### 4.1 Arguments for Pretraining
|
|
@ -83,7 +83,7 @@ class Conversation:
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
conv = Conversation(
|
LLaMA2_Conv = Conversation(
|
||||||
system="A chat between a curious human and an artificial intelligence assistant. "
|
system="A chat between a curious human and an artificial intelligence assistant. "
|
||||||
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
|
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
|
||||||
roles=("Human", "Assistant"),
|
roles=("Human", "Assistant"),
|
||||||
|
@ -93,4 +93,14 @@ conv = Conversation(
|
||||||
seps=["<s>", "</s>"],
|
seps=["<s>", "</s>"],
|
||||||
)
|
)
|
||||||
|
|
||||||
default_conversation = conv
|
LLaMA3_Conv = Conversation(
|
||||||
|
system="A chat between a curious human and an artificial intelligence assistant. "
|
||||||
|
"The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
|
||||||
|
roles=("Human", "Assistant"),
|
||||||
|
messages=[],
|
||||||
|
offset=0,
|
||||||
|
sep_style=SeparatorStyle.ADD_BOS_EOS_TOKEN,
|
||||||
|
seps=["<|begin_of_text|>", "<|end_of_text|>"],
|
||||||
|
)
|
||||||
|
|
||||||
|
default_conversation = LLaMA3_Conv
|
|
@ -12,6 +12,7 @@ from typing import Any, Callable, Dict, Iterable, List, Tuple, Union
|
||||||
|
|
||||||
from datasets import dataset_dict
|
from datasets import dataset_dict
|
||||||
from torch.utils.data import ConcatDataset, Dataset, IterableDataset
|
from torch.utils.data import ConcatDataset, Dataset, IterableDataset
|
||||||
|
from transformers import AutoTokenizer
|
||||||
from transformers.models.llama.tokenization_llama import LlamaTokenizer
|
from transformers.models.llama.tokenization_llama import LlamaTokenizer
|
||||||
from transformers.tokenization_utils import PreTrainedTokenizer
|
from transformers.tokenization_utils import PreTrainedTokenizer
|
||||||
|
|
||||||
|
@ -71,7 +72,7 @@ def supervised_tokenize_pretrain(
|
||||||
|
|
||||||
def supervised_tokenize_sft(
|
def supervised_tokenize_sft(
|
||||||
data_point: Dict[str, str],
|
data_point: Dict[str, str],
|
||||||
tokenizer: LlamaTokenizer,
|
tokenizer: AutoTokenizer,
|
||||||
conversation_template: Conversation = default_conversation,
|
conversation_template: Conversation = default_conversation,
|
||||||
ignore_index: int = None,
|
ignore_index: int = None,
|
||||||
max_length: int = 4096,
|
max_length: int = 4096,
|
|
@ -1,7 +1,7 @@
|
||||||
import argparse
|
import argparse
|
||||||
|
|
||||||
import torch
|
import torch
|
||||||
from colossal_llama2.dataset.conversation import default_conversation
|
from colossal_llama.dataset.conversation import default_conversation
|
||||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
|
|
||||||
from colossalai.logging import get_dist_logger
|
from colossalai.logging import get_dist_logger
|
|
@ -11,12 +11,12 @@ import os
|
||||||
import time
|
import time
|
||||||
from multiprocessing import cpu_count
|
from multiprocessing import cpu_count
|
||||||
|
|
||||||
from colossal_llama2.dataset.spliced_and_tokenized_dataset import (
|
from colossal_llama.dataset.spliced_and_tokenized_dataset import (
|
||||||
ClosedToConstantLengthSplicedDataset,
|
ClosedToConstantLengthSplicedDataset,
|
||||||
supervised_tokenize_pretrain,
|
supervised_tokenize_pretrain,
|
||||||
)
|
)
|
||||||
from datasets import dataset_dict, load_dataset
|
from datasets import dataset_dict, load_dataset
|
||||||
from transformers.models.llama.tokenization_llama import LlamaTokenizer
|
from transformers import AutoTokenizer
|
||||||
|
|
||||||
from colossalai.logging import get_dist_logger
|
from colossalai.logging import get_dist_logger
|
||||||
|
|
||||||
|
@ -35,34 +35,23 @@ def main():
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer"
|
"--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer"
|
||||||
)
|
)
|
||||||
parser.add_argument("--data_cache_dir", type=str, default="cache", help="Data cache directory")
|
parser.add_argument("--data_output_dirs", type=str, default="data_output_dirs", help="Data output directory")
|
||||||
parser.add_argument(
|
parser.add_argument("--max_length", type=int, default=8192, help="Max length of each spliced tokenized sequence")
|
||||||
"--data_jsonl_output_dir",
|
|
||||||
type=str,
|
|
||||||
default="jsonl_output",
|
|
||||||
help="Output directory of spliced dataset with jsonl format",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--data_arrow_output_dir",
|
|
||||||
type=str,
|
|
||||||
default="arrow_output",
|
|
||||||
help="Output directory of spliced dataset with arrow format",
|
|
||||||
)
|
|
||||||
parser.add_argument("--max_length", type=int, default=4096, help="Max length of each spliced tokenized sequence")
|
|
||||||
parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins")
|
parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
if args.num_spliced_dataset_bins >= 100000:
|
if args.num_spliced_dataset_bins >= 100000:
|
||||||
raise ValueError("Too many spliced divisions, must be smaller than 100000")
|
raise ValueError("Too many spliced divisions, must be smaller than 100000")
|
||||||
|
|
||||||
assert not os.path.exists(args.data_cache_dir), f"Find existed data cache dir {args.data_cache_dir}"
|
args.data_cache_dir = os.path.join(args.data_output_dirs, "cache")
|
||||||
assert not os.path.exists(
|
args.data_jsonl_output_dir = os.path.join(args.data_output_dirs, "jsonl")
|
||||||
args.data_jsonl_output_dir
|
args.data_arrow_output_dir = os.path.join(args.data_output_dirs, "arrow")
|
||||||
), f"Find existed jsonl data output dir {args.data_jsonl_output_dir}"
|
|
||||||
assert not os.path.exists(
|
if not os.path.exists(args.data_cache_dir):
|
||||||
args.data_arrow_output_dir
|
os.makedirs(args.data_cache_dir)
|
||||||
), f"Find existed arrow data output dir {args.data_arrow_output_dir}"
|
if not os.path.exists(args.data_jsonl_output_dir):
|
||||||
os.makedirs(args.data_jsonl_output_dir)
|
os.makedirs(args.data_jsonl_output_dir)
|
||||||
|
if not os.path.exists(args.data_arrow_output_dir):
|
||||||
os.makedirs(args.data_arrow_output_dir)
|
os.makedirs(args.data_arrow_output_dir)
|
||||||
|
|
||||||
# Prepare to all input datasets
|
# Prepare to all input datasets
|
||||||
|
@ -86,7 +75,7 @@ def main():
|
||||||
train_splits.append(f"train[{start}%:{end}%]")
|
train_splits.append(f"train[{start}%:{end}%]")
|
||||||
|
|
||||||
# Prepare to the tokenizer.
|
# Prepare to the tokenizer.
|
||||||
tokenizer = LlamaTokenizer.from_pretrained(args.tokenizer_dir)
|
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir)
|
||||||
tokenizer.add_bos_token = False
|
tokenizer.add_bos_token = False
|
||||||
tokenizer.add_eos_token = False
|
tokenizer.add_eos_token = False
|
||||||
if tokenizer.pad_token is None:
|
if tokenizer.pad_token is None:
|
|
@ -10,10 +10,10 @@ import math
|
||||||
import os
|
import os
|
||||||
from multiprocessing import cpu_count
|
from multiprocessing import cpu_count
|
||||||
|
|
||||||
from colossal_llama2.dataset.conversation import default_conversation
|
from colossal_llama.dataset.conversation import default_conversation
|
||||||
from colossal_llama2.dataset.spliced_and_tokenized_dataset import supervised_tokenize_sft
|
from colossal_llama.dataset.spliced_and_tokenized_dataset import supervised_tokenize_sft
|
||||||
from datasets import dataset_dict, load_dataset
|
from datasets import dataset_dict, load_dataset
|
||||||
from transformers.models.llama.tokenization_llama import LlamaTokenizer
|
from transformers import AddedToken, AutoTokenizer
|
||||||
|
|
||||||
from colossalai.logging import get_dist_logger
|
from colossalai.logging import get_dist_logger
|
||||||
|
|
||||||
|
@ -32,34 +32,24 @@ def main():
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer"
|
"--tokenizer_dir", type=str, required=True, default=None, help="A directory containing the tokenizer"
|
||||||
)
|
)
|
||||||
parser.add_argument("--data_cache_dir", type=str, default="cache", help="Data cache directory")
|
parser.add_argument("--data_output_dirs", type=str, default="data_output_dirs", help="Data output directory")
|
||||||
parser.add_argument(
|
parser.add_argument("--max_length", type=int, default=8192, help="Max length of each spliced tokenized sequence")
|
||||||
"--data_jsonl_output_dir",
|
|
||||||
type=str,
|
|
||||||
default="jsonl_output",
|
|
||||||
help="Output directory of spliced dataset with jsonl format",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--data_arrow_output_dir",
|
|
||||||
type=str,
|
|
||||||
default="arrow_output",
|
|
||||||
help="Output directory of spliced dataset with arrow format",
|
|
||||||
)
|
|
||||||
parser.add_argument("--max_length", type=int, default=4096, help="Max length of each spliced tokenized sequence")
|
|
||||||
parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins")
|
parser.add_argument("--num_spliced_dataset_bins", type=int, default=10, help="Number of spliced dataset bins")
|
||||||
|
parser.add_argument("--llama_version", type=int, default=3, help="LLaMA version")
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
if args.num_spliced_dataset_bins >= 100000:
|
if args.num_spliced_dataset_bins >= 100000:
|
||||||
raise ValueError("Too many spliced divisions, must be smaller than 100000")
|
raise ValueError("Too many spliced divisions, must be smaller than 100000")
|
||||||
|
|
||||||
assert not os.path.exists(args.data_cache_dir), f"Find existed data cache dir {args.data_cache_dir}"
|
args.data_cache_dir = os.path.join(args.data_output_dirs, "cache")
|
||||||
assert not os.path.exists(
|
args.data_jsonl_output_dir = os.path.join(args.data_output_dirs, "jsonl")
|
||||||
args.data_jsonl_output_dir
|
args.data_arrow_output_dir = os.path.join(args.data_output_dirs, "arrow")
|
||||||
), f"Find existed jsonl data output dir {args.data_jsonl_output_dir}"
|
|
||||||
assert not os.path.exists(
|
if not os.path.exists(args.data_cache_dir):
|
||||||
args.data_arrow_output_dir
|
os.makedirs(args.data_cache_dir)
|
||||||
), f"Find existed arrow data output dir {args.data_arrow_output_dir}"
|
if not os.path.exists(args.data_jsonl_output_dir):
|
||||||
os.makedirs(args.data_jsonl_output_dir)
|
os.makedirs(args.data_jsonl_output_dir)
|
||||||
|
if not os.path.exists(args.data_arrow_output_dir):
|
||||||
os.makedirs(args.data_arrow_output_dir)
|
os.makedirs(args.data_arrow_output_dir)
|
||||||
|
|
||||||
# Prepare to all input datasets
|
# Prepare to all input datasets
|
||||||
|
@ -83,11 +73,20 @@ def main():
|
||||||
train_splits.append(f"train[{start}%:{end}%]")
|
train_splits.append(f"train[{start}%:{end}%]")
|
||||||
|
|
||||||
# Prepare to the tokenizer.
|
# Prepare to the tokenizer.
|
||||||
tokenizer = LlamaTokenizer.from_pretrained(args.tokenizer_dir)
|
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir)
|
||||||
|
|
||||||
|
# Fix </s> split issue: https://github.com/huggingface/transformers/issues/23833
|
||||||
|
if args.llama_version == 2:
|
||||||
|
tokenizer.add_tokens(AddedToken("</s>", normalized=False, special=True), special_tokens=True)
|
||||||
|
|
||||||
tokenizer.add_bos_token = False
|
tokenizer.add_bos_token = False
|
||||||
tokenizer.add_eos_token = False
|
tokenizer.add_eos_token = False
|
||||||
if tokenizer.pad_token is None:
|
if tokenizer.pad_token is None:
|
||||||
|
if tokenizer.unk_token is not None:
|
||||||
tokenizer.pad_token = tokenizer.unk_token
|
tokenizer.pad_token = tokenizer.unk_token
|
||||||
|
else:
|
||||||
|
tokenizer.pad_token = tokenizer.eos_token
|
||||||
|
tokenizer.unk_token = tokenizer.eos_token
|
||||||
|
|
||||||
list_dataset = load_dataset(
|
list_dataset = load_dataset(
|
||||||
path="json",
|
path="json",
|
|
@ -1,9 +1,10 @@
|
||||||
torch<2.0.0, >=1.12.1
|
torch==2.1.2
|
||||||
packaging==23.1
|
huggingface-hub
|
||||||
colossalai==0.3.5
|
packaging==24.0
|
||||||
|
colossalai==0.3.6
|
||||||
autoflake==2.2.1
|
autoflake==2.2.1
|
||||||
black==23.9.1
|
black==23.9.1
|
||||||
transformers==4.33.3
|
transformers==4.34.1
|
||||||
tensorboard==2.14.0
|
tensorboard==2.14.0
|
||||||
six==1.16.0
|
six==1.16.0
|
||||||
datasets
|
datasets
|
|
@ -1,6 +1,6 @@
|
||||||
import argparse
|
import argparse
|
||||||
|
|
||||||
from colossal_llama2.utils.stream_chat_patch import streaming_chat
|
from colossal_llama.utils.stream_chat_patch import streaming_chat
|
||||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
|
|
||||||
SYSTEM = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."
|
SYSTEM = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."
|
|
@ -12,18 +12,18 @@ from contextlib import nullcontext
|
||||||
|
|
||||||
import torch
|
import torch
|
||||||
import torch.distributed as dist
|
import torch.distributed as dist
|
||||||
from colossal_llama2.dataset.loader import (
|
from colossal_llama.dataset.loader import (
|
||||||
DataCollatorForSupervisedDataset,
|
DataCollatorForSupervisedDataset,
|
||||||
StatefulDistributedSampler,
|
StatefulDistributedSampler,
|
||||||
load_tokenized_dataset,
|
load_tokenized_dataset,
|
||||||
)
|
)
|
||||||
from colossal_llama2.utils.ckpt_io import load_checkpoint, save_checkpoint
|
from colossal_llama.utils.ckpt_io import load_checkpoint, save_checkpoint
|
||||||
from colossal_llama2.utils.flash_attention_patch import replace_with_flash_attention
|
from colossal_llama.utils.flash_attention_patch import replace_with_flash_attention
|
||||||
from colossal_llama2.utils.froze import freeze_non_embeds_parameters
|
from colossal_llama.utils.froze import freeze_non_embeds_parameters
|
||||||
from colossal_llama2.utils.neftune_patch import activate_neftune, deactivate_neftune
|
from colossal_llama.utils.neftune_patch import activate_neftune, deactivate_neftune
|
||||||
from torch.utils.tensorboard import SummaryWriter
|
from torch.utils.tensorboard import SummaryWriter
|
||||||
from tqdm import tqdm
|
from tqdm import tqdm
|
||||||
from transformers import LlamaForCausalLM, LlamaTokenizer
|
from transformers import AutoTokenizer, LlamaForCausalLM
|
||||||
|
|
||||||
import colossalai
|
import colossalai
|
||||||
from colossalai.accelerator import get_accelerator
|
from colossalai.accelerator import get_accelerator
|
||||||
|
@ -89,7 +89,7 @@ def main() -> None:
|
||||||
parser.add_argument("--accumulation_steps", type=int, default=1, help="Number of accumulation steps")
|
parser.add_argument("--accumulation_steps", type=int, default=1, help="Number of accumulation steps")
|
||||||
parser.add_argument("--micro_batch_size", type=int, default=2, help="Batch size of each process")
|
parser.add_argument("--micro_batch_size", type=int, default=2, help="Batch size of each process")
|
||||||
parser.add_argument("--lr", type=float, default=3e-4, help="Learning rate")
|
parser.add_argument("--lr", type=float, default=3e-4, help="Learning rate")
|
||||||
parser.add_argument("--max_length", type=int, default=4096, help="Model max length")
|
parser.add_argument("--max_length", type=int, default=8192, help="Model max length")
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--mixed_precision",
|
"--mixed_precision",
|
||||||
type=str,
|
type=str,
|
||||||
|
@ -196,7 +196,7 @@ def main() -> None:
|
||||||
# ======================================================
|
# ======================================================
|
||||||
# Initialize Tokenizer, Dataset, Collator and Dataloader
|
# Initialize Tokenizer, Dataset, Collator and Dataloader
|
||||||
# ======================================================
|
# ======================================================
|
||||||
tokenizer = LlamaTokenizer.from_pretrained(args.pretrained)
|
tokenizer = AutoTokenizer.from_pretrained(args.pretrained)
|
||||||
if args.pad_token == "eos":
|
if args.pad_token == "eos":
|
||||||
tokenizer.pad_token = tokenizer.eos_token
|
tokenizer.pad_token = tokenizer.eos_token
|
||||||
elif args.pad_token == "unk":
|
elif args.pad_token == "unk":
|
|
@ -0,0 +1 @@
|
||||||
|
1.0.0
|
|
@ -5,7 +5,7 @@ This directory contains the applications that are powered by Colossal-AI.
|
||||||
The list of applications include:
|
The list of applications include:
|
||||||
|
|
||||||
- [X] [Open-Sora](https://github.com/hpcaitech/Open-Sora): Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models
|
- [X] [Open-Sora](https://github.com/hpcaitech/Open-Sora): Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models
|
||||||
- [X] [Colossal-LLaMA-2](./Colossal-LLaMA-2/): Continual Pre-training of LLaMA-2.
|
- [X] [Colossal-LLaMA](./Colossal-LLaMA/): Continual Pre-training and Supervisied Fine-tuning of LLaMA2 / LLaMA3.
|
||||||
- [X] [ColossalEval](./ColossalEval): Evaluation Pipeline for LLMs.
|
- [X] [ColossalEval](./ColossalEval): Evaluation Pipeline for LLMs.
|
||||||
- [X] [ColossalChat](./Chat/README.md): Replication of ChatGPT with RLHF.
|
- [X] [ColossalChat](./Chat/README.md): Replication of ChatGPT with RLHF.
|
||||||
- [X] [FastFold](https://github.com/hpcaitech/FastFold): Optimizing AlphaFold (Biomedicine) Training and Inference on GPU Clusters.
|
- [X] [FastFold](https://github.com/hpcaitech/FastFold): Optimizing AlphaFold (Biomedicine) Training and Inference on GPU Clusters.
|
||||||
|
|
Loading…
Reference in New Issue