diff --git a/.github/workflows/run_chatgpt_examples.yml b/.github/workflows/run_chatgpt_examples.yml index 4ea86b609..d0b5c2164 100644 --- a/.github/workflows/run_chatgpt_examples.yml +++ b/.github/workflows/run_chatgpt_examples.yml @@ -52,6 +52,7 @@ jobs: mkdir sft_data mkdir prompt_data mkdir preference_data + mkdir kto_data ./tests/test_data_preparation.sh ./tests/test_train.sh env: @@ -61,3 +62,4 @@ jobs: SFT_DATASET: ./sft_data PROMPT_DATASET: ./prompt_data PREFERENCE_DATASET: ./preference_data + KTO_DATASET: ./kto_data diff --git a/applications/ColossalChat/README.md b/applications/ColossalChat/README.md index b1b8f7eb2..4fbe290ba 100755 --- a/applications/ColossalChat/README.md +++ b/applications/ColossalChat/README.md @@ -24,7 +24,9 @@ - [Limitation for LLaMA-finetuned models](#limitation) - [Limitation of dataset](#limitation) - [Alternative Option For RLHF: DPO](#alternative-option-for-rlhf-direct-preference-optimization) -- [Alternative Option For RLHF: SimPO](#alternative-option-for-rlhf-simple-preference-optimization) +- [Alternative Option For RLHF: SimPO](#alternative-option-for-rlhf-simple-preference-optimization-simpo) +- [Alternative Option For RLHF: ORPO](#alternative-option-for-rlhf-odds-ratio-preference-optimization-orpo) +- [Alternative Option For RLHF: KTO](#alternative-option-for-rlhf-kahneman-tversky-optimization-kto) - [FAQ](#faq) - [How to save/load checkpoint](#faq) - [How to train with limited resources](#faq) @@ -284,6 +286,9 @@ Simple Preference Optimization (SimPO) from this [paper](https://arxiv.org/pdf/2 ## Alternative Option For RLHF: Odds Ratio Preference Optimization (ORPO) Odds Ratio Preference Optimization (ORPO) from this [paper](https://arxiv.org/pdf/2403.07691) is a reference model free alignment method that use a mixture of SFT loss and a reinforcement leanring loss calculated based on odds-ratio-based implicit reward to makes the training more efficient and stable. Read this [README](./examples/README.md) for more information. +## Alternative Option For RLHF: Kahneman-Tversky Optimization (KTO) +We support the method introduced in the paper [KTO:Model Alignment as Prospect Theoretic Optimization](https://arxiv.org/pdf/2402.01306) (KTO). Which is a aligment method that directly maximize "human utility" of generation results. Read this [README](./examples/README.md) for more information. + ### Inference Quantization and Serving - After Training We provide an online inference server and a benchmark. We aim to run inference on single GPU, so quantization is essential when using large models. diff --git a/applications/ColossalChat/benchmarks/benchmark_dpo.sh b/applications/ColossalChat/benchmarks/benchmark_dpo.sh index dfd0ff846..44d821a87 100755 --- a/applications/ColossalChat/benchmarks/benchmark_dpo.sh +++ b/applications/ColossalChat/benchmarks/benchmark_dpo.sh @@ -19,30 +19,33 @@ PROJECT_NAME="dpo" PARENT_CONFIG_FILE="./benchmark_config" # Path to a folder to save training config logs PRETRAINED_MODEL_PATH="" # huggingface or local model path PRETRAINED_TOKENIZER_PATH="" # huggingface or local tokenizer path +BENCHMARK_DATA_DIR="./temp/dpo" # Path to benchmark data +DATASET_SIZE=320 TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S) FULL_PROJECT_NAME="${PROJECT_NAME}-${TIMESTAMP}" -SAVE_DIR="${PARENT_SAVE_DIR}${FULL_PROJECT_NAME}" -CONFIG_FILE="${PARENT_CONFIG_FILE}-${FULL_PROJECT_NAME}.json" +declare -a dataset=( + $BENCHMARK_DATA_DIR/arrow/part-0 +) -colossalai run --nproc_per_node 4 --master_port 31313 benchmark_dpo.py \ +# Generate dummy test data +python prepare_dummy_test_dataset.py --data_dir $BENCHMARK_DATA_DIR --dataset_size $DATASET_SIZE --max_length 2048 --data_type preference + + +colossalai run --nproc_per_node 4 --master_port 31313 ../examples/training_scripts/train_dpo.py \ --pretrain $PRETRAINED_MODEL_PATH \ --tokenizer_dir $PRETRAINED_TOKENIZER_PATH \ - --config_file $CONFIG_FILE \ + --dataset ${dataset[@]} \ --plugin "zero2_cpu" \ --max_epochs 1 \ --accumulation_steps 1 \ - --batch_size 8 \ + --batch_size 4 \ --lr 1e-6 \ --beta 0.1 \ - --gamma 0.6 \ --mixed_precision "bf16" \ --grad_clip 1.0 \ --max_length 2048 \ - --dataset_size 640 \ --weight_decay 0.01 \ --warmup_steps 60 \ - --disable_reference_model \ - --length_normalization \ --grad_checkpoint \ --use_flash_attn diff --git a/applications/ColossalChat/benchmarks/benchmark_kto.sh b/applications/ColossalChat/benchmarks/benchmark_kto.sh new file mode 100755 index 000000000..82d3e3421 --- /dev/null +++ b/applications/ColossalChat/benchmarks/benchmark_kto.sh @@ -0,0 +1,51 @@ +#!/bin/bash +set_n_least_used_CUDA_VISIBLE_DEVICES() { + local n=${1:-"9999"} + echo "GPU Memory Usage:" + local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv | + tail -n +2 | + nl -v 0 | + tee /dev/tty | + sort -g -k 2 | + awk '{print $1}' | + head -n $n) + export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g') + echo "Now CUDA_VISIBLE_DEVICES is set to:" + echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" +} +set_n_least_used_CUDA_VISIBLE_DEVICES 4 + +PROJECT_NAME="kto" +PARENT_CONFIG_FILE="./benchmark_config" # Path to a folder to save training config logs +PRETRAINED_MODEL_PATH="" # huggingface or local model path +PRETRAINED_TOKENIZER_PATH="" # huggingface or local tokenizer path +BENCHMARK_DATA_DIR="./temp/kto" # Path to benchmark data +DATASET_SIZE=80 + +TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S) +FULL_PROJECT_NAME="${PROJECT_NAME}-${TIMESTAMP}" +declare -a dataset=( + $BENCHMARK_DATA_DIR/arrow/part-0 +) + +# Generate dummy test data +python prepare_dummy_test_dataset.py --data_dir $BENCHMARK_DATA_DIR --dataset_size $DATASET_SIZE --max_length 2048 --data_type kto + + +colossalai run --nproc_per_node 2 --master_port 31313 ../examples/training_scripts/train_kto.py \ + --pretrain $PRETRAINED_MODEL_PATH \ + --tokenizer_dir $PRETRAINED_TOKENIZER_PATH \ + --dataset ${dataset[@]} \ + --plugin "zero2_cpu" \ + --max_epochs 1 \ + --accumulation_steps 1 \ + --batch_size 2 \ + --lr 1e-5 \ + --beta 0.1 \ + --mixed_precision "bf16" \ + --grad_clip 1.0 \ + --max_length 2048 \ + --weight_decay 0.01 \ + --warmup_steps 60 \ + --grad_checkpoint \ + --use_flash_attn diff --git a/applications/ColossalChat/benchmarks/benchmark_orpo.py b/applications/ColossalChat/benchmarks/benchmark_orpo.py deleted file mode 100755 index 1325bada2..000000000 --- a/applications/ColossalChat/benchmarks/benchmark_orpo.py +++ /dev/null @@ -1,315 +0,0 @@ -import argparse -import json -import os -import resource -from contextlib import nullcontext - -import torch -from coati.dataset import DataCollatorForPreferenceDataset, StatefulDistributedSampler -from coati.models import convert_to_lora_module, disable_dropout -from coati.trainer import ORPOTrainer -from coati.utils import load_checkpoint -from dummy_dataset import DummyLLMDataset -from transformers import AutoModelForCausalLM, AutoTokenizer - -import colossalai -from colossalai.booster import Booster -from colossalai.booster.plugin import GeminiPlugin, HybridParallelPlugin, LowLevelZeroPlugin -from colossalai.cluster import DistCoordinator -from colossalai.logging import get_dist_logger -from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR -from colossalai.nn.optimizer import HybridAdam - -logger = get_dist_logger() - - -def train(args): - # check lora compatibility - if "gemini" in args.plugin and args.lora_rank > 0: - raise ValueError("LoRA is not supported in GeminiPlugin. Please use other plugin") - if args.plugin == "gemini_auto" and args.accumulation_steps > 1: - raise ValueError("Gradient accumulation is not supported in GeminiPlugin. Please use other plugin") - - # ============================== - # Initialize Distributed Training - # ============================== - colossalai.launch_from_torch() - coordinator = DistCoordinator() - - # ============================== - # Initialize Booster - # ============================== - if args.plugin == "ddp": - """ - Default torch ddp plugin without any acceleration, for - debugging purpose acceleration, for debugging purpose - """ - plugin = TorchDDPPlugin(find_unused_parameters=True) - elif args.plugin == "gemini": - plugin = GeminiPlugin( - precision=args.mixed_precision, - placement_policy="static", - initial_scale=2**16, - max_norm=args.grad_clip, - enable_gradient_accumulation=True, - enable_flash_attention=args.use_flash_attn, - ) - elif args.plugin == "gemini_auto": - plugin = GeminiPlugin( - precision=args.mixed_precision, - placement_policy="auto", - initial_scale=2**16, - max_norm=args.grad_clip, - enable_flash_attention=args.use_flash_attn, - ) - elif args.plugin == "zero2": - plugin = LowLevelZeroPlugin( - stage=2, - precision=args.mixed_precision, - initial_scale=2**16, - max_norm=args.grad_clip, - ) - elif args.plugin == "zero2_cpu": - plugin = LowLevelZeroPlugin( - stage=2, - precision=args.mixed_precision, - initial_scale=2**16, - cpu_offload=True, - max_norm=args.grad_clip, - ) - elif args.plugin == "3d": - plugin = HybridParallelPlugin( - tp_size=args.tp, - pp_size=args.pp, - sp_size=args.sp, - sequence_parallelism_mode=args.sp_mode, - zero_stage=args.zero_stage, - enable_flash_attention=args.use_flash_attn, - enable_sequence_parallelism=args.enable_sequence_parallelism, - cpu_offload=True if args.zero_stage >= 1 and args.zero_cpu_offload else False, - parallel_output=False, - max_norm=args.grad_clip, - precision=args.mixed_precision, - ) - else: - raise ValueError(f"Unknown plugin {args.plugin}") - - booster = Booster(plugin=plugin) - - # ====================================================== - # Initialize Model, Objective, Optimizer and LR Scheduler - # ====================================================== - # Temp Fix: Disable lazy init due to version conflict - # init_ctx = ( - # LazyInitContext(default_device=get_current_device()) if isinstance(plugin, (GeminiPlugin,)) else nullcontext() - # ) - - init_ctx = nullcontext() - with init_ctx: - if args.use_flash_attn: - model = AutoModelForCausalLM.from_pretrained( - args.pretrain, - torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16, - use_flash_attention_2=True, - ) - coordinator.print_on_master(msg="Flash-attention enabled successfully") - else: - model = AutoModelForCausalLM.from_pretrained(args.pretrain) - disable_dropout(model) - if args.lora_rank > 0: - model = convert_to_lora_module(model, args.lora_rank, lora_train_bias=args.lora_train_bias) - - if args.grad_checkpoint: - # Note, for some models, lora may not be compatible with gradient checkpointing - model.gradient_checkpointing_enable() - coordinator.print_on_master(msg="Gradient checkpointing enabled successfully") - - # configure tokenizer - tokenizer_dir = args.tokenizer_dir if args.tokenizer_dir is not None else args.pretrain - tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir, use_fast=False, trust_remote_code=True) - if hasattr(tokenizer, "pad_token") and hasattr(tokenizer, "eos_token") and tokenizer.eos_token is not None: - try: - # Some tokenizers doesn't allow to set pad_token mannually e.g., Qwen - tokenizer.pad_token = tokenizer.eos_token - except AttributeError as e: - logger.warning(f"Unable to set pad token to eos token, {str(e)}") - if not hasattr(tokenizer, "pad_token") or tokenizer.pad_token is None: - logger.warning( - "The tokenizer does not have a pad token which is required. May lead to unintended behavior in training, Please consider manually set them." - ) - - tokenizer.add_bos_token = False - tokenizer.add_eos_token = False - - # configure optimizer - optim = HybridAdam( - model_params=model.parameters(), - lr=args.lr, - betas=(0.9, 0.95), - weight_decay=args.weight_decay, - adamw_mode=True, - ) - - # configure dataset - coordinator.print_on_master(f"Load dataset: {args.dataset}") - mode_map = {"train": "train", "valid": "validation", "test": "test"} - train_dataset = DummyLLMDataset( - ["chosen_input_ids", "chosen_loss_mask", "rejected_input_ids", "rejected_loss_mask"], - args.max_length, - args.dataset_size, - ) - data_collator = DataCollatorForPreferenceDataset(tokenizer=tokenizer, max_length=args.max_length) - - train_dataloader = plugin.prepare_dataloader( - dataset=train_dataset, - batch_size=args.batch_size, - shuffle=True, - drop_last=True, - collate_fn=data_collator, - distributed_sampler_cls=StatefulDistributedSampler, - ) - - num_update_steps_per_epoch = len(train_dataloader) // args.accumulation_steps - if args.warmup_steps is None: - args.warmup_steps = int(args.max_epochs * 0.025 * (len(train_dataloader) // args.accumulation_steps)) - coordinator.print_on_master(f"Warmup steps is set to {args.warmup_steps}") - - lr_scheduler = CosineAnnealingWarmupLR( - optimizer=optim, - total_steps=args.max_epochs * num_update_steps_per_epoch, - warmup_steps=args.warmup_steps, - eta_min=0.1 * args.lr, - ) - - default_dtype = torch.float16 if args.mixed_precision == "fp16" else torch.bfloat16 - torch.set_default_dtype(default_dtype) - model, optim, _, train_dataloader, lr_scheduler = booster.boost( - model=model, - optimizer=optim, - lr_scheduler=lr_scheduler, - dataloader=train_dataloader, - ) - torch.set_default_dtype(torch.float) - - coordinator.print_on_master(f"Booster init max CUDA memory: {torch.cuda.max_memory_allocated() / 1024 ** 2:.2f} MB") - coordinator.print_on_master( - f"Booster init max CPU memory: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024:.2f} MB" - ) - - start_epoch = 0 - sampler_start_idx = 0 - start_step = 0 - if args.checkpoint_path is not None: - if "modeling" in args.checkpoint_path: - coordinator.print_on_master(f"Continued pretrain from checkpoint {args.checkpoint_path}") - booster.load_model(model, args.checkpoint_path) - else: - coordinator.print_on_master(f"Load model checkpoint from {args.checkpoint_path}") - start_epoch, start_step, sampler_start_idx = load_checkpoint( - load_dir=args.checkpoint_path, - booster=booster, - model=model, - optimizer=optim, - lr_scheduler=lr_scheduler, - ) - assert isinstance(train_dataloader.sampler, StatefulDistributedSampler) - train_dataloader.sampler.set_start_index(start_index=sampler_start_idx) - - coordinator.print_on_master( - f"Loaded checkpoint {args.checkpoint_path} at epoch {start_epoch} step {start_step}" - ) - coordinator.print_on_master(f"Loaded sample at index {sampler_start_idx}") - - coordinator.print_on_master( - f"Checkpoint loaded max CUDA memory: {torch.cuda.max_memory_allocated() / 1024 ** 2:.2f} MB" - ) - coordinator.print_on_master( - f"Checkpoint loaded CUDA memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB" - ) - coordinator.print_on_master( - f"Checkpoint loaded max CPU memory: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024:.2f} MB" - ) - - trainer = ORPOTrainer( - actor=model, - booster=booster, - actor_optim=optim, - actor_lr_scheduler=lr_scheduler, - tokenizer=tokenizer, - max_epochs=args.max_epochs, - accumulation_steps=args.accumulation_steps, - start_epoch=start_epoch, - save_interval=None, - save_dir=None, - coordinator=coordinator, - lam=args.lam, - ) - - trainer.fit( - train_preference_dataloader=train_dataloader, - eval_preference_dataloader=None, - log_dir=None, - use_wandb=False, - ) - coordinator.print_on_master(f"Max CUDA memory usage: {torch.cuda.max_memory_allocated()/1024**2:.2f} MB") - - -if __name__ == "__main__": - # ============================== - # Parse Arguments - # ============================== - parser = argparse.ArgumentParser() - parser.add_argument( - "--plugin", - type=str, - default="gemini", - choices=["gemini", "gemini_auto", "zero2", "zero2_cpu", "3d"], - help="Choose which plugin to use", - ) - parser.add_argument("--grad_clip", type=float, default=1.0, help="Gradient clipping value") - parser.add_argument("--weight_decay", type=float, default=0.1, help="Weight decay") - parser.add_argument("--warmup_steps", type=int, default=None, help="Warmup steps") - parser.add_argument("--tp", type=int, default=1) - parser.add_argument("--pp", type=int, default=1) - parser.add_argument("--sp", type=int, default=1) - parser.add_argument("--lam", type=float, default=0.1, help="lambda in ORPO loss") - parser.add_argument("--enable_sequence_parallelism", default=False, action="store_true") - parser.add_argument("--zero_stage", type=int, default=0, help="Zero stage", choices=[0, 1, 2]) - parser.add_argument("--zero_cpu_offload", default=False, action="store_true") - parser.add_argument("--sp_mode", type=str, default="split_gather", choices=["split_gather", "ring", "all_to_all"]) - parser.add_argument("--pretrain", type=str, default=None) - parser.add_argument("--model_type", type=str, default=None) - parser.add_argument("--tokenizer_dir", type=str, default=None) - parser.add_argument("--dataset", nargs="+", default=[]) - parser.add_argument( - "--checkpoint_path", type=str, default=None, help="Checkpoint path if need to resume training form a checkpoint" - ) - parser.add_argument("--config_file", type=str, default="config_file", help="Config file") - parser.add_argument("--max_length", type=int, default=2048, help="Model max length") - parser.add_argument("--max_epochs", type=int, default=3) - parser.add_argument("--batch_size", type=int, default=4) - parser.add_argument( - "--disable_reference_model", - action="store_true", - default=False, - help="Disable the reference model (enabled by default)", - ) - parser.add_argument("--dataset_size", type=int, default=500) - parser.add_argument("--mixed_precision", type=str, default="fp16", choices=["fp16", "bf16"], help="Mixed precision") - parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank") - parser.add_argument( - "--lora_train_bias", - type=str, - default="none", - help="'none' means it doesn't train biases. 'all' means it trains all biases. 'lora_only' means it only trains biases of LoRA layers", - ) - parser.add_argument("--merge_lora_weights", type=bool, default=True) - parser.add_argument("--lr", type=float, default=5e-6) - parser.add_argument("--accumulation_steps", type=int, default=8) - parser.add_argument("--grad_checkpoint", default=False, action="store_true") - parser.add_argument("--use_flash_attn", default=False, action="store_true") - args = parser.parse_args() - os.makedirs(os.path.dirname(args.config_file), exist_ok=True) - with open(args.config_file, "w") as f: - json.dump(args.__dict__, f, indent=4) - train(args) diff --git a/applications/ColossalChat/benchmarks/benchmark_orpo.sh b/applications/ColossalChat/benchmarks/benchmark_orpo.sh index cc6eef510..f8fb264ae 100755 --- a/applications/ColossalChat/benchmarks/benchmark_orpo.sh +++ b/applications/ColossalChat/benchmarks/benchmark_orpo.sh @@ -15,20 +15,28 @@ set_n_least_used_CUDA_VISIBLE_DEVICES() { } set_n_least_used_CUDA_VISIBLE_DEVICES 2 -PROJECT_NAME="dpo" +PROJECT_NAME="orpo" PARENT_CONFIG_FILE="./benchmark_config" # Path to a folder to save training config logs PRETRAINED_MODEL_PATH="" # huggingface or local model path PRETRAINED_TOKENIZER_PATH="" # huggingface or local tokenizer path +BENCHMARK_DATA_DIR="./temp/orpo" # Path to benchmark data +DATASET_SIZE=160 TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S) FULL_PROJECT_NAME="${PROJECT_NAME}-${TIMESTAMP}" -CONFIG_FILE="${PARENT_CONFIG_FILE}-${FULL_PROJECT_NAME}.json" +declare -a dataset=( + $BENCHMARK_DATA_DIR/arrow/part-0 +) -colossalai run --nproc_per_node 2 --master_port 31313 benchmark_orpo.py \ +# Generate dummy test data +python prepare_dummy_test_dataset.py --data_dir $BENCHMARK_DATA_DIR --dataset_size $DATASET_SIZE --max_length 2048 --data_type preference + + +colossalai run --nproc_per_node 2 --master_port 31313 ../examples/training_scripts/train_orpo.py \ --pretrain $PRETRAINED_MODEL_PATH \ --tokenizer_dir $PRETRAINED_TOKENIZER_PATH \ + --dataset ${dataset[@]} \ --plugin "zero2" \ - --config_file $CONFIG_FILE \ --max_epochs 1 \ --accumulation_steps 1 \ --batch_size 4 \ @@ -39,6 +47,5 @@ colossalai run --nproc_per_node 2 --master_port 31313 benchmark_orpo.py \ --max_length 2048 \ --weight_decay 0.01 \ --warmup_steps 60 \ - --dataset_size 160 \ --grad_checkpoint \ --use_flash_attn diff --git a/applications/ColossalChat/benchmarks/benchmark_sft.py b/applications/ColossalChat/benchmarks/benchmark_sft.py deleted file mode 100644 index b6438c503..000000000 --- a/applications/ColossalChat/benchmarks/benchmark_sft.py +++ /dev/null @@ -1,315 +0,0 @@ -import argparse -import json -import math -import os -import resource -from contextlib import nullcontext - -import torch -from coati.dataset import DataCollatorForSupervisedDataset, StatefulDistributedSampler -from coati.models import convert_to_lora_module -from coati.trainer import SFTTrainer -from coati.utils import load_checkpoint -from dummy_dataset import DummyLLMDataset -from transformers import AutoModelForCausalLM, AutoTokenizer - -import colossalai -from colossalai.booster import Booster -from colossalai.booster.plugin import GeminiPlugin, HybridParallelPlugin, LowLevelZeroPlugin, TorchDDPPlugin -from colossalai.cluster import DistCoordinator -from colossalai.logging import get_dist_logger -from colossalai.nn.lr_scheduler import CosineAnnealingWarmupLR -from colossalai.nn.optimizer import HybridAdam - -logger = get_dist_logger() - - -def train(args): - # check lora compatibility - if "gemini" in args.plugin and args.lora_rank > 0: - raise ValueError("LoRA is not supported in GeminiPlugin. Please use other plugin") - if args.plugin == "gemini_auto" and args.accumulation_steps > 1: - raise ValueError("Gradient accumulation is not supported in GeminiPlugin. Please use other plugin") - # ============================== - # Initialize Distributed Training - # ============================== - colossalai.launch_from_torch() - coordinator = DistCoordinator() - - # ============================== - # Initialize Booster - # ============================== - init_ctx = nullcontext() - with init_ctx: - if args.use_flash_attn: - model = AutoModelForCausalLM.from_pretrained( - args.pretrain, - torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16, - attn_implementation="flash_attention_2", - trust_remote_code=True, - ) - else: - model = AutoModelForCausalLM.from_pretrained( - args.pretrain, - torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16, - trust_remote_code=True, - ) - if args.lora_rank > 0: - model = convert_to_lora_module(model, args.lora_rank, lora_train_bias=args.lora_train_bias) - - if args.plugin == "ddp": - """ - Default torch ddp plugin without any acceleration, for - debugging purpose acceleration, for debugging purpose - """ - plugin = TorchDDPPlugin(find_unused_parameters=True) - elif args.plugin == "gemini": - plugin = GeminiPlugin( - precision=args.mixed_precision, - placement_policy="static", - initial_scale=2**16, - max_norm=args.grad_clip, - enable_gradient_accumulation=True if args.accumulation_steps > 1 else False, - enable_flash_attention=args.use_flash_attn, - ) - elif args.plugin == "gemini_auto": - plugin = GeminiPlugin( - precision=args.mixed_precision, - placement_policy="auto", - initial_scale=2**16, - max_norm=args.grad_clip, - enable_flash_attention=args.use_flash_attn, - ) - elif args.plugin == "zero2": - plugin = LowLevelZeroPlugin( - stage=2, - precision=args.mixed_precision, - initial_scale=2**16, - max_norm=args.grad_clip, - ) - elif args.plugin == "zero2_cpu": - plugin = LowLevelZeroPlugin( - stage=2, - precision=args.mixed_precision, - initial_scale=2**16, - cpu_offload=True, - max_norm=args.grad_clip, - ) - elif args.plugin == "3d": - plugin = HybridParallelPlugin( - tp_size=args.tp, - pp_size=args.pp, - sp_size=args.sp, - sequence_parallelism_mode=args.sp_mode, - zero_stage=args.zero_stage, - enable_flash_attention=args.use_flash_attn, - enable_sequence_parallelism=args.enable_sequence_parallelism, - cpu_offload=True if args.zero_stage >= 1 and args.zero_cpu_offload else False, - parallel_output=False, - max_norm=args.grad_clip, - precision=args.mixed_precision, - microbatch_size=args.batch_size, - ) - else: - raise ValueError(f"Unknown plugin {args.plugin}") - - booster = Booster(plugin=plugin) - - # ====================================================== - # Initialize Model, Objective, Optimizer and LR Scheduler - # ====================================================== - # Temp Fix: Disable lazy init due to version conflict - # init_ctx = ( - # LazyInitContext(default_device=get_current_device()) if isinstance(plugin, (GeminiPlugin,)) else nullcontext() - # ) - - if args.grad_checkpoint: - # Note, for some models, lora may not be compatible with gradient checkpointing - model.gradient_checkpointing_enable() - coordinator.print_on_master(msg="Gradient checkpointing enabled successfully") - - # configure tokenizer - tokenizer = AutoTokenizer.from_pretrained( - args.tokenizer_dir or args.pretrain, use_fast=False, trust_remote_code=True - ) - if hasattr(tokenizer, "pad_token") and hasattr(tokenizer, "eos_token") and tokenizer.eos_token is not None: - try: - # Some tokenizers doesn't allow to set pad_token mannually e.g., Qwen - tokenizer.pad_token = tokenizer.eos_token - except AttributeError as e: - logger.warning(f"Unable to set pad token to eos token, {str(e)}") - if not hasattr(tokenizer, "pad_token") or tokenizer.pad_token is None: - logger.warning( - "The tokenizer does not have a pad token which is required. May lead to unintended behavior in training, Please consider manually set them." - ) - - tokenizer.add_bos_token = False - tokenizer.add_eos_token = False - tokenizer.padding_side = "right" - - coordinator.print_on_master(f"Configuration file will be saved at: {args.config_file}") - - # configure optimizer - optim = HybridAdam( - model_params=model.parameters(), - lr=args.lr, - betas=(0.9, 0.95), - weight_decay=args.weight_decay, - adamw_mode=True, - ) - - # configure dataset - coordinator.print_on_master( - f"Max CUDA memory before data loader: {torch.cuda.max_memory_allocated() / 1024 ** 2:.2f} MB" - ) - dataset = DummyLLMDataset(["input_ids", "attention_mask", "labels"], args.max_len, args.dataset_size) - data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer, max_length=args.max_len) - - train_dataloader = plugin.prepare_dataloader( - dataset=dataset, - batch_size=args.batch_size, - shuffle=True, - drop_last=True, - collate_fn=data_collator, - distributed_sampler_cls=StatefulDistributedSampler, - ) - coordinator.print_on_master( - f"Max CUDA memory after data loader: {torch.cuda.max_memory_allocated() / 1024 ** 2:.2f} MB" - ) - - num_update_steps_per_epoch = len(train_dataloader) // args.accumulation_steps - math.ceil(args.max_epochs * num_update_steps_per_epoch) - - if args.warmup_steps is None: - args.warmup_steps = int(args.max_epochs * 0.025 * (len(train_dataloader) // args.accumulation_steps)) - coordinator.print_on_master(f"Warmup steps is set to {args.warmup_steps}") - - lr_scheduler = CosineAnnealingWarmupLR( - optimizer=optim, - total_steps=args.max_epochs * num_update_steps_per_epoch, - warmup_steps=args.warmup_steps, - eta_min=0.1 * args.lr, - ) - - # Flash attention will be disabled because it does NOT support fp32. - default_dtype = torch.float16 if args.mixed_precision == "fp16" else torch.bfloat16 - torch.set_default_dtype(default_dtype) - model, optim, _, train_dataloader, lr_scheduler = booster.boost( - model=model, - optimizer=optim, - lr_scheduler=lr_scheduler, - dataloader=train_dataloader, - ) - torch.set_default_dtype(torch.float) - - coordinator.print_on_master(f"Booster init max CUDA memory: {torch.cuda.max_memory_allocated() / 1024 ** 2:.2f} MB") - coordinator.print_on_master( - f"Booster init max CPU memory: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024:.2f} MB" - ) - - start_epoch = 0 - sampler_start_idx = 0 - start_step = 0 - if args.checkpoint_path is not None: - if "modeling" in args.checkpoint_path: - coordinator.print_on_master(f"Continued pretrain from checkpoint {args.checkpoint_path}") - booster.load_model(model, args.checkpoint_path) - else: - coordinator.print_on_master(f"Load model checkpoint from {args.checkpoint_path}") - start_epoch, start_step, sampler_start_idx = load_checkpoint( - load_dir=args.checkpoint_path, - booster=booster, - model=model, - optimizer=optim, - lr_scheduler=lr_scheduler, - ) - train_dataloader.sampler.set_start_index(start_index=sampler_start_idx) - - coordinator.print_on_master( - f"Loaded checkpoint {args.checkpoint_path} at epoch {start_epoch} step {start_step}" - ) - coordinator.print_on_master(f"Loaded sample at index {sampler_start_idx}") - - coordinator.print_on_master( - f"Checkpoint loaded max CUDA memory: {torch.cuda.max_memory_allocated() / 1024 ** 2:.2f} MB" - ) - coordinator.print_on_master( - f"Checkpoint loaded CUDA memory: {torch.cuda.memory_allocated() / 1024 ** 2:.2f} MB" - ) - coordinator.print_on_master( - f"Checkpoint loaded max CPU memory: {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024:.2f} MB" - ) - - trainer = SFTTrainer( - model=model, - booster=booster, - optim=optim, - lr_scheduler=lr_scheduler, - max_epochs=args.max_epochs, - accumulation_steps=args.accumulation_steps, - start_epoch=start_epoch, - save_interval=None, - save_dir=None, - coordinator=coordinator, - ) - - trainer.fit( - train_dataloader=train_dataloader, - eval_dataloader=None, - log_dir=None, - use_wandb=False, - ) - - coordinator.print_on_master(f"Max CUDA memory usage: {torch.cuda.max_memory_allocated()/1024**2:.2f} MB") - - -if __name__ == "__main__": - # ============================== - # Parse Arguments - # ============================== - parser = argparse.ArgumentParser() - parser.add_argument( - "--plugin", - type=str, - default="gemini", - choices=["gemini", "gemini_auto", "3d", "ddp", "zero2_cpu", "zero2"], - help="Choose which plugin to use", - ) - parser.add_argument("--grad_clip", type=float, default=1.0, help="Gradient clipping value") - parser.add_argument("--weight_decay", type=float, default=0.1, help="Weight decay") - parser.add_argument("--warmup_steps", type=int, default=None, help="Warmup steps") - parser.add_argument("--tp", type=int, default=1) - parser.add_argument("--pp", type=int, default=1) - parser.add_argument("--sp", type=int, default=1) - parser.add_argument("--enable_sequence_parallelism", default=False, action="store_true") - parser.add_argument("--zero_stage", type=int, default=0, help="Zero stage", choices=[0, 1, 2]) - parser.add_argument("--zero_cpu_offload", default=False, action="store_true") - parser.add_argument("--sp_mode", type=str, default="split_gather", choices=["split_gather", "ring", "all_to_all"]) - parser.add_argument("--pretrain", type=str, default=None) - parser.add_argument("--tokenizer_dir", type=str, default=None) - parser.add_argument( - "--checkpoint_path", type=str, default=None, help="Checkpoint path if need to resume training form a checkpoint" - ) - parser.add_argument("--max_epochs", type=int, default=3) - parser.add_argument("--batch_size", type=int, default=4) - parser.add_argument("--max_len", type=int, default=512) - parser.add_argument("--mixed_precision", type=str, default="bf16", choices=["fp16", "bf16"], help="Mixed precision") - parser.add_argument("--lora_rank", type=int, default=0, help="low-rank adaptation matrices rank") - parser.add_argument( - "--lora_train_bias", - type=str, - default="none", - help="'none' means it doesn't train biases. 'all' means it trains all biases. 'lora_only' means it only trains biases of LoRA layers", - ) - parser.add_argument("--merge_lora_weights", type=bool, default=True) - parser.add_argument("--lr", type=float, default=5e-6) - parser.add_argument("--config_file", type=str, default="config_file", help="Config file") - parser.add_argument("--accumulation_steps", type=int, default=8) - parser.add_argument("--grad_checkpoint", default=False, action="store_true") - parser.add_argument("--use_flash_attn", default=False, action="store_true") - parser.add_argument("--dataset_size", type=int, default=500) - args = parser.parse_args() - os.makedirs(os.path.dirname(args.config_file), exist_ok=True) - with open(args.config_file, "w") as f: - json.dump(args.__dict__, f, indent=4) - train(args) diff --git a/applications/ColossalChat/benchmarks/benchmark_sft.sh b/applications/ColossalChat/benchmarks/benchmark_sft.sh index 0c80386ef..efcd428dd 100755 --- a/applications/ColossalChat/benchmarks/benchmark_sft.sh +++ b/applications/ColossalChat/benchmarks/benchmark_sft.sh @@ -14,21 +14,31 @@ set_n_least_used_CUDA_VISIBLE_DEVICES() { } set_n_least_used_CUDA_VISIBLE_DEVICES 4 -# export CUDA_VISIBLE_DEVICES=3,4 + PROJECT_NAME="sft" PARENT_CONFIG_FILE="./benchmark_config" # Path to a folder to save training config logs PRETRAINED_MODEL_PATH="" # huggingface or local model path PRETRAINED_TOKENIZER_PATH="" # huggingface or local tokenizer path +BENCHMARK_DATA_DIR="./temp/sft" # Path to benchmark data +DATASET_SIZE=640 TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S) FULL_PROJECT_NAME="${PROJECT_NAME}-${TIMESTAMP}" CONFIG_FILE="${PARENT_CONFIG_FILE}-${FULL_PROJECT_NAME}.json" +declare -a dataset=( + $BENCHMARK_DATA_DIR/arrow/part-0 +) + + +# Generate dummy test data +python prepare_dummy_test_dataset.py --data_dir $BENCHMARK_DATA_DIR --dataset_size $DATASET_SIZE --max_length 2048 --data_type sft + # the real batch size for gradient descent is number_of_node_in_hostfile * nproc_per_node * train_batch_size -colossalai run --nproc_per_node 4 --master_port 31312 benchmark_sft.py \ +colossalai run --nproc_per_node 1 --master_port 31312 ../examples/training_scripts/train_sft.py \ --pretrain $PRETRAINED_MODEL_PATH \ --tokenizer_dir $PRETRAINED_TOKENIZER_PATH \ - --config_file $CONFIG_FILE \ + --dataset ${dataset[@]} \ --plugin zero2 \ --batch_size 8 \ --max_epochs 1 \ @@ -36,6 +46,5 @@ colossalai run --nproc_per_node 4 --master_port 31312 benchmark_sft.py \ --lr 5e-5 \ --lora_rank 32 \ --max_len 2048 \ - --dataset_size 640 \ --grad_checkpoint \ --use_flash_attn diff --git a/applications/ColossalChat/benchmarks/benchmark_simpo.sh b/applications/ColossalChat/benchmarks/benchmark_simpo.sh new file mode 100755 index 000000000..47dfc8595 --- /dev/null +++ b/applications/ColossalChat/benchmarks/benchmark_simpo.sh @@ -0,0 +1,55 @@ +#!/bin/bash +set_n_least_used_CUDA_VISIBLE_DEVICES() { + local n=${1:-"9999"} + echo "GPU Memory Usage:" + local FIRST_N_GPU_IDS=$(nvidia-smi --query-gpu=memory.used --format=csv | + tail -n +2 | + nl -v 0 | + tee /dev/tty | + sort -g -k 2 | + awk '{print $1}' | + head -n $n) + export CUDA_VISIBLE_DEVICES=$(echo $FIRST_N_GPU_IDS | sed 's/ /,/g') + echo "Now CUDA_VISIBLE_DEVICES is set to:" + echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" +} +set_n_least_used_CUDA_VISIBLE_DEVICES 4 + +PROJECT_NAME="simpo" +PARENT_CONFIG_FILE="./benchmark_config" # Path to a folder to save training config logs +PRETRAINED_MODEL_PATH="" # huggingface or local model path +PRETRAINED_TOKENIZER_PATH="" # huggingface or local tokenizer path +BENCHMARK_DATA_DIR="./temp/simpo" # Path to benchmark data +DATASET_SIZE=640 + +TIMESTAMP=$(date +%Y-%m-%d-%H-%M-%S) +FULL_PROJECT_NAME="${PROJECT_NAME}-${TIMESTAMP}" +declare -a dataset=( + $BENCHMARK_DATA_DIR/arrow/part-0 +) + +# Generate dummy test data +python prepare_dummy_test_dataset.py --data_dir $BENCHMARK_DATA_DIR --dataset_size $DATASET_SIZE --max_length 2048 --data_type preference + + +colossalai run --nproc_per_node 4 --master_port 31313 ../examples/training_scripts/train_dpo.py \ + --pretrain $PRETRAINED_MODEL_PATH \ + --tokenizer_dir $PRETRAINED_TOKENIZER_PATH \ + --dataset ${dataset[@]} \ + --plugin "zero2_cpu" \ + --loss_type "simpo_loss" \ + --max_epochs 1 \ + --accumulation_steps 1 \ + --batch_size 8 \ + --lr 1e-6 \ + --beta 0.1 \ + --gamma 0.6 \ + --mixed_precision "bf16" \ + --grad_clip 1.0 \ + --max_length 2048 \ + --weight_decay 0.01 \ + --warmup_steps 60 \ + --disable_reference_model \ + --length_normalization \ + --grad_checkpoint \ + --use_flash_attn diff --git a/applications/ColossalChat/benchmarks/dummy_dataset.py b/applications/ColossalChat/benchmarks/dummy_dataset.py index 070531fd5..9af0f1641 100644 --- a/applications/ColossalChat/benchmarks/dummy_dataset.py +++ b/applications/ColossalChat/benchmarks/dummy_dataset.py @@ -1,10 +1,12 @@ -import torch +from typing import Callable + from torch.utils.data import Dataset class DummyLLMDataset(Dataset): - def __init__(self, keys, seq_len, size=500): + def __init__(self, keys, seq_len, size=500, gen_fn={}): self.keys = keys + self.gen_fn = gen_fn self.seq_len = seq_len self.data = self._generate_data() self.size = size @@ -12,11 +14,17 @@ class DummyLLMDataset(Dataset): def _generate_data(self): data = {} for key in self.keys: - data[key] = torch.ones(self.seq_len, dtype=torch.long) + if key in self.gen_fn: + data[key] = self.gen_fn[key] + else: + data[key] = [1] * self.seq_len return data def __len__(self): return self.size def __getitem__(self, idx): - return {key: self.data[key] for key in self.keys} + return { + key: self.data[key] if not isinstance(self.data[key], Callable) else self.data[key](idx) + for key in self.keys + } diff --git a/applications/ColossalChat/benchmarks/prepare_dummy_test_dataset.py b/applications/ColossalChat/benchmarks/prepare_dummy_test_dataset.py new file mode 100644 index 000000000..f501c5358 --- /dev/null +++ b/applications/ColossalChat/benchmarks/prepare_dummy_test_dataset.py @@ -0,0 +1,105 @@ +import argparse +import json +import os +import time +from multiprocessing import cpu_count + +from datasets import load_dataset +from dummy_dataset import DummyLLMDataset + +from colossalai.logging import get_dist_logger + +logger = get_dist_logger() + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument( + "--data_dir", + type=str, + required=True, + default=None, + help="The output dir", + ) + parser.add_argument( + "--dataset_size", + type=int, + required=True, + default=None, + help="The size of data", + ) + parser.add_argument( + "--max_length", + type=int, + required=True, + default=None, + help="The max length of data", + ) + parser.add_argument( + "--data_type", + type=str, + required=True, + default=None, + help="The type of data, choose one from ['sft', 'prompt', 'preference', 'kto']", + ) + args = parser.parse_args() + if args.data_type == "sft": + dataset = DummyLLMDataset(["input_ids", "attention_mask", "labels"], args.max_length, args.dataset_size) + elif args.data_type == "prompt": + # pass PPO dataset is prepared separately + pass + elif args.data_type == "preference": + dataset = DummyLLMDataset( + ["chosen_input_ids", "chosen_loss_mask", "rejected_input_ids", "rejected_loss_mask"], + args.max_length, + args.dataset_size, + ) + elif args.data_type == "kto": + dataset = DummyLLMDataset( + ["prompt", "completion", "label"], + args.max_length - 512, + args.dataset_size, + gen_fn={ + "completion": lambda x: [1] * 512, + "label": lambda x: x % 2, + }, + ) + else: + raise ValueError(f"Unknown data type {args.data_type}") + + # Save each jsonl spliced dataset. + output_index = "0" + output_name = f"part-{output_index}" + os.makedirs(args.data_dir, exist_ok=True) + output_jsonl_path = os.path.join(args.data_dir, "json") + output_arrow_path = os.path.join(args.data_dir, "arrow") + output_cache_path = os.path.join(args.data_dir, "cache") + os.makedirs(output_jsonl_path, exist_ok=True) + os.makedirs(output_arrow_path, exist_ok=True) + output_jsonl_file_path = os.path.join(output_jsonl_path, output_name + ".jsonl") + st = time.time() + with open(file=output_jsonl_file_path, mode="w", encoding="utf-8") as fp_writer: + count = 0 + for i in range(len(dataset)): + data_point = dataset[i] + if count % 500 == 0: + logger.info(f"processing {count} spliced data points for {fp_writer.name}") + count += 1 + fp_writer.write(json.dumps(data_point, ensure_ascii=False) + "\n") + logger.info( + f"Current file {fp_writer.name}; " + f"Data size: {len(dataset)}; " + f"Time cost: {round((time.time() - st) / 60, 6)} minutes." + ) + # Save each arrow spliced dataset + output_arrow_file_path = os.path.join(output_arrow_path, output_name) + logger.info(f"Start to save {output_arrow_file_path}") + dataset = load_dataset( + path="json", + data_files=[output_jsonl_file_path], + cache_dir=os.path.join(output_cache_path, "tokenized"), + keep_in_memory=False, + num_proc=cpu_count(), + split="train", + ) + dataset.save_to_disk(dataset_path=output_arrow_file_path, num_proc=min(len(dataset), cpu_count())) diff --git a/applications/ColossalChat/coati/dataset/__init__.py b/applications/ColossalChat/coati/dataset/__init__.py index deb7b6d92..8e9060a1a 100755 --- a/applications/ColossalChat/coati/dataset/__init__.py +++ b/applications/ColossalChat/coati/dataset/__init__.py @@ -1,24 +1,26 @@ from .conversation import Conversation, setup_conversation_template from .loader import ( + DataCollatorForKTODataset, DataCollatorForPreferenceDataset, DataCollatorForPromptDataset, DataCollatorForSupervisedDataset, StatefulDistributedSampler, load_tokenized_dataset, ) -from .tokenization_utils import supervised_tokenize_sft, tokenize_prompt_dataset, tokenize_rlhf +from .tokenization_utils import tokenize_kto, tokenize_prompt, tokenize_rlhf, tokenize_sft __all__ = [ - "tokenize_prompt_dataset", + "tokenize_prompt", "DataCollatorForPromptDataset", "is_rank_0", "DataCollatorForPreferenceDataset", "DataCollatorForSupervisedDataset", + "DataCollatorForKTODataset", "StatefulDistributedSampler", "load_tokenized_dataset", - "supervised_tokenize_pretrain", - "supervised_tokenize_sft", + "tokenize_sft", "tokenize_rlhf", + "tokenize_kto", "setup_conversation_template", "Conversation", ] diff --git a/applications/ColossalChat/coati/dataset/conversation.py b/applications/ColossalChat/coati/dataset/conversation.py index 37900f3b8..a77c220d3 100755 --- a/applications/ColossalChat/coati/dataset/conversation.py +++ b/applications/ColossalChat/coati/dataset/conversation.py @@ -18,6 +18,7 @@ class Conversation: chat_template: str stop_ids: List[int] end_of_assistant: str + roles = ["user", "assistant"] @classmethod def from_config(cls, tokenizer: PreTrainedTokenizer, config: Dict): @@ -85,7 +86,7 @@ class Conversation: Raises: AssertionError: If the role is not 'user' or 'assistant'. """ - assert role in ["user", "assistant"] + assert role in self.roles self.messages.append({"role": role, "content": message}) def copy(self): diff --git a/applications/ColossalChat/coati/dataset/loader.py b/applications/ColossalChat/coati/dataset/loader.py index 48011c941..b92cd76ad 100755 --- a/applications/ColossalChat/coati/dataset/loader.py +++ b/applications/ColossalChat/coati/dataset/loader.py @@ -235,6 +235,91 @@ class DataCollatorForPreferenceDataset(object): ) +@dataclass +class DataCollatorForKTODataset(object): + """ + Collate instances for kto dataset. + Each input instance is a tokenized dictionary with fields + `prompt`(List[int]), `completion`(List[int]) and `label`(bool). + Each output instance is a tokenized dictionary with fields + `kl_input_ids`(List[int]), `kl_attention_mask`(List[int]) and `kl_loss_mask`(List[int]). + `input_ids`(List[int]), `attention_mask`(List[int]), `loss_mask`(List[int]) and `label`(bool). + """ + + tokenizer: PreTrainedTokenizer + max_length: int = 4096 + ignore_index: int = -100 + + def __call__(self, instances: Sequence[Dict[str, List[int]]]) -> Dict[str, torch.Tensor]: + """ + + Args: + instances (`Sequence[Dict[str, List[int]]]`): + Mini-batch samples, each sample is stored in an individual dictionary contains the following fields: + `prompt`(List[int]), `completion`(List[int]) and `label`(bool, if the sample is desirable or not). + + Returns: + (`Dict[str, torch.Tensor]`): Contains the following `torch.Tensor`: + `input_ids`: `torch.Tensor` of shape (bsz, max_len); + `attention_mask`: `torch.BoolTensor` of shape (bsz, max_len); + `labels`: `torch.Tensor` of shape (bsz, max_len), which contains `IGNORE_INDEX`. + """ + assert isinstance(self.tokenizer.pad_token_id, int) and self.tokenizer.pad_token_id >= 0, ( + f"`{self.tokenizer.__class__.__name__}.pad_token_id` must be a valid non-negative integer index value, " + f"but now `{self.tokenizer.pad_token_id}`" + ) + # prepare the preference data + prompt = [torch.LongTensor(instance["prompt"]) for instance in instances] + prompt_zeros = [torch.zeros_like(t) for t in prompt] + completion = [torch.LongTensor(instance["completion"]) for instance in instances] + completion_ones = [torch.ones_like(t) for t in completion] + label = [torch.tensor(instance["label"], dtype=torch.bool) for instance in instances] + input_ids = [torch.cat([prompt[i], completion[i]], dim=-1) for i in range(len(instances))] + loss_mask = [torch.cat([prompt_zeros[i], completion_ones[i]], dim=-1) for i in range(len(instances))] + # right padding + input_ids = torch.nn.utils.rnn.pad_sequence( + sequences=input_ids, + batch_first=True, + padding_value=self.tokenizer.pad_token_id, + ) # (bsz, max_len) + loss_mask = torch.nn.utils.rnn.pad_sequence( + sequences=loss_mask, batch_first=True, padding_value=0 + ) # (bsz, max_len) + to_pad = self.max_length - input_ids.size(1) + input_ids = F.pad(input_ids, (0, to_pad), value=self.tokenizer.pad_token_id) + loss_mask = F.pad(loss_mask, (0, to_pad), value=0) + attention_mask = input_ids.ne(self.tokenizer.pad_token_id) # `torch.BoolTensor`, (bsz, max_len) + + # prepare kt data + kl_completion = completion[::-1] # y' + kl_completion_ones = [torch.ones_like(t) for t in kl_completion] + kl_input_ids = [torch.cat([prompt[i], kl_completion[i]], dim=-1) for i in range(len(instances))] + kl_loss_mask = [torch.cat([prompt_zeros[i], kl_completion_ones[i]], dim=-1) for i in range(len(instances))] + # right padding + kl_input_ids = torch.nn.utils.rnn.pad_sequence( + sequences=kl_input_ids, + batch_first=True, + padding_value=self.tokenizer.pad_token_id, + ) # (bsz, max_len) + kl_loss_mask = torch.nn.utils.rnn.pad_sequence( + sequences=kl_loss_mask, batch_first=True, padding_value=0 + ) # (bsz, max_len) + to_pad = self.max_length - kl_input_ids.size(1) + kl_input_ids = F.pad(kl_input_ids, (0, to_pad), value=self.tokenizer.pad_token_id) + kl_loss_mask = F.pad(kl_loss_mask, (0, to_pad), value=0) + kl_attention_mask = kl_input_ids.ne(self.tokenizer.pad_token_id) # `torch.BoolTensor`, (bsz, max_len) + data_dict = { + "input_ids": input_ids, + "attention_mask": attention_mask, + "loss_mask": loss_mask, + "label": torch.stack(label), + "kl_input_ids": kl_input_ids, + "kl_attention_mask": kl_attention_mask, + "kl_loss_mask": kl_loss_mask, + } + return data_dict + + class StatefulDistributedSampler(DistributedSampler): def __init__( self, diff --git a/applications/ColossalChat/coati/dataset/tokenization_utils.py b/applications/ColossalChat/coati/dataset/tokenization_utils.py index 27addcb0d..9eb2eba87 100755 --- a/applications/ColossalChat/coati/dataset/tokenization_utils.py +++ b/applications/ColossalChat/coati/dataset/tokenization_utils.py @@ -23,11 +23,10 @@ IGNORE_INDEX = -100 DSType = Union[Dataset, ConcatDataset, dataset_dict.Dataset] -def supervised_tokenize_sft( +def tokenize_sft( data_point: Dict[str, str], tokenizer: PreTrainedTokenizer, conversation_template: Conversation = None, - ignore_index: int = None, max_length: int = 4096, ) -> Dict[str, Union[int, str, List[int]]]: """ @@ -39,54 +38,37 @@ def supervised_tokenize_sft( Args: data_point: the data point of the following format - {"messages": [{"from": "human", "content": "xxx"}, {"from": "assistant", "content": "xxx"}]} + {"messages": [{"from": "user", "content": "xxx"}, {"from": "assistant", "content": "xxx"}]} tokenizer: the tokenizer whose conversation_template: the conversation template to apply ignore_index: the ignore index when calculate loss during training max_length: the maximum context length """ - if ignore_index is None: - ignore_index = IGNORE_INDEX + ignore_index = IGNORE_INDEX messages = data_point["messages"] template = deepcopy(conversation_template) template.messages = [] - - for mess in messages: - from_str = mess["from"] - if from_str.lower() == "human": - from_str = "user" - elif from_str.lower() == "assistant": - from_str = "assistant" - else: - raise ValueError(f"Unsupported role {from_str.lower()}") - - template.append_message(from_str, mess["content"]) + for idx, mess in enumerate(messages): + if mess["from"] != template.roles[idx % 2]: + raise ValueError( + f"Message should iterate between user and assistant and starts with a \ + line from the user. Got the following data:\n{messages}" + ) + template.append_message(mess["from"], mess["content"]) if len(template.messages) % 2 != 0: + # Force to end with assistant response template.messages = template.messages[0:-1] - # `target_turn_index` is the number of turns which exceeds `max_length - 1` for the first time. - turns = [i for i in range(1, len(messages) // 2 + 1)] - - lo, hi = 0, len(turns) - while lo < hi: - mid = (lo + hi) // 2 - prompt = template.get_prompt(2 * turns[mid] - 1) - chunks, require_loss = split_templated_prompt_into_chunks( - template.messages[: 2 * turns[mid] - 1], prompt, conversation_template.end_of_assistant - ) - tokenized, starts, ends = tokenize_and_concatenate(tokenizer, chunks, require_loss) - if max_length - 1 < len(tokenized): - hi = mid - else: - lo = mid + 1 - target_turn_index = lo - - # The tokenized length for first turn already exceeds `max_length - 1`. - if target_turn_index - 1 < 0: - warnings.warn("The tokenized length for first turn already exceeds `max_length - 1`.") + # tokenize and calculate masked labels -100 for positions corresponding to non-assistant lines + prompt = template.get_prompt() + chunks, require_loss = split_templated_prompt_into_chunks( + template.messages, prompt, conversation_template.end_of_assistant + ) + tokenized, starts, ends = tokenize_and_concatenate(tokenizer, chunks, require_loss, max_length=max_length) + if tokenized is None: return dict( input_ids=None, labels=None, @@ -96,45 +78,18 @@ def supervised_tokenize_sft( seq_category=None, ) - target_turn = turns[target_turn_index - 1] - prompt = template.get_prompt(2 * target_turn) - chunks, require_loss = split_templated_prompt_into_chunks( - template.messages[: 2 * target_turn], prompt, conversation_template.end_of_assistant - ) - tokenized, starts, ends = tokenize_and_concatenate(tokenizer, chunks, require_loss) - labels = [ignore_index] * len(tokenized) for start, end in zip(starts, ends): - if end == len(tokenized): - tokenized = tokenized + [tokenizer.eos_token_id] - labels = labels + [ignore_index] labels[start:end] = tokenized[start:end] - # truncate the sequence at the last token that requires loss calculation - to_truncate_len = 0 - for i in range(len(tokenized) - 1, -1, -1): - if labels[i] == ignore_index: - to_truncate_len += 1 - else: - break - to_truncate_len = max(len(tokenized) - max_length, to_truncate_len) - tokenized = tokenized[: len(tokenized) - to_truncate_len] - labels = labels[: len(labels) - to_truncate_len] - if tokenizer.bos_token_id is not None: + # Force to add bos token at the beginning of the tokenized sequence if the input ids doesn;t starts with bos if tokenized[0] != tokenizer.bos_token_id: + # Some chat templates already include bos token tokenized = [tokenizer.bos_token_id] + tokenized - labels = [ignore_index] + labels + labels = [-100] + labels - if tokenizer.eos_token_id is not None: - # Force to add eos token at the end of the tokenized sequence - if tokenized[-1] != tokenizer.eos_token_id: - tokenized = tokenized + [tokenizer.eos_token_id] - labels = labels + [tokenizer.eos_token_id] - else: - labels[-1] = tokenizer.eos_token_id - - # For some model without bos/eos may raise the following errors + # log decoded inputs and labels for debugging inputs_decode = tokenizer.decode(tokenized) start = 0 end = 0 @@ -171,11 +126,10 @@ def supervised_tokenize_sft( ) -def tokenize_prompt_dataset( +def tokenize_prompt( data_point: Dict[str, str], tokenizer: PreTrainedTokenizer, conversation_template: Conversation = None, - ignore_index: int = None, max_length: int = 4096, ) -> Dict[str, Union[int, str, List[int]]]: """ @@ -183,48 +137,39 @@ def tokenize_prompt_dataset( "Something here can be system message[user_line_start]User line[User line end][Assistant line start]Assistant line[Assistant line end]...[Assistant line start]" Args: data_point: the data point of the following format - {"messages": [{"from": "human", "content": "xxx"}, {"from": "assistant", "content": "xxx"}]} + {"messages": [{"from": "user", "content": "xxx"}, {"from": "assistant", "content": "xxx"}]} tokenizer: the tokenizer whose conversation_template: the conversation template to apply ignore_index: the ignore index when calculate loss during training max_length: the maximum context length """ - if ignore_index is None: - ignore_index = IGNORE_INDEX messages = data_point["messages"] template = deepcopy(conversation_template) template.messages = [] - for mess in messages: - from_str = mess["from"] - if from_str.lower() == "human": - from_str = "user" - elif from_str.lower() == "assistant": - from_str = "assistant" - else: - raise ValueError(f"Unsupported role {from_str.lower()}") - - template.append_message(from_str, mess["content"]) + for idx, mess in enumerate(messages): + if mess["from"] != template.roles[idx % 2]: + raise ValueError( + f"Message should iterate between user and assistant and starts with a \ + line from the user. Got the following data:\n{messages}" + ) + template.append_message(mess["from"], mess["content"]) # `target_turn_index` is the number of turns which exceeds `max_length - 1` for the first time. - target_turn = len(template.messages) - if target_turn % 2 != 1: + if len(template.messages) % 2 != 1: # exclude the answer if provided. keep only the prompt - target_turn = target_turn - 1 + template.messages = template.messages[:-1] # Prepare data - prompt = template.get_prompt(target_turn, add_generation_prompt=True) - chunks, require_loss = split_templated_prompt_into_chunks( - template.messages[:target_turn], prompt, conversation_template.end_of_assistant - ) - tokenized, starts, ends = tokenize_and_concatenate(tokenizer, chunks, require_loss) + prompt = template.get_prompt(length=len(template.messages) - 1, add_generation_prompt=True) + tokenized = tokenizer([prompt], add_special_tokens=False)["input_ids"][0] + if tokenizer.bos_token_id is not None: if tokenized[0] != tokenizer.bos_token_id: tokenized = [tokenizer.bos_token_id] + tokenized - # Skip overlength data - if max_length - 1 < len(tokenized): + if len(tokenized) > max_length: return dict( input_ids=None, inputs_decode=None, @@ -235,47 +180,32 @@ def tokenize_prompt_dataset( # `inputs_decode` can be used to check whether the tokenization method is true. return dict( input_ids=tokenized, - inputs_decode=tokenizer.decode(tokenized), + inputs_decode=prompt, seq_length=len(tokenized), seq_category=data_point["category"] if "category" in data_point else "None", ) -def apply_rlhf_data_format( - template: Conversation, tokenizer: Any, context_len: int, mask_out_target_assistant_line_end=False -): +def apply_rlhf_data_format(template: Conversation, tokenizer: Any): target_turn = int(len(template.messages) / 2) prompt = template.get_prompt(target_turn * 2) chunks, require_loss = split_templated_prompt_into_chunks( template.messages[: 2 * target_turn], prompt, template.end_of_assistant ) - tokenized, starts, ends = tokenize_and_concatenate(tokenizer, chunks, require_loss) - loss_mask = [0] * len(tokenized) - mask_token = tokenizer.eos_token_id or tokenizer.pad_token_id - if mask_token is None: - mask_token = 1 # If the tokenizer doesn't have eos_token or pad_token: Qwen + # no truncation applied + tokenized, starts, ends = tokenize_and_concatenate(tokenizer, chunks, require_loss, max_length=None) + loss_mask = [0] * len(tokenized) label_decode = [] - for start, end in zip(starts[-1:], ends[-1:]): - # only the last round (chosen/rejected) counts - if end == len(tokenized): - tokenized = tokenized + [tokenizer.eos_token_id] - loss_mask = loss_mask + [1] - loss_mask[start:end] = [1] * len(loss_mask[start:end]) - label_decode.append(tokenizer.decode(tokenized[start:end], skip_special_tokens=False)) + # only the last round (chosen/rejected) is used to calculate loss + for i in range(starts[-1], ends[-1]): + loss_mask[i] = 1 + label_decode.append(tokenizer.decode(tokenized[starts[-1] : ends[-1]], skip_special_tokens=False)) if tokenizer.bos_token_id is not None: if tokenized[0] != tokenizer.bos_token_id: tokenized = [tokenizer.bos_token_id] + tokenized loss_mask = [0] + loss_mask - if tokenizer.eos_token_id is not None: - # Force to add eos token at the end of the tokenized sequence - if tokenized[-1] != tokenizer.eos_token_id: - tokenized = tokenized + [tokenizer.eos_token_id] - loss_mask = loss_mask + [1] - else: - loss_mask[-1] = 1 - return {"input_ids": tokenized, "loss_mask": loss_mask, "label_decode": label_decode} @@ -283,39 +213,29 @@ def tokenize_rlhf( data_point: Dict[str, str], tokenizer: PreTrainedTokenizer, conversation_template: Conversation = None, - ignore_index: int = None, max_length: int = 4096, ) -> Dict[str, Union[int, str, List[int]]]: """ A tokenization function to tokenize an original pretraining data point as following: - {"context": [{"from": "human", "content": "xxx"}, {"from": "assistant", "content": "xxx"}], + {"context": [{"from": "user", "content": "xxx"}, {"from": "assistant", "content": "xxx"}], "chosen": {"from": "assistant", "content": "xxx"}, "rejected": {"from": "assistant", "content": "xxx"}} """ - if ignore_index is None: - ignore_index = IGNORE_INDEX context = data_point["context"] template = deepcopy(conversation_template) template.clear() - for mess in context: - from_str = mess["from"] - if from_str.lower() == "human": - from_str = "user" - elif from_str.lower() == "assistant": - from_str = "assistant" - else: - raise ValueError(f"Unsupported role {from_str.lower()}") - - if len(template.messages) > 0 and from_str == template.messages[-1]["role"]: - # Concate adjacent message from the same role - template.messages[-1]["content"] = str(template.messages[-1]["content"] + " " + mess["content"]) - else: - template.append_message(from_str, mess["content"]) + for idx, mess in enumerate(context): + if mess["from"] != template.roles[idx % 2]: + raise ValueError( + f"Message should iterate between user and assistant and starts with a \ + line from the user. Got the following data:\n{context}" + ) + template.append_message(mess["from"], mess["content"]) if len(template.messages) % 2 != 1: warnings.warn( - "Please make sure leading context starts and ends with a line from human\nLeading context: " + "Please make sure leading context starts and ends with a line from user\nLeading context: " + str(template.messages) ) return dict( @@ -326,31 +246,27 @@ def tokenize_rlhf( rejected_loss_mask=None, rejected_label_decode=None, ) - round_of_context = int((len(template.messages) - 1) / 2) - assert context[-1]["from"].lower() == "human", "The last message in context should be from human." + assert context[-1]["from"].lower() == template.roles[0], "The last message in context should be from user." chosen = deepcopy(template) rejected = deepcopy(template) + chosen_continuation = data_point["chosen"] + rejected_continuation = data_point["rejected"] + for round in range(len(chosen_continuation)): + if chosen_continuation[round]["from"] != template.roles[(round + 1) % 2]: + raise ValueError( + f"Message should iterate between user and assistant and starts with a \ + line from the user. Got the following data:\n{chosen_continuation}" + ) + chosen.append_message(chosen_continuation[round]["from"], chosen_continuation[round]["content"]) - for round in range(len(data_point["chosen"])): - from_str = data_point["chosen"][round]["from"] - if from_str.lower() == "human": - from_str = "user" - elif from_str.lower() == "assistant": - from_str = "assistant" - else: - raise ValueError(f"Unsupported role {from_str.lower()}") - chosen.append_message(from_str, data_point["chosen"][round]["content"]) - - for round in range(len(data_point["rejected"])): - from_str = data_point["rejected"][round]["from"] - if from_str.lower() == "human": - from_str = "user" - elif from_str.lower() == "assistant": - from_str = "assistant" - else: - raise ValueError(f"Unsupported role {from_str.lower()}") - rejected.append_message(from_str, data_point["rejected"][round]["content"]) + for round in range(len(rejected_continuation)): + if rejected_continuation[round]["from"] != template.roles[(round + 1) % 2]: + raise ValueError( + f"Message should iterate between user and assistant and starts with a \ + line from the user. Got the following data:\n{rejected_continuation}" + ) + rejected.append_message(rejected_continuation[round]["from"], rejected_continuation[round]["content"]) ( chosen_input_ids, @@ -361,16 +277,14 @@ def tokenize_rlhf( rejected_label_decode, ) = (None, None, None, None, None, None) - chosen_data_packed = apply_rlhf_data_format(chosen, tokenizer, round_of_context) + chosen_data_packed = apply_rlhf_data_format(chosen, tokenizer) (chosen_input_ids, chosen_loss_mask, chosen_label_decode) = ( chosen_data_packed["input_ids"], chosen_data_packed["loss_mask"], chosen_data_packed["label_decode"], ) - rejected_data_packed = apply_rlhf_data_format( - rejected, tokenizer, round_of_context, mask_out_target_assistant_line_end=True - ) + rejected_data_packed = apply_rlhf_data_format(rejected, tokenizer) (rejected_input_ids, rejected_loss_mask, rejected_label_decode) = ( rejected_data_packed["input_ids"], rejected_data_packed["loss_mask"], @@ -387,7 +301,7 @@ def tokenize_rlhf( rejected_label_decode=None, ) # Check if loss mask is all 0s (no loss), this may happen when the tokenized length is too long - if chosen_loss_mask[1:].count(1) == 0 or rejected_loss_mask[1:].count(1) == 0: + if chosen_loss_mask.count(1) == 0 or rejected_loss_mask.count(1) == 0: return dict( chosen_input_ids=None, chosen_loss_mask=None, @@ -405,3 +319,62 @@ def tokenize_rlhf( "rejected_loss_mask": rejected_loss_mask, "rejected_label_decode": rejected_label_decode, } + + +def tokenize_kto( + data_point: Dict[str, str], + tokenizer: PreTrainedTokenizer, + conversation_template: Conversation = None, + max_length: int = 4096, +) -> Dict[str, Union[int, str, List[int]]]: + """ + Tokenize a dataset for KTO training + The raw input data is conversation that have the following format + { + "prompt": [{"from": "user", "content": "xxx"}...], + "completion": {"from": "assistant", "content": "xxx"}, + "label": true/false + } + It returns three fields + The context, which contain the query and the assistant start, + the completion, which only contains the assistance's answer, + and a binary label, which indicates if the sample is prefered or not + """ + prompt = data_point["prompt"] + completion = data_point["completion"] + template = deepcopy(conversation_template) + template.clear() + + if prompt[0].get("from", None) != "user": + raise ValueError("conversation should start with user") + if completion.get("from", None) != "assistant": + raise ValueError("conversation should end with assistant") + + for mess in prompt: + if mess.get("from", None) == "user": + template.append_message("user", mess["content"]) + elif mess.get("from", None) == "assistant": + template.append_message("assistant", mess["content"]) + else: + raise ValueError(f"Unsupported role {mess.get('from', None)}") + generation_prompt = template.get_prompt(len(prompt), add_generation_prompt=True) + template.append_message("assistant", completion["content"]) + full_prompt = template.get_prompt(len(prompt) + 1, add_generation_prompt=False) + tokenized_full_prompt = tokenizer(full_prompt, add_special_tokens=False)["input_ids"] + if len(tokenized_full_prompt) + 1 > max_length: + return dict(prompt=None, completion=None, label=None, input_id_decode=None, completion_decode=None) + tokenized_generation_prompt = tokenizer(generation_prompt, add_special_tokens=False)["input_ids"] + tokenized_completion = tokenized_full_prompt[len(tokenized_generation_prompt) :] + tokenized_completion = deepcopy(tokenized_completion) + if tokenizer.bos_token_id is not None and tokenized_generation_prompt[0] != tokenizer.bos_token_id: + tokenized_generation_prompt = [tokenizer.bos_token_id] + tokenized_generation_prompt + decoded_full_prompt = tokenizer.decode(tokenized_full_prompt, skip_special_tokens=False) + decoded_completion = tokenizer.decode(tokenized_completion, skip_special_tokens=False) + + return { + "prompt": tokenized_generation_prompt, + "completion": tokenized_completion, + "label": data_point["label"], + "input_id_decode": decoded_full_prompt, + "completion_decode": decoded_completion, + } diff --git a/applications/ColossalChat/coati/dataset/utils.py b/applications/ColossalChat/coati/dataset/utils.py index f41a4d772..42c3191db 100755 --- a/applications/ColossalChat/coati/dataset/utils.py +++ b/applications/ColossalChat/coati/dataset/utils.py @@ -88,7 +88,13 @@ def find_first_occurrence_subsequence(seq: torch.Tensor, subseq: torch.Tensor, s return -1 -def tokenize_and_concatenate(tokenizer: PreTrainedTokenizer, text: List[str], require_loss: List[bool]): +def tokenize_and_concatenate( + tokenizer: PreTrainedTokenizer, + text: List[str], + require_loss: List[bool], + max_length: int, + discard_non_loss_tokens_at_tail: bool = True, +): """ Tokenizes a list of texts using the provided tokenizer and concatenates the tokenized outputs. @@ -96,6 +102,13 @@ def tokenize_and_concatenate(tokenizer: PreTrainedTokenizer, text: List[str], re tokenizer (PreTrainedTokenizer): The tokenizer to use for tokenization. text (List[str]): The list of texts to tokenize. require_loss (List[bool]): A list of boolean values indicating whether each text requires loss calculation. + max_length: used to truncate the input ids + discard_non_loss_tokens_at_tail: whether to discard the non-loss tokens at the tail + + if the first round has already exeeded max length + - if the user query already exeeded max length, discard the sample + - if only the first assistant response exeeded max length, truncate the response to fit the max length + else keep the first several complete rounds of the conversations until max length is reached Returns: Tuple[List[int], List[int], List[int]]: A tuple containing the concatenated tokenized input ids, @@ -106,10 +119,18 @@ def tokenize_and_concatenate(tokenizer: PreTrainedTokenizer, text: List[str], re loss_ends = [] for s, r in zip(text, require_loss): tokenized = tokenizer(s, add_special_tokens=False)["input_ids"] - if r: - loss_starts.append(len(input_ids)) - loss_ends.append(len(input_ids) + len(tokenized)) - input_ids.extend(tokenized) + if not max_length or len(input_ids) + len(tokenized) <= max_length or len(loss_ends) == 0: + if r: + loss_starts.append(len(input_ids)) + loss_ends.append(len(input_ids) + len(tokenized)) + input_ids.extend(tokenized) + if max_length and loss_starts[0] >= max_length: + return None, None, None + if discard_non_loss_tokens_at_tail: + input_ids = input_ids[: loss_ends[-1]] + if max_length: + input_ids = input_ids[:max_length] + loss_ends[-1] = min(max_length, loss_ends[-1]) return input_ids, loss_starts, loss_ends @@ -125,6 +146,12 @@ def split_templated_prompt_into_chunks(messages: List[Dict[str, str]], prompt: s content_length = ( prompt.find(end_of_assistant, first_occur + content_length) + len(end_of_assistant) - first_occur ) + # if the tokenized content start with a leading space, we want to keep it in loss calculation + # e.g., Assistant: I am saying... + # if the tokenized content doesn't start with a leading space, we only need to keep the content in loss calculation + # e.g., + # Assistant: # '\n' as line breaker + # I am saying... if prompt[first_occur - 1] != " ": chunks.append(prompt[start_idx:first_occur]) chunks.append(prompt[first_occur : first_occur + content_length]) diff --git a/applications/ColossalChat/coati/models/__init__.py b/applications/ColossalChat/coati/models/__init__.py index 14073207f..f554cbfa5 100755 --- a/applications/ColossalChat/coati/models/__init__.py +++ b/applications/ColossalChat/coati/models/__init__.py @@ -2,7 +2,7 @@ from .base import BaseModel from .critic import Critic from .generation import generate, generate_streaming, prepare_inputs_fn, update_model_kwargs_fn from .lora import convert_to_lora_module -from .loss import DpoLoss, LogExpLoss, LogSigLoss, PolicyLoss, ValueLoss +from .loss import DpoLoss, KTOLoss, LogExpLoss, LogSigLoss, PolicyLoss, ValueLoss from .reward_model import RewardModel from .utils import disable_dropout @@ -16,7 +16,7 @@ __all__ = [ "LogExpLoss", "convert_to_lora_module", "DpoLoss", - "generate", + "KTOLoss" "generate", "generate_streaming", "disable_dropout", "update_model_kwargs_fn", diff --git a/applications/ColossalChat/coati/models/base.py b/applications/ColossalChat/coati/models/base.py index fcea9414b..cfdffdf28 100755 --- a/applications/ColossalChat/coati/models/base.py +++ b/applications/ColossalChat/coati/models/base.py @@ -42,7 +42,6 @@ class BaseModel(nn.Module): out = self.model(dummy_input) self.last_hidden_state_size = out.last_hidden_state.shape[-1] self.model = self.model.cpu() - # print("self.last_hidden_state_size: ",self.last_hidden_state_size) def resize_token_embeddings(self, *args, **kwargs): """ diff --git a/applications/ColossalChat/coati/models/lora.py b/applications/ColossalChat/coati/models/lora.py index 9553b00ff..116c5acec 100755 --- a/applications/ColossalChat/coati/models/lora.py +++ b/applications/ColossalChat/coati/models/lora.py @@ -50,7 +50,7 @@ class LoraLinear(lora.LoRALayer, nn.Module): self.fan_in_fan_out = fan_in_fan_out # Actual trainable parameters if r > 0: - self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features))) + self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)), requires_grad=False) self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r))) self.scaling = self.lora_alpha / self.r # Freezing the pre-trained weight matrix diff --git a/applications/ColossalChat/coati/models/loss.py b/applications/ColossalChat/coati/models/loss.py index e6872276d..840cca074 100755 --- a/applications/ColossalChat/coati/models/loss.py +++ b/applications/ColossalChat/coati/models/loss.py @@ -5,6 +5,7 @@ loss functions from typing import Optional, Tuple import torch +import torch.distributed as dist import torch.nn as nn from .utils import masked_mean @@ -201,7 +202,72 @@ class OddsRatioLoss(nn.Module): chosen_odds_masked = torch.sum(chosen_odds * chosen_loss_mask.float()) / torch.sum(chosen_loss_mask) reject_odds = reject_logp - torch.log(-torch.exp(reject_logp) + 1.0001) reject_odds_masked = torch.sum(reject_odds * reject_loss_mask.float()) / torch.sum(reject_loss_mask) - # print("chosen_odds_masked", chosen_odds_masked[0], "reject_odds_masked", reject_odds_masked[0]) log_odds_ratio = chosen_odds_masked - reject_odds_masked ratio = torch.log(torch.nn.functional.sigmoid(log_odds_ratio)) return ratio.to(dtype=torch.bfloat16), log_odds_ratio + + +class KTOLoss(nn.Module): + def __init__(self, beta: float = 0.1, desirable_weight: float = 1.0, undesirable_weight: float = 1.0): + """ + Args: + beta: The temperature parameter in the KTO paper. + desirable_weight: The weight for the desirable responses. + undesirable_weight: The weight for the undesirable + """ + super().__init__() + self.beta = beta + self.desirable_weight = desirable_weight + self.undesirable_weight = undesirable_weight + + def forward( + self, + chosen_logps: torch.Tensor, + rejected_logps: torch.Tensor, + kl_logps: torch.Tensor, + ref_chosen_logps: torch.Tensor, + ref_rejected_logps: torch.Tensor, + ref_kl_logps: torch.Tensor, + ): + """ + Reference: + https://github.com/huggingface/trl/blob/a2adfb836a90d1e37b1253ab43dace05f1241e04/trl/trainer/kto_trainer.py#L585 + + Compute the KTO loss for a batch of policy and reference model log probabilities. + Args: + chosen_logps: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,) + rejected_logps: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,) + kl_logps: KL divergence of the policy model. Shape: (batch_size,) + ref_chosen_logps: Log probabilities of the reference model for the chosen responses. Shape: (batch_size,) + ref_rejected_logps: Log probabilities of the reference model for the rejected responses. Shape: (batch_size,) + ref_kl_logps: KL divergence of the reference model. Shape: (batch_size,) + beta: The temperature parameter in the DPO paper. + desirable_weight: The weight for the desirable responses. + undesirable_weight: The weight for the undesirable responses. + + Refer to the KTO paper for details about hyperparameters https://arxiv.org/pdf/2402.01306 + """ + kl = (kl_logps - ref_kl_logps).mean().detach() + # all gather + dist.all_reduce(kl, op=dist.ReduceOp.SUM) + kl = (kl / dist.get_world_size()).clamp(min=0) + + if chosen_logps.shape[0] != 0 and ref_chosen_logps.shape[0] != 0: + chosen_logratios = chosen_logps - ref_chosen_logps + chosen_losses = 1 - nn.functional.sigmoid(self.beta * (chosen_logratios - kl)) + chosen_rewards = self.beta * chosen_logratios.detach() + else: + chosen_losses = torch.Tensor([]).to(kl_logps.device) + chosen_rewards = torch.Tensor([]).to(kl_logps.device) + + if rejected_logps.shape[0] != 0 and ref_rejected_logps.shape[0] != 0: + rejected_logratios = rejected_logps - ref_rejected_logps + rejected_losses = 1 - nn.functional.sigmoid(self.beta * (kl - rejected_logratios)) + rejected_rewards = self.beta * rejected_logratios.detach() + else: + rejected_losses = torch.Tensor([]).to(kl_logps.device) + rejected_rewards = torch.Tensor([]).to(kl_logps.device) + + losses = torch.cat((self.desirable_weight * chosen_losses, self.undesirable_weight * rejected_losses), 0).mean() + + return losses, chosen_rewards, rejected_rewards, kl diff --git a/applications/ColossalChat/coati/trainer/__init__.py b/applications/ColossalChat/coati/trainer/__init__.py index 6ce159678..6d0900153 100755 --- a/applications/ColossalChat/coati/trainer/__init__.py +++ b/applications/ColossalChat/coati/trainer/__init__.py @@ -1,8 +1,18 @@ from .base import OLTrainer, SLTrainer from .dpo import DPOTrainer +from .kto import KTOTrainer from .orpo import ORPOTrainer from .ppo import PPOTrainer from .rm import RewardModelTrainer from .sft import SFTTrainer -__all__ = ["SLTrainer", "OLTrainer", "RewardModelTrainer", "SFTTrainer", "PPOTrainer", "DPOTrainer", "ORPOTrainer"] +__all__ = [ + "SLTrainer", + "OLTrainer", + "RewardModelTrainer", + "SFTTrainer", + "PPOTrainer", + "DPOTrainer", + "ORPOTrainer", + "KTOTrainer", +] diff --git a/applications/ColossalChat/coati/trainer/dpo.py b/applications/ColossalChat/coati/trainer/dpo.py index 3daab54f6..c7ef2be8f 100755 --- a/applications/ColossalChat/coati/trainer/dpo.py +++ b/applications/ColossalChat/coati/trainer/dpo.py @@ -26,7 +26,7 @@ from .utils import is_rank_0, to_device class DPOTrainer(SLTrainer): """ - Trainer for PPO algorithm. + Trainer for DPO algorithm. Args: actor (Actor): the actor model in ppo algorithm diff --git a/applications/ColossalChat/coati/trainer/kto.py b/applications/ColossalChat/coati/trainer/kto.py new file mode 100755 index 000000000..8ab0bc66b --- /dev/null +++ b/applications/ColossalChat/coati/trainer/kto.py @@ -0,0 +1,318 @@ +""" +KTO trainer +""" + +import os +from typing import Any, Optional + +import torch +import torch.distributed +from coati.models.loss import KTOLoss +from coati.models.utils import calc_masked_log_probs +from coati.trainer.utils import all_reduce_mean +from coati.utils import AccumulativeMeanMeter, save_checkpoint +from torch.optim import Optimizer +from torch.optim.lr_scheduler import _LRScheduler +from torch.utils.data import DataLoader +from tqdm import trange +from transformers import PreTrainedTokenizerBase + +from colossalai.booster import Booster +from colossalai.cluster import DistCoordinator +from colossalai.utils import get_current_device + +from .base import SLTrainer +from .utils import is_rank_0, to_device + + +class KTOTrainer(SLTrainer): + """ + Trainer for KTO algorithm. + + Args: + actor (Actor): the actor model in ppo algorithm + ref_model (Critic): the reference model in ppo algorithm + booster (Strategy): the strategy to use for training + actor_optim (Optimizer): the optimizer to use for actor model + actor_lr_scheduler (_LRScheduler): the lr scheduler to use for actor model + tokenizer (PreTrainedTokenizerBase): the tokenizer to use for encoding + max_epochs (int, defaults to 1): the max number of epochs to train + accumulation_steps (int): the number of steps to accumulate gradients + start_epoch (int, defaults to 0): the start epoch, non-zero if resumed from a checkpoint + save_interval (int): the interval to save model checkpoints, default to 0, which means no checkpoint will be saved during trainning + save_dir (str): the directory to save checkpoints + coordinator (DistCoordinator): the coordinator to use for distributed logging + beta (float, defaults to 0.1): the beta parameter in kto loss + desirable_weight (float, defaults to 1.0): the weight for desirable reward + undesirable_weight (float, defaults to 1.0): the weight for undesirable reward + """ + + def __init__( + self, + actor: Any, + ref_model: Any, + booster: Booster, + actor_optim: Optimizer, + actor_lr_scheduler: _LRScheduler, + tokenizer: PreTrainedTokenizerBase, + max_epochs: int = 1, + beta: float = 0.1, + desirable_weight: float = 1.0, + undesirable_weight: float = 1.0, + accumulation_steps: int = 1, + start_epoch: int = 0, + save_interval: int = 0, + save_dir: str = None, + coordinator: DistCoordinator = None, + ) -> None: + super().__init__(booster, max_epochs=max_epochs, model=actor, optimizer=actor_optim, start_epoch=start_epoch) + self.ref_model = ref_model + self.actor_scheduler = actor_lr_scheduler + self.tokenizer = tokenizer + self.kto_loss = KTOLoss(beta=beta, desirable_weight=desirable_weight, undesirable_weight=undesirable_weight) + self.save_interval = save_interval + self.coordinator = coordinator + self.save_dir = save_dir + self.num_train_step = 0 + self.accumulation_steps = accumulation_steps + self.device = get_current_device() + self.accumulative_meter = AccumulativeMeanMeter() + self.desirable_weight = desirable_weight + self.undesirable_weight = undesirable_weight + self.beta = beta + + def _before_fit( + self, + train_preference_dataloader: DataLoader = None, + eval_preference_dataloader: DataLoader = None, + log_dir: Optional[str] = None, + use_wandb: bool = False, + ): + """ + Args: + prompt_dataloader (DataLoader): the dataloader to use for prompt data + pretrain_dataloader (DataLoader): the dataloader to use for pretrain data + """ + self.train_dataloader = train_preference_dataloader + self.eval_dataloader = eval_preference_dataloader + self.writer = None + if use_wandb and is_rank_0(): + assert log_dir is not None, "log_dir must be provided when use_wandb is True" + import wandb + + self.wandb_run = wandb.init(project="Coati-kto", sync_tensorboard=True) + if log_dir is not None and is_rank_0(): + import os + import time + + from torch.utils.tensorboard import SummaryWriter + + log_dir = os.path.join(log_dir, "kto") + log_dir = os.path.join(log_dir, time.strftime("%Y-%m-%d_%H:%M:%S", time.localtime())) + self.writer = SummaryWriter(log_dir=log_dir) + + def _train(self, epoch: int): + """ + Args: + epoch int: the number of current epoch + """ + self.model.train() + self.accumulative_meter.reset() + step_bar = trange( + len(self.train_dataloader) // self.accumulation_steps, + desc=f"Epoch {epoch + 1}/{self.max_epochs}", + disable=not is_rank_0(), + ) + for i, batch in enumerate(self.train_dataloader): + batch = to_device(batch, self.device) + (input_ids, attention_mask, loss_mask, label, kl_input_ids, kl_attention_mask, kl_loss_mask) = ( + batch["input_ids"], + batch["attention_mask"], + batch["loss_mask"], + batch["label"], + batch["kl_input_ids"], + batch["kl_attention_mask"], + batch["kl_loss_mask"], + ) + batch_size = input_ids.size()[0] + + # actor logits + with torch.no_grad(): + # calculate KL term with KT data + kl_logits = self.model( + input_ids=kl_input_ids, + attention_mask=kl_attention_mask, + )["logits"] + + logits = self.model( + input_ids=input_ids, + attention_mask=attention_mask, + )["logits"] + + logprob = calc_masked_log_probs(logits, input_ids, loss_mask[:, 1:]).sum(-1) + kl_logprob = calc_masked_log_probs(kl_logits, kl_input_ids, kl_loss_mask[:, 1:]).sum(-1) + chosen_index = [i for i in range(batch_size) if label[i] == 1] + rejected_index = [i for i in range(batch_size) if label[i] == 0] + chosen_logprob = logprob[chosen_index] + rejected_logprob = logprob[rejected_index] + with torch.no_grad(): + ref_kl_logits = self.ref_model( + input_ids=kl_input_ids, + attention_mask=kl_attention_mask, + )["logits"] + ref_logits = self.ref_model( + input_ids=input_ids, + attention_mask=attention_mask, + )["logits"] + + ref_logprob = calc_masked_log_probs(ref_logits, input_ids, loss_mask[:, 1:]).sum(-1) + ref_kl_logprob = calc_masked_log_probs(ref_kl_logits, kl_input_ids, kl_loss_mask[:, 1:]).sum(-1) + ref_chosen_logprob = ref_logprob[chosen_index] + ref_rejected_logprob = ref_logprob[rejected_index] + + loss, chosen_rewards, rejected_rewards, kl = self.kto_loss( + chosen_logprob, rejected_logprob, kl_logprob, ref_chosen_logprob, ref_rejected_logprob, ref_kl_logprob + ) + + self.booster.backward(loss=loss, optimizer=self.optimizer) + if self.num_train_step % self.accumulation_steps == self.accumulation_steps - 1: + self.optimizer.step() + self.optimizer.zero_grad() + self.actor_scheduler.step() + + # sync + loss_mean = all_reduce_mean(tensor=loss) + chosen_rewards_mean = all_reduce_mean(tensor=chosen_rewards.mean()) + rejected_rewards_mean = all_reduce_mean(tensor=rejected_rewards.mean()) + self.accumulative_meter.add("chosen_rewards", chosen_rewards_mean.to(torch.float16).mean().item()) + self.accumulative_meter.add("rejected_rewards", rejected_rewards_mean.to(torch.float16).mean().item()) + self.accumulative_meter.add("loss", loss_mean.to(torch.float16).detach().item()) + + if i % self.accumulation_steps == self.accumulation_steps - 1: + self.num_train_step += 1 + step_bar.update() + # logging + if self.writer and is_rank_0(): + self.writer.add_scalar("train/loss", self.accumulative_meter.get("loss"), self.num_train_step) + self.writer.add_scalar("train/lr", self.optimizer.param_groups[0]["lr"], self.num_train_step) + self.writer.add_scalar( + "train/chosen_rewards", self.accumulative_meter.get("chosen_rewards"), self.num_train_step + ) + self.writer.add_scalar( + "train/rejected_rewards", + self.accumulative_meter.get("rejected_rewards"), + self.num_train_step, + ) + self.writer.add_scalar( + "train/margin", + self.accumulative_meter.get("chosen_rewards") - self.accumulative_meter.get("rejected_rewards"), + self.num_train_step, + ) + self.accumulative_meter.reset() + + if self.save_dir is not None and (self.num_train_step + 1) % self.save_interval == 0: + # save checkpoint + self.coordinator.print_on_master("\nStart saving model checkpoint with running states") + save_checkpoint( + save_dir=self.save_dir, + booster=self.booster, + model=self.model, + optimizer=self.optimizer, + lr_scheduler=self.actor_scheduler, + epoch=epoch, + step=i + 1, + batch_size=batch_size, + coordinator=self.coordinator, + ) + self.coordinator.print_on_master( + f"Saved checkpoint at epoch {epoch} step {self.save_interval} at folder {self.save_dir}" + ) + + step_bar.close() + + def _eval(self, epoch: int): + """ + Args: + epoch int: the number of current epoch + """ + if self.eval_dataloader is None: + self.coordinator.print_on_master("No eval dataloader is provided, skip evaluation") + return + self.model.eval() + self.accumulative_meter.reset() + step_bar = trange( + len(self.train_dataloader) // self.accumulation_steps, + desc=f"Epoch {epoch + 1}/{self.max_epochs}", + disable=not is_rank_0(), + ) + for i, batch in enumerate(self.train_dataloader): + batch = to_device(batch, self.device) + (input_ids, attention_mask, loss_mask, label, kl_input_ids, kl_attention_mask, kl_loss_mask) = ( + batch["input_ids"], + batch["attention_mask"], + batch["loss_mask"], + batch["label"], + batch["kl_input_ids"], + batch["kl_attention_mask"], + batch["kl_loss_mask"], + ) + batch_size = input_ids.size()[0] + + # actor logits + with torch.no_grad(): + # calculate KL term with KT data + kl_logits = self.model( + input_ids=kl_input_ids, + attention_mask=kl_attention_mask, + )["logits"] + + logits = self.model( + input_ids=input_ids, + attention_mask=attention_mask, + )["logits"] + + logprob = calc_masked_log_probs(logits, input_ids, loss_mask[:, 1:]).sum(-1) + kl_logprob = calc_masked_log_probs(kl_logits, kl_input_ids, kl_loss_mask[:, 1:]).sum(-1) + chosen_index = [i for i in range(batch_size) if label[i] == 1] + rejected_index = [i for i in range(batch_size) if label[i] == 0] + chosen_logprob = logprob[chosen_index] + rejected_logprob = logprob[rejected_index] + with torch.no_grad(): + ref_kl_logits = self.ref_model( + input_ids=kl_input_ids, + attention_mask=kl_attention_mask, + )["logits"] + + ref_logits = self.ref_model( + input_ids=input_ids, + attention_mask=attention_mask, + )["logits"] + + ref_logprob = calc_masked_log_probs(ref_logits, input_ids, loss_mask[:, 1:]).sum(-1) + ref_kl_logprob = calc_masked_log_probs(ref_kl_logits, kl_input_ids, kl_loss_mask[:, 1:]).sum(-1) + ref_chosen_logprob = ref_logprob[chosen_index] + ref_rejected_logprob = ref_logprob[rejected_index] + + loss, chosen_rewards, rejected_rewards, kl = self.kto_loss( + chosen_logprob, rejected_logprob, kl_logprob, ref_chosen_logprob, ref_rejected_logprob, ref_kl_logprob + ) + + # sync + loss_mean = all_reduce_mean(tensor=loss) + chosen_rewards_mean = all_reduce_mean(tensor=chosen_rewards.mean()) + rejected_rewards_mean = all_reduce_mean(tensor=rejected_rewards.mean()) + self.accumulative_meter.add("chosen_rewards", chosen_rewards_mean.to(torch.float16).mean().item()) + self.accumulative_meter.add("rejected_rewards", rejected_rewards_mean.to(torch.float16).mean().item()) + self.accumulative_meter.add("loss", loss_mean.to(torch.float16).detach().item()) + self.accumulative_meter.add( + "margin", (chosen_rewards_mean - rejected_rewards_mean).to(torch.float16).mean().item() + ) + step_bar.update() + msg = "Evaluation Result:\n" + for tag in ["loss", "chosen_rewards", "rejected_rewards", "margin"]: + msg = msg + f"{tag}: {self.accumulative_meter.get(tag)}\n" + self.coordinator.print_on_master(msg) + os.makedirs(self.save_dir, exist_ok=True) + with open(os.path.join(self.save_dir, f"eval_result_epoch{epoch}.txt"), "w") as f: + f.write(msg) + step_bar.close() diff --git a/applications/ColossalChat/coati/trainer/orpo.py b/applications/ColossalChat/coati/trainer/orpo.py index 495bb332b..b039da4af 100644 --- a/applications/ColossalChat/coati/trainer/orpo.py +++ b/applications/ColossalChat/coati/trainer/orpo.py @@ -26,7 +26,7 @@ from .utils import is_rank_0, to_device class ORPOTrainer(SLTrainer): """ - Trainer for PPO algorithm. + Trainer for ORPO algorithm. Args: actor (Actor): the actor model in ppo algorithm diff --git a/applications/ColossalChat/examples/README.md b/applications/ColossalChat/examples/README.md index d6114c8d5..f68875568 100755 --- a/applications/ColossalChat/examples/README.md +++ b/applications/ColossalChat/examples/README.md @@ -30,6 +30,8 @@ - [DPO Stage 1: Supervised Instruction Tuning](#dpo-training-stage1---supervised-instructs-tuning) - [DPO Stage 2: DPO Training](#dpo-training-stage2---dpo-training) - [Alternative Option For RLHF: Simple Preference Optimization](#alternative-option-for-rlhf-simple-preference-optimization) + - [Alternative Option For RLHF: Kahneman-Tversky Optimization (KTO)](#alternative-option-for-rlhf-kahneman-tversky-optimization-kto) + - [Alternative Option For RLHF: Odds Ratio Preference Optimization](#alternative-option-for-rlhf-odds-ratio-preference-optimization) - [List of Supported Models](#list-of-supported-models) - [Hardware Requirements](#hardware-requirements) - [Inference example](#inference-example) @@ -446,7 +448,7 @@ The first step in Stage 1 is to collect a dataset of human demonstrations of the {"messages": [ { - "from": "human", + "from": "user", "content": "what are some pranks with a pen i can do?" }, { @@ -527,7 +529,7 @@ Below shows the preference dataset format used in training the reward model. [ {"context": [ { - "from": "human", + "from": "user", "content": "Introduce butterflies species in Oregon." } ] @@ -596,7 +598,7 @@ In stage3 we will use reinforcement learning algorithm--- Proximal Policy Optimi #### Step 1: Data Collection -PPO uses two kinds of training data--- the prompt data and the pretrain data (optional). The first dataset is mandatory, data samples within the prompt dataset ends with a line from "human" and thus the "assistant" needs to generate a response to answer to the "human". Note that you can still use conversation that ends with a line from the "assistant", in that case, the last line will be dropped. Here is an example of the prompt dataset format. +PPO uses two kinds of training data--- the prompt data and the pretrain data (optional). The first dataset is mandatory, data samples within the prompt dataset ends with a line from "user" and thus the "assistant" needs to generate a response to answer to the "user". Note that you can still use conversation that ends with a line from the "assistant", in that case, the last line will be dropped. Here is an example of the prompt dataset format. ```json @@ -604,7 +606,7 @@ PPO uses two kinds of training data--- the prompt data and the pretrain data (op {"messages": [ { - "from": "human", + "from": "user", "content": "what are some pranks with a pen i can do?" } ... @@ -744,13 +746,40 @@ with a Reference-Free Reward](https://arxiv.org/pdf/2405.14734) (SimPO). Which i ### Alternative Option For RLHF: Odds Ratio Preference Optimization -We support the method introduced in the paper [ORPO: Monolithic Preference Optimization without Reference Model](https://arxiv.org/abs/2403.07691) (ORPO). Which is a reference model free aligment method that mixes the SFT loss with a reinforcement learning loss that uses odds ratio as the implicit reward to enhance training stability and efficiency. Simply set the flag to disable the use of the reference model, set the reward target margin and enable length normalization in the DPO training script. To use ORPO in alignment, use the [train_orpo.sh](./training_scripts/train_orpo.sh) script, You can set the value for `lambda` (which determine how strongly the reinforcement learning loss affect the training) but it is optional. +We support the method introduced in the paper [ORPO: Monolithic Preference Optimization without Reference Model](https://arxiv.org/abs/2403.07691) (ORPO). Which is a reference model free aligment method that mixes the SFT loss with a reinforcement learning loss that uses odds ratio as the implicit reward to enhance training stability and efficiency. To use ORPO in alignment, use the [train_orpo.sh](./training_scripts/train_orpo.sh) script, You can set the value for `lambda` (which determine how strongly the reinforcement learning loss affect the training) but it is optional. #### ORPO Result
+
+