ColossalAI/applications/Chat/examples/generate_prompt_dataset.py

import argparse
import json
import random

random.seed(42)


def sample(args):
    with open(args.dataset_path, mode="r") as f:
        dataset_list = json.load(f)

    sampled_dataset = [
        {"instruction": sample["instruction"], "id": idx}
        for idx, sample in enumerate(random.sample(dataset_list, args.sample_size))
    ]

    with open(args.save_path, mode="w") as f:
        json.dump(sampled_dataset, f, indent=4, default=str, ensure_ascii=False)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset_path", type=str, default=None, required=True, help="path to the pretrain dataset")
    parser.add_argument("--save_path", type=str, default="prompt.json", help="path to save the prompt dataset")
    parser.add_argument("--sample_size", type=int, default=16384, help="size of the prompt dataset")
    args = parser.parse_args()
    sample(args)
[chat] fix bugs in stage 3 training (#3759) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com> 2023-05-17 09:44:05 +00:00			`import argparse`
			`import json`
[chat] fix bugs and add unit tests (#4213) * style: rename replay buffer Experience replay is typically for off policy algorithms. Use this name in PPO maybe misleading. * fix: fix wrong zero2 default arg * test: update experience tests * style: rename zero_pad fn * fix: defer init in CycledDataLoader * test: add benchmark test * style: rename internal fn of generation * style: rename internal fn of lora * fix: remove unused loss fn * fix: remove unused utils fn * refactor: remove generate_with_actor fn * fix: fix type annotation * test: add models tests * fix: skip llama due to long execution time * style: modify dataset * style: apply formatter * perf: update reward dataset * fix: fix wrong IGNORE_INDEX in sft dataset * fix: remove DataCollatorForSupervisedDataset * test: add dataset tests * style: apply formatter * style: rename test_ci to test_train * feat: add llama in inference * test: add inference tests * test: change test scripts directory * fix: update ci * fix: fix typo * fix: skip llama due to oom * fix: fix file mod * style: apply formatter * refactor: remove duplicated llama_gptq * style: apply formatter * to: update rm test * feat: add tokenizer arg * feat: add download model script * test: update train tests * fix: modify gemini load and save pretrained * test: update checkpoint io test * to: modify nproc_per_node * fix: do not remove existing dir * fix: modify save path * test: add random choice * fix: fix sft path * fix: enlarge nproc_per_node to avoid oom * fix: add num_retry * fix: make lora config of rm and critic consistent * fix: add warning about lora weights * fix: skip some gpt2 tests * fix: remove grad ckpt in rm and critic due to errors * refactor: directly use Actor in train_sft * test: add more arguments * fix: disable grad ckpt when using lora * fix: fix save_pretrained and related tests * test: enable zero2 tests * revert: remove useless fn * style: polish code * test: modify test args 2023-08-02 02:17:36 +00:00			`import random`
[chat] fix bugs in stage 3 training (#3759) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com> 2023-05-17 09:44:05 +00:00
			`random.seed(42)`


			`def sample(args):`
[misc] update pre-commit and run all files (#4752) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format 2023-09-19 06:20:26 +00:00			`with open(args.dataset_path, mode="r") as f:`
[chat] fix bugs in stage 3 training (#3759) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com> 2023-05-17 09:44:05 +00:00			`dataset_list = json.load(f)`

[chat] fix bugs and add unit tests (#4213) * style: rename replay buffer Experience replay is typically for off policy algorithms. Use this name in PPO maybe misleading. * fix: fix wrong zero2 default arg * test: update experience tests * style: rename zero_pad fn * fix: defer init in CycledDataLoader * test: add benchmark test * style: rename internal fn of generation * style: rename internal fn of lora * fix: remove unused loss fn * fix: remove unused utils fn * refactor: remove generate_with_actor fn * fix: fix type annotation * test: add models tests * fix: skip llama due to long execution time * style: modify dataset * style: apply formatter * perf: update reward dataset * fix: fix wrong IGNORE_INDEX in sft dataset * fix: remove DataCollatorForSupervisedDataset * test: add dataset tests * style: apply formatter * style: rename test_ci to test_train * feat: add llama in inference * test: add inference tests * test: change test scripts directory * fix: update ci * fix: fix typo * fix: skip llama due to oom * fix: fix file mod * style: apply formatter * refactor: remove duplicated llama_gptq * style: apply formatter * to: update rm test * feat: add tokenizer arg * feat: add download model script * test: update train tests * fix: modify gemini load and save pretrained * test: update checkpoint io test * to: modify nproc_per_node * fix: do not remove existing dir * fix: modify save path * test: add random choice * fix: fix sft path * fix: enlarge nproc_per_node to avoid oom * fix: add num_retry * fix: make lora config of rm and critic consistent * fix: add warning about lora weights * fix: skip some gpt2 tests * fix: remove grad ckpt in rm and critic due to errors * refactor: directly use Actor in train_sft * test: add more arguments * fix: disable grad ckpt when using lora * fix: fix save_pretrained and related tests * test: enable zero2 tests * revert: remove useless fn * style: polish code * test: modify test args 2023-08-02 02:17:36 +00:00			`sampled_dataset = [`
			`{"instruction": sample["instruction"], "id": idx}`
			`for idx, sample in enumerate(random.sample(dataset_list, args.sample_size))`
			`]`
[chat] fix bugs in stage 3 training (#3759) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com> 2023-05-17 09:44:05 +00:00
[misc] update pre-commit and run all files (#4752) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format 2023-09-19 06:20:26 +00:00			`with open(args.save_path, mode="w") as f:`
			`json.dump(sampled_dataset, f, indent=4, default=str, ensure_ascii=False)`
[chat] fix bugs in stage 3 training (#3759) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com> 2023-05-17 09:44:05 +00:00

[misc] update pre-commit and run all files (#4752) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format 2023-09-19 06:20:26 +00:00			`if __name__ == "__main__":`
[chat] fix bugs in stage 3 training (#3759) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com> 2023-05-17 09:44:05 +00:00			`parser = argparse.ArgumentParser()`
[misc] update pre-commit and run all files (#4752) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format 2023-09-19 06:20:26 +00:00			`parser.add_argument("--dataset_path", type=str, default=None, required=True, help="path to the pretrain dataset")`
			`parser.add_argument("--save_path", type=str, default="prompt.json", help="path to save the prompt dataset")`
			`parser.add_argument("--sample_size", type=int, default=16384, help="size of the prompt dataset")`
[chat] fix bugs in stage 3 training (#3759) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com> 2023-05-17 09:44:05 +00:00			`args = parser.parse_args()`
			`sample(args)`