ColossalAI/applications/ChatGPT
ver217 1b34701027
[app] add chatgpt application (#2698)
2023-02-14 22:17:25 +08:00
..
benchmarks [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
chatgpt [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
examples [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
requirements [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
tests [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
.gitignore [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
LICENSE [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
README.md [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
pytest.ini [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
setup.py [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00
version.txt [app] add chatgpt application (#2698) 2023-02-14 22:17:25 +08:00

README.md

RLHF - ColossalAI

Implementation of RLHF (Reinforcement Learning with Human Feedback) powered by ColossalAI. It supports distributed training and offloading, which can fit extremly large models.

Training process (step 3)

Install

pip install .

Usage

The main entrypoint is Trainer. We only support PPO trainer now. We support many training strategies:

  • NaiveStrategy: simplest strategy. Train on single GPU.
  • DDPStrategy: use torch.nn.parallel.DistributedDataParallel. Train on multi GPUs.
  • ColossalAIStrategy: use Gemini and Zero of ColossalAI. It eliminates model duplication on each GPU and supports offload. It's very useful when training large models on multi GPUs.

Simplest usage:

from chatgpt.trainer import PPOTrainer
from chatgpt.trainer.strategies import ColossalAIStrategy

strategy = ColossalAIStrategy()

with strategy.model_init_context():
  # init your model here
  actor = Actor()
  critic = Critic()

trainer = PPOTrainer(actor = actor, critic= critic, strategy, ...)

trainer.fit(dataset, ...)

For more details, see examples/.

We also support training reward model with true-world data. See examples/train_reward_model.py.

Todo

  • implement PPO training
  • implement training reward model
  • support LoRA
  • implement PPO-ptx fine-tuning
  • integrate with Ray
  • support more RL paradigms, like Implicit Language Q-Learning (ILQL)

Citations

@article{Hu2021LoRALA,
    title   = {LoRA: Low-Rank Adaptation of Large Language Models},
    author  = {Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen},
    journal = {ArXiv},
    year    = {2021},
    volume  = {abs/2106.09685}
}

@article{ouyang2022training,
  title={Training language models to follow instructions with human feedback},
  author={Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others},
  journal={arXiv preprint arXiv:2203.02155},
  year={2022}
}