History

Wenhao Chen 3d8d5d0d58 [chat] use official transformers and fix some issues (#4117 ) * feat: remove on_learn_epoch fn as not used * revert: add _on_learn_epoch fn * feat: remove NaiveStrategy * test: update train_prompts tests * fix: remove prepare_llama_tokenizer_and_embedding * test: add lora arg * feat: remove roberta support in train_prompts due to runtime errs * feat: remove deberta & roberta in rm as not used * test: remove deberta and roberta tests * feat: remove deberta and roberta models as not used * fix: remove calls to roberta * fix: remove prepare_llama_tokenizer_and_embedding * chore: update transformers version * docs: update transformers version * fix: fix actor inference * fix: fix ci * feat: change llama pad token to unk * revert: revert ddp setup_distributed * fix: change llama pad token to unk * revert: undo unnecessary changes * fix: use pip to install transformers		2023-07-04 13:49:09 +08:00
..
callbacks	[chat] add distributed PPO trainer (#3740 )	2023-06-07 10:41:16 +08:00
README.md	[chat] add distributed PPO trainer (#3740 )	2023-06-07 10:41:16 +08:00
__init__.py	[chat] add distributed PPO trainer (#3740 )	2023-06-07 10:41:16 +08:00
detached_replay_buffer.py	[chat] add distributed PPO trainer (#3740 )	2023-06-07 10:41:16 +08:00
detached_trainer_base.py	[chat] add distributed PPO trainer (#3740 )	2023-06-07 10:41:16 +08:00
detached_trainer_ppo.py	[chat] remove naive strategy and split colossalai strategy (#4094 )	2023-06-29 18:11:00 +08:00
experience_maker_holder.py	fix typo applications/Chat/coati/ (#3947 )	2023-06-15 10:43:11 +08:00
lora_constructor.py	fix typo applications/Chat/coati/ (#3947 )	2023-06-15 10:43:11 +08:00
utils.py	[chat] use official transformers and fix some issues (#4117 )	2023-07-04 13:49:09 +08:00

README.md

Distributed PPO Training on Stage 3

Detach Experience Makers and Trainers

We can completely separate the trainers and makers.

The experience maker performs inference, produces experience, and remotely delivers it to the trainer (1).
The trainer consumes experience to train models, and periodically transmits new model parameters to the maker (2.1, 2.2).
Using an experience buffer to overlap transmission and computing.

In this manner, each node will work continuously without model idle time, and different optimization strategies can be applied for inference and training to meet the needs of speed or storage. It is also helpful for scalability.

DetachedPPOTrainer and ExperienceMakerHolder are Ray Actors (distinguished from Actor Model), representing Trainer and Experience Maker on the graph above, respectively.

More about Ray Core

Usage

See examples at ColossalAI/application/Chat/examples/ray

Setup Makers

define makers' environment variables :

env_info_makers = [{
    'local_rank': '0',
    'rank': str(rank),
    'world_size': str(num_makers),
    'master_port': maker_port,
    'master_addr': master_addr
} for rank in range(num_makers)]

define maker models :

def model_fn():
    actor = get_actor_from_args(...)
    critic = get_critic_from_args(...)
    reward_model = get_reward_model_from_args(...)
    initial_model = get_actor_from_args(...)
    return actor, critic, reward_model, initial_model

set experience_holder_refs :

experience_holder_refs = [
    ExperienceMakerHolder.options(
        name=f"maker_{i}",
        num_gpus=1,
        max_concurrency=2
    ).remote(
        detached_trainer_name_list=[f"trainer_{x}" for x in target_trainers(...)],
        model_fn=model_fn,
        ...)
    for i, env_info_maker in enumerate(env_info_makers)
]

The names in the detached_trainer_name_list refer to the target trainers that the maker should send experience to. We set a trainer's name the same as a maker, by .options(name="str"). See below.

Setup Trainers

define trainers' environment variables :

env_info_trainers = [{
    'local_rank': '0',
    'rank': str(rank),
    'world_size': str(num_trainers),
    'master_port': trainer_port,
    'master_addr': master_addr
} for rank in range(num_trainers)]

define trainer models :

def trainer_model_fn():
    actor = get_actor_from_args(...)
    critic = get_critic_from_args(...)
    return actor, critic

set trainer_refs :

trainer_refs = [
    DetachedPPOTrainer.options(
        name=f"trainer{i}",
        num_gpus=1,
        max_concurrency=2
    ).remote(
        experience_maker_holder_name_list=[f"maker{x}" for x in target_makers(...)],
        model_fn = trainer_model_fn(),
        ...)
    for i, env_info_trainer in enumerate(env_info_trainers)
]

The names in experience_maker_holder_name_list refer to the target makers that the trainer should send updated models to. By setting detached_trainer_name_list and experience_maker_holder_name_list, we can customize the transmission graph.

Launch Jobs

define data_loader :

def data_loader_fn():
    return = torch.utils.data.DataLoader(dataset=dataset)

launch makers :

wait_tasks = []
for experience_holder_ref in experience_holder_refs:
    wait_tasks.append(
        experience_holder_ref.workingloop.remote(data_loader_fn(),
                                                 num_steps=experience_steps))

launch trainers :

for trainer_ref in trainer_refs:
    wait_tasks.append(trainer_ref.fit.remote(total_steps, update_steps, train_epochs))

wait for done :
```
ray.get(wait_tasks)
```

Flexible Structure

We can deploy different strategies to makers and trainers. Here are some notions.

README.md

Distributed PPO Training on Stage 3

Detach Experience Makers and Trainers

Usage

Setup Makers

Setup Trainers

Launch Jobs

Flexible Structure

2 Makers 1 Trainer

2 Makers 2 Trainer

Maker Inference Quantization

Tensor Parallel

TODO