ColossalAI/colossalai/zero/low_level
flybird11111 af6aa9ed06
[plugin] hybrid support zero bubble pipeline (#6060)
* hybrid support zbv

* fix

fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* Update zero_bubble_pp.py

* fix

* fix-ci

* fix

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* fix

* fix

* fix

* [zerobubble]Support ZeroBubble Pipeline (#6034)

* [feat] add zerobubble pp (just a frame now); add POC test for dx_dw; add test for zerobubble;

* [feat] add dw test;

* [fix] fix weight not close;

* [update] update text;

* [feat] add test run_fwd_bwd automatic scheduling;

* [feat] split communication and calculation; fix pop empty send_bwd_buffer error;

* [feat] add test for p & p grad;

* [feat] add comments for ZBV func;

* [fix] rm useless assign and comments;

* [fix] fix ci test; add pytest;

* [feat] add run_fwd_bwd_with_microbatch  (replace input) & test; add p&p.grad assert close test & all pass;

* [feat] add apply v_schedule graph; p & p.grad assert err exist;

* [fix] update

* [feat] fix ci; add assert;

* [feat] fix poc format

* [feat] fix func name & ci; add comments;

* [fix] fix poc test; add comments in poc;

* [feat] add optim backward_b_by_grad

* [feat] fix optimizer bwd b & w; support return accum loss & output

* [feat] add fwd_bwd_step, run_fwd_only;

* [fix] fix optim bwd; add license for v_schedule; remove redundant attributes; fix schedule loop "while"--> "for"; add communication dict;

* [fix] fix communication_map;

* [feat] update test; rm comments;

* [fix] rm zbv in hybridplugin

* [fix] fix optim bwd;

* [fix] fix optim bwd;

* [fix] rm output.data after send fwd;

* [fix] fix bwd step if condition; remove useless comments and format info;

* [fix] fix detach output & release output;

* [fix] rm requir_grad for output;

* [fix] fix requir grad position and detach position and input&output local buffer append position;

* [feat] add memory assertation;

* [fix] fix mem check;

* [fix] mem assertation'

* [fix] fix mem assertation

* [fix] fix mem; use a new model shape; only assert mem less and equal than theo;

* [fix] fix model zoo import;

* [fix] fix redundant detach & clone; add buffer assertation in the end;

* [fix] add output_obj_grad assert None at bwd b step; replace input_obj.require_grad_ with treemap;

* [fix] update optim state dict assert (include param group & state); fix mem assert after add optim;

* [fix] add testcase with microbatch 4;

* hybrid support zbv

* fix

fix

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update zero_bubble_pp.py

* fix

* fix-ci

* fix

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* fix

* fix

* fix

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: duanjunwen <935724073@qq.com>
2024-09-27 14:48:55 +08:00
..
bookkeeping [moe] full test for deepseek and mixtral (pp + sp to fix) 2024-08-01 10:06:59 +08:00
__init__.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
_utils.py [devops] remove post commit ci (#5566) 2024-04-08 15:09:40 +08:00
low_level_optim.py [plugin] hybrid support zero bubble pipeline (#6060) 2024-09-27 14:48:55 +08:00
readme.md [zero]support zero2 with gradient accumulation (#4511) 2023-08-25 13:44:07 +08:00
zero_hook.py [zero] support all-gather overlap (#5898) 2024-07-11 18:59:59 +08:00

readme.md

Low Level ZeRO

Low Level ZeRO == ZeRO-DP stage 1 and 2, we would denote it as ZeRO.

Examples of ZeRO and gradient accumulation

The code below only shows a typical gradient accumulation process, and it drops a lot of details, such as the processing of loss.

# examples of ZeRO1 with gradient accumulation
...
outputs = model(input)
loss = SomeLoss(outputs)
if (idx + 1) % ACCUMULATE_STEP != 0:
    with booster.no_sync(model, optimizer):
        # under this context, the gradient would not sync when backward,
        # left each rank having different gradient.
        # It saves the backward time
        booster.backward(loss, optimizer)
        continue
else:
    # need to sync all the accumulated gradient
    booster.backward(loss, optimizer):
    optimizer.step()
    ...
# example of ZeRO2 with gradient accumulation

...
outputs = model(input)
loss = SomeLoss(outputs)
# ZeRO2 split the gradients and can NOT accumulate gradient with syncing.
booster.backward(loss, optimizer)
if (idx + 1) % ACCUMULATE_STEP == 0:
    optimizer.step()
...

Design:

Notion

p32 denotes the param copy in the optimizer p denotes the model param g denotes the gradient

INIT

In low level zero(1, 2), p32 is split. Different from the previous implement, we split each p32 evenly by world_size. Thus, rank0 got a param list as [p00, p10], rank1 got a param list as [p-01, p-11], etc. image

For the detailed implementation, we first pad p for it can be split by world_size if needed. Then, we would view it to the shape [world_size, -1], and each rank got its own part p32 by cloning.

BWD

To leverage the communication, a gradient would be added to a bucket first. When the bucket is full, each g in it would be reshaped as [world_size, -1]. And the [local_rank] parts would be united. The data structure looks like this:

{
0: [g-00, g-10],
1: [g-01, g-11],
2: [g-02, g-12]
}

After that, the gradients would be flattened by rank, and the data structure looks like this:

# g-X0 means flatten([g-00, g-10])
{
0: [g-X0],
1: [g-X1],
2: [g-X2]
}

For zero1, we iterate the dictionary and do all_reduce. For zero2, we can just do reduce-scatter.

Optim

For each rank gets its own p32 and the counterpart g, it is quite easy to do optim.step().

However, we have to consider a situation of layer drop, for instance:

class MlpModel(nn.Module):
    def __init__(self):
        super(MlpModel, self).__init__()
        self.linear1 = nn.Linear(128, 256)
        self.drop_linear = nn.Linear(256, 256)
        self.linear2 = nn.Linear(256, 512)

    def forward(self, x):
        x = self.linear1(x)
        x = self.linear2(x)
        return x

And the solution is to build a mapping of p32, p, and g. Before optim.step(), we collect p which requires_grad=True and p.grad != None as a real working param. And select the counterpart p32 and g.