ColossalAI/colossalai
flybird11111 af6aa9ed06
[plugin] hybrid support zero bubble pipeline (#6060)
* hybrid support zbv

* fix

fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* Update zero_bubble_pp.py

* fix

* fix-ci

* fix

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* fix

* fix

* fix

* [zerobubble]Support ZeroBubble Pipeline (#6034)

* [feat] add zerobubble pp (just a frame now); add POC test for dx_dw; add test for zerobubble;

* [feat] add dw test;

* [fix] fix weight not close;

* [update] update text;

* [feat] add test run_fwd_bwd automatic scheduling;

* [feat] split communication and calculation; fix pop empty send_bwd_buffer error;

* [feat] add test for p & p grad;

* [feat] add comments for ZBV func;

* [fix] rm useless assign and comments;

* [fix] fix ci test; add pytest;

* [feat] add run_fwd_bwd_with_microbatch  (replace input) & test; add p&p.grad assert close test & all pass;

* [feat] add apply v_schedule graph; p & p.grad assert err exist;

* [fix] update

* [feat] fix ci; add assert;

* [feat] fix poc format

* [feat] fix func name & ci; add comments;

* [fix] fix poc test; add comments in poc;

* [feat] add optim backward_b_by_grad

* [feat] fix optimizer bwd b & w; support return accum loss & output

* [feat] add fwd_bwd_step, run_fwd_only;

* [fix] fix optim bwd; add license for v_schedule; remove redundant attributes; fix schedule loop "while"--> "for"; add communication dict;

* [fix] fix communication_map;

* [feat] update test; rm comments;

* [fix] rm zbv in hybridplugin

* [fix] fix optim bwd;

* [fix] fix optim bwd;

* [fix] rm output.data after send fwd;

* [fix] fix bwd step if condition; remove useless comments and format info;

* [fix] fix detach output & release output;

* [fix] rm requir_grad for output;

* [fix] fix requir grad position and detach position and input&output local buffer append position;

* [feat] add memory assertation;

* [fix] fix mem check;

* [fix] mem assertation'

* [fix] fix mem assertation

* [fix] fix mem; use a new model shape; only assert mem less and equal than theo;

* [fix] fix model zoo import;

* [fix] fix redundant detach & clone; add buffer assertation in the end;

* [fix] add output_obj_grad assert None at bwd b step; replace input_obj.require_grad_ with treemap;

* [fix] update optim state dict assert (include param group & state); fix mem assert after add optim;

* [fix] add testcase with microbatch 4;

* hybrid support zbv

* fix

fix

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update zero_bubble_pp.py

* fix

* fix-ci

* fix

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* fix

* fix

* fix

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: duanjunwen <935724073@qq.com>
2024-09-27 14:48:55 +08:00
..
_C Clean up 2024-06-07 09:09:29 +00:00
_analyzer [test] Fix/fix testcase (#5770) 2024-06-03 15:26:01 +08:00
accelerator [hotfix] fix typo change MoECheckpintIO to MoECheckpointIO (#5335) 2024-03-05 21:52:30 +08:00
amp [plugin] hybrid support zero bubble pipeline (#6060) 2024-09-27 14:48:55 +08:00
auto_parallel [pre-commit.ci] pre-commit autoupdate (#5572) 2024-07-01 17:16:41 +08:00
autochunk [hotfix] Fix examples no pad token & auto parallel codegen bug; (#5606) 2024-04-18 18:15:50 +08:00
booster [plugin] hybrid support zero bubble pipeline (#6060) 2024-09-27 14:48:55 +08:00
checkpoint_io [Feature] Zigzag Ring attention (#5905) 2024-08-16 13:56:38 +08:00
cli [devops] fix extention building (#5427) 2024-03-05 15:35:54 +08:00
cluster Revert "[moe] implement submesh initialization" 2024-08-01 10:06:59 +08:00
context [Fix]: implement thread-safety singleton to avoid deadlock for very large-scale training scenarios (#5625) 2024-04-25 14:45:52 +08:00
device [Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694) 2024-05-14 13:52:45 +08:00
fx [test] Fix/fix testcase (#5770) 2024-06-03 15:26:01 +08:00
inference [Feat] Distrifusion Acceleration Support for Diffusion Inference (#5895) 2024-07-30 10:43:26 +08:00
interface [plugin] hybrid support zero bubble pipeline (#6060) 2024-09-27 14:48:55 +08:00
kernel [NFC] Fix code factors on inference triton kernels (#5743) 2024-05-21 22:12:15 +08:00
lazy [Feature] Zigzag Ring attention (#5905) 2024-08-16 13:56:38 +08:00
legacy [Feature] Zigzag Ring attention (#5905) 2024-08-16 13:56:38 +08:00
logging [Feature] Zigzag Ring attention (#5905) 2024-08-16 13:56:38 +08:00
moe [moe] remove ops 2024-08-01 10:06:59 +08:00
nn [misc] fix dist logger (#5782) 2024-06-05 15:04:22 +08:00
pipeline [plugin] hybrid support zero bubble pipeline (#6060) 2024-09-27 14:48:55 +08:00
quantization [quant] fix bitsandbytes version check (#5882) 2024-07-04 11:33:23 +08:00
shardformer [plugin] hybrid support zero bubble pipeline (#6060) 2024-09-27 14:48:55 +08:00
tensor [compatibility] support torch 2.2 (#5875) 2024-07-16 13:59:25 +08:00
testing [misc] update compatibility (#6008) 2024-08-16 18:49:14 +08:00
utils Merge pull request #5310 from hpcaitech/feature/npu 2024-01-29 13:49:39 +08:00
zero [plugin] hybrid support zero bubble pipeline (#6060) 2024-09-27 14:48:55 +08:00
__init__.py [devops] remove post commit ci (#5566) 2024-04-08 15:09:40 +08:00
initialize.py [Hoxfix] Fix CUDA_DEVICE_MAX_CONNECTIONS for comm overlap 2024-07-05 20:02:36 +08:00