* [ci/tests] simplify some test case to reduce testing time
* [ci/tests] continue to remove test case to reduce ci time cost
* restore some test config
* [ci/tests] continue to reduce ci time cost
* [release] update version
* [devops] update compatibility test
* [devops] update compatibility test
* [devops] update compatibility test
* [devops] update compatibility test
* [test] fix ddp plugin test
* [test] fix gptj and rpc test
* [devops] fix cuda ext compatibility
* [inference] fix flash decoding test
* [inference] fix flash decoding test
* [feat] Add distributed lamb; minor fixes in DeviceMesh (#5476)
* init: add dist lamb; add debiasing for lamb
* dist lamb tester mostly done
* all tests passed
* add comments
* all tests passed. Removed debugging statements
* moved setup_distributed inside plugin. Added dist layout caching
* organize better
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
* [hotfix] Improve tester precision by removing ZeRO on vanilla lamb (#5576)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
* [optim] add distributed came (#5526)
* test CAME under LowLevelZeroOptimizer wrapper
* test CAME TP row and col pass
* test CAME zero pass
* came zero add master and worker param id convert
* came zero test pass
* came zero test pass
* test distributed came passed
* reform code, Modify some expressions and add comments
* minor fix of test came
* minor fix of dist_came and test
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* minor fix of dist_came and test
* rebase dist-optim
* rebase dist-optim
* fix remaining comments
* add test dist came using booster api
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [optim] Distributed Adafactor (#5484)
* [feature] solve conflict; update optimizer readme;
* [feature] update optimize readme;
* [fix] fix testcase;
* [feature] Add transformer-bert to testcase;solve a bug related to indivisible shape (induction in use_zero and tp is row parallel);
* [feature] Add transformers_bert model zoo in testcase;
* [feature] add user documentation to docs/source/feature.
* [feature] add API Reference & Sample to optimizer Readme; add state check for bert exam;
* [feature] modify user documentation;
* [fix] fix readme format issue;
* [fix] add zero=0 in testcase; cached augment in dict;
* [fix] fix percision issue;
* [feature] add distributed rms;
* [feature] remove useless comment in testcase;
* [fix] Remove useless test; open zero test; remove fp16 test in bert exam;
* [feature] Extract distributed rms function;
* [feature] add booster + lowlevelzeroPlugin in test;
* [feature] add Start_with_booster_API case in md; add Supporting Information in md;
* [fix] Also remove state movement in base adafactor;
* [feature] extract factor function;
* [feature] add LowLevelZeroPlugin test;
* [fix] add tp=False and zero=True in logic;
* [fix] fix use zero logic;
* [feature] add row residue logic in column parallel factor;
* [feature] add check optim state func;
* [feature] Remove duplicate logic;
* [feature] update optim state check func and percision test bug;
* [fix] update/fix optim state; Still exist percision issue;
* [fix] Add use_zero check in _rms; Add plugin support info in Readme; Add Dist Adafactor init Info;
* [feature] removed print & comments in utils;
* [feature] uodate Readme;
* [feature] add LowLevelZeroPlugin test with Bert model zoo;
* [fix] fix logic in _rms;
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [fix] remove comments in testcase;
* [feature] add zh-Han Readme;
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [Feature] refractor dist came; fix percision error; add low level zero test with bert model zoo; (#5676)
* [feature] daily update;
* [fix] fix dist came;
* [feature] refractor dist came; fix percision error; add low level zero test with bert model zoo;
* [fix] open rms; fix low level zero test; fix dist came test function name;
* [fix] remove redundant test;
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [Feature] Add Galore (Adam, Adafactor) and distributed GaloreAdamW8bit (#5570)
* init: add dist lamb; add debiasing for lamb
* dist lamb tester mostly done
* all tests passed
* add comments
* all tests passed. Removed debugging statements
* moved setup_distributed inside plugin. Added dist layout caching
* organize better
* update comments
* add initial distributed galore
* add initial distributed galore
* add galore set param utils; change setup_distributed interface
* projected grad precision passed
* basic precision tests passed
* tests passed; located svd precision issue in fwd-bwd; banned these tests
* Plugin DP + TP tests passed
* move get_shard_dim to d_tensor
* add comments
* remove useless files
* remove useless files
* fix zero typo
* improve interface
* remove moe changes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix import
* fix deepcopy
* update came & adafactor to main
* fix param map
* fix typo
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [Hotfix] Remove one buggy test case from dist_adafactor for now (#5692)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: chongqichuizi875 <107315010+chongqichuizi875@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: duanjunwen <54985467+duanjunwen@users.noreply.github.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
* feat: support qwen2 model
* fix: modify model config and add Qwen2RMSNorm
* fix qwen2 model conflicts
* test: add qwen2 shard test
* to: add qwen2 auto policy
* support qwen model
* fix the conflicts
* add try catch
* add transformers version for qwen2
* add the ColoAttention for the qwen2 model
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add the unit test version check
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix the test input bug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix the version check
* fix the version check
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [misc] remove config arg from initialize
* [misc] remove old tensor contrusctor
* [plugin] add npu support for ddp
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [devops] fix doc test ci
* [test] fix test launch
* [doc] update launch doc
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* modify coloattention
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix
* fix
* fix
fxi
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* sequence parallel optimization
* validate sequence parallel in llama (code to be polished)
* shardformer api writing
* integrate sequence parallel in ShardFormer
* fix pp bugs and sp bugs for LlaMa model
* integrating ring-based sequence parallelism into ShardFormer
* [sequence parallelism]: Add fused megatron function
* integrating ring-based sequence parallelism into ShardFormer
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
* fix bugs when useing sp and flashattention together
* fix operation function name
* support flash attention for ulysses-style sp
* clarify sp process group
* fix compatibility bugs in moe plugin
* fix fused linear bugs
* fix linear layer test
* support gpt model all-to-all sp
* modify shard data dimension (meant to be dim=-1)
* support megtron-style sp and distributed attn for llama model
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* finish sp mode 3 support for gpt
* using all_to_all_single when batch size is 1
* support mode 2 sp in gpt2 (#5)
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* refactor ring implementation
* support mode 2 sp in gpt2
* polish code
* enable distributed attn mask when using sp mode 2 and 3 in llama
* automatically enable flash attn when using sp mode 2 and 3 in llama
* inplace attn mask
* add zero2 support for sequence parallel
* polish code
* fix bugs
* fix gemini checkpoint io
* loose tensor checking atol and rtol
* add comment
* fix llama layernorm grad
* fix zero grad
* fix zero grad
* fix conflict
* update split and gather auto grad func
* sequence parallel: inside text split (#6)
* polish code (part 1)
* polish code (part 2)
* polish code (part 2.5)
* polish code (part 3)
* sequence parallel: inside text split
* miscellaneous minor fixes
* polish code
* fix ulysses style ZeRO
* sequence parallel: inside text split
* miscellaneous minor fixes
* disaggregate sp group and dp group for sp
* fix llama and gpt sp
* polish code
* move ulysses grad sync to ddp (#9)
* remove zero_stage and unbind the grad sync for alltoall sp
* add 2d group creation test
* move ulysses grad sync to ddp
* add 2d group creation test
* remove useless code
* change shard config not to enable sp when enable_all_optimizations
* add sp warnings for several model
* remove useless code
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution
* Change static methods for t5 layer distribution to member functions
* Change static methods for whisper layer distribution to member functions
* Replace whisper policy usage with self one
* Fix test case to use non-static layer distribution methods
* fix: fix typo
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* [devops] fix compatibility
* [hotfix] update compatibility test on pr
* [devops] fix compatibility
* [devops] record duration during comp test
* [test] decrease test duration
* fix falcon
* test: add more p2p tests
* fix: remove send_forward_recv_forward as p2p op list need to use the same group
* fix: make send and receive atomic
* feat: update P2PComm fn
* feat: add metadata cache in 1f1b
* feat: add metadata cache in interleaved pp
* feat: modify is_xx_stage fn
* revert: add _broadcast_object_list
* feat: add interleaved pp in llama policy
* feat: set NCCL_BUFFSIZE in HybridParallelPlugin
* [shardformer] implement policy for all GPT-J models and test
* [shardformer] support interleaved pipeline parallel for bert finetune
* [shardformer] shardformer support falcon (#4883)
* [shardformer]: fix interleaved pipeline for bert model (#5048)
* [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093)
* Add Mistral support for Shardformer (#5103)
* [shardformer] add tests to mistral (#5105)
---------
Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>
* [shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
* fix
fix
fix
fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* activate checks
* [Test] test ci
* test ci
* test ci
* test ci
* test ci
* test ci
* test ci
* fix