* sequence parallel optimization
* validate sequence parallel in llama (code to be polished)
* shardformer api writing
* integrate sequence parallel in ShardFormer
* fix pp bugs and sp bugs for LlaMa model
* integrating ring-based sequence parallelism into ShardFormer
* [sequence parallelism]: Add fused megatron function
* integrating ring-based sequence parallelism into ShardFormer
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
* fix bugs when useing sp and flashattention together
* fix operation function name
* support flash attention for ulysses-style sp
* clarify sp process group
* fix compatibility bugs in moe plugin
* fix fused linear bugs
* fix linear layer test
* support gpt model all-to-all sp
* modify shard data dimension (meant to be dim=-1)
* support megtron-style sp and distributed attn for llama model
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* finish sp mode 3 support for gpt
* using all_to_all_single when batch size is 1
* support mode 2 sp in gpt2 (#5)
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* refactor ring implementation
* support mode 2 sp in gpt2
* polish code
* enable distributed attn mask when using sp mode 2 and 3 in llama
* automatically enable flash attn when using sp mode 2 and 3 in llama
* inplace attn mask
* add zero2 support for sequence parallel
* polish code
* fix bugs
* fix gemini checkpoint io
* loose tensor checking atol and rtol
* add comment
* fix llama layernorm grad
* fix zero grad
* fix zero grad
* fix conflict
* update split and gather auto grad func
* sequence parallel: inside text split (#6)
* polish code (part 1)
* polish code (part 2)
* polish code (part 2.5)
* polish code (part 3)
* sequence parallel: inside text split
* miscellaneous minor fixes
* polish code
* fix ulysses style ZeRO
* sequence parallel: inside text split
* miscellaneous minor fixes
* disaggregate sp group and dp group for sp
* fix llama and gpt sp
* polish code
* move ulysses grad sync to ddp (#9)
* remove zero_stage and unbind the grad sync for alltoall sp
* add 2d group creation test
* move ulysses grad sync to ddp
* add 2d group creation test
* remove useless code
* change shard config not to enable sp when enable_all_optimizations
* add sp warnings for several model
* remove useless code
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
* Use self.[distribute_layers|get_stage_index] to exploit custom layer distribution
* Change static methods for t5 layer distribution to member functions
* Change static methods for whisper layer distribution to member functions
* Replace whisper policy usage with self one
* Fix test case to use non-static layer distribution methods
* fix: fix typo
---------
Co-authored-by: Wenhao Chen <cwher@outlook.com>
* [devops] fix compatibility
* [hotfix] update compatibility test on pr
* [devops] fix compatibility
* [devops] record duration during comp test
* [test] decrease test duration
* fix falcon
* test: add more p2p tests
* fix: remove send_forward_recv_forward as p2p op list need to use the same group
* fix: make send and receive atomic
* feat: update P2PComm fn
* feat: add metadata cache in 1f1b
* feat: add metadata cache in interleaved pp
* feat: modify is_xx_stage fn
* revert: add _broadcast_object_list
* feat: add interleaved pp in llama policy
* feat: set NCCL_BUFFSIZE in HybridParallelPlugin
* [shardformer] implement policy for all GPT-J models and test
* [shardformer] support interleaved pipeline parallel for bert finetune
* [shardformer] shardformer support falcon (#4883)
* [shardformer]: fix interleaved pipeline for bert model (#5048)
* [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093)
* Add Mistral support for Shardformer (#5103)
* [shardformer] add tests to mistral (#5105)
---------
Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>
* [shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
[shardformer] chatglm support sequence parallel
* fix
fix
fix
fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* [shardformer] jit fused fix
* activate checks
* [Test] test ci
* test ci
* test ci
* test ci
* test ci
* test ci
* test ci
* fix
* [shardformer/sequence parallel] Support sequence parallel for gpt2 (#4384)
* [sequence parallel] add sequence parallel linear col/row support (#4336)
* add sequence parallel linear col/row support
* add annotation
* add annotation
* add support for gpt2 fused qkv linear layer
* support sequence parallel in GPT2
* add docstring and note
* add requirments
* remove unused flash-attb
* modify flash attn test
* modify flash attn setting
* modify flash attn code
* add assert before divide, rename forward function
* [shardformer/test] fix gpt2 test with seq-parallel
* [shardformer/sequence parallel] Overlap input gather and grad computation during col backward (#4401)
* overlap gather input / grad computing during col backward
* modify test for overlap
* simplify code
* fix code and modify cuda stream synchronize
* [shardformer/sequence parallel] polish code
* [shardformer] gpt2 tests fix
[shardformer] test all optimizations (#4399)
[shardformer] test all optimizations
[shardformer] test all optimizations
[shardformer] test all optimizations
[shardformer] gpt2 tests fix
* [shardformer]update t5 to use all optimizations
* [shardformer] gpt2 tests fix
[shardformer] test all optimizations (#4399)
[shardformer] test all optimizations
[shardformer] test all optimizations
[shardformer] test all optimizations
[shardformer] gpt2 tests fix
* [shardformer] gpt2 tests fix
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
* add bloom model and policy ,revise the base class of policy
* revise
* revision
* add bert_for_pretraining
* add bert_for_pretraining forward and policy
* fix typos
* cancel warning
* change the imediate output to default dict
* change the default output of get_shared_params
* rewrite bert test
* rewrite bert test
* fix some bugs
* del pipeline tests
* del pipeline tests
* del useless print
* del useless print
* rewrite data repeats