* [test] smaller gpt2 test case
* [test] reduce test cases: tests/test_zero/test_gemini/test_zeroddp_state_dict.py
* [test] reduce test cases: tests/test_zero/test_gemini/test_grad_accum.py
* [test] reduce test cases tests/test_zero/test_gemini/test_optim.py
* Revert "[test] smaller gpt2 test case"
Some tests might depend on the size of model (num of chunks)
This reverts commit df705a5210.
* [test] reduce test cases: tests/test_checkpoint_io/test_gemini_checkpoint_io.py
* [CI] smaller test model for two mwo the two modifid cases
* [CI] hardcode gpt model for tests/test_zero/test_gemini/test_search.py since we need a fixed answer there
* sequence parallel optimization
* validate sequence parallel in llama (code to be polished)
* shardformer api writing
* integrate sequence parallel in ShardFormer
* fix pp bugs and sp bugs for LlaMa model
* integrating ring-based sequence parallelism into ShardFormer
* [sequence parallelism]: Add fused megatron function
* integrating ring-based sequence parallelism into ShardFormer
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
* fix bugs when useing sp and flashattention together
* fix operation function name
* support flash attention for ulysses-style sp
* clarify sp process group
* fix compatibility bugs in moe plugin
* fix fused linear bugs
* fix linear layer test
* support gpt model all-to-all sp
* modify shard data dimension (meant to be dim=-1)
* support megtron-style sp and distributed attn for llama model
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* finish sp mode 3 support for gpt
* using all_to_all_single when batch size is 1
* support mode 2 sp in gpt2 (#5)
* [shardformer] add megatron sp to llama
* support llama7B 128k with distributed attention
* [shardformer] robustness enhancement
* add block attn
* sp mode 1: keep input as a complete sequence
* fix sp compatability
* refactor ring implementation
* support mode 2 sp in gpt2
* polish code
* enable distributed attn mask when using sp mode 2 and 3 in llama
* automatically enable flash attn when using sp mode 2 and 3 in llama
* inplace attn mask
* add zero2 support for sequence parallel
* polish code
* fix bugs
* fix gemini checkpoint io
* loose tensor checking atol and rtol
* add comment
* fix llama layernorm grad
* fix zero grad
* fix zero grad
* fix conflict
* update split and gather auto grad func
* sequence parallel: inside text split (#6)
* polish code (part 1)
* polish code (part 2)
* polish code (part 2.5)
* polish code (part 3)
* sequence parallel: inside text split
* miscellaneous minor fixes
* polish code
* fix ulysses style ZeRO
* sequence parallel: inside text split
* miscellaneous minor fixes
* disaggregate sp group and dp group for sp
* fix llama and gpt sp
* polish code
* move ulysses grad sync to ddp (#9)
* remove zero_stage and unbind the grad sync for alltoall sp
* add 2d group creation test
* move ulysses grad sync to ddp
* add 2d group creation test
* remove useless code
* change shard config not to enable sp when enable_all_optimizations
* add sp warnings for several model
* remove useless code
---------
Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>
* [shardformer] implement policy for all GPT-J models and test
* [shardformer] support interleaved pipeline parallel for bert finetune
* [shardformer] shardformer support falcon (#4883)
* [shardformer]: fix interleaved pipeline for bert model (#5048)
* [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093)
* Add Mistral support for Shardformer (#5103)
* [shardformer] add tests to mistral (#5105)
---------
Co-authored-by: Pengtai Xu <henryxu880@gmail.com>
Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: eric8607242 <e0928021388@gmail.com>
* [test] add custom models in model zoo
* [test] update legacy test
* [test] update model zoo
* [test] update gemini test
* [test] remove components to test
* [shardformer] supported flash attention test dependency (#4158)
* [shardformer] fix flash attention utils test (#4180)
* [shardformer] opt support flash attention (#4163)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] move to modeling
* [shardformer] move to modeling
* [shardformer] add performance benchmark of shardformer (#4175)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] benchmark fix
* [shardformer] benchmark fix
* [shardformer] llama support flash attention (#4185)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] move to modeling
* [shardformer] move to modeling
* [shardformer] llama support flash attention
* [shardformer] llama support flash attention
* [shardformer] Move the import statement for xformer outside the forward function.
* [shardformer] gpt2 support flash attention. (#4191)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] move to modeling
* [shardformer] move to modeling
* [shardformer] gpt2 support flash attention
* [shardformer] gpt2 support flash attention
* [shardformer] gpt2 support flash attention
* [shardformer] bloom support flash attention (#4188)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] move to modeling
* [shardformer] move to modeling
* [shardformer] bloom suport flash attention
* [shardformer] add assert to sequence length
* [shardformer] fix
* [shardformer] fix
* [shardformer] fix
* [shardformer] bert support flash attention. (#4206)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] move to modeling
* [shardformer] move to modeling
* [shardformer] bert support flash attention
* [shardformer] t5 support flash attention. (#4216)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] move to modeling
* [shardformer] move to modeling
* [shardformer] t5 support flash attention
* [shardformer] t5 support flash attention
* fix typo
* fix typo
* fix typo
* fix typo
* fix typo
* fix typo
* [shardformer] support 'paddedcausal' type of attention mask in Coloattention. (#4215)
* added padded causal attn mask type for ColoAttention
* [shardformer]t5 flash attention fix (#4239)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] move to modeling
* [shardformer] move to modeling
* [shardformer] t5 flash attention fix
* [shardformer] update gpt2 to use coloattention. (#4234)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] move to modeling
* [shardformer] move to modeling
* [shardformer] update gpt2 to use coloattention
* [shardformer] update gpt2 to use coloattention
* [shardformer] update gpt2 to use coloattention
* [shardformer] update gpt2 to use coloattention
* [shardformer] update gpt2
* [shardformer] update opt and llama to use coloattention. (#4226)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] move to modeling
* [shardformer] move to modeling
* update opt to use coloattention
* [shardformer]update opt to use coloattention
* [shardformer]update opt to use coloattention
* [shardformer]update opt to use coloattention
* [shardformer]update opt to use coloattention
* [shardformer]update opt to use coloattention
* [shardformer]update opt to use coloattention
* [shardformer]update opt
* [shardformer] shardformer support jit fused operator. (#4236)
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] opt support flash attention
* [shardformer] move to modeling
* [shardformer] move to modeling
* [shardformer] bloom support jit fused operator
* [shardformer] bloom support jit fused operator
* [shardformer] bloom support jit fused operator
* [shardformer] t5 support jit fused operator
* [shardformer] t5 support jit fused operator
* [shardformer] t5 support jit fused operator
* [shardformer] add roadmap of flash attention
* [shardformer] add roadmap of flash attention
* [shardformer] add roadmap of flash attention
* [shardformer] add type hint to 'self' param of forward
* [shardformer] merge feature/shardformer-models branch to feature/flash-attention-shardformer branch. (#4290)
* Feature/vit support (#4182)
* [shardformer] added tests
* [shardformer] vit test finish and support
* fix attention dropout
* [shardformer] support SAM (#4231)
* 1.support sam 2.add fused qkv for nn.Linear
* update utils support set element in list
* overtwrite SamVisionAttention foward to use DropoutForParallelInput
* remove unused code
* [shardformer] support whisper (#4212)
* support whisper
* fix bug in vocabembedding
* support downstream model of whisper
* update readme
* Feature/chatglm (#4240)
* [shardformer] added tests
* [shardformer] vit test finish and support
* [shardformer] chatglm ready
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] chatglm shard without mlp sharding
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] fix chatglm configuration with pre-commit
---------
Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>
Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com>
* [shardformer] whisper support flash attention (#4301)
* Feature/vit support (#4182)
* [shardformer] added tests
* [shardformer] vit test finish and support
* fix attention dropout
* [shardformer] support SAM (#4231)
* 1.support sam 2.add fused qkv for nn.Linear
* update utils support set element in list
* overtwrite SamVisionAttention foward to use DropoutForParallelInput
* remove unused code
* [shardformer] support whisper (#4212)
* support whisper
* fix bug in vocabembedding
* support downstream model of whisper
* update readme
* Feature/chatglm (#4240)
* [shardformer] added tests
* [shardformer] vit test finish and support
* [shardformer] chatglm ready
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] chatglm shard without mlp sharding
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] fix chatglm configuration with pre-commit
* [shardformer] whisper support flash attention
* [shardformer] whisper support flash attention
* [shardformer]whisper support jit operator
---------
Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>
Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com>
* [shardformer] sam support flash attention (#4316)
* Feature/vit support (#4182)
* [shardformer] added tests
* [shardformer] vit test finish and support
* fix attention dropout
* [shardformer] support SAM (#4231)
* 1.support sam 2.add fused qkv for nn.Linear
* update utils support set element in list
* overtwrite SamVisionAttention foward to use DropoutForParallelInput
* remove unused code
* [shardformer] support whisper (#4212)
* support whisper
* fix bug in vocabembedding
* support downstream model of whisper
* update readme
* Feature/chatglm (#4240)
* [shardformer] added tests
* [shardformer] vit test finish and support
* [shardformer] chatglm ready
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] chatglm shard without mlp sharding
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] fix chatglm configuration with pre-commit
* [shardformer] sam support flash attention
---------
Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>
Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com>
* [shardformer] merge blip2/chatglm (#4321)
* Feature/vit support (#4182)
* [shardformer] added tests
* [shardformer] vit test finish and support
* fix attention dropout
* [shardformer] support SAM (#4231)
* 1.support sam 2.add fused qkv for nn.Linear
* update utils support set element in list
* overtwrite SamVisionAttention foward to use DropoutForParallelInput
* remove unused code
* [shardformer] support whisper (#4212)
* support whisper
* fix bug in vocabembedding
* support downstream model of whisper
* update readme
* Feature/chatglm (#4240)
* [shardformer] added tests
* [shardformer] vit test finish and support
* [shardformer] chatglm ready
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] chatglm shard without mlp sharding
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] fix chatglm configuration with pre-commit
* [shardformer] added tests
* [shardformer] vit test finish and support
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit
* [shardformer] support Blip2 (#4243)
* support base blip2
* add support for downstream blip2 model
* update readme
* add forward injection
* skip not compatible models test
* fix test for gemini and low_level_zero_pugin
---------
Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>
Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: klhhhhh <1412841649@qq.com>
* [shardformer] blip2 support flash attention and jit operator (#4325)
* Feature/vit support (#4182)
* [shardformer] added tests
* [shardformer] vit test finish and support
* fix attention dropout
* [shardformer] support SAM (#4231)
* 1.support sam 2.add fused qkv for nn.Linear
* update utils support set element in list
* overtwrite SamVisionAttention foward to use DropoutForParallelInput
* remove unused code
* [shardformer] support whisper (#4212)
* support whisper
* fix bug in vocabembedding
* support downstream model of whisper
* update readme
* Feature/chatglm (#4240)
* [shardformer] added tests
* [shardformer] vit test finish and support
* [shardformer] chatglm ready
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] chatglm shard without mlp sharding
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] fix chatglm configuration with pre-commit
* [shardformer] added tests
* [shardformer] vit test finish and support
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit
* [shardformer] support Blip2 (#4243)
* support base blip2
* add support for downstream blip2 model
* update readme
* add forward injection
* skip not compatible models test
* fix test for gemini and low_level_zero_pugin
* [shardformer] blip2 support flash attention and jit operator
* [shardformer] blip2 support flash attention and jit operator
* [shardformer] blip2 support flash attention and jit operator
---------
Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>
Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: klhhhhh <1412841649@qq.com>
* [shardformer] chatglm support flash attention and jit operator (#4330)
* Feature/vit support (#4182)
* [shardformer] added tests
* [shardformer] vit test finish and support
* fix attention dropout
* [shardformer] support SAM (#4231)
* 1.support sam 2.add fused qkv for nn.Linear
* update utils support set element in list
* overtwrite SamVisionAttention foward to use DropoutForParallelInput
* remove unused code
* [shardformer] support whisper (#4212)
* support whisper
* fix bug in vocabembedding
* support downstream model of whisper
* update readme
* Feature/chatglm (#4240)
* [shardformer] added tests
* [shardformer] vit test finish and support
* [shardformer] chatglm ready
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] chatglm shard without mlp sharding
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] fix chatglm configuration with pre-commit
* [shardformer] added tests
* [shardformer] vit test finish and support
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit
* [shardformer] support Blip2 (#4243)
* support base blip2
* add support for downstream blip2 model
* update readme
* add forward injection
* skip not compatible models test
* fix test for gemini and low_level_zero_pugin
* [shardformer] chatglm support flash attention and jit operator
* [shardformer] chatglm support flash attention and jit operator
* [shardformer] chatglm support flash attention and jit operator
* [shardformer] chatglm support flash attention and jit operator
---------
Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>
Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: klhhhhh <1412841649@qq.com>
* [shardformer] vit support flash attention and jit operator (#4334)
* Feature/vit support (#4182)
* [shardformer] added tests
* [shardformer] vit test finish and support
* fix attention dropout
* [shardformer] support SAM (#4231)
* 1.support sam 2.add fused qkv for nn.Linear
* update utils support set element in list
* overtwrite SamVisionAttention foward to use DropoutForParallelInput
* remove unused code
* [shardformer] support whisper (#4212)
* support whisper
* fix bug in vocabembedding
* support downstream model of whisper
* update readme
* Feature/chatglm (#4240)
* [shardformer] added tests
* [shardformer] vit test finish and support
* [shardformer] chatglm ready
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] chatglm shard without mlp sharding
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] fix chatglm configuration with pre-commit
* [shardformer] added tests
* [shardformer] vit test finish and support
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit
* [shardformer] support Blip2 (#4243)
* support base blip2
* add support for downstream blip2 model
* update readme
* add forward injection
* skip not compatible models test
* fix test for gemini and low_level_zero_pugin
* [shardformer] vit support flash attention and jit operator
* [shardformer] vit support flash attention and jit operator
---------
Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>
Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: klhhhhh <1412841649@qq.com>
* [pipeline] merge flash attention branch
* [pipeline] merge flash attention branch
* [pipeline] merge flash attention branch
* [pipeline] fix conflict
* [pipeline] fix conflict
* Merge branch 'feature/pipeline' into feature/pipeline
* Merge branch 'feature/pipeline' into feature/pipeline
* Merge branch 'feature/pipeline' into feature/pipeline
* activate checks
* activate checks
* activate checks
* activate checks
* activate checks
* activate checks
* activate checks
* activate checks
* fix flash attention tests
* gemini ignore whisper
* fix vit
* fix xformers import handle
---------
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>
Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: klhhhhh <1412841649@qq.com>
* fix llama test
* fix test bug of bert, blip2, bloom, gpt2
* fix llama test
* fix opt test
* fix sam test
* fix sam test
* fix t5 test
* fix vit test
* fix whisper test
* fix whisper test
* polish code
* adjust allclose parameter
* Add mistakenly deleted code
* addjust allclose
* change loss function for some base model
* add naive optimizer for 3DPlugin/refactor gpt2 shardformer test
* merge tests of PP/DP/TP combinations into one test file
* fix bug when sync grad for dp in HybridPlugin
* update supported precisions for 3DPlugin/fix bug when shifting tp_degree
* improve the passing of lazy_init
* modify lazy_init/use sync_shared_params
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* finish bloom model
* test shard gpt2
* clear cache
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* Revert "bloom policy"
This reverts commit 8dee68a0a2.
This policy should be revert and copied to feature/bloom
* revert the bloom changes
* cancel unneeded inputs
* gpt
* finish llama
* causal lm and sequence classification
* revision
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* Revert "bloom policy"
This reverts commit 8dee68a0a2.
This policy should be revert and copied to feature/bloom
* revert the bloom changes
* cancel unneeded inputs
* gpt
* pass gpt trace and meta_prop
* pass t5 trace and meta_prop
* [FX] refactor experimental tracer and adapt it with hf models
* pass all mainstream model zoo
* fix CI
* fix CI
* fix CI
* fix CI
* fix CI
* fix CI
* fix CI
* fix CI
* skip tests
* fix CI
* using packaging version
* polish