Commit Graph

1067 Commits (633e95b301336c4c237537f584882b3d8e5f4145)

Author SHA1 Message Date
Cuiqing Li 459a88c806
[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965)
1 year ago
Jianghai cf579ff46d
[Inference] Dynamic Batching Inference, online and offline (#4953)
1 year ago
Bin Jia 1db6727678
[Pipeline inference] Combine kvcache with pipeline inference (#4938)
1 year ago
Hongxin Liu b8e770c832
[test] merge old components to test to model zoo (#4945)
1 year ago
Cuiqing Li 3a41e8304e
[Refactor] Integrated some lightllm kernels into token-attention (#4946)
1 year ago
github-actions[bot] 486d06a2d5
[format] applied code formatting on changed files in pull request 4820 (#4886)
1 year ago
Zhongkai Zhao c7aa319ba0
[test] add no master test for low level zero plugin (#4934)
1 year ago
Hongxin Liu 1f5d2e8062
[hotfix] fix torch 2.0 compatibility (#4936)
1 year ago
Baizhou Zhang 21ba89cab6
[gemini] support gradient accumulation (#4869)
1 year ago
Hongxin Liu 4f68b3f10c
[kernel] support pure fp16 for cpu adam and update gemini optim tests (#4921)
1 year ago
Xu Kai 611a5a80ca
[inference] Add smmoothquant for llama (#4904)
1 year ago
Xu Kai 77a9328304
[inference] add llama2 support (#4898)
1 year ago
Baizhou Zhang 39f2582e98
[hotfix] fix lr scheduler bug in torch 2.0 (#4864)
1 year ago
littsk 83b52c56cd
[feature] Add clip_grad_norm for hybrid_parallel_plugin (#4837)
1 year ago
Hongxin Liu df63564184
[gemini] support amp o3 for gemini (#4872)
1 year ago
littsk ffd9a3cbc9
[hotfix] fix bug in sequence parallel test (#4887)
1 year ago
Xu Kai fdec650bb4
fix test llama (#4884)
1 year ago
Bin Jia 08a9f76b2f
[Pipeline Inference] Sync pipeline inference branch to main (#4820)
1 year ago
Hongxin Liu cb3a25a062
[checkpointio] hotfix torch 2.0 compatibility (#4824)
1 year ago
Zhongkai Zhao db40e086c8 [test] modify model supporting part of low_level_zero plugin (including correspoding docs)
1 year ago
Xu Kai d1fcc0fa4d
[infer] fix test bug (#4838)
1 year ago
Jianghai 013a4bedf0
[inference]fix import bug and delete down useless init (#4830)
1 year ago
Hongxin Liu 4965c0dabd
[lazy] support from_pretrained (#4801)
1 year ago
Baizhou Zhang 64a08b2dc3
[checkpointio] support unsharded checkpointIO for hybrid parallel (#4774)
1 year ago
Jianghai ce7ade3882
[inference] chatglm2 infer demo (#4724)
1 year ago
Xu Kai 946ab56c48
[feature] add gptq for inference (#4754)
1 year ago
Hongxin Liu 3e05c07bb8
[lazy] support torch 2.0 (#4763)
1 year ago
Baizhou Zhang c0a033700c
[shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic (#4758)
1 year ago
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752)
1 year ago
Hongxin Liu b5f9e37c70
[legacy] clean up legacy code (#4743)
1 year ago
Pengtai Xu cd4e61d149 [legacy] remove deterministic data loader test
1 year ago
digger yu 9c2feb2f0b
fix some typo with colossalai/device colossalai/tensor/ etc. (#4171)
1 year ago
Cuiqing Li bce0f16702
[Feature] The first PR to Add TP inference engine, kv-cache manager and related kernels for our inference system (#4577)
1 year ago
flybird11111 eedaa3e1ef
[shardformer]fix gpt2 double head (#4663)
1 year ago
Hongxin Liu 554aa9592e
[legacy] move communication and nn to legacy and refactor logger (#4671)
1 year ago
flybird11111 7486ed7d3a
[shardformer] update llama2/opt finetune example and fix llama2 policy (#4645)
1 year ago
Baizhou Zhang 660eed9124
[pipeline] set optimizer to optional in execute_pipeline (#4630)
1 year ago
Hongxin Liu fae6c92ead
Merge branch 'main' into feature/shardformer
1 year ago
Hongxin Liu 8accecd55b [legacy] move engine to legacy (#4560)
1 year ago
Hongxin Liu 89fe027787 [legacy] move trainer to legacy (#4545)
1 year ago
Hongxin Liu bd18678478
[test] fix gemini checkpoint and gpt test (#4620)
1 year ago
Hongxin Liu 807e01a4ba
[zero] hotfix master param sync (#4618)
1 year ago
Hongxin Liu e71d245293
[test] ignore gpt2 shardformer test (#4619)
1 year ago
Hongxin Liu a39a5c66fe
Merge branch 'main' into feature/shardformer
1 year ago
Baizhou Zhang e79b1e80e2
[checkpointio] support huggingface from_pretrained for all plugins (#4606)
1 year ago
Jianghai 24c0768795
[shardformer] Pytree fix (#4533)
1 year ago
Hongxin Liu 508ca36fe3
[pipeline] 1f1b schedule receive microbatch size (#4589)
1 year ago
LuGY cbac782254
[zero]fix zero ckptIO with offload (#4529)
1 year ago
Baizhou Zhang 38ccb8b1a3
[shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575)
1 year ago
Baizhou Zhang c9625dbb63
[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)
1 year ago
Baizhou Zhang 2c787d7f47
[shardformer] fix submodule replacement bug when enabling pp (#4544)
1 year ago
flybird11111 ec18fc7340
[shardformer] support pp+tp+zero1 tests (#4531)
1 year ago
flybird11111 d367b88785
[shardformer] fix opt test hanging (#4521)
1 year ago
Bin Jia e241b74f24
[shardformer] Add overlap support for gpt2 (#4535)
1 year ago
Baizhou Zhang 0387a47e63
[shardformer] fix emerged bugs after updating transformers (#4526)
1 year ago
Bin Jia c554b7f559
[shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516)
1 year ago
Jianghai 376533a564
[shardformer] zero1+pp and the corresponding tests (#4517)
1 year ago
Baizhou Zhang 44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506)
1 year ago
flybird11111 de8a65babc
[shardformer] opt fix. (#4514)
1 year ago
LuGY 839847b7d7
[zero]support zero2 with gradient accumulation (#4511)
1 year ago
flybird11111 3353e55c80
[shardformer] vit/llama/t5 ignore the sequence parallelism flag and some fix. (#4498)
1 year ago
Hongxin Liu 27061426f7
[gemini] improve compatibility and add static placement policy (#4479)
1 year ago
Jianghai e04436a82a
[shardformer] tests for 3d parallel (#4493)
1 year ago
flybird11111 59e252ecdb
[shardformer] chatglm support sequence parallel (#4482)
1 year ago
Jianghai 5545114fd8
rename chatglm to chatglm2 (#4484)
1 year ago
Baizhou Zhang 1c7df566e2
[shardformer] support tp+zero for shardformer (#4472)
1 year ago
Jianghai 8739aa7fa0
[shardformer] Pipeline/whisper (#4456)
1 year ago
Bin Jia 7c8be77081
[shardformer/sequence parallel] support gpt2 seq parallel with pp/dp/tp (#4460)
1 year ago
LuGY a78daf6180
[shardformer] support interleaved pipeline (#4448)
1 year ago
Hongxin Liu 26e29d58f0
[devops] add large-scale distributed test marker (#4452)
1 year ago
Baizhou Zhang 6ef33f75aa
[shardformer] support DDP in HybridPlugin/add tp+dp tests (#4446)
1 year ago
Bin Jia 424629fea0
[shardformer/sequence parallel] Cherry pick commit to new branch (#4450)
1 year ago
github-actions[bot] d20dceb9a3
[format] applied code formatting on changed files in pull request 4441 (#4445)
1 year ago
Hongxin Liu 172f7fa3cf [misc] resolve code factor issues (#4433)
1 year ago
flybird11111 328a791d10 [shardformer] update bloom/llama/vit/chatglm tests (#4420)
1 year ago
flybird11111 108e54a0b4 [shardformer]update t5 tests for using all optimizations. (#4407)
1 year ago
flybird11111 1edc9b5fb3 [shardformer] update tests for all optimization (#4413)
1 year ago
Baizhou Zhang 7711bd524a [shardformer] rewrite tests for opt/bloom/llama/vit/chatglm (#4395)
1 year ago
flybird11111 21e0a42fd1 [shardformer]fix, test gpt2 for AMP+TP (#4403)
1 year ago
Jianghai 7596e9ae08 [pipeline] rewrite bert tests and fix some bugs (#4409)
1 year ago
flybird1111 d2cd48e0be [shardformer] test all optimizations (#4399)
1 year ago
flybird1111 7a3dfd0c64 [shardformer] update shardformer to use flash attention 2 (#4392)
1 year ago
Baizhou Zhang ed4c448488 [pipeline] rewrite t5 tests & support multi-tensor transmitting in pipeline (#4388)
1 year ago
flybird1111 906426cb44 [Shardformer] Merge flash attention branch to pipeline branch (#4362)
1 year ago
Jianghai a88e92251d [pipeline] add chatglm (#4363)
1 year ago
Baizhou Zhang b1feeced8e [shardformer] add util functions for shardformer tests/fix sync_shared_param (#4366)
1 year ago
Bin Jia 5c6f183192 [test] Hotfix/fix some model test and refactor check util api (#4369)
1 year ago
FoolPlayer c3ca53cf05 [test] skip some not compatible models
1 year ago
FoolPlayer 726541afe2 update some module with new api version
1 year ago
FoolPlayer 879301d0da [shardformer] support Blip2 (#4243)
1 year ago
klhhhhh 8120eca0c0 [shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit
1 year ago
klhhhhh 4da05052f4 [shardformer] pre-commit check files
1 year ago
klhhhhh f155ae89c4 [shardformer] ChatGLM support layernorm sharding
1 year ago
klhhhhh 00f6ef159d [shardformer] delete some file
1 year ago
klhhhhh dad00c42aa [shardformer] support chatglm without layernorm
1 year ago
klhhhhh cbb54d3202 [shardformer] polish code
1 year ago
klhhhhh 1a29e8fc29 [shardformer] polish chatglm code
1 year ago
klhhhhh 8620009dd7 [sharformer] add first version of policy of chatglm
1 year ago
klhhhhh 6ee4c9ee21 [shardformer] add test kit in model zoo for chatglm
1 year ago
klhhhhh 7377be7a53 import chatglm
1 year ago
klhhhhh c49286985d [shardformer] vit test finish and support
1 year ago
klhhhhh f60162b265 [shardformer] added tests
1 year ago
Kun Lin ed34bb1310 Feature/chatglm (#4240)
1 year ago
FoolPlayer 9ee4ebea83 [shardformer] support whisper (#4212)
1 year ago
FoolPlayer dd2bf02679 [shardformer] support SAM (#4231)
1 year ago
Baizhou Zhang 0ceec8f9a9 [pipeline] support fp32 for HybridPlugin/merge shardformer test and pipeline test into one file (#4354)
1 year ago
Jianghai f13954cd58 [pipeline] refactor test pipeline and remove useless utils in pipeline (#4324)
1 year ago
LuGY d3c6cd66f3 [pipeline] add unit test for 1f1b (#4303)
1 year ago
Baizhou Zhang da3cef27ad [pipeline] fix return_dict/fix pure_pipeline_test (#4331)
1 year ago
Hongxin Liu 411cf1d2db [hotfix] fix gemini and zero test (#4333)
1 year ago
Hongxin Liu 261eab02fb [plugin] add 3d parallel plugin (#4295)
1 year ago
FoolPlayer b3f5d7a3ba [shardformer] support pipeline base vit model (#4284)
1 year ago
Baizhou Zhang 083d7da33d [pipeline] add pipeline support for all T5 models (#4310)
1 year ago
Jianghai d0807122e2 [pipeline] test pure pipeline process using llama (#4218)
1 year ago
Baizhou Zhang 36e546b2cc [pipeline] add pipeline support for T5Stack/T5EncoderModel (#4300)
1 year ago
Jianghai d8408d185c [pipeline] OPT model pipeline (#4258)
1 year ago
Hongxin Liu d921ce8391 [shardformer] support inplace sharding (#4251)
1 year ago
Baizhou Zhang 2a2eacfaf1 [pipeline] support shardformer for GPT2ForQuestionAnswering & complete pipeline support for GPT2 (#4245)
1 year ago
Jianghai d9be0472ef [bugs] hot fix some testing bugs for new models (#4268)
1 year ago
Jianghai 34f0e34a4c [pipeline] finish bloom models pipeline and tests (#4223)
1 year ago
Jianghai e7cc62d735 [pipeline] All bert models (#4233)
1 year ago
Baizhou Zhang a14d352088 [pipeline] add pipeline forward for variants of gpt2 (#4238)
1 year ago
Baizhou Zhang 208ac8f2ba [pipeline] Add Pipeline Forward for GPT2Model Shardformer (#4224)
1 year ago
Jianghai 37d22f6878 [pipeline] add bloom model pipeline (#4210)
1 year ago
Jianghai 31bcf867ae [pipeline] Llama causal lm and llama for sequence classification pipeline (#4208)
1 year ago
Jianghai 1622031058 [pipeline] Llama pipeline (#4205)
1 year ago
Jianghai 1094e0f0d3 [pipeline] Bert pipeline for shardformer and its tests (#4197)
1 year ago
Hongxin Liu 890774b2fb [shardformer] support lazy init (#4202)
1 year ago
Jianghai f3bcc292c8 [pipeline] move bert related pipeline components to shardformer (#4187)
1 year ago
Jianghai c5ea728016 [pipeline] add bert_for_pretraining bert_lmhead forward and policy (#4172)
1 year ago
ver217 5fc60a3a04 [test] add shard util tests
1 year ago
ver217 2d6cc07feb [test] update shardformer tests
1 year ago
Jianghai 90a65ea682 [pipeline] build bloom model and policy , revise the base class of policy (#4161)
1 year ago
Jianghai c552cefa93 [pipeline]add pipeline policy and bert forward (#4130)
1 year ago
Hongxin Liu 5c897ddb94 [pipeline] add stage manager (#4093)
1 year ago
Jianghai e8e7e49243 [pipeline]add pipeline policy and bert forward (#4130)
1 year ago
Hongxin Liu f51ce1bc8e [pipeline] refactor 1f1b schedule (#4115)
1 year ago
Hongxin Liu 45fdc9b42c [pipeline] implement p2p communication (#4100)
1 year ago
Hongxin Liu 422544222f [pipeline] add stage manager (#4093)
1 year ago
Hongxin Liu 5e1a9d48dd [cluster] add process group mesh (#4039)
1 year ago
LuGY d86ddd9b29
[hotfix] fix unsafe async comm in zero (#4404)
1 year ago
flybird1111 458ae331ad
[kernel] updated unittests for coloattention (#4389)
1 year ago
flybird1111 38b792aab2
[coloattention] fix import error (#4380)
1 year ago
flybird1111 25c57b9fb4
[fix] coloattention support flash attention 2 (#4347)
1 year ago
Hongxin Liu 16bf4c0221
[test] remove useless tests (#4359)
1 year ago
LuGY 1a49a5ea00 [zero] support shard optimizer state dict of zero (#4194)
1 year ago
LuGY dd7cc58299 [zero] add state dict for low level zero (#4179)
1 year ago
LuGY c668801d36 [zero] allow passing process group to zero12 (#4153)
1 year ago
LuGY 79cf1b5f33 [zero]support no_sync method for zero1 plugin (#4138)
1 year ago
LuGY c6ab96983a [zero] refactor low level zero for shard evenly (#4030)
1 year ago