955 Commits (39f2582e987871c198f2f2526cd4435cbd569741)

Author SHA1 Message Date
Baizhou Zhang 39f2582e98
[hotfix] fix lr scheduler bug in torch 2.0 (#4864) 1 year ago
littsk 83b52c56cd
[feature] Add clip_grad_norm for hybrid_parallel_plugin (#4837) 1 year ago
Hongxin Liu df63564184
[gemini] support amp o3 for gemini (#4872) 1 year ago
littsk ffd9a3cbc9
[hotfix] fix bug in sequence parallel test (#4887) 1 year ago
Xu Kai fdec650bb4
fix test llama (#4884) 1 year ago
Bin Jia 08a9f76b2f
[Pipeline Inference] Sync pipeline inference branch to main (#4820) 1 year ago
Hongxin Liu cb3a25a062
[checkpointio] hotfix torch 2.0 compatibility (#4824) 1 year ago
Zhongkai Zhao db40e086c8 [test] modify model supporting part of low_level_zero plugin (including correspoding docs) 1 year ago
Xu Kai d1fcc0fa4d
[infer] fix test bug (#4838) 1 year ago
Jianghai 013a4bedf0
[inference]fix import bug and delete down useless init (#4830) 1 year ago
Hongxin Liu 4965c0dabd
[lazy] support from_pretrained (#4801) 1 year ago
Baizhou Zhang 64a08b2dc3
[checkpointio] support unsharded checkpointIO for hybrid parallel (#4774) 1 year ago
Jianghai ce7ade3882
[inference] chatglm2 infer demo (#4724) 1 year ago
Xu Kai 946ab56c48
[feature] add gptq for inference (#4754) 1 year ago
Hongxin Liu 3e05c07bb8
[lazy] support torch 2.0 (#4763) 1 year ago
Baizhou Zhang c0a033700c
[shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic (#4758) 1 year ago
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752) 1 year ago
Hongxin Liu b5f9e37c70
[legacy] clean up legacy code (#4743) 1 year ago
Pengtai Xu cd4e61d149 [legacy] remove deterministic data loader test 1 year ago
digger yu 9c2feb2f0b
fix some typo with colossalai/device colossalai/tensor/ etc. (#4171) 1 year ago
Cuiqing Li bce0f16702
[Feature] The first PR to Add TP inference engine, kv-cache manager and related kernels for our inference system (#4577) 1 year ago
flybird11111 eedaa3e1ef
[shardformer]fix gpt2 double head (#4663) 1 year ago
Hongxin Liu 554aa9592e
[legacy] move communication and nn to legacy and refactor logger (#4671) 1 year ago
flybird11111 7486ed7d3a
[shardformer] update llama2/opt finetune example and fix llama2 policy (#4645) 1 year ago
Baizhou Zhang 660eed9124
[pipeline] set optimizer to optional in execute_pipeline (#4630) 1 year ago
Hongxin Liu 8accecd55b [legacy] move engine to legacy (#4560) 1 year ago
Hongxin Liu 89fe027787 [legacy] move trainer to legacy (#4545) 1 year ago
Hongxin Liu bd18678478
[test] fix gemini checkpoint and gpt test (#4620) 1 year ago
Hongxin Liu 807e01a4ba
[zero] hotfix master param sync (#4618) 1 year ago
Hongxin Liu e71d245293
[test] ignore gpt2 shardformer test (#4619) 1 year ago
Baizhou Zhang e79b1e80e2
[checkpointio] support huggingface from_pretrained for all plugins (#4606) 1 year ago
Jianghai 24c0768795
[shardformer] Pytree fix (#4533) 1 year ago
Hongxin Liu 508ca36fe3
[pipeline] 1f1b schedule receive microbatch size (#4589) 1 year ago
LuGY cbac782254
[zero]fix zero ckptIO with offload (#4529) 1 year ago
Baizhou Zhang 38ccb8b1a3
[shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575) 1 year ago
Baizhou Zhang c9625dbb63
[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) 1 year ago
Baizhou Zhang 2c787d7f47
[shardformer] fix submodule replacement bug when enabling pp (#4544) 1 year ago
flybird11111 ec18fc7340
[shardformer] support pp+tp+zero1 tests (#4531) 1 year ago
flybird11111 d367b88785
[shardformer] fix opt test hanging (#4521) 1 year ago
Bin Jia e241b74f24
[shardformer] Add overlap support for gpt2 (#4535) 1 year ago
Baizhou Zhang 0387a47e63
[shardformer] fix emerged bugs after updating transformers (#4526) 1 year ago
Bin Jia c554b7f559
[shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516) 1 year ago
Jianghai 376533a564
[shardformer] zero1+pp and the corresponding tests (#4517) 1 year ago
Baizhou Zhang 44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506) 1 year ago
flybird11111 de8a65babc
[shardformer] opt fix. (#4514) 1 year ago
LuGY 839847b7d7
[zero]support zero2 with gradient accumulation (#4511) 1 year ago
flybird11111 3353e55c80
[shardformer] vit/llama/t5 ignore the sequence parallelism flag and some fix. (#4498) 1 year ago
Hongxin Liu 27061426f7
[gemini] improve compatibility and add static placement policy (#4479) 1 year ago
Jianghai e04436a82a
[shardformer] tests for 3d parallel (#4493) 1 year ago
flybird11111 59e252ecdb
[shardformer] chatglm support sequence parallel (#4482) 1 year ago