152 Commits (colossalchat)

Author SHA1 Message Date
Wang Binluo 75c963686f
[lora] lora support hybrid parallel plugin (#5956) 4 months ago
botbw d1d1ab871e [moe] solve dp axis issue 4 months ago
botbw 65daa87627 [doc] add MoeHybridParallelPlugin docstring 4 months ago
hxwang 7bedd03739 [moe] remove force_overlap_comm flag and add warning instead 4 months ago
hxwang f7c5485ed6 [chore] docstring 4 months ago
haze188 70793ce9ed [misc] fix ci failure: change default value to false in moe plugin 4 months ago
hxwang 606b0891ed [chore] change moe_pg_mesh to private 4 months ago
hxwang cb01c0d5ce [moe] refactor mesh assignment 4 months ago
hxwang 6c39f0b144 [test] add check 4 months ago
botbw 96d0fbc531 [bug] fix: somehow logger hangs the program 4 months ago
hxwang 70c9924d0d [chore] solve moe ckpt test failure and some other arg pass failure 4 months ago
hxwang 46037c2ccd [chore] minor fix after rebase 4 months ago
hxwang 803878b2fd [moe] full test for deepseek and mixtral (pp + sp to fix) 4 months ago
hxwang 7077d38d5a [moe] finalize test (no pp) 4 months ago
haze188 2cddeac717 moe sp + ep bug fix 4 months ago
hxwang 877d94bb8c [moe] init moe plugin comm setting with sp 4 months ago
Haze188 404b16faf3 [Feature] MoE Ulysses Support (#5918) 4 months ago
botbw dc583aa576 [moe] implement tp 4 months ago
hxwang 102b784a10 [chore] arg pass & remove drop token 4 months ago
botbw 014faf6c5a [chore] manually revert unintended commit 4 months ago
botbw 9b9b76bdcd [moe] add mixtral dp grad scaling when not all experts are activated 4 months ago
botbw e28e05345b [moe] implement submesh initialization 4 months ago
haze188 5ed5e8cfba solve hang when parallel mode = pp + dp 4 months ago
botbw 13b48ac0aa [zero] solve hang 4 months ago
botbw b5bfeb2efd [moe] implement transit between non moe tp and ep 4 months ago
botbw 37443cc7e4 [test] pass mixtral shardformer test 4 months ago
hxwang 46c069b0db [zero] solve hang 4 months ago
hxwang a249e71946 [test] mixtra pp shard test 4 months ago
hxwang 8ae8525bdf [moe] fix plugin 4 months ago
Hongxin Liu e86127925a
[plugin] support all-gather overlap for hybrid parallel (#5919) 4 months ago
Hongxin Liu c068ef0fa0
[zero] support all-gather overlap (#5898) 4 months ago
Edenzzzz fbf33ecd01
[Feature] Enable PP + SP for llama (#5868) 5 months ago
Wang Binluo 6cd4c32be4
[shardformer] fix the moe (#5883) 5 months ago
Haze188 416580b314
[MoE/ZeRO] Moe refactor with zero refactor (#5821) 5 months ago
botbw 8e718a1421
[gemini] fixes for benchmarking (#5847) 5 months ago
Edenzzzz 2a25a2aff7
[Feature] optimize PP overlap (#5735) 5 months ago
Edenzzzz 8795bb2e80
Support 4d parallel + flash attention (#5789) 5 months ago
botbw 3f7e3131d9
[gemini] optimize reduce scatter d2h copy (#5760) 6 months ago
Edenzzzz 5f8c0a0ac3
[Feature] auto-cast optimizers to distributed version (#5746) 6 months ago
botbw 2fc85abf43
[gemini] async grad chunk reduce (all-reduce&reduce-scatter) (#5713) 6 months ago
pre-commit-ci[bot] 5bedea6e10 [pre-commit.ci] auto fixes from pre-commit.com hooks 6 months ago
hxwang b2e9745888 [chore] sync 6 months ago
Edenzzzz 43995ee436
[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694) 6 months ago
flybird11111 77ec773388
[zero]remove registered gradients hooks (#5687) 7 months ago
Hongxin Liu 7f8b16635b
[misc] refactor launch API and tensor constructor (#5666) 7 months ago
linsj20 91fa553775 [Feature] qlora support (#5586) 7 months ago
flybird11111 8954a0c2e2 [LowLevelZero] low level zero support lora (#5153) 7 months ago
Baizhou Zhang 14b0d4c7e5 [lora] add lora APIs for booster, support lora for TorchDDP (#4981) 7 months ago
Hongxin Liu 1b387ca9fe
[shardformer] refactor pipeline grad ckpt config (#5646) 7 months ago
Hongxin Liu 4de4e31818
[exampe] update llama example (#5626) 7 months ago