24 Commits (152162a80e95c6e30d335f6af805f1107bc6e35e)

Author SHA1 Message Date
flybird11111 0c10afd372
[FP8] rebase main (#5963) 4 months ago
hxwang 5b4c12381b Revert "[moe] implement submesh initialization" 4 months ago
Haze188 404b16faf3 [Feature] MoE Ulysses Support (#5918) 4 months ago
botbw 8dbb86899d [chore] trivial fix 4 months ago
botbw e28e05345b [moe] implement submesh initialization 4 months ago
hxwang 46c069b0db [zero] solve hang 4 months ago
hxwang 0fad23c691 [chore] handle non member group 4 months ago
Gao, Ruiyuan 5fb958cc83
[FIX BUG] convert env param to int in (#5934) 4 months ago
Haze188 3420921101
[shardformer] DeepseekMoE support (#5871) 5 months ago
Haze188 416580b314
[MoE/ZeRO] Moe refactor with zero refactor (#5821) 5 months ago
Edenzzzz 2a25a2aff7
[Feature] optimize PP overlap (#5735) 5 months ago
Edenzzzz 43995ee436
[Feature] Distributed optimizers: Lamb, Galore, CAME and Adafactor (#5694) 6 months ago
Hongxin Liu 641b1ee71a
[devops] remove post commit ci (#5566) 8 months ago
Zhongkai Zhao 8e412a548e
[shardformer] Sequence Parallelism Optimization (#5533) 8 months ago
flybird11111 365671be10
fix-test (#5210) 11 months ago
flybird11111 576a2f7b10
[gemini] gemini support tensor parallelism. (#4942) 1 year ago
littsk be82b5d4ca
[hotfix] Fix the bug where process groups were not being properly released. (#4940) 1 year ago
Baizhou Zhang a2db75546d
[doc] polish shardformer doc (#4779) 1 year ago
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752) 1 year ago
LuGY a78daf6180
[shardformer] support interleaved pipeline (#4448) 1 year ago
Hongxin Liu 5e1a9d48dd [cluster] add process group mesh (#4039) 1 year ago
digger yu 7f8203af69
fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808) 2 years ago
Frank Lee 73d3e4d309
[booster] implemented the torch ddd + resnet example (#3232) 2 years ago
YuliangLiu0306 4d5d8f98a4
[API] implement device mesh manager (#3221) 2 years ago
Frank Lee e3ad88fb48
[booster] implemented the cluster module (#3191) 2 years ago