Commit Graph

43 Commits (44014faa67bbbed40ecd7376e030218ae3d78826)

Author SHA1 Message Date
Xuanlei Zhao 44014faa67 fix optim
11 months ago
Xuanlei Zhao 0a3aae509b update utils and fwd bwd
11 months ago
Xuanlei Zhao a5580e6289 update test
11 months ago
Wenhao Chen 3c08f17348
[hotfix]: modify create_ep_hierarchical_group and add test (#5032)
1 year ago
Wenhao Chen 724441279b
[moe]: fix ep/tp tests, add hierarchical all2all (#4982)
1 year ago
Xuanlei Zhao f71e63b0f3
[moe] support optimizer checkpoint (#5015)
1 year ago
Xuanlei Zhao dc003c304c
[moe] merge moe into main (#4978)
1 year ago
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752)
1 year ago
Hongxin Liu b5f9e37c70
[legacy] clean up legacy code (#4743)
1 year ago
Hongxin Liu 8accecd55b [legacy] move engine to legacy (#4560)
1 year ago
digger-yu 1f73609adb
[CI] fix typo with tests/ etc. (#3727)
2 years ago
Frank Lee 80eba05b0a
[test] refactor tests with spawn (#3452)
2 years ago
ver217 933048ad3e
[test] reorganize zero/gemini tests (#3445)
2 years ago
ver217 26b7aac0be
[zero] reorganize zero/gemini folder structure (#3424)
2 years ago
HELSON 1a1d68b053
[moe] add checkpoint for moe models (#3354)
2 years ago
Jiarui Fang 1e885329f4
[test] align model name with the file name. (#2045)
2 years ago
HELSON 95c35f73bd
[moe] initialize MoE groups by ProcessGroup (#1640)
2 years ago
HELSON a088022efc
[moe] fix moe bugs (#1633)
2 years ago
HELSON f7f2248771
[moe] fix MoE bugs (#1628)
2 years ago
Frank Lee 5a1a095b92
[test] refactored with the new rerun decorator (#763)
3 years ago
ver217 e396bb71f2
[zero] add tensor placement policies (#743)
3 years ago
HELSON 22c4b88d56
[zero] refactor ShardedParamV2 for convenience (#742)
3 years ago
Jiarui Fang 53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process (#726)
3 years ago
HELSON b9b469ea50
[moe] add checkpoint for moe zero test (#729)
3 years ago
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715)
3 years ago
HELSON a9b8300d54
[zero] improve adaptability for not-shard parameters (#708)
3 years ago
HELSON ee112fe1da
[zero] adapt zero hooks for unsharded module (#699)
3 years ago
HELSON d7ecaf362b
[zero] fix init bugs in zero context (#686)
3 years ago
Jiarui Fang 0aab52301e
[hotfix] fix a bug in model data stats tracing (#655)
3 years ago
HELSON e5d615aeee
[hotfix] fix bugs in testing (#659)
3 years ago
HELSON b31daed4cf
fix bugs in CPU adam (#633)
3 years ago
HELSON 055fbf5be6
[zero] adapt zero for unsharded paramters (Optimizer part) (#601)
3 years ago
HELSON e6d50ec107
[zero] adapt zero for unsharded parameters (#561)
3 years ago
Jiarui Fang 7675366fce
[polish] rename col_attr -> colo_attr (#558)
3 years ago
HELSON 8c90d4df54
[zero] add zero context manager to change config during initialization (#546)
3 years ago
Liang Bowen ec5086c49c Refactored docstring to google style
3 years ago
Frank Lee 3601b2bad0
[test] fixed rerun_on_exception and adapted test cases (#487)
3 years ago
Jiarui Fang a445e118cf
[polish] polish singleton and global context (#500)
3 years ago
Jiarui Fang 65c0f380c2
[format] polish name format for MOE (#481)
3 years ago
HELSON 7544347145
[MOE] add unitest for MOE experts layout, gradient handler and kernel (#469)
3 years ago
HELSON 84fd7c1d4d
add moe context, moe utilities and refactor gradient handler (#455)
3 years ago
1SAA 82023779bb Added TPExpert for special situation
3 years ago
1SAA 219df6e685 Optimized MoE layer and fixed some bugs;
3 years ago