Commit Graph

45 Commits (773d9f964a34a4aa905286a4a0a0a6ddb9de281d)

Author SHA1 Message Date
Hongxin Liu 7f8b16635b
[misc] refactor launch API and tensor constructor (#5666)
7 months ago
Wenhao Chen bb0a668fee
[hotfix] set return_outputs=False in examples and polish code (#5404)
8 months ago
ver217 06db94fbc9 [moe] fix tests
10 months ago
Xuanlei Zhao 7d8e0338a4 [moe] init mixtral impl
10 months ago
Hongxin Liu d202cc28c0
[npu] change device to accelerator api (#5239)
11 months ago
Wenhao Chen 3c08f17348
[hotfix]: modify create_ep_hierarchical_group and add test (#5032)
1 year ago
Wenhao Chen 724441279b
[moe]: fix ep/tp tests, add hierarchical all2all (#4982)
1 year ago
Xuanlei Zhao f71e63b0f3
[moe] support optimizer checkpoint (#5015)
1 year ago
Xuanlei Zhao dc003c304c
[moe] merge moe into main (#4978)
1 year ago
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752)
1 year ago
Hongxin Liu b5f9e37c70
[legacy] clean up legacy code (#4743)
1 year ago
Hongxin Liu 8accecd55b [legacy] move engine to legacy (#4560)
1 year ago
digger-yu 1f73609adb
[CI] fix typo with tests/ etc. (#3727)
2 years ago
Frank Lee 80eba05b0a
[test] refactor tests with spawn (#3452)
2 years ago
ver217 933048ad3e
[test] reorganize zero/gemini tests (#3445)
2 years ago
ver217 26b7aac0be
[zero] reorganize zero/gemini folder structure (#3424)
2 years ago
HELSON 1a1d68b053
[moe] add checkpoint for moe models (#3354)
2 years ago
Jiarui Fang 1e885329f4
[test] align model name with the file name. (#2045)
2 years ago
HELSON 95c35f73bd
[moe] initialize MoE groups by ProcessGroup (#1640)
2 years ago
HELSON a088022efc
[moe] fix moe bugs (#1633)
2 years ago
HELSON f7f2248771
[moe] fix MoE bugs (#1628)
2 years ago
Frank Lee 5a1a095b92
[test] refactored with the new rerun decorator (#763)
3 years ago
ver217 e396bb71f2
[zero] add tensor placement policies (#743)
3 years ago
HELSON 22c4b88d56
[zero] refactor ShardedParamV2 for convenience (#742)
3 years ago
Jiarui Fang 53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process (#726)
3 years ago
HELSON b9b469ea50
[moe] add checkpoint for moe zero test (#729)
3 years ago
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715)
3 years ago
HELSON a9b8300d54
[zero] improve adaptability for not-shard parameters (#708)
3 years ago
HELSON ee112fe1da
[zero] adapt zero hooks for unsharded module (#699)
3 years ago
HELSON d7ecaf362b
[zero] fix init bugs in zero context (#686)
3 years ago
Jiarui Fang 0aab52301e
[hotfix] fix a bug in model data stats tracing (#655)
3 years ago
HELSON e5d615aeee
[hotfix] fix bugs in testing (#659)
3 years ago
HELSON b31daed4cf
fix bugs in CPU adam (#633)
3 years ago
HELSON 055fbf5be6
[zero] adapt zero for unsharded paramters (Optimizer part) (#601)
3 years ago
HELSON e6d50ec107
[zero] adapt zero for unsharded parameters (#561)
3 years ago
Jiarui Fang 7675366fce
[polish] rename col_attr -> colo_attr (#558)
3 years ago
HELSON 8c90d4df54
[zero] add zero context manager to change config during initialization (#546)
3 years ago
Liang Bowen ec5086c49c Refactored docstring to google style
3 years ago
Frank Lee 3601b2bad0
[test] fixed rerun_on_exception and adapted test cases (#487)
3 years ago
Jiarui Fang a445e118cf
[polish] polish singleton and global context (#500)
3 years ago
Jiarui Fang 65c0f380c2
[format] polish name format for MOE (#481)
3 years ago
HELSON 7544347145
[MOE] add unitest for MOE experts layout, gradient handler and kernel (#469)
3 years ago
HELSON 84fd7c1d4d
add moe context, moe utilities and refactor gradient handler (#455)
3 years ago
1SAA 82023779bb Added TPExpert for special situation
3 years ago
1SAA 219df6e685 Optimized MoE layer and fixed some bugs;
3 years ago