286 Commits (7f8b16635b42013b73e1cb1ffdebc07b4d71ac93)

Author SHA1 Message Date
linsj20 91fa553775 [Feature] qlora support (#5586) 7 months ago
flybird11111 8954a0c2e2 [LowLevelZero] low level zero support lora (#5153) 7 months ago
Wang Binluo 0d0a582033
[shardformer] update transformers (#5583) 7 months ago
flybird11111 a0ad587c24
[shardformer] refactor embedding resize (#5603) 7 months ago
Hongxin Liu 3788fefc7a
[zero] support multiple (partial) backward passes (#5596) 7 months ago
Hongxin Liu 641b1ee71a
[devops] remove post commit ci (#5566) 8 months ago
Zhongkai Zhao 8e412a548e
[shardformer] Sequence Parallelism Optimization (#5533) 8 months ago
Hongxin Liu 7303801854
[llama] fix training and inference scripts (#5384) 9 months ago
ver217 06db94fbc9 [moe] fix tests 10 months ago
Hongxin Liu da39d21b71 [moe] support mixtral (#5309) 10 months ago
Xuanlei Zhao 7d8e0338a4 [moe] init mixtral impl 10 months ago
Hongxin Liu ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347) 10 months ago
Frank Lee 7cfed5f076
[feat] refactored extension module (#5298) 10 months ago
digger yu bce9499ed3
fix some typo (#5307) 10 months ago
Hongxin Liu d202cc28c0
[npu] change device to accelerator api (#5239) 11 months ago
Wenhao Chen 7172459e74
[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088) 1 year ago
flybird11111 4ccb9ded7d
[gemini]fix gemini optimzer, saving Shardformer in Gemini got list assignment index out of range (#5085) 1 year ago
github-actions[bot] 8921a73c90
[format] applied code formatting on changed files in pull request 5067 (#5072) 1 year ago
Hongxin Liu e5ce4c8ea6
[npu] add npu support for gemini and zero (#5067) 1 year ago
flybird11111 3e02154710
[gemini] gemini support extra-dp (#5043) 1 year ago
flybird11111 576a2f7b10
[gemini] gemini support tensor parallelism. (#4942) 1 year ago
littsk 1a3315e336
[hotfix] Add layer norm gradients all-reduce for sequence parallel (#4926) 1 year ago
Baizhou Zhang d99b2c961a
[hotfix] fix grad accumulation plus clipping for gemini (#5002) 1 year ago
Xuanlei Zhao dc003c304c
[moe] merge moe into main (#4978) 1 year ago
Baizhou Zhang 21ba89cab6
[gemini] support gradient accumulation (#4869) 1 year ago
Zhongkai Zhao a0684e7bd6
[feature] support no master weights option for low level zero plugin (#4816) 1 year ago
littsk 83b52c56cd
[feature] Add clip_grad_norm for hybrid_parallel_plugin (#4837) 1 year ago
Hongxin Liu df63564184
[gemini] support amp o3 for gemini (#4872) 1 year ago
Hongxin Liu cb3a25a062
[checkpointio] hotfix torch 2.0 compatibility (#4824) 1 year ago
littsk 11f1e426fe
[hotfix] Correct several erroneous code comments (#4794) 1 year ago
littsk 54b3ad8924
[hotfix] fix norm type error in zero optimizer (#4795) 1 year ago
Baizhou Zhang c0a033700c
[shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic (#4758) 1 year ago
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752) 1 year ago
Hongxin Liu b5f9e37c70
[legacy] clean up legacy code (#4743) 1 year ago
Hongxin Liu 554aa9592e
[legacy] move communication and nn to legacy and refactor logger (#4671) 1 year ago
Hongxin Liu ac178ca5c1 [legacy] move builder and registry to legacy (#4603) 1 year ago
Hongxin Liu 807e01a4ba
[zero] hotfix master param sync (#4618) 1 year ago
Hongxin Liu 63ecafb1fb
[checkpointio] optimize zero optim checkpoint io (#4591) 1 year ago
LuGY cbac782254
[zero]fix zero ckptIO with offload (#4529) 1 year ago
Baizhou Zhang c9625dbb63
[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) 1 year ago
flybird11111 ec18fc7340
[shardformer] support pp+tp+zero1 tests (#4531) 1 year ago
Jianghai 376533a564
[shardformer] zero1+pp and the corresponding tests (#4517) 1 year ago
Baizhou Zhang 44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506) 1 year ago
LuGY 839847b7d7
[zero]support zero2 with gradient accumulation (#4511) 1 year ago
Hongxin Liu 27061426f7
[gemini] improve compatibility and add static placement policy (#4479) 1 year ago
LuGY d86ddd9b29
[hotfix] fix unsafe async comm in zero (#4404) 1 year ago
Baizhou Zhang 6ccecc0c69
[gemini] fix tensor storage cleaning in state dict collection (#4396) 1 year ago
LuGY 45b08f08cb [zero] optimize the optimizer step time (#4221) 1 year ago
LuGY 1a49a5ea00 [zero] support shard optimizer state dict of zero (#4194) 1 year ago
LuGY dd7cc58299 [zero] add state dict for low level zero (#4179) 1 year ago