Commit Graph

263 Commits (20332a7a34685e3138f4c8ec8a168d5a9cdb2e2b)

Author SHA1 Message Date
littsk 1a3315e336
[hotfix] Add layer norm gradients all-reduce for sequence parallel (#4926)
1 year ago
Baizhou Zhang d99b2c961a
[hotfix] fix grad accumulation plus clipping for gemini (#5002)
1 year ago
Xuanlei Zhao dc003c304c
[moe] merge moe into main (#4978)
1 year ago
Baizhou Zhang 21ba89cab6
[gemini] support gradient accumulation (#4869)
1 year ago
Zhongkai Zhao a0684e7bd6
[feature] support no master weights option for low level zero plugin (#4816)
1 year ago
littsk 83b52c56cd
[feature] Add clip_grad_norm for hybrid_parallel_plugin (#4837)
1 year ago
Hongxin Liu df63564184
[gemini] support amp o3 for gemini (#4872)
1 year ago
Hongxin Liu cb3a25a062
[checkpointio] hotfix torch 2.0 compatibility (#4824)
1 year ago
littsk 11f1e426fe
[hotfix] Correct several erroneous code comments (#4794)
1 year ago
littsk 54b3ad8924
[hotfix] fix norm type error in zero optimizer (#4795)
1 year ago
Baizhou Zhang c0a033700c
[shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic (#4758)
1 year ago
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752)
1 year ago
Hongxin Liu b5f9e37c70
[legacy] clean up legacy code (#4743)
1 year ago
Hongxin Liu 554aa9592e
[legacy] move communication and nn to legacy and refactor logger (#4671)
1 year ago
Hongxin Liu fae6c92ead
Merge branch 'main' into feature/shardformer
1 year ago
Hongxin Liu ac178ca5c1 [legacy] move builder and registry to legacy (#4603)
1 year ago
Hongxin Liu 807e01a4ba
[zero] hotfix master param sync (#4618)
1 year ago
Hongxin Liu a39a5c66fe
Merge branch 'main' into feature/shardformer
1 year ago
Hongxin Liu 63ecafb1fb
[checkpointio] optimize zero optim checkpoint io (#4591)
1 year ago
LuGY cbac782254
[zero]fix zero ckptIO with offload (#4529)
1 year ago
Baizhou Zhang c9625dbb63
[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)
1 year ago
flybird11111 ec18fc7340
[shardformer] support pp+tp+zero1 tests (#4531)
1 year ago
Jianghai 376533a564
[shardformer] zero1+pp and the corresponding tests (#4517)
1 year ago
Baizhou Zhang 44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506)
1 year ago
LuGY 839847b7d7
[zero]support zero2 with gradient accumulation (#4511)
1 year ago
Hongxin Liu 27061426f7
[gemini] improve compatibility and add static placement policy (#4479)
1 year ago
LuGY d86ddd9b29
[hotfix] fix unsafe async comm in zero (#4404)
1 year ago
Baizhou Zhang 6ccecc0c69
[gemini] fix tensor storage cleaning in state dict collection (#4396)
1 year ago
LuGY 45b08f08cb [zero] optimize the optimizer step time (#4221)
1 year ago
LuGY 1a49a5ea00 [zero] support shard optimizer state dict of zero (#4194)
1 year ago
LuGY dd7cc58299 [zero] add state dict for low level zero (#4179)
1 year ago
LuGY c668801d36 [zero] allow passing process group to zero12 (#4153)
1 year ago
LuGY 79cf1b5f33 [zero]support no_sync method for zero1 plugin (#4138)
1 year ago
LuGY c6ab96983a [zero] refactor low level zero for shard evenly (#4030)
1 year ago
Baizhou Zhang c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302)
1 year ago
Baizhou Zhang 58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141)
1 year ago
Baizhou Zhang 0bb0b481b4 [gemini] fix argument naming during chunk configuration searching
1 year ago
Frank Lee 71fe52769c [gemini] fixed the gemini checkpoint io (#3934)
1 year ago
digger yu de0d7df33f
[nfc] fix typo colossalai/zero (#3923)
2 years ago
digger yu a9d1cadc49
fix typo with colossalai/trainer utils zero (#3908)
2 years ago
Hongxin Liu ae02d4e4f7
[bf16] add bf16 support (#3882)
2 years ago
Hongxin Liu dbb32692d2
[lazy] refactor lazy init (#3891)
2 years ago
digger yu 9265f2d4d7
[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779)
2 years ago
jiangmingyan 307894f74d
[booster] gemini plugin support shard checkpoint (#3610)
2 years ago
YH a22407cc02
[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173)
2 years ago
Hongxin Liu 50793b35f4
[gemini] accelerate inference (#3641)
2 years ago
Hongxin Liu 4b3240cb59
[booster] add low level zero plugin (#3594)
2 years ago
digger-yu b9a8dff7e5
[doc] Fix typo under colossalai and doc(#3618)
2 years ago
Hongxin Liu 12eff9eb4c
[gemini] state dict supports fp16 (#3590)
2 years ago
Hongxin Liu f313babd11
[gemini] support save state dict in shards (#3581)
2 years ago