61 Commits (main)

Author SHA1 Message Date
Hongxin Liu cf519dac6a
[optim] hotfix adam load (#6146) 2 days ago
flybird11111 eb69e640e5 [async io]supoort async io (#6137) 3 days ago
Wang Binluo 8e08c27e19 [ckpt] Add async ckpt api (#6136) 3 days ago
Wang Binluo eea37da6fa
[fp8] Merge feature/fp8_comm to main branch of Colossalai (#6016) 3 months ago
Edenzzzz f5c84af0b0
[Feature] Zigzag Ring attention (#5905) 3 months ago
Haze188 416580b314
[MoE/ZeRO] Moe refactor with zero refactor (#5821) 5 months ago
botbw 80c3c8789b
[Test/CI] remove test cases to reduce CI duration (#5753) 6 months ago
Edenzzzz 79f7a7b211
[misc] Accelerate CI for zero and dist optim (#5758) 6 months ago
Hongxin Liu 7f8b16635b
[misc] refactor launch API and tensor constructor (#5666) 7 months ago
flybird11111 8954a0c2e2 [LowLevelZero] low level zero support lora (#5153) 7 months ago
flybird11111 a0ad587c24
[shardformer] refactor embedding resize (#5603) 7 months ago
Hongxin Liu 641b1ee71a
[devops] remove post commit ci (#5566) 8 months ago
Zhongkai Zhao 8e412a548e
[shardformer] Sequence Parallelism Optimization (#5533) 8 months ago
Wenhao Chen bb0a668fee
[hotfix] set return_outputs=False in examples and polish code (#5404) 8 months ago
Hongxin Liu f2e8b9ef9f
[devops] fix compatibility (#5444) 8 months ago
flybird11111 29695cf70c
[example]add gpt2 benchmark example script. (#5295) 9 months ago
QinLuo bf34c6fef6
[fsdp] impl save/load shard model/optimizer (#5357) 9 months ago
Hongxin Liu ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347) 10 months ago
Frank Lee 7cfed5f076
[feat] refactored extension module (#5298) 10 months ago
flybird11111 2a0558d8ec
[ci] fix test_hybrid_parallel_plugin_checkpoint_io.py (#5276) 10 months ago
Frank Lee d69cd2eb89
[workflow] fixed oom tests (#5275) 10 months ago
Frank Lee edf94a35c3
[workflow] fixed build CI (#5240) 11 months ago
flybird11111 3e02154710
[gemini] gemini support extra-dp (#5043) 1 year ago
flybird11111 576a2f7b10
[gemini] gemini support tensor parallelism. (#4942) 1 year ago
Baizhou Zhang 39f2582e98
[hotfix] fix lr scheduler bug in torch 2.0 (#4864) 1 year ago
Hongxin Liu df63564184
[gemini] support amp o3 for gemini (#4872) 1 year ago
Bin Jia 08a9f76b2f
[Pipeline Inference] Sync pipeline inference branch to main (#4820) 1 year ago
Hongxin Liu cb3a25a062
[checkpointio] hotfix torch 2.0 compatibility (#4824) 1 year ago
Hongxin Liu 4965c0dabd
[lazy] support from_pretrained (#4801) 1 year ago
Baizhou Zhang 64a08b2dc3
[checkpointio] support unsharded checkpointIO for hybrid parallel (#4774) 1 year ago
Baizhou Zhang c0a033700c
[shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic (#4758) 1 year ago
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752) 1 year ago
Hongxin Liu bd18678478
[test] fix gemini checkpoint and gpt test (#4620) 1 year ago
Hongxin Liu 807e01a4ba
[zero] hotfix master param sync (#4618) 1 year ago
Baizhou Zhang e79b1e80e2
[checkpointio] support huggingface from_pretrained for all plugins (#4606) 1 year ago
LuGY cbac782254
[zero]fix zero ckptIO with offload (#4529) 1 year ago
Baizhou Zhang 38ccb8b1a3
[shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575) 1 year ago
Baizhou Zhang c9625dbb63
[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) 1 year ago
Baizhou Zhang 2c787d7f47
[shardformer] fix submodule replacement bug when enabling pp (#4544) 1 year ago
Baizhou Zhang 44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506) 1 year ago
Hongxin Liu 27061426f7
[gemini] improve compatibility and add static placement policy (#4479) 1 year ago
LuGY 1a49a5ea00 [zero] support shard optimizer state dict of zero (#4194) 1 year ago
Baizhou Zhang c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302) 1 year ago
Baizhou Zhang 58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141) 1 year ago
Frank Lee c4b1b65931 [test] fixed tests failed due to dtensor change (#4082) 1 year ago
Baizhou Zhang 0bb0b481b4 [gemini] fix argument naming during chunk configuration searching 1 year ago
Frank Lee a5883aa790
[test] fixed codefactor format report (#4026) 1 year ago
Baizhou Zhang 822c3d4d66
[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002) 1 year ago
Baizhou Zhang c9cff7e7fa
[checkpointio] General Checkpointing of Sharded Optimizers (#3984) 1 year ago
wukong1992 6b305a99d6
[booster] torch fsdp fix ckpt (#3788) 2 years ago