Commit Graph

57 Commits (eb69e640e58ab89bf2e4d5955fa91d9eff55b61c)

Author SHA1 Message Date
flybird11111 eb69e640e5 [async io]supoort async io (#6137)
6 days ago
Hongxin Liu b90835bd32 [checkpointio] fix performance issue (#6139)
6 days ago
Wang Binluo 8e08c27e19 [ckpt] Add async ckpt api (#6136)
6 days ago
Hongxin Liu d4a436051d [checkpointio] support async model save (#6131)
6 days ago
Hongxin Liu c2e8f61592
[checkpointio] fix hybrid plugin model save (#6106)
4 weeks ago
BurkeHulk 6d6cafabe2 pre-commit fix
1 month ago
BurkeHulk b10339df7c fix lora ckpt save format (ColoTensor to Tensor)
1 month ago
Gao, Ruiyuan e9032fb0b2
[colossalai/checkpoint_io/...] fix bug in load_state_dict_into_model; format error msg (#6020)
3 months ago
Edenzzzz f5c84af0b0
[Feature] Zigzag Ring attention (#5905)
3 months ago
Wang Binluo 75c963686f
[lora] lora support hybrid parallel plugin (#5956)
4 months ago
botbw dc583aa576 [moe] implement tp
4 months ago
Hongxin Liu e86127925a
[plugin] support all-gather overlap for hybrid parallel (#5919)
4 months ago
Haze188 416580b314
[MoE/ZeRO] Moe refactor with zero refactor (#5821)
5 months ago
flybird11111 2ddf624a86
[shardformer] upgrade transformers to 4.39.3 (#5815)
5 months ago
Baizhou Zhang 14b0d4c7e5 [lora] add lora APIs for booster, support lora for TorchDDP (#4981)
7 months ago
flybird11111 a0ad587c24
[shardformer] refactor embedding resize (#5603)
7 months ago
Hongxin Liu 641b1ee71a
[devops] remove post commit ci (#5566)
8 months ago
digger yu 5e1c93d732
[hotfix] fix typo change MoECheckpintIO to MoECheckpointIO (#5335)
9 months ago
Frank Lee efef43b53c
Merge pull request #5372 from hpcaitech/exp/mixtral
10 months ago
Hongxin Liu da39d21b71 [moe] support mixtral (#5309)
10 months ago
Hongxin Liu eb4f2d90f9
[llama] polish training script and fix optim ckpt (#5368)
10 months ago
Hongxin Liu ffffc32dc7
[checkpointio] fix gemini and hybrid parallel optim checkpoint (#5347)
10 months ago
Elsa Granger b2ad0d9e8f
[pipeline,shardformer] Fix p2p efficiency in pipeline, allow skipping loading weight not in weight_map when `strict=False`, fix llama flash attention forward, add flop estimation by megatron in llama benchmark (#5017)
1 year ago
Jun Gao a4489384d5
[shardformer] Fix serialization error with Tensor Parallel state saving (#5018)
1 year ago
Hongxin Liu cb3a25a062
[checkpointio] hotfix torch 2.0 compatibility (#4824)
1 year ago
Baizhou Zhang 64a08b2dc3
[checkpointio] support unsharded checkpointIO for hybrid parallel (#4774)
1 year ago
Baizhou Zhang c0a033700c
[shardformer] fix master param sync for hybrid plugin/rewrite unwrapping logic (#4758)
1 year ago
Hongxin Liu 079bf3cb26
[misc] update pre-commit and run all files (#4752)
1 year ago
Hongxin Liu b5f9e37c70
[legacy] clean up legacy code (#4743)
1 year ago
flybird11111 4c4482f3ad
[example] llama2 add fine-tune example (#4673)
1 year ago
Baizhou Zhang d8ceeac14e
[hotfix] fix typo in hybrid parallel io (#4697)
1 year ago
Hongxin Liu 554aa9592e
[legacy] move communication and nn to legacy and refactor logger (#4671)
1 year ago
Hongxin Liu a39a5c66fe
Merge branch 'main' into feature/shardformer
1 year ago
Baizhou Zhang e79b1e80e2
[checkpointio] support huggingface from_pretrained for all plugins (#4606)
1 year ago
Hongxin Liu 63ecafb1fb
[checkpointio] optimize zero optim checkpoint io (#4591)
1 year ago
Baizhou Zhang 38ccb8b1a3
[shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575)
1 year ago
Baizhou Zhang c9625dbb63
[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)
1 year ago
Baizhou Zhang 44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506)
1 year ago
Baizhou Zhang c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302)
1 year ago
Baizhou Zhang 58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141)
1 year ago
Frank Lee c4b1b65931 [test] fixed tests failed due to dtensor change (#4082)
1 year ago
Frank Lee 8eb09a4c69 [shardformer] support module saving and loading (#4062)
1 year ago
Baizhou Zhang 1350ece492
[hotfix] fix import bug in checkpoint_io (#4142)
1 year ago
digger yu 7e46bc87b6
fix CheckpointIndexFile is not defined (#4109)
1 year ago
Baizhou Zhang 822c3d4d66
[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002)
1 year ago
Baizhou Zhang c9cff7e7fa
[checkpointio] General Checkpointing of Sharded Optimizers (#3984)
1 year ago
Frank Lee bd1ab98158
[gemini] fixed the gemini checkpoint io (#3934)
1 year ago
wukong1992 6b305a99d6
[booster] torch fsdp fix ckpt (#3788)
2 years ago
Hongxin Liu 5452df63c5
[plugin] torch ddp plugin supports sharded model checkpoint (#3775)
2 years ago
digger-yu ad6460cf2c
[NFC] fix typo applications/ and colossalai/ (#3735)
2 years ago