3877 Commits (feature/zerobubble)
 

Author SHA1 Message Date
duanjunwen 810cafb2f9
Merge pull request #6114 from duanjunwen/dev/zero_bubble 4 days ago
duanjunwen 41fdd2139b [fix] rm unused comments 4 days ago
duanjunwen dafda0fb70 [fix] remove debug info; 4 days ago
duanjunwen 9a21f87ed6 [fix] fix wait handle in run_fwd_bwd 4 days ago
duanjunwen f48a85e91d [fix] fix test_lora in llama policy 6 days ago
duanjunwen 2980da559f [fix] fix test_lora 6 days ago
duanjunwen 0fb500c7d4 [fix] rm debug info; update llama policy; update wait handle 7 days ago
duanjunwen cf86c1b1c5 [fix] fix zbv wait_handle 7 days ago
duanjunwen 5c2ebbfd48 [fix] fix mixtral modeling & policy; update wait handles; doing benchmarking for llama hybrid; 7 days ago
duanjunwen 014afbdb59 [fix] fix attn 1 week ago
duanjunwen 1bc4dba3a3 [fix] fix p2p error in zbv 1 week ago
duanjunwen b6d5e61809 [feat] update mixtral policy & bert policy for zerobubble 1 week ago
duanjunwen 80b04d7855 [feat] support mixtral policy with zbv tp_Linear & non_tp_Linear 1 week ago
duanjunwen 337debcf2a [feat] fix testcase; 1 week ago
duanjunwen 12919de424 [fix] fix send_tensor_metadata & send_grad_metadata; 2 weeks ago
duanjunwen 0d6d40ccc6 [fix] fix zbv llama pp4 2 weeks ago
duanjunwen 4fc92aa77d [feat] support no_tp Linear for sharderformer.llama 2 weeks ago
duanjunwen 37b23e32b1
Merge pull request #6107 from duanjunwen/dev/zero_bubble 2 weeks ago
duanjunwen 8e40087633 [fix] fix model zoo init 3 weeks ago
duanjunwen 0218e673db [fix] fix use_fp8 flag 3 weeks ago
duanjunwen 5b5fbcff09 [fix] fix hybridparall use_fp8 config 3 weeks ago
duanjunwen 3b5c314bea [fix] fix fp8 args in HybridParallel 3 weeks ago
duanjunwen c82c75a9b4 Merge branch 'feature/zerobubble' of github.com:hpcaitech/ColossalAI into dev/zero_bubble 3 weeks ago
duanjunwen 1d328ff651 Merge branch 'main' into dev/zero_bubble 3 weeks ago
pre-commit-ci[bot] 2f583c1549
[pre-commit.ci] pre-commit autoupdate (#6078) 3 weeks ago
duanjunwen aed20fb2df
[feat] support zbv in mixtral benchmark; (#6083) 3 weeks ago
Hongxin Liu c2e8f61592
[checkpointio] fix hybrid plugin model save (#6106) 3 weeks ago
duanjunwen 5f0924361d [fix] fix linear (no tp) ops func name; 3 weeks ago
duanjunwen d2e05a99b3 [feat] support no tensor parallel Linear in shardformer; Add test for use weightGradStore and not use WeightGradStore 3 weeks ago
duanjunwen 982e4ee1f8 [fix] fix comment in llama & benchmark 3 weeks ago
duanjunwen fa3ccda8ee [fix] fix send recv signature; 3 weeks ago
duanjunwen fafe049b83 [fix] fix handle name; rm useless comments; 3 weeks ago
duanjunwen 5aee4261a6 [fix] fix test zerobubble 4 weeks ago
duanjunwen 6377aa0fff [fix] fix test_shard_llama ci; 4 weeks ago
duanjunwen 03fa79a55c [fix] fix llama modeling policy; 4 weeks ago
duanjunwen cc0dfddcbc [fix] fix test_shard_llama 4 weeks ago
duanjunwen d0ec221b38 [fix\ fix fail case test_shard_llama 4 weeks ago
Tong Li 89a9a600bc
[MCTS] Add self-refined MCTS (#6098) 4 weeks ago
duanjunwen 2eca112c90 [feat] support meta cache, meta_grad_send, meta_tensor_send; fix runtime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp); 4 weeks ago
binmakeswell 4294ae83bb
[doc] sora solution news (#6100) 4 weeks ago
Hongxin Liu 80a8ca916a
[extension] hotfix compile check (#6099) 4 weeks ago
Hanks dee63cc5ef
Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt 1 month ago
BurkeHulk 6d6cafabe2 pre-commit fix 1 month ago
BurkeHulk b10339df7c fix lora ckpt save format (ColoTensor to Tensor) 1 month ago
Hongxin Liu 19baab5fd5
[release] update version (#6094) 1 month ago
Hongxin Liu 58d8b8a2dd
[misc] fit torch api upgradation and remove legecy import (#6093) 1 month ago
Hongxin Liu 5ddad486ca
[fp8] add fallback and make compile option configurable (#6092) 1 month ago
botbw 3b1d7d1ae8 [chore] refactor 1 month ago
botbw 2bcd0b6844 [ckpt] add safetensors util 1 month ago
Hongxin Liu cd61353bae
[pipeline] hotfix backward for multiple outputs (#6090) 1 month ago