duanjunwen
810cafb2f9
Merge pull request #6114 from duanjunwen/dev/zero_bubble
...
[Zerobubble] Support LinearWithAsyncCommunication for sharderformer policy
4 days ago
duanjunwen
41fdd2139b
[fix] rm unused comments
4 days ago
duanjunwen
dafda0fb70
[fix] remove debug info;
4 days ago
duanjunwen
9a21f87ed6
[fix] fix wait handle in run_fwd_bwd
4 days ago
duanjunwen
f48a85e91d
[fix] fix test_lora in llama policy
6 days ago
duanjunwen
2980da559f
[fix] fix test_lora
6 days ago
duanjunwen
0fb500c7d4
[fix] rm debug info; update llama policy; update wait handle
7 days ago
duanjunwen
cf86c1b1c5
[fix] fix zbv wait_handle
7 days ago
duanjunwen
5c2ebbfd48
[fix] fix mixtral modeling & policy; update wait handles; doing benchmarking for llama hybrid;
7 days ago
duanjunwen
014afbdb59
[fix] fix attn
1 week ago
duanjunwen
1bc4dba3a3
[fix] fix p2p error in zbv
1 week ago
duanjunwen
b6d5e61809
[feat] update mixtral policy & bert policy for zerobubble
1 week ago
duanjunwen
80b04d7855
[feat] support mixtral policy with zbv tp_Linear & non_tp_Linear
1 week ago
duanjunwen
337debcf2a
[feat] fix testcase;
1 week ago
duanjunwen
12919de424
[fix] fix send_tensor_metadata & send_grad_metadata;
2 weeks ago
duanjunwen
0d6d40ccc6
[fix] fix zbv llama pp4
2 weeks ago
duanjunwen
4fc92aa77d
[feat] support no_tp Linear for sharderformer.llama
2 weeks ago
duanjunwen
37b23e32b1
Merge pull request #6107 from duanjunwen/dev/zero_bubble
...
[Zerobubble] Merge Main.
2 weeks ago
duanjunwen
8e40087633
[fix] fix model zoo init
3 weeks ago
duanjunwen
0218e673db
[fix] fix use_fp8 flag
3 weeks ago
duanjunwen
5b5fbcff09
[fix] fix hybridparall use_fp8 config
3 weeks ago
duanjunwen
3b5c314bea
[fix] fix fp8 args in HybridParallel
3 weeks ago
duanjunwen
c82c75a9b4
Merge branch 'feature/zerobubble' of github.com:hpcaitech/ColossalAI into dev/zero_bubble
3 weeks ago
duanjunwen
1d328ff651
Merge branch 'main' into dev/zero_bubble
3 weeks ago
pre-commit-ci[bot]
2f583c1549
[pre-commit.ci] pre-commit autoupdate ( #6078 )
...
updates:
- [github.com/psf/black-pre-commit-mirror: 24.8.0 → 24.10.0](https://github.com/psf/black-pre-commit-mirror/compare/24.8.0...24.10.0 )
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.2](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.2 )
- [github.com/pre-commit/pre-commit-hooks: v4.6.0 → v5.0.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.6.0...v5.0.0 )
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
3 weeks ago
duanjunwen
aed20fb2df
[feat] support zbv in mixtral benchmark; ( #6083 )
...
* [feat] support zbv in mixtral benchmark;
* [fix] MixtralForCausalLMPolicy get_held_layer support zbv;
* [feat] update MixtralPipelineForwards --> mixtral_model_forward; support zbv;
* [feat] support MixtralPipelineForwards--> mixtral_for_causal_lm_forward for zbv
* [fix] fix llama, mixtral benchmark zbv loss none bug; update mixtral & llama policy and modeling;
* [feat] Linear1D_COL/ROW support zbv WeightGradStore;
* [feat] support use_zbv in llama, mixtral modeling; only replace Linear1D_Col/Row policy;
* [fix] fix test case; moe error in second iter
* [feat]EPMixtralSparseMoeBlock (op in MOE) support zbv;
* [fix] fix bwd b; now bwd w only for Layer replaced by Linear1D_Col/Row; other layer perform a fully bwd;
* [fix] debug zbv llama test;
* [fix] rm use_zbv flag in Shardconfig; rm debug info;
* [fix] add & fix llama test
* [feat] support meta cache, meta_grad_send, meta_tensor_send; fix runtime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp);
* [fix\ fix fail case test_shard_llama
* [fix] fix test_shard_llama
* [fix] fix llama modeling policy;
* [fix] fix test_shard_llama ci;
* [fix] fix test zerobubble
* [fix] fix handle name; rm useless comments;
* [fix] fix send recv signature;
* [fix] fix comment in llama & benchmark
* [feat] support no tensor parallel Linear in shardformer; Add test for use weightGradStore and not use WeightGradStore
* [fix] fix linear (no tp) ops func name;
3 weeks ago
Hongxin Liu
c2e8f61592
[checkpointio] fix hybrid plugin model save ( #6106 )
3 weeks ago
duanjunwen
5f0924361d
[fix] fix linear (no tp) ops func name;
3 weeks ago
duanjunwen
d2e05a99b3
[feat] support no tensor parallel Linear in shardformer; Add test for use weightGradStore and not use WeightGradStore
3 weeks ago
duanjunwen
982e4ee1f8
[fix] fix comment in llama & benchmark
3 weeks ago
duanjunwen
fa3ccda8ee
[fix] fix send recv signature;
3 weeks ago
duanjunwen
fafe049b83
[fix] fix handle name; rm useless comments;
3 weeks ago
duanjunwen
5aee4261a6
[fix] fix test zerobubble
4 weeks ago
duanjunwen
6377aa0fff
[fix] fix test_shard_llama ci;
4 weeks ago
duanjunwen
03fa79a55c
[fix] fix llama modeling policy;
4 weeks ago
duanjunwen
cc0dfddcbc
[fix] fix test_shard_llama
4 weeks ago
duanjunwen
d0ec221b38
[fix\ fix fail case test_shard_llama
4 weeks ago
Tong Li
89a9a600bc
[MCTS] Add self-refined MCTS ( #6098 )
...
* add reasoner
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update code
* delete llama
* update prompts
* update readme
* update readme
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
4 weeks ago
duanjunwen
2eca112c90
[feat] support meta cache, meta_grad_send, meta_tensor_send; fix runtime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp);
4 weeks ago
binmakeswell
4294ae83bb
[doc] sora solution news ( #6100 )
...
* [doc] sora solution news
* [doc] sora solution news
4 weeks ago
Hongxin Liu
80a8ca916a
[extension] hotfix compile check ( #6099 )
4 weeks ago
Hanks
dee63cc5ef
Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt
...
[hotfix] fix lora ckpt saving format
1 month ago
BurkeHulk
6d6cafabe2
pre-commit fix
1 month ago
BurkeHulk
b10339df7c
fix lora ckpt save format (ColoTensor to Tensor)
1 month ago
Hongxin Liu
19baab5fd5
[release] update version ( #6094 )
1 month ago
Hongxin Liu
58d8b8a2dd
[misc] fit torch api upgradation and remove legecy import ( #6093 )
...
* [amp] fit torch's new api
* [amp] fix api call
* [amp] fix api call
* [misc] fit torch pytree api upgrade
* [misc] remove legacy import
* [misc] fit torch amp api
* [misc] fit torch amp api
1 month ago
Hongxin Liu
5ddad486ca
[fp8] add fallback and make compile option configurable ( #6092 )
1 month ago
botbw
3b1d7d1ae8
[chore] refactor
1 month ago
botbw
2bcd0b6844
[ckpt] add safetensors util
1 month ago
Hongxin Liu
cd61353bae
[pipeline] hotfix backward for multiple outputs ( #6090 )
...
* [pipeline] hotfix backward for multiple outputs
* [pipeline] hotfix backward for multiple outputs
1 month ago