duanjunwen
0d6d40ccc6
[fix] fix zbv llama pp4
2024-11-06 03:35:12 +00:00
duanjunwen
4fc92aa77d
[feat] support no_tp Linear for sharderformer.llama
2024-11-05 05:55:42 +00:00
duanjunwen
0218e673db
[fix] fix use_fp8 flag
2024-11-01 07:05:24 +00:00
duanjunwen
5b5fbcff09
[fix] fix hybridparall use_fp8 config
2024-11-01 05:27:11 +00:00
duanjunwen
3b5c314bea
[fix] fix fp8 args in HybridParallel
2024-11-01 03:54:08 +00:00
duanjunwen
c82c75a9b4
Merge branch 'feature/zerobubble' of github.com:hpcaitech/ColossalAI into dev/zero_bubble
2024-11-01 03:32:18 +00:00
duanjunwen
1d328ff651
Merge branch 'main' into dev/zero_bubble
2024-11-01 03:10:53 +00:00
duanjunwen
aed20fb2df
[feat] support zbv in mixtral benchmark; ( #6083 )
...
* [feat] support zbv in mixtral benchmark;
* [fix] MixtralForCausalLMPolicy get_held_layer support zbv;
* [feat] update MixtralPipelineForwards --> mixtral_model_forward; support zbv;
* [feat] support MixtralPipelineForwards--> mixtral_for_causal_lm_forward for zbv
* [fix] fix llama, mixtral benchmark zbv loss none bug; update mixtral & llama policy and modeling;
* [feat] Linear1D_COL/ROW support zbv WeightGradStore;
* [feat] support use_zbv in llama, mixtral modeling; only replace Linear1D_Col/Row policy;
* [fix] fix test case; moe error in second iter
* [feat]EPMixtralSparseMoeBlock (op in MOE) support zbv;
* [fix] fix bwd b; now bwd w only for Layer replaced by Linear1D_Col/Row; other layer perform a fully bwd;
* [fix] debug zbv llama test;
* [fix] rm use_zbv flag in Shardconfig; rm debug info;
* [fix] add & fix llama test
* [feat] support meta cache, meta_grad_send, meta_tensor_send; fix runtime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp);
* [fix\ fix fail case test_shard_llama
* [fix] fix test_shard_llama
* [fix] fix llama modeling policy;
* [fix] fix test_shard_llama ci;
* [fix] fix test zerobubble
* [fix] fix handle name; rm useless comments;
* [fix] fix send recv signature;
* [fix] fix comment in llama & benchmark
* [feat] support no tensor parallel Linear in shardformer; Add test for use weightGradStore and not use WeightGradStore
* [fix] fix linear (no tp) ops func name;
2024-10-31 18:17:29 +08:00
Hongxin Liu
c2e8f61592
[checkpointio] fix hybrid plugin model save ( #6106 )
2024-10-31 17:04:53 +08:00
duanjunwen
5f0924361d
[fix] fix linear (no tp) ops func name;
2024-10-31 08:18:28 +00:00
duanjunwen
d2e05a99b3
[feat] support no tensor parallel Linear in shardformer; Add test for use weightGradStore and not use WeightGradStore
2024-10-30 02:54:32 +00:00
duanjunwen
982e4ee1f8
[fix] fix comment in llama & benchmark
2024-10-29 07:35:50 +00:00
duanjunwen
fa3ccda8ee
[fix] fix send recv signature;
2024-10-29 03:33:58 +00:00
duanjunwen
fafe049b83
[fix] fix handle name; rm useless comments;
2024-10-29 03:24:15 +00:00
duanjunwen
5aee4261a6
[fix] fix test zerobubble
2024-10-28 06:06:07 +00:00
duanjunwen
6377aa0fff
[fix] fix test_shard_llama ci;
2024-10-28 02:42:33 +00:00
duanjunwen
03fa79a55c
[fix] fix llama modeling policy;
2024-10-25 10:17:06 +00:00
duanjunwen
cc0dfddcbc
[fix] fix test_shard_llama
2024-10-25 09:01:13 +00:00
duanjunwen
d0ec221b38
[fix\ fix fail case test_shard_llama
2024-10-25 02:28:55 +00:00
duanjunwen
2eca112c90
[feat] support meta cache, meta_grad_send, meta_tensor_send; fix runtime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp);
2024-10-24 07:30:19 +00:00
Hanks
dee63cc5ef
Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt
...
[hotfix] fix lora ckpt saving format
2024-10-21 14:13:04 +08:00
BurkeHulk
6d6cafabe2
pre-commit fix
2024-10-21 14:04:32 +08:00
BurkeHulk
b10339df7c
fix lora ckpt save format (ColoTensor to Tensor)
2024-10-21 13:55:43 +08:00
Hongxin Liu
58d8b8a2dd
[misc] fit torch api upgradation and remove legecy import ( #6093 )
...
* [amp] fit torch's new api
* [amp] fix api call
* [amp] fix api call
* [misc] fit torch pytree api upgrade
* [misc] remove legacy import
* [misc] fit torch amp api
* [misc] fit torch amp api
2024-10-18 16:48:52 +08:00
Hongxin Liu
5ddad486ca
[fp8] add fallback and make compile option configurable ( #6092 )
2024-10-18 13:55:31 +08:00
botbw
3b1d7d1ae8
[chore] refactor
2024-10-17 11:04:47 +08:00
botbw
2bcd0b6844
[ckpt] add safetensors util
2024-10-17 11:04:47 +08:00
Hongxin Liu
cd61353bae
[pipeline] hotfix backward for multiple outputs ( #6090 )
...
* [pipeline] hotfix backward for multiple outputs
* [pipeline] hotfix backward for multiple outputs
2024-10-16 17:27:33 +08:00
duanjunwen
705b18e1e7
[fix] add & fix llama test
2024-10-16 03:58:50 +00:00
duanjunwen
e76308c6e6
[fix] rm use_zbv flag in Shardconfig; rm debug info;
2024-10-16 03:25:04 +00:00
Wenxuan Tan
62c13e7969
[Ring Attention] Improve comments ( #6085 )
...
* improve comments
* improve comments
---------
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
2024-10-16 11:23:35 +08:00
Wang Binluo
dcd41d0973
Merge pull request #6071 from wangbluo/ring_attention
...
[Ring Attention] fix the 2d ring attn when using multiple machine
2024-10-15 15:17:21 +08:00
wangbluo
83cf2f84fb
fix
2024-10-15 14:50:27 +08:00
duanjunwen
52dcc73313
Merge branch 'feature/zerobubble' of github.com:hpcaitech/ColossalAI into dev/zero_bubble
2024-10-15 06:31:45 +00:00
duanjunwen
9912cc8c07
[fix] fix bwd b; now bwd w only for Layer replaced by Linear1D_Col/Row; other layer perform a fully bwd;
2024-10-15 06:26:01 +00:00
wangbluo
bc7eeade33
fix
2024-10-15 13:28:33 +08:00
wangbluo
fd92789af2
fix
2024-10-15 13:26:44 +08:00
wangbluo
6be9862aaf
fix
2024-10-15 11:56:49 +08:00
wangbluo
3dc08c8a5a
fix
2024-10-15 11:01:34 +08:00
wangbluo
fe9208feac
fix
2024-10-14 18:07:56 +08:00
wangbluo
3201377e94
fix
2024-10-14 18:06:24 +08:00
wangbluo
23199e34cc
fix
2024-10-14 18:01:53 +08:00
duanjunwen
160e9a4175
[feat]EPMixtralSparseMoeBlock (op in MOE) support zbv;
2024-10-14 08:22:51 +00:00
duanjunwen
abd455189d
[fix] fix test case; moe error in second iter
2024-10-14 07:38:02 +00:00
duanjunwen
a11b4b50a7
[feat] support use_zbv in llama, mixtral modeling; only replace Linear1D_Col/Row policy;
2024-10-14 07:12:14 +00:00
duanjunwen
cfade4c36d
[feat] Linear1D_COL/ROW support zbv WeightGradStore;
2024-10-14 07:02:43 +00:00
wangbluo
d891e50617
fix
2024-10-14 14:56:05 +08:00
wangbluo
e1e86f9f1f
fix
2024-10-14 11:45:35 +08:00
Tong Li
4c8e85ee0d
[Coati] Train DPO using PP ( #6054 )
...
* update dpo
* remove unsupport plugin
* update msg
* update dpo
* remove unsupport plugin
* update msg
* update template
* update dataset
* add pp for dpo
* update dpo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add dpo fn
* update dpo
* update dpo
* update dpo
* update dpo
* minor update
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update loss
* update help
* polish code
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-10-11 19:32:00 +08:00
duanjunwen
0ca16d5cbe
[fix] fix llama, mixtral benchmark zbv loss none bug; update mixtral & llama policy and modeling;
2024-10-11 07:32:43 +00:00