Commit Graph

3874 Commits (dafda0fb7082506ad76b5deff3024b3d5dbb904b)

Author SHA1 Message Date
duanjunwen dafda0fb70 [fix] remove debug info; 2024-11-18 03:32:04 +00:00
duanjunwen 9a21f87ed6 [fix] fix wait handle in run_fwd_bwd 2024-11-18 02:50:14 +00:00
duanjunwen f48a85e91d [fix] fix test_lora in llama policy 2024-11-15 10:27:13 +00:00
duanjunwen 2980da559f [fix] fix test_lora 2024-11-15 10:26:30 +00:00
duanjunwen 0fb500c7d4 [fix] rm debug info; update llama policy; update wait handle 2024-11-15 09:47:05 +00:00
duanjunwen cf86c1b1c5 [fix] fix zbv wait_handle 2024-11-15 07:56:14 +00:00
duanjunwen 5c2ebbfd48 [fix] fix mixtral modeling & policy; update wait handles; doing benchmarking for llama hybrid; 2024-11-15 05:58:56 +00:00
duanjunwen 014afbdb59 [fix] fix attn 2024-11-14 09:43:47 +00:00
duanjunwen 1bc4dba3a3 [fix] fix p2p error in zbv 2024-11-14 09:40:38 +00:00
duanjunwen b6d5e61809 [feat] update mixtral policy & bert policy for zerobubble 2024-11-14 02:51:34 +00:00
duanjunwen 80b04d7855 [feat] support mixtral policy with zbv tp_Linear & non_tp_Linear 2024-11-12 07:28:49 +00:00
duanjunwen 337debcf2a [feat] fix testcase; 2024-11-11 11:34:29 +00:00
duanjunwen 12919de424 [fix] fix send_tensor_metadata & send_grad_metadata; 2024-11-11 08:54:39 +00:00
duanjunwen 0d6d40ccc6 [fix] fix zbv llama pp4 2024-11-06 03:35:12 +00:00
duanjunwen 4fc92aa77d [feat] support no_tp Linear for sharderformer.llama 2024-11-05 05:55:42 +00:00
duanjunwen 8e40087633 [fix] fix model zoo init 2024-11-01 09:02:07 +00:00
duanjunwen 0218e673db [fix] fix use_fp8 flag 2024-11-01 07:05:24 +00:00
duanjunwen 5b5fbcff09 [fix] fix hybridparall use_fp8 config 2024-11-01 05:27:11 +00:00
duanjunwen 3b5c314bea [fix] fix fp8 args in HybridParallel 2024-11-01 03:54:08 +00:00
duanjunwen c82c75a9b4 Merge branch 'feature/zerobubble' of github.com:hpcaitech/ColossalAI into dev/zero_bubble 2024-11-01 03:32:18 +00:00
duanjunwen 1d328ff651 Merge branch 'main' into dev/zero_bubble 2024-11-01 03:10:53 +00:00
pre-commit-ci[bot] 2f583c1549
[pre-commit.ci] pre-commit autoupdate (#6078)
updates:
- [github.com/psf/black-pre-commit-mirror: 24.8.0 → 24.10.0](https://github.com/psf/black-pre-commit-mirror/compare/24.8.0...24.10.0)
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.2](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.2)
- [github.com/pre-commit/pre-commit-hooks: v4.6.0 → v5.0.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.6.0...v5.0.0)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-10-31 18:18:01 +08:00
duanjunwen aed20fb2df
[feat] support zbv in mixtral benchmark; (#6083)
* [feat] support zbv in mixtral benchmark;

* [fix] MixtralForCausalLMPolicy get_held_layer support zbv;

* [feat] update MixtralPipelineForwards --> mixtral_model_forward; support zbv;

* [feat] support MixtralPipelineForwards--> mixtral_for_causal_lm_forward for zbv

* [fix] fix llama, mixtral benchmark zbv loss none bug; update mixtral & llama policy and modeling;

* [feat] Linear1D_COL/ROW support zbv WeightGradStore;

* [feat] support use_zbv in llama, mixtral modeling; only replace Linear1D_Col/Row policy;

* [fix] fix test case; moe error in second iter

* [feat]EPMixtralSparseMoeBlock (op in MOE) support zbv;

* [fix] fix bwd b; now bwd w only for Layer replaced by Linear1D_Col/Row; other layer perform a fully bwd;

* [fix] debug zbv llama test;

* [fix] rm use_zbv flag in Shardconfig; rm debug info;

* [fix] add & fix  llama test

* [feat] support meta cache, meta_grad_send, meta_tensor_send; fix runtime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp);

* [fix\ fix fail case test_shard_llama

* [fix] fix test_shard_llama

* [fix] fix llama modeling policy;

* [fix] fix test_shard_llama ci;

* [fix] fix test zerobubble

* [fix] fix handle name; rm useless comments;

* [fix] fix send recv signature;

* [fix] fix comment in llama & benchmark

* [feat] support no tensor parallel Linear in shardformer; Add test for use weightGradStore and not use WeightGradStore

* [fix] fix linear (no tp) ops func name;
2024-10-31 18:17:29 +08:00
Hongxin Liu c2e8f61592
[checkpointio] fix hybrid plugin model save (#6106) 2024-10-31 17:04:53 +08:00
duanjunwen 5f0924361d [fix] fix linear (no tp) ops func name; 2024-10-31 08:18:28 +00:00
duanjunwen d2e05a99b3 [feat] support no tensor parallel Linear in shardformer; Add test for use weightGradStore and not use WeightGradStore 2024-10-30 02:54:32 +00:00
duanjunwen 982e4ee1f8 [fix] fix comment in llama & benchmark 2024-10-29 07:35:50 +00:00
duanjunwen fa3ccda8ee [fix] fix send recv signature; 2024-10-29 03:33:58 +00:00
duanjunwen fafe049b83 [fix] fix handle name; rm useless comments; 2024-10-29 03:24:15 +00:00
duanjunwen 5aee4261a6 [fix] fix test zerobubble 2024-10-28 06:06:07 +00:00
duanjunwen 6377aa0fff [fix] fix test_shard_llama ci; 2024-10-28 02:42:33 +00:00
duanjunwen 03fa79a55c [fix] fix llama modeling policy; 2024-10-25 10:17:06 +00:00
duanjunwen cc0dfddcbc [fix] fix test_shard_llama 2024-10-25 09:01:13 +00:00
duanjunwen d0ec221b38 [fix\ fix fail case test_shard_llama 2024-10-25 02:28:55 +00:00
Tong Li 89a9a600bc
[MCTS] Add self-refined MCTS (#6098)
* add reasoner

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update code

* delete llama

* update prompts

* update readme

* update readme

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-10-24 17:51:19 +08:00
duanjunwen 2eca112c90 [feat] support meta cache, meta_grad_send, meta_tensor_send; fix runtime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp); 2024-10-24 07:30:19 +00:00
binmakeswell 4294ae83bb
[doc] sora solution news (#6100)
* [doc] sora solution news

* [doc] sora solution news
2024-10-24 13:24:37 +08:00
Hongxin Liu 80a8ca916a
[extension] hotfix compile check (#6099) 2024-10-24 11:11:44 +08:00
Hanks dee63cc5ef
Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt
[hotfix] fix lora ckpt saving format
2024-10-21 14:13:04 +08:00
BurkeHulk 6d6cafabe2 pre-commit fix 2024-10-21 14:04:32 +08:00
BurkeHulk b10339df7c fix lora ckpt save format (ColoTensor to Tensor) 2024-10-21 13:55:43 +08:00
Hongxin Liu 19baab5fd5
[release] update version (#6094) 2024-10-21 10:19:08 +08:00
Hongxin Liu 58d8b8a2dd
[misc] fit torch api upgradation and remove legecy import (#6093)
* [amp] fit torch's new api

* [amp] fix api call

* [amp] fix api call

* [misc] fit torch pytree api upgrade

* [misc] remove legacy import

* [misc] fit torch amp api

* [misc] fit torch amp api
2024-10-18 16:48:52 +08:00
Hongxin Liu 5ddad486ca
[fp8] add fallback and make compile option configurable (#6092) 2024-10-18 13:55:31 +08:00
botbw 3b1d7d1ae8 [chore] refactor 2024-10-17 11:04:47 +08:00
botbw 2bcd0b6844 [ckpt] add safetensors util 2024-10-17 11:04:47 +08:00
Hongxin Liu cd61353bae
[pipeline] hotfix backward for multiple outputs (#6090)
* [pipeline] hotfix backward for multiple outputs

* [pipeline] hotfix backward for multiple outputs
2024-10-16 17:27:33 +08:00
duanjunwen 705b18e1e7 [fix] add & fix llama test 2024-10-16 03:58:50 +00:00
duanjunwen e76308c6e6 [fix] rm use_zbv flag in Shardconfig; rm debug info; 2024-10-16 03:25:04 +00:00
Wenxuan Tan 62c13e7969
[Ring Attention] Improve comments (#6085)
* improve comments

* improve comments

---------

Co-authored-by: Edenzzzz <wtan45@wisc.edu>
2024-10-16 11:23:35 +08:00