ColossalAI

Commit Graph

Author	SHA1	Message	Date
duanjunwen	dafda0fb70	[fix] remove debug info;	2024-11-18 03:32:04 +00:00
duanjunwen	9a21f87ed6	[fix] fix wait handle in run_fwd_bwd	2024-11-18 02:50:14 +00:00
duanjunwen	f48a85e91d	[fix] fix test_lora in llama policy	2024-11-15 10:27:13 +00:00
duanjunwen	2980da559f	[fix] fix test_lora	2024-11-15 10:26:30 +00:00
duanjunwen	0fb500c7d4	[fix] rm debug info; update llama policy; update wait handle	2024-11-15 09:47:05 +00:00
duanjunwen	cf86c1b1c5	[fix] fix zbv wait_handle	2024-11-15 07:56:14 +00:00
duanjunwen	5c2ebbfd48	[fix] fix mixtral modeling & policy; update wait handles; doing benchmarking for llama hybrid;	2024-11-15 05:58:56 +00:00
duanjunwen	014afbdb59	[fix] fix attn	2024-11-14 09:43:47 +00:00
duanjunwen	1bc4dba3a3	[fix] fix p2p error in zbv	2024-11-14 09:40:38 +00:00
duanjunwen	b6d5e61809	[feat] update mixtral policy & bert policy for zerobubble	2024-11-14 02:51:34 +00:00
duanjunwen	80b04d7855	[feat] support mixtral policy with zbv tp_Linear & non_tp_Linear	2024-11-12 07:28:49 +00:00
duanjunwen	337debcf2a	[feat] fix testcase;	2024-11-11 11:34:29 +00:00
duanjunwen	12919de424	[fix] fix send_tensor_metadata & send_grad_metadata;	2024-11-11 08:54:39 +00:00
duanjunwen	0d6d40ccc6	[fix] fix zbv llama pp4	2024-11-06 03:35:12 +00:00
duanjunwen	4fc92aa77d	[feat] support no_tp Linear for sharderformer.llama	2024-11-05 05:55:42 +00:00
duanjunwen	8e40087633	[fix] fix model zoo init	2024-11-01 09:02:07 +00:00
duanjunwen	0218e673db	[fix] fix use_fp8 flag	2024-11-01 07:05:24 +00:00
duanjunwen	5b5fbcff09	[fix] fix hybridparall use_fp8 config	2024-11-01 05:27:11 +00:00
duanjunwen	3b5c314bea	[fix] fix fp8 args in HybridParallel	2024-11-01 03:54:08 +00:00
duanjunwen	c82c75a9b4	Merge branch 'feature/zerobubble' of github.com:hpcaitech/ColossalAI into dev/zero_bubble	2024-11-01 03:32:18 +00:00
duanjunwen	1d328ff651	Merge branch 'main' into dev/zero_bubble	2024-11-01 03:10:53 +00:00
pre-commit-ci[bot]	2f583c1549	[pre-commit.ci] pre-commit autoupdate (#6078 ) updates: - [github.com/psf/black-pre-commit-mirror: 24.8.0 → 24.10.0](https://github.com/psf/black-pre-commit-mirror/compare/24.8.0...24.10.0) - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.2](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.2) - [github.com/pre-commit/pre-commit-hooks: v4.6.0 → v5.0.0](https://github.com/pre-commit/pre-commit-hooks/compare/v4.6.0...v5.0.0) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-10-31 18:18:01 +08:00
duanjunwen	aed20fb2df	[feat] support zbv in mixtral benchmark; (#6083 ) * [feat] support zbv in mixtral benchmark; * [fix] MixtralForCausalLMPolicy get_held_layer support zbv; * [feat] update MixtralPipelineForwards --> mixtral_model_forward; support zbv; * [feat] support MixtralPipelineForwards--> mixtral_for_causal_lm_forward for zbv * [fix] fix llama, mixtral benchmark zbv loss none bug; update mixtral & llama policy and modeling; * [feat] Linear1D_COL/ROW support zbv WeightGradStore; * [feat] support use_zbv in llama, mixtral modeling; only replace Linear1D_Col/Row policy; * [fix] fix test case; moe error in second iter * [feat]EPMixtralSparseMoeBlock (op in MOE) support zbv; * [fix] fix bwd b; now bwd w only for Layer replaced by Linear1D_Col/Row; other layer perform a fully bwd; * [fix] debug zbv llama test; * [fix] rm use_zbv flag in Shardconfig; rm debug info; * [fix] add & fix llama test * [feat] support meta cache, meta_grad_send, meta_tensor_send; fix runtime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp); * [fix\ fix fail case test_shard_llama * [fix] fix test_shard_llama * [fix] fix llama modeling policy; * [fix] fix test_shard_llama ci; * [fix] fix test zerobubble * [fix] fix handle name; rm useless comments; * [fix] fix send recv signature; * [fix] fix comment in llama & benchmark * [feat] support no tensor parallel Linear in shardformer; Add test for use weightGradStore and not use WeightGradStore * [fix] fix linear (no tp) ops func name;	2024-10-31 18:17:29 +08:00
Hongxin Liu	c2e8f61592	[checkpointio] fix hybrid plugin model save (#6106 )	2024-10-31 17:04:53 +08:00
duanjunwen	5f0924361d	[fix] fix linear (no tp) ops func name;	2024-10-31 08:18:28 +00:00
duanjunwen	d2e05a99b3	[feat] support no tensor parallel Linear in shardformer; Add test for use weightGradStore and not use WeightGradStore	2024-10-30 02:54:32 +00:00
duanjunwen	982e4ee1f8	[fix] fix comment in llama & benchmark	2024-10-29 07:35:50 +00:00
duanjunwen	fa3ccda8ee	[fix] fix send recv signature;	2024-10-29 03:33:58 +00:00
duanjunwen	fafe049b83	[fix] fix handle name; rm useless comments;	2024-10-29 03:24:15 +00:00
duanjunwen	5aee4261a6	[fix] fix test zerobubble	2024-10-28 06:06:07 +00:00
duanjunwen	6377aa0fff	[fix] fix test_shard_llama ci;	2024-10-28 02:42:33 +00:00
duanjunwen	03fa79a55c	[fix] fix llama modeling policy;	2024-10-25 10:17:06 +00:00
duanjunwen	cc0dfddcbc	[fix] fix test_shard_llama	2024-10-25 09:01:13 +00:00
duanjunwen	d0ec221b38	[fix\ fix fail case test_shard_llama	2024-10-25 02:28:55 +00:00
Tong Li	89a9a600bc	[MCTS] Add self-refined MCTS (#6098 ) * add reasoner * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update code * delete llama * update prompts * update readme * update readme --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-10-24 17:51:19 +08:00
duanjunwen	2eca112c90	[feat] support meta cache, meta_grad_send, meta_tensor_send; fix runtime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp);	2024-10-24 07:30:19 +00:00
binmakeswell	4294ae83bb	[doc] sora solution news (#6100 ) * [doc] sora solution news * [doc] sora solution news	2024-10-24 13:24:37 +08:00
Hongxin Liu	80a8ca916a	[extension] hotfix compile check (#6099 )	2024-10-24 11:11:44 +08:00
Hanks	dee63cc5ef	Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt [hotfix] fix lora ckpt saving format	2024-10-21 14:13:04 +08:00
BurkeHulk	6d6cafabe2	pre-commit fix	2024-10-21 14:04:32 +08:00
BurkeHulk	b10339df7c	fix lora ckpt save format (ColoTensor to Tensor)	2024-10-21 13:55:43 +08:00
Hongxin Liu	19baab5fd5	[release] update version (#6094 )	2024-10-21 10:19:08 +08:00
Hongxin Liu	58d8b8a2dd	[misc] fit torch api upgradation and remove legecy import (#6093 ) * [amp] fit torch's new api * [amp] fix api call * [amp] fix api call * [misc] fit torch pytree api upgrade * [misc] remove legacy import * [misc] fit torch amp api * [misc] fit torch amp api	2024-10-18 16:48:52 +08:00
Hongxin Liu	5ddad486ca	[fp8] add fallback and make compile option configurable (#6092 )	2024-10-18 13:55:31 +08:00
botbw	3b1d7d1ae8	[chore] refactor	2024-10-17 11:04:47 +08:00
botbw	2bcd0b6844	[ckpt] add safetensors util	2024-10-17 11:04:47 +08:00
Hongxin Liu	cd61353bae	[pipeline] hotfix backward for multiple outputs (#6090 ) * [pipeline] hotfix backward for multiple outputs * [pipeline] hotfix backward for multiple outputs	2024-10-16 17:27:33 +08:00
duanjunwen	705b18e1e7	[fix] add & fix llama test	2024-10-16 03:58:50 +00:00
duanjunwen	e76308c6e6	[fix] rm use_zbv flag in Shardconfig; rm debug info;	2024-10-16 03:25:04 +00:00
Wenxuan Tan	62c13e7969	[Ring Attention] Improve comments (#6085 ) * improve comments * improve comments --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu>	2024-10-16 11:23:35 +08:00

1 2 3 4 5 ...

3874 Commits (dafda0fb7082506ad76b5deff3024b3d5dbb904b) All Branches Search

3874 Commits (dafda0fb7082506ad76b5deff3024b3d5dbb904b)

All Branches