ColossalAI

Commit Graph

Author	SHA1	Message	Date
ver217	8106d7b8c7	[ddp] refactor ColoDDP and ZeroDDP (#1146 ) * ColoDDP supports overwriting default process group * rename ColoDDPV2 to ZeroDDP * add docstr for ZeroDDP * polish docstr	2022-06-21 16:35:23 +08:00
Frank Lee	0e4e62d30d	[tensor] added __repr__ to spec (#1147 )	2022-06-21 15:38:05 +08:00
YuliangLiu0306	70dd88e2ee	[pipeline]add customized policy (#1139 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [pipeline]add customized policy	2022-06-21 15:23:41 +08:00
Frank Lee	d1918304bb	[workflow] added workflow to auto draft the release post (#1144 )	2022-06-21 14:43:25 +08:00
YuliangLiu0306	18091581c0	[pipeline]support more flexible pipeline (#1138 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [pipeline]support more flexible pipeline	2022-06-21 14:40:50 +08:00
ver217	ccf3c58c89	embedding op use gather_out (#1143 )	2022-06-21 13:21:20 +08:00
Frank Lee	e61dc31b05	[ci] added scripts to auto-generate release post text (#1142 ) * [ci] added scripts to auto-generate release post text * polish code	2022-06-21 12:22:53 +08:00
ver217	6690a61b4d	[hotfix] prevent nested ZeRO (#1140 )	2022-06-21 11:33:53 +08:00
Frank Lee	15aab1476e	[zero] avoid zero hook spam by changing log to debug level (#1137 )	2022-06-21 10:44:01 +08:00
Frank Lee	73ad05fc8c	[zero] added error message to handle on-the-fly import of torch Module class (#1135 ) * [zero] added error message to handle on-the-fly import of torch Module class * polish code	2022-06-20 11:24:27 +08:00
ver217	e4f555f29a	[optim] refactor fused sgd (#1134 )	2022-06-20 11:19:38 +08:00
ver217	d26902645e	[ddp] add save/load state dict for ColoDDP (#1127 ) * add save/load state dict for ColoDDP * add unit test * refactor unit test folder * polish unit test * rename unit test	2022-06-20 10:51:47 +08:00
YuliangLiu0306	946dbd629d	[hotfix]fix bugs caused by refactored pipeline (#1133 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [hotfix]fix bugs caused by refactored pipeline	2022-06-17 17:54:15 +08:00
ver217	789cad301b	[hotfix] fix param op hook (#1131 ) * fix param op hook * update zero tp test * fix bugs	2022-06-17 16:12:05 +08:00
ver217	a1a7899cae	[hotfix] fix zero init ctx numel (#1128 )	2022-06-16 17:17:27 +08:00
ver217	f0a954f16d	[ddp] add set_params_to_ignore for ColoDDP (#1122 ) * add set_params_to_ignore for ColoDDP * polish code * fix zero hook v2 * add unit test * polish docstr	2022-06-16 12:54:46 +08:00
YuliangLiu0306	3175bcb4d8	[pipeline]support List of Dict data (#1125 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [pipeline]support List of Dict data * polish	2022-06-16 11:19:48 +08:00
Frank Lee	91a5999825	[ddp] supported customized torch ddp configuration (#1123 )	2022-06-15 18:11:53 +08:00
YuliangLiu0306	fcf55777dd	[fx]add autoparallel passes (#1121 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * feature/add autoparallel passes	2022-06-15 16:36:46 +08:00
ver217	e127b4375b	cast colo ddp v2 inputs/outputs (#1120 )	2022-06-15 15:57:04 +08:00
Frank Lee	16302a5359	[fx] added unit test for coloproxy (#1119 ) * [fx] added unit test for coloproxy * polish code * polish code	2022-06-15 15:27:51 +08:00
ver217	7d14b473f0	[gemini] gemini mgr supports "cpu" placement policy (#1118 ) * update gemini mgr * update chunk * add docstr * polish placement policy * update test chunk * update test zero * polish unit test * remove useless unit test	2022-06-15 15:05:19 +08:00
ver217	f99f56dff4	fix colo parameter torch function (#1117 )	2022-06-15 14:23:27 +08:00
Frank Lee	e1620ddac2	[fx] added coloproxy (#1115 )	2022-06-15 10:47:57 +08:00
Frank Lee	6f82ac9bcb	[pipeline] supported more flexible dataflow control for pipeline parallel training (#1108 ) * [pipeline] supported more flexible dataflow control for pipeline parallel training * polish code * polish code * polish code	2022-06-15 10:41:28 +08:00
Frank Lee	53297330c0	[test] fixed hybrid parallel test case on 8 GPUs (#1106 )	2022-06-14 10:30:54 +08:00
github-actions[bot]	85b58093d2	Automated submodule synchronization (#1105 ) Co-authored-by: github-actions <github-actions@github.com>	2022-06-14 09:53:30 +08:00
Frank Lee	74948b095c	[release] update version.txt (#1103 )	2022-06-13 16:26:22 +08:00
ver217	895c1c5ee7	[tensor] refactor param op hook (#1097 ) * refactor param op hook * add docstr * fix bug	2022-06-13 16:11:53 +08:00
YuliangLiu0306	1e9f9c227f	[hotfix]change to fit latest p2p (#1100 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [hotfix]change to fit latest p2p * polish * polish	2022-06-13 14:57:25 +08:00
Frank Lee	72bd7c696b	[amp] included dict for type casting of model output (#1102 )	2022-06-13 14:18:04 +08:00
Frank Lee	5a9d8ef4d5	[workflow] fixed 8-gpu test workflow (#1101 )	2022-06-13 13:50:22 +08:00
Frank Lee	03e52ecba3	[workflow] added regular 8 GPU testing (#1099 ) * [workflow] added regular 8 GPU testing * polish workflow	2022-06-10 17:38:15 +08:00
Frank Lee	7f2d2b2b5b	[engine] fixed empty op hook check (#1096 ) * [engine] fixed empty op hook check * polish code	2022-06-10 17:27:27 +08:00
Frank Lee	14e5b11d7f	[zero] fixed api consistency (#1098 )	2022-06-10 16:59:59 +08:00
Frank Lee	cb18922c47	[doc] added documentation to chunk and chunk manager (#1094 ) * [doc] added documentation to chunk and chunk manager * polish code * polish code * polish code	2022-06-10 15:33:06 +08:00
ver217	1f894e033f	[gemini] zero supports gemini (#1093 ) * add placement policy * add gemini mgr * update mem stats collector * update zero * update zero optim * fix bugs * zero optim monitor os * polish unit test * polish unit test * add assert	2022-06-10 14:48:28 +08:00
Frank Lee	2b2dc1c86b	[pipeline] refactor the pipeline module (#1087 ) * [pipeline] refactor the pipeline module * polish code	2022-06-10 11:27:38 +08:00
Frank Lee	bad5d4c0a1	[context] support lazy init of module (#1088 ) * [context] support lazy init of module * polish code	2022-06-10 10:09:48 +08:00
ver217	be01db37c8	[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077 ) * polish chunk manager * polish unit test * impl add_extern_static_tensor for chunk mgr * add mem stats collector v2 * polish code * polish unit test * polish code * polish get chunks	2022-06-09 20:56:34 +08:00
Ziyue Jiang	b3a03e4bfd	[Tensor] fix equal assert (#1091 ) * fix equal assert * polish	2022-06-09 17:36:15 +08:00
Frank Lee	50ec3a7e06	[test] skip tests when not enough GPUs are detected (#1090 ) * [test] skip tests when not enough GPUs are detected * polish code * polish code	2022-06-09 17:19:13 +08:00
github-actions[bot]	3a7571b1d7	Automated submodule synchronization (#1081 ) Co-authored-by: github-actions <github-actions@github.com>	2022-06-09 15:33:29 +08:00
Frank Lee	1bd8a72fc9	[workflow] disable p2p via shared memory on non-nvlink machine (#1086 )	2022-06-09 15:24:35 +08:00
Frank Lee	65ee6dcc20	[test] ignore 8 gpu test (#1080 ) * [test] ignore 8 gpu test * polish code * polish workflow * polish workflow	2022-06-08 23:14:18 +08:00
Ziyue Jiang	0653c63eaa	[Tensor] 1d row embedding (#1075 ) * Add CPU 1d row embedding * polish	2022-06-08 12:04:59 +08:00
junxu	d66ffb4df4	Remove duplication registry (#1078 )	2022-06-08 07:47:24 +08:00
Jiarui Fang	bcab249565	fix issue #1080 (#1071 )	2022-06-07 17:21:11 +08:00
ver217	1b17859328	[tensor] chunk manager monitor mem usage (#1076 )	2022-06-07 15:00:00 +08:00
ver217	98cdbf49c6	[hotfix] fix chunk comm src rank (#1072 )	2022-06-07 11:54:56 +08:00

1 2 3 4 5 ...

720 Commits (8106d7b8c73c23877a9bc75d8a66c2631a5b930d) All Branches Search

720 Commits (8106d7b8c73c23877a9bc75d8a66c2631a5b930d)

All Branches