ColossalAI

Commit Graph

Author	SHA1	Message	Date
YuliangLiu0306	aa0f6686f9	[autoparallel] accelerate gpt2 training (#2495 )	2023-01-29 11:13:15 +08:00
HELSON	707b11d4a0	[gemini] update ddp strict mode (#2518 ) * [zero] add strict ddp mode for chunk init * [gemini] update gpt example	2023-01-28 14:35:25 +08:00
Jiarui Fang	8f72b6f8fb	[hotfix] fix implement error in diffusers	2023-01-07 07:56:39 +08:00
1SAA	33f3023e19	[hotfix] fix implement error in diffusers	2023-01-06 18:37:18 +08:00
Jiarui Fang	1aaeb596c6	[example] gpt, shard init on all processes (#2366 )	2023-01-06 15:44:50 +08:00
Boyuan Yao	22e947f982	[autoparallel] fix runtime apply memory estimation (#2281 ) * [autoparallel] align the data_ptr with the old version of auto activation checkpoint pipeline * [autoparallel] using fwd_time and bwd_time instead of fwd_flop and bwd_flop * [autoparallel] specifycomm nodes' memory cost in construct chain * [autoparallel] fix wrong runtime apply calculation * [autoparallel] fix wrong runtime apply calculation * [autoparallel] fix wrong runtime apply calculation	2023-01-03 17:18:07 +08:00
xcnick	85178a397a	[hotfix] fix error for torch 2.0 (#2243 )	2022-12-30 23:11:55 +08:00
Boyuan Yao	24246f7aa5	[autoparallel] Attach input, buffer and output tensor to MetaInfo class (#2162 ) * [fx] metainfo class for auto parallel * [fx] add unit test for linear metainfo * [fx] fix bwd param for linear * [fx] modify unit test * [fx] modify unit test * [fx] modify import * [fx] modify import * [fx] modify import * [fx] move meta profiler to auto parallel * [fx] add conv metainfo class * [fx] restore profiler * [fx] restore meta profiler * [autoparallel] modify unit test * [fx] modify unit test * [autoparallel] add batchnorm metainfo class * [autoparallel] fix batchnorm unit test function declaration * [fx] restore profiler * [fx] add relu metainfo class * [fx] restore profiler * [autoparallel] modify metainfo input * [autoparallel] add pooling metainfo * [autoparallel] add F.linear metainfo generator * [autoparallel] add binary elementwise metainfo * [fx] recover profiler * [autoparallel] fix forward memory calculation * [autoparallel] modify constants.py * [autoparallel] remove redundant print * [autoparallel] add F.conv metainfo * [autoparallel] linear fix * [autoparallel] memory estimation for communication actions * [autoparallel] fix docstring * [autoparallel] fix variables name * [autoparallel] attach tensor to metainfo class * [autoparallel] fix dangerous try except * [autoparallel] attach memory cost to shape consistency node * [autoparallel] attach shape consistency node's metainfo to the node * [autoparallel] remove todo in shape consistency memory estimation * [autoparallel] fix the annotation	2022-12-28 13:37:40 +08:00
HELSON	2458659919	[zero] fix error for BEiT models (#2169 ) * [zero] fix error for BEiT models * [ColoParameter] add unpack operation for tuple arguments * fix bugs * fix chunkv2 unit testing * add assertion for gradient state	2022-12-26 15:03:54 +08:00
Boyuan Yao	cfe2a9bd90	[autoparallel] memory estimation for shape consistency (#2144 ) * [fx] metainfo class for auto parallel * [fx] add unit test for linear metainfo * [fx] fix bwd param for linear * [fx] modify unit test * [fx] modify unit test * [fx] modify import * [fx] modify import * [fx] modify import * [fx] move meta profiler to auto parallel * [fx] add conv metainfo class * [fx] restore profiler * [fx] restore meta profiler * [autoparallel] modify unit test * [fx] modify unit test * [autoparallel] add batchnorm metainfo class * [autoparallel] fix batchnorm unit test function declaration * [fx] restore profiler * [fx] add relu metainfo class * [fx] restore profiler * [autoparallel] modify metainfo input * [autoparallel] add pooling metainfo * [autoparallel] add F.linear metainfo generator * [autoparallel] add binary elementwise metainfo * [fx] recover profiler * [autoparallel] fix forward memory calculation * [autoparallel] modify constants.py * [autoparallel] remove redundant print * [autoparallel] add F.conv metainfo * [autoparallel] linear fix * [autoparallel] memory estimation for communication actions * [autoparallel] fix docstring * [autoparallel] fix variables name	2022-12-21 10:39:37 +08:00
Jiarui Fang	2827f41898	[Gemini] GeminiDPP convert to PyTorch Module. (#2151 )	2022-12-20 10:19:36 +08:00
Jiarui Fang	e99edfcb51	[NFC] polish comments for Chunk class (#2116 )	2022-12-12 15:39:31 +08:00
Jiarui Fang	b3b89865e2	[Gemini] ParamOpHook -> ColoParamOpHook (#2080 )	2022-12-05 17:11:06 +08:00
YuliangLiu0306	81330b0352	[autoparallel] add experimental permute handler (#2029 )	2022-11-27 20:26:52 +08:00
Genghan Zhang	d655eea515	[autoparallel] mix gather (#1977 ) * Add mix-gather * Add comments * Add comments * Polish comments * Change the global rank assumption * Add tests * Add two-step tests * Fix 10 and 01 * Skip test becasue the number of GPUs	2022-11-23 21:49:17 +08:00
YuliangLiu0306	36c0f3ea5b	[autoparallel] remove redundancy comm node (#1893 )	2022-11-15 10:53:41 +08:00
YuliangLiu0306	49216d7ab1	[autoparallel] fix bugs caused by negative dim key (#1808 ) * [autoparallel] fix bugs caused by negative dim key * fix import error * fix matmul test issue * fix unit test issue	2022-11-08 17:03:50 +08:00
Jiarui Fang	218c75fd9d	[NFC] polish type hint for shape consistency (#1801 ) * [NFC] polish type hint for shape consistency * polish code * polish code	2022-11-07 14:13:03 +08:00
HELSON	c6a1a62636	[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786 ) * [hotfix] fix zero's incompatibility with checkpoint in torch-1.12 * [zero] add cpu shard init * [zero] add tiny example test * [colo_tensor] fix bugs for torch-1.11	2022-11-02 16:11:34 +08:00
Frank Lee	f3f19a5c47	[autoparallel] added matmul handler (#1763 ) * [autoparallel] added matmul handler * polish code	2022-11-01 15:14:53 +08:00
YuliangLiu0306	b0f7c8bde8	[autoparallel] update CommSpec to CommActions (#1768 ) * [autoparallel] update CommSpec to CommActions * polish code	2022-10-28 09:57:43 +08:00
YuliangLiu0306	b4cc59b61e	[autoparallel] add numerical test for node strategies (#1760 ) * [autoparallel] add numerical test for node strategies * polish code * polish code	2022-10-27 10:42:54 +08:00
YuliangLiu0306	980ed21723	[autoparallel] shard param and buffer as expected (#1753 ) * [autoparallel] shard param and buffer as expected * fix unit test issue	2022-10-21 15:45:13 +08:00
YuliangLiu0306	a4ce180e85	[autoparallel] add sequential order to communication actions (#1735 )	2022-10-20 18:48:18 +08:00
Frank Lee	993b8875b6	[autoparallel] handled illegal sharding strategy in shape consistency (#1744 ) * [autoparallel] handled illegal sharding strategy in shape consistency * polish code	2022-10-20 12:06:25 +08:00
Frank Lee	eee84908d4	[autoparallel] handled illegal sharding strategy (#1728 ) * [autoparallel] handled illegal sharding strategy * polish code	2022-10-19 12:53:06 +08:00
YuliangLiu0306	51b89d2202	[autoparallel] runtime_backward_apply (#1720 )	2022-10-18 10:44:58 +08:00
Frank Lee	4973157ad7	[autoparallel] added sharding spec conversion for linear handler (#1687 )	2022-10-12 11:16:18 +08:00
YuliangLiu0306	3f068d1409	[autoparallel] update CommSpec (#1667 )	2022-09-29 11:20:59 +08:00
Frank Lee	154d3ef432	[fix] fixed the collective pattern name for consistency (#1649 ) * [fix] fixed the collective pattern name for consistency * polish code	2022-09-26 16:39:37 +08:00
YuliangLiu0306	702dbc5288	[tensor] use communication autograd func (#1617 ) * [tensor] use communication autograd func * change all to all comm spec info * rename pattern and distinguish fwd/bwd * polish code	2022-09-23 13:31:15 +08:00
Frank Lee	27fe8af60c	[autoparallel] refactored shape consistency to remove redundancy (#1591 ) * [autoparallel] refactored shape consistency to remove redundancy * polish code * polish code * polish code	2022-09-13 18:30:18 +08:00
YuliangLiu0306	44c866a3e3	[autoparallel] change the merge node logic (#1533 )	2022-09-07 11:18:19 +08:00
YuliangLiu0306	4b03c25f85	[tensor]add 1D device mesh (#1492 )	2022-08-25 16:48:12 +08:00
YuliangLiu0306	26a37b5cd5	[autoparallel] Add conv handler to generate strategies and costs info for conv (#1467 )	2022-08-19 14:57:23 +08:00
Jiarui Fang	1b491ad7de	[doc] update docstring in ProcessGroup (#1468 )	2022-08-19 13:41:57 +08:00
YuliangLiu0306	b73fb7a077	[tensor] support runtime ShardingSpec apply (#1453 ) * [tensor] support runtime ShardingSpec apply * polish code * polish code	2022-08-19 13:39:51 +08:00
Jiarui Fang	36824a304c	[Doc] add more doc for ColoTensor. (#1458 )	2022-08-16 10:38:41 +08:00
Jiarui Fang	a1476ea882	[NFC] polish doc style for ColoTensor (#1457 )	2022-08-16 09:21:05 +08:00
YuliangLiu0306	0f3042363c	[tensor] shape consistency generate transform path and communication cost (#1435 ) * [tensor] shape consistency output transform path and communication cost * polish code	2022-08-12 14:02:32 +08:00
Frank Lee	ae1b58cd16	[tensor] added linear implementation for the new sharding spec (#1416 ) * [tensor] added linear implementation for the new sharding spec * polish code	2022-08-12 11:33:09 +08:00
YuliangLiu0306	33f0744d51	[tensor] add shape consistency feature to support auto spec transform (#1418 ) * [tensor] add shape consistency feature to supportauto sharding spec transform. * [tensor] remove unused argument in simulator, add doc string for target pair.	2022-08-10 11:29:17 +08:00
YuliangLiu0306	7c96055c68	[tensor]build sharding spec to replace distspec in future. (#1405 )	2022-08-08 11:15:57 +08:00
HELSON	c7221cb2d4	[hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388 )	2022-07-29 19:33:24 +08:00
ver217	828b9e5e0d	[hotfix] fix zero optim save/load state dict (#1381 )	2022-07-28 17:19:39 +08:00
HELSON	943a96323e	[hotfix] fix no optimizer in save/load (#1363 )	2022-07-26 10:53:53 +08:00
ver217	d068af81a3	[doc] update rst and docstring (#1351 ) * update rst * add zero docstr * fix docstr * remove fx.tracer.meta_patch * fix docstr * fix docstr * update fx rst * fix fx docstr * remove useless rst	2022-07-21 15:54:53 +08:00
HELSON	7a8702c06d	[colotensor] add Tensor.view op and its unit test (#1343 ) [colotensor] add megatron initialization for gpt2	2022-07-21 10:53:15 +08:00
HELSON	f92c100ddd	[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339 )	2022-07-19 14:15:28 +08:00
ver217	0c51ff2c13	[hotfix] ZeroDDP use new process group (#1333 ) * process group supports getting ranks in group * chunk mgr receives a process group * update unit test * fix unit tests	2022-07-18 14:14:52 +08:00
HELSON	d49708ae43	[hotfix] fix ddp for unit test test_gpt2 (#1326 )	2022-07-15 18:19:52 +08:00
HELSON	1b41686461	[hotfix] fix unit test test_module_spec (#1321 )	2022-07-15 14:02:32 +08:00
Jiarui Fang	85f933b58b	[Optimizer] Remove useless ColoOptimizer (#1312 )	2022-07-14 16:57:48 +08:00
Jiarui Fang	9f10524313	[Optimizer] polish the init method of ColoOptimizer (#1310 )	2022-07-14 16:37:33 +08:00
HELSON	260a55804a	[hotfix] fix shape error in backward when using ColoTensor (#1298 )	2022-07-13 23:06:12 +08:00
Jiarui Fang	556b9b7e1a	[hotfix] Dist Mgr gather torch version (#1284 ) * make it faster * [hotfix] torchvison fx tests * [hotfix] rename duplicated named test_gpt.py * [hotfix] dist mgr torch version	2022-07-13 00:18:56 +08:00
ver217	7aadcbd070	hotfix colotensor _scan_for_pg_from_args (#1276 )	2022-07-12 20:46:31 +08:00
Jiarui Fang	c92f84fcdb	[tensor] distributed checkpointing for parameters (#1240 )	2022-07-12 15:51:06 +08:00
Jiarui Fang	1aad903c15	[tensor] redistribute among different process groups (#1247 ) * make it faster * [tensor] rename convert_to_dist -> redistribute * [tensor] ShardSpec and ReplicaSpec * [tensor] redistribute among diff pgs * polish code	2022-07-12 10:24:05 +08:00
Jiarui Fang	9bcd2fd4af	[tensor] a shorter shard and replicate spec (#1245 )	2022-07-11 15:51:48 +08:00
Jiarui Fang	2699dfbbfd	[rename] convert_to_dist -> redistribute (#1243 )	2022-07-11 13:05:44 +08:00
HELSON	f6add9b720	[tensor] redirect .data.__get__ to a tensor instance (#1239 )	2022-07-11 11:41:29 +08:00
Jiarui Fang	20da6e48c8	[checkpoint] save sharded optimizer states (#1237 )	2022-07-08 16:33:13 +08:00
Jiarui Fang	4a76084dc9	[tensor] add zero_like colo op, important for Optimizer (#1236 )	2022-07-08 14:55:27 +08:00
Jiarui Fang	3b500984b1	[tensor] fix some unittests (#1234 )	2022-07-08 14:18:30 +08:00
HELSON	f071b500b6	[polish] polish __repr__ for ColoTensor, DistSpec, ProcessGroup (#1235 )	2022-07-08 13:25:57 +08:00
Yi Zhao	04537bf83e	[checkpoint]support generalized scheduler (#1222 )	2022-07-07 18:16:38 +08:00
Jiarui Fang	a98319f023	[tensor] torch function return colotensor (#1229 )	2022-07-07 18:09:18 +08:00
HELSON	280a81243d	[tensor] improve robustness of class 'ProcessGroup' (#1223 )	2022-07-07 13:55:24 +08:00
Jiarui Fang	15d988f954	[tensor] sharded global process group (#1219 )	2022-07-07 13:38:48 +08:00
Jiarui Fang	ae7d3f4927	[refactor] move process group from _DistSpec to ColoTensor. (#1203 )	2022-07-06 16:15:16 +08:00
Jiarui Fang	b5f25eb32a	[Tensor] add cpu group to ddp (#1200 )	2022-07-05 14:58:28 +08:00
Jiarui Fang	060b917daf	[refactor] remove gpc dependency in colotensor's _ops (#1189 )	2022-07-04 18:54:37 +08:00
Jiarui Fang	c463f8adf9	[tensor] remove gpc in tensor tests (#1186 )	2022-06-29 14:08:40 +08:00
Jiarui Fang	372f791444	[refactor] move chunk and chunkmgr to directory gemini (#1182 )	2022-06-29 13:31:02 +08:00
Jiarui Fang	7487215b95	[ColoTensor] add independent process group (#1179 )	2022-06-29 10:03:09 +08:00
Jiarui Fang	1b657f9ce1	[tensor] revert local view back (#1178 )	2022-06-27 18:38:34 +08:00
Jiarui Fang	0dd4e2bbfb	[Tensor] rename some APIs in TensorSpec and Polish view unittest (#1176 )	2022-06-27 15:56:11 +08:00
Ziyue Jiang	dd0420909f	[Tensor] rename parallel_action (#1174 ) * rename parallel_action * polish	2022-06-27 10:04:45 +08:00
Jiarui Fang	aa7bef73d4	[Tensor] distributed view supports inter-process hybrid parallel (#1169 )	2022-06-27 09:45:26 +08:00
Jiarui Fang	4b9bba8116	[ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168 )	2022-06-24 13:08:54 +08:00
Jiarui Fang	f4ef224358	[Tensor] remove ParallelAction, use ComputeSpec instread (#1166 )	2022-06-23 17:34:59 +08:00
Jiarui Fang	177c374401	remove gather out in parallel action (#1163 )	2022-06-23 16:35:05 +08:00
ver217	634eecb98e	mark sanity_check of dist_spec_mgr as staticmethod (#1161 )	2022-06-23 11:35:25 +08:00
ver217	4e67b2a890	fix chunk move device (#1158 )	2022-06-22 18:07:10 +08:00
Jiarui Fang	07f9c781f9	[graph] improve the graph building. (#1157 )	2022-06-22 16:47:20 +08:00
ver217	ffa025e120	[tensor] dist spec s2s uses all-to-all (#1136 ) * dist spec s2s uses all-to-all * update unit test * add sanity check * polish unitest test with titans * add sanity check for DistMgr * add sanity check Co-authored-by: jiaruifang <fangjiarui123@gmail.com>	2022-06-22 11:32:38 +08:00
Jiarui Fang	8cdce0399c	[ColoTensor] improves init functions. (#1150 )	2022-06-21 18:28:38 +08:00
Frank Lee	0e4e62d30d	[tensor] added __repr__ to spec (#1147 )	2022-06-21 15:38:05 +08:00
ver217	789cad301b	[hotfix] fix param op hook (#1131 ) * fix param op hook * update zero tp test * fix bugs	2022-06-17 16:12:05 +08:00
ver217	7d14b473f0	[gemini] gemini mgr supports "cpu" placement policy (#1118 ) * update gemini mgr * update chunk * add docstr * polish placement policy * update test chunk * update test zero * polish unit test * remove useless unit test	2022-06-15 15:05:19 +08:00
ver217	f99f56dff4	fix colo parameter torch function (#1117 )	2022-06-15 14:23:27 +08:00
ver217	895c1c5ee7	[tensor] refactor param op hook (#1097 ) * refactor param op hook * add docstr * fix bug	2022-06-13 16:11:53 +08:00
Frank Lee	cb18922c47	[doc] added documentation to chunk and chunk manager (#1094 ) * [doc] added documentation to chunk and chunk manager * polish code * polish code * polish code	2022-06-10 15:33:06 +08:00
ver217	1f894e033f	[gemini] zero supports gemini (#1093 ) * add placement policy * add gemini mgr * update mem stats collector * update zero * update zero optim * fix bugs * zero optim monitor os * polish unit test * polish unit test * add assert	2022-06-10 14:48:28 +08:00
ver217	be01db37c8	[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077 ) * polish chunk manager * polish unit test * impl add_extern_static_tensor for chunk mgr * add mem stats collector v2 * polish code * polish unit test * polish code * polish get chunks	2022-06-09 20:56:34 +08:00
ver217	1b17859328	[tensor] chunk manager monitor mem usage (#1076 )	2022-06-07 15:00:00 +08:00
ver217	98cdbf49c6	[hotfix] fix chunk comm src rank (#1072 )	2022-06-07 11:54:56 +08:00
ver217	c5cd3b0f35	[zero] zero optim copy chunk rather than copy tensor (#1070 )	2022-06-07 10:30:46 +08:00
Jiarui Fang	a00644079e	reorgnize colotensor directory (#1062 ) * reorgnize colotensor directory * polish code	2022-06-03 18:04:22 +08:00

1 2 3 4 5

205 Commits (544b7a38a167cb05cdc7590cfc100e23c0ed5ab7)