ColossalAI

Commit Graph

Author	SHA1	Message	Date
Jiarui Fang	2e9cbfca12	[Gemini] add unitests to check gemini correctness (#2015 )	2022-11-24 16:51:45 +08:00
Genghan Zhang	d655eea515	[autoparallel] mix gather (#1977 ) * Add mix-gather * Add comments * Add comments * Polish comments * Change the global rank assumption * Add tests * Add two-step tests * Fix 10 and 01 * Skip test becasue the number of GPUs	2022-11-23 21:49:17 +08:00
Jiarui Fang	f7e276fa71	[Gemini] add GeminiAdamOptimizer (#1960 )	2022-11-16 14:44:28 +08:00
Jiarui Fang	52c6ad26e0	[ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953 )	2022-11-15 16:24:16 +08:00
Jiarui Fang	9f4fb3f28a	[ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937 )	2022-11-14 16:05:09 +08:00
Jiarui Fang	3ce4463fe6	[utils] remove lazy_memory_allocate from ColoInitContext (#1844 )	2022-11-09 11:50:33 +08:00
YuliangLiu0306	980ed21723	[autoparallel] shard param and buffer as expected (#1753 ) * [autoparallel] shard param and buffer as expected * fix unit test issue	2022-10-21 15:45:13 +08:00
Frank Lee	eee84908d4	[autoparallel] handled illegal sharding strategy (#1728 ) * [autoparallel] handled illegal sharding strategy * polish code	2022-10-19 12:53:06 +08:00
HELSON	f69f9bf223	[zero] add chunk init function for users (#1729 ) * add chunk manager init function * fix unit tests * add comment * add flush=True	2022-10-18 16:31:22 +08:00
HELSON	b28991dd0a	[feature] A new ZeRO implementation (#1644 )	2022-10-09 09:18:51 +08:00
YuliangLiu0306	3f068d1409	[autoparallel] update CommSpec (#1667 )	2022-09-29 11:20:59 +08:00
Frank Lee	154d3ef432	[fix] fixed the collective pattern name for consistency (#1649 ) * [fix] fixed the collective pattern name for consistency * polish code	2022-09-26 16:39:37 +08:00
Jiarui Fang	c5d39215f6	Revert "[feature] new zero implementation (#1623 )" (#1643 ) This reverts commit `5be118f405`.	2022-09-26 10:06:03 +08:00
HELSON	5be118f405	[feature] new zero implementation (#1623 )	2022-09-24 19:58:18 +08:00
YuliangLiu0306	702dbc5288	[tensor] use communication autograd func (#1617 ) * [tensor] use communication autograd func * change all to all comm spec info * rename pattern and distinguish fwd/bwd * polish code	2022-09-23 13:31:15 +08:00
YuliangLiu0306	4b03c25f85	[tensor]add 1D device mesh (#1492 )	2022-08-25 16:48:12 +08:00
YuliangLiu0306	b73fb7a077	[tensor] support runtime ShardingSpec apply (#1453 ) * [tensor] support runtime ShardingSpec apply * polish code * polish code	2022-08-19 13:39:51 +08:00
YuliangLiu0306	0f3042363c	[tensor] shape consistency generate transform path and communication cost (#1435 ) * [tensor] shape consistency output transform path and communication cost * polish code	2022-08-12 14:02:32 +08:00
Frank Lee	ae1b58cd16	[tensor] added linear implementation for the new sharding spec (#1416 ) * [tensor] added linear implementation for the new sharding spec * polish code	2022-08-12 11:33:09 +08:00
Jiarui Fang	89c434a0a6	[polish] add test_ops directory (#1431 )	2022-08-10 15:35:26 +08:00
Jiarui Fang	10b3df65c8	[FAW] move coloparam setting in test code. (#1429 )	2022-08-10 14:31:53 +08:00
Jiarui Fang	cb98cf5558	[FAW] parallel FreqAwareEmbedding (#1424 )	2022-08-10 13:44:30 +08:00
YuliangLiu0306	33f0744d51	[tensor] add shape consistency feature to support auto spec transform (#1418 ) * [tensor] add shape consistency feature to supportauto sharding spec transform. * [tensor] remove unused argument in simulator, add doc string for target pair.	2022-08-10 11:29:17 +08:00
Jiarui Fang	d209aff684	Add FreqAwareEmbeddingBag (#1421 )	2022-08-09 16:26:12 +08:00
Jiarui Fang	504419d261	[FAW] add cache manager for the cached embedding (#1419 )	2022-08-09 15:17:17 +08:00
YuliangLiu0306	7c96055c68	[tensor]build sharding spec to replace distspec in future. (#1405 )	2022-08-08 11:15:57 +08:00
HELSON	87775a0682	[colotensor] use cpu memory to store state_dict (#1367 )	2022-07-26 14:13:38 +08:00
HELSON	4417804129	[unit test] add megatron init test in zero_optim (#1358 )	2022-07-25 11:18:08 +08:00
HELSON	7a065dc9f6	[hotfix] fix megatron_init in test_gpt2.py (#1357 )	2022-07-25 10:28:19 +08:00
HELSON	7a8702c06d	[colotensor] add Tensor.view op and its unit test (#1343 ) [colotensor] add megatron initialization for gpt2	2022-07-21 10:53:15 +08:00
HELSON	bf5066fba7	[refactor] refactor ColoTensor's unit tests (#1340 )	2022-07-19 15:46:24 +08:00
ver217	0c51ff2c13	[hotfix] ZeroDDP use new process group (#1333 ) * process group supports getting ranks in group * chunk mgr receives a process group * update unit test * fix unit tests	2022-07-18 14:14:52 +08:00
HELSON	d49708ae43	[hotfix] fix ddp for unit test test_gpt2 (#1326 )	2022-07-15 18:19:52 +08:00
HELSON	1b41686461	[hotfix] fix unit test test_module_spec (#1321 )	2022-07-15 14:02:32 +08:00
Jiarui Fang	85f933b58b	[Optimizer] Remove useless ColoOptimizer (#1312 )	2022-07-14 16:57:48 +08:00
Jiarui Fang	9f10524313	[Optimizer] polish the init method of ColoOptimizer (#1310 )	2022-07-14 16:37:33 +08:00
HELSON	36086927e1	[hotfix] fix ColoTensor GPT2 unitest (#1309 )	2022-07-14 16:37:20 +08:00
HELSON	260a55804a	[hotfix] fix shape error in backward when using ColoTensor (#1298 )	2022-07-13 23:06:12 +08:00
Jiarui Fang	79fe7b027a	[hotfix] test model unittest hotfix (#1281 )	2022-07-12 23:45:29 +08:00
Jiarui Fang	e56731e916	[hotfix] test_gpt.py duplicated (#1279 ) * make it faster * [hotfix] torchvison fx tests * [hotfix] rename duplicated named test_gpt.py	2022-07-12 23:29:17 +08:00
HELSON	abba4d84e1	[hotfix] fix bert model test in unitests (#1272 )	2022-07-12 23:26:45 +08:00
Jiarui Fang	c92f84fcdb	[tensor] distributed checkpointing for parameters (#1240 )	2022-07-12 15:51:06 +08:00
Jiarui Fang	1aad903c15	[tensor] redistribute among different process groups (#1247 ) * make it faster * [tensor] rename convert_to_dist -> redistribute * [tensor] ShardSpec and ReplicaSpec * [tensor] redistribute among diff pgs * polish code	2022-07-12 10:24:05 +08:00
Jiarui Fang	9bcd2fd4af	[tensor] a shorter shard and replicate spec (#1245 )	2022-07-11 15:51:48 +08:00
Jiarui Fang	2699dfbbfd	[rename] convert_to_dist -> redistribute (#1243 )	2022-07-11 13:05:44 +08:00
HELSON	f6add9b720	[tensor] redirect .data.__get__ to a tensor instance (#1239 )	2022-07-11 11:41:29 +08:00
Jiarui Fang	4a76084dc9	[tensor] add zero_like colo op, important for Optimizer (#1236 )	2022-07-08 14:55:27 +08:00
Jiarui Fang	3b500984b1	[tensor] fix some unittests (#1234 )	2022-07-08 14:18:30 +08:00
HELSON	0453776def	[tensor] fix a assertion in colo_tensor cross_entropy (#1232 )	2022-07-08 11:18:00 +08:00
Jiarui Fang	0e199d71e8	[hotfix] fx get comm size bugs (#1233 ) * init a checkpoint dir * [checkpoint]support resume for cosinewarmuplr * [checkpoint]add unit test * fix some bugs but still not OK * fix bugs * make it faster * [checkpoint]support generalized scheduler * polish * [tensor] torch function return colotensor * polish * fix bugs * remove debug info * polish * polish * [tensor] test_model pass unittests * polish * [hotfix] fx get comm size bug Co-authored-by: ZhaoYi1222 <zhaoyi9499@gmail.com>	2022-07-08 10:54:41 +08:00

1 2 3

140 Commits (eb7742a4bbff25a9e9cae923e3b490e54038b9f3)