ColossalAI

Commit Graph

Author	SHA1	Message	Date
Geng Zhang	0e06f62160	[NFC] polish colossalai/nn/layer/parallel_sequence/_operation.py code style (#1266 )	2022-07-13 12:08:21 +08:00
binmakeswell	c95e18cdb9	[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.h code style (#1270 )	2022-07-13 12:08:21 +08:00
xyupeng	94bfd35184	[NFC] polish colossalai/builder/builder.py code style (#1265 )	2022-07-13 12:08:21 +08:00
DouJS	db13f96333	[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh code style (#1264 )	2022-07-13 12:08:21 +08:00
shenggan	5d7366b144	[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.h code style (#1263 )	2022-07-13 12:08:21 +08:00
Zangwei Zheng	197a2c89e2	[NFC] polish colossalai/communication/collective.py (#1262 )	2022-07-13 12:08:21 +08:00
ziyu huang	f1cafcc73a	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu code style (#1261 ) Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>	2022-07-13 12:08:21 +08:00
Sze-qq	f8b9aaef47	[NFC] polish colossalai/kernel/cuda_native/csrc/type_shim.h code style (#1260 )	2022-07-13 12:08:21 +08:00
superhao1995	f660152c73	[NFC] polish colossalai/nn/layer/parallel_3d/_operation.py code style (#1258 ) Co-authored-by: Research <research@soccf-snr3-017.comp.nus.edu.sg>	2022-07-13 12:08:21 +08:00
Thunderbeee	9738fb0f78	[NFC] polish colossalai/nn/lr_scheduler/__init__.py (#1255 ) code style	2022-07-13 12:08:21 +08:00
Kai Wang (Victor Kai)	50f2ad213f	[NFC] polish colossalai/engine/ophooks/utils.py code style (#1256 )	2022-07-13 12:08:21 +08:00
Ofey Chan	2dd4d556fb	[NFC] polish colossalai/nn/init.py code style (#1292 )	2022-07-13 10:51:55 +08:00
Jiarui Fang	556b9b7e1a	[hotfix] Dist Mgr gather torch version (#1284 ) * make it faster * [hotfix] torchvison fx tests * [hotfix] rename duplicated named test_gpt.py * [hotfix] dist mgr torch version	2022-07-13 00:18:56 +08:00
HELSON	abba4d84e1	[hotfix] fix bert model test in unitests (#1272 )	2022-07-12 23:26:45 +08:00
ver217	7aadcbd070	hotfix colotensor _scan_for_pg_from_args (#1276 )	2022-07-12 20:46:31 +08:00
oahzxl	0cf8e8e91c	[NFC] polish <colossalai/nn/lr_scheduler/poly.py> code style (#1267 )	2022-07-12 18:18:14 +08:00
Jiarui Fang	c92f84fcdb	[tensor] distributed checkpointing for parameters (#1240 )	2022-07-12 15:51:06 +08:00
Frank Lee	fb35460595	[fx] added ndim property to proxy (#1253 )	2022-07-12 15:27:13 +08:00
Frank Lee	4a09fc0947	[fx] fixed tracing with apex-based T5 model (#1252 ) * [fx] fixed tracing with apex-based T5 model * polish code * polish code	2022-07-12 15:19:25 +08:00
Frank Lee	7531c6271f	[fx] refactored the file structure of patched function and module (#1238 ) * [fx] refactored the file structure of patched function and module * polish code	2022-07-12 15:01:58 +08:00
YuliangLiu0306	17ed33350b	[hotfix] fix an assertion bug in base schedule. (#1250 )	2022-07-12 14:20:02 +08:00
YuliangLiu0306	97d713855a	[fx] methods to get fx graph property. (#1246 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * manipulation * [fx]add graph manipulation methods. * [fx]methods to get fx graph property. * add unit test * add docstring to explain top node and leaf node in this context	2022-07-12 14:10:37 +08:00
YuliangLiu0306	30b4fc0eb0	[fx]add split module pass and unit test from pipeline passes (#1242 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]add split module pass and unit test from pipeline passes * fix MNASNet bug * polish	2022-07-12 13:45:01 +08:00
Jiarui Fang	1aad903c15	[tensor] redistribute among different process groups (#1247 ) * make it faster * [tensor] rename convert_to_dist -> redistribute * [tensor] ShardSpec and ReplicaSpec * [tensor] redistribute among diff pgs * polish code	2022-07-12 10:24:05 +08:00
Jiarui Fang	9bcd2fd4af	[tensor] a shorter shard and replicate spec (#1245 )	2022-07-11 15:51:48 +08:00
Jiarui Fang	2699dfbbfd	[rename] convert_to_dist -> redistribute (#1243 )	2022-07-11 13:05:44 +08:00
HELSON	f6add9b720	[tensor] redirect .data.__get__ to a tensor instance (#1239 )	2022-07-11 11:41:29 +08:00
Jiarui Fang	20da6e48c8	[checkpoint] save sharded optimizer states (#1237 )	2022-07-08 16:33:13 +08:00
Jiarui Fang	4a76084dc9	[tensor] add zero_like colo op, important for Optimizer (#1236 )	2022-07-08 14:55:27 +08:00
Jiarui Fang	3b500984b1	[tensor] fix some unittests (#1234 )	2022-07-08 14:18:30 +08:00
ver217	a45ddf2d5f	[hotfix] fix sharded optim step and clip_grad_norm (#1226 )	2022-07-08 13:34:48 +08:00
HELSON	f071b500b6	[polish] polish __repr__ for ColoTensor, DistSpec, ProcessGroup (#1235 )	2022-07-08 13:25:57 +08:00
HELSON	0453776def	[tensor] fix a assertion in colo_tensor cross_entropy (#1232 )	2022-07-08 11:18:00 +08:00
Jiarui Fang	0e199d71e8	[hotfix] fx get comm size bugs (#1233 ) * init a checkpoint dir * [checkpoint]support resume for cosinewarmuplr * [checkpoint]add unit test * fix some bugs but still not OK * fix bugs * make it faster * [checkpoint]support generalized scheduler * polish * [tensor] torch function return colotensor * polish * fix bugs * remove debug info * polish * polish * [tensor] test_model pass unittests * polish * [hotfix] fx get comm size bug Co-authored-by: ZhaoYi1222 <zhaoyi9499@gmail.com>	2022-07-08 10:54:41 +08:00
HELSON	42ab36b762	[tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230 )	2022-07-07 19:17:23 +08:00
Yi Zhao	04537bf83e	[checkpoint]support generalized scheduler (#1222 )	2022-07-07 18:16:38 +08:00
Jiarui Fang	a98319f023	[tensor] torch function return colotensor (#1229 )	2022-07-07 18:09:18 +08:00
YuliangLiu0306	2b7dca44b5	[fx]get communication size between partitions (#1224 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]get communication size between partitions. * polish	2022-07-07 16:22:00 +08:00
Frank Lee	84f2298a96	[fx] added patches for tracing swin transformer (#1228 )	2022-07-07 15:20:13 +08:00
Frank Lee	b6cb5a47ad	[fx] added timm model tracing testing (#1221 )	2022-07-07 14:02:17 +08:00
HELSON	280a81243d	[tensor] improve robustness of class 'ProcessGroup' (#1223 )	2022-07-07 13:55:24 +08:00
Jiarui Fang	15d988f954	[tensor] sharded global process group (#1219 )	2022-07-07 13:38:48 +08:00
Jiarui Fang	db1bef9032	[hotfix] fx shard 1d pass bug fixing (#1220 )	2022-07-07 13:37:31 +08:00
Frank Lee	11973d892d	[fx] added torchvision model tracing testing (#1216 ) * [fx] added torchvision model tracing testing * remove unused imports	2022-07-06 21:37:56 +08:00
Jiarui Fang	52736205d9	[checkpoint] make unitest faster (#1217 )	2022-07-06 17:39:46 +08:00
Jiarui Fang	f38006ea83	[checkpoint] checkpoint for ColoTensor Model (#1196 )	2022-07-06 17:22:03 +08:00
XYE	291e22aac6	[fx] temporarily used (#1215 )	2022-07-06 17:19:26 +08:00
Jiarui Fang	ae7d3f4927	[refactor] move process group from _DistSpec to ColoTensor. (#1203 )	2022-07-06 16:15:16 +08:00
Frank Lee	5da87ce35d	[fx] added testing for all albert variants (#1211 )	2022-07-06 15:11:08 +08:00
Frank Lee	2d13a45a3b	[fx] added testing for all gpt variants (#1210 ) * [fx] added testing for all gpt variants * polish code * polish code	2022-07-06 14:03:13 +08:00
YuliangLiu0306	189946c5c4	[fx]add uniform policy (#1208 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]add uniform policy	2022-07-06 13:48:11 +08:00
Frank Lee	426a279ce7	[fx] added testing for all bert variants (#1207 ) * [fx] added testing for all bert variants * polish code	2022-07-06 10:50:49 +08:00
Jiarui Fang	b5f25eb32a	[Tensor] add cpu group to ddp (#1200 )	2022-07-05 14:58:28 +08:00
Frank Lee	f7878f465c	[fx] supported model tracing for huggingface bert (#1201 ) * [fx] supported model tracing for huggingface bert * polish test	2022-07-05 13:19:57 +08:00
Jiarui Fang	060b917daf	[refactor] remove gpc dependency in colotensor's _ops (#1189 )	2022-07-04 18:54:37 +08:00
Frank Lee	abf6a262dc	[fx] added module patch for pooling layers (#1197 )	2022-07-04 15:21:26 +08:00
YuliangLiu0306	63d2a93878	[context]support arbitary module materialization. (#1193 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [context]support arbitary module materialization. * [test]add numerical check for lazy init context.	2022-07-04 10:12:02 +08:00
Jiarui Fang	a444633d13	warmup ratio configration (#1192 )	2022-06-30 15:23:50 +08:00
ver217	dba7e0cfb4	make AutoPlacementPolicy configurable (#1191 )	2022-06-30 15:18:30 +08:00
YuliangLiu0306	2053e138a2	[context]use meta tensor to init model lazily. (#1187 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [context]use meta tensor to init model lazily. * polish * make module with device kwargs bypass the normal init. * change unit test to adapt updated context.	2022-06-29 21:02:30 +08:00
Frank Lee	2c8c05675d	[fx] patched conv and normalization (#1188 )	2022-06-29 18:58:38 +08:00
Frank Lee	6d86f1bc91	[fx] supported data-dependent control flow in model tracing (#1185 ) * [fx] supported data-dependent control flow in model tracing * polish code	2022-06-29 15:05:25 +08:00
Jiarui Fang	c463f8adf9	[tensor] remove gpc in tensor tests (#1186 )	2022-06-29 14:08:40 +08:00
Jiarui Fang	372f791444	[refactor] move chunk and chunkmgr to directory gemini (#1182 )	2022-06-29 13:31:02 +08:00
ver217	6b2f2ab9bb	[ddp] ColoDDP uses bucket all-reduce (#1177 ) * add reducer * update colo ddp with reducer * polish unit test * polish unit test	2022-06-29 10:34:13 +08:00
Jiarui Fang	7487215b95	[ColoTensor] add independent process group (#1179 )	2022-06-29 10:03:09 +08:00
YuliangLiu0306	26ba87272d	[hotfix]fixed p2p process send stuck (#1181 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [hotfix]fixed p2p process send stuck	2022-06-28 14:41:11 +08:00
Jiarui Fang	1b657f9ce1	[tensor] revert local view back (#1178 )	2022-06-27 18:38:34 +08:00
Jiarui Fang	0dd4e2bbfb	[Tensor] rename some APIs in TensorSpec and Polish view unittest (#1176 )	2022-06-27 15:56:11 +08:00
Ziyue Jiang	dd0420909f	[Tensor] rename parallel_action (#1174 ) * rename parallel_action * polish	2022-06-27 10:04:45 +08:00
YuliangLiu0306	e27645376d	[hotfix]different overflow status lead to communication stuck. (#1175 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [hotfix]fix some bugs caused by refactored schedule. * [hotfix]different overflow statu llead to communication stuck.	2022-06-27 09:53:57 +08:00
Jiarui Fang	aa7bef73d4	[Tensor] distributed view supports inter-process hybrid parallel (#1169 )	2022-06-27 09:45:26 +08:00
ver217	9e1daa63d2	[zero] sharded optim supports loading local state dict (#1170 ) * sharded optim supports loading local state dict * polish code * add unit test	2022-06-24 18:05:16 +08:00
ver217	561e90493f	[zero] zero optim supports loading local state dict (#1171 ) * zero optim supports loading local state dict * polish code * add unit test	2022-06-24 17:25:57 +08:00
Jiarui Fang	4b9bba8116	[ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168 )	2022-06-24 13:08:54 +08:00
Jiarui Fang	f4ef224358	[Tensor] remove ParallelAction, use ComputeSpec instread (#1166 )	2022-06-23 17:34:59 +08:00
Jiarui Fang	177c374401	remove gather out in parallel action (#1163 )	2022-06-23 16:35:05 +08:00
ver217	634eecb98e	mark sanity_check of dist_spec_mgr as staticmethod (#1161 )	2022-06-23 11:35:25 +08:00
Ziyue Jiang	955ac912de	remove log (#1160 )	2022-06-23 10:32:42 +08:00
ver217	4e67b2a890	fix chunk move device (#1158 )	2022-06-22 18:07:10 +08:00
Jiarui Fang	07f9c781f9	[graph] improve the graph building. (#1157 )	2022-06-22 16:47:20 +08:00
ver217	22717a856f	[tensor] add embedding bag op (#1156 )	2022-06-22 15:54:03 +08:00
ver217	ae86151968	[tensor] add more element-wise ops (#1155 ) * add more element-wise ops * update test_op * polish unit test	2022-06-22 15:16:47 +08:00
ver217	54aabb8da4	[gemini] refactor gemini mgr (#1151 ) * refactor gemini mgr * udpate __init__	2022-06-22 11:54:36 +08:00
Frank Lee	f8eec98ff5	[tensor] fixed non-serializable colo parameter during model checkpointing (#1153 )	2022-06-22 11:43:38 +08:00
ver217	ffa025e120	[tensor] dist spec s2s uses all-to-all (#1136 ) * dist spec s2s uses all-to-all * update unit test * add sanity check * polish unitest test with titans * add sanity check for DistMgr * add sanity check Co-authored-by: jiaruifang <fangjiarui123@gmail.com>	2022-06-22 11:32:38 +08:00
YuliangLiu0306	f1f51990b9	[hotfix]fix some bugs caused by refactored schedule. (#1148 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [hotfix]fix some bugs caused by refactored schedule.	2022-06-21 22:46:30 +08:00
Jiarui Fang	8cdce0399c	[ColoTensor] improves init functions. (#1150 )	2022-06-21 18:28:38 +08:00
ver217	8106d7b8c7	[ddp] refactor ColoDDP and ZeroDDP (#1146 ) * ColoDDP supports overwriting default process group * rename ColoDDPV2 to ZeroDDP * add docstr for ZeroDDP * polish docstr	2022-06-21 16:35:23 +08:00
Frank Lee	0e4e62d30d	[tensor] added __repr__ to spec (#1147 )	2022-06-21 15:38:05 +08:00
YuliangLiu0306	70dd88e2ee	[pipeline]add customized policy (#1139 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [pipeline]add customized policy	2022-06-21 15:23:41 +08:00
YuliangLiu0306	18091581c0	[pipeline]support more flexible pipeline (#1138 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [pipeline]support more flexible pipeline	2022-06-21 14:40:50 +08:00
ver217	ccf3c58c89	embedding op use gather_out (#1143 )	2022-06-21 13:21:20 +08:00
ver217	6690a61b4d	[hotfix] prevent nested ZeRO (#1140 )	2022-06-21 11:33:53 +08:00
Frank Lee	15aab1476e	[zero] avoid zero hook spam by changing log to debug level (#1137 )	2022-06-21 10:44:01 +08:00
Frank Lee	73ad05fc8c	[zero] added error message to handle on-the-fly import of torch Module class (#1135 ) * [zero] added error message to handle on-the-fly import of torch Module class * polish code	2022-06-20 11:24:27 +08:00
ver217	e4f555f29a	[optim] refactor fused sgd (#1134 )	2022-06-20 11:19:38 +08:00
ver217	d26902645e	[ddp] add save/load state dict for ColoDDP (#1127 ) * add save/load state dict for ColoDDP * add unit test * refactor unit test folder * polish unit test * rename unit test	2022-06-20 10:51:47 +08:00
YuliangLiu0306	946dbd629d	[hotfix]fix bugs caused by refactored pipeline (#1133 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [hotfix]fix bugs caused by refactored pipeline	2022-06-17 17:54:15 +08:00
ver217	789cad301b	[hotfix] fix param op hook (#1131 ) * fix param op hook * update zero tp test * fix bugs	2022-06-17 16:12:05 +08:00
ver217	a1a7899cae	[hotfix] fix zero init ctx numel (#1128 )	2022-06-16 17:17:27 +08:00
ver217	f0a954f16d	[ddp] add set_params_to_ignore for ColoDDP (#1122 ) * add set_params_to_ignore for ColoDDP * polish code * fix zero hook v2 * add unit test * polish docstr	2022-06-16 12:54:46 +08:00
YuliangLiu0306	3175bcb4d8	[pipeline]support List of Dict data (#1125 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [pipeline]support List of Dict data * polish	2022-06-16 11:19:48 +08:00
Frank Lee	91a5999825	[ddp] supported customized torch ddp configuration (#1123 )	2022-06-15 18:11:53 +08:00
YuliangLiu0306	fcf55777dd	[fx]add autoparallel passes (#1121 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * feature/add autoparallel passes	2022-06-15 16:36:46 +08:00
ver217	e127b4375b	cast colo ddp v2 inputs/outputs (#1120 )	2022-06-15 15:57:04 +08:00
Frank Lee	16302a5359	[fx] added unit test for coloproxy (#1119 ) * [fx] added unit test for coloproxy * polish code * polish code	2022-06-15 15:27:51 +08:00
ver217	7d14b473f0	[gemini] gemini mgr supports "cpu" placement policy (#1118 ) * update gemini mgr * update chunk * add docstr * polish placement policy * update test chunk * update test zero * polish unit test * remove useless unit test	2022-06-15 15:05:19 +08:00
ver217	f99f56dff4	fix colo parameter torch function (#1117 )	2022-06-15 14:23:27 +08:00
Frank Lee	e1620ddac2	[fx] added coloproxy (#1115 )	2022-06-15 10:47:57 +08:00
Frank Lee	6f82ac9bcb	[pipeline] supported more flexible dataflow control for pipeline parallel training (#1108 ) * [pipeline] supported more flexible dataflow control for pipeline parallel training * polish code * polish code * polish code	2022-06-15 10:41:28 +08:00
ver217	895c1c5ee7	[tensor] refactor param op hook (#1097 ) * refactor param op hook * add docstr * fix bug	2022-06-13 16:11:53 +08:00
YuliangLiu0306	1e9f9c227f	[hotfix]change to fit latest p2p (#1100 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [hotfix]change to fit latest p2p * polish * polish	2022-06-13 14:57:25 +08:00
Frank Lee	72bd7c696b	[amp] included dict for type casting of model output (#1102 )	2022-06-13 14:18:04 +08:00
Frank Lee	7f2d2b2b5b	[engine] fixed empty op hook check (#1096 ) * [engine] fixed empty op hook check * polish code	2022-06-10 17:27:27 +08:00
Frank Lee	14e5b11d7f	[zero] fixed api consistency (#1098 )	2022-06-10 16:59:59 +08:00
Frank Lee	cb18922c47	[doc] added documentation to chunk and chunk manager (#1094 ) * [doc] added documentation to chunk and chunk manager * polish code * polish code * polish code	2022-06-10 15:33:06 +08:00
ver217	1f894e033f	[gemini] zero supports gemini (#1093 ) * add placement policy * add gemini mgr * update mem stats collector * update zero * update zero optim * fix bugs * zero optim monitor os * polish unit test * polish unit test * add assert	2022-06-10 14:48:28 +08:00
Frank Lee	2b2dc1c86b	[pipeline] refactor the pipeline module (#1087 ) * [pipeline] refactor the pipeline module * polish code	2022-06-10 11:27:38 +08:00
Frank Lee	bad5d4c0a1	[context] support lazy init of module (#1088 ) * [context] support lazy init of module * polish code	2022-06-10 10:09:48 +08:00
ver217	be01db37c8	[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077 ) * polish chunk manager * polish unit test * impl add_extern_static_tensor for chunk mgr * add mem stats collector v2 * polish code * polish unit test * polish code * polish get chunks	2022-06-09 20:56:34 +08:00
Frank Lee	50ec3a7e06	[test] skip tests when not enough GPUs are detected (#1090 ) * [test] skip tests when not enough GPUs are detected * polish code * polish code	2022-06-09 17:19:13 +08:00
Ziyue Jiang	0653c63eaa	[Tensor] 1d row embedding (#1075 ) * Add CPU 1d row embedding * polish	2022-06-08 12:04:59 +08:00
junxu	d66ffb4df4	Remove duplication registry (#1078 )	2022-06-08 07:47:24 +08:00
Jiarui Fang	bcab249565	fix issue #1080 (#1071 )	2022-06-07 17:21:11 +08:00
ver217	1b17859328	[tensor] chunk manager monitor mem usage (#1076 )	2022-06-07 15:00:00 +08:00
ver217	98cdbf49c6	[hotfix] fix chunk comm src rank (#1072 )	2022-06-07 11:54:56 +08:00
Frank Lee	bfdc5ccb7b	[context] maintain the context object in with statement (#1073 )	2022-06-07 10:48:45 +08:00
ver217	c5cd3b0f35	[zero] zero optim copy chunk rather than copy tensor (#1070 )	2022-06-07 10:30:46 +08:00
Ziyue Jiang	4fc748f69b	[Tensor] fix optimizer for CPU parallel (#1069 )	2022-06-06 17:36:11 +08:00
Jiarui Fang	49832b2344	[refactory] add nn.parallel module (#1068 )	2022-06-06 15:34:41 +08:00
Ziyue Jiang	6754f1b77f	fix module utils bug (#1066 )	2022-06-06 12:11:48 +08:00
Jiarui Fang	a00644079e	reorgnize colotensor directory (#1062 ) * reorgnize colotensor directory * polish code	2022-06-03 18:04:22 +08:00
Frank Lee	3d10be33bd	[cudnn] set False to cudnn benchmark by default (#1063 )	2022-06-03 17:58:06 +08:00
Ziyue Jiang	df9dcbbff6	[Tensor] add hybrid device demo and fix bugs (#1059 )	2022-06-03 12:09:49 +08:00
YuliangLiu0306	b167258b6a	[pipeline]refactor ppschedule to support tensor list (#1050 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * refactor ppschedule to support tensor list * polish	2022-06-02 13:48:59 +08:00
ver217	e3fde4ee6b	fix import error in sharded model v2 (#1053 )	2022-06-02 13:48:22 +08:00
ver217	e1922ea4f6	[zero] add chunk size search for chunk manager (#1052 )	2022-06-02 13:20:20 +08:00
アマデウス	2c42b230f3	updated collective ops api (#1054 )	2022-06-02 12:52:27 +08:00
ver217	51b9a49655	[zero] add zero optimizer for ColoTensor (#1046 ) * add zero optimizer * torch ok * unit test ok * polish code * fix bugs * polish unit test * polish zero optim * polish colo ddp v2 * refactor folder structure * add comment * polish unit test * polish zero optim * polish unit test	2022-06-02 12:13:15 +08:00
ver217	7faef93326	fix dist spec mgr (#1045 )	2022-05-31 12:14:39 +08:00
ver217	9492a561c3	[tensor] ColoTensor supports ZeRo (#1015 ) * impl chunk manager * impl param op hook * add reduce_chunk * add zero hook v2 * add zero dp * fix TensorInfo * impl load balancing when using zero without chunk * fix zero hook * polish chunk * fix bugs * ddp ok * zero ok * polish code * fix bugs about load balancing * polish code * polish code * add ene-to-end test * polish code * polish code * polish code * fix typo * add test_chunk * fix bugs * fix bugs * polish code	2022-05-31 12:00:12 +08:00
Ziyue Jiang	7c530b9de2	[Tensor] add Parameter inheritance for ColoParameter (#1041 ) * add Parameter inheritance for ColoParameter * remove tricks * remove tricks * polish * polish	2022-05-30 17:23:44 +08:00
ver217	7cfd6c827e	[zero] add load_state_dict for sharded model (#894 ) * add load_state_dict for sharded model * fix bug * fix bug * fix ckpt dtype and device * support load state dict in zero init ctx * fix bugs	2022-05-27 10:25:08 +08:00
Ziyue Jiang	6c5996a56e	[Tensor] add module check and bert test (#1031 ) * add Embedding * Add bert test * polish * add check module test * polish * polish * polish * polish	2022-05-26 18:15:42 +08:00
YuliangLiu0306	7106bd671d	[p2p]add object list send/recv (#1024 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [p2p]add object list send recv * refactor for code reusability * polish	2022-05-26 14:28:46 +08:00
Frank Lee	e4685832f8	[engine] fixed bug in gradient accumulation dataloader to keep the last step (#1030 )	2022-05-26 14:28:23 +08:00
Ziyue Jiang	32291dd73f	[Tensor] add module handler for linear (#1021 ) * add module spec for linear * polish * polish * polish	2022-05-26 11:50:44 +08:00
Ryan Russell	9b0c037027	fix typo in constants (#1027 )	2022-05-26 08:45:08 +08:00
ver217	007ca0df92	fix colo init context (#1026 )	2022-05-25 20:41:58 +08:00
YuliangLiu0306	d182b0bd47	[hotfix] fix some bugs caused by size mismatch. (#1011 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [hotfix]fix some bugs caused by size mismatch. * add warning logs * polish	2022-05-23 14:02:28 +08:00
ver217	cefc29ff06	[tensor] impl ColoDDP for ColoTensor (#1009 ) * impl ColoDDP for ColoTensor * polish code	2022-05-21 13:52:04 +08:00
zhengzangw	ae7c338105	[NFC] polish colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cpp code style	2022-05-20 23:57:38 +08:00
ver217	a3b66f6def	[tensor] refactor parallel action (#1007 ) * refactor parallel action * polish unit tests	2022-05-20 20:19:58 +08:00
ver217	ad536e308e	[tensor] refactor colo-tensor (#992 ) * refactor colo-tensor and update linear op * polish code * polish code * update ops and unit tests * update unit tests * polish code * rename dist_spec module * polish code * polish code * remove unneeded import * fix pipelinable	2022-05-19 12:44:59 +08:00
Frank Lee	1467d83edf	[cli] remove unused imports (#1001 )	2022-05-18 23:27:18 +08:00
Frank Lee	533d0c46d8	[kernel] fixed the include bug in dropout kernel (#999 )	2022-05-18 21:43:18 +08:00
Jiarui Fang	802ac297cc	[Tensor] remove useless import in tensor dir (#997 )	2022-05-18 14:54:51 +08:00
Ziheng Qin	571f12eff3	[NFC] polish colossalai/nn/layer/utils/common.py code style (#983 )	2022-05-17 10:25:06 +08:00
puck_WCR	bda70b4b66	[NFC] polish colossalai/kernel/cuda_native/layer_norm.py code style (#980 )	2022-05-17 10:25:06 +08:00
Kai Wang (Victor Kai)	c50c08dcbb	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu code style (#979 )	2022-05-17 10:25:06 +08:00
binmakeswell	f28c021376	[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu code style (#978 )	2022-05-17 10:25:06 +08:00
shenggan	18542b47fc	[NFC] polish colossalai/nn/layer/parallel_2d/layers.py code style (#976 )	2022-05-17 10:25:06 +08:00
Jie Zhu	b67eebd20f	[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu code style (#977 )	2022-05-17 10:25:06 +08:00
DouJS	52705ec5c5	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu code style (#974 )	2022-05-17 10:25:06 +08:00
Ofey Chan	136946422b	[NFC] polish colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp code style (#973 )	2022-05-17 10:25:06 +08:00
Zirui Zhu	598cde4a0f	[NFC] polish colossalai/nn/layer/parallel_2p5d/layers.py code style (#972 )	2022-05-17 10:25:06 +08:00
Xu Kai	632e94abde	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h code style (#970 )	2022-05-17 10:25:06 +08:00
ExtremeViscent	22d1df224d	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h (#968 ) code style	2022-05-17 10:25:06 +08:00
LuGY	fb5bc6cb28	[NFC] polish colossalai/nn/layer/parallel_3d/layers.py code style (#966 )	2022-05-17 10:25:06 +08:00
lucasliunju	955463e542	[NFC] polish __init__.py code style (#965 )	2022-05-17 10:25:06 +08:00
Yuer867	7106a399fc	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h code style (#964 )	2022-05-17 10:25:06 +08:00
ziyu huang	5bd80b7dd1	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu code style (#963 ) Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>	2022-05-17 10:25:06 +08:00
superhao1995	48c4a180c7	[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp code style (#959 )	2022-05-17 10:25:06 +08:00
MaxT	442a2975ab	[NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h code style (#962 )	2022-05-17 10:25:06 +08:00
runluo	89e2767a92	[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu code style (#958 )	2022-05-17 10:25:06 +08:00
doubleHU	1dc1b6fa00	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h code style (#957 )	2022-05-17 10:25:06 +08:00
RichardoLuo	0e922da874	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/context.h code style (#956 ) Co-authored-by: RichardoLuo <14049555596@qq.com>	2022-05-17 10:25:06 +08:00
Wangbo Zhao(黑色枷锁)	8ca2a85682	[NFC] polish colossalai/kernel/cuda_native/scaled_softmax.py code style (#955 )	2022-05-17 10:25:06 +08:00
Luxios22	f6970ef8b1	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu code style (#954 )	2022-05-17 10:25:06 +08:00
Cautiousss	0b86a6345e	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu code style (#953 ) Co-authored-by: 何晓昕 <cautious@hexiaoxins-MacBook-Pro.local>	2022-05-17 10:25:06 +08:00
Sze-qq	d8d07b0e2b	[NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp code style (#952 )	2022-05-17 10:25:06 +08:00
xyupeng	fa43bb216d	[NFC] polish colossalai/builder/pipeline.py code style (#951 )	2022-05-17 10:25:06 +08:00
JT.Han	c3e423c8be	[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu code style (#949 ) Co-authored-by: Jiatong <jiatong.han@u.nus.edu>	2022-05-17 10:25:06 +08:00
luoling-LC	72c71b67ec	[NFC] polish colossalai/kernel/jit/bias_gelu.py code style (#946 ) Co-authored-by: jnbai <897086360@qq.com>	2022-05-17 10:25:06 +08:00
bajiaoyu517	eb9a81d72a	[NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.h code style (#945 )	2022-05-17 10:25:06 +08:00
wky	8ffdc38376	[NFC] polish colossalai/kernel/cuda_native/csrc/moe_cuda.cpp code style (#942 )	2022-05-17 10:25:06 +08:00
HaoyuQin	c0f373db5d	[NFC] polish pre-commit run --files colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu code style (#943 )	2022-05-17 10:25:06 +08:00
XYE	5bbefeb06a	[NFC] polish moe_cuda_kernel.cu code style (#940 ) Co-authored-by: Xiao Ye <xiaoye2@illinois.edu>	2022-05-17 10:25:06 +08:00
Maruyama_Aya	7aa35eae6a	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h code style (#938 )	2022-05-17 10:25:06 +08:00
Geng Zhang	b6cc9313ef	[NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.cpp code style (#936 )	2022-05-17 10:25:06 +08:00
yuxuan-lou	44b6f8947b	[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h code style (#939 )	2022-05-17 10:25:06 +08:00
BoxiangW	872aa413c2	[NFC] Polish colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu code style. (#937 )	2022-05-17 10:25:06 +08:00
ver217	58580b50fe	Revert "[NFC] Hotfix/format (#984 )" (#986 ) This reverts commit `0772828fba`.	2022-05-17 10:23:38 +08:00
binmakeswell	0772828fba	[NFC] Hotfix/format (#984 ) * [NFC] Polish colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu code style. (#937) * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h code style (#939) * [NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.cpp code style (#936) * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h code style (#938) * [NFC] polish moe_cuda_kernel.cu code style (#940) Co-authored-by: Xiao Ye <xiaoye2@illinois.edu> * [NFC] polish pre-commit run --files colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu code style (#943) * [NFC] polish colossalai/kernel/cuda_native/csrc/moe_cuda.cpp code style (#942) * [NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.h code style (#945) * [NFC] polish colossalai/kernel/jit/bias_gelu.py code style (#946) Co-authored-by: jnbai <897086360@qq.com> * [NFC] polish colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu code style (#949) Co-authored-by: Jiatong <jiatong.han@u.nus.edu> * [NFC] polish colossalai/builder/pipeline.py code style (#951) * [NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp code style (#952) * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu code style (#953) Co-authored-by: 何晓昕 <cautious@hexiaoxins-MacBook-Pro.local> * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu code style (#954) * [NFC] polish colossalai/kernel/cuda_native/scaled_softmax.py code style (#955) * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/context.h code style (#956) Co-authored-by: RichardoLuo <14049555596@qq.com> * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h code style (#957) * [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu code style (#958) * [NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h code style (#962) * [NFC] polish colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp code style (#959) * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu code style (#963) Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com> * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h code style (#964) * [NFC] polish __init__.py code style (#965) * [NFC] polish colossalai/nn/layer/parallel_3d/layers.py code style (#966) * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h (#968) code style * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h code style (#970) * [NFC] polish colossalai/nn/layer/parallel_2p5d/layers.py code style (#972) * [NFC] polish colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp code style (#973) * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu code style (#974) * [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu code style (#977) * [NFC] polish colossalai/nn/layer/parallel_2d/layers.py code style (#976) * [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu code style (#978) * [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu code style (#979) * [NFC] polish colossalai/kernel/cuda_native/layer_norm.py code style (#980) * [NFC] polish colossalai/nn/layer/utils/common.py code style (#983) Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com> Co-authored-by: yuxuan-lou <83441848+yuxuan-lou@users.noreply.github.com> Co-authored-by: Geng Zhang <34452939+zxgx@users.noreply.github.com> Co-authored-by: Maruyama_Aya <38985202+MaruyamaAya@users.noreply.github.com> Co-authored-by: XYE <92607131+Itok2000u@users.noreply.github.com> Co-authored-by: Xiao Ye <xiaoye2@illinois.edu> Co-authored-by: HaoyuQin <79465534+coder-chin@users.noreply.github.com> Co-authored-by: wky <64853922+wangkuangyi@users.noreply.github.com> Co-authored-by: bajiaoyu517 <59548007+bajiaoyu517@users.noreply.github.com> Co-authored-by: luoling-LC <105470086+luoling-LC@users.noreply.github.com> Co-authored-by: jnbai <897086360@qq.com> Co-authored-by: JT.Han <59948448+JThh@users.noreply.github.com> Co-authored-by: Jiatong <jiatong.han@u.nus.edu> Co-authored-by: xyupeng <99191637+xyupeng@users.noreply.github.com> Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com> Co-authored-by: Cautiousss <48676630+Cautiousss@users.noreply.github.com> Co-authored-by: 何晓昕 <cautious@hexiaoxins-MacBook-Pro.local> Co-authored-by: Luxios22 <67457897+Luxios22@users.noreply.github.com> Co-authored-by: Wangbo Zhao(黑色枷锁) <56866854+wangbo-zhao@users.noreply.github.com> Co-authored-by: RichardoLuo <50363844+RichardoLuo@users.noreply.github.com> Co-authored-by: RichardoLuo <14049555596@qq.com> Co-authored-by: doubleHU <98150031+huxin711@users.noreply.github.com> Co-authored-by: runluo <68489000+run-qiao@users.noreply.github.com> Co-authored-by: MaxT <854721132@qq.com> Co-authored-by: superhao1995 <804673818@qq.com> Co-authored-by: ziyu huang <huang0ziyu@gmail.com> Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com> Co-authored-by: Yuer867 <62204893+Yuer867@users.noreply.github.com> Co-authored-by: lucasliunju <lucasliunju@gmail.com> Co-authored-by: LuGY <74758262+Gy-Lu@users.noreply.github.com> Co-authored-by: ExtremeViscent <zhangyiqi55732@sina.com> Co-authored-by: Xu Kai <xukai16@foxmail.com> Co-authored-by: Zirui Zhu <zhuzr21@gmail.com> Co-authored-by: Ofey Chan <ofey206@gmail.com> Co-authored-by: DouJS <dujiangsu@163.com> Co-authored-by: Jie Zhu <chore.08-protist@icloud.com> Co-authored-by: shenggan <csg19971016@gmail.com> Co-authored-by: Kai Wang (Victor Kai) <37533040+kaiwang960112@users.noreply.github.com> Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com> Co-authored-by: Ziheng Qin <37519855+henryqin1997@users.noreply.github.com>	2022-05-17 09:54:49 +08:00
ver217	c2fdc6a011	[tensor] derive compute pattern from dist spec (#971 ) * derive compute pattern from dist spec * polish code	2022-05-16 14:58:08 +08:00
Ziyue Jiang	797a9dc5a9	add DistSpec for loss and test_model (#947 )	2022-05-13 20:29:50 +08:00
ver217	67c33f57eb	[tensor] design DistSpec and DistSpecManager for ColoTensor (#934 ) * add dist spec * update linear op * polish code * polish code * update embedding op * polish unit tests * polish unit tests * polish comments * polish code * add test_dist_spec_mgr * polish code * refactor folder structure * polish unit tests * add get_process_group() for TensorSpec * polish code	2022-05-13 15:13:52 +08:00
Ziyue Jiang	d73c2b1d79	[Tensor] fix init context (#931 ) * change torch.Parameter to ColoParameter * fix post assignment for init context * polish * polish	2022-05-11 15:48:12 +08:00
Ziyue Jiang	dfc88b85ea	[Tensor] simplify named param (#928 ) * simplify ColoModulize * simplify ColoModulize * polish * polish	2022-05-11 10:54:19 +08:00
YuliangLiu0306	32a45cd7ef	[pipelinable]use pipelinable to support GPT model. (#903 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [pipelinable]use pipelinable to support GPT model. * fix a bug caused by ShardedModel * polish * fix front func list	2022-05-11 09:23:58 +08:00
ver217	4ca732349e	[tensor] colo tensor overrides mul (#927 ) * colo tensor overrides mul * polish code	2022-05-10 16:04:08 +08:00
ver217	45b9124df4	[tensor] hijack addmm for colo tensor (#923 ) * hijack addmm for colo tensor * fix bugs * polish unit test * polish comments	2022-05-09 18:55:49 +08:00
Ziyue Jiang	c195d2814c	[Tensor] add from_pretrained support and bert pretrained test (#921 ) * add from_pretrained support and test * polish * polish * polish * polish	2022-05-09 16:11:47 +08:00
Jiarui Fang	845856ea29	[Graph] building computing graph with ColoTensor, Linear only (#917 )	2022-05-07 17:10:37 +08:00
Ziyue Jiang	75d221918a	[Tensor] add 1d vocab loss (#918 ) * add 1d vocab loss * polish	2022-05-07 15:49:14 +08:00
Jiarui Fang	ab95ec9aea	[Tensor] init ColoParameter (#914 )	2022-05-06 12:57:14 +08:00
Ziyue Jiang	f593a5637e	[Tensor] add embedding tp1d row (#904 )	2022-04-29 14:10:05 +08:00
Ziyue Jiang	2c0d19d755	[Tensor] add ColoTensor TP1Dcol Embedding (#899 )	2022-04-28 17:45:06 +08:00
Jiarui Fang	d16671da75	[Tensor] initialize the ColoOptimizer (#898 ) * [Tensor] activation is an attr of ColoTensor * [Tensor] add optimizer * only detach parameters in context * polish code	2022-04-28 15:23:40 +08:00
Jiarui Fang	676f191532	[Tensor] activation is an attr of ColoTensor (#897 )	2022-04-28 14:43:22 +08:00
Ziyue Jiang	cb182da7c5	[tensor] refine linear and add gather for laynorm (#893 ) * refine linear and add function to ColoTensor * add gather for layernorm * polish * polish	2022-04-28 10:55:40 +08:00
Jiarui Fang	26c49639d8	[Tensor] overriding paramters() for Module using ColoTensor (#889 )	2022-04-27 15:28:59 +08:00
Ziyue Jiang	1d0aba4153	[tensor] add ColoTensor 1Dcol (#888 )	2022-04-27 14:13:55 +08:00
Jiarui Fang	72cdc06875	[Tensor] make ColoTensor more robust for getattr (#886 ) * [Tensor] make ColoTensor more robust for getattr * polish * polish	2022-04-27 10:57:49 +08:00
Ziyue Jiang	9bc5a77c31	[tensor] wrap function in the torch_tensor to ColoTensor (#881 )	2022-04-26 20:13:56 +08:00
ver217	4df6471f5d	fix import error (#880 )	2022-04-26 19:28:40 +08:00
Jiarui Fang	7f76517a85	[Tensor] make a simple net works with 1D row TP (#879 )	2022-04-26 18:11:47 +08:00
ver217	c4d903e64a	[gemini] accelerate adjust_layout() (#878 ) * add lru cache * polish code * update unit test * fix sharded optim	2022-04-26 18:08:31 +08:00
Jiarui Fang	909211453b	[Tensor] Add some attributes to ColoTensor (#877 ) * [Tensor] add some function to ColoTensor * torch.allclose * rm torch.add	2022-04-26 15:10:47 +08:00
HELSON	425b4a96b8	[gemini] polish stateful_tensor_mgr (#876 )	2022-04-26 15:05:03 +08:00
Jiarui Fang	e43f83aa5c	[Tensor] get named parameters for model using ColoTensors (#874 )	2022-04-26 14:08:01 +08:00
Jiarui Fang	96211c2cc8	[tensor] customized op returns ColoTensor (#875 ) * [tensor] customized op returns ColoTensor * polish * polish code	2022-04-26 13:23:59 +08:00
Ziyue Jiang	26d4ab8b03	[Tensor] Add function to spec and update linear 1Drow and unit tests (#869 )	2022-04-26 10:15:26 +08:00
Frank Lee	11f54c7b6b	[doc] improved docstring and assertion messages for the engine module (#871 )	2022-04-26 10:00:18 +08:00
Frank Lee	1c34382678	[doc] improved assertion messages in trainer (#873 )	2022-04-26 10:00:12 +08:00
Frank Lee	7a64fae33a	[doc] improved error messages in initialize (#872 )	2022-04-26 10:00:03 +08:00
Jiarui Fang	1190b2c4a4	[tensor] add cross_entrophy_loss (#868 )	2022-04-25 16:01:52 +08:00
HELSON	3107817172	[gemini] add stateful tensor container (#867 )	2022-04-25 14:58:16 +08:00
Jiarui Fang	d01d3b8cb0	colo init context add device attr. (#866 )	2022-04-25 14:24:26 +08:00
Frank Lee	2238758c2e	[usability] improved error messages in the context module (#856 )	2022-04-25 13:42:31 +08:00
Frank Lee	9fdebadd69	[doc] improved docstring in the amp module (#857 )	2022-04-25 13:42:17 +08:00
Frank Lee	b862d89d00	[doc] improved docstring in the logging module (#861 )	2022-04-25 13:42:00 +08:00
Frank Lee	8004c8e938	[doc] improved docstring in the communication module (#863 )	2022-04-25 13:41:43 +08:00
Jiarui Fang	8af5f7423d	[tensor] an initial dea of tensor spec (#865 ) * a initial dea of tensor spec * polish * polish	2022-04-25 13:33:52 +08:00
Jiarui Fang	126ba573a8	[Tensor] add layer norm Op (#852 )	2022-04-25 11:49:20 +08:00
Frank Lee	a82da26f7e	[cli] refactored micro-benchmarking cli and added more metrics (#858 )	2022-04-25 11:48:07 +08:00
Frank Lee	ee222dfbf3	[usability] added assertion message in registry (#864 )	2022-04-25 11:45:15 +08:00
HELSON	f0e654558f	[gemini] polish code (#855 )	2022-04-25 10:40:14 +08:00
Jiarui Fang	29159d9b5b	hotfix tensor unittest bugs (#862 )	2022-04-25 10:06:53 +08:00
YuliangLiu0306	c6930d8ddf	[pipelinable]use ColoTensor to replace dummy tensor. (#853 )	2022-04-24 18:31:22 +08:00
Ziyue Jiang	bcc8655021	[Tensor ] Add 1Drow weight reshard by spec (#854 )	2022-04-24 18:30:20 +08:00
ver217	d7e0303d1e	[zero] use GeminiMemoryManager when sampling model data (#850 )	2022-04-24 17:17:22 +08:00
ver217	232142f402	[utils] refactor profiler (#837 ) * add model data profiler * add a subclass of torch.profiler.profile * refactor folder structure * remove redundant codes * polish code * use GeminiMemoryManager * fix import path * fix stm profiler ext * polish comments * remove useless file	2022-04-24 17:03:59 +08:00
Jiarui Fang	62f059251b	[Tensor] init a tp network training unittest (#849 )	2022-04-24 16:43:44 +08:00
ver217	0dea140760	[hotfix] add deconstructor for stateful tensor (#848 ) * add deconstructor for stateful tensor * fix colo init context	2022-04-24 15:03:04 +08:00
ver217	0f7ed8c192	fix _post_init_method of zero init ctx (#847 )	2022-04-24 14:16:50 +08:00
Ziyue Jiang	2a0a427e04	[tensor]add assert for colo_tensor 1Drow (#846 )	2022-04-24 14:12:45 +08:00
Ziyue Jiang	05023ecfee	[Tensor] TP Linear 1D row (#843 )	2022-04-24 13:43:12 +08:00
Frank Lee	cf6d1c9284	[CLI] refactored the launch CLI and fixed bugs in multi-node launching (#844 ) * [cli] fixed multi-node job launching * [cli] fixed a bug in version comparison * [cli] support launching with env var * [cli] fixed multi-node job launching * [cli] fixed a bug in version comparison * [cli] support launching with env var * added docstring * [cli] added extra launch arguments * [cli] added default launch rdzv args * [cli] fixed version comparison * [cli] added docstring examples and requierment * polish docstring * polish code * polish code	2022-04-24 13:26:26 +08:00

... 3 4 5 6 7 ...

788 Commits (3abf98a6337ae39f11b3c259a0af8d40477fe7f7)