Geng Zhang
0e06f62160
[NFC] polish colossalai/nn/layer/parallel_sequence/_operation.py code style ( #1266 )
2022-07-13 12:08:21 +08:00
binmakeswell
c95e18cdb9
[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.h code style ( #1270 )
2022-07-13 12:08:21 +08:00
xyupeng
94bfd35184
[NFC] polish colossalai/builder/builder.py code style ( #1265 )
2022-07-13 12:08:21 +08:00
DouJS
db13f96333
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh code style ( #1264 )
2022-07-13 12:08:21 +08:00
shenggan
5d7366b144
[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.h code style ( #1263 )
2022-07-13 12:08:21 +08:00
Zangwei Zheng
197a2c89e2
[NFC] polish colossalai/communication/collective.py ( #1262 )
2022-07-13 12:08:21 +08:00
ziyu huang
f1cafcc73a
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu code style ( #1261 )
...
Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>
2022-07-13 12:08:21 +08:00
Sze-qq
f8b9aaef47
[NFC] polish colossalai/kernel/cuda_native/csrc/type_shim.h code style ( #1260 )
2022-07-13 12:08:21 +08:00
superhao1995
f660152c73
[NFC] polish colossalai/nn/layer/parallel_3d/_operation.py code style ( #1258 )
...
Co-authored-by: Research <research@soccf-snr3-017.comp.nus.edu.sg>
2022-07-13 12:08:21 +08:00
Thunderbeee
9738fb0f78
[NFC] polish colossalai/nn/lr_scheduler/__init__.py ( #1255 )
...
code style
2022-07-13 12:08:21 +08:00
Kai Wang (Victor Kai)
50f2ad213f
[NFC] polish colossalai/engine/ophooks/utils.py code style ( #1256 )
2022-07-13 12:08:21 +08:00
Ofey Chan
2dd4d556fb
[NFC] polish colossalai/nn/init.py code style ( #1292 )
2022-07-13 10:51:55 +08:00
Jiarui Fang
556b9b7e1a
[hotfix] Dist Mgr gather torch version ( #1284 )
...
* make it faster
* [hotfix] torchvison fx tests
* [hotfix] rename duplicated named test_gpt.py
* [hotfix] dist mgr torch version
2022-07-13 00:18:56 +08:00
HELSON
abba4d84e1
[hotfix] fix bert model test in unitests ( #1272 )
2022-07-12 23:26:45 +08:00
ver217
7aadcbd070
hotfix colotensor _scan_for_pg_from_args ( #1276 )
2022-07-12 20:46:31 +08:00
oahzxl
0cf8e8e91c
[NFC] polish <colossalai/nn/lr_scheduler/poly.py> code style ( #1267 )
2022-07-12 18:18:14 +08:00
Jiarui Fang
c92f84fcdb
[tensor] distributed checkpointing for parameters ( #1240 )
2022-07-12 15:51:06 +08:00
Frank Lee
fb35460595
[fx] added ndim property to proxy ( #1253 )
2022-07-12 15:27:13 +08:00
Frank Lee
4a09fc0947
[fx] fixed tracing with apex-based T5 model ( #1252 )
...
* [fx] fixed tracing with apex-based T5 model
* polish code
* polish code
2022-07-12 15:19:25 +08:00
Frank Lee
7531c6271f
[fx] refactored the file structure of patched function and module ( #1238 )
...
* [fx] refactored the file structure of patched function and module
* polish code
2022-07-12 15:01:58 +08:00
YuliangLiu0306
17ed33350b
[hotfix] fix an assertion bug in base schedule. ( #1250 )
2022-07-12 14:20:02 +08:00
YuliangLiu0306
97d713855a
[fx] methods to get fx graph property. ( #1246 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* manipulation
* [fx]add graph manipulation methods.
* [fx]methods to get fx graph property.
* add unit test
* add docstring to explain top node and leaf node in this context
2022-07-12 14:10:37 +08:00
YuliangLiu0306
30b4fc0eb0
[fx]add split module pass and unit test from pipeline passes ( #1242 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx]add split module pass and unit test from pipeline passes
* fix MNASNet bug
* polish
2022-07-12 13:45:01 +08:00
Jiarui Fang
1aad903c15
[tensor] redistribute among different process groups ( #1247 )
...
* make it faster
* [tensor] rename convert_to_dist -> redistribute
* [tensor] ShardSpec and ReplicaSpec
* [tensor] redistribute among diff pgs
* polish code
2022-07-12 10:24:05 +08:00
Jiarui Fang
9bcd2fd4af
[tensor] a shorter shard and replicate spec ( #1245 )
2022-07-11 15:51:48 +08:00
Jiarui Fang
2699dfbbfd
[rename] convert_to_dist -> redistribute ( #1243 )
2022-07-11 13:05:44 +08:00
HELSON
f6add9b720
[tensor] redirect .data.__get__ to a tensor instance ( #1239 )
2022-07-11 11:41:29 +08:00
Jiarui Fang
20da6e48c8
[checkpoint] save sharded optimizer states ( #1237 )
2022-07-08 16:33:13 +08:00
Jiarui Fang
4a76084dc9
[tensor] add zero_like colo op, important for Optimizer ( #1236 )
2022-07-08 14:55:27 +08:00
Jiarui Fang
3b500984b1
[tensor] fix some unittests ( #1234 )
2022-07-08 14:18:30 +08:00
ver217
a45ddf2d5f
[hotfix] fix sharded optim step and clip_grad_norm ( #1226 )
2022-07-08 13:34:48 +08:00
HELSON
f071b500b6
[polish] polish __repr__ for ColoTensor, DistSpec, ProcessGroup ( #1235 )
2022-07-08 13:25:57 +08:00
HELSON
0453776def
[tensor] fix a assertion in colo_tensor cross_entropy ( #1232 )
2022-07-08 11:18:00 +08:00
Jiarui Fang
0e199d71e8
[hotfix] fx get comm size bugs ( #1233 )
...
* init a checkpoint dir
* [checkpoint]support resume for cosinewarmuplr
* [checkpoint]add unit test
* fix some bugs but still not OK
* fix bugs
* make it faster
* [checkpoint]support generalized scheduler
* polish
* [tensor] torch function return colotensor
* polish
* fix bugs
* remove debug info
* polish
* polish
* [tensor] test_model pass unittests
* polish
* [hotfix] fx get comm size bug
Co-authored-by: ZhaoYi1222 <zhaoyi9499@gmail.com>
2022-07-08 10:54:41 +08:00
HELSON
42ab36b762
[tensor] add unitest for colo_tensor 1DTP cross_entropy ( #1230 )
2022-07-07 19:17:23 +08:00
Yi Zhao
04537bf83e
[checkpoint]support generalized scheduler ( #1222 )
2022-07-07 18:16:38 +08:00
Jiarui Fang
a98319f023
[tensor] torch function return colotensor ( #1229 )
2022-07-07 18:09:18 +08:00
YuliangLiu0306
2b7dca44b5
[fx]get communication size between partitions ( #1224 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx]get communication size between partitions.
* polish
2022-07-07 16:22:00 +08:00
Frank Lee
84f2298a96
[fx] added patches for tracing swin transformer ( #1228 )
2022-07-07 15:20:13 +08:00
Frank Lee
b6cb5a47ad
[fx] added timm model tracing testing ( #1221 )
2022-07-07 14:02:17 +08:00
HELSON
280a81243d
[tensor] improve robustness of class 'ProcessGroup' ( #1223 )
2022-07-07 13:55:24 +08:00
Jiarui Fang
15d988f954
[tensor] sharded global process group ( #1219 )
2022-07-07 13:38:48 +08:00
Jiarui Fang
db1bef9032
[hotfix] fx shard 1d pass bug fixing ( #1220 )
2022-07-07 13:37:31 +08:00
Frank Lee
11973d892d
[fx] added torchvision model tracing testing ( #1216 )
...
* [fx] added torchvision model tracing testing
* remove unused imports
2022-07-06 21:37:56 +08:00
Jiarui Fang
52736205d9
[checkpoint] make unitest faster ( #1217 )
2022-07-06 17:39:46 +08:00
Jiarui Fang
f38006ea83
[checkpoint] checkpoint for ColoTensor Model ( #1196 )
2022-07-06 17:22:03 +08:00
XYE
291e22aac6
[fx] temporarily used ( #1215 )
2022-07-06 17:19:26 +08:00
Jiarui Fang
ae7d3f4927
[refactor] move process group from _DistSpec to ColoTensor. ( #1203 )
2022-07-06 16:15:16 +08:00
Frank Lee
5da87ce35d
[fx] added testing for all albert variants ( #1211 )
2022-07-06 15:11:08 +08:00
Frank Lee
2d13a45a3b
[fx] added testing for all gpt variants ( #1210 )
...
* [fx] added testing for all gpt variants
* polish code
* polish code
2022-07-06 14:03:13 +08:00
YuliangLiu0306
189946c5c4
[fx]add uniform policy ( #1208 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx]add uniform policy
2022-07-06 13:48:11 +08:00
Frank Lee
426a279ce7
[fx] added testing for all bert variants ( #1207 )
...
* [fx] added testing for all bert variants
* polish code
2022-07-06 10:50:49 +08:00
Jiarui Fang
b5f25eb32a
[Tensor] add cpu group to ddp ( #1200 )
2022-07-05 14:58:28 +08:00
Frank Lee
f7878f465c
[fx] supported model tracing for huggingface bert ( #1201 )
...
* [fx] supported model tracing for huggingface bert
* polish test
2022-07-05 13:19:57 +08:00
Jiarui Fang
060b917daf
[refactor] remove gpc dependency in colotensor's _ops ( #1189 )
2022-07-04 18:54:37 +08:00
Frank Lee
abf6a262dc
[fx] added module patch for pooling layers ( #1197 )
2022-07-04 15:21:26 +08:00
YuliangLiu0306
63d2a93878
[context]support arbitary module materialization. ( #1193 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [context]support arbitary module materialization.
* [test]add numerical check for lazy init context.
2022-07-04 10:12:02 +08:00
Jiarui Fang
a444633d13
warmup ratio configration ( #1192 )
2022-06-30 15:23:50 +08:00
ver217
dba7e0cfb4
make AutoPlacementPolicy configurable ( #1191 )
2022-06-30 15:18:30 +08:00
YuliangLiu0306
2053e138a2
[context]use meta tensor to init model lazily. ( #1187 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [context]use meta tensor to init model lazily.
* polish
* make module with device kwargs bypass the normal init.
* change unit test to adapt updated context.
2022-06-29 21:02:30 +08:00
Frank Lee
2c8c05675d
[fx] patched conv and normalization ( #1188 )
2022-06-29 18:58:38 +08:00
Frank Lee
6d86f1bc91
[fx] supported data-dependent control flow in model tracing ( #1185 )
...
* [fx] supported data-dependent control flow in model tracing
* polish code
2022-06-29 15:05:25 +08:00
Jiarui Fang
c463f8adf9
[tensor] remove gpc in tensor tests ( #1186 )
2022-06-29 14:08:40 +08:00
Jiarui Fang
372f791444
[refactor] move chunk and chunkmgr to directory gemini ( #1182 )
2022-06-29 13:31:02 +08:00
ver217
6b2f2ab9bb
[ddp] ColoDDP uses bucket all-reduce ( #1177 )
...
* add reducer
* update colo ddp with reducer
* polish unit test
* polish unit test
2022-06-29 10:34:13 +08:00
Jiarui Fang
7487215b95
[ColoTensor] add independent process group ( #1179 )
2022-06-29 10:03:09 +08:00
YuliangLiu0306
26ba87272d
[hotfix]fixed p2p process send stuck ( #1181 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fixed p2p process send stuck
2022-06-28 14:41:11 +08:00
Jiarui Fang
1b657f9ce1
[tensor] revert local view back ( #1178 )
2022-06-27 18:38:34 +08:00
Jiarui Fang
0dd4e2bbfb
[Tensor] rename some APIs in TensorSpec and Polish view unittest ( #1176 )
2022-06-27 15:56:11 +08:00
Ziyue Jiang
dd0420909f
[Tensor] rename parallel_action ( #1174 )
...
* rename parallel_action
* polish
2022-06-27 10:04:45 +08:00
YuliangLiu0306
e27645376d
[hotfix]different overflow status lead to communication stuck. ( #1175 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fix some bugs caused by refactored schedule.
* [hotfix]different overflow statu llead to communication stuck.
2022-06-27 09:53:57 +08:00
Jiarui Fang
aa7bef73d4
[Tensor] distributed view supports inter-process hybrid parallel ( #1169 )
2022-06-27 09:45:26 +08:00
ver217
9e1daa63d2
[zero] sharded optim supports loading local state dict ( #1170 )
...
* sharded optim supports loading local state dict
* polish code
* add unit test
2022-06-24 18:05:16 +08:00
ver217
561e90493f
[zero] zero optim supports loading local state dict ( #1171 )
...
* zero optim supports loading local state dict
* polish code
* add unit test
2022-06-24 17:25:57 +08:00
Jiarui Fang
4b9bba8116
[ColoTensor] rename APIs and add output_replicate to ComputeSpec ( #1168 )
2022-06-24 13:08:54 +08:00
Jiarui Fang
f4ef224358
[Tensor] remove ParallelAction, use ComputeSpec instread ( #1166 )
2022-06-23 17:34:59 +08:00
Jiarui Fang
177c374401
remove gather out in parallel action ( #1163 )
2022-06-23 16:35:05 +08:00
ver217
634eecb98e
mark sanity_check of dist_spec_mgr as staticmethod ( #1161 )
2022-06-23 11:35:25 +08:00
Ziyue Jiang
955ac912de
remove log ( #1160 )
2022-06-23 10:32:42 +08:00
ver217
4e67b2a890
fix chunk move device ( #1158 )
2022-06-22 18:07:10 +08:00
Jiarui Fang
07f9c781f9
[graph] improve the graph building. ( #1157 )
2022-06-22 16:47:20 +08:00
ver217
22717a856f
[tensor] add embedding bag op ( #1156 )
2022-06-22 15:54:03 +08:00
ver217
ae86151968
[tensor] add more element-wise ops ( #1155 )
...
* add more element-wise ops
* update test_op
* polish unit test
2022-06-22 15:16:47 +08:00
ver217
54aabb8da4
[gemini] refactor gemini mgr ( #1151 )
...
* refactor gemini mgr
* udpate __init__
2022-06-22 11:54:36 +08:00
Frank Lee
f8eec98ff5
[tensor] fixed non-serializable colo parameter during model checkpointing ( #1153 )
2022-06-22 11:43:38 +08:00
ver217
ffa025e120
[tensor] dist spec s2s uses all-to-all ( #1136 )
...
* dist spec s2s uses all-to-all
* update unit test
* add sanity check
* polish unitest test with titans
* add sanity check for DistMgr
* add sanity check
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-06-22 11:32:38 +08:00
YuliangLiu0306
f1f51990b9
[hotfix]fix some bugs caused by refactored schedule. ( #1148 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fix some bugs caused by refactored schedule.
2022-06-21 22:46:30 +08:00
Jiarui Fang
8cdce0399c
[ColoTensor] improves init functions. ( #1150 )
2022-06-21 18:28:38 +08:00
ver217
8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP ( #1146 )
...
* ColoDDP supports overwriting default process group
* rename ColoDDPV2 to ZeroDDP
* add docstr for ZeroDDP
* polish docstr
2022-06-21 16:35:23 +08:00
Frank Lee
0e4e62d30d
[tensor] added __repr__ to spec ( #1147 )
2022-06-21 15:38:05 +08:00
YuliangLiu0306
70dd88e2ee
[pipeline]add customized policy ( #1139 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]add customized policy
2022-06-21 15:23:41 +08:00
YuliangLiu0306
18091581c0
[pipeline]support more flexible pipeline ( #1138 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]support more flexible pipeline
2022-06-21 14:40:50 +08:00
ver217
ccf3c58c89
embedding op use gather_out ( #1143 )
2022-06-21 13:21:20 +08:00
ver217
6690a61b4d
[hotfix] prevent nested ZeRO ( #1140 )
2022-06-21 11:33:53 +08:00
Frank Lee
15aab1476e
[zero] avoid zero hook spam by changing log to debug level ( #1137 )
2022-06-21 10:44:01 +08:00
Frank Lee
73ad05fc8c
[zero] added error message to handle on-the-fly import of torch Module class ( #1135 )
...
* [zero] added error message to handle on-the-fly import of torch Module class
* polish code
2022-06-20 11:24:27 +08:00
ver217
e4f555f29a
[optim] refactor fused sgd ( #1134 )
2022-06-20 11:19:38 +08:00
ver217
d26902645e
[ddp] add save/load state dict for ColoDDP ( #1127 )
...
* add save/load state dict for ColoDDP
* add unit test
* refactor unit test folder
* polish unit test
* rename unit test
2022-06-20 10:51:47 +08:00
YuliangLiu0306
946dbd629d
[hotfix]fix bugs caused by refactored pipeline ( #1133 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fix bugs caused by refactored pipeline
2022-06-17 17:54:15 +08:00
ver217
789cad301b
[hotfix] fix param op hook ( #1131 )
...
* fix param op hook
* update zero tp test
* fix bugs
2022-06-17 16:12:05 +08:00
ver217
a1a7899cae
[hotfix] fix zero init ctx numel ( #1128 )
2022-06-16 17:17:27 +08:00
ver217
f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP ( #1122 )
...
* add set_params_to_ignore for ColoDDP
* polish code
* fix zero hook v2
* add unit test
* polish docstr
2022-06-16 12:54:46 +08:00
YuliangLiu0306
3175bcb4d8
[pipeline]support List of Dict data ( #1125 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]support List of Dict data
* polish
2022-06-16 11:19:48 +08:00
Frank Lee
91a5999825
[ddp] supported customized torch ddp configuration ( #1123 )
2022-06-15 18:11:53 +08:00
YuliangLiu0306
fcf55777dd
[fx]add autoparallel passes ( #1121 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* feature/add autoparallel passes
2022-06-15 16:36:46 +08:00
ver217
e127b4375b
cast colo ddp v2 inputs/outputs ( #1120 )
2022-06-15 15:57:04 +08:00
Frank Lee
16302a5359
[fx] added unit test for coloproxy ( #1119 )
...
* [fx] added unit test for coloproxy
* polish code
* polish code
2022-06-15 15:27:51 +08:00
ver217
7d14b473f0
[gemini] gemini mgr supports "cpu" placement policy ( #1118 )
...
* update gemini mgr
* update chunk
* add docstr
* polish placement policy
* update test chunk
* update test zero
* polish unit test
* remove useless unit test
2022-06-15 15:05:19 +08:00
ver217
f99f56dff4
fix colo parameter torch function ( #1117 )
2022-06-15 14:23:27 +08:00
Frank Lee
e1620ddac2
[fx] added coloproxy ( #1115 )
2022-06-15 10:47:57 +08:00
Frank Lee
6f82ac9bcb
[pipeline] supported more flexible dataflow control for pipeline parallel training ( #1108 )
...
* [pipeline] supported more flexible dataflow control for pipeline parallel training
* polish code
* polish code
* polish code
2022-06-15 10:41:28 +08:00
ver217
895c1c5ee7
[tensor] refactor param op hook ( #1097 )
...
* refactor param op hook
* add docstr
* fix bug
2022-06-13 16:11:53 +08:00
YuliangLiu0306
1e9f9c227f
[hotfix]change to fit latest p2p ( #1100 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]change to fit latest p2p
* polish
* polish
2022-06-13 14:57:25 +08:00
Frank Lee
72bd7c696b
[amp] included dict for type casting of model output ( #1102 )
2022-06-13 14:18:04 +08:00
Frank Lee
7f2d2b2b5b
[engine] fixed empty op hook check ( #1096 )
...
* [engine] fixed empty op hook check
* polish code
2022-06-10 17:27:27 +08:00
Frank Lee
14e5b11d7f
[zero] fixed api consistency ( #1098 )
2022-06-10 16:59:59 +08:00
Frank Lee
cb18922c47
[doc] added documentation to chunk and chunk manager ( #1094 )
...
* [doc] added documentation to chunk and chunk manager
* polish code
* polish code
* polish code
2022-06-10 15:33:06 +08:00
ver217
1f894e033f
[gemini] zero supports gemini ( #1093 )
...
* add placement policy
* add gemini mgr
* update mem stats collector
* update zero
* update zero optim
* fix bugs
* zero optim monitor os
* polish unit test
* polish unit test
* add assert
2022-06-10 14:48:28 +08:00
Frank Lee
2b2dc1c86b
[pipeline] refactor the pipeline module ( #1087 )
...
* [pipeline] refactor the pipeline module
* polish code
2022-06-10 11:27:38 +08:00
Frank Lee
bad5d4c0a1
[context] support lazy init of module ( #1088 )
...
* [context] support lazy init of module
* polish code
2022-06-10 10:09:48 +08:00
ver217
be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 ( #1077 )
...
* polish chunk manager
* polish unit test
* impl add_extern_static_tensor for chunk mgr
* add mem stats collector v2
* polish code
* polish unit test
* polish code
* polish get chunks
2022-06-09 20:56:34 +08:00
Frank Lee
50ec3a7e06
[test] skip tests when not enough GPUs are detected ( #1090 )
...
* [test] skip tests when not enough GPUs are detected
* polish code
* polish code
2022-06-09 17:19:13 +08:00
Ziyue Jiang
0653c63eaa
[Tensor] 1d row embedding ( #1075 )
...
* Add CPU 1d row embedding
* polish
2022-06-08 12:04:59 +08:00
junxu
d66ffb4df4
Remove duplication registry ( #1078 )
2022-06-08 07:47:24 +08:00
Jiarui Fang
bcab249565
fix issue #1080 ( #1071 )
2022-06-07 17:21:11 +08:00
ver217
1b17859328
[tensor] chunk manager monitor mem usage ( #1076 )
2022-06-07 15:00:00 +08:00
ver217
98cdbf49c6
[hotfix] fix chunk comm src rank ( #1072 )
2022-06-07 11:54:56 +08:00
Frank Lee
bfdc5ccb7b
[context] maintain the context object in with statement ( #1073 )
2022-06-07 10:48:45 +08:00
ver217
c5cd3b0f35
[zero] zero optim copy chunk rather than copy tensor ( #1070 )
2022-06-07 10:30:46 +08:00
Ziyue Jiang
4fc748f69b
[Tensor] fix optimizer for CPU parallel ( #1069 )
2022-06-06 17:36:11 +08:00
Jiarui Fang
49832b2344
[refactory] add nn.parallel module ( #1068 )
2022-06-06 15:34:41 +08:00
Ziyue Jiang
6754f1b77f
fix module utils bug ( #1066 )
2022-06-06 12:11:48 +08:00
Jiarui Fang
a00644079e
reorgnize colotensor directory ( #1062 )
...
* reorgnize colotensor directory
* polish code
2022-06-03 18:04:22 +08:00
Frank Lee
3d10be33bd
[cudnn] set False to cudnn benchmark by default ( #1063 )
2022-06-03 17:58:06 +08:00
Ziyue Jiang
df9dcbbff6
[Tensor] add hybrid device demo and fix bugs ( #1059 )
2022-06-03 12:09:49 +08:00
YuliangLiu0306
b167258b6a
[pipeline]refactor ppschedule to support tensor list ( #1050 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* refactor ppschedule to support tensor list
* polish
2022-06-02 13:48:59 +08:00
ver217
e3fde4ee6b
fix import error in sharded model v2 ( #1053 )
2022-06-02 13:48:22 +08:00
ver217
e1922ea4f6
[zero] add chunk size search for chunk manager ( #1052 )
2022-06-02 13:20:20 +08:00
アマデウス
2c42b230f3
updated collective ops api ( #1054 )
2022-06-02 12:52:27 +08:00
ver217
51b9a49655
[zero] add zero optimizer for ColoTensor ( #1046 )
...
* add zero optimizer
* torch ok
* unit test ok
* polish code
* fix bugs
* polish unit test
* polish zero optim
* polish colo ddp v2
* refactor folder structure
* add comment
* polish unit test
* polish zero optim
* polish unit test
2022-06-02 12:13:15 +08:00
ver217
7faef93326
fix dist spec mgr ( #1045 )
2022-05-31 12:14:39 +08:00
ver217
9492a561c3
[tensor] ColoTensor supports ZeRo ( #1015 )
...
* impl chunk manager
* impl param op hook
* add reduce_chunk
* add zero hook v2
* add zero dp
* fix TensorInfo
* impl load balancing when using zero without chunk
* fix zero hook
* polish chunk
* fix bugs
* ddp ok
* zero ok
* polish code
* fix bugs about load balancing
* polish code
* polish code
* add ene-to-end test
* polish code
* polish code
* polish code
* fix typo
* add test_chunk
* fix bugs
* fix bugs
* polish code
2022-05-31 12:00:12 +08:00
Ziyue Jiang
7c530b9de2
[Tensor] add Parameter inheritance for ColoParameter ( #1041 )
...
* add Parameter inheritance for ColoParameter
* remove tricks
* remove tricks
* polish
* polish
2022-05-30 17:23:44 +08:00
ver217
7cfd6c827e
[zero] add load_state_dict for sharded model ( #894 )
...
* add load_state_dict for sharded model
* fix bug
* fix bug
* fix ckpt dtype and device
* support load state dict in zero init ctx
* fix bugs
2022-05-27 10:25:08 +08:00
Ziyue Jiang
6c5996a56e
[Tensor] add module check and bert test ( #1031 )
...
* add Embedding
* Add bert test
* polish
* add check module test
* polish
* polish
* polish
* polish
2022-05-26 18:15:42 +08:00
YuliangLiu0306
7106bd671d
[p2p]add object list send/recv ( #1024 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [p2p]add object list send recv
* refactor for code reusability
* polish
2022-05-26 14:28:46 +08:00
Frank Lee
e4685832f8
[engine] fixed bug in gradient accumulation dataloader to keep the last step ( #1030 )
2022-05-26 14:28:23 +08:00
Ziyue Jiang
32291dd73f
[Tensor] add module handler for linear ( #1021 )
...
* add module spec for linear
* polish
* polish
* polish
2022-05-26 11:50:44 +08:00
Ryan Russell
9b0c037027
fix typo in constants ( #1027 )
2022-05-26 08:45:08 +08:00
ver217
007ca0df92
fix colo init context ( #1026 )
2022-05-25 20:41:58 +08:00
YuliangLiu0306
d182b0bd47
[hotfix] fix some bugs caused by size mismatch. ( #1011 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fix some bugs caused by size mismatch.
* add warning logs
* polish
2022-05-23 14:02:28 +08:00
ver217
cefc29ff06
[tensor] impl ColoDDP for ColoTensor ( #1009 )
...
* impl ColoDDP for ColoTensor
* polish code
2022-05-21 13:52:04 +08:00
zhengzangw
ae7c338105
[NFC] polish colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cpp code style
2022-05-20 23:57:38 +08:00
ver217
a3b66f6def
[tensor] refactor parallel action ( #1007 )
...
* refactor parallel action
* polish unit tests
2022-05-20 20:19:58 +08:00
ver217
ad536e308e
[tensor] refactor colo-tensor ( #992 )
...
* refactor colo-tensor and update linear op
* polish code
* polish code
* update ops and unit tests
* update unit tests
* polish code
* rename dist_spec module
* polish code
* polish code
* remove unneeded import
* fix pipelinable
2022-05-19 12:44:59 +08:00
Frank Lee
1467d83edf
[cli] remove unused imports ( #1001 )
2022-05-18 23:27:18 +08:00
Frank Lee
533d0c46d8
[kernel] fixed the include bug in dropout kernel ( #999 )
2022-05-18 21:43:18 +08:00
Jiarui Fang
802ac297cc
[Tensor] remove useless import in tensor dir ( #997 )
2022-05-18 14:54:51 +08:00
Ziheng Qin
571f12eff3
[NFC] polish colossalai/nn/layer/utils/common.py code style ( #983 )
2022-05-17 10:25:06 +08:00
puck_WCR
bda70b4b66
[NFC] polish colossalai/kernel/cuda_native/layer_norm.py code style ( #980 )
2022-05-17 10:25:06 +08:00
Kai Wang (Victor Kai)
c50c08dcbb
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu code style ( #979 )
2022-05-17 10:25:06 +08:00
binmakeswell
f28c021376
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu code style ( #978 )
2022-05-17 10:25:06 +08:00
shenggan
18542b47fc
[NFC] polish colossalai/nn/layer/parallel_2d/layers.py code style ( #976 )
2022-05-17 10:25:06 +08:00
Jie Zhu
b67eebd20f
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu code style ( #977 )
2022-05-17 10:25:06 +08:00
DouJS
52705ec5c5
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu code style ( #974 )
2022-05-17 10:25:06 +08:00
Ofey Chan
136946422b
[NFC] polish colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp code style ( #973 )
2022-05-17 10:25:06 +08:00
Zirui Zhu
598cde4a0f
[NFC] polish colossalai/nn/layer/parallel_2p5d/layers.py code style ( #972 )
2022-05-17 10:25:06 +08:00
Xu Kai
632e94abde
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h code style ( #970 )
2022-05-17 10:25:06 +08:00
ExtremeViscent
22d1df224d
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h ( #968 )
...
code style
2022-05-17 10:25:06 +08:00
LuGY
fb5bc6cb28
[NFC] polish colossalai/nn/layer/parallel_3d/layers.py code style ( #966 )
2022-05-17 10:25:06 +08:00
lucasliunju
955463e542
[NFC] polish __init__.py code style ( #965 )
2022-05-17 10:25:06 +08:00
Yuer867
7106a399fc
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h code style ( #964 )
2022-05-17 10:25:06 +08:00
ziyu huang
5bd80b7dd1
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu code style ( #963 )
...
Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>
2022-05-17 10:25:06 +08:00
superhao1995
48c4a180c7
[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp code style ( #959 )
2022-05-17 10:25:06 +08:00
MaxT
442a2975ab
[NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h code style ( #962 )
2022-05-17 10:25:06 +08:00
runluo
89e2767a92
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu code style ( #958 )
2022-05-17 10:25:06 +08:00
doubleHU
1dc1b6fa00
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h code style ( #957 )
2022-05-17 10:25:06 +08:00
RichardoLuo
0e922da874
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/context.h code style ( #956 )
...
Co-authored-by: RichardoLuo <14049555596@qq.com>
2022-05-17 10:25:06 +08:00
Wangbo Zhao(黑色枷锁)
8ca2a85682
[NFC] polish colossalai/kernel/cuda_native/scaled_softmax.py code style ( #955 )
2022-05-17 10:25:06 +08:00
Luxios22
f6970ef8b1
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu code style ( #954 )
2022-05-17 10:25:06 +08:00
Cautiousss
0b86a6345e
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu code style ( #953 )
...
Co-authored-by: 何晓昕 <cautious@hexiaoxins-MacBook-Pro.local>
2022-05-17 10:25:06 +08:00
Sze-qq
d8d07b0e2b
[NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp code style ( #952 )
2022-05-17 10:25:06 +08:00
xyupeng
fa43bb216d
[NFC] polish colossalai/builder/pipeline.py code style ( #951 )
2022-05-17 10:25:06 +08:00
JT.Han
c3e423c8be
[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu code style ( #949 )
...
Co-authored-by: Jiatong <jiatong.han@u.nus.edu>
2022-05-17 10:25:06 +08:00
luoling-LC
72c71b67ec
[NFC] polish colossalai/kernel/jit/bias_gelu.py code style ( #946 )
...
Co-authored-by: jnbai <897086360@qq.com>
2022-05-17 10:25:06 +08:00
bajiaoyu517
eb9a81d72a
[NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.h code style ( #945 )
2022-05-17 10:25:06 +08:00
wky
8ffdc38376
[NFC] polish colossalai/kernel/cuda_native/csrc/moe_cuda.cpp code style ( #942 )
2022-05-17 10:25:06 +08:00
HaoyuQin
c0f373db5d
[NFC] polish pre-commit run --files colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu code style ( #943 )
2022-05-17 10:25:06 +08:00
XYE
5bbefeb06a
[NFC] polish moe_cuda_kernel.cu code style ( #940 )
...
Co-authored-by: Xiao Ye <xiaoye2@illinois.edu>
2022-05-17 10:25:06 +08:00
Maruyama_Aya
7aa35eae6a
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h code style ( #938 )
2022-05-17 10:25:06 +08:00
Geng Zhang
b6cc9313ef
[NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.cpp code style ( #936 )
2022-05-17 10:25:06 +08:00
yuxuan-lou
44b6f8947b
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h code style ( #939 )
2022-05-17 10:25:06 +08:00
BoxiangW
872aa413c2
[NFC] Polish colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu code style. ( #937 )
2022-05-17 10:25:06 +08:00
ver217
58580b50fe
Revert "[NFC] Hotfix/format ( #984 )" ( #986 )
...
This reverts commit 0772828fba
.
2022-05-17 10:23:38 +08:00
binmakeswell
0772828fba
[NFC] Hotfix/format ( #984 )
...
* [NFC] Polish colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu code style. (#937 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h code style (#939 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.cpp code style (#936 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h code style (#938 )
* [NFC] polish moe_cuda_kernel.cu code style (#940 )
Co-authored-by: Xiao Ye <xiaoye2@illinois.edu>
* [NFC] polish pre-commit run --files colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu code style (#943 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/moe_cuda.cpp code style (#942 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.h code style (#945 )
* [NFC] polish colossalai/kernel/jit/bias_gelu.py code style (#946 )
Co-authored-by: jnbai <897086360@qq.com>
* [NFC] polish colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu code style (#949 )
Co-authored-by: Jiatong <jiatong.han@u.nus.edu>
* [NFC] polish colossalai/builder/pipeline.py code style (#951 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp code style (#952 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu code style (#953 )
Co-authored-by: 何晓昕 <cautious@hexiaoxins-MacBook-Pro.local>
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu code style (#954 )
* [NFC] polish colossalai/kernel/cuda_native/scaled_softmax.py code style (#955 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/context.h code style (#956 )
Co-authored-by: RichardoLuo <14049555596@qq.com>
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h code style (#957 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu code style (#958 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h code style (#962 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp code style (#959 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu code style (#963 )
Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h code style (#964 )
* [NFC] polish __init__.py code style (#965 )
* [NFC] polish colossalai/nn/layer/parallel_3d/layers.py code style (#966 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h (#968 )
code style
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h code style (#970 )
* [NFC] polish colossalai/nn/layer/parallel_2p5d/layers.py code style (#972 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp code style (#973 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu code style (#974 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu code style (#977 )
* [NFC] polish colossalai/nn/layer/parallel_2d/layers.py code style (#976 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu code style (#978 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu code style (#979 )
* [NFC] polish colossalai/kernel/cuda_native/layer_norm.py code style (#980 )
* [NFC] polish colossalai/nn/layer/utils/common.py code style (#983 )
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
Co-authored-by: yuxuan-lou <83441848+yuxuan-lou@users.noreply.github.com>
Co-authored-by: Geng Zhang <34452939+zxgx@users.noreply.github.com>
Co-authored-by: Maruyama_Aya <38985202+MaruyamaAya@users.noreply.github.com>
Co-authored-by: XYE <92607131+Itok2000u@users.noreply.github.com>
Co-authored-by: Xiao Ye <xiaoye2@illinois.edu>
Co-authored-by: HaoyuQin <79465534+coder-chin@users.noreply.github.com>
Co-authored-by: wky <64853922+wangkuangyi@users.noreply.github.com>
Co-authored-by: bajiaoyu517 <59548007+bajiaoyu517@users.noreply.github.com>
Co-authored-by: luoling-LC <105470086+luoling-LC@users.noreply.github.com>
Co-authored-by: jnbai <897086360@qq.com>
Co-authored-by: JT.Han <59948448+JThh@users.noreply.github.com>
Co-authored-by: Jiatong <jiatong.han@u.nus.edu>
Co-authored-by: xyupeng <99191637+xyupeng@users.noreply.github.com>
Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com>
Co-authored-by: Cautiousss <48676630+Cautiousss@users.noreply.github.com>
Co-authored-by: 何晓昕 <cautious@hexiaoxins-MacBook-Pro.local>
Co-authored-by: Luxios22 <67457897+Luxios22@users.noreply.github.com>
Co-authored-by: Wangbo Zhao(黑色枷锁) <56866854+wangbo-zhao@users.noreply.github.com>
Co-authored-by: RichardoLuo <50363844+RichardoLuo@users.noreply.github.com>
Co-authored-by: RichardoLuo <14049555596@qq.com>
Co-authored-by: doubleHU <98150031+huxin711@users.noreply.github.com>
Co-authored-by: runluo <68489000+run-qiao@users.noreply.github.com>
Co-authored-by: MaxT <854721132@qq.com>
Co-authored-by: superhao1995 <804673818@qq.com>
Co-authored-by: ziyu huang <huang0ziyu@gmail.com>
Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>
Co-authored-by: Yuer867 <62204893+Yuer867@users.noreply.github.com>
Co-authored-by: lucasliunju <lucasliunju@gmail.com>
Co-authored-by: LuGY <74758262+Gy-Lu@users.noreply.github.com>
Co-authored-by: ExtremeViscent <zhangyiqi55732@sina.com>
Co-authored-by: Xu Kai <xukai16@foxmail.com>
Co-authored-by: Zirui Zhu <zhuzr21@gmail.com>
Co-authored-by: Ofey Chan <ofey206@gmail.com>
Co-authored-by: DouJS <dujiangsu@163.com>
Co-authored-by: Jie Zhu <chore.08-protist@icloud.com>
Co-authored-by: shenggan <csg19971016@gmail.com>
Co-authored-by: Kai Wang (Victor Kai) <37533040+kaiwang960112@users.noreply.github.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: Ziheng Qin <37519855+henryqin1997@users.noreply.github.com>
2022-05-17 09:54:49 +08:00
ver217
c2fdc6a011
[tensor] derive compute pattern from dist spec ( #971 )
...
* derive compute pattern from dist spec
* polish code
2022-05-16 14:58:08 +08:00
Ziyue Jiang
797a9dc5a9
add DistSpec for loss and test_model ( #947 )
2022-05-13 20:29:50 +08:00
ver217
67c33f57eb
[tensor] design DistSpec and DistSpecManager for ColoTensor ( #934 )
...
* add dist spec
* update linear op
* polish code
* polish code
* update embedding op
* polish unit tests
* polish unit tests
* polish comments
* polish code
* add test_dist_spec_mgr
* polish code
* refactor folder structure
* polish unit tests
* add get_process_group() for TensorSpec
* polish code
2022-05-13 15:13:52 +08:00
Ziyue Jiang
d73c2b1d79
[Tensor] fix init context ( #931 )
...
* change torch.Parameter to ColoParameter
* fix post assignment for init context
* polish
* polish
2022-05-11 15:48:12 +08:00
Ziyue Jiang
dfc88b85ea
[Tensor] simplify named param ( #928 )
...
* simplify ColoModulize
* simplify ColoModulize
* polish
* polish
2022-05-11 10:54:19 +08:00
YuliangLiu0306
32a45cd7ef
[pipelinable]use pipelinable to support GPT model. ( #903 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipelinable]use pipelinable to support GPT model.
* fix a bug caused by ShardedModel
* polish
* fix front func list
2022-05-11 09:23:58 +08:00
ver217
4ca732349e
[tensor] colo tensor overrides mul ( #927 )
...
* colo tensor overrides mul
* polish code
2022-05-10 16:04:08 +08:00
ver217
45b9124df4
[tensor] hijack addmm for colo tensor ( #923 )
...
* hijack addmm for colo tensor
* fix bugs
* polish unit test
* polish comments
2022-05-09 18:55:49 +08:00
Ziyue Jiang
c195d2814c
[Tensor] add from_pretrained support and bert pretrained test ( #921 )
...
* add from_pretrained support and test
* polish
* polish
* polish
* polish
2022-05-09 16:11:47 +08:00
Jiarui Fang
845856ea29
[Graph] building computing graph with ColoTensor, Linear only ( #917 )
2022-05-07 17:10:37 +08:00
Ziyue Jiang
75d221918a
[Tensor] add 1d vocab loss ( #918 )
...
* add 1d vocab loss
* polish
2022-05-07 15:49:14 +08:00
Jiarui Fang
ab95ec9aea
[Tensor] init ColoParameter ( #914 )
2022-05-06 12:57:14 +08:00
Ziyue Jiang
f593a5637e
[Tensor] add embedding tp1d row ( #904 )
2022-04-29 14:10:05 +08:00
Ziyue Jiang
2c0d19d755
[Tensor] add ColoTensor TP1Dcol Embedding ( #899 )
2022-04-28 17:45:06 +08:00
Jiarui Fang
d16671da75
[Tensor] initialize the ColoOptimizer ( #898 )
...
* [Tensor] activation is an attr of ColoTensor
* [Tensor] add optimizer
* only detach parameters in context
* polish code
2022-04-28 15:23:40 +08:00
Jiarui Fang
676f191532
[Tensor] activation is an attr of ColoTensor ( #897 )
2022-04-28 14:43:22 +08:00
Ziyue Jiang
cb182da7c5
[tensor] refine linear and add gather for laynorm ( #893 )
...
* refine linear and add function to ColoTensor
* add gather for layernorm
* polish
* polish
2022-04-28 10:55:40 +08:00
Jiarui Fang
26c49639d8
[Tensor] overriding paramters() for Module using ColoTensor ( #889 )
2022-04-27 15:28:59 +08:00
Ziyue Jiang
1d0aba4153
[tensor] add ColoTensor 1Dcol ( #888 )
2022-04-27 14:13:55 +08:00
Jiarui Fang
72cdc06875
[Tensor] make ColoTensor more robust for getattr ( #886 )
...
* [Tensor] make ColoTensor more robust for getattr
* polish
* polish
2022-04-27 10:57:49 +08:00
Ziyue Jiang
9bc5a77c31
[tensor] wrap function in the torch_tensor to ColoTensor ( #881 )
2022-04-26 20:13:56 +08:00
ver217
4df6471f5d
fix import error ( #880 )
2022-04-26 19:28:40 +08:00
Jiarui Fang
7f76517a85
[Tensor] make a simple net works with 1D row TP ( #879 )
2022-04-26 18:11:47 +08:00
ver217
c4d903e64a
[gemini] accelerate adjust_layout() ( #878 )
...
* add lru cache
* polish code
* update unit test
* fix sharded optim
2022-04-26 18:08:31 +08:00
Jiarui Fang
909211453b
[Tensor] Add some attributes to ColoTensor ( #877 )
...
* [Tensor] add some function to ColoTensor
* torch.allclose
* rm torch.add
2022-04-26 15:10:47 +08:00
HELSON
425b4a96b8
[gemini] polish stateful_tensor_mgr ( #876 )
2022-04-26 15:05:03 +08:00
Jiarui Fang
e43f83aa5c
[Tensor] get named parameters for model using ColoTensors ( #874 )
2022-04-26 14:08:01 +08:00
Jiarui Fang
96211c2cc8
[tensor] customized op returns ColoTensor ( #875 )
...
* [tensor] customized op returns ColoTensor
* polish
* polish code
2022-04-26 13:23:59 +08:00
Ziyue Jiang
26d4ab8b03
[Tensor] Add function to spec and update linear 1Drow and unit tests ( #869 )
2022-04-26 10:15:26 +08:00
Frank Lee
11f54c7b6b
[doc] improved docstring and assertion messages for the engine module ( #871 )
2022-04-26 10:00:18 +08:00
Frank Lee
1c34382678
[doc] improved assertion messages in trainer ( #873 )
2022-04-26 10:00:12 +08:00
Frank Lee
7a64fae33a
[doc] improved error messages in initialize ( #872 )
2022-04-26 10:00:03 +08:00
Jiarui Fang
1190b2c4a4
[tensor] add cross_entrophy_loss ( #868 )
2022-04-25 16:01:52 +08:00
HELSON
3107817172
[gemini] add stateful tensor container ( #867 )
2022-04-25 14:58:16 +08:00
Jiarui Fang
d01d3b8cb0
colo init context add device attr. ( #866 )
2022-04-25 14:24:26 +08:00
Frank Lee
2238758c2e
[usability] improved error messages in the context module ( #856 )
2022-04-25 13:42:31 +08:00
Frank Lee
9fdebadd69
[doc] improved docstring in the amp module ( #857 )
2022-04-25 13:42:17 +08:00
Frank Lee
b862d89d00
[doc] improved docstring in the logging module ( #861 )
2022-04-25 13:42:00 +08:00
Frank Lee
8004c8e938
[doc] improved docstring in the communication module ( #863 )
2022-04-25 13:41:43 +08:00
Jiarui Fang
8af5f7423d
[tensor] an initial dea of tensor spec ( #865 )
...
* a initial dea of tensor spec
* polish
* polish
2022-04-25 13:33:52 +08:00
Jiarui Fang
126ba573a8
[Tensor] add layer norm Op ( #852 )
2022-04-25 11:49:20 +08:00
Frank Lee
a82da26f7e
[cli] refactored micro-benchmarking cli and added more metrics ( #858 )
2022-04-25 11:48:07 +08:00
Frank Lee
ee222dfbf3
[usability] added assertion message in registry ( #864 )
2022-04-25 11:45:15 +08:00
HELSON
f0e654558f
[gemini] polish code ( #855 )
2022-04-25 10:40:14 +08:00
Jiarui Fang
29159d9b5b
hotfix tensor unittest bugs ( #862 )
2022-04-25 10:06:53 +08:00
YuliangLiu0306
c6930d8ddf
[pipelinable]use ColoTensor to replace dummy tensor. ( #853 )
2022-04-24 18:31:22 +08:00
Ziyue Jiang
bcc8655021
[Tensor ] Add 1Drow weight reshard by spec ( #854 )
2022-04-24 18:30:20 +08:00
ver217
d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data ( #850 )
2022-04-24 17:17:22 +08:00
ver217
232142f402
[utils] refactor profiler ( #837 )
...
* add model data profiler
* add a subclass of torch.profiler.profile
* refactor folder structure
* remove redundant codes
* polish code
* use GeminiMemoryManager
* fix import path
* fix stm profiler ext
* polish comments
* remove useless file
2022-04-24 17:03:59 +08:00
Jiarui Fang
62f059251b
[Tensor] init a tp network training unittest ( #849 )
2022-04-24 16:43:44 +08:00
ver217
0dea140760
[hotfix] add deconstructor for stateful tensor ( #848 )
...
* add deconstructor for stateful tensor
* fix colo init context
2022-04-24 15:03:04 +08:00
ver217
0f7ed8c192
fix _post_init_method of zero init ctx ( #847 )
2022-04-24 14:16:50 +08:00
Ziyue Jiang
2a0a427e04
[tensor]add assert for colo_tensor 1Drow ( #846 )
2022-04-24 14:12:45 +08:00
Ziyue Jiang
05023ecfee
[Tensor] TP Linear 1D row ( #843 )
2022-04-24 13:43:12 +08:00
Frank Lee
cf6d1c9284
[CLI] refactored the launch CLI and fixed bugs in multi-node launching ( #844 )
...
* [cli] fixed multi-node job launching
* [cli] fixed a bug in version comparison
* [cli] support launching with env var
* [cli] fixed multi-node job launching
* [cli] fixed a bug in version comparison
* [cli] support launching with env var
* added docstring
* [cli] added extra launch arguments
* [cli] added default launch rdzv args
* [cli] fixed version comparison
* [cli] added docstring examples and requierment
* polish docstring
* polish code
* polish code
2022-04-24 13:26:26 +08:00