ver217
c415240db6
[nvme] CPUAdam and HybridAdam support NVMe offload ( #1360 )
...
* impl nvme optimizer
* update cpu adam
* add unit test
* update hybrid adam
* update docstr
* add TODOs
* update CI
* fix CI
* fix CI
* fix CI path
* fix CI path
* fix CI path
* fix install tensornvme
* fix CI
* fix CI path
* fix CI env variables
* test CI
* test CI
* fix CI
* fix nvme optim __del__
* fix adam __del__
* fix nvme optim
* fix CI env variables
* fix nvme optim import
* test CI
* test CI
* fix CI
2022-07-26 17:25:24 +08:00
HELSON
87775a0682
[colotensor] use cpu memory to store state_dict ( #1367 )
2022-07-26 14:13:38 +08:00
ver217
d068af81a3
[doc] update rst and docstring ( #1351 )
...
* update rst
* add zero docstr
* fix docstr
* remove fx.tracer.meta_patch
* fix docstr
* fix docstr
* update fx rst
* fix fx docstr
* remove useless rst
2022-07-21 15:54:53 +08:00
HELSON
7a8702c06d
[colotensor] add Tensor.view op and its unit test ( #1343 )
...
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
ver217
0c51ff2c13
[hotfix] ZeroDDP use new process group ( #1333 )
...
* process group supports getting ranks in group
* chunk mgr receives a process group
* update unit test
* fix unit tests
2022-07-18 14:14:52 +08:00
HELSON
1b41686461
[hotfix] fix unit test test_module_spec ( #1321 )
2022-07-15 14:02:32 +08:00
Jiarui Fang
9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing ( #1316 )
2022-07-15 09:52:55 +08:00
Jiarui Fang
85f933b58b
[Optimizer] Remove useless ColoOptimizer ( #1312 )
2022-07-14 16:57:48 +08:00
Jiarui Fang
9f10524313
[Optimizer] polish the init method of ColoOptimizer ( #1310 )
2022-07-14 16:37:33 +08:00
HELSON
260a55804a
[hotfix] fix shape error in backward when using ColoTensor ( #1298 )
2022-07-13 23:06:12 +08:00
runluo
f83c4d6597
[NFC] polish colossalai/nn/layer/wrapper/pipeline_wrapper.py code style ( #1303 )
2022-07-13 19:01:07 +08:00
XYE
e83b2ce853
[NFC] polish colossalai/nn/layer/vanilla/layers.py code style ( #1295 )
2022-07-13 12:08:21 +08:00
Liping233
1000a41fd5
[NFC] polish colossalai/nn/layer/vanilla/__init__.py code style ( #1293 )
2022-07-13 12:08:21 +08:00
Wangbo Zhao(黑色枷锁)
552667825b
[NFC] polish colossalai/nn/layer/parallel_1d/layers.py code style ( #1290 )
2022-07-13 12:08:21 +08:00
Jiatong Han
38e3ccd1e9
[NFC] polish colossalai/nn/layer/parallel_sequence/layers.py code style ( #1280 )
...
Co-authored-by: JThh <jiatong.han@u.nus.edu>
2022-07-13 12:08:21 +08:00
Boyuan Yao
b414eaa5db
[NFC] polish colossalai/nn/optimizer/lamb.py code style ( #1275 )
2022-07-13 12:08:21 +08:00
Super Daniel
52d145a342
[NFC] polish colossalai/nn/lr_scheduler/onecycle.py code style ( #1269 )
2022-07-13 12:08:21 +08:00
Geng Zhang
0e06f62160
[NFC] polish colossalai/nn/layer/parallel_sequence/_operation.py code style ( #1266 )
2022-07-13 12:08:21 +08:00
superhao1995
f660152c73
[NFC] polish colossalai/nn/layer/parallel_3d/_operation.py code style ( #1258 )
...
Co-authored-by: Research <research@soccf-snr3-017.comp.nus.edu.sg>
2022-07-13 12:08:21 +08:00
Thunderbeee
9738fb0f78
[NFC] polish colossalai/nn/lr_scheduler/__init__.py ( #1255 )
...
code style
2022-07-13 12:08:21 +08:00
Ofey Chan
2dd4d556fb
[NFC] polish colossalai/nn/init.py code style ( #1292 )
2022-07-13 10:51:55 +08:00
HELSON
abba4d84e1
[hotfix] fix bert model test in unitests ( #1272 )
2022-07-12 23:26:45 +08:00
oahzxl
0cf8e8e91c
[NFC] polish <colossalai/nn/lr_scheduler/poly.py> code style ( #1267 )
2022-07-12 18:18:14 +08:00
Jiarui Fang
1aad903c15
[tensor] redistribute among different process groups ( #1247 )
...
* make it faster
* [tensor] rename convert_to_dist -> redistribute
* [tensor] ShardSpec and ReplicaSpec
* [tensor] redistribute among diff pgs
* polish code
2022-07-12 10:24:05 +08:00
Jiarui Fang
9bcd2fd4af
[tensor] a shorter shard and replicate spec ( #1245 )
2022-07-11 15:51:48 +08:00
Jiarui Fang
2699dfbbfd
[rename] convert_to_dist -> redistribute ( #1243 )
2022-07-11 13:05:44 +08:00
Jiarui Fang
4a76084dc9
[tensor] add zero_like colo op, important for Optimizer ( #1236 )
2022-07-08 14:55:27 +08:00
Jiarui Fang
3b500984b1
[tensor] fix some unittests ( #1234 )
2022-07-08 14:18:30 +08:00
HELSON
0453776def
[tensor] fix a assertion in colo_tensor cross_entropy ( #1232 )
2022-07-08 11:18:00 +08:00
HELSON
42ab36b762
[tensor] add unitest for colo_tensor 1DTP cross_entropy ( #1230 )
2022-07-07 19:17:23 +08:00
Yi Zhao
04537bf83e
[checkpoint]support generalized scheduler ( #1222 )
2022-07-07 18:16:38 +08:00
Jiarui Fang
a98319f023
[tensor] torch function return colotensor ( #1229 )
2022-07-07 18:09:18 +08:00
Jiarui Fang
ae7d3f4927
[refactor] move process group from _DistSpec to ColoTensor. ( #1203 )
2022-07-06 16:15:16 +08:00
Jiarui Fang
b5f25eb32a
[Tensor] add cpu group to ddp ( #1200 )
2022-07-05 14:58:28 +08:00
Jiarui Fang
060b917daf
[refactor] remove gpc dependency in colotensor's _ops ( #1189 )
2022-07-04 18:54:37 +08:00
Jiarui Fang
372f791444
[refactor] move chunk and chunkmgr to directory gemini ( #1182 )
2022-06-29 13:31:02 +08:00
ver217
6b2f2ab9bb
[ddp] ColoDDP uses bucket all-reduce ( #1177 )
...
* add reducer
* update colo ddp with reducer
* polish unit test
* polish unit test
2022-06-29 10:34:13 +08:00
Jiarui Fang
1b657f9ce1
[tensor] revert local view back ( #1178 )
2022-06-27 18:38:34 +08:00
Jiarui Fang
0dd4e2bbfb
[Tensor] rename some APIs in TensorSpec and Polish view unittest ( #1176 )
2022-06-27 15:56:11 +08:00
Ziyue Jiang
dd0420909f
[Tensor] rename parallel_action ( #1174 )
...
* rename parallel_action
* polish
2022-06-27 10:04:45 +08:00
Jiarui Fang
aa7bef73d4
[Tensor] distributed view supports inter-process hybrid parallel ( #1169 )
2022-06-27 09:45:26 +08:00
Jiarui Fang
4b9bba8116
[ColoTensor] rename APIs and add output_replicate to ComputeSpec ( #1168 )
2022-06-24 13:08:54 +08:00
Jiarui Fang
f4ef224358
[Tensor] remove ParallelAction, use ComputeSpec instread ( #1166 )
2022-06-23 17:34:59 +08:00
Jiarui Fang
177c374401
remove gather out in parallel action ( #1163 )
2022-06-23 16:35:05 +08:00
Ziyue Jiang
955ac912de
remove log ( #1160 )
2022-06-23 10:32:42 +08:00
Jiarui Fang
07f9c781f9
[graph] improve the graph building. ( #1157 )
2022-06-22 16:47:20 +08:00
ver217
22717a856f
[tensor] add embedding bag op ( #1156 )
2022-06-22 15:54:03 +08:00
ver217
ae86151968
[tensor] add more element-wise ops ( #1155 )
...
* add more element-wise ops
* update test_op
* polish unit test
2022-06-22 15:16:47 +08:00
ver217
54aabb8da4
[gemini] refactor gemini mgr ( #1151 )
...
* refactor gemini mgr
* udpate __init__
2022-06-22 11:54:36 +08:00
ver217
8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP ( #1146 )
...
* ColoDDP supports overwriting default process group
* rename ColoDDPV2 to ZeroDDP
* add docstr for ZeroDDP
* polish docstr
2022-06-21 16:35:23 +08:00