Jiarui Fang
3ce4463fe6
[utils] remove lazy_memory_allocate from ColoInitContext ( #1844 )
2 years ago
YuliangLiu0306
980ed21723
[autoparallel] shard param and buffer as expected ( #1753 )
...
* [autoparallel] shard param and buffer as expected
* fix unit test issue
2 years ago
Frank Lee
eee84908d4
[autoparallel] handled illegal sharding strategy ( #1728 )
...
* [autoparallel] handled illegal sharding strategy
* polish code
2 years ago
HELSON
f69f9bf223
[zero] add chunk init function for users ( #1729 )
...
* add chunk manager init function
* fix unit tests
* add comment
* add flush=True
2 years ago
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2 years ago
YuliangLiu0306
3f068d1409
[autoparallel] update CommSpec ( #1667 )
2 years ago
Frank Lee
154d3ef432
[fix] fixed the collective pattern name for consistency ( #1649 )
...
* [fix] fixed the collective pattern name for consistency
* polish code
2 years ago
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2 years ago
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2 years ago
YuliangLiu0306
702dbc5288
[tensor] use communication autograd func ( #1617 )
...
* [tensor] use communication autograd func
* change all to all comm spec info
* rename pattern and distinguish fwd/bwd
* polish code
2 years ago
YuliangLiu0306
4b03c25f85
[tensor]add 1D device mesh ( #1492 )
2 years ago
YuliangLiu0306
b73fb7a077
[tensor] support runtime ShardingSpec apply ( #1453 )
...
* [tensor] support runtime ShardingSpec apply
* polish code
* polish code
2 years ago
YuliangLiu0306
0f3042363c
[tensor] shape consistency generate transform path and communication cost ( #1435 )
...
* [tensor] shape consistency output transform path and communication cost
* polish code
2 years ago
Frank Lee
ae1b58cd16
[tensor] added linear implementation for the new sharding spec ( #1416 )
...
* [tensor] added linear implementation for the new sharding spec
* polish code
2 years ago
Jiarui Fang
89c434a0a6
[polish] add test_ops directory ( #1431 )
2 years ago
Jiarui Fang
10b3df65c8
[FAW] move coloparam setting in test code. ( #1429 )
2 years ago
Jiarui Fang
cb98cf5558
[FAW] parallel FreqAwareEmbedding ( #1424 )
2 years ago
YuliangLiu0306
33f0744d51
[tensor] add shape consistency feature to support auto spec transform ( #1418 )
...
* [tensor] add shape consistency feature to supportauto sharding spec transform.
* [tensor] remove unused argument in simulator, add doc string for target pair.
2 years ago
Jiarui Fang
d209aff684
Add FreqAwareEmbeddingBag ( #1421 )
2 years ago
Jiarui Fang
504419d261
[FAW] add cache manager for the cached embedding ( #1419 )
2 years ago
YuliangLiu0306
7c96055c68
[tensor]build sharding spec to replace distspec in future. ( #1405 )
2 years ago
HELSON
87775a0682
[colotensor] use cpu memory to store state_dict ( #1367 )
2 years ago
HELSON
4417804129
[unit test] add megatron init test in zero_optim ( #1358 )
2 years ago
HELSON
7a065dc9f6
[hotfix] fix megatron_init in test_gpt2.py ( #1357 )
2 years ago
HELSON
7a8702c06d
[colotensor] add Tensor.view op and its unit test ( #1343 )
...
[colotensor] add megatron initialization for gpt2
2 years ago
HELSON
bf5066fba7
[refactor] refactor ColoTensor's unit tests ( #1340 )
2 years ago
ver217
0c51ff2c13
[hotfix] ZeroDDP use new process group ( #1333 )
...
* process group supports getting ranks in group
* chunk mgr receives a process group
* update unit test
* fix unit tests
2 years ago
HELSON
d49708ae43
[hotfix] fix ddp for unit test test_gpt2 ( #1326 )
2 years ago
HELSON
1b41686461
[hotfix] fix unit test test_module_spec ( #1321 )
2 years ago
Jiarui Fang
85f933b58b
[Optimizer] Remove useless ColoOptimizer ( #1312 )
2 years ago
Jiarui Fang
9f10524313
[Optimizer] polish the init method of ColoOptimizer ( #1310 )
2 years ago
HELSON
36086927e1
[hotfix] fix ColoTensor GPT2 unitest ( #1309 )
2 years ago
HELSON
260a55804a
[hotfix] fix shape error in backward when using ColoTensor ( #1298 )
2 years ago
Jiarui Fang
79fe7b027a
[hotfix] test model unittest hotfix ( #1281 )
2 years ago
Jiarui Fang
e56731e916
[hotfix] test_gpt.py duplicated ( #1279 )
...
* make it faster
* [hotfix] torchvison fx tests
* [hotfix] rename duplicated named test_gpt.py
2 years ago
HELSON
abba4d84e1
[hotfix] fix bert model test in unitests ( #1272 )
2 years ago
Jiarui Fang
c92f84fcdb
[tensor] distributed checkpointing for parameters ( #1240 )
2 years ago
Jiarui Fang
1aad903c15
[tensor] redistribute among different process groups ( #1247 )
...
* make it faster
* [tensor] rename convert_to_dist -> redistribute
* [tensor] ShardSpec and ReplicaSpec
* [tensor] redistribute among diff pgs
* polish code
2 years ago
Jiarui Fang
9bcd2fd4af
[tensor] a shorter shard and replicate spec ( #1245 )
2 years ago
Jiarui Fang
2699dfbbfd
[rename] convert_to_dist -> redistribute ( #1243 )
2 years ago
HELSON
f6add9b720
[tensor] redirect .data.__get__ to a tensor instance ( #1239 )
2 years ago
Jiarui Fang
4a76084dc9
[tensor] add zero_like colo op, important for Optimizer ( #1236 )
2 years ago
Jiarui Fang
3b500984b1
[tensor] fix some unittests ( #1234 )
2 years ago
HELSON
0453776def
[tensor] fix a assertion in colo_tensor cross_entropy ( #1232 )
2 years ago
Jiarui Fang
0e199d71e8
[hotfix] fx get comm size bugs ( #1233 )
...
* init a checkpoint dir
* [checkpoint]support resume for cosinewarmuplr
* [checkpoint]add unit test
* fix some bugs but still not OK
* fix bugs
* make it faster
* [checkpoint]support generalized scheduler
* polish
* [tensor] torch function return colotensor
* polish
* fix bugs
* remove debug info
* polish
* polish
* [tensor] test_model pass unittests
* polish
* [hotfix] fx get comm size bug
Co-authored-by: ZhaoYi1222 <zhaoyi9499@gmail.com>
2 years ago
HELSON
42ab36b762
[tensor] add unitest for colo_tensor 1DTP cross_entropy ( #1230 )
2 years ago
Jiarui Fang
a98319f023
[tensor] torch function return colotensor ( #1229 )
2 years ago
Jiarui Fang
15d988f954
[tensor] sharded global process group ( #1219 )
2 years ago
Jiarui Fang
ae7d3f4927
[refactor] move process group from _DistSpec to ColoTensor. ( #1203 )
2 years ago
Jiarui Fang
060b917daf
[refactor] remove gpc dependency in colotensor's _ops ( #1189 )
2 years ago