Jiarui Fang
|
cb5a4778e1
|
Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)" (#835)
This reverts commit ac88de6dfc .
|
2022-04-22 14:45:57 +08:00 |
Jiarui Fang
|
ac88de6dfc
|
[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)
* revert zero tensors back
* [tensor] init row 1d linear
|
2022-04-22 14:03:26 +08:00 |
Jiarui Fang
|
294a6060d0
|
[tensor] ZeRO use ColoTensor as the base class. (#828)
* [refactor] moving InsertPostInitMethodToModuleSubClasses to utils.
* [tensor] ZeRO use ColoTensor as the base class.
* polish
|
2022-04-22 12:00:48 +08:00 |
Ziyue Jiang
|
8e6fdb4f29
|
[tensor]fix test_linear (#826)
|
2022-04-21 17:18:56 +08:00 |
Ziyue Jiang
|
1a9e2c2dff
|
[tensor] fix kwargs in colo_tensor torch_funtion (#825)
|
2022-04-21 16:47:35 +08:00 |
Jiarui Fang
|
2ecc3d7a55
|
[tensor] lazy init (#823)
|
2022-04-21 15:40:23 +08:00 |
Jiarui Fang
|
660d2d1f1b
|
[Tensor] apply ColoTensor on Torch functions (#821)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"
This reverts commit 88759e289e .
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
* polish code
* add a new tensor structure and override linear for it
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* [tensor] renaming and reorganize directory structure.
* rm useless dir
* polish
* polish
* [tensor] hander the function not wrapped
|
2022-04-21 14:21:10 +08:00 |
Jiarui Fang
|
0ce8924ceb
|
[tensor] reorganize files (#820)
|
2022-04-21 14:15:48 +08:00 |
Jiarui Fang
|
ab962b9735
|
[gemini] a new tensor structure (#818)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"
This reverts commit 88759e289e .
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
* polish code
* add a new tensor structure and override linear for it
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
|
2022-04-21 11:42:37 +08:00 |
Jiarui Fang
|
e761ad2cd7
|
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806)
|
2022-04-19 14:40:02 +08:00 |
HELSON
|
88759e289e
|
[zero] add ZeroTensorShardStrategy (#793)
|
2022-04-19 14:32:45 +08:00 |
Jiarui Fang
|
681addb512
|
[refactor] moving grad acc logic to engine (#804)
|
2022-04-19 14:03:21 +08:00 |
Jiarui Fang
|
4d9332b4c5
|
[refactor] moving memtracer to gemini (#801)
|
2022-04-19 10:13:08 +08:00 |
HELSON
|
4c4388c46e
|
[hotfix] fix memory leak in zero (#781)
|
2022-04-18 13:57:03 +08:00 |
Frank Lee
|
5a1a095b92
|
[test] refactored with the new rerun decorator (#763)
* [test] refactored with the new rerun decorator
* polish test case
|
2022-04-15 00:33:04 +08:00 |
Jiarui Fang
|
10ef8afdd2
|
[gemini] init genimi individual directory (#754)
|
2022-04-14 16:40:26 +08:00 |
ver217
|
dcca614eee
|
[hotfix] fix test_stateful_tensor_mgr (#762)
|
2022-04-14 15:50:09 +08:00 |
ver217
|
a93a7d7364
|
[hotfix] fix reuse_fp16_shard of sharded model (#756)
* fix reuse_fp16_shard
* disable test stm
* polish code
|
2022-04-14 14:56:46 +08:00 |
HELSON
|
84c6700b2a
|
[zero] refactor memstats_collector (#746)
|
2022-04-14 12:01:12 +08:00 |
ver217
|
e396bb71f2
|
[zero] add tensor placement policies (#743)
* add tensor placement policies
* polish comments
* polish comments
* update moe unit tests
|
2022-04-13 15:00:48 +08:00 |
HELSON
|
22c4b88d56
|
[zero] refactor ShardedParamV2 for convenience (#742)
|
2022-04-13 14:54:26 +08:00 |
Frank Lee
|
f4f42d4c3c
|
[bug] fixed DDP compatibility with torch 1.8 (#739)
|
2022-04-13 00:08:46 +08:00 |
Jiarui Fang
|
53cb584808
|
[utils] correct cpu memory used and capacity in the context of multi-process (#726)
|
2022-04-12 14:57:54 +08:00 |
HELSON
|
b9b469ea50
|
[moe] add checkpoint for moe zero test (#729)
|
2022-04-12 12:11:54 +08:00 |
FrankLeeeee
|
e88a498c9c
|
[test] removed trivial outdated test
|
2022-04-12 11:08:15 +08:00 |
FrankLeeeee
|
62b4ce7326
|
[test] added missing decorators to model checkpointing tests
|
2022-04-12 11:08:15 +08:00 |
Jiarui Fang
|
4d90a7b513
|
[refactor] zero directory (#724)
|
2022-04-11 23:13:02 +08:00 |
Frank Lee
|
20ab1f5520
|
[bug] fixed broken test_found_inf (#725)
|
2022-04-11 22:00:27 +08:00 |
Jiarui Fang
|
193dc8dacb
|
[refactor] refactor the memory utils (#715)
|
2022-04-11 16:47:57 +08:00 |
HELSON
|
dbd96fe90a
|
[zero] check whether gradients have inf and nan in gpu (#712)
|
2022-04-11 15:40:13 +08:00 |
HELSON
|
a9b8300d54
|
[zero] improve adaptability for not-shard parameters (#708)
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
|
2022-04-11 13:38:51 +08:00 |
ver217
|
ab8c6b4a0e
|
[zero] refactor memstats collector (#706)
* refactor memstats collector
* fix disposable
* polish code
|
2022-04-11 10:46:08 +08:00 |
HELSON
|
ee112fe1da
|
[zero] adapt zero hooks for unsharded module (#699)
|
2022-04-08 20:23:26 +08:00 |
ver217
|
3c9cd5bb5e
|
[zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager
* add eviction strategy
* polish code
* polish code
* polish comment
* add unit test
* fix sampler bug
* polish code
* fix max sampling cnt resetting bug
* fix sampler bug
* polish code
* fix bug
* fix unit test
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
|
2022-04-08 17:51:34 +08:00 |
HELSON
|
d7ecaf362b
|
[zero] fix init bugs in zero context (#686)
* adapt model weight initialization for methods in Pytorch nn.init
|
2022-04-07 17:38:45 +08:00 |
Jiarui Fang
|
0aab52301e
|
[hotfix] fix a bug in model data stats tracing (#655)
|
2022-04-03 21:48:06 +08:00 |
YuliangLiu0306
|
ade05a5d83
|
[refactor] pipeline, put runtime schedule into engine. (#627)
|
2022-04-03 20:46:45 +08:00 |
HELSON
|
e5d615aeee
|
[hotfix] fix bugs in testing (#659)
* remove hybrid adam in test_moe_zero_optim
* fix activation checkpointing and its unitest
|
2022-04-02 21:58:47 +08:00 |
HELSON
|
b31daed4cf
|
fix bugs in CPU adam (#633)
* add cpu adam counter for all cpu adam
* fixed updating error in adam kernel
|
2022-04-02 17:04:05 +08:00 |
HELSON
|
055fbf5be6
|
[zero] adapt zero for unsharded paramters (Optimizer part) (#601)
|
2022-04-01 20:10:47 +08:00 |
アマデウス
|
354b7954d1
|
[model checkpoint] added unit tests for checkpoint save/load (#599)
|
2022-04-01 16:53:32 +08:00 |
FredHuang99
|
93f14d2a33
|
[zero] test zero tensor utils (#609)
|
2022-04-01 15:16:59 +08:00 |
Jiarui Fang
|
e956d93ac2
|
[refactor] memory utils (#577)
|
2022-04-01 09:22:33 +08:00 |
HELSON
|
e6d50ec107
|
[zero] adapt zero for unsharded parameters (#561)
* support existing sharded and unsharded parameters in zero
* add unitest for moe-zero model init
* polish moe gradient handler
|
2022-03-31 18:34:11 +08:00 |
ver217
|
7c6c427db1
|
[zero] trace states of fp16/32 grad and fp32 param (#571)
|
2022-03-31 16:26:54 +08:00 |
Jiarui Fang
|
7675366fce
|
[polish] rename col_attr -> colo_attr (#558)
|
2022-03-31 12:25:45 +08:00 |
ver217
|
014bac0c49
|
[zero] hijack p.grad in sharded model (#554)
* hijack p.grad in sharded model
* polish comments
* polish comments
|
2022-03-30 18:14:50 +08:00 |
Jiarui Fang
|
f552b11294
|
[zero] label state for param fp16 and grad (#551)
|
2022-03-30 15:57:46 +08:00 |
Jiarui Fang
|
214da761d4
|
[zero] add stateful tensor (#549)
|
2022-03-30 13:51:37 +08:00 |
HELSON
|
8c90d4df54
|
[zero] add zero context manager to change config during initialization (#546)
|
2022-03-29 17:57:59 +08:00 |