780 Commits (a52f62082de0f4b4544ba2d04e909f74123425ce)

Author SHA1 Message Date
Jiarui Fang 909211453b
[Tensor] Add some attributes to ColoTensor (#877) 3 years ago
Jiarui Fang e43f83aa5c
[Tensor] get named parameters for model using ColoTensors (#874) 3 years ago
Jiarui Fang 96211c2cc8
[tensor] customized op returns ColoTensor (#875) 3 years ago
Ziyue Jiang 26d4ab8b03
[Tensor] Add function to spec and update linear 1Drow and unit tests (#869) 3 years ago
Jiarui Fang 1190b2c4a4
[tensor] add cross_entrophy_loss (#868) 3 years ago
HELSON 3107817172
[gemini] add stateful tensor container (#867) 3 years ago
Jiarui Fang d01d3b8cb0
colo init context add device attr. (#866) 3 years ago
Jiarui Fang 126ba573a8
[Tensor] add layer norm Op (#852) 3 years ago
Frank Lee 1258af71cc
[ci] cache cuda extension (#860) 3 years ago
Ziyue Jiang bcc8655021
[Tensor ] Add 1Drow weight reshard by spec (#854) 3 years ago
Jiarui Fang 62f059251b
[Tensor] init a tp network training unittest (#849) 3 years ago
Ziyue Jiang 2a0a427e04
[tensor]add assert for colo_tensor 1Drow (#846) 3 years ago
Ziyue Jiang 05023ecfee
[Tensor] TP Linear 1D row (#843) 3 years ago
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832) 3 years ago
YuliangLiu0306 35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model (#816) 3 years ago
Jiarui Fang ea0a2ed25f
[hotfix] the bug of numel() in ColoTensor (#845) 3 years ago
Jiarui Fang 8789850eea
Init Conext supports lazy allocate model memory (#842) 3 years ago
Frank Lee 943982d29a
[unittest] refactored unit tests for change in dependency (#838) 3 years ago
Frank Lee 01e9f834f5
[dependency] removed torchvision (#833) 3 years ago
Jiarui Fang cb5a4778e1
Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)" (#835) 3 years ago
Jiarui Fang ac88de6dfc
[WIP] Applying ColoTensor on TP-1D-row Linear. (#831) 3 years ago
Jiarui Fang 294a6060d0
[tensor] ZeRO use ColoTensor as the base class. (#828) 3 years ago
Ziyue Jiang 8e6fdb4f29
[tensor]fix test_linear (#826) 3 years ago
Ziyue Jiang 1a9e2c2dff
[tensor] fix kwargs in colo_tensor torch_funtion (#825) 3 years ago
Jiarui Fang 2ecc3d7a55
[tensor] lazy init (#823) 3 years ago
Jiarui Fang 660d2d1f1b
[Tensor] apply ColoTensor on Torch functions (#821) 3 years ago
Jiarui Fang 0ce8924ceb
[tensor] reorganize files (#820) 3 years ago
Jiarui Fang ab962b9735
[gemini] a new tensor structure (#818) 3 years ago
Jiarui Fang e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806) 3 years ago
HELSON 88759e289e
[zero] add ZeroTensorShardStrategy (#793) 3 years ago
Jiarui Fang 681addb512
[refactor] moving grad acc logic to engine (#804) 3 years ago
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801) 3 years ago
HELSON 4c4388c46e
[hotfix] fix memory leak in zero (#781) 3 years ago
Frank Lee 5a1a095b92
[test] refactored with the new rerun decorator (#763) 3 years ago
Jiarui Fang 10ef8afdd2
[gemini] init genimi individual directory (#754) 3 years ago
ver217 dcca614eee
[hotfix] fix test_stateful_tensor_mgr (#762) 3 years ago
ver217 a93a7d7364
[hotfix] fix reuse_fp16_shard of sharded model (#756) 3 years ago
HELSON 84c6700b2a
[zero] refactor memstats_collector (#746) 3 years ago
ver217 e396bb71f2
[zero] add tensor placement policies (#743) 3 years ago
HELSON 22c4b88d56
[zero] refactor ShardedParamV2 for convenience (#742) 3 years ago
Frank Lee f4f42d4c3c
[bug] fixed DDP compatibility with torch 1.8 (#739) 3 years ago
Jiarui Fang 53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process (#726) 3 years ago
HELSON b9b469ea50
[moe] add checkpoint for moe zero test (#729) 3 years ago
FrankLeeeee e88a498c9c [test] removed trivial outdated test 3 years ago
FrankLeeeee 62b4ce7326 [test] added missing decorators to model checkpointing tests 3 years ago
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724) 3 years ago
Frank Lee 20ab1f5520
[bug] fixed broken test_found_inf (#725) 3 years ago
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715) 3 years ago
HELSON dbd96fe90a
[zero] check whether gradients have inf and nan in gpu (#712) 3 years ago
HELSON a9b8300d54
[zero] improve adaptability for not-shard parameters (#708) 3 years ago