Commit Graph

176 Commits (96211c2cc8fc8e081eddb5111557d1d474b1446d)

Author SHA1 Message Date
Jiarui Fang 96211c2cc8
[tensor] customized op returns ColoTensor (#875)
3 years ago
Ziyue Jiang 26d4ab8b03
[Tensor] Add function to spec and update linear 1Drow and unit tests (#869)
3 years ago
Jiarui Fang 1190b2c4a4
[tensor] add cross_entrophy_loss (#868)
3 years ago
HELSON 3107817172
[gemini] add stateful tensor container (#867)
3 years ago
Jiarui Fang d01d3b8cb0
colo init context add device attr. (#866)
3 years ago
Jiarui Fang 126ba573a8
[Tensor] add layer norm Op (#852)
3 years ago
Frank Lee 1258af71cc
[ci] cache cuda extension (#860)
3 years ago
Ziyue Jiang bcc8655021
[Tensor ] Add 1Drow weight reshard by spec (#854)
3 years ago
Jiarui Fang 62f059251b
[Tensor] init a tp network training unittest (#849)
3 years ago
Ziyue Jiang 2a0a427e04
[tensor]add assert for colo_tensor 1Drow (#846)
3 years ago
Ziyue Jiang 05023ecfee
[Tensor] TP Linear 1D row (#843)
3 years ago
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
3 years ago
YuliangLiu0306 35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model (#816)
3 years ago
Jiarui Fang ea0a2ed25f
[hotfix] the bug of numel() in ColoTensor (#845)
3 years ago
Jiarui Fang 8789850eea
Init Conext supports lazy allocate model memory (#842)
3 years ago
Frank Lee 943982d29a
[unittest] refactored unit tests for change in dependency (#838)
3 years ago
Frank Lee 01e9f834f5
[dependency] removed torchvision (#833)
3 years ago
Jiarui Fang cb5a4778e1
Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)" (#835)
3 years ago
Jiarui Fang ac88de6dfc
[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)
3 years ago
Jiarui Fang 294a6060d0
[tensor] ZeRO use ColoTensor as the base class. (#828)
3 years ago
Ziyue Jiang 8e6fdb4f29
[tensor]fix test_linear (#826)
3 years ago
Ziyue Jiang 1a9e2c2dff
[tensor] fix kwargs in colo_tensor torch_funtion (#825)
3 years ago
Jiarui Fang 2ecc3d7a55
[tensor] lazy init (#823)
3 years ago
Jiarui Fang 660d2d1f1b
[Tensor] apply ColoTensor on Torch functions (#821)
3 years ago
Jiarui Fang 0ce8924ceb
[tensor] reorganize files (#820)
3 years ago
Jiarui Fang ab962b9735
[gemini] a new tensor structure (#818)
3 years ago
Jiarui Fang e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806)
3 years ago
HELSON 88759e289e
[zero] add ZeroTensorShardStrategy (#793)
3 years ago
Jiarui Fang 681addb512
[refactor] moving grad acc logic to engine (#804)
3 years ago
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801)
3 years ago
HELSON 4c4388c46e
[hotfix] fix memory leak in zero (#781)
3 years ago
Frank Lee 5a1a095b92
[test] refactored with the new rerun decorator (#763)
3 years ago
Jiarui Fang 10ef8afdd2
[gemini] init genimi individual directory (#754)
3 years ago
ver217 dcca614eee
[hotfix] fix test_stateful_tensor_mgr (#762)
3 years ago
ver217 a93a7d7364
[hotfix] fix reuse_fp16_shard of sharded model (#756)
3 years ago
HELSON 84c6700b2a
[zero] refactor memstats_collector (#746)
3 years ago
ver217 e396bb71f2
[zero] add tensor placement policies (#743)
3 years ago
HELSON 22c4b88d56
[zero] refactor ShardedParamV2 for convenience (#742)
3 years ago
Frank Lee f4f42d4c3c
[bug] fixed DDP compatibility with torch 1.8 (#739)
3 years ago
Jiarui Fang 53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process (#726)
3 years ago
HELSON b9b469ea50
[moe] add checkpoint for moe zero test (#729)
3 years ago
FrankLeeeee e88a498c9c [test] removed trivial outdated test
3 years ago
FrankLeeeee 62b4ce7326 [test] added missing decorators to model checkpointing tests
3 years ago
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724)
3 years ago
Frank Lee 20ab1f5520
[bug] fixed broken test_found_inf (#725)
3 years ago
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715)
3 years ago
HELSON dbd96fe90a
[zero] check whether gradients have inf and nan in gpu (#712)
3 years ago
HELSON a9b8300d54
[zero] improve adaptability for not-shard parameters (#708)
3 years ago
ver217 ab8c6b4a0e
[zero] refactor memstats collector (#706)
3 years ago
HELSON ee112fe1da
[zero] adapt zero hooks for unsharded module (#699)
3 years ago