Jiarui Fang
909211453b
[Tensor] Add some attributes to ColoTensor ( #877 )
...
* [Tensor] add some function to ColoTensor
* torch.allclose
* rm torch.add
3 years ago
Jiarui Fang
e43f83aa5c
[Tensor] get named parameters for model using ColoTensors ( #874 )
3 years ago
Jiarui Fang
96211c2cc8
[tensor] customized op returns ColoTensor ( #875 )
...
* [tensor] customized op returns ColoTensor
* polish
* polish code
3 years ago
Ziyue Jiang
26d4ab8b03
[Tensor] Add function to spec and update linear 1Drow and unit tests ( #869 )
3 years ago
Jiarui Fang
1190b2c4a4
[tensor] add cross_entrophy_loss ( #868 )
3 years ago
HELSON
3107817172
[gemini] add stateful tensor container ( #867 )
3 years ago
Jiarui Fang
d01d3b8cb0
colo init context add device attr. ( #866 )
3 years ago
Jiarui Fang
126ba573a8
[Tensor] add layer norm Op ( #852 )
3 years ago
Frank Lee
1258af71cc
[ci] cache cuda extension ( #860 )
3 years ago
Ziyue Jiang
bcc8655021
[Tensor ] Add 1Drow weight reshard by spec ( #854 )
3 years ago
Jiarui Fang
62f059251b
[Tensor] init a tp network training unittest ( #849 )
3 years ago
Ziyue Jiang
2a0a427e04
[tensor]add assert for colo_tensor 1Drow ( #846 )
3 years ago
Ziyue Jiang
05023ecfee
[Tensor] TP Linear 1D row ( #843 )
3 years ago
HELSON
e5ea3fdeef
[gemini] add GeminiMemoryManger ( #832 )
...
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
3 years ago
YuliangLiu0306
35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model ( #816 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]add module lazy init feature to support large model initization.
* [pipeline]add to_layer_list and partition method to support arbitrary non-pp model
* refactor the module structure
* polish
* [pipelinable]add unit test for pipelinable
* polish
* polish
* Fix CodeFactor issues.
3 years ago
Jiarui Fang
ea0a2ed25f
[hotfix] the bug of numel() in ColoTensor ( #845 )
3 years ago
Jiarui Fang
8789850eea
Init Conext supports lazy allocate model memory ( #842 )
3 years ago
Frank Lee
943982d29a
[unittest] refactored unit tests for change in dependency ( #838 )
3 years ago
Frank Lee
01e9f834f5
[dependency] removed torchvision ( #833 )
...
* [dependency] removed torchvision
* fixed transforms
3 years ago
Jiarui Fang
cb5a4778e1
Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. ( #831 )" ( #835 )
...
This reverts commit ac88de6dfc
.
3 years ago
Jiarui Fang
ac88de6dfc
[WIP] Applying ColoTensor on TP-1D-row Linear. ( #831 )
...
* revert zero tensors back
* [tensor] init row 1d linear
3 years ago
Jiarui Fang
294a6060d0
[tensor] ZeRO use ColoTensor as the base class. ( #828 )
...
* [refactor] moving InsertPostInitMethodToModuleSubClasses to utils.
* [tensor] ZeRO use ColoTensor as the base class.
* polish
3 years ago
Ziyue Jiang
8e6fdb4f29
[tensor]fix test_linear ( #826 )
3 years ago
Ziyue Jiang
1a9e2c2dff
[tensor] fix kwargs in colo_tensor torch_funtion ( #825 )
3 years ago
Jiarui Fang
2ecc3d7a55
[tensor] lazy init ( #823 )
3 years ago
Jiarui Fang
660d2d1f1b
[Tensor] apply ColoTensor on Torch functions ( #821 )
...
* Revert "[zero] add ZeroTensorShardStrategy (#793 )"
This reverts commit 88759e289e
.
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
* polish code
* add a new tensor structure and override linear for it
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* [tensor] renaming and reorganize directory structure.
* rm useless dir
* polish
* polish
* [tensor] hander the function not wrapped
3 years ago
Jiarui Fang
0ce8924ceb
[tensor] reorganize files ( #820 )
3 years ago
Jiarui Fang
ab962b9735
[gemini] a new tensor structure ( #818 )
...
* Revert "[zero] add ZeroTensorShardStrategy (#793 )"
This reverts commit 88759e289e
.
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
* polish code
* add a new tensor structure and override linear for it
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
3 years ago
Jiarui Fang
e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy ( #793 )" ( #806 )
3 years ago
HELSON
88759e289e
[zero] add ZeroTensorShardStrategy ( #793 )
3 years ago
Jiarui Fang
681addb512
[refactor] moving grad acc logic to engine ( #804 )
3 years ago
Jiarui Fang
4d9332b4c5
[refactor] moving memtracer to gemini ( #801 )
3 years ago
HELSON
4c4388c46e
[hotfix] fix memory leak in zero ( #781 )
3 years ago
Frank Lee
5a1a095b92
[test] refactored with the new rerun decorator ( #763 )
...
* [test] refactored with the new rerun decorator
* polish test case
3 years ago
Jiarui Fang
10ef8afdd2
[gemini] init genimi individual directory ( #754 )
3 years ago
ver217
dcca614eee
[hotfix] fix test_stateful_tensor_mgr ( #762 )
3 years ago
ver217
a93a7d7364
[hotfix] fix reuse_fp16_shard of sharded model ( #756 )
...
* fix reuse_fp16_shard
* disable test stm
* polish code
3 years ago
HELSON
84c6700b2a
[zero] refactor memstats_collector ( #746 )
3 years ago
ver217
e396bb71f2
[zero] add tensor placement policies ( #743 )
...
* add tensor placement policies
* polish comments
* polish comments
* update moe unit tests
3 years ago
HELSON
22c4b88d56
[zero] refactor ShardedParamV2 for convenience ( #742 )
3 years ago
Frank Lee
f4f42d4c3c
[bug] fixed DDP compatibility with torch 1.8 ( #739 )
3 years ago
Jiarui Fang
53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process ( #726 )
3 years ago
HELSON
b9b469ea50
[moe] add checkpoint for moe zero test ( #729 )
3 years ago
FrankLeeeee
e88a498c9c
[test] removed trivial outdated test
3 years ago
FrankLeeeee
62b4ce7326
[test] added missing decorators to model checkpointing tests
3 years ago
Jiarui Fang
4d90a7b513
[refactor] zero directory ( #724 )
3 years ago
Frank Lee
20ab1f5520
[bug] fixed broken test_found_inf ( #725 )
3 years ago
Jiarui Fang
193dc8dacb
[refactor] refactor the memory utils ( #715 )
3 years ago
HELSON
dbd96fe90a
[zero] check whether gradients have inf and nan in gpu ( #712 )
3 years ago
HELSON
a9b8300d54
[zero] improve adaptability for not-shard parameters ( #708 )
...
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
3 years ago