Commit Graph

3359 Commits (1b76564e1607aa8cf24566c794977b260de44f6c)
 

Author SHA1 Message Date
Ziyue Jiang 2a0a427e04
[tensor]add assert for colo_tensor 1Drow (#846)
3 years ago
Ziyue Jiang 05023ecfee
[Tensor] TP Linear 1D row (#843)
3 years ago
Frank Lee cf6d1c9284
[CLI] refactored the launch CLI and fixed bugs in multi-node launching (#844)
3 years ago
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
3 years ago
YuliangLiu0306 35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model (#816)
3 years ago
Jiarui Fang ea0a2ed25f
[hotfix] the bug of numel() in ColoTensor (#845)
3 years ago
LuGY c1e8d2001e
modefied the pp build for ckpt adaptation (#803)
3 years ago
Jiarui Fang 8789850eea
Init Conext supports lazy allocate model memory (#842)
3 years ago
Jiarui Fang 4575a3298b
[hotfix] ColoTensor pin_memory (#840)
3 years ago
Frank Lee 9f6f656952
[setup] use env var instead of option for cuda ext (#839)
3 years ago
Frank Lee 943982d29a
[unittest] refactored unit tests for change in dependency (#838)
3 years ago
github-actions[bot] f271f34716
Automated submodule synchronization (#827)
3 years ago
Frank Lee 01e9f834f5
[dependency] removed torchvision (#833)
3 years ago
Jiarui Fang cb5a4778e1
Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)" (#835)
3 years ago
Frank Lee 5e00e6cf23
[setup] allow installation with python 3.6 (#834)
3 years ago
Jiarui Fang ac88de6dfc
[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)
3 years ago
Jiarui Fang 595bedf767
revert zero tensors back (#829)
3 years ago
Jiarui Fang 294a6060d0
[tensor] ZeRO use ColoTensor as the base class. (#828)
3 years ago
Ziyue Jiang 8e6fdb4f29
[tensor]fix test_linear (#826)
3 years ago
Ziyue Jiang 1a9e2c2dff
[tensor] fix kwargs in colo_tensor torch_funtion (#825)
3 years ago
Jiarui Fang eb1b89908c
[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824)
3 years ago
Jiarui Fang 2ecc3d7a55
[tensor] lazy init (#823)
3 years ago
Jiarui Fang 68dcd51d41
[Tensor] update ColoTensor torch_function (#822)
3 years ago
Jiarui Fang 660d2d1f1b
[Tensor] apply ColoTensor on Torch functions (#821)
3 years ago
Jiarui Fang 0ce8924ceb
[tensor] reorganize files (#820)
3 years ago
Jiarui Fang ab962b9735
[gemini] a new tensor structure (#818)
3 years ago
github-actions[bot] 413ce30c45
Automated submodule synchronization (#819)
3 years ago
github-actions[bot] 9aae4197bb
Automated submodule synchronization (#810)
3 years ago
YuliangLiu0306 e1b3899824
Merge pull request #815 from FrankLeeeee/feature/check-cli
3 years ago
FrankLeeeee 70ed11d07e [cli] added check installation cli
3 years ago
YuliangLiu0306 c7eca40f51
Merge pull request #812 from FrankLeeeee/feature/cli
3 years ago
Jiarui Fang 3ddbd1bce1
[gemini] collect cpu-gpu moving volume in each iteration (#813)
3 years ago
FrankLeeeee d522cb704e [cli] fixed single-node process launching
3 years ago
Jiarui Fang 61c20b44bc
[log] local throughput metrics (#811)
3 years ago
ver217 dd92b90a68
[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808)
3 years ago
Jiarui Fang 227d1cd4b3
[gemini] APIs to set cpu memory capacity (#809)
3 years ago
YuliangLiu0306 f6dcd23fb9
Merge pull request #807 from FrankLeeeee/feature/cli
3 years ago
FrankLeeeee f63e91d280 [cli] fixed a bug in user args and refactored the module structure
3 years ago
Jiarui Fang e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806)
3 years ago
HELSON 88759e289e
[zero] add ZeroTensorShardStrategy (#793)
3 years ago
Jiarui Fang 681addb512
[refactor] moving grad acc logic to engine (#804)
3 years ago
Frank Lee 05d9ae5999
[cli] add missing requirement (#805)
3 years ago
YuliangLiu0306 de2f581d43
[cli] added micro benchmarking for tp (#789)
3 years ago
YuliangLiu0306 cfadc9df8e
[cli] added distributed launcher command (#791)
3 years ago
Jiarui Fang 97cd9b03b3
[log] display tflops if available (#802)
3 years ago
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801)
3 years ago
Jiarui Fang 8711c706f4
[hotfix] fix grad offload when enabling reuse_fp16_shard
3 years ago
ver217 f1fa1a675f fix grad offload when enabling reuse_fp16_shard
3 years ago
HELSON 4c4388c46e
[hotfix] fix memory leak in zero (#781)
3 years ago
Ziyue Jiang 4b01da24cd
[TP] change the check assert in split batch 2d (#772)
3 years ago