Commit Graph

351 Commits (ee222dfbf37a0e77ccd69f13aa8c93e3f6dad868)

Author SHA1 Message Date
Frank Lee ee222dfbf3
[usability] added assertion message in registry (#864)
3 years ago
HELSON f0e654558f
[gemini] polish code (#855)
3 years ago
Jiarui Fang 29159d9b5b
hotfix tensor unittest bugs (#862)
3 years ago
YuliangLiu0306 c6930d8ddf
[pipelinable]use ColoTensor to replace dummy tensor. (#853)
3 years ago
Ziyue Jiang bcc8655021
[Tensor ] Add 1Drow weight reshard by spec (#854)
3 years ago
ver217 d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data (#850)
3 years ago
ver217 232142f402
[utils] refactor profiler (#837)
3 years ago
Jiarui Fang 62f059251b
[Tensor] init a tp network training unittest (#849)
3 years ago
ver217 0dea140760
[hotfix] add deconstructor for stateful tensor (#848)
3 years ago
ver217 0f7ed8c192
fix _post_init_method of zero init ctx (#847)
3 years ago
Ziyue Jiang 2a0a427e04
[tensor]add assert for colo_tensor 1Drow (#846)
3 years ago
Ziyue Jiang 05023ecfee
[Tensor] TP Linear 1D row (#843)
3 years ago
Frank Lee cf6d1c9284
[CLI] refactored the launch CLI and fixed bugs in multi-node launching (#844)
3 years ago
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
3 years ago
YuliangLiu0306 35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model (#816)
3 years ago
Jiarui Fang ea0a2ed25f
[hotfix] the bug of numel() in ColoTensor (#845)
3 years ago
LuGY c1e8d2001e
modefied the pp build for ckpt adaptation (#803)
3 years ago
Jiarui Fang 8789850eea
Init Conext supports lazy allocate model memory (#842)
3 years ago
Jiarui Fang 4575a3298b
[hotfix] ColoTensor pin_memory (#840)
3 years ago
Frank Lee 01e9f834f5
[dependency] removed torchvision (#833)
3 years ago
Jiarui Fang cb5a4778e1
Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)" (#835)
3 years ago
Jiarui Fang ac88de6dfc
[WIP] Applying ColoTensor on TP-1D-row Linear. (#831)
3 years ago
Jiarui Fang 595bedf767
revert zero tensors back (#829)
3 years ago
Jiarui Fang 294a6060d0
[tensor] ZeRO use ColoTensor as the base class. (#828)
3 years ago
Ziyue Jiang 8e6fdb4f29
[tensor]fix test_linear (#826)
3 years ago
Ziyue Jiang 1a9e2c2dff
[tensor] fix kwargs in colo_tensor torch_funtion (#825)
3 years ago
Jiarui Fang eb1b89908c
[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824)
3 years ago
Jiarui Fang 2ecc3d7a55
[tensor] lazy init (#823)
3 years ago
Jiarui Fang 68dcd51d41
[Tensor] update ColoTensor torch_function (#822)
3 years ago
Jiarui Fang 0ce8924ceb
[tensor] reorganize files (#820)
3 years ago
Jiarui Fang ab962b9735
[gemini] a new tensor structure (#818)
3 years ago
FrankLeeeee 70ed11d07e [cli] added check installation cli
3 years ago
YuliangLiu0306 c7eca40f51
Merge pull request #812 from FrankLeeeee/feature/cli
3 years ago
Jiarui Fang 3ddbd1bce1
[gemini] collect cpu-gpu moving volume in each iteration (#813)
3 years ago
FrankLeeeee d522cb704e [cli] fixed single-node process launching
3 years ago
Jiarui Fang 61c20b44bc
[log] local throughput metrics (#811)
3 years ago
ver217 dd92b90a68
[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808)
3 years ago
Jiarui Fang 227d1cd4b3
[gemini] APIs to set cpu memory capacity (#809)
3 years ago
FrankLeeeee f63e91d280 [cli] fixed a bug in user args and refactored the module structure
3 years ago
Jiarui Fang e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806)
3 years ago
HELSON 88759e289e
[zero] add ZeroTensorShardStrategy (#793)
3 years ago
Jiarui Fang 681addb512
[refactor] moving grad acc logic to engine (#804)
3 years ago
Frank Lee 05d9ae5999
[cli] add missing requirement (#805)
3 years ago
YuliangLiu0306 de2f581d43
[cli] added micro benchmarking for tp (#789)
3 years ago
YuliangLiu0306 cfadc9df8e
[cli] added distributed launcher command (#791)
3 years ago
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801)
3 years ago
Jiarui Fang 8711c706f4
[hotfix] fix grad offload when enabling reuse_fp16_shard
3 years ago
ver217 f1fa1a675f fix grad offload when enabling reuse_fp16_shard
3 years ago
HELSON 4c4388c46e
[hotfix] fix memory leak in zero (#781)
3 years ago
Ziyue Jiang 4b01da24cd
[TP] change the check assert in split batch 2d (#772)
3 years ago