Baizhou Zhang
44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin ( #4506 )
...
* add APIs
* implement save_sharded_model
* add test for hybrid checkpointio
* implement naive loading for sharded model
* implement efficient sharded model loading
* open a new file for hybrid checkpoint_io
* small fix
* fix circular importing
* fix docstring
* arrange arguments and apis
* small fix
2023-08-25 22:04:57 +08:00
LuGY
1a49a5ea00
[zero] support shard optimizer state dict of zero ( #4194 )
...
* support shard optimizer of zero
* polish code
* support sync grad manually
2023-07-31 22:13:29 +08:00
Baizhou Zhang
c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin ( #4302 )
...
* sharded optimizer checkpoint for gemini plugin
* modify test to reduce testing time
* update doc
* fix bug when keep_gatherd is true under GeminiPlugin
2023-07-21 14:39:01 +08:00
Baizhou Zhang
58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin ( #4141 )
...
* [checkpointio] unsharded optimizer checkpoint for Gemini plugin
* [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather
2023-07-07 16:33:06 +08:00
Frank Lee
c4b1b65931
[test] fixed tests failed due to dtensor change ( #4082 )
...
* [test] fixed tests failed due to dtensor change
* polish code
2023-07-04 16:05:01 +08:00
Baizhou Zhang
0bb0b481b4
[gemini] fix argument naming during chunk configuration searching
2023-06-25 13:34:15 +08:00
Frank Lee
a5883aa790
[test] fixed codefactor format report ( #4026 )
2023-06-16 18:23:02 +08:00
Baizhou Zhang
822c3d4d66
[checkpointio] sharded optimizer checkpoint for DDP plugin ( #4002 )
2023-06-16 14:14:05 +08:00
Baizhou Zhang
c9cff7e7fa
[checkpointio] General Checkpointing of Sharded Optimizers ( #3984 )
2023-06-15 15:21:26 +08:00
wukong1992
6b305a99d6
[booster] torch fsdp fix ckpt ( #3788 )
2023-05-23 16:58:45 +08:00
Hongxin Liu
3c07a2846e
[plugin] a workaround for zero plugins' optimizer checkpoint ( #3780 )
...
* [test] refactor torch ddp checkpoint test
* [plugin] update low level zero optim checkpoint
* [plugin] update gemini optim checkpoint
2023-05-19 19:42:31 +08:00
Hongxin Liu
5452df63c5
[plugin] torch ddp plugin supports sharded model checkpoint ( #3775 )
...
* [plugin] torch ddp plugin add save sharded model
* [test] fix torch ddp ckpt io test
* [test] fix torch ddp ckpt io test
* [test] fix low level zero plugin test
* [test] fix low level zero plugin test
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] fix low level zero plugin test
* [test] fix low level zero plugin test
* [test] remove debug info
2023-05-18 20:05:59 +08:00
Hongxin Liu
afb239bbf8
[devops] update torch version of CI ( #3725 )
...
* [test] fix flop tensor test
* [test] fix autochunk test
* [test] fix lazyinit test
* [devops] update torch version of CI
* [devops] enable testmon
* [devops] fix ci
* [devops] fix ci
* [test] fix checkpoint io test
* [test] fix cluster test
* [test] fix timm test
* [devops] fix ci
* [devops] fix ci
* [devops] fix ci
* [devops] fix ci
* [devops] force sync to test ci
* [test] skip fsdp test
2023-05-15 17:20:56 +08:00
jiangmingyan
20068ba188
[booster] add tests for ddp and low level zero's checkpointio ( #3715 )
...
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update booster tutorials#3717, fix recursive check
2023-05-10 12:17:02 +08:00
jiangmingyan
307894f74d
[booster] gemini plugin support shard checkpoint ( #3610 )
...
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
---------
Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2023-05-05 14:37:21 +08:00
jiangmingyan
52a933e175
[checkpoint] support huggingface style sharded checkpoint ( #3461 )
...
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
---------
Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-04-06 16:23:39 +08:00
Frank Lee
80eba05b0a
[test] refactor tests with spawn ( #3452 )
...
* [test] added spawn decorator
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2023-04-06 14:51:35 +08:00
Frank Lee
1beb85cc25
[checkpoint] refactored the API and added safetensors support ( #3427 )
...
* [checkpoint] refactored the API and added safetensors support
* polish code
2023-04-04 15:23:01 +08:00
Frank Lee
73d3e4d309
[booster] implemented the torch ddd + resnet example ( #3232 )
...
* [booster] implemented the torch ddd + resnet example
* polish code
2023-03-27 10:24:14 +08:00
Frank Lee
cd142fbefa
[api] implemented the checkpoint io module ( #3205 )
...
* [api] implemented the checkpoint io module
* polish code
* polish code
2023-03-23 10:53:17 +08:00