Commit Graph

19 Commits (1edc9b5fb3f8a9c9ec5d71a62bb33914a0d5f0c4)

Author SHA1 Message Date
LuGY 1a49a5ea00 [zero] support shard optimizer state dict of zero (#4194)
* support shard optimizer of zero

* polish code

* support sync grad manually
2023-07-31 22:13:29 +08:00
Baizhou Zhang c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302)
* sharded optimizer checkpoint for gemini plugin

* modify test to reduce testing time

* update doc

* fix bug when keep_gatherd is true under GeminiPlugin
2023-07-21 14:39:01 +08:00
Baizhou Zhang 58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141)
* [checkpointio] unsharded optimizer checkpoint for Gemini plugin

* [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather
2023-07-07 16:33:06 +08:00
Frank Lee c4b1b65931 [test] fixed tests failed due to dtensor change (#4082)
* [test] fixed tests failed due to dtensor change

* polish code
2023-07-04 16:05:01 +08:00
Baizhou Zhang 0bb0b481b4 [gemini] fix argument naming during chunk configuration searching 2023-06-25 13:34:15 +08:00
Frank Lee a5883aa790
[test] fixed codefactor format report (#4026) 2023-06-16 18:23:02 +08:00
Baizhou Zhang 822c3d4d66
[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002) 2023-06-16 14:14:05 +08:00
Baizhou Zhang c9cff7e7fa
[checkpointio] General Checkpointing of Sharded Optimizers (#3984) 2023-06-15 15:21:26 +08:00
wukong1992 6b305a99d6
[booster] torch fsdp fix ckpt (#3788) 2023-05-23 16:58:45 +08:00
Hongxin Liu 3c07a2846e
[plugin] a workaround for zero plugins' optimizer checkpoint (#3780)
* [test] refactor torch ddp checkpoint test

* [plugin] update low level zero optim checkpoint

* [plugin] update gemini optim checkpoint
2023-05-19 19:42:31 +08:00
Hongxin Liu 5452df63c5
[plugin] torch ddp plugin supports sharded model checkpoint (#3775)
* [plugin] torch ddp plugin add save sharded model

* [test] fix torch ddp ckpt io test

* [test] fix torch ddp ckpt io test

* [test] fix low level zero plugin test

* [test] fix low level zero plugin test

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] fix low level zero plugin test

* [test] fix low level zero plugin test

* [test] remove debug info
2023-05-18 20:05:59 +08:00
Hongxin Liu afb239bbf8
[devops] update torch version of CI (#3725)
* [test] fix flop tensor test

* [test] fix autochunk test

* [test] fix lazyinit test

* [devops] update torch version of CI

* [devops] enable testmon

* [devops] fix ci

* [devops] fix ci

* [test] fix checkpoint io test

* [test] fix cluster test

* [test] fix timm test

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] force sync to test ci

* [test] skip fsdp test
2023-05-15 17:20:56 +08:00
jiangmingyan 20068ba188
[booster] add tests for ddp and low level zero's checkpointio (#3715)
* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update booster tutorials#3717, fix recursive check
2023-05-10 12:17:02 +08:00
jiangmingyan 307894f74d
[booster] gemini plugin support shard checkpoint (#3610)
* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

---------

Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2023-05-05 14:37:21 +08:00
jiangmingyan 52a933e175
[checkpoint] support huggingface style sharded checkpoint (#3461)
* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

---------

Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-04-06 16:23:39 +08:00
Frank Lee 80eba05b0a
[test] refactor tests with spawn (#3452)
* [test] added spawn decorator

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-04-06 14:51:35 +08:00
Frank Lee 1beb85cc25
[checkpoint] refactored the API and added safetensors support (#3427)
* [checkpoint] refactored the API and added safetensors support

* polish code
2023-04-04 15:23:01 +08:00
Frank Lee 73d3e4d309
[booster] implemented the torch ddd + resnet example (#3232)
* [booster] implemented the torch ddd + resnet example

* polish code
2023-03-27 10:24:14 +08:00
Frank Lee cd142fbefa
[api] implemented the checkpoint io module (#3205)
* [api] implemented the checkpoint io module

* polish code

* polish code
2023-03-23 10:53:17 +08:00