Baizhou Zhang
822c3d4d66
[checkpointio] sharded optimizer checkpoint for DDP plugin ( #4002 )
1 year ago
Wenhao Chen
725af3eeeb
[booster] make optimizer argument optional for boost ( #3993 )
...
* feat: make optimizer optional in Booster.boost
* test: skip unet test if diffusers version > 0.10.2
1 year ago
Baizhou Zhang
c9cff7e7fa
[checkpointio] General Checkpointing of Sharded Optimizers ( #3984 )
1 year ago
digger yu
d4fb7bfda7
fix typo applications/Chat/coati/ ( #3947 )
1 year ago
Baizhou Zhang
e8ad3c88f5
[doc] add a note about unit-testing to CONTRIBUTING.md ( #3970 )
1 year ago
Yuanchen
2925f47399
[evaluate] support gpt evaluation with reference ( #3972 )
...
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
1 year ago
Frank Lee
8bcad73677
[workflow] fixed the directory check in build ( #3980 )
1 year ago
Wenhao Chen
9d02590c9a
[chat] refactor actor class ( #3968 )
...
* refactor: separate log_probs fn from Actor forward fn
* refactor: separate generate fn from Actor class
* feat: update unwrap_model and get_base_model
* unwrap_model returns model not wrapped by Strategy
* get_base_model returns HF model for Actor, Critic and RewardModel
* feat: simplify Strategy.prepare
* style: remove get_base_model method of Actor
* perf: tokenize text in batches
* refactor: move calc_action_log_probs to utils of model
* test: update test with new forward fn
* style: rename forward fn args
* fix: do not unwrap model in save_model fn of naive strategy
* test: add gemini test for train_prompts
* fix: fix _set_default_generate_kwargs
1 year ago
Frank Lee
2bf6547ad7
Merge pull request #3967 from ver217/update-develop
...
[sync] update develop branch with main
1 year ago
Frank Lee
6718a2f285
[workflow] cancel duplicated workflow jobs ( #3960 )
1 year ago
Frank Lee
71fe52769c
[gemini] fixed the gemini checkpoint io ( #3934 )
1 year ago
Baizhou Zhang
b3ab7fbabf
[example] update ViT example using booster api ( #3940 )
1 year ago
Frank Lee
4110d1f0d4
[workflow] cancel duplicated workflow jobs ( #3960 )
1 year ago
digger yu
1aadeedeea
fix typo .github/workflows/scripts/ ( #3946 )
1 year ago
digger yu
e61ffc77c6
fix typo tests/ ( #3936 )
1 year ago
Frank Lee
bd1ab98158
[gemini] fixed the gemini checkpoint io ( #3934 )
1 year ago
FoolPlayer
bd2c7c3297
Merge pull request #3942 from hpcaitech/revert-3931-sync/develop-to-shardformer
...
Revert "[sync] sync feature/shardformer with develop"
1 year ago
Frank Lee
ddcf58cacf
Revert "[sync] sync feature/shardformer with develop"
1 year ago
FoolPlayer
24651fdd4f
Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer
...
[sync] sync feature/shardformer with develop
1 year ago
Liu Ziming
e277534a18
Merge pull request #3905 from MaruyamaAya/dreambooth
...
[example] Adding an example of training dreambooth with the new booster API
1 year ago
Yuanchen
21c4c0b1a0
support UniEval and add CHRF metric ( #3924 )
...
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
1 year ago
digger yu
33eef714db
fix typo examples and docs ( #3932 )
1 year ago
FoolPlayer
ef1537759c
[shardformer] add gpt2 policy and modify shard and slicer to support ( #3883 )
...
* add gpt2 policy and modify shard and slicer to support
* remove unused code
* polish code
1 year ago
FoolPlayer
6370a935f6
update README ( #3909 )
1 year ago
FoolPlayer
21a3915c98
[shardformer] add Dropout layer support different dropout pattern ( #3856 )
...
* add dropout layer, add dropout test
* modify seed manager as context manager
* add a copy of col_nn.layer
* add dist_crossentropy loss; separate module test
* polish the code
* fix dist crossentropy loss
1 year ago
FoolPlayer
997544c1f9
[shardformer] update readme with modules implement doc ( #3834 )
...
* update readme with modules content
* remove img
1 year ago
Frank Lee
537a52b7a2
[shardformer] refactored the user api ( #3828 )
...
* [shardformer] refactored the user api
* polish code
1 year ago
Frank Lee
bc19024bf9
[shardformer] updated readme ( #3827 )
1 year ago
FoolPlayer
58f6432416
[shardformer]: Feature/shardformer, add some docstring and readme ( #3816 )
...
* init shardformer code structure
* add implement of sharder (inject and replace)
* add implement of replace layer to colossal layer
* separate different layer policy, add some notion
* implement 1d and 2d slicer, can tell col or row
* fix bug when slicing and inject model
* fix some bug; add inference test example
* add share weight and train example
* add train
* add docstring and readme
* add docstring for other files
* pre-commit
1 year ago
FoolPlayer
6a69b44dfc
[shardformer] init shardformer code structure ( #3731 )
...
* init shardformer code structure
* add implement of sharder (inject and replace)
* add implement of replace layer to colossal layer
* separate different layer policy, add some notion
* implement 1d and 2d slicer, can tell col or row
* fix bug when slicing and inject model
* fix some bug; add inference test example
1 year ago
Maruyama_Aya
9b5e7ce21f
modify shell for check
1 year ago
Frank Lee
a98e16ed07
Merge pull request #3926 from hpcaitech/feature/dtensor
...
[feature] updated device mesh and dtensor
1 year ago
digger yu
407aa48461
fix typo examples/community/roberta ( #3925 )
1 year ago
Maruyama_Aya
730a092ba2
modify shell for check
1 year ago
Maruyama_Aya
49567d56d1
modify shell for check
1 year ago
Maruyama_Aya
039854b391
modify shell for check
1 year ago
Baizhou Zhang
e417dd004e
[example] update opt example using booster api ( #3918 )
1 year ago
Maruyama_Aya
cf4792c975
modify shell for check
1 year ago
Frank Lee
eb39154d40
[dtensor] updated api and doc ( #3845 )
1 year ago
Hongxin Liu
9166988d9b
[devops] update torch version in compability test ( #3919 )
1 year ago
digger yu
de0d7df33f
[nfc] fix typo colossalai/zero ( #3923 )
1 year ago
Hongxin Liu
12c90db3f3
[doc] add lazy init tutorial ( #3922 )
...
* [doc] add lazy init en doc
* [doc] add lazy init zh doc
* [doc] add lazy init doc in sidebar
* [doc] add lazy init doc test
* [doc] fix lazy init doc link
1 year ago
Maruyama_Aya
c94a33579b
modify shell for check
1 year ago
digger yu
a9d1cadc49
fix typo with colossalai/trainer utils zero ( #3908 )
1 year ago
Liu Ziming
b306cecf28
[example] Modify palm example with the new booster API ( #3913 )
...
* Modify torch version requirement to adapt torch 2.0
* modify palm example using new booster API
* roll back
* fix port
* polish
* polish
1 year ago
wukong1992
a55fb00c18
[booster] update bert example, using booster api ( #3885 )
1 year ago
Frank Lee
5e2132dcff
[workflow] added docker latest tag for release ( #3920 )
1 year ago
Hongxin Liu
c25d421f3e
[devops] hotfix testmon cache clean logic ( #3917 )
1 year ago
Frank Lee
d51e83d642
Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop
...
[sync] sync feature/dtensor with develop
1 year ago
Frank Lee
c622bb3630
Merge pull request #3915 from FrankLeeeee/update/develop
...
[sync] update develop with main
1 year ago