digger yu
1aadeedeea
fix typo .github/workflows/scripts/ ( #3946 )
2 years ago
digger yu
e61ffc77c6
fix typo tests/ ( #3936 )
2 years ago
Frank Lee
bd1ab98158
[gemini] fixed the gemini checkpoint io ( #3934 )
2 years ago
FoolPlayer
bd2c7c3297
Merge pull request #3942 from hpcaitech/revert-3931-sync/develop-to-shardformer
...
Revert "[sync] sync feature/shardformer with develop"
2 years ago
Frank Lee
ddcf58cacf
Revert "[sync] sync feature/shardformer with develop"
2 years ago
FoolPlayer
24651fdd4f
Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer
...
[sync] sync feature/shardformer with develop
2 years ago
Liu Ziming
e277534a18
Merge pull request #3905 from MaruyamaAya/dreambooth
...
[example] Adding an example of training dreambooth with the new booster API
2 years ago
Yuanchen
21c4c0b1a0
support UniEval and add CHRF metric ( #3924 )
...
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
2 years ago
digger yu
33eef714db
fix typo examples and docs ( #3932 )
2 years ago
FoolPlayer
ef1537759c
[shardformer] add gpt2 policy and modify shard and slicer to support ( #3883 )
...
* add gpt2 policy and modify shard and slicer to support
* remove unused code
* polish code
2 years ago
FoolPlayer
6370a935f6
update README ( #3909 )
2 years ago
FoolPlayer
21a3915c98
[shardformer] add Dropout layer support different dropout pattern ( #3856 )
...
* add dropout layer, add dropout test
* modify seed manager as context manager
* add a copy of col_nn.layer
* add dist_crossentropy loss; separate module test
* polish the code
* fix dist crossentropy loss
2 years ago
FoolPlayer
997544c1f9
[shardformer] update readme with modules implement doc ( #3834 )
...
* update readme with modules content
* remove img
2 years ago
Frank Lee
537a52b7a2
[shardformer] refactored the user api ( #3828 )
...
* [shardformer] refactored the user api
* polish code
2 years ago
Frank Lee
bc19024bf9
[shardformer] updated readme ( #3827 )
2 years ago
FoolPlayer
58f6432416
[shardformer]: Feature/shardformer, add some docstring and readme ( #3816 )
...
* init shardformer code structure
* add implement of sharder (inject and replace)
* add implement of replace layer to colossal layer
* separate different layer policy, add some notion
* implement 1d and 2d slicer, can tell col or row
* fix bug when slicing and inject model
* fix some bug; add inference test example
* add share weight and train example
* add train
* add docstring and readme
* add docstring for other files
* pre-commit
2 years ago
FoolPlayer
6a69b44dfc
[shardformer] init shardformer code structure ( #3731 )
...
* init shardformer code structure
* add implement of sharder (inject and replace)
* add implement of replace layer to colossal layer
* separate different layer policy, add some notion
* implement 1d and 2d slicer, can tell col or row
* fix bug when slicing and inject model
* fix some bug; add inference test example
2 years ago
Maruyama_Aya
9b5e7ce21f
modify shell for check
2 years ago
Frank Lee
a98e16ed07
Merge pull request #3926 from hpcaitech/feature/dtensor
...
[feature] updated device mesh and dtensor
2 years ago
digger yu
407aa48461
fix typo examples/community/roberta ( #3925 )
2 years ago
Maruyama_Aya
730a092ba2
modify shell for check
2 years ago
Maruyama_Aya
49567d56d1
modify shell for check
2 years ago
Maruyama_Aya
039854b391
modify shell for check
2 years ago
Baizhou Zhang
e417dd004e
[example] update opt example using booster api ( #3918 )
2 years ago
Maruyama_Aya
cf4792c975
modify shell for check
2 years ago
Frank Lee
eb39154d40
[dtensor] updated api and doc ( #3845 )
2 years ago
Hongxin Liu
9166988d9b
[devops] update torch version in compability test ( #3919 )
2 years ago
digger yu
de0d7df33f
[nfc] fix typo colossalai/zero ( #3923 )
2 years ago
Hongxin Liu
12c90db3f3
[doc] add lazy init tutorial ( #3922 )
...
* [doc] add lazy init en doc
* [doc] add lazy init zh doc
* [doc] add lazy init doc in sidebar
* [doc] add lazy init doc test
* [doc] fix lazy init doc link
2 years ago
Maruyama_Aya
c94a33579b
modify shell for check
2 years ago
digger yu
a9d1cadc49
fix typo with colossalai/trainer utils zero ( #3908 )
2 years ago
Liu Ziming
b306cecf28
[example] Modify palm example with the new booster API ( #3913 )
...
* Modify torch version requirement to adapt torch 2.0
* modify palm example using new booster API
* roll back
* fix port
* polish
* polish
2 years ago
wukong1992
a55fb00c18
[booster] update bert example, using booster api ( #3885 )
2 years ago
Frank Lee
5e2132dcff
[workflow] added docker latest tag for release ( #3920 )
2 years ago
Hongxin Liu
c25d421f3e
[devops] hotfix testmon cache clean logic ( #3917 )
2 years ago
Frank Lee
d51e83d642
Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop
...
[sync] sync feature/dtensor with develop
2 years ago
Frank Lee
c622bb3630
Merge pull request #3915 from FrankLeeeee/update/develop
...
[sync] update develop with main
2 years ago
Hongxin Liu
9c88b6cbd1
[lazy] fix compatibility problem on torch 1.13 ( #3911 )
2 years ago
Maruyama_Aya
4fc8bc68ac
modify file path
2 years ago
Hongxin Liu
b5f0566363
[chat] add distributed PPO trainer ( #3740 )
...
* Detached ppo (#9 )
* run the base
* working on dist ppo
* sync
* detached trainer
* update detached trainer. no maker update function
* facing init problem
* 1 maker 1 trainer detached run. but no model update
* facing cuda problem
* fix save functions
* verified maker update
* nothing
* add ignore
* analyize loss issue
* remove some debug codes
* facing 2m1t stuck issue
* 2m1t verified
* do not use torchrun
* working on 2m2t
* working on 2m2t
* initialize strategy in ray actor env
* facing actor's init order issue
* facing ddp model update issue (need unwarp ddp)
* unwrap ddp actor
* checking 1m2t stuck problem
* nothing
* set timeout for trainer choosing. It solves the stuck problem!
* delete some debug output
* rename to sync with upstream
* rename to sync with upstream
* coati rename
* nothing
* I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations
* experience_maker_holder performs target-revolving _send_experience() instead of length comparison.
* move code to ray subfolder
* working on pipeline inference
* apply comments
* working on pipeline strategy. in progress.
* remove pipeline code. clean this branch
* update remote parameters by state_dict. no test
* nothing
* state_dict sharding transfer
* merge debug branch
* gemini _unwrap_model fix
* simplify code
* simplify code & fix LoRALinear AttributeError
* critic unwrapped state_dict
---------
Co-authored-by: csric <richcsr256@gmail.com>
* [chat] add perfomance evaluator and fix bugs (#10 )
* [chat] add performance evaluator for ray
* [chat] refactor debug arg
* [chat] support hf config
* [chat] fix generation
* [chat] add 1mmt dummy example
* [chat] fix gemini ckpt
* split experience to send (#11 )
Co-authored-by: csric <richcsr256@gmail.com>
* [chat] refactor trainer and maker (#12 )
* [chat] refactor experience maker holder
* [chat] refactor model init
* [chat] refactor trainer args
* [chat] refactor model init
* [chat] refactor trainer
* [chat] refactor experience sending logic and training loop args (#13 )
* [chat] refactor experience send logic
* [chat] refactor trainer
* [chat] refactor trainer
* [chat] refactor experience maker
* [chat] refactor pbar
* [chat] refactor example folder (#14 )
* [chat] support quant (#15 )
* [chat] add quant
* [chat] add quant example
* prompt example (#16 )
* prompt example
* prompt load csv data
* remove legacy try
---------
Co-authored-by: csric <richcsr256@gmail.com>
* [chat] add mmmt dummy example and refactor experience sending (#17 )
* [chat] add mmmt dummy example
* [chat] refactor naive strategy
* [chat] fix struck problem
* [chat] fix naive strategy
* [chat] optimize experience maker sending logic
* [chat] refactor sending assignment
* [chat] refactor performance evaluator (#18 )
* Prompt Example & requires_grad state_dict & sharding state_dict (#19 )
* prompt example
* prompt load csv data
* remove legacy try
* maker models require_grad set to False
* working on zero redundancy update
* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.
* remove legacy examples
* remove legacy examples
* remove replay buffer tp state. bad design
---------
Co-authored-by: csric <richcsr256@gmail.com>
* state_dict sending adapts to new unwrap function (#20 )
* prompt example
* prompt load csv data
* remove legacy try
* maker models require_grad set to False
* working on zero redundancy update
* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.
* remove legacy examples
* remove legacy examples
* remove replay buffer tp state. bad design
* opt benchmark
* better script
* nothing
* [chat] strategy refactor unwrap model
* [chat] strategy refactor save model
* [chat] add docstr
* [chat] refactor trainer save model
* [chat] fix strategy typing
* [chat] refactor trainer save model
* [chat] update readme
* [chat] fix unit test
* working on lora reconstruction
* state_dict sending adapts to new unwrap function
* remove comments
---------
Co-authored-by: csric <richcsr256@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* [chat-ray] add readme (#21 )
* add readme
* transparent graph
* add note background
---------
Co-authored-by: csric <richcsr256@gmail.com>
* [chat] get images from url (#22 )
* Refactor/chat ray (#23 )
* [chat] lora add todo
* [chat] remove unused pipeline strategy
* [chat] refactor example structure
* [chat] setup ci for ray
* [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24 )
* lora support prototype
* lora support
* 1mmt lora & remove useless code
---------
Co-authored-by: csric <richcsr256@gmail.com>
* [chat] fix test ci for ray
* [chat] fix test ci requirements for ray
* [chat] fix ray runtime env
* [chat] fix ray runtime env
* [chat] fix example ci docker args
* [chat] add debug info in trainer
* [chat] add nccl debug info
* [chat] skip ray test
* [doc] fix typo
---------
Co-authored-by: csric <59389055+CsRic@users.noreply.github.com>
Co-authored-by: csric <richcsr256@gmail.com>
2 years ago
Hongxin Liu
41fb7236aa
[devops] hotfix CI about testmon cache ( #3910 )
...
* [devops] hotfix CI about testmon cache
* [devops] fix testmon cahe on pr
2 years ago
Maruyama_Aya
b4437e88c3
fixed port
2 years ago
Maruyama_Aya
79c9f776a9
fixed port
2 years ago
Maruyama_Aya
d3379f0be7
fixed model saving bugs
2 years ago
Maruyama_Aya
b29e1f0722
change directory
2 years ago
Maruyama_Aya
1c1f71cbd2
fixing insecure hash function
2 years ago
Maruyama_Aya
b56c7f4283
update shell file
2 years ago
Maruyama_Aya
176010f289
update performance evaluation
2 years ago
digger yu
0e484e6201
[nfc]fix typo colossalai/pipeline tensor nn ( #3899 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
* fix typo colossalai/ applications/
* fix typo colossalai/cli fx kernel
* fix typo colossalai/nn
* revert change warmuped
* fix typo colossalai/pipeline tensor nn
2 years ago
Baizhou Zhang
c1535ccbba
[doc] fix docs about booster api usage ( #3898 )
2 years ago