Hongxin Liu
c25d421f3e
[devops] hotfix testmon cache clean logic ( #3917 )
2023-06-07 12:39:12 +08:00
Frank Lee
d51e83d642
Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop
...
[sync] sync feature/dtensor with develop
2023-06-07 11:50:43 +08:00
Frank Lee
c622bb3630
Merge pull request #3915 from FrankLeeeee/update/develop
...
[sync] update develop with main
2023-06-07 11:45:11 +08:00
Hongxin Liu
9c88b6cbd1
[lazy] fix compatibility problem on torch 1.13 ( #3911 )
2023-06-07 11:10:12 +08:00
Maruyama_Aya
4fc8bc68ac
modify file path
2023-06-07 11:02:19 +08:00
Hongxin Liu
b5f0566363
[chat] add distributed PPO trainer ( #3740 )
...
* Detached ppo (#9 )
* run the base
* working on dist ppo
* sync
* detached trainer
* update detached trainer. no maker update function
* facing init problem
* 1 maker 1 trainer detached run. but no model update
* facing cuda problem
* fix save functions
* verified maker update
* nothing
* add ignore
* analyize loss issue
* remove some debug codes
* facing 2m1t stuck issue
* 2m1t verified
* do not use torchrun
* working on 2m2t
* working on 2m2t
* initialize strategy in ray actor env
* facing actor's init order issue
* facing ddp model update issue (need unwarp ddp)
* unwrap ddp actor
* checking 1m2t stuck problem
* nothing
* set timeout for trainer choosing. It solves the stuck problem!
* delete some debug output
* rename to sync with upstream
* rename to sync with upstream
* coati rename
* nothing
* I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations
* experience_maker_holder performs target-revolving _send_experience() instead of length comparison.
* move code to ray subfolder
* working on pipeline inference
* apply comments
* working on pipeline strategy. in progress.
* remove pipeline code. clean this branch
* update remote parameters by state_dict. no test
* nothing
* state_dict sharding transfer
* merge debug branch
* gemini _unwrap_model fix
* simplify code
* simplify code & fix LoRALinear AttributeError
* critic unwrapped state_dict
---------
Co-authored-by: csric <richcsr256@gmail.com>
* [chat] add perfomance evaluator and fix bugs (#10 )
* [chat] add performance evaluator for ray
* [chat] refactor debug arg
* [chat] support hf config
* [chat] fix generation
* [chat] add 1mmt dummy example
* [chat] fix gemini ckpt
* split experience to send (#11 )
Co-authored-by: csric <richcsr256@gmail.com>
* [chat] refactor trainer and maker (#12 )
* [chat] refactor experience maker holder
* [chat] refactor model init
* [chat] refactor trainer args
* [chat] refactor model init
* [chat] refactor trainer
* [chat] refactor experience sending logic and training loop args (#13 )
* [chat] refactor experience send logic
* [chat] refactor trainer
* [chat] refactor trainer
* [chat] refactor experience maker
* [chat] refactor pbar
* [chat] refactor example folder (#14 )
* [chat] support quant (#15 )
* [chat] add quant
* [chat] add quant example
* prompt example (#16 )
* prompt example
* prompt load csv data
* remove legacy try
---------
Co-authored-by: csric <richcsr256@gmail.com>
* [chat] add mmmt dummy example and refactor experience sending (#17 )
* [chat] add mmmt dummy example
* [chat] refactor naive strategy
* [chat] fix struck problem
* [chat] fix naive strategy
* [chat] optimize experience maker sending logic
* [chat] refactor sending assignment
* [chat] refactor performance evaluator (#18 )
* Prompt Example & requires_grad state_dict & sharding state_dict (#19 )
* prompt example
* prompt load csv data
* remove legacy try
* maker models require_grad set to False
* working on zero redundancy update
* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.
* remove legacy examples
* remove legacy examples
* remove replay buffer tp state. bad design
---------
Co-authored-by: csric <richcsr256@gmail.com>
* state_dict sending adapts to new unwrap function (#20 )
* prompt example
* prompt load csv data
* remove legacy try
* maker models require_grad set to False
* working on zero redundancy update
* mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad.
* remove legacy examples
* remove legacy examples
* remove replay buffer tp state. bad design
* opt benchmark
* better script
* nothing
* [chat] strategy refactor unwrap model
* [chat] strategy refactor save model
* [chat] add docstr
* [chat] refactor trainer save model
* [chat] fix strategy typing
* [chat] refactor trainer save model
* [chat] update readme
* [chat] fix unit test
* working on lora reconstruction
* state_dict sending adapts to new unwrap function
* remove comments
---------
Co-authored-by: csric <richcsr256@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* [chat-ray] add readme (#21 )
* add readme
* transparent graph
* add note background
---------
Co-authored-by: csric <richcsr256@gmail.com>
* [chat] get images from url (#22 )
* Refactor/chat ray (#23 )
* [chat] lora add todo
* [chat] remove unused pipeline strategy
* [chat] refactor example structure
* [chat] setup ci for ray
* [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24 )
* lora support prototype
* lora support
* 1mmt lora & remove useless code
---------
Co-authored-by: csric <richcsr256@gmail.com>
* [chat] fix test ci for ray
* [chat] fix test ci requirements for ray
* [chat] fix ray runtime env
* [chat] fix ray runtime env
* [chat] fix example ci docker args
* [chat] add debug info in trainer
* [chat] add nccl debug info
* [chat] skip ray test
* [doc] fix typo
---------
Co-authored-by: csric <59389055+CsRic@users.noreply.github.com>
Co-authored-by: csric <richcsr256@gmail.com>
2023-06-07 10:41:16 +08:00
Hongxin Liu
41fb7236aa
[devops] hotfix CI about testmon cache ( #3910 )
...
* [devops] hotfix CI about testmon cache
* [devops] fix testmon cahe on pr
2023-06-06 18:58:58 +08:00
Maruyama_Aya
b4437e88c3
fixed port
2023-06-06 16:21:38 +08:00
Maruyama_Aya
79c9f776a9
fixed port
2023-06-06 16:20:45 +08:00
Maruyama_Aya
d3379f0be7
fixed model saving bugs
2023-06-06 16:07:34 +08:00
Maruyama_Aya
b29e1f0722
change directory
2023-06-06 15:50:03 +08:00
Maruyama_Aya
1c1f71cbd2
fixing insecure hash function
2023-06-06 14:51:11 +08:00
Maruyama_Aya
b56c7f4283
update shell file
2023-06-06 14:09:27 +08:00
Maruyama_Aya
176010f289
update performance evaluation
2023-06-06 14:08:22 +08:00
digger yu
0e484e6201
[nfc]fix typo colossalai/pipeline tensor nn ( #3899 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
* fix typo colossalai/ applications/
* fix typo colossalai/cli fx kernel
* fix typo colossalai/nn
* revert change warmuped
* fix typo colossalai/pipeline tensor nn
2023-06-06 14:07:36 +08:00
Baizhou Zhang
c1535ccbba
[doc] fix docs about booster api usage ( #3898 )
2023-06-06 13:36:11 +08:00
Hongxin Liu
ec9bbc0094
[devops] improving testmon cache ( #3902 )
...
* [devops] improving testmon cache
* [devops] fix branch name with slash
* [devops] fix branch name with slash
* [devops] fix edit action
* [devops] fix edit action
* [devops] fix edit action
* [devops] fix edit action
* [devops] fix edit action
* [devops] fix edit action
* [devops] update readme
2023-06-06 11:32:31 +08:00
Yuanchen
57a6d7685c
support evaluation for english ( #3880 )
...
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
2023-06-05 21:24:21 +08:00
digger yu
1878749753
[nfc] fix typo colossalai/nn ( #3887 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
* fix typo colossalai/ applications/
* fix typo colossalai/cli fx kernel
* fix typo colossalai/nn
* revert change warmuped
2023-06-05 16:04:27 +08:00
Hongxin Liu
ae02d4e4f7
[bf16] add bf16 support ( #3882 )
...
* [bf16] add bf16 support for fused adam (#3844 )
* [bf16] fused adam kernel support bf16
* [test] update fused adam kernel test
* [test] update fused adam test
* [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860 )
* [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869 )
* [bf16] add mixed precision mixin
* [bf16] low level zero optim support bf16
* [text] update low level zero test
* [text] fix low level zero grad acc test
* [bf16] add bf16 support for gemini (#3872 )
* [bf16] gemini support bf16
* [test] update gemini bf16 test
* [doc] update gemini docstring
* [bf16] add bf16 support for plugins (#3877 )
* [bf16] add bf16 support for legacy zero (#3879 )
* [zero] init context support bf16
* [zero] legacy zero support bf16
* [test] add zero bf16 test
* [doc] add bf16 related docstring for legacy zero
2023-06-05 15:58:31 +08:00
jiangmingyan
07cb21142f
[doc]update moe chinese document. ( #3890 )
...
* [doc]update-moe
* [doc]update-moe
* [doc]update-moe
* [doc]update-moe
* [doc]update-moe
2023-06-05 15:57:54 +08:00
Liu Ziming
8065cc5fba
Modify torch version requirement to adapt torch 2.0 ( #3896 )
2023-06-05 15:57:35 +08:00
Hongxin Liu
dbb32692d2
[lazy] refactor lazy init ( #3891 )
...
* [lazy] remove old lazy init
* [lazy] refactor lazy init folder structure
* [lazy] fix lazy tensor deepcopy
* [test] update lazy init test
2023-06-05 14:20:47 +08:00
Maruyama_Aya
25447d4407
modify path
2023-06-05 11:47:07 +08:00
Maruyama_Aya
42e3232bc0
roll back
2023-06-02 17:00:57 +08:00
Maruyama_Aya
60ec33bb18
Add a new example of Dreambooth training using the booster API
2023-06-02 16:50:51 +08:00
digger yu
70c8cdecf4
[nfc] fix typo colossalai/cli fx kernel ( #3847 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
* fix typo colossalai/ applications/
* fix typo colossalai/cli fx kernel
2023-06-02 15:02:45 +08:00
Maruyama_Aya
46503c35dd
Modify torch version requirement to adapt torch 2.0
2023-06-01 14:30:51 +08:00
jiangmingyan
281b33f362
[doc] update document of zero with chunk. ( #3855 )
...
* [doc] fix title of mixed precision
* [doc]update document of zero with chunk
* [doc] update document of zero with chunk, fix
* [doc] update document of zero with chunk, fix
* [doc] update document of zero with chunk, fix
* [doc] update document of zero with chunk, add doc test
* [doc] update document of zero with chunk, add doc test
* [doc] update document of zero with chunk, fix installation
* [doc] update document of zero with chunk, fix zero with chunk doc
* [doc] update document of zero with chunk, fix zero with chunk doc
2023-05-30 18:41:56 +08:00
jiangmingyan
5f79008c4a
[example] update gemini examples ( #3868 )
...
* [example]update gemini examples
* [example]update gemini examples
2023-05-30 18:41:41 +08:00
Yuanchen
2506e275b8
[evaluation] improvement on evaluation ( #3862 )
...
* fix a bug when the config file contains one category but the answer file doesn't contains that category
* fix Chinese prompt file
* support gpt-3.5-turbo and gpt-4 evaluation
* polish and update README
* resolve pr comments
---------
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
2023-05-30 11:48:41 +08:00
jiangmingyan
b0474878bf
[doc] update nvme offload documents. ( #3850 )
2023-05-26 01:22:01 +08:00
Frank Lee
ae959a72a5
[workflow] fixed workflow check for docker build ( #3849 )
2023-05-25 16:42:34 +08:00
Frank Lee
d42b1be09d
[release] bump to v0.3.0 ( #3830 )
2023-05-25 16:20:07 +08:00
digger yu
e2d81eba0d
[nfc] fix typo colossalai/ applications/ ( #3831 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
* fix typo colossalai/ applications/
2023-05-25 16:19:41 +08:00
jiangmingyan
a64df3fa97
[doc] update document of gemini instruction. ( #3842 )
...
* [doc] update meet_gemini.md
* [doc] update meet_gemini.md
* [doc] fix parentheses
* [doc] fix parentheses
* [doc] fix doc test
* [doc] fix doc test
* [doc] fix doc
2023-05-25 14:58:01 +08:00
Frank Lee
54e97ed7ea
[workflow] supported test on CUDA 10.2 ( #3841 )
2023-05-25 14:14:34 +08:00
wukong1992
3229f93e30
[booster] add warning for torch fsdp plugin doc ( #3833 )
2023-05-25 14:00:02 +08:00
Hongxin Liu
7c9f2ed6dd
[dtensor] polish sharding spec docstring ( #3838 )
...
* [dtensor] polish sharding spec docstring
* [dtensor] polish sharding spec example docstring
2023-05-25 13:09:42 +08:00
Frank Lee
84500b7799
[workflow] fixed testmon cache in build CI ( #3806 )
...
* [workflow] fixed testmon cache in build CI
* polish code
2023-05-24 14:59:40 +08:00
digger yu
518b31c059
[docs] change placememt_policy to placement_policy ( #3829 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
2023-05-24 14:51:49 +08:00
digger yu
e90fdb1000
fix typo docs/
2023-05-24 13:57:43 +08:00
Yuanchen
34966378e8
[evaluation] add automatic evaluation pipeline ( #3821 )
...
* add functions for gpt evaluation
* add automatic eval
Update eval.py
* using jload and modify the type of answers1 and answers2
* Update eval.py
Update eval.py
* Update evaluator.py
* support gpt evaluation
* update readme.md
update README.md
update READNE.md
modify readme.md
* add Chinese example for config, battle prompt and evaluation prompt file
* remove GPT-4 config
* remove sample folder
---------
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>
2023-05-24 11:18:23 +08:00
Frank Lee
05b8a8de58
[workflow] changed to doc build to be on schedule and release ( #3825 )
...
* [workflow] changed to doc build to be on schedule and release
* polish code
2023-05-24 10:50:19 +08:00
Yanming W
269150b6f4
[Docker] Fix a couple of build issues ( #3691 )
2023-05-24 10:22:51 +08:00
digger yu
7f8203af69
fix typo colossalai/auto_parallel autochunk fx/passes etc. ( #3808 )
2023-05-24 09:01:50 +08:00
jiangmingyan
725365f297
Merge pull request #3810 from jiangmingyan/amp
...
[doc] update amp document
2023-05-23 18:58:16 +08:00
jiangmingyan
278fcbc444
[doc]fix
2023-05-23 17:53:11 +08:00
jiangmingyan
8aa1fb2c7f
[doc]fix
2023-05-23 17:50:30 +08:00
Frank Lee
1e3b64f26c
[workflow] enblaed doc build from a forked repo ( #3815 )
2023-05-23 17:49:53 +08:00