Commit Graph

2524 Commits (b1c2901530daa4f9e5a8008170b915a201e59415)

Author SHA1 Message Date
wukong1992 b37797ed3d
[booster] support torch fsdp plugin in booster (#3697)
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2023-05-15 12:14:38 +08:00
digger-yu ad6460cf2c
[NFC] fix typo applications/ and colossalai/ (#3735) 2023-05-15 11:46:25 +08:00
digger-yu 1f73609adb
[CI] fix typo with tests/ etc. (#3727)
* fix spelling error with examples/comminity/

* fix spelling error with tests/

* fix some spelling error with tests/ colossalai/ etc.

* fix spelling error with tests/ etc. date:2023.5.10
2023-05-11 16:30:58 +08:00
digger-yu 899aa86368
[CI] fix typo with tests components (#3695)
* fix spelling error with examples/comminity/

* fix spelling error with tests/
2023-05-11 11:10:28 +08:00
digger-yu b7141c36dd
[CI] fix some spelling errors (#3707)
* fix spelling error with examples/comminity/

* fix spelling error with tests/

* fix some spelling error with tests/ colossalai/ etc.
2023-05-10 17:12:03 +08:00
MisterLin1995 f7361ee1bd
[chat] fix community example ray (#3719)
Co-authored-by: jiangwen <zxl265370@antgroup.com>
2023-05-10 13:36:09 +08:00
jiangmingyan 20068ba188
[booster] add tests for ddp and low level zero's checkpointio (#3715)
* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update booster tutorials#3717, fix recursive check
2023-05-10 12:17:02 +08:00
Hongxin Liu 6552cbf8e1
[booster] fix no_sync method (#3709)
* [booster] fix no_sync method

* [booster] add test for ddp no_sync

* [booster] fix merge

* [booster] update unit test

* [booster] update unit test

* [booster] update unit test
2023-05-09 11:10:02 +08:00
Hongxin Liu 3bf09efe74
[booster] update prepare dataloader method for plugin (#3706)
* [booster] add prepare dataloader method for plug

* [booster] update examples and docstr
2023-05-08 15:44:03 +08:00
Hongxin Liu f83ea813f5
[example] add train resnet/vit with booster example (#3694)
* [example] add train vit with booster example

* [example] update readme

* [example] add train resnet with booster example

* [example] enable ci

* [example] enable ci

* [example] add requirements

* [hotfix] fix analyzer init

* [example] update requirements
2023-05-08 10:42:30 +08:00
YH 2629f9717d
[tensor] Refactor handle_trans_spec in DistSpecManager 2023-05-06 17:55:37 +08:00
zhang-yi-chi 2da5d81dec
[chat] fix train_prompts.py gemini strategy bug (#3666)
* fix gemini strategy bug

* add comment

* add comment

* better solution
2023-05-06 16:46:38 +08:00
Hongxin Liu d556648885
[example] add finetune bert with booster example (#3693) 2023-05-06 11:53:13 +08:00
digger-yu 65bdc3159f
fix some spelling error with applications/Chat/examples/ (#3692)
* fix spelling error with examples/comminity/

* fix spelling error with example/
2023-05-06 11:27:23 +08:00
Hongxin Liu d0915f54f4
[booster] refactor all dp fashion plugins (#3684)
* [booster] add dp plugin base

* [booster] inherit dp plugin base

* [booster] refactor unit tests
2023-05-05 19:36:10 +08:00
digger-yu b49020c1b1
[CI] Update test_sharded_optim_with_sync_bn.py (#3688)
fix spelling error in line23
change "cudnn_determinstic"=True to "cudnn_deterministic=True"
2023-05-05 18:57:27 +08:00
Tong Li b36e67cb2b
Merge pull request #3680 from digger-yu/digger-yu-patch-2
fix spelling error with applications/Chat/evaluate/
2023-05-05 16:26:04 +08:00
jiangmingyan 307894f74d
[booster] gemini plugin support shard checkpoint (#3610)
* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

---------

Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2023-05-05 14:37:21 +08:00
Camille Zhong 0f785cb1f3
[chat] PPO stage3 doc enhancement (#3679)
* Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894d.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894d.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

update roberta with coati

chat ci update

Revert "chat ci update"

This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.

* Update README.md

Update README.md

* update readme

* Update test_ci.sh

* update readme and add a script

update readme and add a script

modify readme

Update README.md
2023-05-05 13:36:56 +08:00
digger-yu 6650daeb0a
[doc] fix chat spelling error (#3671)
* Update README.md

change "huggingaface" to "huggingface"

* Update README.md

change "Colossa-AI" to "Colossal-AI"
2023-05-05 11:37:35 +08:00
Hongxin Liu 7bd0bee8ea
[chat] add opt attn kernel (#3655)
* [chat] add opt attn kernel

* [chat] disable xformer during fwd
2023-05-04 16:03:33 +08:00
digger-yu 8ba7858753
Update generate_gpt35_answers.py
fix spelling error with generate_gpt35_answers.py
2023-05-04 15:34:16 +08:00
digger-yu bfbf650588
fix spelling error
fix spelling error with evaluate.py
2023-05-04 15:31:09 +08:00
tanitna 1a60dc07a8
[chat] typo accimulation_steps -> accumulation_steps (#3662) 2023-04-28 15:42:57 +08:00
Tong Li 816add7e7f
Merge pull request #3656 from TongLi3701/chat/update_eval
[Chat]: Remove unnecessary step and update documentation
2023-04-28 14:07:44 +08:00
binmakeswell 268b3cd80d
[chat] set default zero2 strategy (#3667)
* [chat] set default gemini strategy

* [chat] set default zero2 strategy

* [chat] set default zero2 strategy
2023-04-28 13:56:50 +08:00
Tong Li c1a355940e update readme 2023-04-28 11:56:35 +08:00
Tong Li ed3eaa6922 update documentation 2023-04-28 11:49:21 +08:00
Tong Li c419117329 update questions and readme 2023-04-27 19:04:26 +08:00
Tong Li aa77ddae33 remove unnecessary step and update readme 2023-04-27 18:51:58 +08:00
YH a22407cc02
[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173)
* Fix confusing variable name in zero opt

* Apply lint

* Fix util func

* Fix minor util func

* Fix zero param optimizer name
2023-04-27 18:43:14 +08:00
Hongxin Liu 842768a174
[chat] refactor model save/load logic (#3654)
* [chat] strategy refactor unwrap model

* [chat] strategy refactor save model

* [chat] add docstr

* [chat] refactor trainer save model

* [chat] fix strategy typing

* [chat] refactor trainer save model

* [chat] update readme

* [chat] fix unit test
2023-04-27 18:41:49 +08:00
Hongxin Liu 6ef7011462
[chat] remove lm model class (#3653)
* [chat] refactor lora

* [chat] remove lm class

* [chat] refactor save model

* [chat] refactor train sft

* [chat] fix ci

* [chat] fix ci
2023-04-27 15:37:38 +08:00
Camille Zhong 8bccb72c8d
[Doc] enhancement on README.md for chat examples (#3646)
* Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894d.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894d.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

update roberta with coati

chat ci update

Revert "chat ci update"

This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.

* Update README.md

Update README.md

* update readme

* Update test_ci.sh
2023-04-27 14:26:19 +08:00
Hongxin Liu 2a951955ad
[chat] refactor trainer (#3648)
* [chat] ppo trainer remove useless args

* [chat] update examples

* [chat] update benchmark

* [chat] update examples

* [chat] fix sft training with wandb

* [chat] polish docstr
2023-04-26 18:11:49 +08:00
Hongxin Liu f8288315d9
[chat] polish performance evaluator (#3647) 2023-04-26 17:34:59 +08:00
Hongxin Liu 50793b35f4
[gemini] accelerate inference (#3641)
* [gemini] support don't scatter after inference

* [chat] update colossalai strategy

* [chat] fix opt benchmark

* [chat] update opt benchmark

* [gemini] optimize inference

* [test] add gemini inference test

* [chat] fix unit test ci

* [chat] fix ci

* [chat] fix ci

* [chat] skip checkpoint test
2023-04-26 16:32:40 +08:00
Hongxin Liu 4b3240cb59
[booster] add low level zero plugin (#3594)
* [booster] add low level zero plugin

* [booster] fix gemini plugin test

* [booster] fix precision

* [booster] add low level zero plugin test

* [test] fix booster plugin test oom

* [test] fix booster plugin test oom

* [test] fix googlenet and inception output trans

* [test] fix diffuser clip vision model

* [test] fix torchaudio_wav2vec2_base

* [test] fix low level zero plugin test
2023-04-26 14:37:25 +08:00
digger-yu b9a8dff7e5
[doc] Fix typo under colossalai and doc(#3618)
* Fixed several spelling errors under colossalai

* Fix the spelling error in colossalai and docs directory

* Cautious Changed the spelling error under the example folder

* Update runtime_preparation_pass.py

revert autograft to autograd

* Update search_chunk.py

utile to until

* Update check_installation.py

change misteach to mismatch in line 91

* Update 1D_tensor_parallel.md

revert to perceptron

* Update 2D_tensor_parallel.md

revert to perceptron in line 73

* Update 2p5D_tensor_parallel.md

revert to perceptron in line 71

* Update 3D_tensor_parallel.md

revert to perceptron in line 80

* Update README.md

revert to resnet in line 42

* Update reorder_graph.py

revert to indice in line 7

* Update p2p.py

revert to megatron in line 94

* Update initialize.py

revert to torchrun in line 198

* Update routers.py

change to detailed in line 63

* Update routers.py

change to detailed in line 146

* Update README.md

revert  random number in line 402
2023-04-26 11:38:43 +08:00
Tong Li e1b0a78afa
Merge pull request #3621 from zhang-yi-chi/fix/chat-train-prompts-single-gpu
[chat] fix single gpu training bug in examples/train_prompts.py
2023-04-24 22:13:54 +08:00
ddobokki df309fc6ab
[Chat] Remove duplicate functions (#3625) 2023-04-24 12:23:15 +08:00
Hongxin Liu 179558a87a
[devops] fix chat ci (#3628) 2023-04-24 10:55:14 +08:00
zhang-yi-chi 739cfe3360 [chat] fix enable single gpu training bug 2023-04-22 14:16:08 +08:00
digger-yu d7bf284706
[chat] polish code note typo (#3612) 2023-04-20 17:22:15 +08:00
Yuanchen c4709d34cf
Chat evaluate (#3608)
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
2023-04-20 11:12:24 +08:00
digger-yu 633bac2f58
[doc] .github/workflows/README.md (#3605)
Fixed several word spelling errors
change "compatiblity" to "compatibility" etc.
2023-04-20 10:36:28 +08:00
digger-yu becd3b0f54
[doc] fix setup.py typo (#3603)
Optimization Code
change "vairable" to "variable"
2023-04-19 17:28:15 +08:00
digger-yu 7570d9ae3d
[doc] fix op_builder/README.md (#3597)
Optimization Code
change "requries" to "requires"
2023-04-19 15:56:01 +08:00
Hongxin Liu 12eff9eb4c
[gemini] state dict supports fp16 (#3590)
* [gemini] save state dict support fp16

* [gemini] save state dict shard support fp16

* [gemini] fix state dict

* [gemini] fix state dict
2023-04-19 11:01:48 +08:00
github-actions[bot] d544ed4345
[bot] Automated submodule synchronization (#3596)
Co-authored-by: github-actions <github-actions@github.com>
2023-04-19 10:38:12 +08:00