Commit Graph

2382 Commits (5f79008c4a7a0e854f53d7f1c0b29ca7f411eeab)

Author SHA1 Message Date
jiangmingyan d449525acf
[doc] update booster tutorials (#3718)
* [booster] update booster tutorials#3717

* [booster] update booster tutorials#3717, fix

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, update setup doc

* [booster] update booster tutorials#3717, rename colossalai booster.md

* [booster] update booster tutorials#3717, rename colossalai booster.md

* [booster] update booster tutorials#3717, rename colossalai booster.md

* [booster] update booster tutorials#3717, fix

* [booster] update booster tutorials#3717, fix

* [booster] update tutorials#3717, update booster api doc

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, modify file

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3717, fix reference link

* [booster] update tutorials#3713

* [booster] update tutorials#3713, modify file
2023-05-18 11:41:56 +08:00
Yuanchen 05759839bd
[chat] fix bugs in stage 3 training (#3759)
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
2023-05-17 17:44:05 +08:00
Hongxin Liu 5dd573c6b6
[devops] fix ci for document check (#3751)
* [doc] add test info

* [devops] update doc check ci

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] add debug info

* [devops] remove debug info and update invalid doc

* [devops] add essential comments
2023-05-17 11:24:22 +08:00
Hongxin Liu c03bd7c6b2
[devops] make build on PR run automatically (#3748)
* [devops] make build on PR run automatically

* [devops] update build on pr condition
2023-05-17 11:17:37 +08:00
digger yu 1baeb39c72
[NFC] fix typo with colossalai/auto_parallel/tensor_shard (#3742)
* fix typo applications/ and colossalai/ date 5.11

* fix typo colossalai/
2023-05-17 11:13:23 +08:00
Ziyue Jiang 7386c6669d
[fix] Add init to fix import error when importing _analyzer (#3668) 2023-05-16 16:56:35 +08:00
wukong1992 6050f37776
[booster] removed models that don't support fsdp (#3744)
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2023-05-15 19:35:21 +08:00
Hongxin Liu afb239bbf8
[devops] update torch version of CI (#3725)
* [test] fix flop tensor test

* [test] fix autochunk test

* [test] fix lazyinit test

* [devops] update torch version of CI

* [devops] enable testmon

* [devops] fix ci

* [devops] fix ci

* [test] fix checkpoint io test

* [test] fix cluster test

* [test] fix timm test

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] fix ci

* [devops] force sync to test ci

* [test] skip fsdp test
2023-05-15 17:20:56 +08:00
wukong1992 b37797ed3d
[booster] support torch fsdp plugin in booster (#3697)
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2023-05-15 12:14:38 +08:00
digger-yu ad6460cf2c
[NFC] fix typo applications/ and colossalai/ (#3735) 2023-05-15 11:46:25 +08:00
digger-yu 1f73609adb
[CI] fix typo with tests/ etc. (#3727)
* fix spelling error with examples/comminity/

* fix spelling error with tests/

* fix some spelling error with tests/ colossalai/ etc.

* fix spelling error with tests/ etc. date:2023.5.10
2023-05-11 16:30:58 +08:00
digger-yu 899aa86368
[CI] fix typo with tests components (#3695)
* fix spelling error with examples/comminity/

* fix spelling error with tests/
2023-05-11 11:10:28 +08:00
digger-yu b7141c36dd
[CI] fix some spelling errors (#3707)
* fix spelling error with examples/comminity/

* fix spelling error with tests/

* fix some spelling error with tests/ colossalai/ etc.
2023-05-10 17:12:03 +08:00
MisterLin1995 f7361ee1bd
[chat] fix community example ray (#3719)
Co-authored-by: jiangwen <zxl265370@antgroup.com>
2023-05-10 13:36:09 +08:00
jiangmingyan 20068ba188
[booster] add tests for ddp and low level zero's checkpointio (#3715)
* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update booster tutorials#3717, fix recursive check
2023-05-10 12:17:02 +08:00
Hongxin Liu 6552cbf8e1
[booster] fix no_sync method (#3709)
* [booster] fix no_sync method

* [booster] add test for ddp no_sync

* [booster] fix merge

* [booster] update unit test

* [booster] update unit test

* [booster] update unit test
2023-05-09 11:10:02 +08:00
Hongxin Liu 3bf09efe74
[booster] update prepare dataloader method for plugin (#3706)
* [booster] add prepare dataloader method for plug

* [booster] update examples and docstr
2023-05-08 15:44:03 +08:00
Hongxin Liu f83ea813f5
[example] add train resnet/vit with booster example (#3694)
* [example] add train vit with booster example

* [example] update readme

* [example] add train resnet with booster example

* [example] enable ci

* [example] enable ci

* [example] add requirements

* [hotfix] fix analyzer init

* [example] update requirements
2023-05-08 10:42:30 +08:00
YH 2629f9717d
[tensor] Refactor handle_trans_spec in DistSpecManager 2023-05-06 17:55:37 +08:00
zhang-yi-chi 2da5d81dec
[chat] fix train_prompts.py gemini strategy bug (#3666)
* fix gemini strategy bug

* add comment

* add comment

* better solution
2023-05-06 16:46:38 +08:00
Hongxin Liu d556648885
[example] add finetune bert with booster example (#3693) 2023-05-06 11:53:13 +08:00
digger-yu 65bdc3159f
fix some spelling error with applications/Chat/examples/ (#3692)
* fix spelling error with examples/comminity/

* fix spelling error with example/
2023-05-06 11:27:23 +08:00
Hongxin Liu d0915f54f4
[booster] refactor all dp fashion plugins (#3684)
* [booster] add dp plugin base

* [booster] inherit dp plugin base

* [booster] refactor unit tests
2023-05-05 19:36:10 +08:00
digger-yu b49020c1b1
[CI] Update test_sharded_optim_with_sync_bn.py (#3688)
fix spelling error in line23
change "cudnn_determinstic"=True to "cudnn_deterministic=True"
2023-05-05 18:57:27 +08:00
Tong Li b36e67cb2b
Merge pull request #3680 from digger-yu/digger-yu-patch-2
fix spelling error with applications/Chat/evaluate/
2023-05-05 16:26:04 +08:00
jiangmingyan 307894f74d
[booster] gemini plugin support shard checkpoint (#3610)
* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

---------

Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2023-05-05 14:37:21 +08:00
Camille Zhong 0f785cb1f3
[chat] PPO stage3 doc enhancement (#3679)
* Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894d.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894d.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

update roberta with coati

chat ci update

Revert "chat ci update"

This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.

* Update README.md

Update README.md

* update readme

* Update test_ci.sh

* update readme and add a script

update readme and add a script

modify readme

Update README.md
2023-05-05 13:36:56 +08:00
digger-yu 6650daeb0a
[doc] fix chat spelling error (#3671)
* Update README.md

change "huggingaface" to "huggingface"

* Update README.md

change "Colossa-AI" to "Colossal-AI"
2023-05-05 11:37:35 +08:00
Hongxin Liu 7bd0bee8ea
[chat] add opt attn kernel (#3655)
* [chat] add opt attn kernel

* [chat] disable xformer during fwd
2023-05-04 16:03:33 +08:00
digger-yu 8ba7858753
Update generate_gpt35_answers.py
fix spelling error with generate_gpt35_answers.py
2023-05-04 15:34:16 +08:00
digger-yu bfbf650588
fix spelling error
fix spelling error with evaluate.py
2023-05-04 15:31:09 +08:00
tanitna 1a60dc07a8
[chat] typo accimulation_steps -> accumulation_steps (#3662) 2023-04-28 15:42:57 +08:00
Tong Li 816add7e7f
Merge pull request #3656 from TongLi3701/chat/update_eval
[Chat]: Remove unnecessary step and update documentation
2023-04-28 14:07:44 +08:00
binmakeswell 268b3cd80d
[chat] set default zero2 strategy (#3667)
* [chat] set default gemini strategy

* [chat] set default zero2 strategy

* [chat] set default zero2 strategy
2023-04-28 13:56:50 +08:00
Tong Li c1a355940e update readme 2023-04-28 11:56:35 +08:00
Tong Li ed3eaa6922 update documentation 2023-04-28 11:49:21 +08:00
Tong Li c419117329 update questions and readme 2023-04-27 19:04:26 +08:00
Tong Li aa77ddae33 remove unnecessary step and update readme 2023-04-27 18:51:58 +08:00
YH a22407cc02
[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173)
* Fix confusing variable name in zero opt

* Apply lint

* Fix util func

* Fix minor util func

* Fix zero param optimizer name
2023-04-27 18:43:14 +08:00
Hongxin Liu 842768a174
[chat] refactor model save/load logic (#3654)
* [chat] strategy refactor unwrap model

* [chat] strategy refactor save model

* [chat] add docstr

* [chat] refactor trainer save model

* [chat] fix strategy typing

* [chat] refactor trainer save model

* [chat] update readme

* [chat] fix unit test
2023-04-27 18:41:49 +08:00
Hongxin Liu 6ef7011462
[chat] remove lm model class (#3653)
* [chat] refactor lora

* [chat] remove lm class

* [chat] refactor save model

* [chat] refactor train sft

* [chat] fix ci

* [chat] fix ci
2023-04-27 15:37:38 +08:00
Camille Zhong 8bccb72c8d
[Doc] enhancement on README.md for chat examples (#3646)
* Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894d.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

Add RoBERTa for RLHF Stage 2 & 3 (test)

RoBERTa for RLHF Stage 2 & 3 (still in testing)

Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)"

This reverts commit 06741d894d.

Add RoBERTa for RLHF stage 2 & 3

1. add roberta folder under model folder
2. add  roberta option in train_reward_model.py
3. add some test in testci

Update test_ci.sh

Revert "Update test_ci.sh"

This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a.

update roberta with coati

chat ci update

Revert "chat ci update"

This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846.

* Update README.md

Update README.md

* update readme

* Update test_ci.sh
2023-04-27 14:26:19 +08:00
Hongxin Liu 2a951955ad
[chat] refactor trainer (#3648)
* [chat] ppo trainer remove useless args

* [chat] update examples

* [chat] update benchmark

* [chat] update examples

* [chat] fix sft training with wandb

* [chat] polish docstr
2023-04-26 18:11:49 +08:00
Hongxin Liu f8288315d9
[chat] polish performance evaluator (#3647) 2023-04-26 17:34:59 +08:00
Hongxin Liu 50793b35f4
[gemini] accelerate inference (#3641)
* [gemini] support don't scatter after inference

* [chat] update colossalai strategy

* [chat] fix opt benchmark

* [chat] update opt benchmark

* [gemini] optimize inference

* [test] add gemini inference test

* [chat] fix unit test ci

* [chat] fix ci

* [chat] fix ci

* [chat] skip checkpoint test
2023-04-26 16:32:40 +08:00
Hongxin Liu 4b3240cb59
[booster] add low level zero plugin (#3594)
* [booster] add low level zero plugin

* [booster] fix gemini plugin test

* [booster] fix precision

* [booster] add low level zero plugin test

* [test] fix booster plugin test oom

* [test] fix booster plugin test oom

* [test] fix googlenet and inception output trans

* [test] fix diffuser clip vision model

* [test] fix torchaudio_wav2vec2_base

* [test] fix low level zero plugin test
2023-04-26 14:37:25 +08:00
digger-yu b9a8dff7e5
[doc] Fix typo under colossalai and doc(#3618)
* Fixed several spelling errors under colossalai

* Fix the spelling error in colossalai and docs directory

* Cautious Changed the spelling error under the example folder

* Update runtime_preparation_pass.py

revert autograft to autograd

* Update search_chunk.py

utile to until

* Update check_installation.py

change misteach to mismatch in line 91

* Update 1D_tensor_parallel.md

revert to perceptron

* Update 2D_tensor_parallel.md

revert to perceptron in line 73

* Update 2p5D_tensor_parallel.md

revert to perceptron in line 71

* Update 3D_tensor_parallel.md

revert to perceptron in line 80

* Update README.md

revert to resnet in line 42

* Update reorder_graph.py

revert to indice in line 7

* Update p2p.py

revert to megatron in line 94

* Update initialize.py

revert to torchrun in line 198

* Update routers.py

change to detailed in line 63

* Update routers.py

change to detailed in line 146

* Update README.md

revert  random number in line 402
2023-04-26 11:38:43 +08:00
Tong Li e1b0a78afa
Merge pull request #3621 from zhang-yi-chi/fix/chat-train-prompts-single-gpu
[chat] fix single gpu training bug in examples/train_prompts.py
2023-04-24 22:13:54 +08:00
ddobokki df309fc6ab
[Chat] Remove duplicate functions (#3625) 2023-04-24 12:23:15 +08:00
Hongxin Liu 179558a87a
[devops] fix chat ci (#3628) 2023-04-24 10:55:14 +08:00