ColossalAI

Commit Graph

Author	SHA1	Message	Date
Hongxin Liu	5ce6c9d86f	[doc] add tutorial for cluster utils (#3763 ) * [doc] add en cluster utils doc * [doc] add zh cluster utils doc * [doc] add cluster utils doc in sidebar	2023-05-19 12:12:20 +08:00
Hongxin Liu	5452df63c5	[plugin] torch ddp plugin supports sharded model checkpoint (#3775 ) * [plugin] torch ddp plugin add save sharded model * [test] fix torch ddp ckpt io test * [test] fix torch ddp ckpt io test * [test] fix low level zero plugin test * [test] fix low level zero plugin test * [test] add debug info * [test] add debug info * [test] add debug info * [test] add debug info * [test] add debug info * [test] fix low level zero plugin test * [test] fix low level zero plugin test * [test] remove debug info	2023-05-18 20:05:59 +08:00
jiangmingyan	2703a37ac9	[amp] Add naive amp demo (#3774 ) * [mixed_precison] add naive amp demo * [mixed_precison] add naive amp demo	2023-05-18 16:33:14 +08:00
jiangmingyan	48bd056761	[doc] update hybrid parallelism doc (#3770 )	2023-05-18 14:16:13 +08:00
binmakeswell	15024e40d9	[auto] fix install cmd (#3772 )	2023-05-18 13:33:01 +08:00
jiangmingyan	d449525acf	[doc] update booster tutorials (#3718 ) * [booster] update booster tutorials#3717 * [booster] update booster tutorials#3717, fix * [booster] update booster tutorials#3717, update setup doc * [booster] update booster tutorials#3717, update setup doc * [booster] update booster tutorials#3717, update setup doc * [booster] update booster tutorials#3717, update setup doc * [booster] update booster tutorials#3717, update setup doc * [booster] update booster tutorials#3717, update setup doc * [booster] update booster tutorials#3717, rename colossalai booster.md * [booster] update booster tutorials#3717, rename colossalai booster.md * [booster] update booster tutorials#3717, rename colossalai booster.md * [booster] update booster tutorials#3717, fix * [booster] update booster tutorials#3717, fix * [booster] update tutorials#3717, update booster api doc * [booster] update tutorials#3717, modify file * [booster] update tutorials#3717, modify file * [booster] update tutorials#3717, modify file * [booster] update tutorials#3717, modify file * [booster] update tutorials#3717, modify file * [booster] update tutorials#3717, modify file * [booster] update tutorials#3717, modify file * [booster] update tutorials#3717, fix reference link * [booster] update tutorials#3717, fix reference link * [booster] update tutorials#3717, fix reference link * [booster] update tutorials#3717, fix reference link * [booster] update tutorials#3717, fix reference link * [booster] update tutorials#3717, fix reference link * [booster] update tutorials#3717, fix reference link * [booster] update tutorials#3713 * [booster] update tutorials#3713, modify file	2023-05-18 11:41:56 +08:00
Yuanchen	05759839bd	[chat] fix bugs in stage 3 training (#3759 ) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>	2023-05-17 17:44:05 +08:00
Hongxin Liu	5dd573c6b6	[devops] fix ci for document check (#3751 ) * [doc] add test info * [devops] update doc check ci * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] add debug info * [devops] remove debug info and update invalid doc * [devops] add essential comments	2023-05-17 11:24:22 +08:00
Hongxin Liu	c03bd7c6b2	[devops] make build on PR run automatically (#3748 ) * [devops] make build on PR run automatically * [devops] update build on pr condition	2023-05-17 11:17:37 +08:00
digger yu	1baeb39c72	[NFC] fix typo with colossalai/auto_parallel/tensor_shard (#3742 ) * fix typo applications/ and colossalai/ date 5.11 * fix typo colossalai/	2023-05-17 11:13:23 +08:00
Ziyue Jiang	7386c6669d	[fix] Add init to fix import error when importing _analyzer (#3668 )	2023-05-16 16:56:35 +08:00
wukong1992	6050f37776	[booster] removed models that don't support fsdp (#3744 ) Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>	2023-05-15 19:35:21 +08:00
Hongxin Liu	afb239bbf8	[devops] update torch version of CI (#3725 ) * [test] fix flop tensor test * [test] fix autochunk test * [test] fix lazyinit test * [devops] update torch version of CI * [devops] enable testmon * [devops] fix ci * [devops] fix ci * [test] fix checkpoint io test * [test] fix cluster test * [test] fix timm test * [devops] fix ci * [devops] fix ci * [devops] fix ci * [devops] fix ci * [devops] force sync to test ci * [test] skip fsdp test	2023-05-15 17:20:56 +08:00
wukong1992	b37797ed3d	[booster] support torch fsdp plugin in booster (#3697 ) Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>	2023-05-15 12:14:38 +08:00
digger-yu	ad6460cf2c	[NFC] fix typo applications/ and colossalai/ (#3735 )	2023-05-15 11:46:25 +08:00
digger-yu	1f73609adb	[CI] fix typo with tests/ etc. (#3727 ) * fix spelling error with examples/comminity/ * fix spelling error with tests/ * fix some spelling error with tests/ colossalai/ etc. * fix spelling error with tests/ etc. date:2023.5.10	2023-05-11 16:30:58 +08:00
digger-yu	899aa86368	[CI] fix typo with tests components (#3695 ) * fix spelling error with examples/comminity/ * fix spelling error with tests/	2023-05-11 11:10:28 +08:00
digger-yu	b7141c36dd	[CI] fix some spelling errors (#3707 ) * fix spelling error with examples/comminity/ * fix spelling error with tests/ * fix some spelling error with tests/ colossalai/ etc.	2023-05-10 17:12:03 +08:00
MisterLin1995	f7361ee1bd	[chat] fix community example ray (#3719 ) Co-authored-by: jiangwen <zxl265370@antgroup.com>	2023-05-10 13:36:09 +08:00
jiangmingyan	20068ba188	[booster] add tests for ddp and low level zero's checkpointio (#3715 ) * [booster] update tests for booster * [booster] update tests for booster * [booster] update tests for booster * [booster] update tests for booster * [booster] update tests for booster * [booster] update booster tutorials#3717, fix recursive check	2023-05-10 12:17:02 +08:00
Hongxin Liu	6552cbf8e1	[booster] fix no_sync method (#3709 ) * [booster] fix no_sync method * [booster] add test for ddp no_sync * [booster] fix merge * [booster] update unit test * [booster] update unit test * [booster] update unit test	2023-05-09 11:10:02 +08:00
Hongxin Liu	3bf09efe74	[booster] update prepare dataloader method for plugin (#3706 ) * [booster] add prepare dataloader method for plug * [booster] update examples and docstr	2023-05-08 15:44:03 +08:00
Hongxin Liu	f83ea813f5	[example] add train resnet/vit with booster example (#3694 ) * [example] add train vit with booster example * [example] update readme * [example] add train resnet with booster example * [example] enable ci * [example] enable ci * [example] add requirements * [hotfix] fix analyzer init * [example] update requirements	2023-05-08 10:42:30 +08:00
YH	2629f9717d	[tensor] Refactor handle_trans_spec in DistSpecManager	2023-05-06 17:55:37 +08:00
zhang-yi-chi	2da5d81dec	[chat] fix train_prompts.py gemini strategy bug (#3666 ) * fix gemini strategy bug * add comment * add comment * better solution	2023-05-06 16:46:38 +08:00
Hongxin Liu	d556648885	[example] add finetune bert with booster example (#3693 )	2023-05-06 11:53:13 +08:00
digger-yu	65bdc3159f	fix some spelling error with applications/Chat/examples/ (#3692 ) * fix spelling error with examples/comminity/ * fix spelling error with example/	2023-05-06 11:27:23 +08:00
Hongxin Liu	d0915f54f4	[booster] refactor all dp fashion plugins (#3684 ) * [booster] add dp plugin base * [booster] inherit dp plugin base * [booster] refactor unit tests	2023-05-05 19:36:10 +08:00
digger-yu	b49020c1b1	[CI] Update test_sharded_optim_with_sync_bn.py (#3688 ) fix spelling error in line23 change "cudnn_determinstic"=True to "cudnn_deterministic=True"	2023-05-05 18:57:27 +08:00
Tong Li	b36e67cb2b	Merge pull request #3680 from digger-yu/digger-yu-patch-2 fix spelling error with applications/Chat/evaluate/	2023-05-05 16:26:04 +08:00
jiangmingyan	307894f74d	[booster] gemini plugin support shard checkpoint (#3610 ) * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint --------- Co-authored-by: luchen <luchen@luchendeMBP.lan> Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>	2023-05-05 14:37:21 +08:00
Camille Zhong	0f785cb1f3	[chat] PPO stage3 doc enhancement (#3679 ) * Add RoBERTa for RLHF Stage 2 & 3 (test) RoBERTa for RLHF Stage 2 & 3 (still in testing) Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)" This reverts commit `06741d894d`. Add RoBERTa for RLHF stage 2 & 3 1. add roberta folder under model folder 2. add roberta option in train_reward_model.py 3. add some test in testci Update test_ci.sh Revert "Update test_ci.sh" This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a. Add RoBERTa for RLHF Stage 2 & 3 (test) RoBERTa for RLHF Stage 2 & 3 (still in testing) Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)" This reverts commit `06741d894d`. Add RoBERTa for RLHF stage 2 & 3 1. add roberta folder under model folder 2. add roberta option in train_reward_model.py 3. add some test in testci Update test_ci.sh Revert "Update test_ci.sh" This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a. update roberta with coati chat ci update Revert "chat ci update" This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846. * Update README.md Update README.md * update readme * Update test_ci.sh * update readme and add a script update readme and add a script modify readme Update README.md	2023-05-05 13:36:56 +08:00
digger-yu	6650daeb0a	[doc] fix chat spelling error (#3671 ) * Update README.md change "huggingaface" to "huggingface" * Update README.md change "Colossa-AI" to "Colossal-AI"	2023-05-05 11:37:35 +08:00
Hongxin Liu	7bd0bee8ea	[chat] add opt attn kernel (#3655 ) * [chat] add opt attn kernel * [chat] disable xformer during fwd	2023-05-04 16:03:33 +08:00
digger-yu	8ba7858753	Update generate_gpt35_answers.py fix spelling error with generate_gpt35_answers.py	2023-05-04 15:34:16 +08:00
digger-yu	bfbf650588	fix spelling error fix spelling error with evaluate.py	2023-05-04 15:31:09 +08:00
tanitna	1a60dc07a8	[chat] typo accimulation_steps -> accumulation_steps (#3662 )	2023-04-28 15:42:57 +08:00
Tong Li	816add7e7f	Merge pull request #3656 from TongLi3701/chat/update_eval [Chat]: Remove unnecessary step and update documentation	2023-04-28 14:07:44 +08:00
binmakeswell	268b3cd80d	[chat] set default zero2 strategy (#3667 ) * [chat] set default gemini strategy * [chat] set default zero2 strategy * [chat] set default zero2 strategy	2023-04-28 13:56:50 +08:00
Tong Li	c1a355940e	update readme	2023-04-28 11:56:35 +08:00
Tong Li	ed3eaa6922	update documentation	2023-04-28 11:49:21 +08:00
Tong Li	c419117329	update questions and readme	2023-04-27 19:04:26 +08:00
Tong Li	aa77ddae33	remove unnecessary step and update readme	2023-04-27 18:51:58 +08:00
YH	a22407cc02	[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173 ) * Fix confusing variable name in zero opt * Apply lint * Fix util func * Fix minor util func * Fix zero param optimizer name	2023-04-27 18:43:14 +08:00
Hongxin Liu	842768a174	[chat] refactor model save/load logic (#3654 ) * [chat] strategy refactor unwrap model * [chat] strategy refactor save model * [chat] add docstr * [chat] refactor trainer save model * [chat] fix strategy typing * [chat] refactor trainer save model * [chat] update readme * [chat] fix unit test	2023-04-27 18:41:49 +08:00
Hongxin Liu	6ef7011462	[chat] remove lm model class (#3653 ) * [chat] refactor lora * [chat] remove lm class * [chat] refactor save model * [chat] refactor train sft * [chat] fix ci * [chat] fix ci	2023-04-27 15:37:38 +08:00
Camille Zhong	8bccb72c8d	[Doc] enhancement on README.md for chat examples (#3646 ) * Add RoBERTa for RLHF Stage 2 & 3 (test) RoBERTa for RLHF Stage 2 & 3 (still in testing) Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)" This reverts commit `06741d894d`. Add RoBERTa for RLHF stage 2 & 3 1. add roberta folder under model folder 2. add roberta option in train_reward_model.py 3. add some test in testci Update test_ci.sh Revert "Update test_ci.sh" This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a. Add RoBERTa for RLHF Stage 2 & 3 (test) RoBERTa for RLHF Stage 2 & 3 (still in testing) Revert "Add RoBERTa for RLHF Stage 2 & 3 (test)" This reverts commit `06741d894d`. Add RoBERTa for RLHF stage 2 & 3 1. add roberta folder under model folder 2. add roberta option in train_reward_model.py 3. add some test in testci Update test_ci.sh Revert "Update test_ci.sh" This reverts commit 9c7352b81766f3177d31eeec0ec178a301df966a. update roberta with coati chat ci update Revert "chat ci update" This reverts commit 17ae7ae01fa752bd3289fc39069868fde99cf846. * Update README.md Update README.md * update readme * Update test_ci.sh	2023-04-27 14:26:19 +08:00
Hongxin Liu	2a951955ad	[chat] refactor trainer (#3648 ) * [chat] ppo trainer remove useless args * [chat] update examples * [chat] update benchmark * [chat] update examples * [chat] fix sft training with wandb * [chat] polish docstr	2023-04-26 18:11:49 +08:00
Hongxin Liu	f8288315d9	[chat] polish performance evaluator (#3647 )	2023-04-26 17:34:59 +08:00
Hongxin Liu	50793b35f4	[gemini] accelerate inference (#3641 ) * [gemini] support don't scatter after inference * [chat] update colossalai strategy * [chat] fix opt benchmark * [chat] update opt benchmark * [gemini] optimize inference * [test] add gemini inference test * [chat] fix unit test ci * [chat] fix ci * [chat] fix ci * [chat] skip checkpoint test	2023-04-26 16:32:40 +08:00

1 2 3 4 5 ...

2337 Commits (5ce6c9d86fe667d7ef5cd70a106b88073b640c20) All Branches Search

2337 Commits (5ce6c9d86fe667d7ef5cd70a106b88073b640c20)

All Branches