ColossalAI

Commit Graph

Author	SHA1	Message	Date
Frank Lee	537a52b7a2	[shardformer] refactored the user api (#3828 ) * [shardformer] refactored the user api * polish code	2023-06-08 15:01:34 +08:00
Frank Lee	bc19024bf9	[shardformer] updated readme (#3827 )	2023-06-08 15:01:34 +08:00
FoolPlayer	58f6432416	[shardformer]: Feature/shardformer, add some docstring and readme (#3816 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit	2023-06-08 15:01:34 +08:00
FoolPlayer	6a69b44dfc	[shardformer] init shardformer code structure (#3731 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example	2023-06-08 15:01:34 +08:00
Maruyama_Aya	9b5e7ce21f	modify shell for check	2023-06-08 14:56:56 +08:00
Frank Lee	a98e16ed07	Merge pull request #3926 from hpcaitech/feature/dtensor [feature] updated device mesh and dtensor	2023-06-08 14:39:40 +08:00
digger yu	407aa48461	fix typo examples/community/roberta (#3925 )	2023-06-08 14:28:34 +08:00
Maruyama_Aya	730a092ba2	modify shell for check	2023-06-08 13:38:18 +08:00
Maruyama_Aya	49567d56d1	modify shell for check	2023-06-08 13:36:05 +08:00
Maruyama_Aya	039854b391	modify shell for check	2023-06-08 13:17:58 +08:00
Baizhou Zhang	e417dd004e	[example] update opt example using booster api (#3918 )	2023-06-08 11:27:05 +08:00
Maruyama_Aya	cf4792c975	modify shell for check	2023-06-08 11:15:10 +08:00
Frank Lee	eb39154d40	[dtensor] updated api and doc (#3845 )	2023-06-08 10:18:17 +08:00
Hongxin Liu	9166988d9b	[devops] update torch version in compability test (#3919 )	2023-06-08 09:29:32 +08:00
digger yu	de0d7df33f	[nfc] fix typo colossalai/zero (#3923 )	2023-06-08 00:01:29 +08:00
Hongxin Liu	12c90db3f3	[doc] add lazy init tutorial (#3922 ) * [doc] add lazy init en doc * [doc] add lazy init zh doc * [doc] add lazy init doc in sidebar * [doc] add lazy init doc test * [doc] fix lazy init doc link	2023-06-07 17:59:58 +08:00
Maruyama_Aya	c94a33579b	modify shell for check	2023-06-07 17:23:01 +08:00
digger yu	a9d1cadc49	fix typo with colossalai/trainer utils zero (#3908 )	2023-06-07 16:08:37 +08:00
Liu Ziming	b306cecf28	[example] Modify palm example with the new booster API (#3913 ) * Modify torch version requirement to adapt torch 2.0 * modify palm example using new booster API * roll back * fix port * polish * polish	2023-06-07 16:05:00 +08:00
wukong1992	a55fb00c18	[booster] update bert example, using booster api (#3885 )	2023-06-07 15:51:00 +08:00
Frank Lee	5e2132dcff	[workflow] added docker latest tag for release (#3920 )	2023-06-07 15:37:37 +08:00
Hongxin Liu	c25d421f3e	[devops] hotfix testmon cache clean logic (#3917 )	2023-06-07 12:39:12 +08:00
Frank Lee	d51e83d642	Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop [sync] sync feature/dtensor with develop	2023-06-07 11:50:43 +08:00
Frank Lee	c622bb3630	Merge pull request #3915 from FrankLeeeee/update/develop [sync] update develop with main	2023-06-07 11:45:11 +08:00
Hongxin Liu	9c88b6cbd1	[lazy] fix compatibility problem on torch 1.13 (#3911 )	2023-06-07 11:10:12 +08:00
Maruyama_Aya	4fc8bc68ac	modify file path	2023-06-07 11:02:19 +08:00
Hongxin Liu	b5f0566363	[chat] add distributed PPO trainer (#3740 ) * Detached ppo (#9) * run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments * working on pipeline strategy. in progress. * remove pipeline code. clean this branch * update remote parameters by state_dict. no test * nothing * state_dict sharding transfer * merge debug branch * gemini _unwrap_model fix * simplify code * simplify code & fix LoRALinear AttributeError * critic unwrapped state_dict --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add perfomance evaluator and fix bugs (#10) * [chat] add performance evaluator for ray * [chat] refactor debug arg * [chat] support hf config * [chat] fix generation * [chat] add 1mmt dummy example * [chat] fix gemini ckpt * split experience to send (#11) Co-authored-by: csric <richcsr256@gmail.com> * [chat] refactor trainer and maker (#12) * [chat] refactor experience maker holder * [chat] refactor model init * [chat] refactor trainer args * [chat] refactor model init * [chat] refactor trainer * [chat] refactor experience sending logic and training loop args (#13) * [chat] refactor experience send logic * [chat] refactor trainer * [chat] refactor trainer * [chat] refactor experience maker * [chat] refactor pbar * [chat] refactor example folder (#14) * [chat] support quant (#15) * [chat] add quant * [chat] add quant example * prompt example (#16) * prompt example * prompt load csv data * remove legacy try --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add mmmt dummy example and refactor experience sending (#17) * [chat] add mmmt dummy example * [chat] refactor naive strategy * [chat] fix struck problem * [chat] fix naive strategy * [chat] optimize experience maker sending logic * [chat] refactor sending assignment * [chat] refactor performance evaluator (#18) * Prompt Example & requires_grad state_dict & sharding state_dict (#19) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design --------- Co-authored-by: csric <richcsr256@gmail.com> * state_dict sending adapts to new unwrap function (#20) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design * opt benchmark * better script * nothing * [chat] strategy refactor unwrap model * [chat] strategy refactor save model * [chat] add docstr * [chat] refactor trainer save model * [chat] fix strategy typing * [chat] refactor trainer save model * [chat] update readme * [chat] fix unit test * working on lora reconstruction * state_dict sending adapts to new unwrap function * remove comments --------- Co-authored-by: csric <richcsr256@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * [chat-ray] add readme (#21) * add readme * transparent graph * add note background --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] get images from url (#22) * Refactor/chat ray (#23) * [chat] lora add todo * [chat] remove unused pipeline strategy * [chat] refactor example structure * [chat] setup ci for ray * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24) * lora support prototype * lora support * 1mmt lora & remove useless code --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] fix test ci for ray * [chat] fix test ci requirements for ray * [chat] fix ray runtime env * [chat] fix ray runtime env * [chat] fix example ci docker args * [chat] add debug info in trainer * [chat] add nccl debug info * [chat] skip ray test * [doc] fix typo --------- Co-authored-by: csric <59389055+CsRic@users.noreply.github.com> Co-authored-by: csric <richcsr256@gmail.com>	2023-06-07 10:41:16 +08:00
Hongxin Liu	41fb7236aa	[devops] hotfix CI about testmon cache (#3910 ) * [devops] hotfix CI about testmon cache * [devops] fix testmon cahe on pr	2023-06-06 18:58:58 +08:00
Maruyama_Aya	b4437e88c3	fixed port	2023-06-06 16:21:38 +08:00
Maruyama_Aya	79c9f776a9	fixed port	2023-06-06 16:20:45 +08:00
Maruyama_Aya	d3379f0be7	fixed model saving bugs	2023-06-06 16:07:34 +08:00
Maruyama_Aya	b29e1f0722	change directory	2023-06-06 15:50:03 +08:00
Maruyama_Aya	1c1f71cbd2	fixing insecure hash function	2023-06-06 14:51:11 +08:00
Maruyama_Aya	b56c7f4283	update shell file	2023-06-06 14:09:27 +08:00
Maruyama_Aya	176010f289	update performance evaluation	2023-06-06 14:08:22 +08:00
digger yu	0e484e6201	[nfc]fix typo colossalai/pipeline tensor nn (#3899 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped * fix typo colossalai/pipeline tensor nn	2023-06-06 14:07:36 +08:00
Baizhou Zhang	c1535ccbba	[doc] fix docs about booster api usage (#3898 )	2023-06-06 13:36:11 +08:00
Hongxin Liu	ec9bbc0094	[devops] improving testmon cache (#3902 ) * [devops] improving testmon cache * [devops] fix branch name with slash * [devops] fix branch name with slash * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] update readme	2023-06-06 11:32:31 +08:00
Yuanchen	57a6d7685c	support evaluation for english (#3880 ) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>	2023-06-05 21:24:21 +08:00
digger yu	1878749753	[nfc] fix typo colossalai/nn (#3887 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped	2023-06-05 16:04:27 +08:00
Hongxin Liu	ae02d4e4f7	[bf16] add bf16 support (#3882 ) * [bf16] add bf16 support for fused adam (#3844) * [bf16] fused adam kernel support bf16 * [test] update fused adam kernel test * [test] update fused adam test * [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860) * [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869) * [bf16] add mixed precision mixin * [bf16] low level zero optim support bf16 * [text] update low level zero test * [text] fix low level zero grad acc test * [bf16] add bf16 support for gemini (#3872) * [bf16] gemini support bf16 * [test] update gemini bf16 test * [doc] update gemini docstring * [bf16] add bf16 support for plugins (#3877) * [bf16] add bf16 support for legacy zero (#3879) * [zero] init context support bf16 * [zero] legacy zero support bf16 * [test] add zero bf16 test * [doc] add bf16 related docstring for legacy zero	2023-06-05 15:58:31 +08:00
jiangmingyan	07cb21142f	[doc]update moe chinese document. (#3890 ) * [doc]update-moe * [doc]update-moe * [doc]update-moe * [doc]update-moe * [doc]update-moe	2023-06-05 15:57:54 +08:00
Liu Ziming	8065cc5fba	Modify torch version requirement to adapt torch 2.0 (#3896 )	2023-06-05 15:57:35 +08:00
Hongxin Liu	dbb32692d2	[lazy] refactor lazy init (#3891 ) * [lazy] remove old lazy init * [lazy] refactor lazy init folder structure * [lazy] fix lazy tensor deepcopy * [test] update lazy init test	2023-06-05 14:20:47 +08:00
Maruyama_Aya	25447d4407	modify path	2023-06-05 11:47:07 +08:00
Maruyama_Aya	42e3232bc0	roll back	2023-06-02 17:00:57 +08:00
Maruyama_Aya	60ec33bb18	Add a new example of Dreambooth training using the booster API	2023-06-02 16:50:51 +08:00
digger yu	70c8cdecf4	[nfc] fix typo colossalai/cli fx kernel (#3847 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel	2023-06-02 15:02:45 +08:00
Maruyama_Aya	46503c35dd	Modify torch version requirement to adapt torch 2.0	2023-06-01 14:30:51 +08:00
jiangmingyan	281b33f362	[doc] update document of zero with chunk. (#3855 ) * [doc] fix title of mixed precision * [doc]update document of zero with chunk * [doc] update document of zero with chunk, fix * [doc] update document of zero with chunk, fix * [doc] update document of zero with chunk, fix * [doc] update document of zero with chunk, add doc test * [doc] update document of zero with chunk, add doc test * [doc] update document of zero with chunk, fix installation * [doc] update document of zero with chunk, fix zero with chunk doc * [doc] update document of zero with chunk, fix zero with chunk doc	2023-05-30 18:41:56 +08:00

1 2 3 4 5 ...

2533 Commits (f447ca18111c2e37a2f14e7aecc98876dc7e3216) All Branches Search

2533 Commits (f447ca18111c2e37a2f14e7aecc98876dc7e3216)

All Branches