ColossalAI

Commit Graph

Author	SHA1	Message	Date
Baizhou Zhang	822c3d4d66	[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002 )	2023-06-16 14:14:05 +08:00
Wenhao Chen	725af3eeeb	[booster] make optimizer argument optional for boost (#3993 ) * feat: make optimizer optional in Booster.boost * test: skip unet test if diffusers version > 0.10.2	2023-06-15 17:38:42 +08:00
Baizhou Zhang	c9cff7e7fa	[checkpointio] General Checkpointing of Sharded Optimizers (#3984 )	2023-06-15 15:21:26 +08:00
Frank Lee	8bcad73677	[workflow] fixed the directory check in build (#3980 )	2023-06-13 14:42:35 +08:00
Frank Lee	2bf6547ad7	Merge pull request #3967 from ver217/update-develop [sync] update develop branch with main	2023-06-12 16:39:43 +08:00
Frank Lee	6718a2f285	[workflow] cancel duplicated workflow jobs (#3960 )	2023-06-12 15:11:27 +08:00
Frank Lee	71fe52769c	[gemini] fixed the gemini checkpoint io (#3934 )	2023-06-12 15:11:27 +08:00
Baizhou Zhang	b3ab7fbabf	[example] update ViT example using booster api (#3940 )	2023-06-12 15:02:27 +08:00
Frank Lee	4110d1f0d4	[workflow] cancel duplicated workflow jobs (#3960 )	2023-06-12 09:50:57 +08:00
digger yu	1aadeedeea	fix typo .github/workflows/scripts/ (#3946 )	2023-06-09 10:30:50 +08:00
digger yu	e61ffc77c6	fix typo tests/ (#3936 )	2023-06-09 09:49:41 +08:00
Frank Lee	bd1ab98158	[gemini] fixed the gemini checkpoint io (#3934 )	2023-06-09 09:48:49 +08:00
FoolPlayer	bd2c7c3297	Merge pull request #3942 from hpcaitech/revert-3931-sync/develop-to-shardformer Revert "[sync] sync feature/shardformer with develop"	2023-06-09 09:42:28 +08:00
Frank Lee	ddcf58cacf	Revert "[sync] sync feature/shardformer with develop"	2023-06-09 09:41:27 +08:00
FoolPlayer	24651fdd4f	Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer [sync] sync feature/shardformer with develop	2023-06-09 09:34:00 +08:00
Liu Ziming	e277534a18	Merge pull request #3905 from MaruyamaAya/dreambooth [example] Adding an example of training dreambooth with the new booster API	2023-06-09 08:44:18 +08:00
Yuanchen	21c4c0b1a0	support UniEval and add CHRF metric (#3924 ) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>	2023-06-08 17:38:47 +08:00
digger yu	33eef714db	fix typo examples and docs (#3932 )	2023-06-08 16:09:32 +08:00
FoolPlayer	ef1537759c	[shardformer] add gpt2 policy and modify shard and slicer to support (#3883 ) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code	2023-06-08 15:01:34 +08:00
FoolPlayer	6370a935f6	update README (#3909 )	2023-06-08 15:01:34 +08:00
FoolPlayer	21a3915c98	[shardformer] add Dropout layer support different dropout pattern (#3856 ) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss	2023-06-08 15:01:34 +08:00
FoolPlayer	997544c1f9	[shardformer] update readme with modules implement doc (#3834 ) * update readme with modules content * remove img	2023-06-08 15:01:34 +08:00
Frank Lee	537a52b7a2	[shardformer] refactored the user api (#3828 ) * [shardformer] refactored the user api * polish code	2023-06-08 15:01:34 +08:00
Frank Lee	bc19024bf9	[shardformer] updated readme (#3827 )	2023-06-08 15:01:34 +08:00
FoolPlayer	58f6432416	[shardformer]: Feature/shardformer, add some docstring and readme (#3816 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit	2023-06-08 15:01:34 +08:00
FoolPlayer	6a69b44dfc	[shardformer] init shardformer code structure (#3731 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example	2023-06-08 15:01:34 +08:00
Maruyama_Aya	9b5e7ce21f	modify shell for check	2023-06-08 14:56:56 +08:00
Frank Lee	a98e16ed07	Merge pull request #3926 from hpcaitech/feature/dtensor [feature] updated device mesh and dtensor	2023-06-08 14:39:40 +08:00
digger yu	407aa48461	fix typo examples/community/roberta (#3925 )	2023-06-08 14:28:34 +08:00
Maruyama_Aya	730a092ba2	modify shell for check	2023-06-08 13:38:18 +08:00
Maruyama_Aya	49567d56d1	modify shell for check	2023-06-08 13:36:05 +08:00
Maruyama_Aya	039854b391	modify shell for check	2023-06-08 13:17:58 +08:00
Baizhou Zhang	e417dd004e	[example] update opt example using booster api (#3918 )	2023-06-08 11:27:05 +08:00
Maruyama_Aya	cf4792c975	modify shell for check	2023-06-08 11:15:10 +08:00
Frank Lee	eb39154d40	[dtensor] updated api and doc (#3845 )	2023-06-08 10:18:17 +08:00
Hongxin Liu	9166988d9b	[devops] update torch version in compability test (#3919 )	2023-06-08 09:29:32 +08:00
digger yu	de0d7df33f	[nfc] fix typo colossalai/zero (#3923 )	2023-06-08 00:01:29 +08:00
Hongxin Liu	12c90db3f3	[doc] add lazy init tutorial (#3922 ) * [doc] add lazy init en doc * [doc] add lazy init zh doc * [doc] add lazy init doc in sidebar * [doc] add lazy init doc test * [doc] fix lazy init doc link	2023-06-07 17:59:58 +08:00
Maruyama_Aya	c94a33579b	modify shell for check	2023-06-07 17:23:01 +08:00
digger yu	a9d1cadc49	fix typo with colossalai/trainer utils zero (#3908 )	2023-06-07 16:08:37 +08:00
Liu Ziming	b306cecf28	[example] Modify palm example with the new booster API (#3913 ) * Modify torch version requirement to adapt torch 2.0 * modify palm example using new booster API * roll back * fix port * polish * polish	2023-06-07 16:05:00 +08:00
wukong1992	a55fb00c18	[booster] update bert example, using booster api (#3885 )	2023-06-07 15:51:00 +08:00
Frank Lee	5e2132dcff	[workflow] added docker latest tag for release (#3920 )	2023-06-07 15:37:37 +08:00
Hongxin Liu	c25d421f3e	[devops] hotfix testmon cache clean logic (#3917 )	2023-06-07 12:39:12 +08:00
Frank Lee	d51e83d642	Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop [sync] sync feature/dtensor with develop	2023-06-07 11:50:43 +08:00
Frank Lee	c622bb3630	Merge pull request #3915 from FrankLeeeee/update/develop [sync] update develop with main	2023-06-07 11:45:11 +08:00
Hongxin Liu	9c88b6cbd1	[lazy] fix compatibility problem on torch 1.13 (#3911 )	2023-06-07 11:10:12 +08:00
Maruyama_Aya	4fc8bc68ac	modify file path	2023-06-07 11:02:19 +08:00
Hongxin Liu	b5f0566363	[chat] add distributed PPO trainer (#3740 ) * Detached ppo (#9) * run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments * working on pipeline strategy. in progress. * remove pipeline code. clean this branch * update remote parameters by state_dict. no test * nothing * state_dict sharding transfer * merge debug branch * gemini _unwrap_model fix * simplify code * simplify code & fix LoRALinear AttributeError * critic unwrapped state_dict --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add perfomance evaluator and fix bugs (#10) * [chat] add performance evaluator for ray * [chat] refactor debug arg * [chat] support hf config * [chat] fix generation * [chat] add 1mmt dummy example * [chat] fix gemini ckpt * split experience to send (#11) Co-authored-by: csric <richcsr256@gmail.com> * [chat] refactor trainer and maker (#12) * [chat] refactor experience maker holder * [chat] refactor model init * [chat] refactor trainer args * [chat] refactor model init * [chat] refactor trainer * [chat] refactor experience sending logic and training loop args (#13) * [chat] refactor experience send logic * [chat] refactor trainer * [chat] refactor trainer * [chat] refactor experience maker * [chat] refactor pbar * [chat] refactor example folder (#14) * [chat] support quant (#15) * [chat] add quant * [chat] add quant example * prompt example (#16) * prompt example * prompt load csv data * remove legacy try --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add mmmt dummy example and refactor experience sending (#17) * [chat] add mmmt dummy example * [chat] refactor naive strategy * [chat] fix struck problem * [chat] fix naive strategy * [chat] optimize experience maker sending logic * [chat] refactor sending assignment * [chat] refactor performance evaluator (#18) * Prompt Example & requires_grad state_dict & sharding state_dict (#19) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design --------- Co-authored-by: csric <richcsr256@gmail.com> * state_dict sending adapts to new unwrap function (#20) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design * opt benchmark * better script * nothing * [chat] strategy refactor unwrap model * [chat] strategy refactor save model * [chat] add docstr * [chat] refactor trainer save model * [chat] fix strategy typing * [chat] refactor trainer save model * [chat] update readme * [chat] fix unit test * working on lora reconstruction * state_dict sending adapts to new unwrap function * remove comments --------- Co-authored-by: csric <richcsr256@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * [chat-ray] add readme (#21) * add readme * transparent graph * add note background --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] get images from url (#22) * Refactor/chat ray (#23) * [chat] lora add todo * [chat] remove unused pipeline strategy * [chat] refactor example structure * [chat] setup ci for ray * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24) * lora support prototype * lora support * 1mmt lora & remove useless code --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] fix test ci for ray * [chat] fix test ci requirements for ray * [chat] fix ray runtime env * [chat] fix ray runtime env * [chat] fix example ci docker args * [chat] add debug info in trainer * [chat] add nccl debug info * [chat] skip ray test * [doc] fix typo --------- Co-authored-by: csric <59389055+CsRic@users.noreply.github.com> Co-authored-by: csric <richcsr256@gmail.com>	2023-06-07 10:41:16 +08:00
Hongxin Liu	41fb7236aa	[devops] hotfix CI about testmon cache (#3910 ) * [devops] hotfix CI about testmon cache * [devops] fix testmon cahe on pr	2023-06-06 18:58:58 +08:00

1 2 3 4 5 ...

2455 Commits (822c3d4d66d2d74cb7c7080abed6a207602dddfd) All Branches Search

2455 Commits (822c3d4d66d2d74cb7c7080abed6a207602dddfd)

All Branches