ColossalAI

Commit Graph

Author	SHA1	Message	Date
Frank Lee	8bcad73677	[workflow] fixed the directory check in build (#3980 )	1 year ago
Wenhao Chen	9d02590c9a	[chat] refactor actor class (#3968 ) * refactor: separate log_probs fn from Actor forward fn * refactor: separate generate fn from Actor class * feat: update unwrap_model and get_base_model * unwrap_model returns model not wrapped by Strategy * get_base_model returns HF model for Actor, Critic and RewardModel * feat: simplify Strategy.prepare * style: remove get_base_model method of Actor * perf: tokenize text in batches * refactor: move calc_action_log_probs to utils of model * test: update test with new forward fn * style: rename forward fn args * fix: do not unwrap model in save_model fn of naive strategy * test: add gemini test for train_prompts * fix: fix _set_default_generate_kwargs	1 year ago
Frank Lee	2bf6547ad7	Merge pull request #3967 from ver217/update-develop [sync] update develop branch with main	1 year ago
Frank Lee	6718a2f285	[workflow] cancel duplicated workflow jobs (#3960 )	1 year ago
Frank Lee	71fe52769c	[gemini] fixed the gemini checkpoint io (#3934 )	1 year ago
Baizhou Zhang	b3ab7fbabf	[example] update ViT example using booster api (#3940 )	1 year ago
Frank Lee	4110d1f0d4	[workflow] cancel duplicated workflow jobs (#3960 )	1 year ago
digger yu	1aadeedeea	fix typo .github/workflows/scripts/ (#3946 )	1 year ago
digger yu	e61ffc77c6	fix typo tests/ (#3936 )	1 year ago
Frank Lee	bd1ab98158	[gemini] fixed the gemini checkpoint io (#3934 )	1 year ago
FoolPlayer	bd2c7c3297	Merge pull request #3942 from hpcaitech/revert-3931-sync/develop-to-shardformer Revert "[sync] sync feature/shardformer with develop"	1 year ago
Frank Lee	ddcf58cacf	Revert "[sync] sync feature/shardformer with develop"	1 year ago
FoolPlayer	24651fdd4f	Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer [sync] sync feature/shardformer with develop	1 year ago
Liu Ziming	e277534a18	Merge pull request #3905 from MaruyamaAya/dreambooth [example] Adding an example of training dreambooth with the new booster API	1 year ago
Yuanchen	21c4c0b1a0	support UniEval and add CHRF metric (#3924 ) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>	1 year ago
digger yu	33eef714db	fix typo examples and docs (#3932 )	1 year ago
FoolPlayer	ef1537759c	[shardformer] add gpt2 policy and modify shard and slicer to support (#3883 ) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code	1 year ago
FoolPlayer	6370a935f6	update README (#3909 )	1 year ago
FoolPlayer	21a3915c98	[shardformer] add Dropout layer support different dropout pattern (#3856 ) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss	1 year ago
FoolPlayer	997544c1f9	[shardformer] update readme with modules implement doc (#3834 ) * update readme with modules content * remove img	1 year ago
Frank Lee	537a52b7a2	[shardformer] refactored the user api (#3828 ) * [shardformer] refactored the user api * polish code	1 year ago
Frank Lee	bc19024bf9	[shardformer] updated readme (#3827 )	1 year ago
FoolPlayer	58f6432416	[shardformer]: Feature/shardformer, add some docstring and readme (#3816 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit	1 year ago
FoolPlayer	6a69b44dfc	[shardformer] init shardformer code structure (#3731 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example	1 year ago
Maruyama_Aya	9b5e7ce21f	modify shell for check	1 year ago
Frank Lee	a98e16ed07	Merge pull request #3926 from hpcaitech/feature/dtensor [feature] updated device mesh and dtensor	1 year ago
digger yu	407aa48461	fix typo examples/community/roberta (#3925 )	1 year ago
Maruyama_Aya	730a092ba2	modify shell for check	1 year ago
Maruyama_Aya	49567d56d1	modify shell for check	1 year ago
Maruyama_Aya	039854b391	modify shell for check	1 year ago
Baizhou Zhang	e417dd004e	[example] update opt example using booster api (#3918 )	1 year ago
Maruyama_Aya	cf4792c975	modify shell for check	1 year ago
Frank Lee	eb39154d40	[dtensor] updated api and doc (#3845 )	1 year ago
Hongxin Liu	9166988d9b	[devops] update torch version in compability test (#3919 )	1 year ago
digger yu	de0d7df33f	[nfc] fix typo colossalai/zero (#3923 )	1 year ago
Hongxin Liu	12c90db3f3	[doc] add lazy init tutorial (#3922 ) * [doc] add lazy init en doc * [doc] add lazy init zh doc * [doc] add lazy init doc in sidebar * [doc] add lazy init doc test * [doc] fix lazy init doc link	1 year ago
Maruyama_Aya	c94a33579b	modify shell for check	1 year ago
digger yu	a9d1cadc49	fix typo with colossalai/trainer utils zero (#3908 )	1 year ago
Liu Ziming	b306cecf28	[example] Modify palm example with the new booster API (#3913 ) * Modify torch version requirement to adapt torch 2.0 * modify palm example using new booster API * roll back * fix port * polish * polish	1 year ago
wukong1992	a55fb00c18	[booster] update bert example, using booster api (#3885 )	1 year ago
Frank Lee	5e2132dcff	[workflow] added docker latest tag for release (#3920 )	1 year ago
Hongxin Liu	c25d421f3e	[devops] hotfix testmon cache clean logic (#3917 )	1 year ago
Frank Lee	d51e83d642	Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop [sync] sync feature/dtensor with develop	1 year ago
Frank Lee	c622bb3630	Merge pull request #3915 from FrankLeeeee/update/develop [sync] update develop with main	1 year ago
Hongxin Liu	9c88b6cbd1	[lazy] fix compatibility problem on torch 1.13 (#3911 )	1 year ago
Maruyama_Aya	4fc8bc68ac	modify file path	1 year ago
Hongxin Liu	b5f0566363	[chat] add distributed PPO trainer (#3740 ) * Detached ppo (#9) * run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments * working on pipeline strategy. in progress. * remove pipeline code. clean this branch * update remote parameters by state_dict. no test * nothing * state_dict sharding transfer * merge debug branch * gemini _unwrap_model fix * simplify code * simplify code & fix LoRALinear AttributeError * critic unwrapped state_dict --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add perfomance evaluator and fix bugs (#10) * [chat] add performance evaluator for ray * [chat] refactor debug arg * [chat] support hf config * [chat] fix generation * [chat] add 1mmt dummy example * [chat] fix gemini ckpt * split experience to send (#11) Co-authored-by: csric <richcsr256@gmail.com> * [chat] refactor trainer and maker (#12) * [chat] refactor experience maker holder * [chat] refactor model init * [chat] refactor trainer args * [chat] refactor model init * [chat] refactor trainer * [chat] refactor experience sending logic and training loop args (#13) * [chat] refactor experience send logic * [chat] refactor trainer * [chat] refactor trainer * [chat] refactor experience maker * [chat] refactor pbar * [chat] refactor example folder (#14) * [chat] support quant (#15) * [chat] add quant * [chat] add quant example * prompt example (#16) * prompt example * prompt load csv data * remove legacy try --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add mmmt dummy example and refactor experience sending (#17) * [chat] add mmmt dummy example * [chat] refactor naive strategy * [chat] fix struck problem * [chat] fix naive strategy * [chat] optimize experience maker sending logic * [chat] refactor sending assignment * [chat] refactor performance evaluator (#18) * Prompt Example & requires_grad state_dict & sharding state_dict (#19) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design --------- Co-authored-by: csric <richcsr256@gmail.com> * state_dict sending adapts to new unwrap function (#20) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design * opt benchmark * better script * nothing * [chat] strategy refactor unwrap model * [chat] strategy refactor save model * [chat] add docstr * [chat] refactor trainer save model * [chat] fix strategy typing * [chat] refactor trainer save model * [chat] update readme * [chat] fix unit test * working on lora reconstruction * state_dict sending adapts to new unwrap function * remove comments --------- Co-authored-by: csric <richcsr256@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * [chat-ray] add readme (#21) * add readme * transparent graph * add note background --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] get images from url (#22) * Refactor/chat ray (#23) * [chat] lora add todo * [chat] remove unused pipeline strategy * [chat] refactor example structure * [chat] setup ci for ray * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24) * lora support prototype * lora support * 1mmt lora & remove useless code --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] fix test ci for ray * [chat] fix test ci requirements for ray * [chat] fix ray runtime env * [chat] fix ray runtime env * [chat] fix example ci docker args * [chat] add debug info in trainer * [chat] add nccl debug info * [chat] skip ray test * [doc] fix typo --------- Co-authored-by: csric <59389055+CsRic@users.noreply.github.com> Co-authored-by: csric <richcsr256@gmail.com>	1 year ago
Hongxin Liu	41fb7236aa	[devops] hotfix CI about testmon cache (#3910 ) * [devops] hotfix CI about testmon cache * [devops] fix testmon cahe on pr	1 year ago
Maruyama_Aya	b4437e88c3	fixed port	1 year ago
Maruyama_Aya	79c9f776a9	fixed port	1 year ago

... 3 4 5 6 7 ...

2653 Commits (726541afe2cde5c6f547968a3d232bbb8b3f5f14) All Branches Search

2653 Commits (726541afe2cde5c6f547968a3d232bbb8b3f5f14)

All Branches