ColossalAI

Commit Graph

Author	SHA1	Message	Date
Hongxin Liu	12c90db3f3	[doc] add lazy init tutorial (#3922 ) * [doc] add lazy init en doc * [doc] add lazy init zh doc * [doc] add lazy init doc in sidebar * [doc] add lazy init doc test * [doc] fix lazy init doc link	2 years ago
Frank Lee	c622bb3630	Merge pull request #3915 from FrankLeeeee/update/develop [sync] update develop with main	2 years ago
Hongxin Liu	9c88b6cbd1	[lazy] fix compatibility problem on torch 1.13 (#3911 )	2 years ago
Hongxin Liu	b5f0566363	[chat] add distributed PPO trainer (#3740 ) * Detached ppo (#9) * run the base * working on dist ppo * sync * detached trainer * update detached trainer. no maker update function * facing init problem * 1 maker 1 trainer detached run. but no model update * facing cuda problem * fix save functions * verified maker update * nothing * add ignore * analyize loss issue * remove some debug codes * facing 2m1t stuck issue * 2m1t verified * do not use torchrun * working on 2m2t * working on 2m2t * initialize strategy in ray actor env * facing actor's init order issue * facing ddp model update issue (need unwarp ddp) * unwrap ddp actor * checking 1m2t stuck problem * nothing * set timeout for trainer choosing. It solves the stuck problem! * delete some debug output * rename to sync with upstream * rename to sync with upstream * coati rename * nothing * I am going to detach the replaybuffer from trainer and make it a Ray Actor. Two benefits: 1. support TP trainer. 2. asynchronized buffer operations * experience_maker_holder performs target-revolving _send_experience() instead of length comparison. * move code to ray subfolder * working on pipeline inference * apply comments * working on pipeline strategy. in progress. * remove pipeline code. clean this branch * update remote parameters by state_dict. no test * nothing * state_dict sharding transfer * merge debug branch * gemini _unwrap_model fix * simplify code * simplify code & fix LoRALinear AttributeError * critic unwrapped state_dict --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add perfomance evaluator and fix bugs (#10) * [chat] add performance evaluator for ray * [chat] refactor debug arg * [chat] support hf config * [chat] fix generation * [chat] add 1mmt dummy example * [chat] fix gemini ckpt * split experience to send (#11) Co-authored-by: csric <richcsr256@gmail.com> * [chat] refactor trainer and maker (#12) * [chat] refactor experience maker holder * [chat] refactor model init * [chat] refactor trainer args * [chat] refactor model init * [chat] refactor trainer * [chat] refactor experience sending logic and training loop args (#13) * [chat] refactor experience send logic * [chat] refactor trainer * [chat] refactor trainer * [chat] refactor experience maker * [chat] refactor pbar * [chat] refactor example folder (#14) * [chat] support quant (#15) * [chat] add quant * [chat] add quant example * prompt example (#16) * prompt example * prompt load csv data * remove legacy try --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] add mmmt dummy example and refactor experience sending (#17) * [chat] add mmmt dummy example * [chat] refactor naive strategy * [chat] fix struck problem * [chat] fix naive strategy * [chat] optimize experience maker sending logic * [chat] refactor sending assignment * [chat] refactor performance evaluator (#18) * Prompt Example & requires_grad state_dict & sharding state_dict (#19) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design --------- Co-authored-by: csric <richcsr256@gmail.com> * state_dict sending adapts to new unwrap function (#20) * prompt example * prompt load csv data * remove legacy try * maker models require_grad set to False * working on zero redundancy update * mmmt_prompt example; naive strategy requires_grad state_dict & sharding; maker model requires_no_grad. * remove legacy examples * remove legacy examples * remove replay buffer tp state. bad design * opt benchmark * better script * nothing * [chat] strategy refactor unwrap model * [chat] strategy refactor save model * [chat] add docstr * [chat] refactor trainer save model * [chat] fix strategy typing * [chat] refactor trainer save model * [chat] update readme * [chat] fix unit test * working on lora reconstruction * state_dict sending adapts to new unwrap function * remove comments --------- Co-authored-by: csric <richcsr256@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * [chat-ray] add readme (#21) * add readme * transparent graph * add note background --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] get images from url (#22) * Refactor/chat ray (#23) * [chat] lora add todo * [chat] remove unused pipeline strategy * [chat] refactor example structure * [chat] setup ci for ray * [chat-ray] Support LoRA trainer. LoRA weights reconstruction. (#24) * lora support prototype * lora support * 1mmt lora & remove useless code --------- Co-authored-by: csric <richcsr256@gmail.com> * [chat] fix test ci for ray * [chat] fix test ci requirements for ray * [chat] fix ray runtime env * [chat] fix ray runtime env * [chat] fix example ci docker args * [chat] add debug info in trainer * [chat] add nccl debug info * [chat] skip ray test * [doc] fix typo --------- Co-authored-by: csric <59389055+CsRic@users.noreply.github.com> Co-authored-by: csric <richcsr256@gmail.com>	2 years ago
Hongxin Liu	41fb7236aa	[devops] hotfix CI about testmon cache (#3910 ) * [devops] hotfix CI about testmon cache * [devops] fix testmon cahe on pr	2 years ago
digger yu	0e484e6201	[nfc]fix typo colossalai/pipeline tensor nn (#3899 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped * fix typo colossalai/pipeline tensor nn	2 years ago
Baizhou Zhang	c1535ccbba	[doc] fix docs about booster api usage (#3898 )	2 years ago
Hongxin Liu	ec9bbc0094	[devops] improving testmon cache (#3902 ) * [devops] improving testmon cache * [devops] fix branch name with slash * [devops] fix branch name with slash * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] fix edit action * [devops] update readme	2 years ago
Yuanchen	57a6d7685c	support evaluation for english (#3880 ) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>	2 years ago
digger yu	1878749753	[nfc] fix typo colossalai/nn (#3887 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped	2 years ago
Hongxin Liu	ae02d4e4f7	[bf16] add bf16 support (#3882 ) * [bf16] add bf16 support for fused adam (#3844) * [bf16] fused adam kernel support bf16 * [test] update fused adam kernel test * [test] update fused adam test * [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860) * [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869) * [bf16] add mixed precision mixin * [bf16] low level zero optim support bf16 * [text] update low level zero test * [text] fix low level zero grad acc test * [bf16] add bf16 support for gemini (#3872) * [bf16] gemini support bf16 * [test] update gemini bf16 test * [doc] update gemini docstring * [bf16] add bf16 support for plugins (#3877) * [bf16] add bf16 support for legacy zero (#3879) * [zero] init context support bf16 * [zero] legacy zero support bf16 * [test] add zero bf16 test * [doc] add bf16 related docstring for legacy zero	2 years ago
jiangmingyan	07cb21142f	[doc]update moe chinese document. (#3890 ) * [doc]update-moe * [doc]update-moe * [doc]update-moe * [doc]update-moe * [doc]update-moe	2 years ago
Liu Ziming	8065cc5fba	Modify torch version requirement to adapt torch 2.0 (#3896 )	2 years ago
Hongxin Liu	dbb32692d2	[lazy] refactor lazy init (#3891 ) * [lazy] remove old lazy init * [lazy] refactor lazy init folder structure * [lazy] fix lazy tensor deepcopy * [test] update lazy init test	2 years ago
digger yu	70c8cdecf4	[nfc] fix typo colossalai/cli fx kernel (#3847 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel	2 years ago
jiangmingyan	281b33f362	[doc] update document of zero with chunk. (#3855 ) * [doc] fix title of mixed precision * [doc]update document of zero with chunk * [doc] update document of zero with chunk, fix * [doc] update document of zero with chunk, fix * [doc] update document of zero with chunk, fix * [doc] update document of zero with chunk, add doc test * [doc] update document of zero with chunk, add doc test * [doc] update document of zero with chunk, fix installation * [doc] update document of zero with chunk, fix zero with chunk doc * [doc] update document of zero with chunk, fix zero with chunk doc	2 years ago
jiangmingyan	5f79008c4a	[example] update gemini examples (#3868 ) * [example]update gemini examples * [example]update gemini examples	2 years ago
Yuanchen	2506e275b8	[evaluation] improvement on evaluation (#3862 ) * fix a bug when the config file contains one category but the answer file doesn't contains that category * fix Chinese prompt file * support gpt-3.5-turbo and gpt-4 evaluation * polish and update README * resolve pr comments --------- Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>	2 years ago
jiangmingyan	b0474878bf	[doc] update nvme offload documents. (#3850 )	2 years ago
Frank Lee	ae959a72a5	[workflow] fixed workflow check for docker build (#3849 )	2 years ago
Frank Lee	d42b1be09d	[release] bump to v0.3.0 (#3830 )	2 years ago
digger yu	e2d81eba0d	[nfc] fix typo colossalai/ applications/ (#3831 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/	2 years ago
jiangmingyan	a64df3fa97	[doc] update document of gemini instruction. (#3842 ) * [doc] update meet_gemini.md * [doc] update meet_gemini.md * [doc] fix parentheses * [doc] fix parentheses * [doc] fix doc test * [doc] fix doc test * [doc] fix doc	2 years ago
Frank Lee	54e97ed7ea	[workflow] supported test on CUDA 10.2 (#3841 )	2 years ago
wukong1992	3229f93e30	[booster] add warning for torch fsdp plugin doc (#3833 )	2 years ago
Frank Lee	84500b7799	[workflow] fixed testmon cache in build CI (#3806 ) * [workflow] fixed testmon cache in build CI * polish code	2 years ago
digger yu	518b31c059	[docs] change placememt_policy to placement_policy (#3829 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/	2 years ago
digger yu	e90fdb1000	fix typo docs/	2 years ago
Yuanchen	34966378e8	[evaluation] add automatic evaluation pipeline (#3821 ) * add functions for gpt evaluation * add automatic eval Update eval.py * using jload and modify the type of answers1 and answers2 * Update eval.py Update eval.py * Update evaluator.py * support gpt evaluation * update readme.md update README.md update READNE.md modify readme.md * add Chinese example for config, battle prompt and evaluation prompt file * remove GPT-4 config * remove sample folder --------- Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com> Co-authored-by: Camille Zhong <44392324+Camille7777@users.noreply.github.com>	2 years ago
Frank Lee	05b8a8de58	[workflow] changed to doc build to be on schedule and release (#3825 ) * [workflow] changed to doc build to be on schedule and release * polish code	2 years ago
Yanming W	269150b6f4	[Docker] Fix a couple of build issues (#3691 )	2 years ago
digger yu	7f8203af69	fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808 )	2 years ago
jiangmingyan	725365f297	Merge pull request #3810 from jiangmingyan/amp [doc] update amp document	2 years ago
jiangmingyan	278fcbc444	[doc]fix	2 years ago
jiangmingyan	8aa1fb2c7f	[doc]fix	2 years ago
Frank Lee	1e3b64f26c	[workflow] enblaed doc build from a forked repo (#3815 )	2 years ago
Hongxin Liu	19d153057e	[doc] add warning about fsdp plugin (#3813 )	2 years ago
wukong1992	6b305a99d6	[booster] torch fsdp fix ckpt (#3788 )	2 years ago
jiangmingyan	c425a69d52	[doc] add removed change of config.py	2 years ago
jiangmingyan	75272ef37b	[doc] add removed warning	2 years ago
Mingyan Jiang	a520610bd9	[doc] update amp document	2 years ago
Mingyan Jiang	1167bf5b10	[doc] update amp document	2 years ago
Mingyan Jiang	8c62e50dbb	[doc] update amp document	2 years ago
digger yu	9265f2d4d7	[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc.	2 years ago
jiangmingyan	e871e342b3	[API] add docstrings and initialization to apex amp, naive amp (#3783 ) * [mixed_precison] add naive amp demo * [mixed_precison] add naive amp demo * [api] add docstrings and initialization to apex amp, naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] fix * [api] fix	2 years ago
Frank Lee	615e2e5fc1	[test] fixed lazy init test import error (#3799 )	2 years ago
Frank Lee	ad93c736ea	[workflow] enable testing for develop & feature branch (#3801 )	2 years ago
jiangmingyan	ef02d7ef6d	[doc] update gradient accumulation (#3771 ) * [doc]update gradient accumulation * [doc]update gradient accumulation * [doc]update gradient accumulation * [doc]update gradient accumulation * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, add sidebars * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, resolve comments * [doc]update gradient accumulation, resolve comments * fix	2 years ago
Frank Lee	f5c425c898	fixed the example docstring for booster (#3795 )	2 years ago
Frank Lee	788e07dbc5	[workflow] fixed the docker build workflow (#3794 ) * [workflow] fixed the docker build workflow * polish code	2 years ago

1 2 3 4 5 ...

2398 Commits (12c90db3f30b6d9013a32eee27ea04ec4d631ddc) All Branches Search

2398 Commits (12c90db3f30b6d9013a32eee27ea04ec4d631ddc)

All Branches