ColossalAI

Commit Graph

Author	SHA1	Message	Date
Wenhao Chen	e614aa34f3	[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508 ) * feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig` * feat: apply `GradientCheckpointConfig` to policy and llama_forward * feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager * fix: add optional args for `distribute_layer` and `get_stage_index` * fix: fix changed API calls * test: update llama tests * style: polish `GradientCheckpointConfig` * fix: fix pipeline utils tests	8 months ago
Insu Jang	00525f7772	[shardformer] fix pipeline forward error if custom layer distribution is used (#5189 ) * Use self.[distribute_layers\|get_stage_index] to exploit custom layer distribution * Change static methods for t5 layer distribution to member functions * Change static methods for whisper layer distribution to member functions * Replace whisper policy usage with self one * Fix test case to use non-static layer distribution methods * fix: fix typo --------- Co-authored-by: Wenhao Chen <cwher@outlook.com>	8 months ago
Wenhao Chen	bb0a668fee	[hotfix] set return_outputs=False in examples and polish code (#5404 ) * fix: simplify merge_batch * fix: use return_outputs=False to eliminate extra memory consumption * feat: add return_outputs warning * style: remove `return_outputs=False` as it is the default value	8 months ago
ver217	148469348a	Merge branch 'main' into sync/npu	10 months ago
Wenhao Chen	ef4f0ee854	[hotfix]: add pp sanity check and fix mbs arg (#5268 ) * fix: fix misleading mbs arg * feat: add pp sanity check * fix: fix 1f1b sanity check	10 months ago
Hongxin Liu	d202cc28c0	[npu] change device to accelerator api (#5239 ) * update accelerator * fix timer * fix amp * update * fix * update bug * add error raise * fix autocast * fix set device * remove doc accelerator * update doc * update doc * update doc * use nullcontext * update cpu * update null context * change time limit for example * udpate * update * update * update * [npu] polish accelerator code --------- Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com> Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>	11 months ago
Elsa Granger	d565df3821	[pipeline] A more general _communicate in p2p (#5062 ) * A more general _communicate * feat: finish tree_flatten version p2p * fix: update p2p api calls --------- Co-authored-by: Wenhao Chen <cwher@outlook.com>	11 months ago
Wenhao Chen	d799a3088f	[pipeline]: add p2p fallback order and fix interleaved pp deadlock (#5214 ) * fix: add fallback order option and update 1f1b * fix: fix deadlock comm in interleaved pp * test: modify p2p test	11 months ago
Wenhao Chen	4fa689fca1	[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134 ) * test: add more p2p tests * fix: remove send_forward_recv_forward as p2p op list need to use the same group * fix: make send and receive atomic * feat: update P2PComm fn * feat: add metadata cache in 1f1b * feat: add metadata cache in interleaved pp * feat: modify is_xx_stage fn * revert: add _broadcast_object_list * feat: add interleaved pp in llama policy * feat: set NCCL_BUFFSIZE in HybridParallelPlugin	11 months ago
Wenhao Chen	7172459e74	[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088 ) * [shardformer] implement policy for all GPT-J models and test * [shardformer] support interleaved pipeline parallel for bert finetune * [shardformer] shardformer support falcon (#4883) * [shardformer]: fix interleaved pipeline for bert model (#5048) * [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093) * Add Mistral support for Shardformer (#5103) * [shardformer] add tests to mistral (#5105) --------- Co-authored-by: Pengtai Xu <henryxu880@gmail.com> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: eric8607242 <e0928021388@gmail.com>	1 year ago
Hongxin Liu	079bf3cb26	[misc] update pre-commit and run all files (#4752 ) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format	1 year ago
Hongxin Liu	b5f9e37c70	[legacy] clean up legacy code (#4743 ) * [legacy] remove outdated codes of pipeline (#4692) * [legacy] remove cli of benchmark and update optim (#4690) * [legacy] remove cli of benchmark and update optim * [doc] fix cli doc test * [legacy] fix engine clip grad norm * [legacy] remove outdated colo tensor (#4694) * [legacy] remove outdated colo tensor * [test] fix test import * [legacy] move outdated zero to legacy (#4696) * [legacy] clean up utils (#4700) * [legacy] clean up utils * [example] update examples * [legacy] clean up amp * [legacy] fix amp module * [legacy] clean up gpc (#4742) * [legacy] clean up context * [legacy] clean core, constants and global vars * [legacy] refactor initialize * [example] fix examples ci * [example] fix examples ci * [legacy] fix tests * [example] fix gpt example * [example] fix examples ci * [devops] fix ci installation * [example] fix examples ci	1 year ago
Hongxin Liu	554aa9592e	[legacy] move communication and nn to legacy and refactor logger (#4671 ) * [legacy] move communication to legacy (#4640) * [legacy] refactor logger and clean up legacy codes (#4654) * [legacy] make logger independent to gpc * [legacy] make optim independent to registry * [legacy] move test engine to legacy * [legacy] move nn to legacy (#4656) * [legacy] move nn to legacy * [checkpointio] fix save hf config * [test] remove useledd rpc pp test * [legacy] fix nn init * [example] skip tutorial hybriad parallel example * [devops] test doc check * [devops] test doc check	1 year ago
Baizhou Zhang	660eed9124	[pipeline] set optimizer to optional in execute_pipeline (#4630 ) * set optimizer to optional in execute_pipeline * arrange device and mixed precision in booster init * fix execute_pipeline in booster.py	1 year ago
Hongxin Liu	fae6c92ead	Merge branch 'main' into feature/shardformer	1 year ago
Hongxin Liu	89fe027787	[legacy] move trainer to legacy (#4545 ) * [legacy] move trainer to legacy * [doc] update docs related to trainer * [test] ignore legacy test	1 year ago
Hongxin Liu	a39a5c66fe	Merge branch 'main' into feature/shardformer	1 year ago
Hongxin Liu	508ca36fe3	[pipeline] 1f1b schedule receive microbatch size (#4589 )	1 year ago
Hongxin Liu	27061426f7	[gemini] improve compatibility and add static placement policy (#4479 ) * [gemini] remove distributed-related part from colotensor (#4379) * [gemini] remove process group dependency * [gemini] remove tp part from colo tensor * [gemini] patch inplace op * [gemini] fix param op hook and update tests * [test] remove useless tests * [test] remove useless tests * [misc] fix requirements * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [misc] update requirements * [gemini] refactor gemini optimizer and gemini ddp (#4398) * [gemini] update optimizer interface * [gemini] renaming gemini optimizer * [gemini] refactor gemini ddp class * [example] update gemini related example * [example] update gemini related example * [plugin] fix gemini plugin args * [test] update gemini ckpt tests * [gemini] fix checkpoint io * [example] fix opt example requirements * [example] fix opt example * [example] fix opt example * [example] fix opt example * [gemini] add static placement policy (#4443) * [gemini] add static placement policy * [gemini] fix param offload * [test] update gemini tests * [plugin] update gemini plugin * [plugin] update gemini plugin docstr * [misc] fix flash attn requirement * [test] fix gemini checkpoint io test * [example] update resnet example result (#4457) * [example] update bert example result (#4458) * [doc] update gemini doc (#4468) * [example] update gemini related examples (#4473) * [example] update gpt example * [example] update dreambooth example * [example] update vit * [example] update opt * [example] update palm * [example] update vit and opt benchmark * [hotfix] fix bert in model zoo (#4480) * [hotfix] fix bert in model zoo * [test] remove chatglm gemini test * [test] remove sam gemini test * [test] remove vit gemini test * [hotfix] fix opt tutorial example (#4497) * [hotfix] fix opt tutorial example * [hotfix] fix opt tutorial example	1 year ago
Jianghai	8739aa7fa0	[shardformer] Pipeline/whisper (#4456 ) * add some base tests and policies * finish whisper base model * add conditional generation * finish basic tests * whisper * finish whisper * finish whisper * del useless whisper test * fix * add argmin to replace * finish revision	1 year ago
LuGY	a78daf6180	[shardformer] support interleaved pipeline (#4448 ) * support interleaved pipeline * fix unit test * remove virtual stage test in stage mgr * add droped type hint and updated bwd	1 year ago
github-actions[bot]	d20dceb9a3	[format] applied code formatting on changed files in pull request 4441 (#4445 ) Co-authored-by: github-actions <github-actions@github.com>	1 year ago
Jianghai	a88e92251d	[pipeline] add chatglm (#4363 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining * add bert_for_pretraining forward and policy * fix typos * cancel warning * change the imediate output to default dict * change the default output of get_shared_params * add chatglm * add * chatglm * chatglm * finish chatglm * deletes * fix rmsnorm * chatglm * fix chatglm shard * init	1 year ago
Jianghai	f13954cd58	[pipeline] refactor test pipeline and remove useless utils in pipeline (#4324 ) * refactor tests * refactor bloom model * finish policy tests * refactor tests * fix test pure pipeline * remove test pipeline and cutdown launch process * refactor tests * refactor bloom model * finish policy tests * refactor tests * fix test pure pipeline * remove test pipeline and cutdown launch process	1 year ago
LuGY	d3c6cd66f3	[pipeline] add unit test for 1f1b (#4303 ) * add unit test for 1f1b * polish code * polish code and update ut version * fix	1 year ago
Baizhou Zhang	36e546b2cc	[pipeline] add pipeline support for T5Stack/T5EncoderModel (#4300 ) * modify t5 policy & add test * pipeline stage distribution for t5 * complete t5 base policy * t5 stack: halfway * modify gpt2 pipeline test * complete pipeline forward for T5Stack/T5EncoderModel * fix docstring * move t5 util tests to test_pipeline	1 year ago
Jianghai	d8408d185c	[pipeline] OPT model pipeline (#4258 ) * opt forward and test * pause * finish opt model pipeline * finish opt pipeline * opt forward and test * pause * finish opt model pipeline * finish opt pipeline * fix opt * set transformers version * refactor the test pipeline	1 year ago
Jianghai	e7cc62d735	[pipeline] All bert models (#4233 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt * finish llama * causal lm and sequence classification * revision * add pure pipeline test * finish some bert models * finish all bert models * finish bert tests * fix bugs * fix bugs * fix test pipeline * fix data gen for qa * update the set pipeline forward * shared params * fix bugs	1 year ago
Jianghai	f3bcc292c8	[pipeline] move bert related pipeline components to shardformer (#4187 ) * move bert related pipeline components to shardformer * fix bugs * revision * fix bert model tests * fix bert_lm_head model tests * fix tests * fix tests * done checks * skip bloom	1 year ago
Jianghai	c5ea728016	[pipeline] add bert_for_pretraining bert_lmhead forward and policy (#4172 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining * add bert_for_pretraining forward and policy * fix typos * cancel warning * change the imediate output to default dict * change the default output of get_shared_params	1 year ago
Jianghai	90a65ea682	[pipeline] build bloom model and policy , revise the base class of policy (#4161 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining	1 year ago
Jianghai	c552cefa93	[pipeline]add pipeline policy and bert forward (#4130 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt	1 year ago
Hongxin Liu	5c897ddb94	[pipeline] add stage manager (#4093 ) * [pipeline] add stage manager * [test] add pipeline stage manager test * [pipeline] add docstring for stage manager	1 year ago
Jianghai	e8e7e49243	[pipeline]add pipeline policy and bert forward (#4130 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt	1 year ago
Hongxin Liu	f51ce1bc8e	[pipeline] refactor 1f1b schedule (#4115 ) * [api] update optimizer wrapper to fit pipeline * [pipeline] add base schedule * [pipeline] add 1f1b schedule * [test] add pipeline schedule utils test * [pipeline] fix import	1 year ago
Hongxin Liu	45fdc9b42c	[pipeline] implement p2p communication (#4100 ) * [pipeline] add p2p communication * [test] add p2p communication test * [test] add rerun decorator * [test] rename to avoid conflict	1 year ago
Hongxin Liu	422544222f	[pipeline] add stage manager (#4093 ) * [pipeline] add stage manager * [test] add pipeline stage manager test * [pipeline] add docstring for stage manager	1 year ago
Frank Lee	80eba05b0a	[test] refactor tests with spawn (#3452 ) * [test] added spawn decorator * polish code * polish code * polish code * polish code * polish code * polish code	2 years ago
Ziyue Jiang	09d69e1c25	[PP Middleware] Add bwd and step for PP middleware (#2111 ) * add bwd and step for PP middleware * pre-commit Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>	2 years ago
Ziyue Jiang	e4705ba4e2	[Pipeline Middleware] fix data race in Pipeline Scheduler for DAG (#2087 ) * add DAG test case * fix datarace by adjusting theposition of lock * polish code * fix pytest for middleware * remove test Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>	2 years ago
Ziyue Jiang	597cdd3006	[Pipeline Middleware] Adapt scheduler for Topo (#2066 ) * adapt scheduler for Topo * remoove comment * fix set input Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>	2 years ago
Ziyue Jiang	b0936e4a44	[rpc] split with dag (#2028 ) * add DAG to split_module * add comment * add test case for DAG * remove print * add DAG middleware in scheduler * add test case for scheduler * remove break * recover old lifecycle Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>	2 years ago
Super Daniel	393f594051	[fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug (#1710 ) * [fx] move meta registration * [fx] fix tests. * [fx] fix test. * [fx] fix. * [meta] refactor meta registration.py. * [fx] add compatibility descriptions. * [fx] polish import. * [fx] add a decorator. * [fx] fix tests. * [fx] remove print. * [fx] edit raise error. * [fx] edit raise error. * [fx] add type hint. * [fx] fix import in experimental. * [rpc] remove color debug. * [meta] fix naming.	2 years ago
Kirigaya Kazuto	9708638ded	[pipeline/pytree] add pytree to process args and kwargs \| provide `data_process_func` to process args and kwargs after forward (#1642 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera * [pipeline/chimera] test chimera \| fix bug of initializing * [pipeline/pytree] add pytree to process args and kwargs \| provide to process args and kwargs after forward	2 years ago
Kirigaya Kazuto	170fa81095	[pipeline/chimera] test chimera \| fix bug of initializing (#1615 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera * [pipeline/chimera] test chimera \| fix bug of initializing	2 years ago
Kirigaya Kazuto	edc9e419ad	[pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera (#1595 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera	2 years ago
Kirigaya Kazuto	6159d45417	[pipeline/tuning] improve dispatch performance both time and space cost (#1544 )	2 years ago
Kirigaya Kazuto	f1e1836218	[pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP (#1508 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP * [pipeline/pipleline_process_group] remove comment * [pipeline/pipleline_process_group] remove comment * [pipeline/pipleline_process_group] skip process group test * [pipeline/pipleline_process_group] remove test named function	2 years ago
Kirigaya Kazuto	5a6fd71f90	[pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy (#1497 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy	2 years ago
Kirigaya Kazuto	9145aef2b4	[pipeline/rpc] implement distributed optimizer \| test with assert_close (#1486 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] implement distributed optimizer \| test with assert_close	2 years ago

1 2

53 Commits (a799ca343b13665661a5e95f5ad1523457bef2e2)