ColossalAI

Commit Graph

Author	SHA1	Message	Date
duanjunwen	6c2a120bed	[fix] add testcase with microbatch 4;	2024-09-09 10:16:03 +00:00
duanjunwen	8366a7855f	[fix] update optim state dict assert (include param group & state); fix mem assert after add optim;	2024-09-09 09:27:13 +00:00
duanjunwen	7568b34626	[fix] fix redundant detach & clone; add buffer assertation in the end;	2024-09-09 08:04:28 +00:00
duanjunwen	a5ec3d4285	[fix] fix mem; use a new model shape; only assert mem less and equal than theo;	2024-09-09 06:38:31 +00:00
duanjunwen	35a7b636b3	[fix] fix mem assertation	2024-09-09 05:41:39 +00:00
duanjunwen	400e5e5b23	[fix] mem assertation'	2024-09-09 02:58:06 +00:00
duanjunwen	4a358348c7	[fix] fix mem check;	2024-09-04 10:57:38 +00:00
duanjunwen	2f09c374f3	[feat] add memory assertation;	2024-09-04 06:34:18 +00:00
duanjunwen	e6e1a97a6d	[fix] fix requir grad position and detach position and input&output local buffer append position;	2024-09-04 03:31:08 +00:00
duanjunwen	4c1f81c683	[fix] fix bwd step if condition; remove useless comments and format info;	2024-09-03 08:56:08 +00:00
duanjunwen	ab643c9af7	[fix] rm output.data after send fwd;	2024-09-03 14:12:17 +08:00
duanjunwen	591a13bf7e	[fix] fix optim bwd;	2024-09-02 11:19:42 +00:00
duanjunwen	77fe44286c	[fix] rm zbv in hybridplugin	2024-09-02 10:00:43 +00:00
duanjunwen	6d18d38d5c	[feat] update test; rm comments;	2024-09-02 09:50:47 +00:00
duanjunwen	6af81d8c0d	[feat] add fwd_bwd_step, run_fwd_only;	2024-08-30 02:47:52 +00:00
duanjunwen	48ba22dbfd	[feat] fix optimizer bwd b & w; support return accum loss & output	2024-08-29 08:54:45 +00:00
duanjunwen	4c4b01b859	[feat] add optim backward_b_by_grad	2024-08-29 03:16:59 +00:00
duanjunwen	b1419ef76a	[fix] fix poc test; add comments in poc;	2024-08-28 05:47:53 +00:00
duanjunwen	582ba0d6ff	[feat] fix func name & ci; add comments;	2024-08-28 03:40:50 +00:00
duanjunwen	b5f7b4d228	[feat] fix poc format	2024-08-28 03:08:35 +00:00
duanjunwen	d6e3d7d2a3	[feat] fix ci; add assert;	2024-08-28 02:41:05 +00:00
duanjunwen	29383b2de0	[fix] update	2024-08-28 02:33:42 +00:00
duanjunwen	fe209164f1	[feat] add apply v_schedule graph; p & p.grad assert err exist;	2024-08-27 10:29:39 +00:00
duanjunwen	8b37323f16	[feat] add run_fwd_bwd_with_microbatch (replace input) & test; add p&p.grad assert close test & all pass;	2024-08-27 09:31:38 +00:00
duanjunwen	9e0bd1af00	[fix] fix ci test; add pytest;	2024-08-27 08:00:23 +00:00
duanjunwen	283c9ff5d2	[fix] rm useless assign and comments;	2024-08-27 07:31:58 +00:00
duanjunwen	f1c1a87246	[feat] add test for p & p grad;	2024-08-27 06:37:26 +00:00
duanjunwen	5e09c8b4e1	[feat] split communication and calculation; fix pop empty send_bwd_buffer error;	2024-08-27 06:29:13 +00:00
duanjunwen	1d75045c37	[feat] add test run_fwd_bwd automatic scheduling;	2024-08-26 11:21:56 +00:00
duanjunwen	fd5526b76e	Merge branch 'main' into dev/zero_bubble	2024-08-26 04:03:20 +00:00
duanjunwen	107230d27a	[update] update text;	2024-08-26 04:00:51 +00:00
duanjunwen	203033ea16	[fix] fix weight not close;	2024-08-23 08:57:27 +00:00
duanjunwen	c18ef060cf	[feat] add dw test;	2024-08-23 06:04:12 +00:00
duanjunwen	ee9baedadf	[feat] add zerobubble pp (just a frame now); add POC test for dx_dw; add test for zerobubble;	2024-08-22 10:25:34 +00:00
Edenzzzz	f5c84af0b0	[Feature] Zigzag Ring attention (#5905 ) * halfway * fix cross-PP-stage position id length diff bug * fix typo * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unified cross entropy func for all shardformer models * remove redundant lines * add basic ring attn; debug cross entropy * fwd bwd logic complete * fwd bwd logic complete; add experimental triton rescale * precision tests passed * precision tests passed * fix typos and remove misc files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add sp_mode to benchmark; fix varlen interface * update softmax_lse shape by new interface * change tester name * remove buffer clone; support packed seq layout * add varlen tests * fix typo * all tests passed * add dkv_group; fix mask * remove debug statements --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-08-16 13:56:38 +08:00
Edenzzzz	2a25a2aff7	[Feature] optimize PP overlap (#5735 ) * update to fully overlap, still debugging * improve interface * fixed deadlock bug * debug NaN loss * (experimental) use one comm group for send_fw_recv_fw to fix NaN * cleaned up interfaces; use one batch p2p for all * clean up; removed the double p2p batch case * p2p test passsed * improve overlap: send fwd before backward * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tentatively use 2 p2p batches * remove two p2p batches * fix typos * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove pp.sh --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: root <root@notebook-c55824c0-7742-45e8-9591-c855bb77ad29-0.notebook-c55824c0-7742-45e8-9591-c855bb77ad29.colossal-ai.svc.cluster.local>	2024-06-26 14:48:02 +08:00
Hongxin Liu	7f8b16635b	[misc] refactor launch API and tensor constructor (#5666 ) * [misc] remove config arg from initialize * [misc] remove old tensor contrusctor * [plugin] add npu support for ddp * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [devops] fix doc test ci * [test] fix test launch * [doc] update launch doc --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-29 10:40:11 +08:00
Hongxin Liu	1b387ca9fe	[shardformer] refactor pipeline grad ckpt config (#5646 ) * [shardformer] refactor pipeline grad ckpt config * [shardformer] refactor pipeline grad ckpt config * [pipeline] fix stage manager	2024-04-25 15:19:30 +08:00
Hongxin Liu	641b1ee71a	[devops] remove post commit ci (#5566 ) * [devops] remove post commit ci * [misc] run pre-commit on all files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-08 15:09:40 +08:00
Wenhao Chen	e614aa34f3	[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508 ) * feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig` * feat: apply `GradientCheckpointConfig` to policy and llama_forward * feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager * fix: add optional args for `distribute_layer` and `get_stage_index` * fix: fix changed API calls * test: update llama tests * style: polish `GradientCheckpointConfig` * fix: fix pipeline utils tests	2024-04-01 11:34:58 +08:00
Insu Jang	00525f7772	[shardformer] fix pipeline forward error if custom layer distribution is used (#5189 ) * Use self.[distribute_layers\|get_stage_index] to exploit custom layer distribution * Change static methods for t5 layer distribution to member functions * Change static methods for whisper layer distribution to member functions * Replace whisper policy usage with self one * Fix test case to use non-static layer distribution methods * fix: fix typo --------- Co-authored-by: Wenhao Chen <cwher@outlook.com>	2024-03-27 13:57:00 +08:00
Wenhao Chen	bb0a668fee	[hotfix] set return_outputs=False in examples and polish code (#5404 ) * fix: simplify merge_batch * fix: use return_outputs=False to eliminate extra memory consumption * feat: add return_outputs warning * style: remove `return_outputs=False` as it is the default value	2024-03-25 12:31:09 +08:00
ver217	148469348a	Merge branch 'main' into sync/npu	2024-01-18 12:05:21 +08:00
Wenhao Chen	ef4f0ee854	[hotfix]: add pp sanity check and fix mbs arg (#5268 ) * fix: fix misleading mbs arg * feat: add pp sanity check * fix: fix 1f1b sanity check	2024-01-15 15:57:40 +08:00
Hongxin Liu	d202cc28c0	[npu] change device to accelerator api (#5239 ) * update accelerator * fix timer * fix amp * update * fix * update bug * add error raise * fix autocast * fix set device * remove doc accelerator * update doc * update doc * update doc * use nullcontext * update cpu * update null context * change time limit for example * udpate * update * update * update * [npu] polish accelerator code --------- Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com> Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>	2024-01-09 10:20:05 +08:00
Elsa Granger	d565df3821	[pipeline] A more general _communicate in p2p (#5062 ) * A more general _communicate * feat: finish tree_flatten version p2p * fix: update p2p api calls --------- Co-authored-by: Wenhao Chen <cwher@outlook.com>	2024-01-08 15:37:27 +08:00
Wenhao Chen	d799a3088f	[pipeline]: add p2p fallback order and fix interleaved pp deadlock (#5214 ) * fix: add fallback order option and update 1f1b * fix: fix deadlock comm in interleaved pp * test: modify p2p test	2024-01-03 11:34:49 +08:00
Wenhao Chen	4fa689fca1	[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134 ) * test: add more p2p tests * fix: remove send_forward_recv_forward as p2p op list need to use the same group * fix: make send and receive atomic * feat: update P2PComm fn * feat: add metadata cache in 1f1b * feat: add metadata cache in interleaved pp * feat: modify is_xx_stage fn * revert: add _broadcast_object_list * feat: add interleaved pp in llama policy * feat: set NCCL_BUFFSIZE in HybridParallelPlugin	2023-12-22 10:44:00 +08:00
Wenhao Chen	7172459e74	[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088 ) * [shardformer] implement policy for all GPT-J models and test * [shardformer] support interleaved pipeline parallel for bert finetune * [shardformer] shardformer support falcon (#4883) * [shardformer]: fix interleaved pipeline for bert model (#5048) * [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093) * Add Mistral support for Shardformer (#5103) * [shardformer] add tests to mistral (#5105) --------- Co-authored-by: Pengtai Xu <henryxu880@gmail.com> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: eric8607242 <e0928021388@gmail.com>	2023-11-28 16:54:42 +08:00
Hongxin Liu	079bf3cb26	[misc] update pre-commit and run all files (#4752 ) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format	2023-09-19 14:20:26 +08:00

1 2

92 Commits (6c2a120bed8658015f0f4e4ee95cbbe314b6ce5e)