ColossalAI

Commit Graph

Author	SHA1	Message	Date
duanjunwen	1739df423c	[fix] fix fwd branch, fwd pass both micro_batch & internal_inputs'	2024-09-20 07:34:43 +00:00
duanjunwen	4753bf7add	[fix] fix mem assert;	2024-09-19 08:27:47 +00:00
duanjunwen	a115106f8d	[fix] fix bwd w input;	2024-09-19 08:10:05 +00:00
duanjunwen	349272c71f	[fix] updatw bwd b&w input; dict --> list[torch.Tensor]	2024-09-19 07:47:01 +00:00
duanjunwen	1f5c7258aa	Merge remote-tracking branch 'upstream/feature/zerobubble' into dev/zero_bubble	2024-09-19 03:52:13 +00:00
duanjunwen	af2c2f8092	[feat] add more test;	2024-09-18 07:51:54 +00:00
duanjunwen	3dbad102cf	[fix] fix zerobubble pp for shardformer type input;	2024-09-18 07:14:34 +00:00
duanjunwen	9bc3b6e220	[feat] moehybrid support zerobubble;	2024-09-12 02:51:46 +00:00
duanjunwen	11ae6848c6	[zerobubble]Support ZeroBubble Pipeline (#6034 ) * [feat] add zerobubble pp (just a frame now); add POC test for dx_dw; add test for zerobubble; * [feat] add dw test; * [fix] fix weight not close; * [update] update text; * [feat] add test run_fwd_bwd automatic scheduling; * [feat] split communication and calculation; fix pop empty send_bwd_buffer error; * [feat] add test for p & p grad; * [feat] add comments for ZBV func; * [fix] rm useless assign and comments; * [fix] fix ci test; add pytest; * [feat] add run_fwd_bwd_with_microbatch (replace input) & test; add p&p.grad assert close test & all pass; * [feat] add apply v_schedule graph; p & p.grad assert err exist; * [fix] update * [feat] fix ci; add assert; * [feat] fix poc format * [feat] fix func name & ci; add comments; * [fix] fix poc test; add comments in poc; * [feat] add optim backward_b_by_grad * [feat] fix optimizer bwd b & w; support return accum loss & output * [feat] add fwd_bwd_step, run_fwd_only; * [fix] fix optim bwd; add license for v_schedule; remove redundant attributes; fix schedule loop "while"--> "for"; add communication dict; * [fix] fix communication_map; * [feat] update test; rm comments; * [fix] rm zbv in hybridplugin * [fix] fix optim bwd; * [fix] fix optim bwd; * [fix] rm output.data after send fwd; * [fix] fix bwd step if condition; remove useless comments and format info; * [fix] fix detach output & release output; * [fix] rm requir_grad for output; * [fix] fix requir grad position and detach position and input&output local buffer append position; * [feat] add memory assertation; * [fix] fix mem check; * [fix] mem assertation' * [fix] fix mem assertation * [fix] fix mem; use a new model shape; only assert mem less and equal than theo; * [fix] fix model zoo import; * [fix] fix redundant detach & clone; add buffer assertation in the end; * [fix] add output_obj_grad assert None at bwd b step; replace input_obj.require_grad_ with treemap; * [fix] update optim state dict assert (include param group & state); fix mem assert after add optim; * [fix] add testcase with microbatch 4;	2024-09-10 17:33:09 +08:00
duanjunwen	6c2a120bed	[fix] add testcase with microbatch 4;	2024-09-09 10:16:03 +00:00
duanjunwen	8366a7855f	[fix] update optim state dict assert (include param group & state); fix mem assert after add optim;	2024-09-09 09:27:13 +00:00
duanjunwen	7568b34626	[fix] fix redundant detach & clone; add buffer assertation in the end;	2024-09-09 08:04:28 +00:00
duanjunwen	a5ec3d4285	[fix] fix mem; use a new model shape; only assert mem less and equal than theo;	2024-09-09 06:38:31 +00:00
duanjunwen	35a7b636b3	[fix] fix mem assertation	2024-09-09 05:41:39 +00:00
duanjunwen	400e5e5b23	[fix] mem assertation'	2024-09-09 02:58:06 +00:00
duanjunwen	4a358348c7	[fix] fix mem check;	2024-09-04 10:57:38 +00:00
duanjunwen	2f09c374f3	[feat] add memory assertation;	2024-09-04 06:34:18 +00:00
duanjunwen	e6e1a97a6d	[fix] fix requir grad position and detach position and input&output local buffer append position;	2024-09-04 03:31:08 +00:00
duanjunwen	4c1f81c683	[fix] fix bwd step if condition; remove useless comments and format info;	2024-09-03 08:56:08 +00:00
duanjunwen	ab643c9af7	[fix] rm output.data after send fwd;	2024-09-03 14:12:17 +08:00
duanjunwen	591a13bf7e	[fix] fix optim bwd;	2024-09-02 11:19:42 +00:00
duanjunwen	77fe44286c	[fix] rm zbv in hybridplugin	2024-09-02 10:00:43 +00:00
duanjunwen	6d18d38d5c	[feat] update test; rm comments;	2024-09-02 09:50:47 +00:00
duanjunwen	6af81d8c0d	[feat] add fwd_bwd_step, run_fwd_only;	2024-08-30 02:47:52 +00:00
duanjunwen	48ba22dbfd	[feat] fix optimizer bwd b & w; support return accum loss & output	2024-08-29 08:54:45 +00:00
duanjunwen	4c4b01b859	[feat] add optim backward_b_by_grad	2024-08-29 03:16:59 +00:00
duanjunwen	b1419ef76a	[fix] fix poc test; add comments in poc;	2024-08-28 05:47:53 +00:00
duanjunwen	582ba0d6ff	[feat] fix func name & ci; add comments;	2024-08-28 03:40:50 +00:00
duanjunwen	b5f7b4d228	[feat] fix poc format	2024-08-28 03:08:35 +00:00
duanjunwen	d6e3d7d2a3	[feat] fix ci; add assert;	2024-08-28 02:41:05 +00:00
duanjunwen	29383b2de0	[fix] update	2024-08-28 02:33:42 +00:00
duanjunwen	fe209164f1	[feat] add apply v_schedule graph; p & p.grad assert err exist;	2024-08-27 10:29:39 +00:00
duanjunwen	8b37323f16	[feat] add run_fwd_bwd_with_microbatch (replace input) & test; add p&p.grad assert close test & all pass;	2024-08-27 09:31:38 +00:00
duanjunwen	9e0bd1af00	[fix] fix ci test; add pytest;	2024-08-27 08:00:23 +00:00
duanjunwen	283c9ff5d2	[fix] rm useless assign and comments;	2024-08-27 07:31:58 +00:00
duanjunwen	f1c1a87246	[feat] add test for p & p grad;	2024-08-27 06:37:26 +00:00
duanjunwen	5e09c8b4e1	[feat] split communication and calculation; fix pop empty send_bwd_buffer error;	2024-08-27 06:29:13 +00:00
duanjunwen	1d75045c37	[feat] add test run_fwd_bwd automatic scheduling;	2024-08-26 11:21:56 +00:00
duanjunwen	fd5526b76e	Merge branch 'main' into dev/zero_bubble	2024-08-26 04:03:20 +00:00
duanjunwen	107230d27a	[update] update text;	2024-08-26 04:00:51 +00:00
duanjunwen	203033ea16	[fix] fix weight not close;	2024-08-23 08:57:27 +00:00
duanjunwen	c18ef060cf	[feat] add dw test;	2024-08-23 06:04:12 +00:00
duanjunwen	ee9baedadf	[feat] add zerobubble pp (just a frame now); add POC test for dx_dw; add test for zerobubble;	2024-08-22 10:25:34 +00:00
Edenzzzz	f5c84af0b0	[Feature] Zigzag Ring attention (#5905 ) * halfway * fix cross-PP-stage position id length diff bug * fix typo * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unified cross entropy func for all shardformer models * remove redundant lines * add basic ring attn; debug cross entropy * fwd bwd logic complete * fwd bwd logic complete; add experimental triton rescale * precision tests passed * precision tests passed * fix typos and remove misc files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add sp_mode to benchmark; fix varlen interface * update softmax_lse shape by new interface * change tester name * remove buffer clone; support packed seq layout * add varlen tests * fix typo * all tests passed * add dkv_group; fix mask * remove debug statements --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-08-16 13:56:38 +08:00
Hongxin Liu	7f8b16635b	[misc] refactor launch API and tensor constructor (#5666 ) * [misc] remove config arg from initialize * [misc] remove old tensor contrusctor * [plugin] add npu support for ddp * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [devops] fix doc test ci * [test] fix test launch * [doc] update launch doc --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-29 10:40:11 +08:00
Hongxin Liu	641b1ee71a	[devops] remove post commit ci (#5566 ) * [devops] remove post commit ci * [misc] run pre-commit on all files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-08 15:09:40 +08:00
Wenhao Chen	bb0a668fee	[hotfix] set return_outputs=False in examples and polish code (#5404 ) * fix: simplify merge_batch * fix: use return_outputs=False to eliminate extra memory consumption * feat: add return_outputs warning * style: remove `return_outputs=False` as it is the default value	2024-03-25 12:31:09 +08:00
Wenhao Chen	ef4f0ee854	[hotfix]: add pp sanity check and fix mbs arg (#5268 ) * fix: fix misleading mbs arg * feat: add pp sanity check * fix: fix 1f1b sanity check	2024-01-15 15:57:40 +08:00
Wenhao Chen	4fa689fca1	[pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134 ) * test: add more p2p tests * fix: remove send_forward_recv_forward as p2p op list need to use the same group * fix: make send and receive atomic * feat: update P2PComm fn * feat: add metadata cache in 1f1b * feat: add metadata cache in interleaved pp * feat: modify is_xx_stage fn * revert: add _broadcast_object_list * feat: add interleaved pp in llama policy * feat: set NCCL_BUFFSIZE in HybridParallelPlugin	2023-12-22 10:44:00 +08:00
Wenhao Chen	7172459e74	[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088 ) * [shardformer] implement policy for all GPT-J models and test * [shardformer] support interleaved pipeline parallel for bert finetune * [shardformer] shardformer support falcon (#4883) * [shardformer]: fix interleaved pipeline for bert model (#5048) * [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093) * Add Mistral support for Shardformer (#5103) * [shardformer] add tests to mistral (#5105) --------- Co-authored-by: Pengtai Xu <henryxu880@gmail.com> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: eric8607242 <e0928021388@gmail.com>	2023-11-28 16:54:42 +08:00

1 2

56 Commits (c114d1429af8f029fa73d0253bb8d07756c99f80)