ColossalAI

Commit Graph

Author	SHA1	Message	Date
Yuanheng Zhao	04863a9b14	[example] Update Llama Inference example (#5629 ) * [example] add infernece benchmark llama3 * revise inference config - arg * remove unused args * add llama generation demo script * fix init rope in llama policy * add benchmark-llama3 - cleanup	2024-04-23 22:23:07 +08:00
yuehuayingxueluo	12f10d5b0b	[Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623 ) * fix rotary embedding GQA * change test_rotary_embdding_unpad.py KH	2024-04-23 13:44:49 +08:00
Yuanheng Zhao	5d4c1fe8f5	[Fix/Inference] Fix GQA Triton and Support Llama3 (#5624 ) * [fix] GQA calling of flash decoding triton * fix kv cache alloc shape * fix rotary triton - GQA * fix sequence max length assigning * Sequence max length logic * fix scheduling and spec-dec * skip without import error * fix pytest - skip without ImportError --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-23 13:09:55 +08:00
Steve Luo	ccf72797e3	feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611 )	2024-04-19 15:34:53 +08:00
Runyu Lu	e37ee2fb65	[Feat]Tensor Model Parallel Support For Inference (#5563 ) * tensor parallel support naive source * [fix]precision, model load and refactor the framework * add tp unit test * docstring * fix do_sample	2024-04-18 16:56:46 +08:00
Steve Luo	be396ad6cc	[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531 ) * feat flash decoding for paged attention * refactor flashdecodingattention * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-18 16:45:07 +08:00
yuehuayingxueluo	56b222eff8	[inference/model]Adapted to the baichuan2-7B model (#5591 ) * Adapted to the baichuan2-7B model * modified according to the review comments. * Modified the method of obtaining random weights. * modified according to the review comments. * change mlp layewr 'NOTE'	2024-04-15 16:53:02 +08:00
傅剑寒	d4cb023b62	[Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593 ) * delete duplicated code and refactor vec_copy utils and reduce utils * delete unused header file	2024-04-15 10:57:51 +08:00
傅剑寒	a21912339a	refactor csrc (#5582 )	2024-04-11 15:41:36 +08:00
Yuanheng Zhao	25928d8496	[Inference/Spec-Dec] Merge pull request #5565 from hpcaitech/feat/speculative-decoding Add Speculative Decoding and GLIDE Spec-Dec	2024-04-10 18:39:27 +08:00
Yuanheng	f8598e3ec5	[Fix] Llama Modeling Control with Spec-Dec (#5580 ) - fix ref before asgmt - fall back to use triton kernels when using spec-dec	2024-04-10 18:19:44 +08:00
Yuanheng Zhao	e60d430cf5	[Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557 ) - resolve conflicts of rebasing feat/speculative-decoding	2024-04-10 18:13:49 +08:00
Yuanheng Zhao	e1acb58423	[doc] Add inference/speculative-decoding README (#5552 ) * add README for spec-dec * update roadmap	2024-04-10 11:07:52 +08:00
Yuanheng Zhao	d85d91435a	[Inference/SpecDec] Support GLIDE Drafter Model (#5455 ) * add glide-llama policy and modeling * update glide modeling, compitable with transformers 4.36.2 * revise glide llama modeling/usage * fix issues of glimpsing large kv * revise the way re-loading params for glide drafter * fix drafter and engine tests * enable convert to glide strict=False * revise glide llama modeling * revise vicuna prompt template * revise drafter and tests * apply usage of glide model in engine	2024-04-10 11:07:52 +08:00
Yuanheng Zhao	912e24b2aa	[SpecDec] Fix inputs for speculation and revise past KV trimming (#5449 ) * fix drafter pastkv and usage of batch bucket	2024-04-10 11:07:52 +08:00
Yuanheng Zhao	a37f82629d	[Inference/SpecDec] Add Speculative Decoding Implementation (#5423 ) * fix flash decoding mask during verification * add spec-dec * add test for spec-dec * revise drafter init * remove drafter sampling * retire past kv in drafter * (trivial) rename attrs * (trivial) rename arg * revise how we enable/disable spec-dec	2024-04-10 11:07:52 +08:00
Yuanheng Zhao	5a9b05f7b2	[Inference/SpecDec] Add Basic Drafter Model Container (#5405 ) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * add drafter model container (basic ver)	2024-04-10 11:07:51 +08:00
Yuanheng Zhao	d63c469f45	[Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401 ) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * resolve conflicts for revising flash-attn * adapt kv cache copy kernel for spec-dec * fix seqlen-n kvcache copy kernel/tests * test kvcache copy - use torch.equal * add assertions * (trivial) comment out	2024-04-10 11:07:51 +08:00
Yuanheng Zhao	d56c96334e	Sync main to feature/colossal-infer [Sync] Merge feature/colossal-infer with main	2024-04-09 10:09:34 +08:00
Yuanheng	7ca1d1c545	remove outdated triton test	2024-04-08 17:00:55 +08:00
pre-commit-ci[bot]	d78817539e	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2024-04-08 08:41:09 +00:00
Yuanheng	ce9401ad52	remove unused triton kernels	2024-04-08 16:25:12 +08:00
Yuanheng	ed5ebd1735	[Fix] resolve conflicts of merging main	2024-04-08 16:21:47 +08:00
Hongxin Liu	641b1ee71a	[devops] remove post commit ci (#5566 ) * [devops] remove post commit ci * [misc] run pre-commit on all files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-08 15:09:40 +08:00
傅剑寒	7ebdf48ac5	add cast and op_functor for cuda build-in types (#5546 )	2024-04-08 11:38:05 +08:00
digger yu	341263df48	[hotfix] fix typo s/get_defualt_parser /get_default_parser (#5548 )	2024-04-07 19:04:58 +08:00
digger yu	a799ca343b	[fix] fix typo s/muiti-node /multi-node etc. (#5448 )	2024-04-07 18:42:15 +08:00
Edenzzzz	15055f9a36	[hotfix] quick fixes to make legacy tutorials runnable (#5559 ) Co-authored-by: Edenzzzz <wtan45@wisc.edu>	2024-04-07 12:06:27 +08:00
Zhongkai Zhao	8e412a548e	[shardformer] Sequence Parallelism Optimization (#5533 ) * sequence parallel optimization * validate sequence parallel in llama (code to be polished) * shardformer api writing * integrate sequence parallel in ShardFormer * fix pp bugs and sp bugs for LlaMa model * integrating ring-based sequence parallelism into ShardFormer * [sequence parallelism]: Add fused megatron function * integrating ring-based sequence parallelism into ShardFormer --------- Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn> * fix bugs when useing sp and flashattention together * fix operation function name * support flash attention for ulysses-style sp * clarify sp process group * fix compatibility bugs in moe plugin * fix fused linear bugs * fix linear layer test * support gpt model all-to-all sp * modify shard data dimension (meant to be dim=-1) * support megtron-style sp and distributed attn for llama model * [shardformer] add megatron sp to llama * support llama7B 128k with distributed attention * [shardformer] robustness enhancement * add block attn * sp mode 1: keep input as a complete sequence * fix sp compatability * finish sp mode 3 support for gpt * using all_to_all_single when batch size is 1 * support mode 2 sp in gpt2 (#5) * [shardformer] add megatron sp to llama * support llama7B 128k with distributed attention * [shardformer] robustness enhancement * add block attn * sp mode 1: keep input as a complete sequence * fix sp compatability * refactor ring implementation * support mode 2 sp in gpt2 * polish code * enable distributed attn mask when using sp mode 2 and 3 in llama * automatically enable flash attn when using sp mode 2 and 3 in llama * inplace attn mask * add zero2 support for sequence parallel * polish code * fix bugs * fix gemini checkpoint io * loose tensor checking atol and rtol * add comment * fix llama layernorm grad * fix zero grad * fix zero grad * fix conflict * update split and gather auto grad func * sequence parallel: inside text split (#6) * polish code (part 1) * polish code (part 2) * polish code (part 2.5) * polish code (part 3) * sequence parallel: inside text split * miscellaneous minor fixes * polish code * fix ulysses style ZeRO * sequence parallel: inside text split * miscellaneous minor fixes * disaggregate sp group and dp group for sp * fix llama and gpt sp * polish code * move ulysses grad sync to ddp (#9) * remove zero_stage and unbind the grad sync for alltoall sp * add 2d group creation test * move ulysses grad sync to ddp * add 2d group creation test * remove useless code * change shard config not to enable sp when enable_all_optimizations * add sp warnings for several model * remove useless code --------- Co-authored-by: linsj20 <linsj20@mails.tsinghua.edu.cn>	2024-04-03 17:15:47 +08:00
Edenzzzz	7e0ec5a85c	fix incorrect sharding without zero (#5545 ) Co-authored-by: Edenzzzz <wtan45@wisc.edu>	2024-04-02 20:11:18 +08:00
Yuanheng Zhao	4bb5d8923a	[Fix/Inference] Remove unused and non-functional functions (#5543 ) * [fix] remove unused func * rm non-functional partial	2024-04-02 14:16:59 +08:00
傅剑寒	a2878e39f4	[Inference] Add Reduce Utils (#5537 ) * add reduce utils * add using to delele namespace prefix	2024-04-01 15:34:25 +08:00
yuehuayingxueluo	04aca9e55b	[Inference/Kernel]Add get_cos_and_sin Kernel (#5528 ) * Add get_cos_and_sin kernel * fix code comments * fix code typos * merge common codes of get_cos_and_sin kernel. * Fixed a typo * Changed 'asset allclose' to 'assert equal'.	2024-04-01 13:47:14 +08:00
Wenhao Chen	e614aa34f3	[shardformer, pipeline] add `gradient_checkpointing_ratio` and heterogenous shard policy for llama (#5508 ) * feat: add `GradientCheckpointConfig` and `PipelineGradientCheckpointConfig` * feat: apply `GradientCheckpointConfig` to policy and llama_forward * feat: move `distribute_layer` and `get_stage_index` to PipelineStageManager * fix: add optional args for `distribute_layer` and `get_stage_index` * fix: fix changed API calls * test: update llama tests * style: polish `GradientCheckpointConfig` * fix: fix pipeline utils tests	2024-04-01 11:34:58 +08:00
YeAnbang	df5e9c53cf	[ColossalChat] Update RLHF V2 (#5286 ) * Add dpo. Fix sft, ppo, lora. Refactor all * fix and tested ppo * 2 nd round refactor * add ci tests * fix ci * fix ci * fix readme, style * fix readme style * fix style, fix benchmark * reproduce benchmark result, remove useless files * rename to ColossalChat * use new image * fix ci workflow * fix ci * use local model/tokenizer for ci tests * fix ci * fix ci * fix ci * fix ci timeout * fix rm progress bar. fix ci timeout * fix ci * fix ci typo * remove 3d plugin from ci temporary * test environment * cannot save optimizer * support chat template * fix readme * fix path * test ci locally * restore build_or_pr * fix ci data path * fix benchmark * fix ci, move ci tests to 3080, disable fast tokenizer * move ci to 85 * support flash attention 2 * add all-in-one data preparation script. Fix colossal-llama2-chat chat template * add hardware requirements * move ci test data * fix save_model, add unwrap * fix missing bos * fix missing bos; support grad accumulation with gemini * fix ci * fix ci * fix ci * fix llama2 chat template config * debug sft * debug sft * fix colossalai version requirement * fix ci * add sanity check to prevent NaN loss * fix requirements * add dummy data generation script * add dummy data generation script * add dummy data generation script * add dummy data generation script * update readme * update readme * update readme and ignore * fix logger bug * support parallel_output * modify data preparation logic * fix tokenization * update lr * fix inference * run pre-commit --------- Co-authored-by: Tong Li <tong.li352711588@gmail.com>	2024-03-29 14:12:29 +08:00
Yuanheng Zhao	36c4bb2893	[Fix] Grok-1 use tokenizer from the same pretrained path (#5532 ) * [fix] use tokenizer from the same pretrained path * trust remote code	2024-03-28 16:30:04 +08:00
yuehuayingxueluo	934e31afb2	The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519 )	2024-03-28 10:42:51 +08:00
Insu Jang	00525f7772	[shardformer] fix pipeline forward error if custom layer distribution is used (#5189 ) * Use self.[distribute_layers\|get_stage_index] to exploit custom layer distribution * Change static methods for t5 layer distribution to member functions * Change static methods for whisper layer distribution to member functions * Replace whisper policy usage with self one * Fix test case to use non-static layer distribution methods * fix: fix typo --------- Co-authored-by: Wenhao Chen <cwher@outlook.com>	2024-03-27 13:57:00 +08:00
github-actions[bot]	e6707a6e8d	[format] applied code formatting on changed files in pull request 5510 (#5517 ) Co-authored-by: github-actions <github-actions@github.com>	2024-03-27 11:21:03 +08:00
Hongxin Liu	19e1a5cf16	[shardformer] update colo attention to support custom mask (#5510 ) * [feature] refactor colo attention (#5462) * [extension] update api * [feature] add colo attention * [feature] update sdpa * [feature] update npu attention * [feature] update flash-attn * [test] add flash attn test * [test] update flash attn test * [shardformer] update modeling to fit colo attention (#5465) * [misc] refactor folder structure * [shardformer] update llama flash-attn * [shardformer] fix llama policy * [devops] update tensornvme install * [test] update llama test * [shardformer] update colo attn kernel dispatch * [shardformer] update blip2 * [shardformer] update chatglm * [shardformer] update gpt2 * [shardformer] update gptj * [shardformer] update opt * [shardformer] update vit * [shardformer] update colo attention mask prep * [shardformer] update whisper * [test] fix shardformer tests (#5514) * [test] fix shardformer tests * [test] fix shardformer tests	2024-03-27 11:19:32 +08:00
Edenzzzz	9a3321e9f4	Merge pull request #5515 from Edenzzzz/fix_layout_convert Fix layout convertor caching	2024-03-26 19:51:02 +08:00
Edenzzzz	18edcd5368	Empty-Commit	2024-03-26 19:50:41 +08:00
Edenzzzz	61da3fbc52	fixed layout converter caching and updated tester	2024-03-26 17:22:27 +08:00
傅剑寒	e6496dd371	[Inference] Optimize request handler of llama (#5512 ) * optimize request_handler * fix ways of writing	2024-03-26 16:37:14 +08:00
Rocky Duan	cbe34c557c	Fix ColoTensorSpec for py11 (#5440 )	2024-03-26 15:56:49 +08:00
Hongxin Liu	a7790a92e8	[devops] fix example test ci (#5504 )	2024-03-26 15:09:05 +08:00
Yuanheng Zhao	131f32a076	[fix] fix grok-1 example typo (#5506 )	2024-03-26 10:19:42 +08:00
flybird11111	0688d92e2d	[shardformer]Fix lm parallel. (#5480 ) * fix * padding vocab_size when using pipeline parallellism padding vocab_size when using pipeline parallellism fix fix * fix * fix fix fix * fix gather output * fix * fix * fix fix resize embedding fix resize embedding * fix resize embedding fix * revert * revert * revert * fix lm forward distribution * fix * test ci * fix	2024-03-25 17:21:51 +08:00
Runyu Lu	6251d68dc9	[fix] PR #5354 (#5501 ) * [fix] * [fix] * Update config.py docstring * [fix] docstring align * [fix] docstring align * [fix] docstring align	2024-03-25 15:24:17 +08:00
Runyu Lu	1d626233ce	Merge pull request #5434 from LRY89757/colossal-infer-cuda-graph [feat] cuda graph support and refactor non-functional api	2024-03-25 14:55:59 +08:00

1 2 3 4 5 ...

3199 Commits (04863a9b144fc7dd46a57d2c7b0cf2f4b351ffb6) All Branches Search

3199 Commits (04863a9b144fc7dd46a57d2c7b0cf2f4b351ffb6)

All Branches