ColossalAI

Commit Graph

Author	SHA1	Message	Date
Yuanheng Zhao	537a3cbc4d	[kernel] Support New KCache Layout - Triton Kernel (#5677 ) * kvmemcpy triton for new kcache layout * revise tests for new kcache layout * naive triton flash decoding - new kcache layout * rotary triton kernel - new kcache layout * remove redundancy - triton decoding * remove redundancy - triton kvcache copy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	7 months ago
Yuanheng Zhao	5be590b99e	[kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658 ) * add context attn triton kernel - new kcache layout * add benchmark triton * tiny revise * trivial - code style, comment	7 months ago
yuehuayingxueluo	3c91e3f176	[Inference]Adapt to baichuan2 13B (#5614 ) * adapt to baichuan2 13B * adapt to baichuan2 13B * change BAICHUAN_MODEL_NAME_OR_PATH * fix test_decoding_attn.py * Modifications based on review comments. * change BAICHUAN_MODEL_NAME_OR_PATH * mv attn mask processes to test flash decoding * mv get_alibi_slopes baichuan modeling * fix bugs in test_baichuan.py	7 months ago
Yuanheng Zhao	5d4c1fe8f5	[Fix/Inference] Fix GQA Triton and Support Llama3 (#5624 ) * [fix] GQA calling of flash decoding triton * fix kv cache alloc shape * fix rotary triton - GQA * fix sequence max length assigning * Sequence max length logic * fix scheduling and spec-dec * skip without import error * fix pytest - skip without ImportError --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	7 months ago
Yuanheng Zhao	a37f82629d	[Inference/SpecDec] Add Speculative Decoding Implementation (#5423 ) * fix flash decoding mask during verification * add spec-dec * add test for spec-dec * revise drafter init * remove drafter sampling * retire past kv in drafter * (trivial) rename attrs * (trivial) rename arg * revise how we enable/disable spec-dec	8 months ago
Yuanheng Zhao	d63c469f45	[Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401 ) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * resolve conflicts for revising flash-attn * adapt kv cache copy kernel for spec-dec * fix seqlen-n kvcache copy kernel/tests * test kvcache copy - use torch.equal * add assertions * (trivial) comment out	8 months ago
Yuanheng	7ca1d1c545	remove outdated triton test	8 months ago
Yuanheng	ce9401ad52	remove unused triton kernels	8 months ago
Yuanheng	ed5ebd1735	[Fix] resolve conflicts of merging main	8 months ago
Hongxin Liu	641b1ee71a	[devops] remove post commit ci (#5566 ) * [devops] remove post commit ci * [misc] run pre-commit on all files * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	8 months ago
Runyu Lu	b2c0d9ff2b	[fix] multi graphs capture error	9 months ago
Runyu Lu	cefaeb5fdd	[feat] cuda graph support and refactor non-functional api	9 months ago
yuehuayingxueluo	2a718c8be8	Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390 ) * opt_view_and_memcopy * fix bugs in ci * fix ci bugs * update benchmark scripts * fix ci bugs	9 months ago
Jianghai	730103819d	[Inference]Fused kv copy into rotary calculation (#5383 ) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * fused kv copy * fused copy * colossalai/kernel/triton/no_pad_rotary_embedding.py * del padding llama * del	9 months ago
yuehuayingxueluo	6fb4bcbb24	[Inference/opt] Fused KVCahce Memcopy (#5374 ) * fused kv memcopy * add TODO in test_kvcache_copy.py	10 months ago
Frank Lee	8106ede07f	Revert "[Inference] Adapt to Fused rotary (#5348 )" (#5373 ) This reverts commit `9f4ab2eb92`.	10 months ago
Jianghai	9f4ab2eb92	[Inference] Adapt to Fused rotary (#5348 ) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix	10 months ago
yuehuayingxueluo	35382a7fbf	[Inference]Fused the gate and up proj in mlp，and optimized the autograd process. (#5365 ) * fused the gate and up proj in mlp * fix code styles * opt auto_grad * rollback test_inference_engine.py * modifications based on the review feedback. * fix bugs in flash attn * Change reshape to view * fix test_rmsnorm_triton.py	10 months ago
yuehuayingxueluo	21ad4a27f9	[Inference/opt]Optimize the mid tensor of RMS Norm (#5350 ) * opt rms_norm * fix bugs in rms_layernorm	10 months ago
yuehuayingxueluo	249644c23b	[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation，add fused_qkv and fused linear_add (#5340 ) * add fused qkv * replace attn and mlp by shardformer * fix bugs in mlp * add docstrings * fix test_inference_engine.py * add optimize unbind * add fused_addmm * rm squeeze(1) * refactor codes * fix ci bugs * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention * Removed the dependency on LlamaFlashAttention2 * rollback test_inference_engine.py	10 months ago
Jianghai	df0aa49585	[Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336 ) * revise rotary embedding * remove useless print * adapt	10 months ago
Yuanheng Zhao	5f98a9d68a	[Infer] Optimize Blocked KVCache And Kernels Using It (#5325 ) * revise shape of kvcache (context attn kernel) * revise shape of kvcache (flash decoding kernel) * revise shape of kvcache (kvcache copy) and attn func * init of kvcache in kvcache manager * revise llama modeling * revise block size retrieval * use torch for rms_norm benchmarking * revise block size retrieval	10 months ago
Jianghai	1f8a75d470	[Inference] Update rms norm kernel, benchmark with vLLM (#5315 ) * add * xi * del * del * fix	10 months ago
Jianghai	7ddd8b37f0	fix (#5311 )	10 months ago
yuehuayingxueluo	4f28cb43c0	[inference]Optimize the usage of the mid tensors space in flash attn (#5304 ) * opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py	10 months ago
Yuanheng Zhao	af8359c430	[hotfix] fix boundary check in batch (#5306 )	10 months ago
Jianghai	c647e00e3c	[Inference]Add fused rotary kernel and get cos cache kernel (#5302 ) * add fused rotary and get cos cache func * staged * fix bugs * fix bugs	10 months ago
Yuanheng Zhao	3da9993b0d	[Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301 ) * fix decoding kernel pytest * revise and add triton context attn benchmark	10 months ago
yuehuayingxueluo	bfff9254ac	[inference] Adapted to Rotary Embedding and RMS Norm (#5283 ) * adapted to rotary_embedding * adapted to nopad rms norm * fix bugs in benchmark * fix flash_decoding.py	10 months ago
Yuanheng Zhao	6e487e7d3c	[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274 ) * prevent re-creating intermediate tensors * add singleton class holding intermediate values * fix triton kernel api * add benchmark in pytest * fix kernel api and add benchmark * revise flash decoding triton kernel in/out shapes * fix calling of triton kernel in modeling * fix pytest: extract to util functions	10 months ago
Yaozheng Fang	5ae9099f92	[kernel] Add RMSLayerNorm triton kernel (#5262 ) * add layerrmsnorm triton kernel * add layerrmsnorm kernel * modify the atol and rtol in test file * Remove the logics of mean computations, and update the name of ther kernel functions and files * add benchmark of rms norm	11 months ago
Yuanheng Zhao	0f2b46a41c	[kernel] Revise KVCache copy triton kernel API (#5273 ) * [kernel/fix] revise kvcache copy kernel api * fix benchmark	11 months ago
Yuanheng Zhao	fa85e02b3b	[kernel] Add KV cache copy kernel during decoding (#5261 ) * add kv copy triton kernel during decoding stage * add pytest and fix kernel * fix test utilities * revise kernel config * add benchmark for kvcache copy	11 months ago
Yuanheng Zhao	1513f20f4d	[kernel] Add flash decoding triton kernel for blocked kv cache (#5249 ) * add flash decoding unpad triton kernel * rename flash decoding kernel * add kernel testing (draft) * revise pytest * support kv group (GQA) * (trivial) fix api and pytest * (trivial) func renaming * (trivial) func/file renaming * refactor pytest for attention * (trivial) format and consistent vars of context/decode attn * (trivial) remove test redundancy	11 months ago
Jianghai	fded91d049	[Inference] Kernel: no pad rotary embedding (#5252 ) * fix bugs * comment * use more accurate atol * fix	11 months ago
Yuanheng Zhao	07b5283b6a	[kernel] Add triton kernel for context attention (FAv2) without padding (#5192 ) * add context attn unpadded triton kernel * test compatibility * kv cache copy (testing) * fix k/v cache copy * fix kv cache copy and test * fix boundary of block ptrs * add support for GQA/MQA and testing * fix import statement --------- Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>	11 months ago
Yuanheng Zhao	2bb92243d4	[Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159 ) * [inference/nfc] remove outdated inference tests * remove outdated kernel tests * remove deprecated triton kernels * remove imports from deprecated kernels	11 months ago
Cuiqing Li (李崔卿)	bce919708f	[Kernels]added flash-decoidng of triton (#5063 ) * added flash-decoidng of triton based on lightllm kernel * add req * clean * clean * delete build.sh --------- Co-authored-by: cuiqing.li <lixx336@gmail.com>	1 year ago
Cuiqing Li (李崔卿)	28052a71fb	[Kernels]Update triton kernels into 2.1.0 (#5046 ) * update flash-context-attention * adding kernels * fix * reset * add build script * add building process * add llama2 exmaple * add colossal-llama2 test * clean * fall back test setting * fix test file * clean * clean * clean --------- Co-authored-by: cuiqing.li <lixx336@gmail.com>	1 year ago
Xuanlei Zhao	dc003c304c	[moe] merge moe into main (#4978 ) * update moe module * support openmoe	1 year ago
Cuiqing Li	459a88c806	[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965 ) * adding flash-decoding * clean * adding kernel * adding flash-decoding * add integration * add * adding kernel * adding kernel * adding triton 2.1.0 features for inference * update bloom triton kernel * remove useless vllm kernels * clean codes * fix * adding files * fix readme * update llama flash-decoding --------- Co-authored-by: cuiqing.li <lixx336@gmail.com>	1 year ago
Jianghai	cf579ff46d	[Inference] Dynamic Batching Inference, online and offline (#4953 ) * [inference] Dynamic Batching for Single and Multiple GPUs (#4831) * finish batch manager * 1 * first * fix * fix dynamic batching * llama infer * finish test * support different lengths generating * del prints * del prints * fix * fix bug --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [inference] Async dynamic batching (#4894) * finish input and output logic * add generate * test forward * 1 * [inference]Re push async dynamic batching (#4901) * adapt to ray server * finish async * finish test * del test --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> * Revert "[inference]Re push async dynamic batching (#4901)" (#4905) This reverts commit `fbf3c09e67`. * Revert "[inference] Async dynamic batching (#4894)" This reverts commit `fced140250`. * Revert "[inference] Async dynamic batching (#4894)" (#4909) This reverts commit `fced140250`. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * [infer]Add Ray Distributed Environment Init Scripts (#4911) * Revert "[inference] Async dynamic batching (#4894)" This reverts commit `fced140250`. * Add Ray Distributed Environment Init Scripts * support DynamicBatchManager base function * revert _set_tokenizer version * add driver async generate * add async test * fix bugs in test_ray_dist.py * add get_tokenizer.py * fix code style * fix bugs about No module named 'pydantic' in ci test * fix bugs in ci test * fix bugs in ci test * fix bugs in ci test * support dynamic batch for bloom model and is_running function * [Inference]Test for new Async engine (#4935) * infer engine * infer engine * test engine * test engine * new manager * change step * add * test * fix * fix * finish test * finish test * finish test * finish test * add license --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> * add assertion for config (#4947) * [Inference] Finish dynamic batching offline test (#4948) * test * fix test * fix quant * add default * fix * fix some bugs * fix some bugs * fix * fix bug * fix bugs * reset param --------- Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Cuiqing Li <lixx3527@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497outlook.com>	1 year ago
Xu Kai	785802e809	[inference] add reference and fix some bugs (#4937 ) * add reference and fix some bugs * update gptq init --------- Co-authored-by: Xu Kai <xukai16@foxamil.com>	1 year ago
Cuiqing Li	3a41e8304e	[Refactor] Integrated some lightllm kernels into token-attention (#4946 ) * add some req for inference * clean codes * add codes * add some lightllm deps * clean codes * hello * delete rms files * add some comments * add comments * add doc * add lightllm deps * add lightllm cahtglm2 kernels * add lightllm cahtglm2 kernels * replace rotary embedding with lightllm kernel * add some commnets * add some comments * add some comments * add * replace fwd kernel att1 * fix a arg * add * add * fix token attention * add some comments * clean codes * modify comments * fix readme * fix bug * fix bug --------- Co-authored-by: cuiqing.li <lixx336@gmail.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>	1 year ago
Xu Kai	611a5a80ca	[inference] Add smmoothquant for llama (#4904 ) * [inference] add int8 rotary embedding kernel for smoothquant (#4843) * [inference] add smoothquant llama attention (#4850) * add smoothquant llama attention * remove uselss code * remove useless code * fix import error * rename file name * [inference] add silu linear fusion for smoothquant llama mlp (#4853) * add silu linear * update skip condition * catch smoothquant cuda lib exception * prcocess exception for tests * [inference] add llama mlp for smoothquant (#4854) * add llama mlp for smoothquant * fix down out scale * remove duplicate lines * add llama mlp check * delete useless code * [inference] add smoothquant llama (#4861) * add smoothquant llama * fix attention accuracy * fix accuracy * add kv cache and save pretrained * refactor example * delete smooth * refactor code * [inference] add smooth function and delete useless code for smoothquant (#4895) * add smooth function and delete useless code * update datasets * remove duplicate import * delete useless file * refactor codes (#4902) * rafactor code * add license * add torch-int and smoothquant license	1 year ago
Xu Kai	77a9328304	[inference] add llama2 support (#4898 ) * add llama2 support * fix multi group bug	1 year ago
Jianghai	013a4bedf0	[inference]fix import bug and delete down useless init (#4830 ) * fix import bug and release useless init * fix * fix * fix	1 year ago
Xu Kai	c3bef20478	add autotune (#4822 )	1 year ago
Jianghai	ce7ade3882	[inference] chatglm2 infer demo (#4724 ) * add chatglm2 * add * gather needed kernels * fix some bugs * finish context forward * finish context stage * fix * add * pause * add * fix bugs * finish chatglm * fix bug * change some logic * fix bugs * change some logics * add * add * add * fix * fix tests * fix	1 year ago
Xu Kai	946ab56c48	[feature] add gptq for inference (#4754 ) * [gptq] add gptq kernel (#4416) * add gptq * refactor code * fix tests * replace auto-gptq * rname inferance/quant * refactor test * add auto-gptq as an option * reset requirements * change assert and check auto-gptq * add import warnings * change test flash attn version * remove example * change requirements of flash_attn * modify tests * [skip ci] change requirements-test * [gptq] faster gptq cuda kernel (#4494) * [skip ci] add cuda kernels * add license * [skip ci] fix max_input_len * format files & change test size * [skip ci] * [gptq] add gptq tensor parallel (#4538) * add gptq tensor parallel * add gptq tp * delete print * add test gptq check * add test auto gptq check * [gptq] combine gptq and kv cache manager (#4706) * combine gptq and kv cache manager * add init bits * delete useless code * add model path * delete usless print and update test * delete usless import * move option gptq to shard config * change replace linear to shardformer * update bloom policy * delete useless code * fix import bug and delete uselss code * change colossalai/gptq to colossalai/quant/gptq * update import linear for tests * delete useless code and mv gptq_kernel to kernel directory * fix triton kernel * add triton import	1 year ago

1 2

55 Commits (f9afe0addd89303de4819debd93efe97d5618238)