56 Commits (457a0de79fd2d3602eba0ac78e606acb6401fc60)

Author SHA1 Message Date
Yuanheng Zhao bd38fe6b91
[NFC] Fix code factors on inference triton kernels (#5743) 6 months ago
CjhHa1 bc9063adf1 resolve rebase conflicts on Branch feat/online-serving 7 months ago
Jianghai de378cd2ab [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) 7 months ago
Yuanheng Zhao 537a3cbc4d
[kernel] Support New KCache Layout - Triton Kernel (#5677) 7 months ago
Yuanheng Zhao 5be590b99e
[kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658) 7 months ago
yuehuayingxueluo 3c91e3f176
[Inference]Adapt to baichuan2 13B (#5614) 7 months ago
Yuanheng Zhao 5d4c1fe8f5
[Fix/Inference] Fix GQA Triton and Support Llama3 (#5624) 7 months ago
Yuanheng Zhao a37f82629d [Inference/SpecDec] Add Speculative Decoding Implementation (#5423) 8 months ago
Yuanheng Zhao d63c469f45 [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401) 8 months ago
Yuanheng 7ca1d1c545 remove outdated triton test 8 months ago
Yuanheng ce9401ad52 remove unused triton kernels 8 months ago
Hongxin Liu 641b1ee71a
[devops] remove post commit ci (#5566) 8 months ago
Runyu Lu b2c0d9ff2b [fix] multi graphs capture error 9 months ago
Runyu Lu cefaeb5fdd [feat] cuda graph support and refactor non-functional api 9 months ago
yuehuayingxueluo 2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390) 9 months ago
Jianghai 730103819d
[Inference]Fused kv copy into rotary calculation (#5383) 9 months ago
yuehuayingxueluo 6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy (#5374) 10 months ago
Frank Lee 8106ede07f
Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373) 10 months ago
Jianghai 9f4ab2eb92
[Inference] Adapt to Fused rotary (#5348) 10 months ago
yuehuayingxueluo 35382a7fbf
[Inference]Fused the gate and up proj in mlp,and optimized the autograd process. (#5365) 10 months ago
yuehuayingxueluo 21ad4a27f9
[Inference/opt]Optimize the mid tensor of RMS Norm (#5350) 10 months ago
yuehuayingxueluo 249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340) 10 months ago
Jianghai df0aa49585
[Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336) 10 months ago
Yuanheng Zhao 5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It (#5325) 10 months ago
Jianghai 1f8a75d470
[Inference] Update rms norm kernel, benchmark with vLLM (#5315) 10 months ago
Jianghai 7ddd8b37f0
fix (#5311) 10 months ago
yuehuayingxueluo 4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn (#5304) 10 months ago
Yuanheng Zhao af8359c430
[hotfix] fix boundary check in batch (#5306) 10 months ago
Jianghai c647e00e3c
[Inference]Add fused rotary kernel and get cos cache kernel (#5302) 10 months ago
Yuanheng Zhao 3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301) 10 months ago
yuehuayingxueluo bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm (#5283) 10 months ago
Yuanheng Zhao 6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274) 10 months ago
Yaozheng Fang 5ae9099f92
[kernel] Add RMSLayerNorm triton kernel (#5262) 10 months ago
Yuanheng Zhao 0f2b46a41c
[kernel] Revise KVCache copy triton kernel API (#5273) 10 months ago
Yuanheng Zhao fa85e02b3b
[kernel] Add KV cache copy kernel during decoding (#5261) 10 months ago
Yuanheng Zhao 1513f20f4d [kernel] Add flash decoding triton kernel for blocked kv cache (#5249) 11 months ago
Jianghai fded91d049 [Inference] Kernel: no pad rotary embedding (#5252) 11 months ago
Yuanheng Zhao 07b5283b6a [kernel] Add triton kernel for context attention (FAv2) without padding (#5192) 11 months ago
Yuanheng Zhao 2bb92243d4 [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159) 11 months ago
Cuiqing Li (李崔卿) bce919708f
[Kernels]added flash-decoidng of triton (#5063) 1 year ago
Cuiqing Li (李崔卿) 28052a71fb
[Kernels]Update triton kernels into 2.1.0 (#5046) 1 year ago
Xuanlei Zhao dc003c304c
[moe] merge moe into main (#4978) 1 year ago
Cuiqing Li 459a88c806
[Kernels]Updated Triton kernels into 2.1.0 and adding flash-decoding for llama token attention (#4965) 1 year ago
Jianghai cf579ff46d
[Inference] Dynamic Batching Inference, online and offline (#4953) 1 year ago
Xu Kai 785802e809
[inference] add reference and fix some bugs (#4937) 1 year ago
Cuiqing Li 3a41e8304e
[Refactor] Integrated some lightllm kernels into token-attention (#4946) 1 year ago
Xu Kai 611a5a80ca
[inference] Add smmoothquant for llama (#4904) 1 year ago
Xu Kai 77a9328304
[inference] add llama2 support (#4898) 1 year ago
Jianghai 013a4bedf0
[inference]fix import bug and delete down useless init (#4830) 1 year ago
Xu Kai c3bef20478
add autotune (#4822) 1 year ago