189 Commits (5f8c0a0ac3b52a71b664c3e36dd1a8cef40f428d)

Author SHA1 Message Date
Yuanheng Zhao bd38fe6b91
[NFC] Fix code factors on inference triton kernels (#5743) 6 months ago
CjhHa1 bc9063adf1 resolve rebase conflicts on Branch feat/online-serving 7 months ago
Jianghai de378cd2ab [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) 7 months ago
Yuanheng Zhao 537a3cbc4d
[kernel] Support New KCache Layout - Triton Kernel (#5677) 7 months ago
Yuanheng Zhao 5be590b99e
[kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658) 7 months ago
yuehuayingxueluo 3c91e3f176
[Inference]Adapt to baichuan2 13B (#5614) 7 months ago
flybird11111 148506c828
[coloattention]modify coloattention (#5627) 7 months ago
Yuanheng Zhao 5d4c1fe8f5
[Fix/Inference] Fix GQA Triton and Support Llama3 (#5624) 7 months ago
Yuanheng Zhao a37f82629d [Inference/SpecDec] Add Speculative Decoding Implementation (#5423) 8 months ago
Yuanheng Zhao d63c469f45 [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401) 8 months ago
Yuanheng 7ca1d1c545 remove outdated triton test 8 months ago
Yuanheng ce9401ad52 remove unused triton kernels 8 months ago
Hongxin Liu 641b1ee71a
[devops] remove post commit ci (#5566) 8 months ago
Hongxin Liu 19e1a5cf16
[shardformer] update colo attention to support custom mask (#5510) 8 months ago
Runyu Lu b2c0d9ff2b [fix] multi graphs capture error 9 months ago
Runyu Lu cefaeb5fdd [feat] cuda graph support and refactor non-functional api 9 months ago
yuehuayingxueluo 600881a8ea
[Inference]Add CUDA KVCache Kernel (#5406) 9 months ago
yuehuayingxueluo 2a718c8be8
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390) 9 months ago
Jianghai 730103819d
[Inference]Fused kv copy into rotary calculation (#5383) 9 months ago
yuehuayingxueluo 6fb4bcbb24
[Inference/opt] Fused KVCahce Memcopy (#5374) 10 months ago
Frank Lee 8106ede07f
Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373) 10 months ago
Jianghai 9f4ab2eb92
[Inference] Adapt to Fused rotary (#5348) 10 months ago
yuehuayingxueluo 35382a7fbf
[Inference]Fused the gate and up proj in mlp,and optimized the autograd process. (#5365) 10 months ago
yuehuayingxueluo 21ad4a27f9
[Inference/opt]Optimize the mid tensor of RMS Norm (#5350) 10 months ago
yuehuayingxueluo 249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340) 10 months ago
Jianghai df0aa49585
[Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336) 10 months ago
Yuanheng Zhao 5f98a9d68a
[Infer] Optimize Blocked KVCache And Kernels Using It (#5325) 10 months ago
Jianghai 1f8a75d470
[Inference] Update rms norm kernel, benchmark with vLLM (#5315) 10 months ago
Jianghai 7ddd8b37f0
fix (#5311) 10 months ago
yuehuayingxueluo 4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn (#5304) 10 months ago
Frank Lee 7cfed5f076
[feat] refactored extension module (#5298) 10 months ago
Yuanheng Zhao af8359c430
[hotfix] fix boundary check in batch (#5306) 10 months ago
Jianghai c647e00e3c
[Inference]Add fused rotary kernel and get cos cache kernel (#5302) 10 months ago
Yuanheng Zhao 3da9993b0d
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301) 10 months ago
yuehuayingxueluo bfff9254ac
[inference] Adapted to Rotary Embedding and RMS Norm (#5283) 10 months ago
Yuanheng Zhao 6e487e7d3c
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274) 10 months ago
Yaozheng Fang 5ae9099f92
[kernel] Add RMSLayerNorm triton kernel (#5262) 10 months ago
Yuanheng Zhao 0f2b46a41c
[kernel] Revise KVCache copy triton kernel API (#5273) 10 months ago
Yuanheng Zhao fa85e02b3b
[kernel] Add KV cache copy kernel during decoding (#5261) 10 months ago
Yuanheng Zhao 1513f20f4d [kernel] Add flash decoding triton kernel for blocked kv cache (#5249) 11 months ago
Jianghai fded91d049 [Inference] Kernel: no pad rotary embedding (#5252) 11 months ago
Yuanheng Zhao 07b5283b6a [kernel] Add triton kernel for context attention (FAv2) without padding (#5192) 11 months ago
Yuanheng Zhao 2bb92243d4 [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159) 11 months ago
Hongxin Liu d202cc28c0
[npu] change device to accelerator api (#5239) 11 months ago
Xuanlei Zhao dd2c28a323
[npu] use extension for op builder (#5172) 11 months ago
Xuanlei Zhao d6df19bae7
[npu] support triangle attention for llama (#5130) 12 months ago
Jun Gao dce05da535
fix thrust-transform-reduce error (#5078) 1 year ago
Hongxin Liu e5ce4c8ea6
[npu] add npu support for gemini and zero (#5067) 1 year ago
Cuiqing Li (李崔卿) bce919708f
[Kernels]added flash-decoidng of triton (#5063) 1 year ago
Cuiqing Li (李崔卿) 28052a71fb
[Kernels]Update triton kernels into 2.1.0 (#5046) 1 year ago