23 Commits (f8598e3ec56bbe6bc6dd9fd84a1e0543adbd3073)

Author SHA1 Message Date
pre-commit-ci[bot] d78817539e [pre-commit.ci] auto fixes from pre-commit.com hooks 8 months ago
傅剑寒 7ebdf48ac5
add cast and op_functor for cuda build-in types (#5546) 8 months ago
傅剑寒 a2878e39f4
[Inference] Add Reduce Utils (#5537) 8 months ago
yuehuayingxueluo 04aca9e55b
[Inference/Kernel]Add get_cos_and_sin Kernel (#5528) 8 months ago
yuehuayingxueluo 934e31afb2
The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519) 8 months ago
yuehuayingxueluo 87079cffe8
[Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461) 8 months ago
傅剑寒 7ff42cc06d
add vec_type_trait implementation (#5473) 8 months ago
xs_courtesy 48c4f29b27 refactor vector utils 8 months ago
xs_courtesy 5724b9e31e add some comments 8 months ago
xs_courtesy 388e043930 add implementatino for GetGPULaunchConfig1D 8 months ago
yuehuayingxueluo f366a5ea1f
[Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418) 8 months ago
Steve Luo ed431de4e4
fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454) 8 months ago
xs_courtesy c1c45e9d8e fix include path 9 months ago
Steve Luo b699f54007
optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441) 9 months ago
xs_courtesy 095c070a6e refactor code 9 months ago
Steve Luo f7aecc0c6b
feat rmsnorm cuda kernel and add unittest, benchmark script (#5417) 9 months ago
xs_courtesy 5eb5ff1464 refactor code 9 months ago
xs_courtesy a46598ac59 add reusable utils for cuda 9 months ago
xs_courtesy 95c21498d4 add silu_and_mul for infer 9 months ago
yuehuayingxueluo 600881a8ea
[Inference]Add CUDA KVCache Kernel (#5406) 9 months ago
Frank Lee 7cfed5f076
[feat] refactored extension module (#5298) 10 months ago