Yuanheng Zhao
|
8754abae24
|
[Fix] Fix & Update Inference Tests (compatibility w/ main)
|
7 months ago |
Yuanheng Zhao
|
5be590b99e
|
[kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658)
* add context attn triton kernel - new kcache layout
* add benchmark triton
* tiny revise
* trivial - code style, comment
|
7 months ago |
yuehuayingxueluo
|
0aa27f1961
|
[Inference]Move benchmark-related code to the example directory. (#5408)
* move benchmark-related code to the example directory.
* fix bugs in test_fused_rotary_embedding.py
|
9 months ago |
yuehuayingxueluo
|
2a718c8be8
|
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
|
9 months ago |
Frank Lee
|
e76acbb076
|
[inference] moved ops tests to test_infer (#5354)
|
10 months ago |
Yuanheng Zhao
|
5f98a9d68a
|
[Infer] Optimize Blocked KVCache And Kernels Using It (#5325)
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
|
10 months ago |
Yuanheng Zhao
|
3da9993b0d
|
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)
* fix decoding kernel pytest
* revise and add triton context attn benchmark
|
10 months ago |
Yuanheng Zhao
|
6e487e7d3c
|
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
|
10 months ago |
Yuanheng Zhao
|
1513f20f4d
|
[kernel] Add flash decoding triton kernel for blocked kv cache (#5249)
* add flash decoding unpad triton kernel
* rename flash decoding kernel
* add kernel testing (draft)
* revise pytest
* support kv group (GQA)
* (trivial) fix api and pytest
* (trivial) func renaming
* (trivial) func/file renaming
* refactor pytest for attention
* (trivial) format and consistent vars of context/decode attn
* (trivial) remove test redundancy
|
11 months ago |
Yuanheng Zhao
|
07b5283b6a
|
[kernel] Add triton kernel for context attention (FAv2) without padding (#5192)
* add context attn unpadded triton kernel
* test compatibility
* kv cache copy (testing)
* fix k/v cache copy
* fix kv cache copy and test
* fix boundary of block ptrs
* add support for GQA/MQA and testing
* fix import statement
---------
Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>
|
11 months ago |