Yuanheng Zhao
|
537a3cbc4d
|
[kernel] Support New KCache Layout - Triton Kernel (#5677)
* kvmemcpy triton for new kcache layout
* revise tests for new kcache layout
* naive triton flash decoding - new kcache layout
* rotary triton kernel - new kcache layout
* remove redundancy - triton decoding
* remove redundancy - triton kvcache copy
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
|
2024-05-03 17:20:45 +08:00 |
Yuanheng Zhao
|
5be590b99e
|
[kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658)
* add context attn triton kernel - new kcache layout
* add benchmark triton
* tiny revise
* trivial - code style, comment
|
2024-04-26 17:51:49 +08:00 |
yuehuayingxueluo
|
3c91e3f176
|
[Inference]Adapt to baichuan2 13B (#5614)
* adapt to baichuan2 13B
* adapt to baichuan2 13B
* change BAICHUAN_MODEL_NAME_OR_PATH
* fix test_decoding_attn.py
* Modifications based on review comments.
* change BAICHUAN_MODEL_NAME_OR_PATH
* mv attn mask processes to test flash decoding
* mv get_alibi_slopes baichuan modeling
* fix bugs in test_baichuan.py
|
2024-04-25 23:11:30 +08:00 |
yuehuayingxueluo
|
2a718c8be8
|
Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)
* opt_view_and_memcopy
* fix bugs in ci
* fix ci bugs
* update benchmark scripts
* fix ci bugs
|
2024-02-21 13:23:57 +08:00 |
Yuanheng Zhao
|
5f98a9d68a
|
[Infer] Optimize Blocked KVCache And Kernels Using It (#5325)
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
|
2024-01-30 16:06:09 +08:00 |
yuehuayingxueluo
|
4f28cb43c0
|
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
|
2024-01-26 14:00:10 +08:00 |
Yuanheng Zhao
|
af8359c430
|
[hotfix] fix boundary check in batch (#5306)
|
2024-01-25 10:23:12 +08:00 |
Yuanheng Zhao
|
3da9993b0d
|
[Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)
* fix decoding kernel pytest
* revise and add triton context attn benchmark
|
2024-01-23 17:16:02 +08:00 |
Yuanheng Zhao
|
1513f20f4d
|
[kernel] Add flash decoding triton kernel for blocked kv cache (#5249)
* add flash decoding unpad triton kernel
* rename flash decoding kernel
* add kernel testing (draft)
* revise pytest
* support kv group (GQA)
* (trivial) fix api and pytest
* (trivial) func renaming
* (trivial) func/file renaming
* refactor pytest for attention
* (trivial) format and consistent vars of context/decode attn
* (trivial) remove test redundancy
|
2024-01-11 13:46:14 +00:00 |
Yuanheng Zhao
|
07b5283b6a
|
[kernel] Add triton kernel for context attention (FAv2) without padding (#5192)
* add context attn unpadded triton kernel
* test compatibility
* kv cache copy (testing)
* fix k/v cache copy
* fix kv cache copy and test
* fix boundary of block ptrs
* add support for GQA/MQA and testing
* fix import statement
---------
Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>
|
2024-01-11 13:39:56 +00:00 |