yuehuayingxueluo
|
249644c23b
|
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)
* add fused qkv
* replace attn and mlp by shardformer
* fix bugs in mlp
* add docstrings
* fix test_inference_engine.py
* add optimize unbind
* add fused_addmm
* rm squeeze(1)
* refactor codes
* fix ci bugs
* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention
* Removed the dependency on LlamaFlashAttention2
* rollback test_inference_engine.py
|
2024-02-01 15:49:39 +08:00 |
Yuanheng Zhao
|
5f98a9d68a
|
[Infer] Optimize Blocked KVCache And Kernels Using It (#5325)
* revise shape of kvcache (context attn kernel)
* revise shape of kvcache (flash decoding kernel)
* revise shape of kvcache (kvcache copy) and attn func
* init of kvcache in kvcache manager
* revise llama modeling
* revise block size retrieval
* use torch for rms_norm benchmarking
* revise block size retrieval
|
2024-01-30 16:06:09 +08:00 |
yuehuayingxueluo
|
e8f0642f28
|
[Inference]Add Nopadding Llama Modeling (#5327)
* add nopadding llama modeling
* add nopadding_llama.py
* rm unused codes
* fix bugs in test_xine_copy.py
* fix code style
|
2024-01-30 10:31:46 +08:00 |
yuehuayingxueluo
|
4f28cb43c0
|
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
* opt flash attn
* opt tmp tensor
* fix benchmark_llama
* fix code style
* fix None logic for output tensor
* fix adapted to get_xine_cache
* add comment
* fix ci bugs
* fix some codes
* rm duplicated codes
* rm duplicated codes
* fix code style
* add _get_dtype in config.py
|
2024-01-26 14:00:10 +08:00 |
yuehuayingxueluo
|
bfff9254ac
|
[inference] Adapted to Rotary Embedding and RMS Norm (#5283)
* adapted to rotary_embedding
* adapted to nopad rms norm
* fix bugs in benchmark
* fix flash_decoding.py
|
2024-01-22 10:55:34 +08:00 |
Yuanheng Zhao
|
6e487e7d3c
|
[kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)
* prevent re-creating intermediate tensors
* add singleton class holding intermediate values
* fix triton kernel api
* add benchmark in pytest
* fix kernel api and add benchmark
* revise flash decoding triton kernel in/out shapes
* fix calling of triton kernel in modeling
* fix pytest: extract to util functions
|
2024-01-19 15:47:16 +08:00 |
Jianghai
|
9e2342bde2
|
[Hotfix] Fix bugs in testing continuous batching (#5270)
* fix bug
* fix bugs
* fix bugs
* fix bugs and add padding
* add funcs and fix bugs
* fix typos
* fix bugs
* add func
|
2024-01-18 16:31:14 +08:00 |
yuehuayingxueluo
|
86b63f720c
|
[Inference]Adapted to the triton attn kernels (#5264)
* adapted to the triton attn kernels
* fix pad input
* adapted to copy_kv_to_blocked_cache
* fix ci test
* update kv memcpy
* remove print
|
2024-01-17 16:03:10 +08:00 |
yuehuayingxueluo
|
2a73e828eb
|
fix bugs related to processing padding mask
|
2024-01-11 13:46:14 +00:00 |
yuehuayingxueluo
|
fa4fbdbffb
|
adapted to pad_context_forward
|
2024-01-11 13:44:06 +00:00 |
yuehuayingxueluo
|
47e53eaa1c
|
fix bugs in attention.py and request_handler.py
|
2024-01-11 13:44:06 +00:00 |
yuehuayingxueluo
|
3ad1f3b78b
|
fix beam_width
|
2024-01-11 13:39:56 +00:00 |
yuehuayingxueluo
|
b2eb9cd186
|
Fixed a typo
|
2024-01-11 13:39:56 +00:00 |
yuehuayingxueluo
|
02c1bf8b2a
|
add context_attention_unpadded
|
2024-01-11 13:39:56 +00:00 |
yuehuayingxueluo
|
9489dc64d8
|
precision alignment
|
2024-01-11 13:39:56 +00:00 |
yuehuayingxueluo
|
62968588d1
|
fix bugs in request_handler
|
2024-01-11 13:39:56 +00:00 |
yuehuayingxueluo
|
62fd08ee44
|
Fixed a bug in the inference frame
|
2024-01-11 13:39:56 +00:00 |
yuehuayingxueluo
|
86853a37d5
|
Add padding llama model
|
2024-01-11 13:39:56 +00:00 |