ColossalAI/colossalai/kernel/triton
yuehuayingxueluo 249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)
* add fused qkv

* replace attn and mlp by shardformer

* fix bugs in mlp

* add docstrings

* fix test_inference_engine.py

* add optimize unbind

* add fused_addmm

* rm squeeze(1)

* refactor codes

* fix ci bugs

* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention

* Removed the dependency on LlamaFlashAttention2

* rollback test_inference_engine.py
2024-02-01 15:49:39 +08:00
..
__init__.py [inference]Optimize the usage of the mid tensors space in flash attn (#5304) 2024-01-26 14:00:10 +08:00
context_attn_unpad.py [Infer] Optimize Blocked KVCache And Kernels Using It (#5325) 2024-01-30 16:06:09 +08:00
custom_autotune.py add autotune (#4822) 2023-09-28 13:47:35 +08:00
flash_decoding.py [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340) 2024-02-01 15:49:39 +08:00
fused_rotary_embedding.py fix (#5311) 2024-01-26 15:02:12 +08:00
gptq_triton.py [inference] add reference and fix some bugs (#4937) 2023-10-20 13:39:34 +08:00
kvcache_copy.py [Infer] Optimize Blocked KVCache And Kernels Using It (#5325) 2024-01-30 16:06:09 +08:00
llama_act_combine_kernel.py [moe] merge moe into main (#4978) 2023-11-02 02:21:24 +00:00
no_pad_rotary_embedding.py [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336) 2024-01-31 16:31:29 +08:00
qkv_matmul_kernel.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
rms_layernorm.py [Inference] Update rms norm kernel, benchmark with vLLM (#5315) 2024-01-29 10:22:33 +08:00
rotary_cache_copy.py fix (#5311) 2024-01-26 15:02:12 +08:00
softmax.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00