ColossalAI

History

yuehuayingxueluo 249644c23b [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation，add fused_qkv and fused linear_add (#5340 ) * add fused qkv * replace attn and mlp by shardformer * fix bugs in mlp * add docstrings * fix test_inference_engine.py * add optimize unbind * add fused_addmm * rm squeeze(1) * refactor codes * fix ci bugs * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention * Removed the dependency on LlamaFlashAttention2 * rollback test_inference_engine.py		2024-02-01 15:49:39 +08:00
..
__init__.py	[inference]Optimize the usage of the mid tensors space in flash attn (#5304 )	2024-01-26 14:00:10 +08:00
context_attn_unpad.py	[Infer] Optimize Blocked KVCache And Kernels Using It (#5325 )	2024-01-30 16:06:09 +08:00
custom_autotune.py	add autotune (#4822 )	2023-09-28 13:47:35 +08:00
flash_decoding.py	[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation，add fused_qkv and fused linear_add (#5340 )	2024-02-01 15:49:39 +08:00
fused_rotary_embedding.py	fix (#5311 )	2024-01-26 15:02:12 +08:00
gptq_triton.py	[inference] add reference and fix some bugs (#4937 )	2023-10-20 13:39:34 +08:00
kvcache_copy.py	[Infer] Optimize Blocked KVCache And Kernels Using It (#5325 )	2024-01-30 16:06:09 +08:00
llama_act_combine_kernel.py	[moe] merge moe into main (#4978 )	2023-11-02 02:21:24 +00:00
no_pad_rotary_embedding.py	[Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336 )	2024-01-31 16:31:29 +08:00
qkv_matmul_kernel.py	[misc] update pre-commit and run all files (#4752 )	2023-09-19 14:20:26 +08:00
rms_layernorm.py	[Inference] Update rms norm kernel, benchmark with vLLM (#5315 )	2024-01-29 10:22:33 +08:00
rotary_cache_copy.py	fix (#5311 )	2024-01-26 15:02:12 +08:00
softmax.py	[misc] update pre-commit and run all files (#4752 )	2023-09-19 14:20:26 +08:00