ColossalAI/examples/inference
yuehuayingxueluo 249644c23b
[Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)
* add fused qkv

* replace attn and mlp by shardformer

* fix bugs in mlp

* add docstrings

* fix test_inference_engine.py

* add optimize unbind

* add fused_addmm

* rm squeeze(1)

* refactor codes

* fix ci bugs

* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention

* Removed the dependency on LlamaFlashAttention2

* rollback test_inference_engine.py
2024-02-01 15:49:39 +08:00
..
benchmark_llama.py merge commit 2024-01-31 10:41:47 +08:00
build_smoothquant_weight.py [inference] refactor examples and fix schedule (#5077) 2023-11-21 10:46:03 +08:00
run_benchmark.sh [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340) 2024-02-01 15:49:39 +08:00
run_llama_inference.py [npu] change device to accelerator api (#5239) 2024-01-09 10:20:05 +08:00