ColossalAI/examples/inference
Steve Luo be396ad6cc
[Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531)
* feat flash decoding for paged attention

* refactor flashdecodingattention

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2024-04-18 16:45:07 +08:00
..
benchmark_ops [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531) 2024-04-18 16:45:07 +08:00
benchmark_llama.py [inference/model]Adapted to the baichuan2-7B model (#5591) 2024-04-15 16:53:02 +08:00
build_smoothquant_weight.py [inference] refactor examples and fix schedule (#5077) 2023-11-21 10:46:03 +08:00
run_benchmark.sh The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519) 2024-03-28 10:42:51 +08:00
run_llama_inference.py [npu] change device to accelerator api (#5239) 2024-01-09 10:20:05 +08:00