ColossalAI/colossalai/inference
yuehuayingxueluo 4f28cb43c0
[inference]Optimize the usage of the mid tensors space in flash attn (#5304)
* opt flash attn

* opt tmp tensor

* fix benchmark_llama

* fix code style

* fix None logic for output tensor

* fix adapted to get_xine_cache

* add comment

* fix ci bugs

* fix some codes

* rm duplicated codes

* rm duplicated codes

* fix code style

* add _get_dtype in config.py
2024-01-26 14:00:10 +08:00
..
core [inference]Optimize the usage of the mid tensors space in flash attn (#5304) 2024-01-26 14:00:10 +08:00
kv_cache [inference]Optimize the usage of the mid tensors space in flash attn (#5304) 2024-01-26 14:00:10 +08:00
modeling [inference]Optimize the usage of the mid tensors space in flash attn (#5304) 2024-01-26 14:00:10 +08:00
README.md [doc] updated inference readme (#5269) 2024-01-15 17:37:41 +08:00
__init__.py [Inference] First PR for rebuild colossal-infer (#5143) 2024-01-11 13:39:29 +00:00
config.py [inference]Optimize the usage of the mid tensors space in flash attn (#5304) 2024-01-26 14:00:10 +08:00
flash_decoding_utils.py [inference]Optimize the usage of the mid tensors space in flash attn (#5304) 2024-01-26 14:00:10 +08:00
logit_processors.py [Inference] add logit processor and request handler (#5166) 2024-01-11 13:39:56 +00:00
sampler.py adapted to pad_context_forward 2024-01-11 13:44:06 +00:00
struct.py [inference]Optimize the usage of the mid tensors space in flash attn (#5304) 2024-01-26 14:00:10 +08:00
utils.py add utils.py 2024-01-22 16:06:27 +08:00

README.md

โšก๏ธ ColossalAI-Inference

๐Ÿ“š Table of Contents

๐Ÿ“Œ Introduction

ColossalAI-Inference is a library which offers acceleration to Transformers models, especially LLMs. In ColossalAI-Inference, we leverage high-performance kernels, KV cache, paged attention, continous batching and other techniques to accelerate the inference of LLMs. We also provide a unified interface for users to easily use our library.

๐Ÿ›  Design and Implementation

To be added.

๐Ÿ•น Usage

To be added.

๐Ÿช… Support Matrix

Model KV Cache Paged Attention Kernels Tensor Parallelism Speculative Decoding
Llama โœ… โœ… โœ… ๐Ÿ”œ ๐Ÿ”œ

Notations:

  • โœ…: supported
  • โŒ: not supported
  • ๐Ÿ”œ: still developing, will support soon

๐Ÿ—บ Roadmap

  • KV Cache
  • Paged Attention
  • High-Performance Kernels
  • Llama Modelling
  • Tensor Parallelism
  • Speculative Decoding
  • Continuous Batching
  • Online Inference
  • Benchmarking
  • User Documentation

๐ŸŒŸ Acknowledgement

This project was written from scratch but we learned a lot from several other great open-source projects during development. Therefore, we wish to fully acknowledge their contribution to the open-source community. These projects include

If you wish to cite relevant research papars, you can find the reference below.

# vllm
@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

# flash attention v1 & v2
@inproceedings{dao2022flashattention,
  title={Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
  author={Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}
@article{dao2023flashattention2,
  title={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
  author={Dao, Tri},
  year={2023}
}

# we do not find any research work related to lightllm