ColossalAI/colossalai/inference
yuehuayingxueluo 86b63f720c
[Inference]Adapted to the triton attn kernels (#5264)
* adapted to the triton attn kernels

* fix pad input

* adapted to copy_kv_to_blocked_cache

* fix ci test

* update kv memcpy

* remove print
2024-01-17 16:03:10 +08:00
..
core [Inference]Adapted to the triton attn kernels (#5264) 2024-01-17 16:03:10 +08:00
kv_cache [Inference] Fix request handler and add recycle logic (#5260) 2024-01-15 17:50:46 +08:00
modeling [Inference]Adapted to the triton attn kernels (#5264) 2024-01-17 16:03:10 +08:00
README.md [doc] updated inference readme (#5269) 2024-01-15 17:37:41 +08:00
__init__.py [Inference] First PR for rebuild colossal-infer (#5143) 2024-01-11 13:39:29 +00:00
config.py fix bugs in request_handler.py and engine.py 2024-01-11 13:46:14 +00:00
logit_processors.py [Inference] add logit processor and request handler (#5166) 2024-01-11 13:39:56 +00:00
sampler.py adapted to pad_context_forward 2024-01-11 13:44:06 +00:00
struct.py [Inference]Adapted to the triton attn kernels (#5264) 2024-01-17 16:03:10 +08:00

README.md

โšก๏ธ ColossalAI-Inference

๐Ÿ“š Table of Contents

๐Ÿ“Œ Introduction

ColossalAI-Inference is a library which offers acceleration to Transformers models, especially LLMs. In ColossalAI-Inference, we leverage high-performance kernels, KV cache, paged attention, continous batching and other techniques to accelerate the inference of LLMs. We also provide a unified interface for users to easily use our library.

๐Ÿ›  Design and Implementation

To be added.

๐Ÿ•น Usage

To be added.

๐Ÿช… Support Matrix

Model KV Cache Paged Attention Kernels Tensor Parallelism Speculative Decoding
Llama โœ… โœ… โœ… ๐Ÿ”œ ๐Ÿ”œ

Notations:

  • โœ…: supported
  • โŒ: not supported
  • ๐Ÿ”œ: still developing, will support soon

๐Ÿ—บ Roadmap

  • KV Cache
  • Paged Attention
  • High-Performance Kernels
  • Llama Modelling
  • Tensor Parallelism
  • Speculative Decoding
  • Continuous Batching
  • Online Inference
  • Benchmarking
  • User Documentation

๐ŸŒŸ Acknowledgement

This project was written from scratch but we learned a lot from several other great open-source projects during development. Therefore, we wish to fully acknowledge their contribution to the open-source community. These projects include

If you wish to cite relevant research papars, you can find the reference below.

# vllm
@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

# flash attention v1 & v2
@inproceedings{dao2022flashattention,
  title={Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
  author={Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}
@article{dao2023flashattention2,
  title={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
  author={Dao, Tri},
  year={2023}
}

# we do not find any research work related to lightllm