History

yuehuayingxueluo 4f28cb43c0 [inference]Optimize the usage of the mid tensors space in flash attn (#5304 ) * opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py		2024-01-26 14:00:10 +08:00
..
core	[inference]Optimize the usage of the mid tensors space in flash attn (#5304 )	2024-01-26 14:00:10 +08:00
kv_cache	[inference]Optimize the usage of the mid tensors space in flash attn (#5304 )	2024-01-26 14:00:10 +08:00
modeling	[inference]Optimize the usage of the mid tensors space in flash attn (#5304 )	2024-01-26 14:00:10 +08:00
README.md	[doc] updated inference readme (#5269 )	2024-01-15 17:37:41 +08:00
__init__.py	[Inference] First PR for rebuild colossal-infer (#5143 )	2024-01-11 13:39:29 +00:00
config.py	[inference]Optimize the usage of the mid tensors space in flash attn (#5304 )	2024-01-26 14:00:10 +08:00
flash_decoding_utils.py	[inference]Optimize the usage of the mid tensors space in flash attn (#5304 )	2024-01-26 14:00:10 +08:00
logit_processors.py	[Inference] add logit processor and request handler (#5166 )	2024-01-11 13:39:56 +00:00
sampler.py	adapted to pad_context_forward	2024-01-11 13:44:06 +00:00
struct.py	[inference]Optimize the usage of the mid tensors space in flash attn (#5304 )	2024-01-26 14:00:10 +08:00
utils.py	add utils.py	2024-01-22 16:06:27 +08:00

README.md

⚡️ ColossalAI-Inference

📚 Table of Contents

⚡️ ColossalAI-Inference

📌 Introduction

ColossalAI-Inference is a library which offers acceleration to Transformers models, especially LLMs. In ColossalAI-Inference, we leverage high-performance kernels, KV cache, paged attention, continous batching and other techniques to accelerate the inference of LLMs. We also provide a unified interface for users to easily use our library.

🛠 Design and Implementation

To be added.

🕹 Usage

To be added.

🪅 Support Matrix

Model	KV Cache	Paged Attention	Kernels	Tensor Parallelism	Speculative Decoding
Llama	✅	✅	✅	🔜	🔜

Notations:

✅: supported
❌: not supported
🔜: still developing, will support soon

🗺 Roadmap

KV Cache
Paged Attention
High-Performance Kernels
Llama Modelling
Tensor Parallelism
Speculative Decoding
Continuous Batching
Online Inference
Benchmarking
User Documentation

🌟 Acknowledgement

This project was written from scratch but we learned a lot from several other great open-source projects during development. Therefore, we wish to fully acknowledge their contribution to the open-source community. These projects include

If you wish to cite relevant research papars, you can find the reference below.

# vllm
@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

# flash attention v1 & v2
@inproceedings{dao2022flashattention,
  title={Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
  author={Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}
@article{dao2023flashattention2,
  title={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
  author={Dao, Tri},
  year={2023}
}

# we do not find any research work related to lightllm

README.md Unescape Escape