From c597678da475abd4ecc075c0b80996989f1bcdc0 Mon Sep 17 00:00:00 2001 From: Frank Lee Date: Mon, 15 Jan 2024 17:37:41 +0800 Subject: [PATCH] [doc] updated inference readme (#5269) --- colossalai/inference/README.md | 87 ++++++++++++++++++++++++++++++++++ colossalai/inference/readme.md | 18 ------- 2 files changed, 87 insertions(+), 18 deletions(-) create mode 100644 colossalai/inference/README.md delete mode 100644 colossalai/inference/readme.md diff --git a/colossalai/inference/README.md b/colossalai/inference/README.md new file mode 100644 index 000000000..2773a7ff4 --- /dev/null +++ b/colossalai/inference/README.md @@ -0,0 +1,87 @@ +# ⚡️ ColossalAI-Inference + +## 📚 Table of Contents + +- [⚡️ ColossalAI-Inference](#️-colossalai-inference) + - [📚 Table of Contents](#-table-of-contents) + - [📌 Introduction](#-introduction) + - [🛠 Design and Implementation](#-design-and-implementation) + - [🕹 Usage](#-usage) + - [🪅 Support Matrix](#-support-matrix) + - [🗺 Roadmap](#-roadmap) + - [🌟 Acknowledgement](#-acknowledgement) + + +## 📌 Introduction + +ColossalAI-Inference is a library which offers acceleration to Transformers models, especially LLMs. In ColossalAI-Inference, we leverage high-performance kernels, KV cache, paged attention, continous batching and other techniques to accelerate the inference of LLMs. We also provide a unified interface for users to easily use our library. + +## 🛠 Design and Implementation + +To be added. + +## 🕹 Usage + + +To be added. + +## 🪅 Support Matrix + +| Model | KV Cache | Paged Attention | Kernels | Tensor Parallelism | Speculative Decoding | +| - | - | - | - | - | - | +| Llama | ✅ | ✅ | ✅ | 🔜 | 🔜 | + + +Notations: +- ✅: supported +- ❌: not supported +- 🔜: still developing, will support soon + +## 🗺 Roadmap + +- [x] KV Cache +- [x] Paged Attention +- [x] High-Performance Kernels +- [x] Llama Modelling +- [ ] Tensor Parallelism +- [ ] Speculative Decoding +- [ ] Continuous Batching +- [ ] Online Inference +- [ ] Benchmarking +- [ ] User Documentation + +## 🌟 Acknowledgement + +This project was written from scratch but we learned a lot from several other great open-source projects during development. Therefore, we wish to fully acknowledge their contribution to the open-source community. These projects include + +- [vLLM](https://github.com/vllm-project/vllm) +- [LightLLM](https://github.com/ModelTC/lightllm) +- [flash-attention](https://github.com/Dao-AILab/flash-attention) + +If you wish to cite relevant research papars, you can find the reference below. + +```bibtex +# vllm +@inproceedings{kwon2023efficient, + title={Efficient Memory Management for Large Language Model Serving with PagedAttention}, + author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica}, + booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles}, + year={2023} +} + +# flash attention v1 & v2 +@inproceedings{dao2022flashattention, + title={Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness}, + author={Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher}, + booktitle={Advances in Neural Information Processing Systems}, + year={2022} +} +@article{dao2023flashattention2, + title={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning}, + author={Dao, Tri}, + year={2023} +} + +# we do not find any research work related to lightllm + +``` diff --git a/colossalai/inference/readme.md b/colossalai/inference/readme.md deleted file mode 100644 index e87e46f05..000000000 --- a/colossalai/inference/readme.md +++ /dev/null @@ -1,18 +0,0 @@ -# Colossal-Infer -## Introduction -Colossal-Infer is a library for inference of LLMs and MLMs. It is built on top of Colossal AI. - -## Structures -### Overview -The main design will be released later on. -## Roadmap -- [] design of structures -- [] Core components - - [] engine - - [] request handler - - [] kv cache manager - - [] modeling - - [] custom layers - - [] online server -- [] supported models - - [] llama2