mirror of https://github.com/hpcaitech/ColossalAI
![]() * opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py |
||
---|---|---|
.. | ||
core | ||
kv_cache | ||
modeling | ||
README.md | ||
__init__.py | ||
config.py | ||
flash_decoding_utils.py | ||
logit_processors.py | ||
sampler.py | ||
struct.py | ||
utils.py |
README.md
โก๏ธ ColossalAI-Inference
๐ Table of Contents
๐ Introduction
ColossalAI-Inference is a library which offers acceleration to Transformers models, especially LLMs. In ColossalAI-Inference, we leverage high-performance kernels, KV cache, paged attention, continous batching and other techniques to accelerate the inference of LLMs. We also provide a unified interface for users to easily use our library.
๐ Design and Implementation
To be added.
๐น Usage
To be added.
๐ช Support Matrix
Model | KV Cache | Paged Attention | Kernels | Tensor Parallelism | Speculative Decoding |
---|---|---|---|---|---|
Llama | โ | โ | โ | ๐ | ๐ |
Notations:
- โ : supported
- โ: not supported
- ๐: still developing, will support soon
๐บ Roadmap
- KV Cache
- Paged Attention
- High-Performance Kernels
- Llama Modelling
- Tensor Parallelism
- Speculative Decoding
- Continuous Batching
- Online Inference
- Benchmarking
- User Documentation
๐ Acknowledgement
This project was written from scratch but we learned a lot from several other great open-source projects during development. Therefore, we wish to fully acknowledge their contribution to the open-source community. These projects include
If you wish to cite relevant research papars, you can find the reference below.
# vllm
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
# flash attention v1 & v2
@inproceedings{dao2022flashattention,
title={Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
author={Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
booktitle={Advances in Neural Information Processing Systems},
year={2022}
}
@article{dao2023flashattention2,
title={Flash{A}ttention-2: Faster Attention with Better Parallelism and Work Partitioning},
author={Dao, Tri},
year={2023}
}
# we do not find any research work related to lightllm