ColossalAI/colossalai/inference
Xu Kai 611a5a80ca
[inference] Add smmoothquant for llama (#4904)
* [inference] add int8 rotary embedding kernel for smoothquant (#4843)

* [inference] add smoothquant llama attention (#4850)

* add smoothquant llama attention

* remove uselss code

* remove useless code

* fix import error

* rename file name

* [inference] add silu linear fusion for smoothquant llama mlp  (#4853)

* add silu linear

* update skip condition

* catch smoothquant cuda lib exception

* prcocess exception for tests

* [inference] add llama mlp for smoothquant (#4854)

* add llama mlp for smoothquant

* fix down out scale

* remove duplicate lines

* add llama mlp check

* delete useless code

* [inference] add smoothquant llama (#4861)

* add smoothquant llama

* fix attention accuracy

* fix accuracy

* add kv cache and save pretrained

* refactor example

* delete smooth

* refactor code

* [inference] add smooth function and delete useless code for smoothquant (#4895)

* add smooth function and delete useless code

* update datasets

* remove duplicate import

* delete useless file

* refactor codes (#4902)

* rafactor code

* add license

* add torch-int and smoothquant license
2023-10-16 11:28:44 +08:00
..
pipeline [Pipeline Inference] Sync pipeline inference branch to main (#4820) 2023-10-11 11:40:06 +08:00
quant [inference] Add smmoothquant for llama (#4904) 2023-10-16 11:28:44 +08:00
tensor_parallel [inference] add llama2 support (#4898) 2023-10-13 13:09:23 +08:00
README.md [Feature] The first PR to Add TP inference engine, kv-cache manager and related kernels for our inference system (#4577) 2023-09-12 01:22:56 +08:00
__init__.py [Pipeline Inference] Sync pipeline inference branch to main (#4820) 2023-10-11 11:40:06 +08:00

README.md

🚀 Colossal-Inference

Table of contents

Introduction

Colossal Inference is a module that contains colossal-ai designed inference framework, featuring high performance, steady and easy usability. Colossal Inference incorporated the advantages of the latest open-source inference systems, including TGI, vLLM, FasterTransformer, LightLLM and flash attention. while combining the design of Colossal AI, especially Shardformer, to reduce the learning curve for users.

Design

Colossal Inference is composed of two main components:

  1. High performance kernels and ops: which are inspired from existing libraries and modified correspondingly.
  2. Efficient memory management mechanismwhich includes the key-value cache manager, allowing for zero memory waste during inference.
    1. cache manager: serves as a memory manager to help manage the key-value cache, it integrates functions such as memory allocation, indexing and release.
    2. batch_infer_info: holds all essential elements of a batch inference, which is updated every batch.
  3. High-level inference engine combined with Shardformer: it allows our inference framework to easily invoke and utilize various parallel methods.
    1. engine.TPInferEngine: it is a high level interface that integrates with shardformer, especially for multi-card (tensor parallel) inference:
    2. modeling.llama.LlamaInferenceForwards: contains the forward methods for llama inference. (in this case : llama)
    3. policies.llama.LlamaModelInferPolicy : contains the policies for llama models, which is used to call shardformer and segmentate the model forward in tensor parallelism way.

Pipeline of inference:

In this section we discuss how the colossal inference works and integrates with the Shardformer . The details can be found in our codes.

Colossal-Inference

Roadmap of our implementation

  • Design cache manager and batch infer state
  • Design TpInference engine to integrates with Shardformer
  • Register corresponding high-performance kernel and ops
  • Design policies and forwards (e.g. Llama and Bloom)
    • policy
    • context forward
    • token forward
  • Replace the kernels with faster-transformer in token-forward stage
  • Support all models
    • Llama
    • Bloom
    • Chatglm2
  • Benchmarking for all models

Get started

Installation

pip install -e .

Requirements

dependencies

pytorch= 1.13.1 (gpu)
cuda>= 11.6
transformers= 4.30.2
triton==2.0.0.dev20221202
# for install vllm, please use this branch to install https://github.com/tiandiao123/vllm/tree/setup_branch
vllm
# for install flash-attention, please use commit hash: 67ae6fd74b4bc99c36b2ce524cf139c35663793c
flash-attention

Docker

You can use docker run to use docker container to set-up environment

# env: python==3.8, cuda 11.6, pytorch == 1.13.1 triton==2.0.0.dev20221202, vllm kernels support, flash-attention-2 kernels support
docker pull hpcaitech/colossalai-inference:v2
docker run -it --gpus all --name ANY_NAME -v $PWD:/workspace -w /workspace hpcaitech/colossalai-inference:v2 /bin/bash

Dive into fast-inference!

example files are in

cd colossalai.examples
python xx

Performance

environment:

We conducted multiple benchmark tests to evaluate the performance. We compared the inference latency and throughputs between colossal-inference and original hugging-face torch fp16.

For various models, experiments were conducted using multiple batch sizes under the consistent model configuration of 7 billion(7b) parameters, 1024 input length, and 128 output length. The obtained results are as follows (due to time constraints, the evaluation has currently been performed solely on the A100 single GPU performance; multi-GPU performance will be addressed in the future):

Single GPU Performance:

Currently the stats below are calculated based on A100 (single GPU), and we calculate token latency based on average values of context-forward and decoding forward process, which means we combine both of processes to calculate token generation times. We are actively developing new features and methods to furthur optimize the performance of LLM models. Please stay tuned.

Llama

batch_size 8 16 32
hugging-face torch fp16 199.12 246.56 278.4
colossal-inference 326.4 582.72 816.64

llama

Bloom

batch_size 8 16 32
hugging-face torch fp16 189.68 226.66 249.61
colossal-inference 323.28 538.52 611.64

bloom

The results of more models are coming soon!