|
|
@ -13,39 +13,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 📌 Introduction
|
|
|
|
## 📌 Introduction
|
|
|
|
ColossalAI-Inference is a library which offers acceleration to Transformers models, especially LLMs. In ColossalAI-Inference, we leverage high-performance kernels, KV cache, paged attention, continous batching and other techniques to accelerate the inference of LLMs. We also provide a unified interface for users to easily use our library.
|
|
|
|
ColossalAI-Inference is a module which offers acceleration to the inference execution of Transformers models, especially LLMs. In ColossalAI-Inference, we leverage high-performance kernels, KV cache, paged attention, continous batching and other techniques to accelerate the inference of LLMs. We also provide simple and unified APIs for the sake of user-friendliness.
|
|
|
|
|
|
|
|
|
|
|
|
## 🛠 Design and Implementation
|
|
|
|
## 🛠 Design and Implementation
|
|
|
|
|
|
|
|
|
|
|
|
### :book: Overview
|
|
|
|
### :book: Overview
|
|
|
|
We build ColossalAI-Inference based on **Four** core components: `engine`,`request handler`,`cache manager(block cached)`, `hand crafted modeling`. **Engine** controls inference step, it recives `requests`, calls `request handler` to schedule a decoding batch and runs `modeling` to perform a iteration and returns finished `requests`. **Cache manager** is bound with `request handler`, updates cache blocks and logical block tables during schedule.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The interaction between different components are shown below, you can also checkout detailed introduction below.:
|
|
|
|
ColossalAI-Inference has **4** major components, namely namely `engine`,`request handler`,`cache manager`, and `modeling`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- **Engine**: It orchestrates the inference step. During inference, it recives a request, calls `request handler` to schedule a decoding batch, and executes the model forward pass to perform a iteration. It returns the inference results back to the user at the end.
|
|
|
|
|
|
|
|
- **Request Handler**: It manages requests and schedules a proper batch from exisiting requests.
|
|
|
|
|
|
|
|
- **Cache manager** It is bound within the `request handler`, updates cache blocks and logical block tables as scheduled by the `request handler`.
|
|
|
|
|
|
|
|
- **Modelling**: We rewrite the model and layers of LLMs to simplify and optimize the forward pass for inference.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A high-level view of the inter-component interaction is given below. We would also introduce more details in the next few sections.
|
|
|
|
|
|
|
|
|
|
|
|
<p align="center">
|
|
|
|
<p align="center">
|
|
|
|
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/Structure/Introduction.png" width="600"/>
|
|
|
|
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/Structure/Introduction.png" width="600"/>
|
|
|
|
<br/>
|
|
|
|
<br/>
|
|
|
|
</p>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
|
|
### :mailbox_closed: Design of engine
|
|
|
|
### :mailbox_closed: Engine
|
|
|
|
Engine is designed as starter of inference loop. User can easily instantialize an infer engine with config and execute requests. We provids apis below in engine, you can refer to source code for more information:
|
|
|
|
Engine is designed as the entry point where the user kickstarts an inference loop. User can easily instantialize an inference engine with the inference configuration and execute requests. The engine object will expose the following APIs for inference:
|
|
|
|
- `generate`: main function, handle inputs and return outputs
|
|
|
|
|
|
|
|
- `add_request`: add request to waitting list
|
|
|
|
- `generate`: main function which handles inputs, performs inference and returns outputs
|
|
|
|
- `step`: perform one decoding iteration
|
|
|
|
- `add_request`: add request to the waiting list
|
|
|
|
- first, `request handler` schedules a batch to do prefill/decode
|
|
|
|
- `step`: perform one decoding iteration. The `request handler` first schedules a batch to do prefill/decoding. Then, it invokes a model to generate a batch of token and afterwards does logit processing and sampling, checks and decodes finished requests.
|
|
|
|
- then, invoke a model to generate a batch of token
|
|
|
|
|
|
|
|
- after that, do logit processing and sampling, check and decode finished requests
|
|
|
|
### :game_die: Request Handler
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Request handler is responsible for managing requests and scheduling a proper batch from exisiting requests. According to the existing work and experiments, we do believe that it is beneficial to increase the length of decoding sequences. In our design, we partition requests into three priorities depending on their lengths, the longer sequences are first considered.
|
|
|
|
|
|
|
|
|
|
|
|
### :game_die: Design of request_handler
|
|
|
|
|
|
|
|
Request handler is responsible manage requests and schedule a proper batch from exisiting requests. According to existing work and experiments, we do believe that it is beneficial to increase the length of decoding sequences. In our design, we partition requests into three priorities depending on their lengths, the longer sequences are first considered.
|
|
|
|
|
|
|
|
<p align="center">
|
|
|
|
<p align="center">
|
|
|
|
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/Structure/Request_handler.svg" width="800"/>
|
|
|
|
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/Structure/Request_handler.svg" width="800"/>
|
|
|
|
<br/>
|
|
|
|
<br/>
|
|
|
|
</p>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
|
|
### :radio: Design of KV cache and cache manager
|
|
|
|
### :radio: KV cache and cache manager
|
|
|
|
We design a unified blocked type cache and cache manager to distribute memory. The physical memory is allocated before decoding and represented by a logical block table. During decoding process, cache manager administrate physical memory through `block table` and other components(i.e. engine) can focus on the light-weighted `block table`. Their details are introduced below.
|
|
|
|
|
|
|
|
- `cache block` We group physical memory into different memory blocks. A typical cache block is shaped `(num_kv_heads, head_size, block_size)`. We decide block number beforehand. The memory allocation and computation are executed with the granularity of memory block.
|
|
|
|
We design a unified block cache and cache manager to allocate and manage memory. The physical memory is allocated before decoding and represented by a logical block table. During decoding process, cache manager administrates the physical memory through `block table` and other components(i.e. engine) can focus on the lightweight `block table`. More details are given below.
|
|
|
|
- `block table` Block table is the logical representation of cache blocks. Concretely, a block table of a single sequence is a 1D tensor, with each element holding a block id of allocated id or `-1` for non allocated. Each iteration we pass through a batch block table to the corresponding model. For more information, you can checkout the source code.
|
|
|
|
|
|
|
|
|
|
|
|
- `cache block`: We group physical memory into different memory blocks. A typical cache block is shaped `(num_kv_heads, head_size, block_size)`. We determine the block number beforehand. The memory allocation and computation are executed at the granularity of memory block.
|
|
|
|
|
|
|
|
- `block table`: Block table is the logical representation of cache blocks. Concretely, a block table of a single sequence is a 1D tensor, with each element holding a block ID. Block ID of `-1` means "Not Allocated". In each iteration, we pass through a batch block table to the corresponding model.
|
|
|
|
|
|
|
|
|
|
|
|
<figure>
|
|
|
|
<figure>
|
|
|
|
<p align="center">
|
|
|
|
<p align="center">
|
|
|
@ -57,48 +67,71 @@ We design a unified blocked type cache and cache manager to distribute memory. T
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
### :railway_car: Modeling
|
|
|
|
### :railway_car: Modeling
|
|
|
|
|
|
|
|
|
|
|
|
Modeling contains models and layers, which are hand-crafted for better performance easier usage. Deeply integrated with `shardformer`, we also construct policy for our models. In order to minimize users' learning costs, our models are aligned with [Transformers](https://github.com/huggingface/transformers)
|
|
|
|
Modeling contains models and layers, which are hand-crafted for better performance easier usage. Deeply integrated with `shardformer`, we also construct policy for our models. In order to minimize users' learning costs, our models are aligned with [Transformers](https://github.com/huggingface/transformers)
|
|
|
|
|
|
|
|
|
|
|
|
## 🕹 Usage
|
|
|
|
## 🕹 Usage
|
|
|
|
|
|
|
|
|
|
|
|
### :arrow_right: Quick Start
|
|
|
|
### :arrow_right: Quick Start
|
|
|
|
You can enjoy your fast generation journey within three step
|
|
|
|
|
|
|
|
```python
|
|
|
|
```python
|
|
|
|
# First, create a model in "transformers" way, you can provide a model config or use the default one.
|
|
|
|
import torch
|
|
|
|
model = transformers.LlamaForCausalLM(config).cuda()
|
|
|
|
import transformers
|
|
|
|
# Second, create an inference_config
|
|
|
|
import colossalai
|
|
|
|
|
|
|
|
from colossalai.inference import InferenceEngine, InferenceConfig
|
|
|
|
|
|
|
|
from pprint import pprint
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
colossalai.launch_from_torch(config={})
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Step 1: create a model in "transformers" way
|
|
|
|
|
|
|
|
model_path = "lmsys/vicuna-7b-v1.3"
|
|
|
|
|
|
|
|
model = transformers.LlamaForCausalLM.from_pretrained(model_path).cuda()
|
|
|
|
|
|
|
|
tokenizer = transformers.LlamaTokenizer.from_pretrained(model_path)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Step 2: create an inference_config
|
|
|
|
inference_config = InferenceConfig(
|
|
|
|
inference_config = InferenceConfig(
|
|
|
|
dtype=args.dtype,
|
|
|
|
dtype=torch.float16,
|
|
|
|
max_batch_size=args.max_batch_size,
|
|
|
|
max_batch_size=4,
|
|
|
|
max_input_len=args.seq_len,
|
|
|
|
max_input_len=1024,
|
|
|
|
max_output_len=args.output_len,
|
|
|
|
max_output_len=512,
|
|
|
|
)
|
|
|
|
)
|
|
|
|
# Third, create an engine with model and config
|
|
|
|
|
|
|
|
|
|
|
|
# Step 3: create an engine with model and config
|
|
|
|
engine = InferenceEngine(model, tokenizer, inference_config, verbose=True)
|
|
|
|
engine = InferenceEngine(model, tokenizer, inference_config, verbose=True)
|
|
|
|
|
|
|
|
|
|
|
|
# Try fast infrence now!
|
|
|
|
# Step 4: try inference
|
|
|
|
prompts = {'Nice to meet you, Colossal-Inference!'}
|
|
|
|
generation_config = transformers.GenerationConfig(
|
|
|
|
engine.generate(prompts)
|
|
|
|
pad_token_id=tokenizer.pad_token_id,
|
|
|
|
|
|
|
|
max_new_tokens=512,
|
|
|
|
|
|
|
|
)
|
|
|
|
|
|
|
|
prompts = ['Who is the best player in the history of NBA?']
|
|
|
|
|
|
|
|
engine.add_request(prompts=prompts)
|
|
|
|
|
|
|
|
response = engine.generate(generation_config)
|
|
|
|
|
|
|
|
pprint(response)
|
|
|
|
```
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### :bookmark: Customize your inference engine
|
|
|
|
### :bookmark: Customize your inference engine
|
|
|
|
Besides the basic fast-start inference, you can also customize your inference engine via modifying config or upload your own model or decoding components (logit processors or sampling strategies).
|
|
|
|
Besides the basic quick-start inference, you can also customize your inference engine via modifying config or upload your own model or decoding components (logit processors or sampling strategies).
|
|
|
|
|
|
|
|
|
|
|
|
#### Inference Config
|
|
|
|
#### Inference Config
|
|
|
|
Inference Config is a unified api for generation process. You can define the value of args to control the generation, like `max_batch_size`,`max_output_len`,`dtype` to decide the how many sequences can be handled at a time, and how many tokens to output. Refer to the source code for more detail.
|
|
|
|
Inference Config is a unified api for generation process. You can define the value of args to control the generation, like `max_batch_size`,`max_output_len`,`dtype` to decide the how many sequences can be handled at a time, and how many tokens to output. Refer to the source code for more detail.
|
|
|
|
|
|
|
|
|
|
|
|
#### Generation Config
|
|
|
|
#### Generation Config
|
|
|
|
In colossal-inference, Generation config api is inherited from [Transformers](https://github.com/huggingface/transformers). Usage is aligned. By default, it is automatically generated by our system and you don't bother to construct one. If you have such demand, you can also create your own and send it to your engine.
|
|
|
|
In colossal-inference, Generation config api is inherited from [Transformers](https://github.com/huggingface/transformers). Usage is aligned. By default, it is automatically generated by our system and you don't bother to construct one. If you have such demand, you can also create your own and send it to your engine.
|
|
|
|
|
|
|
|
|
|
|
|
#### Logit Processors
|
|
|
|
#### Logit Processors
|
|
|
|
Logit Processosr receives logits and return processed ones, take the following step to make your own.
|
|
|
|
The `Logit Processosr` receives logits and return processed results. You can take the following step to make your own.
|
|
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
```python
|
|
|
|
@register_logit_processor("name")
|
|
|
|
@register_logit_processor("name")
|
|
|
|
def xx_logit_processor(logits, args):
|
|
|
|
def xx_logit_processor(logits, args):
|
|
|
|
logits = do_some_process(logits)
|
|
|
|
logits = do_some_process(logits)
|
|
|
|
return logits
|
|
|
|
return logits
|
|
|
|
```
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
#### Sampling Strategies
|
|
|
|
#### Sampling Strategies
|
|
|
|
We offer 3 main sampling strategies now (i.e. `greedy sample`, `multinomial sample`, `beam_search sample`), you can refer to [sampler](/ColossalAI/colossalai/inference/sampler.py) for more details. We would strongly appreciate if you can contribute your varities.
|
|
|
|
We offer 3 main sampling strategies now (i.e. `greedy sample`, `multinomial sample`, `beam_search sample`), you can refer to [sampler](/ColossalAI/colossalai/inference/sampler.py) for more details. We would strongly appreciate if you can contribute your varities.
|
|
|
|
|
|
|
|
|
|
|
|
## 🪅 Support Matrix
|
|
|
|
## 🪅 Support Matrix
|
|
|
|
|
|
|
|
|
|
|
|
| Model | KV Cache | Paged Attention | Kernels | Tensor Parallelism | Speculative Decoding |
|
|
|
|
| Model | KV Cache | Paged Attention | Kernels | Tensor Parallelism | Speculative Decoding |
|
|
|
@ -158,5 +191,4 @@ If you wish to cite relevant research papars, you can find the reference below.
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
# we do not find any research work related to lightllm
|
|
|
|
# we do not find any research work related to lightllm
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
```
|
|
|
|