ColossalAI/examples/inference/llama
Yuanheng Zhao 7b249c76e5
[Fix] Fix spec-dec Glide LlamaModel for compatibility with transformers (#5837)
* fix glide llama model

* revise
2024-06-19 15:37:53 +08:00
..
README.md [Fix] Fix spec-dec Glide LlamaModel for compatibility with transformers (#5837) 2024-06-19 15:37:53 +08:00
benchmark_llama.py [Fix] Fix Inference Example, Tests, and Requirements (#5688) 2024-05-08 11:30:15 +08:00
benchmark_llama3.py [Fix/Example] Fix Llama Inference Loading Data Type (#5763) 2024-05-30 13:48:46 +08:00
llama_generation.py [Inference]Add Streaming LLM (#5745) 2024-06-05 10:51:19 +08:00
run_benchmark.sh [Fix] Fix Inference Example, Tests, and Requirements (#5688) 2024-05-08 11:30:15 +08:00
test_ci.sh [Fix] Fix Inference Example, Tests, and Requirements (#5688) 2024-05-08 11:30:15 +08:00

README.md

Run Inference

The provided example llama_generation.py is an example to configure, initialize the engine, and run inference on provided model. We've added AutoModelForCausalLM and NoPaddingLlamaModelInferPolicy as model class and policy class, and the script is good to run inference with Llama 3.

For a basic setting, you could run the example by:

colossalai run --nproc_per_node 1 llama_generation.py -m PATH_MODEL --max_length 128

Run multi-GPU inference (Tensor Parallelism), as in the following example using 2 GPUs:

colossalai run --nproc_per_node 2 llama_generation.py -m PATH_MODEL --max_length 128 --tp_size 2

Run Speculative Decoding

Colossal-Inference supports speculative decoding using the inference engine, with optimized kernels and cache management for the main model.

Both a drafter model (small model) and a main model (large model) will be used during speculative decoding process. The drafter model will generate a few tokens sequentially, and then the main model will validate those candidate tokens in parallel and accept validated ones. The decoding process will be speeded up, for the latency of speculating multiple tokens by the drafter model is lower than that by the main model.

Moreover, Colossal-Inference also supports GLIDE, a modified draft model architecture that reuses key and value caches from the main model, which improves the acceptance rate and increment the speed-up ratio. Details can be found in research paper GLIDE with a CAPE - A Low-Hassle Method to Accelerate Speculative Decoding on arXiv.

Right now, Colossal-Inference offers a GLIDE model compatible with vicuna7B (https://huggingface.co/lmsys/vicuna-7b-v1.5). You can find the fine-tuned GLIDE drafter model cxdu/glide-vicuna7b on the HuggingFace Hub: https://huggingface.co/cxdu/glide-vicuna7b.

Benchmarking with gsm8k and MT-Bench dataset with batch size 1 on H800, the speed increase for using speculative decoding is around 1.28x, and the speed increase for using speculative decoding with Glide model (as drafter model) is around 1.5x.

Usage

For main model, you might want to use model card lmsys/vicuna-7b-v1.5 at HuggingFace Hub. For regular drafter model, you might want to use model card JackFram/llama-68m at HuggingFace Hub. For the GLIDE drafter model, you could use model card cxdu/glide-vicuna7b at HuggingFace Hub.

You could run speculative decoding by

colossalai run --nproc_per_node 1 llama_generation.py -m PATH_MODEL --drafter_model PATH_DRAFTER_MODEL --max_length 128

Run multi-GPU inference (Tensor Parallelism), as in the following example using 2 GPUs.

colossalai run --nproc_per_node 2 llama_generation.py -m PATH_MODEL --drafter_model PATH_DRAFTER_MODEL --max_length 128 --tp_size 2

If you want to try the GLIDE model (glide-vicuna7b) as the drafter model with vicuna-7B, you could provide the GLIDE model path or model card as drafter model and enable the feature by

from colossalai.inference.modeling.models.glide_llama import GlideLlamaForCausalLM
drafter_model = GlideLlamaForCausalLM.from_pretrained(drafter_model_path_or_name)
...
engine.enable_spec_dec(drafter_model, use_glide_drafter=True)