![]() * doc/ Update the user guide and the development document * update quick start example and readme |
||
---|---|---|
.. | ||
README.md | ||
benchmark_llama.py | ||
benchmark_vllm.py | ||
build_smoothquant_weight.py | ||
example.py | ||
run_benchmark.sh |
README.md
Colossal-Inference
Table of Contents
- 📚 Introduction
- 🔨 Installation
- 🚀 Quick Start
- 💡 Usage
📚 Introduction
This example lets you to set up and quickly try out our Colossal-Inference.
🔨 Installation
Install From Source
Prerequistes:
- Python == 3.9
- PyTorch >= 2.1.0
- CUDA == 11.8
- Linux OS
We strongly recommend you use Anaconda to create a new environment (Python >= 3.9) to run our examples:
# Create a new conda environment
conda create -n inference python=3.9 -y
conda activate inference
Install the latest PyTorch (with CUDA == 11.8) using conda:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
Install Colossal-AI from source:
# Clone Colossal-AI repository to your workspace
git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
# Install Colossal-AI from source
pip install .
Install inference dependencies:
# Install inference dependencies
pip install -r requirements/requirements-infer.txt
(Optional) If you want to use SmoothQuant quantization, you need to install torch-int
following this instruction.
Use Colossal-Inference in Docker
Pull from DockerHub
You can directly pull the docker image from our DockerHub page. The image is automatically uploaded upon release.
docker pull hpcaitech/colossal-inference:latest
Build On Your Own
Run the following command to build a docker image from Dockerfile provided.
cd ColossalAI/inference/dokcer
docker build
Run the following command to start the docker container in interactive mode.
docker run -it --gpus all --name Colossal-Inference -v $PWD:/workspace -w /workspace hpcaitech/colossal-inference:latest /bin/bash
Todo
🚀 Quick Start
You can try the inference example using Colossal-LLaMA-2-7B
following the instructions below:
cd ColossalAI/examples/inference
python example.py -m hpcai-tech/Colossal-LLaMA-2-7b-base -b 4 --max_input_len 128 --max_output_len 64 --dtype fp16
Examples for quantized inference will coming soon!
💡 Usage
A general way to use Colossal-Inference will be:
# Import required modules
import ...
# Prepare your model
model = ...
# Declare configurations
tp_size = ...
pp_size = ...
...
# Create an inference engine
engine = InferenceEngine(model, [tp_size, pp_size, ...])
# Tokenize the input
inputs = ...
# Perform inferencing based on the inputs
outputs = engine.generate(inputs)