fd6482ad8c
* [inference] support only TP (#4998) * support only tp * enable tp * add support for bloom (#5008) * [refactor] refactor gptq and smoothquant llama (#5012) * refactor gptq and smoothquant llama * fix import error * fix linear import torch-int * fix smoothquant llama import error * fix import accelerate error * fix bug * fix import smooth cuda * fix smoothcuda * [Inference Refactor] Merge chatglm2 with pp and tp (#5023) merge chatglm with pp and tp * [Refactor] remove useless inference code (#5022) * remove useless code * fix quant model * fix test import bug * mv original inference legacy * fix chatglm2 * [Refactor] refactor policy search and quant type controlling in inference (#5035) * [Refactor] refactor policy search and quant type controling in inference * [inference] update readme (#5051) * update readme * update readme * fix architecture * fix table * fix table * [inference] udpate example (#5053) * udpate example * fix run.sh * fix rebase bug * fix some errors * update readme * add some features * update interface * update readme * update benchmark * add requirements-infer --------- Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: Zhongkai Zhao <kanezz620@gmail.com> |
||
---|---|---|
.. | ||
docker | ||
Colossal_Inference_Handler.py | ||
README.md | ||
config.properties | ||
model-config.yaml | ||
sample_text.txt |
README.md
Colossal-Inference with TorchServe
Overview
This demo is used for testing and demonstrating the usage of Colossal Inference from colossalai.inference
with deployment with TorchServe. It imports inference modules from colossalai and is based on
3e05c07bb8
. For now, single-gpu inference serving is supported.
Environment for testing
Option #1: Use Conda Env
Records to create a conda env to test locally as follows. We might want to use docker or configure env on cloud platform later.
NOTE: It requires the installation of jdk and the set of JAVA_HOME
. We recommend to install open-jdk-17 (Please refer to https://openjdk.org/projects/jdk/17/)
# use python 3.8 or 3.9
conda create -n infer python=3.9
# use torch 1.13+cuda11.6 for inference
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
# conda cuda toolkit (e.g. nvcc, etc)
conda install -c "nvidia/label/cuda-11.6.2" cuda-toolkit
# install colossalai with PyTorch extensions
cd <path_to_ColossalAI_repo>
pip install -r requirements/requirements.txt
pip install -r requirements/requirements-test.txt
CUDA_EXT=1 pip install -e .
# install torchserve
cd <path_to_torch_serve_repo>
python ./ts_scripts/install_dependencies.py --cuda=cu116
pip install torchserve torch-model-archiver torch-workflow-archiver
Option #2: Use Docker
To use the stable diffusion Docker image, you can build using the provided the Dockerfile.
# build from dockerfile
cd ColossalAI/examples/inference/serving/torch_serve/docker
docker build -t hpcaitech/colossal-infer-ts:0.2.0 .
Once you have the image ready, you can launch the image with the following command
cd ColossalAI/examples/inference/serving/torch_serve
# run the docker container
docker run --rm \
-it --gpus all \
--name <name_you_assign> \
-v <your-data-dir>:/data/scratch \
-w <ColossalAI_dir> \
hpcaitech/colossal-infer-ts:0.2.0 \
/bin/bash
Steps to deploy a model
1.download/prepare a model
We will download a bloom model, and then zip the downloaded model. You could download the model from HuggingFace manually, or you might want to refer to this script download_model.py provided by pytorch-serve team to help you download a snapshot of the model.
# download snapshots
cd <path_to_torch_serve>/examples/large_models/utils/
huggingface-cli login
python download_model.py --model_name bigscience/bloom-560m -o <path_to_store_downloaded_model>
# zip the model repo
cd <path_to_store_downloaded_model>/models--bigscience--bloom-560m/snapshots/<specific_revision>
zip -r <path_to_place_zipped_model>//model.zip *
NOTE: The torch archiver and server will use
/tmp/
folder. Depending on the limit of disk quota, using torch-model-archiver might cause OSError "Disk quota exceeded". To prevent the OSError, set tmp dir environment variable as follows:export TMPDIR=<dir_with_enough_space>/tmp
andexport TEMP=<dir_with_enough_space>/tmp
, or use relatively small models (as we did) for local testing.
2. Archive the model
With torch archiver, we will pack the model file (.zip) as well as handler file (.py) together into a .mar file. And then in serving process these files will be unpacked by TorchServe. Revelant model configs and inference configs can be set in model-config.yaml
.
cd ./ColossalAI/examples/inference/serving/torch_serve
# create a folder under the current directory to store the packed model created by torch archiver
mkdir model_store
torch-model-archiver --model-name bloom --version 0.1 --handler Colossal_Inference_Handler.py --config-file model-config.yaml --extra-files <dir_zipped_model>/model.zip --export-path ./model_store/
3. Launch serving
Modify load_models
in config.properties to select the model(s) stored in <model_store> directory to be deployed. By default we use load_models=all
to load and deploy all the models (.mar) we have.
torchserve --start --ncs --ts-config config.properties
We could set inference, management, and metrics addresses and other TorchServe settings in config.properties
.
TorchServe will create a folder logs/
under the current directory to store ts, model, and metrics logs.
4. Run inference
# check inference status
curl http://0.0.0.0:8084/ping
curl -X POST http://localhost:8084/predictions/bloom -T sample_text.txt
To stop TorchServe, run torchserve --stop