History

Yuanheng Zhao 3a74eb4b3a [Infer] Colossal-Inference serving example w/ TorchServe (single GPU case) (#4771 ) * add Colossal-Inference serving example w/ TorchServe * add dockerfile * fix dockerfile * fix dockerfile: fix commit hash, install curl * refactor file structure * revise readme * trivial * trivial: dockerfile format * clean dir; revise readme * fix comments: fix imports and configs * fix formats * remove unused requirements		2023-10-02 17:42:37 +08:00
..
docker	[Infer] Colossal-Inference serving example w/ TorchServe (single GPU case) (#4771 )	2023-10-02 17:42:37 +08:00
Colossal_Inference_Handler.py	[Infer] Colossal-Inference serving example w/ TorchServe (single GPU case) (#4771 )	2023-10-02 17:42:37 +08:00
README.md	[Infer] Colossal-Inference serving example w/ TorchServe (single GPU case) (#4771 )	2023-10-02 17:42:37 +08:00
config.properties	[Infer] Colossal-Inference serving example w/ TorchServe (single GPU case) (#4771 )	2023-10-02 17:42:37 +08:00
model-config.yaml	[Infer] Colossal-Inference serving example w/ TorchServe (single GPU case) (#4771 )	2023-10-02 17:42:37 +08:00
sample_text.txt	[Infer] Colossal-Inference serving example w/ TorchServe (single GPU case) (#4771 )	2023-10-02 17:42:37 +08:00

README.md

Colossal-Inference with TorchServe

Overview

This demo is used for testing and demonstrating the usage of Colossal Inference from colossalai.inference with deployment with TorchServe. It imports inference modules from colossalai and is based on 3e05c07bb8. For now, single-gpu inference serving is supported.

Environment for testing

Option #1: Use Conda Env

Records to create a conda env to test locally as follows. We might want to use docker or configure env on cloud platform later.

NOTE: It requires the installation of jdk and the set of JAVA_HOME. We recommend to install open-jdk-17 (Please refer to https://openjdk.org/projects/jdk/17/)

# use python 3.8 or 3.9
conda create -n infer python=3.9

# use torch 1.13+cuda11.6 for inference
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

# conda cuda toolkit (e.g. nvcc, etc)
conda install -c "nvidia/label/cuda-11.6.2" cuda-toolkit

# install colossalai with PyTorch extensions
cd <path_to_ColossalAI_repo>
pip install -r requirements/requirements.txt
pip install -r requirements/requirements-test.txt
CUDA_EXT=1 pip install -e .

# install torchserve
cd <path_to_torch_serve_repo>
python ./ts_scripts/install_dependencies.py --cuda=cu116
pip install torchserve torch-model-archiver torch-workflow-archiver

Option #2: Use Docker

To use the stable diffusion Docker image, you can build using the provided the Dockerfile.

# build from dockerfile
cd ColossalAI/examples/inference/serving/torch_serve/docker
docker build -t hpcaitech/colossal-infer-ts:0.2.0 .

Once you have the image ready, you can launch the image with the following command

cd ColossalAI/examples/inference/serving/torch_serve

# run the docker container
docker run --rm \
    -it --gpus all \
    --name <name_you_assign> \
    -v <your-data-dir>:/data/scratch \
    -w <ColossalAI_dir> \
    hpcaitech/colossal-infer-ts:0.2.0 \
    /bin/bash

Steps to deploy a model

1.download/prepare a model

We will download a bloom model, and then zip the downloaded model. You could download the model from HuggingFace manually, or you might want to refer to this script download_model.py provided by pytorch-serve team to help you download a snapshot of the model.

# download snapshots
cd <path_to_torch_serve>/examples/large_models/utils/
huggingface-cli login
python download_model.py --model_name bigscience/bloom-560m -o <path_to_store_downloaded_model>

# zip the model repo
cd <path_to_store_downloaded_model>/models--bigscience--bloom-560m/snapshots/<specific_revision>
zip -r <path_to_place_zipped_model>//model.zip *

NOTE: The torch archiver and server will use /tmp/ folder. Depending on the limit of disk quota, using torch-model-archiver might cause OSError "Disk quota exceeded". To prevent the OSError, set tmp dir environment variable as follows: export TMPDIR=<dir_with_enough_space>/tmp and export TEMP=<dir_with_enough_space>/tmp, or use relatively small models (as we did) for local testing.

2. Archive the model

With torch archiver, we will pack the model file (.zip) as well as handler file (.py) together into a .mar file. And then in serving process these files will be unpacked by TorchServe. Revelant model configs and inference configs can be set in model-config.yaml.

cd ./ColossalAI/examples/inference/serving/torch_serve
# create a folder under the current directory to store the packed model created by torch archiver
mkdir model_store
torch-model-archiver --model-name bloom --version 0.1 --handler Colossal_Inference_Handler.py --config-file model-config.yaml --extra-files <dir_zipped_model>/model.zip --export-path ./model_store/

3. Launch serving

Modify load_models in config.properties to select the model(s) stored in <model_store> directory to be deployed. By default we use load_models=all to load and deploy all the models (.mar) we have.

torchserve --start --ncs --ts-config config.properties

We could set inference, management, and metrics addresses and other TorchServe settings in config.properties.

TorchServe will create a folder logs/ under the current directory to store ts, model, and metrics logs.

4. Run inference

# check inference status
curl http://0.0.0.0:8084/ping

curl -X POST http://localhost:8084/predictions/bloom -T sample_text.txt

To stop TorchServe, run torchserve --stop