597914317b | ||
---|---|---|
.. | ||
benchmark | ||
script | ||
README.md | ||
batch.py | ||
cache.py | ||
opt_fastapi.py | ||
opt_server.py | ||
requirements.txt |
README.md
Overview
This is an example showing how to run OPT generation. The OPT model is implemented using ColossalAI.
It supports tensor parallelism, batching and caching.
🚀Quick Start
- Run inference with OPT 125M
docker hpcaitech/tutorial:opt-inference
docker run -it --rm --gpus all --ipc host -p 7070:7070 hpcaitech/tutorial:opt-inference
- Start the http server inside the docker container with tensor parallel size 2
python opt_fastapi.py opt-125m --tp 2 --checkpoint /data/opt-125m
How to run
Run OPT-125M:
python opt_fastapi.py opt-125m
It will launch a HTTP server on 0.0.0.0:7070
by default and you can customize host and port. You can open localhost:7070/docs
in your browser to see the openapi docs.
Configure
Configure model
python opt_fastapi.py <model>
Available models: opt-125m, opt-6.7b, opt-30b, opt-175b.
Configure tensor parallelism
python opt_fastapi.py <model> --tp <TensorParallelismWorldSize>
The <TensorParallelismWorldSize>
can be an integer in [1, #GPUs]
. Default 1
.
Configure checkpoint
python opt_fastapi.py <model> --checkpoint <CheckpointPath>
The <CheckpointPath>
can be a file path or a directory path. If it's a directory path, all files under the directory will be loaded.
Configure queue
python opt_fastapi.py <model> --queue_size <QueueSize>
The <QueueSize>
can be an integer in [0, MAXINT]
. If it's 0
, the request queue size is infinite. If it's a positive integer, when the request queue is full, incoming requests will be dropped (the HTTP status code of response will be 406).
Configure batching
python opt_fastapi.py <model> --max_batch_size <MaxBatchSize>
The <MaxBatchSize>
can be an integer in [1, MAXINT]
. The engine will make batch whose size is less or equal to this value.
Note that the batch size is not always equal to <MaxBatchSize>
, as some consecutive requests may not be batched.
Configure caching
python opt_fastapi.py <model> --cache_size <CacheSize> --cache_list_size <CacheListSize>
This will cache <CacheSize>
unique requests. And for each unique request, it cache <CacheListSize>
different results. A random result will be returned if the cache is hit.
The <CacheSize>
can be an integer in [0, MAXINT]
. If it's 0
, cache won't be applied. The <CacheListSize>
can be an integer in [1, MAXINT]
.
Other configurations
python opt_fastapi.py -h
How to benchmark
cd benchmark
locust
Then open the web interface link which is on your console.
Pre-process pre-trained weights
OPT-66B
See script/processing_ckpt_66b.py.