introduce how to deploy 4-bit quantized internlm model (#207)

pull/219/head
Lyu Han 2023-08-22 11:31:01 +08:00 committed by GitHub
parent 075648cd70
commit 716131e477
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 64 additions and 29 deletions

View File

@ -122,26 +122,44 @@ streamlit run web_demo.py
我们使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 完成 InternLM 的一键部署。
1. 首先安装 LMDeploy:
```bash
python3 -m pip install lmdeploy
```
```bash
python3 -m pip install lmdeploy
```
执行以下命令,可以在终端与 `internlm-chat-7b` 模型进行交互式对话,或者通过 WebUI 与它聊天。
2. 快速的部署命令如下:
```bash
# 转换权重格式
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b
```bash
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-7b/model
```
# 在终端进行交互式对话
python3 -m lmdeploy.turbomind.chat ./workspace
3. 在导出模型后你可以直接通过如下命令启动服务并在客户端与AI对话
# 启动 gradio 服务
python3 -m lmdeploy.serve.gradio.app ./workspace
```
以上过程中LMDeploy 使用的是 FP16 的计算精度。
```bash
bash workspace/service_docker_up.sh
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
```
除了 FP16 精度LMDeploy 还支持 `internlm-chat-7b` 4bit 权重模型推理。它不仅把模型的显存减少到 6G大约只有 FP16 的 40%,更重要的是,经过 kernel 层面的极致优化,其推理性能在 A100-80G 上可达到 FP16 的 2.4 倍以上。
以下是`internlm-chat-7b` 4bit 权重模型的部署方法。推理速度的 bechmark 请参考[这里](https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/w4a16.md#%E6%8E%A8%E7%90%86%E9%80%9F%E5%BA%A6)
```bash
# download prequnantized internlm-chat-7b model from huggingface
git-lfs install
git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
# Convert the model's layout and store it in the default path, ./workspace.
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b ./llama2-chat-7b-w4 awq --group-size 128
# inference lmdeploy's turbomind engine
python3 -m lmdeploy.turbomind.chat ./workspace
# serving with gradio
python3 -m lmdeploy.serve.gradio.app ./workspace
```
LMDeploy 是涵盖了 LLM 任务的全套轻量化、部署和服务的工具箱。请参考 [部署教程](https://github.com/InternLM/LMDeploy) 了解 InternLM 的更多部署细节。
[LMDeploy](https://github.com/InternLM/LMDeploy) 支持了 InternLM 部署的完整流程,请参考 [部署教程](https://github.com/InternLM/LMDeploy) 了解 InternLM 的更多部署细节。
## 微调&训练

View File

@ -123,28 +123,45 @@ The effect is as follows
### Deployment
We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the one-click deployment of InternLM.
We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the workflow of InternLM deployment.
1. First, install LMDeploy:
```bash
python3 -m pip install lmdeploy
```
```bash
python3 -m pip install lmdeploy
```
You can utilize the following commands to conduct `internlm-chat-7b` FP16 inference, serve it and interact with AI assistant via WebUI:
2. Use the following command for quick deployment:
```bash
# convert weight layout
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b
```bash
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b/model
```
# inference lmdeploy's turbomind engine
python3 -m lmdeploy.turbomind.chat ./workspace
3. After exporting the model, you can start a server and have a conversation with the deployed model using the following command:
# serving with gradio
python3 -m lmdeploy.serve.gradio.app ./workspace
```
```bash
bash workspace/service_docker_up.sh
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
```
You can also deploy 4-bit quantized `internlm-chat-7b` model via LMDeploy. It greatly trims down the model's memory overhead to 6G, just 40% of what FP16 inference would take. More importantly, with extreme optimized kernel, the inference performance achieves 2.4x faster than FP16 inference on A100-80G.
[LMDeploy](https://github.com/InternLM/LMDeploy) provides a complete workflow for deploying InternLM. Please refer to the [deployment tutorial](https://github.com/InternLM/LMDeploy) for more details on deploying InternLM.
Try the followings to enjoy 4-bit `internlm-chat-7b` on a Geforce RTX 30x GPU card. You can find the inference benchmark from [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/w4a16.md#inference-performance).
```bash
# download prequnantized internlm-chat-7b model from huggingface
git-lfs install
git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
# Convert the model's layout and store it in the default path, ./workspace.
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b ./llama2-chat-7b-w4 awq --group-size 128
# inference lmdeploy's turbomind engine
python3 -m lmdeploy.turbomind.chat ./workspace
# serving with gradio
python3 -m lmdeploy.serve.gradio.app ./workspace
```
LMDeploy is an efficient toolkit for compressing, deploying, and serving LLM models. Please refer to the [deployment tutorial](https://github.com/InternLM/LMDeploy) for more details on deploying InternLM.
## Fine-tuning & Training