mirror of https://github.com/InternLM/InternLM
introduce how to deploy 4-bit quantized internlm model (#207)
parent
075648cd70
commit
716131e477
|
@ -122,26 +122,44 @@ streamlit run web_demo.py
|
|||
|
||||
我们使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 完成 InternLM 的一键部署。
|
||||
|
||||
1. 首先安装 LMDeploy:
|
||||
```bash
|
||||
python3 -m pip install lmdeploy
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 -m pip install lmdeploy
|
||||
```
|
||||
执行以下命令,可以在终端与 `internlm-chat-7b` 模型进行交互式对话,或者通过 WebUI 与它聊天。
|
||||
|
||||
2. 快速的部署命令如下:
|
||||
```bash
|
||||
# 转换权重格式
|
||||
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b
|
||||
|
||||
```bash
|
||||
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-7b/model
|
||||
```
|
||||
# 在终端进行交互式对话
|
||||
python3 -m lmdeploy.turbomind.chat ./workspace
|
||||
|
||||
3. 在导出模型后,你可以直接通过如下命令启动服务,并在客户端与AI对话
|
||||
# 启动 gradio 服务
|
||||
python3 -m lmdeploy.serve.gradio.app ./workspace
|
||||
```
|
||||
以上过程中,LMDeploy 使用的是 FP16 的计算精度。
|
||||
|
||||
```bash
|
||||
bash workspace/service_docker_up.sh
|
||||
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
|
||||
```
|
||||
除了 FP16 精度,LMDeploy 还支持 `internlm-chat-7b` 4bit 权重模型推理。它不仅把模型的显存减少到 6G,大约只有 FP16 的 40%,更重要的是,经过 kernel 层面的极致优化,其推理性能在 A100-80G 上可达到 FP16 的 2.4 倍以上。
|
||||
|
||||
以下是`internlm-chat-7b` 4bit 权重模型的部署方法。推理速度的 bechmark 请参考[这里](https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/w4a16.md#%E6%8E%A8%E7%90%86%E9%80%9F%E5%BA%A6)
|
||||
|
||||
```bash
|
||||
# download prequnantized internlm-chat-7b model from huggingface
|
||||
git-lfs install
|
||||
git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
|
||||
|
||||
# Convert the model's layout and store it in the default path, ./workspace.
|
||||
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b ./llama2-chat-7b-w4 awq --group-size 128
|
||||
|
||||
# inference lmdeploy's turbomind engine
|
||||
python3 -m lmdeploy.turbomind.chat ./workspace
|
||||
|
||||
# serving with gradio
|
||||
python3 -m lmdeploy.serve.gradio.app ./workspace
|
||||
```
|
||||
LMDeploy 是涵盖了 LLM 任务的全套轻量化、部署和服务的工具箱。请参考 [部署教程](https://github.com/InternLM/LMDeploy) 了解 InternLM 的更多部署细节。
|
||||
|
||||
[LMDeploy](https://github.com/InternLM/LMDeploy) 支持了 InternLM 部署的完整流程,请参考 [部署教程](https://github.com/InternLM/LMDeploy) 了解 InternLM 的更多部署细节。
|
||||
|
||||
## 微调&训练
|
||||
|
||||
|
|
47
README.md
47
README.md
|
@ -123,28 +123,45 @@ The effect is as follows
|
|||
|
||||
### Deployment
|
||||
|
||||
We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the one-click deployment of InternLM.
|
||||
We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the workflow of InternLM deployment.
|
||||
|
||||
1. First, install LMDeploy:
|
||||
```bash
|
||||
python3 -m pip install lmdeploy
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 -m pip install lmdeploy
|
||||
```
|
||||
You can utilize the following commands to conduct `internlm-chat-7b` FP16 inference, serve it and interact with AI assistant via WebUI:
|
||||
|
||||
2. Use the following command for quick deployment:
|
||||
```bash
|
||||
# convert weight layout
|
||||
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b
|
||||
|
||||
```bash
|
||||
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b/model
|
||||
```
|
||||
# inference lmdeploy's turbomind engine
|
||||
python3 -m lmdeploy.turbomind.chat ./workspace
|
||||
|
||||
3. After exporting the model, you can start a server and have a conversation with the deployed model using the following command:
|
||||
# serving with gradio
|
||||
python3 -m lmdeploy.serve.gradio.app ./workspace
|
||||
```
|
||||
|
||||
```bash
|
||||
bash workspace/service_docker_up.sh
|
||||
python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
|
||||
```
|
||||
You can also deploy 4-bit quantized `internlm-chat-7b` model via LMDeploy. It greatly trims down the model's memory overhead to 6G, just 40% of what FP16 inference would take. More importantly, with extreme optimized kernel, the inference performance achieves 2.4x faster than FP16 inference on A100-80G.
|
||||
|
||||
[LMDeploy](https://github.com/InternLM/LMDeploy) provides a complete workflow for deploying InternLM. Please refer to the [deployment tutorial](https://github.com/InternLM/LMDeploy) for more details on deploying InternLM.
|
||||
Try the followings to enjoy 4-bit `internlm-chat-7b` on a Geforce RTX 30x GPU card. You can find the inference benchmark from [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/w4a16.md#inference-performance).
|
||||
|
||||
```bash
|
||||
# download prequnantized internlm-chat-7b model from huggingface
|
||||
git-lfs install
|
||||
git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
|
||||
|
||||
# Convert the model's layout and store it in the default path, ./workspace.
|
||||
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b ./llama2-chat-7b-w4 awq --group-size 128
|
||||
|
||||
# inference lmdeploy's turbomind engine
|
||||
python3 -m lmdeploy.turbomind.chat ./workspace
|
||||
|
||||
# serving with gradio
|
||||
python3 -m lmdeploy.serve.gradio.app ./workspace
|
||||
```
|
||||
|
||||
LMDeploy is an efficient toolkit for compressing, deploying, and serving LLM models. Please refer to the [deployment tutorial](https://github.com/InternLM/LMDeploy) for more details on deploying InternLM.
|
||||
|
||||
## Fine-tuning & Training
|
||||
|
||||
|
|
Loading…
Reference in New Issue