From 716131e47751f5075549e9c629b6c89dd562b0a8 Mon Sep 17 00:00:00 2001
From: Lyu Han <lvhan_028@163.com>
Date: Tue, 22 Aug 2023 11:31:01 +0800
Subject: [PATCH] introduce how to deploy 4-bit quantized internlm model (#207)

---
 README-zh-Hans.md | 46 ++++++++++++++++++++++++++++++++--------------
 README.md         | 47 ++++++++++++++++++++++++++++++++---------------
 2 files changed, 64 insertions(+), 29 deletions(-)

diff --git a/README-zh-Hans.md b/README-zh-Hans.md
index 75c362b..c4e4115 100644
--- a/README-zh-Hans.md
+++ b/README-zh-Hans.md
@@ -122,26 +122,44 @@ streamlit run web_demo.py
 
 我们使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 完成 InternLM 的一键部署。
 
-1. 首先安装 LMDeploy:
+```bash
+python3 -m pip install lmdeploy
+```
 
-   ```bash
-   python3 -m pip install lmdeploy
-   ```
+执行以下命令，可以在终端与 `internlm-chat-7b` 模型进行交互式对话，或者通过 WebUI 与它聊天。
 
-2. 快速的部署命令如下：
+```bash
+# 转换权重格式
+python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b
 
-   ```bash
-   python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-7b/model
-   ```
+# 在终端进行交互式对话
+python3 -m lmdeploy.turbomind.chat ./workspace
 
-3. 在导出模型后，你可以直接通过如下命令启动服务，并在客户端与AI对话
+# 启动 gradio 服务
+python3 -m lmdeploy.serve.gradio.app ./workspace
+```
+以上过程中，LMDeploy 使用的是 FP16 的计算精度。
 
-   ```bash
-   bash workspace/service_docker_up.sh
-   python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
-   ```
+除了 FP16 精度，LMDeploy 还支持 `internlm-chat-7b` 4bit 权重模型推理。它不仅把模型的显存减少到 6G，大约只有 FP16 的 40%，更重要的是，经过 kernel 层面的极致优化，其推理性能在 A100-80G 上可达到 FP16 的 2.4 倍以上。
+
+以下是`internlm-chat-7b` 4bit 权重模型的部署方法。推理速度的 bechmark 请参考[这里](https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/w4a16.md#%E6%8E%A8%E7%90%86%E9%80%9F%E5%BA%A6)
+
+```bash
+# download prequnantized internlm-chat-7b model from huggingface
+git-lfs install
+git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
+
+# Convert the model's layout and store it in the default path, ./workspace.
+python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b ./llama2-chat-7b-w4 awq --group-size 128
+
+# inference lmdeploy's turbomind engine
+python3 -m lmdeploy.turbomind.chat ./workspace
+
+# serving with gradio
+python3 -m lmdeploy.serve.gradio.app ./workspace
+```
+LMDeploy 是涵盖了 LLM 任务的全套轻量化、部署和服务的工具箱。请参考 [部署教程](https://github.com/InternLM/LMDeploy) 了解 InternLM 的更多部署细节。
 
-[LMDeploy](https://github.com/InternLM/LMDeploy) 支持了 InternLM 部署的完整流程，请参考 [部署教程](https://github.com/InternLM/LMDeploy) 了解 InternLM 的更多部署细节。
 
 ## 微调&训练
 
diff --git a/README.md b/README.md
index 78116f8..92f926f 100644
--- a/README.md
+++ b/README.md
@@ -123,28 +123,45 @@ The effect is as follows
 
 ### Deployment
 
-We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the one-click deployment of InternLM.
+We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the workflow of InternLM deployment.
 
-1. First, install LMDeploy:
+```bash
+python3 -m pip install lmdeploy
+```
 
-    ```bash
-    python3 -m pip install lmdeploy
-    ```
+You can utilize the following commands to conduct `internlm-chat-7b` FP16 inference, serve it and interact with AI assistant via WebUI:
 
-2. Use the following command for quick deployment:
+```bash
+# convert weight layout
+python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b
 
-    ```bash
-    python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b/model
-    ```
+# inference lmdeploy's turbomind engine
+python3 -m lmdeploy.turbomind.chat ./workspace
 
-3. After exporting the model, you can start a server and have a conversation with the deployed model using the following command:
+# serving with gradio
+python3 -m lmdeploy.serve.gradio.app ./workspace
+```
 
-    ```bash
-    bash workspace/service_docker_up.sh
-    python3 -m lmdeploy.serve.client {server_ip_addresss}:33337
-    ```
+You can also deploy 4-bit quantized `internlm-chat-7b` model via LMDeploy. It greatly trims down the model's memory overhead to 6G, just 40% of what FP16 inference would take. More importantly, with extreme optimized kernel, the inference performance achieves 2.4x faster than FP16 inference on A100-80G.
 
-[LMDeploy](https://github.com/InternLM/LMDeploy) provides a complete workflow for deploying InternLM. Please refer to the [deployment tutorial](https://github.com/InternLM/LMDeploy) for more details on deploying InternLM.
+Try the followings to enjoy 4-bit `internlm-chat-7b` on a Geforce RTX 30x GPU card. You can find the inference benchmark from [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/w4a16.md#inference-performance).
+
+```bash
+# download prequnantized internlm-chat-7b model from huggingface
+git-lfs install
+git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
+
+# Convert the model's layout and store it in the default path, ./workspace.
+python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b ./llama2-chat-7b-w4 awq --group-size 128
+
+# inference lmdeploy's turbomind engine
+python3 -m lmdeploy.turbomind.chat ./workspace
+
+# serving with gradio
+python3 -m lmdeploy.serve.gradio.app ./workspace
+```
+
+LMDeploy is an efficient toolkit for compressing, deploying, and serving LLM models. Please refer to the [deployment tutorial](https://github.com/InternLM/LMDeploy) for more details on deploying InternLM.
 
 ## Fine-tuning & Training