From 716131e47751f5075549e9c629b6c89dd562b0a8 Mon Sep 17 00:00:00 2001 From: Lyu Han Date: Tue, 22 Aug 2023 11:31:01 +0800 Subject: [PATCH] introduce how to deploy 4-bit quantized internlm model (#207) --- README-zh-Hans.md | 46 ++++++++++++++++++++++++++++++++-------------- README.md | 47 ++++++++++++++++++++++++++++++++--------------- 2 files changed, 64 insertions(+), 29 deletions(-) diff --git a/README-zh-Hans.md b/README-zh-Hans.md index 75c362b..c4e4115 100644 --- a/README-zh-Hans.md +++ b/README-zh-Hans.md @@ -122,26 +122,44 @@ streamlit run web_demo.py 我们使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 完成 InternLM 的一键部署。 -1. 首先安装 LMDeploy: +```bash +python3 -m pip install lmdeploy +``` - ```bash - python3 -m pip install lmdeploy - ``` +执行以下命令,可以在终端与 `internlm-chat-7b` 模型进行交互式对话,或者通过 WebUI 与它聊天。 -2. 快速的部署命令如下: +```bash +# 转换权重格式 +python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b - ```bash - python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-7b/model - ``` +# 在终端进行交互式对话 +python3 -m lmdeploy.turbomind.chat ./workspace -3. 在导出模型后,你可以直接通过如下命令启动服务,并在客户端与AI对话 +# 启动 gradio 服务 +python3 -m lmdeploy.serve.gradio.app ./workspace +``` +以上过程中,LMDeploy 使用的是 FP16 的计算精度。 - ```bash - bash workspace/service_docker_up.sh - python3 -m lmdeploy.serve.client {server_ip_addresss}:33337 - ``` +除了 FP16 精度,LMDeploy 还支持 `internlm-chat-7b` 4bit 权重模型推理。它不仅把模型的显存减少到 6G,大约只有 FP16 的 40%,更重要的是,经过 kernel 层面的极致优化,其推理性能在 A100-80G 上可达到 FP16 的 2.4 倍以上。 + +以下是`internlm-chat-7b` 4bit 权重模型的部署方法。推理速度的 bechmark 请参考[这里](https://github.com/InternLM/lmdeploy/blob/main/docs/zh_cn/w4a16.md#%E6%8E%A8%E7%90%86%E9%80%9F%E5%BA%A6) + +```bash +# download prequnantized internlm-chat-7b model from huggingface +git-lfs install +git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4 + +# Convert the model's layout and store it in the default path, ./workspace. +python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b ./llama2-chat-7b-w4 awq --group-size 128 + +# inference lmdeploy's turbomind engine +python3 -m lmdeploy.turbomind.chat ./workspace + +# serving with gradio +python3 -m lmdeploy.serve.gradio.app ./workspace +``` +LMDeploy 是涵盖了 LLM 任务的全套轻量化、部署和服务的工具箱。请参考 [部署教程](https://github.com/InternLM/LMDeploy) 了解 InternLM 的更多部署细节。 -[LMDeploy](https://github.com/InternLM/LMDeploy) 支持了 InternLM 部署的完整流程,请参考 [部署教程](https://github.com/InternLM/LMDeploy) 了解 InternLM 的更多部署细节。 ## 微调&训练 diff --git a/README.md b/README.md index 78116f8..92f926f 100644 --- a/README.md +++ b/README.md @@ -123,28 +123,45 @@ The effect is as follows ### Deployment -We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the one-click deployment of InternLM. +We use [LMDeploy](https://github.com/InternLM/LMDeploy) to complete the workflow of InternLM deployment. -1. First, install LMDeploy: +```bash +python3 -m pip install lmdeploy +``` - ```bash - python3 -m pip install lmdeploy - ``` +You can utilize the following commands to conduct `internlm-chat-7b` FP16 inference, serve it and interact with AI assistant via WebUI: -2. Use the following command for quick deployment: +```bash +# convert weight layout +python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b - ```bash - python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b/model - ``` +# inference lmdeploy's turbomind engine +python3 -m lmdeploy.turbomind.chat ./workspace -3. After exporting the model, you can start a server and have a conversation with the deployed model using the following command: +# serving with gradio +python3 -m lmdeploy.serve.gradio.app ./workspace +``` - ```bash - bash workspace/service_docker_up.sh - python3 -m lmdeploy.serve.client {server_ip_addresss}:33337 - ``` +You can also deploy 4-bit quantized `internlm-chat-7b` model via LMDeploy. It greatly trims down the model's memory overhead to 6G, just 40% of what FP16 inference would take. More importantly, with extreme optimized kernel, the inference performance achieves 2.4x faster than FP16 inference on A100-80G. -[LMDeploy](https://github.com/InternLM/LMDeploy) provides a complete workflow for deploying InternLM. Please refer to the [deployment tutorial](https://github.com/InternLM/LMDeploy) for more details on deploying InternLM. +Try the followings to enjoy 4-bit `internlm-chat-7b` on a Geforce RTX 30x GPU card. You can find the inference benchmark from [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/w4a16.md#inference-performance). + +```bash +# download prequnantized internlm-chat-7b model from huggingface +git-lfs install +git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4 + +# Convert the model's layout and store it in the default path, ./workspace. +python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b ./llama2-chat-7b-w4 awq --group-size 128 + +# inference lmdeploy's turbomind engine +python3 -m lmdeploy.turbomind.chat ./workspace + +# serving with gradio +python3 -m lmdeploy.serve.gradio.app ./workspace +``` + +LMDeploy is an efficient toolkit for compressing, deploying, and serving LLM models. Please refer to the [deployment tutorial](https://github.com/InternLM/LMDeploy) for more details on deploying InternLM. ## Fine-tuning & Training