Add guidance about 4bit quantized model deployment (#754)

2024-07-04 19:04:02 +08:00 · 2024-07-04 19:04:02 +08:00 · 2ebfdb900f
parent e6bb587ebd
commit 2ebfdb900f
3 changed files with 92 additions and 46 deletions
--- a/README.md
+++ b/README.md
@ -112,7 +112,7 @@ We have evaluated InternLM2.5 on several important benchmarks using the open-sou
 ### Base Model

 | Benchmark      | InternLM2.5-7B | Llama3-8B | Yi-1.5-9B |
-| -------------- | ------------------ | ---------- | --------- |
+| -------------- | -------------- | --------- | --------- |
 | MMLU (5-shot)  | **71.6**       | 66.4      | 71.6      |
 | CMMLU (5-shot) | **79.1**       | 51.0      | 74.1      |
 | BBH (3-shot)   | 70.1           | 59.7      | 71.1      |
@ -123,7 +123,7 @@ We have evaluated InternLM2.5 on several important benchmarks using the open-sou
 ### Chat Model

 | Benchmark          | InternLM2.5-7B-Chat | Llama3-8B-Instruct | Gemma2-9B-IT | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen2-7B-Instruct |
-| ------------------ | ----------------------- | ------------------- | ------------ | -------------- | ------------- | ----------------- |
+| ------------------ | ------------------- | ------------------ | ------------ | -------------- | ------------- | ----------------- |
 | MMLU (5-shot)      | **72.8**            | 68.4               | 70.9         | 71.0           | 71.4          | 70.8              |
 | CMMLU (5-shot)     | 78.0                | 53.3               | 60.3         | 74.5           | 74.5          | 80.9              |
 | BBH (3-shot CoT)   | **71.6**            | 54.4               | 68.2\*       | 69.6           | 69.6          | 65.0              |
@ -144,7 +144,9 @@ We have evaluated InternLM2.5 on several important benchmarks using the open-sou

 ## Usages

-We briefly show the usages with [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue).
+InternLM supports a diverse range of well-known upstream and downstream projects, such as LLaMA-Factory, vLLM, llama.cpp, and more. This support enables a broad spectrum of users to utilize the InternLM series models more efficiently and conveniently. Tutorials for selected ecosystem projects are available [here](./ecosystem/README.md) for your convenience.
+
+In the following chapters, we will focus on the usages with [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue).
 The chat models adopt [chatml format](./chat/chat_format.md) to support both chat and agent applications.
 To ensure a better usage effect, please make sure that the installed transformers library version meets the following requirements before performing inference with [Transformers](#import-from-transformers) or [ModelScope](#import-from-modelscope):

@ -208,11 +210,13 @@ pip install transformers>=4.38
 streamlit run ./chat/web_demo.py
 ```

-### Deployment
+## Deployment by LMDeploy

 We use [LMDeploy](https://github.com/InternLM/LMDeploy) for fast deployment of InternLM.

-With only 4 lines of codes, you can perform `internlm2_5-7b-chat` inference after `pip install lmdeploy>=0.2.1`.
+### Inference
+
+With only 4 lines of codes, you can perform [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) inference after `pip install lmdeploy`.

 ```python
 from lmdeploy import pipeline
@ -221,6 +225,25 @@ response = pipe(["Hi, pls intro yourself", "Shanghai is"])
 print(response)
 ```

+To reduce the memory footprint, we offers 4-bit quantized model [internlm2_5-7b-chat-4bit](https://huggingface.co/internlm/internlm2_5-7b-chat-4bit), with which the inference can be conducted as follows:
+
+```python
+from lmdeploy import pipeline
+pipe = pipeline("internlm/internlm2_5-7b-chat-4bit")
+response = pipe(["Hi, pls intro yourself", "Shanghai is"])
+print(response)
+```
+
+Moreover, you can independently activate the 8bit/4bit KV cache feature:
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+pipe = pipeline("internlm/internlm2_5-7b-chat-4bit",
+                backend_config=TurbomindEngineConfig(quant_policy=8))
+response = pipe(["Hi, pls intro yourself", "Shanghai is"])
+print(response)
+```
+
 Please refer to the [guidance](./chat/lmdeploy.md) for more usages about model deployment. For additional deployment tutorials, feel free to explore [here](https://github.com/InternLM/LMDeploy).

 ### 1M-long-context Inference
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -110,7 +110,7 @@ InternLM2.5 系列模型在本仓库正式发布，具有如下特性：
 ### 基座模型

 | Benchmark      | InternLM2.5-7B | Llama3-8B | Yi-1.5-9B |
-| -------------- | ------------------ | ---------- | --------- |
+| -------------- | -------------- | --------- | --------- |
 | MMLU (5-shot)  | **71.6**       | 66.4      | 71.6      |
 | CMMLU (5-shot) | **79.1**       | 51.0      | 74.1      |
 | BBH (3-shot)   | 70.1           | 59.7      | 71.1      |
@ -121,7 +121,7 @@ InternLM2.5 系列模型在本仓库正式发布，具有如下特性：
 ### 对话模型

 | Benchmark          | InternLM2.5-7B-Chat | Llama3-8B-Instruct | Gemma2-9B-IT | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen2-7B-Instruct |
-| ------------------ | ----------------------- | ------------------- | ------------ | -------------- | ------------- | ----------------- |
+| ------------------ | ------------------- | ------------------ | ------------ | -------------- | ------------- | ----------------- |
 | MMLU (5-shot)      | **72.8**            | 68.4               | 70.9         | 71.0           | 71.4          | 70.8              |
 | CMMLU (5-shot)     | 78.0                | 53.3               | 60.3         | 74.5           | 74.5          | 80.9              |
 | BBH (3-shot CoT)   | **71.6**            | 54.4               | 68.2\*       | 69.6           | 69.6          | 65.0              |
@ -142,6 +142,8 @@ InternLM2.5 系列模型在本仓库正式发布，具有如下特性：

 ## 使用案例

+InternLM 支持众多知名的上下游项目，如 LLaMA-Factory、vLLM、llama.cpp 等。这种支持使得广大用户群体能够更高效、更方便地使用 InternLM 全系列模型。为方便使用，我们为部分生态系统项目提供了教程，访问[此处](./ecosystem/README_zh-CN.md)即可获取。
+
 接下来我们展示使用 [Transformers](#import-from-transformers)，[ModelScope](#import-from-modelscope) 和 [Web demo](#dialogue) 进行推理。
 对话模型采用了 [chatml 格式](./chat/chat_format.md) 来支持通用对话和智能体应用。
 为了保障更好的使用效果，在用 [Transformers](#import-from-transformers) 或 [ModelScope](#import-from-modelscope) 进行推理前，请确保安装的 transformers 库版本满足以下要求：
@ -205,11 +207,13 @@ pip install transformers>=4.38
 streamlit run ./chat/web_demo.py
 ```

-### 基于 InternLM 高性能部署
+## InternLM 高性能部署

 我们使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 完成 InternLM 的一键部署。

-通过 `pip install lmdeploy>=0.2.1` 安装 LMDeploy 之后，只需 4 行代码，就可以实现离线批处理：
+### 推理
+
+通过 `pip install lmdeploy` 安装 LMDeploy 之后，只需 4 行代码，就可以实现离线批处理：

 ```python
 from lmdeploy import pipeline
@ -218,7 +222,26 @@ response = pipe(["Hi, pls intro yourself", "Shanghai is"])
 print(response)
 ```

-请参考[部署指南](./chat/lmdeploy.md)了解更多使用案例，更多部署教程则可在[这里](https://github.com/InternLM/LMDeploy)找到。
+为了减少内存占用，我们提供了4位量化模型 [internlm2_5-7b-chat-4bit](https://huggingface.co/internlm/internlm2_5-7b-chat-4bit)。可以按照如下方式推理该模型：
+
+```python
+from lmdeploy import pipeline
+pipe = pipeline("internlm/internlm2_5-7b-chat-4bit")
+response = pipe(["Hi, pls intro yourself", "Shanghai is"])
+print(response)
+```
+
+此外，可以同步开启 8bit 或者 4bit KV 在线量化功能：
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+pipe = pipeline("internlm/internlm2_5-7b-chat-4bit",
+                backend_config=TurbomindEngineConfig(quant_policy=8))
+response = pipe(["Hi, pls intro yourself", "Shanghai is"])
+print(response)
+```
+
+更多使用案例可参考[部署指南](./chat/lmdeploy.md)，详细的部署教程则可在[这里](https://github.com/InternLM/LMDeploy)找到。

 ### 1百万字超长上下文推理

--- a/model_cards/internlm2.5_7b.md
+++ b/model_cards/internlm2.5_7b.md
@ -42,7 +42,7 @@ We have evaluated InternLM2.5 on several important benchmarks using the open-sou
 ### Chat Model

 | Benchmark          | InternLM2.5-7B-Chat | Llama3-8B-Instruct | Gemma2-9B-IT | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen2-7B-Instruct |
-| ------------------ | ----------------------- | ------------------- | ------------ | -------------- | ------------- | ----------------- |
+| ------------------ | ------------------- | ------------------ | ------------ | -------------- | ------------- | ----------------- |
 | MMLU (5-shot)      | **72.8**            | 68.4               | 70.9         | 71.0           | 71.4          | 70.8              |
 | CMMLU (5-shot)     | 78.0                | 53.3               | 60.3         | 74.5           | 74.5          | 80.9              |
 | BBH (3-shot CoT)   | **71.6**            | 54.4               | 68.2\*       | 69.6           | 69.6          | 65.0              |