mirror of https://github.com/InternLM/InternLM
Add guidance about 4bit quantized model deployment (#754)
parent
e6bb587ebd
commit
2ebfdb900f
33
README.md
33
README.md
|
@ -112,7 +112,7 @@ We have evaluated InternLM2.5 on several important benchmarks using the open-sou
|
|||
### Base Model
|
||||
|
||||
| Benchmark | InternLM2.5-7B | Llama3-8B | Yi-1.5-9B |
|
||||
| -------------- | ------------------ | ---------- | --------- |
|
||||
| -------------- | -------------- | --------- | --------- |
|
||||
| MMLU (5-shot) | **71.6** | 66.4 | 71.6 |
|
||||
| CMMLU (5-shot) | **79.1** | 51.0 | 74.1 |
|
||||
| BBH (3-shot) | 70.1 | 59.7 | 71.1 |
|
||||
|
@ -123,7 +123,7 @@ We have evaluated InternLM2.5 on several important benchmarks using the open-sou
|
|||
### Chat Model
|
||||
|
||||
| Benchmark | InternLM2.5-7B-Chat | Llama3-8B-Instruct | Gemma2-9B-IT | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen2-7B-Instruct |
|
||||
| ------------------ | ----------------------- | ------------------- | ------------ | -------------- | ------------- | ----------------- |
|
||||
| ------------------ | ------------------- | ------------------ | ------------ | -------------- | ------------- | ----------------- |
|
||||
| MMLU (5-shot) | **72.8** | 68.4 | 70.9 | 71.0 | 71.4 | 70.8 |
|
||||
| CMMLU (5-shot) | 78.0 | 53.3 | 60.3 | 74.5 | 74.5 | 80.9 |
|
||||
| BBH (3-shot CoT) | **71.6** | 54.4 | 68.2\* | 69.6 | 69.6 | 65.0 |
|
||||
|
@ -144,7 +144,9 @@ We have evaluated InternLM2.5 on several important benchmarks using the open-sou
|
|||
|
||||
## Usages
|
||||
|
||||
We briefly show the usages with [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue).
|
||||
InternLM supports a diverse range of well-known upstream and downstream projects, such as LLaMA-Factory, vLLM, llama.cpp, and more. This support enables a broad spectrum of users to utilize the InternLM series models more efficiently and conveniently. Tutorials for selected ecosystem projects are available [here](./ecosystem/README.md) for your convenience.
|
||||
|
||||
In the following chapters, we will focus on the usages with [Transformers](#import-from-transformers), [ModelScope](#import-from-modelscope), and [Web demos](#dialogue).
|
||||
The chat models adopt [chatml format](./chat/chat_format.md) to support both chat and agent applications.
|
||||
To ensure a better usage effect, please make sure that the installed transformers library version meets the following requirements before performing inference with [Transformers](#import-from-transformers) or [ModelScope](#import-from-modelscope):
|
||||
|
||||
|
@ -208,11 +210,13 @@ pip install transformers>=4.38
|
|||
streamlit run ./chat/web_demo.py
|
||||
```
|
||||
|
||||
### Deployment
|
||||
## Deployment by LMDeploy
|
||||
|
||||
We use [LMDeploy](https://github.com/InternLM/LMDeploy) for fast deployment of InternLM.
|
||||
|
||||
With only 4 lines of codes, you can perform `internlm2_5-7b-chat` inference after `pip install lmdeploy>=0.2.1`.
|
||||
### Inference
|
||||
|
||||
With only 4 lines of codes, you can perform [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) inference after `pip install lmdeploy`.
|
||||
|
||||
```python
|
||||
from lmdeploy import pipeline
|
||||
|
@ -221,6 +225,25 @@ response = pipe(["Hi, pls intro yourself", "Shanghai is"])
|
|||
print(response)
|
||||
```
|
||||
|
||||
To reduce the memory footprint, we offers 4-bit quantized model [internlm2_5-7b-chat-4bit](https://huggingface.co/internlm/internlm2_5-7b-chat-4bit), with which the inference can be conducted as follows:
|
||||
|
||||
```python
|
||||
from lmdeploy import pipeline
|
||||
pipe = pipeline("internlm/internlm2_5-7b-chat-4bit")
|
||||
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
|
||||
print(response)
|
||||
```
|
||||
|
||||
Moreover, you can independently activate the 8bit/4bit KV cache feature:
|
||||
|
||||
```python
|
||||
from lmdeploy import pipeline, TurbomindEngineConfig
|
||||
pipe = pipeline("internlm/internlm2_5-7b-chat-4bit",
|
||||
backend_config=TurbomindEngineConfig(quant_policy=8))
|
||||
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
|
||||
print(response)
|
||||
```
|
||||
|
||||
Please refer to the [guidance](./chat/lmdeploy.md) for more usages about model deployment. For additional deployment tutorials, feel free to explore [here](https://github.com/InternLM/LMDeploy).
|
||||
|
||||
### 1M-long-context Inference
|
||||
|
|
|
@ -110,7 +110,7 @@ InternLM2.5 系列模型在本仓库正式发布,具有如下特性:
|
|||
### 基座模型
|
||||
|
||||
| Benchmark | InternLM2.5-7B | Llama3-8B | Yi-1.5-9B |
|
||||
| -------------- | ------------------ | ---------- | --------- |
|
||||
| -------------- | -------------- | --------- | --------- |
|
||||
| MMLU (5-shot) | **71.6** | 66.4 | 71.6 |
|
||||
| CMMLU (5-shot) | **79.1** | 51.0 | 74.1 |
|
||||
| BBH (3-shot) | 70.1 | 59.7 | 71.1 |
|
||||
|
@ -121,7 +121,7 @@ InternLM2.5 系列模型在本仓库正式发布,具有如下特性:
|
|||
### 对话模型
|
||||
|
||||
| Benchmark | InternLM2.5-7B-Chat | Llama3-8B-Instruct | Gemma2-9B-IT | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen2-7B-Instruct |
|
||||
| ------------------ | ----------------------- | ------------------- | ------------ | -------------- | ------------- | ----------------- |
|
||||
| ------------------ | ------------------- | ------------------ | ------------ | -------------- | ------------- | ----------------- |
|
||||
| MMLU (5-shot) | **72.8** | 68.4 | 70.9 | 71.0 | 71.4 | 70.8 |
|
||||
| CMMLU (5-shot) | 78.0 | 53.3 | 60.3 | 74.5 | 74.5 | 80.9 |
|
||||
| BBH (3-shot CoT) | **71.6** | 54.4 | 68.2\* | 69.6 | 69.6 | 65.0 |
|
||||
|
@ -142,6 +142,8 @@ InternLM2.5 系列模型在本仓库正式发布,具有如下特性:
|
|||
|
||||
## 使用案例
|
||||
|
||||
InternLM 支持众多知名的上下游项目,如 LLaMA-Factory、vLLM、llama.cpp 等。这种支持使得广大用户群体能够更高效、更方便地使用 InternLM 全系列模型。为方便使用,我们为部分生态系统项目提供了教程,访问[此处](./ecosystem/README_zh-CN.md)即可获取。
|
||||
|
||||
接下来我们展示使用 [Transformers](#import-from-transformers),[ModelScope](#import-from-modelscope) 和 [Web demo](#dialogue) 进行推理。
|
||||
对话模型采用了 [chatml 格式](./chat/chat_format.md) 来支持通用对话和智能体应用。
|
||||
为了保障更好的使用效果,在用 [Transformers](#import-from-transformers) 或 [ModelScope](#import-from-modelscope) 进行推理前,请确保安装的 transformers 库版本满足以下要求:
|
||||
|
@ -205,11 +207,13 @@ pip install transformers>=4.38
|
|||
streamlit run ./chat/web_demo.py
|
||||
```
|
||||
|
||||
### 基于 InternLM 高性能部署
|
||||
## InternLM 高性能部署
|
||||
|
||||
我们使用 [LMDeploy](https://github.com/InternLM/LMDeploy) 完成 InternLM 的一键部署。
|
||||
|
||||
通过 `pip install lmdeploy>=0.2.1` 安装 LMDeploy 之后,只需 4 行代码,就可以实现离线批处理:
|
||||
### 推理
|
||||
|
||||
通过 `pip install lmdeploy` 安装 LMDeploy 之后,只需 4 行代码,就可以实现离线批处理:
|
||||
|
||||
```python
|
||||
from lmdeploy import pipeline
|
||||
|
@ -218,7 +222,26 @@ response = pipe(["Hi, pls intro yourself", "Shanghai is"])
|
|||
print(response)
|
||||
```
|
||||
|
||||
请参考[部署指南](./chat/lmdeploy.md)了解更多使用案例,更多部署教程则可在[这里](https://github.com/InternLM/LMDeploy)找到。
|
||||
为了减少内存占用,我们提供了4位量化模型 [internlm2_5-7b-chat-4bit](https://huggingface.co/internlm/internlm2_5-7b-chat-4bit)。可以按照如下方式推理该模型:
|
||||
|
||||
```python
|
||||
from lmdeploy import pipeline
|
||||
pipe = pipeline("internlm/internlm2_5-7b-chat-4bit")
|
||||
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
|
||||
print(response)
|
||||
```
|
||||
|
||||
此外,可以同步开启 8bit 或者 4bit KV 在线量化功能:
|
||||
|
||||
```python
|
||||
from lmdeploy import pipeline, TurbomindEngineConfig
|
||||
pipe = pipeline("internlm/internlm2_5-7b-chat-4bit",
|
||||
backend_config=TurbomindEngineConfig(quant_policy=8))
|
||||
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
|
||||
print(response)
|
||||
```
|
||||
|
||||
更多使用案例可参考[部署指南](./chat/lmdeploy.md),详细的部署教程则可在[这里](https://github.com/InternLM/LMDeploy)找到。
|
||||
|
||||
### 1百万字超长上下文推理
|
||||
|
||||
|
|
|
@ -42,7 +42,7 @@ We have evaluated InternLM2.5 on several important benchmarks using the open-sou
|
|||
### Chat Model
|
||||
|
||||
| Benchmark | InternLM2.5-7B-Chat | Llama3-8B-Instruct | Gemma2-9B-IT | Yi-1.5-9B-Chat | GLM-4-9B-Chat | Qwen2-7B-Instruct |
|
||||
| ------------------ | ----------------------- | ------------------- | ------------ | -------------- | ------------- | ----------------- |
|
||||
| ------------------ | ------------------- | ------------------ | ------------ | -------------- | ------------- | ----------------- |
|
||||
| MMLU (5-shot) | **72.8** | 68.4 | 70.9 | 71.0 | 71.4 | 70.8 |
|
||||
| CMMLU (5-shot) | 78.0 | 53.3 | 60.3 | 74.5 | 74.5 | 80.9 |
|
||||
| BBH (3-shot CoT) | **71.6** | 54.4 | 68.2\* | 69.6 | 69.6 | 65.0 |
|
||||
|
|
Loading…
Reference in New Issue