Update README

2023-03-19 23:18:53 +08:00 · 2023-03-19 23:18:53 +08:00 · 69bcdcbc4f
parent 52aa3261d7
commit 69bcdcbc4f
2 changed files with 20 additions and 15 deletions
--- a/README.md
+++ b/README.md
@ -10,7 +10,7 @@ ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进
 *Read this in [English](README_en.md).*

 ## 更新信息
-**[2023/03/19]** 增加流式输出接口`stream_chat`，已更新到网页版和命令行demo。修复输出中的中文标点
+**[2023/03/19]** 增加流式输出接口`stream_chat`，已更新到网页版和命令行demo。修复输出中的中文标点。增加量化后的模型 [ChatGLM-6B-INT4](https://huggingface.co/THUDM/chatglm-6b-int4)

 ## 使用方式

@ -34,6 +34,7 @@ ChatGLM-6B 使用了和 ChatGPT 相似的技术，针对中文问答和对话进
 >>> from transformers import AutoTokenizer, AutoModel
 >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
 >>> model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
+>>> model = model.eval()
 >>> response, history = model.chat(tokenizer, "你好", history=[])
 >>> print(response)
 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
@ -100,18 +101,21 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).ha

 模型量化会带来一定的性能损失，经过测试，ChatGLM-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。使用 [GPT-Q](https://arxiv.org/abs/2210.17323) 等量化方案可以进一步压缩量化精度/提升相同量化精度下的模型性能，欢迎大家提出对应的 Pull Request。

+**[2023/03/19]** 量化过程需要在内存中首先加载fp16格式的模型，消耗大概13GB的内存。如果你的内存不足的话，可以直接加载量化后的模型，仅需大概5.2GB的内存：
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).half().cuda()
+```
+
 ### CPU部署
-如果你没有GPU硬件的话，也可以在CPU上进行推理。使用方法如下
+如果你没有GPU硬件的话，也可以在CPU上进行推理，但是推理速度会更慢。使用方法如下（需要大概32GB内存）
 ```python
 model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).float()
 ```
-CPU上推理速度可能会比较慢。

-以上方法需要32G内存。如果你只有16G内存，可以尝试
+**[2023/03/19]** 如果你的内存不足，可以直接加载量化后的模型：
 ```python
-model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).bfloat16()
+model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4",trust_remote_code=True).float()
 ```
-需保证空闲内存接近16G，并且推理速度会很慢。

 如果遇到了报错 `Could not find module 'nvcuda.dll'` 或者 `RuntimeError: Unknown platform: darwin` (MacOS) 的话请参考这个[Issue](https://github.com/THUDM/ChatGLM-6B/issues/6#issuecomment-1470060041).

--- a/README_en.md
+++ b/README_en.md
@ -7,7 +7,7 @@ ChatGLM-6B is an open bilingual language model based on [General Language Model
 ChatGLM-6B uses technology similar to ChatGPT, optimized for Chinese QA and dialogue. The model is trained for about 1T tokens of Chinese and English corpus, supplemented by supervised fine-tuning, feedback bootstrap, and reinforcement learning wit human feedback. With only about 6.2 billion parameters, the model is able to generate answers that are in line with human preference.

 ## Update
-**[2023/03/19]** Add streaming output function `stream_chat`, already applied in web and CLI demo. Fix Chinese punctuations in output.
+**[2023/03/19]** Add streaming output function `stream_chat`, already applied in web and CLI demo. Fix Chinese punctuations in output. Add quantized model [ChatGLM-6B-INT4](https://huggingface.co/THUDM/chatglm-6b-int4). 

 ## Getting Started

@ -31,6 +31,7 @@ Generate dialogue with the following code
 >>> from transformers import AutoTokenizer, AutoModel
 >>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)
 >>> model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
+>>> model = model.eval()
 >>> response, history = model.chat(tokenizer, "你好", history=[])
 >>> print(response)
 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
@ -98,24 +99,24 @@ After 2 to 3 rounds of dialogue, the GPU memory usage is about 10GB under 8-bit

 Model quantization brings a certain performance decline. After testing, ChatGLM-6B can still perform natural and smooth generation under 4-bit quantization. using [GPT-Q](https://arxiv.org/abs/2210.17323) etc. The quantization scheme can further compress the quantization accuracy/improve the model performance under the same quantization accuracy. You are welcome to submit corresponding Pull Requests.

+**[2023/03/19]** The quantization costs about 13GB of CPU memory to load the FP16 model. If your CPU memory is limited, you can directly load the quantized model, which costs only 5.2GB CPU memory: 
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).half().cuda()
+```
+
 ### CPU Deployment

-If your computer is not equipped with GPU, you can also conduct inference on CPU:
+If your computer is not equipped with GPU, you can also conduct inference on CPU, but the inference speed is slow (and taking about 32GB of memory):

 ```python
 model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).float()
 ```

-The inference speed will be relatively slow on CPU.
-
-The above method requires 32GB of memory. If you only have 16GB of memory, you can try:
-
+**[2023/03/19]** If your CPU memory is limited, you can directly load the quantized model:
 ```python
-model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).bfloat16()
+model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).float()
 ```

-It is necessary to ensure that there is nearly 16GB of free memory, and the inference speed will be very slow.
-
 **For Mac users**: if your encounter the error `RuntimeError: Unknown platform: darwin`, please refer to this [Issue](https://github.com/THUDM/ChatGLM-6B/issues/6#issuecomment-1470060041). 

 ## ChatGLM-6B Examples