@ -17,10 +17,13 @@ ChatGLM-6B uses technology similar to ChatGPT, optimized for Chinese QA and dial
## Getting Started
### Environment Setup
Install the requirements with pip: `pip install -r requirements.txt`. `transformers` library version is recommended to be `4.26.1`, but theoretically any version no lower than `4.23.1` is acceptable.
### Usage
Generate dialogue with the following code
```python
>>> from transformers import AutoTokenizer, AutoModel
The program runs a web server and outputs the URL. Open the URL in the browser to use the web demo.
#### CLI Demo
@ -70,27 +77,42 @@ Run [cli_demo.py](cli_demo.py) in the repo:
```shell
python cli_demo.py
```
The command runs an interactive program in the shell. Type your instruction in the shell and hit enter to generate the response. Type `clear` to clear the dialogue history and `stop` to terminate the program.
The command runs an interactive program in the shell. Type your instruction in the shell and hit enter to generate the response. Type `clear` to clear the dialogue history and `stop` to terminate the program.
## Deployment
### Quantization
By default, the model parameters are loaded with FP16 precision, which require about 13GB of GPU memory. It your GPU memory is limited, you can try to load the model parameters with quantization:
```python
# Change according to your hardware. Only support 4/8 bit quantization now.
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().quantize(4).cuda()
```
After 2 to 3 rounds of dialogue, the GPU memory usage is about 10GB under 8-bit quantization, and only 6GB under 4-bit quantization. As the number of dialogue rounds increases, the corresponding GPU memory consumption also increases. Due to the use of relative position encoding, ChatGLM-6B theoretically supports an infinitely long context-length, but the performance will gradually decline after the total length exceeds 2048 (training length).
Model quantization brings a certain performance decline. After testing, ChatGLM-6B can still perform natural and smooth generation under 4-bit quantization. using [GPT-Q](https://arxiv.org/abs/2210.17323) etc. The quantization scheme can further compress the quantization accuracy/improve the model performance under the same quantization accuracy. You are welcome to submit corresponding Pull Requests.
### CPU Deployment
If your computer is not equipped with GPU, you can also conduct inference on CPU:
```python
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).float()
```
The inference speed will be relatively slow on CPU.
The above method requires 32GB of memory. If you only have 16GB of memory, you can try:
```python
model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).bfloat16()
```
It is necessary to ensure that there is nearly 16GB of free memory, and the inference speed will be very slow.
## ChatGLM-6B Examples
The following are some Chinese examples with `web_demo.py`. Welcome to explore more possibility with ChatGLM-6B.
@ -165,6 +187,7 @@ If you find our work useful, please consider citing the following papers:
url={https://openreview.net/forum?id=-Aw0rrrPUF}
}
```
```
@inproceedings{du2022glm,
title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling},