diff --git a/README.md b/README.md index 9150dc9..53966e1 100644 --- a/README.md +++ b/README.md @@ -90,7 +90,7 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).ha 进行 2 至 3 轮对话后,8-bit 量化下 GPU 显存占用约为 10GB,4-bit 量化下仅需 6GB 占用。随着对话轮数的增多,对应消耗显存也随之增长,由于采用了相对位置编码,理论上 ChatGLM-6B 支持无限长的 context-length,但总长度超过 2048(训练长度)后性能会逐渐下降。 -模型量化会带来一定的性能损失,经过测试,ChatGLM-6B 在 4-bit 量化下仍然能够进行自然流畅的生成,使用 [GPT-Q](https://arxiv.org/abs/2210.17323) 等量化方案可以进一步压缩量化精度/提升相同量化精度下的模型性能,欢迎大家提出对应的 Pull Request。 +模型量化会带来一定的性能损失,经过测试,ChatGLM-6B 在 4-bit 量化下仍然能够进行自然流畅的生成。使用 [GPT-Q](https://arxiv.org/abs/2210.17323) 等量化方案可以进一步压缩量化精度/提升相同量化精度下的模型性能,欢迎大家提出对应的 Pull Request。 ### CPU部署 如果你没有GPU硬件的话,也可以在CPU上进行推理。使用方法如下 diff --git a/README_en.md b/README_en.md new file mode 100644 index 0000000..d087486 --- /dev/null +++ b/README_en.md @@ -0,0 +1,176 @@ +# ChatGLM-6B + +## Introduction + +ChatGLM-6B is an open bilingual language model based on [General Language Model (GLM)](https://github.com/THUDM/GLM) framework, with 6.2 billion parameters. With the quantization technique, users can deploy locally on consumer-grade graphics cards (only 6GB of GPU memory is required at the INT4 quantization level). + +ChatGLM-6B uses technology similar to ChatGPT, optimized for Chinese QA and dialogue. The model is trained for about 1T tokens of Chinese and English corpus, supplemented by supervised fine-tuning, feedback bootstrap, and reinforcement learning wit human feedback. With only about 6.2 billion parameters, the model is able to generate answers that are in line with human preference. + +## Hardware Requirements + +| **Quantization Level** | **GPU Memory** | +|------------------------|----------------| +| FP16(no quantization) | 13 GB | +| INT8 | 10 GB | +| INT4 | 6 GB | + +## Getting Started + +### Environment Setup +Install the requirements with pip: `pip install -r requirements.txt`. `transformers` library version is recommended to be `4.26.1`, but theoretically any version no lower than `4.23.1` is acceptable. + +### Usage +Generate dialogue with the following code +```python +>>> from transformers import AutoTokenizer, AutoModel +>>> tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True) +>>> model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda() +>>> response, history = model.chat(tokenizer, "你好", history=[]) +>>> print(response) +你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。 +>>> response, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history) +>>> print(response) +晚上睡不着可能会让你感到焦虑或不舒服,但以下是一些可以帮助你入睡的方法: + +1. 制定规律的睡眠时间表:保持规律的睡眠时间表可以帮助你建立健康的睡眠习惯,使你更容易入睡。尽量在每天的相同时间上床,并在同一时间起床。 +2. 创造一个舒适的睡眠环境:确保睡眠环境舒适,安静,黑暗且温度适宜。可以使用舒适的床上用品,并保持房间通风。 +3. 放松身心:在睡前做些放松的活动,例如泡个热水澡,听些轻柔的音乐,阅读一些有趣的书籍等,有助于缓解紧张和焦虑,使你更容易入睡。 +4. 避免饮用含有咖啡因的饮料:咖啡因是一种刺激性物质,会影响你的睡眠质量。尽量避免在睡前饮用含有咖啡因的饮料,例如咖啡,茶和可乐。 +5. 避免在床上做与睡眠无关的事情:在床上做些与睡眠无关的事情,例如看电影,玩游戏或工作等,可能会干扰你的睡眠。 +6. 尝试呼吸技巧:深呼吸是一种放松技巧,可以帮助你缓解紧张和焦虑,使你更容易入睡。试着慢慢吸气,保持几秒钟,然后缓慢呼气。 + +如果这些方法无法帮助你入睡,你可以考虑咨询医生或睡眠专家,寻求进一步的建议。 +``` +The full model implementation is on [HuggingFace Hub](https://huggingface.co/THUDM/chatglm-6b). + +### Demo +We provide a Web demo based on [Gradio](https://gradio.app) and a command line demo in the repo. First clone our repo with: +```shell +git clone https://github.com/THUDM/ChatGLM-6B +cd ChatGLM-6B +``` + +#### Web Demo + +![web-demo](resources/web-demo.png) + +Install Gradio `pip install gradio`,and run [web_demo.py](web_demo.py): + +```shell +python web_demo.py +``` +The program runs a web server and outputs the URL. Open the URL in the browser to use the web demo. + +#### CLI Demo + +![cli-demo](resources/cli-demo.png) + +Run [cli_demo.py](cli_demo.py) in the repo: + +```shell +python cli_demo.py +``` +The command runs an interactive program in the shell. Type your instruction in the shell and hit enter to generate the response. Type `clear` to clear the dialogue history and `stop` to terminate the program. + + +## Deployment +### Quantization +By default, the model parameters are loaded with FP16 precision, which require about 13GB of GPU memory. It your GPU memory is limited, you can try to load the model parameters with quantization: +```python +# Change according to your hardware. Only support 4/8 bit quantization now. +model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().quantize(4).cuda() +``` +After 2 to 3 rounds of dialogue, the GPU memory usage is about 10GB under 8-bit quantization, and only 6GB under 4-bit quantization. As the number of dialogue rounds increases, the corresponding GPU memory consumption also increases. Due to the use of relative position encoding, ChatGLM-6B theoretically supports an infinitely long context-length, but the performance will gradually decline after the total length exceeds 2048 (training length). + +Model quantization brings a certain performance decline. After testing, ChatGLM-6B can still perform natural and smooth generation under 4-bit quantization. using [GPT-Q](https://arxiv.org/abs/2210.17323) etc. The quantization scheme can further compress the quantization accuracy/improve the model performance under the same quantization accuracy. You are welcome to submit corresponding Pull Requests. + +### CPU Deployment +If your computer is not equipped with GPU, you can also conduct inference on CPU: +```python +model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).float() +``` +The inference speed will be relatively slow on CPU. + +## ChatGLM-6B Examples + +The following are some Chinese examples with `web_demo.py`. Welcome to explore more possibility with ChatGLM-6B. + +
Self Cognition + +![](examples/self-introduction.png) + +
+ +
Outline + +![](examples/blog-outline.png) + +
+ +
Ad + +![](examples/ad-writing-2.png) + +![](examples/comments-writing.png) + +
+ +
Email + +![](examples/email-writing-1.png) + +![](examples/email-writing-2.png) + +
+ +
Information Extraction + +![](examples/information-extraction.png) + +
+ +
Role Play + +![](examples/role-play.png) + +
+ +
Comparison + +![](examples/sport.png) + +
+ +
Travel Guide + +![](examples/tour-guide.png) + +
+ +## License + +This repository is licensed under the [Apache-2.0 License](LICENSE). The use of ChatGLM-6B model weights is subject to the [Model License](MODEL_LICENSE)。 + +## Citation + +If you find our work useful, please consider citing the following papers: + +``` +@inproceedings{ + zeng2023glm-130b, + title={{GLM}-130B: An Open Bilingual Pre-trained Model}, + author={Aohan Zeng and Xiao Liu and Zhengxiao Du and Zihan Wang and Hanyu Lai and Ming Ding and Zhuoyi Yang and Yifan Xu and Wendi Zheng and Xiao Xia and Weng Lam Tam and Zixuan Ma and Yufei Xue and Jidong Zhai and Wenguang Chen and Zhiyuan Liu and Peng Zhang and Yuxiao Dong and Jie Tang}, + booktitle={The Eleventh International Conference on Learning Representations (ICLR)}, + year={2023}, + url={https://openreview.net/forum?id=-Aw0rrrPUF} +} +``` +``` +@inproceedings{du2022glm, + title={GLM: General Language Model Pretraining with Autoregressive Blank Infilling}, + author={Du, Zhengxiao and Qian, Yujie and Liu, Xiao and Ding, Ming and Qiu, Jiezhong and Yang, Zhilin and Tang, Jie}, + booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, + pages={320--335}, + year={2022} +} +```