diff --git a/README.md b/README.md index f226c47..adec904 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,8 @@ 👋 加入我们的 Slack 和 WeChat
+*Read this in [English](README_EN.md)* + ## 介绍 ChatGLM**2**-6B 是开源中英双语对话模型 [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) 的第二代版本,在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上,ChatGLM**2**-6B 引入了如下新特性: diff --git a/README_EN.md b/README_EN.md new file mode 100644 index 0000000..6101561 --- /dev/null +++ b/README_EN.md @@ -0,0 +1,252 @@ +
+🤗 HF Repo • 🐦 Twitter • 📃 [GLM@ACL 22] [GitHub] • 📃 [GLM-130B@ICLR 23] [GitHub]
+
+ 👋 Join our Slack and WeChat +
+ +## Introduction + +ChatGLM**2**-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B). It retains the smooth conversation flow and low deployment threshold of the first-generation model, while introducing the following new features: + +1. **Stronger Performance**: Based on the development experience of the first-generation ChatGLM model, we have fully upgraded the base model of ChatGLM2-6B. ChatGLM2-6B uses the hybrid objective function of [GLM](https://github.com/THUDM/GLM), and has undergone pre-training with 1.4T bilingual tokens and human preference alignment training. The [evaluation results](README.md#evaluation-results) show that, compared to the first-generation model, ChatGLM2-6B has achieved substantial improvements in performance on datasets like MMLU (+23%), CEval (+33%), GSM8K (+571%), BBH (+60%), showing strong competitiveness among models of the same size. +2. **Longer Context**: Based on [FlashAttention](https://github.com/HazyResearch/flash-attention) technique, we have extended the context length of the base model from 2K in ChatGLM-6B to 32K, and trained with a context length of 8K during the dialogue alignment, allowing for more rounds of dialogue. However, the current version of ChatGLM2-6B has limited understanding of single-round ultra-long documents, which we will focus on optimizing in future iterations. +3. **More Efficient Inference**: Based on [Multi-Query Attention](http://arxiv.org/abs/1911.02150) technique, ChatGLM2-6B has more efficient inference speed and lower GPU memory usage: under the official implementation, the inference speed has increased by 42% compared to the first generation; under INT4 quantization, the dialogue length supported by 6G GPU memory has increased from 1K to 8K. +4. **More Open License**: The weights of ChatGLM2-6B are **fully open** to academic research, and with our official written permission, the weights of ChatGLM2-6B are also **permitted for commercial use**. If you find our open-source model useful for your business, we welcome your donation towards the development of the next-generation model ChatGLM3. + +----- + +The open-source ChatGLM2-6B is intended to promote the development of LLMs together with the open-source community. We earnestly request developers and everyone to abide by the [open-source license](MODEL_LICENSE). Do not use the open-source model, code, or any derivatives from the open-source project for any purposes that may harm nations or societies, or for any services that have not undergone safety assessments and legal approval. **At present, our project team has not developed any applications based on ChatGLM2-6B, including web, Android, Apple iOS, and Windows App applications.** + +Although the model strives to ensure the compliance and accuracy of data at each stage of training, due to the smaller scale of the ChatGLM2-6B model, and its susceptibility to probabilistic randomness, the accuracy of output content cannot be guaranteed, and the model can easily be misled. **Our project does not assume any risks or responsibilities arising from data security, public opinion risks, or any instances of the model being misled, abused, disseminated, or improperly used due to the open-source model and code.** + +## Evaluation +We selected some typical Chinese and English datasets for evaluation. Below are the evaluation results of the ChatGLM2-6B model on [MMLU](https://github.com/hendrycks/test) (English), [C-Eval](https://cevalbenchmark.com/static/leaderboard.html) (Chinese), [GSM8K](https://github.com/openai/grade-school-math) (Mathematics), [BBH](https://github.com/suzgunmirac/BIG-Bench-Hard) (English). + +### MMLU + +| Model | Average | STEM | Social Sciences | Humanities | Others | +| ----- | ----- | ---- | ----- | ----- | ----- | +| ChatGLM-6B | 40.63 | 33.89 | 44.84 | 39.02 | 45.71 | +| ChatGLM2-6B (base) | 47.86 | 41.20 | 54.44 | 43.66 | 54.46 | +| ChatGLM2-6B | 45.46 | 40.06 | 51.61 | 41.23 | 51.24 | + +> Chat-aligned version is evaluated under zero-shot CoT (Chain-of-Thought), and Base version is evaluated under few-shot answer-only + +### C-Eval + +| Model | Average | STEM | Social Sciences | Humanities | Others | +| ----- | ---- | ---- | ----- | ----- | ----- | +| ChatGLM-6B | 38.9 | 33.3 | 48.3 | 41.3 | 38.0 | +| ChatGLM2-6B (base) | 51.7 | 48.6 | 60.5 | 51.3 | 49.8 | +| ChatGLM2-6B | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 | + +> Chat-aligned version is evaluated under zero-shot CoT (Chain-of-Thought), and Base version is evaluated under few-shot answer-only + +### GSM8K + +| Model | Accuracy | Accuracy (Chinese)* | +| ----- | ----- | ----- | +| ChatGLM-6B | 4.82 | 5.85 | +| ChatGLM2-6B (base) | 32.37 | 28.95 | +| ChatGLM2-6B | 28.05 | 20.45 | + +> All model versions are evaluated under few-shot CoT, and CoT prompts are from http://arxiv.org/abs/2201.11903 +> \* We translate a 500-query subset of GSM8K and its corresponding CoT prompts using machine translation API and subsequent human proofreading. + + +### BBH + +| Model | Accuracy | +| ----- | ----- | +| ChatGLM-6B | 18.73 | +| ChatGLM2-6B (base) | 33.68 | +| ChatGLM2-6B | 30.00 | + +> All model versions are evaluated under few-shot CoT, and CoT prompts are from https://github.com/suzgunmirac/BIG-Bench-Hard/tree/main/cot-prompts + +## Inference Efficiency +ChatGLM2-6B employs [Multi-Query Attention](http://arxiv.org/abs/1911.02150) to improve inference speed. Here is a comparison of the average speed for generating 2000 tokens. + + +| Model | Inference Speed (tokens/s) | +| ---- | ----- | +| ChatGLM-6B | 31.49 | +| ChatGLM2-6B | 44.62 | + +> Under our official implementation, Batch size = 1, tested with an A100-SXM-80G and PyTorch 2.0 environment + +Multi-Query Attention also reduces the GPU memory usage of the KV Cache during inference. Additionally, ChatGLM2-6B uses Causal Mask for dialogue training, which allows the reuse of the KV Cache from previous rounds in continuous dialogues, further optimizing GPU memory usage. Therefore, when performing INT4 quantization inference with a 6GB GPU, while the first-generation ChatGLM-6B can only generate a maximum of 1119 tokens before running out of memory, ChatGLM2-6B can generate at least 8192 tokens. + +| **Quantization** | **Encoding 2048 Tokens** | **Decoding 8192 Tokens** | +| -------------- | --------------------- | --------------- | +| FP16 / BF16 | 13.1 GB | 14 GB | +| INT8 | 9 GB | 9 GB | +| INT4 | 6 GB | 6 GB | + +> ChatGLM2-6B takes advantage of `torch.nn.functional.scaled_dot_product_attention` introduced in PyTorch 2.0 for efficient Attention computation. If the PyTorch version is lower, it will fallback to the naive Attention implementation, which may result in higher GPU memory usage than shown in the table above. + +We also tested the impact of quantization on model performance. The results show that the impact of quantization on model performance is within an acceptable range. + +| Quantization | Accuracy (MMLU) | Accuracy (C-Eval dev) | +| ----- | ----- |-----------------------| +| BF16 | 45.47 | 53.57 | +| INT4 | 43.13 | 50.30 | + + +## ChatGLM2-6B Examples + +Compared to the first-generation model, ChatGLM2-6B has made improvements in multiple dimensions. Below are some comparison examples. More possibilities with ChatGLM2-6B are waiting for you to explore and discover! + +