diff --git a/PROJECT.md b/PROJECT.md index c529b1e..ce9a91c 100644 --- a/PROJECT.md +++ b/PROJECT.md @@ -10,6 +10,8 @@ * [JittorLLMs](https://github.com/Jittor/JittorLLMs):最低3G显存或者没有显卡都可运行 ChatGLM-6B FP16, 支持Linux、windows、Mac部署 * [ChatGLM-Finetuning](https://github.com/liucongg/ChatGLM-Finetuning):基于ChatGLM-6B模型,进行下游具体任务微调,涉及Freeze、Lora、P-tuning等,并进行实验效果对比。 * [InstructGLM](https://github.com/yanqiangmiffy/InstructGLM):基于ChatGLM-6B进行指令学习,汇总开源中英文指令数据,基于Lora进行指令数据微调,开放了Alpaca、Belle微调后的Lora权重,修复web_demo重复问题 +* [ChatGLM-web](https://github.com/NCZkevin/chatglm-web):基于FastAPI和Vue3搭建的ChatGLM演示网站(支持chatglm流式输出、前端调整模型参数、上下文选择、保存图片、知识库问答等功能) +* [glm-bot](https://github.com/initialencounter/glm-bot):将ChatGLM接入Koishi可在各大聊天平台上调用ChatGLM 以下是部分针对本项目的教程/文档: -* [Windows部署文档](https://github.com/ZhangErling/ChatGLM-6B/blob/main/deployment_windows.md) \ No newline at end of file +* [Windows部署文档](https://github.com/ZhangErling/ChatGLM-6B/blob/main/deployment_windows.md) diff --git a/README.md b/README.md index 5c25575..baf7d97 100644 --- a/README.md +++ b/README.md @@ -150,11 +150,6 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).qu model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).half().cuda() ``` -我们进一步提供了对Embedding量化后的模型,模型参数仅占用4.3 GB显存: -```python -model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True).half().cuda() -``` - ### CPU 部署 如果你没有 GPU 硬件的话,也可以在 CPU 上进行推理,但是推理速度会更慢。使用方法如下(需要大概 32GB 内存) ```python diff --git a/README_en.md b/README_en.md index 1a56c39..632a22a 100644 --- a/README_en.md +++ b/README_en.md @@ -140,11 +140,6 @@ Model quantization brings a certain performance decline. After testing, ChatGLM- model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).half().cuda() ``` -**[2023/03/24]** We further provide an embedding-quantized model whose model parameters only cost 4.3GB GPU memory -```python -model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True).half().cuda() -``` - ### CPU Deployment If your computer is not equipped with GPU, you can also conduct inference on CPU, but the inference speed is slow (and taking about 32GB of memory): diff --git a/ptuning/README.md b/ptuning/README.md index a86db16..ab91468 100644 --- a/ptuning/README.md +++ b/ptuning/README.md @@ -155,11 +155,11 @@ for k, v in prefix_state_dict.items(): new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict) ``` +注意你可能需要将 `pre_seq_len` 改成你训练时的实际值。 (2) 如果需要加载的是旧 Checkpoint(包含 ChatGLM-6B 以及 PrefixEncoder 参数),则直接加载整个 Checkpoint: ```python -config = AutoConfig.from_pretrained(CHECKPOINT_PATH, trust_remote_code=True, pre_seq_len=128) model = AutoModel.from_pretrained(CHECKPOINT_PATH, config=config, trust_remote_code=True) ``` diff --git a/ptuning/main.py b/ptuning/main.py index 2aa5ac3..193a60d 100644 --- a/ptuning/main.py +++ b/ptuning/main.py @@ -166,8 +166,8 @@ def main(): else: prompt = "" history = examples[history_column][i] - for i, (old_query, response) in enumerate(history): - prompt += "[Round {}]\n问:{}\n答:{}\n".format(i, old_query, response) + for turn_idx, (old_query, response) in enumerate(history): + prompt += "[Round {}]\n问:{}\n答:{}\n".format(turn_idx, old_query, response) prompt += "[Round {}]\n问:{}\n答:".format(len(history), query) inputs.append(prompt) targets.append(examples[response_column][i]) @@ -200,8 +200,8 @@ def main(): else: prompt = "" history = examples[history_column][i] - for i, (old_query, response) in enumerate(history): - prompt += "[Round {}]\n问:{}\n答:{}\n".format(i, old_query, response) + for turn_idx, (old_query, response) in enumerate(history): + prompt += "[Round {}]\n问:{}\n答:{}\n".format(turn_idx, old_query, response) prompt += "[Round {}]\n问:{}\n答:".format(len(history), query) prompt = prefix + prompt