mirror of https://github.com/THUDM/ChatGLM-6B
Merge branch 'main' of https://github.com/THUDM/ChatGLM-6B into main
commit
6fb0380847
|
@ -10,6 +10,8 @@
|
|||
* [JittorLLMs](https://github.com/Jittor/JittorLLMs):最低3G显存或者没有显卡都可运行 ChatGLM-6B FP16, 支持Linux、windows、Mac部署
|
||||
* [ChatGLM-Finetuning](https://github.com/liucongg/ChatGLM-Finetuning):基于ChatGLM-6B模型,进行下游具体任务微调,涉及Freeze、Lora、P-tuning等,并进行实验效果对比。
|
||||
* [InstructGLM](https://github.com/yanqiangmiffy/InstructGLM):基于ChatGLM-6B进行指令学习,汇总开源中英文指令数据,基于Lora进行指令数据微调,开放了Alpaca、Belle微调后的Lora权重,修复web_demo重复问题
|
||||
* [ChatGLM-web](https://github.com/NCZkevin/chatglm-web):基于FastAPI和Vue3搭建的ChatGLM演示网站(支持chatglm流式输出、前端调整模型参数、上下文选择、保存图片、知识库问答等功能)
|
||||
* [glm-bot](https://github.com/initialencounter/glm-bot):将ChatGLM接入Koishi可在各大聊天平台上调用ChatGLM
|
||||
|
||||
以下是部分针对本项目的教程/文档:
|
||||
* [Windows部署文档](https://github.com/ZhangErling/ChatGLM-6B/blob/main/deployment_windows.md)
|
|
@ -150,11 +150,6 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).qu
|
|||
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).half().cuda()
|
||||
```
|
||||
|
||||
我们进一步提供了对Embedding量化后的模型,模型参数仅占用4.3 GB显存:
|
||||
```python
|
||||
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True).half().cuda()
|
||||
```
|
||||
|
||||
### CPU 部署
|
||||
如果你没有 GPU 硬件的话,也可以在 CPU 上进行推理,但是推理速度会更慢。使用方法如下(需要大概 32GB 内存)
|
||||
```python
|
||||
|
|
|
@ -140,11 +140,6 @@ Model quantization brings a certain performance decline. After testing, ChatGLM-
|
|||
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=True).half().cuda()
|
||||
```
|
||||
|
||||
**[2023/03/24]** We further provide an embedding-quantized model whose model parameters only cost 4.3GB GPU memory
|
||||
```python
|
||||
model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4-qe", trust_remote_code=True).half().cuda()
|
||||
```
|
||||
|
||||
### CPU Deployment
|
||||
|
||||
If your computer is not equipped with GPU, you can also conduct inference on CPU, but the inference speed is slow (and taking about 32GB of memory):
|
||||
|
|
|
@ -155,11 +155,11 @@ for k, v in prefix_state_dict.items():
|
|||
new_prefix_state_dict[k[len("transformer.prefix_encoder."):]] = v
|
||||
model.transformer.prefix_encoder.load_state_dict(new_prefix_state_dict)
|
||||
```
|
||||
注意你可能需要将 `pre_seq_len` 改成你训练时的实际值。
|
||||
|
||||
(2) 如果需要加载的是旧 Checkpoint(包含 ChatGLM-6B 以及 PrefixEncoder 参数),则直接加载整个 Checkpoint:
|
||||
|
||||
```python
|
||||
config = AutoConfig.from_pretrained(CHECKPOINT_PATH, trust_remote_code=True, pre_seq_len=128)
|
||||
model = AutoModel.from_pretrained(CHECKPOINT_PATH, config=config, trust_remote_code=True)
|
||||
```
|
||||
|
||||
|
|
|
@ -166,8 +166,8 @@ def main():
|
|||
else:
|
||||
prompt = ""
|
||||
history = examples[history_column][i]
|
||||
for i, (old_query, response) in enumerate(history):
|
||||
prompt += "[Round {}]\n问:{}\n答:{}\n".format(i, old_query, response)
|
||||
for turn_idx, (old_query, response) in enumerate(history):
|
||||
prompt += "[Round {}]\n问:{}\n答:{}\n".format(turn_idx, old_query, response)
|
||||
prompt += "[Round {}]\n问:{}\n答:".format(len(history), query)
|
||||
inputs.append(prompt)
|
||||
targets.append(examples[response_column][i])
|
||||
|
@ -200,8 +200,8 @@ def main():
|
|||
else:
|
||||
prompt = ""
|
||||
history = examples[history_column][i]
|
||||
for i, (old_query, response) in enumerate(history):
|
||||
prompt += "[Round {}]\n问:{}\n答:{}\n".format(i, old_query, response)
|
||||
for turn_idx, (old_query, response) in enumerate(history):
|
||||
prompt += "[Round {}]\n问:{}\n答:{}\n".format(turn_idx, old_query, response)
|
||||
prompt += "[Round {}]\n问:{}\n答:".format(len(history), query)
|
||||
|
||||
prompt = prefix + prompt
|
||||
|
|
Loading…
Reference in New Issue