diff --git a/PROJECT.md b/PROJECT.md index 21733b5..f27d7e7 100644 --- a/PROJECT.md +++ b/PROJECT.md @@ -20,7 +20,7 @@ 对 ChatGLM-6B 进行微调的开源项目: * [InstructGLM](https://github.com/yanqiangmiffy/InstructGLM):基于ChatGLM-6B进行指令学习,汇总开源中英文指令数据,基于Lora进行指令数据微调,开放了Alpaca、Belle微调后的Lora权重,修复web_demo重复问题 -* [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning):基于ChatGLM-6B模型进行定制化微调,汇总10余种指令数据集和3种微调方案,实现了4/8比特量化和模型权重融合,提供微调模型快速部署方法。 +* [ChatGLM-Efficient-Tuning](https://github.com/hiyouga/ChatGLM-Efficient-Tuning):实现了ChatGLM-6B模型的监督微调和完整RLHF训练,汇总10余种指令数据集和3种微调方案,实现了4/8比特量化和模型权重融合,提供微调模型快速部署方法。 * [ChatGLM-Finetuning](https://github.com/liucongg/ChatGLM-Finetuning):基于ChatGLM-6B模型,进行下游具体任务微调,涉及Freeze、Lora、P-tuning等,并进行实验效果对比。 * [ChatGLM-Tuning](https://github.com/mymusise/ChatGLM-Tuning): 基于 LoRA 对 ChatGLM-6B 进行微调。类似的项目还包括 [Humanable ChatGLM/GPT Fine-tuning | ChatGLM 微调](https://github.com/hscspring/hcgf) diff --git a/README.md b/README.md index d456f2a..d5a2d71 100644 --- a/README.md +++ b/README.md @@ -191,14 +191,48 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4",trust_remote_code=True 如果遇到了报错 `Could not find module 'nvcuda.dll'` 或者 `RuntimeError: Unknown platform: darwin` (MacOS) ,请[从本地加载模型](README.md#从本地加载模型) +### Mac 上的 CPU 部署和加速 + +Mac直接加载量化后的模型会出现问题,例如`clang: error: unsupported option '-fopenmp',这是由于Mac由于本身缺乏omp导致的,此时可运行但是单核。 + +以[chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4)量化模型为例,需要做如下配置,即可在Mac下使用OMP: + +#### 第一步:安装`libomp` + +```bash +# 第一步: 参考`https://mac.r-project.org/openmp/` +## 假设: gcc(clang)是14.x版本,其他版本见R-Project提供的表格 +curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz +sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C / +``` +此时会安装下面几个文件:`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`。 + +#### 第二步:配置`gcc`编译项 + +然后针对`chatglm-6b-int4`, 修改[quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py),主要是把硬编码的`gcc -O3 -fPIC -pthread -fopenmp -std=c99`命令修改成`gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`,[对应代码](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)见下: + +```python +# 第二步: 找到包含`gcc -O3 -fPIC -pthread -fopenmp -std=c99`的这一行,并修改成 +compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file) +``` + +> 补充说明:可以用`platform.uname()[0] == 'Darwin'`做OS的判断,从而使得[quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py)有兼容性。 + +> 注意:如果你之前运行`ChatGLM`项目失败过,最好清一下Huggingface的缓存,i.e. 默认下是 `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`。由于使用了`rm`命令,请明确知道自己在删除什么。 + ### Mac 上的 GPU 加速 对于搭载了Apple Silicon的Mac(以及MacBook),可以使用 MPS 后端来在 GPU 上运行 ChatGLM-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly。 -目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载,并使用 mps 后端 +目前在 MacOS 上只支持[从本地加载模型](README.md#从本地加载模型)。将代码中的模型加载改为从本地加载,并使用 mps 后端: ```python model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps') ``` -即可使用在 Mac 上使用 GPU 加速模型推理。 +即可使用在 Mac 上使用 GPU 加速模型推理。如果出现关于`half`的报错(比如在MacOS 13.3.x上),可以改成: +```python +model = AutoModel.from_pretrained("your local path", trust_remote_code=True).float().to('mps') +``` + +> 注意:上述方法在非量化版中,运行没有问题。量化版模型在MPS设备运行可以关注[这个](https://github.com/THUDM/ChatGLM-6B/issues/462)ISSUE,这主要是[kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27)的原因,可以解包这个`ELF`文件看到是CUDA的实现。 ### 多卡部署 如果你有多张 GPU,但是每张 GPU 的显存大小都不足以容纳完整的模型,那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`,然后通过如下方法加载模型: diff --git a/README_en.md b/README_en.md index cd60478..d2b6d68 100644 --- a/README_en.md +++ b/README_en.md @@ -188,6 +188,36 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=Tru If your encounter the error `Could not find module 'nvcuda.dll'` or `RuntimeError: Unknown platform: darwin`(MacOS), please [load the model locally](README_en.md#load-the-model-locally). + +### CPU Deployment on Mac + +The default Mac enviroment does not support Openmp. One may encounter such warning/errors when execute the `AutoModel.from_pretrained(...)` command `clang: error: unsupported option '-fopenmp'`. + +Take the quantified int4 version [chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4) for example, two extra steps are needed. + +#### STEP 1: Install `libomp` + +```bash +# STEP 1: install libopenmp, check `https://mac.r-project.org/openmp/` for details. +# Assumption: `gcc(clang) >= 14.x`, read the R-Poject before run the code: +curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz +sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C / +``` +Four files (`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`) are installed accordingly. + +#### STEP 2: Configure `gcc` with `-fopenmp` + +Next, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py) file of the `chatglm-6b-int4` project. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99` (check the corresponding python code [HERE](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)), i.e.: + +```python +# STEP 2: Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to: +compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file) +``` + +> Notice: `platform.uname()[0] == 'Darwin'` could be used to determine the OS type and further polish the python script. + +> Notice: If you have executed the `ChatGLM` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files. + ### GPU Inference on Mac For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly. @@ -195,8 +225,17 @@ Currently you must [load the model locally](README_en.md#load-the-model-locally) ```python model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps') ``` + +For Mac users with Mac OS >= 13.3, one may encounter errors related to the `half()` method. Use the `float()` method instead: + +```python +model = AutoModel.from_pretrained("your local path", trust_remote_code=True).float().to('mps') +``` + Then you can use GPU-accelerated model inference on Mac. +> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unpacking [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`, which fails on MPS currently (`torch 2.1.0.dev20230502`). + ### Multi-GPU Deployment If you have multiple GPUs, but the memory size of each GPU is not sufficient to accommodate the entire model, you can split the model across multiple GPUs.