From b13f1a63f3282a5e14242d9ed46d9ddbd4c487be Mon Sep 17 00:00:00 2001 From: Yifan Date: Wed, 3 May 2023 14:17:37 +0800 Subject: [PATCH] =?UTF-8?q?[Document]=20=E6=9B=B4=E6=96=B0Mac=E9=83=A8?= =?UTF-8?q?=E7=BD=B2?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [Document] 更新Mac部署 - FILE: README.md; README_en.md - ADD: OPENMP; MPS # 具体内容 以[chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4)量化模型为例,做如下配置: - 安装libomp的步骤; - 对量化后模型等配置gcc编译项; - 量化后模型启用MPS的解释。 --- README.md | 4 ++-- README_en.md | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 63 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 27e41d3..25676af 100644 --- a/README.md +++ b/README.md @@ -240,7 +240,7 @@ else: source_code, kernel_file) ``` -> !注意,如果你之前运行过失败过,最好清一下Huggingface对缓存,i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`。由于使用了`rm`命令,请明确知道自己在删除什么。 +> 注意:如果你之前运行过失败过,最好清一下Huggingface的缓存,i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`。由于使用了`rm`命令,请明确知道自己在删除什么。 ### Mac 上的 GPU 加速 对于搭载了Apple Silicon的Mac(以及MacBook),可以使用 MPS 后端来在 GPU 上运行 ChatGLM-6B。需要参考 Apple 的 [官方说明](https://developer.apple.com/metal/pytorch) 安装 PyTorch-Nightly。 @@ -254,7 +254,7 @@ model = AutoModel.from_pretrained("your local path", trust_remote_code=True).hal model = AutoModel.from_pretrained("your local path", trust_remote_code=True).float().to('mps') ``` -> 注意上述方法在非量化版中,运行没有问题。量化版模型在MPS设备运行可以关注[这个](https://github.com/THUDM/ChatGLM-6B/issues/462)ISSUE,这主要是[kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27)的原因,可以解包这个`ELF`文件看到是CUDA的实现。 +> 注意:上述方法在非量化版中,运行没有问题。量化版模型在MPS设备运行可以关注[这个](https://github.com/THUDM/ChatGLM-6B/issues/462)ISSUE,这主要是[kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27)的原因,可以解包这个`ELF`文件看到是CUDA的实现。 ### 多卡部署 如果你有多张 GPU,但是每张 GPU 的显存大小都不足以容纳完整的模型,那么可以将模型切分在多张GPU上。首先安装 accelerate: `pip install accelerate`,然后通过如下方法加载模型: diff --git a/README_en.md b/README_en.md index cd60478..2f256dd 100644 --- a/README_en.md +++ b/README_en.md @@ -188,6 +188,58 @@ model = AutoModel.from_pretrained("THUDM/chatglm-6b-int4", trust_remote_code=Tru If your encounter the error `Could not find module 'nvcuda.dll'` or `RuntimeError: Unknown platform: darwin`(MacOS), please [load the model locally](README_en.md#load-the-model-locally). + +### CPU Deployment on Mac + +The default Mac enviroment does not support Openmp. One may encounter such warning/errors when execute the `AutoModel.from_pretrained(...)` command: + +```sh +clang: error: unsupported option '-fopenmp' +clang: error: unsupported option '-fopenmp' +``` + +Take the quantified int4 version [chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4) for example, the following extra steps are needed: + +1. Install `libomp`; +2. Configure `gcc`. + +```bash +# STEP 1: install libopenmp, check `https://mac.r-project.org/openmp/` for details +## Assumption: `gcc -v >= 14.x`, read the R-Poject before run the code: +curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz +sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C / +## Four files are installed: +# usr/local/lib/libomp.dylib +# usr/local/include/ompt.h +# usr/local/include/omp.h +# usr/local/include/omp-tools.h +``` + +For `chatglm-6b-int4`, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py)file. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`,[corresponding python code](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168), i.e.: + +```python +# STEP +## Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to: +compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file) +``` + +For production code, one could use `platform` library to make it neater: + +```python +## import platform just after `import os` +import platform +## ... +## change the corresponding lines to: +if platform.uname()[0] == 'Darwin': + compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99-o {}".format( + source_code, kernel_file) +else: + compile_command = "gcc -O3 -fPIC -pthread -fopenmp -std=c99 {} -shared -o {}".format( + source_code, kernel_file) +``` + +> Notice: If you have run the `chatglm` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files. + ### GPU Inference on Mac For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly. @@ -195,8 +247,17 @@ Currently you must [load the model locally](README_en.md#load-the-model-locally) ```python model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps') ``` + +For Mac users with Mac >= 13.3, one may encounter errors related to `half()` method. Use `float()` instead: + +```python +model = AutoModel.from_pretrained("your local path", trust_remote_code=True).float().to('mps') +``` + Then you can use GPU-accelerated model inference on Mac. +> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unzip/unpack [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`. + ### Multi-GPU Deployment If you have multiple GPUs, but the memory size of each GPU is not sufficient to accommodate the entire model, you can split the model across multiple GPUs.