Browse Source

[Document] 更新Mac部署

[Document] 更新Mac部署
- FILE: README.md; README_en.md
- ADD: OPENMP; MPS

# 具体内容

以[chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4)量化模型为例,做如下配置:

- 安装libomp的步骤;
- 对量化后模型等配置gcc编译项;
- 量化后模型启用MPS的解释。
pull/899/head
Yifan 2 years ago
parent
commit
2629a55222
  1. 15
      README.md
  2. 24
      README_en.md

15
README.md

@ -207,22 +207,17 @@ clang: error: unsupported option '-fopenmp'
```bash
# 第一步: 参考`https://mac.r-project.org/openmp/`
## 假设gcc -v是14.x版本,其他版本见R-Project提供的表格
## 假设: gcc(clang)是14.x版本,其他版本见R-Project提供的表格
curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
## 此时会安装下面几个文件:
# usr/local/lib/libomp.dylib
# usr/local/include/ompt.h
# usr/local/include/omp.h
# usr/local/include/omp-tools.h
```
针对`chatglm-6b-int4`, 修改[quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py),主要是把硬编码的`gcc -O3 -fPIC -pthread -fopenmp -std=c99`命令修改成`gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`,[对应代码](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)见下:
此时会安装下面几个文件:`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`
然后针对`chatglm-6b-int4`, 修改[quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py),主要是把硬编码的`gcc -O3 -fPIC -pthread -fopenmp -std=c99`命令修改成`gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`,[对应代码](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)见下:
```python
# 第二步
## 找到包含`gcc -O3 -fPIC -pthread -fopenmp -std=c99`的这一行
## 修改成
# 第二步: 找到包含`gcc -O3 -fPIC -pthread -fopenmp -std=c99`的这一行,并修改成
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file)
```

24
README_en.md

@ -200,26 +200,22 @@ clang: error: unsupported option '-fopenmp'
Take the quantified int4 version [chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4) for example, the following extra steps are needed:
1. Install `libomp`;
2. Configure `gcc`.
#### Install `libomp`
```bash
# STEP 1: install libopenmp, check `https://mac.r-project.org/openmp/` for details
## Assumption: `gcc -v >= 14.x`, read the R-Poject before run the code:
## Assumption: `gcc(clang) >= 14.x`, read the R-Poject before run the code:
curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
## Four files are installed:
# usr/local/lib/libomp.dylib
# usr/local/include/ompt.h
# usr/local/include/omp.h
# usr/local/include/omp-tools.h
```
Four files (`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`) are installed accordingly.
For `chatglm-6b-int4`, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py)file. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`,[corresponding python code](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168), i.e.:
#### Configure `gcc` with `-fopenmp`
Next, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py) file of the `chatglm-6b-int4` project. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99` (check the corresponding python code [HERE](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)), i.e.:
```python
# STEP
## Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to:
# STEP 2: Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to:
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file)
```
@ -238,7 +234,7 @@ else:
source_code, kernel_file)
```
> Notice: If you have run the `chatglm` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files.
> Notice: If you have run the `ChatGLM` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files.
### GPU Inference on Mac
For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly.
@ -248,7 +244,7 @@ Currently you must [load the model locally](README_en.md#load-the-model-locally)
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps')
```
For Mac users with Mac >= 13.3, one may encounter errors related to `half()` method. Use `float()` instead:
For Mac users with Mac OS >= 13.3, one may encounter errors related to the `half()` method. Use the `float()` method instead:
```python
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).float().to('mps')
@ -256,7 +252,7 @@ model = AutoModel.from_pretrained("your local path", trust_remote_code=True).flo
Then you can use GPU-accelerated model inference on Mac.
> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unzip/unpack [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`.
> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unpacking [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`, which fails on MPS currently (`torch 2.1.0.dev20230502`).
### Multi-GPU Deployment
If you have multiple GPUs, but the memory size of each GPU is not sufficient to accommodate the entire model, you can split the model across multiple GPUs.

Loading…
Cancel
Save