Merge pull request #2 from yfyang86/mac-m1-dev

[Document] 更新Mac部署
pull/899/head
Yifan 2 years ago committed by GitHub
commit 8983d4b7bc
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -207,22 +207,17 @@ clang: error: unsupported option '-fopenmp'
```bash
# 第一步: 参考`https://mac.r-project.org/openmp/`
## 假设gcc -v是14.x版本其他版本见R-Project提供的表格
## 假设: gcc(clang)是14.x版本其他版本见R-Project提供的表格
curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
## 此时会安装下面几个文件:
# usr/local/lib/libomp.dylib
# usr/local/include/ompt.h
# usr/local/include/omp.h
# usr/local/include/omp-tools.h
```
针对`chatglm-6b-int4`, 修改[quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py),主要是把硬编码的`gcc -O3 -fPIC -pthread -fopenmp -std=c99`命令修改成`gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`[对应代码](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)见下:
此时会安装下面几个文件:`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`
然后针对`chatglm-6b-int4`, 修改[quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py),主要是把硬编码的`gcc -O3 -fPIC -pthread -fopenmp -std=c99`命令修改成`gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`[对应代码](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)见下:
```python
# 第二步
## 找到包含`gcc -O3 -fPIC -pthread -fopenmp -std=c99`的这一行
## 修改成
# 第二步: 找到包含`gcc -O3 -fPIC -pthread -fopenmp -std=c99`的这一行,并修改成
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file)
```
@ -233,7 +228,7 @@ import platform
## ...
## 上述相应部分修改为(请自行改一下缩进):
if platform.uname()[0] == 'Darwin':
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99-o {}".format(
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 -o {}".format(
source_code, kernel_file)
else:
compile_command = "gcc -O3 -fPIC -pthread -fopenmp -std=c99 {} -shared -o {}".format(

@ -200,26 +200,22 @@ clang: error: unsupported option '-fopenmp'
Take the quantified int4 version [chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4) for example, the following extra steps are needed:
1. Install `libomp`;
2. Configure `gcc`.
#### Install `libomp`
```bash
# STEP 1: install libopenmp, check `https://mac.r-project.org/openmp/` for details
## Assumption: `gcc -v >= 14.x`, read the R-Poject before run the code:
## Assumption: `gcc(clang) >= 14.x`, read the R-Poject before run the code:
curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
## Four files are installed:
# usr/local/lib/libomp.dylib
# usr/local/include/ompt.h
# usr/local/include/omp.h
# usr/local/include/omp-tools.h
```
Four files (`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`) are installed accordingly.
For `chatglm-6b-int4`, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py)file. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`[corresponding python code](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168), i.e.:
#### Configure `gcc` with `-fopenmp`
Next, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py) file of the `chatglm-6b-int4` project. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99` (check the corresponding python code [HERE](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)), i.e.:
```python
# STEP
## Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to:
# STEP 2: Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to:
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file)
```
@ -231,14 +227,14 @@ import platform
## ...
## change the corresponding lines to:
if platform.uname()[0] == 'Darwin':
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99-o {}".format(
compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99 -o {}".format(
source_code, kernel_file)
else:
compile_command = "gcc -O3 -fPIC -pthread -fopenmp -std=c99 {} -shared -o {}".format(
source_code, kernel_file)
```
> Notice: If you have run the `chatglm` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files.
> Notice: If you have run the `ChatGLM` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files.
### GPU Inference on Mac
For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly.
@ -248,7 +244,7 @@ Currently you must [load the model locally](README_en.md#load-the-model-locally)
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps')
```
For Mac users with Mac >= 13.3, one may encounter errors related to `half()` method. Use `float()` instead:
For Mac users with Mac OS >= 13.3, one may encounter errors related to the `half()` method. Use the `float()` method instead:
```python
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).float().to('mps')
@ -256,7 +252,7 @@ model = AutoModel.from_pretrained("your local path", trust_remote_code=True).flo
Then you can use GPU-accelerated model inference on Mac.
> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unzip/unpack [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`.
> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unpacking [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`, which fails on MPS currently (`torch 2.1.0.dev20230502`).
### Multi-GPU Deployment
If you have multiple GPUs, but the memory size of each GPU is not sufficient to accommodate the entire model, you can split the model across multiple GPUs.

Loading…
Cancel
Save