From 2629a55222bd448609c5bceba79a2a24abdd859a Mon Sep 17 00:00:00 2001
From: Yifan <yfyang.86@gmail.com>
Date: Wed, 3 May 2023 14:35:11 +0800
Subject: [PATCH] =?UTF-8?q?=20[Document]=20=E6=9B=B4=E6=96=B0Mac=E9=83=A8?=
 =?UTF-8?q?=E7=BD=B2?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

 [Document] 更新Mac部署
- FILE: README.md; README_en.md
- ADD: OPENMP; MPS

# 具体内容

以[chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4)量化模型为例，做如下配置：

- 安装libomp的步骤;
- 对量化后模型等配置gcc编译项；
- 量化后模型启用MPS的解释。
---
 README.md    | 17 ++++++-----------
 README_en.md | 26 +++++++++++---------------
 2 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/README.md b/README.md
index 25676af..2fd5ef6 100644
--- a/README.md
+++ b/README.md
@@ -207,22 +207,17 @@ clang: error: unsupported option '-fopenmp'
 
 ```bash
 # 第一步: 参考`https://mac.r-project.org/openmp/`
-## 假设gcc -v是14.x版本，其他版本见R-Project提供的表格
+## 假设: gcc(clang)是14.x版本，其他版本见R-Project提供的表格
 curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
 sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
-## 此时会安装下面几个文件：
-#   usr/local/lib/libomp.dylib
-#   usr/local/include/ompt.h
-#   usr/local/include/omp.h
-#   usr/local/include/omp-tools.h
 ```
 
-针对`chatglm-6b-int4`, 修改[quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py)，主要是把硬编码的`gcc -O3 -fPIC -pthread -fopenmp -std=c99`命令修改成`gcc -O3 -fPIC -Xclang -fopenmp -pthread  -lomp -std=c99`，[对应代码](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)见下:
+此时会安装下面几个文件：`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`。
+
+然后针对`chatglm-6b-int4`, 修改[quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py)，主要是把硬编码的`gcc -O3 -fPIC -pthread -fopenmp -std=c99`命令修改成`gcc -O3 -fPIC -Xclang -fopenmp -pthread  -lomp -std=c99`，[对应代码](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)见下:
 
 ```python
-# 第二步
-## 找到包含`gcc -O3 -fPIC -pthread -fopenmp -std=c99`的这一行
-## 修改成
+# 第二步: 找到包含`gcc -O3 -fPIC -pthread -fopenmp -std=c99`的这一行，并修改成
 compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread  -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file)
 ```
 
@@ -233,7 +228,7 @@ import platform
 ## ...
 ## 上述相应部分修改为（请自行改一下缩进）：
 if platform.uname()[0] == 'Darwin':
-    compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread  -lomp -std=c99-o {}".format(
+    compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread  -lomp -std=c99 -o {}".format(
     source_code, kernel_file)
 else:
     compile_command = "gcc -O3 -fPIC -pthread -fopenmp -std=c99 {} -shared -o {}".format(
diff --git a/README_en.md b/README_en.md
index 2f256dd..f63bc01 100644
--- a/README_en.md
+++ b/README_en.md
@@ -200,26 +200,22 @@ clang: error: unsupported option '-fopenmp'
 
 Take the quantified int4 version [chatglm-6b-int4](https://huggingface.co/THUDM/chatglm-6b-int4) for example, the following extra steps are needed:
 
-1. Install `libomp`;
-2. Configure `gcc`.
+#### Install `libomp`
 
 ```bash
 # STEP 1: install libopenmp, check `https://mac.r-project.org/openmp/` for details
-## Assumption: `gcc -v >= 14.x`, read the R-Poject before run the code:
+## Assumption: `gcc(clang) >= 14.x`, read the R-Poject before run the code:
 curl -O https://mac.r-project.org/openmp/openmp-14.0.6-darwin20-Release.tar.gz
 sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
-## Four files are installed:
-#   usr/local/lib/libomp.dylib
-#   usr/local/include/ompt.h
-#   usr/local/include/omp.h
-#   usr/local/include/omp-tools.h
 ```
+Four files (`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`) are installed accordingly.
 
-For `chatglm-6b-int4`, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py)file. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread  -lomp -std=c99`，[corresponding python code](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168), i.e.:
+#### Configure `gcc` with `-fopenmp`
+
+Next, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py) file of the `chatglm-6b-int4` project. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread  -lomp -std=c99` (check the corresponding python code [HERE](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)), i.e.:
 
 ```python
-# STEP
-## Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to:
+# STEP 2: Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to:
 compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread  -lomp -std=c99 {} -shared -o {}".format(source_code, kernel_file)
 ```
 
@@ -231,14 +227,14 @@ import platform
 ## ...
 ## change the corresponding lines to:
 if platform.uname()[0] == 'Darwin':
-    compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread  -lomp -std=c99-o {}".format(
+    compile_command = "gcc -O3 -fPIC -Xclang -fopenmp -pthread  -lomp -std=c99 -o {}".format(
     source_code, kernel_file)
 else:
     compile_command = "gcc -O3 -fPIC -pthread -fopenmp -std=c99 {} -shared -o {}".format(
     source_code, kernel_file)
 ```
 
-> Notice: If you have run the `chatglm` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files.
+> Notice: If you have run the `ChatGLM` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files.
 
 ### GPU Inference on Mac
 For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly.
@@ -248,7 +244,7 @@ Currently you must [load the model locally](README_en.md#load-the-model-locally)
 model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps')
 ```
 
-For Mac users with Mac >= 13.3, one may encounter errors related to `half()` method. Use `float()` instead:
+For Mac users with Mac OS >= 13.3, one may encounter errors related to the `half()` method. Use the `float()` method instead:
 
 ```python
 model = AutoModel.from_pretrained("your local path", trust_remote_code=True).float().to('mps')
@@ -256,7 +252,7 @@ model = AutoModel.from_pretrained("your local path", trust_remote_code=True).flo
 
 Then you can use GPU-accelerated model inference on Mac.
 
-> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unzip/unpack [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`.
+> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unpacking [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`, which fails on MPS currently (`torch 2.1.0.dev20230502`).
 
 ### Multi-GPU Deployment
 If you have multiple GPUs, but the memory size of each GPU is not sufficient to accommodate the entire model, you can split the model across multiple GPUs.