sudo tar fvxz openmp-14.0.6-darwin20-Release.tar.gz -C /
## Four files are installed:
# usr/local/lib/libomp.dylib
# usr/local/include/ompt.h
# usr/local/include/omp.h
# usr/local/include/omp-tools.h
```
Four files (`/usr/local/lib/libomp.dylib`, `/usr/local/include/ompt.h`, `/usr/local/include/omp.h`, `/usr/local/include/omp-tools.h`) are installed accordingly.
For `chatglm-6b-int4`, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py)file. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99`,[corresponding python code](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168), i.e.:
#### Configure `gcc` with `-fopenmp`
Next, modify the [quantization.py](https://huggingface.co/THUDM/chatglm-6b-int4/blob/main/quantization.py) file of the `chatglm-6b-int4` project. In the file, change the `gcc -O3 -fPIC -pthread -fopenmp -std=c99` configuration to `gcc -O3 -fPIC -Xclang -fopenmp -pthread -lomp -std=c99` (check the corresponding python code [HERE](https://huggingface.co/THUDM/chatglm-6b-int4/blob/63d66b0572d11cedd5574b38da720299599539b3/quantization.py#L168)), i.e.:
```python
# STEP
## Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to:
# STEP 2: Change the line contains `gcc -O3 -fPIC -pthread -fopenmp -std=c99` to:
> Notice: If you have run the `chatglm` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files.
> Notice: If you have run the `ChatGLM` project and failed, you may want to clean the cache of Huggingface before your next try, i.e. `rm -rf ${HOME}/.cache/huggingface/modules/transformers_modules/chatglm-6b-int4`. Since `rm` is used, please MAKE SURE that the command deletes the right files.
### GPU Inference on Mac
For Macs (and MacBooks) with Apple Silicon, it is possible to use the MPS backend to run ChatGLM-6B on the GPU. First, you need to refer to Apple's [official instructions](https://developer.apple.com/metal/pytorch) to install PyTorch-Nightly.
@ -248,7 +244,7 @@ Currently you must [load the model locally](README_en.md#load-the-model-locally)
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).half().to('mps')
```
For Mac users with Mac >= 13.3, one may encounter errors related to `half()` method. Use `float()` instead:
For Mac users with Mac OS >= 13.3, one may encounter errors related to the `half()` method. Use the `float()` method instead:
```python
model = AutoModel.from_pretrained("your local path", trust_remote_code=True).float().to('mps')
@ -256,7 +252,7 @@ model = AutoModel.from_pretrained("your local path", trust_remote_code=True).flo
Then you can use GPU-accelerated model inference on Mac.
> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unzip/unpack [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`.
> Notice: there is no promblem to run the non-quantified version of ChatGLM with MPS. One could check [this issue](https://github.com/THUDM/ChatGLM-6B/issues/462) to run the quantified version with MPS as the backend (and get `ERRORS`). Unpacking [kernel](https://huggingface.co/THUDM/chatglm-6b/blob/658202d88ac4bb782b99e99ac3adff58b4d0b813/quantization.py#L27) as an `ELF` file shows its backend is `cuda`, which fails on MPS currently (`torch 2.1.0.dev20230502`).
### Multi-GPU Deployment
If you have multiple GPUs, but the memory size of each GPU is not sufficient to accommodate the entire model, you can split the model across multiple GPUs.