[Doc]: Add README_npu (#816)

Co-authored-by: liutongtong27 <liutongtong15@h-partners.com>
Co-authored-by: xinyanhe <1073376478@qq.com>
Co-authored-by: 郑特驹 <zhengteju@h-partners.com>
Co-authored-by: baymax591 <cbai@mail.nwpu.edu.cn>
main
YumiMom 2025-01-26 17:00:55 +08:00 committed by GitHub
parent fb14f9b60a
commit 6dd4af4bd1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
8 changed files with 745 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 363 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 212 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 315 KiB

371
ecosystem/README_npu.md Normal file
View File

@ -0,0 +1,371 @@
# InternLM-NPU
<div align="center">
<img src="./assets/logo.svg" width="200"/>
<div> </div>
<div align="center">
<b><font size="5">InternLM</font></b>
<sup>
<a href="https://internlm.intern-ai.org.cn/">
<i><font size="4">HOT</font></i>
</a>
</sup>
<div> </div>
</div>
[![license](../assets/license.svg)](../LICENSE)
[![evaluation](../assets/compass_support.svg)](https://github.com/internLM/OpenCompass/)
<!-- [![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest) -->
[📘Commercial Application](#license) |
[🤗HuggingFace](https://huggingface.co/internlm) |
[🆕Update News](#news) |
[🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new) |
[📜Technical Report](https://arxiv.org/abs/2403.17297)<br>
[💬Chat Web](https://internlm-chat.intern-ai.org.cn/) |
[🔗API](https://internlm.intern-ai.org.cn/api/document) |
[🧩Modelers](https://modelers.cn/spaces/MindSpore-Lab/INTERNLM2-20B-PLAN)
[English](README_npu.md) |
[简体中文](README_npu_zh-CN.md)
</div>
## Introduction
This is a guide to using Ascend NPU to train and infer the InternLM series models.
## News
\[2025.01.15\] InternLM3-8B-Instruct can be used in Xtuner, LLaMA-Factory, transformers and openMind.
## Model Zoo
### InternLM3
| Model | Transformers | ModelScope | Modelers | Release Date |
| ------------------------- | ---------------------------------------------------- | -------------------------------------------------- | ------------------------------------------------- | ------------ |
| **InternLM3-8B-Instruct** | [🤗internlm3_8B_instruct](https://huggingface.co/internlm/internlm3-8b-instruct) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm3_8b_instruct](https://www.modelscope.cn/models/Shanghai_AI_Laboratory/internlm3-8b-instruct/summary) | [![Open in Modelers](https://modelers.cn/assets/logo1-1bf58310.svg)](https://modelers.cn/models/Intern/internlm3-8b-instruct) | 2025-01-15 |
## Environment Setup
### Installing Ascend CANN Toolkit and Kernels
For details about the installation method, see [Installation Scheme](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2FCANNCommunityEdition%2F80RC2alpha002%2Fquickstart%2Fquickstart%2Fquickstart_18_0004.html) or run the following commands:
```shell
# Replace the URL with the URL corresponding to the CANN version and device model.
# Install CANN Toolkit.
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run
bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run --install
# Install CANN Kernels.
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run
bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run --install
# Set environment variables.
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```
## Xtuner
### Installing Xtuner
```shell
git clone https://github.com/InternLM/xtuner.git
cd xtuner
```
Modify `requirements/runtime.txt` with the following changes:
```text
bitsandbytes==0.42.0
torchvision==0.19.0
numpy==1.26.4
```
Use the following command for installation:
```shell
pip install -e '.[all]'
```
**Note**:
- The default installation version of `torch` is the latest version. Please pay attention to match it with the version of `torch_npu`.
### LoRA Fine-tuning
Use the following commands to copy and rename the file to `internlm3_8b_instruct_lora_oasst1_e10.py`:
```shell
xtuner copy-cfg internlm2_5_chat_7b_qlora_oasst1_e3 .
mv internlm2_5_chat_7b_qlora_oasst1_e3_copy.py internlm3_8b_instruct_lora_oasst1_e10.py
```
The modifications to the configuration file `internlm3_8b_instruct_lora_oasst1_e10.py` are as follows:
```python
pretrained_model_name_or_path = 'internlm/internlm3-8b-instruct'
max_epochs = 10
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16),
# quantization_config=dict(
# type=BitsAndBytesConfig,
# load_in_4bit=True,
# load_in_8bit=False,
# llm_int8_threshold=6.0,
# llm_int8_has_fp16_weight=False,
# bnb_4bit_compute_dtype=torch.float16,
# bnb_4bit_use_double_quant=True,
# bnb_4bit_quant_type='nf4')),
randomness = dict(seed=123, deterministic=True)
```
Run the following commands to start single-machine eight-card fine-tuning:
```shell
NPROC_PER_NODE=8 xtuner train internlm3_8b_instruct_lora_oasst1_e10.py --deepspeed deepspeed_zero2
```
The fine-tuning results are saved in the directory `./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth`.
The comparison of loss between NPU and GPU is as follows:
![xtuner_training_loss](../assets/npu/xtuner_training_loss_compare.png)
### Model Convert
Convert the model weight file obtained from fine-tuning into the Hugging Face format, which facilitates subsequent deployment and usage.
Use the following command for the conversion:
```shell
xtuner convert pth_to_hf internlm3_8b_instruct_lora_oasst1_e10.py ./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth ./work_dirs/convert_output
```
### Model Merge
LoRA or QLoRA fine-tuning generates an additional `Adapter` layer, which needs to be merged with the original model to
create a complete model. Use the following command for model merging, where `$model_path` is the local path where the
original model is stored, and `--max-shard-size` 2GB limits the maximum size of each weight file to 2GB:
```shell
xtuner convert merge $model_path ./work_dirs/convert_output ./work_dirs/merge_output --max-shard-size 2GB
```
### Chat
Chat with the merged model weights:
```shell
cp path_to_your_model/modeling_internlm3.py ./work_dirs/merge_output
xtuner chat ./work_dirs/merge_output --prompt-template internlm2_chat
```
## LLaMA-Factory
### Installing LLaMA-Factory
```shell
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch-npu,metrics]"
```
### Inference
Create the `examples/inference/internlm3_8b_instruct.yaml` inference configuration file in the LLaMA-Factory directory:
```yaml
model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM3-8B-Instruct.
trust_remote_code: true
template: intern3
```
Run the following command to interact with the model:
```shell
llamafactory-cli chat examples/inference/internlm3_8b_instruct.yaml
```
### Fine-tuning
Create the `examples/train_full/internlm3_8b_instruct_full_sft.yaml` configuration file in the LLaMA-Factory directory. The fine-tuning configuration file is as follows:
```yaml
### model
model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM3-8B-Instruct.
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
### dataset
dataset: alpaca_data
template: intern3
cutoff_len: 4096
max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: saves/interlm3/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-6
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 5000000000
```
Run the following commands to start fine-tuning:
```shell
llamafactory-cli train examples/train_full/internlm3_8b_instruct_full_sft.yaml
```
### Accuracy
The loss curve obtained after finetuning is as follows:
![training_loss](../assets/npu/lf_training_loss_npu.png)
The loss curve compared with GPU is as follows:
![training_loss_compare](../assets/npu/lf_training_loss_compare.png)
## Transformers
### Inference
Create the inference script `inference_internlm3_instruct_8b.py`:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_dir = "internlm/internlm3-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16).npu()
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
# InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
# pip install -U bitsandbytes
# 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, load_in_8bit=True).npu()
# 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, load_in_4bit=True).npu()
model = model.eval()
system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文."""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Please tell me five scenic spots in Shanghai"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").npu()
generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=1, repetition_penalty=1.005, top_k=40, top_p=0.8)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
]
prompt = tokenizer.batch_decode(tokenized_chat)[0]
print(prompt)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)
```
Execute the inference script:
```shell
python inference_internlm3_instruct_8b.py
```
## openMind Library
### Introduction to openMind
The openMind Library is an open-source suite for large-scale models, natively supporting fine-tuning, inference, evaluation, and deployment on Ascend NPUs. The openMind Library offers highly user-friendly interfaces and usage methods, fully leveraging the performance of Ascend NPUs to rapidly support and enhance cutting-edge industry models.
### Fine-Tuning
The openMind Library provides a one-click model fine-tuning solution on Ascend NPUs, encompassing capabilities such as data processing, multi-site weight loading, low-rank adaptation (LoRA), and quantization adaptation (QLoRA). Additionally, the openMind Library supports optimization of Ascend NPU fused operators, enhancing model training performance.
#### Installing the openMind Library
```shell
git clone -b dev https://gitee.com/ascend/openmind.git
cd openmind
pip install -e .[pt]
```
#### Initiating Fine-Tuning
Within the openMind directory, fine-tuning can be initiated using the following command line:
```
openmind-cli train examples/internlm3/train_sft_full_internlm3.yaml
```
#### Training Results and Advantages
As illustrated in the figure below, the training loss of the openMind Library normally converges, and compared with the GPU, the average relative error is within 2%.
<div align=center>
<img src="./assets/openmind_train_loss_compare.png" width="600px">
</div>
<p align="center"><strong>Accuracy Comparison</strong> (npu=8, per_device_train_batch_size=6, max_length=1024)</p>
The openMind Library supports the enabling of fine-tuning methods such as LoRA and QLoRA on Ascend NPUs, significantly reducing device memory usage. As illustrated in the figure below, employing the QLoRA fine-tuning method can lead to approximately a 40% reduction in device memory consumption.
<div align=center>
<img src="./assets/openmind_train_memory.png" width="400px">
</div>
<p align="center"><strong>Memory Consumption</strong> (npu=8, per_device_train_batch_size=6, max_length=1024)</p>
The openMind Library facilitates the automatic loading of Ascend NPU fused operators during training, eliminating the need for developers to manually modify code or configurations. This enhances model training performance while maintaining ease of use. The figure below demonstrates the performance benefits achieved by default when the openMind Library enables Ascend NPU fused operators.
<div align=center>
<img src="./assets/openmind_fused_ops.png" width="300px">
</div>
<p align="center"><strong>Training Samples per Second</strong></p>
For more features, please refer to the [openMind Fine-tuning Documentation](https://modelers.cn/docs/en/openmind-library/1.0.0/basic_tutorial/finetune/finetune_pt.html).
### Inference
In addition to fine-tuning, the openMind Library can also be utilized for model inference. After installing the openMind Library, a single round of inference can be conducted using the following command line:
```shell
openmind-cli run Intern/internlm3-8b-instruct --task text-generation --input '{"text_inputs":"What is AI?","max_length":512}' --trust_remote_code 1
```
For more features, please refer to the [openMind Inference Documentation](https://modelers.cn/docs/en/openmind-library/1.0.0/basic_tutorial/pipeline.html).
## License
Code and model weights are licensed under Apache-2.0.

View File

@ -0,0 +1,374 @@
# InternLM-NPU
<div align="center">
<img src="./assets//logo.svg" width="200"/>
<div>&nbsp;</div>
<div align="center">
<b><font size="5">书生·浦语 官网</font></b>
<sup>
<a href="https://internlm.intern-ai.org.cn/">
<i><font size="4">HOT</font></i>
</a>
</sup>
<div>&nbsp;</div>
</div>
[![license](./assets//license.svg)](https://github.com/open-mmlab/mmdetection/blob/main/LICENSE)
[![evaluation](./assets//compass_support.svg)](https://github.com/internLM/OpenCompass/)
<!-- [![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest) -->
[📘商业授权](#开源许可证) |
[🤗HuggingFace](https://huggingface.co/internlm) |
[🆕最新消息](#更新) |
[🤔提交反馈](https://github.com/InternLM/InternLM/issues/new)|
[📜技术报告](https://arxiv.org/abs/2403.17297)<br>
[💬聊天应用](https://internlm-chat.intern-ai.org.cn/) |
[🔗API](https://internlm.intern-ai.org.cn/api/document) |
[🧩魔乐社区](https://modelers.cn/spaces/MindSpore-Lab/INTERNLM2-20B-PLAN)
[English](README_npu.md) |
[简体中文](README_npu_zh-CN.md)
</div>
## 介绍
这是一份使用 Ascend NPU 对 InternLM 系列模型进行训练和推理的指南。
## News
\[2025.01.15\] InternLM3-8B-Instruct 可用于 Xtuner、LLaMA-Factory、transformers 和 openMind 中。
## Model Zoo
### InternLM3
| Model | Transformers | ModelScope | Modelers | Release Date |
| ------------------------- | ---------------------------------------------------- | -------------------------------------------------- | ------------------------------------------------- | ------------ |
| **InternLM3-8B-Instruct** | [🤗internlm3_8B_instruct](https://huggingface.co/internlm/internlm3-8b-instruct) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm3_8b_instruct](https://www.modelscope.cn/models/Shanghai_AI_Laboratory/internlm3-8b-instruct/summary) | [![Open in Modelers](https://modelers.cn/assets/logo1-1bf58310.svg)](https://modelers.cn/models/Intern/internlm3-8b-instruct) | 2025-01-15 |
## 环境准备
### 安装Ascend CANN Toolkit和Kernels
安装方法请参考[安装教程](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2FCANNCommunityEdition%2F80RC2alpha002%2Fquickstart%2Fquickstart%2Fquickstart_18_0004.html)或使用以下命令
```shell
# 请替换URL为CANN版本和设备型号对应的URL
# 安装CANN Toolkit
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run
bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run --install
# 安装CANN Kernels
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run
bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run --install
# 设置环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
```
## Xtuner
### 安装 Xtuner
```shell
git clone https://github.com/InternLM/xtuner.git
cd xtuner
```
修改`requirements/runtime.txt`,修改点如下:
```text
bitsandbytes==0.42.0
torchvision==0.19.0
numpy==1.26.4
```
使用以下命令进行安装:
```shell
pip install -e '.[all]'
```
**注意**:
- 默认安装`torch`为最新版,请注意与`torch_npu`版本相匹配
### LoRA 微调
使用以下命令复制并重命名文件为`internlm3_8b_instruct_lora_oasst1_e10.py`
```shell
xtuner copy-cfg internlm2_5_chat_7b_qlora_oasst1_e3 .
mv internlm2_5_chat_7b_qlora_oasst1_e3_copy.py internlm3_8b_instruct_lora_oasst1_e10.py
```
`internlm3_8b_instruct_lora_oasst1_e10.py`配置文件的修改点如下:
```python
pretrained_model_name_or_path = 'internlm/internlm3-8b-instruct'
max_epochs = 10
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16),
# quantization_config=dict(
# type=BitsAndBytesConfig,
# load_in_4bit=True,
# load_in_8bit=False,
# llm_int8_threshold=6.0,
# llm_int8_has_fp16_weight=False,
# bnb_4bit_compute_dtype=torch.float16,
# bnb_4bit_use_double_quant=True,
# bnb_4bit_quant_type='nf4')),
randomness = dict(seed=123, deterministic=True)
```
通过下列命令启动单机8卡微调
```shell
NPROC_PER_NODE=8 xtuner train internlm3_8b_instruct_lora_oasst1_e10.py --deepspeed deepspeed_zero2
```
微调后结果保存在`./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth`NPU与GPU的loss对比如下
![xtuner_training_loss](../assets/npu/xtuner_training_loss_compare.png)
### 模型转换
将训练得到的模型权重文件转换为 Hugging Face 格式的模型文件,便于后续的部署和使用。使用以下命令进行转换:
```shell
xtuner convert pth_to_hf internlm3_8b_instruct_lora_oasst1_e10.py ./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth ./work_dirs/convert_output
```
### 模型合并
LoRA或QLoRA微调生成的是一个额外的 `Adapter` 层,需要与原模型合并才能生成一个完整的模型。使用以下命令进行模型合并,其中`$model_path`
为原模型存储的本地路径, `--max-shard-size 2GB` 限制每个权重文件最大为2GB
```shell
xtuner convert merge $model_path ./work_dirs/convert_output ./work_dirs/merge_output --max-shard-size 2GB
```
### 对话
使用合并后的模型权重进行对话:
```shell
cp path_to_your_model/modeling_internlm3.py ./work_dirs/merge_output
xtuner chat ./work_dirs/merge_output --prompt-template internlm2_chat
```
## LLaMA-Factory
### 安装 LLaMA-Factory
```shell
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch-npu,metrics]"
```
### 推理
在 LLaMA-Factory 路径下新建`examples/inference/internlm3_8b_instruct.yaml`推理配置文件,文件内容为:
```yaml
model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM3-8B-Instruct.
trust_remote_code: true
template: intern3
```
使用以下命令与模型进行交互:
```shell
llamafactory-cli chat examples/inference/internlm3_8b_instruct.yaml
```
### 微调
在 LLaMA-Factory 路径下新建`examples/train_full/internlm3_8b_instruct_full_sft.yaml`微调配置文件,微调配置文件如下:
```yaml
### model
model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM3-8B-Instruct.
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
### dataset
dataset: alpaca_data
template: intern3
cutoff_len: 4096
max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: saves/interlm3/full/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 1.0e-6
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 5000000000
```
通过下面的命令启动微调:
```shell
llamafactory-cli train examples/train_full/internlm3_8b_instruct_full_sft.yaml
```
### 精度
微调后得到的loss曲线如下
![training_loss](../assets/npu/lf_training_loss_npu.png)
与GPU对比的loss曲线如下
![training_loss_compare](../assets/npu/lf_training_loss_compare.png)
## Transformers
### 推理
新建推理脚本`inference_internlm3_instruct_8b.py`,推理脚本内容为:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_dir = "internlm/internlm3-8b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
# `torch_dtype=torch.float16`可以令模型以float16精度加载否则transformers会将模型加载为float32导致显存不足
model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16).npu()
# 可选如果在低资源设备上可以通过bitsandbytes以4位或8位加载模型从而进一步节省GPU内存。
# InternLM3 8B以4位加载将几乎占用8GB的GPU内存.
# pip install -U bitsandbytes
# 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, load_in_8bit=True).npu()
# 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, load_in_4bit=True).npu()
model = model.eval()
system_prompt = """You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文."""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Please tell me five scenic spots in Shanghai"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").npu
generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=1, repetition_penalty=1.005, top_k=40, top_p=0.8)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
]
prompt = tokenizer.batch_decode(tokenized_chat)[0]
print(prompt)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)
```
执行推理脚本:
```shell
python inference_internlm3_instruct_8b.py
```
## openMind Library
### openMind 简介
openMind Library 是一个开源的大模型套件原生支持在昇腾NPU上进行微调、推理、评估和部署。
openMind Library 提供高易用性的接口和使用方式充分发挥昇腾NPU的性能快速支持、增强业界前沿模型。
### 微调
openMind Library 提供了昇腾 NPU 上的一键式模型微调方案涵盖了数据处理、多站点权重加载低参微调LoRA
量化适配QLoRA等能力。同时openMind Library支持昇腾NPU融合算子优化提升模型训练性能。
#### 安装 openMind Library
```shell
git clone -b dev https://gitee.com/ascend/openmind.git
cd openmind
pip install -e .[pt]
```
#### 启动微调
在 openmind 文件夹下,通过以下命令行即可启动微调:
```
openmind-cli train examples/internlm3/train_sft_full_internlm3.yaml
```
#### 训练结果与优势
如下图所示openMind Library 的训练 loss 正常收敛,同时和 GPU 对比,平均相对误差在 2% 以内。
<div align=center>
<img src="./assets/openmind_train_loss_compare.png" width="600px">
</div>
<p align="center"><strong>精度对比</strong> (npu=8, per_device_train_batch_size=6, max_length=1024)</p>
openMind Library 支持在昇腾 NPU 上使能 LoRA、QLoRA 等微调方法,显著减少 device 内存使用。
如下图所示,通过使能 QloRA 微调方式可减少 device 内存约 40%。
<div align=center>
<img src="./assets/openmind_train_memory.png" width="400px">
</div>
<p align="center"><strong>Full/LoRA/QLoRA 显存开销</strong> (npu=8, per_device_train_batch_size=6, max_length=1024)</p>
openMind Library 支持训练时自动加载昇腾 NPU 融合算子,无需开发者手动修改代码或配置,提升模型训练性能
的同时兼顾易用性。下图展示了 openMind 默认使能昇腾 NPU 融合算子之后的性能收益。
<div align=center>
<img src="./assets/openmind_fused_ops.png" width="300px">
</div>
<p align="center"><strong>每秒训练样本数</strong></p>
更多特性请参考[openMind 微调文档](https://modelers.cn/docs/zh/openmind-library/1.0.0/basic_tutorial/finetune/finetune_pt.html)。
### 推理
除了微调以外,也可以使用 openMind Library 进行模型推理,安装 openMind Library 后,使用
下述命令行即可进行单轮推理:
```shell
openmind-cli run Intern/internlm3-8b-instruct --task text-generation --input '{"text_inputs":"What is AI?","max_length":512}' --trust_remote_code 1
```
更多特性请参考[openMind 推理文档](https://modelers.cn/docs/zh/openmind-library/1.0.0/basic_tutorial/pipeline.html)。
## 开源许可证
本仓库的代码和权重依照 Apache-2.0 协议开源。