add README_npu

2025-01-11 19:42:01 +08:00 · 2025-01-11 19:42:01 +08:00 · ad035eb8bd
parent 1759c4b9b4
commit ad035eb8bd
3 changed files with 593 additions and 0 deletions
--- a/README_npu.md
+++ b/README_npu.md
@ -0,0 +1,298 @@
+# InternLM-NPU
+
+<div align="center">
+
+<img src="./assets/logo.svg" width="200"/>
+  <div> </div>
+  <div align="center">
+    <b><font size="5">InternLM</font></b>
+    <sup>
+      <a href="https://internlm.intern-ai.org.cn/">
+        <i><font size="4">HOT</font></i>
+      </a>
+    </sup>
+    <div> </div>
+  </div>
+
+[![license](./assets/license.svg)](./LICENSE)
+[![evaluation](./assets/compass_support.svg)](https://github.com/internLM/OpenCompass/)
+
+<!-- [![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest) -->
+
+[📘Commercial Application](#license) |
+[🤗HuggingFace](https://huggingface.co/internlm) |
+[🆕Update News](#news) |
+[🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new) |
+[📜Technical Report](https://arxiv.org/abs/2403.17297)<br>
+[💬Chat Web](https://internlm-chat.intern-ai.org.cn/) |
+[🔗API](https://internlm.intern-ai.org.cn/api/document) |
+[🧩Modelers](https://modelers.cn/spaces/MindSpore-Lab/INTERNLM2-20B-PLAN)
+
+[English](./README_npu.md) |
+[简体中文](./README_npu_zh-CN.md)
+
+</div>
+
+## Introduction
+This is a guide to using Ascend NPU to train and infer the InternLM series models.
+
+## News
+\[2025.01.15\] InternLM3-8B-Instruct can be used in Xtuner, LLaMa-Factory and transformers.
+
+## Model Zoo
+
+### InternLM3
+| Model                     | Transformers(HF)                           | ModelScope(HF)                           | Release Date |
+|---------------------------| ------------------------------------------ | ---------------------------------------- |--------------|
+| **InternLM3-8B-Instruct** | [🤗internlm3-8b-instruct](https://huggingface.co/internlm/internlm3-8b-instruct) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm3-8b-instruct](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm3-8b-instruct) | 2025-01-15   |
+
+## Environment Setup
+
+### Installing Ascend CANN Toolkit and Kernels
+
+For details about the installation method, see [Installation Scheme](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2FCANNCommunityEdition%2F80RC2alpha002%2Fquickstart%2Fquickstart%2Fquickstart_18_0004.html) or run the following commands:
+
+```shell
+# Replace the URL with the URL corresponding to the CANN version and device model.
+# Install CANN Toolkit.
+wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run
+bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run --install
+
+# Install CANN Kernels.
+wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run
+bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run --install
+
+# Set environment variables.
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+```
+
+## Xtuner
+
+### Installing Xtuner
+
+```shell
+git clone https://github.com/InternLM/xtuner.git
+cd xtuner
+```
+
+Modify `requirements/runtime.txt` with the following changes:
+
+```text
+bitsandbytes==0.42.0
+mmengine==0.10.5
+torchvision==0.19.0
+numpy==1.26.4
+```
+
+Use the following command for installation:
+
+```shell
+pip install -e '.[all]'
+```
+
+**Note**:
+
+- The default installation version of `torch` is the latest version. Please pay attention to match it with the version of `torch_npu`.
+
+### LoRA Fine-tuning
+
+Use the following commands to copy and rename the file to `internlm3_8b_instruct_lora_oasst1_e10.py`:
+
+```shell
+xtuner copy-cfg internlm2_5_chat_7b_qlora_oasst1_e3 .
+mv internlm2_5_chat_7b_qlora_oasst1_e3_copy.py internlm3_8b_instruct_lora_oasst1_e10.py
+```
+
+The modifications to the configuration file `internlm3_8b_instruct_lora_oasst1_e10.py` are as follows:
+
+```python
+pretrained_model_name_or_path = 'internlm/internlm3-8b-instruct'
+
+max_epochs = 10
+
+model = dict(
+    type=SupervisedFinetune,
+    use_varlen_attn=use_varlen_attn,
+    llm=dict(
+        type=AutoModelForCausalLM.from_pretrained,
+        pretrained_model_name_or_path=pretrained_model_name_or_path,
+        trust_remote_code=True,
+        torch_dtype=torch.float16),
+        # quantization_config=dict(
+        #     type=BitsAndBytesConfig,
+        #     load_in_4bit=True,
+        #     load_in_8bit=False,
+        #     llm_int8_threshold=6.0,
+        #     llm_int8_has_fp16_weight=False,
+        #     bnb_4bit_compute_dtype=torch.float16,
+        #     bnb_4bit_use_double_quant=True,
+        #     bnb_4bit_quant_type='nf4')),
+    lora=dict(
+        type=LoraConfig,
+        r=64,
+        lora_alpha=16,
+        lora_dropout=0.1,
+        bias='none',
+        task_type='CAUSAL_LM'))
+
+custom_hooks = [
+    dict(type=DatasetInfoHook, tokenizer=tokenizer),
+    # dict(
+    #     type=EvaluateChatHook,
+    #     tokenizer=tokenizer,
+    #     every_n_iters=evaluation_freq,
+    #     evaluation_inputs=evaluation_inputs,
+    #     system=SYSTEM,
+    #     prompt_template=prompt_template)
+]
+
+randomness = dict(seed=123, deterministic=True)
+```
+
+Run the following commands to start single-machine eight-card fine-tuning:
+
+```shell
+NPROC_PER_NODE=8 xtuner train internlm3_8b_instruct_lora_oasst1_e10.py --deepspeed deepspeed_zero2
+```
+
+The fine-tuning results are saved in the directory `./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth`.
+
+### Model Convert
+
+Convert the model weight file obtained from fine-tuning into the Hugging Face format, which facilitates subsequent deployment and usage.
+Use the following command for the conversion:
+
+```shell
+xtuner convert pth_to_hf internlm3_8b_instruct_lora_oasst1_e10.py ./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth ./work_dirs/convert_output
+```
+
+### Model Merge
+
+LoRA or QLoRA fine-tuning generates an additional `Adapter` layer, which needs to be merged with the original model to 
+create a complete model. Use the following command for model merging, where `$model_path` is the local path where the 
+original model is stored, and `--max-shard-size` 2GB limits the maximum size of each weight file to 2GB:
+
+```shell
+xtuner convert merge $model_path ./work_dirs/convert_output ./work_dirs/merge_output --max-shard-size 2GB
+```
+
+### Chat
+
+Chat with the merged model weights:
+
+
+```shell
+cp path_to_your_model/modeling_internlm3.py ./work_dirs/merge_output
+xtuner chat ./work_dirs/merge_output --prompt-template internlm2_chat
+```
+
+## LLama-Factory
+
+### Installing LLaMa-Factory
+
+```shell
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
+cd LLaMA-Factory
+pip install -e ".[torch-npu,metrics]"
+```
+
+### Inference
+
+Create the `examples/inference/internlm2_5_7b_chat.yaml` inference configuration file in the LLaMa-Factory directory:
+
+```yaml
+model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM2.5-7B-Chat.
+template: intern2
+```
+
+Run the following command to interact with the model:
+
+```shell
+llamafactory-cli chat examples/inference/internlm2_5_7b_chat.yaml
+```
+
+### Fine-tuning
+
+Create the `examples/train_lora/internlm2_5_7b_chat_lora_sft.yaml` configuration file in the LLaMa-Factory directory. The fine-tuning configuration file is as follows:
+
+```yaml
+### model
+model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM2.5-7B-Chat.
+
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_target: all
+
+### dataset
+dataset: identity
+template: intern2
+cutoff_len: 128
+preprocessing_num_workers: 16
+
+### output
+output_dir: saves/internlm2_5_7b_chat/lora/sft
+logging_steps: 5
+save_steps: 20 
+plot_loss: true
+overwrite_output_dir: true
+
+### train
+per_device_train_batch_size: 8
+gradient_accumulation_steps: 1
+learning_rate: 1.0e-4
+num_train_epochs: 5.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+```
+
+Run the following commands to start fine-tuning:
+
+```shell
+export ASCEND_RT_VISIBLE_DEVICES=0
+llamafactory-cli train examples/train_lora/internlm2_5_7b_chat_lora_sft.yaml
+```
+
+### Accuracy
+
+The loss curve obtained after finetuning is as follows:
+
+![training_loss](assets/training_loss.png)
+
+### Performance
+
+| Chip Type         | train_samples_per_second |
+|-------------------|--------------------------|
+| Atlas 900 A2 PODc | 49.662                   |
+
+## Transformers
+
+### Inference
+
+Create the inference script `inference_internlm2_5_7b_chat.py`:
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+# 若模型已下载，可替换成模型本地路径
+tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-7b-chat", trust_remote_code=True)
+# `torch_dtype=torch.float16`可以令模型以float16精度加载，否则transformers会将模型加载为float32，导致显存不足
+model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-7b-chat", torch_dtype=torch.float16, trust_remote_code=True).npu()
+model = model.eval()
+response, history = model.chat(tokenizer, "你好，请提供三个管理时间的建议。", history=[])
+print(response)
+```
+
+Execute the inference script:
+
+```shell
+python inference_internlm2_5_7b_chat.py
+```
+
+
+## License
+The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表（中文）](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact <internlm@pjlab.org.cn>.
--- a/README_npu_zh-CN.md
+++ b/README_npu_zh-CN.md
@ -0,0 +1,295 @@
+# InternLM-NPU
+
+<div align="center">
+
+<img src="./assets//logo.svg" width="200"/>
+  <div>&nbsp;</div>
+  <div align="center">
+    <b><font size="5">书生·浦语 官网</font></b>
+    <sup>
+      <a href="https://internlm.intern-ai.org.cn/">
+        <i><font size="4">HOT</font></i>
+      </a>
+    </sup>
+    <div>&nbsp;</div>
+  </div>
+
+[![license](./assets//license.svg)](https://github.com/open-mmlab/mmdetection/blob/main/LICENSE)
+[![evaluation](./assets//compass_support.svg)](https://github.com/internLM/OpenCompass/)
+
+<!-- [![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest) -->
+
+[📘商业授权](#开源许可证) |
+[🤗HuggingFace](https://huggingface.co/internlm) |
+[🆕最新消息](#更新) |
+[🤔提交反馈](https://github.com/InternLM/InternLM/issues/new)|
+[📜技术报告](https://arxiv.org/abs/2403.17297)<br>
+[💬聊天应用](https://internlm-chat.intern-ai.org.cn/) |
+[🔗API](https://internlm.intern-ai.org.cn/api/document) |
+[🧩魔乐社区](https://modelers.cn/spaces/MindSpore-Lab/INTERNLM2-20B-PLAN)
+
+[English](./README_npu.md) |
+[简体中文](./README_npu_zh-CN.md)
+
+</div>
+
+## 介绍
+这是一份使用 Ascend NPU 对 InternLM 系列模型进行训练和推理的指南。
+
+## News
+\[2025.01.15\] InternLM3-8B-Instruct 可用于 Xtuner、LLaMa-Factory 和 transformers 中。
+
+## Model Zoo
+
+### InternLM3
+| Model                     | Transformers(HF)                           | ModelScope(HF)                           | Release Date |
+|---------------------------| ------------------------------------------ | ---------------------------------------- |--------------|
+| **InternLM3-8B-Instruct** | [🤗internlm3-8b-instruct](https://huggingface.co/internlm/internlm3-8b-instruct) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm3-8b-instruct](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm3-8b-instruct) | 2025-01-15   |
+
+## 环境准备l
+
+### 安装Ascend CANN Toolkit和Kernels
+
+安装方法请参考[安装教程](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2FCANNCommunityEdition%2F80RC2alpha002%2Fquickstart%2Fquickstart%2Fquickstart_18_0004.html)或使用以下命令
+
+```shell
+# 请替换URL为CANN版本和设备型号对应的URL
+# 安装CANN Toolkit
+wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run
+bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run --install
+
+# 安装CANN Kernels
+wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run
+bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run --install
+
+# 设置环境变量
+source /usr/local/Ascend/ascend-toolkit/set_env.sh
+```
+
+## Xtuner
+
+### 安装 Xtuner
+
+```shell
+git clone https://github.com/InternLM/xtuner.git
+cd xtuner
+```
+
+修改`requirements/runtime.txt`，修改点如下：
+
+```text
+bitsandbytes==0.42.0
+mmengine==0.10.5
+torchvision==0.19.0
+numpy==1.26.4
+```
+
+使用以下命令进行安装：
+
+```shell
+pip install -e '.[all]'
+```
+
+**注意**:
+
+- 默认安装`torch`为最新版，请注意与`torch_npu`版本相匹配
+
+### LoRA 微调
+
+使用以下命令复制并重命名文件为`internlm3_8b_instruct_lora_oasst1_e10.py`， 
+
+```shell
+xtuner copy-cfg internlm2_5_chat_7b_qlora_oasst1_e3 .
+mv internlm2_5_chat_7b_qlora_oasst1_e3_copy.py internlm3_8b_instruct_lora_oasst1_e10.py
+```
+
+`internlm3_8b_instruct_lora_oasst1_e10.py`配置文件的修改点如下：
+
+```python
+pretrained_model_name_or_path = 'internlm/internlm3-8b-instruct'
+
+max_epochs = 10
+
+model = dict(
+    type=SupervisedFinetune,
+    use_varlen_attn=use_varlen_attn,
+    llm=dict(
+        type=AutoModelForCausalLM.from_pretrained,
+        pretrained_model_name_or_path=pretrained_model_name_or_path,
+        trust_remote_code=True,
+        torch_dtype=torch.float16),
+        # quantization_config=dict(
+        #     type=BitsAndBytesConfig,
+        #     load_in_4bit=True,
+        #     load_in_8bit=False,
+        #     llm_int8_threshold=6.0,
+        #     llm_int8_has_fp16_weight=False,
+        #     bnb_4bit_compute_dtype=torch.float16,
+        #     bnb_4bit_use_double_quant=True,
+        #     bnb_4bit_quant_type='nf4')),
+    lora=dict(
+        type=LoraConfig,
+        r=64,
+        lora_alpha=16,
+        lora_dropout=0.1,
+        bias='none',
+        task_type='CAUSAL_LM'))
+
+custom_hooks = [
+    dict(type=DatasetInfoHook, tokenizer=tokenizer),
+    # dict(
+    #     type=EvaluateChatHook,
+    #     tokenizer=tokenizer,
+    #     every_n_iters=evaluation_freq,
+    #     evaluation_inputs=evaluation_inputs,
+    #     system=SYSTEM,
+    #     prompt_template=prompt_template)
+]
+
+randomness = dict(seed=123, deterministic=True)
+```
+
+通过下列命令启动单机8卡微调：
+
+```shell
+NPROC_PER_NODE=8 xtuner train internlm3_8b_instruct_lora_oasst1_e10.py --deepspeed deepspeed_zero2
+```
+
+微调后结果保存在`./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth`下。
+
+### 模型转换
+
+将训练得到的模型权重文件转换为 Hugging Face 格式的模型文件，便于后续的部署和使用。使用以下命令进行转换：
+
+```shell
+xtuner convert pth_to_hf internlm3_8b_instruct_lora_oasst1_e10.py ./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth ./work_dirs/convert_output
+```
+
+### 模型合并
+
+LoRA或QLoRA微调生成的是一个额外的 `Adapter` 层，需要与原模型合并才能生成一个完整的模型。使用以下命令进行模型合并，其中`$model_path`
+为原模型存储的本地路径, `--max-shard-size 2GB` 限制每个权重文件最大为2GB：
+
+```shell
+xtuner convert merge $model_path ./work_dirs/convert_output ./work_dirs/merge_output --max-shard-size 2GB
+```
+
+### 对话
+
+使用合并后的模型权重进行对话：
+
+```shell
+cp path_to_your_model/modeling_internlm3.py ./work_dirs/merge_output
+xtuner chat ./work_dirs/merge_output --prompt-template internlm2_chat
+```
+
+## LLama-Factory
+
+### 安装 LLaMa-Factory
+
+```shell
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
+cd LLaMA-Factory
+pip install -e ".[torch-npu,metrics]"
+```
+
+### 推理
+
+在 LLaMa-Factory 路径下新建`examples/inference/internlm2_5_7b_chat.yaml`推理配置文件，文件内容为：
+
+```yaml
+model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM2.5-7B-Chat.
+template: intern2
+```
+
+使用以下命令与模型进行交互：
+
+```shell
+llamafactory-cli chat examples/inference/internlm2_5_7b_chat.yaml
+```
+
+### 微调
+
+在 LLaMa-Factory 路径下新建`examples/train_lora/internlm2_5_7b_chat_lora_sft.yaml`微调配置文件，微调配置文件如下：
+
+```yaml
+### model
+model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM2.5-7B-Chat.
+
+### method
+stage: sft
+do_train: true
+finetuning_type: lora
+lora_target: all
+
+### dataset
+dataset: identity
+template: intern2
+cutoff_len: 128
+preprocessing_num_workers: 16
+
+### output
+output_dir: saves/internlm2_5_7b_chat/lora/sft
+logging_steps: 5
+save_steps: 20 
+plot_loss: true
+overwrite_output_dir: true
+
+### train
+per_device_train_batch_size: 8
+gradient_accumulation_steps: 1
+learning_rate: 1.0e-4
+num_train_epochs: 5.0
+lr_scheduler_type: cosine
+warmup_ratio: 0.1
+bf16: true
+ddp_timeout: 180000000
+```
+
+通过下面的命令启动微调：
+
+```shell
+export ASCEND_RT_VISIBLE_DEVICES=0
+llamafactory-cli train examples/train_lora/internlm2_5_7b_chat_lora_sft.yaml
+```
+
+### 精度
+
+微调后得到的loss曲线如下：
+
+![training_loss](assets/training_loss.png)
+
+### 性能
+
+| 芯片型号              | train_samples_per_second |
+|-------------------|--------------------------|
+| Atlas 900 A2 PODc | 49.662                   |
+
+## Transformers
+
+### 推理
+
+新建推理脚本`inference_internlm2_5_7b_chat.py`，推理脚本内容为：
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+# 若模型已下载，可替换成模型本地路径
+tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-7b-chat", trust_remote_code=True)
+# `torch_dtype=torch.float16`可以令模型以float16精度加载，否则transformers会将模型加载为float32，导致显存不足
+model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-7b-chat", torch_dtype=torch.float16, trust_remote_code=True).npu()
+model = model.eval()
+response, history = model.chat(tokenizer, "你好，请提供三个管理时间的建议。", history=[])
+print(response)
+```
+
+执行推理脚本：
+
+```shell
+python inference_internlm2_5_7b_chat.py
+```
+
+## 开源许可证
+
+本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放，也可申请免费的商业使用授权（[申请表](https://wj.qq.com/s2/12725412/f7c1/)）。其他问题与合作请联系 <internlm@pjlab.org.cn>。
--- a/assets/training_loss.png
+++ b/assets/training_loss.png