add README_npu

2025-01-11 19:42:01 +08:00 · 2025-01-11 19:42:01 +08:00 · ad035eb8bd
parent 1759c4b9b4
commit ad035eb8bd
3 changed files with 593 additions and 0 deletions
--- a/README_npu.md
+++ b/README_npu.md
@ -0,0 +1,298 @@
 # InternLM-NPU
 <div align="center">
 <img src="./assets/logo.svg" width="200"/>
  <div> </div>
  <div align="center">
    <b><font size="5">InternLM</font></b>
    <sup>
      <a href="https://internlm.intern-ai.org.cn/">
        <i><font size="4">HOT</font></i>
      </a>
    </sup>
    <div> </div>
  </div>
 [![license](./assets/license.svg)](./LICENSE)
 [![evaluation](./assets/compass_support.svg)](https://github.com/internLM/OpenCompass/)
 <!-- [![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest) -->
 [📘Commercial Application](#license) |
 [🤗HuggingFace](https://huggingface.co/internlm) |
 [🆕Update News](#news) |
 [🤔Reporting Issues](https://github.com/InternLM/InternLM/issues/new) |
 [📜Technical Report](https://arxiv.org/abs/2403.17297)<br>
 [💬Chat Web](https://internlm-chat.intern-ai.org.cn/) |
 [🔗API](https://internlm.intern-ai.org.cn/api/document) |
 [🧩Modelers](https://modelers.cn/spaces/MindSpore-Lab/INTERNLM2-20B-PLAN)
 [English](./README_npu.md) |
 [简体中文](./README_npu_zh-CN.md)
 </div>
 ## Introduction
 This is a guide to using Ascend NPU to train and infer the InternLM series models.
 ## News
 \[2025.01.15\] InternLM3-8B-Instruct can be used in Xtuner, LLaMa-Factory and transformers.
 ## Model Zoo
 ### InternLM3
 | Model                     | Transformers(HF)                           | ModelScope(HF)                           | Release Date |
 |---------------------------| ------------------------------------------ | ---------------------------------------- |--------------|
 | **InternLM3-8B-Instruct** | [🤗internlm3-8b-instruct](https://huggingface.co/internlm/internlm3-8b-instruct) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm3-8b-instruct](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm3-8b-instruct) | 2025-01-15   |
 ## Environment Setup
 ### Installing Ascend CANN Toolkit and Kernels
 For details about the installation method, see [Installation Scheme](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2FCANNCommunityEdition%2F80RC2alpha002%2Fquickstart%2Fquickstart%2Fquickstart_18_0004.html) or run the following commands:
 ```shell
 # Replace the URL with the URL corresponding to the CANN version and device model.
 # Install CANN Toolkit.
 wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run
 bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run --install
 # Install CANN Kernels.
 wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run
 bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run --install
 # Set environment variables.
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 ```
 ## Xtuner
 ### Installing Xtuner
 ```shell
 git clone https://github.com/InternLM/xtuner.git
 cd xtuner
 ```
 Modify `requirements/runtime.txt` with the following changes:
 ```text
 bitsandbytes==0.42.0
 mmengine==0.10.5
 torchvision==0.19.0
 numpy==1.26.4
 ```
 Use the following command for installation:
 ```shell
 pip install -e '.[all]'
 ```
 **Note**:
 - The default installation version of `torch` is the latest version. Please pay attention to match it with the version of `torch_npu`.
 ### LoRA Fine-tuning
 Use the following commands to copy and rename the file to `internlm3_8b_instruct_lora_oasst1_e10.py`:
 ```shell
 xtuner copy-cfg internlm2_5_chat_7b_qlora_oasst1_e3 .
 mv internlm2_5_chat_7b_qlora_oasst1_e3_copy.py internlm3_8b_instruct_lora_oasst1_e10.py
 ```
 The modifications to the configuration file `internlm3_8b_instruct_lora_oasst1_e10.py` are as follows:
 ```python
 pretrained_model_name_or_path = 'internlm/internlm3-8b-instruct'
 max_epochs = 10
 model = dict(
    type=SupervisedFinetune,
    use_varlen_attn=use_varlen_attn,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16),
        # quantization_config=dict(
        #     type=BitsAndBytesConfig,
        #     load_in_4bit=True,
        #     load_in_8bit=False,
        #     llm_int8_threshold=6.0,
        #     llm_int8_has_fp16_weight=False,
        #     bnb_4bit_compute_dtype=torch.float16,
        #     bnb_4bit_use_double_quant=True,
        #     bnb_4bit_quant_type='nf4')),
    lora=dict(
        type=LoraConfig,
        r=64,
        lora_alpha=16,
        lora_dropout=0.1,
        bias='none',
        task_type='CAUSAL_LM'))
 custom_hooks = [
    dict(type=DatasetInfoHook, tokenizer=tokenizer),
    # dict(
    #     type=EvaluateChatHook,
    #     tokenizer=tokenizer,
    #     every_n_iters=evaluation_freq,
    #     evaluation_inputs=evaluation_inputs,
    #     system=SYSTEM,
    #     prompt_template=prompt_template)
 ]
 randomness = dict(seed=123, deterministic=True)
 ```
 Run the following commands to start single-machine eight-card fine-tuning:
 ```shell
 NPROC_PER_NODE=8 xtuner train internlm3_8b_instruct_lora_oasst1_e10.py --deepspeed deepspeed_zero2
 ```
 The fine-tuning results are saved in the directory `./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth`.
 ### Model Convert
 Convert the model weight file obtained from fine-tuning into the Hugging Face format, which facilitates subsequent deployment and usage.
 Use the following command for the conversion:
 ```shell
 xtuner convert pth_to_hf internlm3_8b_instruct_lora_oasst1_e10.py ./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth ./work_dirs/convert_output
 ```
 ### Model Merge
 LoRA or QLoRA fine-tuning generates an additional `Adapter` layer, which needs to be merged with the original model to 
 create a complete model. Use the following command for model merging, where `$model_path` is the local path where the 
 original model is stored, and `--max-shard-size` 2GB limits the maximum size of each weight file to 2GB:
 ```shell
 xtuner convert merge $model_path ./work_dirs/convert_output ./work_dirs/merge_output --max-shard-size 2GB
 ```
 ### Chat
 Chat with the merged model weights:
 ```shell
 cp path_to_your_model/modeling_internlm3.py ./work_dirs/merge_output
 xtuner chat ./work_dirs/merge_output --prompt-template internlm2_chat
 ```
 ## LLama-Factory
 ### Installing LLaMa-Factory
 ```shell
 git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
 cd LLaMA-Factory
 pip install -e ".[torch-npu,metrics]"
 ```
 ### Inference
 Create the `examples/inference/internlm2_5_7b_chat.yaml` inference configuration file in the LLaMa-Factory directory:
 ```yaml
 model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM2.5-7B-Chat.
 template: intern2
 ```
 Run the following command to interact with the model:
 ```shell
 llamafactory-cli chat examples/inference/internlm2_5_7b_chat.yaml
 ```
 ### Fine-tuning
 Create the `examples/train_lora/internlm2_5_7b_chat_lora_sft.yaml` configuration file in the LLaMa-Factory directory. The fine-tuning configuration file is as follows:
 ```yaml
 ### model
 model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM2.5-7B-Chat.
 ### method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: all
 ### dataset
 dataset: identity
 template: intern2
 cutoff_len: 128
 preprocessing_num_workers: 16
 ### output
 output_dir: saves/internlm2_5_7b_chat/lora/sft
 logging_steps: 5
 save_steps: 20 
 plot_loss: true
 overwrite_output_dir: true
 ### train
 per_device_train_batch_size: 8
 gradient_accumulation_steps: 1
 learning_rate: 1.0e-4
 num_train_epochs: 5.0
 lr_scheduler_type: cosine
 warmup_ratio: 0.1
 bf16: true
 ddp_timeout: 180000000
 ```
 Run the following commands to start fine-tuning:
 ```shell
 export ASCEND_RT_VISIBLE_DEVICES=0
 llamafactory-cli train examples/train_lora/internlm2_5_7b_chat_lora_sft.yaml
 ```
 ### Accuracy
 The loss curve obtained after finetuning is as follows:
 ![training_loss](assets/training_loss.png)
 ### Performance
 | Chip Type         | train_samples_per_second |
 |-------------------|--------------------------|
 | Atlas 900 A2 PODc | 49.662                   |
 ## Transformers
 ### Inference
 Create the inference script `inference_internlm2_5_7b_chat.py`:
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # 若模型已下载，可替换成模型本地路径
 tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-7b-chat", trust_remote_code=True)
 # `torch_dtype=torch.float16`可以令模型以float16精度加载，否则transformers会将模型加载为float32，导致显存不足
 model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-7b-chat", torch_dtype=torch.float16, trust_remote_code=True).npu()
 model = model.eval()
 response, history = model.chat(tokenizer, "你好，请提供三个管理时间的建议。", history=[])
 print(response)
 ```
 Execute the inference script:
 ```shell
 python inference_internlm2_5_7b_chat.py
 ```
 ## License
 The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow **free** commercial usage. To apply for a commercial license, please fill in the [application form (English)](https://wj.qq.com/s2/12727483/5dba/)/[申请表（中文）](https://wj.qq.com/s2/12725412/f7c1/). For other questions or collaborations, please contact <internlm@pjlab.org.cn>.
--- a/README_npu_zh-CN.md
+++ b/README_npu_zh-CN.md
@ -0,0 +1,295 @@
 # InternLM-NPU
 <div align="center">
 <img src="./assets//logo.svg" width="200"/>
  <div>&nbsp;</div>
  <div align="center">
    <b><font size="5">书生·浦语 官网</font></b>
    <sup>
      <a href="https://internlm.intern-ai.org.cn/">
        <i><font size="4">HOT</font></i>
      </a>
    </sup>
    <div>&nbsp;</div>
  </div>
 [![license](./assets//license.svg)](https://github.com/open-mmlab/mmdetection/blob/main/LICENSE)
 [![evaluation](./assets//compass_support.svg)](https://github.com/internLM/OpenCompass/)
 <!-- [![Documentation Status](https://readthedocs.org/projects/internlm/badge/?version=latest)](https://internlm.readthedocs.io/zh_CN/latest/?badge=latest) -->
 [📘商业授权](#开源许可证) |
 [🤗HuggingFace](https://huggingface.co/internlm) |
 [🆕最新消息](#更新) |
 [🤔提交反馈](https://github.com/InternLM/InternLM/issues/new)|
 [📜技术报告](https://arxiv.org/abs/2403.17297)<br>
 [💬聊天应用](https://internlm-chat.intern-ai.org.cn/) |
 [🔗API](https://internlm.intern-ai.org.cn/api/document) |
 [🧩魔乐社区](https://modelers.cn/spaces/MindSpore-Lab/INTERNLM2-20B-PLAN)
 [English](./README_npu.md) |
 [简体中文](./README_npu_zh-CN.md)
 </div>
 ## 介绍
 这是一份使用 Ascend NPU 对 InternLM 系列模型进行训练和推理的指南。
 ## News
 \[2025.01.15\] InternLM3-8B-Instruct 可用于 Xtuner、LLaMa-Factory 和 transformers 中。
 ## Model Zoo
 ### InternLM3
 | Model                     | Transformers(HF)                           | ModelScope(HF)                           | Release Date |
 |---------------------------| ------------------------------------------ | ---------------------------------------- |--------------|
 | **InternLM3-8B-Instruct** | [🤗internlm3-8b-instruct](https://huggingface.co/internlm/internlm3-8b-instruct) | [<img src="./assets/modelscope_logo.png" width="20px" /> internlm3-8b-instruct](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm3-8b-instruct) | 2025-01-15   |
 ## 环境准备l
 ### 安装Ascend CANN Toolkit和Kernels
 安装方法请参考[安装教程](https://gitee.com/link?target=https%3A%2F%2Fwww.hiascend.com%2Fdocument%2Fdetail%2Fzh%2FCANNCommunityEdition%2F80RC2alpha002%2Fquickstart%2Fquickstart%2Fquickstart_18_0004.html)或使用以下命令
 ```shell
 # 请替换URL为CANN版本和设备型号对应的URL
 # 安装CANN Toolkit
 wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run
 bash Ascend-cann-toolkit_8.0.RC1.alpha001_linux-"$(uname -i)".run --install
 # 安装CANN Kernels
 wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/Milan-ASL/Milan-ASL%20V100R001C17SPC701/Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run
 bash Ascend-cann-kernels-910b_8.0.RC1.alpha001_linux.run --install
 # 设置环境变量
 source /usr/local/Ascend/ascend-toolkit/set_env.sh
 ```
 ## Xtuner
 ### 安装 Xtuner
 ```shell
 git clone https://github.com/InternLM/xtuner.git
 cd xtuner
 ```
 修改`requirements/runtime.txt`，修改点如下：
 ```text
 bitsandbytes==0.42.0
 mmengine==0.10.5
 torchvision==0.19.0
 numpy==1.26.4
 ```
 使用以下命令进行安装：
 ```shell
 pip install -e '.[all]'
 ```
 **注意**:
 - 默认安装`torch`为最新版，请注意与`torch_npu`版本相匹配
 ### LoRA 微调
 使用以下命令复制并重命名文件为`internlm3_8b_instruct_lora_oasst1_e10.py`， 
 ```shell
 xtuner copy-cfg internlm2_5_chat_7b_qlora_oasst1_e3 .
 mv internlm2_5_chat_7b_qlora_oasst1_e3_copy.py internlm3_8b_instruct_lora_oasst1_e10.py
 ```
 `internlm3_8b_instruct_lora_oasst1_e10.py`配置文件的修改点如下：
 ```python
 pretrained_model_name_or_path = 'internlm/internlm3-8b-instruct'
 max_epochs = 10
 model = dict(
    type=SupervisedFinetune,
    use_varlen_attn=use_varlen_attn,
    llm=dict(
        type=AutoModelForCausalLM.from_pretrained,
        pretrained_model_name_or_path=pretrained_model_name_or_path,
        trust_remote_code=True,
        torch_dtype=torch.float16),
        # quantization_config=dict(
        #     type=BitsAndBytesConfig,
        #     load_in_4bit=True,
        #     load_in_8bit=False,
        #     llm_int8_threshold=6.0,
        #     llm_int8_has_fp16_weight=False,
        #     bnb_4bit_compute_dtype=torch.float16,
        #     bnb_4bit_use_double_quant=True,
        #     bnb_4bit_quant_type='nf4')),
    lora=dict(
        type=LoraConfig,
        r=64,
        lora_alpha=16,
        lora_dropout=0.1,
        bias='none',
        task_type='CAUSAL_LM'))
 custom_hooks = [
    dict(type=DatasetInfoHook, tokenizer=tokenizer),
    # dict(
    #     type=EvaluateChatHook,
    #     tokenizer=tokenizer,
    #     every_n_iters=evaluation_freq,
    #     evaluation_inputs=evaluation_inputs,
    #     system=SYSTEM,
    #     prompt_template=prompt_template)
 ]
 randomness = dict(seed=123, deterministic=True)
 ```
 通过下列命令启动单机8卡微调：
 ```shell
 NPROC_PER_NODE=8 xtuner train internlm3_8b_instruct_lora_oasst1_e10.py --deepspeed deepspeed_zero2
 ```
 微调后结果保存在`./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth`下。
 ### 模型转换
 将训练得到的模型权重文件转换为 Hugging Face 格式的模型文件，便于后续的部署和使用。使用以下命令进行转换：
 ```shell
 xtuner convert pth_to_hf internlm3_8b_instruct_lora_oasst1_e10.py ./work_dirs/internlm3_8b_instruct_lora_oasst1_e10/iter_xxx.pth ./work_dirs/convert_output
 ```
 ### 模型合并
 LoRA或QLoRA微调生成的是一个额外的 `Adapter` 层，需要与原模型合并才能生成一个完整的模型。使用以下命令进行模型合并，其中`$model_path`
 为原模型存储的本地路径, `--max-shard-size 2GB` 限制每个权重文件最大为2GB：
 ```shell
 xtuner convert merge $model_path ./work_dirs/convert_output ./work_dirs/merge_output --max-shard-size 2GB
 ```
 ### 对话
 使用合并后的模型权重进行对话：
 ```shell
 cp path_to_your_model/modeling_internlm3.py ./work_dirs/merge_output
 xtuner chat ./work_dirs/merge_output --prompt-template internlm2_chat
 ```
 ## LLama-Factory
 ### 安装 LLaMa-Factory
 ```shell
 git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
 cd LLaMA-Factory
 pip install -e ".[torch-npu,metrics]"
 ```
 ### 推理
 在 LLaMa-Factory 路径下新建`examples/inference/internlm2_5_7b_chat.yaml`推理配置文件，文件内容为：
 ```yaml
 model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM2.5-7B-Chat.
 template: intern2
 ```
 使用以下命令与模型进行交互：
 ```shell
 llamafactory-cli chat examples/inference/internlm2_5_7b_chat.yaml
 ```
 ### 微调
 在 LLaMa-Factory 路径下新建`examples/train_lora/internlm2_5_7b_chat_lora_sft.yaml`微调配置文件，微调配置文件如下：
 ```yaml
 ### model
 model_name_or_path: xxx # Support only local loading. Set this parameter to the local weight path of InternLM2.5-7B-Chat.
 ### method
 stage: sft
 do_train: true
 finetuning_type: lora
 lora_target: all
 ### dataset
 dataset: identity
 template: intern2
 cutoff_len: 128
 preprocessing_num_workers: 16
 ### output
 output_dir: saves/internlm2_5_7b_chat/lora/sft
 logging_steps: 5
 save_steps: 20 
 plot_loss: true
 overwrite_output_dir: true
 ### train
 per_device_train_batch_size: 8
 gradient_accumulation_steps: 1
 learning_rate: 1.0e-4
 num_train_epochs: 5.0
 lr_scheduler_type: cosine
 warmup_ratio: 0.1
 bf16: true
 ddp_timeout: 180000000
 ```
 通过下面的命令启动微调：
 ```shell
 export ASCEND_RT_VISIBLE_DEVICES=0
 llamafactory-cli train examples/train_lora/internlm2_5_7b_chat_lora_sft.yaml
 ```
 ### 精度
 微调后得到的loss曲线如下：
 ![training_loss](assets/training_loss.png)
 ### 性能
 | 芯片型号              | train_samples_per_second |
 |-------------------|--------------------------|
 | Atlas 900 A2 PODc | 49.662                   |
 ## Transformers
 ### 推理
 新建推理脚本`inference_internlm2_5_7b_chat.py`，推理脚本内容为：
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # 若模型已下载，可替换成模型本地路径
 tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2_5-7b-chat", trust_remote_code=True)
 # `torch_dtype=torch.float16`可以令模型以float16精度加载，否则transformers会将模型加载为float32，导致显存不足
 model = AutoModelForCausalLM.from_pretrained("internlm/internlm2_5-7b-chat", torch_dtype=torch.float16, trust_remote_code=True).npu()
 model = model.eval()
 response, history = model.chat(tokenizer, "你好，请提供三个管理时间的建议。", history=[])
 print(response)
 ```
 执行推理脚本：
 ```shell
 python inference_internlm2_5_7b_chat.py
 ```
 ## 开源许可证
 本仓库的代码依照 Apache-2.0 协议开源。模型权重对学术研究完全开放，也可申请免费的商业使用授权（[申请表](https://wj.qq.com/s2/12725412/f7c1/)）。其他问题与合作请联系 <internlm@pjlab.org.cn>。
--- a/assets/training_loss.png
+++ b/assets/training_loss.png