Doc(moe): add documentation for moe training (#411)

* add doc for moe * fix moe and zero1 check in args_sanity_check * restore moe config file
2023-10-19 10:01:12 +08:00 · 2023-10-19 10:01:12 +08:00 · 2c5395fdfd
parent 3ea94f2e2a
commit 2c5395fdfd
9 changed files with 308 additions and 20 deletions
--- a/doc/code-docs/locales/en/LC_MESSAGES/index.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/index.po
@ -7,7 +7,7 @@ msgid ""
 msgstr ""
 "Project-Id-Version: InternLM \n"
 "Report-Msgid-Bugs-To: \n"
-"POT-Creation-Date: 2023-09-07 10:56+0800\n"
+"POT-Creation-Date: 2023-10-10 17:48+0800\n"
 "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
 "Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
 "Language: en\n"
@ -16,7 +16,7 @@ msgstr ""
 "MIME-Version: 1.0\n"
 "Content-Type: text/plain; charset=utf-8\n"
 "Content-Transfer-Encoding: 8bit\n"
-"Generated-By: Babel 2.12.1\n"
+"Generated-By: Babel 2.13.0\n"

 #: ../../source/index.rst:8 11e029810acf410180311a3c63eb01f4
 msgid "InternLM"
@ -46,38 +46,42 @@ msgstr "Parallel Training"
 msgid "混合精度"
 msgstr "Mixed Precision"

-#: ../../source/index.rst:59 9234725f3c464731993d73607608c874
+#: ../../source/index.rst:59
+msgid "混合专家模型"
+msgstr "Mixture-of-Experts"
+
+#: ../../source/index.rst:67 9234725f3c464731993d73607608c874
 msgid "模型备份"
 msgstr "Model Checkpointing"

-#: ../../source/index.rst:67 8e4ce037017f4510b2892a66003877fa
+#: ../../source/index.rst:75 8e4ce037017f4510b2892a66003877fa
 msgid "性能分析"
 msgstr "Profiler"

-#: ../../source/index.rst:75 a36e02819ecd4b448a8cb4ebbecb6600
+#: ../../source/index.rst:83 a36e02819ecd4b448a8cb4ebbecb6600
 msgid "训练监控"
 msgstr "Monitor"

-#: ../../source/index.rst:83 b912e292486f455c8b5cdd75962e8ac2
+#: ../../source/index.rst:91 b912e292486f455c8b5cdd75962e8ac2
 msgid "训练样例"
 msgstr "Example"

-#: ../../source/index.rst:91 ea9e9281720941a1830e5df7a2badf7a
+#: ../../source/index.rst:99 ea9e9281720941a1830e5df7a2badf7a
 msgid "常见问题"
 msgstr "Q&A"

-#: ../../source/index.rst:99 e08edc5aa1c74965b10084b393b88fae
+#: ../../source/index.rst:107 e08edc5aa1c74965b10084b393b88fae
 msgid "索引和表格"
 msgstr "Indices and tables"

-#: ../../source/index.rst:101 f3fdca059caa49dcad09aa44be7f02d6
+#: ../../source/index.rst:109 f3fdca059caa49dcad09aa44be7f02d6
 msgid ":ref:`genindex`"
 msgstr ""

-#: ../../source/index.rst:102 b3791e811315435097bb507edc3f4b9b
+#: ../../source/index.rst:110 b3791e811315435097bb507edc3f4b9b
 msgid ":ref:`modindex`"
 msgstr ""

-#: ../../source/index.rst:103 a164b772960f4ab8b18c7e8820f69f55
+#: ../../source/index.rst:111 a164b772960f4ab8b18c7e8820f69f55
 msgid ":ref:`search`"
 msgstr ""
--- a/doc/code-docs/locales/en/LC_MESSAGES/moe.po
+++ b/doc/code-docs/locales/en/LC_MESSAGES/moe.po
@ -0,0 +1,208 @@
+# SOME DESCRIPTIVE TITLE.
+# Copyright (C) 2023, InternLM Team
+# This file is distributed under the same license as the InternLM package.
+# FIRST AUTHOR <EMAIL@ADDRESS>, 2023.
+#
+#, fuzzy
+msgid ""
+msgstr ""
+"Project-Id-Version: InternLM \n"
+"Report-Msgid-Bugs-To: \n"
+"POT-Creation-Date: 2023-10-10 19:25+0800\n"
+"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
+"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
+"Language: en\n"
+"Language-Team: en <LL@li.org>\n"
+"Plural-Forms: nplurals=2; plural=(n != 1);\n"
+"MIME-Version: 1.0\n"
+"Content-Type: text/plain; charset=utf-8\n"
+"Content-Transfer-Encoding: 8bit\n"
+"Generated-By: Babel 2.12.1\n"
+
+#: ../../source/moe.rst:2
+msgid "混合专家模型"
+msgstr "Mixture-of-Experts"
+
+#: ../../source/moe.rst:3
+msgid ""
+"混合专家模型（Mixture-of-Experts, MoE）是一种特殊的模型结构。 "
+"混合专家模型将模型拆分为一系列称为“专家”的子模型，每个“专家” 具有唯一的权重。 "
+"混合专家模型可以针对每个输入标记仅激活一个或少量的专家参与运算。 例如，图 :ref:`switch_transformer` 是 `Switch"
+" Transformer <https://arxiv.org/pdf/2101.03961.pdf>`_ "
+"提出的稀疏混合专家模型结构，其中的前向神经网络（FFN）被分解为多个子网络，在计算时仅有少部分的模型参数参与计算，以实现更有效的计算和资源分配。"
+msgstr ""
+"Mixture-of-Experts (MoE) is a special model structure. MoE partitions the model into a series of sub-models called \"experts\", "
+"each with unique parameters. MoE only activates one or a small number of experts for each input token. For example, the figure :ref:`switch_transformer` "
+" shows the sparse MoE architecture proposed by `Switch Transformer <https://arxiv.org/pdf/2101.03961.pdf>`_ . "
+"The Forward Neural Network (FFN) is decomposed into multiple sub-networks, and only a small number of model parameters "
+"are involved in the calculation to achieve more efficient calculation and resource allocation. "
+
+#: ../../source/moe.rst:8
+msgid ""
+"稀疏混合专家模型通常还包含一个门控（gating）机制，例如图 :ref:`switch_transformer` "
+"中的Router网络。门控网络负责选择激活哪些专家参与计算并组合不同专家的预测结果。"
+msgstr ""
+"Sparse MoE usually also includes a gating mechanism, such as the Router in Figure :ref:`switch_transformer` . "
+"The gating network is responsible for selecting which experts to activate and combining the prediction results of "
+"different experts."
+
+#: ../../source/moe.rst:16
+msgid "switch transformer"
+msgstr "switch transformer"
+
+#: ../../source/moe.rst:19
+msgid "参数配置"
+msgstr "Parameter Settings"
+
+#: ../../source/moe.rst:20
+msgid "如果在启动训练时要使用混合专家模型，可进行如下相关配置："
+msgstr ""
+"If MoE is expected to be used in the training, please make the following settings in the configuration file:"
+
+#: ../../source/moe.rst:22
+msgid "模型相关配置"
+msgstr "Model related settings"
+
+#: ../../source/moe.rst:31
+msgid "num_experts：专家网络个数。在InternLM中，每个专家有着相同的网络结构但维护着不同的训练参数。"
+msgstr ""
+"num_experts: The number of expert networks. In InternLM, each expert has "
+"the same network structure but maintains different training parameters."
+
+#: ../../source/moe.rst:32
+msgid ""
+"moe_gate_k：门控策略。决定如何将输入标记路由到不同的专家进行计算。目前InternLM支持top1gating和top2gating两种门控策略。关于这些门控策略的详细的信息可以参考"
+" `GShard <https://arxiv.org/pdf/2006.16668.pdf>`_。"
+msgstr ""
+"moe_gate_k: Gating strategy. Determines how to route input tokens to "
+"different experts for calculation. Currently, InternLM supports top1gating"
+" and top2gating strategies. For detailed information about "
+"these gating strategies, please refer to `GShard <https://arxiv.org/pdf/2006.16668.pdf>`_."
+
+#: ../../source/moe.rst:34
+msgid ""
+"注意：在目前的InternLM中，每个专家都是根据配置文件中HIDDEN_SIZE和MLP_RATIO构造的一个 `SwiGLU网络 <https://arxiv.org/pdf/2002.05202.pdf>`_，同时支持张量并行。用户可以根据需要构造自己的专家网络。"
+
+msgstr ""
+"Note: In the current version of InternLM, each expert is a `SwiGLU network <https://arxiv.org/pdf/2002.05202.pdf>`_ based on "
+"HIDDEN_SIZE and MLP_RATIO in the configuration file, and supports tensor parallelism. Users can construct their own expert networks as needed."
+
+#: ../../source/moe.rst:37
+msgid "损失相关配置"
+msgstr "Loss related settings"
+
+#: ../../source/moe.rst:46
+msgid ""
+"在top1gating和top2gating门控策略中，不同的专家处理的标记数量存在差异。为了提高模型效果，应尽量保证输入标记被均匀地路由到不同的专家上。"
+"InternLM采用 `GShard <https://arxiv.org/pdf/2006.16668.pdf>`_ 提出的负载平衡损失优化门控策略。 "
+"Moe_loss_coeff项决定着负载平衡损失项将如何添加到最终的损失项中（ :math:`l=l_{nll}+k·l_{moe}` ）。"
+"关于该部分的详细信息可以进一步参考 `GShard <https://arxiv.org/pdf/2006.16668.pdf>`_。"
+msgstr ""
+"In top1gating and top2gating strategies, the number of tokens to process may be different for different experts. "
+"In order to improve the model effect, the input tokens should be evenly routed to different experts. "
+"InternLM adopts the balancing loss to optimize the gating network proposed by GShard. "
+"The moe_loss_coeff determines how the balancing loss should be added to the final loss ( :math:`l=l_{nll}+k·l_{moe}` ). "
+"The details can be found in `GShard <https://arxiv.org/pdf/2006.16668.pdf>`_. "
+
+#: ../../source/moe.rst:49
+msgid "注意：这些参数需要和其他参数一起使用，具体请参考 :doc:`/usage` “训练配置”相关章节的内容。"
+msgstr "Note: These parameters need to be used together with other parameters, please refer to :doc:`/usage`: Training Configuration"
+
+#: ../../source/moe.rst:52
+msgid "模型训练"
+msgstr "Model Training"
+
+#: ../../source/moe.rst:54
+msgid ""
+"internlm.model.modeling_moe提供了一个标准的混合专家模型的实现，该模型的网络结构和图 :ref:`switch_transformer` "
+"一致，其中使用到internlm.model.moe.MoE实现MoE网络。用户在配置文件中指定模型类型："
+msgstr ""
+"internlm.model.modeling_moe provides an implementation of a standard MoE. "
+"The model structure is consistent with Figure :ref:`switch_transformer` ,"
+" which uses internlm.model.moe.MoE to implement the MoE network. "
+"To use moe model, specify the model type in the configuration file:"
+
+#: ../../source/moe.rst:60
+msgid "并配置好稀疏专家网络的相关参数后，就可以像正常启动InternLM一样进行混合专家模型的分布式训练，具体请参考 :doc:`/usage` “启动训练”相关章节的内容。"
+msgstr ""
+"After configuring the relevant parameters of the sparse MoE, "
+"the distributed training can start as the normal training process. please refer to :doc:`/usage`: Start Training"
+
+#: internlm.model.moe.MoE:1 of
+msgid "Initialize an MoE layer."
+msgstr ""
+
+#: internlm.model.moe.MoE of
+msgid "参数"
+msgstr "parameter"
+
+#: internlm.model.moe.MoE:3 of
+msgid ""
+"the hidden dimension of the model, importantly this is also the input and"
+" output dimension."
+msgstr ""
+
+#: internlm.model.moe.MoE:5 of
+msgid "default=1, the total number of experts per layer."
+msgstr ""
+
+#: internlm.model.moe.MoE:7 of
+msgid "default=1, number of ranks in the expert parallel world or group."
+msgstr ""
+
+#: internlm.model.moe.MoE:9 of
+msgid "default=1, top-k gating value, only supports k=1 or k=2."
+msgstr ""
+
+#: internlm.model.moe.MoE:11 of
+msgid "default=1.0, the capacity of the expert at training time."
+msgstr ""
+
+#: internlm.model.moe.MoE:13 of
+msgid "default=1.0, the capacity of the expert at eval time."
+msgstr ""
+
+#: internlm.model.moe.MoE:15 of
+msgid ""
+"default=4, the minimum capacity per expert regardless of the "
+"capacity_factor."
+msgstr ""
+
+#: internlm.model.moe.MoE:17 of
+msgid ""
+"default=None, noisy gate policy, valid options are 'Jitter', 'RSample' or"
+" 'None'."
+msgstr ""
+
+#: internlm.model.moe.MoE:20 of
+msgid "default=True, whether to use the default MoE layer."
+msgstr ""
+
+#: internlm.model.moe.MoE:22 of
+msgid ""
+"default=True, whether to drop tokens - (setting to False is equivalent to"
+" infinite capacity)."
+msgstr ""
+
+#: internlm.model.moe.MoE:25 of
+msgid "default=True, whether to use Random Token Selection."
+msgstr ""
+
+#: internlm.model.moe.MoE:27 of
+msgid ""
+"default=False, make this MoE layer a Residual MoE "
+"(https://arxiv.org/abs/2201.05596) layer."
+msgstr ""
+
+#: internlm.model.moe.MoE:30 of
+msgid "default=None, the torch module that defines the residual MLP."
+msgstr ""
+
+#: ../../source/moe.rst:64
+msgid "注意：InternLM支持用户定义自己的MoE结构。internlm.model.moe.MoE是定义MoE网络的接口，目前使用SwiGLU网络实现了专家模型并支持top1gating和top2gating两种门控策略。用户可以在MoE接口中对专家网络和门控策略进行扩展。"
+msgstr ""
+"Note: InternLM supports users to define their own MoE structure. "
+"internlm.model.moe.MoE is the interface that defines the MoE network. "
+"Currently, the SwiGLU network is used to implement the experts and "
+"supports two gating strategies: top1gating and top2gating. Users can "
+"extend the expert network and gating strategy in the MoE interface as needed."
--- a/doc/code-docs/source/conf.py
+++ b/doc/code-docs/source/conf.py
@ -9,6 +9,8 @@
 import os
 import sys

+import torch  # noqa # pylint: disable=unused-import
+
 project = "InternLM"
 copyright = "2023, InternLM Team"
 author = "InternLM Team"
@ -94,6 +96,10 @@ autodoc_mock_imports = [
    "apex",
    "torch",
    "numpy",
+    "flash_attn",
+    "rotary_emb",
+    "einops",
+    "torch_scatter",
 ]

 # support multi-language docs
--- a/doc/code-docs/source/index.rst
+++ b/doc/code-docs/source/index.rst
@ -55,6 +55,14 @@ InternLM

   mixed_precision

+混合专家模型
+-------------------
+
+.. toctree::
+   :maxdepth: 2
+
+   moe
+
 模型备份
 --------------------

--- a/doc/code-docs/source/mixed_precision.rst
+++ b/doc/code-docs/source/mixed_precision.rst
@ -1,5 +1,5 @@
 混合精度
-----------------
+============
 混合精度是指在模型训练的过程中同时使用16位和32位浮点数类型，是一种在最小化精度损失的前提下加速模型训练的方法。
 混合精度通过让模型的某些部分使用32位浮点数以保持数值稳定性，并在其余部分利用半精度浮点数加速训练并可以减少内存使用，在评估指标（如准确率）方面仍可以获得同等的训练效果。

@ -22,10 +22,10 @@ InternLM默认将模型转换为16位浮点数类型进行训练（在配置文
            super().__init__()
            self.linear1 = nn.Linear(4, 1, bias=False)
            self.linear2 = nn.Linear(1, 4, bias=False)
+            # set model.linear2 as fp32 module
+            set_fp32_attr_to_module(model.linear2)

    model = MlpModel()
-    # set model.linear2 as fp32 module
-    set_fp32_attr_to_module(model.linear2)

    # apply mixed precision
    model = NaiveAMPModel(
@ -78,4 +78,3 @@ InternLM支持使用TF32训练模型，允许用户在config文件中将 ``dtype

    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cuda.matmul.allow_tf32 = True
-
--- a/doc/code-docs/source/moe.rst
+++ b/doc/code-docs/source/moe.rst
@ -0,0 +1,65 @@
+混合专家模型
+==============
+混合专家模型（Mixture-of-Experts, MoE）是一种特殊的模型结构。
+混合专家模型将模型拆分为一系列称为“专家”的子模型，每个“专家” 具有唯一的权重。
+混合专家模型可以针对每个输入标记仅激活一个或少量的专家参与运算。
+例如，图 :ref:`switch_transformer` 是 `Switch Transformer <https://arxiv.org/pdf/2101.03961.pdf>`_ 提出的稀疏混合专家模型结构，其中的前向神经网络（FFN）被分解为多个子网络，在计算时仅有少部分的模型参数参与计算，以实现更有效的计算和资源分配。
+
+稀疏混合专家模型通常还包含一个门控（gating）机制，例如图 :ref:`switch_transformer` 中的Router网络。门控网络负责选择激活哪些专家参与计算并组合不同专家的预测结果。
+
+.. _switch_transformer:
+
+.. figure:: ../../imgs/switch_transformer.png
+   :scale: 40%
+   :class: with-border
+   :align: center
+
+   switch transformer
+
+参数配置
+----------------
+如果在启动训练时要使用混合专家模型，可进行如下相关配置：
+
+1. 模型相关配置
+
+.. code-block:: python
+
+    model = dict(
+        num_experts=16,
+        moe_gate_k=1,
+    )
+
+* num_experts：专家网络个数。在InternLM中，每个专家有着相同的网络结构但维护着不同的训练参数。
+* moe_gate_k：门控策略。决定如何将输入标记路由到不同的专家进行计算。目前InternLM支持top1gating和top2gating两种门控策略。关于这些门控策略的详细的信息可以参考 `GShard <https://arxiv.org/pdf/2006.16668.pdf>`_。
+
+注意：在目前的InternLM中，每个专家都是根据配置文件中HIDDEN_SIZE和MLP_RATIO构造的一个 `SwiGLU网络 <https://arxiv.org/pdf/2002.05202.pdf>`_，同时支持张量并行。用户可以根据需要构造自己的专家网络。
+
+
+2. 损失相关配置
+
+.. code-block:: python
+
+    loss = dict(
+        moe_loss_coeff=0.1,
+    )
+
+
+在top1gating和top2gating门控策略中，不同的专家处理的标记数量存在差异。为了提高模型效果，应尽量保证输入标记被均匀地路由到不同的专家上。InternLM采用 `GShard <https://arxiv.org/pdf/2006.16668.pdf>`_ 提出的负载平衡损失优化门控策略。
+Moe_loss_coeff项决定着负载平衡损失项将如何添加到最终的损失项中（ :math:`l=l_{nll}+k·l_{moe}` ）。关于该部分的详细信息可以进一步参考 `GShard <https://arxiv.org/pdf/2006.16668.pdf>`_。
+
+注意：这些参数需要和其他参数一起使用，具体请参考 :doc:`/usage` “训练配置”相关章节的内容。
+
+模型训练
+----------------
+
+internlm.model.modeling_moe提供了一个标准的混合专家模型的实现，该模型的网络结构和图 :ref:`switch_transformer` 一致，其中使用到internlm.model.moe.MoE实现MoE网络。用户在配置文件中指定模型类型：
+
+.. code-block:: python
+
+    model_type = "INTERNLM_MoE"
+
+并配置好稀疏专家网络的相关参数后，就可以像正常启动InternLM一样进行混合专家模型的分布式训练，具体请参考 :doc:`/usage` “启动训练”相关章节的内容。
+
+.. autoclass:: internlm.model.moe.MoE
+
+注意：InternLM支持用户定义自己的MoE结构。internlm.model.moe.MoE是定义MoE网络的接口，目前使用SwiGLU网络实现了专家模型并支持top1gating和top2gating两种门控策略。用户可以在MoE接口中对专家网络和门控策略进行扩展。
--- a/doc/imgs/switch_transformer.png
+++ b/doc/imgs/switch_transformer.png
--- a/internlm/initialize/launch.py
+++ b/internlm/initialize/launch.py
@ -349,7 +349,7 @@ def args_sanity_check():
        assert (
            not optim_ckpt.overlap_sync_grad & optim_ckpt.overlap_sync_param
        ), "not support overlap and moe at the same time"
-        assert gpc.config.parallel.zero1.size == -1, "moe only support zero1, set zero1=-1 can fix this"
+        assert gpc.config.parallel.zero1.size == -1, "moe only support zero1, set zero1=dict(size=-1,...) can fix this"


 def launch(
--- a/internlm/model/moe.py
+++ b/internlm/model/moe.py
@ -18,7 +18,6 @@ class MoE(torch.nn.Module):

    Arguments:
        hidden_size (int): the hidden dimension of the model, importantly this is also the input and output dimension.
-        expert (torch.nn.Module): the torch module that defines the expert (e.g., MLP, torch.linear).
        num_experts (int, optional): default=1, the total number of experts per layer.
        ep_size (int, optional): default=1, number of ranks in the expert parallel world or group.
        k (int, optional): default=1, top-k gating value, only supports k=1 or k=2.
@ -26,10 +25,10 @@ class MoE(torch.nn.Module):
        eval_capacity_factor (float, optional): default=1.0, the capacity of the expert at eval time.
        min_capacity (int, optional): default=4, the minimum capacity per expert regardless of the capacity_factor.
        noisy_gate_policy (str, optional): default=None, noisy gate policy, valid options are 'Jitter', 'RSample'
-        or 'None'.
+                                            or 'None'.
        using_default_moe (bool, optional): default=True, whether to use the default MoE layer.
        drop_tokens (bool, optional): default=True, whether to drop tokens - (setting to False is equivalent to
-        infinite capacity).
+                                        infinite capacity).
        use_rts (bool, optional): default=True, whether to use Random Token Selection.
        moe_use_residual (bool, optional): default=False, make this MoE layer a Residual MoE
                                          (https://arxiv.org/abs/2201.05596) layer.
@ -73,7 +72,6 @@ class MoE(torch.nn.Module):
            gpc.expert_parallel_group_names.append(expert_group_name)
        experts = torch.nn.ModuleList(
            [
-                # TODO have trouble when use internlm.model.linear.FeedForward
                FeedForward(
                    hidden_size,
                    int(hidden_size * gpc.config.model.mlp_ratio),