polish moe docsrting (#618)

pull/621/head^2
ver217 2022-04-01 16:15:36 +08:00 committed by GitHub
parent c5b488edf8
commit 8432dc7080
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 13 additions and 6 deletions

View File

@ -320,15 +320,22 @@ class MoeModule(nn.Module):
capacity_factor_eval (float, optional): Capacity factor in routing during evaluation capacity_factor_eval (float, optional): Capacity factor in routing during evaluation
min_capacity (int, optional): The minimum number of the capacity of each expert min_capacity (int, optional): The minimum number of the capacity of each expert
noisy_policy (str, optional): The policy of noisy function. Now we have 'Jitter' and 'Gaussian'. noisy_policy (str, optional): The policy of noisy function. Now we have 'Jitter' and 'Gaussian'.
'Jitter' can be found in Switch Transformer paper (https://arxiv.org/abs/2101.03961). 'Jitter' can be found in `Switch Transformer paper`_.
'Gaussian' can be found in ViT-MoE paper (https://arxiv.org/abs/2106.05974). 'Gaussian' can be found in `ViT-MoE paper`_.
drop_tks (bool, optional): Whether drops tokens in evaluation drop_tks (bool, optional): Whether drops tokens in evaluation
use_residual (bool, optional): Makes this MoE layer a Residual MoE. use_residual (bool, optional): Makes this MoE layer a Residual MoE.
More information can be found in Microsoft paper (https://arxiv.org/abs/2201.05596). More information can be found in `Microsoft paper`_.
residual_instance (nn.Module, optional): The instance of residual module in Resiual MoE residual_instance (nn.Module, optional): The instance of residual module in Resiual MoE
expert_instance (MoeExperts, optional): The instance of experts module in MoeLayer expert_instance (MoeExperts, optional): The instance of experts module in MoeLayer
expert_cls (Type[nn.Module], optional): The class of each expert when no instance is given expert_cls (Type[nn.Module], optional): The class of each expert when no instance is given
expert_args (optional): The args of expert when no instance is given expert_args (optional): The args of expert when no instance is given
.. _Switch Transformer paper:
https://arxiv.org/abs/2101.03961
.. _ViT-MoE paper:
https://arxiv.org/abs/2106.05974
.. _Microsoft paper:
https://arxiv.org/abs/2201.05596
""" """
def __init__(self, def __init__(self,

View File

@ -14,8 +14,8 @@ class ForceFP32Parameter(torch.nn.Parameter):
class NormalNoiseGenerator: class NormalNoiseGenerator:
"""Generates a random noisy mask for logtis tensor. """Generates a random noisy mask for logtis tensor.
All noise is generated from a normal distribution (0, 1 / E^2), where All noise is generated from a normal distribution :math:`(0, 1 / E^2)`, where
E = the number of experts. `E = the number of experts`.
Args: Args:
num_experts (int): The number of experts. num_experts (int): The number of experts.
@ -34,7 +34,7 @@ class NormalNoiseGenerator:
class UniformNoiseGenerator: class UniformNoiseGenerator:
"""Generates a random noisy mask for logtis tensor. """Generates a random noisy mask for logtis tensor.
copied from mesh tensorflow: copied from mesh tensorflow:
Multiply values by a random number between 1-epsilon and 1+epsilon. Multiply values by a random number between :math:`1-epsilon` and :math:`1+epsilon`.
Makes models more resilient to rounding errors introduced by bfloat16. Makes models more resilient to rounding errors introduced by bfloat16.
This seems particularly important for logits. This seems particularly important for logits.