Develop/experiments (#59)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000
* Integrate 1d tensor parallel in Colossal-AI (#39)
* fixed 1D and 2D convergence (#38)
* optimized 2D operations
* fixed 1D ViT convergence problem
* Feature/ddp (#49)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* support torch ddp
* fix loss accumulation
* add log for ddp
* change seed
* modify timing hook
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* Feature/pipeline (#40)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* optimize communication of pipeline parallel
* fix grad clip for pipeline
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)
* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset
* update api for better usability (#58)
update api for better usability
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-09 07:08:29 +00:00
|
|
|
import math
|
2021-12-27 07:04:32 +00:00
|
|
|
import warnings
|
Develop/experiments (#59)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000
* Integrate 1d tensor parallel in Colossal-AI (#39)
* fixed 1D and 2D convergence (#38)
* optimized 2D operations
* fixed 1D ViT convergence problem
* Feature/ddp (#49)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* support torch ddp
* fix loss accumulation
* add log for ddp
* change seed
* modify timing hook
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* Feature/pipeline (#40)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* optimize communication of pipeline parallel
* fix grad clip for pipeline
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)
* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset
* update api for better usability (#58)
update api for better usability
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-09 07:08:29 +00:00
|
|
|
|
|
|
|
from torch import Tensor
|
2021-12-27 07:04:32 +00:00
|
|
|
import torch.nn as nn
|
|
|
|
|
|
|
|
|
|
|
|
def zeros_():
|
2022-03-25 05:02:39 +00:00
|
|
|
"""Return the initializer filling the input Tensor with the scalar zeros"""
|
2022-07-13 02:51:55 +00:00
|
|
|
|
2021-12-27 07:04:32 +00:00
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
return nn.init.zeros_(tensor)
|
|
|
|
|
|
|
|
return initializer
|
|
|
|
|
|
|
|
|
|
|
|
def ones_():
|
2022-03-25 05:02:39 +00:00
|
|
|
"""Return the initializer filling the input Tensor with the scalar ones"""
|
2022-07-13 02:51:55 +00:00
|
|
|
|
2021-12-27 07:04:32 +00:00
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
return nn.init.ones_(tensor)
|
|
|
|
|
|
|
|
return initializer
|
|
|
|
|
|
|
|
|
|
|
|
def uniform_(a: float = 0., b: float = 1.):
|
2022-03-25 05:02:39 +00:00
|
|
|
r"""Return the initializer filling the input Tensor with values drawn from the uniform
|
|
|
|
distribution :math:`\mathcal{U}(a, b)`.
|
|
|
|
|
|
|
|
Args:
|
|
|
|
a (float): the lower bound of the uniform distribution. Defaults 0.0.
|
|
|
|
b (float): the upper bound of the uniform distribution. Defaults 1.0.
|
|
|
|
"""
|
|
|
|
|
2021-12-27 07:04:32 +00:00
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
return nn.init.uniform_(tensor, a, b)
|
|
|
|
|
|
|
|
return initializer
|
|
|
|
|
|
|
|
|
|
|
|
def normal_(mean: float = 0., std: float = 1.):
|
2022-03-25 05:02:39 +00:00
|
|
|
r"""Return the initializer filling the input Tensor with values drawn from the normal distribution
|
|
|
|
|
|
|
|
.. math::
|
|
|
|
\mathcal{N}(\text{mean}, \text{std}^2)
|
|
|
|
|
|
|
|
Args:
|
|
|
|
mean (float): the mean of the normal distribution. Defaults 0.0.
|
|
|
|
std (float): the standard deviation of the normal distribution. Defaults 1.0.
|
|
|
|
"""
|
2022-07-13 02:51:55 +00:00
|
|
|
|
2021-12-27 07:04:32 +00:00
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
return nn.init.normal_(tensor, mean, std)
|
|
|
|
|
|
|
|
return initializer
|
|
|
|
|
|
|
|
|
|
|
|
def trunc_normal_(mean: float = 0., std: float = 1., a: float = -2., b: float = 2.):
|
2022-03-25 05:02:39 +00:00
|
|
|
r"""Return the initializer filling the input Tensor with values drawn from a truncated
|
|
|
|
normal distribution. The values are effectively drawn from the
|
|
|
|
normal distribution :math:`\mathcal{N}(\text{mean}, \text{std}^2)`
|
|
|
|
with values outside :math:`[a, b]` redrawn until they are within
|
|
|
|
the bounds. The method used for generating the random values works
|
|
|
|
best when :math:`a \leq \text{mean} \leq b`.
|
|
|
|
|
|
|
|
Args:
|
|
|
|
mean (float): the mean of the normal distribution. Defaults 0.0.
|
|
|
|
std (float): the standard deviation of the normal distribution. Defaults 1.0.
|
|
|
|
a (float): the minimum cutoff value. Defaults -2.0.
|
|
|
|
b (float): the maximum cutoff value. Defaults 2.0.
|
|
|
|
"""
|
2022-07-13 02:51:55 +00:00
|
|
|
|
2021-12-27 07:04:32 +00:00
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
return nn.init.trunc_normal_(tensor, mean, std, a, b)
|
|
|
|
|
|
|
|
return initializer
|
|
|
|
|
|
|
|
|
|
|
|
def kaiming_uniform_(a=0, mode='fan_in', nonlinearity='leaky_relu'):
|
2022-03-25 05:02:39 +00:00
|
|
|
r"""Return the initializer filling the input `Tensor` with values according to the method
|
|
|
|
described in `Delving deep into rectifiers: Surpassing human-level
|
|
|
|
performance on ImageNet classification` - He, K. et al. (2015), using a
|
|
|
|
uniform distribution. The resulting tensor will have values sampled from
|
|
|
|
:math:`\mathcal{U}(-\text{bound}, \text{bound})` where
|
|
|
|
|
|
|
|
.. math::
|
|
|
|
\text{bound} = \text{gain} \times \sqrt{\frac{3}{\text{fan_mode}}}
|
|
|
|
|
|
|
|
Also known as 'He initialization'.
|
|
|
|
|
|
|
|
Args:
|
|
|
|
a (int): the negative slope of the rectifier used after this layer (only used with ``'leaky_relu'``).
|
|
|
|
mode (str, optional): either ``'fan_in'`` (default) or ``'fan_out'``. Choosing ``'fan_in'``
|
|
|
|
preserves the magnitude of the variance of the weights in the
|
|
|
|
forward pass. Choosing ``'fan_out'`` preserves the magnitudes in the
|
|
|
|
backwards pass.
|
|
|
|
nonlinearity (str, optional): the non-linear function (`nn.functional` name),
|
|
|
|
recommended to use only with ``'relu'`` or ``'leaky_relu'`` (default).
|
|
|
|
"""
|
2022-07-13 02:51:55 +00:00
|
|
|
|
2021-12-27 07:04:32 +00:00
|
|
|
# adapted from torch.nn.init
|
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
if 0 in tensor.shape:
|
|
|
|
warnings.warn("Initializing zero-element tensors is a no-op")
|
|
|
|
return tensor
|
|
|
|
|
|
|
|
if mode == 'fan_in':
|
|
|
|
assert fan_in is not None, 'Fan_in is not provided.'
|
|
|
|
fan = fan_in
|
|
|
|
elif mode == 'fan_out':
|
|
|
|
assert fan_out is not None, 'Fan_out is not provided.'
|
|
|
|
fan = fan_out
|
|
|
|
else:
|
|
|
|
raise ValueError(f'Invalid initialization mode \'{mode}\'')
|
|
|
|
|
|
|
|
std = nn.init.calculate_gain(nonlinearity, a) / math.sqrt(fan)
|
|
|
|
bound = math.sqrt(3.) * std
|
|
|
|
return nn.init.uniform_(tensor, -bound, bound)
|
|
|
|
|
|
|
|
return initializer
|
|
|
|
|
|
|
|
|
|
|
|
def kaiming_normal_(a=0, mode='fan_in', nonlinearity='leaky_relu'):
|
2022-03-25 05:02:39 +00:00
|
|
|
r"""Return the initializer filling the input `Tensor` with values according to the method
|
|
|
|
described in `Delving deep into rectifiers: Surpassing human-level
|
|
|
|
performance on ImageNet classification` - He, K. et al. (2015), using a
|
|
|
|
normal distribution. The resulting tensor will have values sampled from
|
|
|
|
:math:`\mathcal{N}(0, \text{std}^2)` where
|
|
|
|
|
|
|
|
.. math::
|
|
|
|
\text{std} = \frac{\text{gain}}{\sqrt{\text{fan_mode}}}
|
|
|
|
|
|
|
|
Also known as 'He initialization'.
|
|
|
|
|
|
|
|
Args:
|
|
|
|
a (int): the negative slope of the rectifier used after this layer (only used with ``'leaky_relu'``).
|
|
|
|
mode (str, optional): either ``'fan_in'`` (default) or ``'fan_out'``. Choosing ``'fan_in'``
|
|
|
|
preserves the magnitude of the variance of the weights in the
|
|
|
|
forward pass. Choosing ``'fan_out'`` preserves the magnitudes in the
|
|
|
|
backwards pass.
|
|
|
|
nonlinearity (str, optional): the non-linear function (`nn.functional` name),
|
|
|
|
recommended to use only with ``'relu'`` or ``'leaky_relu'`` (default).
|
|
|
|
"""
|
2022-07-13 02:51:55 +00:00
|
|
|
|
2021-12-27 07:04:32 +00:00
|
|
|
# adapted from torch.nn.init
|
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
if 0 in tensor.shape:
|
|
|
|
warnings.warn("Initializing zero-element tensors is a no-op")
|
|
|
|
return tensor
|
|
|
|
|
|
|
|
if mode == 'fan_in':
|
|
|
|
assert fan_in is not None, 'Fan_in is not provided.'
|
|
|
|
fan = fan_in
|
|
|
|
elif mode == 'fan_out':
|
|
|
|
assert fan_out is not None, 'Fan_out is not provided.'
|
|
|
|
fan = fan_out
|
|
|
|
else:
|
|
|
|
raise ValueError(f'Invalid initialization mode \'{mode}\'')
|
|
|
|
|
|
|
|
std = nn.init.calculate_gain(nonlinearity, a) / math.sqrt(fan)
|
|
|
|
return nn.init.normal_(tensor, 0, std)
|
|
|
|
|
|
|
|
return initializer
|
|
|
|
|
|
|
|
|
|
|
|
def xavier_uniform_(a: float = math.sqrt(3.), scale: float = 2., gain: float = 1.):
|
2022-03-25 05:02:39 +00:00
|
|
|
r"""Return the initializer filling the input `Tensor` with values according to the method
|
|
|
|
described in `Understanding the difficulty of training deep feedforward
|
|
|
|
neural networks` - Glorot, X. & Bengio, Y. (2010), using a uniform
|
|
|
|
distribution. The resulting tensor will have values sampled from
|
|
|
|
:math:`\mathcal{U}(-a, a)` where
|
|
|
|
|
|
|
|
.. math::
|
|
|
|
a = \text{gain} \times \sqrt{\frac{6}{\text{fan_in} + \text{fan_out}}}
|
|
|
|
|
|
|
|
Also known as 'Glorot initialization'.
|
|
|
|
|
|
|
|
Args:
|
|
|
|
a (float, optional): an optional scaling factor used to calculate uniform
|
|
|
|
bounds from standard deviation. Defaults ``math.sqrt(3.)``.
|
|
|
|
scale (float, optional): an optional scaling factor used to calculate standard deviation. Defaults 2.0.
|
|
|
|
gain (float, optional): an optional scaling factor. Defaults 1.0.
|
|
|
|
"""
|
2022-07-13 02:51:55 +00:00
|
|
|
|
2021-12-27 07:04:32 +00:00
|
|
|
# adapted from torch.nn.init
|
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
assert fan_in is not None, 'Fan_in is not provided.'
|
|
|
|
|
|
|
|
fan = fan_in
|
|
|
|
if fan_out is not None:
|
|
|
|
fan += fan_out
|
|
|
|
|
|
|
|
std = gain * math.sqrt(scale / float(fan))
|
|
|
|
bound = a * std
|
|
|
|
return nn.init.uniform_(tensor, -bound, bound)
|
|
|
|
|
|
|
|
return initializer
|
|
|
|
|
|
|
|
|
|
|
|
def xavier_normal_(scale: float = 2., gain: float = 1.):
|
2022-03-25 05:02:39 +00:00
|
|
|
r"""Return the initializer filling the input `Tensor` with values according to the method
|
|
|
|
described in `Understanding the difficulty of training deep feedforward
|
|
|
|
neural networks` - Glorot, X. & Bengio, Y. (2010), using a normal
|
|
|
|
distribution. The resulting tensor will have values sampled from
|
|
|
|
:math:`\mathcal{N}(0, \text{std}^2)` where
|
|
|
|
|
|
|
|
.. math::
|
|
|
|
\text{std} = \text{gain} \times \sqrt{\frac{2}{\text{fan_in} + \text{fan_out}}}
|
|
|
|
|
|
|
|
Also known as 'Glorot initialization'.
|
|
|
|
|
|
|
|
Args:
|
|
|
|
scale (float, optional): an optional scaling factor used to calculate standard deviation. Defaults 2.0.
|
|
|
|
gain (float, optional): an optional scaling factor. Defaults 1.0.
|
|
|
|
"""
|
2022-07-13 02:51:55 +00:00
|
|
|
|
2021-12-27 07:04:32 +00:00
|
|
|
# adapted from torch.nn.init
|
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
assert fan_in is not None, 'Fan_in is not provided.'
|
|
|
|
|
|
|
|
fan = fan_in
|
|
|
|
if fan_out is not None:
|
|
|
|
fan += fan_out
|
|
|
|
|
|
|
|
std = gain * math.sqrt(scale / float(fan))
|
|
|
|
|
|
|
|
return nn.init.normal_(tensor, 0., std)
|
|
|
|
|
|
|
|
return initializer
|
|
|
|
|
|
|
|
|
|
|
|
def lecun_uniform_():
|
|
|
|
# adapted from jax.nn.initializers
|
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
assert fan_in is not None, 'Fan_in is not provided.'
|
|
|
|
|
|
|
|
var = 1.0 / fan_in
|
|
|
|
bound = math.sqrt(3 * var)
|
|
|
|
return nn.init.uniform_(tensor, -bound, bound)
|
|
|
|
|
|
|
|
return initializer
|
|
|
|
|
|
|
|
|
|
|
|
def lecun_normal_():
|
|
|
|
# adapted from jax.nn.initializers
|
|
|
|
def initializer(tensor: Tensor, fan_in: int = None, fan_out: int = None):
|
|
|
|
assert fan_in is not None, 'Fan_in is not provided.'
|
|
|
|
|
Develop/experiments (#59)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
* Split conv2d, class token, positional embedding in 2d, Fix random number in ddp
Fix convergence in cifar10, Imagenet1000
* Integrate 1d tensor parallel in Colossal-AI (#39)
* fixed 1D and 2D convergence (#38)
* optimized 2D operations
* fixed 1D ViT convergence problem
* Feature/ddp (#49)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* support torch ddp
* fix loss accumulation
* add log for ddp
* change seed
* modify timing hook
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* Feature/pipeline (#40)
* remove redundancy func in setup (#19) (#20)
* use env to control the language of doc (#24) (#25)
* Support TP-compatible Torch AMP and Update trainer API (#27)
* Add gradient accumulation, fix lr scheduler
* fix FP16 optimizer and adapted torch amp with tensor parallel (#18)
* fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes
* fixed trainer
* Revert "fixed trainer"
This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097.
* improved consistency between trainer, engine and schedule (#23)
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
* add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29)
* add explanation for ViT example (#35) (#36)
* optimize communication of pipeline parallel
* fix grad clip for pipeline
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
* optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51)
* Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset
* update api for better usability (#58)
update api for better usability
Co-authored-by: 1SAA <c2h214748@gmail.com>
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com>
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-09 07:08:29 +00:00
|
|
|
std = math.sqrt(1.0 / fan_in)
|
2021-12-27 07:04:32 +00:00
|
|
|
return nn.init.trunc_normal_(tensor, std=std / .87962566103423978)
|
|
|
|
|
2022-07-13 02:51:55 +00:00
|
|
|
return initializer
|