2.9 KiB
Mixed precision training
In Colossal-AI, we have incorporated different implementations of mixed precision training:
- torch.cuda.amp
- apex.amp
- tensor-parallel amp
The first two rely on the original implementation of PyTorch
(version 1.6 and above) and Nvidia Apex. However, these two methods are not compatible
with tensor parallelism. This is because that tensors are split across devices in tensor parallelism, thus, it is required
to communicate among different processes to check if inf
or nan
occurs in the whole model weights. For the mixed
precision training with tensor parallelism, we adapted this feature from Megatron-LM.
To use mixed precision training, you can easily specify the fp16
field in the config file to be True. Currently, PyTorch and
Apex amp cannot be guaranteed to work with tensor and pipeline parallelism, thus, only the last one is recommended if you
are using hybrid parallelism.
PyTorch AMP
PyTorch provides mixed precision training in version 1.6 and above. It provides an easy way to cast data to fp16
format
while keeping some operations such as reductions in fp32
. You can configure the gradient scaler in the config file.
from colossalai.engine import AMP_TYPE
fp16=dict(
mode=AMP_TYPE.TORCH,
# below are default values for grad scaler
init_scale=2.**16,
growth_factor=2.0,
backoff_factor=0.5,
growth_interval=2000,
enabled=True
)
Apex AMP
For this mode, we rely on the Apex implementation for mixed precision training. We support
this plugin because it allows for finer control on the granularity of mixed precision. For example, O2
level (optimization level 2)
will keep batch normalization in fp32
.
The following code block shows a config file for Apex AMP.
from colossalai.engine import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.APEX,
# below are the default values
enabled=True,
opt_level='O1',
cast_model_type=None,
patch_torch_functions=None,
keep_batchnorm_fp32=None,
master_weights=None,
loss_scale=None,
cast_model_outputs=None,
num_losses=1,
verbosity=1,
min_loss_scale=None,
max_loss_scale=16777216.0
)
Tensor Parallel AMP
We leveraged the Megatron-LM implementation to achieve mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism.
The following conde block show a config file for this mode.
from colossalai.engine import AMP_TYPE
fp16 = dict(
mode=AMP_TYPE.PARALLEL,
# below are the default values
clip_grad=0,
log_num_zeros_in_grad=False,
initial_scale=2 ** 32,
min_scale=1,
growth_factor=2,
backoff_factor=0.5,
growth_interval=1000,
hysteresis=2
)