Merge pull request #3810 from jiangmingyan/amp

[doc] update amp document
2023-05-23 18:58:16 +08:00 · 2023-05-23 18:58:16 +08:00 · 725365f297
parent 1e3b64f26c 278fcbc444
commit 725365f297
9 changed files with 497 additions and 8 deletions
--- a/docs/sidebars.json
+++ b/docs/sidebars.json
@ -43,6 +43,7 @@
      "label": "Features",
      "collapsed": true,
      "items": [
        "features/mixed_precision_training_with_booster",
        "features/mixed_precision_training",
        "features/gradient_accumulation_with_booster",
        "features/gradient_accumulation",
--- a/docs/source/en/features/gradient_accumulation_with_booster.md
+++ b/docs/source/en/features/gradient_accumulation_with_booster.md
@ -128,7 +128,7 @@ for idx, (img, label) in enumerate(train_dataloader):
 ### Step 6. Invoke Training Scripts
 To verify gradient accumulation, we can just check the change of parameter values. When gradient accumulation is set, parameters are only updated in the last step. You can run the script using this command:
 ```shell
-colossalai run --nproc_per_node 1 train.py --config config.py
+colossalai run --nproc_per_node 1 train.py
 ```
 You will see output similar to the text below. This shows gradient is indeed accumulated as the parameter is not updated
--- a/docs/source/en/features/gradient_clipping_with_booster.md
+++ b/docs/source/en/features/gradient_clipping_with_booster.md
@ -136,7 +136,7 @@ for idx, (img, label) in enumerate(train_dataloader):
 You can run the script using this command:
 ```shell
-colossalai run --nproc_per_node 1 train.py --config config/config.py
+colossalai run --nproc_per_node 1 train.py
 ```
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_clipping_with_booster.py  -->
--- a/docs/source/en/features/mixed_precision_training.md
+++ b/docs/source/en/features/mixed_precision_training.md
@ -1,4 +1,4 @@
-# Auto Mixed Precision Training
+# Auto Mixed Precision Training (Outdated)
 Author: Chuanrui Wang, Shenggui Li, Yongbin Li
@ -362,6 +362,7 @@ for epoch in range(gpc.config.NUM_EPOCHS):
 Use the following command to start the training scripts. You can change `--nproc_per_node` to use a different number of GPUs.
-```python
+```shell
 python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
 ```
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training.py  -->
--- a/docs/source/en/features/mixed_precision_training_with_booster.md
+++ b/docs/source/en/features/mixed_precision_training_with_booster.md
@ -0,0 +1,251 @@
 # Auto Mixed Precision Training (Latest)
 Author: [Mingyan Jiang](https://github.com/jiangmingyan)
 **Prerequisite**
 - [Define Your Configuration](../basics/define_your_config.md)
 - [Training Booster](../basics/booster_api.md)
 **Related Paper**
 - [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
 ## Introduction
 AMP stands for automatic mixed precision training.
 In Colossal-AI, we have incorporated different implementations of mixed precision training:
 1. torch.cuda.amp
 2. apex.amp
 3. naive amp
 | Colossal-AI | support tensor parallel | support pipeline parallel | fp16 extent |
 | ----------- | ----------------------- | ------------------------- | ----------- |
 | AMP_TYPE.TORCH | ✅ | ❌ | Model parameters, activation, gradients are downcast to fp16 during forward and backward propagation |
 | AMP_TYPE.APEX | ❌ | ❌ | More fine-grained, we can choose opt_level O0, O1, O2, O3 |
 | AMP_TYPE.NAIVE | ✅ | ✅ | Model parameters, forward and backward operations are all downcast to fp16 |
 The first two rely on the original implementation of PyTorch (version 1.6 and above) and NVIDIA Apex.
 The last method is similar to Apex O2 level.
 Among these methods, apex AMP is not compatible with tensor parallelism.
 This is because that tensors are split across devices in tensor parallelism, thus, it is required to communicate among different processes to check if inf or nan occurs in the whole model weights.
 We modified the torch amp implementation so that it is compatible with tensor parallelism now.
 > ❌️ fp16 and zero are not compatible
 >
 > ⚠️ Pipeline only support naive AMP currently
 We recommend you to use torch AMP as it generally gives better accuracy than naive AMP if no pipeline is used.
 ## Table of Contents
 In this tutorial we will cover:
 1. [AMP introduction](#amp-introduction)
 2. [AMP in Colossal-AI](#amp-in-colossal-ai)
 3. [Hands-on Practice](#hands-on-practice)
 ## AMP Introduction
 Automatic Mixed Precision training is a mixture of FP16 and FP32 training.
 Half-precision float point format (FP16) has lower arithmetic complexity and higher compute efficiency. Besides, fp16 requires half of the storage needed by fp32 and saves memory & network bandwidth, which makes more memory available for large batch size and model size.
 However, there are other operations, like reductions, which require the dynamic range of fp32 to avoid numeric overflow/underflow. That's the reason why we introduce automatic mixed precision, attempting to match each operation to its appropriate data type, which can reduce the memory footprint and augment training efficiency.
 <figure style={{textAlign: "center"}}>
 <img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
 <figcaption>Illustration of an ordinary AMP (figure from <a href="https://arxiv.org/abs/2108.05818">PatrickStar paper</a>)</figcaption>
 </figure>
 ## AMP in Colossal-AI
 We supported three AMP training methods and allowed the user to train with AMP with no code. If you want to train with amp, just assign `mixed_precision` with `fp16` when you instantiate the `Booster`. Now booster support torch amp, the other two(apex amp, naive amp) are still started by `colossalai.initialize`, if needed, please refer to [this](./mixed_precision_training.md). Next we will support `bf16`, `fp8`.
 ### Start with Booster
 instantiate `Booster` with `mixed_precision="fp16"`, then you can train with torch amp.
 <!--- doc-test-ignore-start -->
 ```python
 """
    Mapping:
    'fp16': torch amp
    'fp16_apex': apex amp,
    'bf16': bf16,
    'fp8': fp8,
    'fp16_naive': naive amp
 """
 from colossalai import Booster
 booster = Booster(mixed_precision='fp16',...)
 ```
 <!--- doc-test-ignore-end -->
 or you can create a `FP16TorchMixedPrecision` object, such as:
 <!--- doc-test-ignore-start -->
 ```python
 from colossalai.mixed_precision import FP16TorchMixedPrecision
 mixed_precision = FP16TorchMixedPrecision(
    init_scale=2.**16,
    growth_factor=2.0,
    backoff_factor=0.5,
    growth_interval=2000)
 booster = Booster(mixed_precision=mixed_precision,...)
 ```
 <!--- doc-test-ignore-end -->
 The same goes for other types of amps.
 ### Torch AMP Configuration
 {{ autodoc:colossalai.booster.mixed_precision.FP16TorchMixedPrecision }}
 ### Apex AMP Configuration
 For this mode, we rely on the Apex implementation for mixed precision training.
 We support this plugin because it allows for finer control on the granularity of mixed precision.
 For example, O2 level (optimization level 2) will keep batch normalization in fp32.
 If you look for more details, please refer to [Apex Documentation](https://nvidia.github.io/apex/).
 {{ autodoc:colossalai.booster.mixed_precision.FP16ApexMixedPrecision }}
 ### Naive AMP Configuration
 In Naive AMP mode, we achieved mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism.
 This AMP mode will cast all operations into fp16.
 The following code block shows the mixed precision api for this mode.
 {{ autodoc:colossalai.booster.mixed_precision.FP16NaiveMixedPrecision }}
 When using `colossalai.booster`, you are required to first instantiate a model, an optimizer and a criterion.
 The output model is converted to AMP model of smaller memory consumption.
 If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
 Otherwise, try smaller models or checkout more parallelization training techniques!
 ## Hands-on Practice
 Now we will introduce the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example.
 ### Step 1. Import libraries in train.py
 Create a `train.py` and import the necessary dependencies. Remember to install `scipy` and `timm` by running
 `pip install timm scipy`.
 ```python
 import os
 from pathlib import Path
 import torch
 from timm.models import vit_base_patch16_224
 from titans.utils import barrier_context
 from torchvision import datasets, transforms
 import colossalai
 from colossalai.booster import Booster
 from colossalai.booster.plugin import TorchDDPPlugin
 from colossalai.logging import get_dist_logger
 from colossalai.nn.lr_scheduler import LinearWarmupLR
 ```
 ### Step 2. Initialize Distributed Environment
 We then need to initialize distributed environment. For demo purpose, we uses `launch_from_torch`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)
 for other initialization methods.
 ```python
 # initialize distributed setting
 parser = colossalai.get_default_parser()
 args = parser.parse_args()
 # launch from torch
 colossalai.launch_from_torch(config=dict())
 ```
 ### Step 3. Create training components
 Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
 obtained from the environment variable `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
 to a path on your machine. Data will be automatically downloaded to the root path.
 ```python
 # define the constants
 NUM_EPOCHS = 2
 BATCH_SIZE = 128
 # build model
 model = vit_base_patch16_224(drop_rate=0.1)
 # build dataloader
 train_dataset = datasets.Caltech101(
    root=Path(os.environ['DATA']),
    download=True,
    transform=transforms.Compose([
        transforms.Resize(256),
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        Gray2RGB(),
        transforms.Normalize([0.5, 0.5, 0.5],
                                [0.5, 0.5, 0.5])
    ]))
 # build optimizer
 optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)
 # build loss
 criterion = torch.nn.CrossEntropyLoss()
 # lr_scheduler
 lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=NUM_EPOCHS)
 ```
 ### Step 4. Inject AMP Feature
 Create a `MixedPrecision`(if needed) and `TorchDDPPlugin` object, call `colossalai.boost` convert the training components to be running with FP16.
 ```python
 plugin = TorchDDPPlugin()
 train_dataloader = plugin.prepare_dataloader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
 booster = Booster(mixed_precision='fp16', plugin=plugin)
 # if you need to customize the config, do like this
 # >>> from colossalai.mixed_precision import FP16TorchMixedPrecision
 # >>> mixed_precision = FP16TorchMixedPrecision(
 # >>>     init_scale=2.**16,
 # >>>     growth_factor=2.0,
 # >>>     backoff_factor=0.5,
 # >>>     growth_interval=2000)
 # >>> plugin = TorchDDPPlugin()
 # >>> booster = Booster(mixed_precision=mixed_precision, plugin=plugin)
 # boost model, optimizer, criterion, dataloader, lr_scheduler
 model, optimizer, criterion, dataloader, lr_scheduler = booster.boost(model, optimizer, criterion, dataloader, lr_scheduler)
 ```
 ### Step 5. Train with Booster
 Use booster in a normal training loops.
 ```python
 model.train()
 for epoch in range(NUM_EPOCHS):
    for img, label in enumerate(train_dataloader):
        img = img.cuda()
        label = label.cuda()
        optimizer.zero_grad()
        output = model(img)
        loss = criterion(output, label)
        booster.backward(loss, optimizer)
        optimizer.step()
    lr_scheduler.step()
 ```
 ### Step 6. Invoke Training Scripts
 Use the following command to start the training scripts. You can change `--nproc_per_node` to use a different number of GPUs.
 ```shell
 colossalai run --nproc_per_node 1 train.py
 ```
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training_with_booster.py  -->
--- a/docs/source/zh-Hans/features/gradient_accumulation_with_booster.md
+++ b/docs/source/zh-Hans/features/gradient_accumulation_with_booster.md
@ -131,7 +131,7 @@ for idx, (img, label) in enumerate(train_dataloader):
 ### 步骤 6. 启动训练脚本
 为了验证梯度累积，我们可以只检查参数值的变化。当设置梯度累加时，仅在最后一步更新参数。您可以使用以下命令运行脚本：
 ```shell
-colossalai run --nproc_per_node 1 train.py --config config.py
+colossalai run --nproc_per_node 1 train.py
 ```
 你将会看到类似下方的文本输出。这展现了梯度虽然在前3个迭代中被计算，但直到最后一次迭代，参数才被更新。
--- a/docs/source/zh-Hans/features/gradient_clipping_with_booster.md
+++ b/docs/source/zh-Hans/features/gradient_clipping_with_booster.md
@ -135,6 +135,6 @@ for idx, (img, label) in enumerate(train_dataloader):
 你可以使用以下命令运行脚本：
 ```shell
-colossalai run --nproc_per_node 1 train.py --config config/config.py
+colossalai run --nproc_per_node 1 train.py
 ```
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_clipping_with_booster.py  -->
--- a/docs/source/zh-Hans/features/mixed_precision_training.md
+++ b/docs/source/zh-Hans/features/mixed_precision_training.md
@ -1,4 +1,4 @@
-# 自动混合精度训练 (AMP)
+# 自动混合精度训练 (旧版本)
 作者: Chuanrui Wang, Shenggui Li, Yongbin Li
@ -339,6 +339,7 @@ for epoch in range(gpc.config.NUM_EPOCHS):
 使用下列命令启动训练脚本，你可以改变 `--nproc_per_node` 以使用不同数量的 GPU。
-```python
+```shell
 python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
 ```
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training.py  -->
--- a/docs/source/zh-Hans/features/mixed_precision_training_with_booster.md
+++ b/docs/source/zh-Hans/features/mixed_precision_training_with_booster.md
@ -0,0 +1,235 @@
 # 自动混合精度训练 (最新版本)
 作者: [Mingyan Jiang](https://github.com/jiangmingyan)
 **前置教程**
 - [定义配置文件](../basics/define_your_config.md)
 - [booster使用](../basics/booster_api.md)
 **相关论文**
 - [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
 ## 引言
 AMP 代表自动混合精度训练。
 在 Colossal-AI 中, 我们结合了混合精度训练的不同实现:
 1. torch.cuda.amp
 2. apex.amp
 3. naive amp
 | Colossal-AI | 支持张量并行 | 支持流水并行 | fp16范围 |
 | ----------- | ----------------------- | ------------------------- | ----------- |
 | AMP_TYPE.TORCH | ✅ | ❌ | 在前向和反向传播期间，模型参数、激活和梯度向下转换至fp16 |
 | AMP_TYPE.APEX | ❌ | ❌ | 更细粒度，我们可以选择 opt_level O0, O1, O2, O3 |
 | AMP_TYPE.NAIVE | ✅ | ✅ | 模型参数、前向和反向操作，全都向下转换至fp16 |
 前两个依赖于 PyTorch (1.6及以上) 和 NVIDIA Apex 的原始实现。最后一种方法类似 Apex O2。在这些方法中，Apex-AMP 与张量并行不兼容。这是因为张量是以张量并行的方式在设备之间拆分的，因此，需要在不同的进程之间进行通信，以检查整个模型权重中是否出现inf或nan。我们修改了torch amp实现，使其现在与张量并行兼容。
 > ❌️ fp16与ZeRO不兼容
 >
 > ⚠️ 流水并行目前仅支持naive amp
 我们建议使用 torch AMP，因为在不使用流水并行时，它通常比 NVIDIA AMP 提供更好的准确性。
 ## 目录
 在本教程中，我们将介绍:
 1. [AMP 介绍](#amp-介绍)
 2. [Colossal-AI 中的 AMP](#colossal-ai-中的-amp)
 3. [练习实例](#实例)
 ## AMP 介绍
 自动混合精度训练是混合 FP16 和 FP32 训练。
 半精度浮点格式（FP16）具有较低的算法复杂度和较高的计算效率。此外，FP16 仅需要 FP32 所需的一半存储空间，并节省了内存和网络带宽，从而为大 batch size 和大模型提供了更多内存。
 然而，还有其他操作，如缩减，需要 FP32 的动态范围，以避免数值溢出/下溢。因此，我们引入自动混合精度，尝试将每个操作与其相应的数据类型相匹配，这可以减少内存占用并提高训练效率。
 <figure style={{textAlign: "center"}}>
 <img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
 <figcaption>AMP 示意图 (图片来自 <a href="https://arxiv.org/abs/2108.05818">PatrickStar 论文</a>)</figcaption>
 </figure>
 ## Colossal-AI 中的 AMP
 我们支持三种 AMP 训练方法，并允许用户在没有改变代码的情况下使用 AMP 进行训练。booster支持amp特性注入，如果您要使用混合精度训练，则在创建booster实例时指定`mixed_precision`参数，我们现已支持torch amp，apex amp, naive amp（现已移植torch amp至booster，apex amp, naive amp仍由`colossalai.initialize`方式启动，如您需使用，请[参考](./mixed_precision_training.md）;后续将会拓展`bf16`,`pf8`的混合精度训练.
 #### booster启动方式
 您可以在创建booster实例时，指定`mixed_precision="fp16"`即使用torch amp。
 <!--- doc-test-ignore-start -->
 ```python
 """
    初始化映射关系如下：
    'fp16': torch amp
    'fp16_apex': apex amp,
    'bf16': bf16,
    'fp8': fp8,
    'fp16_naive': naive amp
 """
 from colossalai import Booster
 booster = Booster(mixed_precision='fp16',...)
 ```
 <!--- doc-test-ignore-end -->
 或者您可以自定义一个`FP16TorchMixedPrecision`对象，如
 <!--- doc-test-ignore-start -->
 ```python
 from colossalai.mixed_precision import FP16TorchMixedPrecision
 mixed_precision = FP16TorchMixedPrecision(
    init_scale=2.**16,
    growth_factor=2.0,
    backoff_factor=0.5,
    growth_interval=2000)
 booster = Booster(mixed_precision=mixed_precision,...)
 ```
 <!--- doc-test-ignore-end -->
 其他类型的amp使用方式也是一样的。
 ### Torch AMP 配置
 {{ autodoc:colossalai.booster.mixed_precision.FP16TorchMixedPrecision }}
 ### Apex AMP 配置
 对于这种模式，我们依靠 Apex 实现混合精度训练。我们支持这个插件，因为它允许对混合精度的粒度进行更精细的控制。
 例如, O2 水平 (优化器水平2) 将保持 batch normalization 为 FP32。
 如果你想了解更多细节，请参考 [Apex Documentation](https://nvidia.github.io/apex/)。
 {{ autodoc:colossalai.booster.mixed_precision.FP16ApexMixedPrecision }}
 ### Naive AMP 配置
 在 Naive AMP 模式中, 我们实现了混合精度训练，同时保持了与复杂张量和流水并行的兼容性。该 AMP 模式将所有操作转为 FP16 。下列代码块展示了该模式的booster启动方式。
 {{ autodoc:colossalai.booster.mixed_precision.FP16NaiveMixedPrecision }}
 当使用`colossalai.booster`时, 首先需要实例化一个模型、一个优化器和一个标准。将输出模型转换为内存消耗较小的 AMP 模型。如果您的输入模型已经太大，无法放置在 GPU 中，请使用`dtype=torch.float16`实例化你的模型。或者请尝试更小的模型，或尝试更多的并行化训练技术！
 ## 实例
 下面我们将展现如何在 Colossal-AI 使用 AMP。在该例程中，我们使用 Torch AMP.
 ### 步骤 1. 在 train.py 导入相关库
 创建`train.py`并导入必要依赖. 请记得通过命令`pip install timm scipy`安装`scipy`和`timm`。
 ```python
 import os
 from pathlib import Path
 import torch
 from timm.models import vit_base_patch16_224
 from titans.utils import barrier_context
 from torchvision import datasets, transforms
 import colossalai
 from colossalai.booster import Booster
 from colossalai.booster.plugin import TorchDDPPlugin
 from colossalai.logging import get_dist_logger
 from colossalai.nn.lr_scheduler import LinearWarmupLR
 ```
 ### 步骤 2. 初始化分布式环境
 我们需要初始化分布式环境。为了快速演示，我们使用`launch_from_torch`。你可以参考 [Launch Colossal-AI](../basics/launch_colossalai.md)
 使用其他初始化方法。
 ```python
 # 初始化分布式设置
 parser = colossalai.get_default_parser()
 args = parser.parse_args()
 # launch from torch
 colossalai.launch_from_torch(config=dict())
 ```
 ### 步骤 3. 创建训练组件
 构建你的模型、优化器、损失函数、学习率调整器和数据加载器。注意数据集的路径从环境变量`DATA`获得。你可以通过 `export DATA=/path/to/data` 或 `Path(os.environ['DATA'])`
 在你的机器上设置路径。数据将会被自动下载到该路径。
 ```python
 # define the constants
 NUM_EPOCHS = 2
 BATCH_SIZE = 128
 # build model
 model = vit_base_patch16_224(drop_rate=0.1)
 # build dataloader
 train_dataset = datasets.Caltech101(
    root=Path(os.environ['DATA']),
    download=True,
    transform=transforms.Compose([
        transforms.Resize(256),
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        Gray2RGB(),
        transforms.Normalize([0.5, 0.5, 0.5],
                                [0.5, 0.5, 0.5])
    ]))
 # build optimizer
 optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)
 # build loss
 criterion = torch.nn.CrossEntropyLoss()
 # lr_scheduelr
 lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=NUM_EPOCHS)
 ```
 ### 步骤 4. 插入 AMP
 创建一个MixedPrecision对象（如果需要）及torchDDPPlugin对象，调用 `colossalai.boost` 将所有训练组件转为为FP16模式.
 ```python
 plugin = TorchDDPPlugin()
 train_dataloader = plugin.prepare_dataloader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
 booster = Booster(mixed_precision='fp16', plugin=plugin)
 # if you need to customize the config, do like this
 # >>> from colossalai.mixed_precision import FP16TorchMixedPrecision
 # >>> mixed_precision = FP16TorchMixedPrecision(
 # >>>     init_scale=2.**16,
 # >>>     growth_factor=2.0,
 # >>>     backoff_factor=0.5,
 # >>>     growth_interval=2000)
 # >>> plugin = TorchDDPPlugin()
 # >>> booster = Booster(mixed_precision=mixed_precision, plugin=plugin)
 # boost model, optimizer, criterion, dataloader, lr_scheduler
 model, optimizer, criterion, dataloader, lr_scheduler = booster.boost(model, optimizer, criterion, dataloader, lr_scheduler)
 ```
 ### 步骤 5. 使用 booster 训练
 使用booster构建一个普通的训练循环。
 ```python
 model.train()
 for epoch in range(NUM_EPOCHS):
    for img, label in enumerate(train_dataloader):
        img = img.cuda()
        label = label.cuda()
        optimizer.zero_grad()
        output = model(img)
        loss = criterion(output, label)
        booster.backward(loss, optimizer)
        optimizer.step()
    lr_scheduler.step()
 ```
 ### 步骤 6. 启动训练脚本
 使用下列命令启动训练脚本，你可以改变 `--nproc_per_node` 以使用不同数量的 GPU。
 ```shell
 colossalai run --nproc_per_node 1 train.py
 ```
 <!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training_with_booster.py  -->