[doc] update gradient accumulation (#3771)

* [doc]update gradient accumulation * [doc]update gradient accumulation * [doc]update gradient accumulation * [doc]update gradient accumulation * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, add sidebars * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, fix * [doc]update gradient accumulation, resolve comments * [doc]update gradient accumulation, resolve comments * fix
2023-05-23 10:52:30 +08:00 · 2023-05-23 10:52:30 +08:00 · ef02d7ef6d
parent f5c425c898
commit ef02d7ef6d
5 changed files with 296 additions and 2 deletions
--- a/docs/sidebars.json
+++ b/docs/sidebars.json
@ -44,6 +44,7 @@
      "collapsed": true,
      "items": [
        "features/mixed_precision_training",
+        "features/gradient_accumulation_with_booster",
        "features/gradient_accumulation",
        "features/gradient_clipping_with_booster",
        "features/gradient_clipping",
--- a/docs/source/en/features/gradient_accumulation.md
+++ b/docs/source/en/features/gradient_accumulation.md
@ -1,4 +1,4 @@
-# Gradient Accumulation
+# Gradient Accumulation (Outdated)

 Author: Shenggui Li, Yongbin Li

@ -43,3 +43,5 @@ iteration 1, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0
 iteration 2, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
 iteration 3, first 10 elements of param: tensor([-0.0141,  0.0464,  0.0507,  0.0321,  0.0356, -0.0150,  0.0172, -0.0118, 0.0222,  0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
 ```
+
+<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_accumulation.py  -->
--- a/docs/source/en/features/gradient_accumulation_with_booster.md
+++ b/docs/source/en/features/gradient_accumulation_with_booster.md
@ -0,0 +1,144 @@
+# Gradient Accumulation (Latest)
+
+Author: [Mingyan Jiang](https://github.com/jiangmingyan)
+
+**Prerequisite**
+- [Define Your Configuration](../basics/define_your_config.md)
+- [Training Booster](../basics/booster_api.md)
+
+## Introduction
+
+Gradient accumulation is a common way to enlarge your batch size for training. When training large-scale models, memory can easily become the bottleneck and the batch size can be very small, (e.g. 2), leading to unsatisfactory convergence. Gradient accumulation works by adding up the gradients calculated in multiple iterations, and only update the parameters in the preset iteration.
+
+## Usage
+
+It is simple to use gradient accumulation in Colossal-AI. Just call `booster.no_sync()` which returns a context manager. It accumulate gradients without synchronization, meanwhile you should not update the weights.
+
+## Hands-on Practice
+
+We now demonstrate gradient accumulation. In this example, we let the gradient accumulation size to be 4.
+
+### Step 1. Import libraries in train.py
+Create a `train.py` and import the necessary dependencies. The version of `torch` should not be lower than 1.8.1.
+
+```python
+import os
+from pathlib import Path
+
+import torch
+from torchvision import transforms
+from torchvision.datasets import CIFAR10
+from torchvision.models import resnet18
+from torch.utils.data import DataLoader
+
+import colossalai
+from colossalai.booster import Booster
+from colossalai.booster.plugin import TorchDDPPlugin
+from colossalai.logging import get_dist_logger
+from colossalai.cluster.dist_coordinator import priority_execution
+```
+
+### Step 2. Initialize Distributed Environment
+We then need to initialize distributed environment. For demo purpose, we uses `launch_from_torch`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md) for other initialization methods.
+
+```python
+# initialize distributed setting
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+# launch from torch
+colossalai.launch_from_torch(config=dict())
+```
+
+### Step 3. Create training components
+Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is obtained from the environment variable `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])` to a path on your machine. Data will be automatically downloaded to the root path.
+
+```python
+# define the training hyperparameters
+BATCH_SIZE = 128
+GRADIENT_ACCUMULATION = 4
+
+# build resnet
+model = resnet18(num_classes=10)
+
+# build dataloaders
+with priority_execution():
+    train_dataset = CIFAR10(root=Path(os.environ.get('DATA', './data')),
+                            download=True,
+                            transform=transforms.Compose([
+                                transforms.RandomCrop(size=32, padding=4),
+                                transforms.RandomHorizontalFlip(),
+                                transforms.ToTensor(),
+                                transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.2010]),
+                            ]))
+
+# build criterion
+criterion = torch.nn.CrossEntropyLoss()
+
+# optimizer
+optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
+```
+
+### Step 4. Inject Feature
+Create a `TorchDDPPlugin` object to instantiate a `Booster`, and boost these training components.
+
+```python
+plugin = TorchDDPPlugin()
+booster = Booster(plugin=plugin)
+train_dataloader = plugin.prepare_dataloader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
+model, optimizer, criterion, train_dataloader, _ = booster.boost(model=model,
+                                                                    optimizer=optimizer,
+                                                                    criterion=criterion,
+                                                                    dataloader=train_dataloader)
+```
+
+### Step 5. Train with Booster
+Use booster in a normal training loops, and verify gradient accumulation. `param_by_iter` is to record the distributed training information.
+```python
+optimizer.zero_grad()
+for idx, (img, label) in enumerate(train_dataloader):
+        sync_context = booster.no_sync(model)
+        img = img.cuda()
+        label = label.cuda()
+        if idx % (GRADIENT_ACCUMULATION - 1) != 0:
+            with sync_context:
+                output = model(img)
+                train_loss = criterion(output, label)
+                booster.backward(train_loss, optimizer)
+        else:
+            output = model(img)
+            train_loss = criterion(output, label)
+            booster.backward(train_loss, optimizer)
+            optimizer.step()
+            optimizer.zero_grad()
+
+        ele_1st = next(model.parameters()).flatten()[0]
+        param_by_iter.append(str(ele_1st.item()))
+
+        if idx != 0 and idx % (GRADIENT_ACCUMULATION - 1) == 0:
+            break
+
+    for iteration, val in enumerate(param_by_iter):
+        print(f'iteration {iteration} - value: {val}')
+
+    if param_by_iter[-1] != param_by_iter[0]:
+        print('The parameter is only updated in the last iteration')
+
+```
+
+### Step 6. Invoke Training Scripts
+To verify gradient accumulation, we can just check the change of parameter values. When gradient accumulation is set, parameters are only updated in the last step. You can run the script using this command:
+```shell
+colossalai run --nproc_per_node 1 train.py --config config.py
+```
+
+You will see output similar to the text below. This shows gradient is indeed accumulated as the parameter is not updated
+in the first 3 steps, but only updated in the last step.
+
+```text
+iteration 0, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 1, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 2, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 3, first 10 elements of param: tensor([-0.0141,  0.0464,  0.0507,  0.0321,  0.0356, -0.0150,  0.0172, -0.0118, 0.0222,  0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
+```
+
+<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_accumulation_with_booster.py  -->
--- a/docs/source/zh-Hans/features/gradient_accumulation.md
+++ b/docs/source/zh-Hans/features/gradient_accumulation.md
@ -1,4 +1,4 @@
-# 梯度累积
+# 梯度累积 (旧版本)

 作者: Shenggui Li, Yongbin Li

@ -38,3 +38,4 @@ iteration 1, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0
 iteration 2, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
 iteration 3, first 10 elements of param: tensor([-0.0141,  0.0464,  0.0507,  0.0321,  0.0356, -0.0150,  0.0172, -0.0118, 0.0222,  0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
 ```
+<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_accumulation.py  -->
--- a/docs/source/zh-Hans/features/gradient_accumulation_with_booster.md
+++ b/docs/source/zh-Hans/features/gradient_accumulation_with_booster.md
@ -0,0 +1,146 @@
+# 梯度累积 (最新版本)
+
+作者: [Mingyan Jiang](https://github.com/jiangmingyan)
+
+**前置教程**
+- [定义配置文件](../basics/define_your_config.md)
+- [训练中使用Booster](../basics/booster_api.md)
+
+## 引言
+
+梯度累积是一种常见的增大训练 batch size 的方式。 在训练大模型时，内存经常会成为瓶颈，并且 batch size 通常会很小（如2），这导致收敛性无法保证。梯度累积将多次迭代的梯度累加，并仅在达到预设迭代次数时更新参数。
+
+## 使用
+
+在 Colossal-AI 中使用梯度累积非常简单，booster提供no_sync返回一个上下文管理器，在该上下文管理器下取消同步并且累积梯度。
+
+## 实例
+
+我们将介绍如何使用梯度累积。在这个例子中，梯度累积次数被设置为4。
+
+### 步骤 1. 在 train.py 导入相关库
+创建train.py并导入必要依赖。 `torch` 的版本应不低于1.8.1。
+
+```python
+import os
+from pathlib import Path
+
+import torch
+from torchvision import transforms
+from torchvision.datasets import CIFAR10
+from torchvision.models import resnet18
+
+import colossalai
+from colossalai.booster import Booster
+from colossalai.booster.plugin import TorchDDPPlugin
+from colossalai.logging import get_dist_logger
+from colossalai.cluster.dist_coordinator import priority_execution
+```
+
+### 步骤 2. 初始化分布式环境
+
+我们需要初始化分布式环境。为了快速演示，我们使用`launch_from_torch`。你可以参考 [Launch Colossal-AI](../basics/launch_colossalai.md)使用其他初始化方法。
+
+```python
+# initialize distributed setting
+parser = colossalai.get_default_parser()
+args = parser.parse_args()
+
+# launch from torch
+colossalai.launch_from_torch(config=dict())
+
+```
+
+### 步骤 3. 创建训练组件
+
+构建你的模型、优化器、损失函数、学习率调整器和数据加载器。注意数据集的路径从环境变量`DATA`获得。你可以通过 `export DATA=/path/to/data` 或 `Path(os.environ['DATA'])`，在你的机器上设置路径。数据将会被自动下载到该路径。
+
+```python
+# define the training hyperparameters
+BATCH_SIZE = 128
+GRADIENT_ACCUMULATION = 4
+
+# build resnet
+model = resnet18(num_classes=10)
+
+# build dataloaders
+with priority_execution():
+    train_dataset = CIFAR10(root=Path(os.environ.get('DATA', './data')),
+                            download=True,
+                            transform=transforms.Compose([
+                                transforms.RandomCrop(size=32, padding=4),
+                                transforms.RandomHorizontalFlip(),
+                                transforms.ToTensor(),
+                                transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[0.2023, 0.1994, 0.2010]),
+                            ]))
+
+# build criterion
+criterion = torch.nn.CrossEntropyLoss()
+
+# optimizer
+optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
+```
+
+### 步骤 4. 注入特性
+创建一个`TorchDDPPlugin`对象，并作为参实例化`Booster`， 调用`booster.boost`注入特性。
+
+```python
+plugin = TorchDDPPlugin()
+booster = Booster(plugin=plugin)
+train_dataloader = plugin.prepare_dataloader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
+model, optimizer, criterion, train_dataloader, _ = booster.boost(model=model,
+                                                                    optimizer=optimizer,
+                                                                    criterion=criterion,
+                                                                    dataloader=train_dataloader)
+```
+
+### 步骤 5. 使用booster训练
+使用booster构建一个普通的训练循环，验证梯度累积。 `param_by_iter` 记录分布训练的信息。
+```python
+optimizer.zero_grad()
+for idx, (img, label) in enumerate(train_dataloader):
+        sync_context = booster.no_sync(model)
+        img = img.cuda()
+        label = label.cuda()
+        if idx % (GRADIENT_ACCUMULATION - 1) != 0:
+            with sync_context:
+                output = model(img)
+                train_loss = criterion(output, label)
+                booster.backward(train_loss, optimizer)
+        else:
+            output = model(img)
+            train_loss = criterion(output, label)
+            booster.backward(train_loss, optimizer)
+            optimizer.step()
+            optimizer.zero_grad()
+
+        ele_1st = next(model.parameters()).flatten()[0]
+        param_by_iter.append(str(ele_1st.item()))
+
+        if idx != 0 and idx % (GRADIENT_ACCUMULATION - 1) == 0:
+            break
+
+    for iteration, val in enumerate(param_by_iter):
+        print(f'iteration {iteration} - value: {val}')
+
+    if param_by_iter[-1] != param_by_iter[0]:
+        print('The parameter is only updated in the last iteration')
+
+```
+
+### 步骤 6. 启动训练脚本
+为了验证梯度累积，我们可以只检查参数值的变化。当设置梯度累加时，仅在最后一步更新参数。您可以使用以下命令运行脚本：
+```shell
+colossalai run --nproc_per_node 1 train.py --config config.py
+```
+
+你将会看到类似下方的文本输出。这展现了梯度虽然在前3个迭代中被计算，但直到最后一次迭代，参数才被更新。
+
+```text
+iteration 0, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 1, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 2, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
+iteration 3, first 10 elements of param: tensor([-0.0141,  0.0464,  0.0507,  0.0321,  0.0356, -0.0150,  0.0172, -0.0118, 0.0222,  0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
+```
+
+<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_accumulation_with_booster.py  -->