ColossalAI/docs/source/en/features/gradient_accumulation.md

# Gradient Accumulation

Author: Shenggui Li, Yongbin Li

**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)

**Example Code**
- [ColossalAI-Examples Gradient Accumulation](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)

## Introduction

Gradient accumulation is a common way to enlarge your batch size for training.
When training large-scale models, memory can easily become the bottleneck and the batch size can be very small, (e.g. 2),
leading to unsatisfactory convergence. Gradient accumulation works by adding up the gradients calculated in multiple iterations,
and only update the parameters in the preset iteration.

## Usage

It is simple to use gradient accumulation in Colossal-AI. Just add this following configuration into your config file.
The integer represents the number of iterations to accumulate gradients.

```python
gradient_accumulation = <int>
```

## Hands-on Practice

We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
to demonstrate gradient accumulation. In this example, we set the gradient accumulation size to be 4. You can run the script using this command:

```shell
python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500  run_resnet_cifar10_with_engine.py
```

You will see output similar to the text below. This shows gradient is indeed accumulated as the parameter is not updated
in the first 3 steps, but only updated in the last step.

```text
iteration 0, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
iteration 1, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
iteration 2, first 10 elements of param: tensor([-0.0208,  0.0189,  0.0234,  0.0047,  0.0116, -0.0283,  0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
iteration 3, first 10 elements of param: tensor([-0.0141,  0.0464,  0.0507,  0.0321,  0.0356, -0.0150,  0.0172, -0.0118, 0.0222,  0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
```
[doc] migrate the markdown files (#2652) 2023-02-09 06:21:38 +00:00			`# Gradient Accumulation`

			`Author: Shenggui Li, Yongbin Li`

			`Prerequisite`
			`- [Define Your Configuration](../basics/define_your_config.md)`
			`- [Use Engine and Trainer in Training](../basics/engine_trainer.md)`

			`Example Code`
			`- [ColossalAI-Examples Gradient Accumulation](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)`

			`## Introduction`

			`Gradient accumulation is a common way to enlarge your batch size for training.`
			`When training large-scale models, memory can easily become the bottleneck and the batch size can be very small, (e.g. 2),`
			`leading to unsatisfactory convergence. Gradient accumulation works by adding up the gradients calculated in multiple iterations,`
			`and only update the parameters in the preset iteration.`

			`## Usage`

			`It is simple to use gradient accumulation in Colossal-AI. Just add this following configuration into your config file.`
			`The integer represents the number of iterations to accumulate gradients.`

			```python
			`gradient_accumulation = <int>`
			```

			`## Hands-on Practice`

			`We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)`
[doc] Fix typo under colossalai and doc(#3618) * Fixed several spelling errors under colossalai * Fix the spelling error in colossalai and docs directory * Cautious Changed the spelling error under the example folder * Update runtime_preparation_pass.py revert autograft to autograd * Update search_chunk.py utile to until * Update check_installation.py change misteach to mismatch in line 91 * Update 1D_tensor_parallel.md revert to perceptron * Update 2D_tensor_parallel.md revert to perceptron in line 73 * Update 2p5D_tensor_parallel.md revert to perceptron in line 71 * Update 3D_tensor_parallel.md revert to perceptron in line 80 * Update README.md revert to resnet in line 42 * Update reorder_graph.py revert to indice in line 7 * Update p2p.py revert to megatron in line 94 * Update initialize.py revert to torchrun in line 198 * Update routers.py change to detailed in line 63 * Update routers.py change to detailed in line 146 * Update README.md revert random number in line 402 2023-04-26 03:38:43 +00:00			`to demonstrate gradient accumulation. In this example, we set the gradient accumulation size to be 4. You can run the script using this command:`
[doc] migrate the markdown files (#2652) 2023-02-09 06:21:38 +00:00
			```shell
			`python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py`
			```

			`You will see output similar to the text below. This shows gradient is indeed accumulated as the parameter is not updated`
			`in the first 3 steps, but only updated in the last step.`

			```text
			`iteration 0, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)`
			`iteration 1, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)`
			`iteration 2, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)`
			`iteration 3, first 10 elements of param: tensor([-0.0141, 0.0464, 0.0507, 0.0321, 0.0356, -0.0150, 0.0172, -0.0118, 0.0222, 0.0473], device='cuda:0', grad_fn=<SliceBackward0>)`
			```