ColossalAI/docs/source/en/features/gradient_handler.md

# Gradient Handler

Author: Shenggui Li, Yongbin Li

**Prerequisite**
- [Define Your Configuration](../basics/define_your_config.md)
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)

**Example Code**
- [ColossalAI-Examples Gradient Handler](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)

## Introduction

In distributed training, gradient synchronization is required at the end of each iteration. This is important because we
need to make sure the parameters are updated with the same gradients in different machines so that the resulting parameters
are the same. This is often seen in data parallel as the model is replicated across data parallel ranks.

In Colossal-AI, we provide an interface for users to customize how they want to handle the synchronization. This brings
flexibility in cases such as implementing a new parallelism method.

When gradient handlers are used, PyTorch `DistributedDataParallel` will not be used as it will synchronize automatically.

## Customize Your Gradient Handlers

To implement a customized gradient handler, you need to follow these steps.
1. inherit `BaseGradientHandler` in Colossal-AI.
2. register the gradient handler into the `GRADIENT_HANDLER`.
3. implement `handle_gradient` method.

```python
from colossalai.legacy.registry import GRADIENT_HANDLER
from colossalai.legacy.engine.gradient_handler import BaseGradientHandler


@GRADIENT_HANDLER.register_module
class MyGradientHandler(BaseGradientHandler):

    def handle_gradient(self):
        do_something()


```


## Usage

To use a gradient handler, you need to specify your gradient handler in the config file. The gradient handler
will be automatically built and attached to the engine.

```python
gradient_handler = [dict(type='MyGradientHandler')]
```


### Hands-On Practice

We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
to demonstrate the use of gradient handler. In this example, we used `DataParallelGradientHandler` instead of PyTorch
`DistributedDataParallel` for data parallel training.

```shell
python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500  train_with_engine.py
```
<!-- doc-test-command: echo  -->
[doc] migrate the markdown files (#2652) 2023-02-09 06:21:38 +00:00			`# Gradient Handler`

			`Author: Shenggui Li, Yongbin Li`

			`Prerequisite`
			`- [Define Your Configuration](../basics/define_your_config.md)`
			`- [Use Engine and Trainer in Training](../basics/engine_trainer.md)`

			`Example Code`
			`- [ColossalAI-Examples Gradient Handler](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)`

			`## Introduction`

			`In distributed training, gradient synchronization is required at the end of each iteration. This is important because we`
			`need to make sure the parameters are updated with the same gradients in different machines so that the resulting parameters`
			`are the same. This is often seen in data parallel as the model is replicated across data parallel ranks.`

			`In Colossal-AI, we provide an interface for users to customize how they want to handle the synchronization. This brings`
			`flexibility in cases such as implementing a new parallelism method.`

			When gradient handlers are used, PyTorch `DistributedDataParallel` will not be used as it will synchronize automatically.

			`## Customize Your Gradient Handlers`

			`To implement a customized gradient handler, you need to follow these steps.`
			1. inherit `BaseGradientHandler` in Colossal-AI.
			2. register the gradient handler into the `GRADIENT_HANDLER`.
			3. implement `handle_gradient` method.

			```python
[legacy] move builder and registry to legacy (#4603) 2023-09-04 11:56:42 +00:00			`from colossalai.legacy.registry import GRADIENT_HANDLER`
[legacy] move engine to legacy (#4560) * [legacy] move engine to legacy * [example] fix seq parallel example * [example] fix seq parallel example * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [example] update seq parallel requirements 2023-09-04 03:33:40 +00:00			`from colossalai.legacy.engine.gradient_handler import BaseGradientHandler`
[doc] migrate the markdown files (#2652) 2023-02-09 06:21:38 +00:00

			`@GRADIENT_HANDLER.register_module`
			`class MyGradientHandler(BaseGradientHandler):`

			`def handle_gradient(self):`
			`do_something()`


			```


			`## Usage`

			`To use a gradient handler, you need to specify your gradient handler in the config file. The gradient handler`
			`will be automatically built and attached to the engine.`

			```python
			`gradient_handler = [dict(type='MyGradientHandler')]`
			```


			`### Hands-On Practice`

			`We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)`
			to demonstrate the use of gradient handler. In this example, we used `DataParallelGradientHandler` instead of PyTorch
			`DistributedDataParallel` for data parallel training.

			```shell
			`python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py`
			```
[legacy] move engine to legacy (#4560) * [legacy] move engine to legacy * [example] fix seq parallel example * [example] fix seq parallel example * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [example] update seq parallel requirements 2023-09-04 03:33:40 +00:00			`<!-- doc-test-command: echo -->`