ColossalAI/colossalai/engine/gradient_handler/_pipeline_parallel_gradient...

#!/usr/bin/env python

from collections import defaultdict

import torch
import torch.distributed as dist
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors

from colossalai.core import global_context as gpc
from colossalai.registry import GRADIENT_HANDLER

from ._base_gradient_handler import BaseGradientHandler


@GRADIENT_HANDLER.register_module
class PipelineSharedModuleGradientHandler(BaseGradientHandler):
    """A helper class to handle all-reduce operations in sub parallel groups.
    A all-reduce collective communication will be operated in
    :func:`handle_gradient` among all sub pipeline parallel groups.
    For better performance, it bucketizes the gradients of all parameters that are
    the same type to improve the efficiency of communication.

    Args:
        model (Module): Model where the gradients accumulate.
        optimizer (Optimizer): Optimizer for updating the parameters.
    """

    def handle_gradient(self):
        """A method running a all-reduce operation in sub pipeline parallel groups.
        """
        if gpc.pipeline_parallel_size > 1:
            # bucketize and all-reduce
            buckets = defaultdict(lambda: defaultdict(list))
            # Pack the buckets.
            for param in self._model.parameters():
                group = getattr(param, 'pipeline_shared_module_pg', None)
                if param.requires_grad and group is not None and (
                    (hasattr(param, 'colo_attr') and not param.colo_attr.saved_grad.is_null())
                        or param.grad is not None):
                    tp = param.data.type()
                    buckets[group][tp].append(param)

            # For each bucket, all-reduce and copy all-reduced grads.
            for group, group_buckets in buckets.items():
                for tp, bucket in group_buckets.items():
                    grads = [
                        param.colo_attr.grad_payload if hasattr(param, 'colo_attr') else param.grad.data
                        for param in bucket
                    ]
                    coalesced = _flatten_dense_tensors(grads).to(torch.cuda.current_device())
                    dist.all_reduce(coalesced, op=dist.ReduceOp.SUM, group=group)
                    for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):
                        buf.copy_(synced)
Optimize pipeline schedule (#94) * add pipeline shared module wrapper and update load batch * added model parallel process group for amp and clip grad (#86) * added model parallel process group for amp and clip grad * update amp and clip with model parallel process group * remove pipeline_prev/next group (#88) * micro batch offload * optimize pipeline gpu memory usage * pipeline can receive tensor shape (#93) * optimize pipeline gpu memory usage * fix grad accumulation step counter * rename classes and functions Co-authored-by: Frank Lee <somerlee.9@gmail.com> 2021-12-30 07:56:46 +00:00			`#!/usr/bin/env python`

[zero] ZeRO supports pipeline parallel (#477) 2022-03-21 08:55:37 +00:00			`from collections import defaultdict`
Optimize pipeline schedule (#94) * add pipeline shared module wrapper and update load batch * added model parallel process group for amp and clip grad (#86) * added model parallel process group for amp and clip grad * update amp and clip with model parallel process group * remove pipeline_prev/next group (#88) * micro batch offload * optimize pipeline gpu memory usage * pipeline can receive tensor shape (#93) * optimize pipeline gpu memory usage * fix grad accumulation step counter * rename classes and functions Co-authored-by: Frank Lee <somerlee.9@gmail.com> 2021-12-30 07:56:46 +00:00
[zero] ZeRO supports pipeline parallel (#477) 2022-03-21 08:55:37 +00:00			`import torch`
			`import torch.distributed as dist`
[format] Run lint on colossalai.engine (#3367) 2023-04-05 15:24:43 +00:00			`from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors`

Optimize pipeline schedule (#94) * add pipeline shared module wrapper and update load batch * added model parallel process group for amp and clip grad (#86) * added model parallel process group for amp and clip grad * update amp and clip with model parallel process group * remove pipeline_prev/next group (#88) * micro batch offload * optimize pipeline gpu memory usage * pipeline can receive tensor shape (#93) * optimize pipeline gpu memory usage * fix grad accumulation step counter * rename classes and functions Co-authored-by: Frank Lee <somerlee.9@gmail.com> 2021-12-30 07:56:46 +00:00			`from colossalai.core import global_context as gpc`
			`from colossalai.registry import GRADIENT_HANDLER`
[zero] ZeRO supports pipeline parallel (#477) 2022-03-21 08:55:37 +00:00
Optimize pipeline schedule (#94) * add pipeline shared module wrapper and update load batch * added model parallel process group for amp and clip grad (#86) * added model parallel process group for amp and clip grad * update amp and clip with model parallel process group * remove pipeline_prev/next group (#88) * micro batch offload * optimize pipeline gpu memory usage * pipeline can receive tensor shape (#93) * optimize pipeline gpu memory usage * fix grad accumulation step counter * rename classes and functions Co-authored-by: Frank Lee <somerlee.9@gmail.com> 2021-12-30 07:56:46 +00:00			`from ._base_gradient_handler import BaseGradientHandler`


			`@GRADIENT_HANDLER.register_module`
			`class PipelineSharedModuleGradientHandler(BaseGradientHandler):`
			`"""A helper class to handle all-reduce operations in sub parallel groups.`
[format] Run lint on colossalai.engine (#3367) 2023-04-05 15:24:43 +00:00			`A all-reduce collective communication will be operated in`
Optimize pipeline schedule (#94) * add pipeline shared module wrapper and update load batch * added model parallel process group for amp and clip grad (#86) * added model parallel process group for amp and clip grad * update amp and clip with model parallel process group * remove pipeline_prev/next group (#88) * micro batch offload * optimize pipeline gpu memory usage * pipeline can receive tensor shape (#93) * optimize pipeline gpu memory usage * fix grad accumulation step counter * rename classes and functions Co-authored-by: Frank Lee <somerlee.9@gmail.com> 2021-12-30 07:56:46 +00:00			:func:`handle_gradient` among all sub pipeline parallel groups.
[format] Run lint on colossalai.engine (#3367) 2023-04-05 15:24:43 +00:00			`For better performance, it bucketizes the gradients of all parameters that are`
Optimize pipeline schedule (#94) * add pipeline shared module wrapper and update load batch * added model parallel process group for amp and clip grad (#86) * added model parallel process group for amp and clip grad * update amp and clip with model parallel process group * remove pipeline_prev/next group (#88) * micro batch offload * optimize pipeline gpu memory usage * pipeline can receive tensor shape (#93) * optimize pipeline gpu memory usage * fix grad accumulation step counter * rename classes and functions Co-authored-by: Frank Lee <somerlee.9@gmail.com> 2021-12-30 07:56:46 +00:00			`the same type to improve the efficiency of communication.`
[doc] improved docstring and assertion messages for the engine module (#871) 2022-04-26 02:00:18 +00:00
			`Args:`
			`model (Module): Model where the gradients accumulate.`
			`optimizer (Optimizer): Optimizer for updating the parameters.`
Optimize pipeline schedule (#94) * add pipeline shared module wrapper and update load batch * added model parallel process group for amp and clip grad (#86) * added model parallel process group for amp and clip grad * update amp and clip with model parallel process group * remove pipeline_prev/next group (#88) * micro batch offload * optimize pipeline gpu memory usage * pipeline can receive tensor shape (#93) * optimize pipeline gpu memory usage * fix grad accumulation step counter * rename classes and functions Co-authored-by: Frank Lee <somerlee.9@gmail.com> 2021-12-30 07:56:46 +00:00			`"""`

			`def handle_gradient(self):`
			`"""A method running a all-reduce operation in sub pipeline parallel groups.`
			`"""`
			`if gpc.pipeline_parallel_size > 1:`
			`# bucketize and all-reduce`
			`buckets = defaultdict(lambda: defaultdict(list))`
			`# Pack the buckets.`
			`for param in self._model.parameters():`
			`group = getattr(param, 'pipeline_shared_module_pg', None)`
[hotfix] fix PipelineSharedModuleGradientHandler (#1314) 2022-07-14 09:31:13 +00:00			`if param.requires_grad and group is not None and (`
			`(hasattr(param, 'colo_attr') and not param.colo_attr.saved_grad.is_null())`
			`or param.grad is not None):`
Optimize pipeline schedule (#94) * add pipeline shared module wrapper and update load batch * added model parallel process group for amp and clip grad (#86) * added model parallel process group for amp and clip grad * update amp and clip with model parallel process group * remove pipeline_prev/next group (#88) * micro batch offload * optimize pipeline gpu memory usage * pipeline can receive tensor shape (#93) * optimize pipeline gpu memory usage * fix grad accumulation step counter * rename classes and functions Co-authored-by: Frank Lee <somerlee.9@gmail.com> 2021-12-30 07:56:46 +00:00			`tp = param.data.type()`
			`buckets[group][tp].append(param)`

			`# For each bucket, all-reduce and copy all-reduced grads.`
			`for group, group_buckets in buckets.items():`
			`for tp, bucket in group_buckets.items():`
[hotfix] fix PipelineSharedModuleGradientHandler (#1314) 2022-07-14 09:31:13 +00:00			`grads = [`
			`param.colo_attr.grad_payload if hasattr(param, 'colo_attr') else param.grad.data`
			`for param in bucket`
			`]`
[zero] ZeRO supports pipeline parallel (#477) 2022-03-21 08:55:37 +00:00			`coalesced = _flatten_dense_tensors(grads).to(torch.cuda.current_device())`
Optimize pipeline schedule (#94) * add pipeline shared module wrapper and update load batch * added model parallel process group for amp and clip grad (#86) * added model parallel process group for amp and clip grad * update amp and clip with model parallel process group * remove pipeline_prev/next group (#88) * micro batch offload * optimize pipeline gpu memory usage * pipeline can receive tensor shape (#93) * optimize pipeline gpu memory usage * fix grad accumulation step counter * rename classes and functions Co-authored-by: Frank Lee <somerlee.9@gmail.com> 2021-12-30 07:56:46 +00:00			`dist.all_reduce(coalesced, op=dist.ReduceOp.SUM, group=group)`
			`for buf, synced in zip(grads, _unflatten_dense_tensors(coalesced, grads)):`
			`buf.copy_(synced)`