ColossalAI/colossalai/zero/__init__.py

from typing import Tuple

import torch
import torch.nn as nn
from colossalai.logging import get_dist_logger
from colossalai.zero.sharded_model.sharded_model_v2 import ShardedModelV2
from colossalai.zero.sharded_optim.sharded_optim_v2 import ShardedOptimizerV2
from .zero_optimizer import ZeroOptimizer


def convert_to_zero_v2(model: nn.Module, optimizer: torch.optim.Optimizer, model_config,
                       optimizer_config) -> Tuple[ShardedModelV2, ShardedOptimizerV2]:
    """
    A helper function to integrate the model and optimizer with ZeRO optimizer and off-loading

    :param model: Your model object
    :type model: :class:`torch.nn.Module`
    :param optimizer_config: Your optimizer object
    :type optimizer_config: :class:`dict`

    :return: (model, optimizer)
    :rtype: Tuple
    """

    logger = get_dist_logger('convert_to_zero_v2')

    logger.info(f'optimizer_config is {optimizer_config}', ranks=[0])
    if optimizer_config is None:
        optimizer_config = dict()
    logger.info(f'model_config is {model_config}', ranks=[0])
    if model_config is None:
        model_config = dict()

    zero_model = ShardedModelV2(model, **model_config)
    zero_optimizer = ShardedOptimizerV2(zero_model, optimizer, **optimizer_config)
    return zero_model, zero_optimizer


__all__ = ['convert_to_zero_v2', 'ShardedModelV2', 'ShardedOptimizerV2', 'ZeroOptimizer']
update sharded optim and fix zero init ctx (#457) 2022-03-18 07:44:47 +00:00			`from typing import Tuple`
[zero] hybrid cpu adam (#445) 2022-03-17 07:05:41 +00:00
[zero] Update initialize for ZeRO (#458) * polish code * shard strategy receive pg in shard() / gather() * update zero engine * polish code 2022-03-18 08:18:31 +00:00			`import torch`
[zero] hybrid cpu adam (#445) 2022-03-17 07:05:41 +00:00			`import torch.nn as nn`
Revert "[zero] update sharded optim and fix zero init ctx" (#456) * Revert "polish code" This reverts commit 8cf7ff08cfbcaec8be4c8455a8fc9b54cad5550f. * Revert "rename variables" This reverts commit e99af94ab8e2bd7c4aad8bd260c6b221bec04402. * Revert "remove surplus imports" This reverts commit 46add4a5c5d9296bece829354efe53a46642cba3. * Revert "update sharded optim and fix zero init ctx" This reverts commit 57567ee768338723faf838bc241f84451d941292. 2022-03-18 07:22:43 +00:00			`from colossalai.logging import get_dist_logger`
update sharded optim and fix zero init ctx (#457) 2022-03-18 07:44:47 +00:00			`from colossalai.zero.sharded_model.sharded_model_v2 import ShardedModelV2`
			`from colossalai.zero.sharded_optim.sharded_optim_v2 import ShardedOptimizerV2`
[zero] add zero optimizer for ColoTensor (#1046) * add zero optimizer * torch ok * unit test ok * polish code * fix bugs * polish unit test * polish zero optim * polish colo ddp v2 * refactor folder structure * add comment * polish unit test * polish zero optim * polish unit test 2022-06-02 04:13:15 +00:00			`from .zero_optimizer import ZeroOptimizer`
[zero] hybrid cpu adam (#445) 2022-03-17 07:05:41 +00:00
[refactory] refactory the initialize method for new zero design (#431) 2022-03-16 11:29:37 +00:00
[zero] Update initialize for ZeRO (#458) * polish code * shard strategy receive pg in shard() / gather() * update zero engine * polish code 2022-03-18 08:18:31 +00:00			`def convert_to_zero_v2(model: nn.Module, optimizer: torch.optim.Optimizer, model_config,`
			`optimizer_config) -> Tuple[ShardedModelV2, ShardedOptimizerV2]:`
[refactory] refactory the initialize method for new zero design (#431) 2022-03-16 11:29:37 +00:00			`"""`
			`A helper function to integrate the model and optimizer with ZeRO optimizer and off-loading`

			`:param model: Your model object`
			:type model: :class:`torch.nn.Module`
			`:param optimizer_config: Your optimizer object`
			:type optimizer_config: :class:`dict`

			`:return: (model, optimizer)`
			`:rtype: Tuple`
			`"""`

			`logger = get_dist_logger('convert_to_zero_v2')`

[log] local throughput metrics (#811) * Revert "[zero] add ZeroTensorShardStrategy (#793)" This reverts commit 88759e289efd0a7b5e0d7bf8e01dbe29db85cf71. * [gemini] set cpu memory capacity * [log] local throughput collecting * polish * polish * polish * polish code * polish 2022-04-20 02:05:39 +00:00			`logger.info(f'optimizer_config is {optimizer_config}', ranks=[0])`
[hotfix] fix initialize bug with zero (#442) 2022-03-17 05:16:22 +00:00			`if optimizer_config is None:`
			`optimizer_config = dict()`
[log] local throughput metrics (#811) * Revert "[zero] add ZeroTensorShardStrategy (#793)" This reverts commit 88759e289efd0a7b5e0d7bf8e01dbe29db85cf71. * [gemini] set cpu memory capacity * [log] local throughput collecting * polish * polish * polish * polish code * polish 2022-04-20 02:05:39 +00:00			`logger.info(f'model_config is {model_config}', ranks=[0])`
[hotfix] fix initialize bug with zero (#442) 2022-03-17 05:16:22 +00:00			`if model_config is None:`
			`model_config = dict()`

update sharded optim and fix zero init ctx (#457) 2022-03-18 07:44:47 +00:00			`zero_model = ShardedModelV2(model, **model_config)`
[zero] Update initialize for ZeRO (#458) * polish code * shard strategy receive pg in shard() / gather() * update zero engine * polish code 2022-03-18 08:18:31 +00:00			`zero_optimizer = ShardedOptimizerV2(zero_model, optimizer, **optimizer_config)`
[refactory] refactory the initialize method for new zero design (#431) 2022-03-16 11:29:37 +00:00			`return zero_model, zero_optimizer`
Develop/experiments (#59) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> * Split conv2d, class token, positional embedding in 2d, Fix random number in ddp Fix convergence in cifar10, Imagenet1000 * Integrate 1d tensor parallel in Colossal-AI (#39) * fixed 1D and 2D convergence (#38) * optimized 2D operations * fixed 1D ViT convergence problem * Feature/ddp (#49) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * support torch ddp * fix loss accumulation * add log for ddp * change seed * modify timing hook Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * Feature/pipeline (#40) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * optimize communication of pipeline parallel * fix grad clip for pipeline Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51) * Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset * update api for better usability (#58) update api for better usability Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com> Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com> 2021-12-09 07:08:29 +00:00

[zero] add zero optimizer for ColoTensor (#1046) * add zero optimizer * torch ok * unit test ok * polish code * fix bugs * polish unit test * polish zero optim * polish colo ddp v2 * refactor folder structure * add comment * polish unit test * polish zero optim * polish unit test 2022-06-02 04:13:15 +00:00			`__all__ = ['convert_to_zero_v2', 'ShardedModelV2', 'ShardedOptimizerV2', 'ZeroOptimizer']`