ColossalAI/tests/test_legacy/test_data/test_data_parallel_sampler.py

#!/usr/bin/env python
# -*- encoding: utf-8 -*-

import os
from pathlib import Path

import torch
import torch.distributed as dist
from torchvision import datasets, transforms

import colossalai
from colossalai.context import Config
from colossalai.legacy.context import ParallelMode
from colossalai.legacy.core import global_context as gpc
from colossalai.legacy.utils import get_dataloader
from colossalai.testing import rerun_if_address_is_in_use, spawn

CONFIG = Config(
    dict(
        parallel=dict(
            pipeline=dict(size=1),
            tensor=dict(size=1, mode=None),
        ),
        seed=1024,
    )
)


def run_data_sampler(rank, world_size, port):
    dist_args = dict(config=CONFIG, rank=rank, world_size=world_size, backend="gloo", port=port, host="localhost")
    colossalai.legacy.launch(**dist_args)
    print("finished initialization")

    # build dataset
    transform_pipeline = [transforms.ToTensor()]
    transform_pipeline = transforms.Compose(transform_pipeline)
    dataset = datasets.CIFAR10(root=Path(os.environ["DATA"]), train=True, download=True, transform=transform_pipeline)

    # build dataloader
    dataloader = get_dataloader(dataset, batch_size=8, add_sampler=True)

    data_iter = iter(dataloader)
    img, label = data_iter.next()
    img = img[0]

    if gpc.get_local_rank(ParallelMode.DATA) != 0:
        img_to_compare = img.clone()
    else:
        img_to_compare = img
    dist.broadcast(img_to_compare, src=0, group=gpc.get_group(ParallelMode.DATA))

    if gpc.get_local_rank(ParallelMode.DATA) != 0:
        assert not torch.equal(
            img, img_to_compare
        ), "Same image was distributed across ranks but expected it to be different"
    torch.cuda.empty_cache()


@rerun_if_address_is_in_use()
def test_data_sampler():
    spawn(run_data_sampler, 4)


if __name__ == "__main__":
    test_data_sampler()
Migrated project 2021-10-28 16:21:23 +00:00			`#!/usr/bin/env python`
			`# -- encoding: utf-8 --`

			`import os`
			`from pathlib import Path`

added CI for unit testing (#69) 2021-12-16 02:32:08 +00:00			`import torch`
Migrated project 2021-10-28 16:21:23 +00:00			`import torch.distributed as dist`
[test] refactor tests with spawn (#3452) * [test] added spawn decorator * polish code * polish code * polish code * polish code * polish code * polish code 2023-04-06 06:51:35 +00:00			`from torchvision import datasets, transforms`
Migrated project 2021-10-28 16:21:23 +00:00
			`import colossalai`
[legacy] clean up legacy code (#4743) * [legacy] remove outdated codes of pipeline (#4692) * [legacy] remove cli of benchmark and update optim (#4690) * [legacy] remove cli of benchmark and update optim * [doc] fix cli doc test * [legacy] fix engine clip grad norm * [legacy] remove outdated colo tensor (#4694) * [legacy] remove outdated colo tensor * [test] fix test import * [legacy] move outdated zero to legacy (#4696) * [legacy] clean up utils (#4700) * [legacy] clean up utils * [example] update examples * [legacy] clean up amp * [legacy] fix amp module * [legacy] clean up gpc (#4742) * [legacy] clean up context * [legacy] clean core, constants and global vars * [legacy] refactor initialize * [example] fix examples ci * [example] fix examples ci * [legacy] fix tests * [example] fix gpt example * [example] fix examples ci * [devops] fix ci installation * [example] fix examples ci 2023-09-18 08:31:06 +00:00			`from colossalai.context import Config`
			`from colossalai.legacy.context import ParallelMode`
			`from colossalai.legacy.core import global_context as gpc`
			`from colossalai.legacy.utils import get_dataloader`
[test] refactor tests with spawn (#3452) * [test] added spawn decorator * polish code * polish code * polish code * polish code * polish code * polish code 2023-04-06 06:51:35 +00:00			`from colossalai.testing import rerun_if_address_is_in_use, spawn`
Migrated project 2021-10-28 16:21:23 +00:00
[misc] update pre-commit and run all files (#4752) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format 2023-09-19 06:20:26 +00:00			`CONFIG = Config(`
			`dict(`
			`parallel=dict(`
			`pipeline=dict(size=1),`
			`tensor=dict(size=1, mode=None),`
			`),`
			`seed=1024,`
			`)`
			`)`
Migrated project 2021-10-28 16:21:23 +00:00

[test] fixed rerun_on_exception and adapted test cases (#487) 2022-03-25 09:25:12 +00:00			`def run_data_sampler(rank, world_size, port):`
[misc] update pre-commit and run all files (#4752) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format 2023-09-19 06:20:26 +00:00			`dist_args = dict(config=CONFIG, rank=rank, world_size=world_size, backend="gloo", port=port, host="localhost")`
[legacy] clean up legacy code (#4743) * [legacy] remove outdated codes of pipeline (#4692) * [legacy] remove cli of benchmark and update optim (#4690) * [legacy] remove cli of benchmark and update optim * [doc] fix cli doc test * [legacy] fix engine clip grad norm * [legacy] remove outdated colo tensor (#4694) * [legacy] remove outdated colo tensor * [test] fix test import * [legacy] move outdated zero to legacy (#4696) * [legacy] clean up utils (#4700) * [legacy] clean up utils * [example] update examples * [legacy] clean up amp * [legacy] fix amp module * [legacy] clean up gpc (#4742) * [legacy] clean up context * [legacy] clean core, constants and global vars * [legacy] refactor initialize * [example] fix examples ci * [example] fix examples ci * [legacy] fix tests * [example] fix gpt example * [example] fix examples ci * [devops] fix ci installation * [example] fix examples ci 2023-09-18 08:31:06 +00:00			`colossalai.legacy.launch(**dist_args)`
[misc] update pre-commit and run all files (#4752) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format 2023-09-19 06:20:26 +00:00			`print("finished initialization")`
Migrated project 2021-10-28 16:21:23 +00:00
[unittest] refactored unit tests for change in dependency (#838) 2022-04-22 07:39:07 +00:00			`# build dataset`
			`transform_pipeline = [transforms.ToTensor()]`
Develop/experiments (#59) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> * Split conv2d, class token, positional embedding in 2d, Fix random number in ddp Fix convergence in cifar10, Imagenet1000 * Integrate 1d tensor parallel in Colossal-AI (#39) * fixed 1D and 2D convergence (#38) * optimized 2D operations * fixed 1D ViT convergence problem * Feature/ddp (#49) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * support torch ddp * fix loss accumulation * add log for ddp * change seed * modify timing hook Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * Feature/pipeline (#40) * remove redundancy func in setup (#19) (#20) * use env to control the language of doc (#24) (#25) * Support TP-compatible Torch AMP and Update trainer API (#27) * Add gradient accumulation, fix lr scheduler * fix FP16 optimizer and adapted torch amp with tensor parallel (#18) * fixed bugs in compatibility between torch amp and tensor parallel and performed some minor fixes * fixed trainer * Revert "fixed trainer" This reverts commit 2e0b0b76990e8d4e337add483d878c0f61cf5097. * improved consistency between trainer, engine and schedule (#23) Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> * add an example of ViT-B/16 and remove w_norm clipping in LAMB (#29) * add explanation for ViT example (#35) (#36) * optimize communication of pipeline parallel * fix grad clip for pipeline Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> * optimized 3d layer to fix slow computation ; tested imagenet performance with 3d; reworked lr_scheduler config definition; fixed launch args; fixed some printing issues; simplified apis of 3d layers (#51) * Update 2.5d layer code to get a similar accuracy on imagenet-1k dataset * update api for better usability (#58) update api for better usability Co-authored-by: 1SAA <c2h214748@gmail.com> Co-authored-by: ver217 <lhx0217@gmail.com> Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: アマデウス <kurisusnowdeng@users.noreply.github.com> Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com> 2021-12-09 07:08:29 +00:00			`transform_pipeline = transforms.Compose(transform_pipeline)`
[misc] update pre-commit and run all files (#4752) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format 2023-09-19 06:20:26 +00:00			`dataset = datasets.CIFAR10(root=Path(os.environ["DATA"]), train=True, download=True, transform=transform_pipeline)`
[unittest] refactored unit tests for change in dependency (#838) 2022-04-22 07:39:07 +00:00
			`# build dataloader`
			`dataloader = get_dataloader(dataset, batch_size=8, add_sampler=True)`

Migrated project 2021-10-28 16:21:23 +00:00			`data_iter = iter(dataloader)`
			`img, label = data_iter.next()`
			`img = img[0]`

			`if gpc.get_local_rank(ParallelMode.DATA) != 0:`
			`img_to_compare = img.clone()`
			`else:`
			`img_to_compare = img`
			`dist.broadcast(img_to_compare, src=0, group=gpc.get_group(ParallelMode.DATA))`

			`if gpc.get_local_rank(ParallelMode.DATA) != 0:`
[test] fixed rerun_on_exception and adapted test cases (#487) 2022-03-25 09:25:12 +00:00			`assert not torch.equal(`
[misc] update pre-commit and run all files (#4752) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format 2023-09-19 06:20:26 +00:00			`img, img_to_compare`
			`), "Same image was distributed across ranks but expected it to be different"`
added CI for unit testing (#69) 2021-12-16 02:32:08 +00:00			`torch.cuda.empty_cache()`
Migrated project 2021-10-28 16:21:23 +00:00

[test] refactored with the new rerun decorator (#763) * [test] refactored with the new rerun decorator * polish test case 2022-04-14 16:33:04 +00:00			`@rerun_if_address_is_in_use()`
Migrated project 2021-10-28 16:21:23 +00:00			`def test_data_sampler():`
[test] refactor tests with spawn (#3452) * [test] added spawn decorator * polish code * polish code * polish code * polish code * polish code * polish code 2023-04-06 06:51:35 +00:00			`spawn(run_data_sampler, 4)`
Migrated project 2021-10-28 16:21:23 +00:00

[misc] update pre-commit and run all files (#4752) * [misc] update pre-commit * [misc] run pre-commit * [misc] remove useless configuration files * [misc] ignore cuda for clang-format 2023-09-19 06:20:26 +00:00			`if __name__ == "__main__":`
Migrated project 2021-10-28 16:21:23 +00:00			`test_data_sampler()`