You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ColossalAI/docs/add_your_parallel.md

4.5 KiB

Add Your Own Parallelism

Overview

To enable researchers and engineers to extend our framework to other novel large-scale distributed training algorithm with less effort, we have decoupled the various components in the training lifecycle. You can implement your own parallelism by simply inheriting from the base class.

The main components are

  1. ProcessGroupInitializer
  2. GradientHandler
  3. Schedule

Process Group Initializer

Parallelism is often managed by process groups where processes involved in parallel computing are placed in the same process group. For different parallel algorithms, different process groups need to be created. ColossalAI provides a global context for the user to easily manage their process groups. If you wish to add new process group, you can easily define a new class and set it in your configuration file. To define your own way of creating process groups, you can follow the steps below to create new distributed initialization.

  1. Add your parallel mode in colossalai.context.parallel_mode.ParallelMode

    class ParallelMode(Enum):
        GLOBAL = 'global'
        DATA = 'data'
        PIPELINE = 'pipe'
        PIPELINE_PREV = 'pipe_prev'
        PIPELINE_NEXT = 'pipe_next'
        ...
    
        NEW_MODE = 'new_mode'  # define your mode here
    
  2. Create a ProcessGroupInitializer. You can refer to examples given in colossal.context.dist_group_initializer. The first six arguments are fixed. ParallelContext will pass in these arguments for you. If you need to set other arguments, you can add it behind like the arg1, arg2 in the example below. Lastly, register your initializer to the registry by adding the decorator @DIST_GROUP_INITIALIZER.register_module.

    # sample initializer class
    @DIST_GROUP_INITIALIZER.register_module
    class MyParallelInitializer(ProcessGroupInitializer):
    
        def __init__(self,
                    rank: int,
                    world_size: int,
                    config: Config,
                    data_parallel_size: int,
                    pipeline_parlalel_size: int,
                    tensor_parallel_size: int,
                    arg1,
                    arg2):
            super().__init__(rank, world_size, config)
            self.arg1 = arg1
            self.arg2 = arg2
            # ... your variable init
    
        def init_parallel_groups(self):
            # initialize your process groups
            pass
    
    

    Then, you can insert your new initializer to the current mode-to-initialize mapping in colossalai.constants.INITIALIZER_MAPPING. You can modify the file or insert new key-value pair dynamically.

    colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
    
  3. Set your initializer in your config file. You can pass in your own arguments if there is any. This allows the ParallelContext to create your initializer and initialize your desired process groups.

    parallel = dict(
        pipeline=dict(size=1),
        tensor=dict(size=x, mode='new_mode')  # this is where you enable your new parallel mode
    )
    

Gradient Handler

Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce strategies may be executed for different kinds of parallelism, the user can inherit colossal.engine.gradient_handler.BaseGradientHandler to implement their strategies. Currently, the library uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own gradient handler like below:


from colossalai.registry import GRADIENT_HANDLER
from colossalai.engine import BaseGradientHandler


@GRADIENT_HANDLER.register_module
class YourGradientHandler(BaseGradientHandler):

    def handle_gradient(self):
        do_something()

Afterwards, you can specify the gradient handler you want to use in your configuration file.

dist_initializer = [
    dict(type='YourGradientHandler'),
]

Schedule

Schedule entails how to execute a forward and backward pass. Currently, ColossalAI provides pipeline and non-pipeline schedules. If you want to modify how the forward and backward passes are executed, you can inherit colossalai.engine.BaseSchedule and implement your idea. You can add your schedule to the engine before training.