You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ColossalAI/docs/zero.md

3.0 KiB

Zero Redundancy Optimizer and Zero Offload

The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing this, it boosts memory efficiency compared to classic data-parallelism while retaining its computational granularity and communication efficiency.

  1. ZeRO Level 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first, and second moment estimates) are partitioned across the processes, so that each process updates only its partition.
  2. ZeRO Level 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process retains only the gradients corresponding to its portion of the optimizer states.
  3. ZeRO Level 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.

Getting Started

Once you are training with ColossalAI, enabling ZeRO-3 offload is as simple as enabling it in your ColossalAI configuration! Below are a few examples of ZeRO-3 configurations.

Example ZeRO-3 Configurations

Here we use Adam as the initial optimizer.

  1. Use ZeRO to partition the optimizer states (level 1), gradients (level 2), and parameters (level 3).
    optimizer = dict(
        type='Adam',
        lr=0.001,
        weight_decay=0
    )
    
    zero = dict(
        type='ZeroRedundancyOptimizer_Level_3',
        dynamic_loss_scale=True,
        clip_grad=1.0
    )
    
  2. Additionally offload the optimizer states and computations to the CPU.
    zero = dict(
        offload_optimizer_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=True
        ),
        ...
    )
    
  3. Save even more memory by offloading parameters to the CPU memory.
    zero = dict(
        offload_optimizer_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=True
        ),
        offload_param_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=OFFLOAD_PARAM_MAX_IN_CPU
        ),
        ...
    )
    
  4. Save even MORE memory by offloading to NVMe (if available on your system):
    zero = dict(
        offload_optimizer_config=dict(
            device='nvme',
            pin_memory=True,
            fast_init=True,
            nvme_path='/nvme_data'
        ),
        offload_param_config=dict(
            device='nvme',
            pin_memory=True,
            max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU,
            nvme_path='/nvme_data'
        ),
        ...
    )
    

Note that fp16 is automatically enabled when using ZeRO.

Training

Once you complete your configuration, just use colossalai.initialize() to initialize your training. All you need to do is to write your configuration.