3.0 KiB
Zero Redundancy optimizer and zero offload
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them. By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity and communication efficiency are retained.
- ZeRO Level 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
- ZeRO Level 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process only stores the gradients corresponding to its partition of the optimizer states.
- ZeRO Level 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and partition them during the forward and backward passes.
Getting Started with ZeRO
If you are training models with Colossal-AI, enabling ZeRO-3 offload is as simple as enabling it in your Colossal-AI configuration! Below are a few examples of ZeRO-3 configurations.
Example of ZeRO-3 Configurations
Here we use Adam
as the initial optimizer.
- Use ZeRO to partition the optimizer states (level 1), gradients (level 2), and parameters (level 3).
optimizer = dict( type='Adam', lr=0.001, weight_decay=0 ) zero = dict( type='ZeroRedundancyOptimizer_Level_3', dynamic_loss_scale=True, clip_grad=1.0 )
- Additionally offload the optimizer states and computations to the CPU.
zero = dict( offload_optimizer_config=dict( device='cpu', pin_memory=True, fast_init=True ), ... )
- Save even more memory by offloading parameters to the CPU memory.
zero = dict( offload_optimizer_config=dict( device='cpu', pin_memory=True, fast_init=True ), offload_param_config=dict( device='cpu', pin_memory=True, fast_init=OFFLOAD_PARAM_MAX_IN_CPU ), ... )
- Save even MORE memory by offloading to NVMe (if available on your system):
zero = dict( offload_optimizer_config=dict( device='nvme', pin_memory=True, fast_init=True, nvme_path='/nvme_data' ), offload_param_config=dict( device='nvme', pin_memory=True, max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU, nvme_path='/nvme_data' ), ... )
Note that fp16
is automatically enabled when using ZeRO.
Training
Once you have completed your configuration, just use colossalai.initialize()
to initialize your training.