You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ColossalAI/docs/zero.md

97 lines
3.6 KiB

# Zero Redundancy optimizer and zero offload
3 years ago
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three
model states (optimizer states, gradients, and parameters) instead of replicating them.
By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity
and communication efficiency are retained.
3 years ago
1. **ZeRO Level 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the
first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
2. **ZeRO Level 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process
only stores the gradients corresponding to its partition of the optimizer states.
3. **ZeRO Level 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and
partition them during the forward and backward passes.
3 years ago
## Getting Started with ZeRO
3 years ago
If you are training models with Colossal-AI, enabling ZeRO DP and Offloading is easy by addding several lines in your configuration file. We support configration for level 2 and 3. You have use [PyTorch native implementation](https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html) for level 1 optimizer.
Below are a few examples of ZeRO-3 configurations.
3 years ago
### Example of ZeRO-3 Configurations
3 years ago
You can refer to the [DeepSpeed configuration](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training) for details.
Here we use `Adam` as the initial optimizer.
3 years ago
1. Use ZeRO to partition the optimizer states, gradients (level 2), and parameters (level 3).
3 years ago
```python
zero = dict(
level=3,
3 years ago
dynamic_loss_scale=True,
clip_grad=1.0
)
```
3 years ago
2. Additionally offload the optimizer states and computations to the CPU.
```python
zero = dict(
level=3,
3 years ago
offload_optimizer_config=dict(
device='cpu',
pin_memory=True,
fast_init=True
),
...
)
```
3. Save even more memory by offloading parameters to the CPU memory.
```python
zero = dict(
level=3,
3 years ago
offload_optimizer_config=dict(
device='cpu',
pin_memory=True,
fast_init=True
),
offload_param_config=dict(
device='cpu',
pin_memory=True,
fast_init=OFFLOAD_PARAM_MAX_IN_CPU
),
...
)
```
4. Save even MORE memory by offloading to NVMe (if available on your system):
```python
zero = dict(
level=3,
3 years ago
offload_optimizer_config=dict(
device='nvme',
pin_memory=True,
fast_init=True,
nvme_path='/nvme_data'
),
offload_param_config=dict(
device='nvme',
pin_memory=True,
max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU,
nvme_path='/nvme_data'
),
...
)
```
Note that `fp16` is automatically enabled when using ZeRO. This relies on `AMP_TYPE.NAIVE` in Colossal-AI AMP module.
3 years ago
### Training
Note that if your model is too large to fit within the memory when using ZeRO-3, you should use `colossalai.zero.zero3_model_context` to construct your model:
```python
from colossalai.zero import zero3_model_context
with zero3_model_context():
model = Model()
```
Once you have completed your configuration, just use `colossalai.initialize()` to initialize your training.