ColossalAI/docs/zero.md

# Zero Redundancy optimizer and zero offload

The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three 
model states (optimizer states, gradients, and parameters) instead of replicating them. 
By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity 
and communication efficiency are retained.

1. **ZeRO Level 1**: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the 
first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.
2. **ZeRO Level 2**: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process 
only stores the gradients corresponding to its partition of the optimizer states.
3. **ZeRO Level 3**: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and 
partition them during the forward and backward passes.

## Getting Started with ZeRO

If you are training models with Colossal-AI, enabling ZeRO DP and Offloading is easy by addding several lines in your configuration file. We support configration for level 2 and 3. You have use [PyTorch native implementation](https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html) for level 1 optimizer.
Below are a few examples of ZeRO-3 configurations.

### Example of ZeRO-3 Configurations

You can refer to the [DeepSpeed configuration](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training) for details. 
Here we use `Adam` as the initial optimizer.

1. Use ZeRO to partition the optimizer states, gradients (level 2), and parameters (level 3).
    ```python
    zero = dict(
        level=3,
        dynamic_loss_scale=True,
        clip_grad=1.0
    )
    ```

2. Additionally offload the optimizer states and computations to the CPU.
    ```python
    zero = dict(
        level=3,
        offload_optimizer_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=True
        ),
        ...
    )
    ```
3. Save even more memory by offloading parameters to the CPU memory.
    ```python
    zero = dict(
        level=3,
        offload_optimizer_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=True
        ),
        offload_param_config=dict(
            device='cpu',
            pin_memory=True,
            fast_init=OFFLOAD_PARAM_MAX_IN_CPU
        ),
        ...
    )
    ```
4. Save even MORE memory by offloading to NVMe (if available on your system):
    ```python
    zero = dict(
        level=3,
        offload_optimizer_config=dict(
            device='nvme',
            pin_memory=True,
            fast_init=True,
            nvme_path='/nvme_data'
        ),
        offload_param_config=dict(
            device='nvme',
            pin_memory=True,
            max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU,
            nvme_path='/nvme_data'
        ),
        ...
    )
    ```

Note that `fp16` is automatically enabled when using ZeRO. This relies on `AMP_TYPE.NAIVE` in Colossal-AI AMP module.

### Training

Note that if your model is too large to fit within the memory when using ZeRO-3, you should use `colossalai.zero.zero3_model_context` to construct your model:

```python
from colossalai.zero import zero3_model_context

with zero3_model_context():
    model = Model()
```

Once you have completed your configuration, just use `colossalai.initialize()` to initialize your training.
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`# Zero Redundancy optimizer and zero offload`
Migrated project 2021-10-28 16:21:23 +00:00
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three`
update markdown docs (english) (#60) 2021-12-10 06:37:33 +00:00			`model states (optimizer states, gradients, and parameters) instead of replicating them.`
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity`
			`and communication efficiency are retained.`
Migrated project 2021-10-28 16:21:23 +00:00
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`1. ZeRO Level 1: The optimizer states (e.g., for [Adam optimizer](https://arxiv.org/abs/1412.6980), 32-bit weights, and the`
			`first and second momentum estimates) are partitioned across the processes, so that each process updates only its partition.`
			`2. ZeRO Level 2: The reduced 32-bit gradients for updating the model weights are also partitioned such that each process`
			`only stores the gradients corresponding to its partition of the optimizer states.`
			`3. ZeRO Level 3: The 16-bit model parameters are partitioned across the processes. ZeRO-3 will automatically collect and`
			`partition them during the forward and backward passes.`
Migrated project 2021-10-28 16:21:23 +00:00
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`## Getting Started with ZeRO`
Migrated project 2021-10-28 16:21:23 +00:00
update markdown docs (english) (#60) 2021-12-10 06:37:33 +00:00			`If you are training models with Colossal-AI, enabling ZeRO DP and Offloading is easy by addding several lines in your configuration file. We support configration for level 2 and 3. You have use [PyTorch native implementation](https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html) for level 1 optimizer.`
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`Below are a few examples of ZeRO-3 configurations.`
Migrated project 2021-10-28 16:21:23 +00:00
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			`### Example of ZeRO-3 Configurations`
Migrated project 2021-10-28 16:21:23 +00:00
update examples and sphnix docs for the new api (#63) 2021-12-13 14:07:01 +00:00			`You can refer to the [DeepSpeed configuration](https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training) for details.`
added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			Here we use `Adam` as the initial optimizer.
Migrated project 2021-10-28 16:21:23 +00:00
update markdown docs (english) (#60) 2021-12-10 06:37:33 +00:00			`1. Use ZeRO to partition the optimizer states, gradients (level 2), and parameters (level 3).`
Migrated project 2021-10-28 16:21:23 +00:00			```python
			`zero = dict(`
update markdown docs (english) (#60) 2021-12-10 06:37:33 +00:00			`level=3,`
Migrated project 2021-10-28 16:21:23 +00:00			`dynamic_loss_scale=True,`
			`clip_grad=1.0`
			`)`
			```
update markdown docs (english) (#60) 2021-12-10 06:37:33 +00:00
Migrated project 2021-10-28 16:21:23 +00:00			`2. Additionally offload the optimizer states and computations to the CPU.`
			```python
			`zero = dict(`
update markdown docs (english) (#60) 2021-12-10 06:37:33 +00:00			`level=3,`
Migrated project 2021-10-28 16:21:23 +00:00			`offload_optimizer_config=dict(`
			`device='cpu',`
			`pin_memory=True,`
			`fast_init=True`
			`),`
			`...`
			`)`
			```
			`3. Save even more memory by offloading parameters to the CPU memory.`
			```python
			`zero = dict(`
update markdown docs (english) (#60) 2021-12-10 06:37:33 +00:00			`level=3,`
Migrated project 2021-10-28 16:21:23 +00:00			`offload_optimizer_config=dict(`
			`device='cpu',`
			`pin_memory=True,`
			`fast_init=True`
			`),`
			`offload_param_config=dict(`
			`device='cpu',`
			`pin_memory=True,`
			`fast_init=OFFLOAD_PARAM_MAX_IN_CPU`
			`),`
			`...`
			`)`
			```
			`4. Save even MORE memory by offloading to NVMe (if available on your system):`
			```python
			`zero = dict(`
update markdown docs (english) (#60) 2021-12-10 06:37:33 +00:00			`level=3,`
Migrated project 2021-10-28 16:21:23 +00:00			`offload_optimizer_config=dict(`
			`device='nvme',`
			`pin_memory=True,`
			`fast_init=True,`
			`nvme_path='/nvme_data'`
			`),`
			`offload_param_config=dict(`
			`device='nvme',`
			`pin_memory=True,`
			`max_in_cpu=OFFLOAD_PARAM_MAX_IN_CPU,`
			`nvme_path='/nvme_data'`
			`),`
			`...`
			`)`
			```

update markdown docs (english) (#60) 2021-12-10 06:37:33 +00:00			Note that `fp16` is automatically enabled when using ZeRO. This relies on `AMP_TYPE.NAIVE` in Colossal-AI AMP module.
Migrated project 2021-10-28 16:21:23 +00:00
			`### Training`

fix zero3 fp16 and add zero3 model context (#62) 2021-12-10 09:48:50 +00:00			Note that if your model is too large to fit within the memory when using ZeRO-3, you should use `colossalai.zero.zero3_model_context` to construct your model:

			```python
			`from colossalai.zero import zero3_model_context`

			`with zero3_model_context():`
			`model = Model()`
			```

added Chinese documents and fixed some typos in English documents 2021-11-02 15:01:13 +00:00			Once you have completed your configuration, just use `colossalai.initialize()` to initialize your training.