mirror of https://github.com/hpcaitech/ColossalAI
[doc] update document of gemini instruction. (#3842)
* [doc] update meet_gemini.md * [doc] update meet_gemini.md * [doc] fix parentheses * [doc] fix parentheses * [doc] fix doc test * [doc] fix doc test * [doc] fix docpull/3844/head^2
parent
54e97ed7ea
commit
a64df3fa97
|
@ -9,16 +9,21 @@ When you only have a few GPUs for large model training tasks, **heterogeneous tr
|
|||
|
||||
## Usage
|
||||
|
||||
At present, Gemini supports compatibility with ZeRO parallel mode, and it is really simple to use Gemini. Set attribute of zero model_config, i.e., tensor_placement_policy='auto'.
|
||||
At present, Gemini supports compatibility with ZeRO parallel mode, and it is really simple to use Gemini: Inject the feathures of `GeminiPlugin` into training components with `booster`. More instructions of `booster` please refer to [**usage of booster**](../basics/booster_api.md).
|
||||
|
||||
```
|
||||
zero = dict(
|
||||
model_config=dict(
|
||||
tensor_placement_policy='auto',
|
||||
shard_strategy=BucketTensorShardStrategy()
|
||||
),
|
||||
optimizer_config=dict(
|
||||
...)
|
||||
```python
|
||||
from torchvision.models import resnet18
|
||||
from colossalai.booster import Booster
|
||||
from colossalai.zero import ColoInitContext
|
||||
from colossalai.booster.plugin import GeminiPlugin
|
||||
plugin = GeminiPlugin(placement_policy='cuda', strict_ddp_mode=True, max_norm=1.0, initial_scale=2**5)
|
||||
booster = Booster(plugin=plugin)
|
||||
ctx = ColoInitContext()
|
||||
with ctx:
|
||||
model = resnet18()
|
||||
optimizer = HybridAdam(model.parameters(), lr=1e-3)
|
||||
criterion = lambda x: x.mean()
|
||||
model, optimizer, criterion, _, _ = booster.boost(model, optimizer, criterion)
|
||||
)
|
||||
```
|
||||
|
||||
|
@ -86,3 +91,5 @@ The important duty of MSC is to adjust the tensor layout position. For example,
|
|||
In the warmup stage, since we haven't finished a complete iteration yet, we don't know actual memory occupation. At this time, we limit the upper bound of memory usage of the model data. For example, only 30% of the GPU memory can be used. This ensures that we can successfully complete the warmup state.
|
||||
|
||||
In the non-warmup stage, we need to use the memory information of non-model data collected in the warm-up stage to reserve the peak memory required by the computing device for the next Period, which requires us to move some model tensors. In order to avoid frequent replacement of the same tensor in and out of the CPU-GPU, causing a phenomenon similar to [cache thrashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science)). Using the iterative characteristics of DNN training, we design the OPT cache swap out strategy. Specifically, in the warmup stage, we record the sampling time required by each tensor computing device. If we need to expel some HOLD tensors, we will choose the latest tensor needed on this device as the victim.
|
||||
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 meet_gemini.py -->
|
||||
|
|
|
@ -8,21 +8,21 @@
|
|||
|
||||
## 用法
|
||||
|
||||
目前Gemini支持和ZeRO并行方式兼容,它的使用方法很简单,在训练策略的配置文件里设置zero的model_config属性tensor_placement_policy='auto'
|
||||
目前Gemini支持和ZeRO并行方式兼容,它的使用方法很简单:使用booster将`GeminiPlugin`中的特性注入到训练组件中。更多`booster`介绍请参考[booster使用](../basics/booster_api.md)。
|
||||
|
||||
```
|
||||
zero = dict(
|
||||
model_config=dict(
|
||||
reduce_scatter_bucket_size_mb=25,
|
||||
fp32_reduce_scatter=False,
|
||||
gradient_predivide_factor=1.0,
|
||||
tensor_placement_policy="auto",
|
||||
shard_strategy=TensorShardStrategy(),
|
||||
...
|
||||
),
|
||||
optimizer_config=dict(
|
||||
...
|
||||
)
|
||||
```python
|
||||
from torchvision.models import resnet18
|
||||
from colossalai.booster import Booster
|
||||
from colossalai.zero import ColoInitContext
|
||||
from colossalai.booster.plugin import GeminiPlugin
|
||||
plugin = GeminiPlugin(placement_policy='cuda', strict_ddp_mode=True, max_norm=1.0, initial_scale=2**5)
|
||||
booster = Booster(plugin=plugin)
|
||||
ctx = ColoInitContext()
|
||||
with ctx:
|
||||
model = resnet18()
|
||||
optimizer = HybridAdam(model.parameters(), lr=1e-3)
|
||||
criterion = lambda x: x.mean()
|
||||
model, optimizer, criterion, _, _ = booster.boost(model, optimizer, criterion)
|
||||
)
|
||||
```
|
||||
|
||||
|
@ -94,3 +94,5 @@ MSC的重要职责是在调整tensor layout位置,比如在上图S2时刻,
|
|||
|
||||
在non-warmup阶段,我们需要利用预热阶段采集的非模型数据内存信息,预留出下一个Period在计算设备上需要的峰值内存,这需要我们移动出一些模型张量。
|
||||
为了避免频繁在CPU-GPU换入换出相同的tensor,引起类似[cache thrashing](https://en.wikipedia.org/wiki/Thrashing_(computer_science))的现象。我们利用DNN训练迭代特性,设计了OPT cache换出策略。具体来说,在warmup阶段,我们记录每个tensor被计算设备需要的采样时刻。如果我们需要驱逐一些HOLD tensor,那么我们选择在本设备上最晚被需要的tensor作为受害者。
|
||||
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 meet_gemini.py -->
|
||||
|
|
Loading…
Reference in New Issue