Browse Source

fix typo docs/

pull/3829/head
digger yu 2 years ago committed by Frank Lee
parent
commit
e90fdb1000
  1. 2
      docs/README.md
  2. 2
      docs/REFERENCE.md
  3. 2
      docs/source/en/advanced_tutorials/add_your_parallel.md
  4. 2
      docs/source/en/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md
  5. 2
      docs/source/en/advanced_tutorials/opt_service.md
  6. 2
      docs/source/en/advanced_tutorials/parallelize_your_training_like_Megatron.md
  7. 2
      docs/source/en/advanced_tutorials/train_vit_using_pipeline_parallelism.md
  8. 10
      docs/source/en/advanced_tutorials/train_vit_with_hybrid_parallelism.md
  9. 4
      docs/source/en/basics/booster_api.md
  10. 2
      docs/source/en/basics/colotensor_concept.md
  11. 2
      docs/source/en/features/3D_tensor_parallel.md
  12. 2
      docs/source/en/features/gradient_clipping_with_booster.md
  13. 2
      docs/source/en/features/nvme_offload.md
  14. 2
      docs/source/en/features/pipeline_parallel.md
  15. 2
      docs/source/en/features/zero_with_chunk.md
  16. 2
      docs/source/zh-Hans/advanced_tutorials/add_your_parallel.md
  17. 2
      docs/source/zh-Hans/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md
  18. 4
      docs/source/zh-Hans/advanced_tutorials/meet_gemini.md
  19. 2
      docs/source/zh-Hans/advanced_tutorials/opt_service.md
  20. 4
      docs/source/zh-Hans/advanced_tutorials/train_vit_with_hybrid_parallelism.md
  21. 2
      docs/source/zh-Hans/basics/colotensor_concept.md
  22. 2
      docs/source/zh-Hans/features/mixed_precision_training.md
  23. 2
      docs/source/zh-Hans/features/nvme_offload.md

2
docs/README.md

@ -98,7 +98,7 @@ Lastly, if you want to skip some code, you just need to add the following annota
<!--- doc-test-ignore-end --> <!--- doc-test-ignore-end -->
``` ```
If you have any dependency required, please add it to `requriements-doc-test.txt` for pip and `conda-doc-test-deps.yml` for Conda. If you have any dependency required, please add it to `requirements-doc-test.txt` for pip and `conda-doc-test-deps.yml` for Conda.
### 💉 Auto Documentation ### 💉 Auto Documentation

2
docs/REFERENCE.md

@ -1,6 +1,6 @@
# References # References
The Colossal-AI project aims to provide a wide array of parallelism techniques for the machine learning community in the big-model era. This project is inspired by quite a few reserach works, some are conducted by some of our developers and the others are research projects open-sourced by other organizations. We would like to credit these amazing projects below in the IEEE citation format. The Colossal-AI project aims to provide a wide array of parallelism techniques for the machine learning community in the big-model era. This project is inspired by quite a few research works, some are conducted by some of our developers and the others are research projects open-sourced by other organizations. We would like to credit these amazing projects below in the IEEE citation format.
## By Our Team ## By Our Team

2
docs/source/en/advanced_tutorials/add_your_parallel.md

@ -56,7 +56,7 @@ follow the steps below to create a new distributed initialization.
world_size: int, world_size: int,
config: Config, config: Config,
data_parallel_size: int, data_parallel_size: int,
pipeline_parlalel_size: int, pipeline_parallel_size: int,
tensor_parallel_size: int, tensor_parallel_size: int,
arg1, arg1,
arg2): arg2):

2
docs/source/en/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md

@ -121,7 +121,7 @@ Inside the initialization of Experts, the local expert number of each GPU will b
## Train Your Model ## Train Your Model
Do not to forget to use `colossalai.initialize` function in `colosalai` to add gradient handler for the engine. Do not to forget to use `colossalai.initialize` function in `colossalai` to add gradient handler for the engine.
We handle the back-propagation of MoE models for you. We handle the back-propagation of MoE models for you.
In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients. In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients.
You can find more information about the handler `MoeGradientHandler` in colossal directory. You can find more information about the handler `MoeGradientHandler` in colossal directory.

2
docs/source/en/advanced_tutorials/opt_service.md

@ -53,7 +53,7 @@ export CHECKPOINT_DIR="your_opt_checkpoint_path"
# the ${CONFIG_DIR} must contain a server.sh file as the entry of service # the ${CONFIG_DIR} must contain a server.sh file as the entry of service
export CONFIG_DIR="config_file_path" export CONFIG_DIR="config_file_path"
docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:lastest docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:latest
``` ```
Then open `https://[IP-ADDRESS]:8020/docs#` in your browser to try out! Then open `https://[IP-ADDRESS]:8020/docs#` in your browser to try out!

2
docs/source/en/advanced_tutorials/parallelize_your_training_like_Megatron.md

@ -69,7 +69,7 @@ After the forward operation of the embedding module, each word in all sequences
<figcaption>The embedding module</figcaption> <figcaption>The embedding module</figcaption>
</figure> </figure>
Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer percepton is located in the second block. Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer perception is located in the second block.
<figure style={{textAlign: "center"}}> <figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/08/17/LAVzDlpRcj4dYeb.png"/> <img src="https://s2.loli.net/2022/08/17/LAVzDlpRcj4dYeb.png"/>

2
docs/source/en/advanced_tutorials/train_vit_using_pipeline_parallelism.md

@ -195,7 +195,7 @@ def build_cifar(batch_size):
## Training ViT using pipeline ## Training ViT using pipeline
You can set the size of pipeline parallel and number of microbatches in config. `NUM_CHUNKS` is useful when using interleved-pipeline (for more details see [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) ). The original batch will be split into `num_microbatches`, and each stage will load a micro batch each time. Then we will generate an approriate schedule for you to execute the pipeline training. If you don't need the output and label of model, you can set `return_output_label` to `False` when calling `trainer.fit()` which can further reduce GPU memory usage. You can set the size of pipeline parallel and number of microbatches in config. `NUM_CHUNKS` is useful when using interleaved-pipeline (for more details see [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473) ). The original batch will be split into `num_microbatches`, and each stage will load a micro batch each time. Then we will generate an appropriate schedule for you to execute the pipeline training. If you don't need the output and label of model, you can set `return_output_label` to `False` when calling `trainer.fit()` which can further reduce GPU memory usage.
You should `export DATA=/path/to/cifar`. You should `export DATA=/path/to/cifar`.

10
docs/source/en/advanced_tutorials/train_vit_with_hybrid_parallelism.md

@ -16,14 +16,14 @@ In this example for ViT model, Colossal-AI provides three different parallelism
We will show you how to train ViT on CIFAR-10 dataset with these parallelism techniques. To run this example, you will need 2-4 GPUs. We will show you how to train ViT on CIFAR-10 dataset with these parallelism techniques. To run this example, you will need 2-4 GPUs.
## Tabel of Contents ## Table of Contents
1. Colossal-AI installation 1. Colossal-AI installation
2. Steps to train ViT with data parallelism 2. Steps to train ViT with data parallelism
3. Steps to train ViT with pipeline parallelism 3. Steps to train ViT with pipeline parallelism
4. Steps to train ViT with tensor parallelism or hybrid parallelism 4. Steps to train ViT with tensor parallelism or hybrid parallelism
## Colossal-AI Installation ## Colossal-AI Installation
You can install Colossal-AI pacakage and its dependencies with PyPI. You can install Colossal-AI package and its dependencies with PyPI.
```bash ```bash
pip install colossalai pip install colossalai
``` ```
@ -31,7 +31,7 @@ pip install colossalai
## Data Parallelism ## Data Parallelism
Data parallism is one basic way to accelerate model training process. You can apply data parallelism to training by only two steps: Data parallelism is one basic way to accelerate model training process. You can apply data parallelism to training by only two steps:
1. Define a configuration file 1. Define a configuration file
2. Change a few lines of code in train script 2. Change a few lines of code in train script
@ -94,7 +94,7 @@ from torchvision import transforms
from torchvision.datasets import CIFAR10 from torchvision.datasets import CIFAR10
``` ```
#### Lauch Colossal-AI #### Launch Colossal-AI
In train script, you need to initialize the distributed environment for Colossal-AI after your config file is prepared. We call this process `launch`. In Colossal-AI, we provided several launch methods to initialize the distributed backend. In most cases, you can use `colossalai.launch` and `colossalai.get_default_parser` to pass the parameters via command line. Besides, Colossal-AI can utilize the existing launch tool provided by PyTorch as many users are familiar with by using `colossalai.launch_from_torch`. For more details, you can view the related [documents](https://www.colossalai.org/docs/basics/launch_colossalai). In train script, you need to initialize the distributed environment for Colossal-AI after your config file is prepared. We call this process `launch`. In Colossal-AI, we provided several launch methods to initialize the distributed backend. In most cases, you can use `colossalai.launch` and `colossalai.get_default_parser` to pass the parameters via command line. Besides, Colossal-AI can utilize the existing launch tool provided by PyTorch as many users are familiar with by using `colossalai.launch_from_torch`. For more details, you can view the related [documents](https://www.colossalai.org/docs/basics/launch_colossalai).
@ -613,7 +613,7 @@ NUM_MICRO_BATCHES = parallel['pipeline']
TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE) TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE)
``` ```
Ohter configs: Other configs:
```python ```python
# hyper parameters # hyper parameters
# BATCH_SIZE is as per GPU # BATCH_SIZE is as per GPU

4
docs/source/en/basics/booster_api.md

@ -14,9 +14,9 @@ In our new design, `colossalai.booster` replaces the role of `colossalai.initial
### Plugin ### Plugin
Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows: Plugin is an important component that manages parallel configuration (eg: The gemini plugin encapsulates the gemini acceleration solution). Currently supported plugins are as follows:
***GeminiPlugin:*** This plugin wrapps the Gemini acceleration solution, that ZeRO with chunk-based memory management. ***GeminiPlugin:*** This plugin wraps the Gemini acceleration solution, that ZeRO with chunk-based memory management.
***TorchDDPPlugin:*** This plugin wrapps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines. ***TorchDDPPlugin:*** This plugin wraps the DDP acceleration solution, it implements data parallelism at the module level which can run across multiple machines.
***LowLevelZeroPlugin:*** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs. ***LowLevelZeroPlugin:*** This plugin wraps the 1/2 stage of Zero Redundancy Optimizer. Stage 1 : Shards optimizer states across data parallel workers/GPUs. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs.

2
docs/source/en/basics/colotensor_concept.md

@ -52,7 +52,7 @@ An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/c
## Example ## Example
Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_dgree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor. Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_degree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor.
```python ```python

2
docs/source/en/features/3D_tensor_parallel.md

@ -67,7 +67,7 @@ Given $P=q \times q \times q$ processors, we present the theoretical computation
## Usage ## Usage
To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below. To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallelism setting as below.
```python ```python
CONFIG = dict(parallel=dict( CONFIG = dict(parallel=dict(
data=1, data=1,

2
docs/source/en/features/gradient_clipping_with_booster.md

@ -75,7 +75,7 @@ Build your model, optimizer, loss function, lr scheduler and dataloaders. Note t
NUM_EPOCHS = 200 NUM_EPOCHS = 200
BATCH_SIZE = 128 BATCH_SIZE = 128
GRADIENT_CLIPPING = 0.1 GRADIENT_CLIPPING = 0.1
# build resnetå # build resnet
model = resnet34(num_classes=10) model = resnet34(num_classes=10)
# build dataloaders # build dataloaders
train_dataset = CIFAR10(root=Path(os.environ.get('DATA', './data')), train_dataset = CIFAR10(root=Path(os.environ.get('DATA', './data')),

2
docs/source/en/features/nvme_offload.md

@ -53,7 +53,7 @@ It's compatible with all parallel methods in ColossalAI.
> ⚠ It only offloads optimizer states on CPU. This means it only affects CPU training or Zero/Gemini with offloading. > ⚠ It only offloads optimizer states on CPU. This means it only affects CPU training or Zero/Gemini with offloading.
## Exampls ## Examples
Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`. Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`.

2
docs/source/en/features/pipeline_parallel.md

@ -156,4 +156,4 @@ trainer.fit(train_dataloader=train_dataloader,
display_progress=True) display_progress=True)
``` ```
We use `2` pipeline stages and the batch will be splitted into `4` micro batches. We use `2` pipeline stages and the batch will be split into `4` micro batches.

2
docs/source/en/features/zero_with_chunk.md

@ -72,7 +72,7 @@ chunk_manager = init_chunk_manager(model=module,
gemini_manager = GeminiManager(placement_policy, chunk_manager) gemini_manager = GeminiManager(placement_policy, chunk_manager)
``` ```
`hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still samller than the minimum chunk size, all parameters will be compacted into one small chunk. `hidden_dim` is the hidden dimension of DNN. Users can provide this argument to speed up searching. If users do not know this argument before training, it is ok. We will use a default value 1024. `min_chunk_size_mb` is the the minimum chunk size in MegaByte. If the aggregate size of parameters is still smaller than the minimum chunk size, all parameters will be compacted into one small chunk.
Initialization of the optimizer. Initialization of the optimizer.
```python ```python

2
docs/source/zh-Hans/advanced_tutorials/add_your_parallel.md

@ -48,7 +48,7 @@ Colossal-AI 为用户提供了一个全局 context,使他们能够轻松地管
world_size: int, world_size: int,
config: Config, config: Config,
data_parallel_size: int, data_parallel_size: int,
pipeline_parlalel_size: int, pipeline_parallel_size: int,
tensor_parallel_size: int, tensor_parallel_size: int,
arg1, arg1,
arg2): arg2):

2
docs/source/zh-Hans/advanced_tutorials/integrate_mixture_of_experts_into_your_model.md

@ -122,7 +122,7 @@ Inside the initialization of Experts, the local expert number of each GPU will b
## Train Your Model ## Train Your Model
Do not to forget to use `colossalai.initialize` function in `colosalai` to add gradient handler for the engine. Do not to forget to use `colossalai.initialize` function in `colossalai` to add gradient handler for the engine.
We handle the back-propagation of MoE models for you. We handle the back-propagation of MoE models for you.
In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients. In `colossalai.initialize`, we will automatically create a `MoeGradientHandler` object to process gradients.
You can find more information about the handler `MoeGradientHandler` in colossal directory. You can find more information about the handler `MoeGradientHandler` in colossal directory.

4
docs/source/zh-Hans/advanced_tutorials/meet_gemini.md

@ -48,7 +48,7 @@ zero = dict(
</figure> </figure>
ColossalAI设计了Gemini,就像双子星一样,它管理CPU和GPU二者内存空间。它可以让张量在训练过程中动态分布在CPU-GPU的存储空间内,从而让模型训练突破GPU的内存墙。内存管理器由两部分组成,分别是MemStatsCollector(MSC)和StatefuleTensorMgr(STM)。 ColossalAI设计了Gemini,就像双子星一样,它管理CPU和GPU二者内存空间。它可以让张量在训练过程中动态分布在CPU-GPU的存储空间内,从而让模型训练突破GPU的内存墙。内存管理器由两部分组成,分别是MemStatsCollector(MSC)和StatefulTensorMgr(STM)。
我们利用了深度学习网络训练过程的迭代特性。我们将迭代分为warmup和non-warmup两个阶段,开始时的一个或若干迭代步属于预热阶段,其余的迭代步属于正式阶段。在warmup阶段我们为MSC收集信息,而在non-warmup阶段STM入去MSC收集的信息来移动tensor,以达到最小化CPU-GPU数据移动volume的目的。 我们利用了深度学习网络训练过程的迭代特性。我们将迭代分为warmup和non-warmup两个阶段,开始时的一个或若干迭代步属于预热阶段,其余的迭代步属于正式阶段。在warmup阶段我们为MSC收集信息,而在non-warmup阶段STM入去MSC收集的信息来移动tensor,以达到最小化CPU-GPU数据移动volume的目的。
@ -75,7 +75,7 @@ STM管理所有model data tensor的信息。在模型的构造过程中,Coloss
我们在算子的开始和结束计算时,触发内存采样操作,我们称这个时间点为**采样时刻(sampling moment)**,两个采样时刻之间的时间我们称为**period**。计算过程是一个黑盒,由于可能分配临时buffer,内存使用情况很复杂。但是,我们可以较准确的获取period的系统最大内存使用。非模型数据的使用可以通过两个统计时刻之间系统最大内存使用-模型内存使用获得。 我们在算子的开始和结束计算时,触发内存采样操作,我们称这个时间点为**采样时刻(sampling moment)**,两个采样时刻之间的时间我们称为**period**。计算过程是一个黑盒,由于可能分配临时buffer,内存使用情况很复杂。但是,我们可以较准确的获取period的系统最大内存使用。非模型数据的使用可以通过两个统计时刻之间系统最大内存使用-模型内存使用获得。
我们如何设计采样时刻呢。我们选择preOp的model data layout adjust之前。如下图所示。我们采样获得上一个period的system memory used,和下一个period的model data memoy used。并行策略会给MSC的工作造成障碍。如图所示,比如对于ZeRO或者Tensor Parallel,由于Op计算前需要gather模型数据,会带来额外的内存需求。因此,我们要求在模型数据变化前进行采样系统内存,这样在一个period内,MSC会把preOp的模型变化内存捕捉。比如在period 2-3内,我们考虑的tensor gather和shard带来的内存变化。 我们如何设计采样时刻呢。我们选择preOp的model data layout adjust之前。如下图所示。我们采样获得上一个period的system memory used,和下一个period的model data memory used。并行策略会给MSC的工作造成障碍。如图所示,比如对于ZeRO或者Tensor Parallel,由于Op计算前需要gather模型数据,会带来额外的内存需求。因此,我们要求在模型数据变化前进行采样系统内存,这样在一个period内,MSC会把preOp的模型变化内存捕捉。比如在period 2-3内,我们考虑的tensor gather和shard带来的内存变化。
尽管可以将采样时刻放在其他位置,比如排除gather buffer的变动新信息,但是会给造成麻烦。不同并行方式Op的实现有差异,比如对于Linear Op,Tensor Parallel中gather buffer的分配在Op中。而对于ZeRO,gather buffer的分配是在PreOp中。将放在PreOp开始时采样有利于将两种情况统一。 尽管可以将采样时刻放在其他位置,比如排除gather buffer的变动新信息,但是会给造成麻烦。不同并行方式Op的实现有差异,比如对于Linear Op,Tensor Parallel中gather buffer的分配在Op中。而对于ZeRO,gather buffer的分配是在PreOp中。将放在PreOp开始时采样有利于将两种情况统一。

2
docs/source/zh-Hans/advanced_tutorials/opt_service.md

@ -52,7 +52,7 @@ export CHECKPOINT_DIR="your_opt_checkpoint_path"
# the ${CONFIG_DIR} must contain a server.sh file as the entry of service # the ${CONFIG_DIR} must contain a server.sh file as the entry of service
export CONFIG_DIR="config_file_path" export CONFIG_DIR="config_file_path"
docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:lastest docker run --gpus all --rm -it -p 8020:8020 -v ${CHECKPOINT_DIR}:/model_checkpoint -v ${CONFIG_DIR}:/config --ipc=host energonai:latest
``` ```
接下来,您就可以在您的浏览器中打开 `https://[IP-ADDRESS]:8020/docs#` 进行测试。 接下来,您就可以在您的浏览器中打开 `https://[IP-ADDRESS]:8020/docs#` 进行测试。

4
docs/source/zh-Hans/advanced_tutorials/train_vit_with_hybrid_parallelism.md

@ -477,7 +477,7 @@ def build_cifar(batch_size):
return train_dataloader, test_dataloader return train_dataloader, test_dataloader
# craete dataloaders # create dataloaders
train_dataloader , test_dataloader = build_cifar() train_dataloader , test_dataloader = build_cifar()
# create loss function # create loss function
criterion = CrossEntropyLoss(label_smoothing=0.1) criterion = CrossEntropyLoss(label_smoothing=0.1)
@ -492,7 +492,7 @@ lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
#### 启动 Colossal-AI 引擎 #### 启动 Colossal-AI 引擎
```python ```python
# intiailize # initialize
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model, engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model,
optimizer=optimizer, optimizer=optimizer,
criterion=criterion, criterion=criterion,

2
docs/source/zh-Hans/basics/colotensor_concept.md

@ -53,7 +53,7 @@ ColoTensor 包含额外的属性[ColoTensorSpec](https://colossalai.readthedocs.
## Example ## Example
让我们看一个例子。 使用 tp_degree=4, dp_dgree=2 在 8 个 GPU 上初始化并Shard一个ColoTensor。 然后tensor被沿着 TP 进程组中的最后一个维度进行分片。 最后,我们沿着 TP 进程组中的第一个维度(dim 0)对其进行重新Shard。 我们鼓励用户运行代码并观察每个张量的形状。 让我们看一个例子。 使用 tp_degree=4, dp_degree=2 在 8 个 GPU 上初始化并Shard一个ColoTensor。 然后tensor被沿着 TP 进程组中的最后一个维度进行分片。 最后,我们沿着 TP 进程组中的第一个维度(dim 0)对其进行重新Shard。 我们鼓励用户运行代码并观察每个张量的形状。
```python ```python

2
docs/source/zh-Hans/features/mixed_precision_training.md

@ -203,7 +203,7 @@ Naive AMP 的默认参数:
- initial_scale(int): gradient scaler 的初始值 - initial_scale(int): gradient scaler 的初始值
- growth_factor(int): loss scale 的增长率 - growth_factor(int): loss scale 的增长率
- backoff_factor(float): loss scale 的下降率 - backoff_factor(float): loss scale 的下降率
- hysterisis(int): 动态 loss scaling 的延迟偏移 - hysteresis(int): 动态 loss scaling 的延迟偏移
- max_scale(int): loss scale 的最大允许值 - max_scale(int): loss scale 的最大允许值
- verbose(bool): 如果被设为`True`,将打印调试信息 - verbose(bool): 如果被设为`True`,将打印调试信息

2
docs/source/zh-Hans/features/nvme_offload.md

@ -53,7 +53,7 @@ optimizer = HybridAdam(model.parameters(), lr=1e-3, nvme_offload_fraction=1.0, n
> ⚠ 它只会卸载在 CPU 上的优化器状态。这意味着它只会影响 CPU 训练或者使用卸载的 Zero/Gemini。 > ⚠ 它只会卸载在 CPU 上的优化器状态。这意味着它只会影响 CPU 训练或者使用卸载的 Zero/Gemini。
## Exampls ## Examples
Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`. Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`.
首先让我们从两个简单的例子开始 -- 用不同的方法训练 GPT。这些例子依赖`transformers`。 首先让我们从两个简单的例子开始 -- 用不同的方法训练 GPT。这些例子依赖`transformers`。

Loading…
Cancel
Save