mirror of https://github.com/hpcaitech/ColossalAI
[doc] clean up outdated docs (#4765)
* [doc] clean up outdated docs * [doc] fix linking * [doc] fix linkingpull/4809/head
parent
df66741f77
commit
66f3926019
|
@ -29,13 +29,7 @@
|
|||
"basics/launch_colossalai",
|
||||
"basics/booster_api",
|
||||
"basics/booster_plugins",
|
||||
"basics/booster_checkpoint",
|
||||
"basics/define_your_config",
|
||||
"basics/initialize_features",
|
||||
"basics/engine_trainer",
|
||||
"basics/configure_parallelization",
|
||||
"basics/model_checkpoint",
|
||||
"basics/colotensor_concept"
|
||||
"basics/booster_checkpoint"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -44,12 +38,8 @@
|
|||
"collapsed": true,
|
||||
"items": [
|
||||
"features/mixed_precision_training_with_booster",
|
||||
"features/mixed_precision_training",
|
||||
"features/gradient_accumulation_with_booster",
|
||||
"features/gradient_accumulation",
|
||||
"features/gradient_clipping_with_booster",
|
||||
"features/gradient_clipping",
|
||||
"features/gradient_handler",
|
||||
"features/zero_with_chunk",
|
||||
{
|
||||
"type": "category",
|
||||
|
@ -75,10 +65,7 @@
|
|||
"advanced_tutorials/train_vit_using_pipeline_parallelism",
|
||||
"advanced_tutorials/train_vit_with_hybrid_parallelism",
|
||||
"advanced_tutorials/train_gpt_using_hybrid_parallelism",
|
||||
"advanced_tutorials/define_your_own_parallel_model",
|
||||
"advanced_tutorials/add_your_parallel",
|
||||
"advanced_tutorials/meet_gemini",
|
||||
"advanced_tutorials/parallelize_your_training_like_Megatron",
|
||||
"advanced_tutorials/integrate_mixture_of_experts_into_your_model",
|
||||
"advanced_tutorials/opt_service"
|
||||
]
|
||||
|
|
|
@ -1,125 +0,0 @@
|
|||
# Add Your Own Parallel Mode
|
||||
|
||||
Author: Shenggui Li, Yongbin Li
|
||||
|
||||
**Prerequisite:**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Configure Parallelization](../basics/configure_parallelization.md)
|
||||
|
||||
## Introduction
|
||||
|
||||
To enable researchers and engineers to extend our system to other novel large-scale distributed training algorithm
|
||||
with less effort, we have decoupled various components in the training lifecycle. You can implement your own
|
||||
parallelism by simply inheriting from the base class.
|
||||
|
||||
The main components are:
|
||||
|
||||
1. `ProcessGroupInitializer`
|
||||
2. `GradientHandler`
|
||||
3. `Schedule`
|
||||
|
||||
**This currently requires some code to the source code, thus we recommend that you install from source with the `-e` flag.
|
||||
`-e` flag makes the installation editable, thus, your code change will be reflected in your Python runtime.
|
||||
We will work on this to avoid change to source code in future releases.**
|
||||
|
||||
|
||||
## Process Group Initializer
|
||||
|
||||
Parallelism is often managed by process groups where processes involved in the same parallel algorithm are placed in the same
|
||||
process group. For different parallel algorithms, different process groups need to be created. Colossal-AI provides a
|
||||
global context for users to easily manage their process groups. If you wish to add new process group, you can easily
|
||||
define a new class and set it in your configuration file. To define your own way of creating process groups, you can
|
||||
follow the steps below to create a new distributed initialization.
|
||||
|
||||
1. Add your parallel mode in `colossalai.legacy.context.parallel_mode.ParallelMode`.
|
||||
```python
|
||||
class ParallelMode(Enum):
|
||||
GLOBAL = 'global'
|
||||
DATA = 'data'
|
||||
PIPELINE = 'pipe'
|
||||
...
|
||||
|
||||
NEW_MODE = 'new_mode' # define your mode here
|
||||
```
|
||||
|
||||
2. Create a `ProcessGroupInitializer`. You can refer to examples given in `colossalai.context.dist_group_initializer`. The
|
||||
first six arguments are fixed. `ParallelContext` will pass in these arguments for you. If you need to set other
|
||||
arguments, you can add it behind like the `arg1, arg2` in the example below. Lastly, register your initializer to the
|
||||
registry by adding the decorator `@DIST_GROUP_INITIALIZER.register_module`.
|
||||
```python
|
||||
# sample initializer class
|
||||
@DIST_GROUP_INITIALIZER.register_module
|
||||
class MyParallelInitializer(ProcessGroupInitializer):
|
||||
|
||||
def __init__(self,
|
||||
rank: int,
|
||||
world_size: int,
|
||||
config: Config,
|
||||
data_parallel_size: int,
|
||||
pipeline_parallel_size: int,
|
||||
tensor_parallel_size: int,
|
||||
arg1,
|
||||
arg2):
|
||||
super().__init__(rank, world_size, config)
|
||||
self.arg1 = arg1
|
||||
self.arg2 = arg2
|
||||
# ... your variable init
|
||||
|
||||
def init_parallel_groups(self):
|
||||
# initialize your process groups
|
||||
pass
|
||||
|
||||
```
|
||||
|
||||
Then, you can insert your new initializer to the current mode-to-initialize mapping
|
||||
in `colossalai.constants.INITIALIZER_MAPPING`. You can modify the file or insert new key-value pair dynamically.
|
||||
|
||||
```python
|
||||
colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
|
||||
```
|
||||
|
||||
3. Set your initializer in your config file. You can pass in your own arguments if there is any. This allows
|
||||
the `ParallelContext` to create your initializer and initialize your desired process groups.
|
||||
|
||||
```python
|
||||
parallel = dict(
|
||||
pipeline=dict(size=1),
|
||||
tensor=dict(size=x, mode='new_mode') # this is where you enable your new parallel mode
|
||||
)
|
||||
```
|
||||
|
||||
## Gradient Handler
|
||||
|
||||
Gradient handlers are objects which execute the all-reduce operations on parameters' gradients. As different all-reduce
|
||||
strategies may be executed for different kinds of parallelism, users can
|
||||
inherit `colossalai.legacy.engine.gradient_handler.BaseGradientHandler` to implement their strategies. Currently, the library
|
||||
uses the normal data parallel gradient handler which all-reduces the gradients across data parallel ranks. The data
|
||||
parallel gradient handler is added to the engine automatically if data parallel is detected. You can add your own
|
||||
gradient handler like below:
|
||||
|
||||
```python
|
||||
from colossalai.legacy.registry import GRADIENT_HANDLER
|
||||
from colossalai.legacy.engine import BaseGradientHandler
|
||||
|
||||
@GRADIENT_HANDLER.register_module
|
||||
class YourGradientHandler(BaseGradientHandler):
|
||||
|
||||
def handle_gradient(self):
|
||||
do_something()
|
||||
|
||||
```
|
||||
|
||||
Afterwards, you can specify the gradient handler you want to use in your configuration file.
|
||||
|
||||
```python
|
||||
gradient_handlers = [
|
||||
dict(type='YourGradientHandler'),
|
||||
]
|
||||
```
|
||||
|
||||
## Schedule
|
||||
|
||||
Schedule entails how to execute a forward and backward pass. Currently, Colossal-AI provides pipeline and non-pipeline
|
||||
schedules. If you want to modify how the forward and backward passes are executed, you can
|
||||
inherit `colossalai.legacy.engine.schedule.BaseSchedule` and implement the `forward_back_step` function.
|
||||
<!-- doc-test-command: echo -->
|
|
@ -1,36 +0,0 @@
|
|||
# Define your own parallel model
|
||||
|
||||
Author: Zhengda Bian, Yongbin Li
|
||||
|
||||
> ⚠️ We are working on this documentation to make it more detailed. We will introduce the mechanism of different parallelism
|
||||
> and how to use them to write a model.
|
||||
|
||||
Let's say that you have a huge MLP model with billions of parameters and its extremely large hidden layer size makes it
|
||||
impossible to fit into a single GPU directly. Don't worry, Colossal-AI is here to help you sort things out. With the help of Colossal-AI,
|
||||
you can write your model in the familiar way in which you used to write models for a single GPU, while Colossal-AI automatically
|
||||
splits your model weights and fit them perfectly into a set of GPUs. We give a simple example showing how to write a simple
|
||||
2D parallel model in the Colossal-AI context.
|
||||
|
||||
## Write a simple 2D parallel model
|
||||
|
||||
```python
|
||||
from colossalai.nn import Linear2D
|
||||
import torch.nn as nn
|
||||
|
||||
class MLP_2D(nn.Module):
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.linear_1 = Linear2D(in_features=1024, out_features=16384)
|
||||
self.linear_2 = Linear2D(in_features=16384, out_features=1024)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.linear_1(x)
|
||||
x = self.linear_2(x)
|
||||
return x
|
||||
```
|
||||
|
||||
## Use pre-defined model
|
||||
|
||||
For the sake of your convenience, we kindly provide you in our Model Zoo with some prevalent models such as *BERT*, *ViT*, *MoE*,
|
||||
and *GPT*. Feel free to customize them into different sizes to fit into your special needs.
|
|
@ -1,194 +0,0 @@
|
|||
# Parallelize Your Training like Megatron-LM via ColoTensor
|
||||
|
||||
Author: [Haichen Huang](https://github.com/1SAA) and [Jiarui Fang](https://github.com/feifeibear)
|
||||
|
||||
**Prerequisite:**
|
||||
- [ColoTensor Concepts](../basics/colotensor_concept.md)
|
||||
|
||||
## Introduction
|
||||
|
||||
Thanks to the convenience given by ColoTensor, users can apply parallelism with the least edition to their serial code.
|
||||
In this tutorial, we will illustrate how to modify the training model to automatically adapt the code to parallel training like Megatron-LM.
|
||||
We take the GPT-2 model offered by HuggingFace as an example and provide a way for you to pre-train the GPT-2 model on a single GPU.
|
||||
|
||||
Megatron-LM provided a profound paradigm to parallelize large transformer language models.
|
||||
However, in order to train large transformer language models at scale, users have to build their models with those modules provided by Megatron.
|
||||
It imposes several difficult jobs on users, such as loading the weights from the pre-trained models and constructing the parallelized models.
|
||||
To mitigate users' trouble, we offer ColoTensor to enable the tensor model parallelism automatically.
|
||||
|
||||
## Definitions of the model and the loss function
|
||||
|
||||
First we use the GPTModel and GPTLoss directly from the HuggingFace library.
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from transformers import GPT2Config, GPT2LMHeadModel
|
||||
|
||||
class GPTLMModel(nn.Module):
|
||||
def __init__(self, hidden_size=768, num_layers=12, num_attention_heads=12, max_seq_len=1024, vocab_size=50257, checkpoint=False):
|
||||
super().__init__()
|
||||
self.checkpoint = checkpoint
|
||||
self.model = GPT2LMHeadModel(GPT2Config(n_embd=hidden_size, n_layer=num_layers,
|
||||
n_head=num_attention_heads, n_positions=max_seq_len, n_ctx=max_seq_len, vocab_size=vocab_size))
|
||||
if checkpoint:
|
||||
self.model.gradient_checkpointing_enable()
|
||||
|
||||
def forward(self, input_ids, attention_mask):
|
||||
# Only return lm_logits
|
||||
return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
|
||||
|
||||
|
||||
class GPTLMLoss(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.loss_fn = nn.CrossEntropyLoss()
|
||||
|
||||
def forward(self, logits, labels):
|
||||
shift_logits = logits[..., :-1, :].contiguous()
|
||||
shift_labels = labels[..., 1:].contiguous()
|
||||
# Flatten the tokens
|
||||
return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
|
||||
```
|
||||
|
||||
## Brief Review of GPT-2
|
||||
|
||||
Now, we recall the structure of each GPT-2 model.
|
||||
Every GPT-2 model can be represented as a DAG.
|
||||
As shown in the below pictures, each circle represents an operator and each square represents a weight.
|
||||
An arrow indicates the flow of the input data, and the notation alongside the arrow demonstrates the shape of the input data.
|
||||
|
||||
Then, let's take an insight into this GPT-2 model. It consists of three parts.
|
||||
They are the **embedding module**, **transformer layers**, and the **classification head**.
|
||||
|
||||
The embedding module contains two weights, token embedding weight and position embedding weight.
|
||||
After the forward operation of the embedding module, each word in all sequences of the raw input data will be embedded into a hidden state.
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/08/17/omfkIEN6ui5jcL3.png"/>
|
||||
<figcaption>The embedding module</figcaption>
|
||||
</figure>
|
||||
|
||||
Each transformer layer contains two blocks. The self-attention operation is called in the first block and a two-layer perception is located in the second block.
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/08/17/LAVzDlpRcj4dYeb.png"/>
|
||||
<figcaption>The transformer layer</figcaption>
|
||||
</figure>
|
||||
|
||||
In the end, the classification head is just a linear module without bias, which only has a weight inside.
|
||||
|
||||
## Applied with ColoTensor
|
||||
|
||||
Two steps make your serial code adapted to Megatron-LM tensor parallel style.
|
||||
1. Initialize the model in the context of ColoInitContext.
|
||||
2. Setting ColoTensorSpec for each parameter.
|
||||
|
||||
### Initialize with ColoInitContext
|
||||
|
||||
We should build the model in the ColoInitContext.
|
||||
In this context, any parameter initialized would be transformed to ColoParameter and moved to the corresponded device automatically.
|
||||
|
||||
```python
|
||||
from colossalai.utils.model.colo_init_context import ColoInitContext
|
||||
|
||||
with ColoInitContext(device=torch.device('cpu')):
|
||||
model = GPTLMModel()
|
||||
```
|
||||
|
||||
### Setting ColoTensorSpec for each parameter
|
||||
|
||||
After the creation of the model, we establish the distributed environment through ProcessGroup.
|
||||
Here, we specify the degree of the tensor parallelism as the same as the number of all GPUs, which means the degree of data parallelism is 1.
|
||||
|
||||
```python
|
||||
import torch.distributed as dist
|
||||
from colossalai.tensor import ProcessGroup
|
||||
|
||||
pg = ProcessGroup(tp_degree=dist.get_world_size())
|
||||
```
|
||||
|
||||
Now, some auxiliary functions are necessary for the next step. We define two functions to split a parameter.
|
||||
Megatron-LM-like tensor parallelism requires splitting a parameter tensor along its first dimension or its last dimension.
|
||||
|
||||
```python
|
||||
from colossalai.tensor import ShardSpec, ComputeSpec, ComputePattern, ColoParameter, ProcessGroup
|
||||
|
||||
def split_param_single_dim_tp1d(dim: int, param: ColoParameter, pg: ProcessGroup):
|
||||
spec = (ShardSpec([dim], [pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
|
||||
if param.process_group.tp_world_size() == 1:
|
||||
param.set_process_group(pg)
|
||||
param.set_tensor_spec(*spec)
|
||||
|
||||
|
||||
def split_param_row_tp1d(param: ColoParameter, pg: ProcessGroup):
|
||||
split_param_single_dim_tp1d(0, param, pg)
|
||||
|
||||
|
||||
def split_param_col_tp1d(param: ColoParameter, pg: ProcessGroup):
|
||||
split_param_single_dim_tp1d(-1, param, pg)
|
||||
```
|
||||
|
||||
Then we adapt the model to the tensor parallelism.
|
||||
According to the tensor parallelism applied in Megatron, it is supposed to shard along the last dimension of tensors, including the weights of token embedding, position embedding, all linear weights and biases in self-attention blocks, the first weight linear and bias in each MLP.
|
||||
And it shards the second linear weight along its first dimension.
|
||||
|
||||
```python
|
||||
for mn, module in model.named_modules():
|
||||
for pn, param in module.named_parameters(recurse=False):
|
||||
# set process group for all parameters
|
||||
param.set_process_group(pg)
|
||||
|
||||
if 'mlp.c_fc' in mn:
|
||||
if 'weight' in pn or 'bias' in pn:
|
||||
split_param_col_tp1d(param, pg) # column slice
|
||||
# keep the shape of the output from c_fc
|
||||
param.compute_spec.set_output_replicate(False)
|
||||
elif 'mlp.c_proj' in mn:
|
||||
if 'weight' in pn:
|
||||
split_param_row_tp1d(param, pg) # row slice
|
||||
elif 'wte' in mn or 'wpe' in mn:
|
||||
split_param_col_tp1d(param, pg) # column slice
|
||||
elif 'c_attn' in mn or 'c_proj' in mn:
|
||||
split_param_col_tp1d(param, pg) # column slice
|
||||
```
|
||||
|
||||
The modified model is illustrated below.
|
||||
|
||||
The embedding module:
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/08/17/Yu2xzXEabHV7pwe.png"/>
|
||||
<figcaption>The modified embedding module</figcaption>
|
||||
</figure>
|
||||
|
||||
The transformer layers:
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/08/17/4HWsA2xz51IhPFO.png"/>
|
||||
<figcaption>The modified transformer layer</figcaption>
|
||||
</figure>
|
||||
|
||||
Once users have specified the distributed pattern of each parameter, ColoTensor is capable of inferring the computation patterns of all operators, including matrix multiplication, the linear function, other elementwise functions in torch.nn.functional, etc.
|
||||
In this way, users can train their models as usual.
|
||||
|
||||
In our latest example, a Gemini + ZeRO DDP model is also defined to reduce overhead and improve efficiency.For the details of this part, please refer to [ZeRO](../features/zero_with_chunk.md). You can combine these two parts to understand our entire training process:
|
||||
|
||||
```python
|
||||
def gemini_zero_dpp(model: torch.nn.Module, pg: ProcessGroup, placement_policy: str = "auto"):
|
||||
from colossalai.zero import GeminiDDP
|
||||
model = GeminiDDP(model,
|
||||
device=get_current_device(),
|
||||
placement_policy=placement_policy,
|
||||
pin_memory=True,
|
||||
search_range_m=32)
|
||||
return model
|
||||
```
|
||||
|
||||
## Pretrain GPT-2 On Single GPU
|
||||
|
||||
The above optimization we made allows us to pretrain the GPT-2 model on a single GPU. We only need to set the parameter `GPUNUM`=1 in `run.sh`, and then we can complete the model training on a single GPU when running the file.
|
||||
|
||||
The GPT-2 example is accessible at [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt).
|
||||
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 parallelize_your_training_like_Megatron.py -->
|
|
@ -1,98 +0,0 @@
|
|||
# ColoTensor Concepts
|
||||
|
||||
Author: [Jiarui Fang](https://github.com/feifeibear), [Hongxin Liu](https://github.com/ver217) and [Haichen Huang](https://github.com/1SAA)
|
||||
|
||||
> ⚠️ The information on this page is outdated and will be deprecated.
|
||||
|
||||
**Prerequisite:**
|
||||
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
|
||||
- [Distributed Training](../concepts/distributed_training.md)
|
||||
- [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md)
|
||||
|
||||
## Introduction
|
||||
|
||||
After ColossalAI version 0.1.8, [ColoTensor](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.html#colossalai.tensor.ColoTensor) becomes the basic data structure for tensors in ColossalAI. It is a subclass of torch.Tensor and can be used as a PyTorch Tensor. Additionally, some unique features make it possible to represent a Global Tensor with a payload distributed across multiple GPU devices. With the help of ColoTensor, the users can write distributed DNN training program similar to a serial one.support the following features.
|
||||
|
||||
ColoTensor contains extra attributes capsuled in a [ColoTensorSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.tensor_spec.html#colossalai.tensor.tensor_spec.ColoTensorSpec) instance to describe the tensor's payload distribution and computing pattern.
|
||||
|
||||
- ProcessGroup: how processes are organized as communication groups.
|
||||
- Distributed Spec: how tensor is distributed among process groups.
|
||||
- Compute Spec: how the tensor is used during computation.
|
||||
|
||||
We elaborate on them one by one.
|
||||
|
||||
## ProcessGroup
|
||||
|
||||
An instance of class [ProcessGroup](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.html#colossalai.tensor.ProcessGroup) describes how processes are organized in process groups. Processes in a process group can participate in the same collective communication operations together, such as allgather, allreduce, etc. The way the process group is organized is dominated by the Tensor's parallelism strategy. For example, if the user defines the tensor parallel (TP) and data parallel (DP) modes of a tensor, then the process organization of the process group will be automatically deduced. The process group settings can vary among different tensors. Therefore, it enables us to support more complicated hybrid parallel. The pipeline parallel (PP) definition is not in the ProcessGroup, it needs another set of mechanisms . We will supplement the related content of ColoTensor applied to PP in the future.
|
||||
|
||||
Currently, a process group of ColoTensor is defined by two configurations, i.e. tp_degree and dp_degree. In the case of DP+TP hybrid parallelism, the device can be viewed as a 2D mesh. We place TP communication groups on the leading low dimension of the device mesh and then place the data parallel groups along the high dimension of the device mesh. The reason is that tensor parallelism has a larger communication overhead than data parallelism. Neighboring devices are placed inside a TP process group and are often placed in the same node.
|
||||
|
||||
Considering that 8 processes are configured as tp_degree=4, and dp_degree=2, the layout is shown below. Process group tp0 contains gpu 0,1,2,3. Process dp1 contains gpu 1 and 5.
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/ColoTensor_layout_demo.PNG"/>
|
||||
<figcaption>Process Group using tp_degree=4, dp_degree=2</figcaption>
|
||||
</figure>
|
||||
|
||||
## Distributed Spec
|
||||
|
||||
An instance of [Distributed Spec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html) describes how a ColoTensor is distributed among the ProcessGroup.
|
||||
|
||||
How tensors are distributed among DP process groups is automatically derived and does not need to be manually specified by the user. If this tensor is a model parameter, it is replicated within the DP process group. If it is an activation tensor, it is split along the process with the highest dimension and evenly distributed the tensor payload among processes in the DP process group.
|
||||
|
||||
Therefore, when using Distributed Spec, we only need to describe the way that the tensor is distributed among TP process groups. There are currently two ways to distribute among TP process group, i.e. [ShardSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html#colossalai.tensor.distspec.ShardSpec) and [ReplicaSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html#colossalai.tensor.distspec.ReplicaSpec). ShardSpec needs to specify the dimension index dim of the partition and the number of partitions num_partitions. Currently, we only support the split on a single dim. Different dist specs on the TP process groups can be converted to each other through the set_dist_spec() interface. The spec conversions are recorded by the autograd mechanism and it will trigger corresponding reverse operations during backward propagation.
|
||||
|
||||
## Compute Spec
|
||||
|
||||
An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.compute_spec.html#colossalai.tensor.compute_spec.ComputeSpec) describes how a Colotensor be used in DNN training. Currently, we will set the correct Compute Pattern for the ColoTensor as the parameters of the module. The specific application scenarios will be shown in the next document.
|
||||
|
||||
## ColoParameter
|
||||
|
||||
[ColoParameter](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.colo_parameter.html#colossalai.tensor.colo_parameter.ColoParameter) is a subclass of ColoTensor. Used to define a Global Parameter tensor. Its relationship with ColoTensor is consistent with Torch.Tensor and torch.Parameter. The latter allows the tensor to appear in the return values of the module's parameters() and name_parameters() methods.
|
||||
|
||||
## Example
|
||||
|
||||
Let's see an example. A ColoTensor is initialized and sharded on 8 GPUs using tp_degree=4, dp_degree=2. And then the tensor is sharded along the last dim among the TP process groups. Finally, we reshard it along the first dim (0 dim) among the TP process groups. We encourage users to run the code and observe the shape of each tensor.
|
||||
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.multiprocessing as mp
|
||||
from colossalai.utils import print_rank_0
|
||||
from functools import partial
|
||||
|
||||
import colossalai
|
||||
from colossalai.tensor import ProcessGroup, ColoTensor, ColoTensorSpec, ShardSpec, ComputeSpec, ComputePattern
|
||||
from colossalai.testing import spawn
|
||||
|
||||
import torch
|
||||
|
||||
def run_dist_tests(rank, world_size, port):
|
||||
colossalai.launch(config={}, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
|
||||
pg = ProcessGroup(tp_degree=2, dp_degree=2)
|
||||
|
||||
torch.manual_seed(0)
|
||||
local_tensor = torch.randn(2, 3, 1).cuda()
|
||||
print_rank_0(f"shape {local_tensor.shape}, {local_tensor.data}")
|
||||
|
||||
spec = ColoTensorSpec(pg, ShardSpec(dims=[-1], num_partitions=[pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
|
||||
t1 = ColoTensor.from_torch_tensor(local_tensor, spec)
|
||||
t1 = t1.to_replicate()
|
||||
print_rank_0(f"shape {t1.shape}, {t1.data}")
|
||||
|
||||
spec2 = ShardSpec([0], [pg.tp_world_size()])
|
||||
t1.set_dist_spec(spec2)
|
||||
print_rank_0(f"shape {t1.shape}, {t1.data}")
|
||||
|
||||
def test_dist_cases(world_size):
|
||||
spawn(run_dist_tests, world_size)
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_dist_cases(4)
|
||||
```
|
||||
|
||||
:::caution
|
||||
|
||||
The ColoTensor is an experimental feature and may be updated.
|
||||
|
||||
:::
|
|
@ -1,158 +0,0 @@
|
|||
# Configure Parallelization
|
||||
|
||||
Author: Shenggui Li, Siqi Mai
|
||||
|
||||
> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster Plugins](../basics/booster_plugins.md) for more information.
|
||||
|
||||
**Prerequisite:**
|
||||
- [Distributed Training](../concepts/distributed_training.md)
|
||||
- [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md)
|
||||
- [Define Your Configuration](./define_your_config.md)
|
||||
|
||||
|
||||
## Introduction
|
||||
|
||||
We support multiple parallelization in Colossal-AI. Hybrid parallelism in our codebase refers to namely the combination
|
||||
of data parallelism, pipeline parallelism and tensor parallelism (1D, 2D, 2.5D, 3D).
|
||||
|
||||
Each parallelism requires different network topology and thus initialize different process groups.
|
||||
You can initialize the corresponding process group by setting `parallel` in the config file.
|
||||
The configuration for `parallel` must obey the following format. Data parallel size will be
|
||||
inferred automatically based on your inputs to pipeline parallelism and tensor parallelism.
|
||||
`colossalai.launch` will initialize these distributed process groups automatically based on your configuration.
|
||||
|
||||
Some sample configurations are shown below:
|
||||
|
||||
```python
|
||||
# sampler format
|
||||
parallel = dict(
|
||||
pipeline=dict("size": int),
|
||||
tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
|
||||
)
|
||||
|
||||
# this is ok
|
||||
parallel = dict(
|
||||
pipeline=dict(size=2),
|
||||
tensor=dict(size=4, mode='2d')
|
||||
)
|
||||
|
||||
# this is ok
|
||||
parallel = dict(
|
||||
pipeline=2,
|
||||
tensor=dict(size=4, mode='2d')
|
||||
)
|
||||
|
||||
# this is not ok
|
||||
# as you need to specify the mode for tensor parallelism
|
||||
parallel = dict(
|
||||
pipeline=2,
|
||||
tensor=4
|
||||
)
|
||||
|
||||
# this is ok as well as tensor will be default to size 1
|
||||
# and mode None
|
||||
parallel = dict(
|
||||
pipeline=2
|
||||
)
|
||||
|
||||
# this is ok as well as pipeline will default to size 1
|
||||
parallel = dict(
|
||||
tensor=dict(size=4, mode='2d')
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
The key name `size` refers to the parallel size of the parallelism dimension. For example, pipeline size 2 means there
|
||||
will be 2 pipeline stages. The key name `mode` in tensor parallel config means the corresponding tensor parallelism
|
||||
will be initialized.
|
||||
|
||||
**You can choose to not have 'parallel' in your configuration and both pipeline and tensor will default to size 1.**
|
||||
|
||||
**Total number of GPUs must be equal to `data parallel size * tensor parallel size * pipeline parallel size`**
|
||||
|
||||
## Data Parallel
|
||||
|
||||
Data parallel is the most common way to distribute your training task by splitting data into several shards and train on
|
||||
a single shard on each device. The configuration for data parallel is detected automatically and set for you. You do not
|
||||
have to explicitly set them in your configurations. There are two ways to handle the all-reduce in data parallel in Colossal-AI.
|
||||
|
||||
1. If you specify gradient handlers, gradients will be all-reduced according to the gradient handlers
|
||||
2. Otherwise, PyTorch DistributedDataParallel will be used
|
||||
|
||||
In most cases, you will be using the second mode unless you have complex handling of the gradients.
|
||||
|
||||
## 1D, 2D, 2.5D and 3D Parallel
|
||||
|
||||
To enable hybrid parallelism, we provide an array of tensor parallelism. We provide the list of papers which match each
|
||||
tensor parallel method. These parallel modes need to work with the distributed layers provided by Colossal-AI.
|
||||
|
||||
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
|
||||
|
||||
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
|
||||
2D parallel relies on the SUMMA matrix multiplication algorithm and splits the input data, model weights and layer
|
||||
outputs along two different dimensions. The tensor chunks are distributed over a 2D mesh of `P = N^2` devices where
|
||||
`N` is the number of tensor chunks in a single dimension.
|
||||
|
||||
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
|
||||
Inspired by the 2.5D matrix multiplication algorithm, 2.5D parallel introduces a novel tensor parallelism which
|
||||
further parallelizes 2D tensor parallelism. An amount of `P = N^2 ∗ d` processors are arranged into `d` layers, where
|
||||
each layer performs matrix multiplication operations independently with a dimension `N`.
|
||||
|
||||
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
|
||||
We also introduce a 3D tensor parallelism that parallelizes neural networks on a 3D processor cube. This method
|
||||
achieves the optimal, `O(P^{1/3})` communication overhead on $P$ processors, while both computation and memory usage
|
||||
are evenly distributed through optimized load balancing of parameters as well as activations.
|
||||
|
||||
```python
|
||||
# 1D parallel
|
||||
parallel = dict(
|
||||
tensor=dict(size=4, mode='1d')
|
||||
)
|
||||
|
||||
# 2D parallel
|
||||
parallel = dict(
|
||||
tensor=dict(size=4, mode='2d')
|
||||
)
|
||||
|
||||
# 2.5D parallel
|
||||
parallel = dict(
|
||||
tensor=dict(size=8, mode='2.5d', depth=2)
|
||||
)
|
||||
|
||||
# 3D parallel
|
||||
parallel = dict(
|
||||
tensor=dict(size=8, mode='3d')
|
||||
)
|
||||
```
|
||||
|
||||
Once you specify the tensor parallel mode in your configuration, you can proceed to use its corresponding distributed
|
||||
operator. For example, if you mode is '2d', you can use `colossalai.nn.Linear2D` in you model construction.
|
||||
|
||||
|
||||
## Pipeline Parallel
|
||||
|
||||
Pipeline parallelism is to split the model into several partitions by layer. For example, let's assume we have a simple
|
||||
model which consists of two linear layer. We have two GPUs, and we can allocate the first linear layer to the first GPU
|
||||
and the second layer to the second GPU.
|
||||
|
||||
You can set the number of pipeline stages in your configuration file. When pipeline size is larger than 1, Colossal-AI
|
||||
will automatically creates the pipeline schedule which defines the forward and backward step.
|
||||
|
||||
```python
|
||||
parallel = dict(
|
||||
pipeline=dict(size=4), # number of pipeline stages
|
||||
)
|
||||
```
|
||||
|
||||
## Sequence Parallel
|
||||
|
||||
Sequence parallel is to support long-sequence modelling such as document-level text understanding and medical imaging.
|
||||
This method is proposed in [Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120).
|
||||
You can use specify the mode to be `sequence` to initialize its process group.
|
||||
|
||||
|
||||
```python
|
||||
parallel = dict(
|
||||
tensor=dict(size=4, mode='sequence')
|
||||
)
|
||||
```
|
|
@ -1,85 +0,0 @@
|
|||
# Define Your Configuration
|
||||
|
||||
Author: Guangyang Lu, Shenggui Li, Siqi Mai
|
||||
|
||||
> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster API](../basics/booster_api.md) for more information.
|
||||
|
||||
|
||||
**Prerequisite:**
|
||||
- [Distributed Training](../concepts/distributed_training.md)
|
||||
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
|
||||
|
||||
|
||||
## Introduction
|
||||
|
||||
In Colossal-AI, a configuration file is required to specify the features the system will inject into the training process.
|
||||
In this tutorial, we will introduce you how to construct your configuration file and how this config file will be used.
|
||||
Using configuration file has several advantages:
|
||||
|
||||
1. You can store your feature configuration and training hyper-parameters in different configuration files
|
||||
2. New features released in the future can be specified in the configuration without code change in the training script
|
||||
|
||||
In this tutorial, we will cover how to define your configuration file.
|
||||
|
||||
## Configuration Definition
|
||||
|
||||
In a configuration file, there are two types of variables. One serves as feature specification and the other serves
|
||||
as hyper-parameters. All feature-related variables are reserved keywords. For example, if you want to use mixed precision
|
||||
training, you need to use the variable name `fp16` in the config file and follow a pre-defined format.
|
||||
|
||||
### Feature Specification
|
||||
|
||||
There is an array of features Colossal-AI provides to speed up training. Each feature is defined by a corresponding field
|
||||
in the config file. In this tutorial, we are not giving the config details for all the features, but rather we are providing
|
||||
an illustration of how to specify a feature. **The details of each feature can be found in its respective tutorial.**
|
||||
|
||||
To illustrate the use of config file, we use mixed precision training as an example here. In order to do so, you need to
|
||||
follow the steps below.
|
||||
|
||||
1. create a configuration file (e.g. `config.py`, the file name can be anything)
|
||||
2. define the mixed precision configuration in the config file. For example, in order to use mixed precision training
|
||||
natively provided by PyTorch, you can just write these lines of code below into your config file.
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
fp16 = dict(
|
||||
mode=AMP_TYPE.TORCH
|
||||
)
|
||||
```
|
||||
|
||||
3. Tell Colossal-AI where your config file is when launch the distributed environment. For example, the config file is in
|
||||
the current directory.
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
|
||||
colossalai.launch(config='./config.py', ...)
|
||||
```
|
||||
|
||||
In this way, Colossal-AI knows what features you want to use and will inject this feature during `colossalai.initialize`.
|
||||
|
||||
### Global Hyper-parameters
|
||||
|
||||
Besides feature specification, the config file can also serve as a place to define your training hyper-parameters. This
|
||||
comes handy when you want to perform multiple experiments, each experiment details can be put into a single config file
|
||||
to avoid confusion. These parameters will be stored in the global parallel context and can be accessed in the training script.
|
||||
|
||||
For example, you can specify the batch size in your config file.
|
||||
|
||||
```python
|
||||
BATCH_SIZE = 32
|
||||
```
|
||||
|
||||
After launch, you are able to access your hyper-parameters through global parallel context.
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
from colossalai.core import global_context as gpc
|
||||
|
||||
colossalai.launch(config='./config.py', ...)
|
||||
|
||||
# access your parameter
|
||||
print(gpc.config.BATCH_SIZE)
|
||||
|
||||
```
|
|
@ -1,390 +0,0 @@
|
|||
# Use Engine and Trainer in Training
|
||||
|
||||
Author: Shenggui Li, Siqi Mai
|
||||
|
||||
> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster API](../basics/booster_api.md) for more information.
|
||||
|
||||
**Prerequisite:**
|
||||
- [Initialize Features](./initialize_features.md)
|
||||
|
||||
## Introduction
|
||||
|
||||
In this tutorial, you will learn how to use the engine and trainer provided in Colossal-AI to train your model.
|
||||
Before we delve into the details, we would like to first explain the concept of engine and trainer.
|
||||
|
||||
### Engine
|
||||
|
||||
Engine is essentially a wrapper class for model, optimizer and loss function.
|
||||
When we call `colossalai.initialize`, an engine object will be returned, and it has already been equipped with
|
||||
functionalities such as gradient clipping, gradient accumulation and zero optimizer as specified in your configuration file.
|
||||
An engine object will use similar APIs to those of PyTorch training components such that the user has minimum change
|
||||
to their code.
|
||||
|
||||
Below is a table which shows the commonly used APIs for the engine object.
|
||||
|
||||
| Component | Function | PyTorch | Colossal-AI |
|
||||
| ------------------------------------- | --------------------------------------------- | ------------------------------- | -------------------------------------- |
|
||||
| optimizer | Set all gradients to zero before an iteration | optimizer.zero_grad() | engine.zero_grad() |
|
||||
| optimizer | Update the parameters | optimizer.step() | engine.step() |
|
||||
| model | Run a forward pass | outputs = model(inputs) | outputs = engine(inputs) |
|
||||
| criterion | Calculate the loss value | loss = criterion(output, label) | loss = engine.criterion(output, label) |
|
||||
| criterion | Execute back-propagation on the model | loss.backward() | engine.backward(loss) |
|
||||
|
||||
The reason why we need such an engine class is that we can add more functionalities while hiding the implementations in
|
||||
the `colossalai.initialize` function.
|
||||
Imaging we are gonna add a new feature, we can manipulate the model, optimizer, dataloader and loss function in the
|
||||
`colossalai.initialize` function and only expose an engine object to the user.
|
||||
The user only needs to modify their code to the minimum extent by adapting the normal PyTorch APIs to the Colossal-AI
|
||||
engine APIs. In this way, they can enjoy more features for efficient training.
|
||||
|
||||
A normal training iteration using engine can be:
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
|
||||
# build your model, optimizer, criterion, dataloaders
|
||||
...
|
||||
|
||||
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
|
||||
optimizer,
|
||||
criterion,
|
||||
train_dataloader,
|
||||
test_dataloader)
|
||||
for img, label in train_dataloader:
|
||||
engine.zero_grad()
|
||||
output = engine(img)
|
||||
loss = engine.criterion(output, label)
|
||||
engine.backward(loss)
|
||||
engine.step()
|
||||
```
|
||||
|
||||
### Trainer
|
||||
|
||||
Trainer is a more high-level wrapper for the user to execute training with fewer lines of code. However, in pursuit of more abstraction, it loses some flexibility compared to engine. The trainer is designed to execute a forward and backward step to perform model weight update. It is easy to create a trainer object by passing the engine object. The trainer has a default value `None` for the argument `schedule`. In most cases, we leave this value to `None` unless we want to use pipeline parallelism. If you wish to explore more about this parameter, you can go to the tutorial on pipeline parallelism.
|
||||
|
||||
```python
|
||||
from colossalai.logging import get_dist_logger
|
||||
from colossalai.legacy.trainer import Trainer, hooks
|
||||
|
||||
# build components and initialize with colossalai.initialize
|
||||
...
|
||||
|
||||
# create a logger so that trainer can log on the console
|
||||
logger = get_dist_logger()
|
||||
|
||||
# create a trainer object
|
||||
trainer = Trainer(
|
||||
engine=engine,
|
||||
logger=logger
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
|
||||
In trainer, the user can customize some hooks and attach these hooks to the trainer object. A hook object will execute life-cycle methods periodically based on the training scheme. For example, The `LRSchedulerHook` will execute `lr_scheduler.step()` to update the learning rate of the model during either `after_train_iter` or `after_train_epoch` stages depending on whether the user wants to update the learning rate after each training iteration or only after the entire training epoch. You can store the hook objects in a list and pass it to `trainer.fit` method. `trainer.fit` method will execute training and testing based on your parameters. If `display_process` is True, a progress bar will be displayed on your console to show the training process.
|
||||
|
||||
```python
|
||||
# define the hooks to attach to the trainer
|
||||
hook_list = [
|
||||
hooks.LossHook(),
|
||||
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
|
||||
hooks.AccuracyHook(accuracy_func=Accuracy()),
|
||||
hooks.LogMetricByEpochHook(logger),
|
||||
]
|
||||
|
||||
# start training
|
||||
trainer.fit(
|
||||
train_dataloader=train_dataloader,
|
||||
epochs=NUM_EPOCHS,
|
||||
test_dataloader=test_dataloader,
|
||||
test_interval=1,
|
||||
hooks=hook_list,
|
||||
display_progress=True
|
||||
)
|
||||
```
|
||||
|
||||
If you want to customize your own hook class, you can inherit `hooks.BaseHook` and override the life-cycle methods of your interest. A dummy example to demonstrate how to create a simple log message hook is provided below for your reference.
|
||||
|
||||
```python
|
||||
from colossalai.logging import get_dist_logger
|
||||
from colossalai.legacy.trainer import hooks
|
||||
|
||||
class LogMessageHook(hooks.BaseHook):
|
||||
|
||||
def __init__(self, priority=10):
|
||||
self._logger = get_dist_logger()
|
||||
|
||||
def before_train(self, trainer):
|
||||
self._logger.info('training starts')
|
||||
|
||||
def after_train(self, trainer):
|
||||
self._logger.info('training finished')
|
||||
|
||||
|
||||
...
|
||||
|
||||
# then in your training script
|
||||
hook_list.append(LogMessageHook())
|
||||
```
|
||||
|
||||
|
||||
|
||||
In the sections below, I will guide you through the steps required to train a ResNet model with both engine and trainer.
|
||||
|
||||
|
||||
|
||||
## Explain with ResNet
|
||||
|
||||
### Overview
|
||||
|
||||
In this section we will cover:
|
||||
|
||||
1. Use an engine object to train a ResNet34 model on CIFAR10 dataset
|
||||
2. Use a trainer object to train a ResNet34 model on CIFAR10 dataset
|
||||
|
||||
The project structure will be like:
|
||||
|
||||
```bash
|
||||
-- config.py
|
||||
-- run_resnet_cifar10_with_engine.py
|
||||
-- run_resnet_cifar10_with_trainer.py
|
||||
```
|
||||
|
||||
Steps 1-4 below are commonly used regardless of using engine or trainer. Thus, steps 1-4 + step 5 will be your `run_resnet_cifar10_with_engine.py` and steps 1-4 + step 6 will form `run_resnet_cifar10_with_trainer.py`.
|
||||
|
||||
### Hands-on Practice
|
||||
|
||||
#### Step 1. Create a Config File
|
||||
|
||||
In your project folder, create a `config.py`. This file is to specify some features you may want to use to train your model. A sample config file is as below:
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
BATCH_SIZE = 128
|
||||
NUM_EPOCHS = 200
|
||||
|
||||
fp16=dict(
|
||||
mode=AMP_TYPE.TORCH
|
||||
)
|
||||
```
|
||||
|
||||
In this config file, we specify that we want to use batch size 128 per GPU and run for 200 epochs. These two parameters are exposed by `gpc.config`. For example, you can use `gpc.config.BATCH_SIZE` to access the value you store in your config file. The `fp16` configuration tells `colossalai.initialize` to use mixed precision training provided by PyTorch to train the model with better speed and lower memory consumption.
|
||||
|
||||
#### Step 2. Initialize Distributed Environment
|
||||
|
||||
We need to initialize the distributed training environment. This has been introduced in the tutorial on how to
|
||||
[launch Colossal-AI](./launch_colossalai.md). For this demonstration, we use `launch_from_torch` and PyTorch launch utility.
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
|
||||
# ./config.py refers to the config file we just created in step 1
|
||||
colossalai.launch_from_torch(config='./config.py')
|
||||
```
|
||||
|
||||
#### Step 3. Create all the training components
|
||||
|
||||
In this step, we can create all the components used for training. These components include:
|
||||
|
||||
1. Model
|
||||
2. Optimizer
|
||||
3. Criterion/loss function
|
||||
4. Training/Testing dataloaders
|
||||
5. Learning rate Scheduler
|
||||
6. Logger
|
||||
|
||||
|
||||
|
||||
To build these components, you need to import the following modules:
|
||||
|
||||
```python
|
||||
from pathlib import Path
|
||||
from colossalai.logging import get_dist_logger
|
||||
import torch
|
||||
import os
|
||||
from colossalai.core import global_context as gpc
|
||||
from colossalai.utils import get_dataloader
|
||||
from torchvision import transforms
|
||||
from colossalai.nn.lr_scheduler import CosineAnnealingLR
|
||||
from torchvision.datasets import CIFAR10
|
||||
from torchvision.models import resnet34
|
||||
```
|
||||
|
||||
|
||||
|
||||
Then build your components in the same way as how to normally build them in your PyTorch scripts. In the script below, we set the root path for CIFAR10 dataset as an environment variable `DATA`. You can change it to any path you like, for example, you can change `root=Path(os.environ['DATA'])` to `root='./data'` so that there is no need to set the environment variable.
|
||||
|
||||
```python
|
||||
# build logger
|
||||
logger = get_dist_logger()
|
||||
|
||||
# build resnet
|
||||
model = resnet34(num_classes=10)
|
||||
|
||||
# build datasets
|
||||
train_dataset = CIFAR10(
|
||||
root='./data',
|
||||
download=True,
|
||||
transform=transforms.Compose(
|
||||
[
|
||||
transforms.RandomCrop(size=32, padding=4),
|
||||
transforms.RandomHorizontalFlip(),
|
||||
transforms.ToTensor(),
|
||||
transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
|
||||
0.2023, 0.1994, 0.2010]),
|
||||
]
|
||||
)
|
||||
)
|
||||
|
||||
test_dataset = CIFAR10(
|
||||
root='./data',
|
||||
train=False,
|
||||
transform=transforms.Compose(
|
||||
[
|
||||
transforms.ToTensor(),
|
||||
transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
|
||||
0.2023, 0.1994, 0.2010]),
|
||||
]
|
||||
)
|
||||
)
|
||||
|
||||
# build dataloaders
|
||||
train_dataloader = get_dataloader(dataset=train_dataset,
|
||||
shuffle=True,
|
||||
batch_size=gpc.config.BATCH_SIZE,
|
||||
num_workers=1,
|
||||
pin_memory=True,
|
||||
)
|
||||
|
||||
test_dataloader = get_dataloader(dataset=test_dataset,
|
||||
add_sampler=False,
|
||||
batch_size=gpc.config.BATCH_SIZE,
|
||||
num_workers=1,
|
||||
pin_memory=True,
|
||||
)
|
||||
|
||||
# build criterion
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
|
||||
# optimizer
|
||||
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
|
||||
|
||||
# lr_scheduler
|
||||
lr_scheduler = CosineAnnealingLR(optimizer, total_steps=gpc.config.NUM_EPOCHS)
|
||||
```
|
||||
|
||||
#### Step 4. Initialize with Colossal-AI
|
||||
|
||||
Next, the essential step is to obtain the engine class by calling `colossalai.initialize`. As stated in `config.py`, we will be using mixed precision training for training ResNet34 model. `colossalai.initialize` will automatically check your config file and assign relevant features to your training components. In this way, our engine object has already been able to train with mixed precision, but you do not have to explicitly take care of it.
|
||||
|
||||
```python
|
||||
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
|
||||
optimizer,
|
||||
criterion,
|
||||
train_dataloader,
|
||||
test_dataloader,
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
|
||||
#### Step 5. Train with engine
|
||||
|
||||
With all the training components ready, we can train ResNet34 just like how to normally deal with PyTorch training.
|
||||
|
||||
```python
|
||||
for epoch in range(gpc.config.NUM_EPOCHS):
|
||||
# execute a training iteration
|
||||
engine.train()
|
||||
for img, label in train_dataloader:
|
||||
img = img.cuda()
|
||||
label = label.cuda()
|
||||
|
||||
# set gradients to zero
|
||||
engine.zero_grad()
|
||||
|
||||
# run forward pass
|
||||
output = engine(img)
|
||||
|
||||
# compute loss value and run backward pass
|
||||
train_loss = engine.criterion(output, label)
|
||||
engine.backward(train_loss)
|
||||
|
||||
# update parameters
|
||||
engine.step()
|
||||
|
||||
# update learning rate
|
||||
lr_scheduler.step()
|
||||
|
||||
# execute a testing iteration
|
||||
engine.eval()
|
||||
correct = 0
|
||||
total = 0
|
||||
for img, label in test_dataloader:
|
||||
img = img.cuda()
|
||||
label = label.cuda()
|
||||
|
||||
# run prediction without back-propagation
|
||||
with torch.no_grad():
|
||||
output = engine(img)
|
||||
test_loss = engine.criterion(output, label)
|
||||
|
||||
# compute the number of correct prediction
|
||||
pred = torch.argmax(output, dim=-1)
|
||||
correct += torch.sum(pred == label)
|
||||
total += img.size(0)
|
||||
|
||||
logger.info(
|
||||
f"Epoch {epoch} - train loss: {train_loss:.5}, test loss: {test_loss:.5}, acc: {correct / total:.5}, lr: {lr_scheduler.get_last_lr()[0]:.5g}", ranks=[0])
|
||||
```
|
||||
|
||||
#### Step 6. Train with trainer
|
||||
|
||||
If you wish to train with a trainer object, you can follow the code snippet below:
|
||||
|
||||
```python
|
||||
from colossalai.legacy.nn.metric import Accuracy
|
||||
from colossalai.legacy.trainer import Trainer, hooks
|
||||
|
||||
|
||||
# create a trainer object
|
||||
trainer = Trainer(
|
||||
engine=engine,
|
||||
logger=logger
|
||||
)
|
||||
|
||||
# define the hooks to attach to the trainer
|
||||
hook_list = [
|
||||
hooks.LossHook(),
|
||||
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
|
||||
hooks.AccuracyHook(accuracy_func=Accuracy()),
|
||||
hooks.LogMetricByEpochHook(logger),
|
||||
hooks.LogMemoryByEpochHook(logger)
|
||||
]
|
||||
|
||||
# start training
|
||||
# run testing every 1 epoch
|
||||
trainer.fit(
|
||||
train_dataloader=train_dataloader,
|
||||
epochs=gpc.config.NUM_EPOCHS,
|
||||
test_dataloader=test_dataloader,
|
||||
test_interval=1,
|
||||
hooks=hook_list,
|
||||
display_progress=True
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
|
||||
#### Step 7. Start Distributed Training
|
||||
|
||||
Lastly, we can invoke the scripts using the distributed launcher provided by PyTorch as we used `launch_from_torch` in Step 2. You need to replace `<num_gpus>` with the number of GPUs available on your machine. This number can be 1 if you only want to use 1 GPU. If you wish to use other launchers, you can refer to the tutorial on How to Launch Colossal-AI.
|
||||
|
||||
```bash
|
||||
# with engine
|
||||
python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
|
||||
# with trainer
|
||||
python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_trainer.py
|
||||
```
|
||||
<!-- doc-test-command: echo -->
|
|
@ -1,51 +0,0 @@
|
|||
# Initialize Features
|
||||
|
||||
Author: Shenggui Li, Siqi Mai
|
||||
|
||||
> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster API](../basics/booster_api.md) for more information.
|
||||
|
||||
**Prerequisite:**
|
||||
- [Distributed Training](../concepts/distributed_training.md)
|
||||
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
|
||||
|
||||
## Introduction
|
||||
|
||||
In this tutorial, we will cover the use of `colossalai.initialize` which injects features into your training components
|
||||
(e.g. model, optimizer, dataloader) seamlessly. Calling `colossalai.initialize` is the standard procedure before you run
|
||||
into your training loops.
|
||||
|
||||
In the section below, I will cover how `colossalai.initialize` works and what we should take note of.
|
||||
|
||||
## Usage
|
||||
|
||||
In a typical workflow, we will launch distributed environment at the beginning of our training script.
|
||||
Afterwards, we will instantiate our objects such as model, optimizer, loss function, dataloader etc. At this moment, `colossalai.initialize`
|
||||
can come in to inject features into these objects. A pseudo-code example is like below:
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
import torch
|
||||
...
|
||||
|
||||
|
||||
# launch distributed environment
|
||||
colossalai.launch(config='./config.py', ...)
|
||||
|
||||
# create your objects
|
||||
model = MyModel()
|
||||
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
train_dataloader = MyTrainDataloader()
|
||||
test_dataloader = MyTrainDataloader()
|
||||
|
||||
# initialize features
|
||||
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
|
||||
optimizer,
|
||||
criterion,
|
||||
train_dataloader,
|
||||
test_dataloader)
|
||||
```
|
||||
|
||||
The `colossalai.initialize` function will return an `Engine` object. The engine object is a wrapper
|
||||
for model, optimizer and loss function. **The engine object will run with features specified in the config file.**
|
||||
More details about the engine can be found in the [Use Engine and Trainer in Training](./engine_trainer.md).
|
|
@ -1,64 +0,0 @@
|
|||
# Model Checkpoint
|
||||
|
||||
Author : Guangyang Lu
|
||||
|
||||
> ⚠️ The information on this page is outdated and will be deprecated. Please check [Booster Checkpoint](../basics/booster_checkpoint.md) for more information.
|
||||
|
||||
**Prerequisite:**
|
||||
- [Launch Colossal-AI](./launch_colossalai.md)
|
||||
- [Initialize Colossal-AI](./initialize_features.md)
|
||||
|
||||
**Example Code:**
|
||||
- [ColossalAI-Examples Model Checkpoint](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/utils/checkpoint)
|
||||
|
||||
**This function is experiential.**
|
||||
|
||||
## Introduction
|
||||
|
||||
In this tutorial, you will learn how to save and load model checkpoints.
|
||||
|
||||
To leverage the power of parallel strategies in Colossal-AI, modifications to models and tensors are needed, for which you cannot directly use `torch.save` or `torch.load` to save or load model checkpoints. Therefore, we have provided you with the API to achieve the same thing.
|
||||
|
||||
Moreover, when loading, you are not demanded to use the same parallel strategy as saving.
|
||||
|
||||
## How to use
|
||||
|
||||
### Save
|
||||
|
||||
There are two ways to train a model in Colossal-AI, by engine or by trainer.
|
||||
**Be aware that we only save the `state_dict`.** Therefore, when loading the checkpoints, you need to define the model first.
|
||||
|
||||
#### Save when using engine
|
||||
|
||||
```python
|
||||
from colossalai.utils import save_checkpoint
|
||||
model = ...
|
||||
engine, _, _, _ = colossalai.initialize(model=model, ...)
|
||||
for epoch in range(num_epochs):
|
||||
... # do some training
|
||||
save_checkpoint('xxx.pt', epoch, model)
|
||||
```
|
||||
|
||||
#### Save when using trainer
|
||||
```python
|
||||
from colossalai.legacy.trainer import Trainer, hooks
|
||||
model = ...
|
||||
engine, _, _, _ = colossalai.initialize(model=model, ...)
|
||||
trainer = Trainer(engine, ...)
|
||||
hook_list = [
|
||||
hooks.SaveCheckpointHook(1, 'xxx.pt', model)
|
||||
...]
|
||||
|
||||
trainer.fit(...
|
||||
hook=hook_list)
|
||||
```
|
||||
|
||||
### Load
|
||||
|
||||
```python
|
||||
from colossalai.utils import load_checkpoint
|
||||
model = ...
|
||||
load_checkpoint('xxx.pt', model)
|
||||
... # train or test
|
||||
```
|
||||
<!-- doc-test-command: echo -->
|
|
@ -2,10 +2,6 @@
|
|||
|
||||
Author: Zhengda Bian, Yongbin Li
|
||||
|
||||
**Prerequisite**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Configure Parallelization](../basics/configure_parallelization.md)
|
||||
|
||||
**Example Code**
|
||||
- [Tensor Parallelism with Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples)
|
||||
|
||||
|
|
|
@ -3,8 +3,6 @@
|
|||
Author: Zhengda Bian, Yongbin Li
|
||||
|
||||
**Prerequisite**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Configure Parallelization](../basics/configure_parallelization.md)
|
||||
- [1D Tensor Parallelism](./1D_tensor_parallel.md)
|
||||
|
||||
**Example Code**
|
||||
|
|
|
@ -3,8 +3,6 @@
|
|||
Author: Zhengda Bian, Yongbin Li
|
||||
|
||||
**Prerequisite**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Configure Parallelization](../basics/configure_parallelization.md)
|
||||
- [1D Tensor Parallelism](./1D_tensor_parallel.md)
|
||||
- [2D Tensor Parallelism](./2D_tensor_parallel.md)
|
||||
|
||||
|
|
|
@ -3,8 +3,6 @@
|
|||
Author: Zhengda Bian, Yongbin Li
|
||||
|
||||
**Prerequisite**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Configure Parallelization](../basics/configure_parallelization.md)
|
||||
- [1D Tensor Parallelism](./1D_tensor_parallel.md)
|
||||
- [2D Tensor Parallelism](./2D_tensor_parallel.md)
|
||||
|
||||
|
|
|
@ -1,47 +0,0 @@
|
|||
# Gradient Accumulation (Outdated)
|
||||
|
||||
Author: Shenggui Li, Yongbin Li
|
||||
|
||||
**Prerequisite**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
|
||||
|
||||
**Example Code**
|
||||
- [ColossalAI-Examples Gradient Accumulation](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
|
||||
|
||||
## Introduction
|
||||
|
||||
Gradient accumulation is a common way to enlarge your batch size for training.
|
||||
When training large-scale models, memory can easily become the bottleneck and the batch size can be very small, (e.g. 2),
|
||||
leading to unsatisfactory convergence. Gradient accumulation works by adding up the gradients calculated in multiple iterations,
|
||||
and only update the parameters in the preset iteration.
|
||||
|
||||
## Usage
|
||||
|
||||
It is simple to use gradient accumulation in Colossal-AI. Just add this following configuration into your config file.
|
||||
The integer represents the number of iterations to accumulate gradients.
|
||||
|
||||
```python
|
||||
gradient_accumulation = <int>
|
||||
```
|
||||
|
||||
## Hands-on Practice
|
||||
|
||||
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
|
||||
to demonstrate gradient accumulation. In this example, we set the gradient accumulation size to be 4. You can run the script using this command:
|
||||
|
||||
```shell
|
||||
python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
|
||||
```
|
||||
|
||||
You will see output similar to the text below. This shows gradient is indeed accumulated as the parameter is not updated
|
||||
in the first 3 steps, but only updated in the last step.
|
||||
|
||||
```text
|
||||
iteration 0, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
|
||||
iteration 1, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
|
||||
iteration 2, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
|
||||
iteration 3, first 10 elements of param: tensor([-0.0141, 0.0464, 0.0507, 0.0321, 0.0356, -0.0150, 0.0172, -0.0118, 0.0222, 0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
|
||||
```
|
||||
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_accumulation.py -->
|
|
@ -1,9 +1,8 @@
|
|||
# Gradient Accumulation (Latest)
|
||||
# Gradient Accumulation
|
||||
|
||||
Author: [Mingyan Jiang](https://github.com/jiangmingyan)
|
||||
|
||||
**Prerequisite**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Training Booster](../basics/booster_api.md)
|
||||
|
||||
## Introduction
|
||||
|
|
|
@ -1,64 +0,0 @@
|
|||
# Gradient Clipping (Outdated)
|
||||
|
||||
Author: Boxiang Wang, Haichen Huang, Yongbin Li
|
||||
|
||||
**Prerequisite**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
|
||||
|
||||
**Example Code**
|
||||
- [ColossalAI-Examples Gradient Clipping](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
|
||||
|
||||
**Related Paper**
|
||||
- [On the difficulty of training Recurrent Neural Networks](https://arxiv.org/abs/1211.5063)
|
||||
|
||||
## Introduction
|
||||
|
||||
In order to speed up training process and seek global optimum for better performance, more and more learning
|
||||
rate schedulers have been proposed. People turn to control learning rate to adjust descent pace during training,
|
||||
which makes gradient vector better to be uniformed in every step. In that case, the descent pace can be
|
||||
controlled as expected. As a result, gradient clipping, a technique which can normalize the gradient vector
|
||||
to circumscribe it in a uniformed length, becomes indispensable for those who desire their better
|
||||
performance of their models.
|
||||
|
||||
You do not have to worry about implementing gradient clipping when using Colossal-AI, we support gradient
|
||||
clipping in a powerful and convenient way. All you need is just an additional command in your configuration
|
||||
file.
|
||||
|
||||
## Why you should use gradient clipping provided by Colossal-AI
|
||||
|
||||
The reason of why we do not recommend users to write gradient clipping by themselves is that naive gradient clipping
|
||||
may fail when applying tensor parallelism, pipeline parallelism or MoE.
|
||||
|
||||
According to the illustration below, each GPU only owns a portion of parameters of the weight in a linear layer.
|
||||
To get correct norm of gradient vector of the weight of the linear layer, the norm of every gradient vector in each GPU
|
||||
should be summed together.
|
||||
More complicated thing is that the distribution of bias is different from the distribution of the weight.
|
||||
The communication group is different in the sum operation.
|
||||
|
||||
(PS: This situation is an old version of 2D parallelism, the implementation in the code is not the same.
|
||||
But it is a good example about the difficulty to unify all communication in gradient clipping.)
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/01/28/KXiJPHt3Dum82cA.png"/>
|
||||
<figcaption>Layout of parameters</figcaption>
|
||||
</figure>
|
||||
|
||||
Do not worry about it, since Colossal-AI have handled it for you.
|
||||
|
||||
### Usage
|
||||
To use gradient clipping, you can just simply add gradient clipping norm in your configuration file.
|
||||
```python
|
||||
clip_grad_norm = 1.0
|
||||
```
|
||||
|
||||
### Hands-On Practice
|
||||
|
||||
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
|
||||
to demonstrate gradient clipping. In this example, we set the gradient clipping vector norm to be 1.0. You can run the script using this command:
|
||||
|
||||
```shell
|
||||
python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 train_with_engine.py
|
||||
```
|
||||
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_clipping.py -->
|
|
@ -1,9 +1,8 @@
|
|||
# Gradient Clipping (Latest)
|
||||
# Gradient Clipping
|
||||
|
||||
Author: [Mingyan Jiang](https://github.com/jiangmingyan)
|
||||
|
||||
**Prerequisite**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Training Booster](../basics/booster_api.md)
|
||||
|
||||
**Related Paper**
|
||||
|
|
|
@ -1,64 +0,0 @@
|
|||
# Gradient Handler
|
||||
|
||||
Author: Shenggui Li, Yongbin Li
|
||||
|
||||
**Prerequisite**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
|
||||
|
||||
**Example Code**
|
||||
- [ColossalAI-Examples Gradient Handler](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
|
||||
|
||||
## Introduction
|
||||
|
||||
In distributed training, gradient synchronization is required at the end of each iteration. This is important because we
|
||||
need to make sure the parameters are updated with the same gradients in different machines so that the resulting parameters
|
||||
are the same. This is often seen in data parallel as the model is replicated across data parallel ranks.
|
||||
|
||||
In Colossal-AI, we provide an interface for users to customize how they want to handle the synchronization. This brings
|
||||
flexibility in cases such as implementing a new parallelism method.
|
||||
|
||||
When gradient handlers are used, PyTorch `DistributedDataParallel` will not be used as it will synchronize automatically.
|
||||
|
||||
## Customize Your Gradient Handlers
|
||||
|
||||
To implement a customized gradient handler, you need to follow these steps.
|
||||
1. inherit `BaseGradientHandler` in Colossal-AI.
|
||||
2. register the gradient handler into the `GRADIENT_HANDLER`.
|
||||
3. implement `handle_gradient` method.
|
||||
|
||||
```python
|
||||
from colossalai.legacy.registry import GRADIENT_HANDLER
|
||||
from colossalai.legacy.engine.gradient_handler import BaseGradientHandler
|
||||
|
||||
|
||||
@GRADIENT_HANDLER.register_module
|
||||
class MyGradientHandler(BaseGradientHandler):
|
||||
|
||||
def handle_gradient(self):
|
||||
do_something()
|
||||
|
||||
|
||||
```
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||
To use a gradient handler, you need to specify your gradient handler in the config file. The gradient handler
|
||||
will be automatically built and attached to the engine.
|
||||
|
||||
```python
|
||||
gradient_handler = [dict(type='MyGradientHandler')]
|
||||
```
|
||||
|
||||
|
||||
### Hands-On Practice
|
||||
|
||||
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
|
||||
to demonstrate the use of gradient handler. In this example, we used `DataParallelGradientHandler` instead of PyTorch
|
||||
`DistributedDataParallel` for data parallel training.
|
||||
|
||||
```shell
|
||||
python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py
|
||||
```
|
||||
<!-- doc-test-command: echo -->
|
|
@ -1,368 +0,0 @@
|
|||
# Auto Mixed Precision Training (Outdated)
|
||||
|
||||
Author: Chuanrui Wang, Shenggui Li, Yongbin Li
|
||||
|
||||
**Prerequisite**
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Use Engine and Trainer in Training](../basics/engine_trainer.md)
|
||||
|
||||
**Example Code**
|
||||
- [ColossalAI-Examples AMP](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp)
|
||||
|
||||
**Related Paper**
|
||||
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
|
||||
|
||||
|
||||
## Introduction
|
||||
|
||||
AMP stands for automatic mixed precision training.
|
||||
In Colossal-AI, we have incorporated different implementations of mixed precision training:
|
||||
|
||||
1. torch.cuda.amp
|
||||
2. apex.amp
|
||||
3. naive amp
|
||||
|
||||
|
||||
| Colossal-AI | support tensor parallel | support pipeline parallel | fp16 extent |
|
||||
| ----------- | ----------------------- | ------------------------- | ----------- |
|
||||
| AMP_TYPE.TORCH | ✅ | ❌ | Model parameters, activation, gradients are downcast to fp16 during forward and backward propagation |
|
||||
| AMP_TYPE.APEX | ❌ | ❌ | More fine-grained, we can choose opt_level O0, O1, O2, O3 |
|
||||
| AMP_TYPE.NAIVE | ✅ | ✅ | Model parameters, forward and backward operations are all downcast to fp16 |
|
||||
|
||||
The first two rely on the original implementation of PyTorch (version 1.6 and above) and NVIDIA Apex.
|
||||
The last method is similar to Apex O2 level.
|
||||
Among these methods, apex AMP is not compatible with tensor parallelism.
|
||||
This is because that tensors are split across devices in tensor parallelism, thus, it is required to communicate among different processes to check if inf or nan occurs in the whole model weights.
|
||||
We modified the torch amp implementation so that it is compatible with tensor parallelism now.
|
||||
|
||||
> ❌️ fp16 and zero configuration are not compatible
|
||||
>
|
||||
> ⚠️ Pipeline only support naive AMP currently
|
||||
|
||||
We recommend you to use torch AMP as it generally gives better accuracy than naive AMP if no pipeline is used.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
In this tutorial we will cover:
|
||||
|
||||
1. AMP introduction
|
||||
2. AMP in Colossal-AI
|
||||
3. Hands-on Practice
|
||||
|
||||
## AMP Introduction
|
||||
|
||||
Automatic Mixed Precision training is a mixture of FP16 and FP32 training.
|
||||
|
||||
Half-precision float point format (FP16) has lower arithmetic complexity and higher compute efficiency.
|
||||
Besides, fp16 requires half of the storage needed by fp32 and saves memory & network bandwidth, which makes more memory
|
||||
available for large batch size and model size.
|
||||
|
||||
However, there are other operations, like reductions, which require the dynamic range of fp32 to avoid numeric overflow/underflow. That's the reason why we introduce automatic mixed precision, attempting to match each operation to its appropriate data type, which can reduce the memory footprint and augment training efficiency.
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
|
||||
<figcaption>Illustration of an ordinary AMP (figure from <a href="https://arxiv.org/abs/2108.05818">PatrickStar paper</a>)</figcaption>
|
||||
</figure>
|
||||
|
||||
## AMP in Colossal-AI
|
||||
|
||||
We supported three AMP training methods and allowed the user to train with AMP with no code. You can just simply add `fp16`
|
||||
configuration in your configuration file to use AMP.
|
||||
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
# use Torch AMP
|
||||
fp16=dict(
|
||||
mode = AMP_TYPE.TORCH
|
||||
)
|
||||
|
||||
# use naive AMP
|
||||
fp16=dict(
|
||||
mode = AMP_TYPE.NAIVE
|
||||
)
|
||||
|
||||
# use NVIDIA Apex AMP
|
||||
fp16=dict(
|
||||
mode = AMP_TYPE.APEX
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
> These are the minimum configuration, full configuration are stated in the section later
|
||||
|
||||
### AMP Modularity
|
||||
|
||||
AMP module is designed to be completely modular and can be used independently.
|
||||
If you wish to only use AMP in your code base without `colossalai.initialize`,
|
||||
you can use `colossalai.amp.convert_to_amp`.
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
# example of using torch amp
|
||||
model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
|
||||
optimizer,
|
||||
criterion,
|
||||
AMP_TYPE.TORCH)
|
||||
```
|
||||
|
||||
### Torch AMP Configuration
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
fp16=dict(
|
||||
mode=AMP_TYPE.TORCH,
|
||||
|
||||
# below are default values for grad scaler
|
||||
init_scale=2.**16,
|
||||
growth_factor=2.0,
|
||||
backoff_factor=0.5,
|
||||
growth_interval=2000,
|
||||
enabled=True
|
||||
)
|
||||
```
|
||||
|
||||
With optional arguments:
|
||||
- init_scale(float, optional, default=2.**16): Initial scale factor
|
||||
- growth_factor(float, optional, default=2.0): Factor by which the scale is multiplied during `update` if no inf/NaN gradients occur for ``growth_interval`` consecutive iterations.
|
||||
- backoff_factor(float, optional, default=0.5): Factor by which the scale is multiplied during `update` if inf/NaN gradients occur in an iteration.
|
||||
- growth_interval(int, optional, default=2000): Number of consecutive iterations without inf/NaN gradients that must occur for the scale to be multiplied by ``growth_factor``.
|
||||
- enabled(bool, optional, default=True): If ``False``, disables gradient scaling. `step` simply invokes the underlying ``optimizer.step()``, and other methods become no-ops.
|
||||
|
||||
### Apex AMP Configuration
|
||||
|
||||
For this mode, we rely on the Apex implementation for mixed precision training.
|
||||
We support this plugin because it allows for finer control on the granularity of mixed precision.
|
||||
For example, O2 level (optimization level 2) will keep batch normalization in fp32.
|
||||
|
||||
If you look for more details, please refer to [Apex Documentation](https://nvidia.github.io/apex/).
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
fp16 = dict(
|
||||
mode=AMP_TYPE.APEX,
|
||||
|
||||
# below are the default values
|
||||
enabled=True,
|
||||
opt_level='O1',
|
||||
cast_model_type=None,
|
||||
patch_torch_functions=None,
|
||||
keep_batchnorm_fp32=None,
|
||||
master_weights=None,
|
||||
loss_scale=None,
|
||||
cast_model_outputs=None,
|
||||
num_losses=1,
|
||||
verbosity=1,
|
||||
min_loss_scale=None,
|
||||
max_loss_scale=16777216.0
|
||||
)
|
||||
```
|
||||
|
||||
Parameters:
|
||||
- enabled(bool, optional, default=True): If False, renders all AMP calls no-ops, so your script should run as if Amp were not present.
|
||||
|
||||
- opt_level(str, optional, default="O1" ): Pure or mixed precision optimization level.
|
||||
Accepted values are “O0”, “O1”, “O2”, and “O3”, explained in detail above Apex AMP Documentation.
|
||||
|
||||
- num_losses(int, optional, default=1): Option to tell AMP in advance how many losses/backward passes you plan to use.
|
||||
When used in conjunction with the loss_id argument to `amp.scale_loss`, enables Amp to use a different loss scale per
|
||||
loss/backward pass, which can improve stability. If num_losses is left to 1, Amp will still support multiple
|
||||
losses/backward passes, but use a single global loss scale for all of them.
|
||||
|
||||
- verbosity(int, default=1): Set to 0 to suppress Amp-related output.
|
||||
|
||||
- min_loss_scale(float, default=None): Sets a floor for the loss scale values that can be chosen by dynamic loss scaling.
|
||||
The default value of None means that no floor is imposed. If dynamic loss scaling is not used, min_loss_scale is ignored.
|
||||
|
||||
- max_loss_scale(float, default=2.**24 ): Sets a ceiling for the loss scale values that can be chosen by dynamic loss
|
||||
scaling. If dynamic loss scaling is not used, max_loss_scale is ignored.
|
||||
|
||||
Currently, the under-the-hood properties that govern pure or mixed precision training are the following:
|
||||
cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale.
|
||||
They are optional properties override once opt_level is determined
|
||||
|
||||
- cast_model_type: Casts your model’s parameters and buffers to the desired type.
|
||||
- patch_torch_functions: Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
|
||||
- keep_batchnorm_fp32: To enhance precision and enable cudnn batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
|
||||
- master_weights: Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
|
||||
- loss_scale: If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string "dynamic", adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
|
||||
|
||||
|
||||
### Naive AMP Configuration
|
||||
|
||||
In Naive AMP mode, we achieved mixed precision training while maintaining compatibility with complex tensor and pipeline parallelism.
|
||||
This AMP mode will cast all operations into fp16.
|
||||
The following code block shows the `config.py` file for this mode.
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
fp16 = dict(
|
||||
mode=AMP_TYPE.NAIVE,
|
||||
|
||||
# below are the default values
|
||||
log_num_zeros_in_grad=False,
|
||||
initial_scale=2 ** 32,
|
||||
min_scale=1,
|
||||
growth_factor=2,
|
||||
backoff_factor=0.5,
|
||||
growth_interval=1000,
|
||||
hysteresis=2
|
||||
)
|
||||
```
|
||||
|
||||
The default parameters of Naive AMP:
|
||||
- log_num_zeros_in_grad(bool): return number of zeros in the gradients.
|
||||
- initial_scale(int): initial scale of gradient scaler
|
||||
- growth_factor(int): the growth rate of loss scale
|
||||
- backoff_factor(float): the decrease rate of loss scale
|
||||
- hysteresis(int): delay shift in dynamic loss scaling
|
||||
- max_scale(int): maximum loss scale allowed
|
||||
- verbose(bool): if set to `True`, will print debug info
|
||||
|
||||
When using `colossalai.initialize`, you are required to first instantiate a model, an optimizer and a criterion.
|
||||
The output model is converted to AMP model of smaller memory consumption.
|
||||
If your input model is already too large to fit in a GPU, please instantiate your model weights in `dtype=torch.float16`.
|
||||
Otherwise, try smaller models or checkout more parallelization training techniques!
|
||||
|
||||
|
||||
## Hands-on Practice
|
||||
|
||||
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp) which demonstrates
|
||||
the use of AMP with Colossal-AI. In this practice, we will use Torch AMP as an example, but do note that config files are provided for all AMP modes.
|
||||
|
||||
### Step 1. Create a config file
|
||||
|
||||
Create a `config.py` and add the `fp16` configuration.
|
||||
|
||||
```python
|
||||
# in config.py
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
BATCH_SIZE = 128
|
||||
DROP_RATE = 0.1
|
||||
NUM_EPOCHS = 300
|
||||
|
||||
fp16 = dict(
|
||||
mode=AMP_TYPE.TORCH,
|
||||
)
|
||||
|
||||
clip_grad_norm = 1.0
|
||||
```
|
||||
|
||||
### Step 2. Import libraries in train_with_engine.py
|
||||
|
||||
Create a `train_with_engine.py` and import the necessary dependencies. Remember to install `scipy` and `timm` by running
|
||||
`pip install timm scipy`.
|
||||
|
||||
```python
|
||||
import os
|
||||
import colossalai
|
||||
import torch
|
||||
from pathlib import Path
|
||||
from colossalai.core import global_context as gpc
|
||||
from colossalai.logging import get_dist_logger
|
||||
from colossalai.utils import get_dataloader
|
||||
from colossalai.legacy.trainer import Trainer, hooks
|
||||
from colossalai.nn.lr_scheduler import LinearWarmupLR
|
||||
from timm.models import vit_base_patch16_224
|
||||
from torchvision import datasets, transforms
|
||||
|
||||
```
|
||||
|
||||
### Step 3. Initialize Distributed Environment
|
||||
|
||||
We then need to initialize distributed environment. For demo purpose, we uses `launch_from_torch`. You can refer to [Launch Colossal-AI](../basics/launch_colossalai.md)
|
||||
for other initialization methods.
|
||||
|
||||
```python
|
||||
# initialize distributed setting
|
||||
parser = colossalai.get_default_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
# launch from torch
|
||||
colossalai.launch_from_torch(config=args.config)
|
||||
|
||||
```
|
||||
|
||||
### Step 4. Create training components
|
||||
|
||||
Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
|
||||
obtained from the environment variable `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
|
||||
to a path on your machine. Data will be automatically downloaded to the root path.
|
||||
|
||||
```python
|
||||
# build model
|
||||
model = vit_base_patch16_224(drop_rate=0.1)
|
||||
|
||||
# build dataloader
|
||||
train_dataset = datasets.Caltech101(
|
||||
root=Path(os.environ['DATA']),
|
||||
download=True,
|
||||
transform=transforms.Compose([
|
||||
transforms.Resize(256),
|
||||
transforms.RandomResizedCrop(224),
|
||||
transforms.RandomHorizontalFlip(),
|
||||
transforms.ToTensor(),
|
||||
Gray2RGB(),
|
||||
transforms.Normalize([0.5, 0.5, 0.5],
|
||||
[0.5, 0.5, 0.5])
|
||||
]))
|
||||
|
||||
train_dataloader = get_dataloader(dataset=train_dataset,
|
||||
shuffle=True,
|
||||
batch_size=gpc.config.BATCH_SIZE,
|
||||
num_workers=1,
|
||||
pin_memory=True,
|
||||
)
|
||||
|
||||
# build optimizer
|
||||
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)
|
||||
|
||||
# build loss
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
|
||||
# lr_scheduler
|
||||
lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
|
||||
```
|
||||
|
||||
### Step 5. Inject AMP Feature
|
||||
|
||||
Call `colossalai.initialize` to convert the training components to be running with FP16.
|
||||
|
||||
```python
|
||||
engine, train_dataloader, _, _ = colossalai.initialize(
|
||||
model, optimizer, criterion, train_dataloader,
|
||||
)
|
||||
```
|
||||
|
||||
### Step 6. Train with Engine
|
||||
|
||||
Use engine in a normal training loops.
|
||||
|
||||
```python
|
||||
engine.train()
|
||||
for epoch in range(gpc.config.NUM_EPOCHS):
|
||||
for img, label in enumerate(train_dataloader):
|
||||
img = img.cuda()
|
||||
label = label.cuda()
|
||||
engine.zero_grad()
|
||||
output = engine(img)
|
||||
loss = engine.criterion(output, label)
|
||||
engine.backward(loss)
|
||||
engine.step()
|
||||
lr_scheduler.step()
|
||||
```
|
||||
|
||||
### Step 7. Invoke Training Scripts
|
||||
|
||||
Use the following command to start the training scripts. You can change `--nproc_per_node` to use a different number of GPUs.
|
||||
|
||||
```shell
|
||||
python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
|
||||
```
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training.py -->
|
|
@ -1,10 +1,9 @@
|
|||
# Auto Mixed Precision Training (Latest)
|
||||
# Auto Mixed Precision Training
|
||||
|
||||
Author: [Mingyan Jiang](https://github.com/jiangmingyan)
|
||||
|
||||
**Prerequisite**
|
||||
|
||||
- [Define Your Configuration](../basics/define_your_config.md)
|
||||
- [Training Booster](../basics/booster_api.md)
|
||||
|
||||
**Related Paper**
|
||||
|
@ -61,7 +60,7 @@ However, there are other operations, like reductions, which require the dynamic
|
|||
|
||||
## AMP in Colossal-AI
|
||||
|
||||
We supported three AMP training methods and allowed the user to train with AMP with no code. If you want to train with amp, just assign `mixed_precision` with `fp16` when you instantiate the `Booster`. Now booster support torch amp, the other two(apex amp, naive amp) are still started by `colossalai.initialize`, if needed, please refer to [this](./mixed_precision_training.md). Next we will support `bf16`, `fp8`.
|
||||
We supported three AMP training methods and allowed the user to train with AMP with no code. If you want to train with amp, just assign `mixed_precision` with `fp16` when you instantiate the `Booster`. Next we will support `bf16`, `fp8`.
|
||||
|
||||
### Start with Booster
|
||||
|
||||
|
|
|
@ -204,7 +204,7 @@ def main():
|
|||
|
||||
torch.cuda.synchronize()
|
||||
```
|
||||
> ⚠️ Note: If you want to use the Gemini module, please do not use the [Gradient Accumulation](../features/gradient_accumulation.md) we mentioned before。
|
||||
> ⚠️ Note: If you want to use the Gemini module, please do not use the [Gradient Accumulation](../features/gradient_accumulation_with_booster.md) we mentioned before。
|
||||
The complete example can be found on [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt).
|
||||
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 zero_with_chunk.py -->
|
||||
|
|
|
@ -1,113 +0,0 @@
|
|||
# 添加你自己的并行模式
|
||||
|
||||
作者: Shenggui Li, Yongbin Li
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [并行配置](../basics/configure_parallelization.md)
|
||||
|
||||
## 引言
|
||||
|
||||
为了使研究人员和工程师能够以更少的努力将我们的系统扩展到其他新颖的大规模分布式训练算法,我们已经将训练生命周期中的各种组件解耦。你可以通过简单地继承基类来实现你自己的并行模式。
|
||||
|
||||
主要组件有:
|
||||
|
||||
1. `ProcessGroupInitializer`
|
||||
2. `GradientHandler`
|
||||
3. `Schedule`
|
||||
|
||||
**目前这需要对源代码进行一些改动,因此我们建议你用`-e`标志从源代码安装。`-e`标志使得安装是可编辑的,因此,你的代码变化将反映在你的Python运行时中。我们将在这方面努力,以避免在未来的版本中改变源代码。**
|
||||
|
||||
|
||||
## 进程组初始化器
|
||||
|
||||
并行通常由进程组来管理,参与相同并行算法的进程被置于同一进程组。对于不同的并行算法,需要创建不同的进程组。
|
||||
Colossal-AI 为用户提供了一个全局 context,使他们能够轻松地管理进程组。如果你想添加新的进程组,你可以很容易地定义一个新的类并在你的配置文件中设置它。为了定义你自己的进程组创建方式,你可以按照下面的步骤来创建一个新的分布式初始化。
|
||||
|
||||
1. 在 `colossalai.legacy.context.parallel_mode.ParallelMode` 中添加你自己的并行模式。
|
||||
```python
|
||||
class ParallelMode(Enum):
|
||||
GLOBAL = 'global'
|
||||
DATA = 'data'
|
||||
PIPELINE = 'pipe'
|
||||
...
|
||||
|
||||
NEW_MODE = 'new_mode' # define your mode here
|
||||
```
|
||||
|
||||
2. 创建一个 `ProcessGroupInitializer`。 你可以参考 `colossalai.context.dist_group_initializer` 中给出的例子,前六个参数是固定的。
|
||||
`ParallelContext` 将为你传入这些参数。如果你需要设置其他参数,可以像下面的例子中的 `arg1, arg2` 一样,在后面添加它。
|
||||
最后,通过添加装饰器 `@DIST_GROUP_INITIALIZER.register_module` 将你的初始化程序注册到注册表。
|
||||
```python
|
||||
# sample initializer class
|
||||
@DIST_GROUP_INITIALIZER.register_module
|
||||
class MyParallelInitializer(ProcessGroupInitializer):
|
||||
|
||||
def __init__(self,
|
||||
rank: int,
|
||||
world_size: int,
|
||||
config: Config,
|
||||
data_parallel_size: int,
|
||||
pipeline_parallel_size: int,
|
||||
tensor_parallel_size: int,
|
||||
arg1,
|
||||
arg2):
|
||||
super().__init__(rank, world_size, config)
|
||||
self.arg1 = arg1
|
||||
self.arg2 = arg2
|
||||
# ... your variable init
|
||||
|
||||
def init_parallel_groups(self):
|
||||
# initialize your process groups
|
||||
pass
|
||||
|
||||
```
|
||||
然后,你可以将你的新初始化器插入到 `colossalai.constants.INITIALIZER_MAPPING` 当前的模式与初始化映射中。你可以修改该文件或动态插入新的键值对。
|
||||
|
||||
```python
|
||||
colossalai.constants.INITIALIZER_MAPPING['new_mode'] = 'MyParallelInitializer'
|
||||
```
|
||||
|
||||
3. 在你的配置文件中设置你的初始化器。你可以传入你的自定义参数。这允许
|
||||
`ParallelContext` 创建你的初始化器并初始化你期望的进程组。
|
||||
|
||||
```python
|
||||
parallel = dict(
|
||||
pipeline=dict(size=1),
|
||||
tensor=dict(size=x, mode='new_mode') # this is where you enable your new parallel mode
|
||||
)
|
||||
```
|
||||
|
||||
## 梯度 Handler
|
||||
|
||||
梯度 handler 是对参数的梯度执行 all-reduce 操作的对象。由于不同的 all-reduce 策略或许在不同的并行中被执行,用户可以继承
|
||||
`colossalai.legacy.engine.gradient_handler.BaseGradientHandler` 来实现其策略。目前,Colossal-AI 使用普通的数据并行梯度 handler 在数据并行的 rank 间 all-reduce 梯度。
|
||||
如果数据并行被检测到,梯度 handler 会被自动添加进 engine。
|
||||
|
||||
你可以添加你自己的梯度 handler,如下所示:
|
||||
|
||||
```python
|
||||
from colossalai.legacy.registry import GRADIENT_HANDLER
|
||||
from colossalai.legacy.engine import BaseGradientHandler
|
||||
|
||||
@GRADIENT_HANDLER.register_module
|
||||
class YourGradientHandler(BaseGradientHandler):
|
||||
|
||||
def handle_gradient(self):
|
||||
do_something()
|
||||
|
||||
```
|
||||
|
||||
之后,你可以在配置文件中指定你要使用的梯度 handler。
|
||||
|
||||
```python
|
||||
gradient_handlers = [
|
||||
dict(type='YourGradientHandler'),
|
||||
]
|
||||
```
|
||||
|
||||
## Schedule
|
||||
|
||||
Schedule 包含了如何执行前向和后向计算。目前, Colossal-AI 提供了流水和非流水的 schedule。
|
||||
如果你想修改前向和后向计算的执行方式,你可以继承 `colossalai.legacy.engine.schedule.BaseSchedule` 并实现 `forward_back_step` 函数。
|
||||
<!-- doc-test-command: echo -->
|
|
@ -1,31 +0,0 @@
|
|||
# 定义你自己的并行模型
|
||||
|
||||
作者: Zhengda Bian, Yongbin Li
|
||||
|
||||
> ⚠️ 我们正在编写此文档以使其更加详细。 我们将介绍不同并行的机制以及如何使用它们来编写模型。
|
||||
|
||||
假设您有一个具有数十亿参数的巨大 MLP 模型,其极大的隐藏层大小使其无法直接被单个 GPU 容纳。别担心,Colossal-AI 可以帮你解决这个问题。
|
||||
在 Colossal-AI 的帮助下,您可以用所熟悉的为单个 GPU 编写模型的方式编写大模型,而 Colossal-AI 会自动拆分您的模型权重,并将它们完美地分配到一组 GPU 中。我们给出一个简单的示例,展示如何在 Colossal-AI 中编写简单的 2D 并行模型。
|
||||
|
||||
## 写一个简单的2D并行模型
|
||||
|
||||
```python
|
||||
from colossalai.nn import Linear2D
|
||||
import torch.nn as nn
|
||||
|
||||
class MLP_2D(nn.Module):
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.linear_1 = Linear2D(in_features=1024, out_features=16384)
|
||||
self.linear_2 = Linear2D(in_features=16384, out_features=1024)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.linear_1(x)
|
||||
x = self.linear_2(x)
|
||||
return x
|
||||
```
|
||||
|
||||
## 使用预定义的模型
|
||||
|
||||
为了方便您的使用,我们在 Colossal-AI 的 Model Zoo 中提供一些流行的模型,如*BERT*, *ViT*, *MoE* 和 *GPT*,请自由地将它们定制为不同的尺寸,以满足您的特殊需求。
|
|
@ -1,179 +0,0 @@
|
|||
# 使用ColoTensor让串行程序像Megatron-LM一样并行
|
||||
|
||||
Author: [Haichen Huang](https://github.com/1SAA) and [Jiarui Fang](https://github.com/feifeibear)
|
||||
|
||||
**Prerequisite:**
|
||||
- [ColoTensor Concepts](../basics/colotensor_concept.md)
|
||||
|
||||
## 介绍
|
||||
|
||||
在新版本中,我们引入了ColoTensor。ColoTensor为用户使用并行训练提供了极大的便利,使得用户可以在原本的串行代码上,通过较小的修改将训练改为并行。在本教程中,我们将说明如何修改训练模型以自动使代码采取像 Megatron-LM 一样的方式并行训练。我们以 HuggingFace 提供的 GPT-2 模型为例,并提供一种方式让你可以在单个GPU上预训练GPT-2模型。
|
||||
|
||||
Megatron-LM 提供了一个具有影响力的并行化范式,这个范式主要应用于Transformer大模型的训练。然而,为了大规模训练 Transformer 语言大模型,用户必须使用Megatron-LM提供的特殊模块来构建他们的模型。这给用户带来了一些困难的工作,例如从预先训练的模型中加载权重,或是构建自己的并行训练模型。为了减轻用户的麻烦,我们提供 ColoTensor 类,以完成自动启用张量模型并行。
|
||||
|
||||
## 定义模型和损失函数
|
||||
|
||||
首先,我们直接调用 HuggingFace 库中的 GPTModel 和 GPTLoss。
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from transformers import GPT2Config, GPT2LMHeadModel
|
||||
|
||||
class GPTLMModel(nn.Module):
|
||||
def __init__(self, hidden_size=768, num_layers=12, num_attention_heads=12, max_seq_len=1024, vocab_size=50257, checkpoint=False):
|
||||
super().__init__()
|
||||
self.checkpoint = checkpoint
|
||||
self.model = GPT2LMHeadModel(GPT2Config(n_embd=hidden_size, n_layer=num_layers,
|
||||
n_head=num_attention_heads, n_positions=max_seq_len, n_ctx=max_seq_len, vocab_size=vocab_size))
|
||||
if checkpoint:
|
||||
self.model.gradient_checkpointing_enable()
|
||||
|
||||
def forward(self, input_ids, attention_mask):
|
||||
# Only return lm_logits
|
||||
return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]
|
||||
|
||||
|
||||
class GPTLMLoss(nn.Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.loss_fn = nn.CrossEntropyLoss()
|
||||
|
||||
def forward(self, logits, labels):
|
||||
shift_logits = logits[..., :-1, :].contiguous()
|
||||
shift_labels = labels[..., 1:].contiguous()
|
||||
# Flatten the tokens
|
||||
return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
|
||||
```
|
||||
|
||||
## 对GPT-2的简短回顾
|
||||
|
||||
现在,我们回顾一下 GPT-2 模型的结构。每个 GPT-2 模型都可以表示为一个 DAG。如下图所示,每个圆圈代表一个算子,每个方块代表一个权重。每个箭头表示输入数据的流向,而箭头旁边的符号表示输入数据的形状。
|
||||
|
||||
然后,让我们深入了解一下这个 GPT-2 模型。它由三部分组成,分别是**嵌入模块**、**转换器层**和**分类头**。
|
||||
|
||||
嵌入模块包含两个权重,符号嵌入权重和位置嵌入权重。在嵌入模块的前向操作之后,原始输入数据的所有序列中的每个单词都会被嵌入到隐藏状态。
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/08/17/omfkIEN6ui5jcL3.png"/>
|
||||
<figcaption>嵌入模块</figcaption>
|
||||
</figure>
|
||||
|
||||
每个转换器层包含两个块。自注意操作在第一个块中调用,同时一个双层感知器位于第二个块中。
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/08/17/LAVzDlpRcj4dYeb.png"/>
|
||||
<figcaption>转换器层</figcaption>
|
||||
</figure>
|
||||
|
||||
最后,分类头只是一个不加偏差的线性模块,里面只有一个线性权重。
|
||||
|
||||
## 应用ColoTensor
|
||||
|
||||
两个步骤使您的串行代码采取 Megatron-LM 张量并行风格。
|
||||
1. 在ColoInitContext的上下文中初始化模型。
|
||||
2. 为每个参数设置 ColoTensorSpec。
|
||||
|
||||
### 使用 ColoInitContext 初始化
|
||||
|
||||
我们应该在 ColoInitContext 中构建模型。在该种上下文中,任何初始化的参数都将转换为 ColoParameter 并自动移动到相应的设备上。
|
||||
|
||||
```python
|
||||
from colossalai.utils.model.colo_init_context import ColoInitContext
|
||||
|
||||
with ColoInitContext(device=torch.device('cpu')):
|
||||
model = GPTLMModel()
|
||||
```
|
||||
|
||||
### 为每个参数设置 ColoTensorSpec
|
||||
|
||||
模型创建完成后,我们通过ProcessGroup建立分布式环境。这里,我们将张量并行度指定为所有GPU的数量,即数据并行度为一。
|
||||
|
||||
```python
|
||||
import torch.distributed as dist
|
||||
from colossalai.tensor import ProcessGroup
|
||||
|
||||
pg = ProcessGroup(tp_degree=dist.get_world_size())
|
||||
```
|
||||
|
||||
现在,我们需要一些辅助函数为下一步做准备。我们定义了两个函数来切分参数。Megatron-LM张量并行需要沿参数的第一维或最后一维切分参数张量。
|
||||
|
||||
```python
|
||||
from colossalai.tensor import ShardSpec, ComputeSpec, ComputePattern, ColoParameter, ProcessGroup
|
||||
|
||||
def split_param_single_dim_tp1d(dim: int, param: ColoParameter, pg: ProcessGroup):
|
||||
spec = (ShardSpec([dim], [pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
|
||||
if param.process_group.tp_world_size() == 1:
|
||||
param.set_process_group(pg)
|
||||
param.set_tensor_spec(*spec)
|
||||
|
||||
|
||||
def split_param_row_tp1d(param: ColoParameter, pg: ProcessGroup):
|
||||
split_param_single_dim_tp1d(0, param, pg)
|
||||
|
||||
|
||||
def split_param_col_tp1d(param: ColoParameter, pg: ProcessGroup):
|
||||
split_param_single_dim_tp1d(-1, param, pg)
|
||||
```
|
||||
|
||||
然后我们使模型采用张量并行。根据 Megatron 中使用的张量并行,应该沿着张量的最后一个维度进行切片,包括符号嵌入的权重,位置嵌入的权重,自注意力块中的所有线性权重和偏差,以及每个双层感知器中的第一个线性权重和偏差。且需要沿第一个维度切分双层感知器中的第二个线性权重。
|
||||
|
||||
```python
|
||||
for mn, module in model.named_modules():
|
||||
for pn, param in module.named_parameters(recurse=False):
|
||||
# set process group for all parameters
|
||||
param.set_process_group(pg)
|
||||
|
||||
if 'mlp.c_fc' in mn:
|
||||
if 'weight' in pn or 'bias' in pn:
|
||||
split_param_col_tp1d(param, pg) # column slice
|
||||
# keep the shape of the output from c_fc
|
||||
param.compute_spec.set_output_replicate(False)
|
||||
elif 'mlp.c_proj' in mn:
|
||||
if 'weight' in pn:
|
||||
split_param_row_tp1d(param, pg) # row slice
|
||||
elif 'wte' in mn or 'wpe' in mn:
|
||||
split_param_col_tp1d(param, pg) # column slice
|
||||
elif 'c_attn' in mn or 'c_proj' in mn:
|
||||
split_param_col_tp1d(param, pg) # column slice
|
||||
```
|
||||
|
||||
修改后的模型如下图所示。
|
||||
|
||||
嵌入模块:
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/08/17/Yu2xzXEabHV7pwe.png"/>
|
||||
<figcaption>修改后的嵌入模块</figcaption>
|
||||
</figure>
|
||||
|
||||
转换器层:
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/08/17/4HWsA2xz51IhPFO.png"/>
|
||||
<figcaption>修改后的转换器层</figcaption>
|
||||
</figure>
|
||||
|
||||
一旦用户指定了每个参数的在并行中的分布模式,ColoTensor 就能够推断出所有算子的计算模式,包括矩阵乘法、线性函数、torch.nn.functional 中的其他逐元素函数,以及其他的一些常用函数。这样,用户可以像往常一样训练他们的模型。
|
||||
|
||||
在我们最新示例中还定义了一个Gemini + ZeRO DDP 的模型从而减小开销,提升效率。这一部分的详细内容可以参考[ZeRO](../features/zero_with_chunk.md),你可以将这两部分内容结合起来看从而理解我们整个训练流程:
|
||||
|
||||
```python
|
||||
def gemini_zero_dpp(model: torch.nn.Module, pg: ProcessGroup, placement_policy: str = "auto"):
|
||||
from colossalai.zero import GeminiDDP
|
||||
model = GeminiDDP(model,
|
||||
device=get_current_device(),
|
||||
placement_policy=placement_policy,
|
||||
pin_memory=True,
|
||||
search_range_m=32)
|
||||
return model
|
||||
```
|
||||
|
||||
## 在单个GPU上预训练GPT-2
|
||||
|
||||
我们做的上述优化让我们可以在单GPU上训练GPT-2模型,只需要将`run.sh`中设置参数`GPUNUM`=1,再运行文件时就可以在单个GPU上完成模型的训练。
|
||||
|
||||
GPT-2 示例在[Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt). 获得。
|
||||
|
||||
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 parallelize_your_training_like_Megatron.py -->
|
|
@ -1,99 +0,0 @@
|
|||
# ColoTensor Concepts
|
||||
|
||||
Author: [Jiarui Fang](https://github.com/feifeibear), [Hongxin Liu](https://github.com/ver217) and [Haichen Huang](https://github.com/1SAA)
|
||||
|
||||
> ⚠️ 此页面上的信息已经过时并将被废弃。
|
||||
|
||||
**Prerequisite:**
|
||||
- [Colossal-AI Overview](../concepts/colossalai_overview.md)
|
||||
- [Distributed Training](../concepts/distributed_training.md)
|
||||
- [Paradigms of Parallelism](../concepts/paradigms_of_parallelism.md)
|
||||
|
||||
## Introduction
|
||||
|
||||
在ColossalAI 0.1.8 版本之后,[ColoTensor](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.html#colossalai.tensor.ColoTensor) 成为 ColossalAI 中张量的基本数据结构。 它是 torch.Tensor 的子类,可以当做 PyTorch Tensor使用。 此外,一些独特的功能使其能够表示一个payload分布在多个 GPU 设备上的Global Tensor,并提供一些列方式操作这个Global Tensor。 在 ColoTensor 的帮助下,用户可以以类似编写串行程序方式,编写的分布式 DNN 训练程序。
|
||||
|
||||
ColoTensor 包含额外的属性[ColoTensorSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.tensor_spec.html#colossalai.tensor.tensor_spec.ColoTensorSpec)
|
||||
来描述张量的payload分布和计算模式。
|
||||
|
||||
- ProcessGroup:如何将进程组织为通信组。
|
||||
- Distributed Spec:张量如何在进程组之间分布。
|
||||
- Compute Spec:计算过程中如何使用张量。
|
||||
|
||||
我们一一详述。
|
||||
|
||||
## ProcessGroup
|
||||
|
||||
[ProcessGroup](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.html#colossalai.tensor.ProcessGroup) 类的一个实例描述了如何在进程组中组织进程。进程组内的进程可以一起参与同一个集合通信,比如allgather, allreduce等。进程组组织方式被张量的并行策略支配。比如,如果用户定义了Tensor的张量并行(TP),数据并行(DP)方式,那么进程组的进程组织方式将被自动推导出来。 进程组设置可能因不同的张量而异。 因此,它使我们能够支持更复杂的混合并行。流水线并行(PP)定义不在ProcessGroup中描述,它需要另一套机制,我们将在未来补充ColoTensor应用于PP的相关内容。
|
||||
|
||||
目前,ColoTensor 的一个进程组由 tp_degree 和 dp_degree 两种配置定义。 在 DP+TP 混合并行的情况下,可以将设备视为 2D 网格。 我们将 TP 通信组放置在设备网格的前导低维上,然后将数据并行组放置在设备网格的高维上。 原因是张量并行比数据并行具有更大的通信开销。 相邻设备放置在一个 TP 进程组内,并且通常放置在同一个节点中。
|
||||
|
||||
考虑到8个进程配置为tp_degree=4,dp_degree=2,布局如下图。 进程组 tp0 包含 gpu 0,1,2,3。 进程 dp1 包含 gpu 1 和 5。
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/ColoTensor_layout_demo.PNG"/>
|
||||
<figcaption>Process Group using tp_degree=4, dp_degree=2</figcaption>
|
||||
</figure>
|
||||
|
||||
## Distributed Spec
|
||||
|
||||
[Distributed Spec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html)描述了 ColoTensor 如何在 ProcessGroup 中分布。
|
||||
|
||||
张量在 DP 进程组之间的分布方式是自动导出的,不需要用户手动指定。 如果这个张量是一个模型参数,它会在 DP 进程组中被复制。 如果是activation张量,则沿tensor最高维度在DP进程组中进行平均分割。
|
||||
|
||||
因此,在使用 Distributed Spec 时,我们只需要描述张量在 TP 进程组之间的分布方式即可。 TP 进程组目前有两种分布式规范,即 [ShardSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html#colossalai.tensor.distspec.ShardSpec)和[ReplicaSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.distspec.html#colossalai.tensor.distspec.ReplicaSpec)。 ShardSpec 需要指定分区的维度索引 dim 和分区个数 num_partitions。 目前,我们仅支持在单个dim上进行拆分。 TP进程组上不同的dist spec可以通过set_dist_spec()接口相互转换。这些转化操作可以被记录在PyTorch的自动求导机制中,并在反向传播时候触发对应的反向操作。
|
||||
|
||||
## Compute Spec
|
||||
|
||||
[ComputeSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.compute_spec.html#colossalai.tensor.compute_spec.ComputeSpec)类描述Tensor如何参与计算。目前,我们将作为module parameter的ColoTensor设置正确的Compute Pattern。可以触发正取的计算模式。具体应用方式我们会在接下来的文档中展示。
|
||||
|
||||
## ColoParameter
|
||||
|
||||
[ColoParameter](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.colo_parameter.html#colossalai.tensor.colo_parameter.ColoParameter)是ColoTensor的子类。用来声明Parameter。他和ColoTensor关系和Torch.Tensor和torch.Parameter一致。后者可以让tensor出现在module的parameters()和name_parameters() 的返回值中。
|
||||
|
||||
## Example
|
||||
|
||||
让我们看一个例子。 使用 tp_degree=4, dp_degree=2 在 8 个 GPU 上初始化并Shard一个ColoTensor。 然后tensor被沿着 TP 进程组中的最后一个维度进行分片。 最后,我们沿着 TP 进程组中的第一个维度(dim 0)对其进行重新Shard。 我们鼓励用户运行代码并观察每个张量的形状。
|
||||
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.multiprocessing as mp
|
||||
from colossalai.utils import print_rank_0
|
||||
from functools import partial
|
||||
|
||||
import colossalai
|
||||
from colossalai.tensor import ProcessGroup, ColoTensor, ColoTensorSpec, ShardSpec, ComputeSpec, ComputePattern
|
||||
from colossalai.testing import spawn
|
||||
|
||||
import torch
|
||||
|
||||
def run_dist_tests(rank, world_size, port):
|
||||
colossalai.launch(config={}, rank=rank, world_size=world_size, host='localhost', port=port, backend='nccl')
|
||||
pg = ProcessGroup(tp_degree=2, dp_degree=2)
|
||||
|
||||
torch.manual_seed(0)
|
||||
local_tensor = torch.randn(2, 3, 1).cuda()
|
||||
print_rank_0(f"shape {local_tensor.shape}, {local_tensor.data}")
|
||||
|
||||
spec = ColoTensorSpec(pg, ShardSpec(dims=[-1], num_partitions=[pg.tp_world_size()]), ComputeSpec(ComputePattern.TP1D))
|
||||
t1 = ColoTensor.from_torch_tensor(local_tensor, spec)
|
||||
t1 = t1.to_replicate()
|
||||
print_rank_0(f"shape {t1.shape}, {t1.data}")
|
||||
|
||||
spec2 = ShardSpec([0], [pg.tp_world_size()])
|
||||
t1.set_dist_spec(spec2)
|
||||
print_rank_0(f"shape {t1.shape}, {t1.data}")
|
||||
|
||||
def test_dist_cases(world_size):
|
||||
spawn(run_dist_tests, world_size)
|
||||
|
||||
if __name__ == '__main__':
|
||||
test_dist_cases(4)
|
||||
```
|
||||
|
||||
:::caution
|
||||
|
||||
The ColoTensor is an experimental feature and may be updated.
|
||||
|
||||
:::
|
|
@ -1,138 +0,0 @@
|
|||
# 并行配置
|
||||
|
||||
作者: Shenggui Li, Siqi Mai
|
||||
|
||||
> ⚠️ 此页面上的信息已经过时并将被废弃。请在[Booster插件](../basics/booster_plugins.md)页面查阅更新。
|
||||
|
||||
**预备知识:**
|
||||
- [分布式训练](../concepts/distributed_training.md)
|
||||
- [并行技术](../concepts/paradigms_of_parallelism.md)
|
||||
- [构建配置文件](./define_your_config.md)
|
||||
|
||||
|
||||
## 简介
|
||||
|
||||
我们在 Colossal-AI 中支持多种并行技术。代码库中的混合并行是指您可以轻松地结合数据并行、流水线并行和张量并行(1D、2D、2.5D、3D)的优势共同来进行并行训练。
|
||||
|
||||
每种并行方式需要不同的网络拓扑结构,因此要初始化不同的进程组。您可以通过在配置文件中设置 `parallel` 来初始化相应的进程组。 `parallel` 的配置必须遵从以下格式。数据并行度的大小将被根据您对流水线并行和张量并行的输入自动推断。`colossalai.launch` 将根据您的配置自动初始化这些分布式进程组。
|
||||
|
||||
我们为您提供了一些配置的例子以供参考。
|
||||
|
||||
```python
|
||||
# sampler format
|
||||
parallel = dict(
|
||||
pipeline=dict("size": int),
|
||||
tensor=dict("size": int, "mode": '1d' or '2d' or '2.5d' or '3d', "kwargs": Any)
|
||||
)
|
||||
|
||||
# this is ok
|
||||
parallel = dict(
|
||||
pipeline=dict(size=2),
|
||||
tensor=dict(size=4, mode='2d')
|
||||
)
|
||||
|
||||
# this is ok
|
||||
parallel = dict(
|
||||
pipeline=2,
|
||||
tensor=dict(size=4, mode='2d')
|
||||
)
|
||||
|
||||
# this is not ok
|
||||
# as you need to specify the mode for tensor parallelism
|
||||
parallel = dict(
|
||||
pipeline=2,
|
||||
tensor=4
|
||||
)
|
||||
|
||||
# this is ok as well as tensor will be default to size 1
|
||||
# and mode None
|
||||
parallel = dict(
|
||||
pipeline=2
|
||||
)
|
||||
|
||||
# this is ok as well as pipeline will default to size 1
|
||||
parallel = dict(
|
||||
tensor=dict(size=4, mode='2d')
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
关键字 `size` 指的是并行维度的并行大小。 例如,流水线大小为2意味着有
|
||||
将有2个流水线阶段。张量并行配置中的关键字 `mode` 意味着相应的张量并行技术
|
||||
将被初始化,如1D、2D、2.5D、3D。
|
||||
|
||||
**您也可以选择不在您的配置中使用 "并行",此时流水线和张量的并行度都将默认为大小1。**
|
||||
|
||||
**GPU的总数量必须等于` 数据并行大小 x 张量并行大小 x 流水线并行大小` 。**
|
||||
|
||||
## 数据并行
|
||||
|
||||
数据并行是最常见的分布式训练方式。它将数据分割成几个碎片分别在每个设备上进行训练。数据并行的配置会自动检测并为您设置。您不需要在您的配置中明确地设置它们。在Colossal-AI 中,有两种方法来处理数据并行的 all-reduce。
|
||||
|
||||
1. 如果您设置了梯度handler,梯度handler将会all-reduce梯度。
|
||||
2. 若没有指定相应的配置,Colossal-AI 将会使用 PyTorch 的 DistributedDataParallel。
|
||||
|
||||
在大多数情况下,若您对梯度没有复杂的处理的需求,您将会使用第二种模式。
|
||||
|
||||
## 1D, 2D, 2.5D 和 3D 并行
|
||||
|
||||
为了实现混合并行,我们提供了一系列张量并行方法。您可以阅读相应的学术论文进行深入的了解。这些并行模式需要和 Colossal-AI 提供的分布式层一同工作。
|
||||
|
||||
- 1D: [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053)
|
||||
|
||||
- 2D: [An Efficient 2D Method for Training Super-Large Deep Learning Models](https://arxiv.org/abs/2104.05343)
|
||||
2D 并行基于 SUMMA 矩阵乘法,它将输入数据、模型权重和层输出切分成两个不同的维度。 这些张量块分布在 `P = N^2` 设备的二维网格上,其中 `N` 是单一维度上张量块的数量。
|
||||
|
||||
- 2.5D: [2.5-dimensional distributed model training](https://arxiv.org/abs/2105.14500)
|
||||
在 2.5D 矩阵乘法的启发下,2.5D 并行引入了一种新的张量并行,进一步将2D张量并行化。其中,`P = N^2 ∗ d` 个处理器被分配到 `d` 层, 每层独立进行矩阵乘法运算,维度为 `N`。
|
||||
|
||||
- 3D: [Maximizing Parallelism in Distributed Training for Huge Neural Networks](https://arxiv.org/abs/2105.14450)
|
||||
我们还介绍了一种 3D 张量并行方法,在三维处理器立方体上并行化神经网络。这种方法在数量为 `P` 的处理器上实现了最佳的 `O(P^{1/3})` 通信开销,而计算和内存的使用都是通过优化的参数和激活的负载平衡来实现的。同时,通过优化参数和 activations 的负载平衡,计算和内存的使用都是均匀分布的。
|
||||
|
||||
```python
|
||||
# 1D parallel
|
||||
parallel = dict(
|
||||
tensor=dict(size=4, mode='1d')
|
||||
)
|
||||
|
||||
# 2D parallel
|
||||
parallel = dict(
|
||||
tensor=dict(size=4, mode='2d')
|
||||
)
|
||||
|
||||
# 2.5D parallel
|
||||
parallel = dict(
|
||||
tensor=dict(size=8, mode='2.5d', depth=2)
|
||||
)
|
||||
|
||||
# 3D parallel
|
||||
parallel = dict(
|
||||
tensor=dict(size=8, mode='3d')
|
||||
)
|
||||
```
|
||||
|
||||
当您在配置中指定了张量并行模式,您就可以使用其相应的分布式算子。例如,若您设置模式为 `2d`,那么在模型构建中就能使用 `colossalai.nn.Linear2D` 了。
|
||||
|
||||
|
||||
## 流水线并行
|
||||
|
||||
流水线并行是将模型按层分成几个部分。例如,假设我们有一个简单的模型,它由两个线性层组成。我们有两个 GPU,我们可以将第一个线性层分配给第一个 GPU 而第二层则分配给第二个 GPU。
|
||||
|
||||
您可以在您的配置文件中设置流水线并行度的大小。当流水线并行度大于1,Colossal-AI 将会自动地创建流水线并行的 schedule,这将会为您定义好模型训练的 `forward` 和 `backward`。
|
||||
|
||||
```python
|
||||
parallel = dict(
|
||||
pipeline=dict(size=4), # number of pipeline stages
|
||||
)
|
||||
```
|
||||
|
||||
## 序列并行
|
||||
|
||||
针对处理大图片、视频、长文本、长时间医疗监控等数据的需要,Colossal-AI 还提供了序列并行的方法。该方法是在论文[Sequence Parallelism: Making 4D Parallelism Possible](https://arxiv.org/abs/2105.13120)中提出的。您可以指定模式为 `sequence` 来初始化进程组。
|
||||
|
||||
|
||||
```python
|
||||
parallel = dict(
|
||||
tensor=dict(size=4, mode='sequence')
|
||||
)
|
||||
```
|
|
@ -1,73 +0,0 @@
|
|||
# 构建配置文件
|
||||
|
||||
作者: Guangyang Lu, Shenggui Li, Siqi Mai
|
||||
|
||||
> ⚠️ 此页面上的信息已经过时并将被废弃。请在[Booster API](../basics/booster_api.md)页面查阅更新。
|
||||
|
||||
**预备知识:**
|
||||
- [分布式训练](../concepts/distributed_training.md)
|
||||
- [Colossal-AI 总览](../concepts/colossalai_overview.md)
|
||||
|
||||
|
||||
## 简介
|
||||
|
||||
在 Colossal-AI 中,我们需要一个配置文件来指定系统在训练过程中要注入的特征。在本教程中,我们将向您介绍如何构建您的配置文件以及如何使用这个配置文件。使用配置文件有以下一些好处:
|
||||
|
||||
1. 您可以在不同的配置文件中存储您的特征配置和训练超参数。
|
||||
2. 对于我们未来发布的新功能,您亦可以在配置中指定,而无需改变训练脚本的代码。
|
||||
|
||||
在本教程中,我们将向您介绍如何构建您的配置文件。
|
||||
|
||||
## 配置定义
|
||||
|
||||
在一个配置文件中,有两种类型的变量。一种是作为特征说明,另一种是作为超参数。所有与特征相关的变量都是保留关键字。例如,如果您想使用混合精度训练,需要在 config 文件中使用变量名`fp16`,并遵循预先定义的格式。
|
||||
|
||||
### 功能配置
|
||||
|
||||
Colossal-AI 提供了一系列的功能来加快训练速度。每个功能都是由配置文件中的相应字段定义的。在本教程中,我们不会给出所有功能的配置细节,而是提供一个如何指定一个功能的说明。**每个功能的细节可以在其各自的教程中找到。**
|
||||
|
||||
为了说明配置文件的使用,我们在这里使用混合精度训练作为例子。您需要遵循以下步骤。
|
||||
|
||||
1. 创建一个配置文件(例如 `config.py`,您可以指定任意的文件名)。
|
||||
2. 在配置文件中定义混合精度的配置。例如,为了使用 PyTorch 提供的原始混合精度训练,您只需将下面这几行代码写入您的配置文件中。
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
fp16 = dict(
|
||||
mode=AMP_TYPE.TORCH
|
||||
)
|
||||
```
|
||||
|
||||
3. 当启动分布式环境时,向 Colossal-AI 指定您的配置文件的位置。比如下面的例子是配置文件在当前目录下。
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
|
||||
colossalai.launch(config='./config.py', ...)
|
||||
```
|
||||
|
||||
这样,Colossal-AI 便知道您想使用什么功能,并会在 `colossalai.initialize` 期间注入您所需要的功能。
|
||||
|
||||
### 全局超参数
|
||||
|
||||
除了功能的配置,您还可以在配置文件中定义训练的超参数。当您想进行多个实验时,这将会变得非常方便。每个实验的细节都可以放在独立的配置文件中,以避免混乱。这些参数将被存储在全局并行环境中,可以在训练脚本中访问。
|
||||
|
||||
例如,您可以在配置文件中指定批量大小。
|
||||
|
||||
```python
|
||||
BATCH_SIZE = 32
|
||||
```
|
||||
|
||||
启动后,您能够通过全局并行上下文访问您的超参数。
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
from colossalai.core import global_context as gpc
|
||||
|
||||
colossalai.launch(config='./config.py', ...)
|
||||
|
||||
# access your parameter
|
||||
print(gpc.config.BATCH_SIZE)
|
||||
|
||||
```
|
|
@ -1,387 +0,0 @@
|
|||
# 如何在训练中使用 Engine 和 Trainer
|
||||
|
||||
作者: Shenggui Li, Siqi Mai
|
||||
|
||||
> ⚠️ 此页面上的信息已经过时并将被废弃。请在[Booster API](../basics/booster_api.md)页面查阅更新。
|
||||
|
||||
**预备知识:**
|
||||
- [初始化功能](./initialize_features.md)
|
||||
|
||||
## 简介
|
||||
|
||||
在本教程中,您将学习如何使用 Colossal-AI 中提供的 Engine 和 Trainer 来训练您的模型。在深入研究细节之前,我们想先解释一下 Engine 和 Trainer 的概念。
|
||||
|
||||
### Engine
|
||||
|
||||
Engine 本质上是一个模型、优化器和损失函数的封装类。当我们调用 `colossalai.initialize` 时,一个 Engine 对象将被返回,并且配备了在您的配置文件中指定的梯度剪裁、梯度累计和 ZeRO 优化器等功能。
|
||||
|
||||
Engine 将使用与 PyTorch 训练组件类似的 API,因此您只需对代码进行微小的修改即可。
|
||||
|
||||
下表展示了Engine的常用API。
|
||||
|
||||
| 组件 | 功能 | PyTorch | Colossal-AI |
|
||||
| ------------------------------------- | --------------------------------------------- | ------------------------------- | -------------------------------------- |
|
||||
| optimizer | 迭代前将所有梯度设置为零 | optimizer.zero_grad() | engine.zero_grad() |
|
||||
| optimizer | 更新参数 | optimizer.step() | engine.step() |
|
||||
| model | 进行一次前向计算 | outputs = model(inputs) | outputs = engine(inputs) |
|
||||
| criterion | 计算loss值 | loss = criterion(output, label) | loss = engine.criterion(output, label) |
|
||||
| criterion | 反向计算 | loss.backward() | engine.backward(loss) |
|
||||
|
||||
我们需要这样一个 Engine 类的原因是,我们可以添加更多的功能,同时将实现隐藏在
|
||||
`colossalai.initialize` 函数中实现。
|
||||
假如我们要添加一个新的功能,我们可以在 `colossalai.initialize` 函数中完成对于模型、优化器、数据加载器和损失函数的功能诠释。不管中间的过程有多复杂,最终我们呈现的以及用户需要使用的只有一个 Engine 类,这将十分便捷。
|
||||
用户只需要在最小范围内修改他们的代码,将普通的 PyTorch APIs 调整为 Colossal-AI
|
||||
Engine 的 API。通过这种方式,他们可以享受更多的功能来进行有效的训练。
|
||||
|
||||
以下是一个简单的例子:
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
|
||||
# build your model, optimizer, criterion, dataloaders
|
||||
...
|
||||
|
||||
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
|
||||
optimizer,
|
||||
criterion,
|
||||
train_dataloader,
|
||||
test_dataloader)
|
||||
for img, label in train_dataloader:
|
||||
engine.zero_grad()
|
||||
output = engine(img)
|
||||
loss = engine.criterion(output, label)
|
||||
engine.backward(loss)
|
||||
engine.step()
|
||||
```
|
||||
|
||||
### Trainer
|
||||
|
||||
Trainer 是一个更高级的封装器,用户可以用更少的代码行来执行训练。 由于 Trainer 的使用会更加简单,相较于 Engine,它会缺少一点灵活性。 Trainer 被设计为进行前向和反向计算来进行模型权重的更新。通过传递 Engine 对象,我们可以很容易地创建一个 Trainer。
|
||||
Trainer 的参数 `schedule` 默认值是 `None` 。在大多数情况下,除非我们想使用流水线并行,否则我们把这个值设为 `None`。如果您想探索更多关于这个参数的内容,您可以前往流水线并行的相关教程。
|
||||
|
||||
```python
|
||||
from colossalai.logging import get_dist_logger
|
||||
from colossalai.legacy.trainer import Trainer, hooks
|
||||
|
||||
# build components and initialize with colossalai.initialize
|
||||
...
|
||||
|
||||
# create a logger so that trainer can log on the console
|
||||
logger = get_dist_logger()
|
||||
|
||||
# create a trainer object
|
||||
trainer = Trainer(
|
||||
engine=engine,
|
||||
logger=logger
|
||||
)
|
||||
```
|
||||
|
||||
在 Trainer 中,用户可以定制一些 hooks,并将这些 hooks 附加到 Trainer 上。hook 将根据训练方案定期地执行生命周期函数。例如,基于用户是想在每次训练迭代后还是只在整个训练周期后更新学习率,
|
||||
`LRSchedulerHook` 将会在 `after_train_iter` 或 `after_train_epoch` 阶段执行 `lr_scheduler.step()` 去为用户更新学习率。您可以将 hook 存储在一个列表中并将其传递给 `trainer.fit` 方法。`trainer.fit` 方法将根据您的参数执行训练和测试。如果 `display_process` 为 True,将在您的控制台显示一个进度条,以显示训练的过程。
|
||||
|
||||
|
||||
```python
|
||||
# define the hooks to attach to the trainer
|
||||
hook_list = [
|
||||
hooks.LossHook(),
|
||||
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
|
||||
hooks.AccuracyHook(accuracy_func=Accuracy()),
|
||||
hooks.LogMetricByEpochHook(logger),
|
||||
]
|
||||
|
||||
# start training
|
||||
trainer.fit(
|
||||
train_dataloader=train_dataloader,
|
||||
epochs=NUM_EPOCHS,
|
||||
test_dataloader=test_dataloader,
|
||||
test_interval=1,
|
||||
hooks=hook_list,
|
||||
display_progress=True
|
||||
)
|
||||
```
|
||||
|
||||
如果您想定制您的 hook 类,您可以继承 `hooks.BaseHook` 并重写您想要的生命周期方法。下面提供了一个例子来演示如何创建一个简单的关于日志信息的 hook,以供您参考。
|
||||
|
||||
```python
|
||||
from colossalai.logging import get_dist_logger
|
||||
from colossalai.legacy.trainer import hooks
|
||||
|
||||
class LogMessageHook(hooks.BaseHook):
|
||||
|
||||
def __init__(self, priority=10):
|
||||
self._logger = get_dist_logger()
|
||||
|
||||
def before_train(self, trainer):
|
||||
self._logger.info('training starts')
|
||||
|
||||
def after_train(self, trainer):
|
||||
self._logger.info('training finished')
|
||||
|
||||
|
||||
...
|
||||
|
||||
# then in your training script
|
||||
hook_list.append(LogMessageHook())
|
||||
```
|
||||
|
||||
|
||||
|
||||
在下面的章节中,您将会详细地了解到如何用 Engine 和 Trainer 来训练 ResNet 模型。
|
||||
|
||||
|
||||
## ResNet
|
||||
|
||||
### 总览
|
||||
|
||||
在本节中,我们将介绍:
|
||||
|
||||
1. 使用一个 Engine 在 CIFAR10 数据集上训练 ResNet34 模型
|
||||
2. 使用一个 Trainer 在 CIFAR10 数据集上训练 ResNet34 模型
|
||||
|
||||
项目结构如下:
|
||||
|
||||
```bash
|
||||
-- config.py
|
||||
-- run_resnet_cifar10_with_engine.py
|
||||
-- run_resnet_cifar10_with_trainer.py
|
||||
```
|
||||
|
||||
对于使用 Engine 或 Trainer,步骤 1-4 是通用的。 因此,步骤 1-4 + 步骤 5 将会是对应 `run_resnet_cifar10_with_engine.py` 而 步骤 1-4 + 步骤6 则对应 `run_resnet_cifar10_with_trainer.py`。
|
||||
|
||||
### 牛刀小试
|
||||
|
||||
#### 步骤 1. 创建配置文件
|
||||
|
||||
在你的项目文件夹中,创建一个 `config.py`。这个文件是用来指定一些您可能想用来训练您的模型的特征。下面是一个配置文件的例子。
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
BATCH_SIZE = 128
|
||||
NUM_EPOCHS = 200
|
||||
|
||||
fp16=dict(
|
||||
mode=AMP_TYPE.TORCH
|
||||
)
|
||||
```
|
||||
|
||||
在这个配置文件中,我们指定要在每个 GPU 上使用批大小为128,并运行200个 epoch。这两个参数是在 `gpc.config` 中体现的。例如,您可以使用 `gpc.config.BATCH_SIZE` 来访问您存储在配置文件中的批大小值。而 `fp16` 配置则会告诉 `colossalai.initialize` 使用 PyTorch 提供的混合精度训练,以更好的速度和更低的内存消耗来训练模型。
|
||||
|
||||
#### 步骤 2. 初始化分布式环境
|
||||
|
||||
我们需要初始化分布式训练环境。这在 [启动 Colossal-AI](./launch_colossalai.md) 中有相应的教程。在当前的演示中,我们使用 `launch_from_torch` 和 PyTorch 启用工具。
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
|
||||
# ./config.py refers to the config file we just created in step 1
|
||||
colossalai.launch_from_torch(config='./config.py')
|
||||
```
|
||||
|
||||
#### 步骤 3. 创建所有的训练组件
|
||||
|
||||
这时,我们可以创建用于训练的所有组件,包括:
|
||||
|
||||
1. 模型
|
||||
2. 优化器
|
||||
3. 损失函数
|
||||
4. 训练/测试数据加载器
|
||||
5. 学习率调度器
|
||||
6. 日志记录器
|
||||
|
||||
|
||||
|
||||
为了构建这些组件,您需要导入以下模块。
|
||||
|
||||
```python
|
||||
from pathlib import Path
|
||||
from colossalai.logging import get_dist_logger
|
||||
import torch
|
||||
import os
|
||||
from colossalai.core import global_context as gpc
|
||||
from colossalai.utils import get_dataloader
|
||||
from torchvision import transforms
|
||||
from colossalai.nn.lr_scheduler import CosineAnnealingLR
|
||||
from torchvision.datasets import CIFAR10
|
||||
from torchvision.models import resnet34
|
||||
```
|
||||
|
||||
|
||||
|
||||
然后按照通常在PyTorch脚本中构建组件的方式来构建组件。在下面的脚本中,我们将CIFAR10数据集的根路径设置为环境变量 `DATA`。您可以把它改为您想要的任何路径,例如,您可以把 `root=Path(os.environ['DATA'])` 改为 `root='./data'` ,这样就不需要设置环境变量。
|
||||
|
||||
```python
|
||||
# build logger
|
||||
logger = get_dist_logger()
|
||||
|
||||
# build resnet
|
||||
model = resnet34(num_classes=10)
|
||||
|
||||
# build datasets
|
||||
train_dataset = CIFAR10(
|
||||
root='./data',
|
||||
download=True,
|
||||
transform=transforms.Compose(
|
||||
[
|
||||
transforms.RandomCrop(size=32, padding=4),
|
||||
transforms.RandomHorizontalFlip(),
|
||||
transforms.ToTensor(),
|
||||
transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
|
||||
0.2023, 0.1994, 0.2010]),
|
||||
]
|
||||
)
|
||||
)
|
||||
|
||||
test_dataset = CIFAR10(
|
||||
root='./data',
|
||||
train=False,
|
||||
transform=transforms.Compose(
|
||||
[
|
||||
transforms.ToTensor(),
|
||||
transforms.Normalize(mean=[0.4914, 0.4822, 0.4465], std=[
|
||||
0.2023, 0.1994, 0.2010]),
|
||||
]
|
||||
)
|
||||
)
|
||||
|
||||
# build dataloaders
|
||||
train_dataloader = get_dataloader(dataset=train_dataset,
|
||||
shuffle=True,
|
||||
batch_size=gpc.config.BATCH_SIZE,
|
||||
num_workers=1,
|
||||
pin_memory=True,
|
||||
)
|
||||
|
||||
test_dataloader = get_dataloader(dataset=test_dataset,
|
||||
add_sampler=False,
|
||||
batch_size=gpc.config.BATCH_SIZE,
|
||||
num_workers=1,
|
||||
pin_memory=True,
|
||||
)
|
||||
|
||||
# build criterion
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
|
||||
# optimizer
|
||||
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
|
||||
|
||||
# lr_scheduler
|
||||
lr_scheduler = CosineAnnealingLR(optimizer, total_steps=gpc.config.NUM_EPOCHS)
|
||||
```
|
||||
|
||||
#### 步骤 4. 用 Colossal-AI 进行初始化
|
||||
|
||||
接下来,重要的一步是通过调用 `colossalai.initialize` 获得 Engine。正如 `config.py` 中所述,我们将使用混合精度训练来训练 ResNet34 模型。`colossalai.initialize` 将自动检查您的配置文件,并将相关特征分配给您的训练组件。这样一来,我们的 Engine 已经能够进行混合精度训练,而您不需要进行额外的处理。
|
||||
|
||||
```python
|
||||
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
|
||||
optimizer,
|
||||
criterion,
|
||||
train_dataloader,
|
||||
test_dataloader,
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
|
||||
#### 步骤 5. 用 Engine 进行训练
|
||||
|
||||
当所有的训练组件都准备好后,我们就可以像使用 PyTorch 一样训练 ResNet34 了。
|
||||
|
||||
```python
|
||||
for epoch in range(gpc.config.NUM_EPOCHS):
|
||||
# execute a training iteration
|
||||
engine.train()
|
||||
for img, label in train_dataloader:
|
||||
img = img.cuda()
|
||||
label = label.cuda()
|
||||
|
||||
# set gradients to zero
|
||||
engine.zero_grad()
|
||||
|
||||
# run forward pass
|
||||
output = engine(img)
|
||||
|
||||
# compute loss value and run backward pass
|
||||
train_loss = engine.criterion(output, label)
|
||||
engine.backward(train_loss)
|
||||
|
||||
# update parameters
|
||||
engine.step()
|
||||
|
||||
# update learning rate
|
||||
lr_scheduler.step()
|
||||
|
||||
# execute a testing iteration
|
||||
engine.eval()
|
||||
correct = 0
|
||||
total = 0
|
||||
for img, label in test_dataloader:
|
||||
img = img.cuda()
|
||||
label = label.cuda()
|
||||
|
||||
# run prediction without back-propagation
|
||||
with torch.no_grad():
|
||||
output = engine(img)
|
||||
test_loss = engine.criterion(output, label)
|
||||
|
||||
# compute the number of correct prediction
|
||||
pred = torch.argmax(output, dim=-1)
|
||||
correct += torch.sum(pred == label)
|
||||
total += img.size(0)
|
||||
|
||||
logger.info(
|
||||
f"Epoch {epoch} - train loss: {train_loss:.5}, test loss: {test_loss:.5}, acc: {correct / total:.5}, lr: {lr_scheduler.get_last_lr()[0]:.5g}", ranks=[0])
|
||||
```
|
||||
|
||||
#### 步骤 6. 用 Trainer 进行训练
|
||||
|
||||
如果您想用 Trainer 进行训练,您可以参考下面的代码进行您的实验。
|
||||
|
||||
|
||||
```python
|
||||
from colossalai.legacy.nn.metric import Accuracy
|
||||
from colossalai.legacy.trainer import Trainer, hooks
|
||||
|
||||
|
||||
# create a trainer object
|
||||
trainer = Trainer(
|
||||
engine=engine,
|
||||
logger=logger
|
||||
)
|
||||
|
||||
# define the hooks to attach to the trainer
|
||||
hook_list = [
|
||||
hooks.LossHook(),
|
||||
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
|
||||
hooks.AccuracyHook(accuracy_func=Accuracy()),
|
||||
hooks.LogMetricByEpochHook(logger),
|
||||
hooks.LogMemoryByEpochHook(logger)
|
||||
]
|
||||
|
||||
# start training
|
||||
# run testing every 1 epoch
|
||||
trainer.fit(
|
||||
train_dataloader=train_dataloader,
|
||||
epochs=gpc.config.NUM_EPOCHS,
|
||||
test_dataloader=test_dataloader,
|
||||
test_interval=1,
|
||||
hooks=hook_list,
|
||||
display_progress=True
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
|
||||
#### 步骤 7. 开始分布式训练
|
||||
|
||||
最后,我们可以使用 PyTorch 提供的分布式启动器来调用脚本,因为我们在步骤2中使用了 `launch_from_torch`。您需要把`<num_gpus>` 替换成您机器上可用的GPU数量。如果您只想使用一个 GPU,您可以把这个数字设为1。如果您想使用其他的启动器,请您参考如何启动 Colossal-AI 的教程。
|
||||
|
||||
|
||||
```bash
|
||||
# with engine
|
||||
python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
|
||||
# with trainer
|
||||
python -m torch.distributed.launch --nproc_per_node <num_gpus> --master_addr localhost --master_port 29500 run_resnet_cifar10_with_trainer.py
|
||||
```
|
||||
<!-- doc-test-command: echo -->
|
|
@ -1,48 +0,0 @@
|
|||
# 初始化功能
|
||||
|
||||
作者: Shenggui Li, Siqi Mai
|
||||
|
||||
> ⚠️ 此页面上的信息已经过时并将被废弃。请在[Booster API](../basics/booster_api.md)页面查阅更新。
|
||||
|
||||
**预备知识:**
|
||||
- [分布式训练](../concepts/distributed_training.md)
|
||||
- [Colossal-AI 总览](../concepts/colossalai_overview.md)
|
||||
|
||||
## 简介
|
||||
|
||||
在本教程中,我们将介绍 `colossalai.initialize` 的使用。 它包含了如何将特征(例如,模型、优化器、数据加载器)无缝注入您的训练组件中。 调用 `colossalai.initialize` 是您进入训练循环前的基本操作。
|
||||
|
||||
在下面一节中,我们将介绍 `colossalai.initialize` 是如何工作的以及使用中我们要注意的细节。
|
||||
|
||||
## 使用
|
||||
|
||||
在一个典型的工作流程中,我们将在训练脚本的开始启动分布式环境。
|
||||
之后,我们将实例化我们的对象,如模型、优化器、损失函数、数据加载器等。此时,我们可以使用 `colossalai.initialize` 便捷地为这些对象注入特征。
|
||||
具体细节请看以下的伪代码例子。
|
||||
|
||||
```python
|
||||
import colossalai
|
||||
import torch
|
||||
...
|
||||
|
||||
|
||||
# launch distributed environment
|
||||
colossalai.launch(config='./config.py', ...)
|
||||
|
||||
# create your objects
|
||||
model = MyModel()
|
||||
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
train_dataloader = MyTrainDataloader()
|
||||
test_dataloader = MyTrainDataloader()
|
||||
|
||||
# initialize features
|
||||
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model,
|
||||
optimizer,
|
||||
criterion,
|
||||
train_dataloader,
|
||||
test_dataloader)
|
||||
```
|
||||
|
||||
`colossalai.initialize` 将返回一个 `Engine` 对象。 该对象把模型、优化器和损失函数封装起来。 **`Engine` 对象会以配置文件中指定的特征运行。**
|
||||
关于 `Engine` 的更多使用细节可以在 [在训练中使用Engine和Trainer](./engine_trainer.md) 中获取。
|
|
@ -1,64 +0,0 @@
|
|||
# 模型Checkpoint
|
||||
|
||||
作者 : Guangyang Lu
|
||||
|
||||
> ⚠️ 此页面上的信息已经过时并将被废弃。请在[Booster Checkpoint](../basics/booster_checkpoint.md)页面查阅更新。
|
||||
|
||||
**预备知识:**
|
||||
- [Launch Colossal-AI](./launch_colossalai.md)
|
||||
- [Initialize Colossal-AI](./initialize_features.md)
|
||||
|
||||
**示例代码:**
|
||||
- [ColossalAI-Examples Model Checkpoint](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/utils/checkpoint)
|
||||
|
||||
**函数是经验函数.**
|
||||
|
||||
## 简介
|
||||
|
||||
本教程将介绍如何保存和加载模型Checkpoint。
|
||||
|
||||
为了充分利用Colossal-AI的强大并行策略,我们需要修改模型和张量,可以直接使用 `torch.save` 或者 `torch.load` 保存或加载模型Checkpoint。在Colossal-AI中,我们提供了应用程序接口实现上述同样的效果。
|
||||
|
||||
但是,在加载时,你不需要使用与存储相同的保存策略。
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 保存
|
||||
|
||||
有两种方法可以使用Colossal-AI训练模型,即使用engine或使用trainer。
|
||||
**注意我们只保存 `state_dict`.** 因此,在加载Checkpoint时,需要首先定义模型。
|
||||
|
||||
#### 同 engine 保存
|
||||
|
||||
```python
|
||||
from colossalai.utils import save_checkpoint
|
||||
model = ...
|
||||
engine, _, _, _ = colossalai.initialize(model=model, ...)
|
||||
for epoch in range(num_epochs):
|
||||
... # do some training
|
||||
save_checkpoint('xxx.pt', epoch, model)
|
||||
```
|
||||
|
||||
#### 用 trainer 保存
|
||||
```python
|
||||
from colossalai.legacy.trainer import Trainer, hooks
|
||||
model = ...
|
||||
engine, _, _, _ = colossalai.initialize(model=model, ...)
|
||||
trainer = Trainer(engine, ...)
|
||||
hook_list = [
|
||||
hooks.SaveCheckpointHook(1, 'xxx.pt', model)
|
||||
...]
|
||||
|
||||
trainer.fit(...
|
||||
hook=hook_list)
|
||||
```
|
||||
|
||||
### 加载
|
||||
|
||||
```python
|
||||
from colossalai.utils import load_checkpoint
|
||||
model = ...
|
||||
load_checkpoint('xxx.pt', model)
|
||||
... # train or test
|
||||
```
|
||||
<!-- doc-test-command: echo -->
|
|
@ -2,11 +2,8 @@
|
|||
|
||||
作者: Zhengda Bian, Yongbin Li
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [并行配置](../basics/configure_parallelization.md)
|
||||
|
||||
**示例代码**xw
|
||||
**示例代码**
|
||||
- [Tensor Parallelism with Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples)
|
||||
|
||||
**相关论文**
|
||||
|
|
|
@ -3,8 +3,6 @@
|
|||
作者: Zhengda Bian, Yongbin Li
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [并行配置](../basics/configure_parallelization.md)
|
||||
- [1D 张量并行](./1D_tensor_parallel.md)
|
||||
|
||||
**示例代码**
|
||||
|
|
|
@ -3,8 +3,6 @@
|
|||
作者: Zhengda Bian, Yongbin Li
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [并行配置](../basics/configure_parallelization.md)
|
||||
- [1D 张量并行](./1D_tensor_parallel.md)
|
||||
- [2D 张量并行](./2D_tensor_parallel.md)
|
||||
|
||||
|
|
|
@ -3,8 +3,6 @@
|
|||
作者: Zhengda Bian, Yongbin Li
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [并行配置](../basics/configure_parallelization.md)
|
||||
- [1D 张量并行](./1D_tensor_parallel.md)
|
||||
- [2D 张量并行](./2D_tensor_parallel.md)
|
||||
|
||||
|
|
|
@ -1,41 +0,0 @@
|
|||
# 梯度累积 (旧版本)
|
||||
|
||||
作者: Shenggui Li, Yongbin Li
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [在训练中使用Engine和Trainer](../basics/engine_trainer.md)
|
||||
|
||||
**示例代码**
|
||||
- [ColossalAI-Examples Gradient Accumulation](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
|
||||
|
||||
## 引言
|
||||
|
||||
梯度累积是一种常见的增大训练 batch size 的方式。 在训练大模型时,内存经常会成为瓶颈,并且 batch size 通常会很小(如2),这导致收敛性无法保证。梯度累积将多次迭代的梯度累加,并仅在达到预设迭代次数时更新参数。
|
||||
|
||||
## 使用
|
||||
|
||||
在 Colossal-AI 中使用梯度累积非常简单,仅需将下列配置添加进 config 文件。其中,整数值代表期望梯度累积的次数。
|
||||
|
||||
```python
|
||||
gradient_accumulation = <int>
|
||||
```
|
||||
|
||||
## 实例
|
||||
|
||||
我们提供了一个 [运行实例](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
|
||||
来展现梯度累积。在这个例子中,梯度累积次数被设置为4,你可以通过一下命令启动脚本
|
||||
|
||||
```shell
|
||||
python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
|
||||
```
|
||||
|
||||
你将会看到类似下方的文本输出。这展现了梯度虽然在前3个迭代中被计算,但直到最后一次迭代,参数才被更新。
|
||||
|
||||
```text
|
||||
iteration 0, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
|
||||
iteration 1, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
|
||||
iteration 2, first 10 elements of param: tensor([-0.0208, 0.0189, 0.0234, 0.0047, 0.0116, -0.0283, 0.0071, -0.0359, -0.0267, -0.0006], device='cuda:0', grad_fn=<SliceBackward0>)
|
||||
iteration 3, first 10 elements of param: tensor([-0.0141, 0.0464, 0.0507, 0.0321, 0.0356, -0.0150, 0.0172, -0.0118, 0.0222, 0.0473], device='cuda:0', grad_fn=<SliceBackward0>)
|
||||
```
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_accumulation.py -->
|
|
@ -1,9 +1,8 @@
|
|||
# 梯度累积 (新版本)
|
||||
# 梯度累积
|
||||
|
||||
作者: [Mingyan Jiang](https://github.com/jiangmingyan)
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [训练中使用Booster](../basics/booster_api.md)
|
||||
|
||||
## 引言
|
||||
|
|
|
@ -1,53 +0,0 @@
|
|||
# 梯度裁剪(旧版本)
|
||||
|
||||
作者: Boxiang Wang, Haichen Huang, Yongbin Li
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [在训练中使用Engine和Trainer](../basics/engine_trainer.md)
|
||||
|
||||
**示例代码**
|
||||
- [ColossalAI-Examples Gradient Clipping](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
|
||||
|
||||
**相关论文**
|
||||
- [On the difficulty of training Recurrent Neural Networks](https://arxiv.org/abs/1211.5063)
|
||||
|
||||
## 引言
|
||||
|
||||
为了加快训练过程和寻求全局最优以获得更好的性能,越来越多的学习率调度器被提出。人们通过控制学习率来调整训练中的下降速度。这使得梯度向量在每一步都能更好地统一。在这种情况下,下降速度可以按预期被控制。
|
||||
因此,梯度裁剪,一种可以将梯度向量归一化,以将其限制在统一长度的技术,对于那些希望模型性能更好的人来说是不可或缺的。
|
||||
|
||||
在使用 Colossal-AI 时,你不必担心实现梯度剪裁,我们以一种有效而方便的方式支持梯度剪裁。你所需要的只是在你的配置文件中增加一个命令。
|
||||
|
||||
## 为什么应该使用 Colossal-AI 中的梯度裁剪
|
||||
|
||||
我们不建议用户自己编写梯度剪裁,因为朴素的梯度剪裁在应用张量并行、流水线并行、MoE 等功能时可能会失败。
|
||||
|
||||
根据下图,每个 GPU 只拥有线性层中权重的一部分参数。为了得到线性层权重的梯度向量的正确范数,每个 GPU 中的每个梯度向量的范数应该相加。更复杂的是,偏置的分布不同于权重的分布。通信组在求和运算中有所不同。
|
||||
|
||||
(注: 这种情况是旧版本的 2D 并行,在代码中的实现是不一样的。但这是一个很好的例子,能够说明在梯度剪裁中统一所有通信的困难。)
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/01/28/KXiJPHt3Dum82cA.png"/>
|
||||
<figcaption>参数分布</figcaption>
|
||||
</figure>
|
||||
|
||||
不用担心它,因为 Colossal-AI 已经为你处理好。
|
||||
|
||||
### 使用
|
||||
要使用梯度裁剪,只需在配置文件中添加梯度裁剪范数即可。
|
||||
|
||||
```python
|
||||
clip_grad_norm = 1.0
|
||||
```
|
||||
|
||||
### 实例
|
||||
|
||||
我们提供了一个展现梯度裁剪的[运行实例](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_clipping)
|
||||
。在本例中,我们将梯度裁剪范数设置为1.0,你可以使用以下命令运行脚本:
|
||||
|
||||
```shell
|
||||
python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 train_with_engine.py
|
||||
```
|
||||
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 gradient_clipping.py -->
|
|
@ -1,9 +1,8 @@
|
|||
# 梯度裁剪 (新版本)
|
||||
# 梯度裁剪
|
||||
|
||||
作者: [Mingyan Jiang](https://github.com/jiangmingyan)
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [booster使用](../basics/booster_api.md)
|
||||
|
||||
**相关论文**
|
||||
|
|
|
@ -1,60 +0,0 @@
|
|||
# 梯度 Handler
|
||||
|
||||
作者: Shenggui Li, Yongbin Li
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [在训练中使用Engine和Trainer](../basics/engine_trainer.md)
|
||||
|
||||
**示例代码**
|
||||
- [ColossalAI-Examples Gradient Handler](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
|
||||
|
||||
## 引言
|
||||
|
||||
在分布式训练中,每次迭代结束时都需要梯度同步。这很重要,因为我们需要确保在不同的机器中使用相同的梯度更新参数,以便生成的参数都一样。这通常在数据并行中看到,因为在数据并行中的模型是直接复制的。
|
||||
|
||||
在 Colossal-AI 中,我们为用户提供了一个接口来定制他们想要如何处理同步。这为实现新的并行方法等情况带来了灵活性。
|
||||
|
||||
当梯度 Handler 被使用时, PyTorch 的 `DistributedDataParallel` 将不再被使用,因为它会自动同步梯度.
|
||||
|
||||
## 定制你的梯度 Handler
|
||||
|
||||
要实现定制的梯度Handler,需要遵循以下步骤。
|
||||
1. 继承Colossal-AI中的 `BaseGradientHandler`
|
||||
2. 将梯度Handler注册进 `GRADIENT_HANDLER`
|
||||
3. 实现 `handle_gradient`
|
||||
|
||||
```python
|
||||
from colossalai.legacy.registry import GRADIENT_HANDLER
|
||||
from colossalai.legacy.engine.gradient_handler import BaseGradientHandler
|
||||
|
||||
|
||||
@GRADIENT_HANDLER.register_module
|
||||
class MyGradientHandler(BaseGradientHandler):
|
||||
|
||||
def handle_gradient(self):
|
||||
do_something()
|
||||
|
||||
|
||||
```
|
||||
|
||||
|
||||
## 使用
|
||||
|
||||
要使用梯度 Handler,需要在配置文件中指定梯度 Handler。梯度 Handler 将自动构建并连接到 Engine。
|
||||
|
||||
```python
|
||||
gradient_handler = [dict(type='MyGradientHandler')]
|
||||
```
|
||||
|
||||
|
||||
### 实例
|
||||
|
||||
我们提供了一个 [运行实例](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_handler)
|
||||
展现梯度 Handler 的使用. 在这个例子中,我们使用 `DataParallelGradientHandler` 而不是 PyTorch 的
|
||||
`DistributedDataParallel` 实现数据并行.
|
||||
|
||||
```shell
|
||||
python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py
|
||||
```
|
||||
<!-- doc-test-command: echo -->
|
|
@ -1,345 +0,0 @@
|
|||
# 自动混合精度训练 (旧版本)
|
||||
|
||||
作者: Chuanrui Wang, Shenggui Li, Yongbin Li
|
||||
|
||||
**前置教程**
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [在训练中使用Engine和Trainer](../basics/engine_trainer.md)
|
||||
|
||||
**示例代码**
|
||||
- [ColossalAI-Examples AMP](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp)
|
||||
|
||||
**相关论文**
|
||||
- [Accelerating Scientific Computations with Mixed Precision Algorithms](https://arxiv.org/abs/0808.2794)
|
||||
|
||||
|
||||
## 引言
|
||||
|
||||
AMP 代表自动混合精度训练。
|
||||
在 Colossal-AI 中, 我们结合了混合精度训练的不同实现:
|
||||
|
||||
1. torch.cuda.amp
|
||||
2. apex.amp
|
||||
3. naive amp
|
||||
|
||||
|
||||
| Colossal-AI | 支持张量并行 | 支持流水并行 | fp16范围 |
|
||||
| ----------- | ----------------------- | ------------------------- | ----------- |
|
||||
| AMP_TYPE.TORCH | ✅ | ❌ | 在前向和反向传播期间,模型参数、激活和梯度向下转换至fp16 |
|
||||
| AMP_TYPE.APEX | ❌ | ❌ | 更细粒度,我们可以选择 opt_level O0, O1, O2, O3 |
|
||||
| AMP_TYPE.NAIVE | ✅ | ✅ | 模型参数、前向和反向操作,全都向下转换至fp16 |
|
||||
|
||||
前两个依赖于 PyTorch (1.6及以上) 和 NVIDIA Apex 的原始实现。最后一种方法类似 Apex O2。在这些方法中,Apex-AMP 与张量并行不兼容。这是因为张量是以张量并行的方式在设备之间拆分的,因此,需要在不同的进程之间进行通信,以检查整个模型权重中是否出现inf或nan。我们修改了torch amp实现,使其现在与张量并行兼容。
|
||||
|
||||
> ❌️ fp16与ZeRO配置不兼容
|
||||
>
|
||||
> ⚠️ 流水并行目前仅支持naive amp
|
||||
|
||||
我们建议使用 torch AMP,因为在不使用流水并行时,它通常比 NVIDIA AMP 提供更好的准确性。
|
||||
|
||||
## 目录
|
||||
|
||||
在本教程中,我们将介绍:
|
||||
|
||||
1. AMP 介绍
|
||||
2. Colossal-AI 中的 AMP
|
||||
3. 练习实例
|
||||
|
||||
## AMP 介绍
|
||||
|
||||
自动混合精度训练是混合 FP16 和 FP32 训练。
|
||||
|
||||
半精度浮点格式(FP16)具有较低的算法复杂度和较高的计算效率。此外,FP16 仅需要 FP32 所需的一半存储空间,并节省了内存和网络带宽,从而为大 batch size 和大模型提供了更多内存。
|
||||
|
||||
然而,还有其他操作,如缩减,需要 FP32 的动态范围,以避免数值溢出/下溢。因此,我们引入自动混合精度,尝试将每个操作与其相应的数据类型相匹配,这可以减少内存占用并提高训练效率。
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://s2.loli.net/2022/01/28/URzLJ3MPeDQbtck.png"/>
|
||||
<figcaption>AMP 示意图 (图片来自 <a href="https://arxiv.org/abs/2108.05818">PatrickStar 论文</a>)</figcaption>
|
||||
</figure>
|
||||
|
||||
## Colossal-AI 中的 AMP
|
||||
|
||||
我们支持三种 AMP 训练方法,并允许用户在没有改变代码的情况下使用 AMP 进行训练。只需在配置文件中添加'fp16'配置即可使用 AMP。
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
# 使用 Torch AMP
|
||||
fp16=dict(
|
||||
mode = AMP_TYPE.TORCH
|
||||
)
|
||||
|
||||
# 使用 naive AMP
|
||||
fp16=dict(
|
||||
mode = AMP_TYPE.NAIVE
|
||||
)
|
||||
|
||||
# 使用 Nvidia Apex AMP
|
||||
fp16=dict(
|
||||
mode = AMP_TYPE.APEX
|
||||
)
|
||||
|
||||
```
|
||||
|
||||
> 这些是最低配置,完整配置将在后面的部分中说明
|
||||
|
||||
### AMP 模块化
|
||||
|
||||
AMP 模块设计为完全模块化,可以独立使用。如果你想在你的代码库中只使用 AMP 而不使用`colossalai.initialize`,你可以导入`colossalai.amp.convert_to_amp`。
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
# 使用torch amp的例子
|
||||
model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
|
||||
optimizer,
|
||||
criterion,
|
||||
AMP_TYPE.TORCH)
|
||||
```
|
||||
|
||||
### Torch AMP 配置
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
fp16=dict(
|
||||
mode=AMP_TYPE.TORCH,
|
||||
|
||||
# 下列是grad scaler的默认值
|
||||
init_scale=2.**16,
|
||||
growth_factor=2.0,
|
||||
backoff_factor=0.5,
|
||||
growth_interval=2000,
|
||||
enabled=True
|
||||
)
|
||||
```
|
||||
|
||||
可选参数:
|
||||
- init_scale(float, optional, default=2.**16): 初始缩放因子;
|
||||
- growth_factor(float, optional, default=2.0): 如果在``growth_interval``连续迭代过程中没有出现 inf/NaN 梯度,则在`update`中乘以比例系数;
|
||||
- backoff_factor(float, optional, default=0.5): 如果在迭代中出现 inf/NaN 梯度,则在`update`中乘以比例系数;
|
||||
- growth_interval(int, optional, default=2000): 在指定次数的连续迭代中,若没有出现 inf/NaN 梯度,则乘以``growth_factor``.
|
||||
- enabled(bool, optional, default=True): ``False``则使梯度缩放无效,`step` 仅调用底层的 ``optimizer.step()``, 其他方法成为空操作。
|
||||
|
||||
### Apex AMP 配置
|
||||
|
||||
对于这种模式,我们依靠 Apex 实现混合精度训练。我们支持这个插件,因为它允许对混合精度的粒度进行更精细的控制。
|
||||
例如, O2 水平 (优化器水平2) 将保持 batch normalization 为 FP32。
|
||||
|
||||
如果你想了解更多细节,请参考 [Apex Documentation](https://nvidia.github.io/apex/)。
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
fp16 = dict(
|
||||
mode=AMP_TYPE.APEX,
|
||||
|
||||
# 下列是默认值
|
||||
enabled=True,
|
||||
opt_level='O1',
|
||||
cast_model_type=None,
|
||||
patch_torch_functions=None,
|
||||
keep_batchnorm_fp32=None,
|
||||
master_weights=None,
|
||||
loss_scale=None,
|
||||
cast_model_outputs=None,
|
||||
num_losses=1,
|
||||
verbosity=1,
|
||||
min_loss_scale=None,
|
||||
max_loss_scale=16777216.0
|
||||
)
|
||||
```
|
||||
|
||||
参数:
|
||||
- enabled(bool, optional, default=True): False 会使所有 AMP 调用成为空操作, 程序将会像没有使用 AMP 一样运行。
|
||||
|
||||
- opt_level(str, optional, default="O1" ): 纯精度或混合精度优化水平。可选值 “O0”, “O1”, “O2”, and “O3”, 详细解释见上方 Apex AMP 文档。
|
||||
|
||||
- num_losses(int, optional, default=1): 选择提前告知 AMP 您计划使用多少次损失/反向计算。
|
||||
当`amp.scale_loss`与 loss_id 参数一起使用时,使 AMP 在每次损失/反向计算时使用不同的损失比例,这可以提高稳定性。如果 num_losses 被设置为1,AMP 仍支持多次损失/反向计算,但对他们都使用同一个全局损失比例。
|
||||
|
||||
- verbosity(int, default=1): 设置为0抑制 AMP 相关输出。
|
||||
|
||||
- min_loss_scale(float, default=None): 为可通过动态损耗比例选择的损耗比例值设置下限。
|
||||
默认值“None”意味着不设置任何下限。如果不使用动态损耗比例,则忽略 min_loss_scale 。
|
||||
|
||||
- max_loss_scale(float, default=2.**24 ): 为可通过动态损耗比例选择的损耗比例值设置上限。如果不使用动态损耗比例,则 max_loss_scale 被忽略.
|
||||
|
||||
目前,管理纯精度或混合精度训练的幕后属性有以下几种:
|
||||
cast_model_type, patch_torch_functions, keep_batchnorm_fp32, master_weights, loss_scale.
|
||||
一旦 opt_level 被确定,它们是可选的可覆盖属性
|
||||
|
||||
- cast_model_type: 将模型的参数和缓冲区强制转换为所需的类型。
|
||||
- patch_torch_functions: 补全所有的 Torch 函数和张量方法,以便在FP16中执行张量核心友好的操作,如 GEMMs 和卷积,以及在 FP32 中执行任何受益于 FP32 精度的操作。
|
||||
- keep_batchnorm_fp32: 为了提高精度并启用 cudnn batchnorm (这会提高性能),在 FP32 中保留 batchnorm 权重通常是有益的,即使模型的其余部分是 FP16。
|
||||
- master_weights: 保持 FP32 主权重以配合任何 FP16 模型权重。 FP32 主权重由优化器分级,以提高精度和捕捉小梯度。
|
||||
- loss_scale: 如果 loss_scale 是一个浮点数,则使用这个值作为静态(固定)的损失比例。如果 loss_scale 是字符串 "dynamic",则随着时间的推移自适应地调整损失比例。动态损失比例调整由 AMP 自动执行。
|
||||
|
||||
|
||||
### Naive AMP 配置
|
||||
|
||||
在 Naive AMP 模式中, 我们实现了混合精度训练,同时保持了与复杂张量和流水并行的兼容性。该 AMP 模式将所有操作转为 FP16 。下列代码块展示了该模式的`config.py`。
|
||||
|
||||
```python
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
fp16 = dict(
|
||||
mode=AMP_TYPE.NAIVE,
|
||||
|
||||
# below are the default values
|
||||
log_num_zeros_in_grad=False,
|
||||
initial_scale=2 ** 32,
|
||||
min_scale=1,
|
||||
growth_factor=2,
|
||||
backoff_factor=0.5,
|
||||
growth_interval=1000,
|
||||
hysteresis=2
|
||||
)
|
||||
```
|
||||
|
||||
Naive AMP 的默认参数:
|
||||
- log_num_zeros_in_grad(bool): 返回0值梯度的个数.
|
||||
- initial_scale(int): gradient scaler 的初始值
|
||||
- growth_factor(int): loss scale 的增长率
|
||||
- backoff_factor(float): loss scale 的下降率
|
||||
- hysteresis(int): 动态 loss scaling 的延迟偏移
|
||||
- max_scale(int): loss scale 的最大允许值
|
||||
- verbose(bool): 如果被设为`True`,将打印调试信息
|
||||
|
||||
当使用`colossalai.initialize`时, 首先需要实例化一个模型、一个优化器和一个标准。将输出模型转换为内存消耗较小的 AMP 模型。如果您的输入模型已经太大,无法放置在 GPU 中,请使用`dtype=torch.float16`实例化你的模型。或者请尝试更小的模型,或尝试更多的并行化训练技术!
|
||||
|
||||
## 实例
|
||||
|
||||
我们提供了一个 [运行实例](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/amp)
|
||||
展现如何在 Colossal-AI 使用 AMP。在该例程中,我们使用 Torch AMP, 但提供的配置文件也适用于所有 AMP 模式.
|
||||
|
||||
### 步骤 1. 创建配置文件
|
||||
|
||||
创建一个`config.py`文件并添加`fp16`配置.
|
||||
|
||||
```python
|
||||
# in config.py
|
||||
from colossalai.amp import AMP_TYPE
|
||||
|
||||
BATCH_SIZE = 128
|
||||
DROP_RATE = 0.1
|
||||
NUM_EPOCHS = 300
|
||||
|
||||
fp16 = dict(
|
||||
mode=AMP_TYPE.TORCH,
|
||||
)
|
||||
|
||||
clip_grad_norm = 1.0
|
||||
```
|
||||
|
||||
### 步骤 2. 在 train_with_engine.py 导入相关库
|
||||
|
||||
创建`train_with_engine.py`并导入必要依赖. 请记得通过命令`pip install timm scipy`安装`scipy`和`timm`。
|
||||
|
||||
```python
|
||||
import os
|
||||
import colossalai
|
||||
import torch
|
||||
from pathlib import Path
|
||||
from colossalai.core import global_context as gpc
|
||||
from colossalai.logging import get_dist_logger
|
||||
from colossalai.utils import get_dataloader
|
||||
from colossalai.legacy.trainer import Trainer, hooks
|
||||
from colossalai.nn.lr_scheduler import LinearWarmupLR
|
||||
from timm.models import vit_base_patch16_224
|
||||
from torchvision import datasets, transforms
|
||||
|
||||
```
|
||||
|
||||
### 步骤 3. 初始化分布式环境
|
||||
|
||||
我们需要初始化分布式环境。为了快速演示,我们使用`launch_from_torch`。你可以参考 [Launch Colossal-AI](../basics/launch_colossalai.md)
|
||||
使用其他初始化方法。
|
||||
|
||||
```python
|
||||
# 初始化分布式设置
|
||||
parser = colossalai.get_default_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
# launch from torch
|
||||
colossalai.launch_from_torch(config=args.config)
|
||||
|
||||
```
|
||||
|
||||
### 步骤 4. 创建训练组件
|
||||
|
||||
构建你的模型、优化器、损失函数、学习率调整器和数据加载器。注意数据集的路径从环境变量`DATA`获得。你可以通过 `export DATA=/path/to/data` 或 `Path(os.environ['DATA'])`
|
||||
在你的机器上设置路径。数据将会被自动下载到该路径。
|
||||
|
||||
```python
|
||||
# build model
|
||||
model = vit_base_patch16_224(drop_rate=0.1)
|
||||
|
||||
# build dataloader
|
||||
train_dataset = datasets.Caltech101(
|
||||
root=Path(os.environ['DATA']),
|
||||
download=True,
|
||||
transform=transforms.Compose([
|
||||
transforms.Resize(256),
|
||||
transforms.RandomResizedCrop(224),
|
||||
transforms.RandomHorizontalFlip(),
|
||||
transforms.ToTensor(),
|
||||
Gray2RGB(),
|
||||
transforms.Normalize([0.5, 0.5, 0.5],
|
||||
[0.5, 0.5, 0.5])
|
||||
]))
|
||||
|
||||
train_dataloader = get_dataloader(dataset=train_dataset,
|
||||
shuffle=True,
|
||||
batch_size=gpc.config.BATCH_SIZE,
|
||||
num_workers=1,
|
||||
pin_memory=True,
|
||||
)
|
||||
|
||||
# build optimizer
|
||||
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, weight_decay=0.1)
|
||||
|
||||
# build loss
|
||||
criterion = torch.nn.CrossEntropyLoss()
|
||||
|
||||
# lr_scheduler
|
||||
lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
|
||||
```
|
||||
|
||||
### 步骤 5. 插入 AMP
|
||||
|
||||
调用 `colossalai.initialize` 将所有训练组件转为为FP16模式.
|
||||
|
||||
```python
|
||||
engine, train_dataloader, _, _ = colossalai.initialize(
|
||||
model, optimizer, criterion, train_dataloader,
|
||||
)
|
||||
```
|
||||
|
||||
### 步骤 6. 使用 Engine 训练
|
||||
|
||||
使用Engine构建一个普通的训练循环
|
||||
|
||||
```python
|
||||
engine.train()
|
||||
for epoch in range(gpc.config.NUM_EPOCHS):
|
||||
for img, label in enumerate(train_dataloader):
|
||||
img = img.cuda()
|
||||
label = label.cuda()
|
||||
engine.zero_grad()
|
||||
output = engine(img)
|
||||
loss = engine.criterion(output, label)
|
||||
engine.backward(loss)
|
||||
engine.step()
|
||||
lr_scheduler.step()
|
||||
```
|
||||
|
||||
### 步骤 7. 启动训练脚本
|
||||
|
||||
使用下列命令启动训练脚本,你可以改变 `--nproc_per_node` 以使用不同数量的 GPU。
|
||||
|
||||
```shell
|
||||
python -m torch.distributed.launch --nproc_per_node 4 --master_addr localhost --master_port 29500 train_with_engine.py --config config/config_AMP_torch.py
|
||||
```
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 mixed_precision_training.py -->
|
|
@ -1,10 +1,9 @@
|
|||
# 自动混合精度训练 (新版本)
|
||||
# 自动混合精度训练
|
||||
|
||||
作者: [Mingyan Jiang](https://github.com/jiangmingyan)
|
||||
|
||||
**前置教程**
|
||||
|
||||
- [定义配置文件](../basics/define_your_config.md)
|
||||
- [booster 使用](../basics/booster_api.md)
|
||||
|
||||
**相关论文**
|
||||
|
@ -57,7 +56,7 @@ AMP 代表自动混合精度训练。
|
|||
|
||||
## Colossal-AI 中的 AMP
|
||||
|
||||
我们支持三种 AMP 训练方法,并允许用户在没有改变代码的情况下使用 AMP 进行训练。booster 支持 amp 特性注入,如果您要使用混合精度训练,则在创建 booster 实例时指定`mixed_precision`参数,我们现已支持 torch amp,apex amp, naive amp(现已移植 torch amp 至 booster,apex amp, naive amp 仍由`colossalai.initialize`方式启动,如您需使用,请[参考](./mixed_precision_training.md);后续将会拓展`bf16`,`pf8`的混合精度训练.
|
||||
我们支持三种 AMP 训练方法,并允许用户在没有改变代码的情况下使用 AMP 进行训练。booster 支持 amp 特性注入,如果您要使用混合精度训练,则在创建 booster 实例时指定`mixed_precision`参数;后续将会拓展`bf16`,`pf8`的混合精度训练.
|
||||
|
||||
#### booster 启动方式
|
||||
|
||||
|
|
|
@ -204,7 +204,7 @@ def main():
|
|||
|
||||
torch.cuda.synchronize()
|
||||
```
|
||||
> ⚠️ 注意:如果你使用Gemini模块的话,请不要使用我们之前提到过的[梯度累加](../features/gradient_accumulation.md)。
|
||||
> ⚠️ 注意:如果你使用Gemini模块的话,请不要使用我们之前提到过的[梯度累加](../features/gradient_accumulation_with_booster.md)。
|
||||
完整的例子代码可以在 [Train GPT with Colossal-AI](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt). 获得。
|
||||
|
||||
<!-- doc-test-command: torchrun --standalone --nproc_per_node=1 zero_with_chunk.py -->
|
||||
|
|
Loading…
Reference in New Issue