mirror of https://github.com/hpcaitech/ColossalAI
[hotfix] add copyright for solver and device mesh (#2803)
* [hotfix] add copyright for solver and device mesh * add readme * add alpa license * polishpull/2826/head
parent
dbd0fd1522
commit
2059fdd6b0
15
LICENSE
15
LICENSE
|
@ -200,3 +200,18 @@ Copyright 2021- HPC-AI Technology Inc. All rights reserved.
|
|||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
|
||||
## Some of colossal-ai's code is derived from Alpa, which is subject to the following copyright notice:
|
||||
|
||||
Copyright 2021 The Alpa team.
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
https://github.com/alpa-projects/alpa/blob/979a45a3e6187df941ef4a4c4c6eea664527d68d/LICENSE
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
|
|
|
@ -0,0 +1,23 @@
|
|||
# Colossal-AUTO
|
||||
|
||||
## Challenges
|
||||
Recently, large models have achieved the state of the art performances in various fields. In order to support large model training, we have to use distributed training techniques. However, finding an efficient distributed execution plan not only requires fine-grained model statistics, such as memory and computing overhead of each operator but also is a labor-intensive task even for an expert in the field of distributed training.
|
||||
|
||||
## Our solution
|
||||
To simplify the process of distributed training for foundational models, recent advancements in machine learning systems have led to the emergence of automatic parallel systems. We investigate and research a number of current automatic parallel systems(<a href="https://arxiv.org/abs/1807.08887"> Tofu </a>, <a href="https://arxiv.org/abs/1807.05358"> Flexflow </a>, <a href="https://arxiv.org/abs/2201.12023"> Alpa </a>) and some auto activation checkpoint algorithms(<a href="https://hal.inria.fr/hal-02352969"> Rotor </a>, <a href="https://arxiv.org/abs/1604.06174"> Sublinear </a>). Inspired from these advanced systems, we build an automatic parallel system upon PyTorch framework. The input of the system is the serial PyTorch code, and the output is a PyTorch program with an optimized distributed execution plan. It is worth emphasizing that the output is a regular PyTorch program, so it is compatible with runtime optimization methods, such as ZeRO-Offload and PatrickStar.
|
||||
|
||||
## Key modules
|
||||
|
||||
### Analyzer
|
||||
|
||||
**Analyzer** is a static analysis system consisting of three parts:
|
||||
A *symbolic profiler* for collecting computing and memory overhead related to static computation graph, a *cluster detector* for collecting hardware characteristics and detecting cluster topology and a *tensor layout manager* to find efficient tensor layout conversion path from different sharding spec and record conversion cost.
|
||||
|
||||
### Solver
|
||||
|
||||
**Solver** is designed to find the optimal execution plan for a given computation graph and cluster in two stages:
|
||||
1) *Intra-op parallelism stage* is to find the plan with the minimum total execution time of all nodes with respect to the constraint of the memory budget. The optimaztion goal of intra-op parallelism solver is modified from <a href="https://arxiv.org/abs/2201.12023"> Alpa </a>'s intra-op parallelsim ILP solver.
|
||||
2) *Activation checkpoint stage* is to search for the fastest execution plan that meets the memory budget on the computation graph after inserting the communication nodes by the intra-op parallelism stage. The algorithm to find optimial activation checkpoint is modified from <a href="https://hal.inria.fr/hal-02352969"> Rotor </a>. The reason we use two-stage optimization is that if the two tasks are formulated together, the solving time will be significantly increased, which will greatly affect the user experience of the system. On the contrary, solving in two hierarchical levels has many advantages. Firstly, compared with the computation graph with activation checkpointing, the original graph has fewer nodes, which can reduce the solving cost of intra-op parallelism solver. In addition, a more optimal solution can be found by adding the communication overhead into the activation checkpoint modeling.
|
||||
|
||||
### Generator
|
||||
**Generator** applies the searched execution plan to the computation graph and recompiles the computation graph to optimized PyTorch code. It has *a series compile pass* to insert a communication node or do the kernel substitution as the intra-op parallelism solver required. Additionally, we implement a *code generation* feature to recognize the annotation from the activation checkpoint solver and inject the activation checkpoint block following annotation instructions.
|
|
@ -1,3 +1,7 @@
|
|||
"""This code is adapted from Alpa
|
||||
https://github.com/alpa-projects/alpa/
|
||||
with some changes. """
|
||||
|
||||
import multiprocessing
|
||||
import time
|
||||
import warnings
|
||||
|
|
|
@ -1,3 +1,7 @@
|
|||
"""This code is adapted from Alpa
|
||||
https://github.com/alpa-projects/alpa/
|
||||
with some changes. """
|
||||
|
||||
import operator
|
||||
from functools import reduce
|
||||
from typing import List, Tuple
|
||||
|
@ -6,13 +10,10 @@ import torch
|
|||
import torch.distributed as dist
|
||||
|
||||
|
||||
# modified from alpa LogicalDeviceMesh(https://github.com/alpa-projects/alpa/blob/main/alpa/shard_parallel/auto_sharding.py)
|
||||
class DeviceMesh:
|
||||
"""A logical view of a physical mesh. The logical view is used in the
|
||||
search process.
|
||||
A physical mesh can have multiple logical views. (e.g., a 2x8 physical mesh
|
||||
can be viewed as a 1x16 or a 4x4 logical mesh). Each mesh dimension has its
|
||||
own latency and bandwidth. We use alpha-beta model to model the
|
||||
communication cost.
|
||||
"""A logical view of a physical cluster. For example, we could view a physical cluster
|
||||
with 16 devices as a device mesh with shape (2, 2, 4) or (4, 4).
|
||||
|
||||
Arguments:
|
||||
physical_mesh_id (torch.Tensor): physical view of the devices in global rank.
|
||||
|
|
|
@ -37,9 +37,6 @@ Colossal-AI’s auto-parallelism searches for strategies in regard to each opera
|
|||
## Distributed Tensor and Shape-Consistency System
|
||||
|
||||
The Colossal-AI system uses a device-mesh, similar to PyTorch's latest DTensor release, to manage its cluster. Colossal-AI uses a sharding-spec to annotate the storage status of each tensor and facilitate their distribution across the cluster. The system also employs a shape-consistency manager to automatically transform tensors between different sharding-specs, allowing for seamless slicing and dicing of tensors, while the shape-consistency manager ensures that the output of upstream operands is consistently stored in the cluster, regardless of how the input of downstream operands is stored. This makes Colossal-AI highly versatile and easy to use without users worrying about the storage status of tensors when performing operations on them.
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/shape_consistency.png"/>
|
||||
</figure>
|
||||
|
||||
Here are some key advantages of Colossal-AI compared to PyTorch DTensor:
|
||||
Colossal-AI's device-mesh uses cluster performance metrics and profiling results to estimate the time consumption of different communication operators. This helps Colossal-AI optimize communication between nodes and improve overall system efficiency.
|
||||
|
|
|
@ -11,7 +11,3 @@ Detailed instructions can be found in its `README.md`.
|
|||
|
||||
Colossal-Auto's automatic search function for activation checkpointing finds the most efficient checkpoint within a given memory budget, rather than just aiming for maximum memory compression. To avoid a lengthy search process for an optimal activation checkpoint, Colossal-Auto has implemented a two-stage search process. This allows the system to find a feasible distributed training solution in a reasonable amount of time while still benefiting from activation checkpointing for memory management. The integration of activation checkpointing in Colossal-AI improves the efficiency and effectiveness of large model training. You can follow the [Resnet example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/tutorial/auto_parallel).
|
||||
Detailed instructions can be found in its `README.md`.
|
||||
|
||||
<figure style={{textAlign: "center"}}>
|
||||
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/auto_parallel/auto_ckpt.jpg"/>
|
||||
</figure>
|
||||
|
|
Loading…
Reference in New Issue