mirror of https://github.com/hpcaitech/ColossalAI
121 lines
7.3 KiB
Markdown
121 lines
7.3 KiB
Markdown
|
# Distributed Training
|
||
|
|
||
|
Author: Shenggui Li, Siqi Mai
|
||
|
|
||
|
## What is a distributed system?
|
||
|
|
||
|
<figure style={{textAlign: "center"}}>
|
||
|
<img src="https://s2.loli.net/2022/01/28/sE5daHf2ohIy9wX.png"/>
|
||
|
<figcaption>Image source: <a href="https://towardsdatascience.com/distributed-training-in-the-cloud-cloud-machine-learning-engine-9e264ddde27f">Towards Data Science</a></figcaption>
|
||
|
</figure>
|
||
|
|
||
|
A distributed system consists of multiple software components which run on multiple machines. For example, the traditional
|
||
|
database runs on a single machine. As the amount of data gets incredibly large, a single machine can no longer deliver desirable
|
||
|
performance to the business, especially in situations such as Black Friday where network traffic can be unexpectedly high.
|
||
|
To handle such pressure, modern high-performance database is designed to run on multiple machines, and they work together to provide
|
||
|
high throughput and low latency to the user.
|
||
|
|
||
|
One important evaluation metric for distributed system is scalability. For example, when we run an application on 4 machines,
|
||
|
we naturally expect that the application can run 4 times faster. However, due to communication overhead and difference in
|
||
|
hardware performance, it is difficult to achieve linear speedup. Thus, it is important to consider how to make the application
|
||
|
faster when we implement it. Algorithms of good design and system optimization can help to deliver good performance. Sometimes,
|
||
|
it is even possible to achieve linear and super-linear speedup.
|
||
|
|
||
|
|
||
|
## Why we need distributed training for machine learning?
|
||
|
|
||
|
Back in 2012, [AlexNet](https://arxiv.org/abs/1404.5997) won the champion of the ImageNet competition, and it was trained
|
||
|
on two GTX 580 3GB GPUs.
|
||
|
Today, most models that appear in the top AI conferences are trained on multiple GPUs. Distributed training is definitely
|
||
|
a common practice when researchers and engineers develop AI models. There are several reasons behind this trend.
|
||
|
|
||
|
1. Model size increases rapidly. [ResNet50](https://arxiv.org/abs/1512.03385) has 20 million parameters in 2015,
|
||
|
[BERT-Large](https://arxiv.org/abs/1810.04805) has 345 million parameters in 2018,
|
||
|
[GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
|
||
|
has 1.5 billion parameters in 2018, and [GPT-3](https://arxiv.org/abs/2005.14165) has 175 billion parameters in 2020.
|
||
|
It is obvious that the model size grows exponentially with time. The current largest model has exceeded more than 1000
|
||
|
billion parameters. Super large models generally deliver more superior performance compared to their smaller counterparts.
|
||
|
<figure style={{textAlign: "center"}}>
|
||
|
<img src="https://s2.loli.net/2022/01/28/sCyreJ9PF1EdZYf.jpg"/>
|
||
|
<figcaption>Image source: <a href="https://huggingface.co/blog/large-language-models">HuggingFace</a></figcaption>
|
||
|
</figure>
|
||
|
|
||
|
|
||
|
2. Dataset size increases rapidly. For most machine learning developers, MNIST and CIFAR10 datasets are often the first few
|
||
|
datasets on which they train their models. However, these datasets are very small compared to well-known ImageNet datasets.
|
||
|
Google even has its own (unpublished) JFT-300M dataset which has around 300 million images, and this is close to 300 times
|
||
|
larger than the ImageNet-1k dataset.
|
||
|
|
||
|
|
||
|
3. Computing power gets stronger. With the advancement in the semiconductor industry, graphics cards become more and more
|
||
|
powerful. Due to its larger number of cores, GPU is the most common compute platform for deep learning.
|
||
|
From K10 GPU in 2012 to A100 GPU in 2020, the computing power has increased several hundred times. This allows us to performance
|
||
|
compute-intensive tasks faster and deep learning is exactly such a task.
|
||
|
|
||
|
Nowadays, the model can be too large to fit into a single GPU, and the dataset can be large enough to train for a hundred
|
||
|
days on a single GPU. Only by training our models on multiple GPUs with different parallelization techniques, we are able
|
||
|
to speed up the training process and obtain results in a reasonable amount of time.
|
||
|
|
||
|
|
||
|
## Basic Concepts in Distributed Training
|
||
|
|
||
|
Distributed training requires multiple machines/GPUs. During training, there will be communication among these devices.
|
||
|
To understand distributed training better, there are several important terms to be made clear.
|
||
|
|
||
|
- host: host is the main device in the communication network. It is often required as an argument when initializing the
|
||
|
distributed environment.
|
||
|
- port: port here mainly refers to master port on the host for communication.
|
||
|
- rank: the unique ID given to a device in the network.
|
||
|
- world size: the number of devices in the network.
|
||
|
- process group: a process group is a communication network which include a subset of the devices. There is always a default
|
||
|
process group which contains all the devices. A subset devices can form a process group so that they only communicate among
|
||
|
the devices within the group.
|
||
|
|
||
|
<figure style={{textAlign: "center"}}>
|
||
|
<img src="https://s2.loli.net/2022/01/28/qnNBKh8AjzgM5sY.png"/>
|
||
|
<figcaption>A distributed system example</figcaption>
|
||
|
</figure>
|
||
|
|
||
|
To illustrate these concepts, let's assume we have 2 machines (also called nodes), and each machine has 4 GPUs. When we
|
||
|
initialize distributed environment over these two machines, we essentially launch 8 processes (4 processes on each machine)
|
||
|
and each process is bound to a GPU.
|
||
|
|
||
|
Before initializing the distributed environment, we need to specify the host (master address) and port (master port). In
|
||
|
this example, we can let host be node 0 and port be a number such as 29500. All the 8 processes will then look for the
|
||
|
address and port and connect to one another.
|
||
|
The default process group will then be created. The default process group has a world size of 8 and details are as follows:
|
||
|
|
||
|
| process ID | rank | Node index | GPU index |
|
||
|
| ---------- | ---- | ---------- | --------- |
|
||
|
| 0 | 0 | 0 | 0 |
|
||
|
| 1 | 1 | 0 | 1 |
|
||
|
| 2 | 2 | 0 | 2 |
|
||
|
| 3 | 3 | 0 | 3 |
|
||
|
| 4 | 4 | 1 | 0 |
|
||
|
| 5 | 5 | 1 | 1 |
|
||
|
| 6 | 6 | 1 | 2 |
|
||
|
| 7 | 7 | 1 | 3 |
|
||
|
|
||
|
|
||
|
We can also create a new process group. This new process group can contain any subset of the processes.
|
||
|
For example, we can create one containing only even-number processes, and the details of this new group will be:
|
||
|
|
||
|
| process ID | rank | Node index | GPU index |
|
||
|
| ---------- | ---- | ---------- | --------- |
|
||
|
| 0 | 0 | 0 | 0 |
|
||
|
| 2 | 1 | 0 | 2 |
|
||
|
| 4 | 2 | 1 | 0 |
|
||
|
| 6 | 3 | 1 | 2 |
|
||
|
|
||
|
**Please note that rank is relative to the process group and one process can have a different rank in different process
|
||
|
groups. The max rank is always `world size of the process group - 1`.**
|
||
|
|
||
|
In the process group, the processes can communicate in two ways:
|
||
|
1. peer-to-peer: one process send data to another process
|
||
|
2. collective: a group of process perform operations such as scatter, gather, all-reduce, broadcast together.
|
||
|
|
||
|
<figure style={{textAlign: "center"}}>
|
||
|
<img src="https://s2.loli.net/2022/01/28/zTmlxgc3oeAdn97.png"/>
|
||
|
<figcaption>Collective communication, source: <a href="https://pytorch.org/tutorials/intermediate/dist_tuto.html">PyTorch distributed tutorial</a></figcaption>
|
||
|
</figure>
|