mirror of https://github.com/hpcaitech/ColossalAI
[elixir] updated readme (#3944)
parent
1ee247a51c
commit
3b58ff5c73
|
@ -1,47 +1,96 @@
|
|||
# Elixir (Gemini2.0)
|
||||
Elixir, also known as Gemini, is a technology designed to facilitate the training of large models on a small GPU cluster.
|
||||
# ⚡️ Elixir (Gemini2.0)
|
||||
|
||||
## 📚 Table of Contents
|
||||
|
||||
- [⚡️ Elixir (Gemini2.0)](#️-elixir-gemini20)
|
||||
- [📚 Table of Contents](#-table-of-contents)
|
||||
- [🔗 Introduction](#-introduction)
|
||||
- [💡 Design and Implementation](#-design-and-implementation)
|
||||
- [🔨 API Usage](#-api-usage)
|
||||
- [General Usage](#general-usage)
|
||||
- [Advanced Usage](#advanced-usage)
|
||||
|
||||
## 🔗 Introduction
|
||||
|
||||
Elixir, also known as Gemini 2.0, is a distributed training technique designed to facilitate large-scale model training on a small GPU cluster.
|
||||
Its goal is to eliminate data redundancy and leverage CPU memory to accommodate really large models.
|
||||
In addition, Elixir automatically profiles each training step prior to execution and selects the optimal configuration for the ratio of redundancy and the device for each parameter.
|
||||
This repository is used to benchmark the performance of Elixir.
|
||||
Elixir will be integrated into ColossalAI for usability.
|
||||
Elixir automatically profiles each training step before execution and selects the optimal configuration for the ratio of memory redundancy (tensor sharding) and the device placement for each parameter (tensor offloading).
|
||||
|
||||
## Environment
|
||||
Please note the following before you try this feature:
|
||||
|
||||
This version is a beta release, so the running environment is somewhat restrictive.
|
||||
We are only demonstrating our running environment here, as we have not yet tested its compatibility.
|
||||
We have set the CUDA version to `11.6` and the PyTorch version to `1.13.1+cu11.6`.
|
||||
- **This feature is still in its experimental stage and the API is subject to future changes.**
|
||||
- **We have only tested this feature with PyTorch 1.13**
|
||||
|
||||
## Examples
|
||||
|
||||
Here is a simple example to wrap your model and optimizer for [fine-tuning](https://github.com/hpcaitech/Elixir/tree/main/example/fine-tune).
|
||||
## 💡 Design and Implementation
|
||||
|
||||
Existing methods such as DeepSpeed and FSDP often lead to suboptimal efficiency due to the large combination of hyperparameters to tune and only experienced experts can unleash the full potential of hardware by carefully tuning the distributed configuration.
|
||||
Thus, we present a novel solution, Elixir, which automates efficient large model training based on pre-runtime model profiling.
|
||||
Elixir aims to identify the optimal combination of partitioning and offloading techniques to maximize training throughput.
|
||||
|
||||
Some contributions of Elixir are listed below:
|
||||
- We build a pre-runtime profiler designed for large models. It is capable of obtaining the computation
|
||||
graph and the memory usage of the model before training. We bring this powerful tool to support
|
||||
large model profiling.
|
||||
- We introduce rCache to control the degree of memory redundancy. Moreover, we build a search
|
||||
engine to find the optimal configuration, maximizing training efficiency automatically. Different
|
||||
from previous works, our optimal configuration considers both memory partitioning and memory
|
||||
offloading.
|
||||
- We conduct evaluations on a large scale by testing various model sizes, GPU capacities, numbers of
|
||||
GPUs, and batch sizes. When compared to current SOTA solutions, we observe that Elixir achieves
|
||||
up to 3.4× acceleration without manual tuning.
|
||||
|
||||
You can find more details about this system in our paper [Elixir: Train a Large Language Model on a Small GPU Cluster](https://arxiv.org/abs/2212.05339).
|
||||
|
||||
|
||||
## 🔨 API Usage
|
||||
|
||||
Below is the API for the Elixir module, these APIs are experimental and subject to future changes.
|
||||
|
||||
### General Usage
|
||||
|
||||
|
||||
```python
|
||||
from elixir.search import minimum_waste_search
|
||||
from elixir.wrapper import ElixirModule, ElixirOptimizer
|
||||
import torch
|
||||
import transformers
|
||||
|
||||
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
|
||||
import torch.distributed as dist
|
||||
|
||||
from colossalai.elixir import ElixirModule, ElixirOptimizer, minimum_waste_search
|
||||
|
||||
# initialize your distributed backend
|
||||
...
|
||||
|
||||
# create your model and optimizer
|
||||
model = transformers.BertForSequenceClassification.from_pretrained('bert-base-uncased')
|
||||
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, eps=1e-8)
|
||||
|
||||
sr = minimum_waste_search(model, world_size)
|
||||
model = ElixirModule(model, sr, world_group)
|
||||
# search for configuration
|
||||
world_size = dist.get_world_size()
|
||||
search_result = minimum_waste_search(model, world_size)
|
||||
|
||||
# wrap the model and optimizer
|
||||
model = ElixirModule(model, search_result, world_group)
|
||||
optimizer = ElixirOptimizer(model, optimizer)
|
||||
```
|
||||
|
||||
Here is an advanced example for performance, which is used in our [benchmark](https://github.com/hpcaitech/Elixir/blob/main/example/common/elx.py).
|
||||
### Advanced Usage
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.distributed as dist
|
||||
from colossalai.nn.optimizer import HybridAdam
|
||||
from elixir.wrapper import ElixirModule, ElixirOptimizer
|
||||
from colossalai.elixir import ElixirModule, ElixirOptimizer
|
||||
|
||||
# initialize your distributed backend
|
||||
...
|
||||
|
||||
# get the world communication group
|
||||
global_group = dist.GroupMember.WORLD
|
||||
# get the communication world size
|
||||
global_size = dist.get_world_size()
|
||||
|
||||
# initialize the model in CPU
|
||||
model = get_model(model_name)
|
||||
|
||||
# HybridAdam allows a part of parameters updated on CPU and a part updated on GPU
|
||||
optimizer = HybridAdam(model.parameters(), lr=1e-3)
|
||||
|
||||
|
@ -54,6 +103,8 @@ sr = optimal_search(
|
|||
inp=data, # proivde an example input data in dictionary format
|
||||
step_fn=train_step # provide an example step function
|
||||
)
|
||||
|
||||
# wrap your model with ElixirModule and optimizer with ElixirOptimizer
|
||||
model = ElixirModule(
|
||||
model,
|
||||
sr,
|
||||
|
@ -65,7 +116,7 @@ model = ElixirModule(
|
|||
optimizer = ElixirOptimizer(
|
||||
model,
|
||||
optimizer,
|
||||
initial_scale=64, # loss scale used in AMP
|
||||
initial_scale=1024, # loss scale used in AMP
|
||||
init_step=True # enable for the stability of training
|
||||
)
|
||||
```
|
||||
|
|
|
@ -1 +1,4 @@
|
|||
from .search import minimum_waste_search, optimal_search
|
||||
from .wrapper import ElixirModule, ElixirOptimizer
|
||||
|
||||
__all__ = ['ElixirModule', 'ElixirOptimizer', 'minimum_waste_search', 'optimal_search']
|
||||
|
|
Loading…
Reference in New Issue