Making large AI models cheaper, faster and more accessible

ai big-model data-parallelism deep-learning distributed-computing foundation-models heterogeneous-training hpc inference large-scale model-parallelism pipeline-parallelism

History

Wenxuan Tan 8fd25d6e09 [Feature] Split cross-entropy computation in SP (#5959 ) * halfway * fix cross-PP-stage position id length diff bug * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unified cross entropy func for all shardformer models * remove redundant lines * add basic ring attn; debug cross entropy * fwd bwd logic complete * fwd bwd logic complete; add experimental triton rescale * precision tests passed * precision tests passed * fix typos and remove misc files * update softmax_lse shape by new interface * change tester name * remove buffer clone; support packed seq layout * add varlen tests * fix typo * all tests passed * add dkv_group; fix mask * remove debug statements * adapt chatglm, command-R, qwen * debug * halfway * fix cross-PP-stage position id length diff bug * fix typo * fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * unified cross entropy func for all shardformer models * remove redundant lines * add basic ring attn; debug cross entropy * fwd bwd logic complete * fwd bwd logic complete; add experimental triton rescale * precision tests passed * precision tests passed * fix typos and remove misc files * add sp_mode to benchmark; fix varlen interface * update softmax_lse shape by new interface * add varlen tests * fix typo * all tests passed * add dkv_group; fix mask * remove debug statements * add comments * q1 index only once * remove events to simplify stream sync * simplify forward/backward logic * 2d ring forward passed * 2d ring backward passed * fixes * fix ring attn loss * 2D ring backward + llama passed * merge * update logger * fix typo * rebase * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo * remove typos * fixes * support GPT --------- Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>		2 months ago
..
experiments	[misc] refactor launch API and tensor constructor (#5666 )	7 months ago
gemini	[chore] refactor profiler utils	6 months ago
hybridparallelism	[Feature] Split cross-entropy computation in SP (#5959 )	2 months ago
titans	[misc] refactor launch API and tensor constructor (#5666 )	7 months ago
README.md	…
requirements.txt	…
test_ci.sh	…

README.md

Train GPT with Colossal-AI

This example shows how to use Colossal-AI to run huggingface GPT training in distributed manners.

GPT

We use the GPT-2 model from huggingface transformers. The key learning goal of GPT-2 is to use unsupervised pre-training models to do supervised tasks.GPT-2 has an amazing performance in text generation, and the generated text exceeds people's expectations in terms of contextual coherence and emotional expression.

Requirements

Before you can launch training, you need to install the following requirements.

Install PyTorch

#conda
conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
#pip
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113

Install Colossal-AI

Install requirements

pip install -r requirements.txt

This is just an example that we download PyTorch=1.12.0, CUDA=11.6 and colossalai. You can download another version of PyTorch and its corresponding ColossalAI version. Just make sure that the version of ColossalAI is at least 0.1.10, PyTorch is at least 1.8.1 and transformers is at least 4.231. If you want to test ZeRO1 and ZeRO2 in Colossal-AI, you need to ensure Colossal-AI>=0.1.12.

Dataset

For simplicity, the input data is randomly generated here.

Training

We provide two stable solutions. One utilizes the Gemini to implement hybrid parallel strategies of Gemini, DDP/ZeRO, and Tensor Parallelism for a huggingface GPT model. The other one use Titans, a distributed executed model zoo maintained by ColossalAI,to implement the hybrid parallel strategies of TP + ZeRO + PP.

We recommend using Gemini to quickly run your model in a distributed manner. It doesn't require significant changes to the model structures, therefore you can apply it on a new model easily. And use Titans as an advanced weapon to pursue a more extreme performance. Titans has included the some typical models, such as Vit and GPT. However, it requires some efforts to start if facing a new model structure.

GeminiDPP/ZeRO + Tensor Parallelism

bash run_gemini.sh

The train_gpt_demo.py provides three distributed plans (except ones already provided by PyTorch), you can choose the plan you want in run_gemini.sh. The CAI_Gemini leverages Tensor Parallel and Gemini + ZeRO DDP. For their differences, you may check out the answer to issue here.

ZeRO1 (CAI_ZeRO1)
ZeRO2 (CAI_ZeRO2)
Gemini + ZeRO DDP (CAI_Gemini)
Pytorch DDP (Pytorch_DDP)
Pytorch ZeRO (Pytorch_ZeRO)

Titans (Tensor Parallelism) + ZeRO + Pipeline Parallelism

Titans provides a customized GPT model, which uses distributed operators as building blocks. In [./titans/README.md], we provide a hybrid parallelism of ZeRO, TP and PP. You can switch parallel strategies using a config file.

Hybridparallelism

Hybridparallelism provides a user friendly plugin to set multiple parallelism method for training and inference. In [./hybridparallelism], we provide a n example to finetune gpt2 using Hybridparallelism.

Quick run

cd ./hybridparallelism
bash run.sh

Performance

Testbed: a cluster of 8xA100 (80GB) and 1xAMD EPYC 7543 32-Core Processor (512 GB). GPUs are connected via PCI-e. ColossalAI version 0.1.13.