ColossalAI/examples/language/llama
Edenzzzz 2a25a2aff7
[Feature] optimize PP overlap (#5735)
* update to fully overlap, still debugging

* improve interface

* fixed deadlock bug

* debug NaN loss

* (experimental) use one comm group for send_fw_recv_fw to fix NaN

* cleaned up interfaces; use one batch p2p for all

* clean up; removed the double p2p batch case

* p2p test passsed

* improve overlap: send fwd before backward

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tentatively use 2 p2p batches

* remove two p2p batches

* fix typos

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove pp.sh

---------

Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: root <root@notebook-c55824c0-7742-45e8-9591-c855bb77ad29-0.notebook-c55824c0-7742-45e8-9591-c855bb77ad29.colossal-ai.svc.cluster.local>
2024-06-26 14:48:02 +08:00
..
scripts [example] llama3 (#5631) 2024-04-23 18:48:07 +08:00
README.md [example] llama3 (#5631) 2024-04-23 18:48:07 +08:00
benchmark.py [Feature] optimize PP overlap (#5735) 2024-06-26 14:48:02 +08:00
data_utils.py [hotfix] add soft link to support required files (#5661) 2024-04-26 21:12:04 +08:00
model_utils.py [hotfix] add soft link to support required files (#5661) 2024-04-26 21:12:04 +08:00
performance_evaluator.py [hotfix] add soft link to support required files (#5661) 2024-04-26 21:12:04 +08:00
requirements.txt [example] llama3 (#5631) 2024-04-23 18:48:07 +08:00
test_ci.sh [example] llama3 (#5631) 2024-04-23 18:48:07 +08:00

README.md

Pretraining LLaMA-1/2/3: best practices for building LLaMA-1/2/3-like base models

LLaMA3

  • 70 billion parameter LLaMA3 model training accelerated by 18%

LLaMA2

  • 70 billion parameter LLaMA2 model training accelerated by 195% [blog]

LLaMA1

  • 65-billion-parameter large model pretraining accelerated by 38% [blog]

Usage

⚠ This example only has benchmarking script. For training/finetuning, please refer to the applications/Colossal-LLaMA.

1. Installation

Please install the latest ColossalAI from source.

BUILD_EXT=1 pip install -U git+https://github.com/hpcaitech/ColossalAI

Then install other dependencies.

pip install -r requirements.txt

4. Shell Script Examples

For your convenience, we provide some shell scripts to run benchmark with various configurations.

You can find them in scripts/benchmark_7B and scripts/benchmark_70B directory. The main command should be in the format of:

colossalai run --nproc_per_node YOUR_GPU_PER_NODE --hostfile YOUR_HOST_FILE \
benchmark.py --OTHER_CONFIGURATIONS

Here we will show an example of how to run training llama pretraining with gemini, batch_size=16, sequence_length=4096, gradient_checkpoint=True, flash_attn=True.

a. Running environment

This experiment was performed on 4 computing nodes with 32 A800/H800 80GB GPUs in total for LLaMA-1 65B or LLaMA-2 70B. The nodes are connected with RDMA and GPUs within one node are fully connected with NVLink.

b. Running command

cd scripts/benchmark_7B

First, put your host file (hosts.txt) in this directory with your real host ip or host name.

Here is a sample hosts.txt:

hostname1
hostname2
hostname3
hostname4

Then add environment variables to script if needed.

Finally, run the following command to start training:

bash gemini.sh

If you encounter out-of-memory(OOM) error during training with script gemini.sh, changing to script gemini_auto.sh might be a solution, since gemini_auto will set a upper limit on GPU memory usage through offloading part of the model parameters and optimizer states back to CPU memory. But there's a trade-off: gemini_auto.sh will be a bit slower, since more data are transmitted between CPU and GPU.

c. Results

If you run the above command successfully, you will get the following results: max memory usage: 55491.10 MB, throughput: 24.26 samples/s, TFLOPS/GPU: 167.43.

Reference

@article{bian2021colossal,
  title={Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training},
  author={Bian, Zhengda and Liu, Hongxin and Wang, Boxiang and Huang, Haichen and Li, Yongbin and Wang, Chuanrui and Cui, Fan and You, Yang},
  journal={arXiv preprint arXiv:2110.14883},
  year={2021}
}
@software{openlm2023openllama,
  author = {Geng, Xinyang and Liu, Hao},
  title = {OpenLLaMA: An Open Reproduction of LLaMA},
  month = May,
  year = 2023,
  url = {https://github.com/openlm-research/open_llama}
}
@software{together2023redpajama,
  author = {Together Computer},
  title = {RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset},
  month = April,
  year = 2023,
  url = {https://github.com/togethercomputer/RedPajama-Data}
}
@article{touvron2023llama,
  title={Llama: Open and efficient foundation language models},
  author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{\'e}e and Rozi{\`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and others},
  journal={arXiv preprint arXiv:2302.13971},
  year={2023}
}