ColossalAI/examples/language/bert
Wenhao Chen 1810b9100f [pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134)
* test: add more p2p tests

* fix: remove send_forward_recv_forward as p2p op list need to use the same group

* fix: make send and receive atomic

* feat: update P2PComm fn

* feat: add metadata cache in 1f1b

* feat: add metadata cache in interleaved pp

* feat: modify is_xx_stage fn

* revert: add _broadcast_object_list

* feat: add interleaved pp in llama policy

* feat: set NCCL_BUFFSIZE in HybridParallelPlugin
2024-01-05 13:58:53 +08:00
..
README.md [shardformer] update shardformer readme (#4617) 2023-09-05 13:14:41 +08:00
benchmark.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
benchmark.sh [booster] update bert example, using booster api (#3885) 2023-06-07 15:51:00 +08:00
benchmark_utils.py [misc] update pre-commit and run all files (#4752) 2023-09-19 14:20:26 +08:00
data.py [shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088) 2023-11-28 16:54:42 +08:00
finetune.py [pipeline]: fix p2p comm, add metadata cache and support llama interleaved pp (#5134) 2024-01-05 13:58:53 +08:00
requirements.txt [booster] update bert example, using booster api (#3885) 2023-06-07 15:51:00 +08:00
test_ci.sh [shardformer] update bert finetune example with HybridParallelPlugin (#4584) 2023-09-04 21:46:29 +08:00

README.md

Overview

This directory includes two parts: Using the Booster API finetune Huggingface Bert and AlBert models and benchmarking Bert and AlBert models with different Booster Plugin.

Finetune

bash test_ci.sh

Bert-Finetune Results

Plugin Accuracy F1-score GPU number
torch_ddp 84.4% 88.6% 2
torch_ddp_fp16 84.7% 88.8% 2
gemini 84.0% 88.4% 2
hybrid_parallel 84.5% 88.6% 4

Benchmark

bash benchmark.sh

Now include these metrics in benchmark: CUDA mem occupy, throughput and the number of model parameters. If you have custom metrics, you can add them to benchmark_util.

Results

Bert

max cuda mem throughput(sample/s) params
ddp 21.44 GB 3.0 82M
ddp_fp16 16.26 GB 11.3 82M
gemini 11.0 GB 12.9 82M
low_level_zero 11.29 G 14.7 82M

AlBert

max cuda mem throughput(sample/s) params
ddp OOM
ddp_fp16 OOM
gemini 69.39 G 1.3 208M
low_level_zero 56.89 G 1.4 208M