ColossalAI

History

Hongxin Liu 27061426f7 [gemini] improve compatibility and add static placement policy (#4479 ) * [gemini] remove distributed-related part from colotensor (#4379) * [gemini] remove process group dependency * [gemini] remove tp part from colo tensor * [gemini] patch inplace op * [gemini] fix param op hook and update tests * [test] remove useless tests * [test] remove useless tests * [misc] fix requirements * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [misc] update requirements * [gemini] refactor gemini optimizer and gemini ddp (#4398) * [gemini] update optimizer interface * [gemini] renaming gemini optimizer * [gemini] refactor gemini ddp class * [example] update gemini related example * [example] update gemini related example * [plugin] fix gemini plugin args * [test] update gemini ckpt tests * [gemini] fix checkpoint io * [example] fix opt example requirements * [example] fix opt example * [example] fix opt example * [example] fix opt example * [gemini] add static placement policy (#4443) * [gemini] add static placement policy * [gemini] fix param offload * [test] update gemini tests * [plugin] update gemini plugin * [plugin] update gemini plugin docstr * [misc] fix flash attn requirement * [test] fix gemini checkpoint io test * [example] update resnet example result (#4457) * [example] update bert example result (#4458) * [doc] update gemini doc (#4468) * [example] update gemini related examples (#4473) * [example] update gpt example * [example] update dreambooth example * [example] update vit * [example] update opt * [example] update palm * [example] update vit and opt benchmark * [hotfix] fix bert in model zoo (#4480) * [hotfix] fix bert in model zoo * [test] remove chatglm gemini test * [test] remove sam gemini test * [test] remove vit gemini test * [hotfix] fix opt tutorial example (#4497) * [hotfix] fix opt tutorial example * [hotfix] fix opt tutorial example		2023-08-24 09:29:25 +08:00
..
README.md	[gemini] improve compatibility and add static placement policy (#4479 )	2023-08-24 09:29:25 +08:00
benchmark.py	[booster] update bert example, using booster api (#3885 )	2023-06-07 15:51:00 +08:00
benchmark.sh	[booster] update bert example, using booster api (#3885 )	2023-06-07 15:51:00 +08:00
benchmark_utils.py	[booster] update bert example, using booster api (#3885 )	2023-06-07 15:51:00 +08:00
data.py	[booster] update bert example, using booster api (#3885 )	2023-06-07 15:51:00 +08:00
finetune.py	[gemini] improve compatibility and add static placement policy (#4479 )	2023-08-24 09:29:25 +08:00
requirements.txt	[booster] update bert example, using booster api (#3885 )	2023-06-07 15:51:00 +08:00
test_ci.sh	[booster] update bert example, using booster api (#3885 )	2023-06-07 15:51:00 +08:00

README.md

Overview

This directory includes two parts: Using the Booster API finetune Huggingface Bert and AlBert models and benchmarking Bert and AlBert models with different Booster Plugin.

Finetune

bash test_ci.sh

Results on 2-GPU

Plugin	Accuracy	F1-score
torch_ddp	84.4%	88.6%
torch_ddp_fp16	84.7%	88.8%
gemini	84.0%	88.4%

Benchmark

bash benchmark.sh

Now include these metrics in benchmark: CUDA mem occupy, throughput and the number of model parameters. If you have custom metrics, you can add them to benchmark_util.

Results

Bert

	max cuda mem	throughput(sample/s)	params
ddp	21.44 GB	3.0	82M
ddp_fp16	16.26 GB	11.3	82M
gemini	11.0 GB	12.9	82M
low_level_zero	11.29 G	14.7	82M

AlBert

	max cuda mem	throughput(sample/s)	params
ddp	OOM
ddp_fp16	OOM
gemini	69.39 G	1.3	208M
low_level_zero	56.89 G	1.4	208M