ColossalAI/colossalai
flybird11111 0a94fcd351
[shardformer] update bert finetune example with HybridParallelPlugin (#4584)
* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* [shardformer] fix opt test hanging

* fix

* test

* test

* [shardformer] zero1+pp and the corresponding tests (#4517)

* pause

* finish pp+zero1

* Update test_shard_vit.py

* [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516)

* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom

* [shardformer] fix emerged bugs after updating transformers (#4526)

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] Add overlap support for gpt2 (#4535)

* add overlap support for gpt2

* remove unused code

* remove unused code

* [shardformer] support pp+tp+zero1 tests (#4531)

* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] fix submodule replacement bug when enabling pp (#4544)

* [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)

* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* rebase feature/shardformer

* update pipeline

* [shardformer] fix

* [shardformer] fix

* [shardformer] bert finetune fix

* [shardformer] add all_reduce operation to loss

add all_reduce operation to loss

* [shardformer] make compatible with pytree.

make compatible with pytree.

* [shardformer] disable tp

disable tp

* [shardformer] add 3d plugin to ci test

* [shardformer] update num_microbatches to None

* [shardformer] update microbatchsize

* [shardformer] update assert

* update scheduler

* update scheduler

---------

Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>
2023-09-04 21:46:29 +08:00
..
_C [setup] support pre-build and jit-build of cuda kernels (#2374) 2023-01-06 20:50:26 +08:00
_analyzer [example] add train resnet/vit with booster example (#3694) 2023-05-08 10:42:30 +08:00
amp [pipeline] support fp32 for HybridPlugin/merge shardformer test and pipeline test into one file (#4354) 2023-08-15 23:25:14 +08:00
auto_parallel [NFC] polish runtime_preparation_pass style (#4266) 2023-07-26 14:12:57 +08:00
autochunk fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808) 2023-05-24 09:01:50 +08:00
booster [shardformer] update bert finetune example with HybridParallelPlugin (#4584) 2023-09-04 21:46:29 +08:00
builder [NFC] polish colossalai/builder/__init__.py code style (#1560) 2022-09-08 22:11:04 +08:00
checkpoint_io [shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575) 2023-09-01 17:40:01 +08:00
cli fix localhost measurement (#4320) 2023-08-01 10:14:00 +08:00
cluster [shardformer] support interleaved pipeline (#4448) 2023-08-16 19:29:03 +08:00
communication [NFC] fix: format (#4270) 2023-07-26 14:12:57 +08:00
context [CI] fix some spelling errors (#3707) 2023-05-10 17:12:03 +08:00
device [format] applied code formatting on changed files in pull request 4152 (#4157) 2023-07-04 16:07:47 +08:00
engine [nfc]fix ColossalaiOptimizer is not defined (#4122) 2023-06-30 17:23:22 +08:00
fx [nfc] fix typo colossalai/cli fx kernel (#3847) 2023-06-02 15:02:45 +08:00
interface [pipeline] refactor 1f1b schedule (#4115) 2023-08-15 23:25:14 +08:00
kernel [shardformer] update shardformer to use flash attention 2 (#4392) 2023-08-15 23:25:14 +08:00
lazy [shardformer] support lazy init (#4202) 2023-08-15 23:25:14 +08:00
logging [logger] hotfix, missing _FORMAT (#2231) 2022-12-29 22:59:39 +08:00
nn [doc] add Series A Funding and NeurIPS news (#4377) 2023-08-04 17:42:07 +08:00
pipeline [shardformer] update bert finetune example with HybridParallelPlugin (#4584) 2023-09-04 21:46:29 +08:00
registry Remove duplication registry (#1078) 2022-06-08 07:47:24 +08:00
shardformer [shardformer] Pytree fix (#4533) 2023-09-04 17:52:23 +08:00
tensor [pipeline] support fp32 for HybridPlugin/merge shardformer test and pipeline test into one file (#4354) 2023-08-15 23:25:14 +08:00
testing Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141) 2023-07-07 16:33:06 +08:00
trainer fix typo with colossalai/trainer utils zero (#3908) 2023-06-07 16:08:37 +08:00
utils [test] remove useless tests (#4359) 2023-08-01 18:52:14 +08:00
zero [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) 2023-08-31 14:50:47 +08:00
__init__.py [setup] supported conda-installed torch (#2048) 2022-11-30 16:45:15 +08:00
constants.py updated tp layers 2022-11-02 12:19:38 +08:00
core.py
global_variables.py [NFC] polish colossalai/global_variables.py code style (#3259) 2023-03-29 15:22:21 +08:00
initialize.py [nfc] fix typo colossalai/zero (#3923) 2023-06-08 00:01:29 +08:00