Commit Graph

309 Commits (536397cc951cea648ded9b1052dfac1d4cc3f91c)

Author SHA1 Message Date
flybird11111 7486ed7d3a
[shardformer] update llama2/opt finetune example and fix llama2 policy (#4645)
* [shardformer] update shardformer readme

[shardformer] update shardformer readme

[shardformer] update shardformer readme

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] change dataset

* [shardformer] change dataset

* [shardformer] fix CI

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

[example] update opt example

[example] resolve comments

fix

fix
2023-09-09 22:45:36 +08:00
Baizhou Zhang 295b38fecf
[example] update vit example for hybrid parallel plugin (#4641)
* update vit example for hybrid plugin

* reset tp/pp size

* fix dataloader iteration bug

* update optimizer passing in evaluation/add grad_accum

* change criterion

* wrap tqdm

* change grad_accum to grad_checkpoint

* fix pbar
2023-09-07 17:38:45 +08:00
Baizhou Zhang 660eed9124
[pipeline] set optimizer to optional in execute_pipeline (#4630)
* set optimizer to optional in execute_pipeline

* arrange device and mixed precision in booster init

* fix execute_pipeline in booster.py
2023-09-07 10:42:59 +08:00
Hongxin Liu fae6c92ead
Merge branch 'main' into feature/shardformer 2023-09-05 21:54:08 +08:00
Hongxin Liu ac178ca5c1 [legacy] move builder and registry to legacy (#4603) 2023-09-05 21:53:10 +08:00
Hongxin Liu 8accecd55b [legacy] move engine to legacy (#4560)
* [legacy] move engine to legacy

* [example] fix seq parallel example

* [example] fix seq parallel example

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [example] update seq parallel requirements
2023-09-05 21:53:10 +08:00
Hongxin Liu 89fe027787 [legacy] move trainer to legacy (#4545)
* [legacy] move trainer to legacy

* [doc] update docs related to trainer

* [test] ignore legacy test
2023-09-05 21:53:10 +08:00
flybird11111 ec0866804c
[shardformer] update shardformer readme (#4617)
[shardformer] update shardformer readme

[shardformer] update shardformer readme
2023-09-05 13:14:41 +08:00
Hongxin Liu a39a5c66fe
Merge branch 'main' into feature/shardformer 2023-09-04 23:43:13 +08:00
flybird11111 0a94fcd351
[shardformer] update bert finetune example with HybridParallelPlugin (#4584)
* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* [shardformer] fix opt test hanging

* fix

* test

* test

* [shardformer] zero1+pp and the corresponding tests (#4517)

* pause

* finish pp+zero1

* Update test_shard_vit.py

* [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516)

* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom

* [shardformer] fix emerged bugs after updating transformers (#4526)

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] Add overlap support for gpt2 (#4535)

* add overlap support for gpt2

* remove unused code

* remove unused code

* [shardformer] support pp+tp+zero1 tests (#4531)

* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] fix submodule replacement bug when enabling pp (#4544)

* [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)

* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] add bert finetune example

* [shardformer] fix epoch change

* [shardformer] broadcast add pp group

* rebase feature/shardformer

* update pipeline

* [shardformer] fix

* [shardformer] fix

* [shardformer] bert finetune fix

* [shardformer] add all_reduce operation to loss

add all_reduce operation to loss

* [shardformer] make compatible with pytree.

make compatible with pytree.

* [shardformer] disable tp

disable tp

* [shardformer] add 3d plugin to ci test

* [shardformer] update num_microbatches to None

* [shardformer] update microbatchsize

* [shardformer] update assert

* update scheduler

* update scheduler

---------

Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>
2023-09-04 21:46:29 +08:00
binmakeswell 8d7b02290f
[doc] add llama2 benchmark (#4604)
* [doc] add llama2 benchmark

* [doc] add llama2 benchmark
2023-09-04 13:49:33 +08:00
Tian Siyuan f1ae8c9104
[example] change accelerate version (#4431)
Co-authored-by: Siyuan Tian <siyuant@vmware.com>
Co-authored-by: Hongxin Liu <lhx0217@gmail.com>
2023-08-30 22:56:13 +08:00
ChengDaqi2023 8e2e1992b8
[example] update streamlit 0.73.1 to 1.11.1 (#4386) 2023-08-30 22:54:45 +08:00
Hongxin Liu 0b00def881
[example] add llama2 example (#4527)
* [example] transfer llama-1 example

* [example] fit llama-2

* [example] refactor scripts folder

* [example] fit new gemini plugin

* [cli] fix multinode runner

* [example] fit gemini optim checkpoint

* [example] refactor scripts

* [example] update requirements

* [example] update requirements

* [example] rename llama to llama2

* [example] update readme and pretrain script

* [example] refactor scripts
2023-08-28 17:59:11 +08:00
Hongxin Liu 27061426f7
[gemini] improve compatibility and add static placement policy (#4479)
* [gemini] remove distributed-related part from colotensor (#4379)

* [gemini] remove process group dependency

* [gemini] remove tp part from colo tensor

* [gemini] patch inplace op

* [gemini] fix param op hook and update tests

* [test] remove useless tests

* [test] remove useless tests

* [misc] fix requirements

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [misc] update requirements

* [gemini] refactor gemini optimizer and gemini ddp (#4398)

* [gemini] update optimizer interface

* [gemini] renaming gemini optimizer

* [gemini] refactor gemini ddp class

* [example] update gemini related example

* [example] update gemini related example

* [plugin] fix gemini plugin args

* [test] update gemini ckpt tests

* [gemini] fix checkpoint io

* [example] fix opt example requirements

* [example] fix opt example

* [example] fix opt example

* [example] fix opt example

* [gemini] add static placement policy (#4443)

* [gemini] add static placement policy

* [gemini] fix param offload

* [test] update gemini tests

* [plugin] update gemini plugin

* [plugin] update gemini plugin docstr

* [misc] fix flash attn requirement

* [test] fix gemini checkpoint io test

* [example] update resnet example result (#4457)

* [example] update bert example result (#4458)

* [doc] update gemini doc (#4468)

* [example] update gemini related examples (#4473)

* [example] update gpt example

* [example] update dreambooth example

* [example] update vit

* [example] update opt

* [example] update palm

* [example] update vit and opt benchmark

* [hotfix] fix bert in model zoo (#4480)

* [hotfix] fix bert in model zoo

* [test] remove chatglm gemini test

* [test] remove sam gemini test

* [test] remove vit gemini test

* [hotfix] fix opt tutorial example (#4497)

* [hotfix] fix opt tutorial example

* [hotfix] fix opt tutorial example
2023-08-24 09:29:25 +08:00
Tian Siyuan ff836790ae
[doc] fix a typo in examples/tutorial/auto_parallel/README.md (#4430)
Co-authored-by: Siyuan Tian <siyuant@vmware.com>
2023-08-15 00:22:57 +08:00
binmakeswell 089c365fa0
[doc] add Series A Funding and NeurIPS news (#4377)
* [doc] add Series A Funding and NeurIPS news

* [kernal] fix mha kernal

* [CI] skip moe

* [CI] fix requirements
2023-08-04 17:42:07 +08:00
caption 16c0acc01b
[hotfix] update gradio 3.11 to 3.34.0 (#4329) 2023-08-01 16:25:25 +08:00
binmakeswell ef4b99ebcd add llama example CI 2023-07-26 14:12:57 +08:00
binmakeswell 7ff11b5537
[example] add llama pretraining (#4257) 2023-07-17 21:07:44 +08:00
github-actions[bot] 4e9b09c222
Automated submodule synchronization (#4217)
Co-authored-by: github-actions <github-actions@github.com>
2023-07-12 17:35:58 +08:00
digger yu 2d40759a53
fix #3852 path error (#4058) 2023-06-28 15:29:44 +08:00
Jianghai 31dc302017
[examples] copy resnet example to image (#4090)
* copy resnet example

* add pytest package

* skip test_ci

* skip test_ci

* skip test_ci
2023-06-27 16:40:46 +08:00
Baizhou Zhang 4da324cd60
[hotfix]fix argument naming in docs and examples (#4083) 2023-06-26 23:50:04 +08:00
LuGY 160c64c645
[example] fix bucket size in example of gpt gemini (#4028) 2023-06-19 11:22:42 +08:00
Baizhou Zhang b3ab7fbabf
[example] update ViT example using booster api (#3940) 2023-06-12 15:02:27 +08:00
Liu Ziming e277534a18
Merge pull request #3905 from MaruyamaAya/dreambooth
[example] Adding an example of training dreambooth with the new booster API
2023-06-09 08:44:18 +08:00
digger yu 33eef714db
fix typo examples and docs (#3932) 2023-06-08 16:09:32 +08:00
Maruyama_Aya 9b5e7ce21f modify shell for check 2023-06-08 14:56:56 +08:00
digger yu 407aa48461
fix typo examples/community/roberta (#3925) 2023-06-08 14:28:34 +08:00
Maruyama_Aya 730a092ba2 modify shell for check 2023-06-08 13:38:18 +08:00
Maruyama_Aya 49567d56d1 modify shell for check 2023-06-08 13:36:05 +08:00
Maruyama_Aya 039854b391 modify shell for check 2023-06-08 13:17:58 +08:00
Baizhou Zhang e417dd004e
[example] update opt example using booster api (#3918) 2023-06-08 11:27:05 +08:00
Maruyama_Aya cf4792c975 modify shell for check 2023-06-08 11:15:10 +08:00
Maruyama_Aya c94a33579b modify shell for check 2023-06-07 17:23:01 +08:00
Liu Ziming b306cecf28
[example] Modify palm example with the new booster API (#3913)
* Modify torch version requirement to adapt torch 2.0

* modify palm example using new booster API

* roll back

* fix port

* polish

* polish
2023-06-07 16:05:00 +08:00
wukong1992 a55fb00c18
[booster] update bert example, using booster api (#3885) 2023-06-07 15:51:00 +08:00
Maruyama_Aya 4fc8bc68ac modify file path 2023-06-07 11:02:19 +08:00
Maruyama_Aya b4437e88c3 fixed port 2023-06-06 16:21:38 +08:00
Maruyama_Aya 79c9f776a9 fixed port 2023-06-06 16:20:45 +08:00
Maruyama_Aya d3379f0be7 fixed model saving bugs 2023-06-06 16:07:34 +08:00
Maruyama_Aya b29e1f0722 change directory 2023-06-06 15:50:03 +08:00
Maruyama_Aya 1c1f71cbd2 fixing insecure hash function 2023-06-06 14:51:11 +08:00
Maruyama_Aya b56c7f4283 update shell file 2023-06-06 14:09:27 +08:00
Maruyama_Aya 176010f289 update performance evaluation 2023-06-06 14:08:22 +08:00
Maruyama_Aya 25447d4407 modify path 2023-06-05 11:47:07 +08:00
Maruyama_Aya 60ec33bb18 Add a new example of Dreambooth training using the booster API 2023-06-02 16:50:51 +08:00
jiangmingyan 5f79008c4a
[example] update gemini examples (#3868)
* [example]update gemini examples

* [example]update gemini examples
2023-05-30 18:41:41 +08:00
digger yu 518b31c059
[docs] change placememt_policy to placement_policy (#3829)
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/
2023-05-24 14:51:49 +08:00