Commit Graph

934 Commits (eedaa3e1ef991d9f9a274d10c046877ba2b10467)

Author SHA1 Message Date
flybird11111 eedaa3e1ef
[shardformer]fix gpt2 double head (#4663)
* [shardformer]fix gpt2 test

[shardformer]fix gpt2 test

[shardformer]fix gpt2 test

* fix

* [shardformer] add todo

* [shardformer] add todo
2023-09-11 18:35:03 +08:00
Hongxin Liu 554aa9592e
[legacy] move communication and nn to legacy and refactor logger (#4671)
* [legacy] move communication to legacy (#4640)

* [legacy] refactor logger and clean up legacy codes (#4654)

* [legacy] make logger independent to gpc

* [legacy] make optim independent to registry

* [legacy] move test engine to legacy

* [legacy] move nn to legacy (#4656)

* [legacy] move nn to legacy

* [checkpointio] fix save hf config

* [test] remove useledd rpc pp test

* [legacy] fix nn init

* [example] skip tutorial hybriad parallel example

* [devops] test doc check

* [devops] test doc check
2023-09-11 16:24:28 +08:00
flybird11111 7486ed7d3a
[shardformer] update llama2/opt finetune example and fix llama2 policy (#4645)
* [shardformer] update shardformer readme

[shardformer] update shardformer readme

[shardformer] update shardformer readme

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] update llama2/opt finetune example and shardformer update to llama2

* [shardformer] change dataset

* [shardformer] change dataset

* [shardformer] fix CI

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

* [shardformer] fix

[example] update opt example

[example] resolve comments

fix

fix
2023-09-09 22:45:36 +08:00
Baizhou Zhang 660eed9124
[pipeline] set optimizer to optional in execute_pipeline (#4630)
* set optimizer to optional in execute_pipeline

* arrange device and mixed precision in booster init

* fix execute_pipeline in booster.py
2023-09-07 10:42:59 +08:00
Hongxin Liu fae6c92ead
Merge branch 'main' into feature/shardformer 2023-09-05 21:54:08 +08:00
Hongxin Liu 8accecd55b [legacy] move engine to legacy (#4560)
* [legacy] move engine to legacy

* [example] fix seq parallel example

* [example] fix seq parallel example

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [test] test gemini pluging hang

* [example] update seq parallel requirements
2023-09-05 21:53:10 +08:00
Hongxin Liu 89fe027787 [legacy] move trainer to legacy (#4545)
* [legacy] move trainer to legacy

* [doc] update docs related to trainer

* [test] ignore legacy test
2023-09-05 21:53:10 +08:00
Hongxin Liu bd18678478
[test] fix gemini checkpoint and gpt test (#4620) 2023-09-05 16:02:23 +08:00
Hongxin Liu 807e01a4ba
[zero] hotfix master param sync (#4618)
* [zero] add method to update master params

* [zero] update zero plugin

* [plugin] update low level zero plugin
2023-09-05 15:04:02 +08:00
Hongxin Liu e71d245293
[test] ignore gpt2 shardformer test (#4619) 2023-09-05 14:21:31 +08:00
Hongxin Liu a39a5c66fe
Merge branch 'main' into feature/shardformer 2023-09-04 23:43:13 +08:00
Baizhou Zhang e79b1e80e2
[checkpointio] support huggingface from_pretrained for all plugins (#4606) 2023-09-04 23:25:01 +08:00
Jianghai 24c0768795
[shardformer] Pytree fix (#4533)
* pytree test

* test bert

* test bert

* test bert

* revise

* add register

* add register
2023-09-04 17:52:23 +08:00
Hongxin Liu 508ca36fe3
[pipeline] 1f1b schedule receive microbatch size (#4589) 2023-09-01 21:45:14 +08:00
LuGY cbac782254
[zero]fix zero ckptIO with offload (#4529)
* fix zero ckptio with offload

* fix load device

* saved tensors in ckpt should be on CPU

* fix unit test

* fix unit test

* add clear cache

* save memory for CI
2023-09-01 17:41:19 +08:00
Baizhou Zhang 38ccb8b1a3
[shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575)
* hybrid plugin support huggingface from_pretrained

* add huggingface compatibility tests

* add folder cleaning

* fix bugs
2023-09-01 17:40:01 +08:00
Baizhou Zhang c9625dbb63
[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540)
* implement sharded optimizer saving

* add more param info

* finish implementation of sharded optimizer saving

* fix bugs in optimizer sharded saving

* add pp+zero test

* param group loading

* greedy loading of optimizer

* fix bug when loading

* implement optimizer sharded saving

* add optimizer test & arrange checkpointIO utils

* fix gemini sharding state_dict

* add verbose option

* add loading of master params

* fix typehint

* fix master/working mapping in fp16 amp
2023-08-31 14:50:47 +08:00
Baizhou Zhang 2c787d7f47
[shardformer] fix submodule replacement bug when enabling pp (#4544) 2023-08-31 09:57:18 +08:00
flybird11111 ec18fc7340
[shardformer] support pp+tp+zero1 tests (#4531)
* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix

* [shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

[shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1

* [shardformer] pp+tp+zero1
2023-08-30 21:29:18 +08:00
flybird11111 d367b88785
[shardformer] fix opt test hanging (#4521)
* [shardformer] fix opt test hanging

* fix

* test

* test

* test

* fix test

* fix test

* remove print

* add fix
2023-08-30 14:50:34 +08:00
Bin Jia e241b74f24
[shardformer] Add overlap support for gpt2 (#4535)
* add overlap support for gpt2

* remove unused code

* remove unused code
2023-08-29 18:30:50 +08:00
Baizhou Zhang 0387a47e63
[shardformer] fix emerged bugs after updating transformers (#4526) 2023-08-29 11:25:05 +08:00
Bin Jia c554b7f559
[shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516)
* fix overlap bug and support bert, add overlap as an option in shardconfig

* support overlap for chatglm and bloom
2023-08-28 17:16:40 +08:00
Jianghai 376533a564
[shardformer] zero1+pp and the corresponding tests (#4517)
* pause

* finish pp+zero1

* Update test_shard_vit.py
2023-08-28 10:51:16 +08:00
Baizhou Zhang 44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506)
* add APIs

* implement save_sharded_model

* add test for hybrid checkpointio

* implement naive loading for sharded model

* implement efficient sharded model loading

* open a new file for hybrid checkpoint_io

* small fix

* fix circular importing

* fix docstring

* arrange arguments and apis

* small fix
2023-08-25 22:04:57 +08:00
flybird11111 de8a65babc
[shardformer] opt fix. (#4514)
* [shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

* fix

fix

fix

fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* activate checks

* [Test] test ci

* test ci

* test ci

* test ci

* test ci

* test ci

* test ci

* fix
2023-08-25 19:41:24 +08:00
LuGY 839847b7d7
[zero]support zero2 with gradient accumulation (#4511)
* support gradient accumulation with zero2

* fix type
2023-08-25 13:44:07 +08:00
flybird11111 3353e55c80
[shardformer] vit/llama/t5 ignore the sequence parallelism flag and some fix. (#4498)
* [shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

* fix

fix

fix

fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* [shardformer] jit fused fix

* activate checks
2023-08-24 15:50:02 +08:00
Hongxin Liu 27061426f7
[gemini] improve compatibility and add static placement policy (#4479)
* [gemini] remove distributed-related part from colotensor (#4379)

* [gemini] remove process group dependency

* [gemini] remove tp part from colo tensor

* [gemini] patch inplace op

* [gemini] fix param op hook and update tests

* [test] remove useless tests

* [test] remove useless tests

* [misc] fix requirements

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [test] fix model zoo

* [misc] update requirements

* [gemini] refactor gemini optimizer and gemini ddp (#4398)

* [gemini] update optimizer interface

* [gemini] renaming gemini optimizer

* [gemini] refactor gemini ddp class

* [example] update gemini related example

* [example] update gemini related example

* [plugin] fix gemini plugin args

* [test] update gemini ckpt tests

* [gemini] fix checkpoint io

* [example] fix opt example requirements

* [example] fix opt example

* [example] fix opt example

* [example] fix opt example

* [gemini] add static placement policy (#4443)

* [gemini] add static placement policy

* [gemini] fix param offload

* [test] update gemini tests

* [plugin] update gemini plugin

* [plugin] update gemini plugin docstr

* [misc] fix flash attn requirement

* [test] fix gemini checkpoint io test

* [example] update resnet example result (#4457)

* [example] update bert example result (#4458)

* [doc] update gemini doc (#4468)

* [example] update gemini related examples (#4473)

* [example] update gpt example

* [example] update dreambooth example

* [example] update vit

* [example] update opt

* [example] update palm

* [example] update vit and opt benchmark

* [hotfix] fix bert in model zoo (#4480)

* [hotfix] fix bert in model zoo

* [test] remove chatglm gemini test

* [test] remove sam gemini test

* [test] remove vit gemini test

* [hotfix] fix opt tutorial example (#4497)

* [hotfix] fix opt tutorial example

* [hotfix] fix opt tutorial example
2023-08-24 09:29:25 +08:00
Jianghai e04436a82a
[shardformer] tests for 3d parallel (#4493) 2023-08-23 15:05:24 +08:00
flybird11111 59e252ecdb
[shardformer] chatglm support sequence parallel (#4482)
* [shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

[shardformer] chatglm support sequence parallel

* fix

fix

fix

fix
2023-08-22 23:59:31 +08:00
Jianghai 5545114fd8
rename chatglm to chatglm2 (#4484) 2023-08-22 14:13:31 +08:00
Baizhou Zhang 1c7df566e2
[shardformer] support tp+zero for shardformer (#4472)
* support tp+zero/input type cast for hybridplugin

* add tp+zero tests

* fix bucket arguments
2023-08-21 12:04:52 +08:00
Jianghai 8739aa7fa0
[shardformer] Pipeline/whisper (#4456)
* add some base tests and policies

* finish whisper base model

* add conditional generation

* finish basic tests

* whisper

* finish whisper

* finish whisper

* del useless  whisper test

* fix

* add argmin to replace

* finish revision
2023-08-18 21:29:25 +08:00
Bin Jia 7c8be77081
[shardformer/sequence parallel] support gpt2 seq parallel with pp/dp/tp (#4460)
* support gpt2 seq parallel with pp/dp/tp

* fix a bug when waiting for stream done

* delete unused gpt2_seq file
2023-08-18 11:21:53 +08:00
LuGY a78daf6180
[shardformer] support interleaved pipeline (#4448)
* support interleaved pipeline

* fix unit test

* remove virtual stage test in stage mgr

* add droped type hint and updated bwd
2023-08-16 19:29:03 +08:00
Hongxin Liu 26e29d58f0
[devops] add large-scale distributed test marker (#4452)
* [test] remove cpu marker

* [test] remove gpu marker

* [test] update pytest markers

* [ci] update unit test ci
2023-08-16 18:56:52 +08:00
Baizhou Zhang 6ef33f75aa
[shardformer] support DDP in HybridPlugin/add tp+dp tests (#4446)
* support DDP for HybridPlugin/add tp+dp tests

* add docstring for HybridParallelPlugin
2023-08-16 16:11:57 +08:00
Bin Jia 424629fea0
[shardformer/sequence parallel] Cherry pick commit to new branch (#4450)
* [shardformer/sequence parallel] Support sequence parallel for gpt2 (#4384)

* [sequence parallel] add sequence parallel linear col/row support (#4336)

* add sequence parallel linear col/row support

* add annotation

* add annotation

* add support for gpt2 fused qkv linear layer

* support sequence parallel in GPT2

* add docstring and note

* add requirments

* remove unused flash-attb

* modify flash attn test

* modify flash attn setting

* modify flash attn code

* add assert before divide, rename forward function

* [shardformer/test] fix gpt2 test with seq-parallel

* [shardformer/sequence parallel] Overlap input gather and grad computation during col backward (#4401)

* overlap gather input / grad computing during col backward

* modify test for overlap

* simplify code

* fix code and modify cuda stream synchronize

* [shardformer/sequence parallel] polish code
2023-08-16 15:41:20 +08:00
github-actions[bot] d20dceb9a3
[format] applied code formatting on changed files in pull request 4441 (#4445)
Co-authored-by: github-actions <github-actions@github.com>
2023-08-16 10:47:23 +08:00
Hongxin Liu 172f7fa3cf [misc] resolve code factor issues (#4433) 2023-08-15 23:25:14 +08:00
flybird11111 328a791d10 [shardformer] update bloom/llama/vit/chatglm tests (#4420)
[shardformer] update bloom/llama/vit/chatglm tests

[shardformer] update opt tests

[shardformer] update opt tests

[shardformer] update bloom/llama/vit/chatglm tests

[shardformer] update bloom/llama/vit/chatglm tests

[shardformer] update bloom/llama/vit/chatglm tests
2023-08-15 23:25:14 +08:00
flybird11111 108e54a0b4 [shardformer]update t5 tests for using all optimizations. (#4407)
* [shardformer] gpt2 tests fix

[shardformer] test all optimizations (#4399)

[shardformer] test all optimizations

[shardformer] test all optimizations

[shardformer] test all optimizations

[shardformer] gpt2 tests fix

* [shardformer]update t5 to use all optimizations
2023-08-15 23:25:14 +08:00
flybird11111 1edc9b5fb3 [shardformer] update tests for all optimization (#4413)
[shardformer] update tests for all optimization
2023-08-15 23:25:14 +08:00
Baizhou Zhang 7711bd524a [shardformer] rewrite tests for opt/bloom/llama/vit/chatglm (#4395)
* rewrite opt tests

* rewrite llama tests

* rewrite bloom & vit tests

* rewrite chatglm tests

* fix LinearCol for classfiers

* add judge for other tp layers, fix lazy init in util
2023-08-15 23:25:14 +08:00
flybird11111 21e0a42fd1 [shardformer]fix, test gpt2 for AMP+TP (#4403)
* [shardformer] gpt2 tests fix

[shardformer] test all optimizations (#4399)

[shardformer] test all optimizations

[shardformer] test all optimizations

[shardformer] test all optimizations

[shardformer] gpt2 tests fix

* [shardformer] gpt2 tests fix
2023-08-15 23:25:14 +08:00
Jianghai 7596e9ae08 [pipeline] rewrite bert tests and fix some bugs (#4409)
* add pipeline policy and bert forward to be done

* add bertmodel pipeline forward and make tests

* add Bert_Policy and test for policy

* update formatting

* update formatting

* update the code

* fix bugs

* fix name confilt

* add bloom model and policy ,revise the base class of policy

* revise

* revision

* add bert_for_pretraining

* add bert_for_pretraining forward and policy

* fix typos

* cancel warning

* change the imediate output to default dict

* change the default output of get_shared_params

* rewrite bert test

* rewrite bert test

* fix some bugs

* del pipeline tests

* del pipeline tests

* del useless print

* del useless print

* rewrite data repeats
2023-08-15 23:25:14 +08:00
flybird1111 d2cd48e0be [shardformer] test all optimizations (#4399)
[shardformer] test all optimizations

[shardformer] test all optimizations

[shardformer] test all optimizations
2023-08-15 23:25:14 +08:00
flybird1111 7a3dfd0c64 [shardformer] update shardformer to use flash attention 2 (#4392)
* cherry-pick flash attention 2

cherry-pick flash attention 2

* [shardformer] update shardformer to use flash attention 2

[shardformer] update shardformer to use flash attention 2, fix

[shardformer] update shardformer to use flash attention 2, fix

[shardformer] update shardformer to use flash attention 2, fix
2023-08-15 23:25:14 +08:00
Baizhou Zhang ed4c448488 [pipeline] rewrite t5 tests & support multi-tensor transmitting in pipeline (#4388)
* fix remaining t5 bugs/rewrite t5 tests

* fix multi-tensor communication in pipeline

* rearrange test_config

* fix keyerror in sync_shared_params

* fix get_held_layers & Randomnizer, complete t5 tests

* erase printing

* fix get_held_layers through modifying _release_unheld_layers

* fix _get_recursive_held_layers bug
2023-08-15 23:25:14 +08:00