Frank Lee
ddcf58cacf
Revert "[sync] sync feature/shardformer with develop"
2023-06-09 09:41:27 +08:00
FoolPlayer
24651fdd4f
Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer
...
[sync] sync feature/shardformer with develop
2023-06-09 09:34:00 +08:00
FoolPlayer
ef1537759c
[shardformer] add gpt2 policy and modify shard and slicer to support ( #3883 )
...
* add gpt2 policy and modify shard and slicer to support
* remove unused code
* polish code
2023-06-08 15:01:34 +08:00
FoolPlayer
6370a935f6
update README ( #3909 )
2023-06-08 15:01:34 +08:00
FoolPlayer
21a3915c98
[shardformer] add Dropout layer support different dropout pattern ( #3856 )
...
* add dropout layer, add dropout test
* modify seed manager as context manager
* add a copy of col_nn.layer
* add dist_crossentropy loss; separate module test
* polish the code
* fix dist crossentropy loss
2023-06-08 15:01:34 +08:00
FoolPlayer
997544c1f9
[shardformer] update readme with modules implement doc ( #3834 )
...
* update readme with modules content
* remove img
2023-06-08 15:01:34 +08:00
Frank Lee
537a52b7a2
[shardformer] refactored the user api ( #3828 )
...
* [shardformer] refactored the user api
* polish code
2023-06-08 15:01:34 +08:00
Frank Lee
bc19024bf9
[shardformer] updated readme ( #3827 )
2023-06-08 15:01:34 +08:00
FoolPlayer
58f6432416
[shardformer]: Feature/shardformer, add some docstring and readme ( #3816 )
...
* init shardformer code structure
* add implement of sharder (inject and replace)
* add implement of replace layer to colossal layer
* separate different layer policy, add some notion
* implement 1d and 2d slicer, can tell col or row
* fix bug when slicing and inject model
* fix some bug; add inference test example
* add share weight and train example
* add train
* add docstring and readme
* add docstring for other files
* pre-commit
2023-06-08 15:01:34 +08:00
FoolPlayer
6a69b44dfc
[shardformer] init shardformer code structure ( #3731 )
...
* init shardformer code structure
* add implement of sharder (inject and replace)
* add implement of replace layer to colossal layer
* separate different layer policy, add some notion
* implement 1d and 2d slicer, can tell col or row
* fix bug when slicing and inject model
* fix some bug; add inference test example
2023-06-08 15:01:34 +08:00
Frank Lee
eb39154d40
[dtensor] updated api and doc ( #3845 )
2023-06-08 10:18:17 +08:00
digger yu
de0d7df33f
[nfc] fix typo colossalai/zero ( #3923 )
2023-06-08 00:01:29 +08:00
digger yu
a9d1cadc49
fix typo with colossalai/trainer utils zero ( #3908 )
2023-06-07 16:08:37 +08:00
Frank Lee
d51e83d642
Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop
...
[sync] sync feature/dtensor with develop
2023-06-07 11:50:43 +08:00
Hongxin Liu
9c88b6cbd1
[lazy] fix compatibility problem on torch 1.13 ( #3911 )
2023-06-07 11:10:12 +08:00
digger yu
0e484e6201
[nfc]fix typo colossalai/pipeline tensor nn ( #3899 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
* fix typo colossalai/ applications/
* fix typo colossalai/cli fx kernel
* fix typo colossalai/nn
* revert change warmuped
* fix typo colossalai/pipeline tensor nn
2023-06-06 14:07:36 +08:00
Baizhou Zhang
c1535ccbba
[doc] fix docs about booster api usage ( #3898 )
2023-06-06 13:36:11 +08:00
digger yu
1878749753
[nfc] fix typo colossalai/nn ( #3887 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
* fix typo colossalai/ applications/
* fix typo colossalai/cli fx kernel
* fix typo colossalai/nn
* revert change warmuped
2023-06-05 16:04:27 +08:00
Hongxin Liu
ae02d4e4f7
[bf16] add bf16 support ( #3882 )
...
* [bf16] add bf16 support for fused adam (#3844 )
* [bf16] fused adam kernel support bf16
* [test] update fused adam kernel test
* [test] update fused adam test
* [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860 )
* [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869 )
* [bf16] add mixed precision mixin
* [bf16] low level zero optim support bf16
* [text] update low level zero test
* [text] fix low level zero grad acc test
* [bf16] add bf16 support for gemini (#3872 )
* [bf16] gemini support bf16
* [test] update gemini bf16 test
* [doc] update gemini docstring
* [bf16] add bf16 support for plugins (#3877 )
* [bf16] add bf16 support for legacy zero (#3879 )
* [zero] init context support bf16
* [zero] legacy zero support bf16
* [test] add zero bf16 test
* [doc] add bf16 related docstring for legacy zero
2023-06-05 15:58:31 +08:00
Liu Ziming
8065cc5fba
Modify torch version requirement to adapt torch 2.0 ( #3896 )
2023-06-05 15:57:35 +08:00
Hongxin Liu
dbb32692d2
[lazy] refactor lazy init ( #3891 )
...
* [lazy] remove old lazy init
* [lazy] refactor lazy init folder structure
* [lazy] fix lazy tensor deepcopy
* [test] update lazy init test
2023-06-05 14:20:47 +08:00
digger yu
70c8cdecf4
[nfc] fix typo colossalai/cli fx kernel ( #3847 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
* fix typo colossalai/ applications/
* fix typo colossalai/cli fx kernel
2023-06-02 15:02:45 +08:00
digger yu
e2d81eba0d
[nfc] fix typo colossalai/ applications/ ( #3831 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
* fix typo colossalai/auto_parallel autochunk fx/passes etc.
* fix typo docs/
* change placememt_policy to placement_policy in docs/ and examples/
* fix typo colossalai/ applications/
2023-05-25 16:19:41 +08:00
wukong1992
3229f93e30
[booster] add warning for torch fsdp plugin doc ( #3833 )
2023-05-25 14:00:02 +08:00
Hongxin Liu
7c9f2ed6dd
[dtensor] polish sharding spec docstring ( #3838 )
...
* [dtensor] polish sharding spec docstring
* [dtensor] polish sharding spec example docstring
2023-05-25 13:09:42 +08:00
digger yu
7f8203af69
fix typo colossalai/auto_parallel autochunk fx/passes etc. ( #3808 )
2023-05-24 09:01:50 +08:00
wukong1992
6b305a99d6
[booster] torch fsdp fix ckpt ( #3788 )
2023-05-23 16:58:45 +08:00
digger yu
9265f2d4d7
[NFC]fix typo colossalai/auto_parallel nn utils etc. ( #3779 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
2023-05-23 15:28:20 +08:00
jiangmingyan
e871e342b3
[API] add docstrings and initialization to apex amp, naive amp ( #3783 )
...
* [mixed_precison] add naive amp demo
* [mixed_precison] add naive amp demo
* [api] add docstrings and initialization to apex amp, naive amp
* [api] add docstring to apex amp/ naive amp
* [api] add docstring to apex amp/ naive amp
* [api] add docstring to apex amp/ naive amp
* [api] add docstring to apex amp/ naive amp
* [api] add docstring to apex amp/ naive amp
* [api] add docstring to apex amp/ naive amp
* [api] fix
* [api] fix
2023-05-23 15:17:24 +08:00
Frank Lee
f5c425c898
fixed the example docstring for booster ( #3795 )
2023-05-22 18:10:06 +08:00
Hongxin Liu
72688adb2f
[doc] add booster docstring and fix autodoc ( #3789 )
...
* [doc] add docstr for booster methods
* [doc] fix autodoc
2023-05-22 10:56:47 +08:00
Hongxin Liu
3c07a2846e
[plugin] a workaround for zero plugins' optimizer checkpoint ( #3780 )
...
* [test] refactor torch ddp checkpoint test
* [plugin] update low level zero optim checkpoint
* [plugin] update gemini optim checkpoint
2023-05-19 19:42:31 +08:00
Hongxin Liu
60e6a154bc
[doc] add tutorial for booster checkpoint ( #3785 )
...
* [doc] add checkpoint related docstr for booster
* [doc] add en checkpoint doc
* [doc] add zh checkpoint doc
* [doc] add booster checkpoint doc in sidebar
* [doc] add cuation about ckpt for plugins
* [doc] add doctest placeholder
* [doc] add doctest placeholder
* [doc] add doctest placeholder
2023-05-19 18:05:08 +08:00
digger yu
32f81f14d4
[NFC] fix typo colossalai/amp auto_parallel autochunk ( #3756 )
2023-05-19 13:50:00 +08:00
Hongxin Liu
5452df63c5
[plugin] torch ddp plugin supports sharded model checkpoint ( #3775 )
...
* [plugin] torch ddp plugin add save sharded model
* [test] fix torch ddp ckpt io test
* [test] fix torch ddp ckpt io test
* [test] fix low level zero plugin test
* [test] fix low level zero plugin test
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] fix low level zero plugin test
* [test] fix low level zero plugin test
* [test] remove debug info
2023-05-18 20:05:59 +08:00
jiangmingyan
2703a37ac9
[amp] Add naive amp demo ( #3774 )
...
* [mixed_precison] add naive amp demo
* [mixed_precison] add naive amp demo
2023-05-18 16:33:14 +08:00
digger yu
1baeb39c72
[NFC] fix typo with colossalai/auto_parallel/tensor_shard ( #3742 )
...
* fix typo applications/ and colossalai/ date 5.11
* fix typo colossalai/
2023-05-17 11:13:23 +08:00
wukong1992
b37797ed3d
[booster] support torch fsdp plugin in booster ( #3697 )
...
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2023-05-15 12:14:38 +08:00
digger-yu
ad6460cf2c
[NFC] fix typo applications/ and colossalai/ ( #3735 )
2023-05-15 11:46:25 +08:00
digger-yu
b7141c36dd
[CI] fix some spelling errors ( #3707 )
...
* fix spelling error with examples/comminity/
* fix spelling error with tests/
* fix some spelling error with tests/ colossalai/ etc.
2023-05-10 17:12:03 +08:00
jiangmingyan
20068ba188
[booster] add tests for ddp and low level zero's checkpointio ( #3715 )
...
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update booster tutorials#3717, fix recursive check
2023-05-10 12:17:02 +08:00
Hongxin Liu
6552cbf8e1
[booster] fix no_sync method ( #3709 )
...
* [booster] fix no_sync method
* [booster] add test for ddp no_sync
* [booster] fix merge
* [booster] update unit test
* [booster] update unit test
* [booster] update unit test
2023-05-09 11:10:02 +08:00
Hongxin Liu
3bf09efe74
[booster] update prepare dataloader method for plugin ( #3706 )
...
* [booster] add prepare dataloader method for plug
* [booster] update examples and docstr
2023-05-08 15:44:03 +08:00
Hongxin Liu
f83ea813f5
[example] add train resnet/vit with booster example ( #3694 )
...
* [example] add train vit with booster example
* [example] update readme
* [example] add train resnet with booster example
* [example] enable ci
* [example] enable ci
* [example] add requirements
* [hotfix] fix analyzer init
* [example] update requirements
2023-05-08 10:42:30 +08:00
YH
2629f9717d
[tensor] Refactor handle_trans_spec in DistSpecManager
2023-05-06 17:55:37 +08:00
Hongxin Liu
d0915f54f4
[booster] refactor all dp fashion plugins ( #3684 )
...
* [booster] add dp plugin base
* [booster] inherit dp plugin base
* [booster] refactor unit tests
2023-05-05 19:36:10 +08:00
jiangmingyan
307894f74d
[booster] gemini plugin support shard checkpoint ( #3610 )
...
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
---------
Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2023-05-05 14:37:21 +08:00
YH
a22407cc02
[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. ( #3173 )
...
* Fix confusing variable name in zero opt
* Apply lint
* Fix util func
* Fix minor util func
* Fix zero param optimizer name
2023-04-27 18:43:14 +08:00
Hongxin Liu
50793b35f4
[gemini] accelerate inference ( #3641 )
...
* [gemini] support don't scatter after inference
* [chat] update colossalai strategy
* [chat] fix opt benchmark
* [chat] update opt benchmark
* [gemini] optimize inference
* [test] add gemini inference test
* [chat] fix unit test ci
* [chat] fix ci
* [chat] fix ci
* [chat] skip checkpoint test
2023-04-26 16:32:40 +08:00
Hongxin Liu
4b3240cb59
[booster] add low level zero plugin ( #3594 )
...
* [booster] add low level zero plugin
* [booster] fix gemini plugin test
* [booster] fix precision
* [booster] add low level zero plugin test
* [test] fix booster plugin test oom
* [test] fix booster plugin test oom
* [test] fix googlenet and inception output trans
* [test] fix diffuser clip vision model
* [test] fix torchaudio_wav2vec2_base
* [test] fix low level zero plugin test
2023-04-26 14:37:25 +08:00