Commit Graph

1414 Commits (e8ad3c88f570a24180103eee8c2af6cd1c5f92d5)

Author SHA1 Message Date
Frank Lee ddcf58cacf
Revert "[sync] sync feature/shardformer with develop" 2023-06-09 09:41:27 +08:00
FoolPlayer 24651fdd4f
Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer
[sync] sync feature/shardformer with develop
2023-06-09 09:34:00 +08:00
FoolPlayer ef1537759c [shardformer] add gpt2 policy and modify shard and slicer to support (#3883)
* add gpt2 policy and modify shard and slicer to support

* remove unused code

* polish code
2023-06-08 15:01:34 +08:00
FoolPlayer 6370a935f6 update README (#3909) 2023-06-08 15:01:34 +08:00
FoolPlayer 21a3915c98 [shardformer] add Dropout layer support different dropout pattern (#3856)
* add dropout layer, add dropout test

* modify seed manager as context manager

* add a copy of col_nn.layer

* add dist_crossentropy loss; separate module test

* polish the code

* fix dist crossentropy loss
2023-06-08 15:01:34 +08:00
FoolPlayer 997544c1f9 [shardformer] update readme with modules implement doc (#3834)
* update readme with modules content

* remove img
2023-06-08 15:01:34 +08:00
Frank Lee 537a52b7a2 [shardformer] refactored the user api (#3828)
* [shardformer] refactored the user api

* polish code
2023-06-08 15:01:34 +08:00
Frank Lee bc19024bf9 [shardformer] updated readme (#3827) 2023-06-08 15:01:34 +08:00
FoolPlayer 58f6432416 [shardformer]: Feature/shardformer, add some docstring and readme (#3816)
* init shardformer code structure

* add implement of sharder (inject and replace)

* add implement of replace layer to colossal layer

* separate different layer policy, add some notion

* implement 1d and 2d slicer, can tell col or row

* fix bug when slicing and inject model

* fix some bug; add inference test example

* add share weight and train example

* add train

* add docstring and readme

* add docstring for other files

* pre-commit
2023-06-08 15:01:34 +08:00
FoolPlayer 6a69b44dfc [shardformer] init shardformer code structure (#3731)
* init shardformer code structure

* add implement of sharder (inject and replace)

* add implement of replace layer to colossal layer

* separate different layer policy, add some notion

* implement 1d and 2d slicer, can tell col or row

* fix bug when slicing and inject model

* fix some bug; add inference test example
2023-06-08 15:01:34 +08:00
Frank Lee eb39154d40
[dtensor] updated api and doc (#3845) 2023-06-08 10:18:17 +08:00
digger yu de0d7df33f
[nfc] fix typo colossalai/zero (#3923) 2023-06-08 00:01:29 +08:00
digger yu a9d1cadc49
fix typo with colossalai/trainer utils zero (#3908) 2023-06-07 16:08:37 +08:00
Frank Lee d51e83d642
Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop
[sync] sync feature/dtensor with develop
2023-06-07 11:50:43 +08:00
Hongxin Liu 9c88b6cbd1
[lazy] fix compatibility problem on torch 1.13 (#3911) 2023-06-07 11:10:12 +08:00
digger yu 0e484e6201
[nfc]fix typo colossalai/pipeline tensor nn (#3899)
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/

* fix typo colossalai/cli fx kernel

* fix typo colossalai/nn

* revert change warmuped

* fix typo colossalai/pipeline tensor nn
2023-06-06 14:07:36 +08:00
Baizhou Zhang c1535ccbba
[doc] fix docs about booster api usage (#3898) 2023-06-06 13:36:11 +08:00
digger yu 1878749753
[nfc] fix typo colossalai/nn (#3887)
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/

* fix typo colossalai/cli fx kernel

* fix typo colossalai/nn

* revert change warmuped
2023-06-05 16:04:27 +08:00
Hongxin Liu ae02d4e4f7
[bf16] add bf16 support (#3882)
* [bf16] add bf16 support for fused adam (#3844)

* [bf16] fused adam kernel support bf16

* [test] update fused adam kernel test

* [test] update fused adam test

* [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860)

* [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869)

* [bf16] add mixed precision mixin

* [bf16] low level zero optim support bf16

* [text] update low level zero test

* [text] fix low level zero grad acc test

* [bf16] add bf16 support for gemini (#3872)

* [bf16] gemini support bf16

* [test] update gemini bf16 test

* [doc] update gemini docstring

* [bf16] add bf16 support for plugins (#3877)

* [bf16] add bf16 support for legacy zero (#3879)

* [zero] init context support bf16

* [zero] legacy zero support bf16

* [test] add zero bf16 test

* [doc] add bf16 related docstring for legacy zero
2023-06-05 15:58:31 +08:00
Liu Ziming 8065cc5fba
Modify torch version requirement to adapt torch 2.0 (#3896) 2023-06-05 15:57:35 +08:00
Hongxin Liu dbb32692d2
[lazy] refactor lazy init (#3891)
* [lazy] remove old lazy init

* [lazy] refactor lazy init folder structure

* [lazy] fix lazy tensor deepcopy

* [test] update lazy init test
2023-06-05 14:20:47 +08:00
digger yu 70c8cdecf4
[nfc] fix typo colossalai/cli fx kernel (#3847)
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/

* fix typo colossalai/cli fx kernel
2023-06-02 15:02:45 +08:00
digger yu e2d81eba0d
[nfc] fix typo colossalai/ applications/ (#3831)
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/
2023-05-25 16:19:41 +08:00
wukong1992 3229f93e30
[booster] add warning for torch fsdp plugin doc (#3833) 2023-05-25 14:00:02 +08:00
Hongxin Liu 7c9f2ed6dd
[dtensor] polish sharding spec docstring (#3838)
* [dtensor] polish sharding spec docstring

* [dtensor] polish sharding spec example docstring
2023-05-25 13:09:42 +08:00
digger yu 7f8203af69
fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808) 2023-05-24 09:01:50 +08:00
wukong1992 6b305a99d6
[booster] torch fsdp fix ckpt (#3788) 2023-05-23 16:58:45 +08:00
digger yu 9265f2d4d7
[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779)
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.
2023-05-23 15:28:20 +08:00
jiangmingyan e871e342b3
[API] add docstrings and initialization to apex amp, naive amp (#3783)
* [mixed_precison] add naive amp demo

* [mixed_precison] add naive amp demo

* [api] add docstrings and initialization to apex amp, naive amp

* [api] add docstring to apex amp/ naive amp

* [api] add docstring to apex amp/ naive amp

* [api] add docstring to apex amp/ naive amp

* [api] add docstring to apex amp/ naive amp

* [api] add docstring to apex amp/ naive amp

* [api] add docstring to apex amp/ naive amp

* [api] fix

* [api] fix
2023-05-23 15:17:24 +08:00
Frank Lee f5c425c898
fixed the example docstring for booster (#3795) 2023-05-22 18:10:06 +08:00
Hongxin Liu 72688adb2f
[doc] add booster docstring and fix autodoc (#3789)
* [doc] add docstr for booster methods

* [doc] fix autodoc
2023-05-22 10:56:47 +08:00
Hongxin Liu 3c07a2846e
[plugin] a workaround for zero plugins' optimizer checkpoint (#3780)
* [test] refactor torch ddp checkpoint test

* [plugin] update low level zero optim checkpoint

* [plugin] update gemini optim checkpoint
2023-05-19 19:42:31 +08:00
Hongxin Liu 60e6a154bc
[doc] add tutorial for booster checkpoint (#3785)
* [doc] add checkpoint related docstr for booster

* [doc] add en checkpoint doc

* [doc] add zh checkpoint doc

* [doc] add booster checkpoint doc in sidebar

* [doc] add cuation about ckpt for plugins

* [doc] add doctest placeholder

* [doc] add doctest placeholder

* [doc] add doctest placeholder
2023-05-19 18:05:08 +08:00
digger yu 32f81f14d4
[NFC] fix typo colossalai/amp auto_parallel autochunk (#3756) 2023-05-19 13:50:00 +08:00
Hongxin Liu 5452df63c5
[plugin] torch ddp plugin supports sharded model checkpoint (#3775)
* [plugin] torch ddp plugin add save sharded model

* [test] fix torch ddp ckpt io test

* [test] fix torch ddp ckpt io test

* [test] fix low level zero plugin test

* [test] fix low level zero plugin test

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] fix low level zero plugin test

* [test] fix low level zero plugin test

* [test] remove debug info
2023-05-18 20:05:59 +08:00
jiangmingyan 2703a37ac9
[amp] Add naive amp demo (#3774)
* [mixed_precison] add naive amp demo

* [mixed_precison] add naive amp demo
2023-05-18 16:33:14 +08:00
digger yu 1baeb39c72
[NFC] fix typo with colossalai/auto_parallel/tensor_shard (#3742)
* fix typo applications/ and colossalai/ date 5.11

* fix typo colossalai/
2023-05-17 11:13:23 +08:00
wukong1992 b37797ed3d
[booster] support torch fsdp plugin in booster (#3697)
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2023-05-15 12:14:38 +08:00
digger-yu ad6460cf2c
[NFC] fix typo applications/ and colossalai/ (#3735) 2023-05-15 11:46:25 +08:00
digger-yu b7141c36dd
[CI] fix some spelling errors (#3707)
* fix spelling error with examples/comminity/

* fix spelling error with tests/

* fix some spelling error with tests/ colossalai/ etc.
2023-05-10 17:12:03 +08:00
jiangmingyan 20068ba188
[booster] add tests for ddp and low level zero's checkpointio (#3715)
* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update booster tutorials#3717, fix recursive check
2023-05-10 12:17:02 +08:00
Hongxin Liu 6552cbf8e1
[booster] fix no_sync method (#3709)
* [booster] fix no_sync method

* [booster] add test for ddp no_sync

* [booster] fix merge

* [booster] update unit test

* [booster] update unit test

* [booster] update unit test
2023-05-09 11:10:02 +08:00
Hongxin Liu 3bf09efe74
[booster] update prepare dataloader method for plugin (#3706)
* [booster] add prepare dataloader method for plug

* [booster] update examples and docstr
2023-05-08 15:44:03 +08:00
Hongxin Liu f83ea813f5
[example] add train resnet/vit with booster example (#3694)
* [example] add train vit with booster example

* [example] update readme

* [example] add train resnet with booster example

* [example] enable ci

* [example] enable ci

* [example] add requirements

* [hotfix] fix analyzer init

* [example] update requirements
2023-05-08 10:42:30 +08:00
YH 2629f9717d
[tensor] Refactor handle_trans_spec in DistSpecManager 2023-05-06 17:55:37 +08:00
Hongxin Liu d0915f54f4
[booster] refactor all dp fashion plugins (#3684)
* [booster] add dp plugin base

* [booster] inherit dp plugin base

* [booster] refactor unit tests
2023-05-05 19:36:10 +08:00
jiangmingyan 307894f74d
[booster] gemini plugin support shard checkpoint (#3610)
* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

---------

Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2023-05-05 14:37:21 +08:00
YH a22407cc02
[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173)
* Fix confusing variable name in zero opt

* Apply lint

* Fix util func

* Fix minor util func

* Fix zero param optimizer name
2023-04-27 18:43:14 +08:00
Hongxin Liu 50793b35f4
[gemini] accelerate inference (#3641)
* [gemini] support don't scatter after inference

* [chat] update colossalai strategy

* [chat] fix opt benchmark

* [chat] update opt benchmark

* [gemini] optimize inference

* [test] add gemini inference test

* [chat] fix unit test ci

* [chat] fix ci

* [chat] fix ci

* [chat] skip checkpoint test
2023-04-26 16:32:40 +08:00
Hongxin Liu 4b3240cb59
[booster] add low level zero plugin (#3594)
* [booster] add low level zero plugin

* [booster] fix gemini plugin test

* [booster] fix precision

* [booster] add low level zero plugin test

* [test] fix booster plugin test oom

* [test] fix booster plugin test oom

* [test] fix googlenet and inception output trans

* [test] fix diffuser clip vision model

* [test] fix torchaudio_wav2vec2_base

* [test] fix low level zero plugin test
2023-04-26 14:37:25 +08:00