Commit Graph

1797 Commits (785cd9a9c971aa58e6f8c76575111a4aa4d9513b)

Author SHA1 Message Date
digger yu 1878749753
[nfc] fix typo colossalai/nn (#3887)
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/

* fix typo colossalai/cli fx kernel

* fix typo colossalai/nn

* revert change warmuped
2023-06-05 16:04:27 +08:00
Hongxin Liu ae02d4e4f7
[bf16] add bf16 support (#3882)
* [bf16] add bf16 support for fused adam (#3844)

* [bf16] fused adam kernel support bf16

* [test] update fused adam kernel test

* [test] update fused adam test

* [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860)

* [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869)

* [bf16] add mixed precision mixin

* [bf16] low level zero optim support bf16

* [text] update low level zero test

* [text] fix low level zero grad acc test

* [bf16] add bf16 support for gemini (#3872)

* [bf16] gemini support bf16

* [test] update gemini bf16 test

* [doc] update gemini docstring

* [bf16] add bf16 support for plugins (#3877)

* [bf16] add bf16 support for legacy zero (#3879)

* [zero] init context support bf16

* [zero] legacy zero support bf16

* [test] add zero bf16 test

* [doc] add bf16 related docstring for legacy zero
2023-06-05 15:58:31 +08:00
Liu Ziming 8065cc5fba
Modify torch version requirement to adapt torch 2.0 (#3896) 2023-06-05 15:57:35 +08:00
Hongxin Liu dbb32692d2
[lazy] refactor lazy init (#3891)
* [lazy] remove old lazy init

* [lazy] refactor lazy init folder structure

* [lazy] fix lazy tensor deepcopy

* [test] update lazy init test
2023-06-05 14:20:47 +08:00
digger yu 70c8cdecf4
[nfc] fix typo colossalai/cli fx kernel (#3847)
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/

* fix typo colossalai/cli fx kernel
2023-06-02 15:02:45 +08:00
digger yu e2d81eba0d
[nfc] fix typo colossalai/ applications/ (#3831)
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.

* fix typo colossalai/auto_parallel autochunk fx/passes  etc.

* fix typo docs/

* change placememt_policy to placement_policy in docs/ and examples/

* fix typo colossalai/ applications/
2023-05-25 16:19:41 +08:00
wukong1992 3229f93e30
[booster] add warning for torch fsdp plugin doc (#3833) 2023-05-25 14:00:02 +08:00
Hongxin Liu 7c9f2ed6dd
[dtensor] polish sharding spec docstring (#3838)
* [dtensor] polish sharding spec docstring

* [dtensor] polish sharding spec example docstring
2023-05-25 13:09:42 +08:00
digger yu 7f8203af69
fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808) 2023-05-24 09:01:50 +08:00
wukong1992 6b305a99d6
[booster] torch fsdp fix ckpt (#3788) 2023-05-23 16:58:45 +08:00
digger yu 9265f2d4d7
[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779)
* fix typo colossalai/autochunk auto_parallel amp

* fix typo colossalai/auto_parallel nn utils etc.
2023-05-23 15:28:20 +08:00
jiangmingyan e871e342b3
[API] add docstrings and initialization to apex amp, naive amp (#3783)
* [mixed_precison] add naive amp demo

* [mixed_precison] add naive amp demo

* [api] add docstrings and initialization to apex amp, naive amp

* [api] add docstring to apex amp/ naive amp

* [api] add docstring to apex amp/ naive amp

* [api] add docstring to apex amp/ naive amp

* [api] add docstring to apex amp/ naive amp

* [api] add docstring to apex amp/ naive amp

* [api] add docstring to apex amp/ naive amp

* [api] fix

* [api] fix
2023-05-23 15:17:24 +08:00
Frank Lee f5c425c898
fixed the example docstring for booster (#3795) 2023-05-22 18:10:06 +08:00
Hongxin Liu 72688adb2f
[doc] add booster docstring and fix autodoc (#3789)
* [doc] add docstr for booster methods

* [doc] fix autodoc
2023-05-22 10:56:47 +08:00
Hongxin Liu 3c07a2846e
[plugin] a workaround for zero plugins' optimizer checkpoint (#3780)
* [test] refactor torch ddp checkpoint test

* [plugin] update low level zero optim checkpoint

* [plugin] update gemini optim checkpoint
2023-05-19 19:42:31 +08:00
Hongxin Liu 60e6a154bc
[doc] add tutorial for booster checkpoint (#3785)
* [doc] add checkpoint related docstr for booster

* [doc] add en checkpoint doc

* [doc] add zh checkpoint doc

* [doc] add booster checkpoint doc in sidebar

* [doc] add cuation about ckpt for plugins

* [doc] add doctest placeholder

* [doc] add doctest placeholder

* [doc] add doctest placeholder
2023-05-19 18:05:08 +08:00
digger yu 32f81f14d4
[NFC] fix typo colossalai/amp auto_parallel autochunk (#3756) 2023-05-19 13:50:00 +08:00
Hongxin Liu 5452df63c5
[plugin] torch ddp plugin supports sharded model checkpoint (#3775)
* [plugin] torch ddp plugin add save sharded model

* [test] fix torch ddp ckpt io test

* [test] fix torch ddp ckpt io test

* [test] fix low level zero plugin test

* [test] fix low level zero plugin test

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] add debug info

* [test] fix low level zero plugin test

* [test] fix low level zero plugin test

* [test] remove debug info
2023-05-18 20:05:59 +08:00
jiangmingyan 2703a37ac9
[amp] Add naive amp demo (#3774)
* [mixed_precison] add naive amp demo

* [mixed_precison] add naive amp demo
2023-05-18 16:33:14 +08:00
digger yu 1baeb39c72
[NFC] fix typo with colossalai/auto_parallel/tensor_shard (#3742)
* fix typo applications/ and colossalai/ date 5.11

* fix typo colossalai/
2023-05-17 11:13:23 +08:00
wukong1992 b37797ed3d
[booster] support torch fsdp plugin in booster (#3697)
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2023-05-15 12:14:38 +08:00
digger-yu ad6460cf2c
[NFC] fix typo applications/ and colossalai/ (#3735) 2023-05-15 11:46:25 +08:00
digger-yu b7141c36dd
[CI] fix some spelling errors (#3707)
* fix spelling error with examples/comminity/

* fix spelling error with tests/

* fix some spelling error with tests/ colossalai/ etc.
2023-05-10 17:12:03 +08:00
jiangmingyan 20068ba188
[booster] add tests for ddp and low level zero's checkpointio (#3715)
* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update tests for booster

* [booster] update booster tutorials#3717, fix recursive check
2023-05-10 12:17:02 +08:00
Hongxin Liu 6552cbf8e1
[booster] fix no_sync method (#3709)
* [booster] fix no_sync method

* [booster] add test for ddp no_sync

* [booster] fix merge

* [booster] update unit test

* [booster] update unit test

* [booster] update unit test
2023-05-09 11:10:02 +08:00
Hongxin Liu 3bf09efe74
[booster] update prepare dataloader method for plugin (#3706)
* [booster] add prepare dataloader method for plug

* [booster] update examples and docstr
2023-05-08 15:44:03 +08:00
Hongxin Liu f83ea813f5
[example] add train resnet/vit with booster example (#3694)
* [example] add train vit with booster example

* [example] update readme

* [example] add train resnet with booster example

* [example] enable ci

* [example] enable ci

* [example] add requirements

* [hotfix] fix analyzer init

* [example] update requirements
2023-05-08 10:42:30 +08:00
YH 2629f9717d
[tensor] Refactor handle_trans_spec in DistSpecManager 2023-05-06 17:55:37 +08:00
Hongxin Liu d0915f54f4
[booster] refactor all dp fashion plugins (#3684)
* [booster] add dp plugin base

* [booster] inherit dp plugin base

* [booster] refactor unit tests
2023-05-05 19:36:10 +08:00
jiangmingyan 307894f74d
[booster] gemini plugin support shard checkpoint (#3610)
* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin add shard checkpoint save/load

* gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

* [API Refactoring]gemini plugin support shard checkpoint

---------

Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2023-05-05 14:37:21 +08:00
YH a22407cc02
[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173)
* Fix confusing variable name in zero opt

* Apply lint

* Fix util func

* Fix minor util func

* Fix zero param optimizer name
2023-04-27 18:43:14 +08:00
Hongxin Liu 50793b35f4
[gemini] accelerate inference (#3641)
* [gemini] support don't scatter after inference

* [chat] update colossalai strategy

* [chat] fix opt benchmark

* [chat] update opt benchmark

* [gemini] optimize inference

* [test] add gemini inference test

* [chat] fix unit test ci

* [chat] fix ci

* [chat] fix ci

* [chat] skip checkpoint test
2023-04-26 16:32:40 +08:00
Hongxin Liu 4b3240cb59
[booster] add low level zero plugin (#3594)
* [booster] add low level zero plugin

* [booster] fix gemini plugin test

* [booster] fix precision

* [booster] add low level zero plugin test

* [test] fix booster plugin test oom

* [test] fix booster plugin test oom

* [test] fix googlenet and inception output trans

* [test] fix diffuser clip vision model

* [test] fix torchaudio_wav2vec2_base

* [test] fix low level zero plugin test
2023-04-26 14:37:25 +08:00
digger-yu b9a8dff7e5
[doc] Fix typo under colossalai and doc(#3618)
* Fixed several spelling errors under colossalai

* Fix the spelling error in colossalai and docs directory

* Cautious Changed the spelling error under the example folder

* Update runtime_preparation_pass.py

revert autograft to autograd

* Update search_chunk.py

utile to until

* Update check_installation.py

change misteach to mismatch in line 91

* Update 1D_tensor_parallel.md

revert to perceptron

* Update 2D_tensor_parallel.md

revert to perceptron in line 73

* Update 2p5D_tensor_parallel.md

revert to perceptron in line 71

* Update 3D_tensor_parallel.md

revert to perceptron in line 80

* Update README.md

revert to resnet in line 42

* Update reorder_graph.py

revert to indice in line 7

* Update p2p.py

revert to megatron in line 94

* Update initialize.py

revert to torchrun in line 198

* Update routers.py

change to detailed in line 63

* Update routers.py

change to detailed in line 146

* Update README.md

revert  random number in line 402
2023-04-26 11:38:43 +08:00
Hongxin Liu 12eff9eb4c
[gemini] state dict supports fp16 (#3590)
* [gemini] save state dict support fp16

* [gemini] save state dict shard support fp16

* [gemini] fix state dict

* [gemini] fix state dict
2023-04-19 11:01:48 +08:00
Hongxin Liu dac127d0ee
[fx] fix meta tensor registration (#3589)
* [meta] fix torch 1.13.1

* [meta] fix torch 2.0.0

* [meta] fix torch 1.13.0

* [meta] polish code
2023-04-18 16:20:36 +08:00
Hongxin Liu f313babd11
[gemini] support save state dict in shards (#3581)
* [gemini] support state dict shard

* [gemini] add test state dict shard

* [gemini] polish docstr

* [gemini] fix merge

* [gemini] polish code
2023-04-17 17:11:09 +08:00
YH d329c294ec
Add docstr for zero3 chunk search utils (#3572) 2023-04-17 12:44:17 +08:00
Hongxin Liu 173dad0562
[misc] add verbose arg for zero and op builder (#3552)
* [misc] add print verbose

* [gemini] add print verbose

* [zero] add print verbose for low level

* [misc] add print verbose for op builder
2023-04-17 11:25:35 +08:00
Hongxin Liu 4341f5e8e6
[lazyinit] fix clone and deepcopy (#3553) 2023-04-17 11:25:13 +08:00
Hongxin Liu 152239bbfa
[gemini] gemini supports lazy init (#3379)
* [gemini] fix nvme optimizer init

* [gemini] gemini supports lazy init

* [gemini] add init example

* [gemini] add fool model

* [zero] update gemini ddp

* [zero] update init example

* add chunk method

* add chunk method

* [lazyinit] fix lazy tensor tolist

* [gemini] fix buffer materialization

* [misc] remove useless file

* [booster] update gemini plugin

* [test] update gemini plugin test

* [test] fix gemini plugin test

* [gemini] fix import

* [gemini] fix import

* [lazyinit] use new metatensor

* [lazyinit] use new metatensor

* [lazyinit] fix __set__ method
2023-04-12 16:03:25 +08:00
jiangmingyan 366a035552
[checkpoint] Shard saved checkpoint need to be compatible with the naming format of hf checkpoint files (#3479)
* [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format

* [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format

* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename

* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename

* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename

* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename

---------

Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-04-12 16:02:17 +08:00
YH bcf0cbcbe7
[doc] Add docs for clip args in zero optim (#3504) 2023-04-10 11:11:28 +08:00
jiangmingyan 52a933e175
[checkpoint] support huggingface style sharded checkpoint (#3461)
* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

---------

Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-04-06 16:23:39 +08:00
Frank Lee 80eba05b0a
[test] refactor tests with spawn (#3452)
* [test] added spawn decorator

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-04-06 14:51:35 +08:00
Frank Lee 7d8d825681
[booster] fixed the torch ddp plugin with the new checkpoint api (#3442) 2023-04-06 09:43:51 +08:00
YH 8f740deb53
Fix typo (#3448) 2023-04-06 09:43:31 +08:00
Hakjin Lee 46c009dba4
[format] Run lint on colossalai.engine (#3367) 2023-04-05 23:24:43 +08:00
YuliangLiu0306 ffcdbf0f65
[autoparallel]integrate auto parallel feature with new tracer (#3408)
* [autoparallel] integrate new analyzer in module level

* unify the profiling method

* polish

* fix no codegen bug

* fix pass bug

* fix liveness test

* polish
2023-04-04 17:40:45 +08:00
ver217 573af84184
[example] update examples related to zero/gemini (#3431)
* [zero] update legacy import

* [zero] update examples

* [example] fix opt tutorial

* [example] fix opt tutorial

* [example] fix opt tutorial

* [example] fix opt tutorial

* [example] fix import
2023-04-04 17:32:51 +08:00
Frank Lee 1beb85cc25
[checkpoint] refactored the API and added safetensors support (#3427)
* [checkpoint] refactored the API and added safetensors support

* polish code
2023-04-04 15:23:01 +08:00
ver217 26b7aac0be
[zero] reorganize zero/gemini folder structure (#3424)
* [zero] refactor low-level zero folder structure

* [zero] fix legacy zero import path

* [zero] fix legacy zero import path

* [zero] remove useless import

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] fix test import path

* [zero] fix test

* [zero] fix circular import

* [zero] update import
2023-04-04 13:48:16 +08:00
Frank Lee 638a07a7f9
[test] fixed gemini plugin test (#3411)
* [test] fixed gemini plugin test

* polish code

* polish code
2023-04-03 17:12:22 +08:00
ver217 5f2e34e6c9
[booster] implement Gemini plugin (#3352)
* [booster] add gemini plugin

* [booster] update docstr

* [booster] gemini plugin add coloparam convertor

* [booster] fix coloparam convertor

* [booster] fix gemini plugin device

* [booster] add gemini plugin test

* [booster] gemini plugin ignore sync bn

* [booster] skip some model

* [booster] skip some model

* [booster] modify test world size

* [booster] modify test world size

* [booster] skip test
2023-03-31 16:06:13 +08:00
HELSON 1a1d68b053
[moe] add checkpoint for moe models (#3354)
* [moe] add checkpoint for moe models

* [hotfix] fix bugs in unit test
2023-03-31 09:20:33 +08:00
YuliangLiu0306 fee2af8610
[autoparallel] adapt autoparallel with new analyzer (#3261)
* [autoparallel] adapt autoparallel with new analyzer

* fix all node handler tests

* polish

* polish
2023-03-30 17:47:24 +08:00
Ofey Chan 8706a8c66c
[NFC] polish colossalai/engine/gradient_handler/__init__.py code style (#3329) 2023-03-30 14:19:39 +08:00
yuxuan-lou 198a74b9fd
[NFC] polish colossalai/context/random/__init__.py code style (#3327) 2023-03-30 14:19:26 +08:00
YuliangLiu0306 fbd2a9e05b [hotfix] meta_tensor_compatibility_with_torch2 2023-03-30 13:43:01 +08:00
Michelle ad285e1656
[NFC] polish colossalai/fx/tracer/_tracer_utils.py (#3323)
* [NFC] polish colossalai/engine/schedule/_pipeline_schedule.py code style

* [NFC] polish colossalai/fx/tracer/_tracer_utils.py  code style

---------

Co-authored-by: Qianran Ma <qianranm@luchentech.com>
2023-03-29 17:53:32 +08:00
Xu Kai 64350029fe [NFC] polish colossalai/gemini/paramhooks/_param_hookmgr.py code style 2023-03-29 15:47:42 +08:00
RichardoLuo 1ce9d0c531 [NFC] polish initializer_data.py code style (#3287) 2023-03-29 15:22:21 +08:00
Ziheng Qin 1bed38ef37 [NFC] polish colossalai/cli/benchmark/models.py code style (#3290) 2023-03-29 15:22:21 +08:00
Kai Wang (Victor Kai) 964a28678f [NFC] polish initializer_3d.py code style (#3279) 2023-03-29 15:22:21 +08:00
Sze-qq 94eec1c5ad [NFC] polish colossalai/engine/gradient_accumulation/_gradient_accumulation.py code style (#3277)
Co-authored-by: siqi <siqi@siqis-MacBook-Pro.local>
2023-03-29 15:22:21 +08:00
Arsmart1 8af977f223 [NFC] polish colossalai/context/parallel_context.py code style (#3276) 2023-03-29 15:22:21 +08:00
Zirui Zhu 1168b50e33 [NFC] polish colossalai/engine/schedule/_pipeline_schedule_v2.py code style (#3275) 2023-03-29 15:22:21 +08:00
Tong Li 196d4696d0 [NFC] polish colossalai/nn/_ops/addmm.py code style (#3274) 2023-03-29 15:22:21 +08:00
lucasliunju 4b95464994 [NFC] polish colossalai/amp/__init__.py code style (#3272) 2023-03-29 15:22:21 +08:00
Xuanlei Zhao 6b3bb2c249 [NFC] polish code style (#3273) 2023-03-29 15:22:21 +08:00
CZYCW 4cadb25b96 [NFC] policy colossalai/fx/proxy.py code style (#3269) 2023-03-29 15:22:21 +08:00
Yuanchen d58fa705b2 [NFC] polish code style (#3268)
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
2023-03-29 15:22:21 +08:00
Camille Zhong c4a226b729 [NFC] polish tensor_placement_policy.py code style (#3265) 2023-03-29 15:22:21 +08:00
CsRic 00778abc48 [NFC] polish colossalai/fx/passes/split_module.py code style (#3263)
Co-authored-by: csric <richcsr256@gmail.com>
2023-03-29 15:22:21 +08:00
jiangmingyan 488f37048c [NFC] polish colossalai/global_variables.py code style (#3259)
Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-03-29 15:22:21 +08:00
LuGY 1ff7d5bfa5 [NFC] polish colossalai/engine/gradient_handler/_moe_gradient_handler.py (#3260) 2023-03-29 15:22:21 +08:00
dayellow 204ca2f09a [NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style (#3256)
Co-authored-by: Minghao Huang <huangminghao@luchentech.com>
2023-03-29 15:22:21 +08:00
HELSON 02b058032d
[fx] meta registration compatibility (#3253)
* [fx] meta registration compatibility

* fix error
2023-03-27 15:22:17 +08:00
Frank Lee 73d3e4d309
[booster] implemented the torch ddd + resnet example (#3232)
* [booster] implemented the torch ddd + resnet example

* polish code
2023-03-27 10:24:14 +08:00
YH 1a229045af
Add interface for colo tesnor dp size (#3227) 2023-03-27 09:42:21 +08:00
YuliangLiu0306 4d5d8f98a4
[API] implement device mesh manager (#3221)
* [API] implement  device mesh manager

* polish
2023-03-24 13:39:12 +08:00
Frank Lee cd142fbefa
[api] implemented the checkpoint io module (#3205)
* [api] implemented the checkpoint io module

* polish code

* polish code
2023-03-23 10:53:17 +08:00
ver217 f8289d4221
[lazyinit] combine lazy tensor with dtensor (#3204)
* [lazyinit] lazy tensor add distribute

* [lazyinit] refactor distribute

* [lazyinit] add test dist lazy init

* [lazyinit] add verbose info for dist lazy init

* [lazyinit] fix rnn flatten weight op

* [lazyinit] polish test

* [lazyinit] polish test

* [lazyinit] fix lazy tensor data setter

* [lazyinit] polish test

* [lazyinit] fix clean

* [lazyinit] make materialize inplace

* [lazyinit] refactor materialize

* [lazyinit] refactor test distribute

* [lazyinit] fix requires_grad

* [lazyinit] fix tolist after materialization

* [lazyinit] refactor distribute module

* [lazyinit] polish docstr

* [lazyinit] polish lazy init context

* [lazyinit] temporarily skip test

* [lazyinit] polish test

* [lazyinit] add docstr
2023-03-23 10:53:06 +08:00
Frank Lee e3ad88fb48
[booster] implemented the cluster module (#3191)
* [booster] implemented the cluster module

* polish code
2023-03-22 14:11:54 +08:00
YuliangLiu0306 f57d34958b
[FX] refactor experimental tracer and adapt it with hf models (#3157)
* pass gpt trace and meta_prop

* pass t5 trace and meta_prop

* [FX] refactor experimental tracer and adapt it with hf models

* pass all mainstream model zoo

* fix CI

* fix CI

* fix CI

* fix CI

* fix CI

* fix CI

* fix CI

* fix CI

* skip tests

* fix CI

* using packaging version

* polish
2023-03-22 10:40:33 +08:00
Frank Lee e7f3bed2d3
[booster] added the plugin base and torch ddp plugin (#3180)
* [booster] added the plugin base and torch ddp plugin

* polish code

* polish code

* polish code
2023-03-21 17:39:30 +08:00
Zihao 18dbe76cae
[auto-parallel] add auto-offload feature (#3154)
* add auto-offload feature

* polish code

* fix syn offload runtime pass bug

* add offload example

* fix offload testing bug

* fix example testing bug
2023-03-21 14:17:41 +08:00
YuliangLiu0306 258b43317c
[hotfix] layout converting issue (#3188) 2023-03-21 13:24:18 +08:00
YH 80aed29cd3
[zero] Refactor ZeroContextConfig class using dataclass (#3186) 2023-03-21 12:36:47 +08:00
YH 9d644ff09f
Fix docstr for zero statedict (#3185) 2023-03-21 11:48:21 +08:00
zbian 7bc0afc901 updated flash attention usage 2023-03-20 17:57:04 +08:00
Frank Lee a9b8402d93
[booster] added the accelerator implementation (#3159) 2023-03-20 13:59:24 +08:00
ver217 6ae8ed0407
[lazyinit] add correctness verification (#3147)
* [lazyinit] fix shared module

* [tests] add lazy init test utils

* [tests] add torchvision for lazy init

* [lazyinit] fix pre op fn

* [lazyinit] handle legacy constructor

* [tests] refactor lazy init test models

* [tests] refactor lazy init test utils

* [lazyinit] fix ops don't support meta

* [tests] lazy init test timm models

* [lazyinit] fix set data

* [lazyinit] handle apex layers

* [tests] lazy init test transformers models

* [tests] lazy init test torchaudio models

* [lazyinit] fix import path

* [tests] lazy init test torchrec models

* [tests] update torch version in CI

* [tests] revert torch version in CI

* [tests] skip lazy init test
2023-03-17 13:49:04 +08:00
Frank Lee ed19290560
[booster] implemented mixed precision class (#3151)
* [booster] implemented mixed precision class

* polish code
2023-03-17 11:00:15 +08:00
YuliangLiu0306 2eca4cd376
[DTensor] refactor dtensor with new components (#3089)
* [DTensor] refactor dtensor with new components

* polish
2023-03-14 16:25:47 +08:00
ver217 ed8f60b93b
[lazyinit] refactor lazy tensor and lazy init ctx (#3131)
* [lazyinit] refactor lazy tensor and lazy init ctx

* [lazyinit] polish docstr

* [lazyinit] polish docstr
2023-03-14 15:37:12 +08:00
Frank Lee 95a36eae63
[kernel] added kernel loader to softmax autograd function (#3093)
* [kernel] added kernel loader to softmax autograd function

* [release] v0.2.6
2023-03-10 14:27:09 +08:00
Super Daniel fff98f06ed
[analyzer] a minimal implementation of static graph analyzer (#2852)
* [hotfix] meta tensor default device.

* [siu] add experimental submodules to main branch.

* [siu]

* [siu]

* [analyzer] init.

* [analyzer] readme.

* [analyzer] readme.

* [analyzer] readme.

* [analyzer] readme.

* [test] add test.

* Update symbolic_trace.py

* mark skip tests.

* try except.

* try except.

* try except.

* s

* init

* init

* fix

* skip

* skip

---------

Co-authored-by: Daniel Shao <superdainiu@MININT-PVARVID.fareast.corp.microsoft.com>
Co-authored-by: Daniel Shao <superdainiu@Daniels-Mac.local>
2023-03-10 13:21:05 +08:00
Xuanlei Zhao 10c61de2f7
[autochunk] support vit (#3084)
support vit for autochunk
* support some new ops for vit
* fix some bugs
* add test for vit
2023-03-10 10:23:26 +08:00
YuliangLiu0306 8e4e8601b7
[DTensor] implement layout converter (#3055)
* [DTensor] refactor LayoutConverter for DTensor

* polish code

* polish docstring
2023-03-10 09:53:52 +08:00
Frank Lee f19b49e164
[booster] init module structure and definition (#3056) 2023-03-09 11:27:46 +08:00
Xuanlei Zhao 2ca9728cbb
[autochunk] refactor chunk memory estimation (#2762)
* refact memory code

* dont log free var memory

* add memory align

* update chunk target

* update setting for new memory

* finish test

* update tracer

* update typo

* update test
2023-03-08 16:22:30 +08:00
YuliangLiu0306 29386a54e6
[DTensor] refactor CommSpec (#3034) 2023-03-08 10:45:31 +08:00
YuliangLiu0306 cd2b0eaa8d
[DTensor] refactor sharding spec (#2987)
* [autoparallel] refactor sharding spec

* rename function name
2023-03-07 11:08:11 +08:00
Ziyue Jiang 400f63012e
[pipeline] Add Simplified Alpa DP Partition (#2507)
* add alpa dp split

* add alpa dp split

* use fwd+bwd instead of fwd only

---------

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2023-03-07 10:34:31 +08:00
Super Daniel b42d3d28ed
[fx] remove depreciated algorithms. (#2312) (#2313) 2023-03-07 10:30:35 +08:00
github-actions[bot] 82503a96f2
[format] applied code formatting on changed files in pull request 2997 (#3008)
Co-authored-by: github-actions <github-actions@github.com>
2023-03-06 10:42:22 +08:00
binmakeswell 52a5078988
[doc] add ISC tutorial (#2997)
* [doc] add ISC tutorial

* [doc] add ISC tutorial

* [doc] add ISC tutorial

* [doc] add ISC tutorial
2023-03-06 10:36:38 +08:00
ver217 823f3b9cf4
[doc] add deepspeed citation and copyright (#2996)
* [doc] add deepspeed citation and copyright

* [doc] add deepspeed citation and copyright

* [doc] add deepspeed citation and copyright
2023-03-04 20:08:11 +08:00
YuliangLiu0306 e414e4092b
[DTensor] implementation of dtensor (#2946)
* [DTensor] implementation of dtensor

* test layout convert

* polish
2023-03-01 16:34:58 +08:00
YuliangLiu0306 47fb214b3b
[hotfix] add shard dim to aviod backward communication error (#2954) 2023-03-01 11:41:53 +08:00
ver217 090f14fd6b
[misc] add reference (#2930)
* [misc] add reference

* [misc] add license
2023-02-28 18:07:24 +08:00
YuliangLiu0306 197d0bf4ed
[autoparallel] apply repeat block to reduce solving time (#2912) 2023-02-28 11:03:30 +08:00
YH a848091141
Fix port exception type (#2925) 2023-02-28 11:00:43 +08:00
zbian 61e687831d fixed using zero with tp cannot access weight correctly 2023-02-28 10:52:30 +08:00
YH 7b13f7db18
[zero] trivial zero optimizer refactoring (#2869)
* Fix mionr grad store interface

* Apply lint
2023-02-27 14:04:53 +08:00
Jiatong (Julius) Han 8c8a39be95
[hotfix]: Remove math.prod dependency (#2837)
* Remove math.prod dependency

* Fix style

* Fix style

---------

Co-authored-by: Jiatong Han <jiatong.han@u.nus.edu>
2023-02-23 23:56:15 +08:00
YuliangLiu0306 819e25d8b1
[hotfix] fix autoparallel compatibility test issues (#2754) 2023-02-23 17:28:36 +08:00
YuliangLiu0306 0f392d7403
[autoparallel] find repeat blocks (#2854)
* [autoparallel] find repeat blocks

* polish

* polish

* polish
2023-02-23 17:28:19 +08:00
junxu c52edcf0eb
Rename class method of ZeroDDP (#2692) 2023-02-22 15:05:53 +08:00
HELSON 6e4ac08172
[hotfix] fix chunk size can not be divided (#2867)
* [hotfix] fix chunk size can not be divided

* [hotfix] use numpy for python3.8
2023-02-22 15:04:46 +08:00
Boyuan Yao eae77c831d
[autoparallel] Patch meta information for nodes that will not be handled by SPMD solver (#2823)
* [autoparallel] non spmd meta information generator

* [autoparallel] patch meta information for non spmd nodes
2023-02-22 10:28:56 +08:00
Boyuan Yao c7764d3f22
[autoparallel] Patch meta information of `torch.where` (#2822)
* [autoparallel] patch meta information of torch.where

* [autoparallel] pre-commit modified
2023-02-22 10:28:21 +08:00
Boyuan Yao fcc4097efa
[autoparallel] Patch meta information of `torch.tanh()` and `torch.nn.Dropout` (#2773)
* [autoparallel] tanh meta information

* [autoparallel] remove redundant code

* [autoparallel] patch meta information of torch.nn.Dropout
2023-02-22 10:27:59 +08:00
Frank Lee 935346430f
[cli] handled version check exceptions (#2848)
* [cli] handled version check exceptions

* polish code
2023-02-21 17:04:49 +08:00
Frank Lee 918bc94b6b
[triton] added copyright information for flash attention (#2835)
* [triton] added copyright information for flash attention

* polish code
2023-02-21 11:25:57 +08:00
Boyuan Yao 7ea6bc7f69
[autoparallel] Patch tensor related operations meta information (#2789)
* [autoparallel] tensor related meta information prototype

* [autoparallel] tensor related meta information

* [autoparallel] tensor related meta information

* [autoparallel] tensor related meta information

* [autoparallel] tensor related meta information
2023-02-20 17:38:55 +08:00
Michelle c008d4ad0c
[NFC] polish colossalai/engine/schedule/_pipeline_schedule.py code style (#2744) 2023-02-20 10:38:40 +08:00
YuliangLiu0306 2059fdd6b0
[hotfix] add copyright for solver and device mesh (#2803)
* [hotfix] add copyright for solver and device mesh

* add readme

* add alpa license

* polish
2023-02-18 21:14:38 +08:00
Boyuan Yao 8593ae1a3f
[autoparallel] rotor solver refactor (#2813)
* [autoparallel] rotor solver refactor

* [autoparallel] rotor solver refactor
2023-02-18 11:30:15 +08:00
HELSON 56ddc9ca7a
[hotfix] add correct device for fake_param (#2796) 2023-02-17 15:29:07 +08:00
Boyuan Yao a2b43e393d
[autoparallel] Patch meta information of `torch.nn.Embedding` (#2760)
* [autoparallel] embedding metainfo

* [autoparallel] fix function name in test_activation_metainfo

* [autoparallel] undo changes in activation metainfo and related tests
2023-02-17 10:39:48 +08:00
Boyuan Yao 8e3f66a0d1
[zero] fix wrong import (#2777) 2023-02-17 10:26:07 +08:00
Nikita Shulga 01066152f1
Don't use `torch._six` (#2775)
* Don't use `torch._six`

This is a private API which is gone after https://github.com/pytorch/pytorch/pull/94709

* Update common.py
2023-02-17 09:22:45 +08:00
binmakeswell 93b788b95a Merge branch 'main' into fix/format 2023-02-15 20:23:51 +08:00
xyupeng 2fd528b9f4
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/graph_analysis.py code style (#2737) 2023-02-15 22:57:45 +08:00
YuliangLiu0306 1dc003c169
[autoparallel] distinguish different parallel strategies (#2699) 2023-02-15 22:28:28 +08:00
YH ae86a29e23
Refact method of grad store (#2687) 2023-02-15 22:27:58 +08:00
Zirui Zhu c9e3ee389e
[NFC] polish colossalai/context/process_group_initializer/initializer_2d.py code style (#2726) 2023-02-15 22:27:13 +08:00
Zangwei Zheng 1819373e5c
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/batch_norm_handler.py code style (#2728) 2023-02-15 22:26:13 +08:00
Wangbo Zhao(黑色枷锁) 8331420520
[NFC] polish colossalai/cli/cli.py code style (#2734) 2023-02-15 22:25:28 +08:00
ziyuhuang123 d344313533
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/embedding_handler.py code style (#2725) 2023-02-15 16:31:40 +08:00
Xue Fuzhao e81caeb4bc
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/cost_graph.py code style (#2720)
Co-authored-by: Fuzhao Xue <fuzhao@login2.ls6.tacc.utexas.edu>
2023-02-15 16:12:45 +08:00
yuxuan-lou 51c45c2460
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/where_handler.py code style (#2723) 2023-02-15 16:12:24 +08:00
YuliangLiu0306 21d6a48f4d
[autoparallel] add shard option (#2696)
* [autoparallel] add shard option

* polish
2023-02-15 13:48:28 +08:00
YuliangLiu0306 5b24987fa7
[autoparallel] fix parameters sharding bug (#2716) 2023-02-15 12:25:50 +08:00
Ziyue Jiang 4603538ddd
[NFC] posh colossalai/context/process_group_initializer/initializer_sequence.py code style (#2712)
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2023-02-15 10:53:38 +08:00
YuliangLiu0306 cb2c6a2415
[autoparallel] refactor runtime pass (#2644)
* [autoparallel] refactor runtime pass

* add unit test

* polish
2023-02-15 10:36:19 +08:00
Zihao b3d10db5f1
[NFC] polish colossalai/cli/launcher/__init__.py code style (#2709) 2023-02-15 09:57:22 +08:00
YuliangLiu0306 0b2a738393
[autoparallel] remove deprecated codes (#2664) 2023-02-15 09:54:32 +08:00
YuliangLiu0306 7fa6be49d2
[autoparallel] test compatibility for gemini and auto parallel (#2700) 2023-02-15 09:43:29 +08:00
CZYCW 4ac8bfb072
[NFC] polish colossalai/engine/gradient_handler/utils.py code style (#2708) 2023-02-15 09:40:08 +08:00
Liu Ziming 6427c406cf
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/strategy_generator.py code style (#2695)
Co-authored-by: shenggan <csg19971016@gmail.com>
2023-02-14 21:30:25 +08:00
アマデウス 534f68c83c
[NFC] polish pipeline process group code style (#2694) 2023-02-14 18:12:01 +08:00
LuGY 56ff1921e9
[NFC] polish colossalai/context/moe_context.py code style (#2693) 2023-02-14 18:02:45 +08:00
Shawn-Kong 1712da2800
[NFC] polish colossalai/gemini/gemini_context.py code style (#2690) 2023-02-14 11:55:23 +08:00
HELSON df4f020ee3
[zero1&2] only append parameters with gradients (#2681) 2023-02-13 18:00:16 +08:00
ver217 f0aa191f51
[gemini] fix colo_init_context (#2683) 2023-02-13 17:53:15 +08:00
Boyuan Yao 40c916b192
[autoparallel] Patch meta information of `torch.nn.functional.softmax` and `torch.nn.Softmax` (#2674)
* [autoparallel] softmax metainfo

* [autoparallel] softmax metainfo
2023-02-13 16:09:22 +08:00
HELSON 8213f89fd2
[gemini] add fake_release_chunk for keep-gathered chunk in the inference mode (#2671) 2023-02-13 14:35:32 +08:00
binmakeswell 9ab14b20b5
[doc] add CVPR tutorial (#2666) 2023-02-10 20:43:34 +08:00
Boyuan Yao 0385b26ebf
[autoparallel] Patch meta information of `torch.nn.LayerNorm` (#2647)
* [autoparallel] layernorm metainfo patch

* [autoparallel] polish test
2023-02-10 14:29:24 +08:00
YuliangLiu0306 37df666f38
[autoparallel] refactor handlers which reshape input tensors (#2615)
* [autoparallel] refactor handlers which reshape input tensors

* polish
2023-02-08 15:02:49 +08:00
YuliangLiu0306 28398f1c70
add overlap option (#2613) 2023-02-08 15:02:31 +08:00
YuliangLiu0306 cb3d1bef62
[autoparallel] adapt autoparallel tests with latest api (#2626) 2023-02-08 15:02:12 +08:00
Boyuan Yao 90a9fdd91d
[autoparallel] Patch meta information of `torch.matmul` (#2584)
* [autoparallel] matmul metainfo

* [auto_parallel] remove unused print

* [tests] skip test_matmul_handler when torch version is lower than 1.12.0
2023-02-08 11:05:31 +08:00
oahzxl 6ba8364881
[autochunk] support diffusion for autochunk (#2621)
* add alphafold benchmark

* renae alphafold test

* rename tests

* rename diffuser

* renme

* rename

* update transformer

* update benchmark

* update benchmark

* update bench memory

* update transformer benchmark

* rename

* support diffuser

* support unet metainfo prop

* fix bug and simplify code

* update linear and support some op

* optimize max region search, support conv

* update unet test

* support some op

* support groupnorm and interpolate

* update flow search

* add fix dim in node flow

* fix utils

* rename

* support diffusion

* update diffuser

* update chunk search

* optimize imports

* import

* finish autochunk
2023-02-07 16:32:45 +08:00
Frank Lee 8518263b80
[test] fixed the triton version for testing (#2608) 2023-02-07 13:49:38 +08:00
HELSON 552183bb74
[polish] polish ColoTensor and its submodules (#2537) 2023-02-03 11:44:10 +08:00
Frank Lee dd14783f75
[kernel] fixed repeated loading of kernels (#2549)
* [kernel] fixed repeated loading of kernels

* polish code

* polish code
2023-02-03 09:47:13 +08:00
ver217 5b1854309a
[hotfix] fix zero ddp warmup check (#2545) 2023-02-02 16:42:38 +08:00
oahzxl fa3d66feb9
support unet metainfo prop (#2544) 2023-02-02 16:19:26 +08:00
oahzxl 05671fcb42
[autochunk] support multi outputs chunk search (#2538)
Support multi outputs chunk search. Previously we only support single output chunk search. It is more flexible and improve performance by a large margin. For transformer, we reduce memory by 40% than previous search strategy.

1. rewrite search strategy to support multi outputs chunk search
2. fix many, many bugs
3. update tests
2023-02-01 13:18:51 +08:00
oahzxl 63199c6687
[autochunk] support transformer (#2526) 2023-01-31 16:00:06 +08:00
HELSON a4ed9125ac
[hotfix] fix lightning error (#2529) 2023-01-31 10:40:39 +08:00
HELSON 66dfcf5281
[gemini] update the gpt example (#2527) 2023-01-30 17:58:05 +08:00
HELSON b528eea0f0
[zero] add zero wrappers (#2523)
* [zero] add zero wrappers

* change names

* add wrapper functions to init
2023-01-29 17:52:58 +08:00
Super Daniel c198c7c0b0
[hotfix] meta tensor default device. (#2510) 2023-01-29 16:28:10 +08:00
HELSON 077a5cdde4
[zero] fix gradient clipping in hybrid parallelism (#2521)
* [zero] fix gradient clipping in hybrid parallelism

* [testing] change model name to avoid pytest warning

* [hotfix] fix unit testing
2023-01-29 15:09:57 +08:00
YuliangLiu0306 aa0f6686f9
[autoparallel] accelerate gpt2 training (#2495) 2023-01-29 11:13:15 +08:00
HELSON 707b11d4a0
[gemini] update ddp strict mode (#2518)
* [zero] add strict ddp mode for chunk init

* [gemini] update gpt example
2023-01-28 14:35:25 +08:00
HELSON 2d1a7dfe5f
[zero] add strict ddp mode (#2508)
* [zero] add strict ddp mode

* [polish] add comments for strict ddp mode

* [zero] fix test error
2023-01-20 14:04:38 +08:00
oahzxl c04f183237
[autochunk] support parsing blocks (#2506) 2023-01-20 11:18:17 +08:00
Super Daniel 35c0c0006e
[utils] lazy init. (#2148)
* [utils] lazy init.

* [utils] remove description.

* [utils] complete.

* [utils] finalize.

* [utils] fix names.
2023-01-20 10:49:00 +08:00
oahzxl 72341e65f4
[auto-chunk] support extramsa (#3) (#2504) 2023-01-20 10:13:03 +08:00
Ziyue Jiang 0f02b8c6e6
add avg partition (#2483)
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2023-01-19 13:54:50 +08:00
アマデウス 99d9713b02 Revert "Update parallel_context.py (#2408)"
This reverts commit 7d5640b9db.
2023-01-19 12:27:48 +08:00
oahzxl ecccc91f21
[autochunk] support autochunk on evoformer (#2497) 2023-01-19 11:41:00 +08:00
oahzxl 5db3a5bf42
[fx] allow control of ckpt_codegen init (#2498)
* [fx] allow control of ckpt_codegen init

Currently in ColoGraphModule, ActivationCheckpointCodeGen will be set automatically in __init__. But other codegen can't be set if so. 
So I add an arg to control whether to set ActivationCheckpointCodeGen in __init__.

* code style
2023-01-18 17:02:46 +08:00
HELSON d565a24849
[zero] add unit testings for hybrid parallelism (#2486) 2023-01-18 10:36:10 +08:00
oahzxl 4953b4ace1
[autochunk] support evoformer tracer (#2485)
support full evoformer tracer, which is a main module of alphafold. previously we just support a simplifed version of it.
1. support some evoformer's op in fx
2. support evoformer test
3. add repos for test code
2023-01-16 19:25:05 +08:00
YuliangLiu0306 67e1912b59
[autoparallel] support origin activation ckpt on autoprallel system (#2468) 2023-01-16 16:25:13 +08:00
Ziyue Jiang fef5c949c3
polish pp middleware (#2476)
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2023-01-13 16:56:01 +08:00
HELSON a5dc4253c6
[zero] polish low level optimizer (#2473) 2023-01-13 14:56:17 +08:00
Frank Lee 8b7495dd54
[example] integrate seq-parallel tutorial with CI (#2463) 2023-01-13 14:40:05 +08:00
Jiarui Fang 867c8c2d3a
[zero] low level optim supports ProcessGroup (#2464) 2023-01-13 10:05:58 +08:00
Frank Lee 14d9299360
[cli] fixed hostname mismatch error (#2465) 2023-01-12 14:52:09 +08:00
Haofan Wang 9358262992
Fix False warning in initialize.py (#2456)
* Update initialize.py

* pre-commit run check
2023-01-12 13:49:01 +08:00
YuliangLiu0306 8221fd7485
[autoparallel] update binary elementwise handler (#2451)
* [autoparallel] update binary elementwise handler

* polish
2023-01-12 09:35:10 +08:00
HELSON 2bfeb24308
[zero] add warning for ignored parameters (#2446) 2023-01-11 15:30:09 +08:00
Frank Lee 39163417a1
[example] updated the hybrid parallel tutorial (#2444)
* [example] updated the hybrid parallel tutorial

* polish code
2023-01-11 15:17:17 +08:00
HELSON 5521af7877
[zero] fix state_dict and load_state_dict for ddp ignored parameters (#2443)
* [ddp] add is_ddp_ignored

[ddp] rename to is_ddp_ignored

* [zero] fix state_dict and load_state_dict

* fix bugs

* [zero] update unit test for ZeroDDP
2023-01-11 14:55:41 +08:00
YuliangLiu0306 2731531bc2
[autoparallel] integrate device mesh initialization into autoparallelize (#2393)
* [autoparallel] integrate device mesh initialization into autoparallelize

* add megatron solution

* update gpt autoparallel examples with latest api

* adapt beta value to fit the current computation cost
2023-01-11 14:03:49 +08:00
Frank Lee c72c827e95
[cli] provided more details if colossalai run fail (#2442) 2023-01-11 13:56:42 +08:00
Super Daniel c41e59e5ad
[fx] allow native ckpt trace and codegen. (#2438) 2023-01-11 13:49:59 +08:00
YuliangLiu0306 41429b9b28
[autoparallel] add shard option (#2423) 2023-01-11 13:40:33 +08:00
HELSON 7829aa094e
[ddp] add is_ddp_ignored (#2434)
[ddp] rename to is_ddp_ignored
2023-01-11 12:22:45 +08:00
HELSON bb4e9a311a
[zero] add inference mode and its unit test (#2418) 2023-01-11 10:07:37 +08:00
Jiarui Fang 93f62dd152
[autochunk] add autochunk feature 2023-01-10 16:04:42 +08:00
HELSON dddacd2d2c
[hotfix] add norm clearing for the overflow step (#2416) 2023-01-10 15:43:06 +08:00
oahzxl 7ab2db206f adapt new fx 2023-01-10 11:56:00 +08:00
oahzxl e532679c95 Merge branch 'main' of https://github.com/oahzxl/ColossalAI into chunk 2023-01-10 11:29:01 +08:00
Haofan Wang 7d5640b9db
Update parallel_context.py (#2408) 2023-01-10 11:27:23 +08:00
oahzxl fd818cf144 change imports 2023-01-10 11:10:45 +08:00
oahzxl a591d45b29 add available 2023-01-10 10:56:39 +08:00
oahzxl 615e7e68d9 update doc 2023-01-10 10:44:07 +08:00
oahzxl 7d4abaa525 add doc 2023-01-10 09:59:47 +08:00
oahzxl 1be0ac3cbf add doc for trace indice 2023-01-09 17:59:52 +08:00
oahzxl 0b6af554df remove useless function 2023-01-09 17:46:43 +08:00
oahzxl d914a21d64 rename 2023-01-09 17:45:36 +08:00
oahzxl 865f2e0196 rename 2023-01-09 17:42:25 +08:00
HELSON ea13a201bb
[polish] polish code for get_static_torch_model (#2405)
* [gemini] polish code

* [testing] remove code

* [gemini] make more robust
2023-01-09 17:41:38 +08:00
oahzxl a4ed5b0d0d rename in doc 2023-01-09 17:41:26 +08:00
oahzxl 1bb1f2ad89 rename 2023-01-09 17:38:16 +08:00
oahzxl cb9817f75d rename function from index to indice 2023-01-09 17:34:30 +08:00
oahzxl 0ea903b94e rename trace_index to trace_indice 2023-01-09 17:25:13 +08:00
Frank Lee 551cafec14
[doc] updated kernel-related optimisers' docstring (#2385)
* [doc] updated kernel-related optimisers' docstring

* polish doc
2023-01-09 17:13:53 +08:00
oahzxl 065f0b4c27 add doc for search 2023-01-09 17:11:51 +08:00
oahzxl a68d240ed5 add doc for search chunk 2023-01-09 16:54:08 +08:00
oahzxl 1951f7fa87 code style 2023-01-09 16:30:16 +08:00
oahzxl 212b5b1b5f add comments 2023-01-09 16:29:33 +08:00
oahzxl 19cc64b1d3 remove autochunk_available 2023-01-09 16:06:58 +08:00
eric8607242 9880fd2cd8
Fix state_dict key missing issue of the ZeroDDP (#2363)
* Fix state_dict output for ZeroDDP duplicated parameters

* Rewrite state_dict based on get_static_torch_model

* Modify get_static_torch_model to be compatible with the lower version (ZeroDDP)
2023-01-09 14:35:14 +08:00
oahzxl 4d223e18a2 fix typo 2023-01-09 13:46:17 +08:00
Frank Lee ce08661eb1
[cli] updated installation check cli for aot/jit build (#2395) 2023-01-09 11:05:27 +08:00
jiaruifang 69d9180c4b [hotfix] issue #2388 2023-01-07 18:23:02 +08:00
Jiarui Fang 4e96039649
[device] find best logical mesh 2023-01-07 14:04:30 +08:00
Jiarui Fang 8f72b6f8fb
[hotfix] fix implement error in diffusers 2023-01-07 07:56:39 +08:00
Frank Lee 40d376c566
[setup] support pre-build and jit-build of cuda kernels (#2374)
* [setup] support pre-build and jit-build of cuda kernels

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-01-06 20:50:26 +08:00
1SAA 33f3023e19 [hotfix] fix implement error in diffusers 2023-01-06 18:37:18 +08:00
Jiarui Fang 12c8bf38d7
[Pipeline] Refine GPT PP Example 2023-01-06 18:03:45 +08:00
oahzxl 8a989a0d89 code style 2023-01-06 17:55:22 +08:00
oahzxl c3a2bf48b4 code style 2023-01-06 17:31:59 +08:00
oahzxl a6cdbf9161 seperate trace flow 2023-01-06 17:24:23 +08:00
oahzxl 4748967fb1 ad reorder graph 2023-01-06 17:13:18 +08:00
oahzxl da4076846d rename 2023-01-06 17:09:37 +08:00
oahzxl c3d72f7db9 seperate reorder 2023-01-06 16:53:01 +08:00
binmakeswell a881d6d000
Revert "[NFC] polish code format" (#2372) 2023-01-06 16:01:09 +08:00
Ziyue Jiang 9ae9e74017 fix diff device in some partition 2023-01-06 15:59:06 +08:00
Jiarui Fang 0dcc410f57
[NFC] polish code format 2023-01-06 15:54:06 +08:00
oahzxl 6685a9d022 seperate non chunk input 2023-01-06 15:53:24 +08:00
binmakeswell d634eae05b
Revert "[NFC] polish code format (#2367)" (#2371)
This reverts commit 1f8ab6f1f5.
2023-01-06 15:52:16 +08:00
oahzxl f856611d21 seperate prepose_nodes 2023-01-06 15:47:17 +08:00
Shawn-Kong d42aecdda1
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/embedding_handler.py code style (#2368) 2023-01-06 15:47:10 +08:00
Jiarui Fang 1aaeb596c6
[example] gpt, shard init on all processes (#2366) 2023-01-06 15:44:50 +08:00
oahzxl f4a1607e56 seperate input node dim search 2023-01-06 15:36:17 +08:00
binmakeswell 1f8ab6f1f5
[NFC] polish code format (#2367) 2023-01-06 15:34:48 +08:00
oahzxl ae27a8b26d seperate flow tracer 2023-01-06 14:57:33 +08:00
oahzxl fd87d78a28 rename ambiguous variable 2023-01-06 14:28:04 +08:00
oahzxl 2bde9d2b7f code format 2023-01-06 14:21:49 +08:00
oahzxl 8a634af2f5 close mem and code print 2023-01-06 14:19:45 +08:00
oahzxl 1a6d2a740b take apart chunk code gen 2023-01-06 14:14:45 +08:00
ExtremeViscent ac0d30fe2e
[NFC] polish batch_norm_handler.py code style (#2359) 2023-01-06 13:41:38 +08:00
HELSON 48d33b1b17
[gemini] add get static torch model (#2356) 2023-01-06 13:41:19 +08:00
oahzxl efb1c64c30 restruct dir 2023-01-06 11:39:26 +08:00
ziyuhuang123 7080a8edb0
[workflow]New version: Create workflow files for examples' auto check (#2298)
* [workflows]bug_repair

* [workflow]new_pr_fixing_bugs

Co-authored-by: binmakeswell <binmakeswell@gmail.com>
2023-01-06 09:26:49 +08:00
LuGY e11a005c02
[NFC] polish colossalai/auto_parallel/tensor_shard/utils/factory.py code style (#2349) 2023-01-05 21:17:42 +08:00
YuliangLiu0306 b5a3a4a65f [device] find best logical mesh 2023-01-05 17:21:29 +08:00
yuxuan-lou 28e2d16794
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/graph_analysis.py code style (#2340) 2023-01-05 16:53:24 +08:00
YuliangLiu0306 9c9246c0d9
[device] alpha beta profiler (#2311)
* [device] alpha beta profiler

* add usage

* fix variable name
2023-01-05 16:39:55 +08:00
Maruyama_Aya bd12a49e2a
[NFC] polish <colossalai/auto_parallel/tensor_shard/deprecated/constants.py> code style (#2339) 2023-01-05 16:20:54 +08:00
Zihao 35427bcab4
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/unary_elementwise_handler.py code style (#2326) 2023-01-05 12:18:08 +08:00
Jiarui Fang db6eea3583
[builder] reconfig op_builder for pypi install (#2314) 2023-01-04 16:32:32 +08:00
Junming Wu 4a79c10750 [NFC] polish colossalai/cli/benchmark/__init__.py code style (#2308) 2023-01-04 15:09:57 +08:00
Ofey Chan 87d2defda6 [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/layer_norm_handler.py code style (#2305) 2023-01-04 15:09:57 +08:00
ver217 116e3d0b8f [NFC] polish communication/p2p_v2.py code style (#2303) 2023-01-04 15:09:57 +08:00
xyupeng b965585d05 [NFC] polish colossalai/amp/torch_amp/torch_amp.py code style (#2290) 2023-01-04 15:09:57 +08:00
Zangwei Zheng d1e5bafcd4 [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/__init__.py code style (#2291) 2023-01-04 15:09:57 +08:00
shenggan 950685873f [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/reshape_handler.py code style (#2292) 2023-01-04 15:09:57 +08:00
Ziheng Qin 3041014089 [NFC] polish colossalai/amp/naive_amp/grad_scaler/dynamic_grad_scaler.py code style (#2299)
Co-authored-by: henryqin1997 <henryqin1997@gamil.com>
2023-01-04 15:09:57 +08:00
アマデウス 49715a78f0 [NFC] polish colossalai/cli/benchmark/benchmark.py code style (#2287) 2023-01-04 15:09:57 +08:00
Zirui Zhu 1c29b173c9 [NFC] polish colossalai/auto_parallel/tensor_shard/node_handler/getitem_handler.py code style (#2289) 2023-01-04 15:09:57 +08:00
Zihao 3a02b46447
[auto-parallel] refactoring ColoTracer (#2118)
* add meta_data_computing

* add checkpoint_annotation

* rename proxy.data to proxy.meta_data and add bias addition pass

* polish code

* delete meta_prop_pass invoke and rename ori_node to orig_node

* add TracerType

* unify meta data computing

* delete TracerType

* handle setitem operation

* operator.setitem
2023-01-04 14:44:22 +08:00
HELSON 5d3a2be3af
[amp] add gradient clipping for unit tests (#2283)
* [amp] add gradient clipping in unit tests

* fix bugs
2023-01-04 11:59:56 +08:00
Boyuan Yao d45695d94e
Merge pull request #2258 from hpcaitech/debug/ckpt-autoparallel
[autockpt] provide option for activation checkpoint search in SPMD solver
2023-01-04 11:37:28 +08:00
Jiarui Fang 16cc8e6aa7
[builder] MOE builder (#2277) 2023-01-03 20:29:39 +08:00
Boyuan Yao b904748210
[autoparallel] bypass MetaInfo when unavailable and modify BCAST_FUNC_OP metainfo (#2293)
* [autoparallel] align the data_ptr with the old version of auto activation checkpoint pipeline

* [autoparallel] using fwd_time and bwd_time instead of fwd_flop and bwd_flop

* [autoparallel] specifycomm nodes' memory cost in construct chain

* [autoparallel] fix wrong runtime apply calculation

* [autoparallel] fix wrong runtime apply calculation

* [autoparallel] fix wrong runtime apply calculation

* [autoparallel] bypass metainfo when available and modify BCAST_FUNC_OP
2023-01-03 20:28:01 +08:00
Super Daniel 8ea50d999e
[hotfix] pass a parameter. (#2288)
* [autockpt] make it work.

* [autockpt] linearize / merge shape-consistency nodes.

* [autockpt] considering parameter and optimizer weights.

* [hotfix] pass a parameter.
2023-01-03 18:05:06 +08:00
zbian e94c79f15b improved allgather & reducescatter for 3d 2023-01-03 17:46:08 +08:00
HELSON 62c38e3330
[zero] polish low level zero optimizer (#2275) 2023-01-03 17:22:34 +08:00
Ziyue Jiang ac863a01d6
[example] add benchmark (#2276)
* add benchmark

* merge common func

* add total and avg tflops

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2023-01-03 17:20:59 +08:00
Boyuan Yao 22e947f982
[autoparallel] fix runtime apply memory estimation (#2281)
* [autoparallel] align the data_ptr with the old version of auto activation checkpoint pipeline

* [autoparallel] using fwd_time and bwd_time instead of fwd_flop and bwd_flop

* [autoparallel] specifycomm nodes' memory cost in construct chain

* [autoparallel] fix wrong runtime apply calculation

* [autoparallel] fix wrong runtime apply calculation

* [autoparallel] fix wrong runtime apply calculation
2023-01-03 17:18:07 +08:00
Super Daniel 8e8900ff3f
[autockpt] considering parameter and optimizer weights. (#2279)
* [autockpt] make it work.

* [autockpt] linearize / merge shape-consistency nodes.

* [autockpt] considering parameter and optimizer weights.
2023-01-03 16:55:49 +08:00
YuliangLiu0306 f027ef7913
[hotfix] fix fp16 optimzier bug (#2273) 2023-01-03 16:53:43 +08:00
YuliangLiu0306 fb87322773
[autoparallel] fix spelling error (#2270) 2023-01-03 16:13:00 +08:00
Jiarui Fang af32022f74
[Gemini] fix the convert_to_torch_module bug (#2269) 2023-01-03 15:55:35 +08:00
Super Daniel b0d21d0c4f
[autockpt] linearize / merge shape-consistency nodes. (#2271)
* [autockpt] make it work.

* [autockpt] linearize / merge shape-consistency nodes.
2023-01-03 14:54:22 +08:00
YuliangLiu0306 4b29112ab2
[autoparallel] gpt2 autoparallel examples (#2267)
* [autoparallel] gpt2 autoparallel examples

* polish code

* polish code
2023-01-03 14:23:33 +08:00
Ziyue Jiang 8b045b3c1f
[Pipeline Middleware] Reduce comm redundancy by getting accurate output (#2232)
* move to cpu to avoid dead lock

* get output by offsets

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2023-01-03 13:43:57 +08:00
Boyuan Yao 5c2ef9fc76
[autoparallel] modify comm nodes' memory cost in construct chain (#2263)
* [autoparallel] align the data_ptr with the old version of auto activation checkpoint pipeline

* [autoparallel] using fwd_time and bwd_time instead of fwd_flop and bwd_flop

* [autoparallel] specifycomm nodes' memory cost in construct chain
2023-01-03 11:38:48 +08:00
Boyuan Yao 1ea99b869e
[autoparallel] align the data_ptr with the old version of auto activation checkpoint pipeline (#2261) 2023-01-03 10:30:15 +08:00
Super Daniel 3ccf58aa76
[autockpt] make it work. (#2257) 2023-01-02 23:37:45 +08:00
Boyuan Yao ac3739930d
[autoparallel] modify construct chain in rotor solver (#2254) 2023-01-02 16:26:12 +08:00
Boyuan Yao ab38aebace
[autoparallel] Hook all meta information on ResNet nodes for auto activation checkpoint (#2248)
* [autoparallel] hook node meta on graph nodes for checkpoint solver

* [autoparallel] polish code

* [autoparallel] restore some node handlers

* colossalai/auto_parallel/passes/meta_info_prop.py

* [autoparallel] remove some unused import

* [autoparallel] hook bwd_mem_out
2023-01-02 16:25:18 +08:00
Boyuan Yao c8c79102f0
[autoparallel] patch torch.flatten metainfo for autoparallel (#2247)
* [autoparallel] patch torch.flatten
2023-01-02 15:51:03 +08:00
YuliangLiu0306 8897b8f753
[autoparallel] autoparallel initialize (#2238) 2022-12-31 01:02:14 +08:00
xcnick 85178a397a
[hotfix] fix error for torch 2.0 (#2243) 2022-12-30 23:11:55 +08:00
Super Daniel b7d0990c61
[autoparallel] fix construct meta info. (#2245) 2022-12-30 19:56:44 +08:00
Ziyue Jiang 57929a6210
fix type of num_worker_threads (#2237)
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-30 11:04:01 +08:00
Jiarui Fang db4cbdc7fb
[builder] builder for scaled_upper_triang_masked_softmax (#2234) 2022-12-30 09:58:00 +08:00
Super Daniel 78483a9fdd
[logger] hotfix, missing _FORMAT (#2231) 2022-12-29 22:59:39 +08:00
Jiarui Fang 54de05da5d
[builder] polish builder with better base class (#2216)
* [builder] polish builder

* remove print
2022-12-28 19:45:49 +08:00
YuliangLiu0306 3b1b91eaf4
[autoparallel] record parameter attribute in colotracer (#2217)
* [autoparallel] record parameter attribute in collotracer

* [autoparallel] fix construct_meta_info bug
2022-12-28 19:29:08 +08:00
Jiarui Fang 7675792100
[builder] raise Error when CUDA_HOME is not set (#2213) 2022-12-28 16:07:08 +08:00
Jiarui Fang d5e3e3ec01
[example] update gpt example for larger model scale (#2211) 2022-12-28 13:54:08 +08:00
Boyuan Yao 24246f7aa5
[autoparallel] Attach input, buffer and output tensor to MetaInfo class (#2162)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator

* [autoparallel] add binary elementwise metainfo

* [fx] recover profiler

* [autoparallel] fix forward memory calculation

* [autoparallel] modify constants.py

* [autoparallel] remove redundant print

* [autoparallel] add F.conv metainfo

* [autoparallel] linear fix

* [autoparallel] memory estimation for communication actions

* [autoparallel] fix docstring

* [autoparallel] fix variables name

* [autoparallel] attach tensor to metainfo class

* [autoparallel] fix dangerous try except

* [autoparallel] attach memory cost to shape consistency node

* [autoparallel] attach shape consistency node's metainfo to the node

* [autoparallel] remove todo in shape consistency memory estimation

* [autoparallel] fix the annotation
2022-12-28 13:37:40 +08:00
Boyuan Yao d0bc5a1b34
[autoparallel] new metainfoprop based on metainfo class (#2179)
* [autoparallel] new metainfoprop to combine SPMD solver and checkpoint solver

* [autoparallel] new metainfoprop to combine SPMD solver and checkpoint solver

* [autoparallel] modify placeholder handler

* [autoparallel] modify metainfoprop

* [autoparallel] fix function typo

* [autoparallel] fix placeholder handler
2022-12-28 13:35:08 +08:00
YuliangLiu0306 78509124d3
[autoparallel] update getitem handler (#2207) 2022-12-27 19:58:32 +08:00
Jiarui Fang 1cb532ffec
[builder] multihead attn runtime building (#2203)
* [hotfix] correcnt cpu_optim runtime compilation

* [builder] multihead attn

* fix bug

* fix a bug
2022-12-27 16:06:09 +08:00
Tongping Liu 8e22c38b89
[hotfix] Fixing the bug related to ipv6 support
Co-authored-by: ByteDance <tongping.liu@bytedance.com>
2022-12-27 12:42:46 +08:00
YuliangLiu0306 4851f2d607
[autoparallel] update_getattr_handler (#2193) 2022-12-26 21:57:39 +08:00
Jiarui Fang 5682e6d346
[hotfix] correcnt cpu_optim runtime compilation (#2197) 2022-12-26 16:45:14 +08:00
HELSON 2458659919
[zero] fix error for BEiT models (#2169)
* [zero] fix error for BEiT models

* [ColoParameter] add unpack operation for tuple arguments

* fix bugs

* fix chunkv2 unit testing

* add assertion for gradient state
2022-12-26 15:03:54 +08:00
Jiarui Fang 355ffb386e
[builder] unified cpu_optim fused_optim inferface (#2190) 2022-12-23 20:57:41 +08:00
Jiarui Fang 9587b080ba
[builder] use runtime builder for fused_optim (#2189) 2022-12-23 17:07:03 +08:00
Jiarui Fang bc0e271e71
[buider] use builder() for cpu adam and fused optim in setup.py (#2187) 2022-12-23 16:05:13 +08:00
Jiarui Fang d42afd30f8
[builder] runtime adam and fused_optim builder (#2184) 2022-12-23 14:14:21 +08:00
YuliangLiu0306 550f8f8905
[autoparallel] integrate_gpt_related_tests (#2134)
* [autoparallel] integrate_gpt_related_tests

* polish code

* polish code

* add GPT2Model into runtime test
2022-12-23 12:36:59 +08:00
Ziyue Jiang 59e343328d
[Pipeline Middleware ] Fix deadlock when num_microbatch=num_stage (#2156)
* add splitter

* polish code

* remove comment

* fix async nan by moving to cpu first

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-23 11:38:43 +08:00
Tongping Liu ab54fed292
[hotfix] add kwargs for colo_addmm (#2171) 2022-12-22 13:25:30 +08:00
アマデウス 622f863291
[hotfix] Jit type hint #2161 (#2164) 2022-12-22 10:17:03 +08:00
Zihao 12e7bcd720
register meta func for rnn (#2159) 2022-12-21 23:06:18 +08:00
Boyuan Yao cfe2a9bd90
[autoparallel] memory estimation for shape consistency (#2144)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator

* [autoparallel] add binary elementwise metainfo

* [fx] recover profiler

* [autoparallel] fix forward memory calculation

* [autoparallel] modify constants.py

* [autoparallel] remove redundant print

* [autoparallel] add F.conv metainfo

* [autoparallel] linear fix

* [autoparallel] memory estimation for communication actions

* [autoparallel] fix docstring

* [autoparallel] fix variables name
2022-12-21 10:39:37 +08:00
Jiarui Fang b87496a66b
[hotfix] fix auto policy of test_sharded_optim_v2 (#2157) 2022-12-20 23:03:18 +08:00
YuliangLiu0306 16335cb537
[hotfix] fix aten default bug (#2158) 2022-12-20 22:40:46 +08:00
HELSON a7d95b7024
[example] add zero1, zero2 example in GPT examples (#2146)
* [example] add zero1 and zero2 for GPT

* update readme in gpt example

* polish code

* change init value

* update readme
2022-12-20 14:30:27 +08:00
YuliangLiu0306 1cce6e36ca
[autoparallel] use metainfo in handler (#2149) 2022-12-20 10:31:22 +08:00
Jiarui Fang 2827f41898
[Gemini] GeminiDPP convert to PyTorch Module. (#2151) 2022-12-20 10:19:36 +08:00
Jiarui Fang bdef9dfdbe
[NFC] remove useless graph node code (#2150) 2022-12-20 00:33:58 +08:00
BlueRum b3f73ce1c8
[Gemini] Update coloinit_ctx to support meta_tensor (#2147) 2022-12-19 22:37:07 +08:00
Zihao a128eec9d5
register aten._convolution.default (#2137) 2022-12-18 19:27:01 +08:00
Jiarui Fang ee287620f0
[Gemini] revert ZeROInitCtx related tracer (#2138) 2022-12-16 12:37:06 +08:00
アマデウス 077a66dd81
updated attention kernel (#2133) 2022-12-16 10:54:03 +08:00
YuliangLiu0306 a3c6924deb
[autoparallel] process size nodes in runtime pass (#2130)
* [autoparallel] process size nodes in runtime pass

* polish code
2022-12-14 16:10:50 +08:00
YuliangLiu0306 536560ccc0
[autoparallel] implement softmax handler (#2132) 2022-12-14 16:09:53 +08:00
Jiarui Fang c89c66a858
[Gemini] update API of the chunkmemstatscollector. (#2129) 2022-12-14 00:47:06 +08:00
Jiarui Fang 2938edf446
[Gemini] update the non model data record method in runtime memory tracer (#2128) 2022-12-13 17:11:31 +08:00
Jiarui Fang 8fac837679
[Gemini] update non model data calculation method (#2126) 2022-12-13 15:44:07 +08:00
Jiarui Fang 5efda69735
[Gemini] hotfix the unittest bugs (#2125) 2022-12-13 14:14:55 +08:00
Jiarui Fang 05bb28aacf
[Gemini] mapping of preop timestep and param (#2124) 2022-12-13 12:50:24 +08:00
YuliangLiu0306 cd0af9f7f6
[autoparallel] gpt2lp runtimee test (#2113) 2022-12-12 18:06:40 +08:00
Jiarui Fang 9214d1fe28
[Gemini] chunk init using runtime visited param order (#2115) 2022-12-12 18:06:16 +08:00
HELSON e7d3afc9cc
[optimizer] add div_scale for optimizers (#2117)
* [optimizer] add div_scale for optimizers

* [zero] use div_scale in zero optimizer

* fix testing error
2022-12-12 17:58:57 +08:00
Jiarui Fang e5aa8333e4
[NFC] update chunk manager API (#2119) 2022-12-12 16:57:22 +08:00
Jiarui Fang e99edfcb51
[NFC] polish comments for Chunk class (#2116) 2022-12-12 15:39:31 +08:00
Ziyue Jiang 09d69e1c25
[PP Middleware] Add bwd and step for PP middleware (#2111)
* add bwd and step for PP middleware

* pre-commit

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-12 12:40:03 +08:00
Jiarui Fang 8afc001f4f
[Gemini] chunk init use OrderedParamGenerator (#2110) 2022-12-11 21:41:13 +08:00
HELSON 63fbba3c19
[zero] add L2 gradient clipping for ZeRO (#2112)
* [zero] add L2 gradient clipping

* [testing] add MlpModel

* [zero] add unit test for grad clipping

* fix atol
2022-12-09 18:09:17 +08:00
Jiarui Fang 70a8556946
[gemini] get the param visited order during runtime (#2108) 2022-12-09 16:13:03 +08:00
Jiarui Fang 61f31c3cf0
[Gemini] NFC, polish search_chunk_configuration (#2107) 2022-12-09 15:00:39 +08:00
Jiarui Fang 8e14344ec9
[hotfix] fix a type in ColoInitContext (#2106) 2022-12-09 11:44:39 +08:00
Jiarui Fang 05545bfee9
[ColoTensor] throw error when ColoInitContext meets meta parameter. (#2105) 2022-12-09 11:39:46 +08:00
YuliangLiu0306 d87baa85d9
[autoparallel] support linear function bias addition (#2104) 2022-12-09 10:31:36 +08:00
YuliangLiu0306 0fecbb9e20
[autoparallel] support addbmm computation (#2102) 2022-12-08 21:15:11 +08:00
YuliangLiu0306 d3d4630495
[autoparallel] add sum handler (#2101) 2022-12-08 17:02:54 +08:00
Ziyue Jiang e4705ba4e2
[Pipeline Middleware] fix data race in Pipeline Scheduler for DAG (#2087)
* add DAG test case

* fix datarace by adjusting theposition of lock

* polish code

* fix pytest for middleware

* remove test

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-08 13:32:27 +08:00
YuliangLiu0306 b175e6d58e
[autoparallel] add bias addtion function class (#2098)
* [autoparallel] add bias addtion function class

* polish code

* polish
2022-12-08 11:31:51 +08:00
YuliangLiu0306 3af7e65dea
[autoparallel] complete gpt related module search (#2097) 2022-12-08 10:04:09 +08:00
Jiarui Fang 85efb7ac2e
[Gemini] gemini use the runtime memory tracer (RMT) (#2099) 2022-12-07 23:04:02 +08:00
Super Daniel 2bf2d1cd3b
[fx] An experimental version of ColoTracer.' (#2002)
* [fx] add a symbolic_trace api.

* [fx] fix import errors.

* [fx] ColoTracer experimental.
2022-12-07 18:36:17 +08:00
Jiarui Fang 4b055351b0
[Gemini] make RuntimeMemTracer work correctly (#2096) 2022-12-07 16:59:59 +08:00
YuliangLiu0306 7f72eb0510
[autoparallel]add embedding handler (#2089)
* [autoparallel] add embedding handler

* fix bugs
2022-12-07 09:41:46 +08:00
Jiarui Fang 1fca5d79ea
[Gemini] remove GLOBAL_MODEL_DATA_TRACER (#2091) 2022-12-06 22:30:16 +08:00
Jiarui Fang 28e55c2530
[Gemini] remove GLOBAL_CUDA_MEM_INFO (#2090) 2022-12-06 22:10:47 +08:00
Jiarui Fang 25abae6d7f
[Gemini] use MemStats in Runtime Memory tracer (#2088) 2022-12-06 19:48:20 +08:00
Jiarui Fang 33f4412102
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084) 2022-12-06 16:43:06 +08:00
Jiarui Fang 1f99205827
[Gemini] remove static tracer (#2083) 2022-12-06 12:53:58 +08:00
YuliangLiu0306 0e9db368ef
[autoparallel] add tensor constructor handler (#2082) 2022-12-06 10:20:10 +08:00
YuliangLiu0306 cdf537a648
[autoparallel] add non_split linear strategy (#2078)
* [autoparallel] add non_split linear stategy

* polish
2022-12-06 10:19:33 +08:00
Boyuan Yao cf0268da93
[autoparallel] Add F.conv metainfo (#2069)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator

* [autoparallel] add binary elementwise metainfo

* [fx] recover profiler

* [autoparallel] fix forward memory calculation

* [autoparallel] modify constants.py

* [autoparallel] remove redundant print

* [autoparallel] add F.conv metainfo

* [autoparallel] linear fix
2022-12-06 10:17:57 +08:00
YuliangLiu0306 f123476666
[autoparallel] complete gpt block searching (#2065)
* [autoparallel] complete gpt block searching

* fix test
2022-12-06 10:17:10 +08:00
Ziyue Jiang 597cdd3006
[Pipeline Middleware] Adapt scheduler for Topo (#2066)
* adapt scheduler for Topo

* remoove comment

* fix set input

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-05 20:23:41 +08:00
Jiarui Fang b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook (#2080) 2022-12-05 17:11:06 +08:00
YuliangLiu0306 677e1e20d4
[device] update flatten device mesh usage (#2079) 2022-12-05 16:16:07 +08:00
Jiarui Fang a7adad9ccb
[Gemini] rename hooks related to runtime mem tracer (#2076) 2022-12-05 15:00:03 +08:00
Jiarui Fang 223332ff7e
[Gemini] rename ParamTracerWrapper -> RuntimeMemTracer (#2073) 2022-12-05 12:45:11 +08:00
Jiarui Fang 9f828ef36f
[Gemini] remove not used MemtracerWrapper (#2072) 2022-12-05 11:57:59 +08:00
Boyuan Yao 616da17fab
[autoparallel] add binary elementwise metainfo for auto parallel (#2058)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator

* [autoparallel] add binary elementwise metainfo

* [fx] recover profiler

* [autoparallel] fix forward memory calculation

* [autoparallel] modify constants.py

* [autoparallel] remove redundant print
2022-12-04 15:18:51 +08:00
Boyuan Yao 4b40fbd743
[autoparallel] fix forward memory calculation (#2062) 2022-12-04 15:00:16 +08:00
Ziyue Jiang 44ea461890
[Pipeline] Add Topo Class (#2059)
* use Topo class to rewrite DAG

* polish code

* polish code

* polish code

* add comment

* add else to unended if

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-02 18:13:20 +08:00
YuliangLiu0306 e4293e5077
[hotfix] update test for latest version (#2060) 2022-12-02 18:12:30 +08:00
Zihao 38ea4ba1bd
[Gemini] fix grad unreleased issue and param recovery issue (#2052) 2022-12-02 16:04:19 +08:00
YuliangLiu0306 1c1fe44305
[autoparallel] adapt solver with self attention (#2037)
* [autoparallel] adapt solver with self attention

* polish code
2022-12-01 17:53:15 +08:00
Frank Lee ea74a3b9cc
[cli] updated installation cheheck with more inforamtion (#2050)
* [cli] updated installation cheheck with more inforamtion

* polish code

* polish code
2022-11-30 17:53:55 +08:00
HELSON f6178728a0
[gemini] fix init bugs for modules (#2047)
* [gemini] fix init bugs for modules

* fix bugs
2022-11-30 17:06:10 +08:00
Frank Lee 81e0da7fa8
[setup] supported conda-installed torch (#2048)
* [setup] supported conda-installed torch

* polish code
2022-11-30 16:45:15 +08:00
HELSON e37f3db40c
[gemini] add arguments (#2046)
* [zero] fix testing parameters

* [gemini] add arguments

* add docstrings
2022-11-30 16:40:13 +08:00
Zihao 6a9158f1fa
[Gemini] free and allocate cuda memory by tensor.storage, add grad hook (#2040) 2022-11-30 15:57:45 +08:00
Jiarui Fang 31c644027b
[hotfix] hotfix Gemini for no leaf modules bug (#2043) 2022-11-30 14:53:41 +08:00
HELSON a1ce02d740
[zero] test gradient accumulation (#1964)
* [zero] fix memory leak for zero2

* [zero] test gradient accumulation

* [zero] remove grad clip test
2022-11-29 13:00:30 +08:00
Ziyue Jiang b0936e4a44
[rpc] split with dag (#2028)
* add DAG to split_module

* add comment

* add test case for DAG

* remove print

* add DAG middleware in scheduler

* add test case for scheduler

* remove break

* recover old lifecycle

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-29 11:36:28 +08:00
Jiarui Fang 96134e7be3
[hotfix] add bert test for gemini fwd bwd (#2035) 2022-11-29 11:19:52 +08:00
YuliangLiu0306 0dbcd4a6f5
[autoparallel] add split handler (#2032)
* [autoparallel] add split handler

* add numerical test and runtime passes
2022-11-29 11:03:51 +08:00
Jiarui Fang 28aa9a4294
[Gemini] more rigorous unit tests for run_fwd_bwd (#2034) 2022-11-29 09:26:06 +08:00
YuliangLiu0306 81330b0352
[autoparallel] add experimental permute handler (#2029) 2022-11-27 20:26:52 +08:00
Zihao 95c4532fff
[Gemini] paramWrapper paramTracerHook unitest (#2030) 2022-11-26 13:30:24 +08:00
Jiarui Fang 8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor (#2003) 2022-11-25 20:06:35 +08:00
Ziyue Jiang 632753abbc
[fx]Split partition with DAG information (#2025)
* add DAG to split_module

* add comment

* add test case for DAG

* remove print

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-25 17:42:48 +08:00
YuliangLiu0306 ea0f6b8df9
[autoparallel] add runtime pass and numerical test for view handler (#2018) 2022-11-25 15:50:16 +08:00
Zihao a719b89a41
[gemini] param_trace_hook (#2020) 2022-11-24 18:08:36 +08:00
Jiarui Fang 0b0d8f9e17
[hotfix] revert bug PRs (#2016) 2022-11-24 15:28:58 +08:00
Zihao aba3db464d
[Gemini] ParamMemHook (#2008) 2022-11-24 15:22:51 +08:00
Zihao 0160a62a3c
[Gemini] param_tracer_wrapper and test case (#2009) 2022-11-24 14:40:33 +08:00
YuliangLiu0306 1438993113
[autoparallel] add experimental view handler (#2011)
* [autoparallel] add experimental view handler

* polish

* polish

* polish code

* rename variables
2022-11-24 11:34:41 +08:00
Genghan Zhang d655eea515
[autoparallel] mix gather (#1977)
* Add mix-gather

* Add comments

* Add comments

* Polish comments

* Change the global rank assumption

* Add tests

* Add two-step tests

* Fix 10 and 01

* Skip test becasue the number of GPUs
2022-11-23 21:49:17 +08:00
Frank Lee 2bab6f512c
[release] release v0.1.11rc4 (#2007) 2022-11-23 17:14:32 +08:00
Boyuan Yao 6cd784ffee
[autoparallel] Add metainfo support for F.linear (#1987)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator
2022-11-23 14:12:34 +08:00
Super Daniel 2edbef13cc
[fx] add more meta_registry for MetaTensor execution. (#2000)
* [sc] add examples for auto checkpoint.

* merge upstream

* [fx] add more meta_registry for MetaTensor execution.
2022-11-23 10:55:46 +08:00
Jiarui Fang a2d3266648
[hotfix] make Gemini work for conv DNN (#1998) 2022-11-22 14:52:36 +08:00
YuliangLiu0306 155891113e
[autoparallel] use pytree map style to process data (#1989) 2022-11-21 10:44:22 +08:00
YuliangLiu0306 35e6b9ec82
[autoparallel] adapt handlers with attention block (#1990)
* [autoparallel] adapt handlers with attention block

* polish
2022-11-21 10:44:11 +08:00
YuliangLiu0306 05020e50d0
[autoparallel] support more flexible data type (#1967) 2022-11-18 17:01:06 +08:00
Boyuan Yao c26f21d365
[autoparallel] add pooling metainfo (#1968)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo
2022-11-18 15:13:03 +08:00
Jiarui Fang 3712ac7f90
[Gemini] add bert for MemtracerWrapper unintests (#1982) 2022-11-18 14:58:28 +08:00
Jiarui Fang e481489aa6
[Gemini] MemtracerWrapper unittests (#1981) 2022-11-18 14:19:40 +08:00
Jiarui Fang 31922110ad
[Gemini] memory trace hook (#1978) 2022-11-18 11:52:55 +08:00
Jiarui Fang 0529fcde06
[Gemini] independent runtime tracer (#1974) 2022-11-18 10:53:42 +08:00
YuliangLiu0306 0da1d00399
[autoparallel] support distributed dataloader option (#1906)
* [autoparallel] support distributed dataloader option

* update output handler to support ddp dataloader

* poish code
2022-11-17 20:11:53 +08:00
Genghan Zhang 6630d45546
[autoparallel] Add alpha beta (#1973)
* Add alpha beta

* Fix test

* Fix test
2022-11-17 16:01:14 +08:00
Jiarui Fang cc0ed7cf33
[Gemini] ZeROHookV2 -> GeminiZeROHook (#1972) 2022-11-17 14:43:49 +08:00
ver217 f8a7148dec
[kernel] move all symlinks of kernel to `colossalai._C` (#1971) 2022-11-17 13:42:33 +08:00
Jiarui Fang 7e24b9b9ee
[Gemini] clean no used MemTraceOp (#1970) 2022-11-17 13:41:54 +08:00
Boyuan Yao 7c7921f71b
[autoparallel] add torch.nn.ReLU metainfo (#1868)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input
2022-11-16 23:12:31 +08:00
Jiarui Fang 8c66a1d0aa
[polish] remove useless file _mem_tracer_hook.py (#1963) 2022-11-16 15:55:10 +08:00
Jiarui Fang c4739a725a
[Gemini] polish memstats collector (#1962) 2022-11-16 15:45:57 +08:00
YuliangLiu0306 fea3cb661c
[autoparallel] support addmm in tracer and solver (#1961)
* [fx] patch addmm

* [autoparallel] support addmm in tracer and solver
2022-11-16 14:59:18 +08:00
Jiarui Fang f7e276fa71
[Gemini] add GeminiAdamOptimizer (#1960) 2022-11-16 14:44:28 +08:00
HELSON 7066dfbf82
[zero] fix memory leak for zero2 (#1955) 2022-11-16 11:43:24 +08:00
Jiarui Fang 52c6ad26e0
[ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953) 2022-11-15 16:24:16 +08:00
zbian 598d456d0e fixed logger 2022-11-15 16:00:07 +08:00
zbian 6877121377 updated flash attention api 2022-11-15 15:25:39 +08:00
YuliangLiu0306 36c0f3ea5b
[autoparallel] remove redundancy comm node (#1893) 2022-11-15 10:53:41 +08:00
アマデウス e52f9d9109
[tensorparallel] fixed tp layers (#1938) 2022-11-14 17:34:03 +08:00
Jiarui Fang 9f4fb3f28a
[ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937) 2022-11-14 16:05:09 +08:00
Boyuan Yao d5c5bc219e
[SC] add GPT example for auto checkpoint (#1889)
* [sc] SC tutorial for auto checkpoint

* [sc] polish examples

* [sc] polish readme

* [sc] polish readme and help information

* [sc] polish readme and help information
2022-11-11 23:17:25 +08:00
Junming Wu 14a0b18305
[NFC] polish colossalai/amp/naive_amp/__init__.py code style (#1905) 2022-11-11 17:49:18 +08:00
HELSON 6e51d296f0
[zero] migrate zero1&2 (#1878)
* add zero1&2 optimizer

* rename test ditectory

* rename test files

* change tolerance in test
2022-11-11 09:26:40 +08:00
Super Daniel cc55ff0aa4
[autoparallel] user-friendly API for CheckpointSolver. (#1879)
Merge for SC tutorial
2022-11-10 20:59:28 +08:00
Super Daniel 448248b27c
[fx] metainfo_trace as an API. (#1873)
* [fx] metainfo_trace as an API.

* [fx] add return.
2022-11-10 20:58:37 +08:00
Jiarui Fang 986f8cbaa7
[inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876) 2022-11-10 17:36:42 +08:00