Commit Graph

1366 Commits (8bccb72c8d6b4ff21d3d596f0188c6280d8b29f6)

Author SHA1 Message Date
Hongxin Liu 50793b35f4
[gemini] accelerate inference (#3641)
* [gemini] support don't scatter after inference

* [chat] update colossalai strategy

* [chat] fix opt benchmark

* [chat] update opt benchmark

* [gemini] optimize inference

* [test] add gemini inference test

* [chat] fix unit test ci

* [chat] fix ci

* [chat] fix ci

* [chat] skip checkpoint test
2023-04-26 16:32:40 +08:00
Hongxin Liu 4b3240cb59
[booster] add low level zero plugin (#3594)
* [booster] add low level zero plugin

* [booster] fix gemini plugin test

* [booster] fix precision

* [booster] add low level zero plugin test

* [test] fix booster plugin test oom

* [test] fix booster plugin test oom

* [test] fix googlenet and inception output trans

* [test] fix diffuser clip vision model

* [test] fix torchaudio_wav2vec2_base

* [test] fix low level zero plugin test
2023-04-26 14:37:25 +08:00
digger-yu b9a8dff7e5
[doc] Fix typo under colossalai and doc(#3618)
* Fixed several spelling errors under colossalai

* Fix the spelling error in colossalai and docs directory

* Cautious Changed the spelling error under the example folder

* Update runtime_preparation_pass.py

revert autograft to autograd

* Update search_chunk.py

utile to until

* Update check_installation.py

change misteach to mismatch in line 91

* Update 1D_tensor_parallel.md

revert to perceptron

* Update 2D_tensor_parallel.md

revert to perceptron in line 73

* Update 2p5D_tensor_parallel.md

revert to perceptron in line 71

* Update 3D_tensor_parallel.md

revert to perceptron in line 80

* Update README.md

revert to resnet in line 42

* Update reorder_graph.py

revert to indice in line 7

* Update p2p.py

revert to megatron in line 94

* Update initialize.py

revert to torchrun in line 198

* Update routers.py

change to detailed in line 63

* Update routers.py

change to detailed in line 146

* Update README.md

revert  random number in line 402
2023-04-26 11:38:43 +08:00
Hongxin Liu 12eff9eb4c
[gemini] state dict supports fp16 (#3590)
* [gemini] save state dict support fp16

* [gemini] save state dict shard support fp16

* [gemini] fix state dict

* [gemini] fix state dict
2023-04-19 11:01:48 +08:00
Hongxin Liu dac127d0ee
[fx] fix meta tensor registration (#3589)
* [meta] fix torch 1.13.1

* [meta] fix torch 2.0.0

* [meta] fix torch 1.13.0

* [meta] polish code
2023-04-18 16:20:36 +08:00
Hongxin Liu f313babd11
[gemini] support save state dict in shards (#3581)
* [gemini] support state dict shard

* [gemini] add test state dict shard

* [gemini] polish docstr

* [gemini] fix merge

* [gemini] polish code
2023-04-17 17:11:09 +08:00
YH d329c294ec
Add docstr for zero3 chunk search utils (#3572) 2023-04-17 12:44:17 +08:00
Hongxin Liu 173dad0562
[misc] add verbose arg for zero and op builder (#3552)
* [misc] add print verbose

* [gemini] add print verbose

* [zero] add print verbose for low level

* [misc] add print verbose for op builder
2023-04-17 11:25:35 +08:00
Hongxin Liu 4341f5e8e6
[lazyinit] fix clone and deepcopy (#3553) 2023-04-17 11:25:13 +08:00
Hongxin Liu 152239bbfa
[gemini] gemini supports lazy init (#3379)
* [gemini] fix nvme optimizer init

* [gemini] gemini supports lazy init

* [gemini] add init example

* [gemini] add fool model

* [zero] update gemini ddp

* [zero] update init example

* add chunk method

* add chunk method

* [lazyinit] fix lazy tensor tolist

* [gemini] fix buffer materialization

* [misc] remove useless file

* [booster] update gemini plugin

* [test] update gemini plugin test

* [test] fix gemini plugin test

* [gemini] fix import

* [gemini] fix import

* [lazyinit] use new metatensor

* [lazyinit] use new metatensor

* [lazyinit] fix __set__ method
2023-04-12 16:03:25 +08:00
jiangmingyan 366a035552
[checkpoint] Shard saved checkpoint need to be compatible with the naming format of hf checkpoint files (#3479)
* [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format

* [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format

* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename

* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename

* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename

* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename

---------

Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-04-12 16:02:17 +08:00
YH bcf0cbcbe7
[doc] Add docs for clip args in zero optim (#3504) 2023-04-10 11:11:28 +08:00
jiangmingyan 52a933e175
[checkpoint] support huggingface style sharded checkpoint (#3461)
* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

* [checkpoint] support huggingface style sharded checkpoint

---------

Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-04-06 16:23:39 +08:00
Frank Lee 80eba05b0a
[test] refactor tests with spawn (#3452)
* [test] added spawn decorator

* polish code

* polish code

* polish code

* polish code

* polish code

* polish code
2023-04-06 14:51:35 +08:00
Frank Lee 7d8d825681
[booster] fixed the torch ddp plugin with the new checkpoint api (#3442) 2023-04-06 09:43:51 +08:00
YH 8f740deb53
Fix typo (#3448) 2023-04-06 09:43:31 +08:00
Hakjin Lee 46c009dba4
[format] Run lint on colossalai.engine (#3367) 2023-04-05 23:24:43 +08:00
YuliangLiu0306 ffcdbf0f65
[autoparallel]integrate auto parallel feature with new tracer (#3408)
* [autoparallel] integrate new analyzer in module level

* unify the profiling method

* polish

* fix no codegen bug

* fix pass bug

* fix liveness test

* polish
2023-04-04 17:40:45 +08:00
ver217 573af84184
[example] update examples related to zero/gemini (#3431)
* [zero] update legacy import

* [zero] update examples

* [example] fix opt tutorial

* [example] fix opt tutorial

* [example] fix opt tutorial

* [example] fix opt tutorial

* [example] fix import
2023-04-04 17:32:51 +08:00
Frank Lee 1beb85cc25
[checkpoint] refactored the API and added safetensors support (#3427)
* [checkpoint] refactored the API and added safetensors support

* polish code
2023-04-04 15:23:01 +08:00
ver217 26b7aac0be
[zero] reorganize zero/gemini folder structure (#3424)
* [zero] refactor low-level zero folder structure

* [zero] fix legacy zero import path

* [zero] fix legacy zero import path

* [zero] remove useless import

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor gemini folder structure

* [zero] refactor legacy zero import path

* [zero] fix test import path

* [zero] fix test

* [zero] fix circular import

* [zero] update import
2023-04-04 13:48:16 +08:00
Frank Lee 638a07a7f9
[test] fixed gemini plugin test (#3411)
* [test] fixed gemini plugin test

* polish code

* polish code
2023-04-03 17:12:22 +08:00
ver217 5f2e34e6c9
[booster] implement Gemini plugin (#3352)
* [booster] add gemini plugin

* [booster] update docstr

* [booster] gemini plugin add coloparam convertor

* [booster] fix coloparam convertor

* [booster] fix gemini plugin device

* [booster] add gemini plugin test

* [booster] gemini plugin ignore sync bn

* [booster] skip some model

* [booster] skip some model

* [booster] modify test world size

* [booster] modify test world size

* [booster] skip test
2023-03-31 16:06:13 +08:00
HELSON 1a1d68b053
[moe] add checkpoint for moe models (#3354)
* [moe] add checkpoint for moe models

* [hotfix] fix bugs in unit test
2023-03-31 09:20:33 +08:00
YuliangLiu0306 fee2af8610
[autoparallel] adapt autoparallel with new analyzer (#3261)
* [autoparallel] adapt autoparallel with new analyzer

* fix all node handler tests

* polish

* polish
2023-03-30 17:47:24 +08:00
Ofey Chan 8706a8c66c
[NFC] polish colossalai/engine/gradient_handler/__init__.py code style (#3329) 2023-03-30 14:19:39 +08:00
yuxuan-lou 198a74b9fd
[NFC] polish colossalai/context/random/__init__.py code style (#3327) 2023-03-30 14:19:26 +08:00
YuliangLiu0306 fbd2a9e05b [hotfix] meta_tensor_compatibility_with_torch2 2023-03-30 13:43:01 +08:00
Michelle ad285e1656
[NFC] polish colossalai/fx/tracer/_tracer_utils.py (#3323)
* [NFC] polish colossalai/engine/schedule/_pipeline_schedule.py code style

* [NFC] polish colossalai/fx/tracer/_tracer_utils.py  code style

---------

Co-authored-by: Qianran Ma <qianranm@luchentech.com>
2023-03-29 17:53:32 +08:00
Xu Kai 64350029fe [NFC] polish colossalai/gemini/paramhooks/_param_hookmgr.py code style 2023-03-29 15:47:42 +08:00
RichardoLuo 1ce9d0c531 [NFC] polish initializer_data.py code style (#3287) 2023-03-29 15:22:21 +08:00
Ziheng Qin 1bed38ef37 [NFC] polish colossalai/cli/benchmark/models.py code style (#3290) 2023-03-29 15:22:21 +08:00
Kai Wang (Victor Kai) 964a28678f [NFC] polish initializer_3d.py code style (#3279) 2023-03-29 15:22:21 +08:00
Sze-qq 94eec1c5ad [NFC] polish colossalai/engine/gradient_accumulation/_gradient_accumulation.py code style (#3277)
Co-authored-by: siqi <siqi@siqis-MacBook-Pro.local>
2023-03-29 15:22:21 +08:00
Arsmart1 8af977f223 [NFC] polish colossalai/context/parallel_context.py code style (#3276) 2023-03-29 15:22:21 +08:00
Zirui Zhu 1168b50e33 [NFC] polish colossalai/engine/schedule/_pipeline_schedule_v2.py code style (#3275) 2023-03-29 15:22:21 +08:00
Tong Li 196d4696d0 [NFC] polish colossalai/nn/_ops/addmm.py code style (#3274) 2023-03-29 15:22:21 +08:00
lucasliunju 4b95464994 [NFC] polish colossalai/amp/__init__.py code style (#3272) 2023-03-29 15:22:21 +08:00
Xuanlei Zhao 6b3bb2c249 [NFC] polish code style (#3273) 2023-03-29 15:22:21 +08:00
CZYCW 4cadb25b96 [NFC] policy colossalai/fx/proxy.py code style (#3269) 2023-03-29 15:22:21 +08:00
Yuanchen d58fa705b2 [NFC] polish code style (#3268)
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
2023-03-29 15:22:21 +08:00
Camille Zhong c4a226b729 [NFC] polish tensor_placement_policy.py code style (#3265) 2023-03-29 15:22:21 +08:00
CsRic 00778abc48 [NFC] polish colossalai/fx/passes/split_module.py code style (#3263)
Co-authored-by: csric <richcsr256@gmail.com>
2023-03-29 15:22:21 +08:00
jiangmingyan 488f37048c [NFC] polish colossalai/global_variables.py code style (#3259)
Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-03-29 15:22:21 +08:00
LuGY 1ff7d5bfa5 [NFC] polish colossalai/engine/gradient_handler/_moe_gradient_handler.py (#3260) 2023-03-29 15:22:21 +08:00
dayellow 204ca2f09a [NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style (#3256)
Co-authored-by: Minghao Huang <huangminghao@luchentech.com>
2023-03-29 15:22:21 +08:00
HELSON 02b058032d
[fx] meta registration compatibility (#3253)
* [fx] meta registration compatibility

* fix error
2023-03-27 15:22:17 +08:00
Frank Lee 73d3e4d309
[booster] implemented the torch ddd + resnet example (#3232)
* [booster] implemented the torch ddd + resnet example

* polish code
2023-03-27 10:24:14 +08:00
YH 1a229045af
Add interface for colo tesnor dp size (#3227) 2023-03-27 09:42:21 +08:00
YuliangLiu0306 4d5d8f98a4
[API] implement device mesh manager (#3221)
* [API] implement  device mesh manager

* polish
2023-03-24 13:39:12 +08:00