Hongxin Liu
50793b35f4
[gemini] accelerate inference ( #3641 )
...
* [gemini] support don't scatter after inference
* [chat] update colossalai strategy
* [chat] fix opt benchmark
* [chat] update opt benchmark
* [gemini] optimize inference
* [test] add gemini inference test
* [chat] fix unit test ci
* [chat] fix ci
* [chat] fix ci
* [chat] skip checkpoint test
2023-04-26 16:32:40 +08:00
Hongxin Liu
4b3240cb59
[booster] add low level zero plugin ( #3594 )
...
* [booster] add low level zero plugin
* [booster] fix gemini plugin test
* [booster] fix precision
* [booster] add low level zero plugin test
* [test] fix booster plugin test oom
* [test] fix booster plugin test oom
* [test] fix googlenet and inception output trans
* [test] fix diffuser clip vision model
* [test] fix torchaudio_wav2vec2_base
* [test] fix low level zero plugin test
2023-04-26 14:37:25 +08:00
digger-yu
b9a8dff7e5
[doc] Fix typo under colossalai and doc( #3618 )
...
* Fixed several spelling errors under colossalai
* Fix the spelling error in colossalai and docs directory
* Cautious Changed the spelling error under the example folder
* Update runtime_preparation_pass.py
revert autograft to autograd
* Update search_chunk.py
utile to until
* Update check_installation.py
change misteach to mismatch in line 91
* Update 1D_tensor_parallel.md
revert to perceptron
* Update 2D_tensor_parallel.md
revert to perceptron in line 73
* Update 2p5D_tensor_parallel.md
revert to perceptron in line 71
* Update 3D_tensor_parallel.md
revert to perceptron in line 80
* Update README.md
revert to resnet in line 42
* Update reorder_graph.py
revert to indice in line 7
* Update p2p.py
revert to megatron in line 94
* Update initialize.py
revert to torchrun in line 198
* Update routers.py
change to detailed in line 63
* Update routers.py
change to detailed in line 146
* Update README.md
revert random number in line 402
2023-04-26 11:38:43 +08:00
Hongxin Liu
12eff9eb4c
[gemini] state dict supports fp16 ( #3590 )
...
* [gemini] save state dict support fp16
* [gemini] save state dict shard support fp16
* [gemini] fix state dict
* [gemini] fix state dict
2023-04-19 11:01:48 +08:00
Hongxin Liu
dac127d0ee
[fx] fix meta tensor registration ( #3589 )
...
* [meta] fix torch 1.13.1
* [meta] fix torch 2.0.0
* [meta] fix torch 1.13.0
* [meta] polish code
2023-04-18 16:20:36 +08:00
Hongxin Liu
f313babd11
[gemini] support save state dict in shards ( #3581 )
...
* [gemini] support state dict shard
* [gemini] add test state dict shard
* [gemini] polish docstr
* [gemini] fix merge
* [gemini] polish code
2023-04-17 17:11:09 +08:00
YH
d329c294ec
Add docstr for zero3 chunk search utils ( #3572 )
2023-04-17 12:44:17 +08:00
Hongxin Liu
173dad0562
[misc] add verbose arg for zero and op builder ( #3552 )
...
* [misc] add print verbose
* [gemini] add print verbose
* [zero] add print verbose for low level
* [misc] add print verbose for op builder
2023-04-17 11:25:35 +08:00
Hongxin Liu
4341f5e8e6
[lazyinit] fix clone and deepcopy ( #3553 )
2023-04-17 11:25:13 +08:00
Hongxin Liu
152239bbfa
[gemini] gemini supports lazy init ( #3379 )
...
* [gemini] fix nvme optimizer init
* [gemini] gemini supports lazy init
* [gemini] add init example
* [gemini] add fool model
* [zero] update gemini ddp
* [zero] update init example
* add chunk method
* add chunk method
* [lazyinit] fix lazy tensor tolist
* [gemini] fix buffer materialization
* [misc] remove useless file
* [booster] update gemini plugin
* [test] update gemini plugin test
* [test] fix gemini plugin test
* [gemini] fix import
* [gemini] fix import
* [lazyinit] use new metatensor
* [lazyinit] use new metatensor
* [lazyinit] fix __set__ method
2023-04-12 16:03:25 +08:00
jiangmingyan
366a035552
[checkpoint] Shard saved checkpoint need to be compatible with the naming format of hf checkpoint files ( #3479 )
...
* [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format
* [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format
* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename
* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename
* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename
* [checkpoint] Shard saved checkpoint add 'variant' field to customize filename
---------
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-04-12 16:02:17 +08:00
YH
bcf0cbcbe7
[doc] Add docs for clip args in zero optim ( #3504 )
2023-04-10 11:11:28 +08:00
jiangmingyan
52a933e175
[checkpoint] support huggingface style sharded checkpoint ( #3461 )
...
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
---------
Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-04-06 16:23:39 +08:00
Frank Lee
80eba05b0a
[test] refactor tests with spawn ( #3452 )
...
* [test] added spawn decorator
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2023-04-06 14:51:35 +08:00
Frank Lee
7d8d825681
[booster] fixed the torch ddp plugin with the new checkpoint api ( #3442 )
2023-04-06 09:43:51 +08:00
YH
8f740deb53
Fix typo ( #3448 )
2023-04-06 09:43:31 +08:00
Hakjin Lee
46c009dba4
[format] Run lint on colossalai.engine ( #3367 )
2023-04-05 23:24:43 +08:00
YuliangLiu0306
ffcdbf0f65
[autoparallel]integrate auto parallel feature with new tracer ( #3408 )
...
* [autoparallel] integrate new analyzer in module level
* unify the profiling method
* polish
* fix no codegen bug
* fix pass bug
* fix liveness test
* polish
2023-04-04 17:40:45 +08:00
ver217
573af84184
[example] update examples related to zero/gemini ( #3431 )
...
* [zero] update legacy import
* [zero] update examples
* [example] fix opt tutorial
* [example] fix opt tutorial
* [example] fix opt tutorial
* [example] fix opt tutorial
* [example] fix import
2023-04-04 17:32:51 +08:00
Frank Lee
1beb85cc25
[checkpoint] refactored the API and added safetensors support ( #3427 )
...
* [checkpoint] refactored the API and added safetensors support
* polish code
2023-04-04 15:23:01 +08:00
ver217
26b7aac0be
[zero] reorganize zero/gemini folder structure ( #3424 )
...
* [zero] refactor low-level zero folder structure
* [zero] fix legacy zero import path
* [zero] fix legacy zero import path
* [zero] remove useless import
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] fix test import path
* [zero] fix test
* [zero] fix circular import
* [zero] update import
2023-04-04 13:48:16 +08:00
Frank Lee
638a07a7f9
[test] fixed gemini plugin test ( #3411 )
...
* [test] fixed gemini plugin test
* polish code
* polish code
2023-04-03 17:12:22 +08:00
ver217
5f2e34e6c9
[booster] implement Gemini plugin ( #3352 )
...
* [booster] add gemini plugin
* [booster] update docstr
* [booster] gemini plugin add coloparam convertor
* [booster] fix coloparam convertor
* [booster] fix gemini plugin device
* [booster] add gemini plugin test
* [booster] gemini plugin ignore sync bn
* [booster] skip some model
* [booster] skip some model
* [booster] modify test world size
* [booster] modify test world size
* [booster] skip test
2023-03-31 16:06:13 +08:00
HELSON
1a1d68b053
[moe] add checkpoint for moe models ( #3354 )
...
* [moe] add checkpoint for moe models
* [hotfix] fix bugs in unit test
2023-03-31 09:20:33 +08:00
YuliangLiu0306
fee2af8610
[autoparallel] adapt autoparallel with new analyzer ( #3261 )
...
* [autoparallel] adapt autoparallel with new analyzer
* fix all node handler tests
* polish
* polish
2023-03-30 17:47:24 +08:00
Ofey Chan
8706a8c66c
[NFC] polish colossalai/engine/gradient_handler/__init__.py code style ( #3329 )
2023-03-30 14:19:39 +08:00
yuxuan-lou
198a74b9fd
[NFC] polish colossalai/context/random/__init__.py code style ( #3327 )
2023-03-30 14:19:26 +08:00
YuliangLiu0306
fbd2a9e05b
[hotfix] meta_tensor_compatibility_with_torch2
2023-03-30 13:43:01 +08:00
Michelle
ad285e1656
[NFC] polish colossalai/fx/tracer/_tracer_utils.py ( #3323 )
...
* [NFC] polish colossalai/engine/schedule/_pipeline_schedule.py code style
* [NFC] polish colossalai/fx/tracer/_tracer_utils.py code style
---------
Co-authored-by: Qianran Ma <qianranm@luchentech.com>
2023-03-29 17:53:32 +08:00
Xu Kai
64350029fe
[NFC] polish colossalai/gemini/paramhooks/_param_hookmgr.py code style
2023-03-29 15:47:42 +08:00
RichardoLuo
1ce9d0c531
[NFC] polish initializer_data.py code style ( #3287 )
2023-03-29 15:22:21 +08:00
Ziheng Qin
1bed38ef37
[NFC] polish colossalai/cli/benchmark/models.py code style ( #3290 )
2023-03-29 15:22:21 +08:00
Kai Wang (Victor Kai)
964a28678f
[NFC] polish initializer_3d.py code style ( #3279 )
2023-03-29 15:22:21 +08:00
Sze-qq
94eec1c5ad
[NFC] polish colossalai/engine/gradient_accumulation/_gradient_accumulation.py code style ( #3277 )
...
Co-authored-by: siqi <siqi@siqis-MacBook-Pro.local>
2023-03-29 15:22:21 +08:00
Arsmart1
8af977f223
[NFC] polish colossalai/context/parallel_context.py code style ( #3276 )
2023-03-29 15:22:21 +08:00
Zirui Zhu
1168b50e33
[NFC] polish colossalai/engine/schedule/_pipeline_schedule_v2.py code style ( #3275 )
2023-03-29 15:22:21 +08:00
Tong Li
196d4696d0
[NFC] polish colossalai/nn/_ops/addmm.py code style ( #3274 )
2023-03-29 15:22:21 +08:00
lucasliunju
4b95464994
[NFC] polish colossalai/amp/__init__.py code style ( #3272 )
2023-03-29 15:22:21 +08:00
Xuanlei Zhao
6b3bb2c249
[NFC] polish code style ( #3273 )
2023-03-29 15:22:21 +08:00
CZYCW
4cadb25b96
[NFC] policy colossalai/fx/proxy.py code style ( #3269 )
2023-03-29 15:22:21 +08:00
Yuanchen
d58fa705b2
[NFC] polish code style ( #3268 )
...
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
2023-03-29 15:22:21 +08:00
Camille Zhong
c4a226b729
[NFC] polish tensor_placement_policy.py code style ( #3265 )
2023-03-29 15:22:21 +08:00
CsRic
00778abc48
[NFC] polish colossalai/fx/passes/split_module.py code style ( #3263 )
...
Co-authored-by: csric <richcsr256@gmail.com>
2023-03-29 15:22:21 +08:00
jiangmingyan
488f37048c
[NFC] polish colossalai/global_variables.py code style ( #3259 )
...
Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-03-29 15:22:21 +08:00
LuGY
1ff7d5bfa5
[NFC] polish colossalai/engine/gradient_handler/_moe_gradient_handler.py ( #3260 )
2023-03-29 15:22:21 +08:00
dayellow
204ca2f09a
[NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style ( #3256 )
...
Co-authored-by: Minghao Huang <huangminghao@luchentech.com>
2023-03-29 15:22:21 +08:00
HELSON
02b058032d
[fx] meta registration compatibility ( #3253 )
...
* [fx] meta registration compatibility
* fix error
2023-03-27 15:22:17 +08:00
Frank Lee
73d3e4d309
[booster] implemented the torch ddd + resnet example ( #3232 )
...
* [booster] implemented the torch ddd + resnet example
* polish code
2023-03-27 10:24:14 +08:00
YH
1a229045af
Add interface for colo tesnor dp size ( #3227 )
2023-03-27 09:42:21 +08:00
YuliangLiu0306
4d5d8f98a4
[API] implement device mesh manager ( #3221 )
...
* [API] implement device mesh manager
* polish
2023-03-24 13:39:12 +08:00