Commit Graph

1622 Commits (b5f9e37c709656b286940f1b5e05abddfa257e3d)

Author SHA1 Message Date
Frank Lee 1fb0d95df0 [shardformer] made tensor parallelism configurable (#4144)
1 year ago
Frank Lee 74257cb446 [shardformer] refactored some doc and api (#4137)
1 year ago
jiangmingyan 7f9b30335b [shardformer] write an shardformer example with bert finetuning (#4126)
1 year ago
Frank Lee ae035d305d [shardformer] added embedding gradient check (#4124)
1 year ago
Frank Lee 44a190e6ac [shardformer] import huggingface implicitly (#4101)
1 year ago
Frank Lee 6a88bae4ec [shardformer] integrate with data parallelism (#4103)
1 year ago
Frank Lee f3b6aaa6b7 [shardformer] supported fused normalization (#4112)
1 year ago
Frank Lee b1c2901530 [shardformer] supported bloom model (#4098)
1 year ago
Kun Lin 8af29ee47a [shardformer] support vision transformer (#4096)
1 year ago
jiangmingyan ac80937138 [shardformer] shardformer support opt models (#4091)
1 year ago
Frank Lee d33a44e8c3 [shardformer] refactored layernorm (#4086)
1 year ago
Frank Lee c4b1b65931 [test] fixed tests failed due to dtensor change (#4082)
1 year ago
FoolPlayer 92f6791095 [shardformer] Add layernorm (#4072)
1 year ago
Frank Lee 70c58cfd4f [shardformer] supported fused qkv checkpoint (#4073)
1 year ago
FoolPlayer 0803a61412 [shardformer] add linearconv1d test (#4067)
1 year ago
Frank Lee 8eb09a4c69 [shardformer] support module saving and loading (#4062)
1 year ago
FoolPlayer 7740c55c55 support kit use for bert/gpt test (#4055)
1 year ago
Frank Lee f22ddacef0 [shardformer] refactored the shardformer layer structure (#4053)
1 year ago
Frank Lee 58df720570 [shardformer] adapted T5 and LLaMa test to use kit (#4049)
1 year ago
FoolPlayer 4021b9a8a2 [shardformer] add gpt2 test and layer class refactor (#4041)
1 year ago
Frank Lee d857f3dbba [shardformer] supported T5 and its variants (#4045)
1 year ago
Frank Lee c1d5453e9f [shardformer] adapted llama to the new API (#4036)
1 year ago
FoolPlayer 74d176c8d8 [shardformer] fix bert and gpt downstream with new api (#4024)
1 year ago
Frank Lee e253a07007 [shardformer] updated doc (#4016)
1 year ago
FoolPlayer df018fc305 support bert with new api
1 year ago
FoolPlayer 507c0ad368 add vocabembedding layer
1 year ago
Frank Lee 45d9384346 [shardformer] removed inplace tensor sharding (#4018)
1 year ago
Frank Lee 3893fa1a8d [shardformer] refactored embedding and dropout to parallel module (#4013)
1 year ago
FoolPlayer dfca9678fa integrate with dist layer (#4011)
1 year ago
Frank Lee 015af592f8 [shardformer] integrated linear 1D with dtensor (#3996)
1 year ago
FoolPlayer d3bc530849 [shardformer] Refactor shardformer api (#4001)
1 year ago
Frank Lee 611971248c [device] support init device mesh from process group (#3990)
1 year ago
FoolPlayer a2f9af810d [shardformer] fix an error in readme (#3988)
1 year ago
FoolPlayer f7774ec0f3 [Shardformer] Downstream bert (#3979)
1 year ago
wukong1992 c1c672d0f0 [shardformer] shardformer support t5 model (#3994)
1 year ago
wukong1992 6b30dfb7ce [shardformer] support llama model using shardformer (#3969)
1 year ago
FoolPlayer 45927d5527 [shardformer] Add dropout layer in shard model and refactor policy api (#3949)
1 year ago
FoolPlayer a73130482d [shardformer] Unit test (#3928)
1 year ago
FoolPlayer f1cb5ac6bf [shardformer] Align bert value (#3907)
1 year ago
FoolPlayer 79f8d5d54b [shardformer] add gpt2 policy and modify shard and slicer to support (#3883)
1 year ago
FoolPlayer 70173e3123 update README (#3909)
1 year ago
FoolPlayer ab8a47f830 [shardformer] add Dropout layer support different dropout pattern (#3856)
1 year ago
FoolPlayer c594dc2f1c [shardformer] update readme with modules implement doc (#3834)
1 year ago
Frank Lee 4972e1f40e [shardformer] refactored the user api (#3828)
1 year ago
Frank Lee 235792f170 [shardformer] updated readme (#3827)
1 year ago
FoolPlayer 8cc11235c0 [shardformer]: Feature/shardformer, add some docstring and readme (#3816)
1 year ago
FoolPlayer 8d68de767d [shardformer] init shardformer code structure (#3731)
1 year ago
Baizhou Zhang 1350ece492
[hotfix] fix import bug in checkpoint_io (#4142)
1 year ago
digger yu 8abc87798f
fix Tensor is not defined (#4129)
1 year ago
digger yu 7e46bc87b6
fix CheckpointIndexFile is not defined (#4109)
1 year ago
digger yu 09fe9dc704
[nfc]fix ColossalaiOptimizer is not defined (#4122)
1 year ago
Frank Lee 95e95b6d58
[testing] move pytest to be inside the function (#4087)
1 year ago
Baizhou Zhang 0bb0b481b4 [gemini] fix argument naming during chunk configuration searching
1 year ago
github-actions[bot] a52f62082d
[format] applied code formatting on changed files in pull request 4021 (#4022)
1 year ago
Baizhou Zhang 822c3d4d66
[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002)
1 year ago
Wenhao Chen 725af3eeeb
[booster] make optimizer argument optional for boost (#3993)
1 year ago
Baizhou Zhang c9cff7e7fa
[checkpointio] General Checkpointing of Sharded Optimizers (#3984)
1 year ago
Frank Lee 71fe52769c [gemini] fixed the gemini checkpoint io (#3934)
1 year ago
Frank Lee ddcf58cacf
Revert "[sync] sync feature/shardformer with develop"
1 year ago
FoolPlayer 24651fdd4f
Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer
1 year ago
FoolPlayer ef1537759c [shardformer] add gpt2 policy and modify shard and slicer to support (#3883)
1 year ago
FoolPlayer 6370a935f6 update README (#3909)
1 year ago
FoolPlayer 21a3915c98 [shardformer] add Dropout layer support different dropout pattern (#3856)
1 year ago
FoolPlayer 997544c1f9 [shardformer] update readme with modules implement doc (#3834)
1 year ago
Frank Lee 537a52b7a2 [shardformer] refactored the user api (#3828)
1 year ago
Frank Lee bc19024bf9 [shardformer] updated readme (#3827)
1 year ago
FoolPlayer 58f6432416 [shardformer]: Feature/shardformer, add some docstring and readme (#3816)
1 year ago
FoolPlayer 6a69b44dfc [shardformer] init shardformer code structure (#3731)
1 year ago
Frank Lee eb39154d40
[dtensor] updated api and doc (#3845)
1 year ago
digger yu de0d7df33f
[nfc] fix typo colossalai/zero (#3923)
1 year ago
digger yu a9d1cadc49
fix typo with colossalai/trainer utils zero (#3908)
1 year ago
Frank Lee d51e83d642
Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop
1 year ago
Hongxin Liu 9c88b6cbd1
[lazy] fix compatibility problem on torch 1.13 (#3911)
1 year ago
digger yu 0e484e6201
[nfc]fix typo colossalai/pipeline tensor nn (#3899)
2 years ago
Baizhou Zhang c1535ccbba
[doc] fix docs about booster api usage (#3898)
2 years ago
digger yu 1878749753
[nfc] fix typo colossalai/nn (#3887)
2 years ago
Hongxin Liu ae02d4e4f7
[bf16] add bf16 support (#3882)
2 years ago
Liu Ziming 8065cc5fba
Modify torch version requirement to adapt torch 2.0 (#3896)
2 years ago
Hongxin Liu dbb32692d2
[lazy] refactor lazy init (#3891)
2 years ago
digger yu 70c8cdecf4
[nfc] fix typo colossalai/cli fx kernel (#3847)
2 years ago
digger yu e2d81eba0d
[nfc] fix typo colossalai/ applications/ (#3831)
2 years ago
wukong1992 3229f93e30
[booster] add warning for torch fsdp plugin doc (#3833)
2 years ago
Hongxin Liu 7c9f2ed6dd
[dtensor] polish sharding spec docstring (#3838)
2 years ago
digger yu 7f8203af69
fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808)
2 years ago
wukong1992 6b305a99d6
[booster] torch fsdp fix ckpt (#3788)
2 years ago
digger yu 9265f2d4d7
[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779)
2 years ago
jiangmingyan e871e342b3
[API] add docstrings and initialization to apex amp, naive amp (#3783)
2 years ago
Frank Lee f5c425c898
fixed the example docstring for booster (#3795)
2 years ago
Hongxin Liu 72688adb2f
[doc] add booster docstring and fix autodoc (#3789)
2 years ago
Hongxin Liu 3c07a2846e
[plugin] a workaround for zero plugins' optimizer checkpoint (#3780)
2 years ago
Hongxin Liu 60e6a154bc
[doc] add tutorial for booster checkpoint (#3785)
2 years ago
digger yu 32f81f14d4
[NFC] fix typo colossalai/amp auto_parallel autochunk (#3756)
2 years ago
Hongxin Liu 5452df63c5
[plugin] torch ddp plugin supports sharded model checkpoint (#3775)
2 years ago
jiangmingyan 2703a37ac9
[amp] Add naive amp demo (#3774)
2 years ago
digger yu 1baeb39c72
[NFC] fix typo with colossalai/auto_parallel/tensor_shard (#3742)
2 years ago
wukong1992 b37797ed3d
[booster] support torch fsdp plugin in booster (#3697)
2 years ago
digger-yu ad6460cf2c
[NFC] fix typo applications/ and colossalai/ (#3735)
2 years ago
digger-yu b7141c36dd
[CI] fix some spelling errors (#3707)
2 years ago
jiangmingyan 20068ba188
[booster] add tests for ddp and low level zero's checkpointio (#3715)
2 years ago
Hongxin Liu 6552cbf8e1
[booster] fix no_sync method (#3709)
2 years ago
Hongxin Liu 3bf09efe74
[booster] update prepare dataloader method for plugin (#3706)
2 years ago
Hongxin Liu f83ea813f5
[example] add train resnet/vit with booster example (#3694)
2 years ago
YH 2629f9717d
[tensor] Refactor handle_trans_spec in DistSpecManager
2 years ago
Hongxin Liu d0915f54f4
[booster] refactor all dp fashion plugins (#3684)
2 years ago
jiangmingyan 307894f74d
[booster] gemini plugin support shard checkpoint (#3610)
2 years ago
YH a22407cc02
[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173)
2 years ago
Hongxin Liu 50793b35f4
[gemini] accelerate inference (#3641)
2 years ago
Hongxin Liu 4b3240cb59
[booster] add low level zero plugin (#3594)
2 years ago
digger-yu b9a8dff7e5
[doc] Fix typo under colossalai and doc(#3618)
2 years ago
Hongxin Liu 12eff9eb4c
[gemini] state dict supports fp16 (#3590)
2 years ago
Hongxin Liu dac127d0ee
[fx] fix meta tensor registration (#3589)
2 years ago
Hongxin Liu f313babd11
[gemini] support save state dict in shards (#3581)
2 years ago
YH d329c294ec
Add docstr for zero3 chunk search utils (#3572)
2 years ago
Hongxin Liu 173dad0562
[misc] add verbose arg for zero and op builder (#3552)
2 years ago
Hongxin Liu 4341f5e8e6
[lazyinit] fix clone and deepcopy (#3553)
2 years ago
Hongxin Liu 152239bbfa
[gemini] gemini supports lazy init (#3379)
2 years ago
jiangmingyan 366a035552
[checkpoint] Shard saved checkpoint need to be compatible with the naming format of hf checkpoint files (#3479)
2 years ago
YH bcf0cbcbe7
[doc] Add docs for clip args in zero optim (#3504)
2 years ago
jiangmingyan 52a933e175
[checkpoint] support huggingface style sharded checkpoint (#3461)
2 years ago
Frank Lee 80eba05b0a
[test] refactor tests with spawn (#3452)
2 years ago
Frank Lee 7d8d825681
[booster] fixed the torch ddp plugin with the new checkpoint api (#3442)
2 years ago
YH 8f740deb53
Fix typo (#3448)
2 years ago
Hakjin Lee 46c009dba4
[format] Run lint on colossalai.engine (#3367)
2 years ago
YuliangLiu0306 ffcdbf0f65
[autoparallel]integrate auto parallel feature with new tracer (#3408)
2 years ago
ver217 573af84184
[example] update examples related to zero/gemini (#3431)
2 years ago
Frank Lee 1beb85cc25
[checkpoint] refactored the API and added safetensors support (#3427)
2 years ago
ver217 26b7aac0be
[zero] reorganize zero/gemini folder structure (#3424)
2 years ago
Frank Lee 638a07a7f9
[test] fixed gemini plugin test (#3411)
2 years ago
ver217 5f2e34e6c9
[booster] implement Gemini plugin (#3352)
2 years ago
HELSON 1a1d68b053
[moe] add checkpoint for moe models (#3354)
2 years ago
YuliangLiu0306 fee2af8610
[autoparallel] adapt autoparallel with new analyzer (#3261)
2 years ago
Ofey Chan 8706a8c66c
[NFC] polish colossalai/engine/gradient_handler/__init__.py code style (#3329)
2 years ago
yuxuan-lou 198a74b9fd
[NFC] polish colossalai/context/random/__init__.py code style (#3327)
2 years ago
YuliangLiu0306 fbd2a9e05b [hotfix] meta_tensor_compatibility_with_torch2
2 years ago
Michelle ad285e1656
[NFC] polish colossalai/fx/tracer/_tracer_utils.py (#3323)
2 years ago
Xu Kai 64350029fe [NFC] polish colossalai/gemini/paramhooks/_param_hookmgr.py code style
2 years ago
RichardoLuo 1ce9d0c531 [NFC] polish initializer_data.py code style (#3287)
2 years ago
Ziheng Qin 1bed38ef37 [NFC] polish colossalai/cli/benchmark/models.py code style (#3290)
2 years ago
Kai Wang (Victor Kai) 964a28678f [NFC] polish initializer_3d.py code style (#3279)
2 years ago
Sze-qq 94eec1c5ad [NFC] polish colossalai/engine/gradient_accumulation/_gradient_accumulation.py code style (#3277)
2 years ago
Arsmart1 8af977f223 [NFC] polish colossalai/context/parallel_context.py code style (#3276)
2 years ago
Zirui Zhu 1168b50e33 [NFC] polish colossalai/engine/schedule/_pipeline_schedule_v2.py code style (#3275)
2 years ago
Tong Li 196d4696d0 [NFC] polish colossalai/nn/_ops/addmm.py code style (#3274)
2 years ago
lucasliunju 4b95464994 [NFC] polish colossalai/amp/__init__.py code style (#3272)
2 years ago
Xuanlei Zhao 6b3bb2c249 [NFC] polish code style (#3273)
2 years ago
CZYCW 4cadb25b96 [NFC] policy colossalai/fx/proxy.py code style (#3269)
2 years ago
Yuanchen d58fa705b2 [NFC] polish code style (#3268)
2 years ago
Camille Zhong c4a226b729 [NFC] polish tensor_placement_policy.py code style (#3265)
2 years ago
CsRic 00778abc48 [NFC] polish colossalai/fx/passes/split_module.py code style (#3263)
2 years ago
jiangmingyan 488f37048c [NFC] polish colossalai/global_variables.py code style (#3259)
2 years ago
LuGY 1ff7d5bfa5 [NFC] polish colossalai/engine/gradient_handler/_moe_gradient_handler.py (#3260)
2 years ago
dayellow 204ca2f09a [NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style (#3256)
2 years ago
HELSON 02b058032d
[fx] meta registration compatibility (#3253)
2 years ago
Frank Lee 73d3e4d309
[booster] implemented the torch ddd + resnet example (#3232)
2 years ago
YH 1a229045af
Add interface for colo tesnor dp size (#3227)
2 years ago
YuliangLiu0306 4d5d8f98a4
[API] implement device mesh manager (#3221)
2 years ago
Frank Lee cd142fbefa
[api] implemented the checkpoint io module (#3205)
2 years ago
ver217 f8289d4221
[lazyinit] combine lazy tensor with dtensor (#3204)
2 years ago
Frank Lee e3ad88fb48
[booster] implemented the cluster module (#3191)
2 years ago
YuliangLiu0306 f57d34958b
[FX] refactor experimental tracer and adapt it with hf models (#3157)
2 years ago
Frank Lee e7f3bed2d3
[booster] added the plugin base and torch ddp plugin (#3180)
2 years ago
Zihao 18dbe76cae
[auto-parallel] add auto-offload feature (#3154)
2 years ago
YuliangLiu0306 258b43317c
[hotfix] layout converting issue (#3188)
2 years ago
YH 80aed29cd3
[zero] Refactor ZeroContextConfig class using dataclass (#3186)
2 years ago
YH 9d644ff09f
Fix docstr for zero statedict (#3185)
2 years ago
zbian 7bc0afc901 updated flash attention usage
2 years ago
Frank Lee a9b8402d93
[booster] added the accelerator implementation (#3159)
2 years ago
ver217 6ae8ed0407
[lazyinit] add correctness verification (#3147)
2 years ago
Frank Lee ed19290560
[booster] implemented mixed precision class (#3151)
2 years ago
YuliangLiu0306 2eca4cd376
[DTensor] refactor dtensor with new components (#3089)
2 years ago
ver217 ed8f60b93b
[lazyinit] refactor lazy tensor and lazy init ctx (#3131)
2 years ago
Frank Lee 95a36eae63
[kernel] added kernel loader to softmax autograd function (#3093)
2 years ago
Super Daniel fff98f06ed
[analyzer] a minimal implementation of static graph analyzer (#2852)
2 years ago
Xuanlei Zhao 10c61de2f7
[autochunk] support vit (#3084)
2 years ago
YuliangLiu0306 8e4e8601b7
[DTensor] implement layout converter (#3055)
2 years ago
Frank Lee f19b49e164
[booster] init module structure and definition (#3056)
2 years ago
Xuanlei Zhao 2ca9728cbb
[autochunk] refactor chunk memory estimation (#2762)
2 years ago
YuliangLiu0306 29386a54e6
[DTensor] refactor CommSpec (#3034)
2 years ago
YuliangLiu0306 cd2b0eaa8d
[DTensor] refactor sharding spec (#2987)
2 years ago
Ziyue Jiang 400f63012e
[pipeline] Add Simplified Alpa DP Partition (#2507)
2 years ago
Super Daniel b42d3d28ed
[fx] remove depreciated algorithms. (#2312) (#2313)
2 years ago
github-actions[bot] 82503a96f2
[format] applied code formatting on changed files in pull request 2997 (#3008)
2 years ago
binmakeswell 52a5078988
[doc] add ISC tutorial (#2997)
2 years ago
ver217 823f3b9cf4
[doc] add deepspeed citation and copyright (#2996)
2 years ago
YuliangLiu0306 e414e4092b
[DTensor] implementation of dtensor (#2946)
2 years ago
YuliangLiu0306 47fb214b3b
[hotfix] add shard dim to aviod backward communication error (#2954)
2 years ago
ver217 090f14fd6b
[misc] add reference (#2930)
2 years ago
YuliangLiu0306 197d0bf4ed
[autoparallel] apply repeat block to reduce solving time (#2912)
2 years ago
YH a848091141
Fix port exception type (#2925)
2 years ago
zbian 61e687831d fixed using zero with tp cannot access weight correctly
2 years ago
YH 7b13f7db18
[zero] trivial zero optimizer refactoring (#2869)
2 years ago
Jiatong (Julius) Han 8c8a39be95
[hotfix]: Remove math.prod dependency (#2837)
2 years ago
YuliangLiu0306 819e25d8b1
[hotfix] fix autoparallel compatibility test issues (#2754)
2 years ago
YuliangLiu0306 0f392d7403
[autoparallel] find repeat blocks (#2854)
2 years ago
junxu c52edcf0eb
Rename class method of ZeroDDP (#2692)
2 years ago
HELSON 6e4ac08172
[hotfix] fix chunk size can not be divided (#2867)
2 years ago
Boyuan Yao eae77c831d
[autoparallel] Patch meta information for nodes that will not be handled by SPMD solver (#2823)
2 years ago
Boyuan Yao c7764d3f22
[autoparallel] Patch meta information of `torch.where` (#2822)
2 years ago
Boyuan Yao fcc4097efa
[autoparallel] Patch meta information of `torch.tanh()` and `torch.nn.Dropout` (#2773)
2 years ago
Frank Lee 935346430f
[cli] handled version check exceptions (#2848)
2 years ago