ColossalAI

Commit Graph

Author	SHA1	Message	Date
Frank Lee	58df720570	[shardformer] adapted T5 and LLaMa test to use kit (#4049 ) * [shardformer] adapted T5 and LLaMa test to use kit * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	4021b9a8a2	[shardformer] add gpt2 test and layer class refactor (#4041 ) * add gpt2 test and layer class refactor * add dropout in gpt2 policy	2023-07-04 16:05:01 +08:00
Frank Lee	d857f3dbba	[shardformer] supported T5 and its variants (#4045 )	2023-07-04 16:05:01 +08:00
Frank Lee	c1d5453e9f	[shardformer] adapted llama to the new API (#4036 )	2023-07-04 16:05:01 +08:00
FoolPlayer	74d176c8d8	[shardformer] fix bert and gpt downstream with new api (#4024 ) * fix bert downstream with new api * remove comment line	2023-07-04 16:05:01 +08:00
Frank Lee	e253a07007	[shardformer] updated doc (#4016 )	2023-07-04 16:05:01 +08:00
FoolPlayer	df018fc305	support bert with new api	2023-07-04 16:05:01 +08:00
FoolPlayer	507c0ad368	add vocabembedding layer	2023-07-04 16:05:01 +08:00
Frank Lee	45d9384346	[shardformer] removed inplace tensor sharding (#4018 )	2023-07-04 16:05:01 +08:00
Frank Lee	3893fa1a8d	[shardformer] refactored embedding and dropout to parallel module (#4013 ) * [shardformer] refactored embedding and dropout to parallel module * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	dfca9678fa	integrate with dist layer (#4011 )	2023-07-04 16:05:01 +08:00
Frank Lee	015af592f8	[shardformer] integrated linear 1D with dtensor (#3996 ) * [shardformer] integrated linear 1D with dtensor * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	d3bc530849	[shardformer] Refactor shardformer api (#4001 ) * fix an error in readme * simplify code * refactor shardformer * add todo * remove slicer * resolve code review	2023-07-04 16:05:01 +08:00
Frank Lee	611971248c	[device] support init device mesh from process group (#3990 )	2023-07-04 16:05:01 +08:00
FoolPlayer	a2f9af810d	[shardformer] fix an error in readme (#3988 ) * fix an error in readme * simplify code	2023-07-04 16:05:01 +08:00
FoolPlayer	f7774ec0f3	[Shardformer] Downstream bert (#3979 ) * add dist dropout in model * update docstring and bert policy with dropout * refactor basepolicy and sharded, update bert * update format * update gpt2 policy * update bert policy * remove unused code * update readme for new policy usage * add downstream model of bert * remove unused code	2023-07-04 16:05:01 +08:00
wukong1992	c1c672d0f0	[shardformer] shardformer support t5 model (#3994 ) test t5	2023-07-04 16:05:01 +08:00
wukong1992	6b30dfb7ce	[shardformer] support llama model using shardformer (#3969 ) adjust layer attr	2023-07-04 16:05:01 +08:00
FoolPlayer	45927d5527	[shardformer] Add dropout layer in shard model and refactor policy api (#3949 ) * add dist dropout in model * update docstring and bert policy with dropout * refactor basepolicy and sharded, update bert * update format * update gpt2 policy * update bert policy * remove unused code * update readme for new policy usage	2023-07-04 16:05:01 +08:00
FoolPlayer	a73130482d	[shardformer] Unit test (#3928 ) * fix bug in slicer, add slicer unit test * add dropout test * use pid as dropout seed * updata dropout test with local pattern * ad todo	2023-07-04 16:05:01 +08:00
FoolPlayer	f1cb5ac6bf	[shardformer] Align bert value (#3907 ) * add bert align test, fix dist loss bug * forward and backward align * add ignore index * add shardformer CI * add gather_output optional for user in shardconfig * update readme with optional gather_ouput * add dist crossentropy loss test, remove unused files * remove unused file * remove unused file * rename the file * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	79f8d5d54b	[shardformer] add gpt2 policy and modify shard and slicer to support (#3883 ) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	70173e3123	update README (#3909 )	2023-07-04 16:05:01 +08:00
FoolPlayer	ab8a47f830	[shardformer] add Dropout layer support different dropout pattern (#3856 ) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss	2023-07-04 16:05:01 +08:00
FoolPlayer	c594dc2f1c	[shardformer] update readme with modules implement doc (#3834 ) * update readme with modules content * remove img	2023-07-04 16:05:01 +08:00
Frank Lee	4972e1f40e	[shardformer] refactored the user api (#3828 ) * [shardformer] refactored the user api * polish code	2023-07-04 16:05:01 +08:00
Frank Lee	235792f170	[shardformer] updated readme (#3827 )	2023-07-04 16:05:01 +08:00
FoolPlayer	8cc11235c0	[shardformer]: Feature/shardformer, add some docstring and readme (#3816 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit	2023-07-04 16:05:01 +08:00
FoolPlayer	8d68de767d	[shardformer] init shardformer code structure (#3731 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example	2023-07-04 16:05:01 +08:00
Baizhou Zhang	1350ece492	[hotfix] fix import bug in checkpoint_io (#4142 )	2023-07-03 22:14:37 +08:00
digger yu	8abc87798f	fix Tensor is not defined (#4129 )	2023-07-03 17:10:18 +08:00
digger yu	7e46bc87b6	fix CheckpointIndexFile is not defined (#4109 )	2023-07-03 17:09:06 +08:00
digger yu	09fe9dc704	[nfc]fix ColossalaiOptimizer is not defined (#4122 )	2023-06-30 17:23:22 +08:00
Frank Lee	95e95b6d58	[testing] move pytest to be inside the function (#4087 )	2023-06-27 11:02:25 +08:00
Baizhou Zhang	0bb0b481b4	[gemini] fix argument naming during chunk configuration searching	2023-06-25 13:34:15 +08:00
github-actions[bot]	a52f62082d	[format] applied code formatting on changed files in pull request 4021 (#4022 ) Co-authored-by: github-actions <github-actions@github.com>	2023-06-19 11:23:24 +08:00
Baizhou Zhang	822c3d4d66	[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002 )	2023-06-16 14:14:05 +08:00
Wenhao Chen	725af3eeeb	[booster] make optimizer argument optional for boost (#3993 ) * feat: make optimizer optional in Booster.boost * test: skip unet test if diffusers version > 0.10.2	2023-06-15 17:38:42 +08:00
Baizhou Zhang	c9cff7e7fa	[checkpointio] General Checkpointing of Sharded Optimizers (#3984 )	2023-06-15 15:21:26 +08:00
Frank Lee	71fe52769c	[gemini] fixed the gemini checkpoint io (#3934 )	2023-06-12 15:11:27 +08:00
Frank Lee	ddcf58cacf	Revert "[sync] sync feature/shardformer with develop"	2023-06-09 09:41:27 +08:00
FoolPlayer	24651fdd4f	Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer [sync] sync feature/shardformer with develop	2023-06-09 09:34:00 +08:00
FoolPlayer	ef1537759c	[shardformer] add gpt2 policy and modify shard and slicer to support (#3883 ) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code	2023-06-08 15:01:34 +08:00
FoolPlayer	6370a935f6	update README (#3909 )	2023-06-08 15:01:34 +08:00
FoolPlayer	21a3915c98	[shardformer] add Dropout layer support different dropout pattern (#3856 ) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss	2023-06-08 15:01:34 +08:00
FoolPlayer	997544c1f9	[shardformer] update readme with modules implement doc (#3834 ) * update readme with modules content * remove img	2023-06-08 15:01:34 +08:00
Frank Lee	537a52b7a2	[shardformer] refactored the user api (#3828 ) * [shardformer] refactored the user api * polish code	2023-06-08 15:01:34 +08:00
Frank Lee	bc19024bf9	[shardformer] updated readme (#3827 )	2023-06-08 15:01:34 +08:00
FoolPlayer	58f6432416	[shardformer]: Feature/shardformer, add some docstring and readme (#3816 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit	2023-06-08 15:01:34 +08:00
FoolPlayer	6a69b44dfc	[shardformer] init shardformer code structure (#3731 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example	2023-06-08 15:01:34 +08:00
Frank Lee	eb39154d40	[dtensor] updated api and doc (#3845 )	2023-06-08 10:18:17 +08:00
digger yu	de0d7df33f	[nfc] fix typo colossalai/zero (#3923 )	2023-06-08 00:01:29 +08:00
digger yu	a9d1cadc49	fix typo with colossalai/trainer utils zero (#3908 )	2023-06-07 16:08:37 +08:00
Frank Lee	d51e83d642	Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop [sync] sync feature/dtensor with develop	2023-06-07 11:50:43 +08:00
Hongxin Liu	9c88b6cbd1	[lazy] fix compatibility problem on torch 1.13 (#3911 )	2023-06-07 11:10:12 +08:00
digger yu	0e484e6201	[nfc]fix typo colossalai/pipeline tensor nn (#3899 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped * fix typo colossalai/pipeline tensor nn	2023-06-06 14:07:36 +08:00
Baizhou Zhang	c1535ccbba	[doc] fix docs about booster api usage (#3898 )	2023-06-06 13:36:11 +08:00
digger yu	1878749753	[nfc] fix typo colossalai/nn (#3887 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped	2023-06-05 16:04:27 +08:00
Hongxin Liu	ae02d4e4f7	[bf16] add bf16 support (#3882 ) * [bf16] add bf16 support for fused adam (#3844) * [bf16] fused adam kernel support bf16 * [test] update fused adam kernel test * [test] update fused adam test * [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860) * [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869) * [bf16] add mixed precision mixin * [bf16] low level zero optim support bf16 * [text] update low level zero test * [text] fix low level zero grad acc test * [bf16] add bf16 support for gemini (#3872) * [bf16] gemini support bf16 * [test] update gemini bf16 test * [doc] update gemini docstring * [bf16] add bf16 support for plugins (#3877) * [bf16] add bf16 support for legacy zero (#3879) * [zero] init context support bf16 * [zero] legacy zero support bf16 * [test] add zero bf16 test * [doc] add bf16 related docstring for legacy zero	2023-06-05 15:58:31 +08:00
Liu Ziming	8065cc5fba	Modify torch version requirement to adapt torch 2.0 (#3896 )	2023-06-05 15:57:35 +08:00
Hongxin Liu	dbb32692d2	[lazy] refactor lazy init (#3891 ) * [lazy] remove old lazy init * [lazy] refactor lazy init folder structure * [lazy] fix lazy tensor deepcopy * [test] update lazy init test	2023-06-05 14:20:47 +08:00
digger yu	70c8cdecf4	[nfc] fix typo colossalai/cli fx kernel (#3847 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel	2023-06-02 15:02:45 +08:00
digger yu	e2d81eba0d	[nfc] fix typo colossalai/ applications/ (#3831 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/	2023-05-25 16:19:41 +08:00
wukong1992	3229f93e30	[booster] add warning for torch fsdp plugin doc (#3833 )	2023-05-25 14:00:02 +08:00
Hongxin Liu	7c9f2ed6dd	[dtensor] polish sharding spec docstring (#3838 ) * [dtensor] polish sharding spec docstring * [dtensor] polish sharding spec example docstring	2023-05-25 13:09:42 +08:00
digger yu	7f8203af69	fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808 )	2023-05-24 09:01:50 +08:00
wukong1992	6b305a99d6	[booster] torch fsdp fix ckpt (#3788 )	2023-05-23 16:58:45 +08:00
digger yu	9265f2d4d7	[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc.	2023-05-23 15:28:20 +08:00
jiangmingyan	e871e342b3	[API] add docstrings and initialization to apex amp, naive amp (#3783 ) * [mixed_precison] add naive amp demo * [mixed_precison] add naive amp demo * [api] add docstrings and initialization to apex amp, naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] fix * [api] fix	2023-05-23 15:17:24 +08:00
Frank Lee	f5c425c898	fixed the example docstring for booster (#3795 )	2023-05-22 18:10:06 +08:00
Hongxin Liu	72688adb2f	[doc] add booster docstring and fix autodoc (#3789 ) * [doc] add docstr for booster methods * [doc] fix autodoc	2023-05-22 10:56:47 +08:00
Hongxin Liu	3c07a2846e	[plugin] a workaround for zero plugins' optimizer checkpoint (#3780 ) * [test] refactor torch ddp checkpoint test * [plugin] update low level zero optim checkpoint * [plugin] update gemini optim checkpoint	2023-05-19 19:42:31 +08:00
Hongxin Liu	60e6a154bc	[doc] add tutorial for booster checkpoint (#3785 ) * [doc] add checkpoint related docstr for booster * [doc] add en checkpoint doc * [doc] add zh checkpoint doc * [doc] add booster checkpoint doc in sidebar * [doc] add cuation about ckpt for plugins * [doc] add doctest placeholder * [doc] add doctest placeholder * [doc] add doctest placeholder	2023-05-19 18:05:08 +08:00
digger yu	32f81f14d4	[NFC] fix typo colossalai/amp auto_parallel autochunk (#3756 )	2023-05-19 13:50:00 +08:00
Hongxin Liu	5452df63c5	[plugin] torch ddp plugin supports sharded model checkpoint (#3775 ) * [plugin] torch ddp plugin add save sharded model * [test] fix torch ddp ckpt io test * [test] fix torch ddp ckpt io test * [test] fix low level zero plugin test * [test] fix low level zero plugin test * [test] add debug info * [test] add debug info * [test] add debug info * [test] add debug info * [test] add debug info * [test] fix low level zero plugin test * [test] fix low level zero plugin test * [test] remove debug info	2023-05-18 20:05:59 +08:00
jiangmingyan	2703a37ac9	[amp] Add naive amp demo (#3774 ) * [mixed_precison] add naive amp demo * [mixed_precison] add naive amp demo	2023-05-18 16:33:14 +08:00
digger yu	1baeb39c72	[NFC] fix typo with colossalai/auto_parallel/tensor_shard (#3742 ) * fix typo applications/ and colossalai/ date 5.11 * fix typo colossalai/	2023-05-17 11:13:23 +08:00
wukong1992	b37797ed3d	[booster] support torch fsdp plugin in booster (#3697 ) Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>	2023-05-15 12:14:38 +08:00
digger-yu	ad6460cf2c	[NFC] fix typo applications/ and colossalai/ (#3735 )	2023-05-15 11:46:25 +08:00
digger-yu	b7141c36dd	[CI] fix some spelling errors (#3707 ) * fix spelling error with examples/comminity/ * fix spelling error with tests/ * fix some spelling error with tests/ colossalai/ etc.	2023-05-10 17:12:03 +08:00
jiangmingyan	20068ba188	[booster] add tests for ddp and low level zero's checkpointio (#3715 ) * [booster] update tests for booster * [booster] update tests for booster * [booster] update tests for booster * [booster] update tests for booster * [booster] update tests for booster * [booster] update booster tutorials#3717, fix recursive check	2023-05-10 12:17:02 +08:00
Hongxin Liu	6552cbf8e1	[booster] fix no_sync method (#3709 ) * [booster] fix no_sync method * [booster] add test for ddp no_sync * [booster] fix merge * [booster] update unit test * [booster] update unit test * [booster] update unit test	2023-05-09 11:10:02 +08:00
Hongxin Liu	3bf09efe74	[booster] update prepare dataloader method for plugin (#3706 ) * [booster] add prepare dataloader method for plug * [booster] update examples and docstr	2023-05-08 15:44:03 +08:00
Hongxin Liu	f83ea813f5	[example] add train resnet/vit with booster example (#3694 ) * [example] add train vit with booster example * [example] update readme * [example] add train resnet with booster example * [example] enable ci * [example] enable ci * [example] add requirements * [hotfix] fix analyzer init * [example] update requirements	2023-05-08 10:42:30 +08:00
YH	2629f9717d	[tensor] Refactor handle_trans_spec in DistSpecManager	2023-05-06 17:55:37 +08:00
Hongxin Liu	d0915f54f4	[booster] refactor all dp fashion plugins (#3684 ) * [booster] add dp plugin base * [booster] inherit dp plugin base * [booster] refactor unit tests	2023-05-05 19:36:10 +08:00
jiangmingyan	307894f74d	[booster] gemini plugin support shard checkpoint (#3610 ) * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint --------- Co-authored-by: luchen <luchen@luchendeMBP.lan> Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>	2023-05-05 14:37:21 +08:00
YH	a22407cc02	[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173 ) * Fix confusing variable name in zero opt * Apply lint * Fix util func * Fix minor util func * Fix zero param optimizer name	2023-04-27 18:43:14 +08:00
Hongxin Liu	50793b35f4	[gemini] accelerate inference (#3641 ) * [gemini] support don't scatter after inference * [chat] update colossalai strategy * [chat] fix opt benchmark * [chat] update opt benchmark * [gemini] optimize inference * [test] add gemini inference test * [chat] fix unit test ci * [chat] fix ci * [chat] fix ci * [chat] skip checkpoint test	2023-04-26 16:32:40 +08:00
Hongxin Liu	4b3240cb59	[booster] add low level zero plugin (#3594 ) * [booster] add low level zero plugin * [booster] fix gemini plugin test * [booster] fix precision * [booster] add low level zero plugin test * [test] fix booster plugin test oom * [test] fix booster plugin test oom * [test] fix googlenet and inception output trans * [test] fix diffuser clip vision model * [test] fix torchaudio_wav2vec2_base * [test] fix low level zero plugin test	2023-04-26 14:37:25 +08:00
digger-yu	b9a8dff7e5	[doc] Fix typo under colossalai and doc(#3618 ) * Fixed several spelling errors under colossalai * Fix the spelling error in colossalai and docs directory * Cautious Changed the spelling error under the example folder * Update runtime_preparation_pass.py revert autograft to autograd * Update search_chunk.py utile to until * Update check_installation.py change misteach to mismatch in line 91 * Update 1D_tensor_parallel.md revert to perceptron * Update 2D_tensor_parallel.md revert to perceptron in line 73 * Update 2p5D_tensor_parallel.md revert to perceptron in line 71 * Update 3D_tensor_parallel.md revert to perceptron in line 80 * Update README.md revert to resnet in line 42 * Update reorder_graph.py revert to indice in line 7 * Update p2p.py revert to megatron in line 94 * Update initialize.py revert to torchrun in line 198 * Update routers.py change to detailed in line 63 * Update routers.py change to detailed in line 146 * Update README.md revert random number in line 402	2023-04-26 11:38:43 +08:00
Hongxin Liu	12eff9eb4c	[gemini] state dict supports fp16 (#3590 ) * [gemini] save state dict support fp16 * [gemini] save state dict shard support fp16 * [gemini] fix state dict * [gemini] fix state dict	2023-04-19 11:01:48 +08:00
Hongxin Liu	dac127d0ee	[fx] fix meta tensor registration (#3589 ) * [meta] fix torch 1.13.1 * [meta] fix torch 2.0.0 * [meta] fix torch 1.13.0 * [meta] polish code	2023-04-18 16:20:36 +08:00
Hongxin Liu	f313babd11	[gemini] support save state dict in shards (#3581 ) * [gemini] support state dict shard * [gemini] add test state dict shard * [gemini] polish docstr * [gemini] fix merge * [gemini] polish code	2023-04-17 17:11:09 +08:00
YH	d329c294ec	Add docstr for zero3 chunk search utils (#3572 )	2023-04-17 12:44:17 +08:00
Hongxin Liu	173dad0562	[misc] add verbose arg for zero and op builder (#3552 ) * [misc] add print verbose * [gemini] add print verbose * [zero] add print verbose for low level * [misc] add print verbose for op builder	2023-04-17 11:25:35 +08:00
Hongxin Liu	4341f5e8e6	[lazyinit] fix clone and deepcopy (#3553 )	2023-04-17 11:25:13 +08:00
Hongxin Liu	152239bbfa	[gemini] gemini supports lazy init (#3379 ) * [gemini] fix nvme optimizer init * [gemini] gemini supports lazy init * [gemini] add init example * [gemini] add fool model * [zero] update gemini ddp * [zero] update init example * add chunk method * add chunk method * [lazyinit] fix lazy tensor tolist * [gemini] fix buffer materialization * [misc] remove useless file * [booster] update gemini plugin * [test] update gemini plugin test * [test] fix gemini plugin test * [gemini] fix import * [gemini] fix import * [lazyinit] use new metatensor * [lazyinit] use new metatensor * [lazyinit] fix __set__ method	2023-04-12 16:03:25 +08:00
jiangmingyan	366a035552	[checkpoint] Shard saved checkpoint need to be compatible with the naming format of hf checkpoint files (#3479 ) * [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format * [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename --------- Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local> Co-authored-by: luchen <luchen@luchendeMBP.lan>	2023-04-12 16:02:17 +08:00
YH	bcf0cbcbe7	[doc] Add docs for clip args in zero optim (#3504 )	2023-04-10 11:11:28 +08:00
jiangmingyan	52a933e175	[checkpoint] support huggingface style sharded checkpoint (#3461 ) * [checkpoint] support huggingface style sharded checkpoint * [checkpoint] support huggingface style sharded checkpoint * [checkpoint] support huggingface style sharded checkpoint * [checkpoint] support huggingface style sharded checkpoint * [checkpoint] support huggingface style sharded checkpoint --------- Co-authored-by: luchen <luchen@luchendeMBP.lan>	2023-04-06 16:23:39 +08:00
Frank Lee	80eba05b0a	[test] refactor tests with spawn (#3452 ) * [test] added spawn decorator * polish code * polish code * polish code * polish code * polish code * polish code	2023-04-06 14:51:35 +08:00
Frank Lee	7d8d825681	[booster] fixed the torch ddp plugin with the new checkpoint api (#3442 )	2023-04-06 09:43:51 +08:00
YH	8f740deb53	Fix typo (#3448 )	2023-04-06 09:43:31 +08:00
Hakjin Lee	46c009dba4	[format] Run lint on colossalai.engine (#3367 )	2023-04-05 23:24:43 +08:00
YuliangLiu0306	ffcdbf0f65	[autoparallel]integrate auto parallel feature with new tracer (#3408 ) * [autoparallel] integrate new analyzer in module level * unify the profiling method * polish * fix no codegen bug * fix pass bug * fix liveness test * polish	2023-04-04 17:40:45 +08:00
ver217	573af84184	[example] update examples related to zero/gemini (#3431 ) * [zero] update legacy import * [zero] update examples * [example] fix opt tutorial * [example] fix opt tutorial * [example] fix opt tutorial * [example] fix opt tutorial * [example] fix import	2023-04-04 17:32:51 +08:00
Frank Lee	1beb85cc25	[checkpoint] refactored the API and added safetensors support (#3427 ) * [checkpoint] refactored the API and added safetensors support * polish code	2023-04-04 15:23:01 +08:00
ver217	26b7aac0be	[zero] reorganize zero/gemini folder structure (#3424 ) * [zero] refactor low-level zero folder structure * [zero] fix legacy zero import path * [zero] fix legacy zero import path * [zero] remove useless import * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor legacy zero import path * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor legacy zero import path * [zero] fix test import path * [zero] fix test * [zero] fix circular import * [zero] update import	2023-04-04 13:48:16 +08:00
Frank Lee	638a07a7f9	[test] fixed gemini plugin test (#3411 ) * [test] fixed gemini plugin test * polish code * polish code	2023-04-03 17:12:22 +08:00
ver217	5f2e34e6c9	[booster] implement Gemini plugin (#3352 ) * [booster] add gemini plugin * [booster] update docstr * [booster] gemini plugin add coloparam convertor * [booster] fix coloparam convertor * [booster] fix gemini plugin device * [booster] add gemini plugin test * [booster] gemini plugin ignore sync bn * [booster] skip some model * [booster] skip some model * [booster] modify test world size * [booster] modify test world size * [booster] skip test	2023-03-31 16:06:13 +08:00
HELSON	1a1d68b053	[moe] add checkpoint for moe models (#3354 ) * [moe] add checkpoint for moe models * [hotfix] fix bugs in unit test	2023-03-31 09:20:33 +08:00
YuliangLiu0306	fee2af8610	[autoparallel] adapt autoparallel with new analyzer (#3261 ) * [autoparallel] adapt autoparallel with new analyzer * fix all node handler tests * polish * polish	2023-03-30 17:47:24 +08:00
Ofey Chan	8706a8c66c	[NFC] polish colossalai/engine/gradient_handler/__init__.py code style (#3329 )	2023-03-30 14:19:39 +08:00
yuxuan-lou	198a74b9fd	[NFC] polish colossalai/context/random/__init__.py code style (#3327 )	2023-03-30 14:19:26 +08:00
YuliangLiu0306	fbd2a9e05b	[hotfix] meta_tensor_compatibility_with_torch2	2023-03-30 13:43:01 +08:00
Michelle	ad285e1656	[NFC] polish colossalai/fx/tracer/_tracer_utils.py (#3323 ) * [NFC] polish colossalai/engine/schedule/_pipeline_schedule.py code style * [NFC] polish colossalai/fx/tracer/_tracer_utils.py code style --------- Co-authored-by: Qianran Ma <qianranm@luchentech.com>	2023-03-29 17:53:32 +08:00
Xu Kai	64350029fe	[NFC] polish colossalai/gemini/paramhooks/_param_hookmgr.py code style	2023-03-29 15:47:42 +08:00
RichardoLuo	1ce9d0c531	[NFC] polish initializer_data.py code style (#3287 )	2023-03-29 15:22:21 +08:00
Ziheng Qin	1bed38ef37	[NFC] polish colossalai/cli/benchmark/models.py code style (#3290 )	2023-03-29 15:22:21 +08:00
Kai Wang (Victor Kai)	964a28678f	[NFC] polish initializer_3d.py code style (#3279 )	2023-03-29 15:22:21 +08:00
Sze-qq	94eec1c5ad	[NFC] polish colossalai/engine/gradient_accumulation/_gradient_accumulation.py code style (#3277 ) Co-authored-by: siqi <siqi@siqis-MacBook-Pro.local>	2023-03-29 15:22:21 +08:00
Arsmart1	8af977f223	[NFC] polish colossalai/context/parallel_context.py code style (#3276 )	2023-03-29 15:22:21 +08:00
Zirui Zhu	1168b50e33	[NFC] polish colossalai/engine/schedule/_pipeline_schedule_v2.py code style (#3275 )	2023-03-29 15:22:21 +08:00
Tong Li	196d4696d0	[NFC] polish colossalai/nn/_ops/addmm.py code style (#3274 )	2023-03-29 15:22:21 +08:00
lucasliunju	4b95464994	[NFC] polish colossalai/amp/__init__.py code style (#3272 )	2023-03-29 15:22:21 +08:00
Xuanlei Zhao	6b3bb2c249	[NFC] polish code style (#3273 )	2023-03-29 15:22:21 +08:00
CZYCW	4cadb25b96	[NFC] policy colossalai/fx/proxy.py code style (#3269 )	2023-03-29 15:22:21 +08:00
Yuanchen	d58fa705b2	[NFC] polish code style (#3268 ) Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>	2023-03-29 15:22:21 +08:00
Camille Zhong	c4a226b729	[NFC] polish tensor_placement_policy.py code style (#3265 )	2023-03-29 15:22:21 +08:00
CsRic	00778abc48	[NFC] polish colossalai/fx/passes/split_module.py code style (#3263 ) Co-authored-by: csric <richcsr256@gmail.com>	2023-03-29 15:22:21 +08:00
jiangmingyan	488f37048c	[NFC] polish colossalai/global_variables.py code style (#3259 ) Co-authored-by: luchen <luchen@luchendeMBP.lan>	2023-03-29 15:22:21 +08:00
LuGY	1ff7d5bfa5	[NFC] polish colossalai/engine/gradient_handler/_moe_gradient_handler.py (#3260 )	2023-03-29 15:22:21 +08:00
dayellow	204ca2f09a	[NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style (#3256 ) Co-authored-by: Minghao Huang <huangminghao@luchentech.com>	2023-03-29 15:22:21 +08:00
HELSON	02b058032d	[fx] meta registration compatibility (#3253 ) * [fx] meta registration compatibility * fix error	2023-03-27 15:22:17 +08:00
Frank Lee	73d3e4d309	[booster] implemented the torch ddd + resnet example (#3232 ) * [booster] implemented the torch ddd + resnet example * polish code	2023-03-27 10:24:14 +08:00
YH	1a229045af	Add interface for colo tesnor dp size (#3227 )	2023-03-27 09:42:21 +08:00
YuliangLiu0306	4d5d8f98a4	[API] implement device mesh manager (#3221 ) * [API] implement device mesh manager * polish	2023-03-24 13:39:12 +08:00
Frank Lee	cd142fbefa	[api] implemented the checkpoint io module (#3205 ) * [api] implemented the checkpoint io module * polish code * polish code	2023-03-23 10:53:17 +08:00
ver217	f8289d4221	[lazyinit] combine lazy tensor with dtensor (#3204 ) * [lazyinit] lazy tensor add distribute * [lazyinit] refactor distribute * [lazyinit] add test dist lazy init * [lazyinit] add verbose info for dist lazy init * [lazyinit] fix rnn flatten weight op * [lazyinit] polish test * [lazyinit] polish test * [lazyinit] fix lazy tensor data setter * [lazyinit] polish test * [lazyinit] fix clean * [lazyinit] make materialize inplace * [lazyinit] refactor materialize * [lazyinit] refactor test distribute * [lazyinit] fix requires_grad * [lazyinit] fix tolist after materialization * [lazyinit] refactor distribute module * [lazyinit] polish docstr * [lazyinit] polish lazy init context * [lazyinit] temporarily skip test * [lazyinit] polish test * [lazyinit] add docstr	2023-03-23 10:53:06 +08:00
Frank Lee	e3ad88fb48	[booster] implemented the cluster module (#3191 ) * [booster] implemented the cluster module * polish code	2023-03-22 14:11:54 +08:00
YuliangLiu0306	f57d34958b	[FX] refactor experimental tracer and adapt it with hf models (#3157 ) * pass gpt trace and meta_prop * pass t5 trace and meta_prop * [FX] refactor experimental tracer and adapt it with hf models * pass all mainstream model zoo * fix CI * fix CI * fix CI * fix CI * fix CI * fix CI * fix CI * fix CI * skip tests * fix CI * using packaging version * polish	2023-03-22 10:40:33 +08:00
Frank Lee	e7f3bed2d3	[booster] added the plugin base and torch ddp plugin (#3180 ) * [booster] added the plugin base and torch ddp plugin * polish code * polish code * polish code	2023-03-21 17:39:30 +08:00
Zihao	18dbe76cae	[auto-parallel] add auto-offload feature (#3154 ) * add auto-offload feature * polish code * fix syn offload runtime pass bug * add offload example * fix offload testing bug * fix example testing bug	2023-03-21 14:17:41 +08:00
YuliangLiu0306	258b43317c	[hotfix] layout converting issue (#3188 )	2023-03-21 13:24:18 +08:00
YH	80aed29cd3	[zero] Refactor ZeroContextConfig class using dataclass (#3186 )	2023-03-21 12:36:47 +08:00
YH	9d644ff09f	Fix docstr for zero statedict (#3185 )	2023-03-21 11:48:21 +08:00
zbian	7bc0afc901	updated flash attention usage	2023-03-20 17:57:04 +08:00
Frank Lee	a9b8402d93	[booster] added the accelerator implementation (#3159 )	2023-03-20 13:59:24 +08:00
ver217	6ae8ed0407	[lazyinit] add correctness verification (#3147 ) * [lazyinit] fix shared module * [tests] add lazy init test utils * [tests] add torchvision for lazy init * [lazyinit] fix pre op fn * [lazyinit] handle legacy constructor * [tests] refactor lazy init test models * [tests] refactor lazy init test utils * [lazyinit] fix ops don't support meta * [tests] lazy init test timm models * [lazyinit] fix set data * [lazyinit] handle apex layers * [tests] lazy init test transformers models * [tests] lazy init test torchaudio models * [lazyinit] fix import path * [tests] lazy init test torchrec models * [tests] update torch version in CI * [tests] revert torch version in CI * [tests] skip lazy init test	2023-03-17 13:49:04 +08:00

1 2 3 4 5 ...

1554 Commits (7a3dfd0c645fba51a02eb3c6ac88b4f09160ea7d)