Baizhou Zhang
1c7df566e2
[shardformer] support tp+zero for shardformer ( #4472 )
...
* support tp+zero/input type cast for hybridplugin
* add tp+zero tests
* fix bucket arguments
2023-08-21 12:04:52 +08:00
Bin Jia
7c8be77081
[shardformer/sequence parallel] support gpt2 seq parallel with pp/dp/tp ( #4460 )
...
* support gpt2 seq parallel with pp/dp/tp
* fix a bug when waiting for stream done
* delete unused gpt2_seq file
2023-08-18 11:21:53 +08:00
Baizhou Zhang
6ef33f75aa
[shardformer] support DDP in HybridPlugin/add tp+dp tests ( #4446 )
...
* support DDP for HybridPlugin/add tp+dp tests
* add docstring for HybridParallelPlugin
2023-08-16 16:11:57 +08:00
Bin Jia
424629fea0
[shardformer/sequence parallel] Cherry pick commit to new branch ( #4450 )
...
* [shardformer/sequence parallel] Support sequence parallel for gpt2 (#4384 )
* [sequence parallel] add sequence parallel linear col/row support (#4336 )
* add sequence parallel linear col/row support
* add annotation
* add annotation
* add support for gpt2 fused qkv linear layer
* support sequence parallel in GPT2
* add docstring and note
* add requirments
* remove unused flash-attb
* modify flash attn test
* modify flash attn setting
* modify flash attn code
* add assert before divide, rename forward function
* [shardformer/test] fix gpt2 test with seq-parallel
* [shardformer/sequence parallel] Overlap input gather and grad computation during col backward (#4401 )
* overlap gather input / grad computing during col backward
* modify test for overlap
* simplify code
* fix code and modify cuda stream synchronize
* [shardformer/sequence parallel] polish code
2023-08-16 15:41:20 +08:00
flybird1111
d2cd48e0be
[shardformer] test all optimizations ( #4399 )
...
[shardformer] test all optimizations
[shardformer] test all optimizations
[shardformer] test all optimizations
2023-08-15 23:25:14 +08:00
Baizhou Zhang
ed4c448488
[pipeline] rewrite t5 tests & support multi-tensor transmitting in pipeline ( #4388 )
...
* fix remaining t5 bugs/rewrite t5 tests
* fix multi-tensor communication in pipeline
* rearrange test_config
* fix keyerror in sync_shared_params
* fix get_held_layers & Randomnizer, complete t5 tests
* erase printing
* fix get_held_layers through modifying _release_unheld_layers
* fix _get_recursive_held_layers bug
2023-08-15 23:25:14 +08:00
Baizhou Zhang
b1feeced8e
[shardformer] add util functions for shardformer tests/fix sync_shared_param ( #4366 )
...
* add util functions for shardformer tests & rewrite gpt2 test
* fix shared_params & embedding/merging
* fix precision
2023-08-15 23:25:14 +08:00
Baizhou Zhang
0ceec8f9a9
[pipeline] support fp32 for HybridPlugin/merge shardformer test and pipeline test into one file ( #4354 )
...
* add naive optimizer for 3DPlugin/refactor gpt2 shardformer test
* merge tests of PP/DP/TP combinations into one test file
* fix bug when sync grad for dp in HybridPlugin
* update supported precisions for 3DPlugin/fix bug when shifting tp_degree
* improve the passing of lazy_init
* modify lazy_init/use sync_shared_params
2023-08-15 23:25:14 +08:00
Hongxin Liu
261eab02fb
[plugin] add 3d parallel plugin ( #4295 )
...
* [amp] add mixed precision optimizer
* [plugin] add 3d parallel plugin
* [booster] support pipeline
* [plugin] 3d parallel plugin support clip grad norm
* [shardformer] fix sharder and add plugin test
* [plugin] rename 3d parallel plugin
* [ci] support testmon core pkg change detection (#4305 )
* [hotfix] debug testmon
* [hotfix] fix llama
* [hotfix] fix p2p bugs
* [hotfix] fix requirements
2023-08-15 23:25:14 +08:00
LuGY
1a49a5ea00
[zero] support shard optimizer state dict of zero ( #4194 )
...
* support shard optimizer of zero
* polish code
* support sync grad manually
2023-07-31 22:13:29 +08:00
LuGY
79cf1b5f33
[zero]support no_sync method for zero1 plugin ( #4138 )
...
* support no sync for zero1 plugin
* polish
* polish
2023-07-31 22:13:29 +08:00
梁爽
abe4f971e0
[NFC] polish colossalai/booster/plugin/low_level_zero_plugin.py code style ( #4256 )
...
Co-authored-by: supercooledith <893754954@qq.com>
2023-07-26 14:12:57 +08:00
Baizhou Zhang
c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin ( #4302 )
...
* sharded optimizer checkpoint for gemini plugin
* modify test to reduce testing time
* update doc
* fix bug when keep_gatherd is true under GeminiPlugin
2023-07-21 14:39:01 +08:00
Baizhou Zhang
58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin ( #4141 )
...
* [checkpointio] unsharded optimizer checkpoint for Gemini plugin
* [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather
2023-07-07 16:33:06 +08:00
Baizhou Zhang
0bb0b481b4
[gemini] fix argument naming during chunk configuration searching
2023-06-25 13:34:15 +08:00
Baizhou Zhang
822c3d4d66
[checkpointio] sharded optimizer checkpoint for DDP plugin ( #4002 )
2023-06-16 14:14:05 +08:00
Wenhao Chen
725af3eeeb
[booster] make optimizer argument optional for boost ( #3993 )
...
* feat: make optimizer optional in Booster.boost
* test: skip unet test if diffusers version > 0.10.2
2023-06-15 17:38:42 +08:00
Baizhou Zhang
c9cff7e7fa
[checkpointio] General Checkpointing of Sharded Optimizers ( #3984 )
2023-06-15 15:21:26 +08:00
Frank Lee
bd1ab98158
[gemini] fixed the gemini checkpoint io ( #3934 )
2023-06-09 09:48:49 +08:00
Hongxin Liu
ae02d4e4f7
[bf16] add bf16 support ( #3882 )
...
* [bf16] add bf16 support for fused adam (#3844 )
* [bf16] fused adam kernel support bf16
* [test] update fused adam kernel test
* [test] update fused adam test
* [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860 )
* [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869 )
* [bf16] add mixed precision mixin
* [bf16] low level zero optim support bf16
* [text] update low level zero test
* [text] fix low level zero grad acc test
* [bf16] add bf16 support for gemini (#3872 )
* [bf16] gemini support bf16
* [test] update gemini bf16 test
* [doc] update gemini docstring
* [bf16] add bf16 support for plugins (#3877 )
* [bf16] add bf16 support for legacy zero (#3879 )
* [zero] init context support bf16
* [zero] legacy zero support bf16
* [test] add zero bf16 test
* [doc] add bf16 related docstring for legacy zero
2023-06-05 15:58:31 +08:00
wukong1992
3229f93e30
[booster] add warning for torch fsdp plugin doc ( #3833 )
2023-05-25 14:00:02 +08:00
digger yu
7f8203af69
fix typo colossalai/auto_parallel autochunk fx/passes etc. ( #3808 )
2023-05-24 09:01:50 +08:00
wukong1992
6b305a99d6
[booster] torch fsdp fix ckpt ( #3788 )
2023-05-23 16:58:45 +08:00
Hongxin Liu
3c07a2846e
[plugin] a workaround for zero plugins' optimizer checkpoint ( #3780 )
...
* [test] refactor torch ddp checkpoint test
* [plugin] update low level zero optim checkpoint
* [plugin] update gemini optim checkpoint
2023-05-19 19:42:31 +08:00
Hongxin Liu
5452df63c5
[plugin] torch ddp plugin supports sharded model checkpoint ( #3775 )
...
* [plugin] torch ddp plugin add save sharded model
* [test] fix torch ddp ckpt io test
* [test] fix torch ddp ckpt io test
* [test] fix low level zero plugin test
* [test] fix low level zero plugin test
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] fix low level zero plugin test
* [test] fix low level zero plugin test
* [test] remove debug info
2023-05-18 20:05:59 +08:00
wukong1992
b37797ed3d
[booster] support torch fsdp plugin in booster ( #3697 )
...
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2023-05-15 12:14:38 +08:00
Hongxin Liu
6552cbf8e1
[booster] fix no_sync method ( #3709 )
...
* [booster] fix no_sync method
* [booster] add test for ddp no_sync
* [booster] fix merge
* [booster] update unit test
* [booster] update unit test
* [booster] update unit test
2023-05-09 11:10:02 +08:00
Hongxin Liu
3bf09efe74
[booster] update prepare dataloader method for plugin ( #3706 )
...
* [booster] add prepare dataloader method for plug
* [booster] update examples and docstr
2023-05-08 15:44:03 +08:00
Hongxin Liu
d0915f54f4
[booster] refactor all dp fashion plugins ( #3684 )
...
* [booster] add dp plugin base
* [booster] inherit dp plugin base
* [booster] refactor unit tests
2023-05-05 19:36:10 +08:00
jiangmingyan
307894f74d
[booster] gemini plugin support shard checkpoint ( #3610 )
...
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
---------
Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2023-05-05 14:37:21 +08:00
Hongxin Liu
4b3240cb59
[booster] add low level zero plugin ( #3594 )
...
* [booster] add low level zero plugin
* [booster] fix gemini plugin test
* [booster] fix precision
* [booster] add low level zero plugin test
* [test] fix booster plugin test oom
* [test] fix booster plugin test oom
* [test] fix googlenet and inception output trans
* [test] fix diffuser clip vision model
* [test] fix torchaudio_wav2vec2_base
* [test] fix low level zero plugin test
2023-04-26 14:37:25 +08:00
Hongxin Liu
173dad0562
[misc] add verbose arg for zero and op builder ( #3552 )
...
* [misc] add print verbose
* [gemini] add print verbose
* [zero] add print verbose for low level
* [misc] add print verbose for op builder
2023-04-17 11:25:35 +08:00
Hongxin Liu
152239bbfa
[gemini] gemini supports lazy init ( #3379 )
...
* [gemini] fix nvme optimizer init
* [gemini] gemini supports lazy init
* [gemini] add init example
* [gemini] add fool model
* [zero] update gemini ddp
* [zero] update init example
* add chunk method
* add chunk method
* [lazyinit] fix lazy tensor tolist
* [gemini] fix buffer materialization
* [misc] remove useless file
* [booster] update gemini plugin
* [test] update gemini plugin test
* [test] fix gemini plugin test
* [gemini] fix import
* [gemini] fix import
* [lazyinit] use new metatensor
* [lazyinit] use new metatensor
* [lazyinit] fix __set__ method
2023-04-12 16:03:25 +08:00
Frank Lee
7d8d825681
[booster] fixed the torch ddp plugin with the new checkpoint api ( #3442 )
2023-04-06 09:43:51 +08:00
Frank Lee
1beb85cc25
[checkpoint] refactored the API and added safetensors support ( #3427 )
...
* [checkpoint] refactored the API and added safetensors support
* polish code
2023-04-04 15:23:01 +08:00
ver217
26b7aac0be
[zero] reorganize zero/gemini folder structure ( #3424 )
...
* [zero] refactor low-level zero folder structure
* [zero] fix legacy zero import path
* [zero] fix legacy zero import path
* [zero] remove useless import
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] fix test import path
* [zero] fix test
* [zero] fix circular import
* [zero] update import
2023-04-04 13:48:16 +08:00
ver217
5f2e34e6c9
[booster] implement Gemini plugin ( #3352 )
...
* [booster] add gemini plugin
* [booster] update docstr
* [booster] gemini plugin add coloparam convertor
* [booster] fix coloparam convertor
* [booster] fix gemini plugin device
* [booster] add gemini plugin test
* [booster] gemini plugin ignore sync bn
* [booster] skip some model
* [booster] skip some model
* [booster] modify test world size
* [booster] modify test world size
* [booster] skip test
2023-03-31 16:06:13 +08:00
Frank Lee
73d3e4d309
[booster] implemented the torch ddd + resnet example ( #3232 )
...
* [booster] implemented the torch ddd + resnet example
* polish code
2023-03-27 10:24:14 +08:00
Frank Lee
e7f3bed2d3
[booster] added the plugin base and torch ddp plugin ( #3180 )
...
* [booster] added the plugin base and torch ddp plugin
* polish code
* polish code
* polish code
2023-03-21 17:39:30 +08:00