klhhhhh
c49286985d
[shardformer] vit test finish and support
1 year ago
klhhhhh
f60162b265
[shardformer] added tests
1 year ago
Kun Lin
ed34bb1310
Feature/chatglm ( #4240 )
...
* [shardformer] added tests
* [shardformer] vit test finish and support
* [shardformer] chatglm ready
* import chatglm
* [shardformer] add test kit in model zoo for chatglm
* [sharformer] add first version of policy of chatglm
* [shardformer] polish chatglm code
* [shardformer] polish code
* [shardformer] support chatglm without layernorm
* [shardformer] chatglm shard without mlp sharding
* [shardformer] delete some file
* [shardformer] ChatGLM support layernorm sharding
* [shardformer] register without auto policy
* [shardformer] pre-commit check files
* [shardformer] fix chatglm configuration with pre-commit
1 year ago
FoolPlayer
9ee4ebea83
[shardformer] support whisper ( #4212 )
...
* support whisper
* fix bug in vocabembedding
* support downstream model of whisper
* update readme
1 year ago
FoolPlayer
dd2bf02679
[shardformer] support SAM ( #4231 )
...
* 1.support sam 2.add fused qkv for nn.Linear
* update utils support set element in list
* overtwrite SamVisionAttention foward to use DropoutForParallelInput
* remove unused code
1 year ago
Baizhou Zhang
0ceec8f9a9
[pipeline] support fp32 for HybridPlugin/merge shardformer test and pipeline test into one file ( #4354 )
...
* add naive optimizer for 3DPlugin/refactor gpt2 shardformer test
* merge tests of PP/DP/TP combinations into one test file
* fix bug when sync grad for dp in HybridPlugin
* update supported precisions for 3DPlugin/fix bug when shifting tp_degree
* improve the passing of lazy_init
* modify lazy_init/use sync_shared_params
1 year ago
Jianghai
f13954cd58
[pipeline] refactor test pipeline and remove useless utils in pipeline ( #4324 )
...
* refactor tests
* refactor bloom model
* finish policy tests
* refactor tests
* fix test pure pipeline
* remove test pipeline and cutdown launch process
* refactor tests
* refactor bloom model
* finish policy tests
* refactor tests
* fix test pure pipeline
* remove test pipeline and cutdown launch process
1 year ago
LuGY
d3c6cd66f3
[pipeline] add unit test for 1f1b ( #4303 )
...
* add unit test for 1f1b
* polish code
* polish code and update ut version
* fix
1 year ago
Baizhou Zhang
da3cef27ad
[pipeline] fix return_dict/fix pure_pipeline_test ( #4331 )
1 year ago
Hongxin Liu
411cf1d2db
[hotfix] fix gemini and zero test ( #4333 )
...
* [hotfix] fix gemini and zero test
* [hotfix] fix lazy init test
* [hotfix] fix lazy init test
1 year ago
Hongxin Liu
261eab02fb
[plugin] add 3d parallel plugin ( #4295 )
...
* [amp] add mixed precision optimizer
* [plugin] add 3d parallel plugin
* [booster] support pipeline
* [plugin] 3d parallel plugin support clip grad norm
* [shardformer] fix sharder and add plugin test
* [plugin] rename 3d parallel plugin
* [ci] support testmon core pkg change detection (#4305 )
* [hotfix] debug testmon
* [hotfix] fix llama
* [hotfix] fix p2p bugs
* [hotfix] fix requirements
1 year ago
FoolPlayer
b3f5d7a3ba
[shardformer] support pipeline base vit model ( #4284 )
...
* Feature/vit support (#4182 )
* [shardformer] added tests
* [shardformer] vit test finish and support
* fix attention dropout
* support base vit pipeline
* support vit downstream model
* fix vit shard test
* modify hidden states return type
---------
Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>
1 year ago
Baizhou Zhang
083d7da33d
[pipeline] add pipeline support for all T5 models ( #4310 )
...
* complete policy for T5Model & T5ForConditionalGeneration
* modify function signature in forwards
* add forward for T5model
* add forward for T5ForConditionalGeneration
* fix a bug
* fix hidden_states transporting in decoder
* fix the passing of encoder_outputs
1 year ago
Jianghai
d0807122e2
[pipeline] test pure pipeline process using llama ( #4218 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* Revert "bloom policy"
This reverts commit 8dee68a0a2
.
This policy should be revert and copied to feature/bloom
* revert the bloom changes
* cancel unneeded inputs
* gpt
* finish llama
* causal lm and sequence classification
* revision
* add pure pipeline test
* fixed version
* fixed version
* pure pipeline
1 year ago
Baizhou Zhang
36e546b2cc
[pipeline] add pipeline support for T5Stack/T5EncoderModel ( #4300 )
...
* modify t5 policy & add test
* pipeline stage distribution for t5
* complete t5 base policy
* t5 stack: halfway
* modify gpt2 pipeline test
* complete pipeline forward for T5Stack/T5EncoderModel
* fix docstring
* move t5 util tests to test_pipeline
1 year ago
Jianghai
d8408d185c
[pipeline] OPT model pipeline ( #4258 )
...
* opt forward and test
* pause
* finish opt model pipeline
* finish opt pipeline
* opt forward and test
* pause
* finish opt model pipeline
* finish opt pipeline
* fix opt
* set transformers version
* refactor the test pipeline
1 year ago
Hongxin Liu
d921ce8391
[shardformer] support inplace sharding ( #4251 )
...
* [shardformer] embedding support inplace sharding
* [shardformer] linear support inplace sharding
* [shardformer] layernorm support inplace sharding
* [shardformer] qkv support inplace sharding
* [test] update shardformer layer test
* [shardformer] fix shared param sharding
* [shardformer] fix bert policy
* [shardformer] fix bloom policy
* [shardformer] fix llama policy
* [shardformer] fix opt policy
* [shardformer] fix t5 policy
* [shardformer] fix fused qkv linear
* [shardformer] fix bugs
* force sync
* [test] fix bugs
* [test] fix transformer version
1 year ago
Baizhou Zhang
2a2eacfaf1
[pipeline] support shardformer for GPT2ForQuestionAnswering & complete pipeline support for GPT2 ( #4245 )
...
* change for transformers loggers
* add forward for GPT2ForQuestionAnswering
* fix assert
* fix torchrec test
1 year ago
Jianghai
d9be0472ef
[bugs] hot fix some testing bugs for new models ( #4268 )
...
* hot fix
* hot fx tracer
1 year ago
Jianghai
34f0e34a4c
[pipeline] finish bloom models pipeline and tests ( #4223 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* finish bloom model
* test shard gpt2
* clear cache
* support all bloom models
* add bloom models policies
* finish bloom pipeline and tests
* add set pipeline
* finish bloom
1 year ago
Jianghai
e7cc62d735
[pipeline] All bert models ( #4233 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* Revert "bloom policy"
This reverts commit 8dee68a0a2
.
This policy should be revert and copied to feature/bloom
* revert the bloom changes
* cancel unneeded inputs
* gpt
* finish llama
* causal lm and sequence classification
* revision
* add pure pipeline test
* finish some bert models
* finish all bert models
* finish bert tests
* fix bugs
* fix bugs
* fix test pipeline
* fix data gen for qa
* update the set pipeline forward
* shared params
* fix bugs
1 year ago
Baizhou Zhang
a14d352088
[pipeline] add pipeline forward for variants of gpt2 ( #4238 )
...
* add forward for GPTLMHeadModel
* add test for gpt_lm
* arranging get_held_layers method
* arrange forward replacement
* add forward for GPT2ForTokenClassification
* add forward for GPT2ForSequenceClassification
* fix test_shard_gpt2.py
* add GPT2DoubleHeadsmodel & fix bugs
* add id checking in get_shared_params
1 year ago
Baizhou Zhang
208ac8f2ba
[pipeline] Add Pipeline Forward for GPT2Model Shardformer ( #4224 )
...
* * fix typehint & docstring in sharder.py
* * update pipeline forward for GPT2Model
* * add test for pipeline forward of GPT2Model
* * add cache cleaning in gpt2 test
* * change assert to raise command
1 year ago
Jianghai
37d22f6878
[pipeline] add bloom model pipeline ( #4210 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* finish bloom model
* test shard gpt2
* clear cache
1 year ago
Jianghai
31bcf867ae
[pipeline] Llama causal lm and llama for sequence classification pipeline ( #4208 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* Revert "bloom policy"
This reverts commit 8dee68a0a2
.
This policy should be revert and copied to feature/bloom
* revert the bloom changes
* cancel unneeded inputs
* gpt
* finish llama
* causal lm and sequence classification
* revision
1 year ago
Jianghai
1622031058
[pipeline] Llama pipeline ( #4205 )
...
* bloom policy
* llama pipeline forward and tests
* fix the output and attention_mask
* fix name
* bind argument to policy
* Revert "bloom policy"
This reverts commit 8dee68a0a2
.
This policy should be revert and copied to feature/bloom
* revert the bloom changes
* cancel unneeded inputs
* gpt
1 year ago
Jianghai
1094e0f0d3
[pipeline] Bert pipeline for shardformer and its tests ( #4197 )
...
* add pipeline forward
* complete pipeline forward check
* fix bert forward without pipeline
* fix comments
* discard useless line
* add todo
* clean prints
* fix distribute layers
1 year ago
Hongxin Liu
890774b2fb
[shardformer] support lazy init ( #4202 )
...
* [shardformer] support lazy init
* [shardformer] linear support lazy init
* [shardformer] embedding support lazy init
* [shardformer] norm support lazy init
* [shardformer] fused linear support lazy init
* [test] update shardformer test layer
* [test] shardformer with lazy init fit ddp
* [lazy] hotfix deepcopy of param
* [shardformer] fix bert policy and update test
* [shardformer] fix bloom policy and update test
* [shardformer] fix opt policy and update test
* [shardformer] fix t5 policy and update test
* [shardformer] fix gpt2 policy and update test
* [shardformer] fix llama policy and update test
1 year ago
Jianghai
f3bcc292c8
[pipeline] move bert related pipeline components to shardformer ( #4187 )
...
* move bert related pipeline components to shardformer
* fix bugs
* revision
* fix bert model tests
* fix bert_lm_head model tests
* fix tests
* fix tests
* done checks
* skip bloom
1 year ago
Jianghai
c5ea728016
[pipeline] add bert_for_pretraining bert_lmhead forward and policy ( #4172 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
* add bloom model and policy ,revise the base class of policy
* revise
* revision
* add bert_for_pretraining
* add bert_for_pretraining forward and policy
* fix typos
* cancel warning
* change the imediate output to default dict
* change the default output of get_shared_params
1 year ago
ver217
5fc60a3a04
[test] add shard util tests
1 year ago
ver217
2d6cc07feb
[test] update shardformer tests
1 year ago
Jianghai
90a65ea682
[pipeline] build bloom model and policy , revise the base class of policy ( #4161 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
* add bloom model and policy ,revise the base class of policy
* revise
* revision
* add bert_for_pretraining
1 year ago
Jianghai
c552cefa93
[pipeline]add pipeline policy and bert forward ( #4130 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
1 year ago
Hongxin Liu
5c897ddb94
[pipeline] add stage manager ( #4093 )
...
* [pipeline] add stage manager
* [test] add pipeline stage manager test
* [pipeline] add docstring for stage manager
1 year ago
Jianghai
e8e7e49243
[pipeline]add pipeline policy and bert forward ( #4130 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
1 year ago
Hongxin Liu
f51ce1bc8e
[pipeline] refactor 1f1b schedule ( #4115 )
...
* [api] update optimizer wrapper to fit pipeline
* [pipeline] add base schedule
* [pipeline] add 1f1b schedule
* [test] add pipeline schedule utils test
* [pipeline] fix import
1 year ago
Hongxin Liu
45fdc9b42c
[pipeline] implement p2p communication ( #4100 )
...
* [pipeline] add p2p communication
* [test] add p2p communication test
* [test] add rerun decorator
* [test] rename to avoid conflict
1 year ago
Hongxin Liu
422544222f
[pipeline] add stage manager ( #4093 )
...
* [pipeline] add stage manager
* [test] add pipeline stage manager test
* [pipeline] add docstring for stage manager
1 year ago
Hongxin Liu
5e1a9d48dd
[cluster] add process group mesh ( #4039 )
...
* [cluster] add process group mesh
* [test] add process group mesh test
* force sync
1 year ago
LuGY
d86ddd9b29
[hotfix] fix unsafe async comm in zero ( #4404 )
...
* improve stablility of zero
* fix wrong index
* add record stream
1 year ago
flybird1111
458ae331ad
[kernel] updated unittests for coloattention ( #4389 )
...
Updated coloattention tests of checking outputs and gradients
1 year ago
flybird1111
38b792aab2
[coloattention] fix import error ( #4380 )
...
fixed an import error
1 year ago
flybird1111
25c57b9fb4
[fix] coloattention support flash attention 2 ( #4347 )
...
Improved ColoAttention interface to support flash attention 2. Solved #4322
1 year ago
Hongxin Liu
16bf4c0221
[test] remove useless tests ( #4359 )
...
* [test] remove legacy zero test
* [test] remove lazy distribute test
* [test] remove outdated checkpoint io
1 year ago
LuGY
1a49a5ea00
[zero] support shard optimizer state dict of zero ( #4194 )
...
* support shard optimizer of zero
* polish code
* support sync grad manually
1 year ago
LuGY
dd7cc58299
[zero] add state dict for low level zero ( #4179 )
...
* add state dict for zero
* fix unit test
* polish
1 year ago
LuGY
c668801d36
[zero] allow passing process group to zero12 ( #4153 )
...
* allow passing process group to zero12
* union tp-zero and normal-zero
* polish code
1 year ago
LuGY
79cf1b5f33
[zero]support no_sync method for zero1 plugin ( #4138 )
...
* support no sync for zero1 plugin
* polish
* polish
1 year ago
LuGY
c6ab96983a
[zero] refactor low level zero for shard evenly ( #4030 )
...
* refactor low level zero
* fix zero2 and support cpu offload
* avg gradient and modify unit test
* refactor grad store, support layer drop
* refactor bucket store, support grad accumulation
* fix and update unit test of zero and ddp
* compatible with tp, ga and unit test
* fix memory leak and polish
* add zero layer drop unittest
* polish code
* fix import err in unit test
* support diffenert comm dtype, modify docstring style
* polish code
* test padding and fix
* fix unit test of low level zero
* fix pad recording in bucket store
* support some models
* polish
1 year ago
Baizhou Zhang
c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin ( #4302 )
...
* sharded optimizer checkpoint for gemini plugin
* modify test to reduce testing time
* update doc
* fix bug when keep_gatherd is true under GeminiPlugin
1 year ago
Hongxin Liu
fc5cef2c79
[lazy] support init on cuda ( #4269 )
...
* [lazy] support init on cuda
* [test] update lazy init test
* [test] fix transformer version
1 year ago
Cuiqing Li
4b977541a8
[Kernels] added triton-implemented of self attention for colossal-ai ( #4241 )
...
* added softmax kernel
* added qkv_kernel
* added ops
* adding tests
* upload tets
* fix tests
* debugging
* debugging tests
* debugging
* added
* fixed errors
* added softmax kernel
* clean codes
* added tests
* update tests
* update tests
* added attention
* add
* fixed pytest checking
* add cuda check
* fix cuda version
* fix typo
1 year ago
Baizhou Zhang
58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin ( #4141 )
...
* [checkpointio] unsharded optimizer checkpoint for Gemini plugin
* [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather
1 year ago
github-actions[bot]
c77b3b19be
[format] applied code formatting on changed files in pull request 4152 ( #4157 )
...
Co-authored-by: github-actions <github-actions@github.com>
1 year ago
Frank Lee
1fb0d95df0
[shardformer] made tensor parallelism configurable ( #4144 )
...
* [shardformer] made tensor parallelism configurable
* polish code
1 year ago
Frank Lee
74257cb446
[shardformer] refactored some doc and api ( #4137 )
...
* [shardformer] refactored some doc and api
* polish code
1 year ago
Frank Lee
ae035d305d
[shardformer] added embedding gradient check ( #4124 )
1 year ago
Frank Lee
6a88bae4ec
[shardformer] integrate with data parallelism ( #4103 )
1 year ago
Frank Lee
f3b6aaa6b7
[shardformer] supported fused normalization ( #4112 )
1 year ago
Frank Lee
b1c2901530
[shardformer] supported bloom model ( #4098 )
1 year ago
Kun Lin
8af29ee47a
[shardformer] support vision transformer ( #4096 )
...
* first v of vit shardformer
* keep vit
* update
* vit shard add vitattention vitlayer
* update num head shard para
* finish test for vit
* add new_model_class & postprocess
* add vit readme
* delete old files & fix the conflict
* fix sth
1 year ago
jiangmingyan
ac80937138
[shardformer] shardformer support opt models ( #4091 )
...
* [shardformer] shardformer support opt models
* [shardformer] shardformer support opt models, fix
* [shardformer] shardformer support opt models, fix
* [shardformer] shardformer support opt models, fix
1 year ago
Frank Lee
d33a44e8c3
[shardformer] refactored layernorm ( #4086 )
1 year ago
Frank Lee
c4b1b65931
[test] fixed tests failed due to dtensor change ( #4082 )
...
* [test] fixed tests failed due to dtensor change
* polish code
1 year ago
FoolPlayer
92f6791095
[shardformer] Add layernorm ( #4072 )
...
* add layernorm to bert
* add layernorm test
* add layernorm test with load state dict
* add use_mixedfusedLN in shard config
* refactor policy to support fused_layernorm
1 year ago
Frank Lee
70c58cfd4f
[shardformer] supported fused qkv checkpoint ( #4073 )
1 year ago
FoolPlayer
0803a61412
[shardformer] add linearconv1d test ( #4067 )
...
* add linearconv1d test
* add linearconv1d test
1 year ago
Frank Lee
8eb09a4c69
[shardformer] support module saving and loading ( #4062 )
...
* [shardformer] support module saving and loading
* polish code
1 year ago
FoolPlayer
7740c55c55
support kit use for bert/gpt test ( #4055 )
...
* support kit use for bert test
* support kit test for gpt2
1 year ago
Frank Lee
f22ddacef0
[shardformer] refactored the shardformer layer structure ( #4053 )
1 year ago
Frank Lee
58df720570
[shardformer] adapted T5 and LLaMa test to use kit ( #4049 )
...
* [shardformer] adapted T5 and LLaMa test to use kit
* polish code
1 year ago
FoolPlayer
4021b9a8a2
[shardformer] add gpt2 test and layer class refactor ( #4041 )
...
* add gpt2 test and layer class refactor
* add dropout in gpt2 policy
1 year ago
Frank Lee
d857f3dbba
[shardformer] supported T5 and its variants ( #4045 )
1 year ago
Frank Lee
c1d5453e9f
[shardformer] adapted llama to the new API ( #4036 )
1 year ago
FoolPlayer
74d176c8d8
[shardformer] fix bert and gpt downstream with new api ( #4024 )
...
* fix bert downstream with new api
* remove comment line
1 year ago
FoolPlayer
507c0ad368
add vocabembedding layer
1 year ago
Frank Lee
3893fa1a8d
[shardformer] refactored embedding and dropout to parallel module ( #4013 )
...
* [shardformer] refactored embedding and dropout to parallel module
* polish code
1 year ago
FoolPlayer
dfca9678fa
integrate with dist layer ( #4011 )
1 year ago
Frank Lee
015af592f8
[shardformer] integrated linear 1D with dtensor ( #3996 )
...
* [shardformer] integrated linear 1D with dtensor
* polish code
1 year ago
Frank Lee
611971248c
[device] support init device mesh from process group ( #3990 )
1 year ago
FoolPlayer
f7774ec0f3
[Shardformer] Downstream bert ( #3979 )
...
* add dist dropout in model
* update docstring and bert policy with dropout
* refactor basepolicy and sharded, update bert
* update format
* update gpt2 policy
* update bert policy
* remove unused code
* update readme for new policy usage
* add downstream model of bert
* remove unused code
1 year ago
wukong1992
c1c672d0f0
[shardformer] shardformer support t5 model ( #3994 )
...
test t5
1 year ago
wukong1992
6b30dfb7ce
[shardformer] support llama model using shardformer ( #3969 )
...
adjust layer attr
1 year ago
FoolPlayer
a73130482d
[shardformer] Unit test ( #3928 )
...
* fix bug in slicer, add slicer unit test
* add dropout test
* use pid as dropout seed
* updata dropout test with local pattern
* ad todo
1 year ago
FoolPlayer
f1cb5ac6bf
[shardformer] Align bert value ( #3907 )
...
* add bert align test, fix dist loss bug
* forward and backward align
* add ignore index
* add shardformer CI
* add gather_output optional for user in shardconfig
* update readme with optional gather_ouput
* add dist crossentropy loss test, remove unused files
* remove unused file
* remove unused file
* rename the file
* polish code
1 year ago
Baizhou Zhang
0bb0b481b4
[gemini] fix argument naming during chunk configuration searching
1 year ago
github-actions[bot]
a52f62082d
[format] applied code formatting on changed files in pull request 4021 ( #4022 )
...
Co-authored-by: github-actions <github-actions@github.com>
1 year ago
Frank Lee
a5883aa790
[test] fixed codefactor format report ( #4026 )
1 year ago
Baizhou Zhang
822c3d4d66
[checkpointio] sharded optimizer checkpoint for DDP plugin ( #4002 )
1 year ago
Wenhao Chen
725af3eeeb
[booster] make optimizer argument optional for boost ( #3993 )
...
* feat: make optimizer optional in Booster.boost
* test: skip unet test if diffusers version > 0.10.2
1 year ago
Baizhou Zhang
c9cff7e7fa
[checkpointio] General Checkpointing of Sharded Optimizers ( #3984 )
1 year ago
digger yu
e61ffc77c6
fix typo tests/ ( #3936 )
1 year ago
Frank Lee
ddcf58cacf
Revert "[sync] sync feature/shardformer with develop"
1 year ago
Frank Lee
eb39154d40
[dtensor] updated api and doc ( #3845 )
1 year ago
Hongxin Liu
ae02d4e4f7
[bf16] add bf16 support ( #3882 )
...
* [bf16] add bf16 support for fused adam (#3844 )
* [bf16] fused adam kernel support bf16
* [test] update fused adam kernel test
* [test] update fused adam test
* [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860 )
* [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869 )
* [bf16] add mixed precision mixin
* [bf16] low level zero optim support bf16
* [text] update low level zero test
* [text] fix low level zero grad acc test
* [bf16] add bf16 support for gemini (#3872 )
* [bf16] gemini support bf16
* [test] update gemini bf16 test
* [doc] update gemini docstring
* [bf16] add bf16 support for plugins (#3877 )
* [bf16] add bf16 support for legacy zero (#3879 )
* [zero] init context support bf16
* [zero] legacy zero support bf16
* [test] add zero bf16 test
* [doc] add bf16 related docstring for legacy zero
2 years ago
Hongxin Liu
dbb32692d2
[lazy] refactor lazy init ( #3891 )
...
* [lazy] remove old lazy init
* [lazy] refactor lazy init folder structure
* [lazy] fix lazy tensor deepcopy
* [test] update lazy init test
2 years ago
wukong1992
6b305a99d6
[booster] torch fsdp fix ckpt ( #3788 )
2 years ago
Frank Lee
615e2e5fc1
[test] fixed lazy init test import error ( #3799 )
2 years ago
Hongxin Liu
3c07a2846e
[plugin] a workaround for zero plugins' optimizer checkpoint ( #3780 )
...
* [test] refactor torch ddp checkpoint test
* [plugin] update low level zero optim checkpoint
* [plugin] update gemini optim checkpoint
2 years ago
Hongxin Liu
5452df63c5
[plugin] torch ddp plugin supports sharded model checkpoint ( #3775 )
...
* [plugin] torch ddp plugin add save sharded model
* [test] fix torch ddp ckpt io test
* [test] fix torch ddp ckpt io test
* [test] fix low level zero plugin test
* [test] fix low level zero plugin test
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] add debug info
* [test] fix low level zero plugin test
* [test] fix low level zero plugin test
* [test] remove debug info
2 years ago
wukong1992
6050f37776
[booster] removed models that don't support fsdp ( #3744 )
...
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2 years ago
Hongxin Liu
afb239bbf8
[devops] update torch version of CI ( #3725 )
...
* [test] fix flop tensor test
* [test] fix autochunk test
* [test] fix lazyinit test
* [devops] update torch version of CI
* [devops] enable testmon
* [devops] fix ci
* [devops] fix ci
* [test] fix checkpoint io test
* [test] fix cluster test
* [test] fix timm test
* [devops] fix ci
* [devops] fix ci
* [devops] fix ci
* [devops] fix ci
* [devops] force sync to test ci
* [test] skip fsdp test
2 years ago
wukong1992
b37797ed3d
[booster] support torch fsdp plugin in booster ( #3697 )
...
Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>
2 years ago
digger-yu
1f73609adb
[CI] fix typo with tests/ etc. ( #3727 )
...
* fix spelling error with examples/comminity/
* fix spelling error with tests/
* fix some spelling error with tests/ colossalai/ etc.
* fix spelling error with tests/ etc. date:2023.5.10
2 years ago
digger-yu
b7141c36dd
[CI] fix some spelling errors ( #3707 )
...
* fix spelling error with examples/comminity/
* fix spelling error with tests/
* fix some spelling error with tests/ colossalai/ etc.
2 years ago
jiangmingyan
20068ba188
[booster] add tests for ddp and low level zero's checkpointio ( #3715 )
...
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update tests for booster
* [booster] update booster tutorials#3717, fix recursive check
2 years ago
Hongxin Liu
6552cbf8e1
[booster] fix no_sync method ( #3709 )
...
* [booster] fix no_sync method
* [booster] add test for ddp no_sync
* [booster] fix merge
* [booster] update unit test
* [booster] update unit test
* [booster] update unit test
2 years ago
Hongxin Liu
3bf09efe74
[booster] update prepare dataloader method for plugin ( #3706 )
...
* [booster] add prepare dataloader method for plug
* [booster] update examples and docstr
2 years ago
Hongxin Liu
d0915f54f4
[booster] refactor all dp fashion plugins ( #3684 )
...
* [booster] add dp plugin base
* [booster] inherit dp plugin base
* [booster] refactor unit tests
2 years ago
digger-yu
b49020c1b1
[CI] Update test_sharded_optim_with_sync_bn.py ( #3688 )
...
fix spelling error in line23
change "cudnn_determinstic"=True to "cudnn_deterministic=True"
2 years ago
jiangmingyan
307894f74d
[booster] gemini plugin support shard checkpoint ( #3610 )
...
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
---------
Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2 years ago
Hongxin Liu
50793b35f4
[gemini] accelerate inference ( #3641 )
...
* [gemini] support don't scatter after inference
* [chat] update colossalai strategy
* [chat] fix opt benchmark
* [chat] update opt benchmark
* [gemini] optimize inference
* [test] add gemini inference test
* [chat] fix unit test ci
* [chat] fix ci
* [chat] fix ci
* [chat] skip checkpoint test
2 years ago
Hongxin Liu
4b3240cb59
[booster] add low level zero plugin ( #3594 )
...
* [booster] add low level zero plugin
* [booster] fix gemini plugin test
* [booster] fix precision
* [booster] add low level zero plugin test
* [test] fix booster plugin test oom
* [test] fix booster plugin test oom
* [test] fix googlenet and inception output trans
* [test] fix diffuser clip vision model
* [test] fix torchaudio_wav2vec2_base
* [test] fix low level zero plugin test
2 years ago
Hongxin Liu
f313babd11
[gemini] support save state dict in shards ( #3581 )
...
* [gemini] support state dict shard
* [gemini] add test state dict shard
* [gemini] polish docstr
* [gemini] fix merge
* [gemini] polish code
2 years ago
Hongxin Liu
152239bbfa
[gemini] gemini supports lazy init ( #3379 )
...
* [gemini] fix nvme optimizer init
* [gemini] gemini supports lazy init
* [gemini] add init example
* [gemini] add fool model
* [zero] update gemini ddp
* [zero] update init example
* add chunk method
* add chunk method
* [lazyinit] fix lazy tensor tolist
* [gemini] fix buffer materialization
* [misc] remove useless file
* [booster] update gemini plugin
* [test] update gemini plugin test
* [test] fix gemini plugin test
* [gemini] fix import
* [gemini] fix import
* [lazyinit] use new metatensor
* [lazyinit] use new metatensor
* [lazyinit] fix __set__ method
2 years ago
jiangmingyan
52a933e175
[checkpoint] support huggingface style sharded checkpoint ( #3461 )
...
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
* [checkpoint] support huggingface style sharded checkpoint
---------
Co-authored-by: luchen <luchen@luchendeMBP.lan>
2 years ago
Frank Lee
80eba05b0a
[test] refactor tests with spawn ( #3452 )
...
* [test] added spawn decorator
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2 years ago
ver217
933048ad3e
[test] reorganize zero/gemini tests ( #3445 )
2 years ago
YuliangLiu0306
ffcdbf0f65
[autoparallel]integrate auto parallel feature with new tracer ( #3408 )
...
* [autoparallel] integrate new analyzer in module level
* unify the profiling method
* polish
* fix no codegen bug
* fix pass bug
* fix liveness test
* polish
2 years ago
Frank Lee
1beb85cc25
[checkpoint] refactored the API and added safetensors support ( #3427 )
...
* [checkpoint] refactored the API and added safetensors support
* polish code
2 years ago
ver217
26b7aac0be
[zero] reorganize zero/gemini folder structure ( #3424 )
...
* [zero] refactor low-level zero folder structure
* [zero] fix legacy zero import path
* [zero] fix legacy zero import path
* [zero] remove useless import
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] fix test import path
* [zero] fix test
* [zero] fix circular import
* [zero] update import
2 years ago
Frank Lee
638a07a7f9
[test] fixed gemini plugin test ( #3411 )
...
* [test] fixed gemini plugin test
* polish code
* polish code
2 years ago
ver217
5f2e34e6c9
[booster] implement Gemini plugin ( #3352 )
...
* [booster] add gemini plugin
* [booster] update docstr
* [booster] gemini plugin add coloparam convertor
* [booster] fix coloparam convertor
* [booster] fix gemini plugin device
* [booster] add gemini plugin test
* [booster] gemini plugin ignore sync bn
* [booster] skip some model
* [booster] skip some model
* [booster] modify test world size
* [booster] modify test world size
* [booster] skip test
2 years ago
HELSON
1a1d68b053
[moe] add checkpoint for moe models ( #3354 )
...
* [moe] add checkpoint for moe models
* [hotfix] fix bugs in unit test
2 years ago
YuliangLiu0306
fee2af8610
[autoparallel] adapt autoparallel with new analyzer ( #3261 )
...
* [autoparallel] adapt autoparallel with new analyzer
* fix all node handler tests
* polish
* polish
2 years ago
Frank Lee
73d3e4d309
[booster] implemented the torch ddd + resnet example ( #3232 )
...
* [booster] implemented the torch ddd + resnet example
* polish code
2 years ago
YuliangLiu0306
4d5d8f98a4
[API] implement device mesh manager ( #3221 )
...
* [API] implement device mesh manager
* polish
2 years ago
YuliangLiu0306
045afa3ea2
[hotfix] skip torchaudio tracing test ( #3211 )
...
* [hotfix] skip torchaudio tracing test
* fix lazy init test issue
2 years ago
Frank Lee
cd142fbefa
[api] implemented the checkpoint io module ( #3205 )
...
* [api] implemented the checkpoint io module
* polish code
* polish code
2 years ago
ver217
f8289d4221
[lazyinit] combine lazy tensor with dtensor ( #3204 )
...
* [lazyinit] lazy tensor add distribute
* [lazyinit] refactor distribute
* [lazyinit] add test dist lazy init
* [lazyinit] add verbose info for dist lazy init
* [lazyinit] fix rnn flatten weight op
* [lazyinit] polish test
* [lazyinit] polish test
* [lazyinit] fix lazy tensor data setter
* [lazyinit] polish test
* [lazyinit] fix clean
* [lazyinit] make materialize inplace
* [lazyinit] refactor materialize
* [lazyinit] refactor test distribute
* [lazyinit] fix requires_grad
* [lazyinit] fix tolist after materialization
* [lazyinit] refactor distribute module
* [lazyinit] polish docstr
* [lazyinit] polish lazy init context
* [lazyinit] temporarily skip test
* [lazyinit] polish test
* [lazyinit] add docstr
2 years ago
YuliangLiu0306
019a847432
[Analyzer] fix analyzer tests ( #3197 )
2 years ago
YuliangLiu0306
f57d34958b
[FX] refactor experimental tracer and adapt it with hf models ( #3157 )
...
* pass gpt trace and meta_prop
* pass t5 trace and meta_prop
* [FX] refactor experimental tracer and adapt it with hf models
* pass all mainstream model zoo
* fix CI
* fix CI
* fix CI
* fix CI
* fix CI
* fix CI
* fix CI
* fix CI
* skip tests
* fix CI
* using packaging version
* polish
2 years ago
Frank Lee
e7f3bed2d3
[booster] added the plugin base and torch ddp plugin ( #3180 )
...
* [booster] added the plugin base and torch ddp plugin
* polish code
* polish code
* polish code
2 years ago
Zihao
18dbe76cae
[auto-parallel] add auto-offload feature ( #3154 )
...
* add auto-offload feature
* polish code
* fix syn offload runtime pass bug
* add offload example
* fix offload testing bug
* fix example testing bug
2 years ago
zbian
7bc0afc901
updated flash attention usage
2 years ago
Frank Lee
085e7f4eff
[test] fixed torchrec registration in model zoo ( #3177 )
...
* [test] fixed torchrec registration in model zoo
* polish code
* polish code
* polish code
2 years ago
Frank Lee
a9b8402d93
[booster] added the accelerator implementation ( #3159 )
2 years ago
Frank Lee
1ad3a636b1
[test] fixed torchrec model test ( #3167 )
...
* [test] fixed torchrec model test
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2 years ago
ver217
6ae8ed0407
[lazyinit] add correctness verification ( #3147 )
...
* [lazyinit] fix shared module
* [tests] add lazy init test utils
* [tests] add torchvision for lazy init
* [lazyinit] fix pre op fn
* [lazyinit] handle legacy constructor
* [tests] refactor lazy init test models
* [tests] refactor lazy init test utils
* [lazyinit] fix ops don't support meta
* [tests] lazy init test timm models
* [lazyinit] fix set data
* [lazyinit] handle apex layers
* [tests] lazy init test transformers models
* [tests] lazy init test torchaudio models
* [lazyinit] fix import path
* [tests] lazy init test torchrec models
* [tests] update torch version in CI
* [tests] revert torch version in CI
* [tests] skip lazy init test
2 years ago
Frank Lee
ed19290560
[booster] implemented mixed precision class ( #3151 )
...
* [booster] implemented mixed precision class
* polish code
2 years ago
YuliangLiu0306
ecd643f1e4
[test] add torchrec models to test model zoo ( #3139 )
2 years ago
ver217
14a115000b
[tests] model zoo add torchaudio models ( #3138 )
...
* [tests] model zoo add torchaudio models
* [tests] refactor torchaudio wavernn
* [tests] refactor fx torchaudio tests
2 years ago
Frank Lee
6d48eb0560
[test] added transformers models to test model zoo ( #3135 )
2 years ago
Frank Lee
a674c63348
[test] added torchvision models to test model zoo ( #3132 )
...
* [test] added torchvision models to test model zoo
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2 years ago
HELSON
1216d1e7bd
[tests] diffuser models in model zoo ( #3136 )
...
* [tests] diffuser models in model zoo
* remove useless code
* [tests] add diffusers to requirement-test
2 years ago
YuliangLiu0306
2eca4cd376
[DTensor] refactor dtensor with new components ( #3089 )
...
* [DTensor] refactor dtensor with new components
* polish
2 years ago
Frank Lee
86ac782d7c
[test] added timm models to test model zoo ( #3129 )
...
* [test] added timm models to test model zoo
* polish code
* polish code
* polish code
* polish code
* polish code
2 years ago
Xuanlei Zhao
30dd13c450
[autochunk] support complete benchmark ( #3121 )
...
* refact memory code
* dont log free var memory
* add memory align
* update chunk target
* update setting for new memory
* finish test
* update tracer
* update typo
* update test
* add unet test
* add bench
* update bench
* update bench
* init
* support vit
* move to cpu
* add cpu benchmark
2 years ago
Super Daniel
fff98f06ed
[analyzer] a minimal implementation of static graph analyzer ( #2852 )
...
* [hotfix] meta tensor default device.
* [siu] add experimental submodules to main branch.
* [siu]
* [siu]
* [analyzer] init.
* [analyzer] readme.
* [analyzer] readme.
* [analyzer] readme.
* [analyzer] readme.
* [test] add test.
* Update symbolic_trace.py
* mark skip tests.
* try except.
* try except.
* try except.
* s
* init
* init
* fix
* skip
* skip
---------
Co-authored-by: Daniel Shao <superdainiu@MININT-PVARVID.fareast.corp.microsoft.com>
Co-authored-by: Daniel Shao <superdainiu@Daniels-Mac.local>
2 years ago
Xuanlei Zhao
10c61de2f7
[autochunk] support vit ( #3084 )
...
support vit for autochunk
* support some new ops for vit
* fix some bugs
* add test for vit
2 years ago
YuliangLiu0306
8e4e8601b7
[DTensor] implement layout converter ( #3055 )
...
* [DTensor] refactor LayoutConverter for DTensor
* polish code
* polish docstring
2 years ago
Xuanlei Zhao
2ca9728cbb
[autochunk] refactor chunk memory estimation ( #2762 )
...
* refact memory code
* dont log free var memory
* add memory align
* update chunk target
* update setting for new memory
* finish test
* update tracer
* update typo
* update test
2 years ago
YuliangLiu0306
29386a54e6
[DTensor] refactor CommSpec ( #3034 )
2 years ago
YuliangLiu0306
4269196c79
[hotfix] skip auto checkpointing tests ( #3029 )
...
* [hotfix] skip auto checkpointing tests
* fix test name issue
2 years ago
YuliangLiu0306
cd2b0eaa8d
[DTensor] refactor sharding spec ( #2987 )
...
* [autoparallel] refactor sharding spec
* rename function name
2 years ago
YuliangLiu0306
e414e4092b
[DTensor] implementation of dtensor ( #2946 )
...
* [DTensor] implementation of dtensor
* test layout convert
* polish
2 years ago
YuliangLiu0306
197d0bf4ed
[autoparallel] apply repeat block to reduce solving time ( #2912 )
2 years ago
YuliangLiu0306
819e25d8b1
[hotfix] fix autoparallel compatibility test issues ( #2754 )
2 years ago
YuliangLiu0306
0f392d7403
[autoparallel] find repeat blocks ( #2854 )
...
* [autoparallel] find repeat blocks
* polish
* polish
* polish
2 years ago
Boyuan Yao
c7764d3f22
[autoparallel] Patch meta information of `torch.where` ( #2822 )
...
* [autoparallel] patch meta information of torch.where
* [autoparallel] pre-commit modified
2 years ago
Boyuan Yao
fcc4097efa
[autoparallel] Patch meta information of `torch.tanh()` and `torch.nn.Dropout` ( #2773 )
...
* [autoparallel] tanh meta information
* [autoparallel] remove redundant code
* [autoparallel] patch meta information of torch.nn.Dropout
2 years ago
Boyuan Yao
7ea6bc7f69
[autoparallel] Patch tensor related operations meta information ( #2789 )
...
* [autoparallel] tensor related meta information prototype
* [autoparallel] tensor related meta information
* [autoparallel] tensor related meta information
* [autoparallel] tensor related meta information
* [autoparallel] tensor related meta information
2 years ago
HELSON
56ddc9ca7a
[hotfix] add correct device for fake_param ( #2796 )
2 years ago
Boyuan Yao
a2b43e393d
[autoparallel] Patch meta information of `torch.nn.Embedding` ( #2760 )
...
* [autoparallel] embedding metainfo
* [autoparallel] fix function name in test_activation_metainfo
* [autoparallel] undo changes in activation metainfo and related tests
2 years ago
YuliangLiu0306
1dc003c169
[autoparallel] distinguish different parallel strategies ( #2699 )
2 years ago
YuliangLiu0306
21d6a48f4d
[autoparallel] add shard option ( #2696 )
...
* [autoparallel] add shard option
* polish
2 years ago
YuliangLiu0306
cb2c6a2415
[autoparallel] refactor runtime pass ( #2644 )
...
* [autoparallel] refactor runtime pass
* add unit test
* polish
2 years ago
YuliangLiu0306
0b2a738393
[autoparallel] remove deprecated codes ( #2664 )
2 years ago
YuliangLiu0306
7fa6be49d2
[autoparallel] test compatibility for gemini and auto parallel ( #2700 )
2 years ago
Boyuan Yao
40c916b192
[autoparallel] Patch meta information of `torch.nn.functional.softmax` and `torch.nn.Softmax` ( #2674 )
...
* [autoparallel] softmax metainfo
* [autoparallel] softmax metainfo
2 years ago
HELSON
8213f89fd2
[gemini] add fake_release_chunk for keep-gathered chunk in the inference mode ( #2671 )
2 years ago
Boyuan Yao
0385b26ebf
[autoparallel] Patch meta information of `torch.nn.LayerNorm` ( #2647 )
...
* [autoparallel] layernorm metainfo patch
* [autoparallel] polish test
2 years ago
YuliangLiu0306
37df666f38
[autoparallel] refactor handlers which reshape input tensors ( #2615 )
...
* [autoparallel] refactor handlers which reshape input tensors
* polish
2 years ago
YuliangLiu0306
cb3d1bef62
[autoparallel] adapt autoparallel tests with latest api ( #2626 )
2 years ago
Boyuan Yao
90a9fdd91d
[autoparallel] Patch meta information of `torch.matmul` ( #2584 )
...
* [autoparallel] matmul metainfo
* [auto_parallel] remove unused print
* [tests] skip test_matmul_handler when torch version is lower than 1.12.0
2 years ago
oahzxl
6ba8364881
[autochunk] support diffusion for autochunk ( #2621 )
...
* add alphafold benchmark
* renae alphafold test
* rename tests
* rename diffuser
* renme
* rename
* update transformer
* update benchmark
* update benchmark
* update bench memory
* update transformer benchmark
* rename
* support diffuser
* support unet metainfo prop
* fix bug and simplify code
* update linear and support some op
* optimize max region search, support conv
* update unet test
* support some op
* support groupnorm and interpolate
* update flow search
* add fix dim in node flow
* fix utils
* rename
* support diffusion
* update diffuser
* update chunk search
* optimize imports
* import
* finish autochunk
2 years ago
oahzxl
c4b15661d7
[autochunk] add benchmark for transformer and alphafold ( #2543 )
2 years ago
oahzxl
05671fcb42
[autochunk] support multi outputs chunk search ( #2538 )
...
Support multi outputs chunk search. Previously we only support single output chunk search. It is more flexible and improve performance by a large margin. For transformer, we reduce memory by 40% than previous search strategy.
1. rewrite search strategy to support multi outputs chunk search
2. fix many, many bugs
3. update tests
2 years ago
oahzxl
63199c6687
[autochunk] support transformer ( #2526 )
2 years ago
Frank Lee
b55deb0662
[workflow] only report coverage for changed files ( #2524 )
...
* [workflow] only report coverage for changed files
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
* polish file
2 years ago
HELSON
b528eea0f0
[zero] add zero wrappers ( #2523 )
...
* [zero] add zero wrappers
* change names
* add wrapper functions to init
2 years ago
HELSON
077a5cdde4
[zero] fix gradient clipping in hybrid parallelism ( #2521 )
...
* [zero] fix gradient clipping in hybrid parallelism
* [testing] change model name to avoid pytest warning
* [hotfix] fix unit testing
2 years ago
HELSON
707b11d4a0
[gemini] update ddp strict mode ( #2518 )
...
* [zero] add strict ddp mode for chunk init
* [gemini] update gpt example
2 years ago
HELSON
2d1a7dfe5f
[zero] add strict ddp mode ( #2508 )
...
* [zero] add strict ddp mode
* [polish] add comments for strict ddp mode
* [zero] fix test error
2 years ago
oahzxl
c04f183237
[autochunk] support parsing blocks ( #2506 )
2 years ago
oahzxl
72341e65f4
[auto-chunk] support extramsa ( #3 ) ( #2504 )
2 years ago
oahzxl
ecccc91f21
[autochunk] support autochunk on evoformer ( #2497 )
2 years ago
HELSON
d565a24849
[zero] add unit testings for hybrid parallelism ( #2486 )
2 years ago
oahzxl
4953b4ace1
[autochunk] support evoformer tracer ( #2485 )
...
support full evoformer tracer, which is a main module of alphafold. previously we just support a simplifed version of it.
1. support some evoformer's op in fx
2. support evoformer test
3. add repos for test code
2 years ago
YuliangLiu0306
67e1912b59
[autoparallel] support origin activation ckpt on autoprallel system ( #2468 )
2 years ago
HELSON
21c88220ce
[zero] add unit test for low-level zero init ( #2474 )
2 years ago
HELSON
a5dc4253c6
[zero] polish low level optimizer ( #2473 )
2 years ago
Jiarui Fang
867c8c2d3a
[zero] low level optim supports ProcessGroup ( #2464 )
2 years ago
YuliangLiu0306
8221fd7485
[autoparallel] update binary elementwise handler ( #2451 )
...
* [autoparallel] update binary elementwise handler
* polish
2 years ago
HELSON
5521af7877
[zero] fix state_dict and load_state_dict for ddp ignored parameters ( #2443 )
...
* [ddp] add is_ddp_ignored
[ddp] rename to is_ddp_ignored
* [zero] fix state_dict and load_state_dict
* fix bugs
* [zero] update unit test for ZeroDDP
2 years ago
YuliangLiu0306
41429b9b28
[autoparallel] add shard option ( #2423 )
2 years ago
HELSON
bb4e9a311a
[zero] add inference mode and its unit test ( #2418 )
2 years ago
oahzxl
61fdd3464a
update doc
2 years ago
oahzxl
36ab2cb783
change import
2 years ago
oahzxl
7ab2db206f
adapt new fx
2 years ago
oahzxl
e532679c95
Merge branch 'main' of https://github.com/oahzxl/ColossalAI into chunk
2 years ago
oahzxl
c1492e5013
add test in import
2 years ago
HELSON
ea13a201bb
[polish] polish code for get_static_torch_model ( #2405 )
...
* [gemini] polish code
* [testing] remove code
* [gemini] make more robust
2 years ago
oahzxl
212b5b1b5f
add comments
2 years ago
oahzxl
aafc3516a5
add available
2 years ago
oahzxl
d5c4f0bf95
code style
2 years ago
oahzxl
d106b271f8
add chunk search test
2 years ago
oahzxl
a005965d2d
update codegen test
2 years ago
oahzxl
3abbaf8bc6
update codegen test
2 years ago
oahzxl
74b81395a2
update codegen test
2 years ago
oahzxl
18a51c87fe
rename test
2 years ago
oahzxl
cb68ee864a
set benchmark
2 years ago
Jiarui Fang
4e96039649
[device] find best logical mesh
2 years ago
Frank Lee
40d376c566
[setup] support pre-build and jit-build of cuda kernels ( #2374 )
...
* [setup] support pre-build and jit-build of cuda kernels
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2 years ago
oahzxl
a6cdbf9161
seperate trace flow
2 years ago
oahzxl
da4076846d
rename
2 years ago
oahzxl
fd87d78a28
rename ambiguous variable
2 years ago
oahzxl
8a634af2f5
close mem and code print
2 years ago
oahzxl
1a6d2a740b
take apart chunk code gen
2 years ago
HELSON
48d33b1b17
[gemini] add get static torch model ( #2356 )
2 years ago
oahzxl
d1f0773182
rename
2 years ago
oahzxl
06a5355d98
update test
2 years ago
oahzxl
efb1c64c30
restruct dir
2 years ago
YuliangLiu0306
b5a3a4a65f
[device] find best logical mesh
2 years ago
YuliangLiu0306
9c9246c0d9
[device] alpha beta profiler ( #2311 )
...
* [device] alpha beta profiler
* add usage
* fix variable name
2 years ago
Jiarui Fang
db6eea3583
[builder] reconfig op_builder for pypi install ( #2314 )
2 years ago
HELSON
5d3a2be3af
[amp] add gradient clipping for unit tests ( #2283 )
...
* [amp] add gradient clipping in unit tests
* fix bugs
2 years ago
zbian
e94c79f15b
improved allgather & reducescatter for 3d
2 years ago
YuliangLiu0306
fb87322773
[autoparallel] fix spelling error ( #2270 )
2 years ago
YuliangLiu0306
8897b8f753
[autoparallel] autoparallel initialize ( #2238 )
2 years ago
YuliangLiu0306
3b1b91eaf4
[autoparallel] record parameter attribute in colotracer ( #2217 )
...
* [autoparallel] record parameter attribute in collotracer
* [autoparallel] fix construct_meta_info bug
2 years ago
Boyuan Yao
24246f7aa5
[autoparallel] Attach input, buffer and output tensor to MetaInfo class ( #2162 )
...
* [fx] metainfo class for auto parallel
* [fx] add unit test for linear metainfo
* [fx] fix bwd param for linear
* [fx] modify unit test
* [fx] modify unit test
* [fx] modify import
* [fx] modify import
* [fx] modify import
* [fx] move meta profiler to auto parallel
* [fx] add conv metainfo class
* [fx] restore profiler
* [fx] restore meta profiler
* [autoparallel] modify unit test
* [fx] modify unit test
* [autoparallel] add batchnorm metainfo class
* [autoparallel] fix batchnorm unit test function declaration
* [fx] restore profiler
* [fx] add relu metainfo class
* [fx] restore profiler
* [autoparallel] modify metainfo input
* [autoparallel] add pooling metainfo
* [autoparallel] add F.linear metainfo generator
* [autoparallel] add binary elementwise metainfo
* [fx] recover profiler
* [autoparallel] fix forward memory calculation
* [autoparallel] modify constants.py
* [autoparallel] remove redundant print
* [autoparallel] add F.conv metainfo
* [autoparallel] linear fix
* [autoparallel] memory estimation for communication actions
* [autoparallel] fix docstring
* [autoparallel] fix variables name
* [autoparallel] attach tensor to metainfo class
* [autoparallel] fix dangerous try except
* [autoparallel] attach memory cost to shape consistency node
* [autoparallel] attach shape consistency node's metainfo to the node
* [autoparallel] remove todo in shape consistency memory estimation
* [autoparallel] fix the annotation
2 years ago
YuliangLiu0306
78509124d3
[autoparallel] update getitem handler ( #2207 )
2 years ago
YuliangLiu0306
4851f2d607
[autoparallel] update_getattr_handler ( #2193 )
2 years ago
YuliangLiu0306
f10ce01e31
[autoparallel] add gpt2 performance test code ( #2194 )
2 years ago
HELSON
a3100bd50d
[testing] add beit model for unit testings ( #2196 )
...
* [testing] add beit model
* [beit] fix bugs
* [beit] fix bugs
* [testing] fix bugs
2 years ago
HELSON
2458659919
[zero] fix error for BEiT models ( #2169 )
...
* [zero] fix error for BEiT models
* [ColoParameter] add unpack operation for tuple arguments
* fix bugs
* fix chunkv2 unit testing
* add assertion for gradient state
2 years ago
Jiarui Fang
355ffb386e
[builder] unified cpu_optim fused_optim inferface ( #2190 )
2 years ago
Jiarui Fang
9587b080ba
[builder] use runtime builder for fused_optim ( #2189 )
2 years ago
Jiarui Fang
bc0e271e71
[buider] use builder() for cpu adam and fused optim in setup.py ( #2187 )
2 years ago
Jiarui Fang
d42afd30f8
[builder] runtime adam and fused_optim builder ( #2184 )
2 years ago
YuliangLiu0306
550f8f8905
[autoparallel] integrate_gpt_related_tests ( #2134 )
...
* [autoparallel] integrate_gpt_related_tests
* polish code
* polish code
* add GPT2Model into runtime test
2 years ago
Jiarui Fang
27327a4c90
[example] add palm pytorch version ( #2172 )
2 years ago
Jiarui Fang
b87496a66b
[hotfix] fix auto policy of test_sharded_optim_v2 ( #2157 )
2 years ago
YuliangLiu0306
16335cb537
[hotfix] fix aten default bug ( #2158 )
2 years ago
Jiarui Fang
2827f41898
[Gemini] GeminiDPP convert to PyTorch Module. ( #2151 )
2 years ago
アマデウス
077a66dd81
updated attention kernel ( #2133 )
2 years ago
YuliangLiu0306
536560ccc0
[autoparallel] implement softmax handler ( #2132 )
2 years ago
Jiarui Fang
c89c66a858
[Gemini] update API of the chunkmemstatscollector. ( #2129 )
2 years ago