Hongxin Liu
5c897ddb94
[pipeline] add stage manager ( #4093 )
...
* [pipeline] add stage manager
* [test] add pipeline stage manager test
* [pipeline] add docstring for stage manager
2023-08-15 23:25:14 +08:00
Jianghai
e8e7e49243
[pipeline]add pipeline policy and bert forward ( #4130 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
2023-08-15 23:25:14 +08:00
Hongxin Liu
f51ce1bc8e
[pipeline] refactor 1f1b schedule ( #4115 )
...
* [api] update optimizer wrapper to fit pipeline
* [pipeline] add base schedule
* [pipeline] add 1f1b schedule
* [test] add pipeline schedule utils test
* [pipeline] fix import
2023-08-15 23:25:14 +08:00
Hongxin Liu
45fdc9b42c
[pipeline] implement p2p communication ( #4100 )
...
* [pipeline] add p2p communication
* [test] add p2p communication test
* [test] add rerun decorator
* [test] rename to avoid conflict
2023-08-15 23:25:14 +08:00
Hongxin Liu
422544222f
[pipeline] add stage manager ( #4093 )
...
* [pipeline] add stage manager
* [test] add pipeline stage manager test
* [pipeline] add docstring for stage manager
2023-08-15 23:25:14 +08:00
Hongxin Liu
5e1a9d48dd
[cluster] add process group mesh ( #4039 )
...
* [cluster] add process group mesh
* [test] add process group mesh test
* force sync
2023-08-15 23:25:14 +08:00
LuGY
d86ddd9b29
[hotfix] fix unsafe async comm in zero ( #4404 )
...
* improve stablility of zero
* fix wrong index
* add record stream
2023-08-11 15:09:24 +08:00
flybird1111
458ae331ad
[kernel] updated unittests for coloattention ( #4389 )
...
Updated coloattention tests of checking outputs and gradients
2023-08-09 14:24:45 +08:00
flybird1111
38b792aab2
[coloattention] fix import error ( #4380 )
...
fixed an import error
2023-08-04 16:28:41 +08:00
flybird1111
25c57b9fb4
[fix] coloattention support flash attention 2 ( #4347 )
...
Improved ColoAttention interface to support flash attention 2. Solved #4322
2023-08-04 13:46:22 +08:00
Hongxin Liu
16bf4c0221
[test] remove useless tests ( #4359 )
...
* [test] remove legacy zero test
* [test] remove lazy distribute test
* [test] remove outdated checkpoint io
2023-08-01 18:52:14 +08:00
LuGY
1a49a5ea00
[zero] support shard optimizer state dict of zero ( #4194 )
...
* support shard optimizer of zero
* polish code
* support sync grad manually
2023-07-31 22:13:29 +08:00
LuGY
dd7cc58299
[zero] add state dict for low level zero ( #4179 )
...
* add state dict for zero
* fix unit test
* polish
2023-07-31 22:13:29 +08:00
LuGY
c668801d36
[zero] allow passing process group to zero12 ( #4153 )
...
* allow passing process group to zero12
* union tp-zero and normal-zero
* polish code
2023-07-31 22:13:29 +08:00
LuGY
79cf1b5f33
[zero]support no_sync method for zero1 plugin ( #4138 )
...
* support no sync for zero1 plugin
* polish
* polish
2023-07-31 22:13:29 +08:00
LuGY
c6ab96983a
[zero] refactor low level zero for shard evenly ( #4030 )
...
* refactor low level zero
* fix zero2 and support cpu offload
* avg gradient and modify unit test
* refactor grad store, support layer drop
* refactor bucket store, support grad accumulation
* fix and update unit test of zero and ddp
* compatible with tp, ga and unit test
* fix memory leak and polish
* add zero layer drop unittest
* polish code
* fix import err in unit test
* support diffenert comm dtype, modify docstring style
* polish code
* test padding and fix
* fix unit test of low level zero
* fix pad recording in bucket store
* support some models
* polish
2023-07-31 22:13:29 +08:00
Baizhou Zhang
c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin ( #4302 )
...
* sharded optimizer checkpoint for gemini plugin
* modify test to reduce testing time
* update doc
* fix bug when keep_gatherd is true under GeminiPlugin
2023-07-21 14:39:01 +08:00
Hongxin Liu
fc5cef2c79
[lazy] support init on cuda ( #4269 )
...
* [lazy] support init on cuda
* [test] update lazy init test
* [test] fix transformer version
2023-07-19 16:43:01 +08:00
Cuiqing Li
4b977541a8
[Kernels] added triton-implemented of self attention for colossal-ai ( #4241 )
...
* added softmax kernel
* added qkv_kernel
* added ops
* adding tests
* upload tets
* fix tests
* debugging
* debugging tests
* debugging
* added
* fixed errors
* added softmax kernel
* clean codes
* added tests
* update tests
* update tests
* added attention
* add
* fixed pytest checking
* add cuda check
* fix cuda version
* fix typo
2023-07-18 23:53:38 +08:00
Baizhou Zhang
58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin ( #4141 )
...
* [checkpointio] unsharded optimizer checkpoint for Gemini plugin
* [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather
2023-07-07 16:33:06 +08:00
github-actions[bot]
c77b3b19be
[format] applied code formatting on changed files in pull request 4152 ( #4157 )
...
Co-authored-by: github-actions <github-actions@github.com>
2023-07-04 16:07:47 +08:00
Frank Lee
1fb0d95df0
[shardformer] made tensor parallelism configurable ( #4144 )
...
* [shardformer] made tensor parallelism configurable
* polish code
2023-07-04 16:05:01 +08:00
Frank Lee
74257cb446
[shardformer] refactored some doc and api ( #4137 )
...
* [shardformer] refactored some doc and api
* polish code
2023-07-04 16:05:01 +08:00
Frank Lee
ae035d305d
[shardformer] added embedding gradient check ( #4124 )
2023-07-04 16:05:01 +08:00
Frank Lee
6a88bae4ec
[shardformer] integrate with data parallelism ( #4103 )
2023-07-04 16:05:01 +08:00
Frank Lee
f3b6aaa6b7
[shardformer] supported fused normalization ( #4112 )
2023-07-04 16:05:01 +08:00
Frank Lee
b1c2901530
[shardformer] supported bloom model ( #4098 )
2023-07-04 16:05:01 +08:00
Kun Lin
8af29ee47a
[shardformer] support vision transformer ( #4096 )
...
* first v of vit shardformer
* keep vit
* update
* vit shard add vitattention vitlayer
* update num head shard para
* finish test for vit
* add new_model_class & postprocess
* add vit readme
* delete old files & fix the conflict
* fix sth
2023-07-04 16:05:01 +08:00
jiangmingyan
ac80937138
[shardformer] shardformer support opt models ( #4091 )
...
* [shardformer] shardformer support opt models
* [shardformer] shardformer support opt models, fix
* [shardformer] shardformer support opt models, fix
* [shardformer] shardformer support opt models, fix
2023-07-04 16:05:01 +08:00
Frank Lee
d33a44e8c3
[shardformer] refactored layernorm ( #4086 )
2023-07-04 16:05:01 +08:00
Frank Lee
c4b1b65931
[test] fixed tests failed due to dtensor change ( #4082 )
...
* [test] fixed tests failed due to dtensor change
* polish code
2023-07-04 16:05:01 +08:00
FoolPlayer
92f6791095
[shardformer] Add layernorm ( #4072 )
...
* add layernorm to bert
* add layernorm test
* add layernorm test with load state dict
* add use_mixedfusedLN in shard config
* refactor policy to support fused_layernorm
2023-07-04 16:05:01 +08:00
Frank Lee
70c58cfd4f
[shardformer] supported fused qkv checkpoint ( #4073 )
2023-07-04 16:05:01 +08:00
FoolPlayer
0803a61412
[shardformer] add linearconv1d test ( #4067 )
...
* add linearconv1d test
* add linearconv1d test
2023-07-04 16:05:01 +08:00
Frank Lee
8eb09a4c69
[shardformer] support module saving and loading ( #4062 )
...
* [shardformer] support module saving and loading
* polish code
2023-07-04 16:05:01 +08:00
FoolPlayer
7740c55c55
support kit use for bert/gpt test ( #4055 )
...
* support kit use for bert test
* support kit test for gpt2
2023-07-04 16:05:01 +08:00
Frank Lee
f22ddacef0
[shardformer] refactored the shardformer layer structure ( #4053 )
2023-07-04 16:05:01 +08:00
Frank Lee
58df720570
[shardformer] adapted T5 and LLaMa test to use kit ( #4049 )
...
* [shardformer] adapted T5 and LLaMa test to use kit
* polish code
2023-07-04 16:05:01 +08:00
FoolPlayer
4021b9a8a2
[shardformer] add gpt2 test and layer class refactor ( #4041 )
...
* add gpt2 test and layer class refactor
* add dropout in gpt2 policy
2023-07-04 16:05:01 +08:00
Frank Lee
d857f3dbba
[shardformer] supported T5 and its variants ( #4045 )
2023-07-04 16:05:01 +08:00
Frank Lee
c1d5453e9f
[shardformer] adapted llama to the new API ( #4036 )
2023-07-04 16:05:01 +08:00
FoolPlayer
74d176c8d8
[shardformer] fix bert and gpt downstream with new api ( #4024 )
...
* fix bert downstream with new api
* remove comment line
2023-07-04 16:05:01 +08:00
FoolPlayer
507c0ad368
add vocabembedding layer
2023-07-04 16:05:01 +08:00
Frank Lee
3893fa1a8d
[shardformer] refactored embedding and dropout to parallel module ( #4013 )
...
* [shardformer] refactored embedding and dropout to parallel module
* polish code
2023-07-04 16:05:01 +08:00
FoolPlayer
dfca9678fa
integrate with dist layer ( #4011 )
2023-07-04 16:05:01 +08:00
Frank Lee
015af592f8
[shardformer] integrated linear 1D with dtensor ( #3996 )
...
* [shardformer] integrated linear 1D with dtensor
* polish code
2023-07-04 16:05:01 +08:00
Frank Lee
611971248c
[device] support init device mesh from process group ( #3990 )
2023-07-04 16:05:01 +08:00
FoolPlayer
f7774ec0f3
[Shardformer] Downstream bert ( #3979 )
...
* add dist dropout in model
* update docstring and bert policy with dropout
* refactor basepolicy and sharded, update bert
* update format
* update gpt2 policy
* update bert policy
* remove unused code
* update readme for new policy usage
* add downstream model of bert
* remove unused code
2023-07-04 16:05:01 +08:00
wukong1992
c1c672d0f0
[shardformer] shardformer support t5 model ( #3994 )
...
test t5
2023-07-04 16:05:01 +08:00
wukong1992
6b30dfb7ce
[shardformer] support llama model using shardformer ( #3969 )
...
adjust layer attr
2023-07-04 16:05:01 +08:00