Jianghai
1094e0f0d3
[pipeline] Bert pipeline for shardformer and its tests ( #4197 )
...
* add pipeline forward
* complete pipeline forward check
* fix bert forward without pipeline
* fix comments
* discard useless line
* add todo
* clean prints
* fix distribute layers
2023-08-15 23:25:14 +08:00
Hongxin Liu
890774b2fb
[shardformer] support lazy init ( #4202 )
...
* [shardformer] support lazy init
* [shardformer] linear support lazy init
* [shardformer] embedding support lazy init
* [shardformer] norm support lazy init
* [shardformer] fused linear support lazy init
* [test] update shardformer test layer
* [test] shardformer with lazy init fit ddp
* [lazy] hotfix deepcopy of param
* [shardformer] fix bert policy and update test
* [shardformer] fix bloom policy and update test
* [shardformer] fix opt policy and update test
* [shardformer] fix t5 policy and update test
* [shardformer] fix gpt2 policy and update test
* [shardformer] fix llama policy and update test
2023-08-15 23:25:14 +08:00
Jianghai
f3bcc292c8
[pipeline] move bert related pipeline components to shardformer ( #4187 )
...
* move bert related pipeline components to shardformer
* fix bugs
* revision
* fix bert model tests
* fix bert_lm_head model tests
* fix tests
* fix tests
* done checks
* skip bloom
2023-08-15 23:25:14 +08:00
Jianghai
c5ea728016
[pipeline] add bert_for_pretraining bert_lmhead forward and policy ( #4172 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
* add bloom model and policy ,revise the base class of policy
* revise
* revision
* add bert_for_pretraining
* add bert_for_pretraining forward and policy
* fix typos
* cancel warning
* change the imediate output to default dict
* change the default output of get_shared_params
2023-08-15 23:25:14 +08:00
ver217
d35bd7d0e6
[shardformer] fix type hint
2023-08-15 23:25:14 +08:00
ver217
1ed3f8a24f
[shardformer] rename policy file name
2023-08-15 23:25:14 +08:00
ver217
b0b8ad2823
[pipeline] update shardformer docstring
2023-08-15 23:25:14 +08:00
ver217
59f6f573f1
[pipeline] update shardformer policy
2023-08-15 23:25:14 +08:00
Jianghai
90a65ea682
[pipeline] build bloom model and policy , revise the base class of policy ( #4161 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
* add bloom model and policy ,revise the base class of policy
* revise
* revision
* add bert_for_pretraining
2023-08-15 23:25:14 +08:00
Jianghai
e8e7e49243
[pipeline]add pipeline policy and bert forward ( #4130 )
...
* add pipeline policy and bert forward to be done
* add bertmodel pipeline forward and make tests
* add Bert_Policy and test for policy
* update formatting
* update formatting
* update the code
* fix bugs
* fix name confilt
2023-08-15 23:25:14 +08:00
Hongxin Liu
f51ce1bc8e
[pipeline] refactor 1f1b schedule ( #4115 )
...
* [api] update optimizer wrapper to fit pipeline
* [pipeline] add base schedule
* [pipeline] add 1f1b schedule
* [test] add pipeline schedule utils test
* [pipeline] fix import
2023-08-15 23:25:14 +08:00
Hongxin Liu
45fdc9b42c
[pipeline] implement p2p communication ( #4100 )
...
* [pipeline] add p2p communication
* [test] add p2p communication test
* [test] add rerun decorator
* [test] rename to avoid conflict
2023-08-15 23:25:14 +08:00
Hongxin Liu
422544222f
[pipeline] add stage manager ( #4093 )
...
* [pipeline] add stage manager
* [test] add pipeline stage manager test
* [pipeline] add docstring for stage manager
2023-08-15 23:25:14 +08:00
Hongxin Liu
5e1a9d48dd
[cluster] add process group mesh ( #4039 )
...
* [cluster] add process group mesh
* [test] add process group mesh test
* force sync
2023-08-15 23:25:14 +08:00
LuGY
d86ddd9b29
[hotfix] fix unsafe async comm in zero ( #4404 )
...
* improve stablility of zero
* fix wrong index
* add record stream
2023-08-11 15:09:24 +08:00
Baizhou Zhang
6ccecc0c69
[gemini] fix tensor storage cleaning in state dict collection ( #4396 )
2023-08-10 15:36:46 +08:00
binmakeswell
089c365fa0
[doc] add Series A Funding and NeurIPS news ( #4377 )
...
* [doc] add Series A Funding and NeurIPS news
* [kernal] fix mha kernal
* [CI] skip moe
* [CI] fix requirements
2023-08-04 17:42:07 +08:00
flybird1111
38b792aab2
[coloattention] fix import error ( #4380 )
...
fixed an import error
2023-08-04 16:28:41 +08:00
flybird1111
25c57b9fb4
[fix] coloattention support flash attention 2 ( #4347 )
...
Improved ColoAttention interface to support flash attention 2. Solved #4322
2023-08-04 13:46:22 +08:00
Hongxin Liu
16bf4c0221
[test] remove useless tests ( #4359 )
...
* [test] remove legacy zero test
* [test] remove lazy distribute test
* [test] remove outdated checkpoint io
2023-08-01 18:52:14 +08:00
LuGY
03654c0ce2
fix localhost measurement ( #4320 )
2023-08-01 10:14:00 +08:00
LuGY
45b08f08cb
[zero] optimize the optimizer step time ( #4221 )
...
* optimize the optimizer step time
* fix corner case
* polish
* replace all-reduce with all-gather
* set comm device to cuda
2023-07-31 22:13:29 +08:00
LuGY
1a49a5ea00
[zero] support shard optimizer state dict of zero ( #4194 )
...
* support shard optimizer of zero
* polish code
* support sync grad manually
2023-07-31 22:13:29 +08:00
LuGY
dd7cc58299
[zero] add state dict for low level zero ( #4179 )
...
* add state dict for zero
* fix unit test
* polish
2023-07-31 22:13:29 +08:00
LuGY
c668801d36
[zero] allow passing process group to zero12 ( #4153 )
...
* allow passing process group to zero12
* union tp-zero and normal-zero
* polish code
2023-07-31 22:13:29 +08:00
LuGY
79cf1b5f33
[zero]support no_sync method for zero1 plugin ( #4138 )
...
* support no sync for zero1 plugin
* polish
* polish
2023-07-31 22:13:29 +08:00
LuGY
c6ab96983a
[zero] refactor low level zero for shard evenly ( #4030 )
...
* refactor low level zero
* fix zero2 and support cpu offload
* avg gradient and modify unit test
* refactor grad store, support layer drop
* refactor bucket store, support grad accumulation
* fix and update unit test of zero and ddp
* compatible with tp, ga and unit test
* fix memory leak and polish
* add zero layer drop unittest
* polish code
* fix import err in unit test
* support diffenert comm dtype, modify docstring style
* polish code
* test padding and fix
* fix unit test of low level zero
* fix pad recording in bucket store
* support some models
* polish
2023-07-31 22:13:29 +08:00
dayellow
a50d39a143
[NFC] fix: format ( #4270 )
...
* [NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style
* [NFC] polish colossalai/communication/utils.py code style
---------
Co-authored-by: Minghao Huang <huangminghao@luchentech.com>
2023-07-26 14:12:57 +08:00
Wenhao Chen
fee553288b
[NFC] polish runtime_preparation_pass style ( #4266 )
2023-07-26 14:12:57 +08:00
YeAnbang
3883db452c
[NFC] polish unary_elementwise_generator.py code style ( #4267 )
...
Co-authored-by: aye42 <aye42@gatech.edu>
2023-07-26 14:12:57 +08:00
梁爽
abe4f971e0
[NFC] polish colossalai/booster/plugin/low_level_zero_plugin.py code style ( #4256 )
...
Co-authored-by: supercooledith <893754954@qq.com>
2023-07-26 14:12:57 +08:00
Yanjia0
c614a99d28
[NFC] polish colossalai/auto_parallel/offload/amp_optimizer.py code style ( #4255 )
2023-07-26 14:12:57 +08:00
ocd_with_naming
85774f0c1f
[NFC] polish colossalai/cli/benchmark/utils.py code style ( #4254 )
2023-07-26 14:12:57 +08:00
Michelle
86cf6aed5b
Fix/format ( #4261 )
...
* revise shardformer readme (#4246 )
* [example] add llama pretraining (#4257 )
* [NFC] polish colossalai/communication/p2p.py code style
---------
Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
Co-authored-by: Qianran Ma <qianranm@luchentech.com>
2023-07-26 14:12:57 +08:00
Jianghai
b366f1d99f
[NFC] Fix format for mixed precision ( #4253 )
...
* [NFC] polish colossalai/booster/mixed_precision/mixed_precision_base.py code style
2023-07-26 14:12:57 +08:00
Baizhou Zhang
c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin ( #4302 )
...
* sharded optimizer checkpoint for gemini plugin
* modify test to reduce testing time
* update doc
* fix bug when keep_gatherd is true under GeminiPlugin
2023-07-21 14:39:01 +08:00
Hongxin Liu
fc5cef2c79
[lazy] support init on cuda ( #4269 )
...
* [lazy] support init on cuda
* [test] update lazy init test
* [test] fix transformer version
2023-07-19 16:43:01 +08:00
Cuiqing Li
4b977541a8
[Kernels] added triton-implemented of self attention for colossal-ai ( #4241 )
...
* added softmax kernel
* added qkv_kernel
* added ops
* adding tests
* upload tets
* fix tests
* debugging
* debugging tests
* debugging
* added
* fixed errors
* added softmax kernel
* clean codes
* added tests
* update tests
* update tests
* added attention
* add
* fixed pytest checking
* add cuda check
* fix cuda version
* fix typo
2023-07-18 23:53:38 +08:00
Jianghai
9a4842c571
revise shardformer readme ( #4246 )
2023-07-17 17:30:57 +08:00
Baizhou Zhang
58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin ( #4141 )
...
* [checkpointio] unsharded optimizer checkpoint for Gemini plugin
* [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather
2023-07-07 16:33:06 +08:00
Frank Lee
190a6ea9c2
[dtensor] fixed readme file name and removed deprecated file ( #4162 )
2023-07-04 18:21:11 +08:00
Hongxin Liu
1908caad38
[cli] hotfix launch command for multi-nodes ( #4165 )
2023-07-04 17:54:40 +08:00
digger yu
2ac24040eb
fix some typo colossalai/shardformer ( #4160 )
2023-07-04 17:53:39 +08:00
github-actions[bot]
c77b3b19be
[format] applied code formatting on changed files in pull request 4152 ( #4157 )
...
Co-authored-by: github-actions <github-actions@github.com>
2023-07-04 16:07:47 +08:00
Frank Lee
89f45eda5a
[shardformer] added development protocol for standardization ( #4149 )
2023-07-04 16:05:01 +08:00
Frank Lee
1fb0d95df0
[shardformer] made tensor parallelism configurable ( #4144 )
...
* [shardformer] made tensor parallelism configurable
* polish code
2023-07-04 16:05:01 +08:00
Frank Lee
74257cb446
[shardformer] refactored some doc and api ( #4137 )
...
* [shardformer] refactored some doc and api
* polish code
2023-07-04 16:05:01 +08:00
jiangmingyan
7f9b30335b
[shardformer] write an shardformer example with bert finetuning ( #4126 )
...
* [shardformer] add benchmark of shardformer
* [shardformer] add benchmark of shardformer
2023-07-04 16:05:01 +08:00
Frank Lee
ae035d305d
[shardformer] added embedding gradient check ( #4124 )
2023-07-04 16:05:01 +08:00
Frank Lee
44a190e6ac
[shardformer] import huggingface implicitly ( #4101 )
2023-07-04 16:05:01 +08:00