ColossalAI

Commit Graph

Author	SHA1	Message	Date
Jianghai	c5ea728016	[pipeline] add bert_for_pretraining bert_lmhead forward and policy (#4172 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining * add bert_for_pretraining forward and policy * fix typos * cancel warning * change the imediate output to default dict * change the default output of get_shared_params	2023-08-15 23:25:14 +08:00
ver217	d35bd7d0e6	[shardformer] fix type hint	2023-08-15 23:25:14 +08:00
ver217	1ed3f8a24f	[shardformer] rename policy file name	2023-08-15 23:25:14 +08:00
ver217	b0b8ad2823	[pipeline] update shardformer docstring	2023-08-15 23:25:14 +08:00
ver217	59f6f573f1	[pipeline] update shardformer policy	2023-08-15 23:25:14 +08:00
Jianghai	90a65ea682	[pipeline] build bloom model and policy , revise the base class of policy (#4161 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining	2023-08-15 23:25:14 +08:00
Jianghai	e8e7e49243	[pipeline]add pipeline policy and bert forward (#4130 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt	2023-08-15 23:25:14 +08:00
Hongxin Liu	f51ce1bc8e	[pipeline] refactor 1f1b schedule (#4115 ) * [api] update optimizer wrapper to fit pipeline * [pipeline] add base schedule * [pipeline] add 1f1b schedule * [test] add pipeline schedule utils test * [pipeline] fix import	2023-08-15 23:25:14 +08:00
Hongxin Liu	45fdc9b42c	[pipeline] implement p2p communication (#4100 ) * [pipeline] add p2p communication * [test] add p2p communication test * [test] add rerun decorator * [test] rename to avoid conflict	2023-08-15 23:25:14 +08:00
Hongxin Liu	422544222f	[pipeline] add stage manager (#4093 ) * [pipeline] add stage manager * [test] add pipeline stage manager test * [pipeline] add docstring for stage manager	2023-08-15 23:25:14 +08:00
Hongxin Liu	5e1a9d48dd	[cluster] add process group mesh (#4039 ) * [cluster] add process group mesh * [test] add process group mesh test * force sync	2023-08-15 23:25:14 +08:00
LuGY	d86ddd9b29	[hotfix] fix unsafe async comm in zero (#4404 ) * improve stablility of zero * fix wrong index * add record stream	2023-08-11 15:09:24 +08:00
Baizhou Zhang	6ccecc0c69	[gemini] fix tensor storage cleaning in state dict collection (#4396 )	2023-08-10 15:36:46 +08:00
binmakeswell	089c365fa0	[doc] add Series A Funding and NeurIPS news (#4377 ) * [doc] add Series A Funding and NeurIPS news * [kernal] fix mha kernal * [CI] skip moe * [CI] fix requirements	2023-08-04 17:42:07 +08:00
flybird1111	38b792aab2	[coloattention] fix import error (#4380 ) fixed an import error	2023-08-04 16:28:41 +08:00
flybird1111	25c57b9fb4	[fix] coloattention support flash attention 2 (#4347 ) Improved ColoAttention interface to support flash attention 2. Solved #4322	2023-08-04 13:46:22 +08:00
Hongxin Liu	16bf4c0221	[test] remove useless tests (#4359 ) * [test] remove legacy zero test * [test] remove lazy distribute test * [test] remove outdated checkpoint io	2023-08-01 18:52:14 +08:00
LuGY	03654c0ce2	fix localhost measurement (#4320 )	2023-08-01 10:14:00 +08:00
LuGY	45b08f08cb	[zero] optimize the optimizer step time (#4221 ) * optimize the optimizer step time * fix corner case * polish * replace all-reduce with all-gather * set comm device to cuda	2023-07-31 22:13:29 +08:00
LuGY	1a49a5ea00	[zero] support shard optimizer state dict of zero (#4194 ) * support shard optimizer of zero * polish code * support sync grad manually	2023-07-31 22:13:29 +08:00
LuGY	dd7cc58299	[zero] add state dict for low level zero (#4179 ) * add state dict for zero * fix unit test * polish	2023-07-31 22:13:29 +08:00
LuGY	c668801d36	[zero] allow passing process group to zero12 (#4153 ) * allow passing process group to zero12 * union tp-zero and normal-zero * polish code	2023-07-31 22:13:29 +08:00
LuGY	79cf1b5f33	[zero]support no_sync method for zero1 plugin (#4138 ) * support no sync for zero1 plugin * polish * polish	2023-07-31 22:13:29 +08:00
LuGY	c6ab96983a	[zero] refactor low level zero for shard evenly (#4030 ) * refactor low level zero * fix zero2 and support cpu offload * avg gradient and modify unit test * refactor grad store, support layer drop * refactor bucket store, support grad accumulation * fix and update unit test of zero and ddp * compatible with tp, ga and unit test * fix memory leak and polish * add zero layer drop unittest * polish code * fix import err in unit test * support diffenert comm dtype, modify docstring style * polish code * test padding and fix * fix unit test of low level zero * fix pad recording in bucket store * support some models * polish	2023-07-31 22:13:29 +08:00
dayellow	a50d39a143	[NFC] fix: format (#4270 ) * [NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style * [NFC] polish colossalai/communication/utils.py code style --------- Co-authored-by: Minghao Huang <huangminghao@luchentech.com>	2023-07-26 14:12:57 +08:00
Wenhao Chen	fee553288b	[NFC] polish runtime_preparation_pass style (#4266 )	2023-07-26 14:12:57 +08:00
YeAnbang	3883db452c	[NFC] polish unary_elementwise_generator.py code style (#4267 ) Co-authored-by: aye42 <aye42@gatech.edu>	2023-07-26 14:12:57 +08:00
梁爽	abe4f971e0	[NFC] polish colossalai/booster/plugin/low_level_zero_plugin.py code style (#4256 ) Co-authored-by: supercooledith <893754954@qq.com>	2023-07-26 14:12:57 +08:00
Yanjia0	c614a99d28	[NFC] polish colossalai/auto_parallel/offload/amp_optimizer.py code style (#4255 )	2023-07-26 14:12:57 +08:00
ocd_with_naming	85774f0c1f	[NFC] polish colossalai/cli/benchmark/utils.py code style (#4254 )	2023-07-26 14:12:57 +08:00
Michelle	86cf6aed5b	Fix/format (#4261 ) * revise shardformer readme (#4246) * [example] add llama pretraining (#4257) * [NFC] polish colossalai/communication/p2p.py code style --------- Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: Qianran Ma <qianranm@luchentech.com>	2023-07-26 14:12:57 +08:00
Jianghai	b366f1d99f	[NFC] Fix format for mixed precision (#4253 ) * [NFC] polish colossalai/booster/mixed_precision/mixed_precision_base.py code style	2023-07-26 14:12:57 +08:00
Baizhou Zhang	c6f6005990	[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302 ) * sharded optimizer checkpoint for gemini plugin * modify test to reduce testing time * update doc * fix bug when keep_gatherd is true under GeminiPlugin	2023-07-21 14:39:01 +08:00
Hongxin Liu	fc5cef2c79	[lazy] support init on cuda (#4269 ) * [lazy] support init on cuda * [test] update lazy init test * [test] fix transformer version	2023-07-19 16:43:01 +08:00
Cuiqing Li	4b977541a8	[Kernels] added triton-implemented of self attention for colossal-ai (#4241 ) * added softmax kernel * added qkv_kernel * added ops * adding tests * upload tets * fix tests * debugging * debugging tests * debugging * added * fixed errors * added softmax kernel * clean codes * added tests * update tests * update tests * added attention * add * fixed pytest checking * add cuda check * fix cuda version * fix typo	2023-07-18 23:53:38 +08:00
Jianghai	9a4842c571	revise shardformer readme (#4246 )	2023-07-17 17:30:57 +08:00
Baizhou Zhang	58913441a1	Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141 ) * [checkpointio] unsharded optimizer checkpoint for Gemini plugin * [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather	2023-07-07 16:33:06 +08:00
Frank Lee	190a6ea9c2	[dtensor] fixed readme file name and removed deprecated file (#4162 )	2023-07-04 18:21:11 +08:00
Hongxin Liu	1908caad38	[cli] hotfix launch command for multi-nodes (#4165 )	2023-07-04 17:54:40 +08:00
digger yu	2ac24040eb	fix some typo colossalai/shardformer (#4160 )	2023-07-04 17:53:39 +08:00
github-actions[bot]	c77b3b19be	[format] applied code formatting on changed files in pull request 4152 (#4157 ) Co-authored-by: github-actions <github-actions@github.com>	2023-07-04 16:07:47 +08:00
Frank Lee	89f45eda5a	[shardformer] added development protocol for standardization (#4149 )	2023-07-04 16:05:01 +08:00
Frank Lee	1fb0d95df0	[shardformer] made tensor parallelism configurable (#4144 ) * [shardformer] made tensor parallelism configurable * polish code	2023-07-04 16:05:01 +08:00
Frank Lee	74257cb446	[shardformer] refactored some doc and api (#4137 ) * [shardformer] refactored some doc and api * polish code	2023-07-04 16:05:01 +08:00
jiangmingyan	7f9b30335b	[shardformer] write an shardformer example with bert finetuning (#4126 ) * [shardformer] add benchmark of shardformer * [shardformer] add benchmark of shardformer	2023-07-04 16:05:01 +08:00
Frank Lee	ae035d305d	[shardformer] added embedding gradient check (#4124 )	2023-07-04 16:05:01 +08:00
Frank Lee	44a190e6ac	[shardformer] import huggingface implicitly (#4101 )	2023-07-04 16:05:01 +08:00
Frank Lee	6a88bae4ec	[shardformer] integrate with data parallelism (#4103 )	2023-07-04 16:05:01 +08:00
Frank Lee	f3b6aaa6b7	[shardformer] supported fused normalization (#4112 )	2023-07-04 16:05:01 +08:00
Frank Lee	b1c2901530	[shardformer] supported bloom model (#4098 )	2023-07-04 16:05:01 +08:00
Kun Lin	8af29ee47a	[shardformer] support vision transformer (#4096 ) * first v of vit shardformer * keep vit * update * vit shard add vitattention vitlayer * update num head shard para * finish test for vit * add new_model_class & postprocess * add vit readme * delete old files & fix the conflict * fix sth	2023-07-04 16:05:01 +08:00
jiangmingyan	ac80937138	[shardformer] shardformer support opt models (#4091 ) * [shardformer] shardformer support opt models * [shardformer] shardformer support opt models, fix * [shardformer] shardformer support opt models, fix * [shardformer] shardformer support opt models, fix	2023-07-04 16:05:01 +08:00
Frank Lee	d33a44e8c3	[shardformer] refactored layernorm (#4086 )	2023-07-04 16:05:01 +08:00
Frank Lee	c4b1b65931	[test] fixed tests failed due to dtensor change (#4082 ) * [test] fixed tests failed due to dtensor change * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	92f6791095	[shardformer] Add layernorm (#4072 ) * add layernorm to bert * add layernorm test * add layernorm test with load state dict * add use_mixedfusedLN in shard config * refactor policy to support fused_layernorm	2023-07-04 16:05:01 +08:00
Frank Lee	70c58cfd4f	[shardformer] supported fused qkv checkpoint (#4073 )	2023-07-04 16:05:01 +08:00
FoolPlayer	0803a61412	[shardformer] add linearconv1d test (#4067 ) * add linearconv1d test * add linearconv1d test	2023-07-04 16:05:01 +08:00
Frank Lee	8eb09a4c69	[shardformer] support module saving and loading (#4062 ) * [shardformer] support module saving and loading * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	7740c55c55	support kit use for bert/gpt test (#4055 ) * support kit use for bert test * support kit test for gpt2	2023-07-04 16:05:01 +08:00
Frank Lee	f22ddacef0	[shardformer] refactored the shardformer layer structure (#4053 )	2023-07-04 16:05:01 +08:00
Frank Lee	58df720570	[shardformer] adapted T5 and LLaMa test to use kit (#4049 ) * [shardformer] adapted T5 and LLaMa test to use kit * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	4021b9a8a2	[shardformer] add gpt2 test and layer class refactor (#4041 ) * add gpt2 test and layer class refactor * add dropout in gpt2 policy	2023-07-04 16:05:01 +08:00
Frank Lee	d857f3dbba	[shardformer] supported T5 and its variants (#4045 )	2023-07-04 16:05:01 +08:00
Frank Lee	c1d5453e9f	[shardformer] adapted llama to the new API (#4036 )	2023-07-04 16:05:01 +08:00
FoolPlayer	74d176c8d8	[shardformer] fix bert and gpt downstream with new api (#4024 ) * fix bert downstream with new api * remove comment line	2023-07-04 16:05:01 +08:00
Frank Lee	e253a07007	[shardformer] updated doc (#4016 )	2023-07-04 16:05:01 +08:00
FoolPlayer	df018fc305	support bert with new api	2023-07-04 16:05:01 +08:00
FoolPlayer	507c0ad368	add vocabembedding layer	2023-07-04 16:05:01 +08:00
Frank Lee	45d9384346	[shardformer] removed inplace tensor sharding (#4018 )	2023-07-04 16:05:01 +08:00
Frank Lee	3893fa1a8d	[shardformer] refactored embedding and dropout to parallel module (#4013 ) * [shardformer] refactored embedding and dropout to parallel module * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	dfca9678fa	integrate with dist layer (#4011 )	2023-07-04 16:05:01 +08:00
Frank Lee	015af592f8	[shardformer] integrated linear 1D with dtensor (#3996 ) * [shardformer] integrated linear 1D with dtensor * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	d3bc530849	[shardformer] Refactor shardformer api (#4001 ) * fix an error in readme * simplify code * refactor shardformer * add todo * remove slicer * resolve code review	2023-07-04 16:05:01 +08:00
Frank Lee	611971248c	[device] support init device mesh from process group (#3990 )	2023-07-04 16:05:01 +08:00
FoolPlayer	a2f9af810d	[shardformer] fix an error in readme (#3988 ) * fix an error in readme * simplify code	2023-07-04 16:05:01 +08:00
FoolPlayer	f7774ec0f3	[Shardformer] Downstream bert (#3979 ) * add dist dropout in model * update docstring and bert policy with dropout * refactor basepolicy and sharded, update bert * update format * update gpt2 policy * update bert policy * remove unused code * update readme for new policy usage * add downstream model of bert * remove unused code	2023-07-04 16:05:01 +08:00
wukong1992	c1c672d0f0	[shardformer] shardformer support t5 model (#3994 ) test t5	2023-07-04 16:05:01 +08:00
wukong1992	6b30dfb7ce	[shardformer] support llama model using shardformer (#3969 ) adjust layer attr	2023-07-04 16:05:01 +08:00
FoolPlayer	45927d5527	[shardformer] Add dropout layer in shard model and refactor policy api (#3949 ) * add dist dropout in model * update docstring and bert policy with dropout * refactor basepolicy and sharded, update bert * update format * update gpt2 policy * update bert policy * remove unused code * update readme for new policy usage	2023-07-04 16:05:01 +08:00
FoolPlayer	a73130482d	[shardformer] Unit test (#3928 ) * fix bug in slicer, add slicer unit test * add dropout test * use pid as dropout seed * updata dropout test with local pattern * ad todo	2023-07-04 16:05:01 +08:00
FoolPlayer	f1cb5ac6bf	[shardformer] Align bert value (#3907 ) * add bert align test, fix dist loss bug * forward and backward align * add ignore index * add shardformer CI * add gather_output optional for user in shardconfig * update readme with optional gather_ouput * add dist crossentropy loss test, remove unused files * remove unused file * remove unused file * rename the file * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	79f8d5d54b	[shardformer] add gpt2 policy and modify shard and slicer to support (#3883 ) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	70173e3123	update README (#3909 )	2023-07-04 16:05:01 +08:00
FoolPlayer	ab8a47f830	[shardformer] add Dropout layer support different dropout pattern (#3856 ) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss	2023-07-04 16:05:01 +08:00
FoolPlayer	c594dc2f1c	[shardformer] update readme with modules implement doc (#3834 ) * update readme with modules content * remove img	2023-07-04 16:05:01 +08:00
Frank Lee	4972e1f40e	[shardformer] refactored the user api (#3828 ) * [shardformer] refactored the user api * polish code	2023-07-04 16:05:01 +08:00
Frank Lee	235792f170	[shardformer] updated readme (#3827 )	2023-07-04 16:05:01 +08:00
FoolPlayer	8cc11235c0	[shardformer]: Feature/shardformer, add some docstring and readme (#3816 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit	2023-07-04 16:05:01 +08:00
FoolPlayer	8d68de767d	[shardformer] init shardformer code structure (#3731 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example	2023-07-04 16:05:01 +08:00
Baizhou Zhang	1350ece492	[hotfix] fix import bug in checkpoint_io (#4142 )	2023-07-03 22:14:37 +08:00
digger yu	8abc87798f	fix Tensor is not defined (#4129 )	2023-07-03 17:10:18 +08:00
digger yu	7e46bc87b6	fix CheckpointIndexFile is not defined (#4109 )	2023-07-03 17:09:06 +08:00
digger yu	09fe9dc704	[nfc]fix ColossalaiOptimizer is not defined (#4122 )	2023-06-30 17:23:22 +08:00
Frank Lee	95e95b6d58	[testing] move pytest to be inside the function (#4087 )	2023-06-27 11:02:25 +08:00
Baizhou Zhang	0bb0b481b4	[gemini] fix argument naming during chunk configuration searching	2023-06-25 13:34:15 +08:00
github-actions[bot]	a52f62082d	[format] applied code formatting on changed files in pull request 4021 (#4022 ) Co-authored-by: github-actions <github-actions@github.com>	2023-06-19 11:23:24 +08:00
Baizhou Zhang	822c3d4d66	[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002 )	2023-06-16 14:14:05 +08:00
Wenhao Chen	725af3eeeb	[booster] make optimizer argument optional for boost (#3993 ) * feat: make optimizer optional in Booster.boost * test: skip unet test if diffusers version > 0.10.2	2023-06-15 17:38:42 +08:00
Baizhou Zhang	c9cff7e7fa	[checkpointio] General Checkpointing of Sharded Optimizers (#3984 )	2023-06-15 15:21:26 +08:00
Frank Lee	71fe52769c	[gemini] fixed the gemini checkpoint io (#3934 )	2023-06-12 15:11:27 +08:00

1 2 3 4 5 ...

1564 Commits (839847b7d78bce6af5dfe58d27b5ce2c74a3619b)