ColossalAI

Commit Graph

Author	SHA1	Message	Date
Bin Jia	86d22581e4	[shardformer] Add overlap optional for HybridParallelPlugin (#4615 ) * add optional overlap for plugin * remove fixed todo	2023-09-05 11:52:23 +08:00
Hongxin Liu	a39a5c66fe	Merge branch 'main' into feature/shardformer	2023-09-04 23:43:13 +08:00
Baizhou Zhang	e79b1e80e2	[checkpointio] support huggingface from_pretrained for all plugins (#4606 )	2023-09-04 23:25:01 +08:00
flybird11111	0a94fcd351	[shardformer] update bert finetune example with HybridParallelPlugin (#4584 ) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] fix epoch change * [shardformer] broadcast add pp group * [shardformer] fix opt test hanging * fix * test * test * [shardformer] zero1+pp and the corresponding tests (#4517) * pause * finish pp+zero1 * Update test_shard_vit.py * [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516) * fix overlap bug and support bert, add overlap as an option in shardconfig * support overlap for chatglm and bloom * [shardformer] fix emerged bugs after updating transformers (#4526) * test * fix test * fix test * remove print * add fix * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] Add overlap support for gpt2 (#4535) * add overlap support for gpt2 * remove unused code * remove unused code * [shardformer] support pp+tp+zero1 tests (#4531) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] fix submodule replacement bug when enabling pp (#4544) * [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) * implement sharded optimizer saving * add more param info * finish implementation of sharded optimizer saving * fix bugs in optimizer sharded saving * add pp+zero test * param group loading * greedy loading of optimizer * fix bug when loading * implement optimizer sharded saving * add optimizer test & arrange checkpointIO utils * fix gemini sharding state_dict * add verbose option * add loading of master params * fix typehint * fix master/working mapping in fp16 amp * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] fix epoch change * [shardformer] broadcast add pp group * rebase feature/shardformer * update pipeline * [shardformer] fix * [shardformer] fix * [shardformer] bert finetune fix * [shardformer] add all_reduce operation to loss add all_reduce operation to loss * [shardformer] make compatible with pytree. make compatible with pytree. * [shardformer] disable tp disable tp * [shardformer] add 3d plugin to ci test * [shardformer] update num_microbatches to None * [shardformer] update microbatchsize * [shardformer] update assert * update scheduler * update scheduler --------- Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>	2023-09-04 21:46:29 +08:00
Jianghai	24c0768795	[shardformer] Pytree fix (#4533 ) * pytree test * test bert * test bert * test bert * revise * add register * add register	2023-09-04 17:52:23 +08:00
Hongxin Liu	63ecafb1fb	[checkpointio] optimize zero optim checkpoint io (#4591 ) * [zero] update checkpoint io to save memory * [checkpointio] add device map to save memory	2023-09-04 11:26:45 +08:00
Hongxin Liu	508ca36fe3	[pipeline] 1f1b schedule receive microbatch size (#4589 )	2023-09-01 21:45:14 +08:00
LuGY	cbac782254	[zero]fix zero ckptIO with offload (#4529 ) * fix zero ckptio with offload * fix load device * saved tensors in ckpt should be on CPU * fix unit test * fix unit test * add clear cache * save memory for CI	2023-09-01 17:41:19 +08:00
Baizhou Zhang	38ccb8b1a3	[shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575 ) * hybrid plugin support huggingface from_pretrained * add huggingface compatibility tests * add folder cleaning * fix bugs	2023-09-01 17:40:01 +08:00
Baizhou Zhang	c9625dbb63	[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540 ) * implement sharded optimizer saving * add more param info * finish implementation of sharded optimizer saving * fix bugs in optimizer sharded saving * add pp+zero test * param group loading * greedy loading of optimizer * fix bug when loading * implement optimizer sharded saving * add optimizer test & arrange checkpointIO utils * fix gemini sharding state_dict * add verbose option * add loading of master params * fix typehint * fix master/working mapping in fp16 amp	2023-08-31 14:50:47 +08:00
Baizhou Zhang	2c787d7f47	[shardformer] fix submodule replacement bug when enabling pp (#4544 )	2023-08-31 09:57:18 +08:00
flybird11111	ec18fc7340	[shardformer] support pp+tp+zero1 tests (#4531 ) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1	2023-08-30 21:29:18 +08:00
Lufang Chen	12c95a9fed	fix runtime prepare pass (#4502 ) Co-authored-by: lufang.chen <lufang.chen@nio.com>	2023-08-30 17:29:38 +08:00
flybird11111	d367b88785	[shardformer] fix opt test hanging (#4521 ) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix	2023-08-30 14:50:34 +08:00
Bin Jia	e241b74f24	[shardformer] Add overlap support for gpt2 (#4535 ) * add overlap support for gpt2 * remove unused code * remove unused code	2023-08-29 18:30:50 +08:00
Baizhou Zhang	0387a47e63	[shardformer] fix emerged bugs after updating transformers (#4526 )	2023-08-29 11:25:05 +08:00
Hongxin Liu	0b00def881	[example] add llama2 example (#4527 ) * [example] transfer llama-1 example * [example] fit llama-2 * [example] refactor scripts folder * [example] fit new gemini plugin * [cli] fix multinode runner * [example] fit gemini optim checkpoint * [example] refactor scripts * [example] update requirements * [example] update requirements * [example] rename llama to llama2 * [example] update readme and pretrain script * [example] refactor scripts	2023-08-28 17:59:11 +08:00
Bin Jia	c554b7f559	[shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516 ) * fix overlap bug and support bert, add overlap as an option in shardconfig * support overlap for chatglm and bloom	2023-08-28 17:16:40 +08:00
Jianghai	376533a564	[shardformer] zero1+pp and the corresponding tests (#4517 ) * pause * finish pp+zero1 * Update test_shard_vit.py	2023-08-28 10:51:16 +08:00
Baizhou Zhang	44eab2b27f	[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506 ) * add APIs * implement save_sharded_model * add test for hybrid checkpointio * implement naive loading for sharded model * implement efficient sharded model loading * open a new file for hybrid checkpoint_io * small fix * fix circular importing * fix docstring * arrange arguments and apis * small fix	2023-08-25 22:04:57 +08:00
flybird11111	de8a65babc	[shardformer] opt fix. (#4514 ) * [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel * fix fix fix fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * activate checks * [Test] test ci * test ci * test ci * test ci * test ci * test ci * test ci * fix	2023-08-25 19:41:24 +08:00
LuGY	839847b7d7	[zero]support zero2 with gradient accumulation (#4511 ) * support gradient accumulation with zero2 * fix type	2023-08-25 13:44:07 +08:00
flybird11111	3353e55c80	[shardformer] vit/llama/t5 ignore the sequence parallelism flag and some fix. (#4498 ) * [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel * fix fix fix fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * activate checks	2023-08-24 15:50:02 +08:00
Hongxin Liu	27061426f7	[gemini] improve compatibility and add static placement policy (#4479 ) * [gemini] remove distributed-related part from colotensor (#4379) * [gemini] remove process group dependency * [gemini] remove tp part from colo tensor * [gemini] patch inplace op * [gemini] fix param op hook and update tests * [test] remove useless tests * [test] remove useless tests * [misc] fix requirements * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [misc] update requirements * [gemini] refactor gemini optimizer and gemini ddp (#4398) * [gemini] update optimizer interface * [gemini] renaming gemini optimizer * [gemini] refactor gemini ddp class * [example] update gemini related example * [example] update gemini related example * [plugin] fix gemini plugin args * [test] update gemini ckpt tests * [gemini] fix checkpoint io * [example] fix opt example requirements * [example] fix opt example * [example] fix opt example * [example] fix opt example * [gemini] add static placement policy (#4443) * [gemini] add static placement policy * [gemini] fix param offload * [test] update gemini tests * [plugin] update gemini plugin * [plugin] update gemini plugin docstr * [misc] fix flash attn requirement * [test] fix gemini checkpoint io test * [example] update resnet example result (#4457) * [example] update bert example result (#4458) * [doc] update gemini doc (#4468) * [example] update gemini related examples (#4473) * [example] update gpt example * [example] update dreambooth example * [example] update vit * [example] update opt * [example] update palm * [example] update vit and opt benchmark * [hotfix] fix bert in model zoo (#4480) * [hotfix] fix bert in model zoo * [test] remove chatglm gemini test * [test] remove sam gemini test * [test] remove vit gemini test * [hotfix] fix opt tutorial example (#4497) * [hotfix] fix opt tutorial example * [hotfix] fix opt tutorial example	2023-08-24 09:29:25 +08:00
flybird11111	59e252ecdb	[shardformer] chatglm support sequence parallel (#4482 ) * [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel * fix fix fix fix	2023-08-22 23:59:31 +08:00
Bin Jia	351351a36e	[shardformer/sequence parallel] not support opt of seq-parallel, add warning and fix a bug in gpt2 pp (#4488 )	2023-08-22 17:35:35 +08:00
Jianghai	5545114fd8	rename chatglm to chatglm2 (#4484 )	2023-08-22 14:13:31 +08:00
Baizhou Zhang	1c7df566e2	[shardformer] support tp+zero for shardformer (#4472 ) * support tp+zero/input type cast for hybridplugin * add tp+zero tests * fix bucket arguments	2023-08-21 12:04:52 +08:00
Jianghai	8739aa7fa0	[shardformer] Pipeline/whisper (#4456 ) * add some base tests and policies * finish whisper base model * add conditional generation * finish basic tests * whisper * finish whisper * finish whisper * del useless whisper test * fix * add argmin to replace * finish revision	2023-08-18 21:29:25 +08:00
flybird11111	a27e0bb494	[shardformer] bert support sequence parallel. (#4455 ) * [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel * [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel * [shardformer] bert support sequence parallel	2023-08-18 18:04:55 +08:00
flybird11111	0ecd71e041	[shardformer] bloom support sequence parallel (#4465 ) [shardformer] bloom support sequence parallel	2023-08-18 15:34:18 +08:00
Bin Jia	7c8be77081	[shardformer/sequence parallel] support gpt2 seq parallel with pp/dp/tp (#4460 ) * support gpt2 seq parallel with pp/dp/tp * fix a bug when waiting for stream done * delete unused gpt2_seq file	2023-08-18 11:21:53 +08:00
LuGY	a78daf6180	[shardformer] support interleaved pipeline (#4448 ) * support interleaved pipeline * fix unit test * remove virtual stage test in stage mgr * add droped type hint and updated bwd	2023-08-16 19:29:03 +08:00
Baizhou Zhang	6ef33f75aa	[shardformer] support DDP in HybridPlugin/add tp+dp tests (#4446 ) * support DDP for HybridPlugin/add tp+dp tests * add docstring for HybridParallelPlugin	2023-08-16 16:11:57 +08:00
Bin Jia	424629fea0	[shardformer/sequence parallel] Cherry pick commit to new branch (#4450 ) * [shardformer/sequence parallel] Support sequence parallel for gpt2 (#4384) * [sequence parallel] add sequence parallel linear col/row support (#4336) * add sequence parallel linear col/row support * add annotation * add annotation * add support for gpt2 fused qkv linear layer * support sequence parallel in GPT2 * add docstring and note * add requirments * remove unused flash-attb * modify flash attn test * modify flash attn setting * modify flash attn code * add assert before divide, rename forward function * [shardformer/test] fix gpt2 test with seq-parallel * [shardformer/sequence parallel] Overlap input gather and grad computation during col backward (#4401) * overlap gather input / grad computing during col backward * modify test for overlap * simplify code * fix code and modify cuda stream synchronize * [shardformer/sequence parallel] polish code	2023-08-16 15:41:20 +08:00
github-actions[bot]	d20dceb9a3	[format] applied code formatting on changed files in pull request 4441 (#4445 ) Co-authored-by: github-actions <github-actions@github.com>	2023-08-16 10:47:23 +08:00
ver217	5d4efdf58f	[shardformer] fix import	2023-08-15 23:25:14 +08:00
ver217	73a4144b91	[shardformer] fix embedding	2023-08-15 23:25:14 +08:00
Hongxin Liu	172f7fa3cf	[misc] resolve code factor issues (#4433 )	2023-08-15 23:25:14 +08:00
flybird11111	108e54a0b4	[shardformer]update t5 tests for using all optimizations. (#4407 ) * [shardformer] gpt2 tests fix [shardformer] test all optimizations (#4399) [shardformer] test all optimizations [shardformer] test all optimizations [shardformer] test all optimizations [shardformer] gpt2 tests fix * [shardformer]update t5 to use all optimizations	2023-08-15 23:25:14 +08:00
flybird11111	1edc9b5fb3	[shardformer] update tests for all optimization (#4413 ) [shardformer] update tests for all optimization	2023-08-15 23:25:14 +08:00
Baizhou Zhang	7711bd524a	[shardformer] rewrite tests for opt/bloom/llama/vit/chatglm (#4395 ) * rewrite opt tests * rewrite llama tests * rewrite bloom & vit tests * rewrite chatglm tests * fix LinearCol for classfiers * add judge for other tp layers, fix lazy init in util	2023-08-15 23:25:14 +08:00
flybird1111	d2cd48e0be	[shardformer] test all optimizations (#4399 ) [shardformer] test all optimizations [shardformer] test all optimizations [shardformer] test all optimizations	2023-08-15 23:25:14 +08:00
flybird1111	7a3dfd0c64	[shardformer] update shardformer to use flash attention 2 (#4392 ) * cherry-pick flash attention 2 cherry-pick flash attention 2 * [shardformer] update shardformer to use flash attention 2 [shardformer] update shardformer to use flash attention 2, fix [shardformer] update shardformer to use flash attention 2, fix [shardformer] update shardformer to use flash attention 2, fix	2023-08-15 23:25:14 +08:00
Baizhou Zhang	ed4c448488	[pipeline] rewrite t5 tests & support multi-tensor transmitting in pipeline (#4388 ) * fix remaining t5 bugs/rewrite t5 tests * fix multi-tensor communication in pipeline * rearrange test_config * fix keyerror in sync_shared_params * fix get_held_layers & Randomnizer, complete t5 tests * erase printing * fix get_held_layers through modifying _release_unheld_layers * fix _get_recursive_held_layers bug	2023-08-15 23:25:14 +08:00
flybird1111	906426cb44	[Shardformer] Merge flash attention branch to pipeline branch (#4362 ) * [shardformer] supported flash attention test dependency (#4158) * [shardformer] fix flash attention utils test (#4180) * [shardformer] opt support flash attention (#4163) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] move to modeling * [shardformer] move to modeling * [shardformer] add performance benchmark of shardformer (#4175) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] benchmark fix * [shardformer] benchmark fix * [shardformer] llama support flash attention (#4185) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] move to modeling * [shardformer] move to modeling * [shardformer] llama support flash attention * [shardformer] llama support flash attention * [shardformer] Move the import statement for xformer outside the forward function. * [shardformer] gpt2 support flash attention. (#4191) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] move to modeling * [shardformer] move to modeling * [shardformer] gpt2 support flash attention * [shardformer] gpt2 support flash attention * [shardformer] gpt2 support flash attention * [shardformer] bloom support flash attention (#4188) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] move to modeling * [shardformer] move to modeling * [shardformer] bloom suport flash attention * [shardformer] add assert to sequence length * [shardformer] fix * [shardformer] fix * [shardformer] fix * [shardformer] bert support flash attention. (#4206) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] move to modeling * [shardformer] move to modeling * [shardformer] bert support flash attention * [shardformer] t5 support flash attention. (#4216) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] move to modeling * [shardformer] move to modeling * [shardformer] t5 support flash attention * [shardformer] t5 support flash attention * fix typo * fix typo * fix typo * fix typo * fix typo * fix typo * [shardformer] support 'paddedcausal' type of attention mask in Coloattention. (#4215) * added padded causal attn mask type for ColoAttention * [shardformer]t5 flash attention fix (#4239) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] move to modeling * [shardformer] move to modeling * [shardformer] t5 flash attention fix * [shardformer] update gpt2 to use coloattention. (#4234) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] move to modeling * [shardformer] move to modeling * [shardformer] update gpt2 to use coloattention * [shardformer] update gpt2 to use coloattention * [shardformer] update gpt2 to use coloattention * [shardformer] update gpt2 to use coloattention * [shardformer] update gpt2 * [shardformer] update opt and llama to use coloattention. (#4226) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] move to modeling * [shardformer] move to modeling * update opt to use coloattention * [shardformer]update opt to use coloattention * [shardformer]update opt to use coloattention * [shardformer]update opt to use coloattention * [shardformer]update opt to use coloattention * [shardformer]update opt to use coloattention * [shardformer]update opt to use coloattention * [shardformer]update opt * [shardformer] shardformer support jit fused operator. (#4236) * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] opt support flash attention * [shardformer] move to modeling * [shardformer] move to modeling * [shardformer] bloom support jit fused operator * [shardformer] bloom support jit fused operator * [shardformer] bloom support jit fused operator * [shardformer] t5 support jit fused operator * [shardformer] t5 support jit fused operator * [shardformer] t5 support jit fused operator * [shardformer] add roadmap of flash attention * [shardformer] add roadmap of flash attention * [shardformer] add roadmap of flash attention * [shardformer] add type hint to 'self' param of forward * [shardformer] merge feature/shardformer-models branch to feature/flash-attention-shardformer branch. (#4290) * Feature/vit support (#4182) * [shardformer] added tests * [shardformer] vit test finish and support * fix attention dropout * [shardformer] support SAM (#4231) * 1.support sam 2.add fused qkv for nn.Linear * update utils support set element in list * overtwrite SamVisionAttention foward to use DropoutForParallelInput * remove unused code * [shardformer] support whisper (#4212) * support whisper * fix bug in vocabembedding * support downstream model of whisper * update readme * Feature/chatglm (#4240) * [shardformer] added tests * [shardformer] vit test finish and support * [shardformer] chatglm ready * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] chatglm shard without mlp sharding * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] fix chatglm configuration with pre-commit --------- Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com> Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com> * [shardformer] whisper support flash attention (#4301) * Feature/vit support (#4182) * [shardformer] added tests * [shardformer] vit test finish and support * fix attention dropout * [shardformer] support SAM (#4231) * 1.support sam 2.add fused qkv for nn.Linear * update utils support set element in list * overtwrite SamVisionAttention foward to use DropoutForParallelInput * remove unused code * [shardformer] support whisper (#4212) * support whisper * fix bug in vocabembedding * support downstream model of whisper * update readme * Feature/chatglm (#4240) * [shardformer] added tests * [shardformer] vit test finish and support * [shardformer] chatglm ready * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] chatglm shard without mlp sharding * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] fix chatglm configuration with pre-commit * [shardformer] whisper support flash attention * [shardformer] whisper support flash attention * [shardformer]whisper support jit operator --------- Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com> Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com> * [shardformer] sam support flash attention (#4316) * Feature/vit support (#4182) * [shardformer] added tests * [shardformer] vit test finish and support * fix attention dropout * [shardformer] support SAM (#4231) * 1.support sam 2.add fused qkv for nn.Linear * update utils support set element in list * overtwrite SamVisionAttention foward to use DropoutForParallelInput * remove unused code * [shardformer] support whisper (#4212) * support whisper * fix bug in vocabembedding * support downstream model of whisper * update readme * Feature/chatglm (#4240) * [shardformer] added tests * [shardformer] vit test finish and support * [shardformer] chatglm ready * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] chatglm shard without mlp sharding * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] fix chatglm configuration with pre-commit * [shardformer] sam support flash attention --------- Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com> Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com> * [shardformer] merge blip2/chatglm (#4321) * Feature/vit support (#4182) * [shardformer] added tests * [shardformer] vit test finish and support * fix attention dropout * [shardformer] support SAM (#4231) * 1.support sam 2.add fused qkv for nn.Linear * update utils support set element in list * overtwrite SamVisionAttention foward to use DropoutForParallelInput * remove unused code * [shardformer] support whisper (#4212) * support whisper * fix bug in vocabembedding * support downstream model of whisper * update readme * Feature/chatglm (#4240) * [shardformer] added tests * [shardformer] vit test finish and support * [shardformer] chatglm ready * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] chatglm shard without mlp sharding * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] fix chatglm configuration with pre-commit * [shardformer] added tests * [shardformer] vit test finish and support * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit * [shardformer] support Blip2 (#4243) * support base blip2 * add support for downstream blip2 model * update readme * add forward injection * skip not compatible models test * fix test for gemini and low_level_zero_pugin --------- Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com> Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: klhhhhh <1412841649@qq.com> * [shardformer] blip2 support flash attention and jit operator (#4325) * Feature/vit support (#4182) * [shardformer] added tests * [shardformer] vit test finish and support * fix attention dropout * [shardformer] support SAM (#4231) * 1.support sam 2.add fused qkv for nn.Linear * update utils support set element in list * overtwrite SamVisionAttention foward to use DropoutForParallelInput * remove unused code * [shardformer] support whisper (#4212) * support whisper * fix bug in vocabembedding * support downstream model of whisper * update readme * Feature/chatglm (#4240) * [shardformer] added tests * [shardformer] vit test finish and support * [shardformer] chatglm ready * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] chatglm shard without mlp sharding * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] fix chatglm configuration with pre-commit * [shardformer] added tests * [shardformer] vit test finish and support * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit * [shardformer] support Blip2 (#4243) * support base blip2 * add support for downstream blip2 model * update readme * add forward injection * skip not compatible models test * fix test for gemini and low_level_zero_pugin * [shardformer] blip2 support flash attention and jit operator * [shardformer] blip2 support flash attention and jit operator * [shardformer] blip2 support flash attention and jit operator --------- Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com> Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: klhhhhh <1412841649@qq.com> * [shardformer] chatglm support flash attention and jit operator (#4330) * Feature/vit support (#4182) * [shardformer] added tests * [shardformer] vit test finish and support * fix attention dropout * [shardformer] support SAM (#4231) * 1.support sam 2.add fused qkv for nn.Linear * update utils support set element in list * overtwrite SamVisionAttention foward to use DropoutForParallelInput * remove unused code * [shardformer] support whisper (#4212) * support whisper * fix bug in vocabembedding * support downstream model of whisper * update readme * Feature/chatglm (#4240) * [shardformer] added tests * [shardformer] vit test finish and support * [shardformer] chatglm ready * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] chatglm shard without mlp sharding * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] fix chatglm configuration with pre-commit * [shardformer] added tests * [shardformer] vit test finish and support * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit * [shardformer] support Blip2 (#4243) * support base blip2 * add support for downstream blip2 model * update readme * add forward injection * skip not compatible models test * fix test for gemini and low_level_zero_pugin * [shardformer] chatglm support flash attention and jit operator * [shardformer] chatglm support flash attention and jit operator * [shardformer] chatglm support flash attention and jit operator * [shardformer] chatglm support flash attention and jit operator --------- Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com> Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: klhhhhh <1412841649@qq.com> * [shardformer] vit support flash attention and jit operator (#4334) * Feature/vit support (#4182) * [shardformer] added tests * [shardformer] vit test finish and support * fix attention dropout * [shardformer] support SAM (#4231) * 1.support sam 2.add fused qkv for nn.Linear * update utils support set element in list * overtwrite SamVisionAttention foward to use DropoutForParallelInput * remove unused code * [shardformer] support whisper (#4212) * support whisper * fix bug in vocabembedding * support downstream model of whisper * update readme * Feature/chatglm (#4240) * [shardformer] added tests * [shardformer] vit test finish and support * [shardformer] chatglm ready * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] chatglm shard without mlp sharding * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] fix chatglm configuration with pre-commit * [shardformer] added tests * [shardformer] vit test finish and support * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit * [shardformer] support Blip2 (#4243) * support base blip2 * add support for downstream blip2 model * update readme * add forward injection * skip not compatible models test * fix test for gemini and low_level_zero_pugin * [shardformer] vit support flash attention and jit operator * [shardformer] vit support flash attention and jit operator --------- Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com> Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: klhhhhh <1412841649@qq.com> * [pipeline] merge flash attention branch * [pipeline] merge flash attention branch * [pipeline] merge flash attention branch * [pipeline] fix conflict * [pipeline] fix conflict * Merge branch 'feature/pipeline' into feature/pipeline * Merge branch 'feature/pipeline' into feature/pipeline * Merge branch 'feature/pipeline' into feature/pipeline * activate checks * activate checks * activate checks * activate checks * activate checks * activate checks * activate checks * activate checks * fix flash attention tests * gemini ignore whisper * fix vit * fix xformers import handle --------- Co-authored-by: Frank Lee <somerlee.9@gmail.com> Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com> Co-authored-by: FoolPlayer <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: klhhhhh <1412841649@qq.com>	2023-08-15 23:25:14 +08:00
Jianghai	a88e92251d	[pipeline] add chatglm (#4363 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining * add bert_for_pretraining forward and policy * fix typos * cancel warning * change the imediate output to default dict * change the default output of get_shared_params * add chatglm * add * chatglm * chatglm * finish chatglm * deletes * fix rmsnorm * chatglm * fix chatglm shard * init	2023-08-15 23:25:14 +08:00
Baizhou Zhang	b1feeced8e	[shardformer] add util functions for shardformer tests/fix sync_shared_param (#4366 ) * add util functions for shardformer tests & rewrite gpt2 test * fix shared_params & embedding/merging * fix precision	2023-08-15 23:25:14 +08:00
FoolPlayer	726541afe2	update some module with new api version	2023-08-15 23:25:14 +08:00
FoolPlayer	879301d0da	[shardformer] support Blip2 (#4243 ) * support base blip2 * add support for downstream blip2 model * update readme * add forward injection * skip not compatible models test * fix test for gemini and low_level_zero_pugin	2023-08-15 23:25:14 +08:00
klhhhhh	8120eca0c0	[shardformer] support ChatGLMForConditionalGeneration & add fusedlayernorm for vit	2023-08-15 23:25:14 +08:00
klhhhhh	91850fe984	[shardformer] register without auto policy	2023-08-15 23:25:14 +08:00
klhhhhh	1a29e8fc29	[shardformer] polish chatglm code	2023-08-15 23:25:14 +08:00
klhhhhh	8620009dd7	[sharformer] add first version of policy of chatglm	2023-08-15 23:25:14 +08:00
Kun Lin	ed34bb1310	Feature/chatglm (#4240 ) * [shardformer] added tests * [shardformer] vit test finish and support * [shardformer] chatglm ready * import chatglm * [shardformer] add test kit in model zoo for chatglm * [sharformer] add first version of policy of chatglm * [shardformer] polish chatglm code * [shardformer] polish code * [shardformer] support chatglm without layernorm * [shardformer] chatglm shard without mlp sharding * [shardformer] delete some file * [shardformer] ChatGLM support layernorm sharding * [shardformer] register without auto policy * [shardformer] pre-commit check files * [shardformer] fix chatglm configuration with pre-commit	2023-08-15 23:25:14 +08:00
FoolPlayer	9ee4ebea83	[shardformer] support whisper (#4212 ) * support whisper * fix bug in vocabembedding * support downstream model of whisper * update readme	2023-08-15 23:25:14 +08:00
FoolPlayer	dd2bf02679	[shardformer] support SAM (#4231 ) * 1.support sam 2.add fused qkv for nn.Linear * update utils support set element in list * overtwrite SamVisionAttention foward to use DropoutForParallelInput * remove unused code	2023-08-15 23:25:14 +08:00
Kun Lin	c59d7aca09	Feature/vit support (#4182 ) * [shardformer] added tests * [shardformer] vit test finish and support * fix attention dropout	2023-08-15 23:25:14 +08:00
Baizhou Zhang	0ceec8f9a9	[pipeline] support fp32 for HybridPlugin/merge shardformer test and pipeline test into one file (#4354 ) * add naive optimizer for 3DPlugin/refactor gpt2 shardformer test * merge tests of PP/DP/TP combinations into one test file * fix bug when sync grad for dp in HybridPlugin * update supported precisions for 3DPlugin/fix bug when shifting tp_degree * improve the passing of lazy_init * modify lazy_init/use sync_shared_params	2023-08-15 23:25:14 +08:00
Jianghai	f13954cd58	[pipeline] refactor test pipeline and remove useless utils in pipeline (#4324 ) * refactor tests * refactor bloom model * finish policy tests * refactor tests * fix test pure pipeline * remove test pipeline and cutdown launch process * refactor tests * refactor bloom model * finish policy tests * refactor tests * fix test pure pipeline * remove test pipeline and cutdown launch process	2023-08-15 23:25:14 +08:00
Baizhou Zhang	da3cef27ad	[pipeline] fix return_dict/fix pure_pipeline_test (#4331 )	2023-08-15 23:25:14 +08:00
Hongxin Liu	261eab02fb	[plugin] add 3d parallel plugin (#4295 ) * [amp] add mixed precision optimizer * [plugin] add 3d parallel plugin * [booster] support pipeline * [plugin] 3d parallel plugin support clip grad norm * [shardformer] fix sharder and add plugin test * [plugin] rename 3d parallel plugin * [ci] support testmon core pkg change detection (#4305) * [hotfix] debug testmon * [hotfix] fix llama * [hotfix] fix p2p bugs * [hotfix] fix requirements	2023-08-15 23:25:14 +08:00
FoolPlayer	b3f5d7a3ba	[shardformer] support pipeline base vit model (#4284 ) * Feature/vit support (#4182) * [shardformer] added tests * [shardformer] vit test finish and support * fix attention dropout * support base vit pipeline * support vit downstream model * fix vit shard test * modify hidden states return type --------- Co-authored-by: Kun Lin <81014421+klhhhhh@users.noreply.github.com>	2023-08-15 23:25:14 +08:00
Baizhou Zhang	083d7da33d	[pipeline] add pipeline support for all T5 models (#4310 ) * complete policy for T5Model & T5ForConditionalGeneration * modify function signature in forwards * add forward for T5model * add forward for T5ForConditionalGeneration * fix a bug * fix hidden_states transporting in decoder * fix the passing of encoder_outputs	2023-08-15 23:25:14 +08:00
Jianghai	d0807122e2	[pipeline] test pure pipeline process using llama (#4218 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt * finish llama * causal lm and sequence classification * revision * add pure pipeline test * fixed version * fixed version * pure pipeline	2023-08-15 23:25:14 +08:00
Baizhou Zhang	36e546b2cc	[pipeline] add pipeline support for T5Stack/T5EncoderModel (#4300 ) * modify t5 policy & add test * pipeline stage distribution for t5 * complete t5 base policy * t5 stack: halfway * modify gpt2 pipeline test * complete pipeline forward for T5Stack/T5EncoderModel * fix docstring * move t5 util tests to test_pipeline	2023-08-15 23:25:14 +08:00
Jianghai	18ebcf406a	[pipeline] reformat for unified design (#4283 ) * bert_reformat * reformat * reformat * fix a typo * format * format * fix bug	2023-08-15 23:25:14 +08:00
Jianghai	0a8f3c851a	[hotfix] fix opt pipeline (#4293 ) * opt forward and test * pause * finish opt model pipeline * finish opt pipeline * opt forward and test * pause * finish opt model pipeline * finish opt pipeline * fix opt * set transformers version * refactor the test pipeline * fix bug	2023-08-15 23:25:14 +08:00
Jianghai	d8408d185c	[pipeline] OPT model pipeline (#4258 ) * opt forward and test * pause * finish opt model pipeline * finish opt pipeline * opt forward and test * pause * finish opt model pipeline * finish opt pipeline * fix opt * set transformers version * refactor the test pipeline	2023-08-15 23:25:14 +08:00
Baizhou Zhang	b774d5ea0f	[pipeline] refactor gpt2 pipeline forwards (#4287 ) * move gpt2 pipeline forwards to modeling folder * check pipeline status when adding replacing policy * fix typehint * fix arguments processing in gpt2_model_forward	2023-08-15 23:25:14 +08:00
Hongxin Liu	d921ce8391	[shardformer] support inplace sharding (#4251 ) * [shardformer] embedding support inplace sharding * [shardformer] linear support inplace sharding * [shardformer] layernorm support inplace sharding * [shardformer] qkv support inplace sharding * [test] update shardformer layer test * [shardformer] fix shared param sharding * [shardformer] fix bert policy * [shardformer] fix bloom policy * [shardformer] fix llama policy * [shardformer] fix opt policy * [shardformer] fix t5 policy * [shardformer] fix fused qkv linear * [shardformer] fix bugs * force sync * [test] fix bugs * [test] fix transformer version	2023-08-15 23:25:14 +08:00
Baizhou Zhang	2a2eacfaf1	[pipeline] support shardformer for GPT2ForQuestionAnswering & complete pipeline support for GPT2 (#4245 ) * change for transformers loggers * add forward for GPT2ForQuestionAnswering * fix assert * fix torchrec test	2023-08-15 23:25:14 +08:00
Jianghai	34f0e34a4c	[pipeline] finish bloom models pipeline and tests (#4223 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * finish bloom model * test shard gpt2 * clear cache * support all bloom models * add bloom models policies * finish bloom pipeline and tests * add set pipeline * finish bloom	2023-08-15 23:25:14 +08:00
Jianghai	e7cc62d735	[pipeline] All bert models (#4233 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt * finish llama * causal lm and sequence classification * revision * add pure pipeline test * finish some bert models * finish all bert models * finish bert tests * fix bugs * fix bugs * fix test pipeline * fix data gen for qa * update the set pipeline forward * shared params * fix bugs	2023-08-15 23:25:14 +08:00
Baizhou Zhang	a14d352088	[pipeline] add pipeline forward for variants of gpt2 (#4238 ) * add forward for GPTLMHeadModel * add test for gpt_lm * arranging get_held_layers method * arrange forward replacement * add forward for GPT2ForTokenClassification * add forward for GPT2ForSequenceClassification * fix test_shard_gpt2.py * add GPT2DoubleHeadsmodel & fix bugs * add id checking in get_shared_params	2023-08-15 23:25:14 +08:00
Hongxin Liu	7e4de520e1	[shardformer] fix base policy (#4229 )	2023-08-15 23:25:14 +08:00
Baizhou Zhang	208ac8f2ba	[pipeline] Add Pipeline Forward for GPT2Model Shardformer (#4224 ) * * fix typehint & docstring in sharder.py * * update pipeline forward for GPT2Model * * add test for pipeline forward of GPT2Model * * add cache cleaning in gpt2 test * * change assert to raise command	2023-08-15 23:25:14 +08:00
Jianghai	37d22f6878	[pipeline] add bloom model pipeline (#4210 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * finish bloom model * test shard gpt2 * clear cache	2023-08-15 23:25:14 +08:00
Jianghai	31bcf867ae	[pipeline] Llama causal lm and llama for sequence classification pipeline (#4208 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt * finish llama * causal lm and sequence classification * revision	2023-08-15 23:25:14 +08:00
Jianghai	1622031058	[pipeline] Llama pipeline (#4205 ) * bloom policy * llama pipeline forward and tests * fix the output and attention_mask * fix name * bind argument to policy * Revert "bloom policy" This reverts commit `8dee68a0a2`. This policy should be revert and copied to feature/bloom * revert the bloom changes * cancel unneeded inputs * gpt	2023-08-15 23:25:14 +08:00
Jianghai	1094e0f0d3	[pipeline] Bert pipeline for shardformer and its tests (#4197 ) * add pipeline forward * complete pipeline forward check * fix bert forward without pipeline * fix comments * discard useless line * add todo * clean prints * fix distribute layers	2023-08-15 23:25:14 +08:00
Hongxin Liu	890774b2fb	[shardformer] support lazy init (#4202 ) * [shardformer] support lazy init * [shardformer] linear support lazy init * [shardformer] embedding support lazy init * [shardformer] norm support lazy init * [shardformer] fused linear support lazy init * [test] update shardformer test layer * [test] shardformer with lazy init fit ddp * [lazy] hotfix deepcopy of param * [shardformer] fix bert policy and update test * [shardformer] fix bloom policy and update test * [shardformer] fix opt policy and update test * [shardformer] fix t5 policy and update test * [shardformer] fix gpt2 policy and update test * [shardformer] fix llama policy and update test	2023-08-15 23:25:14 +08:00
Jianghai	f3bcc292c8	[pipeline] move bert related pipeline components to shardformer (#4187 ) * move bert related pipeline components to shardformer * fix bugs * revision * fix bert model tests * fix bert_lm_head model tests * fix tests * fix tests * done checks * skip bloom	2023-08-15 23:25:14 +08:00
Jianghai	c5ea728016	[pipeline] add bert_for_pretraining bert_lmhead forward and policy (#4172 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining * add bert_for_pretraining forward and policy * fix typos * cancel warning * change the imediate output to default dict * change the default output of get_shared_params	2023-08-15 23:25:14 +08:00
ver217	d35bd7d0e6	[shardformer] fix type hint	2023-08-15 23:25:14 +08:00
ver217	1ed3f8a24f	[shardformer] rename policy file name	2023-08-15 23:25:14 +08:00
ver217	b0b8ad2823	[pipeline] update shardformer docstring	2023-08-15 23:25:14 +08:00
ver217	59f6f573f1	[pipeline] update shardformer policy	2023-08-15 23:25:14 +08:00
Jianghai	90a65ea682	[pipeline] build bloom model and policy , revise the base class of policy (#4161 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt * add bloom model and policy ,revise the base class of policy * revise * revision * add bert_for_pretraining	2023-08-15 23:25:14 +08:00
Jianghai	e8e7e49243	[pipeline]add pipeline policy and bert forward (#4130 ) * add pipeline policy and bert forward to be done * add bertmodel pipeline forward and make tests * add Bert_Policy and test for policy * update formatting * update formatting * update the code * fix bugs * fix name confilt	2023-08-15 23:25:14 +08:00
Hongxin Liu	f51ce1bc8e	[pipeline] refactor 1f1b schedule (#4115 ) * [api] update optimizer wrapper to fit pipeline * [pipeline] add base schedule * [pipeline] add 1f1b schedule * [test] add pipeline schedule utils test * [pipeline] fix import	2023-08-15 23:25:14 +08:00
Hongxin Liu	45fdc9b42c	[pipeline] implement p2p communication (#4100 ) * [pipeline] add p2p communication * [test] add p2p communication test * [test] add rerun decorator * [test] rename to avoid conflict	2023-08-15 23:25:14 +08:00
Hongxin Liu	422544222f	[pipeline] add stage manager (#4093 ) * [pipeline] add stage manager * [test] add pipeline stage manager test * [pipeline] add docstring for stage manager	2023-08-15 23:25:14 +08:00
Hongxin Liu	5e1a9d48dd	[cluster] add process group mesh (#4039 ) * [cluster] add process group mesh * [test] add process group mesh test * force sync	2023-08-15 23:25:14 +08:00
LuGY	d86ddd9b29	[hotfix] fix unsafe async comm in zero (#4404 ) * improve stablility of zero * fix wrong index * add record stream	2023-08-11 15:09:24 +08:00
Baizhou Zhang	6ccecc0c69	[gemini] fix tensor storage cleaning in state dict collection (#4396 )	2023-08-10 15:36:46 +08:00
binmakeswell	089c365fa0	[doc] add Series A Funding and NeurIPS news (#4377 ) * [doc] add Series A Funding and NeurIPS news * [kernal] fix mha kernal * [CI] skip moe * [CI] fix requirements	2023-08-04 17:42:07 +08:00
flybird1111	38b792aab2	[coloattention] fix import error (#4380 ) fixed an import error	2023-08-04 16:28:41 +08:00
flybird1111	25c57b9fb4	[fix] coloattention support flash attention 2 (#4347 ) Improved ColoAttention interface to support flash attention 2. Solved #4322	2023-08-04 13:46:22 +08:00
Hongxin Liu	16bf4c0221	[test] remove useless tests (#4359 ) * [test] remove legacy zero test * [test] remove lazy distribute test * [test] remove outdated checkpoint io	2023-08-01 18:52:14 +08:00
LuGY	03654c0ce2	fix localhost measurement (#4320 )	2023-08-01 10:14:00 +08:00
LuGY	45b08f08cb	[zero] optimize the optimizer step time (#4221 ) * optimize the optimizer step time * fix corner case * polish * replace all-reduce with all-gather * set comm device to cuda	2023-07-31 22:13:29 +08:00
LuGY	1a49a5ea00	[zero] support shard optimizer state dict of zero (#4194 ) * support shard optimizer of zero * polish code * support sync grad manually	2023-07-31 22:13:29 +08:00
LuGY	dd7cc58299	[zero] add state dict for low level zero (#4179 ) * add state dict for zero * fix unit test * polish	2023-07-31 22:13:29 +08:00
LuGY	c668801d36	[zero] allow passing process group to zero12 (#4153 ) * allow passing process group to zero12 * union tp-zero and normal-zero * polish code	2023-07-31 22:13:29 +08:00
LuGY	79cf1b5f33	[zero]support no_sync method for zero1 plugin (#4138 ) * support no sync for zero1 plugin * polish * polish	2023-07-31 22:13:29 +08:00
LuGY	c6ab96983a	[zero] refactor low level zero for shard evenly (#4030 ) * refactor low level zero * fix zero2 and support cpu offload * avg gradient and modify unit test * refactor grad store, support layer drop * refactor bucket store, support grad accumulation * fix and update unit test of zero and ddp * compatible with tp, ga and unit test * fix memory leak and polish * add zero layer drop unittest * polish code * fix import err in unit test * support diffenert comm dtype, modify docstring style * polish code * test padding and fix * fix unit test of low level zero * fix pad recording in bucket store * support some models * polish	2023-07-31 22:13:29 +08:00
dayellow	a50d39a143	[NFC] fix: format (#4270 ) * [NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style * [NFC] polish colossalai/communication/utils.py code style --------- Co-authored-by: Minghao Huang <huangminghao@luchentech.com>	2023-07-26 14:12:57 +08:00
Wenhao Chen	fee553288b	[NFC] polish runtime_preparation_pass style (#4266 )	2023-07-26 14:12:57 +08:00
YeAnbang	3883db452c	[NFC] polish unary_elementwise_generator.py code style (#4267 ) Co-authored-by: aye42 <aye42@gatech.edu>	2023-07-26 14:12:57 +08:00
梁爽	abe4f971e0	[NFC] polish colossalai/booster/plugin/low_level_zero_plugin.py code style (#4256 ) Co-authored-by: supercooledith <893754954@qq.com>	2023-07-26 14:12:57 +08:00
Yanjia0	c614a99d28	[NFC] polish colossalai/auto_parallel/offload/amp_optimizer.py code style (#4255 )	2023-07-26 14:12:57 +08:00
ocd_with_naming	85774f0c1f	[NFC] polish colossalai/cli/benchmark/utils.py code style (#4254 )	2023-07-26 14:12:57 +08:00
Michelle	86cf6aed5b	Fix/format (#4261 ) * revise shardformer readme (#4246) * [example] add llama pretraining (#4257) * [NFC] polish colossalai/communication/p2p.py code style --------- Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com> Co-authored-by: Qianran Ma <qianranm@luchentech.com>	2023-07-26 14:12:57 +08:00
Jianghai	b366f1d99f	[NFC] Fix format for mixed precision (#4253 ) * [NFC] polish colossalai/booster/mixed_precision/mixed_precision_base.py code style	2023-07-26 14:12:57 +08:00
Baizhou Zhang	c6f6005990	[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302 ) * sharded optimizer checkpoint for gemini plugin * modify test to reduce testing time * update doc * fix bug when keep_gatherd is true under GeminiPlugin	2023-07-21 14:39:01 +08:00
Hongxin Liu	fc5cef2c79	[lazy] support init on cuda (#4269 ) * [lazy] support init on cuda * [test] update lazy init test * [test] fix transformer version	2023-07-19 16:43:01 +08:00
Cuiqing Li	4b977541a8	[Kernels] added triton-implemented of self attention for colossal-ai (#4241 ) * added softmax kernel * added qkv_kernel * added ops * adding tests * upload tets * fix tests * debugging * debugging tests * debugging * added * fixed errors * added softmax kernel * clean codes * added tests * update tests * update tests * added attention * add * fixed pytest checking * add cuda check * fix cuda version * fix typo	2023-07-18 23:53:38 +08:00
Jianghai	9a4842c571	revise shardformer readme (#4246 )	2023-07-17 17:30:57 +08:00
Baizhou Zhang	58913441a1	Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141 ) * [checkpointio] unsharded optimizer checkpoint for Gemini plugin * [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather	2023-07-07 16:33:06 +08:00
Frank Lee	190a6ea9c2	[dtensor] fixed readme file name and removed deprecated file (#4162 )	2023-07-04 18:21:11 +08:00
Hongxin Liu	1908caad38	[cli] hotfix launch command for multi-nodes (#4165 )	2023-07-04 17:54:40 +08:00
digger yu	2ac24040eb	fix some typo colossalai/shardformer (#4160 )	2023-07-04 17:53:39 +08:00
github-actions[bot]	c77b3b19be	[format] applied code formatting on changed files in pull request 4152 (#4157 ) Co-authored-by: github-actions <github-actions@github.com>	2023-07-04 16:07:47 +08:00
Frank Lee	89f45eda5a	[shardformer] added development protocol for standardization (#4149 )	2023-07-04 16:05:01 +08:00
Frank Lee	1fb0d95df0	[shardformer] made tensor parallelism configurable (#4144 ) * [shardformer] made tensor parallelism configurable * polish code	2023-07-04 16:05:01 +08:00
Frank Lee	74257cb446	[shardformer] refactored some doc and api (#4137 ) * [shardformer] refactored some doc and api * polish code	2023-07-04 16:05:01 +08:00
jiangmingyan	7f9b30335b	[shardformer] write an shardformer example with bert finetuning (#4126 ) * [shardformer] add benchmark of shardformer * [shardformer] add benchmark of shardformer	2023-07-04 16:05:01 +08:00
Frank Lee	ae035d305d	[shardformer] added embedding gradient check (#4124 )	2023-07-04 16:05:01 +08:00
Frank Lee	44a190e6ac	[shardformer] import huggingface implicitly (#4101 )	2023-07-04 16:05:01 +08:00
Frank Lee	6a88bae4ec	[shardformer] integrate with data parallelism (#4103 )	2023-07-04 16:05:01 +08:00
Frank Lee	f3b6aaa6b7	[shardformer] supported fused normalization (#4112 )	2023-07-04 16:05:01 +08:00
Frank Lee	b1c2901530	[shardformer] supported bloom model (#4098 )	2023-07-04 16:05:01 +08:00
Kun Lin	8af29ee47a	[shardformer] support vision transformer (#4096 ) * first v of vit shardformer * keep vit * update * vit shard add vitattention vitlayer * update num head shard para * finish test for vit * add new_model_class & postprocess * add vit readme * delete old files & fix the conflict * fix sth	2023-07-04 16:05:01 +08:00
jiangmingyan	ac80937138	[shardformer] shardformer support opt models (#4091 ) * [shardformer] shardformer support opt models * [shardformer] shardformer support opt models, fix * [shardformer] shardformer support opt models, fix * [shardformer] shardformer support opt models, fix	2023-07-04 16:05:01 +08:00
Frank Lee	d33a44e8c3	[shardformer] refactored layernorm (#4086 )	2023-07-04 16:05:01 +08:00
Frank Lee	c4b1b65931	[test] fixed tests failed due to dtensor change (#4082 ) * [test] fixed tests failed due to dtensor change * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	92f6791095	[shardformer] Add layernorm (#4072 ) * add layernorm to bert * add layernorm test * add layernorm test with load state dict * add use_mixedfusedLN in shard config * refactor policy to support fused_layernorm	2023-07-04 16:05:01 +08:00
Frank Lee	70c58cfd4f	[shardformer] supported fused qkv checkpoint (#4073 )	2023-07-04 16:05:01 +08:00
FoolPlayer	0803a61412	[shardformer] add linearconv1d test (#4067 ) * add linearconv1d test * add linearconv1d test	2023-07-04 16:05:01 +08:00
Frank Lee	8eb09a4c69	[shardformer] support module saving and loading (#4062 ) * [shardformer] support module saving and loading * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	7740c55c55	support kit use for bert/gpt test (#4055 ) * support kit use for bert test * support kit test for gpt2	2023-07-04 16:05:01 +08:00
Frank Lee	f22ddacef0	[shardformer] refactored the shardformer layer structure (#4053 )	2023-07-04 16:05:01 +08:00
Frank Lee	58df720570	[shardformer] adapted T5 and LLaMa test to use kit (#4049 ) * [shardformer] adapted T5 and LLaMa test to use kit * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	4021b9a8a2	[shardformer] add gpt2 test and layer class refactor (#4041 ) * add gpt2 test and layer class refactor * add dropout in gpt2 policy	2023-07-04 16:05:01 +08:00
Frank Lee	d857f3dbba	[shardformer] supported T5 and its variants (#4045 )	2023-07-04 16:05:01 +08:00
Frank Lee	c1d5453e9f	[shardformer] adapted llama to the new API (#4036 )	2023-07-04 16:05:01 +08:00
FoolPlayer	74d176c8d8	[shardformer] fix bert and gpt downstream with new api (#4024 ) * fix bert downstream with new api * remove comment line	2023-07-04 16:05:01 +08:00
Frank Lee	e253a07007	[shardformer] updated doc (#4016 )	2023-07-04 16:05:01 +08:00
FoolPlayer	df018fc305	support bert with new api	2023-07-04 16:05:01 +08:00
FoolPlayer	507c0ad368	add vocabembedding layer	2023-07-04 16:05:01 +08:00
Frank Lee	45d9384346	[shardformer] removed inplace tensor sharding (#4018 )	2023-07-04 16:05:01 +08:00
Frank Lee	3893fa1a8d	[shardformer] refactored embedding and dropout to parallel module (#4013 ) * [shardformer] refactored embedding and dropout to parallel module * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	dfca9678fa	integrate with dist layer (#4011 )	2023-07-04 16:05:01 +08:00
Frank Lee	015af592f8	[shardformer] integrated linear 1D with dtensor (#3996 ) * [shardformer] integrated linear 1D with dtensor * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	d3bc530849	[shardformer] Refactor shardformer api (#4001 ) * fix an error in readme * simplify code * refactor shardformer * add todo * remove slicer * resolve code review	2023-07-04 16:05:01 +08:00
Frank Lee	611971248c	[device] support init device mesh from process group (#3990 )	2023-07-04 16:05:01 +08:00
FoolPlayer	a2f9af810d	[shardformer] fix an error in readme (#3988 ) * fix an error in readme * simplify code	2023-07-04 16:05:01 +08:00
FoolPlayer	f7774ec0f3	[Shardformer] Downstream bert (#3979 ) * add dist dropout in model * update docstring and bert policy with dropout * refactor basepolicy and sharded, update bert * update format * update gpt2 policy * update bert policy * remove unused code * update readme for new policy usage * add downstream model of bert * remove unused code	2023-07-04 16:05:01 +08:00
wukong1992	c1c672d0f0	[shardformer] shardformer support t5 model (#3994 ) test t5	2023-07-04 16:05:01 +08:00
wukong1992	6b30dfb7ce	[shardformer] support llama model using shardformer (#3969 ) adjust layer attr	2023-07-04 16:05:01 +08:00
FoolPlayer	45927d5527	[shardformer] Add dropout layer in shard model and refactor policy api (#3949 ) * add dist dropout in model * update docstring and bert policy with dropout * refactor basepolicy and sharded, update bert * update format * update gpt2 policy * update bert policy * remove unused code * update readme for new policy usage	2023-07-04 16:05:01 +08:00
FoolPlayer	a73130482d	[shardformer] Unit test (#3928 ) * fix bug in slicer, add slicer unit test * add dropout test * use pid as dropout seed * updata dropout test with local pattern * ad todo	2023-07-04 16:05:01 +08:00
FoolPlayer	f1cb5ac6bf	[shardformer] Align bert value (#3907 ) * add bert align test, fix dist loss bug * forward and backward align * add ignore index * add shardformer CI * add gather_output optional for user in shardconfig * update readme with optional gather_ouput * add dist crossentropy loss test, remove unused files * remove unused file * remove unused file * rename the file * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	79f8d5d54b	[shardformer] add gpt2 policy and modify shard and slicer to support (#3883 ) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code	2023-07-04 16:05:01 +08:00
FoolPlayer	70173e3123	update README (#3909 )	2023-07-04 16:05:01 +08:00
FoolPlayer	ab8a47f830	[shardformer] add Dropout layer support different dropout pattern (#3856 ) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss	2023-07-04 16:05:01 +08:00
FoolPlayer	c594dc2f1c	[shardformer] update readme with modules implement doc (#3834 ) * update readme with modules content * remove img	2023-07-04 16:05:01 +08:00
Frank Lee	4972e1f40e	[shardformer] refactored the user api (#3828 ) * [shardformer] refactored the user api * polish code	2023-07-04 16:05:01 +08:00
Frank Lee	235792f170	[shardformer] updated readme (#3827 )	2023-07-04 16:05:01 +08:00
FoolPlayer	8cc11235c0	[shardformer]: Feature/shardformer, add some docstring and readme (#3816 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit	2023-07-04 16:05:01 +08:00
FoolPlayer	8d68de767d	[shardformer] init shardformer code structure (#3731 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example	2023-07-04 16:05:01 +08:00
Baizhou Zhang	1350ece492	[hotfix] fix import bug in checkpoint_io (#4142 )	2023-07-03 22:14:37 +08:00
digger yu	8abc87798f	fix Tensor is not defined (#4129 )	2023-07-03 17:10:18 +08:00
digger yu	7e46bc87b6	fix CheckpointIndexFile is not defined (#4109 )	2023-07-03 17:09:06 +08:00
digger yu	09fe9dc704	[nfc]fix ColossalaiOptimizer is not defined (#4122 )	2023-06-30 17:23:22 +08:00
Frank Lee	95e95b6d58	[testing] move pytest to be inside the function (#4087 )	2023-06-27 11:02:25 +08:00
Baizhou Zhang	0bb0b481b4	[gemini] fix argument naming during chunk configuration searching	2023-06-25 13:34:15 +08:00
github-actions[bot]	a52f62082d	[format] applied code formatting on changed files in pull request 4021 (#4022 ) Co-authored-by: github-actions <github-actions@github.com>	2023-06-19 11:23:24 +08:00
Baizhou Zhang	822c3d4d66	[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002 )	2023-06-16 14:14:05 +08:00
Wenhao Chen	725af3eeeb	[booster] make optimizer argument optional for boost (#3993 ) * feat: make optimizer optional in Booster.boost * test: skip unet test if diffusers version > 0.10.2	2023-06-15 17:38:42 +08:00
Baizhou Zhang	c9cff7e7fa	[checkpointio] General Checkpointing of Sharded Optimizers (#3984 )	2023-06-15 15:21:26 +08:00
Frank Lee	71fe52769c	[gemini] fixed the gemini checkpoint io (#3934 )	2023-06-12 15:11:27 +08:00
Frank Lee	ddcf58cacf	Revert "[sync] sync feature/shardformer with develop"	2023-06-09 09:41:27 +08:00
FoolPlayer	24651fdd4f	Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer [sync] sync feature/shardformer with develop	2023-06-09 09:34:00 +08:00
FoolPlayer	ef1537759c	[shardformer] add gpt2 policy and modify shard and slicer to support (#3883 ) * add gpt2 policy and modify shard and slicer to support * remove unused code * polish code	2023-06-08 15:01:34 +08:00
FoolPlayer	6370a935f6	update README (#3909 )	2023-06-08 15:01:34 +08:00
FoolPlayer	21a3915c98	[shardformer] add Dropout layer support different dropout pattern (#3856 ) * add dropout layer, add dropout test * modify seed manager as context manager * add a copy of col_nn.layer * add dist_crossentropy loss; separate module test * polish the code * fix dist crossentropy loss	2023-06-08 15:01:34 +08:00
FoolPlayer	997544c1f9	[shardformer] update readme with modules implement doc (#3834 ) * update readme with modules content * remove img	2023-06-08 15:01:34 +08:00
Frank Lee	537a52b7a2	[shardformer] refactored the user api (#3828 ) * [shardformer] refactored the user api * polish code	2023-06-08 15:01:34 +08:00
Frank Lee	bc19024bf9	[shardformer] updated readme (#3827 )	2023-06-08 15:01:34 +08:00
FoolPlayer	58f6432416	[shardformer]: Feature/shardformer, add some docstring and readme (#3816 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example * add share weight and train example * add train * add docstring and readme * add docstring for other files * pre-commit	2023-06-08 15:01:34 +08:00
FoolPlayer	6a69b44dfc	[shardformer] init shardformer code structure (#3731 ) * init shardformer code structure * add implement of sharder (inject and replace) * add implement of replace layer to colossal layer * separate different layer policy, add some notion * implement 1d and 2d slicer, can tell col or row * fix bug when slicing and inject model * fix some bug; add inference test example	2023-06-08 15:01:34 +08:00
Frank Lee	eb39154d40	[dtensor] updated api and doc (#3845 )	2023-06-08 10:18:17 +08:00
digger yu	de0d7df33f	[nfc] fix typo colossalai/zero (#3923 )	2023-06-08 00:01:29 +08:00
digger yu	a9d1cadc49	fix typo with colossalai/trainer utils zero (#3908 )	2023-06-07 16:08:37 +08:00
Frank Lee	d51e83d642	Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop [sync] sync feature/dtensor with develop	2023-06-07 11:50:43 +08:00
Hongxin Liu	9c88b6cbd1	[lazy] fix compatibility problem on torch 1.13 (#3911 )	2023-06-07 11:10:12 +08:00
digger yu	0e484e6201	[nfc]fix typo colossalai/pipeline tensor nn (#3899 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped * fix typo colossalai/pipeline tensor nn	2023-06-06 14:07:36 +08:00
Baizhou Zhang	c1535ccbba	[doc] fix docs about booster api usage (#3898 )	2023-06-06 13:36:11 +08:00
digger yu	1878749753	[nfc] fix typo colossalai/nn (#3887 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped	2023-06-05 16:04:27 +08:00
Hongxin Liu	ae02d4e4f7	[bf16] add bf16 support (#3882 ) * [bf16] add bf16 support for fused adam (#3844) * [bf16] fused adam kernel support bf16 * [test] update fused adam kernel test * [test] update fused adam test * [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860) * [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869) * [bf16] add mixed precision mixin * [bf16] low level zero optim support bf16 * [text] update low level zero test * [text] fix low level zero grad acc test * [bf16] add bf16 support for gemini (#3872) * [bf16] gemini support bf16 * [test] update gemini bf16 test * [doc] update gemini docstring * [bf16] add bf16 support for plugins (#3877) * [bf16] add bf16 support for legacy zero (#3879) * [zero] init context support bf16 * [zero] legacy zero support bf16 * [test] add zero bf16 test * [doc] add bf16 related docstring for legacy zero	2023-06-05 15:58:31 +08:00
Liu Ziming	8065cc5fba	Modify torch version requirement to adapt torch 2.0 (#3896 )	2023-06-05 15:57:35 +08:00
Hongxin Liu	dbb32692d2	[lazy] refactor lazy init (#3891 ) * [lazy] remove old lazy init * [lazy] refactor lazy init folder structure * [lazy] fix lazy tensor deepcopy * [test] update lazy init test	2023-06-05 14:20:47 +08:00
digger yu	70c8cdecf4	[nfc] fix typo colossalai/cli fx kernel (#3847 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel	2023-06-02 15:02:45 +08:00
digger yu	e2d81eba0d	[nfc] fix typo colossalai/ applications/ (#3831 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/	2023-05-25 16:19:41 +08:00
wukong1992	3229f93e30	[booster] add warning for torch fsdp plugin doc (#3833 )	2023-05-25 14:00:02 +08:00
Hongxin Liu	7c9f2ed6dd	[dtensor] polish sharding spec docstring (#3838 ) * [dtensor] polish sharding spec docstring * [dtensor] polish sharding spec example docstring	2023-05-25 13:09:42 +08:00
digger yu	7f8203af69	fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808 )	2023-05-24 09:01:50 +08:00
wukong1992	6b305a99d6	[booster] torch fsdp fix ckpt (#3788 )	2023-05-23 16:58:45 +08:00
digger yu	9265f2d4d7	[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc.	2023-05-23 15:28:20 +08:00
jiangmingyan	e871e342b3	[API] add docstrings and initialization to apex amp, naive amp (#3783 ) * [mixed_precison] add naive amp demo * [mixed_precison] add naive amp demo * [api] add docstrings and initialization to apex amp, naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] fix * [api] fix	2023-05-23 15:17:24 +08:00
Frank Lee	f5c425c898	fixed the example docstring for booster (#3795 )	2023-05-22 18:10:06 +08:00
Hongxin Liu	72688adb2f	[doc] add booster docstring and fix autodoc (#3789 ) * [doc] add docstr for booster methods * [doc] fix autodoc	2023-05-22 10:56:47 +08:00
Hongxin Liu	3c07a2846e	[plugin] a workaround for zero plugins' optimizer checkpoint (#3780 ) * [test] refactor torch ddp checkpoint test * [plugin] update low level zero optim checkpoint * [plugin] update gemini optim checkpoint	2023-05-19 19:42:31 +08:00
Hongxin Liu	60e6a154bc	[doc] add tutorial for booster checkpoint (#3785 ) * [doc] add checkpoint related docstr for booster * [doc] add en checkpoint doc * [doc] add zh checkpoint doc * [doc] add booster checkpoint doc in sidebar * [doc] add cuation about ckpt for plugins * [doc] add doctest placeholder * [doc] add doctest placeholder * [doc] add doctest placeholder	2023-05-19 18:05:08 +08:00
digger yu	32f81f14d4	[NFC] fix typo colossalai/amp auto_parallel autochunk (#3756 )	2023-05-19 13:50:00 +08:00
Hongxin Liu	5452df63c5	[plugin] torch ddp plugin supports sharded model checkpoint (#3775 ) * [plugin] torch ddp plugin add save sharded model * [test] fix torch ddp ckpt io test * [test] fix torch ddp ckpt io test * [test] fix low level zero plugin test * [test] fix low level zero plugin test * [test] add debug info * [test] add debug info * [test] add debug info * [test] add debug info * [test] add debug info * [test] fix low level zero plugin test * [test] fix low level zero plugin test * [test] remove debug info	2023-05-18 20:05:59 +08:00
jiangmingyan	2703a37ac9	[amp] Add naive amp demo (#3774 ) * [mixed_precison] add naive amp demo * [mixed_precison] add naive amp demo	2023-05-18 16:33:14 +08:00
digger yu	1baeb39c72	[NFC] fix typo with colossalai/auto_parallel/tensor_shard (#3742 ) * fix typo applications/ and colossalai/ date 5.11 * fix typo colossalai/	2023-05-17 11:13:23 +08:00
wukong1992	b37797ed3d	[booster] support torch fsdp plugin in booster (#3697 ) Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>	2023-05-15 12:14:38 +08:00
digger-yu	ad6460cf2c	[NFC] fix typo applications/ and colossalai/ (#3735 )	2023-05-15 11:46:25 +08:00
digger-yu	b7141c36dd	[CI] fix some spelling errors (#3707 ) * fix spelling error with examples/comminity/ * fix spelling error with tests/ * fix some spelling error with tests/ colossalai/ etc.	2023-05-10 17:12:03 +08:00
jiangmingyan	20068ba188	[booster] add tests for ddp and low level zero's checkpointio (#3715 ) * [booster] update tests for booster * [booster] update tests for booster * [booster] update tests for booster * [booster] update tests for booster * [booster] update tests for booster * [booster] update booster tutorials#3717, fix recursive check	2023-05-10 12:17:02 +08:00
Hongxin Liu	6552cbf8e1	[booster] fix no_sync method (#3709 ) * [booster] fix no_sync method * [booster] add test for ddp no_sync * [booster] fix merge * [booster] update unit test * [booster] update unit test * [booster] update unit test	2023-05-09 11:10:02 +08:00
Hongxin Liu	3bf09efe74	[booster] update prepare dataloader method for plugin (#3706 ) * [booster] add prepare dataloader method for plug * [booster] update examples and docstr	2023-05-08 15:44:03 +08:00
Hongxin Liu	f83ea813f5	[example] add train resnet/vit with booster example (#3694 ) * [example] add train vit with booster example * [example] update readme * [example] add train resnet with booster example * [example] enable ci * [example] enable ci * [example] add requirements * [hotfix] fix analyzer init * [example] update requirements	2023-05-08 10:42:30 +08:00
YH	2629f9717d	[tensor] Refactor handle_trans_spec in DistSpecManager	2023-05-06 17:55:37 +08:00
Hongxin Liu	d0915f54f4	[booster] refactor all dp fashion plugins (#3684 ) * [booster] add dp plugin base * [booster] inherit dp plugin base * [booster] refactor unit tests	2023-05-05 19:36:10 +08:00
jiangmingyan	307894f74d	[booster] gemini plugin support shard checkpoint (#3610 ) * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint --------- Co-authored-by: luchen <luchen@luchendeMBP.lan> Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>	2023-05-05 14:37:21 +08:00
YH	a22407cc02	[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173 ) * Fix confusing variable name in zero opt * Apply lint * Fix util func * Fix minor util func * Fix zero param optimizer name	2023-04-27 18:43:14 +08:00
Hongxin Liu	50793b35f4	[gemini] accelerate inference (#3641 ) * [gemini] support don't scatter after inference * [chat] update colossalai strategy * [chat] fix opt benchmark * [chat] update opt benchmark * [gemini] optimize inference * [test] add gemini inference test * [chat] fix unit test ci * [chat] fix ci * [chat] fix ci * [chat] skip checkpoint test	2023-04-26 16:32:40 +08:00
Hongxin Liu	4b3240cb59	[booster] add low level zero plugin (#3594 ) * [booster] add low level zero plugin * [booster] fix gemini plugin test * [booster] fix precision * [booster] add low level zero plugin test * [test] fix booster plugin test oom * [test] fix booster plugin test oom * [test] fix googlenet and inception output trans * [test] fix diffuser clip vision model * [test] fix torchaudio_wav2vec2_base * [test] fix low level zero plugin test	2023-04-26 14:37:25 +08:00
digger-yu	b9a8dff7e5	[doc] Fix typo under colossalai and doc(#3618 ) * Fixed several spelling errors under colossalai * Fix the spelling error in colossalai and docs directory * Cautious Changed the spelling error under the example folder * Update runtime_preparation_pass.py revert autograft to autograd * Update search_chunk.py utile to until * Update check_installation.py change misteach to mismatch in line 91 * Update 1D_tensor_parallel.md revert to perceptron * Update 2D_tensor_parallel.md revert to perceptron in line 73 * Update 2p5D_tensor_parallel.md revert to perceptron in line 71 * Update 3D_tensor_parallel.md revert to perceptron in line 80 * Update README.md revert to resnet in line 42 * Update reorder_graph.py revert to indice in line 7 * Update p2p.py revert to megatron in line 94 * Update initialize.py revert to torchrun in line 198 * Update routers.py change to detailed in line 63 * Update routers.py change to detailed in line 146 * Update README.md revert random number in line 402	2023-04-26 11:38:43 +08:00
Hongxin Liu	12eff9eb4c	[gemini] state dict supports fp16 (#3590 ) * [gemini] save state dict support fp16 * [gemini] save state dict shard support fp16 * [gemini] fix state dict * [gemini] fix state dict	2023-04-19 11:01:48 +08:00
Hongxin Liu	dac127d0ee	[fx] fix meta tensor registration (#3589 ) * [meta] fix torch 1.13.1 * [meta] fix torch 2.0.0 * [meta] fix torch 1.13.0 * [meta] polish code	2023-04-18 16:20:36 +08:00
Hongxin Liu	f313babd11	[gemini] support save state dict in shards (#3581 ) * [gemini] support state dict shard * [gemini] add test state dict shard * [gemini] polish docstr * [gemini] fix merge * [gemini] polish code	2023-04-17 17:11:09 +08:00
YH	d329c294ec	Add docstr for zero3 chunk search utils (#3572 )	2023-04-17 12:44:17 +08:00
Hongxin Liu	173dad0562	[misc] add verbose arg for zero and op builder (#3552 ) * [misc] add print verbose * [gemini] add print verbose * [zero] add print verbose for low level * [misc] add print verbose for op builder	2023-04-17 11:25:35 +08:00
Hongxin Liu	4341f5e8e6	[lazyinit] fix clone and deepcopy (#3553 )	2023-04-17 11:25:13 +08:00
Hongxin Liu	152239bbfa	[gemini] gemini supports lazy init (#3379 ) * [gemini] fix nvme optimizer init * [gemini] gemini supports lazy init * [gemini] add init example * [gemini] add fool model * [zero] update gemini ddp * [zero] update init example * add chunk method * add chunk method * [lazyinit] fix lazy tensor tolist * [gemini] fix buffer materialization * [misc] remove useless file * [booster] update gemini plugin * [test] update gemini plugin test * [test] fix gemini plugin test * [gemini] fix import * [gemini] fix import * [lazyinit] use new metatensor * [lazyinit] use new metatensor * [lazyinit] fix __set__ method	2023-04-12 16:03:25 +08:00
jiangmingyan	366a035552	[checkpoint] Shard saved checkpoint need to be compatible with the naming format of hf checkpoint files (#3479 ) * [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format * [checkpoint] support huggingface style sharded checkpoint, to be compatible with hf file naming format * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename * [checkpoint] Shard saved checkpoint add 'variant' field to customize filename --------- Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local> Co-authored-by: luchen <luchen@luchendeMBP.lan>	2023-04-12 16:02:17 +08:00
YH	bcf0cbcbe7	[doc] Add docs for clip args in zero optim (#3504 )	2023-04-10 11:11:28 +08:00
jiangmingyan	52a933e175	[checkpoint] support huggingface style sharded checkpoint (#3461 ) * [checkpoint] support huggingface style sharded checkpoint * [checkpoint] support huggingface style sharded checkpoint * [checkpoint] support huggingface style sharded checkpoint * [checkpoint] support huggingface style sharded checkpoint * [checkpoint] support huggingface style sharded checkpoint --------- Co-authored-by: luchen <luchen@luchendeMBP.lan>	2023-04-06 16:23:39 +08:00
Frank Lee	80eba05b0a	[test] refactor tests with spawn (#3452 ) * [test] added spawn decorator * polish code * polish code * polish code * polish code * polish code * polish code	2023-04-06 14:51:35 +08:00
Frank Lee	7d8d825681	[booster] fixed the torch ddp plugin with the new checkpoint api (#3442 )	2023-04-06 09:43:51 +08:00
YH	8f740deb53	Fix typo (#3448 )	2023-04-06 09:43:31 +08:00
Hakjin Lee	46c009dba4	[format] Run lint on colossalai.engine (#3367 )	2023-04-05 23:24:43 +08:00
YuliangLiu0306	ffcdbf0f65	[autoparallel]integrate auto parallel feature with new tracer (#3408 ) * [autoparallel] integrate new analyzer in module level * unify the profiling method * polish * fix no codegen bug * fix pass bug * fix liveness test * polish	2023-04-04 17:40:45 +08:00
ver217	573af84184	[example] update examples related to zero/gemini (#3431 ) * [zero] update legacy import * [zero] update examples * [example] fix opt tutorial * [example] fix opt tutorial * [example] fix opt tutorial * [example] fix opt tutorial * [example] fix import	2023-04-04 17:32:51 +08:00

... 3 4 5 6 7 ...

1797 Commits (785cd9a9c971aa58e6f8c76575111a4aa4d9513b)