ColossalAI

Commit Graph

Author	SHA1	Message	Date
flybird11111	7486ed7d3a	[shardformer] update llama2/opt finetune example and fix llama2 policy (#4645 ) * [shardformer] update shardformer readme [shardformer] update shardformer readme [shardformer] update shardformer readme * [shardformer] update llama2/opt finetune example and shardformer update to llama2 * [shardformer] update llama2/opt finetune example and shardformer update to llama2 * [shardformer] update llama2/opt finetune example and shardformer update to llama2 * [shardformer] change dataset * [shardformer] change dataset * [shardformer] fix CI * [shardformer] fix * [shardformer] fix * [shardformer] fix * [shardformer] fix * [shardformer] fix [example] update opt example [example] resolve comments fix fix	2023-09-09 22:45:36 +08:00
Baizhou Zhang	295b38fecf	[example] update vit example for hybrid parallel plugin (#4641 ) * update vit example for hybrid plugin * reset tp/pp size * fix dataloader iteration bug * update optimizer passing in evaluation/add grad_accum * change criterion * wrap tqdm * change grad_accum to grad_checkpoint * fix pbar	2023-09-07 17:38:45 +08:00
Baizhou Zhang	660eed9124	[pipeline] set optimizer to optional in execute_pipeline (#4630 ) * set optimizer to optional in execute_pipeline * arrange device and mixed precision in booster init * fix execute_pipeline in booster.py	2023-09-07 10:42:59 +08:00
eric8607242	c3d5fa3bac	[shardformer] Support customized policy for llamav2 based model with HybridParallelPlugin (#4624 ) * Enable policy assignment in HybridPlugin and enable llama policy for llamav2 * Remove Policy from Plugin * revert changes of plugin HybridParallelModule * revert changes in plugin * upgrade transformers * revert transformers version --------- Co-authored-by: flybird11111 <1829166702@qq.com>	2023-09-07 10:15:13 +08:00
Hongxin Liu	fae6c92ead	Merge branch 'main' into feature/shardformer	2023-09-05 21:54:08 +08:00
Hongxin Liu	ac178ca5c1	[legacy] move builder and registry to legacy (#4603 )	2023-09-05 21:53:10 +08:00
Hongxin Liu	8accecd55b	[legacy] move engine to legacy (#4560 ) * [legacy] move engine to legacy * [example] fix seq parallel example * [example] fix seq parallel example * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [example] update seq parallel requirements	2023-09-05 21:53:10 +08:00
Hongxin Liu	89fe027787	[legacy] move trainer to legacy (#4545 ) * [legacy] move trainer to legacy * [doc] update docs related to trainer * [test] ignore legacy test	2023-09-05 21:53:10 +08:00
Hongxin Liu	807e01a4ba	[zero] hotfix master param sync (#4618 ) * [zero] add method to update master params * [zero] update zero plugin * [plugin] update low level zero plugin	2023-09-05 15:04:02 +08:00
flybird11111	ec0866804c	[shardformer] update shardformer readme (#4617 ) [shardformer] update shardformer readme [shardformer] update shardformer readme	2023-09-05 13:14:41 +08:00
Bin Jia	86d22581e4	[shardformer] Add overlap optional for HybridParallelPlugin (#4615 ) * add optional overlap for plugin * remove fixed todo	2023-09-05 11:52:23 +08:00
Hongxin Liu	a39a5c66fe	Merge branch 'main' into feature/shardformer	2023-09-04 23:43:13 +08:00
Baizhou Zhang	e79b1e80e2	[checkpointio] support huggingface from_pretrained for all plugins (#4606 )	2023-09-04 23:25:01 +08:00
flybird11111	0a94fcd351	[shardformer] update bert finetune example with HybridParallelPlugin (#4584 ) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] fix epoch change * [shardformer] broadcast add pp group * [shardformer] fix opt test hanging * fix * test * test * [shardformer] zero1+pp and the corresponding tests (#4517) * pause * finish pp+zero1 * Update test_shard_vit.py * [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516) * fix overlap bug and support bert, add overlap as an option in shardconfig * support overlap for chatglm and bloom * [shardformer] fix emerged bugs after updating transformers (#4526) * test * fix test * fix test * remove print * add fix * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] Add overlap support for gpt2 (#4535) * add overlap support for gpt2 * remove unused code * remove unused code * [shardformer] support pp+tp+zero1 tests (#4531) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] fix submodule replacement bug when enabling pp (#4544) * [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) * implement sharded optimizer saving * add more param info * finish implementation of sharded optimizer saving * fix bugs in optimizer sharded saving * add pp+zero test * param group loading * greedy loading of optimizer * fix bug when loading * implement optimizer sharded saving * add optimizer test & arrange checkpointIO utils * fix gemini sharding state_dict * add verbose option * add loading of master params * fix typehint * fix master/working mapping in fp16 amp * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] fix epoch change * [shardformer] broadcast add pp group * rebase feature/shardformer * update pipeline * [shardformer] fix * [shardformer] fix * [shardformer] bert finetune fix * [shardformer] add all_reduce operation to loss add all_reduce operation to loss * [shardformer] make compatible with pytree. make compatible with pytree. * [shardformer] disable tp disable tp * [shardformer] add 3d plugin to ci test * [shardformer] update num_microbatches to None * [shardformer] update microbatchsize * [shardformer] update assert * update scheduler * update scheduler --------- Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>	2023-09-04 21:46:29 +08:00
Jianghai	24c0768795	[shardformer] Pytree fix (#4533 ) * pytree test * test bert * test bert * test bert * revise * add register * add register	2023-09-04 17:52:23 +08:00
Hongxin Liu	63ecafb1fb	[checkpointio] optimize zero optim checkpoint io (#4591 ) * [zero] update checkpoint io to save memory * [checkpointio] add device map to save memory	2023-09-04 11:26:45 +08:00
Hongxin Liu	508ca36fe3	[pipeline] 1f1b schedule receive microbatch size (#4589 )	2023-09-01 21:45:14 +08:00
LuGY	cbac782254	[zero]fix zero ckptIO with offload (#4529 ) * fix zero ckptio with offload * fix load device * saved tensors in ckpt should be on CPU * fix unit test * fix unit test * add clear cache * save memory for CI	2023-09-01 17:41:19 +08:00
Baizhou Zhang	38ccb8b1a3	[shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575 ) * hybrid plugin support huggingface from_pretrained * add huggingface compatibility tests * add folder cleaning * fix bugs	2023-09-01 17:40:01 +08:00
Baizhou Zhang	c9625dbb63	[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540 ) * implement sharded optimizer saving * add more param info * finish implementation of sharded optimizer saving * fix bugs in optimizer sharded saving * add pp+zero test * param group loading * greedy loading of optimizer * fix bug when loading * implement optimizer sharded saving * add optimizer test & arrange checkpointIO utils * fix gemini sharding state_dict * add verbose option * add loading of master params * fix typehint * fix master/working mapping in fp16 amp	2023-08-31 14:50:47 +08:00
Baizhou Zhang	2c787d7f47	[shardformer] fix submodule replacement bug when enabling pp (#4544 )	2023-08-31 09:57:18 +08:00
flybird11111	ec18fc7340	[shardformer] support pp+tp+zero1 tests (#4531 ) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1	2023-08-30 21:29:18 +08:00
Lufang Chen	12c95a9fed	fix runtime prepare pass (#4502 ) Co-authored-by: lufang.chen <lufang.chen@nio.com>	2023-08-30 17:29:38 +08:00
flybird11111	d367b88785	[shardformer] fix opt test hanging (#4521 ) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix	2023-08-30 14:50:34 +08:00
Bin Jia	e241b74f24	[shardformer] Add overlap support for gpt2 (#4535 ) * add overlap support for gpt2 * remove unused code * remove unused code	2023-08-29 18:30:50 +08:00
Baizhou Zhang	0387a47e63	[shardformer] fix emerged bugs after updating transformers (#4526 )	2023-08-29 11:25:05 +08:00
Hongxin Liu	0b00def881	[example] add llama2 example (#4527 ) * [example] transfer llama-1 example * [example] fit llama-2 * [example] refactor scripts folder * [example] fit new gemini plugin * [cli] fix multinode runner * [example] fit gemini optim checkpoint * [example] refactor scripts * [example] update requirements * [example] update requirements * [example] rename llama to llama2 * [example] update readme and pretrain script * [example] refactor scripts	2023-08-28 17:59:11 +08:00
Bin Jia	c554b7f559	[shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516 ) * fix overlap bug and support bert, add overlap as an option in shardconfig * support overlap for chatglm and bloom	2023-08-28 17:16:40 +08:00
Jianghai	376533a564	[shardformer] zero1+pp and the corresponding tests (#4517 ) * pause * finish pp+zero1 * Update test_shard_vit.py	2023-08-28 10:51:16 +08:00
Baizhou Zhang	44eab2b27f	[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506 ) * add APIs * implement save_sharded_model * add test for hybrid checkpointio * implement naive loading for sharded model * implement efficient sharded model loading * open a new file for hybrid checkpoint_io * small fix * fix circular importing * fix docstring * arrange arguments and apis * small fix	2023-08-25 22:04:57 +08:00
flybird11111	de8a65babc	[shardformer] opt fix. (#4514 ) * [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel * fix fix fix fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * activate checks * [Test] test ci * test ci * test ci * test ci * test ci * test ci * test ci * fix	2023-08-25 19:41:24 +08:00
LuGY	839847b7d7	[zero]support zero2 with gradient accumulation (#4511 ) * support gradient accumulation with zero2 * fix type	2023-08-25 13:44:07 +08:00
flybird11111	3353e55c80	[shardformer] vit/llama/t5 ignore the sequence parallelism flag and some fix. (#4498 ) * [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel * fix fix fix fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * [shardformer] jit fused fix * activate checks	2023-08-24 15:50:02 +08:00
Hongxin Liu	27061426f7	[gemini] improve compatibility and add static placement policy (#4479 ) * [gemini] remove distributed-related part from colotensor (#4379) * [gemini] remove process group dependency * [gemini] remove tp part from colo tensor * [gemini] patch inplace op * [gemini] fix param op hook and update tests * [test] remove useless tests * [test] remove useless tests * [misc] fix requirements * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [misc] update requirements * [gemini] refactor gemini optimizer and gemini ddp (#4398) * [gemini] update optimizer interface * [gemini] renaming gemini optimizer * [gemini] refactor gemini ddp class * [example] update gemini related example * [example] update gemini related example * [plugin] fix gemini plugin args * [test] update gemini ckpt tests * [gemini] fix checkpoint io * [example] fix opt example requirements * [example] fix opt example * [example] fix opt example * [example] fix opt example * [gemini] add static placement policy (#4443) * [gemini] add static placement policy * [gemini] fix param offload * [test] update gemini tests * [plugin] update gemini plugin * [plugin] update gemini plugin docstr * [misc] fix flash attn requirement * [test] fix gemini checkpoint io test * [example] update resnet example result (#4457) * [example] update bert example result (#4458) * [doc] update gemini doc (#4468) * [example] update gemini related examples (#4473) * [example] update gpt example * [example] update dreambooth example * [example] update vit * [example] update opt * [example] update palm * [example] update vit and opt benchmark * [hotfix] fix bert in model zoo (#4480) * [hotfix] fix bert in model zoo * [test] remove chatglm gemini test * [test] remove sam gemini test * [test] remove vit gemini test * [hotfix] fix opt tutorial example (#4497) * [hotfix] fix opt tutorial example * [hotfix] fix opt tutorial example	2023-08-24 09:29:25 +08:00
flybird11111	59e252ecdb	[shardformer] chatglm support sequence parallel (#4482 ) * [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel [shardformer] chatglm support sequence parallel * fix fix fix fix	2023-08-22 23:59:31 +08:00
Bin Jia	351351a36e	[shardformer/sequence parallel] not support opt of seq-parallel, add warning and fix a bug in gpt2 pp (#4488 )	2023-08-22 17:35:35 +08:00
Jianghai	5545114fd8	rename chatglm to chatglm2 (#4484 )	2023-08-22 14:13:31 +08:00
Baizhou Zhang	1c7df566e2	[shardformer] support tp+zero for shardformer (#4472 ) * support tp+zero/input type cast for hybridplugin * add tp+zero tests * fix bucket arguments	2023-08-21 12:04:52 +08:00
Jianghai	8739aa7fa0	[shardformer] Pipeline/whisper (#4456 ) * add some base tests and policies * finish whisper base model * add conditional generation * finish basic tests * whisper * finish whisper * finish whisper * del useless whisper test * fix * add argmin to replace * finish revision	2023-08-18 21:29:25 +08:00
flybird11111	a27e0bb494	[shardformer] bert support sequence parallel. (#4455 ) * [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel * [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel [shardformer] bert support sequence parallel * [shardformer] bert support sequence parallel	2023-08-18 18:04:55 +08:00
flybird11111	0ecd71e041	[shardformer] bloom support sequence parallel (#4465 ) [shardformer] bloom support sequence parallel	2023-08-18 15:34:18 +08:00
Bin Jia	7c8be77081	[shardformer/sequence parallel] support gpt2 seq parallel with pp/dp/tp (#4460 ) * support gpt2 seq parallel with pp/dp/tp * fix a bug when waiting for stream done * delete unused gpt2_seq file	2023-08-18 11:21:53 +08:00
LuGY	a78daf6180	[shardformer] support interleaved pipeline (#4448 ) * support interleaved pipeline * fix unit test * remove virtual stage test in stage mgr * add droped type hint and updated bwd	2023-08-16 19:29:03 +08:00
Baizhou Zhang	6ef33f75aa	[shardformer] support DDP in HybridPlugin/add tp+dp tests (#4446 ) * support DDP for HybridPlugin/add tp+dp tests * add docstring for HybridParallelPlugin	2023-08-16 16:11:57 +08:00
Bin Jia	424629fea0	[shardformer/sequence parallel] Cherry pick commit to new branch (#4450 ) * [shardformer/sequence parallel] Support sequence parallel for gpt2 (#4384) * [sequence parallel] add sequence parallel linear col/row support (#4336) * add sequence parallel linear col/row support * add annotation * add annotation * add support for gpt2 fused qkv linear layer * support sequence parallel in GPT2 * add docstring and note * add requirments * remove unused flash-attb * modify flash attn test * modify flash attn setting * modify flash attn code * add assert before divide, rename forward function * [shardformer/test] fix gpt2 test with seq-parallel * [shardformer/sequence parallel] Overlap input gather and grad computation during col backward (#4401) * overlap gather input / grad computing during col backward * modify test for overlap * simplify code * fix code and modify cuda stream synchronize * [shardformer/sequence parallel] polish code	2023-08-16 15:41:20 +08:00
github-actions[bot]	d20dceb9a3	[format] applied code formatting on changed files in pull request 4441 (#4445 ) Co-authored-by: github-actions <github-actions@github.com>	2023-08-16 10:47:23 +08:00
ver217	5d4efdf58f	[shardformer] fix import	2023-08-15 23:25:14 +08:00
ver217	73a4144b91	[shardformer] fix embedding	2023-08-15 23:25:14 +08:00
Hongxin Liu	172f7fa3cf	[misc] resolve code factor issues (#4433 )	2023-08-15 23:25:14 +08:00
flybird11111	108e54a0b4	[shardformer]update t5 tests for using all optimizations. (#4407 ) * [shardformer] gpt2 tests fix [shardformer] test all optimizations (#4399) [shardformer] test all optimizations [shardformer] test all optimizations [shardformer] test all optimizations [shardformer] gpt2 tests fix * [shardformer]update t5 to use all optimizations	2023-08-15 23:25:14 +08:00

1 2 3 4 5 ...

1607 Commits (536397cc951cea648ded9b1052dfac1d4cc3f91c)