ColossalAI

Commit Graph

Author	SHA1	Message	Date
Hongxin Liu	807e01a4ba	[zero] hotfix master param sync (#4618 ) * [zero] add method to update master params * [zero] update zero plugin * [plugin] update low level zero plugin	1 year ago
Bin Jia	86d22581e4	[shardformer] Add overlap optional for HybridParallelPlugin (#4615 ) * add optional overlap for plugin * remove fixed todo	1 year ago
Hongxin Liu	a39a5c66fe	Merge branch 'main' into feature/shardformer	1 year ago
Baizhou Zhang	e79b1e80e2	[checkpointio] support huggingface from_pretrained for all plugins (#4606 )	1 year ago
flybird11111	0a94fcd351	[shardformer] update bert finetune example with HybridParallelPlugin (#4584 ) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] fix epoch change * [shardformer] broadcast add pp group * [shardformer] fix opt test hanging * fix * test * test * [shardformer] zero1+pp and the corresponding tests (#4517) * pause * finish pp+zero1 * Update test_shard_vit.py * [shardformer/fix overlap bug] fix overlap bug, add overlap as an option in shardco… (#4516) * fix overlap bug and support bert, add overlap as an option in shardconfig * support overlap for chatglm and bloom * [shardformer] fix emerged bugs after updating transformers (#4526) * test * fix test * fix test * remove print * add fix * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] Add overlap support for gpt2 (#4535) * add overlap support for gpt2 * remove unused code * remove unused code * [shardformer] support pp+tp+zero1 tests (#4531) * [shardformer] fix opt test hanging * fix * test * test * test * fix test * fix test * remove print * add fix * [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] pp+tp+zero1 * [shardformer] fix submodule replacement bug when enabling pp (#4544) * [shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540) * implement sharded optimizer saving * add more param info * finish implementation of sharded optimizer saving * fix bugs in optimizer sharded saving * add pp+zero test * param group loading * greedy loading of optimizer * fix bug when loading * implement optimizer sharded saving * add optimizer test & arrange checkpointIO utils * fix gemini sharding state_dict * add verbose option * add loading of master params * fix typehint * fix master/working mapping in fp16 amp * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] add bert finetune example * [shardformer] fix epoch change * [shardformer] broadcast add pp group * rebase feature/shardformer * update pipeline * [shardformer] fix * [shardformer] fix * [shardformer] bert finetune fix * [shardformer] add all_reduce operation to loss add all_reduce operation to loss * [shardformer] make compatible with pytree. make compatible with pytree. * [shardformer] disable tp disable tp * [shardformer] add 3d plugin to ci test * [shardformer] update num_microbatches to None * [shardformer] update microbatchsize * [shardformer] update assert * update scheduler * update scheduler --------- Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com> Co-authored-by: Baizhou Zhang <eddiezhang@pku.edu.cn>	1 year ago
Hongxin Liu	63ecafb1fb	[checkpointio] optimize zero optim checkpoint io (#4591 ) * [zero] update checkpoint io to save memory * [checkpointio] add device map to save memory	1 year ago
Hongxin Liu	508ca36fe3	[pipeline] 1f1b schedule receive microbatch size (#4589 )	1 year ago
Baizhou Zhang	38ccb8b1a3	[shardformer] support from_pretrained when loading model with HybridParallelPlugin (#4575 ) * hybrid plugin support huggingface from_pretrained * add huggingface compatibility tests * add folder cleaning * fix bugs	1 year ago
Baizhou Zhang	c9625dbb63	[shardformer] support sharded optimizer checkpointIO of HybridParallelPlugin (#4540 ) * implement sharded optimizer saving * add more param info * finish implementation of sharded optimizer saving * fix bugs in optimizer sharded saving * add pp+zero test * param group loading * greedy loading of optimizer * fix bug when loading * implement optimizer sharded saving * add optimizer test & arrange checkpointIO utils * fix gemini sharding state_dict * add verbose option * add loading of master params * fix typehint * fix master/working mapping in fp16 amp	1 year ago
Baizhou Zhang	44eab2b27f	[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506 ) * add APIs * implement save_sharded_model * add test for hybrid checkpointio * implement naive loading for sharded model * implement efficient sharded model loading * open a new file for hybrid checkpoint_io * small fix * fix circular importing * fix docstring * arrange arguments and apis * small fix	1 year ago
Hongxin Liu	27061426f7	[gemini] improve compatibility and add static placement policy (#4479 ) * [gemini] remove distributed-related part from colotensor (#4379) * [gemini] remove process group dependency * [gemini] remove tp part from colo tensor * [gemini] patch inplace op * [gemini] fix param op hook and update tests * [test] remove useless tests * [test] remove useless tests * [misc] fix requirements * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [misc] update requirements * [gemini] refactor gemini optimizer and gemini ddp (#4398) * [gemini] update optimizer interface * [gemini] renaming gemini optimizer * [gemini] refactor gemini ddp class * [example] update gemini related example * [example] update gemini related example * [plugin] fix gemini plugin args * [test] update gemini ckpt tests * [gemini] fix checkpoint io * [example] fix opt example requirements * [example] fix opt example * [example] fix opt example * [example] fix opt example * [gemini] add static placement policy (#4443) * [gemini] add static placement policy * [gemini] fix param offload * [test] update gemini tests * [plugin] update gemini plugin * [plugin] update gemini plugin docstr * [misc] fix flash attn requirement * [test] fix gemini checkpoint io test * [example] update resnet example result (#4457) * [example] update bert example result (#4458) * [doc] update gemini doc (#4468) * [example] update gemini related examples (#4473) * [example] update gpt example * [example] update dreambooth example * [example] update vit * [example] update opt * [example] update palm * [example] update vit and opt benchmark * [hotfix] fix bert in model zoo (#4480) * [hotfix] fix bert in model zoo * [test] remove chatglm gemini test * [test] remove sam gemini test * [test] remove vit gemini test * [hotfix] fix opt tutorial example (#4497) * [hotfix] fix opt tutorial example * [hotfix] fix opt tutorial example	1 year ago
Baizhou Zhang	1c7df566e2	[shardformer] support tp+zero for shardformer (#4472 ) * support tp+zero/input type cast for hybridplugin * add tp+zero tests * fix bucket arguments	1 year ago
Bin Jia	7c8be77081	[shardformer/sequence parallel] support gpt2 seq parallel with pp/dp/tp (#4460 ) * support gpt2 seq parallel with pp/dp/tp * fix a bug when waiting for stream done * delete unused gpt2_seq file	1 year ago
Baizhou Zhang	6ef33f75aa	[shardformer] support DDP in HybridPlugin/add tp+dp tests (#4446 ) * support DDP for HybridPlugin/add tp+dp tests * add docstring for HybridParallelPlugin	1 year ago
Bin Jia	424629fea0	[shardformer/sequence parallel] Cherry pick commit to new branch (#4450 ) * [shardformer/sequence parallel] Support sequence parallel for gpt2 (#4384) * [sequence parallel] add sequence parallel linear col/row support (#4336) * add sequence parallel linear col/row support * add annotation * add annotation * add support for gpt2 fused qkv linear layer * support sequence parallel in GPT2 * add docstring and note * add requirments * remove unused flash-attb * modify flash attn test * modify flash attn setting * modify flash attn code * add assert before divide, rename forward function * [shardformer/test] fix gpt2 test with seq-parallel * [shardformer/sequence parallel] Overlap input gather and grad computation during col backward (#4401) * overlap gather input / grad computing during col backward * modify test for overlap * simplify code * fix code and modify cuda stream synchronize * [shardformer/sequence parallel] polish code	1 year ago
Hongxin Liu	172f7fa3cf	[misc] resolve code factor issues (#4433 )	1 year ago
flybird1111	d2cd48e0be	[shardformer] test all optimizations (#4399 ) [shardformer] test all optimizations [shardformer] test all optimizations [shardformer] test all optimizations	1 year ago
Baizhou Zhang	ed4c448488	[pipeline] rewrite t5 tests & support multi-tensor transmitting in pipeline (#4388 ) * fix remaining t5 bugs/rewrite t5 tests * fix multi-tensor communication in pipeline * rearrange test_config * fix keyerror in sync_shared_params * fix get_held_layers & Randomnizer, complete t5 tests * erase printing * fix get_held_layers through modifying _release_unheld_layers * fix _get_recursive_held_layers bug	1 year ago
Baizhou Zhang	b1feeced8e	[shardformer] add util functions for shardformer tests/fix sync_shared_param (#4366 ) * add util functions for shardformer tests & rewrite gpt2 test * fix shared_params & embedding/merging * fix precision	1 year ago
Baizhou Zhang	0ceec8f9a9	[pipeline] support fp32 for HybridPlugin/merge shardformer test and pipeline test into one file (#4354 ) * add naive optimizer for 3DPlugin/refactor gpt2 shardformer test * merge tests of PP/DP/TP combinations into one test file * fix bug when sync grad for dp in HybridPlugin * update supported precisions for 3DPlugin/fix bug when shifting tp_degree * improve the passing of lazy_init * modify lazy_init/use sync_shared_params	1 year ago
Hongxin Liu	261eab02fb	[plugin] add 3d parallel plugin (#4295 ) * [amp] add mixed precision optimizer * [plugin] add 3d parallel plugin * [booster] support pipeline * [plugin] 3d parallel plugin support clip grad norm * [shardformer] fix sharder and add plugin test * [plugin] rename 3d parallel plugin * [ci] support testmon core pkg change detection (#4305) * [hotfix] debug testmon * [hotfix] fix llama * [hotfix] fix p2p bugs * [hotfix] fix requirements	1 year ago
LuGY	1a49a5ea00	[zero] support shard optimizer state dict of zero (#4194 ) * support shard optimizer of zero * polish code * support sync grad manually	1 year ago
LuGY	79cf1b5f33	[zero]support no_sync method for zero1 plugin (#4138 ) * support no sync for zero1 plugin * polish * polish	1 year ago
梁爽	abe4f971e0	[NFC] polish colossalai/booster/plugin/low_level_zero_plugin.py code style (#4256 ) Co-authored-by: supercooledith <893754954@qq.com>	1 year ago
Jianghai	b366f1d99f	[NFC] Fix format for mixed precision (#4253 ) * [NFC] polish colossalai/booster/mixed_precision/mixed_precision_base.py code style	1 year ago
Baizhou Zhang	c6f6005990	[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302 ) * sharded optimizer checkpoint for gemini plugin * modify test to reduce testing time * update doc * fix bug when keep_gatherd is true under GeminiPlugin	1 year ago
Baizhou Zhang	58913441a1	Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141 ) * [checkpointio] unsharded optimizer checkpoint for Gemini plugin * [checkpointio] unsharded optimizer checkpoint for Gemini using all_gather	1 year ago
Baizhou Zhang	0bb0b481b4	[gemini] fix argument naming during chunk configuration searching	1 year ago
Baizhou Zhang	822c3d4d66	[checkpointio] sharded optimizer checkpoint for DDP plugin (#4002 )	1 year ago
Wenhao Chen	725af3eeeb	[booster] make optimizer argument optional for boost (#3993 ) * feat: make optimizer optional in Booster.boost * test: skip unet test if diffusers version > 0.10.2	1 year ago
Baizhou Zhang	c9cff7e7fa	[checkpointio] General Checkpointing of Sharded Optimizers (#3984 )	1 year ago
Frank Lee	bd1ab98158	[gemini] fixed the gemini checkpoint io (#3934 )	1 year ago
Baizhou Zhang	c1535ccbba	[doc] fix docs about booster api usage (#3898 )	1 year ago
Hongxin Liu	ae02d4e4f7	[bf16] add bf16 support (#3882 ) * [bf16] add bf16 support for fused adam (#3844) * [bf16] fused adam kernel support bf16 * [test] update fused adam kernel test * [test] update fused adam test * [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860) * [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869) * [bf16] add mixed precision mixin * [bf16] low level zero optim support bf16 * [text] update low level zero test * [text] fix low level zero grad acc test * [bf16] add bf16 support for gemini (#3872) * [bf16] gemini support bf16 * [test] update gemini bf16 test * [doc] update gemini docstring * [bf16] add bf16 support for plugins (#3877) * [bf16] add bf16 support for legacy zero (#3879) * [zero] init context support bf16 * [zero] legacy zero support bf16 * [test] add zero bf16 test * [doc] add bf16 related docstring for legacy zero	1 year ago
wukong1992	3229f93e30	[booster] add warning for torch fsdp plugin doc (#3833 )	2 years ago
digger yu	7f8203af69	fix typo colossalai/auto_parallel autochunk fx/passes etc. (#3808 )	2 years ago
wukong1992	6b305a99d6	[booster] torch fsdp fix ckpt (#3788 )	2 years ago
jiangmingyan	e871e342b3	[API] add docstrings and initialization to apex amp, naive amp (#3783 ) * [mixed_precison] add naive amp demo * [mixed_precison] add naive amp demo * [api] add docstrings and initialization to apex amp, naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] add docstring to apex amp/ naive amp * [api] fix * [api] fix	2 years ago
Frank Lee	f5c425c898	fixed the example docstring for booster (#3795 )	2 years ago
Hongxin Liu	72688adb2f	[doc] add booster docstring and fix autodoc (#3789 ) * [doc] add docstr for booster methods * [doc] fix autodoc	2 years ago
Hongxin Liu	3c07a2846e	[plugin] a workaround for zero plugins' optimizer checkpoint (#3780 ) * [test] refactor torch ddp checkpoint test * [plugin] update low level zero optim checkpoint * [plugin] update gemini optim checkpoint	2 years ago
Hongxin Liu	60e6a154bc	[doc] add tutorial for booster checkpoint (#3785 ) * [doc] add checkpoint related docstr for booster * [doc] add en checkpoint doc * [doc] add zh checkpoint doc * [doc] add booster checkpoint doc in sidebar * [doc] add cuation about ckpt for plugins * [doc] add doctest placeholder * [doc] add doctest placeholder * [doc] add doctest placeholder	2 years ago
Hongxin Liu	5452df63c5	[plugin] torch ddp plugin supports sharded model checkpoint (#3775 ) * [plugin] torch ddp plugin add save sharded model * [test] fix torch ddp ckpt io test * [test] fix torch ddp ckpt io test * [test] fix low level zero plugin test * [test] fix low level zero plugin test * [test] add debug info * [test] add debug info * [test] add debug info * [test] add debug info * [test] add debug info * [test] fix low level zero plugin test * [test] fix low level zero plugin test * [test] remove debug info	2 years ago
jiangmingyan	2703a37ac9	[amp] Add naive amp demo (#3774 ) * [mixed_precison] add naive amp demo * [mixed_precison] add naive amp demo	2 years ago
wukong1992	b37797ed3d	[booster] support torch fsdp plugin in booster (#3697 ) Co-authored-by: 纪少敏 <jishaomin@jishaomindeMBP.lan>	2 years ago
Hongxin Liu	6552cbf8e1	[booster] fix no_sync method (#3709 ) * [booster] fix no_sync method * [booster] add test for ddp no_sync * [booster] fix merge * [booster] update unit test * [booster] update unit test * [booster] update unit test	2 years ago
Hongxin Liu	3bf09efe74	[booster] update prepare dataloader method for plugin (#3706 ) * [booster] add prepare dataloader method for plug * [booster] update examples and docstr	2 years ago
Hongxin Liu	d0915f54f4	[booster] refactor all dp fashion plugins (#3684 ) * [booster] add dp plugin base * [booster] inherit dp plugin base * [booster] refactor unit tests	2 years ago
jiangmingyan	307894f74d	[booster] gemini plugin support shard checkpoint (#3610 ) * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin add shard checkpoint save/load * gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint * [API Refactoring]gemini plugin support shard checkpoint --------- Co-authored-by: luchen <luchen@luchendeMBP.lan> Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>	2 years ago
Hongxin Liu	4b3240cb59	[booster] add low level zero plugin (#3594 ) * [booster] add low level zero plugin * [booster] fix gemini plugin test * [booster] fix precision * [booster] add low level zero plugin test * [test] fix booster plugin test oom * [test] fix booster plugin test oom * [test] fix googlenet and inception output trans * [test] fix diffuser clip vision model * [test] fix torchaudio_wav2vec2_base * [test] fix low level zero plugin test	2 years ago
digger-yu	b9a8dff7e5	[doc] Fix typo under colossalai and doc(#3618 ) * Fixed several spelling errors under colossalai * Fix the spelling error in colossalai and docs directory * Cautious Changed the spelling error under the example folder * Update runtime_preparation_pass.py revert autograft to autograd * Update search_chunk.py utile to until * Update check_installation.py change misteach to mismatch in line 91 * Update 1D_tensor_parallel.md revert to perceptron * Update 2D_tensor_parallel.md revert to perceptron in line 73 * Update 2p5D_tensor_parallel.md revert to perceptron in line 71 * Update 3D_tensor_parallel.md revert to perceptron in line 80 * Update README.md revert to resnet in line 42 * Update reorder_graph.py revert to indice in line 7 * Update p2p.py revert to megatron in line 94 * Update initialize.py revert to torchrun in line 198 * Update routers.py change to detailed in line 63 * Update routers.py change to detailed in line 146 * Update README.md revert random number in line 402	2 years ago
Hongxin Liu	173dad0562	[misc] add verbose arg for zero and op builder (#3552 ) * [misc] add print verbose * [gemini] add print verbose * [zero] add print verbose for low level * [misc] add print verbose for op builder	2 years ago
Hongxin Liu	152239bbfa	[gemini] gemini supports lazy init (#3379 ) * [gemini] fix nvme optimizer init * [gemini] gemini supports lazy init * [gemini] add init example * [gemini] add fool model * [zero] update gemini ddp * [zero] update init example * add chunk method * add chunk method * [lazyinit] fix lazy tensor tolist * [gemini] fix buffer materialization * [misc] remove useless file * [booster] update gemini plugin * [test] update gemini plugin test * [test] fix gemini plugin test * [gemini] fix import * [gemini] fix import * [lazyinit] use new metatensor * [lazyinit] use new metatensor * [lazyinit] fix __set__ method	2 years ago
Frank Lee	7d8d825681	[booster] fixed the torch ddp plugin with the new checkpoint api (#3442 )	2 years ago
Frank Lee	1beb85cc25	[checkpoint] refactored the API and added safetensors support (#3427 ) * [checkpoint] refactored the API and added safetensors support * polish code	2 years ago
ver217	26b7aac0be	[zero] reorganize zero/gemini folder structure (#3424 ) * [zero] refactor low-level zero folder structure * [zero] fix legacy zero import path * [zero] fix legacy zero import path * [zero] remove useless import * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor legacy zero import path * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor gemini folder structure * [zero] refactor legacy zero import path * [zero] fix test import path * [zero] fix test * [zero] fix circular import * [zero] update import	2 years ago
ver217	5f2e34e6c9	[booster] implement Gemini plugin (#3352 ) * [booster] add gemini plugin * [booster] update docstr * [booster] gemini plugin add coloparam convertor * [booster] fix coloparam convertor * [booster] fix gemini plugin device * [booster] add gemini plugin test * [booster] gemini plugin ignore sync bn * [booster] skip some model * [booster] skip some model * [booster] modify test world size * [booster] modify test world size * [booster] skip test	2 years ago
Frank Lee	73d3e4d309	[booster] implemented the torch ddd + resnet example (#3232 ) * [booster] implemented the torch ddd + resnet example * polish code	2 years ago
Frank Lee	e7f3bed2d3	[booster] added the plugin base and torch ddp plugin (#3180 ) * [booster] added the plugin base and torch ddp plugin * polish code * polish code * polish code	2 years ago
Frank Lee	a9b8402d93	[booster] added the accelerator implementation (#3159 )	2 years ago
Frank Lee	ed19290560	[booster] implemented mixed precision class (#3151 ) * [booster] implemented mixed precision class * polish code	2 years ago
Frank Lee	f19b49e164	[booster] init module structure and definition (#3056 )	2 years ago

1 2 3

112 Commits (725fbd2ed067f9c58ac04670377d3e6f2a96fe00)