ColossalAI

Commit Graph

Author	SHA1	Message	Date
Edenzzzz	785cd9a9c9	[misc] Update PyTorch version in docs (#5711 ) Co-authored-by: Edenzzzz <wtan45@wisc.edu>	2024-05-13 12:02:52 +08:00
Hongxin Liu	7f8b16635b	[misc] refactor launch API and tensor constructor (#5666 ) * [misc] remove config arg from initialize * [misc] remove old tensor contrusctor * [plugin] add npu support for ddp * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [devops] fix doc test ci * [test] fix test launch * [doc] update launch doc --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2024-04-29 10:40:11 +08:00
Hongxin Liu	bbb2c21f16	[shardformer] fix chatglm implementation (#5644 ) * [shardformer] fix chatglm policy * [shardformer] fix chatglm flash attn * [shardformer] update readme * [shardformer] fix chatglm init * [shardformer] fix chatglm test * [pipeline] fix chatglm merge batch	2024-04-25 14:41:17 +08:00
Wenhao Chen	bb0a668fee	[hotfix] set return_outputs=False in examples and polish code (#5404 ) * fix: simplify merge_batch * fix: use return_outputs=False to eliminate extra memory consumption * feat: add return_outputs warning * style: remove `return_outputs=False` as it is the default value	2024-03-25 12:31:09 +08:00
Hongxin Liu	070df689e6	[devops] fix extention building (#5427 )	2024-03-05 15:35:54 +08:00
Frank Lee	705a62a565	[doc] updated installation command (#5389 )	2024-02-19 16:54:03 +08:00
Frank Lee	8823cc4831	Merge pull request #5310 from hpcaitech/feature/npu Feature/npu	2024-01-29 13:49:39 +08:00
digger yu	bce9499ed3	fix some typo (#5307 )	2024-01-25 13:56:27 +08:00
ver217	148469348a	Merge branch 'main' into sync/npu	2024-01-18 12:05:21 +08:00
Hongxin Liu	d202cc28c0	[npu] change device to accelerator api (#5239 ) * update accelerator * fix timer * fix amp * update * fix * update bug * add error raise * fix autocast * fix set device * remove doc accelerator * update doc * update doc * update doc * use nullcontext * update cpu * update null context * change time limit for example * udpate * update * update * update * [npu] polish accelerator code --------- Co-authored-by: Xuanlei Zhao <xuanlei.zhao@gmail.com> Co-authored-by: zxl <43881818+oahzxl@users.noreply.github.com>	2024-01-09 10:20:05 +08:00
flybird11111	681d9b12ef	[doc] update pytorch version in documents. (#5177 ) * fix aaa fix fix fix * fix * fix * test ci * fix ci fix * update pytorch version in documents	2023-12-15 18:16:48 +08:00
Wenhao Chen	7172459e74	[shardformer]: support gpt-j, falcon, Mistral and add interleaved pipeline for bert (#5088 ) * [shardformer] implement policy for all GPT-J models and test * [shardformer] support interleaved pipeline parallel for bert finetune * [shardformer] shardformer support falcon (#4883) * [shardformer]: fix interleaved pipeline for bert model (#5048) * [hotfix]: disable seq parallel for gptj and falcon, and polish code (#5093) * Add Mistral support for Shardformer (#5103) * [shardformer] add tests to mistral (#5105) --------- Co-authored-by: Pengtai Xu <henryxu880@gmail.com> Co-authored-by: ppt0011 <143150326+ppt0011@users.noreply.github.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: eric8607242 <e0928021388@gmail.com>	2023-11-28 16:54:42 +08:00
digger yu	2bdf76f1f2	fix typo change lazy_iniy to lazy_init (#5099 )	2023-11-24 19:15:59 +08:00
digger yu	0d482302a1	[nfc] fix typo and author name (#5089 )	2023-11-22 10:39:01 +08:00
digger yu	fd3567e089	[nfc] fix typo in docs/ (#4972 )	2023-11-21 22:06:20 +08:00
ppt0011	335cb105e2	[doc] add supported feature diagram for hybrid parallel plugin (#4996 )	2023-10-31 19:56:42 +08:00
digger yu	11009103be	[nfc] fix some typo with colossalai/ docs/ etc. (#4920 )	2023-10-18 15:44:04 +08:00
Baizhou Zhang	21ba89cab6	[gemini] support gradient accumulation (#4869 ) * add test * fix no_sync bug in low level zero plugin * fix test * add argument for grad accum * add grad accum in backward hook for gemini * finish implementation, rewrite tests * fix test * skip stuck model in low level zero test * update doc * optimize communication & fix gradient checkpoint * modify doc * cleaning codes * update cpu adam fp16 case	2023-10-17 14:07:21 +08:00
flybird11111	6a21f96a87	[doc] update advanced tutorials, training gpt with hybrid parallelism (#4866 ) * [doc]update advanced tutorials, training gpt with hybrid parallelism * [doc]update advanced tutorials, training gpt with hybrid parallelism * update vit tutorials * update vit tutorials * update vit tutorials * update vit tutorials * update en/train_vit_with_hybrid_parallel.py * fix * resolve comments * fix	2023-10-10 08:18:55 +00:00
Zhongkai Zhao	db40e086c8	[test] modify model supporting part of low_level_zero plugin (including correspoding docs)	2023-10-05 15:10:31 +08:00
Hongxin Liu	da15fdb9ca	[doc] add lazy init docs (#4808 )	2023-09-27 10:24:04 +08:00
Baizhou Zhang	64a08b2dc3	[checkpointio] support unsharded checkpointIO for hybrid parallel (#4774 ) * support unsharded saving/loading for model * support optimizer unsharded saving * update doc * support unsharded loading for optimizer * small fix	2023-09-26 10:58:03 +08:00
Baizhou Zhang	a2db75546d	[doc] polish shardformer doc (#4779 ) * fix example format in docstring * polish shardformer doc	2023-09-26 10:57:47 +08:00
Hongxin Liu	66f3926019	[doc] clean up outdated docs (#4765 ) * [doc] clean up outdated docs * [doc] fix linking * [doc] fix linking	2023-09-21 11:36:20 +08:00
Pengtai Xu	4d7537ba25	[doc] put native colossalai plugins first in description section	2023-09-20 09:24:10 +08:00
Pengtai Xu	e10d9f087e	[doc] add model examples for each plugin	2023-09-19 18:01:23 +08:00
Pengtai Xu	a04337bfc3	[doc] put individual plugin explanation in front	2023-09-19 16:27:37 +08:00
Pengtai Xu	10513f203c	[doc] explain suitable use case for each plugin	2023-09-19 15:50:14 +08:00
Hongxin Liu	b5f9e37c70	[legacy] clean up legacy code (#4743 ) * [legacy] remove outdated codes of pipeline (#4692) * [legacy] remove cli of benchmark and update optim (#4690) * [legacy] remove cli of benchmark and update optim * [doc] fix cli doc test * [legacy] fix engine clip grad norm * [legacy] remove outdated colo tensor (#4694) * [legacy] remove outdated colo tensor * [test] fix test import * [legacy] move outdated zero to legacy (#4696) * [legacy] clean up utils (#4700) * [legacy] clean up utils * [example] update examples * [legacy] clean up amp * [legacy] fix amp module * [legacy] clean up gpc (#4742) * [legacy] clean up context * [legacy] clean core, constants and global vars * [legacy] refactor initialize * [example] fix examples ci * [example] fix examples ci * [legacy] fix tests * [example] fix gpt example * [example] fix examples ci * [devops] fix ci installation * [example] fix examples ci	2023-09-18 16:31:06 +08:00
Baizhou Zhang	d151dcab74	[doc] explaination of loading large pretrained models (#4741 )	2023-09-15 21:04:07 +08:00
Baizhou Zhang	451c3465fb	[doc] polish shardformer doc (#4735 ) * arrange position of chapters * fix typos in seq parallel doc	2023-09-15 17:39:10 +08:00
Bin Jia	6a03c933a0	[shardformer] update seq parallel document (#4730 ) * update doc of seq parallel * fix typo	2023-09-15 16:09:32 +08:00
flybird11111	46162632e5	[shardformer] update pipeline parallel document (#4725 ) * [shardformer] update pipeline parallel document * [shardformer] update pipeline parallel document * [shardformer] update pipeline parallel document * [shardformer] update pipeline parallel document * [shardformer] update pipeline parallel document * [shardformer] update pipeline parallel document * [shardformer] update pipeline parallel document * [shardformer] update pipeline parallel document	2023-09-15 14:32:04 +08:00
Baizhou Zhang	50e5602c2d	[doc] add shardformer support matrix/update tensor parallel documents (#4728 ) * add compatibility matrix for shardformer doc * update tp doc	2023-09-15 13:52:30 +08:00
Baizhou Zhang	f911d5b09d	[doc] Add user document for Shardformer (#4702 ) * create shardformer doc files * add docstring for seq-parallel * update ShardConfig docstring * add links to llama example * add outdated massage * finish introduction & supporting information * finish 'how shardformer works' * finish shardformer.md English doc * fix doctest fail * add Chinese document	2023-09-15 10:56:39 +08:00
Baizhou Zhang	1d454733c4	[doc] Update booster user documents. (#4669 ) * update booster_api.md * update booster_checkpoint.md * update booster_plugins.md * move transformers importing inside function * fix Dict typing * fix autodoc bug * small fix	2023-09-12 10:47:23 +08:00
Hongxin Liu	554aa9592e	[legacy] move communication and nn to legacy and refactor logger (#4671 ) * [legacy] move communication to legacy (#4640) * [legacy] refactor logger and clean up legacy codes (#4654) * [legacy] make logger independent to gpc * [legacy] make optim independent to registry * [legacy] move test engine to legacy * [legacy] move nn to legacy (#4656) * [legacy] move nn to legacy * [checkpointio] fix save hf config * [test] remove useledd rpc pp test * [legacy] fix nn init * [example] skip tutorial hybriad parallel example * [devops] test doc check * [devops] test doc check	2023-09-11 16:24:28 +08:00
Hongxin Liu	ac178ca5c1	[legacy] move builder and registry to legacy (#4603 )	2023-09-05 21:53:10 +08:00
Hongxin Liu	8accecd55b	[legacy] move engine to legacy (#4560 ) * [legacy] move engine to legacy * [example] fix seq parallel example * [example] fix seq parallel example * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [test] test gemini pluging hang * [example] update seq parallel requirements	2023-09-05 21:53:10 +08:00
Hongxin Liu	89fe027787	[legacy] move trainer to legacy (#4545 ) * [legacy] move trainer to legacy * [doc] update docs related to trainer * [test] ignore legacy test	2023-09-05 21:53:10 +08:00
Hongxin Liu	27061426f7	[gemini] improve compatibility and add static placement policy (#4479 ) * [gemini] remove distributed-related part from colotensor (#4379) * [gemini] remove process group dependency * [gemini] remove tp part from colo tensor * [gemini] patch inplace op * [gemini] fix param op hook and update tests * [test] remove useless tests * [test] remove useless tests * [misc] fix requirements * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [test] fix model zoo * [misc] update requirements * [gemini] refactor gemini optimizer and gemini ddp (#4398) * [gemini] update optimizer interface * [gemini] renaming gemini optimizer * [gemini] refactor gemini ddp class * [example] update gemini related example * [example] update gemini related example * [plugin] fix gemini plugin args * [test] update gemini ckpt tests * [gemini] fix checkpoint io * [example] fix opt example requirements * [example] fix opt example * [example] fix opt example * [example] fix opt example * [gemini] add static placement policy (#4443) * [gemini] add static placement policy * [gemini] fix param offload * [test] update gemini tests * [plugin] update gemini plugin * [plugin] update gemini plugin docstr * [misc] fix flash attn requirement * [test] fix gemini checkpoint io test * [example] update resnet example result (#4457) * [example] update bert example result (#4458) * [doc] update gemini doc (#4468) * [example] update gemini related examples (#4473) * [example] update gpt example * [example] update dreambooth example * [example] update vit * [example] update opt * [example] update palm * [example] update vit and opt benchmark * [hotfix] fix bert in model zoo (#4480) * [hotfix] fix bert in model zoo * [test] remove chatglm gemini test * [test] remove sam gemini test * [test] remove vit gemini test * [hotfix] fix opt tutorial example (#4497) * [hotfix] fix opt tutorial example * [hotfix] fix opt tutorial example	2023-08-24 09:29:25 +08:00
flybird1111	f40b718959	[doc] Fix gradient accumulation doc. (#4349 ) * [doc] fix gradient accumulation doc * [doc] fix gradient accumulation doc	2023-08-04 17:24:35 +08:00
Baizhou Zhang	c6f6005990	[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302 ) * sharded optimizer checkpoint for gemini plugin * modify test to reduce testing time * update doc * fix bug when keep_gatherd is true under GeminiPlugin	2023-07-21 14:39:01 +08:00
Jianghai	711e2b4c00	[doc] update and revise some typos and errs in docs (#4107 ) * fix some typos and problems in doc * fix some typos and problems in doc * add doc test	2023-06-28 19:30:37 +08:00
digger yu	769cddcb2c	fix typo docs/ (#4033 )	2023-06-28 15:30:30 +08:00
Baizhou Zhang	4da324cd60	[hotfix]fix argument naming in docs and examples (#4083 )	2023-06-26 23:50:04 +08:00
Frank Lee	ddcf58cacf	Revert "[sync] sync feature/shardformer with develop"	2023-06-09 09:41:27 +08:00
FoolPlayer	24651fdd4f	Merge pull request #3931 from FrankLeeeee/sync/develop-to-shardformer [sync] sync feature/shardformer with develop	2023-06-09 09:34:00 +08:00
digger yu	33eef714db	fix typo examples and docs (#3932 )	2023-06-08 16:09:32 +08:00
Hongxin Liu	12c90db3f3	[doc] add lazy init tutorial (#3922 ) * [doc] add lazy init en doc * [doc] add lazy init zh doc * [doc] add lazy init doc in sidebar * [doc] add lazy init doc test * [doc] fix lazy init doc link	2023-06-07 17:59:58 +08:00

1 2

96 Commits (4148ceed9f17446a6c247b49c33805b5abd17984)