ColossalAI

Commit Graph

Author	SHA1	Message	Date
Baizhou Zhang	0ceec8f9a9	[pipeline] support fp32 for HybridPlugin/merge shardformer test and pipeline test into one file (#4354 ) * add naive optimizer for 3DPlugin/refactor gpt2 shardformer test * merge tests of PP/DP/TP combinations into one test file * fix bug when sync grad for dp in HybridPlugin * update supported precisions for 3DPlugin/fix bug when shifting tp_degree * improve the passing of lazy_init * modify lazy_init/use sync_shared_params	1 year ago
Hongxin Liu	d921ce8391	[shardformer] support inplace sharding (#4251 ) * [shardformer] embedding support inplace sharding * [shardformer] linear support inplace sharding * [shardformer] layernorm support inplace sharding * [shardformer] qkv support inplace sharding * [test] update shardformer layer test * [shardformer] fix shared param sharding * [shardformer] fix bert policy * [shardformer] fix bloom policy * [shardformer] fix llama policy * [shardformer] fix opt policy * [shardformer] fix t5 policy * [shardformer] fix fused qkv linear * [shardformer] fix bugs * force sync * [test] fix bugs * [test] fix transformer version	1 year ago
Frank Lee	190a6ea9c2	[dtensor] fixed readme file name and removed deprecated file (#4162 )	1 year ago
Frank Lee	c4b1b65931	[test] fixed tests failed due to dtensor change (#4082 ) * [test] fixed tests failed due to dtensor change * polish code	1 year ago
Frank Lee	70c58cfd4f	[shardformer] supported fused qkv checkpoint (#4073 )	1 year ago
Frank Lee	8eb09a4c69	[shardformer] support module saving and loading (#4062 ) * [shardformer] support module saving and loading * polish code	1 year ago
Frank Lee	45d9384346	[shardformer] removed inplace tensor sharding (#4018 )	1 year ago
Frank Lee	015af592f8	[shardformer] integrated linear 1D with dtensor (#3996 ) * [shardformer] integrated linear 1D with dtensor * polish code	1 year ago
FoolPlayer	a2f9af810d	[shardformer] fix an error in readme (#3988 ) * fix an error in readme * simplify code	1 year ago
Frank Lee	ddcf58cacf	Revert "[sync] sync feature/shardformer with develop"	1 year ago
Frank Lee	eb39154d40	[dtensor] updated api and doc (#3845 )	1 year ago
Frank Lee	d51e83d642	Merge pull request #3916 from FrankLeeeee/sync/dtensor-with-develop [sync] sync feature/dtensor with develop	2 years ago
digger yu	0e484e6201	[nfc]fix typo colossalai/pipeline tensor nn (#3899 ) * fix typo colossalai/autochunk auto_parallel amp * fix typo colossalai/auto_parallel nn utils etc. * fix typo colossalai/auto_parallel autochunk fx/passes etc. * fix typo docs/ * change placememt_policy to placement_policy in docs/ and examples/ * fix typo colossalai/ applications/ * fix typo colossalai/cli fx kernel * fix typo colossalai/nn * revert change warmuped * fix typo colossalai/pipeline tensor nn	2 years ago
Hongxin Liu	7c9f2ed6dd	[dtensor] polish sharding spec docstring (#3838 ) * [dtensor] polish sharding spec docstring * [dtensor] polish sharding spec example docstring	2 years ago
YH	2629f9717d	[tensor] Refactor handle_trans_spec in DistSpecManager	2 years ago
digger-yu	b9a8dff7e5	[doc] Fix typo under colossalai and doc(#3618 ) * Fixed several spelling errors under colossalai * Fix the spelling error in colossalai and docs directory * Cautious Changed the spelling error under the example folder * Update runtime_preparation_pass.py revert autograft to autograd * Update search_chunk.py utile to until * Update check_installation.py change misteach to mismatch in line 91 * Update 1D_tensor_parallel.md revert to perceptron * Update 2D_tensor_parallel.md revert to perceptron in line 73 * Update 2p5D_tensor_parallel.md revert to perceptron in line 71 * Update 3D_tensor_parallel.md revert to perceptron in line 80 * Update README.md revert to resnet in line 42 * Update reorder_graph.py revert to indice in line 7 * Update p2p.py revert to megatron in line 94 * Update initialize.py revert to torchrun in line 198 * Update routers.py change to detailed in line 63 * Update routers.py change to detailed in line 146 * Update README.md revert random number in line 402	2 years ago
YH	8f740deb53	Fix typo (#3448 )	2 years ago
YH	1a229045af	Add interface for colo tesnor dp size (#3227 )	2 years ago
YuliangLiu0306	258b43317c	[hotfix] layout converting issue (#3188 )	2 years ago
YuliangLiu0306	2eca4cd376	[DTensor] refactor dtensor with new components (#3089 ) * [DTensor] refactor dtensor with new components * polish	2 years ago
YuliangLiu0306	8e4e8601b7	[DTensor] implement layout converter (#3055 ) * [DTensor] refactor LayoutConverter for DTensor * polish code * polish docstring	2 years ago
YuliangLiu0306	29386a54e6	[DTensor] refactor CommSpec (#3034 )	2 years ago
YuliangLiu0306	cd2b0eaa8d	[DTensor] refactor sharding spec (#2987 ) * [autoparallel] refactor sharding spec * rename function name	2 years ago
YuliangLiu0306	e414e4092b	[DTensor] implementation of dtensor (#2946 ) * [DTensor] implementation of dtensor * test layout convert * polish	2 years ago
YuliangLiu0306	47fb214b3b	[hotfix] add shard dim to aviod backward communication error (#2954 )	2 years ago
Jiatong (Julius) Han	8c8a39be95	[hotfix]: Remove math.prod dependency (#2837 ) * Remove math.prod dependency * Fix style * Fix style --------- Co-authored-by: Jiatong Han <jiatong.han@u.nus.edu>	2 years ago
HELSON	552183bb74	[polish] polish ColoTensor and its submodules (#2537 )	2 years ago
YuliangLiu0306	aa0f6686f9	[autoparallel] accelerate gpt2 training (#2495 )	2 years ago
HELSON	707b11d4a0	[gemini] update ddp strict mode (#2518 ) * [zero] add strict ddp mode for chunk init * [gemini] update gpt example	2 years ago
Jiarui Fang	8f72b6f8fb	[hotfix] fix implement error in diffusers	2 years ago
1SAA	33f3023e19	[hotfix] fix implement error in diffusers	2 years ago
Jiarui Fang	1aaeb596c6	[example] gpt, shard init on all processes (#2366 )	2 years ago
Boyuan Yao	22e947f982	[autoparallel] fix runtime apply memory estimation (#2281 ) * [autoparallel] align the data_ptr with the old version of auto activation checkpoint pipeline * [autoparallel] using fwd_time and bwd_time instead of fwd_flop and bwd_flop * [autoparallel] specifycomm nodes' memory cost in construct chain * [autoparallel] fix wrong runtime apply calculation * [autoparallel] fix wrong runtime apply calculation * [autoparallel] fix wrong runtime apply calculation	2 years ago
xcnick	85178a397a	[hotfix] fix error for torch 2.0 (#2243 )	2 years ago
Boyuan Yao	24246f7aa5	[autoparallel] Attach input, buffer and output tensor to MetaInfo class (#2162 ) * [fx] metainfo class for auto parallel * [fx] add unit test for linear metainfo * [fx] fix bwd param for linear * [fx] modify unit test * [fx] modify unit test * [fx] modify import * [fx] modify import * [fx] modify import * [fx] move meta profiler to auto parallel * [fx] add conv metainfo class * [fx] restore profiler * [fx] restore meta profiler * [autoparallel] modify unit test * [fx] modify unit test * [autoparallel] add batchnorm metainfo class * [autoparallel] fix batchnorm unit test function declaration * [fx] restore profiler * [fx] add relu metainfo class * [fx] restore profiler * [autoparallel] modify metainfo input * [autoparallel] add pooling metainfo * [autoparallel] add F.linear metainfo generator * [autoparallel] add binary elementwise metainfo * [fx] recover profiler * [autoparallel] fix forward memory calculation * [autoparallel] modify constants.py * [autoparallel] remove redundant print * [autoparallel] add F.conv metainfo * [autoparallel] linear fix * [autoparallel] memory estimation for communication actions * [autoparallel] fix docstring * [autoparallel] fix variables name * [autoparallel] attach tensor to metainfo class * [autoparallel] fix dangerous try except * [autoparallel] attach memory cost to shape consistency node * [autoparallel] attach shape consistency node's metainfo to the node * [autoparallel] remove todo in shape consistency memory estimation * [autoparallel] fix the annotation	2 years ago
HELSON	2458659919	[zero] fix error for BEiT models (#2169 ) * [zero] fix error for BEiT models * [ColoParameter] add unpack operation for tuple arguments * fix bugs * fix chunkv2 unit testing * add assertion for gradient state	2 years ago
Boyuan Yao	cfe2a9bd90	[autoparallel] memory estimation for shape consistency (#2144 ) * [fx] metainfo class for auto parallel * [fx] add unit test for linear metainfo * [fx] fix bwd param for linear * [fx] modify unit test * [fx] modify unit test * [fx] modify import * [fx] modify import * [fx] modify import * [fx] move meta profiler to auto parallel * [fx] add conv metainfo class * [fx] restore profiler * [fx] restore meta profiler * [autoparallel] modify unit test * [fx] modify unit test * [autoparallel] add batchnorm metainfo class * [autoparallel] fix batchnorm unit test function declaration * [fx] restore profiler * [fx] add relu metainfo class * [fx] restore profiler * [autoparallel] modify metainfo input * [autoparallel] add pooling metainfo * [autoparallel] add F.linear metainfo generator * [autoparallel] add binary elementwise metainfo * [fx] recover profiler * [autoparallel] fix forward memory calculation * [autoparallel] modify constants.py * [autoparallel] remove redundant print * [autoparallel] add F.conv metainfo * [autoparallel] linear fix * [autoparallel] memory estimation for communication actions * [autoparallel] fix docstring * [autoparallel] fix variables name	2 years ago
Jiarui Fang	2827f41898	[Gemini] GeminiDPP convert to PyTorch Module. (#2151 )	2 years ago
Jiarui Fang	e99edfcb51	[NFC] polish comments for Chunk class (#2116 )	2 years ago
Jiarui Fang	b3b89865e2	[Gemini] ParamOpHook -> ColoParamOpHook (#2080 )	2 years ago
YuliangLiu0306	81330b0352	[autoparallel] add experimental permute handler (#2029 )	2 years ago
Genghan Zhang	d655eea515	[autoparallel] mix gather (#1977 ) * Add mix-gather * Add comments * Add comments * Polish comments * Change the global rank assumption * Add tests * Add two-step tests * Fix 10 and 01 * Skip test becasue the number of GPUs	2 years ago
YuliangLiu0306	36c0f3ea5b	[autoparallel] remove redundancy comm node (#1893 )	2 years ago
YuliangLiu0306	49216d7ab1	[autoparallel] fix bugs caused by negative dim key (#1808 ) * [autoparallel] fix bugs caused by negative dim key * fix import error * fix matmul test issue * fix unit test issue	2 years ago
Jiarui Fang	218c75fd9d	[NFC] polish type hint for shape consistency (#1801 ) * [NFC] polish type hint for shape consistency * polish code * polish code	2 years ago
HELSON	c6a1a62636	[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786 ) * [hotfix] fix zero's incompatibility with checkpoint in torch-1.12 * [zero] add cpu shard init * [zero] add tiny example test * [colo_tensor] fix bugs for torch-1.11	2 years ago
Frank Lee	f3f19a5c47	[autoparallel] added matmul handler (#1763 ) * [autoparallel] added matmul handler * polish code	2 years ago
YuliangLiu0306	b0f7c8bde8	[autoparallel] update CommSpec to CommActions (#1768 ) * [autoparallel] update CommSpec to CommActions * polish code	2 years ago
YuliangLiu0306	b4cc59b61e	[autoparallel] add numerical test for node strategies (#1760 ) * [autoparallel] add numerical test for node strategies * polish code * polish code	2 years ago
YuliangLiu0306	980ed21723	[autoparallel] shard param and buffer as expected (#1753 ) * [autoparallel] shard param and buffer as expected * fix unit test issue	2 years ago

1 2 3 4

182 Commits (ed4c4484880b733894e6088e681f7cca32afe0b4)