YH
2629f9717d
[tensor] Refactor handle_trans_spec in DistSpecManager
2 years ago
digger-yu
b9a8dff7e5
[doc] Fix typo under colossalai and doc( #3618 )
...
* Fixed several spelling errors under colossalai
* Fix the spelling error in colossalai and docs directory
* Cautious Changed the spelling error under the example folder
* Update runtime_preparation_pass.py
revert autograft to autograd
* Update search_chunk.py
utile to until
* Update check_installation.py
change misteach to mismatch in line 91
* Update 1D_tensor_parallel.md
revert to perceptron
* Update 2D_tensor_parallel.md
revert to perceptron in line 73
* Update 2p5D_tensor_parallel.md
revert to perceptron in line 71
* Update 3D_tensor_parallel.md
revert to perceptron in line 80
* Update README.md
revert to resnet in line 42
* Update reorder_graph.py
revert to indice in line 7
* Update p2p.py
revert to megatron in line 94
* Update initialize.py
revert to torchrun in line 198
* Update routers.py
change to detailed in line 63
* Update routers.py
change to detailed in line 146
* Update README.md
revert random number in line 402
2 years ago
YH
8f740deb53
Fix typo ( #3448 )
2 years ago
YH
1a229045af
Add interface for colo tesnor dp size ( #3227 )
2 years ago
YuliangLiu0306
258b43317c
[hotfix] layout converting issue ( #3188 )
2 years ago
YuliangLiu0306
2eca4cd376
[DTensor] refactor dtensor with new components ( #3089 )
...
* [DTensor] refactor dtensor with new components
* polish
2 years ago
YuliangLiu0306
8e4e8601b7
[DTensor] implement layout converter ( #3055 )
...
* [DTensor] refactor LayoutConverter for DTensor
* polish code
* polish docstring
2 years ago
YuliangLiu0306
29386a54e6
[DTensor] refactor CommSpec ( #3034 )
2 years ago
YuliangLiu0306
cd2b0eaa8d
[DTensor] refactor sharding spec ( #2987 )
...
* [autoparallel] refactor sharding spec
* rename function name
2 years ago
YuliangLiu0306
e414e4092b
[DTensor] implementation of dtensor ( #2946 )
...
* [DTensor] implementation of dtensor
* test layout convert
* polish
2 years ago
YuliangLiu0306
47fb214b3b
[hotfix] add shard dim to aviod backward communication error ( #2954 )
2 years ago
Jiatong (Julius) Han
8c8a39be95
[hotfix]: Remove math.prod dependency ( #2837 )
...
* Remove math.prod dependency
* Fix style
* Fix style
---------
Co-authored-by: Jiatong Han <jiatong.han@u.nus.edu>
2 years ago
HELSON
552183bb74
[polish] polish ColoTensor and its submodules ( #2537 )
2 years ago
YuliangLiu0306
aa0f6686f9
[autoparallel] accelerate gpt2 training ( #2495 )
2 years ago
HELSON
707b11d4a0
[gemini] update ddp strict mode ( #2518 )
...
* [zero] add strict ddp mode for chunk init
* [gemini] update gpt example
2 years ago
Jiarui Fang
8f72b6f8fb
[hotfix] fix implement error in diffusers
2 years ago
1SAA
33f3023e19
[hotfix] fix implement error in diffusers
2 years ago
Jiarui Fang
1aaeb596c6
[example] gpt, shard init on all processes ( #2366 )
2 years ago
Boyuan Yao
22e947f982
[autoparallel] fix runtime apply memory estimation ( #2281 )
...
* [autoparallel] align the data_ptr with the old version of auto activation checkpoint pipeline
* [autoparallel] using fwd_time and bwd_time instead of fwd_flop and bwd_flop
* [autoparallel] specifycomm nodes' memory cost in construct chain
* [autoparallel] fix wrong runtime apply calculation
* [autoparallel] fix wrong runtime apply calculation
* [autoparallel] fix wrong runtime apply calculation
2 years ago
xcnick
85178a397a
[hotfix] fix error for torch 2.0 ( #2243 )
2 years ago
Boyuan Yao
24246f7aa5
[autoparallel] Attach input, buffer and output tensor to MetaInfo class ( #2162 )
...
* [fx] metainfo class for auto parallel
* [fx] add unit test for linear metainfo
* [fx] fix bwd param for linear
* [fx] modify unit test
* [fx] modify unit test
* [fx] modify import
* [fx] modify import
* [fx] modify import
* [fx] move meta profiler to auto parallel
* [fx] add conv metainfo class
* [fx] restore profiler
* [fx] restore meta profiler
* [autoparallel] modify unit test
* [fx] modify unit test
* [autoparallel] add batchnorm metainfo class
* [autoparallel] fix batchnorm unit test function declaration
* [fx] restore profiler
* [fx] add relu metainfo class
* [fx] restore profiler
* [autoparallel] modify metainfo input
* [autoparallel] add pooling metainfo
* [autoparallel] add F.linear metainfo generator
* [autoparallel] add binary elementwise metainfo
* [fx] recover profiler
* [autoparallel] fix forward memory calculation
* [autoparallel] modify constants.py
* [autoparallel] remove redundant print
* [autoparallel] add F.conv metainfo
* [autoparallel] linear fix
* [autoparallel] memory estimation for communication actions
* [autoparallel] fix docstring
* [autoparallel] fix variables name
* [autoparallel] attach tensor to metainfo class
* [autoparallel] fix dangerous try except
* [autoparallel] attach memory cost to shape consistency node
* [autoparallel] attach shape consistency node's metainfo to the node
* [autoparallel] remove todo in shape consistency memory estimation
* [autoparallel] fix the annotation
2 years ago
HELSON
2458659919
[zero] fix error for BEiT models ( #2169 )
...
* [zero] fix error for BEiT models
* [ColoParameter] add unpack operation for tuple arguments
* fix bugs
* fix chunkv2 unit testing
* add assertion for gradient state
2 years ago
Boyuan Yao
cfe2a9bd90
[autoparallel] memory estimation for shape consistency ( #2144 )
...
* [fx] metainfo class for auto parallel
* [fx] add unit test for linear metainfo
* [fx] fix bwd param for linear
* [fx] modify unit test
* [fx] modify unit test
* [fx] modify import
* [fx] modify import
* [fx] modify import
* [fx] move meta profiler to auto parallel
* [fx] add conv metainfo class
* [fx] restore profiler
* [fx] restore meta profiler
* [autoparallel] modify unit test
* [fx] modify unit test
* [autoparallel] add batchnorm metainfo class
* [autoparallel] fix batchnorm unit test function declaration
* [fx] restore profiler
* [fx] add relu metainfo class
* [fx] restore profiler
* [autoparallel] modify metainfo input
* [autoparallel] add pooling metainfo
* [autoparallel] add F.linear metainfo generator
* [autoparallel] add binary elementwise metainfo
* [fx] recover profiler
* [autoparallel] fix forward memory calculation
* [autoparallel] modify constants.py
* [autoparallel] remove redundant print
* [autoparallel] add F.conv metainfo
* [autoparallel] linear fix
* [autoparallel] memory estimation for communication actions
* [autoparallel] fix docstring
* [autoparallel] fix variables name
2 years ago
Jiarui Fang
2827f41898
[Gemini] GeminiDPP convert to PyTorch Module. ( #2151 )
2 years ago
Jiarui Fang
e99edfcb51
[NFC] polish comments for Chunk class ( #2116 )
2 years ago
Jiarui Fang
b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook ( #2080 )
2 years ago
YuliangLiu0306
81330b0352
[autoparallel] add experimental permute handler ( #2029 )
2 years ago
Genghan Zhang
d655eea515
[autoparallel] mix gather ( #1977 )
...
* Add mix-gather
* Add comments
* Add comments
* Polish comments
* Change the global rank assumption
* Add tests
* Add two-step tests
* Fix 10 and 01
* Skip test becasue the number of GPUs
2 years ago
YuliangLiu0306
36c0f3ea5b
[autoparallel] remove redundancy comm node ( #1893 )
2 years ago
YuliangLiu0306
49216d7ab1
[autoparallel] fix bugs caused by negative dim key ( #1808 )
...
* [autoparallel] fix bugs caused by negative dim key
* fix import error
* fix matmul test issue
* fix unit test issue
2 years ago
Jiarui Fang
218c75fd9d
[NFC] polish type hint for shape consistency ( #1801 )
...
* [NFC] polish type hint for shape consistency
* polish code
* polish code
2 years ago
HELSON
c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 ( #1786 )
...
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12
* [zero] add cpu shard init
* [zero] add tiny example test
* [colo_tensor] fix bugs for torch-1.11
2 years ago
Frank Lee
f3f19a5c47
[autoparallel] added matmul handler ( #1763 )
...
* [autoparallel] added matmul handler
* polish code
2 years ago
YuliangLiu0306
b0f7c8bde8
[autoparallel] update CommSpec to CommActions ( #1768 )
...
* [autoparallel] update CommSpec to CommActions
* polish code
2 years ago
YuliangLiu0306
b4cc59b61e
[autoparallel] add numerical test for node strategies ( #1760 )
...
* [autoparallel] add numerical test for node strategies
* polish code
* polish code
2 years ago
YuliangLiu0306
980ed21723
[autoparallel] shard param and buffer as expected ( #1753 )
...
* [autoparallel] shard param and buffer as expected
* fix unit test issue
2 years ago
YuliangLiu0306
a4ce180e85
[autoparallel] add sequential order to communication actions ( #1735 )
2 years ago
Frank Lee
993b8875b6
[autoparallel] handled illegal sharding strategy in shape consistency ( #1744 )
...
* [autoparallel] handled illegal sharding strategy in shape consistency
* polish code
2 years ago
Frank Lee
eee84908d4
[autoparallel] handled illegal sharding strategy ( #1728 )
...
* [autoparallel] handled illegal sharding strategy
* polish code
2 years ago
YuliangLiu0306
51b89d2202
[autoparallel] runtime_backward_apply ( #1720 )
2 years ago
Frank Lee
4973157ad7
[autoparallel] added sharding spec conversion for linear handler ( #1687 )
2 years ago
YuliangLiu0306
3f068d1409
[autoparallel] update CommSpec ( #1667 )
2 years ago
Frank Lee
154d3ef432
[fix] fixed the collective pattern name for consistency ( #1649 )
...
* [fix] fixed the collective pattern name for consistency
* polish code
2 years ago
YuliangLiu0306
702dbc5288
[tensor] use communication autograd func ( #1617 )
...
* [tensor] use communication autograd func
* change all to all comm spec info
* rename pattern and distinguish fwd/bwd
* polish code
2 years ago
Frank Lee
27fe8af60c
[autoparallel] refactored shape consistency to remove redundancy ( #1591 )
...
* [autoparallel] refactored shape consistency to remove redundancy
* polish code
* polish code
* polish code
2 years ago
YuliangLiu0306
44c866a3e3
[autoparallel] change the merge node logic ( #1533 )
2 years ago
YuliangLiu0306
4b03c25f85
[tensor]add 1D device mesh ( #1492 )
2 years ago
YuliangLiu0306
26a37b5cd5
[autoparallel] Add conv handler to generate strategies and costs info for conv ( #1467 )
2 years ago
Jiarui Fang
1b491ad7de
[doc] update docstring in ProcessGroup ( #1468 )
2 years ago
YuliangLiu0306
b73fb7a077
[tensor] support runtime ShardingSpec apply ( #1453 )
...
* [tensor] support runtime ShardingSpec apply
* polish code
* polish code
2 years ago