Hongxin Liu
079bf3cb26
[misc] update pre-commit and run all files ( #4752 )
...
* [misc] update pre-commit
* [misc] run pre-commit
* [misc] remove useless configuration files
* [misc] ignore cuda for clang-format
1 year ago
Hongxin Liu
27061426f7
[gemini] improve compatibility and add static placement policy ( #4479 )
...
* [gemini] remove distributed-related part from colotensor (#4379 )
* [gemini] remove process group dependency
* [gemini] remove tp part from colo tensor
* [gemini] patch inplace op
* [gemini] fix param op hook and update tests
* [test] remove useless tests
* [test] remove useless tests
* [misc] fix requirements
* [test] fix model zoo
* [test] fix model zoo
* [test] fix model zoo
* [test] fix model zoo
* [test] fix model zoo
* [misc] update requirements
* [gemini] refactor gemini optimizer and gemini ddp (#4398 )
* [gemini] update optimizer interface
* [gemini] renaming gemini optimizer
* [gemini] refactor gemini ddp class
* [example] update gemini related example
* [example] update gemini related example
* [plugin] fix gemini plugin args
* [test] update gemini ckpt tests
* [gemini] fix checkpoint io
* [example] fix opt example requirements
* [example] fix opt example
* [example] fix opt example
* [example] fix opt example
* [gemini] add static placement policy (#4443 )
* [gemini] add static placement policy
* [gemini] fix param offload
* [test] update gemini tests
* [plugin] update gemini plugin
* [plugin] update gemini plugin docstr
* [misc] fix flash attn requirement
* [test] fix gemini checkpoint io test
* [example] update resnet example result (#4457 )
* [example] update bert example result (#4458 )
* [doc] update gemini doc (#4468 )
* [example] update gemini related examples (#4473 )
* [example] update gpt example
* [example] update dreambooth example
* [example] update vit
* [example] update opt
* [example] update palm
* [example] update vit and opt benchmark
* [hotfix] fix bert in model zoo (#4480 )
* [hotfix] fix bert in model zoo
* [test] remove chatglm gemini test
* [test] remove sam gemini test
* [test] remove vit gemini test
* [hotfix] fix opt tutorial example (#4497 )
* [hotfix] fix opt tutorial example
* [hotfix] fix opt tutorial example
1 year ago
digger-yu
b9a8dff7e5
[doc] Fix typo under colossalai and doc( #3618 )
...
* Fixed several spelling errors under colossalai
* Fix the spelling error in colossalai and docs directory
* Cautious Changed the spelling error under the example folder
* Update runtime_preparation_pass.py
revert autograft to autograd
* Update search_chunk.py
utile to until
* Update check_installation.py
change misteach to mismatch in line 91
* Update 1D_tensor_parallel.md
revert to perceptron
* Update 2D_tensor_parallel.md
revert to perceptron in line 73
* Update 2p5D_tensor_parallel.md
revert to perceptron in line 71
* Update 3D_tensor_parallel.md
revert to perceptron in line 80
* Update README.md
revert to resnet in line 42
* Update reorder_graph.py
revert to indice in line 7
* Update p2p.py
revert to megatron in line 94
* Update initialize.py
revert to torchrun in line 198
* Update routers.py
change to detailed in line 63
* Update routers.py
change to detailed in line 146
* Update README.md
revert random number in line 402
2 years ago
YH
1a229045af
Add interface for colo tesnor dp size ( #3227 )
2 years ago
Jiatong (Julius) Han
8c8a39be95
[hotfix]: Remove math.prod dependency ( #2837 )
...
* Remove math.prod dependency
* Fix style
* Fix style
---------
Co-authored-by: Jiatong Han <jiatong.han@u.nus.edu>
2 years ago
HELSON
552183bb74
[polish] polish ColoTensor and its submodules ( #2537 )
2 years ago
HELSON
707b11d4a0
[gemini] update ddp strict mode ( #2518 )
...
* [zero] add strict ddp mode for chunk init
* [gemini] update gpt example
2 years ago
Jiarui Fang
1aaeb596c6
[example] gpt, shard init on all processes ( #2366 )
2 years ago
xcnick
85178a397a
[hotfix] fix error for torch 2.0 ( #2243 )
2 years ago
HELSON
2458659919
[zero] fix error for BEiT models ( #2169 )
...
* [zero] fix error for BEiT models
* [ColoParameter] add unpack operation for tuple arguments
* fix bugs
* fix chunkv2 unit testing
* add assertion for gradient state
2 years ago
Jiarui Fang
2827f41898
[Gemini] GeminiDPP convert to PyTorch Module. ( #2151 )
2 years ago
YuliangLiu0306
49216d7ab1
[autoparallel] fix bugs caused by negative dim key ( #1808 )
...
* [autoparallel] fix bugs caused by negative dim key
* fix import error
* fix matmul test issue
* fix unit test issue
2 years ago
HELSON
c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 ( #1786 )
...
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12
* [zero] add cpu shard init
* [zero] add tiny example test
* [colo_tensor] fix bugs for torch-1.11
2 years ago
Jiarui Fang
a1476ea882
[NFC] polish doc style for ColoTensor ( #1457 )
2 years ago
HELSON
7a8702c06d
[colotensor] add Tensor.view op and its unit test ( #1343 )
...
[colotensor] add megatron initialization for gpt2
2 years ago
HELSON
f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test ( #1339 )
2 years ago
HELSON
1b41686461
[hotfix] fix unit test test_module_spec ( #1321 )
2 years ago
HELSON
260a55804a
[hotfix] fix shape error in backward when using ColoTensor ( #1298 )
2 years ago
ver217
7aadcbd070
hotfix colotensor _scan_for_pg_from_args ( #1276 )
2 years ago
Jiarui Fang
c92f84fcdb
[tensor] distributed checkpointing for parameters ( #1240 )
2 years ago
Jiarui Fang
1aad903c15
[tensor] redistribute among different process groups ( #1247 )
...
* make it faster
* [tensor] rename convert_to_dist -> redistribute
* [tensor] ShardSpec and ReplicaSpec
* [tensor] redistribute among diff pgs
* polish code
2 years ago
Jiarui Fang
9bcd2fd4af
[tensor] a shorter shard and replicate spec ( #1245 )
2 years ago
Jiarui Fang
2699dfbbfd
[rename] convert_to_dist -> redistribute ( #1243 )
2 years ago
HELSON
f6add9b720
[tensor] redirect .data.__get__ to a tensor instance ( #1239 )
2 years ago
Jiarui Fang
4a76084dc9
[tensor] add zero_like colo op, important for Optimizer ( #1236 )
2 years ago
Jiarui Fang
3b500984b1
[tensor] fix some unittests ( #1234 )
2 years ago
HELSON
f071b500b6
[polish] polish __repr__ for ColoTensor, DistSpec, ProcessGroup ( #1235 )
2 years ago
Yi Zhao
04537bf83e
[checkpoint]support generalized scheduler ( #1222 )
2 years ago
Jiarui Fang
a98319f023
[tensor] torch function return colotensor ( #1229 )
2 years ago
Jiarui Fang
ae7d3f4927
[refactor] move process group from _DistSpec to ColoTensor. ( #1203 )
2 years ago
Jiarui Fang
060b917daf
[refactor] remove gpc dependency in colotensor's _ops ( #1189 )
2 years ago
Jiarui Fang
c463f8adf9
[tensor] remove gpc in tensor tests ( #1186 )
2 years ago
Jiarui Fang
1b657f9ce1
[tensor] revert local view back ( #1178 )
2 years ago
Jiarui Fang
aa7bef73d4
[Tensor] distributed view supports inter-process hybrid parallel ( #1169 )
2 years ago
Jiarui Fang
4b9bba8116
[ColoTensor] rename APIs and add output_replicate to ComputeSpec ( #1168 )
2 years ago
Jiarui Fang
f4ef224358
[Tensor] remove ParallelAction, use ComputeSpec instread ( #1166 )
2 years ago
Jiarui Fang
177c374401
remove gather out in parallel action ( #1163 )
2 years ago
Jiarui Fang
8cdce0399c
[ColoTensor] improves init functions. ( #1150 )
2 years ago
ver217
a3b66f6def
[tensor] refactor parallel action ( #1007 )
...
* refactor parallel action
* polish unit tests
3 years ago
ver217
ad536e308e
[tensor] refactor colo-tensor ( #992 )
...
* refactor colo-tensor and update linear op
* polish code
* polish code
* update ops and unit tests
* update unit tests
* polish code
* rename dist_spec module
* polish code
* polish code
* remove unneeded import
* fix pipelinable
3 years ago
Jiarui Fang
802ac297cc
[Tensor] remove useless import in tensor dir ( #997 )
3 years ago
ver217
67c33f57eb
[tensor] design DistSpec and DistSpecManager for ColoTensor ( #934 )
...
* add dist spec
* update linear op
* polish code
* polish code
* update embedding op
* polish unit tests
* polish unit tests
* polish comments
* polish code
* add test_dist_spec_mgr
* polish code
* refactor folder structure
* polish unit tests
* add get_process_group() for TensorSpec
* polish code
3 years ago
ver217
4ca732349e
[tensor] colo tensor overrides mul ( #927 )
...
* colo tensor overrides mul
* polish code
3 years ago
ver217
45b9124df4
[tensor] hijack addmm for colo tensor ( #923 )
...
* hijack addmm for colo tensor
* fix bugs
* polish unit test
* polish comments
3 years ago
Ziyue Jiang
c195d2814c
[Tensor] add from_pretrained support and bert pretrained test ( #921 )
...
* add from_pretrained support and test
* polish
* polish
* polish
* polish
3 years ago
Jiarui Fang
845856ea29
[Graph] building computing graph with ColoTensor, Linear only ( #917 )
3 years ago
Jiarui Fang
ab95ec9aea
[Tensor] init ColoParameter ( #914 )
3 years ago
Ziyue Jiang
f593a5637e
[Tensor] add embedding tp1d row ( #904 )
3 years ago
Ziyue Jiang
2c0d19d755
[Tensor] add ColoTensor TP1Dcol Embedding ( #899 )
3 years ago
Jiarui Fang
676f191532
[Tensor] activation is an attr of ColoTensor ( #897 )
3 years ago