Hongxin Liu
dbb32692d2
[lazy] refactor lazy init ( #3891 )
...
* [lazy] remove old lazy init
* [lazy] refactor lazy init folder structure
* [lazy] fix lazy tensor deepcopy
* [test] update lazy init test
1 year ago
digger yu
9265f2d4d7
[NFC]fix typo colossalai/auto_parallel nn utils etc. ( #3779 )
...
* fix typo colossalai/autochunk auto_parallel amp
* fix typo colossalai/auto_parallel nn utils etc.
2 years ago
digger-yu
b9a8dff7e5
[doc] Fix typo under colossalai and doc( #3618 )
...
* Fixed several spelling errors under colossalai
* Fix the spelling error in colossalai and docs directory
* Cautious Changed the spelling error under the example folder
* Update runtime_preparation_pass.py
revert autograft to autograd
* Update search_chunk.py
utile to until
* Update check_installation.py
change misteach to mismatch in line 91
* Update 1D_tensor_parallel.md
revert to perceptron
* Update 2D_tensor_parallel.md
revert to perceptron in line 73
* Update 2p5D_tensor_parallel.md
revert to perceptron in line 71
* Update 3D_tensor_parallel.md
revert to perceptron in line 80
* Update README.md
revert to resnet in line 42
* Update reorder_graph.py
revert to indice in line 7
* Update p2p.py
revert to megatron in line 94
* Update initialize.py
revert to torchrun in line 198
* Update routers.py
change to detailed in line 63
* Update routers.py
change to detailed in line 146
* Update README.md
revert random number in line 402
2 years ago
Hongxin Liu
4341f5e8e6
[lazyinit] fix clone and deepcopy ( #3553 )
2 years ago
Hongxin Liu
152239bbfa
[gemini] gemini supports lazy init ( #3379 )
...
* [gemini] fix nvme optimizer init
* [gemini] gemini supports lazy init
* [gemini] add init example
* [gemini] add fool model
* [zero] update gemini ddp
* [zero] update init example
* add chunk method
* add chunk method
* [lazyinit] fix lazy tensor tolist
* [gemini] fix buffer materialization
* [misc] remove useless file
* [booster] update gemini plugin
* [test] update gemini plugin test
* [test] fix gemini plugin test
* [gemini] fix import
* [gemini] fix import
* [lazyinit] use new metatensor
* [lazyinit] use new metatensor
* [lazyinit] fix __set__ method
2 years ago
Frank Lee
80eba05b0a
[test] refactor tests with spawn ( #3452 )
...
* [test] added spawn decorator
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2 years ago
ver217
26b7aac0be
[zero] reorganize zero/gemini folder structure ( #3424 )
...
* [zero] refactor low-level zero folder structure
* [zero] fix legacy zero import path
* [zero] fix legacy zero import path
* [zero] remove useless import
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] fix test import path
* [zero] fix test
* [zero] fix circular import
* [zero] update import
2 years ago
ver217
f8289d4221
[lazyinit] combine lazy tensor with dtensor ( #3204 )
...
* [lazyinit] lazy tensor add distribute
* [lazyinit] refactor distribute
* [lazyinit] add test dist lazy init
* [lazyinit] add verbose info for dist lazy init
* [lazyinit] fix rnn flatten weight op
* [lazyinit] polish test
* [lazyinit] polish test
* [lazyinit] fix lazy tensor data setter
* [lazyinit] polish test
* [lazyinit] fix clean
* [lazyinit] make materialize inplace
* [lazyinit] refactor materialize
* [lazyinit] refactor test distribute
* [lazyinit] fix requires_grad
* [lazyinit] fix tolist after materialization
* [lazyinit] refactor distribute module
* [lazyinit] polish docstr
* [lazyinit] polish lazy init context
* [lazyinit] temporarily skip test
* [lazyinit] polish test
* [lazyinit] add docstr
2 years ago
ver217
6ae8ed0407
[lazyinit] add correctness verification ( #3147 )
...
* [lazyinit] fix shared module
* [tests] add lazy init test utils
* [tests] add torchvision for lazy init
* [lazyinit] fix pre op fn
* [lazyinit] handle legacy constructor
* [tests] refactor lazy init test models
* [tests] refactor lazy init test utils
* [lazyinit] fix ops don't support meta
* [tests] lazy init test timm models
* [lazyinit] fix set data
* [lazyinit] handle apex layers
* [tests] lazy init test transformers models
* [tests] lazy init test torchaudio models
* [lazyinit] fix import path
* [tests] lazy init test torchrec models
* [tests] update torch version in CI
* [tests] revert torch version in CI
* [tests] skip lazy init test
2 years ago
ver217
ed8f60b93b
[lazyinit] refactor lazy tensor and lazy init ctx ( #3131 )
...
* [lazyinit] refactor lazy tensor and lazy init ctx
* [lazyinit] polish docstr
* [lazyinit] polish docstr
2 years ago
ver217
823f3b9cf4
[doc] add deepspeed citation and copyright ( #2996 )
...
* [doc] add deepspeed citation and copyright
* [doc] add deepspeed citation and copyright
* [doc] add deepspeed citation and copyright
2 years ago
YH
a848091141
Fix port exception type ( #2925 )
2 years ago
Nikita Shulga
01066152f1
Don't use `torch._six` ( #2775 )
...
* Don't use `torch._six`
This is a private API which is gone after https://github.com/pytorch/pytorch/pull/94709
* Update common.py
2 years ago
ver217
f0aa191f51
[gemini] fix colo_init_context ( #2683 )
2 years ago
HELSON
552183bb74
[polish] polish ColoTensor and its submodules ( #2537 )
2 years ago
Super Daniel
35c0c0006e
[utils] lazy init. ( #2148 )
...
* [utils] lazy init.
* [utils] remove description.
* [utils] complete.
* [utils] finalize.
* [utils] fix names.
2 years ago
HELSON
7829aa094e
[ddp] add is_ddp_ignored ( #2434 )
...
[ddp] rename to is_ddp_ignored
2 years ago
Frank Lee
40d376c566
[setup] support pre-build and jit-build of cuda kernels ( #2374 )
...
* [setup] support pre-build and jit-build of cuda kernels
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2 years ago
Jiarui Fang
355ffb386e
[builder] unified cpu_optim fused_optim inferface ( #2190 )
2 years ago
Jiarui Fang
9587b080ba
[builder] use runtime builder for fused_optim ( #2189 )
2 years ago
BlueRum
b3f73ce1c8
[Gemini] Update coloinit_ctx to support meta_tensor ( #2147 )
2 years ago
Jiarui Fang
8e14344ec9
[hotfix] fix a type in ColoInitContext ( #2106 )
2 years ago
Jiarui Fang
05545bfee9
[ColoTensor] throw error when ColoInitContext meets meta parameter. ( #2105 )
2 years ago
HELSON
f6178728a0
[gemini] fix init bugs for modules ( #2047 )
...
* [gemini] fix init bugs for modules
* fix bugs
2 years ago
Jiarui Fang
31c644027b
[hotfix] hotfix Gemini for no leaf modules bug ( #2043 )
2 years ago
ver217
f8a7148dec
[kernel] move all symlinks of kernel to `colossalai._C` ( #1971 )
2 years ago
Jiarui Fang
7e24b9b9ee
[Gemini] clean no used MemTraceOp ( #1970 )
2 years ago
Jiarui Fang
52c6ad26e0
[ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. ( #1953 )
2 years ago
Jiarui Fang
9f4fb3f28a
[ColoTensor] ColoInitContext initialize parameters in shard mode. ( #1937 )
2 years ago
Frank Lee
e6ec99d389
[utils] fixed lazy init context ( #1867 )
2 years ago
Jiarui Fang
3ce4463fe6
[utils] remove lazy_memory_allocate from ColoInitContext ( #1844 )
2 years ago
ver217
99870726b1
[CheckpointIO] a uniform checkpoint I/O module ( #1689 )
2 years ago
HELSON
1468e4bcfc
[zero] add constant placement policy ( #1705 )
...
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2 years ago
Kirigaya Kazuto
3b2a59b0ba
[pipeline/rank_recorder] fix bug when process data before backward | add a tool for multiple ranks debug ( #1681 )
...
* [pipeline/tuning] improve dispatch performance both time and space cost
* [pipeline/converge] add interface for testing convergence
* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style
* Update PipelineBase.py
* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera
* [pipeline/chimera] test chimera | fix bug of initializing
* [pipeline/pytree] add pytree to process args and kwargs | provide to process args and kwargs after forward
2 years ago
CsRic
2ac46f7be4
[NFC] polish utils/tensor_detector/__init__.py code style ( #1573 )
...
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2 years ago
LuGY
c7d4932956
[NFC] polish colossalai/utils/tensor_detector/tensor_detector.py code style ( #1566 )
2 years ago
Kirigaya Kazuto
318fbf1145
[NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style ( #1559 )
2 years ago
ver217
ae71036cd2
[utils] refactor parallel layers checkpoint and bcast model on loading checkpoint ( #1548 )
...
* refactor parallel layer
* broadcast rank0 model after load ckpt
2 years ago
ver217
2bed096848
[utils] optimize partition_tensor_parallel_state_dict ( #1546 )
2 years ago
ver217
a203b709d5
[hotfix] fix init context ( #1543 )
...
* fix init context
* fix lazy init ctx
2 years ago
Boyuan Yao
47fd8e4a02
[utils] Add use_reetrant=False in utils.activation_checkpoint ( #1460 )
...
* [utils] Add use_reetrant=False into colossalai checkpoint
* [utils] add some annotation in utils.activaion_checkpoint
* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py
* [test] modify test_activation_checkpoint.py
* [test] modify test for reentrant=False
2 years ago
Frank Lee
5a52e21fe3
[test] fixed the activation codegen test ( #1447 )
...
* [test] fixed the activation codegen test
* polish code
2 years ago
ver217
821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer ( #1442 )
2 years ago
HELSON
527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py ( #1387 )
2 years ago
HELSON
b6fd165f66
[checkpoint] add kwargs for load_state_dict ( #1374 )
2 years ago
Frank Lee
0c1a16ea5b
[util] standard checkpoint function naming ( #1377 )
2 years ago
Super Daniel
be229217ce
[fx] add torchaudio test ( #1369 )
...
* [fx]add torchaudio test
* [fx]add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test and test patches
* Delete ~
* [fx] add patches and patches test
* [fx] add patches and patches test
* [fx] fix patches
* [fx] fix rnn patches
* [fx] fix rnn patches
* [fx] fix rnn patches
* [fx] fix rnn patches
* [fx] merge upstream
* [fx] fix import errors
2 years ago
HELSON
8463290642
[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint ( #1368 )
2 years ago
HELSON
87775a0682
[colotensor] use cpu memory to store state_dict ( #1367 )
2 years ago
HELSON
943a96323e
[hotfix] fix no optimizer in save/load ( #1363 )
2 years ago