Nikita Shulga
01066152f1
Don't use `torch._six` ( #2775 )
...
* Don't use `torch._six`
This is a private API which is gone after https://github.com/pytorch/pytorch/pull/94709
* Update common.py
2023-02-17 09:22:45 +08:00
ver217
f0aa191f51
[gemini] fix colo_init_context ( #2683 )
2023-02-13 17:53:15 +08:00
HELSON
552183bb74
[polish] polish ColoTensor and its submodules ( #2537 )
2023-02-03 11:44:10 +08:00
Super Daniel
35c0c0006e
[utils] lazy init. ( #2148 )
...
* [utils] lazy init.
* [utils] remove description.
* [utils] complete.
* [utils] finalize.
* [utils] fix names.
2023-01-20 10:49:00 +08:00
HELSON
7829aa094e
[ddp] add is_ddp_ignored ( #2434 )
...
[ddp] rename to is_ddp_ignored
2023-01-11 12:22:45 +08:00
Frank Lee
40d376c566
[setup] support pre-build and jit-build of cuda kernels ( #2374 )
...
* [setup] support pre-build and jit-build of cuda kernels
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2023-01-06 20:50:26 +08:00
Jiarui Fang
355ffb386e
[builder] unified cpu_optim fused_optim inferface ( #2190 )
2022-12-23 20:57:41 +08:00
Jiarui Fang
9587b080ba
[builder] use runtime builder for fused_optim ( #2189 )
2022-12-23 17:07:03 +08:00
BlueRum
b3f73ce1c8
[Gemini] Update coloinit_ctx to support meta_tensor ( #2147 )
2022-12-19 22:37:07 +08:00
Jiarui Fang
8e14344ec9
[hotfix] fix a type in ColoInitContext ( #2106 )
2022-12-09 11:44:39 +08:00
Jiarui Fang
05545bfee9
[ColoTensor] throw error when ColoInitContext meets meta parameter. ( #2105 )
2022-12-09 11:39:46 +08:00
HELSON
f6178728a0
[gemini] fix init bugs for modules ( #2047 )
...
* [gemini] fix init bugs for modules
* fix bugs
2022-11-30 17:06:10 +08:00
Jiarui Fang
31c644027b
[hotfix] hotfix Gemini for no leaf modules bug ( #2043 )
2022-11-30 14:53:41 +08:00
ver217
f8a7148dec
[kernel] move all symlinks of kernel to `colossalai._C` ( #1971 )
2022-11-17 13:42:33 +08:00
Jiarui Fang
7e24b9b9ee
[Gemini] clean no used MemTraceOp ( #1970 )
2022-11-17 13:41:54 +08:00
Jiarui Fang
52c6ad26e0
[ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. ( #1953 )
2022-11-15 16:24:16 +08:00
Jiarui Fang
9f4fb3f28a
[ColoTensor] ColoInitContext initialize parameters in shard mode. ( #1937 )
2022-11-14 16:05:09 +08:00
Frank Lee
e6ec99d389
[utils] fixed lazy init context ( #1867 )
2022-11-10 15:17:20 +08:00
Jiarui Fang
3ce4463fe6
[utils] remove lazy_memory_allocate from ColoInitContext ( #1844 )
2022-11-09 11:50:33 +08:00
ver217
99870726b1
[CheckpointIO] a uniform checkpoint I/O module ( #1689 )
2022-11-08 15:15:13 +08:00
HELSON
1468e4bcfc
[zero] add constant placement policy ( #1705 )
...
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
Kirigaya Kazuto
3b2a59b0ba
[pipeline/rank_recorder] fix bug when process data before backward | add a tool for multiple ranks debug ( #1681 )
...
* [pipeline/tuning] improve dispatch performance both time and space cost
* [pipeline/converge] add interface for testing convergence
* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style
* Update PipelineBase.py
* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera
* [pipeline/chimera] test chimera | fix bug of initializing
* [pipeline/pytree] add pytree to process args and kwargs | provide to process args and kwargs after forward
2022-10-09 17:32:57 +08:00
CsRic
2ac46f7be4
[NFC] polish utils/tensor_detector/__init__.py code style ( #1573 )
...
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2022-09-08 22:11:04 +08:00
LuGY
c7d4932956
[NFC] polish colossalai/utils/tensor_detector/tensor_detector.py code style ( #1566 )
2022-09-08 22:11:04 +08:00
Kirigaya Kazuto
318fbf1145
[NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style ( #1559 )
2022-09-08 22:04:34 +08:00
ver217
ae71036cd2
[utils] refactor parallel layers checkpoint and bcast model on loading checkpoint ( #1548 )
...
* refactor parallel layer
* broadcast rank0 model after load ckpt
2022-09-06 20:18:35 +08:00
ver217
2bed096848
[utils] optimize partition_tensor_parallel_state_dict ( #1546 )
2022-09-06 17:45:31 +08:00
ver217
a203b709d5
[hotfix] fix init context ( #1543 )
...
* fix init context
* fix lazy init ctx
2022-09-06 11:45:08 +08:00
Boyuan Yao
47fd8e4a02
[utils] Add use_reetrant=False in utils.activation_checkpoint ( #1460 )
...
* [utils] Add use_reetrant=False into colossalai checkpoint
* [utils] add some annotation in utils.activaion_checkpoint
* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py
* [test] modify test_activation_checkpoint.py
* [test] modify test for reentrant=False
2022-08-16 15:39:20 +08:00
Frank Lee
5a52e21fe3
[test] fixed the activation codegen test ( #1447 )
...
* [test] fixed the activation codegen test
* polish code
2022-08-12 14:52:31 +08:00
ver217
821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer ( #1442 )
2022-08-11 22:58:58 +08:00
HELSON
527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py ( #1387 )
2022-07-29 15:58:06 +08:00
HELSON
b6fd165f66
[checkpoint] add kwargs for load_state_dict ( #1374 )
2022-07-28 15:56:52 +08:00
Frank Lee
0c1a16ea5b
[util] standard checkpoint function naming ( #1377 )
2022-07-28 09:29:30 +08:00
Super Daniel
be229217ce
[fx] add torchaudio test ( #1369 )
...
* [fx]add torchaudio test
* [fx]add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test and test patches
* Delete ~
* [fx] add patches and patches test
* [fx] add patches and patches test
* [fx] fix patches
* [fx] fix rnn patches
* [fx] fix rnn patches
* [fx] fix rnn patches
* [fx] fix rnn patches
* [fx] merge upstream
* [fx] fix import errors
2022-07-27 11:03:14 +08:00
HELSON
8463290642
[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint ( #1368 )
2022-07-26 14:41:53 +08:00
HELSON
87775a0682
[colotensor] use cpu memory to store state_dict ( #1367 )
2022-07-26 14:13:38 +08:00
HELSON
943a96323e
[hotfix] fix no optimizer in save/load ( #1363 )
2022-07-26 10:53:53 +08:00
HELSON
7a8702c06d
[colotensor] add Tensor.view op and its unit test ( #1343 )
...
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
Frank Lee
2cc1175c76
[fx] tested the complete workflow for auto-parallel ( #1336 )
...
* [fx] tested the complete workflow for auto-parallel
* polish code
* polish code
* polish code
2022-07-20 10:45:17 +08:00
HELSON
f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test ( #1339 )
2022-07-19 14:15:28 +08:00
Frank Lee
250be4d31e
[utils] integrated colotensor with lazy init context ( #1324 )
...
* [utils] integrated colotensor with lazy init context
* polish code
* polish code
* polish code
2022-07-15 17:47:12 +08:00
Jiarui Fang
9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing ( #1316 )
2022-07-15 09:52:55 +08:00
Jiarui Fang
3ef3791a3b
[checkpoint] add test for bert and hotfix save bugs ( #1297 )
2022-07-14 15:38:18 +08:00
Jiarui Fang
4165eabb1e
[hotfix] remove potiential circle import ( #1307 )
...
* make it faster
* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
Jiarui Fang
c92f84fcdb
[tensor] distributed checkpointing for parameters ( #1240 )
2022-07-12 15:51:06 +08:00
Jiarui Fang
9bcd2fd4af
[tensor] a shorter shard and replicate spec ( #1245 )
2022-07-11 15:51:48 +08:00
Jiarui Fang
20da6e48c8
[checkpoint] save sharded optimizer states ( #1237 )
2022-07-08 16:33:13 +08:00
Jiarui Fang
3b500984b1
[tensor] fix some unittests ( #1234 )
2022-07-08 14:18:30 +08:00
ver217
a45ddf2d5f
[hotfix] fix sharded optim step and clip_grad_norm ( #1226 )
2022-07-08 13:34:48 +08:00