Commit Graph

65 Commits (4b03c25f85cd2dbe91a07c8febff3b2452ea39c2)

Author SHA1 Message Date
Boyuan Yao 47fd8e4a02
[utils] Add use_reetrant=False in utils.activation_checkpoint (#1460)
* [utils] Add use_reetrant=False into colossalai checkpoint

* [utils] add some annotation in utils.activaion_checkpoint

* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py

* [test] modify test_activation_checkpoint.py

* [test] modify test for reentrant=False
2022-08-16 15:39:20 +08:00
Jiarui Fang 36824a304c
[Doc] add more doc for ColoTensor. (#1458) 2022-08-16 10:38:41 +08:00
ver217 821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) 2022-08-11 22:58:58 +08:00
HELSON 527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py (#1387) 2022-07-29 15:58:06 +08:00
HELSON 7a8702c06d
[colotensor] add Tensor.view op and its unit test (#1343)
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
HELSON f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) 2022-07-19 14:15:28 +08:00
Frank Lee 169954f87e
[test] removed outdated unit test for meta context (#1329) 2022-07-15 23:16:23 +08:00
Frank Lee 250be4d31e
[utils] integrated colotensor with lazy init context (#1324)
* [utils] integrated colotensor with lazy init context

* polish code

* polish code

* polish code
2022-07-15 17:47:12 +08:00
Jiarui Fang 9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing (#1316) 2022-07-15 09:52:55 +08:00
Jiarui Fang 85f933b58b
[Optimizer] Remove useless ColoOptimizer (#1312) 2022-07-14 16:57:48 +08:00
Jiarui Fang 9f10524313
[Optimizer] polish the init method of ColoOptimizer (#1310) 2022-07-14 16:37:33 +08:00
Jiarui Fang 3ef3791a3b
[checkpoint] add test for bert and hotfix save bugs (#1297) 2022-07-14 15:38:18 +08:00
Frank Lee 7e8114a8dd
[hotfix] skipped unsafe test cases (#1282) 2022-07-13 00:08:59 +08:00
Jiarui Fang c92f84fcdb
[tensor] distributed checkpointing for parameters (#1240) 2022-07-12 15:51:06 +08:00
Jiarui Fang 9bcd2fd4af
[tensor] a shorter shard and replicate spec (#1245) 2022-07-11 15:51:48 +08:00
Jiarui Fang 20da6e48c8
[checkpoint] save sharded optimizer states (#1237) 2022-07-08 16:33:13 +08:00
Jiarui Fang 3b500984b1
[tensor] fix some unittests (#1234) 2022-07-08 14:18:30 +08:00
Yi Zhao 04537bf83e
[checkpoint]support generalized scheduler (#1222) 2022-07-07 18:16:38 +08:00
Jiarui Fang 52736205d9
[checkpoint] make unitest faster (#1217) 2022-07-06 17:39:46 +08:00
Jiarui Fang f38006ea83
[checkpoint] checkpoint for ColoTensor Model (#1196) 2022-07-06 17:22:03 +08:00
YuliangLiu0306 63d2a93878
[context]support arbitary module materialization. (#1193)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [context]support arbitary module materialization.

* [test]add numerical check for lazy init context.
2022-07-04 10:12:02 +08:00
YuliangLiu0306 2053e138a2
[context]use meta tensor to init model lazily. (#1187)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [context]use meta tensor to init model lazily.

* polish

* make module with device kwargs bypass the normal init.

* change unit test to adapt updated context.
2022-06-29 21:02:30 +08:00
ver217 d26902645e
[ddp] add save/load state dict for ColoDDP (#1127)
* add save/load state dict for ColoDDP

* add unit test

* refactor unit test folder

* polish unit test

* rename unit test
2022-06-20 10:51:47 +08:00
ver217 f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP (#1122)
* add set_params_to_ignore for ColoDDP

* polish code

* fix zero hook v2

* add unit test

* polish docstr
2022-06-16 12:54:46 +08:00
Frank Lee 2b2dc1c86b
[pipeline] refactor the pipeline module (#1087)
* [pipeline] refactor the pipeline module

* polish code
2022-06-10 11:27:38 +08:00
Frank Lee bad5d4c0a1
[context] support lazy init of module (#1088)
* [context] support lazy init of module

* polish code
2022-06-10 10:09:48 +08:00
Frank Lee 50ec3a7e06
[test] skip tests when not enough GPUs are detected (#1090)
* [test] skip tests when not enough GPUs are detected

* polish code

* polish code
2022-06-09 17:19:13 +08:00
Frank Lee 65ee6dcc20
[test] ignore 8 gpu test (#1080)
* [test] ignore 8 gpu test

* polish code

* polish workflow

* polish workflow
2022-06-08 23:14:18 +08:00
Jiarui Fang 49832b2344
[refactory] add nn.parallel module (#1068) 2022-06-06 15:34:41 +08:00
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities

* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
YuliangLiu0306 35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model (#816)
* [CLI] add CLI launcher

* Revert "[CLI] add CLI launcher"

This reverts commit df7e6506d4.

* [pipeline]add module lazy init feature to support large model initization.

* [pipeline]add to_layer_list and partition method to support arbitrary non-pp model

* refactor the module structure

* polish

* [pipelinable]add unit test for pipelinable

* polish

* polish

* Fix CodeFactor issues.
2022-04-24 13:03:12 +08:00
Jiarui Fang 681addb512
[refactor] moving grad acc logic to engine (#804) 2022-04-19 14:03:21 +08:00
Frank Lee 5a1a095b92
[test] refactored with the new rerun decorator (#763)
* [test] refactored with the new rerun decorator

* polish test case
2022-04-15 00:33:04 +08:00
Jiarui Fang 53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process (#726) 2022-04-12 14:57:54 +08:00
FrankLeeeee 62b4ce7326 [test] added missing decorators to model checkpointing tests 2022-04-12 11:08:15 +08:00
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724) 2022-04-11 23:13:02 +08:00
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715) 2022-04-11 16:47:57 +08:00
HELSON e5d615aeee
[hotfix] fix bugs in testing (#659)
* remove hybrid adam in test_moe_zero_optim

* fix activation checkpointing and its unitest
2022-04-02 21:58:47 +08:00
アマデウス 354b7954d1
[model checkpoint] added unit tests for checkpoint save/load (#599) 2022-04-01 16:53:32 +08:00
FredHuang99 93f14d2a33
[zero] test zero tensor utils (#609) 2022-04-01 15:16:59 +08:00
Jiarui Fang e956d93ac2
[refactor] memory utils (#577) 2022-04-01 09:22:33 +08:00
Jiarui Fang 705f56107c
[zero] refactor model data tracing (#537) 2022-03-28 16:38:18 +08:00
Jiarui Fang 8d8c5407c0
[zero] refactor model data tracing (#522) 2022-03-25 18:03:32 +08:00
Frank Lee 3601b2bad0
[test] fixed rerun_on_exception and adapted test cases (#487) 2022-03-25 17:25:12 +08:00
Jiarui Fang 4d322b79da
[refactor] remove old zero code (#517) 2022-03-25 14:54:39 +08:00
Jiarui Fang 920c5889a7
[zero] add colo move inline (#521) 2022-03-25 14:02:55 +08:00
Jiarui Fang 7ef3507ace
[zero] show model data cuda memory usage after zero context init. (#515) 2022-03-25 11:23:35 +08:00
Jiarui Fang 9330be0f3c
[memory] set cuda mem frac (#506) 2022-03-24 16:57:13 +08:00
Jiarui Fang 0035b7be07
[memory] add model data tensor moving api (#503) 2022-03-24 14:29:41 +08:00
Jiarui Fang b334822163
[zero] polish sharded param name (#484)
* [zero] polish sharded param name

* polish code

* polish

* polish code

* polish

* polsih

* polish
2022-03-22 14:36:16 +08:00