Commit Graph

66 Commits (cb5a587e9aa545a41980ee68e88bf5edf59c44cb)

Author SHA1 Message Date
oahzxl 25952b67d7
[feat] add flash attention (#1762)
2 years ago
Boyuan Yao 47fd8e4a02
[utils] Add use_reetrant=False in utils.activation_checkpoint (#1460)
2 years ago
Jiarui Fang 36824a304c
[Doc] add more doc for ColoTensor. (#1458)
2 years ago
ver217 821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442)
2 years ago
HELSON 527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py (#1387)
2 years ago
HELSON 7a8702c06d
[colotensor] add Tensor.view op and its unit test (#1343)
2 years ago
HELSON f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339)
2 years ago
Frank Lee 169954f87e
[test] removed outdated unit test for meta context (#1329)
2 years ago
Frank Lee 250be4d31e
[utils] integrated colotensor with lazy init context (#1324)
2 years ago
Jiarui Fang 9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing (#1316)
2 years ago
Jiarui Fang 85f933b58b
[Optimizer] Remove useless ColoOptimizer (#1312)
2 years ago
Jiarui Fang 9f10524313
[Optimizer] polish the init method of ColoOptimizer (#1310)
2 years ago
Jiarui Fang 3ef3791a3b
[checkpoint] add test for bert and hotfix save bugs (#1297)
2 years ago
Frank Lee 7e8114a8dd
[hotfix] skipped unsafe test cases (#1282)
2 years ago
Jiarui Fang c92f84fcdb
[tensor] distributed checkpointing for parameters (#1240)
2 years ago
Jiarui Fang 9bcd2fd4af
[tensor] a shorter shard and replicate spec (#1245)
2 years ago
Jiarui Fang 20da6e48c8
[checkpoint] save sharded optimizer states (#1237)
2 years ago
Jiarui Fang 3b500984b1
[tensor] fix some unittests (#1234)
2 years ago
Yi Zhao 04537bf83e
[checkpoint]support generalized scheduler (#1222)
2 years ago
Jiarui Fang 52736205d9
[checkpoint] make unitest faster (#1217)
2 years ago
Jiarui Fang f38006ea83
[checkpoint] checkpoint for ColoTensor Model (#1196)
2 years ago
YuliangLiu0306 63d2a93878
[context]support arbitary module materialization. (#1193)
2 years ago
YuliangLiu0306 2053e138a2
[context]use meta tensor to init model lazily. (#1187)
2 years ago
ver217 d26902645e
[ddp] add save/load state dict for ColoDDP (#1127)
2 years ago
ver217 f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP (#1122)
2 years ago
Frank Lee 2b2dc1c86b
[pipeline] refactor the pipeline module (#1087)
2 years ago
Frank Lee bad5d4c0a1
[context] support lazy init of module (#1088)
2 years ago
Frank Lee 50ec3a7e06
[test] skip tests when not enough GPUs are detected (#1090)
3 years ago
Frank Lee 65ee6dcc20
[test] ignore 8 gpu test (#1080)
3 years ago
Jiarui Fang 49832b2344
[refactory] add nn.parallel module (#1068)
3 years ago
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
3 years ago
YuliangLiu0306 35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model (#816)
3 years ago
Jiarui Fang 681addb512
[refactor] moving grad acc logic to engine (#804)
3 years ago
Frank Lee 5a1a095b92
[test] refactored with the new rerun decorator (#763)
3 years ago
Jiarui Fang 53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process (#726)
3 years ago
FrankLeeeee 62b4ce7326 [test] added missing decorators to model checkpointing tests
3 years ago
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724)
3 years ago
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715)
3 years ago
HELSON e5d615aeee
[hotfix] fix bugs in testing (#659)
3 years ago
アマデウス 354b7954d1
[model checkpoint] added unit tests for checkpoint save/load (#599)
3 years ago
FredHuang99 93f14d2a33
[zero] test zero tensor utils (#609)
3 years ago
Jiarui Fang e956d93ac2
[refactor] memory utils (#577)
3 years ago
Jiarui Fang 705f56107c
[zero] refactor model data tracing (#537)
3 years ago
Jiarui Fang 8d8c5407c0
[zero] refactor model data tracing (#522)
3 years ago
Frank Lee 3601b2bad0
[test] fixed rerun_on_exception and adapted test cases (#487)
3 years ago
Jiarui Fang 4d322b79da
[refactor] remove old zero code (#517)
3 years ago
Jiarui Fang 920c5889a7
[zero] add colo move inline (#521)
3 years ago
Jiarui Fang 7ef3507ace
[zero] show model data cuda memory usage after zero context init. (#515)
3 years ago
Jiarui Fang 9330be0f3c
[memory] set cuda mem frac (#506)
3 years ago
Jiarui Fang 0035b7be07
[memory] add model data tensor moving api (#503)
3 years ago