oahzxl
25952b67d7
[feat] add flash attention ( #1762 )
2 years ago
Boyuan Yao
47fd8e4a02
[utils] Add use_reetrant=False in utils.activation_checkpoint ( #1460 )
...
* [utils] Add use_reetrant=False into colossalai checkpoint
* [utils] add some annotation in utils.activaion_checkpoint
* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py
* [test] modify test_activation_checkpoint.py
* [test] modify test for reentrant=False
2 years ago
Jiarui Fang
36824a304c
[Doc] add more doc for ColoTensor. ( #1458 )
2 years ago
ver217
821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer ( #1442 )
2 years ago
HELSON
527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py ( #1387 )
2 years ago
HELSON
7a8702c06d
[colotensor] add Tensor.view op and its unit test ( #1343 )
...
[colotensor] add megatron initialization for gpt2
2 years ago
HELSON
f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test ( #1339 )
2 years ago
Frank Lee
169954f87e
[test] removed outdated unit test for meta context ( #1329 )
2 years ago
Frank Lee
250be4d31e
[utils] integrated colotensor with lazy init context ( #1324 )
...
* [utils] integrated colotensor with lazy init context
* polish code
* polish code
* polish code
2 years ago
Jiarui Fang
9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing ( #1316 )
2 years ago
Jiarui Fang
85f933b58b
[Optimizer] Remove useless ColoOptimizer ( #1312 )
2 years ago
Jiarui Fang
9f10524313
[Optimizer] polish the init method of ColoOptimizer ( #1310 )
2 years ago
Jiarui Fang
3ef3791a3b
[checkpoint] add test for bert and hotfix save bugs ( #1297 )
2 years ago
Frank Lee
7e8114a8dd
[hotfix] skipped unsafe test cases ( #1282 )
2 years ago
Jiarui Fang
c92f84fcdb
[tensor] distributed checkpointing for parameters ( #1240 )
2 years ago
Jiarui Fang
9bcd2fd4af
[tensor] a shorter shard and replicate spec ( #1245 )
2 years ago
Jiarui Fang
20da6e48c8
[checkpoint] save sharded optimizer states ( #1237 )
2 years ago
Jiarui Fang
3b500984b1
[tensor] fix some unittests ( #1234 )
2 years ago
Yi Zhao
04537bf83e
[checkpoint]support generalized scheduler ( #1222 )
2 years ago
Jiarui Fang
52736205d9
[checkpoint] make unitest faster ( #1217 )
2 years ago
Jiarui Fang
f38006ea83
[checkpoint] checkpoint for ColoTensor Model ( #1196 )
2 years ago
YuliangLiu0306
63d2a93878
[context]support arbitary module materialization. ( #1193 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [context]support arbitary module materialization.
* [test]add numerical check for lazy init context.
2 years ago
YuliangLiu0306
2053e138a2
[context]use meta tensor to init model lazily. ( #1187 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [context]use meta tensor to init model lazily.
* polish
* make module with device kwargs bypass the normal init.
* change unit test to adapt updated context.
2 years ago
ver217
d26902645e
[ddp] add save/load state dict for ColoDDP ( #1127 )
...
* add save/load state dict for ColoDDP
* add unit test
* refactor unit test folder
* polish unit test
* rename unit test
2 years ago
ver217
f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP ( #1122 )
...
* add set_params_to_ignore for ColoDDP
* polish code
* fix zero hook v2
* add unit test
* polish docstr
2 years ago
Frank Lee
2b2dc1c86b
[pipeline] refactor the pipeline module ( #1087 )
...
* [pipeline] refactor the pipeline module
* polish code
3 years ago
Frank Lee
bad5d4c0a1
[context] support lazy init of module ( #1088 )
...
* [context] support lazy init of module
* polish code
3 years ago
Frank Lee
50ec3a7e06
[test] skip tests when not enough GPUs are detected ( #1090 )
...
* [test] skip tests when not enough GPUs are detected
* polish code
* polish code
3 years ago
Frank Lee
65ee6dcc20
[test] ignore 8 gpu test ( #1080 )
...
* [test] ignore 8 gpu test
* polish code
* polish workflow
* polish workflow
3 years ago
Jiarui Fang
49832b2344
[refactory] add nn.parallel module ( #1068 )
3 years ago
HELSON
e5ea3fdeef
[gemini] add GeminiMemoryManger ( #832 )
...
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
3 years ago
YuliangLiu0306
35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model ( #816 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]add module lazy init feature to support large model initization.
* [pipeline]add to_layer_list and partition method to support arbitrary non-pp model
* refactor the module structure
* polish
* [pipelinable]add unit test for pipelinable
* polish
* polish
* Fix CodeFactor issues.
3 years ago
Jiarui Fang
681addb512
[refactor] moving grad acc logic to engine ( #804 )
3 years ago
Frank Lee
5a1a095b92
[test] refactored with the new rerun decorator ( #763 )
...
* [test] refactored with the new rerun decorator
* polish test case
3 years ago
Jiarui Fang
53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process ( #726 )
3 years ago
FrankLeeeee
62b4ce7326
[test] added missing decorators to model checkpointing tests
3 years ago
Jiarui Fang
4d90a7b513
[refactor] zero directory ( #724 )
3 years ago
Jiarui Fang
193dc8dacb
[refactor] refactor the memory utils ( #715 )
3 years ago
HELSON
e5d615aeee
[hotfix] fix bugs in testing ( #659 )
...
* remove hybrid adam in test_moe_zero_optim
* fix activation checkpointing and its unitest
3 years ago
アマデウス
354b7954d1
[model checkpoint] added unit tests for checkpoint save/load ( #599 )
3 years ago
FredHuang99
93f14d2a33
[zero] test zero tensor utils ( #609 )
3 years ago
Jiarui Fang
e956d93ac2
[refactor] memory utils ( #577 )
3 years ago
Jiarui Fang
705f56107c
[zero] refactor model data tracing ( #537 )
3 years ago
Jiarui Fang
8d8c5407c0
[zero] refactor model data tracing ( #522 )
3 years ago
Frank Lee
3601b2bad0
[test] fixed rerun_on_exception and adapted test cases ( #487 )
3 years ago
Jiarui Fang
4d322b79da
[refactor] remove old zero code ( #517 )
3 years ago
Jiarui Fang
920c5889a7
[zero] add colo move inline ( #521 )
3 years ago
Jiarui Fang
7ef3507ace
[zero] show model data cuda memory usage after zero context init. ( #515 )
3 years ago
Jiarui Fang
9330be0f3c
[memory] set cuda mem frac ( #506 )
3 years ago
Jiarui Fang
0035b7be07
[memory] add model data tensor moving api ( #503 )
3 years ago