Frank Lee
615e2e5fc1
[test] fixed lazy init test import error ( #3799 )
2 years ago
Hongxin Liu
afb239bbf8
[devops] update torch version of CI ( #3725 )
...
* [test] fix flop tensor test
* [test] fix autochunk test
* [test] fix lazyinit test
* [devops] update torch version of CI
* [devops] enable testmon
* [devops] fix ci
* [devops] fix ci
* [test] fix checkpoint io test
* [test] fix cluster test
* [test] fix timm test
* [devops] fix ci
* [devops] fix ci
* [devops] fix ci
* [devops] fix ci
* [devops] force sync to test ci
* [test] skip fsdp test
2 years ago
digger-yu
b7141c36dd
[CI] fix some spelling errors ( #3707 )
...
* fix spelling error with examples/comminity/
* fix spelling error with tests/
* fix some spelling error with tests/ colossalai/ etc.
2 years ago
Frank Lee
80eba05b0a
[test] refactor tests with spawn ( #3452 )
...
* [test] added spawn decorator
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2 years ago
ver217
26b7aac0be
[zero] reorganize zero/gemini folder structure ( #3424 )
...
* [zero] refactor low-level zero folder structure
* [zero] fix legacy zero import path
* [zero] fix legacy zero import path
* [zero] remove useless import
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] fix test import path
* [zero] fix test
* [zero] fix circular import
* [zero] update import
2 years ago
YuliangLiu0306
045afa3ea2
[hotfix] skip torchaudio tracing test ( #3211 )
...
* [hotfix] skip torchaudio tracing test
* fix lazy init test issue
2 years ago
ver217
f8289d4221
[lazyinit] combine lazy tensor with dtensor ( #3204 )
...
* [lazyinit] lazy tensor add distribute
* [lazyinit] refactor distribute
* [lazyinit] add test dist lazy init
* [lazyinit] add verbose info for dist lazy init
* [lazyinit] fix rnn flatten weight op
* [lazyinit] polish test
* [lazyinit] polish test
* [lazyinit] fix lazy tensor data setter
* [lazyinit] polish test
* [lazyinit] fix clean
* [lazyinit] make materialize inplace
* [lazyinit] refactor materialize
* [lazyinit] refactor test distribute
* [lazyinit] fix requires_grad
* [lazyinit] fix tolist after materialization
* [lazyinit] refactor distribute module
* [lazyinit] polish docstr
* [lazyinit] polish lazy init context
* [lazyinit] temporarily skip test
* [lazyinit] polish test
* [lazyinit] add docstr
2 years ago
zbian
7bc0afc901
updated flash attention usage
2 years ago
ver217
6ae8ed0407
[lazyinit] add correctness verification ( #3147 )
...
* [lazyinit] fix shared module
* [tests] add lazy init test utils
* [tests] add torchvision for lazy init
* [lazyinit] fix pre op fn
* [lazyinit] handle legacy constructor
* [tests] refactor lazy init test models
* [tests] refactor lazy init test utils
* [lazyinit] fix ops don't support meta
* [tests] lazy init test timm models
* [lazyinit] fix set data
* [lazyinit] handle apex layers
* [tests] lazy init test transformers models
* [tests] lazy init test torchaudio models
* [lazyinit] fix import path
* [tests] lazy init test torchrec models
* [tests] update torch version in CI
* [tests] revert torch version in CI
* [tests] skip lazy init test
2 years ago
アマデウス
077a66dd81
updated attention kernel ( #2133 )
2 years ago
zbian
6877121377
updated flash attention api
2 years ago
ver217
99870726b1
[CheckpointIO] a uniform checkpoint I/O module ( #1689 )
2 years ago
oahzxl
9639ea88fc
[kernel] more flexible flashatt interface ( #1804 )
2 years ago
oahzxl
501a9e9cd2
[hotfix] polish flash attention ( #1802 )
2 years ago
Jiarui Fang
c248800359
[kernel] skip tests of flash_attn and triton when they are not available ( #1798 )
2 years ago
oahzxl
25952b67d7
[feat] add flash attention ( #1762 )
2 years ago
Boyuan Yao
47fd8e4a02
[utils] Add use_reetrant=False in utils.activation_checkpoint ( #1460 )
...
* [utils] Add use_reetrant=False into colossalai checkpoint
* [utils] add some annotation in utils.activaion_checkpoint
* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py
* [test] modify test_activation_checkpoint.py
* [test] modify test for reentrant=False
2 years ago
Jiarui Fang
36824a304c
[Doc] add more doc for ColoTensor. ( #1458 )
2 years ago
ver217
821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer ( #1442 )
2 years ago
HELSON
527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py ( #1387 )
2 years ago
HELSON
7a8702c06d
[colotensor] add Tensor.view op and its unit test ( #1343 )
...
[colotensor] add megatron initialization for gpt2
2 years ago
HELSON
f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test ( #1339 )
2 years ago
Frank Lee
169954f87e
[test] removed outdated unit test for meta context ( #1329 )
2 years ago
Frank Lee
250be4d31e
[utils] integrated colotensor with lazy init context ( #1324 )
...
* [utils] integrated colotensor with lazy init context
* polish code
* polish code
* polish code
2 years ago
Jiarui Fang
9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing ( #1316 )
2 years ago
Jiarui Fang
85f933b58b
[Optimizer] Remove useless ColoOptimizer ( #1312 )
2 years ago
Jiarui Fang
9f10524313
[Optimizer] polish the init method of ColoOptimizer ( #1310 )
2 years ago
Jiarui Fang
3ef3791a3b
[checkpoint] add test for bert and hotfix save bugs ( #1297 )
2 years ago
Frank Lee
7e8114a8dd
[hotfix] skipped unsafe test cases ( #1282 )
2 years ago
Jiarui Fang
c92f84fcdb
[tensor] distributed checkpointing for parameters ( #1240 )
2 years ago
Jiarui Fang
9bcd2fd4af
[tensor] a shorter shard and replicate spec ( #1245 )
2 years ago
Jiarui Fang
20da6e48c8
[checkpoint] save sharded optimizer states ( #1237 )
2 years ago
Jiarui Fang
3b500984b1
[tensor] fix some unittests ( #1234 )
2 years ago
Yi Zhao
04537bf83e
[checkpoint]support generalized scheduler ( #1222 )
2 years ago
Jiarui Fang
52736205d9
[checkpoint] make unitest faster ( #1217 )
2 years ago
Jiarui Fang
f38006ea83
[checkpoint] checkpoint for ColoTensor Model ( #1196 )
2 years ago
YuliangLiu0306
63d2a93878
[context]support arbitary module materialization. ( #1193 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [context]support arbitary module materialization.
* [test]add numerical check for lazy init context.
2 years ago
YuliangLiu0306
2053e138a2
[context]use meta tensor to init model lazily. ( #1187 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [context]use meta tensor to init model lazily.
* polish
* make module with device kwargs bypass the normal init.
* change unit test to adapt updated context.
2 years ago
ver217
d26902645e
[ddp] add save/load state dict for ColoDDP ( #1127 )
...
* add save/load state dict for ColoDDP
* add unit test
* refactor unit test folder
* polish unit test
* rename unit test
2 years ago
ver217
f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP ( #1122 )
...
* add set_params_to_ignore for ColoDDP
* polish code
* fix zero hook v2
* add unit test
* polish docstr
2 years ago
Frank Lee
2b2dc1c86b
[pipeline] refactor the pipeline module ( #1087 )
...
* [pipeline] refactor the pipeline module
* polish code
3 years ago
Frank Lee
bad5d4c0a1
[context] support lazy init of module ( #1088 )
...
* [context] support lazy init of module
* polish code
3 years ago
Frank Lee
50ec3a7e06
[test] skip tests when not enough GPUs are detected ( #1090 )
...
* [test] skip tests when not enough GPUs are detected
* polish code
* polish code
3 years ago
Frank Lee
65ee6dcc20
[test] ignore 8 gpu test ( #1080 )
...
* [test] ignore 8 gpu test
* polish code
* polish workflow
* polish workflow
3 years ago
Jiarui Fang
49832b2344
[refactory] add nn.parallel module ( #1068 )
3 years ago
HELSON
e5ea3fdeef
[gemini] add GeminiMemoryManger ( #832 )
...
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
3 years ago
YuliangLiu0306
35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model ( #816 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]add module lazy init feature to support large model initization.
* [pipeline]add to_layer_list and partition method to support arbitrary non-pp model
* refactor the module structure
* polish
* [pipelinable]add unit test for pipelinable
* polish
* polish
* Fix CodeFactor issues.
3 years ago
Jiarui Fang
681addb512
[refactor] moving grad acc logic to engine ( #804 )
3 years ago
Frank Lee
5a1a095b92
[test] refactored with the new rerun decorator ( #763 )
...
* [test] refactored with the new rerun decorator
* polish test case
3 years ago
Jiarui Fang
53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process ( #726 )
3 years ago