Jiarui Fang
c92f84fcdb
[tensor] distributed checkpointing for parameters ( #1240 )
2022-07-12 15:51:06 +08:00
Jiarui Fang
9bcd2fd4af
[tensor] a shorter shard and replicate spec ( #1245 )
2022-07-11 15:51:48 +08:00
Jiarui Fang
20da6e48c8
[checkpoint] save sharded optimizer states ( #1237 )
2022-07-08 16:33:13 +08:00
Jiarui Fang
3b500984b1
[tensor] fix some unittests ( #1234 )
2022-07-08 14:18:30 +08:00
Yi Zhao
04537bf83e
[checkpoint]support generalized scheduler ( #1222 )
2022-07-07 18:16:38 +08:00
Jiarui Fang
52736205d9
[checkpoint] make unitest faster ( #1217 )
2022-07-06 17:39:46 +08:00
Jiarui Fang
f38006ea83
[checkpoint] checkpoint for ColoTensor Model ( #1196 )
2022-07-06 17:22:03 +08:00
YuliangLiu0306
63d2a93878
[context]support arbitary module materialization. ( #1193 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [context]support arbitary module materialization.
* [test]add numerical check for lazy init context.
2022-07-04 10:12:02 +08:00
YuliangLiu0306
2053e138a2
[context]use meta tensor to init model lazily. ( #1187 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [context]use meta tensor to init model lazily.
* polish
* make module with device kwargs bypass the normal init.
* change unit test to adapt updated context.
2022-06-29 21:02:30 +08:00
ver217
d26902645e
[ddp] add save/load state dict for ColoDDP ( #1127 )
...
* add save/load state dict for ColoDDP
* add unit test
* refactor unit test folder
* polish unit test
* rename unit test
2022-06-20 10:51:47 +08:00
ver217
f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP ( #1122 )
...
* add set_params_to_ignore for ColoDDP
* polish code
* fix zero hook v2
* add unit test
* polish docstr
2022-06-16 12:54:46 +08:00
Frank Lee
2b2dc1c86b
[pipeline] refactor the pipeline module ( #1087 )
...
* [pipeline] refactor the pipeline module
* polish code
2022-06-10 11:27:38 +08:00
Frank Lee
bad5d4c0a1
[context] support lazy init of module ( #1088 )
...
* [context] support lazy init of module
* polish code
2022-06-10 10:09:48 +08:00
Frank Lee
50ec3a7e06
[test] skip tests when not enough GPUs are detected ( #1090 )
...
* [test] skip tests when not enough GPUs are detected
* polish code
* polish code
2022-06-09 17:19:13 +08:00
Frank Lee
65ee6dcc20
[test] ignore 8 gpu test ( #1080 )
...
* [test] ignore 8 gpu test
* polish code
* polish workflow
* polish workflow
2022-06-08 23:14:18 +08:00
Jiarui Fang
49832b2344
[refactory] add nn.parallel module ( #1068 )
2022-06-06 15:34:41 +08:00
HELSON
e5ea3fdeef
[gemini] add GeminiMemoryManger ( #832 )
...
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
YuliangLiu0306
35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model ( #816 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]add module lazy init feature to support large model initization.
* [pipeline]add to_layer_list and partition method to support arbitrary non-pp model
* refactor the module structure
* polish
* [pipelinable]add unit test for pipelinable
* polish
* polish
* Fix CodeFactor issues.
2022-04-24 13:03:12 +08:00
Jiarui Fang
681addb512
[refactor] moving grad acc logic to engine ( #804 )
2022-04-19 14:03:21 +08:00
Frank Lee
5a1a095b92
[test] refactored with the new rerun decorator ( #763 )
...
* [test] refactored with the new rerun decorator
* polish test case
2022-04-15 00:33:04 +08:00
Jiarui Fang
53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process ( #726 )
2022-04-12 14:57:54 +08:00
FrankLeeeee
62b4ce7326
[test] added missing decorators to model checkpointing tests
2022-04-12 11:08:15 +08:00
Jiarui Fang
4d90a7b513
[refactor] zero directory ( #724 )
2022-04-11 23:13:02 +08:00
Jiarui Fang
193dc8dacb
[refactor] refactor the memory utils ( #715 )
2022-04-11 16:47:57 +08:00
HELSON
e5d615aeee
[hotfix] fix bugs in testing ( #659 )
...
* remove hybrid adam in test_moe_zero_optim
* fix activation checkpointing and its unitest
2022-04-02 21:58:47 +08:00
アマデウス
354b7954d1
[model checkpoint] added unit tests for checkpoint save/load ( #599 )
2022-04-01 16:53:32 +08:00
FredHuang99
93f14d2a33
[zero] test zero tensor utils ( #609 )
2022-04-01 15:16:59 +08:00
Jiarui Fang
e956d93ac2
[refactor] memory utils ( #577 )
2022-04-01 09:22:33 +08:00
Jiarui Fang
705f56107c
[zero] refactor model data tracing ( #537 )
2022-03-28 16:38:18 +08:00
Jiarui Fang
8d8c5407c0
[zero] refactor model data tracing ( #522 )
2022-03-25 18:03:32 +08:00
Frank Lee
3601b2bad0
[test] fixed rerun_on_exception and adapted test cases ( #487 )
2022-03-25 17:25:12 +08:00
Jiarui Fang
4d322b79da
[refactor] remove old zero code ( #517 )
2022-03-25 14:54:39 +08:00
Jiarui Fang
920c5889a7
[zero] add colo move inline ( #521 )
2022-03-25 14:02:55 +08:00
Jiarui Fang
7ef3507ace
[zero] show model data cuda memory usage after zero context init. ( #515 )
2022-03-25 11:23:35 +08:00
Jiarui Fang
9330be0f3c
[memory] set cuda mem frac ( #506 )
2022-03-24 16:57:13 +08:00
Jiarui Fang
0035b7be07
[memory] add model data tensor moving api ( #503 )
2022-03-24 14:29:41 +08:00
Jiarui Fang
b334822163
[zero] polish sharded param name ( #484 )
...
* [zero] polish sharded param name
* polish code
* polish
* polish code
* polish
* polsih
* polish
2022-03-22 14:36:16 +08:00
Frank Lee
f27d801a13
[test] optimized zero data parallel test ( #452 )
2022-03-18 11:35:54 +08:00
Jiarui Fang
21dc54e019
[zero] memtracer to record cuda memory usage of model data and overall system ( #395 )
2022-03-14 22:05:30 +08:00
Jiarui Fang
a37bf1bc42
[hotfix] rm test_tensor_detector.py ( #413 )
2022-03-14 21:39:48 +08:00
LuGY
a9c27be42e
Added tensor detector ( #393 )
...
* Added tensor detector
* Added the - states
* Allowed change include_cpu when detect()
2022-03-14 18:01:46 +08:00
Frank Lee
1e4bf85cdb
fixed bug in activation checkpointing test ( #387 )
2022-03-11 15:50:28 +08:00
Frank Lee
526a318032
[unit test] Refactored test cases with component func ( #339 )
...
* refactored test with component func
* fixed bug
2022-03-11 15:50:28 +08:00
LuGY
de46450461
Added activation offload ( #331 )
...
* Added activation offload
* Fixed the import bug, used the pytest
2022-03-11 15:50:28 +08:00
Jiarui Fang
b5f43acee3
[zero] find miss code ( #378 )
2022-03-11 15:50:28 +08:00
jiaruifang
d9217e1960
Revert "[zero] bucketized tensor cpu gpu copy ( #368 )"
...
This reverts commit bef05489b6
.
2022-03-11 15:50:28 +08:00
Jiarui Fang
00670c870e
[zero] bucketized tensor cpu gpu copy ( #368 )
2022-03-11 15:50:28 +08:00
Jiarui Fang
5a560a060a
Feature/zero ( #279 )
...
* add zero1 (#209 )
* add zero1
* add test zero1
* update zero stage 1 develop (#212 )
* Implement naive zero3 (#240 )
* naive zero3 works well
* add zero3 param manager
* add TODOs in comments
* add gather full param ctx
* fix sub module streams
* add offload
* fix bugs of hook and add unit tests
* fix bugs of hook and add unit tests (#252 )
* add gather full param ctx
* fix sub module streams
* add offload
* fix bugs of hook and add unit tests
* polish code and add state dict hook
* fix bug
* update unit test
* refactor reconstructed zero code
* clip_grad support zero3 and add unit test
* add unit test for Zero3ParameterManager
* [WIP] initialize the shard param class
* [WIP] Yet another sharded model implementation (#274 )
* [WIP] initialize the shard param class
* [WIP] Yes another implementation of shardModel. Using a better hook method.
* torch.concat -> torch.cat
* fix test_zero_level_1.py::test_zero_level_1 unitest
* remove deepspeed implementation and refactor for the reconstructed zero module
* polish zero dp unittests
Co-authored-by: ver217 <lhx0217@gmail.com>
Co-authored-by: Frank Lee <somerlee.9@gmail.com>
2022-03-11 15:50:28 +08:00
アマデウス
01a80cd86d
Hotfix/Colossalai layers ( #92 )
...
* optimized 1d layer apis; reorganized nn.layer modules; fixed tests
* fixed 2.5d runtime issue
* reworked split batch, now called in trainer.schedule.load_batch
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2021-12-29 23:32:10 +08:00
Frank Lee
cd9c28e055
added CI for unit testing ( #69 )
2021-12-16 10:32:08 +08:00