Commit Graph

33 Commits (38102cf61aece2ce5974dbb09abe2aa298de7f8d)

Author SHA1 Message Date
Frank Lee 5a1a095b92
[test] refactored with the new rerun decorator (#763)
3 years ago
Jiarui Fang 53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process (#726)
3 years ago
FrankLeeeee 62b4ce7326 [test] added missing decorators to model checkpointing tests
3 years ago
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724)
3 years ago
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715)
3 years ago
HELSON e5d615aeee
[hotfix] fix bugs in testing (#659)
3 years ago
アマデウス 354b7954d1
[model checkpoint] added unit tests for checkpoint save/load (#599)
3 years ago
FredHuang99 93f14d2a33
[zero] test zero tensor utils (#609)
3 years ago
Jiarui Fang e956d93ac2
[refactor] memory utils (#577)
3 years ago
Jiarui Fang 705f56107c
[zero] refactor model data tracing (#537)
3 years ago
Jiarui Fang 8d8c5407c0
[zero] refactor model data tracing (#522)
3 years ago
Frank Lee 3601b2bad0
[test] fixed rerun_on_exception and adapted test cases (#487)
3 years ago
Jiarui Fang 4d322b79da
[refactor] remove old zero code (#517)
3 years ago
Jiarui Fang 920c5889a7
[zero] add colo move inline (#521)
3 years ago
Jiarui Fang 7ef3507ace
[zero] show model data cuda memory usage after zero context init. (#515)
3 years ago
Jiarui Fang 9330be0f3c
[memory] set cuda mem frac (#506)
3 years ago
Jiarui Fang 0035b7be07
[memory] add model data tensor moving api (#503)
3 years ago
Jiarui Fang b334822163
[zero] polish sharded param name (#484)
3 years ago
Frank Lee f27d801a13
[test] optimized zero data parallel test (#452)
3 years ago
Jiarui Fang 21dc54e019
[zero] memtracer to record cuda memory usage of model data and overall system (#395)
3 years ago
Jiarui Fang a37bf1bc42
[hotfix] rm test_tensor_detector.py (#413)
3 years ago
LuGY a9c27be42e
Added tensor detector (#393)
3 years ago
Frank Lee 1e4bf85cdb fixed bug in activation checkpointing test (#387)
3 years ago
Frank Lee 526a318032 [unit test] Refactored test cases with component func (#339)
3 years ago
LuGY de46450461 Added activation offload (#331)
3 years ago
Jiarui Fang b5f43acee3 [zero] find miss code (#378)
3 years ago
jiaruifang d9217e1960 Revert "[zero] bucketized tensor cpu gpu copy (#368)"
3 years ago
Jiarui Fang 00670c870e [zero] bucketized tensor cpu gpu copy (#368)
3 years ago
Jiarui Fang 5a560a060a Feature/zero (#279)
3 years ago
アマデウス 01a80cd86d
Hotfix/Colossalai layers (#92)
3 years ago
Frank Lee cd9c28e055
added CI for unit testing (#69)
3 years ago
Frank Lee da01c234e1
Develop/experiments (#59)
3 years ago
zbian 404ecbdcc6 Migrated project
3 years ago