Commit Graph

14 Commits (846406a07a7d16a3d4b7f045c4c03fcb2e53c0cc)

Author SHA1 Message Date
HELSON 84c6700b2a
[zero] refactor memstats_collector (#746) 2022-04-14 12:01:12 +08:00
HELSON 340e59f968
[utils] add synchronized cuda memory monitor (#740) 2022-04-13 10:50:54 +08:00
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715) 2022-04-11 16:47:57 +08:00
ver217 ab8c6b4a0e
[zero] refactor memstats collector (#706)
* refactor memstats collector

* fix disposable

* polish code
2022-04-11 10:46:08 +08:00
ver217 3c9cd5bb5e
[zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager

* add eviction strategy

* polish code

* polish code

* polish comment

* add unit test

* fix sampler bug

* polish code

* fix max sampling cnt resetting bug

* fix sampler bug

* polish code

* fix bug

* fix unit test

Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-04-08 17:51:34 +08:00
Jiarui Fang 59bf2dc590
[zero] initialize a stateful tensor manager (#614) 2022-04-06 16:18:49 +08:00
Jiarui Fang 0aab52301e
[hotfix] fix a bug in model data stats tracing (#655) 2022-04-03 21:48:06 +08:00
LuGY 02b187c14f
[zero] add sampling time for memstats collector (#610) 2022-04-01 14:03:00 +08:00
Jiarui Fang e956d93ac2
[refactor] memory utils (#577) 2022-04-01 09:22:33 +08:00
Jiarui Fang 107b99ddb1
[zero] dump memory stats for sharded model (#548) 2022-03-30 09:38:44 +08:00
Jiarui Fang 705f56107c
[zero] refactor model data tracing (#537) 2022-03-28 16:38:18 +08:00
Jiarui Fang 0bebda6ea5
[zero] fix init device bug in zero init context unittest (#516) 2022-03-25 12:24:18 +08:00
Jiarui Fang 56bb412e72
[polish] use GLOBAL_MODEL_DATA_TRACER (#417) 2022-03-15 11:29:46 +08:00
Jiarui Fang 21dc54e019
[zero] memtracer to record cuda memory usage of model data and overall system (#395) 2022-03-14 22:05:30 +08:00