Commit Graph

21 Commits (e619a651fb27dc9db9ca0b97a10e00b3516243b0)

Author SHA1 Message Date
LuGY 02b187c14f
[zero] add sampling time for memstats collector (#610) 2022-04-01 14:03:00 +08:00
Jiarui Fang e956d93ac2
[refactor] memory utils (#577) 2022-04-01 09:22:33 +08:00
HELSON e6d50ec107
[zero] adapt zero for unsharded parameters (#561)
* support existing sharded and unsharded parameters in zero

* add unitest for moe-zero model init

* polish moe gradient handler
2022-03-31 18:34:11 +08:00
Jiarui Fang 7675366fce
[polish] rename col_attr -> colo_attr (#558) 2022-03-31 12:25:45 +08:00
Liang Bowen 2c45efc398
html refactor (#555) 2022-03-31 11:36:56 +08:00
Jiarui Fang 107b99ddb1
[zero] dump memory stats for sharded model (#548) 2022-03-30 09:38:44 +08:00
Jiarui Fang 53b1b6e340
[zero] non model data tracing (#545) 2022-03-29 15:45:48 +08:00
Jie Zhu 73d36618a6
[profiler] add MemProfiler (#356)
* add memory trainer hook

* fix bug

* add memory trainer hook

* fix import bug

* fix import bug

* add trainer hook

* fix #370 git log bug

* modify `to_tensorboard` function to support better output

* remove useless output

* change the name of `MemProfiler`

* complete memory profiler

* replace error with warning

* finish trainer hook

* modify interface of MemProfiler

* modify `__init__.py` in profiler

* remove unnecessary pass statement

* add usage to doc string

* add usage to trainer hook

* new location to store temp data file
2022-03-29 12:48:34 +08:00
Jiarui Fang c11ff81b15
[zero] get memory usage of sharded optim v2. (#542) 2022-03-29 09:08:18 +08:00
Jiarui Fang 705f56107c
[zero] refactor model data tracing (#537) 2022-03-28 16:38:18 +08:00
Jiarui Fang 8d8c5407c0
[zero] refactor model data tracing (#522) 2022-03-25 18:03:32 +08:00
Jiarui Fang 0bebda6ea5
[zero] fix init device bug in zero init context unittest (#516) 2022-03-25 12:24:18 +08:00
Jiarui Fang 7ef3507ace
[zero] show model data cuda memory usage after zero context init. (#515) 2022-03-25 11:23:35 +08:00
Jiarui Fang 9330be0f3c
[memory] set cuda mem frac (#506) 2022-03-24 16:57:13 +08:00
Jiarui Fang 0035b7be07
[memory] add model data tensor moving api (#503) 2022-03-24 14:29:41 +08:00
Jiarui Fang a445e118cf
[polish] polish singleton and global context (#500) 2022-03-23 18:03:39 +08:00
Frank Lee b03b3ae99c
fixed mem monitor device (#433)
fixed mem monitor device
2022-03-16 15:25:02 +08:00
Jiarui Fang 56bb412e72
[polish] use GLOBAL_MODEL_DATA_TRACER (#417) 2022-03-15 11:29:46 +08:00
Jiarui Fang 21dc54e019
[zero] memtracer to record cuda memory usage of model data and overall system (#395) 2022-03-14 22:05:30 +08:00
Jiarui Fang ea2872073f [zero] global model data memory tracer (#360) 2022-03-11 15:50:28 +08:00
Jiarui Fang 10e2826426 move async memory to an individual directory (#345) 2022-03-11 15:50:28 +08:00