LuGY
|
02b187c14f
|
[zero] add sampling time for memstats collector (#610)
|
2022-04-01 14:03:00 +08:00 |
アマデウス
|
54e688b623
|
moved ensure_path_exists to utils.common (#591)
|
2022-04-01 09:46:33 +08:00 |
Jiarui Fang
|
e956d93ac2
|
[refactor] memory utils (#577)
|
2022-04-01 09:22:33 +08:00 |
HELSON
|
e6d50ec107
|
[zero] adapt zero for unsharded parameters (#561)
* support existing sharded and unsharded parameters in zero
* add unitest for moe-zero model init
* polish moe gradient handler
|
2022-03-31 18:34:11 +08:00 |
ver217
|
7c6c427db1
|
[zero] trace states of fp16/32 grad and fp32 param (#571)
|
2022-03-31 16:26:54 +08:00 |
Jiarui Fang
|
7675366fce
|
[polish] rename col_attr -> colo_attr (#558)
|
2022-03-31 12:25:45 +08:00 |
Liang Bowen
|
2c45efc398
|
html refactor (#555)
|
2022-03-31 11:36:56 +08:00 |
Jiarui Fang
|
d1211148a7
|
[utils] update colo tensor moving APIs (#553)
|
2022-03-30 23:13:24 +08:00 |
Jiarui Fang
|
107b99ddb1
|
[zero] dump memory stats for sharded model (#548)
|
2022-03-30 09:38:44 +08:00 |
Liang Bowen
|
ec5086c49c
|
Refactored docstring to google style
|
2022-03-29 17:17:47 +08:00 |
Jiarui Fang
|
53b1b6e340
|
[zero] non model data tracing (#545)
|
2022-03-29 15:45:48 +08:00 |
Jie Zhu
|
73d36618a6
|
[profiler] add MemProfiler (#356)
* add memory trainer hook
* fix bug
* add memory trainer hook
* fix import bug
* fix import bug
* add trainer hook
* fix #370 git log bug
* modify `to_tensorboard` function to support better output
* remove useless output
* change the name of `MemProfiler`
* complete memory profiler
* replace error with warning
* finish trainer hook
* modify interface of MemProfiler
* modify `__init__.py` in profiler
* remove unnecessary pass statement
* add usage to doc string
* add usage to trainer hook
* new location to store temp data file
|
2022-03-29 12:48:34 +08:00 |
Jiarui Fang
|
c11ff81b15
|
[zero] get memory usage of sharded optim v2. (#542)
|
2022-03-29 09:08:18 +08:00 |
Jiarui Fang
|
705f56107c
|
[zero] refactor model data tracing (#537)
|
2022-03-28 16:38:18 +08:00 |
Jiarui Fang
|
05e33b2578
|
[zero] fix grad offload (#528)
* [zero] fix grad offload
* polish code
|
2022-03-25 18:23:25 +08:00 |
Jiarui Fang
|
8d8c5407c0
|
[zero] refactor model data tracing (#522)
|
2022-03-25 18:03:32 +08:00 |
Jiarui Fang
|
920c5889a7
|
[zero] add colo move inline (#521)
|
2022-03-25 14:02:55 +08:00 |
Jiarui Fang
|
0bebda6ea5
|
[zero] fix init device bug in zero init context unittest (#516)
|
2022-03-25 12:24:18 +08:00 |
Jiarui Fang
|
7ef3507ace
|
[zero] show model data cuda memory usage after zero context init. (#515)
|
2022-03-25 11:23:35 +08:00 |
Jiarui Fang
|
9330be0f3c
|
[memory] set cuda mem frac (#506)
|
2022-03-24 16:57:13 +08:00 |
Jiarui Fang
|
0035b7be07
|
[memory] add model data tensor moving api (#503)
|
2022-03-24 14:29:41 +08:00 |
Jiarui Fang
|
a445e118cf
|
[polish] polish singleton and global context (#500)
|
2022-03-23 18:03:39 +08:00 |
HELSON
|
f24b5ed201
|
[MOE] remove old MoE legacy (#493)
|
2022-03-22 17:37:16 +08:00 |
Jiarui Fang
|
b334822163
|
[zero] polish sharded param name (#484)
* [zero] polish sharded param name
* polish code
* polish
* polish code
* polish
* polsih
* polish
|
2022-03-22 14:36:16 +08:00 |
Jiarui Fang
|
65c0f380c2
|
[format] polish name format for MOE (#481)
|
2022-03-21 23:19:47 +08:00 |
HELSON
|
7544347145
|
[MOE] add unitest for MOE experts layout, gradient handler and kernel (#469)
|
2022-03-21 13:35:04 +08:00 |
HELSON
|
aff9d354f7
|
[MOE] polish moe_env (#467)
|
2022-03-19 15:36:25 +08:00 |
HELSON
|
84fd7c1d4d
|
add moe context, moe utilities and refactor gradient handler (#455)
|
2022-03-18 16:38:32 +08:00 |
Frank Lee
|
b72b8445c6
|
optimized context test time consumption (#446)
|
2022-03-17 14:40:52 +08:00 |
Jiarui Fang
|
496cbb0760
|
[hotfix] fix initialize bug with zero (#442)
|
2022-03-17 13:16:22 +08:00 |
Frank Lee
|
b03b3ae99c
|
fixed mem monitor device (#433)
fixed mem monitor device
|
2022-03-16 15:25:02 +08:00 |
Jiarui Fang
|
adebb3e041
|
[zero] cuda margin space for OS (#418)
|
2022-03-15 12:02:19 +08:00 |
Jiarui Fang
|
56bb412e72
|
[polish] use GLOBAL_MODEL_DATA_TRACER (#417)
|
2022-03-15 11:29:46 +08:00 |
Jiarui Fang
|
21dc54e019
|
[zero] memtracer to record cuda memory usage of model data and overall system (#395)
|
2022-03-14 22:05:30 +08:00 |
LuGY
|
a9c27be42e
|
Added tensor detector (#393)
* Added tensor detector
* Added the - states
* Allowed change include_cpu when detect()
|
2022-03-14 18:01:46 +08:00 |
1SAA
|
907ac4a2dc
|
fixed error when no collective communication in CommProfiler
|
2022-03-14 17:21:00 +08:00 |
HELSON
|
dfd0363f68
|
polished output format for communication profiler and pcie profiler (#404)
fixed typing error
|
2022-03-14 16:07:45 +08:00 |
HELSON
|
7c079d9c33
|
[hotfix] fixed bugs in ShardStrategy and PcieProfiler (#394)
|
2022-03-11 18:12:46 +08:00 |
LuGY
|
de46450461
|
Added activation offload (#331)
* Added activation offload
* Fixed the import bug, used the pytest
|
2022-03-11 15:50:28 +08:00 |
HELSON
|
8c18eb0998
|
[profiler] Fixed bugs in CommProfiler and PcieProfiler (#377)
|
2022-03-11 15:50:28 +08:00 |
Jiarui Fang
|
b5f43acee3
|
[zero] find miss code (#378)
|
2022-03-11 15:50:28 +08:00 |
HELSON
|
1ed7c24c02
|
Added PCIE profiler to dectect data transmission (#373)
|
2022-03-11 15:50:28 +08:00 |
jiaruifang
|
d9217e1960
|
Revert "[zero] bucketized tensor cpu gpu copy (#368)"
This reverts commit bef05489b6 .
|
2022-03-11 15:50:28 +08:00 |
Jiarui Fang
|
00670c870e
|
[zero] bucketized tensor cpu gpu copy (#368)
|
2022-03-11 15:50:28 +08:00 |
Jiarui Fang
|
ea2872073f
|
[zero] global model data memory tracer (#360)
|
2022-03-11 15:50:28 +08:00 |
HELSON
|
534e0bb118
|
Fixed import bug for no-tensorboard environment (#354)
|
2022-03-11 15:50:28 +08:00 |
HELSON
|
c57e089824
|
[profile] added example for ProfilerContext (#349)
|
2022-03-11 15:50:28 +08:00 |
Jiarui Fang
|
10e2826426
|
move async memory to an individual directory (#345)
|
2022-03-11 15:50:28 +08:00 |
HELSON
|
425bb0df3f
|
Added Profiler Context to manage all profilers (#340)
|
2022-03-11 15:50:28 +08:00 |
HELSON
|
4f26fabe4f
|
fixed strings in profiler outputs (#325)
|
2022-03-11 15:50:28 +08:00 |