Commit Graph

84 Commits (80e37eec4215cd81a54a525af14a1030dd6177a1)

Author SHA1 Message Date
HELSON 84c6700b2a
[zero] refactor memstats_collector (#746) 2022-04-14 12:01:12 +08:00
HELSON 340e59f968
[utils] add synchronized cuda memory monitor (#740) 2022-04-13 10:50:54 +08:00
Jiarui Fang 53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process (#726) 2022-04-12 14:57:54 +08:00
Frank Lee 2412429d54
[util] fixed activation checkpointing on torch 1.9 (#719) 2022-04-12 09:35:45 +08:00
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715) 2022-04-11 16:47:57 +08:00
LuGY 140263a394
[hotfix]fixed bugs of assigning grad states to non leaf nodes (#711)
* fixed bugs of assigning grad states to non leaf nodes

* use detach()
2022-04-11 14:04:58 +08:00
ver217 ab8c6b4a0e
[zero] refactor memstats collector (#706)
* refactor memstats collector

* fix disposable

* polish code
2022-04-11 10:46:08 +08:00
ver217 3c9cd5bb5e
[zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager

* add eviction strategy

* polish code

* polish code

* polish comment

* add unit test

* fix sampler bug

* polish code

* fix max sampling cnt resetting bug

* fix sampler bug

* polish code

* fix bug

* fix unit test

Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-04-08 17:51:34 +08:00
Jiarui Fang 59bf2dc590
[zero] initialize a stateful tensor manager (#614) 2022-04-06 16:18:49 +08:00
Jiarui Fang 0aab52301e
[hotfix] fix a bug in model data stats tracing (#655) 2022-04-03 21:48:06 +08:00
HELSON e5d615aeee
[hotfix] fix bugs in testing (#659)
* remove hybrid adam in test_moe_zero_optim

* fix activation checkpointing and its unitest
2022-04-02 21:58:47 +08:00
LuGY 1e2557e801
[zero] fixed the activation offload (#647)
* fixed the activation offload

* polish
2022-04-02 16:21:32 +08:00
ver217 f5d3a9c2b0
polish checkpoint docstring (#637) 2022-04-02 13:34:33 +08:00
HELSON 055fbf5be6
[zero] adapt zero for unsharded paramters (Optimizer part) (#601) 2022-04-01 20:10:47 +08:00
アマデウス acae68eb04
[model checkpoint] updated checkpoint save/load utils (#592) 2022-04-01 16:49:21 +08:00
ver217 369a288bf3
polish utils docstring (#620) 2022-04-01 16:36:47 +08:00
LuGY 02b187c14f
[zero] add sampling time for memstats collector (#610) 2022-04-01 14:03:00 +08:00
アマデウス 54e688b623
moved ensure_path_exists to utils.common (#591) 2022-04-01 09:46:33 +08:00
Jiarui Fang e956d93ac2
[refactor] memory utils (#577) 2022-04-01 09:22:33 +08:00
HELSON e6d50ec107
[zero] adapt zero for unsharded parameters (#561)
* support existing sharded and unsharded parameters in zero

* add unitest for moe-zero model init

* polish moe gradient handler
2022-03-31 18:34:11 +08:00
ver217 7c6c427db1
[zero] trace states of fp16/32 grad and fp32 param (#571) 2022-03-31 16:26:54 +08:00
Jiarui Fang 7675366fce
[polish] rename col_attr -> colo_attr (#558) 2022-03-31 12:25:45 +08:00
Liang Bowen 2c45efc398
html refactor (#555) 2022-03-31 11:36:56 +08:00
Jiarui Fang d1211148a7
[utils] update colo tensor moving APIs (#553) 2022-03-30 23:13:24 +08:00
Jiarui Fang 107b99ddb1
[zero] dump memory stats for sharded model (#548) 2022-03-30 09:38:44 +08:00
Liang Bowen ec5086c49c Refactored docstring to google style 2022-03-29 17:17:47 +08:00
Jiarui Fang 53b1b6e340
[zero] non model data tracing (#545) 2022-03-29 15:45:48 +08:00
Jie Zhu 73d36618a6
[profiler] add MemProfiler (#356)
* add memory trainer hook

* fix bug

* add memory trainer hook

* fix import bug

* fix import bug

* add trainer hook

* fix #370 git log bug

* modify `to_tensorboard` function to support better output

* remove useless output

* change the name of `MemProfiler`

* complete memory profiler

* replace error with warning

* finish trainer hook

* modify interface of MemProfiler

* modify `__init__.py` in profiler

* remove unnecessary pass statement

* add usage to doc string

* add usage to trainer hook

* new location to store temp data file
2022-03-29 12:48:34 +08:00
Jiarui Fang c11ff81b15
[zero] get memory usage of sharded optim v2. (#542) 2022-03-29 09:08:18 +08:00
Jiarui Fang 705f56107c
[zero] refactor model data tracing (#537) 2022-03-28 16:38:18 +08:00
Jiarui Fang 05e33b2578
[zero] fix grad offload (#528)
* [zero] fix grad offload

* polish code
2022-03-25 18:23:25 +08:00
Jiarui Fang 8d8c5407c0
[zero] refactor model data tracing (#522) 2022-03-25 18:03:32 +08:00
Jiarui Fang 920c5889a7
[zero] add colo move inline (#521) 2022-03-25 14:02:55 +08:00
Jiarui Fang 0bebda6ea5
[zero] fix init device bug in zero init context unittest (#516) 2022-03-25 12:24:18 +08:00
Jiarui Fang 7ef3507ace
[zero] show model data cuda memory usage after zero context init. (#515) 2022-03-25 11:23:35 +08:00
Jiarui Fang 9330be0f3c
[memory] set cuda mem frac (#506) 2022-03-24 16:57:13 +08:00
Jiarui Fang 0035b7be07
[memory] add model data tensor moving api (#503) 2022-03-24 14:29:41 +08:00
Jiarui Fang a445e118cf
[polish] polish singleton and global context (#500) 2022-03-23 18:03:39 +08:00
HELSON f24b5ed201
[MOE] remove old MoE legacy (#493) 2022-03-22 17:37:16 +08:00
Jiarui Fang b334822163
[zero] polish sharded param name (#484)
* [zero] polish sharded param name

* polish code

* polish

* polish code

* polish

* polsih

* polish
2022-03-22 14:36:16 +08:00
Jiarui Fang 65c0f380c2
[format] polish name format for MOE (#481) 2022-03-21 23:19:47 +08:00
HELSON 7544347145
[MOE] add unitest for MOE experts layout, gradient handler and kernel (#469) 2022-03-21 13:35:04 +08:00
HELSON aff9d354f7
[MOE] polish moe_env (#467) 2022-03-19 15:36:25 +08:00
HELSON 84fd7c1d4d
add moe context, moe utilities and refactor gradient handler (#455) 2022-03-18 16:38:32 +08:00
Frank Lee b72b8445c6
optimized context test time consumption (#446) 2022-03-17 14:40:52 +08:00
Jiarui Fang 496cbb0760
[hotfix] fix initialize bug with zero (#442) 2022-03-17 13:16:22 +08:00
Frank Lee b03b3ae99c
fixed mem monitor device (#433)
fixed mem monitor device
2022-03-16 15:25:02 +08:00
Jiarui Fang adebb3e041
[zero] cuda margin space for OS (#418) 2022-03-15 12:02:19 +08:00
Jiarui Fang 56bb412e72
[polish] use GLOBAL_MODEL_DATA_TRACER (#417) 2022-03-15 11:29:46 +08:00
Jiarui Fang 21dc54e019
[zero] memtracer to record cuda memory usage of model data and overall system (#395) 2022-03-14 22:05:30 +08:00