HELSON
|
425b4a96b8
|
[gemini] polish stateful_tensor_mgr (#876)
|
2022-04-26 15:05:03 +08:00 |
ver217
|
d7e0303d1e
|
[zero] use GeminiMemoryManager when sampling model data (#850)
|
2022-04-24 17:17:22 +08:00 |
HELSON
|
e5ea3fdeef
|
[gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
|
2022-04-24 13:08:48 +08:00 |
Jiarui Fang
|
4d9332b4c5
|
[refactor] moving memtracer to gemini (#801)
|
2022-04-19 10:13:08 +08:00 |
HELSON
|
4c4388c46e
|
[hotfix] fix memory leak in zero (#781)
|
2022-04-18 13:57:03 +08:00 |
Jiarui Fang
|
10ef8afdd2
|
[gemini] init genimi individual directory (#754)
|
2022-04-14 16:40:26 +08:00 |
ver217
|
a93a7d7364
|
[hotfix] fix reuse_fp16_shard of sharded model (#756)
* fix reuse_fp16_shard
* disable test stm
* polish code
|
2022-04-14 14:56:46 +08:00 |
ver217
|
8f7ce94b8e
|
[hotfix] fix auto tensor placement policy (#753)
|
2022-04-14 12:04:45 +08:00 |
Jiarui Fang
|
3d7dc46d33
|
[zero] use factory pattern for tensor_placement_policy (#752)
|
2022-04-14 11:07:29 +08:00 |
ver217
|
e396bb71f2
|
[zero] add tensor placement policies (#743)
* add tensor placement policies
* polish comments
* polish comments
* update moe unit tests
|
2022-04-13 15:00:48 +08:00 |
HELSON
|
22c4b88d56
|
[zero] refactor ShardedParamV2 for convenience (#742)
|
2022-04-13 14:54:26 +08:00 |
ver217
|
e6212f56cd
|
[hotfix] fix memory leak in backward of sharded model (#741)
|
2022-04-13 09:59:05 +08:00 |
Jiarui Fang
|
7db3ccc79b
|
[hotfix] remove duplicated param register to stateful tensor manager (#728)
|
2022-04-12 13:55:25 +08:00 |
Jiarui Fang
|
4d90a7b513
|
[refactor] zero directory (#724)
|
2022-04-11 23:13:02 +08:00 |
Jiarui Fang
|
193dc8dacb
|
[refactor] refactor the memory utils (#715)
|
2022-04-11 16:47:57 +08:00 |
HELSON
|
dbd96fe90a
|
[zero] check whether gradients have inf and nan in gpu (#712)
|
2022-04-11 15:40:13 +08:00 |
HELSON
|
a9b8300d54
|
[zero] improve adaptability for not-shard parameters (#708)
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
|
2022-04-11 13:38:51 +08:00 |
ver217
|
ab8c6b4a0e
|
[zero] refactor memstats collector (#706)
* refactor memstats collector
* fix disposable
* polish code
|
2022-04-11 10:46:08 +08:00 |
HELSON
|
ee112fe1da
|
[zero] adapt zero hooks for unsharded module (#699)
|
2022-04-08 20:23:26 +08:00 |
ver217
|
3c9cd5bb5e
|
[zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager
* add eviction strategy
* polish code
* polish code
* polish comment
* add unit test
* fix sampler bug
* polish code
* fix max sampling cnt resetting bug
* fix sampler bug
* polish code
* fix bug
* fix unit test
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
|
2022-04-08 17:51:34 +08:00 |
ver217
|
0ef8819c67
|
polish docstring of zero (#612)
|
2022-04-01 14:50:56 +08:00 |
Jiarui Fang
|
e956d93ac2
|
[refactor] memory utils (#577)
|
2022-04-01 09:22:33 +08:00 |
HELSON
|
e6d50ec107
|
[zero] adapt zero for unsharded parameters (#561)
* support existing sharded and unsharded parameters in zero
* add unitest for moe-zero model init
* polish moe gradient handler
|
2022-03-31 18:34:11 +08:00 |
ver217
|
7c6c427db1
|
[zero] trace states of fp16/32 grad and fp32 param (#571)
|
2022-03-31 16:26:54 +08:00 |
Jiarui Fang
|
7675366fce
|
[polish] rename col_attr -> colo_attr (#558)
|
2022-03-31 12:25:45 +08:00 |
ver217
|
014bac0c49
|
[zero] hijack p.grad in sharded model (#554)
* hijack p.grad in sharded model
* polish comments
* polish comments
|
2022-03-30 18:14:50 +08:00 |
Jiarui Fang
|
f552b11294
|
[zero] label state for param fp16 and grad (#551)
|
2022-03-30 15:57:46 +08:00 |
Jiarui Fang
|
214da761d4
|
[zero] add stateful tensor (#549)
|
2022-03-30 13:51:37 +08:00 |
Jiarui Fang
|
107b99ddb1
|
[zero] dump memory stats for sharded model (#548)
|
2022-03-30 09:38:44 +08:00 |
Jiarui Fang
|
53b1b6e340
|
[zero] non model data tracing (#545)
|
2022-03-29 15:45:48 +08:00 |
ver217
|
fb841dd5c5
|
[zero] optimize grad offload (#539)
* optimize grad offload
* polish code
* polish code
|
2022-03-29 12:48:00 +08:00 |
Jiarui Fang
|
705f56107c
|
[zero] refactor model data tracing (#537)
|
2022-03-28 16:38:18 +08:00 |
Jiarui Fang
|
05e33b2578
|
[zero] fix grad offload (#528)
* [zero] fix grad offload
* polish code
|
2022-03-25 18:23:25 +08:00 |
Jiarui Fang
|
4d322b79da
|
[refactor] remove old zero code (#517)
|
2022-03-25 14:54:39 +08:00 |
Jiarui Fang
|
7ef3507ace
|
[zero] show model data cuda memory usage after zero context init. (#515)
|
2022-03-25 11:23:35 +08:00 |
Jiarui Fang
|
0035b7be07
|
[memory] add model data tensor moving api (#503)
|
2022-03-24 14:29:41 +08:00 |
ver217
|
9ec1ce6ab1
|
[zero] sharded model support the reuse of fp16 shard (#495)
* sharded model supports reuse fp16 shard
* rename variable
* polish code
* polish code
* polish code
|
2022-03-23 14:59:59 +08:00 |
ver217
|
c4c02424f3
|
[zero] sharded model manages ophooks individually (#492)
|
2022-03-22 17:33:20 +08:00 |
ver217
|
62b0a8d644
|
[zero] sharded optim support hybrid cpu adam (#486)
* sharded optim support hybrid cpu adam
* update unit test
* polish docstring
|
2022-03-22 14:56:59 +08:00 |
Jiarui Fang
|
b334822163
|
[zero] polish sharded param name (#484)
* [zero] polish sharded param name
* polish code
* polish
* polish code
* polish
* polsih
* polish
|
2022-03-22 14:36:16 +08:00 |
ver217
|
8d3250d74b
|
[zero] ZeRO supports pipeline parallel (#477)
|
2022-03-21 16:55:37 +08:00 |
ver217
|
fc8e6db005
|
[doc] Update docstring for ZeRO (#459)
* polish sharded model docstr
* polish sharded optim docstr
* polish zero docstr
* polish shard strategy docstr
|
2022-03-18 16:48:20 +08:00 |
ver217
|
a241f61b34
|
[zero] Update initialize for ZeRO (#458)
* polish code
* shard strategy receive pg in shard() / gather()
* update zero engine
* polish code
|
2022-03-18 16:18:31 +08:00 |
ver217
|
642846d6f9
|
update sharded optim and fix zero init ctx (#457)
|
2022-03-18 15:44:47 +08:00 |
Jiarui Fang
|
e2e9f82588
|
Revert "[zero] update sharded optim and fix zero init ctx" (#456)
* Revert "polish code"
This reverts commit 8cf7ff08cf .
* Revert "rename variables"
This reverts commit e99af94ab8 .
* Revert "remove surplus imports"
This reverts commit 46add4a5c5 .
* Revert "update sharded optim and fix zero init ctx"
This reverts commit 57567ee768 .
|
2022-03-18 15:22:43 +08:00 |
ver217
|
57567ee768
|
update sharded optim and fix zero init ctx
|
2022-03-18 14:25:25 +08:00 |
Jiarui Fang
|
496cbb0760
|
[hotfix] fix initialize bug with zero (#442)
|
2022-03-17 13:16:22 +08:00 |
ver217
|
fce9432f08
|
sync before creating empty grad
|
2022-03-16 14:24:09 +08:00 |
ver217
|
ea6905a898
|
free param.grad
|
2022-03-16 14:24:09 +08:00 |
ver217
|
9506a8beb2
|
use double buffer to handle grad
|
2022-03-16 14:24:09 +08:00 |