Commit Graph

27 Commits (f3ce7b8336978414a95481ef76c7a0987c6f0cda)

Author SHA1 Message Date
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities

* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
Jiarui Fang e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806) 2022-04-19 14:40:02 +08:00
HELSON 88759e289e
[zero] add ZeroTensorShardStrategy (#793) 2022-04-19 14:32:45 +08:00
HELSON a65cbb7e4e
[zero] refactor shard and gather operation (#773) 2022-04-15 14:41:31 +08:00
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724) 2022-04-11 23:13:02 +08:00
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715) 2022-04-11 16:47:57 +08:00
ver217 715b86eadd
[hotfix] fix stm cuda model data size (#710) 2022-04-11 15:10:39 +08:00
ver217 ab8c6b4a0e
[zero] refactor memstats collector (#706)
* refactor memstats collector

* fix disposable

* polish code
2022-04-11 10:46:08 +08:00
ver217 3c9cd5bb5e
[zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager

* add eviction strategy

* polish code

* polish code

* polish comment

* add unit test

* fix sampler bug

* polish code

* fix max sampling cnt resetting bug

* fix sampler bug

* polish code

* fix bug

* fix unit test

Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-04-08 17:51:34 +08:00
Jiarui Fang 59bf2dc590
[zero] initialize a stateful tensor manager (#614) 2022-04-06 16:18:49 +08:00
Jiarui Fang e956d93ac2
[refactor] memory utils (#577) 2022-04-01 09:22:33 +08:00
Jiarui Fang 53b1b6e340
[zero] non model data tracing (#545) 2022-03-29 15:45:48 +08:00
Jiarui Fang 705f56107c
[zero] refactor model data tracing (#537) 2022-03-28 16:38:18 +08:00
Jiarui Fang 8d8c5407c0
[zero] refactor model data tracing (#522) 2022-03-25 18:03:32 +08:00
Jiarui Fang 4d322b79da
[refactor] remove old zero code (#517) 2022-03-25 14:54:39 +08:00
Jiarui Fang 0bebda6ea5
[zero] fix init device bug in zero init context unittest (#516) 2022-03-25 12:24:18 +08:00
ver217 fc8e6db005
[doc] Update docstring for ZeRO (#459)
* polish sharded model docstr

* polish sharded optim docstr

* polish zero docstr

* polish shard strategy docstr
2022-03-18 16:48:20 +08:00
ver217 a241f61b34
[zero] Update initialize for ZeRO (#458)
* polish code

* shard strategy receive pg in shard() / gather()

* update zero engine

* polish code
2022-03-18 16:18:31 +08:00
Jiarui Fang 0fcfb1e00d
[test] make zero engine test really work (#447) 2022-03-17 17:24:25 +08:00
ver217 63469c0f91 polish code 2022-03-14 15:48:55 +08:00
ver217 88804aee49 add bucket tensor shard strategy 2022-03-14 14:48:32 +08:00
HELSON 7c079d9c33
[hotfix] fixed bugs in ShardStrategy and PcieProfiler (#394) 2022-03-11 18:12:46 +08:00
Jiarui Fang 44e4891f57 [zero] able to place params on cpu after zero init context (#365)
* place params on cpu after zero init context

* polish code
2022-03-11 15:50:28 +08:00
ver217 1388671699 [zero] Update sharded model v2 using sharded param v2 (#323) 2022-03-11 15:50:28 +08:00
Jiarui Fang 11bddb6e55 [zero] update zero context init with the updated test utils (#327) 2022-03-11 15:50:28 +08:00
Jiarui Fang c9e7d9582d [zero] polish shard strategy (#310)
* init shard param from shape tuple

* add more unitest for shard param

* add set_payload method for ShardedParam

* [zero] add shareded tensor class

* polish code

* add shard stratgy

* move shard and gather logic to shard strategy from shard tensor.

* polish code
2022-03-11 15:50:28 +08:00
Jiarui Fang 74f77e314b [zero] a shard strategy in granularity of tensor (#307) 2022-03-11 15:50:28 +08:00