Commit Graph

16 Commits (2e1dbfb463b48612f337f33534a90f225ec91845)

Author SHA1 Message Date
HELSON 1468e4bcfc
[zero] add constant placement policy (#1705)
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
HELSON b28991dd0a
[feature] A new ZeRO implementation (#1644) 2022-10-09 09:18:51 +08:00
Jiarui Fang c5d39215f6
Revert "[feature] new zero implementation (#1623)" (#1643)
This reverts commit 5be118f405.
2022-09-26 10:06:03 +08:00
HELSON 5be118f405
[feature] new zero implementation (#1623) 2022-09-24 19:58:18 +08:00
HELSON b80340168e
[zero] add chunk_managerV2 for all-gather chunk (#1441) 2022-08-11 19:17:24 +08:00
HELSON 9056677b13
[zero] add chunk size searching algorithm for parameters in different groups (#1436) 2022-08-11 13:32:19 +08:00
HELSON 039b7ed3bc
[polish] add update directory in gemini; rename AgChunk to ChunkV2 (#1432) 2022-08-10 16:40:29 +08:00
HELSON 0d212183c4
[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426) 2022-08-10 11:37:28 +08:00
HELSON 4fb3c52cf0
[zero] add unit test for AgChunk's append, close, access (#1423) 2022-08-09 18:03:10 +08:00
Jiarui Fang bd71e2a88b
[hotfix] add missing file (#1308) 2022-07-14 14:43:15 +08:00
ver217 c4d903e64a
[gemini] accelerate adjust_layout() (#878)
* add lru cache

* polish code

* update unit test

* fix sharded optim
2022-04-26 18:08:31 +08:00
HELSON 3107817172
[gemini] add stateful tensor container (#867) 2022-04-25 14:58:16 +08:00
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities

* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
Jiarui Fang 0ce8924ceb
[tensor] reorganize files (#820) 2022-04-21 14:15:48 +08:00
Jiarui Fang ab962b9735
[gemini] a new tensor structure (#818)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
2022-04-21 11:42:37 +08:00
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00