Jiarui Fang
4f21c9e8d9
[Gemini] polish runtime tracer tests ( #2077 )
2022-12-05 16:22:49 +08:00
Jiarui Fang
a7adad9ccb
[Gemini] rename hooks related to runtime mem tracer ( #2076 )
2022-12-05 15:00:03 +08:00
Jiarui Fang
40b7d55bf3
[Gemini] add albert in test models. ( #2075 )
2022-12-05 14:09:34 +08:00
Jiarui Fang
616ed91ecd
[test] bert test in non-distributed way ( #2074 )
2022-12-05 13:32:16 +08:00
Jiarui Fang
223332ff7e
[Gemini] rename ParamTracerWrapper -> RuntimeMemTracer ( #2073 )
2022-12-05 12:45:11 +08:00
Jiarui Fang
9f828ef36f
[Gemini] remove not used MemtracerWrapper ( #2072 )
2022-12-05 11:57:59 +08:00
Zihao
38ea4ba1bd
[Gemini] fix grad unreleased issue and param recovery issue ( #2052 )
2022-12-02 16:04:19 +08:00
HELSON
f6178728a0
[gemini] fix init bugs for modules ( #2047 )
...
* [gemini] fix init bugs for modules
* fix bugs
2022-11-30 17:06:10 +08:00
Zihao
6a9158f1fa
[Gemini] free and allocate cuda memory by tensor.storage, add grad hook ( #2040 )
2022-11-30 15:57:45 +08:00
Jiarui Fang
1e885329f4
[test] align model name with the file name. ( #2045 )
2022-11-30 15:45:26 +08:00
Jiarui Fang
31c644027b
[hotfix] hotfix Gemini for no leaf modules bug ( #2043 )
2022-11-30 14:53:41 +08:00
HELSON
384cd26314
[zero] fix testing parameters ( #2042 )
2022-11-30 12:09:32 +08:00
HELSON
17a3c685b0
[zero] fix unit-tests ( #2039 )
2022-11-30 10:40:31 +08:00
Jiarui Fang
eb7742a4bb
[Gemini] more tests for Gemini ( #2038 )
...
* [Gemini] more tests for Gemini
* polish code
2022-11-29 17:13:10 +08:00
HELSON
537e181705
[testing] fix testing models ( #2036 )
...
* [testing] fix testing models
* roll back
2022-11-29 13:42:06 +08:00
Jiarui Fang
96134e7be3
[hotfix] add bert test for gemini fwd bwd ( #2035 )
2022-11-29 11:19:52 +08:00
Jiarui Fang
28aa9a4294
[Gemini] more rigorous unit tests for run_fwd_bwd ( #2034 )
2022-11-29 09:26:06 +08:00
Zihao
95c4532fff
[Gemini] paramWrapper paramTracerHook unitest ( #2030 )
2022-11-26 13:30:24 +08:00
Jiarui Fang
8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor ( #2003 )
2022-11-25 20:06:35 +08:00
Jiarui Fang
2e9cbfca12
[Gemini] add unitests to check gemini correctness ( #2015 )
2022-11-24 16:51:45 +08:00
Jiarui Fang
0b0d8f9e17
[hotfix] revert bug PRs ( #2016 )
2022-11-24 15:28:58 +08:00
Zihao
0160a62a3c
[Gemini] param_tracer_wrapper and test case ( #2009 )
2022-11-24 14:40:33 +08:00
Jiarui Fang
3d907faede
[Gemini] add an inline_op_module to common test models and polish unitests. ( #2004 )
2022-11-23 16:55:54 +08:00
Jiarui Fang
5bec3b2168
[Gemini] open grad checkpoint when model building ( #1984 )
2022-11-18 16:32:54 +08:00
Jiarui Fang
3712ac7f90
[Gemini] add bert for MemtracerWrapper unintests ( #1982 )
2022-11-18 14:58:28 +08:00
Jiarui Fang
e481489aa6
[Gemini] MemtracerWrapper unittests ( #1981 )
2022-11-18 14:19:40 +08:00
Jiarui Fang
f7e276fa71
[Gemini] add GeminiAdamOptimizer ( #1960 )
2022-11-16 14:44:28 +08:00
HELSON
c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 ( #1786 )
...
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12
* [zero] add cpu shard init
* [zero] add tiny example test
* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
HELSON
f69f9bf223
[zero] add chunk init function for users ( #1729 )
...
* add chunk manager init function
* fix unit tests
* add comment
* add flush=True
2022-10-18 16:31:22 +08:00
HELSON
1468e4bcfc
[zero] add constant placement policy ( #1705 )
...
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2022-10-09 09:18:51 +08:00
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2022-09-26 10:06:03 +08:00
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2022-09-24 19:58:18 +08:00
HELSON
b80340168e
[zero] add chunk_managerV2 for all-gather chunk ( #1441 )
2022-08-11 19:17:24 +08:00
HELSON
9056677b13
[zero] add chunk size searching algorithm for parameters in different groups ( #1436 )
2022-08-11 13:32:19 +08:00
HELSON
039b7ed3bc
[polish] add update directory in gemini; rename AgChunk to ChunkV2 ( #1432 )
2022-08-10 16:40:29 +08:00
HELSON
0d212183c4
[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk ( #1426 )
2022-08-10 11:37:28 +08:00
HELSON
4fb3c52cf0
[zero] add unit test for AgChunk's append, close, access ( #1423 )
2022-08-09 18:03:10 +08:00
Jiarui Fang
bd71e2a88b
[hotfix] add missing file ( #1308 )
2022-07-14 14:43:15 +08:00
ver217
c4d903e64a
[gemini] accelerate adjust_layout() ( #878 )
...
* add lru cache
* polish code
* update unit test
* fix sharded optim
2022-04-26 18:08:31 +08:00
HELSON
3107817172
[gemini] add stateful tensor container ( #867 )
2022-04-25 14:58:16 +08:00
HELSON
e5ea3fdeef
[gemini] add GeminiMemoryManger ( #832 )
...
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
Jiarui Fang
0ce8924ceb
[tensor] reorganize files ( #820 )
2022-04-21 14:15:48 +08:00
Jiarui Fang
ab962b9735
[gemini] a new tensor structure ( #818 )
...
* Revert "[zero] add ZeroTensorShardStrategy (#793 )"
This reverts commit 88759e289e
.
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
* polish code
* add a new tensor structure and override linear for it
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
2022-04-21 11:42:37 +08:00
Jiarui Fang
4d9332b4c5
[refactor] moving memtracer to gemini ( #801 )
2022-04-19 10:13:08 +08:00