Jiarui Fang
d5e3e3ec01
[example] update gpt example for larger model scale ( #2211 )
2022-12-28 13:54:08 +08:00
Jiarui Fang
b87496a66b
[hotfix] fix auto policy of test_sharded_optim_v2 ( #2157 )
2022-12-20 23:03:18 +08:00
Jiarui Fang
ee287620f0
[Gemini] revert ZeROInitCtx related tracer ( #2138 )
2022-12-16 12:37:06 +08:00
Jiarui Fang
c89c66a858
[Gemini] update API of the chunkmemstatscollector. ( #2129 )
2022-12-14 00:47:06 +08:00
Jiarui Fang
2938edf446
[Gemini] update the non model data record method in runtime memory tracer ( #2128 )
2022-12-13 17:11:31 +08:00
Jiarui Fang
8fac837679
[Gemini] update non model data calculation method ( #2126 )
2022-12-13 15:44:07 +08:00
Jiarui Fang
5efda69735
[Gemini] hotfix the unittest bugs ( #2125 )
2022-12-13 14:14:55 +08:00
Jiarui Fang
05bb28aacf
[Gemini] mapping of preop timestep and param ( #2124 )
2022-12-13 12:50:24 +08:00
Jiarui Fang
9214d1fe28
[Gemini] chunk init using runtime visited param order ( #2115 )
2022-12-12 18:06:16 +08:00
Jiarui Fang
8afc001f4f
[Gemini] chunk init use OrderedParamGenerator ( #2110 )
2022-12-11 21:41:13 +08:00
Jiarui Fang
70a8556946
[gemini] get the param visited order during runtime ( #2108 )
2022-12-09 16:13:03 +08:00
Jiarui Fang
85efb7ac2e
[Gemini] gemini use the runtime memory tracer (RMT) ( #2099 )
2022-12-07 23:04:02 +08:00
Jiarui Fang
4b055351b0
[Gemini] make RuntimeMemTracer work correctly ( #2096 )
2022-12-07 16:59:59 +08:00
Jiarui Fang
1fca5d79ea
[Gemini] remove GLOBAL_MODEL_DATA_TRACER ( #2091 )
2022-12-06 22:30:16 +08:00
Jiarui Fang
28e55c2530
[Gemini] remove GLOBAL_CUDA_MEM_INFO ( #2090 )
2022-12-06 22:10:47 +08:00
Jiarui Fang
25abae6d7f
[Gemini] use MemStats in Runtime Memory tracer ( #2088 )
2022-12-06 19:48:20 +08:00
Jiarui Fang
33f4412102
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. ( #2084 )
2022-12-06 16:43:06 +08:00
Jiarui Fang
1f99205827
[Gemini] remove static tracer ( #2083 )
2022-12-06 12:53:58 +08:00
Jiarui Fang
b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook ( #2080 )
2022-12-05 17:11:06 +08:00
Jiarui Fang
a7adad9ccb
[Gemini] rename hooks related to runtime mem tracer ( #2076 )
2022-12-05 15:00:03 +08:00
Jiarui Fang
223332ff7e
[Gemini] rename ParamTracerWrapper -> RuntimeMemTracer ( #2073 )
2022-12-05 12:45:11 +08:00
Jiarui Fang
9f828ef36f
[Gemini] remove not used MemtracerWrapper ( #2072 )
2022-12-05 11:57:59 +08:00
Zihao
38ea4ba1bd
[Gemini] fix grad unreleased issue and param recovery issue ( #2052 )
2022-12-02 16:04:19 +08:00
Zihao
6a9158f1fa
[Gemini] free and allocate cuda memory by tensor.storage, add grad hook ( #2040 )
2022-11-30 15:57:45 +08:00
Zihao
95c4532fff
[Gemini] paramWrapper paramTracerHook unitest ( #2030 )
2022-11-26 13:30:24 +08:00
Jiarui Fang
0b0d8f9e17
[hotfix] revert bug PRs ( #2016 )
2022-11-24 15:28:58 +08:00
Zihao
0160a62a3c
[Gemini] param_tracer_wrapper and test case ( #2009 )
2022-11-24 14:40:33 +08:00
Jiarui Fang
3712ac7f90
[Gemini] add bert for MemtracerWrapper unintests ( #1982 )
2022-11-18 14:58:28 +08:00
Jiarui Fang
0529fcde06
[Gemini] independent runtime tracer ( #1974 )
2022-11-18 10:53:42 +08:00
Jiarui Fang
c4739a725a
[Gemini] polish memstats collector ( #1962 )
2022-11-16 15:45:57 +08:00
Zihao
20e255d4e8
MemStatsCollectorStatic ( #1765 )
2022-11-07 16:49:03 +08:00
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2022-10-09 09:18:51 +08:00
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2022-09-26 10:06:03 +08:00
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2022-09-24 19:58:18 +08:00
Jiarui Fang
372f791444
[refactor] move chunk and chunkmgr to directory gemini ( #1182 )
2022-06-29 13:31:02 +08:00
ver217
1f894e033f
[gemini] zero supports gemini ( #1093 )
...
* add placement policy
* add gemini mgr
* update mem stats collector
* update zero
* update zero optim
* fix bugs
* zero optim monitor os
* polish unit test
* polish unit test
* add assert
2022-06-10 14:48:28 +08:00
ver217
be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 ( #1077 )
...
* polish chunk manager
* polish unit test
* impl add_extern_static_tensor for chunk mgr
* add mem stats collector v2
* polish code
* polish unit test
* polish code
* polish get chunks
2022-06-09 20:56:34 +08:00
ver217
d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data ( #850 )
2022-04-24 17:17:22 +08:00
Jiarui Fang
4d9332b4c5
[refactor] moving memtracer to gemini ( #801 )
2022-04-19 10:13:08 +08:00