Commit Graph

85 Commits (ce3c4eca7bc2c5b148dfe5db1ddb702558af4831)

Author SHA1 Message Date
Jiarui Fang b87496a66b
[hotfix] fix auto policy of test_sharded_optim_v2 (#2157)
2 years ago
Jiarui Fang ee287620f0
[Gemini] revert ZeROInitCtx related tracer (#2138)
2 years ago
Jiarui Fang c89c66a858
[Gemini] update API of the chunkmemstatscollector. (#2129)
2 years ago
Jiarui Fang 2938edf446
[Gemini] update the non model data record method in runtime memory tracer (#2128)
2 years ago
Jiarui Fang 8fac837679
[Gemini] update non model data calculation method (#2126)
2 years ago
Jiarui Fang 5efda69735
[Gemini] hotfix the unittest bugs (#2125)
2 years ago
Jiarui Fang 05bb28aacf
[Gemini] mapping of preop timestep and param (#2124)
2 years ago
Jiarui Fang 9214d1fe28
[Gemini] chunk init using runtime visited param order (#2115)
2 years ago
Jiarui Fang e5aa8333e4
[NFC] update chunk manager API (#2119)
2 years ago
Jiarui Fang e99edfcb51
[NFC] polish comments for Chunk class (#2116)
2 years ago
Jiarui Fang 8afc001f4f
[Gemini] chunk init use OrderedParamGenerator (#2110)
2 years ago
HELSON 63fbba3c19
[zero] add L2 gradient clipping for ZeRO (#2112)
2 years ago
Jiarui Fang 70a8556946
[gemini] get the param visited order during runtime (#2108)
2 years ago
Jiarui Fang 61f31c3cf0
[Gemini] NFC, polish search_chunk_configuration (#2107)
2 years ago
Jiarui Fang 85efb7ac2e
[Gemini] gemini use the runtime memory tracer (RMT) (#2099)
2 years ago
Jiarui Fang 4b055351b0
[Gemini] make RuntimeMemTracer work correctly (#2096)
2 years ago
Jiarui Fang 1fca5d79ea
[Gemini] remove GLOBAL_MODEL_DATA_TRACER (#2091)
2 years ago
Jiarui Fang 28e55c2530
[Gemini] remove GLOBAL_CUDA_MEM_INFO (#2090)
2 years ago
Jiarui Fang 25abae6d7f
[Gemini] use MemStats in Runtime Memory tracer (#2088)
2 years ago
Jiarui Fang 33f4412102
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084)
2 years ago
Jiarui Fang 1f99205827
[Gemini] remove static tracer (#2083)
2 years ago
Jiarui Fang b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook (#2080)
2 years ago
Jiarui Fang a7adad9ccb
[Gemini] rename hooks related to runtime mem tracer (#2076)
2 years ago
Jiarui Fang 223332ff7e
[Gemini] rename ParamTracerWrapper -> RuntimeMemTracer (#2073)
2 years ago
Jiarui Fang 9f828ef36f
[Gemini] remove not used MemtracerWrapper (#2072)
2 years ago
Zihao 38ea4ba1bd
[Gemini] fix grad unreleased issue and param recovery issue (#2052)
2 years ago
Zihao 6a9158f1fa
[Gemini] free and allocate cuda memory by tensor.storage, add grad hook (#2040)
2 years ago
Jiarui Fang 28aa9a4294
[Gemini] more rigorous unit tests for run_fwd_bwd (#2034)
2 years ago
Zihao 95c4532fff
[Gemini] paramWrapper paramTracerHook unitest (#2030)
2 years ago
Jiarui Fang 8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor (#2003)
2 years ago
Zihao a719b89a41
[gemini] param_trace_hook (#2020)
2 years ago
Jiarui Fang 0b0d8f9e17
[hotfix] revert bug PRs (#2016)
2 years ago
Zihao aba3db464d
[Gemini] ParamMemHook (#2008)
2 years ago
Zihao 0160a62a3c
[Gemini] param_tracer_wrapper and test case (#2009)
2 years ago
Jiarui Fang 3712ac7f90
[Gemini] add bert for MemtracerWrapper unintests (#1982)
2 years ago
Jiarui Fang e481489aa6
[Gemini] MemtracerWrapper unittests (#1981)
2 years ago
Jiarui Fang 31922110ad
[Gemini] memory trace hook (#1978)
2 years ago
Jiarui Fang 0529fcde06
[Gemini] independent runtime tracer (#1974)
2 years ago
Jiarui Fang 7e24b9b9ee
[Gemini] clean no used MemTraceOp (#1970)
2 years ago
Jiarui Fang c4739a725a
[Gemini] polish memstats collector (#1962)
2 years ago
Zihao 20e255d4e8
MemStatsCollectorStatic (#1765)
2 years ago
Jiarui Fang c248800359
[kernel] skip tests of flash_attn and triton when they are not available (#1798)
2 years ago
HELSON c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
2 years ago
Jiarui Fang cb5a587e9a
[hotfix] polish chunk import (#1787)
2 years ago
Jiarui Fang f34dab4270
[compatibility] ChunkMgr import error (#1772)
2 years ago
HELSON f69f9bf223
[zero] add chunk init function for users (#1729)
2 years ago
HELSON 1468e4bcfc
[zero] add constant placement policy (#1705)
2 years ago
HELSON b28991dd0a
[feature] A new ZeRO implementation (#1644)
2 years ago
Jiarui Fang c5d39215f6
Revert "[feature] new zero implementation (#1623)" (#1643)
2 years ago
HELSON 5be118f405
[feature] new zero implementation (#1623)
2 years ago