Commit Graph

55 Commits (ed4c4484880b733894e6088e681f7cca32afe0b4)

Author SHA1 Message Date
ver217 26b7aac0be
[zero] reorganize zero/gemini folder structure (#3424)
2 years ago
junxu c52edcf0eb
Rename class method of ZeroDDP (#2692)
2 years ago
HELSON 8213f89fd2
[gemini] add fake_release_chunk for keep-gathered chunk in the inference mode (#2671)
2 years ago
ver217 5b1854309a
[hotfix] fix zero ddp warmup check (#2545)
2 years ago
HELSON a4ed9125ac
[hotfix] fix lightning error (#2529)
2 years ago
HELSON 707b11d4a0
[gemini] update ddp strict mode (#2518)
2 years ago
HELSON 2d1a7dfe5f
[zero] add strict ddp mode (#2508)
2 years ago
HELSON 5521af7877
[zero] fix state_dict and load_state_dict for ddp ignored parameters (#2443)
2 years ago
HELSON 7829aa094e
[ddp] add is_ddp_ignored (#2434)
2 years ago
HELSON bb4e9a311a
[zero] add inference mode and its unit test (#2418)
2 years ago
HELSON ea13a201bb
[polish] polish code for get_static_torch_model (#2405)
2 years ago
eric8607242 9880fd2cd8
Fix state_dict key missing issue of the ZeroDDP (#2363)
2 years ago
HELSON 48d33b1b17
[gemini] add get static torch model (#2356)
2 years ago
Jiarui Fang af32022f74
[Gemini] fix the convert_to_torch_module bug (#2269)
2 years ago
HELSON 2458659919
[zero] fix error for BEiT models (#2169)
2 years ago
Jiarui Fang 9214d1fe28
[Gemini] chunk init using runtime visited param order (#2115)
2 years ago
Jiarui Fang e5aa8333e4
[NFC] update chunk manager API (#2119)
2 years ago
Jiarui Fang e99edfcb51
[NFC] polish comments for Chunk class (#2116)
2 years ago
HELSON 63fbba3c19
[zero] add L2 gradient clipping for ZeRO (#2112)
2 years ago
Jiarui Fang b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook (#2080)
2 years ago
Jiarui Fang 96134e7be3
[hotfix] add bert test for gemini fwd bwd (#2035)
2 years ago
Jiarui Fang 8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor (#2003)
2 years ago
Jiarui Fang cc0ed7cf33
[Gemini] ZeROHookV2 -> GeminiZeROHook (#1972)
2 years ago
Jiarui Fang cd5a0d56fa
[Gemini] make gemini usage simple (#1821)
2 years ago
Zihao 20e255d4e8
MemStatsCollectorStatic (#1765)
2 years ago
HELSON c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
2 years ago
HELSON 1468e4bcfc
[zero] add constant placement policy (#1705)
2 years ago
HELSON b28991dd0a
[feature] A new ZeRO implementation (#1644)
2 years ago
Jiarui Fang c5d39215f6
Revert "[feature] new zero implementation (#1623)" (#1643)
2 years ago
HELSON 5be118f405
[feature] new zero implementation (#1623)
2 years ago
Jiatong Han 3263cdf57f [NFC] polish colossalai/nn/parallel/data_parallel.py code style (#1570)
2 years ago
ver217 04c9a86af8
[zero] ZeroDDP supports controlling outputs' dtype (#1399)
2 years ago
HELSON 4e98e938ce
[zero] alleviate memory usage in ZeRODDP state_dict (#1398)
2 years ago
ver217 83328329dd
[hotfix] fix zero ddp buffer cast (#1376)
2 years ago
ver217 5d5031e946
fix zero ddp state dict (#1378)
2 years ago
HELSON 87775a0682
[colotensor] use cpu memory to store state_dict (#1367)
2 years ago
ver217 d068af81a3
[doc] update rst and docstring (#1351)
2 years ago
ver217 0c51ff2c13
[hotfix] ZeroDDP use new process group (#1333)
2 years ago
Jiarui Fang b5f25eb32a
[Tensor] add cpu group to ddp (#1200)
2 years ago
Jiarui Fang 060b917daf
[refactor] remove gpc dependency in colotensor's _ops (#1189)
2 years ago
Jiarui Fang 372f791444
[refactor] move chunk and chunkmgr to directory gemini (#1182)
2 years ago
ver217 6b2f2ab9bb
[ddp] ColoDDP uses bucket all-reduce (#1177)
2 years ago
ver217 54aabb8da4
[gemini] refactor gemini mgr (#1151)
2 years ago
ver217 8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP (#1146)
2 years ago
Frank Lee 15aab1476e
[zero] avoid zero hook spam by changing log to debug level (#1137)
2 years ago
ver217 d26902645e
[ddp] add save/load state dict for ColoDDP (#1127)
2 years ago
ver217 f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP (#1122)
2 years ago
ver217 e127b4375b
cast colo ddp v2 inputs/outputs (#1120)
2 years ago
ver217 7d14b473f0
[gemini] gemini mgr supports "cpu" placement policy (#1118)
2 years ago
ver217 895c1c5ee7
[tensor] refactor param op hook (#1097)
2 years ago