Commit Graph

40 Commits (937f4042533c2e30313d79b8490f9f65a57d94df)

Author SHA1 Message Date
Jiarui Fang 9214d1fe28
[Gemini] chunk init using runtime visited param order (#2115)
2 years ago
Jiarui Fang e5aa8333e4
[NFC] update chunk manager API (#2119)
2 years ago
Jiarui Fang e99edfcb51
[NFC] polish comments for Chunk class (#2116)
2 years ago
HELSON 63fbba3c19
[zero] add L2 gradient clipping for ZeRO (#2112)
2 years ago
Jiarui Fang b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook (#2080)
2 years ago
Jiarui Fang 96134e7be3
[hotfix] add bert test for gemini fwd bwd (#2035)
2 years ago
Jiarui Fang 8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor (#2003)
2 years ago
Jiarui Fang cc0ed7cf33
[Gemini] ZeROHookV2 -> GeminiZeROHook (#1972)
2 years ago
Jiarui Fang cd5a0d56fa
[Gemini] make gemini usage simple (#1821)
2 years ago
Zihao 20e255d4e8
MemStatsCollectorStatic (#1765)
2 years ago
HELSON c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
2 years ago
HELSON 1468e4bcfc
[zero] add constant placement policy (#1705)
2 years ago
HELSON b28991dd0a
[feature] A new ZeRO implementation (#1644)
2 years ago
Jiarui Fang c5d39215f6
Revert "[feature] new zero implementation (#1623)" (#1643)
2 years ago
HELSON 5be118f405
[feature] new zero implementation (#1623)
2 years ago
Jiatong Han 3263cdf57f [NFC] polish colossalai/nn/parallel/data_parallel.py code style (#1570)
2 years ago
ver217 04c9a86af8
[zero] ZeroDDP supports controlling outputs' dtype (#1399)
2 years ago
HELSON 4e98e938ce
[zero] alleviate memory usage in ZeRODDP state_dict (#1398)
2 years ago
ver217 83328329dd
[hotfix] fix zero ddp buffer cast (#1376)
2 years ago
ver217 5d5031e946
fix zero ddp state dict (#1378)
2 years ago
HELSON 87775a0682
[colotensor] use cpu memory to store state_dict (#1367)
2 years ago
ver217 d068af81a3
[doc] update rst and docstring (#1351)
2 years ago
ver217 0c51ff2c13
[hotfix] ZeroDDP use new process group (#1333)
2 years ago
Jiarui Fang b5f25eb32a
[Tensor] add cpu group to ddp (#1200)
2 years ago
Jiarui Fang 060b917daf
[refactor] remove gpc dependency in colotensor's _ops (#1189)
2 years ago
Jiarui Fang 372f791444
[refactor] move chunk and chunkmgr to directory gemini (#1182)
2 years ago
ver217 6b2f2ab9bb
[ddp] ColoDDP uses bucket all-reduce (#1177)
2 years ago
ver217 54aabb8da4
[gemini] refactor gemini mgr (#1151)
2 years ago
ver217 8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP (#1146)
2 years ago
Frank Lee 15aab1476e
[zero] avoid zero hook spam by changing log to debug level (#1137)
2 years ago
ver217 d26902645e
[ddp] add save/load state dict for ColoDDP (#1127)
2 years ago
ver217 f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP (#1122)
2 years ago
ver217 e127b4375b
cast colo ddp v2 inputs/outputs (#1120)
2 years ago
ver217 7d14b473f0
[gemini] gemini mgr supports "cpu" placement policy (#1118)
2 years ago
ver217 895c1c5ee7
[tensor] refactor param op hook (#1097)
2 years ago
Frank Lee cb18922c47
[doc] added documentation to chunk and chunk manager (#1094)
2 years ago
ver217 1f894e033f
[gemini] zero supports gemini (#1093)
2 years ago
ver217 be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077)
2 years ago
Ziyue Jiang 4fc748f69b
[Tensor] fix optimizer for CPU parallel (#1069)
3 years ago
Jiarui Fang 49832b2344
[refactory] add nn.parallel module (#1068)
3 years ago