Commit Graph

290 Commits (fix-setup)

Author SHA1 Message Date
Baizhou Zhang 44eab2b27f
[shardformer] support sharded checkpoint IO for models of HybridParallelPlugin (#4506)
1 year ago
LuGY 839847b7d7
[zero]support zero2 with gradient accumulation (#4511)
1 year ago
Hongxin Liu 27061426f7
[gemini] improve compatibility and add static placement policy (#4479)
1 year ago
LuGY d86ddd9b29
[hotfix] fix unsafe async comm in zero (#4404)
1 year ago
Baizhou Zhang 6ccecc0c69
[gemini] fix tensor storage cleaning in state dict collection (#4396)
1 year ago
LuGY 45b08f08cb [zero] optimize the optimizer step time (#4221)
1 year ago
LuGY 1a49a5ea00 [zero] support shard optimizer state dict of zero (#4194)
1 year ago
LuGY dd7cc58299 [zero] add state dict for low level zero (#4179)
1 year ago
LuGY c668801d36 [zero] allow passing process group to zero12 (#4153)
1 year ago
LuGY 79cf1b5f33 [zero]support no_sync method for zero1 plugin (#4138)
1 year ago
LuGY c6ab96983a [zero] refactor low level zero for shard evenly (#4030)
1 year ago
Baizhou Zhang c6f6005990
[checkpointio] Sharded Optimizer Checkpoint for Gemini Plugin (#4302)
1 year ago
Baizhou Zhang 58913441a1
Next commit [checkpointio] Unsharded Optimizer Checkpoint for Gemini Plugin (#4141)
1 year ago
Baizhou Zhang 0bb0b481b4 [gemini] fix argument naming during chunk configuration searching
1 year ago
Frank Lee 71fe52769c [gemini] fixed the gemini checkpoint io (#3934)
1 year ago
digger yu de0d7df33f
[nfc] fix typo colossalai/zero (#3923)
2 years ago
digger yu a9d1cadc49
fix typo with colossalai/trainer utils zero (#3908)
2 years ago
Hongxin Liu ae02d4e4f7
[bf16] add bf16 support (#3882)
2 years ago
Hongxin Liu dbb32692d2
[lazy] refactor lazy init (#3891)
2 years ago
digger yu 9265f2d4d7
[NFC]fix typo colossalai/auto_parallel nn utils etc. (#3779)
2 years ago
jiangmingyan 307894f74d
[booster] gemini plugin support shard checkpoint (#3610)
2 years ago
YH a22407cc02
[zero] Suggests a minor change to confusing variable names in the ZeRO optimizer. (#3173)
2 years ago
Hongxin Liu 50793b35f4
[gemini] accelerate inference (#3641)
2 years ago
Hongxin Liu 4b3240cb59
[booster] add low level zero plugin (#3594)
2 years ago
digger-yu b9a8dff7e5
[doc] Fix typo under colossalai and doc(#3618)
2 years ago
Hongxin Liu 12eff9eb4c
[gemini] state dict supports fp16 (#3590)
2 years ago
Hongxin Liu f313babd11
[gemini] support save state dict in shards (#3581)
2 years ago
YH d329c294ec
Add docstr for zero3 chunk search utils (#3572)
2 years ago
Hongxin Liu 173dad0562
[misc] add verbose arg for zero and op builder (#3552)
2 years ago
Hongxin Liu 152239bbfa
[gemini] gemini supports lazy init (#3379)
2 years ago
YH bcf0cbcbe7
[doc] Add docs for clip args in zero optim (#3504)
2 years ago
ver217 573af84184
[example] update examples related to zero/gemini (#3431)
2 years ago
ver217 26b7aac0be
[zero] reorganize zero/gemini folder structure (#3424)
2 years ago
YH 80aed29cd3
[zero] Refactor ZeroContextConfig class using dataclass (#3186)
2 years ago
YH 9d644ff09f
Fix docstr for zero statedict (#3185)
2 years ago
ver217 823f3b9cf4
[doc] add deepspeed citation and copyright (#2996)
2 years ago
YH 7b13f7db18
[zero] trivial zero optimizer refactoring (#2869)
2 years ago
Boyuan Yao 8e3f66a0d1
[zero] fix wrong import (#2777)
2 years ago
Nikita Shulga 01066152f1
Don't use `torch._six` (#2775)
2 years ago
YH ae86a29e23
Refact method of grad store (#2687)
2 years ago
HELSON df4f020ee3
[zero1&2] only append parameters with gradients (#2681)
2 years ago
HELSON b528eea0f0
[zero] add zero wrappers (#2523)
2 years ago
HELSON 077a5cdde4
[zero] fix gradient clipping in hybrid parallelism (#2521)
2 years ago
HELSON d565a24849
[zero] add unit testings for hybrid parallelism (#2486)
2 years ago
HELSON a5dc4253c6
[zero] polish low level optimizer (#2473)
2 years ago
Jiarui Fang 867c8c2d3a
[zero] low level optim supports ProcessGroup (#2464)
2 years ago
HELSON 7829aa094e
[ddp] add is_ddp_ignored (#2434)
2 years ago
HELSON 62c38e3330
[zero] polish low level zero optimizer (#2275)
2 years ago
HELSON a7d95b7024
[example] add zero1, zero2 example in GPT examples (#2146)
2 years ago
Jiarui Fang c89c66a858
[Gemini] update API of the chunkmemstatscollector. (#2129)
2 years ago
Jiarui Fang 2938edf446
[Gemini] update the non model data record method in runtime memory tracer (#2128)
2 years ago
Jiarui Fang e99edfcb51
[NFC] polish comments for Chunk class (#2116)
2 years ago
Jiarui Fang 33f4412102
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084)
2 years ago
Jiarui Fang b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook (#2080)
2 years ago
HELSON a1ce02d740
[zero] test gradient accumulation (#1964)
2 years ago
Jiarui Fang cc0ed7cf33
[Gemini] ZeROHookV2 -> GeminiZeROHook (#1972)
2 years ago
Jiarui Fang c4739a725a
[Gemini] polish memstats collector (#1962)
2 years ago
Jiarui Fang f7e276fa71
[Gemini] add GeminiAdamOptimizer (#1960)
2 years ago
HELSON 7066dfbf82
[zero] fix memory leak for zero2 (#1955)
2 years ago
HELSON 6e51d296f0
[zero] migrate zero1&2 (#1878)
2 years ago
Zihao 20e255d4e8
MemStatsCollectorStatic (#1765)
2 years ago
HELSON c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
2 years ago
CsRic ea961d8fd1 [NFC] polish colossalai/zero/sharded_param/__init__.py code style (#1717)
2 years ago
HELSON 1468e4bcfc
[zero] add constant placement policy (#1705)
2 years ago
HELSON b28991dd0a
[feature] A new ZeRO implementation (#1644)
2 years ago
Jiarui Fang c5d39215f6
Revert "[feature] new zero implementation (#1623)" (#1643)
2 years ago
HELSON 5be118f405
[feature] new zero implementation (#1623)
2 years ago
HELSON f7f2248771
[moe] fix MoE bugs (#1628)
2 years ago
ver217 c9e8ce67b8
fix move fp32 shards (#1604)
2 years ago
Fazzie-Maqianli 06dccdde44 [NFC] polish colossalai/zero/sharded_model/reduce_scatter.py code style (#1554)
2 years ago
ver217 821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442)
2 years ago
ver217 6df3e19be9
[hotfix] zero optim prevents calling inner optim.zero_grad (#1422)
2 years ago
ver217 8dced41ad0
[zero] zero optim state_dict takes only_rank_0 (#1384)
2 years ago
ver217 828b9e5e0d
[hotfix] fix zero optim save/load state dict (#1381)
2 years ago
ver217 6b43c789fd
fix zero optim backward_by_grad and save/load (#1353)
2 years ago
ver217 d068af81a3
[doc] update rst and docstring (#1351)
2 years ago
ver217 ce470ba37e
[checkpoint] sharded optim save/load grad scaler (#1350)
2 years ago
ver217 7a05367101
[hotfix] shared model returns cpu state_dict (#1328)
2 years ago
Jiarui Fang 4165eabb1e
[hotfix] remove potiential circle import (#1307)
2 years ago
ver217 a45ddf2d5f
[hotfix] fix sharded optim step and clip_grad_norm (#1226)
2 years ago
Jiarui Fang a444633d13
warmup ratio configration (#1192)
2 years ago
Jiarui Fang 372f791444
[refactor] move chunk and chunkmgr to directory gemini (#1182)
2 years ago
ver217 9e1daa63d2
[zero] sharded optim supports loading local state dict (#1170)
2 years ago
ver217 561e90493f
[zero] zero optim supports loading local state dict (#1171)
2 years ago
ver217 8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP (#1146)
2 years ago
ver217 6690a61b4d
[hotfix] prevent nested ZeRO (#1140)
2 years ago
Frank Lee 15aab1476e
[zero] avoid zero hook spam by changing log to debug level (#1137)
2 years ago
ver217 a1a7899cae
[hotfix] fix zero init ctx numel (#1128)
2 years ago
ver217 f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP (#1122)
2 years ago
Frank Lee 14e5b11d7f
[zero] fixed api consistency (#1098)
3 years ago
Frank Lee cb18922c47
[doc] added documentation to chunk and chunk manager (#1094)
3 years ago
ver217 1f894e033f
[gemini] zero supports gemini (#1093)
3 years ago
ver217 be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077)
3 years ago
ver217 c5cd3b0f35
[zero] zero optim copy chunk rather than copy tensor (#1070)
3 years ago
Jiarui Fang 49832b2344
[refactory] add nn.parallel module (#1068)
3 years ago
ver217 e3fde4ee6b
fix import error in sharded model v2 (#1053)
3 years ago
ver217 51b9a49655
[zero] add zero optimizer for ColoTensor (#1046)
3 years ago
ver217 9492a561c3
[tensor] ColoTensor supports ZeRo (#1015)
3 years ago
ver217 7cfd6c827e
[zero] add load_state_dict for sharded model (#894)
3 years ago
ver217 c4d903e64a
[gemini] accelerate adjust_layout() (#878)
3 years ago