Commit Graph

178 Commits (a4ed5b0d0d926f9e3f84711799e21db795a339e9)

Author SHA1 Message Date
CsRic ea961d8fd1 [NFC] polish colossalai/zero/sharded_param/__init__.py code style (#1717)
2 years ago
HELSON 1468e4bcfc
[zero] add constant placement policy (#1705)
2 years ago
HELSON b28991dd0a
[feature] A new ZeRO implementation (#1644)
2 years ago
Jiarui Fang c5d39215f6
Revert "[feature] new zero implementation (#1623)" (#1643)
2 years ago
HELSON 5be118f405
[feature] new zero implementation (#1623)
2 years ago
HELSON f7f2248771
[moe] fix MoE bugs (#1628)
2 years ago
ver217 c9e8ce67b8
fix move fp32 shards (#1604)
2 years ago
Fazzie-Maqianli 06dccdde44 [NFC] polish colossalai/zero/sharded_model/reduce_scatter.py code style (#1554)
2 years ago
ver217 821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442)
2 years ago
ver217 6df3e19be9
[hotfix] zero optim prevents calling inner optim.zero_grad (#1422)
2 years ago
ver217 8dced41ad0
[zero] zero optim state_dict takes only_rank_0 (#1384)
2 years ago
ver217 828b9e5e0d
[hotfix] fix zero optim save/load state dict (#1381)
2 years ago
ver217 6b43c789fd
fix zero optim backward_by_grad and save/load (#1353)
2 years ago
ver217 d068af81a3
[doc] update rst and docstring (#1351)
2 years ago
ver217 ce470ba37e
[checkpoint] sharded optim save/load grad scaler (#1350)
2 years ago
ver217 7a05367101
[hotfix] shared model returns cpu state_dict (#1328)
2 years ago
Jiarui Fang 4165eabb1e
[hotfix] remove potiential circle import (#1307)
2 years ago
ver217 a45ddf2d5f
[hotfix] fix sharded optim step and clip_grad_norm (#1226)
2 years ago
Jiarui Fang a444633d13
warmup ratio configration (#1192)
2 years ago
Jiarui Fang 372f791444
[refactor] move chunk and chunkmgr to directory gemini (#1182)
2 years ago
ver217 9e1daa63d2
[zero] sharded optim supports loading local state dict (#1170)
2 years ago
ver217 561e90493f
[zero] zero optim supports loading local state dict (#1171)
2 years ago
ver217 8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP (#1146)
2 years ago
ver217 6690a61b4d
[hotfix] prevent nested ZeRO (#1140)
2 years ago
Frank Lee 15aab1476e
[zero] avoid zero hook spam by changing log to debug level (#1137)
2 years ago
ver217 a1a7899cae
[hotfix] fix zero init ctx numel (#1128)
2 years ago
ver217 f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP (#1122)
2 years ago
Frank Lee 14e5b11d7f
[zero] fixed api consistency (#1098)
3 years ago
Frank Lee cb18922c47
[doc] added documentation to chunk and chunk manager (#1094)
3 years ago
ver217 1f894e033f
[gemini] zero supports gemini (#1093)
3 years ago
ver217 be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077)
3 years ago
ver217 c5cd3b0f35
[zero] zero optim copy chunk rather than copy tensor (#1070)
3 years ago
Jiarui Fang 49832b2344
[refactory] add nn.parallel module (#1068)
3 years ago
ver217 e3fde4ee6b
fix import error in sharded model v2 (#1053)
3 years ago
ver217 51b9a49655
[zero] add zero optimizer for ColoTensor (#1046)
3 years ago
ver217 9492a561c3
[tensor] ColoTensor supports ZeRo (#1015)
3 years ago
ver217 7cfd6c827e
[zero] add load_state_dict for sharded model (#894)
3 years ago
ver217 c4d903e64a
[gemini] accelerate adjust_layout() (#878)
3 years ago
HELSON 425b4a96b8
[gemini] polish stateful_tensor_mgr (#876)
3 years ago
ver217 d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data (#850)
3 years ago
ver217 0f7ed8c192
fix _post_init_method of zero init ctx (#847)
3 years ago
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
3 years ago
Jiarui Fang 595bedf767
revert zero tensors back (#829)
3 years ago
Jiarui Fang 294a6060d0
[tensor] ZeRO use ColoTensor as the base class. (#828)
3 years ago
Jiarui Fang eb1b89908c
[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824)
3 years ago
Jiarui Fang 3ddbd1bce1
[gemini] collect cpu-gpu moving volume in each iteration (#813)
3 years ago
Jiarui Fang 61c20b44bc
[log] local throughput metrics (#811)
3 years ago
ver217 dd92b90a68
[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808)
3 years ago
Jiarui Fang e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806)
3 years ago
HELSON 88759e289e
[zero] add ZeroTensorShardStrategy (#793)
3 years ago