Commit Graph

180 Commits (7d49e7b2dbdb4b966496475654a4154b92aeaa7b)

Author SHA1 Message Date
Geng Zhang 0e06f62160 [NFC] polish colossalai/nn/layer/parallel_sequence/_operation.py code style (#1266)
2 years ago
superhao1995 f660152c73 [NFC] polish colossalai/nn/layer/parallel_3d/_operation.py code style (#1258)
2 years ago
Thunderbeee 9738fb0f78 [NFC] polish colossalai/nn/lr_scheduler/__init__.py (#1255)
2 years ago
Ofey Chan 2dd4d556fb
[NFC] polish colossalai/nn/init.py code style (#1292)
2 years ago
HELSON abba4d84e1
[hotfix] fix bert model test in unitests (#1272)
2 years ago
oahzxl 0cf8e8e91c
[NFC] polish <colossalai/nn/lr_scheduler/poly.py> code style (#1267)
2 years ago
Jiarui Fang 1aad903c15
[tensor] redistribute among different process groups (#1247)
2 years ago
Jiarui Fang 9bcd2fd4af
[tensor] a shorter shard and replicate spec (#1245)
2 years ago
Jiarui Fang 2699dfbbfd
[rename] convert_to_dist -> redistribute (#1243)
2 years ago
Jiarui Fang 4a76084dc9
[tensor] add zero_like colo op, important for Optimizer (#1236)
2 years ago
Jiarui Fang 3b500984b1
[tensor] fix some unittests (#1234)
2 years ago
HELSON 0453776def
[tensor] fix a assertion in colo_tensor cross_entropy (#1232)
2 years ago
HELSON 42ab36b762
[tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230)
2 years ago
Yi Zhao 04537bf83e
[checkpoint]support generalized scheduler (#1222)
2 years ago
Jiarui Fang a98319f023
[tensor] torch function return colotensor (#1229)
2 years ago
Jiarui Fang ae7d3f4927
[refactor] move process group from _DistSpec to ColoTensor. (#1203)
2 years ago
Jiarui Fang b5f25eb32a
[Tensor] add cpu group to ddp (#1200)
2 years ago
Jiarui Fang 060b917daf
[refactor] remove gpc dependency in colotensor's _ops (#1189)
2 years ago
Jiarui Fang 372f791444
[refactor] move chunk and chunkmgr to directory gemini (#1182)
2 years ago
ver217 6b2f2ab9bb
[ddp] ColoDDP uses bucket all-reduce (#1177)
2 years ago
Jiarui Fang 1b657f9ce1
[tensor] revert local view back (#1178)
2 years ago
Jiarui Fang 0dd4e2bbfb
[Tensor] rename some APIs in TensorSpec and Polish view unittest (#1176)
2 years ago
Ziyue Jiang dd0420909f
[Tensor] rename parallel_action (#1174)
2 years ago
Jiarui Fang aa7bef73d4
[Tensor] distributed view supports inter-process hybrid parallel (#1169)
2 years ago
Jiarui Fang 4b9bba8116
[ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168)
2 years ago
Jiarui Fang f4ef224358
[Tensor] remove ParallelAction, use ComputeSpec instread (#1166)
2 years ago
Jiarui Fang 177c374401
remove gather out in parallel action (#1163)
2 years ago
Ziyue Jiang 955ac912de
remove log (#1160)
2 years ago
Jiarui Fang 07f9c781f9
[graph] improve the graph building. (#1157)
2 years ago
ver217 22717a856f
[tensor] add embedding bag op (#1156)
2 years ago
ver217 ae86151968
[tensor] add more element-wise ops (#1155)
2 years ago
ver217 54aabb8da4
[gemini] refactor gemini mgr (#1151)
2 years ago
ver217 8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP (#1146)
2 years ago
ver217 ccf3c58c89
embedding op use gather_out (#1143)
2 years ago
Frank Lee 15aab1476e
[zero] avoid zero hook spam by changing log to debug level (#1137)
2 years ago
ver217 e4f555f29a
[optim] refactor fused sgd (#1134)
2 years ago
ver217 d26902645e
[ddp] add save/load state dict for ColoDDP (#1127)
2 years ago
ver217 f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP (#1122)
3 years ago
ver217 e127b4375b
cast colo ddp v2 inputs/outputs (#1120)
3 years ago
ver217 7d14b473f0
[gemini] gemini mgr supports "cpu" placement policy (#1118)
3 years ago
ver217 895c1c5ee7
[tensor] refactor param op hook (#1097)
3 years ago
Frank Lee cb18922c47
[doc] added documentation to chunk and chunk manager (#1094)
3 years ago
ver217 1f894e033f
[gemini] zero supports gemini (#1093)
3 years ago
Frank Lee 2b2dc1c86b
[pipeline] refactor the pipeline module (#1087)
3 years ago
ver217 be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077)
3 years ago
Ziyue Jiang 0653c63eaa
[Tensor] 1d row embedding (#1075)
3 years ago
Ziyue Jiang 4fc748f69b
[Tensor] fix optimizer for CPU parallel (#1069)
3 years ago
Jiarui Fang 49832b2344
[refactory] add nn.parallel module (#1068)
3 years ago
Ziyue Jiang 6754f1b77f
fix module utils bug (#1066)
3 years ago
Jiarui Fang a00644079e
reorgnize colotensor directory (#1062)
3 years ago
Ziyue Jiang df9dcbbff6
[Tensor] add hybrid device demo and fix bugs (#1059)
3 years ago
ver217 51b9a49655
[zero] add zero optimizer for ColoTensor (#1046)
3 years ago
ver217 9492a561c3
[tensor] ColoTensor supports ZeRo (#1015)
3 years ago
ver217 cefc29ff06
[tensor] impl ColoDDP for ColoTensor (#1009)
3 years ago
Ziheng Qin 571f12eff3 [NFC] polish colossalai/nn/layer/utils/common.py code style (#983)
3 years ago
shenggan 18542b47fc [NFC] polish colossalai/nn/layer/parallel_2d/layers.py code style (#976)
3 years ago
Zirui Zhu 598cde4a0f [NFC] polish colossalai/nn/layer/parallel_2p5d/layers.py code style (#972)
3 years ago
LuGY fb5bc6cb28 [NFC] polish colossalai/nn/layer/parallel_3d/layers.py code style (#966)
3 years ago
ver217 58580b50fe
Revert "[NFC] Hotfix/format (#984)" (#986)
3 years ago
binmakeswell 0772828fba
[NFC] Hotfix/format (#984)
3 years ago
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
3 years ago
Ziyue Jiang 4b01da24cd
[TP] change the check assert in split batch 2d (#772)
3 years ago
アマデウス b8899e0905
[TP] allow layernorm without bias (#750)
3 years ago
Frank Lee eda30a058e
[compatibility] fixed tensor parallel compatibility with torch 1.9 (#700)
3 years ago
HELSON a9b8300d54
[zero] improve adaptability for not-shard parameters (#708)
3 years ago
アマデウス 3fc8a204dc
[]Corrected 3d vocab parallel embedding (#707)
3 years ago
HELSON b31daed4cf
fix bugs in CPU adam (#633)
3 years ago
Liang Bowen 828e465622
[hotfix] Raise messages for indivisible batch sizes with tensor parallelism (#622)
3 years ago
アマデウス 77ad24bf94
[model checkpoint] updated saving/loading for 3d layers (#597)
3 years ago
アマデウス 93089ed708
[model checkpoint] updated saving/loading for 2.5d layers (#596)
3 years ago
アマデウス c50bfb807b
[model checkpoint] updated saving/loading for 1d layers (#594)
3 years ago
アマデウス 7636d518e1
[model checkpoint] updated saving/loading for 2d layers (#595)
3 years ago
アマデウス cd13b63832
[model checkpoint] reworked unified layers for ease of save/load states (#593)
3 years ago
Ziyue Jiang 1c40ee8749
[TP] add assert for tp1d (#621)
3 years ago
ver217 e619a651fb
polish optimizer docstring (#619)
3 years ago
ver217 8432dc7080
polish moe docsrting (#618)
3 years ago
ver217 104cbbb313
[hotfix] add hybrid adam to __init__ (#584)
3 years ago
HELSON e6d50ec107
[zero] adapt zero for unsharded parameters (#561)
3 years ago
Wesley 46c9ba33da update code format
3 years ago
Wesley 666cfd094a fix parallel_input flag for Linear1D_Col gather_output
3 years ago
Liang Bowen 2c45efc398
html refactor (#555)
3 years ago
LuGY c44d797072
[docs] updatad docs of hybrid adam and cpu adam (#552)
3 years ago
Ziyue Jiang 763dc325f1
[TP] Add gather_out arg to Linear (#541)
3 years ago
HELSON 8c90d4df54
[zero] add zero context manager to change config during initialization (#546)
3 years ago
Liang Bowen ec5086c49c Refactored docstring to google style
3 years ago
LuGY 105c5301c3
[zero]added hybrid adam, removed loss scale in adam (#527)
3 years ago
LuGY 6a3f9fda83
[cuda] modify the fused adam, support hybrid of fp16 and fp32 (#497)
3 years ago
Jiarui Fang a445e118cf
[polish] polish singleton and global context (#500)
3 years ago
ver217 9ec1ce6ab1
[zero] sharded model support the reuse of fp16 shard (#495)
3 years ago
HELSON c9023d4078
[MOE] support PR-MOE (#488)
3 years ago
ver217 62b0a8d644
[zero] sharded optim support hybrid cpu adam (#486)
3 years ago
HELSON d7ea63992b
[MOE] add FP32LinearGate for MOE in NaiveAMP context (#480)
3 years ago
Jiarui Fang 65c0f380c2
[format] polish name format for MOE (#481)
3 years ago
HELSON 7544347145
[MOE] add unitest for MOE experts layout, gradient handler and kernel (#469)
3 years ago
HELSON aff9d354f7
[MOE] polish moe_env (#467)
3 years ago
HELSON bccbc15861
[MOE] changed parallelmode to dist process group (#460)
3 years ago
Jiarui Fang 0fcfb1e00d
[test] make zero engine test really work (#447)
3 years ago
Jiarui Fang 237d08e7ee
[zero] hybrid cpu adam (#445)
3 years ago
HELSON dbdc9a7783
added Multiply Jitter and capacity factor eval for MOE (#434)
3 years ago
HELSON 3f70a2b12f
removed noisy function during evaluation of MoE router (#419)
3 years ago