Commit Graph

300 Commits (b80e6cdcf30c4acf73d5b910437396839ff8a816)

Author SHA1 Message Date
yingtongxiong b80e6cdcf3 merge origin 2023-11-06 12:05:53 +08:00
yingtongxiong 7c6d2936b3 reset the sp allreduce in optimizer 2023-11-06 12:04:01 +08:00
huangting4201 c517ec5b8c feat(model/overlap_handler.py): delete reduce_scatter_overlap switch 2023-11-06 11:57:14 +08:00
yingtongxiong 9b1265c591 modify the sp allreduce and support tf32 for fstp linear 2023-11-06 10:45:08 +08:00
huangting4201 5a18b3b651 fix(model/overlap_handler.py): fix last block hook when pp with activation 2023-11-02 16:05:07 +08:00
huangting4201 4851291356 fix(optimizer/hybrid_zero_optim.py): fix bucket size full judge condition when reduce scatter overlap 2023-11-02 10:30:16 +08:00
yingtongxiong 10b5056e1e fix all-gather overlap the model_checkpoint is 0 2023-11-01 12:31:52 +08:00
huangting4201 b3def4c162 fix(optimizer/hybrid_zero_optim.py): add reduce_scatter_overlap switch 2023-10-31 20:40:58 +08:00
huangting4201 6b843253eb fix(optimizer/hybrid_zero_optim.py): remove redundant _accum_grad_buckets 2023-10-31 20:26:36 +08:00
mwiacx 4c1cd5d49b fix async reduce scatter 2023-10-31 19:39:24 +08:00
ytxiong bc5a85c624
Merge pull request #6 from yingtongxiong/fstp/overlap-support-pp
feat(model/overlap_handler.py): fix overlap hander to support pp(non-…
2023-10-27 20:32:44 +08:00
huangting4201 3778c66660 feat(model/overlap_handler.py): fix overlap hander to support pp(non-interleaved) 2023-10-27 20:04:23 +08:00
yingtongxiong aa3840fc38 fix some bugs 2023-10-26 20:42:24 +08:00
yingtongxiong 8aefb74e02 add flash tflops 2023-10-26 20:33:12 +08:00
yingtongxiong 4d83e1021b Merge branch 'feat/fstp_refactor' of https://github.com/yingtongxiong/InternLM into feat/fstp_refactor
merge origin
2023-10-26 20:25:02 +08:00
mwiacx 3253cbf48e add a new get_tflops_func 2023-10-26 20:21:46 +08:00
yingtongxiong cbd4f04244 add synchronize 2023-10-26 20:04:01 +08:00
yingtongxiong 1aae39b667 Merge remote-tracking branch 'upstream/develop' into feat/fstp_refactor
merge develop
2023-10-26 17:41:42 +08:00
yingtongxiong d831ddcc1d modify the config 2023-10-26 17:41:17 +08:00
ytxiong aeee9fd2a9
fix broadcast synchronize() (#450) 2023-10-26 17:33:00 +08:00
yingtongxiong cc20fa271a reset print memory 2023-10-25 16:48:02 +08:00
yingtongxiong 985465c96a merge upstream 2023-10-25 14:46:45 +08:00
yingtongxiong 363275b500 add memory print 2023-10-25 14:31:00 +08:00
ytxiong 1d7e2d04ec
fix(*)/all-reduce for norm in sequence parallel (#443)
* fix all-reduce norm grad

* change the order of dp and sp all-reduce

* fix lint
2023-10-25 14:16:32 +08:00
yingtongxiong 918dff7257 reset moe 2023-10-25 13:47:19 +08:00
yingtongxiong 0bac166b7a add test 2023-10-25 13:44:15 +08:00
huangting4201 41cfa1a10a feat(model/overlap_handler.py): fix overlap handler None bug 2023-10-24 18:47:27 +08:00
yingtongxiong 0d3592a53f Merge branch 'feat/fstp_refactor' of https://github.com/yingtongxiong/InternLM into feat/fstp_refactor
merge origin
2023-10-24 17:54:50 +08:00
yingtongxiong 262de4b796 support tflops computation and generate test py files 2023-10-24 17:54:26 +08:00
huangting4201 5d8313693b feat(model/overlap_handler.py): fix head post backward hook when activation 2023-10-24 17:29:09 +08:00
yingtongxiong 97dcefc389 support model activation checkpoint 2023-10-24 16:13:52 +08:00
jiaopenglong 949a0a1d55
feat(optimizer): add layer norm to tensorboard (#429)
* add layer norm to tensorboard

* test moe layer norm

* add function: reduce grads
2023-10-23 17:07:04 +08:00
chenxun.p 0996c47e49 fix accumulate grads bug 2023-10-23 16:17:57 +08:00
huangting4201 b48687a7ff
Merge pull request #5 from yingtongxiong/fstp/refactor-hook-handle
feat(model/overlap_handler.py): refactor overlap hook handle
2023-10-23 15:35:34 +08:00
huangting4201 b2c1a70477 feat(train/training_internlm.py): fix lint error 2023-10-23 15:34:24 +08:00
huangting4201 9cf1ff0f6e feat(solver/optimizer/hybrid_zero_optim.py): minor update 2023-10-23 15:31:41 +08:00
huangting4201 03cc7f9b80 feat(model/overlap_handler.py): fix lint error 2023-10-23 15:28:34 +08:00
huangting4201 0d693cf3a1 feat(model/overlap_handler.py): fix lint error 2023-10-23 15:22:03 +08:00
yingtongxiong f6a5086fe4 support bias 2023-10-23 14:51:27 +08:00
huangting4201 e7f9f1d208 feat(model/overlap_handler.py): optimize reduce scatter mem pool 2023-10-23 13:31:23 +08:00
huangting4201 b20f47a1fe feat(model/overlap_handler.py): move handler to gpc 2023-10-23 12:02:32 +08:00
huangting4201 85ad917ae4 feat(model/overlap_handler.py): refactor overlap hook handle 2023-10-20 21:50:32 +08:00
yingtongxiong 1804d01bb3 merge reduce-scatter 2023-10-20 18:11:00 +08:00
yingtongxiong dcd89ed304 refactor linear 2023-10-20 17:50:56 +08:00
ytxiong f22e5b3b28
Merge pull request #4 from yingtongxiong/fstp/refactor-config
feat(initialize/launch.py): refactor config for fstp
2023-10-20 17:48:20 +08:00
huangting4201 2acf9b817f feat(utils/gputest.py): fix lint error 2023-10-20 16:25:08 +08:00
huangting4201 eac382ad0a feat(optimizer/hybrid_zero_optim.py): fix lint error 2023-10-20 16:22:29 +08:00
huangting4201 3c6925499f feat(optimizer/hybrid_zero_optim.py): resolve conflicts 2023-10-20 16:18:01 +08:00
huangting4201 d91a5d9d9e feat(initialize/launch.py): refactor config for fstp 2023-10-20 15:59:40 +08:00
chenxun.p 95488d8e8f update optimizer accumulate grad impl when fstp 2023-10-20 15:58:06 +08:00