Commit Graph

  • 1e8f5ebef9 Merge branch 'develop' into feat/init_light_monitoring JiaoPL 2023-11-02 13:31:02 +0800
  • 6b2bff421c
    change slurm partition (#464) kkscilife 2023-11-02 13:25:46 +0800
  • 354e49f388 change slurm partition wangmengke 2023-11-02 12:27:13 +0800
  • 4851291356 fix(optimizer/hybrid_zero_optim.py): fix bucket size full judge condition when reduce scatter overlap huangting4201 2023-11-02 10:30:16 +0800
  • ad21459ce4 init light monitoring on all ranks JiaoPL 2023-11-01 22:29:32 +0800
  • 3028f07cb7
    fix(readme): update README with original weight download link (#460) Pryest 2023-11-01 14:49:48 +0800
  • 0cec3d5607 edit typo and polish Pryest 2023-11-01 14:32:52 +0800
  • 83713bacf4 add extra info for original model weights. Pryest 2023-11-01 14:23:33 +0800
  • 10b5056e1e fix all-gather overlap the model_checkpoint is 0 yingtongxiong 2023-11-01 12:31:52 +0800
  • 21624f6f81
    fix(moe): remove norm&gate force sync (#448) Wenwen Qu 2023-11-01 11:29:55 +0800
  • 7aced82ec7 remove some unused function (is norm/gate group) Qu Wenwen 2023-11-01 11:20:16 +0800
  • b3def4c162 fix(optimizer/hybrid_zero_optim.py): add reduce_scatter_overlap switch huangting4201 2023-10-31 20:40:58 +0800
  • 6b843253eb fix(optimizer/hybrid_zero_optim.py): remove redundant _accum_grad_buckets huangting4201 2023-10-31 20:26:36 +0800
  • 4c1cd5d49b fix async reduce scatter mwiacx 2023-10-31 19:39:24 +0800
  • 64b89e8e06 update README with original weight download link. Pryest 2023-10-31 15:30:06 +0800
  • 2c6cfde332
    fix(web_demo): remove <eoh> in user prompt (#440) vansin 2023-10-27 22:44:30 +0800
  • f77f376fd6
    fix(os): fix FileNotFoundError in storage_manager (#455) Yang Gao 2023-10-27 22:32:46 +0800
  • 4995060d84
    feat(storage): support ali oss ckpt saving (#439) jiaxingli 2023-10-27 22:32:10 +0800
  • bc5a85c624
    Merge pull request #6 from yingtongxiong/fstp/overlap-support-pp ytxiong 2023-10-27 20:32:44 +0800
  • 3778c66660 feat(model/overlap_handler.py): fix overlap hander to support pp(non-interleaved) huangting4201 2023-10-27 20:04:23 +0800
  • e6d8ebc3e5
    volc_path (#454) jiaxingli 2023-10-27 18:53:06 +0800
  • 62e5c2c650 Merge upstream/develop into fix/add_zero_broadcast_sync Qu Wenwen 2023-10-27 17:45:34 +0800
  • 87a3c5c374
    feat(optimizer): zero gradient count (#449) jiaopenglong 2023-10-27 16:26:55 +0800
  • 4df2c47472 refactor code Qu Wenwen 2023-10-27 15:34:48 +0800
  • 739a308c82 fix merged error Qu Wenwen 2023-10-27 15:10:16 +0800
  • 950f2de833 Merge upstream/develop into fix/add_zero_broadcast_sync Qu Wenwen 2023-10-27 11:05:53 +0800
  • a81a4a4ba8 delete old sync logic Qu Wenwen 2023-10-27 11:05:17 +0800
  • c46861f986 fix ci gaoyang07 2023-10-27 10:25:34 +0800
  • 2520edb795 use try-except to handle file error gaoyang07 2023-10-27 10:20:43 +0800
  • a862f503b6 use rank0 to makedirs gaoyang07 2023-10-26 22:48:38 +0800
  • aa3840fc38 fix some bugs yingtongxiong 2023-10-26 20:42:24 +0800
  • 87bbef1e0d volc_path li126com 2023-10-26 20:40:56 +0800
  • 8aefb74e02 add flash tflops yingtongxiong 2023-10-26 20:33:12 +0800
  • 4d83e1021b Merge branch 'feat/fstp_refactor' of https://github.com/yingtongxiong/InternLM into feat/fstp_refactor merge origin yingtongxiong 2023-10-26 20:25:02 +0800
  • 3253cbf48e add a new get_tflops_func mwiacx 2023-10-26 20:21:46 +0800
  • cbd4f04244 add synchronize yingtongxiong 2023-10-26 20:04:01 +0800
  • 325b549707 fix param_metrics is not a tensor JiaoPL 2023-10-26 18:41:13 +0800
  • ad70e323eb
    fix(optimizer):broadcast (#453) ytxiong 2023-10-26 17:54:54 +0800
  • 42e5f6f8a9
    fix(optimizer):broadcast main (#452) ytxiong 2023-10-26 17:54:48 +0800
  • a151615847
    Merge branch 'develop' into fix/broadcast ytxiong 2023-10-26 17:51:48 +0800
  • a26481132d fix synchronize yingtongxiong 2023-10-26 17:50:18 +0800
  • 4cd803985b
    Merge branch 'main' into fix/broadcast_main ytxiong 2023-10-26 17:48:42 +0800
  • c78844626b add synchronize yingtongxiong 2023-10-26 17:46:07 +0800
  • 1aae39b667 Merge remote-tracking branch 'upstream/develop' into feat/fstp_refactor merge develop yingtongxiong 2023-10-26 17:41:42 +0800
  • d831ddcc1d modify the config yingtongxiong 2023-10-26 17:41:17 +0800
  • f653e5af01
    add broadcast synchronize (#451) ytxiong 2023-10-26 17:38:51 +0800
  • e4e68ff0ff add broadcast synchronize yingtongxiong 2023-10-26 17:37:08 +0800
  • aeee9fd2a9
    fix broadcast synchronize() (#450) ytxiong 2023-10-26 17:33:00 +0800
  • cd53d90db9 fix broadcast synchronize() yingtongxiong 2023-10-26 17:29:08 +0800
  • 83cb7036a7 add zero_grad_profiling option JiaoPL 2023-10-26 17:20:44 +0800
  • 15ff413362 add zero broadcast_sync Qu Wenwen 2023-10-26 16:27:03 +0800
  • a6051335b7 fix layer norm with pp JiaoPL 2023-10-26 14:50:54 +0800
  • 9ac5ab3101 fix layer norm with pp JiaoPL 2023-10-26 10:07:43 +0800
  • e900a1e45f add zero grad count JiaoPL 2023-10-25 18:02:00 +0800
  • cc20fa271a reset print memory yingtongxiong 2023-10-25 16:48:02 +0800
  • 985465c96a merge upstream yingtongxiong 2023-10-25 14:46:45 +0800
  • 363275b500 add memory print yingtongxiong 2023-10-25 14:31:00 +0800
  • 1d7e2d04ec
    fix(*)/all-reduce for norm in sequence parallel (#443) ytxiong 2023-10-25 14:16:32 +0800
  • 918dff7257 reset moe yingtongxiong 2023-10-25 13:47:19 +0800
  • 0bac166b7a add test yingtongxiong 2023-10-25 13:44:15 +0800
  • 476a24bd9b fix lint yingtongxiong 2023-10-25 13:38:46 +0800
  • 1bc3c33b75 change the order of dp and sp all-reduce yingtongxiong 2023-10-25 13:27:47 +0800
  • 1655a90f34 fix all-reduce norm grad yingtongxiong 2023-10-25 13:11:59 +0800
  • 41cfa1a10a feat(model/overlap_handler.py): fix overlap handler None bug huangting4201 2023-10-24 18:47:27 +0800
  • 0d3592a53f Merge branch 'feat/fstp_refactor' of https://github.com/yingtongxiong/InternLM into feat/fstp_refactor merge origin yingtongxiong 2023-10-24 17:54:50 +0800
  • 262de4b796 support tflops computation and generate test py files yingtongxiong 2023-10-24 17:54:26 +0800
  • 5d8313693b feat(model/overlap_handler.py): fix head post backward hook when activation huangting4201 2023-10-24 17:29:09 +0800
  • 97dcefc389 support model activation checkpoint yingtongxiong 2023-10-24 16:13:52 +0800
  • 24d5f05df6
    Update web_demo.py vansin 2023-10-24 07:38:38 +0800
  • 12b1001b5e new oss li126com 2023-10-23 20:43:26 +0800
  • 7b1b892084
    fix(tools): fix InternLMTokenizer to fit transformers==4.34.0 x54-729 2023-10-23 18:35:30 +0800
  • 949a0a1d55
    feat(optimizer): add layer norm to tensorboard (#429) jiaopenglong 2023-10-23 17:07:04 +0800
  • 0996c47e49 fix accumulate grads bug chenxun.p 2023-10-23 16:17:57 +0800
  • 1e263988cf add function: reduce grads JiaoPL 2023-10-23 15:59:37 +0800
  • b48687a7ff
    Merge pull request #5 from yingtongxiong/fstp/refactor-hook-handle huangting4201 2023-10-23 15:35:34 +0800
  • b2c1a70477 feat(train/training_internlm.py): fix lint error huangting4201 2023-10-23 15:34:24 +0800
  • 9cf1ff0f6e feat(solver/optimizer/hybrid_zero_optim.py): minor update huangting4201 2023-10-23 15:31:41 +0800
  • 03cc7f9b80 feat(model/overlap_handler.py): fix lint error huangting4201 2023-10-23 15:28:34 +0800
  • 0d693cf3a1 feat(model/overlap_handler.py): fix lint error huangting4201 2023-10-23 15:22:03 +0800
  • f6a5086fe4 support bias yingtongxiong 2023-10-23 14:51:27 +0800
  • e7f9f1d208 feat(model/overlap_handler.py): optimize reduce scatter mem pool huangting4201 2023-10-23 13:31:23 +0800
  • b20f47a1fe feat(model/overlap_handler.py): move handler to gpc huangting4201 2023-10-23 12:02:32 +0800
  • 85ad917ae4 feat(model/overlap_handler.py): refactor overlap hook handle huangting4201 2023-10-20 21:50:32 +0800
  • 1804d01bb3 merge reduce-scatter yingtongxiong 2023-10-20 18:11:00 +0800
  • dcd89ed304 refactor linear yingtongxiong 2023-10-20 17:50:56 +0800
  • f22e5b3b28
    Merge pull request #4 from yingtongxiong/fstp/refactor-config ytxiong 2023-10-20 17:48:20 +0800
  • 45fd0ec86b test moe layer norm JiaoPL 2023-10-20 16:27:28 +0800
  • 2acf9b817f feat(utils/gputest.py): fix lint error huangting4201 2023-10-20 16:25:08 +0800
  • eac382ad0a feat(optimizer/hybrid_zero_optim.py): fix lint error huangting4201 2023-10-20 16:22:29 +0800
  • 3c6925499f feat(optimizer/hybrid_zero_optim.py): resolve conflicts huangting4201 2023-10-20 16:18:01 +0800
  • d91a5d9d9e feat(initialize/launch.py): refactor config for fstp huangting4201 2023-10-20 15:59:40 +0800
  • 95488d8e8f update optimizer accumulate grad impl when fstp chenxun.p 2023-10-20 15:58:06 +0800
  • 140be20511
    test(workflow): add unit test yaml (#427) kkscilife 2023-10-20 14:22:58 +0800
  • 1ac7e4b489 add layer norm to tensorboard JiaoPL 2023-10-20 13:15:59 +0800
  • 815a584930 feat(model/linear.py): remove useless code huangting4201 2023-10-20 11:27:59 +0800
  • ed7232777a support reduce scatter memory pool yingtongxiong 2023-10-20 10:35:45 +0800
  • 68b6b6c454 add main branch changxiaodongTHU 2023-10-19 19:09:01 +0800
  • 5084697657 Merge branch 'develop' into ci/unit_test changxiaodongTHU 2023-10-19 18:39:23 +0800
  • 3c992a2101
    fix(pipeline): fix interleave type assert and metrics error (#423) Wenwen Qu 2023-10-19 17:29:30 +0800
  • b278e28202 add unit test yaml changxiaodongTHU 2023-10-19 17:16:28 +0800