Commit Graph

  • 69e0d63b33
    Update passkey_retrieval.py tpoisonooo 2023-10-19 16:30:22 +0800
  • d2cc5f61e3 feat(tools): add passkey retrieval test tpoisonooo 2023-10-19 16:18:42 +0800
  • 3ea46324dd
    fix: unitest (#424) jiaxingli 2023-10-19 15:19:40 +0800
  • 19cddd5060 fix: unitest li126com 2023-10-19 14:41:25 +0800
  • 4742271154 add memory pool yingtongxiong 2023-10-19 13:21:33 +0800
  • 83c47d07d1 fix is_no_pp_or_last_stage logic Wenwen Qu 2023-10-19 10:59:54 +0800
  • 2c5395fdfd
    Doc(moe): add documentation for moe training (#411) Wenwen Qu 2023-10-19 10:01:12 +0800
  • 3ea94f2e2a
    fix(utils): disable bench_net in gputest.py (#421) Guoteng 2023-10-19 10:00:57 +0800
  • 4b5bdedff2
    feat(monitor): send exception to light monitor (#420) jiaopenglong 2023-10-18 21:00:21 +0800
  • 5d0151d7b0 update trainer_result in ci JiaoPL 2023-10-18 19:37:15 +0800
  • 6480e03949 refactor code for assert Wenwen Qu 2023-10-18 19:22:33 +0800
  • e3aff2c23e update try_import_send_exception JiaoPL 2023-10-18 19:19:55 +0800
  • 30f610b1fa
    Test(pp): test pipeline parallel (#413) jiaxingli 2023-10-18 17:53:08 +0800
  • c0d9063a8d change code comments Wenwen Qu 2023-10-18 17:37:20 +0800
  • e1db83899b restore moe config file Qu Wenwen 2023-10-18 14:08:48 +0800
  • 3421d1197a Merge 'upstream/develop' into doc/add_moe__doc Qu Wenwen 2023-10-18 14:06:49 +0800
  • 12f897f553 fix interleave type assert bug Wenwen Qu 2023-10-18 13:56:42 +0800
  • e3d128230b fix(utils): disable bench_net in gputest.py 877825076@qq.com 2023-10-18 12:26:24 +0800
  • bf6dbf07fa add share embedding weight support for moe Wenwen Qu 2023-10-18 11:39:04 +0800
  • a5aeab2a3f memory profiling test yingtongxiong 2023-10-17 19:54:21 +0800
  • aa5e34d815
    compatible with old ckpt (#418) Wenwen Qu 2023-10-17 17:25:36 +0800
  • 16ef7b7889 add test yingtongxiong 2023-10-17 17:16:39 +0800
  • 2538a19927 fix InternLMTokenizer to fit transformers==4.34.0 x54-729 2023-10-17 16:54:51 +0800
  • 5abe519c4c remove full weight for block 0 yingtongxiong 2023-10-17 16:37:06 +0800
  • 9f08c95541 compatible with old ckpt Qu Wenwen 2023-10-17 16:22:32 +0800
  • 44f5c51747 send exception to light monitor JiaoPL 2023-10-17 15:49:17 +0800
  • 5c38cb6409 add head overlap yingtongxiong 2023-10-17 15:38:24 +0800
  • a5c6e457b9 Merge branch 'feat/fstp' of https://github.com/yingtongxiong/InternLM into feat/fstp yingtongxiong 2023-10-17 15:17:03 +0800
  • 6408b944c2 support fine grained yingtongxiong 2023-10-17 15:14:39 +0800
  • b51cf4ebc3 Merge branch 'feat/fstp' of github.com:yingtongxiong/InternLM into feat/fstp chenxun.p 2023-10-17 15:10:27 +0800
  • 6682f5d92a fix reduce scatter async bug chenxun.p 2023-10-17 15:10:07 +0800
  • eeef07934a
    fix(moe): fix moe compatibility for fsdp and memory profiling (#417) Wenwen Qu 2023-10-17 14:13:48 +0800
  • 666dabd0a8 update moe config Qu Wenwen 2023-10-17 11:36:44 +0800
  • 4e99a7fdbc feat(train/training_internlm.py): remove abnormal tgs when calculating avg tgs huangting4201 2023-10-17 11:30:44 +0800
  • 74d6c71ad9 fix moe compatibility for fsdp and memory profiling Qu Wenwen 2023-10-17 11:26:29 +0800
  • 229cc5c68c impl reduce scatter async chenxun.p 2023-10-17 11:15:54 +0800
  • d1af0d6aee feat(model/linear.py): block-grained backward huangting4201 2023-10-17 10:13:56 +0800
  • 0d1fa037dd feat(model/linear.py): set block 0 full weight huangting4201 2023-10-16 20:13:59 +0800
  • 6ce78a4e09 fix layer grad_norm with pp JiaoPL 2023-10-16 19:43:30 +0800
  • 82204eea59 support hybrid overlap yingtongxiong 2023-10-16 16:35:14 +0800
  • 7920168179 fix set layer name JiaoPL 2023-10-14 22:45:35 +0800
  • 7d68509c4f set layer name to parameters after init_model JiaoPL 2023-10-14 22:32:10 +0800
  • 37e0c86e5a
    fix(init): allow resume_tb_folder is an empty string (#391) Guoteng 2023-10-13 16:46:14 +0800
  • 71a0388b87
    feat(storage): support volc oss ckpt saving (#397) jiaxingli 2023-10-13 16:44:29 +0800
  • 646f1b45fa rm debug log JiaoPL 2023-10-13 12:25:46 +0800
  • f2358b9432 Merge branch 'develop' into feat/layer_grad_norm JiaoPL 2023-10-13 12:12:24 +0800
  • 641ee14bbf update layer norm to tensorboard JiaoPL 2023-10-13 12:07:58 +0800
  • d0f0c22cac feat(model/linear.py): change pre backward from wqkv to block huangting4201 2023-10-13 11:10:23 +0800
  • a94f429a67 compute layer norms and replace total_norm with it JiaoPL 2023-10-12 21:25:30 +0800
  • d0b1346993 feat(model/linear.py): support block allgather overlap huangting4201 2023-10-12 19:42:08 +0800
  • 816ecf8e04 fix moe and zero1 check in args_sanity_check Qu Wenwen 2023-10-12 10:56:59 +0800
  • 93bb5c2760 add doc for moe Qu Wenwen 2023-10-12 10:42:16 +0800
  • 5fd5a8a32b support fine-grained overlap yingtongxiong 2023-10-11 17:36:41 +0800
  • 792b066f15 communication overlap yingtongxiong 2023-10-11 10:57:12 +0800
  • 9a731b6e9b
    fix(optimizer/fsdp_optimizer.py): fsdp process empty params group (#408) huangting4201 2023-10-10 20:06:04 +0800
  • a63d7773db fix(optimizer/fsdp_optimizer.py): fsdp process empty params group huangting4201 2023-10-10 19:59:53 +0800
  • f6ff8e61c6 Merge remote-tracking branch 'upstream/develop' into develop huangting4201 2023-10-10 19:53:18 +0800
  • c94be64fd2 merge origin yingtongxiong 2023-10-10 17:13:46 +0800
  • 0fac845c36 overlap grad_input computation and grad_weight reduce_scatter yingtongxiong 2023-10-10 17:06:13 +0800
  • 5fb6d99c11 feat(configs/7B_sft.py): update parallel config comment huangting4201 2023-10-10 11:45:11 +0800
  • db637542a6 fix lint yingtongxiong 2023-10-09 22:19:21 +0800
  • dd67ab948d merge develop yingtongxiong 2023-10-09 21:40:02 +0800
  • 1b7935dd98 merge upstream develop yingtongxiong 2023-10-09 21:35:52 +0800
  • a8dea6313f fix the ci incompatible in config yingtongxiong 2023-10-09 21:33:26 +0800
  • b3645b0244
    fix(model): fix errant inference_forward (#396) Pryest 2023-10-09 21:29:11 +0800
  • 66eba48c9f Fit to flash attention 1.0.5. Pryest 2023-10-09 21:15:40 +0800
  • b38ba5dad2 Fit to flash attention 1.0.5. Pryest 2023-10-09 21:03:16 +0800
  • 007e58a4af merge upstream develop yingtongxiong 2023-10-09 20:54:26 +0800
  • a3580acb6c Fit to flash attention 1.0 Pryest 2023-10-09 20:46:17 +0800
  • a35ce4c888 Fit to flash attention 1.0 Pryest 2023-10-09 20:43:21 +0800
  • f191853bf4 fix lint yingtongxiong 2023-10-09 20:39:57 +0800
  • 78353e12cf Fix bugs. Pryest 2023-10-09 20:27:03 +0800
  • 29df765f65 refactor code yingtongxiong 2023-10-09 20:23:32 +0800
  • 5d39c332fe restore train.py yingtongxiong 2023-10-09 20:08:49 +0800
  • ef9e7cc622 modify the config yingtongxiong 2023-10-09 20:05:39 +0800
  • 144731c35c fix evaluation bug in pp yingtongxiong 2023-10-09 20:04:27 +0800
  • a075153adf
    feat(train): add fsdp training option (#293) zaglc 2023-10-09 18:59:31 +0800
  • 45c846f7df feat(configs/7B_sft.py): adapt to old version config huangting4201 2023-10-09 18:28:41 +0800
  • 54e561665e remove useless code for no-pp yingtongxiong 2023-10-09 18:08:15 +0800
  • 0fa1083780 Merge remote-tracking branch 'upstream/develop' into feat/fstp merge upstream develop yingtongxiong 2023-10-09 18:06:57 +0800
  • 949431f228 modify the config yingtongxiong 2023-10-09 18:06:22 +0800
  • 21c1a7fa47 support evaluation with fstp yingtongxiong 2023-10-09 18:01:06 +0800
  • 582ee000bd
    feat(moe):support zero for expert local dp (#404) Wenwen Qu 2023-10-09 17:45:26 +0800
  • edd7f9e8e1 feat(configs/7B_sft.py): move fsdp config to parallel zero1 huangting4201 2023-10-09 17:39:52 +0800
  • 189a313da6 support fstp and refactor code yingtongxiong 2023-10-09 17:26:20 +0800
  • e8fcbb1ad5 fix above codes: *treat optim.zero_world_size and optim.zero_local_rank as list in model_checkpoint.py and test_model_checkpoint.py *add overlap and zero check for moe in args_sanity_check(.) Qu Wenwen 2023-10-09 16:05:43 +0800
  • 67fad5c894 feat: support volc oss li126com 2023-10-09 14:52:14 +0800
  • bd809a61f2 fix(internlm/model): reset dropout_selective_checkpoint=True huangting4201 2023-10-09 14:47:10 +0800
  • 916647c0a1
    fix(pipeline): fix bugs for pipeline when enable mixed precision (#382) Wenwen Qu 2023-10-09 14:01:15 +0800
  • 9aef11e89c
    make seed in different tensor rank different (#405) ytxiong 2023-10-09 13:53:52 +0800
  • c69481daef merge upstream develop yingtongxiong 2023-10-09 13:32:41 +0800
  • 0e26f52a89 make seed in different tensor rank different yingtongxiong 2023-10-09 13:26:08 +0800
  • 856f88e97b move optim.dtype to each param group Qu Wenwen 2023-10-09 12:39:03 +0800
  • c018e9216f support zero for expert local dp Qu Wenwen 2023-10-09 11:20:01 +0800
  • 4ebe6715bd restore logic for empty fp32 group Qu Wenwen 2023-10-09 11:20:01 +0800
  • 5bca32e4dc fix(internlm/train/training_internlm.py): update wrap class and fix lint error huangting4201 2023-10-09 11:11:04 +0800
  • b444264e89 doc:gpu num li126com 2023-10-08 20:46:13 +0800
  • 2e94870967 fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr huangting4201 2023-10-08 20:22:40 +0800
  • 1b71b19e23 fix(internlm/utils/parallel.py): fix circular import huangting4201 2023-10-08 17:23:29 +0800
  • bd4af3a31f modify the all2all yingtongxiong 2023-10-08 17:21:17 +0800