Commit Graph

  • 8ec20e9862 [format] polish name format for MOE jiaruifang 2022-03-21 16:44:36 +0800
  • 7f5e4592eb
    Update Experiment result about Colossal-AI with ZeRO (#479) Sze-qq 2022-03-21 16:34:07 +0800
  • 8ec76dd22e
    Hotfix/readme (#478) Sze-qq 2022-03-21 16:24:33 +0800
  • 48669455c4
    Merge branch 'hotfix/readme' into hotfix/readme Sze-qq 2022-03-21 16:21:50 +0800
  • f870d15dbd polish code ver217 2022-03-21 16:17:51 +0800
  • 57c283f696 adjust newly-added figure size Sze-qq 2022-03-21 15:58:15 +0800
  • 83a847d058
    [test] added rerun on exception for testing (#475) Frank Lee 2022-03-21 15:51:57 +0800
  • a361623905 zero support interleaved pipeline ver217 2022-03-21 15:50:13 +0800
  • 740dd75631
    [readme] add experimental visualisation regarding ColossalAI with ZeRO (#476) Sze-qq 2022-03-21 15:49:41 +0800
  • 8e48b03f44 polish code FrankLeeeee 2022-03-21 15:41:21 +0800
  • 8f242ad448 [test] added rerun on exception function FrankLeeeee 2022-03-21 15:16:34 +0800
  • df84f5830d add experimental visualisation regarding ColossalAI with ZeRO Sze-qq 2022-03-21 15:31:17 +0800
  • 0e074bf3d8 zero support non-interleaved pipeline ver217 2022-03-21 15:30:06 +0800
  • d70f43dd7a
    embedding remove attn mask (#474) ver217 2022-03-21 14:53:23 +0800
  • 487c64eccc embedding remove attn mask ver217 2022-03-21 14:36:50 +0800
  • 7544347145
    [MOE] add unitest for MOE experts layout, gradient handler and kernel (#469) HELSON 2022-03-21 13:35:04 +0800
  • 0b04c00095 . number1roy 2022-03-21 13:15:21 +0800
  • eced031de1 . number1roy 2022-03-21 12:59:51 +0800
  • be7e66c6dd . number1roy 2022-03-21 12:40:46 +0800
  • e7e8df0754 . number1roy 2022-03-21 12:30:12 +0800
  • d5cddac155 add unitest for MOE experts layout, gradient handler and kernel 1SAA 2022-03-19 15:57:23 +0800
  • ac563a69df extra content number1roy 2022-03-21 12:16:18 +0800
  • 1559c0df41
    fix attn mask shape of gpt (#472) ver217 2022-03-21 12:01:31 +0800
  • 95e87971bd extra content number1roy 2022-03-21 12:01:27 +0800
  • 4c9f4a7c0e extra content number1roy 2022-03-21 11:52:15 +0800
  • 238905f109 fix attn mask shape of gpt ver217 2022-03-21 11:49:02 +0800
  • 3cb3fc275e
    zero init ctx receives a dp process group (#471) ver217 2022-03-21 11:18:55 +0800
  • 7e30068a22
    [doc] update rst (#470) ver217 2022-03-21 10:52:45 +0800
  • 53cd41ef3c zero init ctx receives a dp process group ver217 2022-03-21 10:48:08 +0800
  • 4c4bcfa34c new location to store temp data file Jie Zhu 2022-03-21 10:11:49 +0800
  • 70fed3f2c5 add usage to trainer hook Jie Zhu 2022-03-17 16:44:13 +0800
  • 471c2a2683 add usage to doc string Jie Zhu 2022-03-17 16:07:25 +0800
  • b6cb0cfbca remove unnecessary pass statement Jie Zhu 2022-03-17 15:53:03 +0800
  • 2f08b58b16 modify `__init__.py` in profiler Jie Zhu 2022-03-17 15:50:29 +0800
  • 11d29ce76a modify interface of MemProfiler Jie Zhu 2022-03-17 14:05:23 +0800
  • 7f0bb89646 finish trainer hook Jie Zhu 2022-03-16 16:08:37 +0800
  • 5294e1a00e replace error with warning Jie Zhu 2022-03-16 14:05:02 +0800
  • 29879fe858 complete memory profiler Jie Zhu 2022-03-16 13:54:48 +0800
  • 14fbab59d7 change the name of `MemProfiler` Jie Zhu 2022-03-14 16:47:12 +0800
  • 17e5fd9f52 remove useless output Jie Zhu 2022-03-14 16:24:16 +0800
  • 8cf98780ec modify `to_tensorboard` function to support better output Jie Zhu 2022-03-14 14:05:20 +0800
  • 3ca9b0416e fix #370 git log bug Jie Zhu 2022-03-14 13:55:17 +0800
  • 61cfab1d92 add trainer hook Jie Zhu 2022-03-14 13:09:49 +0800
  • 53feafa355 fix import bug Jie Zhu 2022-03-11 13:35:36 +0800
  • 44b1b022e0 fix import bug Jie Zhu 2022-03-11 13:14:50 +0800
  • e88f94d51d add memory trainer hook Jie Zhu 2022-03-09 14:02:38 +0800
  • 1f041b1896 fix bug Jie Zhu 2022-03-10 17:17:36 +0800
  • 94022485bf add memory trainer hook Jie Zhu 2022-03-09 13:57:54 +0800
  • fa8316c732 . number1roy 2022-03-21 10:09:41 +0800
  • 2e52a6da7d remove empty rst ver217 2022-03-21 09:42:19 +0800
  • 81487bcc5b update rst ver217 2022-03-21 09:29:00 +0800
  • be7ba42aad overview number1roy 2022-03-21 00:44:59 +0800
  • c0dbbf4474 overview number1roy 2022-03-21 00:23:43 +0800
  • e55ec69c0c overview number1roy 2022-03-21 00:06:46 +0800
  • 0797ba78b0 overview number1roy 2022-03-20 23:55:20 +0800
  • 1edc89b796 structure number1roy 2022-03-20 23:20:00 +0800
  • 15b1ae3b69 structure number1roy 2022-03-20 22:42:28 +0800
  • 40194f3124 structure number1roy 2022-03-20 22:34:45 +0800
  • 42af84ed5d structure number1roy 2022-03-20 22:27:15 +0800
  • 3cb1db36de structure number1roy 2022-03-20 22:18:23 +0800
  • e88d64d683 structure number1roy 2022-03-20 22:12:35 +0800
  • 4919517977 structure number1roy 2022-03-20 17:58:36 +0800
  • 6d4d9b134a structure number1roy 2022-03-20 17:53:54 +0800
  • 713a4dc214 structure number1roy 2022-03-20 17:41:47 +0800
  • 62177017d2 structure number1roy 2022-03-20 17:38:26 +0800
  • 0cc3a8846c rst file number1roy 2022-03-20 17:20:36 +0800
  • db08931e74 rst file number1roy 2022-03-20 16:24:35 +0800
  • 7de36f0396 rst file number1roy 2022-03-20 15:56:44 +0800
  • f154ccd923 rst file number1roy 2022-03-20 14:17:05 +0800
  • a18dade47a rst file number1roy 2022-03-20 13:53:58 +0800
  • c339870f50 rst file number1roy 2022-03-20 13:33:11 +0800
  • 987d74f427 import subprocess chenjunejie 2022-03-19 17:42:45 +0800
  • 5114952516 Add a function which about check multiple gpu communicate by peer to peer available in single machine. And add a file which can test multiple gpus support p2p or not. Reason: nvidia is not support peer to peer by PCIe, hence, multiple gpus contact with each other by peer to peer need to fix NvLink. However, this hardware environment is not supported by every user, so tips are needed. Add log info and modify some Unnecessary operations. os.popen change to subprocess.Popen chenjunejie 2022-03-19 17:39:47 +0800
  • 1849b173df Add a function which about check multiple gpu communicate by peer to peer available in single machine. And add a file which can test multiple gpus support p2p or not. Reason: nvidia is not support peer to peer by PCIe, hence, multiple gpus contact with each other by peer to peer need to fix NvLink. However, this hardware environment is not supported by every user, so tips are needed. Add log info and modify some Unnecessary operations chenjunejie 2022-03-19 16:31:00 +0800
  • aff9d354f7
    [MOE] polish moe_env (#467) HELSON 2022-03-19 15:36:25 +0800
  • d8a0827465 redirect moe_env from global_variables to core 1SAA 2022-03-18 16:44:04 +0800
  • bccbc15861
    [MOE] changed parallelmode to dist process group (#460) HELSON 2022-03-19 13:46:29 +0800
  • 2f1fa84960 changed parallelmode to dist process group 1SAA 2022-03-18 16:44:04 +0800
  • 1526637e43 docs number1roy 2022-03-19 03:58:30 +0800
  • 8e898c2913 docs number1roy 2022-03-19 03:02:39 +0800
  • 0337244a5c docs number1roy 2022-03-19 02:57:05 +0800
  • 7863a3d917 docs number1roy 2022-03-19 02:39:23 +0800
  • b3c89c9208 docs number1roy 2022-03-19 02:29:08 +0800
  • 8e5fb3746c Merge remote-tracking branch 'upstream/main' number1roy 2022-03-19 02:26:26 +0800
  • 8f9617c313
    [release] update version (#465) v0.1.0 Frank Lee 2022-03-18 19:26:07 +0800
  • 77b4b8bafc [release] update version FrankLeeeee 2022-03-18 19:19:42 +0800
  • 2963565ff8
    [test] fixed release workflow step (#464) Frank Lee 2022-03-18 19:17:13 +0800
  • 2eaec7e344 [test] fixed release workflow step FrankLeeeee 2022-03-18 17:47:44 +0800
  • a48911b0ab add link number1roy 2022-03-18 19:08:19 +0800
  • 292590e0fa
    [test] fixed release workflow condition (#463) Frank Lee 2022-03-18 17:42:33 +0800
  • 96d09db30f [test] fixed release workflow condition FrankLeeeee 2022-03-18 17:40:24 +0800
  • 90bd97b9c0
    [devops] fixed workflow bug (#462) Frank Lee 2022-03-18 17:26:24 +0800
  • 7d127b1659 [devops] fixed workflow bug FrankLeeeee 2022-03-18 17:23:04 +0800
  • 304263c2ce
    fix gpt attention mask (#461) ver217 2022-03-18 17:24:19 +0800
  • 46c807beaa fix gpt attention mask ver217 2022-03-18 17:22:07 +0800
  • fc8e6db005
    [doc] Update docstring for ZeRO (#459) ver217 2022-03-18 16:48:20 +0800
  • a6fd8f95d1 polish shard strategy docstr ver217 2022-03-18 16:42:30 +0800
  • 438ee93bc9 polish zero docstr ver217 2022-03-18 16:41:09 +0800
  • 84fd7c1d4d
    add moe context, moe utilities and refactor gradient handler (#455) HELSON 2022-03-18 16:38:32 +0800
  • 13696868f1 polish sharded optim docstr ver217 2022-03-18 16:38:21 +0800