Commit Graph

  • 0c15ec10b6
    Update README.md Jiarui Fang 2022-04-14 17:38:53 +0800
  • be5429d475
    Merge pull request #1 from hpcaitech/main binmakeswell 2022-04-14 17:37:23 +0800
  • 1f698f4406
    [readme] polish readme (#764) Jiarui Fang 2022-04-14 17:34:08 +0800
  • 05887ca218 centering image jiaruifang 2022-04-14 17:31:39 +0800
  • 73ed79c2b3 [readme] polish readme jiaruifang 2022-04-14 17:29:38 +0800
  • 920fe31526
    [compatibility] used backward-compatible API for global process group (#758) Frank Lee 2022-04-14 17:20:35 +0800
  • 92fb4ee220 [test] refactored with the new rerun decorator FrankLeeeee 2022-04-14 15:29:02 +0800
  • 4ea49cb536
    [test] added a decorator for address already in use error with backward compatibility (#760) Frank Lee 2022-04-14 16:48:44 +0800
  • 10ef8afdd2
    [gemini] init genimi individual directory (#754) Jiarui Fang 2022-04-14 16:40:26 +0800
  • d9163a7752 Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into gemini/init2 jiaruifang 2022-04-14 16:38:56 +0800
  • dcca614eee
    [hotfix] fix test_stateful_tensor_mgr (#762) ver217 2022-04-14 15:50:09 +0800
  • 38084dcbf7 [compatibility] used backward-compatible API for global process group FrankLeeeee 2022-04-14 14:46:51 +0800
  • eb7cef7d24 [test] added a decorator for address already in use error with backward compatibility FrankLeeeee 2022-04-14 14:59:28 +0800
  • a7913486d9 polish code ver217 2022-04-14 15:39:49 +0800
  • 0a3b2503a8 [test] added a decorator for address already in use error with backward compatibility FrankLeeeee 2022-04-14 14:59:28 +0800
  • 2a45a92fdb fix test_stateful_tensor_mgr ver217 2022-04-14 15:34:47 +0800
  • 6978980f6d
    Automated submodule synchronization (#751) github-actions[bot] 2022-04-14 15:34:01 +0800
  • 33ab27743e Merge branch 'main' into gemini/init2 jiaruifang 2022-04-14 15:15:15 +0800
  • a93a7d7364
    [hotfix] fix reuse_fp16_shard of sharded model (#756) ver217 2022-04-14 14:56:46 +0800
  • f10c738f1f polish code ver217 2022-04-14 14:07:33 +0800
  • 97fd9630f9 disable test stm ver217 2022-04-14 13:51:00 +0800
  • 0e93fd0766 add missing files jiaruifang 2022-04-14 13:32:32 +0800
  • e5532403d4 fix reuse_fp16_shard ver217 2022-04-14 12:45:18 +0800
  • ec56d7413f polish code jiaruifang 2022-04-14 12:13:03 +0800
  • 473672549a [gemini] init the gemini directory jiaruifang 2022-04-14 12:10:11 +0800
  • 8f7ce94b8e
    [hotfix] fix auto tensor placement policy (#753) ver217 2022-04-14 12:04:45 +0800
  • 84c6700b2a
    [zero] refactor memstats_collector (#746) HELSON 2022-04-14 12:01:12 +0800
  • dc3394494f
    Merge branch 'hpcaitech:main' into feature/cli YuliangLiu0306 2022-04-14 11:58:53 +0800
  • 7c0fb127f6 add SLURM and MPI launcher liuyuliang 2022-04-14 11:57:56 +0800
  • b8899e0905
    [TP] allow layernorm without bias (#750) アマデウス 2022-04-14 11:43:56 +0800
  • 14fada713d polish docstr ver217 2022-04-14 11:25:34 +0800
  • 5fab9bb690 polich docstr ver217 2022-04-14 11:21:44 +0800
  • 79969f39c6 refactor memstats_collector 1SAA 2022-04-13 16:35:09 +0800
  • 7a34d6833a fix auto tensor placement policy ver217 2022-04-14 11:17:16 +0800
  • 3d7dc46d33
    [zero] use factory pattern for tensor_placement_policy (#752) Jiarui Fang 2022-04-14 11:07:29 +0800
  • ec5f50b336 rm useless code jiaruifang 2022-04-14 10:17:08 +0800
  • 1de8e8a706 [zero] remove global var and use factory pattern for tensor_placement_policy jiaruifang 2022-04-14 10:04:58 +0800
  • 765d8160df Automated submodule synchronization github-actions 2022-04-14 00:01:24 +0000
  • 4b048a8728
    fix prepare grads in sharded optim (#749) ver217 2022-04-13 22:36:11 +0800
  • 097772546e fix initialize about zero ver217 2022-04-13 18:22:26 +0800
  • 72c8f9f55b allow layernorm without bias zbian 2022-04-13 18:50:26 +0800
  • e87ada3ffb fix prepare grads in sharded optim ver217 2022-04-13 18:24:41 +0800
  • cf98c4e97a fix initialize about zero ver217 2022-04-13 18:22:26 +0800
  • c53774403b
    Merge pull request #2 from YuliangLiu0306/main YuliangLiu0306 2022-04-13 17:54:53 +0800
  • 73753aa3e2
    Merge branch 'feature/cli' into main YuliangLiu0306 2022-04-13 17:54:29 +0800
  • df7e6506d4 [CLI] add CLI launcher liuyuliang 2022-04-13 17:30:47 +0800
  • e396bb71f2
    [zero] add tensor placement policies (#743) ver217 2022-04-13 15:00:48 +0800
  • 22c4b88d56
    [zero] refactor ShardedParamV2 for convenience (#742) HELSON 2022-04-13 14:54:26 +0800
  • 06f645a1e3 update moe unit tests ver217 2022-04-13 14:01:26 +0800
  • baa3103209 polish comments ver217 2022-04-13 13:50:44 +0800
  • 6971dd2731 refactor ShardedParamV2 for convenience 1SAA 2022-04-13 13:14:03 +0800
  • 19b6ca2f44 polish comments ver217 2022-04-13 13:27:35 +0800
  • b0c2d93b73 add tensor placement policies ver217 2022-04-13 13:16:19 +0800
  • 340e59f968
    [utils] add synchronized cuda memory monitor (#740) HELSON 2022-04-13 10:50:54 +0800
  • e6212f56cd
    [hotfix] fix memory leak in backward of sharded model (#741) ver217 2022-04-13 09:59:05 +0800
  • f4f42d4c3c
    [bug] fixed DDP compatibility with torch 1.8 (#739) Frank Lee 2022-04-13 00:08:46 +0800
  • ae9c92c7e3 polish code ver217 2022-04-12 20:02:45 +0800
  • db9b7fc2b8 fix memory leak in backward ver217 2022-04-12 20:01:25 +0800
  • 815199509c add synchronized cuda memory monitor 1SAA 2022-04-12 17:04:00 +0800
  • 025ee31bce [bug] fixed DDP compatibility with torch 1.8 FrankLeeeee 2022-04-12 16:57:29 +0800
  • a4e91bc87f
    [bug] fixed grad scaler compatibility with torch 1.8 (#735) Frank Lee 2022-04-12 16:04:21 +0800
  • 089e432ee5 [bug] fixed grad scaler compatibility with torch 1.8 FrankLeeeee 2022-04-12 15:10:53 +0800
  • 53cb584808
    [utils] correct cpu memory used and capacity in the context of multi-process (#726) Jiarui Fang 2022-04-12 14:57:54 +0800
  • 29b78a8c07 Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into utils/correct_cpu_used_capacity jiaruifang 2022-04-12 13:56:09 +0800
  • 7db3ccc79b
    [hotfix] remove duplicated param register to stateful tensor manager (#728) Jiarui Fang 2022-04-12 13:55:25 +0800
  • 600e769a42
    add video (#732) binmakeswell 2022-04-12 13:41:56 +0800
  • b983dcd80e add video binmakeswell 2022-04-12 11:23:37 +0800
  • a5c3f072f6
    [bug] removed zero installation requirements (#731) Frank Lee 2022-04-12 13:27:25 +0800
  • dc2b621c19 [bug] removed zero installation requirements FrankLeeeee 2022-04-12 13:20:34 +0800
  • b9b469ea50
    [moe] add checkpoint for moe zero test (#729) HELSON 2022-04-12 12:11:54 +0800
  • 6f7d1362c9
    [doc] removed outdated installation command (#730) Frank Lee 2022-04-12 11:56:45 +0800
  • 98cecc00e6 polish jiaruifang 2022-04-12 11:56:39 +0800
  • a2d67e752b [doc] removed outdated installation command FrankLeeeee 2022-04-12 11:53:25 +0800
  • fd437cbc02 Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into hotfix/shardmodel jiaruifang 2022-04-12 11:48:52 +0800
  • f61e901864 add checkpoint for moe zero test 1SAA 2022-04-12 11:33:29 +0800
  • e88a498c9c [test] removed trivial outdated test FrankLeeeee 2022-04-12 10:35:31 +0800
  • 62b4ce7326 [test] added missing decorators to model checkpointing tests FrankLeeeee 2022-04-12 10:00:52 +0800
  • 865d8ad7d1 [hotfix] remove duplicated param register to stateful tensor manager jiaruifang 2022-04-12 10:50:23 +0800
  • 3957141dbf [test] removed trivial outdated test FrankLeeeee 2022-04-12 10:35:31 +0800
  • 17b0bf25c9 polish jiaruifang 2022-04-12 10:12:34 +0800
  • 35eb68807b fix bugs jiaruifang 2022-04-12 10:11:13 +0800
  • ab18502898 [test] added missing decorators to model checkpointing tests FrankLeeeee 2022-04-12 10:00:52 +0800
  • 784af79a5d rename zero test directory jiaruifang 2022-04-12 09:45:43 +0800
  • a1d25ce788 polish jiaruifang 2022-04-12 09:45:18 +0800
  • 1cb7bdad3b
    [util] fixed communication API depth with PyTorch 1.9 (#721) Frank Lee 2022-04-12 09:44:40 +0800
  • f1ae367322 [utils] fix cpu capacity and usage in the context of multi-process jiaruifang 2022-04-12 09:40:12 +0800
  • 2412429d54
    [util] fixed activation checkpointing on torch 1.9 (#719) Frank Lee 2022-04-12 09:35:45 +0800
  • 04ff5ea546
    [utils] support detection of number of processes on current node (#723) Frank Lee 2022-04-12 09:28:19 +0800
  • 4d90a7b513
    [refactor] zero directory (#724) Jiarui Fang 2022-04-11 23:13:02 +0800
  • 7cffb4a485
    Merge branch 'main' into refactor/zero_dir Jiarui Fang 2022-04-11 23:12:53 +0800
  • 470a2850e4 [util] fixed activation checkpointing on torch 1.9 FrankLeeeee 2022-04-11 16:23:15 +0800
  • 6efe12d7fe [util] fixed communication API depth with PyTorch 1.9 FrankLeeeee 2022-04-11 17:08:01 +0800
  • 852fb8efa8 [util] support detection of number of processes on current node FrankLeeeee 2022-04-11 17:50:39 +0800
  • 4361b0c4d1 polish jiaruifang 2022-04-11 22:01:34 +0800
  • 37d66775bc polish code jiaruifang 2022-04-11 22:01:07 +0800
  • 20ab1f5520
    [bug] fixed broken test_found_inf (#725) Frank Lee 2022-04-11 22:00:27 +0800
  • 973488201d [bug] fixed broken test_found_inf FrankLeeeee 2022-04-11 19:50:55 +0800
  • 4795ef2761 polish jiaruifang 2022-04-11 18:11:06 +0800
  • 00c92e54a7 polish jiaruifang 2022-04-11 18:08:52 +0800
  • 83abb0b86e polish jiaruifang 2022-04-11 18:06:52 +0800