Commit Graph

  • 945f1f2feb polish code jiaruifang 2022-04-11 18:06:27 +0800
  • 28022c207d Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into refactor/zero_dir jiaruifang 2022-04-11 18:02:02 +0800
  • a0fcce0b45 add missing files jiaruifang 2022-04-11 17:59:38 +0800
  • 21f7bc7d73
    use gpc to get rank Jie Zhu 2022-04-11 17:03:57 +0800
  • a1d7ab041d
    use rank-based JSON file to avoid inconsistency Jie Zhu 2022-04-11 17:00:47 +0800
  • 193dc8dacb
    [refactor] refactor the memory utils (#715) Jiarui Fang 2022-04-11 16:47:57 +0800
  • 9ac531aba5
    change memory profiler `add_tensorboard` Jie Zhu 2022-04-11 16:06:58 +0800
  • 768516abfc polish jiaruifang 2022-04-11 15:49:39 +0800
  • a37728483d Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into refactor/polish_utils jiaruifang 2022-04-11 15:46:58 +0800
  • 7c4deb7858 add missing files. jiaruifang 2022-04-11 15:45:28 +0800
  • a40053591c [refactor] polish utils jiaruifang 2022-04-11 15:45:03 +0800
  • dbd96fe90a
    [zero] check whether gradients have inf and nan in gpu (#712) HELSON 2022-04-11 15:40:13 +0800
  • 715b86eadd
    [hotfix] fix stm cuda model data size (#710) ver217 2022-04-11 15:10:39 +0800
  • 35fd07156e check whether gradients have inf and nan in gpu 1SAA 2022-04-11 14:48:30 +0800
  • cd9492854b
    chen pcie profiler `to_tensorboard` Jie Zhu 2022-04-11 14:28:02 +0800
  • 140263a394
    [hotfix]fixed bugs of assigning grad states to non leaf nodes (#711) LuGY 2022-04-11 14:04:58 +0800
  • 1e62130ce2
    change communication profiler `to_tensorboard` Jie Zhu 2022-04-11 13:57:03 +0800
  • eda30a058e
    [compatibility] fixed tensor parallel compatibility with torch 1.9 (#700) Frank Lee 2022-04-11 13:44:50 +0800
  • 51eb86ee7f use detach() lclgy 2022-04-11 13:42:29 +0800
  • a9b8300d54
    [zero] improve adaptability for not-shard parameters (#708) HELSON 2022-04-11 13:38:51 +0800
  • 432cdd84e6 fixed bugs of assigning grad states to non leaf nodes lclgy 2022-04-11 13:00:15 +0800
  • de74211eca adapt post grad hooks for unshard paramters 1SAA 2022-04-09 12:16:07 +0800
  • d875ae689a
    change behavior of profiler context manager Jie Zhu 2022-04-11 12:34:01 +0800
  • d96fcc0d2f fix stm cuda model data size ver217 2022-04-11 11:35:25 +0800
  • ab8c6b4a0e
    [zero] refactor memstats collector (#706) ver217 2022-04-11 10:46:08 +0800
  • 3fc8a204dc
    []Corrected 3d vocab parallel embedding (#707) アマデウス 2022-04-11 10:17:55 +0800
  • c362ad553d
    Corrected 3d vocab parallel embedding アマデウス 2022-04-08 21:59:48 +0800
  • ee112fe1da
    [zero] adapt zero hooks for unsharded module (#699) HELSON 2022-04-08 20:23:26 +0800
  • 0b443cb86e polish code ver217 2022-04-08 19:57:24 +0800
  • 20e4b4c5e4 adapt zero hooks for unsharded module 1SAA 2022-04-08 13:56:50 +0800
  • 628b7b6c95 fix disposable ver217 2022-04-08 19:54:14 +0800
  • 4d459c2d4f refactor memstats collector ver217 2022-04-08 18:47:51 +0800
  • 896ade15d6
    add PaLM link (#704) (#705) binmakeswell 2022-04-08 18:42:12 +0800
  • 3aeb556602
    Merge branch 'main' into hotfix/readme binmakeswell 2022-04-08 18:41:05 +0800
  • 0c595a1bc5 add PaLM link (#704) binmakeswell 2022-04-08 18:26:59 +0800
  • 270157e9e7
    add PaLM link (#704) binmakeswell 2022-04-08 18:26:59 +0800
  • d95a3bffc7 add PaLM link binmakeswell 2022-04-08 18:24:51 +0800
  • ec527e550f add PaLM link binmakeswell 2022-04-08 18:17:33 +0800
  • 3c9cd5bb5e
    [zero] stateful tensor manager (#687) ver217 2022-04-08 17:51:34 +0800
  • 6f2aca4cab fix unit test ver217 2022-04-08 16:57:50 +0800
  • 70e8dd418b
    [hotfix] update requirements-test (#701) ver217 2022-04-08 16:52:36 +0800
  • f58bef2e3d Merge branch 'zero/stm' of https://github.com/hpcaitech/ColossalAI into zero/stm jiaruifang 2022-04-08 15:34:28 +0800
  • 7afe2d7163 fix bug jiaruifang 2022-04-08 15:33:47 +0800
  • 36b73d3200 polish code ver217 2022-04-08 15:26:47 +0800
  • de3c9e420d fix sampler bug jiaruifang 2022-04-08 15:13:40 +0800
  • 651bacfe32 Merge branch 'zero/stm' of https://github.com/hpcaitech/ColossalAI into zero/stm jiaruifang 2022-04-08 15:04:05 +0800
  • c6d1516a56 fix max sampling cnt resetting bug jiaruifang 2022-04-08 15:03:01 +0800
  • f761a30085 polish code ver217 2022-04-08 14:24:53 +0800
  • 6af00554ee fix sampler bug jiaruifang 2022-04-08 14:10:44 +0800
  • d1fdcb81e9 update requirements-test ver217 2022-04-08 14:05:37 +0800
  • 0b4ec93fac [compatibility] fixed tensor parallel compatibility with torch 1.9 FrankLeeeee 2022-04-08 14:01:41 +0800
  • 30e9e36efa add unit test ver217 2022-04-08 13:42:01 +0800
  • 1ae94ea85a
    [ci] remove ipc config for rootless docker (#694) Frank Lee 2022-04-08 10:15:52 +0800
  • d878d843ad
    Automated submodule synchronization (#695) github-actions[bot] 2022-04-08 10:03:53 +0800
  • f9cd0bb7b9 Automated submodule synchronization github-actions 2022-04-08 00:01:25 +0000
  • c027924ac7 [ci] remove ipc config for rootless docker FrankLeeeee 2022-04-08 01:45:44 +0800
  • d50cdabbc9
    Automated submodule synchronization (#556) github-actions[bot] 2022-04-07 22:11:00 +0800
  • 267f6ce6bc polish comment ver217 2022-04-07 18:24:01 +0800
  • eaaa3c1fbe polish code ver217 2022-04-07 18:16:45 +0800
  • 5a31a818c7 polish code ver217 2022-04-07 18:07:23 +0800
  • dbe8e030fb
    [ci] added missing field in workflow (#692) Frank Lee 2022-04-07 18:07:15 +0800
  • 5df346a075 [ci] added missing field in workflow FrankLeeeee 2022-04-07 17:55:11 +0800
  • 0372ed7951
    [ci] update workflow trigger condition and support options (#691) Frank Lee 2022-04-07 17:53:03 +0800
  • 912ff14fb2 [ci] update workflow trigger condition and support options FrankLeeeee 2022-04-07 15:02:50 +0800
  • d7ecaf362b
    [zero] fix init bugs in zero context (#686) HELSON 2022-04-07 17:38:45 +0800
  • 034061d3fb add eviction strategy ver217 2022-04-07 16:05:11 +0800
  • 0ed7042f42
    [pipeline] refactor pipeline (#679) YuliangLiu0306 2022-04-07 15:54:14 +0800
  • db3832bcd3 fix init bugs in zero context 1SAA 2022-04-07 15:49:06 +0800
  • 189f0fcff2 [WIP] stateful tensor manager jiaruifang 2022-04-07 11:12:25 +0800
  • 365faa122b Automated submodule synchronization github-actions 2022-04-07 00:01:21 +0000
  • e3d44a9b81 make os stateful tensor ver217 2022-04-06 18:16:27 +0800
  • eace69387d
    [ci] fixed compatibility workflow (#678) Frank Lee 2022-04-06 16:19:34 +0800
  • 59bf2dc590
    [zero] initialize a stateful tensor manager (#614) Jiarui Fang 2022-04-06 16:18:49 +0800
  • 2a5e23dd6a Merge branch 'main' of github.com:hpcaitech/ColossalAI into zero/stateful_tensor_mgr jiaruifang 2022-04-06 15:21:16 +0800
  • f23078ff63 Merge branch 'feature/refactor_pipeline' of github.com:YuliangLiu0306/ColossalAI into feature/refactor_pipeline liuyuliang 2022-04-06 15:17:41 +0800
  • 5c7ae0ef0c
    Merge branch 'hpcaitech:main' into feature/refactor_pipeline YuliangLiu0306 2022-04-06 15:15:03 +0800
  • c99c77af8a infer pipeline schedule params from config liuyuliang 2022-04-06 15:14:54 +0800
  • 3de8424714 [ci] fixed compatibility workflow FrankLeeeee 2022-04-06 14:28:36 +0800
  • cc236916c6
    [ci] replace the dngc ocker image with self-built pytorch image (#672) Frank Lee 2022-04-06 14:10:17 +0800
  • 03e1d35931
    [release] update version (#673) v0.1.2 ver217 2022-04-06 12:03:23 +0800
  • 4e74decd3c update version ver217 2022-04-06 12:01:00 +0800
  • 79ccfa4310 [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu code style (#667) encmps 2022-04-05 09:12:01 +0800
  • e4bcff9b0f [NFC] polish colossalai/builder/builder.py code style (#662) lucasliunju 2022-04-03 11:59:57 +0800
  • 331683bf82 [NFC] polish colossalai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu code style (#661) shenggan 2022-04-02 21:28:47 +0800
  • c336cd3066 [NFC] polish colossalai/communication/utils.py code style (#656) FredHuang99 2022-04-02 17:28:58 +0800
  • 5ab9a71299 [NFC] polish colossalai/kernel/cuda_native/csrc/moe_cuda.cpp code style (#642) MaxT 2022-04-02 15:23:01 +0800
  • 10afec728f [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h code style (#641) Xue Fuzhao 2022-04-02 14:38:40 +0800
  • 055d0270c8 [NFC] polish colossalai/context/process_group_initializer/initializer_sequence.py colossalai/context/process_group_initializer initializer_tensor.py code style (#639) Cautiousss 2022-04-02 14:30:04 +0800
  • c7c224ee17 [NFC] polish colossalai/builder/pipeline.py code style (#638) Ziheng Qin 2022-04-02 14:22:41 +0800
  • 10591ecdf9 [NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.cpp code style (#636) Sze-qq 2022-04-02 13:28:57 +0800
  • 6fcb381801 [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu code style (#635) Wangbo Zhao 2022-04-02 10:45:04 +0800
  • 8a5d526e95 [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu and cross_entropy.cu code style (#634) ExtremeViscent 2022-04-02 02:29:45 +0100
  • ad1e7ab2b2 '[NFC] polish <colossalai/engine/_base_engine.py> code style' (#631) RichardoLuo 2022-04-01 23:41:09 +0800
  • 2e11853d04 [NFC] polish colossalai/communication/ring.py code style (#630) Zangwei 2022-04-01 21:38:05 +0800
  • 01cc941e1d [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/transform_kernels.cu code stype (#629) puck_WCR 2022-04-01 20:20:54 +0800
  • c1bed0d998 [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu code stype (#628) superhao1995 2022-04-01 19:03:01 +0800
  • 0a96338b13 [NFC] polish <colossalai/context/process_group_initializer/initializer_data.py> code stype (#626) Jiang Zhuo 2022-04-01 17:55:06 +0800
  • 701bad439b [NFC] polish colossalai/context/process_group_initializer/process_group_initializer.py code stype (#617) ziyu huang 2022-04-01 16:13:29 +0800
  • db54419409 fix format (#613) Shawn-Kong 2022-03-31 23:57:19 -0700
  • 5ecef13c16 fix format (#611) Yuer867 2022-04-01 14:19:27 +0800