Commit Graph

1437 Commits (6b30dfb7ce002be3acc0668d3fa44c4d4ebb4108)

Author SHA1 Message Date
Xu Kai 64350029fe [NFC] polish colossalai/gemini/paramhooks/_param_hookmgr.py code style 2023-03-29 15:47:42 +08:00
RichardoLuo 1ce9d0c531 [NFC] polish initializer_data.py code style (#3287) 2023-03-29 15:22:21 +08:00
Ziheng Qin 1bed38ef37 [NFC] polish colossalai/cli/benchmark/models.py code style (#3290) 2023-03-29 15:22:21 +08:00
Kai Wang (Victor Kai) 964a28678f [NFC] polish initializer_3d.py code style (#3279) 2023-03-29 15:22:21 +08:00
Sze-qq 94eec1c5ad [NFC] polish colossalai/engine/gradient_accumulation/_gradient_accumulation.py code style (#3277)
Co-authored-by: siqi <siqi@siqis-MacBook-Pro.local>
2023-03-29 15:22:21 +08:00
Arsmart1 8af977f223 [NFC] polish colossalai/context/parallel_context.py code style (#3276) 2023-03-29 15:22:21 +08:00
Zirui Zhu 1168b50e33 [NFC] polish colossalai/engine/schedule/_pipeline_schedule_v2.py code style (#3275) 2023-03-29 15:22:21 +08:00
Tong Li 196d4696d0 [NFC] polish colossalai/nn/_ops/addmm.py code style (#3274) 2023-03-29 15:22:21 +08:00
lucasliunju 4b95464994 [NFC] polish colossalai/amp/__init__.py code style (#3272) 2023-03-29 15:22:21 +08:00
Xuanlei Zhao 6b3bb2c249 [NFC] polish code style (#3273) 2023-03-29 15:22:21 +08:00
CZYCW 4cadb25b96 [NFC] policy colossalai/fx/proxy.py code style (#3269) 2023-03-29 15:22:21 +08:00
Yuanchen d58fa705b2 [NFC] polish code style (#3268)
Co-authored-by: Yuanchen Xu <yuanchen.xu00@gmail.com>
2023-03-29 15:22:21 +08:00
Camille Zhong c4a226b729 [NFC] polish tensor_placement_policy.py code style (#3265) 2023-03-29 15:22:21 +08:00
CsRic 00778abc48 [NFC] polish colossalai/fx/passes/split_module.py code style (#3263)
Co-authored-by: csric <richcsr256@gmail.com>
2023-03-29 15:22:21 +08:00
jiangmingyan 488f37048c [NFC] polish colossalai/global_variables.py code style (#3259)
Co-authored-by: luchen <luchen@luchendeMBP.lan>
2023-03-29 15:22:21 +08:00
LuGY 1ff7d5bfa5 [NFC] polish colossalai/engine/gradient_handler/_moe_gradient_handler.py (#3260) 2023-03-29 15:22:21 +08:00
dayellow 204ca2f09a [NFC] polish colossalai/fx/profiler/experimental/profiler_module/embedding.py code style (#3256)
Co-authored-by: Minghao Huang <huangminghao@luchentech.com>
2023-03-29 15:22:21 +08:00
HELSON 02b058032d
[fx] meta registration compatibility (#3253)
* [fx] meta registration compatibility

* fix error
2023-03-27 15:22:17 +08:00
Frank Lee 73d3e4d309
[booster] implemented the torch ddd + resnet example (#3232)
* [booster] implemented the torch ddd + resnet example

* polish code
2023-03-27 10:24:14 +08:00
YH 1a229045af
Add interface for colo tesnor dp size (#3227) 2023-03-27 09:42:21 +08:00
YuliangLiu0306 4d5d8f98a4
[API] implement device mesh manager (#3221)
* [API] implement  device mesh manager

* polish
2023-03-24 13:39:12 +08:00
Frank Lee cd142fbefa
[api] implemented the checkpoint io module (#3205)
* [api] implemented the checkpoint io module

* polish code

* polish code
2023-03-23 10:53:17 +08:00
ver217 f8289d4221
[lazyinit] combine lazy tensor with dtensor (#3204)
* [lazyinit] lazy tensor add distribute

* [lazyinit] refactor distribute

* [lazyinit] add test dist lazy init

* [lazyinit] add verbose info for dist lazy init

* [lazyinit] fix rnn flatten weight op

* [lazyinit] polish test

* [lazyinit] polish test

* [lazyinit] fix lazy tensor data setter

* [lazyinit] polish test

* [lazyinit] fix clean

* [lazyinit] make materialize inplace

* [lazyinit] refactor materialize

* [lazyinit] refactor test distribute

* [lazyinit] fix requires_grad

* [lazyinit] fix tolist after materialization

* [lazyinit] refactor distribute module

* [lazyinit] polish docstr

* [lazyinit] polish lazy init context

* [lazyinit] temporarily skip test

* [lazyinit] polish test

* [lazyinit] add docstr
2023-03-23 10:53:06 +08:00
Frank Lee e3ad88fb48
[booster] implemented the cluster module (#3191)
* [booster] implemented the cluster module

* polish code
2023-03-22 14:11:54 +08:00
YuliangLiu0306 f57d34958b
[FX] refactor experimental tracer and adapt it with hf models (#3157)
* pass gpt trace and meta_prop

* pass t5 trace and meta_prop

* [FX] refactor experimental tracer and adapt it with hf models

* pass all mainstream model zoo

* fix CI

* fix CI

* fix CI

* fix CI

* fix CI

* fix CI

* fix CI

* fix CI

* skip tests

* fix CI

* using packaging version

* polish
2023-03-22 10:40:33 +08:00
Frank Lee e7f3bed2d3
[booster] added the plugin base and torch ddp plugin (#3180)
* [booster] added the plugin base and torch ddp plugin

* polish code

* polish code

* polish code
2023-03-21 17:39:30 +08:00
Zihao 18dbe76cae
[auto-parallel] add auto-offload feature (#3154)
* add auto-offload feature

* polish code

* fix syn offload runtime pass bug

* add offload example

* fix offload testing bug

* fix example testing bug
2023-03-21 14:17:41 +08:00
YuliangLiu0306 258b43317c
[hotfix] layout converting issue (#3188) 2023-03-21 13:24:18 +08:00
YH 80aed29cd3
[zero] Refactor ZeroContextConfig class using dataclass (#3186) 2023-03-21 12:36:47 +08:00
YH 9d644ff09f
Fix docstr for zero statedict (#3185) 2023-03-21 11:48:21 +08:00
zbian 7bc0afc901 updated flash attention usage 2023-03-20 17:57:04 +08:00
Frank Lee a9b8402d93
[booster] added the accelerator implementation (#3159) 2023-03-20 13:59:24 +08:00
ver217 6ae8ed0407
[lazyinit] add correctness verification (#3147)
* [lazyinit] fix shared module

* [tests] add lazy init test utils

* [tests] add torchvision for lazy init

* [lazyinit] fix pre op fn

* [lazyinit] handle legacy constructor

* [tests] refactor lazy init test models

* [tests] refactor lazy init test utils

* [lazyinit] fix ops don't support meta

* [tests] lazy init test timm models

* [lazyinit] fix set data

* [lazyinit] handle apex layers

* [tests] lazy init test transformers models

* [tests] lazy init test torchaudio models

* [lazyinit] fix import path

* [tests] lazy init test torchrec models

* [tests] update torch version in CI

* [tests] revert torch version in CI

* [tests] skip lazy init test
2023-03-17 13:49:04 +08:00
Frank Lee ed19290560
[booster] implemented mixed precision class (#3151)
* [booster] implemented mixed precision class

* polish code
2023-03-17 11:00:15 +08:00
YuliangLiu0306 2eca4cd376
[DTensor] refactor dtensor with new components (#3089)
* [DTensor] refactor dtensor with new components

* polish
2023-03-14 16:25:47 +08:00
ver217 ed8f60b93b
[lazyinit] refactor lazy tensor and lazy init ctx (#3131)
* [lazyinit] refactor lazy tensor and lazy init ctx

* [lazyinit] polish docstr

* [lazyinit] polish docstr
2023-03-14 15:37:12 +08:00
Frank Lee 95a36eae63
[kernel] added kernel loader to softmax autograd function (#3093)
* [kernel] added kernel loader to softmax autograd function

* [release] v0.2.6
2023-03-10 14:27:09 +08:00
Super Daniel fff98f06ed
[analyzer] a minimal implementation of static graph analyzer (#2852)
* [hotfix] meta tensor default device.

* [siu] add experimental submodules to main branch.

* [siu]

* [siu]

* [analyzer] init.

* [analyzer] readme.

* [analyzer] readme.

* [analyzer] readme.

* [analyzer] readme.

* [test] add test.

* Update symbolic_trace.py

* mark skip tests.

* try except.

* try except.

* try except.

* s

* init

* init

* fix

* skip

* skip

---------

Co-authored-by: Daniel Shao <superdainiu@MININT-PVARVID.fareast.corp.microsoft.com>
Co-authored-by: Daniel Shao <superdainiu@Daniels-Mac.local>
2023-03-10 13:21:05 +08:00
Xuanlei Zhao 10c61de2f7
[autochunk] support vit (#3084)
support vit for autochunk
* support some new ops for vit
* fix some bugs
* add test for vit
2023-03-10 10:23:26 +08:00
YuliangLiu0306 8e4e8601b7
[DTensor] implement layout converter (#3055)
* [DTensor] refactor LayoutConverter for DTensor

* polish code

* polish docstring
2023-03-10 09:53:52 +08:00
Frank Lee f19b49e164
[booster] init module structure and definition (#3056) 2023-03-09 11:27:46 +08:00
Xuanlei Zhao 2ca9728cbb
[autochunk] refactor chunk memory estimation (#2762)
* refact memory code

* dont log free var memory

* add memory align

* update chunk target

* update setting for new memory

* finish test

* update tracer

* update typo

* update test
2023-03-08 16:22:30 +08:00
YuliangLiu0306 29386a54e6
[DTensor] refactor CommSpec (#3034) 2023-03-08 10:45:31 +08:00
YuliangLiu0306 cd2b0eaa8d
[DTensor] refactor sharding spec (#2987)
* [autoparallel] refactor sharding spec

* rename function name
2023-03-07 11:08:11 +08:00
Ziyue Jiang 400f63012e
[pipeline] Add Simplified Alpa DP Partition (#2507)
* add alpa dp split

* add alpa dp split

* use fwd+bwd instead of fwd only

---------

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2023-03-07 10:34:31 +08:00
Super Daniel b42d3d28ed
[fx] remove depreciated algorithms. (#2312) (#2313) 2023-03-07 10:30:35 +08:00
github-actions[bot] 82503a96f2
[format] applied code formatting on changed files in pull request 2997 (#3008)
Co-authored-by: github-actions <github-actions@github.com>
2023-03-06 10:42:22 +08:00
binmakeswell 52a5078988
[doc] add ISC tutorial (#2997)
* [doc] add ISC tutorial

* [doc] add ISC tutorial

* [doc] add ISC tutorial

* [doc] add ISC tutorial
2023-03-06 10:36:38 +08:00
ver217 823f3b9cf4
[doc] add deepspeed citation and copyright (#2996)
* [doc] add deepspeed citation and copyright

* [doc] add deepspeed citation and copyright

* [doc] add deepspeed citation and copyright
2023-03-04 20:08:11 +08:00
YuliangLiu0306 e414e4092b
[DTensor] implementation of dtensor (#2946)
* [DTensor] implementation of dtensor

* test layout convert

* polish
2023-03-01 16:34:58 +08:00
YuliangLiu0306 47fb214b3b
[hotfix] add shard dim to aviod backward communication error (#2954) 2023-03-01 11:41:53 +08:00
ver217 090f14fd6b
[misc] add reference (#2930)
* [misc] add reference

* [misc] add license
2023-02-28 18:07:24 +08:00
YuliangLiu0306 197d0bf4ed
[autoparallel] apply repeat block to reduce solving time (#2912) 2023-02-28 11:03:30 +08:00
YH a848091141
Fix port exception type (#2925) 2023-02-28 11:00:43 +08:00
zbian 61e687831d fixed using zero with tp cannot access weight correctly 2023-02-28 10:52:30 +08:00
YH 7b13f7db18
[zero] trivial zero optimizer refactoring (#2869)
* Fix mionr grad store interface

* Apply lint
2023-02-27 14:04:53 +08:00
Jiatong (Julius) Han 8c8a39be95
[hotfix]: Remove math.prod dependency (#2837)
* Remove math.prod dependency

* Fix style

* Fix style

---------

Co-authored-by: Jiatong Han <jiatong.han@u.nus.edu>
2023-02-23 23:56:15 +08:00
YuliangLiu0306 819e25d8b1
[hotfix] fix autoparallel compatibility test issues (#2754) 2023-02-23 17:28:36 +08:00
YuliangLiu0306 0f392d7403
[autoparallel] find repeat blocks (#2854)
* [autoparallel] find repeat blocks

* polish

* polish

* polish
2023-02-23 17:28:19 +08:00
junxu c52edcf0eb
Rename class method of ZeroDDP (#2692) 2023-02-22 15:05:53 +08:00
HELSON 6e4ac08172
[hotfix] fix chunk size can not be divided (#2867)
* [hotfix] fix chunk size can not be divided

* [hotfix] use numpy for python3.8
2023-02-22 15:04:46 +08:00
Boyuan Yao eae77c831d
[autoparallel] Patch meta information for nodes that will not be handled by SPMD solver (#2823)
* [autoparallel] non spmd meta information generator

* [autoparallel] patch meta information for non spmd nodes
2023-02-22 10:28:56 +08:00
Boyuan Yao c7764d3f22
[autoparallel] Patch meta information of `torch.where` (#2822)
* [autoparallel] patch meta information of torch.where

* [autoparallel] pre-commit modified
2023-02-22 10:28:21 +08:00
Boyuan Yao fcc4097efa
[autoparallel] Patch meta information of `torch.tanh()` and `torch.nn.Dropout` (#2773)
* [autoparallel] tanh meta information

* [autoparallel] remove redundant code

* [autoparallel] patch meta information of torch.nn.Dropout
2023-02-22 10:27:59 +08:00
Frank Lee 935346430f
[cli] handled version check exceptions (#2848)
* [cli] handled version check exceptions

* polish code
2023-02-21 17:04:49 +08:00
Frank Lee 918bc94b6b
[triton] added copyright information for flash attention (#2835)
* [triton] added copyright information for flash attention

* polish code
2023-02-21 11:25:57 +08:00
Boyuan Yao 7ea6bc7f69
[autoparallel] Patch tensor related operations meta information (#2789)
* [autoparallel] tensor related meta information prototype

* [autoparallel] tensor related meta information

* [autoparallel] tensor related meta information

* [autoparallel] tensor related meta information

* [autoparallel] tensor related meta information
2023-02-20 17:38:55 +08:00
Michelle c008d4ad0c
[NFC] polish colossalai/engine/schedule/_pipeline_schedule.py code style (#2744) 2023-02-20 10:38:40 +08:00
YuliangLiu0306 2059fdd6b0
[hotfix] add copyright for solver and device mesh (#2803)
* [hotfix] add copyright for solver and device mesh

* add readme

* add alpa license

* polish
2023-02-18 21:14:38 +08:00
Boyuan Yao 8593ae1a3f
[autoparallel] rotor solver refactor (#2813)
* [autoparallel] rotor solver refactor

* [autoparallel] rotor solver refactor
2023-02-18 11:30:15 +08:00
HELSON 56ddc9ca7a
[hotfix] add correct device for fake_param (#2796) 2023-02-17 15:29:07 +08:00
Boyuan Yao a2b43e393d
[autoparallel] Patch meta information of `torch.nn.Embedding` (#2760)
* [autoparallel] embedding metainfo

* [autoparallel] fix function name in test_activation_metainfo

* [autoparallel] undo changes in activation metainfo and related tests
2023-02-17 10:39:48 +08:00
Boyuan Yao 8e3f66a0d1
[zero] fix wrong import (#2777) 2023-02-17 10:26:07 +08:00
Nikita Shulga 01066152f1
Don't use `torch._six` (#2775)
* Don't use `torch._six`

This is a private API which is gone after https://github.com/pytorch/pytorch/pull/94709

* Update common.py
2023-02-17 09:22:45 +08:00
binmakeswell 93b788b95a Merge branch 'main' into fix/format 2023-02-15 20:23:51 +08:00
xyupeng 2fd528b9f4
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/graph_analysis.py code style (#2737) 2023-02-15 22:57:45 +08:00
YuliangLiu0306 1dc003c169
[autoparallel] distinguish different parallel strategies (#2699) 2023-02-15 22:28:28 +08:00
YH ae86a29e23
Refact method of grad store (#2687) 2023-02-15 22:27:58 +08:00
Zirui Zhu c9e3ee389e
[NFC] polish colossalai/context/process_group_initializer/initializer_2d.py code style (#2726) 2023-02-15 22:27:13 +08:00
Zangwei Zheng 1819373e5c
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/batch_norm_handler.py code style (#2728) 2023-02-15 22:26:13 +08:00
Wangbo Zhao(黑色枷锁) 8331420520
[NFC] polish colossalai/cli/cli.py code style (#2734) 2023-02-15 22:25:28 +08:00
ziyuhuang123 d344313533
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/embedding_handler.py code style (#2725) 2023-02-15 16:31:40 +08:00
Xue Fuzhao e81caeb4bc
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/cost_graph.py code style (#2720)
Co-authored-by: Fuzhao Xue <fuzhao@login2.ls6.tacc.utexas.edu>
2023-02-15 16:12:45 +08:00
yuxuan-lou 51c45c2460
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/where_handler.py code style (#2723) 2023-02-15 16:12:24 +08:00
YuliangLiu0306 21d6a48f4d
[autoparallel] add shard option (#2696)
* [autoparallel] add shard option

* polish
2023-02-15 13:48:28 +08:00
YuliangLiu0306 5b24987fa7
[autoparallel] fix parameters sharding bug (#2716) 2023-02-15 12:25:50 +08:00
Ziyue Jiang 4603538ddd
[NFC] posh colossalai/context/process_group_initializer/initializer_sequence.py code style (#2712)
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2023-02-15 10:53:38 +08:00
YuliangLiu0306 cb2c6a2415
[autoparallel] refactor runtime pass (#2644)
* [autoparallel] refactor runtime pass

* add unit test

* polish
2023-02-15 10:36:19 +08:00
Zihao b3d10db5f1
[NFC] polish colossalai/cli/launcher/__init__.py code style (#2709) 2023-02-15 09:57:22 +08:00
YuliangLiu0306 0b2a738393
[autoparallel] remove deprecated codes (#2664) 2023-02-15 09:54:32 +08:00
YuliangLiu0306 7fa6be49d2
[autoparallel] test compatibility for gemini and auto parallel (#2700) 2023-02-15 09:43:29 +08:00
CZYCW 4ac8bfb072
[NFC] polish colossalai/engine/gradient_handler/utils.py code style (#2708) 2023-02-15 09:40:08 +08:00
Liu Ziming 6427c406cf
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/strategy_generator.py code style (#2695)
Co-authored-by: shenggan <csg19971016@gmail.com>
2023-02-14 21:30:25 +08:00
アマデウス 534f68c83c
[NFC] polish pipeline process group code style (#2694) 2023-02-14 18:12:01 +08:00
LuGY 56ff1921e9
[NFC] polish colossalai/context/moe_context.py code style (#2693) 2023-02-14 18:02:45 +08:00
Shawn-Kong 1712da2800
[NFC] polish colossalai/gemini/gemini_context.py code style (#2690) 2023-02-14 11:55:23 +08:00
HELSON df4f020ee3
[zero1&2] only append parameters with gradients (#2681) 2023-02-13 18:00:16 +08:00
ver217 f0aa191f51
[gemini] fix colo_init_context (#2683) 2023-02-13 17:53:15 +08:00
Boyuan Yao 40c916b192
[autoparallel] Patch meta information of `torch.nn.functional.softmax` and `torch.nn.Softmax` (#2674)
* [autoparallel] softmax metainfo

* [autoparallel] softmax metainfo
2023-02-13 16:09:22 +08:00
HELSON 8213f89fd2
[gemini] add fake_release_chunk for keep-gathered chunk in the inference mode (#2671) 2023-02-13 14:35:32 +08:00
binmakeswell 9ab14b20b5
[doc] add CVPR tutorial (#2666) 2023-02-10 20:43:34 +08:00
Boyuan Yao 0385b26ebf
[autoparallel] Patch meta information of `torch.nn.LayerNorm` (#2647)
* [autoparallel] layernorm metainfo patch

* [autoparallel] polish test
2023-02-10 14:29:24 +08:00
YuliangLiu0306 37df666f38
[autoparallel] refactor handlers which reshape input tensors (#2615)
* [autoparallel] refactor handlers which reshape input tensors

* polish
2023-02-08 15:02:49 +08:00
YuliangLiu0306 28398f1c70
add overlap option (#2613) 2023-02-08 15:02:31 +08:00
YuliangLiu0306 cb3d1bef62
[autoparallel] adapt autoparallel tests with latest api (#2626) 2023-02-08 15:02:12 +08:00
Boyuan Yao 90a9fdd91d
[autoparallel] Patch meta information of `torch.matmul` (#2584)
* [autoparallel] matmul metainfo

* [auto_parallel] remove unused print

* [tests] skip test_matmul_handler when torch version is lower than 1.12.0
2023-02-08 11:05:31 +08:00
oahzxl 6ba8364881
[autochunk] support diffusion for autochunk (#2621)
* add alphafold benchmark

* renae alphafold test

* rename tests

* rename diffuser

* renme

* rename

* update transformer

* update benchmark

* update benchmark

* update bench memory

* update transformer benchmark

* rename

* support diffuser

* support unet metainfo prop

* fix bug and simplify code

* update linear and support some op

* optimize max region search, support conv

* update unet test

* support some op

* support groupnorm and interpolate

* update flow search

* add fix dim in node flow

* fix utils

* rename

* support diffusion

* update diffuser

* update chunk search

* optimize imports

* import

* finish autochunk
2023-02-07 16:32:45 +08:00
Frank Lee 8518263b80
[test] fixed the triton version for testing (#2608) 2023-02-07 13:49:38 +08:00
HELSON 552183bb74
[polish] polish ColoTensor and its submodules (#2537) 2023-02-03 11:44:10 +08:00
Frank Lee dd14783f75
[kernel] fixed repeated loading of kernels (#2549)
* [kernel] fixed repeated loading of kernels

* polish code

* polish code
2023-02-03 09:47:13 +08:00
ver217 5b1854309a
[hotfix] fix zero ddp warmup check (#2545) 2023-02-02 16:42:38 +08:00
oahzxl fa3d66feb9
support unet metainfo prop (#2544) 2023-02-02 16:19:26 +08:00
oahzxl 05671fcb42
[autochunk] support multi outputs chunk search (#2538)
Support multi outputs chunk search. Previously we only support single output chunk search. It is more flexible and improve performance by a large margin. For transformer, we reduce memory by 40% than previous search strategy.

1. rewrite search strategy to support multi outputs chunk search
2. fix many, many bugs
3. update tests
2023-02-01 13:18:51 +08:00
oahzxl 63199c6687
[autochunk] support transformer (#2526) 2023-01-31 16:00:06 +08:00
HELSON a4ed9125ac
[hotfix] fix lightning error (#2529) 2023-01-31 10:40:39 +08:00
HELSON 66dfcf5281
[gemini] update the gpt example (#2527) 2023-01-30 17:58:05 +08:00
HELSON b528eea0f0
[zero] add zero wrappers (#2523)
* [zero] add zero wrappers

* change names

* add wrapper functions to init
2023-01-29 17:52:58 +08:00
Super Daniel c198c7c0b0
[hotfix] meta tensor default device. (#2510) 2023-01-29 16:28:10 +08:00
HELSON 077a5cdde4
[zero] fix gradient clipping in hybrid parallelism (#2521)
* [zero] fix gradient clipping in hybrid parallelism

* [testing] change model name to avoid pytest warning

* [hotfix] fix unit testing
2023-01-29 15:09:57 +08:00
YuliangLiu0306 aa0f6686f9
[autoparallel] accelerate gpt2 training (#2495) 2023-01-29 11:13:15 +08:00
HELSON 707b11d4a0
[gemini] update ddp strict mode (#2518)
* [zero] add strict ddp mode for chunk init

* [gemini] update gpt example
2023-01-28 14:35:25 +08:00
HELSON 2d1a7dfe5f
[zero] add strict ddp mode (#2508)
* [zero] add strict ddp mode

* [polish] add comments for strict ddp mode

* [zero] fix test error
2023-01-20 14:04:38 +08:00
oahzxl c04f183237
[autochunk] support parsing blocks (#2506) 2023-01-20 11:18:17 +08:00
Super Daniel 35c0c0006e
[utils] lazy init. (#2148)
* [utils] lazy init.

* [utils] remove description.

* [utils] complete.

* [utils] finalize.

* [utils] fix names.
2023-01-20 10:49:00 +08:00
oahzxl 72341e65f4
[auto-chunk] support extramsa (#3) (#2504) 2023-01-20 10:13:03 +08:00
Ziyue Jiang 0f02b8c6e6
add avg partition (#2483)
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2023-01-19 13:54:50 +08:00
アマデウス 99d9713b02 Revert "Update parallel_context.py (#2408)"
This reverts commit 7d5640b9db.
2023-01-19 12:27:48 +08:00
oahzxl ecccc91f21
[autochunk] support autochunk on evoformer (#2497) 2023-01-19 11:41:00 +08:00
oahzxl 5db3a5bf42
[fx] allow control of ckpt_codegen init (#2498)
* [fx] allow control of ckpt_codegen init

Currently in ColoGraphModule, ActivationCheckpointCodeGen will be set automatically in __init__. But other codegen can't be set if so. 
So I add an arg to control whether to set ActivationCheckpointCodeGen in __init__.

* code style
2023-01-18 17:02:46 +08:00
HELSON d565a24849
[zero] add unit testings for hybrid parallelism (#2486) 2023-01-18 10:36:10 +08:00
oahzxl 4953b4ace1
[autochunk] support evoformer tracer (#2485)
support full evoformer tracer, which is a main module of alphafold. previously we just support a simplifed version of it.
1. support some evoformer's op in fx
2. support evoformer test
3. add repos for test code
2023-01-16 19:25:05 +08:00
YuliangLiu0306 67e1912b59
[autoparallel] support origin activation ckpt on autoprallel system (#2468) 2023-01-16 16:25:13 +08:00
Ziyue Jiang fef5c949c3
polish pp middleware (#2476)
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2023-01-13 16:56:01 +08:00
HELSON a5dc4253c6
[zero] polish low level optimizer (#2473) 2023-01-13 14:56:17 +08:00
Frank Lee 8b7495dd54
[example] integrate seq-parallel tutorial with CI (#2463) 2023-01-13 14:40:05 +08:00
Jiarui Fang 867c8c2d3a
[zero] low level optim supports ProcessGroup (#2464) 2023-01-13 10:05:58 +08:00
Frank Lee 14d9299360
[cli] fixed hostname mismatch error (#2465) 2023-01-12 14:52:09 +08:00
Haofan Wang 9358262992
Fix False warning in initialize.py (#2456)
* Update initialize.py

* pre-commit run check
2023-01-12 13:49:01 +08:00
YuliangLiu0306 8221fd7485
[autoparallel] update binary elementwise handler (#2451)
* [autoparallel] update binary elementwise handler

* polish
2023-01-12 09:35:10 +08:00
HELSON 2bfeb24308
[zero] add warning for ignored parameters (#2446) 2023-01-11 15:30:09 +08:00
Frank Lee 39163417a1
[example] updated the hybrid parallel tutorial (#2444)
* [example] updated the hybrid parallel tutorial

* polish code
2023-01-11 15:17:17 +08:00
HELSON 5521af7877
[zero] fix state_dict and load_state_dict for ddp ignored parameters (#2443)
* [ddp] add is_ddp_ignored

[ddp] rename to is_ddp_ignored

* [zero] fix state_dict and load_state_dict

* fix bugs

* [zero] update unit test for ZeroDDP
2023-01-11 14:55:41 +08:00
YuliangLiu0306 2731531bc2
[autoparallel] integrate device mesh initialization into autoparallelize (#2393)
* [autoparallel] integrate device mesh initialization into autoparallelize

* add megatron solution

* update gpt autoparallel examples with latest api

* adapt beta value to fit the current computation cost
2023-01-11 14:03:49 +08:00
Frank Lee c72c827e95
[cli] provided more details if colossalai run fail (#2442) 2023-01-11 13:56:42 +08:00
Super Daniel c41e59e5ad
[fx] allow native ckpt trace and codegen. (#2438) 2023-01-11 13:49:59 +08:00
YuliangLiu0306 41429b9b28
[autoparallel] add shard option (#2423) 2023-01-11 13:40:33 +08:00
HELSON 7829aa094e
[ddp] add is_ddp_ignored (#2434)
[ddp] rename to is_ddp_ignored
2023-01-11 12:22:45 +08:00
HELSON bb4e9a311a
[zero] add inference mode and its unit test (#2418) 2023-01-11 10:07:37 +08:00
Jiarui Fang 93f62dd152
[autochunk] add autochunk feature 2023-01-10 16:04:42 +08:00
HELSON dddacd2d2c
[hotfix] add norm clearing for the overflow step (#2416) 2023-01-10 15:43:06 +08:00