YuliangLiu0306
4851f2d607
[autoparallel] update_getattr_handler ( #2193 )
2022-12-26 21:57:39 +08:00
Jiarui Fang
5682e6d346
[hotfix] correcnt cpu_optim runtime compilation ( #2197 )
2022-12-26 16:45:14 +08:00
HELSON
2458659919
[zero] fix error for BEiT models ( #2169 )
...
* [zero] fix error for BEiT models
* [ColoParameter] add unpack operation for tuple arguments
* fix bugs
* fix chunkv2 unit testing
* add assertion for gradient state
2022-12-26 15:03:54 +08:00
Jiarui Fang
355ffb386e
[builder] unified cpu_optim fused_optim inferface ( #2190 )
2022-12-23 20:57:41 +08:00
Jiarui Fang
9587b080ba
[builder] use runtime builder for fused_optim ( #2189 )
2022-12-23 17:07:03 +08:00
Jiarui Fang
bc0e271e71
[buider] use builder() for cpu adam and fused optim in setup.py ( #2187 )
2022-12-23 16:05:13 +08:00
Jiarui Fang
d42afd30f8
[builder] runtime adam and fused_optim builder ( #2184 )
2022-12-23 14:14:21 +08:00
YuliangLiu0306
550f8f8905
[autoparallel] integrate_gpt_related_tests ( #2134 )
...
* [autoparallel] integrate_gpt_related_tests
* polish code
* polish code
* add GPT2Model into runtime test
2022-12-23 12:36:59 +08:00
Ziyue Jiang
59e343328d
[Pipeline Middleware ] Fix deadlock when num_microbatch=num_stage ( #2156 )
...
* add splitter
* polish code
* remove comment
* fix async nan by moving to cpu first
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-23 11:38:43 +08:00
Tongping Liu
ab54fed292
[hotfix] add kwargs for colo_addmm ( #2171 )
2022-12-22 13:25:30 +08:00
アマデウス
622f863291
[hotfix] Jit type hint #2161 ( #2164 )
2022-12-22 10:17:03 +08:00
Zihao
12e7bcd720
register meta func for rnn ( #2159 )
2022-12-21 23:06:18 +08:00
Boyuan Yao
cfe2a9bd90
[autoparallel] memory estimation for shape consistency ( #2144 )
...
* [fx] metainfo class for auto parallel
* [fx] add unit test for linear metainfo
* [fx] fix bwd param for linear
* [fx] modify unit test
* [fx] modify unit test
* [fx] modify import
* [fx] modify import
* [fx] modify import
* [fx] move meta profiler to auto parallel
* [fx] add conv metainfo class
* [fx] restore profiler
* [fx] restore meta profiler
* [autoparallel] modify unit test
* [fx] modify unit test
* [autoparallel] add batchnorm metainfo class
* [autoparallel] fix batchnorm unit test function declaration
* [fx] restore profiler
* [fx] add relu metainfo class
* [fx] restore profiler
* [autoparallel] modify metainfo input
* [autoparallel] add pooling metainfo
* [autoparallel] add F.linear metainfo generator
* [autoparallel] add binary elementwise metainfo
* [fx] recover profiler
* [autoparallel] fix forward memory calculation
* [autoparallel] modify constants.py
* [autoparallel] remove redundant print
* [autoparallel] add F.conv metainfo
* [autoparallel] linear fix
* [autoparallel] memory estimation for communication actions
* [autoparallel] fix docstring
* [autoparallel] fix variables name
2022-12-21 10:39:37 +08:00
Jiarui Fang
b87496a66b
[hotfix] fix auto policy of test_sharded_optim_v2 ( #2157 )
2022-12-20 23:03:18 +08:00
YuliangLiu0306
16335cb537
[hotfix] fix aten default bug ( #2158 )
2022-12-20 22:40:46 +08:00
HELSON
a7d95b7024
[example] add zero1, zero2 example in GPT examples ( #2146 )
...
* [example] add zero1 and zero2 for GPT
* update readme in gpt example
* polish code
* change init value
* update readme
2022-12-20 14:30:27 +08:00
YuliangLiu0306
1cce6e36ca
[autoparallel] use metainfo in handler ( #2149 )
2022-12-20 10:31:22 +08:00
Jiarui Fang
2827f41898
[Gemini] GeminiDPP convert to PyTorch Module. ( #2151 )
2022-12-20 10:19:36 +08:00
Jiarui Fang
bdef9dfdbe
[NFC] remove useless graph node code ( #2150 )
2022-12-20 00:33:58 +08:00
BlueRum
b3f73ce1c8
[Gemini] Update coloinit_ctx to support meta_tensor ( #2147 )
2022-12-19 22:37:07 +08:00
Zihao
a128eec9d5
register aten._convolution.default ( #2137 )
2022-12-18 19:27:01 +08:00
Jiarui Fang
ee287620f0
[Gemini] revert ZeROInitCtx related tracer ( #2138 )
2022-12-16 12:37:06 +08:00
アマデウス
077a66dd81
updated attention kernel ( #2133 )
2022-12-16 10:54:03 +08:00
YuliangLiu0306
a3c6924deb
[autoparallel] process size nodes in runtime pass ( #2130 )
...
* [autoparallel] process size nodes in runtime pass
* polish code
2022-12-14 16:10:50 +08:00
YuliangLiu0306
536560ccc0
[autoparallel] implement softmax handler ( #2132 )
2022-12-14 16:09:53 +08:00
Jiarui Fang
c89c66a858
[Gemini] update API of the chunkmemstatscollector. ( #2129 )
2022-12-14 00:47:06 +08:00
Jiarui Fang
2938edf446
[Gemini] update the non model data record method in runtime memory tracer ( #2128 )
2022-12-13 17:11:31 +08:00
Jiarui Fang
8fac837679
[Gemini] update non model data calculation method ( #2126 )
2022-12-13 15:44:07 +08:00
Jiarui Fang
5efda69735
[Gemini] hotfix the unittest bugs ( #2125 )
2022-12-13 14:14:55 +08:00
Jiarui Fang
05bb28aacf
[Gemini] mapping of preop timestep and param ( #2124 )
2022-12-13 12:50:24 +08:00
YuliangLiu0306
cd0af9f7f6
[autoparallel] gpt2lp runtimee test ( #2113 )
2022-12-12 18:06:40 +08:00
Jiarui Fang
9214d1fe28
[Gemini] chunk init using runtime visited param order ( #2115 )
2022-12-12 18:06:16 +08:00
HELSON
e7d3afc9cc
[optimizer] add div_scale for optimizers ( #2117 )
...
* [optimizer] add div_scale for optimizers
* [zero] use div_scale in zero optimizer
* fix testing error
2022-12-12 17:58:57 +08:00
Jiarui Fang
e5aa8333e4
[NFC] update chunk manager API ( #2119 )
2022-12-12 16:57:22 +08:00
Jiarui Fang
e99edfcb51
[NFC] polish comments for Chunk class ( #2116 )
2022-12-12 15:39:31 +08:00
Ziyue Jiang
09d69e1c25
[PP Middleware] Add bwd and step for PP middleware ( #2111 )
...
* add bwd and step for PP middleware
* pre-commit
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-12 12:40:03 +08:00
Jiarui Fang
8afc001f4f
[Gemini] chunk init use OrderedParamGenerator ( #2110 )
2022-12-11 21:41:13 +08:00
HELSON
63fbba3c19
[zero] add L2 gradient clipping for ZeRO ( #2112 )
...
* [zero] add L2 gradient clipping
* [testing] add MlpModel
* [zero] add unit test for grad clipping
* fix atol
2022-12-09 18:09:17 +08:00
Jiarui Fang
70a8556946
[gemini] get the param visited order during runtime ( #2108 )
2022-12-09 16:13:03 +08:00
Jiarui Fang
61f31c3cf0
[Gemini] NFC, polish search_chunk_configuration ( #2107 )
2022-12-09 15:00:39 +08:00
Jiarui Fang
8e14344ec9
[hotfix] fix a type in ColoInitContext ( #2106 )
2022-12-09 11:44:39 +08:00
Jiarui Fang
05545bfee9
[ColoTensor] throw error when ColoInitContext meets meta parameter. ( #2105 )
2022-12-09 11:39:46 +08:00
YuliangLiu0306
d87baa85d9
[autoparallel] support linear function bias addition ( #2104 )
2022-12-09 10:31:36 +08:00
YuliangLiu0306
0fecbb9e20
[autoparallel] support addbmm computation ( #2102 )
2022-12-08 21:15:11 +08:00
YuliangLiu0306
d3d4630495
[autoparallel] add sum handler ( #2101 )
2022-12-08 17:02:54 +08:00
Ziyue Jiang
e4705ba4e2
[Pipeline Middleware] fix data race in Pipeline Scheduler for DAG ( #2087 )
...
* add DAG test case
* fix datarace by adjusting theposition of lock
* polish code
* fix pytest for middleware
* remove test
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-08 13:32:27 +08:00
YuliangLiu0306
b175e6d58e
[autoparallel] add bias addtion function class ( #2098 )
...
* [autoparallel] add bias addtion function class
* polish code
* polish
2022-12-08 11:31:51 +08:00
YuliangLiu0306
3af7e65dea
[autoparallel] complete gpt related module search ( #2097 )
2022-12-08 10:04:09 +08:00
Jiarui Fang
85efb7ac2e
[Gemini] gemini use the runtime memory tracer (RMT) ( #2099 )
2022-12-07 23:04:02 +08:00
Super Daniel
2bf2d1cd3b
[fx] An experimental version of ColoTracer.' ( #2002 )
...
* [fx] add a symbolic_trace api.
* [fx] fix import errors.
* [fx] ColoTracer experimental.
2022-12-07 18:36:17 +08:00
Jiarui Fang
4b055351b0
[Gemini] make RuntimeMemTracer work correctly ( #2096 )
2022-12-07 16:59:59 +08:00
YuliangLiu0306
7f72eb0510
[autoparallel]add embedding handler ( #2089 )
...
* [autoparallel] add embedding handler
* fix bugs
2022-12-07 09:41:46 +08:00
Jiarui Fang
1fca5d79ea
[Gemini] remove GLOBAL_MODEL_DATA_TRACER ( #2091 )
2022-12-06 22:30:16 +08:00
Jiarui Fang
28e55c2530
[Gemini] remove GLOBAL_CUDA_MEM_INFO ( #2090 )
2022-12-06 22:10:47 +08:00
Jiarui Fang
25abae6d7f
[Gemini] use MemStats in Runtime Memory tracer ( #2088 )
2022-12-06 19:48:20 +08:00
Jiarui Fang
33f4412102
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. ( #2084 )
2022-12-06 16:43:06 +08:00
Jiarui Fang
1f99205827
[Gemini] remove static tracer ( #2083 )
2022-12-06 12:53:58 +08:00
YuliangLiu0306
0e9db368ef
[autoparallel] add tensor constructor handler ( #2082 )
2022-12-06 10:20:10 +08:00
YuliangLiu0306
cdf537a648
[autoparallel] add non_split linear strategy ( #2078 )
...
* [autoparallel] add non_split linear stategy
* polish
2022-12-06 10:19:33 +08:00
Boyuan Yao
cf0268da93
[autoparallel] Add F.conv metainfo ( #2069 )
...
* [fx] metainfo class for auto parallel
* [fx] add unit test for linear metainfo
* [fx] fix bwd param for linear
* [fx] modify unit test
* [fx] modify unit test
* [fx] modify import
* [fx] modify import
* [fx] modify import
* [fx] move meta profiler to auto parallel
* [fx] add conv metainfo class
* [fx] restore profiler
* [fx] restore meta profiler
* [autoparallel] modify unit test
* [fx] modify unit test
* [autoparallel] add batchnorm metainfo class
* [autoparallel] fix batchnorm unit test function declaration
* [fx] restore profiler
* [fx] add relu metainfo class
* [fx] restore profiler
* [autoparallel] modify metainfo input
* [autoparallel] add pooling metainfo
* [autoparallel] add F.linear metainfo generator
* [autoparallel] add binary elementwise metainfo
* [fx] recover profiler
* [autoparallel] fix forward memory calculation
* [autoparallel] modify constants.py
* [autoparallel] remove redundant print
* [autoparallel] add F.conv metainfo
* [autoparallel] linear fix
2022-12-06 10:17:57 +08:00
YuliangLiu0306
f123476666
[autoparallel] complete gpt block searching ( #2065 )
...
* [autoparallel] complete gpt block searching
* fix test
2022-12-06 10:17:10 +08:00
Ziyue Jiang
597cdd3006
[Pipeline Middleware] Adapt scheduler for Topo ( #2066 )
...
* adapt scheduler for Topo
* remoove comment
* fix set input
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-05 20:23:41 +08:00
Jiarui Fang
b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook ( #2080 )
2022-12-05 17:11:06 +08:00
YuliangLiu0306
677e1e20d4
[device] update flatten device mesh usage ( #2079 )
2022-12-05 16:16:07 +08:00
Jiarui Fang
a7adad9ccb
[Gemini] rename hooks related to runtime mem tracer ( #2076 )
2022-12-05 15:00:03 +08:00
Jiarui Fang
223332ff7e
[Gemini] rename ParamTracerWrapper -> RuntimeMemTracer ( #2073 )
2022-12-05 12:45:11 +08:00
Jiarui Fang
9f828ef36f
[Gemini] remove not used MemtracerWrapper ( #2072 )
2022-12-05 11:57:59 +08:00
Boyuan Yao
616da17fab
[autoparallel] add binary elementwise metainfo for auto parallel ( #2058 )
...
* [fx] metainfo class for auto parallel
* [fx] add unit test for linear metainfo
* [fx] fix bwd param for linear
* [fx] modify unit test
* [fx] modify unit test
* [fx] modify import
* [fx] modify import
* [fx] modify import
* [fx] move meta profiler to auto parallel
* [fx] add conv metainfo class
* [fx] restore profiler
* [fx] restore meta profiler
* [autoparallel] modify unit test
* [fx] modify unit test
* [autoparallel] add batchnorm metainfo class
* [autoparallel] fix batchnorm unit test function declaration
* [fx] restore profiler
* [fx] add relu metainfo class
* [fx] restore profiler
* [autoparallel] modify metainfo input
* [autoparallel] add pooling metainfo
* [autoparallel] add F.linear metainfo generator
* [autoparallel] add binary elementwise metainfo
* [fx] recover profiler
* [autoparallel] fix forward memory calculation
* [autoparallel] modify constants.py
* [autoparallel] remove redundant print
2022-12-04 15:18:51 +08:00
Boyuan Yao
4b40fbd743
[autoparallel] fix forward memory calculation ( #2062 )
2022-12-04 15:00:16 +08:00
Ziyue Jiang
44ea461890
[Pipeline] Add Topo Class ( #2059 )
...
* use Topo class to rewrite DAG
* polish code
* polish code
* polish code
* add comment
* add else to unended if
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-02 18:13:20 +08:00
YuliangLiu0306
e4293e5077
[hotfix] update test for latest version ( #2060 )
2022-12-02 18:12:30 +08:00
Zihao
38ea4ba1bd
[Gemini] fix grad unreleased issue and param recovery issue ( #2052 )
2022-12-02 16:04:19 +08:00
YuliangLiu0306
1c1fe44305
[autoparallel] adapt solver with self attention ( #2037 )
...
* [autoparallel] adapt solver with self attention
* polish code
2022-12-01 17:53:15 +08:00
Frank Lee
ea74a3b9cc
[cli] updated installation cheheck with more inforamtion ( #2050 )
...
* [cli] updated installation cheheck with more inforamtion
* polish code
* polish code
2022-11-30 17:53:55 +08:00
HELSON
f6178728a0
[gemini] fix init bugs for modules ( #2047 )
...
* [gemini] fix init bugs for modules
* fix bugs
2022-11-30 17:06:10 +08:00
Frank Lee
81e0da7fa8
[setup] supported conda-installed torch ( #2048 )
...
* [setup] supported conda-installed torch
* polish code
2022-11-30 16:45:15 +08:00
HELSON
e37f3db40c
[gemini] add arguments ( #2046 )
...
* [zero] fix testing parameters
* [gemini] add arguments
* add docstrings
2022-11-30 16:40:13 +08:00
Zihao
6a9158f1fa
[Gemini] free and allocate cuda memory by tensor.storage, add grad hook ( #2040 )
2022-11-30 15:57:45 +08:00
Jiarui Fang
31c644027b
[hotfix] hotfix Gemini for no leaf modules bug ( #2043 )
2022-11-30 14:53:41 +08:00
HELSON
a1ce02d740
[zero] test gradient accumulation ( #1964 )
...
* [zero] fix memory leak for zero2
* [zero] test gradient accumulation
* [zero] remove grad clip test
2022-11-29 13:00:30 +08:00
Ziyue Jiang
b0936e4a44
[rpc] split with dag ( #2028 )
...
* add DAG to split_module
* add comment
* add test case for DAG
* remove print
* add DAG middleware in scheduler
* add test case for scheduler
* remove break
* recover old lifecycle
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-29 11:36:28 +08:00
Jiarui Fang
96134e7be3
[hotfix] add bert test for gemini fwd bwd ( #2035 )
2022-11-29 11:19:52 +08:00
YuliangLiu0306
0dbcd4a6f5
[autoparallel] add split handler ( #2032 )
...
* [autoparallel] add split handler
* add numerical test and runtime passes
2022-11-29 11:03:51 +08:00
Jiarui Fang
28aa9a4294
[Gemini] more rigorous unit tests for run_fwd_bwd ( #2034 )
2022-11-29 09:26:06 +08:00
YuliangLiu0306
81330b0352
[autoparallel] add experimental permute handler ( #2029 )
2022-11-27 20:26:52 +08:00
Zihao
95c4532fff
[Gemini] paramWrapper paramTracerHook unitest ( #2030 )
2022-11-26 13:30:24 +08:00
Jiarui Fang
8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor ( #2003 )
2022-11-25 20:06:35 +08:00
Ziyue Jiang
632753abbc
[fx]Split partition with DAG information ( #2025 )
...
* add DAG to split_module
* add comment
* add test case for DAG
* remove print
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-25 17:42:48 +08:00
YuliangLiu0306
ea0f6b8df9
[autoparallel] add runtime pass and numerical test for view handler ( #2018 )
2022-11-25 15:50:16 +08:00
Zihao
a719b89a41
[gemini] param_trace_hook ( #2020 )
2022-11-24 18:08:36 +08:00
Jiarui Fang
0b0d8f9e17
[hotfix] revert bug PRs ( #2016 )
2022-11-24 15:28:58 +08:00
Zihao
aba3db464d
[Gemini] ParamMemHook ( #2008 )
2022-11-24 15:22:51 +08:00
Zihao
0160a62a3c
[Gemini] param_tracer_wrapper and test case ( #2009 )
2022-11-24 14:40:33 +08:00
YuliangLiu0306
1438993113
[autoparallel] add experimental view handler ( #2011 )
...
* [autoparallel] add experimental view handler
* polish
* polish
* polish code
* rename variables
2022-11-24 11:34:41 +08:00
Genghan Zhang
d655eea515
[autoparallel] mix gather ( #1977 )
...
* Add mix-gather
* Add comments
* Add comments
* Polish comments
* Change the global rank assumption
* Add tests
* Add two-step tests
* Fix 10 and 01
* Skip test becasue the number of GPUs
2022-11-23 21:49:17 +08:00
Frank Lee
2bab6f512c
[release] release v0.1.11rc4 ( #2007 )
2022-11-23 17:14:32 +08:00
Boyuan Yao
6cd784ffee
[autoparallel] Add metainfo support for F.linear ( #1987 )
...
* [fx] metainfo class for auto parallel
* [fx] add unit test for linear metainfo
* [fx] fix bwd param for linear
* [fx] modify unit test
* [fx] modify unit test
* [fx] modify import
* [fx] modify import
* [fx] modify import
* [fx] move meta profiler to auto parallel
* [fx] add conv metainfo class
* [fx] restore profiler
* [fx] restore meta profiler
* [autoparallel] modify unit test
* [fx] modify unit test
* [autoparallel] add batchnorm metainfo class
* [autoparallel] fix batchnorm unit test function declaration
* [fx] restore profiler
* [fx] add relu metainfo class
* [fx] restore profiler
* [autoparallel] modify metainfo input
* [autoparallel] add pooling metainfo
* [autoparallel] add F.linear metainfo generator
2022-11-23 14:12:34 +08:00
Super Daniel
2edbef13cc
[fx] add more meta_registry for MetaTensor execution. ( #2000 )
...
* [sc] add examples for auto checkpoint.
* merge upstream
* [fx] add more meta_registry for MetaTensor execution.
2022-11-23 10:55:46 +08:00
Jiarui Fang
a2d3266648
[hotfix] make Gemini work for conv DNN ( #1998 )
2022-11-22 14:52:36 +08:00
YuliangLiu0306
155891113e
[autoparallel] use pytree map style to process data ( #1989 )
2022-11-21 10:44:22 +08:00