Commit Graph

674 Commits (8b7495dd541ea12e1af84b3a3a0e24abc1e847d1)

Author SHA1 Message Date
Jiarui Fang 27327a4c90
[example] add palm pytorch version (#2172) 2022-12-22 10:15:34 +08:00
Jiarui Fang b87496a66b
[hotfix] fix auto policy of test_sharded_optim_v2 (#2157) 2022-12-20 23:03:18 +08:00
YuliangLiu0306 16335cb537
[hotfix] fix aten default bug (#2158) 2022-12-20 22:40:46 +08:00
Jiarui Fang 2827f41898
[Gemini] GeminiDPP convert to PyTorch Module. (#2151) 2022-12-20 10:19:36 +08:00
アマデウス 077a66dd81
updated attention kernel (#2133) 2022-12-16 10:54:03 +08:00
YuliangLiu0306 536560ccc0
[autoparallel] implement softmax handler (#2132) 2022-12-14 16:09:53 +08:00
Jiarui Fang c89c66a858
[Gemini] update API of the chunkmemstatscollector. (#2129) 2022-12-14 00:47:06 +08:00
Jiarui Fang 2938edf446
[Gemini] update the non model data record method in runtime memory tracer (#2128) 2022-12-13 17:11:31 +08:00
Jiarui Fang deee317b0f
[Gemini] test step-tensor mapping using repeated_computed_layers.py (#2127) 2022-12-13 16:34:10 +08:00
Jiarui Fang 8fac837679
[Gemini] update non model data calculation method (#2126) 2022-12-13 15:44:07 +08:00
Jiarui Fang 5efda69735
[Gemini] hotfix the unittest bugs (#2125) 2022-12-13 14:14:55 +08:00
Jiarui Fang 05bb28aacf
[Gemini] mapping of preop timestep and param (#2124) 2022-12-13 12:50:24 +08:00
YuliangLiu0306 cd0af9f7f6
[autoparallel] gpt2lp runtimee test (#2113) 2022-12-12 18:06:40 +08:00
Jiarui Fang 9214d1fe28
[Gemini] chunk init using runtime visited param order (#2115) 2022-12-12 18:06:16 +08:00
HELSON e7d3afc9cc
[optimizer] add div_scale for optimizers (#2117)
* [optimizer] add div_scale for optimizers

* [zero] use div_scale in zero optimizer

* fix testing error
2022-12-12 17:58:57 +08:00
Jiarui Fang e5aa8333e4
[NFC] update chunk manager API (#2119) 2022-12-12 16:57:22 +08:00
Jiarui Fang e99edfcb51
[NFC] polish comments for Chunk class (#2116) 2022-12-12 15:39:31 +08:00
Ziyue Jiang 09d69e1c25
[PP Middleware] Add bwd and step for PP middleware (#2111)
* add bwd and step for PP middleware

* pre-commit

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-12 12:40:03 +08:00
HELSON 63fbba3c19
[zero] add L2 gradient clipping for ZeRO (#2112)
* [zero] add L2 gradient clipping

* [testing] add MlpModel

* [zero] add unit test for grad clipping

* fix atol
2022-12-09 18:09:17 +08:00
Jiarui Fang 70a8556946
[gemini] get the param visited order during runtime (#2108) 2022-12-09 16:13:03 +08:00
YuliangLiu0306 d87baa85d9
[autoparallel] support linear function bias addition (#2104) 2022-12-09 10:31:36 +08:00
YuliangLiu0306 0fecbb9e20
[autoparallel] support addbmm computation (#2102) 2022-12-08 21:15:11 +08:00
YuliangLiu0306 d3d4630495
[autoparallel] add sum handler (#2101) 2022-12-08 17:02:54 +08:00
Ziyue Jiang e4705ba4e2
[Pipeline Middleware] fix data race in Pipeline Scheduler for DAG (#2087)
* add DAG test case

* fix datarace by adjusting theposition of lock

* polish code

* fix pytest for middleware

* remove test

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-08 13:32:27 +08:00
YuliangLiu0306 b175e6d58e
[autoparallel] add bias addtion function class (#2098)
* [autoparallel] add bias addtion function class

* polish code

* polish
2022-12-08 11:31:51 +08:00
YuliangLiu0306 3af7e65dea
[autoparallel] complete gpt related module search (#2097) 2022-12-08 10:04:09 +08:00
Jiarui Fang 85efb7ac2e
[Gemini] gemini use the runtime memory tracer (RMT) (#2099) 2022-12-07 23:04:02 +08:00
Jiarui Fang 978242326a
[Gemini] remove eval in gemini unittests! (#2092) 2022-12-07 11:58:37 +08:00
YuliangLiu0306 7f72eb0510
[autoparallel]add embedding handler (#2089)
* [autoparallel] add embedding handler

* fix bugs
2022-12-07 09:41:46 +08:00
Jiarui Fang 1fca5d79ea
[Gemini] remove GLOBAL_MODEL_DATA_TRACER (#2091) 2022-12-06 22:30:16 +08:00
Jiarui Fang 25abae6d7f
[Gemini] use MemStats in Runtime Memory tracer (#2088) 2022-12-06 19:48:20 +08:00
Jiarui Fang 33f4412102
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084) 2022-12-06 16:43:06 +08:00
Jiarui Fang 1f99205827
[Gemini] remove static tracer (#2083) 2022-12-06 12:53:58 +08:00
YuliangLiu0306 0e9db368ef
[autoparallel] add tensor constructor handler (#2082) 2022-12-06 10:20:10 +08:00
YuliangLiu0306 cdf537a648
[autoparallel] add non_split linear strategy (#2078)
* [autoparallel] add non_split linear stategy

* polish
2022-12-06 10:19:33 +08:00
Boyuan Yao cf0268da93
[autoparallel] Add F.conv metainfo (#2069)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator

* [autoparallel] add binary elementwise metainfo

* [fx] recover profiler

* [autoparallel] fix forward memory calculation

* [autoparallel] modify constants.py

* [autoparallel] remove redundant print

* [autoparallel] add F.conv metainfo

* [autoparallel] linear fix
2022-12-06 10:17:57 +08:00
YuliangLiu0306 f123476666
[autoparallel] complete gpt block searching (#2065)
* [autoparallel] complete gpt block searching

* fix test
2022-12-06 10:17:10 +08:00
Ziyue Jiang 597cdd3006
[Pipeline Middleware] Adapt scheduler for Topo (#2066)
* adapt scheduler for Topo

* remoove comment

* fix set input

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-05 20:23:41 +08:00
Jiarui Fang 4f21c9e8d9
[Gemini] polish runtime tracer tests (#2077) 2022-12-05 16:22:49 +08:00
Jiarui Fang a7adad9ccb
[Gemini] rename hooks related to runtime mem tracer (#2076) 2022-12-05 15:00:03 +08:00
Jiarui Fang 40b7d55bf3
[Gemini] add albert in test models. (#2075) 2022-12-05 14:09:34 +08:00
Jiarui Fang 616ed91ecd
[test] bert test in non-distributed way (#2074) 2022-12-05 13:32:16 +08:00
Jiarui Fang 223332ff7e
[Gemini] rename ParamTracerWrapper -> RuntimeMemTracer (#2073) 2022-12-05 12:45:11 +08:00
Jiarui Fang 9f828ef36f
[Gemini] remove not used MemtracerWrapper (#2072) 2022-12-05 11:57:59 +08:00
Boyuan Yao 616da17fab
[autoparallel] add binary elementwise metainfo for auto parallel (#2058)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator

* [autoparallel] add binary elementwise metainfo

* [fx] recover profiler

* [autoparallel] fix forward memory calculation

* [autoparallel] modify constants.py

* [autoparallel] remove redundant print
2022-12-04 15:18:51 +08:00
Ziyue Jiang 44ea461890
[Pipeline] Add Topo Class (#2059)
* use Topo class to rewrite DAG

* polish code

* polish code

* polish code

* add comment

* add else to unended if

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-02 18:13:20 +08:00
YuliangLiu0306 e4293e5077
[hotfix] update test for latest version (#2060) 2022-12-02 18:12:30 +08:00
YuliangLiu0306 19438ea0ef
[hotfix] skip gpt tracing test (#2064) 2022-12-02 16:48:28 +08:00
Zihao 38ea4ba1bd
[Gemini] fix grad unreleased issue and param recovery issue (#2052) 2022-12-02 16:04:19 +08:00
YuliangLiu0306 1c1fe44305
[autoparallel] adapt solver with self attention (#2037)
* [autoparallel] adapt solver with self attention

* polish code
2022-12-01 17:53:15 +08:00
HELSON f6178728a0
[gemini] fix init bugs for modules (#2047)
* [gemini] fix init bugs for modules

* fix bugs
2022-11-30 17:06:10 +08:00
Zihao 6a9158f1fa
[Gemini] free and allocate cuda memory by tensor.storage, add grad hook (#2040) 2022-11-30 15:57:45 +08:00
Jiarui Fang 1e885329f4
[test] align model name with the file name. (#2045) 2022-11-30 15:45:26 +08:00
Jiarui Fang 31c644027b
[hotfix] hotfix Gemini for no leaf modules bug (#2043) 2022-11-30 14:53:41 +08:00
HELSON 384cd26314
[zero] fix testing parameters (#2042) 2022-11-30 12:09:32 +08:00
HELSON 17a3c685b0
[zero] fix unit-tests (#2039) 2022-11-30 10:40:31 +08:00
Jiarui Fang eb7742a4bb
[Gemini] more tests for Gemini (#2038)
* [Gemini] more tests for Gemini

* polish code
2022-11-29 17:13:10 +08:00
HELSON 537e181705
[testing] fix testing models (#2036)
* [testing] fix testing models

* roll back
2022-11-29 13:42:06 +08:00
HELSON a1ce02d740
[zero] test gradient accumulation (#1964)
* [zero] fix memory leak for zero2

* [zero] test gradient accumulation

* [zero] remove grad clip test
2022-11-29 13:00:30 +08:00
Ziyue Jiang b0936e4a44
[rpc] split with dag (#2028)
* add DAG to split_module

* add comment

* add test case for DAG

* remove print

* add DAG middleware in scheduler

* add test case for scheduler

* remove break

* recover old lifecycle

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-29 11:36:28 +08:00
Jiarui Fang 96134e7be3
[hotfix] add bert test for gemini fwd bwd (#2035) 2022-11-29 11:19:52 +08:00
YuliangLiu0306 0dbcd4a6f5
[autoparallel] add split handler (#2032)
* [autoparallel] add split handler

* add numerical test and runtime passes
2022-11-29 11:03:51 +08:00
Jiarui Fang 28aa9a4294
[Gemini] more rigorous unit tests for run_fwd_bwd (#2034) 2022-11-29 09:26:06 +08:00
YuliangLiu0306 81330b0352
[autoparallel] add experimental permute handler (#2029) 2022-11-27 20:26:52 +08:00
Zihao 95c4532fff
[Gemini] paramWrapper paramTracerHook unitest (#2030) 2022-11-26 13:30:24 +08:00
Jiarui Fang 8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor (#2003) 2022-11-25 20:06:35 +08:00
Ziyue Jiang 632753abbc
[fx]Split partition with DAG information (#2025)
* add DAG to split_module

* add comment

* add test case for DAG

* remove print

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-25 17:42:48 +08:00
YuliangLiu0306 ea0f6b8df9
[autoparallel] add runtime pass and numerical test for view handler (#2018) 2022-11-25 15:50:16 +08:00
Jiarui Fang 2e9cbfca12
[Gemini] add unitests to check gemini correctness (#2015) 2022-11-24 16:51:45 +08:00
Jiarui Fang 0b0d8f9e17
[hotfix] revert bug PRs (#2016) 2022-11-24 15:28:58 +08:00
Zihao 0160a62a3c
[Gemini] param_tracer_wrapper and test case (#2009) 2022-11-24 14:40:33 +08:00
YuliangLiu0306 1438993113
[autoparallel] add experimental view handler (#2011)
* [autoparallel] add experimental view handler

* polish

* polish

* polish code

* rename variables
2022-11-24 11:34:41 +08:00
Genghan Zhang d655eea515
[autoparallel] mix gather (#1977)
* Add mix-gather

* Add comments

* Add comments

* Polish comments

* Change the global rank assumption

* Add tests

* Add two-step tests

* Fix 10 and 01

* Skip test becasue the number of GPUs
2022-11-23 21:49:17 +08:00
Jiarui Fang 3d907faede
[Gemini] add an inline_op_module to common test models and polish unitests. (#2004) 2022-11-23 16:55:54 +08:00
Boyuan Yao 6cd784ffee
[autoparallel] Add metainfo support for F.linear (#1987)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator
2022-11-23 14:12:34 +08:00
YuliangLiu0306 35e6b9ec82
[autoparallel] adapt handlers with attention block (#1990)
* [autoparallel] adapt handlers with attention block

* polish
2022-11-21 10:44:11 +08:00
Jiarui Fang 5bec3b2168
[Gemini] open grad checkpoint when model building (#1984) 2022-11-18 16:32:54 +08:00
Boyuan Yao c26f21d365
[autoparallel] add pooling metainfo (#1968)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo
2022-11-18 15:13:03 +08:00
Jiarui Fang 3712ac7f90
[Gemini] add bert for MemtracerWrapper unintests (#1982) 2022-11-18 14:58:28 +08:00
Jiarui Fang e481489aa6
[Gemini] MemtracerWrapper unittests (#1981) 2022-11-18 14:19:40 +08:00
YuliangLiu0306 0da1d00399
[autoparallel] support distributed dataloader option (#1906)
* [autoparallel] support distributed dataloader option

* update output handler to support ddp dataloader

* poish code
2022-11-17 20:11:53 +08:00
Genghan Zhang 6630d45546
[autoparallel] Add alpha beta (#1973)
* Add alpha beta

* Fix test

* Fix test
2022-11-17 16:01:14 +08:00
ver217 f8a7148dec
[kernel] move all symlinks of kernel to `colossalai._C` (#1971) 2022-11-17 13:42:33 +08:00
Boyuan Yao 7c7921f71b
[autoparallel] add torch.nn.ReLU metainfo (#1868)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input
2022-11-16 23:12:31 +08:00
YuliangLiu0306 fea3cb661c
[autoparallel] support addmm in tracer and solver (#1961)
* [fx] patch addmm

* [autoparallel] support addmm in tracer and solver
2022-11-16 14:59:18 +08:00
Jiarui Fang f7e276fa71
[Gemini] add GeminiAdamOptimizer (#1960) 2022-11-16 14:44:28 +08:00
HELSON 7066dfbf82
[zero] fix memory leak for zero2 (#1955) 2022-11-16 11:43:24 +08:00
Jiarui Fang 52c6ad26e0
[ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953) 2022-11-15 16:24:16 +08:00
zbian 6877121377 updated flash attention api 2022-11-15 15:25:39 +08:00
Jiarui Fang 9f4fb3f28a
[ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937) 2022-11-14 16:05:09 +08:00
HELSON 6e51d296f0
[zero] migrate zero1&2 (#1878)
* add zero1&2 optimizer

* rename test ditectory

* rename test files

* change tolerance in test
2022-11-11 09:26:40 +08:00
Jiarui Fang 51597f6a28
[hotfix] pass test_complete_workflow (#1877) 2022-11-10 17:53:39 +08:00
Jiarui Fang 986f8cbaa7
[inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876) 2022-11-10 17:36:42 +08:00
YuliangLiu0306 1b494ad73c
[autoparallel] fix linear logical convert issue (#1857) 2022-11-10 17:19:22 +08:00
Jiarui Fang c2947dadf1
[inference] streaming Linear 1D Row inference (#1874) 2022-11-10 17:03:21 +08:00
xcnick a141681260
[amp] add torch amp test (#1860) 2022-11-10 16:40:26 +08:00
Frank Lee e6ec99d389
[utils] fixed lazy init context (#1867) 2022-11-10 15:17:20 +08:00
Jiarui Fang 3ce4463fe6
[utils] remove lazy_memory_allocate from ColoInitContext (#1844) 2022-11-09 11:50:33 +08:00
YuliangLiu0306 f6032ddb17
[autoparallel] fix bias addition module (#1800) 2022-11-08 16:21:25 +08:00
ver217 99870726b1
[CheckpointIO] a uniform checkpoint I/O module (#1689) 2022-11-08 15:15:13 +08:00