Commit Graph

1127 Commits (7080a8edb08400d97ba4c31458f532e4ceeacf4b)

Author SHA1 Message Date
Jiarui Fang 4b055351b0
[Gemini] make RuntimeMemTracer work correctly (#2096) 2022-12-07 16:59:59 +08:00
YuliangLiu0306 7f72eb0510
[autoparallel]add embedding handler (#2089)
* [autoparallel] add embedding handler

* fix bugs
2022-12-07 09:41:46 +08:00
Jiarui Fang 1fca5d79ea
[Gemini] remove GLOBAL_MODEL_DATA_TRACER (#2091) 2022-12-06 22:30:16 +08:00
Jiarui Fang 28e55c2530
[Gemini] remove GLOBAL_CUDA_MEM_INFO (#2090) 2022-12-06 22:10:47 +08:00
Jiarui Fang 25abae6d7f
[Gemini] use MemStats in Runtime Memory tracer (#2088) 2022-12-06 19:48:20 +08:00
Jiarui Fang 33f4412102
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084) 2022-12-06 16:43:06 +08:00
Jiarui Fang 1f99205827
[Gemini] remove static tracer (#2083) 2022-12-06 12:53:58 +08:00
YuliangLiu0306 0e9db368ef
[autoparallel] add tensor constructor handler (#2082) 2022-12-06 10:20:10 +08:00
YuliangLiu0306 cdf537a648
[autoparallel] add non_split linear strategy (#2078)
* [autoparallel] add non_split linear stategy

* polish
2022-12-06 10:19:33 +08:00
Boyuan Yao cf0268da93
[autoparallel] Add F.conv metainfo (#2069)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator

* [autoparallel] add binary elementwise metainfo

* [fx] recover profiler

* [autoparallel] fix forward memory calculation

* [autoparallel] modify constants.py

* [autoparallel] remove redundant print

* [autoparallel] add F.conv metainfo

* [autoparallel] linear fix
2022-12-06 10:17:57 +08:00
YuliangLiu0306 f123476666
[autoparallel] complete gpt block searching (#2065)
* [autoparallel] complete gpt block searching

* fix test
2022-12-06 10:17:10 +08:00
Ziyue Jiang 597cdd3006
[Pipeline Middleware] Adapt scheduler for Topo (#2066)
* adapt scheduler for Topo

* remoove comment

* fix set input

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-05 20:23:41 +08:00
Jiarui Fang b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook (#2080) 2022-12-05 17:11:06 +08:00
YuliangLiu0306 677e1e20d4
[device] update flatten device mesh usage (#2079) 2022-12-05 16:16:07 +08:00
Jiarui Fang a7adad9ccb
[Gemini] rename hooks related to runtime mem tracer (#2076) 2022-12-05 15:00:03 +08:00
Jiarui Fang 223332ff7e
[Gemini] rename ParamTracerWrapper -> RuntimeMemTracer (#2073) 2022-12-05 12:45:11 +08:00
Jiarui Fang 9f828ef36f
[Gemini] remove not used MemtracerWrapper (#2072) 2022-12-05 11:57:59 +08:00
Boyuan Yao 616da17fab
[autoparallel] add binary elementwise metainfo for auto parallel (#2058)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator

* [autoparallel] add binary elementwise metainfo

* [fx] recover profiler

* [autoparallel] fix forward memory calculation

* [autoparallel] modify constants.py

* [autoparallel] remove redundant print
2022-12-04 15:18:51 +08:00
Boyuan Yao 4b40fbd743
[autoparallel] fix forward memory calculation (#2062) 2022-12-04 15:00:16 +08:00
Ziyue Jiang 44ea461890
[Pipeline] Add Topo Class (#2059)
* use Topo class to rewrite DAG

* polish code

* polish code

* polish code

* add comment

* add else to unended if

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-12-02 18:13:20 +08:00
YuliangLiu0306 e4293e5077
[hotfix] update test for latest version (#2060) 2022-12-02 18:12:30 +08:00
Zihao 38ea4ba1bd
[Gemini] fix grad unreleased issue and param recovery issue (#2052) 2022-12-02 16:04:19 +08:00
YuliangLiu0306 1c1fe44305
[autoparallel] adapt solver with self attention (#2037)
* [autoparallel] adapt solver with self attention

* polish code
2022-12-01 17:53:15 +08:00
Frank Lee ea74a3b9cc
[cli] updated installation cheheck with more inforamtion (#2050)
* [cli] updated installation cheheck with more inforamtion

* polish code

* polish code
2022-11-30 17:53:55 +08:00
HELSON f6178728a0
[gemini] fix init bugs for modules (#2047)
* [gemini] fix init bugs for modules

* fix bugs
2022-11-30 17:06:10 +08:00
Frank Lee 81e0da7fa8
[setup] supported conda-installed torch (#2048)
* [setup] supported conda-installed torch

* polish code
2022-11-30 16:45:15 +08:00
HELSON e37f3db40c
[gemini] add arguments (#2046)
* [zero] fix testing parameters

* [gemini] add arguments

* add docstrings
2022-11-30 16:40:13 +08:00
Zihao 6a9158f1fa
[Gemini] free and allocate cuda memory by tensor.storage, add grad hook (#2040) 2022-11-30 15:57:45 +08:00
Jiarui Fang 31c644027b
[hotfix] hotfix Gemini for no leaf modules bug (#2043) 2022-11-30 14:53:41 +08:00
HELSON a1ce02d740
[zero] test gradient accumulation (#1964)
* [zero] fix memory leak for zero2

* [zero] test gradient accumulation

* [zero] remove grad clip test
2022-11-29 13:00:30 +08:00
Ziyue Jiang b0936e4a44
[rpc] split with dag (#2028)
* add DAG to split_module

* add comment

* add test case for DAG

* remove print

* add DAG middleware in scheduler

* add test case for scheduler

* remove break

* recover old lifecycle

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-29 11:36:28 +08:00
Jiarui Fang 96134e7be3
[hotfix] add bert test for gemini fwd bwd (#2035) 2022-11-29 11:19:52 +08:00
YuliangLiu0306 0dbcd4a6f5
[autoparallel] add split handler (#2032)
* [autoparallel] add split handler

* add numerical test and runtime passes
2022-11-29 11:03:51 +08:00
Jiarui Fang 28aa9a4294
[Gemini] more rigorous unit tests for run_fwd_bwd (#2034) 2022-11-29 09:26:06 +08:00
YuliangLiu0306 81330b0352
[autoparallel] add experimental permute handler (#2029) 2022-11-27 20:26:52 +08:00
Zihao 95c4532fff
[Gemini] paramWrapper paramTracerHook unitest (#2030) 2022-11-26 13:30:24 +08:00
Jiarui Fang 8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor (#2003) 2022-11-25 20:06:35 +08:00
Ziyue Jiang 632753abbc
[fx]Split partition with DAG information (#2025)
* add DAG to split_module

* add comment

* add test case for DAG

* remove print

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-25 17:42:48 +08:00
YuliangLiu0306 ea0f6b8df9
[autoparallel] add runtime pass and numerical test for view handler (#2018) 2022-11-25 15:50:16 +08:00
Zihao a719b89a41
[gemini] param_trace_hook (#2020) 2022-11-24 18:08:36 +08:00
Jiarui Fang 0b0d8f9e17
[hotfix] revert bug PRs (#2016) 2022-11-24 15:28:58 +08:00
Zihao aba3db464d
[Gemini] ParamMemHook (#2008) 2022-11-24 15:22:51 +08:00
Zihao 0160a62a3c
[Gemini] param_tracer_wrapper and test case (#2009) 2022-11-24 14:40:33 +08:00
YuliangLiu0306 1438993113
[autoparallel] add experimental view handler (#2011)
* [autoparallel] add experimental view handler

* polish

* polish

* polish code

* rename variables
2022-11-24 11:34:41 +08:00
Genghan Zhang d655eea515
[autoparallel] mix gather (#1977)
* Add mix-gather

* Add comments

* Add comments

* Polish comments

* Change the global rank assumption

* Add tests

* Add two-step tests

* Fix 10 and 01

* Skip test becasue the number of GPUs
2022-11-23 21:49:17 +08:00
Frank Lee 2bab6f512c
[release] release v0.1.11rc4 (#2007) 2022-11-23 17:14:32 +08:00
Boyuan Yao 6cd784ffee
[autoparallel] Add metainfo support for F.linear (#1987)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator
2022-11-23 14:12:34 +08:00
Super Daniel 2edbef13cc
[fx] add more meta_registry for MetaTensor execution. (#2000)
* [sc] add examples for auto checkpoint.

* merge upstream

* [fx] add more meta_registry for MetaTensor execution.
2022-11-23 10:55:46 +08:00
Jiarui Fang a2d3266648
[hotfix] make Gemini work for conv DNN (#1998) 2022-11-22 14:52:36 +08:00
YuliangLiu0306 155891113e
[autoparallel] use pytree map style to process data (#1989) 2022-11-21 10:44:22 +08:00
YuliangLiu0306 35e6b9ec82
[autoparallel] adapt handlers with attention block (#1990)
* [autoparallel] adapt handlers with attention block

* polish
2022-11-21 10:44:11 +08:00
YuliangLiu0306 05020e50d0
[autoparallel] support more flexible data type (#1967) 2022-11-18 17:01:06 +08:00
Boyuan Yao c26f21d365
[autoparallel] add pooling metainfo (#1968)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo
2022-11-18 15:13:03 +08:00
Jiarui Fang 3712ac7f90
[Gemini] add bert for MemtracerWrapper unintests (#1982) 2022-11-18 14:58:28 +08:00
Jiarui Fang e481489aa6
[Gemini] MemtracerWrapper unittests (#1981) 2022-11-18 14:19:40 +08:00
Jiarui Fang 31922110ad
[Gemini] memory trace hook (#1978) 2022-11-18 11:52:55 +08:00
Jiarui Fang 0529fcde06
[Gemini] independent runtime tracer (#1974) 2022-11-18 10:53:42 +08:00
YuliangLiu0306 0da1d00399
[autoparallel] support distributed dataloader option (#1906)
* [autoparallel] support distributed dataloader option

* update output handler to support ddp dataloader

* poish code
2022-11-17 20:11:53 +08:00
Genghan Zhang 6630d45546
[autoparallel] Add alpha beta (#1973)
* Add alpha beta

* Fix test

* Fix test
2022-11-17 16:01:14 +08:00
Jiarui Fang cc0ed7cf33
[Gemini] ZeROHookV2 -> GeminiZeROHook (#1972) 2022-11-17 14:43:49 +08:00
ver217 f8a7148dec
[kernel] move all symlinks of kernel to `colossalai._C` (#1971) 2022-11-17 13:42:33 +08:00
Jiarui Fang 7e24b9b9ee
[Gemini] clean no used MemTraceOp (#1970) 2022-11-17 13:41:54 +08:00
Boyuan Yao 7c7921f71b
[autoparallel] add torch.nn.ReLU metainfo (#1868)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input
2022-11-16 23:12:31 +08:00
Jiarui Fang 8c66a1d0aa
[polish] remove useless file _mem_tracer_hook.py (#1963) 2022-11-16 15:55:10 +08:00
Jiarui Fang c4739a725a
[Gemini] polish memstats collector (#1962) 2022-11-16 15:45:57 +08:00
YuliangLiu0306 fea3cb661c
[autoparallel] support addmm in tracer and solver (#1961)
* [fx] patch addmm

* [autoparallel] support addmm in tracer and solver
2022-11-16 14:59:18 +08:00
Jiarui Fang f7e276fa71
[Gemini] add GeminiAdamOptimizer (#1960) 2022-11-16 14:44:28 +08:00
HELSON 7066dfbf82
[zero] fix memory leak for zero2 (#1955) 2022-11-16 11:43:24 +08:00
Jiarui Fang 52c6ad26e0
[ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953) 2022-11-15 16:24:16 +08:00
zbian 598d456d0e fixed logger 2022-11-15 16:00:07 +08:00
zbian 6877121377 updated flash attention api 2022-11-15 15:25:39 +08:00
YuliangLiu0306 36c0f3ea5b
[autoparallel] remove redundancy comm node (#1893) 2022-11-15 10:53:41 +08:00
アマデウス e52f9d9109
[tensorparallel] fixed tp layers (#1938) 2022-11-14 17:34:03 +08:00
Jiarui Fang 9f4fb3f28a
[ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937) 2022-11-14 16:05:09 +08:00
Boyuan Yao d5c5bc219e
[SC] add GPT example for auto checkpoint (#1889)
* [sc] SC tutorial for auto checkpoint

* [sc] polish examples

* [sc] polish readme

* [sc] polish readme and help information

* [sc] polish readme and help information
2022-11-11 23:17:25 +08:00
Junming Wu 14a0b18305
[NFC] polish colossalai/amp/naive_amp/__init__.py code style (#1905) 2022-11-11 17:49:18 +08:00
HELSON 6e51d296f0
[zero] migrate zero1&2 (#1878)
* add zero1&2 optimizer

* rename test ditectory

* rename test files

* change tolerance in test
2022-11-11 09:26:40 +08:00
Super Daniel cc55ff0aa4
[autoparallel] user-friendly API for CheckpointSolver. (#1879)
Merge for SC tutorial
2022-11-10 20:59:28 +08:00
Super Daniel 448248b27c
[fx] metainfo_trace as an API. (#1873)
* [fx] metainfo_trace as an API.

* [fx] add return.
2022-11-10 20:58:37 +08:00
Jiarui Fang 986f8cbaa7
[inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876) 2022-11-10 17:36:42 +08:00
YuliangLiu0306 1b494ad73c
[autoparallel] fix linear logical convert issue (#1857) 2022-11-10 17:19:22 +08:00
Jiarui Fang c2947dadf1
[inference] streaming Linear 1D Row inference (#1874) 2022-11-10 17:03:21 +08:00
Frank Lee e6ec99d389
[utils] fixed lazy init context (#1867) 2022-11-10 15:17:20 +08:00
zbian 653b0a620e added skip_bias_add for non-tp linear 2022-11-09 15:41:08 +08:00
LuGY 94329fc139
[NFC] polish colossalai/amp/apex_amp/__init__.py code style (#1853) 2022-11-09 14:49:42 +08:00
zbian 1559a09fb7 [NFC] polish amp.naive_amp.grad_scaler code style 2022-11-09 13:38:15 +08:00
HELSON 72c9448920 [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/operator_handler.py code style (#1845) 2022-11-09 12:08:47 +08:00
Genghan Zhang b25030cc07 [NFC] polish ./colossalai/amp/torch_amp/__init__.py code style (#1836) 2022-11-09 12:08:47 +08:00
Sze-qq 95ac4f88ea [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/conv_handler.py code style (#1829)
Co-authored-by: siqi <siqi@siqis-MacBook-Pro.local>
2022-11-09 12:08:47 +08:00
Ziyue Jiang 5da03c936d [NFC] polish colossalai/amp/torch_amp/_grad_scaler.py code style (#1823)
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-09 12:08:47 +08:00
Fazzie-Maqianli 399f84d8f6 [NFC] polish colossalai/amp/naive_amp/_fp16_optimizer.py code style (#1819) 2022-11-09 12:08:47 +08:00
CsRic 9623ec1b02 [NFC] polish colossalai/amp/naive_amp/_utils.py code style (#1816)
* [NFC] polish colossalai/nn/metric/accuracy_2p5d.py code style (#1714)

* [NFC] polish colossalai/zero/sharded_param/__init__.py code style

* [NFC] polish colossalai/amp/naive_amp/_utils.py code style

Co-authored-by: shenggan <csg19971016@gmail.com>
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2022-11-09 12:08:47 +08:00
binmakeswell 3c3714fc2a [NFC] polish strategies_constructor.py code style (#1806) 2022-11-09 12:08:47 +08:00
Jiarui Fang 3ce4463fe6
[utils] remove lazy_memory_allocate from ColoInitContext (#1844) 2022-11-09 11:50:33 +08:00
Jiarui Fang fba34efb5a
version to 0.1.11rc2 (#1832) 2022-11-08 17:25:15 +08:00
YuliangLiu0306 49216d7ab1
[autoparallel] fix bugs caused by negative dim key (#1808)
* [autoparallel] fix bugs caused by negative dim key

* fix import error

* fix matmul test issue

* fix unit test issue
2022-11-08 17:03:50 +08:00
アマデウス 4268ae017b
[kernel] added jit warmup (#1792) 2022-11-08 16:22:23 +08:00
YuliangLiu0306 f6032ddb17
[autoparallel] fix bias addition module (#1800) 2022-11-08 16:21:25 +08:00
Jiarui Fang cd5a0d56fa
[Gemini] make gemini usage simple (#1821) 2022-11-08 15:53:13 +08:00
ver217 99870726b1
[CheckpointIO] a uniform checkpoint I/O module (#1689) 2022-11-08 15:15:13 +08:00
Boyuan Yao 629172b319
[autoparallel] add batch norm metainfo (#1815)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler
2022-11-08 15:05:26 +08:00
Super Daniel 441d584e4a
[fx] add a symbolic_trace api. (#1812)
* [fx] add a symbolic_trace api.

* [fx] fix import errors.
2022-11-08 13:59:20 +08:00
xcnick e0da01ea71
[hotfix] fix build error when torch version >= 1.13 (#1803) 2022-11-08 09:40:24 +08:00
oahzxl 9639ea88fc
[kernel] more flexible flashatt interface (#1804) 2022-11-07 17:02:09 +08:00
Zihao 20e255d4e8
MemStatsCollectorStatic (#1765) 2022-11-07 16:49:03 +08:00
Boyuan Yao 327d07c44a
[autoparallel] add conv metainfo class for auto parallel (#1796)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test
2022-11-07 16:15:35 +08:00
oahzxl 501a9e9cd2
[hotfix] polish flash attention (#1802) 2022-11-07 14:30:22 +08:00
Jiarui Fang 218c75fd9d
[NFC] polish type hint for shape consistency (#1801)
* [NFC] polish type hint for shape consistency

* polish code

* polish code
2022-11-07 14:13:03 +08:00
Jiarui Fang c248800359
[kernel] skip tests of flash_attn and triton when they are not available (#1798) 2022-11-07 13:41:13 +08:00
YuliangLiu0306 e34e850a4c
[autoparallel]add essential CommActions for broadcast oprands (#1793) 2022-11-04 18:36:42 +08:00
Boyuan Yao 05ce3d369f
[fx] Add linear metainfo class for auto parallel (#1783)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel
2022-11-04 10:55:09 +08:00
Super Daniel e8a9bebc87
[autoparallel] refactor and add rotorc. (#1789)
* [autoparallel] refactor and add rotorc.

* [autoparallel] refactor and add rotorc.
2022-11-03 12:32:51 +08:00
YuliangLiu0306 2c4c7b3618
[autoparallel] add getattr handler (#1767)
* [autoparallel] add getattr haandler

* polish code

* add extra processes for Parameters

* add unit test for param resharding cost

* add docstring and polish test
2022-11-03 12:31:33 +08:00
HELSON c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12

* [zero] add cpu shard init

* [zero] add tiny example test

* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
kurisusnowdeng 0b8161fab8 updated tp layers 2022-11-02 12:19:38 +08:00
Jiarui Fang cb5a587e9a
[hotfix] polish chunk import (#1787) 2022-11-02 12:10:52 +08:00
YuliangLiu0306 e859380bf7
[fx] support module with bias addition (#1780)
* [autoparallel] refactor tracer to fix bias addition issue

* [fx] support module with bias addition

* create bias_addition_module

* refactor file structure

* polish code

* fix unit test
2022-11-01 22:53:51 +08:00
Frank Lee f3f19a5c47
[autoparallel] added matmul handler (#1763)
* [autoparallel] added matmul handler

* polish code
2022-11-01 15:14:53 +08:00
Ziyue Jiang 4df0194976
[Pipeline]Adapt to Pipelinable OPT (#1782) 2022-11-01 14:18:50 +08:00
YuliangLiu0306 27de252334
[autoparallel] fix conv handler numerical test (#1771) 2022-11-01 10:43:44 +08:00
Super Daniel 1e88811c7a
[autoparallel] move ckpt solvers to autoparallel folder / refactor code (#1764)
* [autoparallel] first move.

* [autoparallel] add solver rotor.

* [autoparallel] add ckpt solvers.

* [autoparallel] modify codegen.

* [fx] fix annotation in test.

* [fx] remove check.

* [autoparallel] polish docstring.

* [fx] refactor MetaTensor.
2022-11-01 10:43:15 +08:00
Jiarui Fang f34dab4270
[compatibility] ChunkMgr import error (#1772) 2022-10-28 14:48:54 +08:00
YuliangLiu0306 b0f7c8bde8
[autoparallel] update CommSpec to CommActions (#1768)
* [autoparallel] update CommSpec to CommActions

* polish code
2022-10-28 09:57:43 +08:00
YuliangLiu0306 b4cc59b61e
[autoparallel] add numerical test for node strategies (#1760)
* [autoparallel] add numerical test for node strategies

* polish code

* polish code
2022-10-27 10:42:54 +08:00
oahzxl 25952b67d7
[feat] add flash attention (#1762) 2022-10-26 16:15:52 +08:00
Super Daniel 0584654c79
[fx] refactor memory utils and extend shard utils. (#1754)
* [fx] change memory.py to memory_utils.py.

* [fx] add shard utils.

* [fx] fix import.

* [fx] check code style.

* [fx] add comment.

* [autoparallel] first move.

* [fx] add time computations.
2022-10-26 14:24:41 +08:00
Ziyue Jiang 63f250bbd4
fix file name (#1759)
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-10-25 16:48:48 +08:00
YuliangLiu0306 314d8c497f
[autoparallel] refactor the runtime apply pass and add docstring to passes (#1757)
* [autoparallel] refactor the runtime apply pass and add doc string to passes

* fix unit test

* polish
2022-10-25 14:32:22 +08:00
Frank Lee f9a613d660
[autoparallel] added binary elementwise node handler (#1758)
* [autoparallel] added binary elementwise node handler

* polish code
2022-10-25 14:32:01 +08:00
YuliangLiu0306 d2fc067231
[autoparallel] fix param hook issue in transform pass (#1755) 2022-10-24 13:13:38 +08:00
Frank Lee 262652c8bc
[autoparallel] added addbmm handler (#1751) 2022-10-21 18:55:48 +08:00
YuliangLiu0306 980ed21723
[autoparallel] shard param and buffer as expected (#1753)
* [autoparallel] shard param and buffer as expected

* fix unit test issue
2022-10-21 15:45:13 +08:00
YuliangLiu0306 cdb7d5e7d2
[hotfix] autoparallel unit test (#1752) 2022-10-20 19:51:38 +08:00
YuliangLiu0306 a4ce180e85
[autoparallel] add sequential order to communication actions (#1735) 2022-10-20 18:48:18 +08:00
Frank Lee 474111ecb5
[autoparallel] fixed wrong sharding strategy in conv handler (#1747)
* [autoparallel] fixed wrong sharding strategy in conv handler

* polish code
2022-10-20 16:12:39 +08:00
Frank Lee 8b8937d901
[autoparallel] fixed wrong generated strategy for dot op (#1746)
* [autoparallel] fixed wrong generated strategy for dot op

* polish code
2022-10-20 15:18:16 +08:00
Frank Lee 993b8875b6
[autoparallel] handled illegal sharding strategy in shape consistency (#1744)
* [autoparallel] handled illegal sharding strategy in shape consistency

* polish code
2022-10-20 12:06:25 +08:00
Frank Lee 88a79814fb
[autoparallel] handled illegal strategy in node handler (#1743)
* [autoparallel] handled illegal strategy in node handler

* polish code
2022-10-19 17:08:52 +08:00
Super Daniel 30874f1692
[fx/profiler] debug the fx.profiler / add an example test script for fx.profiler (#1730)
* [fx/profiler] add test.

* [fx] fix file names.

* [fx] add docstring and comment.

* [fx] polish profiler.py.

* [fx] fix import errors.

* [fx] fix profiler.

* [fx] fix names.
2022-10-19 14:24:51 +08:00
Frank Lee eee84908d4
[autoparallel] handled illegal sharding strategy (#1728)
* [autoparallel] handled illegal sharding strategy

* polish code
2022-10-19 12:53:06 +08:00
Sze-qq 23703c9dd6 [NFC] polish colossalai/nn/metric/_utils.py code style (#1727) 2022-10-19 12:20:51 +08:00
Ofey Chan 7e62af28a0 [NFC] polish accuracy_2d.py code style (#1719) 2022-10-19 12:20:51 +08:00
LuGY 730f88f8e1 [NFC] polish _checkpoint_hook.py code style (#1722) 2022-10-19 12:20:51 +08:00
CsRic ea961d8fd1 [NFC] polish colossalai/zero/sharded_param/__init__.py code style (#1717)
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2022-10-19 12:20:51 +08:00
yuxuan-lou 2b49ca80a3 [NFC] polish colossalai/nn/lr_scheduler/linear.py code style (#1716) 2022-10-19 12:20:51 +08:00
shenggan e1d780030d [NFC] polish colossalai/nn/metric/accuracy_2p5d.py code style (#1714) 2022-10-19 12:20:51 +08:00
YuliangLiu0306 d373e67b99
[hotfix] resharding cost issue (#1742) 2022-10-19 11:33:43 +08:00
Jiarui Fang 24e84eba60
upgrade version to 0.1.11rc1 (#1739) 2022-10-19 11:26:00 +08:00
Frank Lee d2e0e39c9d
[release] update to v0.1.11 (#1736) 2022-10-19 00:28:00 +08:00
HELSON f69f9bf223
[zero] add chunk init function for users (#1729)
* add chunk manager init function

* fix unit tests

* add comment

* add flush=True
2022-10-18 16:31:22 +08:00