Commit Graph

610 Commits (e7d3afc9cc5b923123d4cb20c420bb3bda906764)

Author SHA1 Message Date
Zihao 95c4532fff
[Gemini] paramWrapper paramTracerHook unitest (#2030) 2022-11-26 13:30:24 +08:00
Jiarui Fang 8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor (#2003) 2022-11-25 20:06:35 +08:00
Ziyue Jiang 632753abbc
[fx]Split partition with DAG information (#2025)
* add DAG to split_module

* add comment

* add test case for DAG

* remove print

Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-11-25 17:42:48 +08:00
YuliangLiu0306 ea0f6b8df9
[autoparallel] add runtime pass and numerical test for view handler (#2018) 2022-11-25 15:50:16 +08:00
Jiarui Fang 2e9cbfca12
[Gemini] add unitests to check gemini correctness (#2015) 2022-11-24 16:51:45 +08:00
Jiarui Fang 0b0d8f9e17
[hotfix] revert bug PRs (#2016) 2022-11-24 15:28:58 +08:00
Zihao 0160a62a3c
[Gemini] param_tracer_wrapper and test case (#2009) 2022-11-24 14:40:33 +08:00
YuliangLiu0306 1438993113
[autoparallel] add experimental view handler (#2011)
* [autoparallel] add experimental view handler

* polish

* polish

* polish code

* rename variables
2022-11-24 11:34:41 +08:00
Genghan Zhang d655eea515
[autoparallel] mix gather (#1977)
* Add mix-gather

* Add comments

* Add comments

* Polish comments

* Change the global rank assumption

* Add tests

* Add two-step tests

* Fix 10 and 01

* Skip test becasue the number of GPUs
2022-11-23 21:49:17 +08:00
Jiarui Fang 3d907faede
[Gemini] add an inline_op_module to common test models and polish unitests. (#2004) 2022-11-23 16:55:54 +08:00
Boyuan Yao 6cd784ffee
[autoparallel] Add metainfo support for F.linear (#1987)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo

* [autoparallel] add F.linear metainfo generator
2022-11-23 14:12:34 +08:00
YuliangLiu0306 35e6b9ec82
[autoparallel] adapt handlers with attention block (#1990)
* [autoparallel] adapt handlers with attention block

* polish
2022-11-21 10:44:11 +08:00
Jiarui Fang 5bec3b2168
[Gemini] open grad checkpoint when model building (#1984) 2022-11-18 16:32:54 +08:00
Boyuan Yao c26f21d365
[autoparallel] add pooling metainfo (#1968)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input

* [autoparallel] add pooling metainfo
2022-11-18 15:13:03 +08:00
Jiarui Fang 3712ac7f90
[Gemini] add bert for MemtracerWrapper unintests (#1982) 2022-11-18 14:58:28 +08:00
Jiarui Fang e481489aa6
[Gemini] MemtracerWrapper unittests (#1981) 2022-11-18 14:19:40 +08:00
YuliangLiu0306 0da1d00399
[autoparallel] support distributed dataloader option (#1906)
* [autoparallel] support distributed dataloader option

* update output handler to support ddp dataloader

* poish code
2022-11-17 20:11:53 +08:00
Genghan Zhang 6630d45546
[autoparallel] Add alpha beta (#1973)
* Add alpha beta

* Fix test

* Fix test
2022-11-17 16:01:14 +08:00
ver217 f8a7148dec
[kernel] move all symlinks of kernel to `colossalai._C` (#1971) 2022-11-17 13:42:33 +08:00
Boyuan Yao 7c7921f71b
[autoparallel] add torch.nn.ReLU metainfo (#1868)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler

* [fx] add relu metainfo class

* [fx] restore profiler

* [autoparallel] modify metainfo input
2022-11-16 23:12:31 +08:00
YuliangLiu0306 fea3cb661c
[autoparallel] support addmm in tracer and solver (#1961)
* [fx] patch addmm

* [autoparallel] support addmm in tracer and solver
2022-11-16 14:59:18 +08:00
Jiarui Fang f7e276fa71
[Gemini] add GeminiAdamOptimizer (#1960) 2022-11-16 14:44:28 +08:00
HELSON 7066dfbf82
[zero] fix memory leak for zero2 (#1955) 2022-11-16 11:43:24 +08:00
Jiarui Fang 52c6ad26e0
[ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953) 2022-11-15 16:24:16 +08:00
zbian 6877121377 updated flash attention api 2022-11-15 15:25:39 +08:00
Jiarui Fang 9f4fb3f28a
[ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937) 2022-11-14 16:05:09 +08:00
HELSON 6e51d296f0
[zero] migrate zero1&2 (#1878)
* add zero1&2 optimizer

* rename test ditectory

* rename test files

* change tolerance in test
2022-11-11 09:26:40 +08:00
Jiarui Fang 51597f6a28
[hotfix] pass test_complete_workflow (#1877) 2022-11-10 17:53:39 +08:00
Jiarui Fang 986f8cbaa7
[inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876) 2022-11-10 17:36:42 +08:00
YuliangLiu0306 1b494ad73c
[autoparallel] fix linear logical convert issue (#1857) 2022-11-10 17:19:22 +08:00
Jiarui Fang c2947dadf1
[inference] streaming Linear 1D Row inference (#1874) 2022-11-10 17:03:21 +08:00
xcnick a141681260
[amp] add torch amp test (#1860) 2022-11-10 16:40:26 +08:00
Frank Lee e6ec99d389
[utils] fixed lazy init context (#1867) 2022-11-10 15:17:20 +08:00
Jiarui Fang 3ce4463fe6
[utils] remove lazy_memory_allocate from ColoInitContext (#1844) 2022-11-09 11:50:33 +08:00
YuliangLiu0306 f6032ddb17
[autoparallel] fix bias addition module (#1800) 2022-11-08 16:21:25 +08:00
ver217 99870726b1
[CheckpointIO] a uniform checkpoint I/O module (#1689) 2022-11-08 15:15:13 +08:00
Boyuan Yao 629172b319
[autoparallel] add batch norm metainfo (#1815)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test

* [autoparallel] add batchnorm metainfo class

* [autoparallel] fix batchnorm unit test function declaration

* [fx] restore profiler
2022-11-08 15:05:26 +08:00
Super Daniel 441d584e4a
[fx] add a symbolic_trace api. (#1812)
* [fx] add a symbolic_trace api.

* [fx] fix import errors.
2022-11-08 13:59:20 +08:00
Jiarui Fang 6fa71d65d3
[fx] skip diffusers unitest if it is not installed (#1799) 2022-11-08 11:45:23 +08:00
oahzxl 9639ea88fc
[kernel] more flexible flashatt interface (#1804) 2022-11-07 17:02:09 +08:00
Boyuan Yao 327d07c44a
[autoparallel] add conv metainfo class for auto parallel (#1796)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel

* [fx] add conv metainfo class

* [fx] restore profiler

* [fx] restore meta profiler

* [autoparallel] modify unit test

* [fx] modify unit test
2022-11-07 16:15:35 +08:00
oahzxl 501a9e9cd2
[hotfix] polish flash attention (#1802) 2022-11-07 14:30:22 +08:00
Jiarui Fang c248800359
[kernel] skip tests of flash_attn and triton when they are not available (#1798) 2022-11-07 13:41:13 +08:00
YuliangLiu0306 e34e850a4c
[autoparallel]add essential CommActions for broadcast oprands (#1793) 2022-11-04 18:36:42 +08:00
Boyuan Yao 05ce3d369f
[fx] Add linear metainfo class for auto parallel (#1783)
* [fx] metainfo class for auto parallel

* [fx] add unit test for linear metainfo

* [fx] fix bwd param for linear

* [fx] modify unit test

* [fx] modify unit test

* [fx] modify import

* [fx] modify import

* [fx] modify import

* [fx] move meta profiler to auto parallel
2022-11-04 10:55:09 +08:00
YuliangLiu0306 2c4c7b3618
[autoparallel] add getattr handler (#1767)
* [autoparallel] add getattr haandler

* polish code

* add extra processes for Parameters

* add unit test for param resharding cost

* add docstring and polish test
2022-11-03 12:31:33 +08:00
HELSON c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12

* [zero] add cpu shard init

* [zero] add tiny example test

* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
Jiarui Fang 32c1b843a9
skip torchrec unittests if not installed (#1790) 2022-11-02 14:44:32 +08:00
kurisusnowdeng 0b8161fab8 updated tp layers 2022-11-02 12:19:38 +08:00
YuliangLiu0306 e859380bf7
[fx] support module with bias addition (#1780)
* [autoparallel] refactor tracer to fix bias addition issue

* [fx] support module with bias addition

* create bias_addition_module

* refactor file structure

* polish code

* fix unit test
2022-11-01 22:53:51 +08:00
Frank Lee f3f19a5c47
[autoparallel] added matmul handler (#1763)
* [autoparallel] added matmul handler

* polish code
2022-11-01 15:14:53 +08:00
YuliangLiu0306 27de252334
[autoparallel] fix conv handler numerical test (#1771) 2022-11-01 10:43:44 +08:00
Super Daniel 1e88811c7a
[autoparallel] move ckpt solvers to autoparallel folder / refactor code (#1764)
* [autoparallel] first move.

* [autoparallel] add solver rotor.

* [autoparallel] add ckpt solvers.

* [autoparallel] modify codegen.

* [fx] fix annotation in test.

* [fx] remove check.

* [autoparallel] polish docstring.

* [fx] refactor MetaTensor.
2022-11-01 10:43:15 +08:00
YuliangLiu0306 a4d1f59c78
[autoparallel] add numerical test for handlers (#1769) 2022-10-28 10:59:59 +08:00
YuliangLiu0306 b0f7c8bde8
[autoparallel] update CommSpec to CommActions (#1768)
* [autoparallel] update CommSpec to CommActions

* polish code
2022-10-28 09:57:43 +08:00
YuliangLiu0306 b4cc59b61e
[autoparallel] add numerical test for node strategies (#1760)
* [autoparallel] add numerical test for node strategies

* polish code

* polish code
2022-10-27 10:42:54 +08:00
oahzxl 25952b67d7
[feat] add flash attention (#1762) 2022-10-26 16:15:52 +08:00
Super Daniel 0584654c79
[fx] refactor memory utils and extend shard utils. (#1754)
* [fx] change memory.py to memory_utils.py.

* [fx] add shard utils.

* [fx] fix import.

* [fx] check code style.

* [fx] add comment.

* [autoparallel] first move.

* [fx] add time computations.
2022-10-26 14:24:41 +08:00
YuliangLiu0306 314d8c497f
[autoparallel] refactor the runtime apply pass and add docstring to passes (#1757)
* [autoparallel] refactor the runtime apply pass and add doc string to passes

* fix unit test

* polish
2022-10-25 14:32:22 +08:00
Frank Lee f9a613d660
[autoparallel] added binary elementwise node handler (#1758)
* [autoparallel] added binary elementwise node handler

* polish code
2022-10-25 14:32:01 +08:00
YuliangLiu0306 d2fc067231
[autoparallel] fix param hook issue in transform pass (#1755) 2022-10-24 13:13:38 +08:00
Frank Lee 262652c8bc
[autoparallel] added addbmm handler (#1751) 2022-10-21 18:55:48 +08:00
YuliangLiu0306 980ed21723
[autoparallel] shard param and buffer as expected (#1753)
* [autoparallel] shard param and buffer as expected

* fix unit test issue
2022-10-21 15:45:13 +08:00
YuliangLiu0306 cdb7d5e7d2
[hotfix] autoparallel unit test (#1752) 2022-10-20 19:51:38 +08:00
YuliangLiu0306 a4ce180e85
[autoparallel] add sequential order to communication actions (#1735) 2022-10-20 18:48:18 +08:00
Super Daniel b893342f95
[fx] test tracer on diffuser modules. (#1750)
* [fx] test tracer on diffuser modules.

* [fx] shorter seq_len.

* Update requirements-test.txt
2022-10-20 18:25:05 +08:00
Frank Lee b80b6eaa88
[autoparallel] recovered skipped test cases (#1748) 2022-10-20 16:37:33 +08:00
Frank Lee 474111ecb5
[autoparallel] fixed wrong sharding strategy in conv handler (#1747)
* [autoparallel] fixed wrong sharding strategy in conv handler

* polish code
2022-10-20 16:12:39 +08:00
Frank Lee 8b8937d901
[autoparallel] fixed wrong generated strategy for dot op (#1746)
* [autoparallel] fixed wrong generated strategy for dot op

* polish code
2022-10-20 15:18:16 +08:00
Frank Lee 88a79814fb
[autoparallel] handled illegal strategy in node handler (#1743)
* [autoparallel] handled illegal strategy in node handler

* polish code
2022-10-19 17:08:52 +08:00
Super Daniel 30874f1692
[fx/profiler] debug the fx.profiler / add an example test script for fx.profiler (#1730)
* [fx/profiler] add test.

* [fx] fix file names.

* [fx] add docstring and comment.

* [fx] polish profiler.py.

* [fx] fix import errors.

* [fx] fix profiler.

* [fx] fix names.
2022-10-19 14:24:51 +08:00
Frank Lee eee84908d4
[autoparallel] handled illegal sharding strategy (#1728)
* [autoparallel] handled illegal sharding strategy

* polish code
2022-10-19 12:53:06 +08:00
Ziheng Qin cbe9a4cb45 [NFC] polish tests/test_layers/test_3d/test_3d.py code style (#1740) 2022-10-19 12:20:51 +08:00
lucasliunju 912eb58ea0 [NFC] polish tests/test_layers/test_3d/checks_3d/common.py code style (#1733) 2022-10-19 12:20:51 +08:00
Xue Fuzhao 754aa7c81f [NFC] polish tests/test_layers/test_3d/checks_3d/check_layer_3d.py code style (#1731) 2022-10-19 12:20:51 +08:00
xyupeng ff373a11eb [NFC] polish tests/test_layers/test_sequence/checks_seq/check_layer_seq.py code style (#1723) 2022-10-19 12:20:51 +08:00
Kai Wang (Victor Kai) b38efe4e8a [NFC] polish test_2p5d/checks_2p5d/check_operation_2p5d.py code style (#1718) 2022-10-19 12:20:51 +08:00
binmakeswell f6389d0813 [NFC] polish tests/test_layers/test_2d/checks_2d/check_operation_2d.py code style (#1715) 2022-10-19 12:20:51 +08:00
HELSON f69f9bf223
[zero] add chunk init function for users (#1729)
* add chunk manager init function

* fix unit tests

* add comment

* add flush=True
2022-10-18 16:31:22 +08:00
Super Daniel 393f594051
[fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug (#1710)
* [fx] move meta registration

* [fx] fix tests.

* [fx] fix test.

* [fx] fix.

* [meta] refactor meta registration.py.

* [fx] add compatibility descriptions.

* [fx] polish import.

* [fx] add a decorator.

* [fx] fix tests.

* [fx] remove print.

* [fx] edit raise error.

* [fx] edit raise error.

* [fx] add type hint.

* [fx] fix import in experimental.

* [rpc] remove color debug.

* [meta] fix naming.
2022-10-18 10:44:23 +08:00
Frank Lee e8d8eda5e7
[autoparallel] moved tests to test_tensor_shard (#1713) 2022-10-17 13:54:20 +08:00
YuliangLiu0306 845ff4a47a
[autoparallel] resnet block runtime apply (#1709)
* [autoparallel] resnet block runtime apply

* seperate buffer and parameter in MemoryCost

* polish code

* add comments and todos

* fix test issue
2022-10-17 13:37:38 +08:00
Frank Lee 22a115406b
[autoparallel] fixed broken node handler tests (#1708) 2022-10-14 18:25:59 +08:00
HELSON 1468e4bcfc
[zero] add constant placement policy (#1705)
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
Frank Lee 6c331a5a09
[autoparallel] refactored the autoparallel module for organization (#1706)
* [autoparallel] refactored the autoparallel module for organization

* polish code
2022-10-14 13:27:00 +08:00
Frank Lee 91cd34e6e0
[unittest] added doc for the pytest wrapper (#1704) 2022-10-14 10:56:17 +08:00
YuliangLiu0306 451cd72dea
[autoparallel] adapt runtime passes (#1703)
* [autoparallel] adapt runtime passes v2

* polish code
2022-10-14 10:14:07 +08:00
Jiarui Fang 21962e1593
[embedding] rename FreqAwareEmbedding -> CachedEmbedding (#1699) 2022-10-13 22:22:27 +08:00
Frank Lee 0e52f3d3d5
[unittest] supported condititonal testing based on env var (#1701)
polish code
2022-10-13 19:38:45 +08:00
Frank Lee 8283e95db3
[autoparallel] collated all deprecated files (#1700)
* [autoparallel] collated all deprecated files

* polish code
2022-10-13 18:24:11 +08:00
YuliangLiu0306 81f7530ee7
[autoparallel] adapt solver and CostGraph with new handler (#1695)
* [autoparallel] adapt solver and CostGraph with new handler

* fix test issue
2022-10-13 14:04:15 +08:00
YuliangLiu0306 42b882ef06
[autoparallel] add output handler and placeholder handler (#1694)
* [autoparallel] add output handler and placeholder handler

* Delete test_solver_with_resnet.py

* fix test bugs
2022-10-13 13:42:36 +08:00
YuliangLiu0306 56088e6d98
[autoparallel] add pooling handler (#1690)
* [autoparallel] add pooling handler

* polish code
2022-10-13 13:42:13 +08:00
YuliangLiu0306 319d654f79
[autoparallel] where_handler_v2 (#1688)
* where generator

* [autoparallel] where_handler_v2
2022-10-13 11:02:22 +08:00
Boyuan Yao 31d2f03d27
[autoparallel] fix C version rotor inconsistency (#1691) 2022-10-12 15:21:58 +08:00
Frank Lee 4973157ad7
[autoparallel] added sharding spec conversion for linear handler (#1687) 2022-10-12 11:16:18 +08:00
YuliangLiu0306 af718e83f2
[autoparallel] add reshape handler v2 and fix some previous bug (#1683) 2022-10-11 18:12:59 +08:00
Super Daniel 3dd6994427
[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 (#1679)
* [fx/profiler] modify data_ptr into uuid for all tensors.

* [fx] modify uuid.

* [fx/profiler] tune performance on GPT-2.

* [fx] updates.

* [fx] debug.

* [fx] debug.

* [fx] cuda.
2022-10-11 11:03:35 +08:00
YuliangLiu0306 517b63939a
[autoparallel] add unary element wise handler v2 (#1674) 2022-10-09 17:30:42 +08:00
YuliangLiu0306 f6c6a932b8
[autoparallel] add following node generator (#1673)
* [autoparallel] add following node generator

* polish code

* polish code

* update name of arguments
2022-10-09 14:49:18 +08:00