ColossalAI

Commit Graph

Author	SHA1	Message	Date
HELSON	384cd26314	[zero] fix testing parameters (#2042 )	2022-11-30 12:09:32 +08:00
HELSON	17a3c685b0	[zero] fix unit-tests (#2039 )	2022-11-30 10:40:31 +08:00
Jiarui Fang	eb7742a4bb	[Gemini] more tests for Gemini (#2038 ) * [Gemini] more tests for Gemini * polish code	2022-11-29 17:13:10 +08:00
HELSON	537e181705	[testing] fix testing models (#2036 ) * [testing] fix testing models * roll back	2022-11-29 13:42:06 +08:00
HELSON	a1ce02d740	[zero] test gradient accumulation (#1964 ) * [zero] fix memory leak for zero2 * [zero] test gradient accumulation * [zero] remove grad clip test	2022-11-29 13:00:30 +08:00
Ziyue Jiang	b0936e4a44	[rpc] split with dag (#2028 ) * add DAG to split_module * add comment * add test case for DAG * remove print * add DAG middleware in scheduler * add test case for scheduler * remove break * recover old lifecycle Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>	2022-11-29 11:36:28 +08:00
Jiarui Fang	96134e7be3	[hotfix] add bert test for gemini fwd bwd (#2035 )	2022-11-29 11:19:52 +08:00
YuliangLiu0306	0dbcd4a6f5	[autoparallel] add split handler (#2032 ) * [autoparallel] add split handler * add numerical test and runtime passes	2022-11-29 11:03:51 +08:00
Jiarui Fang	28aa9a4294	[Gemini] more rigorous unit tests for run_fwd_bwd (#2034 )	2022-11-29 09:26:06 +08:00
YuliangLiu0306	81330b0352	[autoparallel] add experimental permute handler (#2029 )	2022-11-27 20:26:52 +08:00
Zihao	95c4532fff	[Gemini] paramWrapper paramTracerHook unitest (#2030 )	2022-11-26 13:30:24 +08:00
Jiarui Fang	8daf1b4db1	[Gemini] patch for supporting orch.add_ function for ColoTensor (#2003 )	2022-11-25 20:06:35 +08:00
Ziyue Jiang	632753abbc	[fx]Split partition with DAG information (#2025 ) * add DAG to split_module * add comment * add test case for DAG * remove print Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>	2022-11-25 17:42:48 +08:00
YuliangLiu0306	ea0f6b8df9	[autoparallel] add runtime pass and numerical test for view handler (#2018 )	2022-11-25 15:50:16 +08:00
Jiarui Fang	2e9cbfca12	[Gemini] add unitests to check gemini correctness (#2015 )	2022-11-24 16:51:45 +08:00
Jiarui Fang	0b0d8f9e17	[hotfix] revert bug PRs (#2016 )	2022-11-24 15:28:58 +08:00
Zihao	0160a62a3c	[Gemini] param_tracer_wrapper and test case (#2009 )	2022-11-24 14:40:33 +08:00
YuliangLiu0306	1438993113	[autoparallel] add experimental view handler (#2011 ) * [autoparallel] add experimental view handler * polish * polish * polish code * rename variables	2022-11-24 11:34:41 +08:00
Genghan Zhang	d655eea515	[autoparallel] mix gather (#1977 ) * Add mix-gather * Add comments * Add comments * Polish comments * Change the global rank assumption * Add tests * Add two-step tests * Fix 10 and 01 * Skip test becasue the number of GPUs	2022-11-23 21:49:17 +08:00
Jiarui Fang	3d907faede	[Gemini] add an inline_op_module to common test models and polish unitests. (#2004 )	2022-11-23 16:55:54 +08:00
Boyuan Yao	6cd784ffee	[autoparallel] Add metainfo support for F.linear (#1987 ) * [fx] metainfo class for auto parallel * [fx] add unit test for linear metainfo * [fx] fix bwd param for linear * [fx] modify unit test * [fx] modify unit test * [fx] modify import * [fx] modify import * [fx] modify import * [fx] move meta profiler to auto parallel * [fx] add conv metainfo class * [fx] restore profiler * [fx] restore meta profiler * [autoparallel] modify unit test * [fx] modify unit test * [autoparallel] add batchnorm metainfo class * [autoparallel] fix batchnorm unit test function declaration * [fx] restore profiler * [fx] add relu metainfo class * [fx] restore profiler * [autoparallel] modify metainfo input * [autoparallel] add pooling metainfo * [autoparallel] add F.linear metainfo generator	2022-11-23 14:12:34 +08:00
YuliangLiu0306	35e6b9ec82	[autoparallel] adapt handlers with attention block (#1990 ) * [autoparallel] adapt handlers with attention block * polish	2022-11-21 10:44:11 +08:00
Jiarui Fang	5bec3b2168	[Gemini] open grad checkpoint when model building (#1984 )	2022-11-18 16:32:54 +08:00
Boyuan Yao	c26f21d365	[autoparallel] add pooling metainfo (#1968 ) * [fx] metainfo class for auto parallel * [fx] add unit test for linear metainfo * [fx] fix bwd param for linear * [fx] modify unit test * [fx] modify unit test * [fx] modify import * [fx] modify import * [fx] modify import * [fx] move meta profiler to auto parallel * [fx] add conv metainfo class * [fx] restore profiler * [fx] restore meta profiler * [autoparallel] modify unit test * [fx] modify unit test * [autoparallel] add batchnorm metainfo class * [autoparallel] fix batchnorm unit test function declaration * [fx] restore profiler * [fx] add relu metainfo class * [fx] restore profiler * [autoparallel] modify metainfo input * [autoparallel] add pooling metainfo	2022-11-18 15:13:03 +08:00
Jiarui Fang	3712ac7f90	[Gemini] add bert for MemtracerWrapper unintests (#1982 )	2022-11-18 14:58:28 +08:00
Jiarui Fang	e481489aa6	[Gemini] MemtracerWrapper unittests (#1981 )	2022-11-18 14:19:40 +08:00
YuliangLiu0306	0da1d00399	[autoparallel] support distributed dataloader option (#1906 ) * [autoparallel] support distributed dataloader option * update output handler to support ddp dataloader * poish code	2022-11-17 20:11:53 +08:00
Genghan Zhang	6630d45546	[autoparallel] Add alpha beta (#1973 ) * Add alpha beta * Fix test * Fix test	2022-11-17 16:01:14 +08:00
ver217	f8a7148dec	[kernel] move all symlinks of kernel to `colossalai._C` (#1971 )	2022-11-17 13:42:33 +08:00
Boyuan Yao	7c7921f71b	[autoparallel] add torch.nn.ReLU metainfo (#1868 ) * [fx] metainfo class for auto parallel * [fx] add unit test for linear metainfo * [fx] fix bwd param for linear * [fx] modify unit test * [fx] modify unit test * [fx] modify import * [fx] modify import * [fx] modify import * [fx] move meta profiler to auto parallel * [fx] add conv metainfo class * [fx] restore profiler * [fx] restore meta profiler * [autoparallel] modify unit test * [fx] modify unit test * [autoparallel] add batchnorm metainfo class * [autoparallel] fix batchnorm unit test function declaration * [fx] restore profiler * [fx] add relu metainfo class * [fx] restore profiler * [autoparallel] modify metainfo input	2022-11-16 23:12:31 +08:00
YuliangLiu0306	fea3cb661c	[autoparallel] support addmm in tracer and solver (#1961 ) * [fx] patch addmm * [autoparallel] support addmm in tracer and solver	2022-11-16 14:59:18 +08:00
Jiarui Fang	f7e276fa71	[Gemini] add GeminiAdamOptimizer (#1960 )	2022-11-16 14:44:28 +08:00
HELSON	7066dfbf82	[zero] fix memory leak for zero2 (#1955 )	2022-11-16 11:43:24 +08:00
Jiarui Fang	52c6ad26e0	[ColoTensor] reconfig ColoInitContext, decouple default_pg and default_dist_spec. (#1953 )	2022-11-15 16:24:16 +08:00
zbian	6877121377	updated flash attention api	2022-11-15 15:25:39 +08:00
Jiarui Fang	9f4fb3f28a	[ColoTensor] ColoInitContext initialize parameters in shard mode. (#1937 )	2022-11-14 16:05:09 +08:00
HELSON	6e51d296f0	[zero] migrate zero1&2 (#1878 ) * add zero1&2 optimizer * rename test ditectory * rename test files * change tolerance in test	2022-11-11 09:26:40 +08:00
Jiarui Fang	51597f6a28	[hotfix] pass test_complete_workflow (#1877 )	2022-11-10 17:53:39 +08:00
Jiarui Fang	986f8cbaa7	[inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876 )	2022-11-10 17:36:42 +08:00
YuliangLiu0306	1b494ad73c	[autoparallel] fix linear logical convert issue (#1857 )	2022-11-10 17:19:22 +08:00
Jiarui Fang	c2947dadf1	[inference] streaming Linear 1D Row inference (#1874 )	2022-11-10 17:03:21 +08:00
xcnick	a141681260	[amp] add torch amp test (#1860 )	2022-11-10 16:40:26 +08:00
Frank Lee	e6ec99d389	[utils] fixed lazy init context (#1867 )	2022-11-10 15:17:20 +08:00
Jiarui Fang	3ce4463fe6	[utils] remove lazy_memory_allocate from ColoInitContext (#1844 )	2022-11-09 11:50:33 +08:00
YuliangLiu0306	f6032ddb17	[autoparallel] fix bias addition module (#1800 )	2022-11-08 16:21:25 +08:00
ver217	99870726b1	[CheckpointIO] a uniform checkpoint I/O module (#1689 )	2022-11-08 15:15:13 +08:00
Boyuan Yao	629172b319	[autoparallel] add batch norm metainfo (#1815 ) * [fx] metainfo class for auto parallel * [fx] add unit test for linear metainfo * [fx] fix bwd param for linear * [fx] modify unit test * [fx] modify unit test * [fx] modify import * [fx] modify import * [fx] modify import * [fx] move meta profiler to auto parallel * [fx] add conv metainfo class * [fx] restore profiler * [fx] restore meta profiler * [autoparallel] modify unit test * [fx] modify unit test * [autoparallel] add batchnorm metainfo class * [autoparallel] fix batchnorm unit test function declaration * [fx] restore profiler	2022-11-08 15:05:26 +08:00
Super Daniel	441d584e4a	[fx] add a symbolic_trace api. (#1812 ) * [fx] add a symbolic_trace api. * [fx] fix import errors.	2022-11-08 13:59:20 +08:00
Jiarui Fang	6fa71d65d3	[fx] skip diffusers unitest if it is not installed (#1799 )	2022-11-08 11:45:23 +08:00
oahzxl	9639ea88fc	[kernel] more flexible flashatt interface (#1804 )	2022-11-07 17:02:09 +08:00
Boyuan Yao	327d07c44a	[autoparallel] add conv metainfo class for auto parallel (#1796 ) * [fx] metainfo class for auto parallel * [fx] add unit test for linear metainfo * [fx] fix bwd param for linear * [fx] modify unit test * [fx] modify unit test * [fx] modify import * [fx] modify import * [fx] modify import * [fx] move meta profiler to auto parallel * [fx] add conv metainfo class * [fx] restore profiler * [fx] restore meta profiler * [autoparallel] modify unit test * [fx] modify unit test	2022-11-07 16:15:35 +08:00
oahzxl	501a9e9cd2	[hotfix] polish flash attention (#1802 )	2022-11-07 14:30:22 +08:00
Jiarui Fang	c248800359	[kernel] skip tests of flash_attn and triton when they are not available (#1798 )	2022-11-07 13:41:13 +08:00
YuliangLiu0306	e34e850a4c	[autoparallel]add essential CommActions for broadcast oprands (#1793 )	2022-11-04 18:36:42 +08:00
Boyuan Yao	05ce3d369f	[fx] Add linear metainfo class for auto parallel (#1783 ) * [fx] metainfo class for auto parallel * [fx] add unit test for linear metainfo * [fx] fix bwd param for linear * [fx] modify unit test * [fx] modify unit test * [fx] modify import * [fx] modify import * [fx] modify import * [fx] move meta profiler to auto parallel	2022-11-04 10:55:09 +08:00
YuliangLiu0306	2c4c7b3618	[autoparallel] add getattr handler (#1767 ) * [autoparallel] add getattr haandler * polish code * add extra processes for Parameters * add unit test for param resharding cost * add docstring and polish test	2022-11-03 12:31:33 +08:00
HELSON	c6a1a62636	[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786 ) * [hotfix] fix zero's incompatibility with checkpoint in torch-1.12 * [zero] add cpu shard init * [zero] add tiny example test * [colo_tensor] fix bugs for torch-1.11	2022-11-02 16:11:34 +08:00
Jiarui Fang	32c1b843a9	skip torchrec unittests if not installed (#1790 )	2022-11-02 14:44:32 +08:00
kurisusnowdeng	0b8161fab8	updated tp layers	2022-11-02 12:19:38 +08:00
YuliangLiu0306	e859380bf7	[fx] support module with bias addition (#1780 ) * [autoparallel] refactor tracer to fix bias addition issue * [fx] support module with bias addition * create bias_addition_module * refactor file structure * polish code * fix unit test	2022-11-01 22:53:51 +08:00
Frank Lee	f3f19a5c47	[autoparallel] added matmul handler (#1763 ) * [autoparallel] added matmul handler * polish code	2022-11-01 15:14:53 +08:00
YuliangLiu0306	27de252334	[autoparallel] fix conv handler numerical test (#1771 )	2022-11-01 10:43:44 +08:00
Super Daniel	1e88811c7a	[autoparallel] move ckpt solvers to autoparallel folder / refactor code (#1764 ) * [autoparallel] first move. * [autoparallel] add solver rotor. * [autoparallel] add ckpt solvers. * [autoparallel] modify codegen. * [fx] fix annotation in test. * [fx] remove check. * [autoparallel] polish docstring. * [fx] refactor MetaTensor.	2022-11-01 10:43:15 +08:00
YuliangLiu0306	a4d1f59c78	[autoparallel] add numerical test for handlers (#1769 )	2022-10-28 10:59:59 +08:00
YuliangLiu0306	b0f7c8bde8	[autoparallel] update CommSpec to CommActions (#1768 ) * [autoparallel] update CommSpec to CommActions * polish code	2022-10-28 09:57:43 +08:00
YuliangLiu0306	b4cc59b61e	[autoparallel] add numerical test for node strategies (#1760 ) * [autoparallel] add numerical test for node strategies * polish code * polish code	2022-10-27 10:42:54 +08:00
oahzxl	25952b67d7	[feat] add flash attention (#1762 )	2022-10-26 16:15:52 +08:00
Super Daniel	0584654c79	[fx] refactor memory utils and extend shard utils. (#1754 ) * [fx] change memory.py to memory_utils.py. * [fx] add shard utils. * [fx] fix import. * [fx] check code style. * [fx] add comment. * [autoparallel] first move. * [fx] add time computations.	2022-10-26 14:24:41 +08:00
YuliangLiu0306	314d8c497f	[autoparallel] refactor the runtime apply pass and add docstring to passes (#1757 ) * [autoparallel] refactor the runtime apply pass and add doc string to passes * fix unit test * polish	2022-10-25 14:32:22 +08:00
Frank Lee	f9a613d660	[autoparallel] added binary elementwise node handler (#1758 ) * [autoparallel] added binary elementwise node handler * polish code	2022-10-25 14:32:01 +08:00
YuliangLiu0306	d2fc067231	[autoparallel] fix param hook issue in transform pass (#1755 )	2022-10-24 13:13:38 +08:00
Frank Lee	262652c8bc	[autoparallel] added addbmm handler (#1751 )	2022-10-21 18:55:48 +08:00
YuliangLiu0306	980ed21723	[autoparallel] shard param and buffer as expected (#1753 ) * [autoparallel] shard param and buffer as expected * fix unit test issue	2022-10-21 15:45:13 +08:00
YuliangLiu0306	cdb7d5e7d2	[hotfix] autoparallel unit test (#1752 )	2022-10-20 19:51:38 +08:00
YuliangLiu0306	a4ce180e85	[autoparallel] add sequential order to communication actions (#1735 )	2022-10-20 18:48:18 +08:00
Super Daniel	b893342f95	[fx] test tracer on diffuser modules. (#1750 ) * [fx] test tracer on diffuser modules. * [fx] shorter seq_len. * Update requirements-test.txt	2022-10-20 18:25:05 +08:00
Frank Lee	b80b6eaa88	[autoparallel] recovered skipped test cases (#1748 )	2022-10-20 16:37:33 +08:00
Frank Lee	474111ecb5	[autoparallel] fixed wrong sharding strategy in conv handler (#1747 ) * [autoparallel] fixed wrong sharding strategy in conv handler * polish code	2022-10-20 16:12:39 +08:00
Frank Lee	8b8937d901	[autoparallel] fixed wrong generated strategy for dot op (#1746 ) * [autoparallel] fixed wrong generated strategy for dot op * polish code	2022-10-20 15:18:16 +08:00
Frank Lee	88a79814fb	[autoparallel] handled illegal strategy in node handler (#1743 ) * [autoparallel] handled illegal strategy in node handler * polish code	2022-10-19 17:08:52 +08:00
Super Daniel	30874f1692	[fx/profiler] debug the fx.profiler / add an example test script for fx.profiler (#1730 ) * [fx/profiler] add test. * [fx] fix file names. * [fx] add docstring and comment. * [fx] polish profiler.py. * [fx] fix import errors. * [fx] fix profiler. * [fx] fix names.	2022-10-19 14:24:51 +08:00
Frank Lee	eee84908d4	[autoparallel] handled illegal sharding strategy (#1728 ) * [autoparallel] handled illegal sharding strategy * polish code	2022-10-19 12:53:06 +08:00
Ziheng Qin	cbe9a4cb45	[NFC] polish tests/test_layers/test_3d/test_3d.py code style (#1740 )	2022-10-19 12:20:51 +08:00
lucasliunju	912eb58ea0	[NFC] polish tests/test_layers/test_3d/checks_3d/common.py code style (#1733 )	2022-10-19 12:20:51 +08:00
Xue Fuzhao	754aa7c81f	[NFC] polish tests/test_layers/test_3d/checks_3d/check_layer_3d.py code style (#1731 )	2022-10-19 12:20:51 +08:00
xyupeng	ff373a11eb	[NFC] polish tests/test_layers/test_sequence/checks_seq/check_layer_seq.py code style (#1723 )	2022-10-19 12:20:51 +08:00
Kai Wang (Victor Kai)	b38efe4e8a	[NFC] polish test_2p5d/checks_2p5d/check_operation_2p5d.py code style (#1718 )	2022-10-19 12:20:51 +08:00
binmakeswell	f6389d0813	[NFC] polish tests/test_layers/test_2d/checks_2d/check_operation_2d.py code style (#1715 )	2022-10-19 12:20:51 +08:00
HELSON	f69f9bf223	[zero] add chunk init function for users (#1729 ) * add chunk manager init function * fix unit tests * add comment * add flush=True	2022-10-18 16:31:22 +08:00
Super Daniel	393f594051	[fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug (#1710 ) * [fx] move meta registration * [fx] fix tests. * [fx] fix test. * [fx] fix. * [meta] refactor meta registration.py. * [fx] add compatibility descriptions. * [fx] polish import. * [fx] add a decorator. * [fx] fix tests. * [fx] remove print. * [fx] edit raise error. * [fx] edit raise error. * [fx] add type hint. * [fx] fix import in experimental. * [rpc] remove color debug. * [meta] fix naming.	2022-10-18 10:44:23 +08:00
Frank Lee	e8d8eda5e7	[autoparallel] moved tests to test_tensor_shard (#1713 )	2022-10-17 13:54:20 +08:00
YuliangLiu0306	845ff4a47a	[autoparallel] resnet block runtime apply (#1709 ) * [autoparallel] resnet block runtime apply * seperate buffer and parameter in MemoryCost * polish code * add comments and todos * fix test issue	2022-10-17 13:37:38 +08:00
Frank Lee	22a115406b	[autoparallel] fixed broken node handler tests (#1708 )	2022-10-14 18:25:59 +08:00
HELSON	1468e4bcfc	[zero] add constant placement policy (#1705 ) * fixes memory leak when paramter is in fp16 in ZeroDDP init. * bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release. * adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.	2022-10-14 17:53:16 +08:00
Frank Lee	6c331a5a09	[autoparallel] refactored the autoparallel module for organization (#1706 ) * [autoparallel] refactored the autoparallel module for organization * polish code	2022-10-14 13:27:00 +08:00
Frank Lee	91cd34e6e0	[unittest] added doc for the pytest wrapper (#1704 )	2022-10-14 10:56:17 +08:00
YuliangLiu0306	451cd72dea	[autoparallel] adapt runtime passes (#1703 ) * [autoparallel] adapt runtime passes v2 * polish code	2022-10-14 10:14:07 +08:00
Jiarui Fang	21962e1593	[embedding] rename FreqAwareEmbedding -> CachedEmbedding (#1699 )	2022-10-13 22:22:27 +08:00
Frank Lee	0e52f3d3d5	[unittest] supported condititonal testing based on env var (#1701 ) polish code	2022-10-13 19:38:45 +08:00
Frank Lee	8283e95db3	[autoparallel] collated all deprecated files (#1700 ) * [autoparallel] collated all deprecated files * polish code	2022-10-13 18:24:11 +08:00
YuliangLiu0306	81f7530ee7	[autoparallel] adapt solver and CostGraph with new handler (#1695 ) * [autoparallel] adapt solver and CostGraph with new handler * fix test issue	2022-10-13 14:04:15 +08:00
YuliangLiu0306	42b882ef06	[autoparallel] add output handler and placeholder handler (#1694 ) * [autoparallel] add output handler and placeholder handler * Delete test_solver_with_resnet.py * fix test bugs	2022-10-13 13:42:36 +08:00
YuliangLiu0306	56088e6d98	[autoparallel] add pooling handler (#1690 ) * [autoparallel] add pooling handler * polish code	2022-10-13 13:42:13 +08:00
YuliangLiu0306	319d654f79	[autoparallel] where_handler_v2 (#1688 ) * where generator * [autoparallel] where_handler_v2	2022-10-13 11:02:22 +08:00
Boyuan Yao	31d2f03d27	[autoparallel] fix C version rotor inconsistency (#1691 )	2022-10-12 15:21:58 +08:00
Frank Lee	4973157ad7	[autoparallel] added sharding spec conversion for linear handler (#1687 )	2022-10-12 11:16:18 +08:00
YuliangLiu0306	af718e83f2	[autoparallel] add reshape handler v2 and fix some previous bug (#1683 )	2022-10-11 18:12:59 +08:00
Super Daniel	3dd6994427	[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 (#1679 ) * [fx/profiler] modify data_ptr into uuid for all tensors. * [fx] modify uuid. * [fx/profiler] tune performance on GPT-2. * [fx] updates. * [fx] debug. * [fx] debug. * [fx] cuda.	2022-10-11 11:03:35 +08:00
YuliangLiu0306	517b63939a	[autoparallel] add unary element wise handler v2 (#1674 )	2022-10-09 17:30:42 +08:00
YuliangLiu0306	f6c6a932b8	[autoparallel] add following node generator (#1673 ) * [autoparallel] add following node generator * polish code * polish code * update name of arguments	2022-10-09 14:49:18 +08:00
YuliangLiu0306	52fda88796	[autoparallel] add layer norm handler v2 (#1671 ) * [autoparallel] add layer norm handler v2 * polish code * polish code	2022-10-09 14:23:22 +08:00
HELSON	b28991dd0a	[feature] A new ZeRO implementation (#1644 )	2022-10-09 09:18:51 +08:00
Boyuan Yao	1df98d5b66	[autoparallel] add rotor C version (#1658 ) * [autoparallel] add rotor c version * [fx] remove metainfoprop in rotor solver * [autoparallel] modify C code format * [autoparallel] remove build.py * [autoparallel] fix C extension build * [autoparallel] add C solver consistency test * [autoparallel] remove some unused imports * [autoparallel] refactor rotor solver code * [autoparallel] replace print with colossalai logger * [autoparallel] ranks fixed	2022-10-03 17:13:30 +08:00
YuliangLiu0306	11ec070e53	[hotfix]unit test (#1670 )	2022-09-29 12:49:28 +08:00
Frank Lee	a60024e77a	[autoparallel] added utils for broadcast operation (#1665 ) * [autoparallel] added utils for broadcast operation * polish code	2022-09-29 11:22:29 +08:00
YuliangLiu0306	3f068d1409	[autoparallel] update CommSpec (#1667 )	2022-09-29 11:20:59 +08:00
YuliangLiu0306	746f8f979d	[autoparallel] add batch norm handler v2 (#1666 )	2022-09-29 11:02:49 +08:00
Kirigaya Kazuto	9708638ded	[pipeline/pytree] add pytree to process args and kwargs \| provide `data_process_func` to process args and kwargs after forward (#1642 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera * [pipeline/chimera] test chimera \| fix bug of initializing * [pipeline/pytree] add pytree to process args and kwargs \| provide to process args and kwargs after forward	2022-09-29 10:58:58 +08:00
Frank Lee	3a4d6f63a8	[autoparallel] added node handler for bmm (#1655 )	2022-09-28 11:32:16 +08:00
YuliangLiu0306	095854477f	[autoparallel] add conv handler v2 (#1663 )	2022-09-28 11:24:59 +08:00
YuliangLiu0306	1e7816a460	[autoparallel] adapt solver with gpt (#1653 )	2022-09-28 11:17:26 +08:00
Frank Lee	30e50c8b4a	[autoparallel] implemented all matmul strategy generator (#1650 )	2022-09-27 12:06:25 +08:00
YuliangLiu0306	03978aad45	[autoparallel] change the following nodes strategies generation logic (#1636 ) * [autoparallel] change the following nodes strategies generation logic * fix unit test	2022-09-27 11:20:52 +08:00
YuliangLiu0306	59f100510a	[autoparallel] where handler (#1651 ) * [autoparallel] where handler * fix unit test	2022-09-27 11:20:43 +08:00
Boyuan Yao	5d0fdb9cb4	[fx] fix offload codegen test (#1648 ) * [fx] fix offload codegen test * [fx] modify typing	2022-09-27 10:25:27 +08:00
Frank Lee	45b39a692a	[autoparallel] implemented linear projection strategy generator (#1639 )	2022-09-26 16:58:14 +08:00
Frank Lee	154d3ef432	[fix] fixed the collective pattern name for consistency (#1649 ) * [fix] fixed the collective pattern name for consistency * polish code	2022-09-26 16:39:37 +08:00
YuliangLiu0306	b2b2a4af98	[autoparallel] adapt solver with mlp (#1638 )	2022-09-26 15:26:14 +08:00
Jiarui Fang	c5d39215f6	Revert "[feature] new zero implementation (#1623 )" (#1643 ) This reverts commit `5be118f405`.	2022-09-26 10:06:03 +08:00
HELSON	5be118f405	[feature] new zero implementation (#1623 )	2022-09-24 19:58:18 +08:00
HELSON	95c35f73bd	[moe] initialize MoE groups by ProcessGroup (#1640 )	2022-09-23 17:20:41 +08:00
HELSON	a088022efc	[moe] fix moe bugs (#1633 )	2022-09-23 15:33:57 +08:00
YuliangLiu0306	702dbc5288	[tensor] use communication autograd func (#1617 ) * [tensor] use communication autograd func * change all to all comm spec info * rename pattern and distinguish fwd/bwd * polish code	2022-09-23 13:31:15 +08:00
YuliangLiu0306	0c703189b9	[autoparallel] add layernorm handler (#1629 )	2022-09-23 12:00:25 +08:00
YuliangLiu0306	bf77d3ab65	[autoparallel] recover the merged node strategy index (#1613 )	2022-09-23 11:52:42 +08:00
Boyuan Yao	d6b01feb66	[fx] Modify offload codegen (#1618 ) * [fx] modify offload codegen * [fx] remove repeated hook definitions * [fx] modify offload test	2022-09-23 11:04:52 +08:00
YuliangLiu0306	9eae855408	[hotfix] add recompile after graph manipulatation (#1621 )	2022-09-23 11:00:33 +08:00
Super Daniel	d967779a32	[fx/profiler] tuned the calculation of memory estimation (#1619 ) * [fx] tuned the meta info and rotor solver. * [fx] remove import. * [fx] remove import. * [fx] remove import. * [fx] tune the meta calculations. * [fx] polish comments. * [fx] remove assertions. * [fx] modify test cases. * [fx] modify test cases. * [fx] optimize import. * [fx	2022-09-23 10:59:47 +08:00
HELSON	f7f2248771	[moe] fix MoE bugs (#1628 ) * remove forced FP32 modules * correct no_shard-contexts' positions	2022-09-22 13:56:30 +08:00
Jiarui Fang	38c68b5b9a	[embedding] rollback for better FAW performance (#1625 )	2022-09-22 11:16:25 +08:00
Frank Lee	d925122020	[autoparallel] added new linear module handler (#1616 )	2022-09-21 12:23:21 +08:00
Kirigaya Kazuto	170fa81095	[pipeline/chimera] test chimera \| fix bug of initializing (#1615 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera * [pipeline/chimera] test chimera \| fix bug of initializing	2022-09-20 18:00:39 +08:00
Jiarui Fang	504ff1d101	[embeddings] use cache_ratio instead of cuda_row_num (#1611 )	2022-09-20 14:33:04 +08:00
YuliangLiu0306	7d1bb71d5d	[fx] PoC of runtime shape consistency application (#1607 ) * [fx] PoC of runtime shape consistency application * polish code	2022-09-20 14:00:04 +08:00
YuliangLiu0306	47b11c432c	[autoparallel]add bcast matmul strategies (#1605 )	2022-09-20 11:26:21 +08:00
Boyuan Yao	933b6c6367	[fx] Add pofo solver (#1608 ) * [fx] add pofo algorithm * [fx] Add pofo solver * [fx] code refactor * [fx] fix test_linearize import	2022-09-20 11:20:48 +08:00
Kirigaya Kazuto	edc9e419ad	[pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera (#1595 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera	2022-09-19 11:44:18 +08:00
YuliangLiu0306	eac1b79371	[autoparallel] add bcast op handler (#1600 ) * [autoparallel] add bcast op handler * polish code * add more BCAST FUNC OP * polish code * add exception handler * polish	2022-09-16 11:33:01 +08:00
Boyuan Yao	a7cda6f57d	[fx] Add offload codegen (#1598 ) * [fx] add input activation offload to codegen * [fx] modify unit test * [fx] remove two skips in torch11 * [fx] use all_input_nodes instead of _input_nodes	2022-09-14 15:49:06 +08:00
Super Daniel	c8e9b2ad78	[hotfix/rotor] fix variable names (#1597 ) * [fx] add some comment and docstrings. * [fx] add dataflow analysis for an autograd graph. * add intepretation for graph analysis. * [fx] before doing save_tensor_hooks. * [fx] provide an accurate estimation of memory except for GPT-2. * [fx] provide an accurate estimation of memory except for GPT-2. * [fx] provide an accurate estimation of memory except for GPT-2. * [fx] a very accurate version on GPT-2. * [fx] refactor code. * [fx] remove redundant inplace=True. * [fx] refactor code. * [fx] refactor code. * [fx] refactor code. * [fx] dive into backward memory. * [fx] fix variable names in ckpt_solvers and unskip tests. * [fx] commit my changes. * [fx] restore skips. * [fx] restore skips. * [fx] chaange stage into phase. * [fx] chaange stage into phase. * [fx] chaange stage into phase.	2022-09-14 14:27:04 +08:00
YuliangLiu0306	faa23b9d9a	[autoparallel] add reshape handler (#1594 ) * [autoparallel] add reshape handler * polish code	2022-09-14 10:25:45 +08:00
Frank Lee	27fe8af60c	[autoparallel] refactored shape consistency to remove redundancy (#1591 ) * [autoparallel] refactored shape consistency to remove redundancy * polish code * polish code * polish code	2022-09-13 18:30:18 +08:00
YuliangLiu0306	d164449d00	[autoparallel] add resnet autoparallel unit test and add backward weight communication cost (#1589 )	2022-09-13 18:05:05 +08:00
Frank Lee	219f66c571	[autoparallel] added solver option dataclass (#1588 )	2022-09-13 14:47:09 +08:00
YuliangLiu0306	82d4376c23	[autoparallel] adapt solver with resnet (#1583 ) * [autoparallel]adapt solver with resnet * polish code * polish code	2022-09-13 12:07:09 +08:00
CsRic	f3403ff98e	[embeddings] add already_split_along_rank flag for tablewise mode (#1584 )	2022-09-13 10:50:34 +08:00
Boyuan Yao	f3687e4ee2	[fx] Add nested checkpoint in activation checkpoint codegen (#1585 ) * [fx] add nested activation_checkpoint codegen * undo algorithms commits * solver * undo some commits * [fx] torch11 add nested activation checkpoint codegen * remove some imports * [fx] add some comments in activation codegen * [fx] codegen instance error fix	2022-09-12 20:00:48 +08:00
アマデウス	e615cfc3a8	[NFC] polish test component gpt code style (#1567 )	2022-09-08 16:34:09 +08:00
Kirigaya Kazuto	6159d45417	[pipeline/tuning] improve dispatch performance both time and space cost (#1544 )	2022-09-07 19:01:06 +08:00
Super Daniel	4f59693207	[fx] provide a stable but not accurate enough version of profiler. (#1547 ) * [fx] compute memory stat and flop count for MetaInfoProp. * [fx] modify node attribute. * [fx] modify ckpt_chen. * [fx] fix compatibility. * [fx] fix import error. * [fx] skip test for MetaInfoProp. * [fx] skip test for MetaInfoProp. * [fx] skip test for MetaInfoProp. * [fx] skip test for MetaInfoProp. * [fx] skip if torch 1.11.0. * [fx] recover MetaInfoProp support for PyTorch 1.11. * [fx] provide a stable but not accurate enough version of profiler. * [fx] provide a stable but not accurate enough version of profiler. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix import error.	2022-09-07 11:21:04 +08:00
YuliangLiu0306	0908d0fc61	[autoparallel]add backward cost info into strategies (#1524 )	2022-09-07 11:19:00 +08:00
YuliangLiu0306	44c866a3e3	[autoparallel] change the merge node logic (#1533 )	2022-09-07 11:18:19 +08:00
Jiarui Fang	64169f3e8f	[embedding] polish parallel embedding tablewise (#1545 )	2022-09-06 10:41:20 +08:00
CsRic	964123ae0f	[embedding] freq_aware_embedding: add small functions for caller application (#1537 )	2022-09-05 15:12:53 +08:00
Boyuan Yao	56159049e8	[fx] Modify solver linearize and add corresponding test (#1531 ) * [fx] modify solver linearize and add test * [fx] add torch11 test of linearize but skip it * [fx] remove some unused imports	2022-09-02 10:24:41 +08:00
Super Daniel	7dc53237c3	[fx] add test for meta tensor. (#1527 ) * [fx] add test for meta tensor. * [fx] add test for meta tensor. * [fx] add test for meta tensor. * [fx] add test for meta tensor. * [fx] fix error.	2022-09-01 19:30:05 +08:00
YuliangLiu0306	4b3d6caeb3	[fx]patch nn.functional convolution (#1528 )	2022-09-01 19:05:07 +08:00
CsRic	5156d5b4f8	[embedding] add tablewise sharding for FAW (#1526 )	2022-09-01 17:55:41 +08:00
Kirigaya Kazuto	f1e1836218	[pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP (#1508 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP * [pipeline/pipleline_process_group] remove comment * [pipeline/pipleline_process_group] remove comment * [pipeline/pipleline_process_group] skip process group test * [pipeline/pipleline_process_group] remove test named function	2022-09-01 17:45:47 +08:00
Boyuan Yao	b231430bcb	[fx] Fix wrong index in annotation and minimal flops in ckpt solver (#1521 ) * [fx] fix wrong variable name in solver rotor * [fx] fix wrong variable name in solver rotor * [fx] fix the discretize bug * [fx] fix the first op in activation checkpoint codegen * [fx] fix some bugs of ckpt solver * [fx] modify test_ckpt_torchvision * [fx] set sequence to __sequence__ attr of GraphModule * [fx] docstring modification * [fx] remove performance test	2022-08-31 18:10:48 +08:00
YuliangLiu0306	3345c6d352	[autoparellel]add strategies constructor (#1505 ) * [autoparellel]add strategies constructor * remove duplicated strategies * polish code * adapt cost graph with StrategiesConstructor * polish	2022-08-30 16:32:09 +08:00
Frank Lee	a0436a62ee	[autoparallel] added liveness analysis (#1516 ) * [autoparallel] added liveness analysis * remove memory cost	2022-08-30 15:54:37 +08:00
Jiarui Fang	9a9ef65313	[FAW] cpu caching operations (#1520 )	2022-08-30 14:50:02 +08:00
Jiarui Fang	af5438caa2	[FAW] refactor reorder() for CachedParamMgr (#1514 )	2022-08-29 14:22:07 +08:00
CsRic	1b8fee8e9c	[FAW] shrink freq_cnter size (#1509 )	2022-08-29 11:44:55 +08:00
Boyuan Yao	4acc58ee20	[fx] Fix activation codegen dealing with checkpointing first op (#1510 )	2022-08-27 19:39:21 +08:00
Kirigaya Kazuto	5a6fd71f90	[pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy (#1497 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy	2022-08-26 14:04:23 +08:00
CsRic	0ed2f46131	[FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494 )	2022-08-26 11:24:12 +08:00
YuliangLiu0306	8b7d6bd5be	[autoparallel] add more sharding strategies to conv (#1487 )	2022-08-26 11:17:56 +08:00
Boyuan Yao	de1e716dc4	[fx] Add activation checkpoint solver rotor (#1496 ) * [fx] fix defining ckpt functions inside forward * [fx] Modify activation checkpoint codegen and add ColoGraphModule * [fx] some modification * some modifications * some modifications * some modifications * some modifications * some code modifications * [automatic_parallel] ckpt solver rotor * [fx] add ckpt_solver_rotor * [fx] modification * code refactor * code refactor	2022-08-26 10:34:21 +08:00
YuliangLiu0306	413c053453	[autoparallel] add cost graph class (#1481 ) * [autoparallel] add cost graph class * polish code	2022-08-25 17:19:59 +08:00
YuliangLiu0306	4b03c25f85	[tensor]add 1D device mesh (#1492 )	2022-08-25 16:48:12 +08:00
CsRic	b8d0e39eaf	[FAW] LFU cache for the FAW	2022-08-25 13:08:46 +08:00
Kirigaya Kazuto	9145aef2b4	[pipeline/rpc] implement distributed optimizer \| test with assert_close (#1486 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] implement distributed optimizer \| test with assert_close	2022-08-25 10:49:01 +08:00
Frank Lee	3da68d6b1b	[fx] fixed adapative pooling size concatenation error (#1489 )	2022-08-25 09:05:07 +08:00
Jiarui Fang	cde7b8a5b8	[FAW] init an LFU implementation for FAW (#1488 )	2022-08-24 17:37:22 +08:00
Super Daniel	32efe8e740	[fx] add profiler for fx nodes. (#1480 ) * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] merge development into main (#1) * [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen. * [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. * [fx] fix lowercase naming conventions. * [fx] simplify test for ckpt. * [fx] add rules to linearize computation graphs for searching. (#2) * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] merge development into main (#1) * [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen. * [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. * [fx] fix lowercase naming conventions. * [fx] simplify test for ckpt. * [fx] fix test and algorithm bugs in activation checkpointing. * [fx] polish ckpt_test. * [fx] add rules to linearize computation graphs for searching. * [fx] remove chen_sqrt for sake of simplicity * [fx] remove chen_sqrt for sake of simplicity * [fx] remove chen_sqrt for sake of simplicity * [fx] remove chen_sqrt for sake of simplicity * [fx] fix inconsistencies. * [fx] fix MetaInfoProp. * [fx] fix MetaInfoProp. * [fx] consider MetaInfoProp for inplace operands. * [fx] consider MetaInfoProp for inplace operands. * [fx] consider MetaInfoProp for inplace operands. * [fx] consider MetaInfoProp for inplace operands. * [fx] consider MetaInfoProp for inplace operands. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] fix error in tests. * [fx] unfix bug. * [fx] unfix bug.	2022-08-24 16:22:44 +08:00
Kirigaya Kazuto	a6c8749198	[pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B (#1483 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B	2022-08-24 11:19:46 +08:00
Geng Zhang	0aad53c62b	[FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462 )	2022-08-23 17:38:24 +08:00
Frank Lee	ede326298b	[autoparallel] integrate auto parallel with torch fx (#1479 )	2022-08-23 14:23:08 +08:00
Boyuan Yao	1f2e547f7a	[fx] Fix ckpt functions' definitions in forward (#1476 ) * [fx] fix defining ckpt functions inside forward * [fx] Modify activation checkpoint codegen and add ColoGraphModule * [fx] some modification * some modifications * some modifications * some modifications * some modifications * some code modifications	2022-08-22 16:59:54 +08:00
Kirigaya Kazuto	bb5f5289e0	[pipeline/rpc] implement a demo for PP with cuda rpc framework (#1470 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * Delete p2p_v2.py * Delete _pipeline_schedule_v2.py * Delete test_object_list_p2p_v2.py * Delete test_boardcast_send_recv_v2.py * Delete test_cifar_with_data_pipeline_tensor_v2.py	2022-08-22 10:50:51 +08:00
Frank Lee	628c7e3fc8	[autoparallel] added dot handler (#1475 )	2022-08-22 10:32:17 +08:00
YuliangLiu0306	26a37b5cd5	[autoparallel] Add conv handler to generate strategies and costs info for conv (#1467 )	2022-08-19 14:57:23 +08:00
YuliangLiu0306	b73fb7a077	[tensor] support runtime ShardingSpec apply (#1453 ) * [tensor] support runtime ShardingSpec apply * polish code * polish code	2022-08-19 13:39:51 +08:00
Super Daniel	e7383f578b	[fx] add rules to linearize computation graphs for searching. (#1461 ) * [fx] polish ckpt_test. * [fx] add rules to linearize computation graphs for searching. * [fx] remove chen_sqrt for sake of simplicity * [fx] fix inconsistencies.	2022-08-17 14:47:12 +08:00
Boyuan Yao	092b9c8f49	[fx] Add use_reentrant=False to checkpoint in codegen (#1463 ) * [utils] Add use_reetrant=False into colossalai checkpoint * [utils] add some annotation in utils.activaion_checkpoint * [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py * [test] modify test_activation_checkpoint.py * [test] modify test for reentrant=False * [fx] Add use_reentrant=False of checkpoint into codegen	2022-08-17 10:34:50 +08:00
Boyuan Yao	47fd8e4a02	[utils] Add use_reetrant=False in utils.activation_checkpoint (#1460 ) * [utils] Add use_reetrant=False into colossalai checkpoint * [utils] add some annotation in utils.activaion_checkpoint * [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py * [test] modify test_activation_checkpoint.py * [test] modify test for reentrant=False	2022-08-16 15:39:20 +08:00
Jiarui Fang	36824a304c	[Doc] add more doc for ColoTensor. (#1458 )	2022-08-16 10:38:41 +08:00
Super Daniel	0dbd61c29b	[fx] fix test and algorithm bugs in activation checkpointing. (#1451 ) * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] merge development into main (#1) * [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen. * [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. * [fx] fix lowercase naming conventions. * [fx] simplify test for ckpt. * [fx] fix test and algorithm bugs in activation checkpointing. * mend [fx] fix test and algorithm bugs in activation checkpointing. * mend [fx] fix test and algorithm bugs in activation checkpointing. * mend [fx] fix test and algorithm bugs in activation checkpointing. * mend [fx] fix test and algorithm bugs in activation checkpointing. * [fx] polish ckpt_test. * [fx] polish ckpt_test. * [fx] polish ckpt_test.	2022-08-15 19:09:19 +08:00
Geng Zhang	9f3eed66eb	[FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448 )	2022-08-12 15:55:46 +08:00
Frank Lee	5a52e21fe3	[test] fixed the activation codegen test (#1447 ) * [test] fixed the activation codegen test * polish code	2022-08-12 14:52:31 +08:00
YuliangLiu0306	0f3042363c	[tensor] shape consistency generate transform path and communication cost (#1435 ) * [tensor] shape consistency output transform path and communication cost * polish code	2022-08-12 14:02:32 +08:00
Boyuan Yao	5774fe0270	[fx] Use colossalai checkpoint and add offload recognition in codegen (#1439 ) * [fx] Use colossalai.utils.checkpoint to replace torch.utils.checkpoint for offload activation and add offload annotation recognition in codegen * [fx] Use colossalai.utils.checkpoint to replace torch.utils.checkpoint for offload activation and add offload annotation recognition in codegen * Modification of test and add TODO in codegen * [fx] Modification of colossal ckpt usage * [fx] add gpc.destroy() to test_codegen	2022-08-12 12:23:30 +08:00
Kirigaya Kazuto	e9460b45c8	[engin/schedule] use p2p_v2 to recontruct pipeline_schedule (#1408 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [communication] add p2p_v2.py to support communication with List[Any] * Delete _pipeline_schedule_v2.py * Delete test_cifar_with_data_pipeline_tensor_v2.py * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * Delete p2p_v2.py * Delete test_boardcast_send_recv_v2.py * Delete test_object_list_p2p_v2.py * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [communication] remove print code * [communication] remove print code * [engin/schedule] shorten the running time of testing file to prevent cancelling in CI	2022-08-12 11:33:26 +08:00
Frank Lee	ae1b58cd16	[tensor] added linear implementation for the new sharding spec (#1416 ) * [tensor] added linear implementation for the new sharding spec * polish code	2022-08-12 11:33:09 +08:00
Super Daniel	d40a9392ba	[fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 . (#1446 ) * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * mend * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen. * [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. * [fx] fix lowercase naming conventions.	2022-08-12 11:28:50 +08:00
ver217	821c6172e2	[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442 )	2022-08-11 22:58:58 +08:00
HELSON	b80340168e	[zero] add chunk_managerV2 for all-gather chunk (#1441 )	2022-08-11 19:17:24 +08:00
Super Daniel	3b26516c69	[fx] add vanilla activation checkpoint search with test on resnet and densenet (#1433 ) * [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen.	2022-08-11 15:46:39 +08:00
Jiarui Fang	30b4dd17c0	[FAW] export FAW in _ops (#1438 )	2022-08-11 13:43:24 +08:00
HELSON	9056677b13	[zero] add chunk size searching algorithm for parameters in different groups (#1436 )	2022-08-11 13:32:19 +08:00
HELSON	039b7ed3bc	[polish] add update directory in gemini; rename AgChunk to ChunkV2 (#1432 )	2022-08-10 16:40:29 +08:00
Super Daniel	f20cb4e893	[fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages (#1425 ) * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages	2022-08-10 16:36:35 +08:00
Jiarui Fang	89c434a0a6	[polish] add test_ops directory (#1431 )	2022-08-10 15:35:26 +08:00
Jiarui Fang	10b3df65c8	[FAW] move coloparam setting in test code. (#1429 )	2022-08-10 14:31:53 +08:00
Jiarui Fang	cb98cf5558	[FAW] parallel FreqAwareEmbedding (#1424 )	2022-08-10 13:44:30 +08:00
HELSON	0d212183c4	[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426 )	2022-08-10 11:37:28 +08:00
YuliangLiu0306	33f0744d51	[tensor] add shape consistency feature to support auto spec transform (#1418 ) * [tensor] add shape consistency feature to supportauto sharding spec transform. * [tensor] remove unused argument in simulator, add doc string for target pair.	2022-08-10 11:29:17 +08:00
HELSON	4fb3c52cf0	[zero] add unit test for AgChunk's append, close, access (#1423 )	2022-08-09 18:03:10 +08:00
Jiarui Fang	d209aff684	Add FreqAwareEmbeddingBag (#1421 )	2022-08-09 16:26:12 +08:00
Jiarui Fang	504419d261	[FAW] add cache manager for the cached embedding (#1419 )	2022-08-09 15:17:17 +08:00
Kirigaya Kazuto	44fd3c83ab	[communication] add p2p_v2.py to support communication with List[Any] (#1407 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [communication] add p2p_v2.py to support communication with List[Any] * Delete _pipeline_schedule_v2.py * Delete test_cifar_with_data_pipeline_tensor_v2.py * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [communication] remove print code * [communication] remove print code	2022-08-09 11:40:04 +08:00
YuliangLiu0306	7c96055c68	[tensor]build sharding spec to replace distspec in future. (#1405 )	2022-08-08 11:15:57 +08:00
ver217	12b4887097	[hotfix] fix CPUAdam kernel nullptr (#1410 )	2022-08-05 19:45:45 +08:00
YuliangLiu0306	0442f940f0	[device] add DeviceMesh class to support logical device layout (#1394 ) * [device] add DeviceMesh class to support logical device layout * polish code * add doc string	2022-08-02 19:23:48 +08:00
HELSON	4e98e938ce	[zero] alleviate memory usage in ZeRODDP state_dict (#1398 )	2022-08-02 15:49:13 +08:00
Frank Lee	adf5054ff8	[fx] fixed torchaudio conformer tracing (#1392 )	2022-08-01 16:08:28 +08:00
Frank Lee	7d6293927f	[fx] patched torch.max and data movement operator (#1391 ) * [fx] patched torch.max and data movement operator * polish code	2022-08-01 15:31:50 +08:00
HELSON	527758b2ae	[hotfix] fix a running error in test_colo_checkpoint.py (#1387 )	2022-07-29 15:58:06 +08:00
ver217	8dced41ad0	[zero] zero optim state_dict takes only_rank_0 (#1384 ) * zero optim state_dict takes only_rank_0 * fix unit test	2022-07-29 13:22:50 +08:00
ver217	7d5d628e07	[DDP] test ddp state dict uses more strict threshold (#1382 )	2022-07-28 17:29:04 +08:00
ver217	828b9e5e0d	[hotfix] fix zero optim save/load state dict (#1381 )	2022-07-28 17:19:39 +08:00
Super Daniel	be229217ce	[fx] add torchaudio test (#1369 ) * [fx]add torchaudio test * [fx]add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test and test patches * Delete ~ * [fx] add patches and patches test * [fx] add patches and patches test * [fx] fix patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] merge upstream * [fx] fix import errors	2022-07-27 11:03:14 +08:00
Boyuan Yao	bb640ec728	[fx] Add colotracer compatibility test on torchrec (#1370 )	2022-07-26 17:54:39 +08:00
ver217	c415240db6	[nvme] CPUAdam and HybridAdam support NVMe offload (#1360 ) * impl nvme optimizer * update cpu adam * add unit test * update hybrid adam * update docstr * add TODOs * update CI * fix CI * fix CI * fix CI path * fix CI path * fix CI path * fix install tensornvme * fix CI * fix CI path * fix CI env variables * test CI * test CI * fix CI * fix nvme optim __del__ * fix adam __del__ * fix nvme optim * fix CI env variables * fix nvme optim import * test CI * test CI * fix CI	2022-07-26 17:25:24 +08:00
HELSON	87775a0682	[colotensor] use cpu memory to store state_dict (#1367 )	2022-07-26 14:13:38 +08:00
Frank Lee	cd063ac37f	[fx] added activation checkpoint codegen support for torch < 1.12 (#1359 )	2022-07-25 23:35:31 +08:00
HELSON	4417804129	[unit test] add megatron init test in zero_optim (#1358 )	2022-07-25 11:18:08 +08:00
HELSON	7a065dc9f6	[hotfix] fix megatron_init in test_gpt2.py (#1357 )	2022-07-25 10:28:19 +08:00
Frank Lee	644582eee9	[fx] added activation checkpoint codegen (#1355 )	2022-07-25 09:39:10 +08:00
Frank Lee	05fae1fd56	[fx] added activation checkpointing annotation (#1349 ) * [fx] added activation checkpointing annotation * polish code * polish code	2022-07-21 11:14:28 +08:00
HELSON	7a8702c06d	[colotensor] add Tensor.view op and its unit test (#1343 ) [colotensor] add megatron initialization for gpt2	2022-07-21 10:53:15 +08:00
YuliangLiu0306	942c8cd1fb	[fx] refactor tracer to trace complete graph (#1342 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx] refactor tracer to trace complete graph * add comments and solve conflicts.	2022-07-20 11:20:38 +08:00
Frank Lee	2cc1175c76	[fx] tested the complete workflow for auto-parallel (#1336 ) * [fx] tested the complete workflow for auto-parallel * polish code * polish code * polish code	2022-07-20 10:45:17 +08:00
YuliangLiu0306	4631fef8a0	[fx]refactor tracer (#1335 )	2022-07-19 15:50:42 +08:00
HELSON	bf5066fba7	[refactor] refactor ColoTensor's unit tests (#1340 )	2022-07-19 15:46:24 +08:00
HELSON	f92c100ddd	[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339 )	2022-07-19 14:15:28 +08:00
Frank Lee	f3ce7b8336	[fx] recovered skipped pipeline tests (#1338 )	2022-07-19 09:49:50 +08:00
ver217	0c51ff2c13	[hotfix] ZeroDDP use new process group (#1333 ) * process group supports getting ranks in group * chunk mgr receives a process group * update unit test * fix unit tests	2022-07-18 14:14:52 +08:00

... 3 4 5 6 7 ...

770 Commits (b29e1f07224298aea35aab7ee83284beac28e0d8)