Commit Graph

733 Commits (7dc53237c3956d7696564f3e61f04d20ff0aef9a)

Author SHA1 Message Date
YuliangLiu0306 4b3d6caeb3
[fx]patch nn.functional convolution (#1528) 2022-09-01 19:05:07 +08:00
CsRic 5156d5b4f8
[embedding] add tablewise sharding for FAW (#1526) 2022-09-01 17:55:41 +08:00
Kirigaya Kazuto f1e1836218
[pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP (#1508)
* support p2p communication with any type of object | pass test

* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test

* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule

* [pipeline/rpc] implement a demo for PP with cuda rpc framework

* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B

* [pipeline/rpc] implement distributed optimizer | test with assert_close

* [pipeline/rpc] implement distributed optimizer | test with assert_close

* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy

* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy

* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy

* [pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP

* [pipeline/pipleline_process_group] remove comment

* [pipeline/pipleline_process_group] remove comment

* [pipeline/pipleline_process_group] skip process group test

* [pipeline/pipleline_process_group] remove test named function
2022-09-01 17:45:47 +08:00
Super Daniel 112a1f0a8f
[hotfix] avoid conflict of meta registry with torch 1.13.0. (#1530)
* [hotfix] avoid conflict of meta registry with torch 1.13.0.

* [hotfix] avoid conflict of meta registry with torch 1.13.0.
2022-09-01 15:31:21 +08:00
Boyuan Yao b231430bcb
[fx] Fix wrong index in annotation and minimal flops in ckpt solver (#1521)
* [fx] fix wrong variable name in solver rotor

* [fx] fix wrong variable name in solver rotor

* [fx] fix the discretize bug

* [fx] fix the first op in activation checkpoint codegen

* [fx] fix some bugs of ckpt solver

* [fx] modify test_ckpt_torchvision

* [fx] set sequence to __sequence__ attr of GraphModule

* [fx] docstring modification

* [fx] remove performance test
2022-08-31 18:10:48 +08:00
Super Daniel 5cc849f6ce
[fx] hack __torch_dispatch__ for meta tensor and autograd. (#1515)
* [fx] hack __torch_dispatch__ for meta tensor and autograd.

* [fx] hack __torch_dispatch__ for meta tensor and autograd.

* [fx] hack __torch_dispatch__ for meta tensor and autograd.

* [fx] hack __torch_dispatch__ for meta tensor and autograd.

* [fx] hack __torch_dispatch__ for meta tensor and autograd.

* [fx] add bad case detections.

* [fx] add bad case detections.

* [fx] rename MetaTensor attributes.

* [fx] fix unexpected error.

* [fx] fix unexpected error.

* [fx] fix unexpected error.

* [fx] fix unexpected error.

* [fx] fix unexpected error.

* [fx] add register backward for native_batch_norm_backward.

* [fx] add more meta backend support for nn.Modules.

* [fx] add meta backend to support timm and torchvision models.

* [fx] add meta hardswish for timm models.
2022-08-31 16:30:16 +08:00
Jiarui Fang 4537d39df9
[doc] docstring for FreqAwareEmbeddingBag (#1525) 2022-08-31 13:52:30 +08:00
YuliangLiu0306 3345c6d352
[autoparellel]add strategies constructor (#1505)
* [autoparellel]add strategies constructor

* remove duplicated strategies

* polish code

* adapt cost graph with StrategiesConstructor

* polish
2022-08-30 16:32:09 +08:00
Frank Lee a0436a62ee
[autoparallel] added liveness analysis (#1516)
* [autoparallel] added liveness analysis

* remove memory cost
2022-08-30 15:54:37 +08:00
Jiarui Fang 9a9ef65313
[FAW] cpu caching operations (#1520) 2022-08-30 14:50:02 +08:00
Super Daniel ea1a95b8b9
[hotfix] fix coloproxy typos. (#1519) 2022-08-30 11:39:03 +08:00
Jiarui Fang af5438caa2
[FAW] refactor reorder() for CachedParamMgr (#1514) 2022-08-29 14:22:07 +08:00
Jiarui Fang 9feee6d06b
[FAW] LFU initialize with dataset freq (#1513) 2022-08-29 12:52:53 +08:00
CsRic 1b8fee8e9c
[FAW] shrink freq_cnter size (#1509) 2022-08-29 11:44:55 +08:00
Boyuan Yao 4acc58ee20
[fx] Fix activation codegen dealing with checkpointing first op (#1510) 2022-08-27 19:39:21 +08:00
Boyuan Yao ac3a453a50
[fx] fix the discretize bug (#1506)
* [fx] fix wrong variable name in solver rotor

* [fx] fix wrong variable name in solver rotor

* code modification

* [fx] fix the discretize bug
2022-08-26 17:15:52 +08:00
Boyuan Yao 31fffd3fc5
[fx] fix wrong variable name in solver rotor (#1502)
* [fx] fix wrong variable name in solver rotor

* [fx] fix wrong variable name in solver rotor

* code modification
2022-08-26 15:47:08 +08:00
Jiarui Fang ba61109b6c
[FAW] remove code related to chunk (#1501) 2022-08-26 14:23:30 +08:00
Jiarui Fang d5085bb317
[FAW] add more docs and fix a warning (#1500) 2022-08-26 14:10:21 +08:00
Kirigaya Kazuto 5a6fd71f90
[pipeline/rpc] update outstanding mechanism | optimize dispatching strategy (#1497)
* support p2p communication with any type of object | pass test

* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test

* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule

* [pipeline/rpc] implement a demo for PP with cuda rpc framework

* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B

* [pipeline/rpc] implement distributed optimizer | test with assert_close

* [pipeline/rpc] implement distributed optimizer | test with assert_close

* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy

* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy

* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
2022-08-26 14:04:23 +08:00
CsRic 0ed2f46131
[FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494) 2022-08-26 11:24:12 +08:00
YuliangLiu0306 8b7d6bd5be
[autoparallel] add more sharding strategies to conv (#1487) 2022-08-26 11:17:56 +08:00
Boyuan Yao de1e716dc4
[fx] Add activation checkpoint solver rotor (#1496)
* [fx] fix defining ckpt functions inside forward

* [fx] Modify activation checkpoint codegen and add ColoGraphModule

* [fx] some modification

* some modifications

* some modifications

* some modifications

* some modifications

* some code modifications

* [automatic_parallel] ckpt solver rotor

* [fx] add ckpt_solver_rotor

* [fx] modification

* code refactor

* code refactor
2022-08-26 10:34:21 +08:00
Super Daniel 09c023bee2
[fx] add more op patches for profiler and error message for unsupported ops. (#1495)
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main (#1)

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] add rules to linearize computation graphs for searching. (#2)

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main (#1)

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] fix test and algorithm bugs in activation checkpointing.

* [fx] polish ckpt_test.

* [fx] add rules to linearize computation graphs for searching.

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] fix inconsistencies.

* [fx] fix MetaInfoProp.

* [fx] fix MetaInfoProp.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] fix error in tests.

* [fx] unfix bug.

* [fx] unfix bug.

* [fx] patch more modules and functions.

* [fx] change name of utils.py to profiler.py

* [fx] add profiler for rnn.

* [fx] add profiler for rnn.

* [fx] polish and add more patch for profiler.

* [fx] polish and add more patch for profiler.
2022-08-25 23:11:13 +08:00
YuliangLiu0306 413c053453
[autoparallel] add cost graph class (#1481)
* [autoparallel] add cost graph class

* polish code
2022-08-25 17:19:59 +08:00
YuliangLiu0306 4b03c25f85
[tensor]add 1D device mesh (#1492) 2022-08-25 16:48:12 +08:00
CsRic b8d0e39eaf
[FAW] LFU cache for the FAW 2022-08-25 13:08:46 +08:00
Kirigaya Kazuto 9145aef2b4
[pipeline/rpc] implement distributed optimizer | test with assert_close (#1486)
* support p2p communication with any type of object | pass test

* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test

* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule

* [pipeline/rpc] implement a demo for PP with cuda rpc framework

* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B

* [pipeline/rpc] implement distributed optimizer | test with assert_close

* [pipeline/rpc] implement distributed optimizer | test with assert_close
2022-08-25 10:49:01 +08:00
Frank Lee 3da68d6b1b
[fx] fixed adapative pooling size concatenation error (#1489) 2022-08-25 09:05:07 +08:00
Jiarui Fang cde7b8a5b8
[FAW] init an LFU implementation for FAW (#1488) 2022-08-24 17:37:22 +08:00
Super Daniel 32efe8e740
[fx] add profiler for fx nodes. (#1480)
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main (#1)

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] add rules to linearize computation graphs for searching. (#2)

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main (#1)

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] fix test and algorithm bugs in activation checkpointing.

* [fx] polish ckpt_test.

* [fx] add rules to linearize computation graphs for searching.

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] fix inconsistencies.

* [fx] fix MetaInfoProp.

* [fx] fix MetaInfoProp.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] add profiler for fx nodes.

* [fx] fix error in tests.

* [fx] unfix bug.

* [fx] unfix bug.
2022-08-24 16:22:44 +08:00
Frank Lee d39e11dffb
[autoparallel] added namespace constraints (#1490) 2022-08-24 15:44:07 +08:00
Kirigaya Kazuto a6c8749198
[pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B (#1483)
* support p2p communication with any type of object | pass test

* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test

* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule

* [pipeline/rpc] implement a demo for PP with cuda rpc framework

* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B
2022-08-24 11:19:46 +08:00
Geng Zhang 0aad53c62b
[FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462) 2022-08-23 17:38:24 +08:00
Frank Lee ede326298b
[autoparallel] integrate auto parallel with torch fx (#1479) 2022-08-23 14:23:08 +08:00
Boyuan Yao 1f2e547f7a
[fx] Fix ckpt functions' definitions in forward (#1476)
* [fx] fix defining ckpt functions inside forward

* [fx] Modify activation checkpoint codegen and add ColoGraphModule

* [fx] some modification

* some modifications

* some modifications

* some modifications

* some modifications

* some code modifications
2022-08-22 16:59:54 +08:00
Kirigaya Kazuto bb5f5289e0
[pipeline/rpc] implement a demo for PP with cuda rpc framework (#1470)
* support p2p communication with any type of object | pass test

* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test

* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule

* [pipeline/rpc] implement a demo for PP with cuda rpc framework

* Delete p2p_v2.py

* Delete _pipeline_schedule_v2.py

* Delete test_object_list_p2p_v2.py

* Delete test_boardcast_send_recv_v2.py

* Delete test_cifar_with_data_pipeline_tensor_v2.py
2022-08-22 10:50:51 +08:00
Frank Lee 628c7e3fc8
[autoparallel] added dot handler (#1475) 2022-08-22 10:32:17 +08:00
Frank Lee 9dae9bb2bc
[autoparallel] introduced baseclass for op handler and reduced code redundancy (#1471)
* [autoparallel] introduced baseclass for op handler and reduced code redundancy

* polish code
2022-08-19 16:51:38 +08:00
Frank Lee 3a54e1c9b7
[autoparallel] standardize the code structure (#1469) 2022-08-19 15:51:54 +08:00
YuliangLiu0306 26a37b5cd5
[autoparallel] Add conv handler to generate strategies and costs info for conv (#1467) 2022-08-19 14:57:23 +08:00
Jiarui Fang 1b491ad7de
[doc] update docstring in ProcessGroup (#1468) 2022-08-19 13:41:57 +08:00
YuliangLiu0306 b73fb7a077
[tensor] support runtime ShardingSpec apply (#1453)
* [tensor] support runtime ShardingSpec apply

* polish code

* polish code
2022-08-19 13:39:51 +08:00
Super Daniel bbc58d881b
[fx] fix MetaInfoProp for incorrect calculations and add detections for inplace op. (#1466)
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main (#1)

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] add rules to linearize computation graphs for searching. (#2)

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main (#1)

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] fix test and algorithm bugs in activation checkpointing.

* [fx] polish ckpt_test.

* [fx] add rules to linearize computation graphs for searching.

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] remove chen_sqrt for sake of simplicity

* [fx] fix inconsistencies.

* [fx] fix MetaInfoProp.

* [fx] fix MetaInfoProp.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.

* [fx] consider MetaInfoProp for inplace operands.
2022-08-18 11:27:06 +08:00
Super Daniel e7383f578b
[fx] add rules to linearize computation graphs for searching. (#1461)
* [fx] polish ckpt_test.

* [fx] add rules to linearize computation graphs for searching.

* [fx] remove chen_sqrt for sake of simplicity

* [fx] fix inconsistencies.
2022-08-17 14:47:12 +08:00
Boyuan Yao 092b9c8f49
[fx] Add use_reentrant=False to checkpoint in codegen (#1463)
* [utils] Add use_reetrant=False into colossalai checkpoint

* [utils] add some annotation in utils.activaion_checkpoint

* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py

* [test] modify test_activation_checkpoint.py

* [test] modify test for reentrant=False

* [fx] Add use_reentrant=False of checkpoint into codegen
2022-08-17 10:34:50 +08:00
Boyuan Yao 47fd8e4a02
[utils] Add use_reetrant=False in utils.activation_checkpoint (#1460)
* [utils] Add use_reetrant=False into colossalai checkpoint

* [utils] add some annotation in utils.activaion_checkpoint

* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py

* [test] modify test_activation_checkpoint.py

* [test] modify test for reentrant=False
2022-08-16 15:39:20 +08:00
Jiarui Fang 36824a304c
[Doc] add more doc for ColoTensor. (#1458) 2022-08-16 10:38:41 +08:00
Jiarui Fang a1476ea882
[NFC] polish doc style for ColoTensor (#1457) 2022-08-16 09:21:05 +08:00
Super Daniel 0dbd61c29b
[fx] fix test and algorithm bugs in activation checkpointing. (#1451)
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages

* [fx] merge development into main (#1)

* [fx] activation checkpointing using Chen strategies.

* [fx] add test for ckpt_solver_chen

* [fx] add vanilla activation checkpoint search with test on resnet and densenet

* [fx] add a namespace code for solver_chen.

* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174.

* [fx] fix lowercase naming conventions.

* [fx] simplify test for ckpt.

* [fx] fix test and algorithm bugs in activation checkpointing.

* mend

[fx] fix test and algorithm bugs in activation checkpointing.

* mend

[fx] fix test and algorithm bugs in activation checkpointing.

* mend

[fx] fix test and algorithm bugs in activation checkpointing.

* mend

[fx] fix test and algorithm bugs in activation checkpointing.

* [fx] polish ckpt_test.

* [fx] polish ckpt_test.

* [fx] polish ckpt_test.
2022-08-15 19:09:19 +08:00