ColossalAI

Commit Graph

Author	SHA1	Message	Date
YuliangLiu0306	81f7530ee7	[autoparallel] adapt solver and CostGraph with new handler (#1695 ) * [autoparallel] adapt solver and CostGraph with new handler * fix test issue	2022-10-13 14:04:15 +08:00
YuliangLiu0306	42b882ef06	[autoparallel] add output handler and placeholder handler (#1694 ) * [autoparallel] add output handler and placeholder handler * Delete test_solver_with_resnet.py * fix test bugs	2022-10-13 13:42:36 +08:00
YuliangLiu0306	56088e6d98	[autoparallel] add pooling handler (#1690 ) * [autoparallel] add pooling handler * polish code	2022-10-13 13:42:13 +08:00
YuliangLiu0306	319d654f79	[autoparallel] where_handler_v2 (#1688 ) * where generator * [autoparallel] where_handler_v2	2022-10-13 11:02:22 +08:00
Boyuan Yao	31d2f03d27	[autoparallel] fix C version rotor inconsistency (#1691 )	2022-10-12 15:21:58 +08:00
Frank Lee	4973157ad7	[autoparallel] added sharding spec conversion for linear handler (#1687 )	2022-10-12 11:16:18 +08:00
YuliangLiu0306	af718e83f2	[autoparallel] add reshape handler v2 and fix some previous bug (#1683 )	2022-10-11 18:12:59 +08:00
Super Daniel	3dd6994427	[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 (#1679 ) * [fx/profiler] modify data_ptr into uuid for all tensors. * [fx] modify uuid. * [fx/profiler] tune performance on GPT-2. * [fx] updates. * [fx] debug. * [fx] debug. * [fx] cuda.	2022-10-11 11:03:35 +08:00
YuliangLiu0306	517b63939a	[autoparallel] add unary element wise handler v2 (#1674 )	2022-10-09 17:30:42 +08:00
YuliangLiu0306	f6c6a932b8	[autoparallel] add following node generator (#1673 ) * [autoparallel] add following node generator * polish code * polish code * update name of arguments	2022-10-09 14:49:18 +08:00
YuliangLiu0306	52fda88796	[autoparallel] add layer norm handler v2 (#1671 ) * [autoparallel] add layer norm handler v2 * polish code * polish code	2022-10-09 14:23:22 +08:00
HELSON	b28991dd0a	[feature] A new ZeRO implementation (#1644 )	2022-10-09 09:18:51 +08:00
Boyuan Yao	1df98d5b66	[autoparallel] add rotor C version (#1658 ) * [autoparallel] add rotor c version * [fx] remove metainfoprop in rotor solver * [autoparallel] modify C code format * [autoparallel] remove build.py * [autoparallel] fix C extension build * [autoparallel] add C solver consistency test * [autoparallel] remove some unused imports * [autoparallel] refactor rotor solver code * [autoparallel] replace print with colossalai logger * [autoparallel] ranks fixed	2022-10-03 17:13:30 +08:00
YuliangLiu0306	11ec070e53	[hotfix]unit test (#1670 )	2022-09-29 12:49:28 +08:00
Frank Lee	a60024e77a	[autoparallel] added utils for broadcast operation (#1665 ) * [autoparallel] added utils for broadcast operation * polish code	2022-09-29 11:22:29 +08:00
YuliangLiu0306	3f068d1409	[autoparallel] update CommSpec (#1667 )	2022-09-29 11:20:59 +08:00
YuliangLiu0306	746f8f979d	[autoparallel] add batch norm handler v2 (#1666 )	2022-09-29 11:02:49 +08:00
Kirigaya Kazuto	9708638ded	[pipeline/pytree] add pytree to process args and kwargs \| provide `data_process_func` to process args and kwargs after forward (#1642 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera * [pipeline/chimera] test chimera \| fix bug of initializing * [pipeline/pytree] add pytree to process args and kwargs \| provide to process args and kwargs after forward	2022-09-29 10:58:58 +08:00
Frank Lee	3a4d6f63a8	[autoparallel] added node handler for bmm (#1655 )	2022-09-28 11:32:16 +08:00
YuliangLiu0306	095854477f	[autoparallel] add conv handler v2 (#1663 )	2022-09-28 11:24:59 +08:00
YuliangLiu0306	1e7816a460	[autoparallel] adapt solver with gpt (#1653 )	2022-09-28 11:17:26 +08:00
Frank Lee	30e50c8b4a	[autoparallel] implemented all matmul strategy generator (#1650 )	2022-09-27 12:06:25 +08:00
YuliangLiu0306	03978aad45	[autoparallel] change the following nodes strategies generation logic (#1636 ) * [autoparallel] change the following nodes strategies generation logic * fix unit test	2022-09-27 11:20:52 +08:00
YuliangLiu0306	59f100510a	[autoparallel] where handler (#1651 ) * [autoparallel] where handler * fix unit test	2022-09-27 11:20:43 +08:00
Boyuan Yao	5d0fdb9cb4	[fx] fix offload codegen test (#1648 ) * [fx] fix offload codegen test * [fx] modify typing	2022-09-27 10:25:27 +08:00
Frank Lee	45b39a692a	[autoparallel] implemented linear projection strategy generator (#1639 )	2022-09-26 16:58:14 +08:00
Frank Lee	154d3ef432	[fix] fixed the collective pattern name for consistency (#1649 ) * [fix] fixed the collective pattern name for consistency * polish code	2022-09-26 16:39:37 +08:00
YuliangLiu0306	b2b2a4af98	[autoparallel] adapt solver with mlp (#1638 )	2022-09-26 15:26:14 +08:00
Jiarui Fang	c5d39215f6	Revert "[feature] new zero implementation (#1623 )" (#1643 ) This reverts commit `5be118f405`.	2022-09-26 10:06:03 +08:00
HELSON	5be118f405	[feature] new zero implementation (#1623 )	2022-09-24 19:58:18 +08:00
HELSON	95c35f73bd	[moe] initialize MoE groups by ProcessGroup (#1640 )	2022-09-23 17:20:41 +08:00
HELSON	a088022efc	[moe] fix moe bugs (#1633 )	2022-09-23 15:33:57 +08:00
YuliangLiu0306	702dbc5288	[tensor] use communication autograd func (#1617 ) * [tensor] use communication autograd func * change all to all comm spec info * rename pattern and distinguish fwd/bwd * polish code	2022-09-23 13:31:15 +08:00
YuliangLiu0306	0c703189b9	[autoparallel] add layernorm handler (#1629 )	2022-09-23 12:00:25 +08:00
YuliangLiu0306	bf77d3ab65	[autoparallel] recover the merged node strategy index (#1613 )	2022-09-23 11:52:42 +08:00
Boyuan Yao	d6b01feb66	[fx] Modify offload codegen (#1618 ) * [fx] modify offload codegen * [fx] remove repeated hook definitions * [fx] modify offload test	2022-09-23 11:04:52 +08:00
YuliangLiu0306	9eae855408	[hotfix] add recompile after graph manipulatation (#1621 )	2022-09-23 11:00:33 +08:00
Super Daniel	d967779a32	[fx/profiler] tuned the calculation of memory estimation (#1619 ) * [fx] tuned the meta info and rotor solver. * [fx] remove import. * [fx] remove import. * [fx] remove import. * [fx] tune the meta calculations. * [fx] polish comments. * [fx] remove assertions. * [fx] modify test cases. * [fx] modify test cases. * [fx] optimize import. * [fx	2022-09-23 10:59:47 +08:00
HELSON	f7f2248771	[moe] fix MoE bugs (#1628 ) * remove forced FP32 modules * correct no_shard-contexts' positions	2022-09-22 13:56:30 +08:00
Jiarui Fang	38c68b5b9a	[embedding] rollback for better FAW performance (#1625 )	2022-09-22 11:16:25 +08:00
Frank Lee	d925122020	[autoparallel] added new linear module handler (#1616 )	2022-09-21 12:23:21 +08:00
Kirigaya Kazuto	170fa81095	[pipeline/chimera] test chimera \| fix bug of initializing (#1615 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera * [pipeline/chimera] test chimera \| fix bug of initializing	2022-09-20 18:00:39 +08:00
Jiarui Fang	504ff1d101	[embeddings] use cache_ratio instead of cuda_row_num (#1611 )	2022-09-20 14:33:04 +08:00
YuliangLiu0306	7d1bb71d5d	[fx] PoC of runtime shape consistency application (#1607 ) * [fx] PoC of runtime shape consistency application * polish code	2022-09-20 14:00:04 +08:00
YuliangLiu0306	47b11c432c	[autoparallel]add bcast matmul strategies (#1605 )	2022-09-20 11:26:21 +08:00
Boyuan Yao	933b6c6367	[fx] Add pofo solver (#1608 ) * [fx] add pofo algorithm * [fx] Add pofo solver * [fx] code refactor * [fx] fix test_linearize import	2022-09-20 11:20:48 +08:00
Kirigaya Kazuto	edc9e419ad	[pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera (#1595 ) * [pipeline/tuning] improve dispatch performance both time and space cost * [pipeline/converge] add interface for testing convergence * [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style * Update PipelineBase.py * [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule \| finish Chimera	2022-09-19 11:44:18 +08:00
YuliangLiu0306	eac1b79371	[autoparallel] add bcast op handler (#1600 ) * [autoparallel] add bcast op handler * polish code * add more BCAST FUNC OP * polish code * add exception handler * polish	2022-09-16 11:33:01 +08:00
Boyuan Yao	a7cda6f57d	[fx] Add offload codegen (#1598 ) * [fx] add input activation offload to codegen * [fx] modify unit test * [fx] remove two skips in torch11 * [fx] use all_input_nodes instead of _input_nodes	2022-09-14 15:49:06 +08:00
Super Daniel	c8e9b2ad78	[hotfix/rotor] fix variable names (#1597 ) * [fx] add some comment and docstrings. * [fx] add dataflow analysis for an autograd graph. * add intepretation for graph analysis. * [fx] before doing save_tensor_hooks. * [fx] provide an accurate estimation of memory except for GPT-2. * [fx] provide an accurate estimation of memory except for GPT-2. * [fx] provide an accurate estimation of memory except for GPT-2. * [fx] a very accurate version on GPT-2. * [fx] refactor code. * [fx] remove redundant inplace=True. * [fx] refactor code. * [fx] refactor code. * [fx] refactor code. * [fx] dive into backward memory. * [fx] fix variable names in ckpt_solvers and unskip tests. * [fx] commit my changes. * [fx] restore skips. * [fx] restore skips. * [fx] chaange stage into phase. * [fx] chaange stage into phase. * [fx] chaange stage into phase.	2022-09-14 14:27:04 +08:00
YuliangLiu0306	faa23b9d9a	[autoparallel] add reshape handler (#1594 ) * [autoparallel] add reshape handler * polish code	2022-09-14 10:25:45 +08:00
Frank Lee	27fe8af60c	[autoparallel] refactored shape consistency to remove redundancy (#1591 ) * [autoparallel] refactored shape consistency to remove redundancy * polish code * polish code * polish code	2022-09-13 18:30:18 +08:00
YuliangLiu0306	d164449d00	[autoparallel] add resnet autoparallel unit test and add backward weight communication cost (#1589 )	2022-09-13 18:05:05 +08:00
Frank Lee	219f66c571	[autoparallel] added solver option dataclass (#1588 )	2022-09-13 14:47:09 +08:00
YuliangLiu0306	82d4376c23	[autoparallel] adapt solver with resnet (#1583 ) * [autoparallel]adapt solver with resnet * polish code * polish code	2022-09-13 12:07:09 +08:00
CsRic	f3403ff98e	[embeddings] add already_split_along_rank flag for tablewise mode (#1584 )	2022-09-13 10:50:34 +08:00
Boyuan Yao	f3687e4ee2	[fx] Add nested checkpoint in activation checkpoint codegen (#1585 ) * [fx] add nested activation_checkpoint codegen * undo algorithms commits * solver * undo some commits * [fx] torch11 add nested activation checkpoint codegen * remove some imports * [fx] add some comments in activation codegen * [fx] codegen instance error fix	2022-09-12 20:00:48 +08:00
アマデウス	e615cfc3a8	[NFC] polish test component gpt code style (#1567 )	2022-09-08 16:34:09 +08:00
Kirigaya Kazuto	6159d45417	[pipeline/tuning] improve dispatch performance both time and space cost (#1544 )	2022-09-07 19:01:06 +08:00
Super Daniel	4f59693207	[fx] provide a stable but not accurate enough version of profiler. (#1547 ) * [fx] compute memory stat and flop count for MetaInfoProp. * [fx] modify node attribute. * [fx] modify ckpt_chen. * [fx] fix compatibility. * [fx] fix import error. * [fx] skip test for MetaInfoProp. * [fx] skip test for MetaInfoProp. * [fx] skip test for MetaInfoProp. * [fx] skip test for MetaInfoProp. * [fx] skip if torch 1.11.0. * [fx] recover MetaInfoProp support for PyTorch 1.11. * [fx] provide a stable but not accurate enough version of profiler. * [fx] provide a stable but not accurate enough version of profiler. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix compatibility in tests. * [fx] fix import error.	2022-09-07 11:21:04 +08:00
YuliangLiu0306	0908d0fc61	[autoparallel]add backward cost info into strategies (#1524 )	2022-09-07 11:19:00 +08:00
YuliangLiu0306	44c866a3e3	[autoparallel] change the merge node logic (#1533 )	2022-09-07 11:18:19 +08:00
Jiarui Fang	64169f3e8f	[embedding] polish parallel embedding tablewise (#1545 )	2022-09-06 10:41:20 +08:00
CsRic	964123ae0f	[embedding] freq_aware_embedding: add small functions for caller application (#1537 )	2022-09-05 15:12:53 +08:00
Boyuan Yao	56159049e8	[fx] Modify solver linearize and add corresponding test (#1531 ) * [fx] modify solver linearize and add test * [fx] add torch11 test of linearize but skip it * [fx] remove some unused imports	2022-09-02 10:24:41 +08:00
Super Daniel	7dc53237c3	[fx] add test for meta tensor. (#1527 ) * [fx] add test for meta tensor. * [fx] add test for meta tensor. * [fx] add test for meta tensor. * [fx] add test for meta tensor. * [fx] fix error.	2022-09-01 19:30:05 +08:00
YuliangLiu0306	4b3d6caeb3	[fx]patch nn.functional convolution (#1528 )	2022-09-01 19:05:07 +08:00
CsRic	5156d5b4f8	[embedding] add tablewise sharding for FAW (#1526 )	2022-09-01 17:55:41 +08:00
Kirigaya Kazuto	f1e1836218	[pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP (#1508 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP * [pipeline/pipleline_process_group] remove comment * [pipeline/pipleline_process_group] remove comment * [pipeline/pipleline_process_group] skip process group test * [pipeline/pipleline_process_group] remove test named function	2022-09-01 17:45:47 +08:00
Boyuan Yao	b231430bcb	[fx] Fix wrong index in annotation and minimal flops in ckpt solver (#1521 ) * [fx] fix wrong variable name in solver rotor * [fx] fix wrong variable name in solver rotor * [fx] fix the discretize bug * [fx] fix the first op in activation checkpoint codegen * [fx] fix some bugs of ckpt solver * [fx] modify test_ckpt_torchvision * [fx] set sequence to __sequence__ attr of GraphModule * [fx] docstring modification * [fx] remove performance test	2022-08-31 18:10:48 +08:00
YuliangLiu0306	3345c6d352	[autoparellel]add strategies constructor (#1505 ) * [autoparellel]add strategies constructor * remove duplicated strategies * polish code * adapt cost graph with StrategiesConstructor * polish	2022-08-30 16:32:09 +08:00
Frank Lee	a0436a62ee	[autoparallel] added liveness analysis (#1516 ) * [autoparallel] added liveness analysis * remove memory cost	2022-08-30 15:54:37 +08:00
Jiarui Fang	9a9ef65313	[FAW] cpu caching operations (#1520 )	2022-08-30 14:50:02 +08:00
Jiarui Fang	af5438caa2	[FAW] refactor reorder() for CachedParamMgr (#1514 )	2022-08-29 14:22:07 +08:00
CsRic	1b8fee8e9c	[FAW] shrink freq_cnter size (#1509 )	2022-08-29 11:44:55 +08:00
Boyuan Yao	4acc58ee20	[fx] Fix activation codegen dealing with checkpointing first op (#1510 )	2022-08-27 19:39:21 +08:00
Kirigaya Kazuto	5a6fd71f90	[pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy (#1497 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy * [pipeline/rpc] update outstanding mechanism \| optimize dispatching strategy	2022-08-26 14:04:23 +08:00
CsRic	0ed2f46131	[FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494 )	2022-08-26 11:24:12 +08:00
YuliangLiu0306	8b7d6bd5be	[autoparallel] add more sharding strategies to conv (#1487 )	2022-08-26 11:17:56 +08:00
Boyuan Yao	de1e716dc4	[fx] Add activation checkpoint solver rotor (#1496 ) * [fx] fix defining ckpt functions inside forward * [fx] Modify activation checkpoint codegen and add ColoGraphModule * [fx] some modification * some modifications * some modifications * some modifications * some modifications * some code modifications * [automatic_parallel] ckpt solver rotor * [fx] add ckpt_solver_rotor * [fx] modification * code refactor * code refactor	2022-08-26 10:34:21 +08:00
YuliangLiu0306	413c053453	[autoparallel] add cost graph class (#1481 ) * [autoparallel] add cost graph class * polish code	2022-08-25 17:19:59 +08:00
YuliangLiu0306	4b03c25f85	[tensor]add 1D device mesh (#1492 )	2022-08-25 16:48:12 +08:00
CsRic	b8d0e39eaf	[FAW] LFU cache for the FAW	2022-08-25 13:08:46 +08:00
Kirigaya Kazuto	9145aef2b4	[pipeline/rpc] implement distributed optimizer \| test with assert_close (#1486 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B * [pipeline/rpc] implement distributed optimizer \| test with assert_close * [pipeline/rpc] implement distributed optimizer \| test with assert_close	2022-08-25 10:49:01 +08:00
Frank Lee	3da68d6b1b	[fx] fixed adapative pooling size concatenation error (#1489 )	2022-08-25 09:05:07 +08:00
Jiarui Fang	cde7b8a5b8	[FAW] init an LFU implementation for FAW (#1488 )	2022-08-24 17:37:22 +08:00
Super Daniel	32efe8e740	[fx] add profiler for fx nodes. (#1480 ) * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] merge development into main (#1) * [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen. * [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. * [fx] fix lowercase naming conventions. * [fx] simplify test for ckpt. * [fx] add rules to linearize computation graphs for searching. (#2) * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] merge development into main (#1) * [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen. * [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. * [fx] fix lowercase naming conventions. * [fx] simplify test for ckpt. * [fx] fix test and algorithm bugs in activation checkpointing. * [fx] polish ckpt_test. * [fx] add rules to linearize computation graphs for searching. * [fx] remove chen_sqrt for sake of simplicity * [fx] remove chen_sqrt for sake of simplicity * [fx] remove chen_sqrt for sake of simplicity * [fx] remove chen_sqrt for sake of simplicity * [fx] fix inconsistencies. * [fx] fix MetaInfoProp. * [fx] fix MetaInfoProp. * [fx] consider MetaInfoProp for inplace operands. * [fx] consider MetaInfoProp for inplace operands. * [fx] consider MetaInfoProp for inplace operands. * [fx] consider MetaInfoProp for inplace operands. * [fx] consider MetaInfoProp for inplace operands. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] add profiler for fx nodes. * [fx] fix error in tests. * [fx] unfix bug. * [fx] unfix bug.	2022-08-24 16:22:44 +08:00
Kirigaya Kazuto	a6c8749198	[pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B (#1483 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * [pipeline/rpc] support interleaving \| fix checkpoint bug \| change logic when dispatch data in work_list to ensure steady 1F1B	2022-08-24 11:19:46 +08:00
Geng Zhang	0aad53c62b	[FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462 )	2022-08-23 17:38:24 +08:00
Frank Lee	ede326298b	[autoparallel] integrate auto parallel with torch fx (#1479 )	2022-08-23 14:23:08 +08:00
Boyuan Yao	1f2e547f7a	[fx] Fix ckpt functions' definitions in forward (#1476 ) * [fx] fix defining ckpt functions inside forward * [fx] Modify activation checkpoint codegen and add ColoGraphModule * [fx] some modification * some modifications * some modifications * some modifications * some modifications * some code modifications	2022-08-22 16:59:54 +08:00
Kirigaya Kazuto	bb5f5289e0	[pipeline/rpc] implement a demo for PP with cuda rpc framework (#1470 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [pipeline/rpc] implement a demo for PP with cuda rpc framework * Delete p2p_v2.py * Delete _pipeline_schedule_v2.py * Delete test_object_list_p2p_v2.py * Delete test_boardcast_send_recv_v2.py * Delete test_cifar_with_data_pipeline_tensor_v2.py	2022-08-22 10:50:51 +08:00
Frank Lee	628c7e3fc8	[autoparallel] added dot handler (#1475 )	2022-08-22 10:32:17 +08:00
YuliangLiu0306	26a37b5cd5	[autoparallel] Add conv handler to generate strategies and costs info for conv (#1467 )	2022-08-19 14:57:23 +08:00
YuliangLiu0306	b73fb7a077	[tensor] support runtime ShardingSpec apply (#1453 ) * [tensor] support runtime ShardingSpec apply * polish code * polish code	2022-08-19 13:39:51 +08:00
Super Daniel	e7383f578b	[fx] add rules to linearize computation graphs for searching. (#1461 ) * [fx] polish ckpt_test. * [fx] add rules to linearize computation graphs for searching. * [fx] remove chen_sqrt for sake of simplicity * [fx] fix inconsistencies.	2022-08-17 14:47:12 +08:00
Boyuan Yao	092b9c8f49	[fx] Add use_reentrant=False to checkpoint in codegen (#1463 ) * [utils] Add use_reetrant=False into colossalai checkpoint * [utils] add some annotation in utils.activaion_checkpoint * [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py * [test] modify test_activation_checkpoint.py * [test] modify test for reentrant=False * [fx] Add use_reentrant=False of checkpoint into codegen	2022-08-17 10:34:50 +08:00
Boyuan Yao	47fd8e4a02	[utils] Add use_reetrant=False in utils.activation_checkpoint (#1460 ) * [utils] Add use_reetrant=False into colossalai checkpoint * [utils] add some annotation in utils.activaion_checkpoint * [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py * [test] modify test_activation_checkpoint.py * [test] modify test for reentrant=False	2022-08-16 15:39:20 +08:00
Jiarui Fang	36824a304c	[Doc] add more doc for ColoTensor. (#1458 )	2022-08-16 10:38:41 +08:00
Super Daniel	0dbd61c29b	[fx] fix test and algorithm bugs in activation checkpointing. (#1451 ) * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] merge development into main (#1) * [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen. * [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. * [fx] fix lowercase naming conventions. * [fx] simplify test for ckpt. * [fx] fix test and algorithm bugs in activation checkpointing. * mend [fx] fix test and algorithm bugs in activation checkpointing. * mend [fx] fix test and algorithm bugs in activation checkpointing. * mend [fx] fix test and algorithm bugs in activation checkpointing. * mend [fx] fix test and algorithm bugs in activation checkpointing. * [fx] polish ckpt_test. * [fx] polish ckpt_test. * [fx] polish ckpt_test.	2022-08-15 19:09:19 +08:00
Geng Zhang	9f3eed66eb	[FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448 )	2022-08-12 15:55:46 +08:00
Frank Lee	5a52e21fe3	[test] fixed the activation codegen test (#1447 ) * [test] fixed the activation codegen test * polish code	2022-08-12 14:52:31 +08:00
YuliangLiu0306	0f3042363c	[tensor] shape consistency generate transform path and communication cost (#1435 ) * [tensor] shape consistency output transform path and communication cost * polish code	2022-08-12 14:02:32 +08:00
Boyuan Yao	5774fe0270	[fx] Use colossalai checkpoint and add offload recognition in codegen (#1439 ) * [fx] Use colossalai.utils.checkpoint to replace torch.utils.checkpoint for offload activation and add offload annotation recognition in codegen * [fx] Use colossalai.utils.checkpoint to replace torch.utils.checkpoint for offload activation and add offload annotation recognition in codegen * Modification of test and add TODO in codegen * [fx] Modification of colossal ckpt usage * [fx] add gpc.destroy() to test_codegen	2022-08-12 12:23:30 +08:00
Kirigaya Kazuto	e9460b45c8	[engin/schedule] use p2p_v2 to recontruct pipeline_schedule (#1408 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [communication] add p2p_v2.py to support communication with List[Any] * Delete _pipeline_schedule_v2.py * Delete test_cifar_with_data_pipeline_tensor_v2.py * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * Delete p2p_v2.py * Delete test_boardcast_send_recv_v2.py * Delete test_object_list_p2p_v2.py * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [communication] remove print code * [communication] remove print code * [engin/schedule] shorten the running time of testing file to prevent cancelling in CI	2022-08-12 11:33:26 +08:00
Frank Lee	ae1b58cd16	[tensor] added linear implementation for the new sharding spec (#1416 ) * [tensor] added linear implementation for the new sharding spec * polish code	2022-08-12 11:33:09 +08:00
Super Daniel	d40a9392ba	[fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 . (#1446 ) * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * mend * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen. * [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. * [fx] fix lowercase naming conventions.	2022-08-12 11:28:50 +08:00
ver217	821c6172e2	[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442 )	2022-08-11 22:58:58 +08:00
HELSON	b80340168e	[zero] add chunk_managerV2 for all-gather chunk (#1441 )	2022-08-11 19:17:24 +08:00
Super Daniel	3b26516c69	[fx] add vanilla activation checkpoint search with test on resnet and densenet (#1433 ) * [fx] activation checkpointing using Chen strategies. * [fx] add test for ckpt_solver_chen * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add vanilla activation checkpoint search with test on resnet and densenet * [fx] add a namespace code for solver_chen.	2022-08-11 15:46:39 +08:00
Jiarui Fang	30b4dd17c0	[FAW] export FAW in _ops (#1438 )	2022-08-11 13:43:24 +08:00
HELSON	9056677b13	[zero] add chunk size searching algorithm for parameters in different groups (#1436 )	2022-08-11 13:32:19 +08:00
HELSON	039b7ed3bc	[polish] add update directory in gemini; rename AgChunk to ChunkV2 (#1432 )	2022-08-10 16:40:29 +08:00
Super Daniel	f20cb4e893	[fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages (#1425 ) * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages * [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages	2022-08-10 16:36:35 +08:00
Jiarui Fang	89c434a0a6	[polish] add test_ops directory (#1431 )	2022-08-10 15:35:26 +08:00
Jiarui Fang	10b3df65c8	[FAW] move coloparam setting in test code. (#1429 )	2022-08-10 14:31:53 +08:00
Jiarui Fang	cb98cf5558	[FAW] parallel FreqAwareEmbedding (#1424 )	2022-08-10 13:44:30 +08:00
HELSON	0d212183c4	[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426 )	2022-08-10 11:37:28 +08:00
YuliangLiu0306	33f0744d51	[tensor] add shape consistency feature to support auto spec transform (#1418 ) * [tensor] add shape consistency feature to supportauto sharding spec transform. * [tensor] remove unused argument in simulator, add doc string for target pair.	2022-08-10 11:29:17 +08:00
HELSON	4fb3c52cf0	[zero] add unit test for AgChunk's append, close, access (#1423 )	2022-08-09 18:03:10 +08:00
Jiarui Fang	d209aff684	Add FreqAwareEmbeddingBag (#1421 )	2022-08-09 16:26:12 +08:00
Jiarui Fang	504419d261	[FAW] add cache manager for the cached embedding (#1419 )	2022-08-09 15:17:17 +08:00
Kirigaya Kazuto	44fd3c83ab	[communication] add p2p_v2.py to support communication with List[Any] (#1407 ) * support p2p communication with any type of object \| pass test * reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) \| pass test * [communication] add p2p_v2.py to support communication with List[Any] * Delete _pipeline_schedule_v2.py * Delete test_cifar_with_data_pipeline_tensor_v2.py * [engin/schedule] use p2p_v2 to recontruct pipeline_schedule * [communication] remove print code * [communication] remove print code	2022-08-09 11:40:04 +08:00
YuliangLiu0306	7c96055c68	[tensor]build sharding spec to replace distspec in future. (#1405 )	2022-08-08 11:15:57 +08:00
ver217	12b4887097	[hotfix] fix CPUAdam kernel nullptr (#1410 )	2022-08-05 19:45:45 +08:00
YuliangLiu0306	0442f940f0	[device] add DeviceMesh class to support logical device layout (#1394 ) * [device] add DeviceMesh class to support logical device layout * polish code * add doc string	2022-08-02 19:23:48 +08:00
HELSON	4e98e938ce	[zero] alleviate memory usage in ZeRODDP state_dict (#1398 )	2022-08-02 15:49:13 +08:00
Frank Lee	adf5054ff8	[fx] fixed torchaudio conformer tracing (#1392 )	2022-08-01 16:08:28 +08:00
Frank Lee	7d6293927f	[fx] patched torch.max and data movement operator (#1391 ) * [fx] patched torch.max and data movement operator * polish code	2022-08-01 15:31:50 +08:00
HELSON	527758b2ae	[hotfix] fix a running error in test_colo_checkpoint.py (#1387 )	2022-07-29 15:58:06 +08:00
ver217	8dced41ad0	[zero] zero optim state_dict takes only_rank_0 (#1384 ) * zero optim state_dict takes only_rank_0 * fix unit test	2022-07-29 13:22:50 +08:00
ver217	7d5d628e07	[DDP] test ddp state dict uses more strict threshold (#1382 )	2022-07-28 17:29:04 +08:00
ver217	828b9e5e0d	[hotfix] fix zero optim save/load state dict (#1381 )	2022-07-28 17:19:39 +08:00
Super Daniel	be229217ce	[fx] add torchaudio test (#1369 ) * [fx]add torchaudio test * [fx]add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test * [fx] add torchaudio test and test patches * Delete ~ * [fx] add patches and patches test * [fx] add patches and patches test * [fx] fix patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] fix rnn patches * [fx] merge upstream * [fx] fix import errors	2022-07-27 11:03:14 +08:00
Boyuan Yao	bb640ec728	[fx] Add colotracer compatibility test on torchrec (#1370 )	2022-07-26 17:54:39 +08:00
ver217	c415240db6	[nvme] CPUAdam and HybridAdam support NVMe offload (#1360 ) * impl nvme optimizer * update cpu adam * add unit test * update hybrid adam * update docstr * add TODOs * update CI * fix CI * fix CI * fix CI path * fix CI path * fix CI path * fix install tensornvme * fix CI * fix CI path * fix CI env variables * test CI * test CI * fix CI * fix nvme optim __del__ * fix adam __del__ * fix nvme optim * fix CI env variables * fix nvme optim import * test CI * test CI * fix CI	2022-07-26 17:25:24 +08:00
HELSON	87775a0682	[colotensor] use cpu memory to store state_dict (#1367 )	2022-07-26 14:13:38 +08:00
Frank Lee	cd063ac37f	[fx] added activation checkpoint codegen support for torch < 1.12 (#1359 )	2022-07-25 23:35:31 +08:00
HELSON	4417804129	[unit test] add megatron init test in zero_optim (#1358 )	2022-07-25 11:18:08 +08:00
HELSON	7a065dc9f6	[hotfix] fix megatron_init in test_gpt2.py (#1357 )	2022-07-25 10:28:19 +08:00
Frank Lee	644582eee9	[fx] added activation checkpoint codegen (#1355 )	2022-07-25 09:39:10 +08:00
Frank Lee	05fae1fd56	[fx] added activation checkpointing annotation (#1349 ) * [fx] added activation checkpointing annotation * polish code * polish code	2022-07-21 11:14:28 +08:00
HELSON	7a8702c06d	[colotensor] add Tensor.view op and its unit test (#1343 ) [colotensor] add megatron initialization for gpt2	2022-07-21 10:53:15 +08:00
YuliangLiu0306	942c8cd1fb	[fx] refactor tracer to trace complete graph (#1342 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx] refactor tracer to trace complete graph * add comments and solve conflicts.	2022-07-20 11:20:38 +08:00
Frank Lee	2cc1175c76	[fx] tested the complete workflow for auto-parallel (#1336 ) * [fx] tested the complete workflow for auto-parallel * polish code * polish code * polish code	2022-07-20 10:45:17 +08:00
YuliangLiu0306	4631fef8a0	[fx]refactor tracer (#1335 )	2022-07-19 15:50:42 +08:00
HELSON	bf5066fba7	[refactor] refactor ColoTensor's unit tests (#1340 )	2022-07-19 15:46:24 +08:00
HELSON	f92c100ddd	[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339 )	2022-07-19 14:15:28 +08:00
Frank Lee	f3ce7b8336	[fx] recovered skipped pipeline tests (#1338 )	2022-07-19 09:49:50 +08:00
ver217	0c51ff2c13	[hotfix] ZeroDDP use new process group (#1333 ) * process group supports getting ranks in group * chunk mgr receives a process group * update unit test * fix unit tests	2022-07-18 14:14:52 +08:00
Frank Lee	75abc75c15	[fx] fixed compatiblity issue with torch 1.10 (#1331 )	2022-07-18 11:41:27 +08:00
Frank Lee	169954f87e	[test] removed outdated unit test for meta context (#1329 )	2022-07-15 23:16:23 +08:00
ver217	7a05367101	[hotfix] shared model returns cpu state_dict (#1328 )	2022-07-15 22:11:37 +08:00
Frank Lee	b2475d8c5c	[fx] fixed unit tests for torch 1.12 (#1327 )	2022-07-15 18:22:15 +08:00
HELSON	d49708ae43	[hotfix] fix ddp for unit test test_gpt2 (#1326 )	2022-07-15 18:19:52 +08:00
Frank Lee	250be4d31e	[utils] integrated colotensor with lazy init context (#1324 ) * [utils] integrated colotensor with lazy init context * polish code * polish code * polish code	2022-07-15 17:47:12 +08:00
YuliangLiu0306	e8acf55e8b	[fx] add balanced policy v2 (#1251 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx] add balanced policy v2 * add unittest	2022-07-15 14:54:26 +08:00
XYE	ca2d3f284f	[fx] Add unit test and fix bugs for transform_mlp_pass (#1299 ) * add test and fix bugs * add functions back * add comments	2022-07-15 14:37:58 +08:00
HELSON	1b41686461	[hotfix] fix unit test test_module_spec (#1321 )	2022-07-15 14:02:32 +08:00
Jiarui Fang	9e4c6449b0	[checkpoint] add ColoOptimizer checkpointing (#1316 )	2022-07-15 09:52:55 +08:00
Jiarui Fang	85f933b58b	[Optimizer] Remove useless ColoOptimizer (#1312 )	2022-07-14 16:57:48 +08:00
Jiarui Fang	9f10524313	[Optimizer] polish the init method of ColoOptimizer (#1310 )	2022-07-14 16:37:33 +08:00
HELSON	36086927e1	[hotfix] fix ColoTensor GPT2 unitest (#1309 )	2022-07-14 16:37:20 +08:00
Jiarui Fang	3ef3791a3b	[checkpoint] add test for bert and hotfix save bugs (#1297 )	2022-07-14 15:38:18 +08:00
Jiarui Fang	bd71e2a88b	[hotfix] add missing file (#1308 )	2022-07-14 14:43:15 +08:00
Frank Lee	4f4d8c3656	[fx] added apex normalization to patched modules (#1300 ) * [fx] added apex normalization to patched modules * remove unused imports	2022-07-14 14:24:13 +08:00
Jiarui Fang	4165eabb1e	[hotfix] remove potiential circle import (#1307 ) * make it faster * [hotfix] remove circle import	2022-07-14 13:44:26 +08:00
YuliangLiu0306	93a75433df	[hotfix] skip some unittest due to CI environment. (#1301 )	2022-07-14 10:55:18 +08:00
HELSON	260a55804a	[hotfix] fix shape error in backward when using ColoTensor (#1298 )	2022-07-13 23:06:12 +08:00
Frank Lee	7e8114a8dd	[hotfix] skipped unsafe test cases (#1282 )	2022-07-13 00:08:59 +08:00
Jiarui Fang	79fe7b027a	[hotfix] test model unittest hotfix (#1281 )	2022-07-12 23:45:29 +08:00
Jiarui Fang	e56731e916	[hotfix] test_gpt.py duplicated (#1279 ) * make it faster * [hotfix] torchvison fx tests * [hotfix] rename duplicated named test_gpt.py	2022-07-12 23:29:17 +08:00
HELSON	abba4d84e1	[hotfix] fix bert model test in unitests (#1272 )	2022-07-12 23:26:45 +08:00
YuliangLiu0306	01ea68b2e6	[tests] remove T5 test skip decorator (#1271 )	2022-07-12 23:25:30 +08:00
Jiarui Fang	ca9d5ee91c	[hotfix] torchvison fx unittests miss import pytest (#1277 )	2022-07-12 23:04:06 +08:00
Jiarui Fang	c92f84fcdb	[tensor] distributed checkpointing for parameters (#1240 )	2022-07-12 15:51:06 +08:00
Frank Lee	4a09fc0947	[fx] fixed tracing with apex-based T5 model (#1252 ) * [fx] fixed tracing with apex-based T5 model * polish code * polish code	2022-07-12 15:19:25 +08:00
YuliangLiu0306	97d713855a	[fx] methods to get fx graph property. (#1246 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * manipulation * [fx]add graph manipulation methods. * [fx]methods to get fx graph property. * add unit test * add docstring to explain top node and leaf node in this context	2022-07-12 14:10:37 +08:00
YuliangLiu0306	30b4fc0eb0	[fx]add split module pass and unit test from pipeline passes (#1242 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]add split module pass and unit test from pipeline passes * fix MNASNet bug * polish	2022-07-12 13:45:01 +08:00
Jiarui Fang	1aad903c15	[tensor] redistribute among different process groups (#1247 ) * make it faster * [tensor] rename convert_to_dist -> redistribute * [tensor] ShardSpec and ReplicaSpec * [tensor] redistribute among diff pgs * polish code	2022-07-12 10:24:05 +08:00
Jiarui Fang	9bcd2fd4af	[tensor] a shorter shard and replicate spec (#1245 )	2022-07-11 15:51:48 +08:00
Jiarui Fang	2699dfbbfd	[rename] convert_to_dist -> redistribute (#1243 )	2022-07-11 13:05:44 +08:00
HELSON	f6add9b720	[tensor] redirect .data.__get__ to a tensor instance (#1239 )	2022-07-11 11:41:29 +08:00
Jiarui Fang	20da6e48c8	[checkpoint] save sharded optimizer states (#1237 )	2022-07-08 16:33:13 +08:00
Jiarui Fang	4a76084dc9	[tensor] add zero_like colo op, important for Optimizer (#1236 )	2022-07-08 14:55:27 +08:00
Jiarui Fang	3b500984b1	[tensor] fix some unittests (#1234 )	2022-07-08 14:18:30 +08:00
HELSON	0453776def	[tensor] fix a assertion in colo_tensor cross_entropy (#1232 )	2022-07-08 11:18:00 +08:00
Jiarui Fang	0e199d71e8	[hotfix] fx get comm size bugs (#1233 ) * init a checkpoint dir * [checkpoint]support resume for cosinewarmuplr * [checkpoint]add unit test * fix some bugs but still not OK * fix bugs * make it faster * [checkpoint]support generalized scheduler * polish * [tensor] torch function return colotensor * polish * fix bugs * remove debug info * polish * polish * [tensor] test_model pass unittests * polish * [hotfix] fx get comm size bug Co-authored-by: ZhaoYi1222 <zhaoyi9499@gmail.com>	2022-07-08 10:54:41 +08:00
HELSON	42ab36b762	[tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230 )	2022-07-07 19:17:23 +08:00
Yi Zhao	04537bf83e	[checkpoint]support generalized scheduler (#1222 )	2022-07-07 18:16:38 +08:00
Jiarui Fang	a98319f023	[tensor] torch function return colotensor (#1229 )	2022-07-07 18:09:18 +08:00
Frank Lee	5581170890	[fx] fixed huggingface OPT and T5 results misalignment (#1227 )	2022-07-07 16:29:58 +08:00
YuliangLiu0306	2b7dca44b5	[fx]get communication size between partitions (#1224 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]get communication size between partitions. * polish	2022-07-07 16:22:00 +08:00
Frank Lee	84f2298a96	[fx] added patches for tracing swin transformer (#1228 )	2022-07-07 15:20:13 +08:00
Frank Lee	37fcf96b7f	[fx] fixed timm tracing result misalignment (#1225 )	2022-07-07 14:45:15 +08:00
Frank Lee	b6cb5a47ad	[fx] added timm model tracing testing (#1221 )	2022-07-07 14:02:17 +08:00
Jiarui Fang	15d988f954	[tensor] sharded global process group (#1219 )	2022-07-07 13:38:48 +08:00
Frank Lee	11973d892d	[fx] added torchvision model tracing testing (#1216 ) * [fx] added torchvision model tracing testing * remove unused imports	2022-07-06 21:37:56 +08:00
Jiarui Fang	52736205d9	[checkpoint] make unitest faster (#1217 )	2022-07-06 17:39:46 +08:00
Jiarui Fang	f38006ea83	[checkpoint] checkpoint for ColoTensor Model (#1196 )	2022-07-06 17:22:03 +08:00
Jiarui Fang	ae7d3f4927	[refactor] move process group from _DistSpec to ColoTensor. (#1203 )	2022-07-06 16:15:16 +08:00
Frank Lee	5da87ce35d	[fx] added testing for all albert variants (#1211 )	2022-07-06 15:11:08 +08:00
Frank Lee	2d13a45a3b	[fx] added testing for all gpt variants (#1210 ) * [fx] added testing for all gpt variants * polish code * polish code	2022-07-06 14:03:13 +08:00
YuliangLiu0306	189946c5c4	[fx]add uniform policy (#1208 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [fx]add uniform policy	2022-07-06 13:48:11 +08:00
Frank Lee	426a279ce7	[fx] added testing for all bert variants (#1207 ) * [fx] added testing for all bert variants * polish code	2022-07-06 10:50:49 +08:00
Frank Lee	f7878f465c	[fx] supported model tracing for huggingface bert (#1201 ) * [fx] supported model tracing for huggingface bert * polish test	2022-07-05 13:19:57 +08:00
Jiarui Fang	060b917daf	[refactor] remove gpc dependency in colotensor's _ops (#1189 )	2022-07-04 18:54:37 +08:00
Frank Lee	abf6a262dc	[fx] added module patch for pooling layers (#1197 )	2022-07-04 15:21:26 +08:00
YuliangLiu0306	63d2a93878	[context]support arbitary module materialization. (#1193 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [context]support arbitary module materialization. * [test]add numerical check for lazy init context.	2022-07-04 10:12:02 +08:00
YuliangLiu0306	2053e138a2	[context]use meta tensor to init model lazily. (#1187 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [context]use meta tensor to init model lazily. * polish * make module with device kwargs bypass the normal init. * change unit test to adapt updated context.	2022-06-29 21:02:30 +08:00
Frank Lee	2c8c05675d	[fx] patched conv and normalization (#1188 )	2022-06-29 18:58:38 +08:00
Frank Lee	6d86f1bc91	[fx] supported data-dependent control flow in model tracing (#1185 ) * [fx] supported data-dependent control flow in model tracing * polish code	2022-06-29 15:05:25 +08:00
Jiarui Fang	c463f8adf9	[tensor] remove gpc in tensor tests (#1186 )	2022-06-29 14:08:40 +08:00
Jiarui Fang	372f791444	[refactor] move chunk and chunkmgr to directory gemini (#1182 )	2022-06-29 13:31:02 +08:00
ver217	6b2f2ab9bb	[ddp] ColoDDP uses bucket all-reduce (#1177 ) * add reducer * update colo ddp with reducer * polish unit test * polish unit test	2022-06-29 10:34:13 +08:00
Jiarui Fang	7487215b95	[ColoTensor] add independent process group (#1179 )	2022-06-29 10:03:09 +08:00
Jiarui Fang	1b657f9ce1	[tensor] revert local view back (#1178 )	2022-06-27 18:38:34 +08:00
Jiarui Fang	0dd4e2bbfb	[Tensor] rename some APIs in TensorSpec and Polish view unittest (#1176 )	2022-06-27 15:56:11 +08:00
Jiarui Fang	aa7bef73d4	[Tensor] distributed view supports inter-process hybrid parallel (#1169 )	2022-06-27 09:45:26 +08:00
ver217	9e1daa63d2	[zero] sharded optim supports loading local state dict (#1170 ) * sharded optim supports loading local state dict * polish code * add unit test	2022-06-24 18:05:16 +08:00
ver217	561e90493f	[zero] zero optim supports loading local state dict (#1171 ) * zero optim supports loading local state dict * polish code * add unit test	2022-06-24 17:25:57 +08:00
Jiarui Fang	4b9bba8116	[ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168 )	2022-06-24 13:08:54 +08:00
Jiarui Fang	f4ef224358	[Tensor] remove ParallelAction, use ComputeSpec instread (#1166 )	2022-06-23 17:34:59 +08:00
Jiarui Fang	177c374401	remove gather out in parallel action (#1163 )	2022-06-23 16:35:05 +08:00
Jiarui Fang	07f9c781f9	[graph] improve the graph building. (#1157 )	2022-06-22 16:47:20 +08:00
ver217	22717a856f	[tensor] add embedding bag op (#1156 )	2022-06-22 15:54:03 +08:00
ver217	ae86151968	[tensor] add more element-wise ops (#1155 ) * add more element-wise ops * update test_op * polish unit test	2022-06-22 15:16:47 +08:00
ver217	ffa025e120	[tensor] dist spec s2s uses all-to-all (#1136 ) * dist spec s2s uses all-to-all * update unit test * add sanity check * polish unitest test with titans * add sanity check for DistMgr * add sanity check Co-authored-by: jiaruifang <fangjiarui123@gmail.com>	2022-06-22 11:32:38 +08:00
Jiarui Fang	ff644ee5e4	polish unitest test with titans (#1152 )	2022-06-22 09:58:02 +08:00
Jiarui Fang	8cdce0399c	[ColoTensor] improves init functions. (#1150 )	2022-06-21 18:28:38 +08:00
ver217	8106d7b8c7	[ddp] refactor ColoDDP and ZeroDDP (#1146 ) * ColoDDP supports overwriting default process group * rename ColoDDPV2 to ZeroDDP * add docstr for ZeroDDP * polish docstr	2022-06-21 16:35:23 +08:00
ver217	d26902645e	[ddp] add save/load state dict for ColoDDP (#1127 ) * add save/load state dict for ColoDDP * add unit test * refactor unit test folder * polish unit test * rename unit test	2022-06-20 10:51:47 +08:00
ver217	789cad301b	[hotfix] fix param op hook (#1131 ) * fix param op hook * update zero tp test * fix bugs	2022-06-17 16:12:05 +08:00
ver217	f0a954f16d	[ddp] add set_params_to_ignore for ColoDDP (#1122 ) * add set_params_to_ignore for ColoDDP * polish code * fix zero hook v2 * add unit test * polish docstr	2022-06-16 12:54:46 +08:00
YuliangLiu0306	fcf55777dd	[fx]add autoparallel passes (#1121 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * feature/add autoparallel passes	2022-06-15 16:36:46 +08:00
Frank Lee	16302a5359	[fx] added unit test for coloproxy (#1119 ) * [fx] added unit test for coloproxy * polish code * polish code	2022-06-15 15:27:51 +08:00
ver217	7d14b473f0	[gemini] gemini mgr supports "cpu" placement policy (#1118 ) * update gemini mgr * update chunk * add docstr * polish placement policy * update test chunk * update test zero * polish unit test * remove useless unit test	2022-06-15 15:05:19 +08:00
Frank Lee	53297330c0	[test] fixed hybrid parallel test case on 8 GPUs (#1106 )	2022-06-14 10:30:54 +08:00
ver217	1f894e033f	[gemini] zero supports gemini (#1093 ) * add placement policy * add gemini mgr * update mem stats collector * update zero * update zero optim * fix bugs * zero optim monitor os * polish unit test * polish unit test * add assert	2022-06-10 14:48:28 +08:00
Frank Lee	2b2dc1c86b	[pipeline] refactor the pipeline module (#1087 ) * [pipeline] refactor the pipeline module * polish code	2022-06-10 11:27:38 +08:00
Frank Lee	bad5d4c0a1	[context] support lazy init of module (#1088 ) * [context] support lazy init of module * polish code	2022-06-10 10:09:48 +08:00
ver217	be01db37c8	[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077 ) * polish chunk manager * polish unit test * impl add_extern_static_tensor for chunk mgr * add mem stats collector v2 * polish code * polish unit test * polish code * polish get chunks	2022-06-09 20:56:34 +08:00
Ziyue Jiang	b3a03e4bfd	[Tensor] fix equal assert (#1091 ) * fix equal assert * polish	2022-06-09 17:36:15 +08:00
Frank Lee	50ec3a7e06	[test] skip tests when not enough GPUs are detected (#1090 ) * [test] skip tests when not enough GPUs are detected * polish code * polish code	2022-06-09 17:19:13 +08:00
Frank Lee	65ee6dcc20	[test] ignore 8 gpu test (#1080 ) * [test] ignore 8 gpu test * polish code * polish workflow * polish workflow	2022-06-08 23:14:18 +08:00
Ziyue Jiang	0653c63eaa	[Tensor] 1d row embedding (#1075 ) * Add CPU 1d row embedding * polish	2022-06-08 12:04:59 +08:00
ver217	1b17859328	[tensor] chunk manager monitor mem usage (#1076 )	2022-06-07 15:00:00 +08:00
Ziyue Jiang	4fc748f69b	[Tensor] fix optimizer for CPU parallel (#1069 )	2022-06-06 17:36:11 +08:00
Jiarui Fang	49832b2344	[refactory] add nn.parallel module (#1068 )	2022-06-06 15:34:41 +08:00
Jiarui Fang	a00644079e	reorgnize colotensor directory (#1062 ) * reorgnize colotensor directory * polish code	2022-06-03 18:04:22 +08:00
Ziyue Jiang	df9dcbbff6	[Tensor] add hybrid device demo and fix bugs (#1059 )	2022-06-03 12:09:49 +08:00
YuliangLiu0306	b167258b6a	[pipeline]refactor ppschedule to support tensor list (#1050 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * refactor ppschedule to support tensor list * polish	2022-06-02 13:48:59 +08:00
ver217	51b9a49655	[zero] add zero optimizer for ColoTensor (#1046 ) * add zero optimizer * torch ok * unit test ok * polish code * fix bugs * polish unit test * polish zero optim * polish colo ddp v2 * refactor folder structure * add comment * polish unit test * polish zero optim * polish unit test	2022-06-02 12:13:15 +08:00
ver217	7faef93326	fix dist spec mgr (#1045 )	2022-05-31 12:14:39 +08:00
ver217	9492a561c3	[tensor] ColoTensor supports ZeRo (#1015 ) * impl chunk manager * impl param op hook * add reduce_chunk * add zero hook v2 * add zero dp * fix TensorInfo * impl load balancing when using zero without chunk * fix zero hook * polish chunk * fix bugs * ddp ok * zero ok * polish code * fix bugs about load balancing * polish code * polish code * add ene-to-end test * polish code * polish code * polish code * fix typo * add test_chunk * fix bugs * fix bugs * polish code	2022-05-31 12:00:12 +08:00
YuliangLiu0306	9feff0f760	[titans]remove model zoo (#1042 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * rm model zoo	2022-05-31 10:40:47 +08:00
Ziyue Jiang	7c530b9de2	[Tensor] add Parameter inheritance for ColoParameter (#1041 ) * add Parameter inheritance for ColoParameter * remove tricks * remove tricks * polish * polish	2022-05-30 17:23:44 +08:00
Ziyue Jiang	6c5996a56e	[Tensor] add module check and bert test (#1031 ) * add Embedding * Add bert test * polish * add check module test * polish * polish * polish * polish	2022-05-26 18:15:42 +08:00
YuliangLiu0306	7106bd671d	[p2p]add object list send/recv (#1024 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [p2p]add object list send recv * refactor for code reusability * polish	2022-05-26 14:28:46 +08:00
Ziyue Jiang	32291dd73f	[Tensor] add module handler for linear (#1021 ) * add module spec for linear * polish * polish * polish	2022-05-26 11:50:44 +08:00
ver217	cefc29ff06	[tensor] impl ColoDDP for ColoTensor (#1009 ) * impl ColoDDP for ColoTensor * polish code	2022-05-21 13:52:04 +08:00
ver217	a3b66f6def	[tensor] refactor parallel action (#1007 ) * refactor parallel action * polish unit tests	2022-05-20 20:19:58 +08:00
ver217	8e3d0ad8f1	[unit test] refactor test tensor (#1005 ) * polish test_gpt * update op unit tests * update test model	2022-05-19 18:57:56 +08:00
ver217	ad536e308e	[tensor] refactor colo-tensor (#992 ) * refactor colo-tensor and update linear op * polish code * polish code * update ops and unit tests * update unit tests * polish code * rename dist_spec module * polish code * polish code * remove unneeded import * fix pipelinable	2022-05-19 12:44:59 +08:00
ver217	c2fdc6a011	[tensor] derive compute pattern from dist spec (#971 ) * derive compute pattern from dist spec * polish code	2022-05-16 14:58:08 +08:00
Ziyue Jiang	797a9dc5a9	add DistSpec for loss and test_model (#947 )	2022-05-13 20:29:50 +08:00
ver217	67c33f57eb	[tensor] design DistSpec and DistSpecManager for ColoTensor (#934 ) * add dist spec * update linear op * polish code * polish code * update embedding op * polish unit tests * polish unit tests * polish comments * polish code * add test_dist_spec_mgr * polish code * refactor folder structure * polish unit tests * add get_process_group() for TensorSpec * polish code	2022-05-13 15:13:52 +08:00
Ziyue Jiang	830d3bca26	[Tensor] add optimizer to bert test (#933 ) * add optimizer to bert test * polish	2022-05-13 11:37:23 +08:00
Ziyue Jiang	d73c2b1d79	[Tensor] fix init context (#931 ) * change torch.Parameter to ColoParameter * fix post assignment for init context * polish * polish	2022-05-11 15:48:12 +08:00
Ziyue Jiang	dfc88b85ea	[Tensor] simplify named param (#928 ) * simplify ColoModulize * simplify ColoModulize * polish * polish	2022-05-11 10:54:19 +08:00
ver217	45b9124df4	[tensor] hijack addmm for colo tensor (#923 ) * hijack addmm for colo tensor * fix bugs * polish unit test * polish comments	2022-05-09 18:55:49 +08:00
Jiarui Fang	534afb018a	test pretrain loading on multi-process (#922 )	2022-05-09 17:07:35 +08:00
Ziyue Jiang	c195d2814c	[Tensor] add from_pretrained support and bert pretrained test (#921 ) * add from_pretrained support and test * polish * polish * polish * polish	2022-05-09 16:11:47 +08:00
Jiarui Fang	845856ea29	[Graph] building computing graph with ColoTensor, Linear only (#917 )	2022-05-07 17:10:37 +08:00
Ziyue Jiang	75d221918a	[Tensor] add 1d vocab loss (#918 ) * add 1d vocab loss * polish	2022-05-07 15:49:14 +08:00
Ziyue Jiang	dfaff4e243	[Tensor] fix test_model (#916 ) * polish test_model * polish	2022-05-06 18:06:22 +08:00
Jiarui Fang	ed6426c300	[Tensor] polish model test (#915 )	2022-05-06 17:07:56 +08:00
Ziyue Jiang	0fab86b12a	[Tensor] add a basic bert. (#911 ) * add base bert test * Add bert test * polish * remove test_bert * polish	2022-05-06 15:03:43 +08:00
Jiarui Fang	ab95ec9aea	[Tensor] init ColoParameter (#914 )	2022-05-06 12:57:14 +08:00
Ziyue Jiang	193d629311	update pytest.mark.parametrize in tensor tests (#913 )	2022-05-06 11:16:40 +08:00
Ziyue Jiang	f593a5637e	[Tensor] add embedding tp1d row (#904 )	2022-04-29 14:10:05 +08:00
Ziyue Jiang	2c0d19d755	[Tensor] add ColoTensor TP1Dcol Embedding (#899 )	2022-04-28 17:45:06 +08:00
Jiarui Fang	d16671da75	[Tensor] initialize the ColoOptimizer (#898 ) * [Tensor] activation is an attr of ColoTensor * [Tensor] add optimizer * only detach parameters in context * polish code	2022-04-28 15:23:40 +08:00
Jiarui Fang	e76f76c08b	[Tensor] test parameters() as member function (#896 )	2022-04-28 10:57:14 +08:00
Ziyue Jiang	cb182da7c5	[tensor] refine linear and add gather for laynorm (#893 ) * refine linear and add function to ColoTensor * add gather for layernorm * polish * polish	2022-04-28 10:55:40 +08:00
Jiarui Fang	26c49639d8	[Tensor] overriding paramters() for Module using ColoTensor (#889 )	2022-04-27 15:28:59 +08:00
Ziyue Jiang	1d0aba4153	[tensor] add ColoTensor 1Dcol (#888 )	2022-04-27 14:13:55 +08:00
Jiarui Fang	a0e5971692	[Tensor] test model check results for a simple net (#887 )	2022-04-27 12:00:18 +08:00
Jiarui Fang	72cdc06875	[Tensor] make ColoTensor more robust for getattr (#886 ) * [Tensor] make ColoTensor more robust for getattr * polish * polish	2022-04-27 10:57:49 +08:00
Ziyue Jiang	9bc5a77c31	[tensor] wrap function in the torch_tensor to ColoTensor (#881 )	2022-04-26 20:13:56 +08:00
Jiarui Fang	7f76517a85	[Tensor] make a simple net works with 1D row TP (#879 )	2022-04-26 18:11:47 +08:00
ver217	c4d903e64a	[gemini] accelerate adjust_layout() (#878 ) * add lru cache * polish code * update unit test * fix sharded optim	2022-04-26 18:08:31 +08:00
Jiarui Fang	909211453b	[Tensor] Add some attributes to ColoTensor (#877 ) * [Tensor] add some function to ColoTensor * torch.allclose * rm torch.add	2022-04-26 15:10:47 +08:00
Jiarui Fang	e43f83aa5c	[Tensor] get named parameters for model using ColoTensors (#874 )	2022-04-26 14:08:01 +08:00
Jiarui Fang	96211c2cc8	[tensor] customized op returns ColoTensor (#875 ) * [tensor] customized op returns ColoTensor * polish * polish code	2022-04-26 13:23:59 +08:00
Ziyue Jiang	26d4ab8b03	[Tensor] Add function to spec and update linear 1Drow and unit tests (#869 )	2022-04-26 10:15:26 +08:00
Jiarui Fang	1190b2c4a4	[tensor] add cross_entrophy_loss (#868 )	2022-04-25 16:01:52 +08:00
HELSON	3107817172	[gemini] add stateful tensor container (#867 )	2022-04-25 14:58:16 +08:00
Jiarui Fang	d01d3b8cb0	colo init context add device attr. (#866 )	2022-04-25 14:24:26 +08:00
Jiarui Fang	126ba573a8	[Tensor] add layer norm Op (#852 )	2022-04-25 11:49:20 +08:00
Frank Lee	1258af71cc	[ci] cache cuda extension (#860 )	2022-04-25 10:03:47 +08:00
Ziyue Jiang	bcc8655021	[Tensor ] Add 1Drow weight reshard by spec (#854 )	2022-04-24 18:30:20 +08:00
Jiarui Fang	62f059251b	[Tensor] init a tp network training unittest (#849 )	2022-04-24 16:43:44 +08:00
Ziyue Jiang	2a0a427e04	[tensor]add assert for colo_tensor 1Drow (#846 )	2022-04-24 14:12:45 +08:00
Ziyue Jiang	05023ecfee	[Tensor] TP Linear 1D row (#843 )	2022-04-24 13:43:12 +08:00
HELSON	e5ea3fdeef	[gemini] add GeminiMemoryManger (#832 ) * refactor StatefulTensor, tensor utilities * add unitest for GeminiMemoryManager	2022-04-24 13:08:48 +08:00
YuliangLiu0306	35ea6e1023	[pipelinable]use pipelinable context to initialize non-pipeline model (#816 ) * [CLI] add CLI launcher * Revert "[CLI] add CLI launcher" This reverts commit `df7e6506d4`. * [pipeline]add module lazy init feature to support large model initization. * [pipeline]add to_layer_list and partition method to support arbitrary non-pp model * refactor the module structure * polish * [pipelinable]add unit test for pipelinable * polish * polish * Fix CodeFactor issues.	2022-04-24 13:03:12 +08:00
Jiarui Fang	ea0a2ed25f	[hotfix] the bug of numel() in ColoTensor (#845 )	2022-04-24 12:32:10 +08:00
Jiarui Fang	8789850eea	Init Conext supports lazy allocate model memory (#842 )	2022-04-22 18:03:35 +08:00
Frank Lee	943982d29a	[unittest] refactored unit tests for change in dependency (#838 )	2022-04-22 15:39:07 +08:00
Frank Lee	01e9f834f5	[dependency] removed torchvision (#833 ) * [dependency] removed torchvision * fixed transforms	2022-04-22 15:24:35 +08:00
Jiarui Fang	cb5a4778e1	Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. (#831 )" (#835 ) This reverts commit `ac88de6dfc`.	2022-04-22 14:45:57 +08:00
Jiarui Fang	ac88de6dfc	[WIP] Applying ColoTensor on TP-1D-row Linear. (#831 ) * revert zero tensors back * [tensor] init row 1d linear	2022-04-22 14:03:26 +08:00
Jiarui Fang	294a6060d0	[tensor] ZeRO use ColoTensor as the base class. (#828 ) * [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. * [tensor] ZeRO use ColoTensor as the base class. * polish	2022-04-22 12:00:48 +08:00
Ziyue Jiang	8e6fdb4f29	[tensor]fix test_linear (#826 )	2022-04-21 17:18:56 +08:00
Ziyue Jiang	1a9e2c2dff	[tensor] fix kwargs in colo_tensor torch_funtion (#825 )	2022-04-21 16:47:35 +08:00
Jiarui Fang	2ecc3d7a55	[tensor] lazy init (#823 )	2022-04-21 15:40:23 +08:00
Jiarui Fang	660d2d1f1b	[Tensor] apply ColoTensor on Torch functions (#821 ) * Revert "[zero] add ZeroTensorShardStrategy (#793)" This reverts commit `88759e289e`. * [gemini] set cpu memory capacity * [log] local throughput collecting * polish * polish * polish * polish code * polish * polish code * add a new tensor structure and override linear for it * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish * [tensor] renaming and reorganize directory structure. * rm useless dir * polish * polish * [tensor] hander the function not wrapped	2022-04-21 14:21:10 +08:00
Jiarui Fang	0ce8924ceb	[tensor] reorganize files (#820 )	2022-04-21 14:15:48 +08:00
Jiarui Fang	ab962b9735	[gemini] a new tensor structure (#818 ) * Revert "[zero] add ZeroTensorShardStrategy (#793)" This reverts commit `88759e289e`. * [gemini] set cpu memory capacity * [log] local throughput collecting * polish * polish * polish * polish code * polish * polish code * add a new tensor structure and override linear for it * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish * polish	2022-04-21 11:42:37 +08:00
Jiarui Fang	e761ad2cd7	Revert "[zero] add ZeroTensorShardStrategy (#793 )" (#806 )	2022-04-19 14:40:02 +08:00
HELSON	88759e289e	[zero] add ZeroTensorShardStrategy (#793 )	2022-04-19 14:32:45 +08:00
Jiarui Fang	681addb512	[refactor] moving grad acc logic to engine (#804 )	2022-04-19 14:03:21 +08:00
Jiarui Fang	4d9332b4c5	[refactor] moving memtracer to gemini (#801 )	2022-04-19 10:13:08 +08:00
HELSON	4c4388c46e	[hotfix] fix memory leak in zero (#781 )	2022-04-18 13:57:03 +08:00
Frank Lee	5a1a095b92	[test] refactored with the new rerun decorator (#763 ) * [test] refactored with the new rerun decorator * polish test case	2022-04-15 00:33:04 +08:00
Jiarui Fang	10ef8afdd2	[gemini] init genimi individual directory (#754 )	2022-04-14 16:40:26 +08:00
ver217	dcca614eee	[hotfix] fix test_stateful_tensor_mgr (#762 )	2022-04-14 15:50:09 +08:00
ver217	a93a7d7364	[hotfix] fix reuse_fp16_shard of sharded model (#756 ) * fix reuse_fp16_shard * disable test stm * polish code	2022-04-14 14:56:46 +08:00
HELSON	84c6700b2a	[zero] refactor memstats_collector (#746 )	2022-04-14 12:01:12 +08:00
ver217	e396bb71f2	[zero] add tensor placement policies (#743 ) * add tensor placement policies * polish comments * polish comments * update moe unit tests	2022-04-13 15:00:48 +08:00
HELSON	22c4b88d56	[zero] refactor ShardedParamV2 for convenience (#742 )	2022-04-13 14:54:26 +08:00
Frank Lee	f4f42d4c3c	[bug] fixed DDP compatibility with torch 1.8 (#739 )	2022-04-13 00:08:46 +08:00
Jiarui Fang	53cb584808	[utils] correct cpu memory used and capacity in the context of multi-process (#726 )	2022-04-12 14:57:54 +08:00
HELSON	b9b469ea50	[moe] add checkpoint for moe zero test (#729 )	2022-04-12 12:11:54 +08:00
FrankLeeeee	e88a498c9c	[test] removed trivial outdated test	2022-04-12 11:08:15 +08:00
FrankLeeeee	62b4ce7326	[test] added missing decorators to model checkpointing tests	2022-04-12 11:08:15 +08:00
Jiarui Fang	4d90a7b513	[refactor] zero directory (#724 )	2022-04-11 23:13:02 +08:00
Frank Lee	20ab1f5520	[bug] fixed broken test_found_inf (#725 )	2022-04-11 22:00:27 +08:00
Jiarui Fang	193dc8dacb	[refactor] refactor the memory utils (#715 )	2022-04-11 16:47:57 +08:00
HELSON	dbd96fe90a	[zero] check whether gradients have inf and nan in gpu (#712 )	2022-04-11 15:40:13 +08:00
HELSON	a9b8300d54	[zero] improve adaptability for not-shard parameters (#708 ) * adapt post grad hooks for not-shard parameters * adapt optimizer for not-shard parameters * offload gradients for not-replicated parameters	2022-04-11 13:38:51 +08:00
ver217	ab8c6b4a0e	[zero] refactor memstats collector (#706 ) * refactor memstats collector * fix disposable * polish code	2022-04-11 10:46:08 +08:00
HELSON	ee112fe1da	[zero] adapt zero hooks for unsharded module (#699 )	2022-04-08 20:23:26 +08:00
ver217	3c9cd5bb5e	[zero] stateful tensor manager (#687 ) * [WIP] stateful tensor manager * add eviction strategy * polish code * polish code * polish comment * add unit test * fix sampler bug * polish code * fix max sampling cnt resetting bug * fix sampler bug * polish code * fix bug * fix unit test Co-authored-by: jiaruifang <fangjiarui123@gmail.com>	2022-04-08 17:51:34 +08:00
HELSON	d7ecaf362b	[zero] fix init bugs in zero context (#686 ) * adapt model weight initialization for methods in Pytorch nn.init	2022-04-07 17:38:45 +08:00
Jiarui Fang	0aab52301e	[hotfix] fix a bug in model data stats tracing (#655 )	2022-04-03 21:48:06 +08:00
YuliangLiu0306	ade05a5d83	[refactor] pipeline, put runtime schedule into engine. (#627 )	2022-04-03 20:46:45 +08:00
HELSON	e5d615aeee	[hotfix] fix bugs in testing (#659 ) * remove hybrid adam in test_moe_zero_optim * fix activation checkpointing and its unitest	2022-04-02 21:58:47 +08:00
HELSON	b31daed4cf	fix bugs in CPU adam (#633 ) * add cpu adam counter for all cpu adam * fixed updating error in adam kernel	2022-04-02 17:04:05 +08:00

... 5 6 7 8 9 ...

770 Commits (b29e1f07224298aea35aab7ee83284beac28e0d8)