HELSON
c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 ( #1786 )
...
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12
* [zero] add cpu shard init
* [zero] add tiny example test
* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
kurisusnowdeng
0b8161fab8
updated tp layers
2022-11-02 12:19:38 +08:00
Jiarui Fang
cb5a587e9a
[hotfix] polish chunk import ( #1787 )
2022-11-02 12:10:52 +08:00
YuliangLiu0306
e859380bf7
[fx] support module with bias addition ( #1780 )
...
* [autoparallel] refactor tracer to fix bias addition issue
* [fx] support module with bias addition
* create bias_addition_module
* refactor file structure
* polish code
* fix unit test
2022-11-01 22:53:51 +08:00
Frank Lee
f3f19a5c47
[autoparallel] added matmul handler ( #1763 )
...
* [autoparallel] added matmul handler
* polish code
2022-11-01 15:14:53 +08:00
Ziyue Jiang
4df0194976
[Pipeline]Adapt to Pipelinable OPT ( #1782 )
2022-11-01 14:18:50 +08:00
YuliangLiu0306
27de252334
[autoparallel] fix conv handler numerical test ( #1771 )
2022-11-01 10:43:44 +08:00
Super Daniel
1e88811c7a
[autoparallel] move ckpt solvers to autoparallel folder / refactor code ( #1764 )
...
* [autoparallel] first move.
* [autoparallel] add solver rotor.
* [autoparallel] add ckpt solvers.
* [autoparallel] modify codegen.
* [fx] fix annotation in test.
* [fx] remove check.
* [autoparallel] polish docstring.
* [fx] refactor MetaTensor.
2022-11-01 10:43:15 +08:00
Jiarui Fang
f34dab4270
[compatibility] ChunkMgr import error ( #1772 )
2022-10-28 14:48:54 +08:00
YuliangLiu0306
b0f7c8bde8
[autoparallel] update CommSpec to CommActions ( #1768 )
...
* [autoparallel] update CommSpec to CommActions
* polish code
2022-10-28 09:57:43 +08:00
YuliangLiu0306
b4cc59b61e
[autoparallel] add numerical test for node strategies ( #1760 )
...
* [autoparallel] add numerical test for node strategies
* polish code
* polish code
2022-10-27 10:42:54 +08:00
oahzxl
25952b67d7
[feat] add flash attention ( #1762 )
2022-10-26 16:15:52 +08:00
Super Daniel
0584654c79
[fx] refactor memory utils and extend shard utils. ( #1754 )
...
* [fx] change memory.py to memory_utils.py.
* [fx] add shard utils.
* [fx] fix import.
* [fx] check code style.
* [fx] add comment.
* [autoparallel] first move.
* [fx] add time computations.
2022-10-26 14:24:41 +08:00
Ziyue Jiang
63f250bbd4
fix file name ( #1759 )
...
Co-authored-by: Ziyue Jiang <ziyue.jiang@gmail.com>
2022-10-25 16:48:48 +08:00
YuliangLiu0306
314d8c497f
[autoparallel] refactor the runtime apply pass and add docstring to passes ( #1757 )
...
* [autoparallel] refactor the runtime apply pass and add doc string to passes
* fix unit test
* polish
2022-10-25 14:32:22 +08:00
Frank Lee
f9a613d660
[autoparallel] added binary elementwise node handler ( #1758 )
...
* [autoparallel] added binary elementwise node handler
* polish code
2022-10-25 14:32:01 +08:00
YuliangLiu0306
d2fc067231
[autoparallel] fix param hook issue in transform pass ( #1755 )
2022-10-24 13:13:38 +08:00
Frank Lee
262652c8bc
[autoparallel] added addbmm handler ( #1751 )
2022-10-21 18:55:48 +08:00
YuliangLiu0306
980ed21723
[autoparallel] shard param and buffer as expected ( #1753 )
...
* [autoparallel] shard param and buffer as expected
* fix unit test issue
2022-10-21 15:45:13 +08:00
YuliangLiu0306
cdb7d5e7d2
[hotfix] autoparallel unit test ( #1752 )
2022-10-20 19:51:38 +08:00
YuliangLiu0306
a4ce180e85
[autoparallel] add sequential order to communication actions ( #1735 )
2022-10-20 18:48:18 +08:00
Frank Lee
474111ecb5
[autoparallel] fixed wrong sharding strategy in conv handler ( #1747 )
...
* [autoparallel] fixed wrong sharding strategy in conv handler
* polish code
2022-10-20 16:12:39 +08:00
Frank Lee
8b8937d901
[autoparallel] fixed wrong generated strategy for dot op ( #1746 )
...
* [autoparallel] fixed wrong generated strategy for dot op
* polish code
2022-10-20 15:18:16 +08:00
Frank Lee
993b8875b6
[autoparallel] handled illegal sharding strategy in shape consistency ( #1744 )
...
* [autoparallel] handled illegal sharding strategy in shape consistency
* polish code
2022-10-20 12:06:25 +08:00
Frank Lee
88a79814fb
[autoparallel] handled illegal strategy in node handler ( #1743 )
...
* [autoparallel] handled illegal strategy in node handler
* polish code
2022-10-19 17:08:52 +08:00
Super Daniel
30874f1692
[fx/profiler] debug the fx.profiler / add an example test script for fx.profiler ( #1730 )
...
* [fx/profiler] add test.
* [fx] fix file names.
* [fx] add docstring and comment.
* [fx] polish profiler.py.
* [fx] fix import errors.
* [fx] fix profiler.
* [fx] fix names.
2022-10-19 14:24:51 +08:00
Frank Lee
eee84908d4
[autoparallel] handled illegal sharding strategy ( #1728 )
...
* [autoparallel] handled illegal sharding strategy
* polish code
2022-10-19 12:53:06 +08:00
Sze-qq
23703c9dd6
[NFC] polish colossalai/nn/metric/_utils.py code style ( #1727 )
2022-10-19 12:20:51 +08:00
Ofey Chan
7e62af28a0
[NFC] polish accuracy_2d.py code style ( #1719 )
2022-10-19 12:20:51 +08:00
LuGY
730f88f8e1
[NFC] polish _checkpoint_hook.py code style ( #1722 )
2022-10-19 12:20:51 +08:00
CsRic
ea961d8fd1
[NFC] polish colossalai/zero/sharded_param/__init__.py code style ( #1717 )
...
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2022-10-19 12:20:51 +08:00
yuxuan-lou
2b49ca80a3
[NFC] polish colossalai/nn/lr_scheduler/linear.py code style ( #1716 )
2022-10-19 12:20:51 +08:00
shenggan
e1d780030d
[NFC] polish colossalai/nn/metric/accuracy_2p5d.py code style ( #1714 )
2022-10-19 12:20:51 +08:00
YuliangLiu0306
d373e67b99
[hotfix] resharding cost issue ( #1742 )
2022-10-19 11:33:43 +08:00
Jiarui Fang
24e84eba60
upgrade version to 0.1.11rc1 ( #1739 )
2022-10-19 11:26:00 +08:00
Frank Lee
d2e0e39c9d
[release] update to v0.1.11 ( #1736 )
2022-10-19 00:28:00 +08:00
HELSON
f69f9bf223
[zero] add chunk init function for users ( #1729 )
...
* add chunk manager init function
* fix unit tests
* add comment
* add flush=True
2022-10-18 16:31:22 +08:00
YuliangLiu0306
51b89d2202
[autoparallel] runtime_backward_apply ( #1720 )
2022-10-18 10:44:58 +08:00
Super Daniel
393f594051
[fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug ( #1710 )
...
* [fx] move meta registration
* [fx] fix tests.
* [fx] fix test.
* [fx] fix.
* [meta] refactor meta registration.py.
* [fx] add compatibility descriptions.
* [fx] polish import.
* [fx] add a decorator.
* [fx] fix tests.
* [fx] remove print.
* [fx] edit raise error.
* [fx] edit raise error.
* [fx] add type hint.
* [fx] fix import in experimental.
* [rpc] remove color debug.
* [meta] fix naming.
2022-10-18 10:44:23 +08:00
YuliangLiu0306
845ff4a47a
[autoparallel] resnet block runtime apply ( #1709 )
...
* [autoparallel] resnet block runtime apply
* seperate buffer and parameter in MemoryCost
* polish code
* add comments and todos
* fix test issue
2022-10-17 13:37:38 +08:00
Frank Lee
22a115406b
[autoparallel] fixed broken node handler tests ( #1708 )
2022-10-14 18:25:59 +08:00
HELSON
1468e4bcfc
[zero] add constant placement policy ( #1705 )
...
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
binmakeswell
5f41463a76
add optimizer README for tutorials ( #1707 )
2022-10-14 09:10:18 +00:00
Frank Lee
6c331a5a09
[autoparallel] refactored the autoparallel module for organization ( #1706 )
...
* [autoparallel] refactored the autoparallel module for organization
* polish code
2022-10-14 13:27:00 +08:00
Frank Lee
91cd34e6e0
[unittest] added doc for the pytest wrapper ( #1704 )
2022-10-14 10:56:17 +08:00
YuliangLiu0306
451cd72dea
[autoparallel] adapt runtime passes ( #1703 )
...
* [autoparallel] adapt runtime passes v2
* polish code
2022-10-14 10:14:07 +08:00
Jiarui Fang
21962e1593
[embedding] rename FreqAwareEmbedding -> CachedEmbedding ( #1699 )
2022-10-13 22:22:27 +08:00
Frank Lee
0e52f3d3d5
[unittest] supported condititonal testing based on env var ( #1701 )
...
polish code
2022-10-13 19:38:45 +08:00
Frank Lee
8283e95db3
[autoparallel] collated all deprecated files ( #1700 )
...
* [autoparallel] collated all deprecated files
* polish code
2022-10-13 18:24:11 +08:00
Frank Lee
e2355d01b9
[autoparallel] init new folder structure ( #1696 )
2022-10-13 14:18:55 +08:00
YuliangLiu0306
81f7530ee7
[autoparallel] adapt solver and CostGraph with new handler ( #1695 )
...
* [autoparallel] adapt solver and CostGraph with new handler
* fix test issue
2022-10-13 14:04:15 +08:00
YuliangLiu0306
42b882ef06
[autoparallel] add output handler and placeholder handler ( #1694 )
...
* [autoparallel] add output handler and placeholder handler
* Delete test_solver_with_resnet.py
* fix test bugs
2022-10-13 13:42:36 +08:00
YuliangLiu0306
56088e6d98
[autoparallel] add pooling handler ( #1690 )
...
* [autoparallel] add pooling handler
* polish code
2022-10-13 13:42:13 +08:00
YuliangLiu0306
319d654f79
[autoparallel] where_handler_v2 ( #1688 )
...
* where generator
* [autoparallel] where_handler_v2
2022-10-13 11:02:22 +08:00
Boyuan Yao
31d2f03d27
[autoparallel] fix C version rotor inconsistency ( #1691 )
2022-10-12 15:21:58 +08:00
Jiarui Fang
363fc2861a
[embeddings] more detailed timer ( #1692 )
2022-10-12 12:01:21 +08:00
Frank Lee
4973157ad7
[autoparallel] added sharding spec conversion for linear handler ( #1687 )
2022-10-12 11:16:18 +08:00
YuliangLiu0306
af718e83f2
[autoparallel] add reshape handler v2 and fix some previous bug ( #1683 )
2022-10-11 18:12:59 +08:00
YuliangLiu0306
6878e42248
[hotfix] solver bug caused by dict type comm cost ( #1686 )
2022-10-11 17:57:03 +08:00
Super Daniel
3dd6994427
[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 ( #1679 )
...
* [fx/profiler] modify data_ptr into uuid for all tensors.
* [fx] modify uuid.
* [fx/profiler] tune performance on GPT-2.
* [fx] updates.
* [fx] debug.
* [fx] debug.
* [fx] cuda.
2022-10-11 11:03:35 +08:00
Kirigaya Kazuto
0df5034a36
[pipeline/fix-bug] num_microbatches support any integrate | stable chimera | launch tool for rpc pp framework ( #1684 )
...
* [pipeline/tuning] improve dispatch performance both time and space cost
* [pipeline/converge] add interface for testing convergence
* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style
* Update PipelineBase.py
* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera
* [pipeline/chimera] test chimera | fix bug of initializing
* [pipeline/pytree] add pytree to process args and kwargs | provide to process args and kwargs after forward
* [pipeline/fix-bug] num_microbatches support any integrate | stable chimera | launch tool for rpc pp framework
2022-10-10 16:01:02 +08:00
jim
e5ab6be72e
[hotfix[ fix colotensor.type() raise NotImplementedError ( #1682 )
2022-10-10 10:13:31 +08:00
Kirigaya Kazuto
3b2a59b0ba
[pipeline/rank_recorder] fix bug when process data before backward | add a tool for multiple ranks debug ( #1681 )
...
* [pipeline/tuning] improve dispatch performance both time and space cost
* [pipeline/converge] add interface for testing convergence
* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style
* Update PipelineBase.py
* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera
* [pipeline/chimera] test chimera | fix bug of initializing
* [pipeline/pytree] add pytree to process args and kwargs | provide to process args and kwargs after forward
2022-10-09 17:32:57 +08:00
YuliangLiu0306
517b63939a
[autoparallel] add unary element wise handler v2 ( #1674 )
2022-10-09 17:30:42 +08:00
YuliangLiu0306
f6c6a932b8
[autoparallel] add following node generator ( #1673 )
...
* [autoparallel] add following node generator
* polish code
* polish code
* update name of arguments
2022-10-09 14:49:18 +08:00
YuliangLiu0306
52fda88796
[autoparallel] add layer norm handler v2 ( #1671 )
...
* [autoparallel] add layer norm handler v2
* polish code
* polish code
2022-10-09 14:23:22 +08:00
Fazzie-Maqianli
87c5ad352a
update version to 0.1.10 ( #1676 )
2022-10-09 10:43:29 +08:00
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2022-10-09 09:18:51 +08:00
Boyuan Yao
b1be5b88bd
[autoparallel] fix insecure subprocess ( #1680 )
...
* [autoparallel] fix insecure subprocess
* [fx] fix insecure subprocess
2022-10-06 15:07:03 +08:00
Boyuan Yao
d8420f81a4
[hotfix] fix wrong type name in profiler ( #1678 )
2022-10-05 21:59:05 +08:00
Boyuan Yao
132b4306b7
[fx] Add concrete info prop ( #1677 )
...
* [fx] concreteinfoprop
* [fx] add concreteinfoprop
* [fx] modify docstring of ConcreteInfoProp
* [fx] fix device error
* [fx] modify parameter calculation
* [fx] modify parameters calculation
2022-10-04 16:48:24 +08:00
Boyuan Yao
1df98d5b66
[autoparallel] add rotor C version ( #1658 )
...
* [autoparallel] add rotor c version
* [fx] remove metainfoprop in rotor solver
* [autoparallel] modify C
code format
* [autoparallel] remove build.py
* [autoparallel] fix C extension build
* [autoparallel] add C solver consistency test
* [autoparallel] remove some unused imports
* [autoparallel] refactor rotor solver code
* [autoparallel] replace print with colossalai logger
* [autoparallel] ranks fixed
2022-10-03 17:13:30 +08:00
YuliangLiu0306
11ec070e53
[hotfix]unit test ( #1670 )
2022-09-29 12:49:28 +08:00
Frank Lee
a60024e77a
[autoparallel] added utils for broadcast operation ( #1665 )
...
* [autoparallel] added utils for broadcast operation
* polish code
2022-09-29 11:22:29 +08:00
YuliangLiu0306
3f068d1409
[autoparallel] update CommSpec ( #1667 )
2022-09-29 11:20:59 +08:00
Frank Lee
247a9dbca9
[autoparallel] added bias comm spec to matmul strategy ( #1664 )
2022-09-29 11:08:05 +08:00
YuliangLiu0306
746f8f979d
[autoparallel] add batch norm handler v2 ( #1666 )
2022-09-29 11:02:49 +08:00
Kirigaya Kazuto
9708638ded
[pipeline/pytree] add pytree to process args and kwargs | provide `data_process_func` to process args and kwargs after forward ( #1642 )
...
* [pipeline/tuning] improve dispatch performance both time and space cost
* [pipeline/converge] add interface for testing convergence
* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style
* Update PipelineBase.py
* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera
* [pipeline/chimera] test chimera | fix bug of initializing
* [pipeline/pytree] add pytree to process args and kwargs | provide to process args and kwargs after forward
2022-09-29 10:58:58 +08:00
YuliangLiu0306
c27e701cb2
[autoparallel] remove no strategy nodes ( #1652 )
...
* [autoparallel] remove no strategy nodes
* fix none object iteration issue
2022-09-29 10:43:25 +08:00
Frank Lee
50f16a2850
[autoparallel] added compute resharding costs for node handler ( #1662 )
2022-09-28 19:55:44 +08:00
Frank Lee
9ec401a722
[autoparallel] added new strategy constructor template ( #1661 )
...
* [autoparallel] added new strategy constructor template
* polish code
2022-09-28 14:01:36 +08:00
Frank Lee
3a4d6f63a8
[autoparallel] added node handler for bmm ( #1655 )
2022-09-28 11:32:16 +08:00
YuliangLiu0306
095854477f
[autoparallel] add conv handler v2 ( #1663 )
2022-09-28 11:24:59 +08:00
YuliangLiu0306
1e7816a460
[autoparallel] adapt solver with gpt ( #1653 )
2022-09-28 11:17:26 +08:00
Jiarui Fang
c638bec028
[embedding] polish async copy ( #1657 )
2022-09-27 14:37:03 +08:00
Jiarui Fang
988570e4a6
[embedding] add more detail profiling ( #1656 )
2022-09-27 13:43:59 +08:00
Jiarui Fang
e1f97fd2b8
[embedding] print profiling results ( #1654 )
2022-09-27 12:50:33 +08:00
Frank Lee
30e50c8b4a
[autoparallel] implemented all matmul strategy generator ( #1650 )
2022-09-27 12:06:25 +08:00
YuliangLiu0306
03978aad45
[autoparallel] change the following nodes strategies generation logic ( #1636 )
...
* [autoparallel] change the following nodes strategies generation logic
* fix unit test
2022-09-27 11:20:52 +08:00
YuliangLiu0306
59f100510a
[autoparallel] where handler ( #1651 )
...
* [autoparallel] where handler
* fix unit test
2022-09-27 11:20:43 +08:00
Super Daniel
6135e178b3
[fx] refactor code for profiler / enable fake tensor movement. ( #1646 )
...
* [fx/profiling] provide summary for MetaInfoProp.
* [fx/profiler] provide a table of summary.
* [fx/profiler] provide a table of summary.
* [fx/profiler] provide a table of summary.
* [fx/profiler] provide a table of summary.
* [fx] optimize table repr.
* [fx] optimize table repr.
* [fx] refactor code for profiler.
* [fx] add docstring.
* [fx] add docstring.
* [fx] skip test.
* [fx] redo.
* [fx] redo.
* [fx] fix import error for torch11.
* [fx] fix import error for torch11.
2022-09-27 10:26:52 +08:00
Boyuan Yao
5d0fdb9cb4
[fx] fix offload codegen test ( #1648 )
...
* [fx] fix offload codegen test
* [fx] modify typing
2022-09-27 10:25:27 +08:00
Frank Lee
45b39a692a
[autoparallel] implemented linear projection strategy generator ( #1639 )
2022-09-26 16:58:14 +08:00
Frank Lee
154d3ef432
[fix] fixed the collective pattern name for consistency ( #1649 )
...
* [fix] fixed the collective pattern name for consistency
* polish code
2022-09-26 16:39:37 +08:00
YuliangLiu0306
b2b2a4af98
[autoparallel] adapt solver with mlp ( #1638 )
2022-09-26 15:26:14 +08:00
Jiarui Fang
04443605a5
[embedding] non-blocking cpu-gpu copy ( #1647 )
2022-09-26 14:57:57 +08:00
CsRic
0767f67a0f
[embedding] isolate cache_op from forward ( #1645 )
...
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2022-09-26 11:18:59 +08:00
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2022-09-26 10:06:03 +08:00
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2022-09-24 19:58:18 +08:00
Boyuan Yao
f921733621
[autoparallel] Add pofo sequence annotation ( #1637 )
...
* [autoparallel] annotate pofo sequence
* [autoparallel] remove unused print
* [autoparallel] fix some code
2022-09-24 01:52:57 +08:00
Super Daniel
04bbabeea8
[fx/profiler] provide a table of summary. ( #1634 )
...
* [fx/profiling] provide summary for MetaInfoProp.
* [fx/profiler] provide a table of summary.
* [fx] optimize table repr.
2022-09-23 18:12:43 +08:00
HELSON
95c35f73bd
[moe] initialize MoE groups by ProcessGroup ( #1640 )
2022-09-23 17:20:41 +08:00
Jiarui Fang
e57df80325
[embeddings] cache option ( #1635 )
2022-09-23 16:40:18 +08:00
HELSON
a088022efc
[moe] fix moe bugs ( #1633 )
2022-09-23 15:33:57 +08:00
YuliangLiu0306
702dbc5288
[tensor] use communication autograd func ( #1617 )
...
* [tensor] use communication autograd func
* change all to all comm spec info
* rename pattern and distinguish fwd/bwd
* polish code
2022-09-23 13:31:15 +08:00
YuliangLiu0306
c7ac0f4ab2
[autoparallel] add elementwise handler ( #1622 )
...
* [autoparallel] add elementwise handler
* polish code
* polish code
* reduce skipped strategies range
* polish code
2022-09-23 13:27:31 +08:00
YuliangLiu0306
3a46215135
[autoparallel] add embedding handler ( #1620 )
2022-09-23 12:34:30 +08:00
YuliangLiu0306
69448f64c4
[autoparallel] protect bcast handler from invalid strategies ( #1631 )
2022-09-23 12:12:49 +08:00
YuliangLiu0306
0c703189b9
[autoparallel] add layernorm handler ( #1629 )
2022-09-23 12:00:25 +08:00
YuliangLiu0306
bf77d3ab65
[autoparallel] recover the merged node strategy index ( #1613 )
2022-09-23 11:52:42 +08:00
Boyuan Yao
d6b01feb66
[fx] Modify offload codegen ( #1618 )
...
* [fx] modify offload codegen
* [fx] remove repeated hook definitions
* [fx] modify offload test
2022-09-23 11:04:52 +08:00
Super Daniel
d967779a32
[fx/profiler] tuned the calculation of memory estimation ( #1619 )
...
* [fx] tuned the meta info and rotor solver.
* [fx] remove import.
* [fx] remove import.
* [fx] remove import.
* [fx] tune the meta calculations.
* [fx] polish comments.
* [fx] remove assertions.
* [fx] modify test cases.
* [fx] modify test cases.
* [fx] optimize import.
* [fx
2022-09-23 10:59:47 +08:00
HELSON
f7f2248771
[moe] fix MoE bugs ( #1628 )
...
* remove forced FP32 modules
* correct no_shard-contexts' positions
2022-09-22 13:56:30 +08:00
Jiarui Fang
38c68b5b9a
[embedding] rollback for better FAW performance ( #1625 )
2022-09-22 11:16:25 +08:00
Frank Lee
d925122020
[autoparallel] added new linear module handler ( #1616 )
2022-09-21 12:23:21 +08:00
Kirigaya Kazuto
170fa81095
[pipeline/chimera] test chimera | fix bug of initializing ( #1615 )
...
* [pipeline/tuning] improve dispatch performance both time and space cost
* [pipeline/converge] add interface for testing convergence
* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style
* Update PipelineBase.py
* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera
* [pipeline/chimera] test chimera | fix bug of initializing
2022-09-20 18:00:39 +08:00
Jiarui Fang
504ff1d101
[embeddings] use cache_ratio instead of cuda_row_num ( #1611 )
2022-09-20 14:33:04 +08:00
YuliangLiu0306
6a8f8cc05e
[hotfix] got sliced types ( #1614 )
2022-09-20 14:32:42 +08:00
Frank Lee
d397842fa8
[autoparallel] added new node handler ( #1612 )
2022-09-20 14:17:21 +08:00
YuliangLiu0306
7d1bb71d5d
[fx] PoC of runtime shape consistency application ( #1607 )
...
* [fx] PoC of runtime shape consistency application
* polish code
2022-09-20 14:00:04 +08:00
YuliangLiu0306
47b11c432c
[autoparallel]add bcast matmul strategies ( #1605 )
2022-09-20 11:26:21 +08:00
Frank Lee
edb67cb378
[autoparallel] refactored the data structure for sharding strategy ( #1610 )
2022-09-20 11:20:54 +08:00
Boyuan Yao
933b6c6367
[fx] Add pofo solver ( #1608 )
...
* [fx] add pofo algorithm
* [fx] Add pofo solver
* [fx] code refactor
* [fx] fix test_linearize import
2022-09-20 11:20:48 +08:00
Kirigaya Kazuto
edc9e419ad
[pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera ( #1595 )
...
* [pipeline/tuning] improve dispatch performance both time and space cost
* [pipeline/converge] add interface for testing convergence
* [NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style
* Update PipelineBase.py
* [pipeline/chimera] reconstruct PipelineBase and Worker to support more feasible custom schedule | finish Chimera
2022-09-19 11:44:18 +08:00
ver217
c9e8ce67b8
fix move fp32 shards ( #1604 )
2022-09-16 17:33:16 +08:00
YuliangLiu0306
eac1b79371
[autoparallel] add bcast op handler ( #1600 )
...
* [autoparallel] add bcast op handler
* polish code
* add more BCAST FUNC OP
* polish code
* add exception handler
* polish
2022-09-16 11:33:01 +08:00
Frank Lee
3abf98a633
[autoparallel] added all non-bcast matmul strategies ( #1603 )
2022-09-16 10:47:32 +08:00
Frank Lee
db98b695b2
[autoparallel] added strategy generator and bmm strategies ( #1602 )
2022-09-15 16:57:07 +08:00
Jiarui Fang
a19eb80998
[embedding] updates some default parameters
2022-09-15 15:45:17 +08:00
Super Daniel
cd5cf2bcc9
[fx/tuning] tune performance on rotor with meta info. ( #1599 )
2022-09-15 14:46:36 +08:00
Boyuan Yao
a7cda6f57d
[fx] Add offload codegen ( #1598 )
...
* [fx] add input activation offload to codegen
* [fx] modify unit test
* [fx] remove two skips in torch11
* [fx] use all_input_nodes instead of _input_nodes
2022-09-14 15:49:06 +08:00
Super Daniel
c8e9b2ad78
[hotfix/rotor] fix variable names ( #1597 )
...
* [fx] add some comment and docstrings.
* [fx] add dataflow analysis for an autograd graph.
* add intepretation for graph analysis.
* [fx] before doing save_tensor_hooks.
* [fx] provide an accurate estimation of memory except for GPT-2.
* [fx] provide an accurate estimation of memory except for GPT-2.
* [fx] provide an accurate estimation of memory except for GPT-2.
* [fx] a very accurate version on GPT-2.
* [fx] refactor code.
* [fx] remove redundant inplace=True.
* [fx] refactor code.
* [fx] refactor code.
* [fx] refactor code.
* [fx] dive into backward memory.
* [fx] fix variable names in ckpt_solvers and unskip tests.
* [fx] commit my changes.
* [fx] restore skips.
* [fx] restore skips.
* [fx] chaange stage into phase.
* [fx] chaange stage into phase.
* [fx] chaange stage into phase.
2022-09-14 14:27:04 +08:00
YuliangLiu0306
faa23b9d9a
[autoparallel] add reshape handler ( #1594 )
...
* [autoparallel] add reshape handler
* polish code
2022-09-14 10:25:45 +08:00
Super Daniel
5c494d4540
[fx] provide an accurate estimation of memory. ( #1587 )
...
* [fx] add some comment and docstrings.
* [fx] add dataflow analysis for an autograd graph.
* add intepretation for graph analysis.
* [fx] before doing save_tensor_hooks.
* [fx] provide an accurate estimation of memory except for GPT-2.
* [fx] provide an accurate estimation of memory except for GPT-2.
* [fx] provide an accurate estimation of memory except for GPT-2.
* [fx] a very accurate version on GPT-2.
* [fx] refactor code.
* [fx] remove redundant inplace=True.
* [fx] refactor code.
* [fx] refactor code.
* [fx] refactor code.
* [fx] dive into backward memory.
2022-09-14 09:36:43 +08:00
Frank Lee
27fe8af60c
[autoparallel] refactored shape consistency to remove redundancy ( #1591 )
...
* [autoparallel] refactored shape consistency to remove redundancy
* polish code
* polish code
* polish code
2022-09-13 18:30:18 +08:00
YuliangLiu0306
d164449d00
[autoparallel] add resnet autoparallel unit test and add backward weight communication cost ( #1589 )
2022-09-13 18:05:05 +08:00
Frank Lee
7c18a588c8
[autoparallel] added generate_sharding_spec to utils ( #1590 )
2022-09-13 15:43:22 +08:00
Boyuan Yao
49ccf8b5f8
[fx] Improve linearize and rotor solver ( #1586 )
...
* [fx] add nested activation_checkpoint codegen
* undo algorithms commits
* solver
* undo some commits
* [fx] torch11 add nested activation checkpoint codegen
* remove some imports
* [fx] add some comments in activation codegen
* [fx] codegen instance error fix
* [fx] imporve linearize and rotor solver
* [fx] some comments and format modification
2022-09-13 14:50:04 +08:00
Frank Lee
219f66c571
[autoparallel] added solver option dataclass ( #1588 )
2022-09-13 14:47:09 +08:00
YuliangLiu0306
82d4376c23
[autoparallel] adapt solver with resnet ( #1583 )
...
* [autoparallel]adapt solver with resnet
* polish code
* polish code
2022-09-13 12:07:09 +08:00
CsRic
f3403ff98e
[embeddings] add already_split_along_rank flag for tablewise mode ( #1584 )
2022-09-13 10:50:34 +08:00
Boyuan Yao
f3687e4ee2
[fx] Add nested checkpoint in activation checkpoint codegen ( #1585 )
...
* [fx] add nested activation_checkpoint codegen
* undo algorithms commits
* solver
* undo some commits
* [fx] torch11 add nested activation checkpoint codegen
* remove some imports
* [fx] add some comments in activation codegen
* [fx] codegen instance error fix
2022-09-12 20:00:48 +08:00
Boyuan Yao
20e466527b
[NFC] polish ./colossalai/trainer/hooks/_lr_scheduler_hook.py code style ( #1576 )
2022-09-08 22:11:04 +08:00
Fazzie-Maqianli
06dccdde44
[NFC] polish colossalai/zero/sharded_model/reduce_scatter.py code style ( #1554 )
2022-09-08 22:11:04 +08:00
CsRic
2ac46f7be4
[NFC] polish utils/tensor_detector/__init__.py code style ( #1573 )
...
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2022-09-08 22:11:04 +08:00
Sze-qq
2144cbae8c
[NFC] polish colossalai/nn/lr_scheduler/multistep.py code style ( #1572 )
2022-09-08 22:11:04 +08:00
superhao1995
e4bf7ae667
[NFC] polish colossalai/nn/lr_scheduler/torch.py code style ( #1571 )
...
Co-authored-by: Research <research@soccf-snr3-017.comp.nus.edu.sg>
2022-09-08 22:11:04 +08:00
Jiatong Han
3263cdf57f
[NFC] polish colossalai/nn/parallel/data_parallel.py code style ( #1570 )
...
Co-authored-by: JThh <jiatong.han@u.nus.edu>
2022-09-08 22:11:04 +08:00
Zirui Zhu
f566c9b98d
[NFC] polish colossalai/pipeline/utils.py code style ( #1562 )
2022-09-08 22:11:04 +08:00
Xue Fuzhao
e070ca45c6
[NFC] polish colossalai/fx/tracer/meta_patch/patched_module/convolution.py code style ( #1563 )
2022-09-08 22:11:04 +08:00
Zangwei Zheng
9823cbf24b
[NFC] polish colossalai/gemini/update/chunkv2.py code style ( #1565 )
2022-09-08 22:11:04 +08:00
DouJS
f586887a90
[NFC] polish colossalai/nn/layer/colossalai_layer/dropout.py code style ( #1568 )
2022-09-08 22:11:04 +08:00
LuGY
c7d4932956
[NFC] polish colossalai/utils/tensor_detector/tensor_detector.py code style ( #1566 )
2022-09-08 22:11:04 +08:00
BigOneLiXiaoMing
0c4c9aa6e0
[NFC] polish colossalai/nn/_ops/embedding.py code style ( #1561 )
2022-09-08 22:11:04 +08:00
Ziheng Qin
08815f0e72
[NFC] polish colossalai/builder/__init__.py code style ( #1560 )
...
Co-authored-by: henryqin1997 <henryqin1997@gamil.com>
2022-09-08 22:11:04 +08:00
Super Daniel
8328917348
[NFC] polish colossalai/testing/comparison.py code style. ( #1558 )
2022-09-08 22:11:04 +08:00
Ofey Chan
7cc052f6c0
[NFC] polish colossalai/nn/layer/colossalai_layer/linear.py ( #1556 )
2022-09-08 22:11:04 +08:00
Kai Wang (Victor Kai)
46931e3c32
[NFC] polish code colossalai/gemini/update/search_utils.py ( #1557 )
2022-09-08 22:11:04 +08:00
yuxuan-lou
413f9c19f4
[NFC] polish colossalai/nn/_ops/layernorm.py code style ( #1555 )
2022-09-08 22:11:04 +08:00
shenggan
8edb777cc2
[NFC] polish colossalai/nn/loss/loss_2p5d.py code style ( #1553 )
2022-09-08 22:11:04 +08:00
Maruyama_Aya
bd2d789832
[NFC] polish colossalai/nn/_ops/embedding_bag.py code style ( #1552 )
2022-09-08 22:11:04 +08:00
binmakeswell
73e9eb13b7
[NFC] polish colossalai/nn/lr_scheduler/cosine.py code style
2022-09-08 22:11:04 +08:00
Kirigaya Kazuto
318fbf1145
[NFC] polish colossalai/utils/multi_tensor_apply/multi_tensor_apply.py code style ( #1559 )
2022-09-08 22:04:34 +08:00
CsRic
a389ac4ec9
[embedding] cache_embedding small improvement ( #1564 )
2022-09-08 16:41:19 +08:00
ver217
10dd8226b1
add gather_output for VocabParallelClassifier1D ( #1569 )
2022-09-08 16:40:56 +08:00
Kirigaya Kazuto
6159d45417
[pipeline/tuning] improve dispatch performance both time and space cost ( #1544 )
2022-09-07 19:01:06 +08:00
Super Daniel
4f59693207
[fx] provide a stable but not accurate enough version of profiler. ( #1547 )
...
* [fx] compute memory stat and flop count for MetaInfoProp.
* [fx] modify node attribute.
* [fx] modify ckpt_chen.
* [fx] fix compatibility.
* [fx] fix import error.
* [fx] skip test for MetaInfoProp.
* [fx] skip test for MetaInfoProp.
* [fx] skip test for MetaInfoProp.
* [fx] skip test for MetaInfoProp.
* [fx] skip if torch 1.11.0.
* [fx] recover MetaInfoProp support for PyTorch 1.11.
* [fx] provide a stable but not accurate enough version of profiler.
* [fx] provide a stable but not accurate enough version of profiler.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix compatibility in tests.
* [fx] fix import error.
2022-09-07 11:21:04 +08:00
YuliangLiu0306
0908d0fc61
[autoparallel]add backward cost info into strategies ( #1524 )
2022-09-07 11:19:00 +08:00
YuliangLiu0306
1a3599410d
[autoparallel] support fucntion in operator handler ( #1529 )
2022-09-07 11:18:41 +08:00
YuliangLiu0306
44c866a3e3
[autoparallel] change the merge node logic ( #1533 )
2022-09-07 11:18:19 +08:00
ver217
ae71036cd2
[utils] refactor parallel layers checkpoint and bcast model on loading checkpoint ( #1548 )
...
* refactor parallel layer
* broadcast rank0 model after load ckpt
2022-09-06 20:18:35 +08:00
ver217
2bed096848
[utils] optimize partition_tensor_parallel_state_dict ( #1546 )
2022-09-06 17:45:31 +08:00
Super Daniel
d8a5aded19
[hotfix] change namespace for meta_trace. ( #1541 )
2022-09-06 11:46:12 +08:00
ver217
a203b709d5
[hotfix] fix init context ( #1543 )
...
* fix init context
* fix lazy init ctx
2022-09-06 11:45:08 +08:00
Jiarui Fang
64169f3e8f
[embedding] polish parallel embedding tablewise ( #1545 )
2022-09-06 10:41:20 +08:00
Boyuan Yao
46c6cc79a9
[fx] Add common node in model linearize ( #1542 )
...
* [fx] Add common node into linearize
* [fx] Add common node to solver
2022-09-05 18:35:05 +08:00
CsRic
964123ae0f
[embedding] freq_aware_embedding: add small functions for caller application ( #1537 )
2022-09-05 15:12:53 +08:00
Super Daniel
70129603aa
[fx] support meta tracing for aten level computation graphs like functorch. ( #1536 )
...
* [fx] support meta tracing for aten level computation graphs like functorch.
* [fx] support meta tracing for aten level computation graphs like functorch.
* [fx] remove redundant import.
* [fx] add docstring.
2022-09-05 12:10:09 +08:00
Jiarui Fang
521078ffc9
[embedding] fix a bug in table wise sharding ( #1538 )
2022-09-02 15:48:35 +08:00
Jiarui Fang
87134524fd
[embedding] tablewise sharding polish ( #1535 )
2022-09-02 11:09:37 +08:00
Boyuan Yao
56159049e8
[fx] Modify solver linearize and add corresponding test ( #1531 )
...
* [fx] modify solver linearize and add test
* [fx] add torch11 test of linearize but skip it
* [fx] remove some unused imports
2022-09-02 10:24:41 +08:00
YuliangLiu0306
4b3d6caeb3
[fx]patch nn.functional convolution ( #1528 )
2022-09-01 19:05:07 +08:00
CsRic
5156d5b4f8
[embedding] add tablewise sharding for FAW ( #1526 )
2022-09-01 17:55:41 +08:00
Kirigaya Kazuto
f1e1836218
[pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP ( #1508 )
...
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [pipeline/rpc] implement a demo for PP with cuda rpc framework
* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* [pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP
* [pipeline/pipleline_process_group] remove comment
* [pipeline/pipleline_process_group] remove comment
* [pipeline/pipleline_process_group] skip process group test
* [pipeline/pipleline_process_group] remove test named function
2022-09-01 17:45:47 +08:00
Super Daniel
112a1f0a8f
[hotfix] avoid conflict of meta registry with torch 1.13.0. ( #1530 )
...
* [hotfix] avoid conflict of meta registry with torch 1.13.0.
* [hotfix] avoid conflict of meta registry with torch 1.13.0.
2022-09-01 15:31:21 +08:00
Boyuan Yao
b231430bcb
[fx] Fix wrong index in annotation and minimal flops in ckpt solver ( #1521 )
...
* [fx] fix wrong variable name in solver rotor
* [fx] fix wrong variable name in solver rotor
* [fx] fix the discretize bug
* [fx] fix the first op in activation checkpoint codegen
* [fx] fix some bugs of ckpt solver
* [fx] modify test_ckpt_torchvision
* [fx] set sequence to __sequence__ attr of GraphModule
* [fx] docstring modification
* [fx] remove performance test
2022-08-31 18:10:48 +08:00
Super Daniel
5cc849f6ce
[fx] hack __torch_dispatch__ for meta tensor and autograd. ( #1515 )
...
* [fx] hack __torch_dispatch__ for meta tensor and autograd.
* [fx] hack __torch_dispatch__ for meta tensor and autograd.
* [fx] hack __torch_dispatch__ for meta tensor and autograd.
* [fx] hack __torch_dispatch__ for meta tensor and autograd.
* [fx] hack __torch_dispatch__ for meta tensor and autograd.
* [fx] add bad case detections.
* [fx] add bad case detections.
* [fx] rename MetaTensor attributes.
* [fx] fix unexpected error.
* [fx] fix unexpected error.
* [fx] fix unexpected error.
* [fx] fix unexpected error.
* [fx] fix unexpected error.
* [fx] add register backward for native_batch_norm_backward.
* [fx] add more meta backend support for nn.Modules.
* [fx] add meta backend to support timm and torchvision models.
* [fx] add meta hardswish for timm models.
2022-08-31 16:30:16 +08:00
Jiarui Fang
4537d39df9
[doc] docstring for FreqAwareEmbeddingBag ( #1525 )
2022-08-31 13:52:30 +08:00
YuliangLiu0306
3345c6d352
[autoparellel]add strategies constructor ( #1505 )
...
* [autoparellel]add strategies constructor
* remove duplicated strategies
* polish code
* adapt cost graph with StrategiesConstructor
* polish
2022-08-30 16:32:09 +08:00
Frank Lee
a0436a62ee
[autoparallel] added liveness analysis ( #1516 )
...
* [autoparallel] added liveness analysis
* remove memory cost
2022-08-30 15:54:37 +08:00
Jiarui Fang
9a9ef65313
[FAW] cpu caching operations ( #1520 )
2022-08-30 14:50:02 +08:00
Super Daniel
ea1a95b8b9
[hotfix] fix coloproxy typos. ( #1519 )
2022-08-30 11:39:03 +08:00
Jiarui Fang
af5438caa2
[FAW] refactor reorder() for CachedParamMgr ( #1514 )
2022-08-29 14:22:07 +08:00
Jiarui Fang
9feee6d06b
[FAW] LFU initialize with dataset freq ( #1513 )
2022-08-29 12:52:53 +08:00
CsRic
1b8fee8e9c
[FAW] shrink freq_cnter size ( #1509 )
2022-08-29 11:44:55 +08:00
Boyuan Yao
4acc58ee20
[fx] Fix activation codegen dealing with checkpointing first op ( #1510 )
2022-08-27 19:39:21 +08:00
Boyuan Yao
ac3a453a50
[fx] fix the discretize bug ( #1506 )
...
* [fx] fix wrong variable name in solver rotor
* [fx] fix wrong variable name in solver rotor
* code modification
* [fx] fix the discretize bug
2022-08-26 17:15:52 +08:00
Boyuan Yao
31fffd3fc5
[fx] fix wrong variable name in solver rotor ( #1502 )
...
* [fx] fix wrong variable name in solver rotor
* [fx] fix wrong variable name in solver rotor
* code modification
2022-08-26 15:47:08 +08:00
Jiarui Fang
ba61109b6c
[FAW] remove code related to chunk ( #1501 )
2022-08-26 14:23:30 +08:00
Jiarui Fang
d5085bb317
[FAW] add more docs and fix a warning ( #1500 )
2022-08-26 14:10:21 +08:00
Kirigaya Kazuto
5a6fd71f90
[pipeline/rpc] update outstanding mechanism | optimize dispatching strategy ( #1497 )
...
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [pipeline/rpc] implement a demo for PP with cuda rpc framework
* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
* [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy
2022-08-26 14:04:23 +08:00
CsRic
0ed2f46131
[FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats ( #1494 )
2022-08-26 11:24:12 +08:00
YuliangLiu0306
8b7d6bd5be
[autoparallel] add more sharding strategies to conv ( #1487 )
2022-08-26 11:17:56 +08:00
Boyuan Yao
de1e716dc4
[fx] Add activation checkpoint solver rotor ( #1496 )
...
* [fx] fix defining ckpt functions inside forward
* [fx] Modify activation checkpoint codegen and add ColoGraphModule
* [fx] some modification
* some modifications
* some modifications
* some modifications
* some modifications
* some code modifications
* [automatic_parallel] ckpt solver rotor
* [fx] add ckpt_solver_rotor
* [fx] modification
* code refactor
* code refactor
2022-08-26 10:34:21 +08:00
Super Daniel
09c023bee2
[fx] add more op patches for profiler and error message for unsupported ops. ( #1495 )
...
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] merge development into main (#1 )
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 .
* [fx] fix lowercase naming conventions.
* [fx] simplify test for ckpt.
* [fx] add rules to linearize computation graphs for searching. (#2 )
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] merge development into main (#1 )
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 .
* [fx] fix lowercase naming conventions.
* [fx] simplify test for ckpt.
* [fx] fix test and algorithm bugs in activation checkpointing.
* [fx] polish ckpt_test.
* [fx] add rules to linearize computation graphs for searching.
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] fix inconsistencies.
* [fx] fix MetaInfoProp.
* [fx] fix MetaInfoProp.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] fix error in tests.
* [fx] unfix bug.
* [fx] unfix bug.
* [fx] patch more modules and functions.
* [fx] change name of utils.py to profiler.py
* [fx] add profiler for rnn.
* [fx] add profiler for rnn.
* [fx] polish and add more patch for profiler.
* [fx] polish and add more patch for profiler.
2022-08-25 23:11:13 +08:00
YuliangLiu0306
413c053453
[autoparallel] add cost graph class ( #1481 )
...
* [autoparallel] add cost graph class
* polish code
2022-08-25 17:19:59 +08:00
YuliangLiu0306
4b03c25f85
[tensor]add 1D device mesh ( #1492 )
2022-08-25 16:48:12 +08:00
CsRic
b8d0e39eaf
[FAW] LFU cache for the FAW
2022-08-25 13:08:46 +08:00
Kirigaya Kazuto
9145aef2b4
[pipeline/rpc] implement distributed optimizer | test with assert_close ( #1486 )
...
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [pipeline/rpc] implement a demo for PP with cuda rpc framework
* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B
* [pipeline/rpc] implement distributed optimizer | test with assert_close
* [pipeline/rpc] implement distributed optimizer | test with assert_close
2022-08-25 10:49:01 +08:00
Frank Lee
3da68d6b1b
[fx] fixed adapative pooling size concatenation error ( #1489 )
2022-08-25 09:05:07 +08:00
Jiarui Fang
cde7b8a5b8
[FAW] init an LFU implementation for FAW ( #1488 )
2022-08-24 17:37:22 +08:00
Super Daniel
32efe8e740
[fx] add profiler for fx nodes. ( #1480 )
...
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] merge development into main (#1 )
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 .
* [fx] fix lowercase naming conventions.
* [fx] simplify test for ckpt.
* [fx] add rules to linearize computation graphs for searching. (#2 )
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] merge development into main (#1 )
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 .
* [fx] fix lowercase naming conventions.
* [fx] simplify test for ckpt.
* [fx] fix test and algorithm bugs in activation checkpointing.
* [fx] polish ckpt_test.
* [fx] add rules to linearize computation graphs for searching.
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] fix inconsistencies.
* [fx] fix MetaInfoProp.
* [fx] fix MetaInfoProp.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] add profiler for fx nodes.
* [fx] fix error in tests.
* [fx] unfix bug.
* [fx] unfix bug.
2022-08-24 16:22:44 +08:00
Frank Lee
d39e11dffb
[autoparallel] added namespace constraints ( #1490 )
2022-08-24 15:44:07 +08:00
Kirigaya Kazuto
a6c8749198
[pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B ( #1483 )
...
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [pipeline/rpc] implement a demo for PP with cuda rpc framework
* [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B
2022-08-24 11:19:46 +08:00
Geng Zhang
0aad53c62b
[FCE] update interface for frequency statistics in FreqCacheEmbedding ( #1462 )
2022-08-23 17:38:24 +08:00
Frank Lee
ede326298b
[autoparallel] integrate auto parallel with torch fx ( #1479 )
2022-08-23 14:23:08 +08:00
Boyuan Yao
1f2e547f7a
[fx] Fix ckpt functions' definitions in forward ( #1476 )
...
* [fx] fix defining ckpt functions inside forward
* [fx] Modify activation checkpoint codegen and add ColoGraphModule
* [fx] some modification
* some modifications
* some modifications
* some modifications
* some modifications
* some code modifications
2022-08-22 16:59:54 +08:00
Kirigaya Kazuto
bb5f5289e0
[pipeline/rpc] implement a demo for PP with cuda rpc framework ( #1470 )
...
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [pipeline/rpc] implement a demo for PP with cuda rpc framework
* Delete p2p_v2.py
* Delete _pipeline_schedule_v2.py
* Delete test_object_list_p2p_v2.py
* Delete test_boardcast_send_recv_v2.py
* Delete test_cifar_with_data_pipeline_tensor_v2.py
2022-08-22 10:50:51 +08:00
Frank Lee
628c7e3fc8
[autoparallel] added dot handler ( #1475 )
2022-08-22 10:32:17 +08:00
Frank Lee
9dae9bb2bc
[autoparallel] introduced baseclass for op handler and reduced code redundancy ( #1471 )
...
* [autoparallel] introduced baseclass for op handler and reduced code redundancy
* polish code
2022-08-19 16:51:38 +08:00
Frank Lee
3a54e1c9b7
[autoparallel] standardize the code structure ( #1469 )
2022-08-19 15:51:54 +08:00
YuliangLiu0306
26a37b5cd5
[autoparallel] Add conv handler to generate strategies and costs info for conv ( #1467 )
2022-08-19 14:57:23 +08:00
Jiarui Fang
1b491ad7de
[doc] update docstring in ProcessGroup ( #1468 )
2022-08-19 13:41:57 +08:00
YuliangLiu0306
b73fb7a077
[tensor] support runtime ShardingSpec apply ( #1453 )
...
* [tensor] support runtime ShardingSpec apply
* polish code
* polish code
2022-08-19 13:39:51 +08:00
Super Daniel
bbc58d881b
[fx] fix MetaInfoProp for incorrect calculations and add detections for inplace op. ( #1466 )
...
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] merge development into main (#1 )
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 .
* [fx] fix lowercase naming conventions.
* [fx] simplify test for ckpt.
* [fx] add rules to linearize computation graphs for searching. (#2 )
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] merge development into main (#1 )
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 .
* [fx] fix lowercase naming conventions.
* [fx] simplify test for ckpt.
* [fx] fix test and algorithm bugs in activation checkpointing.
* [fx] polish ckpt_test.
* [fx] add rules to linearize computation graphs for searching.
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] remove chen_sqrt for sake of simplicity
* [fx] fix inconsistencies.
* [fx] fix MetaInfoProp.
* [fx] fix MetaInfoProp.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
* [fx] consider MetaInfoProp for inplace operands.
2022-08-18 11:27:06 +08:00
Super Daniel
e7383f578b
[fx] add rules to linearize computation graphs for searching. ( #1461 )
...
* [fx] polish ckpt_test.
* [fx] add rules to linearize computation graphs for searching.
* [fx] remove chen_sqrt for sake of simplicity
* [fx] fix inconsistencies.
2022-08-17 14:47:12 +08:00
Boyuan Yao
092b9c8f49
[fx] Add use_reentrant=False to checkpoint in codegen ( #1463 )
...
* [utils] Add use_reetrant=False into colossalai checkpoint
* [utils] add some annotation in utils.activaion_checkpoint
* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py
* [test] modify test_activation_checkpoint.py
* [test] modify test for reentrant=False
* [fx] Add use_reentrant=False of checkpoint into codegen
2022-08-17 10:34:50 +08:00
Boyuan Yao
47fd8e4a02
[utils] Add use_reetrant=False in utils.activation_checkpoint ( #1460 )
...
* [utils] Add use_reetrant=False into colossalai checkpoint
* [utils] add some annotation in utils.activaion_checkpoint
* [test] add reset_seed at the beginning of tests in test_actiavion_checkpointing.py
* [test] modify test_activation_checkpoint.py
* [test] modify test for reentrant=False
2022-08-16 15:39:20 +08:00
Jiarui Fang
36824a304c
[Doc] add more doc for ColoTensor. ( #1458 )
2022-08-16 10:38:41 +08:00
Jiarui Fang
a1476ea882
[NFC] polish doc style for ColoTensor ( #1457 )
2022-08-16 09:21:05 +08:00
Super Daniel
0dbd61c29b
[fx] fix test and algorithm bugs in activation checkpointing. ( #1451 )
...
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] merge development into main (#1 )
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 .
* [fx] fix lowercase naming conventions.
* [fx] simplify test for ckpt.
* [fx] fix test and algorithm bugs in activation checkpointing.
* mend
[fx] fix test and algorithm bugs in activation checkpointing.
* mend
[fx] fix test and algorithm bugs in activation checkpointing.
* mend
[fx] fix test and algorithm bugs in activation checkpointing.
* mend
[fx] fix test and algorithm bugs in activation checkpointing.
* [fx] polish ckpt_test.
* [fx] polish ckpt_test.
* [fx] polish ckpt_test.
2022-08-15 19:09:19 +08:00
Jiarui Fang
b1553fdf96
[NFC] global vars should be upper case ( #1456 )
2022-08-15 09:50:29 +08:00
ver217
367c615818
fix nvme docstring ( #1450 )
2022-08-12 18:01:02 +08:00
Geng Zhang
9f3eed66eb
[FAW] reorganize the inheritance struct of FreqCacheEmbedding ( #1448 )
2022-08-12 15:55:46 +08:00
Frank Lee
5a52e21fe3
[test] fixed the activation codegen test ( #1447 )
...
* [test] fixed the activation codegen test
* polish code
2022-08-12 14:52:31 +08:00
YuliangLiu0306
0f3042363c
[tensor] shape consistency generate transform path and communication cost ( #1435 )
...
* [tensor] shape consistency output transform path and communication cost
* polish code
2022-08-12 14:02:32 +08:00
Boyuan Yao
5774fe0270
[fx] Use colossalai checkpoint and add offload recognition in codegen ( #1439 )
...
* [fx] Use colossalai.utils.checkpoint to replace torch.utils.checkpoint for offload activation and add offload annotation recognition in codegen
* [fx] Use colossalai.utils.checkpoint to replace torch.utils.checkpoint for offload activation and add offload annotation recognition in codegen
* Modification of test and add TODO in codegen
* [fx] Modification of colossal ckpt usage
* [fx] add gpc.destroy() to test_codegen
2022-08-12 12:23:30 +08:00
Kirigaya Kazuto
e9460b45c8
[engin/schedule] use p2p_v2 to recontruct pipeline_schedule ( #1408 )
...
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [communication] add p2p_v2.py to support communication with List[Any]
* Delete _pipeline_schedule_v2.py
* Delete test_cifar_with_data_pipeline_tensor_v2.py
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* Delete p2p_v2.py
* Delete test_boardcast_send_recv_v2.py
* Delete test_object_list_p2p_v2.py
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [communication] remove print code
* [communication] remove print code
* [engin/schedule] shorten the running time of testing file to prevent cancelling in CI
2022-08-12 11:33:26 +08:00
Frank Lee
ae1b58cd16
[tensor] added linear implementation for the new sharding spec ( #1416 )
...
* [tensor] added linear implementation for the new sharding spec
* polish code
2022-08-12 11:33:09 +08:00
Super Daniel
d40a9392ba
[fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 . ( #1446 )
...
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* mend
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
* [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174 .
* [fx] fix lowercase naming conventions.
2022-08-12 11:28:50 +08:00
ver217
821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer ( #1442 )
2022-08-11 22:58:58 +08:00
HELSON
b80340168e
[zero] add chunk_managerV2 for all-gather chunk ( #1441 )
2022-08-11 19:17:24 +08:00
Super Daniel
3b26516c69
[fx] add vanilla activation checkpoint search with test on resnet and densenet ( #1433 )
...
* [fx] activation checkpointing using Chen strategies.
* [fx] add test for ckpt_solver_chen
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add vanilla activation checkpoint search with test on resnet and densenet
* [fx] add a namespace code for solver_chen.
2022-08-11 15:46:39 +08:00
Jiarui Fang
30b4dd17c0
[FAW] export FAW in _ops ( #1438 )
2022-08-11 13:43:24 +08:00
HELSON
9056677b13
[zero] add chunk size searching algorithm for parameters in different groups ( #1436 )
2022-08-11 13:32:19 +08:00
Jiarui Fang
c9427a323f
hotfix #1434 ( #1437 )
2022-08-11 13:14:25 +08:00
HELSON
039b7ed3bc
[polish] add update directory in gemini; rename AgChunk to ChunkV2 ( #1432 )
2022-08-10 16:40:29 +08:00
Super Daniel
f20cb4e893
[fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages ( #1425 )
...
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
* [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages
2022-08-10 16:36:35 +08:00
Jiarui Fang
10b3df65c8
[FAW] move coloparam setting in test code. ( #1429 )
2022-08-10 14:31:53 +08:00
Jiarui Fang
cb98cf5558
[FAW] parallel FreqAwareEmbedding ( #1424 )
2022-08-10 13:44:30 +08:00
HELSON
0d212183c4
[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk ( #1426 )
2022-08-10 11:37:28 +08:00
YuliangLiu0306
33f0744d51
[tensor] add shape consistency feature to support auto spec transform ( #1418 )
...
* [tensor] add shape consistency feature to supportauto sharding spec transform.
* [tensor] remove unused argument in simulator, add doc string for target pair.
2022-08-10 11:29:17 +08:00
HELSON
4fb3c52cf0
[zero] add unit test for AgChunk's append, close, access ( #1423 )
2022-08-09 18:03:10 +08:00
HELSON
c577ed016e
[zero] add AgChunk ( #1417 )
2022-08-09 16:39:48 +08:00
Jiarui Fang
d209aff684
Add FreqAwareEmbeddingBag ( #1421 )
2022-08-09 16:26:12 +08:00
ver217
6df3e19be9
[hotfix] zero optim prevents calling inner optim.zero_grad ( #1422 )
2022-08-09 16:08:12 +08:00
Jiarui Fang
504419d261
[FAW] add cache manager for the cached embedding ( #1419 )
2022-08-09 15:17:17 +08:00
Kirigaya Kazuto
44fd3c83ab
[communication] add p2p_v2.py to support communication with List[Any] ( #1407 )
...
* support p2p communication with any type of object | pass test
* reconstruct pipeline schedule with p2p_v2.py(support communication with List[Any]) | pass test
* [communication] add p2p_v2.py to support communication with List[Any]
* Delete _pipeline_schedule_v2.py
* Delete test_cifar_with_data_pipeline_tensor_v2.py
* [engin/schedule] use p2p_v2 to recontruct pipeline_schedule
* [communication] remove print code
* [communication] remove print code
2022-08-09 11:40:04 +08:00
YuliangLiu0306
7c96055c68
[tensor]build sharding spec to replace distspec in future. ( #1405 )
2022-08-08 11:15:57 +08:00
ver217
12b4887097
[hotfix] fix CPUAdam kernel nullptr ( #1410 )
2022-08-05 19:45:45 +08:00
YuliangLiu0306
0442f940f0
[device] add DeviceMesh class to support logical device layout ( #1394 )
...
* [device] add DeviceMesh class to support logical device layout
* polish code
* add doc string
2022-08-02 19:23:48 +08:00
ver217
04c9a86af8
[zero] ZeroDDP supports controlling outputs' dtype ( #1399 )
2022-08-02 17:49:11 +08:00
HELSON
4e98e938ce
[zero] alleviate memory usage in ZeRODDP state_dict ( #1398 )
2022-08-02 15:49:13 +08:00
ver217
56b8863b87
[zero] chunk manager allows filtering ex-large params ( #1393 )
2022-08-02 10:40:27 +08:00
Frank Lee
7d6293927f
[fx] patched torch.max and data movement operator ( #1391 )
...
* [fx] patched torch.max and data movement operator
* polish code
2022-08-01 15:31:50 +08:00
Frank Lee
89e60d1505
[fx] fixed indentation error in checkpointing codegen ( #1385 )
2022-07-30 00:27:12 +08:00
HELSON
c7221cb2d4
[hotfix] adapt ProcessGroup and Optimizer to ColoTensor ( #1388 )
2022-07-29 19:33:24 +08:00
Frank Lee
ad678921db
[fx] patched torch.full for huggingface opt ( #1386 )
2022-07-29 17:56:28 +08:00
HELSON
527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py ( #1387 )
2022-07-29 15:58:06 +08:00
Jiarui Fang
f792507ff3
[chunk] add PG check for tensor appending ( #1383 )
2022-07-29 13:27:05 +08:00
ver217
8dced41ad0
[zero] zero optim state_dict takes only_rank_0 ( #1384 )
...
* zero optim state_dict takes only_rank_0
* fix unit test
2022-07-29 13:22:50 +08:00
YuliangLiu0306
df54481473
[hotfix] fix some bugs during gpt2 testing ( #1379 )
2022-07-28 17:21:07 +08:00
ver217
828b9e5e0d
[hotfix] fix zero optim save/load state dict ( #1381 )
2022-07-28 17:19:39 +08:00
HELSON
b6fd165f66
[checkpoint] add kwargs for load_state_dict ( #1374 )
2022-07-28 15:56:52 +08:00
ver217
83328329dd
[hotfix] fix zero ddp buffer cast ( #1376 )
...
* fix zero ddp buffer cast
* fix zero ddp ignore params
2022-07-28 10:54:44 +08:00
ver217
5d5031e946
fix zero ddp state dict ( #1378 )
2022-07-28 09:31:42 +08:00
Frank Lee
0c1a16ea5b
[util] standard checkpoint function naming ( #1377 )
2022-07-28 09:29:30 +08:00
YuliangLiu0306
52bc2dc271
[fx] update split module pass and add customized policy ( #1373 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx]update split module pass and add customized policy
2022-07-27 13:40:54 +08:00
Super Daniel
be229217ce
[fx] add torchaudio test ( #1369 )
...
* [fx]add torchaudio test
* [fx]add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test
* [fx] add torchaudio test and test patches
* Delete ~
* [fx] add patches and patches test
* [fx] add patches and patches test
* [fx] fix patches
* [fx] fix rnn patches
* [fx] fix rnn patches
* [fx] fix rnn patches
* [fx] fix rnn patches
* [fx] merge upstream
* [fx] fix import errors
2022-07-27 11:03:14 +08:00
ver217
c415240db6
[nvme] CPUAdam and HybridAdam support NVMe offload ( #1360 )
...
* impl nvme optimizer
* update cpu adam
* add unit test
* update hybrid adam
* update docstr
* add TODOs
* update CI
* fix CI
* fix CI
* fix CI path
* fix CI path
* fix CI path
* fix install tensornvme
* fix CI
* fix CI path
* fix CI env variables
* test CI
* test CI
* fix CI
* fix nvme optim __del__
* fix adam __del__
* fix nvme optim
* fix CI env variables
* fix nvme optim import
* test CI
* test CI
* fix CI
2022-07-26 17:25:24 +08:00
HELSON
8463290642
[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint ( #1368 )
2022-07-26 14:41:53 +08:00
YuliangLiu0306
5542816690
[fx]add gpt2 passes for pipeline performance test ( #1366 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx]add gpt2 passes for pipeline performance test
2022-07-26 14:31:00 +08:00
HELSON
87775a0682
[colotensor] use cpu memory to store state_dict ( #1367 )
2022-07-26 14:13:38 +08:00
HELSON
943a96323e
[hotfix] fix no optimizer in save/load ( #1363 )
2022-07-26 10:53:53 +08:00
Frank Lee
cd063ac37f
[fx] added activation checkpoint codegen support for torch < 1.12 ( #1359 )
2022-07-25 23:35:31 +08:00
Frank Lee
644582eee9
[fx] added activation checkpoint codegen ( #1355 )
2022-07-25 09:39:10 +08:00
ver217
6b43c789fd
fix zero optim backward_by_grad and save/load ( #1353 )
2022-07-21 16:43:58 +08:00
ver217
d068af81a3
[doc] update rst and docstring ( #1351 )
...
* update rst
* add zero docstr
* fix docstr
* remove fx.tracer.meta_patch
* fix docstr
* fix docstr
* update fx rst
* fix fx docstr
* remove useless rst
2022-07-21 15:54:53 +08:00
Frank Lee
274c1a3b5f
[fx] fixed apex normalization patch exception ( #1352 )
2022-07-21 15:29:11 +08:00
ver217
ce470ba37e
[checkpoint] sharded optim save/load grad scaler ( #1350 )
2022-07-21 15:21:21 +08:00
Frank Lee
05fae1fd56
[fx] added activation checkpointing annotation ( #1349 )
...
* [fx] added activation checkpointing annotation
* polish code
* polish code
2022-07-21 11:14:28 +08:00
YuliangLiu0306
051592c64e
[fx] update MetaInforProp pass to process more complex node.meta ( #1344 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx] update MetaInforProp pass to process more complex node.meta
2022-07-21 10:57:52 +08:00
HELSON
7a8702c06d
[colotensor] add Tensor.view op and its unit test ( #1343 )
...
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
YuliangLiu0306
942c8cd1fb
[fx] refactor tracer to trace complete graph ( #1342 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx] refactor tracer to trace complete graph
* add comments and solve conflicts.
2022-07-20 11:20:38 +08:00
Frank Lee
2cc1175c76
[fx] tested the complete workflow for auto-parallel ( #1336 )
...
* [fx] tested the complete workflow for auto-parallel
* polish code
* polish code
* polish code
2022-07-20 10:45:17 +08:00
YuliangLiu0306
4631fef8a0
[fx]refactor tracer ( #1335 )
2022-07-19 15:50:42 +08:00
HELSON
f92c100ddd
[checkpoint] use gather_tensor in checkpoint and update its unit test ( #1339 )
2022-07-19 14:15:28 +08:00
ver217
0c51ff2c13
[hotfix] ZeroDDP use new process group ( #1333 )
...
* process group supports getting ranks in group
* chunk mgr receives a process group
* update unit test
* fix unit tests
2022-07-18 14:14:52 +08:00
Frank Lee
75abc75c15
[fx] fixed compatiblity issue with torch 1.10 ( #1331 )
2022-07-18 11:41:27 +08:00
ver217
7a05367101
[hotfix] shared model returns cpu state_dict ( #1328 )
2022-07-15 22:11:37 +08:00
Frank Lee
b2475d8c5c
[fx] fixed unit tests for torch 1.12 ( #1327 )
2022-07-15 18:22:15 +08:00
HELSON
d49708ae43
[hotfix] fix ddp for unit test test_gpt2 ( #1326 )
2022-07-15 18:19:52 +08:00
Frank Lee
250be4d31e
[utils] integrated colotensor with lazy init context ( #1324 )
...
* [utils] integrated colotensor with lazy init context
* polish code
* polish code
* polish code
2022-07-15 17:47:12 +08:00
YuliangLiu0306
e8acf55e8b
[fx] add balanced policy v2 ( #1251 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx] add balanced policy v2
* add unittest
2022-07-15 14:54:26 +08:00
XYE
ca2d3f284f
[fx] Add unit test and fix bugs for transform_mlp_pass ( #1299 )
...
* add test and fix bugs
* add functions back
* add comments
2022-07-15 14:37:58 +08:00
HELSON
1b41686461
[hotfix] fix unit test test_module_spec ( #1321 )
2022-07-15 14:02:32 +08:00
Jiarui Fang
9e4c6449b0
[checkpoint] add ColoOptimizer checkpointing ( #1316 )
2022-07-15 09:52:55 +08:00
ver217
7c70bfbefa
[hotfix] fix PipelineSharedModuleGradientHandler ( #1314 )
2022-07-14 17:31:13 +08:00
Jiarui Fang
85f933b58b
[Optimizer] Remove useless ColoOptimizer ( #1312 )
2022-07-14 16:57:48 +08:00
Jiarui Fang
9f10524313
[Optimizer] polish the init method of ColoOptimizer ( #1310 )
2022-07-14 16:37:33 +08:00
Jiarui Fang
3ef3791a3b
[checkpoint] add test for bert and hotfix save bugs ( #1297 )
2022-07-14 15:38:18 +08:00
Frank Lee
4f4d8c3656
[fx] added apex normalization to patched modules ( #1300 )
...
* [fx] added apex normalization to patched modules
* remove unused imports
2022-07-14 14:24:13 +08:00
Jiarui Fang
4165eabb1e
[hotfix] remove potiential circle import ( #1307 )
...
* make it faster
* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
HELSON
260a55804a
[hotfix] fix shape error in backward when using ColoTensor ( #1298 )
2022-07-13 23:06:12 +08:00
runluo
f83c4d6597
[NFC] polish colossalai/nn/layer/wrapper/pipeline_wrapper.py code style ( #1303 )
2022-07-13 19:01:07 +08:00
binmakeswell
7696cead8d
Recover kernal files
2022-07-13 12:08:21 +08:00
XYE
e83b2ce853
[NFC] polish colossalai/nn/layer/vanilla/layers.py code style ( #1295 )
2022-07-13 12:08:21 +08:00
Liping233
1000a41fd5
[NFC] polish colossalai/nn/layer/vanilla/__init__.py code style ( #1293 )
2022-07-13 12:08:21 +08:00
Maruyama_Aya
87f679aeae
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/kernels.h code style ( #1291 )
2022-07-13 12:08:21 +08:00
Wangbo Zhao(黑色枷锁)
552667825b
[NFC] polish colossalai/nn/layer/parallel_1d/layers.py code style ( #1290 )
2022-07-13 12:08:21 +08:00
doubleHU
d6f5ef8860
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/transform_kernels.cu code style ( #1286 )
2022-07-13 12:08:21 +08:00
Ziheng Qin
6d6c01e94d
[NFC] polish colossalai/__init__.py code style ( #1285 )
2022-07-13 12:08:21 +08:00
Jiatong Han
38e3ccd1e9
[NFC] polish colossalai/nn/layer/parallel_sequence/layers.py code style ( #1280 )
...
Co-authored-by: JThh <jiatong.han@u.nus.edu>
2022-07-13 12:08:21 +08:00
Boyuan Yao
b414eaa5db
[NFC] polish colossalai/nn/optimizer/lamb.py code style ( #1275 )
2022-07-13 12:08:21 +08:00
yuxuan-lou
5f6ab35d25
Hotfix/format ( #1274 )
...
* [NFC] Polish colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu code style. (#937 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h code style
* [NFC] polish colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.cpp code style
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
2022-07-13 12:08:21 +08:00
Super Daniel
52d145a342
[NFC] polish colossalai/nn/lr_scheduler/onecycle.py code style ( #1269 )
2022-07-13 12:08:21 +08:00
Geng Zhang
0e06f62160
[NFC] polish colossalai/nn/layer/parallel_sequence/_operation.py code style ( #1266 )
2022-07-13 12:08:21 +08:00
binmakeswell
c95e18cdb9
[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.h code style ( #1270 )
2022-07-13 12:08:21 +08:00
xyupeng
94bfd35184
[NFC] polish colossalai/builder/builder.py code style ( #1265 )
2022-07-13 12:08:21 +08:00
DouJS
db13f96333
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh code style ( #1264 )
2022-07-13 12:08:21 +08:00
shenggan
5d7366b144
[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_masked_softmax.h code style ( #1263 )
2022-07-13 12:08:21 +08:00
Zangwei Zheng
197a2c89e2
[NFC] polish colossalai/communication/collective.py ( #1262 )
2022-07-13 12:08:21 +08:00
ziyu huang
f1cafcc73a
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu code style ( #1261 )
...
Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>
2022-07-13 12:08:21 +08:00
Sze-qq
f8b9aaef47
[NFC] polish colossalai/kernel/cuda_native/csrc/type_shim.h code style ( #1260 )
2022-07-13 12:08:21 +08:00
superhao1995
f660152c73
[NFC] polish colossalai/nn/layer/parallel_3d/_operation.py code style ( #1258 )
...
Co-authored-by: Research <research@soccf-snr3-017.comp.nus.edu.sg>
2022-07-13 12:08:21 +08:00
Thunderbeee
9738fb0f78
[NFC] polish colossalai/nn/lr_scheduler/__init__.py ( #1255 )
...
code style
2022-07-13 12:08:21 +08:00
Kai Wang (Victor Kai)
50f2ad213f
[NFC] polish colossalai/engine/ophooks/utils.py code style ( #1256 )
2022-07-13 12:08:21 +08:00
Ofey Chan
2dd4d556fb
[NFC] polish colossalai/nn/init.py code style ( #1292 )
2022-07-13 10:51:55 +08:00
Jiarui Fang
556b9b7e1a
[hotfix] Dist Mgr gather torch version ( #1284 )
...
* make it faster
* [hotfix] torchvison fx tests
* [hotfix] rename duplicated named test_gpt.py
* [hotfix] dist mgr torch version
2022-07-13 00:18:56 +08:00
HELSON
abba4d84e1
[hotfix] fix bert model test in unitests ( #1272 )
2022-07-12 23:26:45 +08:00
ver217
7aadcbd070
hotfix colotensor _scan_for_pg_from_args ( #1276 )
2022-07-12 20:46:31 +08:00
oahzxl
0cf8e8e91c
[NFC] polish <colossalai/nn/lr_scheduler/poly.py> code style ( #1267 )
2022-07-12 18:18:14 +08:00
Jiarui Fang
c92f84fcdb
[tensor] distributed checkpointing for parameters ( #1240 )
2022-07-12 15:51:06 +08:00
Frank Lee
fb35460595
[fx] added ndim property to proxy ( #1253 )
2022-07-12 15:27:13 +08:00
Frank Lee
4a09fc0947
[fx] fixed tracing with apex-based T5 model ( #1252 )
...
* [fx] fixed tracing with apex-based T5 model
* polish code
* polish code
2022-07-12 15:19:25 +08:00
Frank Lee
7531c6271f
[fx] refactored the file structure of patched function and module ( #1238 )
...
* [fx] refactored the file structure of patched function and module
* polish code
2022-07-12 15:01:58 +08:00
YuliangLiu0306
17ed33350b
[hotfix] fix an assertion bug in base schedule. ( #1250 )
2022-07-12 14:20:02 +08:00
YuliangLiu0306
97d713855a
[fx] methods to get fx graph property. ( #1246 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* manipulation
* [fx]add graph manipulation methods.
* [fx]methods to get fx graph property.
* add unit test
* add docstring to explain top node and leaf node in this context
2022-07-12 14:10:37 +08:00
YuliangLiu0306
30b4fc0eb0
[fx]add split module pass and unit test from pipeline passes ( #1242 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx]add split module pass and unit test from pipeline passes
* fix MNASNet bug
* polish
2022-07-12 13:45:01 +08:00
Jiarui Fang
1aad903c15
[tensor] redistribute among different process groups ( #1247 )
...
* make it faster
* [tensor] rename convert_to_dist -> redistribute
* [tensor] ShardSpec and ReplicaSpec
* [tensor] redistribute among diff pgs
* polish code
2022-07-12 10:24:05 +08:00
Jiarui Fang
9bcd2fd4af
[tensor] a shorter shard and replicate spec ( #1245 )
2022-07-11 15:51:48 +08:00
Jiarui Fang
2699dfbbfd
[rename] convert_to_dist -> redistribute ( #1243 )
2022-07-11 13:05:44 +08:00
HELSON
f6add9b720
[tensor] redirect .data.__get__ to a tensor instance ( #1239 )
2022-07-11 11:41:29 +08:00
Jiarui Fang
20da6e48c8
[checkpoint] save sharded optimizer states ( #1237 )
2022-07-08 16:33:13 +08:00
Jiarui Fang
4a76084dc9
[tensor] add zero_like colo op, important for Optimizer ( #1236 )
2022-07-08 14:55:27 +08:00
Jiarui Fang
3b500984b1
[tensor] fix some unittests ( #1234 )
2022-07-08 14:18:30 +08:00
ver217
a45ddf2d5f
[hotfix] fix sharded optim step and clip_grad_norm ( #1226 )
2022-07-08 13:34:48 +08:00
HELSON
f071b500b6
[polish] polish __repr__ for ColoTensor, DistSpec, ProcessGroup ( #1235 )
2022-07-08 13:25:57 +08:00
HELSON
0453776def
[tensor] fix a assertion in colo_tensor cross_entropy ( #1232 )
2022-07-08 11:18:00 +08:00
Jiarui Fang
0e199d71e8
[hotfix] fx get comm size bugs ( #1233 )
...
* init a checkpoint dir
* [checkpoint]support resume for cosinewarmuplr
* [checkpoint]add unit test
* fix some bugs but still not OK
* fix bugs
* make it faster
* [checkpoint]support generalized scheduler
* polish
* [tensor] torch function return colotensor
* polish
* fix bugs
* remove debug info
* polish
* polish
* [tensor] test_model pass unittests
* polish
* [hotfix] fx get comm size bug
Co-authored-by: ZhaoYi1222 <zhaoyi9499@gmail.com>
2022-07-08 10:54:41 +08:00
HELSON
42ab36b762
[tensor] add unitest for colo_tensor 1DTP cross_entropy ( #1230 )
2022-07-07 19:17:23 +08:00
Yi Zhao
04537bf83e
[checkpoint]support generalized scheduler ( #1222 )
2022-07-07 18:16:38 +08:00
Jiarui Fang
a98319f023
[tensor] torch function return colotensor ( #1229 )
2022-07-07 18:09:18 +08:00
YuliangLiu0306
2b7dca44b5
[fx]get communication size between partitions ( #1224 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx]get communication size between partitions.
* polish
2022-07-07 16:22:00 +08:00
Frank Lee
84f2298a96
[fx] added patches for tracing swin transformer ( #1228 )
2022-07-07 15:20:13 +08:00
Frank Lee
b6cb5a47ad
[fx] added timm model tracing testing ( #1221 )
2022-07-07 14:02:17 +08:00
HELSON
280a81243d
[tensor] improve robustness of class 'ProcessGroup' ( #1223 )
2022-07-07 13:55:24 +08:00
Jiarui Fang
15d988f954
[tensor] sharded global process group ( #1219 )
2022-07-07 13:38:48 +08:00
Jiarui Fang
db1bef9032
[hotfix] fx shard 1d pass bug fixing ( #1220 )
2022-07-07 13:37:31 +08:00
Frank Lee
11973d892d
[fx] added torchvision model tracing testing ( #1216 )
...
* [fx] added torchvision model tracing testing
* remove unused imports
2022-07-06 21:37:56 +08:00
Jiarui Fang
52736205d9
[checkpoint] make unitest faster ( #1217 )
2022-07-06 17:39:46 +08:00
Jiarui Fang
f38006ea83
[checkpoint] checkpoint for ColoTensor Model ( #1196 )
2022-07-06 17:22:03 +08:00
XYE
291e22aac6
[fx] temporarily used ( #1215 )
2022-07-06 17:19:26 +08:00
Jiarui Fang
ae7d3f4927
[refactor] move process group from _DistSpec to ColoTensor. ( #1203 )
2022-07-06 16:15:16 +08:00
Frank Lee
5da87ce35d
[fx] added testing for all albert variants ( #1211 )
2022-07-06 15:11:08 +08:00
Frank Lee
2d13a45a3b
[fx] added testing for all gpt variants ( #1210 )
...
* [fx] added testing for all gpt variants
* polish code
* polish code
2022-07-06 14:03:13 +08:00
YuliangLiu0306
189946c5c4
[fx]add uniform policy ( #1208 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [fx]add uniform policy
2022-07-06 13:48:11 +08:00
Frank Lee
426a279ce7
[fx] added testing for all bert variants ( #1207 )
...
* [fx] added testing for all bert variants
* polish code
2022-07-06 10:50:49 +08:00
Jiarui Fang
b5f25eb32a
[Tensor] add cpu group to ddp ( #1200 )
2022-07-05 14:58:28 +08:00
Frank Lee
f7878f465c
[fx] supported model tracing for huggingface bert ( #1201 )
...
* [fx] supported model tracing for huggingface bert
* polish test
2022-07-05 13:19:57 +08:00
Jiarui Fang
060b917daf
[refactor] remove gpc dependency in colotensor's _ops ( #1189 )
2022-07-04 18:54:37 +08:00
Frank Lee
abf6a262dc
[fx] added module patch for pooling layers ( #1197 )
2022-07-04 15:21:26 +08:00
YuliangLiu0306
63d2a93878
[context]support arbitary module materialization. ( #1193 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [context]support arbitary module materialization.
* [test]add numerical check for lazy init context.
2022-07-04 10:12:02 +08:00
Jiarui Fang
a444633d13
warmup ratio configration ( #1192 )
2022-06-30 15:23:50 +08:00
ver217
dba7e0cfb4
make AutoPlacementPolicy configurable ( #1191 )
2022-06-30 15:18:30 +08:00
YuliangLiu0306
2053e138a2
[context]use meta tensor to init model lazily. ( #1187 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [context]use meta tensor to init model lazily.
* polish
* make module with device kwargs bypass the normal init.
* change unit test to adapt updated context.
2022-06-29 21:02:30 +08:00
Frank Lee
2c8c05675d
[fx] patched conv and normalization ( #1188 )
2022-06-29 18:58:38 +08:00
Frank Lee
6d86f1bc91
[fx] supported data-dependent control flow in model tracing ( #1185 )
...
* [fx] supported data-dependent control flow in model tracing
* polish code
2022-06-29 15:05:25 +08:00
Jiarui Fang
c463f8adf9
[tensor] remove gpc in tensor tests ( #1186 )
2022-06-29 14:08:40 +08:00
Jiarui Fang
372f791444
[refactor] move chunk and chunkmgr to directory gemini ( #1182 )
2022-06-29 13:31:02 +08:00
ver217
6b2f2ab9bb
[ddp] ColoDDP uses bucket all-reduce ( #1177 )
...
* add reducer
* update colo ddp with reducer
* polish unit test
* polish unit test
2022-06-29 10:34:13 +08:00
Jiarui Fang
7487215b95
[ColoTensor] add independent process group ( #1179 )
2022-06-29 10:03:09 +08:00
YuliangLiu0306
26ba87272d
[hotfix]fixed p2p process send stuck ( #1181 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fixed p2p process send stuck
2022-06-28 14:41:11 +08:00
Jiarui Fang
1b657f9ce1
[tensor] revert local view back ( #1178 )
2022-06-27 18:38:34 +08:00
Jiarui Fang
0dd4e2bbfb
[Tensor] rename some APIs in TensorSpec and Polish view unittest ( #1176 )
2022-06-27 15:56:11 +08:00
Ziyue Jiang
dd0420909f
[Tensor] rename parallel_action ( #1174 )
...
* rename parallel_action
* polish
2022-06-27 10:04:45 +08:00
YuliangLiu0306
e27645376d
[hotfix]different overflow status lead to communication stuck. ( #1175 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fix some bugs caused by refactored schedule.
* [hotfix]different overflow statu llead to communication stuck.
2022-06-27 09:53:57 +08:00
Jiarui Fang
aa7bef73d4
[Tensor] distributed view supports inter-process hybrid parallel ( #1169 )
2022-06-27 09:45:26 +08:00
ver217
9e1daa63d2
[zero] sharded optim supports loading local state dict ( #1170 )
...
* sharded optim supports loading local state dict
* polish code
* add unit test
2022-06-24 18:05:16 +08:00
ver217
561e90493f
[zero] zero optim supports loading local state dict ( #1171 )
...
* zero optim supports loading local state dict
* polish code
* add unit test
2022-06-24 17:25:57 +08:00
Jiarui Fang
4b9bba8116
[ColoTensor] rename APIs and add output_replicate to ComputeSpec ( #1168 )
2022-06-24 13:08:54 +08:00
Jiarui Fang
f4ef224358
[Tensor] remove ParallelAction, use ComputeSpec instread ( #1166 )
2022-06-23 17:34:59 +08:00
Jiarui Fang
177c374401
remove gather out in parallel action ( #1163 )
2022-06-23 16:35:05 +08:00
ver217
634eecb98e
mark sanity_check of dist_spec_mgr as staticmethod ( #1161 )
2022-06-23 11:35:25 +08:00
Ziyue Jiang
955ac912de
remove log ( #1160 )
2022-06-23 10:32:42 +08:00
ver217
4e67b2a890
fix chunk move device ( #1158 )
2022-06-22 18:07:10 +08:00
Jiarui Fang
07f9c781f9
[graph] improve the graph building. ( #1157 )
2022-06-22 16:47:20 +08:00
ver217
22717a856f
[tensor] add embedding bag op ( #1156 )
2022-06-22 15:54:03 +08:00
ver217
ae86151968
[tensor] add more element-wise ops ( #1155 )
...
* add more element-wise ops
* update test_op
* polish unit test
2022-06-22 15:16:47 +08:00
ver217
54aabb8da4
[gemini] refactor gemini mgr ( #1151 )
...
* refactor gemini mgr
* udpate __init__
2022-06-22 11:54:36 +08:00
Frank Lee
f8eec98ff5
[tensor] fixed non-serializable colo parameter during model checkpointing ( #1153 )
2022-06-22 11:43:38 +08:00
ver217
ffa025e120
[tensor] dist spec s2s uses all-to-all ( #1136 )
...
* dist spec s2s uses all-to-all
* update unit test
* add sanity check
* polish unitest test with titans
* add sanity check for DistMgr
* add sanity check
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-06-22 11:32:38 +08:00
YuliangLiu0306
f1f51990b9
[hotfix]fix some bugs caused by refactored schedule. ( #1148 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fix some bugs caused by refactored schedule.
2022-06-21 22:46:30 +08:00
Jiarui Fang
8cdce0399c
[ColoTensor] improves init functions. ( #1150 )
2022-06-21 18:28:38 +08:00
ver217
8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP ( #1146 )
...
* ColoDDP supports overwriting default process group
* rename ColoDDPV2 to ZeroDDP
* add docstr for ZeroDDP
* polish docstr
2022-06-21 16:35:23 +08:00
Frank Lee
0e4e62d30d
[tensor] added __repr__ to spec ( #1147 )
2022-06-21 15:38:05 +08:00
YuliangLiu0306
70dd88e2ee
[pipeline]add customized policy ( #1139 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]add customized policy
2022-06-21 15:23:41 +08:00
YuliangLiu0306
18091581c0
[pipeline]support more flexible pipeline ( #1138 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]support more flexible pipeline
2022-06-21 14:40:50 +08:00
ver217
ccf3c58c89
embedding op use gather_out ( #1143 )
2022-06-21 13:21:20 +08:00
ver217
6690a61b4d
[hotfix] prevent nested ZeRO ( #1140 )
2022-06-21 11:33:53 +08:00
Frank Lee
15aab1476e
[zero] avoid zero hook spam by changing log to debug level ( #1137 )
2022-06-21 10:44:01 +08:00
Frank Lee
73ad05fc8c
[zero] added error message to handle on-the-fly import of torch Module class ( #1135 )
...
* [zero] added error message to handle on-the-fly import of torch Module class
* polish code
2022-06-20 11:24:27 +08:00
ver217
e4f555f29a
[optim] refactor fused sgd ( #1134 )
2022-06-20 11:19:38 +08:00
ver217
d26902645e
[ddp] add save/load state dict for ColoDDP ( #1127 )
...
* add save/load state dict for ColoDDP
* add unit test
* refactor unit test folder
* polish unit test
* rename unit test
2022-06-20 10:51:47 +08:00
YuliangLiu0306
946dbd629d
[hotfix]fix bugs caused by refactored pipeline ( #1133 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fix bugs caused by refactored pipeline
2022-06-17 17:54:15 +08:00
ver217
789cad301b
[hotfix] fix param op hook ( #1131 )
...
* fix param op hook
* update zero tp test
* fix bugs
2022-06-17 16:12:05 +08:00
ver217
a1a7899cae
[hotfix] fix zero init ctx numel ( #1128 )
2022-06-16 17:17:27 +08:00
ver217
f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP ( #1122 )
...
* add set_params_to_ignore for ColoDDP
* polish code
* fix zero hook v2
* add unit test
* polish docstr
2022-06-16 12:54:46 +08:00
YuliangLiu0306
3175bcb4d8
[pipeline]support List of Dict data ( #1125 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]support List of Dict data
* polish
2022-06-16 11:19:48 +08:00
Frank Lee
91a5999825
[ddp] supported customized torch ddp configuration ( #1123 )
2022-06-15 18:11:53 +08:00
YuliangLiu0306
fcf55777dd
[fx]add autoparallel passes ( #1121 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* feature/add autoparallel passes
2022-06-15 16:36:46 +08:00
ver217
e127b4375b
cast colo ddp v2 inputs/outputs ( #1120 )
2022-06-15 15:57:04 +08:00
Frank Lee
16302a5359
[fx] added unit test for coloproxy ( #1119 )
...
* [fx] added unit test for coloproxy
* polish code
* polish code
2022-06-15 15:27:51 +08:00
ver217
7d14b473f0
[gemini] gemini mgr supports "cpu" placement policy ( #1118 )
...
* update gemini mgr
* update chunk
* add docstr
* polish placement policy
* update test chunk
* update test zero
* polish unit test
* remove useless unit test
2022-06-15 15:05:19 +08:00
ver217
f99f56dff4
fix colo parameter torch function ( #1117 )
2022-06-15 14:23:27 +08:00
Frank Lee
e1620ddac2
[fx] added coloproxy ( #1115 )
2022-06-15 10:47:57 +08:00
Frank Lee
6f82ac9bcb
[pipeline] supported more flexible dataflow control for pipeline parallel training ( #1108 )
...
* [pipeline] supported more flexible dataflow control for pipeline parallel training
* polish code
* polish code
* polish code
2022-06-15 10:41:28 +08:00
ver217
895c1c5ee7
[tensor] refactor param op hook ( #1097 )
...
* refactor param op hook
* add docstr
* fix bug
2022-06-13 16:11:53 +08:00
YuliangLiu0306
1e9f9c227f
[hotfix]change to fit latest p2p ( #1100 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]change to fit latest p2p
* polish
* polish
2022-06-13 14:57:25 +08:00
Frank Lee
72bd7c696b
[amp] included dict for type casting of model output ( #1102 )
2022-06-13 14:18:04 +08:00
Frank Lee
7f2d2b2b5b
[engine] fixed empty op hook check ( #1096 )
...
* [engine] fixed empty op hook check
* polish code
2022-06-10 17:27:27 +08:00
Frank Lee
14e5b11d7f
[zero] fixed api consistency ( #1098 )
2022-06-10 16:59:59 +08:00
Frank Lee
cb18922c47
[doc] added documentation to chunk and chunk manager ( #1094 )
...
* [doc] added documentation to chunk and chunk manager
* polish code
* polish code
* polish code
2022-06-10 15:33:06 +08:00
ver217
1f894e033f
[gemini] zero supports gemini ( #1093 )
...
* add placement policy
* add gemini mgr
* update mem stats collector
* update zero
* update zero optim
* fix bugs
* zero optim monitor os
* polish unit test
* polish unit test
* add assert
2022-06-10 14:48:28 +08:00
Frank Lee
2b2dc1c86b
[pipeline] refactor the pipeline module ( #1087 )
...
* [pipeline] refactor the pipeline module
* polish code
2022-06-10 11:27:38 +08:00
Frank Lee
bad5d4c0a1
[context] support lazy init of module ( #1088 )
...
* [context] support lazy init of module
* polish code
2022-06-10 10:09:48 +08:00
ver217
be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 ( #1077 )
...
* polish chunk manager
* polish unit test
* impl add_extern_static_tensor for chunk mgr
* add mem stats collector v2
* polish code
* polish unit test
* polish code
* polish get chunks
2022-06-09 20:56:34 +08:00
Frank Lee
50ec3a7e06
[test] skip tests when not enough GPUs are detected ( #1090 )
...
* [test] skip tests when not enough GPUs are detected
* polish code
* polish code
2022-06-09 17:19:13 +08:00
Ziyue Jiang
0653c63eaa
[Tensor] 1d row embedding ( #1075 )
...
* Add CPU 1d row embedding
* polish
2022-06-08 12:04:59 +08:00
junxu
d66ffb4df4
Remove duplication registry ( #1078 )
2022-06-08 07:47:24 +08:00
Jiarui Fang
bcab249565
fix issue #1080 ( #1071 )
2022-06-07 17:21:11 +08:00
ver217
1b17859328
[tensor] chunk manager monitor mem usage ( #1076 )
2022-06-07 15:00:00 +08:00
ver217
98cdbf49c6
[hotfix] fix chunk comm src rank ( #1072 )
2022-06-07 11:54:56 +08:00
Frank Lee
bfdc5ccb7b
[context] maintain the context object in with statement ( #1073 )
2022-06-07 10:48:45 +08:00
ver217
c5cd3b0f35
[zero] zero optim copy chunk rather than copy tensor ( #1070 )
2022-06-07 10:30:46 +08:00
Ziyue Jiang
4fc748f69b
[Tensor] fix optimizer for CPU parallel ( #1069 )
2022-06-06 17:36:11 +08:00
Jiarui Fang
49832b2344
[refactory] add nn.parallel module ( #1068 )
2022-06-06 15:34:41 +08:00
Ziyue Jiang
6754f1b77f
fix module utils bug ( #1066 )
2022-06-06 12:11:48 +08:00
Jiarui Fang
a00644079e
reorgnize colotensor directory ( #1062 )
...
* reorgnize colotensor directory
* polish code
2022-06-03 18:04:22 +08:00
Frank Lee
3d10be33bd
[cudnn] set False to cudnn benchmark by default ( #1063 )
2022-06-03 17:58:06 +08:00
Ziyue Jiang
df9dcbbff6
[Tensor] add hybrid device demo and fix bugs ( #1059 )
2022-06-03 12:09:49 +08:00
YuliangLiu0306
b167258b6a
[pipeline]refactor ppschedule to support tensor list ( #1050 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* refactor ppschedule to support tensor list
* polish
2022-06-02 13:48:59 +08:00
ver217
e3fde4ee6b
fix import error in sharded model v2 ( #1053 )
2022-06-02 13:48:22 +08:00
ver217
e1922ea4f6
[zero] add chunk size search for chunk manager ( #1052 )
2022-06-02 13:20:20 +08:00
アマデウス
2c42b230f3
updated collective ops api ( #1054 )
2022-06-02 12:52:27 +08:00
ver217
51b9a49655
[zero] add zero optimizer for ColoTensor ( #1046 )
...
* add zero optimizer
* torch ok
* unit test ok
* polish code
* fix bugs
* polish unit test
* polish zero optim
* polish colo ddp v2
* refactor folder structure
* add comment
* polish unit test
* polish zero optim
* polish unit test
2022-06-02 12:13:15 +08:00
ver217
7faef93326
fix dist spec mgr ( #1045 )
2022-05-31 12:14:39 +08:00
ver217
9492a561c3
[tensor] ColoTensor supports ZeRo ( #1015 )
...
* impl chunk manager
* impl param op hook
* add reduce_chunk
* add zero hook v2
* add zero dp
* fix TensorInfo
* impl load balancing when using zero without chunk
* fix zero hook
* polish chunk
* fix bugs
* ddp ok
* zero ok
* polish code
* fix bugs about load balancing
* polish code
* polish code
* add ene-to-end test
* polish code
* polish code
* polish code
* fix typo
* add test_chunk
* fix bugs
* fix bugs
* polish code
2022-05-31 12:00:12 +08:00
Ziyue Jiang
7c530b9de2
[Tensor] add Parameter inheritance for ColoParameter ( #1041 )
...
* add Parameter inheritance for ColoParameter
* remove tricks
* remove tricks
* polish
* polish
2022-05-30 17:23:44 +08:00
ver217
7cfd6c827e
[zero] add load_state_dict for sharded model ( #894 )
...
* add load_state_dict for sharded model
* fix bug
* fix bug
* fix ckpt dtype and device
* support load state dict in zero init ctx
* fix bugs
2022-05-27 10:25:08 +08:00
Ziyue Jiang
6c5996a56e
[Tensor] add module check and bert test ( #1031 )
...
* add Embedding
* Add bert test
* polish
* add check module test
* polish
* polish
* polish
* polish
2022-05-26 18:15:42 +08:00
YuliangLiu0306
7106bd671d
[p2p]add object list send/recv ( #1024 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [p2p]add object list send recv
* refactor for code reusability
* polish
2022-05-26 14:28:46 +08:00
Frank Lee
e4685832f8
[engine] fixed bug in gradient accumulation dataloader to keep the last step ( #1030 )
2022-05-26 14:28:23 +08:00
Ziyue Jiang
32291dd73f
[Tensor] add module handler for linear ( #1021 )
...
* add module spec for linear
* polish
* polish
* polish
2022-05-26 11:50:44 +08:00
Ryan Russell
9b0c037027
fix typo in constants ( #1027 )
2022-05-26 08:45:08 +08:00
ver217
007ca0df92
fix colo init context ( #1026 )
2022-05-25 20:41:58 +08:00
YuliangLiu0306
d182b0bd47
[hotfix] fix some bugs caused by size mismatch. ( #1011 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [hotfix]fix some bugs caused by size mismatch.
* add warning logs
* polish
2022-05-23 14:02:28 +08:00
ver217
cefc29ff06
[tensor] impl ColoDDP for ColoTensor ( #1009 )
...
* impl ColoDDP for ColoTensor
* polish code
2022-05-21 13:52:04 +08:00
zhengzangw
ae7c338105
[NFC] polish colossalai/kernel/cuda_native/csrc/colossal_C_frontend.cpp code style
2022-05-20 23:57:38 +08:00
ver217
a3b66f6def
[tensor] refactor parallel action ( #1007 )
...
* refactor parallel action
* polish unit tests
2022-05-20 20:19:58 +08:00
ver217
ad536e308e
[tensor] refactor colo-tensor ( #992 )
...
* refactor colo-tensor and update linear op
* polish code
* polish code
* update ops and unit tests
* update unit tests
* polish code
* rename dist_spec module
* polish code
* polish code
* remove unneeded import
* fix pipelinable
2022-05-19 12:44:59 +08:00
Frank Lee
1467d83edf
[cli] remove unused imports ( #1001 )
2022-05-18 23:27:18 +08:00
Frank Lee
533d0c46d8
[kernel] fixed the include bug in dropout kernel ( #999 )
2022-05-18 21:43:18 +08:00
Jiarui Fang
802ac297cc
[Tensor] remove useless import in tensor dir ( #997 )
2022-05-18 14:54:51 +08:00
Ziheng Qin
571f12eff3
[NFC] polish colossalai/nn/layer/utils/common.py code style ( #983 )
2022-05-17 10:25:06 +08:00
puck_WCR
bda70b4b66
[NFC] polish colossalai/kernel/cuda_native/layer_norm.py code style ( #980 )
2022-05-17 10:25:06 +08:00
Kai Wang (Victor Kai)
c50c08dcbb
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu code style ( #979 )
2022-05-17 10:25:06 +08:00
binmakeswell
f28c021376
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu code style ( #978 )
2022-05-17 10:25:06 +08:00
shenggan
18542b47fc
[NFC] polish colossalai/nn/layer/parallel_2d/layers.py code style ( #976 )
2022-05-17 10:25:06 +08:00
Jie Zhu
b67eebd20f
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu code style ( #977 )
2022-05-17 10:25:06 +08:00
DouJS
52705ec5c5
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu code style ( #974 )
2022-05-17 10:25:06 +08:00
Ofey Chan
136946422b
[NFC] polish colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp code style ( #973 )
2022-05-17 10:25:06 +08:00
Zirui Zhu
598cde4a0f
[NFC] polish colossalai/nn/layer/parallel_2p5d/layers.py code style ( #972 )
2022-05-17 10:25:06 +08:00
Xu Kai
632e94abde
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h code style ( #970 )
2022-05-17 10:25:06 +08:00
ExtremeViscent
22d1df224d
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h ( #968 )
...
code style
2022-05-17 10:25:06 +08:00
LuGY
fb5bc6cb28
[NFC] polish colossalai/nn/layer/parallel_3d/layers.py code style ( #966 )
2022-05-17 10:25:06 +08:00
lucasliunju
955463e542
[NFC] polish __init__.py code style ( #965 )
2022-05-17 10:25:06 +08:00
Yuer867
7106a399fc
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h code style ( #964 )
2022-05-17 10:25:06 +08:00
ziyu huang
5bd80b7dd1
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu code style ( #963 )
...
Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>
2022-05-17 10:25:06 +08:00
superhao1995
48c4a180c7
[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp code style ( #959 )
2022-05-17 10:25:06 +08:00
MaxT
442a2975ab
[NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h code style ( #962 )
2022-05-17 10:25:06 +08:00
runluo
89e2767a92
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu code style ( #958 )
2022-05-17 10:25:06 +08:00
doubleHU
1dc1b6fa00
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h code style ( #957 )
2022-05-17 10:25:06 +08:00
RichardoLuo
0e922da874
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/context.h code style ( #956 )
...
Co-authored-by: RichardoLuo <14049555596@qq.com>
2022-05-17 10:25:06 +08:00
Wangbo Zhao(黑色枷锁)
8ca2a85682
[NFC] polish colossalai/kernel/cuda_native/scaled_softmax.py code style ( #955 )
2022-05-17 10:25:06 +08:00
Luxios22
f6970ef8b1
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu code style ( #954 )
2022-05-17 10:25:06 +08:00
Cautiousss
0b86a6345e
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu code style ( #953 )
...
Co-authored-by: 何晓昕 <cautious@hexiaoxins-MacBook-Pro.local>
2022-05-17 10:25:06 +08:00
Sze-qq
d8d07b0e2b
[NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp code style ( #952 )
2022-05-17 10:25:06 +08:00
xyupeng
fa43bb216d
[NFC] polish colossalai/builder/pipeline.py code style ( #951 )
2022-05-17 10:25:06 +08:00
JT.Han
c3e423c8be
[NFC] polish colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu code style ( #949 )
...
Co-authored-by: Jiatong <jiatong.han@u.nus.edu>
2022-05-17 10:25:06 +08:00
luoling-LC
72c71b67ec
[NFC] polish colossalai/kernel/jit/bias_gelu.py code style ( #946 )
...
Co-authored-by: jnbai <897086360@qq.com>
2022-05-17 10:25:06 +08:00
bajiaoyu517
eb9a81d72a
[NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.h code style ( #945 )
2022-05-17 10:25:06 +08:00
wky
8ffdc38376
[NFC] polish colossalai/kernel/cuda_native/csrc/moe_cuda.cpp code style ( #942 )
2022-05-17 10:25:06 +08:00
HaoyuQin
c0f373db5d
[NFC] polish pre-commit run --files colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu code style ( #943 )
2022-05-17 10:25:06 +08:00
XYE
5bbefeb06a
[NFC] polish moe_cuda_kernel.cu code style ( #940 )
...
Co-authored-by: Xiao Ye <xiaoye2@illinois.edu>
2022-05-17 10:25:06 +08:00
Maruyama_Aya
7aa35eae6a
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h code style ( #938 )
2022-05-17 10:25:06 +08:00
Geng Zhang
b6cc9313ef
[NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.cpp code style ( #936 )
2022-05-17 10:25:06 +08:00
yuxuan-lou
44b6f8947b
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h code style ( #939 )
2022-05-17 10:25:06 +08:00
BoxiangW
872aa413c2
[NFC] Polish colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu code style. ( #937 )
2022-05-17 10:25:06 +08:00
ver217
58580b50fe
Revert "[NFC] Hotfix/format ( #984 )" ( #986 )
...
This reverts commit 0772828fba
.
2022-05-17 10:23:38 +08:00
binmakeswell
0772828fba
[NFC] Hotfix/format ( #984 )
...
* [NFC] Polish colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu code style. (#937 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h code style (#939 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.cpp code style (#936 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/block_reduce.h code style (#938 )
* [NFC] polish moe_cuda_kernel.cu code style (#940 )
Co-authored-by: Xiao Ye <xiaoye2@illinois.edu>
* [NFC] polish pre-commit run --files colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax_cuda.cu code style (#943 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/moe_cuda.cpp code style (#942 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.h code style (#945 )
* [NFC] polish colossalai/kernel/jit/bias_gelu.py code style (#946 )
Co-authored-by: jnbai <897086360@qq.com>
* [NFC] polish colossalai/kernel/cuda_native/csrc/scaled_masked_softmax_cuda.cu code style (#949 )
Co-authored-by: Jiatong <jiatong.han@u.nus.edu>
* [NFC] polish colossalai/builder/pipeline.py code style (#951 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.cpp code style (#952 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cross_entropy.cu code style (#953 )
Co-authored-by: 何晓昕 <cautious@hexiaoxins-MacBook-Pro.local>
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/softmax_kernels.cu code style (#954 )
* [NFC] polish colossalai/kernel/cuda_native/scaled_softmax.py code style (#955 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/context.h code style (#956 )
Co-authored-by: RichardoLuo <14049555596@qq.com>
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cross_entropy_layer.h code style (#957 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu code style (#958 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/multihead_attention_1d.h code style (#962 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/scaled_upper_triang_masked_softmax.cpp code style (#959 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/general_kernels.cu code style (#963 )
Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/softmax.h code style (#964 )
* [NFC] polish __init__.py code style (#965 )
* [NFC] polish colossalai/nn/layer/parallel_3d/layers.py code style (#966 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/feed_forward.h (#968 )
code style
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/dropout.h code style (#970 )
* [NFC] polish colossalai/nn/layer/parallel_2p5d/layers.py code style (#972 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp code style (#973 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/normalize_kernels.cu code style (#974 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_scale_kernel.cu code style (#977 )
* [NFC] polish colossalai/nn/layer/parallel_2d/layers.py code style (#976 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_sgd_kernel.cu code style (#978 )
* [NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu code style (#979 )
* [NFC] polish colossalai/kernel/cuda_native/layer_norm.py code style (#980 )
* [NFC] polish colossalai/nn/layer/utils/common.py code style (#983 )
Co-authored-by: BoxiangW <45734921+BoxiangW@users.noreply.github.com>
Co-authored-by: yuxuan-lou <83441848+yuxuan-lou@users.noreply.github.com>
Co-authored-by: Geng Zhang <34452939+zxgx@users.noreply.github.com>
Co-authored-by: Maruyama_Aya <38985202+MaruyamaAya@users.noreply.github.com>
Co-authored-by: XYE <92607131+Itok2000u@users.noreply.github.com>
Co-authored-by: Xiao Ye <xiaoye2@illinois.edu>
Co-authored-by: HaoyuQin <79465534+coder-chin@users.noreply.github.com>
Co-authored-by: wky <64853922+wangkuangyi@users.noreply.github.com>
Co-authored-by: bajiaoyu517 <59548007+bajiaoyu517@users.noreply.github.com>
Co-authored-by: luoling-LC <105470086+luoling-LC@users.noreply.github.com>
Co-authored-by: jnbai <897086360@qq.com>
Co-authored-by: JT.Han <59948448+JThh@users.noreply.github.com>
Co-authored-by: Jiatong <jiatong.han@u.nus.edu>
Co-authored-by: xyupeng <99191637+xyupeng@users.noreply.github.com>
Co-authored-by: Sze-qq <68757353+Sze-qq@users.noreply.github.com>
Co-authored-by: Cautiousss <48676630+Cautiousss@users.noreply.github.com>
Co-authored-by: 何晓昕 <cautious@hexiaoxins-MacBook-Pro.local>
Co-authored-by: Luxios22 <67457897+Luxios22@users.noreply.github.com>
Co-authored-by: Wangbo Zhao(黑色枷锁) <56866854+wangbo-zhao@users.noreply.github.com>
Co-authored-by: RichardoLuo <50363844+RichardoLuo@users.noreply.github.com>
Co-authored-by: RichardoLuo <14049555596@qq.com>
Co-authored-by: doubleHU <98150031+huxin711@users.noreply.github.com>
Co-authored-by: runluo <68489000+run-qiao@users.noreply.github.com>
Co-authored-by: MaxT <854721132@qq.com>
Co-authored-by: superhao1995 <804673818@qq.com>
Co-authored-by: ziyu huang <huang0ziyu@gmail.com>
Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>
Co-authored-by: Yuer867 <62204893+Yuer867@users.noreply.github.com>
Co-authored-by: lucasliunju <lucasliunju@gmail.com>
Co-authored-by: LuGY <74758262+Gy-Lu@users.noreply.github.com>
Co-authored-by: ExtremeViscent <zhangyiqi55732@sina.com>
Co-authored-by: Xu Kai <xukai16@foxmail.com>
Co-authored-by: Zirui Zhu <zhuzr21@gmail.com>
Co-authored-by: Ofey Chan <ofey206@gmail.com>
Co-authored-by: DouJS <dujiangsu@163.com>
Co-authored-by: Jie Zhu <chore.08-protist@icloud.com>
Co-authored-by: shenggan <csg19971016@gmail.com>
Co-authored-by: Kai Wang (Victor Kai) <37533040+kaiwang960112@users.noreply.github.com>
Co-authored-by: puck_WCR <46049915+WANG-CR@users.noreply.github.com>
Co-authored-by: Ziheng Qin <37519855+henryqin1997@users.noreply.github.com>
2022-05-17 09:54:49 +08:00
ver217
c2fdc6a011
[tensor] derive compute pattern from dist spec ( #971 )
...
* derive compute pattern from dist spec
* polish code
2022-05-16 14:58:08 +08:00
Ziyue Jiang
797a9dc5a9
add DistSpec for loss and test_model ( #947 )
2022-05-13 20:29:50 +08:00
ver217
67c33f57eb
[tensor] design DistSpec and DistSpecManager for ColoTensor ( #934 )
...
* add dist spec
* update linear op
* polish code
* polish code
* update embedding op
* polish unit tests
* polish unit tests
* polish comments
* polish code
* add test_dist_spec_mgr
* polish code
* refactor folder structure
* polish unit tests
* add get_process_group() for TensorSpec
* polish code
2022-05-13 15:13:52 +08:00
Ziyue Jiang
d73c2b1d79
[Tensor] fix init context ( #931 )
...
* change torch.Parameter to ColoParameter
* fix post assignment for init context
* polish
* polish
2022-05-11 15:48:12 +08:00
Ziyue Jiang
dfc88b85ea
[Tensor] simplify named param ( #928 )
...
* simplify ColoModulize
* simplify ColoModulize
* polish
* polish
2022-05-11 10:54:19 +08:00
YuliangLiu0306
32a45cd7ef
[pipelinable]use pipelinable to support GPT model. ( #903 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipelinable]use pipelinable to support GPT model.
* fix a bug caused by ShardedModel
* polish
* fix front func list
2022-05-11 09:23:58 +08:00
ver217
4ca732349e
[tensor] colo tensor overrides mul ( #927 )
...
* colo tensor overrides mul
* polish code
2022-05-10 16:04:08 +08:00
ver217
45b9124df4
[tensor] hijack addmm for colo tensor ( #923 )
...
* hijack addmm for colo tensor
* fix bugs
* polish unit test
* polish comments
2022-05-09 18:55:49 +08:00
Ziyue Jiang
c195d2814c
[Tensor] add from_pretrained support and bert pretrained test ( #921 )
...
* add from_pretrained support and test
* polish
* polish
* polish
* polish
2022-05-09 16:11:47 +08:00
Jiarui Fang
845856ea29
[Graph] building computing graph with ColoTensor, Linear only ( #917 )
2022-05-07 17:10:37 +08:00
Ziyue Jiang
75d221918a
[Tensor] add 1d vocab loss ( #918 )
...
* add 1d vocab loss
* polish
2022-05-07 15:49:14 +08:00
Jiarui Fang
ab95ec9aea
[Tensor] init ColoParameter ( #914 )
2022-05-06 12:57:14 +08:00
Ziyue Jiang
f593a5637e
[Tensor] add embedding tp1d row ( #904 )
2022-04-29 14:10:05 +08:00
Ziyue Jiang
2c0d19d755
[Tensor] add ColoTensor TP1Dcol Embedding ( #899 )
2022-04-28 17:45:06 +08:00
Jiarui Fang
d16671da75
[Tensor] initialize the ColoOptimizer ( #898 )
...
* [Tensor] activation is an attr of ColoTensor
* [Tensor] add optimizer
* only detach parameters in context
* polish code
2022-04-28 15:23:40 +08:00
Jiarui Fang
676f191532
[Tensor] activation is an attr of ColoTensor ( #897 )
2022-04-28 14:43:22 +08:00
Ziyue Jiang
cb182da7c5
[tensor] refine linear and add gather for laynorm ( #893 )
...
* refine linear and add function to ColoTensor
* add gather for layernorm
* polish
* polish
2022-04-28 10:55:40 +08:00
Jiarui Fang
26c49639d8
[Tensor] overriding paramters() for Module using ColoTensor ( #889 )
2022-04-27 15:28:59 +08:00
Ziyue Jiang
1d0aba4153
[tensor] add ColoTensor 1Dcol ( #888 )
2022-04-27 14:13:55 +08:00
Jiarui Fang
72cdc06875
[Tensor] make ColoTensor more robust for getattr ( #886 )
...
* [Tensor] make ColoTensor more robust for getattr
* polish
* polish
2022-04-27 10:57:49 +08:00
Ziyue Jiang
9bc5a77c31
[tensor] wrap function in the torch_tensor to ColoTensor ( #881 )
2022-04-26 20:13:56 +08:00
ver217
4df6471f5d
fix import error ( #880 )
2022-04-26 19:28:40 +08:00
Jiarui Fang
7f76517a85
[Tensor] make a simple net works with 1D row TP ( #879 )
2022-04-26 18:11:47 +08:00
ver217
c4d903e64a
[gemini] accelerate adjust_layout() ( #878 )
...
* add lru cache
* polish code
* update unit test
* fix sharded optim
2022-04-26 18:08:31 +08:00
Jiarui Fang
909211453b
[Tensor] Add some attributes to ColoTensor ( #877 )
...
* [Tensor] add some function to ColoTensor
* torch.allclose
* rm torch.add
2022-04-26 15:10:47 +08:00
HELSON
425b4a96b8
[gemini] polish stateful_tensor_mgr ( #876 )
2022-04-26 15:05:03 +08:00
Jiarui Fang
e43f83aa5c
[Tensor] get named parameters for model using ColoTensors ( #874 )
2022-04-26 14:08:01 +08:00
Jiarui Fang
96211c2cc8
[tensor] customized op returns ColoTensor ( #875 )
...
* [tensor] customized op returns ColoTensor
* polish
* polish code
2022-04-26 13:23:59 +08:00
Ziyue Jiang
26d4ab8b03
[Tensor] Add function to spec and update linear 1Drow and unit tests ( #869 )
2022-04-26 10:15:26 +08:00
Frank Lee
11f54c7b6b
[doc] improved docstring and assertion messages for the engine module ( #871 )
2022-04-26 10:00:18 +08:00
Frank Lee
1c34382678
[doc] improved assertion messages in trainer ( #873 )
2022-04-26 10:00:12 +08:00
Frank Lee
7a64fae33a
[doc] improved error messages in initialize ( #872 )
2022-04-26 10:00:03 +08:00
Jiarui Fang
1190b2c4a4
[tensor] add cross_entrophy_loss ( #868 )
2022-04-25 16:01:52 +08:00
HELSON
3107817172
[gemini] add stateful tensor container ( #867 )
2022-04-25 14:58:16 +08:00
Jiarui Fang
d01d3b8cb0
colo init context add device attr. ( #866 )
2022-04-25 14:24:26 +08:00
Frank Lee
2238758c2e
[usability] improved error messages in the context module ( #856 )
2022-04-25 13:42:31 +08:00
Frank Lee
9fdebadd69
[doc] improved docstring in the amp module ( #857 )
2022-04-25 13:42:17 +08:00
Frank Lee
b862d89d00
[doc] improved docstring in the logging module ( #861 )
2022-04-25 13:42:00 +08:00
Frank Lee
8004c8e938
[doc] improved docstring in the communication module ( #863 )
2022-04-25 13:41:43 +08:00
Jiarui Fang
8af5f7423d
[tensor] an initial dea of tensor spec ( #865 )
...
* a initial dea of tensor spec
* polish
* polish
2022-04-25 13:33:52 +08:00
Jiarui Fang
126ba573a8
[Tensor] add layer norm Op ( #852 )
2022-04-25 11:49:20 +08:00
Frank Lee
a82da26f7e
[cli] refactored micro-benchmarking cli and added more metrics ( #858 )
2022-04-25 11:48:07 +08:00
Frank Lee
ee222dfbf3
[usability] added assertion message in registry ( #864 )
2022-04-25 11:45:15 +08:00
HELSON
f0e654558f
[gemini] polish code ( #855 )
2022-04-25 10:40:14 +08:00
Jiarui Fang
29159d9b5b
hotfix tensor unittest bugs ( #862 )
2022-04-25 10:06:53 +08:00
YuliangLiu0306
c6930d8ddf
[pipelinable]use ColoTensor to replace dummy tensor. ( #853 )
2022-04-24 18:31:22 +08:00
Ziyue Jiang
bcc8655021
[Tensor ] Add 1Drow weight reshard by spec ( #854 )
2022-04-24 18:30:20 +08:00
ver217
d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data ( #850 )
2022-04-24 17:17:22 +08:00
ver217
232142f402
[utils] refactor profiler ( #837 )
...
* add model data profiler
* add a subclass of torch.profiler.profile
* refactor folder structure
* remove redundant codes
* polish code
* use GeminiMemoryManager
* fix import path
* fix stm profiler ext
* polish comments
* remove useless file
2022-04-24 17:03:59 +08:00
Jiarui Fang
62f059251b
[Tensor] init a tp network training unittest ( #849 )
2022-04-24 16:43:44 +08:00
ver217
0dea140760
[hotfix] add deconstructor for stateful tensor ( #848 )
...
* add deconstructor for stateful tensor
* fix colo init context
2022-04-24 15:03:04 +08:00
ver217
0f7ed8c192
fix _post_init_method of zero init ctx ( #847 )
2022-04-24 14:16:50 +08:00
Ziyue Jiang
2a0a427e04
[tensor]add assert for colo_tensor 1Drow ( #846 )
2022-04-24 14:12:45 +08:00
Ziyue Jiang
05023ecfee
[Tensor] TP Linear 1D row ( #843 )
2022-04-24 13:43:12 +08:00
Frank Lee
cf6d1c9284
[CLI] refactored the launch CLI and fixed bugs in multi-node launching ( #844 )
...
* [cli] fixed multi-node job launching
* [cli] fixed a bug in version comparison
* [cli] support launching with env var
* [cli] fixed multi-node job launching
* [cli] fixed a bug in version comparison
* [cli] support launching with env var
* added docstring
* [cli] added extra launch arguments
* [cli] added default launch rdzv args
* [cli] fixed version comparison
* [cli] added docstring examples and requierment
* polish docstring
* polish code
* polish code
2022-04-24 13:26:26 +08:00
HELSON
e5ea3fdeef
[gemini] add GeminiMemoryManger ( #832 )
...
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
YuliangLiu0306
35ea6e1023
[pipelinable]use pipelinable context to initialize non-pipeline model ( #816 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [pipeline]add module lazy init feature to support large model initization.
* [pipeline]add to_layer_list and partition method to support arbitrary non-pp model
* refactor the module structure
* polish
* [pipelinable]add unit test for pipelinable
* polish
* polish
* Fix CodeFactor issues.
2022-04-24 13:03:12 +08:00
Jiarui Fang
ea0a2ed25f
[hotfix] the bug of numel() in ColoTensor ( #845 )
2022-04-24 12:32:10 +08:00
LuGY
c1e8d2001e
modefied the pp build for ckpt adaptation ( #803 )
2022-04-24 12:23:16 +08:00
Jiarui Fang
8789850eea
Init Conext supports lazy allocate model memory ( #842 )
2022-04-22 18:03:35 +08:00
Jiarui Fang
4575a3298b
[hotfix] ColoTensor pin_memory ( #840 )
2022-04-22 17:07:46 +08:00
Frank Lee
01e9f834f5
[dependency] removed torchvision ( #833 )
...
* [dependency] removed torchvision
* fixed transforms
2022-04-22 15:24:35 +08:00
Jiarui Fang
cb5a4778e1
Revert "[WIP] Applying ColoTensor on TP-1D-row Linear. ( #831 )" ( #835 )
...
This reverts commit ac88de6dfc
.
2022-04-22 14:45:57 +08:00
Jiarui Fang
ac88de6dfc
[WIP] Applying ColoTensor on TP-1D-row Linear. ( #831 )
...
* revert zero tensors back
* [tensor] init row 1d linear
2022-04-22 14:03:26 +08:00
Jiarui Fang
595bedf767
revert zero tensors back ( #829 )
2022-04-22 12:12:35 +08:00
Jiarui Fang
294a6060d0
[tensor] ZeRO use ColoTensor as the base class. ( #828 )
...
* [refactor] moving InsertPostInitMethodToModuleSubClasses to utils.
* [tensor] ZeRO use ColoTensor as the base class.
* polish
2022-04-22 12:00:48 +08:00
Ziyue Jiang
8e6fdb4f29
[tensor]fix test_linear ( #826 )
2022-04-21 17:18:56 +08:00
Ziyue Jiang
1a9e2c2dff
[tensor] fix kwargs in colo_tensor torch_funtion ( #825 )
2022-04-21 16:47:35 +08:00
Jiarui Fang
eb1b89908c
[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. ( #824 )
2022-04-21 16:03:18 +08:00
Jiarui Fang
2ecc3d7a55
[tensor] lazy init ( #823 )
2022-04-21 15:40:23 +08:00
Jiarui Fang
68dcd51d41
[Tensor] update ColoTensor torch_function ( #822 )
...
* Revert "[zero] add ZeroTensorShardStrategy (#793 )"
This reverts commit 88759e289e
.
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
* polish code
* add a new tensor structure and override linear for it
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* [tensor] renaming and reorganize directory structure.
* rm useless dir
* polish
* polish
* [tensor] hander the function not wrapped
* polish
2022-04-21 14:25:27 +08:00
Jiarui Fang
0ce8924ceb
[tensor] reorganize files ( #820 )
2022-04-21 14:15:48 +08:00
Jiarui Fang
ab962b9735
[gemini] a new tensor structure ( #818 )
...
* Revert "[zero] add ZeroTensorShardStrategy (#793 )"
This reverts commit 88759e289e
.
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
* polish code
* add a new tensor structure and override linear for it
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
2022-04-21 11:42:37 +08:00
FrankLeeeee
70ed11d07e
[cli] added check installation cli
2022-04-20 12:13:27 +08:00
YuliangLiu0306
c7eca40f51
Merge pull request #812 from FrankLeeeee/feature/cli
...
[cli] fixed single-node process launching
2022-04-20 11:40:07 +08:00
Jiarui Fang
3ddbd1bce1
[gemini] collect cpu-gpu moving volume in each iteration ( #813 )
2022-04-20 11:29:48 +08:00
FrankLeeeee
d522cb704e
[cli] fixed single-node process launching
2022-04-20 10:46:51 +08:00
Jiarui Fang
61c20b44bc
[log] local throughput metrics ( #811 )
...
* Revert "[zero] add ZeroTensorShardStrategy (#793 )"
This reverts commit 88759e289e
.
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
2022-04-20 10:05:39 +08:00
ver217
dd92b90a68
[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext ( #808 )
...
* init fp16 param directly
* polish code
2022-04-19 16:16:48 +08:00
Jiarui Fang
227d1cd4b3
[gemini] APIs to set cpu memory capacity ( #809 )
2022-04-19 16:05:22 +08:00
FrankLeeeee
f63e91d280
[cli] fixed a bug in user args and refactored the module structure
2022-04-19 15:15:16 +08:00
Jiarui Fang
e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy ( #793 )" ( #806 )
2022-04-19 14:40:02 +08:00
HELSON
88759e289e
[zero] add ZeroTensorShardStrategy ( #793 )
2022-04-19 14:32:45 +08:00
Jiarui Fang
681addb512
[refactor] moving grad acc logic to engine ( #804 )
2022-04-19 14:03:21 +08:00
Frank Lee
05d9ae5999
[cli] add missing requirement ( #805 )
2022-04-19 13:56:59 +08:00
YuliangLiu0306
de2f581d43
[cli] added micro benchmarking for tp ( #789 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [CLI]add cli benchmark feature
* fix CodeFactor issues.
* refactor the module structure.
2022-04-19 12:08:28 +08:00
YuliangLiu0306
cfadc9df8e
[cli] added distributed launcher command ( #791 )
...
* [CLI] add CLI launcher
* Revert "[CLI] add CLI launcher"
This reverts commit df7e6506d4
.
* [CLI]add cli launcher feature
* remove testing message used during developing
* refactor the module structure.
2022-04-19 10:59:44 +08:00
Jiarui Fang
4d9332b4c5
[refactor] moving memtracer to gemini ( #801 )
2022-04-19 10:13:08 +08:00
Jiarui Fang
8711c706f4
[hotfix] fix grad offload when enabling reuse_fp16_shard
2022-04-18 14:58:21 +08:00
ver217
f1fa1a675f
fix grad offload when enabling reuse_fp16_shard
2022-04-18 14:07:39 +08:00
HELSON
4c4388c46e
[hotfix] fix memory leak in zero ( #781 )
2022-04-18 13:57:03 +08:00
Ziyue Jiang
4b01da24cd
[TP] change the check assert in split batch 2d ( #772 )
2022-04-16 21:29:57 +08:00
ver217
846406a07a
[gemini] fix auto tensor placement policy ( #775 )
2022-04-16 21:29:31 +08:00
HELSON
a65cbb7e4e
[zero] refactor shard and gather operation ( #773 )
2022-04-15 14:41:31 +08:00
ver217
6e553748a7
polish sharded optim docstr and warning ( #770 )
2022-04-14 21:03:59 +08:00
LuGY
80e37eec42
fix the ckpt bugs when using DDP ( #769 )
2022-04-14 21:03:24 +08:00
Frank Lee
920fe31526
[compatibility] used backward-compatible API for global process group ( #758 )
2022-04-14 17:20:35 +08:00
Frank Lee
4ea49cb536
[test] added a decorator for address already in use error with backward compatibility ( #760 )
...
* [test] added a decorator for address already in use error with backward compatibility
* [test] added a decorator for address already in use error with backward compatibility
2022-04-14 16:48:44 +08:00
Jiarui Fang
10ef8afdd2
[gemini] init genimi individual directory ( #754 )
2022-04-14 16:40:26 +08:00
ver217
dcca614eee
[hotfix] fix test_stateful_tensor_mgr ( #762 )
2022-04-14 15:50:09 +08:00
ver217
a93a7d7364
[hotfix] fix reuse_fp16_shard of sharded model ( #756 )
...
* fix reuse_fp16_shard
* disable test stm
* polish code
2022-04-14 14:56:46 +08:00
ver217
8f7ce94b8e
[hotfix] fix auto tensor placement policy ( #753 )
2022-04-14 12:04:45 +08:00
HELSON
84c6700b2a
[zero] refactor memstats_collector ( #746 )
2022-04-14 12:01:12 +08:00
アマデウス
b8899e0905
[TP] allow layernorm without bias ( #750 )
2022-04-14 11:43:56 +08:00
Jiarui Fang
3d7dc46d33
[zero] use factory pattern for tensor_placement_policy ( #752 )
2022-04-14 11:07:29 +08:00
ver217
4b048a8728
fix prepare grads in sharded optim ( #749 )
2022-04-13 22:36:11 +08:00
ver217
097772546e
fix initialize about zero
2022-04-13 19:10:21 +08:00
ver217
e396bb71f2
[zero] add tensor placement policies ( #743 )
...
* add tensor placement policies
* polish comments
* polish comments
* update moe unit tests
2022-04-13 15:00:48 +08:00
HELSON
22c4b88d56
[zero] refactor ShardedParamV2 for convenience ( #742 )
2022-04-13 14:54:26 +08:00
HELSON
340e59f968
[utils] add synchronized cuda memory monitor ( #740 )
2022-04-13 10:50:54 +08:00
ver217
e6212f56cd
[hotfix] fix memory leak in backward of sharded model ( #741 )
2022-04-13 09:59:05 +08:00
Frank Lee
a4e91bc87f
[bug] fixed grad scaler compatibility with torch 1.8 ( #735 )
2022-04-12 16:04:21 +08:00
Jiarui Fang
53cb584808
[utils] correct cpu memory used and capacity in the context of multi-process ( #726 )
2022-04-12 14:57:54 +08:00
Jiarui Fang
7db3ccc79b
[hotfix] remove duplicated param register to stateful tensor manager ( #728 )
2022-04-12 13:55:25 +08:00
Frank Lee
1cb7bdad3b
[util] fixed communication API depth with PyTorch 1.9 ( #721 )
2022-04-12 09:44:40 +08:00
Frank Lee
2412429d54
[util] fixed activation checkpointing on torch 1.9 ( #719 )
2022-04-12 09:35:45 +08:00
Frank Lee
04ff5ea546
[utils] support detection of number of processes on current node ( #723 )
2022-04-12 09:28:19 +08:00
Jiarui Fang
4d90a7b513
[refactor] zero directory ( #724 )
2022-04-11 23:13:02 +08:00
Jiarui Fang
193dc8dacb
[refactor] refactor the memory utils ( #715 )
2022-04-11 16:47:57 +08:00
HELSON
dbd96fe90a
[zero] check whether gradients have inf and nan in gpu ( #712 )
2022-04-11 15:40:13 +08:00
ver217
715b86eadd
[hotfix] fix stm cuda model data size ( #710 )
2022-04-11 15:10:39 +08:00
LuGY
140263a394
[hotfix]fixed bugs of assigning grad states to non leaf nodes ( #711 )
...
* fixed bugs of assigning grad states to non leaf nodes
* use detach()
2022-04-11 14:04:58 +08:00
Frank Lee
eda30a058e
[compatibility] fixed tensor parallel compatibility with torch 1.9 ( #700 )
2022-04-11 13:44:50 +08:00
HELSON
a9b8300d54
[zero] improve adaptability for not-shard parameters ( #708 )
...
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
2022-04-11 13:38:51 +08:00
ver217
ab8c6b4a0e
[zero] refactor memstats collector ( #706 )
...
* refactor memstats collector
* fix disposable
* polish code
2022-04-11 10:46:08 +08:00
アマデウス
3fc8a204dc
[]Corrected 3d vocab parallel embedding ( #707 )
2022-04-11 10:17:55 +08:00
HELSON
ee112fe1da
[zero] adapt zero hooks for unsharded module ( #699 )
2022-04-08 20:23:26 +08:00
ver217
3c9cd5bb5e
[zero] stateful tensor manager ( #687 )
...
* [WIP] stateful tensor manager
* add eviction strategy
* polish code
* polish code
* polish comment
* add unit test
* fix sampler bug
* polish code
* fix max sampling cnt resetting bug
* fix sampler bug
* polish code
* fix bug
* fix unit test
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-04-08 17:51:34 +08:00
HELSON
d7ecaf362b
[zero] fix init bugs in zero context ( #686 )
...
* adapt model weight initialization for methods in Pytorch nn.init
2022-04-07 17:38:45 +08:00
YuliangLiu0306
0ed7042f42
[pipeline] refactor pipeline ( #679 )
...
* refactor pipeline---put runtime schedule into engine.
* add type hint for schedule Optional[BaseSchedule]
* preprocess schedule during engine initializing
* infer pipeline schedule params from config
2022-04-07 15:54:14 +08:00
Jiarui Fang
59bf2dc590
[zero] initialize a stateful tensor manager ( #614 )
2022-04-06 16:18:49 +08:00
encmps
79ccfa4310
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_adam.cu code style ( #667 )
2022-04-06 11:40:59 +08:00
lucasliunju
e4bcff9b0f
[NFC] polish colossalai/builder/builder.py code style ( #662 )
2022-04-06 11:40:59 +08:00
shenggan
331683bf82
[NFC] polish colossalai/kernel/cuda_native/csrc/layer_norm_cuda_kernel.cu code style ( #661 )
2022-04-06 11:40:59 +08:00
FredHuang99
c336cd3066
[NFC] polish colossalai/communication/utils.py code style ( #656 )
2022-04-06 11:40:59 +08:00
MaxT
5ab9a71299
[NFC] polish colossalai/kernel/cuda_native/csrc/moe_cuda.cpp code style ( #642 )
2022-04-06 11:40:59 +08:00
Xue Fuzhao
10afec728f
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/include/cuda_util.h code style ( #641 )
2022-04-06 11:40:59 +08:00
Cautiousss
055d0270c8
[NFC] polish colossalai/context/process_group_initializer/initializer_sequence.py colossalai/context/process_group_initializer initializer_tensor.py code style ( #639 )
...
Co-authored-by: 何晓昕 <cautious@r-236-100-25-172.comp.nus.edu.sg>
2022-04-06 11:40:59 +08:00
Ziheng Qin
c7c224ee17
[NFC] polish colossalai/builder/pipeline.py code style ( #638 )
2022-04-06 11:40:59 +08:00
Sze-qq
10591ecdf9
[NFC] polish colossalai/kernel/cuda_native/csrc/cpu_adam.cpp code style ( #636 )
2022-04-06 11:40:59 +08:00
Wangbo Zhao
6fcb381801
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_l2norm_kernel.cu code style ( #635 )
2022-04-06 11:40:59 +08:00
ExtremeViscent
8a5d526e95
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/dropout_kernels.cu and cross_entropy.cu code style ( #634 )
2022-04-06 11:40:59 +08:00
RichardoLuo
ad1e7ab2b2
'[NFC] polish <colossalai/engine/_base_engine.py> code style' ( #631 )
...
Co-authored-by: RichardoLuo <14049555596@qq.com>
2022-04-06 11:40:59 +08:00
Zangwei
2e11853d04
[NFC] polish colossalai/communication/ring.py code style ( #630 )
2022-04-06 11:40:59 +08:00
puck_WCR
01cc941e1d
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/transform_kernels.cu code stype ( #629 )
2022-04-06 11:40:59 +08:00
superhao1995
c1bed0d998
[NFC] polish colossalai/kernel/cuda_native/csrc/multi_tensor_lamb.cu code stype ( #628 )
2022-04-06 11:40:59 +08:00
Jiang Zhuo
0a96338b13
[NFC] polish <colossalai/context/process_group_initializer/initializer_data.py> code stype ( #626 )
...
Co-authored-by: 姜卓 <jiangzhuo@jiangzhuodeMacBook-Pro.local>
2022-04-06 11:40:59 +08:00
ziyu huang
701bad439b
[NFC] polish colossalai/context/process_group_initializer/process_group_initializer.py code stype ( #617 )
...
Co-authored-by: “Arsmart123 <202476410arsmart@gmail.com>
2022-04-06 11:40:59 +08:00
Shawn-Kong
db54419409
fix format ( #613 )
...
Co-authored-by: evin K <evink@evins-MacBook-Air.local>
2022-04-06 11:40:59 +08:00
Yuer867
5ecef13c16
fix format ( #611 )
2022-04-06 11:40:59 +08:00
xyupeng
d3d5bedc65
fix format ( #607 )
2022-04-06 11:40:59 +08:00
xuqifan897
f2d2a1597a
fix format ( #608 )
2022-04-06 11:40:59 +08:00
doubleHU
f2da21a827
fix format ( #586 )
2022-04-06 11:40:59 +08:00
fanjinfucool
ffad81e1d1
fix format ( #585 )
...
Co-authored-by: fanjifu <FAN>
2022-04-06 11:40:59 +08:00
binmakeswell
6582aedc94
fix format ( #583 )
2022-04-06 11:40:59 +08:00
DouJS
f08fc17f2b
block_reduce.h fix format ( #581 )
2022-04-06 11:40:59 +08:00
Maruyama_Aya
d2dc6049b5
fix format ( #580 )
2022-04-06 11:40:59 +08:00
wky
174b9c1d85
fix format ( #574 )
2022-04-06 11:40:59 +08:00
BoxiangW
dfe423ae42
fix format ( #572 )
2022-04-06 11:40:59 +08:00
yuxuan-lou
cfb41297ff
'fix/format' ( #573 )
2022-04-06 11:40:59 +08:00
Kai Wang (Victor Kai)
b0f708dfc1
fix format ( #570 )
2022-04-06 11:40:59 +08:00
Xu Kai
2a915a8b62
fix format ( #568 )
2022-04-06 11:40:59 +08:00
YuliangLiu0306
9420d3ae31
fix format ( #567 )
2022-04-06 11:40:59 +08:00
Jie Zhu
0f1da44e5e
[format]colossalai/kernel/cuda_native/csrc/layer_norm_cuda.cpp ( #566 )
2022-04-06 11:40:59 +08:00
coder-chin
5835631218
fix format ( #564 )
2022-04-06 11:40:59 +08:00
Luxios22
e014144c44
fix format ( #565 )
2022-04-06 11:40:59 +08:00
Ziyue Jiang
1762ba14ab
fix format ( #563 )
2022-04-06 11:40:59 +08:00
HELSON
17e73e62cc
[hotfix] fix bugs for unsharded parameters when restore data ( #664 )
2022-04-03 22:02:11 +08:00
Jiarui Fang
0aab52301e
[hotfix] fix a bug in model data stats tracing ( #655 )
2022-04-03 21:48:06 +08:00
YuliangLiu0306
ade05a5d83
[refactor] pipeline, put runtime schedule into engine. ( #627 )
2022-04-03 20:46:45 +08:00
HELSON
e5d615aeee
[hotfix] fix bugs in testing ( #659 )
...
* remove hybrid adam in test_moe_zero_optim
* fix activation checkpointing and its unitest
2022-04-02 21:58:47 +08:00
Jiarui Fang
036404ca8a
Revert "[zero] polish init context ( #645 )" ( #657 )
2022-04-02 18:30:06 +08:00
HELSON
b31daed4cf
fix bugs in CPU adam ( #633 )
...
* add cpu adam counter for all cpu adam
* fixed updating error in adam kernel
2022-04-02 17:04:05 +08:00
LuGY
1e2557e801
[zero] fixed the activation offload ( #647 )
...
* fixed the activation offload
* polish
2022-04-02 16:21:32 +08:00
Liang Bowen
828e465622
[hotfix] Raise messages for indivisible batch sizes with tensor parallelism ( #622 )
2022-04-02 16:12:04 +08:00
Jiarui Fang
67b4928244
[zero] polish init context ( #645 )
2022-04-02 15:52:04 +08:00
ver217
f5d3a9c2b0
polish checkpoint docstring ( #637 )
2022-04-02 13:34:33 +08:00
HELSON
055fbf5be6
[zero] adapt zero for unsharded paramters (Optimizer part) ( #601 )
2022-04-01 20:10:47 +08:00
KAIYUAN GAN
229382c844
[NFC] polish colossalai/kernel/cuda_native/csrc/kernels/cuda_util.cu code stype ( #625 )
2022-04-01 17:45:53 +08:00