Commit Graph

36 Commits (2bf2d1cd3b2af434b9d4d9b20efeddb471c702e0)

Author SHA1 Message Date
Jiarui Fang b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook (#2080) 2022-12-05 17:11:06 +08:00
Jiarui Fang 96134e7be3
[hotfix] add bert test for gemini fwd bwd (#2035) 2022-11-29 11:19:52 +08:00
Jiarui Fang 8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor (#2003) 2022-11-25 20:06:35 +08:00
Jiarui Fang cc0ed7cf33
[Gemini] ZeROHookV2 -> GeminiZeROHook (#1972) 2022-11-17 14:43:49 +08:00
Jiarui Fang cd5a0d56fa
[Gemini] make gemini usage simple (#1821) 2022-11-08 15:53:13 +08:00
Zihao 20e255d4e8
MemStatsCollectorStatic (#1765) 2022-11-07 16:49:03 +08:00
HELSON c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12

* [zero] add cpu shard init

* [zero] add tiny example test

* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
HELSON 1468e4bcfc
[zero] add constant placement policy (#1705)
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
HELSON b28991dd0a
[feature] A new ZeRO implementation (#1644) 2022-10-09 09:18:51 +08:00
Jiarui Fang c5d39215f6
Revert "[feature] new zero implementation (#1623)" (#1643)
This reverts commit 5be118f405.
2022-09-26 10:06:03 +08:00
HELSON 5be118f405
[feature] new zero implementation (#1623) 2022-09-24 19:58:18 +08:00
Jiatong Han 3263cdf57f [NFC] polish colossalai/nn/parallel/data_parallel.py code style (#1570)
Co-authored-by: JThh <jiatong.han@u.nus.edu>
2022-09-08 22:11:04 +08:00
ver217 04c9a86af8
[zero] ZeroDDP supports controlling outputs' dtype (#1399) 2022-08-02 17:49:11 +08:00
HELSON 4e98e938ce
[zero] alleviate memory usage in ZeRODDP state_dict (#1398) 2022-08-02 15:49:13 +08:00
ver217 83328329dd
[hotfix] fix zero ddp buffer cast (#1376)
* fix zero ddp buffer cast

* fix zero ddp ignore params
2022-07-28 10:54:44 +08:00
ver217 5d5031e946
fix zero ddp state dict (#1378) 2022-07-28 09:31:42 +08:00
HELSON 87775a0682
[colotensor] use cpu memory to store state_dict (#1367) 2022-07-26 14:13:38 +08:00
ver217 d068af81a3
[doc] update rst and docstring (#1351)
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00
ver217 0c51ff2c13
[hotfix] ZeroDDP use new process group (#1333)
* process group supports getting ranks in group

* chunk mgr receives a process group

* update unit test

* fix unit tests
2022-07-18 14:14:52 +08:00
Jiarui Fang b5f25eb32a
[Tensor] add cpu group to ddp (#1200) 2022-07-05 14:58:28 +08:00
Jiarui Fang 060b917daf
[refactor] remove gpc dependency in colotensor's _ops (#1189) 2022-07-04 18:54:37 +08:00
Jiarui Fang 372f791444
[refactor] move chunk and chunkmgr to directory gemini (#1182) 2022-06-29 13:31:02 +08:00
ver217 6b2f2ab9bb
[ddp] ColoDDP uses bucket all-reduce (#1177)
* add reducer

* update colo ddp with reducer

* polish unit test

* polish unit test
2022-06-29 10:34:13 +08:00
ver217 54aabb8da4
[gemini] refactor gemini mgr (#1151)
* refactor gemini mgr

* udpate __init__
2022-06-22 11:54:36 +08:00
ver217 8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP (#1146)
* ColoDDP supports overwriting default process group

* rename ColoDDPV2 to ZeroDDP

* add docstr for ZeroDDP

* polish docstr
2022-06-21 16:35:23 +08:00
Frank Lee 15aab1476e
[zero] avoid zero hook spam by changing log to debug level (#1137) 2022-06-21 10:44:01 +08:00
ver217 d26902645e
[ddp] add save/load state dict for ColoDDP (#1127)
* add save/load state dict for ColoDDP

* add unit test

* refactor unit test folder

* polish unit test

* rename unit test
2022-06-20 10:51:47 +08:00
ver217 f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP (#1122)
* add set_params_to_ignore for ColoDDP

* polish code

* fix zero hook v2

* add unit test

* polish docstr
2022-06-16 12:54:46 +08:00
ver217 e127b4375b
cast colo ddp v2 inputs/outputs (#1120) 2022-06-15 15:57:04 +08:00
ver217 7d14b473f0
[gemini] gemini mgr supports "cpu" placement policy (#1118)
* update gemini mgr

* update chunk

* add docstr

* polish placement policy

* update test chunk

* update test zero

* polish unit test

* remove useless unit test
2022-06-15 15:05:19 +08:00
ver217 895c1c5ee7
[tensor] refactor param op hook (#1097)
* refactor param op hook

* add docstr

* fix bug
2022-06-13 16:11:53 +08:00
Frank Lee cb18922c47
[doc] added documentation to chunk and chunk manager (#1094)
* [doc] added documentation to chunk and chunk manager

* polish code

* polish code

* polish code
2022-06-10 15:33:06 +08:00
ver217 1f894e033f
[gemini] zero supports gemini (#1093)
* add placement policy

* add gemini mgr

* update mem stats collector

* update zero

* update zero optim

* fix bugs

* zero optim monitor os

* polish unit test

* polish unit test

* add assert
2022-06-10 14:48:28 +08:00
ver217 be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077)
* polish chunk manager

* polish unit test

* impl add_extern_static_tensor for chunk mgr

* add mem stats collector v2

* polish code

* polish unit test

* polish code

* polish get chunks
2022-06-09 20:56:34 +08:00
Ziyue Jiang 4fc748f69b
[Tensor] fix optimizer for CPU parallel (#1069) 2022-06-06 17:36:11 +08:00
Jiarui Fang 49832b2344
[refactory] add nn.parallel module (#1068) 2022-06-06 15:34:41 +08:00