Commit Graph

220 Commits (e871e342b33d8538bee3604c49021731981b1249)

Author SHA1 Message Date
ver217 821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) 2022-08-11 22:58:58 +08:00
ver217 6df3e19be9
[hotfix] zero optim prevents calling inner optim.zero_grad (#1422) 2022-08-09 16:08:12 +08:00
ver217 8dced41ad0
[zero] zero optim state_dict takes only_rank_0 (#1384)
* zero optim state_dict takes only_rank_0

* fix unit test
2022-07-29 13:22:50 +08:00
ver217 828b9e5e0d
[hotfix] fix zero optim save/load state dict (#1381) 2022-07-28 17:19:39 +08:00
ver217 6b43c789fd
fix zero optim backward_by_grad and save/load (#1353) 2022-07-21 16:43:58 +08:00
ver217 d068af81a3
[doc] update rst and docstring (#1351)
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00
ver217 ce470ba37e
[checkpoint] sharded optim save/load grad scaler (#1350) 2022-07-21 15:21:21 +08:00
ver217 7a05367101
[hotfix] shared model returns cpu state_dict (#1328) 2022-07-15 22:11:37 +08:00
Jiarui Fang 4165eabb1e
[hotfix] remove potiential circle import (#1307)
* make it faster

* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
ver217 a45ddf2d5f
[hotfix] fix sharded optim step and clip_grad_norm (#1226) 2022-07-08 13:34:48 +08:00
Jiarui Fang a444633d13
warmup ratio configration (#1192) 2022-06-30 15:23:50 +08:00
Jiarui Fang 372f791444
[refactor] move chunk and chunkmgr to directory gemini (#1182) 2022-06-29 13:31:02 +08:00
ver217 9e1daa63d2
[zero] sharded optim supports loading local state dict (#1170)
* sharded optim supports loading local state dict

* polish code

* add unit test
2022-06-24 18:05:16 +08:00
ver217 561e90493f
[zero] zero optim supports loading local state dict (#1171)
* zero optim supports loading local state dict

* polish code

* add unit test
2022-06-24 17:25:57 +08:00
ver217 8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP (#1146)
* ColoDDP supports overwriting default process group

* rename ColoDDPV2 to ZeroDDP

* add docstr for ZeroDDP

* polish docstr
2022-06-21 16:35:23 +08:00
ver217 6690a61b4d
[hotfix] prevent nested ZeRO (#1140) 2022-06-21 11:33:53 +08:00
Frank Lee 15aab1476e
[zero] avoid zero hook spam by changing log to debug level (#1137) 2022-06-21 10:44:01 +08:00
ver217 a1a7899cae
[hotfix] fix zero init ctx numel (#1128) 2022-06-16 17:17:27 +08:00
ver217 f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP (#1122)
* add set_params_to_ignore for ColoDDP

* polish code

* fix zero hook v2

* add unit test

* polish docstr
2022-06-16 12:54:46 +08:00
Frank Lee 14e5b11d7f
[zero] fixed api consistency (#1098) 2022-06-10 16:59:59 +08:00
Frank Lee cb18922c47
[doc] added documentation to chunk and chunk manager (#1094)
* [doc] added documentation to chunk and chunk manager

* polish code

* polish code

* polish code
2022-06-10 15:33:06 +08:00
ver217 1f894e033f
[gemini] zero supports gemini (#1093)
* add placement policy

* add gemini mgr

* update mem stats collector

* update zero

* update zero optim

* fix bugs

* zero optim monitor os

* polish unit test

* polish unit test

* add assert
2022-06-10 14:48:28 +08:00
ver217 be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077)
* polish chunk manager

* polish unit test

* impl add_extern_static_tensor for chunk mgr

* add mem stats collector v2

* polish code

* polish unit test

* polish code

* polish get chunks
2022-06-09 20:56:34 +08:00
ver217 c5cd3b0f35
[zero] zero optim copy chunk rather than copy tensor (#1070) 2022-06-07 10:30:46 +08:00
Jiarui Fang 49832b2344
[refactory] add nn.parallel module (#1068) 2022-06-06 15:34:41 +08:00
ver217 e3fde4ee6b
fix import error in sharded model v2 (#1053) 2022-06-02 13:48:22 +08:00
ver217 51b9a49655
[zero] add zero optimizer for ColoTensor (#1046)
* add zero optimizer

* torch ok

* unit test ok

* polish code

* fix bugs

* polish unit test

* polish zero optim

* polish colo ddp v2

* refactor folder structure

* add comment

* polish unit test

* polish zero optim

* polish unit test
2022-06-02 12:13:15 +08:00
ver217 9492a561c3
[tensor] ColoTensor supports ZeRo (#1015)
* impl chunk manager

* impl param op hook

* add reduce_chunk

* add zero hook v2

* add zero dp

* fix TensorInfo

* impl load balancing when using zero without chunk

* fix zero hook

* polish chunk

* fix bugs

* ddp ok

* zero ok

* polish code

* fix bugs about load balancing

* polish code

* polish code

* add ene-to-end test

* polish code

* polish code

* polish code

* fix typo

* add test_chunk

* fix bugs

* fix bugs

* polish code
2022-05-31 12:00:12 +08:00
ver217 7cfd6c827e
[zero] add load_state_dict for sharded model (#894)
* add load_state_dict for sharded model

* fix bug

* fix bug

* fix ckpt dtype and device

* support load state dict in zero init ctx

* fix bugs
2022-05-27 10:25:08 +08:00
ver217 c4d903e64a
[gemini] accelerate adjust_layout() (#878)
* add lru cache

* polish code

* update unit test

* fix sharded optim
2022-04-26 18:08:31 +08:00
HELSON 425b4a96b8
[gemini] polish stateful_tensor_mgr (#876) 2022-04-26 15:05:03 +08:00
ver217 d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data (#850) 2022-04-24 17:17:22 +08:00
ver217 0f7ed8c192
fix _post_init_method of zero init ctx (#847) 2022-04-24 14:16:50 +08:00
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities

* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
Jiarui Fang 595bedf767
revert zero tensors back (#829) 2022-04-22 12:12:35 +08:00
Jiarui Fang 294a6060d0
[tensor] ZeRO use ColoTensor as the base class. (#828)
* [refactor] moving InsertPostInitMethodToModuleSubClasses to utils.

* [tensor] ZeRO use ColoTensor as the base class.

* polish
2022-04-22 12:00:48 +08:00
Jiarui Fang eb1b89908c
[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824) 2022-04-21 16:03:18 +08:00
Jiarui Fang 3ddbd1bce1
[gemini] collect cpu-gpu moving volume in each iteration (#813) 2022-04-20 11:29:48 +08:00
Jiarui Fang 61c20b44bc
[log] local throughput metrics (#811)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish
2022-04-20 10:05:39 +08:00
ver217 dd92b90a68
[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808)
* init fp16 param directly

* polish code
2022-04-19 16:16:48 +08:00
Jiarui Fang e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806) 2022-04-19 14:40:02 +08:00
HELSON 88759e289e
[zero] add ZeroTensorShardStrategy (#793) 2022-04-19 14:32:45 +08:00
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00
Jiarui Fang 8711c706f4
[hotfix] fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:58:21 +08:00
ver217 f1fa1a675f fix grad offload when enabling reuse_fp16_shard 2022-04-18 14:07:39 +08:00
HELSON 4c4388c46e
[hotfix] fix memory leak in zero (#781) 2022-04-18 13:57:03 +08:00
HELSON a65cbb7e4e
[zero] refactor shard and gather operation (#773) 2022-04-15 14:41:31 +08:00
ver217 6e553748a7
polish sharded optim docstr and warning (#770) 2022-04-14 21:03:59 +08:00
Jiarui Fang 10ef8afdd2
[gemini] init genimi individual directory (#754) 2022-04-14 16:40:26 +08:00
ver217 dcca614eee
[hotfix] fix test_stateful_tensor_mgr (#762) 2022-04-14 15:50:09 +08:00
ver217 a93a7d7364
[hotfix] fix reuse_fp16_shard of sharded model (#756)
* fix reuse_fp16_shard

* disable test stm

* polish code
2022-04-14 14:56:46 +08:00
ver217 8f7ce94b8e
[hotfix] fix auto tensor placement policy (#753) 2022-04-14 12:04:45 +08:00
HELSON 84c6700b2a
[zero] refactor memstats_collector (#746) 2022-04-14 12:01:12 +08:00
Jiarui Fang 3d7dc46d33
[zero] use factory pattern for tensor_placement_policy (#752) 2022-04-14 11:07:29 +08:00
ver217 4b048a8728
fix prepare grads in sharded optim (#749) 2022-04-13 22:36:11 +08:00
ver217 e396bb71f2
[zero] add tensor placement policies (#743)
* add tensor placement policies

* polish comments

* polish comments

* update moe unit tests
2022-04-13 15:00:48 +08:00
HELSON 22c4b88d56
[zero] refactor ShardedParamV2 for convenience (#742) 2022-04-13 14:54:26 +08:00
ver217 e6212f56cd
[hotfix] fix memory leak in backward of sharded model (#741) 2022-04-13 09:59:05 +08:00
Jiarui Fang 7db3ccc79b
[hotfix] remove duplicated param register to stateful tensor manager (#728) 2022-04-12 13:55:25 +08:00
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724) 2022-04-11 23:13:02 +08:00
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715) 2022-04-11 16:47:57 +08:00
HELSON dbd96fe90a
[zero] check whether gradients have inf and nan in gpu (#712) 2022-04-11 15:40:13 +08:00
ver217 715b86eadd
[hotfix] fix stm cuda model data size (#710) 2022-04-11 15:10:39 +08:00
HELSON a9b8300d54
[zero] improve adaptability for not-shard parameters (#708)
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
2022-04-11 13:38:51 +08:00
ver217 ab8c6b4a0e
[zero] refactor memstats collector (#706)
* refactor memstats collector

* fix disposable

* polish code
2022-04-11 10:46:08 +08:00
HELSON ee112fe1da
[zero] adapt zero hooks for unsharded module (#699) 2022-04-08 20:23:26 +08:00
ver217 3c9cd5bb5e
[zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager

* add eviction strategy

* polish code

* polish code

* polish comment

* add unit test

* fix sampler bug

* polish code

* fix max sampling cnt resetting bug

* fix sampler bug

* polish code

* fix bug

* fix unit test

Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
2022-04-08 17:51:34 +08:00
HELSON d7ecaf362b
[zero] fix init bugs in zero context (#686)
* adapt model weight initialization for methods in Pytorch nn.init
2022-04-07 17:38:45 +08:00
Jiarui Fang 59bf2dc590
[zero] initialize a stateful tensor manager (#614) 2022-04-06 16:18:49 +08:00
HELSON 17e73e62cc
[hotfix] fix bugs for unsharded parameters when restore data (#664) 2022-04-03 22:02:11 +08:00
Jiarui Fang 0aab52301e
[hotfix] fix a bug in model data stats tracing (#655) 2022-04-03 21:48:06 +08:00
Jiarui Fang 036404ca8a
Revert "[zero] polish init context (#645)" (#657) 2022-04-02 18:30:06 +08:00
Jiarui Fang 67b4928244
[zero] polish init context (#645) 2022-04-02 15:52:04 +08:00
HELSON 055fbf5be6
[zero] adapt zero for unsharded paramters (Optimizer part) (#601) 2022-04-01 20:10:47 +08:00
ver217 0ef8819c67
polish docstring of zero (#612) 2022-04-01 14:50:56 +08:00
ver217 9bee119104
[hotfix] fix sharded optim zero grad (#604)
* fix sharded optim zero grad

* polish comments
2022-04-01 12:41:20 +08:00
Jiarui Fang e956d93ac2
[refactor] memory utils (#577) 2022-04-01 09:22:33 +08:00
HELSON e6d50ec107
[zero] adapt zero for unsharded parameters (#561)
* support existing sharded and unsharded parameters in zero

* add unitest for moe-zero model init

* polish moe gradient handler
2022-03-31 18:34:11 +08:00
ver217 7c6c427db1
[zero] trace states of fp16/32 grad and fp32 param (#571) 2022-03-31 16:26:54 +08:00
Jiarui Fang 7675366fce
[polish] rename col_attr -> colo_attr (#558) 2022-03-31 12:25:45 +08:00
ver217 014bac0c49
[zero] hijack p.grad in sharded model (#554)
* hijack p.grad in sharded model

* polish comments

* polish comments
2022-03-30 18:14:50 +08:00
Jiarui Fang f552b11294
[zero] label state for param fp16 and grad (#551) 2022-03-30 15:57:46 +08:00
Jiarui Fang 214da761d4
[zero] add stateful tensor (#549) 2022-03-30 13:51:37 +08:00
Jiarui Fang 107b99ddb1
[zero] dump memory stats for sharded model (#548) 2022-03-30 09:38:44 +08:00
HELSON 8c90d4df54
[zero] add zero context manager to change config during initialization (#546) 2022-03-29 17:57:59 +08:00
Jiarui Fang 53b1b6e340
[zero] non model data tracing (#545) 2022-03-29 15:45:48 +08:00
ver217 fb841dd5c5
[zero] optimize grad offload (#539)
* optimize grad offload

* polish code

* polish code
2022-03-29 12:48:00 +08:00
ver217 1f90a3b129
[zero] polish ZeroInitContext (#540) 2022-03-29 09:09:04 +08:00
Jiarui Fang c11ff81b15
[zero] get memory usage of sharded optim v2. (#542) 2022-03-29 09:08:18 +08:00
HELSON a30e2b4c24
[zero] adapt for no-leaf module in zero (#535)
only process module's own parameters in Zero context

add zero hooks for all modules that contrain parameters

gather parameters only belonging to module itself
2022-03-28 17:42:18 +08:00
Jiarui Fang 705f56107c
[zero] refactor model data tracing (#537) 2022-03-28 16:38:18 +08:00
Jiarui Fang a590ed0ba3
[zero] improve the accuracy of get_memory_usage of sharded param (#538) 2022-03-28 16:19:19 +08:00
Jiarui Fang 37cb70feec
[zero] get memory usage for sharded param (#536) 2022-03-28 15:01:21 +08:00
Jiarui Fang 05e33b2578
[zero] fix grad offload (#528)
* [zero] fix grad offload

* polish code
2022-03-25 18:23:25 +08:00
Jiarui Fang 8d8c5407c0
[zero] refactor model data tracing (#522) 2022-03-25 18:03:32 +08:00
Jiarui Fang 4d322b79da
[refactor] remove old zero code (#517) 2022-03-25 14:54:39 +08:00
Jiarui Fang 920c5889a7
[zero] add colo move inline (#521) 2022-03-25 14:02:55 +08:00
Jiarui Fang 0bebda6ea5
[zero] fix init device bug in zero init context unittest (#516) 2022-03-25 12:24:18 +08:00
Jiarui Fang 7ef3507ace
[zero] show model data cuda memory usage after zero context init. (#515) 2022-03-25 11:23:35 +08:00
ver217 a2e61d61d4
[zero] zero init ctx enable rm_torch_payload_on_the_fly (#512)
* enable rm_torch_payload_on_the_fly

* polish docstr
2022-03-24 23:44:00 +08:00