ver217
|
d7e0303d1e
|
[zero] use GeminiMemoryManager when sampling model data (#850)
|
3 years ago |
ver217
|
0f7ed8c192
|
fix _post_init_method of zero init ctx (#847)
|
3 years ago |
HELSON
|
e5ea3fdeef
|
[gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
|
3 years ago |
Jiarui Fang
|
595bedf767
|
revert zero tensors back (#829)
|
3 years ago |
Jiarui Fang
|
294a6060d0
|
[tensor] ZeRO use ColoTensor as the base class. (#828)
* [refactor] moving InsertPostInitMethodToModuleSubClasses to utils.
* [tensor] ZeRO use ColoTensor as the base class.
* polish
|
3 years ago |
Jiarui Fang
|
eb1b89908c
|
[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824)
|
3 years ago |
Jiarui Fang
|
3ddbd1bce1
|
[gemini] collect cpu-gpu moving volume in each iteration (#813)
|
3 years ago |
Jiarui Fang
|
61c20b44bc
|
[log] local throughput metrics (#811)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"
This reverts commit 88759e289e .
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
|
3 years ago |
ver217
|
dd92b90a68
|
[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808)
* init fp16 param directly
* polish code
|
3 years ago |
Jiarui Fang
|
e761ad2cd7
|
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806)
|
3 years ago |
HELSON
|
88759e289e
|
[zero] add ZeroTensorShardStrategy (#793)
|
3 years ago |
Jiarui Fang
|
4d9332b4c5
|
[refactor] moving memtracer to gemini (#801)
|
3 years ago |
Jiarui Fang
|
8711c706f4
|
[hotfix] fix grad offload when enabling reuse_fp16_shard
|
3 years ago |
ver217
|
f1fa1a675f
|
fix grad offload when enabling reuse_fp16_shard
|
3 years ago |
HELSON
|
4c4388c46e
|
[hotfix] fix memory leak in zero (#781)
|
3 years ago |
HELSON
|
a65cbb7e4e
|
[zero] refactor shard and gather operation (#773)
|
3 years ago |
ver217
|
6e553748a7
|
polish sharded optim docstr and warning (#770)
|
3 years ago |
Jiarui Fang
|
10ef8afdd2
|
[gemini] init genimi individual directory (#754)
|
3 years ago |
ver217
|
dcca614eee
|
[hotfix] fix test_stateful_tensor_mgr (#762)
|
3 years ago |
ver217
|
a93a7d7364
|
[hotfix] fix reuse_fp16_shard of sharded model (#756)
* fix reuse_fp16_shard
* disable test stm
* polish code
|
3 years ago |
ver217
|
8f7ce94b8e
|
[hotfix] fix auto tensor placement policy (#753)
|
3 years ago |
HELSON
|
84c6700b2a
|
[zero] refactor memstats_collector (#746)
|
3 years ago |
Jiarui Fang
|
3d7dc46d33
|
[zero] use factory pattern for tensor_placement_policy (#752)
|
3 years ago |
ver217
|
4b048a8728
|
fix prepare grads in sharded optim (#749)
|
3 years ago |
ver217
|
e396bb71f2
|
[zero] add tensor placement policies (#743)
* add tensor placement policies
* polish comments
* polish comments
* update moe unit tests
|
3 years ago |
HELSON
|
22c4b88d56
|
[zero] refactor ShardedParamV2 for convenience (#742)
|
3 years ago |
ver217
|
e6212f56cd
|
[hotfix] fix memory leak in backward of sharded model (#741)
|
3 years ago |
Jiarui Fang
|
7db3ccc79b
|
[hotfix] remove duplicated param register to stateful tensor manager (#728)
|
3 years ago |
Jiarui Fang
|
4d90a7b513
|
[refactor] zero directory (#724)
|
3 years ago |
Jiarui Fang
|
193dc8dacb
|
[refactor] refactor the memory utils (#715)
|
3 years ago |
HELSON
|
dbd96fe90a
|
[zero] check whether gradients have inf and nan in gpu (#712)
|
3 years ago |
ver217
|
715b86eadd
|
[hotfix] fix stm cuda model data size (#710)
|
3 years ago |
HELSON
|
a9b8300d54
|
[zero] improve adaptability for not-shard parameters (#708)
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
|
3 years ago |
ver217
|
ab8c6b4a0e
|
[zero] refactor memstats collector (#706)
* refactor memstats collector
* fix disposable
* polish code
|
3 years ago |
HELSON
|
ee112fe1da
|
[zero] adapt zero hooks for unsharded module (#699)
|
3 years ago |
ver217
|
3c9cd5bb5e
|
[zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager
* add eviction strategy
* polish code
* polish code
* polish comment
* add unit test
* fix sampler bug
* polish code
* fix max sampling cnt resetting bug
* fix sampler bug
* polish code
* fix bug
* fix unit test
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
|
3 years ago |
HELSON
|
d7ecaf362b
|
[zero] fix init bugs in zero context (#686)
* adapt model weight initialization for methods in Pytorch nn.init
|
3 years ago |
Jiarui Fang
|
59bf2dc590
|
[zero] initialize a stateful tensor manager (#614)
|
3 years ago |
HELSON
|
17e73e62cc
|
[hotfix] fix bugs for unsharded parameters when restore data (#664)
|
3 years ago |
Jiarui Fang
|
0aab52301e
|
[hotfix] fix a bug in model data stats tracing (#655)
|
3 years ago |
Jiarui Fang
|
036404ca8a
|
Revert "[zero] polish init context (#645)" (#657)
|
3 years ago |
Jiarui Fang
|
67b4928244
|
[zero] polish init context (#645)
|
3 years ago |
HELSON
|
055fbf5be6
|
[zero] adapt zero for unsharded paramters (Optimizer part) (#601)
|
3 years ago |
ver217
|
0ef8819c67
|
polish docstring of zero (#612)
|
3 years ago |
ver217
|
9bee119104
|
[hotfix] fix sharded optim zero grad (#604)
* fix sharded optim zero grad
* polish comments
|
3 years ago |
Jiarui Fang
|
e956d93ac2
|
[refactor] memory utils (#577)
|
3 years ago |
HELSON
|
e6d50ec107
|
[zero] adapt zero for unsharded parameters (#561)
* support existing sharded and unsharded parameters in zero
* add unitest for moe-zero model init
* polish moe gradient handler
|
3 years ago |
ver217
|
7c6c427db1
|
[zero] trace states of fp16/32 grad and fp32 param (#571)
|
3 years ago |
Jiarui Fang
|
7675366fce
|
[polish] rename col_attr -> colo_attr (#558)
|
3 years ago |
ver217
|
014bac0c49
|
[zero] hijack p.grad in sharded model (#554)
* hijack p.grad in sharded model
* polish comments
* polish comments
|
3 years ago |