Commit Graph

139 Commits (288304028645f545b1eb0a6ffda46143ec92c422)

Author SHA1 Message Date
ver217 d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data (#850)
3 years ago
ver217 0f7ed8c192
fix _post_init_method of zero init ctx (#847)
3 years ago
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
3 years ago
Jiarui Fang 595bedf767
revert zero tensors back (#829)
3 years ago
Jiarui Fang 294a6060d0
[tensor] ZeRO use ColoTensor as the base class. (#828)
3 years ago
Jiarui Fang eb1b89908c
[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. (#824)
3 years ago
Jiarui Fang 3ddbd1bce1
[gemini] collect cpu-gpu moving volume in each iteration (#813)
3 years ago
Jiarui Fang 61c20b44bc
[log] local throughput metrics (#811)
3 years ago
ver217 dd92b90a68
[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext (#808)
3 years ago
Jiarui Fang e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806)
3 years ago
HELSON 88759e289e
[zero] add ZeroTensorShardStrategy (#793)
3 years ago
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801)
3 years ago
Jiarui Fang 8711c706f4
[hotfix] fix grad offload when enabling reuse_fp16_shard
3 years ago
ver217 f1fa1a675f fix grad offload when enabling reuse_fp16_shard
3 years ago
HELSON 4c4388c46e
[hotfix] fix memory leak in zero (#781)
3 years ago
HELSON a65cbb7e4e
[zero] refactor shard and gather operation (#773)
3 years ago
ver217 6e553748a7
polish sharded optim docstr and warning (#770)
3 years ago
Jiarui Fang 10ef8afdd2
[gemini] init genimi individual directory (#754)
3 years ago
ver217 dcca614eee
[hotfix] fix test_stateful_tensor_mgr (#762)
3 years ago
ver217 a93a7d7364
[hotfix] fix reuse_fp16_shard of sharded model (#756)
3 years ago
ver217 8f7ce94b8e
[hotfix] fix auto tensor placement policy (#753)
3 years ago
HELSON 84c6700b2a
[zero] refactor memstats_collector (#746)
3 years ago
Jiarui Fang 3d7dc46d33
[zero] use factory pattern for tensor_placement_policy (#752)
3 years ago
ver217 4b048a8728
fix prepare grads in sharded optim (#749)
3 years ago
ver217 e396bb71f2
[zero] add tensor placement policies (#743)
3 years ago
HELSON 22c4b88d56
[zero] refactor ShardedParamV2 for convenience (#742)
3 years ago
ver217 e6212f56cd
[hotfix] fix memory leak in backward of sharded model (#741)
3 years ago
Jiarui Fang 7db3ccc79b
[hotfix] remove duplicated param register to stateful tensor manager (#728)
3 years ago
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724)
3 years ago
Jiarui Fang 193dc8dacb
[refactor] refactor the memory utils (#715)
3 years ago
HELSON dbd96fe90a
[zero] check whether gradients have inf and nan in gpu (#712)
3 years ago
ver217 715b86eadd
[hotfix] fix stm cuda model data size (#710)
3 years ago
HELSON a9b8300d54
[zero] improve adaptability for not-shard parameters (#708)
3 years ago
ver217 ab8c6b4a0e
[zero] refactor memstats collector (#706)
3 years ago
HELSON ee112fe1da
[zero] adapt zero hooks for unsharded module (#699)
3 years ago
ver217 3c9cd5bb5e
[zero] stateful tensor manager (#687)
3 years ago
HELSON d7ecaf362b
[zero] fix init bugs in zero context (#686)
3 years ago
Jiarui Fang 59bf2dc590
[zero] initialize a stateful tensor manager (#614)
3 years ago
HELSON 17e73e62cc
[hotfix] fix bugs for unsharded parameters when restore data (#664)
3 years ago
Jiarui Fang 0aab52301e
[hotfix] fix a bug in model data stats tracing (#655)
3 years ago
Jiarui Fang 036404ca8a
Revert "[zero] polish init context (#645)" (#657)
3 years ago
Jiarui Fang 67b4928244
[zero] polish init context (#645)
3 years ago
HELSON 055fbf5be6
[zero] adapt zero for unsharded paramters (Optimizer part) (#601)
3 years ago
ver217 0ef8819c67
polish docstring of zero (#612)
3 years ago
ver217 9bee119104
[hotfix] fix sharded optim zero grad (#604)
3 years ago
Jiarui Fang e956d93ac2
[refactor] memory utils (#577)
3 years ago
HELSON e6d50ec107
[zero] adapt zero for unsharded parameters (#561)
3 years ago
ver217 7c6c427db1
[zero] trace states of fp16/32 grad and fp32 param (#571)
3 years ago
Jiarui Fang 7675366fce
[polish] rename col_attr -> colo_attr (#558)
3 years ago
ver217 014bac0c49
[zero] hijack p.grad in sharded model (#554)
3 years ago