Commit Graph

71 Commits (7d49e7b2dbdb4b966496475654a4154b92aeaa7b)

Author SHA1 Message Date
ver217 ce470ba37e
[checkpoint] sharded optim save/load grad scaler (#1350)
2 years ago
ver217 a45ddf2d5f
[hotfix] fix sharded optim step and clip_grad_norm (#1226)
2 years ago
ver217 9e1daa63d2
[zero] sharded optim supports loading local state dict (#1170)
2 years ago
ver217 6690a61b4d
[hotfix] prevent nested ZeRO (#1140)
2 years ago
ver217 c4d903e64a
[gemini] accelerate adjust_layout() (#878)
3 years ago
ver217 d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data (#850)
3 years ago
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
3 years ago
Jiarui Fang 61c20b44bc
[log] local throughput metrics (#811)
3 years ago
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801)
3 years ago
Jiarui Fang 8711c706f4
[hotfix] fix grad offload when enabling reuse_fp16_shard
3 years ago
ver217 f1fa1a675f fix grad offload when enabling reuse_fp16_shard
3 years ago
HELSON 4c4388c46e
[hotfix] fix memory leak in zero (#781)
3 years ago
ver217 6e553748a7
polish sharded optim docstr and warning (#770)
3 years ago
Jiarui Fang 10ef8afdd2
[gemini] init genimi individual directory (#754)
3 years ago
ver217 4b048a8728
fix prepare grads in sharded optim (#749)
3 years ago
ver217 e396bb71f2
[zero] add tensor placement policies (#743)
3 years ago
HELSON 22c4b88d56
[zero] refactor ShardedParamV2 for convenience (#742)
3 years ago
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724)
3 years ago
HELSON dbd96fe90a
[zero] check whether gradients have inf and nan in gpu (#712)
3 years ago
HELSON a9b8300d54
[zero] improve adaptability for not-shard parameters (#708)
3 years ago
HELSON ee112fe1da
[zero] adapt zero hooks for unsharded module (#699)
3 years ago
ver217 3c9cd5bb5e
[zero] stateful tensor manager (#687)
3 years ago
HELSON 17e73e62cc
[hotfix] fix bugs for unsharded parameters when restore data (#664)
3 years ago
Jiarui Fang 0aab52301e
[hotfix] fix a bug in model data stats tracing (#655)
3 years ago
HELSON 055fbf5be6
[zero] adapt zero for unsharded paramters (Optimizer part) (#601)
3 years ago
ver217 0ef8819c67
polish docstring of zero (#612)
3 years ago
ver217 9bee119104
[hotfix] fix sharded optim zero grad (#604)
3 years ago
Jiarui Fang e956d93ac2
[refactor] memory utils (#577)
3 years ago
ver217 7c6c427db1
[zero] trace states of fp16/32 grad and fp32 param (#571)
3 years ago
Jiarui Fang 7675366fce
[polish] rename col_attr -> colo_attr (#558)
3 years ago
ver217 014bac0c49
[zero] hijack p.grad in sharded model (#554)
3 years ago
Jiarui Fang f552b11294
[zero] label state for param fp16 and grad (#551)
3 years ago
Jiarui Fang 107b99ddb1
[zero] dump memory stats for sharded model (#548)
3 years ago
Jiarui Fang 53b1b6e340
[zero] non model data tracing (#545)
3 years ago
ver217 fb841dd5c5
[zero] optimize grad offload (#539)
3 years ago
Jiarui Fang c11ff81b15
[zero] get memory usage of sharded optim v2. (#542)
3 years ago
Jiarui Fang 705f56107c
[zero] refactor model data tracing (#537)
3 years ago
Jiarui Fang 05e33b2578
[zero] fix grad offload (#528)
3 years ago
Jiarui Fang 4d322b79da
[refactor] remove old zero code (#517)
3 years ago
Jiarui Fang bca0c49a9d
[zero] use colo model data api in optimv2 (#511)
3 years ago
Jiarui Fang 0035b7be07
[memory] add model data tensor moving api (#503)
3 years ago
ver217 9ec1ce6ab1
[zero] sharded model support the reuse of fp16 shard (#495)
3 years ago
ver217 a9ecb4b244
[zero] polish sharded optimizer v2 (#490)
3 years ago
ver217 62b0a8d644
[zero] sharded optim support hybrid cpu adam (#486)
3 years ago
Jiarui Fang b334822163
[zero] polish sharded param name (#484)
3 years ago
ver217 fc8e6db005
[doc] Update docstring for ZeRO (#459)
3 years ago
ver217 a241f61b34
[zero] Update initialize for ZeRO (#458)
3 years ago
ver217 642846d6f9
update sharded optim and fix zero init ctx (#457)
3 years ago
Jiarui Fang e2e9f82588
Revert "[zero] update sharded optim and fix zero init ctx" (#456)
3 years ago
ver217 e99af94ab8 rename variables
3 years ago