Commit Graph

82 Commits (578374d0dec613d1d1f6595c01a925168c26d4fc)

Author SHA1 Message Date
HELSON b528eea0f0
[zero] add zero wrappers (#2523)
2 years ago
HELSON 077a5cdde4
[zero] fix gradient clipping in hybrid parallelism (#2521)
2 years ago
HELSON d565a24849
[zero] add unit testings for hybrid parallelism (#2486)
2 years ago
HELSON a5dc4253c6
[zero] polish low level optimizer (#2473)
2 years ago
Jiarui Fang 867c8c2d3a
[zero] low level optim supports ProcessGroup (#2464)
2 years ago
HELSON 62c38e3330
[zero] polish low level zero optimizer (#2275)
2 years ago
HELSON a7d95b7024
[example] add zero1, zero2 example in GPT examples (#2146)
2 years ago
HELSON a1ce02d740
[zero] test gradient accumulation (#1964)
2 years ago
HELSON 7066dfbf82
[zero] fix memory leak for zero2 (#1955)
2 years ago
HELSON 6e51d296f0
[zero] migrate zero1&2 (#1878)
2 years ago
ver217 c9e8ce67b8
fix move fp32 shards (#1604)
2 years ago
ver217 ce470ba37e
[checkpoint] sharded optim save/load grad scaler (#1350)
2 years ago
ver217 a45ddf2d5f
[hotfix] fix sharded optim step and clip_grad_norm (#1226)
2 years ago
ver217 9e1daa63d2
[zero] sharded optim supports loading local state dict (#1170)
2 years ago
ver217 6690a61b4d
[hotfix] prevent nested ZeRO (#1140)
2 years ago
ver217 c4d903e64a
[gemini] accelerate adjust_layout() (#878)
3 years ago
ver217 d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data (#850)
3 years ago
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
3 years ago
Jiarui Fang 61c20b44bc
[log] local throughput metrics (#811)
3 years ago
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801)
3 years ago
Jiarui Fang 8711c706f4
[hotfix] fix grad offload when enabling reuse_fp16_shard
3 years ago
ver217 f1fa1a675f fix grad offload when enabling reuse_fp16_shard
3 years ago
HELSON 4c4388c46e
[hotfix] fix memory leak in zero (#781)
3 years ago
ver217 6e553748a7
polish sharded optim docstr and warning (#770)
3 years ago
Jiarui Fang 10ef8afdd2
[gemini] init genimi individual directory (#754)
3 years ago
ver217 4b048a8728
fix prepare grads in sharded optim (#749)
3 years ago
ver217 e396bb71f2
[zero] add tensor placement policies (#743)
3 years ago
HELSON 22c4b88d56
[zero] refactor ShardedParamV2 for convenience (#742)
3 years ago
Jiarui Fang 4d90a7b513
[refactor] zero directory (#724)
3 years ago
HELSON dbd96fe90a
[zero] check whether gradients have inf and nan in gpu (#712)
3 years ago
HELSON a9b8300d54
[zero] improve adaptability for not-shard parameters (#708)
3 years ago
HELSON ee112fe1da
[zero] adapt zero hooks for unsharded module (#699)
3 years ago
ver217 3c9cd5bb5e
[zero] stateful tensor manager (#687)
3 years ago
HELSON 17e73e62cc
[hotfix] fix bugs for unsharded parameters when restore data (#664)
3 years ago
Jiarui Fang 0aab52301e
[hotfix] fix a bug in model data stats tracing (#655)
3 years ago
HELSON 055fbf5be6
[zero] adapt zero for unsharded paramters (Optimizer part) (#601)
3 years ago
ver217 0ef8819c67
polish docstring of zero (#612)
3 years ago
ver217 9bee119104
[hotfix] fix sharded optim zero grad (#604)
3 years ago
Jiarui Fang e956d93ac2
[refactor] memory utils (#577)
3 years ago
ver217 7c6c427db1
[zero] trace states of fp16/32 grad and fp32 param (#571)
3 years ago
Jiarui Fang 7675366fce
[polish] rename col_attr -> colo_attr (#558)
3 years ago
ver217 014bac0c49
[zero] hijack p.grad in sharded model (#554)
3 years ago
Jiarui Fang f552b11294
[zero] label state for param fp16 and grad (#551)
3 years ago
Jiarui Fang 107b99ddb1
[zero] dump memory stats for sharded model (#548)
3 years ago
Jiarui Fang 53b1b6e340
[zero] non model data tracing (#545)
3 years ago
ver217 fb841dd5c5
[zero] optimize grad offload (#539)
3 years ago
Jiarui Fang c11ff81b15
[zero] get memory usage of sharded optim v2. (#542)
3 years ago
Jiarui Fang 705f56107c
[zero] refactor model data tracing (#537)
3 years ago
Jiarui Fang 05e33b2578
[zero] fix grad offload (#528)
3 years ago
Jiarui Fang 4d322b79da
[refactor] remove old zero code (#517)
3 years ago