Boyuan Yao
|
8e3f66a0d1
|
[zero] fix wrong import (#2777)
|
2023-02-17 10:26:07 +08:00 |
Nikita Shulga
|
01066152f1
|
Don't use `torch._six` (#2775)
* Don't use `torch._six`
This is a private API which is gone after https://github.com/pytorch/pytorch/pull/94709
* Update common.py
|
2023-02-17 09:22:45 +08:00 |
YH
|
ae86a29e23
|
Refact method of grad store (#2687)
|
2023-02-15 22:27:58 +08:00 |
HELSON
|
df4f020ee3
|
[zero1&2] only append parameters with gradients (#2681)
|
2023-02-13 18:00:16 +08:00 |
HELSON
|
b528eea0f0
|
[zero] add zero wrappers (#2523)
* [zero] add zero wrappers
* change names
* add wrapper functions to init
|
2023-01-29 17:52:58 +08:00 |
HELSON
|
077a5cdde4
|
[zero] fix gradient clipping in hybrid parallelism (#2521)
* [zero] fix gradient clipping in hybrid parallelism
* [testing] change model name to avoid pytest warning
* [hotfix] fix unit testing
|
2023-01-29 15:09:57 +08:00 |
HELSON
|
d565a24849
|
[zero] add unit testings for hybrid parallelism (#2486)
|
2023-01-18 10:36:10 +08:00 |
HELSON
|
a5dc4253c6
|
[zero] polish low level optimizer (#2473)
|
2023-01-13 14:56:17 +08:00 |
Jiarui Fang
|
867c8c2d3a
|
[zero] low level optim supports ProcessGroup (#2464)
|
2023-01-13 10:05:58 +08:00 |
HELSON
|
62c38e3330
|
[zero] polish low level zero optimizer (#2275)
|
2023-01-03 17:22:34 +08:00 |
HELSON
|
a7d95b7024
|
[example] add zero1, zero2 example in GPT examples (#2146)
* [example] add zero1 and zero2 for GPT
* update readme in gpt example
* polish code
* change init value
* update readme
|
2022-12-20 14:30:27 +08:00 |
HELSON
|
a1ce02d740
|
[zero] test gradient accumulation (#1964)
* [zero] fix memory leak for zero2
* [zero] test gradient accumulation
* [zero] remove grad clip test
|
2022-11-29 13:00:30 +08:00 |
HELSON
|
7066dfbf82
|
[zero] fix memory leak for zero2 (#1955)
|
2022-11-16 11:43:24 +08:00 |
HELSON
|
6e51d296f0
|
[zero] migrate zero1&2 (#1878)
* add zero1&2 optimizer
* rename test ditectory
* rename test files
* change tolerance in test
|
2022-11-11 09:26:40 +08:00 |
ver217
|
c9e8ce67b8
|
fix move fp32 shards (#1604)
|
2022-09-16 17:33:16 +08:00 |
ver217
|
ce470ba37e
|
[checkpoint] sharded optim save/load grad scaler (#1350)
|
2022-07-21 15:21:21 +08:00 |
ver217
|
a45ddf2d5f
|
[hotfix] fix sharded optim step and clip_grad_norm (#1226)
|
2022-07-08 13:34:48 +08:00 |
ver217
|
9e1daa63d2
|
[zero] sharded optim supports loading local state dict (#1170)
* sharded optim supports loading local state dict
* polish code
* add unit test
|
2022-06-24 18:05:16 +08:00 |
ver217
|
6690a61b4d
|
[hotfix] prevent nested ZeRO (#1140)
|
2022-06-21 11:33:53 +08:00 |
ver217
|
c4d903e64a
|
[gemini] accelerate adjust_layout() (#878)
* add lru cache
* polish code
* update unit test
* fix sharded optim
|
2022-04-26 18:08:31 +08:00 |
ver217
|
d7e0303d1e
|
[zero] use GeminiMemoryManager when sampling model data (#850)
|
2022-04-24 17:17:22 +08:00 |
HELSON
|
e5ea3fdeef
|
[gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
|
2022-04-24 13:08:48 +08:00 |
Jiarui Fang
|
61c20b44bc
|
[log] local throughput metrics (#811)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"
This reverts commit 88759e289e .
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
|
2022-04-20 10:05:39 +08:00 |
Jiarui Fang
|
4d9332b4c5
|
[refactor] moving memtracer to gemini (#801)
|
2022-04-19 10:13:08 +08:00 |
Jiarui Fang
|
8711c706f4
|
[hotfix] fix grad offload when enabling reuse_fp16_shard
|
2022-04-18 14:58:21 +08:00 |
ver217
|
f1fa1a675f
|
fix grad offload when enabling reuse_fp16_shard
|
2022-04-18 14:07:39 +08:00 |
HELSON
|
4c4388c46e
|
[hotfix] fix memory leak in zero (#781)
|
2022-04-18 13:57:03 +08:00 |
ver217
|
6e553748a7
|
polish sharded optim docstr and warning (#770)
|
2022-04-14 21:03:59 +08:00 |
Jiarui Fang
|
10ef8afdd2
|
[gemini] init genimi individual directory (#754)
|
2022-04-14 16:40:26 +08:00 |
ver217
|
4b048a8728
|
fix prepare grads in sharded optim (#749)
|
2022-04-13 22:36:11 +08:00 |
ver217
|
e396bb71f2
|
[zero] add tensor placement policies (#743)
* add tensor placement policies
* polish comments
* polish comments
* update moe unit tests
|
2022-04-13 15:00:48 +08:00 |
HELSON
|
22c4b88d56
|
[zero] refactor ShardedParamV2 for convenience (#742)
|
2022-04-13 14:54:26 +08:00 |
Jiarui Fang
|
4d90a7b513
|
[refactor] zero directory (#724)
|
2022-04-11 23:13:02 +08:00 |
HELSON
|
dbd96fe90a
|
[zero] check whether gradients have inf and nan in gpu (#712)
|
2022-04-11 15:40:13 +08:00 |
HELSON
|
a9b8300d54
|
[zero] improve adaptability for not-shard parameters (#708)
* adapt post grad hooks for not-shard parameters
* adapt optimizer for not-shard parameters
* offload gradients for not-replicated parameters
|
2022-04-11 13:38:51 +08:00 |
HELSON
|
ee112fe1da
|
[zero] adapt zero hooks for unsharded module (#699)
|
2022-04-08 20:23:26 +08:00 |
ver217
|
3c9cd5bb5e
|
[zero] stateful tensor manager (#687)
* [WIP] stateful tensor manager
* add eviction strategy
* polish code
* polish code
* polish comment
* add unit test
* fix sampler bug
* polish code
* fix max sampling cnt resetting bug
* fix sampler bug
* polish code
* fix bug
* fix unit test
Co-authored-by: jiaruifang <fangjiarui123@gmail.com>
|
2022-04-08 17:51:34 +08:00 |
HELSON
|
17e73e62cc
|
[hotfix] fix bugs for unsharded parameters when restore data (#664)
|
2022-04-03 22:02:11 +08:00 |
Jiarui Fang
|
0aab52301e
|
[hotfix] fix a bug in model data stats tracing (#655)
|
2022-04-03 21:48:06 +08:00 |
HELSON
|
055fbf5be6
|
[zero] adapt zero for unsharded paramters (Optimizer part) (#601)
|
2022-04-01 20:10:47 +08:00 |
ver217
|
0ef8819c67
|
polish docstring of zero (#612)
|
2022-04-01 14:50:56 +08:00 |
ver217
|
9bee119104
|
[hotfix] fix sharded optim zero grad (#604)
* fix sharded optim zero grad
* polish comments
|
2022-04-01 12:41:20 +08:00 |
Jiarui Fang
|
e956d93ac2
|
[refactor] memory utils (#577)
|
2022-04-01 09:22:33 +08:00 |
ver217
|
7c6c427db1
|
[zero] trace states of fp16/32 grad and fp32 param (#571)
|
2022-03-31 16:26:54 +08:00 |
Jiarui Fang
|
7675366fce
|
[polish] rename col_attr -> colo_attr (#558)
|
2022-03-31 12:25:45 +08:00 |
ver217
|
014bac0c49
|
[zero] hijack p.grad in sharded model (#554)
* hijack p.grad in sharded model
* polish comments
* polish comments
|
2022-03-30 18:14:50 +08:00 |
Jiarui Fang
|
f552b11294
|
[zero] label state for param fp16 and grad (#551)
|
2022-03-30 15:57:46 +08:00 |
Jiarui Fang
|
107b99ddb1
|
[zero] dump memory stats for sharded model (#548)
|
2022-03-30 09:38:44 +08:00 |
Jiarui Fang
|
53b1b6e340
|
[zero] non model data tracing (#545)
|
2022-03-29 15:45:48 +08:00 |
ver217
|
fb841dd5c5
|
[zero] optimize grad offload (#539)
* optimize grad offload
* polish code
* polish code
|
2022-03-29 12:48:00 +08:00 |