Jiarui Fang
|
1fca5d79ea
|
[Gemini] remove GLOBAL_MODEL_DATA_TRACER (#2091)
|
2 years ago |
Jiarui Fang
|
33f4412102
|
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. (#2084)
|
2 years ago |
Jiarui Fang
|
1e885329f4
|
[test] align model name with the file name. (#2045)
|
2 years ago |
HELSON
|
a1ce02d740
|
[zero] test gradient accumulation (#1964)
* [zero] fix memory leak for zero2
* [zero] test gradient accumulation
* [zero] remove grad clip test
|
2 years ago |
Jiarui Fang
|
3712ac7f90
|
[Gemini] add bert for MemtracerWrapper unintests (#1982)
|
2 years ago |
HELSON
|
7066dfbf82
|
[zero] fix memory leak for zero2 (#1955)
|
2 years ago |
HELSON
|
6e51d296f0
|
[zero] migrate zero1&2 (#1878)
* add zero1&2 optimizer
* rename test ditectory
* rename test files
* change tolerance in test
|
2 years ago |
HELSON
|
b28991dd0a
|
[feature] A new ZeRO implementation (#1644)
|
2 years ago |
Jiarui Fang
|
c5d39215f6
|
Revert "[feature] new zero implementation (#1623)" (#1643)
This reverts commit 5be118f405 .
|
2 years ago |
HELSON
|
5be118f405
|
[feature] new zero implementation (#1623)
|
2 years ago |
HELSON
|
f7f2248771
|
[moe] fix MoE bugs (#1628)
* remove forced FP32 modules
* correct no_shard-contexts' positions
|
2 years ago |
ver217
|
8dced41ad0
|
[zero] zero optim state_dict takes only_rank_0 (#1384)
* zero optim state_dict takes only_rank_0
* fix unit test
|
2 years ago |
ver217
|
828b9e5e0d
|
[hotfix] fix zero optim save/load state dict (#1381)
|
2 years ago |
HELSON
|
7a8702c06d
|
[colotensor] add Tensor.view op and its unit test (#1343)
[colotensor] add megatron initialization for gpt2
|
2 years ago |
ver217
|
0c51ff2c13
|
[hotfix] ZeroDDP use new process group (#1333)
* process group supports getting ranks in group
* chunk mgr receives a process group
* update unit test
* fix unit tests
|
2 years ago |
ver217
|
7a05367101
|
[hotfix] shared model returns cpu state_dict (#1328)
|
2 years ago |
Jiarui Fang
|
060b917daf
|
[refactor] remove gpc dependency in colotensor's _ops (#1189)
|
2 years ago |
Jiarui Fang
|
372f791444
|
[refactor] move chunk and chunkmgr to directory gemini (#1182)
|
2 years ago |
ver217
|
9e1daa63d2
|
[zero] sharded optim supports loading local state dict (#1170)
* sharded optim supports loading local state dict
* polish code
* add unit test
|
2 years ago |
ver217
|
561e90493f
|
[zero] zero optim supports loading local state dict (#1171)
* zero optim supports loading local state dict
* polish code
* add unit test
|
2 years ago |
Frank Lee
|
65ee6dcc20
|
[test] ignore 8 gpu test (#1080)
* [test] ignore 8 gpu test
* polish code
* polish workflow
* polish workflow
|
3 years ago |
HELSON
|
e5ea3fdeef
|
[gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
|
3 years ago |
Jiarui Fang
|
e761ad2cd7
|
Revert "[zero] add ZeroTensorShardStrategy (#793)" (#806)
|
3 years ago |
HELSON
|
88759e289e
|
[zero] add ZeroTensorShardStrategy (#793)
|
3 years ago |
Jiarui Fang
|
4d9332b4c5
|
[refactor] moving memtracer to gemini (#801)
|
3 years ago |
HELSON
|
4c4388c46e
|
[hotfix] fix memory leak in zero (#781)
|
3 years ago |
Frank Lee
|
5a1a095b92
|
[test] refactored with the new rerun decorator (#763)
* [test] refactored with the new rerun decorator
* polish test case
|
3 years ago |
Jiarui Fang
|
10ef8afdd2
|
[gemini] init genimi individual directory (#754)
|
3 years ago |
ver217
|
dcca614eee
|
[hotfix] fix test_stateful_tensor_mgr (#762)
|
3 years ago |
ver217
|
a93a7d7364
|
[hotfix] fix reuse_fp16_shard of sharded model (#756)
* fix reuse_fp16_shard
* disable test stm
* polish code
|
3 years ago |
HELSON
|
84c6700b2a
|
[zero] refactor memstats_collector (#746)
|
3 years ago |
ver217
|
e396bb71f2
|
[zero] add tensor placement policies (#743)
* add tensor placement policies
* polish comments
* polish comments
* update moe unit tests
|
3 years ago |
HELSON
|
22c4b88d56
|
[zero] refactor ShardedParamV2 for convenience (#742)
|
3 years ago |
Frank Lee
|
f4f42d4c3c
|
[bug] fixed DDP compatibility with torch 1.8 (#739)
|
3 years ago |
Jiarui Fang
|
53cb584808
|
[utils] correct cpu memory used and capacity in the context of multi-process (#726)
|
3 years ago |