HELSON
c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 ( #1786 )
...
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12
* [zero] add cpu shard init
* [zero] add tiny example test
* [colo_tensor] fix bugs for torch-1.11
2 years ago
HELSON
f69f9bf223
[zero] add chunk init function for users ( #1729 )
...
* add chunk manager init function
* fix unit tests
* add comment
* add flush=True
2 years ago
HELSON
1468e4bcfc
[zero] add constant placement policy ( #1705 )
...
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2 years ago
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2 years ago
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2 years ago
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2 years ago
HELSON
b80340168e
[zero] add chunk_managerV2 for all-gather chunk ( #1441 )
2 years ago
HELSON
9056677b13
[zero] add chunk size searching algorithm for parameters in different groups ( #1436 )
2 years ago
HELSON
039b7ed3bc
[polish] add update directory in gemini; rename AgChunk to ChunkV2 ( #1432 )
2 years ago
HELSON
0d212183c4
[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk ( #1426 )
2 years ago
HELSON
4fb3c52cf0
[zero] add unit test for AgChunk's append, close, access ( #1423 )
2 years ago
Jiarui Fang
bd71e2a88b
[hotfix] add missing file ( #1308 )
2 years ago
ver217
c4d903e64a
[gemini] accelerate adjust_layout() ( #878 )
...
* add lru cache
* polish code
* update unit test
* fix sharded optim
3 years ago
HELSON
3107817172
[gemini] add stateful tensor container ( #867 )
3 years ago
HELSON
e5ea3fdeef
[gemini] add GeminiMemoryManger ( #832 )
...
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
3 years ago
Jiarui Fang
0ce8924ceb
[tensor] reorganize files ( #820 )
3 years ago
Jiarui Fang
ab962b9735
[gemini] a new tensor structure ( #818 )
...
* Revert "[zero] add ZeroTensorShardStrategy (#793 )"
This reverts commit 88759e289e
.
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
* polish code
* add a new tensor structure and override linear for it
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
3 years ago
Jiarui Fang
4d9332b4c5
[refactor] moving memtracer to gemini ( #801 )
3 years ago