HELSON
7829aa094e
[ddp] add is_ddp_ignored ( #2434 )
...
[ddp] rename to is_ddp_ignored
2023-01-11 12:22:45 +08:00
Jiarui Fang
c89c66a858
[Gemini] update API of the chunkmemstatscollector. ( #2129 )
2022-12-14 00:47:06 +08:00
Jiarui Fang
2938edf446
[Gemini] update the non model data record method in runtime memory tracer ( #2128 )
2022-12-13 17:11:31 +08:00
Jiarui Fang
e99edfcb51
[NFC] polish comments for Chunk class ( #2116 )
2022-12-12 15:39:31 +08:00
Jiarui Fang
b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook ( #2080 )
2022-12-05 17:11:06 +08:00
Jiarui Fang
cc0ed7cf33
[Gemini] ZeROHookV2 -> GeminiZeROHook ( #1972 )
2022-11-17 14:43:49 +08:00
HELSON
1468e4bcfc
[zero] add constant placement policy ( #1705 )
...
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2022-10-09 09:18:51 +08:00
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2022-09-26 10:06:03 +08:00
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2022-09-24 19:58:18 +08:00
Jiarui Fang
4165eabb1e
[hotfix] remove potiential circle import ( #1307 )
...
* make it faster
* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
Jiarui Fang
372f791444
[refactor] move chunk and chunkmgr to directory gemini ( #1182 )
2022-06-29 13:31:02 +08:00
Frank Lee
15aab1476e
[zero] avoid zero hook spam by changing log to debug level ( #1137 )
2022-06-21 10:44:01 +08:00
ver217
f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP ( #1122 )
...
* add set_params_to_ignore for ColoDDP
* polish code
* fix zero hook v2
* add unit test
* polish docstr
2022-06-16 12:54:46 +08:00
ver217
1f894e033f
[gemini] zero supports gemini ( #1093 )
...
* add placement policy
* add gemini mgr
* update mem stats collector
* update zero
* update zero optim
* fix bugs
* zero optim monitor os
* polish unit test
* polish unit test
* add assert
2022-06-10 14:48:28 +08:00
ver217
be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 ( #1077 )
...
* polish chunk manager
* polish unit test
* impl add_extern_static_tensor for chunk mgr
* add mem stats collector v2
* polish code
* polish unit test
* polish code
* polish get chunks
2022-06-09 20:56:34 +08:00
ver217
51b9a49655
[zero] add zero optimizer for ColoTensor ( #1046 )
...
* add zero optimizer
* torch ok
* unit test ok
* polish code
* fix bugs
* polish unit test
* polish zero optim
* polish colo ddp v2
* refactor folder structure
* add comment
* polish unit test
* polish zero optim
* polish unit test
2022-06-02 12:13:15 +08:00
ver217
9492a561c3
[tensor] ColoTensor supports ZeRo ( #1015 )
...
* impl chunk manager
* impl param op hook
* add reduce_chunk
* add zero hook v2
* add zero dp
* fix TensorInfo
* impl load balancing when using zero without chunk
* fix zero hook
* polish chunk
* fix bugs
* ddp ok
* zero ok
* polish code
* fix bugs about load balancing
* polish code
* polish code
* add ene-to-end test
* polish code
* polish code
* polish code
* fix typo
* add test_chunk
* fix bugs
* fix bugs
* polish code
2022-05-31 12:00:12 +08:00
ver217
c4d903e64a
[gemini] accelerate adjust_layout() ( #878 )
...
* add lru cache
* polish code
* update unit test
* fix sharded optim
2022-04-26 18:08:31 +08:00
HELSON
425b4a96b8
[gemini] polish stateful_tensor_mgr ( #876 )
2022-04-26 15:05:03 +08:00
HELSON
e5ea3fdeef
[gemini] add GeminiMemoryManger ( #832 )
...
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
Jiarui Fang
3ddbd1bce1
[gemini] collect cpu-gpu moving volume in each iteration ( #813 )
2022-04-20 11:29:48 +08:00
Jiarui Fang
4d9332b4c5
[refactor] moving memtracer to gemini ( #801 )
2022-04-19 10:13:08 +08:00
Jiarui Fang
10ef8afdd2
[gemini] init genimi individual directory ( #754 )
2022-04-14 16:40:26 +08:00
ver217
dcca614eee
[hotfix] fix test_stateful_tensor_mgr ( #762 )
2022-04-14 15:50:09 +08:00
ver217
8f7ce94b8e
[hotfix] fix auto tensor placement policy ( #753 )
2022-04-14 12:04:45 +08:00
HELSON
84c6700b2a
[zero] refactor memstats_collector ( #746 )
2022-04-14 12:01:12 +08:00
Jiarui Fang
3d7dc46d33
[zero] use factory pattern for tensor_placement_policy ( #752 )
2022-04-14 11:07:29 +08:00
ver217
e396bb71f2
[zero] add tensor placement policies ( #743 )
...
* add tensor placement policies
* polish comments
* polish comments
* update moe unit tests
2022-04-13 15:00:48 +08:00
HELSON
22c4b88d56
[zero] refactor ShardedParamV2 for convenience ( #742 )
2022-04-13 14:54:26 +08:00
Jiarui Fang
4d90a7b513
[refactor] zero directory ( #724 )
2022-04-11 23:13:02 +08:00