Zihao
20e255d4e8
MemStatsCollectorStatic ( #1765 )
2022-11-07 16:49:03 +08:00
Jiarui Fang
c248800359
[kernel] skip tests of flash_attn and triton when they are not available ( #1798 )
2022-11-07 13:41:13 +08:00
HELSON
c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 ( #1786 )
...
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12
* [zero] add cpu shard init
* [zero] add tiny example test
* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
Jiarui Fang
cb5a587e9a
[hotfix] polish chunk import ( #1787 )
2022-11-02 12:10:52 +08:00
Jiarui Fang
f34dab4270
[compatibility] ChunkMgr import error ( #1772 )
2022-10-28 14:48:54 +08:00
HELSON
f69f9bf223
[zero] add chunk init function for users ( #1729 )
...
* add chunk manager init function
* fix unit tests
* add comment
* add flush=True
2022-10-18 16:31:22 +08:00
HELSON
1468e4bcfc
[zero] add constant placement policy ( #1705 )
...
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2022-10-09 09:18:51 +08:00
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2022-09-26 10:06:03 +08:00
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2022-09-24 19:58:18 +08:00
Zangwei Zheng
9823cbf24b
[NFC] polish colossalai/gemini/update/chunkv2.py code style ( #1565 )
2022-09-08 22:11:04 +08:00
Kai Wang (Victor Kai)
46931e3c32
[NFC] polish code colossalai/gemini/update/search_utils.py ( #1557 )
2022-09-08 22:11:04 +08:00
HELSON
b80340168e
[zero] add chunk_managerV2 for all-gather chunk ( #1441 )
2022-08-11 19:17:24 +08:00
HELSON
9056677b13
[zero] add chunk size searching algorithm for parameters in different groups ( #1436 )
2022-08-11 13:32:19 +08:00
HELSON
039b7ed3bc
[polish] add update directory in gemini; rename AgChunk to ChunkV2 ( #1432 )
2022-08-10 16:40:29 +08:00
HELSON
0d212183c4
[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk ( #1426 )
2022-08-10 11:37:28 +08:00
HELSON
4fb3c52cf0
[zero] add unit test for AgChunk's append, close, access ( #1423 )
2022-08-09 18:03:10 +08:00
HELSON
c577ed016e
[zero] add AgChunk ( #1417 )
2022-08-09 16:39:48 +08:00
ver217
56b8863b87
[zero] chunk manager allows filtering ex-large params ( #1393 )
2022-08-02 10:40:27 +08:00
HELSON
527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py ( #1387 )
2022-07-29 15:58:06 +08:00
Jiarui Fang
f792507ff3
[chunk] add PG check for tensor appending ( #1383 )
2022-07-29 13:27:05 +08:00
ver217
d068af81a3
[doc] update rst and docstring ( #1351 )
...
* update rst
* add zero docstr
* fix docstr
* remove fx.tracer.meta_patch
* fix docstr
* fix docstr
* update fx rst
* fix fx docstr
* remove useless rst
2022-07-21 15:54:53 +08:00
ver217
0c51ff2c13
[hotfix] ZeroDDP use new process group ( #1333 )
...
* process group supports getting ranks in group
* chunk mgr receives a process group
* update unit test
* fix unit tests
2022-07-18 14:14:52 +08:00
Jiarui Fang
4165eabb1e
[hotfix] remove potiential circle import ( #1307 )
...
* make it faster
* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
ver217
dba7e0cfb4
make AutoPlacementPolicy configurable ( #1191 )
2022-06-30 15:18:30 +08:00
Jiarui Fang
372f791444
[refactor] move chunk and chunkmgr to directory gemini ( #1182 )
2022-06-29 13:31:02 +08:00
ver217
54aabb8da4
[gemini] refactor gemini mgr ( #1151 )
...
* refactor gemini mgr
* udpate __init__
2022-06-22 11:54:36 +08:00
ver217
7d14b473f0
[gemini] gemini mgr supports "cpu" placement policy ( #1118 )
...
* update gemini mgr
* update chunk
* add docstr
* polish placement policy
* update test chunk
* update test zero
* polish unit test
* remove useless unit test
2022-06-15 15:05:19 +08:00
Frank Lee
14e5b11d7f
[zero] fixed api consistency ( #1098 )
2022-06-10 16:59:59 +08:00
ver217
1f894e033f
[gemini] zero supports gemini ( #1093 )
...
* add placement policy
* add gemini mgr
* update mem stats collector
* update zero
* update zero optim
* fix bugs
* zero optim monitor os
* polish unit test
* polish unit test
* add assert
2022-06-10 14:48:28 +08:00
ver217
be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 ( #1077 )
...
* polish chunk manager
* polish unit test
* impl add_extern_static_tensor for chunk mgr
* add mem stats collector v2
* polish code
* polish unit test
* polish code
* polish get chunks
2022-06-09 20:56:34 +08:00
ver217
c4d903e64a
[gemini] accelerate adjust_layout() ( #878 )
...
* add lru cache
* polish code
* update unit test
* fix sharded optim
2022-04-26 18:08:31 +08:00
HELSON
425b4a96b8
[gemini] polish stateful_tensor_mgr ( #876 )
2022-04-26 15:05:03 +08:00
HELSON
3107817172
[gemini] add stateful tensor container ( #867 )
2022-04-25 14:58:16 +08:00
HELSON
f0e654558f
[gemini] polish code ( #855 )
2022-04-25 10:40:14 +08:00
ver217
d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data ( #850 )
2022-04-24 17:17:22 +08:00
ver217
0dea140760
[hotfix] add deconstructor for stateful tensor ( #848 )
...
* add deconstructor for stateful tensor
* fix colo init context
2022-04-24 15:03:04 +08:00
HELSON
e5ea3fdeef
[gemini] add GeminiMemoryManger ( #832 )
...
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
Jiarui Fang
0ce8924ceb
[tensor] reorganize files ( #820 )
2022-04-21 14:15:48 +08:00
Jiarui Fang
ab962b9735
[gemini] a new tensor structure ( #818 )
...
* Revert "[zero] add ZeroTensorShardStrategy (#793 )"
This reverts commit 88759e289e
.
* [gemini] set cpu memory capacity
* [log] local throughput collecting
* polish
* polish
* polish
* polish code
* polish
* polish code
* add a new tensor structure and override linear for it
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
* polish
2022-04-21 11:42:37 +08:00
Jiarui Fang
3ddbd1bce1
[gemini] collect cpu-gpu moving volume in each iteration ( #813 )
2022-04-20 11:29:48 +08:00
Jiarui Fang
681addb512
[refactor] moving grad acc logic to engine ( #804 )
2022-04-19 14:03:21 +08:00
Jiarui Fang
4d9332b4c5
[refactor] moving memtracer to gemini ( #801 )
2022-04-19 10:13:08 +08:00
ver217
846406a07a
[gemini] fix auto tensor placement policy ( #775 )
2022-04-16 21:29:31 +08:00
Jiarui Fang
10ef8afdd2
[gemini] init genimi individual directory ( #754 )
2022-04-14 16:40:26 +08:00