Commit Graph

95 Commits (dbd0fd1522c8b779a6d26da3c4cce52442ce494d)

Author SHA1 Message Date
Zihao 20e255d4e8
MemStatsCollectorStatic (#1765) 2022-11-07 16:49:03 +08:00
Jiarui Fang c248800359
[kernel] skip tests of flash_attn and triton when they are not available (#1798) 2022-11-07 13:41:13 +08:00
HELSON c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786)
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12

* [zero] add cpu shard init

* [zero] add tiny example test

* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
Jiarui Fang cb5a587e9a
[hotfix] polish chunk import (#1787) 2022-11-02 12:10:52 +08:00
Jiarui Fang f34dab4270
[compatibility] ChunkMgr import error (#1772) 2022-10-28 14:48:54 +08:00
HELSON f69f9bf223
[zero] add chunk init function for users (#1729)
* add chunk manager init function

* fix unit tests

* add comment

* add flush=True
2022-10-18 16:31:22 +08:00
HELSON 1468e4bcfc
[zero] add constant placement policy (#1705)
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
HELSON b28991dd0a
[feature] A new ZeRO implementation (#1644) 2022-10-09 09:18:51 +08:00
Jiarui Fang c5d39215f6
Revert "[feature] new zero implementation (#1623)" (#1643)
This reverts commit 5be118f405.
2022-09-26 10:06:03 +08:00
HELSON 5be118f405
[feature] new zero implementation (#1623) 2022-09-24 19:58:18 +08:00
Zangwei Zheng 9823cbf24b [NFC] polish colossalai/gemini/update/chunkv2.py code style (#1565) 2022-09-08 22:11:04 +08:00
Kai Wang (Victor Kai) 46931e3c32 [NFC] polish code colossalai/gemini/update/search_utils.py (#1557) 2022-09-08 22:11:04 +08:00
HELSON b80340168e
[zero] add chunk_managerV2 for all-gather chunk (#1441) 2022-08-11 19:17:24 +08:00
HELSON 9056677b13
[zero] add chunk size searching algorithm for parameters in different groups (#1436) 2022-08-11 13:32:19 +08:00
HELSON 039b7ed3bc
[polish] add update directory in gemini; rename AgChunk to ChunkV2 (#1432) 2022-08-10 16:40:29 +08:00
HELSON 0d212183c4
[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426) 2022-08-10 11:37:28 +08:00
HELSON 4fb3c52cf0
[zero] add unit test for AgChunk's append, close, access (#1423) 2022-08-09 18:03:10 +08:00
HELSON c577ed016e
[zero] add AgChunk (#1417) 2022-08-09 16:39:48 +08:00
ver217 56b8863b87
[zero] chunk manager allows filtering ex-large params (#1393) 2022-08-02 10:40:27 +08:00
HELSON 527758b2ae
[hotfix] fix a running error in test_colo_checkpoint.py (#1387) 2022-07-29 15:58:06 +08:00
Jiarui Fang f792507ff3
[chunk] add PG check for tensor appending (#1383) 2022-07-29 13:27:05 +08:00
ver217 d068af81a3
[doc] update rst and docstring (#1351)
* update rst

* add zero docstr

* fix docstr

* remove fx.tracer.meta_patch

* fix docstr

* fix docstr

* update fx rst

* fix fx docstr

* remove useless rst
2022-07-21 15:54:53 +08:00
ver217 0c51ff2c13
[hotfix] ZeroDDP use new process group (#1333)
* process group supports getting ranks in group

* chunk mgr receives a process group

* update unit test

* fix unit tests
2022-07-18 14:14:52 +08:00
Jiarui Fang 4165eabb1e
[hotfix] remove potiential circle import (#1307)
* make it faster

* [hotfix] remove circle import
2022-07-14 13:44:26 +08:00
ver217 dba7e0cfb4
make AutoPlacementPolicy configurable (#1191) 2022-06-30 15:18:30 +08:00
Jiarui Fang 372f791444
[refactor] move chunk and chunkmgr to directory gemini (#1182) 2022-06-29 13:31:02 +08:00
ver217 54aabb8da4
[gemini] refactor gemini mgr (#1151)
* refactor gemini mgr

* udpate __init__
2022-06-22 11:54:36 +08:00
ver217 7d14b473f0
[gemini] gemini mgr supports "cpu" placement policy (#1118)
* update gemini mgr

* update chunk

* add docstr

* polish placement policy

* update test chunk

* update test zero

* polish unit test

* remove useless unit test
2022-06-15 15:05:19 +08:00
Frank Lee 14e5b11d7f
[zero] fixed api consistency (#1098) 2022-06-10 16:59:59 +08:00
ver217 1f894e033f
[gemini] zero supports gemini (#1093)
* add placement policy

* add gemini mgr

* update mem stats collector

* update zero

* update zero optim

* fix bugs

* zero optim monitor os

* polish unit test

* polish unit test

* add assert
2022-06-10 14:48:28 +08:00
ver217 be01db37c8
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077)
* polish chunk manager

* polish unit test

* impl add_extern_static_tensor for chunk mgr

* add mem stats collector v2

* polish code

* polish unit test

* polish code

* polish get chunks
2022-06-09 20:56:34 +08:00
ver217 c4d903e64a
[gemini] accelerate adjust_layout() (#878)
* add lru cache

* polish code

* update unit test

* fix sharded optim
2022-04-26 18:08:31 +08:00
HELSON 425b4a96b8
[gemini] polish stateful_tensor_mgr (#876) 2022-04-26 15:05:03 +08:00
HELSON 3107817172
[gemini] add stateful tensor container (#867) 2022-04-25 14:58:16 +08:00
HELSON f0e654558f
[gemini] polish code (#855) 2022-04-25 10:40:14 +08:00
ver217 d7e0303d1e
[zero] use GeminiMemoryManager when sampling model data (#850) 2022-04-24 17:17:22 +08:00
ver217 0dea140760
[hotfix] add deconstructor for stateful tensor (#848)
* add deconstructor for stateful tensor

* fix colo init context
2022-04-24 15:03:04 +08:00
HELSON e5ea3fdeef
[gemini] add GeminiMemoryManger (#832)
* refactor StatefulTensor, tensor utilities

* add unitest for GeminiMemoryManager
2022-04-24 13:08:48 +08:00
Jiarui Fang 0ce8924ceb
[tensor] reorganize files (#820) 2022-04-21 14:15:48 +08:00
Jiarui Fang ab962b9735
[gemini] a new tensor structure (#818)
* Revert "[zero] add ZeroTensorShardStrategy (#793)"

This reverts commit 88759e289e.

* [gemini] set cpu memory capacity

* [log] local throughput collecting

* polish

* polish

* polish

* polish code

* polish

* polish code

* add a new tensor structure and override linear for it

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish

* polish
2022-04-21 11:42:37 +08:00
Jiarui Fang 3ddbd1bce1
[gemini] collect cpu-gpu moving volume in each iteration (#813) 2022-04-20 11:29:48 +08:00
Jiarui Fang 681addb512
[refactor] moving grad acc logic to engine (#804) 2022-04-19 14:03:21 +08:00
Jiarui Fang 4d9332b4c5
[refactor] moving memtracer to gemini (#801) 2022-04-19 10:13:08 +08:00
ver217 846406a07a
[gemini] fix auto tensor placement policy (#775) 2022-04-16 21:29:31 +08:00
Jiarui Fang 10ef8afdd2
[gemini] init genimi individual directory (#754) 2022-04-14 16:40:26 +08:00