junxu
c52edcf0eb
Rename class method of ZeroDDP ( #2692 )
2023-02-22 15:05:53 +08:00
HELSON
8213f89fd2
[gemini] add fake_release_chunk for keep-gathered chunk in the inference mode ( #2671 )
2023-02-13 14:35:32 +08:00
ver217
5b1854309a
[hotfix] fix zero ddp warmup check ( #2545 )
2023-02-02 16:42:38 +08:00
HELSON
a4ed9125ac
[hotfix] fix lightning error ( #2529 )
2023-01-31 10:40:39 +08:00
HELSON
66dfcf5281
[gemini] update the gpt example ( #2527 )
2023-01-30 17:58:05 +08:00
HELSON
b528eea0f0
[zero] add zero wrappers ( #2523 )
...
* [zero] add zero wrappers
* change names
* add wrapper functions to init
2023-01-29 17:52:58 +08:00
HELSON
707b11d4a0
[gemini] update ddp strict mode ( #2518 )
...
* [zero] add strict ddp mode for chunk init
* [gemini] update gpt example
2023-01-28 14:35:25 +08:00
HELSON
2d1a7dfe5f
[zero] add strict ddp mode ( #2508 )
...
* [zero] add strict ddp mode
* [polish] add comments for strict ddp mode
* [zero] fix test error
2023-01-20 14:04:38 +08:00
HELSON
5521af7877
[zero] fix state_dict and load_state_dict for ddp ignored parameters ( #2443 )
...
* [ddp] add is_ddp_ignored
[ddp] rename to is_ddp_ignored
* [zero] fix state_dict and load_state_dict
* fix bugs
* [zero] update unit test for ZeroDDP
2023-01-11 14:55:41 +08:00
HELSON
7829aa094e
[ddp] add is_ddp_ignored ( #2434 )
...
[ddp] rename to is_ddp_ignored
2023-01-11 12:22:45 +08:00
HELSON
bb4e9a311a
[zero] add inference mode and its unit test ( #2418 )
2023-01-11 10:07:37 +08:00
HELSON
ea13a201bb
[polish] polish code for get_static_torch_model ( #2405 )
...
* [gemini] polish code
* [testing] remove code
* [gemini] make more robust
2023-01-09 17:41:38 +08:00
eric8607242
9880fd2cd8
Fix state_dict key missing issue of the ZeroDDP ( #2363 )
...
* Fix state_dict output for ZeroDDP duplicated parameters
* Rewrite state_dict based on get_static_torch_model
* Modify get_static_torch_model to be compatible with the lower version (ZeroDDP)
2023-01-09 14:35:14 +08:00
HELSON
48d33b1b17
[gemini] add get static torch model ( #2356 )
2023-01-06 13:41:19 +08:00
Jiarui Fang
af32022f74
[Gemini] fix the convert_to_torch_module bug ( #2269 )
2023-01-03 15:55:35 +08:00
HELSON
2458659919
[zero] fix error for BEiT models ( #2169 )
...
* [zero] fix error for BEiT models
* [ColoParameter] add unpack operation for tuple arguments
* fix bugs
* fix chunkv2 unit testing
* add assertion for gradient state
2022-12-26 15:03:54 +08:00
Jiarui Fang
2827f41898
[Gemini] GeminiDPP convert to PyTorch Module. ( #2151 )
2022-12-20 10:19:36 +08:00
Jiarui Fang
9214d1fe28
[Gemini] chunk init using runtime visited param order ( #2115 )
2022-12-12 18:06:16 +08:00
Jiarui Fang
e5aa8333e4
[NFC] update chunk manager API ( #2119 )
2022-12-12 16:57:22 +08:00
Jiarui Fang
e99edfcb51
[NFC] polish comments for Chunk class ( #2116 )
2022-12-12 15:39:31 +08:00
HELSON
63fbba3c19
[zero] add L2 gradient clipping for ZeRO ( #2112 )
...
* [zero] add L2 gradient clipping
* [testing] add MlpModel
* [zero] add unit test for grad clipping
* fix atol
2022-12-09 18:09:17 +08:00
Jiarui Fang
1f99205827
[Gemini] remove static tracer ( #2083 )
2022-12-06 12:53:58 +08:00
Jiarui Fang
b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook ( #2080 )
2022-12-05 17:11:06 +08:00
HELSON
e37f3db40c
[gemini] add arguments ( #2046 )
...
* [zero] fix testing parameters
* [gemini] add arguments
* add docstrings
2022-11-30 16:40:13 +08:00
Jiarui Fang
96134e7be3
[hotfix] add bert test for gemini fwd bwd ( #2035 )
2022-11-29 11:19:52 +08:00
Jiarui Fang
8daf1b4db1
[Gemini] patch for supporting orch.add_ function for ColoTensor ( #2003 )
2022-11-25 20:06:35 +08:00
Jiarui Fang
cc0ed7cf33
[Gemini] ZeROHookV2 -> GeminiZeROHook ( #1972 )
2022-11-17 14:43:49 +08:00
Jiarui Fang
f7e276fa71
[Gemini] add GeminiAdamOptimizer ( #1960 )
2022-11-16 14:44:28 +08:00
Jiarui Fang
cd5a0d56fa
[Gemini] make gemini usage simple ( #1821 )
2022-11-08 15:53:13 +08:00
Zihao
20e255d4e8
MemStatsCollectorStatic ( #1765 )
2022-11-07 16:49:03 +08:00
HELSON
c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 ( #1786 )
...
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12
* [zero] add cpu shard init
* [zero] add tiny example test
* [colo_tensor] fix bugs for torch-1.11
2022-11-02 16:11:34 +08:00
HELSON
1468e4bcfc
[zero] add constant placement policy ( #1705 )
...
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2022-10-14 17:53:16 +08:00
Jiarui Fang
21962e1593
[embedding] rename FreqAwareEmbedding -> CachedEmbedding ( #1699 )
2022-10-13 22:22:27 +08:00
Jiarui Fang
363fc2861a
[embeddings] more detailed timer ( #1692 )
2022-10-12 12:01:21 +08:00
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2022-10-09 09:18:51 +08:00
Jiarui Fang
c638bec028
[embedding] polish async copy ( #1657 )
2022-09-27 14:37:03 +08:00
Jiarui Fang
988570e4a6
[embedding] add more detail profiling ( #1656 )
2022-09-27 13:43:59 +08:00
Jiarui Fang
e1f97fd2b8
[embedding] print profiling results ( #1654 )
2022-09-27 12:50:33 +08:00
Jiarui Fang
04443605a5
[embedding] non-blocking cpu-gpu copy ( #1647 )
2022-09-26 14:57:57 +08:00
CsRic
0767f67a0f
[embedding] isolate cache_op from forward ( #1645 )
...
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2022-09-26 11:18:59 +08:00
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2022-09-26 10:06:03 +08:00
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2022-09-24 19:58:18 +08:00
Jiarui Fang
e57df80325
[embeddings] cache option ( #1635 )
2022-09-23 16:40:18 +08:00
Jiarui Fang
38c68b5b9a
[embedding] rollback for better FAW performance ( #1625 )
2022-09-22 11:16:25 +08:00
Jiarui Fang
504ff1d101
[embeddings] use cache_ratio instead of cuda_row_num ( #1611 )
2022-09-20 14:33:04 +08:00
Jiarui Fang
a19eb80998
[embedding] updates some default parameters
2022-09-15 15:45:17 +08:00
CsRic
f3403ff98e
[embeddings] add already_split_along_rank flag for tablewise mode ( #1584 )
2022-09-13 10:50:34 +08:00
Jiatong Han
3263cdf57f
[NFC] polish colossalai/nn/parallel/data_parallel.py code style ( #1570 )
...
Co-authored-by: JThh <jiatong.han@u.nus.edu>
2022-09-08 22:11:04 +08:00
CsRic
a389ac4ec9
[embedding] cache_embedding small improvement ( #1564 )
2022-09-08 16:41:19 +08:00
Jiarui Fang
64169f3e8f
[embedding] polish parallel embedding tablewise ( #1545 )
2022-09-06 10:41:20 +08:00