HELSON
b528eea0f0
[zero] add zero wrappers ( #2523 )
...
* [zero] add zero wrappers
* change names
* add wrapper functions to init
2 years ago
HELSON
077a5cdde4
[zero] fix gradient clipping in hybrid parallelism ( #2521 )
...
* [zero] fix gradient clipping in hybrid parallelism
* [testing] change model name to avoid pytest warning
* [hotfix] fix unit testing
2 years ago
HELSON
d565a24849
[zero] add unit testings for hybrid parallelism ( #2486 )
2 years ago
HELSON
a5dc4253c6
[zero] polish low level optimizer ( #2473 )
2 years ago
Jiarui Fang
867c8c2d3a
[zero] low level optim supports ProcessGroup ( #2464 )
2 years ago
HELSON
7829aa094e
[ddp] add is_ddp_ignored ( #2434 )
...
[ddp] rename to is_ddp_ignored
2 years ago
HELSON
62c38e3330
[zero] polish low level zero optimizer ( #2275 )
2 years ago
HELSON
a7d95b7024
[example] add zero1, zero2 example in GPT examples ( #2146 )
...
* [example] add zero1 and zero2 for GPT
* update readme in gpt example
* polish code
* change init value
* update readme
2 years ago
Jiarui Fang
c89c66a858
[Gemini] update API of the chunkmemstatscollector. ( #2129 )
2 years ago
Jiarui Fang
2938edf446
[Gemini] update the non model data record method in runtime memory tracer ( #2128 )
2 years ago
Jiarui Fang
e99edfcb51
[NFC] polish comments for Chunk class ( #2116 )
2 years ago
Jiarui Fang
33f4412102
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. ( #2084 )
2 years ago
Jiarui Fang
b3b89865e2
[Gemini] ParamOpHook -> ColoParamOpHook ( #2080 )
2 years ago
HELSON
a1ce02d740
[zero] test gradient accumulation ( #1964 )
...
* [zero] fix memory leak for zero2
* [zero] test gradient accumulation
* [zero] remove grad clip test
2 years ago
Jiarui Fang
cc0ed7cf33
[Gemini] ZeROHookV2 -> GeminiZeROHook ( #1972 )
2 years ago
Jiarui Fang
c4739a725a
[Gemini] polish memstats collector ( #1962 )
2 years ago
Jiarui Fang
f7e276fa71
[Gemini] add GeminiAdamOptimizer ( #1960 )
2 years ago
HELSON
7066dfbf82
[zero] fix memory leak for zero2 ( #1955 )
2 years ago
HELSON
6e51d296f0
[zero] migrate zero1&2 ( #1878 )
...
* add zero1&2 optimizer
* rename test ditectory
* rename test files
* change tolerance in test
2 years ago
Zihao
20e255d4e8
MemStatsCollectorStatic ( #1765 )
2 years ago
HELSON
c6a1a62636
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 ( #1786 )
...
* [hotfix] fix zero's incompatibility with checkpoint in torch-1.12
* [zero] add cpu shard init
* [zero] add tiny example test
* [colo_tensor] fix bugs for torch-1.11
2 years ago
CsRic
ea961d8fd1
[NFC] polish colossalai/zero/sharded_param/__init__.py code style ( #1717 )
...
Co-authored-by: ric <mkkt_bkkt@mail.ustc.edu.cn>
2 years ago
HELSON
1468e4bcfc
[zero] add constant placement policy ( #1705 )
...
* fixes memory leak when paramter is in fp16 in ZeroDDP init.
* bans chunk releasement in CUDA. Only when a chunk is about to offload, it is allowed to release.
* adds a constant placement policy. With it, users can allocate a reserved caching memory space for parameters.
2 years ago
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2 years ago
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2 years ago
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2 years ago
HELSON
f7f2248771
[moe] fix MoE bugs ( #1628 )
...
* remove forced FP32 modules
* correct no_shard-contexts' positions
2 years ago
ver217
c9e8ce67b8
fix move fp32 shards ( #1604 )
2 years ago
Fazzie-Maqianli
06dccdde44
[NFC] polish colossalai/zero/sharded_model/reduce_scatter.py code style ( #1554 )
2 years ago
ver217
821c6172e2
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer ( #1442 )
2 years ago
ver217
6df3e19be9
[hotfix] zero optim prevents calling inner optim.zero_grad ( #1422 )
2 years ago
ver217
8dced41ad0
[zero] zero optim state_dict takes only_rank_0 ( #1384 )
...
* zero optim state_dict takes only_rank_0
* fix unit test
2 years ago
ver217
828b9e5e0d
[hotfix] fix zero optim save/load state dict ( #1381 )
2 years ago
ver217
6b43c789fd
fix zero optim backward_by_grad and save/load ( #1353 )
2 years ago
ver217
d068af81a3
[doc] update rst and docstring ( #1351 )
...
* update rst
* add zero docstr
* fix docstr
* remove fx.tracer.meta_patch
* fix docstr
* fix docstr
* update fx rst
* fix fx docstr
* remove useless rst
2 years ago
ver217
ce470ba37e
[checkpoint] sharded optim save/load grad scaler ( #1350 )
2 years ago
ver217
7a05367101
[hotfix] shared model returns cpu state_dict ( #1328 )
2 years ago
Jiarui Fang
4165eabb1e
[hotfix] remove potiential circle import ( #1307 )
...
* make it faster
* [hotfix] remove circle import
2 years ago
ver217
a45ddf2d5f
[hotfix] fix sharded optim step and clip_grad_norm ( #1226 )
2 years ago
Jiarui Fang
a444633d13
warmup ratio configration ( #1192 )
2 years ago
Jiarui Fang
372f791444
[refactor] move chunk and chunkmgr to directory gemini ( #1182 )
2 years ago
ver217
9e1daa63d2
[zero] sharded optim supports loading local state dict ( #1170 )
...
* sharded optim supports loading local state dict
* polish code
* add unit test
2 years ago
ver217
561e90493f
[zero] zero optim supports loading local state dict ( #1171 )
...
* zero optim supports loading local state dict
* polish code
* add unit test
2 years ago
ver217
8106d7b8c7
[ddp] refactor ColoDDP and ZeroDDP ( #1146 )
...
* ColoDDP supports overwriting default process group
* rename ColoDDPV2 to ZeroDDP
* add docstr for ZeroDDP
* polish docstr
2 years ago
ver217
6690a61b4d
[hotfix] prevent nested ZeRO ( #1140 )
2 years ago
Frank Lee
15aab1476e
[zero] avoid zero hook spam by changing log to debug level ( #1137 )
2 years ago
ver217
a1a7899cae
[hotfix] fix zero init ctx numel ( #1128 )
2 years ago
ver217
f0a954f16d
[ddp] add set_params_to_ignore for ColoDDP ( #1122 )
...
* add set_params_to_ignore for ColoDDP
* polish code
* fix zero hook v2
* add unit test
* polish docstr
2 years ago
Frank Lee
14e5b11d7f
[zero] fixed api consistency ( #1098 )
3 years ago
Frank Lee
cb18922c47
[doc] added documentation to chunk and chunk manager ( #1094 )
...
* [doc] added documentation to chunk and chunk manager
* polish code
* polish code
* polish code
3 years ago