LuGY
cbac782254
[zero]fix zero ckptIO with offload ( #4529 )
...
* fix zero ckptio with offload
* fix load device
* saved tensors in ckpt should be on CPU
* fix unit test
* fix unit test
* add clear cache
* save memory for CI
2023-09-01 17:41:19 +08:00
LuGY
839847b7d7
[zero]support zero2 with gradient accumulation ( #4511 )
...
* support gradient accumulation with zero2
* fix type
2023-08-25 13:44:07 +08:00
Hongxin Liu
27061426f7
[gemini] improve compatibility and add static placement policy ( #4479 )
...
* [gemini] remove distributed-related part from colotensor (#4379 )
* [gemini] remove process group dependency
* [gemini] remove tp part from colo tensor
* [gemini] patch inplace op
* [gemini] fix param op hook and update tests
* [test] remove useless tests
* [test] remove useless tests
* [misc] fix requirements
* [test] fix model zoo
* [test] fix model zoo
* [test] fix model zoo
* [test] fix model zoo
* [test] fix model zoo
* [misc] update requirements
* [gemini] refactor gemini optimizer and gemini ddp (#4398 )
* [gemini] update optimizer interface
* [gemini] renaming gemini optimizer
* [gemini] refactor gemini ddp class
* [example] update gemini related example
* [example] update gemini related example
* [plugin] fix gemini plugin args
* [test] update gemini ckpt tests
* [gemini] fix checkpoint io
* [example] fix opt example requirements
* [example] fix opt example
* [example] fix opt example
* [example] fix opt example
* [gemini] add static placement policy (#4443 )
* [gemini] add static placement policy
* [gemini] fix param offload
* [test] update gemini tests
* [plugin] update gemini plugin
* [plugin] update gemini plugin docstr
* [misc] fix flash attn requirement
* [test] fix gemini checkpoint io test
* [example] update resnet example result (#4457 )
* [example] update bert example result (#4458 )
* [doc] update gemini doc (#4468 )
* [example] update gemini related examples (#4473 )
* [example] update gpt example
* [example] update dreambooth example
* [example] update vit
* [example] update opt
* [example] update palm
* [example] update vit and opt benchmark
* [hotfix] fix bert in model zoo (#4480 )
* [hotfix] fix bert in model zoo
* [test] remove chatglm gemini test
* [test] remove sam gemini test
* [test] remove vit gemini test
* [hotfix] fix opt tutorial example (#4497 )
* [hotfix] fix opt tutorial example
* [hotfix] fix opt tutorial example
2023-08-24 09:29:25 +08:00
LuGY
d86ddd9b29
[hotfix] fix unsafe async comm in zero ( #4404 )
...
* improve stablility of zero
* fix wrong index
* add record stream
2023-08-11 15:09:24 +08:00
Hongxin Liu
16bf4c0221
[test] remove useless tests ( #4359 )
...
* [test] remove legacy zero test
* [test] remove lazy distribute test
* [test] remove outdated checkpoint io
2023-08-01 18:52:14 +08:00
LuGY
dd7cc58299
[zero] add state dict for low level zero ( #4179 )
...
* add state dict for zero
* fix unit test
* polish
2023-07-31 22:13:29 +08:00
LuGY
c668801d36
[zero] allow passing process group to zero12 ( #4153 )
...
* allow passing process group to zero12
* union tp-zero and normal-zero
* polish code
2023-07-31 22:13:29 +08:00
LuGY
79cf1b5f33
[zero]support no_sync method for zero1 plugin ( #4138 )
...
* support no sync for zero1 plugin
* polish
* polish
2023-07-31 22:13:29 +08:00
LuGY
c6ab96983a
[zero] refactor low level zero for shard evenly ( #4030 )
...
* refactor low level zero
* fix zero2 and support cpu offload
* avg gradient and modify unit test
* refactor grad store, support layer drop
* refactor bucket store, support grad accumulation
* fix and update unit test of zero and ddp
* compatible with tp, ga and unit test
* fix memory leak and polish
* add zero layer drop unittest
* polish code
* fix import err in unit test
* support diffenert comm dtype, modify docstring style
* polish code
* test padding and fix
* fix unit test of low level zero
* fix pad recording in bucket store
* support some models
* polish
2023-07-31 22:13:29 +08:00
Baizhou Zhang
0bb0b481b4
[gemini] fix argument naming during chunk configuration searching
2023-06-25 13:34:15 +08:00
Hongxin Liu
ae02d4e4f7
[bf16] add bf16 support ( #3882 )
...
* [bf16] add bf16 support for fused adam (#3844 )
* [bf16] fused adam kernel support bf16
* [test] update fused adam kernel test
* [test] update fused adam test
* [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860 )
* [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869 )
* [bf16] add mixed precision mixin
* [bf16] low level zero optim support bf16
* [text] update low level zero test
* [text] fix low level zero grad acc test
* [bf16] add bf16 support for gemini (#3872 )
* [bf16] gemini support bf16
* [test] update gemini bf16 test
* [doc] update gemini docstring
* [bf16] add bf16 support for plugins (#3877 )
* [bf16] add bf16 support for legacy zero (#3879 )
* [zero] init context support bf16
* [zero] legacy zero support bf16
* [test] add zero bf16 test
* [doc] add bf16 related docstring for legacy zero
2023-06-05 15:58:31 +08:00
digger-yu
1f73609adb
[CI] fix typo with tests/ etc. ( #3727 )
...
* fix spelling error with examples/comminity/
* fix spelling error with tests/
* fix some spelling error with tests/ colossalai/ etc.
* fix spelling error with tests/ etc. date:2023.5.10
2023-05-11 16:30:58 +08:00
digger-yu
b49020c1b1
[CI] Update test_sharded_optim_with_sync_bn.py ( #3688 )
...
fix spelling error in line23
change "cudnn_determinstic"=True to "cudnn_deterministic=True"
2023-05-05 18:57:27 +08:00
jiangmingyan
307894f74d
[booster] gemini plugin support shard checkpoint ( #3610 )
...
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
---------
Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2023-05-05 14:37:21 +08:00
Hongxin Liu
50793b35f4
[gemini] accelerate inference ( #3641 )
...
* [gemini] support don't scatter after inference
* [chat] update colossalai strategy
* [chat] fix opt benchmark
* [chat] update opt benchmark
* [gemini] optimize inference
* [test] add gemini inference test
* [chat] fix unit test ci
* [chat] fix ci
* [chat] fix ci
* [chat] skip checkpoint test
2023-04-26 16:32:40 +08:00
Hongxin Liu
f313babd11
[gemini] support save state dict in shards ( #3581 )
...
* [gemini] support state dict shard
* [gemini] add test state dict shard
* [gemini] polish docstr
* [gemini] fix merge
* [gemini] polish code
2023-04-17 17:11:09 +08:00
Frank Lee
80eba05b0a
[test] refactor tests with spawn ( #3452 )
...
* [test] added spawn decorator
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2023-04-06 14:51:35 +08:00
ver217
933048ad3e
[test] reorganize zero/gemini tests ( #3445 )
2023-04-06 09:38:25 +08:00
ver217
26b7aac0be
[zero] reorganize zero/gemini folder structure ( #3424 )
...
* [zero] refactor low-level zero folder structure
* [zero] fix legacy zero import path
* [zero] fix legacy zero import path
* [zero] remove useless import
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] fix test import path
* [zero] fix test
* [zero] fix circular import
* [zero] update import
2023-04-04 13:48:16 +08:00
Frank Lee
638a07a7f9
[test] fixed gemini plugin test ( #3411 )
...
* [test] fixed gemini plugin test
* polish code
* polish code
2023-04-03 17:12:22 +08:00
HELSON
b528eea0f0
[zero] add zero wrappers ( #2523 )
...
* [zero] add zero wrappers
* change names
* add wrapper functions to init
2023-01-29 17:52:58 +08:00
HELSON
077a5cdde4
[zero] fix gradient clipping in hybrid parallelism ( #2521 )
...
* [zero] fix gradient clipping in hybrid parallelism
* [testing] change model name to avoid pytest warning
* [hotfix] fix unit testing
2023-01-29 15:09:57 +08:00
HELSON
d565a24849
[zero] add unit testings for hybrid parallelism ( #2486 )
2023-01-18 10:36:10 +08:00
HELSON
21c88220ce
[zero] add unit test for low-level zero init ( #2474 )
2023-01-15 10:42:01 +08:00
HELSON
a5dc4253c6
[zero] polish low level optimizer ( #2473 )
2023-01-13 14:56:17 +08:00
Jiarui Fang
867c8c2d3a
[zero] low level optim supports ProcessGroup ( #2464 )
2023-01-13 10:05:58 +08:00
HELSON
a3100bd50d
[testing] add beit model for unit testings ( #2196 )
...
* [testing] add beit model
* [beit] fix bugs
* [beit] fix bugs
* [testing] fix bugs
2022-12-26 17:35:36 +08:00
Jiarui Fang
b87496a66b
[hotfix] fix auto policy of test_sharded_optim_v2 ( #2157 )
2022-12-20 23:03:18 +08:00
Jiarui Fang
c89c66a858
[Gemini] update API of the chunkmemstatscollector. ( #2129 )
2022-12-14 00:47:06 +08:00
Jiarui Fang
1fca5d79ea
[Gemini] remove GLOBAL_MODEL_DATA_TRACER ( #2091 )
2022-12-06 22:30:16 +08:00
Jiarui Fang
33f4412102
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. ( #2084 )
2022-12-06 16:43:06 +08:00
Jiarui Fang
1e885329f4
[test] align model name with the file name. ( #2045 )
2022-11-30 15:45:26 +08:00
HELSON
a1ce02d740
[zero] test gradient accumulation ( #1964 )
...
* [zero] fix memory leak for zero2
* [zero] test gradient accumulation
* [zero] remove grad clip test
2022-11-29 13:00:30 +08:00
Jiarui Fang
3712ac7f90
[Gemini] add bert for MemtracerWrapper unintests ( #1982 )
2022-11-18 14:58:28 +08:00
HELSON
7066dfbf82
[zero] fix memory leak for zero2 ( #1955 )
2022-11-16 11:43:24 +08:00
HELSON
6e51d296f0
[zero] migrate zero1&2 ( #1878 )
...
* add zero1&2 optimizer
* rename test ditectory
* rename test files
* change tolerance in test
2022-11-11 09:26:40 +08:00
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2022-10-09 09:18:51 +08:00
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2022-09-26 10:06:03 +08:00
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2022-09-24 19:58:18 +08:00
HELSON
f7f2248771
[moe] fix MoE bugs ( #1628 )
...
* remove forced FP32 modules
* correct no_shard-contexts' positions
2022-09-22 13:56:30 +08:00
ver217
8dced41ad0
[zero] zero optim state_dict takes only_rank_0 ( #1384 )
...
* zero optim state_dict takes only_rank_0
* fix unit test
2022-07-29 13:22:50 +08:00
ver217
828b9e5e0d
[hotfix] fix zero optim save/load state dict ( #1381 )
2022-07-28 17:19:39 +08:00
HELSON
7a8702c06d
[colotensor] add Tensor.view op and its unit test ( #1343 )
...
[colotensor] add megatron initialization for gpt2
2022-07-21 10:53:15 +08:00
ver217
0c51ff2c13
[hotfix] ZeroDDP use new process group ( #1333 )
...
* process group supports getting ranks in group
* chunk mgr receives a process group
* update unit test
* fix unit tests
2022-07-18 14:14:52 +08:00
ver217
7a05367101
[hotfix] shared model returns cpu state_dict ( #1328 )
2022-07-15 22:11:37 +08:00
Jiarui Fang
060b917daf
[refactor] remove gpc dependency in colotensor's _ops ( #1189 )
2022-07-04 18:54:37 +08:00
Jiarui Fang
372f791444
[refactor] move chunk and chunkmgr to directory gemini ( #1182 )
2022-06-29 13:31:02 +08:00
ver217
9e1daa63d2
[zero] sharded optim supports loading local state dict ( #1170 )
...
* sharded optim supports loading local state dict
* polish code
* add unit test
2022-06-24 18:05:16 +08:00
ver217
561e90493f
[zero] zero optim supports loading local state dict ( #1171 )
...
* zero optim supports loading local state dict
* polish code
* add unit test
2022-06-24 17:25:57 +08:00
Frank Lee
65ee6dcc20
[test] ignore 8 gpu test ( #1080 )
...
* [test] ignore 8 gpu test
* polish code
* polish workflow
* polish workflow
2022-06-08 23:14:18 +08:00