Hongxin Liu
ae02d4e4f7
[bf16] add bf16 support ( #3882 )
...
* [bf16] add bf16 support for fused adam (#3844 )
* [bf16] fused adam kernel support bf16
* [test] update fused adam kernel test
* [test] update fused adam test
* [bf16] cpu adam and hybrid adam optimizers support bf16 (#3860 )
* [bf16] implement mixed precision mixin and add bf16 support for low level zero (#3869 )
* [bf16] add mixed precision mixin
* [bf16] low level zero optim support bf16
* [text] update low level zero test
* [text] fix low level zero grad acc test
* [bf16] add bf16 support for gemini (#3872 )
* [bf16] gemini support bf16
* [test] update gemini bf16 test
* [doc] update gemini docstring
* [bf16] add bf16 support for plugins (#3877 )
* [bf16] add bf16 support for legacy zero (#3879 )
* [zero] init context support bf16
* [zero] legacy zero support bf16
* [test] add zero bf16 test
* [doc] add bf16 related docstring for legacy zero
2 years ago
digger-yu
1f73609adb
[CI] fix typo with tests/ etc. ( #3727 )
...
* fix spelling error with examples/comminity/
* fix spelling error with tests/
* fix some spelling error with tests/ colossalai/ etc.
* fix spelling error with tests/ etc. date:2023.5.10
2 years ago
digger-yu
b49020c1b1
[CI] Update test_sharded_optim_with_sync_bn.py ( #3688 )
...
fix spelling error in line23
change "cudnn_determinstic"=True to "cudnn_deterministic=True"
2 years ago
jiangmingyan
307894f74d
[booster] gemini plugin support shard checkpoint ( #3610 )
...
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin add shard checkpoint save/load
* gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
* [API Refactoring]gemini plugin support shard checkpoint
---------
Co-authored-by: luchen <luchen@luchendeMBP.lan>
Co-authored-by: luchen <luchen@luchendeMacBook-Pro.local>
2 years ago
Hongxin Liu
50793b35f4
[gemini] accelerate inference ( #3641 )
...
* [gemini] support don't scatter after inference
* [chat] update colossalai strategy
* [chat] fix opt benchmark
* [chat] update opt benchmark
* [gemini] optimize inference
* [test] add gemini inference test
* [chat] fix unit test ci
* [chat] fix ci
* [chat] fix ci
* [chat] skip checkpoint test
2 years ago
Hongxin Liu
f313babd11
[gemini] support save state dict in shards ( #3581 )
...
* [gemini] support state dict shard
* [gemini] add test state dict shard
* [gemini] polish docstr
* [gemini] fix merge
* [gemini] polish code
2 years ago
Frank Lee
80eba05b0a
[test] refactor tests with spawn ( #3452 )
...
* [test] added spawn decorator
* polish code
* polish code
* polish code
* polish code
* polish code
* polish code
2 years ago
ver217
933048ad3e
[test] reorganize zero/gemini tests ( #3445 )
2 years ago
ver217
26b7aac0be
[zero] reorganize zero/gemini folder structure ( #3424 )
...
* [zero] refactor low-level zero folder structure
* [zero] fix legacy zero import path
* [zero] fix legacy zero import path
* [zero] remove useless import
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor gemini folder structure
* [zero] refactor legacy zero import path
* [zero] fix test import path
* [zero] fix test
* [zero] fix circular import
* [zero] update import
2 years ago
Frank Lee
638a07a7f9
[test] fixed gemini plugin test ( #3411 )
...
* [test] fixed gemini plugin test
* polish code
* polish code
2 years ago
HELSON
b528eea0f0
[zero] add zero wrappers ( #2523 )
...
* [zero] add zero wrappers
* change names
* add wrapper functions to init
2 years ago
HELSON
077a5cdde4
[zero] fix gradient clipping in hybrid parallelism ( #2521 )
...
* [zero] fix gradient clipping in hybrid parallelism
* [testing] change model name to avoid pytest warning
* [hotfix] fix unit testing
2 years ago
HELSON
d565a24849
[zero] add unit testings for hybrid parallelism ( #2486 )
2 years ago
HELSON
21c88220ce
[zero] add unit test for low-level zero init ( #2474 )
2 years ago
HELSON
a5dc4253c6
[zero] polish low level optimizer ( #2473 )
2 years ago
Jiarui Fang
867c8c2d3a
[zero] low level optim supports ProcessGroup ( #2464 )
2 years ago
HELSON
a3100bd50d
[testing] add beit model for unit testings ( #2196 )
...
* [testing] add beit model
* [beit] fix bugs
* [beit] fix bugs
* [testing] fix bugs
2 years ago
Jiarui Fang
b87496a66b
[hotfix] fix auto policy of test_sharded_optim_v2 ( #2157 )
2 years ago
Jiarui Fang
c89c66a858
[Gemini] update API of the chunkmemstatscollector. ( #2129 )
2 years ago
Jiarui Fang
1fca5d79ea
[Gemini] remove GLOBAL_MODEL_DATA_TRACER ( #2091 )
2 years ago
Jiarui Fang
33f4412102
[Gemini] use MemStats to store the tracing data. Seperate it from Collector. ( #2084 )
2 years ago
Jiarui Fang
1e885329f4
[test] align model name with the file name. ( #2045 )
2 years ago
HELSON
a1ce02d740
[zero] test gradient accumulation ( #1964 )
...
* [zero] fix memory leak for zero2
* [zero] test gradient accumulation
* [zero] remove grad clip test
2 years ago
Jiarui Fang
3712ac7f90
[Gemini] add bert for MemtracerWrapper unintests ( #1982 )
2 years ago
HELSON
7066dfbf82
[zero] fix memory leak for zero2 ( #1955 )
2 years ago
HELSON
6e51d296f0
[zero] migrate zero1&2 ( #1878 )
...
* add zero1&2 optimizer
* rename test ditectory
* rename test files
* change tolerance in test
2 years ago
HELSON
b28991dd0a
[feature] A new ZeRO implementation ( #1644 )
2 years ago
Jiarui Fang
c5d39215f6
Revert "[feature] new zero implementation ( #1623 )" ( #1643 )
...
This reverts commit 5be118f405
.
2 years ago
HELSON
5be118f405
[feature] new zero implementation ( #1623 )
2 years ago
HELSON
f7f2248771
[moe] fix MoE bugs ( #1628 )
...
* remove forced FP32 modules
* correct no_shard-contexts' positions
2 years ago
ver217
8dced41ad0
[zero] zero optim state_dict takes only_rank_0 ( #1384 )
...
* zero optim state_dict takes only_rank_0
* fix unit test
2 years ago
ver217
828b9e5e0d
[hotfix] fix zero optim save/load state dict ( #1381 )
2 years ago
HELSON
7a8702c06d
[colotensor] add Tensor.view op and its unit test ( #1343 )
...
[colotensor] add megatron initialization for gpt2
2 years ago
ver217
0c51ff2c13
[hotfix] ZeroDDP use new process group ( #1333 )
...
* process group supports getting ranks in group
* chunk mgr receives a process group
* update unit test
* fix unit tests
2 years ago
ver217
7a05367101
[hotfix] shared model returns cpu state_dict ( #1328 )
2 years ago
Jiarui Fang
060b917daf
[refactor] remove gpc dependency in colotensor's _ops ( #1189 )
2 years ago
Jiarui Fang
372f791444
[refactor] move chunk and chunkmgr to directory gemini ( #1182 )
2 years ago
ver217
9e1daa63d2
[zero] sharded optim supports loading local state dict ( #1170 )
...
* sharded optim supports loading local state dict
* polish code
* add unit test
2 years ago
ver217
561e90493f
[zero] zero optim supports loading local state dict ( #1171 )
...
* zero optim supports loading local state dict
* polish code
* add unit test
2 years ago
Frank Lee
65ee6dcc20
[test] ignore 8 gpu test ( #1080 )
...
* [test] ignore 8 gpu test
* polish code
* polish workflow
* polish workflow
3 years ago
HELSON
e5ea3fdeef
[gemini] add GeminiMemoryManger ( #832 )
...
* refactor StatefulTensor, tensor utilities
* add unitest for GeminiMemoryManager
3 years ago
Jiarui Fang
e761ad2cd7
Revert "[zero] add ZeroTensorShardStrategy ( #793 )" ( #806 )
3 years ago
HELSON
88759e289e
[zero] add ZeroTensorShardStrategy ( #793 )
3 years ago
Jiarui Fang
4d9332b4c5
[refactor] moving memtracer to gemini ( #801 )
3 years ago
HELSON
4c4388c46e
[hotfix] fix memory leak in zero ( #781 )
3 years ago
Frank Lee
5a1a095b92
[test] refactored with the new rerun decorator ( #763 )
...
* [test] refactored with the new rerun decorator
* polish test case
3 years ago
Jiarui Fang
10ef8afdd2
[gemini] init genimi individual directory ( #754 )
3 years ago
ver217
dcca614eee
[hotfix] fix test_stateful_tensor_mgr ( #762 )
3 years ago
ver217
a93a7d7364
[hotfix] fix reuse_fp16_shard of sharded model ( #756 )
...
* fix reuse_fp16_shard
* disable test stm
* polish code
3 years ago
HELSON
84c6700b2a
[zero] refactor memstats_collector ( #746 )
3 years ago