InternLM/internlm/initialize
zaglc a075153adf
feat(train): add fsdp training option (#293)
* feat(fsdp): add training option for fsdp

* fix(fsdp): add mix-precision training

* fix failure in lint-check

* fix format problem

* restore 7B_sft

* fix load ckpt bug

* fix load ckpt bug2

* feat(solver/optimizer): add new file fsdp_optimizer.py

* fix(train.py): fix ci lint error

* fix(fsdp_optimizer.py): wait grad async

* fix bug for loading ckpts when zero1 < dp_size

* fix(context/parallel_context.py): only log warning for fsdp

* change ckpt name

* fix(model/modeling_internlm.py): fix checkpoint=False runtime error

* more wrap

* add support for FSDP with tp

* modify args_sanity_check for fsdp with pipeline and fsdp with moe

* fix(internlm/utils/parallel.py): fix circular import

* fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr

* fix(internlm/train/training_internlm.py): update wrap class and fix lint error

* fix(internlm/model): reset dropout_selective_checkpoint=True

* feat(configs/7B_sft.py): move fsdp config to parallel zero1

* feat(configs/7B_sft.py): adapt to old version config

---------

Co-authored-by: huangting4201 <1538303371@qq.com>
2023-10-09 18:59:31 +08:00
..
legacy feat(ckpt): fix checkpoint bugs and add feature enhancements. (#259) 2023-09-05 17:40:48 +08:00
__init__.py feat(numa): bind numa if possible (#320) 2023-09-25 19:34:52 +08:00
initialize_tensor.py feat(model): implement uniform_init for tensor. (#252) 2023-09-01 01:12:53 +08:00
initialize_trainer.py docs(*): add documentation and reST files for readthedocs (#272) 2023-09-06 15:36:03 +08:00
launch.py feat(train): add fsdp training option (#293) 2023-10-09 18:59:31 +08:00