* Fix errant inference_forward.
* Recover use_dynamic_ntk_rope.
* Fix bugs.
* Fit to flash attention 1.0
* Fit to flash attention 1.0
* Fit to flash attention 1.0.5.
* Fit to flash attention 1.0.5.
* feat(fsdp): add training option for fsdp
* fix(fsdp): add mix-precision training
* fix failure in lint-check
* fix format problem
* restore 7B_sft
* fix load ckpt bug
* fix load ckpt bug2
* feat(solver/optimizer): add new file fsdp_optimizer.py
* fix(train.py): fix ci lint error
* fix(fsdp_optimizer.py): wait grad async
* fix bug for loading ckpts when zero1 < dp_size
* fix(context/parallel_context.py): only log warning for fsdp
* change ckpt name
* fix(model/modeling_internlm.py): fix checkpoint=False runtime error
* more wrap
* add support for FSDP with tp
* modify args_sanity_check for fsdp with pipeline and fsdp with moe
* fix(internlm/utils/parallel.py): fix circular import
* fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr
* fix(internlm/train/training_internlm.py): update wrap class and fix lint error
* fix(internlm/model): reset dropout_selective_checkpoint=True
* feat(configs/7B_sft.py): move fsdp config to parallel zero1
* feat(configs/7B_sft.py): adapt to old version config
---------
Co-authored-by: huangting4201 <1538303371@qq.com>
* support zero for expert local dp
* fix above codes:
*treat optim.zero_world_size and optim.zero_local_rank as list in model_checkpoint.py and test_model_checkpoint.py
*add overlap and zero check for moe in args_sanity_check(.)