InternLM/internlm/utils
zaglc a075153adf
feat(train): add fsdp training option (#293)
* feat(fsdp): add training option for fsdp

* fix(fsdp): add mix-precision training

* fix failure in lint-check

* fix format problem

* restore 7B_sft

* fix load ckpt bug

* fix load ckpt bug2

* feat(solver/optimizer): add new file fsdp_optimizer.py

* fix(train.py): fix ci lint error

* fix(fsdp_optimizer.py): wait grad async

* fix bug for loading ckpts when zero1 < dp_size

* fix(context/parallel_context.py): only log warning for fsdp

* change ckpt name

* fix(model/modeling_internlm.py): fix checkpoint=False runtime error

* more wrap

* add support for FSDP with tp

* modify args_sanity_check for fsdp with pipeline and fsdp with moe

* fix(internlm/utils/parallel.py): fix circular import

* fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr

* fix(internlm/train/training_internlm.py): update wrap class and fix lint error

* fix(internlm/model): reset dropout_selective_checkpoint=True

* feat(configs/7B_sft.py): move fsdp config to parallel zero1

* feat(configs/7B_sft.py): adapt to old version config

---------

Co-authored-by: huangting4201 <1538303371@qq.com>
2023-10-09 18:59:31 +08:00
..
__init__.py initial commit 2023-07-06 12:55:23 +08:00
checkpoint.py initial commit 2023-07-06 12:55:23 +08:00
common.py Merge develop to main (#233) 2023-08-24 22:03:04 +08:00
evaluation.py feat(moe): add moe module (#182) 2023-09-27 15:54:53 +08:00
gputest.py Feat(PythonGC): Do garbage collection manually (#326) 2023-09-22 13:52:25 +08:00
logger.py feat(utils): add timeout warpper for key functions (#286) 2023-09-07 17:26:17 +08:00
megatron_timers.py feat: add runtime diag (#297) 2023-09-08 17:56:46 +08:00
model_checkpoint.py feat(train): add fsdp training option (#293) 2023-10-09 18:59:31 +08:00
parallel.py feat(train): add fsdp training option (#293) 2023-10-09 18:59:31 +08:00
registry.py Merge develop to main (#233) 2023-08-24 22:03:04 +08:00
simple_memory_profiler.py Merge develop to main (#233) 2023-08-24 22:03:04 +08:00
storage_manager.py fix(storage): fix try_get_storage_backend (#359) 2023-09-25 15:16:25 +08:00
timeout.py feat(moe): add moe module (#182) 2023-09-27 15:54:53 +08:00
writer.py feat(utils/writer.py): support writer add_scalars for writing dict data (#257) 2023-09-01 13:24:46 +08:00