InternLM

Commit Graph

Author	SHA1	Message	Date
zaglc	a075153adf	feat(train): add fsdp training option (#293 ) * feat(fsdp): add training option for fsdp * fix(fsdp): add mix-precision training * fix failure in lint-check * fix format problem * restore 7B_sft * fix load ckpt bug * fix load ckpt bug2 * feat(solver/optimizer): add new file fsdp_optimizer.py * fix(train.py): fix ci lint error * fix(fsdp_optimizer.py): wait grad async * fix bug for loading ckpts when zero1 < dp_size * fix(context/parallel_context.py): only log warning for fsdp * change ckpt name * fix(model/modeling_internlm.py): fix checkpoint=False runtime error * more wrap * add support for FSDP with tp * modify args_sanity_check for fsdp with pipeline and fsdp with moe * fix(internlm/utils/parallel.py): fix circular import * fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr * fix(internlm/train/training_internlm.py): update wrap class and fix lint error * fix(internlm/model): reset dropout_selective_checkpoint=True * feat(configs/7B_sft.py): move fsdp config to parallel zero1 * feat(configs/7B_sft.py): adapt to old version config --------- Co-authored-by: huangting4201 <1538303371@qq.com>	2023-10-09 18:59:31 +08:00
Guoteng	f6e007f95b	feat(ckpt): fix checkpoint bugs and add feature enhancements. (#259 ) * fix(ckpt): ckpt bug fix and api refactor 1. fix latest ckpt query bug 2. add ckpt unit test 3. fix storage manager boto3/local client get_fns bug 4. fix only model load case zero fp32 buffer overwrite model weights bug. 5. add ckpt_type and add zero reload ci-test * fix(ckpt): fix ckpt and trainer bug * fix and refactor * fix base on comment * feat: add legacy api	2023-09-05 17:40:48 +08:00
Sun Peng	fa7337b37b	initial commit	2023-07-06 12:55:23 +08:00

Author

SHA1

Message

Date

zaglc

a075153adf

feat(train): add fsdp training option (#293 )

* feat(fsdp): add training option for fsdp

* fix(fsdp): add mix-precision training

* fix failure in lint-check

* fix format problem

* restore 7B_sft

* fix load ckpt bug

* fix load ckpt bug2

* feat(solver/optimizer): add new file fsdp_optimizer.py

* fix(train.py): fix ci lint error

* fix(fsdp_optimizer.py): wait grad async

* fix bug for loading ckpts when zero1 < dp_size

* fix(context/parallel_context.py): only log warning for fsdp

* change ckpt name

* fix(model/modeling_internlm.py): fix checkpoint=False runtime error

* more wrap

* add support for FSDP with tp

* modify args_sanity_check for fsdp with pipeline and fsdp with moe

* fix(internlm/utils/parallel.py): fix circular import

* fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr

* fix(internlm/train/training_internlm.py): update wrap class and fix lint error

* fix(internlm/model): reset dropout_selective_checkpoint=True

* feat(configs/7B_sft.py): move fsdp config to parallel zero1

* feat(configs/7B_sft.py): adapt to old version config

---------

Co-authored-by: huangting4201 <1538303371@qq.com>

2023-10-09 18:59:31 +08:00

Guoteng

f6e007f95b

feat(ckpt): fix checkpoint bugs and add feature enhancements. (#259 )

* fix(ckpt): ckpt bug fix and api refactor
1. fix latest ckpt query bug
2. add ckpt unit test
3. fix storage manager boto3/local client get_fns bug
4. fix only model load case zero fp32 buffer overwrite model weights bug.
5. add ckpt_type and add zero reload ci-test

* fix(ckpt): fix ckpt and trainer bug

* fix and refactor

* fix base on comment

* feat: add legacy api

2023-09-05 17:40:48 +08:00

Sun Peng

fa7337b37b

initial commit

2023-07-06 12:55:23 +08:00

3 Commits (a63d7773db63b7519a0df6fda1ca0a4868387127)