yingtongxiong
2b28923949
remove comments
2023-12-06 10:35:40 +08:00
yingtongxiong
e6c0d7bf62
fix lint
2023-12-05 21:03:00 +08:00
yingtongxiong
62d193c763
add IS_SEQUENCE_PARALLEL check for norm module
2023-12-05 20:58:26 +08:00
jiaopenglong
87a3c5c374
feat(optimizer): zero gradient count ( #449 )
...
* add zero grad count
* fix layer norm with pp
* fix layer norm with pp
* add zero_grad_profiling option
* fix param_metrics is not a tensor
2023-10-27 16:26:55 +08:00
jiaopenglong
949a0a1d55
feat(optimizer): add layer norm to tensorboard ( #429 )
...
* add layer norm to tensorboard
* test moe layer norm
* add function: reduce grads
2023-10-23 17:07:04 +08:00
zaglc
a075153adf
feat(train): add fsdp training option ( #293 )
...
* feat(fsdp): add training option for fsdp
* fix(fsdp): add mix-precision training
* fix failure in lint-check
* fix format problem
* restore 7B_sft
* fix load ckpt bug
* fix load ckpt bug2
* feat(solver/optimizer): add new file fsdp_optimizer.py
* fix(train.py): fix ci lint error
* fix(fsdp_optimizer.py): wait grad async
* fix bug for loading ckpts when zero1 < dp_size
* fix(context/parallel_context.py): only log warning for fsdp
* change ckpt name
* fix(model/modeling_internlm.py): fix checkpoint=False runtime error
* more wrap
* add support for FSDP with tp
* modify args_sanity_check for fsdp with pipeline and fsdp with moe
* fix(internlm/utils/parallel.py): fix circular import
* fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr
* fix(internlm/train/training_internlm.py): update wrap class and fix lint error
* fix(internlm/model): reset dropout_selective_checkpoint=True
* feat(configs/7B_sft.py): move fsdp config to parallel zero1
* feat(configs/7B_sft.py): adapt to old version config
---------
Co-authored-by: huangting4201 <1538303371@qq.com>
2023-10-09 18:59:31 +08:00
Wenwen Qu
582ee000bd
feat(moe):support zero for expert local dp ( #404 )
...
* support zero for expert local dp
* fix above codes:
*treat optim.zero_world_size and optim.zero_local_rank as list in model_checkpoint.py and test_model_checkpoint.py
*add overlap and zero check for moe in args_sanity_check(.)
2023-10-09 17:45:26 +08:00
Wenwen Qu
375240e039
feat(moe): add local data parallel support for experts ( #376 )
...
* add local data parallel support for experts
* fix model checkpoint for local dp mode of expert
* do not set ep size from config
2023-09-28 13:38:02 +08:00
Wenwen Qu
136d55ec30
feat(moe): add moe module ( #182 )
...
* feat(XXX): add moe
* reformat code
* modified: .pre-commit-config.yaml
modified: internlm/model/moe.py
modified: internlm/model/modeling_internlm.py
* modified: internlm/model/modeling_internlm.py
* modified: internlm/core/context/process_group_initializer.py
modified: internlm/core/scheduler/no_pipeline_scheduler.py
modified: internlm/solver/optimizer/hybrid_zero_optim.py
* modified: internlm/model/moe.py
modified: internlm/moe/sharded_moe.py
modified: internlm/utils/parallel.py
* rollback .pre-commit-config.yaml
* add residual and other moe features
* modify grad clipping due to moe
* add param arguments
* reformat code
* add expert data support and fix bugs
* Update .pre-commit-config.yaml
* modified: internlm/model/modeling_internlm.py
* add no-interleaved & no-overlapped moe pp support
* support zero_overlap_communication
* avoid moe parameter partition in zero optimizer
* fix the moe_loss_coeff bug
* suppport interleaved pp
* fix moe bugs in zero optimizer
* fix more moe bugs in zero optimizer
* fix moe bugs in zero optimizer
* add logger for moe_loss
* fix bugs with merge
* fix the pp moe bugs
* fix bug on logger
* update moe training cfg on real-dataset
* refactor code
* refactor code
* fix bugs with compute moe norm
* optimize code with moe norm computing
* fix the bug that missing scale the latent moe loss
* refactor code
* fix moe loss logger for the interleaved pp
* change the scale position for latent moe_loss
* Update 7B_sft.py
* add support for moe checkpoint
* add comments for moe
* reformat code
* fix bugs
* fix bugs
* Update .pre-commit-config.yaml
* remove moe_loss_coeff parameter passing
* fix group_norms computing in hybrid_zero_optim
* use dummy mode to generate random numbers in model construction
* replace flashatten experts by feedforward experts
* fix bugs with _compute_norm_with_moe_group
* merge upstream/develop into feature_add_moe
* merge upstream/develop into feature_add_moe
* change float16 to bfloat16
* fix interface for dense pipeline
* refactor split_moe_group code
* fix precision inconsistency
* refactor code
* Update 7B_sft.py
* refactor code
* refactor code
* refactor code
* refactor code
* refactor code for split group
* refactor code for log
* fix logger for moe
* refactor code for split param group
* fix the moe_loss for ci and val
* refactor
* fix bugs with split group
* fix bugs in save/load moe checkpoint
* add moe module to `__init__.py`
* add compatible code for old version
* update moe config file
* modify moe config file
* fix merge bugs
* update moe config file
* change condition for compatibility
---------
Co-authored-by: zhanglei <ryancheung98@163.com>
Co-authored-by: Ryan (张磊) <leizhang.real@gmail.com>
2023-09-27 15:54:53 +08:00
huangting4201
1f7304a8bb
feat(utils/logger.py): support uniscale logger ( #152 )
...
* style(internlm): fix lint error
* feat(utils/logger.py): support uniscale logger
* fix(utils/logger.py): fix import circular error
* feat(train.py): support dashboard metric panel and fix ci train config
* fix(ci_scripts/train/slurm_train.sh): fix ci train error
* fix(ci_scripts/train/torchrun.sh): fix ci train error
* fix(ci_scripts/train): restore ci update
* fix(config.json): delete alert webhook
* feat(train.py): optimize func init logger
* feat(config.json): delete config.json
---------
Co-authored-by: 黄婷 <huangting3@CN0014010744M.local>
Co-authored-by: huangting.p <huangting@sensetime.com>
2023-08-01 17:37:32 +08:00
Sun Peng
fa7337b37b
initial commit
2023-07-06 12:55:23 +08:00