InternLM

Commit Graph

Author	SHA1	Message	Date
Guoteng	5ecb6aa712	fix(pp): fix no-packed dataset load micro batch error (#538 ) * fix(pp): fix no-packed dataset load micro batch error * fix based on comment	2023-12-13 14:48:32 +08:00
Guoteng	81ffb3d824	fix(test): fix type_ids unpack bug (#530 )	2023-12-07 18:47:19 +08:00
jiaxingli	2dbbab7418	fix test_checkpoint (#526 )	2023-12-04 15:38:13 +08:00
jiaxingli	1738bee002	feat(storage): use multipart upload when using oss (#520 ) * multipart upload * upload * storage * storage * storage * storage	2023-12-01 17:05:58 +08:00
Guoteng	b3be333aa2	fix(ci): fix test model ckpt ci test (#518 )	2023-11-30 19:16:35 +08:00
Guoteng	757e19e01a	1. fix(config): rampup_batch_size defalut value BC. (#515 ) 2. fix(config): standardize config parameter access. 3. feat(launch): add warmup_process_group 4. feat(memory): add cuda_memory_analyze	2023-11-28 19:33:46 +08:00
jiaxingli	06e8301861	name (#514 )	2023-11-24 18:24:54 +08:00
jiaxingli	b59641715a	Feat(QA): Check loss when swapping micro_num and micro_bsz && Check grad norm (#510 ) * unitest_only_forward * memory_test * doc fix * doc fix * check loss * check grad norm * check grad norm	2023-11-24 12:05:14 +08:00
Guoteng	0bfc86205e	feat(train): support_rampup_batch_size and fix bugs (#493 )	2023-11-16 19:51:01 +08:00
jiaxingli	4a6987d5e7	unitest_only_forward (#484 )	2023-11-16 15:30:57 +08:00
jiaxingli	e8cf27b8c0	Feat(QA): Check init model weights (#502 ) * check_init * check_init * check_init * check_init	2023-11-16 11:03:19 +08:00
jiaxingli	bd7e501b69	Feat(QA): Check model weights for acc (#476 ) * check_weights * check_weights	2023-11-09 16:16:29 +08:00
jiaxingli	4995060d84	feat(storage): support ali oss ckpt saving (#439 )	2023-10-27 22:32:10 +08:00
jiaxingli	3ea46324dd	fix: unitest (#424 )	2023-10-19 15:19:40 +08:00
jiaxingli	30f610b1fa	Test(pp): test pipeline parallel (#413 ) * test: pp * feat: add pp test * test pp * pp test * pp test * test pp	2023-10-18 17:53:08 +08:00
jiaxingli	71a0388b87	feat(storage): support volc oss ckpt saving (#397 ) * feat: support volc tos * feat: support volc oss	2023-10-13 03:44:29 -05:00
Wenwen Qu	582ee000bd	feat(moe):support zero for expert local dp (#404 ) * support zero for expert local dp * fix above codes: treat optim.zero_world_size and optim.zero_local_rank as list in model_checkpoint.py and test_model_checkpoint.py add overlap and zero check for moe in args_sanity_check(.)	2023-10-09 17:45:26 +08:00
Wenwen Qu	136d55ec30	feat(moe): add moe module (#182 ) * feat(XXX): add moe * reformat code * modified: .pre-commit-config.yaml modified: internlm/model/moe.py modified: internlm/model/modeling_internlm.py * modified: internlm/model/modeling_internlm.py * modified: internlm/core/context/process_group_initializer.py modified: internlm/core/scheduler/no_pipeline_scheduler.py modified: internlm/solver/optimizer/hybrid_zero_optim.py * modified: internlm/model/moe.py modified: internlm/moe/sharded_moe.py modified: internlm/utils/parallel.py * rollback .pre-commit-config.yaml * add residual and other moe features * modify grad clipping due to moe * add param arguments * reformat code * add expert data support and fix bugs * Update .pre-commit-config.yaml * modified: internlm/model/modeling_internlm.py * add no-interleaved & no-overlapped moe pp support * support zero_overlap_communication * avoid moe parameter partition in zero optimizer * fix the moe_loss_coeff bug * suppport interleaved pp * fix moe bugs in zero optimizer * fix more moe bugs in zero optimizer * fix moe bugs in zero optimizer * add logger for moe_loss * fix bugs with merge * fix the pp moe bugs * fix bug on logger * update moe training cfg on real-dataset * refactor code * refactor code * fix bugs with compute moe norm * optimize code with moe norm computing * fix the bug that missing scale the latent moe loss * refactor code * fix moe loss logger for the interleaved pp * change the scale position for latent moe_loss * Update 7B_sft.py * add support for moe checkpoint * add comments for moe * reformat code * fix bugs * fix bugs * Update .pre-commit-config.yaml * remove moe_loss_coeff parameter passing * fix group_norms computing in hybrid_zero_optim * use dummy mode to generate random numbers in model construction * replace flashatten experts by feedforward experts * fix bugs with _compute_norm_with_moe_group * merge upstream/develop into feature_add_moe * merge upstream/develop into feature_add_moe * change float16 to bfloat16 * fix interface for dense pipeline * refactor split_moe_group code * fix precision inconsistency * refactor code * Update 7B_sft.py * refactor code * refactor code * refactor code * refactor code * refactor code for split group * refactor code for log * fix logger for moe * refactor code for split param group * fix the moe_loss for ci and val * refactor * fix bugs with split group * fix bugs in save/load moe checkpoint * add moe module to `__init__.py` * add compatible code for old version * update moe config file * modify moe config file * fix merge bugs * update moe config file * change condition for compatibility --------- Co-authored-by: zhanglei <ryancheung98@163.com> Co-authored-by: Ryan (张磊) <leizhang.real@gmail.com>	2023-09-27 15:54:53 +08:00
Wenwen Qu	655e9dae40	Feat(norm)/support fused precision (#319 ) * add fused precision support for norm * refactor code * refactor code * change the granularity of hook * fix bugs if self.model is ModuleList * add dtype condition for post hook * refactor code for split group * refactor code for pre/post hook * refactor code for split group * remove fp32 hook for norm * unit tests for fused precision * add doc for fused precision * add doc for En. version * reformat docs * Update mixed_precision.rst * Update mixed_precision.po * update mixed_precision.po	2023-09-26 20:39:55 +08:00
Guoteng	26a7397752	fix(storage): fix try_get_storage_backend (#359 ) * fix(storage): fix try_get_storage_backend * fix typo and print infos only in log rank * fix typo and print infos only in log rank --------- Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>	2023-09-25 15:16:25 +08:00
huangting4201	1ed36754df	feat(.github/workflows): update ci e2e tests and add ci unit tests (#324 ) * feat(.github/workflows/e2e_test.yaml): update e2e yaml * feat(.github/workflows/e2e_test.yaml): update e2e yaml * test e2e * test e2e * test e2e * test e2e * test e2e * fix(ci): test ci * fix(ci): test ci * fix(ci): test ci * fix(ci): test ci * fix(ci): test ci * fix(ci): add weekly tests --------- Co-authored-by: huangting4201 <huangting3@sensetime.com>	2023-09-22 14:07:14 +08:00
huangting4201	025ca55dfe	test(tests/test_training): add training e2e tests for loss spike and loss accuracy (#304 ) * tests(test_training): add test case for loss accuracy * tests(test_training): update test cases * ci(.github/workflows/e2e_test.yaml): remove pull submodule * ci(.github/workflows/e2e_test.yaml): update ci env and remove useless env var * test(tests/test_training): add 16 GPUs test cases * test(tests/test_training): fix training_16GPU_8DP2PP test case error * test(tests/test_training): add new case for interleaved pp * test(tests/test_training): remove redundant code * test(tests/test_training): update ci job timeout minutes to 30m * feat(initialize/launch.py): check num_chunks and interleaved_overlap --------- Co-authored-by: huangting4201 <huangting3@sensetime.com>	2023-09-19 14:55:40 +08:00
jiaxingli	ab513e1ddd	feat: add optimizer_unitest (#303 ) * feat: add optimizer_unitest * feat: add optimizer test * feat: add optimizer test * feat:add optimizer test * fianl change * feat:add optimizer test * feat:add optimizer test * feat:add optimizer test	2023-09-15 18:56:56 +08:00
jiaxingli	882a07011c	feat: add unitest for model (#300 ) * feat: add unitest for model * feat:add model test	2023-09-14 13:18:34 +08:00
Guoteng	85e39aae67	fix(ckpt): fix snapshot none load error and remove file lock (#298 )	2023-09-08 20:41:53 +08:00
Guoteng	37b8c6684e	feat(utils): add timeout warpper for key functions (#286 )	2023-09-07 17:26:17 +08:00
Guoteng	8acf823a04	fix(storage): fix and refactor storage api (#281 )	2023-09-06 01:15:09 +08:00
Guoteng	f6e007f95b	feat(ckpt): fix checkpoint bugs and add feature enhancements. (#259 ) * fix(ckpt): ckpt bug fix and api refactor 1. fix latest ckpt query bug 2. add ckpt unit test 3. fix storage manager boto3/local client get_fns bug 4. fix only model load case zero fp32 buffer overwrite model weights bug. 5. add ckpt_type and add zero reload ci-test * fix(ckpt): fix ckpt and trainer bug * fix and refactor * fix base on comment * feat: add legacy api	2023-09-05 17:40:48 +08:00

28 Commits (ebf294274604fba08d164481666bf4591f564235)