InternLM

Commit Graph

Author	SHA1	Message	Date
zhanglei	ccdaf8ec45	fix the moe_loss for ci and val	2023-09-22 15:45:36 +08:00
zhanglei	3df0a51555	fix logger for moe	2023-09-22 14:56:43 +08:00
zhanglei	36d1bd2f41	Merge branch 'develop' of github.com:InternLM/InternLM into feature_add_moe	2023-09-22 14:36:58 +08:00
Ryan (张磊)	aa7645a831	Merge pull request #4 from blankde/feature_add_moe_refactor_zl refactor code	2023-09-22 14:22:45 +08:00
Wenwen Qu	9e6e7986b6	refactor code for log	2023-09-22 14:14:58 +08:00
jiaxingli	f5337f6e02	Feat(PythonGC): Do garbage collection manually (#326 ) * feat:add gc control * feat:add gc control * feat:add gc control * feat:add gc * re-lint	2023-09-22 13:52:25 +08:00
Wenwen Qu	3607548265	refactor code for split group	2023-09-22 13:00:26 +08:00
zhanglei	548d1bd7af	refactor code	2023-09-22 12:30:02 +08:00
zhanglei	80972ff314	refactor code	2023-09-22 11:47:05 +08:00
Qu Wenwen	17bc5f562b	refactor code	2023-09-21 15:00:28 +08:00
huangting4201	3b0eff0c8a	fix(model/embedding.py): ci lint check error (#345 ) * fix(ci): fix ci lint error * fix(ci): fix ci lint error	2023-09-21 14:46:22 +08:00
Qu Wenwen	9665321745	refactor code	2023-09-21 11:51:17 +08:00
YWMditto	8464425a7b	feat(mdoel): add DynamicNTKScalingRotaryEmbedding (#339 ) * add dynamic ntk rope * update dynamic ntk rope * fix lint check * fix lint check * add more desc --------- Co-authored-by: YWMditto <862779238@qq.com>	2023-09-20 23:31:47 +08:00
Qu Wenwen	b7ddc42dcd	merge Internlm/develop into feature_add_moe	2023-09-19 17:44:12 +08:00
ytxiong	6a5915bf0d	feat(linear): optimize mlp by using jit (#321 ) * fuse silu op * refactor code * fix lint * fix lint	2023-09-19 14:57:43 +08:00
huangting4201	025ca55dfe	test(tests/test_training): add training e2e tests for loss spike and loss accuracy (#304 ) * tests(test_training): add test case for loss accuracy * tests(test_training): update test cases * ci(.github/workflows/e2e_test.yaml): remove pull submodule * ci(.github/workflows/e2e_test.yaml): update ci env and remove useless env var * test(tests/test_training): add 16 GPUs test cases * test(tests/test_training): fix training_16GPU_8DP2PP test case error * test(tests/test_training): add new case for interleaved pp * test(tests/test_training): remove redundant code * test(tests/test_training): update ci job timeout minutes to 30m * feat(initialize/launch.py): check num_chunks and interleaved_overlap --------- Co-authored-by: huangting4201 <huangting3@sensetime.com>	2023-09-19 14:55:40 +08:00
Qu Wenwen	0af5175073	merge internlm/develop into feature_add_moe	2023-09-19 13:27:43 +08:00
Qu Wenwen	4a47872382	refactor code	2023-09-19 12:30:40 +08:00
zhanglei	edc18bcddd	fix precision inconsistency	2023-09-18 20:54:52 +08:00
jiaxingli	794a484666	feat: more tgs (#310 ) * feat:more tgs * feat:add more tgs * feat:more tgs	2023-09-15 18:56:11 +08:00
Qu Wenwen	5aa5c96ec8	refactor split_moe_group code	2023-09-15 16:55:16 +08:00
huangting4201	607f691e16	Merge main to develop (#312 ) * fix(chat): fix stream_chat to return generator (#123) * fix(configs/7B_sft.py): model dtype float16 to bfloat16 (#302) * fix(convert2hf.py): fix the rotary_emb.inv_freq KeyError (#299) * docs(doc/code-docs): update quickstart usage (#301) * docs(usage.md): update usage.md * docs(doc/code-docs): update en usage --------- Co-authored-by: huangting4201 <huangting3@sensetime.com> * docs(doc/code-docs): update en usage --------- Co-authored-by: yingtongxiong <974106207@qq.com> Co-authored-by: zhjunqin <zhjunqin@users.noreply.github.com> Co-authored-by: jiangtann <39088437+jiangtann@users.noreply.github.com> Co-authored-by: huangting4201 <huangting3@sensetime.com>	2023-09-15 16:19:26 +08:00
Wenwen Qu	d13f5d3048	Merge branch 'feature_add_moe' of https://github.com/blankde/InternLM into feature_add_moe	2023-09-15 12:26:42 +08:00
Wenwen Qu	462b849942	fix interface for dense pipeline	2023-09-15 12:12:45 +08:00
zhanglei	d218a62b79	Merge branch 'develop' of github.com:InternLM/InternLM into feature_add_moe Conflicts: internlm/core/context/parallel_context.py internlm/core/context/process_group_initializer.py internlm/model/modeling_internlm.py internlm/solver/optimizer/hybrid_zero_optim.py internlm/train/training_internlm.py internlm/utils/model_checkpoint.py train.py	2023-09-12 18:04:48 +08:00
Wenwen Qu	8a595837fc	merge upstream/develop into feature_add_moe	2023-09-11 16:20:08 +08:00
Guoteng	85e39aae67	fix(ckpt): fix snapshot none load error and remove file lock (#298 )	2023-09-08 20:41:53 +08:00
Wenwen Qu	b10e5132fe	fix bugs with _compute_norm_with_moe_group	2023-09-08 18:09:13 +08:00
Wenwen Qu	6cf0fec314	replace flashatten experts by feedforward experts	2023-09-08 18:04:57 +08:00
Sun Peng	1ee31ff9b1	feat: add runtime diag (#297 ) * feat: add runtime diag * add diag_outlier_ratio --------- Co-authored-by: yingtongxiong <974106207@qq.com>	2023-09-08 17:56:46 +08:00
Wenwen Qu	cd6b28b073	use dummy mode to generate random numbers in model construction	2023-09-08 17:56:42 +08:00
Sun Peng	0423426c4c	fix: fix the bug to do bcast in a stream (#294 ) * fix: fix the bug to do bcast in a stream * fix: fix the bug to do bcast in a stream --------- Co-authored-by: yingtongxiong <974106207@qq.com>	2023-09-08 13:53:40 +08:00
yingtongxiong	0c276d8de2	Merge remote-tracking branch 'origin/main' into develop	2023-09-08 10:19:54 +08:00
Sun Peng	b7a8af8133	Feat/sync grad use async op (#277 ) * fix/brocast should not in commu stream * fix/brocast should not in commu stream * feat: support allreduce grad using async op * fix bug of async op * use reduceop.avg * use torch flat * delete unused stream * delete unused stream * feat: overap allreduce with memcapy --------- Co-authored-by: yingtongxiong <974106207@qq.com>	2023-09-07 21:51:30 +08:00
jiaopenglong	7c99e01ca7	fix(monitor): add alert switch and refactor monitor config (#285 ) * add monitor switch * add switch to light monitor * fix alert_address is empty * fix light monitor heartbeat * init light_monitor on rank_log only * add comments to the monitoring config * optimize config	2023-09-07 21:49:05 +08:00
Guoteng	37b8c6684e	feat(utils): add timeout warpper for key functions (#286 )	2023-09-07 17:26:17 +08:00
Season	b6d909d43e	docs(): add documentation and reST files for readthedocs (#272 ) add initial reST files for readthedocs * fix typos * docs refine and minor fix * add references for parallel training section * fix reST format * fix reST format * fix reST format * add comments for trainer API * add link to step-by-step quickstart guide * docs(code-docs/source/parallel.rst): add paper link url * docs(code-docs/source/parallel.rst): add paper link url * use MyST to render markdown * docs(code-docs/source/initialize.rst): update model init * add requirements for myst-parser * reuse install and usage markdown * docs(code-docs/source/index.rst): add example and q&a * docs(doc/code-docs/): docs refine docs(code-docs/source/parallel.rst): update docs for zero config * docs(code-docs/source/example.rst): fix typos for example.rst * docs(code-docs/source/example.rst): refine docs * docs(code-docs/source/example): update example * docs(code-docs/source/example): delete useless example * docs(code-docs/source/): fix image display issue docs(code-docs/source/parallel.rst): add docs for communication overlap * docs(code-docs/source/conf.py): update conf.py * docs(code-docs/source/example): update example 30B demo * docs(code-docs/source/parallel.rst): update pipeline parallel * docs(code-docs/source/parallel.rst): update pipeline parallel * docs(code-docs/source/parallel.rst): update pipeline parallel * docs(code-docs/source/parallel.rst): update pipeline parallel * docs(code-docs/source/parallel.rst): update ZeRO1.5 * docs(code-docs/source/parallel.rst): update ZeRO1.5 * docs(code-docs/source): fix word spelling error --------- Co-authored-by: huangting4201 <huangting3@sensetime.com>	2023-09-06 15:36:03 +08:00
Wenwen Qu	7f687bf4b3	fix(core/context): use dummy mode to generate random numbers in model construction (#266 ) * change mode to dummy in model construction and restore to data when done * add comments * move set_mode(.DATA) to initialize_model(.)	2023-09-06 14:34:11 +08:00
Guoteng	ff181bc5f8	fix(ckpt): fix checkpoint reload bug (#282 ) 1. fix only_load tuple convert bug. 2. fix reload_zero_fp32_buff copy bug	2023-09-06 04:05:04 +08:00
Guoteng	8acf823a04	fix(storage): fix and refactor storage api (#281 )	2023-09-06 01:15:09 +08:00
jiaopenglong	8d8d811e10	feat(monitor): add light monitor (#275 ) * add light monitor * filter key of metrics dict * test no light_monitor case * mv init_light_monitor to initialize_distributed_env	2023-09-05 19:24:01 +08:00
ytxiong	9445faf5be	fix(model): set tensor parallel attribute for mlp (#271 ) * set is_tensor_parallel attribute for mlp * fix lint	2023-09-05 19:03:02 +08:00
yingtongxiong	0fb8d4141f	Merge remote-tracking branch 'origin/main' into develop	2023-09-05 17:50:35 +08:00
Sun Peng	7f61505fa0	fix/broadcast should not in commu stream (#276 ) * fix/brocast should not in commu stream * fix/brocast should not in commu stream --------- Co-authored-by: yingtongxiong <974106207@qq.com>	2023-09-05 17:47:50 +08:00
yingtongxiong	3f07d414e7	Merge branch 'develop' of github.com:InternLM/InternLM into develop	2023-09-05 17:46:27 +08:00
Guoteng	f6e007f95b	feat(ckpt): fix checkpoint bugs and add feature enhancements. (#259 ) * fix(ckpt): ckpt bug fix and api refactor 1. fix latest ckpt query bug 2. add ckpt unit test 3. fix storage manager boto3/local client get_fns bug 4. fix only model load case zero fp32 buffer overwrite model weights bug. 5. add ckpt_type and add zero reload ci-test * fix(ckpt): fix ckpt and trainer bug * fix and refactor * fix base on comment * feat: add legacy api	2023-09-05 17:40:48 +08:00
Shuo Zhang	5238f15e2d	fix(eval): no need to check length of valid_dl when using streaming dataset (#274 ) * fix(eval): StreamingDataset does not have an __len__ method. * fix(eval): StreamingDataset has no len method	2023-09-04 23:14:07 +08:00
Sun Peng	860de0aa46	Feat/add runntime gpu test (#254 ) * feat: add gpu bench * feat/add allreduce runtime bench --------- Co-authored-by: sunpengsdu <sunpengsdu@gmail.com>	2023-09-01 13:38:01 +08:00
huangting4201	b9202b12bc	feat(utils/writer.py): support writer add_scalars for writing dict data (#257 ) * feat(utils/writer.py): support writer add_scalars interface for writing dict data * feat(hybrid_zero_optim.py): change grad_norm_groups list to dict	2023-09-01 13:24:46 +08:00
Pryest	f79586b0c6	feat(model): implement uniform_init for tensor. (#252 ) * Implement uniform_init for tensor. * Fix functinal calling bugs: normal->uniform. * Format editting: remove unused torch importing.	2023-09-01 01:12:53 +08:00

1 2 3 4

152 Commits (ccdaf8ec45ad45dffb23e37a4cdf89d3b4842469)