InternLM

Commit Graph

Author	SHA1	Message	Date
yingtongxiong	ed7232777a	support reduce scatter memory pool	2023-10-20 10:35:45 +08:00
Wenwen Qu	3c992a2101	fix(pipeline): fix interleave type assert and metrics error (#423 ) * fix interleave type assert bug * refactor code for assert * fix is_no_pp_or_last_stage logic	2023-10-19 17:29:30 +08:00
jiaxingli	3ea46324dd	fix: unitest (#424 )	2023-10-19 15:19:40 +08:00
yingtongxiong	4742271154	add memory pool	2023-10-19 13:21:33 +08:00
Wenwen Qu	2c5395fdfd	Doc(moe): add documentation for moe training (#411 ) * add doc for moe * fix moe and zero1 check in args_sanity_check * restore moe config file	2023-10-19 10:01:12 +08:00
Guoteng	3ea94f2e2a	fix(utils): disable bench_net in gputest.py (#421 )	2023-10-19 10:00:57 +08:00
jiaopenglong	4b5bdedff2	feat(monitor): send exception to light monitor (#420 ) * send exception to light monitor * update try_import_send_exception	2023-10-18 21:00:21 +08:00
jiaxingli	30f610b1fa	Test(pp): test pipeline parallel (#413 ) * test: pp * feat: add pp test * test pp * pp test * pp test * test pp	2023-10-18 17:53:08 +08:00
yingtongxiong	a5aeab2a3f	memory profiling test	2023-10-17 19:54:21 +08:00
Wenwen Qu	aa5e34d815	compatible with old ckpt (#418 )	2023-10-17 17:25:36 +08:00
yingtongxiong	16ef7b7889	add test	2023-10-17 17:16:39 +08:00
yingtongxiong	5abe519c4c	remove full weight for block 0	2023-10-17 16:37:06 +08:00
yingtongxiong	5c38cb6409	add head overlap	2023-10-17 15:38:24 +08:00
yingtongxiong	a5c6e457b9	Merge branch 'feat/fstp' of https://github.com/yingtongxiong/InternLM into feat/fstp	2023-10-17 15:17:03 +08:00
yingtongxiong	6408b944c2	support fine grained	2023-10-17 15:14:39 +08:00
chenxun.p	b51cf4ebc3	Merge branch 'feat/fstp' of github.com:yingtongxiong/InternLM into feat/fstp	2023-10-17 15:10:27 +08:00
chenxun.p	6682f5d92a	fix reduce scatter async bug	2023-10-17 15:10:07 +08:00
Wenwen Qu	eeef07934a	fix(moe): fix moe compatibility for fsdp and memory profiling (#417 ) * fix moe compatibility for fsdp and memory profiling * update moe config	2023-10-17 14:13:48 +08:00
huangting4201	4e99a7fdbc	feat(train/training_internlm.py): remove abnormal tgs when calculating avg tgs	2023-10-17 11:30:44 +08:00
chenxun.p	229cc5c68c	impl reduce scatter async	2023-10-17 11:15:54 +08:00
huangting4201	d1af0d6aee	feat(model/linear.py): block-grained backward	2023-10-17 10:13:56 +08:00
huangting4201	0d1fa037dd	feat(model/linear.py): set block 0 full weight	2023-10-16 20:13:59 +08:00
yingtongxiong	82204eea59	support hybrid overlap	2023-10-16 16:35:14 +08:00
Guoteng	37e0c86e5a	fix(init): allow resume_tb_folder is an empty string (#391 )	2023-10-13 03:46:14 -05:00
jiaxingli	71a0388b87	feat(storage): support volc oss ckpt saving (#397 ) * feat: support volc tos * feat: support volc oss	2023-10-13 03:44:29 -05:00
huangting4201	d0f0c22cac	feat(model/linear.py): change pre backward from wqkv to block	2023-10-13 11:10:23 +08:00
huangting4201	d0b1346993	feat(model/linear.py): support block allgather overlap	2023-10-12 19:42:08 +08:00
yingtongxiong	5fd5a8a32b	support fine-grained overlap	2023-10-11 17:36:41 +08:00
yingtongxiong	792b066f15	communication overlap	2023-10-11 10:57:12 +08:00
huangting4201	9a731b6e9b	fix(optimizer/fsdp_optimizer.py): fsdp process empty params group (#408 ) Co-authored-by: huangting4201 <huangting3@sensetime.com>	2023-10-10 20:06:04 +08:00
yingtongxiong	c94be64fd2	merge origin	2023-10-10 17:13:46 +08:00
yingtongxiong	0fac845c36	overlap grad_input computation and grad_weight reduce_scatter	2023-10-10 17:06:13 +08:00
huangting4201	5fb6d99c11	feat(configs/7B_sft.py): update parallel config comment	2023-10-10 11:45:11 +08:00
yingtongxiong	db637542a6	fix lint	2023-10-09 22:19:21 +08:00
yingtongxiong	dd67ab948d	merge develop	2023-10-09 21:40:02 +08:00
yingtongxiong	1b7935dd98	merge upstream develop	2023-10-09 21:35:52 +08:00
yingtongxiong	a8dea6313f	fix the ci incompatible in config	2023-10-09 21:33:26 +08:00
Pryest	b3645b0244	fix(model): fix errant inference_forward (#396 ) * Fix errant inference_forward. * Recover use_dynamic_ntk_rope. * Fix bugs. * Fit to flash attention 1.0 * Fit to flash attention 1.0 * Fit to flash attention 1.0.5. * Fit to flash attention 1.0.5.	2023-10-09 08:29:11 -05:00
yingtongxiong	007e58a4af	merge upstream develop	2023-10-09 20:54:26 +08:00
yingtongxiong	f191853bf4	fix lint	2023-10-09 20:39:57 +08:00
yingtongxiong	29df765f65	refactor code	2023-10-09 20:23:32 +08:00
yingtongxiong	5d39c332fe	restore train.py	2023-10-09 20:08:49 +08:00
yingtongxiong	ef9e7cc622	modify the config	2023-10-09 20:05:39 +08:00
yingtongxiong	144731c35c	fix evaluation bug in pp	2023-10-09 20:04:27 +08:00
zaglc	a075153adf	feat(train): add fsdp training option (#293 ) * feat(fsdp): add training option for fsdp * fix(fsdp): add mix-precision training * fix failure in lint-check * fix format problem * restore 7B_sft * fix load ckpt bug * fix load ckpt bug2 * feat(solver/optimizer): add new file fsdp_optimizer.py * fix(train.py): fix ci lint error * fix(fsdp_optimizer.py): wait grad async * fix bug for loading ckpts when zero1 < dp_size * fix(context/parallel_context.py): only log warning for fsdp * change ckpt name * fix(model/modeling_internlm.py): fix checkpoint=False runtime error * more wrap * add support for FSDP with tp * modify args_sanity_check for fsdp with pipeline and fsdp with moe * fix(internlm/utils/parallel.py): fix circular import * fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr * fix(internlm/train/training_internlm.py): update wrap class and fix lint error * fix(internlm/model): reset dropout_selective_checkpoint=True * feat(configs/7B_sft.py): move fsdp config to parallel zero1 * feat(configs/7B_sft.py): adapt to old version config --------- Co-authored-by: huangting4201 <1538303371@qq.com>	2023-10-09 18:59:31 +08:00
yingtongxiong	54e561665e	remove useless code for no-pp	2023-10-09 18:08:15 +08:00
yingtongxiong	0fa1083780	Merge remote-tracking branch 'upstream/develop' into feat/fstp merge upstream develop	2023-10-09 18:06:57 +08:00
yingtongxiong	949431f228	modify the config	2023-10-09 18:06:22 +08:00
yingtongxiong	21c1a7fa47	support evaluation with fstp	2023-10-09 18:01:06 +08:00
Wenwen Qu	582ee000bd	feat(moe):support zero for expert local dp (#404 ) * support zero for expert local dp * fix above codes: treat optim.zero_world_size and optim.zero_local_rank as list in model_checkpoint.py and test_model_checkpoint.py add overlap and zero check for moe in args_sanity_check(.)	2023-10-09 17:45:26 +08:00

1 2 3 4 5 ...

298 Commits (7c6d2936b352775443948010a9cfb9ba06080e85) All Branches Search

298 Commits (7c6d2936b352775443948010a9cfb9ba06080e85)

All Branches