yingtongxiong
ed7232777a
support reduce scatter memory pool
2023-10-20 10:35:45 +08:00
Wenwen Qu
3c992a2101
fix(pipeline): fix interleave type assert and metrics error ( #423 )
...
* fix interleave type assert bug
* refactor code for assert
* fix is_no_pp_or_last_stage logic
2023-10-19 17:29:30 +08:00
jiaxingli
3ea46324dd
fix: unitest ( #424 )
2023-10-19 15:19:40 +08:00
yingtongxiong
4742271154
add memory pool
2023-10-19 13:21:33 +08:00
Wenwen Qu
2c5395fdfd
Doc(moe): add documentation for moe training ( #411 )
...
* add doc for moe
* fix moe and zero1 check in args_sanity_check
* restore moe config file
2023-10-19 10:01:12 +08:00
Guoteng
3ea94f2e2a
fix(utils): disable bench_net in gputest.py ( #421 )
2023-10-19 10:00:57 +08:00
jiaopenglong
4b5bdedff2
feat(monitor): send exception to light monitor ( #420 )
...
* send exception to light monitor
* update try_import_send_exception
2023-10-18 21:00:21 +08:00
jiaxingli
30f610b1fa
Test(pp): test pipeline parallel ( #413 )
...
* test: pp
* feat: add pp test
* test pp
* pp test
* pp test
* test pp
2023-10-18 17:53:08 +08:00
yingtongxiong
a5aeab2a3f
memory profiling test
2023-10-17 19:54:21 +08:00
Wenwen Qu
aa5e34d815
compatible with old ckpt ( #418 )
2023-10-17 17:25:36 +08:00
yingtongxiong
16ef7b7889
add test
2023-10-17 17:16:39 +08:00
yingtongxiong
5abe519c4c
remove full weight for block 0
2023-10-17 16:37:06 +08:00
yingtongxiong
5c38cb6409
add head overlap
2023-10-17 15:38:24 +08:00
yingtongxiong
a5c6e457b9
Merge branch 'feat/fstp' of https://github.com/yingtongxiong/InternLM into feat/fstp
2023-10-17 15:17:03 +08:00
yingtongxiong
6408b944c2
support fine grained
2023-10-17 15:14:39 +08:00
chenxun.p
b51cf4ebc3
Merge branch 'feat/fstp' of github.com:yingtongxiong/InternLM into feat/fstp
2023-10-17 15:10:27 +08:00
chenxun.p
6682f5d92a
fix reduce scatter async bug
2023-10-17 15:10:07 +08:00
Wenwen Qu
eeef07934a
fix(moe): fix moe compatibility for fsdp and memory profiling ( #417 )
...
* fix moe compatibility for fsdp and memory profiling
* update moe config
2023-10-17 14:13:48 +08:00
huangting4201
4e99a7fdbc
feat(train/training_internlm.py): remove abnormal tgs when calculating avg tgs
2023-10-17 11:30:44 +08:00
chenxun.p
229cc5c68c
impl reduce scatter async
2023-10-17 11:15:54 +08:00
huangting4201
d1af0d6aee
feat(model/linear.py): block-grained backward
2023-10-17 10:13:56 +08:00
huangting4201
0d1fa037dd
feat(model/linear.py): set block 0 full weight
2023-10-16 20:13:59 +08:00
yingtongxiong
82204eea59
support hybrid overlap
2023-10-16 16:35:14 +08:00
Guoteng
37e0c86e5a
fix(init): allow resume_tb_folder is an empty string ( #391 )
2023-10-13 03:46:14 -05:00
jiaxingli
71a0388b87
feat(storage): support volc oss ckpt saving ( #397 )
...
* feat: support volc tos
* feat: support volc oss
2023-10-13 03:44:29 -05:00
huangting4201
d0f0c22cac
feat(model/linear.py): change pre backward from wqkv to block
2023-10-13 11:10:23 +08:00
huangting4201
d0b1346993
feat(model/linear.py): support block allgather overlap
2023-10-12 19:42:08 +08:00
yingtongxiong
5fd5a8a32b
support fine-grained overlap
2023-10-11 17:36:41 +08:00
yingtongxiong
792b066f15
communication overlap
2023-10-11 10:57:12 +08:00
huangting4201
9a731b6e9b
fix(optimizer/fsdp_optimizer.py): fsdp process empty params group ( #408 )
...
Co-authored-by: huangting4201 <huangting3@sensetime.com>
2023-10-10 20:06:04 +08:00
yingtongxiong
c94be64fd2
merge origin
2023-10-10 17:13:46 +08:00
yingtongxiong
0fac845c36
overlap grad_input computation and grad_weight reduce_scatter
2023-10-10 17:06:13 +08:00
huangting4201
5fb6d99c11
feat(configs/7B_sft.py): update parallel config comment
2023-10-10 11:45:11 +08:00
yingtongxiong
db637542a6
fix lint
2023-10-09 22:19:21 +08:00
yingtongxiong
dd67ab948d
merge develop
2023-10-09 21:40:02 +08:00
yingtongxiong
1b7935dd98
merge upstream develop
2023-10-09 21:35:52 +08:00
yingtongxiong
a8dea6313f
fix the ci incompatible in config
2023-10-09 21:33:26 +08:00
Pryest
b3645b0244
fix(model): fix errant inference_forward ( #396 )
...
* Fix errant inference_forward.
* Recover use_dynamic_ntk_rope.
* Fix bugs.
* Fit to flash attention 1.0
* Fit to flash attention 1.0
* Fit to flash attention 1.0.5.
* Fit to flash attention 1.0.5.
2023-10-09 08:29:11 -05:00
yingtongxiong
007e58a4af
merge upstream develop
2023-10-09 20:54:26 +08:00
yingtongxiong
f191853bf4
fix lint
2023-10-09 20:39:57 +08:00
yingtongxiong
29df765f65
refactor code
2023-10-09 20:23:32 +08:00
yingtongxiong
5d39c332fe
restore train.py
2023-10-09 20:08:49 +08:00
yingtongxiong
ef9e7cc622
modify the config
2023-10-09 20:05:39 +08:00
yingtongxiong
144731c35c
fix evaluation bug in pp
2023-10-09 20:04:27 +08:00
zaglc
a075153adf
feat(train): add fsdp training option ( #293 )
...
* feat(fsdp): add training option for fsdp
* fix(fsdp): add mix-precision training
* fix failure in lint-check
* fix format problem
* restore 7B_sft
* fix load ckpt bug
* fix load ckpt bug2
* feat(solver/optimizer): add new file fsdp_optimizer.py
* fix(train.py): fix ci lint error
* fix(fsdp_optimizer.py): wait grad async
* fix bug for loading ckpts when zero1 < dp_size
* fix(context/parallel_context.py): only log warning for fsdp
* change ckpt name
* fix(model/modeling_internlm.py): fix checkpoint=False runtime error
* more wrap
* add support for FSDP with tp
* modify args_sanity_check for fsdp with pipeline and fsdp with moe
* fix(internlm/utils/parallel.py): fix circular import
* fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr
* fix(internlm/train/training_internlm.py): update wrap class and fix lint error
* fix(internlm/model): reset dropout_selective_checkpoint=True
* feat(configs/7B_sft.py): move fsdp config to parallel zero1
* feat(configs/7B_sft.py): adapt to old version config
---------
Co-authored-by: huangting4201 <1538303371@qq.com>
2023-10-09 18:59:31 +08:00
yingtongxiong
54e561665e
remove useless code for no-pp
2023-10-09 18:08:15 +08:00
yingtongxiong
0fa1083780
Merge remote-tracking branch 'upstream/develop' into feat/fstp
...
merge upstream develop
2023-10-09 18:06:57 +08:00
yingtongxiong
949431f228
modify the config
2023-10-09 18:06:22 +08:00
yingtongxiong
21c1a7fa47
support evaluation with fstp
2023-10-09 18:01:06 +08:00
Wenwen Qu
582ee000bd
feat(moe):support zero for expert local dp ( #404 )
...
* support zero for expert local dp
* fix above codes:
*treat optim.zero_world_size and optim.zero_local_rank as list in model_checkpoint.py and test_model_checkpoint.py
*add overlap and zero check for moe in args_sanity_check(.)
2023-10-09 17:45:26 +08:00