Commit Graph

303 Commits (3c07423151924f7350d8e7f7b93d8150721c61df)

Author SHA1 Message Date
huangting4201 3c6925499f feat(optimizer/hybrid_zero_optim.py): resolve conflicts 2023-10-20 16:18:01 +08:00
huangting4201 d91a5d9d9e feat(initialize/launch.py): refactor config for fstp 2023-10-20 15:59:40 +08:00
chenxun.p 95488d8e8f update optimizer accumulate grad impl when fstp 2023-10-20 15:58:06 +08:00
kkscilife 140be20511
test(workflow): add unit test yaml (#427)
* add unit test yaml

* add main branch

---------

Co-authored-by: changxiaodongTHU <2437105032@qq.com>
2023-10-20 14:22:58 +08:00
huangting4201 815a584930 feat(model/linear.py): remove useless code 2023-10-20 11:27:59 +08:00
yingtongxiong ed7232777a support reduce scatter memory pool 2023-10-20 10:35:45 +08:00
Wenwen Qu 3c992a2101
fix(pipeline): fix interleave type assert and metrics error (#423)
* fix interleave type assert bug

* refactor code for assert

* fix is_no_pp_or_last_stage logic
2023-10-19 17:29:30 +08:00
jiaxingli 3ea46324dd
fix: unitest (#424) 2023-10-19 15:19:40 +08:00
yingtongxiong 4742271154 add memory pool 2023-10-19 13:21:33 +08:00
Wenwen Qu 2c5395fdfd
Doc(moe): add documentation for moe training (#411)
* add doc for moe

* fix moe and zero1 check in args_sanity_check

* restore moe config file
2023-10-19 10:01:12 +08:00
Guoteng 3ea94f2e2a
fix(utils): disable bench_net in gputest.py (#421) 2023-10-19 10:00:57 +08:00
jiaopenglong 4b5bdedff2
feat(monitor): send exception to light monitor (#420)
* send exception to light monitor

* update try_import_send_exception
2023-10-18 21:00:21 +08:00
jiaxingli 30f610b1fa
Test(pp): test pipeline parallel (#413)
* test: pp

* feat: add pp test

* test pp

* pp test

* pp test

* test pp
2023-10-18 17:53:08 +08:00
yingtongxiong a5aeab2a3f memory profiling test 2023-10-17 19:54:21 +08:00
Wenwen Qu aa5e34d815
compatible with old ckpt (#418) 2023-10-17 17:25:36 +08:00
yingtongxiong 16ef7b7889 add test 2023-10-17 17:16:39 +08:00
yingtongxiong 5abe519c4c remove full weight for block 0 2023-10-17 16:37:06 +08:00
yingtongxiong 5c38cb6409 add head overlap 2023-10-17 15:38:24 +08:00
yingtongxiong a5c6e457b9 Merge branch 'feat/fstp' of https://github.com/yingtongxiong/InternLM into feat/fstp 2023-10-17 15:17:03 +08:00
yingtongxiong 6408b944c2 support fine grained 2023-10-17 15:14:39 +08:00
chenxun.p b51cf4ebc3 Merge branch 'feat/fstp' of github.com:yingtongxiong/InternLM into feat/fstp 2023-10-17 15:10:27 +08:00
chenxun.p 6682f5d92a fix reduce scatter async bug 2023-10-17 15:10:07 +08:00
Wenwen Qu eeef07934a
fix(moe): fix moe compatibility for fsdp and memory profiling (#417)
* fix moe compatibility for fsdp and memory profiling

* update moe config
2023-10-17 14:13:48 +08:00
huangting4201 4e99a7fdbc feat(train/training_internlm.py): remove abnormal tgs when calculating avg tgs 2023-10-17 11:30:44 +08:00
chenxun.p 229cc5c68c impl reduce scatter async 2023-10-17 11:15:54 +08:00
huangting4201 d1af0d6aee feat(model/linear.py): block-grained backward 2023-10-17 10:13:56 +08:00
huangting4201 0d1fa037dd feat(model/linear.py): set block 0 full weight 2023-10-16 20:13:59 +08:00
yingtongxiong 82204eea59 support hybrid overlap 2023-10-16 16:35:14 +08:00
Guoteng 37e0c86e5a
fix(init): allow resume_tb_folder is an empty string (#391) 2023-10-13 03:46:14 -05:00
jiaxingli 71a0388b87
feat(storage): support volc oss ckpt saving (#397)
* feat: support volc tos

* feat: support volc oss
2023-10-13 03:44:29 -05:00
huangting4201 d0f0c22cac feat(model/linear.py): change pre backward from wqkv to block 2023-10-13 11:10:23 +08:00
huangting4201 d0b1346993 feat(model/linear.py): support block allgather overlap 2023-10-12 19:42:08 +08:00
yingtongxiong 5fd5a8a32b support fine-grained overlap 2023-10-11 17:36:41 +08:00
yingtongxiong 792b066f15 communication overlap 2023-10-11 10:57:12 +08:00
huangting4201 9a731b6e9b
fix(optimizer/fsdp_optimizer.py): fsdp process empty params group (#408)
Co-authored-by: huangting4201 <huangting3@sensetime.com>
2023-10-10 20:06:04 +08:00
yingtongxiong c94be64fd2 merge origin 2023-10-10 17:13:46 +08:00
yingtongxiong 0fac845c36 overlap grad_input computation and grad_weight reduce_scatter 2023-10-10 17:06:13 +08:00
huangting4201 5fb6d99c11 feat(configs/7B_sft.py): update parallel config comment 2023-10-10 11:45:11 +08:00
yingtongxiong db637542a6 fix lint 2023-10-09 22:19:21 +08:00
yingtongxiong dd67ab948d merge develop 2023-10-09 21:40:02 +08:00
yingtongxiong 1b7935dd98 merge upstream develop 2023-10-09 21:35:52 +08:00
yingtongxiong a8dea6313f fix the ci incompatible in config 2023-10-09 21:33:26 +08:00
Pryest b3645b0244
fix(model): fix errant inference_forward (#396)
* Fix errant inference_forward.

* Recover use_dynamic_ntk_rope.

* Fix bugs.

* Fit to flash attention 1.0

* Fit to flash attention 1.0

* Fit to flash attention 1.0.5.

* Fit to flash attention 1.0.5.
2023-10-09 08:29:11 -05:00
yingtongxiong 007e58a4af merge upstream develop 2023-10-09 20:54:26 +08:00
yingtongxiong f191853bf4 fix lint 2023-10-09 20:39:57 +08:00
yingtongxiong 29df765f65 refactor code 2023-10-09 20:23:32 +08:00
yingtongxiong 5d39c332fe restore train.py 2023-10-09 20:08:49 +08:00
yingtongxiong ef9e7cc622 modify the config 2023-10-09 20:05:39 +08:00
yingtongxiong 144731c35c fix evaluation bug in pp 2023-10-09 20:04:27 +08:00
zaglc a075153adf
feat(train): add fsdp training option (#293)
* feat(fsdp): add training option for fsdp

* fix(fsdp): add mix-precision training

* fix failure in lint-check

* fix format problem

* restore 7B_sft

* fix load ckpt bug

* fix load ckpt bug2

* feat(solver/optimizer): add new file fsdp_optimizer.py

* fix(train.py): fix ci lint error

* fix(fsdp_optimizer.py): wait grad async

* fix bug for loading ckpts when zero1 < dp_size

* fix(context/parallel_context.py): only log warning for fsdp

* change ckpt name

* fix(model/modeling_internlm.py): fix checkpoint=False runtime error

* more wrap

* add support for FSDP with tp

* modify args_sanity_check for fsdp with pipeline and fsdp with moe

* fix(internlm/utils/parallel.py): fix circular import

* fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr

* fix(internlm/train/training_internlm.py): update wrap class and fix lint error

* fix(internlm/model): reset dropout_selective_checkpoint=True

* feat(configs/7B_sft.py): move fsdp config to parallel zero1

* feat(configs/7B_sft.py): adapt to old version config

---------

Co-authored-by: huangting4201 <1538303371@qq.com>
2023-10-09 18:59:31 +08:00