Commit Graph

65 Commits (10b5056e1ebfe540f1008c97f4b3bcdafe8b22da)

Author SHA1 Message Date
yingtongxiong 10b5056e1e fix all-gather overlap the model_checkpoint is 0 2023-11-01 12:31:52 +08:00
huangting4201 b3def4c162 fix(optimizer/hybrid_zero_optim.py): add reduce_scatter_overlap switch 2023-10-31 20:40:58 +08:00
mwiacx 4c1cd5d49b fix async reduce scatter 2023-10-31 19:39:24 +08:00
huangting4201 3778c66660 feat(model/overlap_handler.py): fix overlap hander to support pp(non-interleaved) 2023-10-27 20:04:23 +08:00
yingtongxiong cc20fa271a reset print memory 2023-10-25 16:48:02 +08:00
yingtongxiong 985465c96a merge upstream 2023-10-25 14:46:45 +08:00
yingtongxiong 363275b500 add memory print 2023-10-25 14:31:00 +08:00
ytxiong 1d7e2d04ec
fix(*)/all-reduce for norm in sequence parallel (#443)
* fix all-reduce norm grad

* change the order of dp and sp all-reduce

* fix lint
2023-10-25 14:16:32 +08:00
yingtongxiong 918dff7257 reset moe 2023-10-25 13:47:19 +08:00
huangting4201 41cfa1a10a feat(model/overlap_handler.py): fix overlap handler None bug 2023-10-24 18:47:27 +08:00
huangting4201 5d8313693b feat(model/overlap_handler.py): fix head post backward hook when activation 2023-10-24 17:29:09 +08:00
yingtongxiong 97dcefc389 support model activation checkpoint 2023-10-24 16:13:52 +08:00
huangting4201 03cc7f9b80 feat(model/overlap_handler.py): fix lint error 2023-10-23 15:28:34 +08:00
huangting4201 0d693cf3a1 feat(model/overlap_handler.py): fix lint error 2023-10-23 15:22:03 +08:00
yingtongxiong f6a5086fe4 support bias 2023-10-23 14:51:27 +08:00
huangting4201 e7f9f1d208 feat(model/overlap_handler.py): optimize reduce scatter mem pool 2023-10-23 13:31:23 +08:00
huangting4201 b20f47a1fe feat(model/overlap_handler.py): move handler to gpc 2023-10-23 12:02:32 +08:00
huangting4201 85ad917ae4 feat(model/overlap_handler.py): refactor overlap hook handle 2023-10-20 21:50:32 +08:00
yingtongxiong 1804d01bb3 merge reduce-scatter 2023-10-20 18:11:00 +08:00
yingtongxiong dcd89ed304 refactor linear 2023-10-20 17:50:56 +08:00
huangting4201 eac382ad0a feat(optimizer/hybrid_zero_optim.py): fix lint error 2023-10-20 16:22:29 +08:00
huangting4201 d91a5d9d9e feat(initialize/launch.py): refactor config for fstp 2023-10-20 15:59:40 +08:00
huangting4201 815a584930 feat(model/linear.py): remove useless code 2023-10-20 11:27:59 +08:00
yingtongxiong ed7232777a support reduce scatter memory pool 2023-10-20 10:35:45 +08:00
yingtongxiong 4742271154 add memory pool 2023-10-19 13:21:33 +08:00
Wenwen Qu 2c5395fdfd
Doc(moe): add documentation for moe training (#411)
* add doc for moe

* fix moe and zero1 check in args_sanity_check

* restore moe config file
2023-10-19 10:01:12 +08:00
yingtongxiong a5aeab2a3f memory profiling test 2023-10-17 19:54:21 +08:00
yingtongxiong 5abe519c4c remove full weight for block 0 2023-10-17 16:37:06 +08:00
yingtongxiong 5c38cb6409 add head overlap 2023-10-17 15:38:24 +08:00
yingtongxiong a5c6e457b9 Merge branch 'feat/fstp' of https://github.com/yingtongxiong/InternLM into feat/fstp 2023-10-17 15:17:03 +08:00
yingtongxiong 6408b944c2 support fine grained 2023-10-17 15:14:39 +08:00
chenxun.p 6682f5d92a fix reduce scatter async bug 2023-10-17 15:10:07 +08:00
chenxun.p 229cc5c68c impl reduce scatter async 2023-10-17 11:15:54 +08:00
huangting4201 d1af0d6aee feat(model/linear.py): block-grained backward 2023-10-17 10:13:56 +08:00
huangting4201 0d1fa037dd feat(model/linear.py): set block 0 full weight 2023-10-16 20:13:59 +08:00
yingtongxiong 82204eea59 support hybrid overlap 2023-10-16 16:35:14 +08:00
huangting4201 d0f0c22cac feat(model/linear.py): change pre backward from wqkv to block 2023-10-13 11:10:23 +08:00
huangting4201 d0b1346993 feat(model/linear.py): support block allgather overlap 2023-10-12 19:42:08 +08:00
yingtongxiong 5fd5a8a32b support fine-grained overlap 2023-10-11 17:36:41 +08:00
yingtongxiong 792b066f15 communication overlap 2023-10-11 10:57:12 +08:00
yingtongxiong 0fac845c36 overlap grad_input computation and grad_weight reduce_scatter 2023-10-10 17:06:13 +08:00
yingtongxiong dd67ab948d merge develop 2023-10-09 21:40:02 +08:00
yingtongxiong 1b7935dd98 merge upstream develop 2023-10-09 21:35:52 +08:00
Pryest b3645b0244
fix(model): fix errant inference_forward (#396)
* Fix errant inference_forward.

* Recover use_dynamic_ntk_rope.

* Fix bugs.

* Fit to flash attention 1.0

* Fit to flash attention 1.0

* Fit to flash attention 1.0.5.

* Fit to flash attention 1.0.5.
2023-10-09 08:29:11 -05:00
yingtongxiong 007e58a4af merge upstream develop 2023-10-09 20:54:26 +08:00
yingtongxiong f191853bf4 fix lint 2023-10-09 20:39:57 +08:00
yingtongxiong 29df765f65 refactor code 2023-10-09 20:23:32 +08:00
zaglc a075153adf
feat(train): add fsdp training option (#293)
* feat(fsdp): add training option for fsdp

* fix(fsdp): add mix-precision training

* fix failure in lint-check

* fix format problem

* restore 7B_sft

* fix load ckpt bug

* fix load ckpt bug2

* feat(solver/optimizer): add new file fsdp_optimizer.py

* fix(train.py): fix ci lint error

* fix(fsdp_optimizer.py): wait grad async

* fix bug for loading ckpts when zero1 < dp_size

* fix(context/parallel_context.py): only log warning for fsdp

* change ckpt name

* fix(model/modeling_internlm.py): fix checkpoint=False runtime error

* more wrap

* add support for FSDP with tp

* modify args_sanity_check for fsdp with pipeline and fsdp with moe

* fix(internlm/utils/parallel.py): fix circular import

* fix(internlm/train/training_internlm.py): remove set IS_TENSOR_PARALLEL attr

* fix(internlm/train/training_internlm.py): update wrap class and fix lint error

* fix(internlm/model): reset dropout_selective_checkpoint=True

* feat(configs/7B_sft.py): move fsdp config to parallel zero1

* feat(configs/7B_sft.py): adapt to old version config

---------

Co-authored-by: huangting4201 <1538303371@qq.com>
2023-10-09 18:59:31 +08:00
yingtongxiong 21c1a7fa47 support evaluation with fstp 2023-10-09 18:01:06 +08:00
yingtongxiong 189a313da6 support fstp and refactor code 2023-10-09 17:26:20 +08:00