Commit Graph

28 Commits (ebf294274604fba08d164481666bf4591f564235)

Author SHA1 Message Date
Guoteng 5ecb6aa712
fix(pp): fix no-packed dataset load micro batch error (#538)
* fix(pp): fix no-packed dataset load micro batch error

* fix based on comment
2023-12-13 14:48:32 +08:00
Guoteng 81ffb3d824
fix(test): fix type_ids unpack bug (#530) 2023-12-07 18:47:19 +08:00
jiaxingli 2dbbab7418
fix test_checkpoint (#526) 2023-12-04 15:38:13 +08:00
jiaxingli 1738bee002
feat(storage): use multipart upload when using oss (#520)
* multipart upload

* upload

* storage

* storage

* storage

* storage
2023-12-01 17:05:58 +08:00
Guoteng b3be333aa2
fix(ci): fix test model ckpt ci test (#518) 2023-11-30 19:16:35 +08:00
Guoteng 757e19e01a
1. fix(config): rampup_batch_size defalut value BC. (#515)
2. fix(config): standardize config parameter access.
3. feat(launch): add warmup_process_group
4. feat(memory): add cuda_memory_analyze
2023-11-28 19:33:46 +08:00
jiaxingli 06e8301861
name (#514) 2023-11-24 18:24:54 +08:00
jiaxingli b59641715a
Feat(QA): Check loss when swapping micro_num and micro_bsz && Check grad norm (#510)
* unitest_only_forward

* memory_test

* doc fix

* doc fix

* check loss

* check grad norm

* check grad norm
2023-11-24 12:05:14 +08:00
Guoteng 0bfc86205e
feat(train): support_rampup_batch_size and fix bugs (#493) 2023-11-16 19:51:01 +08:00
jiaxingli 4a6987d5e7
unitest_only_forward (#484) 2023-11-16 15:30:57 +08:00
jiaxingli e8cf27b8c0
Feat(QA): Check init model weights (#502)
* check_init

* check_init

* check_init

* check_init
2023-11-16 11:03:19 +08:00
jiaxingli bd7e501b69
Feat(QA): Check model weights for acc (#476)
* check_weights

* check_weights
2023-11-09 16:16:29 +08:00
jiaxingli 4995060d84
feat(storage): support ali oss ckpt saving (#439) 2023-10-27 22:32:10 +08:00
jiaxingli 3ea46324dd
fix: unitest (#424) 2023-10-19 15:19:40 +08:00
jiaxingli 30f610b1fa
Test(pp): test pipeline parallel (#413)
* test: pp

* feat: add pp test

* test pp

* pp test

* pp test

* test pp
2023-10-18 17:53:08 +08:00
jiaxingli 71a0388b87
feat(storage): support volc oss ckpt saving (#397)
* feat: support volc tos

* feat: support volc oss
2023-10-13 03:44:29 -05:00
Wenwen Qu 582ee000bd
feat(moe):support zero for expert local dp (#404)
* support zero for expert local dp

* fix above codes:
    *treat optim.zero_world_size and optim.zero_local_rank as list in model_checkpoint.py and test_model_checkpoint.py
    *add overlap and zero check for moe in args_sanity_check(.)
2023-10-09 17:45:26 +08:00
Wenwen Qu 136d55ec30
feat(moe): add moe module (#182)
* feat(XXX): add moe

* reformat code

* modified:   .pre-commit-config.yaml
	modified:   internlm/model/moe.py
	modified:   internlm/model/modeling_internlm.py

* modified:   internlm/model/modeling_internlm.py

* modified:   internlm/core/context/process_group_initializer.py
	modified:   internlm/core/scheduler/no_pipeline_scheduler.py
	modified:   internlm/solver/optimizer/hybrid_zero_optim.py

* modified:   internlm/model/moe.py
	modified:   internlm/moe/sharded_moe.py
	modified:   internlm/utils/parallel.py

* rollback .pre-commit-config.yaml

* add residual and other moe features

* modify grad clipping due to moe

* add param arguments

* reformat code

* add expert data support and fix bugs

* Update .pre-commit-config.yaml

* modified:   internlm/model/modeling_internlm.py

* add no-interleaved & no-overlapped moe pp support

* support zero_overlap_communication

* avoid moe parameter partition in zero optimizer

* fix the moe_loss_coeff bug

* suppport interleaved pp

* fix moe bugs in zero optimizer

* fix more moe bugs in zero optimizer

* fix moe bugs in zero optimizer

* add logger for moe_loss

* fix bugs with merge

* fix the pp moe bugs

* fix bug on logger

* update moe training cfg on real-dataset

* refactor code

* refactor code

* fix bugs with compute moe norm

* optimize code with moe norm computing

* fix the bug that missing scale the latent moe loss

* refactor code

* fix moe loss logger for the interleaved pp

* change the scale position for latent moe_loss

* Update 7B_sft.py

* add support for moe checkpoint

* add comments for moe

* reformat code

* fix bugs

* fix bugs

* Update .pre-commit-config.yaml

* remove moe_loss_coeff parameter passing

* fix group_norms computing in hybrid_zero_optim

* use dummy mode to generate random numbers in model construction

* replace flashatten experts by feedforward experts

* fix bugs with _compute_norm_with_moe_group

* merge upstream/develop into feature_add_moe

* merge upstream/develop into feature_add_moe

* change float16 to bfloat16

* fix interface for dense pipeline

* refactor split_moe_group code

* fix precision inconsistency

* refactor code

* Update 7B_sft.py

* refactor code

* refactor code

* refactor code

* refactor code

* refactor code for split group

* refactor code for log

* fix logger for moe

* refactor code for split param group

* fix the moe_loss for ci and val

* refactor

* fix bugs with split group

* fix bugs in save/load moe checkpoint

* add moe module to `__init__.py`

* add compatible code for old version

* update moe config file

* modify moe config file

* fix merge bugs

* update moe config file

* change condition for compatibility

---------

Co-authored-by: zhanglei <ryancheung98@163.com>
Co-authored-by: Ryan (张磊) <leizhang.real@gmail.com>
2023-09-27 15:54:53 +08:00
Wenwen Qu 655e9dae40
Feat(norm)/support fused precision (#319)
* add fused precision support for norm

* refactor code

* refactor code

* change the granularity of hook

* fix bugs if self.model is ModuleList

* add dtype condition for post hook

* refactor code for split group

* refactor code for pre/post hook

* refactor code for split group

* remove fp32 hook for norm

* unit tests for fused precision

* add doc for fused precision

* add doc for En. version

* reformat docs

* Update mixed_precision.rst

* Update mixed_precision.po

* update mixed_precision.po
2023-09-26 20:39:55 +08:00
Guoteng 26a7397752
fix(storage): fix try_get_storage_backend (#359)
* fix(storage): fix try_get_storage_backend

* fix typo and print infos only in log rank

* fix typo and print infos only in log rank

---------

Co-authored-by: gaoyang07 <Gary1546308416AL@gmail.com>
2023-09-25 15:16:25 +08:00
huangting4201 1ed36754df
feat(.github/workflows): update ci e2e tests and add ci unit tests (#324)
* feat(.github/workflows/e2e_test.yaml): update e2e yaml

* feat(.github/workflows/e2e_test.yaml): update e2e yaml

* test e2e

* test e2e

* test e2e

* test e2e

* test e2e

* fix(ci): test ci

* fix(ci): test ci

* fix(ci): test ci

* fix(ci): test ci

* fix(ci): test ci

* fix(ci): add weekly tests

---------

Co-authored-by: huangting4201 <huangting3@sensetime.com>
2023-09-22 14:07:14 +08:00
huangting4201 025ca55dfe
test(tests/test_training): add training e2e tests for loss spike and loss accuracy (#304)
* tests(test_training): add test case for loss accuracy

* tests(test_training): update test cases

* ci(.github/workflows/e2e_test.yaml): remove pull submodule

* ci(.github/workflows/e2e_test.yaml): update ci env and remove useless env var

* test(tests/test_training): add 16 GPUs test cases

* test(tests/test_training): fix training_16GPU_8DP2PP test case error

* test(tests/test_training): add new case for interleaved pp

* test(tests/test_training): remove redundant code

* test(tests/test_training): update ci job timeout minutes to 30m

* feat(initialize/launch.py): check num_chunks and interleaved_overlap

---------

Co-authored-by: huangting4201 <huangting3@sensetime.com>
2023-09-19 14:55:40 +08:00
jiaxingli ab513e1ddd
feat: add optimizer_unitest (#303)
* feat: add optimizer_unitest

* feat: add optimizer test

* feat: add optimizer test

* feat:add optimizer test

* fianl change

* feat:add optimizer test

* feat:add optimizer test

* feat:add optimizer test
2023-09-15 18:56:56 +08:00
jiaxingli 882a07011c
feat: add unitest for model (#300)
* feat: add unitest for model

* feat:add model test
2023-09-14 13:18:34 +08:00
Guoteng 85e39aae67
fix(ckpt): fix snapshot none load error and remove file lock (#298) 2023-09-08 20:41:53 +08:00
Guoteng 37b8c6684e
feat(utils): add timeout warpper for key functions (#286) 2023-09-07 17:26:17 +08:00
Guoteng 8acf823a04
fix(storage): fix and refactor storage api (#281) 2023-09-06 01:15:09 +08:00
Guoteng f6e007f95b
feat(ckpt): fix checkpoint bugs and add feature enhancements. (#259)
* fix(ckpt): ckpt bug fix and api refactor
1. fix latest ckpt query bug
2. add ckpt unit test
3. fix storage manager boto3/local client get_fns bug
4. fix only model load case zero fp32 buffer overwrite model weights bug.
5. add ckpt_type and add zero reload ci-test

* fix(ckpt): fix ckpt and trainer bug

* fix and refactor

* fix base on comment

* feat: add legacy api
2023-09-05 17:40:48 +08:00