jiaxingli
							
						 
						
							 
							
							
								
								
							
							
								
							
								06e8301861 
								
									
								
							
								 
							
						 
						
							
							
								
								name ( #514 )  
							
							 
							
							
							
						 
						
							2023-11-24 18:24:54 +08:00  
						
					 
				
					
						
							
							
								 
								jiaxingli
							
						 
						
							 
							
							
								
								
							
							
								
							
								b59641715a 
								
									
								
							
								 
							
						 
						
							
							
								
								Feat(QA): Check loss when swapping micro_num and micro_bsz && Check grad norm ( #510 )  
							
							 
							
							... 
							
							
							
							* unitest_only_forward
* memory_test
* doc fix
* doc fix
* check loss
* check grad norm
* check grad norm 
							
						 
						
							2023-11-24 12:05:14 +08:00  
						
					 
				
					
						
							
							
								 
								jiaxingli
							
						 
						
							 
							
							
								
								
							
							
								
							
								e8cf27b8c0 
								
									
								
							
								 
							
						 
						
							
							
								
								Feat(QA): Check init model weights ( #502 )  
							
							 
							
							... 
							
							
							
							* check_init
* check_init
* check_init
* check_init 
							
						 
						
							2023-11-16 11:03:19 +08:00  
						
					 
				
					
						
							
							
								 
								jiaxingli
							
						 
						
							 
							
							
								
								
							
							
								
							
								bd7e501b69 
								
									
								
							
								 
							
						 
						
							
							
								
								Feat(QA): Check model weights for acc ( #476 )  
							
							 
							
							... 
							
							
							
							* check_weights
* check_weights 
							
						 
						
							2023-11-09 16:16:29 +08:00  
						
					 
				
					
						
							
							
								 
								Wenwen Qu
							
						 
						
							 
							
							
								
								
							
							
								
							
								136d55ec30 
								
									
								
							
								 
							
						 
						
							
							
								
								feat(moe): add moe module ( #182 )  
							
							 
							
							... 
							
							
							
							* feat(XXX): add moe
* reformat code
* modified:   .pre-commit-config.yaml
	modified:   internlm/model/moe.py
	modified:   internlm/model/modeling_internlm.py
* modified:   internlm/model/modeling_internlm.py
* modified:   internlm/core/context/process_group_initializer.py
	modified:   internlm/core/scheduler/no_pipeline_scheduler.py
	modified:   internlm/solver/optimizer/hybrid_zero_optim.py
* modified:   internlm/model/moe.py
	modified:   internlm/moe/sharded_moe.py
	modified:   internlm/utils/parallel.py
* rollback .pre-commit-config.yaml
* add residual and other moe features
* modify grad clipping due to moe
* add param arguments
* reformat code
* add expert data support and fix bugs
* Update .pre-commit-config.yaml
* modified:   internlm/model/modeling_internlm.py
* add no-interleaved & no-overlapped moe pp support
* support zero_overlap_communication
* avoid moe parameter partition in zero optimizer
* fix the moe_loss_coeff bug
* suppport interleaved pp
* fix moe bugs in zero optimizer
* fix more moe bugs in zero optimizer
* fix moe bugs in zero optimizer
* add logger for moe_loss
* fix bugs with merge
* fix the pp moe bugs
* fix bug on logger
* update moe training cfg on real-dataset
* refactor code
* refactor code
* fix bugs with compute moe norm
* optimize code with moe norm computing
* fix the bug that missing scale the latent moe loss
* refactor code
* fix moe loss logger for the interleaved pp
* change the scale position for latent moe_loss
* Update 7B_sft.py
* add support for moe checkpoint
* add comments for moe
* reformat code
* fix bugs
* fix bugs
* Update .pre-commit-config.yaml
* remove moe_loss_coeff parameter passing
* fix group_norms computing in hybrid_zero_optim
* use dummy mode to generate random numbers in model construction
* replace flashatten experts by feedforward experts
* fix bugs with _compute_norm_with_moe_group
* merge upstream/develop into feature_add_moe
* merge upstream/develop into feature_add_moe
* change float16 to bfloat16
* fix interface for dense pipeline
* refactor split_moe_group code
* fix precision inconsistency
* refactor code
* Update 7B_sft.py
* refactor code
* refactor code
* refactor code
* refactor code
* refactor code for split group
* refactor code for log
* fix logger for moe
* refactor code for split param group
* fix the moe_loss for ci and val
* refactor
* fix bugs with split group
* fix bugs in save/load moe checkpoint
* add moe module to `__init__.py`
* add compatible code for old version
* update moe config file
* modify moe config file
* fix merge bugs
* update moe config file
* change condition for compatibility
---------
Co-authored-by: zhanglei <ryancheung98@163.com>
Co-authored-by: Ryan (张磊) <leizhang.real@gmail.com> 
							
						 
						
							2023-09-27 15:54:53 +08:00  
						
					 
				
					
						
							
							
								 
								huangting4201
							
						 
						
							 
							
							
								
								
							
							
								
							
								1ed36754df 
								
									
								
							
								 
							
						 
						
							
							
								
								feat(.github/workflows): update ci e2e tests and add ci unit tests ( #324 )  
							
							 
							
							... 
							
							
							
							* feat(.github/workflows/e2e_test.yaml): update e2e yaml
* feat(.github/workflows/e2e_test.yaml): update e2e yaml
* test e2e
* test e2e
* test e2e
* test e2e
* test e2e
* fix(ci): test ci
* fix(ci): test ci
* fix(ci): test ci
* fix(ci): test ci
* fix(ci): test ci
* fix(ci): add weekly tests
---------
Co-authored-by: huangting4201 <huangting3@sensetime.com> 
							
						 
						
							2023-09-22 14:07:14 +08:00  
						
					 
				
					
						
							
							
								 
								huangting4201
							
						 
						
							 
							
							
								
								
							
							
								
							
								025ca55dfe 
								
									
								
							
								 
							
						 
						
							
							
								
								test(tests/test_training): add training e2e tests for loss spike and loss accuracy ( #304 )  
							
							 
							
							... 
							
							
							
							* tests(test_training): add test case for loss accuracy
* tests(test_training): update test cases
* ci(.github/workflows/e2e_test.yaml): remove pull submodule
* ci(.github/workflows/e2e_test.yaml): update ci env and remove useless env var
* test(tests/test_training): add 16 GPUs test cases
* test(tests/test_training): fix training_16GPU_8DP2PP test case error
* test(tests/test_training): add new case for interleaved pp
* test(tests/test_training): remove redundant code
* test(tests/test_training): update ci job timeout minutes to 30m
* feat(initialize/launch.py): check num_chunks and interleaved_overlap
---------
Co-authored-by: huangting4201 <huangting3@sensetime.com> 
							
						 
						
							2023-09-19 14:55:40 +08:00