fix(docs): fix 20B demo log (#490)

* feat(docs): change 30B demo to 20B

* feat(docs): change 30B demo to 20B

* feat(docs): fix demo log
pull/491/head
huangting4201 2023-11-10 15:57:11 +08:00 committed by GitHub
parent 07026d1821
commit 8ada074cfd
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 24 additions and 25 deletions

View File

@ -176,28 +176,27 @@
.. code-block:: bash
2023-11-10 11:45:20,248 INFO parallel_context.py:555 in set_device -- process rank 0 is bound to host:HOST-10-140-60-69 device: 0
2023-11-10 11:45:20,287 INFO parallel_context.py:555 in set_device -- process rank 10 is bound to host:HOST-10-140-60-95 device: 2
2023-11-10 11:45:20,289 INFO parallel_context.py:555 in set_device -- process rank 12 is bound to host:HOST-10-140-60-95 device: 4
2023-11-10 11:45:20,291 INFO parallel_context.py:555 in set_device -- process rank 9 is bound to host:HOST-10-140-60-95 device: 1
2023-11-10 11:45:20,291 INFO parallel_context.py:555 in set_device -- process rank 13 is bound to host:HOST-10-140-60-95 device: 5
2023-11-10 11:45:20,292 INFO parallel_context.py:555 in set_device -- process rank 8 is bound to host:HOST-10-140-60-95 device: 0
2023-11-10 11:45:20,292 INFO parallel_context.py:555 in set_device -- process rank 15 is bound to host:HOST-10-140-60-95 device: 7
2023-11-10 11:45:20,292 INFO parallel_context.py:555 in set_device -- process rank 14 is bound to host:HOST-10-140-60-95 device: 6
2023-11-10 11:45:20,292 INFO parallel_context.py:555 in set_device -- process rank 11 is bound to host:HOST-10-140-60-95 device: 3
2023-11-10 11:45:20,298 INFO parallel_context.py:555 in set_device -- process rank 6 is bound to host:HOST-10-140-60-69 device: 6
2023-11-10 11:45:20,340 INFO parallel_context.py:555 in set_device -- process rank 7 is bound to host:HOST-10-140-60-69 device: 7
2023-11-10 11:45:20,387 INFO parallel_context.py:555 in set_device -- process rank 2 is bound to host:HOST-10-140-60-69 device: 2
2023-11-10 11:45:20,387 INFO parallel_context.py:555 in set_device -- process rank 5 is bound to host:HOST-10-140-60-69 device: 5
2023-11-10 11:45:20,388 INFO parallel_context.py:555 in set_device -- process rank 1 is bound to host:HOST-10-140-60-69 device: 1
2023-11-10 11:45:20,390 INFO parallel_context.py:555 in set_device -- process rank 4 is bound to host:HOST-10-140-60-69 device: 4
2023-11-10 11:45:20,463 INFO parallel_context.py:555 in set_device -- process rank 3 is bound to host:HOST-10-140-60-69 device: 3
2023-11-10 11:45:25,162 INFO launch.py:409 in launch -- Distributed environment is initialized, data parallel size: 4, pipeline parallel size: 1, tensor parallel size: 4
2023-11-10 11:45:40,621 INFO hybrid_zero_optim.py:268 in _partition_param_list -- Number of elements on ranks: [1262168320, 1269084160, 1269084160, 1222844160], rank:0
2023-11-10T11:46:16.409+08:00 INFO [training_internlm.py, line 600, in record_current_batch_training_metrics] - pid=117775 : tflops=30.535171880622176 step=0 loss=11.542577743530273 tgs (tokens/gpu/second)=246.32 tgs/last_tgs_1=246.32 tgs/tgs_all=246.32 tgs/tgs_avg=246.32 tgs/tgs_SMA=246.32 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=4.0000000000000003e-07 loss_scale=65536.0 grad_norm={'0_default': 87.3189924662012, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=65536 inf_nan_skip_batches=0 num_samples_in_batch=18 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=13.47 acc=0.0 perplexity=104321.0312 acc/.git=0.0 acc/.github=0.0 acc/.gitignore=0.0 acc/.gitmodules=0.0 acc/.owners.yml=0.0 acc/.pre-commit-config.yaml=0.0 acc/.pylintrc=0.0 acc/.pytest_cache=0.0 acc/.readthedocs.yml=0.0 acc/.vscode=0.0 acc/7b_train=0.0 acc/CHANGE_LOG.md=0.0 acc/LICENSE=0.0 acc/README-ja-JP.md=0.0 acc/README-zh-Hans.md=0.0 acc/README.md=0.0 acc/RUN=0.0 acc/ci_scripts=0.0 acc/configs=0.0 acc/doc=0.0 acc/docker=0.0 acc/docker.Makefile=0.0 acc/experiment=0.0 acc/internlm=0.0 acc/requirements=0.0 acc/sonar-project.properties=0.0 acc/tests=0.0 acc/third_party=0.0 acc/tools=0.0 acc/train.py=0.0 acc/version.txt=0.0 acc/web_demo.py=0.0 acc/web_demo_internlm.py=0.0 tokens/.git=60571 tokens/.github=0 tokens/.gitignore=0 tokens/.gitmodules=0 tokens/.owners.yml=0 tokens/.pre-commit-config.yaml=0 tokens/.pylintrc=0 tokens/.pytest_cache=0 tokens/.readthedocs.yml=0 tokens/.vscode=0 tokens/7b_train=0 tokens/CHANGE_LOG.md=0 tokens/LICENSE=0 tokens/README-ja-JP.md=0 tokens/README-zh-Hans.md=0 tokens/README.md=0 tokens/RUN=0 tokens/ci_scripts=0 tokens/configs=0 tokens/doc=0 tokens/docker=0 tokens/docker.Makefile=0 tokens/experiment=0 tokens/internlm=0 tokens/requirements=0 tokens/sonar-project.properties=0 tokens/tests=0 tokens/third_party=0 tokens/tools=0 tokens/train.py=0 tokens/version.txt=0 tokens/web_demo.py=0 tokens/web_demo_internlm.py=0 loss_from_metric=11.5552 loss/.git=11.5552 loss/.github=nan loss/.gitignore=nan loss/.gitmodules=nan loss/.owners.yml=nan loss/.pre-commit-config.yaml=nan loss/.pylintrc=nan loss/.pytest_cache=nan loss/.readthedocs.yml=nan loss/.vscode=nan loss/7b_train=nan loss/CHANGE_LOG.md=nan loss/LICENSE=nan loss/README-ja-JP.md=nan loss/README-zh-Hans.md=nan loss/README.md=nan loss/RUN=nan loss/ci_scripts=nan loss/configs=nan loss/doc=nan loss/docker=nan loss/docker.Makefile=nan loss/experiment=nan loss/internlm=nan loss/requirements=nan loss/sonar-project.properties=nan loss/tests=nan loss/third_party=nan loss/tools=nan loss/train.py=nan loss/version.txt=nan loss/web_demo.py=nan loss/web_demo_internlm.py=nan
2023-11-10T11:46:20.794+08:00 INFO [training_internlm.py, line 600, in record_current_batch_training_metrics] - pid=117775 : tflops=119.67196090960911 step=1 loss=11.337997436523438 tgs (tokens/gpu/second)=965.36 tgs/last_tgs_1=965.37 tgs/tgs_all=392.49 tgs/tgs_avg=605.85 tgs/tgs_SMA=392.49 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=6.000000000000001e-07 loss_scale=65536.0 grad_norm={'0_default': 90.85007610412333, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=131072 inf_nan_skip_batches=0 num_samples_in_batch=19 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.7 acc=0.0 perplexity=81555.5 acc/.git=0.0 acc/.github=0.0 acc/.gitignore=0.0 acc/.gitmodules=0.0 acc/.owners.yml=0.0 acc/.pre-commit-config.yaml=0.0 acc/.pylintrc=0.0 acc/.pytest_cache=0.0 acc/.readthedocs.yml=0.0 acc/.vscode=0.0 acc/7b_train=0.0 acc/CHANGE_LOG.md=0.0 acc/LICENSE=0.0 acc/README-ja-JP.md=0.0 acc/README-zh-Hans.md=0.0 acc/README.md=0.0 acc/RUN=0.0 acc/ci_scripts=0.0 acc/configs=0.0 acc/doc=0.0 acc/docker=0.0 acc/docker.Makefile=0.0 acc/experiment=0.0 acc/internlm=0.0 acc/requirements=0.0 acc/sonar-project.properties=0.0 acc/tests=0.0 acc/third_party=0.0 acc/tools=0.0 acc/train.py=0.0 acc/version.txt=0.0 acc/web_demo.py=0.0 acc/web_demo_internlm.py=0.0 tokens/.git=60265 tokens/.github=0 tokens/.gitignore=0 tokens/.gitmodules=0 tokens/.owners.yml=0 tokens/.pre-commit-config.yaml=0 tokens/.pylintrc=0 tokens/.pytest_cache=0 tokens/.readthedocs.yml=0 tokens/.vscode=0 tokens/7b_train=0 tokens/CHANGE_LOG.md=0 tokens/LICENSE=0 tokens/README-ja-JP.md=0 tokens/README-zh-Hans.md=0 tokens/README.md=0 tokens/RUN=0 tokens/ci_scripts=0 tokens/configs=0 tokens/doc=0 tokens/docker=0 tokens/docker.Makefile=0 tokens/experiment=0 tokens/internlm=0 tokens/requirements=0 tokens/sonar-project.properties=0 tokens/tests=0 tokens/third_party=0 tokens/tools=0 tokens/train.py=0 tokens/version.txt=0 tokens/web_demo.py=0 tokens/web_demo_internlm.py=0 loss_from_metric=11.309 loss/.git=11.309 loss/.github=nan loss/.gitignore=nan loss/.gitmodules=nan loss/.owners.yml=nan loss/.pre-commit-config.yaml=nan loss/.pylintrc=nan loss/.pytest_cache=nan loss/.readthedocs.yml=nan loss/.vscode=nan loss/7b_train=nan loss/CHANGE_LOG.md=nan loss/LICENSE=nan loss/README-ja-JP.md=nan loss/README-zh-Hans.md=nan loss/README.md=nan loss/RUN=nan loss/ci_scripts=nan loss/configs=nan loss/doc=nan loss/docker=nan loss/docker.Makefile=nan loss/experiment=nan loss/internlm=nan loss/requirements=nan loss/sonar-project.properties=nan loss/tests=nan loss/third_party=nan loss/tools=nan loss/train.py=nan loss/version.txt=nan loss/web_demo.py=nan loss/web_demo_internlm.py=nan
2023-11-10T11:46:24.921+08:00 INFO [training_internlm.py, line 600, in record_current_batch_training_metrics] - pid=117775 : tflops=127.02177898638753 step=2 loss=10.111491203308105 tgs (tokens/gpu/second)=1024.65 tgs/last_tgs_1=1024.66 tgs/tgs_all=494.11 tgs/tgs_avg=745.45 tgs/tgs_SMA=494.11 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=8.000000000000001e-07 loss_scale=65536.0 grad_norm={'0_default': 76.99316692997016, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=196608 inf_nan_skip_batches=0 num_samples_in_batch=17 largest_length=2048 largest_batch=5 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.43 acc=0.0704 perplexity=25907.498 acc/.git=0.0704 acc/.github=0.0 acc/.gitignore=0.0 acc/.gitmodules=0.0 acc/.owners.yml=0.0 acc/.pre-commit-config.yaml=0.0 acc/.pylintrc=0.0 acc/.pytest_cache=0.0 acc/.readthedocs.yml=0.0 acc/.vscode=0.0 acc/7b_train=0.0 acc/CHANGE_LOG.md=0.0 acc/LICENSE=0.0 acc/README-ja-JP.md=0.0 acc/README-zh-Hans.md=0.0 acc/README.md=0.0 acc/RUN=0.0 acc/ci_scripts=0.0 acc/configs=0.0 acc/doc=0.0 acc/docker=0.0 acc/docker.Makefile=0.0 acc/experiment=0.0 acc/internlm=0.0 acc/requirements=0.0 acc/sonar-project.properties=0.0 acc/tests=0.0 acc/third_party=0.0 acc/tools=0.0 acc/train.py=0.0 acc/version.txt=0.0 acc/web_demo.py=0.0 acc/web_demo_internlm.py=0.0 tokens/.git=60244 tokens/.github=0 tokens/.gitignore=0 tokens/.gitmodules=0 tokens/.owners.yml=0 tokens/.pre-commit-config.yaml=0 tokens/.pylintrc=0 tokens/.pytest_cache=0 tokens/.readthedocs.yml=0 tokens/.vscode=0 tokens/7b_train=0 tokens/CHANGE_LOG.md=0 tokens/LICENSE=0 tokens/README-ja-JP.md=0 tokens/README-zh-Hans.md=0 tokens/README.md=0 tokens/RUN=0 tokens/ci_scripts=0 tokens/configs=0 tokens/doc=0 tokens/docker=0 tokens/docker.Makefile=0 tokens/experiment=0 tokens/internlm=0 tokens/requirements=0 tokens/sonar-project.properties=0 tokens/tests=0 tokens/third_party=0 tokens/tools=0 tokens/train.py=0 tokens/version.txt=0 tokens/web_demo.py=0 tokens/web_demo_internlm.py=0 loss_from_metric=10.1623 loss/.git=10.1623 loss/.github=nan loss/.gitignore=nan loss/.gitmodules=nan loss/.owners.yml=nan loss/.pre-commit-config.yaml=nan loss/.pylintrc=nan loss/.pytest_cache=nan loss/.readthedocs.yml=nan loss/.vscode=nan loss/7b_train=nan loss/CHANGE_LOG.md=nan loss/LICENSE=nan loss/README-ja-JP.md=nan loss/README-zh-Hans.md=nan loss/README.md=nan loss/RUN=nan loss/ci_scripts=nan loss/configs=nan loss/doc=nan loss/docker=nan loss/docker.Makefile=nan loss/experiment=nan loss/internlm=nan loss/requirements=nan loss/sonar-project.properties=nan loss/tests=nan loss/third_party=nan loss/tools=nan loss/train.py=nan loss/version.txt=nan loss/web_demo.py=nan loss/web_demo_internlm.py=nan
2023-11-10T11:46:29.389+08:00 INFO [training_internlm.py, line 600, in record_current_batch_training_metrics] - pid=117775 : tflops=127.11695859262743 step=3 loss=8.848428726196289 tgs (tokens/gpu/second)=1025.42 tgs/last_tgs_1=1025.43 tgs/tgs_all=567.64 tgs/tgs_avg=815.45 tgs/tgs_SMA=567.64 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=1.0000000000000002e-06 loss_scale=65536.0 grad_norm={'0_default': 60.47096249182329, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=262144 inf_nan_skip_batches=0 num_samples_in_batch=17 largest_length=2048 largest_batch=5 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.44 acc=0.0782 perplexity=7380.2217 acc/.git=0.0782 acc/.github=0.0 acc/.gitignore=0.0 acc/.gitmodules=0.0 acc/.owners.yml=0.0 acc/.pre-commit-config.yaml=0.0 acc/.pylintrc=0.0 acc/.pytest_cache=0.0 acc/.readthedocs.yml=0.0 acc/.vscode=0.0 acc/7b_train=0.0 acc/CHANGE_LOG.md=0.0 acc/LICENSE=0.0 acc/README-ja-JP.md=0.0 acc/README-zh-Hans.md=0.0 acc/README.md=0.0 acc/RUN=0.0 acc/ci_scripts=0.0 acc/configs=0.0 acc/doc=0.0 acc/docker=0.0 acc/docker.Makefile=0.0 acc/experiment=0.0 acc/internlm=0.0 acc/requirements=0.0 acc/sonar-project.properties=0.0 acc/tests=0.0 acc/third_party=0.0 acc/tools=0.0 acc/train.py=0.0 acc/version.txt=0.0 acc/web_demo.py=0.0 acc/web_demo_internlm.py=0.0 tokens/.git=60328 tokens/.github=0 tokens/.gitignore=0 tokens/.gitmodules=0 tokens/.owners.yml=0 tokens/.pre-commit-config.yaml=0 tokens/.pylintrc=0 tokens/.pytest_cache=0 tokens/.readthedocs.yml=0 tokens/.vscode=0 tokens/7b_train=0 tokens/CHANGE_LOG.md=0 tokens/LICENSE=0 tokens/README-ja-JP.md=0 tokens/README-zh-Hans.md=0 tokens/README.md=0 tokens/RUN=0 tokens/ci_scripts=0 tokens/configs=0 tokens/doc=0 tokens/docker=0 tokens/docker.Makefile=0 tokens/experiment=0 tokens/internlm=0 tokens/requirements=0 tokens/sonar-project.properties=0 tokens/tests=0 tokens/third_party=0 tokens/tools=0 tokens/train.py=0 tokens/version.txt=0 tokens/web_demo.py=0 tokens/web_demo_internlm.py=0 loss_from_metric=8.9066 loss/.git=8.9066 loss/.github=nan loss/.gitignore=nan loss/.gitmodules=nan loss/.owners.yml=nan loss/.pre-commit-config.yaml=nan loss/.pylintrc=nan loss/.pytest_cache=nan loss/.readthedocs.yml=nan loss/.vscode=nan loss/7b_train=nan loss/CHANGE_LOG.md=nan loss/LICENSE=nan loss/README-ja-JP.md=nan loss/README-zh-Hans.md=nan loss/README.md=nan loss/RUN=nan loss/ci_scripts=nan loss/configs=nan loss/doc=nan loss/docker=nan loss/docker.Makefile=nan loss/experiment=nan loss/internlm=nan loss/requirements=nan loss/sonar-project.properties=nan loss/tests=nan loss/third_party=nan loss/tools=nan loss/train.py=nan loss/version.txt=nan loss/web_demo.py=nan loss/web_demo_internlm.py=nan
2023-11-10T11:46:33.512+08:00 INFO [training_internlm.py, line 600, in record_current_batch_training_metrics] - pid=117775 : tflops=127.046731454726 step=4 loss=7.509818077087402 tgs (tokens/gpu/second)=1024.85 tgs/last_tgs_1=1024.86 tgs/tgs_all=623.25 tgs/tgs_avg=857.33 tgs/tgs_SMA=623.25 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=1.2000000000000002e-06 loss_scale=65536.0 grad_norm={'0_default': 42.36598096083032, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=327680 inf_nan_skip_batches=0 num_samples_in_batch=22 largest_length=1893 largest_batch=8 smallest_batch=4 adam_beta2=0.95 fwd_bwd_time=3.44 acc=0.0706 perplexity=2728.5999 acc/.git=0.0706 acc/.github=0.0 acc/.gitignore=0.0 acc/.gitmodules=0.0 acc/.owners.yml=0.0 acc/.pre-commit-config.yaml=0.0 acc/.pylintrc=0.0 acc/.pytest_cache=0.0 acc/.readthedocs.yml=0.0 acc/.vscode=0.0 acc/7b_train=0.0 acc/CHANGE_LOG.md=0.0 acc/LICENSE=0.0 acc/README-ja-JP.md=0.0 acc/README-zh-Hans.md=0.0 acc/README.md=0.0 acc/RUN=0.0 acc/ci_scripts=0.0 acc/configs=0.0 acc/doc=0.0 acc/docker=0.0 acc/docker.Makefile=0.0 acc/experiment=0.0 acc/internlm=0.0 acc/requirements=0.0 acc/sonar-project.properties=0.0 acc/tests=0.0 acc/third_party=0.0 acc/tools=0.0 acc/train.py=0.0 acc/version.txt=0.0 acc/web_demo.py=0.0 acc/web_demo_internlm.py=0.0 tokens/.git=61028 tokens/.github=0 tokens/.gitignore=0 tokens/.gitmodules=0 tokens/.owners.yml=0 tokens/.pre-commit-config.yaml=0 tokens/.pylintrc=0 tokens/.pytest_cache=0 tokens/.readthedocs.yml=0 tokens/.vscode=0 tokens/7b_train=0 tokens/CHANGE_LOG.md=0 tokens/LICENSE=0 tokens/README-ja-JP.md=0 tokens/README-zh-Hans.md=0 tokens/README.md=0 tokens/RUN=0 tokens/ci_scripts=0 tokens/configs=0 tokens/doc=0 tokens/docker=0 tokens/docker.Makefile=0 tokens/experiment=0 tokens/internlm=0 tokens/requirements=0 tokens/sonar-project.properties=0 tokens/tests=0 tokens/third_party=0 tokens/tools=0 tokens/train.py=0 tokens/version.txt=0 tokens/web_demo.py=0 tokens/web_demo_internlm.py=0 loss_from_metric=7.9115 loss/.git=7.9115 loss/.github=nan loss/.gitignore=nan loss/.gitmodules=nan loss/.owners.yml=nan loss/.pre-commit-config.yaml=nan loss/.pylintrc=nan loss/.pytest_cache=nan loss/.readthedocs.yml=nan loss/.vscode=nan loss/7b_train=nan loss/CHANGE_LOG.md=nan loss/LICENSE=nan loss/README-ja-JP.md=nan loss/README-zh-Hans.md=nan loss/README.md=nan loss/RUN=nan loss/ci_scripts=nan loss/configs=nan loss/doc=nan loss/docker=nan loss/docker.Makefile=nan loss/experiment=nan loss/internlm=nan loss/requirements=nan loss/sonar-project.properties=nan loss/tests=nan loss/third_party=nan loss/tools=nan loss/train.py=nan loss/version.txt=nan loss/web_demo.py=nan loss/web_demo_internlm.py=nan
2023-11-10T11:46:37.686+08:00 INFO [training_internlm.py, line 600, in record_current_batch_training_metrics] - pid=117775 : tflops=125.95244539756375 step=5 loss=7.049615859985352 tgs (tokens/gpu/second)=1016.03 tgs/last_tgs_1=1016.04 tgs/tgs_all=666.17 tgs/tgs_avg=883.78 tgs/tgs_SMA=666.17 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=1.4000000000000001e-06 loss_scale=65536.0 grad_norm={'0_default': 32.49300931426443, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=393216 inf_nan_skip_batches=0 num_samples_in_batch=13 largest_length=2048 largest_batch=4 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.48 acc=0.0726 perplexity=1169.7832 acc/.git=0.0726 acc/.github=0.0 acc/.gitignore=0.0 acc/.gitmodules=0.0 acc/.owners.yml=0.0 acc/.pre-commit-config.yaml=0.0 acc/.pylintrc=0.0 acc/.pytest_cache=0.0 acc/.readthedocs.yml=0.0 acc/.vscode=0.0 acc/7b_train=0.0 acc/CHANGE_LOG.md=0.0 acc/LICENSE=0.0 acc/README-ja-JP.md=0.0 acc/README-zh-Hans.md=0.0 acc/README.md=0.0 acc/RUN=0.0 acc/ci_scripts=0.0 acc/configs=0.0 acc/doc=0.0 acc/docker=0.0 acc/docker.Makefile=0.0 acc/experiment=0.0 acc/internlm=0.0 acc/requirements=0.0 acc/sonar-project.properties=0.0 acc/tests=0.0 acc/third_party=0.0 acc/tools=0.0 acc/train.py=0.0 acc/version.txt=0.0 acc/web_demo.py=0.0 acc/web_demo_internlm.py=0.0 tokens/.git=61004 tokens/.github=0 tokens/.gitignore=0 tokens/.gitmodules=0 tokens/.owners.yml=0 tokens/.pre-commit-config.yaml=0 tokens/.pylintrc=0 tokens/.pytest_cache=0 tokens/.readthedocs.yml=0 tokens/.vscode=0 tokens/7b_train=0 tokens/CHANGE_LOG.md=0 tokens/LICENSE=0 tokens/README-ja-JP.md=0 tokens/README-zh-Hans.md=0 tokens/README.md=0 tokens/RUN=0 tokens/ci_scripts=0 tokens/configs=0 tokens/doc=0 tokens/docker=0 tokens/docker.Makefile=0 tokens/experiment=0 tokens/internlm=0 tokens/requirements=0 tokens/sonar-project.properties=0 tokens/tests=0 tokens/third_party=0 tokens/tools=0 tokens/train.py=0 tokens/version.txt=0 tokens/web_demo.py=0 tokens/web_demo_internlm.py=0 loss_from_metric=7.0646 loss/.git=7.0646 loss/.github=nan loss/.gitignore=nan loss/.gitmodules=nan loss/.owners.yml=nan loss/.pre-commit-config.yaml=nan loss/.pylintrc=nan loss/.pytest_cache=nan loss/.readthedocs.yml=nan loss/.vscode=nan loss/7b_train=nan loss/CHANGE_LOG.md=nan loss/LICENSE=nan loss/README-ja-JP.md=nan loss/README-zh-Hans.md=nan loss/README.md=nan loss/RUN=nan loss/ci_scripts=nan loss/configs=nan loss/doc=nan loss/docker=nan loss/docker.Makefile=nan loss/experiment=nan loss/internlm=nan loss/requirements=nan loss/sonar-project.properties=nan loss/tests=nan loss/third_party=nan loss/tools=nan loss/train.py=nan loss/version.txt=nan loss/web_demo.py=nan loss/web_demo_internlm.py=nan
2023-11-10 15:05:04,535 INFO parallel_context.py:555 in set_device -- process rank 9 is bound to host:HOST-10-140-60-90 device: 1
2023-11-10 15:05:04,518 INFO parallel_context.py:555 in set_device -- process rank 6 is bound to host:HOST-10-140-60-14 device: 6
2023-11-10 15:05:04,523 INFO parallel_context.py:555 in set_device -- process rank 0 is bound to host:HOST-10-140-60-14 device: 0
2023-11-10 15:05:04,524 INFO parallel_context.py:555 in set_device -- process rank 3 is bound to host:HOST-10-140-60-14 device: 3
2023-11-10 15:05:04,575 INFO parallel_context.py:555 in set_device -- process rank 15 is bound to host:HOST-10-140-60-90 device: 7
2023-11-10 15:05:04,576 INFO parallel_context.py:555 in set_device -- process rank 12 is bound to host:HOST-10-140-60-90 device: 4
2023-11-10 15:05:04,577 INFO parallel_context.py:555 in set_device -- process rank 11 is bound to host:HOST-10-140-60-90 device: 3
2023-11-10 15:05:04,582 INFO parallel_context.py:555 in set_device -- process rank 10 is bound to host:HOST-10-140-60-90 device: 2
2023-11-10 15:05:04,560 INFO parallel_context.py:555 in set_device -- process rank 5 is bound to host:HOST-10-140-60-14 device: 5
2023-11-10 15:05:04,592 INFO parallel_context.py:555 in set_device -- process rank 4 is bound to host:HOST-10-140-60-14 device: 4
2023-11-10 15:05:04,593 INFO parallel_context.py:555 in set_device -- process rank 7 is bound to host:HOST-10-140-60-14 device: 7
2023-11-10 15:05:04,624 INFO parallel_context.py:555 in set_device -- process rank 1 is bound to host:HOST-10-140-60-14 device: 1
2023-11-10 15:05:04,683 INFO parallel_context.py:555 in set_device -- process rank 8 is bound to host:HOST-10-140-60-90 device: 0
2023-11-10 15:05:04,718 INFO parallel_context.py:555 in set_device -- process rank 14 is bound to host:HOST-10-140-60-90 device: 6
2023-11-10 15:05:04,718 INFO parallel_context.py:555 in set_device -- process rank 13 is bound to host:HOST-10-140-60-90 device: 5
2023-11-10 15:05:04,723 INFO parallel_context.py:555 in set_device -- process rank 2 is bound to host:HOST-10-140-60-14 device: 2
2023-11-10 15:05:07,912 INFO launch.py:409 in launch -- Distributed environment is initialized, data parallel size: 4, pipeline parallel size: 1, tensor parallel size: 4
2023-11-10 15:05:24,106 INFO hybrid_zero_optim.py:268 in _partition_param_list -- Number of elements on ranks: [1262168320, 1269084160, 1269084160, 1222844160], rank:0
2023-11-10T15:05:58.540+08:00 INFO [training_internlm.py, line 601, in record_current_batch_training_metrics] - pid=78690 : tflops=41.977599404049684 step=0 loss=11.542577743530273 tgs (tokens/gpu/second)=338.62 tgs/last_tgs_1=338.62 tgs/tgs_all=338.62 tgs/tgs_avg=338.62 tgs/tgs_SMA=338.62 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=4.0000000000000003e-07 loss_scale=65536.0 grad_norm={'0_default': 87.3189617106087, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=65536 inf_nan_skip_batches=0 num_samples_in_batch=18 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=8.94 acc=0.0 perplexity=104321.0312 acc/en=0.0 acc/cn=0.0 acc/code=0.0 tokens/en=60571 tokens/cn=0 tokens/code=0 loss_from_metric=11.5552 loss/en=11.5552 loss/cn=nan loss/code=nan
2023-11-10T15:06:02.978+08:00 INFO [training_internlm.py, line 601, in record_current_batch_training_metrics] - pid=78690 : tflops=115.41412094522278 step=1 loss=11.33798599243164 tgs (tokens/gpu/second)=931.02 tgs/last_tgs_1=931.03 tgs/tgs_all=496.62 tgs/tgs_avg=634.83 tgs/tgs_SMA=496.62 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=6.000000000000001e-07 loss_scale=65536.0 grad_norm={'0_default': 90.85008685328815, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=131072 inf_nan_skip_batches=0 num_samples_in_batch=19 largest_length=2048 largest_batch=6 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.85 acc=0.0 perplexity=81555.5 acc/en=0.0 acc/cn=0.0 acc/code=0.0 tokens/en=60265 tokens/cn=0 tokens/code=0 loss_from_metric=11.309 loss/en=11.309 loss/cn=nan loss/code=nan
2023-11-10T15:06:06.988+08:00 INFO [training_internlm.py, line 601, in record_current_batch_training_metrics] - pid=78690 : tflops=127.89743136367036 step=2 loss=10.111495971679688 tgs (tokens/gpu/second)=1031.72 tgs/last_tgs_1=1031.72 tgs/tgs_all=600.43 tgs/tgs_avg=767.12 tgs/tgs_SMA=600.43 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=8.000000000000001e-07 loss_scale=65536.0 grad_norm={'0_default': 76.99318912653898, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=196608 inf_nan_skip_batches=0 num_samples_in_batch=17 largest_length=2048 largest_batch=5 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.4 acc=0.0704 perplexity=25907.623 acc/en=0.0704 acc/cn=0.0 acc/code=0.0 tokens/en=60244 tokens/cn=0 tokens/code=0 loss_from_metric=10.1623 loss/en=10.1623 loss/cn=nan loss/code=nan
2023-11-10T15:06:10.994+08:00 INFO [training_internlm.py, line 601, in record_current_batch_training_metrics] - pid=78690 : tflops=127.89845291183941 step=3 loss=8.848427772521973 tgs (tokens/gpu/second)=1031.73 tgs/last_tgs_1=1031.73 tgs/tgs_all=670.5 tgs/tgs_avg=833.27 tgs/tgs_SMA=670.5 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=1.0000000000000002e-06 loss_scale=65536.0 grad_norm={'0_default': 60.47092413727133, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=262144 inf_nan_skip_batches=0 num_samples_in_batch=17 largest_length=2048 largest_batch=5 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.41 acc=0.0783 perplexity=7380.229 acc/en=0.0783 acc/cn=0.0 acc/code=0.0 tokens/en=60328 tokens/cn=0 tokens/code=0 loss_from_metric=8.9066 loss/en=8.9066 loss/cn=nan loss/code=nan
2023-11-10T15:06:15.041+08:00 INFO [training_internlm.py, line 601, in record_current_batch_training_metrics] - pid=78690 : tflops=126.55593705224216 step=4 loss=7.509810924530029 tgs (tokens/gpu/second)=1020.9 tgs/last_tgs_1=1020.9 tgs/tgs_all=719.92 tgs/tgs_avg=870.8 tgs/tgs_SMA=719.92 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=1.2000000000000002e-06 loss_scale=65536.0 grad_norm={'0_default': 42.36608180721121, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=327680 inf_nan_skip_batches=0 num_samples_in_batch=22 largest_length=1893 largest_batch=8 smallest_batch=4 adam_beta2=0.95 fwd_bwd_time=3.43 acc=0.0706 perplexity=2728.5764 acc/en=0.0706 acc/cn=0.0 acc/code=0.0 tokens/en=61028 tokens/cn=0 tokens/code=0 loss_from_metric=7.9115 loss/en=7.9115 loss/cn=nan loss/code=nan
2023-11-10T15:06:19.051+08:00 INFO [training_internlm.py, line 601, in record_current_batch_training_metrics] - pid=78690 : tflops=127.79902453659938 step=5 loss=7.049621105194092 tgs (tokens/gpu/second)=1030.92 tgs/last_tgs_1=1030.93 tgs/tgs_all=758.03 tgs/tgs_avg=897.49 tgs/tgs_SMA=758.03 tgs/last_tgs_10=0 tgs/last_tgs_50=0 lr=1.4000000000000001e-06 loss_scale=65536.0 grad_norm={'0_default': 32.49298677335042, '1_fp32': 0.0} micro_num=4 num_consumed_tokens=393216 inf_nan_skip_batches=0 num_samples_in_batch=13 largest_length=2048 largest_batch=4 smallest_batch=3 adam_beta2=0.95 fwd_bwd_time=3.42 acc=0.0726 perplexity=1169.7916 acc/en=0.0726 acc/cn=0.0 acc/code=0.0 tokens/en=61004 tokens/cn=0 tokens/code=0 loss_from_metric=7.0646 loss/en=7.0646 loss/cn=nan loss/code=nan