fix[performance]: fix the performance evaluation mistakes (#40)

* fix(no_pp_scheduler): drop out and label if not used * Update train_performance.md * Update readme with new tested data * update some typos * doc(performance): fix some typos
2023-07-08 20:42:34 +08:00 · 2023-07-08 20:42:34 +08:00 · c18bec9361
parent 4a3d15650e
commit c18bec9361
4 changed files with 15 additions and 14 deletions
--- a/README-zh-Hans.md
+++ b/README-zh-Hans.md
@ -162,12 +162,12 @@ python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer

 InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子，提高了训练效率。通过构建 Hybrid Zero 技术，实现计算和通信的高效重叠，大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡，千卡规模下加速效率可高达 90%，训练吞吐超过 180TFLOPS，平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据：

-| GPU数量         | 8  | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
+| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TKS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 192 | 192  | 186  | 186  | 185   | 185   | 186   | 182    |
+| TGS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |

-TKS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。
+TGS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。

 ## 贡献

--- a/README.md
+++ b/README.md
@ -165,10 +165,10 @@ Please refer to the [System Architecture document](./doc/en/structure.md) for fu

 InternLM deeply integrates Flash-Attention, Apex and other high-performance model operators to improve training efficiency. By building the Hybrid Zero technique, it achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training. InternLM supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-GPU scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:

-| Number of GPUs | 8    | 16   | 32   | 64   | 128  | 256  | 512  | 1024 |
-| -------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
-| TGS            | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
-| TFLOPS         | 192  | 192  | 186  | 186  | 185  | 185  | 186  | 182  |
+| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
+| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
+| TGS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |

 TGS represents the average number of tokens processed per GPU per second. For more performance test data, please refer to the [Training Performance document](./doc/en/train_performance.md) for further details.

--- a/doc/en/train_performance.md
+++ b/doc/en/train_performance.md
@ -6,7 +6,7 @@ InternLM deeply integrates Flash-Attention, Apex, and other high-performance mod
 | GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
 | TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 192 | 192  | 186  | 186  | 185   | 185   | 186   | 182    |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |


 We tested the performance of training the 7B model in InternLM using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:
@ -53,7 +53,7 @@ When `Activation Ckpt` is enabled，the test results are shown in the table belo

 - TGS: Tokens per GPU per Second

- Global Bsz: 一个step中所有GPU处理的token数量
+- Global Bsz: The total number of processed tokens with all GPUs in a step

 | TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
 |-|-|-|-|-|-|-|-|-|-|-|
--- a/doc/train_performance.md
+++ b/doc/train_performance.md
@ -2,10 +2,10 @@

 InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子，提高了训练效率。通过构建 Hybrid Zero 技术，实现计算和通信的高效重叠，大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡，千卡规模下加速效率可高达 90%，训练吞吐超过 180TFLOPS，平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据：

-| InternLM         | 8卡  | 16卡 | 32卡 | 64卡 | 128卡 | 256卡 | 512卡 | 1024卡 |
+| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TKS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 192 | 192  | 186  | 186  | 185   | 185   | 186   | 182    |
+| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |


 我们在GPU集群上测试了多种并行配置下，InternLM训练7B模型的性能。在每组测试中，每张GPU在单次迭代中处理的token数量一致。测试使用的硬件和参数配置如下表所示：
@ -51,7 +51,8 @@ InternLM中`zero1`的配置决定了优化器状态的分配范围。

 - TGS: Tokens per GPU per Second

- Global Bsz: The total number of processed tokens with all GPUs in a step
+- Global Bsz: 一个step中所有GPU处理的token数量
+

 | TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
 |-|-|-|-|-|-|-|-|-|-|-|