fix[performance]: fix the performance evaluation mistakes (#40)

* fix(no_pp_scheduler): drop out and label if not used

* Update train_performance.md

* Update readme with new tested data

* update some typos

* doc(performance): fix some typos
pull/52/head
Sun Peng 2023-07-08 20:42:34 +08:00 committed by GitHub
parent 4a3d15650e
commit c18bec9361
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 15 additions and 14 deletions

View File

@ -162,12 +162,12 @@ python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer
InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子,提高了训练效率。通过构建 Hybrid Zero 技术实现计算和通信的高效重叠大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡,千卡规模下加速效率可高达 90%,训练吞吐超过 180TFLOPS平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据:
| GPU数量 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| TKS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 |
| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
TKS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。
TGS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。
## 贡献

View File

@ -165,10 +165,10 @@ Please refer to the [System Architecture document](./doc/en/structure.md) for fu
InternLM deeply integrates Flash-Attention, Apex and other high-performance model operators to improve training efficiency. By building the Hybrid Zero technique, it achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training. InternLM supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-GPU scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:
| Number of GPUs | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| -------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 |
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
TGS represents the average number of tokens processed per GPU per second. For more performance test data, please refer to the [Training Performance document](./doc/en/train_performance.md) for further details.

View File

@ -6,7 +6,7 @@ InternLM deeply integrates Flash-Attention, Apex, and other high-performance mod
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 |
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
We tested the performance of training the 7B model in InternLM using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:
@ -53,7 +53,7 @@ When `Activation Ckpt` is enabledthe test results are shown in the table belo
- TGS: Tokens per GPU per Second
- Global Bsz: 一个step中所有GPU处理的token数量
- Global Bsz: The total number of processed tokens with all GPUs in a step
| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
|-|-|-|-|-|-|-|-|-|-|-|

View File

@ -2,10 +2,10 @@
InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子,提高了训练效率。通过构建 Hybrid Zero 技术实现计算和通信的高效重叠大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡,千卡规模下加速效率可高达 90%,训练吞吐超过 180TFLOPS平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据:
| InternLM | 8卡 | 16卡 | 32卡 | 64卡 | 128卡 | 256卡 | 512卡 | 1024卡 |
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
| TKS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 |
| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
我们在GPU集群上测试了多种并行配置下InternLM训练7B模型的性能。在每组测试中每张GPU在单次迭代中处理的token数量一致。测试使用的硬件和参数配置如下表所示
@ -51,7 +51,8 @@ InternLM中`zero1`的配置决定了优化器状态的分配范围。
- TGS: Tokens per GPU per Second
- Global Bsz: The total number of processed tokens with all GPUs in a step
- Global Bsz: 一个step中所有GPU处理的token数量
| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
|-|-|-|-|-|-|-|-|-|-|-|