From 85706ef02fe65bc699b0f5334a8af87b1071579c Mon Sep 17 00:00:00 2001 From: Sun Peng Date: Sat, 8 Jul 2023 20:36:02 +0800 Subject: [PATCH] doc(performance): fix some typos --- README-zh-Hans.md | 8 ++++---- README.md | 8 ++++---- doc/en/train_performance.md | 4 ++-- doc/train_performance.md | 9 +++++---- 4 files changed, 15 insertions(+), 14 deletions(-) diff --git a/README-zh-Hans.md b/README-zh-Hans.md index 1d669ba..f1d3d8a 100644 --- a/README-zh-Hans.md +++ b/README-zh-Hans.md @@ -162,12 +162,12 @@ python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子,提高了训练效率。通过构建 Hybrid Zero 技术,实现计算和通信的高效重叠,大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡,千卡规模下加速效率可高达 90%,训练吞吐超过 180TFLOPS,平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据: -| GPU数量 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | +| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ | -| TKS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 | -| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 | +| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 | +| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 | -TKS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。 +TGS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。 ## 贡献 diff --git a/README.md b/README.md index a942468..8e546ee 100644 --- a/README.md +++ b/README.md @@ -165,10 +165,10 @@ Please refer to the [System Architecture document](./doc/en/structure.md) for fu InternLM deeply integrates Flash-Attention, Apex and other high-performance model operators to improve training efficiency. By building the Hybrid Zero technique, it achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training. InternLM supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-GPU scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations: -| Number of GPUs | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | -| -------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | -| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 | -| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 | +| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | +| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ | +| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 | +| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 | TGS represents the average number of tokens processed per GPU per second. For more performance test data, please refer to the [Training Performance document](./doc/en/train_performance.md) for further details. diff --git a/doc/en/train_performance.md b/doc/en/train_performance.md index bd916c4..9c77d9e 100644 --- a/doc/en/train_performance.md +++ b/doc/en/train_performance.md @@ -6,7 +6,7 @@ InternLM deeply integrates Flash-Attention, Apex, and other high-performance mod | GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ | | TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 | -| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 | +| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 | We tested the performance of training the 7B model in InternLM using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below: @@ -53,7 +53,7 @@ When `Activation Ckpt` is enabled,the test results are shown in the table belo - TGS: Tokens per GPU per Second -- Global Bsz: 一个step中所有GPU处理的token数量 +- Global Bsz: The total number of processed tokens with all GPUs in a step | TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS | |-|-|-|-|-|-|-|-|-|-|-| diff --git a/doc/train_performance.md b/doc/train_performance.md index f5ff0bf..239e20f 100644 --- a/doc/train_performance.md +++ b/doc/train_performance.md @@ -2,10 +2,10 @@ InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子,提高了训练效率。通过构建 Hybrid Zero 技术,实现计算和通信的高效重叠,大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡,千卡规模下加速效率可高达 90%,训练吞吐超过 180TFLOPS,平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据: -| InternLM | 8卡 | 16卡 | 32卡 | 64卡 | 128卡 | 256卡 | 512卡 | 1024卡 | +| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ | -| TKS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 | -| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 | +| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 | +| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 | 我们在GPU集群上测试了多种并行配置下,InternLM训练7B模型的性能。在每组测试中,每张GPU在单次迭代中处理的token数量一致。测试使用的硬件和参数配置如下表所示: @@ -51,7 +51,8 @@ InternLM中`zero1`的配置决定了优化器状态的分配范围。 - TGS: Tokens per GPU per Second -- Global Bsz: The total number of processed tokens with all GPUs in a step +- Global Bsz: 一个step中所有GPU处理的token数量 + | TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS | |-|-|-|-|-|-|-|-|-|-|-|