From 85706ef02fe65bc699b0f5334a8af87b1071579c Mon Sep 17 00:00:00 2001
From: Sun Peng <sunpengsdu@gmail.com>
Date: Sat, 8 Jul 2023 20:36:02 +0800
Subject: [PATCH] doc(performance): fix some typos

---
 README-zh-Hans.md           | 8 ++++----
 README.md                   | 8 ++++----
 doc/en/train_performance.md | 4 ++--
 doc/train_performance.md    | 9 +++++----
 4 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/README-zh-Hans.md b/README-zh-Hans.md
index 1d669ba..f1d3d8a 100644
--- a/README-zh-Hans.md
+++ b/README-zh-Hans.md
@@ -162,12 +162,12 @@ python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer
 
 InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子，提高了训练效率。通过构建 Hybrid Zero 技术，实现计算和通信的高效重叠，大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡，千卡规模下加速效率可高达 90%，训练吞吐超过 180TFLOPS，平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据：
 
-| GPU数量         | 8  | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
+| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TKS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 192 | 192  | 186  | 186  | 185   | 185   | 186   | 182    |
+| TGS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |
 
-TKS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。
+TGS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。
 
 ## 贡献
 
diff --git a/README.md b/README.md
index a942468..8e546ee 100644
--- a/README.md
+++ b/README.md
@@ -165,10 +165,10 @@ Please refer to the [System Architecture document](./doc/en/structure.md) for fu
 
 InternLM deeply integrates Flash-Attention, Apex and other high-performance model operators to improve training efficiency. By building the Hybrid Zero technique, it achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training. InternLM supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-GPU scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:
 
-| Number of GPUs | 8    | 16   | 32   | 64   | 128  | 256  | 512  | 1024 |
-| -------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
-| TGS            | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
-| TFLOPS         | 192  | 192  | 186  | 186  | 185  | 185  | 186  | 182  |
+| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
+| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
+| TGS | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |
 
 TGS represents the average number of tokens processed per GPU per second. For more performance test data, please refer to the [Training Performance document](./doc/en/train_performance.md) for further details.
 
diff --git a/doc/en/train_performance.md b/doc/en/train_performance.md
index bd916c4..9c77d9e 100644
--- a/doc/en/train_performance.md
+++ b/doc/en/train_performance.md
@@ -6,7 +6,7 @@ InternLM deeply integrates Flash-Attention, Apex, and other high-performance mod
 | GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
 | TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 192 | 192  | 186  | 186  | 185   | 185   | 186   | 182    |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |
 
 
 We tested the performance of training the 7B model in InternLM using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:
@@ -53,7 +53,7 @@ When `Activation Ckpt` is enabled，the test results are shown in the table belo
 
 - TGS: Tokens per GPU per Second
 
-- Global Bsz: 一个step中所有GPU处理的token数量
+- Global Bsz: The total number of processed tokens with all GPUs in a step
 
 | TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
 |-|-|-|-|-|-|-|-|-|-|-|
diff --git a/doc/train_performance.md b/doc/train_performance.md
index f5ff0bf..239e20f 100644
--- a/doc/train_performance.md
+++ b/doc/train_performance.md
@@ -2,10 +2,10 @@
 
 InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子，提高了训练效率。通过构建 Hybrid Zero 技术，实现计算和通信的高效重叠，大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡，千卡规模下加速效率可高达 90%，训练吞吐超过 180TFLOPS，平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据：
 
-| InternLM         | 8卡  | 16卡 | 32卡 | 64卡 | 128卡 | 256卡 | 512卡 | 1024卡 |
+| GPU Number         | 8   | 16  | 32  | 64  | 128  | 256  | 512  | 1024  |
 | ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
-| TKS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
-| TFLOPS  | 192 | 192  | 186  | 186  | 185   | 185   | 186   | 182    |
+| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928  | 3920  | 3835  | 3625   |
+| TFLOPS  | 193 | 191  | 188  | 188  | 187   | 185   | 186   | 184    |
 
 
 我们在GPU集群上测试了多种并行配置下，InternLM训练7B模型的性能。在每组测试中，每张GPU在单次迭代中处理的token数量一致。测试使用的硬件和参数配置如下表所示：
@@ -51,7 +51,8 @@ InternLM中`zero1`的配置决定了优化器状态的分配范围。
 
 - TGS: Tokens per GPU per Second
 
-- Global Bsz: The total number of processed tokens with all GPUs in a step
+- Global Bsz: 一个step中所有GPU处理的token数量
+
 
 | TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
 |-|-|-|-|-|-|-|-|-|-|-|