mirror of https://github.com/InternLM/InternLM
fix[performance]: fix the performance evaluation mistakes (#40)
* fix(no_pp_scheduler): drop out and label if not used * Update train_performance.md * Update readme with new tested data * update some typos * doc(performance): fix some typospull/52/head
parent
4a3d15650e
commit
c18bec9361
|
@ -162,12 +162,12 @@ python convert2hf.py --src_folder origin_ckpt/ --tgt_folder hf_ckpt/ --tokenizer
|
|||
|
||||
InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子,提高了训练效率。通过构建 Hybrid Zero 技术,实现计算和通信的高效重叠,大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡,千卡规模下加速效率可高达 90%,训练吞吐超过 180TFLOPS,平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据:
|
||||
|
||||
| GPU数量 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
|
||||
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
|
||||
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
|
||||
| TKS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
|
||||
| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 |
|
||||
| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
|
||||
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
|
||||
|
||||
TKS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。
|
||||
TGS 代表平均每GPU每秒可以处理的 Token 数量。更多的性能测试数据可参考[训练性能文档](./doc/train_performance.md)进一步了解。
|
||||
|
||||
## 贡献
|
||||
|
||||
|
|
|
@ -165,10 +165,10 @@ Please refer to the [System Architecture document](./doc/en/structure.md) for fu
|
|||
|
||||
InternLM deeply integrates Flash-Attention, Apex and other high-performance model operators to improve training efficiency. By building the Hybrid Zero technique, it achieves efficient overlap of computation and communication, significantly reducing cross-node communication traffic during training. InternLM supports expanding the 7B model from 8 GPUs to 1024 GPUs, with an acceleration efficiency of up to 90% at the thousand-GPU scale, a training throughput of over 180 TFLOPS, and an average of over 3600 tokens per GPU per second. The following table shows InternLM's scalability test data at different configurations:
|
||||
|
||||
| Number of GPUs | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
|
||||
| -------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
|
||||
| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
|
||||
| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 |
|
||||
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
|
||||
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
|
||||
| TGS | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
|
||||
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
|
||||
|
||||
TGS represents the average number of tokens processed per GPU per second. For more performance test data, please refer to the [Training Performance document](./doc/en/train_performance.md) for further details.
|
||||
|
||||
|
|
|
@ -6,7 +6,7 @@ InternLM deeply integrates Flash-Attention, Apex, and other high-performance mod
|
|||
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
|
||||
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
|
||||
| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
|
||||
| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 |
|
||||
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
|
||||
|
||||
|
||||
We tested the performance of training the 7B model in InternLM using various parallel configurations on a GPU cluster. In each test group, the number of tokens processed per GPU in a single iteration remained consistent. The hardware and parameter configurations used in the tests are shown in the table below:
|
||||
|
@ -53,7 +53,7 @@ When `Activation Ckpt` is enabled,the test results are shown in the table belo
|
|||
|
||||
- TGS: Tokens per GPU per Second
|
||||
|
||||
- Global Bsz: 一个step中所有GPU处理的token数量
|
||||
- Global Bsz: The total number of processed tokens with all GPUs in a step
|
||||
|
||||
| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
|
||||
|-|-|-|-|-|-|-|-|-|-|-|
|
||||
|
|
|
@ -2,10 +2,10 @@
|
|||
|
||||
InternLM 深度整合了 Flash-Attention, Apex 等高性能模型算子,提高了训练效率。通过构建 Hybrid Zero 技术,实现计算和通信的高效重叠,大幅降低了训练过程中的跨节点通信流量。InternLM 支持 7B 模型从 8 卡扩展到 1024 卡,千卡规模下加速效率可高达 90%,训练吞吐超过 180TFLOPS,平均单卡每秒处理的 token 数量超过3600。下表为 InternLM 在不同配置下的扩展性测试数据:
|
||||
|
||||
| InternLM | 8卡 | 16卡 | 32卡 | 64卡 | 128卡 | 256卡 | 512卡 | 1024卡 |
|
||||
| GPU Number | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
|
||||
| ---------------- | ---- | ---- | ---- | ---- | ----- | ----- | ----- | ------ |
|
||||
| TKS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
|
||||
| TFLOPS | 192 | 192 | 186 | 186 | 185 | 185 | 186 | 182 |
|
||||
| TGS (Tokens/GPU/Second) | 4078 | 3939 | 3919 | 3944 | 3928 | 3920 | 3835 | 3625 |
|
||||
| TFLOPS | 193 | 191 | 188 | 188 | 187 | 185 | 186 | 184 |
|
||||
|
||||
|
||||
我们在GPU集群上测试了多种并行配置下,InternLM训练7B模型的性能。在每组测试中,每张GPU在单次迭代中处理的token数量一致。测试使用的硬件和参数配置如下表所示:
|
||||
|
@ -51,7 +51,8 @@ InternLM中`zero1`的配置决定了优化器状态的分配范围。
|
|||
|
||||
- TGS: Tokens per GPU per Second
|
||||
|
||||
- Global Bsz: The total number of processed tokens with all GPUs in a step
|
||||
- Global Bsz: 一个step中所有GPU处理的token数量
|
||||
|
||||
|
||||
| TP | Zero1 | Pack Sample Into One | Activation Ckpt | GPU Num | Seq Len | Micro Bsz | Micro Num | Global Bsz | TGS | TFLOPS |
|
||||
|-|-|-|-|-|-|-|-|-|-|-|
|
||||
|
|
Loading…
Reference in New Issue