pull/6073/head
wangbluo 2024-10-07 10:29:58 +08:00
parent 765db38e48
commit ceadef35d5
2 changed files with 2 additions and 2 deletions

View File

@ -157,7 +157,7 @@ Currently, the `MoeHybridParallelPlugin` only supports DeepSpeed-Ulysses sequenc
### Conclusion
Among the sequence parallelism methods mentioned, both ring attention and Ulysses have their pros and cons, and we need to choose the appropriate sequence parallelism method based on the situation:
Communication: Ulysses has lower communication overhead compared to ring attention, as it primarily involves three All-to-All communication ops, whereas the communication cost of ring attention grows quadratically with the sequence length. However, on the other hand, All-to-All op also demands more bandwidth from the hardware.
Communication: Ulysses has lower communication overhead compared to ring attention, as it primarily involves three All-to-All communication ops, whereas the communication cost of ring attention grows quadratically with the sequence length. However, on the other hand, All-to-All op also demands dense network topologies, e.g. NVLink + NVSwitch, so it doesn't scale well across multiple nodes.
Memory usage: Both are similar in terms of memory consumption.

View File

@ -157,7 +157,7 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_
### 结论
在上述序列并行方法中ring attention和Ulysses各有优劣我们需要根据情况来选择合适的序列并行方法
通信方面Ulysses通信量优于ring attentionUlysess主要包含三次All2All通信量,而ring attention的通信会随着序列长度增长而平方增长。不过另一方面all2all对底层硬件的要求也会更高
通信方面Ulysses通信量优于ring attentionUlysess主要包含三次All2All通信量,而ring attention的通信会随着序列长度增长而平方增长。不过另一方面all2all op由于需要更复杂的网络拓扑例如NVLink和NVSwitch因此在多机情况时并不会随着机器数量增加而有较好的性能提升
内存占用:二者类似。