2023-11-25

Amdahl law and scale experiments

不论是在论文的experiments的部分还是说各种parallel algorithm的性能分析部分， scale efficiency 始终是一个重要的property。Amdahl law is its foundation.

Weak scaling

We increase the problem size proportionally to the number of processors so the work/processor stays the same. For example, we assume the work/processor is 50, which means every processor should process four tasks; when testing the scaling ability, you could choose the following combination of (threadnum, tasknum):(1,50),(2,100),(3,150),(4,200). The key idea behind the weak scale is (work/process num is a constant value). The ideal line for the weak scaling line is a horizontal line.

Strong scale

We keep the problem size constant but increase the number of processors. The idea is that with the increase of used resources, how fast the execution time can improve. For example, the task number is 500, and we increase the thread number from 1, 2, 4, 8 to 16, to figure out the improvements of the performance. The key idea is that the fixed value of the task size.

Amdahl law

The Amdahl law is more focused on the strong scale case; the x-axis is the number of resources or threads used for solving the problem, and the y-axis is the speed up.

Memorizing how Amdal law comes from might be a good way to understand it instead of just memorizing the final equation.

Assuming computing a workload on a serial machine takes time T. In this computing procedure, p (belonging to 0 to 1) of them can benefit from parallel computing, the T = (1-p)T + pT, there is a factor s that indicates the speed up of the parallel computing. T(s) = (1-p)T + (p/s)T. The T(s) here is the new time after applying the parallel computation. The first item is the serial part, which does not change with parallel execution, the second item is the part that changed after using parallel computation (it becomes the 1/s compared with original execution time).

Then using T divided by the T(s) (old time divided by the new time), we can get the speed up of the program based on the parallel case. We can get T/T(s) = 1/[(1-p)+p/s]. In the perfect case, when the speed-up is infinity, the speed-up of the whole program is 1/(1-p), which is limited by the serial portion of the program. This is the upper limit of improving the execution time in a parallel case.

Parallel efficiency

Another related parameter is the parallel efficiency. We usually use a speed-up number divided by the number of cores to get the parallel efficiency; this number is easier to compare since it’s a number between 0 and 1, 而且这个数字换算成为百分比之后更便于人类进行理解。 Typically, a good number for parallel visualization streamline algorithm is 50-70%(from ChatGPT).

Parallel on typical scientific visualization algorithm

以下结果来自chatGPT，根据算法的复杂程度大致分为了易于并行，中等难度以及较高难度，较高难度大概是50%左右的效率，易于并行的可能是90%的效率，中等难度的大概是60-70%的效率。如果领导或者专家问起来，能有这样的总结，就算是一个比较make sense的答复了。

易于并行的算法

等值面提取（Isosurface）：典型的静态域分割数据并行，对大规模体数据负载均衡相对容易把握，也容易获得较高并行效率。（比如marching cube，直接对数据划分就行，通信模式简单）

切片、剖切（Slice）/裁剪（Clip）：计算与通信模式简单，通常易于并行，效率可很高。

中等难度的算法

体绘制（Volume Rendering）：核心渲染过程是可并行的，但最终图像合成需要全局通信，节点数越多合成开销越显著，需优化通信与负载平衡才能保持较高效率。

相对不易并行的算法

流线可视化（Streamline Visualization）：需要频繁跨域通信，负载不平衡风险大，不同流场/种子分布会导致大幅性能波动，因而整体并行效率通常偏低。数据集本身的不均衡性也会对结果又较大的影响。

在实际项目中，不同算法的并行效率还会受到硬件（CPU/GPU 计算能力、网络互联带宽与延迟、I/O 子系统等）、软件栈（MPI、线程模型、调度策略）及数据分布情况的影响。要想在大规模并行环境下充分挖掘可视化算法的性能潜力，需要从负载均衡、通信模式、数据布局等多个方面进行针对性优化。

这就使我想到了之前一个关于particle advection优化的例子，当时得到的一些实验结果是parallel number很大的时候 core time 还不如parallel number比较小的时候。这个结论从amdal law以及parallel efficiency的角度就很好解释了，到了core number增加到了一定的程度，speed up 不再随着core number的升高而升高了，所以core time就变得越来越大，这个情况下，即使再提升资源，也没有节省多少。

所以parallel efficiency应该是一个动态的曲线，在particle advection这种算法的时候，横坐标是core number，纵坐标就是逐渐的再下减，如果用了我们的方法，这个下减的趋势被推迟，甚至是还保持一个很高的线，那就是好的方法，中间出现的gap就是被这个方法close的区域，如果有这样的一个图，应该将会是比较好的说明方法有效性的图。

Different explanation about what is fixed (for the data IO service)

When we increase the number of staging services proportionally with the amount of data.

Yeah, that’s a kind of weak scaling, too, sort of, but since we don’t care about I/O times, what really matters is the data size, so we should have had an amount of data proportional to the number of servers.

Meaning behind the scale experiments

idea scale is a linear number; for example, when we process a number increase from 1 to 2, the task number we could solve increases from 100 to 200. But in real cases, there is overhead for increasing the threads; parallel threads will also increase the overhead for communication between threads.

For the following parts, the contexts for linear weak scaling and strong scaling are discussed.

In the case of weak scaling, linear scaling is achieved if the run time stays constant while the workload is increased in direct proportion to the number of processors. Most programs running in this mode (attention to this assumption) should scale well to larger core counts as they typically employ nearest-neighbour communication patterns where the communication overhead is relatively constant regardless of the number of processes used; exceptions include algorithms that employ heavy use of global communication patterns, eg. FFTs and transposes.

In general, it is harder to achieve good, strong scaling at larger process counts since the communication overhead for many/most algorithms increases in proportion to the number of processes used.

references

https://www.sharcnet.ca/help/index.php/Measuring_Parallel_Scaling_Performance

AverageMind