2019-06-26

Time it

在使用profiling之类的工具之前，更灵活简单的方式就是设置timer

Concepts about time and performance

CPU bound vs IO bound

参考这里的解释，比较直接，或者说整个系统的bottleneck在哪里。

A program is CPU bound if it would go faster if the CPU were faster, i.e. it spends the majority of its time simply using the CPU (doing calculations). A program that computes new digits of π will typically be CPU-bound, it’s just crunching numbers.

解决cpu bound的问题一方面是优化算法本身，另一方面是采用并行算法。

A program is I/O bound if it would go faster if the I/O subsystem was faster. Which exact I/O system is meant can vary; I typically associate it with disk, but of course networking or communication in general is common too. A program that looks through a huge file for some data might become I/O bound, since the bottleneck is then the reading of the data from disk (actually, this example is perhaps kind of old-fashioned these days with hundreds of MB/s coming in from SSDs).

注意因为使用了memory hierachy，除了disk IO之外还有memory，cache，涉及到具体优化的时候除了提升本身hardware的IO性能在算法层面就是合理地利用memory hierachy，更多的细节可以参wiki cache oblivious algorithm

the goal of the optimizaiton is to change the CPU bound program into the IO bound program.

real time（wall time） vs user time vs sys time

Elapsed real time, real time, wall-clock time, or wall time is the actual time taken from the start of a computer program to the end (wall time 与 real time实际代表同样的含义). on linux，there is a command called time，use time <executable>可以得到sys，user，real time。一般说的cpu time 更多指的是cpu所花在这个程序上的时间，这里要注意用词的准确性，有几次把cpu time与wall time给弄混了。

real time就是实际的时间，比如一个program从1:00开始运行，到2:00结束，这个process的real time就是1h，但是在这个1h的时间中，可能这个程序并没有一直在cpu上运行，比如可能是0.5h在cpu上运行，另外0.5小时在sleep，这个时候就cpu time实际上就不是1h。user time 表示在用户态所花的时间，sys time 表示cpu在内核状态下所花费的时间。比如下面这个case：

int main()
{
    //do some io operation
    sleep(1.0);
    return 0;
}

/*
real    0m1.011s
user    0m0.001s
sys     0m0.002s
*/

real time 花费掉了1s，然后user和sys很少，并且user和sys的总和也远小于real，这说明虽然程序从开始运行到结束是1s，但cpu的占用率在这个程序运行的这段时间还是比较低(0.003s in total)。

c中的timer方式

//sample code
#include <time.h>
#include <stdio.h>
#include <unistd.h>
#define BILLION 1000000000L

int main()
{
    struct timespec start, end;
    double diff;
    clock_gettime(CLOCK_REALTIME, &start); /* mark start time */
    sleep(1.0);
    //some functions here
    clock_gettime(CLOCK_REALTIME, &end); /* mark end time */
    diff = (end.tv_sec - start.tv_sec) * 1.0 + (end.tv_nsec - start.tv_nsec) * 1.0 / BILLION;
    printf("time is %lf seconds \n", diff);
}

python 中的timer就和strightforware

import timeit
start = timeit.default_timer()
end = timeit.default_timer()
print("time span ", end-start)

c++中的timer方式

注意duration_cast的时候先转换成micro seconds再换算成seconds可以获得比较好的精确度。

#include <algorithm> 
#include <chrono> 
#include <iostream> 
#include <vector>
using namespace std; 
using namespace std::chrono; 
  
// For demonstration purpose, we will fill up 
// a vector with random integers and then sort 
// them using sort function. We fill record 
// and print the time required by sort function 
int main() 
{ 
  
    vector<int> values(10000); 
    // Generate Random values 
    auto f = []() -> int { return rand() % 10000; }; 
    // Fill up the vector 
    generate(values.begin(), values.end(), f); 
    // Get starting timepoint 
    auto start = high_resolution_clock::now(); 
    // Call the function, here sort() 
    sort(values.begin(), values.end()); 
    // Get ending timepoint 
    auto stop = high_resolution_clock::now(); 
    // Get duration. Substart timepoints to  
    // get durarion. To cast it to proper unit 
    // use duration cast method 
    auto duration_micro = duration_cast<microseconds>(stop - start); 
    auto duration_seconds = duration_cast<seconds> (stop - start); 
  
    cout << "Time taken by function: "
         << duration_micro.count() << " microseconds, " << duration_micro.count()*1.0/1000000 << " seconds" << endl;

    cout << "Time taken by function: "
         << duration_seconds.count()*0.1 << " seconds" << endl;

    return 0;

Another way in cpp to add the timer:

auto startT = std::chrono::steady_clock::now();
//do some works here
auto stopT = std::chrono::steady_clock::now();
auto duration_milli = duration<double, std::milli>(stopT - startT);

It seems there is not much difference between the steady_clock and the high_resolution_clock.

The timer in parallel sitution

In general case, when we need to time the parallel execution, we just need to print out the timer for leader process, such as using rank=0 to control it. It is unnecessary to print out the time spent on each specific process. If we need to time each specific process, writting the performance data to each file separately into parallel file system (this is a common operation for the data check pointing things) might be a more efficient way then writting all information into a one single file.

One issue for the parallel I/O is that, some specific process may need comparatively long time to open a file and write things into a file, such as the description in this article.

The common use case scenario is to write things into a buffer, which can be a vector or just streamstring. The streamstring is more convenient to process the input and output if the contents are supposed to be read by human at last. The vector is more general buffer. There is a discussion between these two containers here. We can just use naive << to input data into the streamstring and use subsequent code to write the contents from streamstring object into a file.

std::ofstream fileHandle;
fileHandle(filename, std::ofstream::out);
fileHandle << streamstringObject.rdbuf();
fileHandle();

Timer for distributed components

有一个case是需要测的timer间隔发生在不同的组件中，比如一个worklow，从component a 开始到 component z 结束，这个时候要测量a到z的执行时间。

这个时候需要设置一个middleware来专门处理这个case，之前用的一个单节点的prototype放在这里，消息通信是基于grpc，单节点一般负载的case就能cover了，grpc的好处是不论cpp或者是python或者是其他的主流code都有比较好的client库可以直接调用。

time 提供了两种方式一种是tick record，每发一次请求记录一次时间，另一种是time span的形式，只记录开始和结束的时间。

除此之外这个工具还提供了两种用于测试的message communication pattern。一种是shared space，另外一种是message queue，具体在README中有介绍，这里就不再赘述。

总之用profile之前，针对性地设置timer然后运行bentchmark也是一种测量performance的方式。如果这个方法找不到bottleneck再使用profiling的工具。

timer in parallel processing

这里有两点需要注意，第一个是在time之前要使用barrier的primitives，这样才能确保不同thread/process的timer是准确的。第二个是求平均值的问题，经常使用的是reduce的primitives。比如MPI_Reduce把不同process的timer汇集到一个process中之后再求平均值。

虽然可以使用前面介绍的例子measure一段时间的timer, 比如按照之前的介绍在compute前后设置time start 以及 time end，然后compute difference但是这个不够灵活，比如这个example

while(sth){
compute
communicate
}

相比于测量period of time (比如每个while loop中花在compute和communicate的时间有多少)，更彻底的是测量当时的absolute time，程序的入口是start point, 这样就能plot出来一个比较完整的gantt chart，但是需要对于print出来的log做一些额外的操作，比如用不同的timer和开始时刻的timer进行比较

ts = std::chrono::duration_cast<std::chrono::milliseconds>(
               std::chrono::system_clock::now().time_since_epoch())
               .count();

这个ts就是一个absolute的value，在关键的操作之前和之后把这个信息记录下来，然后再处理log的时候可以更加灵活，通过gant chart 显示也更加方便，可以显示出不同process的相对位置。

这个是之前的一个例子。mapping到gant plot上的时候要注意，gant chart的长度是create figure时候的长度，而通过测量得到的time使用的是不同的unit，要把这个time 的 unit 转化到gant chart所使用的figure size 的 unit上。

这里是一些其他的测量current time的例子。

在parallel case的重要的case是要考虑让所有的process 有共同的start time point，这样timing比较起来才比较有意义。通常是在需要测量的function开始的位置放置一个MPI_Barrier就行，但是有的时候放置一个barrier也不能够使得program有同样的起点，一些有经验的人说，这个时候放置两个barrier可以起到对应的效果。

timer in distributed cases with slow IO

When we has a timer, we always want to write it out, in the distributed case such as MPI program, we try to let each rank to write into one file. However, the file I/O might influence the accuracy of the timer especially when there is a large number of processes. Instead of writing things each time and calls a lot of file I/O operations, the good practice is to record all timer information into memory (for example, using a map to store assocaited timer information). When the targeted program complets its execution, then writing the timer from the memory into the file once. In this case, even if the file I/O is slow, we do not need to worry about its influence to the accuracy of the timer.

关于timer的误区

量级的timer才有意义(算法优化上的意义)，根据时间复杂度O(n)的定义，也是n在很大数量的时候O(n)才有意义。有几次是对于很小的几分几秒进行测量，就没什么意义，最起码在算法上没什么意义，这个时候的strategy就是把这个差异放大，比如这个operation执行了100或者1000次，这个时候的时间，或者是输入的data变得很大的时候，这个时候的时间比较有意义。

Performance model

Timing 的一个重要的意义就是构建 performance model 或者了解当前这个service的质量是如何的。有时候在想一个research是不是interests可能是要看看这research当中使用了多少python。换句话说就是使用了多少数学工具或者相关的算法。

以前总有那种all thing in real time的感觉，现在更多的是offline和online结合的方式。就是offline的model先training好之后然后再进行online的prediction或者classification。就是不要试图去回避这种offline的learning或者是training的过程，本身重要的是model的quality而不是inline 或者 offline。offline 训练好的模型总是可以搬到inline的方式上面来。offline的数据进行model的traning之后就是learning knowledge的过程，这些knowledge就可以用到具体的 online processing的过程中。

Take care of the error and uncertainty

Always trying to ask what are associated errors when time sth or using this timer for specific performance model. (Type of the error, bouond of the error, source of the error etc.). Figuring out these things can give you a better understanding regarding the model. The error bar is always needed to describe the figure. Testing one group of results or doing one experiments are not enough, you need to get enough results to show the uncertainty. The error is always a good aspects to discuss the quality of the model, or how to move forward based on the current model approach.

Therefore, when trying to write a performance test, a good habits is just try to use a for loop and let it run three or five times and then printout the avg time and stdev of the time. This is not a big deal from the perspective of the programming, just use a simple for loop. But that means a lot for processing the results. You do not copy past results back into the excel and execute the env manually (this can save a lot of energy for post processing the results), and you can also say confidently in the paper that we execute multiple runs instead of one run. Reviewer will not complain about your results.

Reference

https://stackoverflow.com/questions/27618032/i-o-bound-performance-speedup/27618217

AverageMind