# Bug story

### degbug的方法论

gdb and print

gdb总是第一个想到的工具，gdb --args <commands> 如果对于具体的命令使用不熟练就多print一些信息，memory的问题使用valgrind等等。

test case

single node testing

come back to successful point

### IP 配置

127.0.0.1与eth1的ip弄错。在stage环境上，没有做网络的限制，通过127.0.0.1可以访问到swarm的服务，在线上环境，有网络的限制，还能通过eth1的ip访问到swarm的服务，在数据插入的时候，没有留意，开始以为一直是swarm的问题，后来发现，原来是数据库ip配制的问题。

### c中的segmentation fault的问题

general的方法就是compile的时候使用DEBUG参数，然后使用 gdb --args 参数，再加上bt，在加上print stack information。

### 关于gcc版本的问题

HPC上常常配有不同版本的gcc，不同software对gcc的版本要求也不同，有的时候cmake的时候如果不加上-DCMAKE_CXX_COMPILER=g++ -DCMAKE_C_COMPILER=gcc的参数，很有可能cmake就会按照default的gcc版本来安装，导致是旧版本。有几次build的dep的时候使用的是旧版本的gcc但是在build新的项目的时候使用的是新版本的gcc，这导致出现了许多奇怪的错误，特别难以debug。所以tips就是不论在用cmake build什么software的时候，都要通过explicit的方式来指定gcc/g++的版本。

### wait one and wait any

imagine you have 2 processes, each will post 1 irecv operation. This operation will block until the message appears in the list of pending requests. When it does, on_p2p_request will call notify_one and wake up the thread that is blocking on the condition variable in irecv.
Now imagine you have more than 2 processes (say 4 processes), when a message arrive, for instance from rank 3, notify_one will wake up one irecv operation. If it wakes up the irecv for rank 3, that’s great, but if it wakes up the irecv for another rank, then bad luck, it can’t do anything, and the correct irecv operation is still blocked. Using notify_all ensures that ALL the threads are waken up, giving all of them a chance to check whether their operation has completed.

### Relocation truncated to fit

this issue looks a very low level one, this is how it is occured in general (https://www.technovelty.org/c/relocation-truncated-to-fit-wtf.html), but in my case, I just use the spack to load a particular software, then the error occures, more details can be found here (https://discourse.paraview.org/t/issues-of-compiling-v5-8-0-on-nersc-cori-by-cc-and-cc/5278/3). I doubt some things are introduced after I execute the spack load mesa. According to how the problem is happened, it seems that it is related with the linker, so I checked the linker used before and after I execute the load operation, and there are some interesting results:

Since the mesa also depends on binutils, it looks that the cc find the wrong linker associated with the binutils. This problem can be fixed to modify the PATH and let the /global/common/cori/software/altd/2.0/bin/ld to come back to the first position in PATH by this way: PATH="/global/common/cori/software/altd/2.0/bin:\$PATH". It looks hard to solve, but actually the reason behind this is quite simple.

Besides, if the python is linked to the .so file but the python libraray is not compiled with the fpic label, it can also generates this error. such like this:

check more details here in this case

### about the auto key word in c++

In my example, there is a vector, then I use auto x : vectorInstance to get the particular value and then excute the update operation of that class to update the data value. But I found the data was not updated after that. For the plain auto, it will execute the copy constructor of the original class, which means it is just the copy of the original class instance instead of the original instance. Therefore, the data in the original instance is not updated.

it is a little bit tricky (easy to be ignored) here, check this answer for more detailes. If you want to use reference instead of the copy of the object, remember to use the auto & x

Here is recap:

Choose auto x when you want to work with copies.
Choose auto &x when you want to work with original items and may modify them.
Choose auto const &x when you want to work with original items and will not modify them.

More details are recorded here (https://discourse.paraview.org/t/undefined-symbol-pyexc-valueerror/5494/4), firstly I found that some function related with python is not linked, but I do not know the reason. Actually, this is caused by the static version of the python. At this time, I may try to use python2, it will solve the probelm. The python 2 on that cluater allows the dynamic link and is compiled with the .so file. It is importatnt to consider the root reason or ask other people to get more insights instead of just worrying about the issue. It looks that the good practice is to install an anaconda and avoid those issues.

### 关于c的规范编程

high level的语言写的多了，很容易忽略一些编程规范， 简单的例子比如c的array，申请了一段连续的空间之后是否初始化了合适的值，还有使用pointer的之后是否检查了null，space使用完成之后是否很好地释放掉。说起来都是老生常谈的话题了，最近又因为这样的错误浪费了不少时间，说起来都是一些小错误，但有时候由于各种error propatation却又是花费了很多时间来反复调整。在这里再强调一下，总而言之以遇到相关的代码的时候就在心里给自己提个醒。至于错误的定位，如果实在没有什么好的线索，把关键函数的输入的变量打印出来，多rank的时候把rank信息也打印出来，说起来似乎很简单，感觉是一个比较万能的方法，通过这样的方法，之前确实解决了不少问题，在和别人沟通的时候，也比较convincing。

### about the length of the int

these is differences between the int and the int32 https://www.quora.com/Is-using-int32_t-preferred-to-using-int-in-modern-C-C++-programming
currently, the size of the int is 4B for the 64bit system and 2B for the 32bit system. I mistakenly use the int64 to parse the int, and casued the error propagate between the program. It takes some time to find this issue. Be careful about the data length anyway. There is extra padding if misuse the int32 and int64, that will cause the unexpected behaviours. One reason that it is hard to debug is that the unexpected errors might be caused by these mistakes and it is hard to trace the actual error. The one workable way in general is to come back to the last point that you could make tings sucess, maybe the original impoementation, for example, if you tries to provide a new communicator, and you may need to come back to original communicator and to see if it works first. When you have sth that can works well, you could start to add new things or tries to replace particular subfunction, and try to add your modification/updates step by step to narrow down the problem.

some other discussions about how integer is read out from the memory address, alignment issue, how the alignment can avoid multiple memory access with the cost of the data padding.

### virtual for the destructor

I forgot to add the virtual key word at the destructor of the class defination when there is inheritance. In this case, the destructor of the child class is not called. Refer to this for more details

https://www.quantstart.com/articles/C-Virtual-Destructors-How-to-Avoid-Memory-Leaks/

This might cause the potential memory leak issue.

### variable initilization

Refer to this about the vairable initiliation (https://www.learncpp.com/cpp-tutorial/uninitialized-variables-and-undefined-behavior/). The lesson is that: do not assume that the compiler helps you to initilize a particular variable, start with the assumption that the variable is uninitilized in default may help you to avoid some potential issues and generate more cautious code. One recent bug is that, in one program, we use the eof label to indicate the end of the file, this variable is not initilized (but we assume it is initilized as 0), therefore, the file can not be parsed correctly when we read data from it (since we use this label to indicate if it is the end of the file). This kind of error may more common in c related project, since we need to provide a lot of manual implementations for the file I/O.

### the evidence to show sth

The imperfect part of the human is that we are easy to make all kinds of mistakes. With this assumption, you may have more flexible strategy to validate the decisions and results. I made several small mistakes here and there because the lack of checking things. Let see some times we need to register the class before a particular date and it might cause lot of trouble if we miss that date. I could remember many things like this, and I even missed an important exam since I mix the p[osition of the exam (two locations have similar names in spelling). So that means the double check is important, and when you check it, you ‘d better make sure the method of checking is solid. This is a kind of habit that helps you to avoid the small mistakes. Make sure the evidence or the results fully support your decision. Try to consider yourself as a judger, and you need the direact evidence, not the oral expression from the others (double confirm the expression by more direact evidence such as the code or executing results) or the ideas start with the “maybe” in your mind.

### the gdb can not be used direactly in some cases

For one particular case, we can not use gdb in traditional way (open a terminal and wait here), this might be the long running server which crash by segfault somehow. In this case, if we want a stack trace, do ulimit -c unlimited then run the server and do other things. You should end up with a file named core if it segfaults. Open gdb (gdb mbserver for instance if it’s the mbserver that crashed) then do core core and that will load the state of the core when the program crashed. You can then do backtrace and things like that.

### wrong static parameter

in a for loop, we put the static parameter at the varaible declaration accidentally. then the code does not work as expected. since the varable is declared as static, when it jumps to the second iteration, the value still not change. In our case, we try to create a file name that uses iteration number as the suffix, but the code does not work as expected since we add a static key word accidentally.