Bug story

主要记录自己在项目中比较记忆犹新的各种bug

Debug的方法论

虽说debug更多的时候是工程问题,但是很好的debug技巧和工具能加速问题排查的过程。有的时候bug来自于自己的software有的时候来自于其他的依赖的software,这个时候清晰地定位问题和解决问题是提高自己的工作效率的很重要的地方。人总是会犯错误也难以考虑到所有的case,这个时候快速定位问题的手段就变得很关键了。

gdb and print

gdb总是第一个想到的工具,gdb --args <commands> 如果对于具体的命令使用不熟练就多print一些信息,memory的问题使用valgrind等等。常用的一些细节比如在cmake 的时候要使用Debug mode (实际上就是gcc的-g参数),在linux系统上使用 ulimit -c unlimited 允许core dump文件写出到disk 上,之后用gdb debug的时候 (gdb <binary-file> <core-dump-file>) 就比较方便,这些要稍稍留意。 有的时候print message和gdb要结合起来使用才能比较好的发现和定位问题。

在mpi 使用multiprocess的时候,使用gdb的一个小trick是使用xterm,具体可以参考这个 mpirun -np <NP> xterm -e gdb --args ./program 这时就会启动 n 个 xterm然后每个xterm中run一个gdb的 debugger来帮助查看调试的结果。

Compiling issue

Do not forget the compiling issue, maybe you forget adding some macro or specific compiler parameters that cause the bug, or just use a wrong compiler. Sometimes you may link to an old version shared libraray when you missing the compiling process. Anyway, you need to take a lot at the compiling process when sth weird happen and you think it does not make sense when you just look at the source code itsself. For example, if the code links to the shared lib in the install dir, and you update the code, build it but not install it properly, your code still find an old version of the shared library, which may confuse you and makes you think what you updated does not work.

Commonly used tool include ld and otool (on mac) to check the linked library in daynamic way and also the nm to check the name of the function declared in the binary file.

test case and sanity check

对于比较大规模一点的项目,test case 至关重要,每次一个小模块更新完成就想办法使用test case 测试相关的代码,所有的功能混杂到一个大的模块中是很危险的。这里要特别留意,单独的unit testing可能比较容易想到,但是集成测试常常容易忽略,这个时候使用mock server 是很重要的,比如淡化network,着重看结果。总而言之,将比较大的功能划分成不同的小功能进行测试是非常重要的。

最简单的test就是coding完成之后再自己检查一遍,所谓的dry run。比如最近有一次,帮同事做一个事情,本来应该填写para1的地方结果给填写成了para2,导致结果出现了比较大的误差,这种情况就不太好了,因为本身是稍微花点时间检查就可以避免的问题,如果一开始就计算错了,后期再来回调整数据就不太好了。

有的时候不太容易些test case的时候就可以用一些替代的方式,比如让别人帮忙看看,比较一下python 的 和 cpp 的code的实现结果是不是能对应的上(特别是用python做原型用cpp加速的时候),或者是比较自己实现的library与比较经典的library输出的结果(比如一些需要自己实现的线性代数的library)。

reproducer

Reproducer 是 test case的升级版,遇到那种在比较复杂的workflow中的bug,dive into details 之前,要尝试弄一个reproducer,尤其是想让别人帮你看看这个问题到底是出在哪里的时候。需要注意的比如是software的版本信息,固定导致问题产生的测试数据,剥除用不上的依赖信息,等等。本质上来讲是要让这个software stack能尽量稳定,问题可重复(其他人也能比较轻易地复现这个问题),依赖最小化,这样才能更容易地解决问题。

single node testing

很多时候bug要在high parallel的环境下使用,需要用到大量的资源,这种时候一下申请到达的资源很难办,就需要抽象出核心的功能,模拟场景,去掉冗余的依赖,让test能够快速迭代。

come back to a successful point

有好多次发现加入了一个很大的模块或者很多的功能,然后代码出现了bug,然后很长时间都找不出问题。这个时候的方式就是找到上次成功的地方,然后一点点的通过增量的方式增加代码。想想核心的功能是什么,把新增的修改分离出去单独进行测试。去掉那些比较稳定的模块,然后一点一点把增加的部分隔离出来。或者是找到一个正常运行的版本,然后一点一点地进行比对,定位问题。

或者在代码迁移的时候,比如从一个python版本的代码要迁移到cpp或者从一个naive的implementaion要迁移到添加了optimization的版本,这些时候都需要之前先有一个稳定版本的实现,然后即使是新的代码在某些地方出了问题,也可以继续通过对比稳定版本和新版本的每一个步骤的结果来定位问题。

start from a small scale dataset

In visualization of graphics related programming, it is error prone to generates sth that does not really make sense. The debugging for these cases can be hard sometimes since there are so many points, cells in a figure. The good strategy is to generate small scale dataset that you can control by your self, such as 3 by 3, 5 by 5 datasets (maybe it may waste some time, but it definately worth it). These datasets may contain some problematic input data, then it is easier for you to trace the data in detail and find what are problems for your algorithm.

IP 配置

描述
127.0.0.1与eth1的ip弄错。在stage环境上,没有做网络的限制,通过127.0.0.1可以访问到swarm的服务,在线上环境,有网络的限制,还能通过eth1的ip访问到swarm的服务,在数据插入的时候,没有留意,开始以为一直是swarm的问题,后来发现,原来是数据库ip配制的问题。

表面原因,ip配制错误

深层原因,连接信息应该通过环境变量传入进来,比如像数据库连接信息,swarm client 以及 etcd client 这种信息,这个就是代买编写的规范性方面所遇到的问题,连接信息一定要通过环境变量加载进来,或者配置文件,写成配置文件的话,就相当于是启动的时候传入了几个参数。

缺乏容器里外的概念,应用都是在容器里面启动的,127.0.0.1怎么可能访问到容器外面的网络呢,除非都是用net=host的模式。

在stage环境上,竟然加上了net=host参数,但是在线上环境,可能没有加上,还是因为线上环境和stage环境的配置不一致所导致的。

环境变量配置

描述

给原来的老项目中添加一个报警功能,通过docker compose文件的方式来启动,因为要在里面添加一些新的环境变量,修改的时候不够小心,把一个关键的环境变量(原来是true)设置成了truei导致程序启动的时候,每次api返回一个错误。

表面原因,做工作不够细致。深层次原因,程序对于环境变量的检测不到位,没有进行input参数的检测,比如输入的参数如果不是false或者true会怎样,自己输出的日志信息也不够详细,导致出了问题难定位。即使给api返回了错误的信息,也应该在本程序的日志这里留一个对应的记录才行。

之前的程序在给api返回的信息中,指明了环境表变量的问题,但是本身没有log输出,这样就导致了在出现问题之后,很难进行排查。虽然给api也返回了信息,但是显然环境变量这边的信息不是api传过来的,是程序启动时候设置进去的,不管api怎么样,这个参数都是要设置的,这样在程序启动这边,就应该打印有对应的log信息。

json解析的问题

在使用Golang进行json解析的时候,想当然地把结构体中的字段的首字母定义成了小写的形式。老员工仔细看一眼就能看出来,一提醒我我也能知道,但是自己在主动些的时候就是有异常然后json 解析不成功,由于在一个大的已有项目中进行更改很难去把问题限制在一个比较小的范围因此出了问题常常不确定是哪部分导致的,这种字段大小写的问题实在是不应该犯的。

Typo 以及输入密码时的大小写

类似的问题还碰到过好多次,有一次是帮助姥爷配置wifi,什么都设置好了,结果就是最后一步链接的时候连不上,后来其他亲戚提醒,是不是输入密码的时候输错了,可是我是我已经输了好多次了呀,估计也不太可能输错,后来才发现,原来是按了大小写切换的案件,实际上输入的是大小写混乱的结果。这些都是很容易发现的小bug但是有的时候思维陷入了某个局部的误区就难以跳出来看,在某些错位的前提下去做一些操作,最后导致和实际的解决路径越来越远。

遥控器电池没电

还有一次是家里的电视机一打开,停在一个画面上然后就不动了,这个画面是一个机顶盒的开机画面,还有一个促销的广告。这个时候我们都以为机顶盒坏了,然后反复重启了好几次机顶盒也没好,还商量着说是不是要叫维修的人员过来给看看。后来发现是控制机顶盒的遥控器没电了,按那个返回主页的页面就不能正常工作了,之后更换了遥控器的电池,然后一切就正常了。这个问题与上一个类似,就是都首先陷入到一个错位的assumption中,然后在这个assumption下去找一些问题,实际上这个assumption本身就是有一些问题的,都是一些比较简单的原因,但是不太能一下就想到。

listkeys的问题

在我们的项目中有一个接口是通过输入的volumename查找对应的volumeid,具体实现的逻辑如下,函数的输入参数是searchName,函数逻辑是,先列出当前所有的volume信息,之后对这些volume信息遍历,如果遍历的时候发现volumename==searchName,count+1,如果最后count值==0或者>1就会报错。这样的逻辑其实默认假设了list时候返回的信息是没有重复的,或者说返回的是一个set或是map。然而实际情况是list所有的volume信息的时候,返回的数据是一个list,这个list中出现了好多重复的元素比如[volumeida:volumenamea,volumeida:volumenamea]这样如果searchName是volumeName的话,计数值就为2,然而实际是仅有一个volume的。这就是考虑输入参数不周全的问题。因为涉及多层的调用,所以在调用之前最好是假设对方的接口是不牢靠的,处理之前一定要考虑到各种情况,即使没有办法规避也要留有debug的日志信息。比如这个问题是重新添加了list的打印,然后build代码,放在线上运行才发现原来接口返回的信息本身就有问题(如果接口定义的时候返回一个map或者set也就没有这种问题了),整个过程浪费了好多时间和心力。

==写成了=的问题

这种问题虽然是很基本,但是常常跌倒在这里,有时候感叹,我靠,怎么会是这样,这不科学啊,回过头来再仔细一看原来是==写成了=。对于这种问题,还是老老实实地通过单元测试来解决比较好,不要心存侥幸,觉得自己的程序就没有问题,一定要用测试说话,求是求是,用测试来证明自己的函数没问题,并不是自己拍胸脯随便说说。

c中的segmentation fault的问题

一次是malloc()后面的数字写错了,本来应该是写成 malloc(rindex-lindex)结果写成了malloc(sizeof(rindex-lindex))
一次是空间定义的小了

一个tips就是看上下文,也就是看输入和输出

实在不行就是用笨办法,大致定位到出错的函数,然后打点往下走,看在哪一步的时候程序有报错,这个方法虽然慢一些,但最后往往都能找到出错的地方。

主要是代码常常写的不完整,没有检测函数的返回值,导致即使malloc没有成功也不知道是由于什么原因导致的,直接在malloc的过程中就给报错了。

对malloc这种裸的内存操作一定要对输入参数考虑充分,边界条件考虑清楚。一个是习惯问题,一个是经验问题。

使用多了golang的编程模型就觉得其他的语言各种不严谨与随意或者说是灵活,因为golang就是在代码里强制地返回error然后程序必须处理这个error,虽然有时候看起来繁琐,但是可以让所有的可能的问题最准确的定位出来,代码中bug出现的概率首先就小了很多,即使是有错误也能很快定位了。

general的方法就是compile的时候使用DEBUG参数,然后使用 gdb --args 参数,再加上bt,在加上print stack information。

安装nvidia driver

之前帮同学安装nvidia driver 开始是双系统,每次装好都说有个脚本不能运行然后安装失败,smi无法于nvidia通信, 后来把windows卸载了,只装ubuntu14.04之后又参考了这个教程 https://gist.github.com/wangruohui/df039f0dc434d6486f5d4d098aa52d07 才弄好,主要是在bios设置中有一个secure boot option没有修改, 导致有些内核模块没法写入到系统, 总之driver的问题不仅仅是软件的问题,应该把软件和硬件看成一个整体然后全方位的考虑,暂时先写在这里,万一以后自己也要装driver就省去好多事了,从这个角度看双系统还是有风险的,也可能是driver相互冲突,还是单系统加虚拟机的方式比较好用。

同学想替换server上的cuda 9 到 cuda 10 但是按照官方的教程 每次在apt-get update之后 都还是原先 cuda 9 的安装包 后来仔细查抄 /etc/source 也就是 apt-get 的源文件时候才发现 当前的机器有一个默认的source配置 那个配置中有一个默认的cuda的列表 猜想可能是那个列表中的默认的信息把new added的source给覆盖了 结果吧default的list注释掉之后 再apt-get update之后就能识别出新的cuda10了 这种配置的问题常常让人很费解 会因为一个很小的地方就花费掉很多的时间 这里mark一下

upper case and lower case

回想起来很久的时候,帮亲戚家弄宽带,各种接线配置,到了最后一步,当时还是拨号上网那种,输入密码的时候总是有错误,链接不上种种。折腾好久,各个部分检查,最后才发现是因为输入密码的时候大小写出现了问题。好多时候都是因为一个小的condiguration反反复复折腾好久,it’s life。

polling frequency and the scheduler strategy

这两个问题其实还蛮有启发,就是说要care到整个software stack 的各个环节。首先是scheduler strategy的问题,使用的是srun on super computer。之前都是采用默认参数,也没有留意。后来发现,同时run several task的时候,后面的task就没有被调度上,虽然还有额外的资源。后来各种debug之后才发现,默认的case下,task会占用node上所有的memory资源,如果正好6个node,正好6个process,这个时候6个node的资源都会被占用。虽然可能每个process仅仅占用一个core。使用了–mem-per-cpu=1000 进行限制之后就没有这个问题了。有的时候不使用某些参数并不是说这些参数就不存在了或者it will be fine if I doesn’t care it,要注意默认值。

还有一个很坑的地方,就是srun中的--ntasks=<task number>的问题。有几次直接启动单process的程序,直接用了srun但是后面没有加--ntasks=1。有的时候这个参数就按照#SBATCH --ntasks-per-node=<task limits on node>来了,这个参数是想指定每个node中启动的task的上限,两个参数是不同的。有几次就单节点的program在每个node上都启动了一个,很trick。比较好的策略就是不论默认参数是多少,这些重要的参数都通过显示的方式进行声明。

还有一个默认值的问题是使用一个I/O的libraray。比如要是自己实现的话,肯定有一个polling的机制,就是每隔一段时间pull一下,看看是否ok。自己之前忽略了这一点。但后来仔细查看,发现libraray中有相关的设置,并且可以通过外部的参数来控制这个变量。这个启发就是,在使用library之前也要对它的关键行为或者关键参数有一定的了解,这样在集成到自己的程序中的时候,对于那些感到莫名其妙的问题才不至于手忙脚乱,无从下手。

关于.h文件的引用顺序

在面向对象的programming中,经常需要考虑的一个问题就是哪个类要具备哪些功能,有些错误也是发证在这里。有时候会发现一些类要互相引用彼此,就这时候就要考虑考虑模块或者定义的变量是不是合理。什么样就算是合理呢,比如拓扑排序之后发现没有环存在。

如果环装的引用实在无法避免的话,就需要采用 forward declaration 提前声明相关变量,注意两个互相引用的变量都需要进行 forward declaration。

关于gcc版本的问题

HPC上常常配有不同版本的gcc,不同software对gcc的版本要求也不同,有的时候cmake的时候如果不加上-DCMAKE_CXX_COMPILER=g++ -DCMAKE_C_COMPILER=gcc的参数,很有可能cmake就会按照default的gcc版本来安装,导致是旧版本。有几次build的dep的时候使用的是旧版本的gcc但是在build新的项目的时候使用的是新版本的gcc,这导致出现了许多奇怪的错误,特别难以debug。所以tips就是不论在用cmake build什么software的时候,都要通过explicit的方式来指定gcc/g++的版本。

关于serilize的问题。之前有一个distributed timer的libraray, 使用的是grpc的libraray,但是grpc最好是使用bazel来进行build,使用cmake进行build,在有些HPC上还有些问题。有些build时候的问题也不清楚如果解决。于是打算把grpc换成一直使用的thallium,这个时候发现了一个问题,以前一直认为string是会被自动进行serilize的,就把string当做参数直接填在rpc的函数上。但实际情况是,需要额外地调用thallium的serilize函数才能进行正常的解析。这里要特别注意要include thallium的相关serialize的函数或者是进行customize的data structure,之前一直忽略了这个地方,导致有一些奇怪的关于serialize的问题,反反复复折腾了好多次才正常解决。

几个南辕北辙的问题

使用c++的时候很容易会出现一些奇怪的一连串的编译器报错,但是实际上引起这些错误的原因却总是很小,但是通过报错信息却不容易看出来。
比如有一次include了一个.h文件,这个.h文件中有一些执行的函数还是用了namespace,但是在修改这个.h文件的时候,不小心把namespace的一个括号删去了,没有发现,之后在编译的时候就引出了一连串的报错,甚至是这个.h文件后面跟随着的其他的.h文件中的错误,但实际上后面的.h文件中没有什么问题。最后花了一些时间反复对比,才找到问题。这个lesson有两个,一个是test驱动,单个的模块必须要test ok之后再集成到整个的平台性的server中,第二个是build的顺序,最好把test放在整个server之前进行build,这样的好处是隔离错误。因为像是前面提到的小错,如果是build test case的话很容易就能发现,因为只引用了一个.h文件,但自己习惯在最后build test case (这不是一个好的习惯) 这导致了问题没有及时暴露出来。

最近还遇到一个问题是error: invalid use of non-static member function 这个再网上查找可以发现是一个常见的错误,但是我的调用的地方并没有使用函数指针,后来花了好多时间修改 class 的static函数和static 成员变量,还是没搞定。然后仔细一看才发现原来是一个函数没有加()这导致了编译器把这个参数理解成了函数指针。但是巧的是这个位置要传入的是一个指向raw data 的void*, 编译器也发现不了是类型不对称的问题。这说明了两个问题,就是使用void *作为函数的参数还是很危险的,要谨慎使用,或者最好是在外面包装一层自定义的类型,这样有语义的类型就不容易出错。

degbug 三重境界

基本的一些bug可以通过unit test来解决,比如输入输出是否符合预期等等。在一层就是 thread safety的问题,比如一个server来说,有一些thread safety的问题只有在压力测试的规模达到一定的程度之后才能展现出来。这一步就比较难模拟了,理想的情况就是可以有尽量真实的环境来经行测试,比较巧妙的方法是采用mock server或者抽象出其中比较关键的数据流来进行测试。这个阶段比较容易忽略的就是lock的问题,尽量通过改写语句把不适合使用lock的地方修改一下,比如在if condition中access一个map,如果元素不存在则返回,这个时候可以使用conunt函数,在lock的范围内计算map中指定key的count,之后再比较count的时候就不用lock了。第三个境界大概就是memory leak的问题了,特别是对于c/c++ 以及指针操作比较多的语言,需要手动管理memory的时候,这种memory leak就很明显了,比如你的server是否能比较robust运行较长的时间。运行几个月都不down,如如果是c/c++ 这样的语言,valgrind是比较常用的检测方式,可以提早把问题消灭在萌芽阶段,当然实践常常是最好的检验标准了。

wait one and wait any

有一个project是基于rpc服务提供一个block send 以及 block recv 的操作,主要是有两种情况,case1, 先 send req 到达 然后 recv req 到达, case2 先 recv req 到达 然后 send req 到达。 对于 case 1, send req 会被放在一个buffer中,当recv req 到达的时候会从这个buffer中寻找对应的req然后pull数据完成后续的操作。对于case2,会使用搞一个condition vaeaible的wait操作,具体wait的condition就是对应的send req是否存在于这个buffer中。于是配合使用起来,在send req的最后要使用wait one或者 wait any 的操作来wake up一个在sleep状态的thread。开始的时候一直使用的是wait one,小规模的时候程序没有什么问题,但是到了大规模的时候,程序总是hang在中间的某个位置上,后来经提醒才发现了如下情况的存在,正确的操作是使用wait any而不是wait one:

imagine you have 2 processes, each will post 1 irecv operation. This operation will block until the message appears in the list of pending requests. When it does, on_p2p_request will call notify_one and wake up the thread that is blocking on the condition variable in irecv.
Now imagine you have more than 2 processes (say 4 processes), when a message arrive, for instance from rank 3, notify_one will wake up one irecv operation. If it wakes up the irecv for rank 3, that’s great, but if it wakes up the irecv for another rank, then bad luck, it can’t do anything, and the correct irecv operation is still blocked. Using notify_all ensures that ALL the threads are waken up, giving all of them a chance to check whether their operation has completed.

Relocation truncated to fit

this issue looks a very low level one, this is how it is occured in general (https://www.technovelty.org/c/relocation-truncated-to-fit-wtf.html), but in my case, I just use the spack to load a particular software, then the error occures, more details can be found here (https://discourse.paraview.org/t/issues-of-compiling-v5-8-0-on-nersc-cori-by-cc-and-cc/5278/3). I doubt some things are introduced after I execute the spack load mesa. According to how the problem is happened, it seems that it is related with the linker, so I checked the linker used before and after I execute the load operation, and there are some interesting results:

zw241@cori11:~> which ld
/global/common/cori/software/altd/2.0/bin/ld
zw241@cori11:~>
zw241@cori11:~> module load spack
zw241@cori11:~>
zw241@cori11:~> spack load mesa/qozjngg
zw241@cori11:~> which ld
/global/common/sw/cray/sles15/x86_64/binutils/2.32/gcc/8.2.0/sl7nxhi/bin/ld

Since the mesa also depends on binutils, it looks that the cc find the wrong linker associated with the binutils. This problem can be fixed to modify the PATH and let the /global/common/cori/software/altd/2.0/bin/ld to come back to the first position in PATH by this way: PATH="/global/common/cori/software/altd/2.0/bin:$PATH". It looks hard to solve, but actually the reason behind this is quite simple.

Besides, if the python is linked to the .so file but the python libraray is not compiled with the fpic label, it can also generates this error. such like this:

/usr/bin/ld: /projects/community/python/3.8.5/gc563/lib/libpython3.8.a(abstract.o): relocation R_X86_64_32S against symbol `_Py_NotImplementedStruct' can not be used when making a shared object; recompile with -fPIC

check more details here in this case

about the auto key word in c++

In my example, there is a vector, then I use auto x : vectorInstance to get the particular value and then excute the update operation of that class to update the data value. But I found the data was not updated after that. For the plain auto, it will execute the copy constructor of the original class, which means it is just the copy of the original class instance instead of the original instance. Therefore, the data in the original instance is not updated.

it is a little bit tricky (easy to be ignored) here, check this answer for more detailes. If you want to use reference instead of the copy of the object, remember to use the auto & x

Here is recap:

Choose auto x when you want to work with copies.
Choose auto &x when you want to work with original items and may modify them.
Choose auto const &x when you want to work with original items and will not modify them.

More details are recorded here (https://discourse.paraview.org/t/undefined-symbol-pyexc-valueerror/5494/4), firstly I found that some function related with python is not linked, but I do not know the reason. Actually, this is caused by the static version of the python. At this time, I may try to use python2, it will solve the probelm. The python 2 on that cluater allows the dynamic link and is compiled with the .so file. It is importatnt to consider the root reason or ask other people to get more insights instead of just worrying about the issue. It looks that the good practice is to install an anaconda and avoid those issues.

关于c的规范编程

high level的语言写的多了,很容易忽略一些编程规范, 简单的例子比如c的array,申请了一段连续的空间之后是否初始化了合适的值,还有使用pointer的之后是否检查了null,space使用完成之后是否很好地释放掉。说起来都是老生常谈的话题了,最近又因为这样的错误浪费了不少时间,说起来都是一些小错误,但有时候由于各种error propatation却又是花费了很多时间来反复调整。在这里再强调一下,总而言之以遇到相关的代码的时候就在心里给自己提个醒。至于错误的定位,如果实在没有什么好的线索,把关键函数的输入的变量打印出来,多rank的时候把rank信息也打印出来,说起来似乎很简单,感觉是一个比较万能的方法,通过这样的方法,之前确实解决了不少问题,在和别人沟通的时候,也比较convincing。

about the length of the int

these is differences between the int and the int32 https://www.quora.com/Is-using-int32_t-preferred-to-using-int-in-modern-C-C++-programming
currently, the size of the int is 4B for the 64bit system and 2B for the 32bit system. I mistakenly use the int64 to parse the int, and casued the error propagate between the program. It takes some time to find this issue. Be careful about the data length anyway. There is extra padding if misuse the int32 and int64, that will cause the unexpected behaviours. One reason that it is hard to debug is that the unexpected errors might be caused by these mistakes and it is hard to trace the actual error. The one workable way in general is to come back to the last point that you could make tings sucess, maybe the original impoementation, for example, if you tries to provide a new communicator, and you may need to come back to original communicator and to see if it works first. When you have sth that can works well, you could start to add new things or tries to replace particular subfunction, and try to add your modification/updates step by step to narrow down the problem.

some other discussions about how integer is read out from the memory address, alignment issue, how the alignment can avoid multiple memory access with the cost of the data padding.
(https://softwareengineering.stackexchange.com/questions/363370/how-does-a-cpu-load-multiple-bytes-at-once-if-memory-is-byte-addressed)

virtual for the destructor

I forgot to add the virtual key word at the destructor of the class defination when there is inheritance. In this case, the destructor of the child class is not called. Refer to this for more details

https://www.quantstart.com/articles/C-Virtual-Destructors-How-to-Avoid-Memory-Leaks/

This might cause the potential memory leak issue.

correct compiler and linker

很多意想不到的问题是由compiler和linker引起的,总的来说就是没有明确地指明compiler或者是cpp的版本,比如不同gcc版本编译过的代码被link到了一起等等。或者是使用spack这样的package管理软件,没有明确地指定使用什么样的gcc进行安装。有的时候这种问题比较讨厌,因为常常不知道从哪里入手,这个时候想想当前运行的环境,是否有compiler或者linker相关的问题,比如因为env的改变而使用了unexpected的compiler或者linker这个比较重要。

variable initilization

Refer to this about the vairable initiliation (https://www.learncpp.com/cpp-tutorial/uninitialized-variables-and-undefined-behavior/). The lesson is that: do not assume that the compiler helps you to initilize a particular variable, start with the assumption that the variable is uninitilized in default may help you to avoid some potential issues and generate more cautious code. One recent bug is that, in one program, we use the eof label to indicate the end of the file, this variable is not initilized (but we assume it is initilized as 0), therefore, the file can not be parsed correctly when we read data from it (since we use this label to indicate if it is the end of the file). This kind of error may more common in c related project, since we need to provide a lot of manual implementations for the file I/O.

the evidence to show sth

The imperfect part of the human is that we are easy to make all kinds of mistakes. With this assumption, you may have more flexible strategy to validate the decisions and results. I made several small mistakes here and there because the lack of checking things. Let see some times we need to register the class before a particular date and it might cause lot of trouble if we miss that date. I could remember many things like this, and I even missed an important exam since I mix the p[osition of the exam (two locations have similar names in spelling). So that means the double check is important, and when you check it, you ‘d better make sure the method of checking is solid. This is a kind of habit that helps you to avoid the small mistakes. Make sure the evidence or the results fully support your decision. Try to consider yourself as a judger, and you need the direact evidence, not the oral expression from the others (double confirm the expression by more direact evidence such as the code or executing results) or the ideas start with the “maybe” in your mind.

the gdb can not be used direactly in some cases

For one particular case, we can not use gdb in traditional way (open a terminal and wait here), this might be the long running server which crash by segfault somehow. In this case, if we want a stack trace, do ulimit -c unlimited then run the server and do other things. You should end up with a file named core if it segfaults. Open gdb (gdb mbserver for instance if it’s the mbserver that crashed) then do core core and that will load the state of the core when the program crashed. You can then do backtrace and things like that.

Wrong static parameter

in a for loop, we put the static parameter at the varaible declaration and initilization accidentally:

static std::string filename = "test" + std::to_string(i);

Then the code does not work as expected. since the varable is declared as static, it is only supposed to be initilized once. If we separate the defination and the assignment operation it can works. Otherwise, at the second iteration, the initilization operaiton will not be executed since it is static variable which is not destroyed after the scope exits.

We try to use a separate blog to discsus about the static key word.

Undefined reference, undefiend symbol

First, do not panic when getting this error, just calm down and consider possible reasons from scratch. It means we want to call a function, the function is defined, but the compiler could not find it. There are several possible reasons, we may check this list when this error appears again. Not sure the reason, on mac m1, it tends to the undefiend symbol but on linux it tends to be undefiend reference.

  • Fogetting to link the associated library: The linking issue might be triggerd by compiling part, for this error, we actually forgot to add associted .so file into the cmake command. One time, I just forgot to add associated header and cxx file into associated library in associated CMakeList that the test case should link. If just focusing on the source code’s level, we might neglect this issue. Forgetting linking proper library might be the most common reason.

  • Linking path is setting as a wrong one. This is tricky to solve and detect, especially for the case that you mix two set of basic install env, such as mixing homebrew and the anaconda. When the linker tries to find specific low level libraray such as cpython related thing, maybe it is compiled under homebrew, but it tries to link to the one under the anaconda, which can cause the undefined symbol issue. I remember to compile the ParaView, there is always some issue with python enabled. Then I noticed that my python intepreter is under anaconda env, but I used a lot of packages under the homebrew env. After deactivate conda env, the compiling can work as expected. (Although it is not the root reason to the problem, but it is a one that can make thing work at least). Maybe the package or install system I adopted is outdated, for example, one might still use x86 architecture and another one decide to compile code on amd64 architecture.

  • We linked the correct library, but the function name is not match, for example, if the c program links cpp program, there might be the name mangling issue if we forget to use extern C.

  • Function visiability issue during the compiling. The recent one is that we did not set the hidden visibility of the function correctly, the function might be set as an invisible one to the outside libraray. (this is a little bit tricky)

  • Forgetting adding proper function (constructor or destructor) to the class. For example, “undefined symbols for … vtable referenced from …” might means that we forget to set destructor or did not use rule of zero/three/five properly. One time, We may define the header file in a wrong way or forget to add specific virtual function. If we just define the constructor but not implement it, there is still this issue. It generate this undefined symbol or reference issue when the object is created.

MPI free issue

In our MPI program, lib A calles the lib b, they are compiled in separateway, lib b assume that the MPI can be created by itsself or set from lib a, but in particular cases, the lib b tries to free the MPI comm which does not belongs to itself, then there is error like this:

internal_Comm_free(87): MPI_Comm_free(comm=0x7ffe48d7bde0) failed
internal_Comm_free(61): Cannot free permanent communicator MPI_COMM_WORLD

The essential thing is that how to manage the MPI communicator between the depednet library. The important rule is that, do not delete or free the resource does not create by yourself. Remember this in mind and it might be quickly to find the issue.

From 2d to 3d

Another interesting bug that I met recently is a typo error, especially in the area of computing graphics or visulization program, we always start a specific algorithm from 2d and then moves to 3d. This typo uses the same condition for 2d and 3d cases, namely, when we try to compare with z dimention, the algorithm tend to compare with y dimention. This error exist here quite a long time, since when we use the test data, the y and z dimention might always be same in the mesh. Until recently, we have the data sets that with different y and z issues, we find out that error.

Although it is small issue, it takes long time for senior staff to find out that error, since the whole algorithm is really complicated. The way he find out the error is inspiring. We got the issue when there is a complete pipeline, there are all kinds of depedencies make the error reproducing difficult. We first extract the specific data and then split out all depedencies (such as data transfer etc.). Next, the data contains multiple blocks, we select one specific block that generates the issue. But till this step, the data is still comparatively large and make the reproducing difficult. Then the senior stuff tries to continue to extract the key data segment which generates the error. At last, we find the place that generates this issue. Anyway, even for a specific algorithm running on a single node, the case that generates the data with a large data set can still be hard to debug.

Cell field and point field

For the sceintific data with the mesh, the typical property is the association of the field, it can be associated with the data points and it can also be associated with the cell. It is important to ask yourself what are the association of the specific field (is it the point, cell or some metadata for the whole domain). If you try to visualize the data and mistakenly set the point association as the cell association or vice versa, you will not get expected results (I set this association as wrong one for several times). In paraview, there might be some warning message to show that the size of the loaded field does not match with the metadata, which can be a indicator that you set the field association as a wrong one.

Similarly, when processing the data by specific filter, the wrong field association can not get the correct execution resutls.

graph based programming is hard to debug, basically we need to process all kinds of cells or points. One common cases is that when the input data set become large, there are edge cases, can these edge cases trigger error. For one recent exmaple, there are some difference in specific region between the results generated by my program (our program uses a different math laibrary) and the standard output from another program. The solution to solve this problem is as follows, we compare two results and find the cell with the largest difference and output that cell id. Then in our program, we use this cell id as the condition to trigger the printf operation. we compute our results with the results from the standard library, then we found that the results generated by our program have a low precision (some eigen values are regarded as 0 in our program). After fix this issue, the results looks good.

MPI all reduce parameters

MPI all reduce 的API里面,recv的位置上是收到的struct的大小,或者说是从任何一个process收到的data 的size,而不是整个recv buffer的大小,这个地方似乎错了好几次了,这个地方错了的时候在function 调用的时候就会有一些莫名其妙的和malloc有关的bug,这个时候就容易像无头苍蝇一样乱撞,怀疑是不是mpi有问题,或者是link到错误的mpi上等等。发现问题的方式还是跟之前的比较类似,就是先找一个能正常运行的code,然后在一点一点narrow down 看看是哪里有问题。最后排除了init memory buffer的问题,mpi 的问题,最后才发现了是recv buffer的size的number的问题。虽然是个比较简单的问题,但是来来回回却花了好长时间才找到原因。

Restart can solve everything

There is one simple case, the old computer can connect the printer, but it runs slowly and takes several minutes to open the pdf.

The old mac has the usb driver (which can connect the printer), but it does not have the driver to connect the printer. The new mac can connect the internet and download the driver freely, but it is a new mac and does not have the adaptor to the usb driver.

I try to transfer file and associated drivers from new mac to the old mac, and then printing out all things. I try to transfer file by airdrop but it does not work, I do not know the reason about it. All things can work after I restart the computer.

minus operation for the uint number

I got some issue when I try to write a for loop like this

for(uint i=k;i>=0;i--)

the issue is that at the iteration when i is 0, it still execute -1 operation. If everything works normal, it will exit the loop since the -1 is a negative number. However, we set the i as a uint number, it somehow go into the loop again and cause the index error. According to the modular arithmetic here. It actually jumps to a really large number and might trigger the seg fault error. It may even cause a really large loop. So be careful to avoid using uint when there is comparision operation such as i>=0

MPI Isend/Irecv

遇到复杂时候的isend 以及 irecv 的场景就经常最容易出错,由于都是异步操作,所以调试的时候不是很方便,这个时候最直接的思路就是通过print各种message然后看看send和recv是否match,如果发现不match了,就可能引起process hang,比如在某些特殊的case下代码就stuck了。

type checking when copy between two arrays

I recently wokring on a array copy operation which copy subrange of input array to the dest array starting from a specific offset. However, I did not notice the type of these two arrays, so the type of input array does not match with the output array. One issue of this is that the size of each element might mismatch between two arrays. Therefore, even if I set offset as a correct number (it usually represent which number of element), but the results can be wrong, since we assume the input array and output array have the same element type, but actually not, this will trigger some issue that is hard to debug.

Other cases

parallel case 时候 large scale 引发的一些问题

这个问题困扰了自己几天,最终发现还是一些小的疏忽引起的。这种和scale有关的问题是最难debug的一类。比较偶发,而且不太好定位。

Description: 在一个simulation里面的一个while循环的最后一步,要调用一个bcast把 master proc 的相关信息告诉all processes。这个bcast在前几个iteration的时候是正常运行的,然后到了某个时候这个 bcast 就产生了错误的结果。这个communicator是别人写的,但是不是标准的MPI,不太确定到底是由communicator引起的还是自己的代码的原因。

解决思路:

(1)将当前的问题简化,找到一个work的版本,然后一点一点看看,到底是哪个地方引发的问题。

(2)问相关的communicator的维护人员,尝试弄一个reproducer来让相关的维护者方便地复现问题。交流的时候也比较好确定是什么问题。

(3)规避这个function的使用,看看是否能在实现层面绕过这个问题找到替代的方案。

(4)自己fork相关的代码然后加上log一点一点debug,看看是哪一步出了问题。

最后问题解决:首先看自己的代码,看看是不是有问题,比如step是不是在哪里被修改过,或者bcast是否被调用了多次 (也意识到可能是tag的问题,然后采用了不同的tag,但这里不够细心,不同tag的值还是被设置成了一样的,到后来才发现。) 然后尝试修改不同的scale,发现scale小的时候可以,但是scale大的时候不行,之后又尝试弄一个reproducer,发现把code简化以后这个问题就不出现了。还是不知道什么原因,就打算问maintaier,相关的maintainer考虑说是不是用了最新版本的package然后如果还有问题的话,尝试用一个reproducer复现一下。自己不知道怎么办了,于是想看能不能绕过去这个问题,然后发现在coding的时候,tag竟然是一样的数字,改了之后就正常了。

实际上在while loop中有多处bcast的调用,但是这些bcast使用了same tag,程序逻辑简单的时候所有process都比较同步,程序逻辑变得复杂,中间加了各种if condition的时候,这些same tag影响bcast的正常运行,采用了不同的tag之后程序就正常运行了。这里 对于 message tag 有更多的表述。简而言之就是不用的message需要使用不同的tag。

很多时候都是由于自己不小心导致的一些不易察觉的小bug,这个时候debug的手段越多,发现问题的过程就越高效一些,各个手段都尝试一下,说不定某一个手段就有效果了。有的时候没有进展就安静一下,读读自己的代码,把整个流程梳理清晰,也能比较方便定位问题。

Flush printed log results

On HPC platform, we may always need to use bash script to control the execution of the job and we want to redireact the output into a particular output file. Be careful about the log flush issue. In python, if we just use print, it will not flush the print buffer in default, that means that we can get output log when we execute binary file by terminal but there is no expected output in log file. Always be careful to set the Flush flag as True. Or we could use dedicated library for processing the logging information.

For c program, if we use printf, the buffer is also not flushed, we need to call flush function manually. If we use std cout, the output log will be flushed into the output file as expected.

Remember to process the warning message

When we compile a program, just turn on all compiler messages, this is a really good behaviour, and it can make our code more clean and reduce some potential issues(such as casting issue or some ijk issue in code) and bad design.

Remember to add following cmake options when compile a project by cmake in future.

# open all compiler warnings
set(CMAKE_CXX_FLAGS "-Wall -Wextra -Wpedantic")

Random number generator in parallel

When we use a random number generator, the result is usually fixed once we fix the seed value. Be careful about the case where we want to generate an array with random number in it in parallel way. For each thread, the seed should be set according to its threadId, otherwise, the output sampled results can be a sample number which is not the results we want.

get confused about the swap and the copy

This may not be a bug, it is more like a mistake. During the interview, when am asked how to do the dynamic reallocation of the array, i just anser it based on the swap operation, namelly, copy data to allocated new temp space, assign larger space and then copy back. This answer sounds stupid. Since using the third allocated space is only necessary when we exchnage the contens of two space. For the copy operation, we just do not need the content in the dst space anymore, so just two steps: allocated new space, and copy contents from old space to new space. That’s all. Be careful about these small mistakes. Copy operations and swap operations are two differnet ones.

assert and debug

One time, the program (especially for the math related program) uses a lot of assert operation. But I forgot to use Debug mode to compile the program. This causes the associated math funtion runs for a really long time to compute expected resutls. Essentially, assert is a kind of macro. Once you use the Release mode, these macros will be not included in to the final program.

推荐文章