debug the parallel/distributed program

some tips about debugging the parallel or distributed program

The debug of the parallel/distributed program is daunting, it mainly becasue information from different programs are mexed together in particular log fies, and you may not sure which happens first. Here are some tips.

use process id or thread id

distributed program may execute similar instructions, therefore, the practical way to identifies them is to add the id in the debug message such as printwithrank. This id can be used to trace the message printed by a particular process/thread.

do not check the log file direactly

it is hard to find the valuable information if you check the log file direactly, the wise strategy is to use the grep to filter the key messages you printed. And to check if it satisfies your expectation. For example, if you try to debug the send/recv primitives. Print the information at the key places and check if the number of the send message match with the number of the recv messages by grepping the key words based on the log files.