some tips about debugging the parallel or distributed program
The debug of the parallel/distributed program is daunting, it mainly becasue information from different programs are mexed together in particular log fies, and you may not sure which happens first. Here are some tips.
distributed program may execute similar instructions, therefore, the practical way to identifies them is to add the id in the debug message such as printwithrank. This id can be used to trace the message printed by a particular process/thread.
it is hard to find the valuable information if you check the log file direactly, the wise strategy is to use the grep to filter the key messages you printed. And to check if it satisfies your expectation. For example, if you try to debug the send/recv primitives. Print the information at the key places and check if the number of the send message match with the number of the recv messages by grepping the key words based on the log files.