MPI advanced

Some ideas about using the advanced API for MPI, such as prob, non-blocking call etc.

All kinds of basic operations and API can be found in MPI tutorials.

More details of MPI send

The MPI send is actually a wrapper for different types of MPI send operation. The key thing here is to consider (1) whether the cache is used in MPI send. (2) when send operation returns, wether the messages are recieved at the reciving end.

According to this blog and communication mode of MPI standard, when the ssend is used, it can guarantee that the send returns after the recv is posted (recv api is called). For buffered case, even if matching recv API is not called, the send API can returns quickly.

Prob and get count

The MPI_Prob is similar to the operation of getting the meta data, these meta data are stored into the MPI status. This blog provides a really good and straightforwrd explanation about the prob and the associated get count operation after the prob. The main idea is that we can allocated space dynamically based on the results of the prob instead of using the fixed amount of memory. This provides a lot of flexibility for mpi recv call. The simple implementation just use the same buffer size for sending and reciving.

This sentences provides a really good explanation about the MPI Probe operation: In fact, you can think of MPI_Probe as an MPI_Recv that does everything but receive the message. Similar to MPI_Recv, MPI_Probe will block for a message with a matching tag and sender. When the message is available, it will fill the status structure with information (only the metadata information here).

Non-blocking and all kinds of Wait

The high quality software that uses the MPI library may need the MPI nonblocking send and recv, this can overlap multiple communications or overlap the communciation and communication.

This is the simple version of the send and recv

关于linux层面的block, non-block, sync, async comm在这个文章中做了一些记录。

Let’s see how to update it into the isend and irecv version.

The basic version is to use isend/irecv and wait, the wait MPI is used to wait for the completion of specific request handle. After the irecv operation, we could do other things and put the wait operation at last. If the data transfer operation is completed, the wait returns.

More advanced version is the wait some/any for exampe here explains the waitsome. This is used for the case where there are multiple requests, instead of waiting for the completion of all the request in the list. It seems that the waitsome provides a fine granularity control of how to process the recieved request. Once the request can be processed, such as the completion of the sending opertaion, we can move to the subsequent operations.

MPI_Wait and MPI_Test

MPI_Test seems like the non-blocking version fo the MPI_Wait, instead of waiting here, we can use a while loop and call the MPI_Test in that while loop, if the message is recieved successfully, we process the message, otherwise, we can do some other computaion tasks. This exmaple shows some ideas.

Similarly, there are MPI_TestSome and MPI_WaitSome, for the MPI_TestSome it will look at all associated requests, if some of them are satisfied, then it will return associated number of completed requests. for MPI_WaitSome, it will block here and wait for the coming of the new requests. It will return at least one of the requests complete.

Desgin for an actual usecase

There are several ways to desgine isend and irecv pattern. Let’s adding several assumptions:

Assuming one MPI packet is used in one MPI isend/irecv message. And each rank can be both the sender or reciever, communication pattern is clear. (such as 0 send to 1, 1 send to 2 … n-1 send to 0). Assuming the size of the message (length of the message) is known beforehand. Assuming using the isend for sending and common recv for recieving.

Compute for current iteration
MPI_Recv (any source)
CheckRequests to make sure requests are ready to be consumed

There is one design for processing the case with unknown size of message. Using fixed number of recievers and number of buffersize. When sending message, break it into several packets with fixed length. When recv, just iterate these fixed number of irecv. If all work are done, just cancel the extra irecv. If one reciever recieve a packet, then post another reciever (mpi irecv call) after it compelets packet processing. The goal is to make sure the total number if irecv is fixed. When recieving the termination message, then cancel all avalible irecv call.

This pattern can be further improved by using iprobe/recv, the example can be found here which used for particle advection. The assumption is same with previous one. When sending message, it can sends both the actual data and meta data, they are sent by isend. When reciving the data, there are two while loop, the outlayer loop contains the computation call, communication call, and metadata update call. From the outlayer while loop, it can detect if there are avalible work. The inner layer is in communication, if the blockAndWait flag is true, it goes to the Probe call, which means there are no avalible work, and current process need to wait for new work coming. Once the message is recieved, it then process it, the message can be either the meta data or the actual data. Then it sets the blockAndWait flag as false, the process goes to the IProbe call, which can recieve the data from any source. If nothing comes here, the sending process will send the termination message, it will exist the inner while loop. Then it goes to the outer while loop, if there are still avalible works, the blockAndWait flag is set as true, and code comes into the Probe again. Aotherwise, it just exist the outer loop. When the IProbe or Probe returns true, it just use the common recv to recieve the message.

Anyway, there are two key things for this design, the first one is the inserted loop, the second thing is the probe with any source.


good references containing detailed MPI exmaples