2019-06-16

Mpi basic

parallel programming with mpi notes, even though the book was issued 20 years ago, some materials are still inspiring for me and explain the knolwdge clearly. For every points, they gave out the limitation which is really good way to explain the knoledge clearly.

Background Knowledge

hardware issue:

basic classification for parallel computing
Flynn’s Taxonomy, SISD,SIMD,

von Neumann model and bottleneck
the bottle neck is the latency between the memory and the cpu. One stategy is cache in different level, the other is pipeline.

Pipline archetecture

If the operation of excution of instruction is standarlized, the program could be proceed in pipline mode, common knoledge in computing archetecture. Another strategy is distribute the data in multiple memory bank properly to eliminate the time to waite the IO.

Drawback, the pipeline don’t work well for program that use irregular structure or using many branches. They could not scale well to process the complex problem, because the logic unit is not standardized for large system.

SIMD

In evey SM(streaming multiprocessor), the work mode is SIMD, the most naive way is let the data to be stored in the global area and every thread in process will do the specific logic.

The inportant thing for this mode is that for the diverge or conditional instruction, the thread will wait for each other, which means the instruction in different process unit will precced active or idle(synchronously) in conditional instruction.

The drawback is also obvious, on a program with many conditional branch the program could idle for a long period of time.

MIMD
multi cpu and multi memory unit and the instruction could be excuted asynchronously.
including two types :Shared-Mem MIMD, Distributed Mem MIMD, (a little bit complicated for this part, sth on hardware design)

software issue

shared memory programming

sth which could be acceed by all the process.

common words in OS: critical section, mutual exclusion, atomic operation…
attention to barrier: a barrier is usually implemented as a function, once a process has called it, it will not return until every other process have called it.

message passing

Focusing on the communication between the process, MPI could support severl network type like infiniband for HPC, and there are clean protocal for data transmission like assign the type of data to be transformed. Even though there are all kinds of fancy frame work for HPC now, the MPI always play on important role for underlying communicatino layer.

message passing interface is the typical for this part, on common programming mode for those SPMD (more focusing on programming level)

if(my_process_rank==0){
	MPI_Send()
}else if(my_process_rank==1){
	MPI_Recv()
}

buffering

It seems we always misuse the term between “buffer” and “cache”, the buffering is equal to the memory cache, when we use cache, it seems that we refer to cache in hardware level.

one function of buffer is changing synchronous communication into asynchronous communication or buffered communication. Becasue the message is buffered into the memory of send process or recieve process, the send process could execute other logic after excuting.Without the buffer, send process(process 0) should send a “request to send” to server 1 and wait until process 0 recieve the “ready to revieve” from 1(sth like the three time hand shake) than it begin to transfer the data. If we use the buffer strategies, the contents could be copied from the buffer and then start transmission.

The disadvantage is using the system resource to buffer the contents. Whole process will be longer because it involve the process of copy data from the buffer into program memory location.

two group for formidable parallel system

data parallel: dividing data among processor and each processor apply the same operations to its portion of data

message passing interface: a message passing functino is simply a function that explicitly transmits data from one process to another.

blocking and nonblocking communication

For the blocking communication, the send process (process 0) will block here if the recieve process (process 1) did not prepare well to recieve the info.

It is important to distinguish the blocking and synchronise here.

The concept of synchronised communication is clear, the process 0 will send the message until it recieve the info like (ok to send, i’m ready from process 1).

The context of process 1 blocking: if the process 1 call the MPI_RECV, but at this time, the process 0 didi not prepare the data well, the process 1 will stay there and keep idel and do nothing, this is the blocking for recieve end.

If the process 0 put the message into the buffer and process 1 also ready to recieve but the communication line might be busy , so the process 0 will block here and waiting for the communication line become avaliable.

nonblocking could return some value immediately and do some operation which not depend on the sending message from process 0 , and check back later to make sure if all the data arrived.

Data Parallel

RPC active message

Data Mapping

The core issue for data mapping problem is to assign data elements to processor so that communication could be minimized, namely how to map the data into multiple nodes to minimize the data communication time. There are lots of trade off behind this. Memory should correspond with the processor, the naive way is using one process per node (like redis). The drawback of this one thread model is the waste of computation resource. Because it always depends on specific situation and use case, there are lots of issue sin this area, if there is a balance between the computational resource and the storage resource or maximize specific advantage and minimize another one.

MPI basic

if you come out from the os class and to the parallel computing class, one question is that “what is the difference between the pthread , fork based parallel computing with the MPI programming model”, there is an answer here.

if you want to run mpi on multiple server, refer to this, mainly depdends on the NFS and passwordless ssh and adding coresponding parameters for mpirun command(it is better to use same impementation of MPI, the common one is openmpi and mpich2). For the strategy of the schedule of MPI, please refer to this.

This is the most simple case for message sending and recieving:

#include <stdio.h>
#include <string.h>
#include <mpi.h>
main(int argc, char *argv[])
{
    int p_rank;
    int p_num;
    int source;
    int dest;
    int tag = 0;
    char message[100];
    char process_name[100];
    int name_len;
    MPI_Status status;
    //MPI_Init(&argc, &argv);
    MPI_Init(NULL, NULL);
    MPI_Comm_rank(MPI_COMM_WORLD, &p_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &p_num);
    if (p_rank != 0)
    {
        //create message
        sprintf(message, "Greetings from the process %d", p_rank);
        dest = 0;
        MPI_Send(message, strlen(message) + 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
        MPI_Get_processor_name(process_name, &name_len);
        printf("%s send message\n", process_name);
    }
    else
    { //if the process with rank 0
        for (source = 1; source < p_num; source++)
        {
            MPI_Get_processor_name(process_name, &name_len);
            tag=source-1;
            MPI_Recv(message, 100, MPI_CHAR, source, tag , MPI_COMM_WORLD, &status);
            printf("process id %d in processor %s get message : %s\n", p_rank, process_name, message);
        }
    }
    //shut down
    MPI_Finalize();
}

This code is typical SPMD model, the tricky thing is that your should clear that your program will be excuted on different processor and support different data.

The typicl mode of MPI programming start from the MPI_Init(NULL, NULL); and end with the MPI_Finalize(); the two parameters could be the pointer to the parameter for main function like this MPI_Init(&argc, &argv).

MPI_INIT means no function could be excuted before the INIT function and the MPI_Finalize means no MPI function could be called after this.

MPI_COMM_WORLD, Comm_rank

MPI_COMM_WORLD could also be called the communicator, which represent the collection of the process that could send message between each other, the Comm_rank represent the index of the process in this communicator. function MPI_Comm_rank and MPI_Comm_size could be used to get the rank(index) of current process in specific communicator and the last one is used to get the number of process in specific communitator. The default global communicator is the MPI_COMM_WORLD. Basically, the comm_rank represent a unique identity between sending and recieving. For the recieving function and sending function, the src and dst endpoint is also represent by this rank parameter.

tag

tag is more fine grained partition between specific client and server end. Because the communication could be divided into several types and we hope each type could be seperated by another one, like only the recieve call with tag = 0 could recieve the message sent by the client with the message sending for tag = 0. Basically, the communicator, rank, and the tag are the abstraction of the message sending and recieving entity in different levels.

if the tag for send and recieve is not equal, the recieve will hang there, this use case is similar to the blocking communication, the recieve end is ready to recieve the message with tag = 1 but the send client send the message with the tag = 0 , so they could not recieve the info. For the following cases, only the MPI recieve with the tag=0 could recieve the message:

    if (p_rank != 0)
    {
        //create message
        sprintf(message, "Greetings from the process %d", p_rank);
        dest = 0;
        MPI_Send(message, strlen(message) + 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
        MPI_Get_processor_name(process_name, &name_len);
        printf("%s send message\n", process_name);
    }
    else
    { //if the process with rank 0
        for (source = 1; source < p_num; source++)
        {
            MPI_Get_processor_name(process_name, &name_len);
            tag=source-1;
            MPI_Recv(message, 100, MPI_CHAR, source, tag , MPI_COMM_WORLD, &status);
            printf("process id %d in processor %s get message : %s\n", p_rank, process_name, message);
        }
    }
    
    
//output
...
e1c087 send message
e1c086 send message
e1c087 send message
e1c086 send message
e1c086 send message
process id 0 in processor e1c085 get message : Greetings from the process 1
//hang there forever

sending parameters

The element type (the length of the element in memory) and the length of the element should be assigned when sending and recieving.

There are two wildcard at recieve end for source and tag, for the following program, we use wildcard to get all the message with all the tag no matter sent from which client.

    if (p_rank != 0)
    {
        //create message
        sprintf(message, "Greetings from the process %d", p_rank);
        dest = 0;
        tag = p_rank;
        MPI_Send(message, strlen(message) + 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
        MPI_Get_processor_name(process_name, &name_len);
        printf("%s send message\n", process_name);
    }
    else
    { //if the process with rank 0
        for (source = 1; source < p_num; source++)
        {
            MPI_Get_processor_name(process_name, &name_len);
            //tag=source-1;
            //MPI_Recv(message, 100, MPI_CHAR, source, tag , MPI_COMM_WORLD, &status);
            MPI_Recv(message, 100, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
            printf("process id %d in processor %s get message : %s, comm src rank %d comm tag %d\n",
                   p_rank, process_name, message, status.MPI_SOURCE, status.MPI_SOURCE);
        }
    }
    
/*
output:
elf-login2 send message
elf-login2 send message
elf-login2 send message
process id 0 in processor elf-login2 get message : Greetings from the process 1, comm src rank 1 comm tag 1
process id 0 in processor elf-login2 get message : Greetings from the process 3, comm src rank 3 comm tag 3
process id 0 in processor elf-login2 get message : Greetings from the process 4, comm src rank 4 comm tag 4
process id 0 in processor elf-login2 get message : Greetings from the process 2, comm src rank 2 comm tag 2
elf-login2 send message
*/

collective communication

In my opinion, one of the drawback of the MPI programming model is the lack of the storage, for the same matrix multiplication , if you using GPU, the step is straightforward just copy the data between GPU thread and the global memory refer to this(https://www.shodor.org/media/content/petascale/materials/UPModules/matrixMultiplication/moduleDocument.pdf), for MPI, it is not so easy to understand the “Fox” algorithm at first time, becasue data transmission between the thread increse the complexity of programming.

for Group data tramission:

if the data have the same type and stored in the continuous memory location, use type and count parameter to express how long you want to transmit the data in memory. or use MPI_Type_vector and MPI_Type_contiguous to build a derived type.
if the data have the same type but in different location, use MPI_Type_indexed to build the new mpi structure
if the struct mix all kinds of types of data together, use MPI_Type_struct to build new structure.

About the group and communicator, this(https://mpitutorial.com/tutorials/introduction-to-groups-and-communicators/) is the prefered tutorial. The parameters of MPI_Comm_split need to be stressed here, the first parameter is the original communicator, the second parameter is the index of the new communicator, for example, you want to split the original communicator into four new one, this parameter could be the number between 0,1,2,3. If the value is 0, this means that current process belong to the new communicator with index 0. The next parameter is an integer which means the process rank in original communicator (the process rank before splitting). The last parameter is the index to the new communicator splitted from the original one.

Another way to create a new communicator is to construct it direactly. There are two major parts for a specific communicator, one is context or group id, which is used to distinguish the different communicator, another part is a set of process which could be also called group. Therefore, the first step is to creat a process group and the second step is to create a communicator based on those group.

The creating of the communicator is similar to the concepts of ovelay network, new network is added into the old one , and every endpoints in old net could communicated between each other though the new net (if they are belong to the new net) or the original net.

MPI_together

when execute the mpi together, pay attention to the meaning of the parameters. For example, the position of the recieve count represent the number of the element recieved from every thread instead of the whole size of the element recieved from the thread. refer to this for more information.

reference

mpi transfer struct
https://stackoverflow.com/questions/9864510/struct-serialization-in-c-and-transfer-over-mpi
MPI_Type_create_struct

book web
http://www.cs.usfca.edu/~peter/ppmpi/

run mpi within multiple machine
http://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/

AverageMind