MPI_Cart

This article aims to introduce how to use the MPI and the cartition grid to solve a Poisson equation.

Introduce the MPI cartision tools to show how to solve some real world problem such as numerical equations. We discusse the MPI cart firstly, then we will show how to use it to solve 2D possion equation and 3d laplasian equation from the scratch.

this is the basic scripts about using MPI cartisian tool to solve the possion equations, and the related videos are listed as follows:

generate the grid mesh by MPI Cartesian
https://www.youtube.com/watch?v=R9qoLC95SzU

get the neighbour cell id by MPI Cartesian
https://www.youtube.com/watch?v=SmoIFYBWW0E

solving the Poisson equation by serial program
https://www.youtube.com/watch?v=najbwGUSZO0

solving the Poisson equation by parallel program (one cell for one process)
https://youtu.be/EnNoj0WKteM

solving the Poisson equation by parallel program (multiple cells for one process)
https://youtu.be/mUeFWnntbC4

source code
https://github.com/wangzhezhe/5MPPT/tree/master/mpi_example/possionhybrid

Prerequest

We assumed that you have already understand the basic knowledge about the MPI such as send/recive broadcast and reduce.

If you are unfamiliar with those concepts, this tutorial is a good place to start.here

construct cartision grid

Besides the basic API used in the message communication such as send and recv, MPI also provides several practical tools, the typical one is MPI Cartesian.

The mesh is basic tool to sovle the numerical problem. If we want to run an MPI program based on our data in specific domain, we need to assign an MPI process to a specific domain/partition/cell and let this process to caculate and update the data in the coresponding domain. This is called the single program multiple data.

This is the question:

The most basic function used in MPI program is to get an id.
for example, we use MPI_Comm_rank(comm, &rank), then we could assign a unique id to every process. The rank id is in 1d space, What if we want to associate this id with the spatial informaion namely the mesh infomation in 2D or 3D, what should we do?

For exmaple, we have 4 MPI processes, and the coordinates of every process is (x,y) which is in 2D space, we need make rank=y*2+x what should we do?

MPI_Dims_create could solve this problem. it could generates a communicator that will be used to initialize the cartition grid among all the processes, and then we could use the MPI_Cart_coords to caculates the coresponding coordinates associated with each MPI process.

This is an example for creating the cartition by predefiend dims

//the example comes from https://www.youtube.com/watch?v=eMnjwohg-sg

#include "mpi.h"
#include <iostream>
int main(int argc, char *argv[])
{

int rank, procs;
MPI_Comm cart_comm;
int reorder;
int coord[2], id;
int up, down, left, right;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &procs);
if (procs != 12)
{
std::cout << "the process number should be 12" << std::endl;
exit(0);
}

int dim[2] = {4, 3};

int periodical[2] = {1, 1};
reorder = 0;

MPI_Cart_create(MPI_COMM_WORLD, 2, dim, periodical, reorder, &cart_comm);
MPI_Cart_coords(cart_comm, rank, 2, coord);

printf("rank %d x %d y %d\n", rank, coord[0], coord[1]);

if (rank == 7)
{
MPI_Cart_shift(cart_comm, 0, 1, &up, &down);
MPI_Cart_shift(cart_comm, 1, 1, &left, &right);
printf("neighbors for rank7 left %d right %d up %d down %d\n",
left,right,up,down);
}
}

The periodical here means if the dimention is perioical, if it is true, the next position of the last element will be the first element. This can be shown by using MPI_Cart_shift, this function is used to get the id of the neighborhood for specific element on specific dimention. For we could see the difference of the neigobor id for the cell at the boundry positions.

information about the reorder, it will let MPI to determine the optimal process ordering (range from the first dimention)

http://www.bu.edu/tech/support/research/training-consulting/online-tutorials/mpi/more/mpi_cart_create/

https://stackoverflow.com/questions/32125267/mpi-reordering-of-processes-using-mpi-cart-create

One crutial part is how to relates the coordinates with the domain information, we will also discuss how to use the shift API to get the neighborhood of specific cell

caculate the rank id of the neigobor cells

For this part, we will discuss how to caculate the neigobor cells for specific cell in a mash.

Why this is necessary operation? For example, one typical operation for numerical compuing is the stencil operation, we will use this to solve some PDE(Partial Differential Equation). It doesn’t matter if this is a new term to you. The impoant thing here is that, the value in specific cell depednets on the neighbor cells in numerical computing. We assume the value is located in specific cell, in real usecase, the value can locate on point, or edge of the cell etc.

Assume exach MPI process hold a value in specific cell, and each cell is associated with an MPI process. For the five points stencil cell like this [figure], we need to know the rank id of the neigobor cells at the up, down, left and right direaction, by this way, we could index the neighobor cell, fetch the variable value stored at each neigobor cell and caculate the new value at the center position.

If we use coordinates to index the mesh, it is easy to get all the neighbor cells, For 2D mesh, if the coordinates of the current cell is (x,y), then the neigobor cell is (x-1,y) (x+1,y) (x,y-1) (x,y+1). But the underlying index is the rank id which is in one dimention space. Therefore, in order to get the value from the specific MPI process, we need to map the coordinates into the rank id. This is a kind of reverse process compared with the mpi_cart_create.

How to convert the coordinates into the rank id? For example, if the dimention array is a 2D array like this [dim1, dim2], the rank id is

for d1 in dim1
for d2 in dim2
rankid = d2+d1*dim2

If we use the x to represent the horizental axis and use y to represent the vertical axis, if the current rank id is cid, the id of the up cell is cid+1, the down cell id is cid-1, the left cell id is cid-dim1, the right cell is cid+dim1. We do not consider the boundy cell here.

The caculation is getting complicated if we process the mesh in 3D, and we may need to caculate the id by using the function like this:

inline int l2i(int x, int y, int z) const
{
return x + y * (size_x + 2) + z * (size_x + 2) * (size_y + 2);
}

The MPI_Cart provides simple interface to help us to caculate the id of the neighrbor cells directly. This interface is called MPI_Shift and it is like this:

int MPI_Cart_shift(MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest)

The first parameter is the communicator generated by the cat_create, the second parameter label direction or dimention we want to target, the third parameter is the distance between the neigobor cell and the current cell, they rank_source and rank_dest are the output, they are tricky to understand at the first glance. let us look at an example firstly.

#include "mpi.h"
#include <iostream>
int main(int argc, char *argv[])
{

int rank, procs;
MPI_Comm cart_comm;
int reorder;
int coord[2], id;
int up, down, left, right;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &procs);
if (procs != 12)
{
std::cout << "the process number should be 12" << std::endl;
exit(0);
}

int dim[2];
dim[0] = 4;
dim[1] = 3;
int periodical[2] = {1, 1};
reorder = 0;

MPI_Cart_create(MPI_COMM_WORLD, 2, dim, periodical, reorder, &cart_comm);
MPI_Cart_coords(cart_comm, rank, 2, coord);

printf("rank %d coor1 %d coor2 %d\n", rank, coord[0], coord[1]);

if (rank == 7)
{
MPI_Cart_shift(cart_comm, 0, 1, &dim1_src, &dim1_dest);
MPI_Cart_shift(cart_comm, 1, 1, &dim2_src, &dim2_dest);
printf("neighbors for rank 7 dim1_src %d dim1_dest %d dim2_src %d dim2_dest %d\n",
dim1_src, dim1_dest, dim2_src, dim2_dest);
}
}

the output is

rank 5 coor1 1 coor2 2
rank 0 coor1 0 coor2 0
rank 3 coor1 1 coor2 0
rank 6 coor1 2 coor2 0
rank 9 coor1 3 coor2 0
rank 1 coor1 0 coor2 1
rank 4 coor1 1 coor2 1
rank 11 coor1 3 coor2 2
rank 2 coor1 0 coor2 2
rank 10 coor1 3 coor2 1
rank 7 coor1 2 coor2 1
neighbors for rank 7 dim1_src 4 dim1_dest 10 dim2_src 6 dim2_dest 8
rank 8 coor1 2 coor2 2

if we use another parameter:

MPI_Cart_shift(cart_comm, 0, -1, &dim1_src, &dim1_dest);
MPI_Cart_shift(cart_comm, 1, -1, &dim2_src, &dim2_dest);

the output is

neighbors for rank 7 dim1_src 10 dim1_dest 4 dim2_src 8 dim2_dest 6

it is upward shift if this value is larger than 0 , upward means the id increase from the small value to the large value, and the downwards shift means the value decrease from the large value to the small value.

the second parameter is used to label with direaction we are target, for 2D mesh, if the direaction is 0, it means we are talking about the first dimention in this API, this is a exmaple of 2D case:

If we make the z direaction perpendicular with the paper plan by this way, we could get the exmaple for neigobor cells for the 3D mesh

MPI_Cart_shift(cart_comm, 0, 1, &west, &east);
MPI_Cart_shift(cart_comm, 1, 1, &down, &up);
MPI_Cart_shift(cart_comm, 2, 1, &south, &north);

if the distance can be the large number than 1, if it is 2, it means we want to get the cell that is 2 unit based on the current cell.

Let’s go back to the following parameters:

int periodical[2] = {1, 1};
reorder = 0;
MPI_Cart_create(MPI_COMM_WORLD, 2, dim, periodical, reorder, &cart_comm);

The two parameters are used when we create the new communicator, reorder means if you allow MPI to reorder the mapping relationship between the id and the actual process, the process that are close physically may have close id. If reorder = false then the rank of each process in the new group is identical to its rank in the old group. Otherwise, the function may reorder the processes (possibly so as to choose a good embedding of the virtual topology onto the physical machine).

For the cell with id 8, the results is:

neighbors for rank 7 dim1_src 5 dim1_dest 11 dim2_src 7 dim2_dest 6

we could see that the dim2_dest is 6, which is acrually the first element in this direaction.

https://mpi.deino.net/mpi_functions/MPI_Cart_create.html

https://stackoverflow.com/questions/32125267/mpi-reordering-of-processes-using-mpi-cart-create

periodical will be considered for the boundy case, if it is 1, the next cell after the last one will go back to the first one for upward direaction.

https://stackoverflow.com/questions/20813185/what-are-source-and-destination-parameters-in-mpi-cart-shift

One benifits of this is that, this function could simplify the programming when we use MPI with other parallel tools such as OpenMP. Becasue in this case, one MPI process could caculate multiple cells, in each MPI process, we don’t need to hold a global view, we just need a local view, and then try to exchange the data between different domains by using MPI_Shift, we discuss this in detaild for the subsequent parts.

construct dimention by number of the process

For previous example, we simplify the number of the value at each dimention manually. For example int dim[2]; dim[0] = 4; dim[1] = 3; , In some cases, we just want to control the number of the total process and let the MPI to do this things automatically. The MPI_Dims_create could help us to do this thing. It will assign the coresponding values to the array of the dimention.

For this example, we plan to generate the dimention array automatically according to the number of the MPI process. This can be caculated by the MPI_Cart_coords, this is an example:


#include "mpi.h"
#include <iostream>
int main(int argc, char *argv[])
{

int rank, procs;
MPI_Comm cart_comm;
int reorder;
int coord[3], id;
int dim=3;
int xl, xr, yl, yr, zl, zr;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &procs);

int periodical[3] = {1, 1, 1};
reorder = 0;

//update the dim according to nproc
int dims[3]={};
MPI_Dims_create(procs, 3, dims);

if (rank == 0)
{
printf("dims %d,%d,%d\n", dims[0], dims[1], dims[2]);
}

MPI_Cart_create(MPI_COMM_WORLD, dim, dims, periodical, reorder, &cart_comm);
MPI_Cart_coords(cart_comm, rank, dim, coord);

printf("rank %d x %d y %d z %d\n", rank, coord[0], coord[1], coord[2]);

}

If we use 12 MPI process, the output is

dims 3,2,2
rank 10 x 2 y 1 z 0
rank 11 x 2 y 1 z 1
rank 6 x 1 y 1 z 0
rank 9 x 2 y 0 z 1
rank 0 x 0 y 0 z 0
rank 4 x 1 y 0 z 0
rank 2 x 0 y 1 z 0
rank 5 x 1 y 0 z 1
rank 7 x 1 y 1 z 1
rank 1 x 0 y 0 z 1
rank 3 x 0 y 1 z 1
rank 8 x 2 y 0 z 0

the only parameter we input is the number of the process, both the dimention array and coordinates that labels the MPI process are created in automatic way by MPI related functions.

a real example, Possion equation

For the numerical background and realted solution, please refer to this link

Poisson equation is a fundamental and commonly used equation in engineering. If we understand the Poisson equation from the mathematical view. we will be more clear when we write the program to solve it. When we figure out how to write a serial program to solve it, we can then update it into a parallel version to improve the execution time to solve the equation

This is a good example to solve the possion equation in sequential pattern.

Our task here is to update it and write a parallel version based on this sequential version. In particular, there are several models to map resources into parallel solutions

the basic idea is the consider the granularity of the mesh for one process
for serial version, we have one process and one thread to execute whole task. so this single thread have the global view for the whole mesh, the thread can udpate every cell in the mesh step by step in serial way. If we have multiple process, every process can update one cell, therefore, we can accelarate the process of the iteration. The overhead here is the communication between different processes. since the view of the mesh for each proess is one cell. Since we can also use the multi thread on one machine, we can also let each process hold some portion of the mesh, and these mesh can be updated by multiple threads running in parallel. therefore, we can get the following parallel patterns:

(1) MPI, every process process one cell

(2) MPI, every process process multiple cells in serial pattern (a block)

(3) MPI + multithread (such as OpenMP or other accelerator such as GPU), every process, one data block

for this part, we start with the first one, namely, every MPI process one mesh cells. It might be efficient to use the GPU, but as a start point, understanding the MPI version is crutial for parallel programming.

The source code for the parallel version of the possion equation can be found here.

Let’s go through the sequential version quickly and move to the parallel version.

If we ignore the mathmetical part, the code for solving the possion equation is really standard, there are three steps if we use object oriented programming. Let’s say our class is called possion equation, the three functions are init, iterate, checkresults. For init function, basiclly, the framework of the main functions looks like this way:

Possion p;
p.init();
for(int i=0;i<N;i++){
p.iterate();
error = p.check();
if(error < threshold){
break;
}
}

The framework is really straightforward. For the init operation, we mainly need to init the mesh and all the necessary boundry conditions, for the iteration, we just need to update the specific data structure based on the mathmetical method, then if the error of iteration results less then the threshould value, then we jump out of the iteration. The sequential code listed here follows this framework. The thing that make it a little bit difficult is the set of the boundy values. no matter for the iterations or for the initlization, the boundy values are different with the interior values, be careful with this part.

For the parallel version, the extra operation is the exchange, since there multiple MPI process may work at the same time, the need to hold variables tha know their neighborhood values. those values are called the ghost value. Before iteration, every cell should tell their neighbors about the cell value itsself. This operation is negligible for the sequential version since the neighbor cells are accessable by same process. But it is different for the parallel version since cells may run at different cores on different nodes which holds different memory space. So the data exchange operation is necessary here.

Frankly speaking, it is hard to debug the parallel code. Mathmetical method itsself is difficult, parallel code is not easy, when these two challenging tasks come together, the program might become messy. For me, the advice is to divide it into severam small part, for example, start from the small scale case, and print out the initial values, and compare it with the sequemtial verison, than check the results for the 1 iteration and so on.

the poisson equation solver with the hybrid parallelism

For the previous solution, there is only one value for one cell, although it runs in parallel, it is not efficient since the communication is heavy. We can update it into the hybrid patten that run the program in paralllel. Every process will hold a block of the data and process it by serial or thread level parallelism.

Here are some tips during the process of the updating:

the data structure that contains the ghost value

if the local size is n times n, the data size that store the value should be the n+2 times n+2 since every block need a ghost area.

the index between the differnet view

since we have the local view and the global view, and the data vecttor is a 1d array, it is necessary to provide following index firstly.

index from the local 2d position to the 1d position that store the data

index from the local 2d position to the gloabl mesh position, this global mesh is a view logically.

the data exchange

another important operation for parallel processing is the face exchange. But the data might not continuous in storage. The MPI_Vector_Type can be used to define the vector contains the non-continuous data. The user just need to define the start position, the length of every block and the space between every blocks. Then this customized vector can be used to transfer data between each others. The data at the ghost area will be updated at the beginning of every iteration. Becase from the view of the logic, evey partition connected with each other, but actually, they are managmed by different thread and when the boundry element is updated, it needs the value from the ghost area to make the computation process run correactly. The advantages of using the customized the vector is the simplified interface of doing the face change between different process.

write the serial program firstly

it is easy to get lost when write a parallel program, therefore, the good tip is always to write a simplified serial program and use it as a checker for the parallel program. Maybe you could start with a small sacal program and use it to check if steps such as initilization, dataexchage and update are correct. It is not trival work to write a parallel program. Some minor bug may cause the results go to the wrong direaction. Basically, the serial program aims to solve the theoretical part, we need to make sure the math or model works. The parallel version is just use more resource wisely to accelarate the computing process. If we did not figure out the math when we write the parallle program, it may mess up tings easily.

The source code can be found here, for the configuration of the 800 times 800 square, if we use the hybrid parallel with 4 process and every process udpate 200 times 200 data block, it only use around 1/5 time compared with the serial version. The program can be optimized further by all kinds of optimization strategies. And we will discuss these strategies in other blogs.

references

https://www.mpich.org/static/docs/latest/www3/MPI_Cart_create.html

https://www.mpich.org/static/docs/latest/www3/MPI_Cart_shift.html

MPI vector type
https://pages.tacc.utexas.edu/~eijkhout/pcse/html/mpi-data.html#Vectortype

possion equation discritizatio

https://www.youtube.com/watch?v=bLiaz-IHX_Y

dim_create

https://www.mpich.org/static/docs/v3.1.x/www3/MPI_Dims_create.html

good resource for halo exchange
http://wgropp.cs.illinois.edu/courses/cs598-s15/lectures/lecture25.pdf

non continuous vector with constant stride
exchange face -> even if original one is not continuos
https://cvw.cac.cornell.edu/MPIAdvTopics/vectortype
good examples

https://www.rookiehpc.com/mpi/docs/mpi_type_vector.php

推荐文章