2022-06-13

Cuda programming (1)

Key concepts and optimization strategies about the cuda based gpu programming.

The related code is in the parallelBLK cuda MVM (matrix vector multiplication)

This blog discusses how to use cuda to compule the Ax, just show the basic concepts of cuda and associated programming details.

Number of block per grid, number of thread per block

I learn the cuda and gpu programming in different classes, every time i start to try to program on cuda, I try to relearn these concepts again, such as grid, block, wrap, bank conflict and so on. I’m still feel confusing sometimes, and try to start clear up every thing again. Here I try to started with several simple eample, this example can be the foundation for the future learning. The first example is computing the matric times vector operation, namely the matrix vector multiplecation.

We first construct the cpu version of the matric vector multiplecation. This is simple, we just need to go through the vector to follow the rule of matric vector multiplecation to do that.

Then we want to execute things on GPU, forget about all details of the GPU concepts and start from the scratch. There are bunch of thread on GPU and some memories. We need to first copy the data into the GPU and then do things here.

We need to allocate memory spaces on cuda and then transfer the data into that space. Here comes to the first question, what memory should we allocate, we just start with the simple one, namely the globalMemory

Before doing that, we may ask, if there are multiple gpu, how to attach to a specific one

According to the deviceQuery code here

We can first get the number of device by cudaGetDeviceCount and then use the cudaSetDevice to make sure that cuda related operations are executed on specific device.

One important api is the cudaMalloc, we use this api to tell the device to allocate specific memory on it. Be careful about the first parameter for this api. Since this api will return a error code, so we need to set the pointer as the first parameter. In this case, it will be a double pointer here. If you program in c, you might very familiar with these details, you need to set double pointers in order to change the value of original pointer.

After allocating memory on gpu, the next step is to do some computation and get results back to the cpu side. ideally, for the SIMT architecture (single instruction multiple thread), we may want every process (if we have enough processes) to access one element in the matrix, and then do the multiplication operation and then get results back by one operation simultaneously. Here comes to the idea of the grid and the block, for the cuda case, there are more complex ideas behind that. We need to consider that idea of grid and block in this case. Since the gpu physical thread is not distributed in an flatten manner.

It is important to have ideas about the hardware level and soft ware level when we describe things. This blog shows a good idea: cuda core, stream processor and GPU are hardware concepts. thread, cuda thread block and cuda grid are software concepts.

For the block (thread block) dimention, we specify how many thread in each block for each dimention. For the grid dimention, we specify how many blocks in each dimention.

When we execute cuda kernel call, we need to set these two parameters kernelFunc<<<numBlocks, threadsPerBlock>>> namely the number of blocks in grid and the number of thread in block, The confusing part is how to set these parameters based on your program or the scale of the problem?

The naive way is to flattern the blockdim and the griddim, just use the 1d for both cases, then there is

threadid = blockDim.x*gridDim.x + threadIdx.x

This is called the global thread id. This blog describes several cases of thread id computing.
Essentially, this operation is try to map the local index in a block to the global index in the grid.

One typical solution here is to fix the threads number per block, and then compute the block number according to the number of elements you have for the problem you try to solve.

There are some research results discuss the proper parameters, which are beyond the topic of this blog, we just assume that we choose a proper number of blocks here.

The hard limitation is that we do not exceeds the physical limitation. For example, we can use the deviceQuery to check the Maximum number of threads per block, which is 1024 on A100 GPU.

Then we can do the kernel call like this:

function <<< (elementsNum + BLOCK_SIZE - 1)/BLOCK_SIZE, BLOCK_SIZE >>> (d_in, d_out);

The trick for (elementsNum + BLOCK_SIZE - 1)/BLOCK_SIZE is for processing the remainder. Because the elemNum may not be the integer multiples of the BLOCK_SIZE, so we add one more thread for each block if the element size is not the integer multipler of the blcok size. And these threads will keep idle if we do not have so many data to process, this way can simplify the programming a lot. We can also use the ceil(N/blockSize) to compute the number of blocks (dimGrid).

The notation <<<>>> is called the execution configuration in cuda term, check this doc for more details. Although we only set one number for first and second parameters. According to the documentationt, these two positions are the actual dimentions of grid and block. For example, if we want to make use of both threadidx.x and threadidx.y, we need to make sure the block dim is 2d variable when we set the execution configuration.

For our current configuration, the output of all id is like this:

threadIdx.x 32 threadIdx.y 0 threadIdx.z 0 blockIdx.x 80 blockIdx.y 0 blockIdx.z 0 gridDim.x 313 gridDim.y 1 gridDim.z 1

We only use 1d of the threadIdx, it starts from 0 and ends with 63 in our case (we fix the threads number per block as 64). The blockIdx is computed based on the number of blocks, which is also 1d case. The gridDim shows the total number of blocks. Which is 313 in this case. The meaning of these variables are descirbed in this doc. We can explain the gridDim blockIdx and blockIdm threadIdx now. We will explain the idea of warp in the next section.

Anyway, after executing the kernel call, then copy data back and forth between the cpu and gpu and initilizing all kinds of data structure is time consuming and error prone. Although this thing is not difficult, it still takes me a little while to make every thing correct.

The first issue is that, when i initilizes the matrix, I mix the row and column, I use the m and n, when I need to update the index at i*column+j, i written it to the i*m+j, when I print out element in the matrix step by step, I found that issue. When we computing the index, this is the flatten case, when we do the actual operation, we need to transfer it into the 2d cases. rowIndex = gindex / n and columnIndx = gindex % n.
Another error is that, if the size of the problem is smaller than the total allocated thread, we need to solve that issue, the some thread might be useless in this case. (We just use a if condition and let these threads do nothing here or return things directly)

Anyway, the correctness of the computation is still a key issue, you implement sth on gpu, you try to make it run fast, and you want it correct, it takes time.

Till this time, we have sth workable on gpu and we could already see how the speed is improved.

This is the naive implementation.

https://github.com/wangzhezhe/parallelBLK/blob/master/cuda/MVM/main_naive.cu

Warp

Here are built-in varaibles idead about the warp and SIMT architecture.

The concept of warp comes when we consider how the thread is executed. Checking the figure in this slide. Although logically, we think that all threads (number of blocks times the threads per block) we speficied can be executed in parallel. But actually, they does not executed in parallel, they just execute in concurrently. The thread number exeuted in strictly parallel is limited by the physical property of the device. Which is 32 for the current device, and we call this a warp (all the thread bundle, these 32 theads are executed in parallel essentially). These threads are executed by each SM (stream multiprocessor)

It can be a hybrid parallelism. If we have ideas of thread pool, we may know that is a kind of task based parallel. We create a series of task and then put these task into a queue, and then the scheduler of the thread pool schedule a task and execute it using the avalible resources.

The difference here for the GPU is that, every time the scheduler fetch a task and execute it on the multiprocessor (hardware concept of the GPU), we actually schedule a warp, and this warp can execute 32 threads physically.

That might be one reason that cuda things are a little bit complicated, it actually hybrid two patterns. One is that, it contains the parallel pattern such as the MPI, which execute same instructions at the different resources; Another case is that, it contains the task based parallelism such as the strategy when we manage a thread pool.

It is worth noting that the threads in a warp came from the same block.

Strided loaded and memory coalescing

This video provid a really good explanation (1:07)
https://www.youtube.com/watch?v=FcCTHJO8_zo

It only matters when one thread process access data at different positions multiple times. Or every thread access multiple elements. If there is no for loop (current thread is enough for processing the problem) in the gpu kernel function, there is no issue of the strided loaded.

The video listed above show a good figure. There are basically two typical ways to map the data to thread. The first is to distribute things evenly. Compute the workload (N)that each thread need to process. Then thread 0 access 0 to N-1, thread 1 process N to 2N-1 … This pattern is good for CPU based algorithm that used pthread. In the GPU case, there are 32 threads executing at the same time in warp, so there is the issue of strided load, and we loaded unuseful data into the memory back and forth in this situation. We can use memory coalesing to avoid this issue. thread 0 access 0, thread 1 access 1, so on so forth. Then when we use up all threads. We adjust the base value and the the next element that thread 0 will process is current position plus the number of threads we have in total (gridDim.x times blockDim.x in 1d case).

When consider the optimization in this part, the good thing is to consider the memory access pattern first, when we figure out this part, the following optimization idea becomes easy. Instaead of only considering the thread id, we can consider the memory address and then how to assign the data entry to the thread.

This is the code example that use the warp, it still takes some time to make sure every thing is correct. In order to make sure it works, we limit the block size so the total number of threads is limited compared with the whole data element array. Then we just move the base part to map the thread id into the specific memory position.

https://github.com/wangzhezhe/parallelBLK/blob/master/cuda/MVM/main_warp.cu

This is the best version of the article about the grid-stride loops, it explains why the strided way is a good practice. Thread reuse, easy to debug, and avoid memory coalesing issue.

The sample here also provides a lot of good coding practice, such as the tradeoff between thread reuese and the the creation overhead, the debugging of the gpu program and process the problem that is larger than the theritical gpu threads. It use the monolithic kernel and grid-stride loop to represents two verions of the code. This blog provides some good figures to explain the strided access.

The optimization strategy if kind of fixed in typical cuda programmming. The difficult part is how to properly use these ideas to solve different types of questions.

The naive way is monotholic kernel, then the strided for loop, then the vectorized data access, then we use the shared memory.

Using the shared memory

In the previous exmaple, we use the global memeory, another key operation is to use different gpu memory efficiently.

For the scope of the global memory, it can be accessed by whole GPU. From the hardware’s perspective, the shared memory is on each SM, so different SM can access different shread memory. From the software’s perspective, the shread memory is associated with each thread block. The meaning of the shared means the memory shared between the threads in each block.

This question explains some details of the shared memory and global memory. According to the number they listed, there are around 100x gap between shared memory access and global memory access. The scenario of using the shared memory is to optimize the multi read memory access. If in the kernal call, we just need to fetch data once from the global memory, it is pointless to use the shread memory, however, if we need to fetch data multiple times, instead of access gpu each time, we can put the associated data into the shread memory and access shared memory.

This is the official documentation about the shread memory. By using the __shared__ key word, it tells kernel this memory is allocated once for each block (even if we write this code in each kernel call)

In our case, when we compute the A time x (vector), the elements in the vector is accessed in a high frequent, at least we can put the vector into the shared memory firstly.

The next key issue is to consider what data should be put onto the shread memory. This question is associated size of the problem obviously. The data used in the computation kernal should be loaded into the shared memory firstly. If there are enough space, in our example, the vector can be loaded into the shared memory firstly. For the matrix, we may just need to loaded the portion where it will be accessed by current block.

This blog list several key issues about the cuda use cases, such as syncronization, dynamic or static shared memory issue. The difficult part is that we need to flatten what we need and change them into a sequence and put them into the shread memory.

Another question is that, even when we allocated shared memory, how we load the data? It looks we need to load the data manually, but if we do it in each thread, all copy operations will be executed. what is the best practice for this case, here is an exmaple question about this

Let’s look at a exmaple firstly, we can show what are general principles for the code exmaple here:

This is the scan operation (prefix sum operation) in the cuda sample, for the cpu version of the scan, the sample code break the whole very long array into several small array. The main function range the array size gradually to compute different combinations of array size and number of arrays. They do the exclusive scan opertaion in each array. There is no communication between each subarray. It mixes the pointer and the bracket operator. When we use the [] operator to access the pointer, we start from the base position, so even if in the code the index is start from 1 to arraylength, the actual accessed elements are start from the base position, which is k times the array length. This is associated cpp code to compute the scan.

Their kernel is divided into two types, one is the case where there is small array size and they call scanExclusiveShort function. It runs multiple iterations in order to compute the average kernel time. In another case, when the arrayLength is at the region where there is large array size, it calls the scanExclusiveLarge function. The idea here is that we do not want each kernel can run well for all cases, and we just want to make sure that each kernel can process the array with length in a specific range.

For the gpu kernel function, this sample code use the uint4, it looks that every thread needs to process 4 uint numbers here. They use the idea of vectorized data access to improve the performance, here are more related blogs, it can be a strategy to further improve the performance based on strided loop. The core idea is that the vectorized operation can decrease the number of instructions to access the data. There are also some good sample code here as an example. However, there are also some limitation, such as providing more pressure to current register. The term used here is “scalar load” (load on element each time) and “vectorized load”

other related information:
sync group:https://developer.nvidia.com/blog/cooperative-groups/

This is the shared memory example for computing the A times x.

We can further optimize it to let the ans also use the shared memory and then sum results back from each shared memory part.

Other issues

There are still some issues for the current project.

when doing the accumulation why there are errors between cpu and gpu version, different approximation order (will different adding sequence cause different results)

Memory coalesceing vs Bank conflicts

When we discuss the coalesceing, we targeted on the access to the global memory, the uncoalesced operation lead to more transactions to load the data. Check the figure listed here

For the bank conflict, we means discuss this concept in the context of the shared memory. The threads in warp access the same bank address (the segment to store the data in shared memory) can lead to the bank conflicts. This blog list several common bank conflict cases

Data representation for the sparse matrix