2023-01-08

Many faces of vtkm

I use vtkm a lot for all kinds of projects, this article includes some tips and thoughts about using the vtkm from the perspective of software design.

The original paper of vtkm is this one

https://www.kennethmoreland.com/documents/VTKmCGA2016.pdf

Pattens that the algorithm follows for parallel running is this one:

https://sdm.lbl.gov/sdav/images/publications/Mor2013b/MassivelyThreadedVis.pdf

Most of the topics can be found under the change log repo here.

Decouple between the execution env and control env

One important thing is to figure out how to set the ControlSignature, ExecutionSignature and parameters of operator function in the worklet type. It seems that all worklet in vtkm style follows this style.

There are some detailed explanatio in the vtkm userguide 17.1-17.4, the parameters specified by exeutionSignature shows which parameters are chosen to execute in the exec env. Parameters specified in the operator function should match with the type of parameters specified in exec env (including the type of the template parameter). It seems that the template parameters can be reused? Not sure how it works here, but it works. If two parameters use the same type, we only need one input template parameter.

Filter plus worklet style

If we just use the worklet based on specific data sets or array, the things described about can satisfy our requirments, and we need to create the dispatcher separately. Then call the invoker function of based on worklet.

Another level of abstraction is the filter. we need to send the data in and return the data and hidden the details. There are multiple filters in vtkm, the base class is sth like vtkm::filter::FilterField, we can call the invoke function direactly if the filter is inherited from that base class, this is an example. We inherite from that base class and call this->Invoke direactly.

The filter base class provids more simple things to do the dispatcher, invoke and detect the type of input and output type.

The case of multiple input and the multiple output.

Type detect and conversion

The core idea is the type tranform here is to let the code can work on different types. The strategies used in vtkm is the runtime transform. The idea is that the deduce the type of a particular field, and then they use that field as the template parameter.

How to deduce the type of a specific array in runtime? The idea is specify a type list and then compare your type with that list to see which is accurate one. In this array handle there are all kinds of functions start with CastAndCallFor... this can be used at different scenarios. The deduced array type is usually set as an input variable of a lambda expression.

There are different versions of the castandcall function, it will cast the input into the specific type. There is actually a tradeoff between known types and the actual type. If we do not put limitation on input type, there might be too many template generated when creating the actual implementation. Different cast and call type can add these assumptions. For example, CastAndCallScalarField is only works for the types in TypeListFieldScalar.

If you really want your function to work with all types, you can use a different version of CastAndCall that will check for all types. It would look something like this:

inField.GetData().CastAndCallWithTypes<vtkm::TypeListScalarAll, VTKM_DEFAULT_STORAGE_LIST>(resolveType2d)

We could always refer to implementaitons of vtkm filters to see how the cast and call are used in different scenarios.

If there are multiple variable, the solution of the vtkm is to use the type of specific variable to infer the type of other type. They have an function called ArrayCopyShallowIfPossible, the typical scenario is sth like this:

//use decay_t to remove the const qulifier
using T = typename std::decay_t<decltype(inputArray)>::ValueType;
vtkm::cont::ArrayHandle<T> inFieldMin;
vtkm::cont::ArrayCopyShallowIfPossible(inData.GetField("ensemble_min").GetData(), inFieldMin);

If they have the same type, the do the shallow copy, the T is the deduced type.

Another commonly used function to convert the array type is the AsArrayHandle, if we search this keyword on the gitlab page of the vtkm, there are all kinds of examples and test case

In this file that defines the ArrayHandle, it defines specific behaviours for the AsArrayHandle, the idea is to use the decltype to get the target type and to check if it is ok to do the conversion, if it is ok, we then use the reinterpret_cast to do the conversion. For the cellset, there are also some function such as AsCellSet to cast the unknown cellset into the specific one.

This question discusses some details regarding why use this kind of general type. The interface want to be desgined as general as possible, for example such as get cell set in vtkm, it can be desgined to have a get cell set function for each pecific type of cell set, it needs to have some kind of generality. We may imagine the unknown cell set or array type is a kind of interface for the concrete member.

CastAndCall and ListForEach

There are multiple options to do this type conversion in VTK-m, the better ones depend on the situation.

CastAndCallForTypes needs to consider both the element type and storage type of the arrayhandle. Just looking at the Section about the customized array handle. For arrayHandle, it is generalized in two aspects, one is the basic type such as int, double etc. Another is underlying storage type such as basic and SOA array. The StorageList.h in vtkm repo contains associated types.

ListForEach just need to get the base component type of the array. We do not need to consider the storage type for this. It needs to be used together with the ExtractComponent. One specific documentation can be found here (https://gitlab.kitware.com/vtk/vtk-m/-/blob/master/docs/changelog/1.6/release-notes.md).

CompositeVector is a good use case for thie ExtractComponent. It is really flexible and the output array can be put into a dynamic array handle. But I am still not quite sure about how to use these things properly.

Other magics of the array handle

One thing that contains a steep learning curve is the arrayhandle, since vtkm maintain there own version of arrayhandle, and all other datasets are built over this arary handle, getting familiar with its API is important to use vtkm related operations flexibally.

Write data and get data. getting and writting data need to through the WritePortal and GetPortal, maybe the goal of this is to hide the different underlying data structure, such as CPU array and GPU array. Be careful, it is time consuming to get get/write portal multiple times in the for loop.

The arrayHandle is just a pointer essentially, so the shallow copy is eveywhere (even for the dataset based on the arrayhandle) in vtkm unless the explicit copy operation is called.

Checking the type of the value stored in the unknown array handle : UnknownArrayHandle::ExtractComponent will return an ArrayHandleStride for a particular component. This allows you to get the data, usually without a copy, by just looking at the base component type (check UnknownArrayHandle::IsBaseComponentType<T>()). For this, you only have to check the basic C types of the components (i.e., float, double, int, etc. a.k.a. vtkm::Float32, vtkm::Float64, vtkm::Int32, etc.). That is a tractable set of types to check, which is why we have this feature.

AOS or SOA array.

Call the worklet

The idea is the combination between the lambda expression and the type deduction.

This is a typical exmaple

using the CastAndCallVecField can get the correct type.

Another case is that we do not know the type. It will change the concrete type in the lambda expression. such as this example, the CastAndCallForTypesWithFloatFallback will do the actual cast tranform and it will try to compare with the list defined in the first parameter and to see wich inner type can match. Since when we call the GetData, it returns the unknown type. The common use case scenarios is to use the input data type to deduct the output type.

In the lambda expression, it usually declares a dispatcher and then use the Invoke function of the dispatcher to call the worklet, the code in this part follows a similar structure.

More detialed about the type case is described in the previous section. Be careful about the dynamic type conversion here. In one previous code, I do things like this:

auto inField1 = inData.GetField("field1")
auto inField2 = inData.GetField("field2")
auto inField3 = inData.GetField("field3")
...
declare dispatcher
dispatvher.Invoke(inData.GetCellSet(), inField1, infiedl2, inField3, outArray)

In this way, we did not actually cast the field to specific concrete type, the field will do some innter operation to compare at least 10 types in vtkm, so there are totally 10^3 combinations, which cause really long compiling time. The good practive is always to use cast and call to convert the field into the fixed type based on some assumptions. Or at least we compare the type with all possible type for one field, and then for other field, we can assume they have same field with the first one based on ShallowCopyIfPossible, the example can be found in the previous paragraph.

Create data set for sample testing

Although there are flexible libraries in vtkm to imput and output data based on the vtk format, sometimes it is conveneint to build data direactly for simple testing, such as building a structured data. The assocaited code is at the vtkm/cont/DataSetBuilderxx, there are all kinds of exmaples such as build uniform and rectilinear example. We can refer to these examples to see how to add coordinates and cellset when we need to create a new vtkm dataset for specific goals.

Visulization with data parallel primitives

Or so called parallel skeletons:

The general types of the gpu primitives:

Map (scan operation or add operation, stencil is also a kind of operator)
Reduce (output is a single value, sum)
Scan (stor intermidiate results)
Sort (reordering operation)
Search (find index)

One paper regarding the VTKM design, the idea of using the parallel primitives (https://cdux.cs.uoregon.edu/pubs/MorelandPARCO.pdf)

Flexible and zero-copy array handle

If you try to find the array function in vtkm data set, you might doing the wrong things. Most of the function can be achieved by just doing the shallow copy. The array handle copy is also the shallow copy case. Always trying to ask yourself, if the deep copy is necessary when you want to use them.

Device primitives (this is still one key research direaction)

Still not sure how it works, what are connection between these parallel primitives and the upper level operations

Filter type

Classifiation different filter types from different categories is located in this table:

field_transform transform fields from/to the same entity
field_conversion does sth cell to pointer or pointer to cell converion

Iterate the specific array

There is no direact way to do this things, we need to getdata firstly, then use the AsArrayHanle to transfer to dedicated handle then get the ReadPortal, then based on this ReadPortal or write Portal, we can read or write data. This may looks like a little bit redoundant, but it is how a framework works, you need to do things following its convention.

Some tips for code standard

Using DataSet::GetField with field name and association instead of GetPoint or GetField when getting a specific array.

Initilization and Logging

The initilization is interesting for vtkm, it needs to be at the place before the real operation to parse the argument in the program. The vtkm will remove the arguments started with the --vtkm-sth, so we need to set the initilization before the actual operation to parse the argument in the program. The log level can be set based on the --vtkm-log-level, different name of the log are specified here, one useful things I know recently if the Kern log, it will print out a lot of kernel details, such as the cuda call etc, which is helpful for understanding these parallel details.

Using different device adaptor

There currently is no direct way to query on which device something ran on. One way you can do this is to turn on the logging level to Perf, but that will print out a lot of info that will be difficult to discern.

What I usually do is force a particular device. That way VTK-m will run on that device or else it will raise an error. The easiest way to force a device is to add a --vtkm-device=openmp command line argument (that gets passed to vtkm::cont::Initialize). Alternately, you can use ScopedRuntimeDeviceTracker to force a particular device (https://docs-m.vtk.org/latest/structvtkm_1_1cont_1_1ScopedRuntimeDeviceTracker.html).

Essentially, the vtkm is just a library for other dedicated service. Its location is a service which can be integrated into other service such as paraview.

Another way is just try to use the ForceDevice to set the backend as the expected one, such as this (https://github.com/Kitware/VTK-m/blob/master/examples/multi_backend/MultiDeviceGradient.cxx#L60)

This is a example to use the initialization in vtkm (
https://github.com/Kitware/VTK-m/blob/master/examples/cosmotools/CosmoHaloFinder.cxx#L109)

Pay attention, if we set the backend as the openmp, and use the mpirun to run the program, the default case is that there will be one core in the program even for local env, this is a good we to bind multiple physical cores to one process/task

mpirun -np ncores --bind-to none -x OMP_NUM_THREADS=nthreads ./program

Timer

Same with the device adaptor, it is recommended to use the vtkm timer when measure the performance of specific filter. Otherwise, the timing is not accurate, maybe force the timer to be set as the default pattern when initilize the timer. Especially for the cuda device.

Debug strategies

In parallel programming, the debug operations is a little bit tricky. In vtkm, we can always start from the serial backend to do the debug operation. If the data set is large, there are also two strategeis, one is to generate small synthetic data set, another is to print out the results of the communication for specific thread. We can use the inner type workindex to index the thread. Each thread has a unique id. For the case of visitCellByPoints, the WorkIndex can label the spefic position of the cell, for example, the 0th id process the first cell, so on so forth. another convenient tag is the InputIndex, it produces a vtkm::Id that identifies the index of the input element, which can differ from the WorkIndex in a worklet with a scatter.

In addition, if we want to access the coordinates of the points in the worklet, the point coordinates can be made available by having a FieldInPoint input and passing the point coordinates array.

Many aspects of the performance issue

Although using GPU can accelarate execution time and performance, somtetimes, we found that using the GPU and VTK-m did not actually get the expected results, here are some potential reasons:

Build by Debug mode. When we set the vtk as debug mode, it might decrease the performance since it will add extra assertion operations for arrayhandle, check this place (https://github.com/Kitware/VTK-m/blob/master/CMakeLists.txt#L137) to get more details. Here we can see that the performance can increase a lot if we do not use extra VTKM assert: (https://github.com/Kitware/VTK-m/blob/master/vtkm/Assert.h#L42). This is also really common for the c programming (or the kernel written by c/c++), if we use the assert operation in the program, these assert operation only works when we use the DEBUG mode to complie the program.
Initilization. Generally speaking, if the kernel will be executed multiple times, there are more benifits using accelarator. The multiple times here means that a for loop with multiple iterations. In each iteration, the filter is executed once, for the second iteration, the filter will run much faster. We measuring the performance time, this is what we want to do. If we executed whole program several times, each time, all initilization operations are applied, there is no benifits. If we execute the program once, but in this program, there is a for loop and we exeucte the filter within that for loop, the filter can run faster from the second round.
Properties of the parallel algorithm. Only some of the visualization algorithms can run in embarrassingly parallel. If the algorithm itself can be constructed by perfect parallel manner, such as iteration by cell or iteration by points; then the algorithm can run faster definately. If there are performance issue in this case, it is highly potential that the issues are caused by two reasons described above. Otherwise, such as streamline line based algorithm, the performance of parallel run is depedent on the configuration of the task, in these situations, running through accelarator does not necessarily introduce the performance benifits.

function attribute or type

In CUDA, a function or method must be declared with the __device__ modifier to run on the CUDA device. This tells the compiler to generate the machine code that the GPU understands. This __device__ is hidden in the VTKM_EXEC or VTKM_EXEC_CONT in VTK-m. A function running on a device can only call other functions that run on the device. The worklet, which runs on the device, is trying to call functions/methods in the Eigen library that are not marked __device__.

For example, there is hidden attribute without using the exec tag.

Set the log level and the backend

These configurations are recomended to be set in the initilization function of the vtkm call. There are other operations to set the backend and the log levels, such as the ForceDevice function.

CI CD and management

The project management thing is maintained in this repo

https://gitlab.kitware.com/kmorel/ecp-vtk-m-project-management

Since this project includes multiple organizations, they use the open issue and associated labels to show the key functionality and the working progress, it is really clear. However, it might be good to link each specific item with specific gitlab issue.

Another good thing is the CI/CD operation, once there is a new commit, the cdash is triggered and there are detailed results checking, such as the compiling warnings, the memory leak checking (compiler on different platforms have different warning complains) It is good to fix all these kind of issues before code merging.

The position of vtkm, multiprocess programming

Originally, the vtkm only focuses on the parallel primitives on single node. There is a vtkh library calles the vtkm library and in charge of MPI program based on MPI program. It turns out that there introduce lots of depedency issue, some cases such as streamline need to handle all level of parallalism. So for the currnet vtkm, when designing a filter from vtkm’s perspective, both the parallelism from the single node and multiple nodes’s perspective need to be considered.

Compiling based on module

The vtkm’s compiling process adopts similar pattern with the vtk’s pattern it adopts a moduler system to assist the compiling and linking process. There are several key word under the vtkm.module. Details can be found here (https://gitlab.kitware.com/vtk/vtk-m/-/blob/master/docs/Modules.md)

Other small tricks

Once time, I try to use the Threshold and ThresholdPoints filter in vtkm. However, the output data set is not same with what I expected. The reason is that I forget to use the SetCompactPoints(true) without setting this parameter, the results are not the one we want.

Mechanisms in filter base class

The filter base class provides a lot of common mechanisms used by filter. The common thing is set active field, that can set which field to apply on. The filter selection mechanism is also useful, it can control how to map the original data set to the final data set. It can control which field need to be skiped.

Coordinates system

Coordinates system is just a common filter. We can then label that filter as a coordinates. When we call the addCoordinates system, we just label a specific field into a coordinates system.

Detailed understading about the ArrayHandle

At somepoints, we need to understand the array handle in detail from scratch to write the filter algorithm in more efficient way. TODO

Tricks of adding a new filter

Remember setting export label

Remember adding .cxx file into the CMakeLists of current filter, that may cause the vtable error which shows that some virtual function is not implemented, the core issue is that the cxx file is not compiled by cmake!!!

Remember update CMakeLists in testing dir

Remember to update vtkm.module

Scalable design of VTKm

multiple filters
multiple worket and dispathers
multiple backends

Key interface of filters
Doexecution, input is a data set, return a data sets. input is multiple data sets, return a partitioned data sets. This can executed in a distributed manners.
Set some key properties of filter, such as the filed that the filter applied on or if the filter will be passed in the output results.

Key interface of the backends/device adaptors

Addeing tag that labels assocaited device
Query the associated device information, if the device exist on the current system.
Memory management, such as Allocate memory on device, copy data from device to host and copy data from device to host.
Set assocaited parameters for the runtime, such as number of threads and other associated information such as block number in grid and thread number in block etc.
Algorithm can be the most important part, such as the ScheduleKernel1d or 3d etc in a strided for loop
Other associated implementation such as the timer system, start time or stop time etc.

the dispather is the core part that bridges control env with the execution env. We have a array in control env, then use dispatcher to call on worklet to make the worklet code execute on device with specific patterns.

Management of worklet (more details in the tutorial)

Type checking, checking if the type of input parameter of worklet is compatible with the types specified by the ControlSignature defined in the worklet.
Transportation, dispatch mechanism plays a role here, it needs to load the data into the execution env before schduling a job. This is usually a part of dispatcher code. The transport class must contain an operator that turns the assocaited control argument into execution environment object. The ExecObjectType is the the type of the data after it is moved to execution env. It contains some information associated to devices.
Fetching,

Typical workflow, MapField, VisitCellByPoints, VisitPointsByCell, ReducedByKey

Key interface regarding the worklet and dispather(more details can be found in tutorial):

checking parameters
create data on device
schedule corresponding algorithm

Some tricky points

Sequences of visiting cells
When visiting the cell by points, if the result cares about the sequence of visiting the point in cells, we may need to map the index output by vtkm to the correct one. For example, if the sequence of visiting each point of a cell is different with what provided by VTK-m, we need to add another function to map the point id to the one we wanted.

Error messages based on templates
Since VTKm uses templates in defining worklet, it might be tricky to parse these error information, just try to go back to the first places where the error message is generated. For instance, this is one example, in this example, I forget to implement the constructor, then I got a large amounut of output:

(base) zw1@MAC132276 UCV$ make
[ 50%] Linking CXX executable LoadEnsAndProcess
Undefined symbols for architecture arm64:
  "CriticalPointWorklet::CriticalPointWorklet()", referenced from:
      auto callCriticalPointWorklet(vtkm::cont::DataSet)::$_0::operator()<vtkm::cont::ArrayHandleRecombineVec<signed char> >(vtkm::cont::ArrayHandleRecombineVec<signed char>&) const in LoadEnsAndProcess.cpp.o
      auto callCriticalPointWorklet(vtkm::cont::DataSet)::$_0::operator()<vtkm::cont::ArrayHandleRecombineVec<unsigned char> >(vtkm::cont::ArrayHandleRecombineVec<unsigned char>&) const in LoadEnsAndProcess.cpp.o
      auto callCriticalPointWorklet(vtkm::cont::DataSet)::$_0::operator()<vtkm::cont::ArrayHandleRecombineVec<short> >(vtkm::cont::ArrayHandleRecombineVec<short>&) const in LoadEnsAndProcess.cpp.o
      auto callCriticalPointWorklet(vtkm::cont::DataSet)::$_0::operator()<vtkm::cont::ArrayHandleRecombineVec<unsigned short> >(vtkm::cont::ArrayHandleRecombineVec<unsigned short>&) const in LoadEnsAndProcess.cpp.o
      auto callCriticalPointWorklet(vtkm::cont::DataSet)::$_0::operator()<vtkm::cont::ArrayHandleRecombineVec<int> >(vtkm::cont::ArrayHandleRecombineVec<int>&) const in LoadEnsAndProcess.cpp.o
      auto callCriticalPointWorklet(vtkm::cont::DataSet)::$_0::operator()<vtkm::cont::ArrayHandleRecombineVec<unsigned int> >(vtkm::cont::ArrayHandleRecombineVec<unsigned int>&) const in LoadEnsAndProcess.cpp.o
      auto callCriticalPointWorklet(vtkm::cont::DataSet)::$_0::operator()<vtkm::cont::ArrayHandleRecombineVec<long long> >(vtkm::cont::ArrayHandleRecombineVec<long long>&) const in LoadEnsAndProcess.cpp.o
      ...
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [LoadEnsAndProcess] Error 1
make[1]: *** [CMakeFiles/LoadEnsAndProcess.dir/all] Error 2
make: *** [all] Error 2

Only the first line and the latst line might be important, it is easy to get lost when the error is complicated.

Kokkos memory release issue

Sometimes, there might be a kokkos issue after completion of all program. (“ErrorMessageViewInstance” issue)

That usually happens when you have an ArrayHandle defined as a global variable. The problem is that the finalization procedure for Kokkos gets called before the destructor of the ArrayHandle.

The best solution is to define all ArrayHandles as scoped in functions/methods or in classes that are similarly scoped. If you must create a global ArrayHandle, then make sure you call ReleaseResources() on it before your program finishes.

Rendering system

Associated pr about the parallel compositing.

https://gitlab.kitware.com/vtk/vtk-m/-/issues/802

VTKm dataset vs VTK dataset

using merging dataset as example to analyse this thing.

References

VTKmUserGuide

AverageMind