2022-07-28

Software design and code review experiences

Some high level nots about software design and optimization. These contents can be the distinguish beween the junior software enginner and the senior soft ware engineer.

Code review

Two principles for writting elegant code

There are lots of principles for writting the elegant code, we discuss two important cases in this part.

The first one is “try to let compiler do more things”.

The second one is “try to decrease the redoundant code”.

These two principles are two guidance for code review process.

One example I recent met is about the worklet abstraction class, originally we create each abstraction functor for each actual function call. But it turns out that we can use the function template to do things in an more elegant way. Or in more traditional way, we use a function pointer to in the worklets. However, the template parameter is more simple then using the function pointer.

#include <cmath>
#include <iostream>

template <float fun (float)>
struct LogFunWorklet
{
    float operator()(float f) {
      return fun(f);
    }
    //some value for computing the f
};

int main()
{
    auto logWorklet = LogFunWorklet<std::log>{};
    std::cout << logWorklet(10.0f) << std::endl;

    auto log10Worklet = LogFunWorklet<std::log10>{};
    std::cout << log10Worklet(10.0f) << std::endl;

    auto log2Worklet = LogFunWorklet<std::log2>{};
    std::cout << log2Worklet(10.0f) << std::endl;

}

One typical use case for second principle it to make the if or switch branch as dry as possible and let them in the minimal scope. It is a case by case situation. Ask yourself according to these two principles is helpful to improve the quality of the code.

const

The first chaptor of the efficient c++ discusses about the const. We do not dive into details here, just keep it in mind and try to use the const whenever it is necessary to do so.

Several tools for processing different types

It is common that the same logic is applied into different types of data. Typical tools regarding the scope of that code that need to be reused, these techniques include: template (both parameter and function), overload of the function (only overload the section that have differnet behaviours), abstract class (interface). Or there is a mixture way sometimes, such as using both template parameter and the function parameter to distinguish a specific behaviours. One recent example is that I need to write field into both vtkm data set and vtkm partition data set and they need to use different association type. So the template parameter type is the type of the data set and the parameter value is the value of the association.

Software Design

Elastic leader-worker

One project requires a elastic resource management capabilites, and I basically implement all these details from the scratch. There are no fancy ideas behind this but it takes me lot of time to do everything correctly. Some principles of doing related works might be good for future works, either in the cloud env or the HPC env.

Background and depedencies.

We need at least an RPC framework which can define the API conveniently.
We need some collectives based on the API, such as the barrier, bcast etc.
We need the controller acting as a component to add more process. For example, there is a script that can monitor the siganl file on pvfs and start new processes when the siganal file is detected.

The worker-leader identity. The leader process start firstly and runs for a long time. When there is a new process or node added into the service, it belongs to the worker identity, it sends its address to the leader process. When one worker process leaves the group, it send deregister API to the leader process and remove its address from the leader.

How to design the sync operation in the leader, comparision between the expected and actually stored into the address manager. Maybe use a pending process to show how many processes are in pending status.

Declarative management for the metadata, the metadata in the leader process is updated by two components. It is important to clear this point before designing the associated data structure. The first is the worker process, they add address or remove their address from the leader process. The second is the policy, they need to set the expected number of process to the metadata, if the expected data does not match with the actually registered data, it means the rescaling operation does not finish and we still at the sync operation.

The sequence of process id, when there is new added or removed process, how to control their id

Issues, there is a single node issue for the leader process. If the leader process faile, all the information regarding the view of the group may lost.

Reproduciable and using the script based managing

When submitting the work to the high level conference such as SC, the reproducibility is always an important aspect, anyway, the whole merit of the science is that the theory or algorithm that are reproduciable and solide.

Pay attention to the reproducibility from the start of the project can make the project clean and easy to collabarate with others. How to manage or compress the complex configurations need experience. Just thinking about how spack works, when you correctly configure all details, just using one line command, the whole things can be installed properly. We do not need to write another spack but we can make our project and script run like a spack way.

Fixing the platform is always the first step, we can design the specific installing scripts on each kind of platform, ubuntu, windows, macos and so on. This is also an important consideration to decouple the complex tasks.

Using script to record what you have down is another imporant idea. You may execute several experiments with different parameters and configuration, create this dir, update this config, etc. Just using different scripts to take down all these details, then this means that your exp results can be reproducible and you easy to maintain. Altough it might be a little time consuming to set up all things correctly at the first time, you will save lots of time if this is an long term project.

Just assuming there are new member joining your project. They can get familiar the exp and installing process quickly and clearly just by reading these scripts. The previous colza-experiments project for SC paper is a good example.

Even for the daily using project, when you try to install the script from the scratch, just write the install.sh script at the same time, when you finish the installing, the install.sh script is also then installed ok, this could save you much time and energy. When you want to migrate your install scripts to other platform, you do not need waste time to reinstall all things from scratch, maybe you just need to update several lines for the previous code.

Trace the operation based on configuration

Typical example is like the paraview and the gmsh, there are two sets of interfaces, we can add filter by GUI or by the programmable interface such as the python code in paraview or the scripts in gmsh. After manually operation based on the GUI, we can then tranfer these operations to the script manner. This is how Paraview catalyst works. we make the operation persistent by this way to generate the configuration scripts and this scripts can be integrated in more autonomic or complicated pipelines. It is important to generate the scripts based on the manual operation.

How to let software talk to users

The general type of software can be divided into two types, one is the library, this kind of software is designed to implement a specific operation and will be called by other software or library.

The dedicated used for this kind of software is the developer. The language to talk to the user is (1) standard installing scripts (2) necessary examples (3) good API documentation.

If you want other people to use it, you need to make sure that other people can use it easily (the importance of the standard installing scripts, the good practice is to use one line to intall whole software, and there are different install scripts for different platform) and correclty (the importance of example).

Another type of the software is designed for the non-developer. For this case, the gui or frontend based on web browser is necessary things, the user only need to click specific button and then get the results they want. If you only know the frontend or backend, which is not good. The gui might be more important compared with the web based service if there is limited number of user, since there are some burdern to build a web based service (you need to rent a server with the public ip, and you also need proxy service), but you can run an application with gui when there is a device.

K8s controller and declarative management

K8s controller and associated declarative management are always inspiring for design a controller and control the dyanmic behaviours of the system. We declare specific requriemnts in the configuration data base such as etcd, and then using controller continuous pulling that status, when the status change, we change the associated behavious registered into the controller. Enssentially, the controller is a kind of a trigger mechanism, we registeer thee condition and action into it, the controller helps us to achieve the specific triggering process based on specific events.

Motiple controller working together can composing a large system with complex dynamic behavious in multiple aspects.

Designing a distributed descriptive statistics

I have an interesting experience for working on a distrirbuted statistics filter. The first thing is the theoretical support, checking this file

https://gitlab.kitware.com/vtk/vtk-m/-/blob/master/vtkm/worklet/DescriptiveStatistics.h

Related papers can be found here (Numerically stable, single-pass, parallel statistics algorithms), parallel and single pass are important things.

It descibes several operations from math perspective, these functions can be used to compute the specific statistics filter in an iterative way, which provides a good foundation to do the parallel computing. Then we just use the customized operator to compute all kinds of statistics.

Then we need to consider different parallel computing model, the shared memory case and distributed memory cae. When all cases are considered properly, we can then say this filter runs ok for different data sets.

Also pay attetion to the edge case such as when there is empty value or only 1 value, how we get these returned statistics.

AverageMind