2020-12-13

What are you working on

Some thoughts about the classification of CS related topic. Why it is not enough to divide the CS topics into the system track, algorithm track and the middleware track.

Classification of CS domains

It is still hard for me to explain what I am working to others, especially to the people that are not working on this major. Previously, I try to explain whole picture from the perspective of the algorithm, system and the application, but it is still too obscure for other peoples. And I found a new framework here.

Maybe I could explain it in this way. Two focus aspects (or the world view) are the data and resource in the domain of the CS, which is always the core problem to answer for a problem in CS domain.

For the aspects of the data, it includes the data generation, data transfer, data processing and data storage or indexing

For the aspects of the resources, it inluces the computation, networking and storage resources.

Then basically, it is easy to use this view to analyse particular things. People usually work in differnt context or semantics, for exmaple, the IOT, cloud computing, Edge computing, HPC, robotics, etc. Let’s try to see how to use the presented angle to analyse these domains.

Let’s see IOT and the edge computing, the data come from the edge device, then it might be transfered to central device maybe at the data center, after some processing, the decision will be made, the output is the processed information useful for people to make decision, or the algorithm can make the decision (data processing). When the decision is made, the decision is transfered to the device (data transfering) and control the resources (resource management), the simple case is lifting the barrier in the park lot, the complictaed case might be adjusting the engine of the space ship.

You could see that real workable project includes both aspects of data and resource. People may focuse on differnt part and optimize it, that is why it is hard to explain what they do to others, since it is not easy to consider the whole picture if we have dived into a particular aspect too much.

For other context such as the HPC/Cloud related work, the model from different scientific domain will generate the simulation data (data generation), or the log message from the user information, when they try to generate the data, they need the computing resource, and use it. If we want to manage it such as allocating or releasing the resources, we need all kinds of resource management tool (such as slurm or k8s), When we want to use resource properly, we need compiling tool and the programming language. Then we may need to transfer data between different devices for further processing, we need to utilize the data transfer medium such as disk or networking, and the data index methods. that is the position of the data I/O and storage. Then for the data processing, we use differemt mathmetical model or visulization tools to get some insights. When we get some results, then we can adjust the simulation based on the insights.

It is easy to find a position for what you are working on by this framework, namely consider things from the data angle and the resource angle. Previously, we use the system, algorithm and middleware track to classify the CS topics, basically we only consider things in static way without the thoughts of the high level workflow. And it is hard for people to understand it as a whole. Besides, it is not good for the project’s view, since you may neglect the upstram and downstream if you just scope your work as the system or the algorithm part. You may lose the motivation that how your piece will contribute to the whole project.

One typical case is that the middleware related project such as the storage part, if you neglact the idea of upstream and downstrem, it is easy to generate sth that is not useful. Since the goal of the project is to support the loop of data generation, transfer, processing and decision apply to the resource/device or users. Without this goal, the work may lack the motivation, you may become confused about what you are doing, get rerejection of the paper or lost the opportunity for potential founding. How to make your work fit with the whole loop and serve for the goal from the application’s perspective (some user’s requirements or some sceitnific problem, or the motivation of the work, this is really really important) is important.

Focus on the right problem

Sometimes, there is a feeling of overwhelming, since there are so many things that you could learn, I’m confused about what is the right things to learn or to put the time here.

The thing you we need to remember is that to use the CS to server for specific domain, the CS and related math or statistical tools is what we need to have a good understanding, the new emerging CS related techniques is also necessary to know such as all kinds of accelarators and Deeplearning things.

The domain is the physics, biology etc, since we use the visuliztion and associated data processing to process the data generated by this simulation. The tricky part is, how much we need to know about these other fields? Maybe the more is better, but one person can not know everything, if lucky, we can get some collabaration.

The current idea is that, we do not need to learn everything related to other domain, we just need to understand the part which have close connection with the CS part, it’s enough. Since we are contributing to the CS method/domain (make it fast, efficient, accurate or get some results that can not be done previously) essentially.

There are several standards:

Know it, Use it, Implement/Modify/Optimize it

For the level of “know it”, you may know the functions or their defination and what they can do, that’s enough. Such as some knowledge from other domain but related to your work. For use it, you need to have more undersanding about its details. Such as the python statistical libraries or the related APIs, but you do not work on these things direactly, you do not need to modify and update it (you just need to know their apis or understand their code in general). For the works with the level of “Implement it”, you work direactly on these algorithms or program, you know how they works in details.

Assuming there is a reversed triangle, the top of it takes lots of portian, it is sth you need to know, then the less portion are the contents that you need to use, the inner part is the portion that contains the thing you need to implement. Things are getting slow and error prone when start to implememt sth.

One issue that makes people feel overwhelmed is that you think everything by the level of “Implement it”. However, this is never the thuth. Most of the time, they belongs to the level of “Use it”, for example, there are all kinds of simulations that can generate the data as you want, or you just need to modify the current configurations of the softwhare you have to achieve the effects you want. So the right question here is that how to learn to update the configurations for the software you have and get the data you want instead of write sth new from the scratch. By thinking the problem in this way, life will be much easier, and you know the right question to look at or direaction to move forward.

Some questions we may ask ourselves

1 What is the whole picture or pipeline (we need to figure out whole story even if we only work on specific part)

2 Without this thing, what can not be achieved

AverageMind

What are you working on

Classification of CS domains

Focus on the right problem

推荐文章