some thoughts about the dynamic membership in the research
one related work that explains the dynamic membership of the process for key-value store on HEC system (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6569861&tag=1), at the Section three of the paper, they present good motivation for dynamic process join/leave (1) for the fault tolerance such as process or node failure (2) better utilizing resource to improve the performance, such as using more processes when there are more data/tasks. However, for the HEC environment, the resource is fixed when we start a job, this means the resource will not change dynamically (at least for the process level). Therefore, more research efforts about this topic (dynamic membership) is located in the area of fault tolerance for the HEC environment. Even for this paper, they also use fault tolerance as an important use case. The second use case might be more useful for the cloud environment according to this paper.
However, for the fault tolerance, the previous work from Duan (https://dl.acm.org/doi/10.1145/3391448) has presented good solutions for the case of the process failure for data staging service; and it seems that for the research topics related with the fault tolerance, more research interests are how to do data checkpointing, and how to recovery data or process from the failure point. The dynamic membership is just an important tool to support these features.
Based on these discussion, one main motivaion for HEC case is that how to fully utilize the allocalted resources. The resources you can use is like a large bag and what you can do is schedule tasks in a smart way to utilize resources fully or decrease the total workflow execution time fully. For the cloud env, the idea is that, the resource can be joined and leave high frequently and how to handle related issues. That are two different ways to consider the problems. This is mainly because of the resource allocation tool for the HEC system.
Based on this thoguhts, there are lots of works about how to scheduler tasks wisely to let them fully utilize the process resources. such as the GoldRush(Resource Efficient In Situ Scientific Data Analytics Using Fine-Grained Interference Aware Execution) and AMR data load balance, main thoughts is to use the process idle period.