This article mainly compares several differences between the HPC(High performance computing) and CloudComputing. This is the script for online video.
You may heard the CloudComputing and HPC (High performance computing) here and there
Do you curious about the differences between these two techniques?
After several years elementary works in both of these two areas, the answer about this question is becoming clear.
(In this vedio) We try to explain the difference between CloudComputing and HPC from several aspects. We mainly consider these differences from the aspects of Maintainer and Machines, Users and assocaited programs, How to access it and Typical Software Stacks.
These two areas are still evolving quickly, we just try to share some critical ideas and necessary information.
Generally speaking, the common thing between the CloudCompuing and HPC is that, you try to run your program on multiple machines, such as hundres and thousands of computers. Obviously, one person or small insitutions do not have enough funds and capaiblities to construct a cluster that contains a large amount of machines, so, these machines or clusters are usually maintained by specialized institution.
Let’s first look at the difference for institutions that maintain the machine or computing nodes used for CloudComputing and HPC.
The cloud computing service is mainly provided by commercial companies.
The left side figure shows the main cloud computing service providers
you may know names of these companies very well, such as Amazon, Microsoft and Google, etc /ɪt ˈsetərə/
In contrast, the right figure shows the main provider of the HPC system in USA. We mainly list the main national laboratories under the DOE (department of the energy). These institutions are well known for people with the backgound of the scientific computing. These government funded institutions are main force for building the HPC system and play important role in history. For example, the oak ridge and Los Alamos national lab are two sites involved in Manhattan Project.
The difference of the maintainers influence the properties of the platform and its serving functions. Lets looks at the details of the machines and services they provided.
They ususlly do not show the details such as the performance metrics of their cluster, instead, they are more care about the types of services.
The machine with different configurations are created to users as their requirments based on virtulaization techniques. Their physical machine pool is also heterogeneous which is different compared with the HPC which is configered by homogeneous machines in general.
The left figure shows the featured products of the google clould service. In addition to the virtual machine, they also provide versatil /ˈvɜːrsətl/ data management and processing capabilities.
There is a rank list called top500 list that compares the performance of differnet HPC system in different aspects according to various Testing program. This is just like the olympic game in the HPC world, this rank changes a lot every year.
as shown in the right figure, it lists the current rank.
Currently, the rank one system in the list is Fukago supercomputer, which is maintained by RIKEN (RIKEN is Japan’s largest comprehensive research institution renowned for high-quality research in a diverse range of scientific disciplines)
the second one is Summit system, which is maintained by the oak ridge national lab.
Actually, most of these system is maintained by the national institutions.
The Rmax and Rpeak are two important metric
A system’s Rmax score describes its maximal achieved performance (execute how many float operation per second); the Rpeak score describes its theoretical peak performance (https://kb.iu.edu/d/bbzo).
This figure shows national labs in USA
Another interesting topic is to discuss the users of the CloudComputing and HPC
The CloudComputing mainly server for the IT companies. The startup company may not spend money to buy their own machines and they just need to rent the computing node from the cloud provider with low cost in a more flexible way. The cloud provider has a professinal team to maintain these service. Anyone can rent or buy the machine from the cloud compuer provider if you pay their money.
the left figure lists some key information about the customer they serve to. The company such as retailer, financial companies may need to rent lots of machine to maintaine their user information or provide necessary online service, such as login system, website, or the electronic transaction system that supports their businuss.
The HPC mainly server for the domain scientist, they use HPC to solve numerical problems such as scientific simulations. They build model and run the model on HPC in large scale with parallel computing, it is common to use thousands of machine and cores to run a particular simulation. The scientists or research teams in colleage have the collabaration with the research institutions that owns the HPC system, if there research goal overlap with each other, the maintainer will set a specific core time to a project. Anyway, the goal of HPC is for research, they are basically the non-profit services.
the scientic projects that uses summit
The right figure shows some key projects running on summit supercomputer,
you could see that the main areas are biology, physicals, fusion or nuclear science or earth science etc. The styles of project are quite different compared with the CloudComputing illustrated by the figure at the left side.
In summary the core value of the CloudComputing is to serve the customer’s business, differnet softwares are built in order to server this goal. For exmaple, we need the data base to store and index the user information, the smart analytics to process the data as needed, and the backend with accositated security service to support the website.
The main value of the HPC is to server the domain science, the typical projects includes the scientific simulation, data visulization and analytics and associated I/O or high speed networking service to transfer the data between different stages of the workflow.
When we disucss the Cloud Computing Service, they are divided into different layers in particular.
The first layer is called IaaS, namely the infrastructure as a service, the cloud provider will assign a virtulized computing node to the user in this case. when you rent some nodes in the IaaS level, these nodes belong to you totally, and the node might be virtulized node. you have the root permission to your computing nodes, and you can configure it (debug it) as needed. The cloud provider only provide you the computing/netowrking/storage resource as you needed. You need to pay more money if you want to get a more powerful node.
The second layer called the PaaS, namely the platform as a service. In this layer, the user just need to provide the configuration file that describes how to run their executable files, and the platform is in charge of the resource scheduling and high avalability of the program.
The next layer is called the SaaS, namely the software as a service, In this case, the user just need to call the API to interact with the service provided by the cloud computing provider such as the storage service or data monitor service.
Common HPC contains login node and compute node, if you can get an account on a particular HPC, you can login to the login node. Every user have a same view when they access the HPC, theritically speaking, they can use all noeds on this machine (there are some constrains on the total number of the core time, and different partition may have different running time limitations). Anyway, it is just like a membership, when they give you a account to access this machine, you could share high quality service they provided. large disk space, high speed node etc. You need to submit jobs to the scheduler queue, the scheduler is responsible for assigning the node to every user. They do not use the virtulization techiques, since the computing power is abundant and luxury for most of the users. You do not have the root permission, if you need to install particular softwares that requries the root privilage, you may send a ticket to maintainer for help or consulting any issues regarding the usage. You will be charged (core time) only when your job is scheduled to run.
For example, if you check the summit node, it is really luxury and powerful
more parameters on this page
The core time is related with your project. When the pi start a project, the core time is usually fixed. When one project finish, you can not access the HPC resource anymore. That’s why researchers always need to write proposal and apply for new project. Only by this way, they can get the avalible computing reosurce and funding from the governmnet.
Different with the 7 time 24 high avalaibilty for the Cloud Computing service, most HPC have the periodical /ˌpɪriˈɑːdɪkl/ mainataince, you could not access the HPC system when it is offline, I think it is definately a good reason to take some rests some times.
Let’s go through some typical softwares used for these two areas quickly
IAAS OpenStack, KVM, virtulization
PAAS Kubernates, container, docker
SAAS More versatile customized cloud service. such as labmda service.
There is more flexible charing and scheudling policy. Such as you may get more resources on-demand when you need it. This is hard to implment on the HPC. For the HPC, it is plan driven schduling, you plan everything well when you submit a job. You may need to follow the limitation of the queue scheduler, such as long waiting time, since the HPC resource is shared by many people at the same time.
The Cloud Computing may need all kinds of data base softwares and big data service such as Spark, or backend services that provide the RPC call. Java, python, golang is also popular in this area.
Slurm,all user access longin node, and share same computation pool,
module, to configure the environment
spack to insall the packges
The simulation may use MPI and RDMA a lot, C/C++ fortran are the mainstream language in this area.
In summary, the Cloud Computing Service and HPC share the same genes but have different souls. They are buit based on GPU and CPU but server for differnet users and projects.
powerful computing, user may access tounsands of cores in short period of time
latest HPC computing node is configured with GPU for every node, which is luxury for comercial company
They will update the system maybe one or two times a month, this is ok for scientific project which is mostly offline computation.
Basiclly, they need to provide 7 times 24 avaliable, High availability is an important issue for the cloud provider. Even several miniuts offline can cause the huge loss for the customer. There is a tradeoff. If you work with HPC, you may not need to be oncall or update your system at the midnight. But if you work with the cloud computing, you may need to on-call 7 times 24 to process any unexpected issues.
The boundy is becoming less clear.
The scientist may also rent the comertial node for their long running services to satisfy the high quality requriments. The commercial company may also develop specilised HPC cluster for comertial using. The main cloud provider also set up the HPC cluster for users.
In summary, they like the parallel or distributed computing in different world.