HPC vs CloudComputing

This article mainly compares several differences between the HPC(High performance computing) and CloudComputing. This is the script for online video.

You may heard the CloudComputing and HPC (High performance computing) here and there

Do you curious about the differences between these two techniques?

After several years elementary works in both of these two areas, the answer about this question is becoming clear.

(In this vedio) We try to explain the difference between CloudComputing and HPC from several aspects. We mainly consider these differences from the aspects of Maintainer and Machines, Users and assocaited programs, How to access it and Typical Software Stacks.

These two areas are still evolving quickly, we just try to share some critical ideas and necessary information.

Generally speaking, the common thing between the CloudCompuing and HPC is that, you try to run your program on multiple machines, such as hundres and thousands of computers. Obviously, one person or small insitutions do not have enough funds and capaiblities to construct a cluster that contains a large amount of machines, so, these machines or clusters are usually maintained by specialized institution.

Let’s first look at the difference for institutions that maintain the machine or computing nodes used for CloudComputing and HPC.

Machines and their Maintainer

CloudComputing:

The cloud computing service is mainly provided by commercial companies.

The left side figure shows the main cloud computing service providers

you may know names of these companies very well, such as Amazon, Microsoft and Google, etc /ɪt ˈsetərə/

https://cloud.google.com/gartner-cloud-infrastructure-as-a-service

HPC

In contrast, the right figure shows the main provider of the HPC system in USA. We mainly list the main national laboratories under the DOE (department of the energy). These institutions are well known for people with the backgound of the scientific computing. These government funded institutions are main force for building the HPC system and play important role in history. For example, the oak ridge and Los Alamos national lab are two sites involved in Manhattan Project.

The difference of the maintainers influence the properties of the platform and its serving functions. Lets looks at the details of the machines and services they provided.

Cloud Computing

They ususlly do not show the details such as the performance metrics of their cluster, instead, they are more care about the types of services.

The machine with different configurations are created to users as their requirments based on virtulaization techniques. Their physical machine pool is also heterogeneous which is different compared with the HPC which is configered by homogeneous machines in general.

The left figure shows the featured products of the google clould service. In addition to the virtual machine, they also provide versatil /ˈvɜːrsətl/ data management and processing capabilities.

HPC:

There is a rank list called top500 list that compares the performance of differnet HPC system in different aspects according to various Testing program. This is just like the olympic game in the HPC world, this rank changes a lot every year.

as shown in the right figure, it lists the current rank.

https://www.top500.org/lists/top500/2020/11/

Currently, the rank one system in the list is Fukago supercomputer, which is maintained by RIKEN (RIKEN is Japan’s largest comprehensive research institution renowned for high-quality research in a diverse range of scientific disciplines)

the second one is Summit system, which is maintained by the oak ridge national lab.

Actually, most of these system is maintained by the national institutions.

The Rmax and Rpeak are two important metric

A system’s Rmax score describes its maximal achieved performance (execute how many float operation per second); the Rpeak score describes its theoretical peak performance (https://kb.iu.edu/d/bbzo).

This figure shows national labs in USA

https://www.energy.gov/science/science-innovation/office-science-national-laboratories

Another interesting topic is to discuss the users of the CloudComputing and HPC

Serve to whom (Users)

CloudComputing:

The CloudComputing mainly server for the IT companies. The startup company may not spend money to buy their own machines and they just need to rent the computing node from the cloud provider with low cost in a more flexible way. The cloud provider has a professinal team to maintain these service. Anyone can rent or buy the machine from the cloud compuer provider if you pay their money.

https://aws.amazon.com/what-is-aws/

the left figure lists some key information about the customer they serve to. The company such as retailer, financial companies may need to rent lots of machine to maintaine their user information or provide necessary online service, such as login system, website, or the electronic transaction system that supports their businuss.

HPC:

The HPC mainly server for the domain scientist, they use HPC to solve numerical problems such as scientific simulations. They build model and run the model on HPC in large scale with parallel computing, it is common to use thousands of machine and cores to run a particular simulation. The scientists or research teams in colleage have the collabaration with the research institutions that owns the HPC system, if there research goal overlap with each other, the maintainer will set a specific core time to a project. Anyway, the goal of HPC is for research, they are basically the non-profit services.

the scientic projects that uses summit

https://www.olcf.ornl.gov/leadership-science/

The right figure shows some key projects running on summit supercomputer,
you could see that the main areas are biology, physicals, fusion or nuclear science or earth science etc. The styles of project are quite different compared with the CloudComputing illustrated by the figure at the left side.

In summary the core value of the CloudComputing is to serve the customer’s business, differnet softwares are built in order to server this goal. For exmaple, we need the data base to store and index the user information, the smart analytics to process the data as needed, and the backend with accositated security service to support the website.
The main value of the HPC is to server the domain science, the typical projects includes the scientific simulation, data visulization and analytics and associated I/O or high speed networking service to transfer the data between different stages of the workflow.

How to access it from user’s perspective

Cloud Computing:

When we disucss the Cloud Computing Service, they are divided into different layers in particular.

The first layer is called IaaS, namely the infrastructure as a service, the cloud provider will assign a virtulized computing node to the user in this case. when you rent some nodes in the IaaS level, these nodes belong to you totally, and the node might be virtulized node. you have the root permission to your computing nodes, and you can configure it (debug it) as needed. The cloud provider only provide you the computing/netowrking/storage resource as you needed. You need to pay more money if you want to get a more powerful node.

The second layer called the PaaS, namely the platform as a service. In this layer, the user just need to provide the configuration file that describes how to run their executable files, and the platform is in charge of the resource scheduling and high avalability of the program.

The next layer is called the SaaS, namely the software as a service, In this case, the user just need to call the API to interact with the service provided by the cloud computing provider such as the storage service or data monitor service.

HPC:

Common HPC contains login node and compute node, if you can get an account on a particular HPC, you can login to the login node. Every user have a same view when they access the HPC, theritically speaking, they can use all noeds on this machine (there are some constrains on the total number of the core time, and different partition may have different running time limitations). Anyway, it is just like a membership, when they give you a account to access this machine, you could share high quality service they provided. large disk space, high speed node etc. You need to submit jobs to the scheduler queue, the scheduler is responsible for assigning the node to every user. They do not use the virtulization techiques, since the computing power is abundant and luxury for most of the users. You do not have the root permission, if you need to install particular softwares that requries the root privilage, you may send a ticket to maintainer for help or consulting any issues regarding the usage. You will be charged (core time) only when your job is scheduled to run.

For example, if you check the summit node, it is really luxury and powerful
https://docs.olcf.ornl.gov/systems/summit_user_guide.html#summit-user-guide

more parameters on this page
https://www.olcf.ornl.gov/summit/

The core time is related with your project. When the pi start a project, the core time is usually fixed. When one project finish, you can not access the HPC resource anymore. That’s why researchers always need to write proposal and apply for new project. Only by this way, they can get the avalible computing reosurce and funding from the governmnet.

Different with the 7 time 24 high avalaibilty for the Cloud Computing service, most HPC have the periodical /ˌpɪriˈɑːdɪkl/ mainataince, you could not access the HPC system when it is offline, I think it is definately a good reason to take some rests some times.

Software stacks

Let’s go through some typical softwares used for these two areas quickly

CloudComputing:

IAAS OpenStack, KVM, virtulization
PAAS Kubernates, container, docker
SAAS More versatile customized cloud service. such as labmda service.

There is more flexible charing and scheudling policy. Such as you may get more resources on-demand when you need it. This is hard to implment on the HPC. For the HPC, it is plan driven schduling, you plan everything well when you submit a job. You may need to follow the limitation of the queue scheduler, such as long waiting time, since the HPC resource is shared by many people at the same time.

The Cloud Computing may need all kinds of data base softwares and big data service such as Spark, or backend services that provide the RPC call. Java, python, golang is also popular in this area.

HPC

Slurm,all user access longin node, and share same computation pool,
module, to configure the environment
spack to insall the packges

The simulation may use MPI and RDMA a lot, C/C++ fortran are the mainstream language in this area.

In summary, the Cloud Computing Service and HPC share the same genes but have different souls. They are buit based on GPU and CPU but server for differnet users and projects.

Other Requrments

HPC:
powerful computing, user may access tounsands of cores in short period of time

latest HPC computing node is configured with GPU for every node, which is luxury for comercial company

They will update the system maybe one or two times a month, this is ok for scientific project which is mostly offline computation.

CloudComputing:

Basiclly, they need to provide 7 times 24 avaliable, High availability is an important issue for the cloud provider. Even several miniuts offline can cause the huge loss for the customer. There is a tradeoff. If you work with HPC, you may not need to be oncall or update your system at the midnight. But if you work with the cloud computing, you may need to on-call 7 times 24 to process any unexpected issues.

New trend

The boundy is becoming less clear.

The scientist may also rent the comertial node for their long running services to satisfy the high quality requriments. The commercial company may also develop specilised HPC cluster for comertial using. The main cloud provider also set up the HPC cluster for users.

In summary, they like the parallel or distributed computing in different world.

Types of HPC or Cloud

在一个报告上听到的内容,报告中大致将HPC和Cloud结合起来,把算力服务分成4个市场,还是有一定的启发的。

尖端超算,计算,访存,I/O都非常好,需要运行一些重要的模拟项目,比如国家超级计算中心。使用者为顶尖的研究机构,高端的超算从业人员,执行攻坚型科研任务。需要用到万核以上的应用。这类的超算其实是会经常发生故障的,比如llama 3训练过程中几个小时就会中断一次。

通用超算,万核以下的应用,绝大多数是千核以下的应用,自主建设的中小微超算系统。比如超级云计算中心。

智能超算,以GPU算力为主,单卡到万卡的应用,算力投资大,自建较少,主要使用租用的智算算力资源,解决大模型训练需求的超算中心模式,和解决推力等需求的云计算模式,应用运行特征分析,比如通常的智算中心。

业务超算,单核到几千核的应用。行业实际的业务,关注服务和性价比,需要实现完整的业务上云,需要保证业务的稳定性与可靠性,需要保证业务的各个环节能够快速、高效、动态实现,通常的特性包括弹性、高性能、高稳定性、高可靠性、高可维护性等等。通常就是公有云/超算云+专业的少算服务商。

Other references

EC2 type
https://aws.amazon.com/ec2/physicalcores/

Google solutions
https://cloud.google.com/compute#section-9

Google products
https://cloud.google.com/products

推荐文章