HPC in the cloud

Cloud Play

Article from Issue 154/2013

Author(s): Jeff Layton

Cloud computing is most definitely here, but does it have a role in HPC? We discuss changes in HPC that could be solved effectively by cloud computing.

I'm not a big Bob Dylan fan, but when it comes to HPC, "The Times They Are a-Changin'." Some of this change is going to the cloud, not because it would be really cool, but rather because HPC has existing workloads that fit well into cloud computing and can save money over traditional solutions. Perhaps more importantly, I think HPC has evolved to include non-traditional workloads and is adapting to meet those workloads – in many cases using cloud computing to do so. I will explain by giving two examples.

1. Massively Concurrent Runs

At one HPC center that I'm familiar with, users periodically submit 25,000 to 30,000 jobs as part of a parameter sweep; that is, they run the same application with 25,000 to 30,000 different data sets. Many times the application is a Matlab script, a Python script, an R script, a Perl script, or something particularly serial (i.e., it runs well on a single core). The same script is run but with thousands of different input files, resulting in the need to run thousands of jobs at the same time. Many times these applications run fairly quickly – perhaps in a couple of minutes – and many times they do not produce a great deal of data.

A closely related set of researchers study operating systems and security, during which they run different simulations with different inputs. For example, they might run 20,000 instances of an OS, primarily a kernel, and explore exploits against that OS. As with the previous set of researchers, the goal is to run a huge number of simulations as quickly as possible to find new ideas about how to protect an OS and kernel. The run times are not very long, but they must run the tests against a single OS. Consequently, they run thousands of jobs at the same time and then look through the results and continue with their research.

What is important to both sets of researchers is to have all of the jobs run at nearly the same time so they can examine the results and either focus on a small subset of the data and run more granular input data sets or try yet more data sets (broaden the search space). Additionally, these users want to broaden the search space so they can get either more detail or examine more options. The result is the need for more cores. This same HPC center has users asking for 50,000 and 100,000 cores to run their applications. The coin of the realm for these researchers is core count and not per-core performance.

Another interesting aspect of these researchers is that they don't run these massive job sets all of the time. They create the input data sets and create the job arrays and then run the job array. Once the jobs are done, however, it takes time to process the output to understand it and to determine the next step. What is important to these researchers is to have all of the results before doing this post-processing. If this doesn't happen, the researcher has to wait days for the jobs to finish before post-processing can take place.

Getting more efficiency from the hardware is not the problem because faster hardware will only improve the research time a little bit. Reducing the run time from 120 seconds to 100 seconds wouldn't really improve their research productivity. What improves their productivity is to have all of the jobs run at the same time.

I originally thought this scenario was confined to my experience with a particular HPC center, but I was wrong. I've spoken to several people, and they all have similar workload characteristics with varying sizes (several hundred to 50,000 cores). Although this might not describe your particular workload, a number of centers fit this scenario, and this number is growing rapidly.

2. Web Services

Another popular scenario in HPC centers that I've seen is the increasing need for hosting servers for classes or training, for websites (internal and external), and for other general research-related computing in which the applications are not parallel or might not even be "scientific." I heard one person refer to this as "Ash and Trash computing," probably because it refers to running non-traditional HPC workloads; however, it's becoming fairly common.

Consider an HPC center with training courses or classes that need access to a number of systems. A simple example is a class in parallel computing with 30 students. The students might need many cores per person for their course, and they wouldn't be pushing the performance of the systems; however the data center will need a number of systems for the class. If they need 20 cores per student, that's 600 cores just for a single course.

The need for dedicated web servers for research is also increasing. The websites they host go beyond the classic personal websites. Researchers want, and need, to put their research on a website that allows them to share results, interact with other researchers, and show their research. An increasing number of web-based research tools are available, such as nanoHUB [1] and Galaxy [2]. I know of one HPC center that has close to 20 Galaxy servers, each tuned to a specific research project. HPC centers are discovering that it makes much more sense to handle these non-traditional workloads themselves. The reasons are varied, but in general, HPC centers understand research better than the departments that worry about mail servers, databases, and ERP applications. These enterprise computing functions are critical to the overall center, but research and HPC require a different kind of service. Moreover, HPC centers can react much more rapidly to requests than the enterprise IT department.

Time for a Change

HPC is being asked to adapt to new roles when it comes to the needs of researchers. These needs include applications that require a tremendous number of cores but not a great deal of performance, as well as applications such as web servers, classroom and training support, and web-based applications and tools that are not traditional HPC applications. These workloads fit into the HPC world much better than they fit into the enterprise world.

These changes are everywhere. They may not be a large force, and they might not be as pervasive in your particular HPC center, but they are happening and they are growing more rapidly than traditional workloads. Consequently, I've started to refer to this new generation of computing as Research Computing. If you like, research computing is a superset of traditional HPC, or traditional HPC is a subset of research computing. I also like to think of research computing as adding components, techniques, and technology for solving problems that traditional HPC cannot or might not solve. One of these technologies is cloud computing.

1 2 Next »

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

NVIDIA Released Driver for Upcoming NVIDIA 560 GPU for Linux

Community , Kernel

Not only has NVIDIA released the driver for its upcoming CPU series, it's the first release that defaults to using open-source GPU kernel modules.
OpenMandriva Lx 24.07 Released

If you’re into rolling release Linux distributions, OpenMandriva ROME has a new snapshot with a new kernel.
Kernel 6.10 Available for General Usage

graphics , Linux , Security

Linus Torvalds has released the 6.10 kernel and it includes significant performance increases for Intel Core hybrid systems and more.
TUXEDO Computers Releases InfinityBook Pro 14 Gen9 Laptop

Hardware , laptop , Linux

Sporting either AMD or Intel CPUs, the TUXEDO InfinityBook Pro 14 is an extremely compact, lightweight, sturdy powerhouse.
Google Extends Support for Linux Kernels Used for Android

Android , Linux , Security

Because the LTS Linux kernel releases are so important to Android, Google has decided to extend the support period beyond that offered by the kernel development team.
Linux Mint 22 Stable Delayed

Linux mint , Operating Systems , Security

If you're anxious about getting your hands on the stable release of Linux Mint 22, it looks as if you're going to have to wait a bit longer.
Nitrux 3.5.1 Available for Install

Linux , Nitrux , open source

The latest version of the immutable, systemd-free distribution includes an updated kernel and NVIDIA driver.
Debian 12.6 Released with Plenty of Bug Fixes and Updates

DEBIAN , Linux , Security

The sixth update to Debian "Bookworm" is all about security mitigations and making adjustments for some "serious problems."
Canonical Offers 12-Year LTS for Open Source Docker Images

Docker , images , open source , Ubuntu

Canonical is expanding its LTS offering to reach beyond the DEB packages with a new distro-less Docker image.
Plasma Desktop 6.1 Released with Several Enhancements

KDE , open source , Plasma

If you're a fan of Plasma Desktop, you should be excited about this new point release.

HPC in the cloud