Unlocking HPC Cloud: Migration & Cost Challenges

September 1, 2022

Introduction

Modern scientific research depends heavily on processing massive amounts of data which requires elastic, scalable, easy-to-use, and cost-effective computing resources. AWS Cloud provides such resources, but researchers still find it hard to navigate the AWS console. RLCatalyst Research Gateway simplifies access to HPC clusters using a self-service portal that takes care of all the nuts and bolts to provision an elastic cluster based on AWS ParallelCluster 3.0 within minutes. Researchers can leverage this for their scientific computing.

Relevance Lab has been collaborating with AWS Partnership teams over the last one year to simplify access to High Performance Computing across different fields like Genomics Analysis, Computational Fluid Dynamics, Molecular Biology, Earth Sciences, etc.

There is a growing need from customers to adopt the High Performance Computing capabilities in the public cloud. However this throws in key challenges related to right architecture, workload migration and cost management. Working closely with AWS HPC groups we have been enabling adoption of AWS HPC solutions with early adopters in Genomics and Fluid Dynamics with Higher Education and Healthcare customers. The primary ask is for a self-service Portal for planning, deploying and managing HPC workloads with security, cost management and automation. The figure below shows the key building blocks of HPC Architecture part of our solution.

AWS ParallelCluster 3.0

AWS ParallelCluster is an open source cluster management tool written using Python and is available via the standard python package index (PyPI). Version 3.0 also provides support for APIs and Research Gateway leverages this to integrate with the AWS Cloud to set up and use the HPC cluster for complex computational tasks. AWS ParallelCluster supports two different orchestrators, AWS Batch and Slurm, which cover a vast majority of the requirements in the field. ParallelCluster brings many benefits including easy scalability, manageability of clusters, and seamless migration to the cloud from on-premise HPC workloads.

FSx for Lustre

Amazon FSx for Lustre provides fully managed shared storage with the scalability and performance of the popular Lustre file system. This storage can be accessed with very low (sub-millisecond) latencies by the worker nodes in the HPC cluster and provides very high throughput.

NICE DCV

NICE DCV is a high performance remote display protocol used to deliver remote desktops and application streaming from resources in the cloud to any device. Users can leverage this for their visualization requirements.

Research Gateway Provides a Self-Service Portal for AWS PCluster 3.0 Launch with Automatic Cost Tracking

Using RLCatalyst Research Gateway, research teams are organized into projects with their own catalog of self-service workspaces that researchers can provision easily with minimum knowledge of AWS cloud setup. The standard catalog, included with RLCatalyst Research Gateway, now has a new item called PCluster which a Principal Investigator can add to the project catalog to make it available to their team. This product is based on AWS ParallelCluster 3.0 which is a command line tool that advanced users can work with. Research Gateway has wrapped this tool with an intuitive user interface.

To see how you can set up an HPC cluster within minutes, check this video.

The figure below shows a standard catalog inside Research Gateway for users to provision PCluster and FSx for Lustre with ease.

Setting Up a Shared Cluster for Use in the Project

The PCluster product on Research Gateway offers a lot of flexibility. While researchers can set up and use their own clusters, sometimes there is a need to use a shared cluster across collaborators within the same project. Towards this goal, we have also brought in a feature that allows a user to “share” the cluster with the entire project team. The other users can then connect to the same cluster and submit jobs. For example a Principal Investigator might set up the cluster and share it with the researchers in the project to use for their computations.

Large Datasets Storage and Access to Open Datasets

AWS cloud is leveraged to deal with the needs of large datasets for storage, processing, and analytics using the following key products.

Amazon S3 for high-throughput data ingestion, cost-effective storage options, secure access, and efficient searching.

AWS Datasync for secure, online service that automates and accelerates moving data between on-premises and AWS storage services.

AWS Open Datasets program houses openly available, with 200+ open data repositories.

Cost Analysis of Jobs

Research Gateway injects cost allocation tags into the ParallelCluster so that all resources created are tagged and the cost of the scalable cluster can easily be monitored from the Research Gateway UI.

Summary

AWS Cloud provides services like AWS ParallelCluster and FSx for Lustre that can help users with High Performance Computing for their scientific computing needs. Research Gateway makes it easy to provision these services with a 1-Click, self-service model and provides cost and governance to help manage your budget.

Tags
cloud
HPC
AWS solutions
Amazon