How Can You Optimize an HPC Cluster for Cancer Drug Discovery Workloads?

Illustration of Goat working on servers leading data to the cloud and to a proved treatment

You can optimize an HPC cluster for cancer drug discovery workloads by designing hybrid and cloud-integrated architectures that ensure consistent GPU availability, use intelligent cluster management with workload managers like Slurm, apply right-sizing and right-typing of compute resources, leverage tiered storage strategies, and automate cost-aware scaling to balance performance and efficiency

Executive Summary

A biotechnology company focused on cancer drug discovery partnered with PTP to modernize its High-Performance Computing (HPC) infrastructure. Facing fragmented environments, limited access to GPUs, and rising cloud costs, the company required a more scalable, efficient, and cost-effective platform to power its CRISPR-based and computational experiments.

Problem Statement

The biotech company engaged PTP to refine and enhance their existing High-Performance Compute (HPC) environments. They were utilizing multiple HPC environments and doing so was not ideal. One instance was running on a single laptop and others were built by a member of the Computational Team who was no longer with the company. These environments were proving to be difficult to access and maintain.

In order to assess the current environment, PTP’s CloudOps Engineers:

  • Documented and analyzed applications in use and applications the company wanted to use in the future.
  • Evaluated existing AWS account structure and VPN access to research critical data sets.
  • Documented HPC account access and identified security gaps.
  • Gathered requirements for the efficient data flow between Contract Research Organizations (CROs) instruments and their AWS S3 environment.

Solution Overview

With the need for consistent Graphics Processing Units (GPU) availability, cost optimization, and acceptable performance as the end goal, PTP set out to build a hybrid-HPC Cluster leveraging physical infrastructure in a new datacenter and an integrated HPC Cluster in AWS.

As a starting point, PTP proposed an AWS best practice architecture with Control Tower as the foundation. Control Tower ensures a multi-account strategy that takes into account governance and security throughout. Based on discussions regarding cost optimization, PTP ran numerous scenarios with cost calculators and instance sizing and subsequent testing to provide the highest return for their cloud investment.

Hybrid HPC Clusters

With a desire for committed GPUs from the company’s datacenter, PTP assisted with designing and building the high-speed connection between a remote colocation datacenter and AWS.

HPC Clusters in Multiple Availability Zones

The biotech company utilizes Slurm (Simple Linux Utility for Resource Management) for cluster management and job scheduling. With PTP’s expertise, they leverage Slurm in AWS to manage jobs across multiple Availability Zones and Regions utilizing both Spot and On-Demand instances.. This optimizes AWS costs by spinning up and down resources intelligently throughout the course of the day and by increasing their ability to access Spot resources.

Cluster Maintenance

PTP converted legacy Slurm clusters deployed via AWS ParallelCluster that were not actively maintained and centralized configurations to allow for efficient upgrades and changes as well as ongoing administration. PTP then re-templated the clusters to V3 from V2.

Cluster Expansion

Part of the process for GPU availability combined with AWS cost optimization was an expansion of clusters with decision making to determine which jobs should go to which resources. PTP helped define the types of servers for the jobs and conduct right-typing and right-sizing. Right-typing analyzes and benchmarks GPU versus non-GPUs for cost and performance. Right-sizing analyzed and benchmarks different GPU types for cost and performance. AWS Batch was utilized to quickly perform unit testing between different software versions and server hardware without the need for a dedicated cluster.

Detailed AWS architecture diagram illustrating PTP's optimized HPC environment for a biotech company, including services like EC2, ParallelCluster, FSx, and S3 across multiple availability zones.
AWS HCP Environment

Storage Optimization

The company’s existing clusters had been only leveraging Amazon FSx for Lustre because of the high-performance file system capabilities. PTP worked with the company to implement newer data compression methods on FSx to decrease the total amount of data being stored and assisted with configuring and leveraging both S3 and EFS for jobs that did not require the performance from FSx. Storage tiering is critical in cost containment and the PTP team was able to suggest ways to ensure the data was on the right tier at the right time at the lowest costs.

Image Builds

PTP assisted with converting all their software tools into machine images and rebuilding AMIs, including GROMACS for high-performance molecular dynamics and output analysis. PTP then created specific versions to benchmark the performance and cost of versions against one another for further optimization.

Schrödinger Software

The company leverages Schrödinger as a computational platform for predicting molecular behavior. PTP configured Slurm to so the HPC cluster’s compute nodes can check out licenses from Schrödinger via Nginx port forwarding.

Scripting

PTP worked with the biotech company to develop scripts that facilitated running jobs on spot instances by enabling checkpointing and status checking. This allowed them to achieve greater cost efficiency while continuing to have access to high-end GPU and CPU server environments.

AWS services implemented as part of the solution:

CodeCommit, CloudFormation, CloudWatch, CloudTrail, Lambda, AWS Config, Auto Scaling, IAM, DynamoDB, Route 53, VPC, S3, ParallelCluster, EC2, ECS, and WorkSpaces.

The Outcome

The primary objectives of the PTP engagement were to:

  • Improve overall manageability of HPC clusters
  • Determine access consistency of GPU instances in AWS
  • Address challenges with System access
  • Reduce administration time
  • Streamline workflow run time
  • Contain and lower AWS costs, particularly with GPU and FSx.

Through PTP’s CloudOps Engineering services, the company gained a trusted strategic partner and extension of their data science team—able to respond rapidly to evolving design, architecture, and cloud management needs to keep pace with cutting-edge cancer research.

Graphs Isometric Contained Icon

Explore PTP’s CloudOps Services on AWS Marketplace

Accelerate your HPC performance, optimize costs, and scale securely with expert cloud management. Visit our AWS Marketplace listing to get started.

Unlock the Full Potential of Your HPC Environment

Partner with PTP to optimize performance, reduce cloud costs, and accelerate your research. Contact us today to learn how we can accelerate your workflows and enhance productivity.

Homepage Contact Us