Problem: Is your Pipeline inefficient, slow, or keeps crashing?
As a Computational biologist, you’re working on bioinformatics workflows and utilizing different sequencers – Illumina, PacBio, 10x Genomics or Vizgen to name a few, or the vast array of spectrometry and imaging instruments available, and the data are all being processed by your various Pipelines.
You may have a pipeline that works, works most of the time, or was working for you even though it might have taken hours if not days to run. You’re questioning whether this could be faster, more reliable, or how easy it would be to automate. (Data generation / Storage of Data)
Often times it’s when you are generating more Genomics or Bulk RNAseq data, for example, FastQ files, H5AD, VCF Files are growing at a rapid pace, due to new instrument or research initiative, and you need to increase the speed and efficiency to scale your pipelines. Typically because you’re looking to achieve your next-level funding or clinical trials and validating your research is the only way to keep your company moving forward.
There are ways you can split your pipeline into parallel processes and run them simultaneously within AWS that data science folks may not be familiar with.
AWS New features
Certain pipelines can be optimized by using Nextflow, or Airflow, whether they are home-grown or built with existing tools, and utilizing AWS Batch to run multiple jobs to process these files at the same time.
Ultimately using optimized AWS compute instances, automatically auto-scaling and running jobs autonomously from when data is generated by an instrument, and ultimately stored in its ‘converted’ format in S3.
This isn’t only for human readable data that you’d examine in visualizations, but also JSON files, for example ones that illustrate gene expression levels with gene labels, that will ultimately populate AI / ML models in Sagemaker.
I’ve seen companies reduce processing time by over 70 percent, and counter-intuitively this may also save your company a lot of money in your cloud expenses.