1.1 HPC for bioinformatics workflows
Learning objectives
- Define high performance computing (HPC) and describe its main components
- Explain when bioinformatics workflows require HPC resources rather than local execution
- Summarise the main constraints imposed by HPC environments.
1.1.1 What is an HPC?
High Performance Computing (HPC) systems are large clusters of computers with lots of CPUs, memory (RAM) and storage space. They are specially-built to run large numbers of computational jobs efficiently and concurrently. Bioinformatics analysis often involves many steps, many tools, many samples, and large datasets, which can quickly overwhelm your average laptop or desktop computer. When this becomes the case, HPCs can be the perfect solution to scaling up your workflows and running them efficiently and quickly. However, HPCs expect work to be submitted in a particular way, following specific rules. This means our workflows often need to be designed for HPC, not just moved to HPC.
1.1.2 When does a workflow need HPC?
In bioinformatics, a workflow is simply a defined series of steps that take data as input and transform that data into processed data and/or analytical results. This is true whether you are doing whole genome variant calling, proteomics quantification, single-cell transcriptomics, or metagenomics assembly. Each step in the pipeline performs one job, and each job depends on some form of computation and storage.

Signs your workflow is ready for HPC
Not every workflow needs a supercomputer. Many analyses start on a laptop and stay there, especially during method development, testing small datasets, or when turnaround is more important than throughput. HPC becomes necessary when your workflow starts to hit practical limits of time, memory, storage, reliability, or governance.
A workflow is usually ready for HPC when scale becomes a problem. This might be scale in data size (more gigabytes than your laptop can hold), compute time (weeks of serial runs), memory usage (jobs crash due to insufficient RAM), or workflow complexity (tens of jobs become too painful to run manually).
| Challenge | Example scenario |
|---|---|
| Runtime is too long | A single sample takes >12 hours to process |
| Data size is too big | Multiple large FASTQs need to be processed |
| Memory requirements are too large | R or Python crashes loading matrices |
| Scaling samples manually is painful | Manually running multiple scripts across multiple samples |
| Storage is a bottleneck | Local disk is constantly full due to raw and processed data size |
| Serial execution is too slow | Workflow is too slow to run one sample after another - multi-sample analysis must run faster |
| Data governance, ethics, and security constraints | Legal and/or ethical requirements mean highly-protected data must stay on institutional, secure systems |
If any of the above scenarios sound familiar, then your workflow is likely ready to be moved to and configured for running on an HPC.
1.1.3 From your laptop to HPC
Before running a workflow, it is important to understand the system we are running it on. Running workloads on HPC is very different from running them on your laptop or a local workstation. HPCs are not just bigger, they are also:
- Shared
- Scheduled
- Resource constrained.
This introduces an important trade-off. HPCs give you access to massive computational power but at the cost of flexibility. On your laptop or a local workstation you can run whatever you like, whenever you like so long as it fits within the resource limitations of the system. On HPC, you gain scale and speed but you must work within system policies and limits.

Shared systems
HPCs are large-scale institutional computing clusters that are intended to be used by many users at once. Indeed, their size and available resources mean than dozens or even hundreds of users can be using them at the same time and still manage to run large scale workflows concurrently and in a timely manner. However, this shared nature puts a significant constraint on how they can be used.
The primary constraint is that you don't have the freedom to install whatever software you want on the system. This means that you need other solutions to running the tools that you want. We will explore this issue in the next section.
Another constraint is the file system. While HPCs typically have huge shared file systems, they are neither infinite in size nor speed. Running workflows that generate lots of files, or read and write to the file system too frequently, will degrade the performance of the system for all users. Therefore, we need to be conscious of what our workflow is doing and make sure we design it to use the system fairly and efficiently. We will discuss storage limitations in 1.3 HPC architecture.
Schedulers
On your local laptop, you will be used to running things whenever you like, but on shared systems like HPCs, this is not the case. Instead, HPCs require you to submit jobs to a scheduler, which decides where and when to run your job based on its resource requirements and the requirements of all other jobs in the queue. This makes HPCs asynchronous and non-interactive: job execution doesn't happen immediately and jobs won't necessarily execute in the order that they were submitted. As such, an HPC workflow needs to be designed to handle this delayed and potentially out-of-order execution style. As we will see later today, Nextflow is perfectly suited to writing workflows that work in this way.
Resource constraints
Finally, HPCs may have large amounts of computing resources, but they aren't infinite, and they also need to be shared between many users. Therefore, it is vital when running jobs on an HPC to define exactly how many resources you require, including the number of CPUs you need, the amount of memory/RAM, and how much time your jobs needs. As you will see later in this workshop, it is very important to optimise these requests as best as you can, as under- and over-requesting resources can negatively impact your jobs.
1.1.4 Introducing our workshop scenario: WGS short variant calling
Don't worry if you don't have prior knowledge of this workflow
The focus of this workflow is on learning Nextflow; the experimental context we are using (WGS short variant calling) is just a practical example to help you understand workflow design principles for HPC and how Nextflow works. You are not expected to have prior knowledge of variant calling workflows or best practices.
For this workshop, we will be focussing on a common bioinformatics analysis workflow used in genomics to identify genetic variants (SNPs and indels) from short-read whole genome sequencing data. This workflow involves multiple processes and tools and is computationally intensive. At a high level, the general procedure is:
- Quality control of raw sequences, e.g. filtering & trimming reads
- Alignment of reads to a reference genome
- Post alignment processing: sorting, marking duplicates, indexing
- Variant calling: call SNVs and indels for each sample against reference
- Joint genotyping: combining samples from a cohort into a single callset
- Reporting

Running this workflow end-to-end captures many challenges that running on HPC using Nextflow can solve:
- Many independent jobs: each sample can be processed separately for many steps
- Resource diversity: tools used at each step require different amounts of CPU, memory, and walltime
- Large I/O demands: reading and writing of multi-gigabyte files benefits from parallel filesystems
Throughout the workshop we will implement and explore different parts of this workflow in slightly different ways in order to highlight the lessons being taught.
How does HPC help run this workflow?
Consider the workflow described above:
- How does each stage use computational resources? What is the limiting factor of each stage?
- What would happen if we tried to run this workflow on a personal computer?
Answer
Each stage of this workflow has different computational requirements, and many are quite intense:
| Stage | Limiting factor | Explanation |
|---|---|---|
| QC | Storage - I/O speed | Low CPU & memory requirements, but needs fast access to large files |
| Alignment | CPU | CPU speed determines how quickly reads can be aligned. Memory requirements are variable: reads can be aligned independently, can read in data in small chunks. |
| Post-alignment processing | CPU + memory | Both CPU and memory requirements are high as many reads need to be processed together |
| Variant calling | CPU + memory | CPU usage is high as lots of calculations need to be performed to determine how likely a variant is at each genomic position. All reads within a given region must be processed together, so memory use is also high. |
| Joint genotyping | CPU + memory | Need to read data from all samples into memory at once, so memory usage is high. CPU also high to make final variant calls for entire cohort. |
| Reporting | Storage - I/O speed | Low CPU & memory requirements as we only need to summarise the dataset. Fast access to large files creates storage bottleneck. |
On a standard laptop, this workflow would not get very far before failing due to running out of memory during the alignment or variant calling phases. Designing this workflow for HPC lets us take advantage of large numbers of CPUs, lots of memory, and parallel execution of tasks to considerably speed up each stage.
Short variant calling is just one example of where HPCs can be utilised to more efficiently process bioinformatics data. Many bioinformatics workflows, such as RNA sequencing and proteomics data analysis, involve similarly large datasets whose analysis is computationally expensive yet often parallelisable. If you find that your workflows are starting to struggle on your laptop, or you find that it is difficult to scale up your workflows to multiple samples and larger datasets, then this is a good sign that they need to be moved to an HPC and possibly re-designed to take advantage of parallel computation.