Optimising Nextflow for HPC
Learning objectives
- Identify the key factors that impact workflow performance on HPC systems
- Describe common workflow optimisation strategies in the context of Nextflow
Our workflow is now functional: it runs successfully with scheduler-specific settings on HPC and outputs useful trace files showing resource usage. In Lesson 1.4, we learnt about the importance of appropriate resource requests on HPC and the value of running an efficient and optimised workflow. We will now begin to apply these principles to our custom workflow.
2.3.1 Why optimise?
Workflow optimisation involves fine-tuning your workflow to make it more efficient. It can be used to reduce runtime, facilitate higher throughput and larger more powerful studies, avoid idle hardware that could be freed up for other researchers, and decrease computing cost. Many HPCs charge use per core hour, yet even for those where the cost is not billed to the researcher, the cost of the resources must be covered by someone, and this may be the Faculty/School, institution, funding body, or federal budget. Demonstrating efficient use of compute funds can be important in grant success for projects requiring substantial compute.
Another important reason optimisation is worthwhile is the impact on the environment. Bioinformatics data processing on HPCs has a significant carbon footprint due to the energy required to run our computations. Making our workflows complete in a faster time using less hardware contributes to sustaintable computing practices.
This short editorial from Nature Computational Science highlights the challenges research computing faces in the context of the climate crisis.
Today we will apply workflow optimisation from two perspectives:
- Resource efficiency: benchmarking and adjusting resources for efficient process execution on HPC
- Speed-up: introducing code changes to make the workflow run faster without costing more compute hours
2.3.2 When to optimise
A small workflow run a handful of times might not benefit dramatically from optimisation. Many Nextflow workflows that employ good practices (e.g. nf-core) will often run "out of the box" on HPC with default resources, but defaults might not always fit your data and therefore the behaviour of your processes, or the constraints of your cluster. Think back to Part 1 and the configuration customisations we implemented for our nf-core workflow.
Optimising workflows on HPC becomes especially important when:
- You are analysing large datasets or many samples
- You will execute your pipeline repeatedly
- You are operating on a system that uses a service unit or time-limited allocation model
- Your processes have data-dependent resource usage
Tip
The ideal time to optimise a workflow is while it is being developed. This is often more simple to achieve than back-tracking and adding improvements to an existing codebase. Notably, this may add development time to producing a finished workflow, but the efforts are rewarded with a resilient and scalable workflow that will reduce the time interval between data acquisition and final results for the lifetime of its use.
If you have an existing workflow that is reliable but inefficient, there is always value in taking the time to optimise your regularly used workflow. This endeavour also provides the opportunity to update tool versions and introduce other enhancements such as the use of Singularity containers.
When optimising a workflow, resource efficiency should always be considered. Optimisation through parallelisation however is not always possible or recommended. It is vital to always consider what components of work can be parallelised or split up in a biologically valid way. If you are unsure, one means of testing is to run the analysis with and without parallelism on a subset of data to observe any impact on results. Due to heuristics of tool algorithms, in many cases a small amount of difference in the final result is valid and tolerated. In other cases, it may be expected that a tool produces identical results irrespective of multi-threading or parallelisation. It is important to check the tool documentation, consider the nature of the data and underlying biology, test the effects of parallelisation, and only apply parallelisation when it makes biological sense to do so.
Consider the example of sequenced DNA fragments in whole genome sequencing. Each fragment is completely independent on every other fragment in the library, so it can be aligned to the reference sequence independently without affecting the mapping results. This is a perfect case of embarrassingly parallel processing, i.e. running the same analysis numerous times on slightly different input data. In reality, we would not align one read at a time as the overhead of submitting millions of tiny jobs would overload the scheduler, impact performance, and likely result in stern reproach from the HPC system administrators! So in this case, we need to balance parallelisation with sound HPC practices.
When embarassingly parallel processing with scatter-gather through data chunking or interval chunking (e.g. operating over different intervals of the data as separate parallel tasks) is not biologically valid, we can still apply parallelisation by sample. Parallel by sample is logical for numerous common bioinformatics processing tasks, and is typically only invalid when all samples must be analysed together, for example when collating final results. Fortunately, Nextflow makes parallel by sample simple to apply, making workflows expertly scalable.
2.3.3 What affects performance?
Efficiency of any workflow on HPC depends on the interaction of three factors:
2.3.3.1 Your HPC system
We have already witnessed many differences between Gadi and Setonix in previous lessons. A workflow that performs well on one cluster may perform poorly on another simply because the underlying architecture and scheduler rules differ.
Good optimisation respects the boundaries of the system you're working on. When planning an optimisation approach, consider:
| HPC characteristic | What it means | Why it matters for optimisation |
|---|---|---|
| Default scheduler behaviour | Policies set by administrators: fair-share, job priorities, backfill rules, default limits | Affects queue wait time, job placement efficiency, and how many tasks can run in parallel |
| Queue limits | Maximum walltime, cores, and memory allowed per queue or partition | Determines which queues you can use, how large each job can be, and whether your workflow gets delayed |
| Node architecture | Hardware layout: cores per node, memory per node, CPU type (Intel/AMD), GPUs, local scratch | Ensures you request resources that “fit” the node, avoid resource fragmentation, and maximise throughput |
| Charging model | How HPC usage is accounted (CPU proportion, memory proportion, or the maximum of both) | Guides you to request only what you need: over requesting directly increases SU consumption without improving runtime |
2.3.3.2 The characteristics of your data
Data shapes the computational behaviour of bioinformatics workflows. Even two workflows with identical code can perform very differently depending on the file sizes, sample numbers, and data complexity. Understanding these factors can help you anticipate bottlenecks and assign resources more accurately. When planning an optimisation approach, consider:
| Data characteristic | What it means | Why it matters for optimisation |
|---|---|---|
| File size | The total size of FASTQ, BAM/CRAM, reference genomes, annotation files, or intermediate outputs | Larger files increase memory requirements, disk I/O, runtime, and queue time; they also influence whether single-threading or multi-threading is more efficient |
| Sample number | Total number of samples in the analysis, including replicates or cohorts | More samples → more processes → heavier scheduler load; the workflow may require scatter–gather to parallelise effectively and avoid bottlenecks |
| Data heterogeneity | Variability in file sizes, read depth, sample complexity, or quality across inputs | Highly variable samples produce uneven resource usage; some processes may require per-sample resource overrides to prevent memory kills or slowdowns |
| Data type | Whether data are short reads, long reads, single-cell, imaging derivatives, matrices, VCFs, etc. | Different data modalities have different computational profiles (I/O-heavy, CPU-heavy, memory-heavy); optimisation strategies should account for the modality’s behaviour |
| I/O intensity | Frequency and volume of read/write operations (large temporary files, sort steps, indexing, BAM ↔ FASTQ conversions) | I/O-heavy processes benefit from local SSD or node-local scratch; misconfigured I/O can add hours to runtime on shared filesystems |
| Parallelisability | Whether samples or chunks of data can be processed independently | Determines when scatter–gather is useful, how many jobs can run concurrently, and how well the workflow scales on HPC |
| Compression and indexing formats | gzip vs bgzip, BAM vs CRAM, presence of .bai/.crai/.fai, CCS vs raw reads | Impacts CPU time, memory, and I/O behaviour; inefficient formats slow down the entire workflow |
2.3.3.3 The structure of your workflow
Even with the same tools and data, two workflows can behave differently depending on their structure:
- Number of processes
- Order of dependencies
- Opportunities for parallelism
- Whether steps are CPU-bound, memory-bound, or I/O bound
- Incorporated tools ability to multi-thread
2.3.4 What will we optimise today?
For the remainder of Part 2 we will apply the strategies introduced from Part 1 to optimise then scale our custom workflow.
In the next section, we will assign appropriate resources for each process by using trace files to fine-tune cpus, memory, and time and align these to the resources on the compute nodes of our HPCs.
In Lesson 2.5, we will introduce parallel processing firstly by enabling multi-threading in a thread-aware tool (BWA), and then by coding scatter-gather parallelism into the workflow.
In today's example workflow, we will be applying scatter-gather to run alignment with BWA. Note that the GENOTYPE process can also be parallelised in a biologically valid way using a parallelisation strategy known as "interval chunking", but for simplicity we will not be optimising that process today.
Finally, in Lesson 2.6, we will scale to multiple samples. This will consolidate all of the resource optimisations and parallelisation strategies (multi-threading; scatter-gather; parallel by sample) that we have built up today into one efficient, end-to-end run optimised for our respective HPCs.