2.4 Layering Nextflow configurations
Learning objectives
- Learn how Nextflow configurations can be layered on top of one another
- Understand the configuration priorities
- Learn how to fine-tune processes using the
withNamedirective - Understand when you would want to fine-tune process requirements
We have now managed to configure and run an nf-core pipeline, but we have also seen that for our test dataset it isn't very well optimised: by default, the sarek pipeline requests many more CPUs and much more RAM than necessary for our purposes. This problem isn't just relevant to small test datasets; sometimes you might find when running a large dataset that the pipeline hasn't been optimised quite as well as you would like and is requesting fewer resources than you need. In these cases, what we would like to do is have a fine level of control over each process and the resources they request.
We have already seen how we can define a custom configuration file and layer it over the top of the default configuration using the -c option to Nextflow. We have also seen how the sarek pipeline defines the resources required by many of its processes within the conf/base.config file using withLabel and withName directives. In this final section of today's workshop, we will try to optimise the processes that we are running to more efficiently use the HPC resources by defining a new custom configuration file.
2.4.1 Configuration priorities
Before we get started, it's important to understand how Nextflow prioritises configuration files. Because we can provide configuration information at various levels and using multiple files, it is possible for some options in these places to overlap, and Nextflow needs to know which ones to give precedence to. Referring to the Nextflow configuration documentation, configuration files are prioritised in the following order, from lowest to highest priority:
- The config file $HOME/.nextflow/config
- The config file nextflow.config in the project directory (i.e. at the same level as the
main.nfscript) - The config file nextflow.config in the launch directory (i.e. the directory in which you run
nextflow run ...) - Config files specified using the
-c <config-files>option
Furthermore, when using the -c option, multiple configuration files can be provided, separated by commas, and are prioritised from lowest to highest in the order they are specified.
Consider the following (very basic) Nextflow file:
If we run this 'workflow', it will print the contents of the value parameter, i.e. "hello":
N E X T F L O W ~ version 24.10.5
Launching `example.nf` [tender_kay] DSL2 - revision: 573919f401
hello
Now suppose we create a nextflow.config file and set value to something different:
Now, the workflow will print "bye"
N E X T F L O W ~ version 24.10.5
Launching `example.nf` [tender_kay] DSL2 - revision: 573919f401
bye
If we create another config file, define params.value in there, and layer it on top, that value will be used:
N E X T F L O W ~ version 24.10.5
Launching `example.nf` [tender_kay] DSL2 - revision: 573919f401
seeya
And if we create a second custom config, define yet another value for params.value, and layer it on top as well, that value will be used:
N E X T F L O W ~ version 24.10.5
Launching `example.nf` [tender_kay] DSL2 - revision: 573919f401
ciao
Process directives, such as CPU and memory requirements, can be configured in a number of ways, and these too are evaluated in a particular order by Nextflow. Briefly, they are prioritised in the following order, from lowest to highest priority:
- Default process configuration settings in the configuration files (e.g.
process.cpu = 1) - Process directives defined in the process definition
- Process configuration settings within a matching
withLabelselector - Process configuration settings within a matching
withNameselector
Consider the following process {} scope within a configuration file:
This configuration will have the following consequences:
- By default, all processes will be given 4 CPUs, unless their process definitions contain a
cpusdirective - Any process given the
label"hello" will instead be given 8 CPUs - Any process named "bye" will be given 16 CPUs
2.4.2 Optimising nf-core/sarek for our data
Exercise: Fine-tune nf-core/sarek
Start by creating a new blank file within the config/ folder called custom.config and open it up in VSCode.
We have four distinct processes that we want to fine-tune:
TABIX_BGZIPTABIX_INTERVAL_COMBINEDGATK4_CREATESEQUENCEDICTIONARYGATK4_MARKDUPLICATESMULTIQC
From the trace file we received from the previous run of sarek, we saw that the processes were requesting between 1 and 6 CPUs, and up to 36 GB of memory for the GATK4_CREATESEQUENCEDICTIONARY process. For our example dataset, these values are overkill. Instead, we can get away with just 1 CPU and 1GB of memory for each task. We'll also give each task just 2 minutes to complete, which is more than enough time.
Some tools are greedy!
You might have seen from our previous trace file that the GATK4_MARKDUPLICATES processes used several gigabytes of memory each (as reported in the rss column). So how come we can give them just 1GB now? GATK is a bit of a greedy tool and will often expand to use up lots of memory if is allowed to, so the values reported by the trace file aren't necessarily representative of how much memory the tool really needs. Optimising resources for tools therefore requires a bit of trial and error.
Let's translate this into the Nextflow configuration format:
process {
withName: 'TABIX_BGZIPTABIX_INTERVAL_COMBINED' {
cpus = 1
memory = 1.GB
time = 2.min
}
withName: 'GATK4_CREATESEQUENCEDICTIONARY' {
cpus = 1
memory = 1.GB
time = 2.min
}
withName: 'GATK4_MARKDUPLICATES' {
cpus = 1
memory = 1.GB
time = 2.min
}
withName: 'MULTIQC' {
cpus = 1
memory = 1.GB
time = 2.min
}
}
Now that we have our custom configuration file created, we need to update our run script one final time and add the new file to the -c option:
And now we're ready to re-run the pipeline!
After a few minutes, the pipeline should finish. We can again inspect the trace file from the run to see how much memory was requested and used:
# Your trace file will have a unique name based on the time it was run
cat runInfo/trace-2025-11-18_14-15-16.txt
name status exit duration realtime cpus %cpu memory %mem rss
NFCORE_SAREK:PREPARE_INTERVALS:TABIX_BGZIPTABIX_INTERVAL_COMBINED (no_intervals) COMPLETED 0 58s 0ms 1 40.9% 1 GB 0.0% 3.1 MB
NFCORE_SAREK:PREPARE_GENOME:GATK4_CREATESEQUENCEDICTIONARY (Hg38.subsetchr20-22.fasta) COMPLETED 0 1m 12s 10s 1 78.7% 1 GB 0.2% 326.5 MB
NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES (test_sample3) COMPLETED 0 1m 9s 9s 1 93.6% 1 GB 0.5% 602.8 MB
NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES (test_sample1) COMPLETED 0 1m 19s 12s 1 81.9% 1 GB 0.4% 597.3 MB
NFCORE_SAREK:SAREK:BAM_MARKDUPLICATES:GATK4_MARKDUPLICATES (test_sample2) COMPLETED 0 1m 11s 8s 1 94.5% 1 GB 0.5% 600.2 MB
NFCORE_SAREK:SAREK:MULTIQC COMPLETED 0 1m 20s 8.7s 1 67.4% 1 GB 0.2% 420.9 MB
We can see that most of the processes are now using a larger proportion of the memory assigned to them (in this case 300-600MB out of a total 1GB), so we are much more efficiently using our resources. We could probably fine-tune this even further, but we'd get diminishing returns and risk some samples failing.