2.4. Configurations to customise outputs
In the previous lesson, we applied a custom resource configuration file. In this lesson, we will create custom configuration files that alter the pipeline output in some specific way.
By containing these within configuration files, we can apply them to runs in a modular fashion, either added to or dropped from a pipeline execution as required.
This flexibility ensures that we preserve the portability and reproducibility of nf-core pipelines by not editing the core workflow code to make bespoke customisations.
In this lesson we will create four custom configuration files for different customisation scenarios, and apply them to the one run command.
Unlike our institutional config custom-nimbus.config
, these configs are optional and portable:
- We can apply them if and when needed
- We can apply them to nf-core runs on any machine
- We can apply them to any nf-core pipeline that uses the module that the custom config operates on
2.4.1. Customised resource tracing
nf-core pipelines by default create a trace report file within <outdir>/pipeline_info
. A trace report can be customised to include any combination of available fields using the Nextflow -f,-fields option.
If there is a specific combination of fields you would like to regularly capture for your pipeline runs, perhaps for resource benchmarking, it would be useful to have a custom configuration file that can be optionally applied to any run, to save having to extract this information later.
➤ View the available fields using the Nextflow log
command, or online
nextflow log -list-fields
Note that percent CPU and percent memory are %cpu%
and %mem
when used as arguments to the trace
option, yet pcpu
and pmem
for the Nextflow log
command.
➤ Use the Nextflow log
command with fields
option to extract some specific fields for a recent run. Choose any fields you like, and add them in a comma-separated list, for example:
nextflow log | tail -1 nextflow log <run_name> -f name,status,exit,realtime,cpus,pcpu,memory,pmem,rss
Let’s make a portable configuration file for resource tracing that we can share with collaborators and use across multiple pipelines 🤝
➤ Open a file for editing called custom-trace.config
and add the following content. Feel free to list your own set of chosen trace fields rather than the example list below:
// Custom trace report config
// Define timestamp, to avoid overwriting existing trace
def trace_timestamp = new java.util.Date().format('yyyy-MM-dd_HH-mm-ss')
// Generate custom resource trace file
trace {
enabled = true
file = "${params.outdir}/custom-resource-trace-${trace_timestamp}.txt"
fields = 'name,status,exit,realtime,cpus,%cpu,memory,%mem,rss' }
➤ Before submitting your run, delete or comment out the withName: '.*:RNASEQ:ALIGN_STAR:STAR_ALIGN'
block we applied in the custom-nimbus.config
file
➤ Save both configs and resume your previous run, adding our second custom config custom-trace.config
at -c
and setting outdir
to Lesson-2.4.1
:
nextflow run nf-core-rnaseq-3.11.1/workflow/main.nf \
-profile workshop \
-c custom-nimbus.config,custom-trace.config \
-params-file workshop-params.yaml \
--outdir Lesson-2.4.1 \ -resume
👀 Notice that your multiple custom configurations are listed under Core Nextflow options on the launch log printed to the screen:
➤ Once the workflow has run, take a look at the Lesson-2.4.1/custom-resource-trace-${trace_timestamp}.txt
file:
👀 Notice that while the status is CACHED
, the resources reported are from the actual compute required to execute the process, and not the time this execution took to restore the data from cache
2.4.2. Configure MultiQC reports
Many nf-core pipelines use MultiQC to aggregate results and statistics output by various bioinformatics tools, helping to summarise experiments containing multiple samples and multiple analysis steps.
MultiQC allows users the flexibility of customising the output reports. nf-core has enabled users to take full advantage of this functionality through the inclusion of the --multiqc_config
parameter.
We are going to create a custom MultiQC config to:
- Overlay a theoretical GC content track specific for our mm10 reference genome
- Add a custom report header
➤ Open a file for editing named custom-multiqc-config.yaml
and add the following content:
# Project level information:
report_header_info:
- Contact E-mail: "cali.willet@sydney.edu.au"
- Workshop host: "Australian BioCommons"
- Workshop title: "Unlocking nf-core - customising workflows for your research"
- Workshop URL: https://sydney-informatics-hub.github.io/customising-nfcore-workshop/
- Sample data: Mouse RNAseq from https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2801-4
# Theoretical GC Content:
fastqc_config: fastqc_theoretical_gc: "mm10_txome"
➤ To the workshop-params.yaml
file, add:
multiqc_config: "custom-multiqc-config.yaml"
➤ Make sure both YAML files are saved, then re-run the workflow, applying our customisations for both trace reporting and MultiQC formatting:
➤ View the html file Lesson-2.4.2/multiqc/star_salmon/multiqc_report.html
and compare it to the report generated by a previous run:
Notice your custom report header info at the top of the page:
Navigate to the section titled FastQC: Per Sequence GC Content. Compare the two plots to observe the custom track has been successfully added.
👀 This track has shown that our sample transcriptomes follow a normal distribution consistent with the mm10 reference.
2.4.3. Configure your run to use a non-default container
For this example, let’s assume we want to test out the latest version of Trim Galore, which is newer than the default version of Trim Galore used by version 3.11.1 of nf-core/rnaseq
.
As can often be the case in programming, TIMTOWTDI!
Given what you have learnt about custom configuration files so far, take a moment to think about the optimal method to use for this customisation.
➤ Identify the default version of Trim Galore that has been used in our runs so far from the software_versions.yml
file:
more Lesson-2.4.2/pipeline_info/software_versions.yml
TRIMGALORE:
cutadapt: '3.4' trimgalore: 0.6.7
➤ Identify the container path for Trim Galore v. 0.6.10 hosted at CernVM-FS:
ls /cvmfs/singularity.galaxyproject.org/t/r/trim*
/cvmfs/singularity.galaxyproject.org/t/r/trim-galore:0.6.10--hdfd78af_0
➤ Next, identify the execution path for the TRIMGALORE module:
grep GALORE nf-core-rnaseq-3.11.1/workflow/conf/modules.config
withName: '.*:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:FASTQC' { withName: '.*:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE' {
➤ Open a file for editing named custom-trimgalore-version.config
, and start building your config with the process
scope and withName
selector that we covered in lesson 2.3.5:
// Config to test out newer trimgalore
process {
withName: {
} }
➤ Then copy the TRIMGALORE module execution path from your terminal inside the withName
block:
process {
withName: '.*:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE' {
} }
➤ Finally, add the container details inside the withName
block:
process {
withName: '.*:FASTQ_FASTQC_UMITOOLS_TRIMGALORE:TRIMGALORE' {
container = '/cvmfs/singularity.galaxyproject.org/t/r/trim-galore:0.6.10--hdfd78af_0'
} }
➤ Save custom-trimgalore-version.config
then resume the previous run, supplying our three configuration files at -c
and specifying --outdir Lesson-2.4.3
:
nextflow run nf-core-rnaseq-3.11.1/workflow/main.nf \
-profile workshop \
-c custom-nimbus.config,custom-trace.config,custom-trimgalore-version.config \
-params-file workshop-params.yaml \
--outdir Lesson-2.4.3 \ -resume
➤ After your run has completed, check that the updated version of Trim Galore has been used:
grep -i galore Lesson-2.4.3/pipeline_info/software_versions.yml
TRIMGALORE: trimgalore: 0.6.10
grep -i galore Lesson-2.4.2/pipeline_info/software_versions.yml
TRIMGALORE: trimgalore: 0.6.7
2.4.4. Apply external arguments
Mandatory minimum arguments that a tool needs to run are hard-coded within the main.nf
file of the nf-core module.
Optional arguments for a tool that are commonly changed are usually included as nf-core workflow parameters, so they can be easily customised on the command line or supplied within a parameters file.
As we are all aware, bioinformatics tools can have a lot of optional arguments. It is not feasible for nf-core to paramaterise all of these arguments…
In this lesson, we will learn how to apply any argument for a tool that is not explicitly covered by an nf-core workflow parameter using Nextflow’s ext
directive.
➤ Consider the diagram below:
- On the left of the diagram is an example of the standard
main.nf
format of a process namedexample
stored in themodules
directory- Above the script block, the
$args
variable is defined - The
task.ext.args ?: ''
expression checks if theext.args
parameter has already been defined - If it isn’t defined it will assign an empty string
- Inside the script block, the
$args
variable is applied to the tool command
- Above the script block, the
- In the middle of the diagram is a custom configuration file targeting the ‘EXAMPLE’ process
- This configuration file gives the tool-specific argument
--flag1
toext.args
usingwithName
- This configuration file gives the tool-specific argument
- On the right of the diagram we see how the tool command is executed if
-c example-custom.config
was applied to the run- The
$args
variable is interpolated to--flag1
whenmodules/example/main.nf
is executed
- The
👀 You can observe the empty $args
variable at play within .command.sh
scripts for a run:
nextflow log nextflow log <run_name> -f script
We will now pass an optional flag to the Trim Galore process using ext.args
in a custom configuration file.
For the sake of the lesson, let’s assume we want to apply a very stringent minimum Phred score of 40. This is overly stringent! Bear with us… 😊
➤ Follow the decision tree to work out how to apply this workflow customisation. Some useful links are below the flowchart:
🧰 nf-core/rnaseq trimming params v.3.11.1 vs v.3.10.1
From the above investigation, we have learnt that:
- The Trim Galore argument
--quality 40
is required to over-ride the Trim Galore default of 20 - This parameter is absent from the
nf-core/rnaseq
parameters list (for v.3.10.1 that is!) 😊 - Only
--cores
,--gzip
and input file are hard-coded intrimgalore/main.nf
ext.args
is defined intrimgalore/main.nf
➤ Save custom-trimgalore-args.config
then resume your run, adding our latest custom config file to the growing list at -c
and set outdir
param to Lesson-2.4.4
:
nextflow run nf-core-rnaseq-3.11.1/workflow/main.nf \
-profile workshop \
-c custom-nimbus.config,custom-trace.config,custom-trimgalore-version.config,custom-trimgalore-args.config \
-params-file workshop-params.yaml \
--outdir Lesson-2.4.4 \ -resume
😲 Pipeline completed successfully but 2/2 samples failed!
➤ Investigate the source of the failure:
Following the advice of the warning message, check the MultiQC report by opening the file
Lesson-2.4.4/multiqc/star_salmon/multiqc_report.html
(with Live Server VS Code extension or taking a local copy)Click on the ‘WARNING: Fail Trimming Check’ on the navigation headings on the left
Both of our samples “failed the minimum trimmed reads threshold specified via the
--min_trimmed_reads
parameter, and hence were ignored for the downstream processing steps”Looking at the
nf-core/rnaseq
parameter documentation, we see--min_trimmed_reads
param with default value 10,000. Our samples both had < 5,000 readsComparing the outputs in
Lesson-2.4.4
to a previous run, notice there is nostar_salmon
directory. These modules did not have their input channels fulfilled by the previous modules in the workflow, so they were not executed.
In this example, it is clear that our ext.args
customisation was successfully applied because of the catastrophic failure it produced. In less dramatic cases, how can we confirm that the external argument was actually applied? 🤔
➤ Confirm that the external argument was actually applied:
Within your MultiQC report, navigate to the section ‘nf-core/rnaseq Workflow Summary’. Unlike our custom options, profiles, and configs, the custom quality parameter is not documented!
Some tools write a handy log, in which case you might view the applied parameters from that file, eg:
head Lesson-2.4.4/trimgalore/SRR3473988.fastq.gz_trimming_report.txt
In the absence of an informative tool log, you can always rely on nextflow log
🥰
nextflow log <run_name> -f script | grep -A 10 trim_galore
trim_galore \
--quality 40 \
--cores 1 \
--gzip \ SRR3473989.fastq.gz
➤ Optional final run for the purists at heart 👼
If you would like to conclude today’s compute with a successful run, try changing the Trim Galore --quality
threshold to 30.