2.4 Combining channels and multiple process outputs
Learning objectives
- Implement a channel that combines the contents of two channels.
- Implement a process with multiple output files.
- Improve execution logging with process directives and groovy.
In this step we will transform the 03_multiqc.sh
into a process called MULTIQC
.
This step focuses on the final step of our RNAseq data processing workflow: generating
a report that summarises the quality control and quantification steps.
To do this, we will run MultiQC, which is a popular tool for summarising the outputs of many different bioinformatics tools. It aggregates results from all our analyses and renders it into a nice report.
From the MultiQC docs
MultiQC doesn’t do any analysis for you - it just finds results from other tools that you have already run and generates nice reports. See here for a list of supported tools. You can also see an example report here.
Open the bash script 03_multiqc.sh
.
This script is a lot simpler than previous scripts, we've worked with. It searches searches for the output files generated by the FASTQC
and QUANTIFICATION
processes saved to the results/
directory. As specified by --outdir results/
, it will output two MultiQC files:
- A directory called
multiqc_data/
- A report file called
multiqc_report.html
2.4.1 Building the process
1. Process directives, script
, and input
Here is the empty process
template with the container
and publishDir
directives we'll be using to get you started. Add this to your main.nf
after the QUANTIFICATION
process.
process MULTIQC {
container "quay.io/biocontainers/multiqc:1.19--pyhdfd78af_0"
publishDir params.outdir, mode: 'copy'
input:
path "*"
output:
< process outputs >
script:
"""
multiqc .
"""
}
The script
and input
follow the MultiQC Nextflow
integration recommendations.
Probably need another diagram and explanation on why the script is thw way it is
Refer back to staging from Part 1, and that the channel/
.collect
deals with this. More stable using channels to deal with paths vs. directory input
From the information above we know that the input for multiqc
is the
results/
directory, specifically, the files and directories within
results/
. We will need to bring the outputs of the FASTQC
(fastqc_gut_logs/
) and QUANTIFICATION
(gut/
) processes into a single
channel as input to MULTIQC
.
Why you should NOT use the publishDir
folder as a process input
It might make sense to have the results/
folder (set by publishDir
) as
the input to the process here, but it may not exist until the workflow
finishes.
Using the publishDir
as a process input can cause downstream processes
prematurely, even if the directory is empty or incomplete. In this case,
MultiQC might miss some inputs.
Use channels to pass data between processes. Channels enable Nextflow to track outputs and ensure that downstream processes only run when all required data is ready, maintaining proper worfklow control.
More on this in the next section.
Exercise
Think of something
2. Define the process output
Next, add the output
definition to the MULTIQC
process.
- An HTML report file called
multiqc_report.html
- A directory called
multiqc_data/
containing the data used to generate the report.
process MULTIQC {
container "quay.io/biocontainers/multiqc:1.19--pyhdfd78af_0"
publishDir params.outdir, mode: 'copy'
input:
path "*"
output:
path "multiqc_report.html"
path "multiqc_data"
script:
"""
multiqc .
"""
}
2.4.2 Combining channels with operators
The goal of this step is to combine outputs from FASTQC
and QUANTIFICATION
processes into a single input channel for the MULTIQC
process. These tools are both supported by MultiQC and their outputs can be detected automatically by MultiQC.
Exercise
Which channels output:
fastqc_gut_logs/
gut/
Solution
fastqc_ch
and quant_ch
.
The next few steps will involve chaining together Nextflow operators to correctly format inputs for the MULTIQC
process.
In the workflow scope, use the
mix
operator to
emit the contents of fastqc_ch
and quant_ch
in a single channel.
Add the following to the workflow block in your main.nf
file, under the quantification process. View it using the view()
operator:
// Define the workflow
workflow {
// Run the index step with the transcriptome parameter
INDEX(params.transcriptome_file)
// Define the fastqc input channel
reads_in = Channel.fromPath(params.reads)
.splitCsv(header: true)
.map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }
// Define the quantification channel for the index files
transcriptome_index_in = INDEX.out[0]
// Run the quantification step with the index and reads_in channels
QUANTIFICATION(transcriptome_index_in, reads_in)
// Define the multiqc input channel
multiqc_in = FASTQC.out[0]
.mix(QUANTIFICATION[0])
.view()
}
This channel creates a tuple with the two inputs as elements:
- Takes the output of
FASTQC
, using element[0]
to refer to the first element of the output. - Uses
mix(QUANTIFICATION)[0]
to combineFASTQC
output with the first element of theQUANTIFICATION
output. - Uses
view()
allows us to see the values emitted into the channel.
Run the workflow:
The output should look something like:
Launching `main.nf` [stupefied_minsky] DSL2 - revision: 82245ce02b
[de/fef8c4] INDEX | 1 of 1, cached: 1 ✔
[bb/32a3aa] FASTQC (1) | 1 of 1, cached: 1 ✔
[a9/000f36] QUANTIFICATION (1) | 1 of 1, cached: 1 ✔
/home/setup2/hello-nextflow/part2/work/bb/32a3aaa5e5fd68265f0f34df1c87a5/fastqc_gut_logs
/home/setup2/hello-nextflow/part2/work/a9/000f3673536d98c8227b393a641871/gut
The outputs have been emitted one after the other, meaning that it will be processed separately. We need them to be processed together (generated in the same MultiQC report), so we need to add one more step.
Add the collect
operator to ensure all samples are processed together in the same
process and view the output:
// Define the workflow
workflow {
// Run the index step with the transcriptome parameter
INDEX(params.transcriptome_file)
// Define the fastqc input channel
reads_in = Channel.fromPath(params.reads)
.splitCsv(header: true)
.map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }
// Define the quantification channel for the index files
transcriptome_index_in = INDEX.out[0]
// Run the quantification step with the index and reads_in channels
QUANTIFICATION(transcriptome_index_in, reads_in)
// Define the multiqc input channel
multiqc_in = FASTQC.out[0]
.mix(QUANTIFICATION[0])
.collect()
.view()
}
Run the workflow:
The channel now outputs a single tuple with the two directories.
Launching `main.nf` [small_austin] DSL2 - revision: 6ab927f137
[de/fef8c4] INDEX | 1 of 1, cached: 1 ✔
[bb/32a3aa] FASTQC (1) | 1 of 1, cached: 1 ✔
[a9/000f36] QUANTIFICATION (1) | 1 of 1, cached: 1 ✔
[/home/setup2/hello-nextflow/part2/work/bb/32a3aaa5e5fd68265f0f34df1c87a5/fastqc_gut_logs, /home/setup2/hello-nextflow/part2/work/a9/000f3673536d98c8227b393a641871/gut]
Exercise
Now that we have a channel that emits the correct data, remove .view()
and assign the channel to a variable called multiqc_in
.
Solution
// Define the quantification channel for the index files
transcriptome_index_in = INDEX.out[0]
// Run the quantification step with the index and reads_in channels
QUANTIFICATION(transcriptome_index_in, reads_in)
// Define the multiqc input channel
multiqc_in = FASTQC.out[0]
.mix(QUANTIFICATION[0])
.collect()
We are now ready to call the MULTIQC
process in the workflow
.
Exercise
Add the MULTIQC
process in the workflow scope with multiqc_in
as input.
Solution
// Define the quantification channel for the index files
transcriptome_index_in = INDEX.out[0]
// Run the quantification step with the index and reads_in channels
QUANTIFICATION(transcriptome_index_in, reads_in)
// Define the multiqc input channel
multiqc_in = FASTQC.out[0]
.mix(QUANTIFICATION[0])
.collect()
// Run the multiqc step with the multiqc_in channel
MULTIQC(multiqc_in)
Run the workflow:
The output should look something like:
Launching `main.nf` [hopeful_swanson] DSL2 - revision: a4304bbe73
[aa/3b8821] INDEX [100%] 1 of 1, cached: 1 ✔
[c2/baa069] QUANTIFICATION [100%] 1 of 1, cached: 1 ✔
[ad/e49b20] FASTQC [100%] 1 of 1, cached: 1 ✔
[a3/1f885c] MULTIQC [100%] 1 of 1 ✔
Inspect
results/multiqc_report.html
, maybe Poll something in the file
You have a working pipeline for a single paired-end sample!
Summary
In this step you have learned:
1. How to
1. How to
1. How to
1. How to
1. How to