Skip to content

2.4 Combining channels and multiple process outputs

Learning objectives

  1. Implement a channel that combines the contents of two channels.
  2. Implement a process with multiple output files.
  3. Improve execution logging with process directives and groovy.

In this step we will transform the 03_multiqc.sh into a process called MULTIQC. This step focuses on the final step of our RNAseq data processing workflow: generating a report that summarises the quality control and quantification steps.

To do this, we will run MultiQC, which is a popular tool for summarising the outputs of many different bioinformatics tools. It aggregates results from all our analyses and renders it into a nice report.

From the MultiQC docs

MultiQC doesn’t do any analysis for you - it just finds results from other tools that you have already run and generates nice reports. See here for a list of supported tools. You can also see an example report here.





Open the bash script 03_multiqc.sh.

03_multiqc.sh
multiqc --outdir results/ results/

This script is a lot simpler than previous scripts, we've worked with. It searches searches for the output files generated by the FASTQC and QUANTIFICATION processes saved to the results/ directory. As specified by --outdir results/, it will output two MultiQC files:

  1. A directory called multiqc_data/
  2. A report file called multiqc_report.html

2.4.1 Building the process

1. Process directives, script, and input

Here is the empty process template with the container and publishDir directives we'll be using to get you started. Add this to your main.nf after the QUANTIFICATION process.

main.nf
process MULTIQC {

  container "quay.io/biocontainers/multiqc:1.19--pyhdfd78af_0"
  publishDir params.outdir, mode: 'copy'

  input:
  path "*"

  output:
    < process outputs >

  script:
  """
  multiqc .
  """
}

The script and input follow the MultiQC Nextflow integration recommendations.

Probably need another diagram and explanation on why the script is thw way it is

Refer back to staging from Part 1, and that the channel/.collect deals with this. More stable using channels to deal with paths vs. directory input

From the information above we know that the input for multiqc is the results/ directory, specifically, the files and directories within results/. We will need to bring the outputs of the FASTQC (fastqc_gut_logs/) and QUANTIFICATION (gut/) processes into a single channel as input to MULTIQC.

Why you should NOT use the publishDir folder as a process input

It might make sense to have the results/ folder (set by publishDir) as the input to the process here, but it may not exist until the workflow finishes.

Using the publishDir as a process input can cause downstream processes prematurely, even if the directory is empty or incomplete. In this case, MultiQC might miss some inputs.

Use channels to pass data between processes. Channels enable Nextflow to track outputs and ensure that downstream processes only run when all required data is ready, maintaining proper worfklow control.

More on this in the next section.

Exercise

Think of something

2. Define the process output

Next, add the output definition to the MULTIQC process.

path "multiqc_report.html" 
path "multiqc_data"
MultiQC output consists of the following:

  • An HTML report file called multiqc_report.html
  • A directory called multiqc_data/ containing the data used to generate the report.
main.nf
process MULTIQC {

  container "quay.io/biocontainers/multiqc:1.19--pyhdfd78af_0"
  publishDir params.outdir, mode: 'copy'

  input:
  path "*"  

  output:
  path "multiqc_report.html"
  path "multiqc_data"

  script:
  """
  multiqc .
  """
}

2.4.2 Combining channels with operators

The goal of this step is to combine outputs from FASTQC and QUANTIFICATION processes into a single input channel for the MULTIQC process. These tools are both supported by MultiQC and their outputs can be detected automatically by MultiQC.

Exercise

Which channels output:

  1. fastqc_gut_logs/
  2. gut/
Solution

fastqc_ch and quant_ch.

The next few steps will involve chaining together Nextflow operators to correctly format inputs for the MULTIQC process.

In the workflow scope, use the mix operator to emit the contents of fastqc_ch and quant_ch in a single channel.

Add the following to the workflow block in your main.nf file, under the quantification process. View it using the view() operator:

multiqc_in = FASTQC.out[0]
        .mix(QUANTIFICATION[0])
        .view()
main.nf
// Define the workflow
workflow {

    // Run the index step with the transcriptome parameter
    INDEX(params.transcriptome_file)

    // Define the fastqc input channel
    reads_in = Channel.fromPath(params.reads)
        .splitCsv(header: true)
        .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }

    // Define the quantification channel for the index files
    transcriptome_index_in = INDEX.out[0]

    // Run the quantification step with the index and reads_in channels
    QUANTIFICATION(transcriptome_index_in, reads_in)

    // Define the multiqc input channel
    multiqc_in = FASTQC.out[0]
        .mix(QUANTIFICATION[0])
        .view()

}

This channel creates a tuple with the two inputs as elements:

  • Takes the output of FASTQC, using element [0] to refer to the first element of the output.
  • Uses mix(QUANTIFICATION)[0] to combine FASTQC output with the first element of the QUANTIFICATION output.
  • Uses view() allows us to see the values emitted into the channel.

Run the workflow:

nextflow run main.nf -resume  

The output should look something like:

Output
Launching `main.nf` [stupefied_minsky] DSL2 - revision: 82245ce02b

[de/fef8c4] INDEX              | 1 of 1, cached: 1 ✔
[bb/32a3aa] FASTQC (1)         | 1 of 1, cached: 1 ✔
[a9/000f36] QUANTIFICATION (1) | 1 of 1, cached: 1 ✔
/home/setup2/hello-nextflow/part2/work/bb/32a3aaa5e5fd68265f0f34df1c87a5/fastqc_gut_logs
/home/setup2/hello-nextflow/part2/work/a9/000f3673536d98c8227b393a641871/gut

The outputs have been emitted one after the other, meaning that it will be processed separately. We need them to be processed together (generated in the same MultiQC report), so we need to add one more step.

Add the collect operator to ensure all samples are processed together in the same process and view the output:

main.nf
// Define the workflow
workflow {

    // Run the index step with the transcriptome parameter
    INDEX(params.transcriptome_file)

    // Define the fastqc input channel
    reads_in = Channel.fromPath(params.reads)
        .splitCsv(header: true)
        .map { row -> [row.sample, file(row.fastq_1), file(row.fastq_2)] }

    // Define the quantification channel for the index files
    transcriptome_index_in = INDEX.out[0]

    // Run the quantification step with the index and reads_in channels
    QUANTIFICATION(transcriptome_index_in, reads_in)

    // Define the multiqc input channel
    multiqc_in = FASTQC.out[0]
        .mix(QUANTIFICATION[0])
        .collect()
        .view()

}

Run the workflow:

nextflow run main.nf -resume  

The channel now outputs a single tuple with the two directories.

Output
Launching `main.nf` [small_austin] DSL2 - revision: 6ab927f137

[de/fef8c4] INDEX              | 1 of 1, cached: 1 ✔
[bb/32a3aa] FASTQC (1)         | 1 of 1, cached: 1 ✔
[a9/000f36] QUANTIFICATION (1) | 1 of 1, cached: 1 ✔
[/home/setup2/hello-nextflow/part2/work/bb/32a3aaa5e5fd68265f0f34df1c87a5/fastqc_gut_logs, /home/setup2/hello-nextflow/part2/work/a9/000f3673536d98c8227b393a641871/gut]

Exercise

Now that we have a channel that emits the correct data, remove .view() and assign the channel to a variable called multiqc_in.

Solution
main.nf
    // Define the quantification channel for the index files
    transcriptome_index_in = INDEX.out[0]

    // Run the quantification step with the index and reads_in channels
    QUANTIFICATION(transcriptome_index_in, reads_in)

    // Define the multiqc input channel
    multiqc_in = FASTQC.out[0]
        .mix(QUANTIFICATION[0])
        .collect()

We are now ready to call the MULTIQC process in the workflow.

Exercise

Add the MULTIQC process in the workflow scope with multiqc_in as input.

Solution
main.nf
    // Define the quantification channel for the index files
    transcriptome_index_in = INDEX.out[0]

    // Run the quantification step with the index and reads_in channels
    QUANTIFICATION(transcriptome_index_in, reads_in)

    // Define the multiqc input channel
    multiqc_in = FASTQC.out[0]
        .mix(QUANTIFICATION[0])
        .collect()

    // Run the multiqc step with the multiqc_in channel
     MULTIQC(multiqc_in)

Run the workflow:

nextflow run main.nf -resume  

The output should look something like:

Output
Launching `main.nf` [hopeful_swanson] DSL2 - revision: a4304bbe73

[aa/3b8821] INDEX          [100%] 1 of 1, cached: 1 ✔
[c2/baa069] QUANTIFICATION [100%] 1 of 1, cached: 1 ✔
[ad/e49b20] FASTQC         [100%] 1 of 1, cached: 1 ✔
[a3/1f885c] MULTIQC        [100%] 1 of 1 ✔

Inspect results/multiqc_report.html, maybe Poll something in the file

You have a working pipeline for a single paired-end sample!

Summary

In this step you have learned:

1. How to
1. How to
1. How to
1. How to
1. How to