Part 2
Part 2 goals and scope
Part 2 replicates a "real-life" example of how a bioscientist would encounter and build a Nextflow pipeline for the first time. The key goals of Part 2 include:
- Demonstrating how the foundational Nextflow concepts introduced in Part 1 are applied in a practical scenario
- Reiterating the key Nextflow concepts and components required for a simple bioinformatics workflow
What Part 2 is not:
- A guide on bioinformatics or data analysis — avoid reviewing the inner workings of tools or the biology of each input and output file in detail
Structured and repeated development across lessons
Each lesson (2.1 - 2.5) will focus on converting a bash script into a modular Nextflow process. This is to emphasise that there is some boilerplate that will not change across tools, scripts, and analyses in Nextflow pipelines. However, these must be tweaked to fit based on what will be input and output.
For each lesson, you will guide learners to:
- Review an existing bash script that conducts one bioinformatics data processing step
- Identify which components need to be implemented in a Nextflow process definition
- Build that Nextflow process step-by-step, following a similar order to Part 1
2.0 Introduction
This section sets the context for Part 2: converting a set of individual bash scripts into a simple, end-to-end Nextflow pipeline for RNA-seq data preparation.
You will introduce:
- The files and data used throughout Part 2
- The bioinformatics tools being used
- The Nextflow files that learners will be modifying
- The high-level structure of the workflow that learners will have built by the end of the workshop
2.1 Our first process and container
This is the first occurrence of reviewing a bash script and converting it to a Nextflow process. Walk learners through the bash script before asking them to identify the components that map to Nextflow process blocks.
- Recall that there is boilerplate code unlikely to change across processes (
input:,output:,script:), but the contents of each must be tailored to the tool being run - Learners do not need to understand the inner workings of containers — focus on why they are used, where to find them, and how to specify one in a process
Why containers?
There are many ways to manage software (e.g. modules, conda) but using containers in Nextflow is best practice. Containers package a tool together with its software environment so that everyone runs the same tool, same version, and same dependencies, regardless of where the workflow is executed.
For bioinformatics, recommend learners use Singularity (widely supported on HPC systems) and BioContainers hosted on quay.io, which provides pre-built containers for thousands of tools and is supported by Nextflow, nf-core, and Galaxy.
2.2 Samplesheets, operators, and Groovy
This lesson introduces structured inputs via a samplesheet and the use of operators to reshape channel data. Operators are often one of the hardest concepts for Nextflow beginners — focus on the why, not the mechanics.
- Emphasise that operators exist to transform input data into the exact structure required by the next process
- Learners do not need to memorise operators — they need to understand that processes expect inputs in a specific format and operators are how you get data into that format
- Use
.view()frequently to show learners what channel contents look like before and after an operator is applied
Tuples: grouping related data
Tuples are introduced to group related pieces of information together (e.g. sample name + file paths) and prevent accidental mixing of files and metadata. At this stage, focus on the why, not the mechanics.
More tuple usage will be visited in later lessons.
Development and debugging best practices
Use this lesson to model real-world Nextflow development habits:
- Encourage frequent use of
.view()to inspect channel outputs - Encourage regular use of
-resumeto avoid re-running completed steps - Normalise frequent and fast iteration when learners develop their own pipelines
2.3 Multiple process inputs
This lesson applies the process-chaining concepts from Lesson 1.7 to a bioinformatics use case. Learners extend their workflow by defining a process that accepts multiple inputs and wiring it to upstream processes.
One key concept to clarify is when inputs should be grouped in a single tuple versus kept as separate inputs:
- Keep files, values, and metadata together in a tuple if they must travel together (e.g. sample ID + paired FASTQs)
- Keep inputs separate if they are shared across samples or constant (e.g. a reference index used by all samples)
- Grouping everything into a tuple introduces unnecessary data reshaping; splitting incorrectly can cause unexpected behaviour such as samples not running
Accessing process outputs
There are different valid ways to access process outputs:
- Adding
.outto the end of a process name works for single-output channels - For single-element channels,
.out[0]is identical to.out emitcan be used to name outputs and is preferred in more complex workflows
2.4 Combining channels and multiple process outputs
This lesson demonstrates how combining channels allows outputs from multiple processes to be collected and passed to a single downstream process.
MultiQC is the ideal example here: it requires outputs from all upstream tools to be aggregated before it can run.
- Note that the
input: path "*"andscript: multiqc .pattern follows the MultiQC Nextflow integration recommendations - State clearly that this permissive pattern is generally bad practice — it is used here because MultiQC requires it, not as a model to follow
- Communicate that in most cases it is strongly preferred to be explicit about which files and folders a process receives, as this improves error handling, testing, and maintainability
2.5 Upscaling to multiple samples and introspection
This final lesson demonstrates how a workflow built for a single sample scales to many samples without any changes to the pipeline code itself.
- Advise learners that scaling in Nextflow is primarily a configuration problem, not a coding problem — by updating the samplesheet, the same workflow logic runs tasks in parallel across all samples
- This section is only an introduction to configuration, benchmarking, and scaling concepts
- Direct learners to the Nextflow on HPC workshop for a full treatment of these topics, which was developed as a sequel to this introductory workshop