2.1. Design and execute an nf-core run command
Before scaling the nf-core/rnaseq pipeline up to a full dataset, we’ll explore the functionality of the workflow and identify processes that may need to be adjusted or customised. While nf-core pipelines are designed to run with ‘sensible’ default settings, these may not always suit the needs of your experiment of compute environment. Designing a custom run command requires you to identify which parameters you need to specify to suit your circumstances and experimental design. In this lesson, we will download an nf-core pipeline, then design and execute a customised run command using various parameters.
2.1.1. Download the pipeline code
➤ In this session we are using Singularity containers to manage software installation for all nf-core/rnaseq tools. Confirm the Singularity cache directory we set in the previous session using the $NXF_SINGULARITY_CACHEDIR
Nextflow environmental variable:
echo $NXF_SINGULARITY_CACHEDIR
👀 This should match the directory you set in the previous session:
/home/training/singularity_cache
The pipeline files and institutional configuration files from nf-core/configs will be downloaded to ~/session2/nf-core-rnaseq-3.11.1
. Inside this directory you will see 2 subdirectories.
➤ Take a look at the directory structure:
ls -l nf-core-rnaseq-3.11.1/*
nf-core-rnaseq-3.11.1/configs:
total 64
-rwxrwxr-x 1 training training 1562 Apr 21 09:17 CITATION.cff
-rwxrwxr-x 1 training training 1064 Apr 21 09:17 LICENSE
-rwxrwxr-x 1 training training 17476 Apr 21 09:17 README.md
drwxrwxr-x 2 training training 4096 Apr 21 09:17 bin
drwxrwxr-x 3 training training 4096 Apr 21 09:17 conf
-rwxrwxr-x 1 training training 204 Apr 21 09:17 configtest.nf
drwxrwxr-x 4 training training 4096 Apr 21 09:17 docs
-rwxrwxr-x 1 training training 70 Apr 21 09:17 nextflow.config
-rwxrwxr-x 1 training training 8249 Apr 21 09:17 nfcore_custom.config
drwxrwxr-x 2 training training 4096 Apr 21 09:17 pipeline
nf-core-rnaseq-3.11.1/workflow:
total 216
-rwxrwxr-x 1 training training 58889 Apr 21 09:17 CHANGELOG.md
-rwxrwxr-x 1 training training 9681 Apr 21 09:17 CITATIONS.md
-rwxrwxr-x 1 training training 9078 Apr 21 09:17 CODE_OF_CONDUCT.md
-rwxrwxr-x 1 training training 1096 Apr 21 09:17 LICENSE
-rwxrwxr-x 1 training training 10002 Apr 21 09:17 README.md
drwxrwxr-x 3 training training 4096 Apr 21 09:17 assets
drwxrwxr-x 2 training training 4096 Apr 21 09:17 bin
drwxrwxr-x 2 training training 4096 Apr 21 09:17 conf
drwxrwxr-x 3 training training 4096 Apr 21 09:17 docs
drwxrwxr-x 2 training training 4096 Apr 21 09:17 lib
-rwxrwxr-x 1 training training 2736 Apr 21 09:17 main.nf
drwxrwxr-x 4 training training 4096 Apr 21 09:17 modules
-rwxrwxr-x 1 training training 13970 Apr 21 09:17 modules.json
-rwxrwxr-x 1 training training 10847 Apr 21 09:17 nextflow.config
-rwxrwxr-x 1 training training 42576 Apr 21 09:17 nextflow_schema.json
-rwxrwxr-x 1 training training 359 Apr 21 09:17 pyproject.toml
drwxrwxr-x 4 training training 4096 Apr 21 09:17 subworkflows
-rwxrwxr-x 1 training training 1684 Apr 21 09:17 tower.yml drwxrwxr-x 2 training training 4096 Apr 21 09:17 workflows
The public institutional configs were downloaded to the configs
directory. The code base for our pipeline will be stored in the workflow
directory. The files and directories we will be working with in this session are:
Feature | Importance |
---|---|
conf/ |
Contains files, default configuration settings and optional profiles that build on global settings set by nextflow.config |
main.nf |
The executable Nextflow script that defines the structure and flow of the workflow. It calls workflows/rnaseq.nf |
modules/ |
Contains Nextflow processes used by the workflow. Each process is split into a module with its own main.nf file |
workflows/rnaseq.nf |
The complete rnaseq pipeline, containing modules and subworkflows that are connected by channels |
2.1.2. Design your run command
As we learnt in lesson 1.3.3, all nf-core pipelines have a unique set of pipeline-specific parameters that can be used in conjunction with Nextflow parameters to configure the workflow. Generally, nf-core pipelines can be customised at a few different levels:
Level of effect | Customisation feature |
---|---|
The workflow | Where diverging methods are available for a pipeline, you may choose a path to follow |
A process | Where more than one tool is available for a single step, you may choose which to use |
A tool | Apply specific thresholds or optional flags for a tool on top of the default run command |
Compute resources | Specify resource thresholds or software execution methods for the workflow or a process |
All nf-core pipelines are provided with comprehensive documentation that explain what the default workflow structure entails and options for customising this based on your needs. It is important to remember that nf-core pipelines typically do not include all possible tool parameters. This makes it challenging to piece these different sources of information together to determine which parameters you should be using.
The following sections of the documentation can be used to understand what the pipeline is doing and inform your choices about aspects of pipeline-specific customisations:
Docs | Description | Customisation level |
---|---|---|
Introduction | Workflow summary |
|
Usage | Inputs and options |
|
Parameters | Available flags |
|
Output | Files from all processes processes |
|
Pipeline structure
👀 Looking at the nf-core/rnaseq pipeline structure provided in the introduction, we can see that the developers have:
- Organised the workflow into 5 stages based on the type of work that is being done
- Provided a choice of multiple methods and specified defaults
- Provided a choice of tool for some steps
Default pipeline usage
The number and type of default and optional parameters an nf-core pipeline accepts is at the discretion of it’s developers. However, at a minimum, nf-core pipelines typically:
- Require users to specify a sample sheet (
--input
) detailing sample data and relevant metadata - Autogenerate or acquire missing reference files from iGenomes ( using the
--genome
) if not provided by the user.
➤ You can see the recommended (typical) run command and all the parameters available for the nf-core/rnaseq pipeline by running:
nextflow run nf-core-rnaseq-3.11.1/workflow/main.nf --help
The typical or recommended run command for this pipeline is provided at the top of the screen:
It outlines a requirement for a few basic things:
- An input samplesheet
- A location to store outputs
- Relevant reference data
- A software management method
Required input: --input
Most of us will need to adjust the default run command for our experiments. Today we’ll be adjusting the typical nf-core/rnaseq run command by:
- Providing our own reference files
- Using the Singularity software management profile, instead of Docker
- Customising the execution of some processes
- Specifying the computing resource limitations of our instances (2 CPUs, 8 Gb RAM)
➤ Our input fastq files (fastqs/
), reference data (mm10_reference/
), and full sample sheet (samplesheet.csv
) are already available on an external file system called CernVM-FS that we can access from our Nimbus instances. Take a look at the files:
ls -l /cvmfs/data.biocommons.aarnet.edu.au/training_materials/SIH_training/UnlockNfcore_0523
drwxrwxr-x 2 training training 4096 Feb 14 05:36 fastqs
drwxrwxr-x 3 training training 4096 Feb 14 05:46 mm10_reference -rw-rw-r-- 1 training training 641 Feb 16 05:57 samplesheet.csv
➤ Our CVMFS path is very long, for the sake of tidiness, store the CVMFS path in a variable for our run command:
materials=/cvmfs/data.biocommons.aarnet.edu.au/training_materials/SIH_training/UnlockNfcore_0523
➤ Given we are only testing the pipeline in this session, we only need to work with a couple of samples. Copy the first two samples from the full prepared sample sheet to a local version of the file:
head -n 3 $materials/samplesheet.csv > samplesheet.csv
sample,fastq_1,fastq_2,strandedness
SRR3473989,/cvmfs/data.biocommons.aarnet.edu.au/training_materials/SIH_training/UnlockNfcore_0523/fastqs/SRR3473989_selected.fastq.gz,,forward SRR3473988,/cvmfs/data.biocommons.aarnet.edu.au/training_materials/SIH_training/UnlockNfcore_0523/fastqs/SRR3473988_selected.fastq.gz,,forward
Required input: reference data
Many nf-core pipelines have a minimum requirement for reference data inputs. The input reference data requirements for this pipeline are provided in the usage documentation. We can replace the --genome
flag in the typical run command with our own files. To see what reference files we can specify using parameters, rerun the pipeline’s help command to view all the available parameters:
nextflow run nf-core-rnaseq-3.11.1/workflow/main.nf --help --show_hidden_params
From the Reference genome options parameters, we will provide our own files using:
--fasta $materials/mm10_reference/mm10_chr18.fa
--gtf $materials/mm10_reference/mm10_chr18.gtf
--star_index $materials/mm10_reference/STAR
--salmon_index $materials/mm10_reference/salmon-index
Optional parameters
Now that we have prepared our input and reference data, we will customise the typical run command by:
- Using Nextflow’s
-profile
parameter to specify that we will be running the Singularity profile instead of the Docker profile - Adding additional process-specific flags to skip duplicate read marking, save trimmed reads and save unaligned reads
- Adding additional max resource flags to specify the number of CPUs and amount of memory available to the pipeline.
The parameters we will use are:
-profile singularity
--skip_markduplicates true
--save_trimmed true
--save_unaligned true
--max_memory '6.GB'
--max_cpus 2
You can see how we’ve customised the typical run command in the diagram below:
2.1.3. Run the pipeline
➤ Now that we have prepared our data and chosen which parameters to apply, run the pipeline:
nextflow run nf-core-rnaseq-3.11.1/workflow/main.nf \
--input samplesheet.csv \
--outdir Lesson-2.1 \
--fasta $materials/mm10_reference/mm10_chr18.fa \
--gtf $materials/mm10_reference/mm10_chr18.gtf \
--star_index $materials/mm10_reference/STAR \
--salmon_index $materials/mm10_reference/salmon-index \
-profile singularity \
--skip_markduplicates \
--save_trimmed true \
--save_unaligned true \
--max_memory '6.GB' \ --max_cpus 2
👀 Take a look at the stdout printed to the screen. Your workflow configuration and parameter customisations are all documented here. You can use this to confirm if your parameters have been correctly passed to the run command:
As the workflow starts, you will also see a number of processes spawn out underneath this. Recall from session 1 that processes are executed independently and can run in parallel. Nextflow manages the data dependencies between processes, ensuring that each process is executed only when its input data is available and all of its dependencies have been satisfied.
To understand how this is coordinated, consider the STAR_ALIGN process that is being run.
👀 You’ll notice a few things:
- We can see which inputs are being processed by looking at the end of the process name
- When a process starts it progressively spawns tasks for all inputs to be processed
- A single TRIMGALORE process is run across both samples in our
samplesheet.csv
before STAR_ALIGN begins - Once a TRIMGALORE task is completed for a sample, the STAR_ALIGN task for that sample begins
- When the STAR_ALIGN process starts, it spawns 2 tasks.
2.1.4. Examine the outputs
Once your pipeline has completed, you should see this message printed to your terminal:
-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-
-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-
Completed at: 21-Apr-2023 03:58:56
Duration : 9m 16s
CPU hours : 0.3 Succeeded : 66
The pipeline ran successfully, however, note the warning about all samples having failed the strandedness check. We’ll explore that in the next lesson.
➤ In the meantime, list (ls -la
) the contents of your directory, you’ll see a few new directories (and a hidden directory and log file) have been created:
total 416
drwxrwxr-x 7 training training 4.0K Apr 21 03:44 .
drwxr-x--- 15 training training 4.0K Apr 21 01:56 ..
drwxrwxr-x 4 training training 4.0K Apr 21 03:58 .nextflow
-rw-rw-r-- 1 training training 371K Apr 21 03:58 .nextflow.log
-rw-rw-r-- 1 training training 17K Apr 21 03:50 .nextflow.log.1
drwxrwxr-x 7 training training 4.0K Apr 21 03:58 Lesson-2.1
drwxrwxr-x 4 training training 4.0K Apr 21 02:08 nf-core-rnaseq-3.11.1
-rw-rw-r-- 1 training training 563 Apr 21 03:14 samplesheet.csv drwxrwxr-x 143 training training 4.0K Apr 21 03:58 work
👀 Nextflow has created 2 new output directories, work and Lesson-2.1 in the current directory.
The work
directory
As each job is run, a unique sub-directory is created in the work
directory. These directories house temporary files and various command logs created by a process. We can find all information regarding this process that we need to troubleshoot a failed process.
The Lesson-2.1
directory
All final outputs will be presented in a directory specified by the --outdir
flag.
The .nextflow
directory
This directory contains a cache
subdirectory to store cached data such as downloaded files and can be used to speed up subsequent pipeline runs. It also contains a history
file which contains a record of pipeline executions including run time, the unique run name, and command line arguments used.
The .nextflow.log
file
This file is created by Nextflow during the execution of a pipeline and contains information about all processes and any warnings or errors that occurred during execution.