Skip to content

1.4 Introduction to nf-core/rnaseq

Objectives

  • Learn about the nf-core/rnaseq pipeline
  • Understand the levels of customisation available for nf-core pipelines
  • Use the nf-core documentation to select appropriate parameters for a run command
  • Write a run command for nf-core/rnaseq
  • Explore pipeline deployment and outputs

1.4.1 An introduction to nf-core/rnaseq

For the remainder of this workshop, we will be working with a real-world nf-core pipeline for RNA-seq analysis: nf-core/rnaseq. The nf-core website describes this pipeline as "a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation". It implements a variety of tools and has several branching paths that allow users to select the type of alignment and post-processing they desire, as demonstrated in the nf-core metro-map below:

In this lesson, we will download nf-core/rnaseq (version 3.23.0 - the latest version at the time of writing), explore its functionality, and identify processes that may need to be adjusted or customised. By the end of the lesson, we will have constructed a basic run command to execute the pipeline, and then in Session 2 we will dive further into the customisation options. While nf-core pipelines are designed to run with 'sensible' default settings, these may not always suit the needs of your experiment or compute environment. Designing a custom run command requires you to identify which pipeline parameters you need to specify to suit your circumstances and experimental design, and which configurations you may need to apply to efficiently execute the pipeline on your compute platform

Let's start by creating a fresh working directory for all our experiments, and then download the pipeline to that directory.

Exercise 1.4.1 ⏱ 5 mins

First, make a new session2 folder in your home directory and open it in VS Code.

mkdir ~/session2

Open the new directory via the VS Code GUI (File > Open Folder...) and navigating to it.

Next, clone the nf-core/rnaseq pipeline to your working directory. Ensure that you download the version 3.23.0 by using the -r parameter.

nextflow clone -r 3.23.0 nf-core/rnaseq

The pipeline files will be downloaded to ~/session2/rnaseq. Let's have a look at the directory structure.

ls -lh rnaseq

You should see the following:

Output
total 340K
drwxrwxr-x 2 tdev01 tdev01 4.0K Apr 16 03:31 assets
drwxrwxr-x 2 tdev01 tdev01 4.0K Apr 16 03:31 bin
-rwxrwxr-x 1 tdev01 tdev01 113K Apr 16 03:31 CHANGELOG.md
-rwxrwxr-x 1 tdev01 tdev01  11K Apr 16 03:31 CITATIONS.md
-rwxrwxr-x 1 tdev01 tdev01  14K Apr 16 03:31 CODE_OF_CONDUCT.md
drwxrwxr-x 2 tdev01 tdev01 4.0K Apr 16 03:31 conf
drwxrwxr-x 5 tdev01 tdev01 4.0K Apr 16 03:31 docs
-rwxrwxr-x 1 tdev01 tdev01 1.1K Apr 16 03:31 LICENSE
-rwxrwxr-x 1 tdev01 tdev01 7.0K Apr 16 03:31 main.nf
drwxrwxr-x 4 tdev01 tdev01 4.0K Apr 16 03:31 modules
-rwxrwxr-x 1 tdev01 tdev01  24K Apr 16 03:31 modules.json
-rwxrwxr-x 1 tdev01 tdev01  17K Apr 16 03:31 nextflow.config
-rwxrwxr-x 1 tdev01 tdev01  57K Apr 16 03:31 nextflow_schema.json
-rwxrwxr-x 1 tdev01 tdev01 1.5K Apr 16 03:31 nf-test.config
-rwxrwxr-x 1 tdev01 tdev01  13K Apr 16 03:31 README.md
-rwxrwxr-x 1 tdev01 tdev01  23K Apr 16 03:31 ro-crate-metadata.json
drwxrwxr-x 4 tdev01 tdev01 4.0K Apr 16 03:31 subworkflows
drwxrwxr-x 2 tdev01 tdev01 4.0K Apr 16 03:31 tests
-rwxrwxr-x 1 tdev01 tdev01 3.0K Apr 16 03:31 tower.yml
drwxrwxr-x 3 tdev01 tdev01 4.0K Apr 16 03:31 workflows

The files and directories we will be interacting with in this workshop are:

Feature Importance
conf/ Contains files, default configuration settings and optional profiles that build on global settings set by nextflow.config
main.nf The executable Nextflow script that defines the structure and flow of the workflow. It calls workflows/rnaseq.nf
modules/ Contains Nextflow processes used by the workflow. Each process is split into a module with its own main.nf file
workflows/rnaseq.nf The complete rnaseq pipeline, containing modules and subworkflows that are connected by channels

1.4.2 Design your run command

Now that we have the pipeline code downloaded, we need to design the nextflow run command we will use to execute it. But before we can do that, we need to understand what the pipeline is doing and what parameters are available to us to customise.

👀 Looking again at the nf-core/rnaseq pipeline structure provided in the introduction, we can see that the developers have:

  1. Organised the pipeline into 5 stages based on the type of work that is being done
  2. Provided a choice of multiple methods
  3. Provided a choice of tool for some steps

Poll 1.4.2 ⏱ 2 min

❓ Observing the diagram above, which statement is true regarding the choice of alignment and quantification methods provided by the nf-core/rnaseq pipeline?

  • a. The pipeline uses a fixed method for alignment and quantification
  • b. Users can choose between several different methods for alignment and quantification
  • c. The pipeline always performs alignment and quantification using STAR or HISAT2
  • d. The choice of alignment and quantification method is determined automatically based on the input data
Answer

The correct answer is b. The nf-core/rnaseq pipeline allows users to choose between pseudo-alignment and quantification or several different methods for genome-based read alignment and quantification.

  • a is incorrect because the pipeline is not limited to a single method.
  • c is incorrect because HISAT2 is an alignment tool, not a quantification tool, and users can choose between one of three alignment tools (STAR, HISAT2, and Bowtie2) and two quantification tools (RSEM and Salmon), or they can choose the joint pseudo-alignment and quantification method.
  • d is also incorrect, as the pipeline only accepts fastq files as input and the choice of alignment and quantification method must be specified by the user.

The choices shown in the metro-map above demonstrate the flexibility common to many nf-core pipelines, and thus the importance of learning how to customise nf-core runs to best suit your data. Alternative tools and methods are routinely incorporated in pipelines, as each dataset is unique and what works well for one may not work well for another. Once you are familiar with customising nf-core pipelines, you can easily try out different combinations of tools and parameters to get the best results out of your data 🏆

As we learnt in Lesson 1.3.3, all nf-core pipelines have a unique set of pipeline-specific parameters that can be used in conjunction with Nextflow parameters to configure the pipeline. Generally, nf-core pipelines can be customised at a few different levels:

Level of effect Customisation feature
The workflow Where diverging methods are available for a pipeline, you may choose a path to follow
A process Where more than one tool is available for a single step, you may choose which to use
A tool Apply specific thresholds or optional flags for a tool on top of the default run command
Compute resources Specify resource thresholds or software execution methods for the workflow or a process

All nf-core pipelines are provided with comprehensive documentation that explain what the default pipeline structure entails and options for customising this based on your needs. It is important to remember that nf-core pipelines typically do not include all possible tool parameters. This makes it challenging to piece these different sources of information together to determine which parameters you should be using.

The following sections of the documentation can be used to understand what the pipeline is doing and inform your choices about aspects of pipeline-specific customisations:

Docs Description Customisation level
Introduction pipeline summary
  • workflow
  • process
Usage Inputs and options
  • workflow
  • process
Parameters Available flags
  • workflow
  • process
  • compute resources
Output Files from all processes
  • workflow
  • process
  • tool

1.4.2.1 Default pipeline usage

The number and type of default and optional parameters an nf-core pipeline accepts is at the discretion of its developers. However, at a minimum, nf-core pipelines typically:

  • Require users to specify a sample sheet (--input) detailing sample data and relevant metadata
  • Autogenerate or acquire missing reference files from iGenomes (using the --genome parameter) if not provided by the user

Exercise 1.4.2.1 ⏱ 1 min

Recall that we can print out pipeline information, including available paramters, with the --help parameter. Print out the available parameters for the nf-core/rnaseq pipeline by running:

nextflow run rnaseq --help 

The typical or recommended run command for this pipeline is provided at the top of the screen:

It outlines a requirement for a few basic things:

  • An input samplesheet
  • A location to store outputs
  • A software management method

Reminder: hyphens matter in Nextflow!

Nextflow-specific parameters use one (-) hyphen, whereas pipeline-specific parameters use two (--). In the typical run command above -profile is a Nextflow parameter, while --input is an nf-core parameter.

Most of us will need to adjust the default run command for our experiments. To get us started with running nf-core/rnaseq, we will be adjusting the typical run command by:

  1. Providing our own reference files
  2. Using the Singularity software management profile
  3. Specifying the computing resource limitations of our VMs (2 CPUs, 8 GB RAM)
  4. Overwriting some default parameters to speed up execution and keep intermediate files

1.4.2.2 Our dataset

Our input FASTQ files (fastq/), reference data (mm10_reference/), and full sample sheet (samplesheet.csv) are already available within the home directory. Let's take a look at the input files.

Exercise 1.4.2.2 ⏱ 2 min

Start by listing the ~/data directory to see what it contains:

ls -l ~/data
Output
-rwxr-xr-x 1 training training  641 Feb 16 05:57 create_samplesheet.sh
drwxr-xr-x 2 training training 4096 Feb 14 05:36 fastq
drwxr-xr-x 3 training training 4096 Feb 14 05:46 mm10_reference
-rw-r--r-- 1 training training  641 Feb 16 05:57 samplesheet.csv
-rw-r--r-- 1 training training  641 Feb 16 05:57 samplesheet.template.csv

We can also use the tree command to get a deeper look at the folder structure:

tree ~/data
Output
data
├── create_samplesheet.sh
├── fastq
│   ├── SRR3473988_selected.fastq.gz
│   └── SRR3473989_selected.fastq.gz
├── mm10_reference
│   ├── mm10_chr18.fa
│   ├── mm10_chr18.gtf
│   ├── salmon-index
│   │   ├── complete_ref_lens.bin
│   │   ├── ctable.bin
│   │   ├── ctg_offsets.bin
│   │   ├── duplicate_clusters.tsv
│   │   ├── info.json
│   │   ├── mphf.bin
│   │   ├── pos.bin
│   │   ├── pre_indexing.log
│   │   ├── rank.bin
│   │   ├── refAccumLengths.bin
│   │   ├── ref_indexing.log
│   │   ├── reflengths.bin
│   │   ├── refseq.bin
│   │   ├── seq.bin
│   │   └── versionInfo.json
│   └── STAR
│       ├── chrLength.txt
│       ├── chrNameLength.txt
│       ├── chrName.txt
│       ├── chrStart.txt
│       ├── exonGeTrInfo.tab
│       ├── exonInfo.tab
│       ├── geneInfo.tab
│       ├── Genome
│       ├── genomeParameters.txt
│       ├── Log.out
│       ├── SA
│       ├── SAindex
│       ├── sjdbInfo.txt
│       ├── sjdbList.fromGTF.out.tab
│       ├── sjdbList.out.tab
│       └── transcriptInfo.tab
├── samplesheet.csv
└── samplesheet.template.csv

5 directories, 38 files

Finally, take a look at the samplesheet.csv file to see what information the nf-core/rnaseq pipeline requires for each sample:

cat ~/data/samplesheet.csv
Output
sample,fastq_1,fastq_2,strandedness
SRR3473988,/home/training/data/fastq/SRR3473988_selected.fastq.gz,,forward
SRR3473989,/home/training/data/fastq/SRR3473989_selected.fastq.gz,,forward

Sample data source

The sample data was obtained from a public dataset that has been made available on NCBI's BioProject collection and published in the following paper: Corley, S.M., Canales, C.P., Carmona-Mora, P. et al. RNA-Seq analysis of Gtf2ird1 knockout epidermal tissue provides potential insights into molecular mechanisms underpinning Williams-Beuren syndrome. BMC Genomics 17, 450 (2016). https://doi.org/10.1186/s12864-016-2801-4 (BioProject ID: PRJNA320433).

1.4.2.3 Reference data

Many nf-core pipelines have a minimum requirement for reference data inputs. The input reference data requirements for this pipeline are provided in the usage documentation. To see what reference files we can specify using parameters, look back to the output from the pipeline's help command and review the Reference genome options section:

nextflow run rnaseq --help
Reference genome options
    --genome                      [string]  Name of iGenomes reference. 
    --fasta                       [string]  Path to FASTA genome file. 
    --gtf                         [string]  Path to GTF annotation file. 
    --gff                         [string]  Path to GFF3 annotation file. 
    --gene_bed                    [string]  Path to BED file containing gene intervals. This will be created from the GTF file if not specified. 
    --transcript_fasta            [string]  Path to FASTA transcriptome file. 
    --additional_fasta            [string]  FASTA file to concatenate to genome FASTA file e.g. containing spike-in sequences. 
    --splicesites                 [string]  Splice sites file required for HISAT2. 
    --star_index                  [string]  Path to directory or tar.gz archive for pre-built STAR index. 
    --hisat2_index                [string]  Path to directory or tar.gz archive for pre-built HISAT2 index. 
    --rsem_index                  [string]  Path to directory or tar.gz archive for pre-built RSEM index. 
    --salmon_index                [string]  Path to directory or tar.gz archive for pre-built Salmon index. 
    --kallisto_index              [string]  Path to directory or tar.gz archive for pre-built Kallisto index. 
    --bowtie2_index               [string]  Path to directory or tar.gz archive for pre-built Bowtie2 index. 
    --hisat2_build_memory         [string]  Minimum memory required to use splice sites and exons in the HiSAT2 index build process. [default: 200.GB] 
    --gencode                     [boolean] Specify if your GTF annotation is in GENCODE format. 
    --gffread_transcript_fasta    [boolean] Use gffread to generate transcript FASTA instead of RSEM. 
    --gtf_extra_attributes        [string]  By default, the pipeline uses the `gene_name` field to obtain additional gene identifiers from the input GTF file when running Salmon. 
    [default: gene_name]  
    --gtf_group_features          [string]  Define the attribute type used to group features in the GTF file when running Salmon. [default: gene_id] 
    --featurecounts_group_type    [string]  The attribute type used to group feature types in the GTF file when generating the biotype plot with featureCounts. [default: gene_biotype] 
    
    --featurecounts_feature_type  [string]  By default, the pipeline assigns reads based on the 'exon' attribute within the GTF file. [default: exon]

There are a lot of parameters that we can provide, but which ones do we need today? First of all, the rnaseq parameters documentation tells us that we can either specify:

  • --genome to name an AWS iGenomes reference to use; or
  • --fasta and --gtf to specify a FASTA file and a GTF file to use

The documentation also discourages using iGenomes, so we will be using the --fasta and --gtfoptions.

Next, if we look again at the pipeline help text under the Alignment options section, we can see that the default alignment mode is star_salmon, corresponding to the green line in the metro map:

nextflow run rnaseq --help
Alignment options
  --aligner                     [string]  Specifies the alignment algorithm to use - available options are 'star_salmon', 'star_rsem', 'hisat2', and 'bowtie2_salmon'.  (accepted: star_salmon, 
star_rsem, hisat2, bowtie2_salmon) [default: star_salmon]

This means we will be running both STAR and Salmon during our pipeline run. Both of these tools allow us to pass a pre-built reference genome index to speed up execution of the pipeline, which we will do today.

All up, we will require four of the reference genome parameters:

  • --fasta: A path to a FASTA file containing our reference genome sequence.
  • --gtf: A path to a GTF file containing genome annotations such as locations of genes, transcripts, exons, etc.
  • --star_index: A path to a set of files used by the STAR aligner software to map sequencing reads to their origin in the genome.
  • --salmon_index: A path to a set of files used by the Salmon pseudoaligner to count transcript reads from RNA sequencing data.

For each of these parameters, we have the following files that we can use:

Reference file File path
FASTA /home/<USERNAME>/data/mm10_reference/mm10_chr18.fa
GTF /home/<USERNAME>/data/mm10_reference/mm10_chr18.gtf
STAR index /home/<USERNAME>/data/mm10_reference/STAR
Salmon index /home/<USERNAME>/data/mm10_reference/salmon-index

Note that our test dataset has been subsampled to chromosome 18, so this should help to keep the run time for our exercises nice and short.

1.4.2.4 Writing the run command: required --input and --outdir parameters

The pipeline requires us to define both an input samplesheet and an output directory to place our results. We supply these with the --input and --outdir parameters, respectively. We've already looked at our input samplesheet: ~/data/samplesheet.csv. Our output directory can be named anything we want, and will be automatically created by Nextflow if it doesn't already exists.

Exercise 1.4.2.4 ⏱ 3 mins

Start writing a run command for the rnaseq pipeline. Start by providing the samplesheet as input. Also define an output directory called lesson-1.4.

Start by writing out the basic nextflow run command:

nextflow run rnaseq \

Writing multi-line commands

Note that we have added a space and a backslash (\) to the end of the line so we may continue writing the full command over multiple lines for legibility. You can choose to follow along like this or you can write the command on one single line and omit the backslashes. If you're following the multi-line convention, you can press Enter immediately after the backslash and you will be provided a new line to continue writing the command.

Be aware, however, that:

  • The space before the backslash is important.
  • It is also important that you don't have any spaces or other characters after the backslash, or the command will run prematurely and fail.

Next, add the --input parameter and pass it the path to the samplesheet:

nextflow run rnaseq \
    --input ~/data/samplesheet.csv \

Finally, add the --outdir parameter and give it the name lesson-1.4:

nextflow run rnaseq \
    --input ~/data/samplesheet.csv \
    --outdir lesson-1.4 \

1.4.2.5 Writing the run command: reference data

With the inputs and outputs defined, we next need to tell the pipeline where to find the necessary reference data. We have already determined the parameters and files we need to pass to the pipeline, so let's add them to the command now.

Exercise 1.4.2.5 ⏱ 5 mins

Continue writing your run command by passing the reference files to their respective parameters.

Following on from the last line from Exercise 1.4.2.1, add the --fasta, --gtf, --star_index, and --salmon_index parameters, and pass them the files we determined above in Reference data:

nextflow run rnaseq \
    --input ~/data/samplesheet.csv \
    --outdir lesson-1.4 \
    --fasta ~/data/mm10_reference/mm10_chr18.fa \
    --gtf ~/data/mm10_reference/mm10_chr18.gtf \
    --star_index ~/data/mm10_reference/STAR \
    --salmon_index ~/data/mm10_reference/salmon-index \

1.4.2.6 Optional parameters

Now that we have prepared our input and reference data, we have defined all the required parameters for the pipeline. However, Nextflow still needs to be configured to use Singularity, and we will add an additional, optional pipeline parameter. To do this, we will add:

  • -profile singularity
    • Recall that this is a Nextflow parameter and tell it to use nf-core's Singularity profile, rather than the default Docker profile, and run each process using Singularity containers.
  • --skip_markduplicates true
    • This is a pipeline parameter that tells the rnaseq pipeline to skip duplicate read marking. Skipping this step is standard for RNA-seq analysis.

Exercise 1.4.2.6 ⏱ 3 mins

Finish writing the run command by adding the -profile and --skip_markduplicates parameters.

The final run command
nextflow run rnaseq \
    --input ~/data/samplesheet.csv \
    --outdir lesson-1.4 \
    --fasta ~/data/mm10_reference/mm10_chr18.fa \
    --gtf ~/data/mm10_reference/mm10_chr18.gtf \
    --star_index ~/data/mm10_reference/STAR \
    --salmon_index ~/data/mm10_reference/salmon-index \
    -profile singularity \
    --skip_markduplicates true

Remember that -profile is a Nextflow parameter and therefore only uses a single hyphen. The remaining parameters are pipeline parameters and use a double hyphen.

Note also that we have left off the trailing space and bashslash from the final line (--skip_markduplicates true) since this line concludes our initial run command.

What if the parameter I want to apply isn't available?

Recall from the previous lesson that nf-core modules use ext.args to pass additional arguments to a module. This uses a special Nextflow directive ext. If an nf-core pipeline does not have a pre-defined parameter for a process, you may be able to implement ext.args; we'll see this in action in Session 2.

The inclusion of ext.args is currently best practice for all DSL2 nf-core modules where additional parameters may be required to run a process. However, this may not be implemented for all modules in all nf-core pipelines. Depending on the pipeline, these process modules may not have defined the ext.args variable in the script blocks and is thus not available for applying customisation. If that is the case consider submitting a feature request or a making pull request on the pipeline's GitHub repository to implement this!

1.4.3 Run the pipeline

You should now have a multi-line command in your terminal waiting to run. Now if you hit Enter, Nextflow should launch and the pipeline will start to run. It will take a few seconds to start up, and then you should start seeing processes spawning and running.

However, very quickly, we run into an error!

Why did this fail? 🤔

1.4.4 Setting resource limits

It turns out that there is one thing left to do in order to run the pipeline: set some resource limits. The nf-core/rnaseq pipeline is designed to run on large datasets and therefore expects to require lots of CPU and memory resources to run. However, we're using a small test dataset that doesn't need a lot of computing power, and as such we're also using low-resource VMs. Running the pipeline with its default settings causes some of the processes to crash due to insufficient CPU and memory requirements.

We can fix this by telling Nextflow that we want to limit the resource requests from each process to an upper bound of 2 CPUs and 6 GB of memory. We do this within a custom configuration file using the process.resourceLimits directive. This takes a list of upper resource limits like so:

process.resourceLimits = [
    cpus: 2,
    memory: 6.GB,
    time: 1.h
]

Exercise 1.4.4 ⏱ 5 mins

Create a configuration file called nectar_vm.config within your current working directory (~/session2) and add the resourceLimits directive, giving our pipeline a limit of 2 CPUs and 6GB of memory.

First, create the nectar_vm.config file and open it in VS Code:

code nectar_vm.config

Next, add the resourceLimits directive. You can do this in one of two ways. You can use the process.resourceLimits form as shown above:

nectar_vm.config
process.resourceLimits = [
    cpus: 2,
    memory: 6.GB
]

Alternatively, you can use the scope block form by nesting resourceLimits within a process scope with curly braces:

nectar_vm.config
process {
    resourceLimits = [
        cpus: 2,
        memory: 6.GB
    ]
}

The scope block form is preferred

The scope block form is preferable because it helps to break up the configuration file into logical sections. It also makes it simpler to add new configuration options to each scope.

In fact, we will be configuring the process scope further in Session 2, so it's a good idea to start using this form now.

We now have a finished initial run command. Now we just need to update our run command to include the new configuration file, as well as tell Nextflow to resume from where it left off - there's no sense re-running tasks that already succeeded!

Our final run command and default config file look like:

nextflow run rnaseq \
    --input ~/data/samplesheet.csv \
    --outdir lesson-1.4 \
    --fasta ~/data/mm10_reference/mm10_chr18.fa \
    --gtf ~/data/mm10_reference/mm10_chr18.gtf \
    --star_index ~/data/mm10_reference/STAR \
    --salmon_index ~/data/mm10_reference/salmon-index \
    -profile singularity \
    --skip_markduplicates true \
    -c nectar_vm.config \
    -resume
nectar_vm.config
process {
    resourceLimits = [
        cpus: 2,
        memory: 6.GB
    ]
}

Go ahead and re-run the pipeline. It should now run successfully to completion in about 5 minutes!

1.4.5 Examine the outputs

👀 Take a look at the stdout printed to the screen. Your pipeline configuration and parameter customisations are all documented here. You can use this to confirm if your parameters have been correctly passed to the run command:

As the pipeline starts, you will also see a number of processes spawn out underneath this. Recall from Lesson 1.1.2 that processes are executed independently and can run in parallel. Nextflow manages the data dependencies between processes, ensuring that each process is executed only when its input data is available and all of its dependencies have been satisfied.

To understand how this is coordinated, consider the STAR_ALIGN process that is being run.

👀 You'll notice a few things:

  • We can see which inputs are being processed by looking at the text within parentheses at the end of the process name, e.g. ...:GTF2BED (mm10_chr18.gtf)
  • A separate TRIMGALORE process is run for each of samples in our samplesheet.csv before STAR_ALIGN begins
  • Once a TRIMGALORE task is completed for a sample, the STAR_ALIGN task for that sample begins
  • When the STAR_ALIGN process starts, it spawns 2 tasks, one for each of the 2 input samples

Once your pipeline has completed, you should see this message printed to your terminal:

Output
-[nf-core/rnaseq] Pipeline completed successfully -
-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-
Completed at: 20-Apr-2026 01:33:59
Duration    : 7m 6s
CPU hours   : 0.2
Succeeded   : 70

The pipeline ran successfully, however, note the warning about all samples having failed the strandedness check. We'll explore that in Session 2.

In the meantime, list the contents of your directory. You will see a few new directories (and a hidden directory and log file) have been created.

List the working directory

ls -lha
Output
total 184K
drwxrwxr-x  6 tdev01 tdev01 4.0K Apr 20 01:27 .
drwxr-x--- 16 tdev01 tdev01 4.0K Apr 20 01:26 ..
drwxrwxr-x  8 tdev01 tdev01 4.0K Apr 19 23:51 lesson-1.4
drwxrwxr-x  4 tdev01 tdev01 4.0K Apr 20 01:34 .nextflow
-rw-rw-r--  1 tdev01 tdev01   79 Apr 17 03:43 nectar_vm.config
-rw-rw-r--  1 tdev01 tdev01 150K Apr 20 01:34 .nextflow.log
drwxrwxr-x  4 tdev01 tdev01 4.0K Apr 17 06:55 rnaseq
drwxrwxr-x 66 tdev01 tdev01 4.0K Apr 20 01:32 work

The work directory

As each workflow task is run, a unique sub-directory is created in the work directory. These directories house temporary files and various command logs created by a process. If a task fails, we can find all of the information we need to troubleshoot it within the work directory. Successful outputs in these task directories can also be re-used when using the -resume flag, assuming the inputs and process script remain unchanged.

The lesson-1.4 directory

All final outputs will be presented in a directory specified by the --outdir flag.

The .nextflow directory

This directory contains a cache subdirectory to store cached data such as downloaded files and can be used to speed up subsequent pipeline runs. It also contains a history file which contains a record of pipeline executions including run time, the unique run name, and command line arguments used.

The .nextflow.log file

This file is created by Nextflow during the execution of a pipeline and contains information about all processes and any warnings or errors that occurred during execution.

Exercise 1.4.5 ⏱ 5 mins

Was the runtime for the STAR_ALIGN process comparable for samples SRR3473988 and SRR3473989?

Remember we can use the Nextflow log and trace fields to find this information. Specifically, we can look up the realtime for each process.

Run the following:

nextflow log $(nextflow log -q | tail -n 1) -f name,realtime | grep "STAR_ALIGN" 

The command should print two lines corresponding to the STAR_ALIGN tasks for each sample. From the output, we can see that read alignment was comparable for both samples:

Output
NFCORE_RNASEQ:RNASEQ:ALIGN_STAR:STAR_ALIGN (SRR3473988) 1m 16s
NFCORE_RNASEQ:RNASEQ:ALIGN_STAR:STAR_ALIGN (SRR3473989) 53.8s

Key points

  • nf-core pipelines are provided with sensible default settings and have a combination of required and optional inputs and parameters
  • Each nf-core pipeline has a Usage, Output, and Parameters documentation page that should be consulted to help you design a suitable run command for your analysis
  • Parameters can be used to customise how the pipeline runs