Run the nf-core/rnaseq pipeline

Questions

How to run the nextflow command in terminal?

Now that we have prepared our working space and downloaded all the data we need, we can run the nf-core/rnaseq pipeline. The mouse genome is large and contains ~20,000 genes which would take hours to process. For the sake of expediency, we are running the pre-processing workflow on a subset of the whole mouse genome. Tomorrow we will provide you with a whole genome count matrix, to perform differential expression and enrichment analyses.

Why run the command now?

We need to allow ~20 mins for the command to run on the test data. For this reason we will run the command first and then discuss each step of this pipeline once it has completed.

Execute the nf-core/rnaseq run command

In your terminal, check your path by running:

pwd

It should return:

/home/training/base_directory/working_directory

If you are not, move into the above path by running:

cd /home/training/base_directory/working_directory

Run the command:

cvmfs_path=/cvmfs/data.biocommons.aarnet.edu.au/training_materials/SIH_training/IntroRNAseq_0922

nextflow run $cvmfs_path/nfcore_pipeline/rnaseq/ \
                --input samplesheet.csv \
                -profile singularity \
                --fasta $cvmfs_path/Mouse_chr18_reference/chr18.fa \
                --gtf $cvmfs_path/Mouse_chr18_reference/chr_18_startOfLine.gtf \
                --star_index $cvmfs_path/Mouse_chr18_reference/chr18_STAR_singularity_index/ \
                --max_memory '6 GB' --max_cpus 2 \
                --outdir results \
                -with-report excecution_report.html \
                -with-timeline timeline_report.html \
                -with-dag flowchart.png

In order to run the pipeline for all 6 samples, we need to provide nf-core/rnaseq command with some essential parameters and inputs. Our run command specifies some extra details that we will explore in depth later today, but for now the important things to know are:

The reference data we are working with (i.e. fasta, gtf, star index files) are stored in a centralised file system called CernVM File System (CVMFS) (cvmfs_path=) that we are accessing remotely.
We have specified how much memory (--max_memory) and how many CPUs (--max_cpus) the nf-core/rnaseq pipeline can use. Doing this ensures the pipeline won’t try to consume more memory or CPUs than are available on our system and crash.
Outputs will be saved to the results directory (--outdir)

A note on genome indexing

Today we are using a pre-indexed reference genome (--star_index) stored centrally on a filesystem called CernVMFS. Index files are required for tools like our aligners to run faster. Think of it like the index at the back of a textbook, where you can easily find what pages certain words are on.

Note that indexing requires a lot of memory, so if you didn’t have a prepared index and you needed nf-core/rnaseq to perform indexing, you would need to increase memory (--max_memory).

The progress of the workflow will be displayed in the terminal and updated in real-time. Take a look at all the processes you have just run with a single command!

What if my command fails?

If you need help, please ask one of our demonstrators in the Slack channel. Don’t worry if your run fails though, we have a set of results available for anyone who experiences issues.

If you are an experienced CLI user, you might want to run commands that take a while to finish in the background, so you can keep using your terminal while it runs. If that sounds nifty to you, feel free to try the challenge below to run your nf-core/rnaseq command.

Challenge for experienced CLI users

We can run nf-core/rnaseq command using a utility called screen. The screen command provides the ability to launch and use multiple shell sessions from a single ssh session. Can you:

Create a screen session
Run the nf-core/rnaseq run command above inside that session
Detatch from the screen session
Reopen the screen session to check on the progress of the run command

Hint: to work out what flags you will need to do this, run:

screen --help

Solution

To create a new screen session with the name run_nextflow_in_screen, run:

screen -S run_nextflow_in_screen

To detatch the screen session, but keep it running in the background, run:

screen -d

This can also be done with the help of shortcut key Ctrl-a + d

To reattach the screen session, run:

screen -r run_nextflow_in_screen

For now, we will leave the pipeline to run in our terminals and discuss Nextflow and nf-core/rnaseq while we wait. Keep your terminal window open to allow the nf-core/rnaseq command to finish and proceed to the next lesson by clicking on Why use nf-core/rnaseq? > Introduction to Nextflow on the menu bar.

Key points

The nf-core/rnaseq pipeline is a simplified, easy to run solution for RNA-seq data pre-processing

All materials copyright Sydney Informatics Hub, University of Sydney