Run the nf-core/rnaseq pipeline
Questions
- How to run the nextflow command in terminal?
Now that we have prepared our working space and downloaded all the data we need, we can run the nf-core/rnaseq pipeline. The mouse genome is large and contains ~20,000 genes which would take hours to process. For the sake of expediency, we are running the pre-processing workflow on a subset of the whole mouse genome. Tomorrow we will provide you with a whole genome count matrix, to perform differential expression and enrichment analyses.
Execute the nf-core/rnaseq run command
In your terminal, check your path by running:
pwd
It should return:
/home/training/base_directory/working_directory
If you are not, move into the above path by running:
cd /home/training/base_directory/working_directory
Run the command:
cvmfs_path=/cvmfs/data.biocommons.aarnet.edu.au/training_materials/SIH_training/IntroRNAseq_0922
nextflow run $cvmfs_path/nfcore_pipeline/rnaseq/ \
--input samplesheet.csv \
-profile singularity \
--fasta $cvmfs_path/Mouse_chr18_reference/chr18.fa \
--gtf $cvmfs_path/Mouse_chr18_reference/chr_18_startOfLine.gtf \
--star_index $cvmfs_path/Mouse_chr18_reference/chr18_STAR_singularity_index/ \
--max_memory '6 GB' --max_cpus 2 \
--outdir results \
-with-report excecution_report.html \
-with-timeline timeline_report.html \ -with-dag flowchart.png
In order to run the pipeline for all 6 samples, we need to provide nf-core/rnaseq command with some essential parameters and inputs. Our run command specifies some extra details that we will explore in depth later today, but for now the important things to know are:
- The reference data we are working with (i.e. fasta, gtf, star index files) are stored in a centralised file system called CernVM File System (CVMFS) (
cvmfs_path=
) that we are accessing remotely. - We have specified how much memory (
--max_memory
) and how many CPUs (--max_cpus
) the nf-core/rnaseq pipeline can use. Doing this ensures the pipeline won’t try to consume more memory or CPUs than are available on our system and crash. - Outputs will be saved to the results directory (
--outdir
)
The progress of the workflow will be displayed in the terminal and updated in real-time. Take a look at all the processes you have just run with a single command!
If you are an experienced CLI user, you might want to run commands that take a while to finish in the background, so you can keep using your terminal while it runs. If that sounds nifty to you, feel free to try the challenge below to run your nf-core/rnaseq command.
Challenge for experienced CLI users
We can run nf-core/rnaseq command using a utility called screen. The screen command provides the ability to launch and use multiple shell sessions from a single ssh session. Can you:
- Create a screen session
- Run the nf-core/rnaseq run command above inside that session
- Detatch from the screen session
- Reopen the screen session to check on the progress of the run command
Hint: to work out what flags you will need to do this, run:
screen --help
Solution
- To create a new screen session with the name run_nextflow_in_screen, run:
-S run_nextflow_in_screen screen
- To detatch the screen session, but keep it running in the background, run:
screen -d
This can also be done with the help of shortcut key Ctrl-a + d
- To reattach the screen session, run:
screen -r run_nextflow_in_screen
For now, we will leave the pipeline to run in our terminals and discuss Nextflow and nf-core/rnaseq while we wait. Keep your terminal window open to allow the nf-core/rnaseq command to finish and proceed to the next lesson by clicking on Why use nf-core/rnaseq? > Introduction to Nextflow on the menu bar.
Key points
- The nf-core/rnaseq pipeline is a simplified, easy to run solution for RNA-seq data pre-processing
All materials copyright Sydney Informatics Hub, University of Sydney