Customising the workflow

Questions

What are the essential nf-core/rnaseq parameters?
What parameters can be customised?

The nf-core/rnaseq pipeline can be run using a single command with deafult parameters. The default parameters are explained in detail here, and include:

Either an input sample sheet (--input) or directory housing fastq files (--reads)
A reference genome, either one available through Illumina’s iGenomes database (--genome), or a user-specified reference assembly (--fasta) and annotation file (--gtf)
A configuration profile suitable for the computing environment you’re working on.

For example, a user working with human samples, who has Singularity installed and all fastq files stored in a directory would be able to run this command to run the nf-core/rnaseq pipeline:

nextflow run nf-core/rnaseq \                 
  --reads '*_R{1,2}.fastq.gz' \               # The location of fastq files
  --genome GRCh38 \                           # Illumina iGenomes database
  -profile singularity                        # For pre-installed software

Most of us will need to customise the command a little more than this though. For example, a user working with multiple samples, who wants to provide their own pre-indexed reference data, and has computing resource limitations might run a command that looks more like this:

nextflow run $Path_to_nf-core/rnaseq \      
  --input $Path_to_samplesheet.csv \        # Samplesheet file-name
  -profile singularity \                    # For pre-installed software
  --fasta $Genome_fasta_file \              # Genome sequence file
  --gtf $Path_to_Genome.gtf \               # GTF - gene locations on genome
  --star_index $Path_to_index file \        # Formatted file for aligner
  --max_memory '6 GB' --max_cpus 2 \        # Memory and cpu resources 
  --outdir $Path_to_results \               # Results folder
  -with-report excecution_report.html \     # Excecution log file-name 
  -with-timeline timeline_report.html       # Timeline log file-name

Some useful customisation options

Input and output options

--input   Path to comma-separated file containing information about the samples.
--outdir  The output directory where the results will be saved.

Reference genome options

--genome            Name of iGenomes reference. 
--star_index        Path to directory or tar.gz archive for pre-built STAR index.
--hisat2_index      Path to directory or tar.gz archive for pre-built HISAT2 index.
--save_reference    If generated by the pipeline save the STAR index in the results directory.

Alignment options

--aligner           Alignment algorithm to use.
--pseudo_aligner    Pseudo aligner to use.

Process skipping/use-alternate options

--deseq2_vst      Use vst transformation instead of rlog with DESeq2.
--skip_fastqc     Skip FastQC.
--skip_multiqc    Skip MultiQC.

For details of all parameters in nf-core take a look here.

Proceed to the next lesson by clicking on What is nf-core/rnaseq doing? > RNA-seq workflow overview on the menu bar.

Key points

A single nf-core command can be run for the complete pipeline.
Many different parameters can be used to customise pipeline runs.
nf-core allows users to skip non-madatory steps in the pipeline.

All materials copyright Sydney Informatics Hub, University of Sydney