Running Parabricks-Genomics-nf

To call SNVs and short indels for Illumina short read data, you will be running the Parabricks-Genomics-nf pipeline. This pipeline was developed by the SIH Bioinformatics group. It leverages existing best practice tools to identify short variants in whole genome sequence data. The pipeline is written in Nextflow and can be configured for multiple compute environments. It has been configured specifically for NCI’s Gadi HPC.

Considerations

This pipeline is a GPU-accelerated implementation of best practice tools for variant calling pipeline. While the software you are running is called “Parabricks”, by running this pipeline you are essentially running significantly sped up versions of BWA mem, Samtools, Google’s DeepVariant, and GLNexus. Parabricks guarantees complete fidelity with the original tools, but with a significant speedup.

The pipeline is written in Nextflow and can be configured for multiple compute environments but can only run on NVIDIA GPUs that support CUDA architecture. It has been configured specifically for Gadi and would require significant customisation of the nextflow.config in order to run on other systems.

For detailed instructions on how to run the pipeline, see the README.md file in the Parabricks-Genomics-nf GitHub repository.

Reference data on Gadi

Previous cohorts have been run using the Hg38 (chromosome only) and chm13-t2t assemblies. Copies of these reference and their index files are available at /g/data/<project>/reference on Gadi.

Quickstart guide

It is strongly suggested you read through the detailed instructions in the Parabricks-Genomics-nf GitHub repository. Below is a quickstart guide to get you up and running if you are already familiar with the pipeline.

1. Download a copy of the Parabricks-Genomics-nf pipeline code to your Gadi environment

git clone https://github.com/Sydney-Informatics-Hub/Parabricks-Genomics-nf.git

Move into this directory, this is where you will be working:

cd Parabricks-Genomics-nf

2. Prepare your input file for all samples you wish to process

You can do this in a text editor or Excel if that is easier. Just ensure you have a file that has comma-separated columns with a header for each column. For example:

sample,fq1,fq2,platform,library,center
sample1,/scratch/aa00/sample1_L001_1_1k.fastq.gz,/scratch/aa00/sample1_L001_2_1k.fastq.gz,illumina,1,Ramaciotti
sample1,/scratch/aa00/sample1_L002_1_1k.fastq.gz,/scratch/aa00/sample1__L002_2_1k.fastq.gz,illumina,1,Ramaciotti
sample2,/scratch/aa00/sample2_1_1k.fastq.gz,/scratch/aa00/sample2_2_1k.fastq.gz,illumina,1,Ramaciotti
sample3,/scratch/aa00/sample3_1_1k.fastq.gz,/scratch/aa00/sample3_2_1k.fastq.gz,illumina,1,Ramaciotti

3. Run the pipeline

Use the run_gadi.pbs script provided alongside the code inside the scripts/ directory. This script will submit all jobs to the Gadi HPC queue system. Before executing you will need to edit the script, how you do this will depend on how you intend to run the pipeline. Edit the following PBS variables at the top of the script:

#PBS -P <PROJECT>
#PBS -l storage=scratch/<PROJECT>
  • Provide your group’s project code in place of <PROJECT>
  • Provide the storage you will be using for the pipeline in place of storage=. This can be scratch or gdata

For example:

#PBS -P aa00
#PBS -l storage=scratch/aa00+gdata/aa00

Next, edit the script variables to pass the required files and parameters to the pipeline:

gadi_account= #REQUIRED: your gadi project code for accounting 
cohort_name= #REQUIRED: name of your sample cohort  
outdir= #OPTIONAL: name of the results directory 
ref= #REQUIRED: full path to your chosen reference genome e.g. /g/data/aa00/reference/hg38.fa
input= #REQUIRED: full path to your input sample sheet e.g. /scratch/aa00/samples.csv
vep_species= #OPTIONAL which VEP species do you want to use for variant annotation e.g. homo_sapiens
vep_assembly= #OPTIONAL: which genome assembly do you want to use for variant annotation e.g. GRCh38

For example:

gadi_account=aa00
cohort_name=cmt-2024
outdir=/scratch/aa00/Parabricks-Genomics-nf/cmt-2024-results
ref=/g/data/aa00/reference/hg38.fa
input=/scratch/aa00/Parabricks-Genomics-nf/cmt-2024.csv 
vep_species=homo_sapiens
vep_assembly=GRCh38

Save your changes and run the script with:

qsub scripts/run_gadi.pbs

You will receive a message with your unique job id as a numerical code. This will submit the head job (i.e. the run_gadi.pbs script) and all tasks to the job scheduler for execution. You can check the progress of your job by running:

qstat -Esw 

4. View the results of the pipeline

This pipeline runs multiple processes described here, most processes generate some output that you can view in your specified outdir. See here for a description of how the outputs are organised and what files are generated.

Resources

All materials copyright Sydney Informatics Hub, University of Sydney