Running Parabricks-Genomics-nf
To call SNVs and short indels for Illumina short read data, you will be running the Parabricks-Genomics-nf pipeline. This pipeline was developed by the SIH Bioinformatics group. It leverages existing best practice tools to identify short variants in whole genome sequence data. The pipeline is written in Nextflow and can be configured for multiple compute environments. It has been configured specifically for NCI’s Gadi HPC.
Considerations
This pipeline is a GPU-accelerated implementation of best practice tools for variant calling pipeline. While the software you are running is called “Parabricks”, by running this pipeline you are essentially running significantly sped up versions of BWA mem, Samtools, Google’s DeepVariant, and GLNexus. Parabricks guarantees complete fidelity with the original tools, but with a significant speedup.
The pipeline is written in Nextflow and can be configured for multiple compute environments but can only run on NVIDIA GPUs that support CUDA architecture. It has been configured specifically for Gadi and would require significant customisation of the nextflow.config
in order to run on other systems.
For detailed instructions on how to run the pipeline, see the README.md
file in the Parabricks-Genomics-nf GitHub repository.
Reference data on Gadi
Previous cohorts have been run using the Hg38 (chromosome only) and chm13-t2t assemblies. Copies of these reference and their index files are available at /g/data/<project>/reference
on Gadi.
Quickstart guide
It is strongly suggested you read through the detailed instructions in the Parabricks-Genomics-nf GitHub repository. Below is a quickstart guide to get you up and running if you are already familiar with the pipeline.
1. Download a copy of the Parabricks-Genomics-nf pipeline code to your Gadi environment
git clone https://github.com/Sydney-Informatics-Hub/Parabricks-Genomics-nf.git
Move into this directory, this is where you will be working:
cd Parabricks-Genomics-nf
2. Prepare your input file for all samples you wish to process
You can do this in a text editor or Excel if that is easier. Just ensure you have a file that has comma-separated columns with a header for each column. For example:
sample,fq1,fq2,platform,library,center
sample1,/scratch/aa00/sample1_L001_1_1k.fastq.gz,/scratch/aa00/sample1_L001_2_1k.fastq.gz,illumina,1,Ramaciotti
sample1,/scratch/aa00/sample1_L002_1_1k.fastq.gz,/scratch/aa00/sample1__L002_2_1k.fastq.gz,illumina,1,Ramaciotti
sample2,/scratch/aa00/sample2_1_1k.fastq.gz,/scratch/aa00/sample2_2_1k.fastq.gz,illumina,1,Ramaciotti
sample3,/scratch/aa00/sample3_1_1k.fastq.gz,/scratch/aa00/sample3_2_1k.fastq.gz,illumina,1,Ramaciotti
Use the run_gadi.pbs
script provided alongside the code inside the scripts/
directory. This script will submit all jobs to the Gadi HPC queue system. Before executing you will need to edit the script, how you do this will depend on how you intend to run the pipeline. Edit the following PBS variables at the top of the script:
#PBS -P <PROJECT>
#PBS -l storage=scratch/<PROJECT>
- Provide your group’s project code in place of
<PROJECT>
- Provide the storage you will be using for the pipeline in place of
storage=
. This can bescratch
orgdata
For example:
#PBS -P aa00
#PBS -l storage=scratch/aa00+gdata/aa00
Next, edit the script variables to pass the required files and parameters to the pipeline:
gadi_account= #REQUIRED: your gadi project code for accounting
cohort_name= #REQUIRED: name of your sample cohort
outdir= #OPTIONAL: name of the results directory
ref= #REQUIRED: full path to your chosen reference genome e.g. /g/data/aa00/reference/hg38.fa
input= #REQUIRED: full path to your input sample sheet e.g. /scratch/aa00/samples.csv
vep_species= #OPTIONAL which VEP species do you want to use for variant annotation e.g. homo_sapiens
vep_assembly= #OPTIONAL: which genome assembly do you want to use for variant annotation e.g. GRCh38
For example:
gadi_account=aa00
cohort_name=cmt-2024
outdir=/scratch/aa00/Parabricks-Genomics-nf/cmt-2024-results
ref=/g/data/aa00/reference/hg38.fa
input=/scratch/aa00/Parabricks-Genomics-nf/cmt-2024.csv
vep_species=homo_sapiens
vep_assembly=GRCh38
Save your changes and run the script with:
qsub scripts/run_gadi.pbs
You will receive a message with your unique job id as a numerical code. This will submit the head job (i.e. the run_gadi.pbs script) and all tasks to the job scheduler for execution. You can check the progress of your job by running:
qstat -Esw
4. View the results of the pipeline
This pipeline runs multiple processes described here, most processes generate some output that you can view in your specified outdir
. See here for a description of how the outputs are organised and what files are generated.
Resources
All materials copyright Sydney Informatics Hub, University of Sydney