Working on the Gadi HPC

Watch our Pro tips for bioinformatics on HPC webinar (recording and slides) to understand how HPC systems like Gadi work.

Resources

What is high performance computing?

High performance computing refers to the use of parallel processing techniques to solve complex computation problems efficiently. HPC systems, like Gadi, consist of clusters of interconnected computers, each equipped with multiple processors and large amounts of memory. These systems are designed to handle massive datasets and perform computations at speeds far beyond those achievable by your personal computer.

Why do we need HPC for bioinformatics?

In bioinformatics, researchers deal with massive datasets generated by technologies such as next-generation sequencing (genomics, transcriptomics) and mass spectrometry (proteomics). Analysing these datasets requires computationally intensive tasks such as sequence alignment, genome assembly, and statistical analysis. HPC systems provide the computational power and memory resources necessary to process these datasets efficiently.

Key components of the Gadi HPC system

See the Gadi Resources Guide for a detailed explanation of the following.

Computing nodes

Login nodes

These nodes are the gateway for Gadi for users to access the resources of the HPC cluster. It is how you log in to Gadi, move around the filesystem, submit jobs to the scheduler, and do small tasks like view the contents of a file.

Compute nodes

These nodes are the workhorses of any HPC. They are dedicated for executing computational tasks as delegated by the job scheduler when you submit a job. There are various types of compute nodes with different hardware, built for different purposes on Gadi. Depending on the resource requirements of your job (e.g. high memory, GPUs) and the queue you specify, your job will be sent to a specific type of compute node. You can find a breakdown of their technical specifications here.

‼️ Pay Attention ‼️

Compute nodes on Gadi don’t currently have access to external internet. If any tasks within a submitted job on the compute node need to access the internet, they will fail. These jobs should be run separtately on the copyq using the data mover nodes.

Data mover nodes

These nodes are designed specifically for fast data movement. You can use these nodes to transfer files to and from Gadi at high-speed. Steps outlined here. A script for moving data between USyd RDS and Gadi is provided in /g/data/scripts and explained in the following section, transferring data.

Filesystems

$HOME

When you first log in to Gadi, you’ll be placed in your personal $HOME directory (i.e. /home/555/aa1234). You are the only person who can access this directory. No work should be done in here, but you may wish to install things like custom R or Python libraries here. It is backed up but you have a 10Gb storage limit.

You can navigate back here at any point:

cd ~

/scratch

All Gadi projects have a dedicated /scratch allocation that is only accessible to members of that group. This is only intended for active work on big files and not for long-term storage. Your allocation is currently 11 TB of storage. This is not backed up and any files not accessed for 100 days will be purged from the system, so be sure to back up your work to RDS. Your /scratch contains a directory for each user (denoted by their Gadi username), however you can organise things however you wish here.

/g/data

Some Gadi projects, including your group’s, have a dedicated g/data allocation that is only accessible to members of that group. This in intended for long-term large data storage. Your allocation is currently 2 TB of storage. This is not backed up though, so ensure you transfer all important files back to RDS. Your g/data contains the following shared directories:

  • pipelines: copies of SIH public pipelines from our GitHub
  • reference: T2T and Hg38 assemblies and their indexes
  • scripts: copies of shared code for general tasks
ls -lah /gdata/<project>
drwxr-sr-x 3 gs5517 iz89 4.0K Feb 13 22:42 pipelines
drwxr-sr-x 4 gs5517 iz89 4.0K Sep 14  2023 reference
drwxr-sr-x 2 gs5517 iz89 4.0K Feb 13 22:43 scripts

/apps

This directory is accessible to all Gadi users. It is a read-only system containing centrally installed software applications and their module files. You can check what software is installed here:

ls /apps

You can use any software that is installed here by first loading the module file, e.g.:

module load samtools

Then run the tool as per it’s user guide, e.g.:

samtools view -H sample.bam

Queues

Like on Artemis, the job scheduler is PBSPro, however it is implented in a slightly different way. To run jobs on Gadi, users should submit to a specific queue on a corresponding node. The queue and node you choose to run on will depend on the types of resources your job needs. Pipelines your group use have already been configured to run on specific queues.

For custom PBS scripts, you can work out what queue to run your job on by checking the NCI queue documentation and queue limits explainer. Most jobs will be suitable for normal or normalbw queues. The normal queues have more nodes available for your jobs, and will allow users, and jobs that require a specialised queue, to get fair access to those resources. Express queues are designs to support work that needs a faster turnaround, but will be charged accordingly at a higher service unit charge.

Accounting

Check your group’s quarterly storage and SU usage report by running:

nci_project -P <project> -v

All materials copyright Sydney Informatics Hub, University of Sydney