Transferring data to and from Gadi

Introduction

In this section, we will look at transferring data between the Research Data Store (RDS) and Gadi. There is a lot of content with step-by-step examples here due to the numerous methods available for data transfer.

Watch the pre-recorded session

The table below summarises the data transfer methods covered in this section:

Method	Suitable data size/type	Pros/Cons
GUI based data transfer client e.g. `filezilla` or `cyberduck`	Only small files, config scripts etc.	Easy to use Host computer can limit transfer speed
sftp transfer from Gadi terminal	Files up to a few GB.	Convenient for quickly copying data Uses login node for computation Cannot restart after interruption
rsync transfer from Artemis terminal	Files up to a few GB.	Convenient for quickly copying data Can restart after interruption Uses login node for computation Artemis only around until August
sftp copy using copyq	Large files and datasets to copy all in one go.	Robust, runs in background Faster speed than `lftp` Always copies all files
lftp sync using copyq	Large datasets with many files and only a few updated with each transfer.	Robust, runs in background Only copies whats changed Slower speed than `sftp`
Artemis rsync using dtq	Any large dataset.	Robust, runs in background Only copies whats changed Artemis only around until August
Globus	Any file/dataset size.	Should do everything GUI Terminal copy and sync `copyq` copy and sync Unknown time-frame for availability

Where to put project files on Gadi

On Gadi (like on Artemis) you have access to a number of different storage areas for project files and data. Check the NCI User Guides for more detailed information. Here we provide a brief introduction for those familiar with Artemis.

Each space is intended for use in a specific way:

/home

Your home space (/home/<nci_user_id>) is owned by you and has 10 GiB of available space. This cannot be increased.

It works similarly to your /home space on Artemis and should only be used to store things like program code, batch scripts or software configuration information. Note that as space is extremely limited here it is unadvisable to use this space for storing project data.

/scratch

Your scratch space (/scratch/<project>) is owned by your project and has 1 TiB of available space which can be increased upon request to NCI. It is roughly equivalent to /scratch on Artemis.

Data are not backed up and files not accessed for 100 days will be quarantined for 14 days and then removed (See here for instructions for removing files from quarantine).

Scratch should be used for temporary files associated with a job that has large data IO and not for longer term storage. Always ensure any data you need from a job that is left in /scratch is promptly backed up to the Research Data Store.

/g/data

Your /g/data space (/g/data/<project>) is owned by your project and has the available space allocated by the Sydney Scheme Manager.

The files on /g/data are not backed up but they will persist there for the lifetime of your project.

/g/data is intended to be used to store longer term files that are regularly used by your project during its lifetime. Always ensure your data is regularly backed up from here to the Research Data Store.

/g/data may be accessed directly from PBS job scripts by using the -lstorage PBS directive.

To check the amount of disk space you have available in the data areas listed above you can type the command lquota at the Gadi prompt.

Research data store (RDS)

The RDS is NOT being decommissioned along with Artemis HPC. Any RDS projects you currently have will persist on RDS. It is your responsibility to backup any data on Artemis filesystems (/home, /scratch, /project) that you wish to keep prior to the decommission date of August 29 2025. For information on how to go about this see the SIH Artemis Training Series.

In this section, we will mainly focus on how to transfer data between Gadi HPC and RDS. You should be able follow similar methods for copying data between your own laptop/server and Gadi.

Data transfer options

Depending on the size and complexity of the data you are transferring you have multiple options available:

For small transfers (<1GB) you can use a GUI based data transfer client such as filezilla or cyberduck.
For mid sized transfers up to tens of GB you can use terminal based transfer.
For large transfers you should use the data transfer queue options on either Gadi (copyq) or Artemis (dtq).

Warning

All data transfer to and from Gadi should be made using the “Data Mover Node” at gadi-dm.nci.org.au where possible rather than the login nodes. This ensures that data transfer will not consume otherwise limited resources on the login nodes.

Globus - COMING SOON

In the coming months, Globus will be available for simplified and efficient data transfer. We will provide training and materials on this once available.

In the meantime, the below options are available, and detailed examples for each method are provided in the subsequent sections.

Transfer using RDS mapped network drive and data transfer client

For smaller files or datasets, for example a set of scripts that you are going to run, you can map your RDS project as a network drive and transfer the data to Gadi via an intermediate data transfer client GUI such as filezilla or cyberduck.

While simple to use, these are not recommended for large data transfers, as the local computer becomes a bottleneck and they are generally not resumable after interruption. Faster speeds will be obtained if you are on campus, but still this method may be prohibitively slow for larger datasets.

Example using cyberduck

The following are instructions using the cyberduck data transfer client. The process using filezilla is similar with the same username and server IP address as shown in this example.

To mount your RDS drive in either Windows or MacOS, please follow the instructions described here. You should have a File Explorer (Windows) or Finder (MacOS) window open and displaying the files and folders in your RDS project directory.

Next download cyberduck from https://cyberduck.io and open it and connect to Gadi:

Click on the Open Connection icon at the top of the window.
Select SFTP (SSH File Transfer Protocol) from the drop-down menu at the top of box.
In the Server field, enter gadi-dm.nci.org.au.
In the Username field, enter your NCI Username.
In the Password field, enter your NCI password.
Click Connect.
If an Unknown fingerprint box appears, click the Always check box in the lower-left hand corner, then click Allow.

If you have successfully logged in, you will see a directory listing of /home/<user_login>. You can browse to your project folder either in /scratch or /g/data by pressing Ctrl + g and then typing /scratch/<project_id> or /g/data/<project_id>.

You can then transfer data to and from RDS and NCI by dragging and dropping files between your computer’s file explorer and the Cyberduck window.

Transfer from Gadi/Artemis terminal to/from RDS

You can use commands in your terminal application (Mac and linux: Terminal, Windows: Windows Terminal or Powershell equivalent) as an alternative to graphical applications.

Since the connection will be terminated if your computer sleeps, terminal crashes, network drops out etc, this method is not particularly robust for large transfers. For these rather use the queue based methods (either copyq on Gadi or dtq on Artemis) described below.

Transfers from a terminal on Gadi

Due to stringent security settings around Artemis and RDS, familiar commands like rsync or scp cannot be initiated from NCI Gadi login nodes. Instead you have to use commands like sftp to copy the data.

Warning

Please note that running these commands on the login nodes is not the recommended way to transfer research data to Gadi! For smaller downloads, this is OK, but for normal purposes the use of Gadi’s copyq and data mover nodes are the appropriate tools.

How to transfer data from RDS to Gadi (and vice-versa) from a Gadi login.

To transfer data between RDS and Gadi from the Gadi login shell:

Open a terminal (using the ‘Terminal’ app on MacOS or the ‘Command Prompt’ app on Windows and then log into Gadi using ssh:

ssh <nci_user_id>@gadi.nci.org.au

You may be prompted to enter your NCI password at this point.

Get the data from RDS, to a specific location on Gadi, e.g:

sftp -r <your_unikey>@research-data-ext.sydney.edu.au:/rds/PRJ-<Project Short ID>/<Path to File or Folder> <location on Gadi>

For example if I wanted to copy data from my Training project on RDS in the folder MyData to Gadi in the scratch space for my NCI project named qc03:

sftp -r <my_unikey>@research-data-ext.sydney.edu.au:/rds/PRJ-Training/MyData /scratch/qc03/MyData

replace /rds in the above with /project or /scratch for your preferred source folder or file.

If you want to copy the other way around (ie. from Gadi /scratch to RDS) use

sftp <your_unikey>@research-data-ext.sydney.edu.au:/rds/PRJ-<Project Short ID>/<Path to File or Folder> <<< $"put -r <location on Gadi>"

Transfers from a terminal on Artemis using rsync

Warning

This option is only available prior to the decommission of Artemis on 29 August 2025 - after that date you will have to use either GLOBUS (preferred), or copy data when logged into the Gadi terminal (either at the login shell or using copyq scripts).

When logged into Artemis you can use the rsync or scp command to copy data directly to/from Gadi, since Gadi allows the kind of secure connection that these commands require.

Using rsync will allow you to sync data between RDS and Gadi, this means that only files that have been updated since the last transfer will be copied. This will allow small changes to large datasets to be transferred quickly.

Transfer from direct connection to RDS

You can also initiate the transfer in a terminal from a connection via ssh to research-data-int.sydney.edu.au (on campus or USyd VPN) and using the rsync method described here - just replace hpc.sydney.edu.au with research-data-int.sydney.edu.au in step 1.

How to transfer from RDS to Gadi (and vice-versa) at an Artemis terminal using rsync

To transfer data between RDS and Gadi from the Artemis login shell:

Open a terminal (using the ‘Terminal’ app on MacOS or the ‘Command Prompt’ app on Windows and then log into Artemis using ssh:

ssh <your_unikey>@hpc.sydney.edu.au

You may be prompted to enter your password at this stage.

Copy the data from rds, to a specific location on Gadi, e.g:

rsync -rtlPvz /rds/PRJ-<Project Short ID>/<Path to File or Folder in RDS> <nci_user_id>@gadi-dm.nci.org.au:<Destination on Gadi>

For example if I wanted to sync data from my Training project on RDS in the folder MyData to Gadi in the scratch space for my NCI project named qc03:

rsync -rtlPvz /rds/PRJ-Training/MyData <nci_user_id>@gadi-dm.nci.org.au:/scratch/qc03/MyData

You will be prompted for the password associated with your username on Gadi and the transfer will commence.

If you want to copy the other way around (ie. from Gadi /scratch to RDS) then simply reverse the order of the above command in step 2, e.g:

rsync -rtlPvz <nci_user_id>@gadi-dm.nci.org.au:<Source on Gadi> /rds/PRJ-<Project Short ID>/<Destination on RDS>

Using `tmux` to run commands in persistent terminal sessions

Warning

While this method can help you run your copy job in the background over a long period of time, it is recommended to rather use the copyq transfer method described below for large file transfers, as running jobs on the login node can overwhelm their scant resources.

You can run data transfers or other long-running commands in a tmux persistent session which can be detached to run in the background. This allows you to log out and even switch off your computer while the command still runs within the detached session. You can reattach to the tmux session later to review the command’s progress and output.

For example, to copy data from RDS to Gadi with rsync within a tmux session:

Login to RDS: ssh <unikey>@research-data-int.sydney.edu.au
Start a tmux session: enter tmux
Once inside the new session, issue the rsync command: rsync -rtlPvz /rds/PRJ-<project>/<path-to-data> <nci_user_id>@gadi-dm.nci.org.au:<gadi-destination-path>
Detach from the session: enter ctrl B + D
You can later reattach to the same session from a new ssh connection to research-data-int.sydney.edu.au with: tmux attach

For more information about tmux and its options, check here.

Transfer using `sftp` or `lftp` from Gadi `copyq`

The data transfer queue on Gadi is called copyq. This is comparable to the data transfer queue on Artemis dtq. Data transfer methods/scripts that you used to put data onto Artemis for example from the web via wget or from another server should be easily portable to use on Gadi’s copyq.

Important

If you have been relying on ssh key pairs between Gadi and Artemis for passwordless data transfers, please create a new ssh key pair between Gadi and research-data-ext (the RDS login server), as after 29th August, your Gadi-Artemis key pair will no longer work. Users have home directory on research-data-ext with a very small quota to enable storing of ssh key files.

Please note that the compute nodes on Gadi do not have internet access like the Artemis compute nodes do, so all required data must first be downloaded before submitting a compute job that requires the data.

Due to stringent security settings around Artemis and RDS, commands like rsync or scp cannot be initiated from NCI Gadi login nodes or copyq. To initiate the transfer from Gadi, sftp or lftp must be used. In the not too distant future Globus will become available for data transfer and then that will be the preferred method for transferring data to and from Gadi.

How to set up SSH keys for passwordless data transfer

If you are transferring data directly for example scp on the command line or via a transfer client on your local computer, entering a password to initiate the transfer is straightforward. If however you want to transfer via a job submitted to either copyq or dtq, you will need to set up SSH keys first, or else your script will halt while it waits for a password to be entered.

**If you have been relying on ssh key pairs between gadi and Artemis for passwordless data transfers, please create a new ssh key pair between Gadi and research-data-ext (the RDS login server), as after 29th August, your Gadi-Artemis key pair will no longer work.

You only need to set this up once.

SSH key pairs are used for secure communication between two systems. The pair consists of a private key and a public key. The private key should remain private and only be known by the user. It is stored securely on the user’s computer. The public key can be shared with any system the user wants to connect to. It is added to the remote system’s authorized keys. When a connection is attempted, the remote system uses the public key to create a message for the user’s system.

There are many general guides for this online, for example this one.

Set up keys between Gadi and RDS

Follow the below steps carefully to set up SSH keys between RDS and Gadi. Note, you only need to do this once.

Log into Gadi with your chosen method, e.g:

ssh ab1234@gadi.nci.org.au

Move to your home directory:

cd ~

Make a .ssh directory, if you don’t already have one:

mkdir -p .ssh

Set suitable permissions for the .ssh directory and move into it:

chmod 700 .ssh
cd .ssh

Generate SSH key pair:

ssh-keygen

Hit enter when prompted, saving the key in ~/.ssh/id_rsa and enter for NO passphrase. A public key will be located in ~/.ssh/id_rsa.pub and a private key in ~/.ssh/id_rsa.

Set suitable permissions for the keys:

chmod 600 id_rsa
chmod 644 id_rsa.pub

Make an authorized_keys file if you don’t already have one:

touch ~/.ssh/authorized_keys

Copy the contents of the public key file (~/.ssh/id_rsa.pub) to the authorized_keys file:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Set permissions for the authorized_keys file:

chmod 600 ~/.ssh/authorized_keys

Connect to USyd’s RDS login server using lftp and your unikey:

lftp sftp://<your-unikey>@research-data-ext.sydney.edu.au

Provide your unikey password when prompted. When you log in, you are in your personal home directory on the RDS login server. This is NOT the place to store data, only ssh key files.

Then make and move into a .ssh directory if you don’t already have one:

mkdir -p ~/.ssh
cd ~/.ssh

Transfer the authorized_keys file from Gadi to USyd’s RDS login server:

put authorized_keys

Doing this will transfer authorized_keys on Gadi to your current directory on RDS (/home/). With lftp, it will look for the file relative to where you launched lftp. You can check where you are on Gadi using:

local pwd

Exit your lftp connection to RDS by entering ctrl + d. You are now back to your Gadi session. Then, test the passwordless connection:

sftp <your-unikey>@research-data-ext.sydney.edu.au

This time, you shouldn’t be prompted for a password. You can proceed to transfer data between Gadi and RDS now on the copyq.

If you get the error “Fatal error: Host key verification failed” you may have to get an “ssh fingerprint” first. Do this by sending an ssh request to the RDS with:

ssh <your-unikey>@research-data-ext.sydney.edu.au

Accept that you trust the connection and enter your password. The connection will then close with the following message:

This service allows sftp connections only.
Connection to research-data-ext.sydney.edu.au closed.

But now try lftp connection again!

Template copyq scripts for transferring data with `sftp`

The scripts below use sftp to transfer data between RDS and Gadi on the Gadi copyq. sftp can transfer whole files and directories but must copy all of your data every time, it cannot only copy modified files like rsync can. This makes it considerably slower for copying large datasets where only minor changes have been made during a run. An alternative command lftp can behave like rsync but is slower to transfer than sftp. We also provide a template lftp script below.

Copies of these scripts have been placed in /scratch/qc03/data-transfer-scripts/gadi-scripts. You can make a copy of these scripts to your /scratch/<nci-project-code> or /home/<nci-user-id> workspace on Gadi and edit (for example using nano <script>), by replacing the names described in the header to suit your needs.

There are two scripts:

from_gadi_to_rds.pbs is used to transfer a file or folder from Gadi to RDS
from_rds_to_gadi.pbs is used to transfer a file or folder from RDS to Gadi

Transfer from Gadi to RDS

from_gadi_to_rds.pbs script, replace variables in <brackets> as described:

#!/bin/bash

# Transfer from Gadi to RDS
#

# Make the following replacements in this file:
# <unikey>      : Your USyd unikey
# <rds_project> : Your RDS project name
# <local_path>  : The local file or folder you want to copy
# <remote_path> : The location on RDS to put your folder
# <nci_project> : Your NCI project name

#PBS -P <nci_project>
#PBS -N transfer
#PBS -l walltime=04:00:00
#PBS -l ncpus=1
#PBS -l mem=8GB
#PBS -W umask=022
#PBS -q copyq
#PBS -l wd
#PBS -lstorage=scratch/<nci_project>

remote_user=<unikey>
remote_host=research-data-ext.sydney.edu.au
remote_path=/rds/PRJ-<rds_project>/<remote_path>

# NOTE: Add a trailing slash (/) to local_path if you don't want to create the
# parent directory at the destination.
local_path=<local_path>

sftp ${remote_user}@${remote_host}:${remote_path} <<< $"put -r ${local_path}"

Transfer from RDS to Gadi

from_rds_to_gadi.pbs script, replace variables in <brackets> as described:

#!/bin/bash

# Transfer a folder from RDS to Gadi
# This will recreate your RDS path (/rds/PRJ-<rds_project>)
# on Gadi in /scratch/<nci_project>

# Make the following replacements in this file:
# <unikey>      : Your USyd unikey
# <rds_project> : Your RDS project name
# <remote_path> : The location on RDS of your file ot directory to copy
# <local_path>  : The name of the folder to copy to
# <nci_project> : Your NCI project name

#PBS -P <nci_project>
#PBS -N transfer
#PBS -l walltime=04:00:00
#PBS -l ncpus=1
#PBS -l mem=8GB
#PBS -W umask=022
#PBS -q copyq
#PBS -l wd
#PBS -lstorage=scratch/<nci_project>

# RDS:
remote_user=<unikey>
remote_host=research-data-ext.sydney.edu.au

# NOTE: Add a trailing slash (/) to remote_path if you don't want to create the
# parent directory at the destination. 
remote_path=/rds/PRJ-<rds_project>/<remote_path>

# Gadi:
dest_path=/scratch/<nci_project>/<local_path>

# Copy folder with sftp
sftp -r ${remote_user}@${remote_host}:${remote_path} ${dest_path}

Here is an example showing you how to transfer a folder called MyData in the RDS project Training to some scratch space in Gadi owned by project aa00.

Example copyq transfer from RDS to Gadi

Log into Gadi and change directory to your project space and make a folder for your workspace:

# Using an example username tm0000
ssh tm0000@gadi.nci.org.au

cd /scratch/aa00

# Make a folder called workspace in /scratch/aa00/tm0000
mkdir -p /scratch/aa00/tm0000/workspace

Copy the required data transfer script template from /scratch/qc03 to your newly made workspace. In this case we are copying from RDS to Gadi so we use the from_rds_to_gadi.pbs script. You can also cut and paste the template script into your editor from above and save the edited script to your workspace.

cp /scratch/qc03/data-transfer-scripts/gadi-scripts/from_rds_to_gadi.pbs /scratch/aa00/tm0000/workspace

Then follow the script and move to that workspace and open the script in an editor (in this example we’ll use the nano editor):

cd /scratch/aa00/tm0000/workspace

nano from_rds_to_gadi.pbs

You need to edit the script by replacing all the variables marked with <> described in the script header and fill in the following details before using it:

In the # PBS variables part of the script:

Provide the -P variable by replacing <nci_project> with your NCI project code. In this example aa00.
Increase the walltime if you are transferring large files, the limit on this queue is 10 hours.
Alter -lstorage=scratch/<project> as required. If you also need to access g/data, you can change this to scratch/<project>+g/data/<project>. In this example we’ll just use scratch/aa00

In the body of the script:

Provide the remote_user variable by replacing <unikey> with your USyd unikey.
Provide the remote_path variable by replacing <rds_project> and <local_path> with your RDS project name and path to the file or directory you want to transfer. In this example we use remote_path=/rds/PRJ-Training/MyData

Here is what the script will look like when correctly edited for this example:

#!/bin/bash

#PBS -P aa00
#PBS -N transfer
#PBS -l walltime=04:00:00
#PBS -l ncpus=1
#PBS -l mem=8GB
#PBS -W umask=022
#PBS -q copyq
#PBS -l wd
#PBS -lstorage=scratch/aa00

# RDS:
remote_user=tm0000     #Example unikey tm0000
remote_host=research-data-ext.sydney.edu.au

# NOTE: Add a trailing slash (/) to remote_path if you don't want to create the
# parent directory at the destination. 
remote_path=/rds/PRJ-Training/MyData

# Gadi:
# This will create /scratch/aa00/MyData if transferring a folder and it doesn't already exist.
dest_path=/scratch/aa00/MyData

# Copy with sftp
sftp -r ${remote_user}@${remote_host}:${remote_path} ${dest_path}

When you have finished editing the script save it (using <ctrl>-x and answering y at the prompt if using nano as your editor)

Run the transfer script

Once you have customised the script, you can submit it to the copyq on Gadi. Run the script from the directory where you saved it:

qsub from_rds_to_gadi.pbs

This can be a nerve-wracking process, especially if you are transferring large files. You can check the status of your job on Gadi using:

qstat -Esw

Once it says R (running), you can confirm it is going to where you want on RDS/Artemis or Gadi by logging into the system and checking for the presence of the file/directory in its expected location using:

ls MyData

Confirm transfers after completion

To confirm the transfer was successful, you’ll need to check your job logs. These are located in the same directory as your script and are named transfer.o<jobid>.

However, this doesn’t guarantee the integrity of the files. You should check the files themselves to ensure they are intact. You can do this using md5checksums as described below.

Template copyq scripts for syncing data with `lftp`

If you have a large project with many files and only a few of them are modified at any time you can use the mirror command in lftp to only sync the modified files when you are backing up data to RDS. The lftp - mirror command transfers data at a slower speed than sftp but will significantly speed up the backup of data since it generally only needs to copy a small subset of all of the data on Gadi if only a few files have changed since the last sync.

Below is a template .pbs script that can be used to sync files between Gadi and RDS, you can copy it to your workspace and modify it as needed similarly to the example above. You can also find this script on Gadi in /scratch/qc03/data-transfer-scripts/gadi-scripts/sync_gadi_to_rds.pbs.

Warning

Note that when using this script to copy a folder you should ensure the target directory exists on RDS first, otherwise the parent directory will not be synced with the data.

Sync between Gadi and RDS

#!/bin/bash

# Mirror directory from Gadi to RDS
#

# Make the following replacements in this file:
# <unikey>:       Your USyd unikey
# <rds_project>:  Your RDS project name
# <path-on-rds>:  The location on RDS to put your directory
# <path-on-gadi>: The directory on Gadi to mirror to rds
# <nci_project>:  Your NCI project name

#PBS -P <nci_project>
#PBS -N transfer
#PBS -l walltime=04:00:00
#PBS -l ncpus=1
#PBS -l mem=8GB
#PBS -W umask=022
#PBS -q copyq
#PBS -l wd
#PBS -lstorage=scratch/<nci_project>

remote_user=<unikey>
remote_host=research-data-ext.sydney.edu.au
remote_path=/rds/PRJ-<rds-project>/<path-on-rds>

source_path=<path-on-gadi>

outfile="${PBS_O_WORKDIR}/${PBS_JOBNAME}_${PBS_JOBID}.log"
echo "Writing log to: ${outfile}"
touch $outfile

lftp -u ${remote_user}, sftp://${remote_host} <<EOF
set xfer:log true
set xfer:log-file "${outfile}"
mirror -p --verbose -R ${source_path} ${remote_path}
exit
EOF

Suggested workflow for copying data between RDS and Gadi and keeping it up-to-date

The sftp copy method and lftp sync methods described above both have their pros and cons:

sftp has a faster transfer speed but can only copy all your files in bulk when it is run.
lftp has a slower transfer speed but it can sync only the subset of files that have changed.

Because of this we suggest users be selective about which method they use based on their needs.

For example a simple workflow for large projects that have a lot of data and many files in a folder, only a few of which are changed between backups to RDS would be:

Initially use the sftp based from_rds_to_gadi.pbs script to bulk copy your data from RDS to your project space on Gadi, since this is faster for bulk transfers.
Subsequently use the lftp based sync_gadi_to_rds.pbs to sync smaller files (e.g. output logs and data) back to RDS, without having to re-copy the bulk of the data back to RDS.

Transfer using rsync from Artemis dtq

Warning

For transfer of large files directly from Artemis to Gadi, the use of resumable rsync is recommended (see script below). The transfer can be initiated using Artemis dtq and using Gadi’s data mover node: gadi-dm.nci.org.au. The below template script can be used with Artemis’ dtq using rsync.

For further info about copying data from Artemis dtq can be found in the SIH Artemis Training Series.

Template Artemis data transfer script using rsync

#!/bin/bash

# This is an Artemis data transfer script

#PBS -P <project>
#PBS -N transfer
#PBS -l walltime=04:00:00
#PBS -l ncpus=1
#PBS -l mem=8GB
#PBS -W umask=022
#PBS -q dtq

# NOTE: Add a trailing slash (/) to source_path if you don't want to create the
# parent directory at the destination. 
source_path=/rds/PRJ-<project>/<path>/<file>

destination=<user>@gadi-dm.nci.org.au
destination_path=<path-on-gadi>

while [ 1 ]
do
        echo Transferring ${source_path} to ${destination}:${destination_path}
        rsync -rtlPvz --append-verify ${source_path} ${destination}:${destination_path}

        if [ "$?" = "0" ]
        then
                echo "Rsync completed normally"
                dt=$(date)
                echo Finished transferring at $dt
        exit
        else
                echo "Rsync failure. Backing off and retrying in 3 minutes"
                sleep 180
        fi
done

Verify the integrity of your transferred data

When you transfer files between systems, especially over a network using tools described above, there is always a small risk that the data could become corrupted. This might happen for example due to network issues, disk errors, or file system problems. It is therefore important that you verify that your files were transferred completely and accurately. We can do this using a checksum.

A checksum is a small, fixed-size value calculated from the contents of a file using a specific algorithm (like MD5, SHA-1, or SHA-256). Think of it as a unique fingerprint for the file’s data: if even one byte changes, the checksum will be different. By comparing the checksum before and after the transfer, you can confirm that the file is exactly the same on both ends.

The overall process of verifying data transfer is outlined in the figure below. In some cases, you may already have checksum files that you have previously generated, or even downloaded with your raw data. If so, there is no need to re-generate the “source” checksum files.

For transfers from RDS to Gadi, RDS is the “source” and Gadi is the “destination”:

Note that the source and destination are reversed when transferring output files generated on Gadi back to RDS.

Using md5sum to validate files: from RDS to Gadi

Here we demonstrate how you can use the command md5sum to generate a list of MD5 checksums for a directory on RDS, and then use these checksums to validate the integrity of your files after the data has been transferred to Gadi.

1. Create a list of checksums for your files at the source (before copying the data)

Connect to the RDS internal login node with ssh (remember to have VPN connected if off-campus):

ssh <unikey>@research-data-int.sydney.edu.au

Enter your unikey password, and once connected to RDS, navigate to the folder containing the data you are going to transfer to Gadi.

Use the md5sum command to produce a md5 hash for every file within that directory and its subdirectories:

cd /rds/PRJ-<dashR-project>/myData

find -type f \( -not -name "myData.md5" \) -exec md5sum '{}' \; > myData.md5

This will generate a file called myData.md5 in your folder listing a hash for every file (except for the myData.md5 file).

2. Copy the data to Gadi

You should do this using one of the methods described above in this page - make sure to also transfer the .md5 file and keep it in the same relative location as it was at the source (RDS).

3. Verify the copy of the files on Gadi

Once the transfer of data and checksum files to Gadi are complete, you can use the -c (check) flag to the checksum command to verify file integrity.

For small files, you can easily run this on the Gadi login nodes, eg:

cd /scratch/<nci-project-id>/myData
md5sum -c myData.md5

This command goes through the list of files in myData.md5 and compares the md5 sum of the file on Gadi to the md5 sum that was created for the file on RDS.

For larger files, or very large numbers of files, the checksum process can take some time (eg > 10 minutes for a single 100 GB file) so it is more robust to run the checksum commands within a PBS job submitted to the Gadi job scheduler.

You can do this simply by placing the checksum command described above inside a Gadi job script.

Files which have been transferred with full integrity will report “OK”:

file1.txt: OK
sub_folder/file2.txt: OK

If there was a problem with any file, you’ll see a warning like:

file3.txt: FAILED

Redirecting the output of the md5sum -c command to a file can simplify collecting failed transfers for resubmission:

md5sum -c myData.md5 > myData.md5.check
grep FAILED myData.md5.check > myData.md5.check.FAILED

Any files with FAILED status should be re-copied from RDS to Gadi, and the checksum comparison run again.

Using md5sum to validate files: from Gadi to RDS

When working on Gadi, you will want to regularly back up important data (for example job outputs, logs, scripts) to RDS. Gadi scratch and gdata filesystems are not backed up, so regular backups to RDS are critical.

The high-level checksum process is the same as it was for RDS to Gadi, but the source and destination are reversed.

1. Create a list of checksums for your files at the source (before copying the data)

On Gadi, you can create the checksum file with the same md5sum command as described above, either on the Gadi login node, or from within a Gadi job script.

Importantly, the Gadi compute queues can read gdata, unlike Artemis where the general compute queues could not read RDS. So, if your data is on gdata, you do not need to copy it to scratch for checksum commands run within Gadi PBS jobs.

2. Copy the data to RDS

You should do this using one of the methods described above in this page - make sure to also transfer the .md5 file and keep it in the same relative location as it was at the source (Gadi).

3. Verify the copy of the files on RDS

Once the transfer of data and checksum files to RDS are complete, you can use the -c (check) flag to the checksum command to verify file integrity.

Connect to the RDS internal login node with ssh (remember to have VPN connected if off-campus):

ssh <unikey>@research-data-int.sydney.edu.au

Enter your unikey password, and once connected, navigate to the directory containing the data you have just transferred from Gadi to RDS.

cd /rds/PRJ-<dashR-project>/myData
md5sum -c myData.md5 > myData.md5.check
grep FAILED myData.md5.check > myData.md5.check.FAILED

Any files with FAILED status should be re-copied from Gadi to RDS, and the checksum run again.

Alternate methods for running checksum commands on RDS

For large files or datasets with numerous files, creating checksums or running checksum comparisons can be time-consuming. When we ssh into research-data-int.sydney.edu.au and run the checksum command, this is reliant on your terminal window staying open and connected, your internet connection remaining stable, your computer not going to sleep or being switched off, etc. On Gadi, we can mitigate these issues by submitting long walltime checksum commands to the job scheduler. Below are some other options to run checksum commands (or any other data management commands) on data stored on RDS:

Map your RDS as a network drive on your local computer. Then, use your mac terminal, Windows Powershell, WSL2, etc to run the commands using your local computer’s resources. You will still need to ensure stable internet connection and that your computer does not go to sleep mode (unless using tmux, see below), however this setup may be more stable than relying on the ssh connection to the RDS login node.
Run your checksum commands within a persistent terminal session with tmux. You can do this with an ssh connection to research-data-int.sydney.edu.au, or from the mac/Powershell/WSL2 terminal reading the RDS mapped network drive. When you run your checksum command inside a tmux session and then detach from it, the session (and the command) will remain active, even if you lose internet connection, close your terminal, or even power off your computer. You can reattach to the same session later to check the status of the checksum command. See the section on tmux above for basic use.
Until decommission date of August 29, the data transfer queue dtq on Artemis can be used to run checksums on data within RDS. dtq is the only queue on Artemis that can read RDS.
Globus (coming soon) will simplify and expedite data transfer between RDS and Gadi. Globus will handle checksum creation and verification for you if the appropriate flags are set.

Introduction

Where to put project files on Gadi

/home

/scratch

/g/data

Research data store (RDS)

Data transfer options

Globus - COMING SOON

Transfer using RDS mapped network drive and data transfer client

Transfer from Gadi/Artemis terminal to/from RDS

Transfers from a terminal on Gadi

Transfers from a terminal on Artemis using rsync

Using tmux to run commands in persistent terminal sessions

Transfer using sftp or lftp from Gadi copyq

How to set up SSH keys for passwordless data transfer

Template copyq scripts for transferring data with sftp

Confirm transfers after completion

Template copyq scripts for syncing data with lftp

Suggested workflow for copying data between RDS and Gadi and keeping it up-to-date

Transfer using rsync from Artemis dtq

Verify the integrity of your transferred data

Using md5sum to validate files: from RDS to Gadi

Using md5sum to validate files: from Gadi to RDS

Alternate methods for running checksum commands on RDS

Using `tmux` to run commands in persistent terminal sessions

Transfer using `sftp` or `lftp` from Gadi `copyq`

Template copyq scripts for transferring data with `sftp`

Template copyq scripts for syncing data with `lftp`