Data transfer

Overview

Teaching: 5 min
Exercises: 1 min
Questions
  • Transfer to, from and within Gadi

Objectives
  • Understand Gadi copyq and how to move data between Gadi, Artemis and RDS

Data transfer

The data transfer queue on Gadi is called ‘copyq’. You can easily use this queue to transfer data between Gadi and RDS (or other locations) by first setting up ssh keys for password-less transfers between Gadi and Artemis/RDS.


For transfer of large files, the use of ‘resumable’ rsync is recommended. As the USyd RCOS servers only allow sftp connections, this method is not possible to run on Gadi’s copyq. Instead, the transfer can be initiated using Artemis ‘dtq’ and using Gadi’s ‘data mover’ node: gadi-dm.nci.org.au.


Example script: RDS to Gadi from Artemis dtq with rsync

An example script to transfer a large data file from RDS to Gadi scratch using the Artemis ‘dtq’ is shown below. This can be easily converted for transfers the other way by reversing source and dest, or for whole-directory/wildcard matched transfers by using a path to a directory rather than a file or adding ‘*’ wildcard. In the event of a network interruption, this will keep partially transferred files (the default is to delete), wait for three minutes and then try again, appending the newly transferred bytes to the partial transfer.

#!/bin/bash

# This is an Artemis data transfer script

#PBS -P <project>
#PBS -N transfer
#PBS -l walltime=04:00:00
#PBS -l ncpus=1
#PBS -l mem=8GB
#PBS -W umask=022
#PBS -q dtq

source_path=/rds/PRJ-<project>/<path>/<file>

destination=<user>@gadi-dm.nci.org.au
destination_path=<path-on-gadi>

while [ 1 ]
do
        echo Transferring ${source_path} to ${destination}:${destination_path}
        rsync -rtlPvz --append-verify ${source_path} ${destination}:${destination_path}

        if [ "$?" = "0" ]
        then
                echo "Rsync completed normally"
                dt=$(date)
                echo Finished transferring at $dt
        exit
        else
                echo "Rsync failure. Backing off and retrying in 3 minutes"
                sleep 180
        fi
done


Example script: Gadi to RDS from Gadi copyq with sftp

To instead use Gadi’s copyq to transfer data to RDS via sftp, an example script is shown below:

#!/bin/bash

# This is a Gadi data transfer script

#PBS -P <project>
#PBS -N transfer
#PBS -l walltime=01:00:00
#PBS -l ncpus=1
#PBS -l mem=8GB
#PBS -W umask=022
#PBS -q copyq
#PBS -l wd
#PBS -lstorage=scratch/<project>

# Remote server details:
remote_user=<unikey>
remote_host=research-data-ext.sydney.edu.au
remote_path=/rds/PRJ-<project>/<path>

# Download a file from RDS to Gadi:
dest_path=/scratch/<project>/<path>
remote_file=<filename>
sftp ${remote_user}@${remote_host}:${remote_path}/${remote_file} ${dest_path} 

# Download a directory from RDS to Gadi:
remote_dir=<dirname>
sftp -r ${remote_user}@${remote_host}:${remote_path}/${remote_dir} ${dest_path} 

# Upload a file from Gadi to RDS:
local_file=<filename>
sftp ${remote_user}@${remote_host}:${remote_path} <<< $"put ${local_file}" 

# Upload a directory from Gadi to RDS:
# CAVEAT: this method will only work if  directory of the same name exists at destination! 
local_dir=<dirname>
sftp ${remote_user}@${remote_host}:${remote_path} <<< $"put -r ${local_dir}" 


:boom: As always, don’t forget to perform checksumming between source and destination files to ensure data integrity!


Wget some data onto your Gadi scratch space

Sections 7 Optimisation and 8 Example parallel job require some sample data. Please select which is the appropriate sample data for your interests, and use the wget command to download the data to your Gadi scratch processing area.


:exclamation: Please note that running wget commands on the login nodes are not the recommended way to transfer research data to Gadi! For small downloads such as these, this is OK, but for normal purposes the use of Gadi’s copyq and data mover nodes are the appropriate tools.


Choose your emoji:

  • :sob: If you do not intend to stay for these optional (but recommended) sessions, you do not need any sample data


Otherwise, please execute the following commands, and then download the applicable sample data:

        cd /scratch/<project>
        mkdir <training_dir>
        cd <training_dir>


  • :muscle: If you intend to stay for Section 7 Optimisation and do not intend to stay for Section 8 Example parallel job, please download Sample_data_benchmark (144 MB)

          wget -O Sample_data_benchmark.tar.gz https://cloudstor.aarnet.edu.au/plus/s/LYqsJ1rGDRjx46x/download
    
  • :snake: If you intend to stay for Section 7 Optimisation and the python ‘simulation’ example of Section 8 Example parallel job, please download Sample_data_benchmark (144 MB)

          wget -O Sample_data_benchmark.tar.gz https://cloudstor.aarnet.edu.au/plus/s/LYqsJ1rGDRjx46x/download                        
    
  • :dog: If you intend to stay for Section 7 Optimisation and the ‘bioinformatics’ example of Section 8 Example parallel job, please download Sample_data_bio (200 MB) and Sample_data_bio_fastq (5.8 GB)

          wget -O Sample_data_bio.tar.gz https://cloudstor.aarnet.edu.au/plus/s/5QpiXJnYdhJ8gQO/download
          wget -O Sample_data_bio_fastq.tar.gz https://cloudstor.aarnet.edu.au/plus/s/1rQp7axpgRfdspA/download        
    


Questions

:raised_hand: Does anyone have any questions about data transfer between NCI Gadi and USyd Artemis/RCOS?


Key Points

  • Gadi’s data transfer queue is copyq

  • Gadi’s data mover node is gadi-dm.nci.org.au

  • Use Gadi copyq or Artemis dtq depending on your data transfer requirements