Data transfer
Overview
Teaching: 5 min
Exercises: 1 minQuestions
Transfer to, from and within Gadi
Objectives
Understand Gadi copyq and how to move data between Gadi, Artemis and RDS
Data transfer
The data transfer queue on Gadi is called ‘copyq’. You can easily use this queue to transfer data between Gadi and RDS (or other locations) by first setting up ssh keys for password-less transfers between Gadi and Artemis/RDS.
For transfer of large files, the use of ‘resumable’ rsync
is recommended. As the USyd RCOS servers only allow sftp
connections, this method is not possible to run on Gadi’s copyq. Instead, the transfer can be initiated using Artemis ‘dtq’ and using Gadi’s ‘data mover’ node: gadi-dm.nci.org.au
.
Example script: RDS to Gadi from Artemis dtq with rsync
An example script to transfer a large data file from RDS to Gadi scratch using the Artemis ‘dtq’ is shown below. This can be easily converted for transfers the other way by reversing source and dest, or for whole-directory/wildcard matched transfers by using a path to a directory rather than a file or adding ‘*’ wildcard. In the event of a network interruption, this will keep partially transferred files (the default is to delete), wait for three minutes and then try again, appending the newly transferred bytes to the partial transfer.
#!/bin/bash
# This is an Artemis data transfer script
#PBS -P <project>
#PBS -N transfer
#PBS -l walltime=04:00:00
#PBS -l ncpus=1
#PBS -l mem=8GB
#PBS -W umask=022
#PBS -q dtq
source_path=/rds/PRJ-<project>/<path>/<file>
destination=<user>@gadi-dm.nci.org.au
destination_path=<path-on-gadi>
while [ 1 ]
do
echo Transferring ${source_path} to ${destination}:${destination_path}
rsync -rtlPvz --append-verify ${source_path} ${destination}:${destination_path}
if [ "$?" = "0" ]
then
echo "Rsync completed normally"
dt=$(date)
echo Finished transferring at $dt
exit
else
echo "Rsync failure. Backing off and retrying in 3 minutes"
sleep 180
fi
done
Example script: Gadi to RDS from Gadi copyq with sftp
To instead use Gadi’s copyq to transfer data to RDS via sftp, an example script is shown below:
#!/bin/bash
# This is a Gadi data transfer script
#PBS -P <project>
#PBS -N transfer
#PBS -l walltime=01:00:00
#PBS -l ncpus=1
#PBS -l mem=8GB
#PBS -W umask=022
#PBS -q copyq
#PBS -l wd
#PBS -lstorage=scratch/<project>
# Remote server details:
remote_user=<unikey>
remote_host=research-data-ext.sydney.edu.au
remote_path=/rds/PRJ-<project>/<path>
# Download a file from RDS to Gadi:
dest_path=/scratch/<project>/<path>
remote_file=<filename>
sftp ${remote_user}@${remote_host}:${remote_path}/${remote_file} ${dest_path}
# Download a directory from RDS to Gadi:
remote_dir=<dirname>
sftp -r ${remote_user}@${remote_host}:${remote_path}/${remote_dir} ${dest_path}
# Upload a file from Gadi to RDS:
local_file=<filename>
sftp ${remote_user}@${remote_host}:${remote_path} <<< $"put ${local_file}"
# Upload a directory from Gadi to RDS:
# CAVEAT: this method will only work if directory of the same name exists at destination!
local_dir=<dirname>
sftp ${remote_user}@${remote_host}:${remote_path} <<< $"put -r ${local_dir}"
As always, don’t forget to perform checksumming between source and destination files to ensure data integrity!
Wget some data onto your Gadi scratch space
Sections 7 Optimisation and 8 Example parallel job require some sample data. Please select which is the appropriate sample data for your interests, and use the wget
command to download the data to your Gadi scratch processing area.
Please note that running wget
commands on the login nodes are not the recommended way to transfer research data to Gadi! For small downloads such as these, this is OK, but for normal purposes the use of Gadi’s copyq and data mover nodes are the appropriate tools.
Choose your emoji:
- If you do not intend to stay for these optional (but recommended) sessions, you do not need any sample data
Otherwise, please execute the following commands, and then download the applicable sample data:
cd /scratch/<project>
mkdir <training_dir>
cd <training_dir>
-
If you intend to stay for Section 7 Optimisation and do not intend to stay for Section 8 Example parallel job, please download
Sample_data_benchmark
(144 MB)wget -O Sample_data_benchmark.tar.gz https://cloudstor.aarnet.edu.au/plus/s/LYqsJ1rGDRjx46x/download
-
If you intend to stay for Section 7 Optimisation and the python ‘simulation’ example of Section 8 Example parallel job, please download
Sample_data_benchmark
(144 MB)wget -O Sample_data_benchmark.tar.gz https://cloudstor.aarnet.edu.au/plus/s/LYqsJ1rGDRjx46x/download
-
If you intend to stay for Section 7 Optimisation and the ‘bioinformatics’ example of Section 8 Example parallel job, please download
Sample_data_bio
(200 MB) andSample_data_bio_fastq
(5.8 GB)wget -O Sample_data_bio.tar.gz https://cloudstor.aarnet.edu.au/plus/s/5QpiXJnYdhJ8gQO/download wget -O Sample_data_bio_fastq.tar.gz https://cloudstor.aarnet.edu.au/plus/s/1rQp7axpgRfdspA/download
Questions
Does anyone have any questions about data transfer between NCI Gadi and USyd Artemis/RCOS?
Key Points
Gadi’s data transfer queue is
copyq
Gadi’s data mover node is
gadi-dm.nci.org.au
Use Gadi
copyq
or Artemisdtq
depending on your data transfer requirements