Data transfer

THIS PAGE IS UNDER DEVELOPMENT


There are many tools for data transfer, both command-line and GUI-based, and the best method to use will depend on the size of your dataset, the connection between the two machines, and personal preference. In many cases, numerous methods could do an equally good job. This page provides a quickstart overview for some common data transfer tasks as a USyd researcher working on Pawsey.

Data transfer best practices

  • Use command line transfers for small datasets via the data mover nodes
  • Submit large transfers as a job on the copy partition
  • Use resumable transfer commands for large transfers
  • Where possible, chunk large datasets (eg by folder or name pattern) and submit as separate jobs
  • Regularly sync your important output to RDS to avoid the 21-day file purge
  • Archive folders with numerous files with tar for faster, more efficient transfers
  • Use Pawsey’s Acacia storage bucket as an intermediate path for increased speed
  • Always verify the integrity of transferred files, eg with md5 or sha-256 checksums
  • Keep an eye out for announcements from SIH regarding Globus availability for fast, simplified data transfer

Read on for guidance on applying some of these tips to your work on Pawsey!

Data mover nodes

Pawsey provides dedicated data-mover nodes. These should be used instead of the login nodes for any data transfers to, from, or within Pawsey filesystems to avoid overloading the login nodes.

The hostname for these nodes is data-mover.pawsey.org.au.

Users can connect to the data mover nodes over ssh and transfer small datasets with command line transfer tools.

Transferring small datasets via command line

Note

Visit Pawsey’s file transfer guide for detailed instructions and recommended flags for scp, rsync, and sftp command-line transfers.

Below are some common usage scenarios to get you started.

scp

Copy a file from local to Pawsey

scp <file> <pawsey-user-id>@data-mover.pawsey.org.au:/scratch/<pawsey-project-id>/<pawsey-user-id>

Copy a file from Pawsey to local

scp <pawsey-user-id>@data-mover.pawsey.org.au:<pawsey-file-path> <local-destination-path>

Copy a file from Pawsey to remote

ssh <pawsey-user-id>@data-mover.pawsey.org.au
scp <pawsey-file> <remote-user>@<remote-host>:<remote-destination-path>

sftp

Copy a file from local to Pawsey

sftp <pawsey-user-id>@data-mover.pawsey.org.au
cd /scratch/<pawsey-project-id>/<pawsey-user-id>
put <file>

Copy a file from Pawsey to local

sftp <pawsey-user-id>@data-mover.pawsey.org.au
cd /scratch/<pawsey-project-id>/<pawsey-user-id>
get <file>

Copy a folder from local to Pawsey

sftp <pawsey-user-id>@data-mover.pawsey.org.au
cd /scratch/<pawsey-project-id>/<pawsey-user-id>
put -r ./<folder>

Copy a folder from Pawsey to local

sftp <pawsey-user-id>@data-mover.pawsey.org.au
cd /scratch/<pawsey-project-id>/<pawsey-user-id>
get -r <folder>

Copy a folder from Pawsey to RDS

USyd network or VPN not required.

sftp <unikey>@research-data-ext.sydney.edu.au
cd /rds/PRJ-<project>
put -r ./<folder>

rsync

Copy a folder from RDS to Pawsey

Must be on USyd network or VPN on.

ssh <unikey>@research-data-int.sydney.edu.au
cd /rds/PRJ-<project>/<data-location>
rsync -vhsrl --chmod=Dg+s -e ssh <folder> <user>@data-mover.pawsey.org.au:/scratch/<pawsey-project-id>/<user>

Transferring to and from Acacia storage buckets

What is Acacia?

Acacia is Pawsey’s warm tier object storage cluster. It is a performant and highly scalable resource; but differs from traditional filesystems by arranging data as objects in a bucket, rather than files in a folder hierarchy. For detailed information, please refer to Pawsey’s full Acacia user guide.

Below is a quick-start summary of Pawsey’s user guide.

What are buckets?

Buckets are a way of organising your data. Unlike directories/folders, they organise data as a ‘flat’ structure: there is no hierarchy, no sub-buckets etc. To simplify backing up data to a bucket and maintaining nested directory structures, the forward slashes within the filepaths of your data are preserved, and interpreted as ‘pseudo-folders’. In this way, your data within the bucket will appear with the same hierarchical organisation you are familiar with, yet the ‘folder’ does not actually exist as a separate object within the bucket.

Acacia bucket rules

  • Each user can have 1,000 buckets
  • Buckets can have up to 1 million objects, however Pawsey recommends no more than 100,000 objects per bucket for optimal performance
  • Bucket names must:
    • Be globally unique on Acacia
    • Contain 3-63 characters comprising only numbers or lowercase letters. Full stops/periods (.) and hyphens (-) are allowed, but not at the beginning or end of bucket name
    • Not include IP addresses, Pawsey usernames, or sensitive information

What are objects?

Once copied to your bucket, a file becomes an ‘object’. The object includes both the file and any metadata describing the file. Objects cannot be edited within the bucket, only stored. It can also not be read directly by the compute nodes on Setonix, so if you need to use Acacia objects as input data for a job, you will first need to copy them to Setonix /scratch.

Create access keys

Log in to the Pawsey Origin portal and follow the instructions to create Acacia access keys.

Note that separate keys are required per project storage (accessible to all members of the project) and per user storage (private to the user) on Acacia.

Configure Rclone

Tip

Pawsey recommends rclone to access Acacia from Setonix, however you can also use Amazon’s boto3 if preferred.

Log on to Setonix and load the latest version of rclone. Note that Setonix does not allow modules to be loaded without specifying a version.

module avail rclone
module load rclone/<latest-version>

Then follow the instructions to configure Rclone file to add your private Access ID and Secret Key to your $HOME/.config/rclone/rclone.conf.

All new key pairs that you generate will need to be added to your rclone.conf file. The simplest way is to copy the prompted command that you will receive from the Origin portal once you create your key pair, paste it into your terminal connect to Setonix login node, and run the command. Each time you run the recommended tee command, it will append to the end of your rclone.conf file.

If you forget to run this command and have not otherwise saved the access ID and secret key, you can delete that key within the Origin portal and generate a new key.

Create a bucket

rclone mkdir <profile>:<bucket-name>

In the above command, profile-name will be the name of a Pawsey project you are a member of, or your Pawsey username. The bucket name is a name of your choosing, following the naming rules described above.

If you have not configured a key pair for the profile, you will receive an error:

CRITICAL: Failed to create file system for "<profile>:<bucket-name>": didn't find section in config file

You can resolve this by logging in to Origin and following Step 1 to make a key pair for the profile (will show as “Storage Name” dropdown menu on the Origin portal).

Copy data from scratch to Acacia bucket

To copy a file from /scratch to your Acacia bucket:

rclone copy <file-on-scratch> <profile>:<bucket-name>

You can also add a pseudo-folder prefix to the bucket name if desired. For example to copy results.csv to a ‘Results’ pseudo-folder:

rclone copy results.csv pawsey1234:experiment-001-2025/Results

Since there are no true ‘directories’ with object storage, there is no requirement that the ‘Results’ pseudo-folder ‘exists’ yet within your bucket.

To copy a directory to your Acacia bucket, the copy command is again used, but care is required to ensure that the contents of your local directory appear as you intend them to in your bucket. If you want the current top-level directory to remain as the top-level pseudo-folder in the bucket, you will need to specify that pseudo-folder path, as shown in the example below:

Copy the contents of New-results to the root level of the bucket (ie, the New-results prefix will not exist in the bucket):

rclone copy New-results pawsey1234:experiment-001-2025

Copy the contents of New-results to a pseudo-folder called New-results within the bucket:

rclone copy New-results pawsey1234:experiment-001-2025/New-results

All other nested sub-directories with New-results will appear organised as they are on /scratch, it is only the top-level directory path that must be explicitly described as the destination pseudo-folder.

Copy objects from Acacia bucket to scratch

To be completed

Copy a single object to /scratch:

Copy all objects within a pseudo-folder to `/scratch/’

List Acacia contents

List contents of all buckets within a profile:

rclone tree <profile>:

List objects within a specific bucket:

rclone tree <profile>:<bucket-name>

Deleting objects and buckets

Delete an object:

rclone delete <profile>:<bucket-name>/<object>

Delete a pseudo-folder:

rclone delete <profile>:<bucket-name>/<pseudo-folder>

Delete all objects within a bucket:

rclone delete <profile>:<bucket-name>

Delete a bucket:

rclone rmdir <profile>:<bucket-name>

Mount your Acacia bucket locally

To be completed

Share contents of your Acacia bucket with others

To be completed

Transferring large datasets via the copy partition

The copy partition has 7 data mover nodes dedicated to data transfer. By submitting your data transfer to the copy partition on Setonix, your transfers can run on the data mover nodes for up to 48 hours, leaving your terminal free for other work or for you to power down your machine.

By using a resumable method of transfer on the copy nodes, you can transfer extremely large datasets reliably and with minimal effort.

Set up ssh keys

In order to run data transfers via copy partition, you will first need to set up ssh keys between Setonix and the other machine.

Follow Pawsey’s guide to set up your keys.

Tip

Remember to set up keys between Pawsey and all machines where you have data requiring transfer, for example your local computer, RDS, and other workstations/compute environments.

SLURM data transfer script

Any of the command line examples shown above (scp, sftp, rsync) can be included within a SLURM job script to be submitted to the copy partition.

Below is a script template for transferring a large dataset from Pawsey /scratch to USyd RDS:

SLURM data transfer script - S3

To be completed

Transferring between RDS and Pawsey

Small datasets

Copy a folder from Pawsey to RDS

USyd network or VPN not required.

sftp <unikey>@research-data-ext.sydney.edu.au
cd /rds/PRJ-<project>
put -r ./<folder>

Copy a folder from RDS to Pawsey

Must be on USyd network or VPN on.

ssh <unikey>@research-data-int.sydney.edu.au
cd /rds/PRJ-<project>/<data-location>
rsync -vhsrl --chmod=Dg+s -e ssh <folder> <user>@data-mover.pawsey.org.au:/scratch/<pawsey-project-id>/<user>

Large datasets

To be completed

This section includes examples to transfer between RDS and Pawsey via the copy partition.