Accounting

Overview

Teaching: 5 min
Exercises: 3 min

Questions

Querying disk and KSU available per project

Objectives

Understand SU/KSU

Get to know the nci_account, lquota and nci-files-report commands

Service units

A service unit (SU) is the charging unit of compute time on Gadi, where 1 SU = 1 resource hour (ie, one hour of compute walltime on one CPU, under the assumption that the job utilises under the mem per core on the node).

For example, using 1 CPU and 4 GB of RAM for 1 hour on the ‘normal’ queue which has 4 GB RAM per core equates to 1 resource hour consumed. If the same job used 8 GB RAM, 2 resource hours would be consumed. This would equate to an SU charge of 2 and 4, respectively - as the ‘charge rate’ of the normal queue is 2 SU per resource hour.

We often speak in terms of KSU ie 1,000 SU.

The charge rates on Gadi range from 1.25 SU to 6 SU per resource hour, depending on the queue. See the Gadi queue limits page for the queue rates and limits.

It is critical to understand SU and charge rates, as each project is assigned a finite amount of SU to use.

Checking compute resource availability

In order to submit a job, you must have adequate KSU available under your project for that job to complete.

Check your KSU with the ‘nci_account’ command:

nci_account -P <project_code>

Any running jobs will display KSU under ‘Reserved’.

To calculate the KSU required for your job, multiply the number of cores requested by the walltime requested in hours by the charge rate for the chosen queue. Eg for the below job, the KSU required is 9.504 (3168 X 1.5 X 2). If this project had 9.503 KSU available, the job would be immediately rejected by the scheduler.

#PBS -l walltime=01:30:00
#PBS -l ncpus=3168
#PBS -l mem=12540GB
#PBS -q normal

At completion, your project is only charged the SU actually consumed by the job (ie based on walltime used, not walltime requested). Like Artemis, Gadi produces PBS logs. The “.o” job log will report the compute used (similar to the Artemis “.o” and “usage” logs combined).

To view the number of running and queued jobs in each queue, use the command:

qstat -q

Each queue will have two rows: the queue name, and the queue name with ‘-exec’ suffix. The jobs in <queue>-exec are executing, while the jobs in <queue> have not yet been scheduled.

If you are familiar with Artemis, you may be put off by the number of jobs running and in queue! Do not be alarmed, Gadi queues have a very high turnover, as Gadi has >26X the cores of Artemis and shorter maximum walltimes.

Checking disk resource availability

Like KSU, each project is assigned a finite amount of disk space and iNode (index node - can be likened to the total number of files and folders). You MUST monitor your disk and iNode usage, and this can be done with the command:

lquota

which shows disk resource availability for every project you are a member of.

Key lquota terminology:

Usage: how much disk the project is using

Quota: how much disk has been assigned to that project

Limit: twice the quota. Think of this as a grace quota. Your project can exceed the quota up to the limit (limit cannot be exceeded). If your project remains above quota for 1 week, jobs under that project will not run (will enter ‘Held’ status), until you have brought your usage back under the quota.

iUsage: how much iNode the project is using

iQuota: how much iNode has been assigned to that project

iLimit: as per disk limit

It is important to have an understanding of how much output your job will create, and ensure that you can remain within quotas/limits. Jobs can fail with “disk quota exceeded” messages.

Gadi has a 90 day purge policy on /scratch. We have yet to see this implemented, however be aware that it could commence at any stage, and cleanup and backup to RDS MUST be performed regularly. Any requests for increased disk and iNode are met with rigorous scrutiny from NCI!

Other useful disk monitoring commands

To check the disk usage of current directory:

du -hs

Or a specific directory:

du -hs <dir>

Or all directories (max depth 1):

du -h --max-depth=1

To check the iNode of the entire directory, and report totals to a max depth of 1:

for d in `find -maxdepth 1 -type d |cut -d\/ -f2 |sort`; do c=$(find $d |wc -l) ; printf "$c\t\t- $d\n" ; done ; printf "Total: \t\t$(find $(pwd) | wc -l)\n"

Note that there is a shorter command for reporting iNode (du --inodes) however this can under-report total iNode.

Another great management tool is the nci-files-report command. This shows the amount of disk and iNode used by each member of the specified project in the specified filesystem. Note that files belonging to one project can be in the filesystem of another (eg via the chgrp command) so if du and lquota queries do not match up, use the below command to find files that may be contributing to your project quota but not actually visible under /scratch/<your-project>.

nci-files-report -g <project> -f scratch
nci-files-report -g <project> -f gdata

What to do if you run out of KSU

KSU are assigned on Gadi on a quarterly basis, so when you run nci_account -P <project>, you will see KSU used and available for the current quarter.

Under NCMAS, you may be able to request some of your KSU be brought forward from other quarters, for example shift 10% of your allocations for Q2-Q4 to Q1. Lodge the request via help ticket. The request will need to be assessed in terms of the capacity of the system so may not be able to be granted.

Projects are free to seek support from multiple stakeholders, so even if you have an NCMAS allocation, you can still apply for KSU from SIH, Intersect or other schemes. Please note that if your project KSU budget is contributed to from multiple stakeholders, the KSU are one pool, and you cannot preferentially use down KSU from one source before another.

Under the SIH HPC Allocation Scheme, KSU are awarded over one or two quarters. You may request to bring some of your KSU from a subsequent quarter (if applicable) forward, contact SIH to ask if there are any unused KSU available under the scheme for that quarter, or submit an application for the next round of the scheme.

Users can also purchase additional KSU on Gadi. Please contact contact SIH for current charge rates (currently very affordable!).

What to do if you run out of disk or iNode

Clean up your directory!

Requests for increased allowances to /scratch disk and iNode can be made via help ticket. You will be required to justify why you need additional disk and/or iNode, detailing the file types/sizes and their expected duration on disk. The Lustre filesystem is detrimented by millions of small files, so increases to iNode are met with particular scrutiny.

Efforts should always be made to clean up your /scratch allocation before requesting additional disk and/or iNode. Data that is not being actively used should be backed up to RDS. The time taken to transfer and checksum the files is no excuse for leaving data sitting on /scratch for extended periods between access. Creating tar archives of directories with excess small files can help reduce your iNode usage.

Unused KSU

Any unused KSU at the end of the quarter are forfeited. Unused KSU impact negatively on future NCMAS applications, so all efforts should be made to use your KSU within each quarter.

If you are able to predict early in the quarter (say, first half) that you will not need all the KSU you have been allocated in that quarter, you can request them to be voluntarily forfeited, so that NCI/SIH (depending on your allocation provider) may assign them to another project that requires them. This will effectively reduce your allocation, so that these unused KSU do not show as ‘unused’ and impact your future applications. Requests to voluntarily forfeit unused KSU in the last few weeks of the quarter are usually unsuccessful. You cannot request KSU from an earlier quarter be rolled forward to a later quarter.

If you have an SIH HPC allocation, please contact SIH for KSU matters
If you have an NCMAS allocation, please contact NCI for KSU matters
Irrespective of your allocation source, please contact NCI for job and hardware matters

End of quarter workloads

Towards the end of each quarter is the busiest time for the Gadi queues, as users are trying to expend their quarterly KSU budget! So you may have more difficulty pushing high workloads through at this time. Try to keep this in mind when planning your compute for the quarter and where possible, spread the compute workload.

Questions

Does anyone have any questions about access or accounting?

Key Points

Service units (SU) are the charge rate on Gadi, and projects have a finite allocation of SU/KSU

nci_account -P <project> for KSU per project

lquota for disk and iNode by project

nci-files-report -g <project> -f <filesystem> for disk and iNode by user

Unused KSU are forfeited, and negatively impact future applications

previous episode

Introduction to NCI Gadi

next episode