Running AlphaFold on the Artemis HPC

AlphaFold on Artemis

AlphaFold can be accessed by executing

module load alphafold

There are several (large) genetic databases and parameters AlphaFold requires. Different versions of these are conveniently available on Artemis in the common folder: /project/data/alphafold2/ If there is additional databases you require, please fill in a support ticket.

To actually run a computationally intensive AlphaFold job, we must create a PBS jobscript, which is a standard shell script with a few PBS-specific directives, that is sent to the job scheduler for execution. Most requirements are on Artemis already, but you must bring your own .fasta file.

Optional: Generate a FASTA file

Fasta files are simply text. You can make one right in your jobscript or on the command line with these commands, for example:

echo ">T1084 Meio, Meiothermus silvanus, 73 residues|" > demo.fasta
echo "MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH" >> demo.fasta

Make a PBS Jobscript

Make a new file called alpha_job.pbs. Execute nano alpha_job.pbs to edit a new text file in the nano text-editor. Make the changes then hit cntrl+x to exit the nano text-editor, save the changes as prompted.

The contents at a minimum, should look something like this.

#!/bin/bash

#PBS -P Training
#PBS -l select=1:ncpus=8:mem=32gb:ngpus=0
#PBS -l walltime=48:00:00
#PBS -N job01

# Load necessary modules (on Artemis, this will load the correct python environment with AlphaFold installed)
module load alphafold/2.2.0-cpu hmmer hh-suite kalign

# Navigate to your directory
# $PBS_O_WORKDIR is an alias for "the folder you submit your job in", likely /project/Training/demo1/af_demo or similar.
cd $PBS_O_WORKDIR

# Make a directory to save output data to
OUTDIR=${PBS_O_WORKDIR}/${PBS_JOBNAME}_output
mkdir -p $OUTDIR

# Set the alphafold base database directory. Visit this folder to see other options.
export ALPHADB=/project/data/alphafold2/20220323

# Run the AlphaFold2 prediction command. Note most database paths are required
run_alphafold.py \
        --fasta_paths=/project/Training/DATA/input.fasta \
        --output_dir=${OUTDIR} \
        --data_dir=${ALPHADB} \
        --uniref90_database_path=${ALPHADB}/uniref90/uniref90.fasta \
        --mgnify_database_path=${ALPHADB}/mgnify/mgy_clusters_2018_12.fa \
        --template_mmcif_dir=${ALPHADB}/pdb_mmcif/mmcif_files/ \
        --obsolete_pdbs_path=${ALPHADB}/pdb_mmcif/obsolete.dat \
        --db_preset=full_dbs \
        --max_template_date=2022-03-23 \
        --use_gpu_relax=False \
        --model_preset=monomer \
        --pdb70_database_path=${ALPHADB}/pdb70/pdb70 \
        --bfd_database_path=${ALPHADB}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
        --uniclust30_database_path=${ALPHADB}/uniclust30/uniclust30_2018_08/uniclust30_2018_08

Submit a PBS Jobscript to the queue

Now we can submit the job to the scheduler

qsub alpha_job.pbs

This will execute when compute resources become available. Check the status with:

qstat -x -u <unikey>

Output

By default, several output files will be generated.

Artemis specific outputs

By default, at the completion of a job (successful or otherwise) these 3 output files will be generated in the folder in which the job was submitted.

PBS job outputs
JobName.eJobID Contains error/warning messages – those usually printed to stderr
JobName.oJobID Contains output messages – those usually printed to stdout
JobName.oJobID_usage short summary of the resources used by your job

If your job has not completed successfully or AlphaFold outputs have not been generated as expected, these files may help diagnose what wen wrong (or inform you what went right!)

AlphaFold specific outputs

Using the configuration options above will produce output in the folder specified, i.e:

--output_dir=${OUTDIR}

In our case, this should resolve to the directory

/project/Training/demo1/output_directory

This folder will be created by AlphaFold if it does not exist, and will contain the predicted structures of the target protein represented in the following key files - but may vary depending on the options in your job.

AF outputs
msas Folder of intermediate step output from other tools (hmmer, kalign) used to construct the input MSA.
timings.json A JSON format text file containing the times taken to run each section of the AlphaFold pipeline.
ranking_debug.json A JSON format text file containing the pLDDT values used to perform the model ranking, and a mapping back to the original model names.
features.pkl A Python pickle file containing the input feature NumPy arrays used by the models to produce the structures.
result_model_*.pkl
ranked_*.pdb A PDB format text file containing the predicted structures, after reordering by model confidence.
relaxed_model_*.pdb A PDB format text file containing the predicted structure, after performing an Amber relaxation procedure on the unrelaxed structure prediction
unrelaxed_model_*.pdb A PDB format text file containing the predicted structure, exactly as outputted by the model.

Visualise the output

Quick-view an example of the output (note: this requires X-11 forwarding to be enabled when connected to Artemis):

qsub -X -I -P Training
module load pymol/2.4.0
pymol example.pbd

Using the GPU

The below script will run the same workflow, but leverage the GPU in certain steps. Keep in mind, on Artemis, the GPUs are in high-demand so you may have to wait longer in the queue.

#!/bin/bash

#PBS -P Training
#PBS -l select=1:ncpus=8:mem=32gb:ngpus=1
#PBS -l walltime=48:00:00
#PBS -N job01_gpu

# Load necessary modules (on Artemis, this will load the correct python environment with AlphaFold installed)
module load alphafold/2.2.0-cpu hmmer hh-suite kalign

# Navigate to your directory
# $PBS_O_WORKDIR is an alias for "the folder you submit your job in", likely /project/Training/ or similar.
cd $PBS_O_WORKDIR

# Make a directory to save output data to
OUTDIR=${PBS_O_WORKDIR}/${PBS_JOBNAME}_output
mkdir -p $OUTDIR

# You must install extra packages in the python environment that is loaded with alphafold
pip install -U https://storage.googleapis.com/jax-releases/cuda102/jaxlib-0.1.71+cuda102-cp37-none-manylinux2010_x86_64.whl
pip install -U jax==0.2.25

# Set the alphafold base database directory. Visit this folder to see other options.
export ALPHADB=/project/data/alphafold2/20220323

# Run the AlphaFold2 prediction command. Note most database paths are required
run_alphafold.py \
        --fasta_paths=/project/Training/DATA/input.fasta \
        --output_dir=${OUTDIR} \
        --data_dir=${ALPHADB} \
        --uniref90_database_path=${ALPHADB}/uniref90/uniref90.fasta \
        --mgnify_database_path=${ALPHADB}/mgnify/mgy_clusters_2018_12.fa \
        --template_mmcif_dir=${ALPHADB}/pdb_mmcif/mmcif_files/ \
        --obsolete_pdbs_path=${ALPHADB}/pdb_mmcif/obsolete.dat \
        --db_preset=full_dbs \
        --max_template_date=2022-03-23 \
        --use_gpu_relax=False \
        --model_preset=monomer \
        --pdb70_database_path=${ALPHADB}/pdb70/pdb70 \
        --bfd_database_path=${ALPHADB}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
        --uniclust30_database_path=${ALPHADB}/uniclust30/uniclust30_2018_08/uniclust30_2018_08

For this simple example, with 8cpus and 32gb RAM, it will take around 35 hours. Only the final step is GPU optimised, reducing the time to 34 hours.