Running AlphaFold on the Artemis HPC
AlphaFold on Artemis
AlphaFold can be accessed by executing
module load alphafold
There are several (large) genetic databases and parameters AlphaFold requires. Different versions of these are conveniently available on Artemis in the common folder: /project/data/alphafold2/
If there is additional databases you require, please fill in a support ticket.
To actually run a computationally intensive AlphaFold job, we must create a PBS jobscript, which is a standard shell script with a few PBS-specific directives, that is sent to the job scheduler for execution. Most requirements are on Artemis already, but you must bring your own .fasta
file.
Optional: Generate a FASTA file
Fasta files are simply text. You can make one right in your jobscript or on the command line with these commands, for example:
echo ">T1084 Meio, Meiothermus silvanus, 73 residues|" > demo.fasta
echo "MAAHKGAEHHHKAAEHHEQAAKHHHAAAEHHEKGEHEQAAHHADTAYAHHKHAEEHAAQAAKHDAEHHAPKPH" >> demo.fasta
Make a PBS Jobscript
Make a new file called alpha_job.pbs
. Execute nano alpha_job.pbs
to edit a new text file in the nano
text-editor. Make the changes then hit cntrl+x
to exit the nano
text-editor, save the changes as prompted.
The contents at a minimum, should look something like this.
#!/bin/bash
#PBS -P Training
#PBS -l select=1:ncpus=8:mem=32gb:ngpus=0
#PBS -l walltime=48:00:00
#PBS -N job01
# Load necessary modules (on Artemis, this will load the correct python environment with AlphaFold installed)
module load alphafold/2.2.0-cpu hmmer hh-suite kalign
# Navigate to your directory
# $PBS_O_WORKDIR is an alias for "the folder you submit your job in", likely /project/Training/demo1/af_demo or similar.
cd $PBS_O_WORKDIR
# Make a directory to save output data to
OUTDIR=${PBS_O_WORKDIR}/${PBS_JOBNAME}_output
mkdir -p $OUTDIR
# Set the alphafold base database directory. Visit this folder to see other options.
export ALPHADB=/project/data/alphafold2/20220323
# Run the AlphaFold2 prediction command. Note most database paths are required
run_alphafold.py \
--fasta_paths=/project/Training/DATA/input.fasta \
--output_dir=${OUTDIR} \
--data_dir=${ALPHADB} \
--uniref90_database_path=${ALPHADB}/uniref90/uniref90.fasta \
--mgnify_database_path=${ALPHADB}/mgnify/mgy_clusters_2018_12.fa \
--template_mmcif_dir=${ALPHADB}/pdb_mmcif/mmcif_files/ \
--obsolete_pdbs_path=${ALPHADB}/pdb_mmcif/obsolete.dat \
--db_preset=full_dbs \
--max_template_date=2022-03-23 \
--use_gpu_relax=False \
--model_preset=monomer \
--pdb70_database_path=${ALPHADB}/pdb70/pdb70 \
--bfd_database_path=${ALPHADB}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniclust30_database_path=${ALPHADB}/uniclust30/uniclust30_2018_08/uniclust30_2018_08
Submit a PBS Jobscript to the queue
Now we can submit the job to the scheduler
qsub alpha_job.pbs
This will execute when compute resources become available. Check the status with:
qstat -x -u <unikey>
Output
By default, several output files will be generated.
Artemis specific outputs
By default, at the completion of a job (successful or otherwise) these 3 output files will be generated in the folder in which the job was submitted.
JobName.eJobID | Contains error/warning messages – those usually printed to stderr |
JobName.oJobID | Contains output messages – those usually printed to stdout |
JobName.oJobID_usage | short summary of the resources used by your job |
If your job has not completed successfully or AlphaFold outputs have not been generated as expected, these files may help diagnose what wen wrong (or inform you what went right!)
AlphaFold specific outputs
Using the configuration options above will produce output in the folder specified, i.e:
--output_dir=${OUTDIR}
In our case, this should resolve to the directory
/project/Training/demo1/output_directory
This folder will be created by AlphaFold if it does not exist, and will contain the predicted structures of the target protein represented in the following key files - but may vary depending on the options in your job.
msas | Folder of intermediate step output from other tools (hmmer , kalign ) used to construct the input MSA. |
timings.json | A JSON format text file containing the times taken to run each section of the AlphaFold pipeline. |
ranking_debug.json | A JSON format text file containing the pLDDT values used to perform the model ranking, and a mapping back to the original model names. |
features.pkl | A Python pickle file containing the input feature NumPy arrays used by the models to produce the structures. |
result_model_*.pkl | |
ranked_*.pdb | A PDB format text file containing the predicted structures, after reordering by model confidence. |
relaxed_model_*.pdb | A PDB format text file containing the predicted structure, after performing an Amber relaxation procedure on the unrelaxed structure prediction |
unrelaxed_model_*.pdb | A PDB format text file containing the predicted structure, exactly as outputted by the model. |
Visualise the output
Quick-view an example of the output (note: this requires X-11 forwarding to be enabled when connected to Artemis):
qsub -X -I -P Training
module load pymol/2.4.0
pymol example.pbd
Using the GPU
The below script will run the same workflow, but leverage the GPU in certain steps. Keep in mind, on Artemis, the GPUs are in high-demand so you may have to wait longer in the queue.
#!/bin/bash
#PBS -P Training
#PBS -l select=1:ncpus=8:mem=32gb:ngpus=1
#PBS -l walltime=48:00:00
#PBS -N job01_gpu
# Load necessary modules (on Artemis, this will load the correct python environment with AlphaFold installed)
module load alphafold/2.2.0-cpu hmmer hh-suite kalign
# Navigate to your directory
# $PBS_O_WORKDIR is an alias for "the folder you submit your job in", likely /project/Training/ or similar.
cd $PBS_O_WORKDIR
# Make a directory to save output data to
OUTDIR=${PBS_O_WORKDIR}/${PBS_JOBNAME}_output
mkdir -p $OUTDIR
# You must install extra packages in the python environment that is loaded with alphafold
pip install -U https://storage.googleapis.com/jax-releases/cuda102/jaxlib-0.1.71+cuda102-cp37-none-manylinux2010_x86_64.whl
pip install -U jax==0.2.25
# Set the alphafold base database directory. Visit this folder to see other options.
export ALPHADB=/project/data/alphafold2/20220323
# Run the AlphaFold2 prediction command. Note most database paths are required
run_alphafold.py \
--fasta_paths=/project/Training/DATA/input.fasta \
--output_dir=${OUTDIR} \
--data_dir=${ALPHADB} \
--uniref90_database_path=${ALPHADB}/uniref90/uniref90.fasta \
--mgnify_database_path=${ALPHADB}/mgnify/mgy_clusters_2018_12.fa \
--template_mmcif_dir=${ALPHADB}/pdb_mmcif/mmcif_files/ \
--obsolete_pdbs_path=${ALPHADB}/pdb_mmcif/obsolete.dat \
--db_preset=full_dbs \
--max_template_date=2022-03-23 \
--use_gpu_relax=False \
--model_preset=monomer \
--pdb70_database_path=${ALPHADB}/pdb70/pdb70 \
--bfd_database_path=${ALPHADB}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--uniclust30_database_path=${ALPHADB}/uniclust30/uniclust30_2018_08/uniclust30_2018_08
For this simple example, with 8cpus and 32gb RAM, it will take around 35 hours. Only the final step is GPU optimised, reducing the time to 34 hours.