Introduction to AlphaFold
AlphaFold is a groundbreaking artificial intelligence system developed by DeepMind that focuses on predicting the three-dimensional structure of proteins. Proteins are essential molecules in our bodies that carry out various functions, and their structure plays a critical role in understanding their behaviour and function.
AlphaFold2 utilises deep learning algorithms to analyse the amino acid sequence of a protein and predict its 3D structure accurately. By understanding the structure of proteins, scientists can gain insights into how they interact with other molecules, such as drugs, and how they function in biological processes. This knowledge can have profound implications for various fields, including medicine, drug discovery, and bioengineering.
The traditional methods for determining protein structures, such as X-ray crystallography and cryo-electron microscopy, can be time-consuming and expensive. AlphaFold2 significantly accelerates this process by providing accurate predictions in a fraction of the time. Its predictions have shown remarkable accuracy, comparable to experimental methods.
By providing researchers with reliable protein structure predictions, AlphaFold2 has the potential to revolutionise our understanding of biology and accelerate scientific discoveries. It can aid in designing more effective drugs, understanding disease mechanisms, and developing enzymes for industrial applications. Ultimately, AlphaFold2’s advancements in protein folding have the potential to transform multiple areas of research and have a profound impact on human health and well-being.
The original AlphaFold was introduced in 2018 and utilised deep learning techniques to predict protein structures. It made significant advancements in protein folding predictions compared to previous methods but still had limitations in accuracy.
AlphaFold v2, introduced in 2020, represents a major breakthrough in protein structure prediction. It incorporates novel algorithmic improvements and advanced machine learning techniques, including a deep neural network architecture, to significantly improve accuracy. AlphaFold2 achieved remarkable performance in the Critical Assessment of Structure Prediction (CASP) competition in 2020, outperforming other methods and approaching experimental accuracy for many protein structures.
AlphaFold2 is computationally optimised and can deliver predictions within a matter of days, or even hours, for many proteins. Moreover, AlphaFold2 has been designed with scalability in mind. It can efficiently utilise distributed computing resources, enabling multiple protein structure predictions to be processed simultaneously.
AlphaFold2 incorporates an attention mechanism that allows it to focus on relevant parts of the protein sequence when making predictions. This attention mechanism provides insights into the model’s decision-making process and makes it more interpretable than the original AlphaFold.
Here is a good introductory video on AlphaFold by SciShow: How AI Could Change Biology
Manual Database Search
If you only want to predict a small number of naturally occuring proteins, use the online ColabFold notebook or download the structures from the AlphaFold2 Protein Structure Database. These are accessed through a web browser, allowing users to submit protein sequences and receive predicted structures via an online interface. It provides a user-friendly graphical interface that makes it accessible to non-experts and researchers without extensive programming or command line experience.
The AlphaFold2 Protein Structure Database is primarily intended for users who want to obtain protein structure predictions quickly and conveniently without the need for complex setups. Users can submit protein sequences, and the database provides predicted structures as outputs. It offers a straightforward interface with pre-defined settings and options.
Automation
Automation (via AlphaFold or LocalColabFold or similar) is suitable for more advanced applications such as batch processing of structure predictions for non-natural proteins and complexes, or predictions with manually specified MSAs / templates.
The command line AlphaFold2 application is a software package that runs on a local machine or server. It is designed for advanced users and researchers familiar with command line interfaces and programming. It requires installation and configuration, but it offers more flexibility and control over the prediction process.
It is customisable and allows researchers to fine-tune the prediction process according to their specific requirements. It provides a wide range of command line options and parameters that can be modified to optimise predictions. This level of control is beneficial for researchers who want to experiment with different settings or integrate AlphaFold2 into their existing computational pipelines.
This course is aimed at researchers seeking to automate their use of AlphaFold. See the following pages for examples for how to use AlphaFold in different conditions and on different systems. Whilst these are targeted for Sydney University researchers, others may benefit from the general routines.
FASTA Format
AlphaFold is mostly driven by the use of FASTA files.
A FASTA (pronounced “fast-ay”) file is a common plain text format used to store biological sequence data, such as DNA, RNA, or protein sequences. It is named after the FASTA software package, which originally introduced this format. FASTA files consist of two main components:
Header: The header line begins with a “>” symbol, followed by a unique identifier for the sequence. This identifier can contain information about the sequence, like its source or other relevant details.
Sequence Data: The following lines contain the actual sequence data, which can be composed of letters representing nucleotides (A, T, C, G, U) for DNA, RNA sequences or amino acids (A, R, N, etc.) for protein sequences.
Fro example:
>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*
FASTA files are easy to read by both humans and computers, making them a standard format for sharing and storing biological sequence information. They are used in various bioinformatics applications, including sequence analysis, sequence alignment, database searches, and data exchange between different software tools and databases.
AlphaFold Use Cases
- Protein Structure Prediction: Understanding the 3D structure of proteins is crucial because a protein’s function is intricately linked to its shape. By accurately predicting protein structures, researchers can gain insights into their functions, interactions, and potential roles in various biological processes.
run_alphafold.py input.fasta ...
At it’s core, AlphaFold needs a FASTA file to make predictions.
Drug Discovery: Many diseases are caused by malfunctioning proteins. Interpreting the accurate protein structure predictions, researchers can identify potential drug targets and design molecules that specifically interact with the target protein, leading to the development of new drugs and therapies.
Biological Research: AlphaFold aids researchers in studying the underlying mechanisms of biological processes. Protein structures provide insights into how proteins function, interact with other molecules, and carry out their roles in cells. One may use the predicted protein structures as input for docking simulations to study protein-protein interactions, for example:
rosetta_scripts -s receptor.pdb -l ligand.pdb -parser:protocol docking.xml -score:weights ref15.wts
Where receptor.pdb
is the AlphaFold output, ligand.pdb
is the ligand structure, and docking.xml
is a RosettaScripts protocol file for docking simulations.
- Disease Understanding: Knowing the structures of proteins associated with diseases helps researchers understand how these proteins contribute to the disease’s development. This understanding can pave the way for novel therapeutic approaches where you may use predicted structures in molecular dynamics simulations to understand how mutations affect protein stability:
gromacs grompp -f md.mdp -c input.gro -p topol.top -o md.tpr
gromacs mdrun -v -deffnm md
Where md.mdp
is the GROMACS configuration file, input.gro
is the structure predicted by AlphaFold (converted from .pdb
), and topol.top
is the topology file.
- Engineering and Design: AlphaFold can assist in protein engineering by designing proteins with specific functions or properties, such as creating enzymes with improved catalytic activities or generating novel protein structures for industrial applications.
rosetta_scripts -s starting_structure.pdb -parser:protocol design.xml -score:weights ref15.wts
Where starting_structure.pdb
is the input structure from AlphaFold, design.xml
is a RosettaScripts protocol file for design, and ref15.wts
are the scoring weights.
Bioinformatics and Data Interpretation: AlphaFold’s predictions can complement experimental data, helping researchers interpret experimental results and guiding further experiments. It can provide structural context to help explain biochemical and genetic observations, and predictions may integrated into analysis pipelines.
Functional Annotations: Predicting protein structures can aid in annotating the functions of newly sequenced genes and proteins, especially when experimental data is limited. One can compare AlphaFold predictions to known protein structures and functions using tools like BLAST or Pfam:
blastp -query query.fasta -db nr -out results.txt -evalue 1e-5 -outfmt 6
Where query.fasta
is your FASTA sequence, and nr
is the BLAST database.
Akdel, M., Pires, D.E.V., Pardo, E.P. et al. A structural biology community assessment of AlphaFold2 applications. Nat Struct Mol Biol 29, 1056–1067 (2022). https://doi.org/10.1038/s41594-022-00849-w
- Reducing Experimental Costs: Determining protein structures experimentally using methods like X-ray crystallography or nuclear magnetic resonance (NMR) can be time-consuming and expensive. AlphaFold’s predictions can provide a cost-effective alternative to obtaining initial structural information and prioritising which proteins to study experimentally.
Citations
If you use the AlphaFold code or data, please cite:
@Article{AlphaFold2021,
author = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green,
Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool,
Kathryn and Bates, Russ and {\v{Z}}{\'\i}dek, Augustin and Potapenko,
Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard,
Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov,
Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen,
Stig and Reiman, David and Clancy, Ellen and Zielinski,
Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer,
Tamas and Bodenstein, Sebastian and Silver, David and Vinyals,
Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli,
Pushmeet and Hassabis, Demis},
journal = {Nature},
title = {Highly accurate protein structure prediction with {AlphaFold}},
year = {2021},
volume = {596},
number = {7873},
pages = {583--589},
doi = {10.1038/s41586-021-03819-2}
}