These are some cleaned-up notes based on Preetham Venkatesh’s talk to IISc Naturalists on AlphaFold2’s recent CASP win. For a basic summary of what happened, see Deepmind’s blog post AlphaFold: a solution to a 50-year-old grand challenge in biology, and for the details, Mohammed Al Quraishi’s AlphaFold2 @ CASP14: “It feels like one’s child has left home.”
Proteins are complex biomolecules that are composed of a string of submolecules (called amino acids) connected to each other in a line. While it is easy nowadays to figure out what amino acids a protein is composed of, and in what order (its sequence), proteins don’t exist in a linear shape, but fold into weird and wonderful tangled-up structures that strongly influence the functions they have in a cell. Figuring out how they go from their original state to their final low-energy structure is the notorious protein folding problem. Instead of understanding the whole dynamics, you could also just try to predict what final structure a given sequence will form. This is very hard, since there are exponentially many possibilities, so just checking which one has the lowest energy doesn’t work. In practice, people try to “condense” many copies of a protein into a crystal – like regular array, which makes it easy to use X-ray diffraction to figure out the structure. This is considered the “true structure” though in reality the protein probably takes on a slightly different form while swimming around in the cytoplasm.
CASP is a biennial competition in protein structure prediction. The organizers pick a bunch of proteins that have just been successfully crystallized, and ask the experimentalists to keep their results under wraps for a while. Competitors submit predictions of the structure of the protein from its sequence, and these predictions are evaluated by comparing to the experimentally determined crystal structures once they are released. DeepMind took part in CASP in 2018, winning by a comfortable margin. This year, they returned and completely blew everyone else out of the water, with accuracy scores so high that for the first time, we can say the protein structure prediction problem is basically solved, instead of mostly unsolved.
There are tons and tons of caveats to this statement, with confusion added by DeepMind’s conflation of protein folding and protein structure prediction (to be fair, everyone does this).
“Here’s what I think AF2 can do: reliably (>90% of the time) predict to reasonable accuracy (<3-4Å) the lowest energy structure of vanilla (no co-factors, no obligate oligomerization) single protein chains using a list of homologous protein sequences”Here’s a more accurate summary from Quraishi
Nevertheless, this is a stunning scientific achievement, and a start (or continuation!) of a really exciting time for biology, and for applications of ML to the physical and life sciences.
Protein structure prediction pre-Alphafold2
CASP has two types of problems, free modelling and template modelling. A template is a sequence similar to your query sequence, which has a solved structure which you can use as a template to tweak your model. Naturally, this makes template modelling much easier than free modelling, where you have nothing to go on. On a smaller scale, you can use the same trick, if subsequences of your protein sequence are known to fold into known structures, or standard patterns like alpha-helices and beta-sheets. The tricky part is combining the known pieces in the correct way.
What else can you do?
An extremely cool idea that emerged in the lates 90s is using evolutionary data. If two residues (amino acids) are close together in the structure, then if one mutates, the other might be forced to mutate too. Reversing this, if you look at many sequences for the same protein, and notice that two residues always seem to evolve in tandem, that’s a pretty good hint that they’re connected somehow in the final structure.
The current (pre-AlphaFold2) approach was to use multiple sequence alignment (MSA), which essentially means lining up many sequences to extract pairs of co-evolving residues (as described above), which are then fed to a neural network which then predicts a distogram, which you can think of as an adjacency matrix keeping track of which pairs of atoms are close to each other. These pairwise distances are then used, together with templates and known pieces, to predict the final structure (in a manner I don’t completely understand).
“A folded protein can be thought of as a “spatial graph”, where residues are the nodes and edges connect the residues in close proximity. This graph is important for understanding the physical interactions within proteins, as well as their evolutionary history. For the latest version of AlphaFold, used at CASP14, we created an attention-based neural network system, trained end-to-end, that attempts to interpret the structure of this graph, while reasoning over the implicit graph that it’s building. It uses evolutionarily related sequences, multiple sequence alignment (MSA), and a representation of amino acid residue pairs to refine this graph.”-DeepMind
One of AlphaFold2’s major innovations was to avoid summarizing the co-evolving pairs from the multiple sequence alignment, but to stick with the raw sequences and use attention (in the ML sense) to figure out which parts mattered. They then build a distogram iteratively, going back and forth between the distogram (a residue-residue matrix) and the MSAs (a sequence-residue matrix). After this, they use yet more transformers, (SE(3)-equivariant ones), and template structures, in another iterative process, to directly predict a final 3D structure.
It looks like one advantage of this is that you can focus on multiple residues at the same time instead of being stuck with pairs. Additionally, you might have very few sequences for the whole protein, but lots for a small part. There will also be lots of noise, making it unclear which sequences to use. Attention lets you have a neural network learn to deal with all of this and pick the right sequences at the right point instead of being forced to do it manually.
SE(3)-equivariant transformers are a new type of transformer architecture specifically designed for modelling 3d point clouds, which are equivariant (similar to invariant but with technical caveats) to permutations, translations and rotations. It’s fairly easy to make networks permutation and translation invariant (eg translation-invariance holds for CNNs), and they use spherical harmonic functions and lots of fancy stuff for the rotational invariance. Essentially, this allows you to output something that captures the important features of a molecular structure (where the atoms are vis-a-vis each other) and not unimportant global features like orientation.
This means that their model is far more end-to-end than previous work, which (I imagine) allows them to make full use of their massive compute budget. And massive it is. Apart from training, which took 128 TPUs running for several weeks, AlphaFold2 seems to require a shockingly large amount of time to actually predict a single protein structure. According to Demis Hassabis, depending on the protein, they used between 5 to 40 GPUs for hours to days. There are two sets of iteration within the process, one when building the distogram, and one when refining the 3D model with the SE(3)-equivariant transformers, so it’s not as simple as a classic forward pass, but it’s still quite unclear why on earth it takes this long.
There is not much more that is known about how the model actually works. Most of what I’ve said above is from people reading the entrails of the diagram above, since DeepMind hasn’t really been forthcoming on the details, unlike last CASP. This is weird and sad and I hope they release a preprint soon.
It seems like this is another instance of the Bitter Lesson – that at scale, search and learning simply beats all the fancy tricks an AI researcher can hard-code. Quraishi and others did try end-to-end models before, but for various reasons, including their lack of compute, it didn’t really work out until AlphaFold2. Preetham speculated that next CASP will have people compressing MSA data into some new kind of representation, possibly one amenable to large language models, since MSA are currently very computationally intensive to work with.
All I can say is that I am very excited 🙂
Answers to some questions I asked Preetham after the talk
Is the de novo/MSA-available distinction related to the free/template distinction or is it totally orthogonal?
De novo vs MSA is availability of co-evolutionary information to predict contacts/constraints. Free/template refers to availability of a structural template, ie, a sequence similar to my query sequence has a solved structure which I can use as a template to build my model. MSA makes use of the nearly 250 million sequences we have in the sequence database to use evolutionary information to predict constraints for folding. Template makes use of the 150,000 structures we have in the pdb (of which the number of unique folds is far lesser) to build our 3D structure upon.
Does running the model require Protein Data Bank access or is the info encoded in the weights?
Since AF2 incorporates template information, it would require pdb access.
Why doesn’t the Protein Data Bank have mammalian proteins?
If I’m not wrong, the reason is that they’re generally harder to synthesize and purify at scale. Mammalian proteins have a lot more complexity (post-translational modifications) that make them very difficult to overexpress in bacterial systems (cheapest). You need fancy-ass cells to express them, and that’s pretty expensive. And of course, getting a protein to crystallize involves a lot of trial and error, so it’s even more expensive and difficult. –answered by Raj M.
Re hassabis saying 5-40gpus, hours to days, could this be a miscommunication and he was referring to the time required to generate final solutions to all the structures in casp14?
I doubt it was a miscommunication. He was specifically asked about inference time for a single protein, and there was extensive discussion on this in the CASP discord channels and even beyond that, and at no point did anyone from DeepMind correct this.
Why did Quraishi et al not manage to make end-to-end work? Lack of compute? Lack of Alpha’s other innovations?
Quraishi tried to completely do away with evolutionary information as much as possible. He only input the query sequence and a PSSM matrix, which is the minimum evolutionary information that can be used. I think this was a big reason his model failed. He also showed that while his predictions were good at a local level, they were completely off on a global level resulting in poor GDT scores.
You mentioned language models might get used next casp. How tho?
Currently using MSAs is extremely computational intensive. There has been quite interesting work on sequence representation now, including a paper from FAIR which came out yesterday, and I expect that we might see usage of these representations instead of raw sequences as inputs. To be fair, the DeepMind team was asked if they used sequence representations and they said it was one of the things they tried but it did not impact performance much. So it remains to be seen for now. Quraishi’s RGN model did see improvement by using sequence representation so that indicates that they might still be useful.