I last wrote about Deepmind’s efforts to predict protein folding and structure here, with their AlphaFold software. AlphaFold really performed very strongly in the 2020 protein folding challenge, and that got a lot of attention. Well, they’ve recently published a great deal of detail on how they did this, released their source code, and they’ve announced that they’re going to be releasing their computed structures of 350,000 proteins, to be followed in the coming months by up to 100 million more. Here’s the database. And a group at the University of Washington has just published on their own similar approach (RosettaFold) and has also made this code freely available.
So it’s clear that the world of computational protein structure prediction is in a very different place than it was a couple of years ago. But I’m already on record as (1) cheering this sort of thing on while (2) saying that it doesn’t make as much difference to drug discovery as many stories and press releases have had it. Do the latest developments change my mind? What does this all mean? I’ve been talking with colleagues and seeing comments from people in the field, so here’s my attempt.
Well, for one thing, it means that a lot of people in academia are going to have to rewrite their research grants. If you have been working on computational protein folding yourself, odds are that you have had your doors blown off by these recent developments and will need to rethink. That doesn’t mean there’s nothing left to do (far from it – read on) but anyone who was trying to do similar stuff to DeepMind was already in the position of the RosettaFold people, at best, and if you were trying to do similar work to RosettaFold, well. . .you’d better target things carefully.
Another thing that we’re all going to have to get used to is that for years (decades) people have considered computational error to be the most likely source of error when a predicted structure and an experimental one don’t match, and quite rightly so. We’re getting to the point now where the ball is in the experimentalists’ court, and that’s something new. Right now, if you have a big mismatch between the two, it is frankly more likely to be an experimental error, because the folding predictions are getting so solid. This is disorienting, to say the least.
There’s also the synergy with the experimental data. For the many, many of us who are not cystallographers, X-ray data can seem like it’s being delivered on golden tablets to the sound of trumpets. But protein X-ray structures depend on model-building as well; you try to see which structures best fit the experimental electron density data. And those data can very often be interpreted in different ways, especially when it comes to subtle details of conformation. I say “subtle”, but sometimes those small structural things can make huge differences in protein function – just look at the prolyl isomerase enzymes, whose job it is to make proline residues sit in the cis or trans fashion in the overall chain, and at the number of proteins whose downstream activities depend on such state-switching. Having access to such well-attested structural models changes the way that X-ray data are handled, and for the better. There are also differences in protein structures depending on the method used to determine them, and these computations could help to resolve those, too.
But what this leads to is what I keep pointing out when I give my various talks on the effects of AI and ML methods on chemistry and drug discovery. These things redefine grunt work: they make larger and larger areas that were formerly the site of human labor into machine labor instead, which is faster, more tireless, and getting more accurate all the time. What does that do to us humans? It pushes us towards higher-level problems that are not yet subject to computational or automated solutions. In the case of protein folding and structure, it means that we now will spend more of our time on the harder stuff: protein complexes, the classification and function of protein surfaces in general, the effects of all the wide variety of post-translational modifications, the dynamics of protein conformation changes in real cell-biology time, the subtleties of how small-molecule ligands work their way in and out of binding sites, and the related question of how allosteric sites and cofactors modify these things from afar.
It’s important to realize that the new protein computational tools do not make all these into solved problems. Not even close. They clear out a lot of obstacles so that we can get to these problems more easily and more productively, for sure, but they do not solve them once we get up to the actual rock faces in our particular gold mines. To pick an immediate example of this, I’ve seen a comment from a structural biologist pointing out that when you ask AlphaFold for the structures of various key kinase enzymes, it gives you a very accurate one of what we already know to be the inactive form of the protein. Kinases have several regions that flop and scoot into different and easily accessible conformational states, and (at present, anyway) these structure prediction suites will not necessarily capture all of these, and they most certainly will not tell you which ones are associated with the active enzyme or would be more relevant to a protein’s different functions in vivo.
Consider those prolines I just mentioned above: AlphaFold might give you a cis proline in one protein structure or it might give you a trans one at some particular residue in a particular protein, but it will not be able to tell you that both of these are found in a living cell, that they are interconverted by yet another enzyme, and that the two forms will have totally different functions. Meanwhile, other prolines in the same protein will never interconvert at all, and you won’t know that, either. These details are up to us humans to work through. Similarly, many enzymes need cofactor molecules bound to them to do some of their work, and AlphaFold structures have no way to consider these – nor the presence of things like zinc or calcium ions that can also have a profound effect on protein structure and function. These are the tougher problems that will be ironed out by humans (literally, in the case of iron-associated proteins), with machine help.
And that’s why I talk the way I do about protein structure prediction and its effects on drug discovery. Drug discovery is all about those biological effects – what else could it be concerned with? And these are higher-order things than just the naked protein structure, as valuable as that can be. Remember, our failure rate in the clinic is around 90% overall, and none of those failures were due to lack of a good protein structure. They were caused by much harder problems: what those proteins actually do in a living cell, how those functions differ in health and disease, how they differ between different sorts of human patients and between humans in general and the animal models that were used to develop the compounds, what other protein targets the drug candidate might have hit and the downstream effects (usually undesirable) that those kicked off, and on and on.
So structural biology has been greatly advanced by these new tools. But it has not been outmoded, replaced, or rendered irrelevant. It’s more relevant than ever, and now we can get down to even bigger questions with it.
The post More Protein Folding Progress – What’s It Mean? first appeared on In the Pipeline.