I wanted to mention a timely new book, Deep Learning for the Life Sciences, that I’ve received a copy of. It’s by Bharath Ramsundar at Computable, Peter Eastman at Stanford, Pat Walters at Relay, and Vijay Pande at Andreessen Horowitz, and I’ve been using it to shore up my knowledge in this area. From what I can see, there are not too many people who have much understanding of what deep learning/machine learning really entails – not that it stops folks from delivering their opinions on it. So actually obtaining some will make you stand out from the crowd (!)
This book is written for those of us out in biology and chemistry who would like to get up to speed on the topic; it’s not a detailed dive into any one area. But I think that’s a large market: if you would like to know in brief about (say) what a neural network is, the general scheme by which it processes inputs and generates outputs, and how one goes about applying such a thing to a pile of chemical structures or cell images, this would be an excellent place to start. The authors recommend further reading at many points, since they touch on a whole range of topics that have far more detail to them than they’re trying to cover.
Several parts of the book make use of the open-source DeepChem toolbox – there are examples of processing chemical structure and property data, genomic data, protein structural information, imaging data, and so on. Since it’s written for a wide audience, there are introductory sections throughout explaining to the non-life-science computational types (for example) what pi-stacking is and how a SMILES string is generated, and explaining (for example) to the chemists and biologists what a convolutional neural network is and how it might be less susceptible to overfitting errors than some other architectures. A good feature is that the authors have a realistic view of the problems:
At present [the PDB] contains over 142,000 structures. . .that may seem like a lot, but it is far less than we really want. The number of known proteins is orders of magnitude larger, with more being discovered all the time. For any protein that you want to study, there is a good chance that its structure is still unknown. And you really want many structures for each protein, not just one. Many proteins can exist in multiple functionally different states. . .the PDB is a fantastic resource, but the field is still in its “low data” stage. We have far less data than we want, and a major challenge is figuring out how to make the most of what we have. That is likely to remain true for decades.
The book also highlights the limits of what software can accomplish, and when it needs human assistance. That same section quoted from above goes on to warn people that PDB files often contain problematic regions where the protein or ligand is not modeled well, and advises that (at present) there’s no substitute for having an experienced modeler look over the structure for a reality check. Similarly, from the other end, in the chapter on image processing, it’s noted that generating good segmentation masks (read the book!) is often not feasible without some human input as well. That’s something that people outside the field don’t always realize, that these things are not 100% machine, but rather what Gary Kasparov calls centaur systems, using humans and machines in tandem with each doing what they do best.
As someone without much expertise (compared to the authors!), I’ve been particularly enjoying the discussions of “meta” topics such as choosing between different architectures (and how to evaluate such choices), interpretability of the results (and how to quantify that), and testing the validity of output datasets. You may not be surprised to know that some of these topics are complex enough that they are candidates for deep-learning approaches of their own, a recursive feature that will cause you to think of what techniques are then appropriate to evaluate the evaluations. The answer to the question of quis custodiet ipsos custodes turns out, perhaps, to be “this subroutine right over here”, but these are human judgment calls as well.
So overall, this book should make you much more able to digest what people are talking about when they start talking deep learning, and if you’re motivated to try some yourself, it will show you how to get started and where to learn more. And it will also (perhaps paradoxically) reassure you about the current limits of the technique in general and the continued need for intelligent human oversight and intervention. Making ourselves a bit more intelligent about that is no bad thing.