The whole machine learning field has a huge amount to offer chemistry, medicinal chemistry, and biomedical science in general. I don’t think that anyone seriously disputes that part – the arguing starts when you ask when this promise might be realized. In the abstract, the idea of tireless, relentless analysis of the huge piles of data that we generate is very appealing. These things long since passed the ability of humans themselves to wring out all the results unaided, or even with the sort of computational aids that we’re already used to. There are just too many correlations to check, too many ideas to try out, and too many hypotheses to validate.
I am not a specialist in the field. I mean, I know people who are, and I know more about it than many people who aren’t doing it for a living, but I am in no way qualified to sit down and read (say) four different papers on machine-learning approaches to chemical problems and tell you which one is best. The problem is, I’m not so sure anyone else can do that very easily, either. This is even more obvious to researchers in this area than it is to the rest of us. You might hope that additional expertise would allow people to make such calls, but as things are now, it just allows them to see just how tangled things really are.
It would be nice (although perhaps quite difficult) to put together some standard test cases for evaluating performance in (say) machine-learning drug discovery programs. You could imagine a pile of several thousand compounds with associated assay data, from some area where we’ve already worked out a lot of conclusions. You’d turn the software loose and see how much of that hard-won knowledge could be recapitulated – ah ha, Program Number Three caught on to how those are actually two different SAR series, good for it, but it missed the hERG liabilities that Program Number Six flagged successfully, and only Program Number Three was able to extrapolate into that productive direction that we deliberately held back from the data set. . .and so on. Drug repurposing (here’s a recent effort in that line) could be a good fit for a standard comparator set like this as well.
Thus calls like this one, to try to put the whole machine learning/artificial intelligence area onto a more sound (and comparable) footing. This is far from the first effort of this type, but it seems to me that these things have been getting louder, larger, and more insistent. Good. We really have to have some common ground in order for this field to progress, instead of a mass of papers with semi-impenetrable techniques, benchmarked (if at all) in ways that make it hard to compare efficiency and outcomes. Some of this behavior – much of it – is being driven by pressure to publish, so you’d think that journal editors themselves will have to be part of the solution.
But maybe not. This is an area with a strong preprint tradition, and not long ago there was a controversy about (yet another) journal from the Nature Publishing Group, Nature Machine Intelligence. Over two thousand researchers in the field signed up to boycott the journal because it would be subscription-only, which the signers feared would erode that no-paywall system that’s currently in place. In that case, though, the pressure for higher-quality publications will have to come from others in the field somehow, with a willingness to provide full details and useful benchmark tests helping to drive reputations (rather than just numbers of papers and/or their superficial impressiveness). Machine learning is far from the only field that could benefit from this approach, of course, and the fact that we can still speak in those terms makes a person wonder how effective voluntary calls for increased quality will be. But I certainly hope that they work.
Meanwhile, just recently, there’s been a real blot on the whole biomedical machine-learning field. I’ve written some snarky things about IBM’s Watson efforts in this area, and you know what? It looks as if the snark was fully deserved, and then some. STAT reports that the company’s efforts to use Watson technology for cancer-care recommendations was actually worse than useless. The system made a good number of wrong (and even unsafe) calls, which decreased physician confidence in it rapidly (as well it should). Worse, this was going on at the same time that IBM was promoting the wonders of the whole effort, stating that doctors loved it, didn’t want to be without it once they’d been exposed to its glories, and so on. It’s a shameful episode, if STAT has its facts right, and so far I have no reason to think that they don’t.
So there are the two ends of the scale: efforts to make machine-learning papers more comprehensive and transparent, and a company’s apparent efforts to obfuscate its own machine-learning shortcomings in order to boost its commercial prospects. You don’t need to turn a bunch of software loose on the philosophical ethics literature to come to a conclusion about the latter. In this case, anyway, mere human instincts tell you all you need to know.