This is an excellent overview at Stat on the current problems with machine learning in healthcare. It’s a very hot topic indeed, and has been for some time. There has especially been a flood of manuscripts during the pandemic, applying ML/AI techniques to all sorts of coronavirus-related issues. Some of these have been pretty far-fetched, but others are working in areas that everyone agrees that machine learning can be truly useful, such as image analysis.
How about coronavirus pathology as revealed in lung X-ray data? This new paper (open access) reviewed hundreds of such reports and focused in on 62 papers and preprints on this exact topic. On closer inspection, none of these is of any clinical use at all. Every single one of the studies falls into clear methodological errors that invalidate their conclusions. These range from failures to reveal key details about the training and experimental data sets, to not performing robustness or sensitivity analyses of their models, not performing any external validation work, not showing any confidence intervals around the final results (or not revealing the statistical methods used to compute any such), and many more.
A very common problem was the (unacknowledged) risk of bias right up front. Many of these papers relied on public collections of radiological data, but these have not been checked to see if the scans marked as COVID-19 positive patients really were (or if the ones marked negative were as well). It also needs to be noted that many of these collections are very light on actual COVID scans compared to the whole database, which is not a good foundation to work from, either, even if everything actually is labeled correctly by some miracle. Some papers used the entire dataset in such cases, while others excluded images using criteria that were not revealed, which is naturally a further source of unexamined bias.
In all AI/ML approaches, data quality is absolutely critical. “Garbage in, garbage out” is turbocharged to an amazing degree under these conditions, and you have to be really, really sure about what you’re shoveling into the hopper. “We took all the images from this public database that anyone can contribute to and took everyone’s word for it” is, sadly, insufficient. For example, one commonly used pneumonia dataset turns out to be a pediatric collection of patients between one and five, so comparing that to adults with coronavirus infections is problematic, to say the least. You’re far more likely to train the model to recognize children versus adults.
That point is addressed in this recent preprint, which shows how such radiology analysis systems are vulnerable to this kind of short-cutting. That’s a problem for machine learning in general, of course: if your data include some actually-useless-but-highly-correlated factor for the system to build a model around, it will do so cheerfully. Why wouldn’t it? Our own brains pull stunts like that if we don’t keep a close eye on them. That paper shows that ML methods too often pick up on markings around the edges of the actual CT and X-ray images if the control set came from one source or type of machine and the disease set came from another, just to pick one example.
To return to the original Nature paper, remember, all this trouble is after the authors had eliminated (literally) hundreds of other reports on the topic, for insufficient documentation. They couldn’t even get far enough to see if something had gone wrong, or how, because these other papers did not provide details of how the imaging data were pre-processed, how the training of the model was accomplished, how the model was validated, or how the final “best” model was selected at all. These fall into Pauli’s category of “not even false”. A machine learning paper that does not go into such details is, for all real-world purposes, useless. Unless you count “putting a publication on the CV” as a real-world purpose, and I suppose it is.
But if we want to use these systems for some slightly more exalted purposes, we have to engage in a lot more tire-kicking than most current papers do. I have a not-very-controversial prediction: in coming years, virtually all of the work that’s being published now on such systems is going to be deliberately ignored and forgotten about, because it’s of such low quality. Hundreds, thousands of papers are going to be shoved over into the digital scrap heap, where they most certainly belong, because they never should have been published in the state that they’re in. Who exactly does all this activity benefit, other than the CV-padders and the scientific publishers?
The post Machine Learning Deserves Better Than This first appeared on In the Pipeline.