I fear that mentioning the phrase “Big Data” in the first sentence of a blog post will make half the potential readers suddenly remember that they have podiatrist appointments or something. But that’s the only way to approach this article at Wired. After all, the title is “The Cure For Cancer is Data – Mountains of Data”.
But this is a more realistic look than most of these articles. The problem is that when we use a term like “Big”, it’s a natural tendency to think, OK, really really large, got it, and sort of assume that once you get to something that has to be considered really large then you’ve clearly reached the goal and can start getting things to happen. “Really large”, though, is one of those concepts that just keep on going, and our brains are notoriously poor at picturing and manipulating concepts in that range. Here’s what happened along the way of this project:
In their search for these “resilient individuals,” (Eric) Schadt and his team amassed a pool of genetic data from 600,000 people, then the largest such genetic study ever conducted, with data assembled from a dozen sources (23andMe, the Beijing Genomics Institute, and the Broad Institute of MIT and Harvard, most notably). But in searching the 600,000 genomes, the researchers found potentially resilient individuals for only eight of the 170 diseases they were targeting. The study size was too small. By calculating the frequency of the disease-causing mutations in the population, Schadt and his team came to believe that the number of subjects they’d need to be useful wasn’t 600,000—it was more on the order of 10 million.
Schadt is now founding a company called Sema4 that will try to expand into this level of genomic information, figuring that the number of competitors will be small and that there may well be a business model once they’re up to those kinds of numbers (the data will be free to academic and nonprofit researchers). Handling information on that scale certainly is a problem, but as the article makes clear, the bigger problem is just getting information on that scale. How do you convince ten million people, from appropriately diverse genetic backgrounds, to have their genomes completely sequenced and give them to you?
“There are companies today that claim access to millions of patient records,” Schadt explains. “But from the standpoint of what we intend to do, the data is meaningless. It’s often inaccurate, incomplete, and not easily linked across systems. Plus, that data doesn’t typically include access to DNA or to the genomic data generated on their DNA.” To take the example of the Resilience Project, it wasn’t simply that the universe of data was too small—it was also that the 600,000 genomes were governed under a hash of various consenting arrangements. If something vital was discovered, hundreds of thousands of participants could not be recontacted or tracked, making the data useless from a practical research standpoint.
What the article doesn’t go on to lay out, though, is how all this is going to lead to any cures, for cancer or anything else. That’s actually the hard part; rounding up the ten million genomes will seem comparatively straightforward. One way to go about it is (as described above) to look for people who, from what we know, should have some sort of genomically-driven disease but don’t. What compensatory mutations do they have, and how are these protective? That won’t be easy, because everyone has their own collection of mutations, and there’s no guarantee that any of them will leap out as being biochemically plausible. There’s also a reasonable chance that no single mutation will turn out to be the answer by itself – it may be an ensemble, working together. And what if none of them are the answer? You could be looking at an environmental effect that’s not going to be in the DNA sequence at all, or present very subtly as a sort of bounce-shot mechanism. This is still very valuable work, and you can learn a great deal from “human genetic knockouts” that can’t really be learned any other way, but it’s far from straightforward.
You’re also unlikely to find cancer cures like this, at least, not directly. Cancer is a disease of cellular mutations, and it shows up after something, more likely several things, have gone wrong in a single cell. The main way that a person’s background DNA sequence will prove useful is if they have something going on with their DNA repair systems, cellular checkpoints, or the other mechanisms that actually guard against mutations and uncontrolled cell division, and those are almost certainly going to manifest themselves as greater susceptibility to tumor formation. (A number of these mutations are already known). I’m not aware of any mutations that go the other way and seem to confer a greater resistance to carcinogenesis – finding such things would be rather difficult. The efforts to sequence especially long-lived people are about the best idea I have in that line, and that’s not going to be very straightforward, either, for the reasons mentioned above.
But let’s say that you really do identify Protein X as a possible mechanism to cancel out or ameliorate Disease Y. Things have still just only begun. Now you have to see how possible it is, mechanistically, to target this protein as a therapeutic – how “druggable” it is. Your best hope is that it’s an enzyme or receptor whose lowered activity confers the beneficial effect, because we drug-discovery types are at our best when we’re throwing wrenches into the gears to stop some part of the machinery from working. That we can sometimes do. Making a specific protein work better, on the other hand, is extremely rare. There are a lot of disease-associated proteins that are considered more or less undruggable because they fail this step – or, more accurately, because we fail this step and can’t come up with a way to make anything work.
There’s an early scene in Brideshead Revisited where Charles Ryder, in the army during World War II, is looking at a much younger officer under him named Hooper, finding him a bit baffling and frightening. He tries substituting “Hooper” for the word “Youth” in various slogans and phrases to see if they still hold up – Hooper Hostels, the International Hooper Movement, etc., and finds it a pretty severe test. The equivalent, when you’re hearing about some new technique that could provide breakthroughs in human disease, is to wedge the word “Alzheimer’s” in there, and see if it still makes sense.
It’s a severe test as well. All sorts of genomic searches have been done with Alzheimer’s in mind, and (as far as I know) the main things that have been found are the various hideous mutations in amyloid processing that lead to early-onset disease (see the intro to this paper), and the connection with ApoE4 liproprotein. Neither of these explain the prevalence of Alzheimer’s in the general population; there is no genetic smoking gun for Alzheimer’s, because it would have been found by now. What you can get are some clues. The amyloid mutations are some of the strongest evidence for the whole amyloid hypothesis of the disease, but there’s still plenty of argument about how relevant these are to the regular form of it. Developing animal models of Alzheimer’s based on these mutations has been fraught with difficulty. And the ApoE4 correlation has led to a lot of hypotheses, some of which are difficult or impossible to put to the test, and others that remain unproven over twenty years after the initial discovery.
I’m sure that Eric Schadt and his people have a realistic picture of what they’re up to, but a lot of other people outside of biomedical research might read some of these Big Data articles and get the wrong idea. The point is that Big Data will only help you insofar as it leads to Big Understanding, and if you think the data collection and handling are a rate-limiting step, wait until you get to that one. I know it says that ye shall know the truth, and the truth shall make you free (a motto compelling enough that it’s in the lobby of the CIA’s headquarters), but in this kind of research, it’s more like ye shall sort of know parts of the truth, and they will confuse you thoroughly. It goes on that like for quite a while, usually. Big Data efforts will help, but they will not suddenly throw open the repair manual. There is no repair manual. It’s up to us to write it.