We’ve made it to the point – a while back, actually – where people who actually know the subject roll their eyes a bit when the term “artificial intelligence” is used without some acknowledgment that it’s not very useful. I think that’s a real sign that it’s becoming useful. Things are to the point where you have to say, implicitly, “You know that phrase that everyone’s used for years? That’s never really been defined but that headline writers like? Well, this is probably what people had in mind, and it’s actually about to be worth something, but we really should have better words for it”
“Machine learning” was an attempt at those better words, but I fear that one is heading down the same water slide, although “AI” does have a substantial head start. This interesting new paper from a group at LBNL avoids either one in its title, and uses “machine learning” once in the abstract, sparingly thereafter, and “artificial intelligence” not at all. But it is what people have in mind when they talk about those things. The big topic is how you get data into these things, particularly how you get it into them in a form where you can hope to get anything useful back out.
In the same way as the old line about how armchair military buffs talk strategy and tactics while professionals talk logistics, professionals in this field tend to devote a lot of time to data curation and preparation. That’s partly because the real-world data we would like to use are often in rather shaggy piles, and also because even the best machine-learning techniques tend to be a bit finicky and brittle compared to what you’d actually want. We’re used to that with internal combustion engines: diesel fuel, ethanol, gasoline, and jet fuel are not perfectly interchangeable in most situations, and so it is with engines of knowledge. They are tuned up for specific types of input, and will stall if fed something else. To use a different analogy, data curation is very much akin to the advice that you should spend more time preparing a surface for a good paint job than you do in applying the actual paint. In almost every case, you will definitely spend more time getting your data in shape for machine learning than the actual computations will take.
The paper under discussion notes this, and also notes that a lot of the information out there is (1) not in very structured formats and (2) not even numerical. This has been clear for a long time; thus the interest in “natural language processing”. Can ML algorithms sort through the formats that we humans tend to use to communicate with each other? Words, sentences, paragraphs, journal articles, slide decks, chapter headings, reference lists, bibliographies, conference proceedings, abstracts, patent claims, reports and summaries? These things often have numerical data stuck to them, but look where it goes: into the appendices or the supplementary files. When people are communicating to people, we use words and pictures – we try to summarize the numbers in graphical form with captions, rather than just plop a big spreadsheet up on the screen or scroll through a long table of formatted ten-digit numbers. That’s what you feed the software; we humans are not built for it, although you do attend talks where the speaker doesn’t quite seem to have realized that.
Dealing usefully with natural language has been a general information-sciences goal for a long, long time (think machine translation, voice-activated controls, automated customer service, etc.) And we really have been getting better at it, although we’re still well short of our science-fiction-movie goals. Most of the attempts at extracting information from publications have needed significant human supervision – think of the old joke of the machine translation program parsing “Out of sight, out of mind” as equivalent to “Invisible lunatic”. But this new paper is trying to move beyond that, and here’s the eye-catching part of the abstract:
Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications.
That vector-representation trick has been around a while, in various forms, and it is pretty neat. It can allow you to slip right up into the world of mathematics, with all the available tools. A series of 2013 papers from a team at Google on the “Word2Vec” technique (here’s one) really set off a lot of work in the field (here’s an intro, and here’s another), and this paper builds on that work. The idea is that you represent a word as a multidimensional vector (as many “directions” as you like) in an effort to encode its meaning and definition.
If you set the length of those vectors arbitrarily from 0 to 1, the word “cat” would max out to 1.0 on the “mammal” and “carnivore” vectors, and it would have a pretty high setting along the “furry” vector (still, there are those weird-looking sphinx cats) and on the “tail” vector (still, there are Manx cats). It would have a strong partial score along the “pet” vector, because cats are an important class of pets (just ask them), but not all cats are pets (just ask a bobcat, who lowers the score for the “tail” vector as well). “Cat” would zero out on many others, of course: “metallic”, for example, although based on the photo at right I would argue for a bit of length for the “liquid” vector.
That’s what this team has done for the materials science literature. Importantly, you don’t have to do all these vector evaluations by hand – that’s what the algorithms inside things like Word2vec and GloVe do for you. In this case, the group turned Word2vec loose on about 3.3 million abstracts in the materials science literature from 1922 to 2018, which led to a vocabulary of about 500,000 words. There are details about how to handle phrases (such as “resistance of manganese alloys”), how many vector dimensions are used (200, in this case), what training algorithm is used and how many cycles it’s run for with various cutoffs, etc. In the loose spirit of the “skip-gram” technique used here, I’m going to skip over those and refer folks who want the nitty-gritty to the actual paper. We will skip right to the results:
The abstracts delivered a pretty robust set of embeddings (vector representations) which allow for real vector operations. The example given:
For instance, ‘NiFe’ is to ‘ferromagnetic’ as ‘IrMn’ is to ‘?’, where the most appropriate response is ‘antiferromagnetic’. Such analogies are expressed and solved in the Word2vec model by finding the nearest word to the result of subtraction and addition operations between the embeddings.
I’d guess that one reason things worked out so well is that many of the “words” in the abstracts were empirical formulae, which encode a lot of information by themselves. And what’s kind of startling is that when you project the embeddings into two dimensions of the various element symbols, pulled right out of the literature abstracts, is that it’s robust enough to largely recapitulate the periodic table. A number of other real-world properties emerge from the embeddings as well, such as crystal symmetries.
But interesting as that is, it’s just demonstrating that we can pull out things that we already knew. The team also noticed, though, that if you looked for “cosine similarities” in vector space between (say) empirical formulae and a word like “thermoelectric”, you did indeed find compounds whose thermoelectric properties had been noted as interesting. But you also see such cosine similarities for compounds that do not actually show up in the same abstracts, anywhere in the database, as the word “thermoelectric” (or any other words that would clearly identify them as thermoelectric materials). Examining these by DFT (density function theory) and by experimental data from the full-text literature (not found in the abstracts used to produce the vectors) confirmed a good correlation with real-world evidence.
The paper then slices the literature up into abstracts that had only been published between 18 cutoff years between 2001 and 2018, and uses each of these to try to predict the most likely new thermoelectric materials that would show up in the later literature. They found that materials in the top 50 embedding predictions were about eight times more likely to have shown up as studied thermoelectrics within the next five years of publications, as compared to some random material. Even when you restrict things to materials that have a non-zero bandgap by density function theory (the first price of admission to this property), the embedding-based prediction materials were still three times more likely to show up than a random material from that list. For example, looking at abstracts from before 2009, the embedding predictions had CuGaTe2 (currently one of the best materials of its kind), as a top-five prediction four years before it actually appeared in the literature. How is this possible? Here you go:
For instance, CsAgGa2Se4 has high likelihood of appearing next to ‘chalcogenide’, ‘band gap’, ‘optoelectronic’ and ‘photovoltaic applications’: many good thermoelectrics are chalcogenides, the existence of a bandgap is crucial for the majority of thermoelectrics, and there is a large overlap between optoelectronic, photovoltaic and thermoelectric materials (see Supplementary Information section S8). Consequently, the correlations between these keywords and CsAgGa2Se4 led to the prediction. This direct interpretability is a major advantage over many other machine learning methods for materials discovery. We also note that several predictions were found to exhibit promising properties despite not being in any well known thermoelectric material classes (see Supplementary Information section S10). This demonstrates that word embeddings go beyond trivial compositional or structural similarity and have the potential to unlock latent knowledge not directly accessible to human scientists.
Now that’s machine learning, as far as I’m concerned. The authors are well aware, though, that they’re working in a comparatively orderly field (as opposed, say, to drug discovery!) and that by using paper abstracts they’ve been able to tap into a deliberately information-dense source of raw material. The natural extension would be to dive into full-text databases, but frankly that’s going to need better software than the current stuff – who knows, maybe even better hardware, by the time we’re through. But there are already further refinements to the context-independent method like Word2vec, ones that try to infer context and thus zoom in on the important things more quickly. That idea has its own pitfalls, naturally, but it does look like a promising way to go.
Extending this to the biomedical literature will be quite an effort – many will recall that this is just what one aspect of “Watson For Drug Discovery” was supposed to do (root through PubMed for new correlations). As I mentioned in that linked post, though, the failure of Watson (and some other well-hyped approaches, some of which are in the process of failing now, I believe) does not mean that the whole idea is a bust. It just means that it’s hard. And that people who are promising you that they’ve solved it and that you can get in on the ground floor if you’ll just pull out your wallet should be handled with caution. The paper today gives us a hint of what could be possible, eventually, after a lot of time, a lot of money, and a lot of (human) brainpower. Bring it on!