David Weininger passed on last week, and you probably have to be into chemoinformatics for that name to immediately register. He came up with the SMILES notation for chemical structures, though, so that should make his contributions clear. Here’s an excellent appreciation by Anthony (“Ant”) Nicholls that will really give you a sense of the guy; I very much recommend it.
One thing that may not be appreciated is just how much of a big deal SMILES really is. The reason is that chemical structures, although excellent and meaningful representations (which is why we chemists get so ticked off when art directors mangle them) are actually pretty unwieldy for computers to handle. Until the 1950s or so, that probably wasn’t much of a problem, but it became clear that there needed to be ways to turn structures into some sorts of numerical or alphabetical forms so that they could be dealt with by software.
Natural language, as usual, wasn’t much help. For the non-chemists in the crowd, you’ve surely seen chemical names written out in English or whatever your native language might be, all the way from “methane”on up. Those names are systematic, and can be converted back and forth from structures, but they can be pretty unwieldy – for example, here’s the systematic name for testosterone: (8R,9S,10R,13S,14S,17S)- 17-hydroxy-10,13-dimethyl- 1,2,6,7,8,9,11,12,14,15,16,17- dodecahydrocyclopenta[a]phenanthren-3-one. Not very enjoyable, and it gets a lot worse from there. IUPAC nomenclature like that tends to assume that polycyclic systems are aromatic by default, and then reduces the bonds as needed, thus that “dodecahydro” part. It’s also hard to compare bits of a structure with that naming system, since the root of it the name (and the numbering scheme that goes along with it) can totally flip over with even fairly minor changes in structure.
An early way to do this was Wiswesser line notation, which I noted to my amazement was first worked out in 1949. I remember seeing it once in a while in the 1980s, in grad school, but I never really learned it. One problem with it that became apparent as the years went on was that it was difficult to write software that could read it easily. So in the 1980s, Weininger worked up the SMILES format, which is much more friendly in that regard (although not as compact). For an explanation of how it works, I can’t do better than the Wikipedia graphic at right, which uses the antibiotic ciprofloxacin. What you can see is that the molecule’s rings are broken apart, and it’s named as branches off of a resulting chain. There are ways to note what sort of bond connects the atoms, the stereochemistry (if applicable), positive and negative charges, even isotopes.
And as you can imagine, there are a lot of different SMILES strings that are equally possible for any complex molecule. Any given software package should be able to come up with the same representation every time you give it the same molecule, but it’s not necessarily the one that another SMILES generator might return. (When you convert it back to a chemical structure, though, you should get the same thing from either). These text strings can be handled for all sorts of purposes, as you’d imagine, and modern chemoinformatics wouldn’t be possible without something like this.
There are other “structure to text string” systems out there, naturally. One of the most common, besides SMILES (which is basically everywhere there are chemicals and computers) is InChI, the International Chemical Identifier. One big difference between the two is that InChI aims to give a unique representation for every molecule – there’s only one way to do it for each molecule you put in. That’s also the case for Chemical Abstracts Numbers, of course, but CAS numbers, though short, are arbitrary and tell you absolutely nothing about the structure or anything else. InChI is trying to be both unique and comprehensive at the same time, which is a tall order. Attempts have been made to bring InChI and SMILES together as well, at least to make them more freely interconvertible.
With practice, you can sort of eyeball out structures from either SMILES or InChI, or at least differences between two structures. But they aren’t meant for us – they’re meant for machines. And given how we rely on those machines, I’m glad that they work for them.