I think that every synthetic organic chemist should take a look at this paper in Angewandte Chemie. It’s on the application of computer algorithms to planning synthetic routes, which is a subject that’s been worked on for fifty years or more – without, it has to be said, making too much of an impression on working organic chemists. But after reading this paper, from Bartosz Grzybowski and co-workers at the Polish Academy of Sciences, I’m convinced that that is going to change. I don’t know if this particular software, or even this particular computational approach (which I last wrote about here) is going to do it, although they both look very promising. But to a first approximation, those conditions don’t matter – what seems inescapable to me is that the generation of synthetic routes in organic chemistry is subject to automation, and we are getting very close to seeing this put into use.
Here’s the paper’s summary of the situation:
Overall, we believe that modern computers can finally provide valuable help to practicing organic chemists. While the machines are not yet likely to match the creativity of top-level total-synthesis masters, they can combine an incredible amount of chemical knowledge and can process it in intelligent ways with rapidity never to be matched by humans. In retrosynthetic planning, even inexpensive desktop machines can consider thousands of matching reaction motifs per second and can identify those that would be difficult to discern even by expert chemists—in fact, even desktop computers can be distinctly superior to humans in their capability to recognize complex rearrangement patterns and multicomponent reactions. Of course, it could be argued that one might be able to recognize these motifs using human intuition. But this is like arguing that we could, using paper and pencil, “eventually” divide two ten-digit numbers to the precision of ten decimal places—why do so if we have a pocket calculator available? Our thinking about all synthesis-aiding programs is that they should be regarded precisely as “chemical calculators,” accelerating and facilitating synthetic planning, rapidly offering multiple synthetic options which a human expert can then evaluate and perhaps improve in creative ways.
That’s well put. And the corollary is that for less demanding syntheses, this technique should generate perfectly good routes that will not be subject to much improvement at all. The analogy to a calculator is a good one, although many will object that a mathematical operation has a definite correct answer, while a synthetic route is more of a matter of opinion. But as the paper shows, these opinions can (and are) being taken into account: you can set the software to generate the shortest routes, or the routes using the least expensive reagents or the most well-precedented reactions, or some combination of these – it’s like searching online for air travel, taking into account ticket prices, number of connections, layovers, and so on. The software shown can even take into account less-common preferences, such as using no regulated starting materials or reagents, the likely time needed for each reaction step, stipulating that a particular intermediates, solvents, or reactions conditions have to be used or have to be avoided, and so on. (The analogies to GPS-driven map route algorithms will probably be clear by now as well).
There’s still a lot to be done, as the authors show in the third section of the paper. But none of these problems seem to be computationally intractable – far from it. After looking at the computational approaches to things like chess and Go (or driving to Tucson), there seems to be no reason at all why organic synthesis shouldn’t fall into the same general category. Many of the same considerations apply – choosing moves that don’t reveal later vulnerabilities (which can be a computationally intensive process), trading off various factors (convenience, cost, literature precedence), and so on. You can, in chess terms, think of each step in a synthesis as a position, and the possibilities for the next step in the synthesis as the move to be made. It’s easier to do with chess, but that doesn’t mean it’s impossible with chemistry. The tricky part, over the years, has been converting organic chemistry into a computationally tractable form, but that’s well underway. As the paper shows, dealing with the problem in terms of networks and graph theory is particularly promising.
The examples shown, which range all the way up to synthetic routes for Taxol, are quite interesting. The first examples are patchwork, chimeras of various known routes and reactions from the literature (in these cases, on the exact substrates), assembled into something that’s very likely to work. That is exactly how you or I would do it, if we were trying to get to some compound in the quickest and most feasible way possible, only the software searches through the literature much more quickly and thoroughly. The software can be set to display the various sources and their year of publication, so you can see how the route was assembled.
By this point another thought will have occurred to many readers: that this sort of thing may well be fine for Frankensteining a synthesis together from literature precedents, and in fact can probably do that better than any human can, but what can it do with de novo synthesis? What if you feed it molecules that no one has made, in structural classes that haven’t been explored? The paper refers to a 2009 book that I hadn’t seen: Knowledge-Based Expert Systems in Chemistry (Not Counting on Computers), and refers readers there for a good history of attempts at this sort of thing. But the paper itself takes readers through a useful summary of LHASA, SECS, SYNLMA, SYNCHEM, SYNGEN, CHIRON, and other efforts over the years. The paper hypothesizes that the field may have actually been held back by its own enthusiasm and ambitions – many of these efforts were made with what we would now call completely inadequate hardware, which meant that many simplifying assumptions had to be introduced (which naturally limited their applicability). Chess, by contrast, is an easier problem to deal with. As everyone knows, Deep Blue marked the transition where computation defeated the best human player in the world, but it should be remembered that chess programs had been able to give ordinary (or even reasonably good) human players all they could handle for many years before that.
Organic chemistry software, though, has never come close to that standard. The problem isn’t subject to as much algorithmic compressibility as you would like – in the end, a brutal number of chemical transformations just have to be put in individually, without trying to generalize too much. Even trying to introduce shortcuts into the reaction-entering process will lead to trouble:
To do things right, the reactions must be coded by human experts carefully delineating which substituents are or are not allowed, and considering both steric and electronic factors, and more. This expert-based approach is actually not an exception in teaching computers to solve complex problems—indeed, Deep Blue was able to score chess positions because it was “taught” an incredible number, 700 000, of grandmaster games; Mathematica began to do its wonders of symbolic mathematics only after it has been “taught” by humans a certain number of rules, heuristics and algorithms, some of which took years to develop and volumes to describe. . .
Machine extraction of synthetic transformations from the literature works just fine – at first. And then it generates worthless gibberish when you try to apply all that data, because the context of the molecules involved is so important – that protecting group will fall off, that other group will get reduced, too, that stereocenter will racemize, that sulfur will inactivate that catalyst, and on and on. Getting things in at that level of detail has taken years, but once you do it right, you don’t have to do it again, and the software just keeps getting more powerful as you keep adding more detailed knowledge. What you end up with are many thousands of synthetic reactions, with information about their limits, vulnerabilities, and ranges all coded along with them. (They’ve also coded a separate list of structures to avoid – highly strained, basically impossible intermediates that might otherwise be considered plausible by the program).
Now this can be used to propose syntheses of new molecules, and they show an example using a recently identified natural produce, epicolactone. It took several hours for the software to help come up with a solution, with an organic chemist going along step-by-step, but it’s actually a pretty impressive solution, and it’s also mechanistically related to a recently published total synthesis from Dirk Trauner’s group at Munich (both take inspiration from the biosynthesis of purpurogallin).
And that takes us to the ultimate goal – yeah, the one where an organic chemist isn’t watching each step, but instead is off doing something else while the software grinds away. (Some readers will interpret this as “The one where there isn’t an organic chemist at all”, but we’ll get to that). This is going to take very careful work on defining the “position” at each step, and on scoring the synthetic possibilities. Brute force is not going to cut it, unfortunately – there are many more possible reactions from a given chemical position than there are possible moves from a chess position, and the whole thing gets out of control very quickly if you just try to grind it out (and even grinding it out requires a way to score and evaluate each possibility, in order for a recommendation to be made). The paper goes into details about the various scoring functions you can use (for the kinds of variables discussed above – cost, length of the route, literature analogs to it, and so on).
. . .we require that the search algorithm be 1) non-local—that is, able to explore not only one synthetic “branch” of synthetic solutions at a time but consider numerous distinct possibilities simultaneously; 2) strategizing—that is, able to perform few individual reaction “moves” that might locally appear sub-optimal but could ultimately lead to a “winning” synthetic solution; 3) self-correcting—that is, able to revert from hopeless branches and to switch to completely different synthetic approaches. In addition, we require that the searches always terminate at either known or commercially available substances (with the threshold molecular weights specified by the user).
These are stiff requirements, but the paper demonstrates some proposed syntheses of fairly recently-identified (and in some cases unsynthesized) natural products, such as tacamonidine and goniothalesdiol A. This is by far the most impressive thing of its kind that I have ever seen; it’s a real leap past what most people would think of as “organic synthesis software”, assuming that they think of it at all. At the same time, and as the paper strongly emphasizes, this is not a solved problem yet, either. There’s still a lot to be done, both in terms of the chemistry that’s being put into such systems, and with the scoring and evaluating algorithms that drive them. Steric effects (and stereoelectronic effects) in particular need shoring up, as the paper freely admits.
But these now seem like solvable problems, and that’s what I want to emphasize. There seems to be no reason why time, money, and effort cannot continue to make this sort of approach work better and better – and it’s already working better than most people realize. In fact, I’m willing to stipulate that software will, in fact, eventually (how soon?) turn out to provide plausible synthetic routes to most compounds that you give it, with the definition of “most” becoming progressively more stringent as time goes on. So where exactly does that leave us organic chemists?
Well, back in the 1960s, it took an R. B. Woodward (or someone at his level, and there weren’t too damn many of those) to look at some honking big alkaloid and figure out a way that it could be made. There were fewer reactions to choose from, and analytical techniques were (by our standards) quite primitive, so just figuring out that you’d made what you thought you’d made at each step could be a major challenge all by itself. But the science has advanced greatly – we have a lot more bond-forming reactions than we used to, and a lot of elegant ways to set stereochemistry that just weren’t available. And we have far better ways to know just what our reactions have produced, to make sure that we’re on the right track.
One response in the total synthesis community has been to turn to making even larger and more complicated molecules, but that (at least to me) tends to reduce to the “find new chemistry” rationale, and there may be easier or more direct ways to find new chemistry. That, though, looks like the place to be in general. All those transformations that have been painstakingly entered into this software were discovered by organic chemists themselves – and no one knew that there was such a thing as a Friedel-Crafts reaction or a Suzuki coupling until those reaction were discovered. Each new reaction opens up a new part of the world to the techniques of organic chemistry. If it does, in the end, come down to software for reaction planning, then each new reaction that’s discovered will suddenly expand the capability of that program, and it will be able to generate good synthetic routes to things that it couldn’t before (and neither could anyone else).
I think that George Whitesides is right, that organic chemistry in general needs to become more focused on what to make and why, rather than how to make it. “How” is, frankly, becoming a less and less interesting (and rate-limiting) question as the years go on, and the advent of software like this is only going to speed that process up. On level, that’s kind of a shame, because “how” used to be a place that a person could spend a whole interesting and useful career. But it’s not going to be that way in the future. It may not be that way even now.
This isn’t the first time this has happened. NMR was too much for some of the older generation of chemists who’d proven structures by degradation analysis. And a colleague of mine mentioned to me this morning that when computer-driven searching of the Chemical Abstracts database became more widely available during the 1980s, his PhD advisor looked at his years of index cards, turned to him and said “This is the beginning of the end. Now any bonehead with a mouse can call himself a synthetic chemist”.
So the same debates over automated synthesis are sure to start up around this kind of software, too. But that work, too, forces us to work on the harder problems and the harder “what” and “why” questions. I’ll finish up with a thought – how about taking this new software (named Syntaurus, by the way) and asking it for routes that prioritize the starting MIDA boronates and couplings that the Burke synthesis machine is so good at working with? Close the loop. Someone’s going to do it, you know – someone’s probably doing it now. Best to start thinking about how we’ll deal with it.
Update: Wavefunction has thoughts on this paper here.