I wrote a little while back about a brute-force approach to finding metal-catalyzed coupling conditions. These reactions have a lot of variables in them and can be notoriously finicky about what combination of these will actually give decent amounts of product. At the same time, it appears that almost any given metal-catalyzed coupling reaction is capable of being optimized, if you care enough. So this is a good field for both miniaturized reaction searching (as in the link above) and for machine learning (as in this new paper).
It’s going after what’s often an even more challenging case, the C-N Buchwald-Hartwig coupling. That one gets used a lot in medicinal chemistry (we like nitrogen atoms), but it’s also a well-known beast when it comes to variation in yield. You can honestly never be quite sure that it’s going to work well when you set it up on a new system for the first time, because there are a lot of conditions (solvent, catalyst, additives) to pick from, and often very little guidance about which region of reaction space is going to be most fruitful. This work, a Princeton/Merck collaboration, is an attempt to calculate a way out of the woods.
It’s important to note that the two approaches mentioned in the first paragraph are not an either/or case. Far from it. If you’re going to do machine learning, you’re going to need a lot of reliable data (positive and negative) to feed into the model, and how better to generate that than in an automated, high-throughput setup? That’s especially true as you start adding in more reaction variables: every one of those can increase the complexity of the machine learning model in an exponential fashion, and stuff gets out of control quickly under those conditions. So the question is, can ML provide something useful here, and can it outperform the simpler-to-implement regression models?
You’d definitely want robotic help up front: the data set was the yields from 4,608 coupling reactions (15 aryl or heteroaryl halides, 4 ligands, 3 bases, and 23 isoxazole additives). About 30% of the reactions gave no product at all – valuable fodder for the model – and the rest were spread in a wide range from “pretty darn good” to “truly crappy”. Just what you need. At this point, you need to figure out just what you’re going to tell the system about these reactions (a nontrivial decision). Too few (or wrongly chosen) parameters, and you’ll never get a useful model. Too many, and you risk an overfitted model that looks good from a distance but is both hard to implement and still not robust. In this case, the Spartan program is used to provide atomic, molecular, and vibrational descriptors of the substrate(s), catalysts, bases, and additives. So what goes in are electrostatics at various atoms, vibrational modes (frequencies and intensities), dipole moments, electronegativity, surface areas, and so on.
At this point, the team took 70% of the data as a training set to see if they could predict the outcomes in the other 30%, and ran the data through a whole set of possibilities: linear and polynomial regression for starters, then k-nearest neighbors, Bayes-generalized linear models, layered neural networks, random-forest techniques, etc. Only when they got to that last two did they start to see real predictive value, and random-forest seems to have been the clear winner. Its predictions, even using only 5% of the data, were better than linear-regression models tuned up on the whole 70%. But even so, there were limits. “Activity cliffs” are always a problem in these reactions, and those are hard to pick up (even with 4,608 reactions to train on!) The wider the range of chemical/reaction space you’re trying to cover with your model, the better the chance you’re going to miss these things (and the better the chance you’re going to end up with an overfitted model).
One way out of that is to restrict your question to a narrower region of potential reactions (and thus get more fine detail into the model). But that, naturally, makes it less useful: after all, the dream is a Buchwald-Hartwig Box, where you walk up, draw the structures of your two reactants (no matter what they might be), and it pauses for a moment and spits out a set of good reaction conditions. That’s. . .still a little ways off. But I do have to say that this current effort beats both human intuition (insofar as anyone has any about these reactions) and other attempts to calculate likely starting points. So it is a real improvement, and the finding that random-forest techniques outperformed everything else is worth building on, too.
The group did try putting in some additives that were not in the training set at all, to see how the model would handle them. And the answer is “not bad”: the root-mean-square error in predicted yields for the original test set versus reference set was about 8%, and the RMSE for the new additive set was about 11%. So it was not quite as good, but on the right track. (By comparison, the single-layer neural network model had about a 10% error on the original set, while the other methods came in at 15 to 16%).
The results suggested a look at the final random-forest procedure to see if there was anything that could be learned about the mechanism (a tall order, from what I know about the field). Some of the most important descriptors were things like electrostatic charge on the isoxazole additive’s atoms, its calculated LUMO, and its carbon NMR shifts. That suggests that the additive’s electrophilic behavior was important, but those parameters taken by themselves weren’t enough to produce any kind of useful model on their own. Still, a check of isoxazoles from both ends of the scale suggested that the more electrophilic ones were capable of undergoing a side reaction with the Pd catalyst that reduced yields, which may well be what the model was picking up on.
Overall, I’d say that this paper does show (as have some others) that machine-learning techniques are going to help us out eventually in predicting reaction conditions, even if that day has not quite arrived. When you get down to it, the parameters going into this model are not particularly complicated, so it may well be a good sign that it’s worked as well as it has. You’d have to think that larger data sets and new inputs can only going to make these things perform better (after plenty of human effort and care, of course), and that’s leaving aside the general possibility of improved algorithms. I think we’ll get there, but we’re not there now.