Here’s another step along the way to automated synthesis, in a new paper from MIT. The eventual hope is to unite the software and the hardware in this area, both of which are developing these days, and come up with a system that can produce new compounds with a minimum of human intervention. Let’s stipulate up front that we’re not there yet; this paper is both a very interesting look at the state of the art and a reminder that the state of the art isn’t up to that goal yet.
The software end of things involves (in the ideal case) being able to come up with plausible synthetic routes to the desired molecules, with “plausible” being not only in the abstract but fitted to the abilities of the hardware synthesizer itself. And since that synthesizer is very likely going to partake of a lot of flow chemistry, you’ll have a lot of thinking to do about concentrations, flow rates, non-clogging conditions, and so on. I should mention that an even more ideal system would be able to come up with its own ideas about what to synthesize, but doing that from a standing start is even further off. I think what we’ll see before that (and people have already been working on this as well) is a system that can suggest reasonable rounds of analogs given the assay data from a previous round of simple analogs, and then can turn its attention to how to synthesize them.
In this case, the flow of operations looks like this: (1) Select a synthetic target, (2) search the literature for the compound, (3) retrosynthetic analysis, (4) select reaction conditions, (5) estimate feasibility, (6) formulate a recipe for the hardware to follow, (7) configure the platform for that recipe, (8) test run of the process, (9) scaleup of the synthesis in flow, and there’s your product. But as the paper notes, there are still several of these stages that need human input, as you can certainly imagine if you’ve ever done organic synthesis, done any flow chemistry, or worked with automation of any kind at all. One of the good things about this work, actually, is the job it does highlighting just the areas that turned out to need the most human help as they got this system into shape.
The MIT group has been working on their own retrosynthesis software (ASKCOS), and they give some details of it here. The system was trained on millions of reactions abstracted from both Reaxys and the USPTO database, so it’s seen plenty of organic chemistry. For example, out of the 12.5 million or so single-step reactions to be found in Reaxys, the system is set to pay attention only to those with ten or more examples, which knocks things down to about 160,000 “rules” for valid single-step transformations. They then trained a neural network to try to predict which of these rules would be most applicable to a given new target structure – this step was put in to decrease the computational load in the next step and also to try to increase that step’s success rate.
Each proposed retrosynthesis step first gets a binary filter applied to it: are there any conditions that the program knows about that could generate the desired product from the stated reactants? Getting rid of stuff at that stage saves a lot of pointless calculation. Then if the answer is “yes”, the program turns its attention to the hits, and the proposed reaction sequences are evaluated by a more computationally intensive foreward-prediction model, which is trained up on the most synthetically plausible conditions for each transformation. If that result matches up, the route is considered believable.
The hardware is a later version of the system I wrote about here, a deliberately modular plug-and-play setup. This leaves the individual modules themselves open to upgrading on their own without affecting the rest of the system, and cuts down on the number of valves and connections in any given overall plan – and as anyone who’s set up an HPLC, LC/MS, or flow chemistry apparatus will be able to tell you, mo’ connections = mo’ problems. Every new fitting is an opportunity for something to fail later on. There is, as before, a manipulator arm in the middle of the thing that can reach over and plug the individual modules into a common rack to assemble the sequence needed for the synthetic scheme (heated loop flows into phase separator flows into solid-supported reagent bed flows into. . .) There is a lot of engineering involved in getting this to work, a forest of little details that have to be addressed. Just to pick one example, the tubing involved is all tensioned via spring-loaded reels to cut down on looping and tangling – you can easily imagine what might happen otherwise as the robot arm merrily assembles something that look like a colander full of angel hair pasta when you walk around and look at it from behind. The system generates a “chemical recipe file” for a given synthetic path; this CRF has the mapping of the physical reaction setup and the operations needed to execute the synthesis. This includes locations of stock solutions, assembly of the modules, solvents, flow rates, temperatures, and all the rest of it.
The examples given in the paper are part of the “drugs on demand” work that the Jamison lab has been doing for several years now (which I wrote about here). I remain somewhat skeptical of the stated overall DARPA goal of this work, as that post shows, but it’s an excellent proving ground for automated synthesis (which to be fair, is also one of the goals). Links below added by me:
We therefore chose a suite of 15 medicinally relevant small molecules, which ultimately required eight particular retrosynthetic routes and nine specific process configurations. Although literature precedents exist for all 15 targets, the synthesis-planning program is prevented from merely recalling any synthetic route from memory as exact matches; all pathways are required to be discovered de novo through abstracted transformation rules and learned patterns of chemical reactivity. . .In order of increasing complexity, we investigated the synthesis of aspirin and racemic secnidazole run back to back; lidocaine and diazepam run back to back to use a common feedstock; and (S)-warfarin and safinamide to demonstrate the planning program’s stereochemical awareness. . .
They also include (in the SI) a synthesis for benzfibrate that had to be abandoned due to poor flow chemistry performance, which is a detail that I very much appreciate. The paper also includes two small library syntheses around ACE inhibitor and COX-2 inhibitor scaffolds. This is a good time, though, to go back over that list of steps above and note which ones needed (in one case or another) human evaluation. The answer is, most of them. Step one, selecting the target, is of course a completely human operation. Step two, searching the literature, is done automatically. Steps 3, 4, and 5 (retrosynthesis, selection of conditions, and evaluation of feasibility) are a mixture. Human input is available (and desirable) for each of them – users can and should set various thresholds and biases depending on how the runs are going. But I invite my fellow humans not to get too cocky, considering that all of these steps were until very recently considered our exclusive domain. The same goes for step 6 (formulate a CRF) and step 7 (configure the apparatus). The software does most of this itself, but human “reality checks” are needed here as well. As for the other physical steps 8 and 9, actually running the chemistry, the machine obviously does this, but I strongly suspect that there are people standing around watching it while it does, at least at first.
The details of each synthesis are quite interesting, as the software picks out certain reactions and conditions. Even simple steps have a lot of decisions: making aspirin is dead easy, but do you do the acetylation with acetyl chloride or with acetic anhydride? Any added acid catalyst? What solvent? What concentration? What temperature? If you run it neat, you have an increased risk of clogging, for example. Different acidic conditions might be more or less compatible with the downstream apparatus. Some solvents that drive the reaction more quickly to completion could be harder to remove at the end. Every single step in a synthesis, as practitioners will appreciate, involves a list of such decisions. Watching the software deal with them is startlingly like the experience of teaching a teenager to drive: you realize how many small details you take for granted that you have to call attention to. And the flow chemistry aspect introduces new complexities at the same time that it enables the whole idea to work at all – for example, you’re probably going to want to use a soluble liquid base like triethylamine or DBU rather than potassium carbonate, which will have to come in as an aqueous solution in flow, and would generate gas bubbles even if you did. Which base, though, will produce a hydrochloride that’s less likely to clog the system?
That brings up an important point that the paper highlights: we actually don’t know enough chemistry to predict how these things will work, even for rather simple reactions. There is a lot of empirical experimentation every time you set up such a system, for reasons like this:
Approximate conditions for batch synthesis can be generated based on the literature, as we have done in this study, but their direct implementation in flow is challenging. The desire for process intensification (e.g., to decrease reaction times), the need to mitigate solids formation to avoid clogging, and the importance of telescoping multiple unit operations requires deviation from batch conditions and a level of confidence of predictions that flow chemistry has not yet achieved. Computational prediction of solubilities to within even a factor of 2 in nonaqueous solvents and at nonambient temperatures remains elusive. Predicting suitable purification procedures is a general challenge, not just for flow chemistry, particularly when using nonchromatographic methods.
And very much so on! No, these are not solved problems, and you should hold on to your wallet in the presence of anyone who tries to tell you that they are. But at the same time, there is no reason for them not to be solvable. We just need more information and to get better at what we’re doing. Over time, as the paper notes, we will assemble (we already are) a great deal more knowledge about flow chemistry and reaction prediction, and systems such as these will gradually become more and more capable. You can read a paper like this two ways: you can look at the limitations and what remains to be done, or you can look at what’s already been accomplished and how much of that you might have once thought was restricted to human effort. My advice? Don’t neglect either perspective, because they’re both valid.