Let’s face some hard facts first. The good news is that in the last twenty years or so, computer-aided drug design (CADD or computational chemistry) has become a relatively standard part of drug discovery, and even in organizations not formally employing CADD scientists, some form of computation – sometimes as simple as property calculation or visualization – is used. The bad news is that, as it is largely practiced now, CADD is not considered a core part of the drug discovery process. Period. Rather, it’s considered as a supporting part. This situation has only marginally improved in twenty years. Biology and synthetic chemistry are still the core driving disciplines of drug discovery and will remain so in the foreseeable future. If there is a marketed drug that benefits extravagantly from CADD, it comes along once every decade or so at best (good luck finding another HIV protease, for instance). As integral as CADD scientists consider themselves and as impressive as they think the rotating pictures on their screen are, the fact that their peers consider what they do as being marginally relevant to the big picture is a bitter pill that needs to be swallowed. Why is CADD not considered a core part of drug discovery, even though it is now part of every organization’s drug discovery portfolio in one form or another? There are two reasons, one related to fact and the other related to perception. The fact is that drug discovery and biology are still very much experimental disciplines, and you simply can’t compute your way to a drug; in this sense, CADD simply is not good enough yet. The perception of CADD is that unlike synthetic chemistry and biology which generate products (compounds in bottles, assays, animal models etc.), CADD generates ideas. In that sense, experimental scientists often have the same view of CADD scientists that Ernest Rutherford had of theorists: "They play games with their symbols while we discover the mysteries of the universe." Now one of the advantages of CADD is that the ideas can be useful because of their generality (eg. “Put a hydrophobic group here.”), but it’s this very generality that often counts against CADD. Synthetic chemists in particular want to see specific suggestions, and while plenty of them appreciate general guidance, it doesn’t quite have the same impact. The fact that CADD scientists also often make suggestions that are considered obvious – filling hydrophobic pockets is a standard suggestion – does even less to enamor the discipline to other scientists. This rather undervalued reputation of CADD impacts both the discipline and the scientists. The market for CADD scientists is usually a niche market, with low demand and low supply, and typically a small organization will have only one or two CADD scientists. But the supportive role of CADD means that when there are layoffs, a CADD scientist is far more likely to go than a cell biologist or assay developer. Because CADD scientists are not abundant, they also may not have direct reports. All this means that CADD scientists may not have the ear of senior management. They thus rarely occupy senior managerial positions like CSO, VP of drug discovery or CTO; if you take a random sample of top management at both small and large drug discovery organizations, you will very rarely find scientists with a background in CADD in these positions. The supporting role of CADD combined with the lack of core influence and understanding CADD scientists have means that they often have to fight uphill battles to make their voice heard. While this is a good character-building experience, it doesn’t necessarily contribute to career progression or a growing influence on the part of the field. So how can CADD make a bigger impact both on the facts of drug discovery and the perception? At least part of the answer – as unseemly as it seems to a lot of drug hunters today – does involve large datasets and machine learning. I am not saying that ML or some form of AI will have the kind of immediate, hype-heavy, transformational impact that often seems all-too-apparent through Silicon Valley sunglasses. But I am saying that the impact of ML and AI on drug discovery will inexorably increase as time passes, and anyone who simply chooses the option of dismissing these technologies will be left behind. There are some important reasons why I believe this. The most important reason perhaps is that traditional CADD, as it pertains to protein structure and physics-based algorithms, has fallen far short of its promise. Part of the problem was the hype in the late 80s and early 90s, but a more realistic problem is that the data is often not good enough and even when it’s good it can impact a very limited part of early drug discovery. The biggest problem is that, even thirty years after a leading CADD scientist stated the challenges inherent in using CADD to predict binding affinity (protein flexibility, water behavior, crowding effects) we are still struggling with the same issues. Ignorance of the basic physics of protein-ligand interactions, especially as it pertains to making reliable, general predictions, is still rife. I am not saying that physics-based approaches haven’t worked or improved at all over the last two decades – some applications like shape-based similarity searches and pose prediction – have certainly showed impressive progress, but the fact remains that we are still swimming in a sea of ignorance, either ignorance borne out of basic scientific understanding, or ignorance borne out of engineering our current understanding into viable approaches. Equally important, even when we understand these factors, they will be applicable to a very narrow part of drug development, namely improving binding affinity between a single small molecule and (usually) a single protein. At best CADD as we know it will design good ligands, not drugs. Approaches based on large datasets in contrast are agnostic about the physics of protein-ligand binding. In principle, they can take a bunch of well-curated data points about interaction energies and predict what a new interaction will be like without ever explicitly modeling a single hydrogen bond. They don’t understand causation but can provide actionable correlations. But the second and truly novel advantage of machine learning approaches in my opinion is that, unlike traditional CADD which applies to a very narrow part of drug discovery even on its best day, ML can in theory encompass multiple aspects of drug discovery by simply adding more columns to the Excel file. Thus, if one has multiple kinds of data – in vitro and in vivo binding affinity, mutations, metabolic data, solubility, clearance etc. – traditional CADD can pretty much take advantage of only one or two kinds of data, but ML can consider all the datasets at once. Thus, machine learning provides pretty much the only way to solve the multiparameter optimization problem that is at the heart of drug discovery. Now, as is well known, the real problem with data is not the physics but the quantity, quality and curation of the datasets (including negative data). But these are without a doubt getting better and more integrated across various phases of drug discovery every day. Whether it’s structures in the PDB, synthetic reactions for making drugs or patient data in the TCGA (The Cancer Genome Atlas), the improvement of data quantity and quality is inexorable. And as this quantity and quality improves, so would the utility of this data for machine learning. This is a certainty. The naysayers therefore are fighting against a current which is only going to get stronger, even if it doesn’t quite turn into the tsunami that its practitioners say will engulf us very soon. Are we there yet? The simple answer is no. And we probably won’t be in terms of measurable impact for some time, especially since we have to hack our way through so much hype to get to the nugget of causal or even correlative truth. The domain of applicability of even impressive techniques like deep learning is still limited (for instance it works best on images), and the data problem is still a big one. But what I am saying is that unless CADD embraces machine learning and AI – irrespective of the hype – CADD scientists will always occupy a marginal perch in the grand stadium of drug development and CADD scientists will increasingly be cast on the sidelines, both in terms of career and discipline progress. This will especially be true as both protein and drug classes expand out into space (unstructured proteins, macrocycles etc.) that is even more challenging for traditional CADD than usual. Sadly, as with many things these days, the mention of AI or ML in CADD continues to evoke two extreme reactions: the first one which is more common is to say that it’s going to solve everything and is the Next Big Thing, but the second one is sadly also common, the one which says that it’s all hype and good old thinking will always be better. But good old thinking will always be around. The data is getting better and the machine learning is getting better; these are pretty much certainties which we ignore at our own peril. The bottom line is that unless CADD starts including machine learning and the related paraphernalia of techniques that either accurately or misleadingly fall under the rubric of “AI” in its very definition, its role in drug discovery and development threatens to dwindle to a point of vanishing relevance. This will do an enormous disservice both to CADD and drug discovery in general. Every year when the Nobel Prize in chemistry is announced, there's some bellyaching when the work leading to the prize turns out to be as much related to biology or physics as to chemistry. "But is it chemistry?", is the collective outcry. However, a more refined understanding of the matter over the years leads us to realize that chemistry has as much absorbed biology or physics as they have absorbed chemistry. Chemistry's definition has been changing ever since it emerged from the shroud of alchemy. Chemical biology, chemical physics and chemical engineering are as much chemistry as they are biology, physics and engineering. The same thing needs to happen to CADD in my opinion. By embracing computer science or automation or AI, by hiring ML engineers or data scientists in their midst, CADD is not losing its identity. It's showing that its identity is much bigger than anyone thought. That it is vast and contains multitudes.
Rutherford Ernest physics background won Nobel Prize in chemistry. Dan Schectman metallurgists won NL in chemistry. V. Ramakrishnan , physicist won Nobel in chemistry. This year F Arnold , aerospace and mechanical engineering background then PhD in Biophysical chemistry. Has any chemistry background undergrad won a Nobel in physics?
This is somewhat true even in fields where CAD has a proven track record, ie, semiconductors. The issues are similar. The predictive power is only with respect to the action of dopants and crystal structure, and everything else is fairly obvious generalities. As a colleague of mine used to joke, TCAD is excellent at solving yesterday's problem. I don't think AI/ML is really CAD. And I don't think the size of the training sets will reach critical mass in the next 10 years. Commercial data will likely remain silo'd which hinders NN training. That does mean that people should not be trying. NNs was a backwater field until 2012 and now registration for NIPS sells out in 10 minutes and people use bots to nab slots.
I don’t think that it’s useful to compare CADD with synthetic chemistry. I would draw a distinction between medicinal chemistry (deciding what gets made and/or tested) and synthetic chemistry (making what needs to be made) even when the same person performs both roles. CADD is ultimately part of medicinal chemistry and one mistake that CADD scientists sometimes make is to treat medicinal chemists as (purely) synthetic chemists. Towards the end of my time at AZ, we introduced design teams which included medicinal chemists, synthetic chemists, CADD scientists, physical chemists and DMPK scientists.
In my experience, CADD methods tend to have more impact in lead identification than in lead optimization. Lead optimization often focuses on solving specific problems and a CADD scientist with a strong background in physical chemistry can often provide useful input. However, this problem solving is not the same as trying to suggest a development candidate. It’s also worth pointing out that drug design is incremental and it is generally easier to predict the effects of relatively small (i.e. within structural series) changes than it is to predict activity or properties directly from molecular structure. The difficulty for QSAR modelers is that QSAR is a data-hungry way to make predictions. My view is that a big part of drug design is assembling the data needed by the project team as efficiently as possible and this is more Design of Experiments than drug design.
Another problem that CADD scientists encounter is that CADD capabilities have traditionally been oversold. These days, it seems that any regression or classification model can be described as ML and can therefore be touted as AI. A lot of Kool-Aid is getting drunk.
Agree that machine learning has been oversold...however also agree with the post author that it is a powerful technique as long as we recognize its limitations and restrict it to its proper domain of applicability, and it is -- HAS to be -- part of the future of CADD. I think that just as overhyping ML is dangerous, so too is the kind of skepticism you encounter among (some) medicinal chemists who are skeptical at least partly because they don't know what it is. I have worked with synthetic/medicinal chemists who were deeply skeptical of all this "machine-learning stuff", who usually had only a very vague notion of what it was, but would at the same time make arguments about which compounds to make based on imagined correlations from a simple plot of logP vs activity that to be generous had an R^2 of perhaps 0.2 or 0.3 at the most. There is danger in overhyping a useful tool; there is also a danger in rejecting it because I'm Not Sure What It Is And Anyway It's Not How We Do Things Around Here
Interesting article, thanks for posting. Feels overly negative to me if I'm being honest. One thing to highlight is there are a number of recent oncology agents where comp chem has been key to their discovery. Similar to the Pete's comment, maybe I've just been fortunate to work at places where comp chem was an embedded and valued part of a discovery team. Just need to be open and honest about what comp chem can and can't do. Out of the chemistry specialisms comp chem feels to have more employment opportunities than most. Finally, I think we need to be careful not to oversell AI/ML or we'll go back to the days of hype and over promise. Great discussing though!
Rutherford Ernest physics background won Nobel Prize in chemistry. Dan Schectman metallurgists won NL in chemistry. V. Ramakrishnan , physicist won Nobel in chemistry. This year F Arnold , aerospace and mechanical engineering background then PhD in Biophysical chemistry. Has any chemistry background undergrad won a Nobel in physics?
ReplyDeleteThis is somewhat true even in fields where CAD has a proven track record, ie, semiconductors. The issues are similar. The predictive power is only with respect to the action of dopants and crystal structure, and everything else is fairly obvious generalities. As a colleague of mine used to joke, TCAD is excellent at solving yesterday's problem. I don't think AI/ML is really CAD. And I don't think the size of the training sets will reach critical mass in the next 10 years. Commercial data will likely remain silo'd which hinders NN training. That does mean that people should not be trying. NNs was a backwater field until 2012 and now registration for NIPS sells out in 10 minutes and people use bots to nab slots.
ReplyDeleteHi Ash, here are some thoughts on the subject.
ReplyDeleteI don’t think that it’s useful to compare CADD with synthetic chemistry. I would draw a distinction between medicinal chemistry (deciding what gets made and/or tested) and synthetic chemistry (making what needs to be made) even when the same person performs both roles. CADD is ultimately part of medicinal chemistry and one mistake that CADD scientists sometimes make is to treat medicinal chemists as (purely) synthetic chemists. Towards the end of my time at AZ, we introduced design teams which included medicinal chemists, synthetic chemists, CADD scientists, physical chemists and DMPK scientists.
In my experience, CADD methods tend to have more impact in lead identification than in lead optimization. Lead optimization often focuses on solving specific problems and a CADD scientist with a strong background in physical chemistry can often provide useful input. However, this problem solving is not the same as trying to suggest a development candidate. It’s also worth pointing out that drug design is incremental and it is generally easier to predict the effects of relatively small (i.e. within structural series) changes than it is to predict activity or properties directly from molecular structure. The difficulty for QSAR modelers is that QSAR is a data-hungry way to make predictions. My view is that a big part of drug design is assembling the data needed by the project team as efficiently as possible and this is more Design of Experiments than drug design.
Another problem that CADD scientists encounter is that CADD capabilities have traditionally been oversold. These days, it seems that any regression or classification model can be described as ML and can therefore be touted as AI. A lot of Kool-Aid is getting drunk.
Agree that machine learning has been oversold...however also agree with the post author that it is a powerful technique as long as we recognize its limitations and restrict it to its proper domain of applicability, and it is -- HAS to be -- part of the future of CADD. I think that just as overhyping ML is dangerous, so too is the kind of skepticism you encounter among (some) medicinal chemists who are skeptical at least partly because they don't know what it is. I have worked with synthetic/medicinal chemists who were deeply skeptical of all this "machine-learning stuff", who usually had only a very vague notion of what it was, but would at the same time make arguments about which compounds to make based on imagined correlations from a simple plot of logP vs activity that to be generous had an R^2 of perhaps 0.2 or 0.3 at the most. There is danger in overhyping a useful tool; there is also a danger in rejecting it because I'm Not Sure What It Is And Anyway It's Not How We Do Things Around Here
DeleteInteresting article, thanks for posting. Feels overly negative to me if I'm being honest. One thing to highlight is there are a number of recent oncology agents where comp chem has been key to their discovery. Similar to the Pete's comment, maybe I've just been fortunate to work at places where comp chem was an embedded and valued part of a discovery team. Just need to be open and honest about what comp chem can and can't do. Out of the chemistry specialisms comp chem feels to have more employment opportunities than most. Finally, I think we need to be careful not to oversell AI/ML or we'll go back to the days of hype and over promise. Great discussing though!
ReplyDelete