Field of Science

Showing posts with label modeling. Show all posts
Showing posts with label modeling. Show all posts

The Ten Commandments of Molecular (and other) Modeling


Thou shalt not extrapolate too much beyond the training data.

Thou shalt prioritize simple experiments over complex models.

Thou shalt never forget the difference between accuracy and precision.

Thou shalt never try to woo thy audience with pretty pictures or tales of fast GPUs.

Thou shalt not worship “physics-based” models that are not actually physics-based.

Thou shalt not be biased toward favorite models and should use whatever gets the job done.

Thou shalt use good statistics as much as possible.

Thou shalt always remember that modeling is a means and a tool, not an end unto itself.

Thou shalt always understand and explain the limitations of thy models.

Thou shalt never forget: Only good experiments can uncover facts. The rest is crude poetry and imagination. 

Conquering the curse of morphine, one docked structure at a time

Morphine, one of the most celebrated molecules in history, also happens to be one of the deadliest. It belongs to the same family of opioid compounds that includes oxycodone and heroin. So does fentanyl, a weaponized form of which is rumored to have caused havoc and killed dozens in a terrorist standoff in Russia a few years ago. Potent painkillers which mitigate pain that is impossible to cure using traditional painkillers like ibuprofen, morphine and its cousins are also devils in disguise.

Constipation is the least concerning side effect of opioid use. Even in moderate doses they have a high risk of causing death from respiratory failure; in the jargon of drug science, their therapeutic index - the difference between a beneficial and a fatal dose - is very low. To get an idea of how potent these compounds are, it's sobering to know that in 2014, more than 28,000 deaths were caused by opioid overdoses, and half of these were because of prescription opioids; the addiction potential of these substances is frighteningly high. The continuing oxycodone epidemic in places like New Hampshire bears testimony to the hideous trail of death and misery that opioids can leave in their wake.

Quite unsurprisingly, finding a chemical agent that has the same unique painkiller profile as morphine without the insidious side effects has been a longstanding goal in drug discovery, almost a holy grail. It would be hard to overestimate the amount of futile resources that pharmaceutical companies and academic labs alike have spent in this endeavor. It was long thought that it might be impossible to decouple morphine's good effects from its bad ones, but studies in the last few decades have provided important insights into the molecular mechanism of morphine's action that may potentially help us accomplish this goal. Morphine binds to the so-called µ-opioid receptor (MOR), one of a family of several opioid receptors involved in pain, addiction and reward pathways. The MOR is a GPCR, a family of complex seven-helix protein bundle which are ubiquitously involved in pretty much every important physiological process, from vision to neuronal regulation.

Almost 30% of all drugs work by modulating the activity of GPCRs, but the problem until now has been the difficulty of obtaining atomic-level crystal structures of these proteins that would allow researchers to try out a rational drug design approach against them. All that changed with a series of breakthroughs that has allowed us to crystallize dozens of GPCRs (that particular tour de force bagged its inventors - Brian Kobilka and Robert Lefkowitz - the 2012 Nobel Prize, and I had the pleasure of interviewing Kobilka a few years ago). More recently they have crystallized the various opioid receptors, and this definitely represents a promising opportunity to find drugs against these proteins using rational drug design. When activated or inhibited by small molecules, GPCRs bind to several different proteins and engage several different biochemical pathways. Among other insights, detailed studies of GPCRs have allowed us to determine that the side effects of opioid drugs emerge mainly from engaging one particular protein called ß-arrestin.

A paper published this week in Nature documents a promising effort to this end. Using computational docking, Brian Shoichet (UCSF), Bryan Roth (UNC) and their colleagues have found a molecule called PZM21 that seems to activate the MOR without also activating the molecular mechanisms which cause dangerous side effects like respiratory depression. The team started with the crystal structure of the MOR and docked about 3 million commercial lead-like molecules against it. The top 2500 entries were visually inspected, and using known data on GPCR-ligand interactions (especially knowledge of the interaction between the ligands and an aspartate - Asp 147) and sound physicochemical principles (discarding strained ligands), they narrowed down the set to about 25 ligands which gave binding affinities ranging from 2 µM - 14 µM; interestingly, along with the Asp 147 interaction, the model located another unique hydrogen bond with Asp 147 that has never been seen before in GPCR-ligand binding (compound 7 in figure above). Using traditional structure-based design that allowed the group to access novel interactions, the affinity was then improved all the way to single digit nanomolar, which is an excellent starting point for future pharmacokinetic studies. The resulting compound hit all the right endpoints in the right assays; it had greatly reduced recruitment of ß-arrestin and subsequently reduced respiratory arrest and constipation, it mitigated pain efficiently in mice and showed much lesser addiction potential. It also showed selectivity against other opioid receptor subtypes, which may make it a more targeted drug.

This paper is noteworthy for several aspects, and not just for the promising result it achieved. Foremost among them is the fact that a relatively straightforward docking approach was used to find a novel ligand with novel interactions for one of the most coveted and intractable protein targets in pharmacology. But to me, equally noteworthy is the fact that the group examined 2500 docked ligands to pick the right ones and discard bad ones based on chemical intuition and existing knowledge; the final 25 ligands were not just selected from the top 50 or 100. That exercise shows that no modeling can compensate for the virtues of what chemist Garland Marshall called "a functioning pair of eyes connected to a good brain", and as I have found out myself, there's really no substitute for patiently slogging through hundreds of ligands and using your chemical intuition to pick the right ones.

There is little doubt that approaches such as this one will continue to find value in discovering new ligand chemistry and biology in valuable areas of pharmacology and medicine. As is becoming clear in the age of computing, the best results emerge not from computers or human beings alone, but when the two collaborate. This here is as good an example of that collaboration as I have recently seen.

Physics, biology and models: A view from 1993 and now

Twenty three years ago the computer scientist Danny Hillis wrote an essay titled "Why physicists like models, and why biologists should." The essay gently took biologists to task for not borrowing model-building tricks from the physicists' trade. Hillis's contention was that simplified models have been hugely successful in physics, from Newton to Einstein, and that biologists seem to have largely dismissed model building by invoking the knee-jerk reason that biological systems are far too complex.

There is a grain of truth in what Hillis says about biologists not adopting modeling and simulation to understand reality. While some of it probably is still a matter of training - most experimental biologists do not receive training in statistics and computer science as a formal part of their education - the real reasons probably have more to do with culture and philosophy. Historically too biology has always been a much more experimental science compared to physics; Carl Linnaeus was still classifying animals and plants while Isaac Newton was mathematizing all of classical mechanics. 

Hillis documents three reasons why biologists aren't quick to use models.
For various reasons, biologists who are willing to accept a living model as a source of insight are unwilling to apply the same criterion to a computational model. Some of the reasons for this are legitimate, others are not. For example, some fields of biology are so swamped with information that new sources of ideas are unwelcome. A Nobel Prize winning molecular biologist said to me recently, "There may be some good ideas there, but we don't really need any more good ideas right now." He might be right. 
A second reason for biologists' general lack of interest in computational models is that they are often expressed in mathematical terms. Because most mathematics is not very useful in biology, biologists have little reason to learn much beyond statistics and calculus. The result is that the time investment required for many biologists to understand what is going on in computational models is not worth the payoff. 
A third reason why biologists prefer living models is that all known life is related by common ancestry. Two living organisms may have many things in common that are beyond our ability to observe. Computational models are only similar by construction; life is similar by descent.
Many of these reasons still apply but have evolved for the better since 1993. The information glut has, if anything, increased in important fields of biology like genomics and neuroscience. Hillis did not live in the age of 'Big Data' but his observation precedes it. However data by itself should not preclude the infusion of modeling; if anything it should encourage it even more. Also, the idea that "most mathematics is not very useful in biology" pertains to the difficulty (or even impossibility) or writing down, say, a mathematical description of a cat. But you don't have to always go that far in order to use mathematics effectively. For instance ecologists have used simple differential equations to model the rise and fall of predator and prey populations, and systems biologists are now using similar equations to model the flux of nutrients, metabolites and chemical reactions in a cell. Mathematics is certainly more useful in biology now than what it was in 1993, and much of this resurgence has been enabled by the rise of high speed computing and better algorithms. 

The emergence of better computing also speaks to the difficulty in understanding computational models that Hillis talks about; to a large extent it has now mitigated this difficulty. When Hillis was writing, it took a supercomputer to perform the kind of calculation than you can now do on your laptop in a few hours. The advent of Moore's Law-enabled software and hardware has enormously enabled number-crunching in biology. The third objection - that living things are similar by descent rather than construction - is a trivial one in my opinion. If anything it makes the use of computational models even more important, since by comparing similarities across various species one can actually get insights into potential causal relationships between them. Another reason Hillis gives for biologists not embracing computation is because they are emotionally invested in living things rather than non-living material things. This is probably much less of a problem now, especially since computers are used commonly even by biologists to perform routine calculations like graphing.

While the problems responsible for biologists' lukewarm response to computational models are legitimate, Hillis then talks about why biologists should still borrow from the physicists' modeling toolkit. The basic reason is that by constructing a simple system with analogous behavior, models can capture the essential features of a more complex system: This is in fact the sine qua non of model building. The tricky part of course is in figuring out whether the simple features truly reproduce the behavior of the real world system. The funny thing about models however is that they simply need to be useful, so they need not correspond to any of our beliefs about real world systems. In fact, trying to incorporate too much reality into models can make them worse and less accurate.

A good example I know is this paper from Merck that correlated simple minimized energies of a set of HIV protease inhibitor drugs to their inhibition values (IC50s) against the enzyme. Now, nobody believes that the very simple force field underlying this calculation actually reproduces the complex interplay of protein, small molecules and solvent that takes place in the enzyme. But the point is that they don't care: as far as the model is predictive it's all good. Hillis though is making the point that models in physics have been more than just predictive, they have been explanatory. He extols biologists to move away from simple prediction, as useful as it might be, and towards explanation. 

I agree with this sentiment, especially since prediction alone can lead you down a beatific path in which you may get more and more divorced from reality. Something similar can happen especially in fields like machine learning, where combinations of abstract descriptors that defy real world interpretation are used merely because they are predictive. These kinds of models are very risky in the sense that you can't really figure out what went wrong if their break down; at that point you have a giant morass of descriptors and relationships to sort through, very few of which make any physical sense. This dilemma is true of biological models as a whole, so one needs to tread a fine line between keeping a model simple and descriptive and incorporating enough real world variables to make it scientifically sensible.

Later the article talks about models separating out the important variables from the trivial ones, and that's certainly true. He also talks about models being able to synthesize experimental variables into a seamless whole, and I think machine learning and multiparameter optimization in drug discovery for instance can achieve this kind of data fusion. There is a twist here though, since the kinds of emergent variables that are rampant in biology are not usually seen in physics. Thus, modeling in biology needs to account for the mechanisms underlying the generation of emergent phenomena. We are still not at the point where we can do this successfully, but we seem to be getting there. Finally, I also like to emphasize one very important utility of models: as negative filters. They can tell you what experiment not to do or what variable to ignore or what molecule not to make. That at the very least leads to a saving of time and resources.

The bottom line is that there is certainly more cause for biologists to embrace computational models than there was in 1993. And it should have nothing to do with physics envy.

Physicist Leo Kadanoff on reductionism and models: "Don't model bulldozers with quarks."

I have been wanting to write about Leo Kadanoff who passed away a few weeks ago. Among other things Kadanoff made seminal contributions to statistical physics, specifically the theory of phase transitions, that were undoubtedly Nobel caliber. But he should also be remembered for something else - a cogent and very interesting attack on 'strong reductionism' and a volley in support of emergence, topics about which I have written several times before.

Kadanoff introduced and clarified what can be called the "multiple platform" argument. The multiple platform argument is a response to physicists like Steven Weinberg who believe that higher-order phenomena like chemistry and biology have a strict on-on-one relationship with lower-order physics, most notably quantum mechanics. Strict reductionists like Weinberg tell us that "the explanatory arrows always point downward". But leading emergentist physicists like P W Anderson and Robert Laughlin have taken objection to this interpretation. Some of their counterarguments deal with very simple definitions of emergence; for instance a collection of gold atoms have a property (the color yellow) that does not directly flow from the quantum properties of individual gold atoms.

Kadanoff further revealed the essence of this argument by demonstrating that the Navier-Stokes equations which are the fundamental classical equations of fluid flow cannot be accounted for purely by quantum mechanics. Even today one cannot directly derive these equations from the Schrodinger equation, but what Kadanoff demonstrated is that even a simple 'toy model' in which classical particles move around on a hexagonal grid can give rise to fluid behavior described by the Navier-Stokes equations. There clearly isn't just one 'platform' (quantum mechanics) that can account for fluid flow. The complexity theorist Stuart Kauffman captures this well in his book "Reinventing the Sacred".



Others have demonstrated that a simple 'bucket brigade' toy model in which empty and filled buckets corresponding to binary 1s and 0s (which in turn can be linked to well-defined quantum properties) that are being passed around can account for computation. Thus, as real as the electrons obeying quantum mechanics which flow through the semiconducting chips of a computer are, we do not need to invoke their specific properties in order to account for a computer's behavior. A simple toy model can do equally well.

Kadanoff's explanatory device is in a way an appeal to the great utility of models which capture the essential features of a complicated phenomenon. But at a deeper level it's also a strike against strong reductionism. Note that nobody is saying that a toy model of classical particles is a more accurate and fundamental description of reality than quantum mechanics, but what Kadanoff and others are saying is that the explanatory arrows going from complex phenomena to simpler ones don't strictly flow downward; in fact the details of such a flow cannot even be truly demonstrated.

In some of his other writings Kadanoff makes a very clear appeal based on such toy models for understanding complex systems. Two of his statements provide the very model of pithiness when it comes to using and building models:

"1. Use the right level of description to catch the phenomena of interest. Don't model bulldozers with quarks.

2. Every good model starts from a question. The modeler should aways pick the right level of detail to answer the question."


"This lesson applies with equal strength to theoretical work aimed at understanding complex systems. Modeling complex systems by tractable closure schemes or complicated free-field theories in disguise does not work. These may yield a successful description of the small-scale structure, but this description is likely to be irrelevant for the large-scale features. To get these gross features, one should most often use a more phenomenological and aggregated description, aimed specifically at the higher level. 

Thus, financial markets should not be modeled by simple geometric Brownian motion based models, all of which form the basis for modern treatments of derivative markets. These models were created to be analytically tractable and derive from very crude phenomenological modeling. They cannot reproduce the observed strongly non-Gaussian probability distributions in many markets, which exhibit a feature so generic that it even has a whimsical name, fat tails. Instead, the modeling should be driven by asking what are the simplest non-linearities or non-localities that should be present, trying to separate universal scaling features from market specific features. The inclusion of too many processes and parameters will obscure the desired qualitative understanding."

This paragraph captures as well as anything else why chemistry requires its own language, rules and analytical devices for understanding its details and why biology and psychology require their own similar implements. Not everything can be understood through quantum mechanics, because as you try to get more and more fundamental, true understanding might simply slip away from between your fingers.

RIP, Leo Kadanoff.

Nobelist John Pople on using theories and models the right way

John Pople was a towering figure in theoretical and computational chemistry. He contributed to several aspects of the field, meticulously consolidated those aspects into rigorous computer algorithms for the masses and received a Nobel for his efforts. The Gaussian set of programs that he pioneered is now a mainstay of almost every lab in the world which does any kind of molecular calculation at the quantum level.

On the website of Gaussian is a tribute to Pople from his former student (and president of Gaussian) Michael Frisch. Of particular interest are Pople’s views on the role of theories and models in chemistry which make for interesting contemplation, not just for chemists but really for anyone who uses theory and modeling.

  • Theorists should compute what is measured, not just what is easy to calculate
  • Theorists should study systems people care about, not just what is easy or inexpensive to study.
Both these points are well taken as long as one understands that it’s often important to perform calculations on ‘easy’ model systems to benchmark techniques and software (think of spherical cows…). However it’s also a key aspect of modeling that’s often lost on people who simulate more complex systems. For instance the thrust of a lot of protein-small molecule modeling is in determining or rationalizing the binding affinity between the protein and the small molecule. This is an important goal, but it’s often quite insufficient for understanding the ‘true’ binding affinity between the two components in the highly complex milieu of a cell, where other proteins, ions, cofactors and water jostle for attention with the small molecule. Thus, while modelers should indeed try to optimize the affinity of their small molecules for proteins, they should also try to understand how these calculations might translate to a more biological context.
  • Models should be calibrated carefully and the results presented with scrupulous honesty about their weaknesses as well as their strengths.
This is another aspect of theory and modeling that’s often lost in the din of communicating results that seem to make sense. The importance of training sets in validating models on known systems is well-understood, although even in this case the right kind of statistics isn’t always applied to get a real sense of the model’s behavior. But one of the simpler problems with training sets is that they are often incomplete and miss essential features that are rampant among the real world’s test sets (more pithily, all real cows as far as we know are non-spherical). This is where Pople’s point about presenting the strengths and weaknesses of models applies: if you are unsure how similar the test case is to the training set, let the experimentalists know about this limitation. Pople’s admonition also speaks to the more general one about always communicating the degree of confidence in a model to the experimentalists. Often even a crude assessment of this degree can help prioritize which experiments should be done and which ones should be best assessed against cost and implementation.
  • One should recognize the strengths as well as the weaknesses of other people's models and learn from them.
Here we are talking about plagiarism in the best sense of the tradition. It is key to be able to compare different methods and borrow from their strengths. But comparing methods is also important for another, more elemental reason: without proper comparison you might often be misled into thinking that your method actually works, and more importantly that it works because of a chain of causality embedded in the technique. But if a simpler method works as well as your technique, then perhaps your technique worked not because of but in spite of the causal chain that appears so logical to you. A case in point is the whole field of molecular dynamics: Ant Nicholls from OpenEye has made the argument that you can’t really trust MD as a robust and ‘real’ technique if simpler methods are giving you the answer (often faster).
  • If a model is worth implementing in software, it should be implemented in a way which is both efficient and easy to use. There is no point in creating models which are not useful to other chemists.
This should be an obvious point but it isn’t always one. One of the resounding truths in the entire field of modeling and simulation is that the best techniques are the ones which are readily accessible to other scientists and provided to them cheaply or free of cost. Gaussian itself is a good example – even today a single user license is offered for about $1500. Providing user-friendly graphical interfaces seems like a trivial corollary of this principle, but it can make a world of difference for non-specialists. The Schrodinger suite is an especially good example of user-friendly GUIs. Conversely, software for which a premium was charged died a slow death, simply because very few people could use, validate and improve it.

There do seem to be some exceptions to this rule. For instance the protein-modeling program Rosetta is still rather expensive for general industrial use. More importantly, Rosetta seems to be particularly user-unfriendly and is of greatest use to descendants of David Baker’s laboratory at the University of Washington. However the program has still seen some very notable successes, largely because the Baker tribe counts hundreds of accomplished people who are still very actively developing and using the program.

Notwithstanding such exceptions though, it seems almost inevitable in the age of open source software that only the easiest to use and cheapest programs will be widespread and successful, with lesser competitors culled in a ruthlessly Darwinian process.

Molecules, software and function: Why synthesis is no longer chemistry's outstanding problem.

R B Woodward and his followers have solved the general
problem of organic synthesis. We need to solve the general
problem of design and function.
Yesterday's post by Derek titled 'The End of Synthesis'  (follow up here) ruffled quite a few feathers. But I suspect one of the reasons it did so is because it actually hits too close to home for those who are steeped in the art and science of organic synthesis. The paper which the post talks about, by chemist Martin Burke in Science may or may not be the end of synthesis but it may well be the beginning of the end, at least symbolically. As much as it might be unpalatable to some, the truth is that synthesis is no longer chemistry’s outstanding general problem.

Synthesis was the great philosophical question of the twentieth century, not the twenty-first. As Derek pointed out, in the 50s it was actually not clear at all that a molecule like strychnine with its complex shape and stereochemistry could even be made. But now there is little doubt that given enough manpower (read graduate students and postdocs) and money, basically anything that you can sketch on paper can be made. Now I am certainly not claiming that synthesizing a complex natural product with fifty rotatable bonds and twenty chiral centers is even today a trivial task. There is a still a lot of challenge and beauty in the details. I am also not saying that synthesis will cease to be a fruitful source of solutions for humanity’s most pressing problems, such as disease or energy; as a tool the importance of synthesis will remain undiminished. What I am saying is that the general problem of synthesis has now been solved in an intellectual sense (as an aside, this would be consistent with the generally pessimistic outlook regarding total synthesis seen on many blogs, including Derek's.). And since the general problem has been solved it's not too audacious to think that it might lend itself to automation. Once you have a set of formulas for making a molecule, it is not too hard to imagine that machines might be able to bring about specific instantiations of those formulas, even if you will undoubtedly have to work out the glitches along the way.

I always think of software when I think about the current state of synthesis. When software was being developed in the 40s and 50s it took the ingenuity of a John von Neumann or a Grace Hopper to figure out how to encode a set of instructions in the form of a protocol that a machine could understand and implement. But what it took von Neumann to do in the 50s, it takes a moderately intelligent coder from India or China to do in 2014. Just as the massive outsourcing of software was a good example of software's ability to be commoditized and automated, so is the outsourcing of large swathes of organic synthesis a telling sign of its future in automation. Unlike software synthesis is not quite there yet, but given the kind of complexity that it can create on demand, it will soon be.

To see how we have gotten here it's worth taking a look at the history, and this history contains more than a nugget of comparison between synthesis and computer science. In the 1930s the general problem was unsolved. It was also unsolved in the 50s. Then Robert Burns Woodward came along. Woodward was a wizard who made molecules whose construction had previously defied belief. He had predecessors, of course, but it was Woodward who solved the general problem by proving that one could apply well-known principles of physical organic chemistry, conformational analysis and stereochemical control to essentially synthesize any molecule. He provided the definitive proof of principle. All that was needed after that was enough time, effort and manpower. Woodward was followed by others like Corey, Evans, Stork and now Phil Baran, and all of them demonstrated the facile nature of the general problem.

If chemistry were computer science, then Woodward could be said to have created a version of the Turing Machine, a general formula that could allow you to synthesize the structure of any complex molecule, as long as you had enough NIH funding and cheap postdocs to fill in the specific gaps. Every synthetic chemist who came after Woodward has really developed his or her own special versions of Woodward’s recipe. They might have built new models of cars, but their Ferraris, Porches and Bentleys – as elegant and impressive as they are – are a logical extension of Woodward and his predecessor’s invention of the internal combustion engine and the assembly line. And it is this Turing Machine-like nature of synthetic schemes that lend them first to commoditization, and then to automation. The human component is still important and will always be but the proportion of that creative human contribution is definitely changing.

A measure of how the general problem of synthesis has been solved is readily apparent to me in my own small biotech company which specializes in cyclic peptides, macrocycles and other complex bioactive molecules. The company has a vibrant internship program for undergraduates in the area. To me the most remarkable thing is to see how quickly the interns can bring themselves up to speed on the synthetic protocols. Within a month or so of starting at the bench they start churning out these compounds with the same expertise and efficiency as chemists with PhDs. The point is, synthesizing a 16-membered ring with five stereocenters has not only become a routine, high-throughput task but it’s something that can be picked up by a beginner in a month. This kind of synthesis might have easily fazed a graduate student in Woodward's group twenty years ago and taken up a good part of his or her PhD project. The bottom line is that we chemists have to now face an uncomfortable fact: there are still a lot of unexpected gems to be found in synthesis, but the general problem is now solved and the incarnation of chemical synthesis as a tool for other disciplines is now essentially complete.

Functional design and energetics are now chemistry’s outstanding general problems

So if synthesis is no longer the general problem, what is? George Whitesides has held forth on this question in quite a few insightful articles, and my own field of medicinal chemistry and molecular modeling provides a good example. It may be easy to synthesize a highly complex drug molecule using routine techniques, but it is impossible, even now, to calculate the free energy of binding of an arbitrary simple small molecule with an arbitrary protein. There is simply no general formula, no Turing Machine, that can do this. There are of course specific cases where the problem can be solved, but the general solution seems light years away. And not only is the problem unsolved in practice but it is also unsolved in principle. It's not a question of manpower and resources, it's a question of basic understanding. Sure, we modelers have been saying for over twenty years that we have not been able to calculate entropy or not been able to account for tightly bound water molecules. But these are mostly convenient questions which when enunciated make us feel more emotionally satisfied. There have certainly been some impressive strides in addressing each of these and other problems, but the fact is that when it comes to calculating the free energy of binding, we are still today where we were in 1983. So yes, the calculation of free energies – for any system – is certainly a general problem that chemists should focus on.

But here’s the even bigger challenge that I really want to talk about: We chemists have been phenomenal in being able to design structure, but we have done a pretty poor job in designing function. We have of course determined the function of thousands of industrial and biological compounds, but we are still groping in the dark when it comes to designing function. An example from software would be designing an emergency system for a hospital: there the real problem is not writing the code but interfacing the code with the several human, economic and social factors that make the system successful. 

Here are a few examples from chemistry: Through combinatorial techniques we can now synthesize antibodies that we want to bind to a specific virus or molecule, but the very fact that we have to adopt a combinatorial, brute force approach means that we still can’t start from scratch and design a single antibody with the required function (incidentally this problem subsumes the problem of calculating the free energy of antigen-antibody binding). Or consider solar cells. Solid-state and inorganic chemists have developed an impressive array of methods to synthesize and characterize various materials that could serve as more efficient solar materials. But it’s still very hard to lay out the design principles – in general terms – for a solar material with specified properties. In fact I would say that the ability to rapidly make molecules has even hampered the ability to think through general design principles. Who wants to go to the trouble of designing a specific case when you can simply try out all combinations by brute force?

I am not taking anything away from the ingenuity of chemists – nor am I refuting the belief that you do whatever it takes to solve the problem – but I do think that in their zeal to perfect the art of synthesis chemists have neglected the art of de novo design. Yet another example is self-assembly, a phenomenon which operates in everything from detergent action to the origin of life. Today we can study the self-assembly of diverse organic and inorganic materials under a variety of conditions, but we still haven’t figured out the rules – either computational or experimental – that would allow us to specific the forces between multiple interacting partners so that these partners assembly in the desired geometry when brought together in a test tube. Ideally what we want is the ability to come up with a list of parts and the precise relationships between them that would allow us to predict the end product in terms of function. This would be akin to what an architect does when he puts together a list of parts that allows him to not only predict the structure of a building but also the interplay of air and sunlight in it.

I don’t know what we can do to solve this general problem of design but there are certainly a few promising avenues. A better understanding of theory is certainly one of them. The fact is that when it comes to estimating intermolecular interactions, the theories of statistical thermodynamics and quantum mechanics do provide – in principle – a complete framework. Unfortunately these theories are usually too computationally expensive to apply to the vast majority of situations, but we can still make progress if we understand what approximations work for what kind of systems. Psychologically I do think that there has to be a general push away from synthesis and toward understanding function in a broad sense. Synthesis still rules chemical science and for good reason; it's what makes chemistry unique among the sciences. But that also often makes synthetic chemists immune to the (well deserved) charms of conformation, supramolecular interactions and biology. It’s only when synthetic chemists seamlessly integrate themselves into the end stages of their day job that they will learn better to appreciate synthesis as an opportunity to distill general design principles. Part of this solution will also be cultural: organic synthesis has long enjoyed a cultish status which still endures. 

However the anguish should not obscure the opportunity here. The solution of the general problem of synthesis and its possible automation should not leave chemists chagrined since it's really a tribute to the amazing success that organic synthesis has engendered in the last thirty years. Instead of reinventing ourselves as synthetic chemists let's retool. Let the synthetic chemist interact with the physical biochemist, the structural engineer, the photonics expert; let him or her see synthesis through the requirement of function rather than structure. The synthesis is there, the other features are not. Whitesides was right when he said that chemists need to broaden out, but another way to interpret his statement would be to ask other scientists to channel their thoughts into synthesis in a feedback process. As chemists we have nailed structure, but nailing design will bring us untold dividends and will enormously enhance the contribution of chemistry to our world.
 

Adapted from a previous post on Scientific American Blogs.

Free Energy Perturbation (FEP) methods in drug discovery: Or, Waiting for Godot

For interested folks in the Boston area it's worth taking a look at this workshop on Free Energy Perturbation (FEP) methods in drug design at Vertex from May 19-21. The list of speakers and topics is quite impressive, and this is about as much of a state-of-the-art discussion on the topic as you can expect to find in the area.

If computational drug discovery were a series of plays, then FEP might well be the "Waiting for Godot" candidate among them. In fact I would say that FEP is a textbook case of an idea that, if it truly works, can truly transform the early stages of drug discovery. What medicinal chemist would not want to know the absolute free energy of binding of his molecules to a protein so that he can actually rank known and unknown compounds in order of priority? And what medicinal chemist would not want to know exactly what she should make next?

But that's what medicinal chemists have expected from modelers ever since modeling started to be applied realistically to drug discovery, and I think it's accurate to say that it's good they haven't held their breath. FEP methods have always looked very promising because they aim to be very rigorous, bringing the whole machinery of statistical mechanics to bear on a protein-ligand system. The basic goal is "simple": you calculate the individual  energies of the protein and the drug - in explicit water - and then you calculate the energy of the bound system. The difference is the free energy of binding. Problem solved.

Except, not really. Predicting relative free energies is still a major challenge, and predicting absolute free energies is asking for a lot. The major obstacle to the application of these methods for decades was considered to be the lack of enough computing power. But if you really thought that was the major obstacle then you were still a considerable way off. Even now there seems to be a belief that given enough computing power and simulation time we can accurately calculate the free energy of binding between a drug and a target. But that's assuming that the fundamental underlying methodology is accurate, which is a big assumption.

The "fundamental underlying methodology" in this case mainly refers to two factors: the quality of the force field which you use to calculate the energy of the various components and the sampling algorithm which you use to simulate their motions and exhaustively explore their conformations. The force fields can overemphasize electrostatic interactions and can neglect polarization, and the sampling algorithms can fail to overcome large energy barriers. Thus both these components are imperfectly known and applied in most cases, which means that no amount of simulation time or computing power is then going to be sufficient. It's a bit like the Polish army fighting the Wehrmacht in September 1939; simply having a very large number of horses or engaging them in the fight for enough time is not going to help you win against tanks and Stukas.

These problems have all been well recognized however; in fact the two most general issues in any modeling technique are sampling and energy calculation. So parts of this month's workshop are aimed exactly at dissecting the factors that can help us understand and improve sampling and scoring.

The end goal of any applied modeling technique of course is how good it is at prediction. Not surprisingly, progress on this front using FEP has been rather thin. In fact FEP is the quintessential example of a technique whose successes have been anecdotal. Even retrospective examples, while impressive, are not copious. One of the problems is that FEP works only when you are trying to predict the impact of very tiny changes in structure on ligand affinity; for instance the impact of changing a methyl group on a benzene ring to a hydroxyl group. The trouble is that the method doesn't work even for these minor changes across the board; there are projects where a CH3--->OH change will give you quantitative agreement with experiment and there are cases where it will result in error bars large enough to drive a car through them. 

But anecdotes, while not being data, are still quite valuable in telling us what may or may not work. Computing power may not solve all our problems but it has certainly given us the opportunity to examine a large number of cases and try to abstract general rules or best practices for drug discovery. We may not be able to claim consistent successes for FEP right now, but it would help quite a lot even if we know what kinds of systems it works best for. And that, to me, is as good an outcome as we could expect at this time.

The right questions about modeling in drug discovery

There's been a fair bit of discussion recently about the role of modeling in drug design, and as a modeler I have found it encouraging to see people recognize the roles that modeling to varying extents plays in the whole "rational" drug design paradigm. But sometimes I also see a question being asked that's not quite how it should be pitched.

This question is usually along the lines of "Would you (as a medicinal chemist) or would you not make a compound that a modeler recommends?". Firstly, the way the question is set up ignores the statistical process evident in all kinds of drug design and not just modeling. You never recommend a single compound or even five of them but usually a series, often built around a common scaffold. Now suppose the chemist makes fifty compounds and finds only one of them to be active. Does this mean the modeling was useful? It depends. If the one hit that you get is a novel scaffold, or a compound with better properties, or one without that pesky unstable ester, or even a compound similar to ones that you have made that nonetheless picks up some hitherto unexplored interactions with the protein, then modeling would not have been in vain in spite of what is technically a low hit rate. The purpose of modeling is to provide interesting ideas, not rock-solid predictions. Plus you always have to ask the other question: "How likely is it that you would have come up with this structure without the aid of modeling or structural information"? As with almost everything else in drug discovery, the value of a particular approach is not absolute and you have to judge it in the context of alternatives.

The other problem I have is that it's frequently just not possible to objectively quantify the value of modeling or any specific approach for that matter in drug design. That's because anyone who has contributed to the design of a successful drug or even a potent ligand knows how haphazard and multifaceted the process is, with contributions flying in from every direction in many guises. Every project that I have contributed to which has resulted in more potent ligands has involved a patchwork process with an essential back-and-forth exchange with the chemists. Typically I will recommend a particular compound based on some modeling metric and then the synthetic chemist will modify parts of it based on synthetic accessibility, cost of building blocks, ease of purification etc. I might then remodel it and preserve or extend the synthetic modifications.

But it can also be the other way round. The synthetic chemist will recommend a scaffold and I will suggest tweaks based on what I see on the computer, which will lead to another round of suggestions and counter-suggestions. The creative process goes both ways and the point is that it's hard to put a number on the individual contributions that go into the making of the final product. Sometimes there may be a case where one individual recommends that magic fluorine which makes all the difference, but this is usually the exception rather than the rule and more importantly, it's often not possible to trace that suggestion to a strictly rational thought process. Finally, it's key to realize that my suggestions themselves will very often be based as much on my knowledge of physical and organic chemistry as on any specific modeling technique. Thus, even internally it's not always possible for me to separate modeling-based suggestions from other kinds. On a related note, the frequent necessity of this interactive process is why I tend to find the "modeling vs medicinal chemistry" turf wars a little irksome (although they sometimes do help in fleshing out the details); the fact of the matter is that both kinds of scientists are ultimately an important part of successful drug discovery projects involving any kind of model-building.

Now this rather unmethodical approach makes it difficult to pin down the role of any specific technology in the drug discovery process, and this usually makes me skeptical every time someone writes a paper trying to assess "The contribution of HTS/virtual screening/fragment-based drug design in the discovery of major drugs during the last decade". But that's how drug discovery is, a hodgepodge of recommendations and tweaks made by a variety of rational and irrational human beings in an environment where any kind of idea from any corner is welcome.

Unruly beasts in the jungle of molecular modeling

The Journal of Computer-Aided Molecular Design is having a smorgasbord of accomplished modelers reflecting upon the state and future of modeling in drug discovery research and I would definitely recommend anyone - and especially experimentalists - interested in the role of modeling to take a look at the articles. Many of the articles are extremely thoughtful and balanced and take a hard look at the lack of rigorous studies and results in the field; if there was ever a need to make journal articles freely available it was for these kinds, and it's a pity they aren't. But here's one that is open access, and it's by some researchers from Simulations Inc. who talk about three beasts (or in the authors' words, "Lions and tigers and bears, oh my!") in the field that are either unsolved or ignored or both.

1. Entropy: As they say, entropy, taxes and death (entropy) are the three constant things in life. In modeling both small molecules and proteins, entropy has always been the elephant in the room, blithely ignored in most simulations. At the beginning there was no entropy. Early modeling programs then started extracting a rough entropic penalty for freezing certain bonds in the molecule. While this approximated the loss of ligand entropy in binding, it did nothing to take care of the conformational entropy loss that resulted in the compression of a panoply of diverse conformations in solution to a single bound conformation.

But we were just getting started. A very large part of the entropy of binding a ligand by a protein comes from the displacement of water molecules in the active site, essentially their liberation from being constrained prisoners of the protein to free-floating entities in the bulk. A significant advance in trying to take this factor into account was an approach that explicitly and dynamically calculated the enthalpy, entropy and therefore the free energy of bound waters in proteins. We have now reached the point where we can at least think of doing a reasonable calculation on such water molecules. But water molecules are often ill-localized in protein crystal structures because of low-resolution, inadequate refinement and other reasons. It's not easy to perform such calculations for arbitrary proteins without crystal structures.

However, a large piece of the puzzle that's still missing is the entropy of the protein which is extremely difficult to calculate on many fronts. Firstly, the dynamics of the protein is often not captured by a static x-ray structure so any attempts to calculate protein entropy in the presence and absence of ligands would have to shake the protein around. Currently the favored process for doing this is molecular dynamics (MD) which suffers from its own problems, most notably the accuracy of what's under the hood- namely force fields. Secondly, even if we can calculate the total entropy changes, what we really need to know is how the entropy is distributed between various modes since only some of these modes are affected upon ligand binding. An example of the kind of situation in which such details would be important is the case of slow, tight-binding inhibitors illustrated in the paper. The example is of two different prostaglandin synthase inhibitors which demonstrate almost identical binding orientations in the crystal structure. Yet one is a weak binding inhibitor which dissociates rapidly and the other is slow, tight-binding. Only a dynamic treatment of entropy can explain such differences, and we are still quite far from being able to do this in the general case.

2. Uncertainty: Out of all the hurdles facing the successful application and development of modeling in any field, this might be the most fundamental. To reiterate, almost every kind of modeling starts by using a training set of molecules for which the data is known and then proceeds to apply the results from this training set to a test set for which the results are unknown. Successful modeling hinges on the expectation that the data in the test set is sufficiently similar to that in the training set. But problems abound. For one thing, similarity is the eye of the beholder and what seems to be a reasonable criterion for assuming similarity may turn out to be irrelevant in the real world. Secondly, overfitting is a constant issue and results that look perfect for the training set can fail
abysmally on the test set.

But as the article notes, the problems go further and the devil's in the details. Modeling studies very rarely try to quantify the exact differences between the two sets and the error resulting from that difference. What's needed is an estimate of predictive uncertainty for single data points, something which is virtually non-existent. The article notes the seemingly obvious but often ignored fact when it says that "there must be something that distinguishes a new candidate compound from the molecules in the training set". This 'something' will often be a function of the data that was ignored when fitting the model to the training set. Outliers which were thrown out because they were...outliers might return with a vengeance in the form of a new set of compounds that are enriched in their particular properties which were ignored.

But more fundamentally, the very nature of the model used to fit the training set may be severely compromised. In its simplest incarnation for instance, linear regression may be used to fit data points to a set of relationships that are inherently non-linear. In addition, descriptors (such as molecular properties supposedly related to biological activity) may not be independent. As the paper notes, "The tools are inadequate when the model is non-linear or the descriptors are correlated, and one of these conditions always holds when drug responses and biological activity are involved". This problem penetrates into every level of drug discovery modeling, from basic molecular level QSAR to higher-level clinical or toxicological modeling. Only a judicious and high-quality application of statistics, constant validation, and a willingness to wait (for publication, press releases etc.) before the entire analysis is available will preclude erroneous results from seeing the light of day.

3. Data Curation: This is an issue that should be of enormous interest to not just modelers but to all kinds of chemical and biological scientists concerned about information accuracy. The well-known principle of Garbage-In Garbage Out (GIGO) is at work here. The bottom line is that there is an enormous amount of chemical data on the internet that is flawed. For instance there are cases where incorrect structures were inferred from correct names of compounds:

"The structure of gallamine triethiodide is a good illustrative example where many major databases ended up containing the same mistaken datum. Until mid-2011, anyone relying on an internet search would have erroneously concluded that gallamine triethiodide is a tribasic amine. The error resulted from mis-parsing the common name at some point as meaning that the compound is a salt of gallamine and ‘‘ethiodidic acid,’’ identifying gallamine as the active component and retrieving the relevant structure. In fact, gallamine triethiodide is what you get when you react gallamine with three equivalents of ethyl iodide"


So gallamine triethiodide is the triply protonated salt, not the tribasic amine. Assuming otherwise can only lead to chemical mutilation and death. And this case is hardly unique. An equally common problem is simply assigning the wrong ionization state for chemical compounds as illustrated at the beginning of the post. I have already mentioned this as a rookie mistake, but nobody is immune to it. It should hardly be mentioned that any attempt to model an incorrect structure will result in completely wrong results. The bigger problem of course is when the results seem right and prevent us from locating the error; for example, docking an incorrectly positively charged structure into a negative binding site will result in very promising but completely spurious results.

It's hard to see how exactly the entire modeling community can rally together, collectively rectify these errors and establish a common and inviolable standard for performing studies and communicating their results. Until then all we can do is point out the pitfalls, the possibilities, the promises and the perils.

Clark, R., & Waldman, M. (2011). Lions and tigers and bears, oh my! Three barriers to progress in computer-aided molecular design Journal of Computer-Aided Molecular Design DOI: 10.1007/s10822-011-9504-3

On reproducibility in modeling

A recent issue of Science has an article discussing an issue that has been a constant headache for anyone involved with any kind of modeling in drug discovery - the lack of reproducibility in computational science. The author Roger Peng who is a biostatistician at Johns Hopkins talks about modeling standards in general but I think many of his caveats could apply to drug discovery modeling. The problem has been recognized for a few years now but there have been very few concerted efforts to address it.

An old anecdote from my graduate advisor's research drives the point home. He wanted to replicate a protein-ligand docking study done with a compound so he contacted the scientist who had performed the study and processed the protein and ligand according to the former's protocol. He appropriately adjusted the parameters and ran the experiment. To his surprise he got a very different result. He repeated the protocol several times but consistently saw the wrong result. Finally he called up the original researcher. The two went over the protocol a few times and finally realized that the problem lay in a minor but overlooked detail - the two scientists were using slightly different versions of the modeling software. This wasn't even a new version, just an update, but for some reason it was enough to significantly change the results.


These and other problems dot the landscape of modeling in drug discovery. The biggest problem to begin with is of course the sheer lack of reporting of details in modeling studies. I have seen more than my share of papers where the authors find it enough to simply state the name of the software used for modeling. No mention of parameters, versions, inputs, "pre-processing" steps, hardware, operating system, computer time or "expert" tweaking. The latter factor is crucial and I will come back to it. In any case, it's quite obvious that no modeling study can be reproducible without these details. Ironically, the same process that made modeling more accessible to the experimental masses has also encouraged the reporting of incomplete results; the incarnation of simulation as black-box technology has inspired experimentalists to widely use it, but on the flip side it has also discouraged many from being concerned about communicating under-the-hood details.

A related problem is the lack of objective statistical validation in reporting modeling results, a very important topic that has been highlighted recently. Even when protocols are supposedly accurately described, the absence of error bars or statistical variation means that one can get a different result even if the original recipe is meticulously followed. Even simple things like docking runs can give slightly different numbers on the same system, so it's important to be mindful of variation in the results along with their probable causes. Feynman talked about the irreproducibility of individual experiments in quantum mechanics, and while it's not quite that bad in modeling, it's still not irrelevant.

This brings us to one of those important but often unquantifiable factors in successful modeling campaigns - the role of expert knowledge and intuition. Since modeling is still an inexact science (and will probably remain so for the foreseeable future), intuition, gut feelings and a "feel" for the particular system under consideration based on experience can often be an important part of massaging the protocol to deliver the desired results. At least in some cases these intangibles are captured in any number of little tweaks, from constraining the geometry of certain parts of a molecule based on past knowledge to suddenly using a previously unexpected technique to improve the clarity of the data. A lot of this is never reported in papers and some of it probably can't be. But is there a way to capture and communicate at least the tangible part of this kind of thinking?

The paper alludes to a possible simple solution and this solution will have to be implemented by journals. Any modeling protocol generates a log file which can be easily interpreted by the relevant program. In case of some modeling software like Schrodinger, there's also a script that records every step in a format comprehensible to the program. Almost any little tweak that you make is usually recorded in these files or scripts. A log file is more accurate than an English language description at documenting concrete steps. One can imagine a generic log file- generating program which can record the steps across different modeling programs. This kind of venture will need collaboration between different software companies but it could be very useful in providing a single log file that captures as much of both the tangible and intangible thought processes of the modeler as possible. Journals could insist that authors upload these log files and make them available to the community.


Ultimately it's journals which can play the biggest role in the implementation of rigorous and useful modeling standards. In the Science article the author describes a very useful system of communicating modeling results used by the journal Biostatistics. Under this system authors doing simulation can request a "reproducibility review" in which one of the associate editors runs the protocols using the code supplied by the authors. Papers which pass this test are clearly flagged as "R" - reviewed for reproducibility. At the very least, this system gives readers a way to distinguish rigorously validated papers from others so that they know which ones to trust more. You would think that there would be backlash against the system from those who don't want to explicitly display the lack of verification of their protocols, but the fact that it's working seems to indicate its value to the community at large.

Unfortunately in case of drug discovery, any such system will have to deal with the problem of proprietary data. There are several papers without such data which could also benefit from this system, but there can be ways to handle proprietary data. Even proprietary data can be amenable to partial reproducibility. In a typical example for instance, molecular structures which are proprietary could be encoded into special organization-specific formats that are hard to decode (an example would be formats used by OpenEye or Rosetta). One could still run a set of modeling protocols on this cryptic data set and generate statistics without revealing the identity of the structures. Naturally there will have to be safeguards against the misuse of any such evaluation but it's hard to see why they would be difficult to institute.


Finally, it's only a community-wide effort equally comprised of industry and academia which can lead to the validation and use of successful modeling protocols. The article suggests creating a kind of "CodeMed Central" repository akin to PubMed Central, and I think modeling could greatly benefit from such a central data source. Code for successful protocols in virtual screening or homology modeling or molecular dynamics or what have you can be uploaded to a site (along with the log files of course). Not only would these protocols be used to verify their reproducibility, but they could also be used to practically aid data extraction from similar systems. The community as a whole would benefit.

Before there's any data generation or sharing, before there's any drawing of conclusions, before there's any advancement of scientific knowledge, there's reproducibility, a scientific virtue that has guided every field of science since its modern origin. Sadly this virtue has been neglected in modeling, so it's about time that we pay more attention to it.


Peng, R. (2011). Reproducible Research in Computational Science Science, 334 (6060), 1226-1227 DOI: 10.1126/science.1213847