Field of Science

Unruly beasts in the jungle of molecular modeling

The Journal of Computer-Aided Molecular Design is having a smorgasbord of accomplished modelers reflecting upon the state and future of modeling in drug discovery research and I would definitely recommend anyone - and especially experimentalists - interested in the role of modeling to take a look at the articles. Many of the articles are extremely thoughtful and balanced and take a hard look at the lack of rigorous studies and results in the field; if there was ever a need to make journal articles freely available it was for these kinds, and it's a pity they aren't. But here's one that is open access, and it's by some researchers from Simulations Inc. who talk about three beasts (or in the authors' words, "Lions and tigers and bears, oh my!") in the field that are either unsolved or ignored or both.

1. Entropy: As they say, entropy, taxes and death (entropy) are the three constant things in life. In modeling both small molecules and proteins, entropy has always been the elephant in the room, blithely ignored in most simulations. At the beginning there was no entropy. Early modeling programs then started extracting a rough entropic penalty for freezing certain bonds in the molecule. While this approximated the loss of ligand entropy in binding, it did nothing to take care of the conformational entropy loss that resulted in the compression of a panoply of diverse conformations in solution to a single bound conformation.

But we were just getting started. A very large part of the entropy of binding a ligand by a protein comes from the displacement of water molecules in the active site, essentially their liberation from being constrained prisoners of the protein to free-floating entities in the bulk. A significant advance in trying to take this factor into account was an approach that explicitly and dynamically calculated the enthalpy, entropy and therefore the free energy of bound waters in proteins. We have now reached the point where we can at least think of doing a reasonable calculation on such water molecules. But water molecules are often ill-localized in protein crystal structures because of low-resolution, inadequate refinement and other reasons. It's not easy to perform such calculations for arbitrary proteins without crystal structures.

However, a large piece of the puzzle that's still missing is the entropy of the protein which is extremely difficult to calculate on many fronts. Firstly, the dynamics of the protein is often not captured by a static x-ray structure so any attempts to calculate protein entropy in the presence and absence of ligands would have to shake the protein around. Currently the favored process for doing this is molecular dynamics (MD) which suffers from its own problems, most notably the accuracy of what's under the hood- namely force fields. Secondly, even if we can calculate the total entropy changes, what we really need to know is how the entropy is distributed between various modes since only some of these modes are affected upon ligand binding. An example of the kind of situation in which such details would be important is the case of slow, tight-binding inhibitors illustrated in the paper. The example is of two different prostaglandin synthase inhibitors which demonstrate almost identical binding orientations in the crystal structure. Yet one is a weak binding inhibitor which dissociates rapidly and the other is slow, tight-binding. Only a dynamic treatment of entropy can explain such differences, and we are still quite far from being able to do this in the general case.

2. Uncertainty: Out of all the hurdles facing the successful application and development of modeling in any field, this might be the most fundamental. To reiterate, almost every kind of modeling starts by using a training set of molecules for which the data is known and then proceeds to apply the results from this training set to a test set for which the results are unknown. Successful modeling hinges on the expectation that the data in the test set is sufficiently similar to that in the training set. But problems abound. For one thing, similarity is the eye of the beholder and what seems to be a reasonable criterion for assuming similarity may turn out to be irrelevant in the real world. Secondly, overfitting is a constant issue and results that look perfect for the training set can fail
abysmally on the test set.

But as the article notes, the problems go further and the devil's in the details. Modeling studies very rarely try to quantify the exact differences between the two sets and the error resulting from that difference. What's needed is an estimate of predictive uncertainty for single data points, something which is virtually non-existent. The article notes the seemingly obvious but often ignored fact when it says that "there must be something that distinguishes a new candidate compound from the molecules in the training set". This 'something' will often be a function of the data that was ignored when fitting the model to the training set. Outliers which were thrown out because they were...outliers might return with a vengeance in the form of a new set of compounds that are enriched in their particular properties which were ignored.

But more fundamentally, the very nature of the model used to fit the training set may be severely compromised. In its simplest incarnation for instance, linear regression may be used to fit data points to a set of relationships that are inherently non-linear. In addition, descriptors (such as molecular properties supposedly related to biological activity) may not be independent. As the paper notes, "The tools are inadequate when the model is non-linear or the descriptors are correlated, and one of these conditions always holds when drug responses and biological activity are involved". This problem penetrates into every level of drug discovery modeling, from basic molecular level QSAR to higher-level clinical or toxicological modeling. Only a judicious and high-quality application of statistics, constant validation, and a willingness to wait (for publication, press releases etc.) before the entire analysis is available will preclude erroneous results from seeing the light of day.

3. Data Curation: This is an issue that should be of enormous interest to not just modelers but to all kinds of chemical and biological scientists concerned about information accuracy. The well-known principle of Garbage-In Garbage Out (GIGO) is at work here. The bottom line is that there is an enormous amount of chemical data on the internet that is flawed. For instance there are cases where incorrect structures were inferred from correct names of compounds:

"The structure of gallamine triethiodide is a good illustrative example where many major databases ended up containing the same mistaken datum. Until mid-2011, anyone relying on an internet search would have erroneously concluded that gallamine triethiodide is a tribasic amine. The error resulted from mis-parsing the common name at some point as meaning that the compound is a salt of gallamine and ‘‘ethiodidic acid,’’ identifying gallamine as the active component and retrieving the relevant structure. In fact, gallamine triethiodide is what you get when you react gallamine with three equivalents of ethyl iodide"

So gallamine triethiodide is the triply protonated salt, not the tribasic amine. Assuming otherwise can only lead to chemical mutilation and death. And this case is hardly unique. An equally common problem is simply assigning the wrong ionization state for chemical compounds as illustrated at the beginning of the post. I have already mentioned this as a rookie mistake, but nobody is immune to it. It should hardly be mentioned that any attempt to model an incorrect structure will result in completely wrong results. The bigger problem of course is when the results seem right and prevent us from locating the error; for example, docking an incorrectly positively charged structure into a negative binding site will result in very promising but completely spurious results.

It's hard to see how exactly the entire modeling community can rally together, collectively rectify these errors and establish a common and inviolable standard for performing studies and communicating their results. Until then all we can do is point out the pitfalls, the possibilities, the promises and the perils.

Clark, R., & Waldman, M. (2011). Lions and tigers and bears, oh my! Three barriers to progress in computer-aided molecular design Journal of Computer-Aided Molecular Design DOI: 10.1007/s10822-011-9504-3


  1. A timely post! And one that will lead to prodigious downloading on my part very soon. A number of salient points all around, even for those of us who aren't in the drug discovery business.

    I do have a modest inquiry, to which I realize there may not be enough information to properly answer -

    Firstly, the dynamics of the protein is often not captured by a static x-ray structure so any attempts to calculate protein entropy in the presence and absence of ligands would have to shake the protein around. Currently the favored process for doing this is molecular dynamics (MD) which suffers from its own problems, most notably the accuracy of what's under the hood- namely force fields.

    What if instead of starting with a static structure from x-ray crystallography, what if you started with a family of structures as is traditionally determined by NMR? I ask this not just because I want to keep myself employed in the future, but since it would seem that it might prove useful.

  2. Many thanks for this great post Ash! It is exceedingly important to keep having these things spelled out for beginning modelers as well as the experts.

  3. MJ: I think that's certainly possible, although I don't know how many proteins considered "druggable" and of interest in industry are amenable to NMR structure determination. I also don't know how facile (or even possible) it is to extract protein entropy from NMR structures.

    I think NMR can certainly complement solid-state structures in other valuable ways, including providing insights into kinetics. NMR may not always be able to provide the kind of high-throughput structural data important for industrial drug design programs. But I certainly think there's a place for it in more proof-of-principle studies. I personally find NMR enormously valuable in shedding light on both protein and ligand conformations (the latter by way of the transfer-NOESY technique).

    Anon: Thanks for reading.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS