Field of Science

On reproducibility in modeling

A recent issue of Science has an article discussing an issue that has been a constant headache for anyone involved with any kind of modeling in drug discovery - the lack of reproducibility in computational science. The author Roger Peng who is a biostatistician at Johns Hopkins talks about modeling standards in general but I think many of his caveats could apply to drug discovery modeling. The problem has been recognized for a few years now but there have been very few concerted efforts to address it.

An old anecdote from my graduate advisor's research drives the point home. He wanted to replicate a protein-ligand docking study done with a compound so he contacted the scientist who had performed the study and processed the protein and ligand according to the former's protocol. He appropriately adjusted the parameters and ran the experiment. To his surprise he got a very different result. He repeated the protocol several times but consistently saw the wrong result. Finally he called up the original researcher. The two went over the protocol a few times and finally realized that the problem lay in a minor but overlooked detail - the two scientists were using slightly different versions of the modeling software. This wasn't even a new version, just an update, but for some reason it was enough to significantly change the results.

These and other problems dot the landscape of modeling in drug discovery. The biggest problem to begin with is of course the sheer lack of reporting of details in modeling studies. I have seen more than my share of papers where the authors find it enough to simply state the name of the software used for modeling. No mention of parameters, versions, inputs, "pre-processing" steps, hardware, operating system, computer time or "expert" tweaking. The latter factor is crucial and I will come back to it. In any case, it's quite obvious that no modeling study can be reproducible without these details. Ironically, the same process that made modeling more accessible to the experimental masses has also encouraged the reporting of incomplete results; the incarnation of simulation as black-box technology has inspired experimentalists to widely use it, but on the flip side it has also discouraged many from being concerned about communicating under-the-hood details.

A related problem is the lack of objective statistical validation in reporting modeling results, a very important topic that has been highlighted recently. Even when protocols are supposedly accurately described, the absence of error bars or statistical variation means that one can get a different result even if the original recipe is meticulously followed. Even simple things like docking runs can give slightly different numbers on the same system, so it's important to be mindful of variation in the results along with their probable causes. Feynman talked about the irreproducibility of individual experiments in quantum mechanics, and while it's not quite that bad in modeling, it's still not irrelevant.

This brings us to one of those important but often unquantifiable factors in successful modeling campaigns - the role of expert knowledge and intuition. Since modeling is still an inexact science (and will probably remain so for the foreseeable future), intuition, gut feelings and a "feel" for the particular system under consideration based on experience can often be an important part of massaging the protocol to deliver the desired results. At least in some cases these intangibles are captured in any number of little tweaks, from constraining the geometry of certain parts of a molecule based on past knowledge to suddenly using a previously unexpected technique to improve the clarity of the data. A lot of this is never reported in papers and some of it probably can't be. But is there a way to capture and communicate at least the tangible part of this kind of thinking?

The paper alludes to a possible simple solution and this solution will have to be implemented by journals. Any modeling protocol generates a log file which can be easily interpreted by the relevant program. In case of some modeling software like Schrodinger, there's also a script that records every step in a format comprehensible to the program. Almost any little tweak that you make is usually recorded in these files or scripts. A log file is more accurate than an English language description at documenting concrete steps. One can imagine a generic log file- generating program which can record the steps across different modeling programs. This kind of venture will need collaboration between different software companies but it could be very useful in providing a single log file that captures as much of both the tangible and intangible thought processes of the modeler as possible. Journals could insist that authors upload these log files and make them available to the community.

Ultimately it's journals which can play the biggest role in the implementation of rigorous and useful modeling standards. In the Science article the author describes a very useful system of communicating modeling results used by the journal Biostatistics. Under this system authors doing simulation can request a "reproducibility review" in which one of the associate editors runs the protocols using the code supplied by the authors. Papers which pass this test are clearly flagged as "R" - reviewed for reproducibility. At the very least, this system gives readers a way to distinguish rigorously validated papers from others so that they know which ones to trust more. You would think that there would be backlash against the system from those who don't want to explicitly display the lack of verification of their protocols, but the fact that it's working seems to indicate its value to the community at large.

Unfortunately in case of drug discovery, any such system will have to deal with the problem of proprietary data. There are several papers without such data which could also benefit from this system, but there can be ways to handle proprietary data. Even proprietary data can be amenable to partial reproducibility. In a typical example for instance, molecular structures which are proprietary could be encoded into special organization-specific formats that are hard to decode (an example would be formats used by OpenEye or Rosetta). One could still run a set of modeling protocols on this cryptic data set and generate statistics without revealing the identity of the structures. Naturally there will have to be safeguards against the misuse of any such evaluation but it's hard to see why they would be difficult to institute.

Finally, it's only a community-wide effort equally comprised of industry and academia which can lead to the validation and use of successful modeling protocols. The article suggests creating a kind of "CodeMed Central" repository akin to PubMed Central, and I think modeling could greatly benefit from such a central data source. Code for successful protocols in virtual screening or homology modeling or molecular dynamics or what have you can be uploaded to a site (along with the log files of course). Not only would these protocols be used to verify their reproducibility, but they could also be used to practically aid data extraction from similar systems. The community as a whole would benefit.

Before there's any data generation or sharing, before there's any drawing of conclusions, before there's any advancement of scientific knowledge, there's reproducibility, a scientific virtue that has guided every field of science since its modern origin. Sadly this virtue has been neglected in modeling, so it's about time that we pay more attention to it.

Peng, R. (2011). Reproducible Research in Computational Science Science, 334 (6060), 1226-1227 DOI: 10.1126/science.1213847


  1. Computational scientists serious about reproducibility might benefit from this new software project, Sumatra

  2. I remember one of the cases: running GOLD several times for the same protein-ligand complex will give you different results. Because of the genetic algorithm used, but if you provide the seed used - it will give the almost the same result.

    So, we need standards! PMML is not the best choice...maybe some other solutions.

  3. There has been a similar problem in functional MRI work (fMRI) showing which areas of the brain were activated by particular mental activities. You never saw the original data, just the final images, which usually were quite tweaked.

    I began to suspect the results, when fMRIs always confirmed the initial hypothesis of the article. There were very few unexpected results. Scientific inquiry just isn't like that. The result was the following Science 2 August '02 p. 749 it is stated that "neuroimaging has a reputation for producing "pretty pictures" but not replicable data. It has been characterized as pseudocolor phrenology.

  4. Thanks for that interesting piece of information. You know your model is in trouble when
    a. It can be used to justify any result.
    b. It always predicts what you see.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS