Field of Science

Reproducibility in molecular modeling research

When I was a graduate student, a pet peeve of my advisor and I was the extreme unlikelihood of finding structural 3D coordinates of molecules in papers. A typical paper on conformational analysis would claim to derive a solution conformation for a druglike molecule and present a picture of the conformation, but the lack of 3D coordinates or a SDF or PDB file would make detailed inspection of the structure impossible. Sadly this was more generally in line with a lack of structural information regarding molecular structures in molecular modeling or structural chemistry articles.

Last year Pat Walters (Vertex) wrote an editorial in the Journal of Chemical Information and Modeling which I have wanted to highlight for a while. Walters laments the general lack of reproducibility in the molecular modeling literature and notes that while reproducibility is taken for granted in experimental science, there has been an unusual dearth of it in computational science. He also makes it clear that authors are running out of excuses in making things like original source code and 3D coordinates available under the pretext of lack of standard hardware and software platforms.
It is possible that the wide array of computer hardware platforms available 20 years ago would have made supporting a particular code base more difficult. However, over the last 10 years, the world seems to have settled on a small number of hardware platforms, dramatically reducing any sort of support burden. In fact, modern virtual machine technologies have made it almost trivial to install and run software developed in a different computing environment...
When chemical structures are provided, they are typically presented as structure drawings that cannot be readily translated into a machine-readable format. When groups do go to the trouble of redrawing dozens of structures, it is almost inevitable that structure errors will be introduced. Years ago, one could have argued that too manyfile formats existed and that it was difficult to agree on a common format for distribution. However, over the last 10 years, the community seems to have agreed on SMILES and MDL Mol or SD files as a standard means of distribution. Both of these formats can be easily processed and translated using freely available software such as OpenBabel, CDK, and RDKit. At the current time, there do not seem to be any technical impediments to the distribution of structures and data in electronic form.
Indeed, even experimental scientists are now quite familiar with the SD and SMILES file formats and standard chemical illustration software like ChemDraw is now able to readily handle such formats, so nobody should really claim to be hobbled by the lack of a standard communication language. 

Walters proposes some commonsense guidelines for aiding reproducibility in computational research, foremost of which is to include source code. These guidelines mirror some of the suggestions I had made in a previous post. Unfortunately explicit source code cannot always be made available for proprietary reasons, but as Walters notes, executable versions of this code can still be offered. The other guidelines are also simple and should be required for any computational submissions: mandatory provision of SD files; inclusion of scripts and parameter files for verification; a clear description of the method, preferably with an example; and an emphasis by reviewers on reproducibility aspects of a study.

As has been pointed out by many molecular modelers in the recent past - Ant Nicholls from OpenEye for instance - it's hard to judge even supposedly successful molecular modeling results because the relevant statistical validation has not been done and because there is scant method comparison. At the heart of validation however is simply reproduction, because your opinion is only as good as my processing of your data. The simple guidelines from this editorial should go some way in establishing the right benchmarks.


  1. A minor (but important) typo:
    "ChemDraw is not able to readily handle such formats" should be "ChemDraw is now able to readily handle such formats"

    and I definitely agree with the sentiments expressed. There is essentially no limitation on the size of files you can include in Supplemental/Supporting Information with a journal article, so including structures (for reproducibility or further analysis by others) should be encouraged by editors and reviewers.

    1. Thanks, fixed. I agree that neither size of data nor concerns about file formats should be limiting problems at this point in time.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS