The Curious Wavefunction: Nobelist John Pople on using theories and models the right way

John Pople was a towering figure in theoretical and computational chemistry. He contributed to several aspects of the field, meticulously consolidated those aspects into rigorous computer algorithms for the masses and received a Nobel for his efforts. The Gaussian set of programs that he pioneered is now a mainstay of almost every lab in the world which does any kind of molecular calculation at the quantum level.

On the website of Gaussian is a tribute to Pople from his former student (and president of Gaussian) Michael Frisch. Of particular interest are Pople’s views on the role of theories and models in chemistry which make for interesting contemplation, not just for chemists but really for anyone who uses theory and modeling.

Theorists should compute what is measured, not just what is easy to calculate
Theorists should study systems people care about, not just what is easy or inexpensive to study.

Both these points are well taken as long as one understands that it’s often important to perform calculations on ‘easy’ model systems to benchmark techniques and software (think of spherical cows…). However it’s also a key aspect of modeling that’s often lost on people who simulate more complex systems. For instance the thrust of a lot of protein-small molecule modeling is in determining or rationalizing the binding affinity between the protein and the small molecule. This is an important goal, but it’s often quite insufficient for understanding the ‘true’ binding affinity between the two components in the highly complex milieu of a cell, where other proteins, ions, cofactors and water jostle for attention with the small molecule. Thus, while modelers should indeed try to optimize the affinity of their small molecules for proteins, they should also try to understand how these calculations might translate to a more biological context.

Models should be calibrated carefully and the results presented with scrupulous honesty about their weaknesses as well as their strengths.

This is another aspect of theory and modeling that’s often lost in the din of communicating results that seem to make sense. The importance of training sets in validating models on known systems is well-understood, although even in this case the right kind of statistics isn’t always applied to get a real sense of the model’s behavior. But one of the simpler problems with training sets is that they are often incomplete and miss essential features that are rampant among the real world’s test sets (more pithily, all real cows as far as we know are non-spherical). This is where Pople’s point about presenting the strengths and weaknesses of models applies: if you are unsure how similar the test case is to the training set, let the experimentalists know about this limitation. Pople’s admonition also speaks to the more general one about always communicating the degree of confidence in a model to the experimentalists. Often even a crude assessment of this degree can help prioritize which experiments should be done and which ones should be best assessed against cost and implementation.

One should recognize the strengths as well as the weaknesses of other people's models and learn from them.

Here we are talking about plagiarism in the best sense of the tradition. It is key to be able to compare different methods and borrow from their strengths. But comparing methods is also important for another, more elemental reason: without proper comparison you might often be misled into thinking that your method actually works, and more importantly that it works because of a chain of causality embedded in the technique. But if a simpler method works as well as your technique, then perhaps your technique worked not because of but in spite of the causal chain that appears so logical to you. A case in point is the whole field of molecular dynamics: Ant Nicholls from OpenEye has made the argument that you can’t really trust MD as a robust and ‘real’ technique if simpler methods are giving you the answer (often faster).

If a model is worth implementing in software, it should be implemented in a way which is both efficient and easy to use. There is no point in creating models which are not useful to other chemists.

This should be an obvious point but it isn’t always one. One of the resounding truths in the entire field of modeling and simulation is that the best techniques are the ones which are readily accessible to other scientists and provided to them cheaply or free of cost. Gaussian itself is a good example – even today a single user license is offered for about $1500. Providing user-friendly graphical interfaces seems like a trivial corollary of this principle, but it can make a world of difference for non-specialists. The Schrodinger suite is an especially good example of user-friendly GUIs. Conversely, software for which a premium was charged died a slow death, simply because very few people could use, validate and improve it.

There do seem to be some exceptions to this rule. For instance the protein-modeling program Rosetta is still rather expensive for general industrial use. More importantly, Rosetta seems to be particularly user-unfriendly and is of greatest use to descendants of David Baker’s laboratory at the University of Washington. However the program has still seen some very notable successes, largely because the Baker tribe counts hundreds of accomplished people who are still very actively developing and using the program.

Notwithstanding such exceptions though, it seems almost inevitable in the age of open source software that only the easiest to use and cheapest programs will be widespread and successful, with lesser competitors culled in a ruthlessly Darwinian process.

Image link

1 comment:

Matt4:15 PM, April 20, 2015
Oh, that Gaussian could be one of those culled!

The price is high for academics and worse for those who can't prove academic affiliation. The input format is fairly dreadful to write or read by hand. The parallel scaling is poor though serial performance is quite good, if you are one of the five computational chemists in the world who does not yet have a multicore computer.

On the plus side it has a lot of capabilities.

Its biggest advantage isn't broad capabilities though. A bevy of programs have (e.g.) the ability to do a geometry optimization followed by frequency calculation with popular DFT functionals and basis sets. Buying Gaussian for that is like buying a full Photoshop license to put captions on funny cat pictures.

No, the seductive appeal of Gaussian is ease-to-replicate: its software defaults are the defaults of the thousands of prior publications that used Gaussian. How are core orbitals chosen for freezing in an MP2 calculation? How are grids generated for DFT? It's too much work to replicate prior Gaussian results to the microhartree in another program. Just buy Gaussian and don't think about it. Everyone else does.

On the one hand it irks me that Gaussian so dominates published results even when calculations could have been done in many programs. On the other hand I can understand that non-specialists in particular may not wish to save a few thousand dollars at the expense of a lot of additional learning. And on the gripping hand I wonder how robust results are when cross-software replication is attempted less frequently than it should be, and implicit software choices lead to different results.

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS