John Pople was a towering
figure in theoretical and computational chemistry. He contributed to several
aspects of the field, meticulously consolidated those aspects into rigorous
computer algorithms for the masses and received a Nobel for his efforts. The Gaussian
set of programs that he pioneered is now a mainstay of almost every lab in the
world which does any kind of molecular calculation at the quantum level.
On the website of Gaussian
is a tribute to Pople from his former student (and president of Gaussian)
Michael Frisch. Of particular interest are Pople’s views on the role of
theories and models in chemistry which make for interesting contemplation, not
just for chemists but really for anyone who uses theory and modeling.
- Theorists
should compute what is measured, not just what is easy to calculate
- Theorists
should study systems people care about, not just what is easy or
inexpensive to study.
Both these points are well taken as long as one understands that
it’s often important to perform calculations on ‘easy’ model systems to
benchmark techniques and software (think of spherical cows…). However it’s also
a key aspect of modeling that’s often lost on people who simulate more complex
systems. For instance the thrust of a lot of protein-small molecule modeling is
in determining or rationalizing the binding affinity between the protein and
the small molecule. This is an important goal, but it’s often quite
insufficient for understanding the ‘true’ binding affinity between the two
components in the highly complex milieu of a cell, where other proteins, ions,
cofactors and water jostle for attention with the small molecule. Thus, while
modelers should indeed try to optimize the affinity of their small molecules
for proteins, they should also try to understand how these calculations might
translate to a more biological context.
- Models
should be calibrated carefully and the results presented with scrupulous
honesty about their weaknesses as well as their strengths.
This is another aspect of theory and modeling that’s often lost in
the din of communicating results that seem to make sense. The importance of
training sets in validating models on known systems is well-understood,
although even in this case the right kind of statistics isn’t always applied to
get a real sense of the model’s behavior. But one of the simpler problems with
training sets is that they are often incomplete and miss essential features
that are rampant among the real world’s test sets (more pithily, all real cows
as far as we know are non-spherical). This is where Pople’s point about
presenting the strengths and weaknesses of models applies: if you are
unsure how similar the test case is to the training set, let the
experimentalists know about this limitation. Pople’s admonition also speaks to
the more general one about always communicating the degree of confidence in a
model to the experimentalists. Often even a crude assessment of this degree can
help prioritize which experiments should be done and which ones should be best assessed
against cost and implementation.
- One
should recognize the strengths as well as the weaknesses of other people's
models and learn from them.
Here we are talking about plagiarism in the best sense of the
tradition. It is key to be able to compare different methods and borrow from
their strengths. But comparing methods is also important for another, more
elemental reason: without proper comparison you might often be misled into
thinking that your method actually works, and more importantly that it works
because of a chain of causality embedded in the technique. But if a simpler
method works as well as your technique, then perhaps your technique worked not because of but in spite of the causal chain that appears so logical to you. A case
in point is the whole field of molecular dynamics: Ant Nicholls from OpenEye
has made the argument that you can’t really trust MD as a robust and ‘real’
technique if simpler methods are giving you the answer (often faster).
- If a model is worth implementing
in software, it should be implemented in a way which is both efficient and
easy to use. There is no point in creating models which are not useful to
other chemists.
This
should be an obvious point but it isn’t always one. One of the resounding
truths in the entire field of modeling and simulation is that the best
techniques are the ones which are readily accessible to other scientists and
provided to them cheaply or free of cost. Gaussian itself is a good example –
even today a single user license is offered for about $1500. Providing
user-friendly graphical interfaces seems like a trivial corollary of this
principle, but it can make a world of difference for non-specialists. The
Schrodinger suite is an especially good example of user-friendly GUIs. Conversely,
software for which a premium was charged died a slow death, simply because very
few people could use, validate and improve it.
There do
seem to be some exceptions to this rule. For instance the protein-modeling
program Rosetta is still rather expensive for general industrial use. More importantly,
Rosetta seems to be particularly user-unfriendly and is of greatest use to
descendants of David Baker’s laboratory at the University of Washington.
However the program has still seen some very notable successes, largely because
the Baker tribe counts hundreds of accomplished people who are still very
actively developing and using the program.
Notwithstanding
such exceptions though, it seems almost inevitable in the age of open source
software that only the easiest to use and cheapest programs will be widespread
and successful, with lesser competitors culled in a ruthlessly Darwinian
process.
Oh, that Gaussian could be one of those culled!
ReplyDeleteThe price is high for academics and worse for those who can't prove academic affiliation. The input format is fairly dreadful to write or read by hand. The parallel scaling is poor though serial performance is quite good, if you are one of the five computational chemists in the world who does not yet have a multicore computer.
On the plus side it has a lot of capabilities.
Its biggest advantage isn't broad capabilities though. A bevy of programs have (e.g.) the ability to do a geometry optimization followed by frequency calculation with popular DFT functionals and basis sets. Buying Gaussian for that is like buying a full Photoshop license to put captions on funny cat pictures.
No, the seductive appeal of Gaussian is ease-to-replicate: its software defaults are the defaults of the thousands of prior publications that used Gaussian. How are core orbitals chosen for freezing in an MP2 calculation? How are grids generated for DFT? It's too much work to replicate prior Gaussian results to the microhartree in another program. Just buy Gaussian and don't think about it. Everyone else does.
On the one hand it irks me that Gaussian so dominates published results even when calculations could have been done in many programs. On the other hand I can understand that non-specialists in particular may not wish to save a few thousand dollars at the expense of a lot of additional learning. And on the gripping hand I wonder how robust results are when cross-software replication is attempted less frequently than it should be, and implicit software choices lead to different results.