Computational chemistry as an independent discipline has its roots in theoretical chemistry, itself an outgrowth of the revolutions in quantum mechanics in the 1920s and 30s. Theoretical and quantum chemistry advanced rapidly in the postwar era and led to many protocols for calculating molecular and electronic properties which became amenable to algorithmic implementation once computers came on the scene. Rapid growth in software and hardware in the 80s and 90s led to the transformation of theoretical chemistry into computational chemistry and to the availability of standardized, relatively easy to use computer programs like GAUSSIAN. By the end of the first decade of the new century, the field had advanced to a stage where key properties of simple molecular systems such as energies, dipole moments and stable geometries could be calculated in many cases from first principles with an accuracy matching experiment. Developments in computational chemistry were recognized by the Nobel Prize for chemistry awarded in 1998 to John Pople and Walter Kohn.
In parallel with these theoretical advances, another thread started developing in the 80s which attempted something much more ambitious- to apply the principles of theoretical and computational chemistry to complex systems like proteins and other biological macromolecules and to study their interactions with drugs. The practitioners of this paradigm wisely realized that it would be futile to calculate properties of such complex systems from first principles, thus leading to the initiation of parametrized approaches in which properties would be "pre-fit" to experiment rather than calculated ab initio. Typically there would be an extensive set of experimental data (the training set) which would be used to parametrize algorithms which would then be applied to unknown systems (the test set). The adoption of this approach led to molecular mechanics and molecular dynamics - both grounded in classical physics- and to quantitative structure activity relationships (QSAR) which sought to correlate molecular descriptors of various kinds to biological activity. The first productive approach to docking a small molecule in a protein active site in its lowest energy configuration was refined by Irwin "Tack" Kuntz at UCSF. And beginning in the 70s, Corwin Hansch at Pomona College had already made remarkable forays into QSAR.
These methods gradually started to be applied to actual drug discovery in the pharmaceutical industry. Yet it was easy to see that the field was getting far ahead of itself and in fact even today it suffers from the same challenges that plagued it thirty years back. Firstly, nobody had solved the twin cardinal problems of modeling protein-ligand interactions. The first one was conformational sampling wherein you had to exhaustively search the conformation space of a ligand or protein. The second one was energetic ranking wherein you had to rank these structures, either in their isolated form or in the context of their interactions with a protein. Both of these problems remain the central problems of computation as applied to drug discovery. In the context of QSAR, spurious correlations based on complex combinations of descriptors can easily befuddle its practitioners and create an illusion of causation. Furthermore, there have been various long-standing problems such as the transferability of parameters from a known training set to an unknown test set, the calculation of solvation energies for even the simplest molecules and the estimation of entropies. And finally, it's all too easy to forget the sheer complexity of the protein systems we are trying to address which display a stunning variety of behaviors, from large conformational changes to allosteric binding to complicated changes in ionization states and interactions with water. The bottom line is that in many cases we just don't understand the system which we are trying to model well enough.
Not surprisingly, a young field still plagued with multiple problems could be relied upon as no more than a guide when it came to solving practical problems in drug design. Yet the discipline saw unfortunate failures in PR as it was periodically hyped. Even in the 80s there were murmurs about designing drugs using computers alone. Part of the hype unfortunately came from the practitioners themselves who were less than cautious about announcing the strengths and limitations of their approaches. The consequence was that although there continued to be significant advances in both computing power and algorithms, many in the drug discovery community looked at the discipline with a jaundiced eye.
Yet the significance of the problems that the field is trying to address means that it will continue to be promising. What's its future and what would be the most productive direction in which it could be steered? An interesting set of thoughts is offered in a set of articles published in the Journal of Computer-Aided molecular design. The articles are written by experienced practitioners in the field and offer a variety of opinions, critiques and analyses which should be read by all those interested in the future of modeling in the life sciences.
Jurgen Bajorath from the University of Bonn along with his fellow modelers from Novartis laments the fact that studies in the field have not aspired to a high standard of validation, presentation and reproducibility. This is an important point. No scientific field can advance if there is wide variation in the presentation of the quality of its results. When it comes to modeling in drug discovery, the proper use of statistics and well-defined metrics has been highly subjective, leading to great difficulty in separating the wheat from the chaff and honestly assessing the impact of specific techniques. Rigorous statistical validation in particular has been virtually non-existent, with the highly suspect correlation coefficients being the most refined weapon of choice for many scientists in the field. An important step in emphasizing the virtue of objective statistical methods in modeling was taken by Anthony Nicholls of OpenEye Software who in a series of important articles laid out the statistical standards and sensible metrics that any well-validated molecular modeling study should aspire to. I suspect that these articles will go down in the annals of the field as key documents.
In addition, as MIT physics professor Walter Lewin is fond of constantly emphasizing in his popular lectures, any measurement you make without knowledge of its uncertainty is meaningless. It is remarkable that in a field as fraught with complexity as modeling, there has been a rather insouciant indifference to the estimation of error and uncertainty. Modelers egregiously quote numbers involving protein-ligand energies, dipole moments and other properties to four or six figures of significance when ideally those numbers are suspect even to one decimal point. Part of the problem has simply been an insufficient grounding in statistics. Tying every number to its estimated error margin (if it can be estimated at all) will not only give experimentalists and other modelers an accurate feel for the validity of the analysis and the ensuing improvement of methods but will also keep semi-naive interpreters from being overly impressed by the numbers. Whether it's finance or pharmaceutical modeling, it's always a bad idea to get swayed by figures.
Then there's the whole issue, as the modelers from Novartis emphasize, of spreading the love. The past few years have seen the emergence of several rigorously constructed datasets carefully designed to test and benchmark different modeling algorithms. The problem is that these datasets have been most often validated in an industry that's famous for its secrecy. Until the pharmaceutical industry makes at least some efforts to divulge the results of its studies, a true assessment of the value of modeling methods will always come in fits and starts. I have been recently reading Michael Nielsen's eye-opening book on open science, and it's startling to realize the gains in advancement of knowledge that can result from sharing of problems, solutions and ideas. If modeling is to advance and practically contribute to drug discovery, it's imperative for industry - historically the most valuable generator of any kind of data in drug discovery - to open its vaults and allow scientists to use its wisdom to perfect fruitful techniques and discard unproductive ones.
Perhaps the most interesting article on the future of modeling in drug discovery comes from Arizona State University's Gerald Maggiora who writes about a topic close to my heart - the limitations of reductionist science in computational drug design. Remember the physics joke about the physicist suggesting a viable solution to the dairy farmer, but one that applies only to spherical cows in a vacuum? Maggiora describes a similar situation in computational chemistry which has focused on rather unrealistic and high simplified representations of the real world. Much of modeling of protein and ligands focuses on single proteins and single ligands in an implicitly represented solvent. Reality is of course far different, with highly crowded cells constantly buffeting thousands of small molecules, proteins, lipids, metals and water in an unending dance of chemistry and electricity. Molecules interact constantly with multiple proteins, proteins interact with others proteins, changes in local environments significantly impact structure and function, water adopts different guises depending on its location and controlled chaos generally reigns everywhere. As we move from simple molecules to societies of molecules, unexpected emergent properties may kick in which may be invisible at the single molecule level. We haven't even started to tackle this level of complexity in our models and it's a wonder they work at all, but it's clear that any attempts to fruitfully apply computational science to the behavior of molecules in the real world can only be possible when our models include these multiple levels of complexity. As Maggiora says, part of the solution can come from interfacing computational chemistry with other branches of science including biology and engineering. Maggiora's version of computation in drug discovery thus involves seamlessly integrating computational chemistry, biology, fluid dynamics and other disciplines into an integrated model-building paradigm which bravely crisscrosses problems at all levels of molecular and cellular complexity, unifying techniques from any field it chooses to address the most important problems in drug design and discovery.
Maggiora does not discuss all the details of effort that would lead to this computational utopia, but I see at least two tantalizing signs around. Until now we have focused on the thermodynamic aspects of molecular design (more specifically, being able to calculate free energies of binding) and while this will remain a challenge in the foreseeable future, it's quite clear than any merging of traditional computational chemistry with higher-order phenomena will involve accurately modeling the kinetics of interactions between proteins and small ligands and between proteins themselves. Modeling kinetics can pave the way to understanding fluxes between various components in biochemical networks, thus leading directly to the interfacing of computational chemistry with network biology. Such advances could provide essential understanding of the non-linear mechanisms that control the production, reactions, down-regulation and turnover of key proteins, a process that itself is at the heart of most higher-order physiological processes.
How can such an understanding come about? Earlier this year, scientists at D. E. Shaw Research performed a study in which they ran a completely blind molecular dynamics simulation of a drug molecule placed outside a kinase protein. By running the simulation for a sufficiently long time, the drug could efficiently sample all configurations and finally lock into its binding site where it stayed put. The model system was simple and we don't know yet how generalizable such a simulation is, but one tantalizing possibility such simulations offer is to be able to correlate calculated protein-drug kinetic binding data to experimental on and off binding rates. The D. E. Shaw method certainly won't be the only computational method to estimate such data, but it offers a glimpse of a future in which accurate computed kinetic data will be available to estimate higher-order fluxes between biomolecules. Such approaches promise revealing insights
The other auspicious development concerns the modeling of crowded environments. As mentioned above, no simulation of molecular structure and function can be considered truly realistic until it takes into account all the myriad molecular partners that populate a cell's interior. Until now our methods have been too primitive to take these crowded conditions into account. But recent proof-of-principle studies have provided the first glimpse into how protein folding under crowded conditions can be predicted. Hopefully at some point, improved computing power will afford us the capability to simulate multiple small and large molecules in proximity to each other. It's only when we can do this can we claim to have taken the first steps in moving from one level of complexity to another.
The future abounds with these possibilities. Whatever the hype, promises and current status of modeling in drug discovery, one thing is clear: the problems are tough, many of them have been well-defined for a long time and all of them without exception are important. In the sense of the significance of the problems themselves, computational chemistry remains a field pregnant with possibilities and enormous score. Calculating the free energy of binding of an arbitrary small molecule to an arbitrary protein still remains a foundational goal. Honing in on the correct, minimal set of molecular descriptors that would allow is to predict the biological activity of any newly synthesized molecule is another. So is being able to simulate protein folding, or calculate the interaction of a drug with multiple protein targets, or model what happens to the concentration of a protein in a cell when a drug is introduced. All these problems really present different sides of the same general challenge of modeling biomolecular systems over an expansive set of levels of complexity. And all of them are tough, involved, multifaceted and essentially interdisciplinary problems that will keep the midnight oil burning in the labs of experimental and computational scientists alike.
But as Jack Kennedy would say, that's precisely why we need to tackle them, not because they are easy but because they are difficult. That's what we have done. That's what we will continue to do.
Bajorath, J. (2011). Computational chemistry in pharmaceutical research: at the crossroads Journal of Computer-Aided Molecular Design DOI: 10.1007/s10822-011-9488-z
Martin, E., Ertl, P., Hunt, P., Duca, J., & Lewis, R. (2011). Gazing into the crystal ball; the future of computer-aided drug design Journal of Computer-Aided Molecular Design DOI: 10.1007/s10822-011-9487-0
Maggiora, G. (2011). Is there a future for computational chemistry in drug research? Journal of Computer-Aided Molecular Design DOI: 10.1007/s10822-011-9493-2
The Newton Medal is (bit) late
15 hours ago in Doc Madhattan