Field of Science

The future of computation in drug discovery

Computational chemistry as an independent discipline has its roots in theoretical chemistry, itself an outgrowth of the revolutions in quantum mechanics in the 1920s and 30s. Theoretical and quantum chemistry advanced rapidly in the postwar era and led to many protocols for calculating molecular and electronic properties which became amenable to algorithmic implementation once computers came on the scene. Rapid growth in software and hardware in the 80s and 90s led to the transformation of theoretical chemistry into computational chemistry and to the availability of standardized, relatively easy to use computer programs like GAUSSIAN. By the end of the first decade of the new century, the field had advanced to a stage where key properties of simple molecular systems such as energies, dipole moments and stable geometries could be calculated in many cases from first principles with an accuracy matching experiment. Developments in computational chemistry were recognized by the Nobel Prize for chemistry awarded in 1998 to John Pople and Walter Kohn.

In parallel with these theoretical advances, another thread started developing in the 80s which attempted something much more ambitious- to apply the principles of theoretical and computational chemistry to complex systems like proteins and other biological macromolecules and to study their interactions with drugs. The practitioners of this paradigm wisely realized that it would be futile to calculate properties of such complex systems from first principles, thus leading to the initiation of parametrized approaches in which properties would be "pre-fit" to experiment rather than calculated ab initio. Typically there would be an extensive set of experimental data (the training set) which would be used to parametrize algorithms which would then be applied to unknown systems (the test set). The adoption of this approach led to molecular mechanics and molecular dynamics - both grounded in classical physics- and to quantitative structure activity relationships (QSAR) which sought to correlate molecular descriptors of various kinds to biological activity. The first productive approach to docking a small molecule in a protein active site in its lowest energy configuration was refined by Irwin "Tack" Kuntz at UCSF. And beginning in the 70s, Corwin Hansch at Pomona College had already made remarkable forays into QSAR.

These methods gradually started to be applied to actual drug discovery in the pharmaceutical industry. Yet it was easy to see that the field was getting far ahead of itself and in fact even today it suffers from the same challenges that plagued it thirty years back. Firstly, nobody had solved the twin cardinal problems of modeling protein-ligand interactions. The first one was conformational sampling wherein you had to exhaustively search the conformation space of a ligand or protein. The second one was energetic ranking wherein you had to rank these structures, either in their isolated form or in the context of their interactions with a protein. Both of these problems remain the central problems of computation as applied to drug discovery. In the context of QSAR, spurious correlations based on complex combinations of descriptors can easily befuddle its practitioners and create an illusion of causation. Furthermore, there have been various long-standing problems such as the transferability of parameters from a known training set to an unknown test set, the calculation of solvation energies for even the simplest molecules and the estimation of entropies. And finally, it's all too easy to forget the sheer complexity of the protein systems we are trying to address which display a stunning variety of behaviors, from large conformational changes to allosteric binding to complicated changes in ionization states and interactions with water. The bottom line is that in many cases we just don't understand the system which we are trying to model well enough.

Not surprisingly, a young field still plagued with multiple problems could be relied upon as no more than a guide when it came to solving practical problems in drug design. Yet the discipline saw unfortunate failures in PR as it was periodically hyped. Even in the 80s there were murmurs about designing drugs using computers alone. Part of the hype unfortunately came from the practitioners themselves who were less than cautious about announcing the strengths and limitations of their approaches. The consequence was that although there continued to be significant advances in both computing power and algorithms, many in the drug discovery community looked at the discipline with a jaundiced eye.

Yet the significance of the problems that the field is trying to address means that it will continue to be promising. What's its future and what would be the most productive direction in which it could be steered? An interesting set of thoughts is offered in a set of articles published in the Journal of Computer-Aided molecular design. The articles are written by experienced practitioners in the field and offer a variety of opinions, critiques and analyses which should be read by all those interested in the future of modeling in the life sciences.

Jurgen Bajorath from the University of Bonn along with his fellow modelers from Novartis laments the fact that studies in the field have not aspired to a high standard of validation, presentation and reproducibility. This is an important point. No scientific field can advance if there is wide variation in the presentation of the quality of its results. When it comes to modeling in drug discovery, the proper use of statistics and well-defined metrics has been highly subjective, leading to great difficulty in separating the wheat from the chaff and honestly assessing the impact of specific techniques. Rigorous statistical validation in particular has been virtually non-existent, with the highly suspect correlation coefficients being the most refined weapon of choice for many scientists in the field. An important step in emphasizing the virtue of objective statistical methods in modeling was taken by Anthony Nicholls of OpenEye Software who in a series of important articles laid out the statistical standards and sensible metrics that any well-validated molecular modeling study should aspire to. I suspect that these articles will go down in the annals of the field as key documents.

In addition, as MIT physics professor Walter Lewin is fond of constantly emphasizing in his popular lectures, any measurement you make without knowledge of its uncertainty is meaningless. It is remarkable that in a field as fraught with complexity as modeling, there has been a rather insouciant indifference to the estimation of error and uncertainty. Modelers egregiously quote numbers involving protein-ligand energies, dipole moments and other properties to four or six figures of significance when ideally those numbers are suspect even to one decimal point. Part of the problem has simply been an insufficient grounding in statistics. Tying every number to its estimated error margin (if it can be estimated at all) will not only give experimentalists and other modelers an accurate feel for the validity of the analysis and the ensuing improvement of methods but will also keep semi-naive interpreters from being overly impressed by the numbers. Whether it's finance or pharmaceutical modeling, it's always a bad idea to get swayed by figures.

Then there's the whole issue, as the modelers from Novartis emphasize, of spreading the love. The past few years have seen the emergence of several rigorously constructed datasets carefully designed to test and benchmark different modeling algorithms. The problem is that these datasets have been most often validated in an industry that's famous for its secrecy. Until the pharmaceutical industry makes at least some efforts to divulge the results of its studies, a true assessment of the value of modeling methods will always come in fits and starts. I have been recently reading Michael Nielsen's eye-opening book on open science, and it's startling to realize the gains in advancement of knowledge that can result from sharing of problems, solutions and ideas. If modeling is to advance and practically contribute to drug discovery, it's imperative for industry - historically the most valuable generator of any kind of data in drug discovery - to open its vaults and allow scientists to use its wisdom to perfect fruitful techniques and discard unproductive ones.

Perhaps the most interesting article on the future of modeling in drug discovery comes from Arizona State University's Gerald Maggiora who writes about a topic close to my heart - the limitations of reductionist science in computational drug design. Remember the physics joke about the physicist suggesting a viable solution to the dairy farmer, but one that applies only to spherical cows in a vacuum? Maggiora describes a similar situation in computational chemistry which has focused on rather unrealistic and high simplified representations of the real world. Much of modeling of protein and ligands focuses on single proteins and single ligands in an implicitly represented solvent. Reality is of course far different, with highly crowded cells constantly buffeting thousands of small molecules, proteins, lipids, metals and water in an unending dance of chemistry and electricity. Molecules interact constantly with multiple proteins, proteins interact with others proteins, changes in local environments significantly impact structure and function, water adopts different guises depending on its location and controlled chaos generally reigns everywhere. As we move from simple molecules to societies of molecules, unexpected emergent properties may kick in which may be invisible at the single molecule level. We haven't even started to tackle this level of complexity in our models and it's a wonder they work at all, but it's clear that any attempts to fruitfully apply computational science to the behavior of molecules in the real world can only be possible when our models include these multiple levels of complexity. As Maggiora says, part of the solution can come from interfacing computational chemistry with other branches of science including biology and engineering. Maggiora's version of computation in drug discovery thus involves seamlessly integrating computational chemistry, biology, fluid dynamics and other disciplines into an integrated model-building paradigm which bravely crisscrosses problems at all levels of molecular and cellular complexity, unifying techniques from any field it chooses to address the most important problems in drug design and discovery.

Maggiora does not discuss all the details of effort that would lead to this computational utopia, but I see at least two tantalizing signs around. Until now we have focused on the thermodynamic aspects of molecular design (more specifically, being able to calculate free energies of binding) and while this will remain a challenge in the foreseeable future, it's quite clear than any merging of traditional computational chemistry with higher-order phenomena will involve accurately modeling the kinetics of interactions between proteins and small ligands and between proteins themselves. Modeling kinetics can pave the way to understanding fluxes between various components in biochemical networks, thus leading directly to the interfacing of computational chemistry with network biology. Such advances could provide essential understanding of the non-linear mechanisms that control the production, reactions, down-regulation and turnover of key proteins, a process that itself is at the heart of most higher-order physiological processes.

How can such an understanding come about? Earlier this year, scientists at D. E. Shaw Research performed a study in which they ran a completely blind molecular dynamics simulation of a drug molecule placed outside a kinase protein. By running the simulation for a sufficiently long time, the drug could efficiently sample all configurations and finally lock into its binding site where it stayed put. The model system was simple and we don't know yet how generalizable such a simulation is, but one tantalizing possibility such simulations offer is to be able to correlate calculated protein-drug kinetic binding data to experimental on and off binding rates. The D. E. Shaw method certainly won't be the only computational method to estimate such data, but it offers a glimpse of a future in which accurate computed kinetic data will be available to estimate higher-order fluxes between biomolecules. Such approaches promise revealing insights

The other auspicious development concerns the modeling of crowded environments. As mentioned above, no simulation of molecular structure and function can be considered truly realistic until it takes into account all the myriad molecular partners that populate a cell's interior. Until now our methods have been too primitive to take these crowded conditions into account. But recent proof-of-principle studies have provided the first glimpse into how protein folding under crowded conditions can be predicted. Hopefully at some point, improved computing power will afford us the capability to simulate multiple small and large molecules in proximity to each other. It's only when we can do this can we claim to have taken the first steps in moving from one level of complexity to another.

The future abounds with these possibilities. Whatever the hype, promises and current status of modeling in drug discovery, one thing is clear: the problems are tough, many of them have been well-defined for a long time and all of them without exception are important. In the sense of the significance of the problems themselves, computational chemistry remains a field pregnant with possibilities and enormous score. Calculating the free energy of binding of an arbitrary small molecule to an arbitrary protein still remains a foundational goal. Honing in on the correct, minimal set of molecular descriptors that would allow is to predict the biological activity of any newly synthesized molecule is another. So is being able to simulate protein folding, or calculate the interaction of a drug with multiple protein targets, or model what happens to the concentration of a protein in a cell when a drug is introduced. All these problems really present different sides of the same general challenge of modeling biomolecular systems over an expansive set of levels of complexity. And all of them are tough, involved, multifaceted and essentially interdisciplinary problems that will keep the midnight oil burning in the labs of experimental and computational scientists alike.

But as Jack Kennedy would say, that's precisely why we need to tackle them, not because they are easy but because they are difficult. That's what we have done. That's what we will continue to do.


Bajorath, J. (2011). Computational chemistry in pharmaceutical research: at the crossroads Journal of Computer-Aided Molecular Design DOI: 10.1007/s10822-011-9488-z

Martin, E., Ertl, P., Hunt, P., Duca, J., & Lewis, R. (2011). Gazing into the crystal ball; the future of computer-aided drug design Journal of Computer-Aided Molecular Design DOI: 10.1007/s10822-011-9487-0

Maggiora, G. (2011). Is there a future for computational chemistry in drug research? Journal of Computer-Aided Molecular Design DOI: 10.1007/s10822-011-9493-2

Image source

Autism studies among the Asian-American diaspora.

Last week's issue of Nature had a special section on autism research. One look at the series of articles should convince anyone how complex the determination of causal factors for this disorder is. From a time when pseudoscientific environmental factors (such as "frigid" mothers) were supposed to play a major role, we have reached a stage where massive amounts of genetic data are uncovering tantalizing hints behind Autism Spectrum Disorders (the title itself pointing to the difficulty of diagnosis and description) without a clear indication of causes. Indeed, as pointed out in the Nature articles, some researchers think that the pendulum has now swung to the other side and environmental factors need to be taken into account again.

All the articles are worth reading, but for me the most interesting was a piece describing the research of psychologist Simon Baron-Cohen (brother of the colorful actor Sacha Baron-Cohen) who believes that there is a link between autistic children and the probability of their having technically-minded parents like engineers or scientists. Baron-Cohen's hypothesis has not been validated by rigorous studies but it's extremely intriguing. He thinks that the correlation may have to do with the existence of a "systematizing" brain, one which is adept at deciphering and constructing working relationships in mechanical and logical systems, but which is simultaneously rather poor at grasping the irrational, ill-defined nature of human relationships. Baron-Cohen's hypothesis would be consistent with the lack of empathy and human understanding sometimes found among autistic individuals who also seem to have an aptitude for mathematics, science and engineering.

The moment I read the article, I immediately thought of the substantial Asian-American diaspora in the US, especially Indian and Chinese. I don't have precise statistics (although these would be easy to obtain) but I would think that the majority of Indians or Chinese who emigrated to the US in the last twenty years or so are engineers. If not engineers then they would mostly be scientists or doctors, with businessmen, lawyers and others making up the rest. Chinese and Indian engineers and scientists have always been immigrants here, but the last twenty years have undoubtedly seen a dramatic increase in their numbers.

Now most of the Asians who migrated to the US in the last few years have children who are quite young. From what I read in the Nature article, it seems to me that this Asian community, especially concentrated in places employing large numbers of technically-minded professionals like Silicon Valley and New Jersey, might provide a very good population sample to test Baron-Cohen's hypothesis between autism in children and their probability of having parents who are engineers or physical scientists. Have there been any such studies indicating a relatively higher proportion of ASDs among Asian-American children? I would think that geographic localization and a rather "signal-rich" sample to test Baron-Cohen's hypothesis would provide fertile ground. And surveys conducted with these people by email or in person might be a relatively easy way to test the idea. In fact you may even gain some insight into the phenomenon by analyzing existing records detailing the ethnicity and geographic location of children diagnosed with autism in the last two decades in the US (however, this sample may be skewed since awareness of autism among Asian parents has been relatively recent).

If the prevalence of autism among recent Chinese and Indian children turns out to have seen an upswing in the last few years (therefore contributing to the national average), it would not prove Baron-Cohen's hypothesis. There could be many other factors responsible for the effect. But the result would certainly be consistent with Cohen-Baron's thinking. And it would provide an invitation to further inquiry. That's what good science is about.

A Nobel Prize for "fundamental biology"?

There's a letter in this week's Nature by a William Hughes from Carleton University in Ottawa lamenting the fact that there's no Nobel Prize for "fundamental biology". It's somewhat ironic to hear a biologist complaining about a lack of a biology prize after several chemists have complained that biologist have poached the chemistry prize from them.

But the important question is, what does Hughes mean by "fundamental biology"? In his letter he makes it clear that he is talking about an award that recognizes all discoveries in biology and not just those related to medicine. But the problem is that several of the biological discoveries that were later recognized by the Nobel Prize in medicine had little to do with medicine when they were made. For instance what would you say about the discovery of telomerase (2009)? Or cell cycle control (2000)? Or any number of discoveries in molecular biology including restriction enzymes, DNA and RNA polymerase, phosphorylation or indeed, the structure of DNA itself? By any definition all of these discoveries seem to fall under the "fundamental biology" rubric and I think even Hughes would have a hard time denying this. It just happened that most of them also turned out to be important for medicine (and why not, medicine is after all grounded in biology).

Perhaps Hughes means discoveries in "non-molecular" biology including ecology, evolutionary biology. But I believe that these fields are often honored by the Crafoord Prize which has gone to thinkers like E O Wilson, Robert May and William Hamilton. Maybe the Crafoord doesn't quite satisfy biologists' Nobel cravings because only one prize is awarded every year to different fields. It's also sort of odd that the Nobel Prize in medicine was only once awarded to ethologists studying animal behavior, and perhaps they should hand out more of those. But in any case, I think there have been ample prizes awarded to fundamental biological discoveries, so it doesn't seem very meaningful to me to institute a separate prize for biology. And ultimately of course, there's not much point getting hung up over any of these prizes since the work done by these scientists speaks for itself.

But maybe having a separate biology prize would make chemists feel happier. Which may not be an entirely good thing.

Age is of course a fever chill?: Why even older scientists can make important contributions

A ditty often attributed to Paul Dirac conveys the following warning about doing scientific work in your later years:

Age is of course a fever chill
That every physicist must fear
He is better dead than living still
When past his thirtieth year

Dirac was of course a member of the extraordinary generation of physicists who changed our understanding of the physical world through the development of quantum mechanics in the 1920s and 30s. These men were all in their 20s when they made their revolutionary discoveries, with one glaring exception - Erwin Schrodinger was, by the standards of theoretical physics, a ripe old thirty-eight when he stumbled across the famous equation bearing his name.

When we look at the careers of these individuals we could be forgiven for assuming that if you are past thirty you will probably not make an important contribution to science. The legend of the Young Turks has not quite played out in other fields of science though. Now a paper in PNAS confirms what we suspected, that since the turn of the twentieth century there has been a general increase in the age at which scientists make their important discoveries. What is more surprising is that this increase exists more across time than across fields. The study looks at Nobel Laureates, but since many people who make Nobel Prize worthy discoveries never get the prize, the analysis applies to others too.

So what does the paper find out? It finds out that in the early years of the last century, young men and women made contributions in every field at a relatively young age, often in their twenties. This was most pronounced for theoretical physics. Einstein famously formulated special relativity when he was a 26-year-old patent clerk, and Heisenberg formulated quantum mechanics when he was 24. The same was true for many other physicists including Bohr, Pauli, Dirac and De Broglie and the trend continued into the 70s, although it was less true for chemists and biologists studied by the authors.

The reasons postulated in the paper are probably not too surprising but they illustrate the changing nature of scientific research during the last one hundred years or so. It is easiest for a brilliant individual to make an early contribution to a theoretical field like theoretical physics or mathematics, since achievement in such fields depends more on raw, innate ability than skills gained over time. In an experimental field it's much harder to make early contributions since one needs time to assimilate the large body of experimental work done and to learn the painstaking trade of the experimentalist, an endeavor where patience and perseverance count much more than innate intelligence. As the paper puts it, deductive knowledge lends itself more easily to innate analytical thinking skills visible at a young age than inductive knowledge based on a large body of existing work. This is true even for theoretical physicists where fundamental discoveries have become extremely hard and scarce, and where new ideas depend as much on integrating an extensive set of facts into your thinking process as on "Eureka!" moments. And this difference holds even more starkly for social sciences like economics and psychology where you find very few young people making Nobel Prize winning contributions. In these cases success depends as much on intellectual maturity gained from a thorough assimilation of data about extremely complex systems ("humans") as it does on precocity.

But if this were purely the distinction, then we wouldn't find young people making contributions even to experimental chemistry and biology in the early twentieth century. The reason why this happened is also clear; there was a lot of low-hanging fruit to be picked. So little was known for instance about the molecular components of living organisms that almost every newly discovered vitamin, protein, alkaloid, carbohydrate or steroid could bag its discoverer a Nobel prize. The mean age for achievement was not as early as in theoretical physics, but the contrast is still clear. Even in theoretical physics, the playing field was so rife for new discoveries in the 1930s that in Dirac's words, "even a second-rate physicist could make a first-rate discovery". The paper draws the unsurprising conclusion that there is much more opportunity for a young person to discover something new in a field where little is known.

This conclusion is starkly illustrated in the case of DNA. Watson and Crick are the "original" Young Turks. Watson was only 25 and Crick was in his early thirties when they cracked open the DNA structure, although one has to give Crick a pass since his career was interrupted by the war. What's important to note is that both Watson and Crick came swinging into the field with very little prior knowledge. For instance they both knew very little chemistry. But in this case this lack of knowledge did not really hold them back and in fact freed up their imagination because they were working in a field where there were no experts, where even newcomers could use the right kind of knowledge (crystallography and model building in this case) to make important discoveries. Watson and Crick's story points to a tantalizing thought- that it may yet be possible to make fundamental contributions at a young age to fields in which virgin territory is still widely available. Neuroscience comes to mind right away.

Since this is a chemistry blog, let's look at the authors' conclusions as they apply to chemistry. Linus Pauling provides a very interesting example since he plays into both categories. The "early Pauling" made his famous contributions to chemical bonding in his twenties, and this contribution was definitely more of the deductive kind where you could indulge in much armchair analysis based on principles and approximations drawn from quantum theory. In contrast, contributions by the "late Pauling" are much more inductive. These would include his landmark discovery of the fundamental elements of protein structure (the alpha helix and the beta sheet) and the first description of a disease at a molecular level (sickle cell anemia). Pauling did both these things in his 40s, and both of them needed him to build up from an extensive body of knowledge about crystallography, chemical bonding and biochemistry. It would be hard to imagine even a Linus Pauling deducing protein structure the way he deduced orbital hybridization.

If we move to more inductive fields then the relatively advanced age of the participants is even more obvious. In fact in chemistry, in contrast to mathematics or physics, it's much harder to pinpoint a young revolutionary precisely because chemistry more than physics is an experimental science based on the accumulation of facts. Thus even exceptional chemists are often singled out more for lifelong contributions than for lone flashes of inspiration. Even someone as brilliant as R. B. Woodward (who did make his mark at a young age) was really known for his career-wide contributions to organic synthesis rather than any early idea. It's also interesting that Woodward did make a very important contribution in his late 40s - to the elucidation of the Woodward-Hoffmann rules- and although Hoffmann provided a robust deductive component, inspiration for the rules came to Woodward through anomalies in his synthesis of Vitamin B12 and his vast knowledge of experimental data on pericyclic reactions. Woodward was definitely building up from a lot of inductive knowledge.

An additional factor that the authors don't discuss is the contribution of collaborations. From a general standpoint it has now become very difficult for scientists in any field to make lone significant contributions.
In fact one can make a good case that even the widely cited lone contributions to theoretical physics in the 1920s involved constant collaboration and exchange of ideas (mostly through Niels Bohr's institute in Copenhagen). This was far from the case for most of scientific history, when you had people like Cavendish, Lavoisier, Maxwell, Faraday, Kekule and Planck working alone and producing spectacular results. But things have significantly changed, especially in the case of experimental particle physics and genomics where even the most outstanding thinkers can often work only as part of a team. In such cases it may even be meaningless to talk about the young vs advanced age dichotomy since no one individual makes the most important discovery.

Finally, one rather disturbing reason that could potentially contribute to an even greater advancement of age in the context of important discoveries is left undiscussed. As the biologist Bob Weinberg lamented in an editorial a few years ago, the mean age at which new academic researchers receive their first important research grant has been advancing. This means that even brilliant scientists may be held back from making important discoveries simply because they lack the resources. While this trend has really been visible in the last decade or so, it could contribute as an unfortunate factor to the age-corrected generation of novel ideas. One only hopes that this does not make things so bad that scientists are forced to consider contributing to their fields in their 70s.

Ultimately there's one thing that age brings that's hard to replace with raw brilliance, and that's the nebulous but invaluable entity called 'intuition'. As scientific problems become more and more complex and interdisciplinary, it is inevitable that intuition and experience will play more important roles in the divining of new scientific phenomena. And these are definitely a product of age, so there may be something to look forward to when you grow old after all.