I am attending the Gordon Conference on Computer-Aided Drug Design (CADD) in the verdant mountains of Vermont this week, and while conference rules prohibit me from divulging the details of the talks, even the first day of the meeting reinforces a feeling that I have had for a while about the field of molecular modeling: the problems that plague the field cannot be solved by modelers alone.
This realization is probably apparent to anyone who has been working the field for a while, but its ramifications have become really clear in the last decade or so. It should be obvious by now to many that while modeling has seen some real and solid progress in the last few years, the general gap between promise and deliverables is still quite big. The good news is that modeling has been integrated into the drug discovery process in many small and sundry ways, ranging from getting rid of duplicates and "rogue" molecules in chemical libraries to quick similarity searching of new proposed compounds against existing databases to refinement of x-ray crystal structures. These are all very useful and noteworthy advances, but they don't by themselves promise a game changing impact of modeling on the field of drug discovery and development.
The reasons why this won't happen have thankfully been reiterated several times in several publications over the last fifteen odd years, to the extent that most reasonable people in the field don't get defensive anymore when they are pointed out. There's the almost complete lack of statistics that plagued the literature, leading people to believe that specific algorithms were better than what they actually were and continuing to apply them (this aspect was well emphasized by the last GRC). There's the constant drumbeats about how badly we treat things like water molecules, entropy, protein flexibility and conformational flexibility of ligands. There are the organizational issues concerning the interactions between modelers and other kinds of scientists which in my opinion people don't formally talk about with anywhere near the level of seriousness and the frequency which they deserve (although we are perfectly happy to discuss them in person).
All these are eminently legitimate reasons whose ills must be exorcised if we are to turn modeling into not just a useful but consequential and even paradigm-shifting part of the drug discovery process. And yet there is one other aspect that we should be constantly talking about that really puts a ceiling on top of even the most expert modeler. And this is the crucial reliance on data obtained from other fields. Because this is a ceiling erected by other fields it's not just something that even the best modelers alone can punch through. And breaking this ceiling is really going to need both scientific and organizational changes in the ways that modelers do their daily work, interact with people from other disciplines and even organize conferences.
The problem is simply of not having the right kind of data. It's not a question of 'Big Data' but of data at the right level of relevance to a particular kind of modeling. One illusion that I have felt gradually creeping up the spines of people in modeling-related conferences is that of somehow being awash in data. Too often we are left with the feeling that the problem is not that of enough data, it's only of tools to interpret that sea of information.
The problem of tools is certainly an important one, but the data problem has certainly not been resolved. To understand this, let's divide the kind of data that is crucial for 'lower level' or basic modeling into three categories: structural, thermodynamic and kinetic. It should be obvious to anyone in the field that we have made amazing progress in the form of the PDB as far as structural information is concerned. It did take us some time to realize that PDB structures are not sacrosanct, but what I want to emphasize is that when serious structure-based modeling like docking, homology modeling and molecular dynamics really took off, the structural data was already there, either in the PDB or readily obtained in house. Today the PDB boasts more than a hundred thousand structures. Meticulous tabulation and analysis of these structures has resulted in high-quality datasets like Iridium. In addition there is no dearth of publications pointing out the care which must be exercised in using these structures for actual drug design. Finally, with the recent explosion of crystallographic advances in the field of membrane protein structure, data is now available for virtually every important family of pharmaceutically relevant proteins.
Now consider where the field might have been in a hypothetical universe where the PDB was just getting off of the ground in the year 2015. Docking, homology modeling, protein refinement and molecular dynamics would all have been in the inky backwaters of the modeling landscape. None of these methods could have been validated in the absence of good protein structure and we would have had scant understanding of water molecules, protein flexibility and protein-protein interactions. The Gordon Conference on CADD would likely still be the Gordon Conference on QSAR.
Apply the same kind of thinking to the other two categories of data - thermodynamic and kinetic - and I think we can see some of the crucial problems holding the field back. Unlike the PDB there is simply no comparable database of tens of thousands of reliable thermodynamic data points that would aid the validation of methods like Free Energy Perturbation (FEP). There is some data to be found in repositories like PDBbind, but this is still a pale shadow of the quantity and (curated) quality of structures in the PDB. No wonder that our understanding of energies - relative to structure - is so poor. When it comes to kinetics the situation is much, much worse. In the absence of kinetic data, how can we start to truly model the long residence times in protein-ligand interactions that so many people are talking about these days? The same situation also applies to what we can call 'higher order' data concerning toxicology, network effects on secondary targets in pathways and so on.
The situation is reminiscent of the history of the development of quantum mechanics. When quantum mechanics was formulated in the twenties, it was made possible only by the existence of a large body of spectroscopic data that had been gathered since the late 1870s. If that data had not existed in the 1920s, even wunderkinder like Werner Heisenberg and Paul Dirac would not have been able to revolutionize our understanding of the physical world. Atomic physics in the 1920s was thus data-rich and theory poor. Modeling in 2015 is not exactly theory-rich to begin with, but I would say it's distinctly data-poor. That's a pretty bad situation to be in.
The reality is very simple in my view: unless somebody else - not modelers - generates the thermodynamic, kinetic and higher-order data critical to advancing modeling techniques the field will not advance. This problem is not going to be solved by a bunch of even genius modelers brainstorming for days in a locked room. Just like the current status of modeling would have been impossible to imagine without the contributions of crystallographers, the future status of modeling would be impossible to imagine without the contribution of biophysical chemists and biologists. Modelers alone simply cannot punch through that ceiling.
One of the reasons I note this problem is because even now, I see very few (none?) meetings which serve as common platforms for biophysical chemists, biologists and modelers to come together and talk not just about problems in modeling but how people from these other fields can address the problem. But as long as modelers think of Big Data as some kind of ocean of truth simply waiting to spill out its secrets in the presence of the right tools, the field will not advance. They need to constantly realize the crucial interfacing with other disciplines that is an absolute must for progress in their own field. What would make their own field advance would be its practitioners knocking on the doors of their fellow kineticists, thermodynamicists and network biologists to get them the data that they need.
That last problem suddenly catapults the whole challenge to a new level of complexity and urgency, since convincing other kinds of scientists to do the experiments and procure the data that would allow your field to advance is a daunting cultural challenge, not a scientific one. Crystallographers were busy solving pharmaceutically relevant protein structures long before there were modelers, and most of them were doing it based on pure curiosity. But it took them fifty years to generate the kind of data that modelers could realistically use. We don't have the luxury of waiting for fifty years to get the same kind of data from biophysical chemists, so how do we incentivize them to speed up the process?
There are no easy ways to address this challenge, but a start would be to recognize its looming existence. And to invite more scientists from other fields to the next Gordon Conference in CADD. How to get people from other fields to contribute to your own in a mutually beneficial relationship is a research problem in its own right that deserves separate space at a conference like this. And there is every reason to fill that space if we want our field to rapidly progress.