Just like bioinformatics, cheminformatics has come into its own an independent framework and tool for drug design. As a measure of the field's independence and importance, consider that at least five journals primarily dedicated to it have emerged in the last couple of years, and 15000 articles on it have been published since 2003.
But how useful is it in drug discovery? The answer, just like for other approaches and technologies, is that it depends. For calculating and analyzing some properties it is more useful than for others. A group at Abbott summarizes the current knowledge of cheminformatics approaches as applied to various parameters.
To do this the group utilizes a useful classification scheme made (in)famous by ex-Secretary of Defense Donald Rumsfeld (Rumsfeld et al., J. Improb. Res. 2002). This is the classification of knowledge and facts into 'known knowns', 'unknown knowns', 'known unknowns' and 'unknown unknowns'.
The 'known known' category of properties calculated by cheminformatics consists of those that we think we have a very accurate handle on and that are easy to calculate. These include molecular weight, substructure (SS) searches, and ligand efficiency. Molecular weight can be easily calculated, and a variety of studies have indicated that high MW generally impacts drug discovery negatively. Thus the general thrust is on keeping your compounds small. Ligand efficiency is calculated as the free energy of binding per heavy atom. Medicinal chemists are more familiar with IC50 than free energy. Since IC50s are usually easy to measure and the number of heavy atoms are of course known, ligand efficiency would be a 'known known'. However the caveat is that IC50 is not the same as in vivo activity, so biochemical potency in terms of ligand efficiency might be a very different kettle of fish. Lastly, substructure searches can be carried out easily by many computer programs. Such searches are usually used when a compound having a substructure similar to one that is known is sought. The problem with SS searches arises when the decision on which SS to search becomes subjective. SS is also valuably used when certain SSs are to be avoided. In general SS searching belongs with the 'known known' category because of its ease, but subjective interpretations can render this more fuzzy.
In the 'unknown knowns' category lie an interesting set of properties; those which we think we know how to calculate but which sometimes look deceptively simple and are often subject to overconfidence in calculation. The first property in this category is one of the most important ones in drug design; logP, which is regarded to be a measure of the lipophilicity of the compound, a key parameter dictating absorption, bioavailability, and partitioning of drugs between membranes and body fluids. Several programs can calculate logP. However, as the article notes, a recent study of no less than 30 such programs located a mean error in logP calculation of 1 log unit, which means that some programs would do much worse. Thus calculation of logP, just like other computational techniques, crucially depends on the method used to calculate it. The caveat is that absolute cutoffs for logP values in library design of compounds searches might mislead, but many programs seem to do a fairly good job in producing trends. Solubility is another parameter that is notoriously difficult to calculate, especially since its calculation hinges on calculation of pKa values. pKa values in turn again are very method-dependent, and trouble arises especially for charged compounds, which include most drugs. Medicinal chemists are understandably suspicious of theoretical solubility prediction, but as for logP, such calculations may at least be used for quick estimation of qualitative trends. Another parameter in this category is plasma protein binding. Although we know a fair amount about the effect of this parameter on drug ADME, good luck trying to calculate this. Lastly, in vivo ADME is a minefield of complications. I personally would not have placed this in the 'unknown knowns' category, but at least some of the properties in this category can be calculated on the basis of specialized fragment-based models.
What about 'known unknowns'? This includes polar surface area (PSA)'. PSA is a known unknown because we know that is not exactly a real, measurable quantity. However it is still a useful parameter since PSA has been shown to relate to membrane permeability and hence is especially useful for guessing blood-brain barrier penetration for CNS drugs. A set of rules similar to the Lipinski rules says that compounds with a PSA of less than 120 A^2 are more likely to penetrate membranes. As for other properties though, calculation of PSA depends on method and usually involves calculating the 3D structure of a molecule and then calculating the PSA by assigning some kind of a 'surface area' associated with polar atoms. As the article notes, one can get wildly different results depending on which atoms are assigned as 'polar'. Plus compounds containing sulfur or phosphorus can result in big discrepancies. Clearly PSA is a useful parameter, but we have to go some way in calculating it reliably. Finally, similarity searching is all the rage these days and promises a windfall of potential discoveries. In its simplest incarnation, similarity searching aims to discover compounds similar in some structural metric to a given compound, with the assumption that similarity in structure would correspond to similarity in function. But similarity, just like beauty, is notoriously in the eye of the beholder. As the authors quip from a quaint piece of literature-related controversy:
Deciding whether two molecules are similar is much like trying to decide whether something is beautiful. There are no concrete definitions, and most chemists take an “I know it when I see it” attitude (attributed to United States Justice Potter Stewart, concurring opinion in Jacobellis v. Ohio 378 U.S. 184 (1964), regarding possible obscenity in The Lovers)."As again illustrated, different similarity searching methods give very different hits. One of the most successful metrics for measuring similarities has turned out to be the 'Tanimoto coefficient', but other metrics also abound. Thus it is quite remarkable that similarity is already being used widely in every endeavor from virtual screening to finding new protein targets for old drugs based on drug similarity. One of the most valuable applications of similarity is in finding bioisosteres (chemically similar fragments with improved properties), something which medicinal chemists try to do all the time. A recent review of similarity-based methods in JMC summarizes the utility of such techniques. Nonetheless, similarity based methods are still among the 'known unknowns' because we don't have an objective handle on what constitutes similarity, and we may never have such a handle even if such methods are widely and successfully adopted.
Finally we come to the dreaded 'unknown unknowns'. As the authors ask, can we even list properties which we have no idea about? We can take a shot. This category includes flight of fancy which may never be achieved but which are worth striving for. One such holy grail is the large-scale, high-throughput computation of binding free energies. The binding free energy for a protein ligand complex includes contributions from an enormous number of complicated factors, but the mere attempt to calculate all these factors has valuably increased our understanding of biological systems. Thus we should continue in this endeavor. Another fancy endeavor which is the talk of the town these days is systems biology, where construction of biological networks and applications of graph theory are supposed to shed valuable light on molecular interactions. Such approaches may well be successful, but they always run the risk of becoming too abstract and divorced from reality to be truly understandable. QSAR models can suffer from the same shortcomings.
In the end, only a robust collaboration between informaticians, computer scientists, medicinal chemists and biologists can make sense of the jungle of data uncovered by cheminformatics approaches. It is key for one group of scientists to keep reality checks on others. In the end it's all about reality checks. We all know what happened when these were not applied to the original classification scheme.
Muchmore, S., Edmunds, J., Stewart, K., & Hajduk, P. (2010). Cheminformatic Tools for Medicinal Chemists Journal of Medicinal Chemistry DOI: 10.1021/jm100164z