As Niels Bohr said, prediction is very difficult, especially about the future. In case of computational modeling, the real grist of value is in prediction. But for any methodology to predict it must first be able to evaluate. Sound evaluation of known data (retrospective testing) is the only means to proceed to accurate prediction of new data (prospective testing).
Over the last few years, several papers have come out involving the comparisons of different structure-based and ligand-based methods for virtual screening (VS), binding mode prediction and binding affinity prediction. Every one of these goals if accurately achieved could lead to the saving of immense amounts of time and money for the industry. Every paper concludes that some method is better than other. For virtual screening for example, it has been concluded by many that ligand-based 3D methods are better than docking methods, and 2D ligand-based methods are at least as good if not better.
However, such studies have to be conducted very carefully to make sure that you are not biasing your experiment for or against any method, or comparing apples and oranges. In addition, you have to use the correct metrics for evaluation of your results. Failure to do either of these and other things can lead to erroneous or/and inflated or artificially enhanced results leading to fallible prediction.
Here I will talk about two aspects of virtual screening; choosing the correct dataset, and choice of evaluation metric. The basic problems in VS are false positives and false negatives and one wants to minimize the occurrence of these. Sound statistical analysis can do wonders for generating and evaluating good virtual screening data. This has been documented in several recent papers, notably one by Ant Nicholls from OpenEye. If you have a VS method, it's not of much use randomly picking a random screen of 100,000 compounds. You need to choose the nature and number of actives and inactives in the screen judiciously to avoid bias. Here are a few things to be remembered that I got from the literature:
1. Standard statistical analysis tells you that the error in your results depends upon the number of representatives in your sample. Thus, you need to have an adequate number of actives and inactives in your screening dataset. What is much more important is the correct ratio of inactives to actives. The errors inherent in choosing various such ratios have been quantified; for example, with an inactive:active ratio of 4:1, the error incurred is 11% more than that incurred by a theoretical ratio of infinite:1. For a ratio of 100:1 it's only 0.5% more than with infinite. Clearly we must use a good ratio of inactives to actives to reduce statistical error. Incidentally you can also increase the number of actives to reduce this error. But this is not compatible with real-life HTS where actives are (usually) very less, sometimes not more than 0.1% of the screen.
2. Number is one thing. The nature of your actives and decoys is equally important; simply overwhelming your screen with decoys won't do the trick. For example, consider a kinase inhibitor virtual screen in which the decoys are things like hydrocarbons and inorganic ions. In his paper, Nicholls calls distinguishing these decoys the "dog test", that is, even your dog should be able to distinguish them from actives (not that I am belittling dogs here). We don't want a screen that makes it too easy for the method to reject actives. Thus, simply throwing a screen of random compounds at your method might make it too easy for your method to screen actives and mislead.
We also don't want a method that rejects chemically similar molecules on the basis of some property like logP or molecular weight. For example consider a method or scoring function that is sensitive to logP, and suppose it is supplied with two hypothetical molecules which have an identical core and a nitrogen in the side chain. If one side chain has a NH2 and another one is N-alkylated where the alkyl is butyl, then there will be a substantial difference in logP between the two, and your method will fail to recognise them as "similar", especially from a 2D perspective. Thus, a challenging dataset for similarity based methods is one in which the decoys and actives are property-matched. Just such a dataset has been put together by Irwin, Huang and Shoichet- this is the DUD (Directory of Useful Decoys) dataset of property-matched compounds. In it, 40 protein targets and their corresponding actives have been selected. 36 property-matched decoys for every active have been chosen. This dataset is much more challenging for many methods that do well on other random datasets. For more details, take a look at the original DUD paper. In general, there can be different kinds of decoys; random, drug-like, drug-like and property-matched etc. and one needs to know exactly how to choose the correct dataset. With datasets like DUD, there is an attempt to provide possible benchmarks for the modeling community.
3. Then there is the extremely important matter of evaluation. After doing a virtual screen with a well-chosen dataset and well-chosen targets, how do you actually evaluate the results and put your method in perspective? There are several metrics but until now, the most popular way of doing this is by calculating enrichment and this is the way it has been done in several publications. The idea is simple; you want your top ranked compounds to contain the most number of actives. Enrichment is simply the fraction of actives found in a certain fraction of screened compounds. Ideally you want your enrichment curve to shoot up at the beginning, that is you want most (ideally all) of the actives to show up in the first 1% or so of your ranked molecules. Then you compare that enrichment curve to a curve (actually a straight line) that would stem from an ideal result. The problem with enrichment is that it is a function of the method and the dataset, hence of the entire experiment. For example, the ideal straight line depends on the number of actives in the dataset. If you want to do a controlled experiment, then you want to make sure that the only differences in the results come from your method, and enrichment introduces another variable that complicates interpretation. Other failings of enrichment are documented in this paper.
Instead, what's recommended for evaluating methods are R.O.C curves.
Essentially, R.O.C curves can be used in any situation where one needs to distinguish signal from noise and boy, is there a lot of noise around. R.O.C curves have an interesting history; they were developed by radar scientists during World War 2 to distinguish the signal of enemy warplanes from the noise of false hits and other artifacts. In recent times they have been used in diverse fields; psychology, medicine, epidemiology, engineering quality control, anywhere where we want to pick the bad apples from the good ones. Thus, R.O.C curves simply plot the false positive (FP) rate against the true positive (TP) rate. A purely random result gives a straight line at 45 degrees implying that for every FL you get a TP- dismal performance. A good R.O.C curve is a hyperbola that shoots above the straight line, and a very useful measure of your method's performance is the Area Under the Curve (AUC). The AUC needs to be prudently interpreted; for instance an AUC of 0.8 means that you can discriminate a TP by assigning a higher score to it than to a FP in 8 out of 10 cases. Here's a paper discussing the advantages of R.O.C curves for VS and detailing an actual example.
One thing seems to be striking. The papers linked here and at other places document that R.O.C curves may currently be the single-best metric for measuring performance of virtual screening methods. This is probably not too surprising given that they have proved so successful in other fields.
Why should modeling be different? Just like in other fields, rigorous and standard statistical metrics need to be established for the field. Only then will the comparisons between different methods and programs commonly seen these days be valid. For this, as in other fields, experiments need to be judiciously planned (including choosing the correct datasets here) and their results need to be carefully evaluated with unbiased techniques.
It is worth noting that these are mostly prescriptions for retrospective evaluations. When confronted with an unknown and novel screen, which method or combination of methods does one use? The answer to this question is still out there. In fact some of the real-life challenges run contrary to the known scenarios. For example consider a molecular screen from some novel plant or marine sponge. Are the molecules in this screen going to be drug-like? Certainly not. Is this going to have the right ratio of actives to decoys? Who knows (the whole point is to find the actives). Is it going to be random? Yes. If so, how random? In all actual screenings, there are a lot of unknowns out there. But it's still very useful to know about "known unknowns" and "unknown unknowns", and retrospective screening and the design of experiments can help us unearth some of these. If nothing else, it indicates attention to sound scientific and statistical principles.
In later posts, we will take a closer look at statistical evaluation and dangers in pose-prediction including being always wary of crystal structures, as well as something I found fascinating- bias in virtual screen design and evaluation that throws light on chemist psychology itself. This is a learning experience for me as much or more than it is for anyone else.
1. Hawkins, P.C., Warren, G.L., Skillman, A.G., Nicholls, A. (2008). How to do an evaluation: pitfalls and traps. Journal of Computer-Aided Molecular Design, 22(3-4), 179-190. DOI: 10.1007/s10822-007-9166-3
2. Triballeau, N., Acher, F., Brabet, I., Pin, J., Bertrand, H. (2005). . Journal of Medicinal Chemistry, 48(7), 2534-2547. DOI: 10.1021/jm049092j
3. Huang, N., Shoichet, B., Irwin, J. (2006). . Journal of Medicinal Chemistry, 49(23), 6789-6801. DOI: 10.1021/jm0608356