Field of Science

A fair effort

My advisor does not trust QSAR. And for good reason. Every week hundreds of papers are published, with someone trying to predict molecule activities by building QSAR models; few if any are actually predictive, but many explain the known data fairly well.

But I was curious to know what pitfalls lie in the way of QSAR, so I convinced him to let me do a small QSAR project. To make it more interesting, me and two friends started with the same data for about 50 molecules with activites ranging from 10s of micromolars to 10s of nanomolars. We all used a program by a well-known company, and all started with the same molecules in the training and test sets. From then on we were on our own, and in the end we would compare results.

In the end, we each got a decent model. Since we may possibly publish the work later, I cannot say which molecules we were working on. But I actually ended up learning more about statistics than QSAR itself. The main point is that the usual statistical predictor, the correlation coefficient r*2, can be made as high as you want by including more factors in the partial least squares (PLS) regression, and yet you may end up with a model that's nonsense. A high r*2 can mean you have simply fit the existing data fairly well, or it can mean you have overfit the data.

On the other hand, another predictor, q*2 is the correlation coefficient for the test set. That really tells you how well your model can predict new activities. The interesting thing is that a q*2 above 0.50 or is considered publishable, that's how inefficient and relatively primitive QSAR is. Then there's also the third common statistical parameter, the standard deviation. The standard deviation should be as small as possible, but logically it cannot be smaller than that in the experimental data.

The most interesting observation was that all three of us felt that we should lay emphasis on totally different parameters in the model building. Yet with our own unique approaches, we managed to get models with similar statistics.

I learnt that QSAR is indeed not very helpful, but like some other methods, it works if you don't have anything else to go on. A lot of times computational results are simply idea generating devices that save time spent by mere educated guesswork. In fact sometimes the hit rate for educated guesswork might be higher, but the time taken may be longer. In such cases, computational methods can be deemed useful even with a lower hit rate.

In the end, there's one thing we always have to remember about statistics, the rather ribald statement about it which I am still going to quote because I think it's apt (and apologies to women!): Statistics are like women's skirts. What they show us can be useful, but what they hide is much more interesting.


  1. Actually, q^2 is not really a valid measure of predictive ability. See

    Author = {Golbraikh, A. and Tropsha, A.},
    Journal = {J.~Mol.~Graph.~Model.},
    Pages = {269--276},
    Title = {Beware of $q^2$},
    Volume = {20},
    Year = {2002}}

    I agree that there's far too many 'plug-and-chug' QSAR papers published. The paper by Doweyko is also an interesting read:

  2. I agree with Rajashi's comment on Q^2,which is not the correlation coefficient for the test set but a coefficient derived from a repeated "leave-one-out" validation step from within the learning set. There are also other issues with validation of QSAR. Up till now the methodology and validation of QSAR has been a matter of choice for the experimenter, but there are clear guidelines from the OECD on what constitutes a valid QSAR (see
    which should address most of the issues that concern sceptics such as your advisor. Gradually we will see that papers will only be accepted for publication if they address the OECD guidelines. modern QSAR is not as inefficient as you say and I suggest that you need to get a bit more up to date with developments in the filed. QSAR's are now beginning to be used in regulatory contexts, (eg for safety assessments), and for such purposes Q^2 would not really be a suitable statistic;rather a r^2 for an independent test set stratified for activity across the whole domain of the training set would be expected. A common value for this indicator of predictivity would be 0.75, (ie 75% of the test set predicted at for example, 95% confidence limits).

  3. thanks very much for the references, they are very helpful. actually there is some confusion here as a colleague of mine pointed out; the q^2 i am talking about actually refers to the r^2 (pred) that you and doweyko talk about for an independent test set. it's not the LOO q^2. the confusion lies in our program calling the r^2(pred) "q^2". on the other hand, the program does not have a way to calculate the LOO cross validation q^2 that you are talking about.
    in any case, i am relatively new to QSAR so it's good to know stuff, thanks.

  4. Aah, OK. r^2 (pred) might be more reasonable if the test set is a true test set. More often than not, the TSET / PSET split is just not very rigorous.

    In the end the only rigorous way to to measure the value of the model is to do prospective testing - synthesize new molecules.

  5. Something you really need to watch when building QSAR models is the extent that the descriptors cluster the compounds substructurally. In an extreme case you can end up with what are effectively two point correlations. In a high-dimensional space these can sometimes be difficult to spot. All the usual diagnostics tell you that you have a wondeful model and in a sense you do. The problem is that you might be using 5 descriptors to encode a simple relevant substructural feature such as the presence of an anionic group.

    One problem with CoMFA models is knowing the extent of their scope. Because CoMFA uses a sampling of molecular fields as descriptors, there are no inherent substructural limitations in applying the models. However if the CoMFA model is derived for compounds in a tightly defined series, would one be confident in applying it outside that series. Something that I'd like to see with 3D QSARs is some measure of sensitivity of the model to the molecular alignment.

    There are some truly execrable QSAR analyses in the literature. One that is (hopefully) not there came to me as reviewer and none of the residuals in pIC50 for the training set exceeded 0.03.

    If you're planning to do a lot of QSAR and quantitative modeling, you might want to take a look at:

    We will be taking a closer look at data analysis in The Crapshoot once we've finished looking at hydrogen bonds.

  6. GMC: I agree with you. Actually what PHASE does is produce several models based on alignment, something like what you want I think. The user is free to choose whatever model he finds the best, based on some parameters which the program has for evaluating the models. As you know, when doing pharmacophore modeling/QSAR, it's best to have as many possible models as you can and then hopefully have an astute criterion for deciding between them.
    Thanks for the reference.
    P.S. Even if I don't comment, I am reading all your informative posts on h-bonds etc.


Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="">FoS</a> = FoS