My advisor does not trust QSAR. And for good reason. Every week hundreds of papers are published, with someone trying to predict molecule activities by building QSAR models; few if any are actually predictive, but many explain the known data fairly well.
But I was curious to know what pitfalls lie in the way of QSAR, so I convinced him to let me do a small QSAR project. To make it more interesting, me and two friends started with the same data for about 50 molecules with activites ranging from 10s of micromolars to 10s of nanomolars. We all used a program by a well-known company, and all started with the same molecules in the training and test sets. From then on we were on our own, and in the end we would compare results.
In the end, we each got a decent model. Since we may possibly publish the work later, I cannot say which molecules we were working on. But I actually ended up learning more about statistics than QSAR itself. The main point is that the usual statistical predictor, the correlation coefficient r*2, can be made as high as you want by including more factors in the partial least squares (PLS) regression, and yet you may end up with a model that's nonsense. A high r*2 can mean you have simply fit the existing data fairly well, or it can mean you have overfit the data.
On the other hand, another predictor, q*2 is the correlation coefficient for the test set. That really tells you how well your model can predict new activities. The interesting thing is that a q*2 above 0.50 or is considered publishable, that's how inefficient and relatively primitive QSAR is. Then there's also the third common statistical parameter, the standard deviation. The standard deviation should be as small as possible, but logically it cannot be smaller than that in the experimental data.
The most interesting observation was that all three of us felt that we should lay emphasis on totally different parameters in the model building. Yet with our own unique approaches, we managed to get models with similar statistics.
I learnt that QSAR is indeed not very helpful, but like some other methods, it works if you don't have anything else to go on. A lot of times computational results are simply idea generating devices that save time spent by mere educated guesswork. In fact sometimes the hit rate for educated guesswork might be higher, but the time taken may be longer. In such cases, computational methods can be deemed useful even with a lower hit rate.
In the end, there's one thing we always have to remember about statistics, the rather ribald statement about it which I am still going to quote because I think it's apt (and apologies to women!): Statistics are like women's skirts. What they show us can be useful, but what they hide is much more interesting.
Why I'm Marching for Science
20 hours ago in Angry by Choice