The Curious Wavefunction: statistics

Showing posts with label statistics. Show all posts

What Oliver Sacks could teach us about the value of anecdotal information in drug discovery research

By Wavefunction on Monday, August 01, 2016

Oliver Sacks was a writer who elevated the art of anecdote to an art form. He did not write authoritative medical books filled with numbers, graphs and statistics. Instead he focused on individual cases and the human element in medicine. A 2015 Wired magazine profile of Sacks highlighted his signature achievement in resurrecting the value of anecdote in medical research.

By restoring narrative to a central place in the practice of medicine, Sacks has regrafted his profession to its roots. Before the science of medicine thought of itself as a science, at the crux of the healing arts was an exchange of stories. The patient related a confusing odyssey of symptoms to the doctor, who interpreted the tale and recast it as a course of treatment. The compiling of detailed case histories was considered an indispensable tool of physicians from the time of Hippocrates. It fell into disrepute in the 20th century, as lab tests replaced time-consuming observation, merely “anecdotal” evidence was dismissed in favor of generalizable data, and the house call was rendered quaintly obsolete.

...By exiling the clinical anecdote to the margins of medical practice—to stories passed down in hallways from attending physician to resident—the culture of medicine had blinded itself, forgetting things it had once known. Sacks calls these knowledge gaps “scotomas,” the clinical term for blind spots or shadows in the field of vision.

...Sacks immersed himself in the neglected anecdotal literature of migraine, feeling that every one of his patients “opened out into an entire encyclopedia of neurology.”

By studying individual patients with unusual or bizarre neurological symptoms or personality changes, not only the infamous sleeping sickness in "Awakenings" but insomnia, migraines and Tourette's syndrome all revealed their intricacies under Sacks's skillful and empathetic dissection.

Sacks's explorations of anecdotal data were on my mind as I contemplated the pros and cons of anecdotal data in drug discovery, and especially in molecular modeling. It's a topic that is the subject of much discussion, and occasional pugnacious debate. One of the smartest people I know in the field often scorns individual stories of successes in modeling chemical and biological data. If you read about or listen to case studies in the field, you will often find scientists attributing success in a particular drug design study to a particular method, software algorithm or physicochemical factors like electrostatic effects or hydrophobic effects.

My statistically enlightened colleague has a tendency to dismiss these individual stories: Where, he asks, are the comparisons, often with simpler techniques? Where are the controls? How do you know for certain that there is a causal relationship between a particular method or factor and the success of a particular drug design?

My colleague is absolutely right about the difficulty of extrapolating to causal explanations or general rules based on individual studies. Not a day passes when I don't hear the words "Method X worked for us" or "Method X failed abysmally for us". "Method X" could be a particular assay, chemical reagent, cell line or computational algorithm. The problem is when these statements are extrapolated to "Therefore Method X works" or "Method X does not work". The difference between "works for me" and simply "works" can be the difference between storytelling and actual science.

And yet I cannot help but think, partly based on Oliver Sacks's career, that storytelling has a unique place in drug discovery. And this is not just because of the cultural and community-building role of storytelling that sociologists often extol. It's because I see anecdotes not as data but as starting points for gathering data. And this is because of the sheer complexity of biology, and of drug discovery by extension. The problem is that when you are dealing with a very complex and multifactorial system, one in which the variance of success is large, it's often impossible to find general rules even by running well-designed statistical studies. Do you then completely dismiss individual data points? I think that would be a mistake.

If a colleague tells me that using a method for optimizing electrostatic interactions worked well for his system, I would try to use that method for my system within time and resource constraints, especially if his study is well documented and carefully done. I would certainly try it out if both of us were working on similar systems (say, kinases) but even if the systems were different I would give it a shot. That's because we know for a fact that similar laws of physics and chemistry operate even in very different systems; often it's a matter of teasing apart which laws among them are dominant and which ones are weaker.

Then there's also Sacks's quote about each one of his patients providing openings into entire encyclopedias of discovery. What Sacks is really talking about here are model systems which highlight a particular feature that is present in other systems. For instance, a patient with memory loss may not be representative of the population as a whole, but he can shed some very valuable insights into the workings of the mechanisms of memory (sadly although illuminatingly, it's only by observing certain damaged neurological mechanisms that you can know more about normal, undamaged ones).

Model systems are similarly invaluable in fundamental drug discovery research. A good example is a hydrophobic protein cavity in which a single engineered polar amino acid residue can provide information on the role of electrostatics in small molecule binding. There also "virtual" model systems; for instance a molecular mechanics force field in which the electrostatic interactions are turned off to study only the influence of the Van der Waals interactions. Synthetic chemists can also tell you about scores of model systems that are constructed to demonstrate the feasibility of certain reactions or conformational effects. These model systems are not infallible, and they are certainly not data by themselves. But what they are are invitations to discovery, curious starting points, hints that point the way to interesting phenomena. Whether one uses them depends to some extent on time and resource availability, but we would be naive to simply discard them because they represent isolated or anecdotal cases.

That brings me to the value of outliers. Outliers are interesting in almost any investigation, but they are especially important in a paradigm like biology or medicine where the emergent, non-linear nature of the system leads to significant variance. In fact one could argue that medicine is currently being revolutionized by the study of outliers in clinical trials. Outliers can point the way to selecting subgroups of patient populations, so-called 'extraordinary responders" who elicit an unusual response to certain drugs. A great recent example is the anticancer drug Iressa which was withdrawn from the market because of lack of efficacy. However, a few years later AstraZeneca could introduce the drug back into the market when it found out that patients with EGFR mutations responded much better to it than the general population. It was thus the outliers that could help repurpose a drug, one which is saving lives today. And as the genetic basis of medicine is unraveled more and more, we can be sure that specific outlier patients who are distinguished by unique genetic signatures will be critical not only in targeting specific drugs but also in advancing basic science; a great example along these lines is the aerobics instructor in Texas who had a mutation in a protein called PCSK9, now a lucrative target for heart disease drugs. It would have been a shame if her particular case had simply been dismissed as a statistical anomaly.

That then is the great value of anecdotes, not as markers of scientific laws in themselves but as starting points for experiment and theory, as signposts leading us to intellectual forests and mountaintops. Whether these forests and mountaintops contain anything of value is a separate question, but without the anecdotes we may never know about their existence.

Naomi Oreskes and false positives in climate change: Do we know enough?

By Wavefunction on Sunday, January 04, 2015

Historian of science Naomi Oreskes who has long since been a campaigner against climate change and science denialists (her book "Merchants of Doubt" is a touchstone in this regard) has a New York Times Op-Ed titled "Playing Dumb on Climate Change" in which she essentially - to my eyes at least - asks that scientists slightly lower the standards of statistical significance when assessing the evidence for climate change.

It's not that Oreskes is saying that we should actually use sloppier standards for climate change analysis. Her argument is much more reasonable and specific but ultimately she still seems to embrace a kind of certainty that I do not think exists in the field. She starts by reminding us of the two common errors in any kind of statistical analysis: Type 1 and Type II. In my field I tend to use the more informative names 'false positives' and 'false negatives' so that's what I will stick with here. False positives arise when you are too generous and allow spurious data along with legitimate facts. False negatives arise when you are too curmudgeonly and disallow legitimate facts. In some sense false positives are better than false negatives since you would rather run the risk of getting some noise along with all the signal rather than missing some of the signal.

The risk of getting false negatives vs false positives is governed in part by the confidence limits you are using. A confidence limit of 95% means that your standards for accepting something as a signal are very high - it means that you are bound to reject some true signal, resulting in a false negative. When you reduce the confidence limits, say to 90%, then you are lowering the bar for marking something as a signal, leading potentially to some false positives creeping in along with the true signal.

The crux of Oreskes's argument is that until now climate scientists have largely used a very high confidence limit for accepting and rejecting evidence which in her opinion has led to some false negatives. She is right that the magic number of 95% is arbitrary and has no objective basis in natural law. But, and this is her crucial point, we now know enough about the climate to warranty using slightly lower confidence limits since any false positives emerging from this lowered limit can be identified as such and therefore rejected. Perhaps, she suggests, we could consider using a lower confidence limit like 90% for climate change studies.

I see where Oreskes is coming from and her general argument about being able to use lower confidence limits in light of increased knowledge about a field is valid, but I am simply not as convinced as she is that we will be able to identify false positives in climate data and reject them. First of all, 'climate data' is a humungous and staggeringly heterogenous mass of facts from fields ranging from oceanography to soil chemistry. While we may be more confident about some of these facts, we don't know enough about others (especially concerning the biosphere) to be as confident about them. That means that we won't even know which facts are more likely to be false positives than others. There are certainly some false positives that are in the 'known unknown' category but there are also many others which are in the 'unknown unknown' category, so how will we identify these if they show up?

The second problem is that, even though I mentioned before that sometimes accepting false positives is better than rejecting true positives, in case of climate change one must weigh this tradeoff against the enormous cost and resources needed to pursue false positives. So for instance, if a false positive data point concerns melting glaciers in some part of the world, actually taking action to address that data point would mean the expenditure of tens of millions or even billions of dollars. Sadly in case of climate change, pursuing Type I errors is not as simple as spending a few man hours and a few thousand dollars investigating blind alleys.

Oreskes like many other thinkers in the field is well-meaning and her suggestions apply to fields where enough is known to confidently identify and weed out false positives, but in case of climate change we are straitjacketed by a problem that has tragically been the bane of the field since its inception: our lack of knowledge and general ignorance about a very complex, unpredictable system. Unless we bridge this gulf of ignorance significantly, a 90% confidence limit would be as arbitrary and dangerous as a 95% confidence limit.

Biologists, chemists, math and computing

By Wavefunction on Monday, November 03, 2014

Here are some words of wisdom from C. Titus Brown, a biology professor at Michigan State University, on the critical importance of quantitative math, stats and computing skills in biology. The larger question is what it means to 'do biology' in an age when biology has become so interdisciplinary.

Brown's post itself is based on another post by Sean Eddy at Janelia Farm on how biologists must learn scripting these days. Here's Brown:

So here's my conclusion: to be a biologist, one must be seriously trying to study biology. Period. Clearly you must know something about biology in order to be effective here, and critical thinking is presumably pretty important there; I think "basic competency in scientific practice" is probably the minimum bar, but even there you can imagine lab techs or undergraduates putting in useful work at a pretty introductory level here. I think there are many useful skills to have, but I have a hard time concluding that any of them are strictly necessary.

The more interesting question, to my mind, is what should we be teaching undergraduates and graduate students in biology? And there I unequivocally agree with the people who prioritize some reasonable background in stats, and some reasonable background in data analysis (with R or Python - something more than Excel). What's more important than teaching any one thing in specific, though, is that the whole concept that biologists can avoid math or computing in their training (be it stats, modeling, simulation, programming, data science/data analysis, or whatever) needs to die. That is over. Dead, done, over.

And here's Eddy:

The most important thing I want you to take away from this talk tonight is that writing scripts in Perl or Python is both essential and easy, like learning to pipette. Writing a script is not software programming. To write scripts, you do not need to take courses in computer science or computer engineering. Any biologist can write a Perl script. A Perl or Python script is not much different from writing a protocol for yourself. The way you get started is that someone gives you one that works, and you learn how to tweak it to do more of what you need. After a while, you’ll find you’re writing your own scripts from scratch. If you aren’t already competent in Perl and you’re dealing with sequence data, you really need to take a couple of hours and write your first Perl script.

The thing is, large-scale biological data are almost as complicated as the organism itself. Asking a question of a large, complex dataset is just like doing an experiment. You can’t see the whole dataset at once; you have to ask an insightful question of it, and get back one narrow view at a time. Asking insightful questions of data takes just as much time, and just as much intuition as asking insightful questions of the living system. You need to think about what you’re asking, and you need to think about what you’re going to do for positive and negative controls. Your script is the experimental protocol. The amount of thought you will find yourself putting into many different experiments and controls vastly outweighs the time it will take you to learn to write Perl.

And they are both right. The transformation of biology in the last thirty years or so into a data (especially sequencing data) and information intensive discipline means that it's no longer optional to be able to use the tools of information technology, math and computing to biological problems. You either use the tools yourself or you collaborate with someone who knows how to use them. Collaboration has of course also become much easier in the last thirty years, but it's easiest to do it yourself if your goal is simply to write some small but useful scripts for manipulating data. R and Python/Perl are both usually necessary but also sufficient for such purposes. Both Eddy and Brown also raise important questions about including such tools in the traditional education of undergraduate and graduate students in biology, something that's not a standard practice yet.

What I find interesting and rather ironic is that for most of its existence biology was a highly descriptive and experimental science, the opposite of math, which seemed to have nothing to do with mathematical analysis. This changed partially with the advent of genetic analysis in the early twentieth century, but it did not really hit the molecular basis of biology until the growth of sequencing tools.

Another irony is that chemistry is usually considered to be more data intensive and mathematical than biology, especially in the form of subfields like physical chemistry. And yet today we don't really find the average organic or inorganic chemist facing a need to learn and apply scripting or statistical analysis, although computational chemists have of course long since recognized the value of cheminformatics and data analysis. Compared to this the average biologist has become much more dependent on such tools.

However I think that this scenario in chemistry is changing. Organic synthesis may not be information intensive right now but automated organic synthesis - which I personally think is going to be revolutionary at some point in the future - will undoubtedly lead to large amounts of data which will have to be analyzed and statistically made sense of. One specific area of chemistry where I think this will almost certainly happen is in the application of 'accelerated serendipity' which involves the screening of catalysts and the correlation of molecular properties with catalytic activity. Some chemists are also looking at network theory to calculate optimal synthetic paths. At some point to use these tools effectively, the average chemist will have to borrow tools from the computational chemist, and perhaps chemistry will have become as data-intensive as biology. It's never too early to pick up that Python book or look at that R course.

Stephen Hawking's advice for twenty-first century grads: Embrace complexity

By Wavefunction on Wednesday, April 24, 2013

: Charles Joseph Minard's famous graph showing the decreasing size of Napoleon's Grande Armée as it marches to Moscow; a classic in data visualization (Image: Wikipedia Commons)

As the economy continues to chart its own tortuous, uncertain course, there seems to have been a fair amount of much-needed discussion on the kinds of skills new grads should possess. These skills of course have to be driven by market demand. As chemist George Whitesides asks for instance, what's the point of getting a degree in organic synthesis in the United States if most organic synthesis jobs are in China?

Upcoming grads should indeed focus on what sells. But from a bigger standpoint, especially in the sciences, new skill sets are also inevitably driven by the course that science is taking at that point. The correlation is not perfect (since market forces still often trump science) but a few examples make this science-driven demand clear. For instance if you were growing up in the immediate post-WW2 era, getting a degree in physics would have helped. Because of its prestige and glut of government funding, physics was in the middle of one of its most exciting periods. New particles were literally streaming out of woodwork, giant particle accelerators were humming and federal and industrial labs were enthusiastically hiring. If you were graduating in the last twenty years or so, getting a degree in biology would have been useful because the golden age of biology was just entering its most productive years. Similarly, organic chemists enjoyed a remarkably fertile period in the pharmaceutical industry from the 50s through the 80s because new drugs were flowing out of drug companies at a rapid pace and scientists like R. B. Woodward were taking the discipline to new heights.

Demand for new grads is clearly driven by the market, but it also depends on the prevalence of certain scientific disciplines at specific time points. This in turn dictates the skills you should have; a physics-heavy market would need skills in mathematics and electronics for instance, a biology-heavy market would mop up people who can run Western blots and PCR. Based on this trend, what kind of skills and knowledge would best serve graduates in the twenty-first century?

To me the answer partly comes from an unlikely source: Stephen Hawking. A few years ago, Hawking was asked what he thought of the common opinion that the twentieth century was that of biology and the twenty-first century would be that of physics. Hawking replied that in his opinion the twenty-first century would be the "century of complexity". That remark probably holds more useful advice for contemporary students than they realize since it points to at least two skills which are going to be essential for new college grads in the age of complexity: statistics and data visualization.

Let's start with the need for statistics. Many of the most important fields of twenty-first century research including neuroscience, synthetic and systems biology, materials science and energy are inherently composed of multilevel phenomena that proliferate across different levels of complexity. While the reductionist zeitgeist of the twentieth century yielded great dividends, we are now seeing a movement away from strict reductionism toward emergent phenomena. While the word "emergence" is often thrown around as a fashionable place-card, the fact is that complex, emergent phenomena do need a different kind of skill set.

The hallmark of complexity is a glut of data. These days you often hear talk of the analysis of 'Big Data' as an independent field and you hear about the advent of 'data scientists'. Big Data now has started making routine appearances in the pharmaceutical and biotech industry, whether in the form of extensive multidimensional structure-activity relationship (SAR) datasets or as bushels of genomic sequence information. It's also important in any number of diverse fields ranging from voter behavior to homeland security. Statistical analysis is undoubtedly going to be key to analyzing this data. In my own field of molecular modeling, statistical analysis is now considered routine in the analysis of virtual screening hits although it's not as widely used as it should.

Statistics was of course always a useful science but now it's going to be paramount; positions explicitly looking for 'data scientists' for instance specifically ask for a mix of programming skills and statistics. Sadly many formal college requirements still don't include statistics and most scientists, if they do it at all, learn statistics on the job. For thriving in the new age of complexity this scenario has to change. Statistics must now become a mandatory part of science majors. A modest step in this direction is the publication of user-friendly, popular books on statistics like Charles Wheelan's "Naked Statistics" or Nate Silver's "The Signal and the Noise" which have been quickly devoured by science-savvy readers. Some of these are good enough to be prescribed in college courses for statistics non-majors.

Along with statistics, the other important skill for students of complexity is going to be data visualization and formal college courses should also reflect this increasingly important skill set. Complex systems often yield data that's spread over different levels of hierarchy and even different fields. It's quite a challenge to visualize this data well. One resource that's often recommended for data visualization is Edward Tufte's pioneering series of books. Tufte shows us how to present complex data often convoluted by the constrains of Excel spreadsheets. Pioneering developments in human-computer interaction and graphics will nonetheless ease visual access to complicated datasets. Sound data visualization is important not just to simply understand a multilayered system or problem but also to communicate that understanding to non-specialists. The age of complexity will inherently involve researchers from different disciplines working together. And while we are at it it's also important to stress - especially to college grads - the value of being able to harmoniously co-exist with other professionals.

Hawking's century of complexity will call upon all the tools of twentieth century problem solving along with a few more. Statistics and data visualization are going to be at the forefront of the data-driven revolution in complex systems. It's time that college requirements reflected these important paradigms.

First published on the Scientific American Blog Network.

On reproducibility in modeling

By Wavefunction on Tuesday, December 13, 2011

A recent issue of Science has an article discussing an issue that has been a constant headache for anyone involved with any kind of modeling in drug discovery - the lack of reproducibility in computational science. The author Roger Peng who is a biostatistician at Johns Hopkins talks about modeling standards in general but I think many of his caveats could apply to drug discovery modeling. The problem has been recognized for a few years now but there have been very few concerted efforts to address it.

An old anecdote from my graduate advisor's research drives the point home. He wanted to replicate a protein-ligand docking study done with a compound so he contacted the scientist who had performed the study and processed the protein and ligand according to the former's protocol. He appropriately adjusted the parameters and ran the experiment. To his surprise he got a very different result. He repeated the protocol several times but consistently saw the wrong result. Finally he called up the original researcher. The two went over the protocol a few times and finally realized that the problem lay in a minor but overlooked detail - the two scientists were using slightly different versions of the modeling software. This wasn't even a new version, just an update, but for some reason it was enough to significantly change the results.

These and other problems dot the landscape of modeling in drug discovery. The biggest problem to begin with is of course the sheer lack of reporting of details in modeling studies. I have seen more than my share of papers where the authors find it enough to simply state the name of the software used for modeling. No mention of parameters, versions, inputs, "pre-processing" steps, hardware, operating system, computer time or "expert" tweaking. The latter factor is crucial and I will come back to it. In any case, it's quite obvious that no modeling study can be reproducible without these details. Ironically, the same process that made modeling more accessible to the experimental masses has also encouraged the reporting of incomplete results; the incarnation of simulation as black-box technology has inspired experimentalists to widely use it, but on the flip side it has also discouraged many from being concerned about communicating under-the-hood details.

A related problem is the lack of objective statistical validation in reporting modeling results, a very important topic that has been highlighted recently. Even when protocols are supposedly accurately described, the absence of error bars or statistical variation means that one can get a different result even if the original recipe is meticulously followed. Even simple things like docking runs can give slightly different numbers on the same system, so it's important to be mindful of variation in the results along with their probable causes. Feynman talked about the irreproducibility of individual experiments in quantum mechanics, and while it's not quite that bad in modeling, it's still not irrelevant.

This brings us to one of those important but often unquantifiable factors in successful modeling campaigns - the role of expert knowledge and intuition. Since modeling is still an inexact science (and will probably remain so for the foreseeable future), intuition, gut feelings and a "feel" for the particular system under consideration based on experience can often be an important part of massaging the protocol to deliver the desired results. At least in some cases these intangibles are captured in any number of little tweaks, from constraining the geometry of certain parts of a molecule based on past knowledge to suddenly using a previously unexpected technique to improve the clarity of the data. A lot of this is never reported in papers and some of it probably can't be. But is there a way to capture and communicate at least the tangible part of this kind of thinking?

The paper alludes to a possible simple solution and this solution will have to be implemented by journals. Any modeling protocol generates a log file which can be easily interpreted by the relevant program. In case of some modeling software like Schrodinger, there's also a script that records every step in a format comprehensible to the program. Almost any little tweak that you make is usually recorded in these files or scripts. A log file is more accurate than an English language description at documenting concrete steps. One can imagine a generic log file- generating program which can record the steps across different modeling programs. This kind of venture will need collaboration between different software companies but it could be very useful in providing a single log file that captures as much of both the tangible and intangible thought processes of the modeler as possible. Journals could insist that authors upload these log files and make them available to the community.

Ultimately it's journals which can play the biggest role in the implementation of rigorous and useful modeling standards. In the Science article the author describes a very useful system of communicating modeling results used by the journal Biostatistics. Under this system authors doing simulation can request a "reproducibility review" in which one of the associate editors runs the protocols using the code supplied by the authors. Papers which pass this test are clearly flagged as "R" - reviewed for reproducibility. At the very least, this system gives readers a way to distinguish rigorously validated papers from others so that they know which ones to trust more. You would think that there would be backlash against the system from those who don't want to explicitly display the lack of verification of their protocols, but the fact that it's working seems to indicate its value to the community at large.

Unfortunately in case of drug discovery, any such system will have to deal with the problem of proprietary data. There are several papers without such data which could also benefit from this system, but there can be ways to handle proprietary data. Even proprietary data can be amenable to partial reproducibility. In a typical example for instance, molecular structures which are proprietary could be encoded into special organization-specific formats that are hard to decode (an example would be formats used by OpenEye or Rosetta). One could still run a set of modeling protocols on this cryptic data set and generate statistics without revealing the identity of the structures. Naturally there will have to be safeguards against the misuse of any such evaluation but it's hard to see why they would be difficult to institute.

Finally, it's only a community-wide effort equally comprised of industry and academia which can lead to the validation and use of successful modeling protocols. The article suggests creating a kind of "CodeMed Central" repository akin to PubMed Central, and I think modeling could greatly benefit from such a central data source. Code for successful protocols in virtual screening or homology modeling or molecular dynamics or what have you can be uploaded to a site (along with the log files of course). Not only would these protocols be used to verify their reproducibility, but they could also be used to practically aid data extraction from similar systems. The community as a whole would benefit.

Before there's any data generation or sharing, before there's any drawing of conclusions, before there's any advancement of scientific knowledge, there's reproducibility, a scientific virtue that has guided every field of science since its modern origin. Sadly this virtue has been neglected in modeling, so it's about time that we pay more attention to it.

Peng, R. (2011). Reproducible Research in Computational Science Science, 334 (6060), 1226-1227 DOI: 10.1126/science.1213847

How pretty is your (im)perfect p?

By Wavefunction on Wednesday, March 24, 2010

In my first year of grad school we were once sitting in a seminar in which someone had put up a graph with a p-value in it. My advisor asked all the students what a p-value was. When no one answered, he severely admonished us and said that any kind of scientist engaged in any kind of endeavor should know what a p-value is. It is after all paramount in most statistical studies, and especially in judging the consequences of clinical trials.

So do we really know what it is? Apparently not, not even scientists who routinely use statistics in their work. In spite of this, the use and overuse of the p-value and the constant invoking of "statistical significance" are universal phenomena. When a very low p-value is cited for a study, it is taken for granted that the results of the study must be true. My dad, who is a statistician turned economist, often says that one of the big laments about modern science is that too many scientists are never formally trained in statistics. This may be true, especially with regard to the p-value. An article in ScienceNews drives home this point and I think should be read by every scientist or engineer who remotely uses statistics in his or her work. The article builds on concerns raised recently by several statisticians about the misuse and over-interpretation of these concepts in clinical trials. In fact there seems to be an entire book which is devoted to criticizing the use of statistics; "The Cult of Statistical Significance".

So what's the p-value? A reasonable accurate definition is that it is the probability of getting a result at least extreme as the one you get under the assumption that the null hypothesis is true". The null hypothesis is the hypothesis that your sample is a completely unbiased sample. The p-value is basically related to the probability that you will get a result by chance alone in a completely unbiased sample; thus, the lower it is, the more "confident" you can be that chance did not produce the result. It is usually calculated as a cutoff value. Ronald Fisher, the great statistician who pioneered its use, rather arbitrarily decided this value to be 0.05. Thus generally if one gets a p-value of 0.05, he or she is "confident" that the result which is being observed is one that is "statistically significant".

But this is where misuses and misunderstandings about the p-value just begin to manifest themselves. Firstly, a p-value of 0.05 does not immediately point to a definitive conclusion:

But in fact, there’s no logical basis for using a P value from a single study to draw any conclusion. If the chance of a fluke is less than 5 percent, two possible conclusions remain: There is a real effect, or the result is an improbable fluke. Fisher’s method offers no way to know which is which. On the other hand, if a study finds no statistically significant effect, that doesn’t prove anything, either. Perhaps the effect doesn’t exist, or maybe the statistical test wasn’t powerful enough to detect a small but real effect.

Secondly, as the article says, the most common and erroneous conclusion drawn from a p-value of 0.05 is that there is 95% confidence about the null hypotheses being false, that is, there is 95% confidence that the result being observed is statistically significant and not by chance alone. But as the article notes, this is a rather blasphemous logical fallacy:

Correctly phrased, experimental data yielding a P value of 0.05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct). But many explanations mangle the subtleties in that definition. A recent popular book on issues involving science, for example, states a commonly held misperception about the meaning of statistical significance at the 0.05 level: “This means that it is 95 percent certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance.”
That interpretation commits an egregious logical error (technical term: “transposed conditional”): confusing the odds of getting a result (if a hypothesis is true) with the odds favoring the hypothesis if you observe that result. A well-fed dog may seldom bark, but observing the rare bark does not imply that the dog is hungry. A dog may bark 5 percent of the time even if it is well-fed all of the time

This underscores the above point, that a low p-value may also mean that the result is indeed a fluke result. An especially striking problem with p-values emerged in studies of the controversial link between antidepressants and suicide. The placebo is the golden standard in clinical trials. But it turns out that not enough attention is always paid to the fact that the frequency of incidents even in two different placebo groups might be different. Thus, when calculating p-values, one must consider not their absolute values but the difference in values with reference to the placebo. Thus, a study investigating possible links between Prozac and Paxil reached a conclusion that was opposite to the real one. Consider this:

“Comparisons of the sort, ‘X is statistically significant but Y is not,’ can be misleading,” statisticians Andrew Gelman of Columbia University and Hal Stern of the University of California, Irvine, noted in an article discussing this issue in 2006 in the American Statistician. “Students and practitioners [should] be made more aware that the difference between ‘significant’ and ‘not significant’ is not itself statistically significant.”

A similar real-life example arises in studies suggesting that children and adolescents taking antidepressants face an increased risk of suicidal thoughts or behavior. Most such studies show no statistically significant increase in such risk, but some show a small (possibly due to chance) excess of suicidal behavior in groups receiving the drug rather than a placebo. One set of such studies, for instance, found that with the antidepressant Paxil, trials recorded more than twice the rate of suicidal incidents for participants given the drug compared with those given the placebo. For another antidepressant, Prozac, trials found fewer suicidal incidents with the drug than with the placebo. So it appeared that Paxil might be more dangerous than Prozac.

But actually, the rate of suicidal incidents was higher with Prozac than with Paxil. The apparent safety advantage of Prozac was due not to the behavior of kids on the drug, but to kids on placebo — in the Paxil trials, fewer kids on placebo reported incidents than those on placebo in the Prozac trials. So the original evidence for showing a possible danger signal from Paxil but not from Prozac was based on data from people in two placebo groups, none of whom received either drug. Consequently it can be misleading to use statistical significance results alone when comparing the benefits (or dangers) of two drugs.

As a general and simple example of how p-values can mislead, consider two drugs X and Y, one of whose effects seem more significant than the other when compared to placebo. In this case Drug X has a p-value of 0.04 while the second one Y has a p-value of 0.06. Thus Drug X has more "significant" effects than Drug Y, right? Well, but what happens when you just compare the two drugs to each other rather than to placebo? As the article says:

If both drugs were tested on the same disease, a conundrum arises. For even though Drug X appeared to work at a statistically significant level and Drug Y did not, the difference between the performance of Drug X and Drug Y might very well NOT be statistically significant. Had they been tested against each other, rather than separately against placebos, there may have been no statistical evidence to suggest that one was better than the other (even if their cure rates had been precisely the same as in the separate tests)

Nor does, as some would think, the hallowed "statistical significance" equate to "importance". Whether the result is really important or not depends on the exact field and study under consideration. Thus, a new drug that is statistically significant may only lead to one or two extra benefits per thousand people, which may not be clinically signicant. It is flippant to publish a statistically significant result as a generally significant result for the field.

The article has valuable criticisms of many other common practices, including the popular practice of meta-analyses (combining and analyzing different analyses), where differences between various trials may be obscured. These issues also point to a perpetual problem in statistics which is frequently ignored; that of ensuring that variations in your sample are uniformly distributed. Usually this is an assumption, but it's an assumption that may be flawed, especially in genetic studies of disease inheritance. This is part of an even larger and most general situation in any kind of statistical modeling; that of determining the distribution of your data. In fact a family friend who is an academic statistician once told me that about 90% of his graduate students' time is spent in determining the form of the data distribution. Computer programs have made the job easier, but the problem is not going to go away. The normal distribution may be one of the most remarkable aspects of her identity that nature reveals to us, but assuming such a distribution is also a slap in our face since it indicates that we don't really know the shape of the distribution.

A meta-analyses study with the best-selling anti-diabetic drug Avandia locks in on the problems:

Meta-analyses have produced many controversial conclusions. Common claims that antidepressants work no better than placebos, for example, are based on meta-analyses that do not conform to the criteria that would confer validity. Similar problems afflicted a 2007 meta-analysis, published in the New England Journal of Medicine, that attributed increased heart attack risk to the diabetes drug Avandia. Raw data from the combined trials showed that only 55 people in 10,000 had heart attacks when using Avandia, compared with 59 people per 10,000 in comparison groups. But after a series of statistical manipulations, Avandia appeared to confer an increased risk.

In principle, a proper statistical analysis can suggest an actual risk even though the raw numbers show a benefit. But in this case the criteria justifying such statistical manipulations were not met. In some of the trials, Avandia was given along with other drugs. Sometimes the non-Avandia group got placebo pills, while in other trials that group received another drug. And there were no common definitions.

Clearly it is extremely messy to compare the effects of a drug across different trials with different populations and parameters. So what is the way out of this tortuous statistical jungle? One simple remedy would be to simply insist on a lower p-value for significance, typically 0.0001. But there's another way. The article says that Bayesian methods which were neglected for a while in such studies are now being preferred to address the shortcomings of such interpretations since they include the need to acquire previous knowledge in drawing new conclusions. Bayesian or "conditional probability" has a long history and it can very valuably provide counter-intuitive but correct answers. The problem with prior information is that it itself might not be available and may need to be guessed, but an educated guess about it is better than not using it at all. The article ends with a simple example that demonstrates this:

For a simplified example, consider the use of drug tests to detect cheaters in sports. Suppose the test for steroid use among baseball players is 95 percent accurate — that is, it correctly identifies actual steroid users 95 percent of the time, and misidentifies non-users as users 5 percent of the time.

Suppose an anonymous player tests positive. What is the probability that he really is using steroids? Since the test really is accurate 95 percent of the time, the naïve answer would be that probability of guilt is 95 percent. But a Bayesian knows that such a conclusion cannot be drawn from the test alone. You would need to know some additional facts not included in this evidence. In this case, you need to know how many baseball players use steroids to begin with — that would be what a Bayesian would call the prior probability.

Now suppose, based on previous testing, that experts have established that about 5 percent of professional baseball players use steroids. Now suppose you test 400 players. How many would test positive?

• Out of the 400 players, 20 are users (5 percent) and 380 are not users.

• Of the 20 users, 19 (95 percent) would be identified correctly as users.

• Of the 380 nonusers, 19 (5 percent) would incorrectly be indicated as users.

So if you tested 400 players, 38 would test positive. Of those, 19 would be guilty users and 19 would be innocent nonusers. So if any single player’s test is positive, the chances that he really is a user are 50 percent, since an equal number of users and nonusers test positive.

In the end though, the problem is certainly related to what my dad was talking about. I know almost no scientists around me, whether chemists, biologists or computer scientists (these mainly being the ones I deal with) who were formally trained in statistics. A lot of statistical concepts that I know were hammered into my brain by my dad who knew better. Sure, most scientists including myself were taught how to derive means and medians by plugging numbers into formulas, but very few were exposed to the conceptual landscape, very few were taught how tricky the concept of statistical significance can be, how easily p-values can mislead. The concepts have been widely used and abused even in premier scientific journals and people who have pressed home this point have done all of us a valuable service. In other cases it might simply be a nuisance, but when it comes to clinical trials, simple misinterpretations of p-values and significance could translate to differences between life and death.

Statistics is not just another way of obtaining useful numbers; like mathematics it's a way of looking at the world and understanding its hidden, counterintuitive aspects that our senses don't reveal. In fact isn't that what science is precisely supposed to be about?

Some books on statistics I have found illuminating:

1. The Cartoon Guide to Statistics- Larry Gonick and Woollcott Smith. This actually does a pretty good job of explaining serious statistical concepts.
2. The Lady Tasting Tea- David Salzburg. A wonderful account of the history and significance of the science.
3. How to Lie with Statistics- Darrell Huff. How statistics can disguise facts, and how you can use this fact to your sly advantage.
4. The Cult of Statistical Significance. I haven't read it yet but it sure looks useful.
5. A Mathematician Reads The Newspaper- John Allen Paulos. A best-selling author's highly readable exposition on using math and statistics to make sense of daily news.

Datasets in Virtual Screening: starting off on the right foot

By Wavefunction on Tuesday, April 01, 2008

As Niels Bohr said, prediction is very difficult, especially about the future. In case of computational modeling, the real grist of value is in prediction. But for any methodology to predict it must first be able to evaluate. Sound evaluation of known data (retrospective testing) is the only means to proceed to accurate prediction of new data (prospective testing).

Over the last few years, several papers have come out involving the comparisons of different structure-based and ligand-based methods for virtual screening (VS), binding mode prediction and binding affinity prediction. Every one of these goals if accurately achieved could lead to the saving of immense amounts of time and money for the industry. Every paper concludes that some method is better than other. For virtual screening for example, it has been concluded by many that ligand-based 3D methods are better than docking methods, and 2D ligand-based methods are at least as good if not better.

However, such studies have to be conducted very carefully to make sure that you are not biasing your experiment for or against any method, or comparing apples and oranges. In addition, you have to use the correct metrics for evaluation of your results. Failure to do either of these and other things can lead to erroneous or/and inflated or artificially enhanced results leading to fallible prediction.

Here I will talk about two aspects of virtual screening; choosing the correct dataset, and choice of evaluation metric. The basic problems in VS are false positives and false negatives and one wants to minimize the occurrence of these. Sound statistical analysis can do wonders for generating and evaluating good virtual screening data. This has been documented in several recent papers, notably one by Ant Nicholls from OpenEye. If you have a VS method, it's not of much use randomly picking a random screen of 100,000 compounds. You need to choose the nature and number of actives and inactives in the screen judiciously to avoid bias. Here are a few things to be remembered that I got from the literature:

1. Standard statistical analysis tells you that the error in your results depends upon the number of representatives in your sample. Thus, you need to have an adequate number of actives and inactives in your screening dataset. What is much more important is the correct ratio of inactives to actives. The errors inherent in choosing various such ratios have been quantified; for example, with an inactive:active ratio of 4:1, the error incurred is 11% more than that incurred by a theoretical ratio of infinite:1. For a ratio of 100:1 it's only 0.5% more than with infinite. Clearly we must use a good ratio of inactives to actives to reduce statistical error. Incidentally you can also increase the number of actives to reduce this error. But this is not compatible with real-life HTS where actives are (usually) very less, sometimes not more than 0.1% of the screen.

2. Number is one thing. The nature of your actives and decoys is equally important; simply overwhelming your screen with decoys won't do the trick. For example, consider a kinase inhibitor virtual screen in which the decoys are things like hydrocarbons and inorganic ions. In his paper, Nicholls calls distinguishing these decoys the "dog test", that is, even your dog should be able to distinguish them from actives (not that I am belittling dogs here). We don't want a screen that makes it too easy for the method to reject actives. Thus, simply throwing a screen of random compounds at your method might make it too easy for your method to screen actives and mislead.

We also don't want a method that rejects chemically similar molecules on the basis of some property like logP or molecular weight. For example consider a method or scoring function that is sensitive to logP, and suppose it is supplied with two hypothetical molecules which have an identical core and a nitrogen in the side chain. If one side chain has a NH2 and another one is N-alkylated where the alkyl is butyl, then there will be a substantial difference in logP between the two, and your method will fail to recognise them as "similar", especially from a 2D perspective. Thus, a challenging dataset for similarity based methods is one in which the decoys and actives are property-matched. Just such a dataset has been put together by Irwin, Huang and Shoichet- this is the DUD (Directory of Useful Decoys) dataset of property-matched compounds. In it, 40 protein targets and their corresponding actives have been selected. 36 property-matched decoys for every active have been chosen. This dataset is much more challenging for many methods that do well on other random datasets. For more details, take a look at the original DUD paper. In general, there can be different kinds of decoys; random, drug-like, drug-like and property-matched etc. and one needs to know exactly how to choose the correct dataset. With datasets like DUD, there is an attempt to provide possible benchmarks for the modeling community.

3. Then there is the extremely important matter of evaluation. After doing a virtual screen with a well-chosen dataset and well-chosen targets, how do you actually evaluate the results and put your method in perspective? There are several metrics but until now, the most popular way of doing this is by calculating enrichment and this is the way it has been done in several publications. The idea is simple; you want your top ranked compounds to contain the most number of actives. Enrichment is simply the fraction of actives found in a certain fraction of screened compounds. Ideally you want your enrichment curve to shoot up at the beginning, that is you want most (ideally all) of the actives to show up in the first 1% or so of your ranked molecules. Then you compare that enrichment curve to a curve (actually a straight line) that would stem from an ideal result.
The problem with enrichment is that it is a function of the method and the dataset, hence of the entire experiment. For example, the ideal straight line depends on the number of actives in the dataset. If you want to do a controlled experiment, then you want to make sure that the only differences in the results come from your method, and enrichment introduces another variable that complicates interpretation. Other failings of enrichment are documented in this paper.

Instead, what's recommended for evaluating methods are R.O.C curves.

Essentially, R.O.C curves can be used in any situation where one needs to distinguish signal from noise and boy, is there a lot of noise around. R.O.C curves have an interesting history; they were developed by radar scientists during World War 2 to distinguish the signal of enemy warplanes from the noise of false hits and other artifacts. In recent times they have been used in diverse fields; psychology, medicine, epidemiology, engineering quality control, anywhere where we want to pick the bad apples from the good ones. Thus, R.O.C curves simply plot the false positive (FP) rate against the true positive (TP) rate. A purely random result gives a straight line at 45 degrees implying that for every FL you get a TP- dismal performance. A good R.O.C curve is a hyperbola that shoots above the straight line, and a very useful measure of your method's performance is the Area Under the Curve (AUC). The AUC needs to be prudently interpreted; for instance an AUC of 0.8 means that you can discriminate a TP by assigning a higher score to it than to a FP in 8 out of 10 cases. Here's a paper discussing the advantages of R.O.C curves for VS and detailing an actual example.

One thing seems to be striking. The papers linked here and at other places document that R.O.C curves may currently be the single-best metric for measuring performance of virtual screening methods. This is probably not too surprising given that they have proved so successful in other fields.

Why should modeling be different? Just like in other fields, rigorous and standard statistical metrics need to be established for the field. Only then will the comparisons between different methods and programs commonly seen these days be valid. For this, as in other fields, experiments need to be judiciously planned (including choosing the correct datasets here) and their results need to be carefully evaluated with unbiased techniques.

It is worth noting that these are mostly prescriptions for retrospective evaluations. When confronted with an unknown and novel screen, which method or combination of methods does one use? The answer to this question is still out there. In fact some of the real-life challenges run contrary to the known scenarios. For example consider a molecular screen from some novel plant or marine sponge. Are the molecules in this screen going to be drug-like? Certainly not. Is this going to have the right ratio of actives to decoys? Who knows (the whole point is to find the actives). Is it going to be random? Yes. If so, how random? In all actual screenings, there are a lot of unknowns out there. But it's still very useful to know about "known unknowns" and "unknown unknowns", and retrospective screening and the design of experiments can help us unearth some of these. If nothing else, it indicates attention to sound scientific and statistical principles.

In later posts, we will take a closer look at statistical evaluation and dangers in pose-prediction including being always wary of crystal structures, as well as something I found fascinating- bias in virtual screen design and evaluation that throws light on chemist psychology itself. This is a learning experience for me as much or more than it is for anyone else.

References:

1. Hawkins, P.C., Warren, G.L., Skillman, A.G., Nicholls, A. (2008). How to do an evaluation: pitfalls and traps. Journal of Computer-Aided Molecular Design, 22(3-4), 179-190. DOI: 10.1007/s10822-007-9166-3

2. Triballeau, N., Acher, F., Brabet, I., Pin, J., Bertrand, H. (2005). . Journal of Medicinal Chemistry, 48(7), 2534-2547. DOI: 10.1021/jm049092j

3. Huang, N., Shoichet, B., Irwin, J. (2006). . Journal of Medicinal Chemistry, 49(23), 6789-6801. DOI: 10.1021/jm0608356

About Me

“Ashutosh (Ash) Jogalekar is a scientist and science writer based in the San Francisco Bay Area. He has been blogging at the “Curious Wavefunction” blog for more than fifteen years, and in this capacity has written for several organizations including Nature, Scientific American and the Lindau Meeting of Nobel Laureates. His main interests include the history and philosophy of science and technology, especially physics and mathematics. Professionally he is trained as a chemist and works at the intersection of chemistry, technology and drug discovery. Follow @curiouswavefn

Field of Science

The Curious Wavefunction

What Oliver Sacks could teach us about the value of anecdotal information in drug discovery research

Naomi Oreskes and false positives in climate change: Do we know enough?

Biologists, chemists, math and computing

Stephen Hawking's advice for twenty-first century grads: Embrace complexity

On reproducibility in modeling

How pretty is your (im)perfect p?

Datasets in Virtual Screening: starting off on the right foot

Previous Posts

Popular Posts

Follow

Blogroll

Journals and Magazines

Archives