The Curious Wavefunction: How pretty is your (im)perfect p?

In my first year of grad school we were once sitting in a seminar in which someone had put up a graph with a p-value in it. My advisor asked all the students what a p-value was. When no one answered, he severely admonished us and said that any kind of scientist engaged in any kind of endeavor should know what a p-value is. It is after all paramount in most statistical studies, and especially in judging the consequences of clinical trials.

So do we really know what it is? Apparently not, not even scientists who routinely use statistics in their work. In spite of this, the use and overuse of the p-value and the constant invoking of "statistical significance" are universal phenomena. When a very low p-value is cited for a study, it is taken for granted that the results of the study must be true. My dad, who is a statistician turned economist, often says that one of the big laments about modern science is that too many scientists are never formally trained in statistics. This may be true, especially with regard to the p-value. An article in ScienceNews drives home this point and I think should be read by every scientist or engineer who remotely uses statistics in his or her work. The article builds on concerns raised recently by several statisticians about the misuse and over-interpretation of these concepts in clinical trials. In fact there seems to be an entire book which is devoted to criticizing the use of statistics; "The Cult of Statistical Significance".

So what's the p-value? A reasonable accurate definition is that it is the probability of getting a result at least extreme as the one you get under the assumption that the null hypothesis is true". The null hypothesis is the hypothesis that your sample is a completely unbiased sample. The p-value is basically related to the probability that you will get a result by chance alone in a completely unbiased sample; thus, the lower it is, the more "confident" you can be that chance did not produce the result. It is usually calculated as a cutoff value. Ronald Fisher, the great statistician who pioneered its use, rather arbitrarily decided this value to be 0.05. Thus generally if one gets a p-value of 0.05, he or she is "confident" that the result which is being observed is one that is "statistically significant".

But this is where misuses and misunderstandings about the p-value just begin to manifest themselves. Firstly, a p-value of 0.05 does not immediately point to a definitive conclusion:

But in fact, there’s no logical basis for using a P value from a single study to draw any conclusion. If the chance of a fluke is less than 5 percent, two possible conclusions remain: There is a real effect, or the result is an improbable fluke. Fisher’s method offers no way to know which is which. On the other hand, if a study finds no statistically significant effect, that doesn’t prove anything, either. Perhaps the effect doesn’t exist, or maybe the statistical test wasn’t powerful enough to detect a small but real effect.

Secondly, as the article says, the most common and erroneous conclusion drawn from a p-value of 0.05 is that there is 95% confidence about the null hypotheses being false, that is, there is 95% confidence that the result being observed is statistically significant and not by chance alone. But as the article notes, this is a rather blasphemous logical fallacy:

Correctly phrased, experimental data yielding a P value of 0.05 means that there is only a 5 percent chance of obtaining the observed (or more extreme) result if no real effect exists (that is, if the no-difference hypothesis is correct). But many explanations mangle the subtleties in that definition. A recent popular book on issues involving science, for example, states a commonly held misperception about the meaning of statistical significance at the 0.05 level: “This means that it is 95 percent certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance.”
That interpretation commits an egregious logical error (technical term: “transposed conditional”): confusing the odds of getting a result (if a hypothesis is true) with the odds favoring the hypothesis if you observe that result. A well-fed dog may seldom bark, but observing the rare bark does not imply that the dog is hungry. A dog may bark 5 percent of the time even if it is well-fed all of the time

This underscores the above point, that a low p-value may also mean that the result is indeed a fluke result. An especially striking problem with p-values emerged in studies of the controversial link between antidepressants and suicide. The placebo is the golden standard in clinical trials. But it turns out that not enough attention is always paid to the fact that the frequency of incidents even in two different placebo groups might be different. Thus, when calculating p-values, one must consider not their absolute values but the difference in values with reference to the placebo. Thus, a study investigating possible links between Prozac and Paxil reached a conclusion that was opposite to the real one. Consider this:

“Comparisons of the sort, ‘X is statistically significant but Y is not,’ can be misleading,” statisticians Andrew Gelman of Columbia University and Hal Stern of the University of California, Irvine, noted in an article discussing this issue in 2006 in the American Statistician. “Students and practitioners [should] be made more aware that the difference between ‘significant’ and ‘not significant’ is not itself statistically significant.”

A similar real-life example arises in studies suggesting that children and adolescents taking antidepressants face an increased risk of suicidal thoughts or behavior. Most such studies show no statistically significant increase in such risk, but some show a small (possibly due to chance) excess of suicidal behavior in groups receiving the drug rather than a placebo. One set of such studies, for instance, found that with the antidepressant Paxil, trials recorded more than twice the rate of suicidal incidents for participants given the drug compared with those given the placebo. For another antidepressant, Prozac, trials found fewer suicidal incidents with the drug than with the placebo. So it appeared that Paxil might be more dangerous than Prozac.

But actually, the rate of suicidal incidents was higher with Prozac than with Paxil. The apparent safety advantage of Prozac was due not to the behavior of kids on the drug, but to kids on placebo — in the Paxil trials, fewer kids on placebo reported incidents than those on placebo in the Prozac trials. So the original evidence for showing a possible danger signal from Paxil but not from Prozac was based on data from people in two placebo groups, none of whom received either drug. Consequently it can be misleading to use statistical significance results alone when comparing the benefits (or dangers) of two drugs.

As a general and simple example of how p-values can mislead, consider two drugs X and Y, one of whose effects seem more significant than the other when compared to placebo. In this case Drug X has a p-value of 0.04 while the second one Y has a p-value of 0.06. Thus Drug X has more "significant" effects than Drug Y, right? Well, but what happens when you just compare the two drugs to each other rather than to placebo? As the article says:

If both drugs were tested on the same disease, a conundrum arises. For even though Drug X appeared to work at a statistically significant level and Drug Y did not, the difference between the performance of Drug X and Drug Y might very well NOT be statistically significant. Had they been tested against each other, rather than separately against placebos, there may have been no statistical evidence to suggest that one was better than the other (even if their cure rates had been precisely the same as in the separate tests)

Nor does, as some would think, the hallowed "statistical significance" equate to "importance". Whether the result is really important or not depends on the exact field and study under consideration. Thus, a new drug that is statistically significant may only lead to one or two extra benefits per thousand people, which may not be clinically signicant. It is flippant to publish a statistically significant result as a generally significant result for the field.

The article has valuable criticisms of many other common practices, including the popular practice of meta-analyses (combining and analyzing different analyses), where differences between various trials may be obscured. These issues also point to a perpetual problem in statistics which is frequently ignored; that of ensuring that variations in your sample are uniformly distributed. Usually this is an assumption, but it's an assumption that may be flawed, especially in genetic studies of disease inheritance. This is part of an even larger and most general situation in any kind of statistical modeling; that of determining the distribution of your data. In fact a family friend who is an academic statistician once told me that about 90% of his graduate students' time is spent in determining the form of the data distribution. Computer programs have made the job easier, but the problem is not going to go away. The normal distribution may be one of the most remarkable aspects of her identity that nature reveals to us, but assuming such a distribution is also a slap in our face since it indicates that we don't really know the shape of the distribution.

A meta-analyses study with the best-selling anti-diabetic drug Avandia locks in on the problems:

Meta-analyses have produced many controversial conclusions. Common claims that antidepressants work no better than placebos, for example, are based on meta-analyses that do not conform to the criteria that would confer validity. Similar problems afflicted a 2007 meta-analysis, published in the New England Journal of Medicine, that attributed increased heart attack risk to the diabetes drug Avandia. Raw data from the combined trials showed that only 55 people in 10,000 had heart attacks when using Avandia, compared with 59 people per 10,000 in comparison groups. But after a series of statistical manipulations, Avandia appeared to confer an increased risk.

In principle, a proper statistical analysis can suggest an actual risk even though the raw numbers show a benefit. But in this case the criteria justifying such statistical manipulations were not met. In some of the trials, Avandia was given along with other drugs. Sometimes the non-Avandia group got placebo pills, while in other trials that group received another drug. And there were no common definitions.

Clearly it is extremely messy to compare the effects of a drug across different trials with different populations and parameters. So what is the way out of this tortuous statistical jungle? One simple remedy would be to simply insist on a lower p-value for significance, typically 0.0001. But there's another way. The article says that Bayesian methods which were neglected for a while in such studies are now being preferred to address the shortcomings of such interpretations since they include the need to acquire previous knowledge in drawing new conclusions. Bayesian or "conditional probability" has a long history and it can very valuably provide counter-intuitive but correct answers. The problem with prior information is that it itself might not be available and may need to be guessed, but an educated guess about it is better than not using it at all. The article ends with a simple example that demonstrates this:

For a simplified example, consider the use of drug tests to detect cheaters in sports. Suppose the test for steroid use among baseball players is 95 percent accurate — that is, it correctly identifies actual steroid users 95 percent of the time, and misidentifies non-users as users 5 percent of the time.

Suppose an anonymous player tests positive. What is the probability that he really is using steroids? Since the test really is accurate 95 percent of the time, the naïve answer would be that probability of guilt is 95 percent. But a Bayesian knows that such a conclusion cannot be drawn from the test alone. You would need to know some additional facts not included in this evidence. In this case, you need to know how many baseball players use steroids to begin with — that would be what a Bayesian would call the prior probability.

Now suppose, based on previous testing, that experts have established that about 5 percent of professional baseball players use steroids. Now suppose you test 400 players. How many would test positive?

• Out of the 400 players, 20 are users (5 percent) and 380 are not users.

• Of the 20 users, 19 (95 percent) would be identified correctly as users.

• Of the 380 nonusers, 19 (5 percent) would incorrectly be indicated as users.

So if you tested 400 players, 38 would test positive. Of those, 19 would be guilty users and 19 would be innocent nonusers. So if any single player’s test is positive, the chances that he really is a user are 50 percent, since an equal number of users and nonusers test positive.

In the end though, the problem is certainly related to what my dad was talking about. I know almost no scientists around me, whether chemists, biologists or computer scientists (these mainly being the ones I deal with) who were formally trained in statistics. A lot of statistical concepts that I know were hammered into my brain by my dad who knew better. Sure, most scientists including myself were taught how to derive means and medians by plugging numbers into formulas, but very few were exposed to the conceptual landscape, very few were taught how tricky the concept of statistical significance can be, how easily p-values can mislead. The concepts have been widely used and abused even in premier scientific journals and people who have pressed home this point have done all of us a valuable service. In other cases it might simply be a nuisance, but when it comes to clinical trials, simple misinterpretations of p-values and significance could translate to differences between life and death.

Statistics is not just another way of obtaining useful numbers; like mathematics it's a way of looking at the world and understanding its hidden, counterintuitive aspects that our senses don't reveal. In fact isn't that what science is precisely supposed to be about?

Some books on statistics I have found illuminating:

1. The Cartoon Guide to Statistics- Larry Gonick and Woollcott Smith. This actually does a pretty good job of explaining serious statistical concepts.
2. The Lady Tasting Tea- David Salzburg. A wonderful account of the history and significance of the science.
3. How to Lie with Statistics- Darrell Huff. How statistics can disguise facts, and how you can use this fact to your sly advantage.
4. The Cult of Statistical Significance. I haven't read it yet but it sure looks useful.
5. A Mathematician Reads The Newspaper- John Allen Paulos. A best-selling author's highly readable exposition on using math and statistics to make sense of daily news.

5 comments:

Anonymous1:11 AM, March 28, 2010
Somehow this is a scary post! I can't help thinking if the stats on these large investigations of nutrition are properly done. I an especially thinking of those "weird" results like "one bear a day is good" and " the more red wine you drink the longer you live" kind a studies.
Wavefunction8:19 AM, March 28, 2010
I would almost certainly take a hard look at those!
Unknown5:43 PM, March 29, 2010
I would certainly question a study that showed that consuming one "bear" a day was good for you!
Wavefunction12:24 PM, March 30, 2010
I would especially question the volunteers enrolled in that particular trial!
Anonymous1:23 AM, April 01, 2010
Yeah - Okay I need to learn to proofread my comments before posting..
It was of course a beer not a bear

Markup Key:
- <b>bold</b> = bold
- <i>italic</i> = italic
- <a href="http://www.fieldofscience.com/">FoS</a> = FoS