\( \alpha \) < 0.05!

Table of Contents

Wait what does P stand for again?

I will never forget a conversation I had with an undergraduate mentor while showing him one of my research reports. Upon going through some of my figures, several of them proudly labelled with p-values, he asked if I knew what a p-value was. At that time I had yet to take a formal course on probability theory or statistics, but my apparent ignorance clearly did not stop me from using these magic numbers as embellishments.

Looking back, my mentor gave a really great answer to my dumbstruck silence. In fact, his answer was a somewhat less formal and more general version of what Sir Ronald Fisher1 spoke of almost a hundred years ago:

… what is the probability that a value so distributed, chosen at random, shall exceed a given deviation.

In other words, a p-value is the probability of obtaining results that are at least as extreme as the observed data, assuming that the null hypothesis is true. There’s quite abit to unpack here, so lets begin with the null hypothesis and hypothesis testing.

A test of patience

Imagine that for as long as you can remember, your next-door neighbor has been practicing his trumpet at an obnoxiously loud volume on an average of 5 times a week. Fed up with his antics, you confront him and he agrees to reduce the number of practice sessions per week. Skeptical of his response, you then record the number of practice sessions in the following week and hypothesize that one of two things could happen :

\(H_0 : \) He refuses to change his ways
\(H_1 : \) He reduces his practice frequency

Here we have conceptualized both the null (\(H_0 \)) and alternative (\(H_1\)) hypothesis and collected some data to test these hypotheses, but what do \(H_0 \) and \(H_1\) actually represent? To answer this question, we turn once again to Sir Ronald Fisher2 :

We have spoken of the experiment as testing a certain null hypothesis … we have too assigned as appropriate to this hypothesis … the frequency distribution appropriate to a classification by pure chance.

If we view hypotheses as testable claims about observable phenomena and if we assume that the processes by which phenomena arise are fundamentally random, then defining a statistical hypothesis is equivalent to specifying a probability distribution for \(H_0\) / \(H_1\).

Now when it comes to probability distributions, it doesn’t get more common place than the normal distribution. In fact, Google “Hypothesis Testing” and under Images you will find plethora of bell-curve diagrams. Although it may be tempting to immediately associate hypothesis testing with normal distributions, I believe it is important to realize that Normal distributions are everywhere because they are often used for large-sample approximations and facilitate convenient computation. In reality, you could potentially perform a hypothesis test with any probability distribution.

If we assume that the neighbour’s weekly practices after the confrontation (\(X)\) follow a Poisson distribution \(X \sim \mathcal{Poisson}(\lambda)\), with the average rate of weekly practice parameterized by \( \lambda \), we can then rewrite our original hypotheses as :

\(H_0 : \lambda = 5 \)
\(H_1 : \lambda < 5 \)

Going back to our earlier definition of p-values, the idea of results being “at least as extreme” is connected to \(H_1\). Here, we see that the idea of extremity is tied to \(\lambda < 5 \), which implies a lower rate of weekly practice. As such, results that are less than or equal to what is observed would be considered “more extreme”. Assuming we observe 4 practice sessions, we calculate our P-value (\( \alpha \)) as :

\( \begin{aligned} P(X \leq 4 | H_0 \ is \ True) &= P(X \leq 4 | \lambda = 5) \\ &= P(\text{Observing} \leq \text{4 instances} \ \text{given that} \ X \sim \mathcal{Poisson}(5)) \\ &= \alpha \end{aligned} \)

Although exact computation of this p-value is not relevant to our discussion, this concept would hold for any other dataset or distribution. In fact, you could proceed to collect data from \(2,3,4 … N\) number of weeks and generate p-values for each set of observed results. Because at the end of the day, P-values are just a numerical representation of how strongly our observed data presents evidence against the null hypothesis.

Some misconceptions

Now that we have outlined what P-values are, I believe that it is almost more important to understand what P-values are not. The following discussion is based on the following publications3 4 5. For the sake of brevity, only 2 selected points will be presented below and I highly recommend giving these enlightening papers a read.

#1: Misinterpreting the Probability in P-value

While the P in P-value does stand for Probability, mistaking P-values for the wrong probability can lead to some very erroneous conclusions. For instance, common misinterpretations are that \( \alpha \) :

\( \begin{aligned} &= P(H_0 \ \text{is} \ True) \\ OR \\ &= P(\text{Data is observed assuming} \ H_0 \ \text{is} \ True) \end{aligned} \)

As we have thoroughly discussed above, P-values represent the probability of collecting data that is more extreme than observed, given that \(H_0\) is true.

#2: Placing too much significance on “significance”

This misconception is somewhat multifaceted and will be addressed in several points. Firstly, the entire idea that \( \alpha < 0.05 \) implies a statistically significant result is entirely arbitrary. Unsurprisingly it was Fisher that suggested this 0.05 threshold and in fact his response to an \( \alpha < 0.05 \) was that we should repeat the experiment again :

A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.

This concept of “significance” brings us to our second point, that P-values are not a accurate representation of the true effect size/magnitude of the experiment in question. A theoretical explanation of this is out of the scope of this discussion (perhaps for another post), but essentially P-values are heavily dependent on sample size. With reference to our neighbor hypotheses, instead of just 1 week, you could have collected \( N \) weeks worth of data and calculated a potentially different P-value. Due to the increased sample size, the second P-value could have been “significant” even if the first was not. Accordingly, just because an experiment was shown to be “significant” does not mean that there was a truly conclusive effect, and vice versa.

Lastly, it is also incorrect to compare P-values between studies with different experimental conditions. For instance concluding that an experiment with \( \alpha = 0.01 \) is more effective/efficacious than another with \( \alpha = 0.06 \). Not only would these P-values be inaccurate measures of effectiveness, they are also defined in relation to their specific null hypotheses. Therefore, comparing “significance” between P-values from two different null hypotheses would be the equivalent of comparing apples to oranges.




  1. Fisher, Ronald Aylmer. Statistical methods for research workers. Springer New York, 1992. ↩︎

  2. Fisher, Ronald Aylmer. “Design of experiments.” British Medical Journal 1.3923 (1936): 554. ↩︎

  3. Goodman, Steven N. “Toward evidence-based medical statistics. 1: The P value fallacy.” Annals of internal medicine 130.12 (1999): 995-1004. ↩︎

  4. Goodman, Steven. “A dirty dozen: twelve p-value misconceptions.” Seminars in hematology. Vol. 45. No. 3. WB Saunders, 2008. ↩︎

  5. Sullivan, Gail M., and Richard Feinn. “Using effect size—or why the P value is not enough.” Journal of graduate medical education 4.3 (2012): 279-282. ↩︎