MCB 111 week 4 Section

p-values and p-hacking

There is an infectious disease that if left untreated, affected people have a probability of recovery of $p_0=0.4$. Let's call this the null hypothesis $H_0$.

Your lab is testing two new treatments, treatment A (TrA) and treatment B (TrB). You run two different experiments (expA and expB) where TrA or TrB is given to two different groups of affected people.

Your analysis tells you that the p-values for the outcomes of expA and expB respect to the null hypothesis $H_0$ are

$$\begin{aligned} \text{pval}(expA) &= 0.2131,\\ \text{pval}(expB) &= 2.09\times 10^{-6}.\\ \end{aligned}$$

Can we conclude that TrB is more effective than TrA? The p-value is lower!


Let's see a simple confounding variable that can underly this observation. Say that

What are the p-values under the null hypothesis of no treatment?

$$\begin{aligned} pval(expA) &= P(\text{observing result as or more extreme than 8 infected} \mid N=15, p_0=0.4) \\ &= \sum_{n=8}^{15} P(n \mid N = 15, p_0=0.4) \\ &= \sum_{n=8}^{15} \frac{15!}{n!(15-n)!} 0.4^n 0.6^{15-n} \\ &= 0.2131 \end{aligned}$$$$\begin{aligned} pval(expB) &= P(\text{observing result as or more extreme than 160 infected} \mid N=300, p_0=0.4) \\ &= \sum_{n=160 }^{300} \frac{300!}{n!(300-n)!} 0.4^n 0.6^{300-n} \\ &= 2.09\times10^{-6} \end{aligned}$$

But, let's compare the effectiveness for each treatment:

$$\hat{p}_A = \frac{8}{15} = 0.5\bar{3}$$$$\hat{p}_B = \frac{160 }{300} = 0.5\bar{3}$$

These correspond to the modes of the respective posterior distributions obtained for each probability assuming a uniform prior (i.e. maximum likelihood estimators for $p_A$ and $p_B$)

So what have we seen here? These two treatments have the same effectiveness, but because the sample sizes of the experiments are so different, the null distribution for expB is much narrower than that of expA, so the p-value for expB is smaller as a result.

You can never compare two hypotheses by looking to their p-values relative to a third null hypothesis. There are many hidden and possibly confounding variables in that calculation, one of them the sample size.

Let's instead look at the posterior distributions for the two experiments, as done before in week 2:

$$\begin{aligned} P(p_A\mid expA) &= \frac{16!} {8! 7!}\, p_A^{8} (1-p_A)^{7}\\ P(p_B\mid expB) &= \frac{301!}{160! 140!}\, p_B^{160} (1-p_B)^{140} \\ \end{aligned}$$

These posteriors are compatible with both treatments having the same effectiveness, as they are both centered around the same value, but the estimate of $p_B$ is more precise than that of $p_A$ because there expB had much more data.

As we did in the homework of w02, you remember that when the data follows a binomial distribution, the best estimate of the Bernoulli parameter p and its confidence value are given by

$$ \begin{equation} \hat{p} = \tilde{\mu} = \frac{n}{N} \quad \tilde{\sigma} = \sqrt{\frac{\hat{p}(1-\hat{p})}{N}}. \end{equation} $$

The best estimates and confidence values for the effectiveness of the two treatments are

$$ \begin{aligned} p_A &\approx 0.533 \pm 0.129\\ p_B &\approx 0.523 \pm 0.029. \end{aligned} $$

In fact, if the two experiments had been run with using the same treatment, you would have obtained one “significant” p-value and one “non significant” p-value just because of the different sample size.

$$P(p_A<p_B)= 1 -\text{cdf}(0 | \mu = p_A - p_B, \sigma^2 = \sigma_A^2 + \sigma_B^2)$$$$P(p_A<p_B)= 1 -\Phi\left(\frac{\mu_A-\mu_B}{\sqrt{\sigma_A^2+\sigma_B^2}}\right)$$
$$ \begin{aligned} P(p_A<p_B)&= P(p_A - p_B < 0) \\ \\ P(p_A<p_B)&= 1 -\text{cdf}(0 | \mu = p_A - p_B, \sigma^2 = \sigma_A^2 + \sigma_B^2)\\ \\ P(p_A<p_B)&= 1 -\Phi\left(\frac{\mu_A-\mu_B}{\sqrt{\sigma_A^2+\sigma_B^2}}\right)\\ \\ &= 1 - \Phi\left(\frac{0.5\bar{3}-0.5\bar{3}}{\sqrt{0.129^2+0.029^2}}\right) \\ \\ &= 1 - \Phi\left(0\right) \\ P(p_A<p_B) &\approx 50\% \\ \end{aligned} $$

Do not compare p-values of different experiments!