MCB111 w04 Section¶

p-hacking¶

In [1]:
import numpy as np
from scipy.special import comb
import scipy.stats as stats
import matplotlib.pyplot as plt
In [2]:
np.random.seed(2023)

https://xkcd.com/882/

image.png

Examples of p-Hacking:¶

  1. Stop collecting data once $p \lt 0.05$
  2. Analyze many measures, but report only those with $p \lt 0.05$
  3. Collect and analyze many conditions, but only report those with $p \lt 0.05$
  4. Use covariates to get $p \lt 0.05$
  5. Exclude participants to get $p \lt 0.05$.
  6. Transform the data to get $p \lt 0.05$
  7. Increase the n in one group to get a $p \lt 0.05$.

Hypothetical scenario¶

Your lab wants to develop a drug to reduce the size of tumors in a particular kind of cancer. In a sort of screening fashion, you lab is exhaustively testing compounds to find out if any of them worked. So using an animal model you measured the size of the tumors from three specimens after a certain time without any drugs and with each candidate drug.

Below are results from one drug.

In [3]:
def measure_tumor(group):
    loc,scale = 15,5
    fig,ax = plt.subplots(ncols=1,sharey=True,figsize=(3,3))
    control = stats.norm.rvs(loc=loc,scale=scale,size=3)
    drug = stats.norm.rvs(loc=loc,scale=scale,size=3)
    ax.scatter([0,0,0],control)
    ax.scatter([1,1,1],drug)
    ax.set_xlim(-1,2)
    ax.set_ylim(np.min(np.concatenate((control,drug)))*0.9,
                np.max(np.concatenate((control,drug)))*1.1)
    ax.set_ylabel('Tumor size')
    ax.set_xlabel('Treatment')
    ax.set_xticks([0,1],['Control',group])
In [4]:
measure_tumor('Drug')
No description has been provided for this image

If we do the same for another 100 drugs:

In [5]:
def measure_many_tumors():
    loc,scale,n = 15,5,3 # mean, standard deviation, and sample size
    fig,axs = plt.subplots(ncols=10,nrows=10,figsize=(20,20),sharey=True,sharex=True)
    axs = axs.flatten()
    N = len(axs) # number of drugs
    data = np.zeros((N,2,n))
    pvals = np.zeros(N)
    for i in range(N):
        control = stats.norm.rvs(loc=loc,scale=scale,size=n)
        drug = stats.norm.rvs(loc=loc,scale=scale,size=n)
        data[i,0,:] = control
        data[i,1,:] = drug
        axs[i].scatter([0,0,0],control)
        axs[i].scatter([1,1,1],drug)
        p = stats.ttest_ind(control, drug)[1] # get p-value from t-test
        axs[i].set_title('p = %.3f'%p,y=0.8)
        if p<0.01:
            axs[i].set_facecolor('orange')
        elif p<0.05:
            axs[i].set_facecolor('yellow')
        elif p<0.06:
            axs[i].set_facecolor('lightyellow')
        pvals[i] = p
    axs[-1].set_xlim(-1,2)
    axs[-1].set_xticks([0,1],['',''])
    fig.text(0.5,0.1,'Treatment',ha='center')
    fig.text(0.1,0.5,'Tumor size',va='center',rotation='vertical')
    return data,pvals
In [6]:
data_100, pvals_100 = measure_many_tumors()
No description has been provided for this image

We see a few drugs having "significant" p-values, even though they don't actually have any effect. BAM, we just p-hacked! p-hacking refers to the misuse and abuse of analysis techniques and results in being fooled by false positives.

Setting the threshold for significance to 0.05 means that approximately 5% of the statistical tests we do on data gathered from the same distribution will result in false positives.

That means if we did 100 tests we would expect about 5 false positives or 5 percent and if we did 10,000 tests we would expect about 500 false positives in other words the more tests we do the more false positives we have to deal with.

If we simulate 10000 drugs, and plot the histogram of all 10000 p-values we get, we see a standard uniform distribution. Everything to the left of the $ p=0.05 $ dashed line is false positive.

In [7]:
N = 10000
pvals = np.zeros(N)
loc,scale,n=15,5,3
for i in range(N):
    control = stats.norm.rvs(loc=loc,scale=scale,size=n)
    drug = stats.norm.rvs(loc=loc,scale=scale,size=n)
    p = stats.ttest_ind(control,drug)[1]
    pvals[i] = p
plt.hist(pvals,bins=20,density=True)
plt.xticks((0,0.05,1),(0,0.05,1))
plt.xlabel('p-value')
plt.ylabel('Probability density')
plt.axvline(0.05,0,1.1,linestyle='--',color='k')
N_positive = sum(pvals<0.05)
print('%i out of 10000 drugs tested have p-values < 0.05.'%N_positive)
465 out of 10000 drugs tested have p-values < 0.05.
No description has been provided for this image

On the other hand, if drugs are truly effective, we are more likely to get $p<0.05$.

In [8]:
N = 10000
pvals = np.zeros(N)
loc1,scale1 = 15,5
loc2,scale2 = 10,5
n = 3
for i in range(N):
    control = stats.norm.rvs(loc=loc1,scale=scale1,size=n)
    drug = stats.norm.rvs(loc=loc2,scale=scale2,size=n)
    p = stats.ttest_ind(control,drug)[1]
    pvals[i] = p
plt.hist(pvals,bins=20,density=True)
plt.xticks((0,0.05,1),(0,0.05,1))
plt.xlabel('p-value')
plt.ylabel('Probability density')
plt.axvline(0.05,0,1.1,linestyle='--',color='k')
N_positive = sum(pvals<0.05)
print('%i out of 10000 drugs tested have p-values < 0.05.'%N_positive)
1594 out of 10000 drugs tested have p-values < 0.05.
No description has been provided for this image

A very "useful" hacking technique: Keep adding data until $p<0.05$?¶

Now let's talk about a slightly less obvious form of p-hacking. Remember these two groups with p-values between 0.05 and 0.06.

In [9]:
# Find the drugs that have p-values between 0.05 and 0.06
almost_significant_index = np.where(np.logical_and(pvals_100<0.06,pvals_100>0.05))[0]
print(almost_significant_index)
[ 0 89]
In [10]:
m = len(almost_significant_index)
fig,axs = plt.subplots(1,m,figsize=(6,3),sharey=True,sharex=True)
for i in range(m):
    idx = almost_significant_index[i]
    control = data_100[idx,0,:]
    drug = data_100[idx,1,:]
    axs[i].scatter([0,0,0],control)
    axs[i].scatter([1,1,1],drug)
    p = pvals_100[idx]
    axs[i].set_title('p = %.3f'%p,y=0.8)
    axs[i].set_facecolor('lightyellow')
    axs[i].set_xlabel('Treatment')
axs[0].set_ylabel('Tumor size')
axs[-1].set_xlim(-1,2)
axs[-1].set_xticks([0,1],['Control','Drug']);
No description has been provided for this image

Now we know that both groups came from the same distribution. But typically when we are doing experiments we don't know if they both came from the same distribution or different ones. And let's be honest we usually hope that the observations come from two different distributions. In this example we are looking for a new drug to help people, so we want to see an improvement.

When we get data like this where the p-value is close to 0.05 but not less than, it is very tempting to think, hmm I wonder if the p-value will get smaller if I add more data. So we add a few more measurements to each group.

In [11]:
n = data_100.shape[2]
extra_n = 10
extra_pvals = np.zeros((m,extra_n+1))
fig,ax = plt.subplots()
for i in range(m):
    idx = almost_significant_index[i]
    extra_control = stats.norm.rvs(loc=loc,scale=scale,size=extra_n)
    extra_drug = stats.norm.rvs(loc=loc,scale=scale,size=extra_n)
    control = np.append(data_100[idx,0,:],extra_control)
    drug = np.append(data_100[idx,1,:],extra_drug)
    for j in range(extra_n+1):
        p = stats.ttest_ind(control[:n+j],drug[:n+j])[1]
        extra_pvals[i,j] = p
    ax.plot(np.arange(extra_n+1),extra_pvals[i],'.-')
ax.axhline(0.05,c='k',linestyle='--')
y_max = np.ceil(np.min((extra_pvals.max()*11,10)))/10
ax.set_yticks((0,0.05,y_max),(0,0.05,y_max))
ax.set_ylabel('p-value')
ax.set_xlabel('# Additional measurements');
No description has been provided for this image

And now when we calculate the p-value we get something less than 0.05, so we can report a statistically significant difference.

Hooray! We got what we wanted, right? No, we p-hacked again.

When a p-value is close to 0.05 like what we had with the original data, there's a surprisingly high probability that adding (even just one) new measurement to both groups will result in a false positive.

In [12]:
n = 3
N = 100
pvals = np.zeros((N,2))
i = 0
while i<N:
    control = stats.norm.rvs(loc=loc,scale=scale,size=n)
    drug = stats.norm.rvs(loc=loc,scale=scale,size=n)
    p = stats.ttest_ind(control, drug)[1]
    pvals[i,0] = p
    if 0.05<p<0.06:
        extra_control = stats.norm.rvs(loc=loc,scale=scale,size=1)
        extra_drug = stats.norm.rvs(loc=loc,scale=scale,size=1)
        p = stats.ttest_ind(np.append(control,extra_control),np.append(drug,extra_drug))[1]
        pvals[i,1] = p
        i += 1
N_positive = np.sum(pvals[:,1]<0.05)
print('Among %i drugs with an initial p-value between 0.05 and 0.06, %i have p-values < 0.05 after just one additional measurement.'%(N,N_positive))
Among 100 drugs with an initial p-value between 0.05 and 0.06, 34 have p-values < 0.05 after just one additional measurement.

Ultimately, if sufficient data points are supplied, we could get over half of all tests with $p<0.05$, that is, false positive.

In [13]:
n = 5000
N = 100
pvals = np.zeros((N,n-3))
loc,scale,n=15,5,5000
for i in range(N):
    control = stats.norm.rvs(loc=loc,scale=scale,size=n)
    drug = stats.norm.rvs(loc=loc,scale=scale,size=n)
    for j in range(n-3):
        p = stats.ttest_ind(control[:j+3], drug[:j+3])[1]
        pvals[i,j] = p
N_positive = np.any(pvals<0.05,axis=1).sum()
print('%i out of %i drugs tested have p-values < 0.05 before reaching n = %i.'%(N_positive,N,n))
57 out of 100 drugs tested have p-values < 0.05 before reaching n = 5000.

A simple correction for multiple tests to reduce the overall false-positive rates¶

Under the null hypothesis, the distribution of $p$ value is uniform over $[0,1]$. Thus, among $n$ independent tests, it's the smallest $p$ value that controls if you will see false-positives among the results.

Suppose the ground truth is that all tests are performed on data following their own null distributions (i.e., their is no significance whatsoever), what is the distribution of the smallest $p$ value? Suppose the significance threshold for each test is fixed at $\hat{p}$, then the probability of observing at least one false-positive is

$$ \begin{align} &\mathbb{P}(\min(p_1,p_2,\cdots,p_n)\leq \hat{p})\\ =&1-\mathbb{P}(\text{All of } p_1,p_2,\cdots,p_n > \hat{p})\\ =&1-\prod_{i=1}^n \mathbb{P}(p_i > \hat{p})\\ =&1-(1-\hat{p})^n\\ \approx & 1-\exp(-n\hat{p}) \end{align} $$

This tells you that as the number of tests $n$ goes up, the possibility that at least one test becomes false-positive rises exponentially fast to 1. To reduce this effect, one could choose $\hat{p}$ depending on the number of tests $n$, for instance, a popular correction method is to set

$$ \hat{p}=\frac{0.05}{n} $$

Then, the probability that at least one test becomes false-positive is acceptably small:

$$ 1-\exp(-0.05)\approx 0.04877 $$

With this correction (Bonferroni correction), if you have 1000 tests, then the significance threshold for each test will be as small as $0.00005$.

Comparing p-values of different experiments – don’t¶

There is an infectious disease that if left untreated, affected people have a probability of recovery of $p_0=0.4$. Let's call this the null hypothesis $H_0$.

Your lab is testing two new treatments, treatment A (TrA) and treatment B (TrB). You run two different experiments (expA and expB) where TrA or TrB is given to two different groups of affected people.

Your analysis tells you that the p-values for the outcomes of expA and expB respect to the null hypothesis $H_0$ are

$$\begin{aligned} \mbox{pval}(expA) &= 0.2131,\\ \mbox{pval}(expB) &= 1.05 e^{-5}.\\ \end{aligned}$$

Can we conclude that TrB is more effective than TrA? The p-value is lower!

No!

Let's see a simple confounding variable that can underly this observation. Say that

  • expA involves 15 infected people, 8 of which recovered using TrA
  • expB involes 300 infected people, 157 of which recovered using TrB

What are the p-values under the null hypothesis of no treatment?

$$\begin{aligned} pval(expA) &= P(\mathrm{observing~result~as~or~more~extreme~as~8~infected} \mid N=15, p_0=0.4) \\ &= \sum_{n=8}^{15} P(n \mid N = 15, p_0=0.4) \\ &= \sum_{n=8}^{15} \frac{15!}{n!(15-n)!} 0.4^n 0.6^{15-n} \\ &= 0.2131 \end{aligned}$$

$$\begin{aligned} pval(expB) &= P(\mathrm{observing~result~as~or~more~extreme~as~157~infected} \mid N=300, p_0=0.4) \\ &= \sum_{n=157}^{300} \frac{300!}{n!(300-n)!} 0.4^n 0.6^{300-n} \\ &= 1.05e^{-5} \end{aligned}$$

But, let's compare the effectiveness for each treatment:

$$p_A = \frac{8}{15} = 0.523$$ $$p_B = \frac{157}{300} = 0.523$$

In [14]:
from matplotlib.patches import Polygon
def exp_pmf(ax,n,N,p,exp_name):
    x = np.arange(N+1)
    y = stats.binom.pmf(x,N,p)
    ax.plot(x,y,'.-')
    ax.set_title('Exp%s ($p%s=%.1f,N=%i$)'%(exp_name,exp_name,p,N))
    ax.set_ylim(0,y.max()*1.1)
    ax.set_xlabel('Number of survivors ($n$)')
    ax.set_ylabel('Probability, $P(n|p%s=%.1f,N=%i)$'%(exp_name,p,N))
    ix = np.arange(n,N)
    iy = stats.binom.pmf(ix,N,p)
    verts = [(n,0),*zip(ix, iy),(N,0)]
    poly = Polygon(verts,facecolor='lightblue')
    ax.add_patch(poly)
    xmax = np.argmax(y)
    ax.set_xticks((0,n,xmax,N),(0,n,xmax,N))
    ax.axvline(n,linestyle='--',c='gray')
    pval = stats.binom.sf(n-1,N,p)
    ax.text(n*1.03,stats.binom.pmf(n,N,p)+np.max(y)*0.03,
            '    p-value\n$=P(n\geqslant %i|N=%i,p_{%s}=%.1f)$\n$=%f$'%(n,N,exp_name,p,pval))
fig,ax = plt.subplots(2,1,figsize=(5.5,10))
exp_pmf(ax[0],8,15,0.4,'A')
exp_pmf(ax[1],157,300,0.4,'B')
fig.tight_layout()
No description has been provided for this image

So what have we seen here? These two treatments have the same effectiveness, but because the sample sizes of the experiments are so different, the null distribution for expB is much narrower than that of expA, so the p-value for expB is smaller as a result.

You can never compare two hypotheses by looking to their p-values relative to a third null hypothesis. There are many hidden and possibly confounding variables in that calculation, one of them the sample size.

Let's instead look at the posterior distributions for the two experiments, as done before in week 2:

$$\begin{aligned} P(p_A\mid expA) &= \frac{16!} {8! 7!}\, p_A^{8} (1-p_A)^{7}\\ P(p_B\mid expB) &= \frac{301!}{157! 143!}\, p_B^{157} (1-p_B)^{143} \\ \end{aligned}$$

In [15]:
p = np.arange(0,1.001,0.001)
fig,ax = plt.subplots()
ax.plot(p,16*stats.binom.pmf(8,15,p),label='$p(pA|n=8,N=15)$')
ax.plot(p,301*stats.binom.pmf(157,300,p),label='$p(pB|n=157,N=300)$')
ax.set_xlabel('$p$')
ax.set_ylabel('Posterior probability density')
ax.legend();
No description has been provided for this image

These posteriors are compatible with both treatments having the same effectiveness, as they are both centered around the same value, but the estimate of $p_B$ is more precise than that of $p_A$ because there expB had much more data.

As we did in the homework of w02, you remember that when the data follows a binomial distribution, the best estimate of the Bernoulli parameter p and its confidence value are given by

$$ \begin{equation} p^\ast = \frac{n}{N} \quad \sigma = \sqrt{\frac{p^\ast(1-p^\ast)}{N}}. \end{equation} $$

The best estimates and confidence values for the effectiveness of the two treatments are

$$ \begin{aligned} p_A &\approx 0.523 \pm 0.129\\ p_B &\approx 0.523 \pm 0.029. \end{aligned} $$ In fact, if the two experiments had been run with using the same treatment, you would have obtained one “significant” p-value and one “non significant” p-value just because of the different sample size.