MCB111 w04 Section¶
p-hacking¶
import numpy as np
from scipy.special import comb
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(2023)
Examples of p-Hacking:¶
- Stop collecting data once $p \lt 0.05$
- Analyze many measures, but report only those with $p \lt 0.05$
- Collect and analyze many conditions, but only report those with $p \lt 0.05$
- Use covariates to get $p \lt 0.05$
- Exclude participants to get $p \lt 0.05$.
- Transform the data to get $p \lt 0.05$
- Increase the n in one group to get a $p \lt 0.05$.
Hypothetical scenario¶
Your lab wants to develop a drug to reduce the size of tumors in a particular kind of cancer. In a sort of screening fashion, you lab is exhaustively testing compounds to find out if any of them worked. So using an animal model you measured the size of the tumors from three specimens after a certain time without any drugs and with each candidate drug.
Below are results from one drug.
def measure_tumor(group):
loc,scale = 15,5
fig,ax = plt.subplots(ncols=1,sharey=True,figsize=(3,3))
control = stats.norm.rvs(loc=loc,scale=scale,size=3)
drug = stats.norm.rvs(loc=loc,scale=scale,size=3)
ax.scatter([0,0,0],control)
ax.scatter([1,1,1],drug)
ax.set_xlim(-1,2)
ax.set_ylim(np.min(np.concatenate((control,drug)))*0.9,
np.max(np.concatenate((control,drug)))*1.1)
ax.set_ylabel('Tumor size')
ax.set_xlabel('Treatment')
ax.set_xticks([0,1],['Control',group])
measure_tumor('Drug')
If we do the same for another 100 drugs:
def measure_many_tumors():
loc,scale,n = 15,5,3 # mean, standard deviation, and sample size
fig,axs = plt.subplots(ncols=10,nrows=10,figsize=(20,20),sharey=True,sharex=True)
axs = axs.flatten()
N = len(axs) # number of drugs
data = np.zeros((N,2,n))
pvals = np.zeros(N)
for i in range(N):
control = stats.norm.rvs(loc=loc,scale=scale,size=n)
drug = stats.norm.rvs(loc=loc,scale=scale,size=n)
data[i,0,:] = control
data[i,1,:] = drug
axs[i].scatter([0,0,0],control)
axs[i].scatter([1,1,1],drug)
p = stats.ttest_ind(control, drug)[1] # get p-value from t-test
axs[i].set_title('p = %.3f'%p,y=0.8)
if p<0.01:
axs[i].set_facecolor('orange')
elif p<0.05:
axs[i].set_facecolor('yellow')
elif p<0.06:
axs[i].set_facecolor('lightyellow')
pvals[i] = p
axs[-1].set_xlim(-1,2)
axs[-1].set_xticks([0,1],['',''])
fig.text(0.5,0.1,'Treatment',ha='center')
fig.text(0.1,0.5,'Tumor size',va='center',rotation='vertical')
return data,pvals
data_100, pvals_100 = measure_many_tumors()
We see a few drugs having "significant" p-values, even though they don't actually have any effect. BAM, we just p-hacked! p-hacking refers to the misuse and abuse of analysis techniques and results in being fooled by false positives.
Setting the threshold for significance to 0.05 means that approximately 5% of the statistical tests we do on data gathered from the same distribution will result in false positives.
That means if we did 100 tests we would expect about 5 false positives or 5 percent and if we did 10,000 tests we would expect about 500 false positives in other words the more tests we do the more false positives we have to deal with.
If we simulate 10000 drugs, and plot the histogram of all 10000 p-values we get, we see a standard uniform distribution. Everything to the left of the $ p=0.05 $ dashed line is false positive.
N = 10000
pvals = np.zeros(N)
loc,scale,n=15,5,3
for i in range(N):
control = stats.norm.rvs(loc=loc,scale=scale,size=n)
drug = stats.norm.rvs(loc=loc,scale=scale,size=n)
p = stats.ttest_ind(control,drug)[1]
pvals[i] = p
plt.hist(pvals,bins=20,density=True)
plt.xticks((0,0.05,1),(0,0.05,1))
plt.xlabel('p-value')
plt.ylabel('Probability density')
plt.axvline(0.05,0,1.1,linestyle='--',color='k')
N_positive = sum(pvals<0.05)
print('%i out of 10000 drugs tested have p-values < 0.05.'%N_positive)
465 out of 10000 drugs tested have p-values < 0.05.
On the other hand, if drugs are truly effective, we are more likely to get $p<0.05$.
N = 10000
pvals = np.zeros(N)
loc1,scale1 = 15,5
loc2,scale2 = 10,5
n = 3
for i in range(N):
control = stats.norm.rvs(loc=loc1,scale=scale1,size=n)
drug = stats.norm.rvs(loc=loc2,scale=scale2,size=n)
p = stats.ttest_ind(control,drug)[1]
pvals[i] = p
plt.hist(pvals,bins=20,density=True)
plt.xticks((0,0.05,1),(0,0.05,1))
plt.xlabel('p-value')
plt.ylabel('Probability density')
plt.axvline(0.05,0,1.1,linestyle='--',color='k')
N_positive = sum(pvals<0.05)
print('%i out of 10000 drugs tested have p-values < 0.05.'%N_positive)
1594 out of 10000 drugs tested have p-values < 0.05.
A very "useful" hacking technique: Keep adding data until $p<0.05$?¶
Now let's talk about a slightly less obvious form of p-hacking. Remember these two groups with p-values between 0.05 and 0.06.
# Find the drugs that have p-values between 0.05 and 0.06
almost_significant_index = np.where(np.logical_and(pvals_100<0.06,pvals_100>0.05))[0]
print(almost_significant_index)
[ 0 89]
m = len(almost_significant_index)
fig,axs = plt.subplots(1,m,figsize=(6,3),sharey=True,sharex=True)
for i in range(m):
idx = almost_significant_index[i]
control = data_100[idx,0,:]
drug = data_100[idx,1,:]
axs[i].scatter([0,0,0],control)
axs[i].scatter([1,1,1],drug)
p = pvals_100[idx]
axs[i].set_title('p = %.3f'%p,y=0.8)
axs[i].set_facecolor('lightyellow')
axs[i].set_xlabel('Treatment')
axs[0].set_ylabel('Tumor size')
axs[-1].set_xlim(-1,2)
axs[-1].set_xticks([0,1],['Control','Drug']);
Now we know that both groups came from the same distribution. But typically when we are doing experiments we don't know if they both came from the same distribution or different ones. And let's be honest we usually hope that the observations come from two different distributions. In this example we are looking for a new drug to help people, so we want to see an improvement.
When we get data like this where the p-value is close to 0.05 but not less than, it is very tempting to think, hmm I wonder if the p-value will get smaller if I add more data. So we add a few more measurements to each group.
n = data_100.shape[2]
extra_n = 10
extra_pvals = np.zeros((m,extra_n+1))
fig,ax = plt.subplots()
for i in range(m):
idx = almost_significant_index[i]
extra_control = stats.norm.rvs(loc=loc,scale=scale,size=extra_n)
extra_drug = stats.norm.rvs(loc=loc,scale=scale,size=extra_n)
control = np.append(data_100[idx,0,:],extra_control)
drug = np.append(data_100[idx,1,:],