MCB 111 week 4 Section

p-values and p-hacking

https://xkcd.com/882/

image.png

Examples of p-Hacking:

  1. Stop collecting data once $p \lt 0.05$
  2. Analyze many measures, but report only those with $p \lt 0.05$
  3. Collect and analyze many conditions, but only report those with $p \lt 0.05$
  4. Use covariates to get $p \lt 0.05$
  5. Exclude participants to get $p \lt 0.05$.
  6. Transform the data to get $p \lt 0.05$
  7. Increase the n in one group to get a $p \lt 0.05$.

Hypothetical scenario

Your lab wants to develop a drug to reduce the size of tumors in a particular kind of cancer. In a sort of screening fashion, you lab is exhaustively testing compounds to find out if any of them worked. So using an animal model you measured the size of the tumors from three specimens after a certain time without any drugs and with each candidate drug.

Setting the threshold for significance to 0.05 means that approximately 5% of the statistical tests we do on data gathered from the same distribution will result in false positives.

That means if we did 100 tests we would expect about 5 false positives or 5 percent and if we did 10,000 tests we would expect about 500 false positives in other words the more tests we do the more false positives we have to deal with

A very "useful" hacking technique: Keep adding data until $p<0.05$?

Apparently, this approach will result in over 50% experiments being falsely reported as positive (i.e., ineffective drugs being reported as effective!)

A simple correction for multiple tests to reduce the overall false-positive rates

Under the null hypothesis, the distribution of $p$ value is uniform over $[0,1]$. Thus, among $n$ independent tests, it's the smallest $p$ value that controls if you will see false-positives among the results.

Suppose the ground truth is that all tests are performed on data following their own null distributions (i.e., their is no significance whatsoever), what is the distribution of the smallest $p$ value? Suppose the significance threshold for each test is fixed at $\hat{p}$, then the probability of observing at least one false-positive is

$$ \begin{align} &\mathbb{P}(\min(p_1,p_2,\cdots,p_n)\leq \hat{p})\\ =&1-\mathbb{P}(\text{All of } p_1,p_2,\cdots,p_n > \hat{p})\\ =&1-\prod_{i=1}^n \mathbb{P}(p_i > \hat{p})\\ =&1-(1-\hat{p})^n\\ \approx & 1-\exp(-n\hat{p}) \end{align} $$

This tells you that as the number of tests $n$ goes up, the possibility that at least one test becomes false-positive rises exponentially fast to 1. To reduce this effect, one could choose $\hat{p}$ depending on the number of tests $n$, for instance, a popular correction method is to set

$$ \hat{p}=\frac{0.05}{n} $$

Then, the probability that at least one test becomes false-positive is acceptably small:

$$ 1-\exp(-0.05)\approx 0.04877 $$

With this correction (Bonferroni correction), if you have 1000 tests, then the significance threshold for each test will be as small as $0.00005$.