MCB111: Mathematics in Biology (Spring 2018)
 pvalue: definition
 Example:
 The Student’s ttest: a widely used pvalue
 Correcting for multiple hypothesis testing
 Proper pvalue etiquette
 Comparing pvalues of different experiments – don’t
week 04:
Significance: The Student’s ttest and pvalues
pvalue: definition
You have a hypothesis , usually complicated, that you suspect may explain your experimental observations, and a null hypothesis , usually simpler, that could also explain your results.
A new result comes in for which you hypothesis assigns a particular “score” .
The pvalue of is the cumulative probability that using the null hypothesis you could have got that score or higher (assuming that higher score is better).
If is small, that means there is little chance of that score being obtained under the null model. A common mistake is to think that such a small pvalue implies that your hypothesis is true. A pvalue says nothing about your hypothesis, but about the hypothesis you would like to reject. Rejecting the null hypothesis does not imply that your hypothesis is true.
Example:
Remember the homework where you where running a new experiment which for a previous researcher had a failure frequency of , and you observed ‘ssfff’, that is, a sequence of 2 successes (s) followed by three failures (f) ?
We can introduce two hypotheses

: The failure frequency is equal to ().

: The failure frequency is larger than ().
The pvalue approach
We can calculate what is the probability under the null hypothesis of obtaining in 5 attempts the observed result or something more extreme.
Those events are: ‘ssfff’, ‘sffff’, and ‘fffff’. Resulting in a pvalue,
What would you do with this result?
Oftentimes, a pvalue is used to “reject” the null hypothesis if the pvalue is smaller than . Would you or would you not reject the null hypothesis after this result? Why 0.05?
The Bayesian approach
We have learned to calculate the posterior probability which tell us a lot about the value of the failure parameter given the data. Namely,
\begin{equation} P(f \mid \mbox{‘ssfff’}) = \frac{6!}{3! 2!}\, f^3 (1f)^2, \end{equation}
which is given in Figure 1. The maximum of this posterior distribution is , far away from the failure frequency of the null hypothesis.
Figure 1. Posterior probabilities of the frequency of failure given that you have obtained 2 successes and 3 failures.
Bayesian hypothesis comparison tells us
Let us assume that both hypotheses are equally likely . Then the ratio of the two hypotheses is the ratio of the evidences of the data given and , as
And using marginalization,
Let’s assume that for the hypothesis the probability does not have to be precisely , but it can take the range .
Using uniform priors
we have
Thus, given the data and our priors, there is a 80:20 chance that the failure probability is larger than 0.2.
Which method do you prefer?

As you can see in this example, the pvalue does not use any information about the hypothesis, thus does not provide any information about , while the Bayesian method does.

Calculating a pvalue can be easier, especially for methods with many parameters, and you should feel free to use it, provided that you know what you are doing, and interpret the results correctly. The above pvalue of 0.06 does not mean that the probability of the null hypothesis being false is 94%, it is just the probability of obtaining the results you got under the null hypothesis.
Sometimes it is difficult to decide what to do with a given pvalue result. More about this later in this lecture.

In my opinion, if possible, the Bayesian method is preferable as it conveys a lot more information. Just the posterior probability (Figure 1) gives you a lot of information about what the data tells about the value of the parameters. It is also more robust to small fluctuations as allowing you integrate around the value 0.2 for .
The Student’s ttest: a widely used pvalue
It is typical to provide a pvalue to compare the mean of a experimental distribution to a known mean of a null hypothesis. Let’s consider one practical case.
There are genes that do not produce proteins, but the functional product is the RNA molecule. Many of those functional RNAs form stable structures. RNaseP RNA is an example of a structural RNA that functions as a ribozyme (a RNA enzyme). RNaseP RNA is responsible for trimming the ends of tRNAs. As many other functional RNAs, RNAseP RNA has a structure.
Finding the structure of one of those functional RNAs is an interesting problem. One conclusive way is using crystallography, but that is hard and slow. We have faster computational methods that can predict structures given the RNA sequence. There are also experimental techniques (named RNA structural or chemical probing) that given an RNA molecule, can calculate the reactivity of each residue for rather long RNA molecules. These chemical reactivities correlate with the flexibility of the RNA molecule, thus they are expected to inform us about which residues are paired and unpaired in the molecule, thus providing experimental information about RNA structure (see for instance).
I have taken chemical probing data for a well known RNA molecule, RNaseP RNA, from the RNA Mapping DataBase ([RMDB])(https://rmdb.stanford.edu/), a repository of RNA structure probing. Here is the original data if you want to double check my analysis (which I encourage you to do). And here are the files extracted from the previous file of reactivities for bases that are paired, and reactivities for bases that are not paired, for the wild type RNaseP RNA sequence.
Figure 3. Mean and standard deviation for the reactivity data of one chemical probing experiment for RNaseP RNA.
In Figure 3, I show a typical plot. As you well know, I do not like those plots, and I think you should never use them. This plots simply tells us what the mean and standard deviation of the reactivities are if we separate the base paired residues from those that are unpaired.
Another typical thing to do is to add a mysterious pvalue associated to that plot. I have followed that trend. Let the null hypothesis be that a residue is unpaired. One such typical method is called Student’s ttest. I have used the matlab function that implements the Student’s ttest, named ttest. I have used the ttest matlab function out of the box in order to calculate the pvalue of the distribution of paired residues relative to that of unpaired residues (the null hypothesis).
The ttest function assigns a pvalue of for the paired residue distribution to have the mean of the unpaired residue distribution. Here is my matlab code to obtain that result.
From this super small pvalue, you may be tempted to conclude that clearly you can use chemical probing data to distinguish between residues are paired from those that are not, thus to infer a structure for the RNA.
That conclusion would be wrong
Read the documentation – look at your data
Figure 3. Matlab documentation for function ttest(x,m).
If you go to the matlab documentation for the ttest function, the first thing you read is that in order to use the Student’s ttest, the distribution of the data has to be Gaussian (see Figure 3).
Is that the case for this data?
Figure 4 shows that the distributions of reactivity scores both for paired and unpaired residues are obviously not Gaussian.
never use a pvalue test without looking at the actual distribution first
Figure 4. The distributions of reactivity scores for paired and unpaired residues.
What does the Student’s distribution have to do with any of this?
For this section, I am following Sivia’s book “Data Analysis”, Sections 2.3 and 3.2.
Let us be Bayesian again. We have an experiments in which the measurements are Gaussian. The ttest calculates the probability that the experiment’s mean could have come from a process of mean .
Bayes’ theorem tells us that the posterior probability for the two parameters of the Gaussian distribution and is given by
where is the prior probability of the parameters.
Since we are comparing only means, the Bayesian thing to do is to integrate to all possible values of the variance
The probability of the data given and is
Assuming uninformative and independent priors
where
are constants, which allows us to include the prior in the proportionality.
The posterior probability of is then
Introducing a change of variable
we obtain,
Then, since the integral is a constant that does not depend on the data or ,
Introducing the sample mean , and using the identity
and introducing which is not dependent on
we have,
which is one of many representations of the Student’s t distribution.
Figure 5. Comparison of the Student's t distribution for a zero sample mean and different values of N to the Gaussian distribution of mean zero and sigma one.
The Student’s distribution is much like the Gaussian distribution. It is symmetric around . The main difference from a Gaussian is that the tails are much fatter and they extend further from the mean value. As becomes larger, the Student’s t distribution becomes closer to the Gaussian distribution. (See Figure 5).
Is this pvalue what you really want to know?
Student’s ttest or not, it is obvious that the means of the two chemical probing distributions for paired and unpaired residues are different (Figure 4). For some experiments, knowing that the means are different is all you need. But let us think again about what we would like to achieve here. You want to see if chemical probing reactivities will help you distinguish residues that are base paired from those that are not paired.
The pvalue calculation you would like to do is:
for a given reactivity value, what is the probability that I could have obtained that reactivity or smaller if the residue were unpaired?
We can calculate an empirical version of this pvalue using the data we have. From the distribution of unpaired residues in Figure 4, I find
reactivity  pvalue  unpaired residues with this reactivity or lower 

0.0029  0.02  2% 
0.0034  0.05  5% 
0.0042  0.10  10% 
This results is telling us that we need to restrict to very small reactivities so that not too many false positives start appearing.
Notice that I have not had to invoke any fancynamed statistical test to do this calculation, nor have I had to make any assumptions about the data.
Remember that our objective is to decide on whether and how can I use the residue reactivities to infer which residues are paired and which aren’t. Pvalues refer to one single testing: I look at the reactivity of one single residue. Answering the above question requires testing all residues in the RNA structure. This is a problem of multiple testing that we approach next.
Correcting for multiple hypothesis testing
The pvalues we have discussed so far give us a probability per test. If I do my experiment one more time, and I obtain one particular outcome, the pvalue is the probability of that outcome (or another more extreme) happening just under the null hypothesis.
How to interpret a pvalue if you repeat the experiment many times? This issue is very relevant in modern highthroughput biology. If you repeat a test many times, even a small p of obtaining that result under the null hypothesis, could results in a large number of cases controlled by the null hypothesis with pvalue or smaller (false positives).
How many false positives? approximately , for tests, when you assume that all observed scores come from the null hypothesis.
Here is a good read about what to do with pvalues for multiple testing “How does multiple testing correction work?” by W. Noble
The bonferroni correction: controlling the familywise error rate
A simple thing to do is, if you aim for a certain pvalue and you do tests, impose that for any given test you obtain a pvalue of p/n. This is called the bonferroni correction.
This is also called controlling the familywise error rate, as \begin{equation} P(\cup_i {p_i \leq p}) \leq \sum_i P(p_i\leq p/n) \leq n \frac{p}{n} = p. \end{equation}
This is considered a very conservative approach for very large number or tests.
False Discovery Rate (FDR)
Let’s go back to our problem of using chemical probing data to estimate RNA residues that are paired into a structure.

N = number of tests (number of residues in the RNA molecule, N=265)

a chosen score (reactivity)

pvalue probability that a unpaired residue has a score .

number of residues with a score .
The False Discovery Rate is defined as the fraction of the measurements we called positives at the pvalue threshold (named above ), that are expected to be false positives. That is, assuming that all measurements are drawn from the null hypothesis,
Introducing also

= Trues (residues that are paired, N=160)

= number of T with a score , that is, T .
We can calculate the sensitivity corresponding to that FDR as
In a given experiment, you want to find a good tradeoff between the cost of having false positive (the FDR) with the benefit of finding more positives (the sensitivity).
Using the data we have, we can calculate pvalues, FDR, and sensitivity for different reactivity scores as,
reactivity  pvalue  FDR (%)  Sensitivity (%) 

0.0029  0.02  8.7  14.4 
0.0034  0.05  15.0  21.1 
0.0042  0.10  22.4  31.9 
0.0063  0.28  26.9  50.0 
Thus, in order to identify 50% of the paired bases (sensitivity), you should expect that about 27% of the bases that you call paired are really not paired (FDR). This is a more informative and sober view of the power of chemical probing, than our original assertion about the means of the two distribution being different with a very small pvalue.
Proper pvalue etiquette
As David Mackay puts it, “[..] pvalues [..] should be treated with extreme caution”. Even the American Statistical Association has issued recent warnings on the use of pvalues
What pvalues are not

pvalues say nothing about your beloved hypothesis. Rejecting the null hypothesis does not mean that your hypothesis is true.

pvalues are not the probability of the null model given the data. pvalues tell you about the probability of the data, given the null hypothesis. A pvalue of 0.01 does not mean that there is a 1% chance for the null model to be true; rather, that you should expect a 1% false positives.

pvalues are not posteriors of the null model given the data, and cannot be used to asses model strength or to compare modiferent models.

pvalues are often used as a measure of statistical significance, but that is bogus. pvalues tell you about the number of false positives that have sneaked into your experiment. In an experiment with observation, a pvalue of means that you should expect find about false positives amongst your positives.
Define precisely your pvalue as a probability of something
When you say you are calculating a pvalue, you cannot assume the reader should know what you are talking about. And conversely, if you read in a paper that something is “significant with a pvalue of blah” and that is all the explanation they provide, do not feel stupid if you have no idea what they are talking about!
Specify the null hypothesis
It is also not obvious what the null hypothesis is, that should always be specified.
Look in your data for those expected false cases
Do not use the word “significant”. All a pvalue is giving you is an idea of how many false positives you should find in your set of predictions. It is your responsibility to estimate if that number is reasonable or not, and to find those false predictions. They may give you an idea of other effects (alternative hypotheses) that you are so far ignoring.
Comparing pvalues of different experiments – don’t
There is an infectious disease that if left untreated, affected people have a probability of recovery of . We will call this the null hypothesis .
Your lab is testing two new treatments, treatment A (TrA) and treatment B (TrB). You run two different experiments (expA and expB) where TrA or TrB is given to two different groups of affected people.
Your analysis tells you that the pvalues for the outcomes of expA and expB respect to the null hypothesis are
It seems tempting to conclude from this results that TrB is a more effective treatment than TrA.
That conclusion would be wrong.
You can never compare two hypotheses by looking to their pvalues relative to a third null hypothesis. There are many hidden and possibly confounding variables in that calculation, one of them the sample size.
Let’s see what could be going on behind the scenes. A way of obtaining the previous result is the following:
 expA includes 15 infected people, 8 of which recovered after using TrA.
 expB includes 300 infected people, 157 of which recovered after using TrB.
Figure 6. Probability density under the null hypothesis and pvalues for expA and expB.
The corresponding pvalues, that is, the probabilities of at least 8 (157) recoveries out of 15 (300) under the null hypothesis of no treatment at all are
However, any simple analysis (for instance looking at the sample mean for and , the probabilities of survival under TrA and TrB) tell you that those two treatments seem to be similar in effectiveness,
What is going on?
Because the sample sizes for the two experiments are so different, the distribution is much narrower than , thus the pvalue of expB is smaller simply because the sample size is larger (see Figure 6).
If you calculate the posterior distributions and , which we have done before in w02,
you can conclude that your assessment of is more precise than that of , due to the larger amount of data, but they are perfectly compatible with both treatments having the same effectiveness. (see Figure 7).
Figure 7. Posterior probability of the efficiency of treatment A and treatment B given the data.
As we did in the homework of w02, you remember that when the data follows a binomial distribution, the best estimate of the Bernoulli parameter and its confidence value are given by
The best estimates and confidence values for the effectiveness of the two treatments are
In fact, if the two experiments had been run with using the same treatment, you would have obtained one “significant” pvalue and one “non significant” pvalue just because of the different sample size.