MCB111: Mathematics in Biology (Spring 2018)
week 03:
Model Comparison and Hypothesis Testing
For this week’s lectures, I am following Mackay’s book, Chapter 37, and Chapter 28, in that order. MacKay’s video Lecture 10 is also relevant.
Hypothesis testing: an example
Continuing with our malaria vaccine example. Now we want to compare to a control treatment with no active ingredients. We have 40 volunteers, 30 of which receive the vaccine (V) and the other 10 receive the control (C). Of the 30 that receive the vaccine, one contracts malaria, of the 10 that receive the control treatment, three contract malaria.
Is our vaccine treatment V better than the control treatment C?
Let us assume that the probability of contracting malaria after taking the vaccine is , and the probability of contracting malaria under the control treatment is . These are the two unknown quantities, we would like to infer. What does the data tells us about them?
You may want to compare these two alternative hypotheses:

: The vaccine treatment is more effective than the control, .

: Both treatments have the same effectiveness, .
The data measurements are
the number of subjects that took the vaccine,
the number of subjects that took the vaccine and contracted malaria,
the number of subjects in the control,
the number of subjects that contracted malaria under the control treatment.
The classical test: squared test
A well used classical statistical hypothesis test is the test.
In the test, one compares the observed values () to the expected values () under the null hypothesis as
Then we assume that this quantity follows the distribution with degrees of freedom, where is equal to the number of measurements minus the number of “constraint”.
In this example, the null model is that both treatment are equally effective. The probability of contracting malaria under the null hypothesis is an unknown parameter that a frequentist would estimate as
Observed  Expected  

took vaccine & malaria  1  
took vaccine & no malaria  29  
no vaccine & malaria  3  
no vaccine & no malaria  7 
Then, is
To calculate the number of degrees of freedom, from the four data sets, we realize that there are two constrains (1+29=30 and 3+7 = 10), and also the fact that the is not known. The result is degrees of freedom.
A classical statistician then goes to a table for degree of freedom values and finds
and interpolating she decides that the pvalue is 0.02.
What does this result mean?

It does not mean that there is a 98% chance that the two treatments are different.

It means (and this is the actual definition of a pvalue) that if we were to repeat the experiment many time and the two treatments were identical, 2% of the time we will find a result with a at least as larger than 5.926.
A lot more about pvalues in the next week of lectures!
Paraphrasing Mackay: This is at best an indirect result regarding what we really want to know: have the two treatments different effectiveness?
Which test?
The is a number of different tests,
 Pearson’s test, which is the one we just used above
 Yates’s test
 CochranMantelHaenszel’s test
 Tukey’s test of additivity
 portmanteau test
 …
For instance if you use Yates’s test, which says
In classical statistics, a pvalue of 0.05 is usually accepted as the boundary between rejecting (if the pvalue is smaller than 0.05) or not rejecting (if pvalue larger than 0.05) the null hypothesis. We would have obtained a . This corresponds to a pvalue , extrapolating let’s say 0.7.
Which result would you believe?
In classical statistics, a pvalue of 0.05 is usually accepted as the boundary between rejecting (if the pvalue is smaller than 0.05) or not rejecting (if pvalue larger than 0.05) the null hypothesis.
 Pearson’s rejects the null hypothesis with a pvalue of 0.02
 Yates’s does not reject the null hypothesis with a pvalue of 0.07
On this topic, I recommend that your read this note from David MacKay “Bayes or Chisquare? Or does it not matter?””. In this note, Mackay uses the example of inferring the language of a text by sampling random characters from the text.
Let’s now take the Bayesian approach to hypothesis testing for the vaccine question.
The Bayesian approach
As we have done before in this course, let us see what the data tells us about the possible values of the unknown parameters and .
The posterior probability of the parameters given the data is \begin{equation} P(f_v, f_c\mid D) = \frac{P(D\mid f_v, f_c) P(f_v, f_c)}{P(D)}, \end{equation}
where the probability of the data given the parameters and is \begin{equation} P(D\mid f_v, f_c) = {N_v \choose N^+_v} f_v^{N^+_v} (1f_v)^{N_vN^+_v} {N_c \choose N^+_c} f_c^{N^+_c} (1f_c)^{N_cN^+_c}. \end{equation}
If we assume uniform priors , then we can factorize the posteriors and obtain \begin{equation} P(f_v, f_c\mid D) = P(f_v\mid N_v, N^+_v)P(f_c\mid N_c, N^+_c), \end{equation} and
Using the beta integral again
Figure 1. Posterior probabilities of the effectiveness for the vaccine and the control treatments. The parameter's standard deviation are calculated using a second order Taylor expansion around the optimal value.
For our particular example
The two posterior distributions are shown in Figure 1.
Exercise: reproduce on your own the optimal values and standard deviations given in Figure 1 for the parameters of the two treatments. (The standard deviations have been calculated according to the Gaussian approximation.)
The question: how probable is that ? has now a very clean answer
This integral does not have an easy analytical expression, but it is easy to do it numerically \begin{equation} P(f_v<f_c\mid D) = 0.987. \end{equation}
This means that there is a 98.7% chance, given the data and our prior assumptions, that the vaccine treatment is more effective than the control.
Model comparison
If we actually want to quantitatively compare the two hypotheses: that and that , then we can use \begin{equation} \frac{P(H_1\mid D)}{P(H_0\mid D)} = \frac{P(D\mid H_1) P(H_1)}{P(D\mid H_0) P(H_0)}. \end{equation}
Giving both hypotheses equal prior probability gives \begin{equation} \frac{P(H_1\mid D)}{P(H_0\mid D)} = \frac{P(D\mid H_1)}{P(D\mid H_0)}. \end{equation}
The probability of the data under (the evidence of the data under ) is, using marginalization, given by
\begin{equation} P(D\mid H_1) = \int_{f_v<f_c} P(D\mid f_v, f_c, H_1) P(f_v,f_c\mid H_1)\, df_v df_v, \end{equation}
and the probability of the data under is given by
\begin{equation} P(D\mid H_0) = \int_{f_v=f_c=f} P(D\mid f, f, H_0) P(f\mid H_0)\, df. \end{equation}
If we use uniform priors again, in this case for and , then we have
and
Then,
In our particular example: , , , , we have
The integral in the numerator does not have a compact analytic expression, but we (me and you as well) can calculate it numerically, resulting in
\begin{equation} \frac{P(H_1\mid D)}{P(H_0\mid D)} = 3.0140. \end{equation}
This results tells us that hypothesis (the vaccine treatment is more effective than the control) is 3 times more likely than the control treatment.
The raw data sort of told us that the malaria treatment was more effective. Bayesian analysis gives us a quantification of that intuitive assessment.
Consider now this other case in which the amount of data is small,
Treatment  Vaccine  Control 

Got malaria  0  1 
Did not  3  1 
Total treated  3  2 
In this case, using the expression we just deduced
So, from our prior assumption that the two hypotheses were 50:50, the data shifts the posterior probability to 52:48 in favor of .
Bayesian analysis gives us a quantification of the intuitive acknowledgment that such small sample give us ``some’’ weak evidence in favor of the vaccine treatment.
Occam’s razor
Or why more complicated models can turn out to be less probable
Occam’s razor advises that if several explanations are compatible with a set of observations, one should go with the simplest. Applied to our model selection problem, Occam’s razor advised that if two different models can explain the data, we should use the one with the fewer number of parameters.
Why shall we follow Occam’s razor?
There is an aesthetic reason: more complexity is uglier, simpler is prettier. Bayesian inference offers another justification.
The plausibility of two alternative hypotheses and can be expressed as
\begin{equation} \frac{P(H_1\mid D)}{P(H_2\mid D)} = \frac{P(D\mid H_1) P(H_1)}{P(D\mid H_2) P(H_2)}. \end{equation}
If we use uniform priors , then the ratio of the two hypotheses is the ratio of the evidence of the data under the two hypotheses
\begin{equation} \frac{P(H_1\mid D)}{P(H_2\mid D)} = \frac{P(D\mid H_1)}{P(D\mid H_2)}. \end{equation}
Assume that depends on parameter(s) , and on parameter(s) . We can write,
\begin{equation} P(D\mid H_1) = \int_{p_1} P(D\mid p_1, H_1) P(p_1\mid H_1) dp_1, \end{equation}
and similarly for .
Remember that in many cases we have considered, the posterior probability of the parameter has a strong peak at the most probable value . We can expand the probability around the most probable value, defined by, \begin{equation} \frac{d}{dp_1} \left. \log P(D\mid p_1, H_1)\right_{p_1=p_1^\ast} = 0. \end{equation}
A Taylor expansion gives us
such that
Then we can approximate as
\begin{equation} P(D\mid p_1, H_1) \approx P(D\mid p^\ast_1, H_1)\, e^{\frac{(p_1p_1^\ast)^2}{2{\sigma_1^{\ast}}^2}}. \end{equation}
This expansion goes by the name of Laplace’s Method.
Using the Laplace approximation in , we have
If we assume that an uniform prior
such that , and such that the bounds are large enough
Then the probability ratio between the two hypotheses results to be
The ratio of the probability of the two hypotheses has two terms:

Best data fit.
This term gives us the best fit of the model to the data. If we are not Bayesian, this would be the only term to consider.
If and are two nested hypothesis such that has more parameters and includes as a particular case (), then it is always going to be possible for to fit the data better than . The model with more parameters can use the extra parameters to overfit to the data.
That is, for nested hypotheses ,
This term alone would tells us that a model with more parameters would always have higher probability.

Bayesian Occam razor.
But we are Bayesians, then we have another term. This term is equal to the posterior accessible volume of the parameters of the hypothesis. The more posterior uncertainty we have about the model, the more it is favored.
For the two nested hypotheses

On the one hand, \begin{equation} P(D\mid p^\ast_2, H_2)\geq P(D\mid p^\ast_1, H_1) \end{equation} because is going to fit the data at least as well as

On the other hand, \begin{equation} \frac{\sigma_2^\ast}{\sigma_2} \leq \frac{\sigma_1^\ast}{\sigma_1} \end{equation} because for the posterior of the parameter has a higher peak, thus it has a smaller spread.
Then for a model with more parameters to have a higher probability, the gain in the fit of the data (the data fit term), has to compensate the decrease in uncertainty about the parameter(s) (the Bayesian Occam’s razor term).
Occam’s razor Example
Let us go back to our example in Week 02 about the wait time for bacterial mutation. In that example, we assumed there was one type of bacteria, with one single mutational time constant .
What if there was more than one species of bacteria with distinct mutational time constants? Can the data tell us if that is the case?
In order to do that, we want to compare two nested hypotheses:

: One mutation time constant .
The probability for a bacterium to wait time before mutating under is \begin{equation} P(t\mid \lambda, H_1) = \frac{e^{t/\lambda}}{Z(\lambda)}, \end{equation} where is determined by the normalization condition \begin{equation} Z(\lambda) = \int_{t_{}}^{t_{+}} e^{t/\lambda}\, dt. \end{equation} Where we have assumed that we observe the bacterial colony in the time interval times .

: Two mutation time constants .
The probability for a bacterium to wait time before mutating under is \begin{equation} P(t\mid \lambda_1,\lambda_2, H_2) \propto \eta e^{t/\lambda_1} + (1\eta) e^{t/\lambda_2}, \end{equation}
such that there is a probability that the bacterium mutates with rate and a probability that it mutates with rate . For simplicity, we assume that is a fixed value.
Imposing normalization, we obtain, \begin{equation} P(t\mid \lambda_1,\lambda_2, H_2) = \frac{\eta e^{t/\lambda_1} + (1\eta) e^{t/\lambda_2}}{\eta Z(\lambda_1) + (1\eta)Z(\lambda_2)}. \end{equation}
We want to calculate the ratio of the probability of the two hypotheses, which as we have seen before is given by \begin{equation} \frac{P(H_1\mid D)}{P(H_2\mid D)} = \frac{P(D\mid H_1)}{P(D\mid H_2)}, \end{equation} assuming an equal prior for the two hypotheses.
Let’s calculate the evidences and for the two hypotheses. If we assume we have data that correspond to mutation times , then

Evidence for H_1: For a given parameter we have, \begin{equation} P(D\mid \lambda, H_1) = \frac{e^{\sum_i t_i/\lambda}}{Z^N(\lambda)}. \end{equation}
The evidence is obtained by marginalization over all allowed values of as
Here, I assume a uniform prior for (which can vary in ) given by

Evidence for H_2: For two given parameters we have,
The evidence is obtained by marginalization over all allowed values of and as
Uniform priors are also assumed for and and , .
The ratio of the two evidences gives us the ratio of the two hypotheses. The integrals in the evidences can be easily calculated numerically.
For the example of Week 02, the data was \begin{equation} {t_1,\ldots,t_6}= {1.2,2.1,3.4,4.1,7,11}. \end{equation}
Assuming , mins, mins; and , and , produces,
That is
\begin{equation} \frac{P(H_1\mid {1.2,2.1,3.4,4.1,7,11})}{P(H_2\mid {1.2,2.1,3.4,4.1,7,11})} = 1.30 = \frac{0.56}{0.44}. \end{equation}
That is, the onetime mutation constant is times more probable than the twoconstant model, thus making true Occam’s razor advice.
Question: what would you do if you wanted to relax the condition that the mixing parameter has a fixed value. Can you do inference to determine its value given the data?