MCB111: Mathematics in Biology (Spring 2018)

week 03:

Model Comparison and Hypothesis Testing

For this week’s lectures, I am following Mackay’s book, Chapter 37, and Chapter 28, in that order. MacKay’s video Lecture 10 is also relevant.

Hypothesis testing: an example

Continuing with our malaria vaccine example. Now we want to compare to a control treatment with no active ingredients. We have 40 volunteers, 30 of which receive the vaccine (V) and the other 10 receive the control (C). Of the 30 that receive the vaccine, one contracts malaria, of the 10 that receive the control treatment, three contract malaria.

Is our vaccine treatment V better than the control treatment C?

Let us assume that the probability of contracting malaria after taking the vaccine is , and the probability of contracting malaria under the control treatment is . These are the two unknown quantities, we would like to infer. What does the data tells us about them?

You may want to compare these two alternative hypotheses:

The data measurements are

the number of subjects that took the vaccine,

the number of subjects that took the vaccine and contracted malaria,

the number of subjects in the control,

the number of subjects that contracted malaria under the control treatment.

The classical test: -squared test

A well used classical statistical hypothesis test is the test.

In the test, one compares the observed values () to the expected values () under the null hypothesis as

Then we assume that this quantity follows the distribution with degrees of freedom, where is equal to the number of measurements minus the number of “constraint”.

In this example, the null model is that both treatment are equally effective. The probability of contracting malaria under the null hypothesis is an unknown parameter that a frequentist would estimate as

  Observed Expected
took vaccine & malaria 1
took vaccine & no malaria 29
no vaccine & malaria 3
no vaccine & no malaria 7

Then, is

To calculate the number of degrees of freedom, from the four data sets, we realize that there are two constrains (1+29=30 and 3+7 = 10), and also the fact that the is not known. The result is degrees of freedom.

A classical statistician then goes to a table for -degree of freedom values and finds

and interpolating she decides that the p-value is 0.02.

What does this result mean?

Paraphrasing Mackay: This is at best an indirect result regarding what we really want to know: have the two treatments different effectiveness?

Which test?

The is a number of different tests,

For instance if you use Yates’s test, which says

In classical statistics, a p-value of 0.05 is usually accepted as the boundary between rejecting (if the p-value is smaller than 0.05) or not rejecting (if p-value larger than 0.05) the null hypothesis. We would have obtained a . This corresponds to a p-value , extrapolating let’s say 0.7.

Which result would you believe?

In classical statistics, a p-value of 0.05 is usually accepted as the boundary between rejecting (if the p-value is smaller than 0.05) or not rejecting (if p-value larger than 0.05) the null hypothesis.

On this topic, I recommend that your read this note from David MacKay “Bayes or Chi-square? Or does it not matter?””. In this note, Mackay uses the example of inferring the language of a text by sampling random characters from the text.

Let’s now take the Bayesian approach to hypothesis testing for the vaccine question.

The Bayesian approach

As we have done before in this course, let us see what the data tells us about the possible values of the unknown parameters and .

The posterior probability of the parameters given the data is \begin{equation} P(f_v, f_c\mid D) = \frac{P(D\mid f_v, f_c) P(f_v, f_c)}{P(D)}, \end{equation}

where the probability of the data given the parameters and is \begin{equation} P(D\mid f_v, f_c) = {N_v \choose N^+_v} f_v^{N^+_v} (1-f_v)^{N_v-N^+_v} {N_c \choose N^+_c} f_c^{N^+_c} (1-f_c)^{N_c-N^+_c}. \end{equation}

If we assume uniform priors , then we can factorize the posteriors and obtain \begin{equation} P(f_v, f_c\mid D) = P(f_v\mid N_v, N^+_v)P(f_c\mid N_c, N^+_c), \end{equation} and

Using the beta integral again

Figure 1. Posterior probabilities of the effectiveness for the vaccine and the control treatments. The parameter's standard deviation are calculated using a second order Taylor expansion around the optimal value.

For our particular example

The two posterior distributions are shown in Figure 1.

Exercise: reproduce on your own the optimal values and standard deviations given in Figure 1 for the parameters of the two treatments. (The standard deviations have been calculated according to the Gaussian approximation.)

The question: how probable is that ? has now a very clean answer

This integral does not have an easy analytical expression, but it is easy to do it numerically \begin{equation} P(f_v<f_c\mid D) = 0.987. \end{equation}

This means that there is a 98.7% chance, given the data and our prior assumptions, that the vaccine treatment is more effective than the control.

Model comparison

If we actually want to quantitatively compare the two hypotheses: that and that , then we can use \begin{equation} \frac{P(H_1\mid D)}{P(H_0\mid D)} = \frac{P(D\mid H_1) P(H_1)}{P(D\mid H_0) P(H_0)}. \end{equation}

Giving both hypotheses equal prior probability gives \begin{equation} \frac{P(H_1\mid D)}{P(H_0\mid D)} = \frac{P(D\mid H_1)}{P(D\mid H_0)}. \end{equation}

The probability of the data under (the evidence of the data under ) is, using marginalization, given by

\begin{equation} P(D\mid H_1) = \int_{f_v<f_c} P(D\mid f_v, f_c, H_1) P(f_v,f_c\mid H_1)\, df_v df_v, \end{equation}

and the probability of the data under is given by

\begin{equation} P(D\mid H_0) = \int_{f_v=f_c=f} P(D\mid f, f, H_0) P(f\mid H_0)\, df. \end{equation}

If we use uniform priors again, in this case for and , then we have



In our particular example: , , , , we have

The integral in the numerator does not have a compact analytic expression, but we (me and you as well) can calculate it numerically, resulting in

\begin{equation} \frac{P(H_1\mid D)}{P(H_0\mid D)} = 3.0140. \end{equation}

This results tells us that hypothesis (the vaccine treatment is more effective than the control) is 3 times more likely than the control treatment.

The raw data sort of told us that the malaria treatment was more effective. Bayesian analysis gives us a quantification of that intuitive assessment.

Consider now this other case in which the amount of data is small,

Treatment Vaccine Control
Got malaria 0 1
Did not 3 1
Total treated 3 2

In this case, using the expression we just deduced

So, from our prior assumption that the two hypotheses were 50:50, the data shifts the posterior probability to 52:48 in favor of .

Bayesian analysis gives us a quantification of the intuitive acknowledgment that such small sample give us ``some’’ weak evidence in favor of the vaccine treatment.

Occam’s razor

Or why more complicated models can turn out to be less probable

Occam’s razor advises that if several explanations are compatible with a set of observations, one should go with the simplest. Applied to our model selection problem, Occam’s razor advised that if two different models can explain the data, we should use the one with the fewer number of parameters.

Why shall we follow Occam’s razor?

There is an aesthetic reason: more complexity is uglier, simpler is prettier. Bayesian inference offers another justification.

The plausibility of two alternative hypotheses and can be expressed as

\begin{equation} \frac{P(H_1\mid D)}{P(H_2\mid D)} = \frac{P(D\mid H_1) P(H_1)}{P(D\mid H_2) P(H_2)}. \end{equation}

If we use uniform priors , then the ratio of the two hypotheses is the ratio of the evidence of the data under the two hypotheses

\begin{equation} \frac{P(H_1\mid D)}{P(H_2\mid D)} = \frac{P(D\mid H_1)}{P(D\mid H_2)}. \end{equation}

Assume that depends on parameter(s) , and on parameter(s) . We can write,

\begin{equation} P(D\mid H_1) = \int_{p_1} P(D\mid p_1, H_1) P(p_1\mid H_1) dp_1, \end{equation}

and similarly for .

Remember that in many cases we have considered, the posterior probability of the parameter has a strong peak at the most probable value . We can expand the probability around the most probable value, defined by, \begin{equation} \frac{d}{dp_1} \left. \log P(D\mid p_1, H_1)\right|_{p_1=p_1^\ast} = 0. \end{equation}

A Taylor expansion gives us

such that

Then we can approximate as

\begin{equation} P(D\mid p_1, H_1) \approx P(D\mid p^\ast_1, H_1)\, e^{-\frac{(p_1-p_1^\ast)^2}{2{\sigma_1^{\ast}}^2}}. \end{equation}

This expansion goes by the name of Laplace’s Method.

Using the Laplace approximation in , we have

If we assume that an uniform prior

such that , and such that the bounds are large enough

Then the probability ratio between the two hypotheses results to be

The ratio of the probability of the two hypotheses has two terms:

For the two nested hypotheses

Then for a model with more parameters to have a higher probability, the gain in the fit of the data (the data fit term), has to compensate the decrease in uncertainty about the parameter(s) (the Bayesian Occam’s razor term).

Occam’s razor Example

Let us go back to our example in Week 02 about the wait time for bacterial mutation. In that example, we assumed there was one type of bacteria, with one single mutational time constant .

What if there was more than one species of bacteria with distinct mutational time constants? Can the data tell us if that is the case?

In order to do that, we want to compare two nested hypotheses:

We want to calculate the ratio of the probability of the two hypotheses, which as we have seen before is given by \begin{equation} \frac{P(H_1\mid D)}{P(H_2\mid D)} = \frac{P(D\mid H_1)}{P(D\mid H_2)}, \end{equation} assuming an equal prior for the two hypotheses.

Let’s calculate the evidences and for the two hypotheses. If we assume we have data that correspond to mutation times , then

The ratio of the two evidences gives us the ratio of the two hypotheses. The integrals in the evidences can be easily calculated numerically.

For the example of Week 02, the data was \begin{equation} {t_1,\ldots,t_6}= {1.2,2.1,3.4,4.1,7,11}. \end{equation}

Assuming , mins, mins; and , and , produces,

That is

\begin{equation} \frac{P(H_1\mid {1.2,2.1,3.4,4.1,7,11})}{P(H_2\mid {1.2,2.1,3.4,4.1,7,11})} = 1.30 = \frac{0.56}{0.44}. \end{equation}

That is, the one-time mutation constant is times more probable than the two-constant model, thus making true Occam’s razor advice.

Question: what would you do if you wanted to relax the condition that the mixing parameter has a fixed value. Can you do inference to determine its value given the data?