In [1]:

```
%pylab inline
# import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```

Populating the interactive namespace from numpy and matplotlib

- Go over general problem for w5 homework -- the underlying models of data we are considering, and the big picture behind the calculation of evidences for the models.

For ten different loci, we want to know if the genotype at that locus is linked to the phenotype.

The phenotype for fly $k$ is $f_k$, a song frequency measured in Hz.

At a particular locus, we also have the genotype of the $k$th fly, $g_k$.

Our null model is that the genotype does not explain the phenotype:

$$f_k = c + \epsilon, ~~\epsilon \sim N(0, \sigma)$$Here, $c$ is some constant song frequency, and $\varepsilon$ is a Gaussian noise variable with standard deviation $\sigma$.

Our QTL model:

$$f_k = a + b g_k + \epsilon, ~~\epsilon \sim N(0, \sigma)$$What does this mean?

- if fly $k$ has $g_k = 0$, then its phenotype has mean frequency $a$ with Gaussian noise with standard deviation $\sigma$ (a $N(a, \sigma)$ process),
- whereas if $g_k = 1$, then it's $N(a+b, \sigma)$

Let's visualize this for different $a, b, c$:

In [2]:

```
a = 100
b = 10
c = 100
sigma = 5
data_QTL_0 = np.random.normal(loc=a, scale=sigma, size=1000)
data_QTL_1 = np.random.normal(loc=a+b, scale=sigma, size=1000)
data_NQTL = np.random.normal(loc=c, scale=sigma, size=1000)
all_data = [data_QTL_0, data_QTL_1, data_NQTL]
bs = np.linspace(min(np.concatenate(all_data)), max(np.concatenate(all_data)), 50)
fig, ax = plt.subplots(2, 1, figsize=(9, 6), sharex=True)
# plot QTL case
ax[0].set_title('QTL model: a = {:.0f}, b = {:.0f}, $\sigma$ = {:.0f}'.format(a, b, sigma))
ax[0].hist(data_QTL_0, bins=bs,
alpha=0.5, label='$g=0: N(a, \sigma)$')
ax[0].axvline(a, ls='--', c='C0', label='a')
ax[0].hist(data_QTL_1, bins=bs,
alpha=0.5, label='$g=1: N(a+b, \sigma)$')
ax[0].axvline(a+b, ls='--', c='C1', label='a+b')
ax[0].legend(loc='upper left', bbox_to_anchor=(1.05, 1), borderaxespad=0)
# plot NQTL case
ax[1].set_title('NQTL model: c = {:.0f}, $\sigma$ = {:.0f}'.format(c, sigma))
ax[1].hist(data_NQTL, bins=bs,
alpha=0.5, color='C2', label='$N(c, \sigma)$')
ax[1].axvline(c, ls='--', c='C2', label='c')
ax[1].legend(loc='upper left', bbox_to_anchor=(1.05, 1), borderaxespad=0)
plt.show()
```

Given our data, we want to compare two hypotheses, so let's compute our old friend:

$$\begin{aligned} \frac{P(QTL \mid D)}{P(NQTL \mid D)} &= \frac{P(D \mid QTL)}{P(D \mid NQTL)} \frac{P(QTL)}{P(NQTL)} \\ &= \frac{P(D \mid QTL)}{P(D \mid NQTL)} ~\mathrm{(assuming~models~equally~ likely)} \\ \end{aligned}$$Let's go through $P(D \mid NQTL)$.

In the NQTL model, what we assume is that phenotypes are governed by

$$f_k = c + \epsilon, ~~\epsilon \sim N(0, \sigma)$$So, a frequency $f_k$ is assumed to come from a $N(c, \sigma)$ distribution.

The pdf of a Gaussian $N(\alpha, \beta)$ is:

$$f(x; \alpha, \beta) = \frac{1}{\sqrt{2\pi}\beta}e^{-\frac{(x-\alpha)^2}{2\beta^2}}$$That's exactly what we use to write the likelihood of a single datapoint:

$$P(f_k \mid c, \sigma, NQTL) = \frac{1}{\sqrt{2\pi}\sigma}e^{\frac{-(f_k - c)^2}{2\sigma^2}}$$And for the full data, with $N$ flies: $$\begin{aligned} P(D \mid c, \sigma, NQTL) &= \prod_{k=1}^N P(f_k \mid c, \sigma, NQTL) \\ &= \prod_{k=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(f_k - c)^2}{2\sigma^2}} \\ &= \frac{1}{(\sqrt{2\pi}\sigma)^N}\prod_{k=1}^N e^{-\frac{(f_k - c)^2}{2\sigma^2}} \\ &= \frac{1}{(\sqrt{2\pi}\sigma)^N} e^{\frac{-\sum_k (f_k - c)^2}{2\sigma^2}} \\ \end{aligned}$$

The $f_k$'s are fixed and given to us. We can compute the term above if we set a particular $c$ and $\sigma$. But, what $c$ should we use? We will marginalize over different $c$. For our homework, it is OK to assume a single $\sigma$ (though in principle, you could leave this as a parameter and marginalize over different $\sigma$'s too).

$$P(D \mid NQTL) = \int_{-\infty}^\infty dc~ P(D \mid c, \sigma, NQTL)P(c \mid NQTL)$$Similar to previous homeworks, we will consider uniform priors on parameters like $c$ and call $\sigma_c = c_+ -c_-$, where $c_+$ and $c_-$ are endpoints for $c$ that are arbitrary but don't cut off meaningful values.

$$P(D \mid NQTL) = \frac{1}{\sigma_c}\int_{-\infty}^\infty dc~ P(D \mid c, \sigma, NQTL)$$We need to compute the above integral in order to get the evidence for the NQTL model. What follows in the lecture notes is a simplification of that integral by doing a Taylor expansion around the peak, very similar to how we did the Laplace approximation of using Taylor expansions around a peak of a posterior to find an approximating Gaussian.

Here, we consider the log likelihood:

$$\begin{aligned} L(c) &= \log P(D \mid c, \sigma, NQTL) \\ &\approx L(c^*) + \frac{1}{2} \frac{\partial^2 L}{\partial c^2} (c - c^*)^2 \end{aligned}$$We are expanding around an optimal $c^*$, for which the first partial derivative of $L$ in $c$ is zero. Hence, there are no first partial derivatives in the above Taylor expansion. We got this $c^*$ by the following:

$$\frac{\partial L}{\partial c} = \frac{1}{\sigma^2} \sum_k (f_k - c) = 0 \implies c^* = \frac{1}{N} \sum_k f_k = \bar{f}$$There is another key result from lecture notes, $\frac{\partial^2 L}{\partial c^2} = -\frac{N}{\sigma^2}$, which we will now plug in to the above expression:

$$\begin{aligned} L(c) &= \log P(D \mid c, \sigma, NQTL) \\ &\approx L(c^*) + \frac{1}{2} \frac{\partial^2 L}{\partial c^2} (c - c^*)^2\\ &\approx L(c^*) - \frac{N}{2\sigma^2} (c - c^*)^2 \end{aligned}$$Going back to non-logs:

$$\begin{aligned} P(D \mid c, \sigma, NQTL) &= \exp\big(\log P(D \mid c, \sigma, NQTL)\big)\\ &\approx \exp\big( L(c^*) - \frac{N}{2\sigma^2} (c - c^*)^2 \big)\\ &\approx e^{L(c^*)}e^{-\frac{N}{2\sigma^2} (c - c^*)^2}\\ &\approx P(D \mid c^*, \sigma, NQTL)e^{-\frac{N}{2\sigma^2} (c - c^*)^2}\\ \end{aligned}$$Plug this into our integral:

$$\begin{aligned} P(D \mid NQTL) &= \frac{1}{\sigma_c}\int_{-\infty}^\infty dc~ P(D \mid c, \sigma, NQTL) \\ &\approx \frac{1}{\sigma_c}\int_{-\infty}^\infty dc~ P(D \mid c^*, \sigma, NQTL)e^{-\frac{N}{2\sigma^2} (c - c^*)^2}\\ &\approx \frac{1}{\sigma_c} P(D \mid c^*, \sigma, NQTL) \int_{-\infty}^\infty dc~ e^{-\frac{N}{2\sigma^2} (c - c^*)^2}\\ \end{aligned}$$That integral is simple to integrate (it is a form of a Gaussian integral), so we arrive at a simple expression:

$$P(D\mid NQTL) \approx P(D\mid c^\ast, \sigma, NQTL) \times \sqrt{\frac{2\pi\sigma^2}{N}}\frac{1}{\sigma_c}$$Here, $ P(D \mid c^*, \sigma, NQTL)$ is simply the likelihood of the data, given a particular $c=c^* = \frac{1}{N} \sum_k f_k$ (which makes sense, right? In the null model, we assume that genotype has no effect on phenotype, and that each observed phenotype is generated by a gaussian with mean frequency $c$... so if we wanted to estimate $c$, the best we can do is take the mean of the frequencies $f$).

Let's sum up. So what did we do here? We

- assumed that we had a known $\sigma$ (something you can set by looking at the data -- what would be a reasonable choice? You can set this as a variable in your code and see how your results change for different $\sigma$'s if you're interested),
- used a Taylor expansion around the optimal $c$ to approximate $P(D \mid c, \sigma, NQTL)$
- doing that approximation gave us a new integral with an easy analytical answer, giving us a one-line calculation for the integral we wanted to compute in the first place (we don't need to do integration ourselves! just need to evaluate the above expression)

For the QTL model, the lecture notes go through a similar process. But there, we have two parameters, $a$ and $b$, and we incorporate genotype information, $g$.

We will do the same tricks though!

$$\begin{aligned} P(D\mid QTL) &= \int_{-\infty}^{\infty} da \int_{-\infty}^{\infty} db ~P(D\mid a, b, \sigma, QTL) P(a,b\mid QTL) \\ &= \frac{1}{\sigma_a\sigma_b}\int_{-\infty}^{\infty} da \int_{-\infty}^{\infty} db ~P(D\mid a, b, \sigma, QTL) \\ &\approx \frac{1}{\sigma_a\sigma_b} P(D\mid a^*, b^*, \sigma, QTL)\int_{-\infty}^{\infty} da \int_{-\infty}^{\infty} db ~e^{-\frac{1}{2}(a-a^*, b-b^*)Q{a-a^\ast\choose b-b^\ast}} \\ &\approx \frac{1}{\sigma_a\sigma_b} P(D\mid a^*, b^*, \sigma, QTL)\frac{2\pi}{\sqrt{\det Q}} \\ \end{aligned}$$Above, I introduce quantities like the matrix $Q$, and the optimal $a^*, b^*$ that are defined in lecture notes. The key point is that the we again arrived at a closed form solution of the integral using terms that can be quickly computed using the genotypes $g$ and frequencies $f$ for the data that we're interested in.

Finally, we arrive at our ratio of probabilities:

$$ \frac{P(QTL\mid D)}{P(NQTL\mid D)} = \frac{P(D\mid a^\ast, b^\ast, \sigma, QTL)}{P(D\mid c^\ast, \sigma, NQTL)} \times \sqrt{\frac{2\pi/N}{\overline{gg} - \bar{g} \bar{g}}} \times \sigma \times \frac{\sigma_c}{\sigma_a\sigma_b} $$So really this is the term you need to compute for your homework, after all the beautiful simplification shown in full in the lecture notes. Terms like $\overline{gg}$ are defined there. If this term is much greater than 1, then the QTL model is more plausible. If it's close to 0, NQTL is more plausible.

In [ ]:

```
```