Probability and inference of parameters¶
Bayes' Theorem¶
Thomas Bayes used some properties of conditional probabilities to produce the elegant: $$P(A|B) = \frac{P(B|A) ⋅P(A)}{P(B)}$$
We can rewrite this in human language as: $$ Posterior = \frac{likelihood ⋅ prior}{evidence}$$
But what the heck does this mean, and why do we care? Bayesian inference enables our inference of parameters given the data we actually abserve. Such inference allows us to rigorously explore the uncertainty that's inherent to scientific research. By integrating over all possible values that a parameter may be, we can get a best-guess of the parameters controlling our system (maximum posterior probability).
What this also means is we can update our maximum posterior probability as we collect more data. As a wet lab biologist this makes me excited - the more data I collect, the better I can infer the parameters governing my system.
Ok, let's get into some applications of Bayes' Theorem that will prepare you for your homework.
What happened in class this week: Part 1¶
Bacterial mutation wait time if we hypothesize an exponential decay¶
Let's return to the class example of bacterial mutation times. In Luria and Delbruck's work, they found that Lamarckian biology is, in fact, not a thing, and that random mutations accrue independent of environmental "stimuli".
Let's say we want to know the average time it takes for a bacterium to mutate. We can assume that the probability of acquiring a virus-insensitive mutation follows an exponential decay.
$$P(mutation) = e^\frac{-t}{\lambda}$$
And, in an experiment, we can see the accumulation of mutations via a signal in a flourescent channel.(If you're interested, this paper does exactly that - except it measures the basal rate of mutations due to replication errors).
So, the data we'll collect is times that signal occurs in our flourescent channel. $$Data = {t_1, t_2,t_3,...t_i}$$
If we want to estimate the best parameter of λ, we need to look to Bayes'... Given our specific context: $$P(λ|Data) = \frac{P(Data | λ) ⋅ P(λ)}{P(Data)}$$
$$P(λ|data) ∝ \sum_{i=1}^{N} \frac{e^{\left(\frac{1}{\lambda}\sum_{i} t_i\right)}}{Z^N(λ)} \cdot P(λ)$$
When you know nothing about your possible λ values, the prior disappears. Why?
If we assume the integral from 0 to infinity of P(lambda | data) = 1, the evidence can also be ignored. Why?
These simplifications leave us with: $$P(λ|data) ∝ Likelihood$$
$$∝ \sum_{i=1}^{N} e^{\left(\frac{1}{\lambda}\sum_{i} t_i\right)}$$
What happened in class this week: Part 2¶
Bayesian inference allows you to report the confidence you have about your posterior by calculating the Laplacian approximation¶
Goal: Use a guassian centered around the maximum of your posterior distribution to give an estimation of your confidence intervals (σ)
Method: Taylor expand around the log of your posterior. Generally you only need to do two terms because the following terms get smaller and smaller.
Mean = $μ = f^\star$ = maximum of your posterior (also called the maximum a posteriori)
Variance = $σ^2 = \frac{-1}{\frac{d^2L}{df^L}\bigg\rvert_{f^⋆}}$. Meaning that the inverse of the second derivative of the log of the posterior, evaluated at the posterior maximum, tells us about the width of the peak near that optimum (the standard deviation 𝜎 of a Gaussian curve approximating that peak).¶
All together, the steps you'd take to synthesize this week:¶
- Generate a probability distribution to estimate the best parameter, given your data:
Define your posterior probability 2. Identify the parameter with the largest probability:
Analytically, this is the point where the first derivative of your posterior = 0. In Python, you can use a function like np.argmax()
to calculate the maxmimum (the peak of your posterior probability).
3. Estimate your confidence in this parameter:
To get your standard deviation around this peak, calculate the second derivative of the log of your posterior probability, evaluated at the peak.
Let's do Bayesian inference of the Poisson rate parameter¶
Here we're solving Exercise 27.1 from MacKay's Information Theory, Inference and Learning Algorithms textbook (available as pdf here).
A photon counter is pointed at a remote star for one minute, in order to infer the rate of photons arriving at the counter per minute, $\lambda$. Assuming the number of photons collected $r$ has a Poisson distribution with mean $\lambda$, $$ P ( r \mid \lambda) = exp(-\lambda) \frac{\lambda^r}{r!}$$ and assuming the improper prior $P(\lambda) = \frac{1}{\lambda}$, make Laplace aproximations to the posterior distribution over $\lambda$.
On our homework, we'll do a similar thing but for a Binomial distribution, not Poisson.
Quick aside: Poisson pmf (probability mass function).¶
Poisson random variables take on discrete values. Given that events of interest occur at some rate $\lambda$, what is the probability you observe $r=1$ event, $r=2$ events, etc.? These questions are what's answered by the Poisson pmf (probability mass function):
$$ P(r \mid \lambda) = e^{-\lambda} \frac{\lambda^r}{r!} $$
We're going to use the Poisson pmf as our likelihood.
Okay, let's follow the steps outlined above to estimate the best λ given some data:¶
1. Generate a probability distribution to estimate the best parameter, given your data:¶
Define your posterior probability:
We want to use the data, an observed number of photons $r$, to infer our rate $\lambda$:
$$ P (\lambda \mid \mbox{data}) = \frac{P(\mbox{data} \mid \lambda) P (\lambda)}{P(\mbox{data})} \implies P (\lambda \mid r) = \frac{P(r \mid \lambda) P (\lambda)}{P(r)} $$
To compute the posterior, let's go one-by-one on each term:
1a) likelihood¶
$$ P ( r \mid \lambda) = \exp(-\lambda) \frac{\lambda^r}{r!}$$
This is just the Poisson pmf as discussed above.
1b) prior¶
$$ P(\lambda) = \frac{1}{\lambda} $$
Wait. This looks a little weird. $\int_0^\infty \frac{1}{\lambda} d\lambda$ doesn't equal 1 -- it doesn't even converge! As alluded to by the problem writeup, this is an example of an improper prior, which doesn't obey the proper normalization rule (of integrating to 1). If you're interested in why we use the improper prior here, scroll to the end of this notebook for MacKay's explanation.
1c) evidence¶
$$\begin{aligned} P(r) &= \int_0^\infty P(r \mid \lambda) P(\lambda) d\lambda \\ &= \int_0^\infty e^{-\lambda}\frac{\lambda^r}{r!} \frac{1}{\lambda} d\lambda \\ &= \int_0^\infty e^{-\lambda}\frac{\lambda^{r-1}}{r!} d\lambda \end{aligned}$$
It turns out that this whole term can get significantly simplified via the gamma function:
$$\Gamma(z) = \int_0^\infty e^{-x} x^{z-1}dx $$ (Elena mentioned that the gamma integral can be used in the Bernouilli, vaccine efficacy example)
An especially nice property of the gamma function is that for an integer $n$, $\Gamma(n) = (n-1)!$. Let's simplify our expression:
$$\begin{aligned} P(r) &= \int_0^\infty e^{-\lambda}\frac{\lambda^{r-1}}{r!} d\lambda \\ &= \frac{\Gamma(r)}{r!} \\ &= \frac{(r-1)!}{r!} \\ &= \frac{1}{r} \\ \end{aligned} $$
1d) posterior¶
Combining all our terms to get an expression for the posterior...
$$\begin{aligned} P (\lambda \mid r) &= \frac{P(r \mid \lambda) P (\lambda)}{P(r)} \\ &= \frac{e^{-\lambda} \frac{\lambda^r}{r!} \frac{1}{\lambda}}{1/r} \\ &= e^{-\lambda} \frac{\lambda^{r-1}}{(r-1)!} \\ \end{aligned}$$
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
r_array = np.arange(0, 50)
lambdas = [1, 2, 8, 16]
plt.figure(figsize=(10,5))
for lam in lambdas:
plt.plot(r_array, stats.poisson.pmf(r_array, mu=lam), 'o-', label=r'$\lambda = $'+f'{lam}')
plt.xlabel(r'$r$')
plt.ylabel(r'$P(r \mid \lambda)$')
plt.title('Poisson pmf at various values of' r' $\lambda$')
plt.legend()
plt.show()
#instead of using scipy's poisson pmf function, we can define our own
def poisson_pmf(r, lamb):
return np.exp(-lamb) * lamb**r / np.math.factorial(r)
#observed photon count
r = 10
# make dense array of lambdas
lam_array = np.linspace(0.01, r*3, 1000)
# set up terms of posterior
lik_array = poisson_pmf(r=r, lamb=lam_array)
pri_array = 1/lam_array
post_array = lik_array * pri_array
#plot the posterio
plt.title('Posterior probability of the poisson rate parameter')
plt.plot(lam_array, post_array, c='C2', label='posterior')
plt.legend()
<matplotlib.legend.Legend at 0x7fab5900f610>
2. and 3. Laplace approximation around the maximum of the posterior¶
We can write the analytical form of the inverse of the second derivative of the log of our posterior, then we'll compute that number at the mode of our posterior.
It's often convenient to work with the logarithm of a posterior because it becomes easier to differentiate various terms.
Let $L(\lambda) = \log P(\lambda | r)$ = log of our posterior.
Let's say we found a peak of the posterior, a $\lambda^*$ satisfying
$$\frac{d L}{d\lambda} \bigg\rvert_{\lambda=\lambda^*} = 0$$
We are going to do a Taylor expansion around this peak:
$$\begin{aligned} L(\lambda) &\approx L(\lambda^*) + \frac{d L}{d\lambda} \bigg\rvert_{\lambda=\lambda^*} (\lambda - \lambda^*) + \frac{1}{2} \frac{d^2L}{d\lambda^2} \bigg\rvert_{\lambda=\lambda^*} (\lambda - \lambda^*)^2 + \dots \\ &\approx L(\lambda^*) + \frac{1}{2} \frac{d^2L}{d\lambda^2} \bigg\rvert_{\lambda=\lambda^*} (\lambda - \lambda^*)^2 + \dots \\ \end{aligned}$$
Going back to the non-logged version:
$$P(\lambda \mid r) = e^{L(\lambda)} \propto e^{\frac{1}{2} \frac{d^2L}{d\lambda^2} \bigg\rvert_{\lambda=\lambda^*} (\lambda - \lambda^*)^2} $$
We've got an expression with $e$ to something squared... this resembles the equation of a Gaussian. It is easy to construct confidence intervals with Gaussians, so it would be convenient to use a Gaussian to approximate the posterior near this optimum. The pdf of a Gaussian variable with mean $\mu$ and standard deviation $\sigma$ is
$$P(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}$$
To match the expression in the exponent, we thus have
$$\sigma^2 = -\bigg( \frac{\partial^2 L}{\partial \lambda^2} \bigg\rvert_{\lambda=\lambda^*} \bigg)^{-1} $$
In words, the inverse of the second derivative of the log of the posterior, evaluated at the optimal $\lambda^*$, tells us about the width of the peak near that optimum (the standard deviation $\sigma$ of a Gaussian curve approximating that peak).
So let's do this with our example, with $P(\lambda \mid r) = e^{-\lambda} \frac{\lambda^{r-1}}{(r-1)!}$.
First let's get our optimal $\lambda$ under the posterior. We'll work with the log posterior:
$$\log P(\lambda \mid r) = -\lambda + (r-1)\log \lambda - \log\big((r-1)!\big)$$
Take the first derivative:
$$\frac{\partial \log P(\lambda \mid r)}{\partial \lambda} = -1 + \frac{r-1}{\lambda} = 0 \implies \lambda^* = r - 1$$
Now, take second derivative of the log posterior:
$$\frac{\partial^2 \log P(\lambda \mid r)}{\partial \lambda^2} = -\frac{r-1}{\lambda^2}$$
Evaluating at $\lambda^* = r-1$:
$$\frac{\partial^2 \log P(\lambda \mid r)}{\partial \lambda^2} \bigg\rvert_{\lambda=\lambda^*} = -\frac{r-1}{(r-1)^2} = -\frac{1}{r-1}$$
So,
$$\begin{aligned} \sigma^2 &= -\bigg( \frac{\partial^2 \log P(\lambda \mid r)}{\partial \lambda^2} \bigg\rvert_{\lambda=\lambda^*} \bigg)^{-1} \\ &= r-1\\ \end{aligned}$$
The standard deviation of the Gaussian approximating the posterior at its peak, $\sigma$, is $\sqrt{r-1}$.
Now let's visualize this and see how well this approximation works with our computed posterior.
# define gaussian pdf
def gaussian_pdf(x, mu, sigma):
# equivalently, can use scipy and do stats.norm.pdf(x, loc=mu, scale=sigma)
return 1/np.sqrt(2*np.pi*sigma**2) * np.exp(-(x - mu)**2/2/(sigma**2))
# use results above to get gaussian approximation to posterior near optimum
gauss_array = gaussian_pdf(lam_array, mu=r-1, sigma=np.sqrt(r-1))
# normalize for plotting purposes
lik_array /= np.max(lik_array)
pri_array /= np.max(pri_array)
post_array /= np.max(post_array)
gauss_array /= np.max(gauss_array)
print("The value of lambda with the maximal posterior is " + str(lam_array[np.argmax(post_array)]))
fig, axs = plt.subplots(1, 1, figsize=(8,4), sharex=True)
plt.suptitle(r'$r = $'+f'{r}')
axs.set_title('Posterior and Laplace approximation show same peak')
axs.plot(lam_array, post_array, c='C2', label='posterior')
axs.plot(lam_array, gauss_array, c='gray', label='laplace approximation')
axs.axvline(r-1, c='k', ls='--', label=r'$r - 1$')
axs.set_xlabel(r'$\lambda$')
axs.legend(loc="upper right")
# for ax in axs:
# ax.set_yticks([])
# ax.legend(loc='upper left', bbox_to_anchor=(1.025, 1), borderaxespad=0)
plt.show()
The value of lambda with the maximal posterior is 8.985985985985986
Some info on Poisson processes¶
Requirements:¶
- Events occur (in/dependently) ?
- Events can occur (a set/any) number of times in a given interval of time ?
Examples of the Poisson:¶
- Bacterial mutations
- Radioactive particle discharge
- Arrivals at a restaurant from 5-6pm
Calculations we can make:¶
Time to first event (exponential)$$\text{mean} = \lambda, \quad P(t) = \frac{1}{\lambda} \cdot e^{\frac{t}{\lambda}}$$
Time to nth event (gamma) $$P(t | \lambda, n) = \frac{t^{n-1}}{\lambda^n} \cdot e^{-\frac{t}{\lambda}} \cdot \frac{1}{\Gamma(n)}$$
where $\Gamma(n) = (n-1)! $
Number of events in a given time interval $$P(n)=\frac{(t/λ)^n}{n!} ⋅e^{−{\frac{t}{\lambda}}}$$
Properties of Poisson random variables¶
Mean:
$$\begin{aligned} \mathbb{E}(r) &= \sum_{r=0}^\infty r P(r \mid \lambda) \\ &= \sum_{r=0}^\infty r e^{-\lambda} \frac{\lambda^r}{r!} \\ &= \sum_{r=1}^\infty r e^{-\lambda} \frac{\lambda^r}{r!} \\ &= e^{-\lambda} \sum_{r=1}^\infty \frac{\lambda^r}{(r-1)!} \\ &= \lambda e^{-\lambda} \sum_{r=1}^\infty \frac{\lambda^{r-1}}{(r-1)!} \\ &= \lambda e^{-\lambda} \sum_{r=0}^\infty \frac{\lambda^{r}}{r!} \\ &= \lambda e^{-\lambda} e^\lambda \\ &= \lambda \\ \end{aligned}$$
Variance:
$$Var(r) = \lambda$$
The proof is left for you to do :D.
You can use the following fact and follow a similar procedure above:
$$Var(r) = \mathbb{E}(r^2) - \big[ \mathbb{E}(r) \big]^2 $$
Another useful term:
$$P(0 \mid \lambda) = e^{-\lambda} \frac{\lambda^0}{0!} = e^{-\lambda}$$
MacKay's thoughts on the improper prior¶
Here's MacKay in Chapter 24, page 320:
The [$1/\lambda$] prior is less-strange-looking if we examine the resulting density over [$\ln \lambda$]... which is flat. This is the prior that expresses ignorance about [$\lambda$] by saying, well, it could be 10, or it could be 1, or it could be 0.1, ... Using improper priors is viewed as distasteful in some circles, so let me excuse myself by saying it’s for the sake of readability; if I included proper priors, the calculations could still be done but the key points would be obscured by the flood of extra parameters
So there is some sense to use such a prior if you have complete ignorance about the scale of $\lambda$ -- this prior is uniform over the logarithm of $\lambda$.