MCB111: Mathematics in Biology (Spring 2018)
 What is information content, and how to measure it
 Information, Redundancy and Compression
 Entropy: the average information content of an ensemble
 Relative entropy: the KullbackLeibler divergence
 The principle of maximizing entropy. Or trusting Shannon to be fair
 Practical case 1: The length of a unique sequence in a genome
 Practical case 2: Bacteria adapting to a changing environment
week 01:
Introduction to Information Theory
For this topic, you will probably find relevant MacKay’s book (Sections 2.4, 2.5, 2.6), and also MacKay’s lectures 1 and 2 from his online ``Course on Information Theory, Pattern Recognition, and Neural Networks’’, from which I have borrowed a lot of the language.
What is information content, and how to measure it
The term information content was introduced by Claude Shannon (19162001) in 1948 to solve communication problems. The problem was how to obtain “reliable communication over an unreliable channel”.
Examples relevant to you can be:
 voice ear (channel = air)
 DNA DNA replication (channel = DNA polymerase)
 biological system perturbation measurement (channel = your experimental design)
The received signal is never identical to the sent signal. How can we make it so the received message is as similar as possible to the sent message?
For the case of your biological experiment, you could either
 Improve the experiment
 Add redundancy (Hamming codes).
Properties of information content
Information content is inversely related to how frequent an event is
Consider these examples. We barely know each other, I run into you and I say:
 I had breakfast this morning,
 Tomorrow is my birthday,
 I ran the first (2016) Cambridge halfmarathon,
 I ran the first and second Cambridge halfmarathons,
 I have been to Antarctica,
 I have been to Antarctica in the winter (June).
Which of these events you think carries the most information? (Disclaimer: not all events directly apply to me.)
The information conveyed by one event is ‘somehow’ inversely related to how frequent that event is (we will specify the ‘how’ later). And often times, how frequent an event is depends on the number of possible outcomes. Such that,

The rarer an event is, the lower its probability, and the more information you could transmit by running the experiment (listening at the end of the channel).

Equivalently, the more ignorant you are about the outcome (because the number of possibilities is very large), the smaller the probability of each individual outcome, and the more information you could get from listening at the end of the channel.
Thus, for an event that has probability , we expect that
Information_Content(e) grows monotonically as 1/p.
Information content is additive
A second desired property of information is that, for two independent events and , we expect the total information to be
\begin{equation} \mbox{Information_Content} (e_1 \& e_2) = \mbox{Information_Content} (e_1) + \mbox{Information_Content} (e_2). \end{equation}
Tada!
Using these two desirable properties for information content, we can derive its mathematical form.
The logarithm function has the convenient property:
\begin{equation} \log(ab) = \log(a) + \log(b). \end{equation}
Shannon postulated that, for an event that has probability , the right way of measuring its information is,
\begin{equation} \mbox{Information Content}(e) = h(e) = \log_2 \frac{1}{p} \quad \mbox{in bits.} \end{equation}
Figue 1. Shannon information content as a function of the probability value. Examples: h(p=0.1) = 3.32 bits, h(0.9)= 0.15 bits.
In Figure 1, we see how information content varies as a function of the probability. Going back to my examples,
Event  Probability  Entropy (bits) 

I had breakfast today  1.0  0.0 
Tomorrow is my birthday  1/365  8.5 
Ran first Cambridge halfmarathon  4,500/4,700,000  10.0 
Ran first & second Cambridge halfmarathon  (4,500/4,700,000)x(6,500/4,700,000)  13.5 
I’ve been to Antarctica  925,000/7,400,000,000  13.0 
I’ve lived in Antarctica in winter (June)  25,000/7,400,000,000  18.2 
I have considered the greater Boston population (4.7 millions).
I have assumed that the two events (running the first and second halfmarathon) are independent, but in fact they aren’t.
Tourists to Antarctica in 20092010 season amounted to 37,000. I have considered 25 years of visits (since the 1990s).
About 1,000 people live in Antarctica in winter. I have considered 25 years of Antarctica colonization.
Do these results agree with your intuition about how much you learned about me by knowing those facts?
Let’s see how additivity works. For two independent events and ,
\begin{equation} P(e_1,e_2) = P(e_1)P(e_2), \end{equation}
then
\begin{eqnarray}
h(e_1 \& e_2)
&=& \log_2 \frac{1}{P(e_1,e_2)}
&=& \log_2 \frac{1}{P(e_1)P(e_2)}
&=& \log_2 \frac{1}{P(e_1)} + \log_2 \frac{1}{P(e_2)}
&=& h(e_1) + h(e_2).
\end{eqnarray}
Information, Redundancy and Compression
The information content is inversely related to redundancy. The less information an event carries, the more it can be compressed.
Shannon postulated that, for an event of probability , its information content is the compression to which we should aspire.
Entropy: the average information content of an ensemble
The entropy of a random variable X is the average information content of an ensemble of all the possible outcomes (variables) for X,
\begin{equation} \mbox{Entropy}(X) = H(X) = \sum_x p(x) \log \frac{1}{p(x)}, \end{equation}
with the convention that if , then .
Notice that the entropy is never negative, \begin{equation} H(X) \geq 0, \end{equation}
with
\begin{equation} H(X) = 0\quad\mbox{ if and only if}\quad p(x)=1 \quad\mbox{for one}\quad x. \end{equation}
A desirable property of H that Shannon describes in his classical paper is that if a choice is broken down in two different choices the original H should be the weighted sum of the individual entropies. Graphically, using Shannon’s own figure
We have the equality,
The coefficient weights the fact that the second choice only happens half of the time.
Relative entropy: the KullbackLeibler divergence
For two probability distributions and defined over the same space, their relative entropy or KullbackLeibler (KL) divergence is defined by
The relative entropy has the following properties

It satisfies Gibbs’s inequality
It is zero only if .

It is not symmetric. In general,
Thus, it is not a distance although oftentimes it is called the KL distance.

For continuous distributions it takes the form
where and are the densities of and .
The principle of maximizing entropy. Or trusting Shannon to be fair
Imagine you have 8 pieces of cake that you want to distribute amongst 3 kids, without cutting the pieces further. How would you do that? Unless you are Gru (despicable me), you would like to be as fair as possible to all three kids.
Here are some options. Would you give?
 4 4 0
 4 2 2
 3 3 2
Then entropy of these three different scenarios are
 bits
 bits
 bits
Do these entropy values correlate with your gut feel about fairness?
In the same way, when you are working in an experiment for which you expect different possible outcomes, but you have no idea of which are more favorable, being fair to all of them (that is, not giving preference to each) you should also use the maximum entropy principle.
The maximum entropy distribution is the one that favors any of the possible outcomes the least. Thus, the maximum entropy distribution is the one that should be used in the absence of any other information.
When you do not know anything
Still in the issue of being fair to the 3 kids and the 8 pieces of cake, now consider this option in which you give 2 pieces to each kids and leave 2 pieces out:
\begin{equation} H(2,2,2) = \frac{2}{6}\log_2\frac{1}{2/6} + \frac{2}{6}\log_2\frac{1}{2/6} + \frac{2}{6}\log_2\frac{1}{2/6} = \log_2(3) = 1.59 \, \mbox{bits}. \end{equation}
What do you think about this result? This entropy is even higher than the (3,3,2) case that seemed pretty fair. How come?
You can formally ask, what is the probability distribution that maximizes the entropy of the system? This is an optimization problem.
For a discrete distribution with distribution for , we want to optimize the entropy,
\begin{equation} \sum_j p_j \log\frac{1}{p_j} \end{equation}
which we do by calculating the point(s) of zero derivative. Because the ’s are not arbitrary but probabilities that sum to one (), we need to incorporate that constraint in the optimization function using a Lagrange multiplier (), and the function to optimize is
\begin{equation} L = \sum_j p_j \log\frac{1}{p_j} \lambda \left(\sum_j p_j  1\right). \end{equation}
The derivative respect to one such probability is
\begin{equation} \frac{d}{d p_i} L = \log\frac{1}{p_i}  p_i \frac{1}{p_i}  \lambda, \end{equation}
which becomes optimal when , then
\begin{equation} \log\frac{1}{p^{\ast}_i}  1  \lambda = 0, \end{equation}
or
\begin{equation} p^{\ast}_i = e^{(1+\lambda)}. \end{equation}
Because , that implies that or .
Thus, the probability that maximizes the entropy is the uniform distribution \begin{equation} p^{\ast}_i = \frac{1}{N}, \quad \mbox{for}\quad 1\leq i\leq N. \end{equation}
Notice that in the case of the three kids each with 2 pieces of cake, we used a uniform distribution.
We could have done the same for a continuous distribution in a range , then the maximum entropy uniform distribution looks like \begin{equation} p(x) = \frac{1}{ba}, \quad \mbox{for}\quad x\in [a,b], \end{equation} such that, .
When you know the mean
But sometimes we have some information, for instance, we may know (or have an estimate) of the mean of the distribution (). That adds one more constraint to the optimization problem, and one more Lagrange multiplier . Given
\begin{equation} L = \int_y p(y) \log\frac{1}{p(y)} dy \lambda \left(\int_y p(y) dy  1\right) \lambda_{\mu} \left(\int_y y p(y) dy  \mu\right), \end{equation}
we want to calculate the that optimizes , \begin{equation} \left.\frac{d L}{d p(x)}\right_{p(x)=p^\ast(x)} = 0 \end{equation}
That is, \begin{equation} \log\frac{1}{p^{\ast}(x)}  1  \lambda \lambda_{\mu} x = 0, \end{equation} or \begin{equation} p^{\ast}(x) = e^{(1+\lambda + x\lambda_{\mu})}. \end{equation}
We can eliminate and using the two constraints again.
Using the normalization condition, we obtain, \begin{equation} e^{(1+\lambda)} = \frac{1}{\int_x e^{x\lambda_{\mu}} dx}, \end{equation} then the maximumentropy distribution \begin{equation} p^{\ast}(x) = \frac{e^{x\lambda_{\mu}}}{\int_y e^{y\lambda_{\mu}} dy}, \end{equation} that is the exponential distribution.
The value of can be determined using the condition \begin{equation} \mu \equiv \int_x x p^{\ast}(x) dx = \frac{\int_x x e^{x\lambda_{\mu}} dx}{\int_y e^{y\lambda_{\mu}} dy}. \end{equation}
So far, we have said nothing about the range of x. This equation has a solution only if all are positive.
Now, we remember that \begin{equation} \int_0^{\infty} e^{\lambda_{\mu} x} dx = \frac{1}{\lambda_{\mu}} \end{equation}
and
\begin{equation} \int_0^{\infty} x\, e^{\lambda_{\mu} x} dx = \frac{1}{\lambda_{\mu}^2}. \end{equation}
(You can find these results in Spiegel’s manual).
The solution is \begin{equation} \mu = \frac{1/\lambda_{\mu}^2}{1/\lambda_{\mu}} = \frac{1}{\lambda_{\mu}}. \end{equation}
Thus, the exponential distribution \begin{equation} p(x) = \frac{1}{\mu} e^{x/\mu} \quad x\geq 0, \end{equation} is the maximum entropy distribution amongst all distributions in the range that have mean .
When you know the standard deviation
If what we know is the standard deviation of the distribution (), then there is a different condition (and it corresponding Lagrange multiplier ) in the optimization method
For
\begin{equation} L = \int_y p(y) \log\frac{1}{p(y)} dy \lambda \left(\int_y p(y) dy  1\right) \lambda_{\sigma} \left(\int_y (y\mu)^2 p(y) dy  \sigma^2\right) \end{equation}
where , is the mean of the distribution, but it is not a fixed value. We are only imposing a known standard deviation .
In this case, optimization implies,
\begin{equation} \left.\frac{d L}{d p(x)}\right_{p(x)=p^\ast(x)} = 0. \end{equation}
or
\begin{equation} \log\frac{1}{p^{\ast}(x)}  1  \lambda  \lambda_{\sigma}(x\mu)^2  \lambda_{\sigma}\int_y 2(y\mu) \left(\frac{d\mu}{p(x)}\right) p(y) dy = 0. \end{equation}
That is,
\begin{equation} \log\frac{1}{p^{\ast}(x)}  1  \lambda  \lambda_{\sigma}(x\mu)^2  \lambda_{\sigma}\int_y 2(y\mu) (x) p(y) dy = 0. \end{equation}
The last term cancels out so that,
\begin{equation} \log\frac{1}{p^{\ast}(x)}  1  \lambda  \lambda_{\sigma}(x\mu)^2 = 0. \end{equation}
or
\begin{equation} p^{\ast}(x) = e^{(1+\lambda)  (x\mu)^2\lambda_{\sigma}}. \end{equation}
Using the normalization condition, we obtain, \begin{equation} e^{(1+\lambda)} = \frac{1}{\int_x e^{(x\mu)^2\lambda_{\sigma}} dx}, \end{equation}
then
\begin{equation} p^{\ast}(x) = \frac{e^{(x\mu)^2\lambda_{\sigma}}}{\int_y e^{(y\mu)^2\lambda_{\sigma}} dy}, \end{equation}
that is the Gaussian distribution.
The value of can be determined using the condition \begin{equation} \sigma^2 \equiv \int_x (x\mu)^2 p(x) dx = \frac{\int_x (x\mu)^2 e^{(x\mu)^2\lambda_{\sigma}} dx}{\int_y e^{(y\mu)^2\lambda_{\sigma}} dy}. \end{equation}
Now, we remember that
\begin{equation} \int_{\infty}^{\infty} e^{(x\mu)^2\lambda_{\sigma}} dx = \sqrt{\frac{\pi}{\lambda_{\sigma}}} \end{equation} and \begin{equation} \int_{\infty}^{\infty} (x\mu)^2 e^{(x\mu)^2\lambda_{\sigma}} dx = \sqrt{\frac{\pi}{\lambda_{\sigma}}} \frac{1}{2\lambda_{\sigma}}. \end{equation}
(You can find these results in Spiegel’s manual).
The solution is \begin{equation} \sigma^2 = \frac{1}{2\lambda_{\sigma}}. \end{equation}
Thus, the Gaussian distribution
\begin{equation} p(x) = \frac{e^{(x\mu)^2\lambda_{\sigma}}}{\sqrt{\pi/\lambda_{\sigma}}} = \frac{e^{\frac{(x\mu)^2}{2\sigma^2}}}{\sqrt{2\pi\sigma^2}} , \end{equation}
is the maximum entropy distribution amongst all distributions in the range that have variance .
In summary, when you want to model a set of observations with a probability distribution, them maximum entropy principle tells you that if you don’t know anything about the probability of the different outcomes, you should use a uniform distribution, if you know the mean, the maximum entropy distribution is an exponential distribution, and if you know the standard deviation the maximum entropy distribution is the Gaussian distribution.
Practical case 1: The length of a unique sequence in a genome
In a genome of length , what is the shortest sequence (motif) that we expect to appear only once?
Eukaryotic genomes have on the order of nucleotides. We assume that the length of the unique sequence is . Lets also assume that the genome has nucleotide frequencies given by , that sum to one.
On the one hand, we have the entropy in the motif. Each position is independent from each other, so their entropies add up, and if the sequence has A’s, C’s, G’s and T’s, such that , then
Entropy of sequence
On the other hand, we have the entropy of where the sequence is going to land in the genome (the genomic position), if we assume all positions are equally likely we have,
Entropy in genome position
Figure 2. The ignorance about the location in the genome is determined by the length of the genome L as log_2(L), and it doesn't depend on the length of the motif l (as long as l<<L and we can ignore edges). The total number of possible motifs is 4^l, and the information content of each one of them is log_2(4^l)= 2l (assuming all residues are equally likely). In a situation where the segment size is small, the total number of possible segments is small (for instance, for l=2 there are 16 possible segments), thus we expect that none of the segments is going to be unique in the genome. As the segment size increases, the number of possible segments does too, so does the information content of each one of them (for instance, for l=16 there are over 4 billion possible segments) then it becomes more likely that a given segment is unique in the genome.
We want to have the condition (see Figure 2)
or
Let’s look at some examples. Assume a genome of length , the ``location entropy’’ is bits, and the entropy for different sequences for two different base frequencies are
genome entropy 19.9  AAAA  CCCC  AAAAAA  CCCCCC  AAAAAAAAAA  CCCCCCCCCC 

8.0  8.0  12.0  12.0  20.0  20.0  
0.8  17.3  1.4  25.9  2.3  43.2 
In bold, situations in which the sequence entropy is larger than the genome position entropy, and we should expect the sequence to be unique.
For a genome where all nucleotides are distributed equally and for a genome of length , sequences of length or larger will already be unique. However for a very Arich genome, the situation is very different depending on the base composition of the sequence.
We will run some ``live’’ examples in class. Here I show results of how many times the different sequences were found (mean stdv) in 20 sampled genome.
AAAA  CCCC  AAAAAA  CCCCCC  AAAAAAAAAA  CCCCCCCCCC  

3,89488  3,90493  24013  24725  1.21.2  0.91.2  
521,6261152  62  376,815824  0.050.22  196616924  0.000.00 
As a rule of thumb, for a genome with residues uniformly distributed (), then the information content of a nucleotide is 2 bits () and the length of a unique motif is \begin{equation} l = \frac{\log_2(L)}{2}. \end{equation}
Some examples,
genome  length  genome inf content  unique motif length 

human  32 bits  16  
worm  27 bits  13  
E.coli  22 bits  11 
Practical case 2: Bacteria adapting to a changing environment
A type of bacterium can produce 2 different factors and . Bacteria of type live well in environment A but cannot survive in environment B and viceversa.
We force the colony to live in either environment A or environment B such that there is a probability of environment A, and a probability of environment B. We choose the environments randomly, and we assume that the environment lasts long enough for all bacteria to have the opportunity to replicate.
Which strategy do you think is best for the bacteria to follow for better growth? That is, how the bacteria should update her bets on which fraction () should be type and which fraction () should be type?
During a given environment, each type of bacteria have gains given by
Environment A  Environment B  

bacteria  
bacteria 
Since type bacteria do not survive in B, and type bacteria do not survive in A, then If all bacteria duplicate in a given interval then
Assume that the initial number of bacteria is . After one interval the number of bacteria is
The number of bacteria can be writen in compact form as
where if environment 1 is A or zero otherwise, and equivalently for .
After switches, the number of bacteria is
\begin{equation} B_i = [ \delta^A_i g^A_\alpha\ f_\alpha + \delta^B_i g^B_\beta\ f_\beta] B_{i1}\equiv G_i B_{i1}, \end{equation}
where is the gain after interval .
After a large number of switches
\begin{equation} B_N = \prod_{i=1}^{N} [ \delta^A_i g^A_\alpha\ f_\alpha + \delta^B_i g^B_\beta\ f_\beta] B_{0}\equiv G B_0, \end{equation}
The total gain is \begin{equation} G = \prod_{i=1}^{N} [ \delta^A_i\ g^A_\alpha\ f_\alpha + \delta^B_i\ g^B_\beta\ f_\beta] \end{equation}
Assuming exponential growth rate
or
where the number of times the bacterial have experienced environment A, and similarly for .
Thus, for large enough
and
The optimal frequency that optimizes the gain satisfies, (remembering that , we avoid a Lagrange multiplier)
This means that the optimal frequency satisfies the condition
That means that the maximum gain solution is for the bacteria to produce type and type matching the probabilities of the two environments A () and B ()! simple.
The optimal growth rate is given by
If the gain is equal to the inverse of the environment probability and the optimal growth is zero.
You can rewrite the optimal growth rate as
The first term in brackets is due to the bacterial growth, the second term is the entropy of the environment fluctuations.

As the fluctuations in the environment become more random ( closer to 1/2), the entropy term increases and the growth rate decreases.

As the environment becomes more certain, ( closer to 1 or 0), the entropy term decreases, and the bacteria can achieve larger growth.
This analysis can be generalized to

An arbitrary number of environments: A, B, C,…, K

An arbitrary number of bacterial types: , , ,…

By modeling the dynamics under which the environments change
All these generalizations are present in a paper “Phenotypic diversity, population growth, and information in fluctuating environments” by Kussell & Leibler where they discuss two possible strategies for the bacteria to switch environments: stochastic switching versus sensing switching. Kussell & Leibler calculate the environment’s entropy contribution to bacterial growth to conclude that stochastic switching is favored when the environment’ fluctuations are slow.