# Introduction to Information Theory

For this topic, you will probably find relevant MacKay’s book (Sections 2.4, 2.5, 2.6), and also MacKay’s lectures 1 and 2 from his online Course on Information Theory, Pattern Recognition, and Neural Networks’’, from which I have borrowed a lot of the language.

## What is information content, and how to measure it

The term information content was introduced by Claude Shannon (1916-2001) in 1948 to solve communication problems. The problem was how to obtain “reliable communication over an unreliable channel”.

Examples relevant to you can be:

• voice $\rightarrow$ ear (channel = air)
• DNA $\rightarrow$ DNA replication (channel = DNA polymerase)
• biological system perturbation $\rightarrow$ measurement (channel = your experimental design)

The received signal is never identical to the sent signal. How can we make it so the received message is as similar as possible to the sent message?

For the case of your biological experiment, you could either

• Improve the experiment

### Properties of information content

Consider these examples. We barely know each other, I run into you and I say:

• I had breakfast this morning,
• Tomorrow is my birthday,
• I ran the first (2016) Cambridge half-marathon,
• I ran the first and second Cambridge half-marathons,
• I have been to Antarctica,
• I have been to Antarctica in the winter (June).

Which of these events you think carries the most information? (Disclaimer: not all events directly apply to me.)

The information conveyed by one event is ‘somehow’ inversely related to how frequent that event is (we will specify the ‘how’ later). And often times, how frequent an event is depends on the number of possible outcomes. Such that,

• The rarer an event is, the lower its probability, and the more information you could transmit by running the experiment (listening at the end of the channel).

• Equivalently, the more ignorant you are about the outcome (because the number of possibilities is very large), the smaller the probability of each individual outcome, and the more information you could get from listening at the end of the channel.

Thus, for an event $e$ that has probability $p$, we expect that

      Information_Content(e) grows monotonically as 1/p.


A second desired property of information is that, for two independent events $e_1$ and $e_2$, we expect the total information to be

$$\mbox{Information_Content} (e_1 \& e_2) = \mbox{Information_Content} (e_1) + \mbox{Information_Content} (e_2).$$

### Ta-da!

Using these two desirable properties for information content, we can derive its mathematical form.

The logarithm function has the convenient property:

$$\log(ab) = \log(a) + \log(b).$$

Shannon postulated that, for an event $e$ that has probability $p$, the right way of measuring its information is,

$$\mbox{Information Content}(e) = h(e) = \log_2 \frac{1}{p} \quad \mbox{in bits.}$$

Figue 1. Shannon information content as a function of the probability value. Examples: h(p=0.1) = 3.32 bits, h(0.9)= 0.15 bits.

In Figure 1, we see how information content varies as a function of the probability. Going back to my examples,

Event Probability Entropy (bits)
I had breakfast today 1.0 0.0
Tomorrow is my birthday 1/365 8.5
Ran first Cambridge half-marathon 4,500/4,700,000$^\ast$ 10.0
Ran first & second Cambridge half-marathon (4,500/4,700,000)x(6,500/4,700,000)$^{\ast\ast}$ 13.5
I’ve been to Antarctica 925,000/7,400,000,000$^{+}$ 13.0
I’ve lived in Antarctica in winter (June) 25,000/7,400,000,000$^{++}$ 18.2

$^{\ast}$I have considered the greater Boston population (4.7 millions).

$^{\ast\ast}$I have assumed that the two events (running the first and second half-marathon) are independent, but in fact they aren’t.

$^{+}$Tourists to Antarctica in 2009-2010 season amounted to 37,000. I have considered 25 years of visits (since the 1990s).

$^{++}$ About 1,000 people live in Antarctica in winter. I have considered 25 years of Antarctica colonization.

Do these results agree with your intuition about how much you learned about me by knowing those facts?

Let’s see how additivity works. For two independent events $e_1$ and $e_2$,

$$P(e_1,e_2) = P(e_1)P(e_2),$$

then

\begin{eqnarray} h(e_1 \& e_2) &=& \log_2 \frac{1}{P(e_1,e_2)}
&=& \log_2 \frac{1}{P(e_1)P(e_2)}
&=& \log_2 \frac{1}{P(e_1)} + \log_2 \frac{1}{P(e_2)}
&=& h(e_1) + h(e_2). \end{eqnarray}

## Information, Redundancy and Compression

The information content is inversely related to redundancy. The less information an event carries, the more it can be compressed.

Shannon postulated that, for an event $e$ of probability $p$, its information content is the compression to which we should aspire.

## Entropy: the average information content of an ensemble

The entropy of a random variable X is the average information content of an ensemble of all the possible outcomes (variables) for X,

$$\mbox{Entropy}(X) = H(X) = \sum_x p(x) \log \frac{1}{p(x)},$$

with the convention that if $p(x) = 0$, then $0\times \log\frac{1}{0} = 0$.

Notice that the entropy is never negative, $$H(X) \geq 0,$$

with

$$H(X) = 0\quad\mbox{ if and only if}\quad p(x)=1 \quad\mbox{for one}\quad x.$$

A desirable property of H that Shannon describes in his classical paper is that if a choice is broken down in two different choices the original H should be the weighted sum of the individual entropies. Graphically, using Shannon’s own figure

We have the equality,

The coefficient $\frac{1}{2}$ weights the fact that the second choice only happens half of the time.

## Relative entropy: the Kullback-Leibler divergence

For two probability distributions $P(x)$ and $Q(x)$ defined over the same space, their relative entropy or Kullback-Leibler (KL) divergence is defined by

The relative entropy has the following properties

• It satisfies Gibbs’s inequality

It is zero only if $P=Q$.

• It is not symmetric. In general,

Thus, it is not a distance although oftentimes it is called the KL distance.

• For continuous distributions it takes the form

where $p(x)$ and $q(x)$ are the densities of $P$ and $Q$.

## The principle of maximizing entropy. Or trusting Shannon to be fair

Imagine you have 8 pieces of cake that you want to distribute amongst 3 kids, without cutting the pieces further. How would you do that? Unless you are Gru (despicable me), you would like to be as fair as possible to all three kids.

Here are some options. Would you give?

• 4 4 0
• 4 2 2
• 3 3 2

Then entropy of these three different scenarios are

• $H(4,4,0) = 1.00$ bits
• $H(4,2,2) = 1.50$ bits
• $H(3,3,2) = 1.56$ bits

In the same way, when you are working in an experiment for which you expect different possible outcomes, but you have no idea of which are more favorable, being fair to all of them (that is, not giving preference to each) you should also use the maximum entropy principle.

The maximum entropy distribution is the one that favors any of the possible outcomes the least. Thus, the maximum entropy distribution is the one that should be used in the absence of any other information.

### When you do not know anything

Still in the issue of being fair to the 3 kids and the 8 pieces of cake, now consider this option in which you give 2 pieces to each kids and leave 2 pieces out:

$$H(2,2,2) = \frac{2}{6}\log_2\frac{1}{2/6} + \frac{2}{6}\log_2\frac{1}{2/6} + \frac{2}{6}\log_2\frac{1}{2/6} = \log_2(3) = 1.59 \, \mbox{bits}.$$

What do you think about this result? This entropy is even higher than the (3,3,2) case that seemed pretty fair. How come?

You can formally ask, what is the probability distribution that maximizes the entropy of the system? This is an optimization problem.

For a discrete distribution $X$ with distribution $\{p_i\}$ for $1\leq i\leq N$, we want to optimize the entropy,

$$\sum_j p_j \log\frac{1}{p_j}$$

which we do by calculating the point(s) of zero derivative. Because the $p_j$’s are not arbitrary but probabilities that sum to one ($\sum_j p_j = 1$), we need to incorporate that constraint in the optimization function using a Lagrange multiplier ($\lambda > 0$), and the function to optimize is

$$L = \sum_j p_j \log\frac{1}{p_j} -\lambda \left(\sum_j p_j - 1\right).$$

The derivative respect to one such probability $p_i$ is

$$\frac{d}{d p_i} L = \log\frac{1}{p_i} - p_i \frac{1}{p_i} - \lambda,$$

which becomes optimal when $\frac{d}{d p_i} L = 0$, then

$$\log\frac{1}{p^{\ast}_i} - 1 - \lambda = 0,$$

or

$$p^{\ast}_i = e^{-(1+\lambda)}.$$

Because $\sum_i p_i = 1$, that implies that $1 = N e^{-(1+\lambda)}$ or $e^{-(1+\lambda)} = 1/N$.

Thus, the probability that maximizes the entropy is the uniform distribution $$p^{\ast}_i = \frac{1}{N}, \quad \mbox{for}\quad 1\leq i\leq N.$$

Notice that in the case of the three kids each with 2 pieces of cake, we used a uniform distribution.

We could have done the same for a continuous distribution in a range $[a,b]$, then the maximum entropy uniform distribution looks like $$p(x) = \frac{1}{b-a}, \quad \mbox{for}\quad x\in [a,b],$$ such that, $\int_{x=a}^{x=b} p(x) dx = 1$.

### When you know the mean

But sometimes we have some information, for instance, we may know (or have an estimate) of the mean of the distribution ($\mu$). That adds one more constraint to the optimization problem, and one more Lagrange multiplier $\lambda_{\mu}>0$. Given

$$L = \int_y p(y) \log\frac{1}{p(y)} dy -\lambda \left(\int_y p(y) dy - 1\right) -\lambda_{\mu} \left(\int_y y p(y) dy - \mu\right),$$

we want to calculate the $p(x)$ that optimizes $L$, $$\left.\frac{d L}{d p(x)}\right|_{p(x)=p^\ast(x)} = 0$$

That is, $$\log\frac{1}{p^{\ast}(x)} - 1 - \lambda -\lambda_{\mu} x = 0,$$ or $$p^{\ast}(x) = e^{-(1+\lambda + x\lambda_{\mu})}.$$

We can eliminate $\lambda$ and $\lambda_{\mu}$ using the two constraints again.

Using the normalization condition, we obtain, $$e^{-(1+\lambda)} = \frac{1}{\int_x e^{-x\lambda_{\mu}} dx},$$ then the maximum-entropy distribution $$p^{\ast}(x) = \frac{e^{-x\lambda_{\mu}}}{\int_y e^{-y\lambda_{\mu}} dy},$$ that is the exponential distribution.

The value of $\lambda_{\mu}$ can be determined using the condition $$\mu \equiv \int_x x p^{\ast}(x) dx = \frac{\int_x x e^{-x\lambda_{\mu}} dx}{\int_y e^{-y\lambda_{\mu}} dy}.$$

So far, we have said nothing about the range of x. This equation has a solution only if all $x$ are positive.

Now, we remember that $$\int_0^{\infty} e^{-\lambda_{\mu} x} dx = \frac{1}{\lambda_{\mu}}$$

and

$$\int_0^{\infty} x\, e^{-\lambda_{\mu} x} dx = \frac{1}{\lambda_{\mu}^2}.$$

(You can find these results in Spiegel’s manual).

The solution is $$\mu = \frac{1/\lambda_{\mu}^2}{1/\lambda_{\mu}} = \frac{1}{\lambda_{\mu}}.$$

Thus, the exponential distribution $$p(x) = \frac{1}{\mu} e^{-x/\mu} \quad x\geq 0,$$ is the maximum entropy distribution amongst all distributions in the $[0,\infty)$ range that have mean $\mu$.

### When you know the standard deviation

If what we know is the standard deviation of the distribution ($\sigma$), then there is a different condition (and it corresponding Lagrange multiplier $\lambda_{\sigma}>0$) in the optimization method

For

$$L = \int_y p(y) \log\frac{1}{p(y)} dy -\lambda \left(\int_y p(y) dy - 1\right) -\lambda_{\sigma} \left(\int_y (y-\mu)^2 p(y) dy - \sigma^2\right)$$

where $\mu = \int_x x p(x) dx$, is the mean of the distribution, but it is not a fixed value. We are only imposing a known standard deviation $\sigma$.

In this case, optimization implies,

$$\left.\frac{d L}{d p(x)}\right|_{p(x)=p^\ast(x)} = 0.$$

or

$$\log\frac{1}{p^{\ast}(x)} - 1 - \lambda - \lambda_{\sigma}(x-\mu)^2 - \lambda_{\sigma}\int_y 2(y-\mu) \left(-\frac{d\mu}{p(x)}\right) p(y) dy = 0.$$

That is,

$$\log\frac{1}{p^{\ast}(x)} - 1 - \lambda - \lambda_{\sigma}(x-\mu)^2 - \lambda_{\sigma}\int_y 2(y-\mu) (-x) p(y) dy = 0.$$

The last term cancels out so that,

$$\log\frac{1}{p^{\ast}(x)} - 1 - \lambda - \lambda_{\sigma}(x-\mu)^2 = 0.$$

or

$$p^{\ast}(x) = e^{-(1+\lambda) - (x-\mu)^2\lambda_{\sigma}}.$$

Using the normalization condition, we obtain, $$e^{-(1+\lambda)} = \frac{1}{\int_x e^{-(x-\mu)^2\lambda_{\sigma}} dx},$$

then

$$p^{\ast}(x) = \frac{e^{-(x-\mu)^2\lambda_{\sigma}}}{\int_y e^{-(y-\mu)^2\lambda_{\sigma}} dy},$$

that is the Gaussian distribution.

The value of $\lambda_{\sigma}$ can be determined using the condition $$\sigma^2 \equiv \int_x (x-\mu)^2 p(x) dx = \frac{\int_x (x-\mu)^2 e^{-(x-\mu)^2\lambda_{\sigma}} dx}{\int_y e^{-(y-\mu)^2\lambda_{\sigma}} dy}.$$

Now, we remember that

$$\int_{-\infty}^{\infty} e^{-(x-\mu)^2\lambda_{\sigma}} dx = \sqrt{\frac{\pi}{\lambda_{\sigma}}}$$ and $$\int_{-\infty}^{\infty} (x-\mu)^2 e^{-(x-\mu)^2\lambda_{\sigma}} dx = \sqrt{\frac{\pi}{\lambda_{\sigma}}} \frac{1}{2\lambda_{\sigma}}.$$

(You can find these results in Spiegel’s manual).

The solution is $$\sigma^2 = \frac{1}{2\lambda_{\sigma}}.$$

Thus, the Gaussian distribution

$$p(x) = \frac{e^{-(x-\mu)^2\lambda_{\sigma}}}{\sqrt{\pi/\lambda_{\sigma}}} = \frac{e^{-\frac{(x-\mu)^2}{2\sigma^2}}}{\sqrt{2\pi\sigma^2}} ,$$

is the maximum entropy distribution amongst all distributions in the $(-\infty,\infty)$ range that have variance $\sigma^2$.

In summary, when you want to model a set of observations with a probability distribution, them maximum entropy principle tells you that if you don’t know anything about the probability of the different outcomes, you should use a uniform distribution, if you know the mean, the maximum entropy distribution is an exponential distribution, and if you know the standard deviation the maximum entropy distribution is the Gaussian distribution.

## Practical case 1: The length of a unique sequence in a genome

In a genome of length $L$, what is the shortest sequence (motif) that we expect to appear only once?

Eukaryotic genomes have on the order of $L=10^{6}-10^{10}$ nucleotides. We assume that the length of the unique sequence is $% $. Lets also assume that the genome has nucleotide frequencies given by $p_A, p_C, p_G, p_T$, that sum to one.

On the one hand, we have the entropy in the motif. Each position is independent from each other, so their entropies add up, and if the sequence has $l_A$ A’s, $l_C$ C’s, $l_G$ G’s and $l_T$ T’s, such that $l = l_A + l_C+l_G+l_T$, then

Entropy  of sequence


On the other hand, we have the entropy of where the sequence is going to land in the genome (the genomic position), if we assume all $L$ positions are equally likely we have,

Entropy in genome position


Figure 2. The ignorance about the location in the genome is determined by the length of the genome L as log_2(L), and it doesn't depend on the length of the motif l (as long as l<<L and we can ignore edges). The total number of possible motifs is 4^l, and the information content of each one of them is log_2(4^l)= 2l (assuming all residues are equally likely). In a situation where the segment size is small, the total number of possible segments is small (for instance, for l=2 there are 16 possible segments), thus we expect that none of the segments is going to be unique in the genome. As the segment size increases, the number of possible segments does too, so does the information content of each one of them (for instance, for l=16 there are over 4 billion possible segments) then it becomes more likely that a given segment is unique in the genome.

We want to have the condition (see Figure 2)

or

Let’s look at some examples. Assume a genome of length $L = 10^6$, the location entropy’’ is $\log_2{10^6}=19.93$ bits, and the entropy for different sequences for two different base frequencies are

genome entropy 19.9 AAAA CCCC AAAAAA CCCCCC AAAAAAAAAA CCCCCCCCCC
$p_{A,C,G,T}=0.25$ 8.0 8.0 12.0 12.0 20.0 20.0
$p_A=0.85$ $p_{C,G,T}=0.05$ 0.8 17.3 1.4 25.9 2.3 43.2

In bold, situations in which the sequence entropy is larger than the genome position entropy, and we should expect the sequence to be unique.

For a genome where all nucleotides are distributed equally and for a genome of length $10^6$, sequences of length $10$ or larger will already be unique. However for a very A-rich genome, the situation is very different depending on the base composition of the sequence.

We will run some live’’ examples in class. Here I show results of how many times the different sequences were found (mean $\pm$ stdv) in 20 sampled genome.

AAAA CCCC AAAAAA CCCCCC AAAAAAAAAA CCCCCCCCCC
$p_{A,C,G,T}=0.25$ 3,894$\pm$88 3,904$\pm$93 240$\pm$13 247$\pm$25 1.2$\pm$1.2 0.9$\pm$1.2
$p_A=0.85$ $p_{C,G,T}=0.05$ 521,626$\pm$1152 6$\pm$2 376,815$\pm$824 0.05$\pm$0.22 196616$\pm$924 0.00$\pm$0.00

As a rule of thumb, for a genome with residues uniformly distributed ($p_A=p_C=p_G=p_T=0.25$), then the information content of a nucleotide is 2 bits ($log_2\frac{1}{0.25}$) and the length of a unique motif is $$l = \frac{\log_2(L)}{2}.$$

Some examples,

genome length $L$ genome inf content unique motif length $l$
human $4\times 10^9$ 32 bits 16
worm $100\times 10^6$ 27 bits 13
E.coli $6\times 10^6$ 22 bits 11

## Practical case 2: Bacteria adapting to a changing environment

A type of bacterium can produce 2 different factors $\alpha$ and $\beta$. Bacteria of $\alpha$-type live well in environment A but cannot survive in environment B and viceversa.

We force the colony to live in either environment A or environment B such that there is a probability $p_A$ of environment A, and a probability $p_B=1-p_A$ of environment B. We choose the environments randomly, and we assume that the environment lasts long enough for all bacteria to have the opportunity to replicate.

Which strategy do you think is best for the bacteria to follow for better growth? That is, how the bacteria should update her bets on which fraction ($f_\alpha$) should be $\alpha$-type and which fraction ($f_\beta=1-f_\alpha$) should be $\beta$-type?

During a given environment, each type of bacteria have gains given by

Environment A Environment B
$\alpha$ bacteria $g_{\alpha}^A$ $g_{\alpha}^B \approx 0$
$\beta$ bacteria $g_{\beta}^A\approx 0$ $g_{\beta}^B$

Since $\alpha$-type bacteria do not survive in B, and $\beta$-type bacteria do not survive in A, then $g_{\alpha}^B = g_{\beta}^A\approx 0.$ If all bacteria duplicate in a given interval then $g^A_\alpha = g^B_\beta = 2.$

Assume that the initial number of bacteria is $B_0$. After one interval the number of bacteria is

The number of bacteria $B_1$ can be writen in compact form as

where $\delta^A_1= 1$ if environment 1 is A or zero otherwise, and equivalently for $\delta^B_1$.

After $i$ switches, the number of bacteria $B_i$ is

$$B_i = [ \delta^A_i g^A_\alpha\ f_\alpha + \delta^B_i g^B_\beta\ f_\beta] B_{i-1}\equiv G_i B_{i-1},$$

where $G_i$ is the gain after interval $i$.

After a large number of switches $N$

$$B_N = \prod_{i=1}^{N} [ \delta^A_i g^A_\alpha\ f_\alpha + \delta^B_i g^B_\beta\ f_\beta] B_{0}\equiv G B_0,$$

The total gain $G$ is $$G = \prod_{i=1}^{N} [ \delta^A_i\ g^A_\alpha\ f_\alpha + \delta^B_i\ g^B_\beta\ f_\beta]$$

Assuming exponential growth rate $\Lambda$

or

where $n_A=\sum_i \delta_i^A$ the number of times the bacterial have experienced environment A, and similarly for $n_B$.

Thus, for large enough $N$

and

The optimal frequency $f^{\ast}_\alpha$ that optimizes the gain satisfies, (remembering that $f_\beta = 1 - f_\alpha$, we avoid a Lagrange multiplier)

This means that the optimal frequency $f^{\ast}_\alpha$ satisfies the condition

That means that the maximum gain solution is for the bacteria to produce $\alpha$-type and $\beta$-type matching the probabilities of the two environments A ($p_A$) and B ($p_B=1-p_A$)! simple.

The optimal growth rate is given by

If the gain is equal to the inverse of the environment probability $g^A_\alpha p_A = 1,$ and the optimal growth is zero.

You can rewrite the optimal growth rate as

The first term in brackets is due to the bacterial growth, the second term is the entropy of the environment fluctuations.

• As the fluctuations in the environment become more random ($p_A$ closer to 1/2), the entropy term increases and the growth rate decreases.

• As the environment becomes more certain, ($p_A$ closer to 1 or 0), the entropy term decreases, and the bacteria can achieve larger growth.

This analysis can be generalized to

• An arbitrary number of environments: A, B, C,…, K

• An arbitrary number of bacterial types: $\alpha$, $\beta$, $\gamma$,…

• By modeling the dynamics under which the environments change

All these generalizations are present in a paper “Phenotypic diversity, population growth, and information in fluctuating environments” by Kussell & Leibler where they discuss two possible strategies for the bacteria to switch environments: stochastic switching versus sensing switching. Kussell & Leibler calculate the environment’s entropy contribution to bacterial growth to conclude that stochastic switching is favored when the environment’ fluctuations are slow.