# Introduction to Information Theory

For this topic, you will probably find relevant MacKay’s book (Sections 2.4, 2.5, 2.6), and also MacKay’s lectures 1 and 2 from his online Course on Information Theory, Pattern Recognition, and Neural Networks’’, from which I have borrowed a lot of the language.

Here is another set of related materials “Visual Information Theory”. I don’t follow it directly, but it covers many of the same basic topic, with very explicit visualizations. If you find my notes or MacKay’s book difficult to follow, you should definitely check this reference out. We don’t all learn the same way.

## What is information content, and how to measure it

The term information content was introduced by Claude Shannon (1916-2001) in his 1948 paper “A Mathematical Theory of Communication” to solve communication problems. The problem was how to obtain “reliable communication over an unreliable channel”.

Examples relevant to you can be:

• voice $$\rightarrow$$ ear (channel = air)
• DNA $$\rightarrow$$ DNA replication (channel = DNA polymerase)
• biological system perturbation $$\rightarrow$$ measurement (channel = your experimental design)

The received signal is never identical to the sent signal. How can we make it so the received message is as similar as possible to the sent message?

For the case of your biological experiment, you could either

• Improve the experiment

### Properties of information content

Consider these examples. We barely know each other, I run into you and I say:

• I had breakfast this morning,
• Today is not my birthday,
• Today is my birthday,
• I ran the first (2016) Cambridge half-marathon,
• I ran the first and second Cambridge half-marathons,
• I ran the first, second and third Cambridge half-marathons,
• I have been to Antarctica,

Which of these events you think carries the most information? (Disclaimer: not all events directly apply to me.)

The information conveyed by one event is ‘somehow’ inversely related to how frequent that event is (we will specify the ‘how’ later). And often times, how frequent an event is depends on the number of possible outcomes. Such that,

• The rarer an event is, the lower its probability, and the more information you could transmit by running the experiment (listening at the end of the channel).

• Equivalently, the more ignorant you are about the outcome (because the number of possibilities is very large), the smaller the probability of each individual outcome, and the more information you could get from listening at the end of the channel.

Thus, for an event $$e$$ that has probability $$p$$, we expect that

      Information_Content(e) grows monotonically as 1/p.


A second desired property of information is that, for two independent events $$e_1$$ and $$e_2$$, we expect the total information to be

$$\mbox{Information_Content} (e_1 \& e_2) = \mbox{Information_Content} (e_1) + \mbox{Information_Content} (e_2).$$

### Ta-da!

Using these two desirable properties for information content, we can derive its mathematical form.

The logarithm function has the convenient property:

$$\log(ab) = \log(a) + \log(b).$$

Shannon postulated that, for an event $$e$$ that has probability $$p$$, the right way of measuring its information is,

$$\mbox{Information Content}(e) = h(e) = \log_2 \frac{1}{p} \quad \mbox{in bits.}$$

Figue 1. Shannon information content as a function of the probability value. Examples: h(p=0.1) = 3.32 bits, h(0.9)= 0.15 bits.

In Figure 1, we see how information content varies as a function of the probability. Going back to my examples,

Event Probability Entropy (bits)
I had breakfast today 1.0 0.0
Tomorrow is my birthday 1/365 8.5
Ran first Cambridge half-marathon 4,500/4,700,000$$^\ast$$ 10.0
Ran first & second Cambridge half-marathon (4,500/4,700,000)x(6,500/4,700,000)$$^{\ast\ast}$$ 13.5
I’ve been to Antarctica 925,000/7,400,000,000$$^{+}$$ 13.0
I’ve lived in Antarctica in winter (June) 25,000/7,400,000,000$$^{++}$$ 18.2

$$^{\ast}$$I have considered the greater Boston population (4.7 millions).

$$^{\ast\ast}$$I have assumed that the two events (running the first and second half-marathon) are independent, but in fact they aren’t.

$$^{+}$$Tourists to Antarctica in 2009-2010 season amounted to 37,000. I have considered 25 years of visits (since the 1990s).

$$^{++}$$ About 1,000 people live in Antarctica in winter. I have considered 25 years of Antarctica colonization.

Do these results agree with your intuition about how much you learned about me by knowing those facts?

### Let’s see how additivity for the information of two independent events works

For two independent events $$e_1$$ and $$e_2$$,

$$P(e_1,e_2) = P(e_1)P(e_2),$$

then

\begin{eqnarray} h(e_1 \& e_2) &=& \log_2 \frac{1}{P(e_1,e_2)}
&=& \log_2 \frac{1}{P(e_1)P(e_2)}
&=& \log_2 \frac{1}{P(e_1)} + \log_2 \frac{1}{P(e_2)}
&=& h(e_1) + h(e_2). \end{eqnarray}

### Note about the log scale

You can define entropy using any logarithm base you want.

• If I say $$I = \ln(\frac{1}{p})\quad\quad$$ I am measuring entropy in nats
• If I say $$I = \log_2(\frac{1}{p})\quad\$$ I am measuring entropy in bits
• If I say $$I = \log_{10}(\frac{1}{p})\quad$$ I am measuring entropy in base 10

The relationship is

\begin{aligned} \ln(a) &= b\quad\,\,\,\ \mbox{means}\quad a = e^{b}\\ \log_2(a) &= b_2\quad\,\, \mbox{means}\quad a = 2^{b_{2}}\\ \log_{10}(a) &= b_{10}\quad \mbox{means}\quad a = 10^{b_{10}}\\ \log_{n}(a) &= b_{n}\quad\,\ \mbox{means}\quad a = n^{b_{n}} \quad\mbox{for an arbitrary base}\, n\\ \end{aligned}

Then

$e^{b} = 2^{b_{2}} = 10^{b_{10}} = n^{b_n},$

Thus, the relationship between natural logs and logs in any other base $$n$$ is

$b = \ln(n^{b_n}) = b_n\ln(n).$

In particular, the relationship between nats and bits is

$b \mbox{nats} = \ln(2)\ b_2.$

In general, I will use $$\log(x)$$ interchangeably with $$\ln(x)$$ to refer to natural logs. If I don’t specify a base, I am using natural logs.

## Information, Redundancy and Compression

The information content is inversely related to redundancy. The less information an event carries, the more it can be compressed.

Shannon postulated that, for an event $$e$$ of probability $$p$$, its information content is the compression to which we should aspire.

## Entropy: the average information content of an ensemble

The entropy of a random variable X is the average information content of an ensemble of all the possible outcomes (variables) for X,

$$\mbox{Entropy}(X) = H(X) = \sum_x p(x) \log \frac{1}{p(x)},$$

with the convention that if $$p(x) = 0$$, then $$0\times \log\frac{1}{0} = 0$$.

Notice that the entropy is never negative, $$H(X) \geq 0,$$

with

$$H(X) = 0\quad\mbox{ if and only if}\quad p(x)=1 \quad\mbox{for one}\quad x.$$

A desirable property of H that Shannon describes in his classical paper is that if a choice is broken down in two different choices the original H should be the weighted sum of the individual entropies. Graphically, using Shannon’s own figure

We have the equality,

$H\left(\frac{1}{2},\frac{1}{3},\frac{1}{6}\right) = H\left(\frac{1}{2},\frac{1}{2}\right) + \frac{1}{2} H\left(\frac{2}{3},\frac{1}{3}\right).$

The coefficient $$\frac{1}{2}$$ weights the fact that the second choice only happens half of the time.

## Relative entropy: the Kullback-Leibler divergence

For two probability distributions $$P(x)$$ and $$Q(x)$$ defined over the same space, their relative entropy or Kullback-Leibler (KL) divergence is defined by

$D_{KL}(P||Q) = \sum_x P(x)\, \log{\frac{P(x)}{Q(x)}}.$

The relative entropy has the following properties

• It satisfies Gibbs’s inequality

$D_{KL}(P||Q) \geq 0.$

It is zero only if $$P=Q$$.

• It is not symmetric. In general,

$D_{KL}(P||Q) \neq D_{KL}(Q||P).$

Thus, it is not a distance although oftentimes it is called the KL distance.

• For continuous distributions it takes the form

$D_{KL}(P||Q) = \int_x p(x)\, \log{\frac{p(x)}{q(x)}}\, dx,$

where $$p(x)$$ and $$q(x)$$ are the densities of $$P$$ and $$Q$$.

## Mutual Information

Mutual Information is a particular case of relative entropy.

For two random variables $$X$$ and $$Y$$, with join probability distribution $$P_{XY}$$, and marginal probability distributions $$P_X$$ and $$P_Y$$, we define the mutual information as

$MI(X,Y) = D_{KL}(P_{XY}||P_X P_Y) = \int_x\int_y p(x,y)\, \log{\frac{p(x,y)}{p(x)p(y)}}\, dx\, dy,$

Mutual information measures how different the random variables $$X, Y$$ are from being independent of each other.

## The principle of maximizing entropy. Or trusting Shannon to be fair

Imagine you have 8 pieces of cake that you want to distribute amongst 3 kids, without cutting the pieces further. How would you do that? Unless you are Gru (despicable me), you would like to be as fair as possible to all three kids.

Here are some options. Would you give?

• 4 4 0
• 4 2 2
• 3 3 2

Then entropy of these three different scenarios are

• $$H(4,4,0) = 1.00$$ bits
• $$H(4,2,2) = 1.50$$ bits
• $$H(3,3,2) = 1.56$$ bits

In the same way, when you are working in an experiment for which you expect different possible outcomes, but you have no idea of which are more favorable, being fair to all of them (that is, not giving preference to each) you should also use the maximum entropy principle.

The maximum entropy distribution is the one that favors any of the possible outcomes the least. Thus, the maximum entropy distribution is the one that should be used in the absence of any other information.

The principle of maximal entropy was introduced by E.T. Jaynes in his 1957 paper “Information theory and statistical mechanics.

### When you do not know anything

Still in the issue of being fair to the 3 kids and the 8 pieces of cake, now consider this option in which you give 2 pieces to each kids and leave 2 pieces out:

$$H(2,2,2) = \frac{2}{6}\log_2\frac{1}{2/6} + \frac{2}{6}\log_2\frac{1}{2/6} + \frac{2}{6}\log_2\frac{1}{2/6} = \log_2(3) = 1.59 \, \mbox{bits}.$$

What do you think about this result? This entropy is even higher than the (3,3,2) case that seemed pretty fair. How come?

You can formally ask, what is the probability distribution that maximizes the entropy of the system? This is an optimization problem.

For a discrete distribution $$X$$ with distribution $$\{p_i\}$$ for $$1\leq i\leq N$$, we want to optimize the entropy,

$$\sum_j p_j \log\frac{1}{p_j}$$

which we do by calculating the point(s) of zero derivative. Because the $$p_j$$’s are not arbitrary but probabilities that sum to one ($$\sum_j p_j = 1$$), we need to incorporate that constraint in the optimization function using a Lagrange multiplier ($$\lambda > 0$$), and the function to optimize is

$$L = \sum_j p_j \log\frac{1}{p_j} -\lambda \left(\sum_j p_j - 1\right).$$

The derivative respect to one such probability $$p_i$$ is

$$\frac{d}{d p_i} L = \log\frac{1}{p_i} - p_i \frac{1}{p_i} - \lambda,$$

which becomes optimal when $$\frac{d}{d p_i} L = 0$$, then

$$\log\frac{1}{p^{\ast}_i} - 1 - \lambda = 0,$$

or

$$p^{\ast}_i = e^{-(1+\lambda)}.$$

Because $$\sum_i p_i = 1$$, that implies that $$1 = N e^{-(1+\lambda)}$$ or $$e^{-(1+\lambda)} = 1/N$$.

Thus, the probability that maximizes the entropy is the uniform distribution $$p^{\ast}_i = \frac{1}{N}, \quad \mbox{for}\quad 1\leq i\leq N.$$

Notice that in the case of the three kids each with 2 pieces of cake, we used a uniform distribution.

We could have done the same for a continuous distribution in a range $$[a,b]$$, then the maximum entropy uniform distribution looks like $$p(x) = \frac{1}{b-a}, \quad \mbox{for}\quad x\in [a,b],$$ such that, $$\int_{x=a}^{x=b} p(x) dx = 1$$.

### When you know the mean

But sometimes we have some information, for instance, we may know (or have an estimate) of the mean of the distribution ($$\mu$$). That adds one more constraint to the optimization problem, and one more Lagrange multiplier $$\lambda_{\mu}>0$$. Given

$$L = \int_y p(y) \log\frac{1}{p(y)} dy -\lambda \left(\int_y p(y) dy - 1\right) -\lambda_{\mu} \left(\int_y y p(y) dy - \mu\right),$$

we want to calculate the $$p(x)$$ that optimizes $$L$$, $$\left.\frac{d L}{d p(x)}\right|_{p(x)=p^\ast(x)} = 0$$

That is, $$\log\frac{1}{p^{\ast}(x)} - 1 - \lambda -\lambda_{\mu} x = 0,$$ or $$p^{\ast}(x) = e^{-(1+\lambda + x\lambda_{\mu})}.$$

We can eliminate $$\lambda$$ and $$\lambda_{\mu}$$ using the two constraints again.

Using the normalization condition, we obtain, $$e^{-(1+\lambda)} = \frac{1}{\int_x e^{-x\lambda_{\mu}} dx},$$ then the maximum-entropy distribution $$p^{\ast}(x) = \frac{e^{-x\lambda_{\mu}}}{\int_y e^{-y\lambda_{\mu}} dy},$$ that is the exponential distribution.

The value of $$\lambda_{\mu}$$ can be determined using the condition $$\mu \equiv \int_x x p^{\ast}(x) dx = \frac{\int_x x e^{-x\lambda_{\mu}} dx}{\int_y e^{-y\lambda_{\mu}} dy}.$$

So far, we have said nothing about the range of x. This equation has a solution only if all $$x$$ are positive.

Now, we remember that $$\int_0^{\infty} e^{-\lambda_{\mu} x} dx = \frac{1}{\lambda_{\mu}}$$

and

$$\int_0^{\infty} x\, e^{-\lambda_{\mu} x} dx = \frac{1}{\lambda_{\mu}^2}.$$

(You can find these results in Spiegel’s manual).

The solution is $$\mu = \frac{1/\lambda_{\mu}^2}{1/\lambda_{\mu}} = \frac{1}{\lambda_{\mu}}.$$

Thus, the exponential distribution $$p(x) = \frac{1}{\mu} e^{-x/\mu} \quad x\geq 0,$$ is the maximum entropy distribution amongst all distributions in the $$[0,\infty)$$ range that have mean $$\mu$$.

### When you know the standard deviation

If what we know is the standard deviation of the distribution ($$\sigma$$), then there is a different condition (and it corresponding Lagrange multiplier $$\lambda_{\sigma}>0$$) in the optimization method

For

$$L = \int_y p(y) \log\frac{1}{p(y)} dy -\lambda \left(\int_y p(y) dy - 1\right) -\lambda_{\sigma} \left(\int_y (y-\mu)^2 p(y) dy - \sigma^2\right)$$

where $$\mu$$ and $$\sigma$$ are the parameters of the Gaussian distribution. it is not a fixed value.

We are only imposing to known standard deviation $$\sigma$$. In this case, optimization implies,

$$\left.\frac{d L}{d p(x)}\right|_{p(x)=p^\ast(x)} = 0.$$

or

$$\log\frac{1}{p^{\ast}(x)} - 1 - \lambda - \lambda_{\sigma}(x-\mu)^2 = 0.$$

That is,

$$\log\frac{1}{p^{\ast}(x)} = 1 + \lambda + \lambda_{\sigma}(x-\mu)^2.$$

Taking the exponential in both sides of the equation, we have

$$p^{\ast}(x) = e^{-(1+\lambda) - (x-\mu)^2\lambda_{\sigma}}.$$

Using the normalization condition, we obtain, $$e^{-(1+\lambda)} = \frac{1}{\int_x e^{-(x-\mu)^2\lambda_{\sigma}} dx},$$

then

$$p^{\ast}(x) = \frac{e^{-(x-\mu)^2\lambda_{\sigma}}}{\int_y e^{-(y-\mu)^2\lambda_{\sigma}} dy},$$

that is the Gaussian distribution.

The value of $$\lambda_{\sigma}$$ can be determined using the condition $$\sigma^2 \equiv \int_x (x-\mu)^2 p(x) dx = \frac{\int_x (x-\mu)^2 e^{-(x-\mu)^2\lambda_{\sigma}} dx}{\int_y e^{-(y-\mu)^2\lambda_{\sigma}} dy}.$$

Now, we remember that

$$\int_{-\infty}^{\infty} e^{-(x-\mu)^2\lambda_{\sigma}} dx = \sqrt{\frac{\pi}{\lambda_{\sigma}}}$$ and $$\int_{-\infty}^{\infty} (x-\mu)^2 e^{-(x-\mu)^2\lambda_{\sigma}} dx = \sqrt{\frac{\pi}{\lambda_{\sigma}}} \frac{1}{2\lambda_{\sigma}}.$$

(You can find these results in Spiegel’s manual).

The solution is $$\sigma^2 = \frac{1}{2\lambda_{\sigma}}.$$

Thus, the Gaussian distribution

$$p(x) = \frac{1}{\sqrt{\pi/\lambda_{\sigma}}}\ e^{-(x-\mu)^2\lambda_{\sigma}} = \frac{1}{\sqrt{2\pi\sigma^2}}\ e^{-\frac{(x-\mu)^2}{2\sigma^2}},$$

is the maximum entropy distribution amongst all distributions in the $$(-\infty,\infty)$$ range that have variance $$\sigma^2$$.

In summary, when you want to model a set of observations with a probability distribution, them maximum entropy principle tells you that if you don’t know anything about the probability of the different outcomes, you should use a uniform distribution, if you know the mean, the maximum entropy distribution is an exponential distribution, and if you know the standard deviation the maximum entropy distribution is the Gaussian distribution.