# Introduction to Information Theory

## Q1

In Sections, we derived the conditions that determine when we should expect to find one single copy of a motif of length $$l$$ in a genome of length $$L$$ when all four residues A, C, G, T are equally likely. In general genomes have certain biases. The human genome ($$3.1\times 10^{9}$$ bases) has an average content of 42% G-C. Some Archaea thermophiles can be really extreme in their base composition, for instance malaria parasite Plasmodium falciparum, for example, has a GC-content of 20% and a genome size of roughly $$10^7$$.

• Derive an expression for the length $$l$$ of a motif expected to be unique in a genome of length $$L$$ with arbitrary probabilities $$p_A, p_C, p_G, p_T$$.

• Particularize for the human genome and the P. falciparium genome.

• Sample a human-like genome and show how well your estimation works. Select a particular motif of the estimated unique length, and test experimentally how many times you actually find the motif in your simulated genome. Repeat the experiment to get a variance of your estimate.

## Choose between Q2 or Q3

Q2 is a theoretical question, no coding is involved. You have the option of answering Q2 or going straight to Q3. Extra credit if you answer both.

### Q2

Consider a system of 3 neurons that are all interconnected. We are going to assume that each neuron $$n_i$$ has two states +1 for firing and -1 for not firing.

Your experimental design allows you only to measure two neurons simultaneously (but not three at the time unfortunately), and what you can obtain is the averages of those pair measurements

\begin{aligned} J_{12} = < n_1 n_2 >_{\mbox{obs}} \\ J_{13} = < n_1 n_3 >_{\mbox{obs}}\\ J_{23} = < n_2 n_3 >_{\mbox{obs}}\\ \end{aligned}

Obviously, these averages are symmetric so we only need to consider the three cases above, as $$< n_1 n_2>_{\mbox{obs}} = < n_2 n_1>_{\mbox{obs}}$$, etc..

• Calculate the probability distribution that describes state of the three neurons,

\begin{equation} P(n_1, n_2, n_3) \end{equation}

that has the maximum entropy if you impose the average correlations $$J_{12}, J_{13},J_{23}$$ that you have observed.

Notice that the correlations are defined by

\begin{aligned} < n_1 n_2 > = \sum_{n_1=1,-1}\sum_{n_2=1,-1}\sum_{n_3=1,-1} n_1 n_2\ P(n_1 n_2 n_3)\\ < n_1 n_3 > = \sum_{n_1=1,-1}\sum_{n_2=1,-1}\sum_{n_3=1,-1} n_1 n_3\ P(n_1 n_2 n_3)\\ < n_2 n_3 > = \sum_{n_1=1,-1}\sum_{n_2=1,-1}\sum_{n_3=1,-1} n_2 n_3\ P(n_1 n_2 n_3)\\ \end{aligned}
• Can you generalize to $$n$$ neurons for which you know all average pairwise measurements?

If you need some inspiration, you may want to look into this paper Schneidman E, Still S, Berry MJ, Sergev R, Bialek W (2006) Weak pairwise correlations imply strongly correlated network states in a neural population. In this manuscript, Schneidman et al. show that in vertebrate retina, the maximum entropy model that captures just pairwise correlations (like the one we have introduced here) is sufficient to account for the majority of other non-pairwise collective behaviors.

### Q3

For this question, we are going back to our experiment with P. falciparium aggregation probability. In Q2 of the previous homework, you were requested to compare the probability distribution for the aggregation probability from the bacteria in your current (low sample $$N=10$$) experiment on some new conditions, with that derived from a well determined set of standard conditions.

You were asked to compare several samples of similar size in a qualitative way as to whether you thought they could be from the same distribution or not.

Now, with the material we learned in this week’s lecture, you can address that question again doing a quantitative comparison of the actual probability distributions. Hopefully, your conclusions from last week will not change (or if they do, explain), but you should be able to give numbers for the corresponding comparisons.

• For instance, a good exercise would be to determine if the KL divergence you observe from your data1 distribution to a $$N=10$$ subsample of data_default is distinguishable from the same quantity but calculated between two $$N=10$$ samples of data_default.

• In fact, you can repeat the experiment several times and calculate the PFD (or CDF) for two random variables: one of the random variables measures the difference between data1 and $$N=10$$ samples of data_default, the other random variable measures the difference between two $$N=10$$ data_default samples. You can calculate the difference by drawing the two distributions, and reporting their Kullback-Leibler divergence.