MCB111: Mathematics in Biology (Fall 2019)
week 06:
Probabilistic models and Inference
Finding the breakpoints of backcrossed flies
For this homework, we are going to infer the breakpoints of a male fly for which we have mapped reads in a region of 10,000 nucleotides.
The fly is a backcross resulting from a largescale experiment in which from parents Drosophila Simulans (Dsim) and Drosophila Sechellia (Dsec), F1 hybrid females (Dsec/Dsim) were crossed with Dsec/Dsec parent males.
We also know the sequences of Dsim () and Dsec () for the region of interest, which are giving here.
You should use all the assumptions we make in the w06 lecture, with one simplification. Since we know the Dsim/Dsim ancestry is not possible, we are going to assume only two possible ancestry states
Here are some more details about the conditional probabilities that we skipped on the lecture notes.
The model
The
Lets introduce the notation described in Figure 1,
to indicate the probability of staying in the same ancestry.
Let’s assume that due to previous experience with these two Drosophila species, we expect that there is going to be on average one breakpoint (that is a change from homozygous BB to heterozygous AB or viceversa) per chromosome.
Figure 1. A graphical representation of the HMM.
Our synthetic chromosome has length , in general, if the number of expected breakpoint is , then the length of a continuous regions without breakpoints is
and because for a geometric distribution
then we set the probability of staying in the same ancestry as
if we expect breaks in a region of length .
The
Let’s introduce , where is the count of residue in the reads that cover position .
The conditional probability is a multinomial distribution
For some Gaussian noise with standard deviation given by and
and
For this homework, the person who generated the reads tells you that .
As an example, for locus 10 in Figure 1, we have
and
Figure 2. Reads and genome data are given for three hypothetical loci as examples.
The
As for the probability of the genotype given the ancestry, we have
where is the residue at position for strain , and similarly for .
For locus 10 in Figure 1, we have
Andolfatto et al. assume a more general case in which there could be sequencing error, thus a nonzero probability for other genotypes. We have simplified the model and assume no sequencing error at all.
Finally
is given by
For locus 10 in Figure 1, we have
Since is an error term, then and , then we have
as one would expect by looking at the data in Figure 1.
The data
You are given two files

The mapped reads for a region of 10,000 nucleotides.
This file has 10,000 rows, each with four columns. Each column represents the number of As, Cs, Gs, and Ts (in that order) that appeared in all the reads that mapped to that position.

The sequence of the genomic regions for the two species “A” and “B”, in fasta format.
Implementation issues
 need to work in logarithms
The forward and backward algorithms require to multiply many small numbers (probabilities). If you try to implement them as multiplications, you will see that soon you run out of precission. Numbers will be so small that you cannot distinguish them.
The solution is to work in logarithm space. Insted of using probabilities, you take the logarithm of the probabilities, and instead of multiplying them you sum them.
So for instance, the first recursion in the forward algorithm
becomes
 sum exponentials in a clever way
In the previous expression, we still need to calculate terms of the form
A numerical trick to keep this quantity in bounds is to find the largest of the two values, say and to calculate
Now the term we are exponentiating is and that exponential is always small.
There are functions both in Matlab and Python that implement this “logsumexp” calculation.
A luxury you will not have in real live
Because this is a simulated experiment, as a control, I can give you information about the actual breakpoints, so that you can check whether your inference is getting the right answer or not.
However, one thing you could do in real live is to generate synthetic data to test your implementation before you deploy your algorithm on real data for which you do not know the real answer.
Because the model we have introduced is probabilistic, for each node, you have defined a probability distribution from which you can sample. A probabilistic model is a generative model. That is how I generated the data for this homework. That is a powerful tool to have.