MCB111: Mathematics in Biology (Fall 2019)


week 06:

Probabilistic models and Inference

Finding the breakpoints of backcrossed flies

For this homework, we are going to infer the breakpoints of a male fly for which we have mapped reads in a region of 10,000 nucleotides.

The fly is a backcross resulting from a large-scale experiment in which from parents Drosophila Simulans (Dsim) and Drosophila Sechellia (Dsec), F1 hybrid females (Dsec/Dsim) were crossed with Dsec/Dsec parent males.

We also know the sequences of Dsim () and Dsec () for the region of interest, which are giving here.

You should use all the assumptions we make in the w06 lecture, with one simplification. Since we know the Dsim/Dsim ancestry is not possible, we are going to assume only two possible ancestry states

Here are some more details about the conditional probabilities that we skipped on the lecture notes.

The model

The

Lets introduce the notation described in Figure 1,

to indicate the probability of staying in the same ancestry.

Let’s assume that due to previous experience with these two Drosophila species, we expect that there is going to be on average one breakpoint (that is a change from homozygous BB to heterozygous AB or viceversa) per chromosome.


Figure 1. A graphical representation of the HMM.

Our synthetic chromosome has length , in general, if the number of expected breakpoint is , then the length of a continuous regions without breakpoints is

and because for a geometric distribution

then we set the probability of staying in the same ancestry as

if we expect breaks in a region of length .

The

Let’s introduce , where is the count of residue in the reads that cover position .

The conditional probability is a multinomial distribution

For some Gaussian noise with standard deviation given by and

and

For this homework, the person who generated the reads tells you that .

As an example, for locus 10 in Figure 1, we have

and


Figure 2. Reads and genome data are given for three hypothetical loci as examples.

The

As for the probability of the genotype given the ancestry, we have

where is the residue at position for strain , and similarly for .

For locus 10 in Figure 1, we have

Andolfatto et al. assume a more general case in which there could be sequencing error, thus a non-zero probability for other genotypes. We have simplified the model and assume no sequencing error at all.

Finally

is given by

For locus 10 in Figure 1, we have

Since is an error term, then and , then we have

as one would expect by looking at the data in Figure 1.

The data

You are given two files

Implementation issues

The forward and backward algorithms require to multiply many small numbers (probabilities). If you try to implement them as multiplications, you will see that soon you run out of precission. Numbers will be so small that you cannot distinguish them.

The solution is to work in logarithm space. Insted of using probabilities, you take the logarithm of the probabilities, and instead of multiplying them you sum them.

So for instance, the first recursion in the forward algorithm

becomes

In the previous expression, we still need to calculate terms of the form

A numerical trick to keep this quantity in bounds is to find the largest of the two values, say and to calculate

Now the term we are exponentiating is and that exponential is always small.

There are functions both in Matlab and Python that implement this “logsumexp” calculation.

A luxury you will not have in real live

Because this is a simulated experiment, as a control, I can give you information about the actual breakpoints, so that you can check whether your inference is getting the right answer or not.

However, one thing you could do in real live is to generate synthetic data to test your implementation before you deploy your algorithm on real data for which you do not know the real answer.

Because the model we have introduced is probabilistic, for each node, you have defined a probability distribution from which you can sample. A probabilistic model is a generative model. That is how I generated the data for this homework. That is a powerful tool to have.