MCB111: Mathematics in Biology (Spring 2018)
week 06:
Probabilistic models and Inference
In this week’s lectures, we discuss probabilistic graphical models and how to do inference using them. As a practical case, we are going to implement the probabilistic model introduced by Andolfatto et al. to assign ancestry to chromosomal segments.
There are many good books to learn about probabilistic models. “Probabilistic graphical models: principles and techniques” (by Koller & Friedman) is a comprehensive source about more general probabilistic models than the one we are going to study here. “Biological sequence analysis” (by Durbin et al.) while concentrates on genomic sequences, is a good reference to learn about some basic (and widely used) probabilistic models, as it provides detailed descriptions of some of the fundamental algorithms. Another good source is “Pattern Theory” (by Mumford and Desolneux).
Probabilistic Graphical Models
Definition
A probabilistic graphical model is a graphical representation of the relationships between different random variables. It consists of nodes that represent random variables and edges that represent their relationships by conditional probabilities. Probabilistic graphical models also go by the name of Bayesian networks (BN).
Let me light it up with an example in Figure 1 taken from Koller & Friedman.
Figure 1. A example of a Bayesian network (BN).
A professor is writing a recommendation letter for a student. Let’s “letter” be the strength of the letter. The strength of the letter is going to be based on the student’s “grades”, and “SAT” scores, both of which are available to the professor. The model in Figure 1 assumes that the grades depend both on the student’s “intelligence”, and on the course’s “difficulty”, while the “SAT” scores depend only on the intelligence.
The parts of the Bayesian network
The random variables are
 intelligence ()
 course difficulty ()
 course grades ()
 SAT scores ()
 strength of letter ()
Thus, we have a joint distribution of 5 random variables
With all generality (using the chain rule) we can write
This is a complex set of possible dependencies, some of which do not really exist for the problem at hand; for instance, the intelligence of the student is not dependent of the difficulty of the courses she takes, thus in the expression above .
What the BN in Figure 1 describes is how to simplify the joint distribution of the 5 random variables. It is telling us which are the dependencies (amongst all possible) that we are going to consider. The edges represent conditional dependences between the random variables (nodes), such that the node at the end of the arrow depends on the node from where the edge starts. In a general BN, a node can have many arrows landing on it (many dependencies) from other nodes.
The graphical model in Figure 1 says that
so that, the joint probability of all random variables is given by
What to do with a Bayesian network?
Inference. Inference about some of the random variables, assuming that we have information about others.
In a Bayesian network, the random variables for which we have information become the evidence. From the other variables, we can decide which to do inference on and which to treat as hidden variables which will be integrated out to all possible values in order to make our inferences.
Several scenarios for the student letter example,

We know the outcome of “letters”, and also have access to the grades and SAT scores, and we may want to infer the intelligence of the student.
In this case the course grades will become a “hidden” variable that you will integrate out to all values.

You want to predict the strength of the letter you will be able to write by knowing the grades and SAT scores, in which case both intelligence and course difficulty become hidden variables.
Let’s approach inference with a probabilistic graphical model using a biology example implemented recently.
Our example: From genomic reads to recombination breakpoints
The probabilistic model proposed by Andolfatto et al. is given in their Figure S8 reproduced here. The problem is to assign ancestry (homozygous for one of the parents or heterozygous) to the different genomic loci of a given male fly by analyzing genomic reads.
Andolfatto uses males flies created by crossing female F1 hybrids of Drosophila Simulans (Dsim) and Drosophila Sechellia (Dsec) with Dsec males. The resulting males would have an ancestry (Z) that is either Dsec/Dsec or Dsec/Dsim at different loci (see Figure S5 from Andolfatto et al. reproduced here). The genotype (G), that is the collection of actual nucleotides in both alleles, is unknown, and will be treated as a nuisance variable. What we know from the experiment is a collection of mapped reads (X) for each locus (genomic position).
For a given locus in one individual, the question is how to determine which reads correspond to parent Dsim versus parent Dsec, which will in turn determine the ancestry (Z) of the locus being either or . Because the F1 females were crossed only to parent males (but not to parent males), the genotype does not occur. (The reason for only crossing F1 females is because the F1 males are sterile.)
Figure 1 from Andolfatto et al. (http://genome.cshlp.org/content/21/4/610.full)
Andolfatto’s model assumes a more general situation in which the ancestry of a given individual at different loci, named , could be either AA, or BB, or AB (Figure 1 from Andolfatto et al. reproduced here).
The Model (a Hidden Markov Model)
The random variables
Figure 2 shows an example of the three random variables for the problem of infering ancestry from mapped reads with specific examples.
The random variables for a given locus (i.e. a genomic position) are

is the collection of reads for locus .
where is the number of read that have residue “a” at the locus,…

is the genotype at locus
are the actual residues at the locus for the two alleles. We have no information about order, thus there are 10 possible values for .

is the ancestry at locus
where means the locus is homozygous for parent , means the locus is homozygous for parent , and means the locus is heterozygous.
Figure 2. An illustration of problem of inferring ancestry from mapped reads. Three loci are given as examples.In principle, each can take three values. For the particular experiment described by Andolfatto et al. (assigning and ), only two cases are possible and , as the backcrosses are between Dsim/Dsec females and Dsec/Dsec males (Figure S5). Andolfatto describes the more general model with three possible ancestors.
The conditional probabilities
We want to calculate the joint probability of all random variables for all sites , for a chromosome of length ,
The Bayesian network in Andolfatto’s Figure S8, establishes the following dependencies
This particular set of dependencies in the joint probability such that the probabilities at one node depends only on the probability at the node before , define this model as a Hidden Markov Model. Hidden Markov Models (HMMs) are a subset of of the much larger set of probabilistic graphical models or Bayesian networks.
This graphical model also tells us (see Figure S8)
The inference question
An example of the inference problem is given in Figure 2. Figure 2 shows three loci as examples. Each locus has 8 mapped reads (the evidence). We also know the genome sequence of the two parental species A (top in red) and B (top in blue). The data for the first locus almost unambiguously indicates that the ancestry is AB. For the two other loci, either due to ambiguity or errors in the reads both possibilities AB and BB are plausible. The question is which one is the more likely?
We know the reads , and we want to infer the ancestry for each locus . Thus, the are hidden random variables that we want to integrate.
The joint marginal is given by
Because of the Markovian nature of this particular Bayesian network
for the marginal probabilities are
The parameters
The conditional probabilities of this probabilistic model can be separated into two depending on whether they are conditioned on the locus before or not:

The .
Because any ancestry can be one of three cases , those represent three conditional probability distributions , , .
given by
and

The .
More about these probability distributions when we get to the homework.
Figure 3 shows a different graphical representation of the Andolfatto HMM, where all the parameters that connect one locus with the next are explicit.
Figure 3. Graphical representation of the HMM for assigning ancestry.
Andolfatto et al., sets these parameters by looking at actual data.
Double breakpoints are not observed,
If we assume that all segment have on average the same length then we can use one single parameter to characterize when there is no breakpoint as
Finally, Andolfatto uses the fact that on average there is about 1 breakpoint per chromosome in Drosophila species. The length of a region without breakpoints is determined by a geometric distribution of Bernoulli parameter
Then having on average only two regions (one breakpoint) per chromosome of length , results in
or
for a chromosome of length .
Next week’s lectures will cover parameter estimation for a probabilistic model.
Inference
Using the probabilistic model (Figure1 and Andolfatto’s Figure S8) we can do calculate the breakpoints for a given chromosome of a given individual. In order to do that, we need to calculate the probability of the mapped reads under any possible ancestry
We will see later why we need to calculate this quantity. Let’s focus know onf how are we going to calculate it.
Because of the Markovian property of this particular probabilitic model, we can write
This marginal probability has terms, too many in general to be calculated directly.
Fortunately, we can calculate that marginal probability very efficiently with an algorithm that is linear in instead of exponential.
In fact, we have two algorithms that can calculate efficiently, the forward and backward algorithms. We are introducing both, because both algorithm will be used to infer the breakpoints.
Forward Algorithm
For each locus , we define as the probability of loci , where all loci from have been marginalized (sumed) to all possible ancestry, and loci has ancestor , that is
The name forward should be selfexplanatory now. In order to calculate , I need to know , …
More explicitely, the forward probabilities are calculated using a forward recursion starting from
The initialization conditions for use the prior probabilities . For the first locus,
Then the probability is given by
This type of algorithm is called a dynamic programing (DP) algorithm. Dynamic programming algorithms were introduced by Richard Bellman in 1954.
Backward Algorithm
For each locus , we define as the probability of (summing to all possible ancestry), conditioned on the fact that locus has ancestor .
The name backward should be selfexplanatory now. In order to calculate , I need to know , …
Those backward probabilities are calculated using a backwards recursion starting from
The initialization conditions for use the prior probabilities
The backward probabilities are calculated independently from the forward probabilities.
The probability is given by
And in general, for any locus
Why do we want to do pretty much the same calculation in the forward and backward directions?
Decoding
Now we can use the forward and backward probabilities to make inferences about ancestry.
The posterior probabilities that the ancestor at a given position is either or or are given for each locus by
And using the forward and backward calculation, we can write
These posterior probabilities are robust because they make inference on the state of a given position (locus), based on what has been observed for all other loci.
Figure 4. Posterior probabilities for a male fly backcross in a 1,000 nucleotides region.
I have simulated mapped reads for a male fly backcross
and a hypothetical region of 1,000 bases, where I introduced one breakpoint.
In Figure 4, I display for each position , the three posterior probabilities
obtained using the posterior decoding algorithm.
The posterior probabilities, clearly indicate the presence of two regions Dsim/Dsec and Dsec/Dsec, with one breakpoint (and only one) occuring around position 600. As expected, due to the nature of the backcross the posterior probabilitie of having ancestry Dsim/Dsim are zero at every position.