MCB111: Mathematics in Biology (Spring 2018)


week 06:

Probabilistic models and Inference

In this week’s lectures, we discuss probabilistic graphical models and how to do inference using them. As a practical case, we are going to implement the probabilistic model introduced by Andolfatto et al. to assign ancestry to chromosomal segments.

There are many good books to learn about probabilistic models. “Probabilistic graphical models: principles and techniques” (by Koller & Friedman) is a comprehensive source about more general probabilistic models than the one we are going to study here. “Biological sequence analysis” (by Durbin et al.) while concentrates on genomic sequences, is a good reference to learn about some basic (and widely used) probabilistic models, as it provides detailed descriptions of some of the fundamental algorithms. Another good source is “Pattern Theory” (by Mumford and Desolneux).

Probabilistic Graphical Models

Definition

A probabilistic graphical model is a graphical representation of the relationships between different random variables. It consists of nodes that represent random variables and edges that represent their relationships by conditional probabilities. Probabilistic graphical models also go by the name of Bayesian networks (BN).

Let me light it up with an example in Figure 1 taken from Koller & Friedman.


Figure 1. A example of a Bayesian network (BN).

A professor is writing a recommendation letter for a student. Let’s “letter” be the strength of the letter. The strength of the letter is going to be based on the student’s “grades”, and “SAT” scores, both of which are available to the professor. The model in Figure 1 assumes that the grades depend both on the student’s “intelligence”, and on the course’s “difficulty”, while the “SAT” scores depend only on the intelligence.

The parts of the Bayesian network

The random variables are

Thus, we have a joint distribution of 5 random variables

With all generality (using the chain rule) we can write

This is a complex set of possible dependencies, some of which do not really exist for the problem at hand; for instance, the intelligence of the student is not dependent of the difficulty of the courses she takes, thus in the expression above .

What the BN in Figure 1 describes is how to simplify the joint distribution of the 5 random variables. It is telling us which are the dependencies (amongst all possible) that we are going to consider. The edges represent conditional dependences between the random variables (nodes), such that the node at the end of the arrow depends on the node from where the edge starts. In a general BN, a node can have many arrows landing on it (many dependencies) from other nodes.

The graphical model in Figure 1 says that

so that, the joint probability of all random variables is given by


What to do with a Bayesian network?

Inference. Inference about some of the random variables, assuming that we have information about others.

In a Bayesian network, the random variables for which we have information become the evidence. From the other variables, we can decide which to do inference on and which to treat as hidden variables which will be integrated out to all possible values in order to make our inferences.

Several scenarios for the student letter example,

Let’s approach inference with a probabilistic graphical model using a biology example implemented recently.

Our example: From genomic reads to recombination breakpoints


The probabilistic model proposed by Andolfatto et al. is given in their Figure S8 reproduced here. The problem is to assign ancestry (homozygous for one of the parents or heterozygous) to the different genomic loci of a given male fly by analyzing genomic reads.

Andolfatto uses males flies created by crossing female F1 hybrids of Drosophila Simulans (Dsim) and Drosophila Sechellia (Dsec) with Dsec males. The resulting males would have an ancestry (Z) that is either Dsec/Dsec or Dsec/Dsim at different loci (see Figure S5 from Andolfatto et al. reproduced here). The genotype (G), that is the collection of actual nucleotides in both alleles, is unknown, and will be treated as a nuisance variable. What we know from the experiment is a collection of mapped reads (X) for each locus (genomic position).

For a given locus in one individual, the question is how to determine which reads correspond to parent Dsim versus parent Dsec, which will in turn determine the ancestry (Z) of the locus being either or . Because the F1 females were crossed only to parent males (but not to parent males), the genotype does not occur. (The reason for only crossing F1 females is because the F1 males are sterile.)


Figure 1 from Andolfatto et al. (http://genome.cshlp.org/content/21/4/610.full)

Andolfatto’s model assumes a more general situation in which the ancestry of a given individual at different loci, named , could be either AA, or BB, or AB (Figure 1 from Andolfatto et al. reproduced here).

The Model (a Hidden Markov Model)

The random variables

Figure 2 shows an example of the three random variables for the problem of infering ancestry from mapped reads with specific examples.

The random variables for a given locus (i.e. a genomic position) are

The conditional probabilities

We want to calculate the joint probability of all random variables for all sites , for a chromosome of length ,

The Bayesian network in Andolfatto’s Figure S8, establishes the following dependencies

This particular set of dependencies in the joint probability such that the probabilities at one node depends only on the probability at the node before , define this model as a Hidden Markov Model. Hidden Markov Models (HMMs) are a subset of of the much larger set of probabilistic graphical models or Bayesian networks.

This graphical model also tells us (see Figure S8)

The inference question

An example of the inference problem is given in Figure 2. Figure 2 shows three loci as examples. Each locus has 8 mapped reads (the evidence). We also know the genome sequence of the two parental species A (top in red) and B (top in blue). The data for the first locus almost unambiguously indicates that the ancestry is AB. For the two other loci, either due to ambiguity or errors in the reads both possibilities AB and BB are plausible. The question is which one is the more likely?

We know the reads , and we want to infer the ancestry for each locus . Thus, the are hidden random variables that we want to integrate.

The joint marginal is given by

Because of the Markovian nature of this particular Bayesian network

for the marginal probabilities are

The parameters

The conditional probabilities of this probabilistic model can be separated into two depending on whether they are conditioned on the locus before or not:

Figure 3 shows a different graphical representation of the Andolfatto HMM, where all the parameters that connect one locus with the next are explicit.


Figure 3. Graphical representation of the HMM for assigning ancestry.

Andolfatto et al., sets these parameters by looking at actual data.

Double breakpoints are not observed,

If we assume that all segment have on average the same length then we can use one single parameter to characterize when there is no breakpoint as

Finally, Andolfatto uses the fact that on average there is about 1 breakpoint per chromosome in Drosophila species. The length of a region without breakpoints is determined by a geometric distribution of Bernoulli parameter

Then having on average only two regions (one breakpoint) per chromosome of length , results in

or

for a chromosome of length .

Next week’s lectures will cover parameter estimation for a probabilistic model.

Inference

Using the probabilistic model (Figure1 and Andolfatto’s Figure S8) we can do calculate the breakpoints for a given chromosome of a given individual. In order to do that, we need to calculate the probability of the mapped reads under any possible ancestry

We will see later why we need to calculate this quantity. Let’s focus know onf how are we going to calculate it.

Because of the Markovian property of this particular probabilitic model, we can write

This marginal probability has terms, too many in general to be calculated directly.

Fortunately, we can calculate that marginal probability very efficiently with an algorithm that is linear in instead of exponential.

In fact, we have two algorithms that can calculate efficiently, the forward and backward algorithms. We are introducing both, because both algorithm will be used to infer the breakpoints.

Forward Algorithm

For each locus , we define as the probability of loci , where all loci from have been marginalized (sumed) to all possible ancestry, and loci has ancestor , that is

The name forward should be self-explanatory now. In order to calculate , I need to know , …

More explicitely, the forward probabilities are calculated using a forward recursion starting from

The initialization conditions for use the prior probabilities . For the first locus,

Then the probability is given by

This type of algorithm is called a dynamic programing (DP) algorithm. Dynamic programming algorithms were introduced by Richard Bellman in 1954.

Backward Algorithm

For each locus , we define as the probability of (summing to all possible ancestry), conditioned on the fact that locus has ancestor .

The name backward should be self-explanatory now. In order to calculate , I need to know , …

Those backward probabilities are calculated using a backwards recursion starting from

The initialization conditions for use the prior probabilities

The backward probabilities are calculated independently from the forward probabilities.

The probability is given by

And in general, for any locus

Why do we want to do pretty much the same calculation in the forward and backward directions?

Decoding

Now we can use the forward and backward probabilities to make inferences about ancestry.

The posterior probabilities that the ancestor at a given position is either or or are given for each locus by

And using the forward and backward calculation, we can write

These posterior probabilities are robust because they make inference on the state of a given position (locus), based on what has been observed for all other loci.


Figure 4. Posterior probabilities for a male fly backcross in a 1,000 nucleotides region.

I have simulated mapped reads for a male fly backcross

and a hypothetical region of 1,000 bases, where I introduced one breakpoint.

In Figure 4, I display for each position , the three posterior probabilities

obtained using the posterior decoding algorithm.

The posterior probabilities, clearly indicate the presence of two regions Dsim/Dsec and Dsec/Dsec, with one breakpoint (and only one) occuring around position 600. As expected, due to the nature of the backcross the posterior probabilitie of having ancestry Dsim/Dsim are zero at every position.