- The biological example
- Why you want to optimize the parameters
- Maximum likelihood: parameter estimation for complete data
- Expectation Maximization: parameter estimation for incomplete data
- Why does Expectation-Maximization work?
Probabilistic models and Expectation Maximization
In this lecture, we are going to investigate the problem of training a probabilistic graphical model or Bayesian network. We will describe the method of Maximum likelihood used when we have completely annotated data, and the method of Expectation-Maximization used when the available data for training is incompletely labeled.
For this week’s lectures, if you do not read anything else, I recommend the article “What is the expectation maximization algorithm?” by Do & Batzoglou.
I am going to rephrase the contents of that article using our example of the HMM model described by Andolfatto et al. to assign ancestry to chromosomal segments.
The biological example
We are continuing with the biological problem that we introduced in week06. Andolfatto et al. introduced an HMM to assign ancestry to male fly backcrosses for which we know the genomic sequence, and have mapped reads for each backcross individual.
The experimental design allows two ancestral states
Heterozygous for Drosophila Simulans (A=Dsim) and Drosophila Sechellia (B=Dsec)
Homozygous for Drosophila Sechellia (B=Dsec)
Figure 1. The ancestry HMM.
The HMM is described in Figure 1. The transition probabilities from being at ancestry at locus given the ancestry at the previous locus are parameterized by one Bernoulli parameter ,
In their paper, Andolfatto et al. assigned the value of , based what it is already known about recombination in Drosophila species. Mainly, that one should expected on average one recombination event from chromosome. That means that in some individuals you will find one, in others 0, in others 2, maybe a few rare cases with three or more, such that the average will be close to one.
If there is on average one break per chromosome, then the average length of a region without breakpoints should be for a chromosome of length , and because the mean of a geometric distribution of Bernoulli parameter is , that determines the parameter to take the value
For some real numbers in Drosophila,
- Chromosome 2L, length = 23.01 million bases
- Chromosome 4, is much shorter length = 1.35 million bases
Does such small difference in the values of make any difference in the inference results?
Figure 2. Analysis of a chromosomal fragment with 4 breakpoints indicated on top with an arrow. Posterior probabilities identifying the breakpoints are depicted for three different values of the Bernoulli parameter for the HMM, p: (A) Assumes that one breakpoint is expected. (B) Assumes 4 breakpoints (the actual number in the segment). (C) Assumes that 90 breaks are expected. (D) The Bernoulli parameter is set to 0.8 , which should produce on average 2,500 breakpoints.
Why you want to optimize the parameters
Let’s now concentrate in our re-implementation of the Andolfatto HMM which we did last week. Because our implementation uses some slow scripting programming language (python, perl, matlab…), we probably cannot deal with very long sequences, so let’s assume a chromosome of length .
Because this is a generative model, I have the advantage that I can generate the data what otherwise you will have to painfully collect in your experiments. For this example, I have introduced 4 breakpoints.
I ran the decoding algorithm of week 06, for four different values of the Bernoulli parameter, that correspond to a expected number of breakpoints of respectively. Results are given in Figure 2.
Panel (B) corresponds to the optimal value of , which identifies the 4 breakpoints, and only those breakpoints. In panel (A), a value of the would result in fewer expected breakpoints, misses two of them, the two that are very closed together. In panel (C) you observe how as your parameter lowers the number of predicted breakpoints (false positives) increases. Finally, panel (D) shows how sensitive the results are to changes in the value of the parameters.
So, if you are now convinced that the value of the parameters matter, sometimes when the differences are tiny, how do we obtain the optimal value of the parameters?
Maximum likelihood: parameter estimation for complete data
In many cases, you have a number of examples in which the data is actually labeled. In our case, that would mean, a collection individuals (male flies) with mapped reads for which we know the ancestry for all loci (see Figure 3). We call this data set of labeled data the training set.
Using the training set, you can calculate the number of times (counts) that each possible transition occurs at each locus ,
After collecting all the counts for all loci and for all individuals , we can calculate the frequency of the parameter as
This intuitive estimation of the value of the parameters is called the maximum likelihood estimation.
The reason is because those estimates are the value of the parameters that maximize the probability of the data in the training set.
Maximum likelihood derivation
In order to apply ML, we need to have a set of individuals for which we know both the reads and the ancestry for a given genomic loci of length . That is, the data is
The probability of the data, given the model that depends on the parameters is
Introducing the log probability , the log probability of the data given the parameter is given by
The ML value of the parameter named , is given by
Notice that under uniform prior hypothesis for , the ML value, , is also the value that maximizes the posterior probability of the parameter .
We all generality, we can write
The derivative respect to , is given by
The ML condition results in
That is, the expression we proposed above
This results generalizes when there is more than one parameters. If the transition from ancestry could have possible values , such that , the ML estimate from labeled data is given by the the fraction of the number of times that the transition occurs in the data (),
Figure 3. A re-implementation of Do & Batzoglou's Figure 1 comparing Maximum likelihood training in the presence of labeled data, versus Expectation Maximization training when the data is not labeled. In ML estimation, the parameters are given by their frequency in the training set. In EM estimation, there is an iterative process. At each EM cycle, all possible labelings of the data are considered each weighted by their expected counts (the E-step). Using all those expected counts new ML parameters are estimated. The recursion ends when parameters converge.
Expectation Maximization: parameter estimation for incomplete data
But most of the time, we do not have labeled data, you may have mapped reads for many different male flies, but not the corresponding labeling of the ancestry.
In this situation, you can use a iterative method named Expectation Maximization EM). For the particular example of an HMM, as it is our case here, it also goes by the name of the Baum-Welch algorithm.
The EM algorithm has the following steps (Figure 3)
One starts with some arbitrary value of the parameters.
For our case with one parameter
In this step you calculate the expected number of times each of the transitions is used in each locus, assuming that you sum to all other possible values of the ancestry for all other loci. That is
These expected counts are the equivalent of the counts in the case of having labeled data.
The expected counts are easily calculated from the forward and backward probabilities calculated using the forward and backward algorithms introduced in week 06 for each individual ,
The new are used for a new round of the E-step
We iterate the E-step/M-step until convergence, that is .
EM optimization of other conditional probabilities
We could also use the EM algorithm to optimize the conditional probabilities and .
We can introduce,
Then, the EM algorithm for optimizing and is as,
Calculate the expected values
Update and as follows
Figure 4. The probability distribution is approximated at each step (n) by a function G(n) guaranteed to be a lower bound for the actual distribution. In the E-step, we use the current value of the parameters to find the "expected values". In the M-step, we use those expectations in order to find new values for the parameters that optimize the lower bound function G(n).
Why does Expectation-Maximization work?
A graphical description of how EM works
In the EM algorithm, we alternate between two actions, see Figure 4.
In the E-step, given a set of values for the parameters, we obtain expected counts for al possible labels.
In the M-step, we use those expected counts to obtain new estimations for the parameters. The M-step can be justified as the new values of the parameters are the maximum-likelihood values for a function related to our full probability distribution by being a lower bound.
EM as a Variational Method
The EM algorithm belongs to the class of variant methods in which the optimal value for a function as calculated using a different but related (variational) objective function.
to describe the observed data (in out case the collection of mapped reads for all loci for all male backcross flies).
to refer to the unobserved lated data (in our case the ancestry for all loci for all male backcross flies).
And let’s use to refer to a vector of unknown parameters (in our case just one)
We can always write by marginalization, for
For any arbitrary probability distribution on the unobserved data
such that is a probability distribution in , that is . The variant distribution could also be conditioned on the data or the parameters , (or even a completely different set of parameters !), but not on the parameters that we are trying to optimize.
Using Jensen’s inequality, we can write
Then two expressions become equal for .
In the EM algorithm, we are going to select a different variant distribution for each iteration , where the parameters take values
Notice that these variant distribution is exactly the quantities that we calculate in the E-step of each iteration.
and maximizing one each iteration results in a improvement for
In summary, EM consists of optimizing the function
by optimizing at each EM iteration the variant function
which has the following properties (see Figure 4)
The variant function is a lower bound to the actual function we want to optimize:
The two functions take the same value when the parameters are set to the current estimate :
In addition, optimizing is equivalent to optimizing the function:
as the term is independent of the parameters.
Thus, at each iteration in the EM algorithm:
Calculate the for all possible realizations of the unobserved variables.
In our particular example, we calculate the , , , and .
Optimize in , that is the probability of the date for each possible completion of the unobserved data, each completions weighted by its corresponding posterior probabilities.
A similar proof can be found in the supplemental materials of Do & Batzoglou.
The EM algorithm is guarantee to find local maxima. You would like to re-start the algorithm from different starting points and compare results.