MCB111: Mathematics in Biology (Spring 2018)
week 05:
Maximum likelihood and least squares
For this topic, I will follow Sivia’s Chapters 3 and 5. One of the first instances of using Maximum likelihood for quantitative trait loci mapping is this Lander and Botstein paper from 1988 “Mapping mendelian factors underlying quatitative traits loci using RPLF linkage maps.”
Maximum Likelihood
We have used Bayes theorem to calculate the probability of the parameters () of a given hypothesis () given the data () as
where is the prior probability of the parameters.
We have seen some examples in which we have calculated the posterior probability distribution explicitly. Oftentimes we get away with just finding the parameter values that maximize the posterior probability.
If we do not have any information before doing the analysis, we will use a flat prior, which is a constant, so then we can write
Thus the value of the parameters that maximizes the posterior are those that maximize the probability of the data, and they are usually referred to as the maximum likelihood estimate of the parameters.
We have already calculated those for some distributions. Here is a recap,
Probability of the data  ML parameters  confidence of the ML estimate 

Exponential ()  
Binomial ()  
Gaussian ( )  
assuming , or in the binomial case.
For any given parameter ,
Leastsquares
If we assume that the data are independent, then the joint probability is given by the product of the probabilities of each independent experiment
And if we further assume the we know how ideal data behaves as a function of the parameters, except for a noise that can be modeled by a Gaussian distribution as,
If we further assume that all measurements follow the same Gaussian distribution, that is
and that is fixed and known, then we have,
Since the maximum of the posterior distributions of the parameters will occur when
is minimum, the optimal solution for the parameters (which are hiding in ) is referred to as the leastsquare estimate.
The leastsquare procedure is widely used in data analysis. Here, we have seen how its justification relies on the data being Gaussian, and the use of uniform prior. The leastsquare quantity is referred as the norm.
Fit to a straight line
Next we are going to consider a biological case in which the data is given by pairs such that they fit to a straight line with Gaussian noise
Here, the parameters are those defining the straight line, offset () and slope (), and the Gaussian noise parameters .
Quantitative Trait Loci (QTL) analysis
We have two strains of a given organism that differ in one quantitative trait. We want to see if we can identify which genomic region is responsible for this phenotypic variability. That is, to map the genomic loci responsible for this quantitative trait difference (QTL mapping).
For instance, wild isolates of two different species of fly: D. simulans and D. mauritiana that have different fly male sine song carrier frequency means
(This is an example we have seen before by Ding et al.).
In order to do this, one performs backcrosses between the F1 hybrid progeny (mausim) and the parents, producing a large population of backcross males B (F1 parents) in which we expect to find a mosaic of the genomic loci of the parents.
You then collect two types of data for the backcrossed population,
Figure 1. Sine song carrier frequency in parental strains, F1 hybrids, and backcross males.

their phenotype (sine song male frequencies for our example),

their genotype (using for instance RNAseq).
We have observed that the mau allele is largely dominant (Figure 1, a reproduction of Figure 1b in Ding. et al.). For a backcrossed male fly , we use to designate a sim homozygous male, that is, if the sim sequence is present in both alleles, we assign , otherwise (the mau sequence is in at least one allele) we use .
In real life, you are going to test a large number of potential informative loci, here for simplicity we consider that one single loci.
The models

QTL model: The loci is responsible for the phenotype such that there is a linear fit
where the error is drawn from a Gaussian distribution .
Then, the probability of the data given the QTL model with parameters , and is

NQTL model: the loci does not explain the phenotype (the null model),
Then, the probability of the data given the NQTL model with parameters and is
Deciding if this loci has a quantitative influence in the phenotype becomes a hypothesis testing question, that requires evaluating this quantity:
Hypothesis testing
As we have seen before
Assuming we have no idea of whether the loci is or not a QTL, we use , then the comparison of the models becomes the comparisons of the evidences
For the QTL model, we have 2 parameters ( and ) for the straight line fit, and one parameter for the Gaussian noise (). The NQTL model depends on 2 parameters and .
We have assumed that is the same under both hypotheses. If we further assume that is known,
The prior probability
We assume independent and flat priors,
such that,
where we have introduced,
for some constant values , , and which are arbitrary but large enough not to introduce a cut in the evidence.
Approximating the probability of the data
Remember from w02 that you can take the ML value of the parameters and their confidence value to approximate and .
Introducing the log probabilities
A Taylor expansion around the ML values results in
where the ML values of the parameters are defined by
The ML values of the parameters and their confidence values
Let’s calculate the actual values of the ML parameters and their confidence values.
The log probabilities are given by
where we are keeping only the terms that depend on the parameters we are going to optimize.
Then, the first derivatives respect to the parameters are
Then, introducing
we have for the ML values of the parameters,
provided that .
As for the confidence values
Exponentiating the approximations to the log probabilities, we obtain the following approximations
Putting things together
Introducing the matrix
then we can write
These integrals can be approximated by a closed form expression (the case of the NQTL model with one parameter we did explicitly in w03).
we have,
Finally!
The ratio of the probabilities of the QTL respect to the NQTL model is given by
Notice the presence of both the data fit term and the Occam’s razor term. The QTL includes the NQTL model as a particular case, so by looking at the data fit term, the QTL model would always have higher probability than the NQTL model. It is the Occam’s razor term the one that compensates by the fact that the QTL model has one more parameter than the NQTL model.
In Ding et. al, they use a method similar to Lander & Botstein, in which they only calculate the data fit term (named LOD). All LOD are always positive (see Lander/Botstein, Figures 2 and 3). Thus, they need to find threshold values (named T), “above which a QTL will be declared present”. Selecting the appropriate threshold T is “an important issue […] What LOD threshold T should be used in order to maintain an acceptably low rate of false positives?”.
Identifying the threshold(s) T becomes a classical pvalue calculation based on additional experiments for some previously known true and false QTLs. Besides, different thresholds have to be apply depending on the genome size and the density of restriction enzyme marks (see Lander/Botsein, Figure 4).
Doing the proper Bayesian calculation removes the need for such threshold value.
Generalization to many loci
If we are trying to test if the observed phenotype variability is do to more than one loci, we would include one parameter for each locus tested as
Assumptions
These are the assumptions in the QTL mapping derivation presented here, each of which could be questioned and removed if necessary

Linear fit between phenotype and genotype.

Gaussian noise.

All noise follows the same Gaussian ,

is known.

For the linear regression parameters, we have integrated their posterior probability approximated by a secondorder Taylor expansion of the log probability around their maximum likelihood values.