MCB111: Mathematics in Biology (Fall 2024)
week 05:
Maximum likelihood and leasts squares
Preliminars
Present all your reasoning, derivations, plots and code as part of the homework. Imagine that you are writing a short paper that anyone in class should to be able to understand. If you are stuck in some point, please describe the issue and how far you got. A jupyter notebook if you are working in Python is not required, but recommended.
More on QTLs
We continue with the example we discussed in class from Ding et al. about fly variants with different male sine song frequencies.
For each backcross fly, we do RNA-seq, and (using a clever model named multiplex shotgun genotyping) you can assign to each allele whether it corresponds to the sim parent or to the mau parent. The mau allele seems to be largely dominant, thus we assign genotypes as “1” if both alleles are sim and “0” otherwise.
Here is the data file w05-homework.dat. Each line corresponds to one backcross male. Each line includes the sine song frequency (first field) followed by the genotypes (1 or 0) for 10 independent loci.
- 
    Sanity check: calculate the histogram of sine song frequencies for all backcross males. 
- 
    For each locus, compare the two hypotheses and decide whether it is linked to the phenotype or not. 
- 
    Tecnical note: what to do with \(\sigma\) the noise of the Normal fit of the phenotypes to the genotypes? ** The simplest thing to do is to take the maximum likelhood estimate from the data under each hypothesis as \[\begin{aligned} \sigma_{QTL} &= \sqrt{\frac{\sum_{i=1}^N (f_i - a^\ast - b^\ast * g_i)^2}{N}}\\ \sigma_{NQTL} &= \sqrt{\frac{\sum_{i=1}^N (f_i - c^\ast)^2}{N}}\\ \end{aligned}\]where \(f_i\) the phenotype for fly \(i\), and \(g_i\) is the genotype for the same fly at a given locus, and \(a^\ast\), \(b^\ast\), and \(c^\ast\) are the ML values of the parameters of the two hypotheses ** Otherwise, you could be bayesian on the noise parameter \(\sigma\) and integrate it over, for extra credit. In that case, you want to use a constant prior, and set the range of values of sigma in a finite range \([0,\sigma_{max}]\), where \(\sigma_{max}\) is larger than your observed variances. 
The phenotypic data is real (from Ding et al.), the genotypes for the 10 loci are made up.