week 07:

Probabilistic models and Expectation Maximization

Preliminars

Present all your reasoning, derivations, plots and code as part of the homework. Imagine that you are writing a short paper that anyone in class should to be able to understand. If you are stuck in some point, please describe the issue and how far you got. A jupyter notebook if you are working in Python is not required, but recommended.

The highly recombinant Drosophila

A new species related to Drosophila has been found. While Drosophila melanogaster (Dmel) recombines on average once per chromosome, this new species, Drosophila new (Dnew), appears recombine at much higher rate.

You are given the task to estimate the recombination rate of this new species \(Dnew\).

In order to do this, you are going to use a genomic region that is conserved in both species.

Starting with a female \(Dmel/Dmel\) and a male \(Dnew/Dnew\), You have crossed heterozygous F1 females (\(Dmel/Dnew\)) with male parents \(Dnew/Dnew\). Thus your progeny of backcross flies are either homozygous \(Dnew/Dnew\) or heterozygous \(Dmel/Dnew\).

You have used RNA-seq to sequence backcross flies. The reads have been mapped to the genomic region. We are cheap, and we only have data for one fly.

This is the reads file. This file reports for each position, the number of times that nucleotide “A” was found in the reads, the number of times “C” was found, “G”, and “T”, in that order.

So, a line like

3       4       0       3

means that 3 “A”s, 4 “C”s , and 3 “T”s were found in the reads that mapped that position. Positions in the file are listed from 1 to L, where L is the total length of the genomic region we are studying.

In addition to the mapped read data, you have also sequenced the regions for both genomes which are given in fasta format in the file genomes.

Using the forward and backward algorithms we implemented last week, you can now implement an EM algorithm to optimize the transition parameter from \(Dmel/Dnew\) to \(Dnew/Dnew\), thus to determine the number of expected breakpoints for the genomic region of interest, which we can assume is representative of the overall recombination rate for \(Dnew\).

Usually, this analysis would have been done using reads from many backcross individuals for a good reason. Real Droshophila species have low recombination rates (1 break on average per chromosome), and one single fly does not provide enough information. For our weird \(Dnew\), we expect higher rates of recombination, so our cheap one-fly experiment may be enough. Was the one fly data enough for the EM algorithm to work? What would expect to happen if you had more data?