%pylab inline
Populating the interactive namespace from numpy and matplotlib
Data wrangling and the data in HW6¶
For this week's homework we have two sources of data that we are going to need to use:
- The mapped reads
- The sequence data
Let's get started with the mapped reads. If we get a file that we don't know how it is formatted, the best way to attempt to handle it is to understand its structure. The homework tells us that "This file has 10,000 rows, each with four columns. Each column represents the number of As, Cs, Gs, and Ts (in that order) that appeared in all the reads that mapped to that position."
However, it is always good to have a visual sense of what is going on. Let's look at the mapped reads first:
x = np.loadtxt('http://mcb111.org/w06/w06-homework_backcross.reads')
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-2-be56f2ad1e0f> in <cell line: 1>() ----> 1 x = np.loadtxt('http://mcb111.org/w06/w06-homework_backcross.reads') /usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, quotechar, like) 1336 delimiter = delimiter.decode('latin1') 1337 -> 1338 arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter, 1339 converters=converters, skiplines=skiprows, usecols=usecols, 1340 unpack=unpack, ndmin=ndmin, encoding=encoding, /usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py in _read(fname, delimiter, comment, quote, imaginary_unit, usecols, skiplines, max_rows, converters, ndmin, unpack, dtype, encoding) 997 998 if read_dtype_via_object_chunks is None: --> 999 arr = _load_from_filelike( 1000 data, delimiter=delimiter, comment=comment, quote=quote, 1001 imaginary_unit=imaginary_unit, ValueError: the number of columns changed from 4 to 1 at row 10000; use `usecols` to select a subset and avoid this error
The error above is telling us that there is a wrong number of columns at line 10000? what does this mean? Let's take a look at the data by clicking the link scroll all the way to the bottom and see that there is a \\
character which is not allowing us to use the loadtxt function directly. However, we can see that the data is 4 different columns, each of which corresponds to A's, C's, G's, and T's according to the homework.
How do we get around this?¶
Let's look at the documentation for the numpy.loadtxt()
function
np.loadtxt?
There are two arguments that come to mind to handle this, the first is:
max_rows : int, optional
Read `max_rows` lines of content. The default is to read all the lines.
A way to work around this is to set the max number of rows read below 10000, since by visual inspection we know the problem is there. by specifying 9999 we'd skip the last row which is not informative anyways.
data = np.loadtxt('http://mcb111.org/w06/w06-homework_backcross.reads', max_rows=9999)
data.shape
(9999, 4)
data
array([[2., 1., 7., 0.], [3., 3., 2., 2.], [1., 6., 0., 3.], ..., [6., 2., 0., 2.], [1., 7., 0., 2.], [1., 0., 6., 3.]])
By calling data.shape
we can see that we ended up with an array of 9999 elements with 4 elements each. corresponding to A's, C's, G's, and T's. If for whatever reason we wanted all the bases to be on the same array so that we could call them by position we could make use of numpy slicing.
If we have an array of arrays we can get the first position of every sub array in the array by calling the following:
array[:,0]
Similarly for the second position we'd call
array[:,1]
It should be obvious how to extend this to any desired position i
in your sub-array:
array[:,i]
mapped_A = data[:,0]
mapped_C = data[:,1]
mapped_G = data[:,2]
mapped_T = data[:,3]
Let's continue by looking at the sequence data, the homework tells us that this data is in a FASTA format. if you're not familiar with the format please look it up on wikipedia, and just understand what it is, click on the homework link to see an example and let's figure out how to load this into our homework, let's again try with the good old numpy.loadtxt()
function:
np.loadtxt('http://mcb111.org/w06/w06-homework_backcross_genome.afa')
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ValueError: could not convert string to float: '>A' The above exception was the direct cause of the following exception: ValueError Traceback (most recent call last) <ipython-input-7-a5ab630c95f4> in <cell line: 1>() ----> 1 np.loadtxt('http://mcb111.org/w06/w06-homework_backcross_genome.afa') /usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding, max_rows, quotechar, like) 1336 delimiter = delimiter.decode('latin1') 1337 -> 1338 arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter, 1339 converters=converters, skiplines=skiprows, usecols=usecols, 1340 unpack=unpack, ndmin=ndmin, encoding=encoding, /usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py in _read(fname, delimiter, comment, quote, imaginary_unit, usecols, skiplines, max_rows, converters, ndmin, unpack, dtype, encoding) 997 998 if read_dtype_via_object_chunks is None: --> 999 arr = _load_from_filelike( 1000 data, delimiter=delimiter, comment=comment, quote=quote, 1001 imaginary_unit=imaginary_unit, ValueError: could not convert string '>A' to float64 at row 0, column 1.
Useless function it disapointed us again, not really, if you look in the documentation they mention the following:
dtype : data-type, optional
Data-type of the resulting array; default: float. If this is a structured data-type,
the resulting array will be 1-dimensional, and each row will be interpreted as an element of the array.
In this case, the number of columns used must match the number of fields in the data-type.
If you click the link you'll see that the sequence data is full of letters which are of type string str
. so let's go ahead and modify that argument.
data = np.loadtxt('http://mcb111.org/w06/w06-homework_backcross_genome.afa',dtype=str)
data
array(['>A', 'GACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGTGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGATCTTCGATTTTTTGGCAACCCAAAATGGAGGCGGACGAACGAGATGATAATGATAAGATGATTCAAAAAGACAATGCACGACAGAGAGAGCAGAAAAGATAATTAAATTGCCCCTCATTTTCTCTGGCAAATTGTAGGGTGAATTATGATCGCGTATGCGAGAGTGGTGCCAACATATTGTGCTCTTCGATTTTTTGGCAACCCAAAATGGAGGCGGATGAACGAGATGATAATATTTTCAAGTTGCCGCTAATCAAAAATAAATTCCTTGCAACATAAAATAAAGCACAAAATGCCCGCTCAAAAAAAGGCATGAATATATAAGCTCGAACATAGAACATAGGCTTGAACATATAATGACTGCCTTTCATTCTCTATCTTATATTACCGCAAACACAAAATGACAATGCACGACATAGAGAGAAAGAGAGATATTCAGATTGCCTCTCATTGTCTCACCCATATTATGGGAACCAAATATGAGCACGTATGCGAGAGGAGTGCCAACATATTGTGCTCTACGATTTTTTTGCAACCCAAAATGGCGGCGTACGAACGAGATGATAATATATTCAAATTGCCGCTAATCAGAAGCAAGTTTATTGCAATGTTCAGTGCAGCGCAAAATGGCCGCTCAAGAAAAGGCTCGAATATATATTGCCTGCCTCTCATTCACTCTCTTTTATTACCGCAAGACCAAAATGACAATGTACAACAGAGAGAGCAAGAGAGATATTTAGATTGCCTCTCCTTGTCTCTCCCATATTATAGAGACCGAAAATGATTGCGTATGCGAGAAGAGTGCCATTGTATTGAGCTCCTCGACCCAAAATAGCGTCGGACGAACGAGATTATATATTTAAAATGCCGATCATTTTCTCATCCATATAAATACTACCGAAAATGACTGTCTAAAGGTACTCATCGACTATATTTAAATCTGTGTATTTCTGTGAATAGATTGACCTTTGCAATTTTTAACGGCATTGTCTATTAAATTAATATAATTTTCTTTTTTGATGAATATTTAACCGAACATTTACTTGAAATTAAATTATAAAATTGGTTAAATAATGTTGAAATCTTACTTTCAGCTAAATGGGGCTATTTTGCAAGGGTTCCATCATGACATTGGTAAATAATTTTTAAAGAATTAATTGTAAGTTCCAATAGACTGGAAATTATTTTGCAATATCATTCTTATCCCTATTTCCAAAAGCGAATTATTAGTTGCGTGAAAATCAGAAGGAAAATTATTTAACGTGTTATGCCACGCCAAATAGCCGCGCAATAGGAAGCTAGACTATATAATGACTGCAACGAAAATTGTAAATTCCAATTAAAAGGATATTATTGTGCGATTTCACTTTAATTCTTATTTCAAAAAAGTTAATTATTAGTTGACGGAAATCAGAACGAATTTCACCGCAACGTCTTATGCAGCACAAAATGGCGGCGCAAAAGGATGGTTGCATATACAATAACTTCATCTCATTCAATCTCTCCTATATTACCGCAAACTCGAAAGCCAAAACACGAATGATGAAGAGGGATAGATTTTATTGGGACAAAAATGATAGGTCACGCGAGAGGAGTGGTCTAAATTTTACTCTCACAAAAATGTTGGCAATACAAAATGGCGGCGGAATGAAGAGGTGAAAATATATTAAAATTGCCGCTCATTTTCTTCGCGGTAGAATTAGGACTGAACGTTGCCGGGTATAGGATCTCTATTGATGGCCTTTACTTATAAAGTGTATTTCTACAGATCAAATTACTTTTTACTCTTTATCAATATTTAAATATTATAAATTGATTTAGTTAAAATACAATTCGAACAATCTTTTCTCCAAATAATAATAATGTTTAATACCTATTTGCGCATATGCGTTTATTTTTGGGATTTAATTTTAACATTTTTCAACAAAACCGTTACAAATGTAATTTTAAATCAGGAAACGACTTTGGTATGAAAATATGTTTTTTTGTGCGCTTTTAAACATGTAACTGCTCTTTTGTGCTGTTTTATTGAATGCTATCACAGCGTAAAATTTTAGTTTTAATACCAATACATTGGGAATAATTTGCGATTTCATTCTATTCTTATGCCCAAATAAGGAAATAGTTTCCGGCAAAAAATCAGAATTTAGCTTTTACAAAAACTAGAGAGGAGAGGACAATATTATAATTGTAGACCGTTTTAAACACTTTAAAATGTTTAACCATTTATCAATTATTCTACTAAATGTAGGTGATTTTATTTATTAGAATACGAATTCTTTATCTGAATCGAACTAAGTAAGCCTAAGCGCTTAGGAAAAATACATACTTGACGAGTAGAGTGAAATAATTACAAATATTAGACATATCCATTGCTACTCGCATGTAGAGATTTCCACTTATGTTTTCTCTACTTTCAGCAACCGAGAAGAGAACCCACGTTTGAACAAGTATCGGCGTGTGGACAACAGCTATCCCCGCTTCATAACGAATGAGGCTGCCGAGGACCTGATTTACAAGAAGTCCATGGGCGAGCGGGATCAGCCACAGAGCTCAGAGCGGATCTCAATATTTAATCCGCCAGTATACACGCAGCACCAGGTGCGCAATGAAGCCCCCTACATACCCACCACATTTGACCTCCTCTCAGACGATGAGGAGTCGTCACAGAGAGTTGCCAACGCCGGGCCATCTTTCAGGCCCTTGACTTACTCGGATGCTGTGCGTCTAAGCCAGAATGGCTTCGCCAACTCCCGCGTAAGTGGGCACTCCAGCTATACGGTGCGCAGACCACCGGCACTAGTTGACAGAAGCATTCTATCCCAGGAAATGGAGCGCATGGACCAAGAGCAGTATATCTACCTTATCCGTACCGCAGCCCAAAGTAATTCCGTGGGCAGTCACTACGCCGAACCGGTTACTGATAACTCGGAGGTCAAGAAAGTCAGTGAAACCAACAAAAGGTAAATAAATTTTTTATATCCATCCATATCCGAATCAGTGGCAATAATGCAAAATGCTGATTTTATCACCAATTAGTGACGCACCACAGCCGTTAACCCCTCAACCTACCAGACTCACCAGAACAGAATCCTTGCACCGTCGTTTTGCCAGCTGCGTCAACTTAAATGATGACTTCGCCGAGCAATTTAAAGCAAGAGCGGCGGACTGTGAAGAGAAATCCAAACATCGTCTTAGATTAGCTGAAGAGCAGAGGCTTTTTTCGAATTTCAGTGCTATAAAGAACATAGATGAACTCCGTGCCTATGAACGAAAAGTAGTGGAAAACATATTCCAGTCTTGTATCGCCCACAAGCCCATTTTTGTACTCGGGCCCTTGGACAAGCCAAATGTGAAGAAAGTGACCAAGCTCATTCCGTTAACAGAGGAGCACCACGATCGCTTTAACGAAATTACACAGGATGATAAATCGACGGTATGGCAACGAATATATTGATGTCTTTCGTACCCATTGAAAACGTTGTGGTGCTTGCGCTTTAAAATCTTATATTAGGAAATTATTTTTAAATTTAACCTACACATAACTACCGAAGACATATGCACGTTTATTAATGGGAAATGGCTTAACGACGAGGTCATTAACTTTTACATGTCCTTGCTGACAGAACGGTCGGAGAAGAGATCTGGCGTACTTCCCGCCACTTACGCCATAAACACATTCTTCGTGCCCCGCCTCCTGCAAGCTGGGCATGCAGGCATTAAGCGCTGGACTCGCAAAGTGGACTTGTTCAGCAAGGACATAATCCCGGTACCAGTGCACTGCAACGGCGTCCACTGGTGCATGGCCATCATACACTTGCGGAACAAGACAATCCGGTATTATGACTCAAAGGGAAAGCCAAACCGACCAGTGCTGGACGCTCTAGAGAAATATCTACGCGAAGAGTCAATATTCAAGCCCAAAAAGCAGTTTGATACCAGCGATTTTGTTATTGAGAGCGTGCAGAATATACCACGACAGTTAGATGGCAGCGATTGCGGTATCTTCAGCTGCATGTTCGCCGAGTATATAACGTGTGATGTGCCAATTACCTTTACCCAGTCGGAAATGTTGTACTTCCGCAAGAAGATGGCTCTAGAAATCGTCGACGGAGAGTTGTGACAGTAGAATCACACAGCTACGCAAGAATGTGGAGAATCCAGTTTAGTTATTTTTACAAATCTTACGTAAACACTCCAAGCATGAATTCGCAACAAGTGCTTAGCTATTTAATTGAATTGAGCTGGCCGAGAGATGTGCTGGTGCAATAACTTGTTCTCATATCTGATTGTAACAGAGAATCTAGTTTTTCAATAAAATTTCCCCAAGTAAAAACAATGCGAATAGGGACGTATTAATTGCCGAATCTCTTCGAGTTAATAATTAATTTTTACAATACAGCAAGCTGAGAATATGCAATTGTAATGTCCAATTCAATATTTGTAATTTACTATTTTAAGCCTAACTCTTATCTAGGGATTACTCGATTCCAACTATATTAGAGTAGAAGAAAACAATTTATTGTAACGAAGTACAAAGATCATTCTAGAAAATCACTCATACAAACCTCTAAGGCTCAAAACCGAGGTATGATCTTTAAATAAGTCAAAATTAGGAGTTTTCAGTTTGAGACCTACAACTAAATAGATCGGTGTTCTTCCACAAAATATTGTAAAGCCAGTTTGTTAAATAAAATACATGTTTTATTAATAATCCTAAGCTAAATACTCAATTATATACTTTATATGGTCGGAAAAGCTTCCTTCTGCCTGTAACATACTTCTCAACGAATCTACAATACTATTGTATATACTATACCTTTTACTATACGAGTAACGGAGTAACGG', '>B', 'GACAATGCACGACAGAATAAGCAGAACAAATATTTAGATTGACTCTAATTTGCTCTCCCATAATATACGGAGAAATATGATCGCGTTGGCGAGTGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCTAAAATAGTGGCGGATGAACGAGACGATAATAGATTCAAGTTGCCGCTAATCAGAAACAAATTAATTGCCACGTTAAATACAGGACGGTATATGATCTTGTATGGGAGAGTAGTGCCAACATATTGTGCTAAGGAGTGCCTCTCGTTATCTTTCTTAGATTTCCGTAAAACCAAAAACACAATACACGACAGAAAGAGAGTGCAGCGGAGACATTTAGATTGCCTATTAAATATGATCGCGTATTCGAGAGTACTGCTAACATATTGTGCTCTCTATAAAAGGACAGCCTCTCATTGTGTCGTGTTTTACCGCAAACCCAAACCGACAATACACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCGTATTATTGGGAGAAATATGATCGCGTATGCGAGAGTTGTGCCAACCTATTGAGCTCTTTGATTGTTGAGCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATACAAGTTGCCGCTAATCAGAAATAAAATCATTGCAACGTTAACTACAGCACCATATATGATCGCGTATGCGAGAGTAGTGCCAACATATTAGGGTAATGAGTGCCGCTCATGCTCTGTCGTATATCACCGCGAACCCCAAAAGACAATACACGACAGAAAGAGAGAGCTACGGAGACATTTAGATTGCCTATGAAATATTATCGCGTATGAGACAGTAGTCCCAATATATTATGCCCTCTAAATAATGAATGCCCCTCATTCTGTCTTCTCTAACCGAAAACCCAAATACACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATTGGGAGAAAGATAATCGCATATGCGAGAGTAGTGCCAACATATTGTTCGCATTGATTGATTGCCAACCCAAAATGGTGGCGGATGAACGAGATGATAATATATTATAGTTGGCGCTAATCAGAAATAAATTCATTGCATCGGAAAATACAGCACAATATATGATCGCATATGCGTGAGTAGTGCCAACATATTGTGCTAGTGATTACCTCTCATTCTCCGTCTTATATTACCGCATACCTACGAAGCCGATACACGACAGAGAGAGAGAGCTGTGGAGAGATTTTGGTTGCCTATTAAATATGATCGCGTATACGAGAGTAGTGGCAACATATTGTGCTCTCTATATAATGACTGCCTCTCAGTCTGTCTTATTTTGCCGCAACCCCAAATCGACAATGCTCGACAGCGGAAGCAGAACAGATATTGAGTTTGCCTCTCATTTTGTCTCCCATATTATAGGGAGAAATCTGATCGCGCATGCGAGAGTAGGGCCCACATATTGTGCCCTGTGGTTTTTTGGCAACCCAAAATGGTCGCGAATGAACGAGATGATATCAGATTCAAGTGGCCGCTAATCAGAAACTAATTCATTGCACCGTTAACACAAGCACAATATATGAGCGCATATGCGAGAGTAGAGTCAACATATTGTGCGAATAAGGGCCTTTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGGCAATACACGACAGAGAGAGGGAGCAGCGGCGATATTTATACTGCCTATTATATATGATCGCGTATGCGAGAGTAGTGGCATAAGTTTGTGCTCTCTATATAATGACTGCCTCACATACTGACTTATTTTACTGCAAACCCAAATGGGCAATGCCCGACAGAGGAAGCAGAACAGATATTTAGATTGCCACTCATTTTCTATCCCATATTATAGGGAGAATTATGATCGTGTATGCGAGAGTAGTACCCATATATTGTGCTCTTTGATTTTTTGGCAACACAAAATGCTGACGGATGAACGAGATGATAATATATTCAGGTTGCCGCTAATCAGGAATAAATTCATTGCAACGTTAAATACAGCATAATATAGGACCGCGTATGCGAGAGTAGTGTAGACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACTCAAAAAGACAATCCACGAAAGGGAAAGAGAGCAGCGGAGACAGTTAGATTGCCTATTAATTCTCATCGTGTATGCGAGGGTAGGGCCAGCACATTGTGCTCTCTATATAATGATTACCTCTCTTTCTGTCTTATTTTACCGCAAACCCAACTCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGTTTGCCTTTCATTTTCTCTCCCATATTATAGGGAGACATATGATCCCGTATGCGAGAGTCGTGCCAACATATTGTGCTCTTTGATATTTTGGCAAACCAAAACTGTGGCGGGTGAACCAGACGAGAGTATCTTCAAGTTGCCGGTAATCAGAGATACATTCATTGCACTGTCAATTACAGTACAATATATCATCGCGTATGCGAGAGTAGTGCCAACATATTGCGTTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAAAGAGACAGAGCATCGGAGATATTTAGATTGCCTATCAAATGTGATCGCGTTTGCGAGAATAGTGCCAACATATTGTGCTCTCTCTATAATGACTGCCTTTCATTCTGTCTTATTTTACCGCAAACCCAAATCGACAATGCAAGTCAGAGGACGCAGAACCGATATTTAGATTGCCTCTCATTTTCTCTCCCATGTTTTAGGGAGAAATATGATCGGGTATGCGATAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGCACGAGATGATAATATATACAAGTTGCCGCAAATCAGAAATAAATTCATTGCAAGGTTAAATACACCACATTATATGACGGCGTATGCGAGAGTAGTGCCAACATATTGTGCTAATGATTGCCTCTCCTTCTACGTCTAATATTACCGCAAACCCAGAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATGGTTAGATTGCCCATTAACTATGATCGCGGATGCGAGACAAGTGCCAACCTATTGTACTCTCTATCTAATGACAGATGCTCGTTCTGTCTTATTTTACGGCAAACCCAAATCGAAAATGCACGACAGAGGAAGCAGAAGAGATATTTCGATTGCATCTCATTTTCTCTCCTATATTATCGGTACAAATATGATGGCGTATGACAGAGTAGTGCCCACACATGGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGGTGAACGAGAAGATACGATATTCAAGTTGCCGGTAATCAGAAAGAAATTCATTGCAACGTAAAATACAGCAGAATATATGATCGCGTATGTGAGAGTAATTCCATCATATTGGGCTAATGGGTGCCTATCGTTCTCTGTCTTATATGCCCGCAATCCCAAAAACACAATACACGACAGAAAGAGAGAGCAACGGAGATATTAAGATTGCCTATTAAATATGAACGCGGATGCGGGAGTAGTGCGAACAGATTGTGCTCTATTTATAATGACTGCCTATCATTCTGTCTTAAGTTACCGCAATCTCAAATCGACAAAGCACGACAGAGGAAGCATAACAGATATTTAGACTGTCACTCATTTTGTCTCCCATATGATATGGAGAAATCAAATCGAATATCCGAGAGTAGTGCCAACATATTGTGATCTTAGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAAGGTACGGTTTTCAAAGTGCCGCTGATCAGAAATAAATTCATGATAAGGTTAAATACAGCTCATTATATGATAGCGTATGCGAGGGTAGGGCCAACACATTGTGCTAATGGGTGTCTCTCGTTCTCTGGCTTATGTTACCGCAAACCCAAAAAGACAATACACGGCAGAGAGAGGGAGCAGCGGAAATATTTAGGTTGGCTATTAAATATGATCACGTATGCGAGAGTGGTGCCAACATATTATGCTCTCTATAGTATGACTGTCTCTTACTCTGTCTTATTTTACCGCACACCCAAATCGACAATGCACGACAGAGGGAACAGAACAGATATTTAGATTGCCTCTCATTTACTCTCCCATATTATAGGGAGAAATATGATCGAGTATGCGAGAGTAGTGTCAACATATTGTGCACTTTGATCGCATGGCAACCCCTAATGGTGGCGGATGAACGAGATGATAATATTTTCAAGTTACCGCTAATCAGAAATAAATTCATTGCAACCTTAAAGACAGCACTATATACTATCGTGTATACAAGAGTATTGCCAACGTTTTGTGCTAATGAGTGCCTGTCGTGGTCTGTGTTATAGTACCGCAAACCCAAAAAGACAAGACACGACGGAGAAAGAGAGCAGCGGAGATATTCAGACTGCATATTAAACATGTTCGCGTATGCGAGAGTAGTGCCAATATATTGGGCTCTCTATATAACGATTGCCTCTCATTCTGTCTTATTTTAACTCAAACCTAAATCGACAAGGCACGACAGGGGAAGCAGAACCGATAACTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGACAAATATGAACGCGTATGCGAGAGTAGTGCAAACATATTCTGCTCTTTGATTCTTTGGCAACCCAAAATGGTGGGGGATGAACGAGATGATAATATATTGAAGTTGCCGCTAATCAAAAATAAATTCATTGGAACGTTAAATAGAGCACAATATATGATCGCGGATGCGGGAGTAGTGCCAACATATTGTGCTGATGAGTGCATCTCGTTCTCTGCCTTGTATTACCGCTAACCCAAAAAGCCAATACACGACAGAGAGAAAGAGCAGCGGAGATATTTAGAATGCCTATTAACTATGATCGCGTGTGCGAGAGTAGTGCCAACATATTGTGCTCTCTATATAATGCCGGCCTGTCATTCTGGCTTATTTTACCGCACAACCCAATCGACACTGCACGACAGAGCAACCAGAACAGACATTTAGATTGCCTCTCATTTTCTCTCGCCTATTATAGGGTGAAATAACATCGCGCATGCGAGAGTAGTCCCAACATATTGTGATCTTCGATTTTTTGGTACACCAAAATGGAGGCGGACGATCGAGATAAGAATGCGTAGCTGCTTCAAAAAGACAATGCACGACAGAGATAGCAGAGAAGATAATTAACTTGCCCCTCATTATCTCTGGTAAATTGTAGGGTGAATTATGATCCCGTAAGCGAGAGTGGTGCCGCCATGTCGTGCTCTACGATTTTTTCGCGACCCAAAATGGGGGCGGATGACCGAGATGATAATATCTTCAAGTTGCCGCTAATCAAAAATAAATTCCTTGCAACATAAAATAAAGCACAAAATGCCCGCACAAAAAAAGGCATGAATATATAAGCGCGAACATAGAACATAGGTTTGAACATATAATGACAGCCTTTCATTCTCTGGATAAAATCACCCCAAACACAAAATTGCATTGCACGACAAAGAGAGTAAGTCAGATAATCCGATTGCCTCTCATTGCCTCACCAATGTTACGGGAACCAAATATGAGCACGTGGGCGAGAGGAGTGCTAACATAGTGTGCTCTACGATTTTTTTGCAACCCTAAATGGCGGCGTACGAACGACATGATAATATACTCAAATTGCCGCTAATCAGAAGCAAGTTTGTTGCAATGCTCAGTGAAGCGCAAAATGGCCGCCCTAGAAATGCCTTTAATTTATATTGACTGCCTCTCCTTCTCTCTCTTTTATTACATCAAGACCAAAATGACAATGTACAACAGGGTGAGCAAGAGAGATATTTAGAATGACTTTCCTTGTCTCTCCCATATTAAGGAGACCGAAAGTGATAGCGTATGCGAGAAGAGTGTCATTGTATTGAGCTCCTCGAACCAAAATAGCGTCTGACGAGCGACATTATATATTTAAAACGCCGATCCTTTTCCCATCCATATAAATACTACCGAAAATGACTGTCTAAAGGTACTCAGCGACTATAGTAAAATCTGTGAATTACTGTGGACAGGTTGACCTTTTCAATTTTTACCGGCATTGTCAAGTAAATTAATATAATTTTCTTTTTAGATGATTATTTAACAGAACTTTTACTTGAAAATATTTTATAAAATTCGTTTTATAATATTGAAATCTTACTTTCAGTTAAATGGGGCTAGTTTGCAAGGGTTCCATCATGACCTTTTTAAATAGGTTTGAAAGAATTAATTGTAAGTTCCAATAGACTGGAAATTATTTTACAATTGCATTTTTATACCCAGATCCATCATCGAATTGTTAGTTGCATGAAAATCAGAAGGAAAAATCTTTAACTTGTTAGGCCCTGCCACATAGCCGCGCAATAGGGAGAGAGACCATATAATCACTGCAAGGAAAAATGTAAATGCCAATTAGAAGGATAAGATTGGGAGCCTTCATTTTAATTCTAATTTCAAAAAACTTAATGATGAGATGACGGAAATCAGAACGAATTTCTCCGCAACGTCATTACCAGCACAAAATGGCGGCGCAAAAGGATGGTTGCAGATACAATAACTACATCTCATTCCATCTCTCCTATATTTCCGCAAACTCGAAAGCCAAAACACGAATGATTATGAGGGATAGGTTGTGTTGGGACAAAAATTTTAGGTCACGCGATAGGAGTGGTCTAAATTTTACTCTTACAAAAATGTTGGCGATACAAAATGGCGGCGGAATAAAGAGGTGGAAATTTATTAAGATTGCCGCTTATGTGCTTCGCGGTGGAATTAGGACTGAACGGTCCCGAGTATAAGCTCTCGATTGCTGGCCTTTACCTATAAAGTGTATTACTACAGATTAAATTACTTTTTACCGTTTATCAATATTTAAATATTATATATTGATTTAGTTATACTACAATTCGAACAATCTTTTCTCCAAATAATAATAATGGTTAATACCTATTTGATCATATACGTTTATGTTTGGGATTTAATTTTAACATTTTTCAACGTAACCGTTATAGATGTAGTTTTATATCAGGAAACGACTTTGGTATGAAATTATGTTGTTTTGCGCTCTTTTAAACATGTAACTGCTATTATGTGCTCTTTTATTGAATCCTATCGCTACGTAAAATATTAGTTTTAATATCAATACATTGGTACTAATTCGCGATTTCATTCTATTCTTATGCCCAAATAAGGAAATAGTTTCCGGCGAAAACTCACAATTTACCTTTTACAAAAACTAGAGACGAGAGGACAATATTATAATTGTAGACCGTTATAAACACTTTAAAATGTTTTACCATTTATAAATTATTTTACTAATCGTAGGTATTTTTACTTATTAGAATACGAATTCTGTATCTGATGCGAACTAAGTATGCCTAATCGCTTAGAAAATATACATACTTGACGAGTAGAGCGATAAATTAACAAATATTAGACATATTCATTGCAACCCGCAGGTAGTGATTTCCACTTATGTTTTCGCTACTTCCAGCAACAGACAAGAGAACCCGCGTTTGAACTAGTTTCGGCGTGTAGACAACAGCTATCCACGCTTCATAACGAATGAGGCTGCGGAGGACCTGATTAACAGGAAGCCCATGGGCGTGCGGGATAAGCCACAGGCCTCAGAGCGGATCTCAAAATTAAATCTGGCAGAATACGCGCAGCACCAGGTGCGCAATGAAGCACCCTACATACCCACCACATTTGACCTCCTCTCAGACGATTAGAAGGCATCACAGAGAGTTGCCAACGCCGCGACATCTATGAGGGCCTTGACTTACTCGGATGCTGTGCGTCTAAGCCGGAATGGCATCGCCAACTCCCCCGTAAGTGGGCACTCCAGCTATACGGTGCGCAGACCACTGGCACTCGATGACAGAAGCATTCTATCCCACGAAATGGAGCGCATGGACCTACAGTAGTATATAAACCTTAACCTGACCGCAGCCCAAAGTAATTCCGTGAGCAGTCACTACGCCGAACCGGTTACTGATAACTCGGAGGTTAAGAAAGTCAATGAAATCAACGAGAGGTAAATAAATTTTTTTTATCCATTCACATCCGAAGCAGTGGTAATAAAGCAAAATGCTGATCCTATCACCAATTAGTAACGCACCACAGGCGTTAACCCCCCAATCTACGAGACGCTCCGGAACAGATTCATTGCACCGTCGTTTTACCAGCTGCGTCAACTTATACGATGGCTTCGCCACGCAATCTGAATCCAGAGCGGCAGACTGTGAAGAGAAGTCCAAACATCGTCCTAGATTAGCTCATGAGCAGAGGGTTTTTTGGAAGTTCTGTGCTAAAAAGAACATAGAAGAACTCCGCGCCTATGTACGAAAAATAGTGCAAAACATATTCCCGTCTTGTATCGCCCTCAAGCCCATCTTTGTACTCTGGCTCTTGTACAAGCCAAATGTGAGGCAAGTGACCAAGCTCATTCAGTCAACATAGGAGCACCACGATAGCTTTACCGAAATTACACAGGATGATATATCGACGTTATTGTAACGAATATATTGATGTCTTTCGTACCCATTGAAAAGGTTGTGGTGCTTGCGATTTAAAATCTTATATTATGAATTTATTTTTAAATTGAACGTACACATAACTACCGAAGACATATGCACGTTTATTAATGGGAAATGGCTTAACGAAGAGGGCTTAAATTCTTACATGTCCTTGTTCACAGAACGCTCGGAGAAGAGATCTGGCGTGCTTCCCGCCACTTACTCCATAAACTCATTCTTCGTGCAACGGCTCCTGCATGCTAGGCATGCAGGAACTAAGCGCTGGGCTCGCAAAGTGGACTTGTCCAGCATGGACATAATCCCGGTACCAGCGCTCTGCGACGGCGACCACTGGTGCATGGCCATCATACACTAGCATAACAATACAACCCGGAAAAATGACACAAAGGGAAAGCCTTACCGAACAGTGCGGGACGCTCTAGAGAAATGACTACGCGTAAAGTCAATATTCAAGCCCAAAAAGTAGTTTGATATCAGCGATTTTGTTATTGCGAGCGTGCAGAATATACCACGACAGTTAGATGCCAGCGCTCGCGGTAACGTCAGCTGCATGTTCGCCGAGTTTATAACGTTTGGTGTGCCAATTACAGTTACCCACTCCGATATGTCGTACTTCCGCAAGACGATGACTCTAGAAATCGTCGACGGAGAGTTGTGACAGTAGAATCCTACAGCTACGATAGAATGTGGAGAATGCAGTTTAGTTATTTTGACAAATCTTACGTAAACTATCCAAGCATGAATGAGTAACAAGTGCTTAGCTATTGAATGAAACTGAGCTGGCCGAGTGATGTGCTGGTGCAATAACTTGTTCTCATATCTTATTGAAACAGAGAATCCAGTTATTCAATAAAACTTTCCCAAGTAATAACAATGCGAATAGTGACGTAATTATTGCAGAAACTCTTCGAGTTTATAATTAATTTTTACAACACAGCAGGCCGAGAATATGCAATTGTAATGTCCATTTCAATATTTGGAATTTACTAATTTAAGCATAACTCTTATCTAGGGATTACTAGATTCTAACTATATTCGAGTAGAATAAAACAATTTATAGTAACAAAGTACAAAGATCCTTATAGATAATCACTCATACCAACCTCAAAAGATCAAAACCGAGGTATGATCTTTCAATAAGTCAAAATTAGGAGTTTTCAGTTGGAGACCTACAACTAAATAGATCGGTGTTCTTCCACAAAATATTCTAAAGCCAGTTTATTAAATTAAATACATGTTTTATTAATAATCCCAAGCTAACTACTCACTTATAAACTTTATCTGGTCTAAAAAGCTTCCTCCTGCCTGTAACATACTTTTCAACGAATCTACAATACTATTGTATATACTATACCTTTTACTCTACCGGTAACGGTCTAACGG'], dtype='<U10000')
So we have the header, in the first position. The sequence of that in the second position. In the next two positions are the header and sequence for the next sequence data.
We know we can ignore the >
character as all it does is depict the start of a line that contains a header:
data.shape
(4,)
seq = []
header = []
for row in data:
if row.startswith('>'):
header.append(row[1:])
else:
seq.append([c for c in row])
species = np.array(header)
sequence = np.array(seq)
print(species)
print(sequence)
['A' 'B'] [['G' 'A' 'C' ... 'C' 'G' 'G'] ['G' 'A' 'C' ... 'C' 'G' 'G']]
Now you have an array called species with the headers of the FASTA file, and ordered in the same way, an array of the sequences under each header.
species[0], sequence[0]
('A', array(['G', 'A', 'C', ..., 'C', 'G', 'G'], dtype='<U1'))
np.loadtxt('http://mcb111.org/w06/w06-homework_backcross_genome.info', dtype=str,max_rows = 2, comments= None)
array([['#t', '1', 'pos', '1600', 'st', '1->2'], ['#t', '2', 'pos', '3441', 'st', '2->1']], dtype='<U4')
Problem overview¶
What do we want to do in this problem? Given the entire observed sequence $X_1, \dots, X_L$, we want to infer the genotype at any desired position in the sequence, $Z_i$. In the homework, we consider two possible genotypes, BB and AB. Formally, we would like to know, for every location $i$, how $P(Z_i = BB \mid X_1, \dots, X_L)$ compares to $P(Z_i = AB \mid X_1, \dots, X_L)$. Are there large contiguous blocks where there is one genotype? Are there clear transition points (breakpoints) between such blocks?
Let's look at how we can compute these posteriors. For $Z_i = BB$:
$$\begin{aligned} P(Z_i = BB \mid X_1, \dots, X_L) &= \frac{P (X_1, \dots, X_i, Z_i=BB, X_{i+1}, \dots, X_L)}{P (X_1, \dots, X_L)} \\ &= \frac{P (X_1, \dots, X_i, Z_i = BB) P (X_{i+1}, \dots, X_L \mid Z_i = BB)}{ P(X_1, \dots, X_L, Z_i=BB) + P(X_1, \dots, X_L, Z_i=AB)} \\ &= \frac{f_{BB}(i) b_{BB}(i)}{f_{BB}(i) b_{BB}(i) + f_{AB}(i) b_{AB}(i)} \\ \end{aligned}$$
In the numerator, we went from the 1st step to the 2nd by applying the rule $P(A, B) = P(A) P (B \mid A)$, which is valid for any events $A$ and $B$.
How did we go from the 1st step to the 2nd in the denominator? We used the law of total probability. If you have some event $A$, and another event $B$ that takes on $n$ possible states, then $P(A) = \sum_n P(A, B_n)$. In our case, the genotype $Z$ takes on 2 possible values, $BB$ or $AA$, so $$\begin{aligned} P(X_1, \dots, X_L) &= \sum_{Z_i} P(X_1, \dots, X_L, Z_i) \\ &= P(X_1, \dots, X_L, Z_i=BB) + P(X_1, \dots, X_L, Z_i=AB) \end{aligned}$$
Above, we also introduced some notation: $f_Z(i) = P(X_1, \dots, X_i, Z_i = Z)$, the "forward probability", and $b_Z(i) = P(X_{i+1}, \dots, X_L \mid Z_i = Z)$, the "backward probability".
We saw in class that we can write the forward/backward probabilities as recursions.
Forward: $$\begin{aligned} f_{AB}(i) &= P(X_i \mid AB)[ p f_{AB}(i-1) + (1-p) f_{BB}(i-1)] \\ f_{BB}(i) &= P(X_i \mid BB)[ p f_{BB}(i-1) + (1-p) f_{AB}(i-1)] \\ \end{aligned}$$
The initial conditions are $f_{AB}(1) = P(X_1 \mid AB) P(AB)$ and $f_{BB}(1) = P(X_1 \mid BB) P(BB)$.
Backward: $$\begin{aligned} b_{AB}(i) &= p b_{AB}(i+1)P(X_{i+1} \mid AB) + (1-p) b_{BB}(i+1)P(X_{i+1} \mid BB) \\ b_{BB}(i) &= p b_{BB}(i+1)P(X_{i+1} \mid BB) + (1-p) b_{AB}(i+1)P(X_{i+1} \mid AB) \\ \end{aligned}$$
The initial conditions are $b_{AB}(L) = 1$ and $b_{BB}(L) = 1$.
It will be extremely important to do everything in log space, because of the huge number of multiplications of small probabilities.
We also highly recommend building arrays: one representing $\log P(X_i \mid AB)$ for all $i$ in $1, \dots, L$, and similarly one for $\log P(X_i \mid BB)$, since these terms are used in the recursion for the forward and backward probabilities.
So, make $L$-long arrays for both $\log P(X_i \mid AB)$ and $\log P(X_i \mid BB)$ first.
Then, you can use a for loop to build up $\log b_{AB}(i)$, $\log b_{BB}(i)$, $\log f_{AB}(i)$ and $\log f_{BB}(i)$.
For instance, using the forward probability step:
The first position:
$$\log f_{AB}(1) = \log P(X_1 \mid AB) + \log P(AB)$$
Starting with the second index:
$$\log f_{AB}(i) = \log P(X_i \mid AB) + \mathrm{logsumexp}\big([\log p + \log f_{AB}(i-1), \log(1-p) + \log f_{BB}(i-1)] \big)$$
You could either use the logsumexp
function from scipy.special
, or implement the $\log(e^a + e^b)$ trick discussed previously in section.
In this implementation, it's great to have an array that you can index into, since at each $i = 2, \dots, L$, you need to access the term you already computed at $i-1$.
The backward probabilities are done in a similar fashion, just working backwards from $b_{AB}(L) = b_{BB}(L) = 1$. You can initialize arrays of length $L$ with zeroes via np.zeroes(L)
, and then fill them in at position $L$, $L-1$, $L-2$, and so on.
import numpy as np
L = 5
log_bAB = np.zeros(L)
log_bBB = np.zeros(L)
for i in np.arange(L-2, -1, -1):
# why L-2?
# if an array x has length L,
# then to access the very last index,
# we need to index it like so: x[L-1]
# but in our backward probability,
# we actually already fill in the last index with b_Z(L) = 1
# (and with logs, log b_Z(L) = log 1 = 0)
# so, in our computation, we need to start indexing at L-2 :)
log_bAB[i] = -0.5 + log_bAB[i+1]
log_bAB
array([-2. , -1.5, -1. , -0.5, 0. ])
log_bAB[L-1]
0.0
print(np.arange(3, -1, -1))
print(np.arange(2, 5.1))
[3 2 1 0] [2. 3. 4. 5.]
Multinomials¶
$P(X_i | G_i)$¶
Before we talk about the multinomial term encountered in the w06 homework writeup, let's first look at the binomial PMF:
$$P(X = k \mid N, p) = \frac{n!}{k!(n-k)!} p^k (1-p)^{n-k}$$
This probability mass function tells us, for some event that has a success probability $p$ (and failure probability $1-p$), the probability of observing $k$ successes out of $N$ trials (key: the trials are independent from each other).
The term with factorials can also be expressed as ${N \choose k}$ (read "N choose k"), which expresses how many different ways $k$ objects can be chosen out of $N$ objects (if we have $N$ trials, we need to count up the different ways we get $k$ successes... all in the first $k$ trials, have one failure first, etc.)
Then, the rest of the binomial PMF accounts for the probability of $k$ successes ($p^k$), and necessarily $N-k$ failures ($(1-p)^{N-k}$.
It turns out that there is a natural extension to the binomial, if you had trials with more than two outcomes -- that's the multinomial.
So now let's look at the multinomial term from class: $$ P(X_i \mid G_i) = \frac{(n_a+n_c+n_g+n_t)!}{n_a! n_c! n_g! n_t!} \left\{ \begin{array}{ll} p_1^{n_a} q_1^{n_c+n_g+n_t} & \mbox{for}\, G_i = aa\\ p_1^{n_c} q_1^{n_g+n_t+n_a} & \mbox{for}\, G_i = cc\\ \ldots & \\ p_2^{n_a+n_c} q_2^{n_g+n_t} & \mbox{for}\, G_i = ac\\ \ldots & \\ \end{array} \right. $$
Let's look at a homozygous case, $G_i = cc$:
$$\begin{aligned} P(X_i = n_a, n_c, n_g, n_t \mid G_i = cc) &= \frac{(n_a+n_c+n_g+n_t)!}{n_a! n_c! n_g! n_t!} p_1^{n_c}q_1^{n_g + n_t + n_a} \\ &= \frac{(n_a+n_c+n_g+n_t)!}{n_a! n_c! n_g! n_t!} (1-\epsilon)^{n_c}\big(\frac{\epsilon}{3}\big)^{n_g + n_t + n_a} \\ &= \frac{(n_a+n_c+n_g+n_t)!}{n_a! n_c! n_g! n_t!} (1-\epsilon)^{n_c}\big(\frac{\epsilon}{3}\big)^{n_g}\big(\frac{\epsilon}{3}\big)^{n_t}\big(\frac{\epsilon}{3}\big)^{n_a} \\ \end{aligned}$$
All the terms are very analagous to the binomial case. Here, we're looking at the probability of observing some particular ensemble $(n_a, n_c, n_g, n_t)$ -- counts of reads, given a particular genotype.
The term with factorials expresses the number of ways of drawing $n_a$ A's, $n_c$ C's, $n_g$ G's, and $n_t$ T's, out of the total read counts ($n_a + n_c + n_g + n_t$).
The remainder of the term deals with probabilities.
If the genotype is truly $cc$, we should expect all of the reads to be $c$. But, we assume there could be some error, whether it's sequencing or alignment-based, so the probability of seeing a $c$ is 1 minus some small error, $\epsilon$. That forms $p_1$: the probability of seeing the correct read for the genotype. Since we have a multinomial process where a read can be C or any of A, G, T, we also need terms for A, G, T. In our case, we need $p_c + p_a + p_g + p_t = 1$, so if $p_c = 1-\epsilon$, then setting $p_a = p_g = p_t = \frac{\epsilon}{3}$ would satisfy this.
Your code will need a way to switch like this. Given some genotype, you need to use a particular multinomial distribution, which lines up the success probability with the bases in your genotype.
Now we are in the clear. We have a factorial term describing the number of ways we can order our particular set of $n_a$ A's... $n_t$ T's, and we have probability terms for all of the bases too. IN THIS PARTICULAR CASE, where $cc$ is the genotype, the "correct" base is $c$, and we would want $p_1^{n_c} = (1-\epsilon)^{n_c}$ in the expression. If we were looking at $G_i = aa$, then we would want to match $p_1$ with $n_a$.
Now let's look at a heterozygous case. Say
$$\begin{aligned} P(X_i = n_a, n_c, n_g, n_t \mid G_i = ag) &= \frac{(n_a+n_c+n_g+n_t)!}{n_a! n_c! n_g! n_t!} p_2^{n_a + n_g}q_2^{n_c + n_t} \\ &= \frac{(n_a+n_c+n_g+n_t)!}{n_a! n_c! n_g! n_t!} \big(\frac{1}{2} - \frac{\epsilon}{2}\big)^{n_a + n_g}\big(\frac{\epsilon}{2}\big)^{n_c + n_t} \\ &= \frac{(n_a+n_c+n_g+n_t)!}{n_a! n_c! n_g! n_t!} \big(\frac{1}{2} - \frac{\epsilon}{2}\big)^{n_a}\big(\frac{1}{2} - \frac{\epsilon}{2}\big)^{n_g}\big(\frac{\epsilon}{2}\big)^{n_c}\big(\frac{\epsilon}{2}\big)^{n_t} \\ \end{aligned}$$
Here, if the genotype is $ag$, then we would expect to see either $a$ or $g$ as the correct bases. But, if there's some sequencing/alignment error, we could see a different base. Let's again say the probability of a correct base is $1 - \epsilon$. If we assume that $a$ and $g$ are equally likely to occur, then we need $1 - \epsilon = p_a + p_g$. Thus, $p_a = p_g = \frac{1}{2} - \frac{\epsilon}{2}$. And because $p_a + p_g + p_c + p_t = 1$, then $p_c = p_t = \frac{\epsilon}{2}$.
from math import factorial
def log_factorial(n):
return np.log(factorial(n))
log_factorial(10), log_factorial(0)
(15.104412573075516, 0.0)
$P(G_i | Z_i)$¶
The probability of the genotype given the ancestry, we need to know two things to calculate it the ancestry (because its given) and the genotypes which is a pair of residues at a particular position in the chromosome. The possible genotypes that we have are:
AA or AA \ AC or CA \ AT or TA \ AG or GA \ CC or CC \ CT or TC \ CG or GC \ TT or TT \ TG or GT \ GG or GG \
Since we can't tell which allele was provided by which parent the order doesn't really matter and we are left with 10 choices really.
So if we are looking at the genotypes that are possible given ancestry BB $P(G_i|BB)$, for example. we need to look at the parental genome B at the given position. Since we are asuming ancestry BB, only the genotype at this position is possible and no other.
Similarly if we are looking at the genotypes that are possible given ancestry AB $P(G_i|AB)$ we'll need to look at position AB and see what the genotypes at this position are in both parental genomes, and only the combination of the genotypes at this positions is going to be possible and no other.
Let's look at a couple of examples:
data = np.loadtxt('http://mcb111.org/w06/w06-homework_backcross_genome.afa',dtype=str)
seq = []
header = []
for row in data:
if row.startswith('>'):
header.append(row[1:])
else:
seq.append([c for c in row])
species = np.array(header)
sequence = np.array(seq)
species[0],sequence[0][203]
('A', 'A')
species[1],sequence[1][203]
('B', 'C')
If we are looking at position 203 (starting from 0) given an ancestry of BB the genotype has to be CC, because both alleles come from the same parent but given an ancestry of AB the genotype has to be AC. beacause each allele comes from a different parent.
While loops!¶
From Section week 5 you might remember this example, where we are trying to do inference of the parameters underlying these points, much like in the case of Expectation maximization, the approach we took required to iterate a number of times updating out parameters and re-calculating a score. Let's look at two ways in which we could run this example in loops and what are things we need to keep in mind.
%pylab inline
# Simulate points
w1 = 0.5
w0 = 3
sigma2 = 3
N = 50
err = np.random.normal(0,sigma2,N)
X = np.array(sorted(np.random.uniform(-10,10,N)))
T = w0 + X*w1 + err
plt.plot(X,T,".")
plt.show()
Populating the interactive namespace from numpy and matplotlib
from scipy.special import logsumexp
def log_posterior(x, t, w0, w1,sigma2):
logconst = -0.5*np.log(2*np.pi)*sigma2
logP = 0 # flat prior
logP = sum(-(t - w0 - w1*x)*(t - w0 - w1*x)/2/sigma2 + logconst)
return logP
w0_prev = 0
w1_prev = 0
posterior_prev= log_posterior(X, T, w0_prev, w1_prev,sigma2)
iterations = 100000
w0_sampled = np.empty(iterations)
w1_sampled = np.empty(iterations)
for i in range(iterations):
# draw a new set of parameters using previous value
w0_cur = w0_prev+np.random.normal(0, 0.1)
w1_cur = w1_prev+np.random.normal(0, 0.1)
# calculate the posterior with the new parameters
posterior_cur = log_posterior(X, T, w0_cur, w1_cur,sigma2)
#calculate the ratio
alpha = logsumexp(posterior_cur-posterior_prev)
if alpha >0:
u = np.random.uniform()
if u>=alpha: # reject the current
w0_prev = w0_prev
w1_prev = w1_prev
posterior_prev = posterior_prev
elif u<alpha: # accept with probability alpha
w0_prev = w0_cur
w1_prev = w1_cur
posterior_prev = posterior_cur
# Update the result arrays
w0_sampled[i] = w0_prev
w1_sampled[i] = w1_prev
For loop:
You have to pre-define the number of iterations, which means that your algorithm might not have enough room to converge properly or too much room and take longer than needed. We used immutable structures in this example -the numpy arrays- meaning that we allocated the space in which our calculations where going to be stored before running the code. this can also be achieved by using mutable structures, like lists.
plt.plot(list(range(iterations)), w0_sampled)
plt.hlines(np.mean(w0_sampled[1000:]), -2000, iterations+2000, 'r')
plt.xlabel("Iterations")
plt.ylabel("sampled w_0")
plt.show()
plt.plot(list(range(iterations)), w1_sampled)
plt.hlines(np.mean(w1_sampled[1000:]), -2000, iterations+2000,'r')
plt.xlabel("Iterations")
plt.ylabel("sampled w_1")
plt.show()
from scipy.special import logsumexp
def log_posterior(x, t, w0, w1,sigma2):
logconst = -0.5*np.log(2*np.pi)*sigma2
logP = 0 # flat prior
logP = sum(-(t - w0 - w1*x)*(t - w0 - w1*x)/2/sigma2 + logconst)
return logP
w0_prev = 0
w1_prev = 0
posterior_prev= log_posterior(X, T, w0_prev, w1_prev,sigma2)
w0_sampled = []
w1_sampled = []
i = 0
stop=1
while stop == 1:
# draw a new set of parameters using previous value
w0_cur = w0_prev+np.random.normal(0, 0.1)
w1_cur = w1_prev+np.random.normal(0, 0.1)
# calculate the posterior with the new parameters
posterior_cur = log_posterior(X, T, w0_cur, w1_cur,sigma2)
#calculate the ratio
alpha = logsumexp(posterior_cur-posterior_prev)
if alpha >0:
u = np.random.uniform()
if u>=alpha: # reject the current
w0_prev = w0_prev
w1_prev = w1_prev
posterior_prev = posterior_prev
elif u<alpha: # accept with probability alpha
w0_prev = w0_cur
w1_prev = w1_cur
posterior_prev = posterior_cur
# Update the result arrays
w0_sampled.append(w0_prev)
w1_sampled.append(w1_prev)
i += 1
if i >101:
if (np.mean(w0_sampled[-10:-1]) - w0_sampled[-1] == 0) and (np.mean(w1_sampled[-10:-1]) - w1_sampled[-1] == 0):
stop = 0
len(w0_sampled)
71842
While loops:
They will run forever and you need to specify an exit condition, in this case we are looking at the last set of values and getting a measure of how much they changed, because of the nature of this algorithm (not the EM) we can get fluctuations over and under the optimal value. In the case of the EM algorithm it is sufficient to look at the last value and the one before and set a threshold for their difference!
Using this approach potentially reduces the amount of iterations that you'll go through before convergence, but we could run into the scenario where our condition is never reached and the code runs forever! or until your computer runs out of memory.
plt.plot(list(range(len(w0_sampled))), w0_sampled)
plt.hlines(np.mean(w0_sampled[1000:]), 0, 2000, 'r')
plt.xlabel("Iterations")
plt.ylabel("sampled w_0")
plt.show()
plt.plot(list(range(len(w1_sampled))), w1_sampled)
plt.hlines(np.mean(w1_sampled[1000:]), 0, 2000,'r')
plt.xlabel("Iterations")
plt.ylabel("sampled w_1")
plt.show()