Data wrangling and the data in HW6

For this week's homework we have two sources of data that we are going to need to use:

  1. The mapped reads
  2. The sequence data

Let's get started with the mapped reads. If we get a file that we don't know how it is formatted, the best way to attempt to handle it is to understand its structure. The homework tells us that "This file has 10,000 rows, each with four columns. Each column represents the number of As, Cs, Gs, and Ts (in that order) that appeared in all the reads that mapped to that position."

However, it is always good to have a visual sense of what is going on. Let's look at the mapped reads first:

The error above is telling us that there is a wrong number of columns at line 10000? what does this mean? Let's take a look at the data by clicking the link scroll all the way to the bottom and see that there is a \\ character which is not allowing us to use the loadtxt function directly. However, we can see that the data is 4 different columns, each of which corresponds to A's, C's, G's, and T's according to the homework.

How do we get around this?

Let's look at the documentation for the numpy.loadtxt() function

There are two arguments that come to mind to handle this, the first is:

max_rows : int, optional
    Read `max_rows` lines of content. The default is to read all the lines.

A way to work around this is to set the max number of rows read below 10000, since by visual inspection we know the problem is there. by specifying 9999 we'd skip the last row which is not informative anyways.

By calling data.shape we can see that we ended up with an array of 9999 elements with 4 elements each. corresponding to A's, C's, G's, and T's. If for whatever reason we wanted all the bases to be on the same array so that we could call them by position we could make use of numpy slicing.

If we have an array of arrays we can get the first position of every sub array in the array by calling the following:


Similarly for the second position we'd call


It should be obvious how to extend this to any desired position i in your sub-array:


Let's continue by looking at the sequence data, the homework tells us that this data is in a FASTA format. if you're not familiar with the format please look it up on wikipedia, and just understand what it is, click on the homework link to see an example and let's figure out how to load this into our homework, let's again try with the good old numpy.loadtxt() function:

So we have the header, in the first position. The sequence of that in the second position. In the next two positions are the header and sequence for the next sequence data.

We know we can ignore the > character as all it does is depict the start of a line that contains a header:

Tips on implementing iterations

If the full dimension of final results is known and all results should be kept

In our case, we know the forward-backward algorithm will generate results of size $4\times L$ (note: the final position is missing in the mapped reads). For instance, $f_{BB}(i)$ is of length $L$. We want to keep all the results

Recipe: initialize a dummy array matching the dimension of the final results, and update the elements during each iteraction of a for-loop

If the total number of iteractions is unknown and all results should be kept

Recipe: initialize an empty array (or an array with necessary elements to begin with), and append the elements during each iteraction of a while loop

If the total number of iterations is unknown and only the final result should be kept

Recipe: use a variable(s) that will be updated in a while loop