MCB111: Mathematics in Biology (Fall 2020)


week 00:

Sampling from a probability distribution

What is sampling?

Sampling from a distribution means to take a number of data points such that their likelihood is determined by the distribution’s probability density function. As the number of samples increases, the histogram of the drawn point will increasingly resemble the original distribution.

Consider the distributions in Figure 1, you expect to see the points to start accumulating around the regions with higher probability. If the sampling continues indefinitely, you expect to eventually cover the whole sampling region. The histogram of sampled points will approximate the actual probably density function. The larger the number of points, the better the approximation.


Figure 1. Top: samples for a Normal distribution. Bottom: samples for an exponential distribution. Left: 10 samples, Right: 50 samples. Sampled points are in red.

Why sampling?

Sampling from a distribution has a number of uses. One of them is to test alternative hypothesis. We do that when we want to test how data produced in an experiment fits a simpler and perhaps more boring hypothesis than the interesting hypothesis we are testing. Oftentimes the alternative hypothesis can be formulated mathematically and samples can be taken from it, and those can be compared to the actual measurements. Another use is when the probability distribution is the posterior distribution of the parameters of a model, and sampling from that distribution provides samples of models compatible with the data.

Once we have a sample for a given distribution probability density , then we can use those to estimate the averages of quantities derived from , as follows

For instance, an estimation of the mean of the distribution from the sample corresponds to

The cool thing we are showing here

Here we are going to show that one can produce samples from any arbitrary probability distribution just by using samples from a uniform random variable .

This is an important result because even though Python has functions to generate samples from most standard probability distributions using scypy.stats, for instance, you can sample from the exponential distribution using expon.rvs, those don’t help if you have an experimentally determined probability distribution without an analytical expression.

The thing is, you only need math.random() to sample from any arbitrary distribution.

Some notation on PDFs and CDFs

For a random variable , we define the probability density function (PDF) as

\begin{equation} P(X = x) = p(x), \end{equation}

and the cumulative density function (CDF) as

\begin{equation} F_X(x) = P(X\leq x) = \int_{y\leq x} p(y)\ dy. \end{equation}

is the probability of getting the value or any value smaller than it. When the PDF is high, the slope of the CDF is large, when the PDF is low, the slope of the CDF is small.

The CDF has (for a continuous variable) the following properties,

The CDF function is always increasing, that is if then . Most continuous distribution are also strictly increasing, but it does not have to be that way, and it could be the case that for .

Geometrically is the area under the curve of up to . See Figure 2. And we can see that


Figure 2. Relationship between the probability density function (PDF) and the cumulative density function (CDF).

The CDF for the Uniform distribution

Let’s take a moment to remind ourselves of an important property of the Uniform distribution.

A uniform random variable on is such that the PDF is given by

And the CDF is given by

And in the particular case of a uniform random variable on , we have the results


Figure 3. The Uniform distribution. (left) the probability density function (PDF), (right) the cumulative density function (CDF)

That is, if we sample a number using a random number generator, such as random.random(), we have that

In particular, because for any CDF, takes values in then we have,

This apparently trivial result is at the core of the Inverse Transformation method for sampling that we discuss next.

The random variable is uniformly distributed

For any random variable that follows the probability distribution , such that , we can show that is the uniform distribution U in .

What does this mean?

It means that if is a sample from a probability distribution with CDF , then the sample given by


follows a uniform distristribution in .


Figure 4.

In python pseudo code

Produces the result in Figure 4.

Proof

Introduce . Then for the CDF of Z we have,

Thus, is the uniform distribution in .`

We will also use this result later in the course when we discuss p-values. After all, a p-value is a CDF (or more precisely), which means that p-values are uniformly distributed. We will discuss the implications of that fact.

The Inverse Transformation Method for sampling

This method allows us to sample from an arbitrary probability distribution using only a random number generator that samples uniformly on .

How it works

The Inverse Transformation method just requires to sample a value from the Uniform distribution on , and to take as the sampled value.

In practice, it requires to have the cumulative distribution function and


Figure 5. Graphical representation of the Inverse transformation sampling methods

Notice that for this Inverse Transformation to work, the CDF has to be invertible, which requires that if then , but not , in which case the inverse would not be possible to calculate.

Why it works

We want to show that when and has been sampled uniformly in is indeed .

Introduce the random variable , we want to show that .

Because means that , then

where the last equalty is the result of the property we show before for the Uniform distribution, so that .

where for the last step, I have used the property that we found for the uniform distribution in the previous section, and described in the bottom right plot in Figure 2.

Example: sampling from an Exponential Distribution

The exponential distribution is one of the few cases of a distribution for which the CDF has an inverse with an analytic expression.

The exponential distribution describes processes that occur randomly over time with a fixed rate . The exponential distribution is widely used in biology to describe decay processes such as RNA or protein degradation for instance.

The PDF of the exponential distribution is given by

where is an arbitrary parameter, and the range of is .


Figure 6. Inverse sampling for the exponential distribution.

The CDF can be calculated analytically as

For a value sampled uniformly in , a sampled according to the exponential distribution can be obtained using the expression,

that is

or

finally

In summary, to sample from an exponential distribution can be obtained from uniformly sampled values on as

Notice that the sampled values are positive as if ,

Indirect sampling methods

The Inverse Transformation Method directly samples from the distribution . In some cases, is not available, and several algorithms have been designed to sample when directly sampling from is not possible.

Rejection sampling


Figure 7. Rejection sampling

To sample from a probability density is to sample uniformly the area under the curve. In rejection sampling, you sample the area under a different density , with the restriction that , see Figure 7 from “MonteCarlo Sampling” by M I Jordan.

To generate a sample from ,

In regions where is high, we are less likely to reject a sample , and we will get more values in that region.

This method we may waste a lot of time rejecting samples if the distribution has many peaks and valleys.

Importance sampling

Importance sampling is used when sampling from a given distribution with density is very difficult, but there is a related distribution from which it is easier to sample.

Importance sampling does not allow to calculate directly samples of , but it can be used to calculate averages of an arbitrary function respect to the density .

which can be expressed using the auxiliary distribution as

where the weights are defined as

The importance sampling method goes as follows

Monte Carlo sampling

Monte Carlo sampling is a very generic term to describe any method that allows you to approximate a probability distribution by taking a number of samples , such that for any function of the random variable you can use,

Many difference techniques or “sampling rules” can be used to produce those samples representatives of .

Rejection sampling and Importance sampling are examples of Monte Carlo sampling. Another well known method is Markov Chain Monte Carlo (MCMC), in which the sampling rule is a Markov process (that is, a process that depends only on the previously sampled value).