## Probabilities can be very small numbers¶

Calculating the probability of some data given a model, usually requires multiplying many different probably terms

$$P(D\mid M) = \prod_{i=1}^N P(d_i\mid M).$$

This quantity easily underflows, which means that you are trying to calculate something smaller than the precision of your computer.

If you had a data set $D$, in which each individual datum had probability of the order of $0.001$ (which is not such small quantity), then the probability of all the data is

$$P(D) = \left(10^{-3}\right)^{N} = 10^{-3 \times N}$$

## The smallest double-precision floting point number,¶

$$k_{min} \approx 10^{-323}$$

which means that at $N = 100$ data points,

$$P(D; N=100) = 10^{-300}$$

Has become so small that you would not be able to distinguish having a data set with $N=100$ measurements from having another with $N=200$ points. In other words when we go and try to calculate said probabiliy we are going to end up with 0 instead of what we know it to actually be: a really, really small number.

How do we get around this?

### Working in log space¶

Working with the log of the probability can be very useful to avoid underflow problems. Because it expands the dynamic range of the probability effectively giving you access to those small numbers that you couldn't calculate before. Also, because the logarithm of the probability is monotonically increasing, maximizing the probability and the logarithm of the probability is the same task.

But working in log-space can be helpful in many ways, let's go back to our example:

$$\log P(D\mid M) = \log \prod_{i=1}^N P(d_i\mid M) = \sum_{i=1}^N \log P(d_i\mid M).$$

If you are having trouble seeing why this is the case remember that it is a property of logarithms that $\log(a_1a_2) = \log(a_1) + \log(a_2)$. Convince yourself that this extends to N elements, $\log(a_1a_2...a_N) = \log(a_1) + \log(a_2) + ...+ \log(a_N)$ when you are sold, it is trivial to see that $\log(a_1a_2...a_N) = \sum_{i=1}^N \log(a_i)$.

If we go back to the previous example:

$$\log P(D; N=100) = -300 * \log(10) = -690.776$$

but

$$\log P(D; N=200) = -600 * \log(10) = -1382.551$$

thus, the two cases are easily distinguishable in log space.

$$\log P(D; N=200) - \log P(D; N=100) = -691.775$$

or

$$P(D; N=200) = e^{-691.775} P(D; N=100)$$

Sometimes you describe your system with a mixture of probability distributions. For instance like mixture of Gaussian distributions, or like in the case of our homework this week a mixture of exponential distributions.

Now we work in log space so you have, so you end up having to do the following calculation $\log(e^a + e^b)$

If $a$ and $b$ are large and negative (as it is the case when they represent log probabilities), then that calculation cannot be directly. For instance, a naive calculation of $log(e^{-1000} + e^{-999})$ will give you -infinity. And if we tried with positive number, for instance $\log(e^{1000} + e^{999})$ will give you +infinity.

The way to do this calculation robustly is as follows, assume $a > b$, then

\begin{aligned} \log(e^a + e^b) &= \log(\frac{e^a}{e^a}(e^a + e^b)) \\ \log(e^a + e^b) &= \log \left(e^a ( 1 + e^{b-a})\right) \\ \log(e^a + e^b) &= \log(e^a) + \log\left( 1 + e^{b-a}\right)\\ \log(e^a + e^b) &= a + \log\left( 1 + e^{b-a}\right) \end{aligned}

since $b-a <0$, the exponential $e^{b-a}\leq1$ never becomes a large number, and the calculation is robust.

Then you will calculate

$$\log(e^{-1000} + e^{-999}) = -999 + \log(1+e^{-1}) = -999 + 0.31 = - 998.69,$$

and

$$\log(e^{1000} + e^{999}) = 1000 + \log(1+e^{-1}) = 1000 + 0.31 = 1000.31,$$

The log-sum-exp trick can be generalized for an arbitrary number of terms $$n$$

$$\log(e^{a_1} + \ldots + e^{a_n}) = a_{max} + \log\left(1+\sum_{i=1}^ne^{(a_i - a_{max})}\right),$$

where $a_{max}$ is the maximum value of the set $\{a_1,\ldots, a_n\}$.

# How to integrate?¶

We want to find the Area under the curves!

### But what about higher dimensional integrals?¶

$$\int \int (x^2-y^2) dxdy$$

We don't want the area under the curve, we want the volume under the surface!

Try is yourself:

• What does the silcing [:, np.newaxis] do to the numpy array being sliced?
• How does the operation propagate?

### Don't know where to start with the homework! Or getting confused with the log computations?¶

Try starting here