week 01:

The entropy of a continuous distribution can be positive or negative

In class, we introduced the entropy of a distribution

\begin{equation} \mbox{Entropy}(X) = H(X) = \sum_i p_i \log \frac{1}{p_i}. \end{equation}

I mentioned that the entropy of a distribution was always positive, and could only be zero if one of the elements \(i\) had probability \(p_i=1\).

I was following Shannon (bottom of page 11), but I failed to realize that Shannon’s statement applies only to discrete distributions.

The Shannon entropy of the exponential distribution can be negative

In fact, when you did your homework, you should have noticed, that, for the exponential distribution, given by the probability density function

\begin{equation} p(x) = \frac{1}{\mu}\exp^{-\frac{x}{\mu}} \end{equation}

its entropy is given by

\begin{equation} 1 + \log(\mu), \end{equation}

and it can take negative values for many values of the mean \(\mu\)!.

Robert B. Ash in his 1965 paper Information Theory (page 237) noted this: unlike a discrete distribution, for a continuous distribution, the entropy can be positive or negative, in fact it may even be \(+\infty\) or \(-\infty\).

The reason is that for a discrete distribution, the normalization condition

\begin{equation} \sum_i p_i = 1 \end{equation}

requires that no individual probability can be larger than one, that is, \(0\leq p_i \leq 1\) for all possible values of \(i\).

In contrast, for a continuous distribution, the normalization condition

\begin{equation} \int p(x)\, dx = 1 \end{equation}

does not require for all probability density values, \(p(x)\), to be smaller than one. For instance if you look at the wikipedia definition of the exponential distribution you can see for \(\lambda = 1.5\) that some values of the probability density function are indeed larger than one.

If we discretize the continuous distribution

\begin{equation} \int p(x)\, dx \sim \sum_i p(x_i)\, \Delta = 1 \end{equation}

we see that only the product \(p(x_i) \Delta\) for each term \(i\) in the sum has to be smaller than 1.

Having values of \(p(x)\) that can be larger than one is the cause why the entropy of a continuous distribution

\begin{equation} H = \int p(x)\, \log \frac{1}{p(x)}\, dx. \end{equation}

cannot be guarantee to be positive, as the term \(\log \frac{1}{p(x)}\) takes a negative value if \(p(x)> 1\).

That means that for a continuous distribution the meaningful quantity to consider is the relative entropy (or Kullback-Leibler distance) relative to some arbitrary reference distribution \(p_0\). For any continuous distribution \(p(x)\) the relative entropy

\begin{equation} D_{KL}(p||p_0) = \int p(x)\, \log \frac{p(x)}{p_0(x)}\, dx, \end{equation}

is guaranteed to be always positive.