entropy

Definition

For a discrete probability distribution $p$ on the finite set ${x_{1}, x_{2}, \dots, x_{N}}$ with $p_{i} = p (x_{i})$ , the entropy of $p$ is defined as

$h (p) = - i = 1 \sum N p_{i} l o g p_{i} .$

For a continuous probability density function $p$ on an interval $[a, b]$ , the entropy of $p$ is defined as

$h (p) = - \int_{a}^{b} p (x) l o g p (x) d x .$

Theorem

For a probability density function $p$ on a finite set ${x_{1}, x_{2}, \dots, x_{N}}$ , then

$h (p) \leq l o g n,$

with equality iff. $p$ is uniform, i.e. $\forall i \leq N, p (x_{i}) = 1 / n$ .

Uniform probability yields maximum uncertainty and therefore maximum entropy.

Theorem

For a continuous probability density function $p$ on $R$ with variance $σ^{2}$ , then

$h (p) \leq \frac{1}{2} (1 + l o g (2 π σ^{2}))$

with equality iff. $p$ if Gaussian with variance $σ^{2}$ , i.e. for some $μ$ we have

$p (x) = \frac{1}{2 π σ} e^{- \frac{( x - μ ) ^{2}}{2 σ ^{2}}} .$

Theorem

For a continuous probability density function $p$ on $(0, \infty)$ with mean $λ$ , then

$h (p) \leq 1 + λ,$

with equality iff. $p$ is exponential with mean $λ$ , i.e.

$p (x) = \frac{1}{λ} e^{- \frac{1}{λ} x} .$

Cross entropy

The cross entropy of the distribution $q$ relative to a distribution $p$ over a given set is defined as follows:

$H (p, q) = - E_{p} [lo g q] = - \sum p lo g q = H (p) + D_{K L} (p ∥ q)$

Kullback-Leibler divergence

The Kullback-Leibler divergence (relative entropy) was introduced as the directed divergence between two distributions

$D_{KL} (P ∥ Q) = - x \in X \sum P (x) lo g (\frac{Q ( x )}{P ( x )})$

The Kullback-Leibler divergence is then interpreted as the average difference of the number of bits required for encoding samples of $P$ using a code optimized for $Q$ rather than one optimized for $P$ .

Jensen-Shannon divergence

$D_{JS} (P ∥ Q) = \frac{1}{2} D_{KL} (P ∥ M) + \frac{1}{2} D_{KL} (Q ∥ M)$

where $M = \frac{1}{2} (P + Q)$

Let $(X, Y)$ be a pair of random variables with values over the space $X \times Y$ . If their joint distribution is $P_{(X, Y)}$ and the marginal distributions are $P_{X}$ and $P_{Y}$ , the mutual information is defined as

$I (X; Y) = D_{K L} (P_{(X, Y)} ∥ P_{X} \otimes P_{Y})$

where $D_{K L}$ is the Kullback–Leibler divergence.

PMFs for discrete distributions

The mutual information of two jointly discrete random variables $X$ and $Y$ is calculated as a double sum:

$I (X; Y) = y \in Y \sum x \in X \sum P_{(X, Y)} (x, y) lo g (\frac{P _{(X, Y)} ( x , y )}{P _{X} ( x ) P _{Y} ( y )}),$

where $P_{(X, Y)}$ is the joint probability mass function of $X$ and $Y$ , and $P_{X}$ and $P_{Y}$ are the marginal probability mass functions of $X$ and $Y$ respectively.

PDFs for continuous distributions

In the case of jointly continuous random variables, the double sum is replaced by a double integral:

$I (X; Y) = \int_{Y} \int_{X} P_{(X, Y)} (x, y) lo g (\frac{P _{(X, Y)} ( x , y )}{P _{X} ( x ) P _{Y} ( y )}) d x d y,$

where $P_{(X, Y)}$ is now the joint probability density function of $X$ and $Y$ , and $P_{X}$ and $P_{Y}$ are the marginal probability density functions of $X$ and $Y$ respectively.

Mutual information and Kullback–Leibler divergence

Mutual information is the Kullback–Leibler divergence from the product of the marginal distributions

$I (X; Y) = D_{KL} (p_{(X, Y)} ∥ p_{X} p_{Y})$

$I (X; Y) = E_{Y} [D_{KL} (p_{X ∣ Y} ∥ p_{X})]$