Goodfellow: Probability and Information Theory

Probability Theory

Frequentist x Bayesian

Saying that a \(3\) will appear \(1/6\) of times that a dice is rolled has a different meaning than a doctor says you have \(40%\) of chances to have a flu. The first is frequentist probability and it is directly related to the rates at which a event occur. The latter is related to the quality levels of certainty, and is known as bayesian probability

Probability is an extension of formal logic. Probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.

Independence x Zero Covariance

Independence is a much more stronger requirement than zero covariance because independence also excludes non-linear relationships. In other words, every pair of independent variables has zero covariance but not every pair of zero covariance variables are independent.

Ex: \(y = sx\)
\(x \sim U[-1,1]\)
\(s \sim B[-1,0.5;1,0.5]\)

The pair of variables \(x,y\) has zero covariance but \(x,y\) are clearly dependent of each other.

Second, out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. We can thus think of the normal distribution as being the one that inserts the least amount of prior knowledge into a model. Fully developing and justifying this idea requires more mathematical tools and is postponed to section 19.4.2.

Dirac Function

The Dirac delta function is deﬁned such that it is zero valued everywhere except 0, yet integrates to 1. We can think of the Dirac delta function as being the limit point of a series of functions that put less and less density on all points other than zero.

Common Functions

Logistic Sigmoid: \(\sigma(x) = \frac{1}{1 + \exp(-x)}\)
Softplus: \(\varsigma(x) = \log(1+\exp(x))\)

Useful properties

\[ \begin{align} \sigma(x) &= \frac{\exp(x)}{\exp(x) + \exp(0)} \\[1em] \frac{d}{dx}\sigma(x) &= \sigma(x)(1-\sigma(x))\\[1em] 1-\sigma(x) &= \sigma(-x)\\[1em] \log\sigma(x) &= -\varsigma(-x)\\[1em] \frac{d}{dx}\varsigma(x) &= \sigma(x)\\[1em] \forall x \in (0,1), \sigma^{-1}(x) &= \log\left(\frac{x}{1-x}\right)\\[1em] \forall x > 0, \varsigma^{-1}(x) &= \log (\exp(x) - 1)\\[1em] \varsigma(x) &= \int_{-\infty}^{x}\sigma(y)dy\\[1em] \varsigma(x) - \varsigma(-x) &= x \end{align} \]

Information Theory

Information theory is based on the idea that the occurrence of unlikely events carry much more information than the occurrence of likely events. That's the reasoning behind the definition of the self-information of an event \(x \sim P\).

\[ I(x) = - \log P(x) \]

Shannon Entropy

The Shannon entropy is the average of self-information: \(H(x) = E_{X \sim P}[I(x)]\).

In a Bernoulli distribution, the Shannon entropy is higher whenever the parameter \(p\) is set to \(0.5\); and it is zero (it is practically deterministic) whenever \(p=0\) or \(p=1\).

\[ \begin{align} X \sim Bernoulli(p) \\ H(x) &= - \left( (1-p)\log(1-p) + p\log p \right) \\ &= (p-1)\log (p-1) - p\log p \end{align} \]

Kullback-Leibler divergence

\[ \begin{align} D_{KL}(P||Q) &= E_{X \sim P}\left[\log \frac{P(X)}{Q(X)}\right] \\ &= \sum{P(X=x_i)\log\left( \frac{P(X=x_i)}{Q(X=x_i)} \right)} \end{align} \]

The KL-divergence measures the extra amount of information needed to send a message containing symbols drawn from probability distribution \(P\) when we use a code that was designed to minimize the length of messages drawn from probability distribution \(Q\).

It is a statistical distance, although is not a metric (it does not respect the triangle inequality and it is not symetric).
It is often called the information gain achieved if \(P\) would be used instead of \(Q\).
Notice that the KL divergence is non-negative!

In order to see why the KL-divergence is non-negative, consider the Jensen's inequality. Let \(f\) a convex function. Then,

\[ \frac{\sum{a_i f(x_i)}}{\sum a_i} \geq f\left( \frac{\sum{a_ix_i}}{\sum a_i} \right) \]

Notice that the Jensen's inequality is the generalization of the fact that a secant line in a convex function lies above the function. The inequality can be immediately transferred to the context of probability theory. Notice that the weights \(a_i\) can be interpreted as a distribution.

\[ \begin{align} \sum P(X=x_i)f(x_i) &\geq f \left( P(X=x_i)x_i \right) \\ E\left[ f(x_i) \right] &\geq f\left( E\left[X\right]\right) \end{align} \]

If \(f\) is concave (such as \(\log\)), the inequality is inverted. Therefore

\[ \begin{align} -D_{KL}(P||Q) &= -\sum P(X=x_i)\frac{\log P(X=x_i)}{\log Q(X=x_i)} \\[1em] &= \sum P(X=x_i)\frac{\log Q(X=x_i)}{\log P(X=x_i)} \\[1em] & \leq \log \left( P(X=x_i) \frac{Q(X=x_i)}{P(X=x_i)} \right) \\[1em] &= \log 1 \\[1em] &= 0 \end{align} \]

Therefore \(-D_{KL} \leq 0\) which implies that \(D_{KL} \geq 0\).

Cross-Entropy

\[ H(P,Q) = - E_{X \sim P}\left[ \log Q(x) \right] \]

Minimizing the cross-entropy with respect to \(Q\) is equivalent to minimize the KL-divergence because \(Q\) does not participate in the omitted term.