Goodfellow: Probability and Information Theory
Probability Theory
Frequentist x Bayesian
Saying that a \(3\) will appear \(1/6\) of times that a dice is rolled has a different meaning than a doctor says you have \(40%\) of chances to have a flu. The first is frequentist probability and it is directly related to the rates at which a event occur. The latter is related to the quality levels of certainty, and is known as bayesian probability
Probability is an extension of formal logic. Probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.
Independence x Zero Covariance
Independence is a much more stronger requirement than zero covariance because independence also excludes non-linear relationships. In other words, every pair of independent variables has zero covariance but not every pair of zero covariance variables are independent.
Ex: \(y = sx\)
\(x \sim U[-1,1]\)
\(s \sim B[-1,0.5;1,0.5]\)
The pair of variables \(x,y\) has zero covariance but \(x,y\) are clearly dependent of each other.
Second, out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. We can thus think of the normal distribution as being the one that inserts the least amount of prior knowledge into a model. Fully developing and justifying this idea requires more mathematical tools and is postponed to section 19.4.2.
Dirac Function
The Dirac delta function is deļ¬ned such that it is zero valued everywhere except 0, yet integrates to 1. We can think of the Dirac delta function as being the limit point of a series of functions that put less and less density on all points other than zero.
Common Functions
- Logistic Sigmoid: \(\sigma(x) = \frac{1}{1 + \exp(-x)}\)
- Softplus: \(\varsigma(x) = \log(1+\exp(x))\)
Useful properties
Information Theory
Information theory is based on the idea that the occurrence of unlikely events carry much more information than the occurrence of likely events. That's the reasoning behind the definition of the self-information of an event \(x \sim P\).
Shannon Entropy
The Shannon entropy is the average of self-information: \(H(x) = E_{X \sim P}[I(x)]\).
In a Bernoulli distribution, the Shannon entropy is higher whenever the parameter \(p\) is set to \(0.5\); and it is zero (it is practically deterministic) whenever \(p=0\) or \(p=1\).
Kullback-Leibler divergence
The KL-divergence measures the extra amount of information needed to send a message containing symbols drawn from probability distribution \(P\) when we use a code that was designed to minimize the length of messages drawn from probability distribution \(Q\).
- It is a statistical distance, although is not a metric (it does not respect the triangle inequality and it is not symetric).
- It is often called the information gain achieved if \(P\) would be used instead of \(Q\).
- Notice that the KL divergence is non-negative!
In order to see why the KL-divergence is non-negative, consider the Jensen's inequality. Let \(f\) a convex function. Then,
Notice that the Jensen's inequality is the generalization of the fact that a secant line in a convex function lies above the function. The inequality can be immediately transferred to the context of probability theory. Notice that the weights \(a_i\) can be interpreted as a distribution.
If \(f\) is concave (such as \(\log\)), the inequality is inverted. Therefore
Therefore \(-D_{KL} \leq 0\) which implies that \(D_{KL} \geq 0\).
Cross-Entropy
Minimizing the cross-entropy with respect to \(Q\) is equivalent to minimize the KL-divergence because \(Q\) does not participate in the omitted term.