Goodfellow: Classical Machine Learning

Supervised x Unsupervised

Roughly speaking, unsupervised learning involves observing several examples of a random vector \(\mathbf{x}\) and attempting to implicitly or explicitly learn the probability distribution \(p(\mathbf{x})\), or some interesting properties of that distribution; while supervised learning involces observing several examples of a random vector \(\mathbf{x}\) and an associated value vector \(\mathbf{y}\), then learning to predict \(\mathbf{y}\) from \(\mathbf{x}\), usually by estimating \(p(\mathbf{y} | \mathbf{x})\).
In some scenarios, one can solve an unsupervised problem by solving \(n\) supervised problems plus the chain rule of probability.
Alternatively, one may solve a supervised problem by finding \(p(\mathbf{x})\) and then estimating the conditional probability \(p(\mathbf{y} | \mathbf{x})\).

Statistical Learning Theory

The goal is to derive results associating training and test set by making use of some assumptions such as: the training and test distributions are identically distributed and their elements are independent (i.i.d.).

Definitions

Underfitting: The training error is too large.
Overfitting: The gap between training error and test error (generalization error) is too large.
Capacity of a model: The model flexibility to represent data. In a linear model, a polynomial of degree 2 has more capacity than the one of degree one. Be aware though, that capacity is not the same of code dimension. An encoder with code dimension \(1\) that uses non-linear functions to learn its code has more capacity than a linear model with code dimension \(1\) (depending on the non-linear functions used, more capacity than any linear model with any code dimension). The capacity is related to the flexibility of the model to create maps between numbers. A linear model can only create linear mappings; a non-linear model can create linear and non-linear mappings, so it has more capacity.
Discrepancy error: \(D(C,n) := |T_{error} - G_{error}|\), where \(C\) is model capacity and \(n\) is the number of training examples.
One can use regularization as a more general way to control the model's capacity. When we use the weight's decay regularization (penalize solutions which assign high values for certain weights) we are implicitly controlling the polynome degree in a linear model. The higher the coefficient controlling weight's decay, the lower would be the degree of the fitted polynomial.

Statistical learning theory shows that

\[ \begin{align} D(C,n) \text{ increases as } C \text{ increases }\\ D(C,n) \text{ shrink as } n \text{ increases } \end{align} \]

we have little theoretical understanding of the general nonconvex optimization problems involved in deep learning.

No free lunch theorem

The most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class.

That means that there is no general algorithm that is able to solve all the tasks. We need specialized algorithms for each task, taking advantage of prior knowledge associated to the task.

Validation and Cross-Validation sets

Validation set: Typically one takes 20% of the training set to do hyperparameters optimization.
Cross-validation: If the number of instances to train the model is too small, one may use cross-validation. In \(k\)-fold cross-validation, the instances are divided in \(k\) disjoint subsets. Next, \(k\) training trials are executed. In trial \(i\), the \(ith\) subset is used as test set and the rest of data as training set. At the end, we estimate the errors by averaging the errors of all trials.
Cross-validation allow us to use (in some sense) all the instances available to train the model. This is better than using a small training set due to high variance associated to small sets.

Estimators

Bias of an estimator: is the difference between the expected value of an estimator and the true value of the quantity that is being estimated.
The sample mean of the gaussian distribution is an unbiased estimator for the mean.
The sample variance of the gaussian distribution is a biased estimator of the variance
We can also compute the variance of an estimator. As it is the case for the bias, we generally prefer estimators with lower bias and lower variance.
Some estimators have their variance as a function of the samples used. Higher is the number of samples, the lower is the variance.

Estimator consistency: \(plim_{m\rightarrow \infty}\hat{\theta_m} = \theta\). Being \(m\) the number of data points.
Maximum Likelihood Estimator: \(argmax_{\mathbf{\theta}}\sum_{i=1}^{m}\log p_{model}(\mathbf{x}^{(i)},\mathbf{\theta})\) This equivalency is true because the \(log\) function is monotonicaly increasing.
Since scaling factors do not change the argmax, we can also interpret it as \(argmax_{\mathbf{\theta}} E_{X \sim \hat{p}_{data}}\log p_{model}(\mathbf{x},\mathbf{\theta})\), where \(\hat{p}_{data}\) is the empirical data distribution, that is, the distribution we've got by sampling from the true, unknown distribution \(p_{data}\).
We can also interpret the MLE as minimizing the KL divergence: \(D_{KL}(\hat{p}_{data}||p_{model}) = E_{\mathbf{X}\sim \hat{p}_{data}}[\log \hat{p}_{data}(\mathbf{x}) - \log p_{model}(\mathbf{x})]\)
Linear regression is equivalent to MLE applied on a normal distribution with fixed variation. That is, find the best parameters \(\theta\) of the normal distribution (in this case, only the mean \(\mu\)) that produces the higher combined probability for the set of input points.
Under appropriate conditions, the MLE has the consistency property described above.

Bayesian Statistics.

Application. Parameter estimation. For example, estimate the parameter \(\theta\) of some distribution.
Input. Prior probability and some distribution \(P\) parameterized by \(\theta\) that we believe models the data.
- Prior probability. \(P(\theta)\)
- Conditional probability. \(P(X|\theta)\)
Posterior distribution. \(P(\theta|X) = \frac{P(X|\theta)}{P(X)}\)
- Recall that \(P(X)\) can be computed by marginalizing over all values of \(\theta\).
Belief update. \(P(x^{(k+1)}|x^{(k)},x^{(k-1)},\cdots) = \int{P(x^{(k+1)}|\theta)P(\theta|x^{(k)},x^{(k-1)},\cdots)d\theta}\)
Maximum a posteriori. Similiar to maximum likelihood. We are going to choose the value of \(\theta\) that maximizes the posterior probability.

Challenges motivating deep learning:

Curse of dimensionality: As the number of dimensions in the data increases, we face a combinatorial explosion in terms of the number of valid configurations in our solution space.
Local Constancy and Smooth Regularization: Classical machine learning methods makes the assumption that the function it is trying to interpolate is locally well behaved. In k-nearest neighbors, the function is constant in the region where the k-nearest neighbors consist of the same instances; if doing regression, one may use the average of near points to predict an unobserved point. Notice, however, that if the function looks like a checkboard, it would be difficult to recover such a function if we don't have sufficiently enough examples. In general, the number of examples to train the model should be excessively large in order to obtain good results in a classical machine learning method for high dimensional data.
Manifold Learning: Even if the data is high dimensional, interesting data is located in a manifold (a embedded subspace of these high dimensional space in which one can locally move itself along a fewer number of dimensions.) Think about that structured data such as image, audio and text belong to a huge dimensional space but that only few instances of these space constitue interest examples. If someone randomly types english letters, the probability that something intelligible comes out is pratically zero.

Supervised Learning

k-nearest neighbor limitation

It can perform very badly if training data is scarce. Think about a task in which points are located in \(\mathbb{R}^{100}\). Next, assume that the output depends only on the first feature. More precisely, \(y=x_1\). Now, let us say that we have a training data in which we pick up points from an isotropic Gaussian. As it is the case with nearest neighbors, the result will be derived from the average of the k-nearest neighbors. Therefore, for a small training set, the response it will be kind of random. One needs a very large training set in order to identify this simple pattern with k-nearest neighbors.

Kernel Trick

The following section is based on the lecture notes: Andrew NH lecture notes: cs229-notes3.pdf

Recall linear regression problem

Given \(y \in \mathbb{R}^{1xn}\) and \(x \in \mathbb{R}^{fxn}\) find \(\theta^{\star} \in \mathbb{R}^{fx1}\) such that

\[ \theta^{\star} = argmin_{\theta} y - \theta^Tx \]

with gradient descent update rule:

\[ \theta = \theta + h(y - \theta^Tx)x^T \]

Using a feature extension map

\[ \phi(X) : \mathbb{R}^{fxn} \rightarrow \mathbb{R}^{g(f) x n} \]

Example:

\[ \begin{align} X &= (x_1,x_2,x_3)^T\\ \phi(X) &= (1,x_1,x_2,x_3,x_1x_2,x_1x_3,x_2x_3)^T \end{align} \]

with update rule

\[ \theta = \theta + h(y - \theta^T\phi(x))\phi(x)^T \]

Problem: At each iteration, one has to compute \(\phi(x)\), which quickly becomes expensive as the complexity of the feature extension map increases.

Fundamental result

We can assume that \(\theta\) is a linear combination of \(\phi(x)\) during all the process.

Let \(\theta^{(0)}=0\). Therefore \(\theta^{(0)} = \phi(X)0\).
Now, for some \(k>0\), assume that \(\theta^{(k)} = \phi(X)B^{(k)}\). Therefore

\[ \begin{align} \theta^{(k+1)} &= \theta^{(k)} + h\Big(y - \theta^{(k)^T}\phi(x)\Big)\phi(x) \\ &= \phi(x)B^{(k)} + h\Big(y-\theta^{(k)^T}\phi(x)\Big)\phi(x) \\ &= \Bigg(B^{(k)} + h\Big(y - (\phi(x)B^{(k)})^T\phi(x)\Big)\Bigg)\phi(x) \end{align} \]

In particular, the coefficients \(B^{(k)}\) evolve as

\[ \begin{align} B^{(k+1)} &= B^{(k)} + h\Big(y - (\phi(x)B^{(k)})^T\phi(x)\Big)\\ &= B^{(k)} + h\Big(y - B^{(k)^T}\phi(x)^T\phi(x)\Big)\\ &= B^{(k)} + h\Big(y - B^{(k)^T}K(\phi(x))\Big) \end{align} \]

Kernel methods are based on this trick. The kernel does not change during the gradient descent evaluation. One disadvantadge is that the evaluation of the training function is proportional to the number of training examples. That is, one can expect that one iteration of the gradient descent would take more time to execute as more examples are add to the training set.

Support Vector Machine

The goal is to find the set of hyperplanes that optimally separates the training set regarding their geometric margins.

Optimally separating hyperplane

Many believe that SVMs is the best "off-he-shelf" supervised learning algorithm.

Notation and Glossary

Training set: \(\{ (x^{(i)},y^{(i)} \;|\; y^{(i)} \in \{-1,1\} \}\)

Functional Marging (training instance): \(\hat{\gamma}^{(i)}=y^{(i)}(w^Tx^{(i)}+b)\)
Geometric Marging (training instance): \(\gamma^{(i)} = y^{(i)}\Big(\left(\frac{w}{||w||}\right)^Tx^{(i)}+\frac{b}{||w||}\Big)\)

The geometric marging with respect to the whole training set is the minimum geometric marging among the training instances.

Optimal Margin Classifier

The separating hyperplane defines a frontier in which data is classified. The farther a point is from the frontier, the higher is our confidence in giving it a classification. Therefore, to maximize our confidence, we would like to find hyperplanes such that the training set has the highest geometric marging.

\[ \begin{align} max_{\gamma,w,b} & \gamma \\ & y^{(i)}(w^Tx^{(i)}+b) \geq \gamma \\ & ||w|| = 1 \end{align} \]

equivalently

\[ \begin{align} max_{\gamma,w,b} & \frac{\hat{\gamma}}{||w||} \\ & y^{(i)}(w^Tx^{(i)}+b) \geq \hat{\gamma} \end{align} \]

equivalently

\[ \begin{align} min_{\gamma,w,b} & \frac{1}{2}||w||^2 \\ & y^{(i)}(w^Tx^{(i)}+b) \geq 1 \end{align} \]

In the last step, we arbitrarily constrained the model to have a functional marging (with respect to the training set) greater than \(1\). This is equivalent to scale \(w\) and \(b\) by a constant, which does not change the final result of the classification problem.

Linear Programming Duality

Primal problem

\[ max \quad c^Tx \text{ s.t. } Ax \leq b \text{ and } x \geq 0 \]

Dual problem

\[ min \quad b^Ty \text{ s.t. } A^Ty \geq c \text{ and } y \geq 0 \]

Lagrangian Duality

Consider the optimization problem

\[ \begin{align} min & f(x) \\ \text{s.t.} & g_i(x) \leq 0 \\ & h_i(x) = 0 \end{align} \]

Its Lagrangian is given by(\(\alpha_i \geq 0\))

\[ \mathbb{L}(x,\alpha,\beta) = f(x) + \sum_{i} \alpha_ig_i(x) + \sum_{i} \beta_ih_i(x) \]

It is possible to prove that

\[ d^*=max_{\alpha,\beta:\alpha_i \geq 0} \; min_{x} \mathbb{L}(x,\alpha,\beta) \leq min_{x} \; max_{\alpha,\beta:\alpha_i \geq 0} \mathbb{L}(x,\alpha,\beta) = p^* \]

In certain conditions, \(d^*=p^*\). Those are the following:

\(f\) and \(g_i\) are convex and \(h_i\) are affine;
The \(g_i\) constraints are strictly feasible. That is, there exists \(x\) such that \(g_i\leq0\) for all \(i\).

Under these conditions, the optimal value of the the dual and primal problems are equal. Moreover, they respect the KKT conditions.

\[ \begin{align} \frac{\partial}{\partial x_i} \mathbb{L}(x^*,\alpha^*,\beta^*) &= 0 \\ \frac{\partial}{\partial \beta_i} \mathbb{L}(x^*,\alpha^*,\beta^*) &= 0 \\ \alpha^*g(x^*) &= 0 \\ g_i(x^*) &\leq 0 \\ \alpha^* \geq 0 \end{align} \]

Solving by Lagrangian Duality

We compute the Lagrangian of our optimization problem

\[ \mathbb{L}(w,b,\alpha) = \frac{1}{2}||w||^2 - \sum\alpha_i\big(y^{(i)}(w^Tx^{(i)}+b) - 1\big) \]

Next, we derive the Lagrangian dual by substituting in the expression above, the expressions we got by taking the derivative of \(\mathbb{L}\) with respect to \(w\) and \(\beta\) and make them equal to zero. That is,

\[ \begin{align} \frac{\partial}{\partial w}\mathbb{L} = w - \sum\alpha_iy^{(i)}x^{(i)} = 0 \\ \frac{\partial}{\partial b}\mathbb{L} = \sum\alpha_iy^{(i)} = 0 \end{align} \]

Replacing the expressions above, we've got the Lagrangian dual below

\[ \begin{align} max_{\alpha} & \sum\alpha_i - \frac{1}{2}\sum{y^{(i)}y^{(j)}\alpha_i\alpha_j<x^{(i)},x^{(j)}>} \\ \text{s.t.}\quad & \alpha_i \geq 0 \\ & \sum\alpha_iy^{(i)} = 0 \end{align} \]

The problem above can be optimized by any quadratic programming solver. By finding \(\alpha^*\) we can easily find \(w^*\) and \(b^*\).

From the KKT conditions, the \(\alpha_i\) values will be all zero except those \(i\) in which the corresponding \(g_i\) constraint is respected with an equal sign (the so called support vectors). We are going to take advantage of this fact to come up with a more efficient algorithm.

Given that we have found the optiaml values, here it is how our model is going to predict the classification of new instances.

\[ \begin{align} h_{w,b}(x) = w^Tx + b &= \Big(\sum\alpha_iy^{(i)}x^{(i)}\Big)^Tx + b \\ &= \sum\alpha_iy^{(i)}<x^{(i)},x> + b \\ h_{w,b}(x) > 0 &\rightarrow y=1 \\ h_{w,b}(x) < 0 &\rightarrow y=-1 \end{align} \]

Regularization

To avoid that outliers have a strong weight in the choice of the separating hyperplane, one may use L1 regularization.

\[ \begin{align} min \quad & \frac{1}{2}||w||^2 + C\sum{e_i} \\ \text{s.t.} & y^{(i)}(w^Tx^{(i)} + b) \geq 1 - e_i \\ & e_i \geq 0 \end{align} \]

Sequential Minimal Optimization Algorithm

Read Platts' paper.

References

Andrew NH lecture notes: cs229-notes3.pdf
Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective. MIT Press, Cambridge, MA, USA.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Matousek (2006). Understanding and Using Linear Programming. Springer.
Platts (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines

Stochastic Gradient Descent

The classic gradient descent does not scale well if the training set is large. The complexity of one iterations is of \(O(m)\), where \(m\) is the number of training data points. Clearly this is not feasible when we have billions of data points.

The idea os stochastic descent is to interpret the gradient as an expected value of the training points. Therefore, it is enough to sample just a few data points to get a good estimation of the gradient.

The combination of Deep Learning and Stochastic Gradient Descent is the key to understand the boom of Deep Learning. Deep Learning needs lot of training examples, and gradient descent scales well with the size of the training set.