Goodfellow: Deep FeedForwards Networks

Most tasks that consist of mapping an input vector to an output vector, and that are easy for a person to do rapidly, can be accomplished via deep learning, given suﬃciently large models and suﬃciently large datasets of labeled training examples. Other tasks, that cannot be described as associating one vector to another, or that are diﬃcult enough that a person would require time to think and reﬂect in order to accomplish the task, remain beyond the scope of deep learning for now

Feedforward networks

The feedforward networks are called this way because they lack feedback connections. It is the most basic neural network model and it is powerful enough to be used in several domains, computer vision included. Convolutions networds, popular in computer vision, are a kind of feedforward networks.

Here is the basic model of a feedforward network

\[ f^{(3)}(f^{(2)}(f^{(1)}(x))) \]

As one can see, it is a composition of functions. In the example above we have three layers. The last layer is called output layer and it is the only one that is observable, that is, the training examples are supposed to represent values in the output layer. Because of that, the other layers are called hidden layers. It is the job of the model to figure it out how they look like.

Neural Networks and Classical Machine Learning

Linear and Logistic Regression and SVMs are classical models in machine learning. A big advantage of them is that they model the problem as a convex function and because of that we are able to globally optimize them. But that characteristic it is also its weakpoint.

The methods above cannot represent the complexity of every relationship present in every dataset. We need non-linear expressions to model that. As we have seen, we can use an extension feature map to solve non-linear models using the same framework, but we need to know a priori what non-linear function to use. The main feature of neural networks is that they figure the non-linear function by themselves.

Before neural networks we had basically two alternatives:

Use a generic extension feature map \(\phi\). A popular choice is to use the RBF kernel (infinite dimensional) in conjunction with the kernel trick. But we have problems regarding generalization of the model.
Manually engineer \(\phi\). This is hard and it take years of experimentation by practicioners to figure out what works best for each task.

RBF kernels

This seems an interesting topic. For me it looked like using \(\sin\) and \(\cos\) as in the Fourier transform to represent functions.

Indeed, if you have wave functions as \(\sin\) and RBF you have all the derivatives at your disposal. If you have all the derivatives, you just need to parameterize the basis function correctly in order to obtain whatever function you want.

The advantage of RBF is that it is density is concentred in a range. Out of this range the values are very close to zero. This is different from \(\sin\) which is periodic.

XOR Neural Network Example

Let \(F(x)\) be the XOR function in two variables.We want to find function \(f(x,\theta)\) with vector parameters \(\theta\) that minimizes the loss function

\[ \sum_{n \in \Omega}{ \big( F(x) - f(x,\theta) \big)^2} \]

where \(\Omega = \{ (0,0),(0,1),(1,0),(1,1) \}\).

Notice that the XOR function cannot be represented by a linear function.

\(x_1\)	\(x_2\)	\(F(x)\)
0	0	0
0	1	1
1	0	1
1	1	0

Therefore, we are going to construct a feedforward neural network model for it. The model has one hidden layer (two hidden units) and it can be schematized as following:

FFN scheme for the XOR problem

or in mathematical notation as:

\[ \begin{align} f^{(2)}(g(f^{(1)}(x))) \end{align} \]

where

\[ \begin{align} f^{(1)}(x) &= W^Tx + c \\ f^{(2)}(h) &= w^Th \\ g(z) &= \max\{0,z\} \end{align} \]

The function \(g\) is the activation function. Modern practice says to use the rectified linear unit (ReLu).

For a general problem modeled by this network, training and inference have the following form

Goal: Given \(n\) examples of a \(q\)-feature vector, find parameters \(W,c,w,b\) of \(f\) in order to minimize the loss function \(L(f)\).

Model: FFN with one-hidden layer and ReLU activation function.

\[ \begin{align} \text{Training: } & g(XW + \mathbb{1}c^T)w + \mathbb{1}b \\ \text{Inference: } & w^Tg(W^Tx +c) + b \end{align} \]

where,

\[ \begin{align} X: & \mathbb{R}^{(n \times q)} \\ x: & \mathbb{R}^{(q \times 1)} \\ W: & \mathbb{R}^{(q \times r)} \\ c: & \mathbb{R}^{(r \times 1)} \\ w: & \mathbb{R}^{(r \times 1)} \\ \mathbb{1}: & \mathbb{1}^{(n \times 1)} \\ \end{align} \]

Optimization of Neural Networks

We use a variation of gradient descent. It is a little bit more complicated to implement due to the compositionality of neural networks, but the backpropagation algorithm does the job. Just bear in mind that the result of the optimization is dependent of the initialization. The loss functions used in neural networks are generally the same used for classical machine learning. But the intermediate functions composing the intermediate layers are non-convex sometimes, which utlimately turns the problem into a non-convex optimization problem.

Maximum-likelihood is the same of minimizing cross-entropy

Cross-entropy and maximum likelihood

Loss function and cross-entropy

Whenever using cross-entropy as the cost function, it immediately gives you the loss function to use.

Prefer cross-entropy as cost function

Mean squared error or mean absolute error cost functions are not good choices because they may perform very poorly in gradient descent. That happens whenever the immediately prior layer saturates (becomes flat). The combination of the saturated output with mean squared error produces very small gradients which leads to lower progression towards the minimum.

The cross-entropy, on the other hand, usually annulates the saturation. Exponential function are commonly used in the output units. Those saturates with negative values. But the cross-entropy cost function annulates the saturation, transforming it into a sum.

Linear units do not saturate. That's why it works out fine to do linear regression using mean squred error as cost function.

Output units

Interpretation of the sigmoid function representing a Bernoulli distribution

Softmax (the Bernoulli extension to n variables)

Hidden Units

Rectified Linear

Sigmoid and Hyperbolic Tangent

They were very used in the past but their use is now discouraged because they saturate (small gradient) almost everywhere. Only in a neighborhood of zero the gradient asssume values that do not disrupts gradient descent.

Network Architecture

Universal Approximation Theorem

A feedforward network with a linear output layer and at least one hidden layer with any "squashing" activation function (such as the logisitic sigmoid) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero error, provided that the network is given enough hidden units.

Notice that the Universal Approximation theorem does not mean that the optimization algorithm will be able to find such representation. The optimization could fail to find the exact function for different reasons. Among those, local minimum ; overfitting; underfitting... It is rarely the case that we will have enough data to infer the exact function from what that the data/task came from.

Depth and Generalization

It is given an extensive list of works that indicates that the more deeper is the network better it is its capacity to generalize. The same cannot be said regarding the number of parameters (width) in each layer.

Gradient descent in a neural network

In this section we are going to compute by hand the gradient descent of a simple neural network. This will help to gain familiarity with notation and also with the issues that motivates the backpropagation algorithm.

Consider the following network:

Gradient computation in a simple neural network

We are going to use the following notation

Notation	Description
\(w_{jk}^{(L_i)}\)	Weight applied to activation node \(a_j^{(L_i-1)}\) to obtain activation node \(a_k^{(L_i)}\).
\(z_j^{(L_i)}\)	\(\displaystyle \sum_{(j,k) \in \Omega(L_i,L_i-1)}{w_{jk}^{(L_i)}a_k^{(L_i-1)}}\)
\(a_j^{(L_i)}\)	\(g(z_j^{(L_i)})\), where \(g\) is the activation function, for example, ReLu.
\(C_0(a^{(3)})\)	\((a_0^{(3)} - y_0)^2 + (a_1^{(3)} - y_1)^2\)

Next, we rewrite \(C_0\) using high level parameter \(a^{(3)}\) until low level parameter \(a^{(1)}\).

\[ \begin{align} C_0(a^{(3)}) &= (a_0^{(3)} - y_0)^2 + (a_1^{(3)} - y_1)^2 \\[1em] C_0(z^{(3)}) &= \big( g(z_0^{(3)}) - y_0 \big)^2 + \big( g(z_1^{(3)}) - y_1 \big)^2 \\[1em] C_0(a^{(2)}) &= \big( g(w_{00}^{(3)}a_0^{(2)} + w_{01}^{(3)}a_1^{(2)}) - y_0 )^2 + \big( g(w_{10}^{(3)}a_0^{(2)} + w_{11}^{(3)}a_1^{(2)}) - y_1 \big)^2 \\[1em] C_0(z^{(2)}) &= \Big( g\big(w_{00}^{(3)}g(z_0^{(2)}) + w_{01}^{(3)}g(z_1^{(2)})\big) - y_0 \Big)^2 \\ & + \Big( g\big(w_{10}^{(3)}g(z_0^{(2)}) + w_{11}^{(3)}g(z_1^{(2)})\big) - y_1 \Big)^2 \\[1em] C_0(a^{(1)}) &= a \end{align} \]

BackPropagation Algorithm

Consider a neural network in which the activation nodes are ordered in such a way that nodes in a superior layer have higher indexes than nodes in a inferior layer. For example, consider the following graph \(\mathcal{G}\).

Ordererd neural network

The backpropagation algorithm uses dynamic programming technique to compute the chain rule in the neural network. The algorithm works in two steps.

Forward step: It computes the activation nodes.
Backward step: It computes the gradient of the output nodes.

Forward step

At this step we are going to compute the value of the activation nodes \(u_i\). To each node \(u_i\) is associated a function \(f_i\) that will compute its value. The function \(f_i\) takes the parents of \(u_i\) in the graph \(G\) as parameters. If the node \(u_i\) has no parents, that means that \(u_i\) is an input node. In this case, the function \(f_i\) simply returns the input value \(x_i\).

def forward(G):
  for i in range(1,n+1):
    Ai = [ uj for uj in G if uj in Parent(ui) ]
    if len(Ai) == 0:
      ui = xi
    else:
      ui = fi(Ai)

Backward step

The backward step consists into computing the gradient of the output nodes with respect to all activation nodes. This is done via dynamic programming. The backward function outputs a matrix \(T[n-m,m]\) where

\[ T[j,i] = \frac{\partial u_j}{\partial u_i} \]

def backward(G):
    T = zeros((len(OutputNodes(G)),len(G)))
    for uj in OutputNodes(G):
        T[j,j] = 1

    # All activation nodes that are not output nodes
    M = [ ui for ui in G if ui not in OutputNodes(G) ]

    m = len(M)

    # m,m-1,k-2...1
    M.sort(reverse=True)

    for uj in OutputNodes(G):
        for i in range(m,0,-1):
            T[j,i] = sum( T[j,k]*derivative(uk,ui) for uk in Parent(ui) )

    return T