Goodfellow: Deep FeedForwards Networks
Most tasks that consist of mapping an input vector to an output vector, and that are easy for a person to do rapidly, can be accomplished via deep learning, given sufficiently large models and sufficiently large datasets of labeled training examples. Other tasks, that cannot be described as associating one vector to another, or that are difficult enough that a person would require time to think and reflect in order to accomplish the task, remain beyond the scope of deep learning for now
Feedforward networks
The feedforward networks are called this way because they lack feedback connections. It is the most basic neural network model and it is powerful enough to be used in several domains, computer vision included. Convolutions networds, popular in computer vision, are a kind of feedforward networks.
Here is the basic model of a feedforward network
As one can see, it is a composition of functions. In the example above we have three layers. The last layer is called output layer and it is the only one that is observable, that is, the training examples are supposed to represent values in the output layer. Because of that, the other layers are called hidden layers. It is the job of the model to figure it out how they look like.
Neural Networks and Classical Machine Learning
Linear and Logistic Regression and SVMs are classical models in machine learning. A big advantage of them is that they model the problem as a convex function and because of that we are able to globally optimize them. But that characteristic it is also its weakpoint.
The methods above cannot represent the complexity of every relationship present in every dataset. We need non-linear expressions to model that. As we have seen, we can use an extension feature map to solve non-linear models using the same framework, but we need to know a priori what non-linear function to use. The main feature of neural networks is that they figure the non-linear function by themselves.
Before neural networks we had basically two alternatives:
- Use a generic extension feature map \(\phi\). A popular choice is to use the RBF kernel (infinite dimensional) in conjunction with the kernel trick. But we have problems regarding generalization of the model.
- Manually engineer \(\phi\). This is hard and it take years of experimentation by practicioners to figure out what works best for each task.
RBF kernels
This seems an interesting topic. For me it looked like using \(\sin\) and \(\cos\) as in the Fourier transform to represent functions.
Indeed, if you have wave functions as \(\sin\) and RBF you have all the derivatives at your disposal. If you have all the derivatives, you just need to parameterize the basis function correctly in order to obtain whatever function you want.
The advantage of RBF is that it is density is concentred in a range. Out of this range the values are very close to zero. This is different from \(\sin\) which is periodic.
XOR Neural Network Example
Let \(F(x)\) be the XOR function in two variables.We want to find function \(f(x,\theta)\) with vector parameters \(\theta\) that minimizes the loss function
where \(\Omega = \{ (0,0),(0,1),(1,0),(1,1) \}\).
Notice that the XOR function cannot be represented by a linear function.
| \(x_1\) | \(x_2\) | \(F(x)\) |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Therefore, we are going to construct a feedforward neural network model for it. The model has one hidden layer (two hidden units) and it can be schematized as following:

or in mathematical notation as:
where
The function \(g\) is the activation function. Modern practice says to use the rectified linear unit (ReLu).
For a general problem modeled by this network, training and inference have the following form
Goal: Given \(n\) examples of a \(q\)-feature vector, find parameters \(W,c,w,b\) of \(f\) in order to minimize the loss function \(L(f)\).
Model: FFN with one-hidden layer and ReLU activation function.
where,
Optimization of Neural Networks
We use a variation of gradient descent. It is a little bit more complicated to implement due to the compositionality of neural networks, but the backpropagation algorithm does the job. Just bear in mind that the result of the optimization is dependent of the initialization. The loss functions used in neural networks are generally the same used for classical machine learning. But the intermediate functions composing the intermediate layers are non-convex sometimes, which utlimately turns the problem into a non-convex optimization problem.
Maximum-likelihood is the same of minimizing cross-entropy
Cross-entropy and maximum likelihood
Loss function and cross-entropy
Whenever using cross-entropy as the cost function, it immediately gives you the loss function to use.
Prefer cross-entropy as cost function
Mean squared error or mean absolute error cost functions are not good choices because they may perform very poorly in gradient descent. That happens whenever the immediately prior layer saturates (becomes flat). The combination of the saturated output with mean squared error produces very small gradients which leads to lower progression towards the minimum.
The cross-entropy, on the other hand, usually annulates the saturation. Exponential function are commonly used in the output units. Those saturates with negative values. But the cross-entropy cost function annulates the saturation, transforming it into a sum.
Linear units do not saturate. That's why it works out fine to do linear regression using mean squred error as cost function.
Output units
Interpretation of the sigmoid function representing a Bernoulli distribution
Softmax (the Bernoulli extension to n variables)
Hidden Units
Rectified Linear
Sigmoid and Hyperbolic Tangent
They were very used in the past but their use is now discouraged because they saturate (small gradient) almost everywhere. Only in a neighborhood of zero the gradient asssume values that do not disrupts gradient descent.
Network Architecture
Universal Approximation Theorem
A feedforward network with a linear output layer and at least one hidden layer with any "squashing" activation function (such as the logisitic sigmoid) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero error, provided that the network is given enough hidden units.
Notice that the Universal Approximation theorem does not mean that the optimization algorithm will be able to find such representation. The optimization could fail to find the exact function for different reasons. Among those, local minimum ; overfitting; underfitting... It is rarely the case that we will have enough data to infer the exact function from what that the data/task came from.
Depth and Generalization
It is given an extensive list of works that indicates that the more deeper is the network better it is its capacity to generalize. The same cannot be said regarding the number of parameters (width) in each layer.
Gradient descent in a neural network
In this section we are going to compute by hand the gradient descent of a simple neural network. This will help to gain familiarity with notation and also with the issues that motivates the backpropagation algorithm.
Consider the following network:

We are going to use the following notation
| Notation | Description |
|---|---|
| \(w_{jk}^{(L_i)}\) | Weight applied to activation node \(a_j^{(L_i-1)}\) to obtain activation node \(a_k^{(L_i)}\). |
| \(z_j^{(L_i)}\) | \(\displaystyle \sum_{(j,k) \in \Omega(L_i,L_i-1)}{w_{jk}^{(L_i)}a_k^{(L_i-1)}}\) |
| \(a_j^{(L_i)}\) | \(g(z_j^{(L_i)})\), where \(g\) is the activation function, for example, ReLu. |
| \(C_0(a^{(3)})\) | \((a_0^{(3)} - y_0)^2 + (a_1^{(3)} - y_1)^2\) |
Next, we rewrite \(C_0\) using high level parameter \(a^{(3)}\) until low level parameter \(a^{(1)}\).
BackPropagation Algorithm
Consider a neural network in which the activation nodes are ordered in such a way that nodes in a superior layer have higher indexes than nodes in a inferior layer. For example, consider the following graph \(\mathcal{G}\).

The backpropagation algorithm uses dynamic programming technique to compute the chain rule in the neural network. The algorithm works in two steps.
- Forward step: It computes the activation nodes.
- Backward step: It computes the gradient of the output nodes.
Forward step
At this step we are going to compute the value of the activation nodes \(u_i\). To each node \(u_i\) is associated a function \(f_i\) that will compute its value. The function \(f_i\) takes the parents of \(u_i\) in the graph \(G\) as parameters. If the node \(u_i\) has no parents, that means that \(u_i\) is an input node. In this case, the function \(f_i\) simply returns the input value \(x_i\).
def forward(G):
for i in range(1,n+1):
Ai = [ uj for uj in G if uj in Parent(ui) ]
if len(Ai) == 0:
ui = xi
else:
ui = fi(Ai)
Backward step
The backward step consists into computing the gradient of the output nodes
with respect to all activation nodes. This is done via dynamic programming.
The backward function outputs a matrix \(T[n-m,m]\) where
def backward(G):
T = zeros((len(OutputNodes(G)),len(G)))
for uj in OutputNodes(G):
T[j,j] = 1
# All activation nodes that are not output nodes
M = [ ui for ui in G if ui not in OutputNodes(G) ]
m = len(M)
# m,m-1,k-2...1
M.sort(reverse=True)
for uj in OutputNodes(G):
for i in range(m,0,-1):
T[j,i] = sum( T[j,k]*derivative(uk,ui) for uk in Parent(ui) )
return T