• What is a deep neural network? You've seen this picture for logistic regression and you've also seen neural networks with a single hidden layer.


  • So given below is an example of a neural network with two hidden layers and a neural network with 5 hidden layers.


  • deep-neural-network


  • We say that logistic regression is a very "shallow" model, whereas the 5 hidden layer model given above is a much deeper model


  • For a neural network of a single hidden layer, we say that it is a 2 layer neural network. Note that when we count layers in a neural network, we don't count the input layer, we just count the hidden layers as well as the output layer.


  • In the above figure a 2 layer neural network is given which is still quite shallow, but not as shallow as logistic regression.


  • Technically logistic regression is a one layer neural network. And although it is still useful but over the last several years the machine learning community, has realized that there are functions that very deep neural networks can learn that shallower models are often unable to.


  • Although for any given problem, it might be hard to predict in advance exactly how deep in your network you would want. So it would be reasonable to try logistic regression, try one and then two hidden layers, and view the number of hidden layers as a hyper parameter that you could try a variety of values of, and evaluate on a cross validation data, or on your development set.


  • Details on that shall be given later.






  • deep-neural-network-notations


  • For the figure given above: With three hidden layers, and the number of units in these hidden layers are given as 5, 5, 3, also there's one one output unit.


  • So the notation capital L \( L \) is to denote the number of layers in the network. So in this case, L = 4


  • We're going to use N superscript [l] i.e. \( n^{[l]} \) to denote the number of nodes, or the number of units in layer lowercase l.


  • In the figure given above we index the input as layer "0", layer 1, layer 2,layer 3 and layer 4.


  • Thus, the full list of notations are:
    \( n^{[1]} = 5 \\ n^{[2]} = 5 \\ n^{[3]} = 3 \\ n^{[4]} = n^{[L]} = 1 \\ n^{[0]} = n_x = 3 \)


  • Explanation for the notations: First symbol will equal 5, because we have 5 hidden units. The second symbol is for second hidden layer and is also equal to 5, n[3] = 3, and n[4] = n[L] units is 1, because your capital L is equal to four.


  • Then we're also going to set the input layer \( n^{[0]} = n_x = 3 \).


  • For each layer L, we're also going to use \( a^{[l]} \) to denote the activations in layer l. The computation for \( a^{[l]} \) which is the activation for layer l, is given by g( \( z^{[l]} \)) and perhaps the activation is indexed by the layer l as well, and then we'll use \( W^{[l]} \) to denote, the weights for computing the value \( z^{[l]} \) in layer l, and similarly, \( b^{[l]} \) is used to compute \( z^{[l]} \).


  • Finally, the input features are called x, but x is also the activations of layer zero, so \( a^{[0]} \) = x, and the activation of the final layer, \( a^{[L]} \) = \( \hat{y} \).


  • So \( a^{[L]} \) is equal to the predicted output to prediction \( \hat{y} \) to the neural network.






  • In this section, you see how you can perform forward propagation, in a deep network.


  • Let's first go over what forward propagation will look like for a single training example x, and then later on we'll talk about the vectorized version, where you want to carry out forward propagation on the entire training set at the same time.


  • But given a single training example x, here's how you compute the activations of the first layer for the layer given below.


  • forward-propagation-in-a-deep-network
  • For this first layer, you compute \( z^{[1]} \) equals \( w^{[1]} \) times x plus \( b^{[1]} \). So \(w^{[1]} \) and \( b^{[1]} \) are the parameters that affect the activations in layer one.


  • This is layer one of the neural network, and then you compute the activations for that layer to be equal to \( g(z^{[1]}) \). The activation function g depends on what layer you're at.


  • Similarly, for layer 2, you would then compute \( z^{[2]} \) equals \( w^{[2]} * a^{[2]} + b^{[2]}\). Then, so the activation of layer two is the weight matrix times the outputs of layer one plus the bias vector for layer two.


  • Then \( a^{[2]} \) equals the activation function 'g' applied to \( z^{[2]} \).


  • So that's it for layer two, and so on and so forth. Until you get to the outer layer, that's layer four.


  • Where you would have that z4 is equal to the parameters for that layer times the activations from the previous layer, plus that bias vector i.e. \( z^{[4]} = w^{[4]}*a^{[3]} + b^{[4]} \).


  • Then similarly, \( a^{[4]} = g(z^{[4]}) \). So, that's how you compute your estimated output, \( \hat{y} \).


  • So, just one thing to notice, x here is also equal to \( a^{[0]} \), because the input feature vector x is also the activations of layer zero. So we scratch out x. When I cross out x and put \( a^{[0]} \) here, then all of these equations basically look the same.


  • \( z^{[1]} = w^{[1]}*a^{[0]} + b^{[1]} \\ a^{[1]} = g(z^{[1]}) \\ z^{[2]} = w^{[2]}*a^{[1]} + b^{[2]} \\ a^{[2]} = g(z^{[2]}) \\ z^{[3]} = w^{[3]}*a^{[2]} + b^{[3]} \\ a^{[3]} = g(z^{[3]}) \\ z^{[4]} = w^{[4]}*a^{[3]} + b^{[4]} \\ \hat{y} = a^{[4]} = g(z^{[4]}) \\ \)


  • The general rule is that \( z^{[l]} = w^{[l]}*a^{[l-1]} + b^{[l]} \) . And then, the activations for that layer is the activation function applied to the values of z.


  • How about for doing it in a vectorized way for the whole training set at the same time?


  • \( Z^{[1]} = W^{[1]}*A^{[0]} + B^{[1]} \\ A^{[1]} = g(Z^{[1]}) \\ Z^{[2]} = W^{[2]}*A^{[1]} + B^{[2]} \\ A^{[2]} = g(Z^{[2]}) \\ Z^{[3]} = W^{[3]}*A^{[2]} + B^{[3]} \\ A^{[3]} = g(Z^{[3]}) \\ Z^{[4]} = W^{[4]}*A^{[3]} + B^{[4]} \\ \hat{Y} = A^{[4]} = g(Z^{[4]}) \\ \)


  • The equations look quite similar as before. For the first layer, you would have capital \( Z^{[1]} = W^{[1]}*X + B^{[1]} \). Then, \( A^{[1]} \) equals \( g(Z^{[1]}) \). Bear in mind that X is equal to \( A^{[0]} \).


  • These are just the training examples stacked in different columns.


  • Similarly, for capital A, just as capital X. All the training examples are column vectors stack left to right.


  • In this process, you end up with \( \hat{y} = g(Z^{[4]}) = A^{[4]} \).


  • That's the predictions on all of your training examples stacked horizontally.