• One of the problems of training neural network, especially very deep neural networks, is data vanishing and exploding gradients.


  • What that means is that when you're training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult.


  • Lets see what this problem of exploding and vanishing gradients really means, as well as how you can use careful choices of the random weight initialization to significantly reduce this problem.


  • Assume you're training a very deep neural network like the one shown below:


  • vanishing-exploding-gradients


  • In this neural network will have parameters \( W^{[1]}, W^{[2]}, W^{[3]} , ... , W^{[L]} \). For the sake of simplicity, let's say we're using an activation function g(z) = z, so linear activation function.


  • And let's ignore B, let's say \( b^{[1]}, b^{[2]}, b^{[3]} , ... , b^{[L]} \) equals zero.


  • So in that case you can show that the output Y will be given by below method:


  • \( Z^{[1]} = W^{[[1]]}X^{[0]} \\ A^{[1]} = g(Z^{[1]}) = Z^{[1]} \\ Z^{[2]} = W^{[[2]]}A^{[1]} \\ A^{[2]} = g(Z^{[2]}) = Z^{[2]} \\ ..\\ ..\\ ..\\ Z^{[L - 1]} = W^{[[L - 1]]}A^{[L - 2]} \\ A^{[L - 1]} = g(Z^{[L - 1]}) = Z^{[L - 1]} \\ Z^{[L]} = W^{[[L]]}A^{[L - 1]} \\ \hat{y} = A^{[L]} = g(Z^{[L]}) = Z^{[L]} \)


  • But by method of substitution we can show that:


  • \( \hat{y} = W^{[[L]]} * W^{[[L-1]]} * W^{[[L-2]]} * ... * W^{[[1]]} X \)


  • Now, let's say that each of you weight matrices \( W^{[1]}, W^{[2]}, W^{[3]} , ... , W^{[L]} \) is \( \begin{bmatrix} 1.5 & 0\\ 0 & 1.5 \end{bmatrix} \) .


  • Technically, the last one has different dimensions so maybe this is just the rest of these weight matrices.


  • Then \( \hat{y} \) will be \( \begin{bmatrix} 1.5 & 0\\ 0 & 1.5 \end{bmatrix}^{L} \), because we assume that each one of these matrices is equal to this \( \begin{bmatrix} 1.5 & 0\\ 0 & 1.5 \end{bmatrix} \).


  • This is 1.5 times the identity matrix, then you end up with this calculation.


  • \( \begin{bmatrix} 1.5 & 0\\ 0 & 1.5 \end{bmatrix}^{L} \)


  • And so \( \hat{y} \) will be essentially 1.5 to the power of L and if L was large for very deep neural network, \( \hat{y} \) will be very large.


  • In fact, it just grows exponentially, it grows like 1.5 to the number of layers.


  • And so if you have a very deep neural network, the value of Y will explode.


  • Now, conversely, if we replace this with 0.5, so something less than 1, then this becomes 0.5 to the power of L.


  • \( \begin{bmatrix} 0.5 & 0\\ 0 & 0.5 \end{bmatrix}^{L} \)


  • This matrix becomes 0.5 to the L minus one times X. And so each of your matrices are less than 1, then the activations will be \( \begin{bmatrix} 0.5 & 0\\ 0 & 0.5 \end{bmatrix}^{L} \).


  • So the activation values will decrease exponentially as a function of the number of layers L of the network.


  • So in the very deep network, the activations end up decreasing exponentially.


  • So the intuition you can take away from this is that at the weights W, if they're all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode.


  • And if W is just a little bit less than identity. So this maybe here's 0.9, 0.9, then you have a very deep network, the activations will decrease exponentially.


  • And a similar argument can be used to show that the derivatives or the gradients will also increase exponentially or decrease exponentially as a function of the number of layers.


  • With some of the modern neural networks, L equals 150. Microsoft recently got great results with 152 layer neural network.


  • But with such a deep neural network, if your activations or gradients increase or decrease exponentially as a function of L, then these values could get really big or really small.


  • And this makes training difficult, especially if your gradients are exponentially smaller than L, then gradient descent will take tiny little steps.


  • It will take a long time for gradient descent to learn anything. To summarize, you've seen how deep networks suffer from the problems of vanishing or exploding gradients.


  • In fact, for a long time this problem was a huge barrier to training deep neural networks. It turns out there's a partial solution that doesn't completely solve this problem but it helps a lot which is careful choice of how you initialize the weights (next section).