**One of the problems**of training neural network, especially very deep neural networks, is data vanishing and exploding gradients.**What that means**is that when you're training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult.**Lets see**what this problem of exploding and vanishing gradients really means, as well as how you can use careful choices of the random weight initialization to significantly reduce this problem.**Assume you're training**a very deep neural network like the one shown below:**In this neural network**will have parameters \( W^{[1]}, W^{[2]}, W^{[3]} , ... , W^{[L]} \). For the sake of simplicity, let's say we're using an activation function g(z) = z, so linear activation function.**And let's ignore B**, let's say \( b^{[1]}, b^{[2]}, b^{[3]} , ... , b^{[L]} \) equals zero.**So in that case**you can show that the output Y will be given by below method:\( Z^{[1]} = W^{[[1]]}X^{[0]} \\ A^{[1]} = g(Z^{[1]}) = Z^{[1]} \\ Z^{[2]} = W^{[[2]]}A^{[1]} \\ A^{[2]} = g(Z^{[2]}) = Z^{[2]} \\ ..\\ ..\\ ..\\ Z^{[L - 1]} = W^{[[L - 1]]}A^{[L - 2]} \\ A^{[L - 1]} = g(Z^{[L - 1]}) = Z^{[L - 1]} \\ Z^{[L]} = W^{[[L]]}A^{[L - 1]} \\ \hat{y} = A^{[L]} = g(Z^{[L]}) = Z^{[L]} \)

**But by method**of substitution we can show that:\( \hat{y} = W^{[[L]]} * W^{[[L-1]]} * W^{[[L-2]]} * ... * W^{[[1]]} X \)

**Now, let's say**that each of you weight matrices \( W^{[1]}, W^{[2]}, W^{[3]} , ... , W^{[L]} \) is \( \begin{bmatrix} 1.5 & 0\\ 0 & 1.5 \end{bmatrix} \) .**Technically**, the last one has different dimensions so maybe this is just the rest of these weight matrices.**Then**\( \hat{y} \) will be \( \begin{bmatrix} 1.5 & 0\\ 0 & 1.5 \end{bmatrix}^{L} \), because we assume that each one of these matrices is equal to this \( \begin{bmatrix} 1.5 & 0\\ 0 & 1.5 \end{bmatrix} \).**This**is 1.5 times the identity matrix, then you end up with this calculation.\( \begin{bmatrix} 1.5 & 0\\ 0 & 1.5 \end{bmatrix}^{L} \)

**And so**\( \hat{y} \) will be essentially 1.5 to the power of L and if L was large for very deep neural network, \( \hat{y} \) will be very large.**In fact**, it just grows exponentially, it grows like 1.5 to the number of layers.**And**so if you have a very deep neural network, the value of Y will explode.**Now**, conversely, if we replace this with 0.5, so something less than 1, then this becomes 0.5 to the power of L.\( \begin{bmatrix} 0.5 & 0\\ 0 & 0.5 \end{bmatrix}^{L} \)

**This**matrix becomes 0.5 to the L minus one times X. And so each of your matrices are less than 1, then the activations will be \( \begin{bmatrix} 0.5 & 0\\ 0 & 0.5 \end{bmatrix}^{L} \).**So the activation values**will decrease exponentially as a function of the number of layers L of the network.**So in the very deep network**, the activations end up decreasing exponentially.**So the intuition**you can take away from this is that at the weights W, if they're all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode.**And if W is just a little bit**less than identity. So this maybe here's 0.9, 0.9, then you have a very deep network, the activations will decrease exponentially.**And a similar argument**can be used to show that the derivatives or the gradients will also increase exponentially or decrease exponentially as a function of the number of layers.**With some of the modern neural networks**, L equals 150. Microsoft recently got great results with 152 layer neural network.**But with such a deep neural network**, if your activations or gradients increase or decrease exponentially as a function of L, then these values could get really big or really small.**And this makes training difficult**, especially if your gradients are exponentially smaller than L, then gradient descent will take tiny little steps.**It will take a long time for gradient descent**to learn anything. To summarize, you've seen how deep networks suffer from the problems of vanishing or exploding gradients.**In fact, for a long time**this problem was a huge barrier to training deep neural networks. It turns out there's a partial solution that doesn't completely solve this problem but it helps a lot which is careful choice of how you initialize the weights (next section).