• Previously, We saw how very deep neural networks can have the problems of vanishing and exploding gradients.


  • It turns out that a partial solution to this, doesn't solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network.


  • To understand this lets start with the example of initializing the ways for a single neuron and then we're going to generalize this to a deep network.


  • Let's go through this with an example with just a single neuron and then we'll talk about the deep net later.


  • So a single neuron you might input four features \( x_1, ..., x_4 \) and then you have some a = g(z) and end it up with \( \hat{y} \). For this problem assume, b = 0 and so we get \( z = W_1X_1 + W_2X_2 + .... W_nX_n \)


  • So in order to make z not blow up and not become too small you notice that the larger n is, the smaller you want \( W_i \) to be.


  • Because z is the sum of the \( W_iX_i \) and so if you're adding up a lot of these terms you want each of these terms to be smaller.


  • One reasonable thing to do would be to set the variance of \( W_i \) to be equal to \( \frac{1}{n} \), where n is the number of input features that's going into a neuron.


  • So in practice, what you can do is set the weight matrix W for a certain layer to be : W[l] = np.random.randn(shape) * np.sqrt(\frac{1}{n^{[L-1]}})


  • It turns out that if you're using a ReLU activation function that rather than \( \frac{1}{n} \) it turns out that, set in the variance that \( \frac{2}{n} \) works a little bit better.


  • So you often see that in initialization especially if you're using a ReLU activation function so if g(z) is ReLu(z) that the weight matrices are initialized by W[l] = np.random.randn(shape) * \( \sqrt{\frac{2}{\text{dimension of the previous layer}}} \).


  • A few other variants, if you are using a TanH activation function then there's a paper that shows that instead of using the constant 2 it's better use the constant 1 and we use the code W[l] = np.random.randn(shape) * \( \sqrt{\frac{1}{\text{dimension of the previous layer}}}\).


  • Another variant, W[l] = np.random.randn(shape) * \( \sqrt{\frac{2}{\text{dimension of the previous layer + dimension of current layer}}} \)


  • So we hope that gives you some intuition about the problem of vanishing or exploding gradients as well as how choosing a reasonable scaling for how you initialize the weights.


  • Hopefully that makes your weights you know not explode too quickly and not decay to zero too quickly so you can train a reasonably deep network without the weights or the gradients exploding or vanishing too much.


  • When you train deep networks this is another trick that will help you make your neural networks trained much.