• Previously, We saw how very deep neural networks can have the problems of vanishing and exploding gradients.

• It turns out that a partial solution to this, doesn't solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network.

• To understand this lets start with the example of initializing the ways for a single neuron and then we're going to generalize this to a deep network.

• Let's go through this with an example with just a single neuron and then we'll talk about the deep net later.

• So a single neuron you might input four features $$x_1, ..., x_4$$ and then you have some a = g(z) and end it up with $$\hat{y}$$. For this problem assume, b = 0 and so we get $$z = W_1X_1 + W_2X_2 + .... W_nX_n$$

• So in order to make z not blow up and not become too small you notice that the larger n is, the smaller you want $$W_i$$ to be.

• Because z is the sum of the $$W_iX_i$$ and so if you're adding up a lot of these terms you want each of these terms to be smaller.

• One reasonable thing to do would be to set the variance of $$W_i$$ to be equal to $$\frac{1}{n}$$, where n is the number of input features that's going into a neuron.

• So in practice, what you can do is set the weight matrix W for a certain layer to be : W[l] = np.random.randn(shape) * np.sqrt(\frac{1}{n^{[L-1]}})

• It turns out that if you're using a ReLU activation function that rather than $$\frac{1}{n}$$ it turns out that, set in the variance that $$\frac{2}{n}$$ works a little bit better.

• So you often see that in initialization especially if you're using a ReLU activation function so if g(z) is ReLu(z) that the weight matrices are initialized by W[l] = np.random.randn(shape) * $$\sqrt{\frac{2}{\text{dimension of the previous layer}}}$$.

• A few other variants, if you are using a TanH activation function then there's a paper that shows that instead of using the constant 2 it's better use the constant 1 and we use the code W[l] = np.random.randn(shape) * $$\sqrt{\frac{1}{\text{dimension of the previous layer}}}$$.

• Another variant, W[l] = np.random.randn(shape) * $$\sqrt{\frac{2}{\text{dimension of the previous layer + dimension of current layer}}}$$

• So we hope that gives you some intuition about the problem of vanishing or exploding gradients as well as how choosing a reasonable scaling for how you initialize the weights.

• Hopefully that makes your weights you know not explode too quickly and not decay to zero too quickly so you can train a reasonably deep network without the weights or the gradients exploding or vanishing too much.

• When you train deep networks this is another trick that will help you make your neural networks trained much.