**Previously, We saw how very deep neural networks**can have the problems of vanishing and exploding gradients.**It turns out that a partial solution**to this, doesn't solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network.**To understand this**lets start with the example of initializing the ways for a single neuron and then we're going to generalize this to a deep network.**Let's go through**this with an example with just a single neuron and then we'll talk about the deep net later.**So a single neuron**you might input four features \( x_1, ..., x_4 \) and then you have some a = g(z) and end it up with \( \hat{y} \). For this problem assume, b = 0 and so we get \( z = W_1X_1 + W_2X_2 + .... W_nX_n \)**So in order to make z**not blow up and not become too small you notice that the larger n is, the smaller you want \( W_i \) to be.**Because z**is the sum of the \( W_iX_i \) and so if you're adding up a lot of these terms you want each of these terms to be smaller.**One reasonable thing**to do would be to set the variance of \( W_i \) to be equal to \( \frac{1}{n} \), where n is the number of input features that's going into a neuron.**So in practice**, what you can do is set the weight matrix W for a certain layer to be : W^{[l]}= np.random.randn(shape) * np.sqrt(\frac{1}{n^{[L-1]}})**It turns out that**if you're using a ReLU activation function that rather than \( \frac{1}{n} \) it turns out that, set in the variance that \( \frac{2}{n} \) works a little bit better.**So you often see that**in initialization especially if you're using a ReLU activation function so if g(z) is ReLu(z) that the weight matrices are initialized by W^{[l]}= np.random.randn(shape) * \( \sqrt{\frac{2}{\text{dimension of the previous layer}}} \).**A few other variants**, if you are using a TanH activation function then there's a paper that shows that instead of using the constant 2 it's better use the constant 1 and we use the code W^{[l]}= np.random.randn(shape) * \( \sqrt{\frac{1}{\text{dimension of the previous layer}}}\).**Another variant**, W^{[l]}= np.random.randn(shape) * \( \sqrt{\frac{2}{\text{dimension of the previous layer + dimension of current layer}}} \)**So we hope**that gives you some intuition about the problem of vanishing or exploding gradients as well as how choosing a reasonable scaling for how you initialize the weights.**Hopefully**that makes your weights you know not explode too quickly and not decay to zero too quickly so you can train a reasonably deep network without the weights or the gradients exploding or vanishing too much.**When you train deep networks**this is another trick that will help you make your neural networks trained much.