Weight Initialization for Deep Networks - UPSCFEVER

Previously, We saw how very deep neural networks can have the problems of vanishing and exploding gradients.

It turns out that a partial solution to this, doesn't solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network.

To understand this lets start with the example of initializing the ways for a single neuron and then we're going to generalize this to a deep network.

Let's go through this with an example with just a single neuron and then we'll talk about the deep net later.

So a single neuron you might input four features \( x_1, ..., x_4 \) and then you have some a = g(z) and end it up with \( \hat{y} \). For this problem assume, b = 0 and so we get \( z = W_1X_1 + W_2X_2 + .... W_nX_n \)

So in order to make z not blow up and not become too small you notice that the larger n is, the smaller you want \( W_i \) to be.

Because z is the sum of the \( W_iX_i \) and so if you're adding up a lot of these terms you want each of these terms to be smaller.

One reasonable thing to do would be to set the variance of \( W_i \) to be equal to \( \frac{1}{n} \), where n is the number of input features that's going into a neuron.

So in practice, what you can do is set the weight matrix W for a certain layer to be : W^[l] = np.random.randn(shape) * np.sqrt(\frac{1}{n^{[L-1]}})

It turns out that if you're using a ReLU activation function that rather than \( \frac{1}{n} \) it turns out that, set in the variance that \( \frac{2}{n} \) works a little bit better.

So you often see that in initialization especially if you're using a ReLU activation function so if g(z) is ReLu(z) that the weight matrices are initialized by W^[l] = np.random.randn(shape) * \( \sqrt{\frac{2}{\text{dimension of the previous layer}}} \).

A few other variants, if you are using a TanH activation function then there's a paper that shows that instead of using the constant 2 it's better use the constant 1 and we use the code W^[l] = np.random.randn(shape) * \( \sqrt{\frac{1}{\text{dimension of the previous layer}}}\).

Another variant, W^[l] = np.random.randn(shape) * \( \sqrt{\frac{2}{\text{dimension of the previous layer + dimension of current layer}}} \)

So we hope that gives you some intuition about the problem of vanishing or exploding gradients as well as how choosing a reasonable scaling for how you initialize the weights.

Hopefully that makes your weights you know not explode too quickly and not decay to zero too quickly so you can train a reasonably deep network without the weights or the gradients exploding or vanishing too much.

When you train deep networks this is another trick that will help you make your neural networks trained much.