Why regularization reduces overfitting? - UPSCFEVER

Why does regularization help with overfitting? Why does it help with reducing variance problems?

Let's go through the example below to gain some intuition about how it works. Recall that high bias, high variance and that looks something like the image given below.

Also given a large and deep neural network.

For this network we have a cost function like J which looks like the equation given below:

\( J( w^{[1]}, b^{[1]} , .... , w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)} , y^{(i)}) \)

So what we did for regularization was add this extra term that penalizes the weight matrices from being too large. This term is known as the Frobenius norm (||w^{[l]}||^2_F = \sum_{i=1}^{n^{[l]}} \sum_{i=1}^{n^{[l-1]}} (w_{ij}^{[l]})^2). The final result is given below:

\( J( w^{[1]}, b^{[1]} , .... , w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)} , y^{(i)}) + ( \frac{\lambda}{m} W^{[2]} ) \)

So why is it that shrinking the L2 norm or the Frobenius norm or the parameters might cause less overfitting?

One intuition is that if you set regularisation parameter - lambda to be big, gradient descent will be really incentivized to set the weight matrices W to be reasonably close to zero.

So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that's basically zeroing out a lot of the impact of these hidden units.

And if that's the case, then this much simplified neural network becomes a much smaller neural network as given in figure below.

And so that will take you from this overfitting case much closer to the high bias case.

But hopefully there'll be an intermediate value of lambda that results in a result closer to the just right case.

The intuition of completely zeroing out of a bunch of hidden units isn't quite right. It turns out that what actually happens is they'll still use all the hidden units, but each of them would just have a much smaller effect.

But you do end up with a simpler network and as if you have a smaller network that is therefore less prone to overfitting.