• Why does regularization help with overfitting? Why does it help with reducing variance problems?


  • Let's go through the example below to gain some intuition about how it works. Recall that high bias, high variance and that looks something like the image given below.


  • Also given a large and deep neural network.


  • overfitting-regularization-underfitting


  • For this network we have a cost function like J which looks like the equation given below:


  • \( J( w^{[1]}, b^{[1]} , .... , w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)} , y^{(i)}) \)


  • So what we did for regularization was add this extra term that penalizes the weight matrices from being too large. This term is known as the Frobenius norm (||w^{[l]}||^2_F = \sum_{i=1}^{n^{[l]}} \sum_{i=1}^{n^{[l-1]}} (w_{ij}^{[l]})^2). The final result is given below:


  • \( J( w^{[1]}, b^{[1]} , .... , w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)} , y^{(i)}) + ( \frac{\lambda}{m} W^{[2]} ) \)


  • So why is it that shrinking the L2 norm or the Frobenius norm or the parameters might cause less overfitting?


  • One intuition is that if you set regularisation parameter - lambda to be big, gradient descent will be really incentivized to set the weight matrices W to be reasonably close to zero.


  • So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that's basically zeroing out a lot of the impact of these hidden units.


  • And if that's the case, then this much simplified neural network becomes a much smaller neural network as given in figure below.


  • regularized-neural-network


  • And so that will take you from this overfitting case much closer to the high bias case.


  • But hopefully there'll be an intermediate value of lambda that results in a result closer to the just right case.


  • The intuition of completely zeroing out of a bunch of hidden units isn't quite right. It turns out that what actually happens is they'll still use all the hidden units, but each of them would just have a much smaller effect.


  • But you do end up with a simpler network and as if you have a smaller network that is therefore less prone to overfitting.