**Why does regularization**help with overfitting? Why does it help with reducing variance problems?**Let's go**through the example below to gain some intuition about how it works. Recall that high bias, high variance and that looks something like the image given below.**Also given a large**and deep neural network.**For this network**we have a cost function like J which looks like the equation given below:\( J( w^{[1]}, b^{[1]} , .... , w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)} , y^{(i)}) \)

**So what we did**for regularization was add this extra term that penalizes the weight matrices from being too large. This term is known as the Frobenius norm (||w^{[l]}||^2_F = \sum_{i=1}^{n^{[l]}} \sum_{i=1}^{n^{[l-1]}} (w_{ij}^{[l]})^2). The final result is given below:\( J( w^{[1]}, b^{[1]} , .... , w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)} , y^{(i)}) + ( \frac{\lambda}{m} W^{[2]} ) \)

**So why is it**that shrinking the L2 norm or the Frobenius norm or the parameters might cause less overfitting?**One intuition**is that if you set regularisation parameter - lambda to be big, gradient descent will be really incentivized to set the weight matrices W to be reasonably close to zero.**So one piece of intuition**is maybe it set the weight to be so close to zero for a lot of hidden units that's basically zeroing out a lot of the impact of these hidden units.**And if that's the case**, then this much simplified neural network becomes a much smaller neural network as given in figure below.**And so that**will take you from this overfitting case much closer to the high bias case.**But hopefully**there'll be an intermediate value of lambda that results in a result closer to the just right case.**The intuition**of completely zeroing out of a bunch of hidden units isn't quite right. It turns out that what actually happens is they'll still use all the hidden units, but each of them would just have a much smaller effect.**But you do end up**with a simpler network and as if you have a smaller network that is therefore less prone to overfitting.