• If you suspect your neural network is over fitting your data. That is you have a high variance problem, one of the first things you should try per probably regularization.

• The other way to address high variance, is to get more training data that's also quite reliable.

• But you can't always get more training data, or it could be expensive to get more data.

• But adding regularization will often help to prevent overfitting, or to reduce the errors in your network. So let's see how regularization works.

• Recall that for logistic regression, you try to minimize the cost function J, which is defined as the cost function given below.

• $$J(w,b) = − \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)}) = − \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)}log(\hat{y}^{(i)})−(1−y^{(i)})log(1−\hat{y}^{(i)}) ]$$

• Sum of your training examples of the losses of the individual predictions in the different examples, where you recall that w and b in the logistic regression, are the parameters.

• So w is an nx-dimensional (nx number of features) parameter vector, and b is a real number. And so to add regularization to the logistic regression you add lambda, which is called the regularization parameter.

• Usually L2 regularization is applied to regularize the model. L2 regularization is the defined as $$\frac{\lambda}{2m} ||w||^2_2$$.

• This is read as lambda/2m times the norm of w squared. So here, the norm of w squared is just written w transpose w ( $$||w||^2_2 = \sum_{j = 1}^{n_x} w_j^2$$ ), it's just a square Euclidean norm of the vector w.

• Now, why do you regularize just the parameter w? Why don't we add something here about b as well? $$\frac{\lambda}{2m} ||b||^2_2$$

• In practice, you could do this, but we usually just omit this.

• Because if you look at your parameters, w is usually a pretty high dimensional parameter vector, especially with a high variance problem.

• Maybe w just has a lot of parameters, so you aren't fitting all the parameters well, whereas b is just a single number.

• So almost all the parameters are in w rather b. And if you add this last term, in practice, it won't make much of a difference, because b is just one parameter over a very large number of parameters.

• In practice, we usually just don't bother to include it. But you can if you want.

• So L2 regularization is the most common type of regularization.

• You might have also heard of some people talk about L1 regularization. And that's when you add, instead of this L2 norm, add a term that is lambda/m of sum over of this i.e $$\frac{\lambda}{2m} ||w||_1$$.

• And this is also called the L1 norm of the parameter vector w

• The last detail - Lambda is called the regularization parameter.

• And usually, you set this using your development set, or using cross validation.

• Regularization helps prevent over fitting. So lambda is another hyper parameter that you might have to tune. Note that, for the programming exercises, lambda is a reserved keyword in the Python programming language. So in the programming exercise, we'll have lambd.

• This is how you implement L2 regularization for logistic regression.

• In a neural network, you have a cost function that's a function of all of your parameters, $$w^{[1]}, b^{[1]}$$ through $$w^{[L]}, b^{[L]}$$, where capital L is the number of layers in your neural network.

• And so the cost function of this is given below, sum of the losses, summed over your m training examples.

• $$J( w^{[1]}, b^{[1]} , .... , w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)} , y^{(i)})$$

• So for regularization, you add lambda over 2m of sum over squared norm of all of your parameters W.

• Where the the squared norm of a matrix is defined as $$||w^{[l]}||^2_F = \sum_{i=1}^{n^{[l]}} \sum_{i=1}^{n^{[l-1]}} (w_{ij}^{[l]})^2$$. This is called the frobenius norm of the matrix. This involves squaring each element of the matrix and summing all of them.

• The indices of this summation are $$n^{[l]}, n^{[l-1]}$$ because w is an $$n^{[l]} * n^{[l-1]}$$ dimensional matrix, where these are the number of units in layers [l-1] and layer l.

• So this matrix norm is denoted with a F in the subscript.

• So for arcane linear algebra technical reasons, this is not called the L2 norm of a matrix. Instead, it's called the Frobenius norm of a matrix.

• For the implementation of gradient descent. Previously, we would complete dw using backprop, where backprop would give us the partial derivative of J with respect to w. And then you update $$w^{[l]}$$, as $$w^{[l]}$$ - the learning rate times d. This is shown below

• $$𝑑𝑊^{[2]} =𝑑𝑧^{[2]} 𝑎^{[1]^𝑇} \\ 𝑊^{[2]} = 𝑊^{[2]} - \alpha * 𝑑𝑊^{[2]} \\$$

• Now we add this extra regularization term to the objective i.e. $$\frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m} W^2) = \frac{\lambda}{m} W$$.

• $$𝑑𝑊^{[2]} =𝑑𝑧^{[2]} 𝑎^{[1]^𝑇} \\ 𝑊^{[2]} = 𝑊^{[2]} - \alpha * [ 𝑑𝑊^{[2]} * \frac{1}{2}\frac{\lambda}{m} 𝑊^{[2]} ]\\ 𝑊^{[2]} = 𝑊^{[2]} - \alpha * [ 𝑑𝑊^{[2]} ] - \alpha * [ \frac{1}{2}\frac{\lambda}{m} 𝑊^{[2]} ]\\ 𝑊^{[2]} = (1 - \frac{\alpha \lambda}{m})𝑊^{[2]} - \alpha * [ 𝑑𝑊^{[2]} ]$$

• L2 regularization is sometimes also called weight decay.

• This is demonstrated as given below:

• From above equation, you are taking the matrix $$𝑊^{[2]}$$ and you're multiplying it by $$(1 - \frac{\alpha \lambda}{m})$$.

• Which is equivalent to taking the matrix $$𝑊^{[2]}$$ and subtracting alpha lambda/m times $$𝑊^{[2]}$$ . Like you're multiplying matrix $$𝑊^{[2]}$$ by $$1 - \frac{\alpha \lambda}{m})$$, which is going to be a little bit less than 1 as $$\frac{\alpha \lambda}{m})$$ is positive.

• So this is why L2 norm regularization is also called weight decay. Because it's like the ordinally gradient descent, where you update $$𝑊^{[2]}$$ by subtracting alpha times the original gradient you got from backprop.

• But now you're also multiplying $$𝑊^{[2]}$$ by $$(1 - \frac{\alpha \lambda}{m})$$, which is a little bit less than 1. So the alternative name for L2 regularization is weight decay.