• If you suspect your neural network is over fitting your data. That is you have a high variance problem, one of the first things you should try per probably regularization.


  • The other way to address high variance, is to get more training data that's also quite reliable.


  • But you can't always get more training data, or it could be expensive to get more data.


  • But adding regularization will often help to prevent overfitting, or to reduce the errors in your network. So let's see how regularization works.






  • Recall that for logistic regression, you try to minimize the cost function J, which is defined as the cost function given below.


  • \( J(w,b) = − \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)}) = − \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)}log(\hat{y}^{(i)})−(1−y^{(i)})log(1−\hat{y}^{(i)}) ] \)


  • Sum of your training examples of the losses of the individual predictions in the different examples, where you recall that w and b in the logistic regression, are the parameters.


  • So w is an nx-dimensional (nx number of features) parameter vector, and b is a real number. And so to add regularization to the logistic regression you add lambda, which is called the regularization parameter.


  • Usually L2 regularization is applied to regularize the model. L2 regularization is the defined as \( \frac{\lambda}{2m} ||w||^2_2 \).


  • This is read as lambda/2m times the norm of w squared. So here, the norm of w squared is just written w transpose w ( \( ||w||^2_2 = \sum_{j = 1}^{n_x} w_j^2 \) ), it's just a square Euclidean norm of the vector w.


  • Now, why do you regularize just the parameter w? Why don't we add something here about b as well? \( \frac{\lambda}{2m} ||b||^2_2 \)


  • In practice, you could do this, but we usually just omit this.


  • Because if you look at your parameters, w is usually a pretty high dimensional parameter vector, especially with a high variance problem.


  • Maybe w just has a lot of parameters, so you aren't fitting all the parameters well, whereas b is just a single number.


  • So almost all the parameters are in w rather b. And if you add this last term, in practice, it won't make much of a difference, because b is just one parameter over a very large number of parameters.


  • In practice, we usually just don't bother to include it. But you can if you want.


  • So L2 regularization is the most common type of regularization.


  • You might have also heard of some people talk about L1 regularization. And that's when you add, instead of this L2 norm, add a term that is lambda/m of sum over of this i.e \( \frac{\lambda}{2m} ||w||_1 \).


  • And this is also called the L1 norm of the parameter vector w


  • The last detail - Lambda is called the regularization parameter.


  • And usually, you set this using your development set, or using cross validation.


  • Regularization helps prevent over fitting. So lambda is another hyper parameter that you might have to tune. Note that, for the programming exercises, lambda is a reserved keyword in the Python programming language. So in the programming exercise, we'll have lambd.


  • This is how you implement L2 regularization for logistic regression.






  • In a neural network, you have a cost function that's a function of all of your parameters, \( w^{[1]}, b^{[1]} \) through \( w^{[L]}, b^{[L]} \), where capital L is the number of layers in your neural network.


  • And so the cost function of this is given below, sum of the losses, summed over your m training examples.


  • \( J( w^{[1]}, b^{[1]} , .... , w^{[L]}, b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)} , y^{(i)}) \)


  • So for regularization, you add lambda over 2m of sum over squared norm of all of your parameters W.


  • Where the the squared norm of a matrix is defined as \( ||w^{[l]}||^2_F = \sum_{i=1}^{n^{[l]}} \sum_{i=1}^{n^{[l-1]}} (w_{ij}^{[l]})^2 \). This is called the frobenius norm of the matrix. This involves squaring each element of the matrix and summing all of them.


  • The indices of this summation are \( n^{[l]}, n^{[l-1]}\) because w is an \( n^{[l]} * n^{[l-1]}\) dimensional matrix, where these are the number of units in layers [l-1] and layer l.


  • So this matrix norm is denoted with a F in the subscript.


  • So for arcane linear algebra technical reasons, this is not called the L2 norm of a matrix. Instead, it's called the Frobenius norm of a matrix.


  • For the implementation of gradient descent. Previously, we would complete dw using backprop, where backprop would give us the partial derivative of J with respect to w. And then you update \( w^{[l]} \), as \( w^{[l]} \) - the learning rate times d. This is shown below


  • \( 𝑑𝑊^{[2]} =𝑑𝑧^{[2]} 𝑎^{[1]^𝑇} \\ 𝑊^{[2]} = 𝑊^{[2]} - \alpha * 𝑑𝑊^{[2]} \\ \)


  • Now we add this extra regularization term to the objective i.e. \( \frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m} W^2) = \frac{\lambda}{m} W \).


  • \( 𝑑𝑊^{[2]} =𝑑𝑧^{[2]} 𝑎^{[1]^𝑇} \\ 𝑊^{[2]} = 𝑊^{[2]} - \alpha * [ 𝑑𝑊^{[2]} * \frac{1}{2}\frac{\lambda}{m} 𝑊^{[2]} ]\\ 𝑊^{[2]} = 𝑊^{[2]} - \alpha * [ 𝑑𝑊^{[2]} ] - \alpha * [ \frac{1}{2}\frac{\lambda}{m} 𝑊^{[2]} ]\\ 𝑊^{[2]} = (1 - \frac{\alpha \lambda}{m})𝑊^{[2]} - \alpha * [ 𝑑𝑊^{[2]} ] \)


  • L2 regularization is sometimes also called weight decay.


  • This is demonstrated as given below:


  • From above equation, you are taking the matrix \( 𝑊^{[2]} \) and you're multiplying it by \( (1 - \frac{\alpha \lambda}{m}) \).


  • Which is equivalent to taking the matrix \( 𝑊^{[2]} \) and subtracting alpha lambda/m times \( 𝑊^{[2]} \) . Like you're multiplying matrix \( 𝑊^{[2]} \) by \( 1 - \frac{\alpha \lambda}{m}) \), which is going to be a little bit less than 1 as \( \frac{\alpha \lambda}{m}) \) is positive.


  • So this is why L2 norm regularization is also called weight decay. Because it's like the ordinally gradient descent, where you update \( 𝑊^{[2]} \) by subtracting alpha times the original gradient you got from backprop.


  • But now you're also multiplying \( 𝑊^{[2]} \) by \( (1 - \frac{\alpha \lambda}{m}) \), which is a little bit less than 1. So the alternative name for L2 regularization is weight decay.


basic-recipe-for-machine-learning