• Previously You've seen the logistic regression model. You've seen the loss function that measures how well you're doing on the single training example i.e. \( L(\hat{y}^{(i)}, y^{(i)}) \).


  • You've also seen the cost function that measures how well your parameters w and b are doing on your entire training set. \( J(w,b) \)


  • Now you can use the gradient descent algorithm to train, or to learn, the parameters w and b on your training set. To recap, here is the familiar logistic regression algorithm. \( \hat{y}^{(i)} = \sigma(w^Tx^{(i)} + b) \)


  • And we have on the second line the cost function, J, which is a function of your parameters w and b. \( J(w,b) = − \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)}) \) And that's defined as the average. So it's 1 over m times the sum of this loss function.


  • And so the loss function measures how well your algorithms outputs \( \hat{y}^{(i)} \) on each of the training examples stacks up or compares to the ground true label \( y^{(i)} \) on each of the training examples.


  • So in order to learn the set of parameters w and b it seems natural that we want to find w and b that make the cost function J(w, b) as small as possible.


  • So here's an illustration of gradient descent. In this diagram the horizontal axes represent your spatial parameters, w and b.


  • In practice, w can be much higher dimensional, but for the purposes of plotting, let's illustrate w as a single real number and b as a single real number.


  • The cost function J(w,b) is, then, some surface above these horizontal axes w and b.


  • So the height of the surface represents the value of J(w,b) at a certain point. And what we want to do is really to find the value of w and b that corresponds to the minimum of the cost function J.


  • sigmoid cost-function
  • It turns out that this cost function J is a convex function. So it's just a single big bowl, so this is a convex function.


  • So our cost function J(w,b) as defined here is convex (has a global minima and a bowl shaped structure) is one of the huge reasons why we use this particular cost function, J, for logistic regression.


  • So to find a good value for the parameters, what we'll do is initialize w and b to some initial value. And for logistic regression almost any initialization method works, usually you initialize the value to zero.


  • Random initialization also works, but people don't usually do that for logistic regression. But because this function is convex, no matter where you initialize, you should get to the same point.


  • And what gradient descent does is it starts at that initial point and then takes a step in the steepest downhill direction. After many iterations , hopefully you converge to this global optimum or get to something close to the global optimum.


  • So this picture illustrates the gradient descent algorithm. sigmoid gradient-descent For the purpose of illustration, let's say that there's some function, J(w), that you want to minimize as shown above.


  • So gradient descent does this, we're going to repeatedly carry out the following update.


  • \( \begin{align*}& Repeat \; \lbrace \newline & \; w := w - \alpha \dfrac{\partial J(w)}{\partial w} \newline & \rbrace\end{align*} \)


  • In the notation, alpha, is the learning rate, and controls how big a step we take on each iteration or gradient descent.


  • And second quantity, is a derivative. This is basically the update or the change you want to make to the parameters w. When we start to write code to implement gradient descent, we're going to use the convention that the variable name in our code dw will be used to represent this derivative term.


  • Let's say that w was over here. So you're at this point on the cost function J(w). Remember that the definition of a derivative is the slope of a function at the point.


  • So the slope of the function is really the height divided by the width, right, of a low triangle (shown in purple arrow at right) at this tangent to J(w) at that point. And so, the derivative is positive.


  • sigmoid gradient-descent
  • W gets updated as w minus a learning rate times the derivative. The derivative is positive and so you end up subtracting from w, so you end up taking a step to the left.


  • And so gradient descent will make your algorithm slowly decrease the parameter if you have started off with this large value of w. As another example, if w was over here (shown in purple arrow at left in the image), then at that point the slope here of dJ/dw will be negative and so the gradient descent update would subtract alpha times a negative number.


  • And so end up slowly increasing w, so you end up making w bigger and bigger with successive iterations and gradient descent.


  • So that hopefully whether you initialize on the left or on the right gradient descent will move you towards this global minimum here.


  • The overall intuition for now is that this term represents the slope of the function, and we want to know the slope of the function at the current setting of the parameters so that we can take these steps of steepest descent, so that we know what direction to step in in order to go downhill on the cost function J.


  • In logistic regression, your cost function is a function of both w and b. So in that case, the inner loop of gradient descent becomes as follows.


  • sigmoid partial-derivative
  • You end up updating w as w minus the learning rate times the derivative of J(w,b) respect to w. And you update b as b minus the learning rate times the derivative of the cost function in respect to b.


  • So these two equations at the top are the actual update you implement.


  • Note: If J is a function of two or more variables, you use the partial derivative symbol, if J is only a function of one variable then you use lower case d for the derivative.