**This is a learning algorithm**that you use when the output labels Y in a supervised learning problem are all either zero or one i.e. binary classification problems.**Given an input feature vector X**maybe corresponding to an image that you want to recognize as either a cat picture or not a cat picture, you want an algorithm that can output a prediction, \( \hat{y} \), which is known as estimate of Y.**More formally**, you want \( \hat{y} \) to be the probability of the chance that, Y is equal to one given the input features X. So in other words, if X is a picture you want \( \hat{y} \) to tell you, what is the chance that this is a cat picture?**Given x**, \( \hat{y} = P( y = 1 | x), where 0 \leq \hat{y} \leq 1 \)**The parameters used in Logistic regression are:****The input features vector:**\( x \in R^{n_x} \) where \( n_x \) is the number of features.**The training label:**\(𝑦 \in 0,1 \)**The weights:**\( w \in R^{n_x} \) where \( n_x \) is the number of features.**The threshold:**\( 𝑏 ∈ R \)**The output:**\( \hat{y} = \sigma(w^Tx + b) \)**Sigmoid function:**\( s = \sigma(w^Tx + b) = \sigma(z) = \frac{1}{1 + e^{-z}} \)**\( (w^Tx + b) \) is a linear function (ax + b)**, but since we are looking for a probability constraint between [0,1], the sigmoid function is used. The function is bounded between [0,1] as shown in the graph above.**Some observations from the graph:****If 𝑧**is a large positive number, then \( \sigma(z) = 1 \)**If 𝑧**is small or large negative number, then \( \sigma(z) = 0 \)**If 𝑧 = 0**, then \( \sigma(z) = 0.5 \)**So from previous data X is an X dimensional vector**, given that the parameters of logistic regression will be W which is also an X dimensional vector, together with b which is just a real number.**So given an input X and the parameters W and b**, how do we generate the output \( \hat{y} \)? Well, one thing you could try, that doesn't work, would be to have \( \hat{y} \) be w transpose X plus B \( (w^Tx + b) \), is a linear function of the input X.**But this isn't a very good algorithm**for binary classification because you want \( \hat{y} \) to be the chance that Y is equal to one. So \( \hat{y} \) should really be between zero and one, and it's difficult to enforce that because W transpose X plus B can be much bigger then one or it can even be negative, which doesn't make sense for probability, that you want it to be between zero and one.**So in logistic regression**our output is instead going to be \( \hat{y} \) equals the sigmoid function applied to this quantity. This is what the sigmoid function looks like (figure below).**It goes smoothly from zero up to one.**When you implement logistic regression, your job is to try to learn parameters W and B so that \( \hat{y} \) becomes a good estimate of the chance of Y being equal to one.

**To train the parameters 𝑤 and 𝑏**, we need to define a cost function.**Recap:**\( \hat{y}^{(i)} = \sigma(w^Tx^{(i)} + b) \), where \( \sigma(z^{(i)}) = \frac{1}{1 + e^{-z^{(i)}}} \)\( x^{(i)} \) the i-th training example

**Given**\( {(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ... , (x^{(m)}, y^{(m)})} \), we want \( \hat{y^{(i)}} \sim y^{(i)} \)**Loss (error) function:**The loss function measures the discrepancy between the prediction \( \hat{y}^{(i)}\) and the desired output (y^{(i)}). In other words, the loss function computes the error for a single training example.\( L(\hat{y}^{(i)}, y^{(i)}) = \frac{1}{2} (\hat{y}^{(i)} - y^{(i)})^2 \)

\( L(\hat{y}^{(i)}, y^{(i)}) = -(y^{(i)} log(\hat{y}^{(i)}))+(1 - y^{(i)})log(1 - \hat{y}^{(i)}) \)

**If**\( y^{(i)} = 1; L(\hat{y}^{(i)}, y^{(i)}) = -log(\hat{y}^{(i)}) \) where \( log(\hat{y}^{(i)}) \) and \( \hat{y}^{(i)} \) should be close to 1If \( y^{(i)} = 0; L(\hat{y}^{(i)}, y^{(i)}) = -log(1 - \hat{y}^{(i)}) \) where \( log(1 - \hat{y}^{(i)}) \) and \( \hat{y}^{(i)} \) should be close to 0

**Cost function****The cost function**is the average of the loss function of the entire training set. We are going to find the parameters w,b that minimize the overall cost function.\( J(w,b) = − \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)}) = − \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)}log(\hat{y}^{(i)})−(1−y^{(i)})log(1−\hat{y}^{(i)}) ] \)