• With supervised learning, the performance of many supervised learning algorithms will be pretty similar, and what matters less often will be whether you use learning algorithm a or learning algorithm b, but what matters more will often be things like the amount of data you create these algorithms on, as well as your skill in applying these algorithms.

• Things like your choice of the features you design to give to the learning algorithms, and how you choose the regularization parameter, and things like that.

• But, there's one more algorithm that is very powerful and is very widely used both within industry and academia, and that's called the support vector machine.

• And compared to both logistic regression and neural networks, the Support Vector Machine, or SVM sometimes gives a cleaner, and sometimes more powerful way of learning complex non-linear functions.

• In order to describe the support vector machine, we are going to start with logistic regression, and show how we can modify it a bit, and get what is essentially the support vector machine.

• In logistic regression, we have our familiar form of the hypothesis there and the sigmoid activation function shown on the right.

• In logistic regression if we have an example with y equals one then $$h_\theta(x) \sim 1$$. This means that $$\theta x >> 0$$. Conversely, if we have an example where y is equal to zero, $$h_\theta(x) \sim 0$$. This means that $$\theta x << 0$$

• If you look at the cost function of logistic regression, you'll find is that each example (x,y) contributes a term given below to the overall cost function.

• For the overall cost function, we sum over all the training examples using the above function, and have a 1/m term If you then plug in the hypothesis definition (hθ(x)), you get an expanded cost function equation;

• So each training example contributes that term to the cost function for logistic regression. If y = 1 then only the first term in the objective matters, If we plot the functions vs. z we get the following graph

• This plot shows the cost contribution of an example when y = 1 given z So if z is big, the cost is low. But if z is 0 or negative the cost contribution is high. This is why, when logistic regression sees a positive example, it tries to set $$\theta^T.x$$ to be a very large term.

• If y = 0 then only the second term matters. We can again plot it and get a similar graph. Here if z is small then the cost is low, But if s is large then the cost is massive

• To build a SVM we must redefine our cost functions: When y = 1, Take the y = 1 function and create a new cost function. Instead of a curved line create two straight lines (magenta) which acts as an approximation to the logistic regression y = 1 function.

• Take point (1) on the z axis, Flat from 1 onwards It Grows when we reach 1 or a lower number. This means we have two straight lines. Flat when cost is 0 Straight growing line after 1, So this is the new y=1 cost function. This gives the SVM a computational advantage and an easier optimization problem We call this function cost1(z).

• Similarly When y = 0 Do the equivalent with the y = 0 function plot. We call this function cost0(z). So here we define the two cost function terms for our SVM graphically.

• The complete SVM cost function : As a comparison/reminder we have logistic regression below

• If this looks unfamiliar its because we previously had the - sign outside the expression. For the SVM we take our two logistic regression y = 1 and y = 0 terms described previously and replace with: $$\text{cost}_1(\theta^T.x), \text{cost}_2(\theta^T.x)$$. So we get :

• In convention with SVM notation we rename a few things here:

• Get rid of the 1/m terms. This is just a slightly different convention By removing 1/m we should get the same optimal values for 1/m is a constant, so should get same optimization. e.g. say you have a minimization problem which minimizes to x = 5 such as $$(x-5)^2 + 1$$. If your cost function * by a constant i.e. $$10 [(x-5)^2 + 1] = 10(x-5)^2 + 10$$, you still generates the same minimal value.

• For logistic regression we had two terms; Training data set term (i.e. that we sum over m) = A Regularization term (i.e. that we sum over n) = B So we could describe it as A + λB. Instead of parameterization this as A + λB For SVMs the convention is to use a different parameter called C So do CA + B If C were equal to 1/λ then the two functions (CA + B and A + λB) would give the same value

• So, our overall equation is

• Unlike logistic, $$h_\theta(x)$$ doesn't give us a probability, but instead we get a direct prediction of 1 or 0 So if $$\theta^T*x$$ is equal to or greater than 0 --> $$h_\theta(x)$$ = 1 Else --> $$h_\theta(x)$$ = 0.