• This chapter may give you better intuition about how the optimization problem of the support vector machine leads to large margin classifiers.

• Let's say we have two vectors U and V, that look like below. So both two dimensional vectors.

• Then U transpose V $$U^TV$$ looks like. And U transpose V is also called the inner products between the vectors U and V.

• Plot u on graph - i.e $$u_1 vs. u_2$$ as shown below

• One property which is good to have is the norm of a vector Written as ||u||. This is the euclidean length of vector u. So $$||u|| = \sqrt{u_1^2 + u_2^2}$$ we get a real number, i.e. length of the arrow above. This can also be shown via Pythagoras. For the inner product, take v and orthogonally project down onto u. First we can plot v on the same axis in the same way $$v_1 vs v_2$$.

• Measure the length/magnitude of the projection

• So here, the green line is the projection. p = length along u to the intersection. p is the magnitude of the projection of vector v onto vector u It is possible to show that $$u^Tv = p * ||u||$$.

• So this is one way to compute the inner product : $$u^Tv = u_1v_1+ u_2v_2$$. So therefore : $$p * ||u|| = u_1v_1+ u_2v_2$$.

• This is an important rule in linear algebra We can reverse this too

• So we could do : $$v^T u = v_1u_1+ v_2u_2$$ Which would obviously give you the same number p can be negative if the angle between them is 90 degrees or more

• So here p is negative. Use the vector inner product theory to try and understand SVMs a little better

• For the following explanation - two simplification Set $$\theta_0 = 0$$ (i.e. ignore intercept terms) and Set $$n = 2 ; (x_1, x_2)$$ i.e. each example has only 2 features Given we only have two parameters we can simplify our function given below :

• To obtain the new simplified optimization objective as follows : $$\frac{1}{2} (\theta_1^2 + \theta_2^2)$$, which can be re-written as $$\frac{1}{2} \sqrt{\theta_1^2 + \theta_2^2}$$. Both give the same answers.

• As we notice that $$\frac{1}{2} \sqrt{\theta_1^2 + \theta_2^2} = ||\theta||$$

• The second term in above equation is the norm of θ. If we take θ as a 2x1 vector and If we assume $$θ_0 = 0$$ its still true. So, finally, this means our optimization function can be re-defined as $$\frac{1}{2} ||\theta||^2$$

• So the SVM is minimizing the squared norm. Given this, what are the $$θ^Tx$$ parameters doing? Given θ and given example x what is this equal to? We can look at this in a comparable manner to how we just looked at u and v Say we have a single positive training example (red cross below). $$x^{(i)}$$ with features $$x_1^{(i)} , x_2^{(i)}$$.

• Although we haven't been thinking about examples as vectors it can be described as such

• Now, say we have our parameter vector θ and we plot that on the same axis

• The next question is what is the inner product of these two vectors

• It is p, is in fact pi, because it's the length of p for example i. Given our previous discussion we know \ (θ^Tx_i = p_i * ||θ|| = θ_1x^i_1 + θ_2x^i_2 \)

• So these are both equally valid ways of computing $$θ^T x_i$$ What does this mean? The constraints we defined earlier $$θ^T x \geq 1 if y = 1 \\ θ^T x \leq -1 if y = 0$$

• Can be replaced/substituted with the constraints $$p_i * ||θ|| \geq 1 if y = 1 \\ p_i * ||θ|| \leq -1 if y = 0$$

• Writing that into our optimization objective

• So, given we've redefined these functions let us now consider the training example below

• Given this data, what boundary will the SVM choose? Note that we're still assuming θ0 = 0, which means the boundary has to pass through the origin (0,0). Green line - small margins is the result.

• SVM would not chose this line. Decision boundary comes very close to examples Lets discuss why the SVM would not chose this decision boundary Looking at this line

• We can show that θ is at 90 degrees to the decision boundary. θ is always at 90 degrees to the decision boundary

• So now lets look at what this implies for the optimization objective Look at first example (x1)

• Project a line from x1 on to to the θ vector (so it hits at 90 degrees)

• The distance between the intersection and the origin is (p1) Similarly, look at second example (x2) Project a line from x2 into to the θ vector This is the magenta line, which will be negative (p2)

• If we overview these two lines below we see a graphical representation of what's going on;

• We find that both these p values are going to be pretty small If we look back at our optimization objective We know we need p1 * ||θ|| to be bigger than or equal to 1 for positive examples

• If p is small Means that ||θ|| must be pretty large Similarly, for negative examples we need p2 * ||θ|| to be smaller than or equal to -1

• We saw in this example p2 is a small negative number So ||θ|| must be a large number Why is this a problem? The optimization objective is trying to find a set of parameters where the norm of theta is small

• So this doesn't seem like a good direction for the parameter vector (because as p values get smaller ||θ|| must get larger to compensate) So we should make p values larger which allows ||θ|| to become smaller So lets chose a different boundary

• Now if you look at the projection of the examples to θ we find that p1 becomes large and ||θ|| can become small

• So with some values drawn in

• This means that by choosing this second decision boundary we can make ||θ|| smaller Which is why the SVM choses this hypothesis as better

• This is how we generate the large margin effect
The magnitude of this margin is a function of the p values So by maximizing these p values we minimize ||θ||

• Finally, we did this derivation assuming θ0 = 0, If this is the case we're entertaining only decision boundaries which pass through (0,0)

• If you allow θ0 to be other values then this simply means you can have decision boundaries which cross through the x and y values at points other than (0,0)

• Can show with basically same logic that this works, and even when θ0 is non-zero when you have optimization objective described above (when C is very large) that the SVM is looking for a large margin separator between the classes