• This chapter may give you better intuition about how the optimization problem of the support vector machine leads to large margin classifiers.


  • Let's say we have two vectors U and V, that look like below. So both two dimensional vectors.


  • Large margin classification mathematics


  • Then U transpose V \( U^TV \) looks like. And U transpose V is also called the inner products between the vectors U and V.


  • Plot u on graph - i.e \( u_1 vs. u_2 \) as shown below


  • Large margin classification mathematics


  • One property which is good to have is the norm of a vector Written as ||u||. This is the euclidean length of vector u. So \( ||u|| = \sqrt{u_1^2 + u_2^2} \) we get a real number, i.e. length of the arrow above. This can also be shown via Pythagoras. For the inner product, take v and orthogonally project down onto u. First we can plot v on the same axis in the same way \(v_1 vs v_2\).


  • Measure the length/magnitude of the projection


  • Large margin classification mathematics


  • So here, the green line is the projection. p = length along u to the intersection. p is the magnitude of the projection of vector v onto vector u It is possible to show that \( u^Tv = p * ||u|| \).


  • So this is one way to compute the inner product : \( u^Tv = u_1v_1+ u_2v_2 \). So therefore : \( p * ||u|| = u_1v_1+ u_2v_2 \).


  • This is an important rule in linear algebra We can reverse this too


  • So we could do : \( v^T u = v_1u_1+ v_2u_2 \) Which would obviously give you the same number p can be negative if the angle between them is 90 degrees or more


  • Large margin classification mathematics


  • So here p is negative. Use the vector inner product theory to try and understand SVMs a little better






  • For the following explanation - two simplification Set \( \theta_0 = 0 \) (i.e. ignore intercept terms) and Set \( n = 2 ; (x_1, x_2) \) i.e. each example has only 2 features Given we only have two parameters we can simplify our function given below :


  • Large margin classification mathematics


  • To obtain the new simplified optimization objective as follows : \( \frac{1}{2} (\theta_1^2 + \theta_2^2) \), which can be re-written as \( \frac{1}{2} \sqrt{\theta_1^2 + \theta_2^2} \). Both give the same answers.


  • As we notice that \( \frac{1}{2} \sqrt{\theta_1^2 + \theta_2^2} = ||\theta|| \)


  • The second term in above equation is the norm of θ. If we take θ as a 2x1 vector and If we assume \( θ_0 = 0 \) its still true. So, finally, this means our optimization function can be re-defined as \( \frac{1}{2} ||\theta||^2 \)


  • So the SVM is minimizing the squared norm. Given this, what are the \( θ^Tx \) parameters doing? Given θ and given example x what is this equal to? We can look at this in a comparable manner to how we just looked at u and v Say we have a single positive training example (red cross below). \( x^{(i)} \) with features \( x_1^{(i)} , x_2^{(i)} \).


  • Large margin classification mathematics


  • Although we haven't been thinking about examples as vectors it can be described as such


  • Large margin classification mathematics


  • Now, say we have our parameter vector θ and we plot that on the same axis


  • Large margin classification mathematics


  • The next question is what is the inner product of these two vectors


  • Large margin classification mathematics


  • It is p, is in fact pi, because it's the length of p for example i. Given our previous discussion we know \ (θ^Tx_i = p_i * ||θ|| = θ_1x^i_1 + θ_2x^i_2 \)


  • So these are both equally valid ways of computing \( θ^T x_i \) What does this mean? The constraints we defined earlier \( θ^T x \geq 1 if y = 1 \\ θ^T x \leq -1 if y = 0 \)


  • Can be replaced/substituted with the constraints \( p_i * ||θ|| \geq 1 if y = 1 \\ p_i * ||θ|| \leq -1 if y = 0 \)


  • Writing that into our optimization objective


  • Large margin classification mathematics






  • So, given we've redefined these functions let us now consider the training example below Large margin classification mathematics


  • Given this data, what boundary will the SVM choose? Note that we're still assuming θ0 = 0, which means the boundary has to pass through the origin (0,0). Green line - small margins is the result. Large margin classification mathematics


  • SVM would not chose this line. Decision boundary comes very close to examples Lets discuss why the SVM would not chose this decision boundary Looking at this line


  • We can show that θ is at 90 degrees to the decision boundary. θ is always at 90 degrees to the decision boundary
    Large margin classification mathematics


  • So now lets look at what this implies for the optimization objective Look at first example (x1)
    Large margin classification mathematics


  • Project a line from x1 on to to the θ vector (so it hits at 90 degrees)


  • The distance between the intersection and the origin is (p1) Similarly, look at second example (x2) Project a line from x2 into to the θ vector This is the magenta line, which will be negative (p2)


  • If we overview these two lines below we see a graphical representation of what's going on;
    Large margin classification mathematics


  • We find that both these p values are going to be pretty small If we look back at our optimization objective We know we need p1 * ||θ|| to be bigger than or equal to 1 for positive examples


  • If p is small Means that ||θ|| must be pretty large Similarly, for negative examples we need p2 * ||θ|| to be smaller than or equal to -1


  • We saw in this example p2 is a small negative number So ||θ|| must be a large number Why is this a problem? The optimization objective is trying to find a set of parameters where the norm of theta is small


  • So this doesn't seem like a good direction for the parameter vector (because as p values get smaller ||θ|| must get larger to compensate) So we should make p values larger which allows ||θ|| to become smaller So lets chose a different boundary


  • Large margin classification mathematics Now if you look at the projection of the examples to θ we find that p1 becomes large and ||θ|| can become small


  • So with some values drawn in Large margin classification mathematics


  • This means that by choosing this second decision boundary we can make ||θ|| smaller Which is why the SVM choses this hypothesis as better


  • This is how we generate the large margin effect
    Large margin classification mathematics The magnitude of this margin is a function of the p values So by maximizing these p values we minimize ||θ||


  • Finally, we did this derivation assuming θ0 = 0, If this is the case we're entertaining only decision boundaries which pass through (0,0)


  • If you allow θ0 to be other values then this simply means you can have decision boundaries which cross through the x and y values at points other than (0,0)


  • Can show with basically same logic that this works, and even when θ0 is non-zero when you have optimization objective described above (when C is very large) that the SVM is looking for a large margin separator between the classes