This chapter may give you better intuition about how the optimization problem of the support vector machine leads to large margin classifiers.
Let's say we have two vectors U and V, that look like below. So both two dimensional vectors.
Then U transpose V \( U^TV \) looks like. And U transpose V is also called the inner products between the vectors U and V.
Plot u on graph - i.e \( u_1 vs. u_2 \) as shown below
One property which is good to have is the norm of a vector Written as ||u||. This is the euclidean length of vector u. So \( ||u|| = \sqrt{u_1^2 + u_2^2} \) we get a real number, i.e. length of the arrow above. This can also be shown via Pythagoras. For the inner product, take v and orthogonally project down onto u. First we can plot v on the same axis in the same way \(v_1 vs v_2\).
Measure the length/magnitude of the projection
So here, the green line is the projection. p = length along u to the intersection. p is the magnitude of the projection of vector v onto vector u It is possible to show that \( u^Tv = p * ||u|| \).
So this is one way to compute the inner product : \( u^Tv = u_1v_1+ u_2v_2 \). So therefore : \( p * ||u|| = u_1v_1+ u_2v_2 \).
This is an important rule in linear algebra We can reverse this too
So we could do : \( v^T u = v_1u_1+ v_2u_2 \) Which would obviously give you the same number p can be negative if the angle between them is 90 degrees or more
So here p is negative. Use the vector inner product theory to try and understand SVMs a little better
For the following explanation - two simplification Set \( \theta_0 = 0 \) (i.e. ignore intercept terms) and Set \( n = 2 ; (x_1, x_2) \) i.e. each example has only 2 features Given we only have two parameters we can simplify our function given below :
To obtain the new simplified optimization objective as follows : \( \frac{1}{2} (\theta_1^2 + \theta_2^2) \), which can be re-written as \( \frac{1}{2} \sqrt{\theta_1^2 + \theta_2^2} \). Both give the same answers.
As we notice that \( \frac{1}{2} \sqrt{\theta_1^2 + \theta_2^2} = ||\theta|| \)
The second term in above equation is the norm of θ. If we take θ as a 2x1 vector and If we assume \( θ_0 = 0 \) its still true. So, finally, this means our optimization function can be re-defined as \( \frac{1}{2} ||\theta||^2 \)
So the SVM is minimizing the squared norm. Given this, what are the \( θ^Tx \) parameters doing? Given θ and given example x what is this equal to? We can look at this in a comparable manner to how we just looked at u and v Say we have a single positive training example (red cross below). \( x^{(i)} \) with features \( x_1^{(i)} , x_2^{(i)} \).
Although we haven't been thinking about examples as vectors it can be described as such
Now, say we have our parameter vector θ and we plot that on the same axis
The next question is what is the inner product of these two vectors
It is p, is in fact pi, because it's the length of p for example i. Given our previous discussion we know \ (θ^Tx_i = p_i * ||θ|| = θ_1x^i_1 + θ_2x^i_2 \)
So these are both equally valid ways of computing \( θ^T x_i \) What does this mean? The constraints we defined earlier \( θ^T x \geq 1 if y = 1 \\ θ^T x \leq -1 if y = 0 \)
Can be replaced/substituted with the constraints \( p_i * ||θ|| \geq 1 if y = 1 \\ p_i * ||θ|| \leq -1 if y = 0 \)
Writing that into our optimization objective
So, given we've redefined these functions let us now consider the training example below
Given this data, what boundary will the SVM choose? Note that we're still assuming θ0 = 0, which means the boundary has to pass through the origin (0,0). Green line - small margins is the result.
SVM would not chose this line. Decision boundary comes very close to examples Lets discuss why the SVM would not chose this decision boundary Looking at this line
We can show that θ is at 90 degrees to the decision boundary. θ is always at 90 degrees to the decision boundary
So now lets look at what this implies for the optimization objective
Look at first example (x1)
Project a line from x1 on to to the θ vector (so it hits at 90 degrees)
The distance between the intersection and the origin is (p1) Similarly, look at second example (x2) Project a line from x2 into to the θ vector This is the magenta line, which will be negative (p2)
If we overview these two lines below we see a graphical representation of what's going on;
We find that both these p values are going to be pretty small If we look back at our optimization objective We know we need p1 * ||θ|| to be bigger than or equal to 1 for positive examples
If p is small Means that ||θ|| must be pretty large Similarly, for negative examples we need p2 * ||θ|| to be smaller than or equal to -1
We saw in this example p2 is a small negative number So ||θ|| must be a large number Why is this a problem? The optimization objective is trying to find a set of parameters where the norm of theta is small
So this doesn't seem like a good direction for the parameter vector (because as p values get smaller ||θ|| must get larger to compensate) So we should make p values larger which allows ||θ|| to become smaller So lets chose a different boundary
Now if you look at the projection of the examples to θ we find that p1 becomes large and ||θ|| can become small
So with some values drawn in
This means that by choosing this second decision boundary we can make ||θ|| smaller Which is why the SVM choses this hypothesis as better
This is how we generate the large margin effect
The magnitude of this margin is a function of the p values
So by maximizing these p values we minimize ||θ||
Finally, we did this derivation assuming θ0 = 0, If this is the case we're entertaining only decision boundaries which pass through (0,0)
If you allow θ0 to be other values then this simply means you can have decision boundaries which cross through the x and y values at points other than (0,0)
Can show with basically same logic that this works, and even when θ0 is non-zero when you have optimization objective described above (when C is very large) that the SVM is looking for a large margin separator between the classes