**This chapter**may give you better intuition about how the optimization problem of the support vector machine leads to large margin classifiers.**Let's say**we have two vectors U and V, that look like below. So both two dimensional vectors.**Then U transpose V \( U^TV \) looks like.**And U transpose V is also called the inner products between the vectors U and V.**Plot u on graph**- i.e \( u_1 vs. u_2 \) as shown below**One property**which is good to have is the norm of a vector Written as ||u||. This is the euclidean length of vector u. So \( ||u|| = \sqrt{u_1^2 + u_2^2} \) we get a real number, i.e. length of the arrow above. This can also be shown via Pythagoras. For the inner product, take v and orthogonally project down onto u. First we can plot v on the same axis in the same way \(v_1 vs v_2\).**Measure the length/magnitude of the projection****So here, the green line**is the projection. p = length along u to the intersection. p is the magnitude of the projection of vector v onto vector u It is possible to show that \( u^Tv = p * ||u|| \).**So this is one way**to compute the inner product : \( u^Tv = u_1v_1+ u_2v_2 \). So therefore : \( p * ||u|| = u_1v_1+ u_2v_2 \).**This is an important rule in linear algebra**We can reverse this too**So we could do**: \( v^T u = v_1u_1+ v_2u_2 \) Which would obviously give you the same number p can be negative if the angle between them is 90 degrees or more**So here p is negative**. Use the vector inner product theory to try and understand SVMs a little better

**For the following explanation**- two simplification Set \( \theta_0 = 0 \) (i.e. ignore intercept terms) and Set \( n = 2 ; (x_1, x_2) \) i.e. each example has only 2 features Given we only have two parameters we can simplify our function given below :**To obtain the new simplified optimization objective as follows :**\( \frac{1}{2} (\theta_1^2 + \theta_2^2) \), which can be re-written as \( \frac{1}{2} \sqrt{\theta_1^2 + \theta_2^2} \). Both give the same answers.**As we notice that**\( \frac{1}{2} \sqrt{\theta_1^2 + \theta_2^2} = ||\theta|| \)**The second term in above equation is the norm of θ.**If we take θ as a 2x1 vector and If we assume \( θ_0 = 0 \) its still true. So, finally, this means our optimization function can be re-defined as \( \frac{1}{2} ||\theta||^2 \)**So the SVM is minimizing the squared norm.**Given this, what are the \( θ^Tx \) parameters doing? Given θ and given example x what is this equal to? We can look at this in a comparable manner to how we just looked at u and v Say we have a single positive training example (red cross below). \( x^{(i)} \) with features \( x_1^{(i)} , x_2^{(i)} \).**Although**we haven't been thinking about examples as vectors it can be described as such**Now**, say we have our parameter vector θ and we plot that on the same axis**The next question**is what is the inner product of these two vectors**It is p, is in fact p**Given our previous discussion we know \ (θ^Tx_i = p_i * ||θ|| = θ_1x^i_1 + θ_2x^i_2 \)_{i}, because it's the length of p for example i.**So these are both equally valid ways of computing \( θ^T x_i \)**What does this mean? The constraints we defined earlier \( θ^T x \geq 1 if y = 1 \\ θ^T x \leq -1 if y = 0 \)**Can be replaced/substituted with the constraints**\( p_i * ||θ|| \geq 1 if y = 1 \\ p_i * ||θ|| \leq -1 if y = 0 \)**Writing that into our optimization objective**

**So**, given we've redefined these functions let us now consider the training example below**Given this data**, what boundary will the SVM choose? Note that we're still assuming θ_{0}= 0, which means the boundary has to pass through the origin (0,0). Green line - small margins is the result.**SVM would not chose this line.**Decision boundary comes very close to examples Lets discuss why the SVM would not chose this decision boundary Looking at this line**We can show that θ is at 90 degrees**to the decision boundary. θ is always at 90 degrees to the decision boundary

**So now lets look**at what this implies for the optimization objective Look at first example (x^{1})

**Project a line**from x^{1}on to to the θ vector (so it hits at 90 degrees)**The distance**between the intersection and the origin is (p^{1}) Similarly, look at second example (x^{2}) Project a line from x^{2}into to the θ vector This is the magenta line, which will be negative (p^{2})**If we overview**these two lines below we see a graphical representation of what's going on;

**We find that both these p values are going to be pretty small**If we look back at our optimization objective We know we need p^{1}* ||θ|| to be bigger than or equal to 1 for positive examples**If p is small**Means that ||θ|| must be pretty large Similarly, for negative examples we need p^{2}* ||θ|| to be smaller than or equal to -1**We saw in this example**p^{2}is a small negative number So ||θ|| must be a large number Why is this a problem? The optimization objective is trying to find a set of parameters where the norm of theta is small**So this doesn't seem**like a good direction for the parameter vector (because as p values get smaller ||θ|| must get larger to compensate) So we should make p values larger which allows ||θ|| to become smaller So lets chose a different boundaryNow if you look at the projection of the examples to θ we find that p

^{1}becomes large and ||θ|| can become small**So with some values drawn in****This means**that by choosing this second decision boundary we can make ||θ|| smaller Which is why the SVM choses this hypothesis as better**This is how we generate the large margin effect**

The magnitude of this margin is a function of the p values So by maximizing these p values we minimize ||θ||**Finally**, we did this derivation assuming θ_{0}= 0, If this is the case we're entertaining only decision boundaries which pass through (0,0)**If you allow**θ_{0}to be other values then this simply means you can have decision boundaries which cross through the x and y values at points other than (0,0)**Can show with basically**same logic that this works, and even when θ_{0}is non-zero when you have optimization objective described above (when C is very large) that the SVM is looking for a large margin separator between the classes