Support Vectore Machines II Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back. In the last video, we talked about linear support vector machines. In this video, we talk about general support vector machines. So earlier in the support vector machine notebook, we went through and learned about the Maxwell margin classifier and the support vector classifier or soft margin vector classifier. These were both support vector machines for classification that provided a linear boundary and one was more strict than the other. So in the maximum margin case, we didn't let points cross over the separating line and we also didn't let points cross over the margin. And the support vector classifier also known as a soft margin classifier, we still had a linear decision boundary, but now we allowed some points to maybe cross over that or even cross over cross over the margin or even cross over the separating line altogether. Those were controlled by slack variables and the level c that we chose for the linear S V C uh model on S K learn determined how much we were allowing uh points to cross over either the margin or the separating line itself. Now those all those work well and good for problems where you suspect your data has a linear decision boundary, meaning you can draw a hyperplane and more or less get the classification correct. But you all encounter problems in your life. If you're, if you're doing classification work that um have nonlinear decision boundaries, or maybe you can't determine whether or not the decision boundary is linear because you can't visualize the data. Uh So here are some examples. This is a two dimensional example where you can see here, we have some negative ones in a blob over here, some negative ones in a blob up here. And then a blob of orange triangles representing ones in the middle. So we would not be able to draw a line into ad that would be able to separate these points. So we couldn't use a linear um or even close to linearly separating these points. So we couldn't use a linear decision boundary here. So both our maximal margin and our soft margin classifiers would not perform well. On this particular example, here's an even simpler example on one D. So we've got some blue circles on the left and on the right and in the middle, some orange triangles. So there would be no way to choose a single point on the real line that would separate these well. Um so what can we do? Well, luckily we have non-linear support vector machines as well. And so the idea for the foundational idea for all of these is we're going to, it sounds weird. Uh But we're gonna take a lower dimensional data set for instance, one in R the real line lift it into a higher dimensional data set. And this example will take this into R two, the, the plane uh the X Y coordinate axes. Uh And then in that higher dimension, we can separate it or come close to separating it with a linear decision boundary in the lower dimension. Uh and then project the, the back down into the lower dimension. So let's see what that is with this one dimensional data set. So what we're gonna do is we're going to add on a new variable. So we had X one as a variable. Now we're gonna add on an X one squared. So we have a single feature in this example, X one. What if we also consider the variable X one squared? ... So here we have X one on the X axis or the horizontal axis and we have X two equal to X one squared on the vertical axis. And so you can see now we can draw a line to separate these two. So a line that looks something like what I'm tracing out with my mouse would separate the negative ones from the ones. And so we could, now that we've lifted it up into a higher space, we could separate it. So we'll call linear S V C. Uh I'll let's C equal 100 say and set my maximum iterations, I always just like to set it high. So it will converge. Uh And now I'm gonna fit and then let's check, what did I call X X new X new? So this is the one with the X one and the X two equals X one squared. And then I think it should just be Y and then here's a plot ... in two dimensions. So this is what it looks like in two dimensions here. I now have this decision boundary line where up above anything above this, I would classify as a negative one and anything below I would classify as a one. Now, I could take this and uh any point given to me with the value of X one, I can quickly compute X one squared plot it out here. And then we would uh be able to make a decision. So essentially what we're doing is, you know, if we look at this earlier plot, uh you may not have been able to use a linear vector machine, but you could look at it with your eyes uh and say, well, if I were to maybe make a cut point here or to the left, it's a negative one and a cut point here to the right, it's a negative one. And then between those two cut points, everything is a one that's essentially all we're doing. But we're using the fact that in higher dimensions, I can fit a linear support vector machine. So essentially what we've done now, you know, at some point, the, the function X squared will equal to the decision line. And at those two points, that's where your cutoffs would be. OK. So this was a nice little trick that worked for this very simple problem. But this nice little trick is in essence the point or the foundation of the more general support vector machine that provides for nonlinear decision boundaries. So we have data in a lower dimensional space. We somehow find a way in this instance, the way was performing a polynomial transformation. Uh We somehow find a way to take it from a one dimensional space or a lower dimensional space and lift it up into a higher dimensional space where we can make a separate, uh we can find a decision boundary that's linear that separates in the higher dimensional space or maybe comes close to separating sort of like a soft margin. And then once we have that separation, we can project everything back down to the lower dimensional space uh and get the decision boundary that we would like that is nonlinear. So essentially what this process was uh this was taking a lower dimensional space and embedding it within a higher dimensional space. So the lower dimensional space of the real line was embedded in two dimensional space along this parabola. In this example, it was easy to choose that based on my experience with mathematics over my long career doing math. Um But one question that might arise in general is, well, how do I choose an embedding? Uh Do I just arbi arbitrarily add polynomials? Do I arbitrarily add in nonlinear transformations? So there is um a standard way to go about these and there's some nice standard options that have been developed by researchers and they implement and S K learn. And all of these come from the kernel function. OK are the concept of a kernel function. There's not one single kernel function but there is a single concept behind kernel functions. So up to this point for ease of discussion and lecture, we have not explicitly discussed how the solution to the support vector classifier is computed. So how do we find for instance the optimal betas and the optimal M for the constrained optimization problem of the uh linear support vector classifier? We haven't talked about that because it's a little long and it's technical. But what the end result is it turns out that all you need to do the step comes down to uh you just have to take inner products between training observations. OK. So what's an inner product? Well, so for this support vector machine setting, we can think of the inner product. Uh For those of you in math familiar with linear algebra or just if you're familiar with learning algebra, it's essentially just a dot product. Uh But in general, it's an inner product. So the inner product between two vectors A and B. Uh for our purposes is going to be the sum from I equals one to M of A I times B I. So vector A has entries, one through M vector B has entries, one through M you multiply the corresponding entries and add them together. That's what we're gonna call the inner product between A and B. So this is all that's involved in. So uh solving the uh getting the estimates for the beta hats. Um So if we have some function Phi right, so this is gonna be in the example above Phi is was the X squared. So if we had some function that took our observations from the lower dimensional space of R M and into the higher dimensional space, well, all we would need to do uh in order to compute the support vector machine in the higher dimensional space be to compute the dot product between those transformations. So if we had two observations X star and X pound symbol or X, hashtag if you like, we would need to be able to compute the inner product between the transformations of those that embeds it in the higher space, we wouldn't have to do it. But right, our computer does it for us. So uh the issue comes about that your, your fee usually takes the lower dimensional space into a space that has very, very high dimensional. Uh Sometimes that space is even theoretically infinite, right? So sometimes you'll come up with a function that embeds your space, your lower dimensional space into an infinite dimensional space up there. Uh Your computer is out of luck, it cannot do infinite calculations. Uh You would run out of time in the world, you would die before it finished, right? Because it's infinite. Uh But the idea here and the nice thing about kernel functions is it's actually possible uh to calculate the inner product between the transformed versions of your observations and the higher dimensions with only the lower dimensional data. So let's say that your data has two features X one and X two. This is an example that illustrates the idea. And let's say that your function fee that takes it into the higher spaces given by a fee of any vector X one X two turns into the three dimensional vector X one squared square root of two times X one times X two X two squared. So this is our transformation of R two into R three. So for two vectors A and B, we can then go ahead and compute the inner products between P of A and P F B. And if you work all that out, what you end up with is a one, B, one plus A two B two quantity squared. But that's actually the inner product between A and B and the regular space, the R two squared. So the inner product in the higher dimensional space. And R three turns out to only uh to be the same as the inner product between A and B and the two dimensional space in this example, squared. So even though uh we knew mat fee, we knew it was to a higher space, we actually didn't need to know what that map was. We didn't need to know about fee at all. Uh In order to compute the dot product between fe of A and P F B, you know, once we worked it out uh right, because it's just the dot product between A and B squared. And so this is the idea behind the kernel function. So if you have a map fi uh fee, we say that this map fee has a kernel function capital K. If the inner product between P of A and P of B for vectors A and B is actually just a function of A and B K K is the function where K is only a function of A and B in the original feature space. Meaning that we don't need to know the higher dimensional transformation. We just need to know the values of the vector A and B. In this example, the kernel function for a fee. And this example that I've highlighted right here, the kernel function for fee would be the square of the inner product between A and B. So some very common support vector machine kernel functions that you will want to try. Uh and that S K learn allows you to implement is the linear kernel function. So again, this is just sort of the identity. Uh This is the inner product between A and B. Um That's one version uh this, if you use these, you'll get the same thing as the linear S V CS that we used earlier. So if you're going to do a linear uh classification, you might as well just use linear S V C because they've been optimized to be faster uh than choosing the regular S V C with a linear kernel. There's a polynomial kernel which in general is gamma times the inner product between A and B plus R uh raised to the D or gamma are uh gamma and are hyper parameters that you'd have to tune. Um And D is the degree of the polynomial. So above, this would be a degree two polynomial kernel with gamma equal to one and R equal to zero. There's something called the Gaussian radial kernel or Gaussian R B F uh that takes uh A and B and then does the exponential. So E to the negative gamma times the Euclidean norm of A minus B squared. This is uh again, this norm is the Euclidean norm. And then finally, the sigmoid which is uh hyperbolic tangent gamma inner product A B plus R. So these are the most common ve kernel vector uh support vector kernel functions. Um Now we and they're also implement in S K learn which we will now see. So the general support vector machine for classification is called capital S V C. It's stored in S K learn dot S V M. So from S K learn dot S V M, we import S V C ... uh And now we will plot it using this code. So here is ... the example. So this was that one D example. And so remember we use the polynomial kernel here. So we would call S V C, we call our kernel instead it equal to poly uh as a string. We want a degree two polynomial because we just want to square it. And then I, I'm gonna set C equal to a relatively high number. So this is close essentially to, once we get it into the higher dimensional space, we're gonna fit like a maximum margin classifier on it basically. And then we call S V C dot Fit X Y uh ... oh Gotta do reshape. ... OK. And now I can plot the predictions that we get. So if we here our is our decision boundary is this large uh black X. And then to the left and to the right of those two XS, we would classify as a negative one. And then to, in between them, we would classify as a one. ... As another example. We can go back to this two D example from earlier. And here's an example in two dimensions. We're gonna use both a polynomial kernel and a radial basis function kernel and we'll see the difference. So we call S V C kernel equals to the string of poly. I'm just gonna use a degree two polynomial. And then here I'll set Sequels to 10 much closer to like a, a soft margin classifier in the higher dimensional space. Uh So essentially, if you think about this in two in uh as a degree two, um it's kind of like taking this space and maybe lifting it up into a para, I guess we'll see. Uh And then for an R B F kernel, we'll call S V C kernel. And then here for R B F, you call it as a string R B F and then C is equal to 10. And you might notice in both the last example, in this example, I didn't set gamma or R uh I did set D in the polynomial. So gamma and R have default values from the S K learn documentation is that's where you can see what they are. I'm just using the default values. In practice, you would have to do something like a validation set check or cross validation to find the ones that work best for your data. ... So now I fit I've made both of these model objects one with a polynomial kernel, one with an R B F kernel. And then I'm gonna plot them with the data. ... So on the left, we can see that the polynomial kernel provided uh two uh essentially copied uh the approach for the one D problem where we just drew vertical or uh we just drew lines. So in between these two lines, we would classify as a, as a one. Outside of those lines, we would classify as a negative one in contrast the radial basis function kernel. So the Gaussian kernel uh essentially draws an ellipse or an oval around. So inside here, we would classify as a one. And outside here we would classify as a negative one. So that's the idea behind general support vector machines. You are going to take lower dimensional data lifted into a higher dimensional space where you have a kernel function that will allow you to compute inner products in the higher dimensional space using only the original dimensional data. Uh And then you can project back down your decision boundaries like we did here and here. Uh So for some references for mathematical details about both linear support vector machines and general support vector machines, you can see chapters four and 12 of the elements of statistical learning that are linked to here for both the maximum margin classifier and the support vector classifier. And then chapter 12 of that same chat of that same book has information on the derivations for the general support vector machine. Uh And that's it. OK. So now you know about support vector machines in both a linear sense and in a non linear sense, I hope you enjoyed learning about this uh about this model type. And I hope you enjoyed watching this lecture video. I hope to see you in the next video and I hope you have a great rest of your day. Bye.