Support Vectore Machines I Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna learn about support vector machines. We're gonna specifically start with linear support vector machines. So my Jupiter notebook is just the regular support vector machines notebook. This is gonna contain both linear and general support vector machines. This is contained in the classification folder uh in the lectures. So this is a popular type of classification algorithm developed by computer science uh by the computer science uh community. We're gonna learn about this technique in a couple different ways. We're gonna start with a maximum margin classifier which is the strictest uh linear version. And then in this video, we'll talk about the soft margin classifier classifier which is a uh relaxation of the maximum margin classifier. In the end, we'll talk about some limitations of this approach. Maybe uh I don't remember where the video start stops exactly. Uh But we may talk about limitations. And then in the next video, we'll talk about general support vector machines. ... So a linear support vector machine is our first class of support vector machine. These are support vector machines or S V MS that are designed for data sets that can be separated linearly or close to linearly. So what do we mean by this, uh we say that a data set for a binary classification problem is linearly separable if you can draw a hyperplane that completely perfectly separates the two classes. So if you're unfamiliar with the term hyperplane, one way to think about it is you have a high dimensional space. Uh And then any subspace that is one dimension lower is a hyperplane. So in R uh the real line, a hyperplane would just be a point because it separates the real line into two distinct halfs or yeah, halfs. Uh in R two, a hyperplane would be a line because it cuts the plane uh and separates it into two sides. And R three A hyperplane would just be a regular two D plane and an R to the end, it would be any N minus one subspace. So uh this is the whole idea between a support linear support vector machine is you have maybe some zeros over here or in this case, we're gonna call them negative ones to respect the uh literature on support vector machines. We have negative ones over here. We have ones down here and then we can like draw a line between them essentially. So the first type of linear support vector machine is a maximum margin classifier. So this is uh you know, called this for a reason we're gonna understand soon, but we're first gonna show some motivation So here's the data set that I've generated where we've got negative ones up in the left hand, upper left hand corner and one's in the lower right hand corner. So negative ones are blue circles, ones are orange triangles. So these appear linearly separable, meaning I could draw a line between them and separate them perfectly. But a natural question then becomes, well, what line should I choose? And so here's an example where I have three different lines. So I've got a black solid line that seems to go straight through the middle, a red dot dash line that gets a little bit closer to the either class at various points. And then a blue dotted line which gets also gets close to each class at different points. So all of these separate the two, but one of them will end up generalizing better to future data points. Meaning in the future when we want to predict on one that we don't have the label for uh one of these three lines will generalize better. And the one that does so is the black solid line. So why is that? Well, the red dot dash line and the blue dotted line get too close to the actual training data at various points. So for instance, here in the future, you may go out into the world and collect an example that is actually a blue circle, but maybe it departs slightly from where we see our training data and therefore crosses over the red dot dash line and is incorrectly classified as a triangle. That would not, that is less likely to happen for the solid black line because of the distance it has from either class. And so here we think of distance between the hyperplane and a point to be the minimum distance from a point to the hyperplane any point uh in either class. So that's the idea behind the maximal margin classifier. This distance from the smallest distance from a point to the hyperplane is known as the margin. So for instance, the margin for this black line I think is given by maybe uh this point to the hyperplane or maybe this point to the hyperplane. We'll have, I have a drawing in a second. Um But the idea here is the maximum margin classifier wants to maximize the minimum distance from any point in the data set to the margin itself. So here are the to the separating line itself. So in the instance of the black line, which is pretty close to the maximum margin separator, um this is uh the line that is our separating line, we want to maximize the smallest distance from any point to that line. So here is the formal definition of an algorithm that you would do. So we're gonna say we have a binary variable Y and negative 1 to 1 which again, we're using negative one and one here to be respectful. And to follow the uh the formulation of these problems as they originally presented, they use negative one and one in a in uh opposition to 01 which we have been using up to this point. Then we'll also think, you know, just like we've had before, we have a set of M features stored in the columns in the columns of a feature matrix X. And then the maximum margin classifier, if one exists, it may not exist uh is solved by this constrained optimization problem. So we have the, we wanna find beta which is uh as always a, a vector of weights or coefficients. And we wanna find the largest M, we can capital M so that the Euclidean norm squared of beta is equal to one. And for all observations in our data set, our training set the I observation of Y times X, the I observation, the ith of X times beta is greater than or equal to M. So this here uh is a dot product that will end up being a um yeah, it will end up being a single number that you can measure against capital MS. We want to find the largest M such that this vector is equal to one and Euclidean norm. And then this statement is true. And here I have extended X, uh I've extended it to have a column of ones in the zero column. So it's possible to solve this optimization problem. We don't do it in this notebook. But at the end, there is a reference that shows you how to do it and goes through it. Um And you may also be thinking, well, this condition looks a little weird. Why are we having this condition? How does that have? What does that have to do with the hyperplane? Um If you set X times beta equals to zero, it defines a hyperplane. So for example, beta zero plus beta one, X one plus beta two X two gives the formula for a line in two D beta zero plus beta one, X one plus beta two X two plus beta three, X three equals zero is the formula for a plane in 3d. So X times beta gives you the hyperplane. And essentially what you're doing is you're setting it up so that if you take Y times this, you want it to be greater than some margin away from the hyperplane. So M is your capital M for the margin and you want this to be as far away from the actual hyperplane as possible. OK. So we can see what this means by implementing it and s can learn. So we're gonna use linear S V C which stands for linear support vector classifier. So we call from S K learn dot S V M which stands for, for support vector machine. Uh We want to import linear support vector classifier, linear S V C. Now I'm gonna make my model and then fit my model here. And there's gonna be some arguments that might seem weird. So the first thing you're gonna see is the C equals 1000 we'll talk about the sea and the very next uh support vector machine we talk about, that's where it comes from. You're also gonna see this argument. Max underscore iter equals 100,000. You'll, this is, there's uh it uses um an algorithm to find, to estimate the betas. I believe something probably something like gradient descent. Uh It, it, so in that algorithm, it has to go essentially through a loop. And max underscore iter uh increases the number of times that it is allowed to go through that loop in order to find the optimal betas. So we're gonna call uh what did I call it down here? Max margin? So max margin, this is gonna be our classifier is equal to the linear S V seat. And then we want C equals to 1000 max iter equal to 100,000. And we fit the model max margin dot fit. And I think I just called the I, what did I call these? Just X and Y Kay. ... Now, this will make a grid. Uh Well, actually what this does is it's gonna allow us to plot both the line and the margin around the line. Uh And so that's where this comes from. So here is our, this solid line is our um decision boundary. This is the linear separating line. So it's the separating line. Uh So anything above and to the left of this line up here would be classified as a negative one down here. It'd be classified as a one. And these dotted lines uh give the margin around our separating line. So the distance from this dotted line uh between this dotted line and the separating line, that's our margin. And you'll notice that a lot of uh maybe not a lot, but a few of these points on either side actually touch that separating line. So we've got three over here and we've got these orange triangles, these two orange triangles over here. So this is actually where the term support vector machine comes from these two, these uh five points or six points, four blue triangles, two orange or sorry, four blue circles, two orange triangles. These are called the support vectors. Uh They're called the support vectors because they're what support the decision for the linear separating line. So these are the ones that are exactly m away from the linear separating line. Now, I want to point out for those math people. This has nothing to do with the idea of the support of a function uh that you may be familiar with. These are the support vectors in the sense that if you were to move any one of them, uh or if you were to move them, you may end up with a different uh linear separating line. So they are the support because there are the limiting factor in increasing the margin and wiggling around the separating plane separating hyperplane anymore. OK. ... So that's a maximum margin classifier. Uh In practice, you're probably not uh rarely going to encounter a situation where you can exactly separate the data with a straight line. Uh So that all the class is negative one is on one side and all of class one is on the other. So this is the idea behind what's known as a support vector classifier, a general support vector classifier or a soft margin classifier, uh which we'll see the reason for in a second uh reason for the name in a second. So you might wanna have a pro you might have a problem where you can't separate them right by hyperplane. Or maybe if you end up with, maybe you can separate them by hyperplane. But uh you'll end up with a situation where the maximum margin classifier would be very close to a training point, which means it might not generalize. Well, so like maybe the training set you got is linearly separable, but you have suspicions that it wouldn't be linearly separable uh if you were to go out and make more uh samples in your training set. So here is an example where uh this is an example of that same data set where I've shifted three orange triangles up here and two of the blue circles down here. Now, we have something that we cannot separate with a line in the plane. This is not linearly separable. So we couldn't make a maximum margin classifier here. So that's the idea here. We're gonna set up a version of this where we are going to allow some points to cross over. OK. So above, in this problem, we set up because we didn't want any of the points to be able to cross over to the other side. They had to stay on the side where negative ones would be classified. Now, we're gonna allow some of those points to cross over, meaning that we're essentially softening up the margin. That's where the soft margin comes from. We're gonna soften up the margin uh with this. So this is the constrained optimization problem. So this part should look familiar, we want to find beta and maximal MS subject to the square of the Euclidean norm of the beta is equals to one. We want Y I times X I times beta to be greater than or equal to M. Now, here's the new part that allows for going over the margin or going over the separating line, uh times one minus epsilon, I where the epsilon I depend upon the observation. So epsilon I is greater than, or equal to zero and the sum of them have to be less than or equal to one over C. So this is the C from above uh what essentially it does is it determines the amount that you're going to allow crossover. So when we chose a large C above, I believe we chose 1000 that meant that the epsilons couldn't really b greater than zero. Uh So essentially what we did was we ensured that we're going to have the maximum margin situation because if epsilon I is close to zero for all I, then this is close to M. So uh what we're learning is in S K learn there is no proper maximum margin classifier object. There's just the support vector classifier or the uh soft margin classifier. But we can essentially recreate the maximum margin by choosing a large value for C. So the smaller C is meaning that this thing would be uh this bound would be larger uh the more you're allowing points to cross over the border uh or cross over the margin, meaning the more flexible your class, your margin is. So uh let's see, do to do. So these epsilon I are known as are called uh slack variables because they control how much observation I can violate the margin. And so if we're going, uh if we're looking at them clearly, if epsilon I is equal to zero, that means that observation I has to be on its side of the margin. So if it's a negative one, it's where the negative ones are being classified. If it's a one, it's where the ones are being classified if you're um between zero and one. So if you have an epsilon, I, that's between zero and one. Well, that means observation I is on the wrong side of the margin. So if we go back up, that would mean your observation, I would be like in here if it was a negative one. Uh but it has not yet crossed over the linear separating line. So it wouldn't, for instance, be over here if it was a negative one ... and then here we go, if epsilon I was greater than one, that means observation I is on the incorrect side of the hyperplane. So that would be when you cross over the separating line. So as I said before, the value C can be thought of as our budget. So what are we gonna do is we're gonna look at this problem, we're gonna train our linear S V C on this data and we're gonna slowly increase the value of C. So if we slowly increase the value of C, oh it looks like we're slowly actually slowly, oh My gosh, actually slowly decreasing. The value of C. If we slowly decrease the value of C, what's gonna happen is this budget's gonna get larger and larger and larger. Meaning that the uh classifier is going to allow more and more crossover of both the margin and the linear le linear separating line. Uh So we have a four loop where C is gonna be 10 1.1 point oh one and point oh oh one, I then fit a soft margin classifier or support vector classifier to all of the, to the data that is pictured here where we have two blue circles over here, three red uh orange triangles up here. And then I just go ahead and plot the same image that I plotted before. But with this new data and this new classifier, OK. So let me zoom back out. So when C equals to 10, this is what it looks like. And uh this is pretty close to the margin that we would get or the uh the separating line that we would, would have gotten above. But as we can see, as we start to increase the epsilon, we see that the margin gets a little bit bigger and we're seeing a lot more validations of the uh violations of the margin down here, more violations of the margin up here. Now we're finally starting to get like all of the points within the margin. That's when C is 0.1. Now, the margin is so large that you can barely see it. And we're starting to see a lot more crossover across the separating line. And now at this point, every line, every uh negative one observation is on the incorrect side of line. So this would be where C is probably too small. Uh We're giving up too much slack to allow crossover. So you would do something like a cross validation or a validation set. Look uh looking at a validation set to determine the best value of C for your problem. Um But this is what the effect C has on both the separating line, the margin and what points are allowed to cross over. ... OK. ... So that's actually gonna be it for this video. As you can imagine, we're not always gonna be in situations where a linear decision boundary is the best idea for our problems. So we may have nonlinear problems, which is what we will see in the next video with general support vector machines. In the meantime, I hope you enjoyed learning about linear support vector machines, both the maximal margin classifier and the soft margin or support vector classifier. All right, I hope you enjoyed this video. I enjoyed having you watch this video and I hope you have a great rest of your day. Bye.