Regression Version of Classification Algorithms Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna talk about regression versions of some of the classification algorithms that we now know. Let me go ahead and share my Jupiter notebook and we can get started. So a lot of the classification algorithms that we learned in the classification section also have analogous uh setups for regression problems. Uh So in particular, we'll learn about the K nearest neighbors regression algorithm, the decision tree regression algorithm which you can then use to make a random forest or extra trees regression algorithm. And then finally, the support vector regression algorithm. So we're gonna assume like we have in previous regression videos and notebooks, we'll assume that we have N observations of M features stored in a matrix X and N corresponding output. So this will be for a theoretical setup in our implementation. Uh examples, we're just gonna stick to a single feature just because that's easiest to visualize. So the first one we'll talk about is K nearest neighbors regression. So in K nearest neighbors regression, it works a lot like K nearest neighbors classification, your prediction for a point X star. Uh You're going to find X stars K closest neighbors in the data space. So for instance, if I have a point uh like here and then there's K neighbors next to it, we would find those K neighbors. And then we take the average of the Y values for each of those K neighbors, remember in the training set. So we have observations for those. So we take the average of the k closest neighbors to any point that we'd like to predict on uh those neighbors are in the training set. And so we have labels for those. So we take the average uh of the K nearest neighbors. It's basically the same thing. Uh In classification, our average rate is just a simple, usually just a simple vote. Uh Like what's the majority class in cannes neighbors regression? We have a thing that we can actually take an average of an arithmetic mean. Uh And so that's what we do for K nearest neighbors regression. We can implement this in S K learn with K neighbors regress, which uh the documentation is linked to here. And we're gonna compare this using that baseball data that we've, we talked about a while ago. Uh So remember our baseball data looks like this and we're gonna regress wins on run differential and we're gonna make a K neighbors regressive object with K equals one and K equals 10. And sort of show you the difference in comparison to simple linear regression that we talked about. Uh in the regression content. So first thing we're gonna do is we're gonna import K nearest neighbors aggressor. So from S K learn dot neighbors, we'll import K neighbors regress ... and then we'll make our model object. So the setup is exactly the same uh for uh for the regressive as it was for the classifier. So K neighbors aggressor and then you just put in the number of neighbors. So for the one that's K N R underscore one I put in one for the one that's K N R 10, I put in 10. And then, you know, there are other arguments that you could put in, but I'm gonna leave it to you to experiment with those if you ever use this in a real problem. Um And then we're gonna fit both of our models. So K and R underscore one underscore 10 and then the dot Fit procedure is just the same as it's been. So we've got dot Fit ball Train R D dot Values need to remember that reshape ... and then ball train W dot Values. So we will now make these three model objects, two K nearest neighbors, one with K equals 11 with K equals 10. And then a simple linear regression model and we're gonna compare the uh predictions for both of these. ... So on the left, we can see this is what K nearest neighbors with K equals to one is like, and so we can see it's very um bumpy, very noisy. Uh And that's because, right, with cannier neighbors, one, it's just gonna find the single neighbor that's closest. And then predict that as the value, right? So whichever neighbor is closest in the run differential, uh It's going to find that and then predict that it is uh the one now here, this K equals 10, it's a little bit less bumpy, it's smoother in some sense. Uh But you can see here and also when we have a larger value for, for K uh the edges of your data space. So the uh uh when you get close to the boundary of the data space, whatever dimension it is here, it's a single dimension. Uh the predictions start to go off, right? Because once you get out here, your predictions will stagnate because you're running out of observations to consider. So let's say I'm gonna guess like once you get to about here on, well, once you get to here, apparently un run differential, you're stuck with the same 10 observations to average. And that's why you're getting a constant value beyond negative looks about negative 2 75 maybe. And then the same thing up here for about 2 50 maybe you get the same because you're stuck with the same 10. So the larger value you make K, the more things you're taking the average over. So the more biased you should expect for your predictions to be. Uh And then similarly the smaller your value of K, the more it's gonna try and fit to the training data. Exactly. So that's higher variant. So you can find the value of K that works best for your problem with cross validation or a validation set. But if you're doing K near neighbors, I'm guessing that cross would be an option and then you can uh you know, balance that bias variance tradeoff. Now, you might be wondering, it definitely looks like for this data set in particular, that simple line of regression would work better. Uh So you might wonder why you would ever want to use K near neighbors regression. Uh So K uh simple linear regression uh some background is known as a parametric technique. So that's because we're estimating a parameter beta where K nearest neighbors is a non para metric technique because we're not actually estimating any parameters. We're just using the training data directly. Uh And sort of taking local averages. Uh So sometimes non parametric techniques are useful. So, uh for instance, if we couldn't, if we had a data set where we couldn't confirm some of the assumptions of linear regression, or maybe it doesn't look like it's a linear function of the data. Uh And you don't want to go around and start guessing a bunch of nonlinear transformations or polynomial transformations to make it might be useful to use the K nearest neighbors because that would naturally pick up any sort of non linearity in the data um with the caveat that you may be overfitting to the training data. And it would maybe wouldn't generalize as well as a, as a linear model. ... So that was our first one. Our next uh classification to regression algorithm is our decision trees. So you can make a decision tree based regression. Um So essentially the only difference here you can see, I don't have a model written out. But the only difference between doing a decision tree for classification and a decision tree for regression is what the cart algorithm uh is using to make its choices. So in classifiers, the cart algorithm goes through a random subset and cut points and then finds the featured cut point pa pairing that reduces the impurity. So this was uh either uh yi impurity or um entropy uh in regression, it's going to the car algorithm is gonna go through. But instead of those impurity measures, it's going to try and find a cut point and a feature pairing that gives you the greatest reduction in your mean square error or, or whatever uh cost function you want to use. I think typically it's the mean square error. So once the tree is constructed, it's gonna construct it just the same way. The only thing that changes is that we're using MS E once it's constructed and you want to predict you'll put a data point at the top of your tree. Follow it all the way down, see what bin it ends up in and then just like K and S neighbors, you're just gonna take an average. So whatever bin it ends up in, in the decision tree, let's say there's 10 observations in that bin. The prediction for X star in that bin would be the arithmetic mean over those 10 observations, right? So you just add up all the Y values for every observation from the training set in that bin and then you average them together. So divide by the total number of observations in the bin. And S K learn, we can implement this with the decision tree regress. So uh from S can learn dot A tree, we would import decision tree regress ... and then we can make uh tree objects just the same. So we're gonna play around with the max depth here. So I'm gonna show a difference between max step one and max step five. So we've got ma er uh we can do decision tree regress. Max depth is equal to ... and the first one, we want a, a depth of one and the second one, we want a maximum depth of five. And then the fit is exactly the same as before. So fit ball train dot run differential dot values dot reshape ... negative 11 ball train dot W dot values. ... And then this is gonna do sort of the same thing we did above, but instead of K equals one and K equals 10, we should have max steps equals one and a max steps equals five. ... OK. And so now here, uh it's kind of the opposite of what we got with the K near neighbors. So on the left, we have a very biased model and that's because we're only allowing the data to make a single cut. So essentially, we make a cut here and it looks like the cut happens uh at run differential equal to zero. And then ever all the predictions to the left of zero are the average of the training points to the left of zero. And then all the predictions to the right of zero are the average among the training points to the rate of zero. Now, once we get to five, we get a lot more cuts. So we'll have five cuts which gets us closer to what I'm guessing is probably the true relationship, which is this simple linear regression line. ... Um So we're getting closer to that, but we are starting to maybe see that we're maybe overfitting to improve, right? Because we're getting some weird jagged upticks that probably shouldn't be the case even though we are closer to the um to the to the simple line regression line, which I think is probably the better model. Um So again, you could do cross validation and get the maximum depth. Uh If you know, if you've seen what a random forest is, we could then use these trees to make a random forest model on these data uh as well as an extra trees model. So the last one we'll talk about is support vector regression, which is probably the most complicated one. Um Just because it's a lot of notation. So the idea behind support vector uh regression really draws upon that concept of margin that we talked about in support vector machines where remember we were trying to get all of our classified points on either side of some of, of some separating line. Uh And then we didn't want them to cross a margin. So essentially what we're going to do is we're gonna make some assumption that all of our observations uh fall in between uh uh from epsilon. Am I trying to say all of our observations are falling within epsilon of the true uh relationship, which is that F of X? And now the idea here is we don't want epsilon to be too large because if we just arbitrarily chose epsilon incredibly large, this will be true no matter what F of X we estimate. So that's sort of the assumption uh if we're gonna build it up like we did in support vector machines. Well, the first is sort of that maximum margin classifier, right? So we're gonna kind of come up with the equivalent here. So in the, we're gonna assume for the linear, so we're gonna do it for linear support vector regression. Uh And then there is a generalization that allows you to do that sort of kernel lifting technique. Uh But we're not going to explicitly cover the formulation of the model here. It's, it's basically the same as the classification counterpart. Um So in the linear formula formulation of the algorithm, you're assuming that F FX is equal to X times W. Uh so you're essentially saying that F of X is equal to a hyperplane. Uh And then you're the, you're gonna try and solve this problem where you want to minimize, you want your weights of the hyperplane, the coefficients that make up the hyperplane, you want them to have a small, an MS E or a small uh a two norm uh or not two norm as small A squared norm as possible. But you're constraining yourself to the fact that you want uh the prediction minus the actual to be uh the absolute value or the norm of that to be less than, or equal to epsilon for all the training observations. So essentially, you're trying to move this hyperplane around so that all of your observations fall with an epsilon. And then you're also ensuring that you do this in a way that has a small uh weight factor. So this is the setup that's analogous to the maximum margin classifier as you might guess this isn't always the best uh model. Uh So sometimes you want sort of like that soft margin version. So you can add in some what are known as slack variables. Uh Actually, I always forget what the pronunciation of these Greek letters are. So I'm just gonna call them squiggly letters. Uh these slack variables that are squiggly. So that essentially you're um allowing things to cross the margin if that makes sense. So you're going to still want to minimize the, the, the squared norm of those weights, but now you're allowing yourself some slack uh by having like, well, if some things cross that uh epsilon border along the line, um then I'm just gonna add a small penalty term and I want to do my best to minimize this um without letting you know too many things cross over. And so these are the constraints and I think what makes it maybe a little bit easier to see is this picture uh without worrying too much about the details of the setup. Uh So here is the regression line is this line in the middle. And then these are the epsilon margins around. And so the goal of the algorithm is to sort of wiggle this around by minimizing W uh so that most of the points fit within the epsilon margin and then these points that are outside of the epsilon margin of the list of the regression line, they contribute to the thing that you're trying to minimize. So essentially your goal is well, not yours. But the computer and algorithms goal is to wiggle this hyperplane around. So that you have as few things going beyond the epsilon border as possible. Uh And then the things that do go beyond the epsilon border, sometimes, maybe they're just so far out that it would be uh you'd have a bad model to try and fit them in. Uh And so they'll just get added on to this term. So essentially the, the soft margin version of the regression is trying to balance out, minimizing the weights while simultaneously not letting too many things cross the epsilon border. And so the thing in the hyper parameter, uh the, the hyper parameter that we're gonna be able to control in this model is the epsilon. And what we're gonna do now is see the effects of changing different values of epsilon for that baseball data. So the first thing we're gonna do is import uh an S K learn. We're gonna use linear S V R in general. You could use S V R to do just a general support vector regress. But we're gonna restrict ourselves to the linear one here because we're just doing uh a problem that could be solved with a line. So from S K learn dot S V M import linear S V R. And now I've written this code that will do some stuff for us. It's just going to take in the data and the model. Uh and then it's going to fit the model and then it plots basically, it just plots this picture. So we'll plot the regression line, we'll plot the epsilon borders around the regression line. And then we're also gonna plot the support or the regular simple linear regression model as a comparison point. And so we're gonna do this for different values of epsilon going from zero all the way up to 1000 by powers of 10. Um So this is what it looks like is epsilon is zero. And so, in this case, you know, all the points uh across the border. And so essentially what we're doing is um ... uh essentially what we're doing is pretty similar to the regular linear regression. And that's why the uh the line is pretty close. But now as we start to increase epsilon, we can see that we'll start to depart. So here we've already departed and gone. A little crooked, uh epsilon is 10. And now you see, like as you get to a point where epsilon is large enough that you can fit all of the points in relatively easily, you're eventually gonna get an algorithm that just wants to make a horizontal line or a horizontal hyperplane because then all the weights will just be zero. So it'd be F of uh B Y equals to zero for all values of X. Uh So this is what happens when you get epsilon too large. So there's as always, there's this tradeoff between choosing large epsilon and small epsilon. Uh And you could do uh cross validation to find the epsilon value that works best for you. Uh So I've included some extra extra support vector sources. Uh So here are some links on support vector regression that might be useful if you want to understand more about this mathematical setup that we sort of uh glazed over just a little bit because it's a lot of notation and that I didn't wanna spend too much time talking about it. Um And I also want to point out, you know, I did mention it as we were going through, but we focused on basically simple linear regression in all these cases. Um All of these work with multiple predictors and in particular uh support vector machines, as I said, have a way to deal with sort of nonlinear data as well. So if you suspect you have a nonlinear function, uh that's your underlying, you know, relationship, you might want to try just a general support vector aggressor with some sort of kernel. OK. So that's it for this video. I hope you enjoyed learning regression versions of some of the classification algorithms that we've learned. And I hope to see you in the next video. Have a great rest of your day. Bye.