Polynomial Regression and Nonlinear Regression Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna talk about regression some more with polynomial regression and nonlinear transformations. Let me go ahead and share my Jupiter notebook. So in this notebook, we will introduce the concept of polynomial regression. We'll review interaction terms uh between continuous features and then we'll discuss adding in nonlinear transformations uh of your features at the end. So we're gonna demonstrate this by working through a very simple data set that uh is stored in our data folder poly dot CS V. So this is a data set that has X one X two and Y. Uh So X one and X two are the features. And why is the desired output? I'm gonna make a note that in this video, I'm not gonna do train test splits. I'm not gonna do cross validation or a validation set. Uh And those are for comparing predictive models here. I just want to illustrate how polynomial regression works uh with this data set. If you're interested in train, trying to find the best predictive model for this data set, you can go and do that uh in a separate notebook on your own time. OK. So we're going to first make a plot to look at relationships between X one X two and Y. Uh So we're gonna use Panda's scatter matrix function, which is a nice feature. ... So this is what it looks like. And so what the scatter matrix does is each row of the data set represents one of our uh features or one of our columns of the data frame being the Y or vertical axis. And then each row or each column of the data set gives the horizontal axis of the plot. And so in instances where X one, for instance, where one variable is both the vertical and horizontal axis, they give you a uh histogram uh instead of the scatter plot. And so what we're interested in are where Y is the vertical axis. So we're gonna go down to this bottom row here. And so we can see that there seems to be relationships between both X one and Y as well as X two and Y. Uh the relationship between X two and Y may in fact be a linear one, but the relationship between X one and Y appears to not be linear. Uh But maybe something else. And so for instance, maybe it's a quadratic. So meaning Y is proportional to X one squared or a cor uh cortic uh meaning Y is proportional to X one to the fourth. So these features are not present in our original data set, right? We only have X one and X two. But there's nothing that keeps us from making um one of these features. And so, for instance, we can make X one squared relatively quickly. Uh We can do X one and then we just square the column and now we have a data set that has X one squared in it. And if we go down to the row that corresponds to Y again, ... which is this row zoom out just a little. Here we go. Uh We can see now that Y X one, Y and X two and Y and X one square, uh X one squared appears to have a much more um a linear relationship than it had with X one. So what we can do is we can take advantage of us thinking that there is a uh quadratic relationship between Y and X, meaning that it's the shape of a problem or has X one squared by including beta zero plus beta one, X one plus beta two, X one squared in a model. Uh And then we have, remember we still think that there's a linear relationship maybe between Y and X two. So we include a beta three plus times X two and then plus our random error. OK. And so this should be familiar. Hopefully we import our linear regression, we make our regression objects, we then fit the objects. ... And now we haven't done this yet. Uh But one of the tricks that you'll learn, uh or maybe you have learned in a statistics course or you will learn today is you sometimes like to make what are known as residual plots. We'll talk more about these in a later notebook. But in this notebook, I'm introducing the idea to show their uh utility. So a residual plot is when you plot your errors, so your actual values minus your predicted values against the true values. So you're gonna have the residual which is actual minus predicted on the vertical axis and the actual values on the horizontal axis. And so the idea here is that if our model is a close approximation of the actual value, then taking the actual minus the predicted should be approximately equal to the error term if our model is good. So if our model is good, we should expect that uh all of these residuals which again is actual minus predicted uh should be a draw from a normal random there uh a draw from a normal, a random normal distribution. Uh And so they should fall uniformly around the horizontal axis. If we see points that depart from this pattern, it tells us that our model is missing some signal from the data that could help explain or predict why. As and, and as I said, we'll talk more about residuals in depth uh in a coming notebook. So here's our residual plots. So we have our actual values on the horizontal axis, our Y minus Y hat or our residuals on the vertical axis. And we can see here that this is not a uniform band and does seem to display a very distinct pattern and almost like a crossing pattern like that. I'm highlighting with my mouse. Uh So when we see something like this instead of that uniform line, uh uniform band along the horizontal axis, uh it seems that that is an indicator that we're missing some input into our model. And so in this setting, we don't have any other features in our data set that we are not already including. And so uh whenever you see something like this sort of crisscross shape that I'm highlighting again, uh that's a sign that maybe you want to include an interaction term between some of your features. So the features we have are X one and X two. So we're going to try adding an interaction term between X one and X two and then uh remembering that interaction terms just mean that we're gonna multiply two columns. So that's what this is what our new model looks like. Uh Y is equal to beta zero plus beta one, X one plus beta two, X one squared plus beta three, X two plus beta four, X one times X two. So we're gonna have D F at X one, we're just making now the column uh that is the interaction term. ... And then once again, we make a new regression that's gonna be this new model and then we're going to fit the model. And then, uh so I've started to add, you might notice that um I have these values here. And so the reason I'm doing this now is uh S K learn has updated it. So if you do not have values and you just put in a data frame, it records all of the column names which can be nice. But then like in later notebooks, it gives us weird warnings when we're trying to make predictions uh on just regular arrays. So I've started to put the dot values here because I prefer to keep it without having the column names when I fit it. Um So I'm just explaining what I'm doing with my code. ... OK? And so now we're going to go ahead and plot the new residual plot for this model and then dot values. There we go. Uh So here's what the residual plot looks like now and we can refresh ourselves. What did it look like for the original model without the interaction term? It was this weird shape. Now, it's that uniform band that I told you about that we should expect. Uh Don't worry about this guy up here. Just remember, you know, sometimes you will randomly get one with a high residual that's fine. Uh So this is a much better plot which suggests that we have maybe found all the signal from the dataset that we can find. Uh And I just want to point out this seems maybe somewhat mystical if this is your first time seeing it. Don't worry, that's like a perfectly natural feeling. Uh So a key that we're gonna see in the later notebook and I'll talk about it more then is that making these residual plots can help improve the fits of our models? Uh It can help us see like maybe we need to create another polynomial or we need to create an interaction term. Uh We'll talk more about these in a later notebook. So if it seems mystical now, hopefully it gets demystified with that later notebook, but just know that checking residual plots is a standard part of model building for regression. OK. So let's go back and look at the coefficient on the model we just fit. And so from this, uh we see that this is the coefficient on an X one, this is the coefficient on an X one squared coefficient on X two and the coefficient on X one X two. And so from here, this is somewhat close to zero, not exactly zero, but somewhat close to zero. And so we might be tempted to remove X one uh from the model and seeing if uh and seeing if that improves the fit. Uh This might be especially tempting if, for instance, maybe we knew ahead of time, which we don't, but if we did, maybe we knew ahead of time. That this was the model that we are looking at. Uh this was the true relationship between Y and X one and X two. But uh in practice and in the real world, there's no way ahead that you would know that ahead of time, you're not gonna know what the true relationship is between the target and the features because otherwise you wouldn't need to do the regression, you would already know it. Uh And so, for instance, why we do, uh why do we want to include the X one? Even if we suspect that X one, it has a zero coefficient on it. The idea here is that, you know, let's imagine in practice, maybe we had something that looked at the uh that uh had a variable Y where the only relationship between Y and X one and X two is that Y is proportional to X one squared. So Y is some, you know, multiple of X one squared plus random error. Uh So if we didn't include X one, we would be limiting ourselves to only fitting parabolas of the form a constant plus beta one times X one squared. So this leaves out a large number of possible parabolas. Uh And so in order to be able to have our model be as flexible as possible, we need to do what is called respecting the hierarchy when we're modeling with polynomial features and interaction terms. And what this means is that if you're going to include a polynomial term. So X one squared, you need to include all the powers of that term that are below it. So we have an X one squared, meaning that we need to include an X one. If we had an X one cubed, we would want to include X one squared and X one uh as well. This works the same for interaction terms. So if we include X one times X two, we need to include both X one and X two on their own as inputs uh as well. And so that's what it means when I say that we're going to respect the hierarchy of modeling, meaning that if you have a polynomial transformation of a feature or if you have an interaction term between two features, you need to respect the hierarchy. So you can uh have the most flexible version of whatever polynomial you're looking at as possible. Uh So that's it for polynomial regression as a final, quick touch, uh nonlinear transformations are also something people will do. So sometimes you may want to put other forms of input, like maybe you want to take X one and look at the square roots or the log of X one. Sometimes people include trigonometry, uh trigonometric transformations as well or exponential transformations. So like E to the X one, all of these are possible, you just need to code it up and make, you know D F at uh E X one is equal to, you know MP dot E X P of the column X one. So there are, there will likely be an example problem of this and a problem session notebook that you can look at uh and see what we're talking about. So just know just like we made polynomial transformations for this model, you can also make nonlinear transformations uh just by taking them of the columns. OK. So that's it for this notebook. I hope you enjoyed learning about polynomial regression and nonlinear transformations as well as that nice residual plot trick that we'll learn more about in a later notebook. All right. Have a great rest of your day. I hope to see you in the next notebook. Uh Bye.