Linear Regression Diagnostic Plots Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna talk about some diagnostic plots that can help you with your linear regression modeling. Let me go ahead and share my Jupiter notebook and we'll get started. So in this notebook, we're gonna talk about some diagnostic plots that can help you see uh when there maybe are some issues with your model, some issues with your modeling assumptions or just some nice tips and tricks for model building. So as a reminder, we're gonna talk a lot about all these pots are gonna be involving the uh residual. And so let's remind ourselves what a residual is. So we're gonna regress Y on M features that are stored in X and our model for this, right, as we've been saying, all along is X times beta plus epsilon. Uh We epsilon is randomly distributed from a normal zero sigma uh distribution and independent of X. Uh So epsilon, as we've said is our random error and then we go through and we're gonna observe, make N observations X I Y I which we use to fit the model above. So the residuals for this model are given by the following So the ith residual is given by taking the I observation and subtracting off the ith uh prediction or estimate. Uh So this, when writing this down from the model, we know is equal to X I beta plus epsilon I and then you just subtract off the fit, which is X I beta hat. Uh And so this is equal to just working it out X times beta minus beta hat plus epsilon I. Uh And so, if our model is good and our residual assumptions hold, uh then beta minus beta hat should be close to zero, which means that our residual Y I M minus Y I hat should be approximately the same as the random draw the random noise, the draw from the normal zero Sigma. And so, if we have a good enough model and our assumptions are holding, then we should have um we should have a residual that is more or less randomly distributed according to a near a normal zero sigma random distribution. And so remember essentially what we're seeing is the model is this image right, where you go to, for a particular value of X, you go to the blue line and then randomly draw an error from this normal curve. And then what we end up with is this, this data here which we use to estimate. Uh The estimate here is this now solid, very solid black line. And then the resist individuals are these red dotted lines. The distance from the actual value to the predicted value. And so the idea here being is that this these red dotted line distances should be close to whatever the original observations were from the normal zero sigma distribution. And so that's the hope. And so if our model is good, what we should see when we make residual plots, which are gonna be plots where we put the residual on the vertical axis are uniform bands that more or less just look like a nice uniform band of a bunch of dots. OK. Uh So that's what we're gonna be looking for. Uh And so what can be useful when building a model is a series of residual plots? And we're gonna talk about two kinds of plots that are typically investigated. The first type of plot puts the residuals on the vertical axis and various features on the horizontal axis. So this first plot, this first type of plot can tell you a few different things. So one thing it can tell you uh is whether or not you're missing subsignal from the data. And so what do we mean signal? It means that we're missing something that the data should be telling us uh that we have not included in our model. So here's gonna be a very simple example, we're gonna have X and then Y is gonna be equal to X squared plus some random noise. And now this is that data set. So we hear here's our Y in our X and we can see the X squared but what we're gonna do is incorrectly regress Y on X and forget to include, we will not include an X squared. So here I have my simple linear regression I fit I regress Y on X X alone, not X squared, just X. And then here's what we got. So this is what our simple linear regression line is. And here's our observed data. And you can see that in the middle, we're over predicting on these points and near the outskirts of the line, we're under predicting meaning that our residuals should be very positive on the outside and very uh and somewhat negative on the inside. And so that gets reflected in the residual plot where we can see here that basically, it looks like the residuals more or less have the same relationship with the variable X that Y did. Uh And so when you see a residual plot like this where it looks almost as if the residuals are a function of X as well. That means that you've missed some feature in the data set that you should be including in the set. And uh what we're hoping to see is sort of a uniform band between uh negative, let's say it looks like here, it should be between negative 10 and 10 maybe or negative, probably something smaller than that, but it should be a uniform band uh that does not have a clear pattern. And so when you see something like this in your residual plot that should be assigned to you, that you should try and see one. Is there a variable that I've missed in the data set that I should include? Or two? Should I maybe try including a polynomial transformation of the data or a nonlinear transformation of data in this setting, we know it's a polynomial. So now we're gonna do that uh polynomial regression Y on X squared and X. And this is what our residual plot looks like. Now, that's not, it's the nice random blob that primarily falls between 10 and negative 10. Why do we know it should be 10 and negative 10? Remember my standard deviation was five and a normal distribution tends to fall between uh two and negative two times the standard deviation. ... OK. Another thing that we can assess with this type of plot is the assumption of equal variants. So remember all the way up here when we defined what the model was, we assume that, that it was normally distributed and all of the epsilons had uh the same variants and were independent of the value of X. And now what we're gonna see here is with these plots, we can assess the assumption of equal variants as well as the assumption of independence from X. And so here's gonna be some errors that are not randomly distributed from the same distribution, but have a uh a variance that depends upon the value of X. So they're not gonna have the same uh variants and they're not going to have, uh they're not going to be independent of the value of X. This is what it looks like. Here's Y, here's X. Uh And now I'm gonna fit a simple linear regression to it and then here's my residual plot and so it's slightly hard to see, but what's going on here is in this residual plot, you can see sort of a funnel and what do I mean by a funnel? Well, the data on the left has less spread than the data all the way on the right. So it opens up, if you're following my mouse, it opens up and opens down. So that's what I mean by a funnel like it has like a funnel shape if you're following my fingers like this. And so when you see a funnel shape like this, that means that the variance of epsilon, uh it could mean it doesn't necessarily mean, but it could mean that the variance of epsilon does not seem to be the same across all values of X. And so, uh there are some approaches that you can take for this, one of which is addressed in the practice problems notebook. Uh Also, we will note that when you have an instance of maybe unequal variances, this is sometimes referred to as heterosis. Uh So hetero meaning different Uh And then I forget, I forget what the scholastic means. But uh so when we have differing variances for all of these errors, that's hetero Scaasi. Another thing you can do with these sorts of interaction term plots is to er uh residual plots is you can also maybe sometimes notice that you've missed an interaction term. So here is going to be uh we generate X as two random variables. Uh And then Y is X one times X two plus some random noise. And so what we're gonna plot here is what I did was I fit a regression model regressing Y on X and here are the residuals for that model. And so here we're not seeing that nice, we'd want to see a nice uniform band. Again, uh we're not seeing that it looks like the uniform band should be between Yep one and negative one. We're not seeing that. What we're seeing is this sort of weird cross behavior both on the X one plot and the X two plot. And when you see weird behavior like this where it's kind of like a bow tie or an X or a cross, that's usually a sign that you're gonna wanna add an interaction term. And so if we add our interaction term ... and now look at the residual plots, we can see that they do tend to fall in that uniform band between one and negative one. So any time if you look at a residual plot and you see this sort of kind of weird crisscross pattern or an X or a funnel or uh not, no, not a funnel, an X or a crisscross or a bow tie. Uh That's usually a sign that you're missing in interaction term. ... OK. So that was the first type of residual plot, you plot your residuals against one of your features. The second kind of residual plot would be to take your residual versus your predicted value. So in this case, you plot on the vertical axis Y minus Y hats and then on the horizontal axis, you plot Y hats. So that's residual versus predicted. And these plots are useful in cases when you have so many features that it would be infeasible to make residuals versus all of the featured plots. Uh And we can point out that you should be able to see most of the stuff if not all of the stuff we pointed out that you can see with the residual versus a feature plot. So here's that first example, we talked about where we had an X squared that we were missing. And we're gonna show this is the incorrect simple linear regression model where we had X just regular simple linear regression of Y on X. And you can see the nice pattern here, right? And then here's where we had the correct model where you can see that this is more or less the random blob between 10 and negative 10. ... You can also see unequal variants here as well. So this is the same exact problem as the above plot with the features. But now I'm gonna make it on the uh residual versus predicted plot. And here, once again, we have a funnel opening up and if we were to uh add the corrected plot, uh we would, um we would get the correct or if we were to go correct it, we, we should be able to get a random band between 10 and negative 10 as well. Uh Again, we do not cover how to correct it in this stupid or notebook or this video, but I, I do touch on it in the practice problems notebook associated to the regression content. ... OK. And then finally, you can see missed interaction terms in these plots as well. So the same exact model and data as before uh randomly generates. So not the same points but same problem. Uh And on the left, we can see the one where we did not include the interaction term. And on the right, the one where we did and you again see this sort of weird crisscross pattern. Now almost looks like a weird letter M. Uh And then on the right, we have added in the interaction term once again. Uh And now we get the, the random blob between one and negative one that we're looking for. So one thing you might be asking yourself is why do I put residuals against predicted plots? Uh a natural question and a very common mistake that is made is why not go against the actual values or Y minus? Why hat against why? So occasionally you might be able to, to get more or less the same plots doing this approach. Uh But it's not a good idea because we are not making an assumption in the model that the random errors are independent of the value of Y in the model set up. So even if you've made a perfect model, uh if you do a residual versus an actual plot, you could still end up displaying something that is not the random blob we're looking for, which would lead you to do some tinkering with your already perfect model. And again, this is coming about from the fact that we're not assuming that epsilon is independent of the value of Y in the model set up. OK. We do assume that it's independent of X. And remember why hat the predicted value is a function of X uh not of Y. So here are some major takeaways from this notebook that residual plots can help you make model assessments which can help you check for uh whether or not your assumptions are wrong or if you're missing something which can help improve both inferential and predictive modeling. You can plot uh you know, one type of residual plot is to plot against the features. Another type is to plot them against the predicted values. And in either setting, in either kind of plot what you're looking for are random blobs of points that tend to fall between the negative and the positive values of some number. Uh Why is this? Well, that's because normally distributed variables tend to fall between one uh or between two standard deviations away from their mean. That's where I believe 98 96% of the data is something around there. OK. So that's it for this notebook. Uh I hope you enjoyed learning about linear regression diagnostic plots, mainly residual plots. Uh And I hope to see you in the next video. Have a great rest of your day. Bye.