Interpreting Linear Regression Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi, everybody. Welcome back in this video. We're gonna talk about how you can interpret the results of a linear regression. Let me go ahead and share my Jupiter notebook and we'll get started. So this uh version of the note, this notebook will not have a complete version because all of the code that you get will be filled in. I'm not going to do any additional coding uh That's like been left blank or anything like that.
Um It's just already complete for you. So we have talked about linear regression and we've built up some understanding about linear regression in the context of predictive modeling. So in this notebook, I wanna talk slightly on how we can use linear regression to make inferences about our data. Uh In particular, we're gonna talk about how we can use regression as a means of comparison.
Uh We'll interpret the coefficient of simple linear regression and then mention some slight differences between interpreting simple and multiple linear regression. ... So one of the key ideas about regression is it's actually about making comparisons. And so when it comes to make, making uh when it comes to making inferences about the dependent variable Y, we can use regression models to compare what we would expect for Y if we know the value of X that we're interested in.
Uh So this also holds for simple as well as multiple near regression. So we're gonna use this baseball data sets. Uh So I'm gonna point out I'm not doing predictive modeling in this notebook. So we're just gonna use the data set as it is, we're not going to make a train test split. So in the past, we've regressed on either R or R A uh in the present notebook, uh for simple line regression purposes, I'm going to regress solely regress W on R D.
So R D stands for a run differential, which is your runs minus your runs allowed and is a measure of a team's offensive performance in relation to its defensive performance. So, if you have a positive R D, that means your team scored more runs than it has given up and uh vice versa. ... OK. So here is the relationship uh visually uh the relationship between wins and run differential.
So we still have a very strong linear relationship here, positive linear relationship. So I'm just gonna make a simple linear regression regressing wins on to run differential and then plot the points, the training points along with the regression line. And so what do we mean about the expected value of Y? So if we recall uh the regression model regressing Y on X is given by Y is equal to X times beta plus epsilon, where epsilon is randomly distributed noise from a normal zero
Sigma distribution and is independent of X. And so for a given value of X, meaning X equals X star, we can see that the expect the conditional expectation of Y given that X is equal to X star. Well, you just go through and apply the expectation across the equals sign. Uh And so expectation is, is uh a linear operator. So it's linear. So you can apply it across the addition.
Uh And so we see now we're left with the expected value of X star times beta, which is a constant because we've assumed that X is this constant X star plus the expectation of epsilon which is zero. Uh So we're left with X star times beta. And so actually, what this fitted line above is giving us is it's the expected number of wins for any given value of run differential.
And so this is a model for predicting the expected value of wins if you know a run differential. Uh so a fitted regression model thus gives you a way to compare how different values of X may impact what you expect for Y. Uh So for instance, we can see while a team with a 10 run differential, we just plug it in as a prediction into our model. So we would expect that a team with a 10 run differential so 10 more runs they've scored than they give up, should average about 82 wins, which I believe
is maybe one above like having a 50 50 season. Sometimes, uh as a note, it might be useful to do some sort of shift. So maybe instead of having 82 wins, like interested in the number of wins, maybe we're more interested in how many wins above an average team are we. And so what we can do is we can shift both the, uh we can shift the number of wins to be wins above average.
And so the way we could do that is making a new column. So by subtracting off of wins, the average value, we can make a wins above average. And then now we now have a model that regresses wins above average onto the run differential. Uh And so now we can use this to make inferences about OK. So let's say hypothetically I have a run differential of this, what is my expected wins above average?
Uh So wins more or less than an average level MLB team. And so here we can see that a team with a 10 run differential uh should average about one win above average. So one more win than an average team is what you get with uh a 10 run differential. So in addition to making comparisons about the outcomes, so again, while we did this for simple linear regression, the same thing holds for multiple linear regression, except now we'd have to put in more inputs than just a single, very a single
number. Uh So we can also interpret the coefficient in a nice way. So the coefficient and simple line regression right gives you the slope, so it gives you the slope of this line here. So beta one hat gives me the slope of this line which can allow me to say something about. Well, what should I expect to happen to my outputs if I increase my input or decrease my input in some way?
So here we can see here, we interpret this beta one hat for this specific problem. So we can see that if we increase our run differential by one, uh we should let me add some space here to make it easier for me to read. So a one run increase and run differential should average us about 10.1 more wins. So if we want to add in another win in the season, on average, we maybe we need to increase our run differential by more than just one.
Uh So another nice way to do this is to put it into terms that maybe uh someone who your stakeholder would understand. And so maybe they're interested in. Well, what do I need to add in? Well, how many more runs do I need to score or what, what's my run differential need to go up by uh if I want to increase my wins by X amount. And so here, what I've done is um beta one hat is in terms of uh wins per run differential.
And so here we can see that we can do this. If we flip it, we can get run differential uh per win. So here if we want to increase our wins by one, uh now we can, we need, we can see that we need to increase our run differential by 10. So essentially every 10 extra points we get in run differential that should increase our one, our wins by one on average.
... And then this is essentially the way we can interpret it. Also the intercept, you just plug in a value of zero for the input. And we can see that if you have a run differential of zero, then your team should average about zero wins above average. So an average team uh has zero as its run differential. ... Uh So now we can see that you can also do this for multiple linear regression.
And so for this example, we're gonna split up run differential again and just make the model beta uh W is equal to beta zero plus beta one, R plus beta two R A plus epsilon. Uh And then we'll fit that model. And so the slight difference between simple and multiple linear aggression was before we had this really nice interpretation, right, that beta one was just a slope.
Uh And so it's very easy to set for someone to ask me. OK, if I increase run differential by this, how many wins should go up? Now, we have to be slightly more careful and keep track of what we're increasing and what we're keeping the same. So for example, we can estimate the effect, the effect of increasing the team's run by 10 while also holding their runs allowed constant.
And so this could be, there's a couple of different ways right now, in the last model, we talked about how can, what happens if I change differential? Well, now that we've broken it up into its components, we can say, OK, I probably can't get any better at runs allowed, but I can maybe increase my runs by a certain amount. This sort of approach would allow me to then say what are the individual effects of each of these things?
Uh So we can do, we can do that by plugging this in just to our model. So if we want to increase our runs by 10 while keeping our runs allowed the same, uh We now have this new input R star plus 10 comma R A star and comparing it to the original baseline of R star and R A. So it's not important what R and R A are. Um It's just important the change between this observation X star and this observation X uh and symbol.
Uh And so you can plug this in. So we're interested in the expected number of wins given that X is equal to X and a symbol minus the expected number of wins given that X is equal to X star. And let me change that so that it's correct, X is equal. OK. Uh And so then we can just plug the stuff in. So plugging in on the left gives us this, plugging in on the right, gives us this.
And so we can see that what we're left with once everything cancels out is 10 times beta one. And so if a team increases their runs by 10 while maintaining their runs allowed, we'd expect an increase of 10 times beta one wins. Uh And then here we are interpreting that. So for, if we have 10 more runs over the course of the season, we should expect to increase our wins by 100.97 wins.
Uh So if we want to add an additional win to the season, we're gonna need to increase by more than just 10 runs, maybe like 11, you do the same sort of process where you just plug in doing a careful job of keeping track of for categorical variables. What categorical variable is being turned on to one and what categorical variable is being turned on to zero.
And you can do the same exact process for interpreting coefficients on categorical variables as well. Uh And so now you have an idea of how to interpret just uh in addition, to the actual model itself, seeing it as an expected value of Y given a value for all the features as well as seeing how changing various features uh what you would expect to happen to your output.
And again, expectation expect here means expectation. OK. All right. So that's it for this notebook. I hope you enjoyed learning about how you can interpret various linear regression models and coefficients. I hope to see you in the next book. I hope to see you in the next notebook. Have a great rest of your day. Bye.