Categorical Variables and Interactions Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video, we continue to build upon our regression uh tool set with categorical variables and interactions. Let me go ahead and share my Jupiter notebook and we can get started it. So we're going to build upon what we just learned about about multiple linear regression and see how we can include categorical variables as well as what are known as interaction terms into our regression model. So in this data set specifically, we will start by introducing a new data set that contains data on beer. Uh with through this will show how to incorporate categorical variables into the multiple linear regression model demonstrate what's known as one hot encoding and then at the end discuss interaction terms and how they impact the model. ... So we're gonna start just by learning about this new data set, which will then motivate us to want to know about categorical models. So this data set is stored here. Uh It's called beer one dot CS V, but I'm going to import it as beer. I make my train test split. And then uh here is a sample from the training set. So we have four columns each row of this data set represents a beer. Uh We have the I B U which stands for international bitterness units, which has to do with the flavor of the beer. Uh We have a B V which is the alcohol by volume, a user rating for the beer, which comes from the website where this data was scraped from as well as a beer type, whether or not it's an IP A or a stout. So those are our two types of beers I B A, uh I B U or sorry, IP A s and stouts, we're gonna have two types of beer in this data set. So we're going to now uh essentially what we're gonna see is in this problem, we want to be able to predict the bitterness of the beer using just the A B V as well as the beer type. So we're gonna start by looking at uh any possible relationship that might exist between the two. Uh We're just gonna start with the plot. So here is A B V alcohol by volume on the X axis, horizontal axis I B U on the vertical. And it does seem here like maybe there is some sort of positive linear relationship or the higher the alcohol content of the beer, uh the higher the bitterness units of the beer. So we're gonna try uh two models here from this plot. We have two models to think about. The first is of course is our baseline where we're just gonna say that the I B U of the beer is the expected value plus some random noise. And then the standard, simple, simple linear regression model where I B U is beta zero plus beta one A B V plus epsilon. So that's what we get from this first plot. Now we're gonna make some more plots where we look at categorical variables. So that's that beer type column. So the beer type may be or maybe not, it may or may not have an impact on the I B U. So maybe some beers tend to have higher bitterness than other beers. Uh So we're going to investigate this with a couple more plots. One nice plot that I like to use is called the swarm plot. And so here's what that looks like. So a swarm plot will go ahead and for the variables and uh category or variables columns that you're interested in. Uh We're going to plot every observation of the data set that is a stout along with its I B U rating. So for instance, this stout has an I B U of under 20 the same thing for IP A over here. So this, for instance, this IP A has an I B U of 1 20. So from this, we may get a sense that, well, it looks like IP A S may tend to have higher I B us than stouts. And So if possible, we can also look at this uh scatter plot we made and what if we colored the points by whether or not it's an IP A or whether or not it's a stout. So we want to do this because maybe what's going on isn't necessarily that IP A S have higher I B US than stouts, but maybe it's because IP A S tend to have just have higher A B V, right. So if we can go ahead and recreate that, but coloring the points. So here we've made our IP A S into orange triangles, they tend to live up here. And here we've made our stouts as blue circles. So it does look to me like the relationship between A B V and I B U may be different for IP A S and stouts which suggests that we may be able to make a better model if we can include this information uh into the models that we're considering. So how can we include a categorical variable? So this is a categorical variable. It can be IP A or a stout. How can I include this data into a model? Uh both theoretically and then practically with coding. So what you need to do is called one hot encoding. So currently beer type is a column stored as strings. And so this really is nice for us because we can go through the column and say OK, this one's a stout, this one's an IP A. Uh but string input is not good for uh sorry about that string input is not good for actual models making models and computers, right? So they need to have numbers, they need to have a float to an integer. And so what one hot encoding does is it's going to allow you to take a categorical variable and then represent it as a new collection of 01 variables. So essentially you're gonna have uh for stout, you'll have a variable with one hot encoding or for IP A, you have a variable for one hot encoding where if the beer is a stout, it will be a one. If it's an IP A, it will be a zero. That would be if we do it for stout uh vice versa if you did it for IP A. So to make it more formal, we are gonna suppose that we have some variable little X and that little X has little K unique categories. So there are K possible categories and then with one hot encoding for regression, you're going to create K minus one indicator variables and those indicator variables are denoted one sub J. So if this was the beer data set, it could be one subs stout or one sub IP A. And so the indicator is gonna be one if the observation is J. So if X is equal to J or it will be zero, if X does not equal J. So why do we only need K minus one of these variables? Uh So this is because of the process of elimination. So for instance, if uh let's say we have K, if we know that all of the K minus one indicated variables are equal to zero, we know that the only remaining option left to us is that case variable. And so that information is absorbed by the model without needing an explicit uh variable for it. Uh For instance, in this situation, if a beer is a stout, I know for a fact that it can't be an IP A. If a beer is an IP A, I know that the indicator for stout would have to be zero, right? Uh And so because of that process of elimination, we only need K minus one indicator or dummy variables for any um variable with K unique categories. So we're gonna demonstrate how to do this in Python by making a dummy or an indicator variable for stouts, then we'll work that into a model uh with uh multiple linear regression. We do this with the P D dot Get dummies function that I've linked to the documentation to here. So we can do P D and this is a demonstration not asking you to do it on your own, but you could, if you want to pause the video and do it on your own, you can. Uh but I'm just gonna demonstrate it. Uh So P D dot Get dummies. And then within there we'll put beer train and we don't want dummies for all the columns of beer train. So we just look at the beer type column and so remember beer type where it was IP a stout, IP a stout, whatever. Uh And what we get returned is a column called IP A which is zeros or ones and a column called stout which is ones or zeros, right? Zeros ones. And so this is because the first, the row at 1 33 with index 1 33 must be a stout. We can always check uh beer train dot lo 1 33. And so here, yeah, we have a stout. OK? And so remember we have two call two possible categories. So we only need one dummy. So we're gonna get the stout dummy, like I said above. And then we're gonna create this as a new column in Beer train right here. So I do P D dot git dummies, Beer Train, beer type ... at the stout column. ... OK? And now you can see I have this stout column on my beer training data set that accurately keeps track of whether or not a beer is a stout or an IP A. So with this new indicator variable, I'm now able to make a new model, which we're going to call the stout model. So this is I B U is gonna be equal to beta zero plus beta one times A B V plus beta two times stout plus epsilon. So we're gonna fit this model using linear regression like we did before uh at this point. Hopefully, we are familiar enough with this that I can breeze through the code of it. Uh And then I'm gonna go ahead and plot the line, the prediction that this makes for different stouts and different I B US. Uh Right now along with the training data. So we've got our training data which are the faded orange triangles for IP A S the training data, which are the faded blue circles for stout. And then we have the model output which is the predicted I B U for different IP A S which is an orange dotted line. And then the solid blue line is for stout. So here we can see for any value of A B V I can go up. The blue line, gives me the prediction for a stout. The orange line, the orange dotted line gives me a prediction for an IP A. OK. Uh So what can we see here? Well, we might notice that one, the slopes are parallel and two if we look back and uh let's look back at the original one. It kind of seems like now that the, the, the points aren't as faded that maybe IP A S should have a different slope which is maybe a little bit steeper than the stouts should have. And So we can't get that with our model. So we have to add what's called an interaction term. And so why do we need to do that? Well, this is the model that we just fit that we called the stout model. I've highlighted it here and we can see what happens when we have two different values for stout. So when stout is equal to zero, this term, the beta two times stout goes to zero. And so what's left is I B U equals beta zero plus beta one A B V plus epsilon. But when stout is equal to one beta two times stout becomes beta two. And so our model becomes I B U equals beta zero plus beta two plus beta one A B V plus epsilon. And so all we've done for this model is change the intercept of the line with two different values of stout. But the slope of the line is the same for both beta one beta one. So if we want to be able to change the slope, we need to add in a new term called an interaction term. This is an interaction between A B V and stout, which all that really means is we're just gonna add in another variable call that we call an interaction term. That's just gonna be the multiplication of A B V with the indicator variable stout. And so this is our interaction model which is the I B U is equal to beta zero plus beta, one times A B V plus beta two times stout plus beta three times A B V times stout, that's the interaction term plus our random noise. And so once again, what's the difference between when stout is zero? And stout is one? Well, when stout is zero, we have beta zero plus beta one, A B V plus epsilon. And then when stout is one, we now have a different intercept which is beta zero plus beta two and a different slope which is beta one plus beta three. So uh when we add an interaction term, we have both the possibility for different intercepts and different slopes. And so we're going to quickly visualize this first, we have to make our interaction term. Uh So beer or train A B V times, beer, train stout. Now I will make the model and fit the model and then I'm gonna plot the model with the training data like I did before. OK. So here we can now see that the orange dotted line which is the model output for an IP A is now has a different slope and a different intercept than the uh blue model line. Uh the solid blue line which was for stouts. OK. So eyeballing it, it looks like a slightly better fit for our data set. Why don't we go ahead and do a comparison using cross validation? Uh So I fit all the code here. This is not about practicing cross validation. This is just seeing of the four models we considered we can go through and review real quickly. Uh We have the baseline model, the simple linear regression model which produced uh the line above. And then we have the stout model which just included the dummy variable but not the interaction term. And then finally, our interaction model, we want to see which one does best on average at predicting the I B U using cross validation. Um That's what we're gonna do now. So I import all the stuff for cross validation. I make the K fold object. Now I have notice four models and uh five splits. And then I go through and then here are the cross validation results uh for all four models. And so if we go through uh it looks like the interaction model has the lowest average cross validation. But in practice, we may get similar performance. If we were to use the stout model, they're pretty close in the scheme of things. OK. So that's how do you add, that's how you add categorical variables and interaction terms into your model. Remember for categorical variables, if you have K possible categories, you make K minus one uh dummy variables or indicator variables and interaction terms are just the multiplication of two columns uh among your features. All right, I hope you enjoyed this video. I hope you learned something about categorical variables and interactions. And I hope to see you in the next video where we continue on with progression. Have a great rest of your day. Bye.