Feature Selection Approaches Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna talk about some feature selection approaches when you're doing regression. Let me go ahead and uh share my Jupiter notebook. Here we go. So in this notebook, we're gonna talk about a couple of approaches you might take to choosing different models or features. Uh When doing linear regression. Uh In particular, we're gonna work on a problem that models car seat sales and then see how we can use this uh problem. We'll demonstrate an implementation of one algorithm and then mention two variations on that approach. And then we'll finally remind ourselves of lasso regression's feature selection usefulness. So the data we're gonna work on comes from the intro an introduction to statistical learning book which is linked to here uh by Gareth James Daniel Witter, Trevor Ha Hastie and Robert Tib Shara. Uh This is a car seats data set that we're gonna load and then look at a sample. Uh So this is a synthetic data set that the authors came up with uh to demonstrate some algorithms. The column that we're interested in predicting is sales. Uh There's comp price as a feature which is the competitor price charged at each location uh income, which is the average income level of the community that the store, the fi theoretical store is. And so these are sales of car seats at stores within various communities. Uh How much that place spent on advertising? What the population level is of the community, what the price of the car seat is in the store, the shelf location of the store, uh the average age of the local population, the education level, uh the urban. So whether or not it's an urban area and whether or not it is in the United States. So the goal here would be to build the best model we can that uses these features to predict the sales of car seats at various locations. Uh So we're gonna start by making a train test split and then cleaning the data. So the first thing I'm gonna do is make some dummy variables for the three categorical variables we have. So for shelf location, I make a shelf good, shelf bad because there are three possible categories here. The other two are 01 variables, yes or no. So I just turn the yeses into ones and the nos into zeros and then I make my train test split. So one of the first steps when you're thinking about what features you might include in a linear regression model is to do some exploratory data analysis. So a lot of times this involves making series of plots as well as to com computing some very introductory statistics, some useful statistics like things like correlation or standard deviations and things like that. In this example, we're gonna look at what are known as scatter plots. And so we're gonna use a nice feature called S N S. So Sea Bourne's pair plot, which I've given the documentation link to here. What this does is it takes in a data frame and then you can add in an extra argument here of uh a hues which will color the points by various categorical variables. And so we're going to now see we have three of them, but we're gonna plot the first one. Now uh we're gonna be able to look, I believe we looked at this as scatter matrix and pandas, but it's nicer. I find it nicer and um Seaborne. Uh so we're gonna look and see this is the sales row. And so what we're most interested in are seeing the sales uh they, everything is colored by the shelve location which is good, medium or bad. And now we have on the columns, comp price, income, advertising, and population. And so we can look and we can see here. Well, if we look at just the distribution for sales, uh given the different types of shelf location, we have, we might be interested in thinking that this is useful because we can see it does look like the distribution for good is slightly higher, uh has uh good is slightly shifted to the right here for sales, which means that if you have a good shelf location, the shelves might, the sales might be higher than if you have a bad or a medium shelf location for a competitor price. It doesn't seem to be much of a relationship between competitor price and sales. There also doesn't seem to be much of a relationship between the general income of the area and sales. This one was advertising and I would say that maybe there's a positive relationship, a positive linear relationship between advertising and sales and then going back to population. It also seems like there's not much of uh a relationship between population and sales. So out of these 1st 45, if you include shelf location, I would say there seems to be shelf location, interest and uh advertising interest. Now we can look at price, age and education. Once again, we'll look at shelf location, but we've already decided that we're interested in that one. So the first thing we can see, so price is first, then age and then education there definitely seems to be a negative linear relationship between a or price and sales, which makes sense. Uh And then the other two from my eye, it's hard to tell if there is a relationship between age and sales and education and sales, possibly between age. But it's hard for me to see just looking at it and then finally we'll look at sales and urban and US and here it looks like maybe urban seems to more or less line up with uh zero and one. Whereas with the US, it does seem like maybe they're slightly higher for if you're in the US than if you're international. Uh So I would say here, maybe US is the useful one to add. So from looking at all of these plots, some variables that stuck out to me or shelf location, advertising population price and the US. So maybe these would be a good first step of parameters to play around with if you're going to do more uh further exploration and modeling. So beyond exploratory data analysis, here's uh there are a couple of algorithmic approaches you might want to take, which in theory you could do without looking at the plots at all. Uh The first of which is gonna be called best subset selection. So it's called best subset selection because you're going to look at every possible model that includes a subset uh uh any possible model that you can get including uh some subset of all the features. Uh And so what does this mean? So for instance, if we only had, we were only interested in any model involving competitor price and advertising best subset selection would go through and compute uh some performance metric for the four following models. So the baseline model, the one regressing sales on just competitor price, the one regressing sales on advertising and then the one regressing sales on both competitor price and advertising. So this is every possible combination of the two features we mentioned. Uh none of the features, one of the features and then both of the features. So for us, we're going to implement this in Python using age comp price, advertising, price, population us and shelve location just to demonstrate. Not that I think that these are the best ones I just want to demonstrate. And so the way we're gonna do this is we're gonna go through every possible model that includes or does not include uh some subset or every possible model built on a subset of these features. And we're gonna calculate the K fold cross validation, MS E for all of those models store it and then choose the one with the lowest cross validation, MS E. So here I'm importing K fold and linear regression. So this is a function that's gonna go through, it's gonna take a list of these features and then it will give me back a power set of the set of features. And so what a power set of the set of features is gonna do is just give me every possible combination of the features. So this code was slightly adapted from code that I found at this stack overflow link which you can still go look at today. ... So here's a demonstration of power set using that example from before. So I'm gonna put in the list comp price and advertising ... and I get back comp price advertising and then both and then what I would do on my own is then hand code the baseline case here. I'm gonna import mean squared error. I define my K fold object here. And now I'm gonna go through. So I wanna record all of the models that we have. So I'm gonna use power set and then the list of all the features that I'm interested in. Now, remember from our categorical variables, uh notebook and content, we can include shelve location on its own because it has strings. We need to include the one hot and coated variables. I made up above uh shelve look good. Shel Loke Bat. So what we're gonna do is we're gonna loop through all of the possible models that were returned by power set. And if they had shelve location, we're going to replace shelve location with my two categorical variable or my two um dummy variables that I made up above using P D docket dummies. And then at the end, I'm going to, as I said, uh append the fact that I would like a baseline model into my list. Uh And there we go. So then I'm gonna keep track of my cross validation split MS E S for all of my models. Uh And then this will have five splits because I'm doing fivefold cross validation and then in this process, all I do is loop through all of the splits and all of the models and then train the model and record the MS C on the holdout set for all of them. And so the 15 indicates that the code ran and we're done so I can use N P dot org min to find me the lowest average. So that's what MP mean does cross validation, MS C. And so that was the 110th model that had the lowest. And here I print out uh the results. So the model with the lowest cross validation, MS E included the features age comp price, advertising price us show of look good, show of look bad and had an average C F CV MS E of 1.266. So I believe what we ended up was the model that included everything. And so we can actually check N P dot Mean CV N S E S axis equals zero. And then ... let's find the minimum of this. So these are the average cross validation, MS E S for all of the models I considered 1.266. And then what we can do also is what if we looked at the baseline model. So the baseline model should be the last one. And so the baseline model had an MS C of six point, an average cross validation MS C of 6.85. And then the model we came up with um greatly improved upon that uh about one, somewhere between 1/6 and 1/7 of the baseline. OK. So this is not always the best approach because you may end up trying so many models. Uh There, let's say hypothetically have m possible features, best subsets would then have to go on and fit and assess two to the M models and that gets out of hand hand rather quickly. So there are two possible approaches that are still somewhat computationally expensive, but they are um better than best subsets. That is because they are better in the sense that you don't have to fit as many models because they are greedy algorithms. And so a greedy algorithm will go through and at each step, it has to make a choice, it makes a choice that will improve the model performance in that very next step. So basically what it does is there are gonna be two that we look at. So forward selection and backwards selection. So what do I mean by the best? And the next possible step, I think it makes sense to look at one and, and use that to illustrate. Uh So for forward selection, you first start off with the baseline model. OK. So we're gonna fit the baseline model record the average cross validation, MS E for step one, we're then gonna go through. So we're assuming we have m possible features you go through and fit each of the m possible simple linear regression models that you could have from that. So you have m possible features, meaning that you have m possible simple linear regression models. And then you calculate the average cross validation MS C. If none of these were to outperform the baseline model, then you would say through forward selection that the baseline is the best model. If one of them was able to outperform the baseline model, you would choose whichever one was that was that had the lowest average CV, MS E. So let's say you have 33 of the M outperform, you look among those three and see which one had the lowest average cross validation, MS E. That is now your default model that you're gonna compare to going forward for the more general step L at each after you've gone through this process, uh L times you'll have M minus L features that are not included in whatever your current default model is, you fit the regression model that then includes them. So you'll have M minus L feat models to consider, you calculate the average cost validation, MS E for all of the ones that you've built just then. And if none of them outperform your current default model, you'll say that you're done. If one of them does outperform your current default model, you choose the one that has now has the lowest average cross validation, MS E, this would be your default model and you keep going in this way. So for instance, from step one to step two, you would have M minus one features not included, you would then loop through and say OK. Uh What's the model that if I add this feature to my current simple linear regression model? What? And then what's the model? If I were to add the second feature to my current, you know what I mean? So you just keep going in this way and then you'll loop through. And then if you look at all the models that have two features now and none of them outperform simple linear regression forward selection says that that's your best model. Uh If you have one or more that outperform simple linear regression, you choose the one with lowest cross validation error and then that would become your default model and you keep going another greedy algorithm that's similar to this is backwards selection where essentially you're going to do everything we did for forward selection but kind of in reverse. So for backwards selection step zero is to fit the regression model that includes every feature and then you'll slowly remove features one by one until you get to your stopping point. Uh And it could terminate with um just having the baseline model being the best. So either of these approaches forwards or backwards selections will terminate in at most M factorial steps. Uh We'll finally end with a lasso reminder. So another algorithmic approach that you can use algorithmic in the sense that you're using an algorithm is lasso regression. And so what I mean by this isn't that we're gonna fit a Lasso model and then just use the Lasso model instead of the linear regression model. But if we recall from our regularization notebook, the Lasso model can be used for feature selection by slowly increasing the value of the hyper parameter alpha and then observing the persistence of the coefficients. So which coefficients were above zero the longest? So here's where we demonstrate this uh using our car seats data and all of our possible features. So remember here's Lasso and then here's standard scaler because I need to scale the data. ... Uh And so what I'm gonna go through is that I just make a holder for my coefficient and my different values of alpha. This is an array that's gonna hold all my training data. Here's my scaler. Uh Here, I'm scaling only these columns because you don't want to scale shelf location, good or bad uh Because they're uh 01 variables and then I loop through all my values of alpha. Uh And then I fit the Lasso model and record the coefficients. ... And so now I can look at my coefficients and I need to look at the ones that stick around the longest. And so uh it looks like because we're going through price definitely, we want to keep price, it sticks around the longest out of all of them. And then if we're looking for a next round of ones to include, it looks like shelf location, good and bad would have to be included. Even though bads zero. At this point, you can't just include one variable of a set of K dummy variables. You have to include all K of them. It looks like age stuck around ... and advertising stuck around. So Lasso would say that maybe we should consider price, the shelf location variables, age and advertising. And then we could even go a little if you wanted to, you could go further. But then at that point, we start to include a lot more variables that might not be worth it. But um we could from this try two models, the advertising price age, shelf location model and then the model that adds in comp price, income, advertising, price, age and shelf location. OK. All right. So that's three approaches to model building. So you have your exploratory data analysis approach to get started. You have a more algorithmic approach with the best subsets or forwards and backwards selection, which is pretty costly. And then you have your reminder that last O is useful for feature selection by looking at different values on the coefficients as you increase alpha. That's gonna be it for this video. I hope you enjoyed learning about these different selection approaches. Uh And I hope you enjoyed this video. I hope to see you in the next video and I hope you have a great rest of your day. Bye.