Imputation Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna learn a little bit more about data cleaning with imputation. So, imputation is a process uh for bringing in uh or replacing missing values in your data set. So sometimes in the real world uh or for research, uh so industry research, personal projects, sometimes you may have to deal with data sets that have missing values. And particularly this becomes a problem when you have columns or features that you would like to use in a model or you'd like to use uh to maybe build an explanatory model or a model that have missing observations. And so you don't want to just throw those out uh because maybe there's useful information and other features, you can't just throw away all of the observations that have some of the features missing. So what can you do? Well, imputation is the process by which we may fill in missing values in an informed way. So in this notebook, we'll discuss this method of replacing missing values. We'll demonstrate it on a new, maybe a new data set that we haven't seen yet called the Penguin data set from Seaborn. Uh will illustrate a couple of different approaches to imputation. And then we'll also show you at the end how you can integrate imputation into any sort of predictive modeling train test Vaid set or cross validation splits. ... So here is an example where I've taken the uh penguins data set that you can find illustrated here from Seaborne and I've replaced it with some missing values. Um And then this is what we have. So penguins uh is now a data frame that I've loaded in from this file stored in the data folder. And um if we use dot info, that's a nice useful pandas data frame method that allows us to see a couple of different features of the data set without actually looking at any of the observations. Um So for one, it tells us all of the different columns we have, it tells us how many of the rows of those columns are non annu meaning that they have values in them or that they are not missing as well as the D type. So we know from the top that we have 344 entries, meaning there are 344 observations. And from here, we can see that there are various features or various columns that have missing observations. So sex is missing quite a few body mass. G is missing, quite a few flipper length is missing a couple bill depth is missing a couple as well as bill length. OK. Uh And so now we have these missing features and we've been told by whoever is in charge that we want to actually use this to build some models to predict species. Maybe or maybe we're looking to build a model to explain some things. So we have all these missing values. So what can we do? Well, we can do imputation. So uh imputation is the name that we give something, uh the process of replacing missing data with a value. So there's a couple different approaches. The first is maybe you just have a preset constant that you would like to impute. So for instance, in this example, we're gonna look at uh and in this notebook, we're primarily going look at body mass G. So it's possible that there is some well known value for the body mass of a penguin that previous scientists have studied and it's been published and peer reviewed and replicated. So we feel comfortable using this preset value as a hypothetical, let's say that this is 4207. And so what we can then do is we can go into our data set and then just replace or impute any missing values of body mass G with this preset value of 4207. So here I've made a copy of my Penguins data frame where I'm going to do this. And so I can use this nice P. This feature called is N A which is gonna check for missing data. So first we'll just demonstrate penguins dot or underscore constant impute dot Loke Penguins underscore constant underscore imputes dot mo body mass G. And if we type in is N A, we'll see that we get back all of the rows that have missing body mass. Now, how can we do the impulsion? Well, we just have to specify that for the body mass G, we're just gonna put in this 4207. ... OK. And now if we go back to dot info for the constant impute uh copy that we've made that we've just imputed, we can see that body mass G now has 344 non no, which means that they are not empty observations. That's one approach to imputing another approach uh where maybe such a uh constant doesn't exist and you have to sort of use the data to help you infer what the missing value might be is you can impute the missing values with some sort of sample statistic. So sample statistic mean things like the average or the median uh maybe um the mode. So what, which is the most frequent uh just depends upon the problem you're working on and the feature that you're looking at. So for instance, if you have a categorical variable, you wouldn't want to use the mean or the median, but you might use the mode, meaning the category that shows up the most So for here, we've gone over these examples. So we can, you know, just like we did above, we could maybe do this by hand with NPI or pandas. But it's gonna be easier and more easily used in predictive modeling if we use S K learns simple impute object. So here's a link to the documentation and we're gonna import it. So from S K learn and the simple computer is stored in the impute sub package, we're gonna import simple computer and now simple computer uh we are now going to make the simple computer object. And what do we have to give to the simple computer as an argument? Well, we need to tell them what strategy we're going to use. So the simple computer comes with four strategies which are uh uh determined using a string input. So if I said strategy equal to mean, it will use the arithmetic, mean if I set it equal to median, it will use the median. If I set it equal to most frequent, it will use the mode. And then if I set it equal to constant, I, it will sort of just do the same thing we did above, we just also have to provide the constant. Um So we're gonna use uh the median, let's use the median uh no reason. Just let's use that. So you say strategy equals median. Now uh just like with the standard scale or this is a transformer object. So we have to first fit it. Meaning that simple imp computer uh is going to go through and look at all the columns and find the median value. So we'll do impute that fits and then we only want it for the body mass um that fits uh do do do penguins ... body mass G ... and yeah, I need to reshape. So dot values dot reshape. ... OK. And now what's the transform data look like? Well, now you can use transform just like with a standard scaler. So I can call penguins uh body. Uh I'll just copy and paste this ... and then this will return a data frame which I can uh look up. Well, actually let's see what it returns. OK? And then I could look for where penguins just to show that it worked where the body mass G is N A. And so here we can see these are the values where they used to be where there used to be N A S. Now there are imputed uh in particular, the median of the non missing values. And so we could even check is this the median or we can call penguins dot body mass underscore G and just look at the median. OK. And then this is computed uh by excluding all the missing values. The last sort of approach, sometimes people might take uh not the last approach in the world, but the last that we're going to cover uh other people might take is you can also just build a predictive model to do the imputation for you. So uh this is not uncommon and is maybe more intelligent and leverages the other data you have than just a simple median or a mean. Uh And so what we can do is we can build a regression model that takes in all the other columns of my data set and then build a model to predict the missing values of body mass G. Uh Now, the key here is um if we're gonna do this, we can't use other observations that are missing features as well. Uh Because that will mess with the model. So what we're going to do is we're first going to get a version of the penguins data frame that doesn't have any non body mass G missing values. So if we go back up, uh or if we look at it ... to do so, if we look at Penguins dot Loke Penguins body mass G dot is N A, we can see there are some observations where there are missing values for the other uh variables as well. And so what we're going to do is essentially throw those out because we need all of the other variables in order to be able to make predictions. And so in the real world, if we did this, maybe we would use our linear regression model to fit observations 23 32 to 28 get a prediction for the missing values of body mass there. Uh And then three and 3 39 maybe we would just use something like the mean or the median. OK. So this is what this does is I'm just gonna get rid of all the ones that have the missing. And so that's just gonna be three and 3 39. And now I'm going to make some dummy columns. ... OK. So this gives me some dummy columns for the species of penguin because they're also going to be used in the regression. And then this is something that we're hopefully familiar with where I can build a regression model. Uh So I import linear regression. I make my model object I fit my model object. Notice that the predict the, the Y in this example is the body mass and then I predict um on the missing part. So the penguins and a that's where the missing values are. So these are the predictions for those missing values. OK. So these missing values being uh the ones for 23 32 2, 28. ... So if uh we're gonna make a quick note before we show you how to put this into a predictive modeling environment. But if for instance, your end goal of this was to build a model that maybe predicted a species of penguin, given all the other features. So maybe we wanna take in bill length all the way through sex and then predict what species of penguin it is. If our end goal was to do that, we obviously could not do something where we use Adela and Gin two as inputs into our imputation model. That would be sort of like working backwards, right? We could infer something about the species because we use the species to fill in the missing value. So if we were doing something like that, our regression model would have to not include the type of penguin. Uh and only include these four. OK. Uh And talking of predictive modeling, well, how can we use imputation and predictive modeling projects? Uh So when we did scaling right, we made a very key point to say that when you fit a scaler, you only scale uh fit the scaler using the training data. Uh And then when it comes to something like the test set or the validation set or a cross validation holdout set, you only then transform. So you never fit on a the the data that's getting held out, you only ever fit the scaler on the training data. So the same is true for an imputation. So for example, if we were doing an imputation with the median or if we were doing the sort of imputation procedure we did above here, where we fit a regression model, we would only fit whatever the impute is using the training data. And then we would then uh when we want to get missing values replaced in a test set, we would use the fitted computer from the training data. So the reason for this is we want to avoid uh data leakage. And that's the main reason. And so thankfully, uh simple computer is easily set up for this just like standard scaler was. So for instance, we're gonna make our train test split and then we'll see that in both the training set and the test set, there are observations that are missing. So we're supposed to have 69 total entries in the test set and we have some columns that are missing. And we're supposed to have 275 entries in the training set. And then we have a couple of columns that have missing observations. ... And so then we're going to go ahead and define an imp computer with the mean strategy. And so we'll do impute is equal to simple imp computer. Uh And then strategy is equal to mean. And then what this computer is going to do is well, just demonstrate the procedure you would follow in a real project uh or problem, maybe you're just practicing. Uh So once you have the computer, you fit it on the training set and then we can impute the training set. And then when it comes time to work with the test set, you would transform only uh the test set. So you do not do another fitting on the test set and try and get the mean from that for missing Vario uh missing values in the test set. OK. So remember you only fit on the training set, you never touch the test set except for, you know, to get the transformed version, the imputed version. And the reason we do this remember is that we're trying to pretend that the test set validation set, cross validation holdout split whatever it is uh is a data set that for which we do not have labels. Uh Meaning that we wouldn't be able to impute the values there, right? OK. So again, simple computer is just like standard scalar. So we could put it into a pipeline like we did for Standard Scalar and a different notebook. So now, you know all about three approaches or first of all, you know about the concept of imputation and you know about three approaches uh for specific imputation techniques and you know about how to integrate imputation into a train test split predictive modeling uh environment. So I hope to see you in the next video and I hope you enjoyed this video. Uh Have a great rest of your day. Bye.