Scaling Data Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody. Welcome back in this video. We're gonna talk about some cleaning uh which we're gonna do to scale some data. So sometimes you're going to have to scale data. Uh for instance, in the next couple uh algorithms slash model videos and notebooks, we're gonna have a couple of models and algorithms where if you do not scale the data prior to trying to fit the model, uh it will give you a very messed up fit and it won't work very well. So sometimes prior to fitting a model or running an algorithm, we're going to scale our data. Uh And, and this is particularly true if you have some features or columns of X that are on vastly different scales. So maybe you have one in the ones or the 10th and then you have another one in the millions. Uh then it's not going to work very well and you need to scale your data prior to fitting whatever model or algorithm. So in this notebook, we will introduce the concept of scaling. Uh in particular, we'll demonstrate what's known as the standard scalar object in S K learn, we will then mention the differences between fit transform and fit underscore transform. And then at the end, we'll demonstrate the process of how to scale data uh with scalar, not at the end actually, but throughout the entire video. So we're first gonna start off by generating some data. ... So we have this uh set of features X which is 1000 observations of four different columns. And below, I've printed out the mean and the variance of the four columns. So X one has a mean of about 12, but it has a variance of about uh one million. Uh The mean of X two is negative nine with the variances on the scale of the hundreds. Uh X three has a mean of negative 52 with a variance on the scale of 10 thousands. And then X four has a mean of negative 75 about uh and then it has a variance or on the scale of the hundreds. So we can see here uh this is uh a set of features that have vastly different scales ranging from the hundreds, all the way up to millions. And so we will have some algorithms in the future where this would mess up the algorithm. So what we need to do now is we need to scale it to make sure that the scales of X one, X two, X three and X four are all the same. So units tens hundreds, as long as they're all the same, we're OK. So the way to do this. Uh One way to do this is to standardize your data. So this is gonna be the first scaling procedure that we'll see there are others that you may use in your work. Uh whether it be at an industry academia or a personal project. Uh This is just one of them. So you standardize data. Uh if you have a variable x standardizing, it would mean that you subtract the mean and then divide by the standard deviation. So this is how you produce a standardized variable, you subtract the sample mean. So you would take the mean of the column and then divide by the standard deviation of that column. So if you've ever taken a statistics course, or maybe you haven't taken a statistics course, but you've had to use a Z table or have generated a A Z statistic uh using some sort of software. This maybe looks familiar. So this transformation is exactly the transformation that you would apply to a normal random variable. If you want to turn it into a standard normal variable, that is uh a standard normal with zero as a mean and one as the standard deviation. So if you standardize your data at the end of it, the scale data will then have mean zero and standard deviation one. So we could, this is a very straightforward process. We could write up some code to do this uh ourselves using numpy. But that actually becomes tedious quite quickly. S K Lain provides what is known as a scaler object called Standard Scaler that will do this for us. Uh It has a nice functionality that does it on all the columns simultaneously and then also plays very nicely with train test splits and uh the data science um paradigm. So we're gonna import Standard Scaler. So from S K learn dot preprocessing, so this may be as a new sub package. Uh we'll import standard S C R. Now the first step is very similar to making a model and then fitting that model, you make the scaler object. So Scaler is equal to a standard skier and then we fit the scaler. So what what happens when we fit the scaler? ... So when we fit the scaler, it's gonna go through all those columns of X and then what does it mean for each column? Well, it needs the mean and it needs the standard deviation. So it's going to through going to go through all four columns of X, find the mean of each column and the standard deviation of each column and then store it within the scalar object. So now scalar uh as the variable has the mean and standard deviation for all four columns of X, which means that we can now scale the data by doing Scaler. And now here when we want to perform a scaling, we call dot transform because we're going to transform the data into its scaled form. And now we will go through and check the standard. Uh the standardized columns should have a mean of approximately zero and a variance of approximately one. OK. And so here you see some numbers, these e to the negative high uh teens 14, 17, this means this is essentially zero. Uh Sometimes our computer cannot get exactly to zero just like it can't get exactly to one. But with, for all intents and purposes point 999999999999 is one. Uh OK. So now you might be wondering, well, we did fit that seems familiar from when we had like logistic or, or sorry, not logistic but linear regression. Uh What is transform? And then you're saying there's something called fit transform. What does that mean? So there are three things that you can do with a scale or object you can do fit transform or fit transform. So there's a, a procedure to this that you should follow if you want it to work the right way. Uh And we're gonna touch on that now. So the first thing you need to do for any scale or object, the first step would be fitting. So what fit does is, for instance, for standard scalar, it would find the mean and the standard deviation of each column, each different scaler will um do something slightly different. There's one where you can scale so that the minimum value of a column goes to zero and the maximum goes to one. So there'd be some parameters uh that the scaler would need from the data for that. And that's what the fit does. It figures out all the parameters it needs in order to perform the scaling uh for each of the columns. So as an important note, just to stress again, fit has to be called before you do transform. OK. So if we were to uh go up and rerun this after commenting outfit and then try and transform, we would get an error. Specifically, we get a not fitted error. So we always need to uh fit the scale error before we can transform the data. OK. So the next step you would do after fitting your scale is you're gonna transform your data. So this is what actually performs the scaling. So a standard scaler this would be the function that goes through, subtracts off the means divides by the standard deviation uh you call this after fit. OK. Uh There's a step that combines both of them. So there's fit underscore transform. So this will do both steps at once. So it would do uh you would fit the scaler object with the data that you put in and then transform that data uh using the fitted scale that it did in the first step. OK. So this does both fit and transform in the correct order all at once. And so now you might be asking yourself well, why would I ever need anything other than fit transform? This is a great question. Uh And above, when we were showing this off, when we were doing this above, we could have done fit transform and we would have been fine. But in practice, uh when you're doing data science with train test splits or validation sets or cross validation, you do not want your um you do not want to refit the scaler on the test set validation set or the holdout set of cross validation. Uh So the fit the fitted scaler objects. So the mean and the standard deviation in this case, uh those come from the training data and we think of that theoretically as part of the algorithm or part of the model. So if you were going to scale some data as a part of the model or algorithm, you're working on, you think of that scaling procedure as part of the fitting process. And so the fitted uh the fitting for any algorithm or model comes only from the training data. We then don't want to go in later and then refit a scaler using the test data because there is a chance that theoretically the distribution that we have for our training data is slightly different because it's an empirical sample, it's an observed sample is slightly different than our test set. And so if we allow the scaler to change the fitted scaler to change from the training set to the test set. Uh We're going to have what's known as data leakage where our performance will be influenced uh by using um observations about the test data, which we would not have known ahead of time when we're fitting the model. So for train test splits, you only ever fit your scaler on the training set and then on your test set, you only ever use transform. And so that's why we have this separated outfit transform and not just fit underscore transform. This allows us to make sure that we're fitting exclusively on training sets and transforming. Uh And then on test sets only using transform. OK. So we're gonna uh illustrate this with a quick final example. So what do I mean by illustrate this? We're gonna illustrate the proper procedure for when you're doing a data science project with train uh with uh a scaler. So I'm gonna make a train test split. And so let's imagine. Now I'm about to go work on a model. I'm then going to make a scaler. Uh So maybe this model needs some scaling. I make my scaler object first, then I fit that scale or object on the training data. And then I'm going to scale the training data for whatever algorithm that I'm going to do. Now, I will say alternatively because this is the training data I could uh use fit transform instead of doing fit and transform separately. I tend to like to separate them out because it's, it makes sure that I know that I'm doing it the correct way. Uh But technically, in this step, we could have done fit underscore transform because we're only working with the training set. Now, we're going to imagine that we're building some amazing model that will change the world. Uh And now we're ready to check its performance on a test set, a holdout set or a validation set. And so on the test set, we would only do the following X test scale is equal to scaler new dot transform X test. And then we would go through and actually, we can look at test scale. This is not that we knew what the original data looked like, but this is what the scaled test set looks like. Uh Now we could imagine that we're going out and we're gonna check our performance as a final check before we put this model out into the world. So as I said before scal learn has more scalar than just the standard S scaler. Uh And there may be um other arguments for those here is a link to the documentation for scalar. And if you're interested in seeing some of the other scalar uh objects they have. As I said, I believe earlier in this video, there's like a min max one but there are others as well. OK. So now you know all about scaling data and you also know about the uh Fit Transform and Fit Transform uh funk uh steps uh which will be applied and other sorts of things we do in the future. All right, I hope you enjoyed this video and I hope to see you in the next video. Have a great rest of your day. Bye.