Data Splits for Predictive Modeling I Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi, everybody and welcome back. We're gonna continue learning about supervised learning by touching on a series of data splits. The first of which is going to be known as the train test split. Let me go ahead and share my Jupiter notebook and we can get started. So what we're going to accomplish in this notebook, which is the data splits for predictive modeling notebook found within the supervised learning folder in our lectures.
Uh We will discuss why we might want to split our data set in the first place. Then we will continue on through a series of videos and define three different types of data splits. Uh And this video will only touch on the train test split portion. Uh And then we will talk about scenarios. Uh at the end in the third video, we'll touch on this uh scenarios in which you might prefer one splitting technique over another.
So at the top here, you'll see these um this little code chunk. This will start appearing at the top of uh a lot of our Jupiter notebooks moving forward. This is just going to be a common set of items that I know I'm going to need throughout my Jupiter notebook such as NPI Matt plot lib. Eventually we'll see pandas up here. Uh So that's what this code chunk at the top of the notebook is going to be from now on.
... So let's talk about a rationale for why we would want to split our data set. Why wouldn't we want all of our data to train the model? Uh You know, why don't we want that? So let's imagine uh we are solving a predictive modeling problem which we're soon going to be doing. So we go out and we collect some data X comma Y which may be is N observations of M features with N corresponding outputs.
So the goal in predictive modeling is to try and find the model with what is known as the lowest generalization error. So we define generalization error uh vaguely uh to be the uh error on of the model. The model's error on a randomly collected set. Uh This is a new set that the model was not trained on and let's call that set X star Y star. So the idea being we're gonna fix this new data set uh but that random or the uh initial trading set was randomly collected X Y was randomly collected.
And so we can use this fact, the fact that this was randomly collected to say that well, the error on this fixed uh generalization set uh is a random variable. So we can call this random variable capital G if we want to determine the best model. So the model with the lowest G, because we'll say lowest, sometimes you'll want the highest. But uh for this video, let's say that we want to make G as low as possible.
Uh We will choose the best model from a set of candidate models. Maybe we have a few that we're choosing between, we'll choose the one that has the optimal G. So in this setting, we're gonna say we want to be smallest. So what we really want to know for each model in order to make our decision is we need to know something about G and hopefully something about its distribution.
So one way we could try and do this in an ideal world, maybe we go out and we just collect a whole bunch of large data sets that we can uh fit models on and then test it on this fixed generalization set and then see how it does. Um But we should point out that this isn't practical. Uh It can be very expensive to collect data. Uh Maybe it's not possible to collect any more data or maybe it's not ethical to continuously collect data from uh individuals just for the purposes of training a model.
Uh So in practice, we're going to be constrained to a single training set. And so because we're constrained to a single training set, we're going to need to sort of simulate this process of going out to get many different sets uh with a series of data splits. So these data splits allow us to sort of simulate the process of going out and getting a bunch of different randomly collected training sets, which we can then use to our advantage to try and estimate something about this
generalization error. So we're gonna start off with the first type of data split called the train test split. So this one you're gonna want to do and every predictive modeling problem you're gonna work on, you should probably start it out with doing a train test split. So the purpose of this train test split is we're gonna create two unique data sets.
The first data set we will come uh uh create is called the training set. So this training set is the subset used to fit models and then compare potential models that you're choosing between. So maybe you have a few different models, you'll use the training set to figure out which one is best. So uh oh OK. So this data set is usually split further as we'll see.
So we make this initial train test split and then the training set is usually split into even smaller sets. The testing set is the subset of the data that you're going to use to make a sort of final check before putting your model into production, whatever whatever that means for you, maybe it means using the model uh for the research paper you're working on or maybe it means making it a part of a production pipeline in an industry setting.
So the training set is usually because we're going to split it further, we usually want it to contain a majority of the original data set. So commonly, you'll split the train set and the test set into things like an 80 20 split or a 75 25 split. So 80 20 train test, 75 25 train to test uh Sometimes it can be appropriate to use different split sizes than that.
It just depends on the particular problem. These splits are done randomly. Uh And then how the randomness is determined depends upon your project. So in the first set of models, we'll look at, we're gonna just do uniformly at random. But in other models, you're gonna have to tweak it a little bit. So here's the visualization of the train test.
What? So let's say this green rectangle represents all the data we collected uh before we started our project. Uh And then the random sampling is going to take a random subset and assign it to the training set and then the remaining data gets assigned to the test set. Then that training set is gonna be used to train and compare a bunch of different models.
And then once we feel like we found the best model for our problem, we'll test it on the test set and then check for uh for things like coding errors or for something known as overfitting. So I want to stop for a second before moving on and seeing how we can make a train test split with Python and put in a potential point of confusion. So up above, I went on and on about how we want to measure the generalization error.
And then the first data split, I'm telling you about uh we say that we don't actually use the test set to figure out which one has the lowest generalization, we're using the training set. So the point of a test set is not explicitly to measure the generalization error, but rather it's a final check. So the test set serves as sort of a final check on your chosen model.
Uh It's nice to do this. We want to hold out a data set that the model was not trained on. And so we can do a final check by uh making predictions on the test set, comparing them to the actual values and then seeing if anything looks out of the ordinary. Uh So some things that might happen is maybe you just have a small coding error in all of your training at work that makes the model look like it's doing a heck of a lot better than it actually should be doing.
And then when you get to the test set, that coding error wouldn't, maybe isn't there anymore. Uh And so then you can see, well, wait a minute, this isn't the sort of errors I should be expecting based on all the work I did before another issue you might see is something known as overfitting, which we will talk about about more in a later notebook.
Uh So test sets are for looking for coding errors and overfitting on the training set. Um The training set is the only data set we're going to use to both train models and compare models. Uh So the model comparison part that we spent a long time talking about up above uh this is gonna come in with our two other types of data splits that we'll see in the next video.
OK? So now we've talked about train test splits theoretically, how can we make them an S K learn? So S K learn has this nice function called the train test split that will do the thing for you. So it's uh the documentation is found here and I'm gonna show you how to implement it now. So first, I'm just gonna make a random data set. So here's my X and then here's my Y now I have to import a train test split.
So from S K learn dot model selection, import train test split. So now that I have train test split, I can use it to make a train test split. So train test split, as you can see here is going to return four items. It's going to first return the training set and the test set for my features, my X and then it will return the training set and test set for my uh outcomes, my Y.
So in the train test, what I first put my features followed by my outcome variable. And then I'm just doing this to make it easier to read. Next, I will put in that. I want my shuffle argument to be true. And then what this does is it tells a S K learn train test split function. It tells it that I first want you to randomly shuffle the data before making your train test split.
So it doesn't always do that. You need to set shuffle equals to true to make it random. ... Uh The next thing I'm going to do is I'm going to uh cause call a random state. So a random state is just going to ensure that no matter when I run this code, I end up with the same random split. So like if I set random, oops, ... here we go. If I set random state equal to 44 oh uh And you also set your random state equal to 44 oh, copying my code.
You and I would get the same random split. ... Uh Finally, I want to specify the size of my test set and I want it to be 20% of the data. So you call test size equals to 0.2. Alternatively, I believe we can do train size equals the point eight, but I'm gonna go with test size. That's just what I normally do. ... And so now we can check that the lengths match is one way to see if it's doing what we think it should do.
So remember our original data had 1000 observations. So the train set should have 800. The test set should have 200 it does. So as a quick uh check, you can do this exercise on your own by pausing the video and working through it or just working with me right now. So I'm gonna want to set aside 25% of the data Z comma W so I will have Z train the test, W train W test equals train test split Z comma W shuffle equals true random state is equal to nine oh nine.
And then finally, um test size is equal to 0.25 ... uh and you should generate the data first. OK? So that's going to be it for this video on the train test split. In the next video. We will continue this notebook and learn about the next tip of data split. A validation set. I hope you enjoyed this video and I hope to see you next time where we'll learn more about data splits. Bye.