Data Splits for Predictive Modeling III Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi and welcome back. If you're watching this video, I've assumed that you have seen or familiar with train test splits and validation sets. We're gonna continue on with learning about data splits with CRO K fold cross validation. Let me go ahead and share my Jupiter notebook and we'll be on our way. So in the previous video, as I said, we talked about train test splits in which we split our data set into a training set and a testing set with a random split.
The training set as we mentioned in previous videos is to uh use to contra and then compare different models while the testing set is held out until the final uh final final mo oh my gosh, until the final model is chosen. Uh And then we check for things like coding errors and overfitting. We'll learn about overfitting later. Uh So a validation set is one method that we've talked about to choose and compare models.
So you make a smaller training set out of your training set using a random split. And then the rest of the data is used to validate the models and choose the one that performs best uh in this video, we're gonna learn a different approach called K fold cross validation. So as a reminder, if you have gone, if you uh haven't done the earlier notebook, uh earlier coded sections of the notebook or you uh have done them, but you've turned off the notebook and they're coming back now later,
make sure to read through uh run through all the previous coding chunks in the notebook up to this point. So go all the way through till you get here. Uh And then come back and we can start with the content. So the idea behind the validation set, uh in essence, it gives us a point estimate. So you can think of this as going out and surveying 100 people uh like who are you gonna vote for in the next election?
And that would be a point estimate of the true proportion of people who are going to vote for a candidate, a candidate B. So the same thing can be said here where the validation set gives us a point estimate of the generalization error, which we called capital G. One issue with this approach is that your point estimate may not always be reflective of overall or average model performance.
Uh So what would be really nice is if we could know something about the distribution from which G is drawn. So ideally, we would know something about the distribution. This is typically difficult to do with a finite set of data. Uh But we can leverage a well-known rule from probability theory called the law of large numbers. Uh So the law of large numbers says that if I have a sequence of random variables that are independent and identically distributed, so V one V two V N all have the
same distribution. And V one V two V N are all independent of one another. And let's say that that distribution has some true mean, which I'm gonna call me then the arithmetic mean or the average of those N random variables uh in the limit as N goes to infinity. So as we get more and more of them will approach the true mean. So the, the average, right, the average of your N sample is going to get closer and closer to the actual mean of the random variables itself uh as N gets bigger and
bigger. So in essence, what this is saying is that the average, the arithmetic mean of a set of random draws will be close, close depends on a lot of things, but close to the expected value of the distribution if you have enough draws. And so here's where we're gonna loosely try and leverage that. Uh we're gonna loosely try and leverage the law of large numbers by sort of thinking to ourselves.
Well, if we could somehow generate a sequence of observations of this error, say we were able to generate an independent sequence G one G two, all the way to G N, then we know that if we take the mean, the arithmetic mean of them, then we should be somewhat close to the expected value, which is the true mean of the, of the generalization error. So K fold cross validation is a way to sort of uh macgyver our way into having um a sequence of random variables that are identically distributed
and independent from one another. Uh And if you're unfamiliar with Macgyver, it was a TV show in the eighties uh where this, the, the star would get into a series of situations where he would use uh whatever was at his disposal to get out of the situations and, and win the episode. Um So after you conduct your train test split, uh so essentially, what we're gonna do is we're gonna try and do the best with the data.
We have to get a sequence of randomly uh independent, identically distributed random variables. So how do we do that with K fold cross validation? Well, first you conduct your train test split like we have illustrated below. Then we're gonna once again split our training set this time, we're going to randomly split it into K equally sized uh or roughly equal, depending on how your division works out equally sized chunks.
Then your gen observations of G, your G one G two, all the way to G. In this instance, K is gonna be done by cycling through each of the K chunks, you will hold out that chunk and then train your model, your model on the remaining K minus one chunks. Then your observation of G will be the performance on this the K chunk that you left out. So I think it's easier to see this with the picture.
Here is an illustration where I've set K equal to five. So I have my data train test split, hold out onto that test set until I'm all the way done. Then I generate a split of five uniquely, right. So five randomly chosen sets. So all of these are random even though it doesn't necessarily appear that way. And then you're gonna go through one at a time, you will leave one set out and then train on the remaining four, calculate G on the one that you left out and then you go to the next split,
leave that second set out train on the other four, calculate G on that second set, leave the third set out train on the other four, calculate G on that third set, leave the fourth set out train on the other sets, calculate G on that fourth set and then finally leave the fifth set out train on the other first four. Uh and then calculate G, then you average those together by adding them together and dividing by five.
And then in theory, uh this should be roughly what your expected value of G would be and the roughness might be pretty rough because we're only doing five. And uh there's a lot of assumptions here, but this is the idea behind K fold cross validation. Uh So common values. So here we chose K equals to five. Other common values are K. Uh If you've taken some statistics courses, maybe you've heard of Leave one out cross validation.
That's where you choose K equal to the number of observations uh in the training set. Uh But typical values for us are gonna be five or 10. Uh And again, you know what you choose may depend upon your problem and what you're trying to accomplish. So K fold cross validation and S K learn is uh accomplished with S K learns K fold object. So the documentation for this method can be found at this link.
If you're interested, I'm gonna show you how to implement it right now. So from K or from S K learn dot Model selection, I'm gonna import capital K capital F K fold. Now to make a K fold object, you call K fold. First, you input the number of splits you want. So N splits is equal to, we'll choose five here. Uh Again, we have to set shuffle equal to true.
Uh Then I'm going to set a random state and let's say uh 582. OK. So now I have K fold and then I'm gonna show you what happens when I call K fold. And then the way you do the split is to call dot split and then you'll put in your features and then your observations note that you could do this without one. So maybe you wanna do a K fold on a data set. Uh That's just no observation, like no uh outcomes but just features that can be done.
Um But if you have two things that you'd like to split, you put in one and then the other and then it's good. Uh The, the syntax is typically to put your features first followed by your output. So we can see that we get something called the generator object. Uh So what this means is we're gonna have to loop through it. So we're gonna call a four loop.
Uh So we're gonna say ... uh and this is to demonstrate for train index for test index and K fold dot split. And so we'll see what the K fold operator or the K fold object returns is that it's going to give us a set of indices that correspond to the four blue boxes, our training set and a set of indices that correspond to the yellow box, which is the set we left out here.
I've called it test index. That's just normal syntax that I've seen. You could call it leave out index or hold out index. So for the first train test split here, uh these are the indices of the training set. So the four blue boxes from our image below. While these are the indices for the holdout set. So that yellow box up above. Uh so, and we can notice that in the test index, there was a seven.
Uh and when we go to the next train test split, well, that seven, you know, this entire test index gets reabsorbed into the ones that are for training. And we can see that now in the second split, uh the seven is back here. And so it goes through and just gives you all the splits and we, because we chose K equals to five. So we have five different splits.
And then when we were, let's say, hypothetically we're fitting a model. Uh So the typical process would be that you say for uh train index test index in K fold dot split, you first get the training data. So here I've indexed Xtra Y train uh with the train index that I re return, get returned to make two smaller training sets. So those blue pictures up above the blue boxes and then to get my holdout data, uh which is the one that I left out these yellow boxes.
You do the same thing, you just index and then you would fit the model here. So you, you know, we'll see that the form is you'd say something like model dot fit Xtra train, why train train and then you'd have some function that's like error. Uh Y hold out, we'll see this soon, uh, error Y hold out and then like, uh, model dot Predict X uh holdout. So this will be the form of like what we're gonna see.
A lot of, um, but for now we don't have a model, so I'm gonna, or an error function. So I'm gonna comment that out. But this is what you should expect to see, uh, and expect to do in, in coming notebooks. OK. So uh here is an exercise to check that you comprehended everything we just talked about. So feel free to pause the video and work on it on your own or you can watch me and try and code along with me or you can just watch me do it.
Uh So K fold. So I need to make a new K fold objects because now I want 10 fold. So N splits is equal to 10 shuffle is equal to true. Random state is equal to uh let's do 323. ... And then I can do four train index test index in K fold dot split. Uh Z train, W train. So you should have these from the earlier code that got ran uh that you wrote or copied from the code that I wrote.
Uh And then we'll just pretend like we're doing it. So Z train train is equal to Z train at train index ... uh Y or W not Y W train train is equal to W train at train, index, Z holdout is equal to Z train at test index and Z or W hold out is equal to w train that hold out index, ... uh sorry, test index, brain fart to do and then this should only have there we go. OK.
So hopefully you're able to get that or it makes sense. Uh And then we'll end by talking about like why would we want to use a validation set versus a cross validation approach? So, in general, if you have the ability to, you should use cross validation. Uh this is preferred typically to a single validation set because cross validation gives you more measures.
It gives you a better sense of uh the distribution of G as opposed to a single point estimate. Um There are some cases where you can't really do cross validation. So two limiting factors you might need to consider are both the size of your data set and the training time for your model. So if you have too few observations, cross validation wouldn't be possible because splitting your data set into too many different sets can lead to deficiencies in both the model fitting and the
estimation of G. So for instance, if we had after um the train test split, if our training set had like uh this is a silly example. If it only had like 10 observations, you wouldn't want to do cross validation there. Uh Probably uh for two, maybe your model takes a long time to train. Uh in which case, cross validation becomes infeasible because you would be training at five times.
So it's, or K times. Uh So that's K times. The training time of your model is the, is the amount of time it would take to run cross validation. So this won't really be a factor in any of the notebooks we're doing. And probably not for a lot of the projects that many of you might work on uh on your own. Uh But in, in some settings, your models can take days or weeks to train.
Uh in which case, cross validation is out of the question. Uh So in either of those cases, you'd want to use a validation set, uh which is better than nothing. Uh But in general worse than cross validation. OK. So that's it for this video. We now have a good idea of why we want to do data splits for predictive modeling. Uh So test split 10 train, test splits and validation sets and cross validation.
Uh We'll see this in action in the coming notebooks where we actually start to learn some actual uh predictive models and making predictive models. OK. So I hope you enjoyed this video and I hope to see you next time. Have a great rest of your day. Bye.