Data Splits for Predictive Modeling II Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi, everybody. Welcome back in this video. We're gonna continue on learning about data splits with validation sets. Let me go ahead and share my Jupiter notebook. So in the last video, uh the one that you should have watched or the one that comes before this one, we learned about train test splits. In particular, we learned first why we need to do data splits for predictive modeling.
Then we saw our first type of data split called a train test split uh in which we split our data set into a training set and a testing set through some sort of random sampling. Uh And then now we're gonna learn more about how we can divvy up the training set with additional data splits to train and compare models uh which will eventually allow us to choose a model after which we can use the test set for a final sanity check on the model we've chosen.
So if you stopped the notebook and have restarted it since the last time or you didn't watch the train test split video, make sure you go through and rerun the code chunks uh down to uh the next code chunk which would be here. OK. All right. So we're gonna now learn about two split types. Uh Just one in this video. The second one will come in the next video, uh split types for model comparison and selection.
So the first splitt type that we're gonna learn about in this video, as I said is a validation set. So a validation set procedurally is the same kind of step as a train test split. Uh You're gonna make a random split of the training set uh to make a smaller training set and a validation set. So in this setting, when we do a validation set, we're going to after our random sampling to produce a train test split, we randomly sample data from the training set to get a smaller training set and a
validation set. This validation set is used for model comparison. So you'll have, let's say hypothetically we have five different models that we're comparing. I will train all of those models on my smaller training sets. I will then calculate whatever my performance metric is that that G that we talked about. In the last video, I will calculate it on the validation set and then whichever one performs the best.
So earlier, we said it was lowest. Uh so whichever one has the best performance metric that's on the validation set, that's the one I would use. And then once I've chosen, in this case, we had five. So once I've chosen the best of the five, I will then look at my test set as a final sanity check uh getting the performance on that. So that's it theoretically a very simple split, especially now that we understand what a train test split is.
Uh luckily, since it is so straightforward, we get to just use the train test split again. So even though the function is called train test split, we can choose it, we can use it to make any data split. We would like. Uh So we are now going to make that. So remember I told you to go through and make sure you run this. So you should have X train and Y train and X test and Y test stored in your system.
We're now going to use the train test split to make a smaller training set, which I'm calling train train and a validation set, which I'm just calling X underscore V. And so the important thing to make sure we don't do is we don't want to overwrite our initial X train and Y train. We want to keep those as they are. So that's why we're making this smaller data split and calling it train train which is a mouthful.
Uh But we don't want to overwrite our original variables. So I'll code this up. So my features have to go first X train, my outputs, my outcomes have to go next Y train. I need shuffle set to true. I'm gonna set another random state and let's go with 321 and then I want my test size to be equal to, uh, why don't we go ahead? I said 15%. So we'll go with 150.15.
And then once again, we can look. So remember, we can say that. Well, what was the length? A train? I believe it was 800. Uh We know that ... 15% of ... 800 is ... 1 20. And so now we can check uh 123. So now we're just gonna check the lengths of those. So print train train and then we'll fix the words so that they're the right. So then we want vow and then here we want why?
And then we want length of X train train. ... Uh do, do do so I'm just printing out words to make it look nice length of ... and then we're gonna add an extra parenthesis at the end of all of these, put in the correct name, put in the correct name, ... put in the correct name. And then actually, I want this to be for the XS. I don't want the length, they want the shape ... because these have a column or uh dimension as well.
... OK. Probably didn't need to do that, but I did it and now it's here. OK. So this checks out. So for your exercise to make sure you comprehended how to do this in Python, let's go ahead and make a validation set with 22% of the training set from the Z W split that we did in the previous notebook or the not the previous notebook, the previous video, the train test split video.
So as always, you can pause the video, try and do it on your own or you can watch me do it right now. Ok. So Z train, train val uh W train, train W V is equal to train test split Z W and you should have trains on them. Almost forgot. Uh shuffle is equal to true. Random states is equal to 51. Uh Could be any random state. I just chose that and then I said I want 22%.
So test size is equal to 220.22. ... OK. And then we could copy this ... uh let's do this ... uh links of the train and then when we can replace this 800 length of the train. ... Ok. So here uh it should either give us 1 12 or 1 15. So let's see what it did. I'm sorry, 1 12 or 1 13 uh change these XS to Zs, change these Ys to W S so it looks like it gave the validation set. Oh, and we wanted uh not 15, 22. Uh So, ok, there we go. 1 65 1 65. And then what remains must be 5 85. Ok. So that is a short video but that I didn't want to
confuse validation sets with cross validation. So in the next video, we'll talk about the next train uh next data split technique called K fold cross validation. Uh In this video, we talked about validation sets where we took a random split of the training set from the train test split to make a smaller training set with a validation set. All right, I hope you enjoyed this video and I will see you in the next data split video on K fold cross validation. Bye.