Adjustments for Classification Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody in this video, we're gonna start learning about classification in particular. We're gonna learn about some adjustments that we have to make for classification problems. Uh Let me go ahead and share my Jupiter notebook. So in this notebook, we're gonna talk about the concept and the motivation for stratified splits, we'll demonstrate how to use the stratify argument in S Korn's train test split. And then finally, we'll introduce stratified K fold at the end. So we're gonna start with a contrived example sort of to give the idea and motivation for stratified split. So here's some sample output data, just 10 observations, two of them are 18 of them are zero. And so here's, we're gonna run a loop here. Uh And we're gonna loop through uh it looks like five different times and we're gonna give a train test split. And so what we can see here is that through the five different train test splits, we generate four of them have uh zero of the one observations in the test set. One of them has zero of the one observations in the training set. So both of these uh both um both of these types of outcomes are not good for predictive modeling. So for instance, in this example, split three, if all of your ones are in the test set and none of the ones are in the training set, it would not be possible to train a classifier that is able to identify a one because you just don't have any observations of one in your training set. And the other extreme example where you have no observations of one in your test set, it then becomes impossible to see how well your classifier performs on um on a one. Uh So for instance, we train our model using this observation and then when it comes time to validate our model on a test set as a final check, well, we don't know if it's actually doing what it's supposed to do for one observations because there's no observations of one in the test set. So while this might be a silly example, it does highlight issues that can occur when you do classification, train test splits, validation set splits uh and cross validation. Uh This is a particular problem if you have a data set that is highly imbalanced just like we did here. So this means that one of the categories is far more present than the other uh which does happen quite a bit in the real world. So a major assumption in supervised learning is that the data you're training and then evaluating your model on is drawn from the same distribution. So essentially, we're assuming when we make these splits, uh that I went out and more than once took a random draw from some probability distribution, that was the same probability distribution for our training sets, for our validation sets, for our holdout sets and for our test sets. And so when we do this uh train test split up above uh what we're getting is an example that I've drawn on the left, which is what we don't want. So we have some sample and then when we do our train test split more of the ones. In this case, all of the ones uh end up in the training set and a smaller amount end up in the one and the test set. And so this is not identical to the distribution of zeros and ones that we had and the original sample, which is hopefully reflective of the distribution of zeros and ones out in the real world. What we would like to do uh is do some sort of sampling scheme for this train test test split where the distribution of zeros and ones is roughly the same between the training and the test set, which is roughly the same as the original sample, which again, we're hoping is the same and assuming is the same as the distribution out in the real world. So how can we do this? Well, in theory, the way we do it is called a stratified split. And so what does a stratified split do? Well, first, before we go ahead and do our random splits and this works for train test splits, validation, set splits and cross validation. You take all of your data and then you split it up into all the ones and all the zeros. So let's say we put all the ones down here and all the zeros up here. And then this is pictured with binary data. It could work for multi class classification as well. So if we have three classes, we'd have zeros, ones and twos uh and so forth, you then take the data set represented that is uh contains all of the zeros and do a regular train test split on this. So we do a random split into a set of training zeros and a set of test zeros or if we're doing a validation, uh training zeros and validation zeros and so forth. And then we do a separate random split on all of the ones where all the training ones go over here and all the test ones go over here. Then once we have train test splits for all of the unique classes in our data set, we recombine them so that all of the training ones and zeros go into a train set and then all of the test zeros and ones go into a test set. And so then once we do this, our sample and our train test split would have roughly the same distribution. Uh And again, as hopefully, as the real world distribution, again, this can be done for multiple classes. It can be done in cross validation and it can be done for validation sets. Um So how can we implement this in S K learn? Well, we'll return to this data set called beer. So this is a data set where we have different beer types, stouts and IP A s. So if we did um no, we did it here. So 56% roughly above of the data are IP A S and 44% roughly of the data are stouts. Um And so what can we do? Well, train test split has a stratify argument. So let's check. Did I import chain test split above? Yes. OK. So how do I use the stratify argument? Why I called train test split? And then I do beer and because it's a data frame dot copy uh shuffle will still be true. My random state uh will still be true. I'll put that in, let's call it 21. Uh And then finally, we'll call a stratify argument. So stratify equals and then you put in the column or the NPI array that you would like it to be stratified by. So for us, we want it to be stratified by beer type. So we would do beer at beer type. ... And now we can see this is the split for the data, the training set, which is basically the same as the split for the regular set uh as the original data set. And here's the split for the test set, which is roughly the same as both the uh original and the training set. Uh So that's how you do a stratified train test split, which you would also use for making a validation set. Uh Furthermore, we can make stratified K fold. So there's not like a nice stratify option in uh K fold, they just have a unique stratified K fold object. And so we do strati uh from S K learn dot Model selection, we would import stratified ... kone and then we make the K fold object just like before. So stratified K fold. Uh And so now we split up. Um So maybe we could have done beer train before with regular or yeah, beer train before with a regular cost validation. Now we're gonna split it up uh into the features which let's say their I B U an A B V and then the thing, the uh output, which maybe is this is what we want to stratify by. So what you want to stratify by has to come second uh and beer type. ... And then uh all the other stuff we had before. So shuffle equals true, random state equals 21 test size equals, well, you don't do that for cross validation. That's for train test. But ... uh ... do ... OK, what's going on here. Hm. OK. So I figured it out, I'm having a brain fart. This should not have been here. That's for the next step. ... We just want number of splits. So sorry about that, that I was skipping ahead to the next step and combining it with this step. Uh What we need to do, you set up stratified K fold exactly the same way you set up a regular K fold object. So this is the same exact setup that we've done before. And then this is the part where the stratified comes in. So you first put in the features like before and then secondly, you put in the thing that you want uh to be um the second argument is the one that you want to do the stratification by. OK. So this is gonna be a loop that goes through the train index and the test index of the uh K fold split. Uh where now the thing we want to stratify by is the second argument to K fold dot split. So as we go through, we can see that uh these are all roughly the same splits uh between IP A and stout as the original set. OK. ... All right. So in this video, we went through talked about why we want to do a stratified train test splitt. What could go wrong if we don't, then we showed you how to, what a stratified train test split was. Uh And then we showed you how to implement it with S K learn uh using the beer data set from the regression notebooks. So I hope you enjoyed learning about stratified splits. I hope you enjoyed this video and notebook and I hope to see you in the next notebook where we start to do some classification. All right, and have a great rest of your day. Bye.