Data Splits and Overfitting Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi, everybody. Welcome back in this video. We're going to talk about Dana splits and overfitting. So let me go ahead and share my Jupiter notebook once again, no code in this notebook. Uh But we are just going to talk about a conceptual topic. So we have a better idea about overfitting uh what it means from our biased variance tradeoff notebook.
And I want to talk about how we can use data splits to prevent us from using models that are overfitting on data. ... So remember we talked about three different types of data splits and have used them at various points. So we have train test splits, validation sets and cross validation. When we define these approaches, we said we want to use them to get some estimate of our generalization error and then use that estimate to compare various models.
Uh So this is particularly true in the case of cross validation and validation sets. Uh oftentimes I've said it and you'll hear other people say that we use data splits, uh insert data split that you're interested in to combat overfitting. Uh So what we mean here is not that using the data split itself uh gives you a model that is not over fit.
So it is still possible to have a model that is overfitting on the training data. Uh Even if you use a validation set or cross validation or a train test split, just doing this does not eliminate the risk of overfitting. However, it does allow you to choose models that over fit less compared to others. Uh Even if you do that, you may still be overfitting on the training data itself.
So importantly, this is one of the points of the train test split. It allows you to assess overfitting uh by comparing the performance, let's say mean square error on the training set and then whatever your validation holdout or test set was. And so what you can do in addition to trying to compare which models perform the best on holdout sets or validation sets, you can make the performance metric uh on the training set, whatever set you use to train, compare it to the performance
metric on the validation holdout or test set. And then if you're seeing a situation where the training set is having significantly better performance than the test holdout or validation set, that's an indicator that your model is overfitting on the training data. Now, you can, as I we said, use validation sets or cross validation to choose the model, which does this the least that doesn't change the fact that your model may still be overfitting.
Um But at, you know, at the best you can still measuring how um measuring uh how much your model may be overfitting on the training set. By looking at the difference in training performance and validation tests, holdout performance, uh can allow you to understand how much your data may be. Your model may be overfitting on the training data.
Uh and giving you a sense of what you should expect going forward uh when looking at test sets or in the future, assessing performance on actual data that you don't have the output for. OK. So that was a short video, but this is an important concept that a lot of people uh you know, get slightly incorrect or maybe majorly incorrect on the internet. Uh And a lot of us will be after this boot camp, presumably be doing their work by looking up things they don't understand by going onto the
internet. So it's important to have a strong foundation uh from these videos and lectures. All right. So I hope you enjoyed this video where we talked about how data splits can be used to combat, which is really more identifying when overfitting is occurring. Uh And I hope to see you in the next video. Have a great rest of your day. Bye.