Bias-Variance Trade-Off Video Lecture Transcript This transcript was automatically generated, so there may be discrepancies between the video and the text. Hi, everybody in this video, we continue learning about supervised learning by talking about the bias variance tradeoff that comes into play when comparing different types of models and fitting models. Let me go ahead and share my Jupiter notebook. So this Jupiter notebook isn't going to involve any coding by you. Uh I've written all the code and all the code has been run. Uh You may want to rerun it before watching the video. Uh This is just sort of an uh an educational video about something called the biased variance tradeoff. So in previous notebooks, we may have mentioned things like overfitting uh and how the use of cross Vaid or validation sets might allow us to combat it. Uh In this notebook, we're gonna kind of give the idea behind the bias variants tradeoff, which is the source of the uh notion of overfitting. So we're going to discuss the bias of an estimate F hat which is known as under fitting or it leads to under fitting. We're gonna discuss the various the variances of an estimate F hat which uh leads to overfitting or is the source of overfitting. And then we'll just demonstrate that there is a trade-off between the two of them when you increase or decrease the complexity of the model. ... So remember in our supervised learning framework that we set up, we're trying to fit a model Y equals to F of X plus epsilon. Uh where F is some function Y is our output X are our features and epsilon is random noise. In this process, we're trying to produce an estimate of the function F and usually we call this estimate F hat. So we've discussed before that when we want to make an F hat, when we want an estimate, we want an estimate that has a very low for predictive modeling, uh a very low generalization error. And so remember generalization error is the error of the model on values that we haven't seen before. So like with cross validation, we're going to look at the expected value of the squared difference. Uh It can be a different type of error function, but in general, we're gonna look at the expected value of the square difference between Y and Y hats. And we're gonna let Y zero and X zero denote a single test set, which is should be different from the notion of like doing the train test split here. We're talking about going out and collecting randomly collecting a new test set and setting that set as fixed. Uh So that's how we evaluate the generalization error is how is the error performing on this uh set aside set. So taking all this into account, we can talk about the expectation of that generalization error by looking at the expected value. Uh And for notation purposes, we're gonna write it as the expected value of the actual observation on this test set minus the predicted or estimated observation on this uh test set squared. And then we just go ahead and plug everything in. So what does that mean? Well, we know that we're talking about uh Y zero minus F had of X zero because that's what Y had at zero is we just take the estimated value of F and plug in X zero. And then we can remember, well, what was Y zero? Well, according to our model, Y zero was F of X zero uh plus epsilon where here I've arranged it so that the F and the minus F hat comes first and then epsilon comes second. Uh So for those of you that are familiar with things like probability theory and what taking expectation over probability spaces and whatnot. Uh what we're saying here is we're the probability space we're taking over is the probability space of all possible training sets. Remember at the beginning, we can think of randomly going out and collecting data for the training set where uh our randomness is coming from the selection of the training set and that's what we're taking the expectation over. So if you manipulate some things around, use some rules from probability theory, a little bit of algebra, believe you have to use like the plus one minus one trick if you're from math that you should be somewhat familiar with. Um you're able to get that this expression that we derived up here just by plugging things in is equivalent to the variants of the estimate plus the bi is squared of the estimate plus the variants of the random noise epsilon. And so, if you've never heard of bias before, bias is the expected value of the actual thing minus the estimated thing. And so one way to think about this heuristically is how far on average is your estimator from the thing that it's estimating. So here variance is always positive or at least non negative bias squared is also non negative. Uh And so the best we can possibly do according to this derivation that we've uh outlined above, the best we can possibly do with any estimate of F is that is making a variance of zero, a bias squared of zero. And so the best we can possibly do is this variance of epsilon, which is our irreducible error. So that means no matter how well we do out in the world, the matter, the best model that you can find the best model is always going to have some sort of small amount of irreducible error. All right. Uh So another thing that might occur is um you might think well, I'm just gonna find the model that has variant s equal to zero and bias equal to zero. It's often not possible to reduce both of these things down to zero. Oftentimes when you lower a model or an algorithm's bias, you're going to end up increasing its variance and vice versa. Uh As a side note, high bias usually means that we're under fitting the data while high variance is uh akin to overfitting on the data. And we're gonna show that in action with this code. So here I've made some data to demonstrate what we're talking about. So X is just evenly spaced points from negative 3 to 3 500 evenly spaced points, Y is equal to X times X minus one uh which is a Parabola plus random noise with the variance of 1.2. ... Uh Then we've got or actually standard deviation of 1.2 I believe. Uh And then we plot that. So here is our training data, the blue dots which remember this, we're gonna take a bunch of different ones in order to drive the bias experience tradeoff. Uh And then the true relationship underline the training data uh which is this black line, which is the Parabola. ... So remember what we're seeing in the real world that this was a real problem we were working on is only the blue dots, not the black line. And so that's the thing we want to estimate. Uh So if we were to have a model that had high bias. Uh What's an example of a model with high bias for this data set above? Well, this would be a linear regression model or sorry, a linear regression model with high bias, but low variance would be the one that uses no predictors. So what we've been using is a baseline uh in our regression materials. So we're just gonna take the average value of Y as our model for high biased, low variants. And so this is uh very much under fitting the data uh because it has a clear pattern. And so it's high bias because we're far from the true relationship, we know that the true relationship is this parabola where we're assuming that Y is always fixed at this um horizontal line, meaning that it's uh an expected value regardless of X, the X has no uh uh no impact on Y at all. Uh It's low variants due to the law of large numbers. So the law of large numbers tells us that if we have enough observations, if our sample is large enough that the expected the average value of Y, the arithmetic mean of Y should be close to its expected value. So as long as we have enough of these blue dots, our uh expected value models should always be roughly the same over different training sets. So that's low variants. A model with high variants would be one that we take a very high degree polynomial of X. So we have a polynomial regression uh with X to the 30th say. So this is low bias because high degree polynomials are actually gonna be able to fit the pattern pretty well. So we're not going to be very far away from the true uh relationship. So we'll have sort of a wiggly line that kind of goes back and forth over the true relationship. So, uh we're not gonna be very far away from it. On average, it's high variants because high degree polynomials tend to try and fit all of the data points at once. So with each new training set, we randomly pick uh over the training space, we're going to get very different wiggly curves. And we'll see that in a second, the model that the one is the, the model, which is the one that we're looking for, which doesn't have too much bias or too much variance is a low degree polynomial. ... So from knowing how the data is generated, we would want to choose, you know, X squared as our highest power. Uh But like X to the fourth would probably also have low bias and low variants. So we're gonna demonstrate the biased variance tradeoff here with this code chunk. So there's some code here that we haven't touched on if you're watching these videos in order. So you don't know what a pipeline is or what polynomial features are. Uh Again, the key here is not to focus on the code, just focus on the concepts. Uh but do know we will touch on pipelines and polynomial features in a different video and notebook. And so what I'm gonna do is I go through and I'm gonna randomly generate five sets of training data. And so each time through this loop, I'm gonna generate a new training set, which is what this code chunk does then I go through and I fit the three different types of models. So the first model is gonna be a high degree polynomial of degree 20. This is our high variance uh low bias data or model. Then I fit what I, you know, we're known as our Goldilocks model if you're familiar with that childhood tale uh where this is the just right model. So it's gonna have low bias and low variants. This is a degree two polynomial. And then finally, we plot that high uh high bias, low variance model, which is just the average value of Y for all values of X. And we're gonna do this and remember, as I said, over five distinct randomly selected training sets. And we're gonna see how this is low bias but high variants or higher variances, but low bias. ... So on the left here, in all three of these plots, we have the true relationship and the solid black. And then all of the other uh fits the model fits from the five different training sets are plotted in a bunch of different colors. Uh So for the high biased, low variance, one, this is the expected value of Y we can see here that it's definitely high bias because this is far away from our, um, far away from our true relationship, right? But it is low variance as we see because over the different training sets, we're basically ending up with the same horizontal line. Uh we're gonna now go to the low bias hi variance plot on the right here. So we can see that this is low bias because we are very close to the parabola with all of these different wiggly curves. But we can tell that it's high variance because for each of the five training sets, we're getting very uh different non overlapping um curves, right. So each of these individual curves is going to be highly specific to the training data and they're not gonna overlap very much. Uh In contrast, the just right mind model has very low bias, uh which it should, uh has very low bias because it's basically right on top of the actual relationship and has very little variance because we're having a very uh a large enough sample to where it's pretty, uh it's able to get the relationship down pretty closely. Ok. And so here we can see the tradeoff and why is it called a tradeoff? What we can think of if we go from the left to the right. In these plots, the model on the left is not very complex. We're just taking the average whereas we go more and more to the right, we're increasing the degree of the polynomial we're fitting. So we think of that as the model complexity. And so the tradeoff between bias and variants is that as you increase the complexity of your model, you may be lowering the bias squared of the model while increasing the variants of the model. And so the tradeoff here is as you increase or decrease your model complexity, you're usually making one or the other larger while making the other smaller. And so what the goal here is you want to try and find the, the model that has the lowest generalization error, which isn't always at the point where they intersect but may be some point where you're willing to take a higher biased uh a more biased model uh because it gives you lower variants, say so a lot of the times in the models we'll learn from here on out, we're gonna have models that seem weird because we're definitely introducing additional bias into the model itself. But ultimately, that by introducing additional bias, we may be lowering the variant by such an amount that we end up with a lower generalization here. Uh So that's the idea of the bias variance tradeoff here. We're gonna go ahead and uh N P not go find. Oh, that's because uh I forgot I already ran all the code. So ignore this. Uh your version will have all the code run for it. Don't run any of the code junks. OK? Uh So what we're gonna go ahead and do is we're gonna go ahead and plot this generalization error as we increase the model complexity. And by that, I mean, we're gonna fit a bunch of different degree, model polynomial polynomials. There we go. Uh So what we're gonna do is we will have 10 different training sets. And then for each of those training sets, we will fit a one through 100 degree polynomial on that training set. And then we will plot the average training, the average uh generalization error across the 10 training sets uh below. So that's what we have here. So this is the average generalization error for the one degree polynomial. The two degree polynomial is the one with the lowest error uh out of all of these. And then we can see after we miss that, after we uh pass the complexity for us is the polynomial degree. After we pass this level of complexity, we start to increase the generalization error uh from there on. OK. So that is the idea behind the bias variance tradeoff. Uh In this setting of polynomial regression, the way to increase and decrease model complexity is to change the degree of the polynomial uh in general with multiple linear regression. It's usually by introducing more and more variables and other settings will have different ideas of what model complexity is. OK. So I hope you enjoyed this video. I hope you learned something about the biased variants tradeoff and I hope to see you in the next video. Have a great rest of your day. Bye.