Regularization Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi, everybody. Welcome back in this video. We talk about regularization as a regression technique. Let me go ahead and share my Jupiter notebook. So the idea behind regularization is that we're going to add what's known as a penalty term to our loss function. And we show that this sort of can help us with problems regarding overfitting. So we're going to introduce the general idea behind regularization.
We'll then use that to establish uh regularization as a constrained optimization problem. We'll discuss ridge and Lasso aggression as very particular versions of regularization. And then we'll show at the end how we can use Lass O for feature selection which will again be touched upon in a later notebook. So as a quick note, this one's a little bit heavier on the math.
It's not too heavy, but it's a little bit heavier than some of our other notebooks. Uh I do my best to provide both mathematical insight on how this stuff works and then give a broader overview for those of you that aren't as interested in the mathematical details. ... So we're going to return to an example from the bias variance tradeoff notebook where we had this true relationship.
That was a Parabola and we have some observed data. And so remember in the bias variance tradeoff notebook, we talked about how various models can either have high bias and, and low variants or high variants and low bias. We're gonna see what happens when we start to fit to this data. A high degree Pono polynomial. So higher and higher degree, I believe we go up to 30 and then add a new variable, uh a new degree one at a time.
Uh And then we'll see what happens to the coefficients. So here I've imported some things that will need to do this. Uh So we go all the way up to 26. So we're going to record the coefficient. So the beta one, beta two beta three on the X one X squared, uh X cubed X to the fourth and so forth uh for a four loop that goes through from 1 to 26 then fits a po polynomial regression model uh in sequence.
So the first model will just be uh Y equals X. Uh The second one will be Y equals beta zero plus beta one, X plus beta two X squared and then beta zero plus beta one, X plus beta two X squared plus beta three X cube and so forth. And then we'll record the coefficients and then print them nicely using pandas. So here are our coefficients. So you can see X all the way to X to the 10th and then there's some missing ones.
But then we go X for to the 17th to the uh all the way up to X to the 26th. And so each row here represents, uh maybe I'll even zoom in a little bit. Each row here represents the coefficients. And so for the first one, for instance, we're basically just getting the line Y equals minus X. Um But as we go along, we have the different degree polynomials more and more get added.
And then the key thing starts to happen once we get to the higher degree one, so we'll focus in on those. So let's start with 19. We can see now that, you know, earlier we had some relatively low degrees for this would be X to the fifth like negative 190.1, negative 0.2, negative 0.2. But as we get larger, we get up to a negative 21 negative 31 we can see here X to the sixth gets up to negative 200.
Uh And by up, I mean, sort of in uh magnitude not in, you know, negative numbers are obviously down. Um But magnitude wise it gets up there uh for X to the seventh, we get to 262 negative 207. And so essentially what people noticed uh when you start to over fit like this with a high degree polynomials in this particular example. Uh you start to have very high values for coefficients and sort of being referred to as and I may have said this above this is sort of a coefficient explosion.
So when we start to over fit, at least in this way, uh we start to get very large coefficients. Uh And one way we can look at this is by visualizing the vector norm on the coefficient vector as we start to increase the number of um that degree as we start to increase the degree of the polynomial we're fitting. And so we can see here for the most part is, you know, it's not too big, but then it starts to go up slowly and then it sort of explodes after 26.
Uh And so, looking at the data frame that we talked about, as well as this image, we can see that our coefficients start to explode as we over fit with this high degree polynomial. And so that's sort of the idea that uh this observation sort of is part of the idea behind regularization. So we're going to here for notation's sake, we're gonna separate out our intercept to beta zero from the rest of our coefficients.
And we're gonna collect our coefficients in beta. So beta one beta two is a column vector. Uh And then remember in ordinary least squares, we're trying to minimize this mean square error. So we want to find the beta hats, the estimates that make the mean square error as small as possible. And remember we notice that in this sort of example, when we have a lot of features, uh one way that this starts to minimize is by getting really high uh high value betas.
Uh And so how do we measure high value betas in general with vector norms? And so a vector norm in, in essence is a way to measure the length of a vector. Uh And so there's gonna be various different norms we're gonna consider. And in this notebook, we'll consider explicitly too, but there are many other different types of norms. Uh If you don't know what a norm is, just think of it as the way we can tell how long a vector is.
And then, you know, notation wise, we denote norms with double bars. So think of absolute value. But then now we have an extra bar on the outside of that. And so this notation here where we have beta in a double bar would be the vector norm of beta. And so uh the idea behind regularization is we're still gonna try and minimize this mean square error, but we're going to do it on a budget.
So what do we mean by on a budget? Well, we're only going to constrain ourselves to think about how can we minimize the mean square error of uh using uh different estimates of beta as long as we limit ourselves to only consider those beta, whose vector norm is less than or equal to some constant C. So we think of this as trying to find the smallest mean square error while we're on a budget of how much we can spend on beta.
So this is actually equivalent to just minimizing the following the mean square error, it looks slightly different, but this is still just the mean square error plus alpha times the norm of beta. And so alpha here is some constant that you'll choose ahead of time. It's a hyper parameter. And the two, the notation here with a two on the bottom, that means it's the two norm and the superscript two is just an exponent.
Uh And so the two nom squared, the two norm squared is equal to a one squared plus a two squared dot dot dot plus A N squared if A is a vector and R N. So minimizing this, as I said is equivalent to minimizing the MS C. Uh So a mathematical derivation of the equivalence between constrained optimization form of the problem. And uh this penalty term version of the problem is given in the references below.
Uh So alpha, as I said is a hyper parameter. This is our first time maybe seeing a hyper parameter. So these sorts of parameters are not like the beta. They're something that we have to choose before we can fit the algorithm. Uh And when we do that, uh you can usually find the best alpha, the one that gives maybe the lowest MS E for instance with a grid search.
So you just go through and try sequential version uh sequential values of alpha and then choose the one that's uh smallest using cross validation or a validation set. Different values of alpha are gonna give you different results. Uh So for alpha equals zero, for instance, since I said this is equivalent to the MS E, you're just gonna get back the standard ordinary least squares regression for beta.
Uh Right? Because if alpha zero, then the penalty term goes away. Uh if alpha is infinite, the only way to minimize this is to give beta equals to zero. Uh And so obviously, values of alpha between those two extremes will give you very different results. So that's a general regularization approach. Two specific regularization models are ridge regression.
Uh ridge regression is where we take the vector norm to be specifically the Euclidean norm squared. Uh And so this is, as I said before, a one squared plus A two squared plus dot dot dots plus A N squared. Another one people will consider regularly is lasso regression. And so in lasso regression, you take this norm to be the specific little L one norm.
And so the little L one norm is just taking the sum of all the absolute values of the components. So the absolute value of A one plus the absolute value of A two plus the absolute value of A N. So in order to implement this in S K learn, which is what we're going to do. Now, uh you have to import Ridge, which is also stored in linear model as well as lasso in linear model.
Uh And this I will point out ridge and lasso regression are examples of models or algorithms where if we have uncalled data, uh it can mess up the results of the model. And one way to think about this is remember we're limiting the budget on beta. And so if you have uh different scales that can sometimes impact um the value of beta. So for instance, imagine doing a model where uh the original data set maybe is in miles and so maybe the coefficient for beta there, whichever column is in
miles is two. But then let's say you transfer it into feet, then beta I believe will have to be a lot larger which could can over go uh go over your beta budget. And so one way to get around this sort of scale problem is to just scale all of them so that they're on the same scale using a standard scaler. So now we're gonna import this. So from S K learn dot linear model import.
And so since they're both in linear model, we can just do ridge comma lasso and that will import both of them. So this chunk of code is gonna demonstrate to us how different values of alpha impact the values of the coefficients. So we're gonna do basically the same exact thing where we're gonna loop through uh from degree one all the way up to degree um 10 uh we're gonna loop through from degree one all the way up to degree.
Sorry, we're not actually gonna loop through the degree. We're going to fit a degree 10 polynomial. Then we're going to loop through the different values of alpha and we'll record the coefficient. So you can see what happens to the coefficients as we increase alpha. And remember increasing alpha is going to make the penalty much more severe.
So the larger the alpha, the less room beta has to go. OK? So let's show you first how to fit a ridge regression model. So we're using a pipeline here so I can scale the data and then get the polynomial features. This is the part where this is the ridge regression model. So I put in the name I want so I can access it if I want to. So ridge and then you call ridge, you're going to put in uh Max, uh you're gonna put in the value for alpha.
So alpha is equal to alpha at I, let's let's make this alpha. So alpha is equal to alpha Z I. And now there's this other argument Max Iter is equal to. So this is um the way that this is fit is with uh an algorithm that needs to converge. And so increasing the maximum number of iterations, that's what Max Iter does. Increasing this number of iterations just allows the algorithm to actually converge to uh the minimum point.
Uh ... Now for a lasso, it's gonna be the same thing. So a lasso ridge alpha is equal. The alpha is that I and then max Iter equals uh this, I think I actually copied and pasted. Oh OK. Max Iter is equal to whatever that number is. ... Oh I forgot commas, comma comma. ... And then of course, there's an issue I should change these also to alphas. There we go. And so now the code is run and we can look at the coefficient for the ridge and again, this should be to alphas.
... So this is what happens with ridge. So these are for each row represents a different value of alpha going from 10 to the negative fifth all the way up to 1000 and intervals of multiplying by 10. Uh So not intervals but multiplying by 10 each row. Uh And so we can see that for the coefficients when we start off, this is pretty close to zero. So this would, should be close to what we did.
We uh what we would get if we did a regular linear regression. But then as we increase alpha, we can see that the coefficients start to get smaller and smaller and eventually they will shrink to zero if we kept going this is what we get uh for Lasso and let me change the alpha to an alpha. Uh We got a similar thing for Lasso. So the first row should be pretty similar to what we had up above.
But then um ... hm this should let me check my code. Uh ... So there we go. ... Ok. So, uh what I had to do was accidentally taped ridge here. It should have been lasso. So I changed it to lasso. OK. Here we go. So this is what we get for lasso. And what you can notice there is a stark difference between the ridge coefficients which seem to slowly gradually shrink to zero altogether and the lasso coefficients, which if you notice once we get to say la alpha equals 00.1 or even or one, maybe uh they sort of
like kind of drop off ac. So Lasso will tend to uh shrink to zero uh pretty suddenly and abruptly, whereas ridge coefficients will very slowly and gradually go down to zero. So that's a, a big difference between the two. Uh And this big difference actually is what makes Lasso appealing for feature selection. So with lasso, one thing that's nice is because these things go to zero relatively quickly and abruptly uh the things that stick around.
So for instance, um well, this time X to the seven sticks around, even though it probably shouldn't. Uh But the things that tend to stick around are the, the the um variables that tend to be most important to the data. So provide the most signal or help lower the MS E the most. And so those things would be the features that you would want to keep.
... OK. So why does Lasso do this? So I believe we'll end by talking about why Lasso does this? Uh Actually, we end by talking about which one's better uh or which one you might want to use. But we're gonna first talk about why does Lasso suddenly shrink to zero? Uh And the answer comes from the setup of the problem. So if we go back to that constraint where we're now we're minimizing.
Again, this is MS E, we want to minimize the MS E and for the ridge, we wanna minimize constraint to my two norm, the square of my two norm being less than or equal to C. And so if we rework this, this is the same as saying beta one squared, if we have two features, beta one squared plus beta two squared has to be less than or equal to the square root of C squared.
Now, if you remember your geometry or maybe algebra uh or trig, this is the formula for a filled in circle uh centered at the origin with radius square root of C and R two. And this is visualized down here. Uh We'll get back to that picture in a second. Now, the setup for A A O is we want to minimize the MS E constrained to the L one norm being less than equal to C.
And when you have two features, this works out to be a square uh with vertices at C 00 C negative C zero and zero negative C. So this is pictured over here and this image comes from the, the book elements of statistical learning. And so essentially the reason that lasso coefficients tend to go to zero quite quickly and abruptly uh has to do with sort of the geometry of the constraint region.
So these blue, this blue square and this blue circle are the acceptable values for beta uh for the ridge. That's the blue circle and the lasso which is the blue square. So any value in here can be a value that uh the beta can't is allowed to take on given these constraints, right, that the L one norm or the square of the L two norm has to be uh below uh below that value.
OK. And so these red ellipses or ellipses, uh these red ellipses are the level curves of the MS C. So the MS E is usually like a, a parabola. So it comes up like a round oval curve. And so at each of these level curves, this would be the, say the MS C is equal to two, the MS E is equal to one, the MS C is equal to 10.5, something like that. Uh And then this point here beta hat and both of these is meant to be the estimate you would get for beta if you did just regular ordinary least squares regression.
So this is the one that minimizes the MS E uh but not within the constraint region. And so the, the, the estimate you're going to get from a lasso or ridge regression is going to be where the constraint region. So these blue areas intersect with a level curve. And so usually for lasso, this constraint region happens to intersect at one of the vertices.
Uh So this is in two dimensions, but it's gonna be similar in higher dimensions, it will be some equivalent of a square and higher dimensions. Uh Whereas in uh higher dimensions, this should be some equivalent of a sphere or a circle I believe. Um And so the shape of this constraint region tends to make it so that the intersection with the level curve happens on one of the axes, meaning that one of the coefficients is equal to zero, uh one or more of the coefficients are equal to zero
while the shape of this uh constraint regions tends to make it so that the intersection with a level curve happens not on one of the axis but outside in the regular uh quadrant in this case. So that's the idea here. Uh why Lasso tends to go to zero very quickly and abruptly compared to why it does not do that for rich. So you might ask, well, which one should I use?
It really just depends on the problem. So both are good at addressing overfitting concerns uh because they help address this coefficient explosion, but they have some unique pros and cons. So a nice thing for Lass O obviously is feature selection uh which can allow for a sparser model, uh which is good for computational reasons. It also works well when you have a large number of features that do not have any effect on the target, right?
So that comes back to this whole shrinking to zero quickly. So if you have a lot of features in your data set that you don't think have much impact on the target. Lasso should help shrink those to zero quite quickly and then you'll be left with the ones that do have an impact. Um A con for Lasso is it can have trouble when you have features that are highly correlated.
So let's say X one and X two have high correlation. Uh Sometimes it can be hard for the Lasso algorithm to tell which one is actually impacting. So like which one maybe actually has the effect on Y. Uh And so it would just randomly choose among the correlated variables. So let's say Y is actually equal to some function of X two because X one and X two are highly correlated.
Maybe Lasso just randomly selects X one even though X two might give better predictions, ridge regression uh works well if you have a problem where you think that the target might depend on all of the features. So let's say you have a bunch of features that you think are related uh all related to the to the uh outcome. Then you might want to use Ridge to handle with overfitting.
It also does better at handling uh col linearity. Um So this feature or the pro of it can actually be a con in some sense because Ridge tends to keep most of the predictors in the model, meaning that the coefficients don't all go to zero. Uh So this can be costly if you have a really high number of predictors or features. Um And so that can cause computational issues. Uh Elastic net is an algorithm that sort of is in between the two. Uh And so in the practice problems notebook, I go over how
to implement this in S T learn. It's uh it uses a vector norm that's sort of between little L one and the square of the Euclidean norm. Uh And so some nice notebook specific references I've included are given down here in case you'd like to learn more about both of these um techniques. OK. So I hope you enjoyed this notebook learning about regularization. I also hope to see you next time. Um have a great rest of your day. And yeah, I hope to see you next time. Bye.