WEBVTT

00:00:00.000 --> 00:00:02.000
Okay, so I'm gonna start recording.

00:00:02.000 --> 00:00:10.000
Alright! Welcome back! This is day. 5 of the lectures for the 2023 may data, science boot camp from the Institute.

00:00:10.000 --> 00:00:16.000
So we're gonna keep learning. Today's our last day on like linear regression type stuff.

00:00:16.000 --> 00:00:25.000
Tomorrow and day 6, we'll start learning a little time series. So we're actually gonna start today with something that is not explicitly just regression.

00:00:25.000 --> 00:00:26.000
We're gonna finish up a little bit of data cleaning stuff.

00:00:26.000 --> 00:00:41.000
So I have all of the lectures. I am confident we'll be able to get through already pre loaded just because I didn't want to waste time waiting for the Colonel to see but if you're trying to follow along in your own we're working on the basic

00:00:41.000 --> 00:00:50.000
pipelines. Notebook first. So we're gonna talk about pipelines and let me get my chat window open.

00:00:50.000 --> 00:00:57.000
Okay, so Hi, last week we talked about standard Scalar, and then I believe somebody in the chat pointed out like, Oh, there's other neat, pre-processing steps like polynomial features.

00:00:57.000 --> 00:01:05.000
And so basically, there are times when you want to do a lot of sort of automatable, pre-processing steps for your models.

00:01:05.000 --> 00:01:24.000
And it can be a hassle to have to code each one of those steps out one by one by one, particularly when you're doing things like crossfalidation, or when you want to go from the training set to the test set and so there's this concept known as a pipeline which will

00:01:24.000 --> 00:01:29.000
allow you to do everything all at once. You just first have to define the pipeline.

00:01:29.000 --> 00:01:34.000
So we're gonna start with the most basics. And that's all we're going to cover in lecture in live lecture.

00:01:34.000 --> 00:01:49.000
However, there's another one. If you're really interested about doing things with more advanced pipelines, there is a notebook called More Advanced pipelines, which does have a pre recorded video, we're not gonna have time to cover that one in live lecture but if you need to learn more

00:01:49.000 --> 00:01:55.000
about making more complicated pipelines. Check out those notebooks.

00:01:55.000 --> 00:01:56.000
That notebook. It's just one. Okay, so what is a pipeline?

00:01:56.000 --> 00:02:02.000
So we've done as I said, a little bit of pre-processing.

00:02:02.000 --> 00:02:07.000
So including scaling. We've done, as I said, a little bit of pre-processing. So, including scaling, we've made some new features, using.

00:02:07.000 --> 00:02:14.000
Pd dot get dummies. We've also by hand created features like polynomial transformations, like polynomial transformations, like polynomial transformations, like X squared or X cubed.

00:02:14.000 --> 00:02:18.000
And then we've talked about the possibility that you can make other nonlinear transformations like logs or square roots.

00:02:18.000 --> 00:02:33.000
And so these are a lot of different pre-processing steps, and the concept of a pipeline is just a nice framework for combining all of those steps into a single code container.

00:02:33.000 --> 00:02:51.000
And so the idea about pipelines and we'll see more of this in action when we're coating it up is we want something that will take in our data on one side and then systematically or sequentially go through and apply all the different pre-processing steps and then the left step would be the model that

00:02:51.000 --> 00:03:05.000
we're going to fit and then get production from. And so basically, we're going to define our python to be the sort of coding object that we're going to fit and then get predictions from. And so basically we're going to define our pipeline to

00:03:05.000 --> 00:03:30.000
be the sort of coding object that has both of and then get predictions from. And so basically, we're going to define our pipeline to be this sort of coding object that has both all of the Preca dot fit and then provide things like the transform data the Fitted model and predictions with commands like dot fit dot transform or dot

00:03:30.000 --> 00:03:33.000
predict so I'm just generating some random data here.

00:03:33.000 --> 00:03:34.000
So this is random data, synthetic. So I could always go back and generate more.

00:03:34.000 --> 00:03:42.000
So you might be wondering, why aren't you doing a train test? Split?

00:03:42.000 --> 00:03:47.000
It's just because this is just random data. And the focus here isn't predictive modeling pipelines.

00:03:47.000 --> 00:04:06.000
So my goal here is I want to fit a polynomial regression model that regresses Y on X and I wanna do that in a way where I don't have to go through and always do like finding X to the first power X to the second, power X to the third, power so the first thing we're

00:04:06.000 --> 00:04:09.000
gonna learn is something known as polynomial features.

00:04:09.000 --> 00:04:15.000
So polynomial features, it's what's known as a transfer object in sk. Learn.

00:04:15.000 --> 00:04:20.000
And so these are really similar to scalar objects like standard scalar.

00:04:20.000 --> 00:04:30.000
And so remember with standard skill, or it had things like fit, transform and fit, transform transformer objects have the same thing where you fit it.

00:04:30.000 --> 00:04:43.000
In some sense, then you transform the data. And so paulnial features takes in your data, and then does a polynomial transation where it will take in your columns. You specify a degree.

00:04:43.000 --> 00:04:44.000
2 polynomial features would take in fit itself.

00:04:44.000 --> 00:04:51.000
And what that means is it? When it says fit? It's like, Okay, how many columns do I have?

00:04:51.000 --> 00:04:58.000
And then what transformations do I need to compute which for us would be X.

00:04:58.000 --> 00:05:04.000
One x 2 x one squared x one times x 2, and x 2 squared with degree, 2.

00:05:04.000 --> 00:05:08.000
And then, when you call dot trans transform, it will then actually provide this data frame.

00:05:08.000 --> 00:05:12.000
So this is stored in pre-processing.

00:05:12.000 --> 00:05:13.000
So this is stored in pre-processing. So I've already typed it out here.

00:05:13.000 --> 00:05:18.000
So from sk learn that pre-processing. So I've already typed it out here.

00:05:18.000 --> 00:05:20.000
So from Sk. Learn that pre-processing import, polynomial features, capital.

00:05:20.000 --> 00:05:24.000
And so, in order to use this, we're going to first define it. So we do.

00:05:24.000 --> 00:05:32.000
Polynomial features. Now there's gonna be 2 arguments that I wanna put in.

00:05:32.000 --> 00:05:37.000
The first argument is interaction. Actually sorry. The first arguments should be the degree.

00:05:37.000 --> 00:05:43.000
So I'm gonna put 2 because I want to. Or what do I want for this?

00:05:43.000 --> 00:05:46.000
Let's go with, I guess, for the demonstration.

00:05:46.000 --> 00:05:50.000
I'm just gonna do 2. So 2. And then the next argument is something called interaction underscore.

00:05:50.000 --> 00:06:02.000
Only. And so when this is true, it's only gonna give you interaction terms like X, one times x, 2.

00:06:02.000 --> 00:06:08.000
When this is false, it will only give you, it will give you all the terms so like the ones that I've shown you here up above.

00:06:08.000 --> 00:06:10.000
So I'm gonna set this equal to false cause.

00:06:10.000 --> 00:06:17.000
I don't want just the interactions. And then there's finally another term called include Bias.

00:06:17.000 --> 00:06:20.000
And so include bias takes in a true or a false value.

00:06:20.000 --> 00:06:24.000
If it's true, it will include a column of ones at the front.

00:06:24.000 --> 00:06:28.000
So in machine learning, they call intercepts biases.

00:06:28.000 --> 00:06:32.000
So, if it's true, it will include a column of ones, and if it's false it will not include a column of one.

00:06:32.000 --> 00:06:40.000
So I'm going to set it equal to false, because the other Sk.

00:06:40.000 --> 00:06:46.000
Learn models. Don't expect to receive a column of ones.

00:06:46.000 --> 00:06:47.000
So then, once we have our object, we're gonna do fit.

00:06:47.000 --> 00:06:54.000
And did I define this? X. So if I do, X.

00:06:54.000 --> 00:06:59.000
Where was I fit? X. What? This might seem? Weird it's like, well, what do you need to fit here?

00:06:59.000 --> 00:07:00.000
It's not like with standard scalar. Why need to find a mean and a standard deviation?

00:07:00.000 --> 00:07:14.000
So for this particular transformer, when you call fit, what's happening is it's seeing how many columns does the input data have?

00:07:14.000 --> 00:07:19.000
What is my degree, and then depending on what the answer to both of those are like.

00:07:19.000 --> 00:07:23.000
What new columns do I need to just generate?

00:07:23.000 --> 00:07:32.000
So, because my degree here is 2, and the number of columns I have here is one which reminds me, I think I need to do a reshape cause.

00:07:32.000 --> 00:07:38.000
It's one dimensional data. So my number of columns here is one so that means I'm gonna need to produce an X and an X squared.

00:07:38.000 --> 00:07:44.000
And so, after fitting it.

00:07:44.000 --> 00:07:48.000
We need to do transfer to actually get the new stuff.

00:07:48.000 --> 00:07:54.000
So we'll do transform. And then reshape, and then negative one.

00:07:54.000 --> 00:07:58.000
Come on, and so you can see here, you know, negative 3.

00:07:58.000 --> 00:08:09.000
That's x, and then the X squared is 9, and if we went back into cube just as an example, we can see how it's X X squared X cubed.

00:08:09.000 --> 00:08:10.000
So I want. I think I'll end up wanting cube, so I'll keep that for now.

00:08:10.000 --> 00:08:32.000
But before we continue on to talk about pipelines, are there any questions about polynomial features?

00:08:32.000 --> 00:08:46.000
Let's see. So Zack is asking, Does the term pipeline refer exclusively to several steps that can be cross-validated together as stated in the Sqlarn document for the pipeline object context?

00:08:46.000 --> 00:08:55.000
In my academic field, pipeline is used to refer to the full push button replicable, end-to-end analysis, including data cleaning and visualization.

00:08:55.000 --> 00:09:05.000
So there are, I would say, in this particular context, it's like the Sk. Learn.

00:09:05.000 --> 00:09:23.000
I know there's also like sort of like, the business context of pipeline and I'm not sure what your field is, but it's kind of the same thing where, when they say pipeline, they mean sort of the whole thing of collecting the data cleaning the data, fitting models and

00:09:23.000 --> 00:09:33.000
then producing analytics, based visualizations or tables based on the results of models so here, when we say, pipeline, we literally just mean, like the Sk learn stuff that we're fitting.

00:09:33.000 --> 00:09:40.000
So the pre-processing S. Keylearn followed by like whatever model we're gonna do, Dustin is asking, does it sample?

00:09:40.000 --> 00:09:57.000
Non-integer powers so polynomial features, is only going to give you the polynomials, which means the positive integer powers.

00:09:57.000 --> 00:09:58.000
I have a good question.

00:09:58.000 --> 00:10:00.000
Yeah.

00:10:00.000 --> 00:10:14.000
Why did your so you see in in the graphic you show that you have the x one x 2, and then you would output it would and all those degree to turn.

00:10:14.000 --> 00:10:15.000
Yup!

00:10:15.000 --> 00:10:18.000
How come you only got like just x one and the x one x 3 x one squared.

00:10:18.000 --> 00:10:19.000
Yeah, so this possibly is just a bad visual to have.

00:10:19.000 --> 00:10:24.000
But I wanted to show the fact that it also does interaction terms.

00:10:24.000 --> 00:10:27.000
So the particular X we have here is a one dimensional like it just has one column.

00:10:27.000 --> 00:10:56.000
So this is X, but presented as a column. Vector so because this only has one column, it can only produce, like X squared X cubed, etc. If we had one that had 2 costumes, it would produce something like this.

00:10:56.000 --> 00:11:07.000
Yeah. Any other questions about polynomials, features?

00:11:07.000 --> 00:11:08.000
Okay, so now, we're gonna talk about how to define a pipeline and sk learn.

00:11:08.000 --> 00:11:19.000
So the first thing you need to do is you need to import pipeline, which is, conveniently stored in the pipeline sub package.

00:11:19.000 --> 00:11:22.000
I'm not sure quite sure what it's called.

00:11:22.000 --> 00:11:34.000
I call it a sub package. So from Escaler and dot lowercase pipeline import, uppercase, p pipeline, and then once you have it, what we're going to try and fit.

00:11:34.000 --> 00:11:45.000
Here is the following, polynomial regression model, Y is equal to beta 0 plus beta one x plus beta 2 x squared plus beta 3 x cubed plus epsilon.

00:11:45.000 --> 00:11:56.000
And so for us. That means the first step for our data is we want to provide a polynomial feature, fit, transformer, object of degree. 3.

00:11:56.000 --> 00:11:58.000
So this that we showed up here, and then after that, we want to fit a linear regression model.

00:11:58.000 --> 00:12:05.000
So this is sort of like schematically. What we're looking at.

00:12:05.000 --> 00:12:09.000
So now we just need to show you how to actually implement that in python.

00:12:09.000 --> 00:12:12.000
So the first thing you do is you type the object name.

00:12:12.000 --> 00:12:17.000
So pipeline, then within pipeline you put a list, and then the entries of your list are a string.

00:12:17.000 --> 00:12:32.000
Are Tuples, and each Tuple has as its first entry a string which is the name of that step, and then the actual object, that is, that step.

00:12:32.000 --> 00:12:40.000
So, for instance, in this diagram the first one I'm gonna call Poly, and then the object is this particular polynomial features object.

00:12:40.000 --> 00:12:41.000
And then the second is, I'm going to call Reg, and then the object that goes along with it is this linear regression.

00:12:41.000 --> 00:12:50.000
So I'm gonna first put in a tuple for my polynomial features that I'm going to call Poly.

00:12:50.000 --> 00:12:57.000
And then I'm going to put polynomial features.

00:12:57.000 --> 00:13:09.000
Degree, 3 interaction only equals false include bye equals false.

00:13:09.000 --> 00:13:14.000
So remember, include biases false here, because the linear regression model fits an intercept by default.

00:13:14.000 --> 00:13:19.000
So I don't need the bias to next.

00:13:19.000 --> 00:13:21.000
I'm gonna put a tuple. So notice the commie here.

00:13:21.000 --> 00:13:29.000
It's a list of tuples. I'm gonna put a tuple called Reg that will contain my linear regression.

00:13:29.000 --> 00:13:30.000
And can I import that already? I don't think so.

00:13:30.000 --> 00:13:39.000
So let me also just make sure. So from Sklearn Dot.

00:13:39.000 --> 00:13:43.000
Linear model, import.

00:13:43.000 --> 00:13:46.000
One year regression. There's a chance. I imported it already above.

00:13:46.000 --> 00:13:51.000
But just to make sure. Okay, so linear regression. And then just copy X equals.

00:13:51.000 --> 00:13:56.000
True.

00:13:56.000 --> 00:14:01.000
Okay. So now I have a pipeline, and you fit and make predictions in exactly the same way.

00:14:01.000 --> 00:14:05.000
So we do pipe dot fit you put in your features first.

00:14:05.000 --> 00:14:14.000
So x dot reshape negative one comma one, followed by y.

00:14:14.000 --> 00:14:17.000
And now that I have a fitted pipeline, I could make predictions.

00:14:17.000 --> 00:14:21.000
So pipe dot predict. So the pipeline basically acts identically to the model.

00:14:21.000 --> 00:14:27.000
So I can just do pipe, dot, predict, and get my predictions.

00:14:27.000 --> 00:14:33.000
You can access individual features of the pipeline. So just like a dictionary.

00:14:33.000 --> 00:14:43.000
So remember, I called my polynomial feature transformer Poly, and so now I can get the aspects of the polynomial feature.

00:14:43.000 --> 00:14:53.000
I called my regression object Reg, and so I can even get out things like the coefficient on the fitted regression or the intercept on the fititted regression.

00:14:53.000 --> 00:15:00.000
Object.

00:15:00.000 --> 00:15:01.000
Okay, so this is a good introduction to this basic pipelines.

00:15:01.000 --> 00:15:16.000
As when you get more complicated data and more complicated models, you'll have to do slightly more advanced things like making custom transformer auds or custom scalar objects.

00:15:16.000 --> 00:15:26.000
If you need to learn about how to do that for your projects or or any other type of thing you're doing, check out notebook number 5 and cleaning it goes over how to do all of that.

00:15:26.000 --> 00:15:34.000
But for us in the rest of the lectures we're good with just the basic pipeline.

00:15:34.000 --> 00:15:35.000
So Brooks is asking, is there any reason to prefer pipeline over?

00:15:35.000 --> 00:15:42.000
Make Underscore Pipeline. Both are from F.

00:15:42.000 --> 00:15:47.000
Scalar, and but have slightly different structure. So I've never used make underscore pipeline. So I've never used make underscore pipeline.

00:15:47.000 --> 00:15:51.000
So I don't actually know what it does. I've only ever used pipeline.

00:15:51.000 --> 00:16:01.000
So it's possible that they introduced make underscore pipeline after I had already learned pipeline, so I never bothered looking into it, so I would just say, if you're interested in knowing the difference check out the documents.

00:16:01.000 --> 00:16:02.000
I'm sure they point out maybe pros or cons of using one versus the other.

00:16:02.000 --> 00:16:08.000
Maybe not phrased that way. You'd have to look and see like, okay, make pipeline does this for pipeline.

00:16:08.000 --> 00:16:18.000
But you'd have to read the documentation.

00:16:18.000 --> 00:16:27.000
So, Pedro, I'm not sure if we are so here I did fit x comma y.

00:16:27.000 --> 00:16:28.000
Because the pipeline remember, the goal is to fit a model.

00:16:28.000 --> 00:16:38.000
So basically what the pipeline's doing is the data goes along, gets trained.

00:16:38.000 --> 00:16:49.000
The X goes along, gets transformed, and then the Y kind of just follows along with it until the model at the end, where the like, the regression gets fit.

00:16:49.000 --> 00:16:53.000
So the model needs the Y, so like the Y just kinda goes along.

00:16:53.000 --> 00:16:56.000
But that's why it's fit. Dot x comma y.

00:16:56.000 --> 00:17:06.000
Because the pipeline is fitting, not just the polynomial features, but also the linear regression model.

00:17:06.000 --> 00:17:10.000
We did not include the task of transforming the data into the pipeline.

00:17:10.000 --> 00:17:21.000
Could we have done otherwise? So as these? The pipeline automatically, like once it is, it will transform the data for you.

00:17:21.000 --> 00:17:31.000
So basically, how the pipelines working is you just give it the objects, and it knows, based on its structure that polynomial features has a fit and a trans transform.

00:17:31.000 --> 00:17:34.000
So what happens when you call fit, is it first goes through?

00:17:34.000 --> 00:17:35.000
We'll fit the polynomial features on X, then transfer.

00:17:35.000 --> 00:17:38.000
So it's probably calling like a fit underscore transform.

00:17:38.000 --> 00:17:42.000
Then transform those features to be used in the fitting of the linear regression.

00:17:42.000 --> 00:17:59.000
Then, when I call, predict down here, it's taking in these fit or not refitting the polynomial features, because it's already fit.

00:17:59.000 --> 00:18:08.000
It's just transforming this data and then feeding it into the prediction part of the linear regression model.

00:18:08.000 --> 00:18:30.000
Okay, are there any other questions about the pipeline?

00:18:30.000 --> 00:18:33.000
Okay.

00:18:33.000 --> 00:18:38.000
So that's gonna be it for pipeline.

00:18:38.000 --> 00:18:41.000
And now we're gonna go to supervised learning once again.

00:18:41.000 --> 00:18:55.000
I already have it open. But if you're trying to follow along, we are in supervised learning, and to sort of motivate what we're gonna do as one of our last regression live lectures, notebooks.

00:18:55.000 --> 00:19:00.000
We're gonna talk about something called the bias variance trade-off, which is a concept in supervised learning.

00:19:00.000 --> 00:19:04.000
So we're going to start here with our supervis learning for today and then go back into regression after we cover this.

00:19:04.000 --> 00:19:13.000
So this notebook is a conceptual one. There's not going to be a ton of coding.

00:19:13.000 --> 00:19:17.000
All the coding is just sort of to help illustrate the concepts.

00:19:17.000 --> 00:19:20.000
So there's not gonna be a ton of like we're not introducing like new customers in this notebook.

00:19:20.000 --> 00:19:30.000
So particularly, we're gonna talk about things like the bias of the estimate of the functions.

00:19:30.000 --> 00:19:36.000
So remember, in supervisor learning we have this model that we're trying to fit and estimate.

00:19:36.000 --> 00:19:39.000
And the thing we're particularly trying to estimate is the function of X.

00:19:39.000 --> 00:19:42.000
So we're going to talk about the bias of that estimate.

00:19:42.000 --> 00:19:51.000
The variance of that estimate, and then sort show view that there's sort of a trade off between those 2 things, as you increase or decrease.

00:19:51.000 --> 00:19:55.000
Something called the complexity of the model that you're trying to fit.

00:19:55.000 --> 00:19:58.000
So!

00:19:58.000 --> 00:19:59.000
Remember our framework where we're assuming that Y is equal to a function of the features.

00:19:59.000 --> 00:20:06.000
X plus some random noise and typically we're assuming this random noise is independent of the of the features.

00:20:06.000 --> 00:20:24.000
Acts. So we have some algorithm. Up to this point it's been linear regression, but it could be any supervised learning algorithm that we then use to estimate F with an estimate called F that I'm going to call F hat.

00:20:24.000 --> 00:20:28.000
So the little carrot on top of the F. I'm going to call F hat.

00:20:28.000 --> 00:20:36.000
So in the past last week, we've discussed that we're really just interested in understanding the generalization error of our algorithm meaning, our goal is to get as low a generalization error as possible.

00:20:36.000 --> 00:21:00.000
So remember generalization. Error is the error on a set that the algorithm was not trained on, and so particular, if we have a set called y 0 comma x 0, which is denoting a single test set or a set of data, we're trying to generize on so data it was

00:21:00.000 --> 00:21:04.000
not trained on. We can write this mathematically, as we want to know the expected value and if we're using something like Msc.

00:21:04.000 --> 00:21:16.000
Of y 0 minus y 0 hat, meaning the actual values minus the predicted values.

00:21:16.000 --> 00:21:27.000
That square error. Then, if you sort of go through the process of substituting all of that in what is Y hat 0?

00:21:27.000 --> 00:21:47.000
Well, it's f hat of evaluated at the features, and then, if you substitute in what, y is well, that's F at x 0 plus epsilon, and then, if you do, and here the expectation is being taken over the probability space over, all possible, training sets so if you're wondering what we're

00:21:47.000 --> 00:21:56.000
taking, the expectation over. It's that if you do a little bit of algebra and probability theory, you can rewrite this to give you the variance of your estimate.

00:21:56.000 --> 00:22:08.000
Remember, we're estimating the function plus the buyas of your estimate squared, plus the variance of the error terms and so I didn't write all this out because it's a lot of like algebra.

00:22:08.000 --> 00:22:14.000
That is kind of. I don't wanna talk through it, but if you do, the algebra and work it out.

00:22:14.000 --> 00:22:25.000
You'll get this. And so the important thing to take away from this is that you have the variants of the estimate, plus the bias squared of the estimate, plus the irreducible error.

00:22:25.000 --> 00:22:32.000
So there's 2 kind of things to take away from this, and then maybe just as a refresher, if you don't recall like what bias means, it's the average, the expected value of the actual thing.

00:22:32.000 --> 00:22:49.000
The estimate. So this holds not just for this particular function, but in general, if you're estimating something, the bias of the estimate is the actual thing, minus the estimate of the thing.

00:22:49.000 --> 00:22:54.000
So one way to think of this is like, how far, on average, is your estimator?

00:22:54.000 --> 00:22:59.000
The thing that you're making using to make the estimate from the actual thing that it's estimating.

00:22:59.000 --> 00:23:04.000
So? How far away on average is what you're estimating from the actual value.

00:23:04.000 --> 00:23:11.000
So because variance and bias squared, both of these are non-negative.

00:23:11.000 --> 00:23:24.000
The best that you could do is as an algorithm is something that has 0 bias and 0 variance meaning this variance of epsilon is left, which is why it's called an irreducible error.

00:23:24.000 --> 00:23:31.000
So, whatever the the variance is of your random noise, you're always gonna have that as the error in your your generalization error.

00:23:31.000 --> 00:23:51.000
So the best we could do is get down to that. However, it's usually not possible to actually get all the way down to the irreducible error, and also there tends to be this phenomenon, where when you in when you decrease your bias you'll in general tend to

00:23:51.000 --> 00:24:09.000
decrease your variance, and likewise, when you decrease your variance, you tend to increase your bias, and if I said that backwards the takeaway is just that there's sort of in general an inverse relationship between the 2, where if one goes down the other one tends to go up

00:24:09.000 --> 00:24:10.000
so to sort of give you a feeling for what this looks like.

00:24:10.000 --> 00:24:19.000
We're gonna play around with another toy example, where I've got some, evenly spaced data from negative 3 to 3.

00:24:19.000 --> 00:24:31.000
As my features, and then the model or the actual relationship is that why is x times x, minus one and then our error is this random noise that I'm highlighting here.

00:24:31.000 --> 00:24:35.000
So this is the true relationship is that given by the black line.

00:24:35.000 --> 00:24:39.000
And then the data that we've observed are these blue dots?

00:24:39.000 --> 00:24:43.000
So we're sort of gonna give you a sense of what's going on here.

00:24:43.000 --> 00:24:49.000
By looking at 3 scenarios. So the first scenario is a model with high bias.

00:24:49.000 --> 00:24:54.000
So remember, bias is sort of how far off from the actual relationship are we?

00:24:54.000 --> 00:25:11.000
So the model that is going to give us high bias would just be our regular baseline that we've talked about before, where we're just going to assume that the value of Y is X the expected value of y plus average plus plus random noise.

00:25:11.000 --> 00:25:23.000
So basically, we're just assuming that there is no relationship between Y and X, Y is just given by the expected value or average of y plus random noise.

00:25:23.000 --> 00:25:29.000
So this is high bias, because, as we can see, there is a relationship here between Y and X.

00:25:29.000 --> 00:25:39.000
So what we would get each time through is sort of a horizontal line at the average value of Y, that's pretty far from our true relationship.

00:25:39.000 --> 00:25:49.000
But it's low variance, because the law of large numbers tells us that as we go through the different training sets, remember, that's where randomness is coming from.

00:25:49.000 --> 00:26:00.000
As long as we have enough observations of why the average value, the sample average would be pretty close to the expected value of y, so that's why it's low variance.

00:26:00.000 --> 00:26:05.000
Now there's another model, we'll consider is one that would have high variance.

00:26:05.000 --> 00:26:17.000
But low bias. So here and remember, this is variance with respect to the random training set that you draw here the model that we would consider is a high degree polynomial of X.

00:26:17.000 --> 00:26:19.000
So really over fitting high degree polynomial.

00:26:19.000 --> 00:26:27.000
So it's low bias. Because if you have a high enough polynomial, you're gonna be like hovering around this true relationship as the polynomial. Alas!

00:26:27.000 --> 00:26:32.000
The parabola, but it's high variance, because the higher degree your polynomial, the more likely you're trying to fit like every training observation as closely as you can.

00:26:32.000 --> 00:26:46.000
So the actual model you get from each training set that you would pull is going to change drastically.

00:26:46.000 --> 00:27:03.000
So this is known as overfitting to the training data, because as you get different training data, your model changes drastically, whereas the other one, I forgot to say this, the other one with high bias is known as underfitting the data underfitting because you're completely missing sort of the signal in

00:27:03.000 --> 00:27:10.000
the data. So what you try and go for with this bias variance tradeoffs is something that's in the middle.

00:27:10.000 --> 00:27:17.000
So it it has a little bit of bias, but not too much and a little bit of variance, but not too much so so sort of like a Goldilocks model.

00:27:17.000 --> 00:27:22.000
So for us, because this is a parabola that would end up being, you know, a low degree, polynomial, like a degree.

00:27:22.000 --> 00:27:27.000
2 polynomial would be perfect. So that's what we would want here.

00:27:27.000 --> 00:27:38.000
So what I've done here in this code is I've gone through, and I've generated, as you can see, 5 5 different random training sets.

00:27:38.000 --> 00:27:42.000
So my ex stays the same. The only thing that's different is, I'm getting different.

00:27:42.000 --> 00:27:48.000
Random noise here then I'm gonna go through and fit the 3 different models.

00:27:48.000 --> 00:27:50.000
So the first one I fit is that high variance model.

00:27:50.000 --> 00:27:54.000
So here it's a degree 20 polynomial.

00:27:54.000 --> 00:27:57.000
The second one I fit is sort of. I called the Goldilocks model.

00:27:57.000 --> 00:28:01.000
It's just the parabola I'm just fitting a parabola.

00:28:01.000 --> 00:28:02.000
And then the final is that high bias model. So this is the one where I just take it to be.

00:28:02.000 --> 00:28:16.000
The average value of the training set. And so then I plot the actual model along with the true relationship.

00:28:16.000 --> 00:28:17.000
So what you can see here, I think, goes from highest bias to lowest bias, from left to right.

00:28:17.000 --> 00:28:19.000
So the one with high bias. You can see these all are all the model fits.

00:28:19.000 --> 00:28:29.000
So just like I said. We have enough observations to where we're basically hovering around the expected value of Y for this data.

00:28:29.000 --> 00:28:37.000
So you can see the high bias because we're not lining up with the true relationship at all.

00:28:37.000 --> 00:28:44.000
But we can see it's low variance, because basically, all of the models sit on top of each other.

00:28:44.000 --> 00:28:53.000
The Goldilocks model. That's just right, is falling on top of the true relationship, and most of the estimates from the training sets are basically on top of each other.

00:28:53.000 --> 00:29:00.000
So low bias and low variance, and then the one that is low bias, high variance.

00:29:00.000 --> 00:29:05.000
You can see that it's low bias, because it is basically right on top of the parabola.

00:29:05.000 --> 00:29:12.000
But we can tell it variants, because all the different estimates are wiggling kind of wildly, and not really lining up with one another.

00:29:12.000 --> 00:29:22.000
So that's sort of the idea here. And so what you're seeing on this sort of horizontal axis of the different plots is what's known as the model complexity.

00:29:22.000 --> 00:29:30.000
So the way that this trade-off works is the less complex your model, the more likely it is to have high bias squared.

00:29:30.000 --> 00:29:32.000
And that's sort of what we're seeing over here.

00:29:32.000 --> 00:29:49.000
And then, as you increase the complexity which in this setting was the degree of the polynomial, we fitting in other settings, it will be different as you increase the complexity you can see your bias tends to go down because you get really close to the true relationship but then your variance, tends.

00:29:49.000 --> 00:29:56.000
To go up, and so why is this matter? Well, remember all the way back up here.

00:29:56.000 --> 00:30:04.000
Your generalization. Error is the variance of your estimate, plus the bias squared of your estimate, plus the irreducible error.

00:30:04.000 --> 00:30:22.000
So what you're looking for is to find the minimum on this generalization error curve which tends to occur somewhere along, not exactly where they're both at their lowest or where they intersect, but somewhere along like we're biased has been lowered but before variance starts to go up

00:30:22.000 --> 00:30:23.000
and so that's sort of the idea here. And then, if we were to look at this particular problem, we can see where that is.

00:30:23.000 --> 00:30:32.000
I've made this particular problem. We can see where that is. I've made the set.

00:30:32.000 --> 00:30:42.000
The test set error across different training sets and we can see that the lowest error occurs here, and after that the variance starts to increase enough to increase the generalization error.

00:30:42.000 --> 00:30:51.000
Okay. So that was a lot of me talking. So maybe now is a great time to pause for questions about this trade-off or things like what is monopolyity?

00:30:51.000 --> 00:31:01.000
I didn't understand that. So if you have any questions now is a great time to ask.

00:31:01.000 --> 00:31:08.000
So Zack is asking in the chat if you train your model on a data sample that is not representative of the population.

00:31:08.000 --> 00:31:09.000
The model will be applied on is the result in error.

00:31:09.000 --> 00:31:20.000
Bias? Or is it variance? So that is a slightly different, a slightly different question than like what's covered with the bias variance trade-off?

00:31:20.000 --> 00:31:34.000
So in the definition of this sort of thing, like, you're assuming that the the samples you're getting are drawn from the like.

00:31:34.000 --> 00:31:46.000
They all follow the same distribution, so it, like the models, variance.

00:31:46.000 --> 00:31:57.000
And bias are more dependent on, like the type of model and sort of what you're asking is a sort of a different kind of question about what if my samples bad?

00:31:57.000 --> 00:32:02.000
So like Erica, suggesting an example of this could be something like your data.

00:32:02.000 --> 00:32:05.000
Has selection bias, or something like that. It?

00:32:05.000 --> 00:32:10.000
Yeah, it's they're sort of like different types of concepts.

00:32:10.000 --> 00:32:20.000
I'm guessing. I would guess if that makes any sense.

00:32:20.000 --> 00:32:23.000
Matthew, can I make a comment?

00:32:23.000 --> 00:32:24.000
Alright, sure!

00:32:24.000 --> 00:32:30.000
I'm thinking this to answer to me. Jack's questions.

00:32:30.000 --> 00:32:37.000
I think it should be, the bias will be more because you are off from the model.

00:32:37.000 --> 00:32:41.000
So the I mean it could be more right.

00:32:41.000 --> 00:32:45.000
It just depends on like, how how the data is not representative of the population.

00:32:45.000 --> 00:32:57.000
Data. If that makes sense. So like, if I could see a situation where I just think it just depends.

00:32:57.000 --> 00:33:01.000
I don't know. It's like it depends on how the sampling is wrong.

00:33:01.000 --> 00:33:03.000
If that makes sense.

00:33:03.000 --> 00:33:24.000
I think, when when that data is off from the representative data that essentially also represents, it was also represent. The whatever model Y is equal to F of X is totally of the hook.

00:33:24.000 --> 00:33:31.000
From that perspective. I'm thinking that the buyers will be quite big.

00:33:31.000 --> 00:33:37.000
I think again, I think it just depends on how the sampling is wrong.

00:33:37.000 --> 00:33:38.000
Okay.

00:33:38.000 --> 00:33:41.000
Yeah.

00:33:41.000 --> 00:33:42.000
Are there any other? Yeah.

00:33:42.000 --> 00:34:03.000
Hi, so is, can you say the bias is the expected value of the residual.

00:34:03.000 --> 00:34:07.000
That's the same thing.

00:34:07.000 --> 00:34:17.000
Is. This expression is looking like the residual, and then I'm wondering if.

00:34:17.000 --> 00:34:26.000
So!

00:34:26.000 --> 00:34:41.000
Not quite so. The residual would also include. So this is like the actual F minus the estimate of F, but the actual F, like the residual, is y, minus the estimate, if that makes sense.

00:34:41.000 --> 00:34:50.000
So in the residual is this error term as well according to our assumptions, so like the residual, is y minus f hat of x.

00:34:50.000 --> 00:34:52.000
So. So this which includes the epsilon.

00:34:52.000 --> 00:35:00.000
So the residual is F of x 0 plus epsilon minus f hat.

00:35:00.000 --> 00:35:05.000
But this is F of x, minus F. Hat of X. It's like a slight difference, but it's not yeah.

00:35:05.000 --> 00:35:06.000
It's not the residual.

00:35:06.000 --> 00:35:10.000
Okay, so, okay, so it's missing the ups on.

00:35:10.000 --> 00:35:17.000
Alright, I get that. The other thing. I'm wondering is so when I use here terms over fitting and underfitting.

00:35:17.000 --> 00:35:24.000
I think of what? Now your model is doing well on the training set, but when you do it on the test side, you know it doesn't work.

00:35:24.000 --> 00:35:34.000
That's usually when people say, Oh, you're overfitting, it's performing really good on your training, but not doing so well on the validation or your test set.

00:35:34.000 --> 00:35:41.000
So I'm just thinking how this relates to those terms, or that idea.

00:35:41.000 --> 00:35:48.000
So overfitting, like a measure of how much you're overfitting is like.

00:35:48.000 --> 00:35:57.000
If you're performing like way better on the training set that on the test set, then that can be a like that is like sort of a measure of how much you're overfitting.

00:35:57.000 --> 00:36:04.000
So like, essentially like you can get, you could have a model.

00:36:04.000 --> 00:36:07.000
That fixed press perfectly on the training set, but sort of perhaps because you're overfitting on the data like maybe you're over here right?

00:36:07.000 --> 00:36:20.000
So like your bias, is almost 0. But your variance is really high if you're fitting, like all the training examples, perfectly you'd have 0 error on the training set.

00:36:20.000 --> 00:36:28.000
But like, maybe because of the high variance you'd be have a hijackerization error. Does that sort of make sense like that is one way to sort of get a sense of like, how much you're overfitting.

00:36:28.000 --> 00:36:37.000
But it isn't.

00:36:37.000 --> 00:36:38.000
Alright. Yeah, that is it. That is a way I think I kinda lost the original question.

00:36:38.000 --> 00:36:43.000
But.

00:36:43.000 --> 00:36:46.000
Yeah, but I'm just curious how you would use these metrics.

00:36:46.000 --> 00:36:49.000
Do you calculate these metrics when you're when you build a problem?

00:36:49.000 --> 00:36:53.000
No, so so this is just sort of a theoretical concept.

00:36:53.000 --> 00:37:00.000
That's sort of guides. Some of the techniques we'll be seeing throughout the rest of the boot camp.

00:37:00.000 --> 00:37:15.000
So you don't like. Go and calculate your variance, or calculate your bias squared, because in order to actually get an estimate of those you'd need to be able to get a lot of different training sets again, like when we're doing sort of this fitting you're only ever

00:37:15.000 --> 00:37:22.000
really going to look at the generalization error, never, like individually, the bias squared or the variance.

00:37:22.000 --> 00:37:24.000
Okay, got it. Thanks.

00:37:24.000 --> 00:37:27.000
Yup!

00:37:27.000 --> 00:37:31.000
And then icon sorry if I mispronounced your name.

00:37:31.000 --> 00:37:33.000
So she they are asking plot, coding question here.

00:37:33.000 --> 00:37:42.000
With, for I and range 5, you create 5 plots. Where do you indicate the I that the code loops over so?

00:37:42.000 --> 00:37:47.000
And I believe Zack answered it. But maybe you want we should clarify.

00:37:47.000 --> 00:37:56.000
Where did I do that? Okay, so when you have a for loop in python like, you don't actually have, I don't have to use the O anywhere.

00:37:56.000 --> 00:38:12.000
This is just saying like, for like this is just iterating through the range, so range will have 0 1, 2, 3, 4, of the range. So range will have 0 1, 2, 3, 4. And so then basically what it's saying is for each of the things within that range, do the following thing and so if you never

00:38:12.000 --> 00:38:19.000
use eye. It's just saying, Do this 5 times and you don't have to use eye. It's just saying, Do this 5 times, and you don't have to use the eye.

00:38:19.000 --> 00:38:22.000
It's just like sort of just like, Hey, I have this chunk of code.

00:38:22.000 --> 00:38:31.000
I want you to do it 5 times.

00:38:31.000 --> 00:38:32.000
Yeah.

00:38:32.000 --> 00:38:41.000
I have a question. If you were to be given a lot of training back like different sets of training data, how would you actually like calculate?

00:38:41.000 --> 00:38:45.000
Like, how would you actually give a number for the buyers?

00:38:45.000 --> 00:38:50.000
And the way, because, yeah.

00:38:50.000 --> 00:38:52.000
Oh, yeah, what were you gonna say?

00:38:52.000 --> 00:39:04.000
No, I that's the question. Like, if you were actually given different, like many like thousands of training data sets, then how would you actually give a number for the buyers?

00:39:04.000 --> 00:39:06.000
Because bias of if hat is expectation of fx minus F.

00:39:06.000 --> 00:39:17.000
Hatx, where fx is the is true answer, and we don't know what the true answer is.

00:39:17.000 --> 00:39:29.000
Even if we have lots of training data. Yeah, I'm just asking, like, how would you actually calculate a number for the buyers if you were given a lot of training data or different sets of training.

00:39:29.000 --> 00:39:36.000
So I, yeah. So I think in order to do it, you would need to be and like to actually be able to do it.

00:39:36.000 --> 00:39:52.000
I think you would need to be in a situation where, in here, where we know what that we know, that the true relationship is y equals x times x minus one but I think, like in general, we don't like in a real world situation, we don't know what the true relationship is like we don't know what

00:39:52.000 --> 00:39:56.000
F. Is, so we wouldn't be able to calculate it.

00:39:56.000 --> 00:40:02.000
So the statement is, whatever the true relationship is in theory, we are far off.

00:40:02.000 --> 00:40:04.000
If the bias is high and we are closer to it.

00:40:04.000 --> 00:40:08.000
If the biases low.

00:40:08.000 --> 00:40:09.000
Okay. Thank you.

00:40:09.000 --> 00:40:25.000
Yes, I think, yeah. Yeah. And then maybe the last one before we move on. Whereas asking, how would you decipher if the variance is high because of the choice of model versus being the nature of the data.

00:40:25.000 --> 00:40:32.000
The variance is a property of the model you're choosing, not a property of the data.

00:40:32.000 --> 00:40:42.000
If that makes any sense so.

00:40:42.000 --> 00:40:48.000
Trying to think of. If there's like a like how to answer, yeah.

00:40:48.000 --> 00:40:57.000
So it's a property. It's a property of the model, so it wouldn't be the case that, like, you wanna change something about your data.

00:40:57.000 --> 00:41:04.000
It would be the case that you wanna change something about your model if you're if you suspect that you're overfitting.

00:41:04.000 --> 00:41:14.000
So if you suspect your model has high variance.

00:41:14.000 --> 00:41:20.000
Okay.

00:41:20.000 --> 00:41:24.000
Alright! So that's sort of gonna the bias variance trade off.

00:41:24.000 --> 00:41:27.000
And you know, thanks for all the questions. They're very good.

00:41:27.000 --> 00:41:32.000
This is sort of motivation for what we're going to learn in this notebook called Regularization.

00:41:32.000 --> 00:41:33.000
So we're going to sort of introduce the general idea behind regularization.

00:41:33.000 --> 00:41:41.000
We'll set up a couple of different formulations and then work our way through it.

00:41:41.000 --> 00:41:44.000
We'll show you how to do them in python.

00:41:44.000 --> 00:41:53.000
And then at the end we'll show you this nice feature of lasso sort of for feature, selection.

00:41:53.000 --> 00:41:59.000
So I will do a quick note. This notebook gets a little bit heavy into some math.

00:41:59.000 --> 00:42:09.000
If you're not like the biggest math person, you don't have to worry too much about those sorts of things and sort of just try and take away the general concepts of like what?

00:42:09.000 --> 00:42:11.000
How the regularization problems are being set up, and then don't worry so much about like the formal, like the actual setups.

00:42:11.000 --> 00:42:28.000
Just sort of remembered the gist, and, you know, hold on until we get to the python parts. If that's what you're interested in.

00:42:28.000 --> 00:42:29.000
Okay, so this was sort of that example that we literally just looked at in the bias variants. So we have.

00:42:29.000 --> 00:42:44.000
Y is equal to x times x minus one. And so then I'm gonna go ahead and plot that so fewer example, fewer observations.

00:42:44.000 --> 00:43:02.000
This time, and then here's the real relationship. And so what I'm gonna go ahead and do is I'm gonna do a loop where for I I'm gonna just fit degree one to degree 26 polynomial using my polynomial features

00:43:02.000 --> 00:43:12.000
pipeline, and each time through I'm gonna record the coefficients on each of the on each of the degrees.

00:43:12.000 --> 00:43:15.000
So here's what that looks like. So maybe, yeah.

00:43:15.000 --> 00:43:23.000
Matthew. Sorry! Where is this file in your github? In the lecture folder under which?

00:43:23.000 --> 00:43:26.000
So this is so. This is in regression.

00:43:26.000 --> 00:43:34.000
So under supervised. Okay? Okay?

00:43:34.000 --> 00:43:39.000
Yeah. So the last note folks will do today. So I hope to get through 6.

00:43:39.000 --> 00:43:40.000
And then I think we'll be able to get through 9.

00:43:40.000 --> 00:43:47.000
Everybody, if we, if we are able to get through more, all of it will be in regression, should have specified.

00:43:47.000 --> 00:43:49.000
Okay, so let me go ahead and zoom in just to make the table a little easier to see.

00:43:49.000 --> 00:44:00.000
So as you can see, like each row is one of the polynomials that I fit, and then each column is the coefficient on that degree.

00:44:00.000 --> 00:44:08.000
So, for instance, when we only fit aligned X. Has the coefficient of negative point 9 5 9.

00:44:08.000 --> 00:44:09.000
Okay, so one thing we'll wanna do is sort of get the motivation here is, if you like.

00:44:09.000 --> 00:44:19.000
Look in general at the sort of magnitude of the coefficient, so absolute values they tend to be increasing.

00:44:19.000 --> 00:44:26.000
So, for instance, if we look at, you know we're going, it's kind of more or less staying around the same for one.

00:44:26.000 --> 00:44:36.000
But if we look at ones like x to the fourth, X to the fifth, we can quickly see that the coefficients are getting larger.

00:44:36.000 --> 00:44:41.000
So here we have one. That's a 75, a negative, 1, 95, a 2, 6, 2.

00:44:41.000 --> 00:44:52.000
And so basically what's going on is something that linear regression is kind of referred to as coefficient explosion.

00:44:52.000 --> 00:45:09.000
And so what does that mean? So here's sort of as a function of the degree of the polynomial, the size of beta, meaning the the norm of beta, is increasing quite drastically as you increase the degree of the polynomial.

00:45:09.000 --> 00:45:23.000
You're considering. And so what's happening is basically as you increase the degree of the polynomial, we're fitting like the polynomial tries to wiggle around quite a bit to fit as many training points as it can meaning that these coefficients are getting

00:45:23.000 --> 00:45:35.000
larger and larger and larger in magnitude. And so this is sort of a motivation behind, what we're going to learn like behind sort of the reason that regularization, like the little trick that it does.

00:45:35.000 --> 00:45:42.000
And so the idea behind regularization is, we're still going to try and minimize that Mse.

00:45:42.000 --> 00:45:43.000
For linear regression. So remember, this is just a different formula.

00:45:43.000 --> 00:45:51.000
Of that. Msc, so it's the one over N. Times.

00:45:51.000 --> 00:45:52.000
The sum of the actual minus the predicted and this is just a way to write that in sort of linear algebra terms.

00:45:52.000 --> 00:46:06.000
So we're still trying to minimize this. But now we're trying to do it in a way that we don't let the norm of our coefficients get too big.

00:46:06.000 --> 00:46:13.000
And so what is a norm? It's a way to measure the size of a vector?

00:46:13.000 --> 00:46:19.000
So we're going to focus in on 2 norms later on, but for now just think of it as a way to visualize the size of a vector in 2 dimensions.

00:46:19.000 --> 00:46:27.000
It would be sort of like the length of the vector depending on what norm you're using.

00:46:27.000 --> 00:46:30.000
So like, if you were to draw an arrow on a piece of paper, then measure it with a ruler.

00:46:30.000 --> 00:46:34.000
You can sort of think of that as being the same sort of thought process behind a norm.

00:46:34.000 --> 00:46:38.000
And then later on, we'll define like actual formulas for a definition of a norm.

00:46:38.000 --> 00:46:49.000
So for now, just know a norm is a way to measure the size of a vector and in particular, we want norms that are going to measure the size of our coefficients.

00:46:49.000 --> 00:47:03.000
And so regularization is essentially taking our Mse minimization and rewriting it in a way that we're trying to do this while not getting too large.

00:47:03.000 --> 00:47:06.000
And so we're still trying to minimize the Mse.

00:47:06.000 --> 00:47:23.000
But now we're sort of operating on a budget where we're only doing this in a world where the norm of my coefficient vector has to be less than or equal to some constant C, and so sort of thinking this again, like we were trying to find the smallest mean square error.

00:47:23.000 --> 00:47:27.000
We can while we're on a budget for Beta equivalently.

00:47:27.000 --> 00:47:30.000
So I know we set it up as a constrained optimization problem.

00:47:30.000 --> 00:47:33.000
That's sort of just the motivation for me.

00:47:33.000 --> 00:47:34.000
That seems like the motivation behind, like, why are people doing this?

00:47:34.000 --> 00:47:35.000
You can also rewrite this to be sort of a penalty optimization.

00:47:35.000 --> 00:47:43.000
Why are people doing this? You can also rewrite this to be sort of a penalized optimization. Where?

00:47:43.000 --> 00:47:45.000
Your again, this this is the Msc. Now multiply by N, we wanna minimize this plus.

00:47:45.000 --> 00:47:56.000
Now we've added a penalty for large Beta so Alpha is something known as a hyper parameter.

00:47:56.000 --> 00:48:14.000
And now, in addition to our Mse. Or just, I guess the sum of square errors we're now adding a penalty term, where, if Beta gets too large, the thing we're trying to optimize gets large as well so here, this is our first instance of a particular norm this is the square of

00:48:14.000 --> 00:48:27.000
the 2 norm, which is just the sum of the squares of the entries for some vector so if we have a vector a, it's a one squared plus a 2 squared plus dot dot dot plus a n squared.

00:48:27.000 --> 00:48:33.000
So maximizing this to norm squared is the same as minimize or sorry, not maximizing, minimizing.

00:48:33.000 --> 00:48:44.000
This is the same as minimizing the Mse. And so sort of to give like a mathematical equivalency of the constrained optimization with this penalized form.

00:48:44.000 --> 00:48:45.000
There are some references at the bottom. I'm not gonna go into the details.

00:48:45.000 --> 00:49:09.000
There, so basically what's going on here is by adding this penalty or if you want to think about it as the constrained optimization you're forcing this minimization to happen in a way that doesn't let Beta get too large so that's the idea behind regularization as you're

00:49:09.000 --> 00:49:13.000
enforcing some sort of penalty or constraint that makes it so.

00:49:13.000 --> 00:49:14.000
Your co-fefficients can't get too big.

00:49:14.000 --> 00:49:15.000
So this is our first instance of a hyper parameter.

00:49:15.000 --> 00:49:18.000
So what's the hyperparameter here? It's Alpha.

00:49:18.000 --> 00:49:19.000
So Beta is a vector of parameterers that we have to try and estimate.

00:49:19.000 --> 00:49:36.000
Alpha is a hyper parameter here it's Alpha. Is a vector of parameters that we have to try and estimate. Alpha is a hyper parameter.

00:49:36.000 --> 00:49:39.000
The difference between a during the algorithm fitting the hyper parameters, you have to set from the very beginning.

00:49:39.000 --> 00:49:50.000
So before you even try and fit, you have to decide what value of Alpha you're going to use and then fit the algorithm.

00:49:50.000 --> 00:49:52.000
So the way that you choose the alpha depends.

00:49:52.000 --> 00:50:08.000
So we're gonna see some examples below where we choose different values of Alpha, just to get something out in other cases you'll do something called heper parameter tuning, where you'll do something like a cross-validation where each time through the cross validation.

00:50:08.000 --> 00:50:25.000
You're choosing a different value for the hyper parameter, and then at the end, you see which one performs best in this particular case, if alpha 0, we recover what we used to have for beta the ordinary least squares estimate and if alka theoretically.

00:50:25.000 --> 00:50:37.000
if Alpha equals infinity, the estimate would imply that Beta has to be 0 in order to minimize.

00:50:37.000 --> 00:50:47.000
Okay. So I think I saw a question.

00:50:47.000 --> 00:50:48.000
So icons asking, isn't linear regression also estimating the effect, sizes of features.

00:50:48.000 --> 00:50:55.000
Then what does it mean when we use regularization because of machine learning?

00:50:55.000 --> 00:50:58.000
We do not care about Beta. So that's kind of the gist of it.

00:50:58.000 --> 00:51:05.000
So here we're sort of in a predictive modeling framework where we don't necessarily care about being able.

00:51:05.000 --> 00:51:09.000
At the end of the day we may not care about being able to.

00:51:09.000 --> 00:51:12.000
Clearly interpret, you know. Oh, if I increase X by 2 units, then this does blank to the output.

00:51:12.000 --> 00:51:20.000
So you'd argue the nice interpretability that you get from a regular or an ordinary least squares, like a regular linear regression.

00:51:20.000 --> 00:51:38.000
But this is sort of thinking of it in terms of like improving predictive performance.

00:51:38.000 --> 00:51:43.000
Okay. So here are the 2 specific regularization models that we're gonna look at.

00:51:43.000 --> 00:51:49.000
The first is called rich regression, and so Ridge regression chooses this square of the 2 norm.

00:51:49.000 --> 00:52:00.000
Here. So this is again we said, a one squared plus a 2 squared plus dot dot dot plus a N squared for this is a n-dimensional, vector a not involved in the problem at all.

00:52:00.000 --> 00:52:03.000
Just a placeholder name for a vector the other one.

00:52:03.000 --> 00:52:07.000
We're gonna look at uses what's known as the L one norm.

00:52:07.000 --> 00:52:24.000
So the l. One norm is the sum of the absolute value, so the absolute value of a one plus the absolute value of a 2 plus dot dot dot plus the absolute value of a n, where, if it wasn't clear, let's say sub one is the first entry of the vector a a sub 2 is the second

00:52:24.000 --> 00:52:27.000
entry. A sub n, is the entry. So, yeah, yeah, yeah.

00:52:27.000 --> 00:52:34.000
I don't ask you a quick question. How does that different norms?

00:52:34.000 --> 00:52:36.000
Adding this regularization term, does it make a difference which one you choose?

00:52:36.000 --> 00:52:42.000
Or what's the difference between choosing, you know, like the L.

00:52:42.000 --> 00:52:45.000
One non versus the square.

00:52:45.000 --> 00:52:51.000
Yeah, so I'll do a quick show, and then we'll come back to this, maybe more detail later.

00:52:51.000 --> 00:52:57.000
So basically like, this is sort of. And then this is from a book called The Elements of Statistical Learning.

00:52:57.000 --> 00:53:07.000
I didn't make this image like this is a way to visualize it in 2D, so basically, you're just restricting the shape of the constraint region in the parameter space.

00:53:07.000 --> 00:53:27.000
So here is a space that represents Beta one and Beta 2, and then the square is, if you use the lasso norm, and so all of your coefficients have to exist either with the lasso, norm and so all of your coefficients have to exist either within or on the edge of this space versus

00:53:27.000 --> 00:53:36.000
the circle. And so it makes a difference in the type of estimates you're able to get, and you can use that to your advantage when trying to do things like feature selection.

00:53:36.000 --> 00:53:50.000
And we'll talk on that at the end. So this is like a little preview, basically, the different norms change the shape of the basically the different norms change the shape of the like the values that the constraints can live in so for lasso it has to be within this square

00:53:50.000 --> 00:54:08.000
or on the edge of the square for Ridge it has to be in this circle are on the edge of the circle, and then there's like higher dimensional equivalents of this, which we just can't. I can't draw on the 2 dimensional space.

00:54:08.000 --> 00:54:17.000
Okay. Are there any other questions before we go on? How to do this with Python?

00:54:17.000 --> 00:54:26.000
Could you? Sorry, explain again how the norm is coming into the equation?

00:54:26.000 --> 00:54:30.000
For regularization. I kind of missed it.

00:54:30.000 --> 00:54:36.000
Yeah, so basically what we're trying to do is we're, there's 2 ways to think of it.

00:54:36.000 --> 00:54:41.000
The first is that the norm of your basically making it so that you want to, you're still minimizing Msc, or whatever you want to minimize.

00:54:41.000 --> 00:54:49.000
But now you're forcing it so that the norm of your vector so your coefficients.

00:54:49.000 --> 00:54:59.000
So here, think of this being replaced by a being replaced by Beta, and either one of these, you're making it so that these norms are less than or equal to some constant.

00:54:59.000 --> 00:55:03.000
That's one way equivalently. You can think of it as now.

00:55:03.000 --> 00:55:08.000
You're minimizing the Mse or something that's equivalent to the Msc.

00:55:08.000 --> 00:55:13.000
Plus a penalty where the larger beta gets, the more penalty you get.

00:55:13.000 --> 00:55:16.000
So basically you can't just automatically get the best one.

00:55:16.000 --> 00:55:20.000
The OS, the ols estimate so our regular linear regression estimate.

00:55:20.000 --> 00:55:27.000
Now you're forced to maybe get a subpar minimum, because if you were to try and get the minimum, Beta would get too big.

00:55:27.000 --> 00:55:35.000
So the norms are just different ways of measuring the size of Beta.

00:55:35.000 --> 00:55:42.000
It could be any norm, but the 2 most commonly used are ridge in lasso.

00:55:42.000 --> 00:55:43.000
In the A when you say either, it's really Beta.

00:55:43.000 --> 00:55:45.000
Then.

00:55:45.000 --> 00:55:49.000
And this particular it would be Beta. This is just a general vector.

00:55:49.000 --> 00:55:51.000
Like a is a vector.

00:55:51.000 --> 00:55:56.000
Yeah, putting the example, it's represented by Beta on your example.

00:55:56.000 --> 00:55:59.000
Yeah, yeah, and what we're using it for, it would be Beta.

00:55:59.000 --> 00:56:11.000
I just wanted to use a general vector here because the norms are independent of the vector you use.

00:56:11.000 --> 00:56:12.000
Yeah.

00:56:12.000 --> 00:56:13.000
I have a question. So for the can you go a little bit towards the yeah.

00:56:13.000 --> 00:56:24.000
Yeah, the like. You say that the first in the secondary equivalent so in the second one, we are basically just redefining.

00:56:24.000 --> 00:56:31.000
Our cost cost function to include a regularization term, and in the first one we are just saying that, okay, we have the regular cost function.

00:56:31.000 --> 00:56:35.000
We're additionally, we have the normal of Beta to be less than some constant.

00:56:35.000 --> 00:56:37.000
So like and the second, I'm just trying to think about.

00:56:37.000 --> 00:56:43.000
Yes, these are. These are equivalent in terms of the ideas.

00:56:43.000 --> 00:56:59.000
But like does the second one automatically imply that the norm of the beta that would come out of the calculation would be upper, bounded by some constancy.

00:56:59.000 --> 00:57:07.000
Maybe it does.

00:57:07.000 --> 00:57:08.000
Okay.

00:57:08.000 --> 00:57:10.000
So there is. There are some notes at the bottom that go through the derivation of how this is equivalent to that and so I'll refer you to those notes.

00:57:10.000 --> 00:57:11.000
Okay. Thanks.

00:57:11.000 --> 00:57:17.000
Yeah. Yup.

00:57:17.000 --> 00:57:20.000
So how do we do this in sk, learn? So both of these are.

00:57:20.000 --> 00:57:26.000
Here are the links to the documentation. If you want to check out more, and I think maybe it's good point.

00:57:26.000 --> 00:57:30.000
Let's go to rich, cause I think there is something here, you know.

00:57:30.000 --> 00:57:34.000
I'm thinking of a different model. But this is what the this is what the documentation looks like.

00:57:34.000 --> 00:57:37.000
And so maybe it's good to see at least that the default value is alpha.

00:57:37.000 --> 00:57:45.000
Another thing that we're gonna see here is something called Max A.

00:57:45.000 --> 00:57:58.000
So this is the number of iterations. So this is, I think, that we sort of talked about it last week when we talked about fitting Linux regression and like worrying about the ill conditioning of a matrix.

00:57:58.000 --> 00:58:06.000
So sk learn doesn't use like normal equations. So like, for instance, Ridge regression, you can derive a formula for the estimate.

00:58:06.000 --> 00:58:08.000
But it doesn't use that formula. It uses sort of like a gradient descent or something like that.

00:58:08.000 --> 00:58:16.000
And so the algorithm goes through a number of iterations.

00:58:16.000 --> 00:58:23.000
And so there's like a default value for the number of iterations that you may have to change here.

00:58:23.000 --> 00:58:32.000
It's saying it's none. But sometimes you'll get a warning from Escalar and saying like algorithm did not converge before number of iterations was exceeded.

00:58:32.000 --> 00:58:33.000
And so when that happens, you may need to take this and increase it.

00:58:33.000 --> 00:58:42.000
So we'll see more examples of this as we go through the boot camp.

00:58:42.000 --> 00:58:43.000
But because I clicked on this I wanted to at least demonstrate something.

00:58:43.000 --> 00:58:45.000
I was thinking of a different, a different model that I wanted to look at the documentation for.

00:58:45.000 --> 00:58:54.000
But here's the documentation.

00:58:54.000 --> 00:58:59.000
Okay, so first, we're gonna import them. So from S. K.

00:58:59.000 --> 00:59:08.000
Loren, dot linear model. I'm gonna import ridge.

00:59:08.000 --> 00:59:24.000
And then maybe this is new. So Ridge, when you're importing something, 2 things from the same place, you can separate them by a comma, and you can do this instead of having to write like both on a different line.

00:59:24.000 --> 00:59:34.000
So this can save you some typing. So then, what I'm gonna do is I'm gonna go through and show like for different values of Alpha.

00:59:34.000 --> 00:59:39.000
How this impacts the coefficients that you get as the estimate.

00:59:39.000 --> 00:59:49.000
So each time through. For this different value of Alpha, what we're gonna do is we're gonna fit the high degree polynomial, which is a 10 degree polynomial.

00:59:49.000 --> 00:59:54.000
And then also you might notice that we're using pipelines here.

00:59:54.000 --> 00:59:55.000
So this is a situation where you need to scale your columns before you fit the model.

00:59:55.000 --> 01:00:08.000
So, if you don't scale your columns before fitting the model it can mess up the way the model gets fit.

01:00:08.000 --> 01:00:22.000
And so, if you have very different scales, so sort of thinking like this budget way, like your budget, can be dedicated to like entirely one column, just because the scale is vastly different from the other columns, it won't actually give you the right to like entirely one column just because the scale is vastly different from the other columns.

01:00:22.000 --> 01:00:31.000
It won't actually give you the right fit. So you need to always scale when you use ridge or lasso.

01:00:31.000 --> 01:00:34.000
So that's basically what we're doing here. And so the last thing we need to add.

01:00:34.000 --> 01:00:37.000
So we have the scaling and the polynomial features.

01:00:37.000 --> 01:00:41.000
Now we just need to add the ridge and the lasso.

01:00:41.000 --> 01:00:47.000
And so here, I'll add Ridge.

01:00:47.000 --> 01:00:56.000
So you just call rich. And then Alpha is, gonna be easy to each time through.

01:00:56.000 --> 01:01:01.000
We're going to go ahead and get a different value of Alpha.

01:01:01.000 --> 01:01:06.000
And so Alpha at I. And then, I think to be safe.

01:01:06.000 --> 01:01:10.000
I just like to set the Max iter pretty high.

01:01:10.000 --> 01:01:17.000
And then the same thing here for lasso.

01:01:17.000 --> 01:01:20.000
I'm gonna call it lasso, and then I'll call capital.

01:01:20.000 --> 01:01:24.000
L lasso alpha equals Alpha at I.

01:01:24.000 --> 01:01:29.000
Why is it Alpha at I? Because Alpha, one this vector that I've defined up here.

01:01:29.000 --> 01:01:33.000
Maybe I should call it alphas, with an S.

01:01:33.000 --> 01:01:42.000
Max Iter equals 100,000. Then I fit them and record the coefficients. Send an array.

01:01:42.000 --> 01:01:50.000
So let's go ahead and make those even bigger.

01:01:50.000 --> 01:01:55.000
Let's try one more time.

01:01:55.000 --> 01:02:06.000
There we go. Oh, come on, guys!

01:02:06.000 --> 01:02:09.000
Alright!

01:02:09.000 --> 01:02:11.000
So we can look at the coefficients for these.

01:02:11.000 --> 01:02:16.000
And so you can see as you increase, Alpha. So the bigger Alpha is.

01:02:16.000 --> 01:02:20.000
So here, the alpha, you can think of it like.

01:02:20.000 --> 01:02:29.000
It's a. This particular alpha, which is why it shows Alpha here the bigger alpha is, the smaller the norm of the coefficients have to be.

01:02:29.000 --> 01:02:44.000
So you can kind of see them shrinking. Now, one thing you might notice is that with Ridge regression the shrinking seems to happen more or less like pretty uniformly like no one coefficient goes down to 0 on it's like on its own and that whereas with

01:02:44.000 --> 01:03:01.000
lasso. When you make the comparison and this goes back to the question that was asked earlier, like some of the coefficients will go down to 0 faster than others, and so that's a nice feature of last, so that we're gonna talk about in a second so but just to

01:03:01.000 --> 01:03:04.000
demonstrate. As Alpha gets bigger, the coefficients go to 0 and lasso, they tend to hit 0.

01:03:04.000 --> 01:03:11.000
We're as in Ridge. They tend to not hit 0.

01:03:11.000 --> 01:03:19.000
So you know, why is this happening? And this is going back to sort of what we talked about earlier in Ridge regression.

01:03:19.000 --> 01:03:25.000
We're trying to minimize the Msc with respect to the 2 norm being less than or equal to C.

01:03:25.000 --> 01:03:26.000
What you can rearrange to be like in 2 dimensions.

01:03:26.000 --> 01:03:49.000
You have this unit, you have this disk with a a radius of C, or sorry radius of the square root of C, so that's where you get this picture, whereas with lasso you now have the absolute value which gives you a square with the following vertices.

01:03:49.000 --> 01:03:54.000
And so what you see here is this point is the ordinary least squares estimate.

01:03:54.000 --> 01:04:07.000
So the thing that we got with normal linear regression that we learned about last week these ellipses are what are known as the level curves of the Mse.

01:04:07.000 --> 01:04:08.000
And so the minimum of the Mse is sort of the bottom.

01:04:08.000 --> 01:04:15.000
So it's this paraboloid, and then the bottom is the minimum.

01:04:15.000 --> 01:04:19.000
And so these blue regions are this constraint region.

01:04:19.000 --> 01:04:22.000
So the constraint regions being this part so, where the coefficient, the norm of the coefficient is less than or equal to the constant.

01:04:22.000 --> 01:04:37.000
And so your estimate for the betas has to be where the level curves the values of the Mse hit the constraint region.

01:04:37.000 --> 01:04:57.000
So with lasso, because you have sort of this, the square or pointy constraint region you have to hit you typically end to hit on one of the axes which explains why one of the coefficients are more one or more of the coefficients tend to hit 0.

01:04:57.000 --> 01:05:17.000
While the in comparison to Ridge, they don't exactly hit 0, whereas with Ridge, because it's more of this like circular or in higher dimensions, spherical spherical shape, the the inner the curve tends to intersect like out here away from the axis

01:05:17.000 --> 01:05:23.000
which is why you tend to see this sort of uniform shrinking, but not quite getting to 0.

01:05:23.000 --> 01:05:29.000
Whereas with lasso one of them will go to 0 like much earlier than the others.

01:05:29.000 --> 01:05:39.000
That's sort of what's going on. So this gives lasso a nice feature or a nice feature for feature.

01:05:39.000 --> 01:05:56.000
Selection. So because this happens, the features that stick around with lasso that don't go to 0 tend to be important features for predictive power and so what you can do is we'll see in the next notebook we cover you can use lasso and then change different values.

01:05:56.000 --> 01:05:59.000
Of Alpha.

01:05:59.000 --> 01:06:10.000
You can use lasso, change different values of alpha, and then follow along like which ones stick around the longest, and those tend to be your more important features.

01:06:10.000 --> 01:06:11.000
Now in this example, it's kind of acting weirdly.

01:06:11.000 --> 01:06:19.000
We're like, we know that the correct answer ahead of time is like X squared and x to the first.

01:06:19.000 --> 01:06:21.000
But then, like things like x to the 6, stick around a little bit longer than those 2 same with x to the tenth.

01:06:21.000 --> 01:06:31.000
But these 2 are like the actual, you know, close, farthest away from 0. For the longest.

01:06:31.000 --> 01:06:47.000
If you look at all of them. So I think like, if we were to look at it, even if we didn't have the knowledge, we'd be more inclined to keep these than like X to the tenth okay.

01:06:47.000 --> 01:06:48.000
So Mitch is asking math heavy question. We've seen L. One and L.

01:06:48.000 --> 01:06:51.000
2. Regularization or other, else to the P. Norms ever used for other values of P.

01:06:51.000 --> 01:07:07.000
So there's something called elastic net, which is sort of it's not quite what your questions asking, but Elasticnet is sort of like a weighted sum of the L.

01:07:07.000 --> 01:07:14.000
One and the L. 2, the l. 2 norms.

01:07:14.000 --> 01:07:19.000
It's not quite that. So I think you in theory, could use other Lp.

01:07:19.000 --> 01:07:20.000
Norms. But from my experience it's usually L one l.

01:07:20.000 --> 01:07:29.000
2, or this elastic net, which is a combination of the 2.

01:07:29.000 --> 01:07:34.000
Eric is asking, can you explain why the red contour lines are shifted?

01:07:34.000 --> 01:07:42.000
One picture to the next. Is that a mistake? Oh, yeah.

01:07:42.000 --> 01:07:55.000
I think that they are not the I think they are not the same, so I think the 2 in the middle are the same, and then this one on the outside are different, like this level curve is different from this level.

01:07:55.000 --> 01:08:06.000
Curve because they intersect at different places. Does that make sense?

01:08:06.000 --> 01:08:24.000
Yeah, I think that's why. So like, this level curve is different from that one but the 2 in the middle, I think, are the same asterisk asking, Do people use the strategy of using lasso for feature for feature selection followed by going back to unregularized regression as to

01:08:24.000 --> 01:08:27.000
get lower emcee. So I think that's something you could do.

01:08:27.000 --> 01:08:38.000
So you could use lasso to figure out what features you might want to consider, and like a cross validation, and then just use non unregularized linear regression.

01:08:38.000 --> 01:08:45.000
You could also compare it with the linear regret, like with the lasso version of the same model, and see which performs better.

01:08:45.000 --> 01:09:00.000
But, like the earlier question is, if you're in a situation where the people you're making the model for like to be able to see the direct interpretations. You might stick with the linear one because you can get that.

01:09:00.000 --> 01:09:06.000
Yeah, also. So you were using the chart with a different alpha and x one through.

01:09:06.000 --> 01:09:13.000
Xk power values? Was it that you are choosing the features?

01:09:13.000 --> 01:09:18.000
Then what are? How are the features represented in this chart?

01:09:18.000 --> 01:09:25.000
So the we are fitting this model the tenth degree, polynomial for each value of Alpha.

01:09:25.000 --> 01:09:26.000
And so these are the coefficients on that model.

01:09:26.000 --> 01:09:36.000
So I use this model cause it's I just wanna do an aside cause every time I do this I think people just assume that when you're in the real world you're just always using polynomial regression.

01:09:36.000 --> 01:09:44.000
That's a bad assumption. You're not like, do I don't think you're typically doing stuff like this.

01:09:44.000 --> 01:09:47.000
This is just to make it easy that's aside from your question.

01:09:47.000 --> 01:09:58.000
So like in the real world. Your columns would be like your columns would be the features so like thinking to the problem.

01:09:58.000 --> 01:10:03.000
Session, like kilometres driven age, that sort of thing.

01:10:03.000 --> 01:10:14.000
And so then what you'll do is you'll follow the coefficients on those different features, and then see like which ones tend to persist as you increase the value of alpha.

01:10:14.000 --> 01:10:23.000
Now what I tend to do is like I do powers of 10, and if that doesn't work I might switch up and do like point 2 5.5.

01:10:23.000 --> 01:10:35.000
Something like that. Does that answer your question?

01:10:35.000 --> 01:10:36.000
Yeah. Yup.

01:10:36.000 --> 01:10:40.000
Yes. So you just saying that this is a specific example with polynomial regression, and I could still use the same process right of choosing the powers of Alpha to the time of what features stick around.

01:10:40.000 --> 01:10:45.000
And so when I'm doing the powers of Alpha, if the higher penalty.

01:10:45.000 --> 01:11:01.000
So Alpha is given me like a higher degree of penalty, and then the features that drop off are the ones that I choose, because those are the ones that are.

01:11:01.000 --> 01:11:05.000
The why do I choose the ones that drop off like as Alpha goes higher?

01:11:05.000 --> 01:11:07.000
So you want to choose the ones that persist. So like, yeah, yeah, yeah.

01:11:07.000 --> 01:11:20.000
Okay, persist. Okay, okay.

01:11:20.000 --> 01:11:37.000
Yup yup so Jonathan asking, what is the thought process, and deciding to underfit by exclusion versus underfit, by ridge regression, the regularization is still biasing it, even though it doesn't force to 0 rate so with ridge so basically like you could

01:11:37.000 --> 01:11:44.000
use, Ridge regression as your model, and if that performance better than regular linear regression, and that is and you just care about predictive power like, do that.

01:11:44.000 --> 01:11:53.000
But the idea behind that you could not use, like Rich regression for feature selections.

01:11:53.000 --> 01:11:54.000
And so we're not necessarily talking about like lasso is better.

01:11:54.000 --> 01:12:08.000
Because look, all these go to 0. We're saying that last so can be used for feature selection, because the things that stick around tend to be the features that are most important and making predictions.

01:12:08.000 --> 01:12:18.000
And so that's sort of the idea here. It's not that the lasso model is inherently better at providing you predictions in comparison to the Ridge, because the Ridge doesn't go away.

01:12:18.000 --> 01:12:24.000
The lasso is better for feature selection, because you can see, like the important features, tend to stick around the longest.

01:12:24.000 --> 01:12:25.000
So that's the idea. There, now, you know, Ridge may still give you a better predictive model, and it doesn't shrink.

01:12:25.000 --> 01:12:47.000
Everything to 0. That's fine. It's just in lasso, is uniquely is sort of like unique regularization in the sense that you, you can see what features are most important to making predictions.

01:12:47.000 --> 01:12:50.000
Okay. And so maybe it's like, wrap up for this notebook.

01:12:50.000 --> 01:12:59.000
And we kind of talked about this a little bit so like when might you use lasso versus rich. So Lasso has some pros so like for example, this feature selection is really nice.

01:12:59.000 --> 01:13:10.000
It works well also, when you have a large like, let's say you have a very large number of features, and it turns out like you don't know this ahead of time.

01:13:10.000 --> 01:13:15.000
But it turns out that, like in actuality, a lot of these features don't have a ton of effect on the target.

01:13:15.000 --> 01:13:20.000
So the thing you're trying to predict last so turns out to be really good for these types of problems, because it will get rid of the ones that aren't important.

01:13:20.000 --> 01:13:24.000
As you increase Alpha, one problem is, it can be, it can have trouble with something called co-inearity.

01:13:24.000 --> 01:13:53.000
So co-linearity is was when one of your columns is highly correlated with another one of your columns, so it can be difficult for this, because it can typically just choose like one of those variables and not necessarily like the one that's providing the signal to yup we use like one of those variables and not necessarily like the one that's providing the signal to y so that it

01:13:53.000 --> 01:14:05.000
can struggle with that ridge. Regression is good when you have a target that does end up depending on a lot of, or all of the features, and it works a little bit better with coinearity than lasso one con of the ridge regression is that it tends to keep most of

01:14:05.000 --> 01:14:06.000
the predictors in the model, as we saw in this very is, you know, this example.

01:14:06.000 --> 01:14:18.000
So this can be like computationally costly. If the data sets have a large number of features, is that sort of the pros and cons of both.

01:14:18.000 --> 01:14:19.000
And again, some of these, you won't know ahead of time, but you can keep them in mind.

01:14:19.000 --> 01:14:31.000
And then, as I mentioned earlier, there's something called elastic net there's an example in the practice problems in the regression where you can learn about that.

01:14:31.000 --> 01:14:34.000
And as I also said, here are some references that kind of go over.

01:14:34.000 --> 01:14:41.000
Some of the mathy stuff for the math ones. You'll probably want like these Pdfs at the bottom would be my guess.

01:14:41.000 --> 01:14:48.000
And then maybe this one right here, constrained on constrained form.

01:14:48.000 --> 01:14:52.000
Okay. So with 10 min left, give or take, I want to end by going over some feature selection approaches for regression, linear regression models.

01:14:52.000 --> 01:15:01.000
And so I'm gonna go through what used to be a problem session.

01:15:01.000 --> 01:15:09.000
Basically and go through the data that we're gonna use and then show you a couple of different ways.

01:15:09.000 --> 01:15:19.000
This is gonna be largely like it's already all been coded, just given like the length of time but, you know, feel free to ask questions at the end.

01:15:19.000 --> 01:15:24.000
If you have like questions about particular code chats.

01:15:24.000 --> 01:15:30.000
So this is a synthetic data set called car seats that comes from this really great book.

01:15:30.000 --> 01:15:37.000
An Introduction to statistical learning, and this summer I guess they have a python addition, which is great.

01:15:37.000 --> 01:15:49.000
The current edition is for R. But this is a really, I like this book a lot, even if you're not like you don't care about learning like, are it's just good for like theory and stuff and I believe it's free, which is awesome.

01:15:49.000 --> 01:15:53.000
So they have this data set called car seats, which has these columns.

01:15:53.000 --> 01:16:10.000
You're trying to predict sales. And they have problems like competitor price income advertising population price, shelve location, age, education, urban and us.

01:16:10.000 --> 01:16:21.000
And so basically these are synthetic data. Each row represents a store that sells car seats represents the amount in sales.

01:16:21.000 --> 01:16:29.000
They're getting at that store, and then has various columns about the different features that that store uses.

01:16:29.000 --> 01:16:33.000
So, for instance, like for this store, what are its competitors?

01:16:33.000 --> 01:16:38.000
Prices. For what is the population of the area? What is the edge level of the area?

01:16:38.000 --> 01:16:44.000
What is the average age of the population that the store is in? So that's the idea of the data set.

01:16:44.000 --> 01:16:45.000
If you're interested to learn more, I've provided this link.

01:16:45.000 --> 01:16:50.000
So that's the idea of the data set. If you're interested to learn more, I've provided this link, and you can also just look at the book which is free online.

01:16:50.000 --> 01:16:57.000
So I go ahead, and just as a first start I make some dummy variables.

01:16:57.000 --> 01:17:00.000
So, for the first is shelved location, good and show of location, bad.

01:17:00.000 --> 01:17:04.000
So show of location takes in bad good, and medium is potential value.

01:17:04.000 --> 01:17:13.000
So this is just doing the K minus one dummies and the other 2 can just be quickly turned into 0, or one from yes or no.

01:17:13.000 --> 01:17:27.000
Then I make my train test split. So one of the very first steps in feature selection that you guys have gotten a lot of practice that with the last 2 problem sessions is called exploratory data analysis.

01:17:27.000 --> 01:17:36.000
So here, if it's possible, like, if you don't have too many features, you can just try and look at various platforms.

01:17:36.000 --> 01:17:42.000
And you know, basic count basic statistics to get a sense of what might be most important.

01:17:42.000 --> 01:17:51.000
So one that's useful. Is this pair plot function and pair plot takes in sort of the data.

01:17:51.000 --> 01:17:55.000
And like you've seen in the.

01:17:55.000 --> 01:18:01.000
The problem session. You can specify things like what should go on the y-axis, and what should go on the X-axis.

01:18:01.000 --> 01:18:05.000
Here we can see like we're interested in sales.

01:18:05.000 --> 01:18:10.000
And so we can see like the sales, seem to be related to these other variables.

01:18:10.000 --> 01:18:14.000
In addition, like sort of looking at good medium, bad. And if we look here, there's like a density plot on the diagonal.

01:18:14.000 --> 01:18:29.000
And so we can kind of get a sense of that. It does appear that maybe shelve location has an impact on sales. Just looking at the distributions here, the densities.

01:18:29.000 --> 01:18:32.000
Here's just another one looking at the remainder.

01:18:32.000 --> 01:18:47.000
Variable. So here's an example of one that looks like it has a very negative correlation with sales, and that is price which makes sense. The higher the price, the less likely people are to buy it.

01:18:47.000 --> 01:18:51.000
And then here's just the final one, with, like the final remaining one.

01:18:51.000 --> 01:18:56.000
So we saw that you've seen some examples of this. If you go through and like, look at the different plots you might find.

01:18:56.000 --> 01:19:03.000
Okay, like things like advertising population price shelved location.

01:19:03.000 --> 01:19:06.000
And whether or not the store is located in the United States potentially have an impact on sales.

01:19:06.000 --> 01:19:17.000
And so we might want to consider, including them, and sort of models we'd consider.

01:19:17.000 --> 01:19:26.000
So the first type of like sort of automatable algorithmic approach to feature selection, we're gonna look at is called best subset selection.

01:19:26.000 --> 01:19:36.000
So this is a an algorithm that looks at every possible model fits that model and gets the error on that model for comparison.

01:19:36.000 --> 01:19:45.000
So, for instance, if we were only looking at Comp Competitor Price and advertising best subsets would fit and compare the foxes.

01:19:45.000 --> 01:19:48.000
So the baseline, the one that regresses sales on comp.

01:19:48.000 --> 01:19:52.000
Price, the one that regresses sales on comp. Price, the one that regresses sales on comp.

01:19:52.000 --> 01:19:59.000
Price, the one that regresses sales on advertising, and then the one that it regresses sales on advertising, and then the one that it regresses sales on advertising and then the one that regresses sales on comp price and addresses sales on comp price and addresses sales on comp

01:19:59.000 --> 01:20:01.000
price and advertising. And so we're going to.

01:20:01.000 --> 01:20:04.000
Then it would, you know, do something like cross validation.

01:20:04.000 --> 01:20:06.000
Get the average crossoveration. Msc. On all 4 models, and then choose the one with the lowest cross validation.

01:20:06.000 --> 01:20:16.000
Msc, so we're gonna do that with those features that I outlined above that.

01:20:16.000 --> 01:20:20.000
I'm highlighting right now, and I'm just gonna show you.

01:20:20.000 --> 01:20:23.000
It's already been programmed up. But I'm just gonna show you.

01:20:23.000 --> 01:20:27.000
So here's the here's the linear regression.

01:20:27.000 --> 01:20:30.000
We've seen that before. So this is a function.

01:20:30.000 --> 01:20:31.000
I always get questions about, and I forgot to look up what this little less than less than does. I.

01:20:31.000 --> 01:20:38.000
I don't remember. I don't remember what it does, so I got this original function.

01:20:38.000 --> 01:20:43.000
It's sort of a a slight adjustment on this function that's found here at Stack Overflow.

01:20:43.000 --> 01:21:00.000
It's going to. Essentially, just take in a list of functions or a list of features, and then it will return the power set of that list.

01:21:00.000 --> 01:21:01.000
The empty set. And so, for instance, if I did, the Comp.

01:21:01.000 --> 01:21:10.000
Price, advertising example. It gives out the list that has compet prices.

01:21:10.000 --> 01:21:11.000
A list advertising is a list, and then Comp. Price and advertising in the list.

01:21:11.000 --> 01:21:23.000
And so we're gonna use this function to go through and produce all of the models that we're gonna fit.

01:21:23.000 --> 01:21:25.000
Okay.

01:21:25.000 --> 01:21:38.000
So first I make my k-fold objects, then I get my power set of all the features that I'm considering, and then, because I have a categorical variable in here.

01:21:38.000 --> 01:21:47.000
That isn't just a binary 0 one. I have to sort of make an adjustment so that anytime I were to have like one of these that includes shelve location.

01:21:47.000 --> 01:21:52.000
I have to go through it, include both shelve location, good and shelve.

01:21:52.000 --> 01:22:00.000
Location bad. You can't have just one. You have to have both because you need to include all possibilities for shell of location bad. You can't have just one.

01:22:00.000 --> 01:22:01.000
You have to have both now. I also, then, for the model that's sort of my baseline.

01:22:01.000 --> 01:22:06.000
Just the expected value. I add, in the option of being a baseline.

01:22:06.000 --> 01:22:18.000
Then I make a holder that's gonna have all of my Msc's for my different splits and modells.

01:22:18.000 --> 01:22:27.000
And then I do this for loop where this is the traditional K-fold for loop. We've seen this before, and then I loop through all my potential models.

01:22:27.000 --> 01:22:44.000
The first is the baseline model. So when I do this one, I just fit the mean of the training set, and then for the rest of them, I make the linear regression model, and I don't know why it says that's that's old.

01:22:44.000 --> 01:22:52.000
So now it's going through, and it's fitting this. And I forgot to run that code.

01:22:52.000 --> 01:22:59.000
Here we go, and so now you can use something like augment to find all right, which one had the lowest cross validation.

01:22:59.000 --> 01:23:02.000
Mse. And it was the 100 and Tenth model.

01:23:02.000 --> 01:23:10.000
So we fit a lot of different models here. And so, after running that you can go through and see, okay, the model with the lowest average cross validation, Msc.

01:23:10.000 --> 01:23:12.000
Had these features, and had an average C. Crops. Validation.

01:23:12.000 --> 01:23:19.000
Msc. Of this.

01:23:19.000 --> 01:23:25.000
And I don't know why this is here. This must just be left over from yeah, that's nothing.

01:23:25.000 --> 01:23:30.000
Ignore that. Okay, so are there any questions about best subsets?

01:23:30.000 --> 01:23:35.000
And it looks like that. Everybody has chimed in to explain what the less than less than does so.

01:23:35.000 --> 01:23:39.000
Thank you, Terry. Who did that?

01:23:39.000 --> 01:23:44.000
I mean, are you just more likely to to get a lower Msc.

01:23:44.000 --> 01:23:47.000
If you use more features.

01:23:47.000 --> 01:23:56.000
So. Not necessarily so. There can be models that if you put in bad data like, if you put in features that aren't at all related to shelve location right?

01:23:56.000 --> 01:23:59.000
You're the more remember, the more this is sort of that.

01:23:59.000 --> 01:24:07.000
Bias, variance trade off right. The more features you include in your model, the more likely to your you are to over fit the model, meaning your generalization.

01:24:07.000 --> 01:24:08.000
Right.

01:24:08.000 --> 01:24:10.000
Error will go up!

01:24:10.000 --> 01:24:14.000
Oh! So this captures that!

01:24:14.000 --> 01:24:15.000
Okay.

01:24:15.000 --> 01:24:16.000
Yeah, right? Because it is the case that the more features you include, the more likely you are to have a better fit on the training data.

01:24:16.000 --> 01:24:24.000
But remember, we're looking at the error on the test data.

01:24:24.000 --> 01:24:25.000
Okay.

01:24:25.000 --> 01:24:28.000
Or not. The test data, the hold out from the cross validation.

01:24:28.000 --> 01:24:36.000
Hold that. Yeah.

01:24:36.000 --> 01:24:40.000
Okay. So another thing that you might have noticed is, we fit a lot of models for this.

01:24:40.000 --> 01:24:44.000
So that's because best subset like fits.

01:24:44.000 --> 01:24:54.000
Every possible model which can have a lot of models so like, if you do it, if you do it, I think it's something like 2 to the M models, and that's just like a lot of models.

01:24:54.000 --> 01:25:04.000
So there's something called grey approaches, which they're called greedy, because they're making sort of like at each pocket chance that they can make a choice.

01:25:04.000 --> 01:25:05.000
They make the choice that is greedious at the moment.

01:25:05.000 --> 01:25:15.000
So the 2 that you can consider are called for where there are 2 that you can consider called forward selection and backwards selection, and they're the same basic idea.

01:25:15.000 --> 01:25:40.000
So basically what it is is in forward selection you're starting with the baseline and then slowly adding in features, so step 0 of forward selection is, you fit the baseline model, get the average Cv, Msc then you go through for each of the M possible features you fit that simple

01:25:40.000 --> 01:25:44.000
linear regression model, calculate the average Cv. Mce.

01:25:44.000 --> 01:25:49.000
If none of them outperform the baseline, then you just stop. If one of them does.

01:25:49.000 --> 01:25:50.000
If there are some that do outperform the baseline, choose the one that performs best, that's your new default.

01:25:50.000 --> 01:26:06.000
And then for the remaining steps, you basically just go through, and you'll do like for the second time through you do for the remaining M minus one features, not in your model.

01:26:06.000 --> 01:26:10.000
You fit the regression model that includes that feature.

01:26:10.000 --> 01:26:13.000
Calculate those msees find the one that does best.

01:26:13.000 --> 01:26:23.000
If it's the current model, you stop, if it's not the current model, you have a new default model, and you go through and try fitting in the remaining features.

01:26:23.000 --> 01:26:41.000
And so basically you're only going to stop at forward selection until you either a have the model that includes every feature or B find a model that adding a different feature to it will not improve so that's the idea behind forwards backwards selection is sort of the

01:26:41.000 --> 01:26:49.000
backwards. Approach. So in backwards selection you include, you start with the model that includes everything, and then slowly remove a feature.

01:26:49.000 --> 01:26:59.000
One at a time, seeing if you outperform the whatever your current default is, current, default is, you know, and doing the same sort of stopping process.

01:26:59.000 --> 01:27:09.000
So in backwards selection, you're either gonna end with the baseline model or a model that has removed some of the features.

01:27:09.000 --> 01:27:23.000
So, Erit is asking, can we automate these features in a pipeline so these can be automated just like, like, for instance, like, you would do a sort of like this, for you'd have to do some sort of like for loop type, thing.

01:27:23.000 --> 01:27:24.000
But it can be automated like forwards and backwards, can be coded up like this.

01:27:24.000 --> 01:27:32.000
It's not something you have to do by hand every time.

01:27:32.000 --> 01:27:44.000
You might use a pipeline as a part of your steps, but I don't know that there's like a within S. Kaler, and like a forward selection backwards selection sort of objects.

01:27:44.000 --> 01:27:47.000
There might be I just don't know. I don't know.

01:27:47.000 --> 01:27:50.000
And then the close out the notebook you can use also.

01:27:50.000 --> 01:28:05.000
Just use lasso. And so here's the example where I have a lasso, looking at all the features, and then you just sort of track like the different alpha values, and see like which ones stick around so price sticks around the longest.

01:28:05.000 --> 01:28:15.000
And then this one, you know, the shell of locations, I think, stick around for a decent amount of time, so you'd probably consider those at the same time.

01:28:15.000 --> 01:28:22.000
Perhaps what's see? What is this? Advertising sticks around, and age sticks around?

01:28:22.000 --> 01:28:26.000
So you might consider using these, seeing how they do with a cross validation in comparison to other models.

01:28:26.000 --> 01:28:28.000
So that's sort of the idea with lasso.

01:28:28.000 --> 01:28:33.000
Okay, so we are over time. I apologize for that.

01:28:33.000 --> 01:28:45.000
We had a lot of questions, so I think that pushed me a little bit I'll stick around for anyone who still has questions, but further assistive the rest of your evening feel free to leave.

01:28:45.000 --> 01:28:46.000
And yeah, I hope to see it tomorrow where we're gonna start time series.

01:28:46.000 --> 01:28:52.000
So in tomorrow's lecture we'll be talking about time series.

01:28:52.000 --> 01:28:56.000
Alright!