WEBVTT 00:00:00.000 --> 00:00:02.000 Okay, so I'm gonna start recording. 00:00:02.000 --> 00:00:10.000 Alright! Welcome back! This is day. 5 of the lectures for the 2023 may data, science boot camp from the Institute. 00:00:10.000 --> 00:00:16.000 So we're gonna keep learning. Today's our last day on like linear regression type stuff. 00:00:16.000 --> 00:00:25.000 Tomorrow and day 6, we'll start learning a little time series. So we're actually gonna start today with something that is not explicitly just regression. 00:00:25.000 --> 00:00:26.000 We're gonna finish up a little bit of data cleaning stuff. 00:00:26.000 --> 00:00:41.000 So I have all of the lectures. I am confident we'll be able to get through already pre loaded just because I didn't want to waste time waiting for the Colonel to see but if you're trying to follow along in your own we're working on the basic 00:00:41.000 --> 00:00:50.000 pipelines. Notebook first. So we're gonna talk about pipelines and let me get my chat window open. 00:00:50.000 --> 00:00:57.000 Okay, so Hi, last week we talked about standard Scalar, and then I believe somebody in the chat pointed out like, Oh, there's other neat, pre-processing steps like polynomial features. 00:00:57.000 --> 00:01:05.000 And so basically, there are times when you want to do a lot of sort of automatable, pre-processing steps for your models. 00:01:05.000 --> 00:01:24.000 And it can be a hassle to have to code each one of those steps out one by one by one, particularly when you're doing things like crossfalidation, or when you want to go from the training set to the test set and so there's this concept known as a pipeline which will 00:01:24.000 --> 00:01:29.000 allow you to do everything all at once. You just first have to define the pipeline. 00:01:29.000 --> 00:01:34.000 So we're gonna start with the most basics. And that's all we're going to cover in lecture in live lecture. 00:01:34.000 --> 00:01:49.000 However, there's another one. If you're really interested about doing things with more advanced pipelines, there is a notebook called More Advanced pipelines, which does have a pre recorded video, we're not gonna have time to cover that one in live lecture but if you need to learn more 00:01:49.000 --> 00:01:55.000 about making more complicated pipelines. Check out those notebooks. 00:01:55.000 --> 00:01:56.000 That notebook. It's just one. Okay, so what is a pipeline? 00:01:56.000 --> 00:02:02.000 So we've done as I said, a little bit of pre-processing. 00:02:02.000 --> 00:02:07.000 So including scaling. We've done, as I said, a little bit of pre-processing. So, including scaling, we've made some new features, using. 00:02:07.000 --> 00:02:14.000 Pd dot get dummies. We've also by hand created features like polynomial transformations, like polynomial transformations, like polynomial transformations, like X squared or X cubed. 00:02:14.000 --> 00:02:18.000 And then we've talked about the possibility that you can make other nonlinear transformations like logs or square roots. 00:02:18.000 --> 00:02:33.000 And so these are a lot of different pre-processing steps, and the concept of a pipeline is just a nice framework for combining all of those steps into a single code container. 00:02:33.000 --> 00:02:51.000 And so the idea about pipelines and we'll see more of this in action when we're coating it up is we want something that will take in our data on one side and then systematically or sequentially go through and apply all the different pre-processing steps and then the left step would be the model that 00:02:51.000 --> 00:03:05.000 we're going to fit and then get production from. And so basically, we're going to define our python to be the sort of coding object that we're going to fit and then get predictions from. And so basically we're going to define our pipeline to 00:03:05.000 --> 00:03:30.000 be the sort of coding object that has both of and then get predictions from. And so basically, we're going to define our pipeline to be this sort of coding object that has both all of the Preca dot fit and then provide things like the transform data the Fitted model and predictions with commands like dot fit dot transform or dot 00:03:30.000 --> 00:03:33.000 predict so I'm just generating some random data here. 00:03:33.000 --> 00:03:34.000 So this is random data, synthetic. So I could always go back and generate more. 00:03:34.000 --> 00:03:42.000 So you might be wondering, why aren't you doing a train test? Split? 00:03:42.000 --> 00:03:47.000 It's just because this is just random data. And the focus here isn't predictive modeling pipelines. 00:03:47.000 --> 00:04:06.000 So my goal here is I want to fit a polynomial regression model that regresses Y on X and I wanna do that in a way where I don't have to go through and always do like finding X to the first power X to the second, power X to the third, power so the first thing we're 00:04:06.000 --> 00:04:09.000 gonna learn is something known as polynomial features. 00:04:09.000 --> 00:04:15.000 So polynomial features, it's what's known as a transfer object in sk. Learn. 00:04:15.000 --> 00:04:20.000 And so these are really similar to scalar objects like standard scalar. 00:04:20.000 --> 00:04:30.000 And so remember with standard skill, or it had things like fit, transform and fit, transform transformer objects have the same thing where you fit it. 00:04:30.000 --> 00:04:43.000 In some sense, then you transform the data. And so paulnial features takes in your data, and then does a polynomial transation where it will take in your columns. You specify a degree. 00:04:43.000 --> 00:04:44.000 2 polynomial features would take in fit itself. 00:04:44.000 --> 00:04:51.000 And what that means is it? When it says fit? It's like, Okay, how many columns do I have? 00:04:51.000 --> 00:04:58.000 And then what transformations do I need to compute which for us would be X. 00:04:58.000 --> 00:05:04.000 One x 2 x one squared x one times x 2, and x 2 squared with degree, 2. 00:05:04.000 --> 00:05:08.000 And then, when you call dot trans transform, it will then actually provide this data frame. 00:05:08.000 --> 00:05:12.000 So this is stored in pre-processing. 00:05:12.000 --> 00:05:13.000 So this is stored in pre-processing. So I've already typed it out here. 00:05:13.000 --> 00:05:18.000 So from sk learn that pre-processing. So I've already typed it out here. 00:05:18.000 --> 00:05:20.000 So from Sk. Learn that pre-processing import, polynomial features, capital. 00:05:20.000 --> 00:05:24.000 And so, in order to use this, we're going to first define it. So we do. 00:05:24.000 --> 00:05:32.000 Polynomial features. Now there's gonna be 2 arguments that I wanna put in. 00:05:32.000 --> 00:05:37.000 The first argument is interaction. Actually sorry. The first arguments should be the degree. 00:05:37.000 --> 00:05:43.000 So I'm gonna put 2 because I want to. Or what do I want for this? 00:05:43.000 --> 00:05:46.000 Let's go with, I guess, for the demonstration. 00:05:46.000 --> 00:05:50.000 I'm just gonna do 2. So 2. And then the next argument is something called interaction underscore. 00:05:50.000 --> 00:06:02.000 Only. And so when this is true, it's only gonna give you interaction terms like X, one times x, 2. 00:06:02.000 --> 00:06:08.000 When this is false, it will only give you, it will give you all the terms so like the ones that I've shown you here up above. 00:06:08.000 --> 00:06:10.000 So I'm gonna set this equal to false cause. 00:06:10.000 --> 00:06:17.000 I don't want just the interactions. And then there's finally another term called include Bias. 00:06:17.000 --> 00:06:20.000 And so include bias takes in a true or a false value. 00:06:20.000 --> 00:06:24.000 If it's true, it will include a column of ones at the front. 00:06:24.000 --> 00:06:28.000 So in machine learning, they call intercepts biases. 00:06:28.000 --> 00:06:32.000 So, if it's true, it will include a column of ones, and if it's false it will not include a column of one. 00:06:32.000 --> 00:06:40.000 So I'm going to set it equal to false, because the other Sk. 00:06:40.000 --> 00:06:46.000 Learn models. Don't expect to receive a column of ones. 00:06:46.000 --> 00:06:47.000 So then, once we have our object, we're gonna do fit. 00:06:47.000 --> 00:06:54.000 And did I define this? X. So if I do, X. 00:06:54.000 --> 00:06:59.000 Where was I fit? X. What? This might seem? Weird it's like, well, what do you need to fit here? 00:06:59.000 --> 00:07:00.000 It's not like with standard scalar. Why need to find a mean and a standard deviation? 00:07:00.000 --> 00:07:14.000 So for this particular transformer, when you call fit, what's happening is it's seeing how many columns does the input data have? 00:07:14.000 --> 00:07:19.000 What is my degree, and then depending on what the answer to both of those are like. 00:07:19.000 --> 00:07:23.000 What new columns do I need to just generate? 00:07:23.000 --> 00:07:32.000 So, because my degree here is 2, and the number of columns I have here is one which reminds me, I think I need to do a reshape cause. 00:07:32.000 --> 00:07:38.000 It's one dimensional data. So my number of columns here is one so that means I'm gonna need to produce an X and an X squared. 00:07:38.000 --> 00:07:44.000 And so, after fitting it. 00:07:44.000 --> 00:07:48.000 We need to do transfer to actually get the new stuff. 00:07:48.000 --> 00:07:54.000 So we'll do transform. And then reshape, and then negative one. 00:07:54.000 --> 00:07:58.000 Come on, and so you can see here, you know, negative 3. 00:07:58.000 --> 00:08:09.000 That's x, and then the X squared is 9, and if we went back into cube just as an example, we can see how it's X X squared X cubed. 00:08:09.000 --> 00:08:10.000 So I want. I think I'll end up wanting cube, so I'll keep that for now. 00:08:10.000 --> 00:08:32.000 But before we continue on to talk about pipelines, are there any questions about polynomial features? 00:08:32.000 --> 00:08:46.000 Let's see. So Zack is asking, Does the term pipeline refer exclusively to several steps that can be cross-validated together as stated in the Sqlarn document for the pipeline object context? 00:08:46.000 --> 00:08:55.000 In my academic field, pipeline is used to refer to the full push button replicable, end-to-end analysis, including data cleaning and visualization. 00:08:55.000 --> 00:09:05.000 So there are, I would say, in this particular context, it's like the Sk. Learn. 00:09:05.000 --> 00:09:23.000 I know there's also like sort of like, the business context of pipeline and I'm not sure what your field is, but it's kind of the same thing where, when they say pipeline, they mean sort of the whole thing of collecting the data cleaning the data, fitting models and 00:09:23.000 --> 00:09:33.000 then producing analytics, based visualizations or tables based on the results of models so here, when we say, pipeline, we literally just mean, like the Sk learn stuff that we're fitting. 00:09:33.000 --> 00:09:40.000 So the pre-processing S. Keylearn followed by like whatever model we're gonna do, Dustin is asking, does it sample? 00:09:40.000 --> 00:09:57.000 Non-integer powers so polynomial features, is only going to give you the polynomials, which means the positive integer powers. 00:09:57.000 --> 00:09:58.000 I have a good question. 00:09:58.000 --> 00:10:00.000 Yeah. 00:10:00.000 --> 00:10:14.000 Why did your so you see in in the graphic you show that you have the x one x 2, and then you would output it would and all those degree to turn. 00:10:14.000 --> 00:10:15.000 Yup! 00:10:15.000 --> 00:10:18.000 How come you only got like just x one and the x one x 3 x one squared. 00:10:18.000 --> 00:10:19.000 Yeah, so this possibly is just a bad visual to have. 00:10:19.000 --> 00:10:24.000 But I wanted to show the fact that it also does interaction terms. 00:10:24.000 --> 00:10:27.000 So the particular X we have here is a one dimensional like it just has one column. 00:10:27.000 --> 00:10:56.000 So this is X, but presented as a column. Vector so because this only has one column, it can only produce, like X squared X cubed, etc. If we had one that had 2 costumes, it would produce something like this. 00:10:56.000 --> 00:11:07.000 Yeah. Any other questions about polynomials, features? 00:11:07.000 --> 00:11:08.000 Okay, so now, we're gonna talk about how to define a pipeline and sk learn. 00:11:08.000 --> 00:11:19.000 So the first thing you need to do is you need to import pipeline, which is, conveniently stored in the pipeline sub package. 00:11:19.000 --> 00:11:22.000 I'm not sure quite sure what it's called. 00:11:22.000 --> 00:11:34.000 I call it a sub package. So from Escaler and dot lowercase pipeline import, uppercase, p pipeline, and then once you have it, what we're going to try and fit. 00:11:34.000 --> 00:11:45.000 Here is the following, polynomial regression model, Y is equal to beta 0 plus beta one x plus beta 2 x squared plus beta 3 x cubed plus epsilon. 00:11:45.000 --> 00:11:56.000 And so for us. That means the first step for our data is we want to provide a polynomial feature, fit, transformer, object of degree. 3. 00:11:56.000 --> 00:11:58.000 So this that we showed up here, and then after that, we want to fit a linear regression model. 00:11:58.000 --> 00:12:05.000 So this is sort of like schematically. What we're looking at. 00:12:05.000 --> 00:12:09.000 So now we just need to show you how to actually implement that in python. 00:12:09.000 --> 00:12:12.000 So the first thing you do is you type the object name. 00:12:12.000 --> 00:12:17.000 So pipeline, then within pipeline you put a list, and then the entries of your list are a string. 00:12:17.000 --> 00:12:32.000 Are Tuples, and each Tuple has as its first entry a string which is the name of that step, and then the actual object, that is, that step. 00:12:32.000 --> 00:12:40.000 So, for instance, in this diagram the first one I'm gonna call Poly, and then the object is this particular polynomial features object. 00:12:40.000 --> 00:12:41.000 And then the second is, I'm going to call Reg, and then the object that goes along with it is this linear regression. 00:12:41.000 --> 00:12:50.000 So I'm gonna first put in a tuple for my polynomial features that I'm going to call Poly. 00:12:50.000 --> 00:12:57.000 And then I'm going to put polynomial features. 00:12:57.000 --> 00:13:09.000 Degree, 3 interaction only equals false include bye equals false. 00:13:09.000 --> 00:13:14.000 So remember, include biases false here, because the linear regression model fits an intercept by default. 00:13:14.000 --> 00:13:19.000 So I don't need the bias to next. 00:13:19.000 --> 00:13:21.000 I'm gonna put a tuple. So notice the commie here. 00:13:21.000 --> 00:13:29.000 It's a list of tuples. I'm gonna put a tuple called Reg that will contain my linear regression. 00:13:29.000 --> 00:13:30.000 And can I import that already? I don't think so. 00:13:30.000 --> 00:13:39.000 So let me also just make sure. So from Sklearn Dot. 00:13:39.000 --> 00:13:43.000 Linear model, import. 00:13:43.000 --> 00:13:46.000 One year regression. There's a chance. I imported it already above. 00:13:46.000 --> 00:13:51.000 But just to make sure. Okay, so linear regression. And then just copy X equals. 00:13:51.000 --> 00:13:56.000 True. 00:13:56.000 --> 00:14:01.000 Okay. So now I have a pipeline, and you fit and make predictions in exactly the same way. 00:14:01.000 --> 00:14:05.000 So we do pipe dot fit you put in your features first. 00:14:05.000 --> 00:14:14.000 So x dot reshape negative one comma one, followed by y. 00:14:14.000 --> 00:14:17.000 And now that I have a fitted pipeline, I could make predictions. 00:14:17.000 --> 00:14:21.000 So pipe dot predict. So the pipeline basically acts identically to the model. 00:14:21.000 --> 00:14:27.000 So I can just do pipe, dot, predict, and get my predictions. 00:14:27.000 --> 00:14:33.000 You can access individual features of the pipeline. So just like a dictionary. 00:14:33.000 --> 00:14:43.000 So remember, I called my polynomial feature transformer Poly, and so now I can get the aspects of the polynomial feature. 00:14:43.000 --> 00:14:53.000 I called my regression object Reg, and so I can even get out things like the coefficient on the fitted regression or the intercept on the fititted regression. 00:14:53.000 --> 00:15:00.000 Object. 00:15:00.000 --> 00:15:01.000 Okay, so this is a good introduction to this basic pipelines. 00:15:01.000 --> 00:15:16.000 As when you get more complicated data and more complicated models, you'll have to do slightly more advanced things like making custom transformer auds or custom scalar objects. 00:15:16.000 --> 00:15:26.000 If you need to learn about how to do that for your projects or or any other type of thing you're doing, check out notebook number 5 and cleaning it goes over how to do all of that. 00:15:26.000 --> 00:15:34.000 But for us in the rest of the lectures we're good with just the basic pipeline. 00:15:34.000 --> 00:15:35.000 So Brooks is asking, is there any reason to prefer pipeline over? 00:15:35.000 --> 00:15:42.000 Make Underscore Pipeline. Both are from F. 00:15:42.000 --> 00:15:47.000 Scalar, and but have slightly different structure. So I've never used make underscore pipeline. So I've never used make underscore pipeline. 00:15:47.000 --> 00:15:51.000 So I don't actually know what it does. I've only ever used pipeline. 00:15:51.000 --> 00:16:01.000 So it's possible that they introduced make underscore pipeline after I had already learned pipeline, so I never bothered looking into it, so I would just say, if you're interested in knowing the difference check out the documents. 00:16:01.000 --> 00:16:02.000 I'm sure they point out maybe pros or cons of using one versus the other. 00:16:02.000 --> 00:16:08.000 Maybe not phrased that way. You'd have to look and see like, okay, make pipeline does this for pipeline. 00:16:08.000 --> 00:16:18.000 But you'd have to read the documentation. 00:16:18.000 --> 00:16:27.000 So, Pedro, I'm not sure if we are so here I did fit x comma y. 00:16:27.000 --> 00:16:28.000 Because the pipeline remember, the goal is to fit a model. 00:16:28.000 --> 00:16:38.000 So basically what the pipeline's doing is the data goes along, gets trained. 00:16:38.000 --> 00:16:49.000 The X goes along, gets transformed, and then the Y kind of just follows along with it until the model at the end, where the like, the regression gets fit. 00:16:49.000 --> 00:16:53.000 So the model needs the Y, so like the Y just kinda goes along. 00:16:53.000 --> 00:16:56.000 But that's why it's fit. Dot x comma y. 00:16:56.000 --> 00:17:06.000 Because the pipeline is fitting, not just the polynomial features, but also the linear regression model. 00:17:06.000 --> 00:17:10.000 We did not include the task of transforming the data into the pipeline. 00:17:10.000 --> 00:17:21.000 Could we have done otherwise? So as these? The pipeline automatically, like once it is, it will transform the data for you. 00:17:21.000 --> 00:17:31.000 So basically, how the pipelines working is you just give it the objects, and it knows, based on its structure that polynomial features has a fit and a trans transform. 00:17:31.000 --> 00:17:34.000 So what happens when you call fit, is it first goes through? 00:17:34.000 --> 00:17:35.000 We'll fit the polynomial features on X, then transfer. 00:17:35.000 --> 00:17:38.000 So it's probably calling like a fit underscore transform. 00:17:38.000 --> 00:17:42.000 Then transform those features to be used in the fitting of the linear regression. 00:17:42.000 --> 00:17:59.000 Then, when I call, predict down here, it's taking in these fit or not refitting the polynomial features, because it's already fit. 00:17:59.000 --> 00:18:08.000 It's just transforming this data and then feeding it into the prediction part of the linear regression model. 00:18:08.000 --> 00:18:30.000 Okay, are there any other questions about the pipeline? 00:18:30.000 --> 00:18:33.000 Okay. 00:18:33.000 --> 00:18:38.000 So that's gonna be it for pipeline. 00:18:38.000 --> 00:18:41.000 And now we're gonna go to supervised learning once again. 00:18:41.000 --> 00:18:55.000 I already have it open. But if you're trying to follow along, we are in supervised learning, and to sort of motivate what we're gonna do as one of our last regression live lectures, notebooks. 00:18:55.000 --> 00:19:00.000 We're gonna talk about something called the bias variance trade-off, which is a concept in supervised learning. 00:19:00.000 --> 00:19:04.000 So we're going to start here with our supervis learning for today and then go back into regression after we cover this. 00:19:04.000 --> 00:19:13.000 So this notebook is a conceptual one. There's not going to be a ton of coding. 00:19:13.000 --> 00:19:17.000 All the coding is just sort of to help illustrate the concepts. 00:19:17.000 --> 00:19:20.000 So there's not gonna be a ton of like we're not introducing like new customers in this notebook. 00:19:20.000 --> 00:19:30.000 So particularly, we're gonna talk about things like the bias of the estimate of the functions. 00:19:30.000 --> 00:19:36.000 So remember, in supervisor learning we have this model that we're trying to fit and estimate. 00:19:36.000 --> 00:19:39.000 And the thing we're particularly trying to estimate is the function of X. 00:19:39.000 --> 00:19:42.000 So we're going to talk about the bias of that estimate. 00:19:42.000 --> 00:19:51.000 The variance of that estimate, and then sort show view that there's sort of a trade off between those 2 things, as you increase or decrease. 00:19:51.000 --> 00:19:55.000 Something called the complexity of the model that you're trying to fit. 00:19:55.000 --> 00:19:58.000 So! 00:19:58.000 --> 00:19:59.000 Remember our framework where we're assuming that Y is equal to a function of the features. 00:19:59.000 --> 00:20:06.000 X plus some random noise and typically we're assuming this random noise is independent of the of the features. 00:20:06.000 --> 00:20:24.000 Acts. So we have some algorithm. Up to this point it's been linear regression, but it could be any supervised learning algorithm that we then use to estimate F with an estimate called F that I'm going to call F hat. 00:20:24.000 --> 00:20:28.000 So the little carrot on top of the F. I'm going to call F hat. 00:20:28.000 --> 00:20:36.000 So in the past last week, we've discussed that we're really just interested in understanding the generalization error of our algorithm meaning, our goal is to get as low a generalization error as possible. 00:20:36.000 --> 00:21:00.000 So remember generalization. Error is the error on a set that the algorithm was not trained on, and so particular, if we have a set called y 0 comma x 0, which is denoting a single test set or a set of data, we're trying to generize on so data it was 00:21:00.000 --> 00:21:04.000 not trained on. We can write this mathematically, as we want to know the expected value and if we're using something like Msc. 00:21:04.000 --> 00:21:16.000 Of y 0 minus y 0 hat, meaning the actual values minus the predicted values. 00:21:16.000 --> 00:21:27.000 That square error. Then, if you sort of go through the process of substituting all of that in what is Y hat 0? 00:21:27.000 --> 00:21:47.000 Well, it's f hat of evaluated at the features, and then, if you substitute in what, y is well, that's F at x 0 plus epsilon, and then, if you do, and here the expectation is being taken over the probability space over, all possible, training sets so if you're wondering what we're 00:21:47.000 --> 00:21:56.000 taking, the expectation over. It's that if you do a little bit of algebra and probability theory, you can rewrite this to give you the variance of your estimate. 00:21:56.000 --> 00:22:08.000 Remember, we're estimating the function plus the buyas of your estimate squared, plus the variance of the error terms and so I didn't write all this out because it's a lot of like algebra. 00:22:08.000 --> 00:22:14.000 That is kind of. I don't wanna talk through it, but if you do, the algebra and work it out. 00:22:14.000 --> 00:22:25.000 You'll get this. And so the important thing to take away from this is that you have the variants of the estimate, plus the bias squared of the estimate, plus the irreducible error. 00:22:25.000 --> 00:22:32.000 So there's 2 kind of things to take away from this, and then maybe just as a refresher, if you don't recall like what bias means, it's the average, the expected value of the actual thing. 00:22:32.000 --> 00:22:49.000 The estimate. So this holds not just for this particular function, but in general, if you're estimating something, the bias of the estimate is the actual thing, minus the estimate of the thing. 00:22:49.000 --> 00:22:54.000 So one way to think of this is like, how far, on average, is your estimator? 00:22:54.000 --> 00:22:59.000 The thing that you're making using to make the estimate from the actual thing that it's estimating. 00:22:59.000 --> 00:23:04.000 So? How far away on average is what you're estimating from the actual value. 00:23:04.000 --> 00:23:11.000 So because variance and bias squared, both of these are non-negative. 00:23:11.000 --> 00:23:24.000 The best that you could do is as an algorithm is something that has 0 bias and 0 variance meaning this variance of epsilon is left, which is why it's called an irreducible error. 00:23:24.000 --> 00:23:31.000 So, whatever the the variance is of your random noise, you're always gonna have that as the error in your your generalization error. 00:23:31.000 --> 00:23:51.000 So the best we could do is get down to that. However, it's usually not possible to actually get all the way down to the irreducible error, and also there tends to be this phenomenon, where when you in when you decrease your bias you'll in general tend to 00:23:51.000 --> 00:24:09.000 decrease your variance, and likewise, when you decrease your variance, you tend to increase your bias, and if I said that backwards the takeaway is just that there's sort of in general an inverse relationship between the 2, where if one goes down the other one tends to go up 00:24:09.000 --> 00:24:10.000 so to sort of give you a feeling for what this looks like. 00:24:10.000 --> 00:24:19.000 We're gonna play around with another toy example, where I've got some, evenly spaced data from negative 3 to 3. 00:24:19.000 --> 00:24:31.000 As my features, and then the model or the actual relationship is that why is x times x, minus one and then our error is this random noise that I'm highlighting here. 00:24:31.000 --> 00:24:35.000 So this is the true relationship is that given by the black line. 00:24:35.000 --> 00:24:39.000 And then the data that we've observed are these blue dots? 00:24:39.000 --> 00:24:43.000 So we're sort of gonna give you a sense of what's going on here. 00:24:43.000 --> 00:24:49.000 By looking at 3 scenarios. So the first scenario is a model with high bias. 00:24:49.000 --> 00:24:54.000 So remember, bias is sort of how far off from the actual relationship are we? 00:24:54.000 --> 00:25:11.000 So the model that is going to give us high bias would just be our regular baseline that we've talked about before, where we're just going to assume that the value of Y is X the expected value of y plus average plus plus random noise. 00:25:11.000 --> 00:25:23.000 So basically, we're just assuming that there is no relationship between Y and X, Y is just given by the expected value or average of y plus random noise. 00:25:23.000 --> 00:25:29.000 So this is high bias, because, as we can see, there is a relationship here between Y and X. 00:25:29.000 --> 00:25:39.000 So what we would get each time through is sort of a horizontal line at the average value of Y, that's pretty far from our true relationship. 00:25:39.000 --> 00:25:49.000 But it's low variance, because the law of large numbers tells us that as we go through the different training sets, remember, that's where randomness is coming from. 00:25:49.000 --> 00:26:00.000 As long as we have enough observations of why the average value, the sample average would be pretty close to the expected value of y, so that's why it's low variance. 00:26:00.000 --> 00:26:05.000 Now there's another model, we'll consider is one that would have high variance. 00:26:05.000 --> 00:26:17.000 But low bias. So here and remember, this is variance with respect to the random training set that you draw here the model that we would consider is a high degree polynomial of X. 00:26:17.000 --> 00:26:19.000 So really over fitting high degree polynomial. 00:26:19.000 --> 00:26:27.000 So it's low bias. Because if you have a high enough polynomial, you're gonna be like hovering around this true relationship as the polynomial. Alas! 00:26:27.000 --> 00:26:32.000 The parabola, but it's high variance, because the higher degree your polynomial, the more likely you're trying to fit like every training observation as closely as you can. 00:26:32.000 --> 00:26:46.000 So the actual model you get from each training set that you would pull is going to change drastically. 00:26:46.000 --> 00:27:03.000 So this is known as overfitting to the training data, because as you get different training data, your model changes drastically, whereas the other one, I forgot to say this, the other one with high bias is known as underfitting the data underfitting because you're completely missing sort of the signal in 00:27:03.000 --> 00:27:10.000 the data. So what you try and go for with this bias variance tradeoffs is something that's in the middle. 00:27:10.000 --> 00:27:17.000 So it it has a little bit of bias, but not too much and a little bit of variance, but not too much so so sort of like a Goldilocks model. 00:27:17.000 --> 00:27:22.000 So for us, because this is a parabola that would end up being, you know, a low degree, polynomial, like a degree. 00:27:22.000 --> 00:27:27.000 2 polynomial would be perfect. So that's what we would want here. 00:27:27.000 --> 00:27:38.000 So what I've done here in this code is I've gone through, and I've generated, as you can see, 5 5 different random training sets. 00:27:38.000 --> 00:27:42.000 So my ex stays the same. The only thing that's different is, I'm getting different. 00:27:42.000 --> 00:27:48.000 Random noise here then I'm gonna go through and fit the 3 different models. 00:27:48.000 --> 00:27:50.000 So the first one I fit is that high variance model. 00:27:50.000 --> 00:27:54.000 So here it's a degree 20 polynomial. 00:27:54.000 --> 00:27:57.000 The second one I fit is sort of. I called the Goldilocks model. 00:27:57.000 --> 00:28:01.000 It's just the parabola I'm just fitting a parabola. 00:28:01.000 --> 00:28:02.000 And then the final is that high bias model. So this is the one where I just take it to be. 00:28:02.000 --> 00:28:16.000 The average value of the training set. And so then I plot the actual model along with the true relationship. 00:28:16.000 --> 00:28:17.000 So what you can see here, I think, goes from highest bias to lowest bias, from left to right. 00:28:17.000 --> 00:28:19.000 So the one with high bias. You can see these all are all the model fits. 00:28:19.000 --> 00:28:29.000 So just like I said. We have enough observations to where we're basically hovering around the expected value of Y for this data. 00:28:29.000 --> 00:28:37.000 So you can see the high bias because we're not lining up with the true relationship at all. 00:28:37.000 --> 00:28:44.000 But we can see it's low variance, because basically, all of the models sit on top of each other. 00:28:44.000 --> 00:28:53.000 The Goldilocks model. That's just right, is falling on top of the true relationship, and most of the estimates from the training sets are basically on top of each other. 00:28:53.000 --> 00:29:00.000 So low bias and low variance, and then the one that is low bias, high variance. 00:29:00.000 --> 00:29:05.000 You can see that it's low bias, because it is basically right on top of the parabola. 00:29:05.000 --> 00:29:12.000 But we can tell it variants, because all the different estimates are wiggling kind of wildly, and not really lining up with one another. 00:29:12.000 --> 00:29:22.000 So that's sort of the idea here. And so what you're seeing on this sort of horizontal axis of the different plots is what's known as the model complexity. 00:29:22.000 --> 00:29:30.000 So the way that this trade-off works is the less complex your model, the more likely it is to have high bias squared. 00:29:30.000 --> 00:29:32.000 And that's sort of what we're seeing over here. 00:29:32.000 --> 00:29:49.000 And then, as you increase the complexity which in this setting was the degree of the polynomial, we fitting in other settings, it will be different as you increase the complexity you can see your bias tends to go down because you get really close to the true relationship but then your variance, tends. 00:29:49.000 --> 00:29:56.000 To go up, and so why is this matter? Well, remember all the way back up here. 00:29:56.000 --> 00:30:04.000 Your generalization. Error is the variance of your estimate, plus the bias squared of your estimate, plus the irreducible error. 00:30:04.000 --> 00:30:22.000 So what you're looking for is to find the minimum on this generalization error curve which tends to occur somewhere along, not exactly where they're both at their lowest or where they intersect, but somewhere along like we're biased has been lowered but before variance starts to go up 00:30:22.000 --> 00:30:23.000 and so that's sort of the idea here. And then, if we were to look at this particular problem, we can see where that is. 00:30:23.000 --> 00:30:32.000 I've made this particular problem. We can see where that is. I've made the set. 00:30:32.000 --> 00:30:42.000 The test set error across different training sets and we can see that the lowest error occurs here, and after that the variance starts to increase enough to increase the generalization error. 00:30:42.000 --> 00:30:51.000 Okay. So that was a lot of me talking. So maybe now is a great time to pause for questions about this trade-off or things like what is monopolyity? 00:30:51.000 --> 00:31:01.000 I didn't understand that. So if you have any questions now is a great time to ask. 00:31:01.000 --> 00:31:08.000 So Zack is asking in the chat if you train your model on a data sample that is not representative of the population. 00:31:08.000 --> 00:31:09.000 The model will be applied on is the result in error. 00:31:09.000 --> 00:31:20.000 Bias? Or is it variance? So that is a slightly different, a slightly different question than like what's covered with the bias variance trade-off? 00:31:20.000 --> 00:31:34.000 So in the definition of this sort of thing, like, you're assuming that the the samples you're getting are drawn from the like. 00:31:34.000 --> 00:31:46.000 They all follow the same distribution, so it, like the models, variance. 00:31:46.000 --> 00:31:57.000 And bias are more dependent on, like the type of model and sort of what you're asking is a sort of a different kind of question about what if my samples bad? 00:31:57.000 --> 00:32:02.000 So like Erica, suggesting an example of this could be something like your data. 00:32:02.000 --> 00:32:05.000 Has selection bias, or something like that. It? 00:32:05.000 --> 00:32:10.000 Yeah, it's they're sort of like different types of concepts. 00:32:10.000 --> 00:32:20.000 I'm guessing. I would guess if that makes any sense. 00:32:20.000 --> 00:32:23.000 Matthew, can I make a comment? 00:32:23.000 --> 00:32:24.000 Alright, sure! 00:32:24.000 --> 00:32:30.000 I'm thinking this to answer to me. Jack's questions. 00:32:30.000 --> 00:32:37.000 I think it should be, the bias will be more because you are off from the model. 00:32:37.000 --> 00:32:41.000 So the I mean it could be more right. 00:32:41.000 --> 00:32:45.000 It just depends on like, how how the data is not representative of the population. 00:32:45.000 --> 00:32:57.000 Data. If that makes sense. So like, if I could see a situation where I just think it just depends. 00:32:57.000 --> 00:33:01.000 I don't know. It's like it depends on how the sampling is wrong. 00:33:01.000 --> 00:33:03.000 If that makes sense. 00:33:03.000 --> 00:33:24.000 I think, when when that data is off from the representative data that essentially also represents, it was also represent. The whatever model Y is equal to F of X is totally of the hook. 00:33:24.000 --> 00:33:31.000 From that perspective. I'm thinking that the buyers will be quite big. 00:33:31.000 --> 00:33:37.000 I think again, I think it just depends on how the sampling is wrong. 00:33:37.000 --> 00:33:38.000 Okay. 00:33:38.000 --> 00:33:41.000 Yeah. 00:33:41.000 --> 00:33:42.000 Are there any other? Yeah. 00:33:42.000 --> 00:34:03.000 Hi, so is, can you say the bias is the expected value of the residual. 00:34:03.000 --> 00:34:07.000 That's the same thing. 00:34:07.000 --> 00:34:17.000 Is. This expression is looking like the residual, and then I'm wondering if. 00:34:17.000 --> 00:34:26.000 So! 00:34:26.000 --> 00:34:41.000 Not quite so. The residual would also include. So this is like the actual F minus the estimate of F, but the actual F, like the residual, is y, minus the estimate, if that makes sense. 00:34:41.000 --> 00:34:50.000 So in the residual is this error term as well according to our assumptions, so like the residual, is y minus f hat of x. 00:34:50.000 --> 00:34:52.000 So. So this which includes the epsilon. 00:34:52.000 --> 00:35:00.000 So the residual is F of x 0 plus epsilon minus f hat. 00:35:00.000 --> 00:35:05.000 But this is F of x, minus F. Hat of X. It's like a slight difference, but it's not yeah. 00:35:05.000 --> 00:35:06.000 It's not the residual. 00:35:06.000 --> 00:35:10.000 Okay, so, okay, so it's missing the ups on. 00:35:10.000 --> 00:35:17.000 Alright, I get that. The other thing. I'm wondering is so when I use here terms over fitting and underfitting. 00:35:17.000 --> 00:35:24.000 I think of what? Now your model is doing well on the training set, but when you do it on the test side, you know it doesn't work. 00:35:24.000 --> 00:35:34.000 That's usually when people say, Oh, you're overfitting, it's performing really good on your training, but not doing so well on the validation or your test set. 00:35:34.000 --> 00:35:41.000 So I'm just thinking how this relates to those terms, or that idea. 00:35:41.000 --> 00:35:48.000 So overfitting, like a measure of how much you're overfitting is like. 00:35:48.000 --> 00:35:57.000 If you're performing like way better on the training set that on the test set, then that can be a like that is like sort of a measure of how much you're overfitting. 00:35:57.000 --> 00:36:04.000 So like, essentially like you can get, you could have a model. 00:36:04.000 --> 00:36:07.000 That fixed press perfectly on the training set, but sort of perhaps because you're overfitting on the data like maybe you're over here right? 00:36:07.000 --> 00:36:20.000 So like your bias, is almost 0. But your variance is really high if you're fitting, like all the training examples, perfectly you'd have 0 error on the training set. 00:36:20.000 --> 00:36:28.000 But like, maybe because of the high variance you'd be have a hijackerization error. Does that sort of make sense like that is one way to sort of get a sense of like, how much you're overfitting. 00:36:28.000 --> 00:36:37.000 But it isn't. 00:36:37.000 --> 00:36:38.000 Alright. Yeah, that is it. That is a way I think I kinda lost the original question. 00:36:38.000 --> 00:36:43.000 But. 00:36:43.000 --> 00:36:46.000 Yeah, but I'm just curious how you would use these metrics. 00:36:46.000 --> 00:36:49.000 Do you calculate these metrics when you're when you build a problem? 00:36:49.000 --> 00:36:53.000 No, so so this is just sort of a theoretical concept. 00:36:53.000 --> 00:37:00.000 That's sort of guides. Some of the techniques we'll be seeing throughout the rest of the boot camp. 00:37:00.000 --> 00:37:15.000 So you don't like. Go and calculate your variance, or calculate your bias squared, because in order to actually get an estimate of those you'd need to be able to get a lot of different training sets again, like when we're doing sort of this fitting you're only ever 00:37:15.000 --> 00:37:22.000 really going to look at the generalization error, never, like individually, the bias squared or the variance. 00:37:22.000 --> 00:37:24.000 Okay, got it. Thanks. 00:37:24.000 --> 00:37:27.000 Yup! 00:37:27.000 --> 00:37:31.000 And then icon sorry if I mispronounced your name. 00:37:31.000 --> 00:37:33.000 So she they are asking plot, coding question here. 00:37:33.000 --> 00:37:42.000 With, for I and range 5, you create 5 plots. Where do you indicate the I that the code loops over so? 00:37:42.000 --> 00:37:47.000 And I believe Zack answered it. But maybe you want we should clarify. 00:37:47.000 --> 00:37:56.000 Where did I do that? Okay, so when you have a for loop in python like, you don't actually have, I don't have to use the O anywhere. 00:37:56.000 --> 00:38:12.000 This is just saying like, for like this is just iterating through the range, so range will have 0 1, 2, 3, 4, of the range. So range will have 0 1, 2, 3, 4. And so then basically what it's saying is for each of the things within that range, do the following thing and so if you never 00:38:12.000 --> 00:38:19.000 use eye. It's just saying, Do this 5 times and you don't have to use eye. It's just saying, Do this 5 times, and you don't have to use the eye. 00:38:19.000 --> 00:38:22.000 It's just like sort of just like, Hey, I have this chunk of code. 00:38:22.000 --> 00:38:31.000 I want you to do it 5 times. 00:38:31.000 --> 00:38:32.000 Yeah. 00:38:32.000 --> 00:38:41.000 I have a question. If you were to be given a lot of training back like different sets of training data, how would you actually like calculate? 00:38:41.000 --> 00:38:45.000 Like, how would you actually give a number for the buyers? 00:38:45.000 --> 00:38:50.000 And the way, because, yeah. 00:38:50.000 --> 00:38:52.000 Oh, yeah, what were you gonna say? 00:38:52.000 --> 00:39:04.000 No, I that's the question. Like, if you were actually given different, like many like thousands of training data sets, then how would you actually give a number for the buyers? 00:39:04.000 --> 00:39:06.000 Because bias of if hat is expectation of fx minus F. 00:39:06.000 --> 00:39:17.000 Hatx, where fx is the is true answer, and we don't know what the true answer is. 00:39:17.000 --> 00:39:29.000 Even if we have lots of training data. Yeah, I'm just asking, like, how would you actually calculate a number for the buyers if you were given a lot of training data or different sets of training. 00:39:29.000 --> 00:39:36.000 So I, yeah. So I think in order to do it, you would need to be and like to actually be able to do it. 00:39:36.000 --> 00:39:52.000 I think you would need to be in a situation where, in here, where we know what that we know, that the true relationship is y equals x times x minus one but I think, like in general, we don't like in a real world situation, we don't know what the true relationship is like we don't know what 00:39:52.000 --> 00:39:56.000 F. Is, so we wouldn't be able to calculate it. 00:39:56.000 --> 00:40:02.000 So the statement is, whatever the true relationship is in theory, we are far off. 00:40:02.000 --> 00:40:04.000 If the bias is high and we are closer to it. 00:40:04.000 --> 00:40:08.000 If the biases low. 00:40:08.000 --> 00:40:09.000 Okay. Thank you. 00:40:09.000 --> 00:40:25.000 Yes, I think, yeah. Yeah. And then maybe the last one before we move on. Whereas asking, how would you decipher if the variance is high because of the choice of model versus being the nature of the data. 00:40:25.000 --> 00:40:32.000 The variance is a property of the model you're choosing, not a property of the data. 00:40:32.000 --> 00:40:42.000 If that makes any sense so. 00:40:42.000 --> 00:40:48.000 Trying to think of. If there's like a like how to answer, yeah. 00:40:48.000 --> 00:40:57.000 So it's a property. It's a property of the model, so it wouldn't be the case that, like, you wanna change something about your data. 00:40:57.000 --> 00:41:04.000 It would be the case that you wanna change something about your model if you're if you suspect that you're overfitting. 00:41:04.000 --> 00:41:14.000 So if you suspect your model has high variance. 00:41:14.000 --> 00:41:20.000 Okay. 00:41:20.000 --> 00:41:24.000 Alright! So that's sort of gonna the bias variance trade off. 00:41:24.000 --> 00:41:27.000 And you know, thanks for all the questions. They're very good. 00:41:27.000 --> 00:41:32.000 This is sort of motivation for what we're going to learn in this notebook called Regularization. 00:41:32.000 --> 00:41:33.000 So we're going to sort of introduce the general idea behind regularization. 00:41:33.000 --> 00:41:41.000 We'll set up a couple of different formulations and then work our way through it. 00:41:41.000 --> 00:41:44.000 We'll show you how to do them in python. 00:41:44.000 --> 00:41:53.000 And then at the end we'll show you this nice feature of lasso sort of for feature, selection. 00:41:53.000 --> 00:41:59.000 So I will do a quick note. This notebook gets a little bit heavy into some math. 00:41:59.000 --> 00:42:09.000 If you're not like the biggest math person, you don't have to worry too much about those sorts of things and sort of just try and take away the general concepts of like what? 00:42:09.000 --> 00:42:11.000 How the regularization problems are being set up, and then don't worry so much about like the formal, like the actual setups. 00:42:11.000 --> 00:42:28.000 Just sort of remembered the gist, and, you know, hold on until we get to the python parts. If that's what you're interested in. 00:42:28.000 --> 00:42:29.000 Okay, so this was sort of that example that we literally just looked at in the bias variants. So we have. 00:42:29.000 --> 00:42:44.000 Y is equal to x times x minus one. And so then I'm gonna go ahead and plot that so fewer example, fewer observations. 00:42:44.000 --> 00:43:02.000 This time, and then here's the real relationship. And so what I'm gonna go ahead and do is I'm gonna do a loop where for I I'm gonna just fit degree one to degree 26 polynomial using my polynomial features 00:43:02.000 --> 00:43:12.000 pipeline, and each time through I'm gonna record the coefficients on each of the on each of the degrees. 00:43:12.000 --> 00:43:15.000 So here's what that looks like. So maybe, yeah. 00:43:15.000 --> 00:43:23.000 Matthew. Sorry! Where is this file in your github? In the lecture folder under which? 00:43:23.000 --> 00:43:26.000 So this is so. This is in regression. 00:43:26.000 --> 00:43:34.000 So under supervised. Okay? Okay? 00:43:34.000 --> 00:43:39.000 Yeah. So the last note folks will do today. So I hope to get through 6. 00:43:39.000 --> 00:43:40.000 And then I think we'll be able to get through 9. 00:43:40.000 --> 00:43:47.000 Everybody, if we, if we are able to get through more, all of it will be in regression, should have specified. 00:43:47.000 --> 00:43:49.000 Okay, so let me go ahead and zoom in just to make the table a little easier to see. 00:43:49.000 --> 00:44:00.000 So as you can see, like each row is one of the polynomials that I fit, and then each column is the coefficient on that degree. 00:44:00.000 --> 00:44:08.000 So, for instance, when we only fit aligned X. Has the coefficient of negative point 9 5 9. 00:44:08.000 --> 00:44:09.000 Okay, so one thing we'll wanna do is sort of get the motivation here is, if you like. 00:44:09.000 --> 00:44:19.000 Look in general at the sort of magnitude of the coefficient, so absolute values they tend to be increasing. 00:44:19.000 --> 00:44:26.000 So, for instance, if we look at, you know we're going, it's kind of more or less staying around the same for one. 00:44:26.000 --> 00:44:36.000 But if we look at ones like x to the fourth, X to the fifth, we can quickly see that the coefficients are getting larger. 00:44:36.000 --> 00:44:41.000 So here we have one. That's a 75, a negative, 1, 95, a 2, 6, 2. 00:44:41.000 --> 00:44:52.000 And so basically what's going on is something that linear regression is kind of referred to as coefficient explosion. 00:44:52.000 --> 00:45:09.000 And so what does that mean? So here's sort of as a function of the degree of the polynomial, the size of beta, meaning the the norm of beta, is increasing quite drastically as you increase the degree of the polynomial. 00:45:09.000 --> 00:45:23.000 You're considering. And so what's happening is basically as you increase the degree of the polynomial, we're fitting like the polynomial tries to wiggle around quite a bit to fit as many training points as it can meaning that these coefficients are getting 00:45:23.000 --> 00:45:35.000 larger and larger and larger in magnitude. And so this is sort of a motivation behind, what we're going to learn like behind sort of the reason that regularization, like the little trick that it does. 00:45:35.000 --> 00:45:42.000 And so the idea behind regularization is, we're still going to try and minimize that Mse. 00:45:42.000 --> 00:45:43.000 For linear regression. So remember, this is just a different formula. 00:45:43.000 --> 00:45:51.000 Of that. Msc, so it's the one over N. Times. 00:45:51.000 --> 00:45:52.000 The sum of the actual minus the predicted and this is just a way to write that in sort of linear algebra terms. 00:45:52.000 --> 00:46:06.000 So we're still trying to minimize this. But now we're trying to do it in a way that we don't let the norm of our coefficients get too big. 00:46:06.000 --> 00:46:13.000 And so what is a norm? It's a way to measure the size of a vector? 00:46:13.000 --> 00:46:19.000 So we're going to focus in on 2 norms later on, but for now just think of it as a way to visualize the size of a vector in 2 dimensions. 00:46:19.000 --> 00:46:27.000 It would be sort of like the length of the vector depending on what norm you're using. 00:46:27.000 --> 00:46:30.000 So like, if you were to draw an arrow on a piece of paper, then measure it with a ruler. 00:46:30.000 --> 00:46:34.000 You can sort of think of that as being the same sort of thought process behind a norm. 00:46:34.000 --> 00:46:38.000 And then later on, we'll define like actual formulas for a definition of a norm. 00:46:38.000 --> 00:46:49.000 So for now, just know a norm is a way to measure the size of a vector and in particular, we want norms that are going to measure the size of our coefficients. 00:46:49.000 --> 00:47:03.000 And so regularization is essentially taking our Mse minimization and rewriting it in a way that we're trying to do this while not getting too large. 00:47:03.000 --> 00:47:06.000 And so we're still trying to minimize the Mse. 00:47:06.000 --> 00:47:23.000 But now we're sort of operating on a budget where we're only doing this in a world where the norm of my coefficient vector has to be less than or equal to some constant C, and so sort of thinking this again, like we were trying to find the smallest mean square error. 00:47:23.000 --> 00:47:27.000 We can while we're on a budget for Beta equivalently. 00:47:27.000 --> 00:47:30.000 So I know we set it up as a constrained optimization problem. 00:47:30.000 --> 00:47:33.000 That's sort of just the motivation for me. 00:47:33.000 --> 00:47:34.000 That seems like the motivation behind, like, why are people doing this? 00:47:34.000 --> 00:47:35.000 You can also rewrite this to be sort of a penalty optimization. 00:47:35.000 --> 00:47:43.000 Why are people doing this? You can also rewrite this to be sort of a penalized optimization. Where? 00:47:43.000 --> 00:47:45.000 Your again, this this is the Msc. Now multiply by N, we wanna minimize this plus. 00:47:45.000 --> 00:47:56.000 Now we've added a penalty for large Beta so Alpha is something known as a hyper parameter. 00:47:56.000 --> 00:48:14.000 And now, in addition to our Mse. Or just, I guess the sum of square errors we're now adding a penalty term, where, if Beta gets too large, the thing we're trying to optimize gets large as well so here, this is our first instance of a particular norm this is the square of 00:48:14.000 --> 00:48:27.000 the 2 norm, which is just the sum of the squares of the entries for some vector so if we have a vector a, it's a one squared plus a 2 squared plus dot dot dot plus a n squared. 00:48:27.000 --> 00:48:33.000 So maximizing this to norm squared is the same as minimize or sorry, not maximizing, minimizing. 00:48:33.000 --> 00:48:44.000 This is the same as minimizing the Mse. And so sort of to give like a mathematical equivalency of the constrained optimization with this penalized form. 00:48:44.000 --> 00:48:45.000 There are some references at the bottom. I'm not gonna go into the details. 00:48:45.000 --> 00:49:09.000 There, so basically what's going on here is by adding this penalty or if you want to think about it as the constrained optimization you're forcing this minimization to happen in a way that doesn't let Beta get too large so that's the idea behind regularization as you're 00:49:09.000 --> 00:49:13.000 enforcing some sort of penalty or constraint that makes it so. 00:49:13.000 --> 00:49:14.000 Your co-fefficients can't get too big. 00:49:14.000 --> 00:49:15.000 So this is our first instance of a hyper parameter. 00:49:15.000 --> 00:49:18.000 So what's the hyperparameter here? It's Alpha. 00:49:18.000 --> 00:49:19.000 So Beta is a vector of parameterers that we have to try and estimate. 00:49:19.000 --> 00:49:36.000 Alpha is a hyper parameter here it's Alpha. Is a vector of parameters that we have to try and estimate. Alpha is a hyper parameter. 00:49:36.000 --> 00:49:39.000 The difference between a during the algorithm fitting the hyper parameters, you have to set from the very beginning. 00:49:39.000 --> 00:49:50.000 So before you even try and fit, you have to decide what value of Alpha you're going to use and then fit the algorithm. 00:49:50.000 --> 00:49:52.000 So the way that you choose the alpha depends. 00:49:52.000 --> 00:50:08.000 So we're gonna see some examples below where we choose different values of Alpha, just to get something out in other cases you'll do something called heper parameter tuning, where you'll do something like a cross-validation where each time through the cross validation. 00:50:08.000 --> 00:50:25.000 You're choosing a different value for the hyper parameter, and then at the end, you see which one performs best in this particular case, if alpha 0, we recover what we used to have for beta the ordinary least squares estimate and if alka theoretically. 00:50:25.000 --> 00:50:37.000 if Alpha equals infinity, the estimate would imply that Beta has to be 0 in order to minimize. 00:50:37.000 --> 00:50:47.000 Okay. So I think I saw a question. 00:50:47.000 --> 00:50:48.000 So icons asking, isn't linear regression also estimating the effect, sizes of features. 00:50:48.000 --> 00:50:55.000 Then what does it mean when we use regularization because of machine learning? 00:50:55.000 --> 00:50:58.000 We do not care about Beta. So that's kind of the gist of it. 00:50:58.000 --> 00:51:05.000 So here we're sort of in a predictive modeling framework where we don't necessarily care about being able. 00:51:05.000 --> 00:51:09.000 At the end of the day we may not care about being able to. 00:51:09.000 --> 00:51:12.000 Clearly interpret, you know. Oh, if I increase X by 2 units, then this does blank to the output. 00:51:12.000 --> 00:51:20.000 So you'd argue the nice interpretability that you get from a regular or an ordinary least squares, like a regular linear regression. 00:51:20.000 --> 00:51:38.000 But this is sort of thinking of it in terms of like improving predictive performance. 00:51:38.000 --> 00:51:43.000 Okay. So here are the 2 specific regularization models that we're gonna look at. 00:51:43.000 --> 00:51:49.000 The first is called rich regression, and so Ridge regression chooses this square of the 2 norm. 00:51:49.000 --> 00:52:00.000 Here. So this is again we said, a one squared plus a 2 squared plus dot dot dot plus a N squared for this is a n-dimensional, vector a not involved in the problem at all. 00:52:00.000 --> 00:52:03.000 Just a placeholder name for a vector the other one. 00:52:03.000 --> 00:52:07.000 We're gonna look at uses what's known as the L one norm. 00:52:07.000 --> 00:52:24.000 So the l. One norm is the sum of the absolute value, so the absolute value of a one plus the absolute value of a 2 plus dot dot dot plus the absolute value of a n, where, if it wasn't clear, let's say sub one is the first entry of the vector a a sub 2 is the second 00:52:24.000 --> 00:52:27.000 entry. A sub n, is the entry. So, yeah, yeah, yeah. 00:52:27.000 --> 00:52:34.000 I don't ask you a quick question. How does that different norms? 00:52:34.000 --> 00:52:36.000 Adding this regularization term, does it make a difference which one you choose? 00:52:36.000 --> 00:52:42.000 Or what's the difference between choosing, you know, like the L. 00:52:42.000 --> 00:52:45.000 One non versus the square. 00:52:45.000 --> 00:52:51.000 Yeah, so I'll do a quick show, and then we'll come back to this, maybe more detail later. 00:52:51.000 --> 00:52:57.000 So basically like, this is sort of. And then this is from a book called The Elements of Statistical Learning. 00:52:57.000 --> 00:53:07.000 I didn't make this image like this is a way to visualize it in 2D, so basically, you're just restricting the shape of the constraint region in the parameter space. 00:53:07.000 --> 00:53:27.000 So here is a space that represents Beta one and Beta 2, and then the square is, if you use the lasso norm, and so all of your coefficients have to exist either with the lasso, norm and so all of your coefficients have to exist either within or on the edge of this space versus 00:53:27.000 --> 00:53:36.000 the circle. And so it makes a difference in the type of estimates you're able to get, and you can use that to your advantage when trying to do things like feature selection. 00:53:36.000 --> 00:53:50.000 And we'll talk on that at the end. So this is like a little preview, basically, the different norms change the shape of the basically the different norms change the shape of the like the values that the constraints can live in so for lasso it has to be within this square 00:53:50.000 --> 00:54:08.000 or on the edge of the square for Ridge it has to be in this circle are on the edge of the circle, and then there's like higher dimensional equivalents of this, which we just can't. I can't draw on the 2 dimensional space. 00:54:08.000 --> 00:54:17.000 Okay. Are there any other questions before we go on? How to do this with Python? 00:54:17.000 --> 00:54:26.000 Could you? Sorry, explain again how the norm is coming into the equation? 00:54:26.000 --> 00:54:30.000 For regularization. I kind of missed it. 00:54:30.000 --> 00:54:36.000 Yeah, so basically what we're trying to do is we're, there's 2 ways to think of it. 00:54:36.000 --> 00:54:41.000 The first is that the norm of your basically making it so that you want to, you're still minimizing Msc, or whatever you want to minimize. 00:54:41.000 --> 00:54:49.000 But now you're forcing it so that the norm of your vector so your coefficients. 00:54:49.000 --> 00:54:59.000 So here, think of this being replaced by a being replaced by Beta, and either one of these, you're making it so that these norms are less than or equal to some constant. 00:54:59.000 --> 00:55:03.000 That's one way equivalently. You can think of it as now. 00:55:03.000 --> 00:55:08.000 You're minimizing the Mse or something that's equivalent to the Msc. 00:55:08.000 --> 00:55:13.000 Plus a penalty where the larger beta gets, the more penalty you get. 00:55:13.000 --> 00:55:16.000 So basically you can't just automatically get the best one. 00:55:16.000 --> 00:55:20.000 The OS, the ols estimate so our regular linear regression estimate. 00:55:20.000 --> 00:55:27.000 Now you're forced to maybe get a subpar minimum, because if you were to try and get the minimum, Beta would get too big. 00:55:27.000 --> 00:55:35.000 So the norms are just different ways of measuring the size of Beta. 00:55:35.000 --> 00:55:42.000 It could be any norm, but the 2 most commonly used are ridge in lasso. 00:55:42.000 --> 00:55:43.000 In the A when you say either, it's really Beta. 00:55:43.000 --> 00:55:45.000 Then. 00:55:45.000 --> 00:55:49.000 And this particular it would be Beta. This is just a general vector. 00:55:49.000 --> 00:55:51.000 Like a is a vector. 00:55:51.000 --> 00:55:56.000 Yeah, putting the example, it's represented by Beta on your example. 00:55:56.000 --> 00:55:59.000 Yeah, yeah, and what we're using it for, it would be Beta. 00:55:59.000 --> 00:56:11.000 I just wanted to use a general vector here because the norms are independent of the vector you use. 00:56:11.000 --> 00:56:12.000 Yeah. 00:56:12.000 --> 00:56:13.000 I have a question. So for the can you go a little bit towards the yeah. 00:56:13.000 --> 00:56:24.000 Yeah, the like. You say that the first in the secondary equivalent so in the second one, we are basically just redefining. 00:56:24.000 --> 00:56:31.000 Our cost cost function to include a regularization term, and in the first one we are just saying that, okay, we have the regular cost function. 00:56:31.000 --> 00:56:35.000 We're additionally, we have the normal of Beta to be less than some constant. 00:56:35.000 --> 00:56:37.000 So like and the second, I'm just trying to think about. 00:56:37.000 --> 00:56:43.000 Yes, these are. These are equivalent in terms of the ideas. 00:56:43.000 --> 00:56:59.000 But like does the second one automatically imply that the norm of the beta that would come out of the calculation would be upper, bounded by some constancy. 00:56:59.000 --> 00:57:07.000 Maybe it does. 00:57:07.000 --> 00:57:08.000 Okay. 00:57:08.000 --> 00:57:10.000 So there is. There are some notes at the bottom that go through the derivation of how this is equivalent to that and so I'll refer you to those notes. 00:57:10.000 --> 00:57:11.000 Okay. Thanks. 00:57:11.000 --> 00:57:17.000 Yeah. Yup. 00:57:17.000 --> 00:57:20.000 So how do we do this in sk, learn? So both of these are. 00:57:20.000 --> 00:57:26.000 Here are the links to the documentation. If you want to check out more, and I think maybe it's good point. 00:57:26.000 --> 00:57:30.000 Let's go to rich, cause I think there is something here, you know. 00:57:30.000 --> 00:57:34.000 I'm thinking of a different model. But this is what the this is what the documentation looks like. 00:57:34.000 --> 00:57:37.000 And so maybe it's good to see at least that the default value is alpha. 00:57:37.000 --> 00:57:45.000 Another thing that we're gonna see here is something called Max A. 00:57:45.000 --> 00:57:58.000 So this is the number of iterations. So this is, I think, that we sort of talked about it last week when we talked about fitting Linux regression and like worrying about the ill conditioning of a matrix. 00:57:58.000 --> 00:58:06.000 So sk learn doesn't use like normal equations. So like, for instance, Ridge regression, you can derive a formula for the estimate. 00:58:06.000 --> 00:58:08.000 But it doesn't use that formula. It uses sort of like a gradient descent or something like that. 00:58:08.000 --> 00:58:16.000 And so the algorithm goes through a number of iterations. 00:58:16.000 --> 00:58:23.000 And so there's like a default value for the number of iterations that you may have to change here. 00:58:23.000 --> 00:58:32.000 It's saying it's none. But sometimes you'll get a warning from Escalar and saying like algorithm did not converge before number of iterations was exceeded. 00:58:32.000 --> 00:58:33.000 And so when that happens, you may need to take this and increase it. 00:58:33.000 --> 00:58:42.000 So we'll see more examples of this as we go through the boot camp. 00:58:42.000 --> 00:58:43.000 But because I clicked on this I wanted to at least demonstrate something. 00:58:43.000 --> 00:58:45.000 I was thinking of a different, a different model that I wanted to look at the documentation for. 00:58:45.000 --> 00:58:54.000 But here's the documentation. 00:58:54.000 --> 00:58:59.000 Okay, so first, we're gonna import them. So from S. K. 00:58:59.000 --> 00:59:08.000 Loren, dot linear model. I'm gonna import ridge. 00:59:08.000 --> 00:59:24.000 And then maybe this is new. So Ridge, when you're importing something, 2 things from the same place, you can separate them by a comma, and you can do this instead of having to write like both on a different line. 00:59:24.000 --> 00:59:34.000 So this can save you some typing. So then, what I'm gonna do is I'm gonna go through and show like for different values of Alpha. 00:59:34.000 --> 00:59:39.000 How this impacts the coefficients that you get as the estimate. 00:59:39.000 --> 00:59:49.000 So each time through. For this different value of Alpha, what we're gonna do is we're gonna fit the high degree polynomial, which is a 10 degree polynomial. 00:59:49.000 --> 00:59:54.000 And then also you might notice that we're using pipelines here. 00:59:54.000 --> 00:59:55.000 So this is a situation where you need to scale your columns before you fit the model. 00:59:55.000 --> 01:00:08.000 So, if you don't scale your columns before fitting the model it can mess up the way the model gets fit. 01:00:08.000 --> 01:00:22.000 And so, if you have very different scales, so sort of thinking like this budget way, like your budget, can be dedicated to like entirely one column, just because the scale is vastly different from the other columns, it won't actually give you the right to like entirely one column just because the scale is vastly different from the other columns. 01:00:22.000 --> 01:00:31.000 It won't actually give you the right fit. So you need to always scale when you use ridge or lasso. 01:00:31.000 --> 01:00:34.000 So that's basically what we're doing here. And so the last thing we need to add. 01:00:34.000 --> 01:00:37.000 So we have the scaling and the polynomial features. 01:00:37.000 --> 01:00:41.000 Now we just need to add the ridge and the lasso. 01:00:41.000 --> 01:00:47.000 And so here, I'll add Ridge. 01:00:47.000 --> 01:00:56.000 So you just call rich. And then Alpha is, gonna be easy to each time through. 01:00:56.000 --> 01:01:01.000 We're going to go ahead and get a different value of Alpha. 01:01:01.000 --> 01:01:06.000 And so Alpha at I. And then, I think to be safe. 01:01:06.000 --> 01:01:10.000 I just like to set the Max iter pretty high. 01:01:10.000 --> 01:01:17.000 And then the same thing here for lasso. 01:01:17.000 --> 01:01:20.000 I'm gonna call it lasso, and then I'll call capital. 01:01:20.000 --> 01:01:24.000 L lasso alpha equals Alpha at I. 01:01:24.000 --> 01:01:29.000 Why is it Alpha at I? Because Alpha, one this vector that I've defined up here. 01:01:29.000 --> 01:01:33.000 Maybe I should call it alphas, with an S. 01:01:33.000 --> 01:01:42.000 Max Iter equals 100,000. Then I fit them and record the coefficients. Send an array. 01:01:42.000 --> 01:01:50.000 So let's go ahead and make those even bigger. 01:01:50.000 --> 01:01:55.000 Let's try one more time. 01:01:55.000 --> 01:02:06.000 There we go. Oh, come on, guys! 01:02:06.000 --> 01:02:09.000 Alright! 01:02:09.000 --> 01:02:11.000 So we can look at the coefficients for these. 01:02:11.000 --> 01:02:16.000 And so you can see as you increase, Alpha. So the bigger Alpha is. 01:02:16.000 --> 01:02:20.000 So here, the alpha, you can think of it like. 01:02:20.000 --> 01:02:29.000 It's a. This particular alpha, which is why it shows Alpha here the bigger alpha is, the smaller the norm of the coefficients have to be. 01:02:29.000 --> 01:02:44.000 So you can kind of see them shrinking. Now, one thing you might notice is that with Ridge regression the shrinking seems to happen more or less like pretty uniformly like no one coefficient goes down to 0 on it's like on its own and that whereas with 01:02:44.000 --> 01:03:01.000 lasso. When you make the comparison and this goes back to the question that was asked earlier, like some of the coefficients will go down to 0 faster than others, and so that's a nice feature of last, so that we're gonna talk about in a second so but just to 01:03:01.000 --> 01:03:04.000 demonstrate. As Alpha gets bigger, the coefficients go to 0 and lasso, they tend to hit 0. 01:03:04.000 --> 01:03:11.000 We're as in Ridge. They tend to not hit 0. 01:03:11.000 --> 01:03:19.000 So you know, why is this happening? And this is going back to sort of what we talked about earlier in Ridge regression. 01:03:19.000 --> 01:03:25.000 We're trying to minimize the Msc with respect to the 2 norm being less than or equal to C. 01:03:25.000 --> 01:03:26.000 What you can rearrange to be like in 2 dimensions. 01:03:26.000 --> 01:03:49.000 You have this unit, you have this disk with a a radius of C, or sorry radius of the square root of C, so that's where you get this picture, whereas with lasso you now have the absolute value which gives you a square with the following vertices. 01:03:49.000 --> 01:03:54.000 And so what you see here is this point is the ordinary least squares estimate. 01:03:54.000 --> 01:04:07.000 So the thing that we got with normal linear regression that we learned about last week these ellipses are what are known as the level curves of the Mse. 01:04:07.000 --> 01:04:08.000 And so the minimum of the Mse is sort of the bottom. 01:04:08.000 --> 01:04:15.000 So it's this paraboloid, and then the bottom is the minimum. 01:04:15.000 --> 01:04:19.000 And so these blue regions are this constraint region. 01:04:19.000 --> 01:04:22.000 So the constraint regions being this part so, where the coefficient, the norm of the coefficient is less than or equal to the constant. 01:04:22.000 --> 01:04:37.000 And so your estimate for the betas has to be where the level curves the values of the Mse hit the constraint region. 01:04:37.000 --> 01:04:57.000 So with lasso, because you have sort of this, the square or pointy constraint region you have to hit you typically end to hit on one of the axes which explains why one of the coefficients are more one or more of the coefficients tend to hit 0. 01:04:57.000 --> 01:05:17.000 While the in comparison to Ridge, they don't exactly hit 0, whereas with Ridge, because it's more of this like circular or in higher dimensions, spherical spherical shape, the the inner the curve tends to intersect like out here away from the axis 01:05:17.000 --> 01:05:23.000 which is why you tend to see this sort of uniform shrinking, but not quite getting to 0. 01:05:23.000 --> 01:05:29.000 Whereas with lasso one of them will go to 0 like much earlier than the others. 01:05:29.000 --> 01:05:39.000 That's sort of what's going on. So this gives lasso a nice feature or a nice feature for feature. 01:05:39.000 --> 01:05:56.000 Selection. So because this happens, the features that stick around with lasso that don't go to 0 tend to be important features for predictive power and so what you can do is we'll see in the next notebook we cover you can use lasso and then change different values. 01:05:56.000 --> 01:05:59.000 Of Alpha. 01:05:59.000 --> 01:06:10.000 You can use lasso, change different values of alpha, and then follow along like which ones stick around the longest, and those tend to be your more important features. 01:06:10.000 --> 01:06:11.000 Now in this example, it's kind of acting weirdly. 01:06:11.000 --> 01:06:19.000 We're like, we know that the correct answer ahead of time is like X squared and x to the first. 01:06:19.000 --> 01:06:21.000 But then, like things like x to the 6, stick around a little bit longer than those 2 same with x to the tenth. 01:06:21.000 --> 01:06:31.000 But these 2 are like the actual, you know, close, farthest away from 0. For the longest. 01:06:31.000 --> 01:06:47.000 If you look at all of them. So I think like, if we were to look at it, even if we didn't have the knowledge, we'd be more inclined to keep these than like X to the tenth okay. 01:06:47.000 --> 01:06:48.000 So Mitch is asking math heavy question. We've seen L. One and L. 01:06:48.000 --> 01:06:51.000 2. Regularization or other, else to the P. Norms ever used for other values of P. 01:06:51.000 --> 01:07:07.000 So there's something called elastic net, which is sort of it's not quite what your questions asking, but Elasticnet is sort of like a weighted sum of the L. 01:07:07.000 --> 01:07:14.000 One and the L. 2, the l. 2 norms. 01:07:14.000 --> 01:07:19.000 It's not quite that. So I think you in theory, could use other Lp. 01:07:19.000 --> 01:07:20.000 Norms. But from my experience it's usually L one l. 01:07:20.000 --> 01:07:29.000 2, or this elastic net, which is a combination of the 2. 01:07:29.000 --> 01:07:34.000 Eric is asking, can you explain why the red contour lines are shifted? 01:07:34.000 --> 01:07:42.000 One picture to the next. Is that a mistake? Oh, yeah. 01:07:42.000 --> 01:07:55.000 I think that they are not the I think they are not the same, so I think the 2 in the middle are the same, and then this one on the outside are different, like this level curve is different from this level. 01:07:55.000 --> 01:08:06.000 Curve because they intersect at different places. Does that make sense? 01:08:06.000 --> 01:08:24.000 Yeah, I think that's why. So like, this level curve is different from that one but the 2 in the middle, I think, are the same asterisk asking, Do people use the strategy of using lasso for feature for feature selection followed by going back to unregularized regression as to 01:08:24.000 --> 01:08:27.000 get lower emcee. So I think that's something you could do. 01:08:27.000 --> 01:08:38.000 So you could use lasso to figure out what features you might want to consider, and like a cross validation, and then just use non unregularized linear regression. 01:08:38.000 --> 01:08:45.000 You could also compare it with the linear regret, like with the lasso version of the same model, and see which performs better. 01:08:45.000 --> 01:09:00.000 But, like the earlier question is, if you're in a situation where the people you're making the model for like to be able to see the direct interpretations. You might stick with the linear one because you can get that. 01:09:00.000 --> 01:09:06.000 Yeah, also. So you were using the chart with a different alpha and x one through. 01:09:06.000 --> 01:09:13.000 Xk power values? Was it that you are choosing the features? 01:09:13.000 --> 01:09:18.000 Then what are? How are the features represented in this chart? 01:09:18.000 --> 01:09:25.000 So the we are fitting this model the tenth degree, polynomial for each value of Alpha. 01:09:25.000 --> 01:09:26.000 And so these are the coefficients on that model. 01:09:26.000 --> 01:09:36.000 So I use this model cause it's I just wanna do an aside cause every time I do this I think people just assume that when you're in the real world you're just always using polynomial regression. 01:09:36.000 --> 01:09:44.000 That's a bad assumption. You're not like, do I don't think you're typically doing stuff like this. 01:09:44.000 --> 01:09:47.000 This is just to make it easy that's aside from your question. 01:09:47.000 --> 01:09:58.000 So like in the real world. Your columns would be like your columns would be the features so like thinking to the problem. 01:09:58.000 --> 01:10:03.000 Session, like kilometres driven age, that sort of thing. 01:10:03.000 --> 01:10:14.000 And so then what you'll do is you'll follow the coefficients on those different features, and then see like which ones tend to persist as you increase the value of alpha. 01:10:14.000 --> 01:10:23.000 Now what I tend to do is like I do powers of 10, and if that doesn't work I might switch up and do like point 2 5.5. 01:10:23.000 --> 01:10:35.000 Something like that. Does that answer your question? 01:10:35.000 --> 01:10:36.000 Yeah. Yup. 01:10:36.000 --> 01:10:40.000 Yes. So you just saying that this is a specific example with polynomial regression, and I could still use the same process right of choosing the powers of Alpha to the time of what features stick around. 01:10:40.000 --> 01:10:45.000 And so when I'm doing the powers of Alpha, if the higher penalty. 01:10:45.000 --> 01:11:01.000 So Alpha is given me like a higher degree of penalty, and then the features that drop off are the ones that I choose, because those are the ones that are. 01:11:01.000 --> 01:11:05.000 The why do I choose the ones that drop off like as Alpha goes higher? 01:11:05.000 --> 01:11:07.000 So you want to choose the ones that persist. So like, yeah, yeah, yeah. 01:11:07.000 --> 01:11:20.000 Okay, persist. Okay, okay. 01:11:20.000 --> 01:11:37.000 Yup yup so Jonathan asking, what is the thought process, and deciding to underfit by exclusion versus underfit, by ridge regression, the regularization is still biasing it, even though it doesn't force to 0 rate so with ridge so basically like you could 01:11:37.000 --> 01:11:44.000 use, Ridge regression as your model, and if that performance better than regular linear regression, and that is and you just care about predictive power like, do that. 01:11:44.000 --> 01:11:53.000 But the idea behind that you could not use, like Rich regression for feature selections. 01:11:53.000 --> 01:11:54.000 And so we're not necessarily talking about like lasso is better. 01:11:54.000 --> 01:12:08.000 Because look, all these go to 0. We're saying that last so can be used for feature selection, because the things that stick around tend to be the features that are most important and making predictions. 01:12:08.000 --> 01:12:18.000 And so that's sort of the idea here. It's not that the lasso model is inherently better at providing you predictions in comparison to the Ridge, because the Ridge doesn't go away. 01:12:18.000 --> 01:12:24.000 The lasso is better for feature selection, because you can see, like the important features, tend to stick around the longest. 01:12:24.000 --> 01:12:25.000 So that's the idea. There, now, you know, Ridge may still give you a better predictive model, and it doesn't shrink. 01:12:25.000 --> 01:12:47.000 Everything to 0. That's fine. It's just in lasso, is uniquely is sort of like unique regularization in the sense that you, you can see what features are most important to making predictions. 01:12:47.000 --> 01:12:50.000 Okay. And so maybe it's like, wrap up for this notebook. 01:12:50.000 --> 01:12:59.000 And we kind of talked about this a little bit so like when might you use lasso versus rich. So Lasso has some pros so like for example, this feature selection is really nice. 01:12:59.000 --> 01:13:10.000 It works well also, when you have a large like, let's say you have a very large number of features, and it turns out like you don't know this ahead of time. 01:13:10.000 --> 01:13:15.000 But it turns out that, like in actuality, a lot of these features don't have a ton of effect on the target. 01:13:15.000 --> 01:13:20.000 So the thing you're trying to predict last so turns out to be really good for these types of problems, because it will get rid of the ones that aren't important. 01:13:20.000 --> 01:13:24.000 As you increase Alpha, one problem is, it can be, it can have trouble with something called co-inearity. 01:13:24.000 --> 01:13:53.000 So co-linearity is was when one of your columns is highly correlated with another one of your columns, so it can be difficult for this, because it can typically just choose like one of those variables and not necessarily like the one that's providing the signal to yup we use like one of those variables and not necessarily like the one that's providing the signal to y so that it 01:13:53.000 --> 01:14:05.000 can struggle with that ridge. Regression is good when you have a target that does end up depending on a lot of, or all of the features, and it works a little bit better with coinearity than lasso one con of the ridge regression is that it tends to keep most of 01:14:05.000 --> 01:14:06.000 the predictors in the model, as we saw in this very is, you know, this example. 01:14:06.000 --> 01:14:18.000 So this can be like computationally costly. If the data sets have a large number of features, is that sort of the pros and cons of both. 01:14:18.000 --> 01:14:19.000 And again, some of these, you won't know ahead of time, but you can keep them in mind. 01:14:19.000 --> 01:14:31.000 And then, as I mentioned earlier, there's something called elastic net there's an example in the practice problems in the regression where you can learn about that. 01:14:31.000 --> 01:14:34.000 And as I also said, here are some references that kind of go over. 01:14:34.000 --> 01:14:41.000 Some of the mathy stuff for the math ones. You'll probably want like these Pdfs at the bottom would be my guess. 01:14:41.000 --> 01:14:48.000 And then maybe this one right here, constrained on constrained form. 01:14:48.000 --> 01:14:52.000 Okay. So with 10 min left, give or take, I want to end by going over some feature selection approaches for regression, linear regression models. 01:14:52.000 --> 01:15:01.000 And so I'm gonna go through what used to be a problem session. 01:15:01.000 --> 01:15:09.000 Basically and go through the data that we're gonna use and then show you a couple of different ways. 01:15:09.000 --> 01:15:19.000 This is gonna be largely like it's already all been coded, just given like the length of time but, you know, feel free to ask questions at the end. 01:15:19.000 --> 01:15:24.000 If you have like questions about particular code chats. 01:15:24.000 --> 01:15:30.000 So this is a synthetic data set called car seats that comes from this really great book. 01:15:30.000 --> 01:15:37.000 An Introduction to statistical learning, and this summer I guess they have a python addition, which is great. 01:15:37.000 --> 01:15:49.000 The current edition is for R. But this is a really, I like this book a lot, even if you're not like you don't care about learning like, are it's just good for like theory and stuff and I believe it's free, which is awesome. 01:15:49.000 --> 01:15:53.000 So they have this data set called car seats, which has these columns. 01:15:53.000 --> 01:16:10.000 You're trying to predict sales. And they have problems like competitor price income advertising population price, shelve location, age, education, urban and us. 01:16:10.000 --> 01:16:21.000 And so basically these are synthetic data. Each row represents a store that sells car seats represents the amount in sales. 01:16:21.000 --> 01:16:29.000 They're getting at that store, and then has various columns about the different features that that store uses. 01:16:29.000 --> 01:16:33.000 So, for instance, like for this store, what are its competitors? 01:16:33.000 --> 01:16:38.000 Prices. For what is the population of the area? What is the edge level of the area? 01:16:38.000 --> 01:16:44.000 What is the average age of the population that the store is in? So that's the idea of the data set. 01:16:44.000 --> 01:16:45.000 If you're interested to learn more, I've provided this link. 01:16:45.000 --> 01:16:50.000 So that's the idea of the data set. If you're interested to learn more, I've provided this link, and you can also just look at the book which is free online. 01:16:50.000 --> 01:16:57.000 So I go ahead, and just as a first start I make some dummy variables. 01:16:57.000 --> 01:17:00.000 So, for the first is shelved location, good and show of location, bad. 01:17:00.000 --> 01:17:04.000 So show of location takes in bad good, and medium is potential value. 01:17:04.000 --> 01:17:13.000 So this is just doing the K minus one dummies and the other 2 can just be quickly turned into 0, or one from yes or no. 01:17:13.000 --> 01:17:27.000 Then I make my train test split. So one of the very first steps in feature selection that you guys have gotten a lot of practice that with the last 2 problem sessions is called exploratory data analysis. 01:17:27.000 --> 01:17:36.000 So here, if it's possible, like, if you don't have too many features, you can just try and look at various platforms. 01:17:36.000 --> 01:17:42.000 And you know, basic count basic statistics to get a sense of what might be most important. 01:17:42.000 --> 01:17:51.000 So one that's useful. Is this pair plot function and pair plot takes in sort of the data. 01:17:51.000 --> 01:17:55.000 And like you've seen in the. 01:17:55.000 --> 01:18:01.000 The problem session. You can specify things like what should go on the y-axis, and what should go on the X-axis. 01:18:01.000 --> 01:18:05.000 Here we can see like we're interested in sales. 01:18:05.000 --> 01:18:10.000 And so we can see like the sales, seem to be related to these other variables. 01:18:10.000 --> 01:18:14.000 In addition, like sort of looking at good medium, bad. And if we look here, there's like a density plot on the diagonal. 01:18:14.000 --> 01:18:29.000 And so we can kind of get a sense of that. It does appear that maybe shelve location has an impact on sales. Just looking at the distributions here, the densities. 01:18:29.000 --> 01:18:32.000 Here's just another one looking at the remainder. 01:18:32.000 --> 01:18:47.000 Variable. So here's an example of one that looks like it has a very negative correlation with sales, and that is price which makes sense. The higher the price, the less likely people are to buy it. 01:18:47.000 --> 01:18:51.000 And then here's just the final one, with, like the final remaining one. 01:18:51.000 --> 01:18:56.000 So we saw that you've seen some examples of this. If you go through and like, look at the different plots you might find. 01:18:56.000 --> 01:19:03.000 Okay, like things like advertising population price shelved location. 01:19:03.000 --> 01:19:06.000 And whether or not the store is located in the United States potentially have an impact on sales. 01:19:06.000 --> 01:19:17.000 And so we might want to consider, including them, and sort of models we'd consider. 01:19:17.000 --> 01:19:26.000 So the first type of like sort of automatable algorithmic approach to feature selection, we're gonna look at is called best subset selection. 01:19:26.000 --> 01:19:36.000 So this is a an algorithm that looks at every possible model fits that model and gets the error on that model for comparison. 01:19:36.000 --> 01:19:45.000 So, for instance, if we were only looking at Comp Competitor Price and advertising best subsets would fit and compare the foxes. 01:19:45.000 --> 01:19:48.000 So the baseline, the one that regresses sales on comp. 01:19:48.000 --> 01:19:52.000 Price, the one that regresses sales on comp. Price, the one that regresses sales on comp. 01:19:52.000 --> 01:19:59.000 Price, the one that regresses sales on advertising, and then the one that it regresses sales on advertising, and then the one that it regresses sales on advertising and then the one that regresses sales on comp price and addresses sales on comp price and addresses sales on comp 01:19:59.000 --> 01:20:01.000 price and advertising. And so we're going to. 01:20:01.000 --> 01:20:04.000 Then it would, you know, do something like cross validation. 01:20:04.000 --> 01:20:06.000 Get the average crossoveration. Msc. On all 4 models, and then choose the one with the lowest cross validation. 01:20:06.000 --> 01:20:16.000 Msc, so we're gonna do that with those features that I outlined above that. 01:20:16.000 --> 01:20:20.000 I'm highlighting right now, and I'm just gonna show you. 01:20:20.000 --> 01:20:23.000 It's already been programmed up. But I'm just gonna show you. 01:20:23.000 --> 01:20:27.000 So here's the here's the linear regression. 01:20:27.000 --> 01:20:30.000 We've seen that before. So this is a function. 01:20:30.000 --> 01:20:31.000 I always get questions about, and I forgot to look up what this little less than less than does. I. 01:20:31.000 --> 01:20:38.000 I don't remember. I don't remember what it does, so I got this original function. 01:20:38.000 --> 01:20:43.000 It's sort of a a slight adjustment on this function that's found here at Stack Overflow. 01:20:43.000 --> 01:21:00.000 It's going to. Essentially, just take in a list of functions or a list of features, and then it will return the power set of that list. 01:21:00.000 --> 01:21:01.000 The empty set. And so, for instance, if I did, the Comp. 01:21:01.000 --> 01:21:10.000 Price, advertising example. It gives out the list that has compet prices. 01:21:10.000 --> 01:21:11.000 A list advertising is a list, and then Comp. Price and advertising in the list. 01:21:11.000 --> 01:21:23.000 And so we're gonna use this function to go through and produce all of the models that we're gonna fit. 01:21:23.000 --> 01:21:25.000 Okay. 01:21:25.000 --> 01:21:38.000 So first I make my k-fold objects, then I get my power set of all the features that I'm considering, and then, because I have a categorical variable in here. 01:21:38.000 --> 01:21:47.000 That isn't just a binary 0 one. I have to sort of make an adjustment so that anytime I were to have like one of these that includes shelve location. 01:21:47.000 --> 01:21:52.000 I have to go through it, include both shelve location, good and shelve. 01:21:52.000 --> 01:22:00.000 Location bad. You can't have just one. You have to have both because you need to include all possibilities for shell of location bad. You can't have just one. 01:22:00.000 --> 01:22:01.000 You have to have both now. I also, then, for the model that's sort of my baseline. 01:22:01.000 --> 01:22:06.000 Just the expected value. I add, in the option of being a baseline. 01:22:06.000 --> 01:22:18.000 Then I make a holder that's gonna have all of my Msc's for my different splits and modells. 01:22:18.000 --> 01:22:27.000 And then I do this for loop where this is the traditional K-fold for loop. We've seen this before, and then I loop through all my potential models. 01:22:27.000 --> 01:22:44.000 The first is the baseline model. So when I do this one, I just fit the mean of the training set, and then for the rest of them, I make the linear regression model, and I don't know why it says that's that's old. 01:22:44.000 --> 01:22:52.000 So now it's going through, and it's fitting this. And I forgot to run that code. 01:22:52.000 --> 01:22:59.000 Here we go, and so now you can use something like augment to find all right, which one had the lowest cross validation. 01:22:59.000 --> 01:23:02.000 Mse. And it was the 100 and Tenth model. 01:23:02.000 --> 01:23:10.000 So we fit a lot of different models here. And so, after running that you can go through and see, okay, the model with the lowest average cross validation, Msc. 01:23:10.000 --> 01:23:12.000 Had these features, and had an average C. Crops. Validation. 01:23:12.000 --> 01:23:19.000 Msc. Of this. 01:23:19.000 --> 01:23:25.000 And I don't know why this is here. This must just be left over from yeah, that's nothing. 01:23:25.000 --> 01:23:30.000 Ignore that. Okay, so are there any questions about best subsets? 01:23:30.000 --> 01:23:35.000 And it looks like that. Everybody has chimed in to explain what the less than less than does so. 01:23:35.000 --> 01:23:39.000 Thank you, Terry. Who did that? 01:23:39.000 --> 01:23:44.000 I mean, are you just more likely to to get a lower Msc. 01:23:44.000 --> 01:23:47.000 If you use more features. 01:23:47.000 --> 01:23:56.000 So. Not necessarily so. There can be models that if you put in bad data like, if you put in features that aren't at all related to shelve location right? 01:23:56.000 --> 01:23:59.000 You're the more remember, the more this is sort of that. 01:23:59.000 --> 01:24:07.000 Bias, variance trade off right. The more features you include in your model, the more likely to your you are to over fit the model, meaning your generalization. 01:24:07.000 --> 01:24:08.000 Right. 01:24:08.000 --> 01:24:10.000 Error will go up! 01:24:10.000 --> 01:24:14.000 Oh! So this captures that! 01:24:14.000 --> 01:24:15.000 Okay. 01:24:15.000 --> 01:24:16.000 Yeah, right? Because it is the case that the more features you include, the more likely you are to have a better fit on the training data. 01:24:16.000 --> 01:24:24.000 But remember, we're looking at the error on the test data. 01:24:24.000 --> 01:24:25.000 Okay. 01:24:25.000 --> 01:24:28.000 Or not. The test data, the hold out from the cross validation. 01:24:28.000 --> 01:24:36.000 Hold that. Yeah. 01:24:36.000 --> 01:24:40.000 Okay. So another thing that you might have noticed is, we fit a lot of models for this. 01:24:40.000 --> 01:24:44.000 So that's because best subset like fits. 01:24:44.000 --> 01:24:54.000 Every possible model which can have a lot of models so like, if you do it, if you do it, I think it's something like 2 to the M models, and that's just like a lot of models. 01:24:54.000 --> 01:25:04.000 So there's something called grey approaches, which they're called greedy, because they're making sort of like at each pocket chance that they can make a choice. 01:25:04.000 --> 01:25:05.000 They make the choice that is greedious at the moment. 01:25:05.000 --> 01:25:15.000 So the 2 that you can consider are called for where there are 2 that you can consider called forward selection and backwards selection, and they're the same basic idea. 01:25:15.000 --> 01:25:40.000 So basically what it is is in forward selection you're starting with the baseline and then slowly adding in features, so step 0 of forward selection is, you fit the baseline model, get the average Cv, Msc then you go through for each of the M possible features you fit that simple 01:25:40.000 --> 01:25:44.000 linear regression model, calculate the average Cv. Mce. 01:25:44.000 --> 01:25:49.000 If none of them outperform the baseline, then you just stop. If one of them does. 01:25:49.000 --> 01:25:50.000 If there are some that do outperform the baseline, choose the one that performs best, that's your new default. 01:25:50.000 --> 01:26:06.000 And then for the remaining steps, you basically just go through, and you'll do like for the second time through you do for the remaining M minus one features, not in your model. 01:26:06.000 --> 01:26:10.000 You fit the regression model that includes that feature. 01:26:10.000 --> 01:26:13.000 Calculate those msees find the one that does best. 01:26:13.000 --> 01:26:23.000 If it's the current model, you stop, if it's not the current model, you have a new default model, and you go through and try fitting in the remaining features. 01:26:23.000 --> 01:26:41.000 And so basically you're only going to stop at forward selection until you either a have the model that includes every feature or B find a model that adding a different feature to it will not improve so that's the idea behind forwards backwards selection is sort of the 01:26:41.000 --> 01:26:49.000 backwards. Approach. So in backwards selection you include, you start with the model that includes everything, and then slowly remove a feature. 01:26:49.000 --> 01:26:59.000 One at a time, seeing if you outperform the whatever your current default is, current, default is, you know, and doing the same sort of stopping process. 01:26:59.000 --> 01:27:09.000 So in backwards selection, you're either gonna end with the baseline model or a model that has removed some of the features. 01:27:09.000 --> 01:27:23.000 So, Erit is asking, can we automate these features in a pipeline so these can be automated just like, like, for instance, like, you would do a sort of like this, for you'd have to do some sort of like for loop type, thing. 01:27:23.000 --> 01:27:24.000 But it can be automated like forwards and backwards, can be coded up like this. 01:27:24.000 --> 01:27:32.000 It's not something you have to do by hand every time. 01:27:32.000 --> 01:27:44.000 You might use a pipeline as a part of your steps, but I don't know that there's like a within S. Kaler, and like a forward selection backwards selection sort of objects. 01:27:44.000 --> 01:27:47.000 There might be I just don't know. I don't know. 01:27:47.000 --> 01:27:50.000 And then the close out the notebook you can use also. 01:27:50.000 --> 01:28:05.000 Just use lasso. And so here's the example where I have a lasso, looking at all the features, and then you just sort of track like the different alpha values, and see like which ones stick around so price sticks around the longest. 01:28:05.000 --> 01:28:15.000 And then this one, you know, the shell of locations, I think, stick around for a decent amount of time, so you'd probably consider those at the same time. 01:28:15.000 --> 01:28:22.000 Perhaps what's see? What is this? Advertising sticks around, and age sticks around? 01:28:22.000 --> 01:28:26.000 So you might consider using these, seeing how they do with a cross validation in comparison to other models. 01:28:26.000 --> 01:28:28.000 So that's sort of the idea with lasso. 01:28:28.000 --> 01:28:33.000 Okay, so we are over time. I apologize for that. 01:28:33.000 --> 01:28:45.000 We had a lot of questions, so I think that pushed me a little bit I'll stick around for anyone who still has questions, but further assistive the rest of your evening feel free to leave. 01:28:45.000 --> 01:28:46.000 And yeah, I hope to see it tomorrow where we're gonna start time series. 01:28:46.000 --> 01:28:52.000 So in tomorrow's lecture we'll be talking about time series. 01:28:52.000 --> 01:28:56.000 Alright!