WEBVTT 00:00:00.000 --> 00:00:03.000 Okay? Hitting record. Alright. Welcome back everybody. Today's day. 00:00:03.000 --> 00:00:04.000 3 of lectures for the May 2023. 00:00:04.000 --> 00:00:11.000 Errors Institute boot camp. Today. We're going to start learning about supervised learning. 00:00:11.000 --> 00:00:16.000 And in particular dive into regression. So remember your lectures. 00:00:16.000 --> 00:00:23.000 You can find them in the. If you have the Github cloned, you'll find them in the repository. 00:00:23.000 --> 00:00:26.000 Once you open up your Jupiter environment. So we're gonna go to lectures. 00:00:26.000 --> 00:00:32.000 And we're gonna start in the supervised learning folder. 00:00:32.000 --> 00:00:33.000 And in there we are going to skip the introduction notebook. 00:00:33.000 --> 00:00:38.000 This is just a notebook that's sort of like what is, you know. 00:00:38.000 --> 00:00:48.000 Here's here's what we're gonna talk about, and we're gonna dive straight into a supervised learning framework will be our first notebook and then after that, we'll look at data, splits. 00:00:48.000 --> 00:00:51.000 And then after that we'll dive into regression. 00:00:51.000 --> 00:00:52.000 So in this notebook I just kind of want to set the ground. 00:00:52.000 --> 00:01:05.000 So lay the groundwork for what is supervis learning, and then setting up sort of a common framework that any supervised learning problem can be laid into. 00:01:05.000 --> 00:01:11.000 And then from there we'll branch out and actually start to learn algorithms. 00:01:11.000 --> 00:01:19.000 So supervised learning the idea here is, you have data and X and y, the X is a collection of features. 00:01:19.000 --> 00:01:22.000 So think of these as like inputs or data that you have about variations that you think you can use to. 00:01:22.000 --> 00:01:39.000 Then predict which will store in a vector called Y, so y can be continuous. 00:01:39.000 --> 00:01:43.000 It can be categorical, it can be binary. 00:01:43.000 --> 00:01:54.000 It's something that we'd like to predict using the data stor in a matrix X and end by m matrix. 00:01:54.000 --> 00:01:59.000 So N. Columns and M. Rows so or sorry flip that, and Rose and M. 00:01:59.000 --> 00:02:11.000 Columns, so I think it should be m, by n, so yeah, so just m columns, which is the number of features, and then nrows, which is the number of observations. 00:02:11.000 --> 00:02:12.000 And this will become more clear as we dive into the algorithms. 00:02:12.000 --> 00:02:32.000 So the framework for Supervisor is that we assume it may not be true, but we're going to assume that the output that we're interested in is equal to a function of the inputs, plus some random noise so F is a function from the M dimensional reals. 00:02:32.000 --> 00:02:40.000 So all of those features, down to the real numbers that we're trying to estimate, and the idea being is, once we estimate F. 00:02:40.000 --> 00:02:51.000 If we do a good enough job we can sort of predict what various values of Y would be given. 00:02:51.000 --> 00:03:10.000 The inputs. So F of X is also known as the systematic information that X is giving about Y, or it's also sometimes referred to as the signal that X is providing about why and then epsilon is sort of random noise that we think of as independent from x so that's one of our 00:03:10.000 --> 00:03:15.000 assumptions, is that the random noise is independent of the observation. 00:03:15.000 --> 00:03:21.000 The shape of that random noise, the distribution depends on the problem you're working on. 00:03:21.000 --> 00:03:25.000 So I think it's easiest to understand this with an example. 00:03:25.000 --> 00:03:30.000 So we're going to just assume that X is a single one dimensional. 00:03:30.000 --> 00:03:35.000 Vector so a column vector and that the relationship is simply linear. 00:03:35.000 --> 00:03:42.000 So y equals x plus random noise. Where the random noise is normally distributed. 00:03:42.000 --> 00:03:49.000 So you're gonna see some code here that might not make sense right now, we're are going to dive into it deeper. 00:03:49.000 --> 00:03:58.000 So this is more for making graphs and illustrating purposes, and then, hopefully, it will become more clear when we dive into specific algorithms. 00:03:58.000 --> 00:04:02.000 So for now don't worry about understanding the code that you're seeing. 00:04:02.000 --> 00:04:07.000 Just worry about the concepts. Okay, so basically what we're saying is in the real world. 00:04:07.000 --> 00:04:12.000 There's this wide that is equal to F of X. 00:04:12.000 --> 00:04:25.000 This is the systematic information. So somewhere out there, there's a variable Y and a variable X, and they are in nature related to this way, ignoring any sort of random perturbation. 00:04:25.000 --> 00:04:26.000 So the idea here is we don't know what this is ahead of time. 00:04:26.000 --> 00:04:34.000 We. But what we can do is we can collect data. 00:04:34.000 --> 00:04:44.000 So we go out into the world, and let's say we collect a hundred observations and sort of this sort of random random draws in a simulation of that. 00:04:44.000 --> 00:04:53.000 So we go out and we collect 100 observations, and then, you know graphically what we're thinking of is in the background with this black line. 00:04:53.000 --> 00:05:04.000 That was the true relationship and the observations represent sort of these random deviations from the true relationship that will always occur in nature. 00:05:04.000 --> 00:05:18.000 And so these blue dots are the things that we have, and we'd like to use these blue dots to make an estimate of the the black line which is the true relationship and so typically we'll use some sort of algorithm for this. 00:05:18.000 --> 00:05:23.000 And so in this particular notebook, that algorithm is linear regression. 00:05:23.000 --> 00:05:28.000 And so you use that to make an estimate. 00:05:28.000 --> 00:05:29.000 And then basically, what we're saying is using these blue dot. 00:05:29.000 --> 00:05:46.000 We made this estimate, which is now the red solid line, and then our hope is that the red, solid line, our estimate is a good approximation of the real world relationship that's represented by the black line. 00:05:46.000 --> 00:06:02.000 And here good tends to mean that the 2, the estimates, is close I know that sounds not very more definitive, but the essay is close. I know that sounds not very more definitive, but the estimate is close, and some sort of distance metric to the actual relationship and so this is the 00:06:02.000 --> 00:06:09.000 process of supervised learning. You assume that there is some sort of true relationship. 00:06:09.000 --> 00:06:21.000 You go out and observe some data to try and then estimate that relationship. And then the hope is that the estimate is close to the true relationship. 00:06:21.000 --> 00:06:22.000 Okay, so Ashley's asking, How did I get the true relationship? 00:06:22.000 --> 00:06:31.000 So this was just a hypothetical. What this is not real real world data. 00:06:31.000 --> 00:06:35.000 We're not, I would say 900%. I don't know. 00:06:35.000 --> 00:06:36.000 Maybe a hundred percent of the time, in the real world with real data. 00:06:36.000 --> 00:06:44.000 You're never gonna know the real relationship. Maybe not a hundred percent of the time. 00:06:44.000 --> 00:06:45.000 Most of the time with real world data. You're not gonna know the real relationship. 00:06:45.000 --> 00:06:52.000 So you're not gonna have the ability to graph this like true relationship line. 00:06:52.000 --> 00:07:07.000 But in this imaginary world, where I'm sort of demonstrating how supervised learning works, I can pretend that I know it ahead of time just to demonstrate the process of supervised learning. 00:07:07.000 --> 00:07:08.000 So in supervised learning there are 2 main modeling goals making predictions. 00:07:08.000 --> 00:07:17.000 And making inferences, making predictions means that you want to produce your estimate. 00:07:17.000 --> 00:07:25.000 Your algorithm so that the predictions you make are as close to the real world observations that you're gonna have as possible. 00:07:25.000 --> 00:07:36.000 So, here, maybe you're less concerned with making explicit and trying to understand the true relationship between things and just producing models that make really good predictions. 00:07:36.000 --> 00:07:41.000 And we'll talk about what it means to be a good prediction as we dive into the course. 00:07:41.000 --> 00:07:54.000 The other goal, a supervised learning techniques is to make inferences, and so the difference between a prediction and an inference is an inference is you're really trying to understand the relationship itself. 00:07:54.000 --> 00:08:07.000 And then describe it in a way that is useful. So describe how changes in X impact volume values of y, and so here the best estimate is typically models that explain the variance. 00:08:07.000 --> 00:08:26.000 And why, while still being parsimonious, so this is more of a statistics, point of view so basically, you're just trying to explain the why given the X, whereas in predictions, you don't actually care if you're able to explain how different values of X impact and values of y you just care 00:08:26.000 --> 00:08:43.000 about making good predictions. So these are not. Sometimes these turn out to be the same type of model, the model that makes the best predictions is also the model that allows you to most easily make inferences, but it's not always the case that the 2 are one in the same and an example comes from 00:08:43.000 --> 00:08:58.000 there's this competition. Years ago called the Netflix Price Competition, where An netflix put up a large prize like a million dollars or something like that for whatever team could improve their recommendation algorithm a certain certain amount. 00:08:58.000 --> 00:08:59.000 And the team that won, and ended up using a model that predicted better over a model that explained better. 00:08:59.000 --> 00:09:15.000 So they had 2 different models, and one of them was better at making explanations as to why people liked different movies or whatever television shows more than. 00:09:15.000 --> 00:09:28.000 But was not as good as making predictions as the one that you could not use to make the the explanations. So that's sort of in this course, we're going to focus this boot camp. 00:09:28.000 --> 00:09:29.000 We're gonna focus on the making predictions side of things. 00:09:29.000 --> 00:09:45.000 Because that's not as touched upon in classical statistics, courses, and then we'll leave it to you to talk about like learn about the making inferences by referring back to more common like statistics, texts, and that sort of thing. 00:09:45.000 --> 00:10:15.000 Okay. So before we move on to actually doing things with code. Are there any questions about sort of this idea of the supervised learning framework? 00:10:16.000 --> 00:10:19.000 Okay. 00:10:19.000 --> 00:10:26.000 So with all of that being said, we're actually gonna say, slightly abstract before we learn an actual algorithm. 00:10:26.000 --> 00:10:31.000 And I wanna talk about a concept known as data splits for predictive modeling. 00:10:31.000 --> 00:10:36.000 So Nope, wrong one. I want the lecture copy. 00:10:36.000 --> 00:10:48.000 Here we go. So there's this idea of splitting your data that you have to do in order to make predictive models and so that's what we're going to talk about here. 00:10:48.000 --> 00:10:52.000 We're gonna talk about. Why you want to split your data set into smaller data sets. 00:10:52.000 --> 00:11:01.000 And then we're also going to talk about the 3 different types of data splits that you'll be making as you train models and try and find the best model. 00:11:01.000 --> 00:11:04.000 Okay. So for a lot of the notebooks, I think almost all of them, I'll be importing a series of packages at the very top. 00:11:04.000 --> 00:11:07.000 So these are numpy. Often pandas will be included. 00:11:07.000 --> 00:11:20.000 Mat, plot, lib, and then from se born the set style function so I can add a a white grid to make things easier to see. 00:11:20.000 --> 00:11:24.000 So I'm just gonna use these like, probably a most of the notebooks to generate data. 00:11:24.000 --> 00:11:26.000 And then to plot data. So you'll always see a code chunk at the top like this. 00:11:26.000 --> 00:11:34.000 That is just because I'm gonna be handling data a lot. 00:11:34.000 --> 00:11:38.000 So I think a reasonable question is like, why the heck would. 00:11:38.000 --> 00:11:39.000 I want to split up my data set. I only have one of those and I might want to use all of them. 00:11:39.000 --> 00:11:48.000 So let's imagine we're doing a predictive modeling project. 00:11:48.000 --> 00:11:53.000 And you know we go out. We randomly collect some data let's say n observations of M features that have N corresponding outputs x comma y. 00:11:53.000 --> 00:12:06.000 And so the goal with any predictive modeling project is to build a model that has the lowest generalization error. 00:12:06.000 --> 00:12:18.000 And so here generalization, error is defined to be the error of the model on a new, randomly collected set, meaning a data set that it was not trained upon. 00:12:18.000 --> 00:12:26.000 So if we fix this data that we collected originally sorry if we fix this new data set. 00:12:26.000 --> 00:12:27.000 So we just have a hypothetical new data set. X, star, y. 00:12:27.000 --> 00:12:35.000 Star. Then we you know the data we collected originally was randomly collected. 00:12:35.000 --> 00:12:42.000 And so we can think of the generalization, error of any particular model that we're training as a random, variable. 00:12:42.000 --> 00:12:49.000 So generalization, error, meaning the error of the model. 00:12:49.000 --> 00:12:56.000 On this new data set. So when we say error and regression, it's going to be something called the mean squared error in classification. 00:12:56.000 --> 00:13:03.000 It might be something like the accuracy. So either way, whatever it is, we're going to call this variable capital. 00:13:03.000 --> 00:13:06.000 G, this is a random, variable. That's the generalization error. 00:13:06.000 --> 00:13:07.000 So the best model for predictive modeling purposes is the one that has the smallest capital. 00:13:07.000 --> 00:13:15.000 G. So it would be nice if we could know something about capital G and its distribution. 00:13:15.000 --> 00:13:30.000 But if we use all of our data from that we collected out in the world to train our model, it's impossible for us to try and get an estimate of this capital. 00:13:30.000 --> 00:13:51.000 G with the data that we have in hand. So typically in the real world, you collect as much data as you can, and then, either because of budget or logistical reasons, it's not practical to go out and collect additional data to then test your model on so usually there's some sort of limitation on the 00:13:51.000 --> 00:13:54.000 data you're able to collect to train a model and then test that model's performance. 00:13:54.000 --> 00:14:16.000 So what you'll do instead is, you'll create data, splits which these data splits are, one will be set aside for training the data, and then the other part of the split will be set aside for testing the performance of the data to sort of simulate the process of going out to getting new 00:14:16.000 --> 00:14:26.000 data and then calculating this generalization error. So that's the idea so we're going to talk about 3 different splitting steps and or strategies that people use in data science and machine learning. 00:14:26.000 --> 00:14:51.000 But before we do that, does anybody have questions about like our rationale as to why we would want to do a data split. 00:14:51.000 --> 00:14:56.000 Okay. So the first split type, we're gonna talk about is called the train test Split. 00:14:56.000 --> 00:15:07.000 And so I know we just had this very long explanation about like, why we want to do a data split so we can estimate this G, so the trained test split is sort of for that. 00:15:07.000 --> 00:15:20.000 But I, when you make this split, you're going to be setting aide a small portion of the data oftentimes like 1015, 2025% of your data you'll set aside and then you don't touch that data. 00:15:20.000 --> 00:15:30.000 Until you've already selected a best model. So the smaller part of your data set the 1015, 2025% it's called the test set. 00:15:30.000 --> 00:15:31.000 And the purpose of this set is to serve as a final stop. 00:15:31.000 --> 00:15:35.000 Get test before you go out and take this model. You've selected and deploy it out in the wild. 00:15:35.000 --> 00:15:46.000 And so oftentimes what can happen is you can be working, maybe not oftentimes. 00:15:46.000 --> 00:15:47.000 But what can happen is, you could be working on a model. 00:15:47.000 --> 00:16:00.000 You can think that it's a really great model, but it turns out that maybe there was like a typo in your code, or there was some data leakage from like whatever process you're using. 00:16:00.000 --> 00:16:01.000 And so you're erroneously choosing, not the best model. 00:16:01.000 --> 00:16:19.000 And so like this test set sort of acts as a sanity check before you then, you know, if you're working in industry, go out and maybe potentially waste a lot of money and resources on a model that's not very good either because of a typo in your code or something like that so 00:16:19.000 --> 00:16:25.000 the test set you set aside until the very end. After you selected your model. 00:16:25.000 --> 00:16:27.000 The training set is then all the data that's left over that you're gonna use for the process of training your data and making model selections. 00:16:27.000 --> 00:16:37.000 So here's sort of an illustration of this so this data split is randomly done. 00:16:37.000 --> 00:16:43.000 So here, this green, rectangle represents all of the data we've collected through our sampling. 00:16:43.000 --> 00:16:56.000 Then we will do a random sampling, so that randomly some portion, the larger portion of the data get set aside as the training set which we're going to use to train and compare our models. 00:16:56.000 --> 00:16:58.000 And then then over here the smaller portion again, 1015, 2025%. 00:16:58.000 --> 00:17:06.000 This is gonna get held out until the final model is chosen. 00:17:06.000 --> 00:17:25.000 So this is done randomly in general. We'll learn in a later notebook that there are sometimes problems where this randomness is going to have to be relaxed, or the way we make our split is gonna have to be a little bit more prescribed than just random, but we'll come to 00:17:25.000 --> 00:17:31.000 that when we, when we get to those notebooks. 00:17:31.000 --> 00:17:34.000 So Kirthan's asking what you said. 00:17:34.000 --> 00:17:35.000 You do not touch the test set until the best model is selected. 00:17:35.000 --> 00:17:37.000 But how do you select the model if you don't have the generalization error? 00:17:37.000 --> 00:17:48.000 So? These are great questions. So again the test set again, as I've noted, potential point of confusion. 00:17:48.000 --> 00:17:54.000 The test set is not used to directly estimate G. There are going to be other splits that are used for that. 00:17:54.000 --> 00:17:55.000 The test set is sort of like a final check on your chosen model. 00:17:55.000 --> 00:18:03.000 So let's say you do an eighty-twenty split that 20% is set aside until you've already selected a model using other methods, and you use it as sort of a sanity check. 00:18:03.000 --> 00:18:22.000 Just to be like, okay, like, this, performance is not wildly out of line with what I observed from my other forms of testing the data. 00:18:22.000 --> 00:18:23.000 So Aziza, asking if we do data augmentation, should we do splitting after that? 00:18:23.000 --> 00:18:27.000 Or should we do data augmentation after we have the training set? 00:18:27.000 --> 00:18:35.000 So it depends upon the pre-processing that you're doing so. 00:18:35.000 --> 00:18:42.000 For instance, so it depends upon the pre processing that you're doing so. For instance, like, let's say you have a column that you want to apply a log transform to. So you just want to take the natural logarithm of that column that you're able to just do at the very beginning before the split. 00:18:42.000 --> 00:18:53.000 But if you're doing something else like, for instance, we'll learn about things like scaling imputation as well as pre-processing steps, like Pca. 00:18:53.000 --> 00:19:06.000 Those do have to do after you've made the trained test split because you cannot allow the test set to influence sort of things that would need to be fit in those processes that would be called data leakage. 00:19:06.000 --> 00:19:15.000 So we'll talk about those more specifically, once we get to those techniques. 00:19:15.000 --> 00:19:19.000 But that's a great question. 00:19:19.000 --> 00:19:32.000 Are there any other questions about the train test? Split? 00:19:32.000 --> 00:19:34.000 Okay. 00:19:34.000 --> 00:19:42.000 So how do I make a trained test split? So one way you could do it is with either the random or the numpy dot. 00:19:42.000 --> 00:19:43.000 Random packages you could do this by hand, but that is tedious. 00:19:43.000 --> 00:19:55.000 So sk learn which stands for sidekit learn has a function called train underscore test under underscore Split. 00:19:55.000 --> 00:19:56.000 So that's what we're gonna use. So let's imagine we have this. 00:19:56.000 --> 00:20:06.000 Imaginary data set where X is a thousand observations of 10 features, and Y is a thousand observations of the output. 00:20:06.000 --> 00:20:10.000 Now we're going to use this function called train test Split. 00:20:10.000 --> 00:20:13.000 So in order to use this function, we have to import it. 00:20:13.000 --> 00:20:19.000 So from sk learn dot model underscore selection. 00:20:19.000 --> 00:20:26.000 So that's where a train test split is stored, which we can see by looking at the documentation link I've provided here. 00:20:26.000 --> 00:20:33.000 We will then import train underscore test underscore split. 00:20:33.000 --> 00:20:37.000 Okay, so once we have that we're going to run train test, split. 00:20:37.000 --> 00:20:38.000 And so I've already started the code for that here, just to save myself time. 00:20:38.000 --> 00:20:48.000 So the way that this works is train test, split. And maybe I know I just said to save myself time. 00:20:48.000 --> 00:20:54.000 But it might make more sense to do it like this. So train test, split you first input your features. 00:20:54.000 --> 00:21:04.000 X. Then you follow by your outputs. Y, the next thing I usually do is, it has an argument called shuffle. 00:21:04.000 --> 00:21:13.000 I'm gonna set this equal to true. So shuffle just ensures that the data is randomly shuffled before the split is made. 00:21:13.000 --> 00:21:21.000 So, so it might do this by default, but I always just like to make sure, because I want my split to be random, and then finally, you can or not. 00:21:21.000 --> 00:21:27.000 Finally, you can also specify the size of the split with an argument called test size. 00:21:27.000 --> 00:21:33.000 So you can either put in a number of rows like 200, or you could put in a fraction. 00:21:33.000 --> 00:21:46.000 So if I put in point 2, this will specify that 20% of the data set should be set aside as a test set and then the last thing that you might want to do is provide a random state. 00:21:46.000 --> 00:21:54.000 So a random underscore state argument is a positive integer, that, for instance, I could make it anything I want. 00:21:54.000 --> 00:22:07.000 So maybe 4, 2, 8, 9. So this integer will set the random state that is used to generate the random state that is used to generate the random state that is used to generate the random state and so this would just ensure that if I use the random state 4 to 8. 00:22:07.000 --> 00:22:13.000 9 and you use the random state 4 to 8, 9. We will still both of us, when we run the code. 00:22:13.000 --> 00:22:18.000 We'll get the same random split. So the splits are made randomly, so the splits are made randomly, but by specifying the random state you ensure that every time you run it you get the same split. 00:22:18.000 --> 00:22:27.000 So what's show what this puts out so you can see that from this that we get a list. 00:22:27.000 --> 00:22:39.000 S, and inside of that list are arrays. And so we have 1, 2, 3, and 4 arrays. 00:22:39.000 --> 00:22:54.000 And so what those arrays are are the training split of the features followed by the test split of the features, followed by the training set of the outputs, followed by the test set of the outputs. 00:22:54.000 --> 00:22:59.000 So when I take this and I copy it. 00:22:59.000 --> 00:23:08.000 And then I paste it here, and I'll move these over. So they're closer. 00:23:08.000 --> 00:23:15.000 When I take this, and then I run it like here. Then it will store these all, and you can check like here is the shape. 00:23:15.000 --> 00:23:20.000 So we did an 80, 20 split right? So X train should have 800 rows. 00:23:20.000 --> 00:23:23.000 X test should have 200 rows. Why changed? Step? 800 observations. 00:23:23.000 --> 00:23:26.000 Y-axis should have 200 rows. Why test should have 200. 00:23:26.000 --> 00:23:27.000 So Mitch is asking, does random State have a default value? 00:23:27.000 --> 00:23:44.000 No, so if you do not specify a random state, I believe it just uses your computers internal clock when the code was run, as as the pseudo random starting state, so it will be different every time. 00:23:44.000 --> 00:23:53.000 So that's why you typically want to include a random state. 00:23:53.000 --> 00:23:58.000 And then I in practice, you'll want to you don't always want to use the same random state throughout all of your coding. 00:23:58.000 --> 00:24:17.000 This could like weird to like. Maybe this leads to like a weird behavior where, like your random state, provides better performance. For, like the training sets, so you know, try and switch it up every now and then. 00:24:17.000 --> 00:24:20.000 Aziz is asking if y is categorical. 00:24:20.000 --> 00:24:25.000 Is there a quick way to guarantee that the train test split keeps the distributions of the classes approximately the same in the train, and test sets. 00:24:25.000 --> 00:24:37.000 Yes, and we are going to learn that when we get to classification, it's called a strategy, a stratify train test split. 00:24:37.000 --> 00:24:44.000 And there is an argument and trained test split to do that. 00:24:44.000 --> 00:24:47.000 Are there any other questions? 00:24:47.000 --> 00:25:08.000 Yeah, I have a question. So when we are training, our model, is it a common practice to like instead of using just one train test split, do it like multiple times, and see that it's if your outcomes are consistent across all those train test splits? 00:25:08.000 --> 00:25:10.000 Without a random state. 00:25:10.000 --> 00:25:18.000 So I think people don't typically do like making like, let's say, 10 train sets. 00:25:18.000 --> 00:25:26.000 And then like training like so yes and no, if that makes sense so there is a thing that we're gonna learn in this notebook. 00:25:26.000 --> 00:25:29.000 That's basically an essence the same as what you're saying. 00:25:29.000 --> 00:25:37.000 It's called a cross validation, and so it will take your training set split that into 10 sets, and do sort of the model fitting, and then you take an average we'll talk about that, and just a little bit. 00:25:37.000 --> 00:25:47.000 But like typically that is done on, it gets confusing, because that's done on the training set. 00:25:47.000 --> 00:25:53.000 And then the test set is still. You don't touch it until the very end. 00:25:53.000 --> 00:25:55.000 Yeah, okay, so I'll leave this exercise for you to try and complete like later tonight or tomorrow. 00:25:55.000 --> 00:26:05.000 So we're gonna talk about now that we have the trained test split settled. 00:26:05.000 --> 00:26:11.000 We're gonna talk about 2 types of splitting that are then used for model comparison and selection. 00:26:11.000 --> 00:26:21.000 So the 2 types. The first one is called a validation set, so a validation set is basically just a retread of the trained test split. 00:26:21.000 --> 00:26:26.000 But now you will configure your models. Performance on the validation set, and see how like which one compares better. 00:26:26.000 --> 00:26:39.000 So this is typically done when you have a model that takes a very long time to train, and you don't want to wait for the next thing, we'll learn about which is cross validation. 00:26:39.000 --> 00:26:48.000 Or if you have a data set where it's a very small data set and you don't have enough data to split it multiple times, we'll talk about that again before we end the notebook. 00:26:48.000 --> 00:26:52.000 So schematically. What this looks like is, you start off with a picture we had before, where the entire data set is split into a train set, and then a test set. 00:26:52.000 --> 00:26:59.000 And then that training set is further randomly split into a validation set, and then a smaller training set. 00:26:59.000 --> 00:27:09.000 So this smaller training set is the one that's then used to train the models. 00:27:09.000 --> 00:27:14.000 And then this validation set is used for the models, and then this validation set is used for the estimate. 00:27:14.000 --> 00:27:32.000 The estimating of the capital. G. So like I might train 4 or 5 different models here, calculate the error on the validation set, and then find the one that has the best error so in practice, this is done exactly the same way with train test split but instead of X comma Y it 00:27:32.000 --> 00:27:39.000 would be X underscore trained? Y underscore train, and then you just provide all the remaining arguments. 00:27:39.000 --> 00:27:45.000 So shuffle equals true test size equals point. 00:27:45.000 --> 00:27:52.000 Let's say point 2, and then random States is equal to. 00:27:52.000 --> 00:28:13.000 Let's do 2, 3, 2. Okay, so are there questions about the validation split? 00:28:13.000 --> 00:28:14.000 Sorry I've got. 00:28:14.000 --> 00:28:20.000 Sorry go ahead. 00:28:20.000 --> 00:28:30.000 Okay. So I was just gonna ask, like, it might happen that you have minimized G over all your validation over your validation set or cross validation settings. 00:28:30.000 --> 00:28:38.000 When you like doing. 00:28:38.000 --> 00:28:43.000 I said the address in your testing set is still like high. 00:28:43.000 --> 00:28:56.000 So what do you? 00:28:56.000 --> 00:29:03.000 And do like repeat the process by changing your you know what I mean. Like, the yeah. 00:29:03.000 --> 00:29:06.000 Like, how would you do that when? 00:29:06.000 --> 00:29:20.000 Yeah, yeah, so that can happen. So basically, you typically will. So if you have like a model like, I said earlier, like, if you have a model that takes a really really long time to train, like sometimes you'll have like models that could take like an whole evening, to train. 00:29:20.000 --> 00:29:25.000 And it's it just takes a while to train the model. 00:29:25.000 --> 00:29:35.000 So if you have something like that, and those are the models you're comparing, you'll typically like, start with just the one that performs best on the validation set. 00:29:35.000 --> 00:29:36.000 As long as, like the test set. Final check like. Didn't reveal anything like wrong with the model. 00:29:36.000 --> 00:29:41.000 Yeah. 00:29:41.000 --> 00:29:43.000 If that makes sense so meaning like you didn't have like a typo, or it's performing unexpectedly something like that. 00:29:43.000 --> 00:29:44.000 So I get. 00:29:44.000 --> 00:29:45.000 Even though it increases the overall error on the test. 00:29:45.000 --> 00:29:55.000 Set a little bit more than the validation set. 00:29:55.000 --> 00:29:56.000 Yeah, so like, there's because there are different sets. 00:29:56.000 --> 00:29:58.000 Yeah. 00:29:58.000 --> 00:29:59.000 The errors will be different, and so like it could possibly be the case that you perform worse on your test set than your validation sets. The errors will be different, and so it could possibly be the case that you perform worse on your test set than your validation a huge difference that you 00:29:59.000 --> 00:30:09.000 Okay. 00:30:09.000 --> 00:30:17.000 can't like. Look at the data and be like, Okay, well, my test set had this one outlier, and that's why it's performing way worse. 00:30:17.000 --> 00:30:22.000 You would still go with that model. 00:30:22.000 --> 00:30:27.000 Yup! 00:30:27.000 --> 00:30:28.000 And there was a there was another question. Or was that like a like, did you have the same question? 00:30:28.000 --> 00:30:34.000 Yeah, you've got it adjusted. Thank you. 00:30:34.000 --> 00:30:42.000 Okay. Yup, okay. So the other process that you'll do, which is ideal. 00:30:42.000 --> 00:30:47.000 If you can do this, you would prefer to do this is K-fold cross validation. 00:30:47.000 --> 00:30:48.000 So if you've taken statistics, courses, this is sometimes referred to as like you might see it. 00:30:48.000 --> 00:30:58.000 A variation called like one out, one left out validation, or something like that, where you will go through and leave everything except for one observation out. 00:30:58.000 --> 00:31:16.000 This is the same exact process. If you've heard of that, if you haven't heard of that, you're going to learn about crossover. So this validation, set approach in essence is like coming from a statistics frequency, statistics point of view it. 00:31:16.000 --> 00:31:32.000 Gives you like a point. Estimate of G. So this approach is an ideal like you, instead of having a single estimate of G so this approach is an ideal like you, instead of having a single estimate of G, which, like maybe this particular set like model one is just randomly better than model 2 on this 00:31:32.000 --> 00:31:34.000 particular set, but in practice, like model 2 on average, is better. 00:31:34.000 --> 00:31:37.000 It would be ideal to know something about like the distribution of G as opposed to just a single estimate. 00:31:37.000 --> 00:31:50.000 So it's difficult to get an idea of the distribution of G. 00:31:50.000 --> 00:32:12.000 But what we can do is try and leverage a rule from a probability called the law of large numbers, and so, as a reminder of what that says, if you have a sequence of independent, identically distributed random variables with some true mean mu than the arithmetic mean of that meaning you add them all up and divide by 00:32:12.000 --> 00:32:19.000 the total number of them. The law of large numbers says that this arithmetic mean in the limit, as N. 00:32:19.000 --> 00:32:23.000 Goes to infinity. We'll approach Mu. So basically, what this is saying is that you can use the arithmetic mean to estimate the expected value of the distribution. 00:32:23.000 --> 00:32:35.000 So you can use a sequence of a sequence of estimates of this generalization. Error. 00:32:35.000 --> 00:32:43.000 What we'll talk about how you can get that in a second to sort of try and estimate the expected value of the error. 00:32:43.000 --> 00:32:51.000 And so that's sort of the idea here, with Kfold cross validation is we're gonna try and create a sequence of estimates that we're then going to average together. 00:32:51.000 --> 00:32:58.000 So how do we do this? So you're going to take your training set randomly, split it into K different smaller subsets. 00:32:58.000 --> 00:33:23.000 So this cadence and this example is 5 different. Then what you're going to do after you do that split is you're going to sequentially go through train on 4 of the 5 or K minus one of the K train on 4 of the 5 here, and then forget the error. 00:33:23.000 --> 00:33:35.000 on the one that's left out. And then you just sequentially go through and change the one that's left out. And then you just sequentially go through and change the one that is left out that way will give you 5 in this case, 5 estimates of that error what you can then average together. 00:33:35.000 --> 00:33:36.000 To try and get an estimate of G. So common values are 5 and 10. 00:33:36.000 --> 00:34:00.000 But you could do different values if you'd like and that's sort of the idea is by doing this process, we can sort of imagine we're getting random draws of the generalization error as a random variable, which we can then try and use the law of large numbers to say that this is some 00:34:00.000 --> 00:34:13.000 estimate of the expected value. It's not like a perfect estimate, because 5 is not a large sample, but we're limited with what we can do, based on the size of our data, and how long model training takes. 00:34:13.000 --> 00:34:14.000 Okay. So before we talk about how to implement this, and are there any questions about this? 00:34:14.000 --> 00:34:31.000 Quick question is this preferred over time? Something like bootstrapping, where you're sampling with replacement from the set. 00:34:31.000 --> 00:34:32.000 So bootstrapping sort of a different thing. So the reason you have to do this is so. 00:34:32.000 --> 00:34:41.000 That like the. So basically, Jason is asking, right? This is sort of, I'm just combining the 2. 00:34:41.000 --> 00:34:52.000 So Jason's asking like, can the left out validation sets overlap so like with bootstrapping, they would overlap, but in cross validation is they don't overlap. 00:34:52.000 --> 00:34:59.000 And so the idea there is to prevent like leakage from your test set into the training set across the different ones. 00:34:59.000 --> 00:35:12.000 So here, basically, like your test set is guaranteed to be different from estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimateimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to estimate, to 00:35:12.000 --> 00:35:13.000 estimate. 00:35:13.000 --> 00:35:14.000 Okay, so so this is more robust than a estimate error. 00:35:14.000 --> 00:35:23.000 Then bootstrapping, which also gives an estimate of error. 00:35:23.000 --> 00:35:24.000 Yeah, so yeah, I just know that, like, in practice, this is what's done. 00:35:24.000 --> 00:35:40.000 I think bootstrapping will sometimes be done for, like estimating of parameters, as opposed to like estimating the error of the model. 00:35:40.000 --> 00:35:45.000 Ashley's asking is each of these columns of the data the same data? Yes. 00:35:45.000 --> 00:35:52.000 So this is the same training set each time there's a column it's the training set, but it's just broken down into the splits. 00:35:52.000 --> 00:35:55.000 And then demonstrating that each time there's a column, it's the training set, but it's just broken down into the splits. 00:35:55.000 --> 00:36:02.000 And then demonstrating that each time you go through and leave out a different one of the splits. 00:36:02.000 --> 00:36:03.000 So Brantley is asking to apply the law of large numbers. We need G. 00:36:03.000 --> 00:36:11.000 One through g. 5. To be independent samples from the true distribution of G. 00:36:11.000 --> 00:36:12.000 These seem like they would be dependent if we generate them like this. 00:36:12.000 --> 00:36:24.000 So it's not like a perfect one to one of like we're going out and getting new samples. 00:36:24.000 --> 00:36:28.000 We can't do that. We're limited like we went out. 00:36:28.000 --> 00:36:33.000 We've got our observations to generate like the we got our observations to, you know. 00:36:33.000 --> 00:36:34.000 Get the green bar. Basically, we get the all data is our observations. 00:36:34.000 --> 00:36:43.000 We are unable to go out and get additional observations. 00:36:43.000 --> 00:36:47.000 So like we don't live in like the ideal world where we can go out and just conduct experiments until we have enough data to make estimates. 00:36:47.000 --> 00:37:06.000 So with the restrictions that we have from doing our initial sampling like this is like our best approximation of Iid random variables that we can then use to sort of get an estimate of the expected value. 00:37:06.000 --> 00:37:13.000 So these are not, you know. This is not like a perfect repetition of like the statistical process. 00:37:13.000 --> 00:37:14.000 We may be learned in stats class. This is sort of trying to do our best to replicate that, to get an idea of what we might expect, the generalization error to be. 00:37:14.000 --> 00:37:24.000 Question you mentioned about like knowing something about the distribution of the generalization error as opposed to just, you know, having a single estimate for it I don't know if it's ever feasible to get anything but the distribution other than like the mean but if so what would you like would you 00:37:24.000 --> 00:37:54.000 ever care about. It's just to other than the fact that the main, you know, like normal versus something else, does that factor into model selection? 00:37:54.000 --> 00:37:57.000 So like! 00:37:57.000 --> 00:38:05.000 You could like, compare the distributions, and then, like let's say hypothetically, you could get an estimate of the distribution. 00:38:05.000 --> 00:38:11.000 You could compare the 2, and then see, like you probably want the one. 00:38:11.000 --> 00:38:16.000 So it is, I mean, it is kind of like a mean, but like a, you know, distribution is not defined by its mean right. 00:38:16.000 --> 00:38:19.000 So if you had, like a by various thing, the mean would be in the middle. 00:38:19.000 --> 00:38:23.000 But you'd never actually, you know, get the mean rate. 00:38:23.000 --> 00:38:26.000 So basically like you could compare the 2 distributions. 00:38:26.000 --> 00:38:32.000 And then you'd want to choose the one that has a higher probability of being the better error right? 00:38:32.000 --> 00:38:39.000 So! 00:38:39.000 --> 00:38:40.000 So Ricky's asking if your training set is really large, would you benefit from a large Cav value? 00:38:40.000 --> 00:38:54.000 So I think, as long as you can guarantee that your test set has like enough observations to be. 00:38:54.000 --> 00:39:00.000 Enough observations to be a good measure of the error. 00:39:00.000 --> 00:39:07.000 Then, yeah, I think larger. K, but again, like, it's not just about you also have to be considering, like the computation time. 00:39:07.000 --> 00:39:16.000 So the larger K. Is, the more times you have to fit the model, and then calculate the error and a lot of times the models you'll choose are maybe gonna take a while to train. 00:39:16.000 --> 00:39:25.000 And so that's another consideration. So from what I understand in practice, you tend to stick to 5 or 10, you can try different ones. 00:39:25.000 --> 00:39:26.000 If you'd like. But you know, if you let Kay get too large, you're spending a lot of time training your model. 00:39:26.000 --> 00:39:41.000 I have a question. So with cross validation, there's a pretty good chance that the the app or the mean error here will be much higher than what you would find for probably your training set after you select the model is is that like a common problem, and if that's the 00:39:41.000 --> 00:39:50.000 case, how would you go about? You know, gettinging with that is okayful cross validation. 00:39:50.000 --> 00:40:20.000 In that case really, is it really effective? Or is it better to move on with a different way of data splitting? 00:40:20.000 --> 00:40:22.000 Yeah, so in predictive modeling, we don't really care about the performance on the training set, because we already know the labels for those. 00:40:22.000 --> 00:40:36.000 What we want to do is to be able to produce a model that is good at guessing or predicting what the labels would be on data that we don't. 00:40:36.000 --> 00:40:38.000 And in this case the left out sets for pretending. 00:40:38.000 --> 00:40:42.000 We don't know, but we do know. You know, we want a model that's good at predicting the labels for things that we don't know, you know, ahead of time. 00:40:42.000 --> 00:40:49.000 So we could have a model that's perfect on the training set. 00:40:49.000 --> 00:40:54.000 But if it's unable to make good predictions on things it doesn't know the label, for it's useless to us. 00:40:54.000 --> 00:41:06.000 So here you can have a M like it's going to typically be the case that your models perform better on the training set than on these hold out sets, or on the validation set. 00:41:06.000 --> 00:41:14.000 But that's fine. The point is to find the model that performs the best on these left out sets, or on the vidation set. 00:41:14.000 --> 00:41:16.000 If that's the approach you take, because that's what your goal is with. 00:41:16.000 --> 00:41:17.000 Predictive, modeling, it's to make the best predictions. 00:41:17.000 --> 00:41:24.000 And in with predictions. We're assuming we don't know what the labels are here. 00:41:24.000 --> 00:41:25.000 We're taking advantage of the fact that we do have the labels to get a sense of how good we're predicting. 00:41:25.000 --> 00:41:35.000 Thank you. 00:41:35.000 --> 00:41:37.000 Yeah, okay, so for the sake of time, I'll cut off questions. There. 00:41:37.000 --> 00:41:42.000 So we have enough time to finish the other 2 notebooks. 00:41:42.000 --> 00:41:43.000 So how do we do Kfld? Cross-solidation and sk learn they have an object for that. 00:41:43.000 --> 00:41:54.000 So in model selection, there's the K fold class, so we'll do from Sk. 00:41:54.000 --> 00:42:00.000 Learn model underscore selection, and we import capital K. Capital, F. K. 00:42:00.000 --> 00:42:02.000 Fold. 00:42:02.000 --> 00:42:03.000 And so the way this works is, you'll first put in. 00:42:03.000 --> 00:42:09.000 You want to create what's known as the K-fold object. 00:42:09.000 --> 00:42:14.000 So you first put in the number of splits. So end splits, and here we'll do 5. 00:42:14.000 --> 00:42:19.000 The next thing you want to do is you want to say that you do want it to randomly shuffle the data so shuffle equals true and just like with trained test split. 00:42:19.000 --> 00:42:28.000 You can specify the random state. So why don't we do? 00:42:28.000 --> 00:42:34.000 7, 5, 8. So now we have. Oh, what did I do? 00:42:34.000 --> 00:42:44.000 Oh, splits, not. There we go. Okay. So now that we have that, we, when we want to go ahead and make the 5 fold Cross validation split, you'll call K. 00:42:44.000 --> 00:42:56.000 Fold, dot split, and then you will then put in the features, first followed by the outputs, and so here I'm going to demonstrate what you get, so this is what's known as a generator object. 00:42:56.000 --> 00:43:02.000 So in Python. Those have to be iterated through with like a for loop. 00:43:02.000 --> 00:43:08.000 So you'll do 4, and then what's gonna come out? 00:43:08.000 --> 00:43:25.000 Are indices, so like indices of either the array or of the data for if that's what you put in, so you're gonna do train because the train index comes first, followed by the test index in this Split. 00:43:25.000 --> 00:43:29.000 And so I'm going to run this and you'll see what you get. 00:43:29.000 --> 00:43:40.000 So first you get a list of the indices of the training set, and then you get a list of the indices of the test set, and so we'll focus on just showing that the splits are in fact, different for one another. 00:43:40.000 --> 00:43:45.000 So you're gonna get 6, 7, 9 in the test set here. 00:43:45.000 --> 00:43:51.000 And then, if you look here, you can see 6, 7, and 9 are now in the training set. 00:43:51.000 --> 00:43:56.000 And so this will give you the trained test index indices for all 5 splits. 00:43:56.000 --> 00:44:15.000 And so then in practice, what you would do is each time through the loop you would subset to get the training set from the split and the holdout set for the split, so that up in the graphic above, those are the things that I labeled leave out and then in practice you would fit the 00:44:15.000 --> 00:44:20.000 model and record the error on the holdout set in some kind of array that you could look at later. 00:44:20.000 --> 00:44:27.000 Okay. Alright. So Jacob has asked How is model selection? 00:44:27.000 --> 00:44:32.000 Careful, different from model selection, cross, validate. So I've only ever used case. 00:44:32.000 --> 00:44:49.000 So I don't know what the difference between kfold and cross validate is, but if you were to go to the documentation links, you could figure it out like it will tell you what each one does and what each one returns so for instance, just to demonstrate what the 00:44:49.000 --> 00:44:50.000 documentation links look like here. This is the documentation from Sk. 00:44:50.000 --> 00:45:07.000 Learn for Kfold. It tells you. You know, K-fold cross validator provides trained tests, indices to split data, and then we could try and compare this to. 00:45:07.000 --> 00:45:16.000 What was it cross, validate? So if I just replace K-fold with cross, underscore, validate? 00:45:16.000 --> 00:45:20.000 So this says, evaluate metrics by cross validation, and also record fit score time. 00:45:20.000 --> 00:45:25.000 So this is something where you'd have to provide the algorithm. 00:45:25.000 --> 00:45:36.000 The data, and then how you want to score. So this is a way of basically, it would do the whole cross validation process for you where this is just providing the splits of the data. 00:45:36.000 --> 00:45:52.000 And then you can. Then you know in this for loop that I wrote out below, you would have to fit the data, fit the model and then record the holdout error. 00:45:52.000 --> 00:45:56.000 So why would we want to do one versus the other? 00:45:56.000 --> 00:46:01.000 So validation sets tend to be you like the main considerations when you're choosing between the 2 is data set size and model training time. 00:46:01.000 --> 00:46:15.000 So in general, you want to use Cross Valleyation when it's feasible, because you get a better sense of the errors on more than just a single validation set in practice. 00:46:15.000 --> 00:46:23.000 If you're data set is small. And again, it's hard to say what small means like, I don't have a number like a hundred or 10. 00:46:23.000 --> 00:46:28.000 It really depends on the problem you're doing. And the type of model you're trying to fit on the problem you're doing. And the type of model you're trying to fit. 00:46:28.000 --> 00:46:29.000 So if your data set is small, or if your data set is small, or if your model takes a very long time to train, you'll do validation sets because that's just what you're able to do. 00:46:29.000 --> 00:46:38.000 A very long time to train. You'll do validation sets, because that's just what you're able to do. 00:46:38.000 --> 00:46:41.000 If you have a large in a large data set, and your models are relatively quick to train, where again, there's not like a good rule of thumb for this, it just depends on the problem you're working on. 00:46:41.000 --> 00:46:56.000 Then you'd use cross validation. Okay? So just for the sake of time, I'll pause questions and move on. 00:46:56.000 --> 00:47:01.000 So you'll have a chance to practice with cross validation. 00:47:01.000 --> 00:47:05.000 And the problem session tomorrow. For now let's get. 00:47:05.000 --> 00:47:08.000 Let's dive in and learn an actual algorithm. 00:47:08.000 --> 00:47:15.000 So we're gonna start with simple linear regression, which I'm sure is familiar to a lot of you. 00:47:15.000 --> 00:47:31.000 But it's a good place to start, because over the course of the next 2 lectures, we'll build a upon this to get more complicated models, and it's also going to allow us to sort of apply this sort of cross-validation approach in the next lecture notebook. 00:47:31.000 --> 00:47:37.000 Okay. So the simple linear regression model is, you have this situation. 00:47:37.000 --> 00:47:46.000 You have a variable. You want to predict, called y, and then you have a single feature that I'm going to call Little X here, just because that's what I am used to. 00:47:46.000 --> 00:47:53.000 And so remember our supervised learning framework that y is equal to F of x plus some error. 00:47:53.000 --> 00:48:04.000 So in this case the F of X is an actual function that we're assuming a form for which is basically beta 0 plus beta one x and then plus epsilon, which is our error. 00:48:04.000 --> 00:48:12.000 So beta 0 and beta one are real numbers that are constants known as parameters, that we're going to estimate. 00:48:12.000 --> 00:48:22.000 And then for simple linear regression, we have the assumption that the Epsilons are normally distributed with 0 mean and a common standard deviation. 00:48:22.000 --> 00:48:30.000 And that the error term is independent from x so to sort of visualize what this model looks like. 00:48:30.000 --> 00:48:31.000 If we have this blue line representing Y equals Beta 0 plus beta one x. 00:48:31.000 --> 00:48:44.000 So that's systematic form. Then the Epsilons for each value of X are being drawn from the same normal random distribution, and then added to the value of y given by the line. 00:48:44.000 --> 00:48:53.000 So you can sort of imagine that for any given value of X we go to the line. 00:48:53.000 --> 00:49:00.000 Then we can think of like we're drawing a random error, and then, either, you know, adding it to that value. 00:49:00.000 --> 00:49:06.000 So like it's less likely that we'd be up here down here, and more likely that we'd be closer to the line. 00:49:06.000 --> 00:49:15.000 So this is like what the model is assuming, and then you collect your data and then if it's a good like, we'll see if it's a good fit. 00:49:15.000 --> 00:49:32.000 So we make these assumptions, because if the assumptions hold, we can derive some nice features about the estimates and the predictions that we will touch on some of them are touched on in later lecture notebooks, some of them are touched on in the practice problems for the 00:49:32.000 --> 00:49:39.000 regression. They allow you to derive nice features of the estimators. 00:49:39.000 --> 00:49:45.000 So Melanie asked. 00:49:45.000 --> 00:49:50.000 Okay. So, Melanie, I'm not sure why you are unable to see my screen. 00:49:50.000 --> 00:49:55.000 I am everybody else. It says that I'm screen sharing on my screen and everybody else. 00:49:55.000 --> 00:49:58.000 Okay. Great. Awesome. Okay. So how do you fit this model? 00:49:58.000 --> 00:50:03.000 So in general, we're gonna use python to fit the models. 00:50:03.000 --> 00:50:07.000 But I also think it's useful to know, how the algorithms are fit. 00:50:07.000 --> 00:50:12.000 So there's a couple reasons for this is, it's nice to get away from the black box. 00:50:12.000 --> 00:50:16.000 Idea of our machine learning where you know, data goes in, prediction comes out, and you don't know what's happening in production comes out, and you don't know what's happening in the middle. 00:50:16.000 --> 00:50:27.000 So a lot of times things could go wrong with your model, and it's useful to have an idea of like what's going on behind the scenes just like it's useful to have an idea of how your car works. 00:50:27.000 --> 00:50:30.000 If something breaks and you want to be able to fix it yourself. 00:50:30.000 --> 00:50:31.000 That being said, you don't always have to know how it works. 00:50:31.000 --> 00:50:42.000 There are plenty of positions where you know your bosses don't care if you know how like the thing is being fit in the background just as long as you know enough to. You know. 00:50:42.000 --> 00:50:50.000 Make the business money so if you're a person that's perfectly like you just want to figure out how to do it in python, we're going to cover how to do that. 00:50:50.000 --> 00:50:58.000 And if you're a to do it in python, we're gonna cover how to do that. And if you're a person who wants to know how the algorithms work, we're also going to try our best to cover that so just feel free to like pay attention to the parts that you're most interested 00:50:58.000 --> 00:51:01.000 in and ask questions from that. Yeah, okay, so how do we fit the model? 00:51:01.000 --> 00:51:10.000 So the way that we fit a model is we're going to define what's known as a loss or an error function. 00:51:10.000 --> 00:51:14.000 So the things that we're estimating for are the Beta 0 and the beta one and how do we estimate this? 00:51:14.000 --> 00:51:17.000 We take, we need a loss function. And for regression problems. 00:51:17.000 --> 00:51:25.000 So regression problems are outputs are continuous, like the one we have here for regression problems. 00:51:25.000 --> 00:51:35.000 Our loss functions are the mean, square error, so, or Msc. 00:51:35.000 --> 00:51:36.000 So the mean, square error is given by one over N. 00:51:36.000 --> 00:51:44.000 The sum of I equals one to N. So remember N. Is the number of observations. 00:51:44.000 --> 00:51:50.000 The actual value minus the estimated value. So that's what the little hat denotes. 00:51:50.000 --> 00:51:51.000 That value squared. So that's the square part you're taking. 00:51:51.000 --> 00:52:00.000 The square of the difference, and then the mean part is one over N. 00:52:00.000 --> 00:52:08.000 Of that sum, so mean as an average value. So plugging this in for simple linear agreression, our estimate is beta 0 hat minus beta, or plus beta one hat x. 00:52:08.000 --> 00:52:31.000 I. So that's where this part comes from. So if you do a little bit of calculus, and then some algebra, you can find out that the values of beta 0 hat and beta one hat that minimize this mse so in this you know we want our errors to be small 00:52:31.000 --> 00:52:50.000 so the values that make this as small as possible are given by Beta 0 being the average value of y minus beta one hat times the average value of x, where these averages are found using the data that you've observed the training center and then beta one hat is given by the sample 00:52:50.000 --> 00:52:56.000 covariance of X and Y, divided by the sample variance of X. 00:52:56.000 --> 00:52:57.000 Okay, so Nsc is used as the default loss function for a number of reasons. 00:52:57.000 --> 00:53:06.000 So a lot of those reasons come from its roots as a statistical regression technique. 00:53:06.000 --> 00:53:18.000 A nice reason to use it is that this is this function is differentiable with respect to the Beta hat, it's also convex, meaning that if you're able to find the minimum, it is a unique minimum. 00:53:18.000 --> 00:53:31.000 Other things you might use. Some people might want you to use a Mac a mean, absolute error, or Mae and if you're interested in learning more about that, you can check out the regression practice problems. 00:53:31.000 --> 00:53:35.000 Notebook, okay, so before we show you how to do this in S. 00:53:35.000 --> 00:53:36.000 Learn. Does anybody have a question about the model, or how to fit? 00:53:36.000 --> 00:53:54.000 So what would you do if you had errors in X and y's data like? 00:53:54.000 --> 00:54:11.000 Yeah, so you never have. So we had just assumed that the at like in practice, you might have, like, you know, some sort of error with recording your X, but in the model you're just assuming that your X's are what they say, they are so like for instance, if you had data that was related to 00:54:11.000 --> 00:54:15.000 like the height in the web of somebody like you're using those as features. 00:54:15.000 --> 00:54:20.000 You're just assuming that those are correct like, that's just an assumption of the model. 00:54:20.000 --> 00:54:27.000 Like, yeah, I don't know how to say that it's just an assumption of the model. There. 00:54:27.000 --> 00:54:39.000 You would not have a situation with like linear regression, the way it's classically set up where you'll a lot for the inputs to also have errors. 00:54:39.000 --> 00:54:40.000 Yup! 00:54:40.000 --> 00:54:41.000 Thanks. 00:54:41.000 --> 00:54:42.000 I think, in other words, that's basically saying you have bad data, and you're trying to build a model on a data that's that's very much error. 00:54:42.000 --> 00:54:43.000 Prone. So I think it's best to have data. 00:54:43.000 --> 00:54:44.000 That's right, and yes, there will be some inherent error. 00:54:44.000 --> 00:55:14.000 But again, going back to the purpose of machine learning models is to actually predict, make predictions without you actually going back to your experiment, or whatever it is to generate that data. 00:55:25.000 --> 00:55:26.000 Yeah, so that's like, that's one way to think about it. 00:55:26.000 --> 00:55:47.000 So my high school mathematics teacher, whenever we are like using our calculator to solve a problem would always encourage us to like double check like the inputs of our calculations, saying, like, which stood for garbage in garbage out so you know, like they said, if you had 00:55:47.000 --> 00:55:58.000 bad input data from, you know, faulty measurements. Then your model is also not going to be very good. 00:55:58.000 --> 00:56:06.000 Any other questions? 00:56:06.000 --> 00:56:17.000 Okay. So you could fit. Because this is just, you know, these estimates are found with a sample means sample covariance and sample variance. 00:56:17.000 --> 00:56:21.000 You could calculate this by hand, using numpy or pandas. 00:56:21.000 --> 00:56:25.000 But we're just going to get into the role like the swing of using sk, learn. 00:56:25.000 --> 00:56:39.000 So sk learn is sort of like the workhorse of traditional machine learning algorithms and by that I mean, like the non deep learning stuff so it's sort of the workhorse in python for doing this sort of thing. 00:56:39.000 --> 00:56:53.000 So they have what are known as model objects, which for a lot of almost all of the algorithms we learn, they're going to have a model object that will then take in the data fit to find whatever parameters they need to fit. 00:56:53.000 --> 00:57:04.000 And then allow you to make predictions with that. So we're going to learn sort of that workflow in this notebook in particular, we're learning how to use the linear regression model object right now. 00:57:04.000 --> 00:57:10.000 So we're gonna use it to predict on this synthetic data. 00:57:10.000 --> 00:57:13.000 So this is synthetic, because I used numpy to generate it. 00:57:13.000 --> 00:57:14.000 So it's random data. But it's not like real data. 00:57:14.000 --> 00:57:25.000 So we have X, which is this randomly distributed, uniformly from 0 to one it's 100 observations. And then why? Where? 00:57:25.000 --> 00:57:26.000 The true relationship is 2 x plus one, and our random noise is random. 00:57:26.000 --> 00:57:36.000 Normally randomly distributed with a standard deviation of point 5. 00:57:36.000 --> 00:57:37.000 Okay, so this is what that looks like. These are our observations. 00:57:37.000 --> 00:57:43.000 And now we're going to use Sk. Learn to fit a lid. 00:57:43.000 --> 00:57:49.000 A simple linear regression model of why regressing onto X so the first thing you're gonna do in these workflows is you'll import the model class. 00:57:49.000 --> 00:58:01.000 So from sk, learn, linear regression is stored in linear underscore model. 00:58:01.000 --> 00:58:09.000 Well import linear with a capital L. Regression with the capital R. 00:58:09.000 --> 00:58:17.000 And so this might be a good time to pause and say in Python the syntax standard is when you have a class you'll use what's known as camel back. 00:58:17.000 --> 00:58:21.000 I believe it's called camelback Typing. 00:58:21.000 --> 00:58:24.000 So you'll alternate. When you have a new word. 00:58:24.000 --> 00:58:42.000 It starts with a capital letter. Then all of the remaining letters are lowercase, and then, when you have a new word instead of like an underscore, it would be another capital so this is for classes and objects where other things like functions, tend to be separated with underscore so this is 00:58:42.000 --> 00:58:47.000 just a note on Python syntax for those of you that are that are new to python. 00:58:47.000 --> 00:58:52.000 Okay, so after we've imported our linear linear regression class, we can now make a an empty model object. 00:58:52.000 --> 00:59:00.000 So we will do. Let's do, Slr. Is gonna be the variable. 00:59:00.000 --> 00:59:07.000 I store it in. Then I'm gonna do linear regression, parentheses. 00:59:07.000 --> 00:59:13.000 So some of these models will have inputs that you can then do this to customize the model. 00:59:13.000 --> 00:59:16.000 So I'm gonna use the standard model, the default. 00:59:16.000 --> 00:59:22.000 One thing that might be worth doing is putting in the argument, copy, underscore X equals true. 00:59:22.000 --> 00:59:38.000 So what this is going to ensure happens is that when linear regression takes in our X and our Y, it will make copies of the array before fitting the algorithm fitting the model. 00:59:38.000 --> 00:59:43.000 And so in Python, you wanna make sure like with these sorts of things that you make copies. 00:59:43.000 --> 00:59:52.000 So you're not accidentally altering the original data that can happen with the way that python stores data in its, you know, in your computer. 00:59:52.000 --> 00:59:53.000 So just to be safe, I usually will put the copy underscore X. 00:59:53.000 --> 00:59:54.000 Argument equals to true. Okay? So now I have a linear regression model on Yup. 00:59:54.000 --> 01:00:07.000 What? Why, only the copy eggs? What? What about copy? One? 01:00:07.000 --> 01:00:08.000 So this is just what the argument is called so like. 01:00:08.000 --> 01:00:16.000 In the algorithm. I believe the X is the one that's manipulated, whereas, like, why, I think just gets to be Y. 01:00:16.000 --> 01:00:17.000 That might be. Why, that they use copy underscore X instead of copy underscore. Why, I don't think there is a copy under score y argument. 01:00:17.000 --> 01:00:29.000 Okay. Thank you. 01:00:29.000 --> 01:00:30.000 So now that we have the model object, we could look at it. 01:00:30.000 --> 01:00:38.000 This is what it looks like. It's it's not fit yet. 01:00:38.000 --> 01:00:39.000 And so you might not be able to see this like. When you do this, you might just see the text. 01:00:39.000 --> 01:00:40.000 It might just depend on your version of Jupiter Notebooks. 01:00:40.000 --> 01:00:43.000 So once we have the empty model object, we can fit it. So we do. 01:00:43.000 --> 01:00:56.000 Slr dot fit. So here's gonna be like our first instance of something that I think really tends to confuse people. 01:00:56.000 --> 01:01:03.000 So, in order to fit your models, your features have to be what's known as a 2D array. 01:01:03.000 --> 01:01:14.000 So if we look at X right now, and do X dot shape, it's a one d numpy array, meaning that it has a single direction which is, it's just one dimensional. 01:01:14.000 --> 01:01:23.000 So right now mathematically, we could think of it as a row vector but what we need it to be is a column vector or 2 dimensional. 01:01:23.000 --> 01:01:27.000 So what? We're gonna do is what's known as reshape. So we do. 01:01:27.000 --> 01:01:39.000 X dot reshape negative, one comma one. And so what this does is if we look at the original x, so this is the original X, and we can kind of see it's like a row. 01:01:39.000 --> 01:01:47.000 But once we've done reshape, this now makes it a column, and if we look at the shape of that X dot reshape negative one comea one dot shape. 01:01:47.000 --> 01:01:55.000 We look at the shape of that, it is now a 2 dimensional. 01:01:55.000 --> 01:02:00.000 Vector it has array. It has 100 rows and one column. 01:02:00.000 --> 01:02:02.000 So what reshape does is it allows you to input arguments that will dictate the shape of the array. 01:02:02.000 --> 01:02:16.000 So the one in the second position tells numpy that I want it to have a single column, and then the negative one here says, Make this whatever it dimension you need it to be, to fill in the array. 01:02:16.000 --> 01:02:38.000 So I could have replaced this with a 100, and it still would have worked, but in general we don't know how many observations we're going to have, so it's better practice to use a negative one, because no matter what shape our X is this will still work okay, why, did we need to 01:02:38.000 --> 01:02:41.000 go through that big, long spiel about doing reshape. 01:02:41.000 --> 01:02:43.000 Well, this is what? 01:02:43.000 --> 01:02:45.000 What would happen if we did? X comma y. We get an error. 01:02:45.000 --> 01:02:51.000 And why do we get this error? We can scroll all the way down, and we can see that it says you got this. 01:02:51.000 --> 01:02:58.000 Error, because I expected a 2D. Route, but I got a one d array instead. 01:02:58.000 --> 01:03:01.000 So the way that sk learn writes its algorithm. 01:03:01.000 --> 01:03:02.000 To. 01:03:02.000 --> 01:03:03.000 Models in the background. It's assuming that the features are stored in a 2D. Array. 01:03:03.000 --> 01:03:04.000 So we. That's why we have to do the reshape negative one. 01:03:04.000 --> 01:03:05.000 Okay. So now we have a fitted linear regression. And once our linear regression. 01:03:05.000 --> 01:03:06.000 We could. 01:03:06.000 --> 01:03:20.000 Sr. Dot predict, and then do. I'm just going to go ahead and copy and paste this and see that these are the predictions of. 01:03:20.000 --> 01:03:25.000 The linear regression model on the data we use to train it. 01:03:25.000 --> 01:03:32.000 Okay. So maybe before going into looking at all, the different parts of the simple linear regression are. 01:03:32.000 --> 01:03:33.000 Other questions. 01:03:33.000 --> 01:03:39.000 The model. 01:03:39.000 --> 01:03:41.000 Okay. Okay, so here's one, can you explain one more time? The reason to do reshaping? 01:03:41.000 --> 01:03:49.000 So simple linear regression. 01:03:49.000 --> 01:03:54.000 Model. 01:03:54.000 --> 01:03:58.000 Okay. 01:03:58.000 --> 01:04:04.000 Okay. Alright! Could not be fit with a one-dimensional array. So let's go back to what this looks like. 01:04:04.000 --> 01:04:13.000 So, and then put that comma back. So remember X on its own is one dimensional. 01:04:13.000 --> 01:04:19.000 If we go down to the error message, it will say. 01:04:19.000 --> 01:04:37.000 Read so. And then even here, it's really good error message, because it says, reshape your data, either using reshape if your data has a single feature or reshape if it contains a single sample, so because ours just has a single feature. That's why we use the 01:04:37.000 --> 01:04:40.000 first one, so it has to be a 2D array. 01:04:40.000 --> 01:04:43.000 So doing reshape negative, one comma one. So there are 2 entries here which means it will be 2 dimensional. 01:04:43.000 --> 01:04:55.000 We know that X is a single column. So we put a one in the second spot. 01:04:55.000 --> 01:04:59.000 In general, we might not know the number of rows our data has. 01:04:59.000 --> 01:05:06.000 So then we would use a negative one instead. 01:05:06.000 --> 01:05:10.000 So, you sip, is asking. One to simple transpose function on work. 01:05:10.000 --> 01:05:15.000 So if we did. X dot trepose we can check this out. 01:05:15.000 --> 01:05:23.000 Hey? And you can see that the shape of this is the same. 01:05:23.000 --> 01:05:26.000 Okay, so we have to do reshape so we've got some more questions. 01:05:26.000 --> 01:05:27.000 So pager's asking, why didn't I do an X. 01:05:27.000 --> 01:05:39.000 Y train test split. So one the reason I know it's like, Oh, I just went on this big long notebook of like, why I do Splits. 01:05:39.000 --> 01:05:47.000 This is just to demonstrate the model so like I'm not trying to make a predict like, I'm not trying to make any predictions. 01:05:47.000 --> 01:05:50.000 I'm just trying to show you a this is the moment this is how it works. 01:05:50.000 --> 01:06:04.000 This is how you fit it with Python. This is how you fit it in general, if this was a predictive modeling problem, I would make my train test split at the very beginning, and then go from there for the simplicity, and then go from there for the simplicity of like not having to 01:06:04.000 --> 01:06:09.000 go through and make those steps. I just showed you with the data. 01:06:09.000 --> 01:06:20.000 Another reason is like this is a situation where, if I wanted to go through and make those steps, I just showed you with the data, another reason is like this is a situation where, if I wanted to, I could just go generate more data at any time, I want to because this is synthetic data so like anytime I want 01:06:20.000 --> 01:06:24.000 I can rerun, random and generate X and Y all over again. 01:06:24.000 --> 01:06:28.000 Okay. So then Payelle is asking, What about? Why? 01:06:28.000 --> 01:06:34.000 Why don't we reshape that? This is just the way that sk learn has written its code? So it's. 01:06:34.000 --> 01:06:40.000 Did not expect y to be a 2 dimensional vector or 2 dimensional numpy array. It's fine and I think it's better to leave it as a one dimensional array. 01:06:40.000 --> 01:06:43.000 So you do not have to reshape. Why, you do have to reshape X. 01:06:43.000 --> 01:06:56.000 So basically like when we look at multiple linear regression, you'll see like X is a matrix. 01:06:56.000 --> 01:07:01.000 So basically, the people who wrote sk learn, I think, are expecting your features to be a matrix. 01:07:01.000 --> 01:07:02.000 And then your Y to be like a regular row vector, so I think that's why it's like that. 01:07:02.000 --> 01:07:13.000 But in general you're why does not have to be for sk learn a 2 dimensional ray. 01:07:13.000 --> 01:07:21.000 It can be a one dimensional array. 01:07:21.000 --> 01:07:32.000 Okay. Any other questions. 01:07:32.000 --> 01:07:39.000 Okay, so this is a regression model, which means that it has like, remember, we said that it had Beta 0. 01:07:39.000 --> 01:07:44.000 We are trying to estimate Beta one we are trying to estimate, so we can get all of that data. 01:07:44.000 --> 01:07:48.000 So to get the intercept, which is the estimate of Beta 0. 01:07:48.000 --> 01:08:00.000 You just do. Slr. The name of the variable dot intercept underscore, and so here we're estimating the intercepts to be point 9 9 7 5. 01:08:00.000 --> 01:08:02.000 You can get the estimate of Beta one with. 01:08:02.000 --> 01:08:16.000 Sr.co. F. Underscore. So here we're estimating that the coefficient is 2.1 5, and then here we can use this to actually predict, like show, like what the model is saying. 01:08:16.000 --> 01:08:21.000 So the black line here is the model that I just fit with simple linear regression. 01:08:21.000 --> 01:08:30.000 That I'm just providing, like an evenly spaced array from 0 to one and then predicting on that to get the Y values. Okay? So. 01:08:30.000 --> 01:08:32.000 So this is the model that I fit. 01:08:32.000 --> 01:08:38.000 Just now in this notebook. Okay? 01:08:38.000 --> 01:08:39.000 Yeah, so Chris is asking, what? Sorry? 01:08:39.000 --> 01:08:40.000 Co-fashioned, multiple betas if they were in the model. 01:08:40.000 --> 01:08:41.000 So when we learn about multiple linear regression, we'll see that. 01:08:41.000 --> 01:08:42.000 So if you're doing multiple. 01:08:42.000 --> 01:08:47.000 Efficient. The Co. F. Will hold all of the coefficients from the model, and the intercept will still hold the intercept. 01:08:47.000 --> 01:09:01.000 Any other questions about this notebook before we move on to our last notebook for today. 01:09:01.000 --> 01:09:08.000 Okay. 01:09:08.000 --> 01:09:09.000 Alright! 01:09:09.000 --> 01:09:12.000 So the last notebook we're gonna do is sort of give. 01:09:12.000 --> 01:09:18.000 You a rundown of like, how a predictive, modeling workflow might go. 01:09:18.000 --> 01:09:19.000 It's not exactly like how it will be at once. 01:09:19.000 --> 01:09:43.000 You hopefully get your dream job working out. 01:09:43.000 --> 01:09:44.000 Season. It's just hard to imagine now, because we're in the middle of the baseball season. 01:09:44.000 --> 01:09:47.000 But let's imagine that we're in November, and it's the off season. 01:09:47.000 --> 01:10:08.000 And so during the off season, baseball teams are looking to see like players that they can bring in or keep to improve the number of wins that they had this past season into the coming season, so one question you might have as someone who's working for a baseball team is Alright is it better to have better 01:10:08.000 --> 01:10:15.000 defensive players, which, in the sport of baseball, means that you're limiting the number of runs that your team is allowing. 01:10:15.000 --> 01:10:22.000 Or is it better to have good offensive players, meaning that you're increasing the number of runs that your team scores? 01:10:22.000 --> 01:10:26.000 So basically what you're trying to see. And this question is given like the number of runs. 01:10:26.000 --> 01:10:34.000 And the number of runs allowed like, which is better at predicting the number of wins you will have in a given season. 01:10:34.000 --> 01:10:40.000 Now this is a silly question, because, like this isn't realistic to like the real world of baseball, it's much more complicated. 01:10:40.000 --> 01:10:46.000 But we only know simple linear regression right now. So this is a perfect question for us. 01:10:46.000 --> 01:10:51.000 So the first thing we're gonna do is load the data. 01:10:51.000 --> 01:10:52.000 And then here's a random sample of that data, so we can see what it looks like. 01:10:52.000 --> 01:11:00.000 So here we have 5 rows of the data. We have teams. 01:11:00.000 --> 01:11:05.000 We have the year that the data comes from the league of the team. 01:11:05.000 --> 01:11:08.000 So in major League baseball, we have National League and American League. 01:11:08.000 --> 01:11:23.000 The number of games that team played in that season. The number of wins and losses that team had during the season, and then the number of runs scored by that team, and the number of runs allowed by that team so if you're new, if you're unfamiliar with baseball runs 01:11:23.000 --> 01:11:29.000 aloud, means. These are the total number of runs or points that the other team scored against them. 01:11:29.000 --> 01:11:42.000 So in 2,012 the Pittsburgh pirates scored 651 runs, and teams scored 674 runs against them. 01:11:42.000 --> 01:11:49.000 Okay. So once you get your data, the very first thing you should do is assuming you're not doing any sort of data transformations that like doing a log transform or anything, is the trained test split. 01:11:49.000 --> 01:12:00.000 So I'm going to go ahead, and import my train test split, which we saw earlier today. 01:12:00.000 --> 01:12:01.000 Then I'm going to make my train test split. 01:12:01.000 --> 01:12:14.000 You should notice here that I have baseball dot copy. So it's dot copy here is making a heart. I think it's called a hard copy. 01:12:14.000 --> 01:12:31.000 On the baseball data frame. So if you're working with the pandas data frame, you want to do this when you make the train test split, because otherwise you're going to technically be working on like the rows of the original data, frame and if you make any changes to it you'd 01:12:31.000 --> 01:12:38.000 also be changing the original data frame. So this ensures that you're getting a an actual copy of the data frame. 01:12:38.000 --> 01:12:41.000 And instead of the original data frame, but only the subset of the rows. 01:12:41.000 --> 01:12:48.000 So this is a python thing. It's just the way that they chose to store data in your computer. 01:12:48.000 --> 01:12:54.000 Okay, so now that I have my train test split, I will do some exploratory data analysis and you'll get some practice with this, in tomorrow's problem session. 01:12:54.000 --> 01:13:09.000 So you know the first thing you wanna do is one of the biggest assumptions in linear regression is that there is a linear relationship between your outputs which for us are the W's. 01:13:09.000 --> 01:13:10.000 The Ws and your inputs, which for us are runs or runs allowed. 01:13:10.000 --> 01:13:23.000 So for this I'm going to make scatter plots of my training, sets, wins against runs, and runs aloud. 01:13:23.000 --> 01:13:29.000 Okay. So over here on the side, on the left hand side plot, I've got wins on my vertical axis and runs scored on my horizontal and then on my right hand plot. 01:13:29.000 --> 01:13:41.000 I've got wins on the vertical and runs aloud on the horizontal, and so I would say, based on the horizontal. And so I would say, based on this these 2 plots. 01:13:41.000 --> 01:13:53.000 These look like linear relationships to me. So simple linear regression would be an appropriate model for either of these potential relationships. 01:13:53.000 --> 01:13:57.000 So that's why we're okay to use simple linear regression. 01:13:57.000 --> 01:14:08.000 If we were to look at this, and it doesn't look like a linear relationship at all, we might want to try something else, and we'll talk about that in the coming lecture notebooks. 01:14:08.000 --> 01:14:09.000 So then we're gonna choose some candidate models to try. 01:14:09.000 --> 01:14:21.000 So the first model would just be regressing wins on runs and then the second model is going to be regressing, wins on, runs aloud. 01:14:21.000 --> 01:14:26.000 An important step in any predictive modeling project is to have what's known as a baseline model. 01:14:26.000 --> 01:14:33.000 So these are models that aren't necessarily good, but they allow us to give some sort of context to see how good our models are in general. 01:14:33.000 --> 01:14:38.000 So it's hard to tell whether or not a model is good. 01:14:38.000 --> 01:14:40.000 So like, let's say we go through this process and it turns out our model has a mean, squared error of a hundred. 01:14:40.000 --> 01:14:59.000 It's hard to tell in the abstract, if that is good, we want to have something to compare our model with a very simple model, to compare our more complicated models with the say, Okay, these models are outperforming the simpler one. 01:14:59.000 --> 01:15:06.000 And you know, for instance, let's say, our baseline model, the very simple one that we're going to compare to. 01:15:06.000 --> 01:15:08.000 Let's say it had an Msc. Of a thousand. 01:15:08.000 --> 01:15:12.000 So in this case, if we were able to find a model that had an Msc. 01:15:12.000 --> 01:15:20.000 Of 100. This model is a good improvement over the baseline, but, for instance, if our baseline had an Msc. 01:15:20.000 --> 01:15:25.000 Of 10, our more complicated model would be underperforming the baseline. 01:15:25.000 --> 01:15:26.000 So it has a worse error, and it's more complicated. 01:15:26.000 --> 01:15:33.000 So we shouldn't stick with that one even if it is the best one among the ones we've tried. 01:15:33.000 --> 01:15:40.000 So basically, that's the idea of having a baseline model as you just want to do a sanity check of like, is this any better than the baseline that's that's the whole goal. 01:15:40.000 --> 01:15:50.000 So a really good baseline to start out with when you're just starting a problem fresh and regression is to just say that there is no relationship, basically. 01:15:50.000 --> 01:16:04.000 Saying that the number of wins is independent of the runs or the runs allowed, so the number of wins is the extension value, plus some random noise. 01:16:04.000 --> 01:16:05.000 So to do this we would estimate this with just the average. 01:16:05.000 --> 01:16:11.000 The, the arithmetic mean? Okay? So those are the the 3 models. 01:16:11.000 --> 01:16:19.000 Model 0, our baseline just taking the average and always predicting that model. 01:16:19.000 --> 01:16:27.000 One regressing wins on to runs, and then model 2 regressing wins onto, runs aloud. 01:16:27.000 --> 01:16:39.000 So to do this, to see what of these 3 models performs best, and then see if our models, one and 2 outperform model 0, we're gonna do K-fold cross validation. 01:16:39.000 --> 01:16:50.000 So here I import k fold. And now I'm gonna make my K-fold cross validation. So here I import K fold. And now I'm gonna make my k-fold object with 5 spl shuffle will be true. 01:16:50.000 --> 01:16:54.000 And then so you can compare it later on. Your computer will do a random state. 01:16:54.000 --> 01:17:00.000 And let's do. 6, 1, 6. 01:17:00.000 --> 01:17:08.000 Okay, so we want to calculate the mean squared error for all 3 of these models across all of the cross validation splits. 01:17:08.000 --> 01:17:15.000 Now, I could do this by hand, but Scikit learn has a function that I can use called the mean squared Error. 01:17:15.000 --> 01:17:22.000 It takes in the true values along with the predicted values, and then we'll output the Msc. 01:17:22.000 --> 01:17:27.000 So I'm gonna import the mean squared error along with my regression model. 01:17:27.000 --> 01:17:30.000 And so now I don't have to calculate it by hand. 01:17:30.000 --> 01:17:36.000 I can just input what my prediction is along with what the actual values are. 01:17:36.000 --> 01:17:45.000 So, what I'm gonna do now is I'm gonna create an array of zeros and this array of zeros is going to keep track of the mean squared error. 01:17:45.000 --> 01:17:49.000 On the holdout set from my cross. Validation splits. 01:17:49.000 --> 01:17:59.000 So the 3 represents the fact that I have 3 models, and the 5 represents that I'm doing 5 fold cross validation. 01:17:59.000 --> 01:18:06.000 So 4 train in that test index in K. Fold. 01:18:06.000 --> 01:18:13.000 Dot split heck. It was bb train. 01:18:13.000 --> 01:18:21.000 What am I gonna do first? I'm gonna get my training splits from this particular. 01:18:21.000 --> 01:18:27.000 Should probably actually do this as dot copy as well. Do get my training set from this cross validation split. 01:18:27.000 --> 01:18:40.000 So I'm gonna do bb train dot train index dot copy I guess if I do that I don't need this copy. 01:18:40.000 --> 01:18:44.000 Alright, and then I'm gonna get my hold out. 01:18:44.000 --> 01:18:51.000 Set bb, train, dot Luke test index dot copy here. 01:18:51.000 --> 01:18:54.000 I'm getting the mean prediction from my baseline. 01:18:54.000 --> 01:18:55.000 So I just take the mean number of wins for the training split from the cross-sidation times. 01:18:55.000 --> 01:19:06.000 A vector of ones. That's the length of the holdout set now, I'm gonna go through and make my linear regression model. 01:19:06.000 --> 01:19:16.000 So, linear regression. Happy X. Equals! True. Now I'm going to fit it on the training data. 01:19:16.000 --> 01:19:19.000 So dot fit, eb underscore t underscore t underscore T. 01:19:19.000 --> 01:19:30.000 So what was model one was for on runs. So model one dot R. 01:19:30.000 --> 01:19:42.000 And then dot values dot reshape and then Bv underscore t underscore t dot w dot values alright, so you can do this just as a note. 01:19:42.000 --> 01:19:43.000 You can do this with the dot values and just use like the columns themselves. 01:19:43.000 --> 01:19:53.000 I prefer to do it with the dot values, so sk learn as of a recent update as of like. 01:19:53.000 --> 01:20:03.000 Maybe last year, if you use the columns themselves, it will always want you to provide input with the column names, and if it doesn't, it will give you a warning and I really dislike seeing the red warning box and so that's why I trained them to numpy. 01:20:03.000 --> 01:20:11.000 Raise here. It would work if without the dot values as well. 01:20:11.000 --> 01:20:17.000 Okay. So now I'm gonna make my prediction. Now that the models fit at this step, I'm gonna make my prediction. 01:20:17.000 --> 01:20:27.000 So model one dot per on the training set or sorry on the holdout set Bb. 01:20:27.000 --> 01:20:38.000 Underscore H. O. Dot, are that values dot reshape at negative one comma one. 01:20:38.000 --> 01:20:41.000 And then this is doing the same thing. But offer model 2. 01:20:41.000 --> 01:20:46.000 So you don't have to watch me type it all again. 01:20:46.000 --> 01:20:47.000 So now as we're going through the splits, we're getting our training set. 01:20:47.000 --> 01:20:54.000 Our hold out set, we fit the baseline model and get the predictions. 01:20:54.000 --> 01:21:02.000 We fit the models model one model and get the predictions we fit the model to model and get the predictions. 01:21:02.000 --> 01:21:10.000 And now we just have to record the mean, squared error for that we do mean underscore squared, underscore error. 01:21:10.000 --> 01:21:16.000 Then we would do. Bb, hold out dot w dot values. 01:21:16.000 --> 01:21:30.000 So the true values. Remember, from our documentation the true values of Y minus the predicted values of y, which I store for this model, and a variable called P. R, E. D. 01:21:30.000 --> 01:21:37.000 One, and then this is just helping me keep track of the crossoveration split. I'm on. 01:21:37.000 --> 01:21:40.000 Oh, no! What did I do? I think I want them to be eyeloke. 01:21:40.000 --> 01:21:47.000 Yeah, so these should be, Ilo. That's what I thought. But. 01:21:47.000 --> 01:21:52.000 So now that I have that I can go ahead, and what I'm showing here is for each of my 3 models. 01:21:52.000 --> 01:21:57.000 So the baseline model one and model 2. The black circles represent the cross. 01:21:57.000 --> 01:22:08.000 Validation Msc. For a single one of the splits, and then the red larger circles represent the mean cross validation, Error. 01:22:08.000 --> 01:22:19.000 Across all 5 splits, so we can see here that the model that performs best from the cross foundation is the one with the lowest Msc. 01:22:19.000 --> 01:22:23.000 Which is model 2. Here. 01:22:23.000 --> 01:22:30.000 Okay. And so then, in this world, you know, in the real world, you'd probably try more than 2 models in this world. 01:22:30.000 --> 01:22:33.000 There's nothing else that we can try, based on what we know. 01:22:33.000 --> 01:22:42.000 So, you know in the real world you would do some additional modeling, like trying different models and comparing that to model 2. 01:22:42.000 --> 01:22:43.000 But you know we're done with that for the sake of this notebook. 01:22:43.000 --> 01:22:46.000 So let's imagine we've done that. And it turns out model 2 is still our best choice. 01:22:46.000 --> 01:22:51.000 So we're going to select model 2. What you then do is sort of a test set. 01:22:51.000 --> 01:23:03.000 Sanity check. You're gonna take your model refitted on the entire training set. 01:23:03.000 --> 01:23:08.000 And then calculate the training set Msc. Along with the test set Msc. 01:23:08.000 --> 01:23:20.000 Which is what I'm doing here. And so you can see that you know the training set as expected well, unexpectedly, this doesn't really happen usually, but the training set has a worse Msc. 01:23:20.000 --> 01:23:22.000 Than the test set. That doesn't usually happen. It can happen because it's just sort of random, right? 01:23:22.000 --> 01:23:29.000 But they are comparable. They're not like vastly different from one another. 01:23:29.000 --> 01:23:30.000 And so here I would say, Okay, like, our model isn't doing something unexpected. 01:23:30.000 --> 01:23:36.000 And we're also clearly not having an error with the way that our model was fit. 01:23:36.000 --> 01:23:41.000 So we're okay to, you know. Take this and put it into production. 01:23:41.000 --> 01:23:48.000 If we wanted to, so that's the idea. 01:23:48.000 --> 01:23:53.000 And then, if we were to find an error here, or something like that, like, let's say, this test set Msc. 01:23:53.000 --> 01:23:59.000 Was much larger than the training set, or much much smaller than the training set. 01:23:59.000 --> 01:24:04.000 Then we would want to look at the code that we use to fit the model as well as maybe the actual data to see. 01:24:04.000 --> 01:24:11.000 Like, okay, did the test set have some weird outliers or something, you know, just to give ourselves. 01:24:11.000 --> 01:24:12.000 That's the point of the test set is the sanity check. Okay? 01:24:12.000 --> 01:24:13.000 So that's the whole process. With the last 3 min. 01:24:13.000 --> 01:24:17.000 It's now is a great time for questions, and that's sort of like today. 01:24:17.000 --> 01:24:28.000 Those that's today's lecture. So if there are any questions, now's a great time to ask. 01:24:28.000 --> 01:24:30.000 Hello! 01:24:30.000 --> 01:24:31.000 Hi! 01:24:31.000 --> 01:24:41.000 People are setting the baseline with that? Is there any criteria to determine the shape or formula for the baseline? 01:24:41.000 --> 01:24:48.000 So with regression problems, it's typical to start out with just the one that we did where you'll choose the expected value of your output. 01:24:48.000 --> 01:24:59.000 Then once you go through, like, you know, we've gone through and done this like, and found that model to so like. 01:24:59.000 --> 01:25:06.000 Now, we want update our baseline to just be the simple linear regression model because that's a simple model. 01:25:06.000 --> 01:25:07.000 It can be fit very quickly, and it gives us a reasonable performance that we can compare it to. 01:25:07.000 --> 01:25:10.000 So that's what we might do to start out with you. 01:25:10.000 --> 01:25:23.000 Typically will like, assume, like, there's no relationship. And then you might update your baseline as you continue and identify better models. 01:25:23.000 --> 01:25:35.000 Is it the simplest way in your mobile, or setting the baseline with that? 01:25:35.000 --> 01:25:36.000 Hmm! 01:25:36.000 --> 01:25:43.000 Yeah, so like the baseline, the idea behind a baseline is you typically do want to have a simple model to compare it to so like here, the simplest model is just assuming there's no relationship. 01:25:43.000 --> 01:25:44.000 But in practice like that might not be like a great comparison. 01:25:44.000 --> 01:25:51.000 So like, you might then like, say, Okay, simple, like, linear regression. 01:25:51.000 --> 01:26:00.000 Usually a pretty simple model. So you compare it to that. And if it can't outperform linear regression, then you just might use linear regression. 01:26:00.000 --> 01:26:02.000 Hi! Gordy! Thanks! 01:26:02.000 --> 01:26:05.000 Yup, I see that we have a question from Steven. 01:26:05.000 --> 01:26:06.000 We have 3 variables, Wr and Ra. We have modeled functions regressing Wnr. 01:26:06.000 --> 01:26:16.000 And W. On Ra. If we also modeled, runs with runs aloud with linear regression. 01:26:16.000 --> 01:26:24.000 What the resulting triangle of functions commute, and other words, would our project are a predict? 01:26:24.000 --> 01:26:29.000 W be the same as our predict. W. Sorry for the naive question. 01:26:29.000 --> 01:26:30.000 I know 0 Stephen. No need to apologize at all. 01:26:30.000 --> 01:26:37.000 That's what these questions are for, you know, asking the questions and getting the answer. 01:26:37.000 --> 01:26:42.000 So in this case I would say that the causal well, I don't know. 01:26:42.000 --> 01:26:55.000 I think it would be a little bit difficult to make an argument that, like your runs, cause your runs allowed, because in baseball, like offensive performance, is typically somewhat independent of defensive performance. 01:26:55.000 --> 01:26:56.000 Now I think you could probably make arguments that's it's a little more subtle than that. 01:26:56.000 --> 01:27:04.000 But I think it wouldn't typically do something where you take one of your features and then like use like one feature to predict the other feature. 01:27:04.000 --> 01:27:13.000 Then use those predicted features to predict the output. So like you're adding a layer of complexity. 01:27:13.000 --> 01:27:19.000 There, now, that being said, there are some cases where you would do something like that called imputation. 01:27:19.000 --> 01:27:35.000 I don't know if we'll talk about that in the live lecture, but you can see an example of imputation and the pre-recorded videos where, if you had a missing value for a particular feature, you might use the other features to fill in that missing value, to then used 01:27:35.000 --> 01:27:36.000 for the predictive model it's a slightly different situation. 01:27:36.000 --> 01:27:46.000 But in general you would not use like predicted, like the entire column of predicted features, to then predict the output. 01:27:46.000 --> 01:28:06.000 You'll stick with the ones that you've been given. Because, going back to that earlier question, we're assuming that these predicted value are these features are set in stone, and then using that to predict the output. 01:28:06.000 --> 01:28:07.000 Yeah. 01:28:07.000 --> 01:28:11.000 I have a question. When we observe that both R. 01:28:11.000 --> 01:28:20.000 And Ra have linear correlation with the output, and then and then we decided to use by these to be the features. 01:28:20.000 --> 01:28:26.000 But since both of them have linear correlation with the output, can we say that any linear combination of R. 01:28:26.000 --> 01:28:30.000 And R. A. We'll have linear correlation with the output. 01:28:30.000 --> 01:28:36.000 And is there a way to find the best linear combination of R and Ra. 01:28:36.000 --> 01:28:42.000 Such that the model will be the best in terms of the lowest Ms. 01:28:42.000 --> 01:28:48.000 Yeah, so what we would try like, you know, we're restricted in this particular notebook. We're self. 01:28:48.000 --> 01:28:51.000 The imposing, the restriction that we only know simple linear regression, because that's all we've covered so far. 01:28:51.000 --> 01:28:58.000 But in the next notebook tomorrow we'll learn about something called multiple Linear Regression. 01:28:58.000 --> 01:29:04.000 And so what you would do in practice is you could, would compare this to the model regressing W. 01:29:04.000 --> 01:29:05.000 On both R. And R. A. And what in that case, then, the actual model will find the best meaning. 01:29:05.000 --> 01:29:24.000 The lowest Nsc. It can find. The best coefficients for are in Ra it's not something where we would systematically go through test different coefficients by hand, and see what's best. 01:29:24.000 --> 01:29:31.000 We let the model do that for us, you know, algorithmically. 01:29:31.000 --> 01:29:32.000 Yeah. 01:29:32.000 --> 01:29:33.000 Thanks. 01:29:33.000 --> 01:29:34.000 Yeah. 01:29:34.000 --> 01:29:40.000 I just had a question. So I'm just kind of going back to your the validation lecture. 01:29:40.000 --> 01:29:50.000 I was just curious. So so is the point of validation to try to give you a better estimate of the error. 01:29:50.000 --> 01:29:52.000 Is that what it's design? What's the output? 01:29:52.000 --> 01:29:56.000 I didn't quite follow the output of that exercise. 01:29:56.000 --> 01:30:00.000 Yeah, so each one of these, let's let's use this picture as an example. 01:30:00.000 --> 01:30:06.000 So like these are are these little black dots are the errors on the holdout sets from the cross. 01:30:06.000 --> 01:30:08.000 Validation. So the idea here is the validation set will give you an estimate like one estimate. 01:30:08.000 --> 01:30:22.000 And so what you could ultimately end up doing if you use a validation set approach is just build the model that performs best on that single validation set. 01:30:22.000 --> 01:30:32.000 So the idea with crosst validation, which is why it's generally preferred, is, we're now getting to see the performance of these 3 different models on 3 on 5. 01:30:32.000 --> 01:30:37.000 In this case 5 different validation sets granted, you're using different training sets as well. 01:30:37.000 --> 01:30:39.000 So it's it's a little bit, you know. 01:30:39.000 --> 01:30:42.000 It's you kind of squint a little bit. 01:30:42.000 --> 01:30:54.000 But the idea being like now, in addition to, in addition to seeing like, okay, on average, you know, model one does better than the baseline and multitude does better than model one. 01:30:54.000 --> 01:30:55.000 You can also get a sense like if you look at the different splits, you can get a sense of like, okay, like. 01:30:55.000 --> 01:31:09.000 And almost all of the splits model 2 did better it wasn't like the case, for, like model, one had like one that it performed really poorly on but on the rest it did better. 01:31:09.000 --> 01:31:26.000 So I guess the short answer to your question is, yes, the cross validation gives you a better estimate of, like the generalization error in that it's estimating sort of like what the average generalization error would be as opposed to just a single generalization error. 01:31:26.000 --> 01:31:29.000 Okay, I see. Thanks. 01:31:29.000 --> 01:31:34.000 Yes, and then asks, Where can I find the recorded lectures? 01:31:34.000 --> 01:31:39.000 Yes, so those are on the webinar. You go down to the program content. 01:31:39.000 --> 01:31:40.000 And there are videos. So you can either go through the process of scrolling all the way down to the process of scrolling all the way down to the bottom. 01:31:40.000 --> 01:31:47.000 But Roman has graciously provided a filter button. 01:31:47.000 --> 01:31:59.000 So you click on that filter and then search for the the label like may live lectures or something like that, and you can find it there. 01:31:59.000 --> 01:32:00.000 Yahweh is asking is only the mean value from each model compared, or is the Msc. 01:32:00.000 --> 01:32:11.000 Computed from each model. So each of these black dots is the Msc. 01:32:11.000 --> 01:32:17.000 On a particular holdout split so like going back to that picture like the left out set. 01:32:17.000 --> 01:32:23.000 That's what each of these dots represents. So you could if you wanted to try and get a sense of the comparing, all the splits to one another. 01:32:23.000 --> 01:32:42.000 But what's generally done in practice, because you probably won't have like the time to go through and do that is, you'll just compare the average of these, hold out values to one another. 01:32:42.000 --> 01:32:52.000 Sure! 01:32:52.000 --> 01:32:53.000 So it. 01:32:53.000 --> 01:32:54.000 So, yeah, so can I clarify my questions. So it's not just one average value from that distribution copy computed, just one efficiency value from that leads hold out. 01:32:54.000 --> 01:32:59.000 Yeah, so it's like the. So for the splits, there are 5. 01:32:59.000 --> 01:33:12.000 Holdout sets. We calculate the error on those 5 average them together, and that gave us the red dot for each of the models. You then compare that average across models to figure out which one performed best on average. 01:33:12.000 --> 01:33:22.000 Okay, so so that if I know there's a a mean, square error divided by the end. 01:33:22.000 --> 01:33:27.000 So in this case the end numbers should be just one. 01:33:27.000 --> 01:33:38.000 So the means note. So the n in in the formula for mean squared error. 01:33:38.000 --> 01:33:48.000 Do. This end refers to the number of observations in the training set, or in whatever set you're looking at. 01:33:48.000 --> 01:33:49.000 Okay. 01:33:49.000 --> 01:33:50.000 So if it was the on the training set, you know, we typically call that. 01:33:50.000 --> 01:33:58.000 And if it's on the test set or hold out set, it's whatever the size is. So like, whatever the number of data observations are. That's where the end comes from. 01:33:58.000 --> 01:34:01.000 Okay. Okay. I see. Okay. Gotcha. Okay. Thank you. 01:34:01.000 --> 01:34:03.000 Yup! 01:34:03.000 --> 01:34:19.000 Alright, maybe like one more question, and then we'll sign off for today. 01:34:19.000 --> 01:34:21.000 Okay. So there isn't seem to be any more questions. 01:34:21.000 --> 01:34:26.000 Thank you so much for everyone that stuck around. That was day number 2. 01:34:26.000 --> 01:34:29.000 I'll upload the video later tonight and you'll be able to find it if you'd like to rewatch it later. 01:34:29.000 --> 01:34:36.000 Okay.